During my master’s I would study in the business library. It was nice there. The desks were large and sturdy, the seats were comfortable, and no one was ever there because MBA students don’t actually study. Although I didn’t think much about it at the time, there were always business magazines on display everywhere. I remember one magazine in particular that said “Data Is the New Oil” on the cover. I immediately thought about how stupid and wrong that was, but I didn’t really care.
Thinking back, that cover explains a lot about how businesses behave with regards to data. They hoard it. They get as much data as they can, put it in a safe(ish) place, and never look back. They hire “data scientists” to turn the data into gold, and expect the alchemy to be done. This is the skill of business in 2018.
This post is intended to help “business people” to think about data in a more nuanced and interesting way. I talk about the value of data, how data is used, and the reasons not to treat data as a commodity to be hoarded.
Of course data isn’t literally the new oil. Oil is a liquid, whereas data is not. Oil is black, and data doesn’t have a color. However, the “data as oil” analogy has led many businesses to treat the two as if they are economically similar when they are not.
First, data is not even a commodity. You wouldn’t buy “web traffic data” without asking about the source. Facebook traffic data isn’t the same as Twitter traffic data. The two might be substitutes, but they are not fungible. Twitter traffic data isn’t even the same as other Twitter traffic data, as the value depends on what is inside.
Second, the value of data is usually highly idiosyncratic. Sensor data from a factory might be useful in predicting which parts are likely to fail and should therefore be replaced, but the only entity in a position to act on that data is the factory operator. Producing this data on the grounds that “data is valuable” is idiotic if you are the only one with use for the specific data in question.
Third, data loses a lot of its value over time. Of course, the speed with which data loses its value depends on what it is being used for. Some data becomes nearly worthless a few minutes after it is generated. For example, I recently bought a new pair of boxing gloves. I decided to go with the white Hayabusa T3 gloves. The fact that I chose the white gloves could have been used to market the other equipment to me as well, e.g. showing me the shin guards in white instead of black. But that didn’t happen. I checked out, and now it is far less valuable to know that I bought the gloves in white. I am probably not going to buy anything from the same company ever again, and even if I do it probably won’t be for a few years when my preferences might have changed.
In a recent post I talked about the problem of old data that companies collect and never use, and I think it is worth repeating here:
Old, unused data is incredibly overvalued. Companies collect all sorts of data they never use. The data sits idly in a data warehouse somewhere, waiting for an enthusiastic data scientist to come along and turn it into gold. Except that day never comes, and it just keeps on sitting. At my current job, I pretty much have free rein when it comes to using our data. I can access whatever I want, collect whatever I want, and do whatever I want. It’s pretty cool. And you know what I don’t do? Play with old data.
Old data sucks. One, it’s never in the format that I need; after all, it was collected without any specific purpose in mind. If I want to do something with the data, I first have to set up a connection to the database, and then I have to start pulling data into memory so that I can process it. Two, old data doesn’t tell the whole story. I can use the data for analysis purposes, but I then have to walk around the building and ask people to explain my findings. I don’t know what happened in April, 2016! Interpreting results can be very difficult.
I then went on to discuss how I address this problem at work by processing data in real time.
However, there is a case for keeping old data: optionality. After all, you can’t use data you never collected (*man tapping on head meme*). You might not know what you’re going to do with the data, but by not collecting it you close off all possibilities. So why not? The cost of storage is relatively small, and if you might find a valuable use in the future then it is probably worth it.
The “why not” (err, “optionality”) justification for bulk data collection might make sense if storage was the only cost. However, I’m not convinced that is the case. There are other costs that are often ignored.
First, there is a legal risk involved in storing data that relates to individuals. Before collecting and storing data, companies need to ask: Is it legal to collect this data? What can we legally do with it? What obligations do we have in relation to data subjects? What happens if there is a data breach? Europe now has the GDPR, so there are very serious consequences for not asking these questions. If you’re going to run this legal risk, you probably need a better reason than “optionality”.
Second, storing data that relates to individuals is a PR nightmare waiting to happen. Facebook is currently in the middle of a PR storm because of the Cambridge Analytica story. However, at least Facebook got something out of that illicit transaction, i.e. money. Imagine if that type of bad publicity came as a result of a data breach that occurred in relation to data the company wasn’t even using. It’d be extremely embarrassing for a manager to tell the CEO, “We were keeping that data because why not.” Why not? Because it might really hurt your company’s reputation.
Third, companies that take a “collect everything” approach run the risk of ignoring real opportunities in the present. Rather than build services to solve specific problems, data engineers might feel they are just meant to drop everything into a data lake. The analytics team will have so much data on hand that it won’t know what to do with it all, and as a result might choose the wrong data to focus on. Systems balloon in size, and eventually someone realizes how pointless everything was. However, cleaning up the mess is no small task either. By forcing people to think carefully about what data they collect and why, you focus their minds on the specific problems that matter to the company.
Of course, not all of this necessarily applies. Maybe you aren’t collecting data about people at all, in which case the legal and PR costs mentioned above probably don’t apply. It also is possible to collect everything while still maintaining the data team’s focus on what matters. These problems can be avoided, and these risks can be mitigated. The only requirement is that managers think carefully about the value and use of data, rather than charge ahead after reading that data is the new oil.
It’s probably the case that managers rarely get fired for collecting too much data. Try looking online for questions about limiting data collection while using popular tools such as Google Analytics; very few people have ever considered the possibility of intentionally collecting less data. That, I think, is a huge problem. It’s a legal and PR problem, but it’s also an organizational problem. A data team that has no restrictions doesn’t have to think at all.
If I were starting a data department today I would focus on “data discipline”. Optionality can wait. A data team that is forced to think carefully about everything it does, that asks “why” at every step and understands the company’s goals, is going to be far more valuable than the optionality that comes with old data.