Does BIG DATA Imply Small Effects?

Radhika Dirks Uncategorized

Two words are thrown around a lot these days, mostly together: Big Data. Given the abundance of information, affordability of data storage, ease of collection, lax privacy concerns and even better, eagerness to share, data is ubiquitous. How we shop, what we eat, how our hearts beat, and what the world did for millions of years. If there is any record of it or even the possibility of digitizing it, it is being digitized. And why the hell not? For the first time ever, we have terabytes of data flowing in. What used to be a computer scientist’s nightmare has become a statistician’s dream. In every tech. conference, journal or discussion the message is obnoxiously loud and Swarovski clear: either you are already dabbling in Big Data or you should be. More data can only mean deeper analysis, lesser errors, richer insights, and more predictive power. Data, the story goes, speaks for itself and will tell us what to do. And bigger data will do that better. Right?

Well, not so fast. There are a couple of caveats you need to be aware of. To quote one of my super smart PhD friends and senior data scientist at a very prominent computing firm, “most people who say big data and large analysis all have little sensible data and almost no problems to solve.” And given that I think of my (“Big Data”) startup more as a Big Effects company, I decided it’s about time someone reveals some of the behind-the-scenes truth. So here is the scoop to help you decide if Big Data really makes sense for your company:

 Big Data – the fine print: As any halfway decent scientist or statistician will tell you, the amount of data you need for reliable analysis depends first on the effects (trends) you are trying to observe, provided there are any. Overcompensation doesn’t hurt you, but getting more and more confirmation of the same behavior does not give you more insight. Second, larger data sets result in smaller error. In most systems, errors scale as SquareRoot(N), where N is the number of data points you have. So as long as your error rates are less than your tolerance, having more N doesn’t help much. Based on these two points, I have often been tempted to conclude that big data implies small effects. Meaning, if you ‘need’ big data to observe your effects, the effect then has to be small. Otherwise, it would have shown up in the first 100, 1000, or even 10,000 data points you collected.

 Big Data – the hope:  Models of behavior, especially those with dominating effects, can be constructed with a small subset of data. But simple models miss out on three key things. Understanding these will help you identify what new types of data will be useful to you.

1. Predicting behavior outside of the data range: When new data falls outside of your current data range, they can have enormous predictive power. The two diverse peak syndrome is a classic example. With the limited range of data, you can completely miss different behavioral regions. Plus you will also know whether the behavior you see is global vs. local. i.e., whether it is valid over the entire data range. E.g., American passengers tend to sit as far away as possible on airplanes. But the Japanese tend to cluster together, even when they are strangers to each other. Perhaps something useful when you are modelling seat assignments for optimal weight distribution on intercontinental flights (and care about pleasing your customers).

2.  Dynamic modeling with feedback: When data becomes abundant, you can start doing clever and meaningful things with it: because you have so much to play with without worrying about whether you just got lucky. Case in point:  A/B testing, popular with web analysts and the Obama Campaign. Including a picture of Michelle and the kids in emails increased Barack’s campaign sign-up rates by 40%! By  including feedback, you make models more realistic and increase their predictive power. And by partitioning the data set, you can test for winning combinations cheaper and faster than ever before.

3. Understanding correlations vs. causation: This is a mucky subject. And not as easy as most people tend to believe it is. Causation is almost always just a mistaken elevated correlation and almost impossible to distinguish from a correlation to an unknown correlator.  It’s like proving a negation – you will be hard pressed to prove that there isn’t a blue cow.

Now there are more impressive things you can do, computationally speaking.  Come up with completely new analyses, apply little-known techniques outside of their domain to gain profitable deep insights, or simply try to first digitize a section of the world that has not been digitized. We (and other start-ups) do all of this…but lumping these (perhaps more important tasks) under Big Data is misleading. And let’s not forget that there is such a thing as good data. One of the biggest challenges in Big Data is that most of it is gunk, unstructured or irrelevant. Things like intentional sabotage (which I sometimes do to mask my shopping behavior) are mostly not even considered at such early stages.

Lastly, data analysis is only as good as the though processes behind them! Thinking can be either Big Coefficient thinking or New Reality thinking (a la Scott Page). Big coefficient refers to the linear/non-linear models that dictate the dominating variables and optimizing for that particular variable. E.g., American jobs act – not to be confused with the JOBS act. But big coefficient thinking can blind you to reality because of the three points (above) that you hope Big Data can solve. New Reality thinking, on the other hand,  refers to understanding the overlap landscape with directly related or unrelated factors to initiate a change that has more drastic causes, behavioral changes as a side effect! E.g., the American Inter Highway system resulting from the Federal Highway Act of 1956.

Perhaps you are just looking for pretty graphs to impress your executive board or justify a decision already made.  But if you are thinking about investing serious money in Big Data, think about what you hope to achieve before you think Big Data can solve your problems. This will help you avoid the double whammy of having no sensible data and no big problems.