Here is How
Data science is evolving from a kludgy mess held together by spit, tape and constant effort, into something better. However, there are some fundamental problems at it’s heart that MUST change in order for the field to fulfill its promise of wresting understanding from the ever expanding universe of data we create.
For the past 3 years, my cofounder and I have been attempting to import the insights of complexity physics into the machine learning universe to better predict large societal events. Rare events, high dimensional data, classically unpredictable ‘black swans’ and ‘dragon kings’ – I bring back insights from that edge.
Here is the first thing that must be understood – the tools of the modern machine learning/AI industry were not built for what you think. Specifically they were not built to search for truth in data.
They were built by computer scientists attempting to prove that the tools could be built at all. The point was to show that the algorithms 1) worked at all and 2) worked repeatable/predictably and if you are lucky 3) worked (computationally) efficiently. Then in the beautiful practice of computer science, some of the tools used to prove these things were open sourced and we all started using them for something entirely, but not obviously different.
We never stopped to think about the different choices that are made by someone who is interested in demonstrating the nice properties of the algorithm (the people who designed the things we use) and someone who is interested in the truth/insight that the algorithm can help us to find. Keep this distinction in mind as you step through what actually happens in your training, validation, and usage pipeline and I think you’ll start to ask important questions. Asking those questions in the land of swans and dragons led to what follows.
Clean Up Our Words
In my first company, Rotary Gallop, we had to rethink Social Choice Theory to cleanly apply it to predicting the outcome of shareholder activism in public companies. It was hard going and we made little progress, until we realized we were using a few words interchangeably to mean subtly different things. We defined those things, each getting its own word and literally that day everything cleared up. We went through a similar, but much longer process with Machine Learning at Seldn.
Our words MUST get more specific. At the moment, absurdly, we don’t have a sensible way to communicate and differentiate between:
- A ML algorithm (e.g. Random forests)
- The algorithms function space(e.g. The space of possible trees)
- The method by which we fit/iterate/search the function space(e.g. Gradient Descent)
- The optimization function(the thing we are minimizing/optimizing)
- The models available hyper parameters (e.g. tree depth)
- The specific trained hyper parameters on a specific dataset – with which we then search for solutions
- The full set of parameters that lead to a specific fit (often we pretend this IS the predictive algorithm, as if a recipe IS a cake.)
- The actual, honest-to-goodness trained and fully specified solution/fit, into which we then plug inputs and make a prediction.
- The predictions themselves.
I have heard each and every one of these things referred to as “the model”. If we can’t talk easily and clearly about these very different things, how can we think clearly about the training/validation/usage pipeline? Clear thinking requires clear concepts.
A large number of data scientists lack knowledge of the scientific method, which extends beyond statistics. But there is a larger epistemological problem in the field. Data science must stop searching the entire universe of mathematical functions. What we do now is akin to looking for an address by searching the entire known universe. Guess what, it is probably on earth.
We have over a 1000-year history of finding mathematical truths in data, called science. If one looks broadly across the sciences it is obvious that the truth we find comes in certain forms. These forms are VASTLY more constrained than is mathematics. The real world has favored patterns. We need to honor those patterns first in our search strategies, and only when that fails strike out into the great dark elsewhere.
Get Real with High Dimensions
The not so secret, dirty secret of data science/machine learning/AI is that human beings are still in the loop. Humans are artificially cutting down the parameter space based on knowledge of the field and raw gut intuition. Machines are utterly lost without them. It’s commonly referred to as the art in data science. It happens because our tools and our thinking was built for low dimensional dataspace. If we want to leverage the world of data, we have to rethink the entire, training, Validation and prediction pipeline for high dimensions.
- At the very least it means we MUST be able to train and validate a SINGLE fit. We cannot retrain repeatedly in a high dimensional space and pretend that it is always the same “thing” under discussion.
- High dimensional data spaces crossed with high dimensional function spaces give you a search space that is effectively infinite in comparison to even modern computing power. We MUST be able to track successful locations in that space and re-initialize the search in those corners as new data comes in. We cannot afford to “mine” random seeds hoping to luck into that nice fit from last month that keyed in on XYZ feature.
- Whether we are bagging solutions of not, We MUST have tools and data/model structures that keep ALL fits that pass predetermined criteria and
- monitor their success moving forward
- search intelligently around those solutions that survive as new data comes in
- Analyze the solutions to look for meaning in the sections of parameter space that yield a high density of fits. What does the geometry of passing prediction functions tell us?
Validation is where you find truth. It’s where you make your money. It’s how you know you are not fooling yourself. Own it. If you put all your learning results through 2 sigma validation, but your algorithms never last more than a few months something is wrong.
- We have to track how real life performance compares with the predictions of validation. This is the only way we can prove the existence to “future leaks” in the dataset.
- We also MUST have automated tools to supply various monkey-with-typewriter benchmarks for validation. Can your pipeline beat all of the following Dummies?
- The Actuary – A Dummy that guess with the correct historical frequency.
- The Weatherman – same prediction as last period.
- The Stockbroker – same same growth rate as last period.
- The VC – same acceleration (growth rate of growth rate) as last period.
These are not questions a good data scientist asks, these MUST be standard benchmarks that outputs are measured against.
Ascend, You are the algorithm
Your entire pipeline from data provider to published predictions – all of it – IS the Algorithm. We must work to make it consistent, so that we can validate and improve the entire process from start to finish.
- We MUST have system level validation, tracking of published algorithms and a process for recognizing when published prediction functions needs to be retired.
- Did your average lifetime of a predictive algorithm drop from 4 months to 1 month on all algorithms that predict weekly or shorter, but remained steady on everything that predicts monthly or longer? You don’t know, but what if you did? What would that say about the data provider you just added? How much heart ache could you stop if you caught that information leak early?
Put Humans to Work with VR/AR
Real progress happens when human beings dig into the data. We must leverage VR/AR tech to put our visual cortex into the machine learning chain. This means experimenting with ways to visualize and move through high dimensional data and fits in a consistent way. Only by becoming virtual explorers of this space can develop the intuition needed help the algorithms find better solutions.
Set Prediction Functions Free
Open Source is a beautiful thing. We should fight for a world where data scientists can train prediction functions for real world meaningful things, then open source the on-going predictions. This can only occur if we open source data and infrastructure. It also means agreeing on data/format standards for predictions to allow automated scoring and retirement of prediction functions going forward. The work must live on after training is done.
Two Paths Diverge in a Wood
The current “answer” to these problems rely heavily on human expertise. Far more damning, it requires a combine high level of expertise in AI/ML and the subject matter under study. There are only two sane approaches in this situation. One is to do away with the data scientist, to work towards the excel of data science – the no-programing-required tool any MBA can use. A terrifying fraction of the world runs on excel, so many people are on this path.
The other road is far less traveled, far more difficult and far more important. Only by weakening the need for subject matter experts can we really harness our machines’ ability to scour the world for understanding. We must understand how the process of science really works. Not how we say it works, but the thing that we as scientists actually do to arrive understanding. We must find a way to teach our algorithms how to assume what we assume and to see what we. To our machines, we must make available our heuristics and our context.
A wise man once said that data science is statistics cleaved of any reference to significance or confidence. Said another way, it’s pattern recognition without really being concerned about truth. The best of our kind rely on experienced people to ask “are we fooling ourselves?” – And it works. Others simply live in a cycle of forever retraining. When we turn loose the machine with our ability to connect and leap and guess and blur, we cannot send it forth without deep and new tools. For the sake of our sanity, we must teach our algorithms to also constantly ask “Am I fooling myself?”
This will makes all the difference in the future of AI.