I’m thinking about starting a competition website for ML efficiency. I’m calling it ML Golf. In a game of ML Golf, you
- find the model with the lowest regularization “score”
- which fits some data perfectly
- within a specified compute budget.
Pick A Competition Class
You competition class determines what type of hardware you get to use and how much time you get to use it.
Actually specifying how much “computation” something takes is too hard. It’s a very ambiguous word. If the hardware changes, then the optimal algorithms can change dramatically. So if you’re playing ML Golf, you have to say upfront which hardware you’ll be using.
Example hardware classes:
- Single GPU
- Single TPU
- TPU Pod
- Unbounded (Distributed Setups)
And then you pick a sub class of how much time you get on that hardware. Note that since each piece of hardware has a maximum energy throughput, time is a proxy for energy usage (but does not measure it exactly).
|SS||< 10 min|
|S||< 1 hour|
|A||< 1 day|
|B||< 1 week|
|C||< 1 month|
That means you’re only playing against people that have the same computational budget as you. You only have to punch within your weight class.
To qualify, you must find a model which has near zero training loss. Why? More below.
The “winner” is the model which qualifies while also minimizing some scoring function specified by the competition.
Example ways to score models:
- Sum of L2 / L1 / nuclear norm of all weight matrices
- Number of non-zero parameters
Why is machine learning computationally expensive?
Someone, somewhere (the domain expert), wants to predict the output of a blackbox function. They can’t see the internals of the function, but they have some previous inputs and outputs of that function, which we call data.
Machine learning is about using data to learn what’s inside the blackbox. That is our job, as the ML theorists.
Learning from data usually involves optimization. We start by searching for a model which minimizes the training loss on the data. But we also want the model to generalize to new inputs, so we add some regularization term which gives our model nice properties which we expect the blackbox function to have. For example, models with low L2-norm weight matrices might generalize better than high L2-norm. Or sparse models might generalize better than dense models.
Optimization is what takes all of our compute money.
Why does deep learning work so well?
In traditional statistical learning theory (STL), we usually think of this as the bias-variance trade-off. When we increase regularization, our models fit the training data worse but generalize better. But in the modern era, we know that deep neural networks can fit the training data nearly perfectly and still generalize well. This renders many STL generalization bounds vacuous.1
So we’re not sure why DL works so well. There are many implicit regularizers present in our optimization algorithms. Does the correct regularization depend on the distribution of the data, or is there a “universal” property that we’re looking for? The goal is not entirely clear.
Why do I have to fit the data perfectly?
The ambiguity around why deep learning works makes it difficult to research how to efficiently do it. I will happily await the day when ML theorists announce either The One True Regularization or a method for picking the best regularization every time. Until then, the field of ML efficiency (and related competitions) need explicit goals.
So what is the current relationship between the domain expert and the machine learning theorist? I believe the domain expert:
- Cleans the training data
- Gives the theorist the data
- Gives the theorist some intuition about properties of the blackbox (regularization)
- Gives the theorist a computational budget on specific hardware
- Evaluates the resulting model
And the ML theorist’s job is to
- Find a model which fits the data perfectly and
- Minimizes the regularization while
- Staying within a fixed compute budget.
Basically, a game of ML Golf is my understanding of the contract between a domain expert and a machine learning theorist.
If the domain expert doesn’t like the performance of the model we give them, they should clean up their data, or ask for a different regularizer.
Practically, the theorist can’t know what is noise in the data and what’s not; she must assume that the data all comes from the blackbox function. And empirically, it seems like we can get good performance while fitting the data perfectly anyway. So let’s not worry about it and just require all the contestants to get a perfect fit.
As for regularization, please let me know when they figure out The One True Regularization. Until then, pick your own.
Why do I have to pick a competition class?
We know that the efficiency of a given machine learning algorithm depends heavily on the hardware that runs it, and that the most efficient hardware/algorithm pairs are those that are co-designed. And in the real world (which is not Google) people have hardware lying around that is considerably less than a pod of TPUs. Real people make do with what they have, and so this is a vital part of the contract.
Can I pre-train? What about inherited hyper-parameters?
Pre-training data is still data about the blackbox function. It can be either likely inputs to the blackbox, past outputs of the blackbox, or input-output pairs. So if you use pre-training, you’re taking advantage of information about the blackbox that other contestants don’t have. So that’s obviously not fair.
Now, if the competition host specifies a pre-training dataset and has some intuitions about how that dataset relates to the blackbox function, then sure, you can pre-train. BUT if you use a pre-trained model you must enter a separate competition class which is specific to that model. Otherwise, the amount of compute you’re actually using is effectively unbounded.
Isn’t this just Kaggle?
Basically, yes. It’s just that we’ve changed the rules a little bit to focus more on efficiency and less on generalization. We also have stricter rules about pre-training.
STL generalization bounds depend on uniform convergence of the training loss to the test loss. They also depend on capacity control of our hypothesis class. In DL, the training loss is zero, which means the bound depends only on capacity control. But capacity control doesn’t know anything about the blackbox function, so how could it tell us anything about generalization? ↩