All The Ways You Can Compress BERT
Model compression reduces redundancy in a trained neural network. This is useful, since BERT barely fits on a GPU (BERTLarge does not) and definitely won’t fit on your smart phone. Improved memory and inference speed efficiency can also save costs at scale.
In this post I’ll list and briefly taxonomize all the papers I’ve seen compressing BERT. Don’t see yours? Feel free to shoot me an email.
Methods
Pruning  Removes unnecessary parts of the network after training. This includes weight magnitude pruning, attention head pruning, layers, and others. Some methods also impose regularization during training to increase prunability (layer dropout).
Weight Factorization  Approximates parameter matrices by factorizing them into a multiplication of two smaller matrices. This imposes a lowrank constraint on the matrix. Weight factorization can be applied to both token embeddings (which saves a lot of memory on disk) or parameters in feedforward / selfattention layers (for some speed improvements).
Knowledge Distillation  Aka “Student Teacher.” Trains a much smaller Transformer from scratch on the pretraining / downstreamdata. Normally this would fail, but utilizing soft labels from a fullysized model improves optimization for unknown reasons. Some methods also distill BERT into different architectures (LSTMS, etc.) which have faster inference times. Others dig deeper into the teacher, looking not just at the output but at weight matrices and hidden activations.
Weight Sharing  Some weights in the model share the same value as other parameters in the model. For example, ALBERT uses the same weight matrices for every single layer of selfattention in BERT.
Quantization  Truncates floating point numbers to only use a few bits (which causes roundoff error). The quantization values can also be learned either during or after training.
Pretrain vs. Downstream  Some methods only compress BERT w.r.t. certain downstream tasks. Others compress BERT in a way that is taskagnostic.
Papers
Comparison of Results
We’re just going to do our best here, and report whatever the papers claim. Mainly, we’ll look at parameter reduction, inference speedup^{1}, and accuracy.^{2}^{3}
If you’re looking for practical winners, I would go with ALBERT, DistilBERT, MobileBERT, QBERT, LayerDrop, RPP. You might be able to stack some of these methods.^{4} But some of the pruning papers are more scientific than practical, so maybe check out those, too.
Bonus Papers / Blog Posts
Sparse Transformer: Concentrated Attention Through Explicit Selection
Lightweight and Efficient Neural Natural Language Processing with Quaternion Networks
Adaptively Sparse Transformers
Compressing BERT for Faster Prediction
Update 11/19/19: Bibtex and bonus papers
Update 11/24/19: Added section with comparison of results
Update 11/25/19: Added “Attentive Student Meets…”

Note that not all compression methods make models faster. Unstructured pruning is notoriously difficult to speedup via GPU parallelization. One of the papers claims that in Transformers, the computation time is dominated by the softmax computation, rather than matrix multiplication. ↩

It would be nice if we could come up with a single number to capture what we really care about. Like F1. ↩

Some of these percentages are measured against BERTLarge instead of BERTBase, FYI. ↩

How different compression methods interact is an open research question. ↩