All The Ways You Can Compress BERT
Model compression reduces redundancy in a trained neural network. This is useful, since BERT barely fits on a GPU (BERT-Large does not) and definitely won’t fit on your smart phone. Improved memory and inference speed efficiency can also save costs at scale.
In this post I’ll list and briefly taxonomize all the papers I’ve seen compressing BERT. Don’t see yours? Feel free to shoot me an email.
Update 3/3/20: A survey paper of BERT compression methods has been released by Ganesh et al.
Methods
Pruning - Removes unnecessary parts of the network after training. This includes weight magnitude pruning, attention head pruning, layers, and others. Some methods also impose regularization during training to increase prunability (layer dropout).
Weight Factorization - Approximates parameter matrices by factorizing them into a multiplication of two smaller matrices. This imposes a low-rank constraint on the matrix. Weight factorization can be applied to both token embeddings (which saves a lot of memory on disk) or parameters in feed-forward / self-attention layers (for some speed improvements).
Knowledge Distillation - Aka “Student Teacher.” Trains a much smaller Transformer from scratch on the pre-training / downstream-data. Normally this would fail, but utilizing soft labels from a fully-sized model improves optimization for unknown reasons. Some methods also distill BERT into different architectures (LSTMS, etc.) which have faster inference times. Others dig deeper into the teacher, looking not just at the output but at weight matrices and hidden activations.
Weight Sharing - Some weights in the model share the same value as other parameters in the model. For example, ALBERT uses the same weight matrices for every single layer of self-attention in BERT.
Quantization - Truncates floating point numbers to only use a few bits (which causes round-off error). The quantization values can also be learned either during or after training.
Pre-train vs. Downstream - Some methods only compress BERT w.r.t. certain downstream tasks. Others compress BERT in a way that is task-agnostic.
Papers
Comparison of Results
We’re just going to do our best here, and report whatever the papers claim. Mainly, we’ll look at parameter reduction, inference speed-up1, and accuracy.23
If you’re looking for practical winners, I would go with ALBERT, DistilBERT, MobileBERT, Q-BERT, LayerDrop, RPP. You might be able to stack some of these methods.4 But some of the pruning papers are more scientific than practical, so maybe check out those, too.
If you found this post useful, please consider citing it as:
Bonus Papers / Blog Posts
Sparse Transformer: Concentrated Attention Through Explicit Selection
Lightweight and Efficient Neural Natural Language Processing with Quaternion Networks
Adaptively Sparse Transformers
Compressing BERT for Faster Prediction
Update 11/19/19: Bibtex and bonus papers
Update 11/24/19: Added section with comparison of results
Update 11/25/19: Added “Attentive Student Meets…”
-
Note that not all compression methods make models faster. Unstructured pruning is notoriously difficult to speed-up via GPU parallelization. One of the papers claims that in Transformers, the computation time is dominated by the softmax computation, rather than matrix multiplication. ↩
-
It would be nice if we could come up with a single number to capture what we really care about. Like F1. ↩
-
Some of these percentages are measured against BERT-Large instead of BERT-Base, FYI. ↩
-
How different compression methods interact is an open research question. ↩