A bag of tricks to increase either training or inference latency or memory and storage requirements for large language models.
Compress the model
- Quantization (post-training) — Normalize and round the weights. No retraining is needed.
- Mixed precision — Using a combination of lower (e.g., float16) and higher (e.g., float32) precision arithmetic to balance performance and accuracy.
- LoRa (Low-Rank Adaptation of Large Language Models) — A method to reduce the model size and computational requirements by approximating large matrices using low-rank decomposition. Faster fine-tuning, and you can share the LoRa weights only (orders of magnitude smaller than a fine-tuned model). Used often in Stable Diffusion.
Prune the model
- Structured pruning uses different algorithms to determine what weights can be ignored at inference time. For example, SparseGPT claims their algorithm can prune models by 50% without retraining.
Restrict the domain (fine-tune a smaller model)
- Task-specific fine-tuning — Retraining a large model on a smaller dataset specific to the target task, reducing its complexity and size.
- Model arbitrage — Generate targeted training data from a larger model to train a specific smaller model.
Dispatch to multiple small models
- Model ensembles — Combining the outputs of multiple smaller models, each specialized in a sub-task, to improve overall performance. Might use a similarity search on embeddings or some other heuristic to figure out what models to call.
Cache the inputs
- Cache repeated responses
- Cache semantically similar inputs