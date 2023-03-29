A bag of tricks to increase either training or inference latency or memory and storage requirements for large language models.

Compress the model

Quantization (post-training) — Normalize and round the weights. No retraining is needed.

Mixed precision — Using a combination of lower (e.g., float16) and higher (e.g., float32) precision arithmetic to balance performance and accuracy.

Fewer computations:

LoRa (Low-Rank Adaptation of Large Language Models) — A method to reduce the model size and computational requirements by approximating large matrices using low-rank decomposition. Faster fine-tuning, and you can share the LoRa weights only (orders of magnitude smaller than a fine-tuned model). Used often in Stable Diffusion.

Prune the model

Structured pruning uses different algorithms to determine what weights can be ignored at inference time. For example, SparseGPT claims their algorithm can prune models by 50% without retraining.

Restrict the domain (fine-tune a smaller model)

Task-specific fine-tuning — Retraining a large model on a smaller dataset specific to the target task, reducing its complexity and size.

Model arbitrage — Generate targeted training data from a larger model to train a specific smaller model.

Dispatch to multiple small models

Model ensembles — Combining the outputs of multiple smaller models, each specialized in a sub-task, to improve overall performance. Might use a similarity search on embeddings or some other heuristic to figure out what models to call.

Cache the inputs