A Hacker's Guide to LLM Optimization

Mar 29, 2023

A bag of tricks to increase either training or inference latency or memory and storage requirements for large language models.

Compress the model

  • Quantization (post-training) — Normalize and round the weights. No retraining is needed.
  • Mixed precision —  Using a combination of lower (e.g., float16) and higher (e.g., float32) precision arithmetic to balance performance and accuracy.

Fewer computations:

  • LoRa (Low-Rank Adaptation of Large Language Models) — A method to reduce the model size and computational requirements by approximating large matrices using low-rank decomposition. Faster fine-tuning, and you can share the LoRa weights only (orders of magnitude smaller than a fine-tuned model). Used often in Stable Diffusion.

Prune the model

  • Structured pruning uses different algorithms to determine what weights can be ignored at inference time. For example, SparseGPT claims their algorithm can prune models by 50% without retraining.

Restrict the domain (fine-tune a smaller model)

  • Task-specific fine-tuning —  Retraining a large model on a smaller dataset specific to the target task, reducing its complexity and size.
  • Model arbitrage —  Generate targeted training data from a larger model to train a specific smaller model.

Dispatch to multiple small models

  • Model ensembles — Combining the outputs of multiple smaller models, each specialized in a sub-task, to improve overall performance. Might use a similarity search on embeddings or some other heuristic to figure out what models to call.

Cache the inputs

  • Cache repeated responses
  • Cache semantically similar inputs