Optimizing $Model.cpp

Jul 25, 2023

llama.cpp and llama.c are both libraries to perform fast inference for Llama-based models. They are written in low-level languages (C++/C) and use quantization tricks, device hardware, and precision hacks to run Llama faster (although thoroughly researched benchmarks are hard to come by). We don’t know the popularity of libraries like llama.cpp in production. Still, they have captured the zeitgeist, with Shopify CEO Tobi Lütke and Meta CEO Mark Zuckerberg mentioning llama.cpp in interviews (Tobi has even sent a few PRs).

But speed doesn’t come for free; the major tradeoff is extensibility. The obvious ones:

  • You can’t inference different types of models outside the Llama family. Something like the Huggingface Transformers library will always support more models (probably) at the cost of being slower.
  • Llama.cpp was initially only CPU-bound. While that’s great for running inference in more places (embedded, etc.), it isn’t great for running inference fast at scale (for which you most likely want to use accelerators like GPUs). It now supports more devices, but at the cost of being just as complex (if not more) than competing libraries.
  • Needs its own class of debugging tools vs. using more generic layers like PyTorch and Transformers by Huggingface. This isn’t always bad — sometimes, erasing the assumptions made for a previous generation of models can lead to significantly easier stacks. Although it’s hard to do something like this without corporate support (e.g., Meta or Huggingface).


  • Will Llama-family models become ubiquitous enough to make a Llama serving (or training) layer a real abstraction for LLMs? I don’t think that’s completely unlikely.
  • Quantized model formats like GGML are lossy, meaning they are injective mappings from formats like PyTorch (i.e., you can’t convert GGML to PyTorch without losing information). Not necessarily a bad thing, but where do network effects accrue? Especially as new methods emerge, lossless, “universal donor” models won’t go anywhere. Quantization methods aren’t standardized yet.
  • A corollary — does quantization hold across different models? Across Llama-family models? I don’t know the research on this one.
  • How debuggable are these libraries? Two arguments: It’s easier to debug C++ directly than C++ or Rust embedded within Python (or vice versa?). Especially as these libraries delve deeper into device acceleration (e.g., GPUs), I imagine debugging PyTorch layers would be easier than bespoke and specialized C++.
Daily posts on startups, engineering, and AI