Lessons from llama.cpp

Nov 3, 2023

Llama.cpp is an implementation of Meta’s LLaMA architecture in C/C++. It’s one of the most active open-source communities around LLM inference.

Why did llama.cpp become the Schelling point around LLM inference? Why not the official Python implementation by Meta? Why not something written in Tensorflow, PyTorch, or another machine learning framework rather than a bespoke one?

Runs everywhere. Llama.cpp was originally a CPU-only library. CPU-only meant magnitudes less code to work with. Writing it in C++ also means that it could be easily imported into higher-level languages via bindings. Go bindings power ollama (because Go is one of the easiest languages to write a good CLI tool in). Support later came for Apple Silicon and GPU frameworks. But CPU-first was clearly the best way to get llama.cpp in the hands of developers quickly (and in as many places as possible).

Schelling point for low-level features. Just like LangChain subsumed every high-level LLM feature (like chain-of-thought and RAG), llama.cpp has done that for low-level features. ReLLM and ParserLLM found their way into llama.cpp (and for what it’s worth, they are in LangChain as well) (see this initial PR in llama.cpp). It’s hard to know what will be important, so many features end up in the library. Over time, some of these will be difficult to maintain and will probably need to find a new home.

Custom model format (“library lock-in”). GGML/GGUF is a custom format for quantized models. GGML is a one-way transformation — once you quantize your models you can’t unquantize them. GGML models only work with llama.cpp (although it’s all open-source so you could write your own). It was a necessary development (since llama.cpp doesn’t use something like PyTorch) that had some strategic implications.

Bet on the right horse (llama). While other libraries like HuggingFace transformers are general purpose, llama.cpp was able to focus on a single model architecture. This meant all sorts of optimizations. GGML only worked for Llama models (until GGUF, its replacement, came along). The developer, George Gerganov, had done a similar binding a few months earlier for OpenAI’s text-to-speech Whisper model, which was successful but not on the same scale.