The Free Lunch of Model Distillation

Model distillation uses one model to generate training data for a second model. It’s been shown that this synthetic data can significantly improve models and distill knowledge (I prefer to think of it in finance terms as model arbitrage).

Meta released its Code Llama models — LLMs built for code generation based on the Llama family. One model was missing from the downloadable Code Llama model weights despite being described in the paper as an “Unnatural Llama.” This model was trained on synthetic data. It’s most likely named like this because it’s inspired by the methodology in the Unnatural Instructions paper, which describes a way to generate large training sets with little manual work.

Model distillation is essential both technically and strategically. Proprietary models can be distilled from their knowledge and hard-earned capabilities. Even if some models have rules against distillation (e.g., OpenAI says you cannot train a competing model from its generated data), there will be ways around it. User-submitted data. Or crawled data. Or other workarounds. Some of the implications:

Data isn’t oil, at least in the same way it was before. If you don’t have any data, you can generate it.
Model compression. Taking large models and using them to train hyper-task-specific but tinier models. It can change the economics of inference drastically.
If you have your own data, you can make it go much further by extending it via a synthetic data set.
Creation is now cheaper than curation (in some cases).
APIs aren’t safe. They can be used against you to crawl your model output.