The RLHF Advantage

We posit that the superior writing abilities of LLMs, as manifested in surpassing human annotators in certain tasks, are fundamentally driven by RLHF. — Llama 2

Reinforcement learning from human feedback (RLHF) is one of the most effective techniques to align large language model (LLM) behavior with human preferences and instruction following. The simplest form is sampling human preference by letting human annotators choose which of the two model outputs they prefer. The human feedback is used to train a reward model.

OpenAI uses RLHF extensively in its models and spends significant sums on human annotators. Now Meta is doing the same with Llama 2. Interestingly enough, it doesn’t even seem like Meta has reached the limit of the effectiveness of RLHF with Llama 2:

Scaling trends for the reward model. More data and a larger-sized model generally improve accuracy, and it appears that our models have not yet saturated from learning on training data. — Llama 2

What does this mean? Some ideas:

Base models are the commodity, RLHF is the complement. A curated reward model could turn base models into differentiated products.
Human annotation is still important. While many thought that the data labeling companies of the last ML wave might be left behind in the age of LLMs, they might be more relevant than ever.
Is human preference the right model? It’s frustrating when chat-based models refuse to answer a tricky question. Some models trade off helpfulness for safety. On the other hand, we don’t want to perpetuate the biases and bad parts of the Internet in our models. Obviously a much deeper and more complex topic.
Is RLHF a path-dependent product of OpenAI? Or is it the right long-term strategy? OpenAI is a pioneer of reinforcement learning (most of their products pre-GPT were RL). Is reinforcement learning the most effective way to steer LLMs, or was it just the hammer that OpenAI researchers knew best? Both can be true.
Who owns the best data for RLHF? Not all data is created equally. What kind of feedback system will be most effective for building a reward model for future LLMs? While companies like Google have insurmountable amounts of data, they might not have the right data.