We posit that the superior writing abilities of LLMs, as manifested in surpassing human annotators in certain tasks, are fundamentally driven by RLHF. — Llama 2

Reinforcement learning from human feedback (RLHF) is one of the most effective techniques to align large language model (LLM) behavior with human preferences and instruction following. The simplest form is sampling human preference by letting human annotators choose which of the two model outputs they prefer. The human feedback is used to train a reward model.

OpenAI uses RLHF extensively in its models and spends significant sums on human annotators. Now Meta is doing the same with Llama 2. Interestingly enough, it doesn’t even seem like Meta has reached the limit of the effectiveness of RLHF with Llama 2:

Scaling trends for the reward model. More data and a larger-sized model generally improve accuracy, and it appears that our models have not yet saturated from learning on training data. — Llama 2

What does this mean? Some ideas: