The Context Length Observation

Large language models can only consider a limited amount of text at one time when generating a response or prediction. This is called the context length. It differs across models.

But one trend is interesting. Context length is increasing.

GPT-1 (2018) had a context length of 512 tokens.
GPT-2 (2019) supported 1,024.
GPT-3 (2020) supported 2,048.
GPT-3.5 (2022) supported 4,096
GPT-4 (2023) first supported 8,192. Then 16,384. Then 32,768. Now, it supports up to 128,000 tokens.

Just using the OpenAI models for comparison, context length has, on average, doubled every year for the last five years. An observation akin to Moore’s Law:

The maximum context length of state-of-the-art Large Language Models is expected to double approximately every two years, driven by advances in neural network architectures, data processing techniques, and hardware capabilities.

It’s generally hard to scale — for many years, the attention mechanism scaled quadratically (until FlashAttention). It’s even harder to get models to consider longer contexts (early models with high context lengths had trouble considering data in the middle).

Understanding relationships and dependencies across large portions of text is difficult otherwise. Small context lengths require documents to be chunked up and processed bit by bit (with something like retrieval augmented generation).

With long enough context lengths, we might ask questions on entire books or write full books with a single prompt. We might analyze an entire codebase in one pass. Or extract useful information from mountains of legal documents with complex interdependencies.

What might lead to longer context lengths?

Advances in architecture. Innovations like FlashAttention turned the computational complexity of the attention mechanism from quadratic to linear with respect to context length. Doubling the context length no longer means quadrupling the computation cost.

Rotary Positional Encoding (RoPE) is another architectural enhancement that makes context length scale more efficiently. It also helps models generalize to longer contexts.

Advances in data processing techniques. You can increase context length in two ways. First, you can train the model with longer context lengths. That’s difficult because it’s much more computationally expensive, and it’s hard to find datasets with long context lengths (most documents in CommonCrawl have fewer than 2,000 tokens).

The second, more common, way is to fine-tune a base model with a longer context window. Code Llama is a 16k context length fine-tuned version on top of Llama 2 (4k context length).

Advances in hardware capabilities. Finally, the more we can make the attention mechanism and other bottlenecks in training and inference more efficient, the more they can scale with advances in the underlying hardware.

There’s still work to be done. How do we determine context length for data? It’s simple enough if it’s the same file (a book, a webpage, a file). But how should we represent an entire codebase in the training data? Or a semester’s worth of lectures from a college class? Or a long online discussion? Or a person’s medical records from their entire life?