Python has long had a monopoly on data workflows — everything from data analysis to data science to machine learning. Anything that can't be done in SQL is done in Python. But Python won't be the language for LLMs.
Why did Python become the language for data workflows?
- Cross-platform. Data analysts are much more likely to work on Windows. Python was one of the first languages to have a simple cross-platform toolchain.
- Dynamic Typing. Data science is often exploratory. As a result, code churns at a much higher rate. Why go through the trouble to type every numpy array with different shapes when the code will never go to production and might be replaced soon?
- REPL / Scripting. Why do we rarely see different languages used in Jupyter notebooks?
- Built-in data structures. First-class support for sets, dictionaries, and lists. Python's import/namespace design vs. Ruby's global namespace. There's more to unpack here, but the main discussion on this point is why Python over Ruby?
- Brevity over verbosity. Java is also cross-platform (via JVM), but is a verbose language. On the other hand, Python is brief and succinct — there is a pythonic way to do something.
These features, along with many others, led the data community to aggregate in Python. In addition, there are economies of scale to languages — the more libraries that exist, the more productive those languages get, especially within a niche/workflow.
But LLMs will break this monopoly.
- Simple interfaces served over the web. Whether or not many companies will end up self-hosting these models, the interface remains simple. Sure, you can use OpenAI's python library to make a call to the completion API, but you can just as quickly run a cURL command to do the same. Or an HTTP request from any language.
- Data-lite (preparing the model). Before, you had to clean your data and convert it to specific data structures — e.g., a model might accept an embedding as the input. These data structures were often heavy and complex to pickle or serialize over the wire, so they stayed in Python. Now, natural language fits cleanly in a string (or a binary image or audio file without special encoding).
- Data-lite (calling the model). While some companies may still fine-tune and pack their data into these models somehow, many other workflows can be done with a small amount of data (e.g., in JSONL), a multi-line string of examples (few-shot), or nothing at all (zero-shot). High latency also means that developers will move the model calls as close to the application as possible ("on the edge").
- Performance-critical libraries are not written in Python. Most of the low-level libraries are simply Python wrappers over C++ or Rust. In theory, these can be called from other languages. Language boundaries are blurring.
- Deployed on the edge. Lower latency means happier users. I imagine there will be providers who offer edge-colocated LLMs for fast inference.
- For application end-developers. You won't need a complicated data pipeline to start with or call LLMs. You don't need a data science certificate to call these models either (it's just plain text, for now).
- Type safety. LLMs can return any schema (or none at all). Instead of parsing plain text, developers will prefer to restrict LLM calls to a known schema. What language would be best for this? TypeScript (too complex and verbose to do with JSON schema).