Modeling Context Length vs. Information Retrieval Cost in LLMs

Mar 21, 2023

Large language models are unique because you can get good results with in-context learning (i.e., prompting) at inference time. This is much cheaper and more flexible than fine-tuning a model.

But what happens when you have too much data to fit in a prompt but don’t want to fine-tune it? How do you provide the proper context for those models?

You have a few choices:

  • Use the model with the most significant context window. For example, most models have a limit of 4k tokens (prompt and completion included), but GPT-4 has a window size of 32k tokens.
  • Use a vector database to perform a similarity search to filter down the relevant context for the model. Only a subset (e.g., the three most similar sentences or paragraphs) are included in the prompt.
  • Use a traditional search engine (e.g., ElasticSearch, Bing) to retrieve information. Unlike similarity search, there’s more semantic work to be done here (but possibly more relevant results).
  • Use an alternative architecture where the model does some routing to more specific models or information retrieval itself (e.g., Google’s Pathways architecture)

What will be the dominant architecture in the future? Napkin math look at the cost of different methods. It’s a bit of an apples-and-oranges comparison — there are use cases that only work with a specific method, but this just looks at the use case of augmenting in-context learning with the relevant data.

(Let’s assume 1 page ~= 500 words, and 1 sentence ~= 15 words, 1 word ~= 5 characters).

Using the largest model. With large context lengths, let’s estimate there’s a 9:1 split between prompt tokens (currently $0.06/1k tokens) and sampled tokens ($0.12/1k tokens). This comes out to a blended $0.066 / 1k tokens.

Using OpenAI’s embeddings, 1 token ~= 4 characters in English, or 100 tokens ~= 75 words.

At the full token capacity, that’s $2.112 per query containing 24,000 words.

Vector search. You can convert chunks of text to vectors, let’s say a sentence per vector for retrieval for simplicity. In practice, chunk sizes might be larger (paragraphs) or shorter (single tokens).

Vector sizes. Let’s use 1536 dimensions since that’s the size of OpenAI’s embeddings. In practice, you would probably use a lower dimensionality to store in a vector database (768 or even 256). Pinecone, a vector database, has a standard tier that roughly fits up to 5mm 768-dimensional vectors, costing $0.096/hour or ~$70/mo. This includes compute. Let’s assume this equates to 2.5mm dim(1536) vectors.

A rough calculation of the storage size required for 2.5mm 1536-dimensional vectors (assuming float32).

2.5mm vectors * 1536 dimensions * 4 bytes per dimension ~= 15GB

That’s about 1.875mm words. Significantly larger than even the largest context window. Assuming 100 queries/day, that’s $0.023 per query.

Of course, you still need to put the relevant documents in the prompt.

Essentially, as long as the similarity search reduces the query by ~1% ($0.023/$2.112) tokens, you should run the vector search first. This seems like the no-brainer it is today.

The numerator ($/vector search) and the denominator ($/token) are likely to decrease over time. However, the $/token costs are likely to fall much faster than the vector database cost. If token costs fall 10x faster, we’re looking at a 10% trade-off. Maybe a different story.

Additional costs: maintaining the vector database infrastructure, added latency to make a database call first (who knows how slow the 32k-prompt models will be — a difference calculation), the margin of error (what finds the relevant information more often?), and the developer experience (a data pipeline vs. “put it all in the prompt”).

Subscribe for daily posts on startups & engineering.

Framework-Defined Infrastructure

Mar 20, 2023

What would cloud-native Ruby-on-Rails look like?

  • Route handlers mapped to AWS Lambda functions, Google Cloud Functions, and Azure functions.
  • An ORM that automatically deploys and migrates AWS RDS, DynamoDB, Google Cloud SQL, and Azure SQL.
  • Static files that are uploaded and cached on an automatically provisioned CDN endpoint like AWS CloudFront, Google Cloud CDN, and Azure CDN.
  • Deployments that build and package themselves as reproducible Docker containers
  • Mailers that automatically configure and send via AWS SES, background jobs that run via ephemeral containers or functions, and more.

Vercel has coined this idea as framework-defined infrastructure, and I think it's directionally the future.

Why now? We saw two downstream effects as the public cloud APIs matured and higher-level abstractions were developed (e.g., Kubernetes, FaaS). The first was serverless primitives (scale-to-zero and elastic scaling), and the second was infrastructure-as-code.

The framework-defined infrastructure uses both to avoid complex state management at the framework level (are there enough individually mapped pieces?) and at the infrastructure level (declarative configuration moves complex state management to the provider).

While framework-defined infrastructure seems like a step in the right direction, I wonder whether it is a net benefit to companies like Vercel or cloud providers like AWS. The age-old question of value creation vs. value capture.

Suppose the infrastructure is serverless, and the mapping from code to infrastructure is already well-defined in an open-source framework or API. What is the developer paying the provider of framework-defined infrastructure? (More in IaC: Strength or Weakness for Cloud Providers?).

Ruby on Rails created immense value for startups (see the list of getting to market with rails), but the authors captured relatively little of it – Tobias Lütke (Shopify)  and DHH (Basecamp) indirectly monetized their contributions to Rails through more domain-specific startups.

The Missing Semester of CS

Mar 19, 2023

MIT has a pragmatic course that covers proficiency with software tools. The idea is that you utilize these tools so often that they move past being a fact of the vocation to being a problem-solving technique. While my advice is that you should focus on theory and first principles at school, knowing these concepts can help you learn (and extend) the theory.

Here's what would be in my course (you can see MIT's here):

  • Command line essentials. The terminal is still the entry point for most developer tasks. Learn it. Understand the UNIX philosophy. Essential: string manipulation, SSH, git, grep, tar, cURL, UNIX pipes. (I don't think you need to learn vim or emacs anymore).
  • Package management. Understand your system's default package manager (e.g., apt on Ubuntu or brew on macOS). How to install, remove, and query packages. Have a good understanding of language-level package managers: installing, removing, and querying. Are you install packages globally or locally? What runtime are you using? A lot of wasted time here.
  • Build system. You don't need to know them all, but have a good idea of what files, modules, or packages are getting compiled when you run a build command. How to write a simple Makefile.
  • Basic networking. How to expose a service to the internet (any method you prefer). How to connect to a remote machine. How to forward a port.
  • One scripting language –  You should know at least one (1) scripting language – whether that's bash or Python, or something else. It doesn't matter what it is, but you should be able to do quick manipulations and batch operations in it.
  • One data analysis tool – Pandas would be my choice, but R or Excel is acceptable. You should be able to quickly generate some obvious insights from structured data (e.g., JSON or CSV) – means, medians, unique values, etc. You should be able to do basic data cleaning. Be able to graph basic data.
  • One SQL dialect. You don't have to know complex aggregate functions or common table expressions, but you should know how to SELECT, INSERT, GROUP, and filter data.
  • Debugging. I don't think understanding how to use a language-specific debugger would be on the curriculum. Instead, general-purpose debugging techniques. How to read a stack trace. Print debugging (effectively). Methods for identifying and solving different bugs – finding when a bug was introduced (bisect), tracing a bug across multiple services, runtime vs. build time bugs.

The Value of Software Generalists

Mar 18, 2023

We’ve always known that software engineering skills are key to unlocking the power of ML. Some large companies (FAANG) have gone as far as adopting a preference of hiring software engineers and teaching them ML to work on applied problems (rather than the reverse)

LLMs really put the elephant in the room.  All of a sudden the ML is abstracted away and the jobs to be done are design, engineering, UX, etc. Yes LLMs/NLP are only a subset of ML, but seems like a tipping point with respect to how people think about skills.

Hamel Husain

A familiar story: taking software engineering's best principles and injecting them into auxiliary technical stacks – the modern data stack (data observability, versioning, orchestrators rebased on Kubernetes), the machine learning stack (cloud-native distributed training and inference on Kubernetes), or even domain-specific "Ops" like FinOps and HRMs (human resource management).

There's immense value in being a software engineering generalist. Knowing how to build and deploy a service. Knowing how to write a script to transform some data. Knowing how to do common tasks like authentication, querying a database, setting up a developer environment, SSH-ing into a machine, compiling software, debugging, and more.

Subscribe for daily posts on startups & engineering.

Foundational Models Are Commodities

Mar 17, 2023

There are over 24 public LLMs from 8 providers (OpenAI, Google, Meta, AI21, EleutherAI, Anthropic, Bloom, Salesforce, and more) for developers to choose from. You can train one from scratch with only public data and still get good results (see LLaMA).

Developers can switch out a model with a single line of code. In addition, new models are incorporated across libraries as soon as they are released.

There are still trade-offs between latency, cost, size, data, and more (choosing the suitable model). But the data is in:

Foundational models are commodities.

And yet, foundational models by themselves are not enough.

  • It isn't easy to orchestrate calls between LLMs, internal databases, and APIs.
  • With the right techniques, you can increase reasoning ability with chain-of-thought-prompting, but it doesn't come out of the box.
  • Augment context length (e.g., filtering items via a vector similarity search first) requires extra infrastructure.
  • DSLs (like ChatML) might be needed to serve more domain-specific use cases.

On OpenAI's Kubernetes Cluster

Mar 16, 2023

As someone who used to work on Kubernetes and distributed ML on Kubernetes, digging into some of the publicly available facts about how OpenAI runs a Kubernetes cluster of 7,500+ to produce scalable infrastructure for their large language models. [1] [2]

Kubernetes vs. HPC. Many might object and say that OpenAI should be running on HPC frameworks like Slurm instead of Kubernetes. My (biased) answer: the developer experience and cloud-native integrations of Kubernetes more than makeup for some of the shortcomings. Developers today deploy with containers. Nodes are heterogeneous (and ephemeral). Secrets, blob storage, and volume mounts other than NFS are necessary. You have to build many of these things in HPC, but it's much easier in Kubernetes. Developer experience matters.

Cluster-wide MPI. All pods participate in a single MPI (message-passing interface) communicator. You can think of a bunch of parallel jobs doing work and then doing a batch operation (e.g., batch normalization) across all nodes. OpenAI built its own, but I would use the operators and custom resources in the Kubeflow project (I worked on Kubeflow at Google).

Scheduler. You can swap out the default scheduler in Kubernetes and replace it with something more specific. It sounds like OpenAI tried this and ran into issues, but, in theory, it's possible. One of the points I made in MLOps, Convergent or Divergent?

A service mesh? Traffic shaping? It sounds like OpenAI doesn't use a complicated service mesh or network overlay on top of Kubernetes if any. Instead, they do minimal service discovery when the pods start (and join the MPI group) but communicate over SSH via pod IPs.

They might benefit from something like Cilium instead. It also might help traffic shaping for pods that have significant internet bandwidth (crawling websites?). Lightweight enough not to cause too much network traffic (it's eBPF).

Vertically scaled vs. federated vs. multiple clusters. It's often easier to have multiple clusters than one giant cluster. For example, the official limit for Kubernetes clusters is 5000 nodes (~300,000 containers), but some experiments by the scalability SIG have shown Kubernetes orchestrating up to 15000 nodes.