What does it take to integrate an LLM within your infrastructure? A look at some components that might make up an LLMOps infrastructure beyond the usual training/inference pipelines.
First, the requirements that might make LLMOps interesting assume that LLMs may:
- Take action beyond generating text. Call APIs, execute code, or modify resources.
- Generate and execute dynamic workflows
- Run middleware on requests (augment with additional data)
- Schedule specific requests near data or models
- Inference multiple models with different modalities (text, code, image, audio)
Policy Engine — Agents calling other services must have the appropriate authorization. What secrets or environment variables should be (can be) mounted? What network overlays can the requests operate on? Existing policy engines might work here, but the front-end configuration tooling must improve.
Data plane (i.e., “the node”) — The execution environment for the workload. Since it is executing dynamic workloads, it must be sandboxed. Good options here are (1) containers or (2) WebAssembly, or (3) microVMs like Firecracker. The data plane probably should be able to run them all (different use cases). Containers are great because they also provide a packaging format (images) for running different services (runtimes, CLI tools, etc.). It should be easy to integrate with your existing infrastructure.
Controller — Dynamic workflows must be orchestrated and routed to the right data planes. This could mean (1) generating or executing dynamic workflows or (2) dispatching to smaller models, or (3) reconciliation of some declarative state.
Control plane (i.e., “the API server”) — The management layer that provides the entry point for configuration and requests. The crucial part of the control plane is that it is separate from the data plane. That means you can scale each one independently, and the data plane can be sufficiently locked down.
Persistence — The persistence layer for LLMs today is vector databases. However, this could mean anything from key-value stores to search engines to relational databases. Some APIs will be developed here — how data is persisted, retrieved, and added to the model inference step.
Scheduler — Workloads that need specific scheduling — e.g., colocating compute with data for inference, coscheduling, or batch scheduling for training. Many distributed systems frameworks require extra work to implement these algorithms.