As someone who used to work on Kubernetes and distributed ML on Kubernetes, digging into some of the publicly available facts about how OpenAI runs a Kubernetes cluster of 7,500+ to produce scalable infrastructure for their large language models. [1] [2]
Kubernetes vs. HPC. Many might object and say that OpenAI should be running on HPC frameworks like Slurm instead of Kubernetes. My (biased) answer: the developer experience and cloud-native integrations of Kubernetes more than makeup for some of the shortcomings. Developers today deploy with containers. Nodes are heterogeneous (and ephemeral). Secrets, blob storage, and volume mounts other than NFS are necessary. You have to build many of these things in HPC, but it's much easier in Kubernetes. Developer experience matters.
Cluster-wide MPI. All pods participate in a single MPI (message-passing interface) communicator. You can think of a bunch of parallel jobs doing work and then doing a batch operation (e.g., batch normalization) across all nodes. OpenAI built its own, but I would use the operators and custom resources in the Kubeflow project (I worked on Kubeflow at Google).
Scheduler. You can swap out the default scheduler in Kubernetes and replace it with something more specific. It sounds like OpenAI tried this and ran into issues, but, in theory, it's possible. One of the points I made in MLOps, Convergent or Divergent?
A service mesh? Traffic shaping? It sounds like OpenAI doesn't use a complicated service mesh or network overlay on top of Kubernetes if any. Instead, they do minimal service discovery when the pods start (and join the MPI group) but communicate over SSH via pod IPs.
They might benefit from something like Cilium instead. It also might help traffic shaping for pods that have significant internet bandwidth (crawling websites?). Lightweight enough not to cause too much network traffic (it's eBPF).
Vertically scaled vs. federated vs. multiple clusters. It's often easier to have multiple clusters than one giant cluster. For example, the official limit for Kubernetes clusters is 5000 nodes (~300,000 containers), but some experiments by the scalability SIG have shown Kubernetes orchestrating up to 15000 nodes.