Machine Learning Operations (MLOps), Convergent or Divergent?

Jun 26, 2021

At Google, I helped develop an open-source MLOps platform built on Kubernetes called Kubeflow. Many companies found the abstractions helpful, but I thought deeply on whether MLOps would diverge into a separate toolchain or DevOps tooling would converge to cover machine learning use cases.

The premise of MLOps is that large-scale machine learning deployments will need specialized infrastructure tooling. MLOps means new ways to define and execute pipelines for data cleaning and ingestion. Distributed batch and job schedulers for large training jobs. Specialized API infrastructure for inference (prediction) endpoints. New problems that didn't exist ten years ago. Real issues faced by actual companies.

I believe that DevOps and MLOps should converge. Applying DevOps best practices to machine learning is the most straightforward way to advance the field. Kubernetes can execute distributed training jobs. Some developers have already implemented a new pluggable gang scheduler for Kubernetes, which is essential for MLOps. Inference APIs are usually just HTTP wrappers around python functions but clearly can become more sophisticated with RPC and observability such as tracing, logging, and metrics. Whether or not MLOps and DevOps will converge, I'm not sure. Market forces can keep them as separate categories. Data scientists, the primary users and sometimes architects of these systems, are not infrastructure engineers. But I'm willing to bet that the DevOps engineers will be the most critical player in the MLOps space. How do you think that Jeff Dean went from architecting large-scale distributed systems to being the father of machine learning at Google?