The Data Stack As a Snowflake

Sep 16, 2021

The modern data stack has completely changed in the last few years. There are scalable data warehouses to store all of your data (structured and now unstructured), there's the unbundling of ETL, and there's the bundling of reverse ETL.

But so far, this stack has operated as a Snowflake. It exists outside the realm of platform and application engineers. At the bottom of the stack, Databricks has embraced Kubernetes, and Kubernetes is the simplest and most reliable way to run Apache Spark for big data workloads.

But the rest of the stack hasn't followed yet. Data pipelines are more fragile than typical DevOps pipelines while being more critical to production. Reproducibility and declarative configuration are rarely found, despite data's declarative roots. Data engineers bring an entirely new skill set to the job but necessarily lack tools to make them as efficient as software engineers when dealing with infrastructure (and for a good reason). Tools like dbt kept the abstraction simple for data analysts while providing basic features like versioning and environment management. Open-source startups are tackling the problem by making deployment easy with containerization, but aren't quite Kubernetes-native yet.

Why don't we declare our data pipelines as code? Why don't we provision data connectors as serverless or autoscaling deployments? Why are we stuck using templating engines when we could do configuration-as-code? Why don't we connect the data stack and application stack in a meaningful way? I suspect as the data stack matures, there will have to be a convergence. And the data stack won't be a Snowflake anymore.