Data Local Machine Learning

Mar 8, 2023

Data is slow and expensive to move around. What if we moved our compute local to our data? Running functions, containers, and other jobs right next to where the data is stored? What's been tried, and where things go from here.

Integrated compute over a distributed object store (Manta). The earliest cloud-native version of this that I've seen is Manta from Joyent, which was started back in 2011. The insight was from Bryan Cantrill (Sun, dtrace, Joyent, Oxide) that Solaris Zones (a precursor to modern containers) could provide isolation over object stores. Unfortunately, the idea was probably ahead of its time. Docker containers were based on Linux containers (not Solaris Zones), and Kubernetes and public clouds took the lead on object storage.

There are user-defined functions (UDFs) in Snowflake that can be written in Python, Java, or JavaScript. UDFs have existed in databases for nearly as long as database technology, but more recently, they have been supported in languages other than SQL. But, again, the benefits of calling your function close to the data are almost outweighed by the awkwardness of defining and calling your function in SQL.

Another strategy is implementing machine learning at the database layer. BigQuery ML, MindsDB, and, more recently, PostgresML are all examples of this. This means that data analysts and data scientists can directly call models from SQL. Usually, that means quicker latency and less boilerplate with shifting data around. The downside is that SQL isn't great for procedural logic. For example, cleaning data, experimenting, and visualizing data are often hard or impossible directly in SQL.