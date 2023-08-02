The idea of version control for data has existed for decades. Countless startups and products have tried to tackle the problem, but nothing has come close to success. Why? Is the idea flawed?

Datasets grow and accumulate history significantly faster than code repositories. It’s challenging to do data versioning in a performant manner. Needs bespoke tools. Version control systems that work well for code do not work well for large datasets. There are many issues here, but it boils down to the data structures and index design.

A single new record? A single modified record? A single modified field? A schema change? Merge is not defined for many data types. History often implies mergability, but many data types don’t have obvious merge strategies. Diffing fields like JSON require an understanding of the language semantics. Merging binary blobs doesn’t make sense.