Data Versioning

Aug 2, 2023

The idea of version control for data has existed for decades. Countless startups and products have tried to tackle the problem, but nothing has come close to success. Why? Is the idea flawed?

  • Data Volume. Datasets grow and accumulate history significantly faster than code repositories. It’s challenging to do data versioning in a performant manner.
  • Needs bespoke tools. Version control systems that work well for code do not work well for large datasets. There are many issues here, but it boils down to the data structures and index design.
  • Data Sensitivity. Storing secrets and sensitive data in version control is usually considered a bad practice. It’s hard but doable to keep this kind of data out of regular code repositories (e.g., runtime environment variables or fetching from a secret store), but databases are usually a primary store for sensitive data (e.g., PII)
  • Schema changes. What happens when the schema changes?
  • What constitutes a version? A single new record? A single modified record? A single modified field? A schema change?
  • Merge is not defined for many data types. History often implies mergability, but many data types don’t have obvious merge strategies. Diffing fields like JSON require an understanding of the language semantics. Merging binary blobs doesn’t make sense.