Virgo: a Graph-based Configuration Language

Over the last few years, I've worked on open source distributed systems in Go at Google. I've thought a lot about dependency management, systems configuration, programming languages, and compilers.

Again and again, I saw the same fundamental data structure underpinning these technologies: the directed acyclic graph. The most frustrating part was modeling graph-based configuration in languages that optimized for hierarchical data structures. That's why I created Virgo.

Virgo is a graph-based configuration language. It has two main features: edge definitions and vertex definitions. The Virgo configuration file then parses into an adjacency list. You can clearly achieve similar results from adding additional conventions and restrictions on YAML or JSON. Much like YAML optimized for human readability, Virgo optimizes for natural graph readability, editability, and representation.

// config.vgo

a -> b, c, d -> e <- f, g
A graphical representation of the Virgo graph

Virgo is open to proposals and language changes. Please open up an issue to start a discussion at https://github.com/r2d4/virgo.

Graphs are everywhere in configuration management. One graph that engineers may be familiar with is the Makefile target graph. The make tool topologically sorts the targets that it resolves which lets it build the files in order. Virgo's CLI or Go library allow developers to replicate this feature easily.

clean -> parser, lexer -> "src files" -> test

parser = `goyacc parser.y`
lexer  = `golex lex.l`
clean  = `rm lex.yy.go parser.go || true`
test   = `go test-v`
"src files"  = `go build ./...`
A simple example to build the Virgo CLI tool with the language itself.

There are two entrypoints to parsing the Virgo file. You can use the Go library found in the same repository to parse the file into a native Go struct. There is also a published CLI binary that exposes the parsing function for other environments.

package main

import (
	"fmt"
	"io/ioutil"
	"log"
	"os"
	"strings"

	"github.com/pkg/errors"
	"matt-rickard.com/virgo/pkg/virgo"
)

func main() {
	if err := run("config.go"); err != nil {
		log.Fatal(err)
		os.Exit(1)
	}
}

func run(fname string) error {
	f, err := ioutil.ReadFile(fname)
	if err != nil {
		return errors.Wrap(err, "reading file")
	}
	g, err := virgo.Parse(f)
	if err != nil {
		return errors.Wrap(err, "parsing virgo file")
	}

	nodes, err := virgo.TopSort(g)
	if err != nil {
		return errors.Wrap(err, "topological sort")
	}

	out := []string{}
	for _, n := range nodes {
		out = append(out, g.Vertices[n]...)
	}
	fmt.Println(strings.Join(out, "\n"))
	return nil
}
Code snippet to read a Virgo file, topologically sort the graph and print out the vertex definitions for each node in order.
$ virgo run build.vgo
Or build with the CLI tool

One operation we frequently want to perform on graphs is a topological sort. A topological sorting is a linear ordering of vertices such that for every directed edge u -> v, vertex u comes before v in the ordering.

The CLI tool topologically sorts the graph, and can even start from a particular vertex (analogous to a Make target).

$ virgo run build.vgo:parser

Build systems are not the only type of configuration schema that can benefit from a graphical representation. Some other examples include:

  • deployment of microservices
  • docker build instructions
  • continuous integration pipelines
  • package dependencies
  • git commits

For full documentation on the language and features of Virgo, visit the GitHub page https://github.com/r2d4/virgo.

Reproducibility in Practice

What is reproducibility in software engineering? It works on my machine

Reproducibility is the confidence that an application will behave similarly in development, test, and production. Reproducibility can make it easier to track down bugs and fix them.

Reproducibility is a confidence interval, not an absolute concept. It forces the engineer to understand the tradeoffs between flexibility and correctness and determine under which conditions reproducibility is desired or necessary.

Here's a few different examples of reproducibility in practice.

Vendoring

Vendoring is the process of saving the state of all software dependencies. Usually this takes the form of downloading all the dependencies and committing them to source control. For instance, node_modules and Go's vendor folder.

Why is vendoring useful?

  • Resolved compile-time or runtime dependencies may be different in different environments. A package exists locally, or a maintainer has updated the package remotely on the package repository. By committing a certain set of dependencies that have been tested, other developers can be more confident that the project will build and run on their machine.

Reproducible Vendoring

Committing dependencies to the repository is a good step, but how were those dependencies resolved?

  • When it comes time to update, how do we know what transitive dependencies will also need to be updated?
  • How do we trust that a developer did not sneak in malicious code when committing a large number of vendored dependencies?

By having a manifest of checksums and versions for each user-specified dependency, along with a program that can "solve" the transitive dependencies based on that file, we can solve both of these problems. The same program can be ran in CI to shift the trust model to the solver binary instead of the code review. The solver binary will also be able to update dependencies transitively.

Examples Go's go.mod, npm's package-lock.json.

Declarative Configuration

Declarative configuration, in contrast to imperative configuration, describes a desired state of software, rather than the explicit commands to create that state.

Like vendoring, the declarative model is more reproducible because the imperative model does not account for the current state of the environment. Has the application already been deployed? Does a folder exist or need to be created? While the imperative model can imperatively check all of these conditions initially at runtime, the state may change over time and produce undesirable results. Once you start watching the state continuously, you've arrived at the declarative model.

Like reproducible vendoring, the declarative model is about shifting the reproducible burden to a application - in this case, a reconciler or a controller that manages the state.

Most importantly, declarative configuration allows for infrastructure as code. It allows you to codify the state of the infrastructure, which means that it can be reproduced easier.

Containers

Container can be used to provide reproducibility in terms of the root filesystem, environment variables, PID, and user.

Containers can provide reproducibility in two aspects

  • Runtime reproducibility
  • Build reproducibility

By ensuring the rootfs, environment variables, and user in running deployments, we can reduce the possibility of an ill-provisioned node, bad behaved sibling processes, and unexpected filesystem state.

In contrast to the previous strategies for reproducibility, containers are about creating a specification bundle that behaves the same on any linux kernel. Namespaces make sure that the processes view of the world looks the same regardless of the the state of the world.

You can think of build reproducibility in a similar matter, except for the state of the world when the artifact is built rather than when it is ran. Note well: the Dockerfile doesn't provide great reproducibility, in the sense that it still doesn't solve the issue of vendored dependencies or the availability and reproducibility of networked dependencies, but it is a step in the right direction.

Byte-for-byte reproducible builds on the same "environment"

Build a binary, take the checksum, and build it again on a similar machine. Chances are, you won't get the same checksum as you did before? Why?

Compilers like GCC can capture the build path and use that in compilers, nondeterministic random ids can be injected as well.

Build systems like Bazel, Pants, and Buck are all aiming to be reproducible build systems.

There is an effort to make debian packages reproducible as well.