Screenshots as the Universal API

Oct 3, 2022

With advancements in machine learning, screenshots are quickly becoming a universal data format. It's now (relatively) easy to extract meaning (image-to-text), layout information (object recognition), text (optical character recognition, OCR),  and other metadata (formatting, fonts, etc.).

Now, with diffusion-based models like Stable Diffusion and DALL-E, we have an encoder – text-to-image.

Screenshots-as-API solves a few problems:

  • Easier to parse than highly complex layout formats. When I wrote about Rethinking the PDF, I didn't consider images an alternative. But image models are generic and don't need to understand the PDF encoding. Screenshots-as-API could mean big changes for existing unstable APIs like web crawlers. Now that websites are mostly dynamic, it's difficult to fully hydrate a website, parse the layout, and extract the same experience that an end-user would get (open-source crawlers like puppeteer from Google make this easier, but there are many edge cases). What it was easier to parse a screenshot of the page?
  • Universally available, easily copyable. While images aren't the most efficient encoding method for text, they can be the simplest for humans. Excel has had a screenshot-to-table feature for some time because some tables are notoriously difficult to copy (how do you solve that generically at the text level?). You can copy objects from photos in the latest iOS 16 update.
  • Permissionless. Many applications won't allow you to export data. Screenshots are always available (similar to the era of web crawling).
  • More complex metadata. Look how effective image search is on mobile – you can search for people, places, things, and more. Some of this comes from the actual image metadata, but other is inferred with on-device models. Automatically encoding this data in traditional formats like PDF takes much longer.
An image is worth a thousand words.
Subscribe for daily posts on startups & engineering.

The Steffen Boarding Method

Oct 2, 2022

Boarding an airplane is no fun. Usually, it's done by airline status or back-to-front. In a wildly unprofitable industry where time is the major constraint – isn't there a quicker boarding method? A survey of queuing algorithms.

Front-to-back: This is essentially boarding the plane serially – the remainder of the plane is always empty while you wait for a few passengers to stow their luggage and find their seats.

Back-to-front: Back-to-front means that Mythbusters (2014) found that this is actually the slowest method other than front-to-back. People might have to get up to provide access to the window and aisle seats. Some are standing waiting for to other passengers to stow their luggage (for which there might not even be room).

Random ordering: Easiest to implement. Performs fairly well in all of these tests and simulations that I've read online. The fact that it does so well gives me hope that there's probably a much more optimal method.

Out-to-in: Board window seats first, then middle seats, and finally aisle seats.

Out-to-in (staggered): The Steffen method (2008) follows the out-to-in strategy but alternates between odd-numbered and even-numbered seats (e.g., window odd, window even, middle odd, middle even, etc.). Steffen used a Markov Chain Monte Carlo model to test different methods. The model assumes that stowing luggage is the major constraint in boarding.

Out-to-in, back-to-front: Also known as the reverse pyramid, this method was developed by some researchers at Arizona State University in 2006. You can read an article about it here.

Slow/Fast: This paper (Lorentzian-geometry-based analysis of airplane boarding policies, 2019) find that boarding slow passengers first is the most effective. They offer a categorization of "slow" passengers as those who require extra assistance (e.g., elderly or children) or those with overhead bin luggage.

Of course, there are other considerations with these boarding methods:

  • Temporarily splitting up families and passengers traveling together
  • Intentionally unfair –  consideration for airline status
  • Fair distribution of overhead bin space (who gets it when it's limited?)
  • No incentive to optimize – weak market forces in a government-subsidized industry with enormous barriers to entry.

While queueing algorithms are fascinating (not as much as elevator algorithms), maybe there's an outside-of-the-box solution.

  • Boarding times increased as airlines increased prices for checked luggage. Fewer passengers bringing carry-on luggage means faster boarding times.
  • Different algorithms are currently enacted by seat block – it wasn't practical to call out specific passengers or seat numbers to board. Now, we have the technology to notify individual passengers via mobile. This could be used to increase the granularity of how the algorithms are used.
  • Would the flight crew be faster at stowing luggage than passengers? Can carry-on luggage be stowed ahead of time?
  • Assigned seats vs. unassigned seats (e.g., Southwest)

The Promise of Write Once, Run Anywhere

Oct 1, 2022

The promise of a runtime that lets programmers Write once, run anywhere has been a recurring one for decades. While many of the technologies that tried to fulfill the promise went on to be successful (Java), the goalposts continued to move, and there was not one runtime to rule them all.

Portability is often the goal.

The idea can be traced all the way back to the 1960s with the start of virtualization (the IBM M44/44X that emulated multiple IBM 7044 mainframes).

The catchy slogan wouldn't come around until 1995 when Sun Microsystem was marketing Java and the JVM. While Java had many advantages, its portability was instrumental to its success.

It continued in the mobile era with React Native and Flutter: the promise of creating an application that targets both iOS and Android. While it works in many cases, targeting the native platform is the optimal choice for teams with enough resources (see Facebook Messenger).

The write once, run anywhere story continued at the infrastructure level with Docker and containers – providing portability of applications (on any system that had an up-to-date Linux kernel). Of course, the JVM and container layers sometimes clash – portability that exists at one layer but not another – JVM-supported architectures that require different Docker images or questions like who gets to manage resource limits?

Electron applications let developers write cross-platform desktop applications.

The newest entry is WebAssembly. Many LLVM-based languages can target WebAssembly bytecode targets. That means you can run it on a variety of platforms – in the browser, on the server, or at the edge.

Even if it doesn't become a universal runtime, a new technology that creates greater portability for developers creates new opportunities.

Why Stadia Failed

Sep 30, 2022

Google is shutting down it's streaming game platform, Stadia. They are refunding all Stadia purchases – both hardware and software (most likely to break the meme).

The model makes sense (and I believe the demand is there):

  • Consumers don't need to buy specialized hardware.
  • Zero downtime spent downloading or updating games.
  • New business models and go-to-market: freemium, subscription, or licensing.
  • New markets: casual gamers, streaming creators, etc.

Some predictions on why it failed

  • Organizational dysfunction (high level) – hard to create cross-team integrations, e.g., between Stadia and YouTube, Chromecast, Chrome, etc.
  • Technical challenges – Latency is not a solved problem for most markets and requires massive external infrastructure and political movement (see Google Fiber's failure).
  • Wrong market – the best games don't need Stadia, but casual games could benefit from Stadia's low friction. However, Stadia launched with name-brand and high-end titles. Or even repurposed the technology to stream other applications.
  • Lack of content – maybe there weren't enough games? Time will tell how Microsoft's Xbox Cloud Gaming turns out, which has more titles (as Microsoft is also a publisher).
  • Poor business development – The game industry is new for Google, and navigating it might have proven to be too difficult. Much like the company has struggled in Cloud, the company's ad-tech DNA might not translate to gaming. Many game studios have a deep connection to the Microsoft stack – from graphics card drivers to developer tools. Google is a Linux shop.

Of course, maybe Google's management is acting too conservative and making a mistake by prematurely shutting down Stadia. In a bull market, Google might not have been faced with this decision. It will be hard for Google to rationalize getting back into this market now that they've left (see Google Code/GitHub).

Subscribe for daily posts on startups & engineering.

Thoughts on GitHub Actions

Sep 29, 2022

GitHub Actions is probably the closest thing to good CI/CD we've seen in the market for a while. That's because, historically, there are two glaring problems with CI/CD startups – the problem is so generalized that the product ends up being a distributed job scheduler, and the margin over cloud storage and compute is thin and undifferentiated (compared to other SaaS).

Things GitHub Actions gets right

  • Container-native. Container-native ones are slowly replacing systems like Jenkins. There are some issues – building docker-in-docker, caching, and dependencies, but GitHub Actions handles them reasonably well.
  • Reusability, in theory. While the end implementation leaves much to be desired, the idea of composable actions and libraries for common CI/CD actions is an exciting one.
  • DAG, in theory. There's been a variety of push and pull between event-based CI/CD workflows vs. graph-based workflows. After working with both extensively, DAGs are much easier to reason about.
  • Limited access to the machine. It's tempting to let developers SSH into build machines to debug issues, install software, and perform other tasks. Cattle, not pets.

Where I think we could still improve

  • Not easily self-hosted. Actions, as a business selling compute and storage, won't scale. Besides the improved UX and better distribution, it's not differentiated from TravisCI and similar products. Let it be a true on-ramp for selling cloud services (in my own VPC). I want AWS to build a similar type of library (open-source) that makes it dead simple to do self-hosted runners. Self-hosting a runner on GitHub Actions is clunky right now. I'd even do it on Azure as managed service.
  • Native cloud IAM. CI/CD machines are notorious security holes. They usually have permission to deploy anywhere (including production!) and can be triggered by anyone who can push code. You can configure this through Actions, but why not have it baked into the framework (as an API, endpoint, or other configuration)? There's permissions but I want to bring my existing IAM configurations.
  • As an aside, caching Docker (or BuildKit) builds on a few self-hosted runners will be significantly more effective for the majority of teams than GitHub Action's caching mechanisms.
  • Code, not YAML. Maybe contrarian, but I believe that TypeScript will become the lingua franca of configuration as code. Much simpler for developers to reason about a typed schema and injected variables than with custom YAML templating.
  • Not easily ran locally. Like self-hosted runners, you can do this, but it wasn't built into the framework as a first-class design decision. CI/CD cycles are long (that's part of the problem!), and pushing job configuration just to test it creates long feedback cycles.

Simple Group Theory

Sep 28, 2022

Learning group theory through a simple example. No math knowledge needed.

Start with a triangle with vertices labeled 1,2,3, starting from the top and going clockwise. You can perform two functions on the triangle – either a horizontal flip (f) or a clockwise rotation (r).

You can already see some patterns. For example, three rotations (let's use the notation rrr or r3) gives you the original orientation. Two flips (ff or f2) also gives you the original orientation.

Can you generate all possible orientations of the triangle only using horizontal flips and clockwise rotations? Hint: there are 6.

Try it on a piece of paper.

Some other interesting properties you might have found:

  • Given any orientation, you can always do a series of flips and/or rotations to bring you back to the original.
  • Harder, but "rearranging the parentheses" doesn't matter. These two function compositions give the same triangle: f(rr) and (fr)r.

With just flips and rotations, you can generate all permutations of a triangle. Together, these elements make up a mathematical group (specifically the symmetric group S3).

Here's a diagram of the elements and functions. Because of the symmetry, it looks nice (you can do these for any symmetric group, they are called Cayley Graphs).

By doing this example, you've inadvertently proven the three axioms you need to show that this is a group.

The formal definition of a group and a short proof:

A group is a set G with a binary operation on G that satisfies these four axioms1:

  • Associativity: For all a,b,c in G (a • b) • c = a • (b • c)
  • Identity: There exists an element e in G such that, for each a in G e • a = a and a • e = a
  • Inverse: For each a in G there exists an element b in G such that a • b = e and b • a = e, where e is the identity element.
  • Closure: For each a, b in G, a • b and b • a are contained in G.

For the symmetric group S3 above, let the elements be the triangle orientations expressed as flips and rotations: {1, r, f, r2, rf, r2f}. You can verify that these line up with the unique triangles in the diagram above. We use 1 as r3 or s2 because it is the identity element.

You can prove all the axioms above by observation since there are only six elements.

A more rigorous proof actually proves that all bijections (e.g., a permutation) from a set to itself form a group under function composition (a small example of the Innovator's Paradox!)

Other examples of groups:

  • The set of positive and negative integers with addition (not multiplication – let a = 2, there is no integer such that 2 • b = 1 (1/2 is not an integer).
  • The set of nonzero fractions ("rational numbers") under multiplication.
  • The set of fractions under multiplication.
  • The set of legal moves in a Rubik's cube.

Other fun group facts:

Number theory and cryptography make heavy use of groups. Elliptic curve cryptography uses an elliptic group.

We've completely classified all groups with a finite set of elements (the proof is over 10,000 pages long). There are 18 families of groups that have a simple pattern, and 26 that don't. Of those 26, there's a group that contains 20 of them (including itself) called the Monster Group. It has 808,017,424,794,512,875,886,459,904,961,710,757,005,754,368,000,000,000 elements.

Cayley group of the symmetric group S4

1Usually denoted as three axioms, as the closure is given by the shorthand "binary operation on G", which implies G G into G.