The New AI Moats

“We Have No Moat, And Neither Does OpenAI,” a supposedly leaked document from Google, makes some interesting points. The competitive landscape shifts, and so do the moats.

What is no longer a moat

Data is no longer a moat. For example, GPT-3 and Stable Diffusion were trained on public data sets by companies or groups with zero proprietary data. Now, model arbitrage captures any difference between publically available models — just use one to generate training data for another. But what about code training data on GitHub? The Pile (the dataset used for many of the open-source LLMs) includes more than 192,000 GitHub repositories with over 100 stars. Plus, there are many more ways than other LLMs to generate synthetic training data for code.
Foundational models are no longer a moat. I’ve written about this several times over the last 2 and a half years.

Nov ’21 — Open-sourced GPT-J

Jul ’22 — OPT, DiffusionLM, Commoditization of LLMs: Part 1

Aug ’22 — Stable Diffusion, Midjourney, Commoditization of LLMs: Part 2

Feb ’23 — LLaMA, Commoditization of LLMs: Part 3

Capital is no longer a moat (for training). Training these models through model distillation (arbitrage) or techniques like LoRA is becoming exponentially cheaper. Alpaca was trained for $500 in OpenAI usage. There’s an appetite in the private markets to fund competitors (Anthropic, Cohere, StabilityAI, HuggingFace). Although I think the markets will move faster than companies to build and train large and expensive models (new architecture, better techniques, etc.).

The new (old) moats in AI

Distribution. We still don’t know the most important distribution channel for these models. Will it be the web, mobile, desktops, laptops, or another form factor? Google’s moat is stronger in some of these than others. Chrome is near peak popularity — Chromium is the de-facto standard for all non-mobile browsers. Pixel and Android aren’t iPhone and iOS, but there are still over 3 billion active Android devices. So who is best positioned given different outcomes? LLMs in the browser — Google. LLMs on-device — Apple. (Very Large) LMs over the web — OpenAI (for now). Thousands of small LLMs — HuggingFace, AWS (for now). Best for code — Microsoft (VSCode, GitHub, etc.). RLHF is most important — User-facing SaaS
Brand (narrative). The biggest blow to Google’s moat is narrative and brand. Before, Google was seen as the center of AI innovation. Now, the narrative has slipped away — slow, bureaucratic, outdated, and out of touch. The reality is probably somewhere in the middle, but the narrative can be self-fulfilling. OpenAI captured the narrative with the launch of ChatGPT.
Talent. Open source can solve many problems (Linus’s Law — with enough eyeballs, all bugs are shallow). But there are many things that open source can’t do. Foundational research is one (but it’s expensive). There’s a reason why Transformers were invented at Google.
Hardware. The biggest AI companies have partnered with cloud or hardware providers. Why? Training and inference are expensive. Access to GPUs can be limited if you’re on your own. Google may suffer from diseconomies of scale due to open-source alternatives to their stack, but vertical integration might be a benefit depending on how things play out.
Regulation. Regulation makes it harder for new companies to enter the market. However, companies with deep pockets and deep connections (Google, Microsoft, etc.) might be able to work with regulators to enact favorable regulations around AI systems.