Implementing LLMs in the Browser

LLMs are coming to the browser. While it’s still really slow, running these computations on clientside is much cheaper. And the browser is the ultimate delivery mechanism — no downloading packages, setting up a programming environment, or getting an API key. But, of course, they won’t be used clientside for everything — initially just testing, playgrounds, and freemium getting-started experiences for products.

There are generally two strategies for getting LLMs working in the browser:

Compile C/C++ or Rust to WebAssembly. Take a fairly vanilla library like ggml and use emscripten to convert it to WebAssembly (Wasm fork of ggml, WasmGPT). Optionally, target the new WebGPU runtime like WebLLM.

Implement transformers in vanilla JavaScript. Transformers.js. These models don’t have the most complicated architecture. Typically, they can be implemented in less than a thousand lines of code (nanoGPT is 369 lines, with comments). You might also target WebGPU with this strategy, like WebGPT.

Now, combine a WebAssembly LLM in the browser with a WebAssembly Python interpreter in the browser, and you might get some interesting applications that are sandboxed by default.

WebGPU will ship on May 2nd in Chrome. WebGPU exposes more advanced GPU features and general computation primitives (unlike WebGL).