React LLM: Run Models in the Browser with Headless Components

May 15, 2023

react-llm is a set of headless React components to run an LLM completely clientside in the browser with WebGPU, starting with useLLM.

There’s a live demo running on chat.matt-rickard.com. I put together a quick retro UI that looks like an AOL AIM instant message with a “SmartestChild” to demonstrate how to use the library (it’s made to bring your own UI). It only works on the newest versions of Chrome (>113) on Desktop.

LLMs are both (1) expensive to inference and (2) hard to self-host. There’s been a lot of work to run these in the browser (“the new OS”), but they are tough to set up and integrate into modern front-end frameworks. What if you could serve models entirely clientside? With WebGPU shipping, it’s beginning to be a reality.

react-llm sets everything up for you — an off-the-main-thread worker that fetches the model from a CDN (HuggingFace), cross-compiles the WebAssembly components (like the tokenizer and model bindings), and manages the model state (attention kv cache, and more). Everything runs clientside — the model is cached and inferenced in the browser. Conversations are stored in session storage.

  • Everything is customizable about the model — from the system prompt to the user and assistant role names.
  • Completion options like max tokens and stop sequences are available in the API
  • Supports the LLaMa family of models (starting with Vicuna 13B).

The API is simple — use it as a React hook or context provider:

<ModelProvider>

    <YourApp />

</ModelProvider>

Then in your component,

const {send, conversation, init} = useLLM()

See the docs for the entire API.

How does it work? There are many moving parts, and not surprisingly, it requires a lot of coordination between systems engineering, browser APIs, and frontend frameworks.

  1. SentencePiece (the tokenizer) and the Apache TVM runtime are compiled with emscripten. The folks working on Apache TVM and MLC have done much low-level work to get the runtime working in the browser. These libraries were initially written in Python and C++.
  2. Both of these are initialized in an off-the-main-thread WebWorker. This lets the inference happen outside the main render thread, so it doesn’t slow down the UI. This worker is packaged alongside the React hooks.
  3. The worker downloads the model from HuggingFace and initializes the runtime and tokenizer.
  4. Then, some tedious state management and work to make it easily consumable via React. There are hooks, contexts, and providers which make it easy to use it across your application.

The browser is the new operating system.