Llama 2 in the Browser

Aug 30, 2023

Back in May, I got Vicuna 7B — a chat-tuned version of the original Llama model, working entirely in the browser via the new WebGPU APIs that had shipped in Chrome. I open-sourced a React library to make it easy to use (react-llm).

Today, I’m releasing an updated version of this on Thiggle, which supports Llama 2 Chat in the 7B and 13B variations, as well as Vicuna 7B and Redpajama 3B. The interface is updated for more advanced use cases — allowing you to modify the different parameters in generation mode, such as temperature, top p, stop sequences, system prompts, max generation length, and repetition penalty. There’s a short description of what each parameter does in the hover detail.

You can access the in-browser playground at thiggle.com/local-llm.

You can also use the Model API Gateway to compare different models with the Llama-2 models — including the largest variant of the Llama chat models (70b), which runs in the cloud (for now).

What’s the future of on-device AI? It’s something that’s already pervasive in my personal and professional work. Quick image generation prototypes are more easily done via a locally hosted Web UI (like Automatic1111) before moving to more robust cloud models. For LLMs, it’s a little trickier. The best ones are too large to fit on a consumer device, and the most useful small ones have additional infrastructure around them.

For now, I think an interesting path is a hybrid approach.