Local AI: Part 1

What happens when parts of the AI stack go on-device? It’s already fairly easy to run Stable Diffusion locally, although these solutions won’t be the ones that reach the majority of consumers. You can use DiffusionBee, Automatic1111’s web UI, or just run it yourself via Python or a Jupyter Notebook.

So what could be run locally, and where? Local AI has interesting properties: low latency, data sovereignty, works offline, cheaper.

Training from scratch is probably out of the question. Anything other than fine-tuning needs distributed computing.

But fine-tuning is a different question. You can fine-tune Stable Diffusion (via Dreambooth) on an NVIDIA T4 with 15GB of RAM (from my own experience).

Inference is relatively cheap but still unfeasible on most end-user devices. However, the open-sourcing of models has outsourced a lot of the optimization — already, some folks have found ways to reduce the memory footprint needed. You can even run Stable Diffusion with only 3.2 GB of VRAM, at the cost of time (huggingface).

Something like GPT-J needs closer to 64GB of GPU RAM. GPT-3 (175 billion parameters) probably needs somewhere in the order of 300 GB (an uneducated guess).

For reference, iPhones have about 6GB of RAM. Some Samsung phones have up to 16GB.

But raw memory size is the wrong way to look at the problem. Things could change quickly:

Optimizations to the model (code, architecture, framework, etc.)
Specialized hardware on-device (e.g., tensor processing units)
Specialized models (fine-tuned, smaller context models, higher compression, tuned hyperparameters)
Optimizations to the hardware (e.g., Moore’s Law, etc.)