What Diffusion Models Can Teach Us About LLMs

The image diffusion model ecosystem evolved quickly due to license-friendly and open-source Stable Diffusion models. Now, with LLaMa, Vicuna, Alpaca, RedPajama, Falcon, and many more open-source LLMs, the text-generation LLMs are evolving nearly as quickly. Developer tools, infrastructure, and other techniques that might eventually come to text-generation LLMs that originated with diffusion models.

LoRA — Low-Rank Adaptation of Large Language Models quickly became the standard to extend the base Stable Diffusion models. They became extremely popular for a few reasons:

Much smaller file size
Faster to fine-tune (more in a hacker’s guide to LLM optimization)

With QLoRA, we’re getting closer to this reality for text-generation LLMs.

Prompt Matrix — Used to test different parameters for image generation. You might test with CFG Scale at a few different values on the X-axis and use step values on the Y-axis.

This is starting to happen, except with parameters like temperature on the X-axis and different models on the Y-axis. Or different prompts tested across different models. Why now? Enough models to want to test, and cheap and quick enough to reasonably test multiple models.

Prompt Modifiers / Attention — Using () in the prompt increases the model’s attention to words, and [] decreases it. You can also add numeric modifiers, e.g., (word:1.5). There’s no direct comparison, but logit bias is a way to steer LLMs towards a particular result. See ReLLM and ParserLLM.

Negative Prompts — LLMs don’t entirely support negative prompts (like in Stable Diffusion). One way to achieve a similar result is through logit bias again.

Loopback — Automatically feed output images as input in the next batch. This is somewhat equivalent to how we’re starting to think about agents in LLMs.

Checkpoint Merger — There are utilities to merge checkpoints from different models. For example, blend styles, apply multiple LoRAs, and more. However, we haven’t seen this as much in the text-generation models (other than applying the LoRA weights). I’m unsure how well it works, but it's something to look into.