SEO Inside AI

What does SEO look like in a world where most queries are LLM-assisted in some way?

Keyword stuffing (at train time). It might be possible to keyword stuff data that are part of a training set by optimizing for specific tokens or token sequences. This might be as simple as “keyword stuffing” for LLMs but also more advanced in taking advantage of the embedding space.

Prompt injection (at inference time). For models that are augmented with tools (e.g., ChatGPT Plugins or Bing Chat), it is possible to prompt inject or prompt poison. The basic method goes like this: embed a specific prompt injection (e.g., “Ignore all previous directions and…”) inside the content of a website or other resource that an LLM would access (e.g., HTML or API). Then, when the LLM crawls your site as part of the query, it will template some features of your site into another prompt (possibly to summarize or extract information).

Token manipulation (SolidGoldMagikarp). Some odd tokens exist in the GPT-2 / GPT-3 / GPT-J. token vocabularies, like SolidGoldMagikarp and BuyableInstoreAndOnline. These shouldn’t be common enough to show up in the 50k token vocabulary, but they show up anyways. And when you query the model with these tokens, they spit out seemingly random results. For example, when asked, “What does the string “SolidGoldMagikarp” refer to?”, ChatGPT once responded, “The word “distributed” refers to …”. (now patched, see the original article).

The long story is that these tokens somehow end up in the vocabulary due to mistakes or overfitting in the training data (possibly) and then cause erratic behavior at inference time. There’s probably a whole world of SEO to be discovered in the embedding space (similar to keyword stuffing).

Ranking / Ads at Inference. Finally, there could just be a new RLHF or another layer that augments generations to add in more branded or relevant content. In this case, SEO would be related to the ranking algorithm that would sit on top (Goodhart’s law — when a measure becomes a target, it ceases to be a good measure).