Probabilistic Data Structures and LLMs

Bloom filters are a data structure that answers the question: is an element part of a set? It does it in a remarkably efficient way: the time needed to either add items or check whether an item is in the set is a fixed constant O(k) and independent of the number of items already in the set. A fixed-sized Bloom filter can represent an arbitrarily large number of elements.

There’s no free lunch — Bloom filters don’t answer in yes/no but rather “possibly yes/definitely no.” That means no false negatives, but false positives are possible.

LLMs are not probabilistic data structures, but it could be interesting to view them as building blocks for new probabilistic data structures. The next token probability provides an interesting interface to “most likely answer” / “logistic probability” (through wording a question as yes/no and inspecting the token probabilities of yes/no). They are space efficient. The time complexity of search and query operations is independent of the number of items searched over (or the complexity of the underlying structure).

You can (maybe) use these properties to both (1) implement new versions of probabilistic data structures like bloom filters and (2) create entirely new probabilistic data structures that weren’t possible before.

What would one look like? Not fully thought out, but initial thoughts.

Replacing the hash function with output from the LLM. Here’s an example — take a chunk of text, use the top N probable next tokens, and hash them. Probalistic mapping from X tokens to R^N.
Move up the stack. Problems that might have been solved with something like a Bloom filter might be able to be answered with an LLM. Things like collaborative editing platforms, plagiarism detection, spam detection, spell checking, and caching.