Context-Free Grammar Parsing with LLMs

May 14, 2023

Last week, I open-sourced a method to coerce LLMs into only generating a specific structure for a given regex pattern (ReLLM on Github). The library has proven extremely useful for a handful of tasks I’ve been doing with LLMs (everything from categorization to simple agent automation).

However, I left a part of it as “an exercise to the reader.” I claimed it could also coerce LLMs into only generating more complex structures, like valid JSON or XML. However, I didn’t show a worked example. The natural extension of supporting regular expressions would be to support context-free grammar.

If you remember your freshman year computer science course, regular languages are the simplest type of formal language (described by a regular expression) and can be described by a finite number of states (finite automata). Context-free grammars (CFGs) are a step up in complexity from regular languages. For example, they might define a language of all strings with balanced parentheses (you can’t do this with regular languages).

I’ve implemented this in ParserLLM. You’ll need (1) a prompt and a (2) context-free grammar to restrict the output.

The general strategy goes like this:

First, define a context-free grammar. You might use this for a simplified version of JSON (in EBNF form):

?start: value

    ?value: object

          | array

          | string

          | "true"             -> true

          | "false"            -> false

          | "null"             -> null

    array  : "[" [value ("," value)*] "]"

    object : "{" [pair ("," pair)*] "}"

    pair   : string ":" value

    string : ESCAPED_STRING

    %import common.ESCAPED_STRING

    %import common.SIGNED_NUMBER

    %import common.WS

    %ignore WS

Next, to practically support multiple CFGs, use a parser generator to parse the language. In the example, I use Lark, simply because it’s written in Python and fairly easy to use.

We’ll run the partial output through the parser generator. At step zero, this is just the empty string. The parser will return all of the possible next tokens. You can see the valid first completion of this grammar is any “value,” which can be an array, string, true/false, or null. This means the valid starting tokens are `{`, `[`, `”`, `”true”`, `”false”`, and `”null”`.

Next, we’ll compile those tokens to their regular expressions. Now we have an equivalent problem to ReLLM. Simply run the regexps through ReLLM to generate the next possible token. ReLLM will squash the logits of the non-matching characters and the LLM will only consider valid partial or full next tokens.

Iterate until max tokens are reached, or the parser sees only an empty string or stop token as the next token.

Some interesting features:

  • You can describe the syntax of most programming and configuration languages as a CFG.
  • The LLM won’t produce an invalid result, but there’s no guarantee it will finish and produce a stop token.