Skip to content

Use FSMs for scanning during grammar-guided generation#178

Merged
brandonwillard merged 4 commits intodottxt-ai:mainfrom
brandonwillard:lex-state-merged-fsm-approach
Sep 11, 2023
Merged

Use FSMs for scanning during grammar-guided generation#178
brandonwillard merged 4 commits intodottxt-ai:mainfrom
brandonwillard:lex-state-merged-fsm-approach

Conversation

@brandonwillard
Copy link
Copy Markdown
Member

@brandonwillard brandonwillard commented Jul 10, 2023

This PR is the first step to closing #170 using the approach described in our paper "Efficient Guided Generation for Large Language Models".

In other words, the changes in this PR allow us to create "index" dicts for full grammars. By "index" we mean a dict that maps partial parse states to subsets of a vocabulary that are valid continuations of the parse state.

NOTE: The indexing work is being split off into its own PR.

TODO:

  • Use a more efficient DFA implementation.
  • Use DFA minimization on the parse state FSMs.
    This will remove a large number of unnecessary states and decrease the overall cost of constructing indices.
  • We need to do something about ignored tokens in terminal_symbol_scan.
    They are currently being completely ignored, which incorrectly excludes vocabulary strings consisting of only ignored tokens (e.g. " " in a Python grammar).
  • Convert the terminal symbol FSM states used to compute the indices to their corresponding parser state (aggregate/unioned) FSM states.

@brandonwillard brandonwillard marked this pull request as draft July 10, 2023 03:36
@brandonwillard brandonwillard force-pushed the lex-state-merged-fsm-approach branch 4 times, most recently from 714cb7b to 955de6d Compare July 13, 2023 20:25
@rlouf rlouf added the structured generation Linked to structured generation label Jul 17, 2023
@brandonwillard brandonwillard force-pushed the lex-state-merged-fsm-approach branch 11 times, most recently from 99b8aaf to 9eaff91 Compare July 24, 2023 19:51
if not terminated and state == fsm.initial:
return None, None

return None if not terminated else i, accepted_states
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised I did not see any bug related to this

@brandonwillard brandonwillard force-pushed the lex-state-merged-fsm-approach branch 8 times, most recently from b65b38a to c9d223e Compare July 27, 2023 22:49
@brandonwillard brandonwillard force-pushed the lex-state-merged-fsm-approach branch 11 times, most recently from ba833fd to 46c045d Compare August 4, 2023 23:47
@rlouf rlouf mentioned this pull request Aug 6, 2023
@brandonwillard brandonwillard force-pushed the lex-state-merged-fsm-approach branch 3 times, most recently from 01ad2b3 to 299d54d Compare August 7, 2023 05:28
@dpsimpson
Copy link
Copy Markdown

It is I, Daniel, smasher of for loops, obsfucator of code here to tell you that you can replace the for loop here (aka the one starting at line 108 of /examples/parsing.py with the probably more efficient, if somewhat less clear
(assuming that next_vocab is a list of integers, otherwise add some code to make it so)

ids = torch.tensor(next_vocab, dtype = torch.uint8).unsqueeze(0) # same number of dims as mask
mask = mask.scatter_(1, ids, torch.zeros_like(ids) # trailing _ is the in-place version

Is this a sensible change for an example? No. Will it be faster when next_vocab is big and you're on a cuda device? It should be.

@samuela
Copy link
Copy Markdown

samuela commented Aug 23, 2023

OOC how does this work and the corresponding paper relate to the "Parsing with Derivatives" from @mattmight et al?

@brandonwillard
Copy link
Copy Markdown
Member Author

OOC how does this work and the corresponding paper relate to the "Parsing with Derivatives" from @mattmight et al?

We're aware of that work, but this PR does not use anything from it directly. The work being done here is almost exclusively concerned with the adaptation of existing LALR(1) parsers and the use of the "indices" described in our technical paper.

All the terminal symbols regexs in each parse-state-dependent lexer are
combined/unioned into a single FSM, and scanning is performed according to those
combined FSMs.  The function `fsm_union` was added for that purpose.

Since the parser needs to know exactly which terminal symbols were matched, we
now need to the ability to determine exactly which sub-FSM (i.e. one of the
combined terminal symbol FSMs) accepted an input string.  The function
`get_sub_fsms_from_seq` serves this purpose.
@brandonwillard
Copy link
Copy Markdown
Member Author

I'm going to merge this after the tests pass and create follow-up issues for the remaining work mentioned in the description. Doing so will fix the flaky FSM tests.

@Wehzie
Copy link
Copy Markdown

Wehzie commented Sep 21, 2023

I'm very interested in this PR. Will there be documentation detailing how a context free grammar can be specified for guided generation?

@brandonwillard
Copy link
Copy Markdown
Member Author

I'm very interested in this PR. Will there be documentation detailing how a context free grammar can be specified for guided generation?

Yes, and hopefully very soon! We have a development roadmap for integrating this into our Sequence framework and user interface and providing a performant sampling-based approach (to be followed by indexing and more). The next steps on that path are taking place in #272 right now.

@Wehzie
Copy link
Copy Markdown

Wehzie commented Oct 18, 2023

Thank you for the update! Seeing as #272 is closed, what are the next steps to get DSLs running? Regarding Sequence and a user interface, I'm not finding information on that in the repo or your website. Do you have some more information about your plans there?

@rlouf
Copy link
Copy Markdown
Member

rlouf commented Oct 19, 2023

Working on it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement structured generation Linked to structured generation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants