Use FSMs for scanning during grammar-guided generation by brandonwillard · Pull Request #178 · dottxt-ai/outlines

brandonwillard · 2023-07-10T03:35:59Z

This PR is the first step to closing #170 using the approach described in our paper "Efficient Guided Generation for Large Language Models".

In other words, the changes in this PR allow us to create "index" dicts for full grammars. By "index" we mean a dict that maps partial parse states to subsets of a vocabulary that are valid continuations of the parse state.

NOTE: The indexing work is being split off into its own PR.

TODO:

Use a more efficient DFA implementation.
Use DFA minimization on the parse state FSMs.
This will remove a large number of unnecessary states and decrease the overall cost of constructing indices.
We need to do something about ignored tokens in terminal_symbol_scan.
They are currently being completely ignored, which incorrectly excludes vocabulary strings consisting of only ignored tokens (e.g. " " in a Python grammar).
Convert the terminal symbol FSM states used to compute the indices to their corresponding parser state (aggregate/unioned) FSM states.

rlouf · 2023-07-26T14:35:41Z

outlines/text/parsing.py

        if not terminated and state == fsm.initial:
            return None, None

-        return None if not terminated else i, accepted_states


I'm surprised I did not see any bug related to this

dpsimpson · 2023-08-09T04:13:22Z

It is I, Daniel, smasher of for loops, obsfucator of code here to tell you that you can replace the for loop here (aka the one starting at line 108 of /examples/parsing.py with the probably more efficient, if somewhat less clear
(assuming that next_vocab is a list of integers, otherwise add some code to make it so)

ids = torch.tensor(next_vocab, dtype = torch.uint8).unsqueeze(0) # same number of dims as mask
mask = mask.scatter_(1, ids, torch.zeros_like(ids) # trailing _ is the in-place version

Is this a sensible change for an example? No. Will it be faster when next_vocab is big and you're on a cuda device? It should be.

samuela · 2023-08-23T21:21:17Z

OOC how does this work and the corresponding paper relate to the "Parsing with Derivatives" from @mattmight et al?

brandonwillard · 2023-08-23T21:37:53Z

OOC how does this work and the corresponding paper relate to the "Parsing with Derivatives" from @mattmight et al?

We're aware of that work, but this PR does not use anything from it directly. The work being done here is almost exclusively concerned with the adaptation of existing LALR(1) parsers and the use of the "indices" described in our technical paper.

All the terminal symbols regexs in each parse-state-dependent lexer are combined/unioned into a single FSM, and scanning is performed according to those combined FSMs. The function `fsm_union` was added for that purpose. Since the parser needs to know exactly which terminal symbols were matched, we now need to the ability to determine exactly which sub-FSM (i.e. one of the combined terminal symbol FSMs) accepted an input string. The function `get_sub_fsms_from_seq` serves this purpose.

brandonwillard · 2023-09-11T21:10:19Z

I'm going to merge this after the tests pass and create follow-up issues for the remaining work mentioned in the description. Doing so will fix the flaky FSM tests.

Wehzie · 2023-09-21T11:25:55Z

I'm very interested in this PR. Will there be documentation detailing how a context free grammar can be specified for guided generation?

brandonwillard · 2023-09-21T16:14:09Z

I'm very interested in this PR. Will there be documentation detailing how a context free grammar can be specified for guided generation?

Yes, and hopefully very soon! We have a development roadmap for integrating this into our Sequence framework and user interface and providing a performant sampling-based approach (to be followed by indexing and more). The next steps on that path are taking place in #272 right now.

Wehzie · 2023-10-18T13:30:08Z

Thank you for the update! Seeing as #272 is closed, what are the next steps to get DSLs running? Regarding Sequence and a user interface, I'm not finding information on that in the repo or your website. Do you have some more information about your plans there?

rlouf · 2023-10-19T05:59:09Z

Working on it!

brandonwillard assigned brandonwillard and rlouf Jul 10, 2023

brandonwillard added enhancement text labels Jul 10, 2023

brandonwillard marked this pull request as draft July 10, 2023 03:36

brandonwillard force-pushed the lex-state-merged-fsm-approach branch 4 times, most recently from 714cb7b to 955de6d Compare July 13, 2023 20:25

rlouf added the structured generation Linked to structured generation label Jul 17, 2023

brandonwillard force-pushed the lex-state-merged-fsm-approach branch 11 times, most recently from 99b8aaf to 9eaff91 Compare July 24, 2023 19:51

rlouf reviewed Jul 26, 2023

View reviewed changes

brandonwillard force-pushed the lex-state-merged-fsm-approach branch 8 times, most recently from b65b38a to c9d223e Compare July 27, 2023 22:49

brandonwillard force-pushed the lex-state-merged-fsm-approach branch 11 times, most recently from ba833fd to 46c045d Compare August 4, 2023 23:47

rlouf mentioned this pull request Aug 6, 2023

Add JSON generation method #140

Merged

brandonwillard force-pushed the lex-state-merged-fsm-approach branch 3 times, most recently from 01ad2b3 to 299d54d Compare August 7, 2023 05:28

rlouf mentioned this pull request Aug 16, 2023

Change default generate behavior from fixed seed to random seed #228

Merged

brandonwillard mentioned this pull request Sep 6, 2023

Introduce Numba-based FSM utilities #272

Merged

7 tasks

brandonwillard added 4 commits September 11, 2023 16:01

Use custom Lark objects

f4b29fd

Use deterministic FSM state labels

7be6220

Make parse tree/value computations optional

f5a62ac

brandonwillard mentioned this pull request Sep 11, 2023

Use DFA minimization on the parse state FSMs #278

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use FSMs for scanning during grammar-guided generation#178

Use FSMs for scanning during grammar-guided generation#178
brandonwillard merged 4 commits intodottxt-ai:mainfrom
brandonwillard:lex-state-merged-fsm-approach

brandonwillard commented Jul 10, 2023 •

edited

Loading

Uh oh!

rlouf Jul 26, 2023

Uh oh!

dpsimpson commented Aug 9, 2023

Uh oh!

samuela commented Aug 23, 2023

Uh oh!

brandonwillard commented Aug 23, 2023

Uh oh!

brandonwillard commented Sep 11, 2023

Uh oh!

Wehzie commented Sep 21, 2023

Uh oh!

brandonwillard commented Sep 21, 2023

Uh oh!

Wehzie commented Oct 18, 2023

Uh oh!

rlouf commented Oct 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

brandonwillard commented Jul 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rlouf Jul 26, 2023

Choose a reason for hiding this comment

Uh oh!

dpsimpson commented Aug 9, 2023

Uh oh!

samuela commented Aug 23, 2023

Uh oh!

brandonwillard commented Aug 23, 2023

Uh oh!

brandonwillard commented Sep 11, 2023

Uh oh!

Wehzie commented Sep 21, 2023

Uh oh!

brandonwillard commented Sep 21, 2023

Uh oh!

Wehzie commented Oct 18, 2023

Uh oh!

rlouf commented Oct 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

brandonwillard commented Jul 10, 2023 •

edited

Loading