Use FSMs for scanning during grammar-guided generation#178
Conversation
714cb7b to
955de6d
Compare
99b8aaf to
9eaff91
Compare
| if not terminated and state == fsm.initial: | ||
| return None, None | ||
|
|
||
| return None if not terminated else i, accepted_states |
There was a problem hiding this comment.
I'm surprised I did not see any bug related to this
b65b38a to
c9d223e
Compare
ba833fd to
46c045d
Compare
01ad2b3 to
299d54d
Compare
|
It is I, Daniel, smasher of for loops, obsfucator of code here to tell you that you can replace the for loop here (aka the one starting at line 108 of /examples/parsing.py with the probably more efficient, if somewhat less clear Is this a sensible change for an example? No. Will it be faster when |
|
OOC how does this work and the corresponding paper relate to the "Parsing with Derivatives" from @mattmight et al? |
We're aware of that work, but this PR does not use anything from it directly. The work being done here is almost exclusively concerned with the adaptation of existing LALR(1) parsers and the use of the "indices" described in our technical paper. |
All the terminal symbols regexs in each parse-state-dependent lexer are combined/unioned into a single FSM, and scanning is performed according to those combined FSMs. The function `fsm_union` was added for that purpose. Since the parser needs to know exactly which terminal symbols were matched, we now need to the ability to determine exactly which sub-FSM (i.e. one of the combined terminal symbol FSMs) accepted an input string. The function `get_sub_fsms_from_seq` serves this purpose.
|
I'm going to merge this after the tests pass and create follow-up issues for the remaining work mentioned in the description. Doing so will fix the flaky FSM tests. |
|
I'm very interested in this PR. Will there be documentation detailing how a context free grammar can be specified for guided generation? |
Yes, and hopefully very soon! We have a development roadmap for integrating this into our |
|
Thank you for the update! Seeing as #272 is closed, what are the next steps to get DSLs running? Regarding Sequence and a user interface, I'm not finding information on that in the repo or your website. Do you have some more information about your plans there? |
|
Working on it! |
This PR is the first step to closing #170 using the approach described in our paper "Efficient Guided Generation for Large Language Models".
In other words, the changes in this PR allow us to create "index"
dicts for full grammars. By "index" we mean adictthat maps partial parse states to subsets of a vocabulary that are valid continuations of the parse state.NOTE: The indexing work is being split off into its own PR.
TODO:
This will remove a large number of unnecessary states and decrease the overall cost of constructing indices.
terminal_symbol_scan.They are currently being completely ignored, which incorrectly excludes vocabulary strings consisting of only ignored tokens (e.g.
" "in a Python grammar).