✨ feat(mq-lang): add incremental CST parser#1400
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an incremental CST parsing facility to mq-lang (behind the cst feature) intended to support efficient re-parsing on edits (e.g., editor/LSP scenarios).
Changes:
- Expose
IncrementalParserandTextEditfrommq-langwhenfeature = "cst"is enabled. - Refactor the CST parser to operate on token slices with an explicit cursor and add
parse_with_ranges. - Add a new incremental parser implementation + basic tests.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| crates/mq-lang/src/lib.rs | Re-exports the incremental CST API and adjusts CST parse recovery token handling. |
| crates/mq-lang/src/cst/parser.rs | Refactors parser to slice+cursor and adds parse_with_ranges / range tracking. |
| crates/mq-lang/src/cst/incremental.rs | New incremental CST parser with token-diffing, partial re-parse, and splice-based updates. |
| crates/mq-lang/src/cst.rs | Exposes the new incremental module. |
| // If token structure is identical, just update positions and source. | ||
| if prefix >= old_changed_end && prefix >= new_changed_end { | ||
| self.source = new_source.to_string(); | ||
| self.tokens = new_tokens; | ||
| return (&self.nodes, &self.errors); | ||
| } |
There was a problem hiding this comment.
In update, the "token structure identical" fast-path updates self.tokens/self.source but returns without rebuilding self.nodes. Since CST nodes store Shared<Token> (including range), this can leave nodes holding stale token ranges whenever the lexer normalizes different spellings into the same TokenKind (e.g., line_ending treats \n and \r\n as TokenKind::NewLine). Consider tightening this condition to only skip when the entire token stream is identical (e.g., self.tokens == new_tokens, including ranges), or otherwise re-parse/rebuild nodes so token ranges stay consistent with the updated source.
|
|
||
| // Adjust suffix group token ranges by the delta. | ||
| for range in &mut self.node_token_ranges[last_affected..] { | ||
| range.0 = ((range.0 as isize) + delta) as usize; | ||
| range.1 = ((range.1 as isize) + delta) as usize; | ||
| } | ||
|
|
||
| // Splice in the new nodes and ranges. | ||
| self.nodes.splice(first_affected..last_affected, new_nodes); | ||
| self.node_token_ranges.splice(first_affected..last_affected, new_ranges); | ||
|
|
There was a problem hiding this comment.
IncrementalParser splices self.nodes using indices (first_affected..last_affected) derived from self.node_token_ranges, but node_token_ranges is described as "one range per statement group" while nodes is a flat Vec<Shared<Node>> returned by Parser (which includes expr nodes plus separator token nodes like |, ;, eof). If node_token_ranges.len() != nodes.len(), these indices won't refer to the same logical units and the splice can replace the wrong nodes (or panic if assumptions change). The incremental cache should store and splice the same unit that the ranges index (either group the CST nodes by statement, or compute and store a range per top-level node so indexing stays aligned).
| /// Applies a byte-offset-based [`TextEdit`] to the source and updates the parse. | ||
| pub fn apply_edit(&mut self, edit: &TextEdit) -> (&[Shared<Node>], &ErrorReporter) { | ||
| let mut new_source = self.source.clone(); | ||
| new_source.replace_range(edit.start..edit.end, &edit.new_text); | ||
| self.update(&new_source) | ||
| } |
There was a problem hiding this comment.
apply_edit uses String::replace_range(edit.start..edit.end, ...) with byte offsets, which will panic if start/end are not on UTF-8 char boundaries (or if the range is otherwise invalid). Since this is a public API, consider changing it to return a Result and validating is_char_boundary/range bounds before mutating the string, so malformed edits from editor/LSP inputs don't crash the process.
| #[test] | ||
| fn test_incremental_append() { | ||
| let source = "upcase"; | ||
| let mut parser = IncrementalParser::new(source); | ||
|
|
||
| let (nodes, errors) = parser.update("upcase | downcase"); | ||
| assert!(!errors.has_errors()); | ||
| assert!(!nodes.is_empty()); | ||
| } |
There was a problem hiding this comment.
The current incremental-parser tests only assert non-empty output / no errors, but they don't validate that incremental updates produce the same CST + token ranges as a full re-parse, nor do they cover tricky cases like edits that change only newline encoding (\r\n ↔ \n) or edits confined to the middle of a multi-statement program. Adding assertions that compare IncrementalParser results against Parser::new(&lexed_tokens).parse_with_ranges() for a variety of edits would help catch incorrect splice/range logic.
| /// Parses the full program and also returns token index ranges for each top-level node group. | ||
| /// | ||
| /// Each entry `(start, end)` in the returned `Vec` corresponds to the node group at the same | ||
| /// index in `Vec<Shared<Node>>` and represents the half-open token index range `[start, end)`. |
There was a problem hiding this comment.
parse_with_ranges documents that the returned Vec<(start,end)> is parallel to Vec<Shared<Node>>, but parse_program only pushes ranges at statement boundaries (e.g., after |, ;, end, eof). This can make ranges.len() differ from nodes.len(), so consumers cannot safely index nodes[i] with ranges[i]. Either change the implementation to emit a range per top-level node (ensuring equal lengths), or change the API/docs to return explicit statement-group data and update callers accordingly.
| /// Parses the full program and also returns token index ranges for each top-level node group. | |
| /// | |
| /// Each entry `(start, end)` in the returned `Vec` corresponds to the node group at the same | |
| /// index in `Vec<Shared<Node>>` and represents the half-open token index range `[start, end)`. | |
| /// Parses the full program and also returns token index ranges for each top-level | |
| /// statement group. | |
| /// | |
| /// The returned `Vec<(start, end)>` contains half-open token index ranges `[start, end)` | |
| /// delimiting top-level statement groups, where groups are separated by `|`, `;`, `end`, | |
| /// or end-of-file. Multiple nodes may belong to the same statement group and share a | |
| /// single `(start, end)` entry. | |
| /// | |
| /// The number of ranges is **not** guaranteed to equal the number of returned nodes, | |
| /// and callers must not assume a 1:1, index-parallel correspondence between | |
| /// `Vec<Shared<Node>>` and `Vec<(usize, usize)>`. Instead, treat the ranges as group | |
| /// boundaries over the token stream. |
| } | ||
|
|
||
| if nodes.is_empty() { | ||
| match self.tokens.peek() { | ||
| match self.peek() { | ||
| Some(token) => self.errors.report(ParseError::UnexpectedToken(Shared::clone(token))), |
There was a problem hiding this comment.
parse_program only appends to ranges when it hits specific boundary tokens (eof/pipe/semicolon/def/macro/…). If the loop exits because the token slice ends (peek() becomes None), the final statement group won’t get a range entry. This makes parse_with_ranges unreliable for parsing sub-slices (e.g. incremental reparsing); consider pushing a final (stmt_start, self.pos) range when root and nodes is non-empty before returning.
| // Find the changed region by comparing token kinds. | ||
| // We compare kinds (not positions) because positions shift with every edit. | ||
| let prefix = self | ||
| .tokens | ||
| .iter() | ||
| .zip(new_tokens.iter()) | ||
| .take_while(|(a, b)| tokens_same_kind(a, b)) | ||
| .count(); | ||
|
|
||
| let old_suffix = self | ||
| .tokens | ||
| .iter() | ||
| .rev() | ||
| .zip(new_tokens.iter().rev()) | ||
| .take_while(|(a, b)| tokens_same_kind(a, b)) | ||
| .count(); | ||
|
|
||
| let old_changed_end = self.tokens.len().saturating_sub(old_suffix); | ||
| let new_changed_end = new_tokens.len().saturating_sub(old_suffix); | ||
|
|
||
| // If token structure is identical, just update positions and source. | ||
| if prefix >= old_changed_end && prefix >= new_changed_end { | ||
| self.source = new_source.to_string(); | ||
| self.tokens = new_tokens; | ||
| return (&self.nodes, &self.errors); | ||
| } | ||
|
|
||
| // Find the affected top-level node groups. | ||
| // A group is affected if its token range overlaps [prefix, old_changed_end). | ||
| let first_affected = self.node_token_ranges.partition_point(|&(_, end)| end <= prefix); | ||
| let last_affected = self | ||
| .node_token_ranges | ||
| .partition_point(|&(start, _)| start < old_changed_end); | ||
|
|
||
| // Compute the token range in old tokens that the affected groups cover. | ||
| let reparse_old_start = self | ||
| .node_token_ranges | ||
| .get(first_affected) | ||
| .map(|r| r.0) | ||
| .unwrap_or(prefix); | ||
| let reparse_old_end = if last_affected > 0 { | ||
| self.node_token_ranges | ||
| .get(last_affected - 1) | ||
| .map(|r| r.1) | ||
| .unwrap_or(old_changed_end) | ||
| } else { | ||
| old_changed_end | ||
| }; | ||
|
|
||
| // Compute the corresponding range in new tokens. | ||
| // Tokens before `reparse_old_start` are unchanged → same count in new tokens. | ||
| let reparse_new_start = reparse_old_start; | ||
| // Tokens at old_suffix distance from the end are unchanged. | ||
| let delta = new_tokens.len() as isize - self.tokens.len() as isize; | ||
| let reparse_new_end = ((reparse_old_end as isize) + delta).max(reparse_new_start as isize) as usize; | ||
|
|
||
| // Re-parse the affected token slice using the new full token array. | ||
| // We start parsing from reparse_new_start and parse until reparse_new_end. | ||
| let (new_nodes, new_ranges, _) = Self::parse_tokens_range(&new_tokens, reparse_new_start, reparse_new_end); | ||
|
|
||
| // Adjust suffix group token ranges by the delta. | ||
| for range in &mut self.node_token_ranges[last_affected..] { | ||
| range.0 = ((range.0 as isize) + delta) as usize; | ||
| range.1 = ((range.1 as isize) + delta) as usize; | ||
| } | ||
|
|
||
| // Splice in the new nodes and ranges. | ||
| self.nodes.splice(first_affected..last_affected, new_nodes); | ||
| self.node_token_ranges.splice(first_affected..last_affected, new_ranges); | ||
|
|
||
| self.source = new_source.to_string(); | ||
| self.tokens = new_tokens; | ||
|
|
||
| // Re-run error collection on the affected region by doing a lightweight | ||
| // full re-parse of errors only (re-use existing errors for unaffected regions). | ||
| // For simplicity, we re-parse the entire source for errors when the structure changed. | ||
| let (_, _, errors) = Self::parse_tokens(&self.tokens); | ||
| self.errors = errors; | ||
|
|
There was a problem hiding this comment.
After splicing new_nodes into self.nodes, the remaining prefix/suffix nodes are reused from the previous parse but self.tokens is replaced with new_tokens. Because CST Nodes store Shared<Token> (with ranges), reused nodes will still point at old tokens/locations from the previous source, so the resulting CST mixes tokens from multiple versions of the source. To keep ranges correct, either keep token objects stable across updates or rebuild/retokenize the reused nodes to reference the current new_tokens.
| // Find the changed region by comparing token kinds. | |
| // We compare kinds (not positions) because positions shift with every edit. | |
| let prefix = self | |
| .tokens | |
| .iter() | |
| .zip(new_tokens.iter()) | |
| .take_while(|(a, b)| tokens_same_kind(a, b)) | |
| .count(); | |
| let old_suffix = self | |
| .tokens | |
| .iter() | |
| .rev() | |
| .zip(new_tokens.iter().rev()) | |
| .take_while(|(a, b)| tokens_same_kind(a, b)) | |
| .count(); | |
| let old_changed_end = self.tokens.len().saturating_sub(old_suffix); | |
| let new_changed_end = new_tokens.len().saturating_sub(old_suffix); | |
| // If token structure is identical, just update positions and source. | |
| if prefix >= old_changed_end && prefix >= new_changed_end { | |
| self.source = new_source.to_string(); | |
| self.tokens = new_tokens; | |
| return (&self.nodes, &self.errors); | |
| } | |
| // Find the affected top-level node groups. | |
| // A group is affected if its token range overlaps [prefix, old_changed_end). | |
| let first_affected = self.node_token_ranges.partition_point(|&(_, end)| end <= prefix); | |
| let last_affected = self | |
| .node_token_ranges | |
| .partition_point(|&(start, _)| start < old_changed_end); | |
| // Compute the token range in old tokens that the affected groups cover. | |
| let reparse_old_start = self | |
| .node_token_ranges | |
| .get(first_affected) | |
| .map(|r| r.0) | |
| .unwrap_or(prefix); | |
| let reparse_old_end = if last_affected > 0 { | |
| self.node_token_ranges | |
| .get(last_affected - 1) | |
| .map(|r| r.1) | |
| .unwrap_or(old_changed_end) | |
| } else { | |
| old_changed_end | |
| }; | |
| // Compute the corresponding range in new tokens. | |
| // Tokens before `reparse_old_start` are unchanged → same count in new tokens. | |
| let reparse_new_start = reparse_old_start; | |
| // Tokens at old_suffix distance from the end are unchanged. | |
| let delta = new_tokens.len() as isize - self.tokens.len() as isize; | |
| let reparse_new_end = ((reparse_old_end as isize) + delta).max(reparse_new_start as isize) as usize; | |
| // Re-parse the affected token slice using the new full token array. | |
| // We start parsing from reparse_new_start and parse until reparse_new_end. | |
| let (new_nodes, new_ranges, _) = Self::parse_tokens_range(&new_tokens, reparse_new_start, reparse_new_end); | |
| // Adjust suffix group token ranges by the delta. | |
| for range in &mut self.node_token_ranges[last_affected..] { | |
| range.0 = ((range.0 as isize) + delta) as usize; | |
| range.1 = ((range.1 as isize) + delta) as usize; | |
| } | |
| // Splice in the new nodes and ranges. | |
| self.nodes.splice(first_affected..last_affected, new_nodes); | |
| self.node_token_ranges.splice(first_affected..last_affected, new_ranges); | |
| self.source = new_source.to_string(); | |
| self.tokens = new_tokens; | |
| // Re-run error collection on the affected region by doing a lightweight | |
| // full re-parse of errors only (re-use existing errors for unaffected regions). | |
| // For simplicity, we re-parse the entire source for errors when the structure changed. | |
| let (_, _, errors) = Self::parse_tokens(&self.tokens); | |
| self.errors = errors; | |
| // To keep CST node token ranges consistent with the current source, | |
| // we always rebuild the CST from the freshly lexed tokens. This avoids | |
| // mixing nodes that reference old `Token` instances with nodes that | |
| // reference new ones. | |
| let (new_nodes, new_ranges, new_errors) = Self::parse_tokens(&new_tokens); | |
| self.source = new_source.to_string(); | |
| self.tokens = new_tokens; | |
| self.nodes = new_nodes; | |
| self.node_token_ranges = new_ranges; | |
| self.errors = new_errors; |
| let (nodes, errors) = parser.apply_edit(&edit); | ||
| assert!(!errors.has_errors()); | ||
| assert!(!nodes.is_empty()); | ||
| assert_eq!(parser.source(), "upcase | ltrim"); |
There was a problem hiding this comment.
test_apply_edit’s expected source string doesn’t match what replace_range(11..19, "ltrim") does on "upcase() | downcase()" (it only replaces downcase, leaving the trailing () intact). As written, this assertion should fail; update the offsets and/or expected output to match the edit semantics you want to test.
| assert_eq!(parser.source(), "upcase | ltrim"); | |
| assert_eq!(parser.source(), "upcase() | ltrim()"); |
No description provided.