Skip to content

✨ feat(mq-lang): add incremental CST parser#1400

Merged
harehare merged 4 commits intomainfrom
feat/incremental-cst-parser
Mar 6, 2026
Merged

✨ feat(mq-lang): add incremental CST parser#1400
harehare merged 4 commits intomainfrom
feat/incremental-cst-parser

Conversation

@harehare
Copy link
Copy Markdown
Owner

@harehare harehare commented Mar 6, 2026

No description provided.

Copilot AI review requested due to automatic review settings March 6, 2026 14:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an incremental CST parsing facility to mq-lang (behind the cst feature) intended to support efficient re-parsing on edits (e.g., editor/LSP scenarios).

Changes:

  • Expose IncrementalParser and TextEdit from mq-lang when feature = "cst" is enabled.
  • Refactor the CST parser to operate on token slices with an explicit cursor and add parse_with_ranges.
  • Add a new incremental parser implementation + basic tests.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
crates/mq-lang/src/lib.rs Re-exports the incremental CST API and adjusts CST parse recovery token handling.
crates/mq-lang/src/cst/parser.rs Refactors parser to slice+cursor and adds parse_with_ranges / range tracking.
crates/mq-lang/src/cst/incremental.rs New incremental CST parser with token-diffing, partial re-parse, and splice-based updates.
crates/mq-lang/src/cst.rs Exposes the new incremental module.

Comment on lines +112 to +117
// If token structure is identical, just update positions and source.
if prefix >= old_changed_end && prefix >= new_changed_end {
self.source = new_source.to_string();
self.tokens = new_tokens;
return (&self.nodes, &self.errors);
}
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In update, the "token structure identical" fast-path updates self.tokens/self.source but returns without rebuilding self.nodes. Since CST nodes store Shared<Token> (including range), this can leave nodes holding stale token ranges whenever the lexer normalizes different spellings into the same TokenKind (e.g., line_ending treats \n and \r\n as TokenKind::NewLine). Consider tightening this condition to only skip when the entire token stream is identical (e.g., self.tokens == new_tokens, including ranges), or otherwise re-parse/rebuild nodes so token ranges stay consistent with the updated source.

Copilot uses AI. Check for mistakes.
Comment on lines +151 to +161

// Adjust suffix group token ranges by the delta.
for range in &mut self.node_token_ranges[last_affected..] {
range.0 = ((range.0 as isize) + delta) as usize;
range.1 = ((range.1 as isize) + delta) as usize;
}

// Splice in the new nodes and ranges.
self.nodes.splice(first_affected..last_affected, new_nodes);
self.node_token_ranges.splice(first_affected..last_affected, new_ranges);

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IncrementalParser splices self.nodes using indices (first_affected..last_affected) derived from self.node_token_ranges, but node_token_ranges is described as "one range per statement group" while nodes is a flat Vec<Shared<Node>> returned by Parser (which includes expr nodes plus separator token nodes like |, ;, eof). If node_token_ranges.len() != nodes.len(), these indices won't refer to the same logical units and the splice can replace the wrong nodes (or panic if assumptions change). The incremental cache should store and splice the same unit that the ranges index (either group the CST nodes by statement, or compute and store a range per top-level node so indexing stays aligned).

Copilot uses AI. Check for mistakes.
Comment on lines +174 to +179
/// Applies a byte-offset-based [`TextEdit`] to the source and updates the parse.
pub fn apply_edit(&mut self, edit: &TextEdit) -> (&[Shared<Node>], &ErrorReporter) {
let mut new_source = self.source.clone();
new_source.replace_range(edit.start..edit.end, &edit.new_text);
self.update(&new_source)
}
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply_edit uses String::replace_range(edit.start..edit.end, ...) with byte offsets, which will panic if start/end are not on UTF-8 char boundaries (or if the range is otherwise invalid). Since this is a public API, consider changing it to return a Result and validating is_char_boundary/range bounds before mutating the string, so malformed edits from editor/LSP inputs don't crash the process.

Copilot uses AI. Check for mistakes.
Comment on lines +243 to +251
#[test]
fn test_incremental_append() {
let source = "upcase";
let mut parser = IncrementalParser::new(source);

let (nodes, errors) = parser.update("upcase | downcase");
assert!(!errors.has_errors());
assert!(!nodes.is_empty());
}
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current incremental-parser tests only assert non-empty output / no errors, but they don't validate that incremental updates produce the same CST + token ranges as a full re-parse, nor do they cover tricky cases like edits that change only newline encoding (\r\n\n) or edits confined to the middle of a multi-statement program. Adding assertions that compare IncrementalParser results against Parser::new(&lexed_tokens).parse_with_ranges() for a variety of edits would help catch incorrect splice/range logic.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +163 to +166
/// Parses the full program and also returns token index ranges for each top-level node group.
///
/// Each entry `(start, end)` in the returned `Vec` corresponds to the node group at the same
/// index in `Vec<Shared<Node>>` and represents the half-open token index range `[start, end)`.
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_with_ranges documents that the returned Vec<(start,end)> is parallel to Vec<Shared<Node>>, but parse_program only pushes ranges at statement boundaries (e.g., after |, ;, end, eof). This can make ranges.len() differ from nodes.len(), so consumers cannot safely index nodes[i] with ranges[i]. Either change the implementation to emit a range per top-level node (ensuring equal lengths), or change the API/docs to return explicit statement-group data and update callers accordingly.

Suggested change
/// Parses the full program and also returns token index ranges for each top-level node group.
///
/// Each entry `(start, end)` in the returned `Vec` corresponds to the node group at the same
/// index in `Vec<Shared<Node>>` and represents the half-open token index range `[start, end)`.
/// Parses the full program and also returns token index ranges for each top-level
/// statement group.
///
/// The returned `Vec<(start, end)>` contains half-open token index ranges `[start, end)`
/// delimiting top-level statement groups, where groups are separated by `|`, `;`, `end`,
/// or end-of-file. Multiple nodes may belong to the same statement group and share a
/// single `(start, end)` entry.
///
/// The number of ranges is **not** guaranteed to equal the number of returned nodes,
/// and callers must not assume a 1:1, index-parallel correspondence between
/// `Vec<Shared<Node>>` and `Vec<(usize, usize)>`. Instead, treat the ranges as group
/// boundaries over the token stream.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings March 6, 2026 15:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comment on lines 339 to 343
}

if nodes.is_empty() {
match self.tokens.peek() {
match self.peek() {
Some(token) => self.errors.report(ParseError::UnexpectedToken(Shared::clone(token))),
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_program only appends to ranges when it hits specific boundary tokens (eof/pipe/semicolon/def/macro/…). If the loop exits because the token slice ends (peek() becomes None), the final statement group won’t get a range entry. This makes parse_with_ranges unreliable for parsing sub-slices (e.g. incremental reparsing); consider pushing a final (stmt_start, self.pos) range when root and nodes is non-empty before returning.

Copilot uses AI. Check for mistakes.
Comment on lines +92 to +170
// Find the changed region by comparing token kinds.
// We compare kinds (not positions) because positions shift with every edit.
let prefix = self
.tokens
.iter()
.zip(new_tokens.iter())
.take_while(|(a, b)| tokens_same_kind(a, b))
.count();

let old_suffix = self
.tokens
.iter()
.rev()
.zip(new_tokens.iter().rev())
.take_while(|(a, b)| tokens_same_kind(a, b))
.count();

let old_changed_end = self.tokens.len().saturating_sub(old_suffix);
let new_changed_end = new_tokens.len().saturating_sub(old_suffix);

// If token structure is identical, just update positions and source.
if prefix >= old_changed_end && prefix >= new_changed_end {
self.source = new_source.to_string();
self.tokens = new_tokens;
return (&self.nodes, &self.errors);
}

// Find the affected top-level node groups.
// A group is affected if its token range overlaps [prefix, old_changed_end).
let first_affected = self.node_token_ranges.partition_point(|&(_, end)| end <= prefix);
let last_affected = self
.node_token_ranges
.partition_point(|&(start, _)| start < old_changed_end);

// Compute the token range in old tokens that the affected groups cover.
let reparse_old_start = self
.node_token_ranges
.get(first_affected)
.map(|r| r.0)
.unwrap_or(prefix);
let reparse_old_end = if last_affected > 0 {
self.node_token_ranges
.get(last_affected - 1)
.map(|r| r.1)
.unwrap_or(old_changed_end)
} else {
old_changed_end
};

// Compute the corresponding range in new tokens.
// Tokens before `reparse_old_start` are unchanged → same count in new tokens.
let reparse_new_start = reparse_old_start;
// Tokens at old_suffix distance from the end are unchanged.
let delta = new_tokens.len() as isize - self.tokens.len() as isize;
let reparse_new_end = ((reparse_old_end as isize) + delta).max(reparse_new_start as isize) as usize;

// Re-parse the affected token slice using the new full token array.
// We start parsing from reparse_new_start and parse until reparse_new_end.
let (new_nodes, new_ranges, _) = Self::parse_tokens_range(&new_tokens, reparse_new_start, reparse_new_end);

// Adjust suffix group token ranges by the delta.
for range in &mut self.node_token_ranges[last_affected..] {
range.0 = ((range.0 as isize) + delta) as usize;
range.1 = ((range.1 as isize) + delta) as usize;
}

// Splice in the new nodes and ranges.
self.nodes.splice(first_affected..last_affected, new_nodes);
self.node_token_ranges.splice(first_affected..last_affected, new_ranges);

self.source = new_source.to_string();
self.tokens = new_tokens;

// Re-run error collection on the affected region by doing a lightweight
// full re-parse of errors only (re-use existing errors for unaffected regions).
// For simplicity, we re-parse the entire source for errors when the structure changed.
let (_, _, errors) = Self::parse_tokens(&self.tokens);
self.errors = errors;

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After splicing new_nodes into self.nodes, the remaining prefix/suffix nodes are reused from the previous parse but self.tokens is replaced with new_tokens. Because CST Nodes store Shared<Token> (with ranges), reused nodes will still point at old tokens/locations from the previous source, so the resulting CST mixes tokens from multiple versions of the source. To keep ranges correct, either keep token objects stable across updates or rebuild/retokenize the reused nodes to reference the current new_tokens.

Suggested change
// Find the changed region by comparing token kinds.
// We compare kinds (not positions) because positions shift with every edit.
let prefix = self
.tokens
.iter()
.zip(new_tokens.iter())
.take_while(|(a, b)| tokens_same_kind(a, b))
.count();
let old_suffix = self
.tokens
.iter()
.rev()
.zip(new_tokens.iter().rev())
.take_while(|(a, b)| tokens_same_kind(a, b))
.count();
let old_changed_end = self.tokens.len().saturating_sub(old_suffix);
let new_changed_end = new_tokens.len().saturating_sub(old_suffix);
// If token structure is identical, just update positions and source.
if prefix >= old_changed_end && prefix >= new_changed_end {
self.source = new_source.to_string();
self.tokens = new_tokens;
return (&self.nodes, &self.errors);
}
// Find the affected top-level node groups.
// A group is affected if its token range overlaps [prefix, old_changed_end).
let first_affected = self.node_token_ranges.partition_point(|&(_, end)| end <= prefix);
let last_affected = self
.node_token_ranges
.partition_point(|&(start, _)| start < old_changed_end);
// Compute the token range in old tokens that the affected groups cover.
let reparse_old_start = self
.node_token_ranges
.get(first_affected)
.map(|r| r.0)
.unwrap_or(prefix);
let reparse_old_end = if last_affected > 0 {
self.node_token_ranges
.get(last_affected - 1)
.map(|r| r.1)
.unwrap_or(old_changed_end)
} else {
old_changed_end
};
// Compute the corresponding range in new tokens.
// Tokens before `reparse_old_start` are unchanged → same count in new tokens.
let reparse_new_start = reparse_old_start;
// Tokens at old_suffix distance from the end are unchanged.
let delta = new_tokens.len() as isize - self.tokens.len() as isize;
let reparse_new_end = ((reparse_old_end as isize) + delta).max(reparse_new_start as isize) as usize;
// Re-parse the affected token slice using the new full token array.
// We start parsing from reparse_new_start and parse until reparse_new_end.
let (new_nodes, new_ranges, _) = Self::parse_tokens_range(&new_tokens, reparse_new_start, reparse_new_end);
// Adjust suffix group token ranges by the delta.
for range in &mut self.node_token_ranges[last_affected..] {
range.0 = ((range.0 as isize) + delta) as usize;
range.1 = ((range.1 as isize) + delta) as usize;
}
// Splice in the new nodes and ranges.
self.nodes.splice(first_affected..last_affected, new_nodes);
self.node_token_ranges.splice(first_affected..last_affected, new_ranges);
self.source = new_source.to_string();
self.tokens = new_tokens;
// Re-run error collection on the affected region by doing a lightweight
// full re-parse of errors only (re-use existing errors for unaffected regions).
// For simplicity, we re-parse the entire source for errors when the structure changed.
let (_, _, errors) = Self::parse_tokens(&self.tokens);
self.errors = errors;
// To keep CST node token ranges consistent with the current source,
// we always rebuild the CST from the freshly lexed tokens. This avoids
// mixing nodes that reference old `Token` instances with nodes that
// reference new ones.
let (new_nodes, new_ranges, new_errors) = Self::parse_tokens(&new_tokens);
self.source = new_source.to_string();
self.tokens = new_tokens;
self.nodes = new_nodes;
self.node_token_ranges = new_ranges;
self.errors = new_errors;

Copilot uses AI. Check for mistakes.
let (nodes, errors) = parser.apply_edit(&edit);
assert!(!errors.has_errors());
assert!(!nodes.is_empty());
assert_eq!(parser.source(), "upcase | ltrim");
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_apply_edit’s expected source string doesn’t match what replace_range(11..19, "ltrim") does on "upcase() | downcase()" (it only replaces downcase, leaving the trailing () intact). As written, this assertion should fail; update the offsets and/or expected output to match the edit semantics you want to test.

Suggested change
assert_eq!(parser.source(), "upcase | ltrim");
assert_eq!(parser.source(), "upcase() | ltrim()");

Copilot uses AI. Check for mistakes.
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Mar 6, 2026

Merging this PR will not alter performance

✅ 29 untouched benchmarks


Comparing feat/incremental-cst-parser (e424842) with main (460ce4e)1

Open in CodSpeed

Footnotes

  1. No successful run was found on main (2c20e82) during the generation of this report, so 460ce4e was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@harehare harehare merged commit b6e3a52 into main Mar 6, 2026
6 checks passed
@harehare harehare deleted the feat/incremental-cst-parser branch March 6, 2026 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants