✨ feat(mq-lang): add incremental CST parser by harehare · Pull Request #1400 · harehare/mq

harehare · 2026-03-06T14:48:01Z

No description provided.

Copilot

Pull request overview

Adds an incremental CST parsing facility to mq-lang (behind the cst feature) intended to support efficient re-parsing on edits (e.g., editor/LSP scenarios).

Changes:

Expose IncrementalParser and TextEdit from mq-lang when feature = "cst" is enabled.
Refactor the CST parser to operate on token slices with an explicit cursor and add parse_with_ranges.
Add a new incremental parser implementation + basic tests.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
crates/mq-lang/src/lib.rs	Re-exports the incremental CST API and adjusts CST parse recovery token handling.
crates/mq-lang/src/cst/parser.rs	Refactors parser to slice+cursor and adds `parse_with_ranges` / range tracking.
crates/mq-lang/src/cst/incremental.rs	New incremental CST parser with token-diffing, partial re-parse, and splice-based updates.
crates/mq-lang/src/cst.rs	Exposes the new `incremental` module.

Copilot · 2026-03-06T14:53:53Z

crates/mq-lang/src/cst/incremental.rs

+        // If token structure is identical, just update positions and source.
+        if prefix >= old_changed_end && prefix >= new_changed_end {
+            self.source = new_source.to_string();
+            self.tokens = new_tokens;
+            return (&self.nodes, &self.errors);
+        }


In update, the "token structure identical" fast-path updates self.tokens/self.source but returns without rebuilding self.nodes. Since CST nodes store Shared<Token> (including range), this can leave nodes holding stale token ranges whenever the lexer normalizes different spellings into the same TokenKind (e.g., line_ending treats \n and \r\n as TokenKind::NewLine). Consider tightening this condition to only skip when the entire token stream is identical (e.g., self.tokens == new_tokens, including ranges), or otherwise re-parse/rebuild nodes so token ranges stay consistent with the updated source.

Copilot · 2026-03-06T14:53:54Z

crates/mq-lang/src/cst/incremental.rs

+
+        // Adjust suffix group token ranges by the delta.
+        for range in &mut self.node_token_ranges[last_affected..] {
+            range.0 = ((range.0 as isize) + delta) as usize;
+            range.1 = ((range.1 as isize) + delta) as usize;
+        }
+
+        // Splice in the new nodes and ranges.
+        self.nodes.splice(first_affected..last_affected, new_nodes);
+        self.node_token_ranges.splice(first_affected..last_affected, new_ranges);
+


IncrementalParser splices self.nodes using indices (first_affected..last_affected) derived from self.node_token_ranges, but node_token_ranges is described as "one range per statement group" while nodes is a flat Vec<Shared<Node>> returned by Parser (which includes expr nodes plus separator token nodes like |, ;, eof). If node_token_ranges.len() != nodes.len(), these indices won't refer to the same logical units and the splice can replace the wrong nodes (or panic if assumptions change). The incremental cache should store and splice the same unit that the ranges index (either group the CST nodes by statement, or compute and store a range per top-level node so indexing stays aligned).

Copilot · 2026-03-06T14:53:54Z

crates/mq-lang/src/cst/incremental.rs

+    /// Applies a byte-offset-based [`TextEdit`] to the source and updates the parse.
+    pub fn apply_edit(&mut self, edit: &TextEdit) -> (&[Shared<Node>], &ErrorReporter) {
+        let mut new_source = self.source.clone();
+        new_source.replace_range(edit.start..edit.end, &edit.new_text);
+        self.update(&new_source)
+    }


apply_edit uses String::replace_range(edit.start..edit.end, ...) with byte offsets, which will panic if start/end are not on UTF-8 char boundaries (or if the range is otherwise invalid). Since this is a public API, consider changing it to return a Result and validating is_char_boundary/range bounds before mutating the string, so malformed edits from editor/LSP inputs don't crash the process.

Copilot · 2026-03-06T14:53:54Z

crates/mq-lang/src/cst/incremental.rs

+    #[test]
+    fn test_incremental_append() {
+        let source = "upcase";
+        let mut parser = IncrementalParser::new(source);
+
+        let (nodes, errors) = parser.update("upcase | downcase");
+        assert!(!errors.has_errors());
+        assert!(!nodes.is_empty());
+    }


The current incremental-parser tests only assert non-empty output / no errors, but they don't validate that incremental updates produce the same CST + token ranges as a full re-parse, nor do they cover tricky cases like edits that change only newline encoding (\r\n ↔ \n) or edits confined to the middle of a multi-statement program. Adding assertions that compare IncrementalParser results against Parser::new(&lexed_tokens).parse_with_ranges() for a variety of edits would help catch incorrect splice/range logic.

Copilot · 2026-03-06T14:53:54Z

crates/mq-lang/src/cst/parser.rs

+    /// Parses the full program and also returns token index ranges for each top-level node group.
+    ///
+    /// Each entry `(start, end)` in the returned `Vec` corresponds to the node group at the same
+    /// index in `Vec<Shared<Node>>` and represents the half-open token index range `[start, end)`.


parse_with_ranges documents that the returned Vec<(start,end)> is parallel to Vec<Shared<Node>>, but parse_program only pushes ranges at statement boundaries (e.g., after |, ;, end, eof). This can make ranges.len() differ from nodes.len(), so consumers cannot safely index nodes[i] with ranges[i]. Either change the implementation to emit a range per top-level node (ensuring equal lengths), or change the API/docs to return explicit statement-group data and update callers accordingly.

Suggested change

/// Parses the full program and also returns token index ranges for each top-level node group.

///

/// Each entry `(start, end)` in the returned `Vec` corresponds to the node group at the same

/// index in `Vec<Shared<Node>>` and represents the half-open token index range `[start, end)`.

/// Parses the full program and also returns token index ranges for each top-level

/// statement group.

///

/// The returned `Vec<(start, end)>` contains half-open token index ranges `[start, end)`

/// delimiting top-level statement groups, where groups are separated by `|`, `;`, `end`,

/// or end-of-file. Multiple nodes may belong to the same statement group and share a

/// single `(start, end)` entry.

///

/// The number of ranges is **not** guaranteed to equal the number of returned nodes,

/// and callers must not assume a 1:1, index-parallel correspondence between

/// `Vec<Shared<Node>>` and `Vec<(usize, usize)>`. Instead, treat the ranges as group

/// boundaries over the token stream.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Copilot · 2026-03-06T15:07:41Z

crates/mq-lang/src/cst/parser.rs

        }

        if nodes.is_empty() {
-            match self.tokens.peek() {
+            match self.peek() {
                Some(token) => self.errors.report(ParseError::UnexpectedToken(Shared::clone(token))),


parse_program only appends to ranges when it hits specific boundary tokens (eof/pipe/semicolon/def/macro/…). If the loop exits because the token slice ends (peek() becomes None), the final statement group won’t get a range entry. This makes parse_with_ranges unreliable for parsing sub-slices (e.g. incremental reparsing); consider pushing a final (stmt_start, self.pos) range when root and nodes is non-empty before returning.

Copilot · 2026-03-06T15:07:42Z

crates/mq-lang/src/cst/incremental.rs

+        // Find the changed region by comparing token kinds.
+        // We compare kinds (not positions) because positions shift with every edit.
+        let prefix = self
+            .tokens
+            .iter()
+            .zip(new_tokens.iter())
+            .take_while(|(a, b)| tokens_same_kind(a, b))
+            .count();
+
+        let old_suffix = self
+            .tokens
+            .iter()
+            .rev()
+            .zip(new_tokens.iter().rev())
+            .take_while(|(a, b)| tokens_same_kind(a, b))
+            .count();
+
+        let old_changed_end = self.tokens.len().saturating_sub(old_suffix);
+        let new_changed_end = new_tokens.len().saturating_sub(old_suffix);
+
+        // If token structure is identical, just update positions and source.
+        if prefix >= old_changed_end && prefix >= new_changed_end {
+            self.source = new_source.to_string();
+            self.tokens = new_tokens;
+            return (&self.nodes, &self.errors);
+        }
+
+        // Find the affected top-level node groups.
+        // A group is affected if its token range overlaps [prefix, old_changed_end).
+        let first_affected = self.node_token_ranges.partition_point(|&(_, end)| end <= prefix);
+        let last_affected = self
+            .node_token_ranges
+            .partition_point(|&(start, _)| start < old_changed_end);
+
+        // Compute the token range in old tokens that the affected groups cover.
+        let reparse_old_start = self
+            .node_token_ranges
+            .get(first_affected)
+            .map(|r| r.0)
+            .unwrap_or(prefix);
+        let reparse_old_end = if last_affected > 0 {
+            self.node_token_ranges
+                .get(last_affected - 1)
+                .map(|r| r.1)
+                .unwrap_or(old_changed_end)
+        } else {
+            old_changed_end
+        };
+
+        // Compute the corresponding range in new tokens.
+        // Tokens before `reparse_old_start` are unchanged → same count in new tokens.
+        let reparse_new_start = reparse_old_start;
+        // Tokens at old_suffix distance from the end are unchanged.
+        let delta = new_tokens.len() as isize - self.tokens.len() as isize;
+        let reparse_new_end = ((reparse_old_end as isize) + delta).max(reparse_new_start as isize) as usize;
+
+        // Re-parse the affected token slice using the new full token array.
+        // We start parsing from reparse_new_start and parse until reparse_new_end.
+        let (new_nodes, new_ranges, _) = Self::parse_tokens_range(&new_tokens, reparse_new_start, reparse_new_end);
+
+        // Adjust suffix group token ranges by the delta.
+        for range in &mut self.node_token_ranges[last_affected..] {
+            range.0 = ((range.0 as isize) + delta) as usize;
+            range.1 = ((range.1 as isize) + delta) as usize;
+        }
+
+        // Splice in the new nodes and ranges.
+        self.nodes.splice(first_affected..last_affected, new_nodes);
+        self.node_token_ranges.splice(first_affected..last_affected, new_ranges);
+
+        self.source = new_source.to_string();
+        self.tokens = new_tokens;
+
+        // Re-run error collection on the affected region by doing a lightweight
+        // full re-parse of errors only (re-use existing errors for unaffected regions).
+        // For simplicity, we re-parse the entire source for errors when the structure changed.
+        let (_, _, errors) = Self::parse_tokens(&self.tokens);
+        self.errors = errors;
+


After splicing new_nodes into self.nodes, the remaining prefix/suffix nodes are reused from the previous parse but self.tokens is replaced with new_tokens. Because CST Nodes store Shared<Token> (with ranges), reused nodes will still point at old tokens/locations from the previous source, so the resulting CST mixes tokens from multiple versions of the source. To keep ranges correct, either keep token objects stable across updates or rebuild/retokenize the reused nodes to reference the current new_tokens.

Suggested change

// Find the changed region by comparing token kinds.

// We compare kinds (not positions) because positions shift with every edit.

let prefix = self

.tokens

.iter()

.zip(new_tokens.iter())

.take_while(|(a, b)| tokens_same_kind(a, b))

.count();

let old_suffix = self

.tokens

.iter()

.rev()

.zip(new_tokens.iter().rev())

.take_while(|(a, b)| tokens_same_kind(a, b))

.count();

let old_changed_end = self.tokens.len().saturating_sub(old_suffix);

let new_changed_end = new_tokens.len().saturating_sub(old_suffix);

// If token structure is identical, just update positions and source.

if prefix >= old_changed_end && prefix >= new_changed_end {

self.source = new_source.to_string();

self.tokens = new_tokens;

return (&self.nodes, &self.errors);

}

// Find the affected top-level node groups.

// A group is affected if its token range overlaps [prefix, old_changed_end).

let first_affected = self.node_token_ranges.partition_point(|&(_, end)| end <= prefix);

let last_affected = self

.node_token_ranges

.partition_point(|&(start, _)| start < old_changed_end);

// Compute the token range in old tokens that the affected groups cover.

let reparse_old_start = self

.node_token_ranges

.get(first_affected)

.map(|r| r.0)

.unwrap_or(prefix);

let reparse_old_end = if last_affected > 0 {

self.node_token_ranges

.get(last_affected - 1)

.map(|r| r.1)

.unwrap_or(old_changed_end)

} else {

old_changed_end

};

// Compute the corresponding range in new tokens.

// Tokens before `reparse_old_start` are unchanged → same count in new tokens.

let reparse_new_start = reparse_old_start;

// Tokens at old_suffix distance from the end are unchanged.

let delta = new_tokens.len() as isize - self.tokens.len() as isize;

let reparse_new_end = ((reparse_old_end as isize) + delta).max(reparse_new_start as isize) as usize;

// Re-parse the affected token slice using the new full token array.

// We start parsing from reparse_new_start and parse until reparse_new_end.

let (new_nodes, new_ranges, _) = Self::parse_tokens_range(&new_tokens, reparse_new_start, reparse_new_end);

// Adjust suffix group token ranges by the delta.

for range in &mut self.node_token_ranges[last_affected..] {

range.0 = ((range.0 as isize) + delta) as usize;

range.1 = ((range.1 as isize) + delta) as usize;

}

// Splice in the new nodes and ranges.

self.nodes.splice(first_affected..last_affected, new_nodes);

self.node_token_ranges.splice(first_affected..last_affected, new_ranges);

self.source = new_source.to_string();

self.tokens = new_tokens;

// Re-run error collection on the affected region by doing a lightweight

// full re-parse of errors only (re-use existing errors for unaffected regions).

// For simplicity, we re-parse the entire source for errors when the structure changed.

let (_, _, errors) = Self::parse_tokens(&self.tokens);

self.errors = errors;

// To keep CST node token ranges consistent with the current source,

// we always rebuild the CST from the freshly lexed tokens. This avoids

// mixing nodes that reference old `Token` instances with nodes that

// reference new ones.

let (new_nodes, new_ranges, new_errors) = Self::parse_tokens(&new_tokens);

self.source = new_source.to_string();

self.tokens = new_tokens;

self.nodes = new_nodes;

self.node_token_ranges = new_ranges;

self.errors = new_errors;

Copilot · 2026-03-06T15:07:42Z

crates/mq-lang/src/cst/incremental.rs

+        let (nodes, errors) = parser.apply_edit(&edit);
+        assert!(!errors.has_errors());
+        assert!(!nodes.is_empty());
+        assert_eq!(parser.source(), "upcase | ltrim");


test_apply_edit’s expected source string doesn’t match what replace_range(11..19, "ltrim") does on "upcase() | downcase()" (it only replaces downcase, leaving the trailing () intact). As written, this assertion should fail; update the offsets and/or expected output to match the edit semantics you want to test.

Suggested change

assert_eq!(parser.source(), "upcase | ltrim");

assert_eq!(parser.source(), "upcase() | ltrim()");

codspeed-hq · 2026-03-06T15:14:22Z

Merging this PR will not alter performance

✅ 29 untouched benchmarks

_{Comparing feat/incremental-cst-parser (e424842) with main (460ce4e)¹}

No successful run was found on main (2c20e82) during the generation of this report, so 460ce4e was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

✨ feat(mq-lang): add incremental CST parser

30f99d2

Copilot AI review requested due to automatic review settings March 6, 2026 14:48

Copilot started reviewing on behalf of harehare March 6, 2026 14:48 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

harehare added 2 commits March 6, 2026 23:56

Update incremental.rs

241d7c5

Update incremental.rs

0958467

Copilot AI review requested due to automatic review settings March 6, 2026 15:00

Copilot started reviewing on behalf of harehare March 6, 2026 15:00 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

Update incremental.rs

e424842

harehare merged commit b6e3a52 into main Mar 6, 2026
6 checks passed

harehare deleted the feat/incremental-cst-parser branch March 6, 2026 23:34

BrewTestBot mentioned this pull request Mar 11, 2026

mq 0.5.18 Homebrew/homebrew-core#271763

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ feat(mq-lang): add incremental CST parser#1400

✨ feat(mq-lang): add incremental CST parser#1400
harehare merged 4 commits intomainfrom
feat/incremental-cst-parser

harehare commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

codspeed-hq bot commented Mar 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    /// Parses the full program and also returns token index ranges for each top-level node group.
-    ///
-    /// Each entry `(start, end)` in the returned `Vec` corresponds to the node group at the same
-    /// index in `Vec<Shared<Node>>` and represents the half-open token index range `[start, end)`.
+    /// Parses the full program and also returns token index ranges for each top-level
+    /// statement group.
+    ///
+    /// The returned `Vec<(start, end)>` contains half-open token index ranges `[start, end)`
+    /// delimiting top-level statement groups, where groups are separated by `|`, `;`, `end`,
+    /// or end-of-file. Multiple nodes may belong to the same statement group and share a
+    /// single `(start, end)` entry.
+    ///
+    /// The number of ranges is **not** guaranteed to equal the number of returned nodes,
+    /// and callers must not assume a 1:1, index-parallel correspondence between
+    /// `Vec<Shared<Node>>` and `Vec<(usize, usize)>`. Instead, treat the ranges as group
+    /// boundaries over the token stream.

	assert_eq!(parser.source(), "upcase \| ltrim");
	assert_eq!(parser.source(), "upcase() \| ltrim()");

Conversation

harehare commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

codspeed-hq bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Footnotes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codspeed-hq bot commented Mar 6, 2026 •

edited

Loading