Skip to content

Comments

feat(ast): add loc field support for AST nodes#13285

Open
mertcanaltin wants to merge 37 commits intooxc-project:mainfrom
mertcanaltin:mert/feat/add-loc-field-support
Open

feat(ast): add loc field support for AST nodes#13285
mertcanaltin wants to merge 37 commits intooxc-project:mainfrom
mertcanaltin:mert/feat/add-loc-field-support

Conversation

@mertcanaltin
Copy link

Add line and column information to AST serialization:

extend UTF8ToUtf16 converter with line tracking
support all line break types (\n, \r, \r\n, LS, PS)
add loc options to ESTree serializer configs
implement efficient offset-to-line-column conversion
add SourceLocation struct for ESTree compatibility
integrate with NAPI parser bindings

fyi: #10307

Copilot AI review requested due to automatic review settings August 24, 2025 20:09
@graphite-app
Copy link
Contributor

graphite-app bot commented Aug 24, 2025

How to use the Graphite Merge Queue

Add either label to this PR to merge it via the merge queue:

  • 0-merge - adds this PR to the back of the merge queue
  • hotfix - for urgent hot fixes, skip the queue and merge this PR next

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

@github-actions github-actions bot added A-ast Area - AST C-enhancement Category - New feature or request labels Aug 24, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds line and column information support to AST serialization by implementing the loc field for ESTree compatibility. The implementation extends the UTF8ToUtf16 converter with line tracking capabilities and integrates this functionality throughout the serialization pipeline.

  • Adds line tracking to the UTF8ToUtf16 converter with support for all line break types
  • Implements SourceLocation and Position structs for ESTree-compatible loc fields
  • Extends serialization methods to support optional loc information
  • Integrates loc functionality with NAPI parser bindings

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
napi/parser/src/types.rs Adds loc field to ParserOptions struct for controlling line/column output
napi/parser/src/lib.rs Integrates loc support into parser flow and AST serialization
crates/oxc_napi/src/lib.rs Extends UTF8ToUtf16 conversion to support line tracking
crates/oxc_estree/src/serialize/structs.rs Adds SourceLocation/Position structs and loc serialization methods
crates/oxc_estree/src/serialize/mod.rs Extends ESTreeSerializer with loc support
crates/oxc_estree/src/serialize/config.rs Updates Config structs to track loc option
crates/oxc_ast_visit/tests/test_loc_integration.rs Adds integration tests for loc functionality
crates/oxc_ast_visit/src/utf8_to_utf16/translation.rs Implements line break detection and tracking logic
crates/oxc_ast_visit/src/utf8_to_utf16/mod.rs Adds line/column conversion methods to Utf8ToUtf16
crates/oxc_ast_visit/src/utf8_to_utf16/converter.rs Adds utility method for offset-to-line-column conversion
crates/oxc_ast/src/serialize/mod.rs Extends Program serialization methods with loc support

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@codspeed-hq
Copy link

codspeed-hq bot commented Aug 24, 2025

CodSpeed Performance Report

Merging #13285 will degrade performances by 43.59%

Comparing mertcanaltin:mert/feat/add-loc-field-support (0de8b57) with main (0740016)1

Summary

❌ 1 regression
✅ 41 untouched
⏩ 3 skipped2

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Mode Benchmark BASE HEAD Change
Simulation estree[checker.ts] 104.5 ms 185.3 ms -43.59%

Footnotes

  1. No successful run was found on main (9c573ec) during the generation of this report, so 0740016 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

  2. 3 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Copy link
Member

@overlookmotel overlookmotel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously this is incomplete at present - nothing actually calls serialize_span_with_loc at the moment, so the JSON won't include locs (as per the TODO comments).

Please let me know if you need any help getting the translation table into the serializer.

I've made a couple of comments in the meantime - apologies if they're premature, and you just haven't got to this yet.

One other broader point: I don't think we need separate *_with_loc methods. serialize_span_with_loc shouldn't be needed. Ditto new_with_loc, and the extra to_estree_ts_json_with_loc etc methods. The existing methods can just have an additional loc: bool param added.

We're not at 1.0 yet, and regularly make breaking changes. Please don't worry about changing existing APIs.

I've marked the PR as draft for now.

Comment on lines 117 to 160
let mut index = 0;
while index < slice.len() {
let byte = slice[index];

// Handle line breaks
if track_lines {
let mut is_line_break = false;
let mut line_break_len = 1;

match byte {
b'\n' => is_line_break = true,
b'\r' => {
is_line_break = true;
// Check for \r\n
if index + 1 < slice.len() && slice[index + 1] == b'\n' {
line_break_len = 2;
}
}
// Unicode line separators LS (\u2028) and PS (\u2029)
// LS: E2 80 A8, PS: E2 80 A9
0xE2 if index + 2 < slice.len()
&& slice[index + 1] == 0x80
&& (slice[index + 2] == 0xA8 || slice[index + 2] == 0xA9) =>
{
is_line_break = true;
line_break_len = 3;
}
_ => {}
}

if is_line_break {
let line_end_offset = start_offset + index + line_break_len;
if line_end_offset < source_text.len() {
#[expect(clippy::cast_possible_truncation)]
let utf8_offset = line_end_offset as u32;
lines
.as_mut()
.unwrap()
.push(LineTranslation { utf8_offset, utf16_difference });
}
index += line_break_len;
continue;
}
}
Copy link
Member

@overlookmotel overlookmotel Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see a few problems here.

  1. process_slice is only called in the main search loop (while ptr < body_end_ptr { ... }) if the chunk contains non-ASCII characters. So line breaks won't be recorded unless they're at the very start or end of the source text, or they happen to have a non-ASCII char close to them.

The chunk.contains_unicode() check in that loop needs to be replaced with a check for "chunk contains unicode OR \r / \n".

Or maybe better: Check chunk for non-ASCII bytes and call process_slice if any are found (as now). Then do a 2nd check for any \n / \r bytes in the chunk, and call a different function if any are found. That 2nd function can be simpler and faster, as it doesn't have to deal with PS / LS (because they'd be caught by the non-ASCII check).

  1. process_slice is passed a small subset of the source text (in the main loop, 32 bytes). If 0xE2 is the 31st byte, then PS/LS won't be identified, because the last 2 bytes are after the end of the slice.

  2. Irregular line breaks are non-ASCII, so need a Translation record as well as LineTranslation. Simplest way to do that is skip the continue statement if line_break_len == 3.

Also, a small optimization: No need for 2 vars is_line_break and line_break_len. Just line_break_len would do, with 0 representing "not a line break".

@overlookmotel overlookmotel marked this pull request as draft August 25, 2025 22:57
@overlookmotel
Copy link
Member

I'm afraid I had to make a few changes to the UTF-8 -> UTF-16 converter, so it could also be used to convert in the opposite direction UTF-16 -> UTF-8. That'll be the cause of the merge conflicts. The changes are fairly small, so hopefully not too difficult to rebase on top of them.

@mertcanaltin mertcanaltin force-pushed the mert/feat/add-loc-field-support branch from 9cf5a92 to d8f7d8a Compare August 31, 2025 09:08
@mertcanaltin
Copy link
Author

@overlookmotel You were spot on, fixed everything:

  • Line break detection now works throughout text (not just near Unicode)
  • Added chunk boundary fallback for PS/LS edge cases
  • Removed redundant *_with_loc methods, added loc param instead
  • serialize_span now auto-translates, showing real coordinates not (0,0)

@github-actions github-actions bot added the A-parser Area - Parser label Sep 3, 2025
Copy link
Member

@overlookmotel overlookmotel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a couple of fixes just to get more of it to compile.

But still some compilation failures, and some tests are failing.

All of these should pass:

cargo test --all-features
just test-estree
cd napi/parser; pnpm run build-test; RUN_RAW_TESTS=true pnpm test; cd ../..

I've made a few comments. I don't know if they explain the test failures or not.

Just to say (and I hope you don't consider this rude): Obviously we all rely on AI these days, but in my opinion our would-be robot overlords are not quite up to the task yet. So they need a bit of human hand-holding and checking of their work.

Could you please make sure the code compiles and tests pass before next round of review?

Of course, if you run into problems or want to discuss anything, very happy to help.

Comment on lines 135 to 138
if bytes_from_start_of_range <= self.range_len_utf8 {
// Offset is within current range.
// `wrapping_add` because `range_start_utf16` can be `u32::MAX`.
*offset = self.range_start_utf16.wrapping_add(bytes_from_start_of_range);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this change was introduced while resolving merge conflicts from #13339. This PR has reverted the change #13339 made, which isn't right. Ditto other changes in this file. convert_offset_back and convert_span_back methods definitely shouldn't be removed!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I edited

Comment on lines 162 to 175
let full_offset = start_offset + index;
let has_ls_ps = if index + 2 < slice.len() {
// Can check within this slice
slice[index + 1] == 0x80
&& (slice[index + 2] == 0xA8 || slice[index + 2] == 0xA9)
} else if full_offset + 2 < source_text.len() {
// Need to check in full source text (boundary case)
let source_bytes = source_text.as_bytes();
source_bytes[full_offset + 1] == 0x80
&& (source_bytes[full_offset + 2] == 0xA8
|| source_bytes[full_offset + 2] == 0xA9)
} else {
false
};
Copy link
Member

@overlookmotel overlookmotel Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified.

  1. Always check against full source text - to remove a branch.
  2. Don't check if 2 more bytes. source_text is a &str, which is guaranteed valid UTF-8. So if a byte is 0xE2, it's guaranteed there are 2 more bytes after it. So just assert!(full_offset + 2 < source_text.len()) and then check the bytes.

That assertion can never fail. Later on, once we're completely confident of the implementation, we can remove that assert! and use unsafe indexing to get the next 2 bytes without bounds checks. But for now, safer to keep the assertion.

Note: I'm suggesting an assert! as opposed to relying on the implicit bounds checks in source_bytes[full_offset + 1] and source_bytes[full_offset + 2] because the assertion will do 1 bounds check instead of 2.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for optimize, I edited

Comment on lines 136 to 139
b'\r' => {
// Check for \r\n
if index + 1 < slice.len() && slice[index + 1] == b'\n' { 2 } else { 1 }
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think same problem here as with LS / PS. What if \r is the last byte in the chunk, and \n is in the next chunk? Need to check full source text to avoid that problem.

(but unlike PS / LS, you can't guarantee that \r isn't last byte in the file, so it shouldn't be asserted)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I edited

Comment on lines 263 to 265
// SAFETY: `ptr` is equal to or after `start_ptr`. Both are within bounds of `bytes`.
// `ptr` is derived from `start_ptr`.
let offset = unsafe { ptr.offset_from_unsigned(start_ptr) };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this into the if has_unicode || has_line_breaks { ... } block. If no line breaks or unicode chars in the block, we don't use this value, so avoid calculating it in that case.

Copy link
Author

@mertcanaltin mertcanaltin Oct 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I edited

@mertcanaltin
Copy link
Author

mertcanaltin commented Sep 10, 2025

Thank you for your feedback. I have tried to address the issues you mentioned,

You're absolutely right about the robot overlords needing human guidance 😄 - I appreciate your patience and thorough review.

@overlookmotel
Copy link
Member

Does not compile. Please see CI fails.

@mertcanaltin mertcanaltin force-pushed the mert/feat/add-loc-field-support branch from 3bfda73 to 2962db4 Compare September 13, 2025 11:36
@github-actions github-actions bot added A-linter Area - Linter A-semantic Area - Semantic A-cli Area - CLI A-transformer Area - Transformer / Transpiler A-codegen Area - Code Generation A-ast-tools Area - AST tools A-editor Area - Editor and Language Server A-formatter Area - Formatter labels Sep 13, 2025
@mertcanaltin mertcanaltin force-pushed the mert/feat/add-loc-field-support branch 2 times, most recently from 3bfda73 to 566397d Compare September 13, 2025 11:44
@mertcanaltin mertcanaltin force-pushed the mert/feat/add-loc-field-support branch from c15dd0e to 3f3efaf Compare February 19, 2026 11:54
@github-actions github-actions bot added A-linter Area - Linter A-cli Area - CLI A-ast Area - AST A-ast-tools Area - AST tools labels Feb 19, 2026
@mertcanaltin
Copy link
Author

Hey @overlookmotel I sended new commits for comments, Whenever you have time, could you take a look and resolve the open conversations?, really appreciate your patience and help throughout this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-ast Area - AST A-ast-tools Area - AST tools A-cli Area - CLI A-linter Area - Linter A-linter-plugins Area - Linter JS plugins A-parser Area - Parser C-enhancement Category - New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants