feat(ast): add loc field support for AST nodes#13285
feat(ast): add loc field support for AST nodes#13285mertcanaltin wants to merge 37 commits intooxc-project:mainfrom
Conversation
How to use the Graphite Merge QueueAdd either label to this PR to merge it via the merge queue:
You must have a Graphite account in order to use the merge queue. Sign up using this link. An organization admin has enabled the Graphite Merge Queue in this repository. Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue. |
There was a problem hiding this comment.
Pull Request Overview
This PR adds line and column information support to AST serialization by implementing the loc field for ESTree compatibility. The implementation extends the UTF8ToUtf16 converter with line tracking capabilities and integrates this functionality throughout the serialization pipeline.
- Adds line tracking to the UTF8ToUtf16 converter with support for all line break types
- Implements SourceLocation and Position structs for ESTree-compatible loc fields
- Extends serialization methods to support optional loc information
- Integrates loc functionality with NAPI parser bindings
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| napi/parser/src/types.rs | Adds loc field to ParserOptions struct for controlling line/column output |
| napi/parser/src/lib.rs | Integrates loc support into parser flow and AST serialization |
| crates/oxc_napi/src/lib.rs | Extends UTF8ToUtf16 conversion to support line tracking |
| crates/oxc_estree/src/serialize/structs.rs | Adds SourceLocation/Position structs and loc serialization methods |
| crates/oxc_estree/src/serialize/mod.rs | Extends ESTreeSerializer with loc support |
| crates/oxc_estree/src/serialize/config.rs | Updates Config structs to track loc option |
| crates/oxc_ast_visit/tests/test_loc_integration.rs | Adds integration tests for loc functionality |
| crates/oxc_ast_visit/src/utf8_to_utf16/translation.rs | Implements line break detection and tracking logic |
| crates/oxc_ast_visit/src/utf8_to_utf16/mod.rs | Adds line/column conversion methods to Utf8ToUtf16 |
| crates/oxc_ast_visit/src/utf8_to_utf16/converter.rs | Adds utility method for offset-to-line-column conversion |
| crates/oxc_ast/src/serialize/mod.rs | Extends Program serialization methods with loc support |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
CodSpeed Performance ReportMerging #13285 will degrade performances by 43.59%Comparing Summary
Benchmarks breakdown
Footnotes
|
There was a problem hiding this comment.
Obviously this is incomplete at present - nothing actually calls serialize_span_with_loc at the moment, so the JSON won't include locs (as per the TODO comments).
Please let me know if you need any help getting the translation table into the serializer.
I've made a couple of comments in the meantime - apologies if they're premature, and you just haven't got to this yet.
One other broader point: I don't think we need separate *_with_loc methods. serialize_span_with_loc shouldn't be needed. Ditto new_with_loc, and the extra to_estree_ts_json_with_loc etc methods. The existing methods can just have an additional loc: bool param added.
We're not at 1.0 yet, and regularly make breaking changes. Please don't worry about changing existing APIs.
I've marked the PR as draft for now.
| let mut index = 0; | ||
| while index < slice.len() { | ||
| let byte = slice[index]; | ||
|
|
||
| // Handle line breaks | ||
| if track_lines { | ||
| let mut is_line_break = false; | ||
| let mut line_break_len = 1; | ||
|
|
||
| match byte { | ||
| b'\n' => is_line_break = true, | ||
| b'\r' => { | ||
| is_line_break = true; | ||
| // Check for \r\n | ||
| if index + 1 < slice.len() && slice[index + 1] == b'\n' { | ||
| line_break_len = 2; | ||
| } | ||
| } | ||
| // Unicode line separators LS (\u2028) and PS (\u2029) | ||
| // LS: E2 80 A8, PS: E2 80 A9 | ||
| 0xE2 if index + 2 < slice.len() | ||
| && slice[index + 1] == 0x80 | ||
| && (slice[index + 2] == 0xA8 || slice[index + 2] == 0xA9) => | ||
| { | ||
| is_line_break = true; | ||
| line_break_len = 3; | ||
| } | ||
| _ => {} | ||
| } | ||
|
|
||
| if is_line_break { | ||
| let line_end_offset = start_offset + index + line_break_len; | ||
| if line_end_offset < source_text.len() { | ||
| #[expect(clippy::cast_possible_truncation)] | ||
| let utf8_offset = line_end_offset as u32; | ||
| lines | ||
| .as_mut() | ||
| .unwrap() | ||
| .push(LineTranslation { utf8_offset, utf16_difference }); | ||
| } | ||
| index += line_break_len; | ||
| continue; | ||
| } | ||
| } |
There was a problem hiding this comment.
I can see a few problems here.
process_sliceis only called in the main search loop (while ptr < body_end_ptr { ... }) if the chunk contains non-ASCII characters. So line breaks won't be recorded unless they're at the very start or end of the source text, or they happen to have a non-ASCII char close to them.
The chunk.contains_unicode() check in that loop needs to be replaced with a check for "chunk contains unicode OR \r / \n".
Or maybe better: Check chunk for non-ASCII bytes and call process_slice if any are found (as now). Then do a 2nd check for any \n / \r bytes in the chunk, and call a different function if any are found. That 2nd function can be simpler and faster, as it doesn't have to deal with PS / LS (because they'd be caught by the non-ASCII check).
-
process_sliceis passed a small subset of the source text (in the main loop, 32 bytes). If0xE2is the 31st byte, thenPS/LSwon't be identified, because the last 2 bytes are after the end of the slice. -
Irregular line breaks are non-ASCII, so need a
Translationrecord as well asLineTranslation. Simplest way to do that is skip thecontinuestatement ifline_break_len == 3.
Also, a small optimization: No need for 2 vars is_line_break and line_break_len. Just line_break_len would do, with 0 representing "not a line break".
|
I'm afraid I had to make a few changes to the UTF-8 -> UTF-16 converter, so it could also be used to convert in the opposite direction UTF-16 -> UTF-8. That'll be the cause of the merge conflicts. The changes are fairly small, so hopefully not too difficult to rebase on top of them. |
9cf5a92 to
d8f7d8a
Compare
|
@overlookmotel You were spot on, fixed everything:
|
overlookmotel
left a comment
There was a problem hiding this comment.
I've pushed a couple of fixes just to get more of it to compile.
But still some compilation failures, and some tests are failing.
All of these should pass:
cargo test --all-features
just test-estree
cd napi/parser; pnpm run build-test; RUN_RAW_TESTS=true pnpm test; cd ../..I've made a few comments. I don't know if they explain the test failures or not.
Just to say (and I hope you don't consider this rude): Obviously we all rely on AI these days, but in my opinion our would-be robot overlords are not quite up to the task yet. So they need a bit of human hand-holding and checking of their work.
Could you please make sure the code compiles and tests pass before next round of review?
Of course, if you run into problems or want to discuss anything, very happy to help.
| if bytes_from_start_of_range <= self.range_len_utf8 { | ||
| // Offset is within current range. | ||
| // `wrapping_add` because `range_start_utf16` can be `u32::MAX`. | ||
| *offset = self.range_start_utf16.wrapping_add(bytes_from_start_of_range); |
There was a problem hiding this comment.
| let full_offset = start_offset + index; | ||
| let has_ls_ps = if index + 2 < slice.len() { | ||
| // Can check within this slice | ||
| slice[index + 1] == 0x80 | ||
| && (slice[index + 2] == 0xA8 || slice[index + 2] == 0xA9) | ||
| } else if full_offset + 2 < source_text.len() { | ||
| // Need to check in full source text (boundary case) | ||
| let source_bytes = source_text.as_bytes(); | ||
| source_bytes[full_offset + 1] == 0x80 | ||
| && (source_bytes[full_offset + 2] == 0xA8 | ||
| || source_bytes[full_offset + 2] == 0xA9) | ||
| } else { | ||
| false | ||
| }; |
There was a problem hiding this comment.
This can be simplified.
- Always check against full source text - to remove a branch.
- Don't check if 2 more bytes.
source_textis a&str, which is guaranteed valid UTF-8. So if a byte is0xE2, it's guaranteed there are 2 more bytes after it. So justassert!(full_offset + 2 < source_text.len())and then check the bytes.
That assertion can never fail. Later on, once we're completely confident of the implementation, we can remove that assert! and use unsafe indexing to get the next 2 bytes without bounds checks. But for now, safer to keep the assertion.
Note: I'm suggesting an assert! as opposed to relying on the implicit bounds checks in source_bytes[full_offset + 1] and source_bytes[full_offset + 2] because the assertion will do 1 bounds check instead of 2.
There was a problem hiding this comment.
thanks for optimize, I edited
| b'\r' => { | ||
| // Check for \r\n | ||
| if index + 1 < slice.len() && slice[index + 1] == b'\n' { 2 } else { 1 } | ||
| } |
There was a problem hiding this comment.
I think same problem here as with LS / PS. What if \r is the last byte in the chunk, and \n is in the next chunk? Need to check full source text to avoid that problem.
(but unlike PS / LS, you can't guarantee that \r isn't last byte in the file, so it shouldn't be asserted)
| // SAFETY: `ptr` is equal to or after `start_ptr`. Both are within bounds of `bytes`. | ||
| // `ptr` is derived from `start_ptr`. | ||
| let offset = unsafe { ptr.offset_from_unsigned(start_ptr) }; |
There was a problem hiding this comment.
Move this into the if has_unicode || has_line_breaks { ... } block. If no line breaks or unicode chars in the block, we don't use this value, so avoid calculating it in that case.
|
Thank you for your feedback. I have tried to address the issues you mentioned, You're absolutely right about the robot overlords needing human guidance 😄 - I appreciate your patience and thorough review. |
|
Does not compile. Please see CI fails. |
3bfda73 to
2962db4
Compare
3bfda73 to
566397d
Compare
c15dd0e to
3f3efaf
Compare
|
Hey @overlookmotel I sended new commits for comments, Whenever you have time, could you take a look and resolve the open conversations?, really appreciate your patience and help throughout this PR! |
Add line and column information to AST serialization:
extend UTF8ToUtf16 converter with line tracking
support all line break types (\n, \r, \r\n, LS, PS)
add loc options to ESTree serializer configs
implement efficient offset-to-line-column conversion
add SourceLocation struct for ESTree compatibility
integrate with NAPI parser bindings
fyi: #10307