Skip to content

Comments

fix(span): track end tag positions correctly in naive state switching mode#124

Merged
untitaker merged 1 commit intountitaker:mainfrom
shulaoda:11-22-fix-naive-state-switching-span
Nov 27, 2025
Merged

fix(span): track end tag positions correctly in naive state switching mode#124
untitaker merged 1 commit intountitaker:mainfrom
shulaoda:11-22-fix-naive-state-switching-span

Conversation

@shulaoda
Copy link
Contributor

@shulaoda shulaoda commented Nov 22, 2025

Description

When using html5gum's naively_switch_states(true) feature to properly parse HTML content within <script>, <style>, and similar raw text tags, the tokenizer returns incorrect span (position) information for the text content inside these tags.

Specifically, the span.start value is incorrectly set to 0 instead of the actual position where the text content begins in the HTML string.

The Span Bug Example

use html5gum::{DefaultEmitter, Token, Tokenizer};

let html = r#"<script>content</script>"#;
//            ^       ^
//            0       8 (expected span.start)

let mut emitter = DefaultEmitter::<usize>::new_with_span();
emitter.naively_switch_states(true);
let tokenizer = Tokenizer::new_with_emitter(html, emitter);

for token in tokenizer {
    if let Ok(Token::String(s)) = token {
        println!("Content: {:?}", String::from_utf8_lossy(&s.0));
        println!("Span: {:?}", s.span);
        // Expected: Span { start: 8, end: 15 }
        // Actual:   Span { start: 0, end: 15 }  ❌ BUG!
    }
}

@untitaker
Copy link
Owner

@Akida31 can you take a look?

Copy link
Owner

@untitaker untitaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to wait a bit for a second review but that makes sense to me. the pr just calls start_open_tag consistently with data state

@untitaker
Copy link
Owner

untitaker commented Nov 22, 2025

not entirely related to this PR, but it seems that in data state we also call init_string() before matching on state. not sure what that accomplishes, and if we need it in the other states...

@shulaoda
Copy link
Contributor Author

not entirely related to this PR, but it seems that in data state we also call init_string() before matching on state. not sure what that accomplishes, and if we need it in the other states...

It will only call init_string() in these situations:

#[must_use]
pub fn naive_next_state(tag_name: &[u8]) -> Option<State> {
match tag_name {
b"textarea" | b"title" => Some(State::RcData),
b"plaintext" => Some(State::PlainText),
b"script" => Some(State::ScriptData),
b"style" | b"iframe" | b"xmp" | b"noembed" | b"noframe" | b"noscript" => {
Some(State::RawText)
}
_ => None,
}
}

@untitaker
Copy link
Owner

I'm talking about the explicit init-string call in machine.rs in the data state. it seems to me this should be copied to all other text/data states, but I don't know its purpose. haven't investigated it deeply yet, maybe @Akida31 remembers since he wrote the code

shulaoda added a commit to rolldown/rolldown that referenced this pull request Nov 23, 2025
@untitaker untitaker merged commit 505de5b into untitaker:main Nov 27, 2025
8 checks passed
@untitaker
Copy link
Owner

ty @shulaoda!

@shulaoda shulaoda deleted the 11-22-fix-naive-state-switching-span branch November 28, 2025 02:29
shulaoda added a commit to rolldown/rolldown that referenced this pull request Nov 30, 2025
@Akida31
Copy link
Contributor

Akida31 commented Dec 2, 2025

I think I just overlooked these cases (and don't use the naive state switching mode), so this PR looks good to me, thanks!
(I had a hard time understanding the control flow in machine.rs, so there might be more missing cases)

@untitaker
Copy link
Owner

That's alright, I have now found more cases using fuzzing:

the fuzzer has now run since 1 day and has not found anything new.

anybody is encouraged to run the fuzzer themselves (also with the other envvars enabled) to find other cases. let me know if you need help doing so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants