Skip to content

Attribute names including ; cannot be converted to W3C DOM; subsequent XPath query may be incorrect #2244

@aianta

Description

@aianta

Simple reproducer, attached HTML text file is the document in question. Using JSoup version 1.18.3

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.junit.jupiter.api.Test;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;

public class SelectXpathTest {

    @Test
    public void test() throws IOException {

        String html = new String(Files.readAllBytes(Path.of("src/test/resources/debug.html.txt")));
        Document document = Jsoup.parse(html);
        int foundElements = document.selectXpath(xpath).size();
        System.out.println("Found %s elements".formatted(foundElements));

        assert foundElements > 0;

    }

    private static final String xpath = "/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[5]/fieldset[1]/div[2]/div/div[1]/label[1]/input[2]";

}

debug.html.txt

I have evidence that the xpath provided should be resolvable, since I am able to take the attached html, open it in firefox and then using the following JS:

function getElementByXpath(path) {
    return document.evaluate(path, document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue;
}

run the following command in the console:

getElementByXpath("/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[5]/fieldset[1]/div[2]/div/div[1]/label[1]/input[2]")

to get:

<input id="assignment_text_entry" type="checkbox" value="1" name="online_submission_types[online_text_entry]" aria-label="Online Submission Type - Text Entry" style="">

Initial debugging shows that the last element JSoup is able to retrieve along the xpath is:

/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]

When I retrieve the children for /html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1], I get 7 children, then:

/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[1] -> works
/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[2] -> works
/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[3] -> works
/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[4] -> works
/html/body/div[3]/div[2]/div[2]/div[3]/div[1]/div/div[1]/form/div[1]/div[5] -> Nope...

Metadata

Metadata

Assignees

No one assigned

    Labels

    fixedAn {bug|improvement} that has been {fixed|implemented}

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions