Skip to content

get error charset when call page.asXml() or page.asNormalizedText() at HtmlUnit 2.58.0 #452

@qurikuduo

Description

@qurikuduo

Hi, there, when I try to fetch html code and plain text of a URL, the content becomes garbled.
I think some where was not as good as your expected. When gzip was disabled at AcceptEncoding Header, everythin goes fine.
When gizp was enabled, com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.parse() will be call and the charset of the page will be set to ISO_8859_1 in the HtmlUnitNekoHtmlParser.java file at line 181:
charset = StandardCharsets.ISO_8859_1;

Here is the code:

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

import java.io.IOException;
import java.net.MalformedURLException;

public class HTest {
    public static void main(String[] args) {
        final BrowserVersion browser =
                new BrowserVersion.BrowserVersionBuilder(BrowserVersion.FIREFOX)
                        // uncomment this or change "deflate" to "br" to get right charset of UTF-8 which declared in html source code at line 36: <meta charset="utf-8" />
                        //.setAcceptEncodingHeader("deflate")
                        .build();
        try (final WebClient webClient = new WebClient(browser)) {
            webClient.getOptions().setThrowExceptionOnScriptError(false);
            try {
                 String url="http://www.jj.gov.cn/art/2021/8/19/art_1229539289_59057787.html";

                final HtmlPage page = webClient.getPage(url);
                final String pageAsXml = page.asXml();
                System.out.println(pageAsXml);

                final String pageAsText = page.asNormalizedText();
                System.out.println(pageAsText);
            } catch (MalformedURLException e) {
                e.printStackTrace();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions