-
-
Notifications
You must be signed in to change notification settings - Fork 184
Closed
Description
Hi, there, when I try to fetch html code and plain text of a URL, the content becomes garbled.
I think some where was not as good as your expected. When gzip was disabled at AcceptEncoding Header, everythin goes fine.
When gizp was enabled, com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.parse() will be call and the charset of the page will be set to ISO_8859_1 in the HtmlUnitNekoHtmlParser.java file at line 181:
charset = StandardCharsets.ISO_8859_1;
Here is the code:
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.io.IOException;
import java.net.MalformedURLException;
public class HTest {
public static void main(String[] args) {
final BrowserVersion browser =
new BrowserVersion.BrowserVersionBuilder(BrowserVersion.FIREFOX)
// uncomment this or change "deflate" to "br" to get right charset of UTF-8 which declared in html source code at line 36: <meta charset="utf-8" />
//.setAcceptEncodingHeader("deflate")
.build();
try (final WebClient webClient = new WebClient(browser)) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
try {
String url="http://www.jj.gov.cn/art/2021/8/19/art_1229539289_59057787.html";
final HtmlPage page = webClient.getPage(url);
final String pageAsXml = page.asXml();
System.out.println(pageAsXml);
final String pageAsText = page.asNormalizedText();
System.out.println(pageAsText);
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
Metadata
Metadata
Assignees
Labels
No labels