-
-
Notifications
You must be signed in to change notification settings - Fork 184
Description
Original website in the wild, it is in Dutch https://lokaleregelgeving.overheid.nl/ZoekResultaat
<dependency>
<groupId>org.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>4.7.0</version>
</dependency>
When parsing an html chunk that contains unknown html tags that match the final List<HtmlMeta> tags = getDocumentElement().getStaticElementsByTagName("meta") it returns a List which contains HtmlUnknownElement and HtmlMeta and blindly casts it to List and the error is only uncovered when iterating over the list a few lines later.
Arguably the List<E extends HtmlElement> getStaticElementsByTagName(String) is the unsafe link in the chain, but modifying that was too big of a change for me.
The relevant portion of the stack
at org.htmlunit.html.HtmlPage.getMetaTags(HtmlPage.java:1936)
at org.htmlunit.html.HtmlPage.getRefreshStringOrNull(HtmlPage.java:1448)
at org.htmlunit.html.HtmlPage.executeRefreshIfNeeded(HtmlPage.java:1356)
at org.htmlunit.html.HtmlPage.initialize(HtmlPage.java:332)
at build.struck.testBug.testHTMLUnit(testBug.java:790)
The fix is simple, filterout classes that will fail the cast in org.htmlunit.html.HtmlPage.getMetaTags
Proposed solution
I coded it using a stream for a quick and dirty approach. But it would be better to ask for a List and then filter before casting and thus avoid the
protected List<HtmlMeta> getMetaTags(final String httpEquiv) {
if (getDocumentElement() == null) {
return Collections.emptyList(); // weird case, for instance if document.documentElement has been removed
}
final List<HtmlMeta> tags = getDocumentElement().getStaticElementsByTagName("meta")
.stream().filter((i) -> i instanceof HtmlMeta).map(HtmlMeta.class::cast).collect(Collectors.toList());
final List<HtmlMeta> foundTags = new ArrayList<>();
for (final HtmlMeta htmlMeta : tags) {
if (httpEquiv.equalsIgnoreCase(htmlMeta.getHttpEquivAttribute())) {
foundTags.add(htmlMeta);
}
}
return foundTags;
}
Unittest
This is TestNG, not JUnit but it gets the point across
@Test(groups = "manual")
public void testHTMLUnit() throws IOException {
String poison = "<overheidrg:meta xmlns:overheidrg=\"http://standaarden.overheid.nl/cvdr/terms/\">";
String badHtml = String.format("<html><body>%s</body></html>", poison);
URL originalUrl = new URL("https://lokaleregelgeving.overheid.nl/ZoekResultaat");
try(WebClient client = new WebClient()) {
StringWebResponse webResponse = new StringWebResponse(badHtml, originalUrl);
// Create a WebWindow to hold the HtmlPage
WebWindow webWindow = client.getCurrentWindow();
PageCreator pc = new DefaultPageCreator();
HtmlPage page = (HtmlPage) pc.createPage(webResponse, webWindow);
page.initialize(); //java.lang.ClassCastException: class org.htmlunit.html.HtmlUnknownElement cannot be cast to class org.htmlunit.html.HtmlMeta (org.htmlunit.html.HtmlUnknownElement and org.htmlunit.html.HtmlMeta are in unnamed module of loader 'app')
}
}
}
Workaround
- Subclass HtmlPage as HackHtmlPage and override
protected List<HtmlMeta> getMetaTags(final String httpEquiv)as indicated above - Subclass DefaultPageCreator as HackDefaultPageCreator and override
protected HtmlPage createHtmlPage(final WebResponse webResponse, final WebWindow webWindow) throws IOExceptionto return HackHtmlPage - webClient.setPageCreator(new HackDefaultPageCreator())