HTMLDiffer.diff hangs *JVM* on dirty input
Last week we've encountered a HTML page that has a lot of \ and ' dropped in it. Previous version of the same page didn't have that garbage. See attached files: fist.html is initial version of a page and second.html is updated version. When trying to calculate differences between those two using following sample code, HTMLDiffer hangs:
String html1;
try (BufferedReader reader = Files.newBufferedReader(Paths.get("first.html"))) {
StringBuilder bodyBuilder = new StringBuilder();
String line;
while ((line = reader.readLine()) != null)
bodyBuilder.append(line).append('\n');
html1 = bodyBuilder.toString();
}
String html2;
try (BufferedReader reader = Files.newBufferedReader(Paths.get("second.html"))) {
StringBuilder bodyBuilder = new StringBuilder();
String line;
while ((line = reader.readLine()) != null)
bodyBuilder.append(line).append('\n');
html2 = bodyBuilder.toString();
}
InputStream oldStream = new ByteArrayInputStream(html1.getBytes());
InputStream newStream = new ByteArrayInputStream(html2.getBytes());
Locale locale = LocaleUtils.toLocale("hr");
HtmlCleaner cleaner = new HtmlCleaner();
InputSource oldSource = new InputSource(oldStream);
InputSource newSource = new InputSource(newStream);
DomTreeBuilder oldHandler = new DomTreeBuilder();
cleaner.cleanAndParse(oldSource, oldHandler);
TextNodeComparator leftComparator = new TextNodeComparator(oldHandler, locale);
DomTreeBuilder newHandler = new DomTreeBuilder();
cleaner.cleanAndParse(newSource, newHandler);
TextNodeComparator rightComparator = new TextNodeComparator(newHandler, locale);
DiffOutput nullDiffOutput = new DiffOutput() {
@Override
public void generateOutput(TagNode node) throws SAXException { /* ignore */ }
};
HTMLDiffer differ = new HTMLDiffer(nullDiffOutput);
System.err.println("Going to hang now :)");
differ.diff(leftComparator, rightComparator); // HANG!
I understand that this is some very unusual HTML and it would be perfectly acceptable for diff calculation to last very long, but what is weird is that upon entering HTMLDiffer.diff whole app, which is running inside Tomcat container, stops accepting inbound connections. Other apps from same Tomcat instance (Manager app for instance) stops accepting connections too, what suggests that something happens on JVM level.
I've tried to figure out what global resource HTMLDiffer my consume that would cause whole cause JVM to go down, but couldn't find any. But it's reproducible and happens every time with given HTML input.
Ubuntu 16.04.1 LTS Apache Tomcat 8.5.6 Oracle Java 1.8.0u112 daisydiff 1.2-NX5-SNAPSHOT
I would gladly provide any additional info or test case if needed.