-
Notifications
You must be signed in to change notification settings - Fork 35
Closed
Description
HTML
<html>
<body>
<div class="a">
<span class="b">Item</span>
<span class="b">Item</span>
<span class="b">Item</span>
<span class="b">Item</span>
</div>
</body>
</html>Annotation Rules
{
"div#class=a": ["outer"],
"span#class=b": ["inner"]
}Output XmlExtractor
<inner><outer> Item </inner><inner>Item </inner><inner>Item </inner><inner>Item</outer></inner>Extpected Output
<outer><inner> Item </inner><inner>Item </inner><inner>Item </inner><inner>Item</inner></outer>Potential fix for XmlExtractor
To fix this issue, the sorting could take into account the distance between start to end index
# [..]
tag_indices = defaultdict(list)
for start, end, label in sorted(annotated_text["label"]):
length = end - start
tag_indices[start].append((label, length))
tag_indices[end].append(("/" + label, length))
current_idx = 0
tagged_content = ['<?xml version="1.0" encoding="UTF-8" ?>\n']
text = annotated_text["text"]
for index, tags in sorted(tag_indices.items()):
tagged_content.append(text[current_idx:index])
# Separate closing vs opening tags
closing_tags = [t for t in tags if t[0].startswith("/")]
opening_tags = [t for t in tags if not t[0].startswith("/")]
# Sort closing tags by ascending length (so outer closes last) then by name
closing_tags.sort(key=lambda x: (x[1], x[0]))
for tag, _ in closing_tags:
tagged_content.append(f"<{tag}>")
# Sort opening tags by descending length (so outer opens first) then by name
opening_tags.sort(key=lambda x: (x[1], x[0]), reverse=True)
for tag, _ in opening_tags:
tagged_content.append(f"<{tag}>")
current_idx = index
tagged_content.append(text[current_idx:])
return "".join(tagged_content)HtmlExtractor
The HtmlExtractor suffers from the same issue and I think a similar fix could be applied there.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels