Skip to content

XmlExtractor and HtmlExtractor produce wrong tag order #93

@nicoell

Description

@nicoell

HTML

<html>
   <body>
     <div class="a">
        <span class="b">Item</span>
        <span class="b">Item</span>
        <span class="b">Item</span>
        <span class="b">Item</span>
     </div>
   </body>
</html>

Annotation Rules

{
    "div#class=a": ["outer"],
    "span#class=b": ["inner"]
}

Output XmlExtractor

<inner><outer>  Item </inner><inner>Item </inner><inner>Item </inner><inner>Item</outer></inner>

Extpected Output

<outer><inner>  Item </inner><inner>Item </inner><inner>Item </inner><inner>Item</inner></outer>

Potential fix for XmlExtractor

To fix this issue, the sorting could take into account the distance between start to end index

# [..]
        tag_indices = defaultdict(list)

        for start, end, label in sorted(annotated_text["label"]):
            length = end - start
            tag_indices[start].append((label, length))
            tag_indices[end].append(("/" + label, length))

        current_idx = 0
        tagged_content = ['<?xml version="1.0" encoding="UTF-8" ?>\n']
        text = annotated_text["text"]
        for index, tags in sorted(tag_indices.items()):
            tagged_content.append(text[current_idx:index])

            # Separate closing vs opening tags
            closing_tags = [t for t in tags if t[0].startswith("/")]
            opening_tags = [t for t in tags if not t[0].startswith("/")]

            # Sort closing tags by ascending length (so outer closes last) then by name
            closing_tags.sort(key=lambda x: (x[1], x[0]))
            for tag, _ in closing_tags:
                tagged_content.append(f"<{tag}>")

            # Sort opening tags by descending length (so outer opens first) then by name
            opening_tags.sort(key=lambda x: (x[1], x[0]), reverse=True)
            for tag, _ in opening_tags:
                tagged_content.append(f"<{tag}>")
            
            current_idx = index
        tagged_content.append(text[current_idx:])

        return "".join(tagged_content)

HtmlExtractor

The HtmlExtractor suffers from the same issue and I think a similar fix could be applied there.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions