XmlExtractor and HtmlExtractor produce wrong tag order

## HTML
```html
<html>
   <body>
     <div class="a">
        <span class="b">Item</span>
        <span class="b">Item</span>
        <span class="b">Item</span>
        <span class="b">Item</span>
     </div>
   </body>
</html>
```

## Annotation Rules
```py
{
    "div#class=a": ["outer"],
    "span#class=b": ["inner"]
}
```

## Output XmlExtractor
```xml
<inner><outer>  Item </inner><inner>Item </inner><inner>Item </inner><inner>Item</outer></inner>
```

## Extpected Output
```xml

<outer><inner>  Item </inner><inner>Item </inner><inner>Item </inner><inner>Item</inner></outer>
```

## Potential fix for XmlExtractor
To fix this issue, the sorting could take into account the distance between start to end index

```py
# [..]
        tag_indices = defaultdict(list)

        for start, end, label in sorted(annotated_text["label"]):
            length = end - start
            tag_indices[start].append((label, length))
            tag_indices[end].append(("/" + label, length))

        current_idx = 0
        tagged_content = ['<?xml version="1.0" encoding="UTF-8" ?>\n']
        text = annotated_text["text"]
        for index, tags in sorted(tag_indices.items()):
            tagged_content.append(text[current_idx:index])

            # Separate closing vs opening tags
            closing_tags = [t for t in tags if t[0].startswith("/")]
            opening_tags = [t for t in tags if not t[0].startswith("/")]

            # Sort closing tags by ascending length (so outer closes last) then by name
            closing_tags.sort(key=lambda x: (x[1], x[0]))
            for tag, _ in closing_tags:
                tagged_content.append(f"<{tag}>")

            # Sort opening tags by descending length (so outer opens first) then by name
            opening_tags.sort(key=lambda x: (x[1], x[0]), reverse=True)
            for tag, _ in opening_tags:
                tagged_content.append(f"<{tag}>")
            
            current_idx = index
        tagged_content.append(text[current_idx:])

        return "".join(tagged_content)

```

## HtmlExtractor 
The HtmlExtractor suffers from the same issue and I think a similar fix could be applied there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XmlExtractor and HtmlExtractor produce wrong tag order #93

HTML

Annotation Rules

Output XmlExtractor

Extpected Output

Potential fix for XmlExtractor

HtmlExtractor

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

XmlExtractor and HtmlExtractor produce wrong tag order #93

Description

HTML

Annotation Rules

Output XmlExtractor

Extpected Output

Potential fix for XmlExtractor

HtmlExtractor

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions