Skip to content

Chunk Positioning After Cleaning and Recursive Splitting – Need to Retain Original Document Position for PDF Navigation / Highlight  #8761

@alexanderkhivrych

Description

@alexanderkhivrych

Hey @brandenchan . I’m splitting the content by 200 words or 1200 characters using recursive splitting. After that, I retrieve the top 5 chunks from the retrieval pipeline. It’s critical for our use case that we display these 5 chunks, and be able to navigate and highlight them in a PDF.

To address this, I was thinking of using the original page number and split_idx_start to find the positions (for example), but due to cleaning, I’m getting incorrect offsets. I’m considering using Levenshtein distances or fuzzy matching to compare with the processed chunk on the page, but that approach feels a bit risky and might not be very accurate.

It would be incredible if we could backtrack and add the start and end positions to the chunk metadata to refer to the original document position.

Metadata

Metadata

Assignees

Labels

P2Medium priority, add to the next sprint if no P1 available

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions