Chunk Positioning After Cleaning and Recursive Splitting – Need to Retain Original Document Position for PDF Navigation / Highlight  

Hey @brandenchan . I’m splitting the content by 200 words or 1200 characters using recursive splitting. After that, I retrieve the top 5 chunks from the retrieval pipeline. It’s critical for our use case that we display these 5 chunks, and be able to navigate and highlight them in a PDF.

To address this, I was thinking of using the original page number and `split_idx_start` to find the positions (for example), but due to cleaning, I’m getting incorrect offsets. I’m considering using Levenshtein distances or fuzzy matching to compare with the processed chunk on the page, but that approach feels a bit risky and might not be very accurate.

It would be incredible if we could backtrack and add the start and end positions to the chunk metadata to refer to the original document position.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunk Positioning After Cleaning and Recursive Splitting – Need to Retain Original Document Position for PDF Navigation / Highlight #8761

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Chunk Positioning After Cleaning and Recursive Splitting – Need to Retain Original Document Position for PDF Navigation / Highlight #8761

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions