Hey @brandenchan . I’m splitting the content by 200 words or 1200 characters using recursive splitting. After that, I retrieve the top 5 chunks from the retrieval pipeline. It’s critical for our use case that we display these 5 chunks, and be able to navigate and highlight them in a PDF.
To address this, I was thinking of using the original page number and split_idx_start to find the positions (for example), but due to cleaning, I’m getting incorrect offsets. I’m considering using Levenshtein distances or fuzzy matching to compare with the processed chunk on the page, but that approach feels a bit risky and might not be very accurate.
It would be incredible if we could backtrack and add the start and end positions to the chunk metadata to refer to the original document position.