Skip to content

char bounds misaligns the containing textobject #198

@shi-yan

Description

@shi-yan

Hello,

Thank you so much for this library. I'm using it to implement a pdf viewer. I'm not sure if I can bother you with a question?

I want to implement text selection. To do that, I need to know the accurate bounding boxes of all chars.

But I noticed some misalignment between the char bounds (red) and the containing textobject (blue):

Image

in the above example, between the letter D and the ":", there is a wide space. The char bounds don't seem to account for that wide space. As you can see the char bounds connect with the precedent character.

I don't know if I have missed anything. I tried both the loose_bounds and the tight_bounds. They all have misalignment.

When I print the char measurements in the console, I can see that between the two characters, there is an invisible character (a space, unicode 32) that has a width of zero:

Image

here is my code:

                        for page_index in 0..page_count {
                            let p = document.pages().get(page_index as u16).unwrap();
                            let text_page = &p.text().unwrap();
                            println!(
                                "page width: {}, height: {} {}",
                                p.width().value,
                                p.height().value,
                                page_index
                            );

                            let mut text_objects = Vec::new();

                            for object in p.objects().iter() {
                                match object {
                                    pdfium_render::prelude::PdfPageObject::Text(ref text) => {
                                        //println!("text: {}", text.text());
                                        let bounds = text.bounds().unwrap();
                                        //println!("text bounds: {:?}", &bounds);

                                        let mut text_bboxes = Vec::new();
                                         if let Ok(chars_in_text_object) = text_page.chars_for_object(text){
                                             for c in chars_in_text_object.iter() {
                                                println!("text char: {} {}", c.loose_bounds().unwrap(), c.unicode_char().unwrap());
                                                println!("translate : {:?}", c. get_translation());
                                                text_bboxes.push(BBox::new(
                                                    c.loose_bounds().unwrap().left.value,
                                                    c.loose_bounds().unwrap().right.value,
                                                    p.height().value - c.loose_bounds().unwrap().top.value,
                                                    p.height().value - c.loose_bounds().unwrap().bottom.value,
                                                ));
                                               // break;
                                            }
                                        }
                               
                                         text_objects.push(TextObject {
                                            bbox: BBox::new(
                                                bounds.left.value,
                                                bounds.right.value,
                                                p.height().value - bounds.top.value,
                                                p.height().value - bounds.bottom.value,
                                            ),
                                            char_bboxes: text_bboxes,
                                            text: text.text().to_string(),
                                        });

                                      
                                    }
                                    pdfium_render::prelude::PdfPageObject::Image(ref image) => {
                                        println!("image: ");
                                    }
                                    _ => {}
                                }
                               // break;
                            }

Also, I noticed some strange behavior of calling the text_page.chars_for_object(text) function. When I call it, the function seems to reset the page iterator : document.pages().iter() making my for loop infinite. Therefore, in the above code sample, I have to get the overall page count first and only loop that many times.

I can't do, for example:

for p in document.pages().iter() {
  let text_page = &p.text().unwrap();
  text_page.chars_for_object(text) // document.pages().iter() seems to be reset here
} // this will be an infinite loop for some reason.

Thank you!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions