-
Notifications
You must be signed in to change notification settings - Fork 115
Description
Hello,
Thank you so much for this library. I'm using it to implement a pdf viewer. I'm not sure if I can bother you with a question?
I want to implement text selection. To do that, I need to know the accurate bounding boxes of all chars.
But I noticed some misalignment between the char bounds (red) and the containing textobject (blue):
in the above example, between the letter D and the ":", there is a wide space. The char bounds don't seem to account for that wide space. As you can see the char bounds connect with the precedent character.
I don't know if I have missed anything. I tried both the loose_bounds and the tight_bounds. They all have misalignment.
When I print the char measurements in the console, I can see that between the two characters, there is an invisible character (a space, unicode 32) that has a width of zero:
here is my code:
for page_index in 0..page_count {
let p = document.pages().get(page_index as u16).unwrap();
let text_page = &p.text().unwrap();
println!(
"page width: {}, height: {} {}",
p.width().value,
p.height().value,
page_index
);
let mut text_objects = Vec::new();
for object in p.objects().iter() {
match object {
pdfium_render::prelude::PdfPageObject::Text(ref text) => {
//println!("text: {}", text.text());
let bounds = text.bounds().unwrap();
//println!("text bounds: {:?}", &bounds);
let mut text_bboxes = Vec::new();
if let Ok(chars_in_text_object) = text_page.chars_for_object(text){
for c in chars_in_text_object.iter() {
println!("text char: {} {}", c.loose_bounds().unwrap(), c.unicode_char().unwrap());
println!("translate : {:?}", c. get_translation());
text_bboxes.push(BBox::new(
c.loose_bounds().unwrap().left.value,
c.loose_bounds().unwrap().right.value,
p.height().value - c.loose_bounds().unwrap().top.value,
p.height().value - c.loose_bounds().unwrap().bottom.value,
));
// break;
}
}
text_objects.push(TextObject {
bbox: BBox::new(
bounds.left.value,
bounds.right.value,
p.height().value - bounds.top.value,
p.height().value - bounds.bottom.value,
),
char_bboxes: text_bboxes,
text: text.text().to_string(),
});
}
pdfium_render::prelude::PdfPageObject::Image(ref image) => {
println!("image: ");
}
_ => {}
}
// break;
}Also, I noticed some strange behavior of calling the text_page.chars_for_object(text) function. When I call it, the function seems to reset the page iterator : document.pages().iter() making my for loop infinite. Therefore, in the above code sample, I have to get the overall page count first and only loop that many times.
I can't do, for example:
for p in document.pages().iter() {
let text_page = &p.text().unwrap();
text_page.chars_for_object(text) // document.pages().iter() seems to be reset here
} // this will be an infinite loop for some reason.Thank you!