Best practice scanned PDF / What model to use?

Hi all

I’m new to the OpenAI API. I’ve written a (backoffice) application which uploads documents (mainly pdf) to OpenAI to extract data.

All works perfectly, but i’m struggling with scanned pdf’s. What the best practice?

  • I can do OCR before sending the file to ChatGPT and make a searchable pdf
  • I can make an image of each pdf page and upload those using the vision API calls. I tried this but the chat then asks to upload the document, so I guess im doing something wrong here.
  • I can extract the text from the pdf and send it to the API instead of the file, but i’m worried about the results if i do that. (Position of the data,…)
  • I read making a html file of the pdf does the trick. Anyone van verify?

Additional questions:

  • anyone knows how the data extraction works on secured pdf files? Like the security which makes you can’t extract a page for example.
  • whats the best model to use? I’m now using gpt4o-mini and results are fine. But i’ve read gpt4o is cheaper for the vision calls?

Alot of questions. Hopefully alot of answers too :slight_smile: I have read alot of it but the API has changed a lot recently it seems so it’s hard to find the right answers online. Community to the rescue?

Thanks!

1 Like

I’ve not worked with pdf’s specifically before but using gpt4o has worked well with images for me so far.

My suggestion would be to do it one pdf at a time. For each page in the pdf, get the image and encode it into base64. Then add all the images in order of page number into a message object that you can send to 4o.

Just be careful about the amount of images and tokens you are sending in one message to ensure they dont cross the limits.

The prompt in the begining of the messages object can have your task description init with the developer role.

This should do the trick for you

Thanks for your answer.
I used the Assistant API so far. Guess you use the chat completion one?

Yeah. While I like the concept of the Assistant API, it just feels inflexible most of the times to me.

It’s quite possible to create a assistant like architecture using tool calls and message trails which I prefer