Loading Data from PDFs into Pharo

Well, this one took way too long to accomplish.

In order to get the information from PDFs into Pharo, I had to find a simple cross-platform binary to turn them into text. There are fancy tools for extracting data from science Papers (Authors, Citations etc.). Those depend on Machine Learning and are either huge in Size (500MB and above) or inaccurate.

I still have to decide where i will keep this binary. I execute it using PipeableOSProcess (CommandShell somehow doesn’t recognize the binary). Both OSProcess and CommandShell don’t seem to be able to access all binaries available in the Terminal. (I installed pdftotext using homebrew. Neither of the Classes were able to use it).

I now have the pdf available split up in lines. I will try to figure out what the names are as follows: “Check if the Line below the name candidate contains an @ character or something like University or a name of a state/town/country”. I will have to test multiple pdfs in order to make sure if this is accurate.

One more thing: pdftotext apparently has some quirks. One Pdf was extracted into a text t h a t  l o o k s  l i k e  t h i s. Given that pdf is heavily focused on layout and not information, i think I will be running into more problems like this.

Loading Data from PDFs into Pharo