Well this one was unexpected. The findTokens method splits up my Strings at the set delimiters (namely new line). Two delimiters in a row count as one, as long, as the quoteDelimiters argument is not set. So I had to set an arbitrary quoteDelimiter in order to prevent empty lines of my text to be ignored.
Month: October 2015
FileReference pathString problems
I am using the method pathString of the class FileReference to get the Path for my terminal command. The returned string however is not escaped to be used in the terminal. I am now writing a method to escape a string of a path so it can be used in a terminalcommand.
Loading Data from PDFs into Pharo
Well, this one took way too long to accomplish.
In order to get the information from PDFs into Pharo, I had to find a simple cross-platform binary to turn them into text. There are fancy tools for extracting data from science Papers (Authors, Citations etc.). Those depend on Machine Learning and are either huge in Size (500MB and above) or inaccurate.
I still have to decide where i will keep this binary. I execute it using PipeableOSProcess (CommandShell somehow doesn’t recognize the binary). Both OSProcess and CommandShell don’t seem to be able to access all binaries available in the Terminal. (I installed pdftotext using homebrew. Neither of the Classes were able to use it).
I now have the pdf available split up in lines. I will try to figure out what the names are as follows: “Check if the Line below the name candidate contains an @ character or something like University or a name of a state/town/country”. I will have to test multiple pdfs in order to make sure if this is accurate.
One more thing: pdftotext apparently has some quirks. One Pdf was extracted into a text t h a t l o o k s l i k e t h i s. Given that pdf is heavily focused on layout and not information, i think I will be running into more problems like this.
Introduction
In this blog I will write down the ongoing Process of my Bachelor Project.
I hope to be able to come back to all the posts, in order to facilitate the process of writing the thesis/project (what’s the right word for this?)