XML-Import over SLR Folder

December 9, 2015 dose1231 Comment

Today I visualized the accuracy of my XML-Importer and the modeling.

errorvisualization — Visualization of accuracy of the import. Red squares are papers with wrong names, Red circles are collaborators with wrong names.

It comes with popups that show the imported data of papers/authors. This way i was able to figure out three problems:

Lots of Names have special letters, that my xml-inporter can’t handle yet
I have to transform authors to only use the initials of the pre-names
Titles in the SLR-csv are all lowercase. My importer however imports the uppercase versions. My error-visualization therefore ignores letter-cases.

Authors that aren’t red are either in the right format in the paper, or it’s a faulty hit (For authors I set the required similarities in the name to low. This however is not the case for the titles, there it’s pretty high, so it ignores like one mal-formated character)

So what’s next?

First I will try to fix the encoding issue. After that, I will change the model-creation, so it only uses the initials of pre-names. Additionally I’ll finally implement a caching system. So TODOs in chronological order:

Fix encoding
Fix name-format
Caching

XMLImport author detection

November 29, 2015November 29, 2015 dose123Leave a comment

The XMLImport class is now able to import authors that are all aligned on one line as follows:

works1

Comma separated authors are also supported:

works2

What’s not supported yet, are blocks, that are below each other:

notyet

I think i will need to detect the form of a block (Name above E-Mail in this case) and search for the same structure below each detected Author.

Maybe i will also be able to detect blocks by comparing line heights. Something like: All lines with the same line height and roughly the same horizontal position are considered to be part of the same block.

(As a side note: I have also improved the title detection)

Optimistic XML Import

November 25, 2015 dose123Leave a comment

I’m currently writing an optimistic importer, based on my XML-Extraction Binary, that makes some assumption for the pdfs. I’m done with the importing of titles, and it is able to import the titles of most of pdfs in the SLR folder correctly. I’ll leave it at that for now, until i have my tools for Model-debugging set up. Next up is the extraction of authors.

After I’m done with the optimistic import, i can look at some less convenient papers, and see what i can do about them.

PDF to XML

November 25, 2015November 25, 2015 dose123Leave a comment

I’m finished with the first version of my pdf-to-xml tool. I’ve decided upon a structure that looks as follows:

<page>
  <block left="84" top="72">
    <span id="f1" font-size="17" vertical-align="baseline" color="#000000" font-family="sans-serif" font-weight="bold" font-style="normal">
      Heapviz: Interactive Heap Visualization for Program
    </span>
  </block>
  <block left="174" top="92">
    <span id="f1" font-size="17" vertical-align="baseline" color="#000000" font-family="sans-serif" font-weight="bold" font-style="normal">
      Understanding and Debugging
    </span>
  </block>
</page>

Now i have to figure out how to extract the needed data from this xml. For example, how do i detect, that a sentence is split across multiple blocks like in the example above.

UML Diagram

November 18, 2015November 19, 2015 dose123Leave a comment

Today I was getting a little exhausted because my code didn’t feel clean at all. That’s why I took the time to create an UML Diagram with a design that feels natural to me. I will now adjust my code to fit this design.

In addition I named my Project EggShell as it’s working title.

Problems with examples for Roassal

November 4, 2015November 24, 2015 dose1231 Comment

(Update: The examples all seem to work in the development version of Moose)

Certain examples from the learningmaterials on agilevisualization.com don’t work. I tried different versions from Moose and different Roassal packages. The first error i encountred, was that the RTDSM class isn’t available. One reoccurring error is, that trans isn’t understood.

Luckily i can correct most of the errors, and still see the examples in action.

I annote the Learning Sites with Scrible. I mark parts, that don’t work in the newest Version of Moose (Version 5.1). Up until now, the examples that didn’t work in Version 5.1 also didn’t work in the version linked from agilevisualization.com (Moose 5, often for different reasons)

You can check out my current annotations here:

Part 1 – Chapter 1 – Quick Start
Part 1 – Chapter 3 – Painting with Trachel (all examples work)
Part 1 – Chapter 4 – Visualizing with Roassal

String findTokens variations

October 21, 2015October 21, 2015 dose123Leave a comment

Well this one was unexpected. The findTokens method splits up my Strings at the set delimiters (namely new line). Two delimiters in a row count as one, as long, as the quoteDelimiters argument is not set. So I had to set an arbitrary quoteDelimiter in order to prevent empty lines of my text to be ignored.

FileReference pathString problems

October 20, 2015October 20, 2015 dose1232 Comments

I am using the method pathString of the class FileReference to get the Path for my terminal command. The returned string however is not escaped to be used in the terminal. I am now writing a method to escape a string of a path so it can be used in a terminalcommand.

Loading Data from PDFs into Pharo

October 14, 2015October 20, 2015 dose123Leave a comment

Well, this one took way too long to accomplish.

In order to get the information from PDFs into Pharo, I had to find a simple cross-platform binary to turn them into text. There are fancy tools for extracting data from science Papers (Authors, Citations etc.). Those depend on Machine Learning and are either huge in Size (500MB and above) or inaccurate.

I still have to decide where i will keep this binary. I execute it using PipeableOSProcess (CommandShell somehow doesn’t recognize the binary). Both OSProcess and CommandShell don’t seem to be able to access all binaries available in the Terminal. (I installed pdftotext using homebrew. Neither of the Classes were able to use it).

I now have the pdf available split up in lines. I will try to figure out what the names are as follows: “Check if the Line below the name candidate contains an @ character or something like University or a name of a state/town/country”. I will have to test multiple pdfs in order to make sure if this is accurate.

One more thing: pdftotext apparently has some quirks. One Pdf was extracted into a text t h a t l o o k s l i k e t h i s. Given that pdf is heavily focused on layout and not information, i think I will be running into more problems like this.

Introduction

October 14, 2015 dose123Leave a comment

In this blog I will write down the ongoing Process of my Bachelor Project.

I hope to be able to come back to all the posts, in order to facilitate the process of writing the thesis/project (what’s the right word for this?)