Navigating GeoParquet: Lessons Learned from the eMOTIONAL Cities Project

Posted on August 2, 2024 by doublebyte

The eMOTIONAL Cities project has set out to understand how the natural and built environment can shape the feelings and emotions of those who experience it. At its core lies a Spatial Data Infrastructure (SDI) which combines a variety of datasets from the Urban Health domain. These datasets should be available to urban planners, neuroscientists and other stakeholders, for analysis, creating data products and eventually making decisions based upon them.

Although the average size of a dataset is small (with few exceptions), scientists often want to combine several of these datasets in the same analysis, which creates a use case where we could benefit from format efficiency. For this reason, we recently decided to offer GeoParquet as an alternate encoding for the 100+ vector datasets published in the SDI.

What is GeoParquet

For those who have been distracted, GeoParquet is a format which encodes vector data in Apache Parquet. There is no reinventing the wheel here: Apache Parquet is a free and open-source column-oriented data storage format, which provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk; GeoParquet is just adding the spatial support on top of Parquet, leveraging the fact that most cloud data warehouses already understand it to achieve interoperability. Although GeoParquet started as a community effort, it is now on the path to become an OGC Standard and you can follow (or even contribute to) the spec on: https://github.com/opengeospatial/geoparquet

GeoParquet extends Parquet by adding some metadata about the file and for each geometry column; the number of mandatory columns is kept to a minimum, with some nice-to-have optional features.

Converting & Publishing the Data

Although a relatively new format, GeoParquet already spots a vibrant ecosystem of implementations to choose from. After a few experiments, we decided to use the GDAL library to convert the datasets, as it integrates better with our existing pipeline.

It should be noted that our source datasets are hosted in a S3 bucket in GeoJSON format and the idea was to place the GeoParquet files also in S3, so the idea was to read/write the files directly from S3.

Our pipeline uses GDAL wrapped in a bash script, which reads all the GeoJSON files in a folder on a S3 bucket and places the resulting GeoParquet files on a different folder in the same bucket.

It should be noted that in order to support GeoParquet, GDAL > 3.8.4 should be used; to make things easier, we run GDAL with the right GDAL version from a docker container. The script is available in the etl-tools repository of the eMOTIONAL Cities project with an MIT license.

The GeoParquet files were validated directly from the S3 bucket using the gpq, a lightweight tool written in GO, which creates as well validates GeoParquet files. The gpq cli was wrapped in another bash script, available here. It should be noted that no validation errors were spotted from the created GeoParquet files.
In order to make the GeoParquet datasets discoverable, they were added to each collection record of the eMOTIONAL Cities catalogue using an item type “application/vnd.apache.parquet”, which can be negotiated by clients. See an example below for the hex350_grid_obesity_1920 collection.

{

            “href”:”https://emotional-cities.s3.eu-central-1.amazonaws.com/geoparquet/hex350_grid_obesity_1920.parquet”,

            “rel”:”item”,

            “type”:”application/vnd.apache.parquet”,

            “title”:”GeoParquet download link for hex350_grid_obesity_1920″

        },

The chart above shows the size of one of our largest datasets (activity_level_ldn) in different formats. GeoParquet translates into a smaller size, even when compared with binary formats such as Shapefile or GeoPackage. These smaller sizes, specially when multiplied by a large number of datasets, will translate in cost saving for hosting data; they will also provide a better experience for users which stream these datasets over the web for the purpose of analysis.

The chart above shows the total size of eMOTIONAL Cities datasets in various formats.

Socializing the Results

Although the GeoParquet files are discoverable by machines through the OGC API – Records catalogue, more work needs to be done in order to ensure that humans are aware of them. These are a few initiatives that we did, or plan to do, in order to socialise these results and encourage users to leverage the geoparquet datasets that we expose in the SDI:

May 2024: GeomobLX – Lighting Talk about “GeoParquet”.
October 2024: eMOTIONAL Cities webinar about GeoParquet (TBD)
December 2024: FOSS4G World – “Adding GeoParquet to a Spatial Data Infrastructure: What, Why and How” (submitted talk)

The image bellow shows an eMOTIONAL Cities GeoParquet dataset in QGIS. The out-of-the-box support in widely used tools like QGIS, is one of the most exciting things about GeoParquet, but we need to make sure that users know about it.

If you are curious to get your hands on some nice examples, the entire eMOTIONAL Cities catalogue is available to you. In each metadata record, you will find a link to the corresponding geoparquet file, which you can download locally or stream to your jupyter notebook.

This blog post ,and the work leading to it, was possible with the collaboration of my colleague Pascallike. A Big Thanks to him!

Mapping the IVAucher

Posted on November 16, 2021 by doublebyte

As a reaction to the record high of fuel prices, the Portuguese government has updated the IVAucher program, to allow each citizen to recover 10 cents per each liter of fuel spent, up to a maximum of 5 EUR/month. This blog post is not going to discuss whether this is good way of spending the public budget, or if it is going to make a real impact in the lives of the people that manage to subscribe to this program. Instead, I want to focus on data.

Once you subscribe to the program as a consumer, you just need to fill the tank in one of the gas stations that subscribed the program, as businesses. The IVAucher website publishes a list of subscribed stations, which seems to be updated, from time to time. The list is published as a PDF, with 2746 records, ordered by “districto” and “concelho” administrative units.

When I look for the stations around me, in the “concelho” of Lisbon, I found 67 records. In order to know where to go, I would literally need to go through each and check if I know the address or the name of the station. Lisbon is a big city, and I admit that there are lots of street names that I don’t know – and I don’t need to, because this is “why” we have maps. My first though was that this data belonged in a map, and my second though was that the data should be published in such a way that it would enable other people to create maps – and this is how this project was born.

In the five-star deployment scheme for Open Data, PDF is at the very bottom, and it is easy to understand why. There is so much you can do with a format, which is largely unstructured.

In order to be able to process these data, I had to transform it into a structured format, preferentially non proprietary, so I chosen CSV (3 stars). This was achieved using a combination of command-line processing tools (e.g.: pdftotext, sed and grep).

The next step was to publish these data, following the FAIR principles, so that it is Findable, Accessible, Interoperable and Reusable. In order to do that, I have chosen the OGC API Features standard, which allows to publish vector geospatial data on the web. This standard defines a RESTfull API with JSON encodings, which fits the expectations of modern web applications. I used a Python implementation of OGC API Features, called pygeoapi.

Before getting the data into pygeoapi, I had to georeference it. In order to do forward geocoding, I used the OpenCage API, and more specifically a Python client, which is one of the many supported SDKs. After tweaking the parameters, the results were quite good, and I was even able to georeference some incomplete addresses, something that was not possible using the Nominatum OSM API.

The next thing was to get the data into a format which supports geometry. The CSV was transformed into a GeoJSON using GDAL/ogr2ogr. I could have published it as a GeoJSON int pygeoapi, but indexing into a database adds support to more functionality, so I decided to store it in a MongoDB NoSQL data store. Everything was virtualized into docker containers, and orchestrated using this docker-compose file.

The application was deployed in AWS and the collection is available at this endpoint:

https://features.byteroad.net/collections/gas_stations

This means that anyone is able to consume this data and create their own maps, whether they are using QGIS, ArcGIS, JavaScript, Python, etc. All they need is an application which implements the OGC API Features standard.

I also created a map, using React.js and the Leaflet library. Although Leaflet does not support OGC API Features natively, I was able to fetch the data as GeoJSON, by following this approach.

The resulting application is available here:

https://ivaucher.byteroad.net

Now you can navigate through the map until you find you area of interest, or even type an address in the search box, to let the map fly to that location.

Hopefully, this application will make the user experience of the IVAucher program a bit easier, but it will also demonstrate the importance of using standards in order to leverage the use of geospatial information. Making data available on the web is good, but it is time that we move a step forward and question “how” we are making the data available, in order to ensure that its full potential is unlocked.

Data Analytics Bootcamp

Posted on December 2, 2019 by doublebyte

I have always dreamed about doing some contribution towards improving the gender balance in technology, which as you may know, is far from ideal.

Fortunately the opportunity arose, when Katrina Walker has invited me to teach the “Data Analytics” bootcamp at CodeOp, an international code school for women and TGNC individuals.

Over the 6-month course, I will share my hands-on experience with the various stages of the data analysis pipeline, specifically on how to apply various technologies to ingest, model and visualize data insights.

Rather than focusing on a specific technology, I will leverage on the “best tool for the job, approach”, which is what I do when I want to analyse data. This means learning different tools, such as Python, R, SQL or QGIS, and often combine them together.

For me “data analytics” is like a journey, where we start with a high-level problem, translate it into data and algorithms, and finally extract a high-level idea. At the start and the end of journey, we should always be able to communicate with people that are not “data geeks” and this is one idea that I would like to pass in the course.

I will not add anything else, apart that I am really excited to get started!

codeops2

Spatial Data Mining

Posted on February 20, 2016 by doublebyte

Social media streams may generate massive clouds of geolocated points, but how can we extract useful information from these, sometimes huge, datasets? I think machine learning and GIS can be helpful here.

My PechaKucha talk at DataBeers : “Visualizing Geolocated Tweets: a Spatial Data Mining Approach”.

Cluster Explorer Demo

Posted on October 6, 2015 by doublebyte

Cluster explorer is a piece of software I was working on, which blends machine learning and GIS to produce a cluster visualization on top of a virtual globe.
The virtual globe uses NasaWorldWind, a Java framework for 3D geo-visualization based on OpenGL, while the clustering algorithms use the ELKI Data mining framework for unsupervised learning.
The tool allows to cluster a bunch of points (for instance geolocated Tweets) using one of two algorithms (or both), and explore the contextual information provided by the geographic layers (e.g.: OSM, Bing).

Visualizing Micro-blogging Data

Posted on May 11, 2015 by doublebyte

This time I’m at the Workshop: “Data Visualization from Different Universes”, presenting some work regarding the visualization of Tweets, using a combination of machine learning and GIS.

Driving Spatial Analysis Beyond the Limits of Traditional Storage

Posted on March 21, 2015 by doublebyte

My presentation on the “Conference on Advanced Spatial Modelling and Analysis” consisted in some thoughts regarding Big Spatial Data, and how to handle them in terms of modern technologies.

It was great to see such a motivated crowd from all generations, and to get to know the research developed by CEG, in topics such as Agent Based Modelling and Neural Networks. It was also great to talk again to Arnaud Banos, from the Complex System Institute of Paris Ile-de-France (ISC-PIF).

CSV 2 GeoJSON

Posted on February 16, 2015 by doublebyte

Recently I had another challenge, which I believe has the characteristics to be a common problem. I have a table with attributes, in CSV format, one of which is geospatial.

CSV is a structured format for storing tabular data (text and numbers), where each row corresponds to a record, and each field is separated by a known character(generally a comma). It is probably one of the most common formats to distribute that, probably because it is a standard output from relational databases.

Since people hand me data often in this format, and for a number of reasons it is more convenient for me to to use JSON data, I thought it would be handy to have a method to translating CSV into JSON, and this was the first milestone of this challenge.

The second milestone of this challenge, is that there is some geospatial information within this data, serialized in a non standard format, and I would like to convert it into a standard JSON format for spatial data; e.g.: GeoJSON. So the second milestone has actually two parts:

parse a GeoJSON geometry from the CSV fields
pack the geometry and the properties into GeoJSON field

To convert CSV (or XML) to JSON, I found this really nice website. It lets you upload a file, and save the results into another file,so I could transform this:

TMC,ROADNUMBER,DIR,PROV,CCAA,StartLatitude,StartLongitude,
EndLatitude,EndLongitude
E17+02412,A-2,E-90/AP-2/BARCELONA-ZARAGOZA (SOSES),LLEIDA,
CATALUNYA,41.5368273,0.4387071,
41.5388396,0.4638462

into this:

{
"TMC": "E17+02412",
"ROADNUMBER": "A-2",
"DIR": "E-90/AP-2/BARCELONA-ZARAGOZA (SOSES)",
"PROV": "LLEIDA",
"CCAA": "CATALUNYA",
"StartLatitude": "41.5368273",
"StartLongitude": "0.4387071",
"EndLatitude": "41.5388396",
"EndLongitude": "0.4638462"
}

This gave me a nicely formatted JSON output (the first milestone!), but as you can notice the geometry is not conform with any OGC standards. It is actually a linestring, which is defined by a start point (StartLongitude, StartLatitude) and an end point (EndLongitude, EndLatitude).

According to the JSON spec, a linestring is defined by an array of coordinates:

So the goal would be to transform the geometry above into:

"LineString",
"coordinates": [
[0.4387071, 41.5368273], [0.4638462, 41.5388396]
]

Once more, jq comes really handy to this task.

The JSON can be transformed into a feature using this syntax:

cat tramos.json | jq -c '[.[] | { type: "Feature", "geometry": {"type": "LineString","coordinates": [ [.StartLongitude, .StartLatitude| tonumber], [ .EndLongitude, .EndLatitude | tonumber] ] }, properties: {tmc: .TMC, roadnumber: .ROADNUMBER, dir: .DIR, prov: .PROV, ccaa: .CCAA}}]' &gt; tramos.geojson

Since the JSON converser parse all the variables into strings, it is important to pass a filter (tonumber) to make sure that the coordinate numbers are converted back into numbers.

{
"properties": {
"ccaa": "CATALUNYA",
"prov": "LLEIDA",
"dir": "N-IIA/SOSES/TORRES DE SEGRE/ALCARRàS",
"roadnumber": "A-2",
"tmc": "E17+02413"
},
"geometry": {
"coordinates": [
[
0.4714937,
41.5420936
],
[
0.4891472,
41.5497014
]
],
"type": "LineString"
},
"type": "Feature"
}

Since we are creating an array of features (or “Feature Collection”), to be conform with GeoJSON, it is important to declare the root element too, by adding this outer element:

{ "type": "FeatureCollection","features": [ ]}

The result should be a valid GeoJSON, that you can view and manipulate in your favourite GIS (for instance QGIS!) 🙂

Piping an API into R: a Data Science Workflow

Posted on November 25, 2014 by doublebyte

Inspired by @jeroenhjanssens, author of the Data Science Toolbox, I decided to give a go to one of the most unfriendly data sources: An XML API.
Apart from its rich syntax with query capabilities, I tend think XML is highly verbose and human unfriendly, which is quite a discouraging if you don’t want to take advantage of all its capabilities. And in my case I didn’t: I just wanted to grab a data stream, in order to be able to build some analysis in R. APIs are generally a pain for data scientists, because they tend to want to have “a look at things” and get a general feeling of the dataset, before start building code. Normally, this is not possible with an API, unless you use these high-end drag-and-drop interfaces, that are generally costly. But following this approach I was able to setup a chain of tools that enable me to reproduce this AGILE workflow, where you can have a feel of the dataset in R, without having to write a Python client.

The first step was to pipe the xml output of the query into a file, and that is easy enough to do with curl

curl -s 'http://someurl.com/Data/Entity.ashx?Action=GetStuff&Par=59&Resolution=250&&token=OxWDsixG6n5sometoken' > out.xml

Now, if you are an XML wiz you can follow a different approach, but I personally feel more comfortable with JSON, so the next step for me was to convert the XML dump into some nice JSON, and fortunately there is another free tool for that too: xml2json

xml2json < out.xml > out.json

Having the JSON, it is possible to query it using jq, a command line JSON parser that I find really intuitive. With this command, I am able to narrow the dataset to the fields I am interested, and pipe the results into another text file. In this case I am skipping all the “headers”, and grabbing an array of elements, which is what I want to analyse.

cat out.json | jq '[.Root.ResultSet.Entity[] | {color: .color, width: .with, average: .average, reference: .reference, Time: .Time}]' > test.json

Now here I could add another step, to convert the JSON results into csv, but actually R has interfaces to JSON, so why not use those to import the data directly. There is actually more than one package that can do this, but I had some nice results with jsonlite.

library("jsonlite") data1 <- fromJSON("test.json")

And with these two lines of code, I have a data frame that I can use for running ML algorithms.

Post-processing OPTICSxi Clustering

Posted on June 26, 2014 by doublebyte

On a previous post, I expressed my concerns regarding the results of OPTICSxi clustering. Namely, I mentioned an “annoying” spike effect, that turns out massively almost at any simulation (so massively that it is almost a “feature”).

A post in StackOverflow, originated an useful exchange of ideas with the author of the ELKI framework. Namely he pointed out a “weakness” of the algorithm, that basis its partition on the reachability distance, which is not always a synonymous of “spatial closeness”. Literally, outliers that are standing in the middle of clusters, could be erroneous misinterpretated as belonging to a cluster or the other.

Since this is a problem on the partition algorithm, the solution could pass through improving the partition algorithm, using a different partition algorithm, or using a different cluster algorithm all together. As a “quick fix” I opted to some cluster “post-processing”, in order to remove the outliers.

So my research question was: how to identify a point that is a spatial outlier?

I tried a couple of approaches that I will discuss now, as I think they may be useful for someone or generate an useful discussion.

On the image above, the coloured points are outliers, according to our definition. One simple approach, would be to calculate the average path length, the average distance of one point, to all other points in the point cloud. Then we could test the distance of each point, and say if it is greater than a certain value, let us say 5, we would consider it an outlier and remove it from the dataset. I found this approach actually yield good results, and was able to reduce the “spikes” that we see in this figure:

To this results:

The code that calculates the average distance for each point, is bellow:
public static double getAverageDistance(Coordinate coord, ModifiableDBIDs ids, HashMap dataMap){


double sum=0.0;

try{ for (DBIDIter iter = ids.iter(); iter.valid(); iter.advance()) sum+=coord.distance(dataMap.get(DBIDUtil.toString(iter))); } catch( Exception exp) { System.out.println( "Unexpected exception:" + exp.getMessage() ); } return sum/ids.size(); }

From the point of view of “correctness” this algorithm suffers of the “bottleneck” of removing the outlier from the clusters (and thus converting it into “noise”). To avoid this, it would have to be tested if the point actually belongs to another cluster.

Apart from that, in terms of computation the algorithm is extremely “costly” , being the most costly part when it computes the distances from all points to all points, literally a matrix of NxN that can easily increase to huge numbers with a large cluster.

To avoid that, I tried a few different approaches. One was to calculate this distances for a part of the dataset. We know that by the way the algorithm is written, the outliers tend to “appear” either in the beginning or the end of the cluster. Having this in mind, I calculated the average based on the 60% “middle” values (n<20% and n>20%) and tested the condition for the first 20% and last 20% (this refinement was actually not needed as the “costly” part of the algorithm is the distance matrix and not the condition testing). I was not able to reach any reasonable results with this approach, either because the points were not ordered (which defeats the all purpose of my “slicing”) or because outliers were appearing outside these “classes” (i.e.: in the middle of the dataset).

The other approach that I tried was to work with the “final” polygon (the convex hull of the cluster), rather than the raw points. The polygon border has only a few points, when compared to the point cloud used to generate it. It is very easy and quick to identify the “outlier” in the polygon border; however removing it, will understandable result in a “strange polygon”, since the reality is: if that point would not be used to build the polygon, another point (that we don’t have right now!) would be used, and so the real geometry would not be this one. This is particularly noticeable when we see overlapping clusters (which don’t overlap anymore). After this experience, it became clear that the processing would have to be done in the cluster point dataset, before building the polygon.

I ended up with an algorithm that successfully removes the “spike” effects from OPTICSxi, but is rather costly in terms of time (more costly than OPTICSxi itself) and unfortunately this grows exponentially with the size of the dataset, which limits its application with big data.

UPDATE

The approach above was improved, by testing the distances against the 2 neighbours of each point (previous and next point) rather than against the entire matrix. With this hack, the running time of the algorithm was reduced to reasonable values, that don’t grow up so much with the size of the vector.

heartcode

Coding "hard", with the heart… :-)

Category Archives: Data Mining