Navigating GeoParquet: Lessons Learned from the eMOTIONAL Cities Project

The eMOTIONAL Cities project has set out to understand how the natural and built environment can shape the feelings and emotions of those who experience it. At its core lies a Spatial Data Infrastructure (SDI) which combines a variety of datasets from the Urban Health domain. These datasets should be available to urban planners, neuroscientists and other stakeholders, for analysis, creating data products and eventually making decisions based upon them.

Although the average size of a dataset is small (with few exceptions), scientists often want to combine several of these datasets in the same analysis, which creates a use case where we could benefit from format efficiency. For this reason, we recently decided to offer GeoParquet as an alternate encoding for the 100+ vector datasets published in the SDI.

What is GeoParquet

For those who have been distracted, GeoParquet is a format which encodes vector data in Apache Parquet. There is no reinventing the wheel here: Apache Parquet is a free and open-source column-oriented data storage format, which provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk; GeoParquet is just adding the spatial support on top of Parquet, leveraging the fact that most cloud data warehouses already understand it to achieve interoperability. Although GeoParquet started as a community effort, it is now on the path to become an OGC Standard and you can follow (or even contribute to) the spec on: https://github.com/opengeospatial/geoparquet

GeoParquet extends Parquet by adding some metadata about the file and for each geometry column; the number of mandatory columns is kept to a minimum, with some nice-to-have optional features.

Converting & Publishing the Data

Although a relatively new format, GeoParquet already spots a vibrant ecosystem of implementations to choose from. After a few experiments, we decided to use the GDAL library to convert the datasets, as it integrates better with our existing pipeline.

It should be noted that our source datasets are hosted in a S3 bucket in GeoJSON format and the idea was to place the GeoParquet files also in S3, so the idea was to read/write the files directly from S3.

Our pipeline uses GDAL wrapped in a bash script, which reads all the GeoJSON files in a folder on a S3 bucket and places the resulting GeoParquet files on a different folder in the same bucket. 

It should be noted that in order to support GeoParquet, GDAL > 3.8.4 should be used; to make things easier, we run GDAL with the right GDAL version from a docker container. The script is available in the etl-tools repository of the eMOTIONAL Cities project with an MIT license.

The GeoParquet files were validated directly from the S3 bucket using the gpq, a lightweight tool written in GO, which creates as well validates GeoParquet files. The gpq cli was wrapped in another bash script, available here. It should be noted that no validation errors were spotted from the created GeoParquet files.
In order to make the GeoParquet datasets discoverable, they were added to each collection record of the eMOTIONAL Cities catalogue using an item type “application/vnd.apache.parquet”, which can be negotiated by clients. See an example below for the  hex350_grid_obesity_1920 collection.

        {

            “href”:”https://emotional-cities.s3.eu-central-1.amazonaws.com/geoparquet/hex350_grid_obesity_1920.parquet”,

            “rel”:”item”,

            “type”:”application/vnd.apache.parquet”,

            “title”:”GeoParquet download link for hex350_grid_obesity_1920″

        },

The chart above shows the size of one of our largest datasets (activity_level_ldn) in different formats. GeoParquet translates into a smaller size, even when compared with binary formats such as Shapefile or GeoPackage. These smaller sizes, specially when multiplied by a large number of datasets, will translate in cost saving for hosting data; they will also provide a better experience for users which stream these datasets over the web for the purpose of analysis.

The chart above shows the total size of eMOTIONAL Cities datasets in various formats.

Socializing the Results

Although the GeoParquet files are discoverable by machines through the OGC API – Records catalogue, more work needs to be done in order to ensure that humans are aware of them. These are a few initiatives that we did, or plan to do, in order to socialise these results and encourage users to leverage the geoparquet datasets that we expose in the SDI:

  • May 2024: GeomobLX – Lighting Talk about “GeoParquet”.
  • October 2024: eMOTIONAL Cities webinar about GeoParquet (TBD)
  • December 2024: FOSS4G World – “Adding GeoParquet to a Spatial Data Infrastructure: What, Why and How” (submitted talk)

The image bellow shows an eMOTIONAL Cities GeoParquet dataset in QGIS. The out-of-the-box support in widely used tools like QGIS, is one of the most exciting things about GeoParquet, but we need to make sure that users know about it.

If you are curious to get your hands on some nice examples, the entire eMOTIONAL Cities catalogue is available to you. In each metadata record, you will find a link to the corresponding geoparquet file, which you can download locally or stream to your jupyter notebook.

This blog post ,and the work leading to it, was possible with the collaboration of my colleague Pascallike. A Big Thanks to him!

Watching a Server through a Container

Lately I have been working a lot with docker, the new kid on the block on cloud computing, which is winning the heart of sysadmins, as well as developers.

The main idea is to setup a Spatial Data Infrastructure, something that has been at the core of other projects such as Georchestra.

Unfortunately having something running on a server is normally not a complete smooth experience, and this sets the ground for the need of a monitoring service.

After searching a bit, I found NewRelic, which provides monitoring on a service basis. I really liked the advanced functionality and the completeness of the dashboards, so it was not hard to convince myself to try it.

NewRelic provides two types of monitoring: application monitoring, and server monitoring, which is what I will cover today on this post. The server monitor, is basically a daemon that runs on the server and collects statistics about various metrics, such as: memory usage, CPU usage, bandwidth, etc. But what really caught my eye about this solution, was the ability to monitor the docker daemon and the different containers that run within it.

Unfortunately this functionality appears to be broken for docker 1.11 (my current version), but with the help of the NewRelic engineers I was able to apply a workaround.

My next step was to dockerize this solution. After all, wouldn’t it be great to spin another container in my SDI, that would monitor the other containers AND the server?

The bad news is that the existing images of Newrelic’s server on docker hub do not implement the workaround. So I went and implement my own image.

You can pull this image from the repository, with:

docker pull doublebyte/newrelic_sysmond

Then you can run it with:

 docker run -d \
–privileged=true –name nrsysmond \
–pid=host \
–net=host \
-v /sys:/sys \
-v /dev:/dev \
–env=”NRSYSMOND_license_key=REPLACE_BY_NEWRELIC_KEY” \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /var/log:/var/log:rw \
newrelic_sysmond

The privileged flag and the bindings to the host directories are necessaries, because we need to be able to watch the docker daemon, and collect the docker metrics.

Note that if you also want to collect memory stats of the containers, it is necessary to configure it in the kernel. The procedure is explained on the docker documentation, but it really comes down to updating the bootloader and restarting. In the case of grub, you would need to add this line to /etc/default/grub:

GRUB_CMDLINE_LINUX=”cgroup_enable=memory swapaccount=1″

Then you need to update grub with:

update-grub

After a restart of the server, the docker memory statistics should be present on the server dashboard:

newrelic