Navigating GeoParquet: Lessons Learned from the eMOTIONAL Cities Project

The eMOTIONAL Cities project has set out to understand how the natural and built environment can shape the feelings and emotions of those who experience it. At its core lies a Spatial Data Infrastructure (SDI) which combines a variety of datasets from the Urban Health domain. These datasets should be available to urban planners, neuroscientists and other stakeholders, for analysis, creating data products and eventually making decisions based upon them.

Although the average size of a dataset is small (with few exceptions), scientists often want to combine several of these datasets in the same analysis, which creates a use case where we could benefit from format efficiency. For this reason, we recently decided to offer GeoParquet as an alternate encoding for the 100+ vector datasets published in the SDI.

What is GeoParquet

For those who have been distracted, GeoParquet is a format which encodes vector data in Apache Parquet. There is no reinventing the wheel here: Apache Parquet is a free and open-source column-oriented data storage format, which provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk; GeoParquet is just adding the spatial support on top of Parquet, leveraging the fact that most cloud data warehouses already understand it to achieve interoperability. Although GeoParquet started as a community effort, it is now on the path to become an OGC Standard and you can follow (or even contribute to) the spec on: https://github.com/opengeospatial/geoparquet

GeoParquet extends Parquet by adding some metadata about the file and for each geometry column; the number of mandatory columns is kept to a minimum, with some nice-to-have optional features.

Converting & Publishing the Data

Although a relatively new format, GeoParquet already spots a vibrant ecosystem of implementations to choose from. After a few experiments, we decided to use the GDAL library to convert the datasets, as it integrates better with our existing pipeline.

It should be noted that our source datasets are hosted in a S3 bucket in GeoJSON format and the idea was to place the GeoParquet files also in S3, so the idea was to read/write the files directly from S3.

Our pipeline uses GDAL wrapped in a bash script, which reads all the GeoJSON files in a folder on a S3 bucket and places the resulting GeoParquet files on a different folder in the same bucket. 

It should be noted that in order to support GeoParquet, GDAL > 3.8.4 should be used; to make things easier, we run GDAL with the right GDAL version from a docker container. The script is available in the etl-tools repository of the eMOTIONAL Cities project with an MIT license.

The GeoParquet files were validated directly from the S3 bucket using the gpq, a lightweight tool written in GO, which creates as well validates GeoParquet files. The gpq cli was wrapped in another bash script, available here. It should be noted that no validation errors were spotted from the created GeoParquet files.
In order to make the GeoParquet datasets discoverable, they were added to each collection record of the eMOTIONAL Cities catalogue using an item type “application/vnd.apache.parquet”, which can be negotiated by clients. See an example below for the  hex350_grid_obesity_1920 collection.

        {

            “href”:”https://emotional-cities.s3.eu-central-1.amazonaws.com/geoparquet/hex350_grid_obesity_1920.parquet”,

            “rel”:”item”,

            “type”:”application/vnd.apache.parquet”,

            “title”:”GeoParquet download link for hex350_grid_obesity_1920″

        },

The chart above shows the size of one of our largest datasets (activity_level_ldn) in different formats. GeoParquet translates into a smaller size, even when compared with binary formats such as Shapefile or GeoPackage. These smaller sizes, specially when multiplied by a large number of datasets, will translate in cost saving for hosting data; they will also provide a better experience for users which stream these datasets over the web for the purpose of analysis.

The chart above shows the total size of eMOTIONAL Cities datasets in various formats.

Socializing the Results

Although the GeoParquet files are discoverable by machines through the OGC API – Records catalogue, more work needs to be done in order to ensure that humans are aware of them. These are a few initiatives that we did, or plan to do, in order to socialise these results and encourage users to leverage the geoparquet datasets that we expose in the SDI:

  • May 2024: GeomobLX – Lighting Talk about “GeoParquet”.
  • October 2024: eMOTIONAL Cities webinar about GeoParquet (TBD)
  • December 2024: FOSS4G World – “Adding GeoParquet to a Spatial Data Infrastructure: What, Why and How” (submitted talk)

The image bellow shows an eMOTIONAL Cities GeoParquet dataset in QGIS. The out-of-the-box support in widely used tools like QGIS, is one of the most exciting things about GeoParquet, but we need to make sure that users know about it.

If you are curious to get your hands on some nice examples, the entire eMOTIONAL Cities catalogue is available to you. In each metadata record, you will find a link to the corresponding geoparquet file, which you can download locally or stream to your jupyter notebook.

This blog post ,and the work leading to it, was possible with the collaboration of my colleague Pascallike. A Big Thanks to him!

Creating Responsive Maps with Vector Tiles

Vector tiles have been around for a while and they seem to combine the best of both worlds. They provide design flexibility, something we usually associate to vector data, while enabling fast delivery, like we generally see on raster services. The mvt specification, based on Google’s protobuf format, packages geographic data into pre-defined roughly-square shaped “tiles” for transfer over the web.

The OGC API – Tiles standard, enables sharing vector tiles while ensuring interoperability among services. It is a very simple format, which formalizes what most applications are already doing in terms of tiling, while adding some interesting (optional) features. You can find more information on: tiles.developer.ogc.org .

If you want to publish vector tiles using this standard, you could use pygeoapi, which is a Python server implementation of the OGC API suite of standards and a reference implementation of OGC API – Tiles. With its plugin architecture, pygeoapi supports many different providers to render the tiles in the backend. One option could be to use the elastic search backend (mvt-elastic), which enables rendering vector tiles on the fly, from any index stored in elasticsearch. Recently, this provider also supports retrieving the properties (e.g.: fields) along with the geometry, which is needed for client side styling.

You can check some OGC API – Tiles collections in the eMOTIONAL Cities catalogue. On this map, we show the results of urban health outcomes (Prevalence rates of cardiovascular diseases) in 350m hexagonal grids of Inner London. It is rendered according to the mean value.

On the developer console, we can inspect how the attribute values of the vector tiles are exposed to the client.

Another option for interactive maps that require access to attributes, would be to retrieve a GeoJSON from an OGC API – Features endpoint. In that case, the client would need to load all the features at the start, and then carry these features in memory. If we have a high number of features, or many different layers, this could result in a less responsive application.

As an experiment, we loaded a web application with a base layer and two collections with 3480 and 3517 features (“hex350_grid_cardio_1920” and “hex350_grid_pm10_2019”). When the collections were loaded as vector tiles, the application took 20 milliseconds to load. On the other hand, when the collections were loaded as features it took 6887 milliseconds.

You can check out the code for this experiment at: https://github.com/emotional-cities/vtiles-example/tree/ecities and a map showing the vector tile layers at: https://emotional-cities.github.io/vtiles-example/demo-oat.htm

Welcome to the “Obscure” World of Databases

How would we picture the image of a non doctor, entering an operations room and performing a surgery? Probably pretty badly. After removing the scope of the consequences, to me this is very similar to the image of a non database person, designing a database.

For several reasons, people who have not been formally trained as computer scientists find themselves performing IT tasks, some of them very sophisticated. I don’t see anything wrong with people that were not formally trained embrassing an IT career (actually I am one of them), as long as… they do embrace it. That is: study or research the things they don’t know, and do care about standards of quality. If they don’t do it, then they are just bringing into “shame” the name of everyone who is working on the field, by bringing the level really down.

All these thoughts – that are actually recurrent in my life – were triggered by analysing what I thought it was a relational database. By going through it and refactoring it, I got a pretty good collection of the things that you should not do, when designing a database. These are some ideas that I would like to clarify; if you think that they are obvious, you would be surprised with what I saw.

Choosing a relational database management system (RDBMS), does not automatically mean that you have a relational database. A relational database is a database that is organized in terms of the relational model, as formulated by Edgar F. Codd. According to Wikipedia[1], the purpose of this model is to “provide a declarative method for specifying data and queries: users directly state what information the database contains and what information they want from it, and let the database management system software take care of describing data structures for storing the data and retrieval procedures for answering queries”. It is a requirement of the model that the users state exactly what the database contains, and describe properly how it is organized; only in this way, it is possible to query the database and obtain exact answers to our queries.

If the database is a repository of information, some of it unkown or unnecessary, and not really organized, that means that we are not modelling the data, but only storing information. The outcome of this, is that we cannot query the database and produce any usefull answers about our data (there are no miracles!). We can store the information, but not in a much better way than if we were storing it in a filesystem. Users of this repository may implement themselves ways of dealing with this data by creating relations on-the-fly, either by using the SQL engine, or by pullling the data out and do it somewhere else. However they are not gaining anything from the data model, as they have to figure out themselves how the data is organized, and there is no guarantee that any two people would not do it in a different way.

My first bullet point is that there is absolutely no point in using a relational database engine and *not* implement a relational model; it is probably a worst solution than using a filesystem, because it may pass the wrong idea to people: that there is a relational model.

If people are not implementing the relational model because they do not know the data, or because the data is not organized in a relational manner, this is an explanation I can accept. In fact I think for many cases (and some of them which are being approached relationally) the rigid structures of a top-down design are not the providing a good solution, because knowing everything (or at least a lot) about the data is a weak assumption. For this cases there is NoSQL[2], and specially the document-driven databases provide a very flexible model, close to what people implement with filesystems. In other words: this is ok, but then don’t use a relational database engine.

On the other hand, if people are not implementing the relational model because they don’t know it, then they should learn about it. They should learn about it, at least to be able to decide that they don’t use it and go for a NoSQL database. Finally if they do not want to learn about databases, or relational models, because they are “just” biologists or economists, that is also fine but then please don’t let them design a database, specially if it is one where you want to store valuable data. And I am afraid of the institutions who let this sort of thing happen.

Image

 

[1] http://en.wikipedia.org/wiki/Relational_database

[2] http://en.wikipedia.org/wiki/NoSQL