Navigating GeoParquet: Lessons Learned from the eMOTIONAL Cities Project

The eMOTIONAL Cities project has set out to understand how the natural and built environment can shape the feelings and emotions of those who experience it. At its core lies a Spatial Data Infrastructure (SDI) which combines a variety of datasets from the Urban Health domain. These datasets should be available to urban planners, neuroscientists and other stakeholders, for analysis, creating data products and eventually making decisions based upon them.

Although the average size of a dataset is small (with few exceptions), scientists often want to combine several of these datasets in the same analysis, which creates a use case where we could benefit from format efficiency. For this reason, we recently decided to offer GeoParquet as an alternate encoding for the 100+ vector datasets published in the SDI.

What is GeoParquet

For those who have been distracted, GeoParquet is a format which encodes vector data in Apache Parquet. There is no reinventing the wheel here: Apache Parquet is a free and open-source column-oriented data storage format, which provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk; GeoParquet is just adding the spatial support on top of Parquet, leveraging the fact that most cloud data warehouses already understand it to achieve interoperability. Although GeoParquet started as a community effort, it is now on the path to become an OGC Standard and you can follow (or even contribute to) the spec on: https://github.com/opengeospatial/geoparquet

GeoParquet extends Parquet by adding some metadata about the file and for each geometry column; the number of mandatory columns is kept to a minimum, with some nice-to-have optional features.

Converting & Publishing the Data

Although a relatively new format, GeoParquet already spots a vibrant ecosystem of implementations to choose from. After a few experiments, we decided to use the GDAL library to convert the datasets, as it integrates better with our existing pipeline.

It should be noted that our source datasets are hosted in a S3 bucket in GeoJSON format and the idea was to place the GeoParquet files also in S3, so the idea was to read/write the files directly from S3.

Our pipeline uses GDAL wrapped in a bash script, which reads all the GeoJSON files in a folder on a S3 bucket and places the resulting GeoParquet files on a different folder in the same bucket. 

It should be noted that in order to support GeoParquet, GDAL > 3.8.4 should be used; to make things easier, we run GDAL with the right GDAL version from a docker container. The script is available in the etl-tools repository of the eMOTIONAL Cities project with an MIT license.

The GeoParquet files were validated directly from the S3 bucket using the gpq, a lightweight tool written in GO, which creates as well validates GeoParquet files. The gpq cli was wrapped in another bash script, available here. It should be noted that no validation errors were spotted from the created GeoParquet files.
In order to make the GeoParquet datasets discoverable, they were added to each collection record of the eMOTIONAL Cities catalogue using an item type “application/vnd.apache.parquet”, which can be negotiated by clients. See an example below for the  hex350_grid_obesity_1920 collection.

        {

            “href”:”https://emotional-cities.s3.eu-central-1.amazonaws.com/geoparquet/hex350_grid_obesity_1920.parquet”,

            “rel”:”item”,

            “type”:”application/vnd.apache.parquet”,

            “title”:”GeoParquet download link for hex350_grid_obesity_1920″

        },

The chart above shows the size of one of our largest datasets (activity_level_ldn) in different formats. GeoParquet translates into a smaller size, even when compared with binary formats such as Shapefile or GeoPackage. These smaller sizes, specially when multiplied by a large number of datasets, will translate in cost saving for hosting data; they will also provide a better experience for users which stream these datasets over the web for the purpose of analysis.

The chart above shows the total size of eMOTIONAL Cities datasets in various formats.

Socializing the Results

Although the GeoParquet files are discoverable by machines through the OGC API – Records catalogue, more work needs to be done in order to ensure that humans are aware of them. These are a few initiatives that we did, or plan to do, in order to socialise these results and encourage users to leverage the geoparquet datasets that we expose in the SDI:

  • May 2024: GeomobLX – Lighting Talk about “GeoParquet”.
  • October 2024: eMOTIONAL Cities webinar about GeoParquet (TBD)
  • December 2024: FOSS4G World – “Adding GeoParquet to a Spatial Data Infrastructure: What, Why and How” (submitted talk)

The image bellow shows an eMOTIONAL Cities GeoParquet dataset in QGIS. The out-of-the-box support in widely used tools like QGIS, is one of the most exciting things about GeoParquet, but we need to make sure that users know about it.

If you are curious to get your hands on some nice examples, the entire eMOTIONAL Cities catalogue is available to you. In each metadata record, you will find a link to the corresponding geoparquet file, which you can download locally or stream to your jupyter notebook.

This blog post ,and the work leading to it, was possible with the collaboration of my colleague Pascallike. A Big Thanks to him!

Creating Responsive Maps with Vector Tiles

Vector tiles have been around for a while and they seem to combine the best of both worlds. They provide design flexibility, something we usually associate to vector data, while enabling fast delivery, like we generally see on raster services. The mvt specification, based on Google’s protobuf format, packages geographic data into pre-defined roughly-square shaped “tiles” for transfer over the web.

The OGC API – Tiles standard, enables sharing vector tiles while ensuring interoperability among services. It is a very simple format, which formalizes what most applications are already doing in terms of tiling, while adding some interesting (optional) features. You can find more information on: tiles.developer.ogc.org .

If you want to publish vector tiles using this standard, you could use pygeoapi, which is a Python server implementation of the OGC API suite of standards and a reference implementation of OGC API – Tiles. With its plugin architecture, pygeoapi supports many different providers to render the tiles in the backend. One option could be to use the elastic search backend (mvt-elastic), which enables rendering vector tiles on the fly, from any index stored in elasticsearch. Recently, this provider also supports retrieving the properties (e.g.: fields) along with the geometry, which is needed for client side styling.

You can check some OGC API – Tiles collections in the eMOTIONAL Cities catalogue. On this map, we show the results of urban health outcomes (Prevalence rates of cardiovascular diseases) in 350m hexagonal grids of Inner London. It is rendered according to the mean value.

On the developer console, we can inspect how the attribute values of the vector tiles are exposed to the client.

Another option for interactive maps that require access to attributes, would be to retrieve a GeoJSON from an OGC API – Features endpoint. In that case, the client would need to load all the features at the start, and then carry these features in memory. If we have a high number of features, or many different layers, this could result in a less responsive application.

As an experiment, we loaded a web application with a base layer and two collections with 3480 and 3517 features (“hex350_grid_cardio_1920” and “hex350_grid_pm10_2019”). When the collections were loaded as vector tiles, the application took 20 milliseconds to load. On the other hand, when the collections were loaded as features it took 6887 milliseconds.

You can check out the code for this experiment at: https://github.com/emotional-cities/vtiles-example/tree/ecities and a map showing the vector tile layers at: https://emotional-cities.github.io/vtiles-example/demo-oat.htm

Data Analytics Bootcamp

I have always dreamed about doing some contribution towards improving the gender balance in technology, which as you may know, is far from ideal.

Fortunately the opportunity arose, when Katrina Walker has invited me to teach the “Data Analytics”  bootcamp at CodeOp, an international code school for women and TGNC individuals.

Over the 6-month course, I will share my hands-on experience with the various stages of the data analysis pipeline, specifically on how to apply various technologies to ingest, model and visualize data insights.

Rather than focusing on a specific technology, I will leverage on the “best tool for the job, approach”, which is what I do when I want to analyse data. This means learning different tools, such as Python, R, SQL or QGIS, and often combine them together.

For me “data analytics” is like a journey, where we start with a high-level problem, translate it into data and algorithms, and finally extract a high-level idea. At the start and the end of journey, we should always be able to communicate with people that are not “data geeks” and this is one idea that I would like to pass in the course.

I will not add anything else, apart that I am really excited to get started!

codeops2

Spatial Data Mining

Social media streams may generate massive clouds of geolocated points, but how can we extract useful information from these, sometimes huge, datasets? I think machine learning and GIS can be helpful here.

My PechaKucha talk at DataBeers : “Visualizing Geolocated Tweets: a Spatial Data Mining Approach”.

Driving Spatial Analysis Beyond the Limits of Traditional Storage

My presentation on the “Conference on Advanced Spatial Modelling and Analysis” consisted in some thoughts regarding Big Spatial Data, and how to handle them in terms of modern technologies.

It was great to see such a motivated crowd from all generations, and to get to know the research developed by CEG, in topics such as Agent Based Modelling and Neural Networks. It was also great to talk again to Arnaud Banos, from the Complex System Institute of Paris Ile-de-France (ISC-PIF).

p_ceg

Importing Large Spatial Datasets into PostGIS

Is PostgreSQL/PostGIS suitable for Big Data? This is a question I have been asking myself for a while, and I now have the opportunity to test it with a PostgreSQL instance from Amazon Web Services (AWS); after all, if such a service cannot deal with “Big Data”, who can?
I have a dataset with Tweets in the city of Barcelona, over a time period. The entire dataset amounts to about 3 million points (82 MB), and I have some subsets: a smaller collection of about 2 million points (55 MB) and 300 000 points (8.2. MB).

big_data

We can argue whether this dataset is, or not, Big Data; but IMHO, a table with more than 1 million records starts to get “heavy” for any relational database. Now before doing any computations of spatial algorithms (e.g.: convex hull, point in polygon), the very first step is to actually load this dataset into the database. And this is where I started to struggle.

QGIS has a plugin called SPIT, that allows to import a Shapefile into PostGIS. Unfortunately it seems to be extremely slow any of the three files that I mentioned above (including the smaller one).

spit

Switching to the command line, where I have more control knowledge about the processes that are running, I tried shp2pgsql, a tool for importing Shapefiles that comes with the PostGIS installation.

On Linux it is possible to convert and import the Shapefile in one go, redirecting the output with pipe:

shp2pgsql -s <SRID> -c -D -I <path to shapefile> <schema>.<table> | \
psql -d <databasename> -h <hostname> -U <username>

Since this command was really slow (unresponsive?) I decided to do it in two steps, to see where it was hanging:

  • Create an sql file with the insert statements.
  • Load this sql file into the database.

Unsurprisingly, the SQL file with the statements was generated quite quickly, and the processing effort was being spent on the insert queries. This situation was verified with the three Shapefiles, even when I inserted a spatial index (Gist). After thinking a bit, I came to the conclusion that the spatial index should be created within the “CREATE TABLE” statement, so it should actually speed up the queries a bit (but I did not confirm this guess).

My next attempt, was to use ogr2ogr, the GDAL tool to convert between vector formats.

ogr2ogr -f "PostgreSQL" PG:"host=host user=user dbname=db password=pass" data.shp

I have to say that I read some suggestion to prefer shp2pgsql over this tool, and in fact it was quite slow as well (at this point I was only testing the smallest of the Shapefiles).

My conclusion up to this moment, is that it is quite expensive in terms of time (and processor) to import a large dataset into a PostGIS. Some ideas that could be explored:

  • Upgrade the PostGIS server (would scale vertically make any difference?).
  • Upload the data into the Amazon server and run the import tool from there. I am not sure whether it is possible to do that, but it seems like a good improvement to do this operation locally, and using the power of the Amazon Server, rather than my desktop computer.
  • Split the Shapefile into chunks of data. I am not entirely sure how to do this, and any suggestions would be welcome.

Next steps: once I manage to import a large Shapefile into my PostGIS instance, I will post some feedback about the computation of spatial operations.

Update: Apparently I underestimated the fact that I am importing data over a network connection, with a limited latency. The best thing would be to make the import locally, an option that I unfortunately don’t have, since the Amazon RDS does not give me SSH access (it is a service, not a machine!). I decided to switch to a new approach: import a plain csv file into a database table, using the /copy command, and that was actually really fast (a few seconds).

psql database -U user -h host.eu-west-1.rds.amazonaws.com -c "\copy newt_table from 'data.csv' with DELIMITER ','"

Of course, I then had to run a query populate the point geometry from the float (x,y) values, and create an update a spatial index (for a matter efficiency it is recommended that indexes are created after, and not during the import!).

UPDATE new_table SET geom = ST_SetSRID(ST_MakePoint(lon,lat),4326);

CREATE INDEX [indexname] ON [tablename] USING GIST ( [geometryfield] );

VACUUM ANALYZE [table_name] [column_name];

That was still considerably fast (although not as much as the csv import!), and my guess is: it is because I are now running queries on the remote machine (and not uploading data via an internet connection).

Following some very usefull on Stack Overflow, I decided to retry my previous approaches, tuning the configuration parameters.

First I tried the OGR command line tool, but this time with the config flag “PG_USE_COPY YES”, to force dumping of the data:

ogr2ogr -progress -f PostgreSQL PG:"dbname='database' user='user' host='host.eu-west-1.rds.amazonaws.com' port='5432' data.shp --config PG_USE_COPY YES -nlt POINT -nln import_ogr

Although speed has improved massively, it still took me 1:03:00 to run this query.

Then I tried the shp2pgsql tool, with the “-D” flag, which also forces the dump of data:

shp2pgsql -D -s 4326 data.shp import_shp2pgsql | psql

This was remarkably fast: 00:01:04

compare_imports

Comparing the two imports, we can see that shp2pgsql did not create any index (which we already commented is a costly operation), while ogr2ogr created a gist index on the spatial field. The extra step of creating an index, has to be added to the shp2pgsql import, but as we have seen before, that operation is not a problem since we are then operating on the server.

I conclude this review, stating that shp2pgsql has a great performance in the task of  loading a Big Data spatial table into a remote PostGIS server (e.g.: Amazon RDS). It is recommended that the “-D” flag is passed to the tool, and that no indexes are created during this operation.

Bellow we can see the PostGIS geometry of the imported table, with more than 3 million points:

layer_shp2pgsql