pickle free centroids hdf by emanuel-schmid · Pull Request #1076 · CLIMADA-project/climada_python

emanuel-schmid · 2025-07-23T06:37:08Z

Changes proposed in this PR:

centroids are not stored as pickled shaply objects in hdf5
a compression is introduced
in case centroids are strictly points, they are stored as x and y columns
otherwise as (pickled) wkb, same as for exposures (Avoid pickling shapely object in Exposures.write_hdf5 #1051)

This PR is a pragmatic workaround of the problem described in #1055

PR Author Checklist

PR Reviewer Checklist

chahank

Good changes, although some tests are failing now.

As a note, centroids are currently never supposed to be anything else than points, as the impact calculations cannot deal with other geometries in the centroids at the moment. I think it does not hurt to have the code ready.

peanutfun

Thank you for tackling this issue! I have some concerns regarding the readability and performance of the code, especially because the entire data needs to be copied. I also think the tests need to be adapted a bit. See my comments.

peanutfun · 2025-08-21T09:22:36Z

climada/hazard/centroids/centr.py

-        store.close()
+        xycols = []
+        wkbcols = []
+        store = pd.HDFStore(file_name, mode=mode, complevel=9)


For zlib, it seems like looks like high compression levels only slightly reduce the file size while costing much performance. A lower value seems more advisable to me. See https://www.pytables.org/usersguide/optimization.html#compression-issues

Suggested change

store = pd.HDFStore(file_name, mode=mode, complevel=9)

store = pd.HDFStore(file_name, mode=mode, complevel=3)

I leave it as it is: in an arbitrary test, the cpu decrease was 0.4 seconds, about 15%, the size increase 2M, about 10%. From a $ point of view complevel 9 seems justified.

15% decrease vs. 10% increase seems like an argument for a lower complevel, from my point of view. But I guess it's not that relevant. In case we run into some issues, we might consider making this a method kwarg in the future.

climada/hazard/centroids/centr.py

peanutfun · 2025-08-21T09:37:46Z

climada/hazard/centroids/centr.py

+            for col in pandas_df.columns:
+                if str(pandas_df[col].dtype) == "geometry":


Make clear that you do not want to iterate over all columns:

Suggested change

for col in pandas_df.columns:

if str(pandas_df[col].dtype) == "geometry":

for col in filter(lambda x: str(x.dtype) == "geometry", pandas_df.columns):

(Suggestion won't work because the following code needs to be indented less)

elegant suggestion - but I leave it as it is. it's more "climada style" like that.

Then please add a comment, what you call "climada style" confused me quite a bit 😕

Suggested change

for col in pandas_df.columns:

if str(pandas_df[col].dtype) == "geometry":

# Iterate over geometry columns (only)

for col in pandas_df.columns:

if str(pandas_df[col].dtype) == "geometry":

peanutfun · 2025-08-21T09:39:38Z

climada/hazard/centroids/centr.py

                crs = metadata.get("crs")
-                gdf = gpd.GeoDataFrame(store["centroids"], crs=crs)
+                gdf = gpd.GeoDataFrame(store["centroids"])
+                for xycol in metadata.get("xy_columns", []):


Please also add a test with multiple xy_columns/wkb_columns to be stored and read.

Co-authored-by: Lukas Riedel <[email protected]>

peanutfun · 2025-08-27T11:15:54Z

climada/hazard/centroids/test/test_centr.py

+        )
+        centroids_w.write_hdf5(tmpfile)
+        centroids_r = Centroids.from_hdf5(tmpfile)
+        self.assertTrue(centroids_w == centroids_r)


Suggested change

self.assertTrue(centroids_w == centroids_r)

self.assertEqual(centroids_w, centroids_r)

(this was actually done in purpose - the idea was to make sure the overridden equality operator does what it ought to do, regardless of what exactly happens inside assertEqual, about which I have no clue)

emanuel-schmid added 2 commits July 23, 2025 08:20

hazard.io: avoid pickling geometries and compress hdf5 files

6924ce1

'changelog'

df4082b

emanuel-schmid requested review from chahank and peanutfun as code owners July 23, 2025 06:37

chahank requested changes Jul 23, 2025

View reviewed changes

emanuel-schmid added 2 commits July 25, 2025 17:44

fix column drop

6d19c26

Merge branch 'develop' into feature/pickle_free_centroids_store

6279ec5

peanutfun requested changes Aug 21, 2025

View reviewed changes

emanuel-schmid and others added 5 commits August 26, 2025 14:53

add unit tests with multiple wkb and xy columns

dc95e42

avoid crs resetting

d5d7981

explicitly ask for Points

ca86640

Co-authored-by: Lukas Riedel <[email protected]>

fix typo

822a9ae

fix point condition

a1f0fcb

peanutfun reviewed Aug 27, 2025

View reviewed changes

emanuel-schmid added 2 commits August 27, 2025 13:57

add comment about filtering geometry columns

43a744f

remove ducplicated methods and obsolete lines

4e85273

emanuel-schmid merged commit 1ebd005 into develop Aug 27, 2025
19 checks passed

emanuel-schmid deleted the feature/pickle_free_centroids_store branch August 27, 2025 13:18

	store = pd.HDFStore(file_name, mode=mode, complevel=9)
	store = pd.HDFStore(file_name, mode=mode, complevel=3)

		for col in pandas_df.columns:
		if str(pandas_df[col].dtype) == "geometry":

	for col in pandas_df.columns:
	if str(pandas_df[col].dtype) == "geometry":
	for col in filter(lambda x: str(x.dtype) == "geometry", pandas_df.columns):

	self.assertTrue(centroids_w == centroids_r)
	self.assertEqual(centroids_w, centroids_r)

Conversation

emanuel-schmid commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Author Checklist

PR Reviewer Checklist

Uh oh!

chahank left a comment

Choose a reason for hiding this comment

Uh oh!

peanutfun left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

emanuel-schmid commented Jul 23, 2025 •

edited

Loading