Skip to content

Conversation

@schlunma
Copy link
Contributor

@schlunma schlunma commented Jul 31, 2025

Description

This PR adds the preprocessor align_metadata which aligns cube metadata with a specific target project. This is useful to perform multi-model analysis across different projects (e.g., CMIP5 and CMIP6).

Example recipe:

# ESMValTool
---
documentation:
  description: Test
  authors:
    - schlund_manuel
  title: Test.

preprocessors:
  test:
    align_metadata:
      target_project: CMIP6
    regrid:
      target_grid: 2x2
      scheme: linear
    multi_model_statistics:
      span: overlap
      statistics: [mean]

diagnostics:
  test:
    scripts:
      null
    variables:
      clivi:
        preprocessor: test
        mip: Amon
        additional_datasets:
          - {project: OBS, dataset: CLARA-AVHRR, type: sat, version: V002-01, tier: 3, alias: MultiObsMean, timerange: 19990101/20181231}
          - {project: OBS, dataset: CLOUDSAT-L2, type: sat, version: P1-R05-gridbox-average-noprecip, tier: 3, alias: MultiObsMean, timerange: 20060101/20171231}
          - {project: native6, dataset: ERA5, type: reanaly, version: v1, tier: 3, alias: MultiObsMean, timerange: 20000101/20191231}
          - {project: OBS, dataset: ESACCI-CLOUD, type: sat, version: AVHRR-AMPM-fv3.0, tier: 2, alias: MultiObsMean, timerange: 19970101/20161231}
          - {project: OBS6, dataset: MERRA2, type: reanaly, version: 5.12.4, tier: 3, alias: MultiObsMean, timerange: 20020101/20211231}
          - {project: OBS, dataset: MODIS, type: sat, version: MYD08-M3, tier: 3, alias: MultiObsMean, timerange: 20030101/20181231}
      clwvi:
        preprocessor: test
        mip: Amon
        additional_datasets:
          - {project: OBS, dataset: CLARA-AVHRR, type: sat, version: V002-01, tier: 3, alias: MultiObsMean, timerange: 19990101/20181231}
          - {project: OBS, dataset: CLOUDSAT-L2, type: sat, version: P1-R05-gridbox-average-noprecip, tier: 3, alias: MultiObsMean, timerange: 20060101/20171231}
          - {project: native6, dataset: ERA5, type: reanaly, version: v1, tier: 3, alias: MultiObsMean, timerange: 20000101/20191231}
          - {project: OBS, dataset: ESACCI-CLOUD, type: sat, version: AVHRR-AMPM-fv3.0, tier: 2, alias: MultiObsMean, timerange: 19970101/20161231}
          - {project: OBS6, dataset: MERRA2, type: reanaly, version: 5.12.4, tier: 3, alias: MultiObsMean, timerange: 20030101/20221231}
          - {project: OBS, dataset: MODIS, type: sat, version: MYD08-M3, tier: 3, alias: MultiObsMean, timerange: 20030101/20181231}

Closes #1985

Link to documentation: https://esmvaltool--2789.org.readthedocs.build/projects/ESMValCore/en/2789/recipe/preprocessor.html#align-metadata


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@schlunma schlunma added this to the v2.13.0 milestone Jul 31, 2025
@schlunma schlunma requested a review from axel-lauer July 31, 2025 13:59
@schlunma schlunma added the preprocessor Related to the preprocessor label Jul 31, 2025
@codecov
Copy link

codecov bot commented Jul 31, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.44%. Comparing base (05f8e4d) to head (f04c75b).
⚠️ Report is 68 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2789      +/-   ##
==========================================
+ Coverage   95.42%   95.44%   +0.02%     
==========================================
  Files         260      260              
  Lines       15426    15470      +44     
==========================================
+ Hits        14720    14766      +46     
+ Misses        706      704       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@axel-lauer axel-lauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @schlunma ! This is fantastic! For most variables, this works like a charm. For lwp, I ran into the following problem:

In case of the custom variable lwp, some datasets such as the new ESACCI-CLOUD version, lwp does not need to be derived and the data provided specifies a standard_name for lwp, which should be fine. For some other datasets, lwp needs to be derived (e.g. MODIS, tier3). In this case, the custom table for lwp is applied. All custom variables have an empty standard_name. I guess this is the reason for the following error:

ValueError: Multi-model statistics failed to merge input cubes into a single array:
0: atmosphere_mass_content_of_cloud_liquid_water / (kg m-2) (time: 444)
1: Liquid Water Path / (kg m-2)        (time: 444)
2: Liquid Water Path / (kg m-2)        (time: 444)
3: Liquid Water Path / (kg m-2)        (time: 444)
4: Liquid Water Path / (kg m-2)        (time: 444)
  cube.standard_name differs: 'atmosphere_mass_content_of_cloud_liquid_water' != ''

I tested this with the following example recipe:

# ESMValTool
---
documentation:
  title: Reasonable variable ranges for sanity checks (OBS)
  description: >
    Calculate reasonable variable ranges for sanity checks.
  authors:
    - lauer_axel
    - bock_lisa
  maintainer:
    - lauer_axel


preprocessors:

  pp_max:
    custom_order: true
    align_metadata:
      target_project: CMIP5
      strict: false
    area_statistics:
      operator: mean
    multi_model_statistics:
      ignore_scalar_coords: true
      span: full
      statistics:
        - operator: max
      keep_input_datasets: false
    climate_statistics:
      operator: max


diagnostics:

  lwp:
    description: Calculate range of reasonable monthly values for cloud liquid water path.
    variables:
      lwp_max:
        short_name: lwp
        derive: true
        preprocessor: pp_max
        mip: Amon
    additional_datasets:
      - {dataset: ESACCI-CLOUD, project: OBS6, type: sat,
        version: v3.0-AVHRR-AMPM, tier: 2,
        start_year: 1982, end_year: 2016}
      - {dataset: CLARA-AVHRR, project: OBS, type: sat,
        version: V002-01, tier: 3,
        start_year: 1982, end_year: 2018}
      - {dataset: CLOUDSAT-L2, project: OBS, type: sat,
        version: P1-R05-gridbox-average-noprecip,
        start_year: 2006, end_year: 2017, tier: 3}
      - {dataset: MAC-LWP, project: OBS, type: sat, version: v1,
        tier: 3, start_year: 1988, end_year: 2016}
      - {dataset: MODIS, project: OBS, type: sat, version: MYD08-M3,
        tier: 3, start_year: 2003, end_year: 2018}
    scripts: null

@schlunma
Copy link
Contributor Author

schlunma commented Aug 1, 2025

Can you try with target_project: OBS6 (and omit strict=false)? OBS6 is defined with cmor_strict: false, i.e., it's basically the CMIP6 tables in addition to our custom tables (+ it searches all MIP tables).

@schlunma
Copy link
Contributor Author

schlunma commented Aug 1, 2025

We might also want to add the standard_name to our lwp table so that it matches the CMIP6 entry. We did the same for toz recently.

@axel-lauer
Copy link
Contributor

Can you try with target_project: OBS6 (and omit strict=false)? OBS6 is defined with cmor_strict: false, i.e., it's basically the CMIP6 tables in addition to our custom tables (+ it searches all MIP tables).

If I do so, I run into the following error:

ERROR   align_metadata failed: Variable 'lwp' not available for table 'Amon' of project 'CMIP6'. Set `strict=False` to ignore this.

I guess the way to go would be to add the standard_name to our custom table entry for lwp. I'll try this next.

@schlunma
Copy link
Contributor Author

schlunma commented Aug 1, 2025

Are you 100% you changed to target_project: OBS6? The error still says CMIP6.

Changing the table unfortunately would not help, since you need a project that is not defined as CMOR-strict.

@axel-lauer
Copy link
Contributor

It now works also for lwp when applying the fix #2791, but only when setting strict: false. When omitting strict: false, I get this error message:

ERROR   align_metadata failed: Variable 'lwp' not available for table 'Amon' of project 'CMIP6'. Set `strict=False` to ignore this.

@schlunma
Copy link
Contributor Author

schlunma commented Aug 1, 2025

It now works also for lwp when applying the fix #2791, but only when setting strict: false. When omitting strict: false, I get this error message:

ERROR   align_metadata failed: Variable 'lwp' not available for table 'Amon' of project 'CMIP6'. Set `strict=False` to ignore this.

I guess with #2791 you could omit the preprocessor align_metadata altogether for this variable.

@axel-lauer
Copy link
Contributor

When changing target_project from CMIP6 to OBS6, everything works nicely!

Copy link
Contributor

@axel-lauer axel-lauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything works now. Yay! Thumbs up for getting this merged!

@schlunma
Copy link
Contributor Author

schlunma commented Aug 1, 2025

@valeriupredoi would you be available for a brief technical review of this? This fixes a long-standing issue, and I would love to see this merged before my leave! Thanks 🍻

Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, Manu! Looks good and very useful! The only thing I'm concerned about is what's the margin of overloading this functionality and the chances of people trying to use it in all manners of weird used cases, replace eg air pressure cube metadata with eg depth metadata, and iris allowing it. I might have missed it, but do you do anything with the lower importance attributes, or you preserve them? 🍺

@schlunma
Copy link
Contributor Author

schlunma commented Aug 7, 2025

Well, if people abuse this feature one could overwrite the long_name, standard_name, and var_name of a variable with any entry from the CMOR table. Units will only be overwritten if source units are convertible to target units, otherwise an error will be raised. Coordinate metadata cannot be overwritten.

But for this, you need to specify a custom target_short_name in the preprocessor, e.g., target_short_name: pr for a variable tas. I don't think this is a big issue to be honest, the default usage of this preprocessor is pretty safe.

@valeriupredoi
Copy link
Contributor

Well, if people abuse this feature one could overwrite the long_name, standard_name, and var_name of a variable with any entry from the CMOR table. Units will only be overwritten if source units are convertible to target units, otherwise an error will be raised. Coordinate metadata cannot be overwritten.

But for this, you need to specify a custom target_short_name in the preprocessor, e.g., target_short_name: pr for a variable tas. I don't think this is a big issue to be honest, the default usage of this preprocessor is pretty safe.

that's what I thought too, perfect - fire at will, Manu, I mean push at will 😁

@schlunma schlunma merged commit b5240ea into main Aug 7, 2025
7 checks passed
@schlunma schlunma deleted the align_metadata_preproc branch August 7, 2025 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preprocessor Related to the preprocessor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-model statistics on changed standard names through CMIP projects

4 participants