Introduce data validation (`pydantic`/ `pandera`) and dynamic attribute handling by lkstrp · Pull Request #1128 · PyPSA/PyPSA

lkstrp · 2025-01-27T16:46:11Z

Closes #348
Closes #722
Closes #734

Changes proposed in this Pull Request

Summary

Add data validation (via Pydantic for data classes, pandera for dataframes) to better handle attribute types/ defaults/ nullable status, introduce immutable class attributes, allow for attribute specific checks etc.
Possibilities for checks are endless, we can move a lot of the docs to validation checks, which could improve user experience by a lot.
Removes arbitrary definition when attr is listed in both c.dynamic and c.static
Adds mechanism for dynamic attribute initialisation. E.g. output variables are only added when a network is solved which can be extended to "configurable" attributes.

Data Validation

Pydantic is used for data classes
- Currently only for Components class/ subclasses and ComponentType. The plan is to bring it to the Network class as well, when it is split into data and logic with some refactoring.
- Allows to enforce types, use simpler default factories, mix immutable with mutable attributes etc. This alone can make things more robust, for both users and developers.
Pandera is used for data frames
- It is a library based on pydantic.
- Defines schemas for DataFrames for both static and dynamic data.
- Handles types (and casting), missing columns and nullability for each attribute individually.
- Attribute specific settings are handled in the attribute csvs in pypsa/data/component_attrs/, which has some changed structure now.
At the moment the checks are not very strict and the main benefits are type safty and simplified DataFrame initialisation
- But the structure will allow us to set up tons of attribute-specific checks, which will be much easier than what is currently done in check_consistency.
- We can raise these checks when adding data, disallow certain attribute combinations, enforce certain ranges or discrete steps, even interdependent, and so on. A lot of the side notes/ explanation in the docs can be turned into instant feedback instead.
- In a similar way I would like to bring the same data validation steps at some point to pypsa-eur, where benefits could be even bigger.

Dynamic Attributes

Dynamic Attribute Initialisation
- Previously, all attributes were initialised once as placeholders and then used or not used. This process can now be dynamic. For example, output variables can only be included in the network if the network has been resolved.
- To simplify our existing feature set and allow for better modularity, this can also be used for input attributes. I see three different types of input attributes:
  - required: Like name etc. These must be set by the user and cannot have a default value.
  - configurable: These can be set by the user. If not, the default value is used.
  - optional: These can be set by the user. If not set, they have no effect on the optimisation. Therefore, no default value must be applied and they are not even added.
- A fourth type would be output and a fifth type custom for all attributes that are manually defined and only used in the extra functionality.
- Most networks, where users do not use the full functionality and all attributes, will be less bloated via that approach. And your network can contain only the stuff you actually need.
Ambiguity of static or dynamic attributes
- Previously, when dynamic data was added to an attribute, it was stored in c.dynamic, but the column still existed in n.static for both dynamic and static components. The data for dynamic components in n.static was simply unused and misleading.
- Now an attribute can be either dynamic or static, not both. It is stored either in n.dynamic or n.static. Attributes in both containers are now mutually exclusive.
  - If some components have static and some dynamic data, the static data is casted to dynamic and removed from the static dataframe.
  - This increases the amount of data stored in the dynamic container quite a bit. This could be a problem for
    - io, which can be solved by compressing our output files.
    - readability, since c.dynamic no longer shows only the dynamic values. Which could be solved with the introduced DynamicDataStore, which is a dictionary with some more functionality.
  - But I think the trade off is worth it when considering the usage benefits
- All dynamic attributes can now also be static by design. And by default they are static, so c.dynamic is now an empty dict-like store by default. For backwards compatibility, instead of an AttributeError you can still get an empty dynamic df (as was possible before) with a small warning. I am thinking of adding similar functionality for the static data.
nan instead of "".
- Previously, when referencing other components and no reference was declared (attribute is nullable), "" was used. This led to some boilerplate and also some bugs (where empty "" components existed). The above data validation allows to easily enforce and use nan instead.

Breaking

Some of those changes need to be breaking:
- dynamic vs static attributes: As described, this can be handled quite good ish, via the custom Dict stores, which handles Attribute errors.
- nan instead of "": This can't be handled and needs a big note. I don't know how many users build logic on top of that. But fixing this is fine, but maybe not to trivial.
- The changed structure of component attribute csvs, which might breaks logic based on n.components.generators.attrs: Also not backwards compatible. But I don't think too many users are using this and a fix is trivial.
Dynamic Attribute andling
- Any attribute can either be static or dynamic, not both anymore.
- Adds dynamic "states", attributes can only be added when needed.
  - E.g. output attrs only when a network is solved or "configurable" attributes only when they are actually used in the optimization

@fneum @FabianHofmann @coroa
This is still a draft, so there is no need to review any code yet. While most things have already been discussed, I would mainly like feedback on the general concepts and the introduced breaking things.
This PR is a bit bloated again and data validation was not even planned for 1.0. Let's say I fell into a rabbit hole. But I think it is easier to make these changes all at once. Instead of spending too much time pointing people to new and new deprecations and adding those things one by one. Not sure if there is anything else on the "would be nice but also breaks things" list that could be added.

Checklist

for more information, see https://pre-commit.ci

FabianHofmann · 2025-01-28T09:57:06Z

This is awesome! The improvements far outweigh the breaking changes. In particular looking at the potential extensions in the long-term. Let me know when I should scroll through the code

lkstrp · 2026-02-05T11:03:02Z

Closed with ref to #1487

wip

324469e

lkstrp requested review from FabianHofmann and fneum January 27, 2025 16:46

[pre-commit.ci] auto fixes from pre-commit.com hooks

ade9077

for more information, see https://pre-commit.ci

lkstrp mentioned this pull request Jan 27, 2025

deprecate, rename and check config entries PyPSA/pypsa-eur#1514

Open

8 tasks

lkstrp marked this pull request as draft January 27, 2025 17:02

lkstrp mentioned this pull request Jan 28, 2025

Deprecate custom components #1131

Merged

lkstrp mentioned this pull request Feb 18, 2025

Fix pandas dtype warning in transformer calculations #1151

Merged

lkstrp mentioned this pull request Feb 28, 2025

Add new io type: .xlsx #1159

Merged

4 tasks

lkstrp mentioned this pull request Jun 11, 2025

address all-zero time-dependent value removal bug (closes #722) #734

Open

lkstrp added this to the v1.0 milestone Jun 23, 2025

lkstrp modified the milestones: v1.0, v1.1 Aug 4, 2025

lkstrp mentioned this pull request Sep 17, 2025

Optimize _sort_attrs for already ordered axes #1362

Merged

4 tasks

This was referenced Dec 10, 2025

Data Validation #1487

Open

Dynamic Attribute Initialisation #1488

Open

lkstrp mentioned this pull request Jan 5, 2026

feat: split capital_cost into investment and FOM costs #1507

Merged

4 tasks

lkstrp closed this Feb 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce data validation (`pydantic`/ `pandera`) and dynamic attribute handling#1128

Introduce data validation (`pydantic`/ `pandera`) and dynamic attribute handling#1128
lkstrp wants to merge 2 commits intoPyPSA:masterfrom
lkstrp:data-validation

lkstrp commented Jan 27, 2025 •

edited by fneum

Loading

Uh oh!

FabianHofmann commented Jan 28, 2025

Uh oh!

lkstrp commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lkstrp commented Jan 27, 2025 • edited by fneum Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes proposed in this Pull Request

Summary

Data Validation

Dynamic Attributes

Breaking

Checklist

Uh oh!

FabianHofmann commented Jan 28, 2025

Uh oh!

lkstrp commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lkstrp commented Jan 27, 2025 •

edited by fneum

Loading