Skip to content

Introduce data validation (pydantic/ pandera) and dynamic attribute handling#1128

Closed
lkstrp wants to merge 2 commits intoPyPSA:masterfrom
lkstrp:data-validation
Closed

Introduce data validation (pydantic/ pandera) and dynamic attribute handling#1128
lkstrp wants to merge 2 commits intoPyPSA:masterfrom
lkstrp:data-validation

Conversation

@lkstrp
Copy link
Copy Markdown
Member

@lkstrp lkstrp commented Jan 27, 2025

Closes #348
Closes #722
Closes #734

Changes proposed in this Pull Request

Summary

  • Add data validation (via Pydantic for data classes, pandera for dataframes) to better handle attribute types/ defaults/ nullable status, introduce immutable class attributes, allow for attribute specific checks etc.
  • Possibilities for checks are endless, we can move a lot of the docs to validation checks, which could improve user experience by a lot.
  • Removes arbitrary definition when attr is listed in both c.dynamic and c.static
  • Adds mechanism for dynamic attribute initialisation. E.g. output variables are only added when a network is solved which can be extended to "configurable" attributes.

Data Validation

  • Pydantic is used for data classes
    • Currently only for Components class/ subclasses and ComponentType. The plan is to bring it to the Network class as well, when it is split into data and logic with some refactoring.
    • Allows to enforce types, use simpler default factories, mix immutable with mutable attributes etc. This alone can make things more robust, for both users and developers.
  • Pandera is used for data frames
    • It is a library based on pydantic.
    • Defines schemas for DataFrames for both static and dynamic data.
    • Handles types (and casting), missing columns and nullability for each attribute individually.
    • Attribute specific settings are handled in the attribute csvs in pypsa/data/component_attrs/, which has some changed structure now.
  • At the moment the checks are not very strict and the main benefits are type safty and simplified DataFrame initialisation
    • But the structure will allow us to set up tons of attribute-specific checks, which will be much easier than what is currently done in check_consistency.
    • We can raise these checks when adding data, disallow certain attribute combinations, enforce certain ranges or discrete steps, even interdependent, and so on. A lot of the side notes/ explanation in the docs can be turned into instant feedback instead.
    • In a similar way I would like to bring the same data validation steps at some point to pypsa-eur, where benefits could be even bigger.

Dynamic Attributes

  • Dynamic Attribute Initialisation

    • Previously, all attributes were initialised once as placeholders and then used or not used. This process can now be dynamic. For example, output variables can only be included in the network if the network has been resolved.
    • To simplify our existing feature set and allow for better modularity, this can also be used for input attributes. I see three different types of input attributes:
      • required: Like name etc. These must be set by the user and cannot have a default value.
      • configurable: These can be set by the user. If not, the default value is used.
      • optional: These can be set by the user. If not set, they have no effect on the optimisation. Therefore, no default value must be applied and they are not even added.
    • A fourth type would be output and a fifth type custom for all attributes that are manually defined and only used in the extra functionality.
    • Most networks, where users do not use the full functionality and all attributes, will be less bloated via that approach. And your network can contain only the stuff you actually need.
  • Ambiguity of static or dynamic attributes

    • Previously, when dynamic data was added to an attribute, it was stored in c.dynamic, but the column still existed in n.static for both dynamic and static components. The data for dynamic components in n.static was simply unused and misleading.
    • Now an attribute can be either dynamic or static, not both. It is stored either in n.dynamic or n.static. Attributes in both containers are now mutually exclusive.
      • If some components have static and some dynamic data, the static data is casted to dynamic and removed from the static dataframe.
      • This increases the amount of data stored in the dynamic container quite a bit. This could be a problem for
        • io, which can be solved by compressing our output files.
        • readability, since c.dynamic no longer shows only the dynamic values. Which could be solved with the introduced DynamicDataStore, which is a dictionary with some more functionality.
      • But I think the trade off is worth it when considering the usage benefits
    • All dynamic attributes can now also be static by design. And by default they are static, so c.dynamic is now an empty dict-like store by default. For backwards compatibility, instead of an AttributeError you can still get an empty dynamic df (as was possible before) with a small warning. I am thinking of adding similar functionality for the static data.
  • nan instead of "".

    • Previously, when referencing other components and no reference was declared (attribute is nullable), "" was used. This led to some boilerplate and also some bugs (where empty "" components existed). The above data validation allows to easily enforce and use nan instead.

Breaking

  • Some of those changes need to be breaking:

    • dynamic vs static attributes: As described, this can be handled quite good ish, via the custom Dict stores, which handles Attribute errors.
    • nan instead of "": This can't be handled and needs a big note. I don't know how many users build logic on top of that. But fixing this is fine, but maybe not to trivial.
    • The changed structure of component attribute csvs, which might breaks logic based on n.components.generators.attrs: Also not backwards compatible. But I don't think too many users are using this and a fix is trivial.
  • Dynamic Attribute andling

    • Any attribute can either be static or dynamic, not both anymore.
    • Adds dynamic "states", attributes can only be added when needed.
      • E.g. output attrs only when a network is solved or "configurable" attributes only when they are actually used in the optimization

@fneum @FabianHofmann @coroa
This is still a draft, so there is no need to review any code yet. While most things have already been discussed, I would mainly like feedback on the general concepts and the introduced breaking things.
This PR is a bit bloated again and data validation was not even planned for 1.0. Let's say I fell into a rabbit hole. But I think it is easier to make these changes all at once. Instead of spending too much time pointing people to new and new deprecations and adding those things one by one. Not sure if there is anything else on the "would be nice but also breaks things" list that could be added.

Checklist

@lkstrp lkstrp requested review from FabianHofmann and fneum January 27, 2025 16:46
@lkstrp lkstrp marked this pull request as draft January 27, 2025 17:02
@FabianHofmann
Copy link
Copy Markdown
Contributor

This is awesome! The improvements far outweigh the breaking changes. In particular looking at the potential extensions in the long-term. Let me know when I should scroll through the code

@lkstrp
Copy link
Copy Markdown
Member Author

lkstrp commented Feb 5, 2026

Closed with ref to #1487

@lkstrp lkstrp closed this Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Export bug when time-dependent variable is all 0. Add support for categorical data

2 participants