feat: Logging to AzureML with ML Flow by timothyas · Pull Request #469 · ecmwf/anemoi-core

timothyas · 2025-08-07T17:36:30Z

Description

I have been logging metrics to Azure ML, which requires Azure's special blend of mlflow. This PR adds those features, which I think might be useful to a broader audience of people who either run on Azure, Azure ML, or simply log and view metrics on AML. I'm currently running on an on-prem HPC machine but logging to AML with this code, however I think it will be necessary for those actually running on AML as well.

cc @mchantry

What problem does this change solve?

Using mlflow to log to Azure ML requires some azure-specific code and packages.

What issue or task does this change relate to?

No open issue, please let me know if you'd like me to open one up to describe the situation, I'd be happy to do so.

Additional notes

I got this working just enough to my liking, but there are still some issues unfortunately.

tests: I'm not sure how to make tests with this, since it would require logging to AML... If anyone has any ideas I'm all ears
TokenAuth: is not necessary with AML, see necessary helper PR at feat: NoAuth for AML mlflow Logging anemoi-utils#200
terminal logs: with regular mlflow, terminal logs get updated by simply overwriting the terminal_log.txt. Overwriting files is not allowed with azure-mlflow. Moreover this appears to happen on each process (?) that's running, so many times over and over. I put in some try/except logic to try writing these files, but just pass whenever a failure happens. This means the terminal_log.txt stops updating pretty early on in the training process. I'm not really sure what to recommend here.
diagnostic plots: these also seem to get written by each process (worker?). However, since each unique plot has a unique name, putting the try/except logic here doesn't really matter - each process tries to log the plot and if it succeeds, that means it was the first one and the plot gets stored. If the process fails, then it just moves on passively since a different one beat it to the job. no problem but maybe not the cleanest solution.
model checkpoints: I simply could not get model checkpoints to log to AML, and I don't know what is going on there. For my purposes, I don't really care though, since they get stored locally on the machine where I will use the checkpoints anyway. Also, @mariahpope at EPIC/NOAA has been kicking off some anemoi training runs on AML, and noticed the checkpoints are stored. So, this may not be a problem.
hyperparameter logging: in the second half of the method log_hyperparams_in_mlflow, the logger tries to log "expanded_params" after truncating them. I have no idea why but I could not get this to work for AML, despite lots of attempts with ChatGPT and Microsoft Copilot. This is the reason for overriding this method in the Azure subclass.
I think there are a lot of different ways to establish a connection to an AML workspace, I did it here with the following code block:

        sp_auth = ServicePrincipalAuthentication(
            tenant_id=os.environ["AZURE_TENANT_ID"],
            service_principal_id=os.environ["AZURE_CLIENT_ID"],
            service_principal_password=os.environ["AZURE_CLIENT_SECRET"],
        )
        ws = Workspace(
            subscription_id=os.environ["AZURE_SUBSCRIPTION_ID"],
            resource_group=aml_resource_group,
            workspace_name=aml_workspace_name,
            auth=sp_auth,
        )

Note there's a check before those lines to make sure the necessary environment variables are defined for the user. I would be very open to making this more flexible based on what true AML users think, or we could let this evolve as more users start to flex this feature.

Let me know if these limitations are acceptable, and I'd be happy to put some warning flags for the user so it's clear what will and will not be logged to AML. If anyone wants these to be worked out in the PR, I'd love to hear some suggestions - I'm pretty new to AML and leaned heavily on some LLMs to make this work, but "we" all got stuck on these issues :)

As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/

By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.

for more information, see https://pre-commit.ci

phinate · 2025-08-12T11:00:15Z

Just to chime in that I've independently noticed many of the same issues as you @timothyas as an AzureML user. Happy to help take this forward if you'd like to divide up the work; will test the current state of your branch on our infrastructure as a first step! Thanks for this. :)

phinate · 2025-08-12T15:44:52Z

One thing I'd like to point out is that this PR is based on an older version of the AzureML Python SDK (v1) — I think we could refactor this to use the v2 version of the SDK without too much issue.

phinate · 2025-08-14T14:53:15Z

I'm working on the SDK v2 migration on a fork-of-your-fork @timothyas, and just wanted to check: how are you (and colleague(s)) running things on AzureML? For me, I'm running a Command Job via az ml job create, and am working with that as the use-case. Want to make sure I don't end up accidentally preventing you from running things!

mariahpope · 2025-08-14T15:57:55Z

I'm working on the SDK v2 migration on a fork-of-your-fork @timothyas, and just wanted to check: how are you (and colleague(s)) running things on AzureML? For me, I'm running a Command Job via az ml job create, and am working with that as the use-case. Want to make sure I don't end up accidentally preventing you from running things!

Hi! I have also been running Command Jobs (either do it via CLI or within a notebook using python sdk), so that is great.

phinate · 2025-08-14T16:01:42Z

Hi! I have also been running Command Jobs (either do it via CLI or within a notebook using python sdk), so that is great.

Excellent! I'm actually having different results based on how I'm running at the moment; trying to avoid the secrets-based approach with a ServicePrincipal, and instead use a ManagedIdentity or a UserIdentity. Not sure what you've been getting to work, but for sure I'd love to know what your current workflow is + if you could test things once they're stable on our side :)

timothyas · 2025-08-14T19:33:14Z

Hey @phinate thanks so much for all of this!! I'm really not sure why or how I'm using an older version of azure mlflow... I definitely appreciate your efforts for an upgrade and I'll upgrade my setup accordingly. I'm curious if that would fix some of the issues I ran into, like those related to logging hyper parameters.

I would be happy to iterate with you on updates to this PR. My workflow is kind of strange actually. I typically run on an HPC machine, but since we happen to have an azure workspace set up, I just use azure mlflow to log to that as a server. However, from the NOAA side we may be using azure resources and AML directly soon, and @mariahpope has been getting that going.

Just a heads up and explanation for a slow response, I'm on leave the rest of this week but I'll jump back on this on Monday.

for more information, see https://pre-commit.ci

… method

for more information, see https://pre-commit.ci

timothyas · 2025-10-30T17:13:26Z

Hi folks (@anaprietonem @phinate), I believe this PR is ready for review. I think all of the functionality is there and thanks to @phinate we even have a test 🙂. My most recent changes since @phinate's test addition involved:

Merging upstream/main into this feature branch
Hacking through ruff errors. This was mostly simple, but did result in a few minor changes to the method inheritance for logging hyperparams (no functionality change, just moving things around).

When you get a chance I'd be curious to hear what folks think about this. I think we're close! Thanks for everyone's patience and attention.

training/src/anemoi/training/diagnostics/callbacks/plot.py

…licts

for more information, see https://pre-commit.ci

anaprietonem · 2025-10-31T16:52:01Z

@timothyas , the PR looks good I tested an it runs fine. However I noticed some differences in the number of total parameters being logged to mlflow compared to before. I didn't manage to get to the bottom so if that's okey I will check this further next and then hopefully we can merge it!

timothyas · 2025-10-31T17:19:13Z

Thanks @anaprietonem, that sounds great! Hopefully no big issues with the parameters being logged...

phinate · 2025-10-31T17:56:51Z

Happy to also try this early next week to ensure it's working on my side!

anaprietonem · 2025-11-03T11:13:42Z

Thanks @anaprietonem, that sounds great! Hopefully no big issues with the parameters being logged...

I looked a bit more in detail, the difference I was seeing it's mostly since now the config entries are being logged slightly different so that we don't see 'config.data.processor...' but 'data.processor...' . Since we have these entries on the clean_params: (list of prefixes to remove)

       "hardware",
       "data",
       "dataloader",
       "model",
       "training",
       "diagnostics",
       "graph",

we'd need to remove then to keep the same behaviour as before. @timothyas could you that change as part of that PR please?

timothyas · 2025-11-04T16:09:05Z

@anaprietonem I just removed the prefixes that you mentioned in c9eeef0, although I have to be honest I don't really understand why that would have changed in this PR. Let me know if this is what you were looking for. Thanks!

anaprietonem · 2025-11-04T19:35:03Z

@anaprietonem I just removed the prefixes that you mentioned in c9eeef0, although I have to be honest I don't really understand why that would have changed in this PR. Let me know if this is what you were looking for. Thanks!

yes that's what I was looking. We just need to resolve the issues with the unit-test as they seem to be failing because of ModuleNotFoundError: No module named 'marshmallow' - https://github.com/ecmwf/anemoi-core/actions/runs/19080440585/job/54507308965?pr=469

I guess this is related to the pytest.init updates and we might need to the install the [azure] dependencies as part of the tests? (see https://github.com/timothyas/anemoi-core/blob/691edc548dc50669bb4216b2c9c82718e3d490e7/.github/workflows/python-pull-request.yml) - so we could specify it in the .all section of the pyproject - what do you think @timothyas @phinate ?

timothyas · 2025-11-05T03:47:25Z

That sounds fine to me @anaprietonem, I added this in 6aadb04.

anaprietonem · 2025-11-05T08:41:43Z

That sounds fine to me @anaprietonem, I added this in 6aadb04.

Aha, I meant anemoi-training[azure] - in any case I think we are happy with the changes as they are right now so what I'd suggest is rather than merging into main I have now created a copy of main called 'mlflow_azure' - in that I can help debugging the tests and do the changes to the pyproject directly + run the integration. tests from the branch and then we merge to main.

That means I will be updating the PR to point to that branch + merging it - If hit any issues I will let you know - but I think this ready and it's just a minor last push so thanks both the nice work @timothyas and @phinate!

anaprietonem

PR LGTM comments have been addressed, thanks a lot for the nice work!

timothyas · 2025-11-06T04:25:19Z

Woohoo! Thanks @anaprietonem! Let me know if you'd like any help with the other PR.

timothyas and others added 8 commits July 18, 2025 14:24

allows logging to azure ml workspace

be80163

make mlflow display name same as run_id (what anemoi sees)

123db30

try/except for instances of double plots

66d2e2b

this makes the AML hash the same as the anemoi hash

964f964

i dont need to hear about all these failed logging attempts

7a713a6

Merge remote-tracking branch 'upstream/main' into feature/azure-mlflow

318f16d

update a comment

a4e6a9f

move mods to azure specific logger, leave regular one untouched

d86661f

github-project-automation bot added this to Anemoi-dev Aug 7, 2025

github-project-automation bot moved this to To be triaged in Anemoi-dev Aug 7, 2025

github-actions bot added the contributor label Aug 7, 2025

timothyas marked this pull request as draft August 7, 2025 17:36

github-actions bot added the training label Aug 7, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

9900ce8

for more information, see https://pre-commit.ci

timothyas mentioned this pull request Aug 7, 2025

feat: NoAuth for AML mlflow Logging ecmwf/anemoi-utils#200

Merged

timothyas changed the title ~~Logging to AzureML with ML Flow~~ feat: Logging to AzureML with ML Flow Aug 7, 2025

timothyas and others added 4 commits August 7, 2025 11:27

pass max_params_length to azure logger

f3e4f35

Optional -> str | None

78f5dc2

_logged_metrics attr

7755089

[pre-commit.ci] auto fixes from pre-commit.com hooks

0635db6

for more information, see https://pre-commit.ci

phinate added 4 commits August 13, 2025 16:44

inital changes for Azure SDK v2

80556c2

add optional extra for azure

811a6c9

logging update

3bc741f

update tracking uri get method

d8e77b8

pre-commit-ci bot and others added 8 commits October 29, 2025 04:27

[pre-commit.ci] auto fixes from pre-commit.com hooks

bb5eac2

for more information, see https://pre-commit.ci

handle ruff errors round 1

ed45b1e

the rest of the easy ruff fixes

f9fb766

removed unused argument ruff errors except for hyperparameter logging…

c92a300

… method

expand_hyperparameters OOP management to satisfy ruff

9e1ec88

fix offline default, which actually creates a directory called None

04152c3

[pre-commit.ci] auto fixes from pre-commit.com hooks

91e2523

for more information, see https://pre-commit.ci

bugs fixed stemming from ruff wanting pathlib.Path not os.path

dac108b

anaprietonem reviewed Oct 31, 2025

View reviewed changes

training/src/anemoi/training/diagnostics/callbacks/plot.py Outdated Show resolved Hide resolved

timothyas and others added 2 commits October 31, 2025 09:38

overwrite AzureLogger.experiment.log_artifact to handle resource conf…

3019aa8

…licts

[pre-commit.ci] auto fixes from pre-commit.com hooks

90b8f0a

for more information, see https://pre-commit.ci

remove some from prefixes to remove

c9eeef0

Merge branch 'main' into feature/azure-mlflow

691edc5

add azure group dependencies to [all] in pyproject.toml

6aadb04

anaprietonem self-requested a review November 5, 2025 08:41

anaprietonem changed the base branch from main to mlflow_azure November 5, 2025 08:41

anaprietonem approved these changes Nov 5, 2025

View reviewed changes

anaprietonem merged commit 8e11009 into ecmwf:mlflow_azure Nov 5, 2025
11 of 14 checks passed

github-project-automation bot moved this from To be triaged to Done in Anemoi-dev Nov 5, 2025

timothyas deleted the feature/azure-mlflow branch November 6, 2025 04:25

Conversation

timothyas commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What problem does this change solve?

What issue or task does this change relate to?

Additional notes

Uh oh!

phinate commented Aug 12, 2025

Uh oh!

phinate commented Aug 12, 2025

Uh oh!

phinate commented Aug 14, 2025

Uh oh!

mariahpope commented Aug 14, 2025

Uh oh!

phinate commented Aug 14, 2025

Uh oh!

timothyas commented Aug 14, 2025

Uh oh!

timothyas commented Oct 30, 2025

Uh oh!

Uh oh!

anaprietonem commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timothyas commented Oct 31, 2025

Uh oh!

phinate commented Oct 31, 2025

Uh oh!

anaprietonem commented Nov 3, 2025

Uh oh!

timothyas commented Nov 4, 2025

Uh oh!

anaprietonem commented Nov 4, 2025

Uh oh!

timothyas commented Nov 5, 2025

Uh oh!

anaprietonem commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anaprietonem left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

timothyas commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

timothyas commented Aug 7, 2025 •

edited

Loading

anaprietonem commented Oct 31, 2025 •

edited

Loading

anaprietonem commented Nov 5, 2025 •

edited

Loading