Skip to content

Update ORTModule Default Opset Version to 15#12419

Merged
Lafi7e merged 7 commits intomasterfrom
weicwang/ortmodule_opset_15
Aug 5, 2022
Merged

Update ORTModule Default Opset Version to 15#12419
Lafi7e merged 7 commits intomasterfrom
weicwang/ortmodule_opset_15

Conversation

@Lafi7e
Copy link
Contributor

@Lafi7e Lafi7e commented Aug 2, 2022

Update ORTModule default Opset version to 15. Some models such MoE requires the BatchNorm from the new Opset version.

@lgtm-com
Copy link

lgtm-com bot commented Aug 2, 2022

This pull request fixes 1 alert when merging a032ffa into 5d1173f - view on LGTM.com

fixed alerts:

  • 1 for Module is imported more than once

@lgtm-com
Copy link

lgtm-com bot commented Aug 2, 2022

This pull request fixes 1 alert when merging f0d5526 into b39257a - view on LGTM.com

fixed alerts:

  • 1 for Module is imported more than once

torch==1.10.0+cu113
torchvision==0.11.1+cu113
torchtext==0.11.0
torch==1.11.0+cu113
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to this file will use torch 1.11.0 for orttrainer tests. This is definitely a change we want to make, but I think it might be more involved since we might need to update a few tests in the orttrainer realm.
Is the intention of this PR to migrate orttrainer tests to a newer torch as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just followed what Xavier changed when he bumped the version from 12 to 14, and he also changed this one. The problem is for some ORTModule pipelines it also run these ORTTrainer tests, not sure if there is issue I update the torch version but keep the ORTTrainer tests using older Opset version. I am guessing since we are no longer using the ORTTrainer python frontend anymore, as long as we can have the tests passed, it should be OK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change indeed broke the orttraining-distributed pipeline. I just rolled back the change for orttraining. I actually doubted that why we still need the orttraining-distributed pipeline as it's flaky. I can always see random failure from it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way the orttraining-linux-gpu-ci-pipeline works is that we build ort and run all the orttrainer tests with the torch version as specified in this requirements.txt file.
After the tests are done, we uninstall torch and install ortmodule supported version of torch (1.11.0) and then run ortmodule tests. So, if we don't update this file, and only update the opset version for ortmodule, we should be good.

@askhade
Copy link
Contributor

askhade commented Aug 3, 2022

Given ONNX released opset 17, can we move to 17? Is it a lot of work to be compatible with 17?

@lgtm-com
Copy link

lgtm-com bot commented Aug 3, 2022

This pull request introduces 1 alert and fixes 2 when merging 5cc866e into 26dc094 - view on LGTM.com

new alerts:

  • 1 for Unused import

fixed alerts:

  • 1 for Module is imported more than once
  • 1 for Unused import

@Lafi7e
Copy link
Contributor Author

Lafi7e commented Aug 3, 2022

Given ONNX released opset 17, can we move to 17? Is it a lot of work to be compatible with 17?

Torch-1.12 support only <=Opset16, torch-1.11 is <=Opset15. For 17, we need to wait for the next torch release to support it. We can update to torch 1.12 and Opset 16, but I check some Opset 16 changes, it's either Loop or If, or add bfloat16 support, which have limit impact to ORTModule. Changing torch version with big difference sometime bring extra effort for testing, so maybe we can wait next torch release and spend more time on testing before bumping the versions.

@lgtm-com
Copy link

lgtm-com bot commented Aug 3, 2022

This pull request introduces 1 alert and fixes 2 when merging 2f94e45 into 01f3a19 - view on LGTM.com

new alerts:

  • 1 for Unused import

fixed alerts:

  • 1 for Module is imported more than once
  • 1 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Aug 3, 2022

This pull request introduces 1 alert and fixes 2 when merging c611a25 into 01f3a19 - view on LGTM.com

new alerts:

  • 1 for Unused import

fixed alerts:

  • 1 for Module is imported more than once
  • 1 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Aug 3, 2022

This pull request introduces 1 alert and fixes 2 when merging bfffa22 into 01f3a19 - view on LGTM.com

new alerts:

  • 1 for Unused import

fixed alerts:

  • 1 for Module is imported more than once
  • 1 for Unused import

baijumeswani
baijumeswani previously approved these changes Aug 5, 2022
@mszhanyi
Copy link
Contributor

mszhanyi commented Aug 5, 2022

I'm curious why ort training py package has cuda 11.3 and 11.5 workflows.
are there special rules for cuda version?

@Lafi7e
Copy link
Contributor Author

Lafi7e commented Aug 5, 2022

I'm curious why ort training py package has cuda 11.3 and 11.5 workflows. are there special rules for cuda version?

Not sure...

@mszhanyi
Copy link
Contributor

mszhanyi commented Aug 5, 2022

I'm curious why ort training py package has cuda 11.3 and 11.5 workflows. are there special rules for cuda version?

Not sure...

if it follows PyTorch, PyTorch hasn't official release package with 11.5, and it still has package with 10.2 on Linux.

@Lafi7e
Copy link
Contributor Author

Lafi7e commented Aug 5, 2022

I'm curious why ort training py package has cuda 11.3 and 11.5 workflows. are there special rules for cuda version?

Not sure...

if it follows PyTorch, PyTorch hasn't official release package with 11.5, and it still has package with 10.2 on Linux.

This PR #11018 introduced this, it said PyTorch has stable release for 11.3 and 11.5, though from the offical offsite I saw 11.3 and 11.6 But I think this is not related to the PR. We can check with @baijumeswani to check if we want to change 11.5 to 11.6, but in another PR.

mszhanyi
mszhanyi previously approved these changes Aug 5, 2022
CU_VER="11.1"
ROCM_VER="5.1.1"
TORCH_VERSION='1.10.0'
TORCH_VERSION='1.11.0'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CU_VER and TORCH_VERSION are not the latest, please double check it .

@Lafi7e Lafi7e dismissed stale reviews from mszhanyi and baijumeswani via a7b0df8 August 5, 2022 05:54
@lgtm-com
Copy link

lgtm-com bot commented Aug 5, 2022

This pull request introduces 1 alert and fixes 2 when merging a7b0df8 into a7d6290 - view on LGTM.com

new alerts:

  • 1 for Unused import

fixed alerts:

  • 1 for Module is imported more than once
  • 1 for Unused import

@Lafi7e Lafi7e merged commit e85e31e into master Aug 5, 2022
@Lafi7e Lafi7e deleted the weicwang/ortmodule_opset_15 branch August 5, 2022 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants