Update ORTModule Default Opset Version to 15 by Lafi7e · Pull Request #12419 · microsoft/onnxruntime

Lafi7e · 2022-08-02T06:56:15Z

Update ORTModule default Opset version to 15. Some models such MoE requires the BatchNorm from the new Opset version.

lgtm-com · 2022-08-02T07:11:51Z

This pull request fixes 1 alert when merging a032ffa into 5d1173f - view on LGTM.com

fixed alerts:

1 for Module is imported more than once

lgtm-com · 2022-08-02T09:26:03Z

This pull request fixes 1 alert when merging f0d5526 into b39257a - view on LGTM.com

fixed alerts:

1 for Module is imported more than once

baijumeswani · 2022-08-02T17:19:21Z

tools/ci_build/github/linux/docker/scripts/training/requirements.txt

-torch==1.10.0+cu113
-torchvision==0.11.1+cu113
-torchtext==0.11.0
+torch==1.11.0+cu113


Changes to this file will use torch 1.11.0 for orttrainer tests. This is definitely a change we want to make, but I think it might be more involved since we might need to update a few tests in the orttrainer realm.
Is the intention of this PR to migrate orttrainer tests to a newer torch as well?

I just followed what Xavier changed when he bumped the version from 12 to 14, and he also changed this one. The problem is for some ORTModule pipelines it also run these ORTTrainer tests, not sure if there is issue I update the torch version but keep the ORTTrainer tests using older Opset version. I am guessing since we are no longer using the ORTTrainer python frontend anymore, as long as we can have the tests passed, it should be OK?

The change indeed broke the orttraining-distributed pipeline. I just rolled back the change for orttraining. I actually doubted that why we still need the orttraining-distributed pipeline as it's flaky. I can always see random failure from it.

The way the orttraining-linux-gpu-ci-pipeline works is that we build ort and run all the orttrainer tests with the torch version as specified in this requirements.txt file.
After the tests are done, we uninstall torch and install ortmodule supported version of torch (1.11.0) and then run ortmodule tests. So, if we don't update this file, and only update the opset version for ortmodule, we should be good.

askhade · 2022-08-03T04:47:03Z

Given ONNX released opset 17, can we move to 17? Is it a lot of work to be compatible with 17?

lgtm-com · 2022-08-03T04:50:28Z

This pull request introduces 1 alert and fixes 2 when merging 5cc866e into 26dc094 - view on LGTM.com

new alerts:

1 for Unused import

fixed alerts:

1 for Module is imported more than once
1 for Unused import

Lafi7e · 2022-08-03T05:47:30Z

Given ONNX released opset 17, can we move to 17? Is it a lot of work to be compatible with 17?

Torch-1.12 support only <=Opset16, torch-1.11 is <=Opset15. For 17, we need to wait for the next torch release to support it. We can update to torch 1.12 and Opset 16, but I check some Opset 16 changes, it's either Loop or If, or add bfloat16 support, which have limit impact to ORTModule. Changing torch version with big difference sometime bring extra effort for testing, so maybe we can wait next torch release and spend more time on testing before bumping the versions.

lgtm-com · 2022-08-03T06:50:55Z

This pull request introduces 1 alert and fixes 2 when merging 2f94e45 into 01f3a19 - view on LGTM.com

new alerts:

1 for Unused import

fixed alerts:

1 for Module is imported more than once
1 for Unused import

lgtm-com · 2022-08-03T08:45:18Z

This pull request introduces 1 alert and fixes 2 when merging c611a25 into 01f3a19 - view on LGTM.com

new alerts:

1 for Unused import

fixed alerts:

1 for Module is imported more than once
1 for Unused import

lgtm-com · 2022-08-03T09:19:44Z

This pull request introduces 1 alert and fixes 2 when merging bfffa22 into 01f3a19 - view on LGTM.com

new alerts:

1 for Unused import

fixed alerts:

1 for Module is imported more than once
1 for Unused import

mszhanyi · 2022-08-05T02:32:18Z

I'm curious why ort training py package has cuda 11.3 and 11.5 workflows.
are there special rules for cuda version?

Lafi7e · 2022-08-05T02:35:21Z

I'm curious why ort training py package has cuda 11.3 and 11.5 workflows. are there special rules for cuda version?

Not sure...

mszhanyi · 2022-08-05T03:10:11Z

I'm curious why ort training py package has cuda 11.3 and 11.5 workflows. are there special rules for cuda version?

Not sure...

if it follows PyTorch, PyTorch hasn't official release package with 11.5, and it still has package with 10.2 on Linux.

Lafi7e · 2022-08-05T03:16:39Z

I'm curious why ort training py package has cuda 11.3 and 11.5 workflows. are there special rules for cuda version?

Not sure...

if it follows PyTorch, PyTorch hasn't official release package with 11.5, and it still has package with 10.2 on Linux.

This PR #11018 introduced this, it said PyTorch has stable release for 11.3 and 11.5, though from the offical offsite I saw 11.3 and 11.6 But I think this is not related to the PR. We can check with @baijumeswani to check if we want to change 11.5 to 11.6, but in another PR.

mszhanyi · 2022-08-05T02:24:07Z

tools/ci_build/github/linux/docker/scripts/install_python_deps.sh

 CU_VER="11.1"
 ROCM_VER="5.1.1"
-TORCH_VERSION='1.10.0'
+TORCH_VERSION='1.11.0'


CU_VER and TORCH_VERSION are not the latest, please double check it .

lgtm-com · 2022-08-05T06:12:19Z

This pull request introduces 1 alert and fixes 2 when merging a7b0df8 into a7d6290 - view on LGTM.com

new alerts:

1 for Unused import

fixed alerts:

1 for Module is imported more than once
1 for Unused import

update ortmodule opset to 15

a032ffa

Lafi7e added the component:ortmodule label Aug 2, 2022

Lafi7e requested review from a team, askhade, baijumeswani and xadupre August 2, 2022 06:56

update torch version

f0d5526

baijumeswani reviewed Aug 2, 2022

View reviewed changes

fix ut

5cc866e

fix ut

2f94e45

rollback

c611a25

rollback for orttrainer

bfffa22

baijumeswani previously approved these changes Aug 5, 2022

View reviewed changes

mszhanyi previously approved these changes Aug 5, 2022

View reviewed changes

Merge branch 'master' into weicwang/ortmodule_opset_15

a7b0df8

Lafi7e dismissed stale reviews from mszhanyi and baijumeswani via a7b0df8 August 5, 2022 05:54

mszhanyi approved these changes Aug 5, 2022

View reviewed changes

Lafi7e merged commit e85e31e into master Aug 5, 2022

Lafi7e deleted the weicwang/ortmodule_opset_15 branch August 5, 2022 08:55

Conversation

Lafi7e commented Aug 2, 2022

Uh oh!

lgtm-com bot commented Aug 2, 2022

Uh oh!

lgtm-com bot commented Aug 2, 2022

Uh oh!

baijumeswani Aug 2, 2022

Choose a reason for hiding this comment

Uh oh!

Lafi7e Aug 3, 2022

Choose a reason for hiding this comment

Uh oh!

Lafi7e Aug 3, 2022

Choose a reason for hiding this comment

Uh oh!

baijumeswani Aug 4, 2022

Choose a reason for hiding this comment

Uh oh!

askhade commented Aug 3, 2022

Uh oh!

lgtm-com bot commented Aug 3, 2022

Uh oh!

Lafi7e commented Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lgtm-com bot commented Aug 3, 2022

Uh oh!

lgtm-com bot commented Aug 3, 2022

Uh oh!

lgtm-com bot commented Aug 3, 2022

Uh oh!

mszhanyi commented Aug 5, 2022

Uh oh!

Lafi7e commented Aug 5, 2022

Uh oh!

mszhanyi commented Aug 5, 2022

Uh oh!

Lafi7e commented Aug 5, 2022

Uh oh!

mszhanyi Aug 5, 2022

Choose a reason for hiding this comment

Uh oh!

lgtm-com bot commented Aug 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Lafi7e commented Aug 3, 2022 •

edited

Loading