Update ORTModule Default Opset Version to 15#12419
Conversation
|
This pull request fixes 1 alert when merging a032ffa into 5d1173f - view on LGTM.com fixed alerts:
|
|
This pull request fixes 1 alert when merging f0d5526 into b39257a - view on LGTM.com fixed alerts:
|
| torch==1.10.0+cu113 | ||
| torchvision==0.11.1+cu113 | ||
| torchtext==0.11.0 | ||
| torch==1.11.0+cu113 |
There was a problem hiding this comment.
Changes to this file will use torch 1.11.0 for orttrainer tests. This is definitely a change we want to make, but I think it might be more involved since we might need to update a few tests in the orttrainer realm.
Is the intention of this PR to migrate orttrainer tests to a newer torch as well?
There was a problem hiding this comment.
I just followed what Xavier changed when he bumped the version from 12 to 14, and he also changed this one. The problem is for some ORTModule pipelines it also run these ORTTrainer tests, not sure if there is issue I update the torch version but keep the ORTTrainer tests using older Opset version. I am guessing since we are no longer using the ORTTrainer python frontend anymore, as long as we can have the tests passed, it should be OK?
There was a problem hiding this comment.
The change indeed broke the orttraining-distributed pipeline. I just rolled back the change for orttraining. I actually doubted that why we still need the orttraining-distributed pipeline as it's flaky. I can always see random failure from it.
There was a problem hiding this comment.
The way the orttraining-linux-gpu-ci-pipeline works is that we build ort and run all the orttrainer tests with the torch version as specified in this requirements.txt file.
After the tests are done, we uninstall torch and install ortmodule supported version of torch (1.11.0) and then run ortmodule tests. So, if we don't update this file, and only update the opset version for ortmodule, we should be good.
|
Given ONNX released opset 17, can we move to 17? Is it a lot of work to be compatible with 17? |
|
This pull request introduces 1 alert and fixes 2 when merging 5cc866e into 26dc094 - view on LGTM.com new alerts:
fixed alerts:
|
Torch-1.12 support only <=Opset16, torch-1.11 is <=Opset15. For 17, we need to wait for the next torch release to support it. We can update to torch 1.12 and Opset 16, but I check some Opset 16 changes, it's either Loop or If, or add bfloat16 support, which have limit impact to ORTModule. Changing torch version with big difference sometime bring extra effort for testing, so maybe we can wait next torch release and spend more time on testing before bumping the versions. |
|
This pull request introduces 1 alert and fixes 2 when merging 2f94e45 into 01f3a19 - view on LGTM.com new alerts:
fixed alerts:
|
|
This pull request introduces 1 alert and fixes 2 when merging c611a25 into 01f3a19 - view on LGTM.com new alerts:
fixed alerts:
|
|
This pull request introduces 1 alert and fixes 2 when merging bfffa22 into 01f3a19 - view on LGTM.com new alerts:
fixed alerts:
|
|
I'm curious why ort training py package has cuda 11.3 and 11.5 workflows. |
Not sure... |
if it follows PyTorch, PyTorch hasn't official release package with 11.5, and it still has package with 10.2 on Linux. |
This PR #11018 introduced this, it said PyTorch has stable release for 11.3 and 11.5, though from the offical offsite I saw 11.3 and 11.6 But I think this is not related to the PR. We can check with @baijumeswani to check if we want to change 11.5 to 11.6, but in another PR. |
| CU_VER="11.1" | ||
| ROCM_VER="5.1.1" | ||
| TORCH_VERSION='1.10.0' | ||
| TORCH_VERSION='1.11.0' |
There was a problem hiding this comment.
CU_VER and TORCH_VERSION are not the latest, please double check it .
|
This pull request introduces 1 alert and fixes 2 when merging a7b0df8 into a7d6290 - view on LGTM.com new alerts:
fixed alerts:
|
Update ORTModule default Opset version to 15. Some models such MoE requires the BatchNorm from the new Opset version.