ExecutionProvider API refactor - replace OrtMemoryInfo with OrtDevice#15618

Merged

jslhcl merged 6 commits intomainfrom

leca/OrtMemoryInfo2OrtDevice

May 1, 2023

Contributor

jslhcl commented Apr 21, 2023

Description

ExecutionProvider API refactor - replace OrtMemoryInfo with OrtDevice

Motivation and Context

Currently “Location” is represented as ORTMemoryInfo, which is OrtDevice + OrtMemType, while OrtDevice is represent as DeviceType + DeviceId + MemType. As we can see there is some unnecessary hierarchy, the proposal is to make it a clear definition that to use OrtDevice as an abstraction for Location


          replace OrtMemoryInfo with OrtDevice

9bd7286

jslhcl requested review from RandySheriffH and souptc

April 21, 2023 00:43

Contributor Author

jslhcl commented Apr 21, 2023

static const DeviceType FPGA = 2;

No FPGA device in EP?

Refers to: include/onnxruntime/core/framework/ortdevice.h:18 in 9bd7286. [](commit_id = 9bd7286, deletion_comment = False)

souptc reviewed

View reviewed changes

include/onnxruntime/core/framework/execution_provider.h Outdated

                     metadef_id_generator_ = std::make_unique<ModelMetadefIdGenerator>();
                   }
                 }
+                OrtDevice default_device_;

Member

souptc Apr 21, 2023 •

edited by jslhcl

Loading

this should be private?

and please add comments for the new member/method. #Resolved

Contributor Author

jslhcl Apr 21, 2023

it is protected.

Comments added

Lei Cao and others added 4 commits

April 21, 2023 14:25


          resolve comments

53bad43


          info_ is not initialized in EP's constructor

f20647e


          resolve conflict

3a20c5d


          DML EP use GPU as default device

6dc807a

jslhcl marked this pull request as ready for review

April 25, 2023 21:31

Contributor Author

jslhcl commented Apr 26, 2023

/azp run Linux Android Emulator QNN CI Pipeline

azure-pipelines bot commented Apr 26, 2023

Azure Pipelines successfully started running 1 pipeline(s).

souptc previously approved these changes

View reviewed changes

Member

souptc left a comment

Member

souptc commented Apr 28, 2023

static const DeviceType FPGA = 2;

FPGA is resvered for some 1p hardware.

In reply to: 1517106603

Refers to: include/onnxruntime/core/framework/ortdevice.h:18 in 9bd7286. [](commit_id = 9bd7286, deletion_comment = False)


          resolve conflict

1e2eb26

jslhcl dismissed souptc’s stale review via

1e2eb26

April 30, 2023 15:49

Contributor Author

jslhcl commented Apr 30, 2023

/azp run Android CI Pipeline

azure-pipelines bot commented Apr 30, 2023

Azure Pipelines successfully started running 1 pipeline(s).

Contributor Author

jslhcl commented May 1, 2023

/azp run Windows GPU CI Pipeline

azure-pipelines bot commented May 1, 2023

Azure Pipelines successfully started running 1 pipeline(s).

souptc approved these changes

View reviewed changes

Member

souptc left a comment

jslhcl merged commit d58fa98 into main

jslhcl deleted the leca/OrtMemoryInfo2OrtDevice branch

May 1, 2023 17:06

fs-eire mentioned this pull request

[JSEP] fix constructor for OrtDevice #15805

Merged

fs-eire added a commit that referenced this pull request


          [JSEP] fix constructor for OrtDevice (#15805)

df7424e

### Description
Add the missing `OrtDevice` initialization in JSEP introduced by #15618

pengwa mentioned this pull request

Fix segfault for multiple GPU run (regression) #15823

Merged

pengwa added a commit that referenced this pull request


          Fix segfault for multiple GPU run (regression) (#15823)

dfac096

### Fix segfault for multiple GPU run

#15618 introduced
`GetOrtDeviceByMemType`. The intention should be: handle CPU device
differently in the if branch, while might by mistakenly passing the
unique default non-cpu device id.


```
OrtDevice CUDAExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const {
  if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
    return OrtDevice(OrtDevice::CPU, OrtDevice::MemType::CUDA_PINNED, default_device_.Id());
  }
  return default_device_;
}
```

We observed a segement fault thrown when running multiple GPU training  

`
CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch
--nproc_per_node=2
examples/onnxruntime/training/language-modeling/run_mlm.py
--model_name_or_path distilbert-base-uncased --dataset_name wikitext
--dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10
--per_device_train_batch_size 8 --per_device_eval_batch_size 8
--do_train --do_eval --overwrite_output_dir --output_dir ./outputs222/
--seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps
400 --logging_steps 1
`

It is found GPU0 works fine, GPU1 throw segement fault. Looking further,
a Shape node trying to allocate it's output tensor, trying to fetch
corresponding allocator with ORTDevice(Device:[DeviceType:0 MemoryType:1
DeviceId:1]), while CPU device did not have device id = 1, so a no
allocator returned. When we try to call `AsStreamBasedAllocator` for the
allocator, segement happens as no null check was done there.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

ShukantPal pushed a commit to ShukantPal/onnxruntime that referenced this pull request


          ExecutionProvider API refactor - replace OrtMemoryInfo with OrtDevice (…

31d6fbd

…microsoft#15618)

### Description
ExecutionProvider API refactor - replace OrtMemoryInfo with OrtDevice



### Motivation and Context
Currently “Location” is represented as ORTMemoryInfo, which is OrtDevice
+ OrtMemType, while OrtDevice is represent as DeviceType + DeviceId +
MemType. As we can see there is some unnecessary hierarchy, the proposal
is to make it a clear definition that to use OrtDevice as an abstraction
for Location

---------

Co-authored-by: Lei Cao <[email protected]>

ShukantPal pushed a commit to ShukantPal/onnxruntime that referenced this pull request


          [JSEP] fix constructor for OrtDevice (microsoft#15805)

420c9e4

### Description
Add the missing `OrtDevice` initialization in JSEP introduced by microsoft#15618

ShukantPal pushed a commit to ShukantPal/onnxruntime that referenced this pull request


          Fix segfault for multiple GPU run (regression) (microsoft#15823)

457349a

### Fix segfault for multiple GPU run

microsoft#15618 introduced
`GetOrtDeviceByMemType`. The intention should be: handle CPU device
differently in the if branch, while might by mistakenly passing the
unique default non-cpu device id.


```
OrtDevice CUDAExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const {
  if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
    return OrtDevice(OrtDevice::CPU, OrtDevice::MemType::CUDA_PINNED, default_device_.Id());
  }
  return default_device_;
}
```

We observed a segement fault thrown when running multiple GPU training  

`
CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch
--nproc_per_node=2
examples/onnxruntime/training/language-modeling/run_mlm.py
--model_name_or_path distilbert-base-uncased --dataset_name wikitext
--dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10
--per_device_train_batch_size 8 --per_device_eval_batch_size 8
--do_train --do_eval --overwrite_output_dir --output_dir ./outputs222/
--seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps
400 --logging_steps 1
`

It is found GPU0 works fine, GPU1 throw segement fault. Looking further,
a Shape node trying to allocate it's output tensor, trying to fetch
corresponding allocator with ORTDevice(Device:[DeviceType:0 MemoryType:1
DeviceId:1]), while CPU device did not have device id = 1, so a no
allocator returned. When we try to call `AsStreamBasedAllocator` for the
allocator, segement happens as no null check was done there.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

fs-eire added a commit that referenced this pull request


          [JSEP] fix constructor for OrtDevice (#15805)

98166e9

### Description
Add the missing `OrtDevice` initialization in JSEP introduced by #15618

jslhcl mentioned this pull request

change the EP device to default OrtDevice() for memoryType equals CPUInput #15903

Merged

jslhcl added a commit that referenced this pull request


          change the EP device to default OrtDevice() for memoryType equals CPU…

3b8f3a0

…Input (#15903)

### Description
<!-- Describe your changes. -->
change the EP device to default OrtDevice() for memoryType equals
CPUInput for cuda, rocm, migraph
x and tensorRT EP


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
My previous PR (#15618)
caused random failures on cuda training test
GradientCheckerTest.TileGrad (see build
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=986784&view=logs&j=5076e696-f193-5f12-2d8a-703dda41a79b&t=a3824a7c-2162-5e3d-3fdd-8cf808834fbb)
and rocm test:

root@a59558217e53:/workspace# pytest
orttraining/orttraining/test/python/orttraining_test_ortmodule_api.py::test_gradient_correctness_minmax
... 
E RuntimeError: Error in backward pass execution: Non-zero status code
returned while running ATen node.
Name:'/_original_module/ATen_Grad/ATen_1' Status Message: Storage size
calculation overflowed with sizes=[72340172838076673, 72340172838076673,
128]

Potential reason is that if the memType of cuda/tensorRT/rocm/migraphx
EP is CPUInput, previously the corresponding device in the IAllocator's
memoryInfo is default OrtDevice(), while after my change, it becomes
OrtDevice(CPU, xx_PINNED, 0);

Changing it back fixed GradientCheckerTest.TileGrad in Win GPU training
build.

pengwa added the release:1.15 label

prathikr pushed a commit that referenced this pull request


          [JSEP] fix constructor for OrtDevice (#15805)

2ce04aa

### Description
Add the missing `OrtDevice` initialization in JSEP introduced by #15618

prathikr pushed a commit that referenced this pull request


          Fix segfault for multiple GPU run (regression) (#15823)

a942fae

### Fix segfault for multiple GPU run

#15618 introduced
`GetOrtDeviceByMemType`. The intention should be: handle CPU device
differently in the if branch, while might by mistakenly passing the
unique default non-cpu device id.


```
OrtDevice CUDAExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const {
  if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
    return OrtDevice(OrtDevice::CPU, OrtDevice::MemType::CUDA_PINNED, default_device_.Id());
  }
  return default_device_;
}
```

We observed a segement fault thrown when running multiple GPU training  

`
CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch
--nproc_per_node=2
examples/onnxruntime/training/language-modeling/run_mlm.py
--model_name_or_path distilbert-base-uncased --dataset_name wikitext
--dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10
--per_device_train_batch_size 8 --per_device_eval_batch_size 8
--do_train --do_eval --overwrite_output_dir --output_dir ./outputs222/
--seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps
400 --logging_steps 1
`

It is found GPU0 works fine, GPU1 throw segement fault. Looking further,
a Shape node trying to allocate it's output tensor, trying to fetch
corresponding allocator with ORTDevice(Device:[DeviceType:0 MemoryType:1
DeviceId:1]), while CPU device did not have device id = 1, so a no
allocator returned. When we try to call `AsStreamBasedAllocator` for the
allocator, segement happens as no null check was done there.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

prathikr pushed a commit that referenced this pull request


          change the EP device to default OrtDevice() for memoryType equals CPU…

c6f36a3

…Input (#15903)

### Description
<!-- Describe your changes. -->
change the EP device to default OrtDevice() for memoryType equals
CPUInput for cuda, rocm, migraph
x and tensorRT EP


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
My previous PR (#15618)
caused random failures on cuda training test
GradientCheckerTest.TileGrad (see build
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=986784&view=logs&j=5076e696-f193-5f12-2d8a-703dda41a79b&t=a3824a7c-2162-5e3d-3fdd-8cf808834fbb)
and rocm test:

root@a59558217e53:/workspace# pytest
orttraining/orttraining/test/python/orttraining_test_ortmodule_api.py::test_gradient_correctness_minmax
... 
E RuntimeError: Error in backward pass execution: Non-zero status code
returned while running ATen node.
Name:'/_original_module/ATen_Grad/ATen_1' Status Message: Storage size
calculation overflowed with sizes=[72340172838076673, 72340172838076673,
128]

Potential reason is that if the memType of cuda/tensorRT/rocm/migraphx
EP is CPUInput, previously the corresponding device in the IAllocator's
memoryInfo is default OrtDevice(), while after my change, it becomes
OrtDevice(CPU, xx_PINNED, 0);

Changing it back fixed GradientCheckerTest.TileGrad in Win GPU training
build.

fs-eire mentioned this pull request

fix transpose optimizer on GPU EP #15988

Merged

snnn removed the release:1.15 label

snnn pushed a commit that referenced this pull request


          [JSEP] fix constructor for OrtDevice (#15805)

9e6089a

### Description
Add the missing `OrtDevice` initialization in JSEP introduced by #15618

snnn pushed a commit that referenced this pull request


          Fix segfault for multiple GPU run (regression) (#15823)

6b2013c

### Fix segfault for multiple GPU run

#15618 introduced
`GetOrtDeviceByMemType`. The intention should be: handle CPU device
differently in the if branch, while might by mistakenly passing the
unique default non-cpu device id.


```
OrtDevice CUDAExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const {
  if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
    return OrtDevice(OrtDevice::CPU, OrtDevice::MemType::CUDA_PINNED, default_device_.Id());
  }
  return default_device_;
}
```

We observed a segement fault thrown when running multiple GPU training  

`
CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch
--nproc_per_node=2
examples/onnxruntime/training/language-modeling/run_mlm.py
--model_name_or_path distilbert-base-uncased --dataset_name wikitext
--dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10
--per_device_train_batch_size 8 --per_device_eval_batch_size 8
--do_train --do_eval --overwrite_output_dir --output_dir ./outputs222/
--seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps
400 --logging_steps 1
`

It is found GPU0 works fine, GPU1 throw segement fault. Looking further,
a Shape node trying to allocate it's output tensor, trying to fetch
corresponding allocator with ORTDevice(Device:[DeviceType:0 MemoryType:1
DeviceId:1]), while CPU device did not have device id = 1, so a no
allocator returned. When we try to call `AsStreamBasedAllocator` for the
allocator, segement happens as no null check was done there.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

snnn pushed a commit that referenced this pull request


          change the EP device to default OrtDevice() for memoryType equals CPU…

e684531

…Input (#15903)

### Description
<!-- Describe your changes. -->
change the EP device to default OrtDevice() for memoryType equals
CPUInput for cuda, rocm, migraph
x and tensorRT EP


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
My previous PR (#15618)
caused random failures on cuda training test
GradientCheckerTest.TileGrad (see build
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=986784&view=logs&j=5076e696-f193-5f12-2d8a-703dda41a79b&t=a3824a7c-2162-5e3d-3fdd-8cf808834fbb)
and rocm test:

root@a59558217e53:/workspace# pytest
orttraining/orttraining/test/python/orttraining_test_ortmodule_api.py::test_gradient_correctness_minmax
... 
E RuntimeError: Error in backward pass execution: Non-zero status code
returned while running ATen node.
Name:'/_original_module/ATen_Grad/ATen_1' Status Message: Storage size
calculation overflowed with sizes=[72340172838076673, 72340172838076673,
128]

Potential reason is that if the memType of cuda/tensorRT/rocm/migraphx
EP is CPUInput, previously the corresponding device in the IAllocator's
memoryInfo is default OrtDevice(), while after my change, it becomes
OrtDevice(CPU, xx_PINNED, 0);

Changing it back fixed GradientCheckerTest.TileGrad in Win GPU training
build.

snnn pushed a commit that referenced this pull request


          [JSEP] fix constructor for OrtDevice (#15805)

2ff7fa2

### Description
Add the missing `OrtDevice` initialization in JSEP introduced by #15618

snnn pushed a commit that referenced this pull request


          Fix segfault for multiple GPU run (regression) (#15823)

bac66d5

### Fix segfault for multiple GPU run

#15618 introduced
`GetOrtDeviceByMemType`. The intention should be: handle CPU device
differently in the if branch, while might by mistakenly passing the
unique default non-cpu device id.


```
OrtDevice CUDAExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const {
  if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
    return OrtDevice(OrtDevice::CPU, OrtDevice::MemType::CUDA_PINNED, default_device_.Id());
  }
  return default_device_;
}
```

We observed a segement fault thrown when running multiple GPU training  

`
CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch
--nproc_per_node=2
examples/onnxruntime/training/language-modeling/run_mlm.py
--model_name_or_path distilbert-base-uncased --dataset_name wikitext
--dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10
--per_device_train_batch_size 8 --per_device_eval_batch_size 8
--do_train --do_eval --overwrite_output_dir --output_dir ./outputs222/
--seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps
400 --logging_steps 1
`

It is found GPU0 works fine, GPU1 throw segement fault. Looking further,
a Shape node trying to allocate it's output tensor, trying to fetch
corresponding allocator with ORTDevice(Device:[DeviceType:0 MemoryType:1
DeviceId:1]), while CPU device did not have device id = 1, so a no
allocator returned. When we try to call `AsStreamBasedAllocator` for the
allocator, segement happens as no null check was done there.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

snnn pushed a commit that referenced this pull request


          change the EP device to default OrtDevice() for memoryType equals CPU…

b41c3c3

…Input (#15903)

### Description
<!-- Describe your changes. -->
change the EP device to default OrtDevice() for memoryType equals
CPUInput for cuda, rocm, migraph
x and tensorRT EP


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
My previous PR (#15618)
caused random failures on cuda training test
GradientCheckerTest.TileGrad (see build
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=986784&view=logs&j=5076e696-f193-5f12-2d8a-703dda41a79b&t=a3824a7c-2162-5e3d-3fdd-8cf808834fbb)
and rocm test:

root@a59558217e53:/workspace# pytest
orttraining/orttraining/test/python/orttraining_test_ortmodule_api.py::test_gradient_correctness_minmax
... 
E RuntimeError: Error in backward pass execution: Non-zero status code
returned while running ATen node.
Name:'/_original_module/ATen_Grad/ATen_1' Status Message: Storage size
calculation overflowed with sizes=[72340172838076673, 72340172838076673,
128]

Potential reason is that if the memType of cuda/tensorRT/rocm/migraphx
EP is CPUInput, previously the corresponding device in the IAllocator's
memoryInfo is default OrtDevice(), while after my change, it becomes
OrtDevice(CPU, xx_PINNED, 0);

Changing it back fixed GradientCheckerTest.TileGrad in Win GPU training
build.

fs-eire added a commit that referenced this pull request


          fix transpose optimizer on GPU EP (#15988)

dc06c25

### Description
because of #15618 , the default allocator changed to device allocator,
which will be GPU instead of CPU. in transpose optimizer we expect to
read data from initializers so a CPU allocator is required here.

this change fixes transpose optimizer on GPU EP

Fixes the issue referred to in #15869, #15796

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet