Skip to content

Comments

Fix segfault for multiple GPU run (regression)#15823

Merged
pengwa merged 3 commits intomainfrom
pengwa/fix_seg_fault
May 6, 2023
Merged

Fix segfault for multiple GPU run (regression)#15823
pengwa merged 3 commits intomainfrom
pengwa/fix_seg_fault

Conversation

@pengwa
Copy link
Contributor

@pengwa pengwa commented May 5, 2023

Fix segfault for multiple GPU run

#15618 introduced GetOrtDeviceByMemType. The intention should be: handle CPU device differently in the if branch, while might by mistakenly passing the unique default non-cpu device id.

OrtDevice CUDAExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const {
  if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
    return OrtDevice(OrtDevice::CPU, OrtDevice::MemType::CUDA_PINNED, default_device_.Id());
  }
  return default_device_;
}

We observed a segement fault thrown when running multiple GPU training

CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch --nproc_per_node=2 examples/onnxruntime/training/language-modeling/run_mlm.py --model_name_or_path distilbert-base-uncased --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --do_train --do_eval --overwrite_output_dir --output_dir ./outputs222/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps 400 --logging_steps 1

It is found GPU0 works fine, GPU1 throw segement fault. Looking further, a Shape node trying to allocate it's output tensor, trying to fetch corresponding allocator with ORTDevice(Device:[DeviceType:0 MemoryType:1 DeviceId:1]), while CPU device did not have device id = 1, so a no allocator returned. When we try to call AsStreamBasedAllocator for the allocator, segement happens as no null check was done there.

Motivation and Context

@satyajandhyala
Copy link
Contributor

Can we add a test case for this scenario?

@pengwa
Copy link
Contributor Author

pengwa commented May 6, 2023

Can we add a test case for this scenario?

We should have one, will add one later.

@pengwa pengwa merged commit dfac096 into main May 6, 2023
@pengwa pengwa deleted the pengwa/fix_seg_fault branch May 6, 2023 00:48
ShukantPal pushed a commit to ShukantPal/onnxruntime that referenced this pull request May 7, 2023
### Fix segfault for multiple GPU run

microsoft#15618 introduced
`GetOrtDeviceByMemType`. The intention should be: handle CPU device
differently in the if branch, while might by mistakenly passing the
unique default non-cpu device id.


```
OrtDevice CUDAExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const {
  if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
    return OrtDevice(OrtDevice::CPU, OrtDevice::MemType::CUDA_PINNED, default_device_.Id());
  }
  return default_device_;
}
```

We observed a segement fault thrown when running multiple GPU training  

`
CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch
--nproc_per_node=2
examples/onnxruntime/training/language-modeling/run_mlm.py
--model_name_or_path distilbert-base-uncased --dataset_name wikitext
--dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10
--per_device_train_batch_size 8 --per_device_eval_batch_size 8
--do_train --do_eval --overwrite_output_dir --output_dir ./outputs222/
--seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps
400 --logging_steps 1
`

It is found GPU0 works fine, GPU1 throw segement fault. Looking further,
a Shape node trying to allocate it's output tensor, trying to fetch
corresponding allocator with ORTDevice(Device:[DeviceType:0 MemoryType:1
DeviceId:1]), while CPU device did not have device id = 1, so a no
allocator returned. When we try to call `AsStreamBasedAllocator` for the
allocator, segement happens as no null check was done there.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
pengwa added a commit that referenced this pull request May 9, 2023
### Add CPU allocation test for non-CPU devices distributed run

When CUDA EP is enabled in distributed training, CPU memory is still
used for some node output. Early we have distributed run test coverage,
but don't cover the case when some of the node are using CPU devices for
storing tensor output. As a result, I recalled we hit regression twice
in the passing months:
- #14050
- #15823

So adding this test to avoid future regressions. 

The test graph looks like this:


![image](https://user-images.githubusercontent.com/10530022/236594940-70c68a55-18bf-4e09-bbf5-8a64895d3045.png)



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
prathikr pushed a commit that referenced this pull request May 16, 2023
### Fix segfault for multiple GPU run

#15618 introduced
`GetOrtDeviceByMemType`. The intention should be: handle CPU device
differently in the if branch, while might by mistakenly passing the
unique default non-cpu device id.


```
OrtDevice CUDAExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const {
  if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
    return OrtDevice(OrtDevice::CPU, OrtDevice::MemType::CUDA_PINNED, default_device_.Id());
  }
  return default_device_;
}
```

We observed a segement fault thrown when running multiple GPU training  

`
CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch
--nproc_per_node=2
examples/onnxruntime/training/language-modeling/run_mlm.py
--model_name_or_path distilbert-base-uncased --dataset_name wikitext
--dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10
--per_device_train_batch_size 8 --per_device_eval_batch_size 8
--do_train --do_eval --overwrite_output_dir --output_dir ./outputs222/
--seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps
400 --logging_steps 1
`

It is found GPU0 works fine, GPU1 throw segement fault. Looking further,
a Shape node trying to allocate it's output tensor, trying to fetch
corresponding allocator with ORTDevice(Device:[DeviceType:0 MemoryType:1
DeviceId:1]), while CPU device did not have device id = 1, so a no
allocator returned. When we try to call `AsStreamBasedAllocator` for the
allocator, segement happens as no null check was done there.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
prathikr pushed a commit that referenced this pull request May 16, 2023
### Add CPU allocation test for non-CPU devices distributed run

When CUDA EP is enabled in distributed training, CPU memory is still
used for some node output. Early we have distributed run test coverage,
but don't cover the case when some of the node are using CPU devices for
storing tensor output. As a result, I recalled we hit regression twice
in the passing months:
- #14050
- #15823

So adding this test to avoid future regressions. 

The test graph looks like this:


![image](https://user-images.githubusercontent.com/10530022/236594940-70c68a55-18bf-4e09-bbf5-8a64895d3045.png)



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
@snnn snnn added the triage:approved Approved for cherrypicks for release label May 18, 2023
snnn pushed a commit that referenced this pull request May 19, 2023
### Fix segfault for multiple GPU run

#15618 introduced
`GetOrtDeviceByMemType`. The intention should be: handle CPU device
differently in the if branch, while might by mistakenly passing the
unique default non-cpu device id.


```
OrtDevice CUDAExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const {
  if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
    return OrtDevice(OrtDevice::CPU, OrtDevice::MemType::CUDA_PINNED, default_device_.Id());
  }
  return default_device_;
}
```

We observed a segement fault thrown when running multiple GPU training  

`
CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch
--nproc_per_node=2
examples/onnxruntime/training/language-modeling/run_mlm.py
--model_name_or_path distilbert-base-uncased --dataset_name wikitext
--dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10
--per_device_train_batch_size 8 --per_device_eval_batch_size 8
--do_train --do_eval --overwrite_output_dir --output_dir ./outputs222/
--seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps
400 --logging_steps 1
`

It is found GPU0 works fine, GPU1 throw segement fault. Looking further,
a Shape node trying to allocate it's output tensor, trying to fetch
corresponding allocator with ORTDevice(Device:[DeviceType:0 MemoryType:1
DeviceId:1]), while CPU device did not have device id = 1, so a no
allocator returned. When we try to call `AsStreamBasedAllocator` for the
allocator, segement happens as no null check was done there.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
snnn pushed a commit that referenced this pull request May 19, 2023
### Fix segfault for multiple GPU run

#15618 introduced
`GetOrtDeviceByMemType`. The intention should be: handle CPU device
differently in the if branch, while might by mistakenly passing the
unique default non-cpu device id.


```
OrtDevice CUDAExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const {
  if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
    return OrtDevice(OrtDevice::CPU, OrtDevice::MemType::CUDA_PINNED, default_device_.Id());
  }
  return default_device_;
}
```

We observed a segement fault thrown when running multiple GPU training  

`
CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch
--nproc_per_node=2
examples/onnxruntime/training/language-modeling/run_mlm.py
--model_name_or_path distilbert-base-uncased --dataset_name wikitext
--dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10
--per_device_train_batch_size 8 --per_device_eval_batch_size 8
--do_train --do_eval --overwrite_output_dir --output_dir ./outputs222/
--seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps
400 --logging_steps 1
`

It is found GPU0 works fine, GPU1 throw segement fault. Looking further,
a Shape node trying to allocate it's output tensor, trying to fetch
corresponding allocator with ORTDevice(Device:[DeviceType:0 MemoryType:1
DeviceId:1]), while CPU device did not have device id = 1, so a no
allocator returned. When we try to call `AsStreamBasedAllocator` for the
allocator, segement happens as no null check was done there.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
snnn pushed a commit that referenced this pull request May 19, 2023
### Description
Cherry-picks 26 commits to the release branch. 
Most cherry-picks are clean merges. Except:

1. When I got conflicts in cgmanifest.json and download-deps.yml, I
choose to ignore the conflicts and regenerate the two files
2. There were some conflicts in cmake/deps.txt, onnxruntime_c_api.cc


PR list:

[js/webgpu] fix Transpose with non-float tensor (#15819)
[js/web] fix terser reserved symbols for worker (#15864)
[JSEP] fix constructor for OrtDevice (#15805)
Bump engine.io from 6.4.1 to 6.4.2 in /js/web (#15799)
Bump engine.io from 6.4.0 to 6.4.2 in /onnxruntime/test/wasm (#15798)
[wasm] revert emsdk to v3.1.19 (#15793)
[wasm/JSEP] add threaded build to artifacts (#15777)
[js/web] add target ort.webgpu.min.js (#15780)
update ort extensions to 94142d8391c9791ec71c38336436319a2d4ac7a0 (#15688)
fix: setting builder optimization level to TRT 8.6 default (#15897)
Adust GetVersionString() GetBuildInfoString() signatures and move them to OrtApi (#15921)
Fix segfault for multiple GPU run (regression) (#15823)
android package fix (#15999)
[CoreML EP] Minor changes to allow CoreML EP to handle more nodes and models. (#15993)
Adding support for conv fp16 fusion on Resnet50v1 (#15474)
update onnx release 1.14 for docker files (#15680)
Avoid generating training documentation during packaging (#15795)
Update Conv-Add-Relu Fusion Transformation (#15834)
Fix symbolic shape infer empty value_info (#15842)
NhwcFusedConv: Add before Activation (#15837)
use __hmul2 instead of __hmul2_rn (#15852)
change the EP device to default OrtDevice() for memoryType equals CPU Input (#15903)
Fixing NhwcFusedConv fp16 (#15950)
fix topo sort in quantization tool (#16003)
[doc] add LeakyRelu to coreml supported ops (#15944)
[DML EP] Add frequent upload heap flushing (#15960)

Co-authored-by: Yulong Wang 
Co-authored-by: dependabot[bot] 
Co-authored-by: Guenther Schmuelling 
Co-authored-by: Shalva Mist 
Co-authored-by: Maximilian Müller 
Co-authored-by: Dmitri Smirnov 
Co-authored-by: pengwa 
Co-authored-by: Ashwini Khade 
Co-authored-by: Edward Chen 
Co-authored-by: Jian Chen 
Co-authored-by: liqun Fu 
Co-authored-by: Baiju Meswani 
Co-authored-by: Tianlei Wu 
Co-authored-by: Chen Fu 
Co-authored-by: Ye Wang 
Co-authored-by: cao lei 
Co-authored-by: Yufeng Li 
Co-authored-by: Rachel Guo 
Co-authored-by: Patrice Vignola
@snnn snnn removed triage:approved Approved for cherrypicks for release release:1.15 labels May 19, 2023
preetha-intel pushed a commit to intel/onnxruntime that referenced this pull request Jun 7, 2023
### Description
Cherry-picks 26 commits to the release branch. 
Most cherry-picks are clean merges. Except:

1. When I got conflicts in cgmanifest.json and download-deps.yml, I
choose to ignore the conflicts and regenerate the two files
2. There were some conflicts in cmake/deps.txt, onnxruntime_c_api.cc


PR list:

[js/webgpu] fix Transpose with non-float tensor (microsoft#15819)
[js/web] fix terser reserved symbols for worker (microsoft#15864)
[JSEP] fix constructor for OrtDevice (microsoft#15805)
Bump engine.io from 6.4.1 to 6.4.2 in /js/web (microsoft#15799)
Bump engine.io from 6.4.0 to 6.4.2 in /onnxruntime/test/wasm (microsoft#15798)
[wasm] revert emsdk to v3.1.19 (microsoft#15793)
[wasm/JSEP] add threaded build to artifacts (microsoft#15777)
[js/web] add target ort.webgpu.min.js (microsoft#15780)
update ort extensions to 94142d8391c9791ec71c38336436319a2d4ac7a0 (microsoft#15688)
fix: setting builder optimization level to TRT 8.6 default (microsoft#15897)
Adust GetVersionString() GetBuildInfoString() signatures and move them to OrtApi (microsoft#15921)
Fix segfault for multiple GPU run (regression) (microsoft#15823)
android package fix (microsoft#15999)
[CoreML EP] Minor changes to allow CoreML EP to handle more nodes and models. (microsoft#15993)
Adding support for conv fp16 fusion on Resnet50v1 (microsoft#15474)
update onnx release 1.14 for docker files (microsoft#15680)
Avoid generating training documentation during packaging (microsoft#15795)
Update Conv-Add-Relu Fusion Transformation (microsoft#15834)
Fix symbolic shape infer empty value_info (microsoft#15842)
NhwcFusedConv: Add before Activation (microsoft#15837)
use __hmul2 instead of __hmul2_rn (microsoft#15852)
change the EP device to default OrtDevice() for memoryType equals CPU Input (microsoft#15903)
Fixing NhwcFusedConv fp16 (microsoft#15950)
fix topo sort in quantization tool (microsoft#16003)
[doc] add LeakyRelu to coreml supported ops (microsoft#15944)
[DML EP] Add frequent upload heap flushing (microsoft#15960)

Co-authored-by: Yulong Wang 
Co-authored-by: dependabot[bot] 
Co-authored-by: Guenther Schmuelling 
Co-authored-by: Shalva Mist 
Co-authored-by: Maximilian Müller 
Co-authored-by: Dmitri Smirnov 
Co-authored-by: pengwa 
Co-authored-by: Ashwini Khade 
Co-authored-by: Edward Chen 
Co-authored-by: Jian Chen 
Co-authored-by: liqun Fu 
Co-authored-by: Baiju Meswani 
Co-authored-by: Tianlei Wu 
Co-authored-by: Chen Fu 
Co-authored-by: Ye Wang 
Co-authored-by: cao lei 
Co-authored-by: Yufeng Li 
Co-authored-by: Rachel Guo 
Co-authored-by: Patrice Vignola
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants