Fixes order fiasco of device_config_parser_hook_#172581
Fixes order fiasco of device_config_parser_hook_#172581ddelgadovargas-cyber wants to merge 2 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/172581
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit d0c9f88 with merge base d74f9ec ( NEW FAILURE - The following job has failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "topic: not user facing" |
c10/core/AllocatorConfig.h
Outdated
| // their own environment configuration extensions. | ||
| inline static std::function<void(const std::string&)> | ||
| device_config_parser_hook_{nullptr}; | ||
| static std::function<void(const std::string&)>& device_config_parser_hook() { |
There was a problem hiding this comment.
| static std::function<void(const std::string&)>& device_config_parser_hook() { | |
| static std::function<void(const std::string&)>& getConfigParserHook() { |
And move this function and its comments below static const std::unordered_set<std::string>& getKeys() { in line 252
guangyey
left a comment
There was a problem hiding this comment.
Thanks for your update!
|
This PR still needs the final stamp from Meta. |
|
Hi @albanD Could you please help review this PR. |
|
Thanks~ |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: Claude Code Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 5, 6, linux.rocm.gpu.gfx942.1) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 3 checks: trunk / linux-jammy-cuda13.0-py3.10-gcc11 / test (default, 3, 5, linux.g6.4xlarge.experimental.nvidia.gpu), trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 4, 5, linux.g6.4xlarge.experimental.nvidia.gpu), trunk / linux-jammy-rocm-py3.10 / test (default, 5, 6, linux.rocm.gpu.gfx942.1) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
If another static object (like `g_device_config_parse_hook_registry_instance` created by the `REGISTER_ALLOCATOR_CONFIG_PARSE_HOOK` macro) tries to call `registerDeviceConfigParserHook` before `device_config_parser_hook_` is initialized, assigning to it (operator=) can fail, which leads to a runtime error.
When I use a compilation optimization of ` -O1` I see this issue:
```
[src/libcxx/include/__functional/function.h:496]:14: runtime error: member access within null pointer of type 'const __policy'
#0 0x563224e28b78 in operator= [crosstool/v18/stable/src/libcxx/include/__functional/function.h:496]:14
pytorch#1 0x563224e28b78 in operator= [crosstool/v18/stable/src/libcxx/include/__functional/function.h:483]:19
pytorch#2 0x563224e28b78 in operator= [crosstool/v18/stable/src/libcxx/include/__functional/function.h:727]:8
pytorch#3 0x563224e28b78 in c10::CachingAllocator::AcceleratorAllocatorConfig::registerDeviceConfigParserHook(std::__u::function<void (std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>> const&)>&&, std::__u::unordered_set<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>, std::__u::hash<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>>, std::__u::equal_to<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>>, std::__u::allocator<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>>> const&) [torch/c10/core/AllocatorConfig.h:263]:32
pytorch#4 0x563224e28e9d in DeviceConfigParserHookRegistry [torch/c10/core/AllocatorConfig.h:369]:5
pytorch#5 0x563224e28e9d in __cxx_global_var_init.34 [torch/c10/cuda/CUDAAllocatorConfig.cpp:195]:1
pytorch#6 0x563224e28e9d in _GLOBAL__sub_I_CUDAAllocatorConfig.cpp torch/c10/cuda/CUDAAllocatorConfig.cpp
pytorch#7 0x5632459709ac in __libc_csu_init /[usr/grte/v5/debug-src/src/csu/elf-init.c:88]:7
pytorch#8 0x7f748b9562e7 in __libc_start_main (/usr/grte/v5/lib64/libc.so.6+0x612e7) (BuildId: ca23ec6d935352118622ce674a8bb52d)
pytorch#9 0x5632018f3729 in _start /usr/grte/v5/debug-src/src/csu/../sysdeps/x86_64/start.S:120
```
Pull Request resolved: pytorch#172581
Approved by: https://github.com/guangyey, https://github.com/albanD
If another static object (like
g_device_config_parse_hook_registry_instancecreated by theREGISTER_ALLOCATOR_CONFIG_PARSE_HOOKmacro) tries to callregisterDeviceConfigParserHookbeforedevice_config_parser_hook_is initialized, assigning to it (operator=) can fail, which leads to a runtime error.When I use a compilation optimization of
-O1I see this issue: