forked from ROCm/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 0
[pg_rocm7.1_internal_testing][autogenerated] Upstream IFU on 08132025 #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
pragupta
wants to merge
344
commits into
pg_rocm7.1_internal_testing
from
pg_rocm7.1_internal_testing_IFU_08132025
Closed
[pg_rocm7.1_internal_testing][autogenerated] Upstream IFU on 08132025 #2
pragupta
wants to merge
344
commits into
pg_rocm7.1_internal_testing
from
pg_rocm7.1_internal_testing_IFU_08132025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…tion plus tests from pytorch#125438 (pytorch#157786)" This reverts commit 3a2c3c8. Reverted pytorch#157786 on behalf of https://github.com/albanD due to Breaks lint ([comment](pytorch#157786 (comment)))
…Handle. (pytorch#159989) Summary: Today users outside of pytorch core cannot `#include <torch/nativert/ModelRunner.h>`. It turns out that we should place a header inside `torch/csrc/api/include/`. Placing every single nativert header here would pollute the namespace a lot and that's not what we want in general. Therefore here we just create a Handle type which hold a pointer to decouple the actual type from header definition. Test Plan: CI Rollback Plan: Differential Revision: D79751098 Pull Request resolved: pytorch#159989 Approved by: https://github.com/dolpm
Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#158340 Approved by: https://github.com/seemethere Co-authored-by: xinan.lin <[email protected]>
The test filter was wrong, it should not start with "test/". Test Plan: - wait for CI - Tested locally with `python test/run_test.py --einops --verbose` Pull Request resolved: pytorch#159776 Approved by: https://github.com/atalman, https://github.com/StrongerXi
numba currently doesn't build from source due to numba/numba#10073 Pull Request resolved: pytorch#158636 Approved by: https://github.com/malfet
**Summary** Some thoughts on view-op and `_StridedShard` interaction: 1. `_StridedShard` has no impact on sharding (i.e. how tensor is partitioned) compared to `Shard`. It only changes how shards permute across the devices. 2. `view()` op on DTensor strictly forbids shard redistribution which means if `view()` may cause shard permutation across devices, it should be rejected. This is enforced in today's sharding prop for `view()`. 3. Since DTensor `view()` won't introduce any redistribution, it's certain that `placements` won't change except the inner `dim` attribute of `Shard` or `_StridedShard`. Therefore, to support `_StridedShard` in `view()` op, the only change required is to keep `_StridedShard` as `_StridedShard` in the output spec. **Test** `pytest test/distributed/tensor/test_view_ops.py` Pull Request resolved: pytorch#159656 Approved by: https://github.com/wconstab
Disables the job on PRs completely, so that we don't litter people's CI signals and use machines unnecessarily. If you want to run these xla tests, add the ciflow/unstable label to your PR Pull Request resolved: pytorch#159272 Approved by: https://github.com/atalman, https://github.com/malfet
…)" This reverts commit 4604f04. Reverted pytorch#155200 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](pytorch#138222 (comment)))
This reverts commit 15f1173. Reverted pytorch#152932 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](pytorch#138222 (comment)))
)" This reverts commit f7a66da. Reverted pytorch#138222 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](pytorch#138222 (comment)))
Fixed `test_dynamo_timed `: <img width="1030" height="389" alt="image" src="https://github.com/user-attachments/assets/02d84dd8-6a65-4f91-8d4c-48ba0a81fac1" /> Pull Request resolved: pytorch#159981 Approved by: https://github.com/angelayi
Unification inductor debug build, follow @desertfire 's suggestion: pytorch#159938 (review) Pull Request resolved: pytorch#159998 Approved by: https://github.com/angelayi
Summary: Updated README with code structure and explanation of core features within profiler Test Plan: N/A Rollback Plan: Differential Revision: D79604189 Pull Request resolved: pytorch#159816 Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
…on (pytorch#155360) (pytorch#158983) Pull Request resolved: pytorch#158983 Approved by: https://github.com/eellison ghstack dependencies: pytorch#158758
Update HF components to not inherit from fsspec components and instead use filesystem writer/reader. The reason is because there doesn't seem to be much of a need for fsspec, since users are using mounted storage. Using local storage will allow for performance improvements because we can take advantage of the safe_open API provided by HF safetensors (30s vs 4s for load of 8b model), which is signifcant performance wins over reading bytes and converting to tensors which is what we are doing now. Also, we can use the official methods provided by HF instead of relying on reading the metadata by bytes and loading it Differential Revision: [D78993550](https://our.internmc.facebook.com/intern/diff/D78993550/) Pull Request resolved: pytorch#159405 Approved by: https://github.com/saumishr
…9406) Reading the bytes and converting to tensors is much slower than using safe_open. For a 8B model across 8 ranks, took ~30s to load before this change and ~4s after. Differential Revision: [D78994259](https://our.internmc.facebook.com/intern/diff/D78994259/) Pull Request resolved: pytorch#159406 Approved by: https://github.com/saumishr ghstack dependencies: pytorch#159405
Get rid of the logic to read the metadata from the header of the safetensors file manually and use the functions as part of safe_open() to get the metadata. This is much cleaner and allows us to not rely on our own custom methods to get metadata, but use safetensors provided APIs Differential Revision: [D79460272](https://our.internmc.facebook.com/intern/diff/D79460272/) Pull Request resolved: pytorch#159681 Approved by: https://github.com/saumishr ghstack dependencies: pytorch#159405, pytorch#159406
…160070) This is a follow up on pytorch#159800 as other tests are still failing. Pull Request resolved: pytorch#160070 Approved by: https://github.com/aorenste
Since switching from wheel 0.34.2 to wheel 0.45.1 python symlinks are no longer correctly created. Migrate to packaging package for symlink creation Pull Request resolved: pytorch#158634 Approved by: https://github.com/malfet
Discussed with @jianan-gu and @Valentine233 , disable flex decoding on Windows. Pull Request resolved: pytorch#160072 Approved by: https://github.com/angelayi
) If the user provides a generator kwarg to a random op (e.g. nn.init.uniform_(..., generator=my_generator)), we can still advance that generator's state in a SPMD-global way so that each local-tensor gets appropriate values and the generator advances to the same state as if it had operated on the full tensor. Pull Request resolved: pytorch#159933 Approved by: https://github.com/fduwjj, https://github.com/XilunWu, https://github.com/wanchaol
Previously we only applied this move_to_device_pass to the toplevel graph. However if we have HOO, this pass will not be applied on the HOO submodules. This PR modifies the pass to run on all submodules. Pull Request resolved: pytorch#159992 Approved by: https://github.com/yiming0416
Summary: In qembeddingbag_byte_prepack_meta, weight.sizes() would return a concrete int. we should use .sym_size() to return a SymInt instead. Test Plan: CI Rollback Plan: Reviewed By: kqfu, henryoier Differential Revision: D79744512 Pull Request resolved: pytorch#159985 Approved by: https://github.com/jerryzh168, https://github.com/henryoier
…59691) partially generated with ``` for TESTCASE in $(ls | cut -f1 -d'.' | grep -v CPython | uniq); do if grep "$TESTCASE" -m 1 .. -r; then echo; else sl rm "$TESTCASE"* ; fi; done ``` Pull Request resolved: pytorch#159691 Approved by: https://github.com/xmfan
…mismatches in tracing and take a preferred device. (pytorch#159931) Summary: Device mismatches in tracing can most often be ignored. These are only logical mismatches not physical. Take any intermediate computation, and that computation will not actually materialize in a compiled binary execution. So a device mismatch in the middle of the program is not real. The runtime will never materialize those tensors on CPU device during the execution, as they are temporary allocations. If a user knows his tensors at graph input are all on the correct device, then he can ignore all tracing errors. Users who know what they are doing should have an escape hatch to ignore any device mismatch in tracing. Users can set ``` torch._functorch.config.fake_tensor_prefer_device_type = 'mtia' ``` to forcefully override any mismatch and prefer the non cpu device. This unblocks vLLM graph mode for MTIA. Test Plan: Added two unit tests. Rollback Plan: Differential Revision: D79698438 Pull Request resolved: pytorch#159931 Approved by: https://github.com/jansel
…ytorch#159691)" This reverts commit 36f46d0. Reverted pytorch#159691 on behalf of https://github.com/izaitsevfb due to breaking dynamo tests ([comment](pytorch#159691 (comment)))
…(but not insides of list) (pytorch#145089) Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: pytorch#145089 Approved by: https://github.com/albanD, https://github.com/zou3519
Summary:
Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures.
Currently, exceptions are dumped to the console in the following format::
```
[0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
[0/0] Runtime error during autotuning:
[0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help..
[0/0] Ignoring this choice.
```
The exception tracebacks:
```
# inner exception
traceback:
File "/torch/_inductor/runtime/triton_heuristics.py", line 603, in _make_launchers
launchers.append(result.make_launcher())
^^^^^^^^^^^^^^^^^^^^^^
File "/torch/_inductor/runtime/triton_heuristics.py", line 1503, in make_launcher
self.kernel.load_kernel(device)
File "/torch/_inductor/runtime/static_cuda_launcher.py", line 113, in load_kernel
(self.function, self.n_regs, self.n_spills) = _StaticCudaLauncher._load_kernel(
# wrapped exception
traceback:
File "/usr/local/fbcode/platform010/lib/python3.12/concurrent/futures/thread.py", line 59, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 2596, in precompile_with_captured_stdout
choice.precompile()
File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 1881, in precompile
self.bmreq.precompile()
File "<trimmed>#link-tree/torch/_inductor/autotune_process.py", line 660, in precompile
getattr(mod, self.kernel_name).precompile()
File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 440, in precompile
self._make_launchers()
File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 608, in _make_launchers
raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}")
```
With this change, the exception details will also be logged in the metadata of the `{name}_template_precompiling` event.
The format:
```
{
"exceptions": [
{
"choice_type": "triton",
"choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0",
"exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.",
"exception": "OutOfMemoryError",
"required_memory": "262144",
"hardware_limit": "232448"
}
]
}
```
Test Plan:
buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt
Rollback Plan:
Differential Revision: D79420953
Pull Request resolved: pytorch#159688
Approved by: https://github.com/stashuk-olek
pytorch#158533) For pytorch#114850, we will port distributed tests to Intel GPU. We could enable Intel GPU with following methods and try the best to keep the original code styles: - instantiate_device_type_tests() - use "torch.accelerator.current_accelerator()" to determine the accelerator backend - enabled XPU for some test path - Unify some common code under torch/testing/_internal for multiple backend, for example: - requires_nccl_version - _dynamo_dist_per_rank_init - DynamoDistributedSingleProcTestCase - DistTestCases - FSDPTestMultiThread Pull Request resolved: pytorch#158533 Approved by: https://github.com/guangyey, https://github.com/d4l3k Co-authored-by: Yu, Guangye <[email protected]>
…gitsLoss` (pytorch#150282)" This reverts commit f990490. Reverted pytorch#150282 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#150282 (comment)))
…ubmodules (pytorch#157979) Fixes issue with HF Gen AI models where we mark a param as static and a get_attr node gets put in the region. The effect of this is lifting get_attr nodes to be inputs. Pull Request resolved: pytorch#157979 Approved by: https://github.com/williamwen42
Before we would topologically sort each region individually, this works well except if some nodes have no arguments, then their order may change. To rectify this, we sort the first region as the reference region and use that sort order to sort the remaining regions. Pull Request resolved: pytorch#158814 Approved by: https://github.com/williamwen42
get_free_symbol_uses is used to know what unbacked symbols are used by a given node. not having correct get_free_symbol_uses defined properly leads to : - eliminating of some nodes due to not detection of any users. (See the added unit test) - Incorrect topological sort. Fix get_free_symbol_uses , NopKernel , ConcarKernel, InputsKerenl, external kernel. for ComputedBuffer with NonOwningLayout its interesting case. when layout is NonOwningLayout we need to access the actual view op base layout and use detect symbols in it. Because when we codegen the ComputedBuffer we uses those symbols. Pull Request resolved: pytorch#160314 Approved by: https://github.com/eellison
XCCL backend no pytorch#62300 issue, add xccl path here. Pull Request resolved: pytorch#159240 Approved by: https://github.com/guangyey, https://github.com/Skylion007, https://github.com/EikanWang
For [pytorch#114850](pytorch#114850), we will port distributed tests to Intel GPU. We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. use "torch.accelerator.current_accelerator()" to determine the accelerator backend 2. enabled XPU for some test path 3. skip some test cases which Intel GPU does not support Pull Request resolved: pytorch#159543 Approved by: https://github.com/guangyey, https://github.com/d4l3k Co-authored-by: Yu, Guangye <[email protected]>
- Add pytorch_overview.md - Add pytorch_main_components.md - Reorganize top nav to have Get Started, User Guide, Reference API, Community, Tutorials - Move notes under user guide Pull Request resolved: pytorch#159379 Approved by: https://github.com/albanD Co-authored-by: sekyondaMeta <[email protected]> Co-authored-by: Nikita Shulga <[email protected]>
…-> default val of 240 (pytorch#160500) 10 hours is very long Pull Request resolved: pytorch#160500 Approved by: https://github.com/huydhn
Fixed `test_nccl_user_buffer_registration ` due to pytorch#160145, somehow CI didn't capture it. Pull Request resolved: pytorch#160497 Approved by: https://github.com/ngimel
…tion (pytorch#160357)" This reverts commit cbffde7. Reverted pytorch#160357 on behalf of https://github.com/clee2000 due to broke a bunch of internal builds due to not being able to find the file No such file or directory: torch/_inductor/kernel/flex/templates/flex_decode.py.jinja D80145761, might need a buck targets change? ([comment](pytorch#160357 (comment)))
Fixes builds for old compilers Pull Request resolved: pytorch#160499 Approved by: https://github.com/Skylion007
… contain attributes (pytorch#160436) Summary: Fixes internal test failures of D80037015 Test Plan: CI Rollback Plan: Differential Revision: D80094187 Pull Request resolved: pytorch#160436 Approved by: https://github.com/clee2000
Hi, @malfet Based on the previous discussion: [RISCV CI support · Issue pytorch#141550 · pytorch/pytorch](pytorch#141550) I have cross-compiled PyTorch for the RISC-V architecture on x86_64 Ubuntu 24.04 and created a new PR for it. Could you please help review it? Pull Request resolved: pytorch#143979 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <[email protected]>
…ch#147758) Summary: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642 Pull Request resolved: pytorch#147758 Approved by: https://github.com/meetv18
## Summary - register conv3d with MPS autocast to ensure bias dtypes match under AMP - add regression test chaining two Conv3d layers on MPS autocast Written by Codex, see https://chatgpt.com/codex/tasks/task_e_689b64192df883278648935963d2776d Pull Request resolved: pytorch#160423 Approved by: https://github.com/dcci
…h#160435) remove aten::contiguous for NHWC convolutions on ROCm Tests: - nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float32 - nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float16 Before: <img width="1255" height="228" alt="image" src="https://github.com/user-attachments/assets/b125ccab-00c2-4d3a-a341-4583e51d8d57" /> After: <img width="874" height="153" alt="image" src="https://github.com/user-attachments/assets/ec200754-3622-488e-8762-bff1c2d22818" /> Pull Request resolved: pytorch#160435 Approved by: https://github.com/jeffdaily
`_transform_cuda_paths` intentionally includes the CUDA stubs folder. However this path must not be added to the rpath as otherwise any CUDA command will fail at runtime with > CUDA_ERROR_STUB_LIBRARY: "CUDA driver is a stub library" This results in e.g. non-descriptive errors like ``` cutlass_library/source/tools/util/include/cutlass/util/device_memory.h:67 cutlass::device_memory::allocate: cudaMalloc failed: bytes=4096 terminate called after throwing an instance of 'cutlass::cuda_exception' what(): std::exception ``` Pull Request resolved: pytorch#160179 Approved by: https://github.com/jansel
MPS backend does not support double, so errors should be different Pull Request resolved: pytorch#160378 Approved by: https://github.com/dcci
* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 128
**Reproducer:**
```
import time
import torch
shapes = [
(5079670, 128)
]
dims = [
(1)
]
for i, shape in enumerate(shapes):
x = torch.randn(shape, device='cuda', dtype=torch.float)
for _ in range(10):
w = torch.sum(x, dims[i])
torch.cuda.synchronize()
print(w.size())
start_time = time.time()
for _ in range(50):
_ = torch.sum(x, dims[i])
torch.cuda.synchronize()
end_time = time.time()
mean_time = (end_time - start_time)/50
print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us")
```
**Before (MI300X):**
Avg time for shape (5079670, 128): 1629.99 us
**After (MI300X)**
Avg time for shape (5079670, 128): 1008.59 us
Pull Request resolved: pytorch#160466
Approved by: https://github.com/petrex, https://github.com/jeffdaily
Fixes maintenance of triton packaging script when library versions change from one ROCm version to next. Pull Request resolved: pytorch#158408 Approved by: https://github.com/jeffdaily Co-authored-by: Ethan Wee <[email protected]>
…h#159824) Follow up to pytorch#159580 Pull Request resolved: pytorch#159824 Approved by: https://github.com/williamwen42
…#160356) Differential Revision: [D80035771](https://our.internmc.facebook.com/intern/diff/D80035771/) The motivation and the original change is to reduce the number parameters we pass into the kernel, which was motivated by aesthetic reasons only. But seeing the need to use different batch stride, we should just pass in the batch stride. That would be a good long term fix. Pull Request resolved: pytorch#160356 Approved by: https://github.com/mlazos
Adds `OperatorEntry::getComputedKernelForDispatchKey` which returns the KernelFunction corresponding to `OperatorEntry.dispatchTable_[dispatch_ix]` for a given dispatch key - Specifically it returns a `SafeKernelFunction` that holds a `KernelToken`. This `KernelToken` is registered to the `KernelFunction` in `OperatorEntry.kernels_` and will be invalidated when the `KernelFunction` is destructed (i.e. when the `AnnotatedKernel` that holds this `KernelFunction` is removed from `kernels_`, which happens when the corresponding impl is deregistered). - `SafeKernelFunction` can be called via `callBoxed`, the validity of the token will be checked before this happens - `SafeKernelFunction` is pybinded and `getComputedKernelForDispatchKey` is exposed to the frontend ia `torch.library.get_kernel` Related to pytorch#155330 Pull Request resolved: pytorch#158393 Approved by: https://github.com/albanD
…_testing_IFU_08132025 # Conflicts: # .ci/docker/requirements-ci.txt # aten/src/ATen/Context.cpp # test/distributed/_tools/test_fsdp2_mem_tracker.py # test/dynamo/test_activation_checkpointing.py # test/dynamo/test_structured_trace.py # test/inductor/test_combo_kernels.py # torch/_higher_order_ops/triton_kernel_wrap.py # torch/_inductor/choices.py # torch/_inductor/codegen/triton.py
pragupta
pushed a commit
that referenced
this pull request
Sep 17, 2025
) Summary: This diff fixes two things which come up when testing a tgif-published pt2 model remote net: 1) Updates isSameDevice to handle meta device to avoid this error: ``` what(): Unsupported device typemeta and meta Exception raised from isSameDevice at fbcode/caffe2/torch/nativert/executor/PlacementUtils.cpp:20 ``` 2. Updates xl weight v2 loading logic in Weights.cpp to handle non-TBE xl-weights. Today, we enforce the device is the same for an old weight and new weight when replacing with ModelRunnerAdapter.setAttr(). However, the way we replace non-TBE xl weights is to find any weights on "meta" device and then replace them with their correct weight with real device from xl_weights folder. Therefore, the new weight and old weight will always have different devices and the device check is invalid. I don't think we've run into this so far bc non-TBE xl weights have not been thoroughly tested until now. Test Plan: Run MRS you model merge net, which uses non-TBE xl weights. Confirm that before change #1 we get error: ``` Unsupported device typemeta and meta ``` Then after change #1 and before change #2 we get: ``` what(): Mismatched device for merge.user_tower.linear.weight: meta vs cpu Exception raised from validateValue at fbcode/caffe2/torch/nativert/executor/Weights.cpp:374 ``` After change run is successful Command: ``` MODEL_ENTITY_ID=921242082 SNAPSHOT_ID=1269 module_name=merge SAMPLE_INPUT_DIR=/data/users/georgiaphillips/models/921242082/${SNAPSHOT_ID}/${module_name}_archive/package/data/sample_inputs buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100,a100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.${module_name} --moduleName=${module_name} --submodToDevice="merge|cuda0" --benchmarkEnableProfiling=false --disableStaticRuntime=true --doNotRandomizeSampleInputs=true --benchmarkDontRebatchSamples=true --pytorch_predictor_sigmoid_static_dispatch_enable=false --pytorch_predictor_sigmoid_graph_passes_enable=false --sampleInputFilePath=${SAMPLE_INPUT_DIR}/${module_name}.pt ``` Rollback Plan: Differential Revision: D80713052 Pull Request resolved: pytorch#162842 Approved by: https://github.com/henryoier
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Merged latest changes from upstream/main into pg_rocm7.1_internal_testing on 08132025