[pg_rocm7.1_internal_testing][autogenerated] Upstream IFU on 08132025 #2

pragupta · 2025-08-13T21:44:07Z

Merged latest changes from upstream/main into pg_rocm7.1_internal_testing on 08132025

…tion plus tests from pytorch#125438 (pytorch#157786)" This reverts commit 3a2c3c8. Reverted pytorch#157786 on behalf of https://github.com/albanD due to Breaks lint ([comment](pytorch#157786 (comment)))

…Handle. (pytorch#159989) Summary: Today users outside of pytorch core cannot `#include <torch/nativert/ModelRunner.h>`. It turns out that we should place a header inside `torch/csrc/api/include/`. Placing every single nativert header here would pollute the namespace a lot and that's not what we want in general. Therefore here we just create a Handle type which hold a pointer to decouple the actual type from header definition. Test Plan: CI Rollback Plan: Differential Revision: D79751098 Pull Request resolved: pytorch#159989 Approved by: https://github.com/dolpm

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#158340 Approved by: https://github.com/seemethere Co-authored-by: xinan.lin <[email protected]>

The test filter was wrong, it should not start with "test/". Test Plan: - wait for CI - Tested locally with `python test/run_test.py --einops --verbose` Pull Request resolved: pytorch#159776 Approved by: https://github.com/atalman, https://github.com/StrongerXi

numba currently doesn't build from source due to numba/numba#10073 Pull Request resolved: pytorch#158636 Approved by: https://github.com/malfet

**Summary** Some thoughts on view-op and `_StridedShard` interaction: 1. `_StridedShard` has no impact on sharding (i.e. how tensor is partitioned) compared to `Shard`. It only changes how shards permute across the devices. 2. `view()` op on DTensor strictly forbids shard redistribution which means if `view()` may cause shard permutation across devices, it should be rejected. This is enforced in today's sharding prop for `view()`. 3. Since DTensor `view()` won't introduce any redistribution, it's certain that `placements` won't change except the inner `dim` attribute of `Shard` or `_StridedShard`. Therefore, to support `_StridedShard` in `view()` op, the only change required is to keep `_StridedShard` as `_StridedShard` in the output spec. **Test** `pytest test/distributed/tensor/test_view_ops.py` Pull Request resolved: pytorch#159656 Approved by: https://github.com/wconstab

Disables the job on PRs completely, so that we don't litter people's CI signals and use machines unnecessarily. If you want to run these xla tests, add the ciflow/unstable label to your PR Pull Request resolved: pytorch#159272 Approved by: https://github.com/atalman, https://github.com/malfet

…)" This reverts commit 4604f04. Reverted pytorch#155200 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](pytorch#138222 (comment)))

This reverts commit 15f1173. Reverted pytorch#152932 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](pytorch#138222 (comment)))

)" This reverts commit f7a66da. Reverted pytorch#138222 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](pytorch#138222 (comment)))

Fixed `test_dynamo_timed `: <img width="1030" height="389" alt="image" src="https://github.com/user-attachments/assets/02d84dd8-6a65-4f91-8d4c-48ba0a81fac1" /> Pull Request resolved: pytorch#159981 Approved by: https://github.com/angelayi

@desertfire

Unification inductor debug build, follow @desertfire 's suggestion: pytorch#159938 (review) Pull Request resolved: pytorch#159998 Approved by: https://github.com/angelayi

Summary: Updated README with code structure and explanation of core features within profiler Test Plan: N/A Rollback Plan: Differential Revision: D79604189 Pull Request resolved: pytorch#159816 Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi

…ytorch#155121) (pytorch#158758) Fixes pytorch#155121 Pull Request resolved: pytorch#158758 Approved by: https://github.com/EikanWang, https://github.com/eellison

…on (pytorch#155360) (pytorch#158983) Pull Request resolved: pytorch#158983 Approved by: https://github.com/eellison ghstack dependencies: pytorch#158758

Update HF components to not inherit from fsspec components and instead use filesystem writer/reader. The reason is because there doesn't seem to be much of a need for fsspec, since users are using mounted storage. Using local storage will allow for performance improvements because we can take advantage of the safe_open API provided by HF safetensors (30s vs 4s for load of 8b model), which is signifcant performance wins over reading bytes and converting to tensors which is what we are doing now. Also, we can use the official methods provided by HF instead of relying on reading the metadata by bytes and loading it Differential Revision: [D78993550](https://our.internmc.facebook.com/intern/diff/D78993550/) Pull Request resolved: pytorch#159405 Approved by: https://github.com/saumishr

…9406) Reading the bytes and converting to tensors is much slower than using safe_open. For a 8B model across 8 ranks, took ~30s to load before this change and ~4s after. Differential Revision: [D78994259](https://our.internmc.facebook.com/intern/diff/D78994259/) Pull Request resolved: pytorch#159406 Approved by: https://github.com/saumishr ghstack dependencies: pytorch#159405

Get rid of the logic to read the metadata from the header of the safetensors file manually and use the functions as part of safe_open() to get the metadata. This is much cleaner and allows us to not rely on our own custom methods to get metadata, but use safetensors provided APIs Differential Revision: [D79460272](https://our.internmc.facebook.com/intern/diff/D79460272/) Pull Request resolved: pytorch#159681 Approved by: https://github.com/saumishr ghstack dependencies: pytorch#159405, pytorch#159406

…160070) This is a follow up on pytorch#159800 as other tests are still failing. Pull Request resolved: pytorch#160070 Approved by: https://github.com/aorenste

Since switching from wheel 0.34.2 to wheel 0.45.1 python symlinks are no longer correctly created. Migrate to packaging package for symlink creation Pull Request resolved: pytorch#158634 Approved by: https://github.com/malfet

@jianan-gu

Discussed with @jianan-gu and @Valentine233 , disable flex decoding on Windows. Pull Request resolved: pytorch#160072 Approved by: https://github.com/angelayi

) If the user provides a generator kwarg to a random op (e.g. nn.init.uniform_(..., generator=my_generator)), we can still advance that generator's state in a SPMD-global way so that each local-tensor gets appropriate values and the generator advances to the same state as if it had operated on the full tensor. Pull Request resolved: pytorch#159933 Approved by: https://github.com/fduwjj, https://github.com/XilunWu, https://github.com/wanchaol

Previously we only applied this move_to_device_pass to the toplevel graph. However if we have HOO, this pass will not be applied on the HOO submodules. This PR modifies the pass to run on all submodules. Pull Request resolved: pytorch#159992 Approved by: https://github.com/yiming0416

Summary: In qembeddingbag_byte_prepack_meta, weight.sizes() would return a concrete int. we should use .sym_size() to return a SymInt instead. Test Plan: CI Rollback Plan: Reviewed By: kqfu, henryoier Differential Revision: D79744512 Pull Request resolved: pytorch#159985 Approved by: https://github.com/jerryzh168, https://github.com/henryoier

…59691) partially generated with ``` for TESTCASE in $(ls | cut -f1 -d'.' | grep -v CPython | uniq); do if grep "$TESTCASE" -m 1 .. -r; then echo; else sl rm "$TESTCASE"* ; fi; done ``` Pull Request resolved: pytorch#159691 Approved by: https://github.com/xmfan

…mismatches in tracing and take a preferred device. (pytorch#159931) Summary: Device mismatches in tracing can most often be ignored. These are only logical mismatches not physical. Take any intermediate computation, and that computation will not actually materialize in a compiled binary execution. So a device mismatch in the middle of the program is not real. The runtime will never materialize those tensors on CPU device during the execution, as they are temporary allocations. If a user knows his tensors at graph input are all on the correct device, then he can ignore all tracing errors. Users who know what they are doing should have an escape hatch to ignore any device mismatch in tracing. Users can set ``` torch._functorch.config.fake_tensor_prefer_device_type = 'mtia' ``` to forcefully override any mismatch and prefer the non cpu device. This unblocks vLLM graph mode for MTIA. Test Plan: Added two unit tests. Rollback Plan: Differential Revision: D79698438 Pull Request resolved: pytorch#159931 Approved by: https://github.com/jansel

…h#160128) Pull Request resolved: pytorch#160128 Approved by: https://github.com/mori360

…ytorch#159691)" This reverts commit 36f46d0. Reverted pytorch#159691 on behalf of https://github.com/izaitsevfb due to breaking dynamo tests ([comment](pytorch#159691 (comment)))

…(but not insides of list) (pytorch#145089) Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: pytorch#145089 Approved by: https://github.com/albanD, https://github.com/zou3519

Summary: Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures. Currently, exceptions are dumped to the console in the following format:: ``` [0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help. [0/0] Runtime error during autotuning: [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.. [0/0] Ignoring this choice. ``` The exception tracebacks: ``` # inner exception traceback: File "/torch/_inductor/runtime/triton_heuristics.py", line 603, in _make_launchers launchers.append(result.make_launcher()) ^^^^^^^^^^^^^^^^^^^^^^ File "/torch/_inductor/runtime/triton_heuristics.py", line 1503, in make_launcher self.kernel.load_kernel(device) File "/torch/_inductor/runtime/static_cuda_launcher.py", line 113, in load_kernel (self.function, self.n_regs, self.n_spills) = _StaticCudaLauncher._load_kernel( # wrapped exception traceback: File "/usr/local/fbcode/platform010/lib/python3.12/concurrent/futures/thread.py", line 59, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 2596, in precompile_with_captured_stdout choice.precompile() File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 1881, in precompile self.bmreq.precompile() File "<trimmed>#link-tree/torch/_inductor/autotune_process.py", line 660, in precompile getattr(mod, self.kernel_name).precompile() File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 440, in precompile self._make_launchers() File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 608, in _make_launchers raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}") ``` With this change, the exception details will also be logged in the metadata of the `{name}_template_precompiling` event. The format: ``` { "exceptions": [ { "choice_type": "triton", "choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0", "exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.", "exception": "OutOfMemoryError", "required_memory": "262144", "hardware_limit": "232448" } ] } ``` Test Plan: buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt Rollback Plan: Differential Revision: D79420953 Pull Request resolved: pytorch#159688 Approved by: https://github.com/stashuk-olek

pytorch#158533) For pytorch#114850, we will port distributed tests to Intel GPU. We could enable Intel GPU with following methods and try the best to keep the original code styles: - instantiate_device_type_tests() - use "torch.accelerator.current_accelerator()" to determine the accelerator backend - enabled XPU for some test path - Unify some common code under torch/testing/_internal for multiple backend, for example: - requires_nccl_version - _dynamo_dist_per_rank_init - DynamoDistributedSingleProcTestCase - DistTestCases - FSDPTestMultiThread Pull Request resolved: pytorch#158533 Approved by: https://github.com/guangyey, https://github.com/d4l3k Co-authored-by: Yu, Guangye <[email protected]>

…gitsLoss` (pytorch#150282)" This reverts commit f990490. Reverted pytorch#150282 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#150282 (comment)))

…ubmodules (pytorch#157979) Fixes issue with HF Gen AI models where we mark a param as static and a get_attr node gets put in the region. The effect of this is lifting get_attr nodes to be inputs. Pull Request resolved: pytorch#157979 Approved by: https://github.com/williamwen42

Before we would topologically sort each region individually, this works well except if some nodes have no arguments, then their order may change. To rectify this, we sort the first region as the reference region and use that sort order to sort the remaining regions. Pull Request resolved: pytorch#158814 Approved by: https://github.com/williamwen42

get_free_symbol_uses is used to know what unbacked symbols are used by a given node. not having correct get_free_symbol_uses defined properly leads to : - eliminating of some nodes due to not detection of any users. (See the added unit test) - Incorrect topological sort. Fix get_free_symbol_uses , NopKernel , ConcarKernel, InputsKerenl, external kernel. for ComputedBuffer with NonOwningLayout its interesting case. when layout is NonOwningLayout we need to access the actual view op base layout and use detect symbols in it. Because when we codegen the ComputedBuffer we uses those symbols. Pull Request resolved: pytorch#160314 Approved by: https://github.com/eellison

XCCL backend no pytorch#62300 issue, add xccl path here. Pull Request resolved: pytorch#159240 Approved by: https://github.com/guangyey, https://github.com/Skylion007, https://github.com/EikanWang

For [pytorch#114850](pytorch#114850), we will port distributed tests to Intel GPU. We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. use "torch.accelerator.current_accelerator()" to determine the accelerator backend 2. enabled XPU for some test path 3. skip some test cases which Intel GPU does not support Pull Request resolved: pytorch#159543 Approved by: https://github.com/guangyey, https://github.com/d4l3k Co-authored-by: Yu, Guangye <[email protected]>

- Add pytorch_overview.md - Add pytorch_main_components.md - Reorganize top nav to have Get Started, User Guide, Reference API, Community, Tutorials - Move notes under user guide Pull Request resolved: pytorch#159379 Approved by: https://github.com/albanD Co-authored-by: sekyondaMeta <[email protected]> Co-authored-by: Nikita Shulga <[email protected]>

…-> default val of 240 (pytorch#160500) 10 hours is very long Pull Request resolved: pytorch#160500 Approved by: https://github.com/huydhn

Fixed `test_nccl_user_buffer_registration ` due to pytorch#160145, somehow CI didn't capture it. Pull Request resolved: pytorch#160497 Approved by: https://github.com/ngimel

…tion (pytorch#160357)" This reverts commit cbffde7. Reverted pytorch#160357 on behalf of https://github.com/clee2000 due to broke a bunch of internal builds due to not being able to find the file No such file or directory: torch/_inductor/kernel/flex/templates/flex_decode.py.jinja D80145761, might need a buck targets change? ([comment](pytorch#160357 (comment)))

Fixes builds for old compilers Pull Request resolved: pytorch#160499 Approved by: https://github.com/Skylion007

… contain attributes (pytorch#160436) Summary: Fixes internal test failures of D80037015 Test Plan: CI Rollback Plan: Differential Revision: D80094187 Pull Request resolved: pytorch#160436 Approved by: https://github.com/clee2000

@malfet

Hi, @malfet Based on the previous discussion: [RISCV CI support · Issue pytorch#141550 · pytorch/pytorch](pytorch#141550) I have cross-compiled PyTorch for the RISC-V architecture on x86_64 Ubuntu 24.04 and created a new PR for it. Could you please help review it? Pull Request resolved: pytorch#143979 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <[email protected]>

…ch#147758) Summary: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642 Pull Request resolved: pytorch#147758 Approved by: https://github.com/meetv18

## Summary - register conv3d with MPS autocast to ensure bias dtypes match under AMP - add regression test chaining two Conv3d layers on MPS autocast Written by Codex, see https://chatgpt.com/codex/tasks/task_e_689b64192df883278648935963d2776d Pull Request resolved: pytorch#160423 Approved by: https://github.com/dcci

pytorch#159130) Fixes pytorch#159129 Pull Request resolved: pytorch#159130 Approved by: https://github.com/soulitzer

…h#160435) remove aten::contiguous for NHWC convolutions on ROCm Tests: - nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float32 - nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float16 Before: <img width="1255" height="228" alt="image" src="https://github.com/user-attachments/assets/b125ccab-00c2-4d3a-a341-4583e51d8d57" /> After: <img width="874" height="153" alt="image" src="https://github.com/user-attachments/assets/ec200754-3622-488e-8762-bff1c2d22818" /> Pull Request resolved: pytorch#160435 Approved by: https://github.com/jeffdaily

`_transform_cuda_paths` intentionally includes the CUDA stubs folder. However this path must not be added to the rpath as otherwise any CUDA command will fail at runtime with > CUDA_ERROR_STUB_LIBRARY: "CUDA driver is a stub library" This results in e.g. non-descriptive errors like ``` cutlass_library/source/tools/util/include/cutlass/util/device_memory.h:67 cutlass::device_memory::allocate: cudaMalloc failed: bytes=4096 terminate called after throwing an instance of 'cutlass::cuda_exception' what(): std::exception ``` Pull Request resolved: pytorch#160179 Approved by: https://github.com/jansel

Fixes pytorch#159616 Pull Request resolved: pytorch#159617 Approved by: https://github.com/lezcano, https://github.com/jansel

MPS backend does not support double, so errors should be different Pull Request resolved: pytorch#160378 Approved by: https://github.com/dcci

* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 128 **Reproducer:** ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` **Before (MI300X):** Avg time for shape (5079670, 128): 1629.99 us **After (MI300X)** Avg time for shape (5079670, 128): 1008.59 us Pull Request resolved: pytorch#160466 Approved by: https://github.com/petrex, https://github.com/jeffdaily

Fixes maintenance of triton packaging script when library versions change from one ROCm version to next. Pull Request resolved: pytorch#158408 Approved by: https://github.com/jeffdaily Co-authored-by: Ethan Wee <[email protected]>

…h#159824) Follow up to pytorch#159580 Pull Request resolved: pytorch#159824 Approved by: https://github.com/williamwen42

…#160356) Differential Revision: [D80035771](https://our.internmc.facebook.com/intern/diff/D80035771/) The motivation and the original change is to reduce the number parameters we pass into the kernel, which was motivated by aesthetic reasons only. But seeing the need to use different batch stride, we should just pass in the batch stride. That would be a good long term fix. Pull Request resolved: pytorch#160356 Approved by: https://github.com/mlazos

Adds `OperatorEntry::getComputedKernelForDispatchKey` which returns the KernelFunction corresponding to `OperatorEntry.dispatchTable_[dispatch_ix]` for a given dispatch key - Specifically it returns a `SafeKernelFunction` that holds a `KernelToken`. This `KernelToken` is registered to the `KernelFunction` in `OperatorEntry.kernels_` and will be invalidated when the `KernelFunction` is destructed (i.e. when the `AnnotatedKernel` that holds this `KernelFunction` is removed from `kernels_`, which happens when the corresponding impl is deregistered). - `SafeKernelFunction` can be called via `callBoxed`, the validity of the token will be checked before this happens - `SafeKernelFunction` is pybinded and `getComputedKernelForDispatchKey` is exposed to the frontend ia `torch.library.get_kernel` Related to pytorch#155330 Pull Request resolved: pytorch#158393 Approved by: https://github.com/albanD

…_testing_IFU_08132025 # Conflicts: # .ci/docker/requirements-ci.txt # aten/src/ATen/Context.cpp # test/distributed/_tools/test_fsdp2_mem_tracker.py # test/dynamo/test_activation_checkpointing.py # test/dynamo/test_structured_trace.py # test/inductor/test_combo_kernels.py # torch/_higher_order_ops/triton_kernel_wrap.py # torch/_inductor/choices.py # torch/_inductor/codegen/triton.py

) Summary: This diff fixes two things which come up when testing a tgif-published pt2 model remote net: 1) Updates isSameDevice to handle meta device to avoid this error: ``` what(): Unsupported device typemeta and meta Exception raised from isSameDevice at fbcode/caffe2/torch/nativert/executor/PlacementUtils.cpp:20 ``` 2. Updates xl weight v2 loading logic in Weights.cpp to handle non-TBE xl-weights. Today, we enforce the device is the same for an old weight and new weight when replacing with ModelRunnerAdapter.setAttr(). However, the way we replace non-TBE xl weights is to find any weights on "meta" device and then replace them with their correct weight with real device from xl_weights folder. Therefore, the new weight and old weight will always have different devices and the device check is invalid. I don't think we've run into this so far bc non-TBE xl weights have not been thoroughly tested until now. Test Plan: Run MRS you model merge net, which uses non-TBE xl weights. Confirm that before change #1 we get error: ``` Unsupported device typemeta and meta ``` Then after change #1 and before change #2 we get: ``` what(): Mismatched device for merge.user_tower.linear.weight: meta vs cpu Exception raised from validateValue at fbcode/caffe2/torch/nativert/executor/Weights.cpp:374 ``` After change run is successful Command: ``` MODEL_ENTITY_ID=921242082 SNAPSHOT_ID=1269 module_name=merge SAMPLE_INPUT_DIR=/data/users/georgiaphillips/models/921242082/${SNAPSHOT_ID}/${module_name}_archive/package/data/sample_inputs buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100,a100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.${module_name} --moduleName=${module_name} --submodToDevice="merge|cuda0" --benchmarkEnableProfiling=false --disableStaticRuntime=true --doNotRandomizeSampleInputs=true --benchmarkDontRebatchSamples=true --pytorch_predictor_sigmoid_static_dispatch_enable=false --pytorch_predictor_sigmoid_graph_passes_enable=false --sampleInputFilePath=${SAMPLE_INPUT_DIR}/${module_name}.pt ``` Rollback Plan: Differential Revision: D80713052 Pull Request resolved: pytorch#162842 Approved by: https://github.com/henryoier

pytorchmergebot and others added 30 commits August 7, 2025 13:09

[CI] Update xpu ci use rolling driver for new features (pytorch#158340)

d20c4c2

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#158340 Approved by: https://github.com/seemethere Co-authored-by: xinan.lin <[email protected]>

S390X: update test dependencies (pytorch#158636)

f60454c

numba currently doesn't build from source due to numba/numba#10073 Pull Request resolved: pytorch#158636 Approved by: https://github.com/malfet

[inductor] disable flex decoding on Windows. (pytorch#160072)

21392c0

Discussed with @jianan-gu and @Valentine233 , disable flex decoding on Windows. Pull Request resolved: pytorch#160072 Approved by: https://github.com/angelayi

[replicate][be] improved readability of test case description (pytorc…

f077c24

…h#160128) Pull Request resolved: pytorch#160128 Approved by: https://github.com/mori360

daisyden and others added 27 commits August 13, 2025 08:13

[ez][CI] Set timeout for linux-jammy-py3_13-clang12-test from 600min …

deea71a

…-> default val of 240 (pytorch#160500) 10 hours is very long Pull Request resolved: pytorch#160500 Approved by: https://github.com/huydhn

guard cuMulticastUnbind call (pytorch#160499)

a2fd106

Fixes builds for old compilers Pull Request resolved: pytorch#160499 Approved by: https://github.com/Skylion007

Fix the Doc issue on the description of edge_order in torch.gradient() (

87e6c40

pytorch#159130) Fixes pytorch#159129 Pull Request resolved: pytorch#159130 Approved by: https://github.com/soulitzer

[BE][CI] Adjust error_inputs for cat and complex (pytorch#160378)

db0b7f1

MPS backend does not support double, so errors should be different Pull Request resolved: pytorch#160378 Approved by: https://github.com/dcci

[BE][Dynamo] Type improvements in _dynamo/utils to generics (pytorc…

3ef2e1e

…h#159824) Follow up to pytorch#159580 Pull Request resolved: pytorch#159824 Approved by: https://github.com/williamwen42

pragupta closed this Aug 22, 2025

pragupta deleted the pg_rocm7.1_internal_testing_IFU_08132025 branch August 29, 2025 23:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pg_rocm7.1_internal_testing][autogenerated] Upstream IFU on 08132025 #2

[pg_rocm7.1_internal_testing][autogenerated] Upstream IFU on 08132025 #2

Uh oh!

pragupta commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

[pg_rocm7.1_internal_testing][autogenerated] Upstream IFU on 08132025 #2

[pg_rocm7.1_internal_testing][autogenerated] Upstream IFU on 08132025 #2

Uh oh!

Conversation

pragupta commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants