add inductor-cu126 tag #141110

tinglvv · 2024-11-20T05:45:30Z

related to #138440
for testing #140793

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @gujinghui @PenghuiCheng @jianyuh @min-jean-cho @yanbing-j @Guobing-Chen @Xia-Weiwen @snadampal @mcarilli @ptrblck @leslie-fang-intel @voznesenskym @penguinwu @EikanWang @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn @xmfan @atalman

pytorch-bot · 2024-11-20T05:45:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141110

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Fixes pytorch#82159 Add note description about type promotion of `torch.sum`. **Test Result** **Before** ![image](https://github.com/user-attachments/assets/fb952676-f190-4680-9e15-ea8c99d33c67) **After** ![image](https://github.com/user-attachments/assets/ee0d46a6-5053-46d5-b412-5c919a40965a) Pull Request resolved: pytorch#140939 Approved by: https://github.com/zou3519

) Fix pytorch#140462 . Horace found that when we implicitly fallback to eager, some eager kernels may not work correctly if Inductor provide non-contiguous inputs (due to padding etc.). The original issue is found for the backward op of weight_norm. The fix in this PR is a general one: we force inputs to all implicit fallback kernels to be contiguous. I have to refactor the code a bit to make it work. Previously we apply layout constraint in `GraphLowering.run_node`. We looks for implicit fallback in `call_function`. The problem here is, when we setup the implicit fallback in `call_function` with a layout constraint, we don't have a chance to apply the constraints.. The refactor moves the code that applies layout constraints to `call_function`. Pull Request resolved: pytorch#140996 Approved by: https://github.com/jansel

…0084) Summary: The way we've been de/serializing sympy.Exprs is not roundtrippable in all cases (serialize by calling `str(expr)`, and deserialize by calling `sympy.sympify(expr_str)`). This has led to expressions being mathematically equivalent but structurally different, causing issues in ValueRanges. Example issue: pytorch#136797 This starts to deprecate the use of `expr_str` and stores expressions in AST format instead. For BC purposes, `expr_str` deserialization is still supported, but we will always serialize to `expr_ast`. We'll kill this once the serialization upgrader design is finalized and implemented. Test Plan: test_export Differential Revision: D65638757 Pull Request resolved: pytorch#140084 Approved by: https://github.com/angelayi

…41003) Fixes `python test/inductor/test_fused_attention.py SDPAPatternRewriterCpuTests.test_pattern_fails_with_unsupported_mask_cpu` when `specialize_float=False`. You might wonder how it's related, it's because there is a "negative" test that expects us not to match. Previously it would fail on isinstance(param, Tensor), but now that we tensorify the float, it did match and caused a failure. This check ensures the mask has the same shape to ensure this negative test case actually fails. Pull Request resolved: pytorch#141003 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#140983

Implement RNG Generator by falling back to CPUGeneratorImpl. Pull Request resolved: pytorch#138449 Approved by: https://github.com/ezyang

…1022) Detailed description: The codes below will raise an error ```Python import torch from torch.fx.experimental.proxy_tensor import make_fx def func(a): b = a + 1 c = b.view(-1) c.add_(1) return b input = torch.randn(2) out = make_fx(func)(input) ``` The error info are like below: ```Python ... File "/root/Git.d/pytorch/pytorch/torch/_dynamo/codegen.py", line 34, in <module> from .variables.torch_function import TensorWithTFOverrideVariable File "/root/Git.d/pytorch/pytorch/torch/_dynamo/variables/torch_function.py", line 185, in <module> populate_builtin_to_tensor_fn_map() File "/root/Git.d/pytorch/pytorch/torch/_dynamo/variables/torch_function.py", line 146, in populate_builtin_to_tensor_fn_map inp0 = torch.ones(1) File "/root/Git.d/pytorch/pytorch/torch/fx/experimental/proxy_tensor.py", line 1240, in __torch_function__ return func(*args, **kwargs) File "/root/Git.d/pytorch/pytorch/torch/utils/_stats.py", line 21, in wrapper return fn(*args, **kwargs) File "/root/Git.d/pytorch/pytorch/torch/fx/experimental/proxy_tensor.py", line 1342, in __torch_dispatch__ return proxy_call(self, func, self.pre_dispatch, args, kwargs) File "/root/Git.d/pytorch/pytorch/torch/fx/experimental/proxy_tensor.py", line 907, in proxy_call name=proxy_mode.tracer.graph._target_to_str(func.overloadpacket.__name__), AttributeError: 'PythonKeyTracer' object has no attribute 'graph' ... ``` Solutions: Import torch._dynamo before dispatch_trace is called to avoid the context set before dispatch_trace from affecting the torch._dynamo import. Pull Request resolved: pytorch#141022 Approved by: https://github.com/ezyang

Detailed description: As the title stated. Pull Request resolved: pytorch#141023 Approved by: https://github.com/Skylion007, https://github.com/ezyang

Tracking issue: pytorch#138399 This PR changes the `pow` C++ implementation, making its C++ meta kernel consistent with its Python ref implementation. The following example shows the inconsistency between the two: ```python def run(device): S = (5,) a = torch.rand(S, device=device, dtype=torch.float32) b = 2 out = torch.empty(S, device=device, dtype=torch.float64) return torch.pow(a, b, out=out) >>> run("cpu") Traceback (most recent call last): File "test.py", line 34, in run return torch.pow(a, b, out=out) RuntimeError: Found dtype Double but expected Float >>> run("meta") tensor(..., device='meta', size=(5,), dtype=torch.float64) ``` **~Update:~** ~Note that this happens only for `pow.Tensor_Scalar` overloads. Therefore, this PR needed further 2 modifications:~ - ~Split the `pow` ref implementation, making `pow.Tensor_Scalar` error on mismatching output dtypes~ - ~Create a dispatch for `pow` when `_refs.pow()` is called~ **Update:** Changing the `TensorIteratorConfig` for `pow.Tensor_Scalar` was easier and, after the discussion below, more correct. The solution was to change the `TensorIteratorBase::build_output_borrowing_argument_owning_unary_op` function, setting: - `cast_common_dtype_to_outputs`; and - `enforce_safe_casting_to_output`. Pull Request resolved: pytorch#140287 Approved by: https://github.com/ezyang

Pull Request resolved: pytorch#141092 Approved by: https://github.com/Skylion007, https://github.com/kulinseth ghstack dependencies: pytorch#141089, pytorch#141090

Instead of calling `REGISTER_FUSED_ADAM_OP` macro with 7 parameters 16 times, 4 type parameter macros for each op and then one op to define the quartet of ops: Adam, AdamW and their grad functions Pull Request resolved: pytorch#141103 Approved by: https://github.com/kulinseth ghstack dependencies: pytorch#141089, pytorch#141090, pytorch#141092

tinglvv · 2024-11-20T15:50:07Z

@pytorchbot merge

pytorchmergebot · 2024-11-20T15:51:54Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-11-20T15:52:04Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Lint / quick-checks / linux-job

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

Summary: **wins** on torchrec benchmark, for 2K nodes it save 40seconds with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on). ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200 ``` This diff optimizes construction expressions of the form a+b+c... (all unique symbols). which are very common in torchrec models. **How** Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them. If we have a+b+c and we are adding (d) to it, we can do a binary search to know the position of (d) and avoid optimizing the new expression by passing the new order. **Extensions**: 1. support constant terms. 2. support 10a+10b+.. (this will give even more wins will extend the support in second PR) Differential Revision: D66008482 Pull Request resolved: pytorch#140822 Approved by: https://github.com/ezyang

* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables. * list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize. * Manually went back and made mypy happy after the change. * Also fixed style lints in files covered by flake8 but not by pyfmt Pull Request resolved: pytorch#140980 Approved by: https://github.com/justinchuby, https://github.com/malfet

Pull Request resolved: pytorch#140250 Approved by: https://github.com/jansel ghstack dependencies: pytorch#140223

Easy fix to missing reserve calls in NT Matmul CUDA kernel to improve perf. Pull Request resolved: pytorch#141130 Approved by: https://github.com/malfet

This uploads the MPS benchmark results to benchmark database. The data can then be queried, for example: ``` select benchmark, model, metric from oss_ci_benchmark_v3 where head_sha = '99a133116fee15aa1467165f2b209b37da53f189' and metric.name in ['eager_peak_mem', 'dynamo_peak_mem', 'speedup'] and model.name = 'BERT_pytorch' ``` I'm documenting the JSON format at https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-database ### Testing Locally, ``` PYTHONPATH=/Users/huydo/Storage/mine/benchmark python benchmarks/dynamo/torchbench.py --performance --only resnet152 --backend eager --training --devices mps --output test/test-reports/torchbench_training.csv ``` Workflow dispatch https://github.com/pytorch/pytorch/actions/runs/11927990520 Pull Request resolved: pytorch#141087 Approved by: https://github.com/malfet

I'm trying to make this benchmark results available on OSS benchmark database, so that people can query it from outside. The first step is to also record the results in the JSON format compatible with the database schema defined in pytorch/test-infra#5839. Existing CSV files remain unchanged. ### Testing The JSON results are uploaded as artifacts to S3 https://github.com/pytorch/pytorch/actions/runs/11809725848/job/32901411180#step:26:13, for example https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/11809725848/1/artifact/test-jsons-test-pr_time_benchmarks-1-1-linux.g4dn.metal.nvidia.gpu_32901411180.zip Pull Request resolved: pytorch#140493 Approved by: https://github.com/laithsakka

…#141145) A small fix I just realize after pytorch#141087. Pull Request resolved: pytorch#141145 Approved by: https://github.com/malfet

Fixes pytorch#107302 This is a clone and fix for pytorch#139199. This PR is a small step for the overall NumPy 2 support. It adds a new CI job for testing with NumPy 2 with one test file only. More tests to be fixed and added later in follow-up pull requests. Pull Request resolved: pytorch#140586 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <[email protected]>

tinglvv · 2024-11-21T20:06:12Z

Closing as an accidental wrong rebase happened.

add inductor-cu126 tag

472c15f

tinglvv requested a review from a team as a code owner November 20, 2024 05:45

pytorch-bot bot added the topic: not user facing topic category label Nov 20, 2024

pytorchbot added the open source label Nov 20, 2024

zeshengzong and others added 10 commits November 20, 2024 06:20

Openreg: Add RNG Generator (pytorch#138449)

c9c8370

Implement RNG Generator by falling back to CPUGeneratorImpl. Pull Request resolved: pytorch#138449 Approved by: https://github.com/ezyang

Skylion007 approved these changes Nov 20, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 20, 2024

pytorchmergebot added the merging label Nov 20, 2024

pytorchmergebot removed the merging label Nov 20, 2024

laithsakka and others added 8 commits November 20, 2024 16:48

[dynamo][guards] Introduce a diff_guard_manager (pytorch#140250)

9d229f0

Pull Request resolved: pytorch#140250 Approved by: https://github.com/jansel ghstack dependencies: pytorch#140223

[BE][Ez]: Reserve vector for NT GEMM Matmul (pytorch#141130)

1d8318d

Easy fix to missing reserve calls in NT Matmul CUDA kernel to improve perf. Pull Request resolved: pytorch#141130 Approved by: https://github.com/malfet

Skip uploading benchmark records when there is no model name (pytorch…

b5db3cb

…#141145) A small fix I just realize after pytorch#141087. Pull Request resolved: pytorch#141145 Approved by: https://github.com/malfet

tinglvv requested review from HDCharles, albanD, bdhirsh, jbschlosser, jerryzh168, justinchuby, kshitij12345, mikaylagawarecki, mruberry, shubhambhokare1, soulitzer, titaiwangms, wschin and zou3519 as code owners November 21, 2024 20:03

tinglvv closed this Nov 21, 2024

tinglvv deleted the add-cu126-tag branch November 21, 2024 20:05

atalman mentioned this pull request Dec 12, 2024

Enable CUDA 12.6 CI/CD , Disable CUDA 12.1 #138440

Closed

31 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add inductor-cu126 tag #141110

add inductor-cu126 tag #141110

Uh oh!

tinglvv commented Nov 20, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Nov 20, 2024 •

edited

Loading

Uh oh!

tinglvv commented Nov 20, 2024

Uh oh!

pytorchmergebot commented Nov 20, 2024

Uh oh!

pytorchmergebot commented Nov 20, 2024

Uh oh!

tinglvv commented Nov 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

add inductor-cu126 tag #141110

add inductor-cu126 tag #141110

Uh oh!

Conversation

tinglvv commented Nov 20, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141110

❗ 1 Active SEVs

Uh oh!

tinglvv commented Nov 20, 2024

Uh oh!

pytorchmergebot commented Nov 20, 2024

Merge started

Uh oh!

pytorchmergebot commented Nov 20, 2024

Merge failed

Uh oh!

tinglvv commented Nov 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

tinglvv commented Nov 20, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 20, 2024 •

edited

Loading