Skip to content

Conversation

@tinglvv tinglvv requested a review from a team as a code owner November 20, 2024 05:45
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 20, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141110

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

zeshengzong and others added 10 commits November 20, 2024 06:20
)

Fix pytorch#140462 .

Horace found that when we implicitly fallback to eager, some eager kernels may not work correctly if Inductor provide non-contiguous inputs (due to padding etc.). The original issue is found for the backward op of weight_norm. The fix in this PR is a general one: we force inputs to all implicit fallback kernels to be contiguous.

I have to refactor the code a bit to make it work. Previously we apply layout constraint in `GraphLowering.run_node`. We looks for implicit fallback in `call_function`. The problem here is, when we setup the implicit fallback in `call_function` with a layout constraint, we don't have a chance to apply the constraints.. The refactor moves the code that applies layout constraints to `call_function`.

Pull Request resolved: pytorch#140996
Approved by: https://github.com/jansel
…0084)

Summary: The way we've been de/serializing sympy.Exprs is not roundtrippable in all cases (serialize by calling `str(expr)`, and deserialize by calling `sympy.sympify(expr_str)`). This has led to expressions being mathematically equivalent but structurally different, causing issues in ValueRanges. Example issue: pytorch#136797

This starts to deprecate the use of `expr_str` and stores expressions in AST format instead. For BC purposes, `expr_str` deserialization is still supported, but we will always serialize to `expr_ast`. We'll kill this once the serialization upgrader design is finalized and implemented.

Test Plan: test_export

Differential Revision: D65638757

Pull Request resolved: pytorch#140084
Approved by: https://github.com/angelayi
…41003)

Fixes `python test/inductor/test_fused_attention.py SDPAPatternRewriterCpuTests.test_pattern_fails_with_unsupported_mask_cpu` when `specialize_float=False`. You might wonder how it's related, it's because there is a "negative" test that expects us not to match. Previously it would fail on isinstance(param, Tensor), but now that we tensorify the float, it did match and caused a failure. This check ensures the mask has the same shape to ensure this negative test case actually fails.

Pull Request resolved: pytorch#141003
Approved by: https://github.com/ezyang
ghstack dependencies: pytorch#140983
Implement RNG Generator by falling back to CPUGeneratorImpl.

Pull Request resolved: pytorch#138449
Approved by: https://github.com/ezyang
…1022)

Detailed description:

The codes below will raise an error
```Python
import torch
from torch.fx.experimental.proxy_tensor import make_fx

def func(a):
    b = a + 1
    c = b.view(-1)
    c.add_(1)
    return b

input = torch.randn(2)
out = make_fx(func)(input)
```

The error info are like below:
```Python
...
  File "/root/Git.d/pytorch/pytorch/torch/_dynamo/codegen.py", line 34, in <module>
    from .variables.torch_function import TensorWithTFOverrideVariable
  File "/root/Git.d/pytorch/pytorch/torch/_dynamo/variables/torch_function.py", line 185, in <module>
    populate_builtin_to_tensor_fn_map()
  File "/root/Git.d/pytorch/pytorch/torch/_dynamo/variables/torch_function.py", line 146, in populate_builtin_to_tensor_fn_map
    inp0 = torch.ones(1)
  File "/root/Git.d/pytorch/pytorch/torch/fx/experimental/proxy_tensor.py", line 1240, in __torch_function__
    return func(*args, **kwargs)
  File "/root/Git.d/pytorch/pytorch/torch/utils/_stats.py", line 21, in wrapper
    return fn(*args, **kwargs)
  File "/root/Git.d/pytorch/pytorch/torch/fx/experimental/proxy_tensor.py", line 1342, in __torch_dispatch__
    return proxy_call(self, func, self.pre_dispatch, args, kwargs)
  File "/root/Git.d/pytorch/pytorch/torch/fx/experimental/proxy_tensor.py", line 907, in proxy_call
    name=proxy_mode.tracer.graph._target_to_str(func.overloadpacket.__name__),
AttributeError: 'PythonKeyTracer' object has no attribute 'graph'
...
```

Solutions:
Import torch._dynamo before dispatch_trace is called to avoid the context set before dispatch_trace from affecting the torch._dynamo import.

Pull Request resolved: pytorch#141022
Approved by: https://github.com/ezyang
Tracking issue: pytorch#138399

This PR changes the `pow` C++ implementation, making its C++ meta kernel consistent with
its Python ref implementation. The following example shows the inconsistency between the
two:

```python
def run(device):
    S = (5,)
    a = torch.rand(S, device=device, dtype=torch.float32)
    b = 2
    out = torch.empty(S, device=device, dtype=torch.float64)
    return torch.pow(a, b, out=out)

>>> run("cpu")
Traceback (most recent call last):
  File "test.py", line 34, in run
    return torch.pow(a, b, out=out)
RuntimeError: Found dtype Double but expected Float

>>> run("meta")
tensor(..., device='meta', size=(5,), dtype=torch.float64)
```

**~Update:~**

~Note that this happens only for `pow.Tensor_Scalar` overloads. Therefore, this PR needed
further 2 modifications:~

- ~Split the `pow` ref implementation, making `pow.Tensor_Scalar` error on mismatching
output dtypes~
- ~Create a dispatch for `pow` when `_refs.pow()` is called~

**Update:**

Changing the `TensorIteratorConfig` for `pow.Tensor_Scalar` was easier and,
after the discussion below, more correct. The solution was to change the
`TensorIteratorBase::build_output_borrowing_argument_owning_unary_op` function,
setting:

- `cast_common_dtype_to_outputs`; and
- `enforce_safe_casting_to_output`.

Pull Request resolved: pytorch#140287
Approved by: https://github.com/ezyang
Instead of calling `REGISTER_FUSED_ADAM_OP` macro with 7 parameters 16 times, 4 type parameter macros for each op and then one op to define the quartet of ops: Adam, AdamW and their grad functions
Pull Request resolved: pytorch#141103
Approved by: https://github.com/kulinseth
ghstack dependencies: pytorch#141089, pytorch#141090, pytorch#141092
@tinglvv
Copy link
Collaborator Author

tinglvv commented Nov 20, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 20, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

laithsakka and others added 8 commits November 20, 2024 16:48
Summary:
**wins**
on torchrec benchmark, for 2K nodes it save 40seconds
with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on).
```
buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200
```
This diff optimizes construction expressions of the form
a+b+c...  (all unique symbols).
which are very common in torchrec models.

**How**
Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them.
If we have  a+b+c and we are adding (d) to it, we can do a binary search to know
the position of (d) and avoid optimizing the new expression by passing the new order.

**Extensions**:
1. support constant terms.
2. support 10a+10b+.. (this will give even more wins will extend the support in second PR)

Differential Revision: D66008482

Pull Request resolved: pytorch#140822
Approved by: https://github.com/ezyang
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt

Pull Request resolved: pytorch#140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
Easy fix to missing reserve calls in NT Matmul CUDA kernel to improve perf.
Pull Request resolved: pytorch#141130
Approved by: https://github.com/malfet
This uploads the MPS benchmark results to benchmark database.  The data can then be queried, for example:

```
select benchmark, model, metric from oss_ci_benchmark_v3 where head_sha = '99a133116fee15aa1467165f2b209b37da53f189' and metric.name in ['eager_peak_mem', 'dynamo_peak_mem', 'speedup'] and model.name = 'BERT_pytorch'
```

I'm documenting the JSON format at https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-database

### Testing

Locally,

```
PYTHONPATH=/Users/huydo/Storage/mine/benchmark python benchmarks/dynamo/torchbench.py --performance --only resnet152 --backend eager --training --devices mps --output test/test-reports/torchbench_training.csv
```

Workflow dispatch https://github.com/pytorch/pytorch/actions/runs/11927990520

Pull Request resolved: pytorch#141087
Approved by: https://github.com/malfet
I'm trying to make this benchmark results available on OSS benchmark database, so that people can query it from outside.  The first step is to also record the results in the JSON format compatible with the database schema defined in pytorch/test-infra#5839.

Existing CSV files remain unchanged.

### Testing

The JSON results are uploaded as artifacts to S3 https://github.com/pytorch/pytorch/actions/runs/11809725848/job/32901411180#step:26:13, for example https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/11809725848/1/artifact/test-jsons-test-pr_time_benchmarks-1-1-linux.g4dn.metal.nvidia.gpu_32901411180.zip

Pull Request resolved: pytorch#140493
Approved by: https://github.com/laithsakka
Fixes pytorch#107302

This is a clone and fix for pytorch#139199.

This PR is a small step for the overall NumPy 2 support.
It adds a new CI job for testing with NumPy 2 with one test file only.
More tests to be fixed and added later in follow-up pull requests.

Pull Request resolved: pytorch#140586
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <[email protected]>
@pytorch-bot pytorch-bot bot added ciflow/inductor ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/mps Run MPS tests (subset of trunk) module: amp (automated mixed precision) autocast module: compiled autograd compiled_autograd module: cpu CPU specific problem (e.g., perf, algorithm) module: distributed_checkpoint module: dynamo module: inductor module: mkldnn Related to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: quantization release notes category labels Nov 21, 2024
@tinglvv tinglvv closed this Nov 21, 2024
@tinglvv tinglvv deleted the add-cu126-tag branch November 21, 2024 20:05
@tinglvv
Copy link
Collaborator Author

tinglvv commented Nov 21, 2024

Closing as an accidental wrong rebase happened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/mps Run MPS tests (subset of trunk) ciflow/trunk Trigger trunk jobs on your pull request module: amp (automated mixed precision) autocast module: compiled autograd compiled_autograd module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo module: inductor module: mkldnn Related to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: quantization release notes category topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.