Commit 415b882
committed
Update on "[inductor] [cpp] use non-temporal tile load for A"
Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1.
Verified AMP static shapes and dynamic shapes on CPU with AMX support and no obvious performance boost (no regression either) at end-to-end level. We're expecting to get performance gain when adding #129348 (also in this ghstack) on top of this PR.
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]File tree
1,582 files changed
+38714
-31499
lines changed- .circleci/scripts
- .ci
- docker
- ci_commit_pins
- common
- pytorch
- .github
- ci_commit_pins
- scripts
- amd
- workflows
- aten/src/ATen
- core
- boxing
- impl
- dispatch
- op_registration
- cpu
- vec/vec256
- cuda
- detail
- tunable
- cudnn
- detail
- functorch
- hip/impl
- mps
- native
- cpu
- cuda
- cudnn
- metal
- mpscnn/tests
- ops
- mkldnn
- xpu
- detail
- mps
- operations
- nested
- quantized
- cpu
- kernels
- qnnpack/deps/clog
- cuda
- cudnn
- sparse
- cuda
- transformers
- cuda
- flash_attn
- mem_eff_attention/gemm
- hip/flash_attn
- vulkan/ops
- xnnpack
- nnapi
- ops
- templates
- test
- xpu
- detail
- benchmarks
- distributed
- ddp
- intra_node_comm
- dynamo
- ci_expected_accuracy
- cu124
- microbenchmarks
- fastrnns
- gpt_fast
- transformer
- c10
- core
- impl
- cuda
- impl
- test
- core
- impl
- util
- util
- xpu/test/impl
- caffe2
- utils/threadpool
- cmake
- External
- Modules
- public
- docs
- caffe2
- source
- notes
- scripts/exportdb
- functorch/csrc/dim
- scripts/compile_tests
- test
- cpp_extensions
- torch_test_cpp_extension
- cpp
- api
- c10d
- jit
- distributed
- _composable/fsdp
- _tensor
- debug
- _tools
- checkpoint
- e2e
- elastic
- utils
- nn/jit
- pipelining
- tensor/parallel
- dynamo_expected_failures
- dynamo
- error_messages
- expect
- export
- forward_backward_compatibility
- functorch
- inductor
- jit
- lazy
- nn
- onnx
- dynamo
- model_defs
- profiler
- quantization
- core
- experimental
- pt2e
- torch_np
- numpy_tests/core
- typing/reveal
- third_party
- nccl
- tools
- alerts
- autograd
- code_analyzer
- code_coverage/package
- oss
- tool
- parser
- util
- coverage_plugins_package
- src/coverage_plugins
- dynamo
- github
- iwyu
- jit
- linter
- adapters
- clang_tidy
- lite_interpreter
- lldb
- onnx
- pyi
- setup_helpers
- stats
- testing
- target_determination
- heuristics
- test
- heuristics
- torchgen
- aoti
- api
- types
- dest
- executorch
- api
- types
- fuse
- operator_versions
- selective_build
- shape_functions
- static_runtime
- torch
- _C
- _dynamo
- _custom_op
- _decomp
- _dynamo
- backends
- variables
- _export
- db
- examples
- passes
- serde
- _functorch
- _aot_autograd
- _higher_order_ops
- _inductor
- codegen
- rocm
- fx_passes
- kernel
- runtime
- _library
- _logging
- _numpy
- testing
- _prims_common
- _prims
- _refs
- _subclasses
- amp
- ao/quantization
- fx
- pt2e
- quantizer
- autograd
- backends/cuda
- csrc
- api
- include/torch
- data/detail
- nn
- functional
- modules
- options
- utils
- optim
- serialize
- src
- nn/modules
- optim
- serialize
- autograd
- functions
- utils
- cuda
- distributed
- autograd/engine
- c10d
- control_plane
- rpc
- profiler
- dynamo
- functorch
- inductor
- aoti_runtime
- aoti_torch
- generated
- jit
- api
- codegen
- fuser
- cpu
- cuda
- onednn
- cuda
- frontend
- ir
- mobile
- compatibility
- model_tracer
- operator_upgraders
- passes
- onnx
- pattern_conversion
- quantization
- utils
- python
- runtime
- static
- serialization
- tensorexpr
- operators
- testing
- lazy
- backend
- core
- python
- ts_backend
- monitor
- profiler
- python
- standalone
- unwind
- tensor
- utils
- xpu
- cuda
- distributed
- _composable/fsdp
- _shard/sharded_tensor
- _symmetric_memory
- _tensor
- debug
- examples
- _tools
- algorithms/_checkpoint
- benchmarks
- checkpoint
- elastic
- agent/server
- utils
- fsdp
- optim
- pipelining
- export
- experimental
- fx
- experimental
- passes
- jit
- masked/maskedtensor
- multiprocessing
- nested
- _internal
- nn
- attention
- modules
- parallel
- utils
- onnx/_internal/fx
- optim
- package
- testing
- _internal
- opinfo
- definitions
- utils
- _sympy
- benchmark/utils
- data
- _utils
- datapipes
- dataframe
- iter
- map
- utils
- hipify
- model_dump
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
1,582 files changed
+38714
-31499
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | | - | |
4 | | - | |
5 | | - | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
12 | | - | |
| 12 | + | |
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
85 | 85 | | |
86 | 86 | | |
87 | 87 | | |
88 | | - | |
| 88 | + | |
89 | 89 | | |
90 | 90 | | |
91 | 91 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
306 | 306 | | |
307 | 307 | | |
308 | 308 | | |
309 | | - | |
| 309 | + | |
310 | 310 | | |
311 | 311 | | |
312 | 312 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
230 | 230 | | |
231 | 231 | | |
232 | 232 | | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
233 | 237 | | |
234 | 238 | | |
235 | 239 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
222 | 222 | | |
223 | 223 | | |
224 | 224 | | |
| 225 | + | |
| 226 | + | |
225 | 227 | | |
226 | 228 | | |
227 | 229 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | | - | |
22 | 21 | | |
| 22 | + | |
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
| |||
0 commit comments