Skip to content

[inductor] show performance for each autotune config for a kernel#96248

Merged
shunting314 merged 2 commits intogh/shunting314/23/basefrom
gh/shunting314/23/head
Mar 9, 2023
Merged

[inductor] show performance for each autotune config for a kernel#96248
shunting314 merged 2 commits intogh/shunting314/23/basefrom
gh/shunting314/23/head

Conversation

@shunting314
Copy link
Contributor

@shunting314 shunting314 commented Mar 8, 2023

Stack from ghstack (oldest at bottom):

Be able to benchmark the perf for each config of each kernel.

To use it:

  1. run the model with TORCHINDUCTOR_BENCHMARK_KERNEL enabled. e.g.:
TORCHINDUCTOR_BENCHMARK_KERNEL=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --dashboard --only vgg16 --disable-cudagraphs --training

Get the path to the compiled module from log, e.g.

Compiled module path: /tmp/torchinductor_shunting/mj/cmjv5hyt3uq2v7beqkthcl4ul6fh2luwfzmd4tnrquworcmqz4i3.py
  1. run the compiled module directly with the following options:
  • -k to benchmark each kernel
  • -c to benchmark each config for each kernel

Example command:

TORCHINDUCTOR_BENCHMARK_KERNEL=1 python /tmp/torchinductor_shunting/mj/cmjv5hyt3uq2v7beqkthcl4ul6fh2luwfzmd4tnrquworcmqz4i3.py -kc

Sample result:
Screenshot 2023-03-06 at 6 05 23 PM

cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 8, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/96248

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit 9ce6ad5:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@ngimel
Copy link
Collaborator

ngimel commented Mar 8, 2023

Do we need a step to first output compiled module? Can we output results of all autotuning configs as we are autotuning them for the first time, running the model?

@shunting314
Copy link
Contributor Author

Do we need a step to first output compiled module? Can we output results of all autotuning configs as we are autotuning them for the first time, running the model?

yea, currently we split the following steps

  1. generate compiled modules
  2. benchmark each kernel in the compiled modules

This way, we can do step 1 once and do step 2 multiple times as we tune hueristics.

If we want, we can also make step 2 being done on the fly while we do step 1 as you mentioned. There is one tricky part here though. We have cache for autotuning result. We would want to ignore the cache if we want to show perf for each config. But each model is usually being run multiple times in our scripts (for warm up or for more stable perf number), we need

  • either take care to only print the autotuning results for the first run.
  • or even better, we only disable autotuning cache for the first run and future run will not even do autotuning because of cache hit

Do we want to go this route?


if ms > 0.012 and gb_per_s < 650:
print(colorama.Fore.RED + info_str + colorama.Fore.RESET)
def get_info_str(ms, prefix=""):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw I'm changing this code a bit to put the kernel name at the end: #96170

… kernel"


Be able to benchmark the perf for each config of each kernel.

To use it:
1. run the model with `TORCHINDUCTOR_BENCHMARK_KERNEL` enabled. e.g.:
```
TORCHINDUCTOR_BENCHMARK_KERNEL=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --dashboard --only vgg16 --disable-cudagraphs --training
```
Get the path to the compiled module from log, e.g. 
```
Compiled module path: /tmp/torchinductor_shunting/mj/cmjv5hyt3uq2v7beqkthcl4ul6fh2luwfzmd4tnrquworcmqz4i3.py
```

2. run the compiled module directly with the following options:
- `-k` to benchmark each kernel
- `-c` to benchmark each config for each kernel


Example command:
```
TORCHINDUCTOR_BENCHMARK_KERNEL=1 python /tmp/torchinductor_shunting/mj/cmjv5hyt3uq2v7beqkthcl4ul6fh2luwfzmd4tnrquworcmqz4i3.py -kc
```

Sample result:
<img width="829" alt="Screenshot 2023-03-06 at 6 05 23 PM" src="https://user-images.githubusercontent.com/52589240/223300934-59a4634b-dfd1-46f5-b964-dc0074535236.png">



cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Mar 8, 2023
@ngimel
Copy link
Collaborator

ngimel commented Mar 8, 2023

Yeah I think two-step process is fine for now, but for the future the disabling cache for the first run that you described would add nice convenience.
The reason I want regular runs to be able to print stats is because sometimes the produced ouput_code is non-runnable or gives skewed performance because of the random inputs, and it's nice to be able to run on "real" inputs.

@shunting314
Copy link
Contributor Author

Yeah I think two-step process is fine for now, but for the future the disabling cache for the first run that you described would add nice convenience. The reason I want regular runs to be able to print stats is because sometimes the produced ouput_code is non-runnable or gives skewed performance because of the random inputs, and it's nice to be able to run on "real" inputs.

Make sense. Also you pointed out earlier that there may be some alias between inputs that the randomly generated inputs may not be able to capture. We can improve these for sure if they results into problems.

@shunting314
Copy link
Contributor Author

@pytorch merge -f "the test_tensorboard failure is unrelated"

@shunting314
Copy link
Contributor Author

@pytorchbot merge -f "the test_tensorboard failure is unrelated"

@shunting314 shunting314 merged commit bc8f9f2 into gh/shunting314/23/base Mar 9, 2023
@shunting314 shunting314 deleted the gh/shunting314/23/head branch March 9, 2023 19:12
@shunting314 shunting314 restored the gh/shunting314/23/head branch March 9, 2023 19:15
shunting314 added a commit that referenced this pull request Mar 9, 2023
@shunting314
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 9, 2023
@pytorchmergebot
Copy link
Collaborator

Can't merge closed PR #96248

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request module: inductor topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants