PyTorch ThroughputBenchmark #20766

salexspb · 2019-05-21T18:13:00Z

Summary:
This is useful for measuring inference performance of your
models. This is a very basic benchmark for now. We don't support
batching on the benchmark side, no inter and intra op parallelizm is
supported yet, just caller based parallelizm.

Main phylosophy here is that user should be able to provide inputs
from python and just stack them within the benchmark. API should be
exactly the same as passing inputs to module.forward.

Differential Revision: D15435461

soumith

this should be a separate binary / utility, outside of csrc -- from what I see it doesn't need to hook into any of the internals in any way.
Probably going into the benchmark folder as a separate utility?

dzhulgakov · 2019-05-21T22:10:23Z

pytorch/benchmarks feels like indeed a better place. You can make it a separate binary/library and link against libtorch.so

The reasoning is that it's more of a scaffolding around pytorch that doesn't require changes in the core (like e.g. autograd profiler). So putting it separately is cleaner. Even if we were put it in the same build, csrc is probably the wrong place to put it. cc @gchanan for input too.

salexspb · 2019-05-21T23:11:18Z

@dzhulgakov , @soumith , my motivation is that I would like any researcher easily see perf of their model in an inference like setup. It is not specific to a model, it is a generic benchmarking utility.

So yes, I would like it to be part of the same build. Then one doesn't need to recompile anything after they developed a model and can just call this utility. How I see it, it should work this way both in OSS and internally.

No issues pytorch/benchmarks folder specifically, but would it work the way I described above?

dzhulgakov

Hm, if we want to expose it in the main build, I guess it doesn't make sense to build it as a separate python extension.

The question becomes whether it cuts the bar for general usefulness (similarly to autograd profiler) - @soumith @gchanan . But it that case we probably want to put it into some sub-namespace (torch.profiler? torch.bechmark? autograd.profiler name is a bit unfortunate)

test/test_throughput_benchmark.py

torch/csrc/throughput_benchmark/ThroughputBenchmark.cpp

torch/csrc/throughput_benchmark/ThroughputBenchmark.h

salexspb · 2019-05-22T18:05:03Z

Putting things into a different python namespace makes sense to me, putting things into jit was rather to cause this discussion and figure out what is the proper place :)

@soumith , @zdevito , @gchanan , given the goal I described above (of making this easily available to any researcher / ml practitioner so they can test their model throughput without additional builds), should I just go ahead and create a new python namespace like torch.benchmark? I assume, in this case, the c++ code stays in the same place, but please let me know if there is a better place for it.

autograd.profiler should likely also live there :)

zdevito · 2019-05-22T21:49:23Z

We should probably provide a torch.profiler package as part of the main build or in an easily installable package similar to torchvision. First, this is where autograd profiler should live. It has long since become more generic than just profiling autograd. We can make torch.autograd.profiler alias torch.profiler for now. Second, a generic harness for timing the throughput and latency of a model until appropriate multi-core load is a useful enough tool that it seems to make sense going into torch.profiler. Note that this will enforce a higher barrier of quality on it than if it were for our own use only, but that is probably a good thing because we to make it easier to accurately benchmark our models anyway. It can serve as a way to capture best practices for our external users. Example: JIT requires warm up (at least one iteration). I don't expect users to know this, so if they write their own benchmark harness then they often report slower numbers.

zdevito

Seems like a good first pass. If this is going to go into PyTorch proper and have python bindings, it needs to work with nn.Module not just ScriptModules.

test/test_throughput_benchmark.py

soumith · 2019-05-23T02:13:47Z

There is precedence (and good design) for this. Take a look at torch.utils.bottleneck: https://pytorch.org/docs/stable/bottleneck.html

It was a tool that was aimed to tell users in a one-touch way what part of their code is a bottleneck -- without the user understanding how to use the profiler, or plug profiler code deeper into their codebase.

test/test_throughput_benchmark.py

soumith · 2019-05-23T02:16:06Z

similarly, we can probably aim for torch.utils.benchmark, but I think the API needs to be cleaned up a bit. For example config can be removed in favor of just using args / kwargs (if you choose, you can pass a dict in as kwargs with throughput(**config) or something

soumith · 2019-05-23T02:17:38Z

the other thing I can think of that the API needs to be improved upon is beyond CPU throughput.
It's pretty common to do GPU-inference. If so, what exactly are the tuning parameters there (like for CPU it's number of threads etc.)

salexspb · 2019-05-23T18:49:24Z

So here is a new version. I think, I addressed all the API questions except the following:

@zdevito suggestion to support nn.Module. Do you suggest we should just take an nn.Module and trace it? Because IMO running python code through throughput inference benchmark doesn't make much sense, people don't really deploy python. At least this is not a big use-case which I would like to not be blocked on. If you suggest just trying to trace the module, it should be a few lines change.
@soumith 's comment about supporting GPU inference. I think this is a good goal. API wise calling threads should stay there . In majority of the cases calling threads are CPU threads which offload to an accelerator card. Speaking of which - we actually would want to support other hardware as well. I think, we may provide additional parameter here to the benchmark which device to run model on. With future PEP integration we hopefully will allow people to submit remote jobs from an iPython notebook and run this whole thing on a remote host with hardware they want to test against. How cool would be that ? :)

I also didn't move the module location (it is currently under torch.jit for no reason), I intend to move it to torch.profiler as soon as we agree on this step for sure :)

Update summary:

nicer api for callling benchmark from Python. Now benchmkar() method takes in key word arguments which I convert to BechmarkConfig struct later in the python helper. I would like to keep the struct for C++ API as C++ is not good about key word arguments and I don't want benchmark() c++ method to become a mess.
hidden model._c call
more documentation
@dzhulgakov 's suggestions about getters and setters
@dzhulgakov 's suggestions on ExecutionStats field names
fixed out of bound issue.

torch/csrc/throughput_benchmark/ThroughputBenchmark.cpp

dzhulgakov

Looks nice, I'd defer to @gchanan on naming. My guess would be torch.profiler.ThroughputBenchmark

torch/csrc/throughput_benchmark/ThroughputBenchmark.cpp

dzhulgakov · 2019-05-24T06:18:47Z

torch/csrc/throughput_benchmark/ThroughputBenchmark.cpp

you could probably cast to something like duration<std::chrono::milliseconds, float> but I'm not sure how exactly it works

torch/csrc/throughput_benchmark/init.cpp

salexspb · 2019-05-25T01:38:35Z

spoke with @zdevito , going to add support for nn.Module in order to be more generic.

soumith · 2019-05-26T16:07:12Z

the final location should be in torch.utils, I'm pretty sure it should be torch.utils.throughput_benchmark, unless there is somehow a natural place for it in torch.autograd.profiler, which I think there isn't.

I also think throughput_benchmark is a long name, so I wonder if throughput has alternatives, but I dont care about bikeshedding it too much.

caffe2/CMakeLists.txt

torch/csrc/throughput_benchmark/init.cpp

torch/jit/__init__.py

test/test_throughput_benchmark.py

salexspb · 2019-06-05T23:45:46Z

@dzhulgakov , sorry for the churn, I addressed XQ's feedback on the internal diff wrt CVs usage, the python module part needs some update.

dzhulgakov · 2019-06-21T15:55:18Z

@bddppq pointed me to https://pybind11.readthedocs.io/en/stable/faq.html#someclass-declared-with-greater-visibility-than-the-type-of-its-field-someclass-member-wattributes - so it seems that pybind11 has this issue.

You can either declare your classes explicitly as __attribute__((__visibility__("hidden"))) or we really should try to enable -fvisibility=hidden for the python extension (trying it now) - it was already enabled for C2's extension