Skip to content

Conversation

@salexspb
Copy link
Contributor

Summary:
This is useful for measuring inference performance of your
models. This is a very basic benchmark for now. We don't support
batching on the benchmark side, no inter and intra op parallelizm is
supported yet, just caller based parallelizm.

Main phylosophy here is that user should be able to provide inputs
from python and just stack them within the benchmark. API should be
exactly the same as passing inputs to module.forward.

Differential Revision: D15435461

@pytorchbot pytorchbot added caffe2 oncall: jit Add this issue/PR to JIT oncall triage queue module: build Build system issues module: internals Related to internal abstractions in c10 and ATen labels May 21, 2019
Copy link
Contributor

@soumith soumith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a separate binary / utility, outside of csrc -- from what I see it doesn't need to hook into any of the internals in any way.
Probably going into the benchmark folder as a separate utility?

@dzhulgakov
Copy link
Collaborator

pytorch/benchmarks feels like indeed a better place. You can make it a separate binary/library and link against libtorch.so

The reasoning is that it's more of a scaffolding around pytorch that doesn't require changes in the core (like e.g. autograd profiler). So putting it separately is cleaner. Even if we were put it in the same build, csrc is probably the wrong place to put it. cc @gchanan for input too.

@salexspb
Copy link
Contributor Author

@dzhulgakov , @soumith , my motivation is that I would like any researcher easily see perf of their model in an inference like setup. It is not specific to a model, it is a generic benchmarking utility.

So yes, I would like it to be part of the same build. Then one doesn't need to recompile anything after they developed a model and can just call this utility. How I see it, it should work this way both in OSS and internally.

No issues pytorch/benchmarks folder specifically, but would it work the way I described above?

@salexspb salexspb requested review from zdevito and zheng-xq May 21, 2019 23:12
Copy link
Collaborator

@dzhulgakov dzhulgakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, if we want to expose it in the main build, I guess it doesn't make sense to build it as a separate python extension.

The question becomes whether it cuts the bar for general usefulness (similarly to autograd profiler) - @soumith @gchanan . But it that case we probably want to put it into some sub-namespace (torch.profiler? torch.bechmark? autograd.profiler name is a bit unfortunate)

@salexspb
Copy link
Contributor Author

Putting things into a different python namespace makes sense to me, putting things into jit was rather to cause this discussion and figure out what is the proper place :)

@soumith , @zdevito , @gchanan , given the goal I described above (of making this easily available to any researcher / ml practitioner so they can test their model throughput without additional builds), should I just go ahead and create a new python namespace like torch.benchmark? I assume, in this case, the c++ code stays in the same place, but please let me know if there is a better place for it.

autograd.profiler should likely also live there :)

@zdevito
Copy link
Contributor

zdevito commented May 22, 2019

We should probably provide a torch.profiler package as part of the main build or in an easily installable package similar to torchvision. First, this is where autograd profiler should live. It has long since become more generic than just profiling autograd. We can make torch.autograd.profiler alias torch.profiler for now. Second, a generic harness for timing the throughput and latency of a model until appropriate multi-core load is a useful enough tool that it seems to make sense going into torch.profiler. Note that this will enforce a higher barrier of quality on it than if it were for our own use only, but that is probably a good thing because we to make it easier to accurately benchmark our models anyway. It can serve as a way to capture best practices for our external users. Example: JIT requires warm up (at least one iteration). I don't expect users to know this, so if they write their own benchmark harness then they often report slower numbers.

Copy link
Contributor

@zdevito zdevito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good first pass. If this is going to go into PyTorch proper and have python bindings, it needs to work with nn.Module not just ScriptModules.

@soumith
Copy link
Contributor

soumith commented May 23, 2019

There is precedence (and good design) for this. Take a look at torch.utils.bottleneck: https://pytorch.org/docs/stable/bottleneck.html

It was a tool that was aimed to tell users in a one-touch way what part of their code is a bottleneck -- without the user understanding how to use the profiler, or plug profiler code deeper into their codebase.

@soumith
Copy link
Contributor

soumith commented May 23, 2019

similarly, we can probably aim for torch.utils.benchmark, but I think the API needs to be cleaned up a bit. For example config can be removed in favor of just using args / kwargs (if you choose, you can pass a dict in as kwargs with throughput(**config) or something

@soumith
Copy link
Contributor

soumith commented May 23, 2019

the other thing I can think of that the API needs to be improved upon is beyond CPU throughput.
It's pretty common to do GPU-inference. If so, what exactly are the tuning parameters there (like for CPU it's number of threads etc.)

@salexspb salexspb force-pushed the export-D15435461 branch from e10aa56 to 137be71 Compare May 23, 2019 18:48
@salexspb
Copy link
Contributor Author

salexspb commented May 23, 2019

So here is a new version. I think, I addressed all the API questions except the following:

  1. @zdevito suggestion to support nn.Module. Do you suggest we should just take an nn.Module and trace it? Because IMO running python code through throughput inference benchmark doesn't make much sense, people don't really deploy python. At least this is not a big use-case which I would like to not be blocked on. If you suggest just trying to trace the module, it should be a few lines change.

  2. @soumith 's comment about supporting GPU inference. I think this is a good goal. API wise calling threads should stay there . In majority of the cases calling threads are CPU threads which offload to an accelerator card. Speaking of which - we actually would want to support other hardware as well. I think, we may provide additional parameter here to the benchmark which device to run model on. With future PEP integration we hopefully will allow people to submit remote jobs from an iPython notebook and run this whole thing on a remote host with hardware they want to test against. How cool would be that ? :)

I also didn't move the module location (it is currently under torch.jit for no reason), I intend to move it to torch.profiler as soon as we agree on this step for sure :)

Update summary:

  1. nicer api for callling benchmark from Python. Now benchmkar() method takes in key word arguments which I convert to BechmarkConfig struct later in the python helper. I would like to keep the struct for C++ API as C++ is not good about key word arguments and I don't want benchmark() c++ method to become a mess.

  2. hidden model._c call

  3. more documentation

  4. @dzhulgakov 's suggestions about getters and setters

  5. @dzhulgakov 's suggestions on ExecutionStats field names

  6. fixed out of bound issue.

Copy link
Collaborator

@dzhulgakov dzhulgakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice, I'd defer to @gchanan on naming. My guess would be torch.profiler.ThroughputBenchmark

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could probably cast to something like duration<std::chrono::milliseconds, float> but I'm not sure how exactly it works

@salexspb
Copy link
Contributor Author

spoke with @zdevito , going to add support for nn.Module in order to be more generic.

@soumith
Copy link
Contributor

soumith commented May 26, 2019

the final location should be in torch.utils, I'm pretty sure it should be torch.utils.throughput_benchmark, unless there is somehow a natural place for it in torch.autograd.profiler, which I think there isn't.

I also think throughput_benchmark is a long name, so I wonder if throughput has alternatives, but I dont care about bikeshedding it too much.

@salexspb salexspb force-pushed the export-D15435461 branch from 137be71 to 7cacabd Compare June 3, 2019 23:16
@salexspb
Copy link
Contributor Author

salexspb commented Jun 5, 2019

@dzhulgakov , sorry for the churn, I addressed XQ's feedback on the internal diff wrt CVs usage, the python module part needs some update.

@dzhulgakov
Copy link
Collaborator

@bddppq pointed me to https://pybind11.readthedocs.io/en/stable/faq.html#someclass-declared-with-greater-visibility-than-the-type-of-its-field-someclass-member-wattributes - so it seems that pybind11 has this issue.

You can either declare your classes explicitly as __attribute__((__visibility__("hidden"))) or we really should try to enable -fvisibility=hidden for the python extension (trying it now) - it was already enabled for C2's extension

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some things that we should clean up and improve before this is merged. There's one place that might be running without the GIL which is unsafe, cloneInputs doesn't clone inputs, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you plan to add things later, add them later. There's no need to have dead code in the codebase. It's not like adding this field will break BC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer using instead of typedef

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@salexspb is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@salexspb
Copy link
Contributor Author

Hey @apaszke , I am going to kindly ask follow up offline with me about this PR to make sure we don't have too much back and forth here while I consider your last set of comments as non blocking with all the respect to you. Kindly note that this PR was standing out for quite some time. I am going to respectfully land this for now and if you would be able to show me some draw backs of the current approach, I will iterate over it later. I think I addressed the feedback from 4 persons on this PR already who have been kindly spending a lot of time iterating over this PR for several weeks while there was no single comment from you here before. In my view addressing comments from 4 reviewers who have been highly involved on this PR it is sufficient for merging. The only reason I didn't merge this yet is annoying visibility issues which I now fixed thanks to @dzhulgakov 's and @bddppq 's help. I am also going to be in NYC next week, not sure where you are based, but if you are there, happy to chat in person! We are also going to have some sessions there specifically designed to address extra back and forth on diffs which was a BIG pain point for people in AI Dev Platform. This PR is included in the list! :) Please also note that I went into quite a bit of compromises here already. In my view having support for nn.Module is not required for merging this, though I compromised on this and added it in the first version which complicated the things quite a bit and slowed this down.

Summary:
This is useful for measuring inference performance of your
models. This is a very basic benchmark for now. We don't support
batching on the benchmark side, no inter and intra op parallelizm is
supported yet, just caller based parallelizm.

Main phylosophy here is that user should be able to provide inputs
from python and just stack them within the benchmark. API should be
exactly the same as passing inputs to module.forward.
Pull Request resolved: pytorch#20766

Test Plan: Added a new unit test

Differential Revision: D15435461

Pulled By: salexspb

fbshipit-source-id: e8759c417faf6c52807d5a685eed48e023cd40e9
@facebook-github-bot
Copy link
Contributor

@salexspb merged this pull request in 9b45237.

@suo
Copy link
Member

suo commented Jun 24, 2019

@salexspb this PR appears to have broken windows builds on master :(

@soumith
Copy link
Contributor

soumith commented Jun 24, 2019

fyi, unlanding because it broke windows build.

@salexspb
Copy link
Contributor Author

@suo, @soumith, do you have logs showing the failures by any chance? There was a single failing job when I landed and it seemed unrelated to my change. Speaking of which before some fixes I made there were indeed Windows failures and I was hopping they are now addressed

@salexspb
Copy link
Contributor Author

Apparently I might've missed the error due to it being in "full logs" only. Will try to resend tonight

@salexspb
Copy link
Contributor Author

revert of the revert: #22185

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

caffe2 Merged module: build Build system issues module: internals Related to internal abstractions in c10 and ATen oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.