Skip to content

Conversation

@arktrail
Copy link
Contributor

@arktrail arktrail commented Jul 1, 2020

Summary:
This diff contains the implementation of C++ API for ParameterList from #25883.
Refer to the Python API:

class ParameterList(Module):

Not sure about some naming difference between C++ API and Python API, like append, should it be called push_back

Test Plan: Add unit tests in this diff

Reviewers: @glaringlee

Subscribers:

Tasks: #25883

Tags:

@dr-ci
Copy link

dr-ci bot commented Jul 1, 2020

💊 CI failures summary and remediations

As of commit 24952d5 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 14 times.

zuoxingdong and others added 24 commits June 30, 2020 21:38
Summary:
Since `torch.atanh` is recently implemented in #38388, we should simply use it for `TanhTransform`.
Pull Request resolved: #40160

Differential Revision: D22208039

Pulled By: ezyang

fbshipit-source-id: 34dfbc91eb9383461e16d3452e3ebe295f39df26
…duced (#40825)

Summary:
Continuing #40824

All CIs have been enabled (on a branch that starts with `ci-all/`)
Pull Request resolved: #40825

Differential Revision: D22328732

Pulled By: ezyang

fbshipit-source-id: 3e517d01a9183d95df0687b328fb268947ea5fb0
Summary:
Pull Request resolved: #40830

Fixes #40725

Signed-off-by: Edward Z. Yang <[email protected]>

Test Plan: Imported from OSS

Differential Revision: D22323886

Pulled By: ezyang

fbshipit-source-id: b8a61496923d9f086d4c201024748505ba783238
Summary:
Fixes #33439

This introduces torch._sparse_coo_tensor_unsafe(...) and
torch._validate_sparse_coo_tensor_args(...)
Pull Request resolved: #34059

Differential Revision: D22161254

Pulled By: ezyang

fbshipit-source-id: 994efc9b0e30abbc23ddd7b2ec987e6ba08a8ef0
Summary: Pull Request resolved: #40578

Test Plan: Imported from OSS

Differential Revision: D22239286

Pulled By: IvanKobzarev

fbshipit-source-id: 7a4160b621af8cfcc3b3d9e6da1a75c8afefba27
Summary:
`long long == int64_t != long` in MSVC
Pull Request resolved: #40783

Differential Revision: D22328757

Pulled By: ezyang

fbshipit-source-id: bc7301d6b0e7e00ee6d7ca8637e3fce7810b15e2
#40845)

Summary:
… file

This prevents implementation of those functions(as lambdas) to be embedded as weak symbol into every shared library that includes this header.

Combination of this and #40844 reduces size of `libcaffe2_module_test_dynamic.so` from 500kb to 50Kb.
Pull Request resolved: #40845

Differential Revision: D22334779

Pulled By: malfet

fbshipit-source-id: 64706918fc2947350a58c0877f294b1b8b085455
…40844)

Summary:
If virtual function is implemented in header file, it's implementation will be included as a weak symbol to every shared library that includes this header along with all of it's dependencies.

This was one of the reasons why size of libcaffe2_module_test_dynamic.so  was 500Kb (AddRelatedBlobInfo implementation pulled a quarter of libprotobuf.a with it)

Combination of this and #40845 reduces size of `libcaffe2_module_test_dynamic.so` from 500kb to 50Kb.
Pull Request resolved: #40844

Differential Revision: D22334725

Pulled By: malfet

fbshipit-source-id: 836a4cbb9f344355ddd2512667e77472546616c0
…0708)

Summary:
Pull Request resolved: #40708

Insert NoopObservers for activations and weight tensors for FP16

Test Plan:
python test/test_quantization.py test_prepare_dynamic

Imported from OSS

Differential Revision: D22335976

fbshipit-source-id: b19e8035c7db3b0b065ec09c9ad6d913eb434f3e
Summary:
Pull Request resolved: #40709

Cast to kHalf and back to kFloat before the linear operator to mimic FP16 quant support

Test Plan:
python test/test_quantization.py test_convert_dynamic_fp16

Imported from OSS

Differential Revision: D22335977

fbshipit-source-id: f964128ec733469672a1ed4cb0d757d0a6c22c3a
Summary: Pull Request resolved: #40710

Test Plan:
python test/test_quantization.py

Imported from OSS

Differential Revision: D22335975

fbshipit-source-id: 5c176bb6b9c300e1beb83df972149dd5a400b854
Summary:
Fix #40701

A new special case is added to let `dim()` save an int instead of self.
Pull Request resolved: #40766

Differential Revision: D22308354

Pulled By: albanD

fbshipit-source-id: 69008230d7398b9e06b8e074a549ae921c2bf603
Summary:
Added missing generator argument in type annotation(#40803)
Pull Request resolved: #40873

Differential Revision: D22344217

Pulled By: malfet

fbshipit-source-id: 9871401b97c96fa20c70e3f66334259ead1f8429
…Destruction (#40585)

Summary:
Pull Request resolved: #40585

This PR aborts incomplete NCCL Communicators in the ProcessGroupNCCL
destructor. This should prevent pending NCCL communicators from blocking other CUDA ops.
ghstack-source-id: 106988073

Test Plan: Sandcastle/ OSS CI

Differential Revision: D22244873

fbshipit-source-id: 4b4fe65e1bd875a50151870f8120498193d7535e
)

Summary:
Moving this to a file that can be source by downstream pytorch/xla.
Pull Request resolved: #40828

Reviewed By: malfet

Differential Revision: D22339513

Pulled By: ailzhang

fbshipit-source-id: c43b18fa2b7e1e8bb6810a6a43bb7dccd4756238
Summary:
Pull Request resolved: #40876

clang format reducer.cpp
ghstack-source-id: 106980050

Test Plan: unit test

Differential Revision: D22321422

fbshipit-source-id: 54afdff206504c7bbdf2e408928cc32068e15cdc
Summary:
Fixes #40158

Description
- docs update: removed incorrect statements
Pull Request resolved: #40617

Reviewed By: ezyang

Differential Revision: D22308802

Pulled By: yns88

fbshipit-source-id: e33084af320f249c0c9ba04bdbe2191d1b954d17
Summary:
Pull Request resolved: #38257

It seems we're doing a runtime type check for custom classes on each operator call if the operator has custom class arguments.
This does not have an effect on operators without custom class arguments, but this is a problem for operators with custom class arguments,
for example operators taking a at::native::xnnpack::Conv2dOpContext argument.

The long term solution would be to move those checks to op registration time instead of doing them at call time,
but as an intermediate fix, we can at least make the check fast by

- Using ska::flat_hash_map instead of std::unordered_map
- Using std::type_index instead of std::string (i.e. avoid calling std::hash on a std::string)
ghstack-source-id: 106805209

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D21507226

fbshipit-source-id: bd120d5574734be843c197673ea4222599fee7cb
Summary:
Pull Request resolved: #40747

-
ghstack-source-id: 106833603

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D22299161

fbshipit-source-id: 6e34999b5f8244d9582e4978754039d340720ca8
Summary:
Right now it is used to check whether `math.remainder` exists, which is the case for both Python-3.7 and 3.8
Pull Request resolved: #40868

Differential Revision: D22343454

Pulled By: malfet

fbshipit-source-id: 6b6d4869705b64c4b952309120f92c04ac7e39fd
…ning-amenable mode (#40546)

Summary:
This PR introduces a warning when user tries to export the model to ONNX in training-amenable mode while constant folding is turned on. We want to warn against any unintentional use because constant folding may fold some parameters that may be intended to be trainable in the exported model.
Pull Request resolved: #40546

Reviewed By: hl475

Differential Revision: D22310917

Pulled By: houseroad

fbshipit-source-id: ba83b8e63af7c458b5ecca8ff2ee1c77e2064f90
Summary:
Pull Request resolved: #40649

The original implementation of RemoveOpsByType is pretty buggy and does not remove all instances of the ops that should be removed. It's also quite complicated and hard to modify. I reimplemented it by first converting the graph to its SSA form. The algorithm is quite simple once the graph is in SSA form. It's very similar to constant propagation with a few modifications. The hardest part is to deal with the case of removing an op with the output being an output of the predict net, because that output has to be preserved.

(Note: this ignores all push blocking failures!)

Reviewed By: yinghai, dzhulgakov

Differential Revision: D22220798

fbshipit-source-id: faf6ed5242f1e2f310125d964738c608c6c55c94
Summary:
Pull Request resolved: #40885

`TracingState::setValue` associates a concrete IValue in the traced
program with a `Value*` symbolic. Previously, the logic for how
GenericDicts worked was special cased to only work for very simple cases
and silently eat other cases.

This PR generalizes the logic to reflect the same behavior as using
dictionaries on input: whenever we encounter a dictionary in the system,
we completely "burn in" all the keys into the graph, and then
recursively call `setValue` on the associated value.

This has the effect of requiring that any dictionary structure you are
creating in a traced program be of fixed structure, similar to how any
dictionary used as input must be static as well.

Test Plan: Imported from OSS

Differential Revision: D22342490

Pulled By: suo

fbshipit-source-id: 93e610a4895d61d9b8b19c8d2aa4e6d57777eaf6
Summary:
Move Storage class from __init__.pyi.in to types.py and make it a protocol, since this is not a real class
Expose `PyTorchFileReader` and `PyTorchFileWriter` native classes

Ignore function attributes, as there are yet no good way to type annotate those, see python/mypy#2087
Pull Request resolved: #40862

Differential Revision: D22344743

Pulled By: malfet

fbshipit-source-id: 95cdb6f980ee79383960f306223e170c63df3232
lw and others added 12 commits July 3, 2020 02:52
…#40814)

Summary:
Pull Request resolved: #40814

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
This prepares the stack by simplifying the TensorPipe fixture. A comment says that the TensorPipe fixture cannot subclass the generic fixture class as that would lead to a diamond class hierarchy which Python doesn't support (whereas in fact it does), and therefore it copies over two properties that are defined on the generic fixture. However, each class that uses the TensorPipe fixture also inherits from the generic fixture, so there's no need to redefine those properties. And, in fact, by not redefining it we save ourselves some trouble when the TensorPipe fixture would end up overriding another override.
ghstack-source-id: 107045914

Test Plan: Sandcastle and CircleCI

Differential Revision: D22287533

fbshipit-source-id: 254c38b36ba51c9d852562b166027abacbbd60ef
Summary:
Pull Request resolved: #40815

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
This prepares the stack by aligning the `ddp_under_dist_autograd` test to the other ones, so that later changes will be more consistent and thus easier to follow. It does so by moving the `skipIf` decorators and the `setUp` methods from the base test suite to the entry point scripts.
ghstack-source-id: 107045911

Test Plan: Sandcastle and CircleCI

Differential Revision: D22287535

fbshipit-source-id: ab0c9eb774b21d81e0ebd3078df958dbb4bfa0c7
Summary:
Pull Request resolved: #40913

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
Once we start merging multiple test suites in a single file (which we'll happen in the next diffs in the stack) the OSX tests on CircleCI start failing due to "too many open files". This indicates a file descriptor leak. I then managed to repro it on Linux too by lowering the limit on open file descriptors (`ulimit -n 500`). Each test method that unittest runs is run on a new instance of the Testcase class. With our multiprocessing wrappers, this instance contains a list of child processes. Even after these processes are terminated, it appears they still hold some open file descriptor (for example a pipe to communicate with the subprocess). It also appears unittest is keeping these Testcase instances alive until the entire suite completes, which I suspect is what leads to this "leak" of file descriptors. Based on that guess, in this diff I am resetting the list of subprocesses during shutdown, and this seems to fix the problem.
ghstack-source-id: 107045908

Test Plan: Sandcastle and CircleCI

Differential Revision: D22356784

fbshipit-source-id: c93bb9db60fde72cae0b0c735a50c17e427580a6
Summary: Pull Request resolved: #40889

Test Plan: Imported from OSS

Differential Revision: D22348266

fbshipit-source-id: eed2ece5c94fcfaf187d6770bed4a7109f0c0b4a
Summary:
Pull Request resolved: #40652

Resolves #40304, but looking for
feedback on whether there is a better approach for this.

In order to profile `rpc_async` calls made within a torchscript function, we
add the profiling logic to `rpcTorchscript` which is the point where the RPC is
dispatched and is called by the jit `rpc_async` operator. We take a somewhat
similar approach to how this is done in the python API. If profiling is
enabled, we call `record_function_enter` which creates a `RecordFunction`
object and runs its starting callbacks. Then, we schedule end callbacks for
this `RecordFunction` to be run when the jit future completes.

One caveat is that `rpcTorchscript` can also be called by rpc_async from a
non-JIT function, in which case the profiling logic lives in Python. We add a
check to ensure that we don't double profile in this case.
ghstack-source-id: 107109485

Test Plan: Added relevant unittests.

Differential Revision: D22270608

fbshipit-source-id: 9f62d1a2a27f9e05772d0bfba47842229f0c24e1
Summary:
Pull Request resolved: #40931

Fix docstrings for dynamic quantized Linear/LSTM and associated classes
ghstack-source-id: 107064446

Test Plan: Docs show up in correctly

Differential Revision: D22360787

fbshipit-source-id: 8e357e081dc59ee42fd7f12ea5079ce5d0cc9df2
Summary:
The doc for `torch.distributed.launch` is missing since v1.2.0 (see issue #36386) because PR #22501 added some imports at the first line.
https://github.com/pytorch/pytorch/blob/542ac74987141c81b6326b9e7d5c6a1d00fc3701/torch/distributed/launch.py#L1-L5
I move it below the docstring to make the autodoc in Sphinx work normally.

Pull Request resolved: #40963

Differential Revision: D22380816

Pulled By: mrshenli

fbshipit-source-id: ee8406785b9a198bbf3fc65e589854379179496f
Summary:
fixes  #40658 #40658

Pull Request resolved: #40879

Reviewed By: pbelevich

Differential Revision: D22339146

Pulled By: ezyang

fbshipit-source-id: 6b4695e102591e7a2c391eb337c154414bacf67c
Summary: Pull Request resolved: #40926

Reviewed By: eellison

Differential Revision: D22359471

Pulled By: Krovatkin

fbshipit-source-id: 823e87674e2d2917f075255ec926e0485972f4e2
Summary: Pull Request resolved: #40866

Reviewed By: pbelevich

Differential Revision: D22358988

Pulled By: Krovatkin

fbshipit-source-id: 7118d7f8d4eaf056cfb71dc0d588d38b1dfb0fc7
@arktrail
Copy link
Contributor Author

arktrail commented Jul 4, 2020

Sorry I made a wrong rebase. I will abandon this diff and create a new one.

@arktrail arktrail marked this pull request as draft July 5, 2020 06:40
@arktrail arktrail mentioned this pull request Jul 5, 2020
@glaringlee
Copy link
Contributor

This one has rebase issue, #40987 is created instead. Close this one now.

@glaringlee glaringlee closed this Jul 5, 2020
@arktrail arktrail deleted the ImplParameterList branch July 9, 2020 18:34
This was referenced Jul 9, 2020
facebook-github-bot pushed a commit that referenced this pull request Jul 13, 2020
Summary:
This is a new PR for #40850, #40987 and #41206 unintentionally closed), as I have some issues for rebates for that one. Very sorry about that. And I have fixed the tests failed in that PR.

This diff contains the implementation of C++ API for ParameterList from #25883.
Refer to the Python API: https://github.com/pytorch/pytorch/blob/bc9e8af21875dafafe9bbd25c8f542b20b2e660f/torch/nn/modules/container.py#L376
Not sure about some naming difference between C++ API and Python API, like `append`, should it be called `push_back`

Pull Request resolved: #41259

Test Plan: Add unit tests in this diff

Differential Revision: D22495780

Pulled By: glaringlee

fbshipit-source-id: 79ea3592db640f35477d445ecdaeafbdad814bec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.