Skip to content

Conversation

@malfet
Copy link
Contributor

@malfet malfet commented Sep 22, 2021

Reported by @cloudhan in #64733 (comment)

Fixes regression introduced by 047e682

cc @malfet @seemethere

@pytorch-probot
Copy link

pytorch-probot bot commented Sep 22, 2021

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/708d689198ebd2fe0af4f72970bb7f2b808ecb34/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/xla ✅ triggered
linux-bionic-py3.8-gcc9-coverage ciflow/all, ciflow/coverage, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/win ✅ triggered
Skipped Workflows
libtorch-linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
puretorch-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
win-vs2019-cuda10.2-py3 ciflow/all, ciflow/cuda, ciflow/win 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Sep 22, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 708d689 (more details on the Dr. CI page):


  • 4/4 failures introduced in this PR

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build linux-bionic-py3.6-clang9 / test (noarch, 1, 1, linux.2xlarge) (1/3)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2021-09-22T20:44:55.9301000Z test_add_done_ca...arg() takes 0 positional arguments but 1 was given
2021-09-22T20:44:55.9279997Z   /opt/conda/lib/python3.6/unittest/suite.py(122): run
2021-09-22T20:44:55.9280578Z   /opt/conda/lib/python3.6/unittest/suite.py(84): __call__
2021-09-22T20:44:55.9281267Z   /opt/conda/lib/python3.6/site-packages/xmlrunner/runner.py(66): run
2021-09-22T20:44:55.9281851Z   /opt/conda/lib/python3.6/unittest/main.py(256): runTests
2021-09-22T20:44:55.9282357Z   /opt/conda/lib/python3.6/unittest/main.py(95): __init__
2021-09-22T20:44:55.9283120Z   /opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py(605): run_tests
2021-09-22T20:44:55.9283707Z   test_futures.py(329): <module>
2021-09-22T20:44:55.9284151Z 
2021-09-22T20:44:55.9284421Z ok (0.002s)
2021-09-22T20:44:55.9295285Z   test_add_done_callback_maintains_callback_order (__main__.TestFuture) ... ok (0.002s)
2021-09-22T20:44:55.9301000Z   test_add_done_callback_no_arg_error_is_ignored (__main__.TestFuture) ... [E pybind_utils.h:201] Got the following error when running the callback: TypeError: no_arg() takes 0 positional arguments but 1 was given
2021-09-22T20:44:55.9301858Z ok (0.001s)
2021-09-22T20:44:55.9310337Z   test_add_done_callback_simple (__main__.TestFuture) ... ok (0.001s)
2021-09-22T20:44:55.9336860Z   test_chained_then (__main__.TestFuture) ... ok (0.003s)
2021-09-22T20:44:56.0355961Z   test_collect_all (__main__.TestFuture) ... ok (0.102s)
2021-09-22T20:44:56.0362336Z   test_done (__main__.TestFuture) ... ok (0.001s)
2021-09-22T20:44:56.0373616Z   test_done_exception (__main__.TestFuture) ... ok (0.001s)
2021-09-22T20:44:56.0387337Z   test_interleaving_then_and_add_done_callback_maintains_callback_order (__main__.TestFuture) ... ok (0.001s)
2021-09-22T20:44:56.0397146Z   test_interleaving_then_and_add_done_callback_propagates_error (__main__.TestFuture) ... [E pybind_utils.h:201] Got the following error when running the callback: ValueError: Expected error
2021-09-22T20:44:56.0397847Z 
2021-09-22T20:44:56.0398111Z At:

See GitHub Actions build linux-xenial-cuda11.3-py3.6-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu) (2/3)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2021-09-22T22:11:48.8127647Z RuntimeError: CUDA error: invalid device ordinal
2021-09-22T22:11:48.8118578Z   File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py", line 204, in _run_function
2021-09-22T22:11:48.8119531Z     result = python_udf.func(*python_udf.args, **python_udf.kwargs)
2021-09-22T22:11:48.8120681Z   File "/opt/conda/lib/python3.6/site-packages/torch/distributed/nn/api/remote_module.py", line 87, in _create_module
2021-09-22T22:11:48.8121509Z     module.to(device)
2021-09-22T22:11:48.8122383Z   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 899, in to
2021-09-22T22:11:48.8123111Z     return self._apply(convert)
2021-09-22T22:11:48.8124188Z   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 593, in _apply
2021-09-22T22:11:48.8124916Z     param_applied = fn(param)
2021-09-22T22:11:48.8125863Z   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 897, in convert
2021-09-22T22:11:48.8126841Z     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
2021-09-22T22:11:48.8127647Z RuntimeError: CUDA error: invalid device ordinal
2021-09-22T22:11:48.8128622Z CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
2021-09-22T22:11:48.8129582Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2021-09-22T22:11:48.8130023Z 
2021-09-22T22:11:48.8130250Z 
2021-09-22T22:11:48.9827804Z ok (1.713s)
2021-09-22T22:11:52.6990446Z   test_valid_device (__main__.TensorPipeCudaRemoteModuleTest) ... ok (3.716s)
2021-09-22T22:12:00.5306197Z   test_profiler_remote_cuda (__main__.TensorPipeCudaRpcTest) ... ok (7.831s)
2021-09-22T22:12:02.0433904Z   test_basic_gloo_ckpt_always (__main__.TensorPipePipeWithDDPTest) ... skip (1.512s)
2021-09-22T22:12:03.5552191Z   test_basic_gloo_ckpt_except_last (__main__.TensorPipePipeWithDDPTest) ... skip (1.512s)
2021-09-22T22:12:05.0668387Z   test_basic_gloo_ckpt_never (__main__.TensorPipePipeWithDDPTest) ... skip (1.511s)

See GitHub Actions build linux-bionic-py3.6-clang9 / test (default, 2, 2, linux.2xlarge) (3/3)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-09-22T20:50:30.8093853Z AssertionError: RuntimeError not raised
2021-09-22T20:50:30.8088801Z Traceback (most recent call last):
2021-09-22T20:50:30.8089393Z   File "/var/lib/jenkins/workspace/test/jit/test_tracer.py", line 244, in test_canonicalize_tensor_iterator
2021-09-22T20:50:30.8090354Z     self.assertTrue(str(traced.graph_for(x)).count(': int = prim::Constant') == 5)
2021-09-22T20:50:30.8090895Z AssertionError: False is not true
2021-09-22T20:50:30.8091234Z 		
2021-09-22T20:50:30.8091786Z ❌ Failure: jit.test_tracer.TestTracer.test_inplace_check
2021-09-22T20:50:30.8092155Z 
2021-09-22T20:50:30.8092473Z Traceback (most recent call last):
2021-09-22T20:50:30.8093018Z   File "/var/lib/jenkins/workspace/test/jit/test_tracer.py", line 342, in test_inplace_check
2021-09-22T20:50:30.8093471Z     ge(x)
2021-09-22T20:50:30.8093853Z AssertionError: RuntimeError not raised
2021-09-22T20:50:30.8094214Z 		
2021-09-22T20:50:30.8094932Z 🚨 ERROR: jit.test_freezing.TestMKLDNNReinplacing.test_always_alive_values
2021-09-22T20:50:30.8095470Z 
2021-09-22T20:50:30.8095776Z Traceback (most recent call last):
2021-09-22T20:50:30.8096347Z   File "/var/lib/jenkins/workspace/test/jit/test_freezing.py", line 2134, in test_always_alive_values
2021-09-22T20:50:30.8096935Z     self.checkResults(mod_eager, mod)
2021-09-22T20:50:30.8097530Z   File "/var/lib/jenkins/workspace/test/jit/test_freezing.py", line 2091, in checkResults
2021-09-22T20:50:30.8098125Z     self.assertEqual(mod1(inp), mod2(inp))
2021-09-22T20:50:30.8098860Z   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
2021-09-22T20:50:30.8099528Z     return forward_call(*input, **kwargs)

1 failure not recognized by patterns:

Job Step Action
GitHub Actions Lint / clang-format Run clang-format 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot
Copy link
Contributor

@malfet has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

malfet referenced this pull request Sep 22, 2021
Summary:
Pull Request resolved: #64733

The previous implementation was wrong when CPU scheduling affinity is
set. In fact, it is still wrong if Ninja is not being used.

When there is CPU scheduling affinity set, the number of processors
available on the system likely exceeds the number of processors that
are usable to the build. We ought to use
`len(os.sched_getaffinity(0))` to determine the effective parallelism.

This change is more minimal and instead just delegates to Ninja (which
handles this correctly) when it is used.

Test Plan:
I verified this worked as correctly using Ninja on a 96-core machine
with 24 cores available for scheduling by checking:
 * the cmake command did not specify "-j"
 * the number of top-level jobs in top/pstree never exceeded 26 (24 +
   2)

And I verified we get the legacy behavior by specifying USE_NINJA=0 on
the build.

Reviewed By: jbschlosser, driazati

Differential Revision: D30968796

Pulled By: dagitses

fbshipit-source-id: 29547dd378fea793957bcc2f7d52d5def1ecace2
@malfet malfet requested review from a team and dagitses September 22, 2021 03:25
@malfet malfet added module: build Build system issues module: regression It used to work, and now it doesn't triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 22, 2021
@codecov
Copy link

codecov bot commented Sep 22, 2021

Codecov Report

Merging #65444 (fcbe48b) into master (64d3c73) will decrease coverage by 0.00%.
The diff coverage is n/a.

❗ Current head fcbe48b differs from pull request most recent head 708d689. Consider uploading reports for the commit 708d689 to get more accurate results

@@            Coverage Diff             @@
##           master   #65444      +/-   ##
==========================================
- Coverage   66.38%   66.37%   -0.01%     
==========================================
  Files         739      739              
  Lines       94295    94295              
==========================================
- Hits        62594    62593       -1     
- Misses      31701    31702       +1     

@malfet malfet added this to the 1.10.0 milestone Sep 22, 2021
Copy link
Collaborator

@dagitses dagitses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the fix!

(( None, False, False), ['-j', '13']), # noqa: E201,E241
(( '6', True, True), ['-j', '6']), # noqa: E201,E241
(( None, True, True), None), # noqa: E201,E241
(( '5', False, True), ['/p:CL_MPCount=5']), # noqa: E201,E241
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if you change this to "11" it will line up more nicely with the value below

@facebook-github-bot
Copy link
Contributor

@malfet has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@malfet merged this pull request in 923f066.

malfet added a commit to malfet/pytorch that referenced this pull request Oct 5, 2021
Summary:
Reported by cloudhan in pytorch#64733 (comment)

Fixes regression introduced by pytorch@047e682

cc malfet seemethere

Pull Request resolved: pytorch#65444

Reviewed By: dagitses, seemethere

Differential Revision: D31103260

Pulled By: malfet

fbshipit-source-id: 9d5454a64cb8a0b96264119cf16582cc5afed284
malfet added a commit that referenced this pull request Oct 5, 2021
Summary:
Reported by cloudhan in #64733 (comment)

Fixes regression introduced by 047e682

cc malfet seemethere

Pull Request resolved: #65444

Reviewed By: dagitses, seemethere

Differential Revision: D31103260

Pulled By: malfet

fbshipit-source-id: 9d5454a64cb8a0b96264119cf16582cc5afed284
@malfet malfet deleted the malfet-patch-4 branch January 5, 2022 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged module: build Build system issues module: regression It used to work, and now it doesn't triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants