Fix Windows ninja builds when MAX_JOBS is specified #65444

malfet · 2021-09-22T03:24:33Z

Reported by @cloudhan in #64733 (comment)

Fixes regression introduced by 047e682

cc @malfet @seemethere

@cloudhan

Reported by @cloudhan in #64733 (comment)

pytorch-probot · 2021-09-22T03:24:37Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/708d689198ebd2fe0af4f72970bb7f2b808ecb34/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-bionic-py3.6-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/xla`	✅ triggered
linux-bionic-py3.8-gcc9-coverage	`ciflow/all`, `ciflow/coverage`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/win`	✅ triggered
Skipped Workflows
libtorch-linux-xenial-cuda10.2-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`	🚫 skipped
linux-xenial-cuda10.2-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`	🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.1-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
puretorch-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	🚫 skipped
win-vs2019-cuda10.2-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/win`	🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:

# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

facebook-github-bot · 2021-09-22T03:24:39Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/65444
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
↩️ [fb-only] Re-run with SSH instructions
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit 708d689 (more details on the Dr. CI page):

4/4 failures introduced in this PR

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

linux-bionic-py3.6-clang9 / test (noarch, 1, 1, linux.2xlarge) (1/3)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2021-09-22T20:44:55.9301000Z test_add_done_ca...arg() takes 0 positional arguments but 1 was given

2021-09-22T20:44:55.9279997Z   /opt/conda/lib/python3.6/unittest/suite.py(122): run
2021-09-22T20:44:55.9280578Z   /opt/conda/lib/python3.6/unittest/suite.py(84): __call__
2021-09-22T20:44:55.9281267Z   /opt/conda/lib/python3.6/site-packages/xmlrunner/runner.py(66): run
2021-09-22T20:44:55.9281851Z   /opt/conda/lib/python3.6/unittest/main.py(256): runTests
2021-09-22T20:44:55.9282357Z   /opt/conda/lib/python3.6/unittest/main.py(95): __init__
2021-09-22T20:44:55.9283120Z   /opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py(605): run_tests
2021-09-22T20:44:55.9283707Z   test_futures.py(329): <module>
2021-09-22T20:44:55.9284151Z 
2021-09-22T20:44:55.9284421Z ok (0.002s)
2021-09-22T20:44:55.9295285Z   test_add_done_callback_maintains_callback_order (__main__.TestFuture) ... ok (0.002s)
2021-09-22T20:44:55.9301000Z   test_add_done_callback_no_arg_error_is_ignored (__main__.TestFuture) ... [E pybind_utils.h:201] Got the following error when running the callback: TypeError: no_arg() takes 0 positional arguments but 1 was given
2021-09-22T20:44:55.9301858Z ok (0.001s)
2021-09-22T20:44:55.9310337Z   test_add_done_callback_simple (__main__.TestFuture) ... ok (0.001s)
2021-09-22T20:44:55.9336860Z   test_chained_then (__main__.TestFuture) ... ok (0.003s)
2021-09-22T20:44:56.0355961Z   test_collect_all (__main__.TestFuture) ... ok (0.102s)
2021-09-22T20:44:56.0362336Z   test_done (__main__.TestFuture) ... ok (0.001s)
2021-09-22T20:44:56.0373616Z   test_done_exception (__main__.TestFuture) ... ok (0.001s)
2021-09-22T20:44:56.0387337Z   test_interleaving_then_and_add_done_callback_maintains_callback_order (__main__.TestFuture) ... ok (0.001s)
2021-09-22T20:44:56.0397146Z   test_interleaving_then_and_add_done_callback_propagates_error (__main__.TestFuture) ... [E pybind_utils.h:201] Got the following error when running the callback: ValueError: Expected error
2021-09-22T20:44:56.0397847Z 
2021-09-22T20:44:56.0398111Z At:

linux-xenial-cuda11.3-py3.6-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu) (2/3)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2021-09-22T22:11:48.8127647Z RuntimeError: CUDA error: invalid device ordinal

2021-09-22T22:11:48.8118578Z   File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py", line 204, in _run_function
2021-09-22T22:11:48.8119531Z     result = python_udf.func(*python_udf.args, **python_udf.kwargs)
2021-09-22T22:11:48.8120681Z   File "/opt/conda/lib/python3.6/site-packages/torch/distributed/nn/api/remote_module.py", line 87, in _create_module
2021-09-22T22:11:48.8121509Z     module.to(device)
2021-09-22T22:11:48.8122383Z   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 899, in to
2021-09-22T22:11:48.8123111Z     return self._apply(convert)
2021-09-22T22:11:48.8124188Z   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 593, in _apply
2021-09-22T22:11:48.8124916Z     param_applied = fn(param)
2021-09-22T22:11:48.8125863Z   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 897, in convert
2021-09-22T22:11:48.8126841Z     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
2021-09-22T22:11:48.8127647Z RuntimeError: CUDA error: invalid device ordinal
2021-09-22T22:11:48.8128622Z CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
2021-09-22T22:11:48.8129582Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2021-09-22T22:11:48.8130023Z 
2021-09-22T22:11:48.8130250Z 
2021-09-22T22:11:48.9827804Z ok (1.713s)
2021-09-22T22:11:52.6990446Z   test_valid_device (__main__.TensorPipeCudaRemoteModuleTest) ... ok (3.716s)
2021-09-22T22:12:00.5306197Z   test_profiler_remote_cuda (__main__.TensorPipeCudaRpcTest) ... ok (7.831s)
2021-09-22T22:12:02.0433904Z   test_basic_gloo_ckpt_always (__main__.TensorPipePipeWithDDPTest) ... skip (1.512s)
2021-09-22T22:12:03.5552191Z   test_basic_gloo_ckpt_except_last (__main__.TensorPipePipeWithDDPTest) ... skip (1.512s)
2021-09-22T22:12:05.0668387Z   test_basic_gloo_ckpt_never (__main__.TensorPipePipeWithDDPTest) ... skip (1.511s)

linux-bionic-py3.6-clang9 / test (default, 2, 2, linux.2xlarge) (3/3)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-09-22T20:50:30.8093853Z AssertionError: RuntimeError not raised

2021-09-22T20:50:30.8088801Z Traceback (most recent call last):
2021-09-22T20:50:30.8089393Z   File "/var/lib/jenkins/workspace/test/jit/test_tracer.py", line 244, in test_canonicalize_tensor_iterator
2021-09-22T20:50:30.8090354Z     self.assertTrue(str(traced.graph_for(x)).count(': int = prim::Constant') == 5)
2021-09-22T20:50:30.8090895Z AssertionError: False is not true
2021-09-22T20:50:30.8091234Z 		
2021-09-22T20:50:30.8091786Z ❌ Failure: jit.test_tracer.TestTracer.test_inplace_check
2021-09-22T20:50:30.8092155Z 
2021-09-22T20:50:30.8092473Z Traceback (most recent call last):
2021-09-22T20:50:30.8093018Z   File "/var/lib/jenkins/workspace/test/jit/test_tracer.py", line 342, in test_inplace_check
2021-09-22T20:50:30.8093471Z     ge(x)
2021-09-22T20:50:30.8093853Z AssertionError: RuntimeError not raised
2021-09-22T20:50:30.8094214Z 		
2021-09-22T20:50:30.8094932Z 🚨 ERROR: jit.test_freezing.TestMKLDNNReinplacing.test_always_alive_values
2021-09-22T20:50:30.8095470Z 
2021-09-22T20:50:30.8095776Z Traceback (most recent call last):
2021-09-22T20:50:30.8096347Z   File "/var/lib/jenkins/workspace/test/jit/test_freezing.py", line 2134, in test_always_alive_values
2021-09-22T20:50:30.8096935Z     self.checkResults(mod_eager, mod)
2021-09-22T20:50:30.8097530Z   File "/var/lib/jenkins/workspace/test/jit/test_freezing.py", line 2091, in checkResults
2021-09-22T20:50:30.8098125Z     self.assertEqual(mod1(inp), mod2(inp))
2021-09-22T20:50:30.8098860Z   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
2021-09-22T20:50:30.8099528Z     return forward_call(*input, **kwargs)

1 failure not recognized by patterns:

Job	Step	Action
^{Lint / clang-format}	^{Run clang-format}	🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

facebook-github-bot · 2021-09-22T03:24:49Z

@malfet has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Pull Request resolved: #64733 The previous implementation was wrong when CPU scheduling affinity is set. In fact, it is still wrong if Ninja is not being used. When there is CPU scheduling affinity set, the number of processors available on the system likely exceeds the number of processors that are usable to the build. We ought to use `len(os.sched_getaffinity(0))` to determine the effective parallelism. This change is more minimal and instead just delegates to Ninja (which handles this correctly) when it is used. Test Plan: I verified this worked as correctly using Ninja on a 96-core machine with 24 cores available for scheduling by checking: * the cmake command did not specify "-j" * the number of top-level jobs in top/pstree never exceeded 26 (24 + 2) And I verified we get the legacy behavior by specifying USE_NINJA=0 on the build. Reviewed By: jbschlosser, driazati Differential Revision: D30968796 Pulled By: dagitses fbshipit-source-id: 29547dd378fea793957bcc2f7d52d5def1ecace2

codecov · 2021-09-22T06:37:41Z

Codecov Report

Merging #65444 (fcbe48b) into master (64d3c73) will decrease coverage by 0.00%.
The diff coverage is n/a.

❗ Current head fcbe48b differs from pull request most recent head 708d689. Consider uploading reports for the commit 708d689 to get more accurate results

@@            Coverage Diff             @@
##           master   #65444      +/-   ##
==========================================
- Coverage   66.38%   66.37%   -0.01%     
==========================================
  Files         739      739              
  Lines       94295    94295              
==========================================
- Hits        62594    62593       -1     
- Misses      31701    31702       +1

dagitses

thanks for the fix!

dagitses · 2021-09-22T17:58:09Z

tools/test/test_cmake.py

+            ((    None,     False,     False),         ['-j', '13']),  # noqa: E201,E241
+            ((     '6',      True,      True),          ['-j', '6']),  # noqa: E201,E241
            ((    None,      True,      True),                 None),  # noqa: E201,E241
+            ((     '5',     False,      True),  ['/p:CL_MPCount=5']),  # noqa: E201,E241


nit: if you change this to "11" it will line up more nicely with the value below

facebook-github-bot · 2021-09-22T18:10:19Z

@malfet has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-09-22T21:05:56Z

@malfet merged this pull request in 923f066.

Summary: Reported by cloudhan in pytorch#64733 (comment) Fixes regression introduced by pytorch@047e682 cc malfet seemethere Pull Request resolved: pytorch#65444 Reviewed By: dagitses, seemethere Differential Revision: D31103260 Pulled By: malfet fbshipit-source-id: 9d5454a64cb8a0b96264119cf16582cc5afed284

Summary: Reported by cloudhan in #64733 (comment) Fixes regression introduced by 047e682 cc malfet seemethere Pull Request resolved: #65444 Reviewed By: dagitses, seemethere Differential Revision: D31103260 Pulled By: malfet fbshipit-source-id: 9d5454a64cb8a0b96264119cf16582cc5afed284

Fix Windows ninja builds when MAX_JOBS is specified

fcbe48b

Reported by @cloudhan in #64733 (comment)

pytorch-probot bot added the ciflow/default label Sep 22, 2021

facebook-github-bot added the cla signed label Sep 22, 2021

malfet requested review from a team and dagitses September 22, 2021 03:25

malfet added module: build Build system issues module: regression It used to work, and now it doesn't triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 22, 2021

malfet added this to the 1.10.0 milestone Sep 22, 2021

Ensure that all permutations are tested

d2d665f

dagitses approved these changes Sep 22, 2021

View reviewed changes

dagitses mentioned this pull request Sep 22, 2021

delegate parallelism to Ninja when possible #64733

Closed

Applied suggestion

708d689

seemethere approved these changes Sep 22, 2021

View reviewed changes

facebook-github-bot closed this in 923f066 Sep 22, 2021

facebook-github-bot added the Merged label Sep 22, 2021

This was referenced Oct 5, 2021

Fix Windows ninja builds when MAX_JOBS is specified (#65444) #66155

Merged

[v.1.10.0] Release Tracker #65438

Closed

malfet deleted the malfet-patch-4 branch January 5, 2022 14:42

WBobby mentioned this pull request Jul 14, 2022

Add ROCm5.2/AMDGPU support for PyTorch 1.10 WBobby/pytorch#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Windows ninja builds when MAX_JOBS is specified #65444

Fix Windows ninja builds when MAX_JOBS is specified #65444

Uh oh!

malfet commented Sep 22, 2021 •

edited by pytorch-probot bot

Loading

Uh oh!

pytorch-probot bot commented Sep 22, 2021 •

edited

Loading

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Sep 22, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Sep 22, 2021

Uh oh!

codecov bot commented Sep 22, 2021 •

edited

Loading

Uh oh!

dagitses left a comment

Uh oh!

dagitses Sep 22, 2021

Uh oh!

facebook-github-bot commented Sep 22, 2021

Uh oh!

facebook-github-bot commented Sep 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fix Windows ninja builds when MAX_JOBS is specified #65444

Fix Windows ninja builds when MAX_JOBS is specified #65444

Uh oh!

Conversation

malfet commented Sep 22, 2021 • edited by pytorch-probot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-probot bot commented Sep 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Sep 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

🕵️ 3 new failures recognized by patterns

linux-bionic-py3.6-clang9 / test (noarch, 1, 1, linux.2xlarge) (1/3)

linux-xenial-cuda11.3-py3.6-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu) (2/3)

linux-bionic-py3.6-clang9 / test (default, 2, 2, linux.2xlarge) (3/3)

1 failure not recognized by patterns:

Uh oh!

facebook-github-bot commented Sep 22, 2021

Uh oh!

codecov bot commented Sep 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dagitses left a comment

Choose a reason for hiding this comment

Uh oh!

dagitses Sep 22, 2021

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Sep 22, 2021

Uh oh!

facebook-github-bot commented Sep 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

malfet commented Sep 22, 2021 •

edited by pytorch-probot bot

Loading

pytorch-probot bot commented Sep 22, 2021 •

edited

Loading

facebook-github-bot commented Sep 22, 2021 •

edited

Loading

codecov bot commented Sep 22, 2021 •

edited

Loading