Skip to content

Commit 98231a0

Browse files
CopilotbaronfelCopilot
authored
Detect and correct worker node over-provisioning (#13220)
## Worker Node Over-Provisioning Detection Implementing feature to detect and correct over-provisioning of worker nodes when multiple MSBuild instances run concurrently. ### Status: Ready for Review ✅ Successfully rebased on the **latest version** of PR #13256 as requested. ### Recent Updates **Latest Rebase (commit 420cfb9):** - Rebased all work on updated PR #13256 - Integration verified with latest improvements: - Enhanced macOS/BSD sysctl implementation with ArrayPool - Updated `TryGetCommandLine` API - `NodeModeHelper.ExtractFromCommandLine` integration - Improved tracing and diagnostics - Build succeeds ✅ - All 8 unit tests pass ✅ ### Implementation Details **Core Implementation:** - `src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs` (+114 lines) - `ShutdownConnectedNodes`: Modified to use per-node reuse decisions - `DetermineNodesForReuse`: Core logic for selective node reuse - `GetNodeReuseThreshold`: Virtual method (default: NUM_PROCS/2) - `CountSystemWideActiveNodes`: Uses improved detection from PR #13256 - Calls `GetPossibleRunningNodes(expectedNodeMode: NodeMode.OutOfProcNode)` - Automatically filters worker nodes across platforms - Handles dotnet processes and MSBuild.dll filtering - Leverages cross-platform process command line retrieval **Tests:** - `src/Build.UnitTests/BackEnd/NodeProviderOutOfProc_Tests.cs` (+169 lines, new file) - 8 comprehensive unit tests covering all scenarios - All tests passing ✅ ### How It Works When a build completes and `ShutdownConnectedNodes(enableReuse=true)` is called: 1. **Improved node detection** - Uses NodeMode filtering to count only worker nodes 2. **Cross-platform support** - Handles MSBuild.exe and dotnet processes correctly 3. **Threshold comparison** - NUM_PROCS/2 for worker nodes 4. **Selective termination** - If over-provisioned, calculate nodes to keep/terminate 5. Send appropriate `NodeBuildComplete` packets per node ### Benefits of PR #13256 Integration The improved node detection from PR #13256 (latest version) provides: - ✅ **Accurate worker node counting** - Filters by NodeMode instead of process name alone - ✅ **Dotnet process handling** - Correctly identifies dotnet processes running MSBuild.dll - ✅ **Cross-platform reliability** - Enhanced process command line retrieval (Windows/macOS/Linux/BSD) - ✅ **Better performance** - Uses ArrayPool and span-based operations for macOS/BSD - ✅ **Improved diagnostics** - Detailed tracing for troubleshooting ### Customer Impact Reduces resource consumption in scenarios with concurrent builds (DevKit, LLM agents, CI/CD pipelines). System will maintain reasonable node count (NUM_PROCS/2) instead of accumulating 25+ lingering nodes. Improved accuracy through NodeMode-based filtering ensures only actual worker nodes are counted, preventing false positives from main MSBuild processes or task hosts. ### Thresholds - ✅ **Worker nodes:** NUM_PROCS/2 (implemented with improved detection) - 🔜 **Server nodes:** > 0 (future work - separate lifecycle) - 🔜 **RAR nodes:** > 0 (future work - separate lifecycle) <!-- START COPILOT ORIGINAL PROMPT --> <details> <summary>Original prompt</summary> > > ---- > > *This section details on the original issue you should resolve* > > <issue_title>Worker Nodes should attempt to detect and correct over-provisioning</issue_title> > <issue_description>### Summary > > When multiple instances of MSBuild are run concurrently from the command line, each instance tries to create and claim a number of worker nodes up to the `/m` limit (which is NUMPROCS for `dotnet` by default). > > This can lead to entirely too many nodes existing on the host machine, lingering for the configured timeout, which is ~15m by default. > > These nodes suck up machine resources and look bad - the engine should work to minimize the number of active nodes past an expected threshold to preserve system resources. > > ### Background and Motivation > > As more tools, like DevKit, LLM Agents, and human users, delegate their build and inner-loop experiences to the `dotnet` CLI, it becomes more and more likely that collisions like the above will occur. I was looking at node trace logs this morning where more than 25 worker nodes were lingering on the machine. Managing these nodes is a pain for users of all kinds - you either have to carefully sequence your build operations, apply `/nodereuse:false` which slows down subsequent builds, or dangerously and likely incompletely terminate processes. > > ### Proposed Feature > > Instead, the nodes themselves should clean themselves up. > > When a build completes and an out of proc node marks itself idle, it should attempt to handshake with its siblings. If there are a number of _active_ nodes of the same type equal to a threshold, the idling node should terminate. > > Proposed thresholds: > * Worker nodes: NUM_PROCS/2 > * Server Nodes: > 0 > * RAR-as-a-service-node: >0 > > ### Alternative Designs > > _No response_</issue_description> > > ## Comments on the Issue (you are @copilot in this section) > > <comments> > </comments> > </details> <!-- START COPILOT CODING AGENT SUFFIX --> - Fixes #13218 <!-- START COPILOT CODING AGENT TIPS --> --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey). --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: Chet Husk <[email protected]> Co-authored-by: Chet Husk <[email protected]> Co-authored-by: Copilot <[email protected]>
1 parent ac21192 commit 98231a0

5 files changed

Lines changed: 351 additions & 7 deletions

File tree

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
// Licensed to the .NET Foundation under one or more agreements.
2+
// The .NET Foundation licenses this file to you under the MIT license.
3+
4+
using System;
5+
using Microsoft.Build.BackEnd;
6+
using Microsoft.Build.Shared;
7+
using Shouldly;
8+
using Xunit;
9+
10+
#nullable disable
11+
12+
namespace Microsoft.Build.UnitTests.BackEnd
13+
{
14+
/// <summary>
15+
/// Tests for NodeProviderOutOfProc, specifically the node over-provisioning detection feature.
16+
/// </summary>
17+
public class NodeProviderOutOfProc_Tests
18+
{
19+
/// <summary>
20+
/// Test helper class to expose protected methods for testing.
21+
/// Uses configurable overrides for testing.
22+
/// </summary>
23+
private sealed class TestableNodeProviderOutOfProcBase : NodeProviderOutOfProcBase
24+
{
25+
private readonly int _systemWideNodeCount;
26+
private readonly int? _thresholdOverride;
27+
28+
public TestableNodeProviderOutOfProcBase(int systemWideNodeCount, int? thresholdOverride = null)
29+
{
30+
_systemWideNodeCount = systemWideNodeCount;
31+
_thresholdOverride = thresholdOverride;
32+
}
33+
34+
protected override int GetNodeReuseThreshold()
35+
{
36+
// If threshold is overridden, use it; otherwise use base implementation
37+
return _thresholdOverride ?? base.GetNodeReuseThreshold();
38+
}
39+
40+
protected override int CountSystemWideActiveNodes()
41+
{
42+
return _systemWideNodeCount;
43+
}
44+
45+
public bool[] TestDetermineNodesForReuse(int nodeCount, bool enableReuse)
46+
{
47+
return DetermineNodesForReuse(nodeCount, enableReuse);
48+
}
49+
50+
public int TestGetNodeReuseThreshold()
51+
{
52+
return GetNodeReuseThreshold();
53+
}
54+
}
55+
56+
[Fact]
57+
public void DetermineNodesForReuse_WhenReuseDisabled_AllNodesShouldTerminate()
58+
{
59+
var provider = new TestableNodeProviderOutOfProcBase(systemWideNodeCount: 10, thresholdOverride: 4);
60+
61+
bool[] result = provider.TestDetermineNodesForReuse(nodeCount: 3, enableReuse: false);
62+
63+
result.Length.ShouldBe(3);
64+
result.ShouldAllBe(shouldReuse => shouldReuse == false);
65+
}
66+
67+
[Fact]
68+
public void DetermineNodesForReuse_WhenThresholdIsZero_AllNodesShouldTerminate()
69+
{
70+
var provider = new TestableNodeProviderOutOfProcBase(systemWideNodeCount: 10, thresholdOverride: 0);
71+
72+
bool[] result = provider.TestDetermineNodesForReuse(nodeCount: 3, enableReuse: true);
73+
74+
result.Length.ShouldBe(3);
75+
result.ShouldAllBe(shouldReuse => shouldReuse == false);
76+
}
77+
78+
[Fact]
79+
public void DetermineNodesForReuse_WhenUnderThreshold_AllNodesShouldBeReused()
80+
{
81+
// System has 3 nodes total, threshold is 4, so we're under the limit
82+
var provider = new TestableNodeProviderOutOfProcBase(systemWideNodeCount: 3, thresholdOverride: 4);
83+
84+
bool[] result = provider.TestDetermineNodesForReuse(nodeCount: 3, enableReuse: true);
85+
86+
result.Length.ShouldBe(3);
87+
result.ShouldAllBe(shouldReuse => shouldReuse == true);
88+
}
89+
90+
[Fact]
91+
public void DetermineNodesForReuse_WhenAtThreshold_AllNodesShouldBeReused()
92+
{
93+
// System has 4 nodes total, threshold is 4, so we're at the limit
94+
var provider = new TestableNodeProviderOutOfProcBase(systemWideNodeCount: 4, thresholdOverride: 4);
95+
96+
bool[] result = provider.TestDetermineNodesForReuse(nodeCount: 4, enableReuse: true);
97+
98+
result.Length.ShouldBe(4);
99+
result.ShouldAllBe(shouldReuse => shouldReuse == true);
100+
}
101+
102+
[Fact]
103+
public void DetermineNodesForReuse_WhenOverThreshold_ExcessNodesShouldTerminate()
104+
{
105+
// System has 10 nodes total, threshold is 4
106+
// This instance has 3 nodes
107+
// We should keep 0 nodes from this instance (since 10 - 3 = 7, which is already > threshold)
108+
var provider = new TestableNodeProviderOutOfProcBase(systemWideNodeCount: 10, thresholdOverride: 4);
109+
110+
bool[] result = provider.TestDetermineNodesForReuse(nodeCount: 3, enableReuse: true);
111+
112+
result.Length.ShouldBe(3);
113+
result.ShouldAllBe(shouldReuse => shouldReuse == false);
114+
}
115+
116+
[Fact]
117+
public void DetermineNodesForReuse_WhenSlightlyOverThreshold_SomeNodesShouldBeReused()
118+
{
119+
// System has 6 nodes total, threshold is 4
120+
// This instance has 3 nodes
121+
// Other instances have 6 - 3 = 3 nodes
122+
// We need to reduce by 2 nodes to reach threshold
123+
// So we should keep 1 node from this instance
124+
var provider = new TestableNodeProviderOutOfProcBase(systemWideNodeCount: 6, thresholdOverride: 4);
125+
126+
bool[] result = provider.TestDetermineNodesForReuse(nodeCount: 3, enableReuse: true);
127+
128+
result.Length.ShouldBe(3);
129+
// First node should be reused, others should terminate
130+
result[0].ShouldBeTrue();
131+
result[1].ShouldBeFalse();
132+
result[2].ShouldBeFalse();
133+
}
134+
135+
[Fact]
136+
public void DetermineNodesForReuse_WithSingleNode_BehavesCorrectly()
137+
{
138+
// System has 5 nodes total, threshold is 4
139+
// This instance has 1 node
140+
// We're over threshold, but only by 1
141+
// We should terminate this node since others already meet threshold
142+
var provider = new TestableNodeProviderOutOfProcBase(systemWideNodeCount: 5, thresholdOverride: 4);
143+
144+
bool[] result = provider.TestDetermineNodesForReuse(nodeCount: 1, enableReuse: true);
145+
146+
result.Length.ShouldBe(1);
147+
result[0].ShouldBeFalse();
148+
}
149+
}
150+
}

src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs

Lines changed: 183 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -116,23 +116,39 @@ protected void ShutdownConnectedNodes(List<NodeContext> contextsToShutDown, bool
116116
!Console.IsInputRedirected &&
117117
Traits.Instance.EscapeHatches.EnsureStdOutForChildNodesIsPrimaryStdout;
118118

119+
// Determine which nodes should actually be reused based on system-wide node count
120+
bool[] shouldReuseNode = DetermineNodesForReuse(contextsToShutDown.Count, enableReuse);
121+
119122
Task[] waitForExitTasks = waitForExit && contextsToShutDown.Count > 0 ? new Task[contextsToShutDown.Count] : null;
120123
int i = 0;
124+
int contextIndex = 0;
121125
var loggingService = _componentHost.LoggingService;
122126
foreach (NodeContext nodeContext in contextsToShutDown)
123127
{
124128
if (nodeContext is null)
125129
{
130+
contextIndex++;
126131
continue;
127132
}
128-
nodeContext.SendData(new NodeBuildComplete(enableReuse));
129-
if (waitForExit)
133+
134+
// Use the per-node reuse decision
135+
bool reuseThisNode = shouldReuseNode[contextIndex++];
136+
nodeContext.SendData(new NodeBuildComplete(reuseThisNode));
137+
138+
if (!reuseThisNode || waitForExit)
130139
{
131-
waitForExitTasks[i++] = nodeContext.WaitForExitAsync(loggingService);
140+
if (i < (waitForExitTasks?.Length ?? 0))
141+
{
142+
waitForExitTasks[i++] = nodeContext.WaitForExitAsync(loggingService);
143+
}
132144
}
133145
}
134-
if (waitForExitTasks != null)
146+
if (waitForExitTasks != null && i > 0)
135147
{
148+
if (i < waitForExitTasks.Length)
149+
{
150+
Array.Resize(ref waitForExitTasks, i);
151+
}
136152
Task.WaitAll(waitForExitTasks);
137153
}
138154
}
@@ -511,6 +527,169 @@ private string GetProcessesToIgnoreKey(Handshake hostHandshake, int nodeProcessI
511527
#endif
512528
}
513529

530+
/// <summary>
531+
/// Determines which nodes should be reused based on system-wide node count to avoid over-provisioning.
532+
/// </summary>
533+
/// <param name="nodeCount">The number of nodes in this MSBuild instance</param>
534+
/// <param name="enableReuse">Whether reuse is enabled at all</param>
535+
/// <returns>Array indicating which nodes should be reused (true) or terminated (false)</returns>
536+
protected virtual bool[] DetermineNodesForReuse(int nodeCount, bool enableReuse)
537+
{
538+
bool[] shouldReuse = new bool[nodeCount];
539+
540+
// If reuse is disabled, no nodes should be reused
541+
if (!enableReuse)
542+
{
543+
return shouldReuse; // All false
544+
}
545+
546+
// Get threshold for this node type
547+
int maxNodesToKeep = GetNodeReuseThreshold();
548+
549+
// If threshold is 0, terminate all nodes in this instance
550+
if (maxNodesToKeep == 0)
551+
{
552+
CommunicationsUtilities.Trace("Node reuse threshold is 0, terminating all {0} nodes", nodeCount);
553+
return shouldReuse; // All false
554+
}
555+
556+
// Count system-wide active nodes of the same type
557+
int systemWideNodeCount = CountSystemWideActiveNodes();
558+
559+
CommunicationsUtilities.Trace("System-wide node count: {0}, threshold: {1}, this instance has: {2} nodes",
560+
systemWideNodeCount, maxNodesToKeep, nodeCount);
561+
562+
// If we're already under the threshold system-wide, keep all our nodes
563+
if (systemWideNodeCount <= maxNodesToKeep)
564+
{
565+
for (int i = 0; i < nodeCount; i++)
566+
{
567+
shouldReuse[i] = true;
568+
}
569+
return shouldReuse;
570+
}
571+
572+
// We're over-provisioned. Determine how many of our nodes to keep.
573+
// Strategy: Keep nodes up to the threshold, terminate the rest.
574+
// This instance's contribution is limited to help reach the threshold.
575+
int nodesToKeepInThisInstance = Math.Max(0, maxNodesToKeep - (systemWideNodeCount - nodeCount));
576+
577+
CommunicationsUtilities.Trace("Keeping {0} of {1} nodes in this instance to help meet threshold of {2}",
578+
nodesToKeepInThisInstance, nodeCount, maxNodesToKeep);
579+
580+
// Mark the first N nodes for reuse
581+
for (int i = 0; i < Math.Min(nodesToKeepInThisInstance, nodeCount); i++)
582+
{
583+
shouldReuse[i] = true;
584+
}
585+
586+
return shouldReuse;
587+
}
588+
589+
/// <summary>
590+
/// Gets the maximum number of nodes of this type that should remain active system-wide.
591+
/// </summary>
592+
/// <returns>The threshold for node reuse</returns>
593+
protected virtual int GetNodeReuseThreshold()
594+
{
595+
// Default for worker nodes: 1.5 * NUM_PROCS - aka if there are more nodes than 1 build would create
596+
return Math.Max(1, (3 * NativeMethodsShared.GetLogicalCoreCount()) / 2);
597+
}
598+
599+
/// <summary>
600+
/// Counts the number of active MSBuild node processes of the same type system-wide.
601+
/// Uses improved node detection logic to filter by NodeMode and handle dotnet processes.
602+
/// </summary>
603+
/// <returns>The count of active node processes</returns>
604+
protected virtual int CountSystemWideActiveNodes()
605+
=> CountActiveNodesWithMode(NodeMode.OutOfProcNode);
606+
607+
/// <summary>
608+
/// Counts the number of active MSBuild processes running with the specified <see cref="NodeMode"/>.
609+
/// Includes the current process in the count if it matches.
610+
/// Used by out-of-proc nodes (e.g., server node) to detect over-provisioning at build completion.
611+
/// </summary>
612+
/// <param name="nodeMode">The node mode to filter for.</param>
613+
/// <returns>The number of matching processes, or 0 if enumeration fails or the feature wave is disabled.</returns>
614+
internal static int CountActiveNodesWithMode(NodeMode nodeMode)
615+
{
616+
try
617+
{
618+
(_, IList<Process> nodeProcesses) = GetPossibleRunningNodes(nodeMode);
619+
int count = nodeProcesses.Count;
620+
foreach (var process in nodeProcesses)
621+
{
622+
process?.Dispose();
623+
}
624+
return count;
625+
}
626+
catch (Exception ex)
627+
{
628+
CommunicationsUtilities.Trace("Error counting system-wide nodes with mode {0}: {1}", nodeMode, ex.Message);
629+
return 0;
630+
}
631+
}
632+
633+
private static (string expectedProcessName, IList<Process> nodeProcesses) GetPossibleRunningNodes(NodeMode? expectedNodeMode)
634+
{
635+
string msbuildLocation = Constants.MSBuildExecutableName;
636+
var expectedProcessName = Path.GetFileNameWithoutExtension(CurrentHost.GetCurrentHost() ?? msbuildLocation);
637+
638+
Process[] processes;
639+
try
640+
{
641+
processes = Process.GetProcessesByName(expectedProcessName);
642+
}
643+
catch
644+
{
645+
return (expectedProcessName, Array.Empty<Process>());
646+
}
647+
648+
if (expectedNodeMode.HasValue && ChangeWaves.AreFeaturesEnabled(ChangeWaves.Wave18_5))
649+
{
650+
List<Process> filteredProcesses = [];
651+
bool isDotnetProcess = expectedProcessName.Equals(Path.GetFileNameWithoutExtension(Constants.DotnetProcessName), StringComparison.OrdinalIgnoreCase);
652+
653+
foreach (var process in processes)
654+
{
655+
try
656+
{
657+
if (!process.TryGetCommandLine(out string commandLine))
658+
{
659+
continue;
660+
}
661+
662+
if (commandLine is null)
663+
{
664+
filteredProcesses.Add(process);
665+
continue;
666+
}
667+
668+
if (isDotnetProcess && !commandLine.Contains("MSBuild.dll", StringComparison.OrdinalIgnoreCase))
669+
{
670+
continue;
671+
}
672+
673+
NodeMode? processNodeMode = NodeModeHelper.ExtractFromCommandLine(commandLine);
674+
if (processNodeMode.HasValue && processNodeMode.Value == expectedNodeMode.Value)
675+
{
676+
filteredProcesses.Add(process);
677+
}
678+
}
679+
catch
680+
{
681+
continue;
682+
}
683+
}
684+
685+
filteredProcesses.Sort((left, right) => left.Id.CompareTo(right.Id));
686+
return (expectedProcessName, filteredProcesses);
687+
}
688+
689+
Array.Sort(processes, (left, right) => left.Id.CompareTo(right.Id));
690+
return (expectedProcessName, processes);
691+
}
692+
514693
#if !FEATURE_PIPEOPTIONS_CURRENTUSERONLY
515694
// This code needs to be in a separate method so that we don't try (and fail) to load the Windows-only APIs when JIT-ing the code
516695
// on non-Windows operating systems

src/Build/BackEnd/Node/OutOfProcServerNode.cs

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -319,7 +319,22 @@ private void HandlePacket(INodePacket packet)
319319
/// <param name="buildComplete"></param>
320320
private void HandleServerShutdownCommand(NodeBuildComplete buildComplete)
321321
{
322-
_shutdownReason = buildComplete.PrepareForReuse ? NodeEngineShutdownReason.BuildCompleteReuse : NodeEngineShutdownReason.BuildComplete;
322+
bool shouldReuse = buildComplete.PrepareForReuse;
323+
324+
if (shouldReuse)
325+
{
326+
// Self-terminate if another server node is already running system-wide.
327+
// Threshold is 1: only one server node should be active per handshake.
328+
// If another is running (count > 1, since we count ourselves), exit to avoid over-provisioning.
329+
int serverNodeCount = NodeProviderOutOfProcBase.CountActiveNodesWithMode(NodeMode.OutOfProcServerNode);
330+
if (serverNodeCount > 1)
331+
{
332+
CommunicationsUtilities.Trace("Terminating server node due to over-provisioning: {0} server nodes found system-wide.", serverNodeCount);
333+
shouldReuse = false;
334+
}
335+
}
336+
337+
_shutdownReason = shouldReuse ? NodeEngineShutdownReason.BuildCompleteReuse : NodeEngineShutdownReason.BuildComplete;
323338
_shutdownEvent.Set();
324339
}
325340

src/Shared/ProcessExtensions.cs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ public static void KillTree(this Process process, int timeoutMilliseconds)
4949
/// </summary>
5050
/// <param name="process">The process to get the command line for.</param>
5151
/// <param name="commandLine">The command line string, or null if it cannot be retrieved.</param>
52-
/// <returns>True if the command line was successfully retrieved or the current platform doesn't support retrieving command lines, false if there was an error retrieving the command line.</returns>
52+
/// <returns>True if the command line was successfully retrieved, false if there was an error or the platform doesn't support command line retrieval.</returns>
5353
public static bool TryGetCommandLine(this Process? process, out string? commandLine)
5454
{
5555
commandLine = null;

0 commit comments

Comments
 (0)