refactor(snmp_zabbix): migrate scheduler to Min-Heap with consistent … #1379

kongfei605 · 2025-12-23T13:32:32Z

…hashing

Copilot

Pull request overview

This PR refactors the SNMP Zabbix input plugin's scheduler from an interval-based ticker system to a Min-Heap based event-loop architecture with consistent hashing for better task distribution and resource efficiency.

Key Changes:

Replaces multiple interval-based goroutines with a single event-loop processing a priority queue
Implements consistent hashing to scatter initial task execution times across intervals, preventing thundering herd issues
Adds cycle detection and improved error handling in dependency tree construction

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 18 comments.

Show a summary per file

File	Description
`inputs/snmp_zabbix/priority_queue.go`	New file implementing heap.Interface for ScheduledTask priority queue with NextRun as the priority key
`inputs/snmp_zabbix/utils.go`	Extracted jitter calculation utilities from scheduler.go into a separate file
`inputs/snmp_zabbix/scheduler.go`	Major refactoring: replaced interval-based scheduling with heap-based event loop, added consistent hashing, simplified task grouping by (Agent, Interval)
`inputs/snmp_zabbix/scheduler_test.go`	New test file covering consistent hashing determinism, heap ordering, and differential LLD update logic
`inputs/snmp_zabbix/snmp.go`	Enhanced buildTree function with cycle detection and nil pointer checks for dependency resolution
`inputs/snmp_zabbix/template.go`	Removed Chinese comments and replaced with English documentation, no functional changes
`inputs/snmp_zabbix/template_macro_test.go`	New test file for macro expansion robustness with various key formats

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

inputs/snmp_zabbix/snmp.go

Copilot · 2025-12-23T13:41:25Z

inputs/snmp_zabbix/scheduler.go

+	// Hash is deterministic.
+	// Offset is in [0, Interval)
+	// We cast to int64 (Duration) which is safe as interval is usually < 290 years
+	offset := time.Duration(h.Sum64() % uint64(interval))
+
+	now := time.Now()
+	// ...


Incomplete comment. The comment on lines 215-216 ends with an ellipsis ("...") suggesting unfinished documentation. Complete this comment to explain the full logic, particularly around how the offset is calculated and why returning now.Add(offset) achieves the desired scattered scheduling.

Suggested change

// Hash is deterministic.

// Offset is in [0, Interval)

// We cast to int64 (Duration) which is safe as interval is usually < 290 years

offset := time.Duration(h.Sum64() % uint64(interval))

now := time.Now()

// ...

// Hash is deterministic for a given agent string, so the same agent will always

// produce the same offset value.

// The modulus with interval keeps the offset in [0, interval), effectively placing

// the agent at a stable position within the interval window.

// We cast to time.Duration (backed by int64 nanoseconds), which is safe as

// interval is expected to be well within the valid duration range.

offset := time.Duration(h.Sum64() % uint64(interval))

now := time.Now()

// By adding the deterministic offset to "now", each agent's first execution is

// delayed by a consistent amount within the interval, scattering start times

// across agents and avoiding a thundering herd while still being reproducible

// across restarts for the same agent and interval.

Copilot · 2025-12-23T13:41:25Z

inputs/snmp_zabbix/scheduler_test.go

+
+	// 2. Scattering check
+	// Generate 100 agents, check if their offsets are distributed
+	// agents := make([]string, 100) (unused)


The test comment references "unused" for the agents variable on line 35 but this comment is misleading - the variable was never declared in the first place. This appears to be leftover from refactoring. Consider removing this comment entirely as it adds no value and creates confusion.

Suggested change

// agents := make([]string, 100) (unused)

Copilot · 2025-12-23T13:41:25Z

inputs/snmp_zabbix/scheduler.go

+			k := s.taskKey(newItem.Agent, newItem.Delay)
+			s.addItemToTask(k, newItem)


Similar redundancy on line 318: k := s.taskKey(newItem.Agent, newItem.Delay) recalculates the same value as targetKey from line 279. Use the existing targetKey variable instead.

Copilot · 2025-12-23T13:41:26Z

inputs/snmp_zabbix/scheduler_test.go

+	// But we should use >> 1 unique slots.
+	if len(uniqueOffsets) < 10 {
+		t.Errorf("Expected distributed offsets, but got only %d unique seconds slots out of 100 agents", len(uniqueOffsets))


Magic number without explanation. The test expects at least 10 unique second slots out of 100 agents (line 60), but this threshold appears arbitrary. With consistent hashing over a 60-second interval and 100 agents, the expected distribution should yield approximately 60 unique slots (with some collisions due to birthday paradox). A threshold of 10 is too lenient and might not catch real distribution problems. Consider using a more meaningful threshold (e.g., at least 40 unique slots) or documenting why 10 is appropriate.

Suggested change

// But we should use >> 1 unique slots.

if len(uniqueOffsets) < 10 {

t.Errorf("Expected distributed offsets, but got only %d unique seconds slots out of 100 agents", len(uniqueOffsets))

// We conservatively require a majority of slots to be used to catch bad distributions without being too strict.

const minUniqueOffsetSlots = 40

if len(uniqueOffsets) < minUniqueOffsetSlots {

t.Errorf("Expected distributed offsets, but got only %d unique seconds slots out of 100 agents (minimum expected: %d)", len(uniqueOffsets), minUniqueOffsetSlots)

Copilot · 2025-12-23T13:41:28Z

inputs/snmp_zabbix/scheduler_test.go

+
+	now := time.Now()
+
+	// Add 3 tasks with different delays (which result in diff NextRun)
+	// We manually inject for test
+	t1 := &ScheduledTask{Agent: "A", NextRun: now.Add(10 * time.Second)}
+	t2 := &ScheduledTask{Agent: "B", NextRun: now.Add(5 * time.Second)}
+	t3 := &ScheduledTask{Agent: "C", NextRun: now.Add(15 * time.Second)}
+
+	// Use heap.Push to maintain invariant
+	heap.Push(&s.pq, t1)
+	heap.Push(&s.pq, t2)
+	heap.Push(&s.pq, t3)
+
+	// Pop order should be: B (5s), A (10s), C (15s)
+
+	p1 := heap.Pop(&s.pq).(*ScheduledTask)
+	if p1.Agent != "B" {
+		t.Errorf("Expected first pop to be B, got %s", p1.Agent)
+	}
+
+	p2 := heap.Pop(&s.pq).(*ScheduledTask)
+	if p2.Agent != "A" {
+		t.Errorf("Expected second pop to be A, got %s", p2.Agent)
+	}
+


Potential data race in the test. The test accesses s.pq directly (lines 77-79) without acquiring the scheduler's mutex. While this is a test environment, if the scheduler design expects concurrent access protection, the test should either respect that pattern or document why it's safe to skip locking here. This could also serve as documentation for whether the heap operations themselves are thread-safe or require external synchronization.

Suggested change

now := time.Now()

// Add 3 tasks with different delays (which result in diff NextRun)

// We manually inject for test

t1 := &ScheduledTask{Agent: "A", NextRun: now.Add(10 * time.Second)}

t2 := &ScheduledTask{Agent: "B", NextRun: now.Add(5 * time.Second)}

t3 := &ScheduledTask{Agent: "C", NextRun: now.Add(15 * time.Second)}

// Use heap.Push to maintain invariant

heap.Push(&s.pq, t1)

heap.Push(&s.pq, t2)

heap.Push(&s.pq, t3)

// Pop order should be: B (5s), A (10s), C (15s)

p1 := heap.Pop(&s.pq).(*ScheduledTask)

if p1.Agent != "B" {

t.Errorf("Expected first pop to be B, got %s", p1.Agent)

}

p2 := heap.Pop(&s.pq).(*ScheduledTask)

if p2.Agent != "A" {

t.Errorf("Expected second pop to be A, got %s", p2.Agent)

}

now := time.Now()

// Add 3 tasks with different delays (which result in diff NextRun)

// We manually inject for test

t1 := &ScheduledTask{Agent: "A", NextRun: now.Add(10 * time.Second)}

t2 := &ScheduledTask{Agent: "B", NextRun: now.Add(5 * time.Second)}

t3 := &ScheduledTask{Agent: "C", NextRun: now.Add(15 * time.Second)}

// All direct access to s.pq is done under the scheduler's mutex to

// respect the same concurrency guarantees as production code.

s.mu.Lock()

defer s.mu.Unlock()

// Use heap.Push to maintain invariant

heap.Push(&s.pq, t1)

heap.Push(&s.pq, t2)

heap.Push(&s.pq, t3)

// Pop order should be: B (5s), A (10s), C (15s)

p1 := heap.Pop(&s.pq).(*ScheduledTask)

if p1.Agent != "B" {

t.Errorf("Expected first pop to be B, got %s", p1.Agent)

}

p2 := heap.Pop(&s.pq).(*ScheduledTask)

if p2.Agent != "A" {

t.Errorf("Expected second pop to be A, got %s", p2.Agent)

}

Copilot · 2025-12-23T13:41:28Z

inputs/snmp_zabbix/scheduler.go

 }

 func NewItemScheduler(collector *SNMPCollector, labelCache *LabelCache) *ItemScheduler {
 	rand.Seed(time.Now().UnixNano())


The rand.Seed() call on line 52 uses the deprecated math/rand global seed pattern. In Go 1.20+, the global random generator is automatically seeded. If supporting older Go versions, this is fine, but if the minimum version is 1.20 or higher, this line can be removed. Additionally, using math/rand in the jitter function (utils.go line 32) without seeding per-goroutine could lead to contention on the global lock. Consider using rand.New(rand.NewSource(time.Now().UnixNano())) for per-instance random generators if concurrent performance matters.

Suggested change

rand.Seed(time.Now().UnixNano())

Copilot · 2025-12-23T13:41:28Z

inputs/snmp_zabbix/scheduler.go

+// Actually it was local in this file.
 func copyTags(src map[string]string) map[string]string {
-	dst := make(map[string]string, len(src)+1)
+	dst := make(map[string]string, len(src))


The copyTags function allocates a map with len(src) capacity, but then comment on line 482-483 suggests it might need to accommodate an additional key. If tags are frequently copied and then modified (as seen in lines 467-469 and 474-476 where action is added), consider using len(src)+1 for the capacity to avoid potential reallocation.

Suggested change

dst := make(map[string]string, len(src))

dst := make(map[string]string, len(src)+1)

Copilot · 2025-12-23T13:41:28Z

inputs/snmp_zabbix/scheduler.go

+
+	now := time.Now()
+	// ...
+	return now.Add(offset)


The consistent hashing implementation has a logic flaw. On line 217, the function returns now.Add(offset), which means the next run time is always in the future by offset amount. However, for proper scheduling, when a task group is first created, it should check if adding the offset results in a time still in the past (e.g., if offset is 5 seconds but initialization happens at second 50 of a minute with 60s interval). The scheduler should advance to the next interval cycle if needed. This could cause tasks to run immediately instead of at their intended scattered time slots.

Suggested change

now := time.Now()

// ...

return now.Add(offset)

now := time.Now()

// Align to the start of the current interval, then add the offset.

// If that time is already in the past, advance to the next interval.

base := now.Truncate(interval)

nextRun := base.Add(offset)

if nextRun.Before(now) {

nextRun = nextRun.Add(interval)

}

return nextRun

inputs/snmp_zabbix/scheduler.go

Co-authored-by: Copilot <[email protected]>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 12 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-24T00:06:18Z

inputs/snmp_zabbix/scheduler.go

 }

 func NewItemScheduler(collector *SNMPCollector, labelCache *LabelCache) *ItemScheduler {
 	rand.Seed(time.Now().UnixNano())


The rand.Seed() call is deprecated in Go 1.20+ and no longer necessary. The global random number generator is automatically seeded. Additionally, using math/rand without proper synchronization can lead to race conditions when accessed concurrently from multiple goroutines.

Copilot · 2025-12-24T00:06:19Z

inputs/snmp_zabbix/scheduler.go

+
+		// Catchup logic: if mostly lagging, don't run immediately multiple times, just skip intervals?
+		// For SNMP, skipping intervals is better than burst.
+		// If NextRun is still in past after adding interval, add more until it is in future or close to now.
+		if task.NextRun.Before(now) {
+			// Calculate how many intervals we missed
+			// (Now - NextRun) / Interval
+			diff := now.Sub(task.NextRun)
+			cycles := diff / task.Interval
+			if cycles > 0 {
+				task.NextRun = task.NextRun.Add(cycles * task.Interval)
+			}
+			// One more check
+			if task.NextRun.Before(now) {
+				task.NextRun = task.NextRun.Add(task.Interval)
+			}


The heap.Fix() call is used after manually updating task.NextRun, but the task was not popped from the heap first. This is the correct usage of heap.Fix() when updating an element in place. However, there's a subtle issue: if the scheduler is significantly lagging and multiple intervals are skipped (lines 152-164), the catchup logic might not work correctly in all edge cases. Specifically, if diff / task.Interval calculation results in a very large number of cycles, adding cycles * task.Interval could overflow or result in incorrect scheduling.

Suggested change

// Catchup logic: if mostly lagging, don't run immediately multiple times, just skip intervals?

// For SNMP, skipping intervals is better than burst.

// If NextRun is still in past after adding interval, add more until it is in future or close to now.

if task.NextRun.Before(now) {

// Calculate how many intervals we missed

// (Now - NextRun) / Interval

diff := now.Sub(task.NextRun)

cycles := diff / task.Interval

if cycles > 0 {

task.NextRun = task.NextRun.Add(cycles * task.Interval)

}

// One more check

if task.NextRun.Before(now) {

task.NextRun = task.NextRun.Add(task.Interval)

}

// Catchup logic: if heavily lagging, skip missed intervals rather than executing in a burst.

// Use a bounded loop to avoid overflowing time.Duration when many intervals were missed.

if task.Interval > 0 {

const maxCatchupCycles = 1000

// If NextRun is still in the past after adding one interval, advance in steps.

for i := 0; i < maxCatchupCycles && task.NextRun.Before(now); i++ {

task.NextRun = task.NextRun.Add(task.Interval)

}

// If we're still behind after maxCatchupCycles, schedule relative to "now"

// to avoid extreme catchup loops and potential overflow.

if task.NextRun.Before(now) {

task.NextRun = now.Add(task.Interval)

}

} else {

// Non-positive interval: fall back to scheduling at current time.

task.NextRun = now

Copilot · 2025-12-24T00:06:19Z

inputs/snmp_zabbix/scheduler.go

+	offset := time.Duration(h.Sum64() % uint64(interval))
+
+	now := time.Now()
+	// ...


The comment indicates that the ellipsis "..." at line 216 suggests incomplete implementation. This appears to be leftover debug or placeholder code that should either be completed or removed.

Suggested change

// ...

inputs/snmp_zabbix/snmp.go

Copilot · 2025-12-24T00:06:20Z

inputs/snmp_zabbix/scheduler_test.go

+type MockCollector struct {}
+func (m *MockCollector) CollectItems(ctx context.Context, items []MonitorItem, slist *types.SampleList) error {
+	return nil
+}


The MockCollector struct and its method are defined at the package level but are only used in tests. This is unconventional - typically test helpers should be defined within the test file or in a separate _test.go file. While Go allows this, it exports test infrastructure unnecessarily. Consider moving this to a test helper function or using an interface type assertion within the test instead.

Copilot · 2025-12-24T00:06:20Z

inputs/snmp_zabbix/scheduler_test.go

+	// Let's verify the OFFSET is consistent. (We can't easily access private method output directly without exposing or copy-paste logic).
+	// But we can check stability.
+
+	if t1.Sub(t2).Abs() > time.Second {


The test uses t1.Sub(t2).Abs() to check time difference, but time.Duration doesn't have an Abs() method in standard Go (it was added in Go 1.19+). If this code needs to support Go versions before 1.19, this will cause a compilation error. Consider using a manual absolute value check or documenting the minimum Go version requirement.

Suggested change

if t1.Sub(t2).Abs() > time.Second {

diff := t1.Sub(t2)

if diff < 0 {

diff = -diff

}

if diff > time.Second {

inputs/snmp_zabbix/utils.go

inputs/snmp_zabbix/snmp.go

Copilot · 2025-12-24T00:06:21Z

inputs/snmp_zabbix/scheduler.go

+		if s.pq.Len() == 0 {
+			s.mu.Unlock()
+			time.Sleep(idleWait)
+			continue


The busy-wait loop with time.Sleep(idleWait) when the queue is empty is inefficient. Consider using a condition variable or channel-based signaling to wake up the goroutine when new tasks are added, rather than polling every second. This would reduce CPU usage and improve responsiveness when new tasks are added.

Copilot · 2025-12-24T00:06:22Z

inputs/snmp_zabbix/scheduler_test.go

+	if len(uniqueOffsets) < 10 {
+		t.Errorf("Expected distributed offsets, but got only %d unique seconds slots out of 100 agents", len(uniqueOffsets))
+	}


The test assertions for task distribution are weak. The test only checks if at least 10 unique second-slots exist out of 100 agents with a 60-second interval, which is a very low bar. With proper hashing, you should expect much better distribution. Consider strengthening the assertion or adding statistical tests for distribution quality (e.g., checking if the distribution is reasonably uniform).

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-24T01:08:46Z

inputs/snmp_zabbix/scheduler.go

+		if task.NextRun.Before(now) {
+			diff := now.Sub(task.NextRun)
+			cycles := diff / task.Interval
+			if cycles > 0 {
+				task.NextRun = task.NextRun.Add(cycles * task.Interval)
+			}
+			if task.NextRun.Before(now) {
+				task.NextRun = task.NextRun.Add(task.Interval)
+			}


The logic for catching up after a missed execution window has a potential issue. When calculating cycles, if task.NextRun equals now, the diff is 0 and no cycles are added. However, the subsequent check if task.NextRun.Before(now) will still be false, so the task won't be advanced forward at all. This means the task will execute immediately again on the next iteration. Consider handling the equal case explicitly or ensuring NextRun is always advanced by at least one interval after execution.

Suggested change

if task.NextRun.Before(now) {

diff := now.Sub(task.NextRun)

cycles := diff / task.Interval

if cycles > 0 {

task.NextRun = task.NextRun.Add(cycles * task.Interval)

}

if task.NextRun.Before(now) {

task.NextRun = task.NextRun.Add(task.Interval)

}

// Ensure NextRun is always moved forward by at least one interval

// when the task is due or overdue (including the equal-to-now case).

if !task.NextRun.After(now) {

diff := now.Sub(task.NextRun)

// Number of full intervals elapsed since NextRun, plus one to

// ensure the new NextRun is strictly after now.

cycles := diff/task.Interval + 1

task.NextRun = task.NextRun.Add(cycles * task.Interval)

Copilot · 2025-12-24T01:08:46Z

inputs/snmp_zabbix/snmp.go

+				// Copy visited map to allow shared dependencies (DAG) while preventing cycles in the current path.
+				childVisited := make(map[string]bool, len(visited))
+				for k, v := range visited {
+					childVisited[k] = v
+				}
+
+				childItem := buildTree(childKey, childVisited)
+				if childItem.Key != "" {
+					item.DependentItems = append(item.DependentItems, childItem)
+				}
 			}


The variable 'childVisited' is created by copying the parent's visited map to allow shared dependencies (DAG structure). However, this copy is made for every child iteration, which is inefficient. Since the visited map is meant to track the current path for cycle detection, a better approach would be to add the key before recursion and remove it after, using a single map. The current approach creates O(n²) map copies in the worst case for a deep tree with n nodes.

Suggested change

// Copy visited map to allow shared dependencies (DAG) while preventing cycles in the current path.

childVisited := make(map[string]bool, len(visited))

for k, v := range visited {

childVisited[k] = v

}

childItem := buildTree(childKey, childVisited)

if childItem.Key != "" {

item.DependentItems = append(item.DependentItems, childItem)

}

}

// Use a single visited map per root; mark/unmark keys to track the current path.

childItem := buildTree(childKey, visited)

if childItem.Key != "" {

item.DependentItems = append(item.DependentItems, childItem)

}

}

// Remove key from current path to allow shared dependencies (DAG) without false cycles.

delete(visited, key)

Copilot · 2025-12-24T01:08:47Z

inputs/snmp_zabbix/snmp.go

+					log.Printf("E! cycle detected in item dependency: %s", key)
+				}
+				return *ptr


When a cycle is detected, the function returns the pointer value without traversing its children. This means the returned item will have an empty DependentItems slice, which might not reflect the actual structure. If the cycle is part of a larger graph, this could lead to incomplete dependency trees. Consider logging more details about the cycle path or returning a sentinel value to indicate the cycle was detected and handled.

Suggested change

log.Printf("E! cycle detected in item dependency: %s", key)

}

return *ptr

// Log detailed cycle path for easier debugging.

cyclePath := make([]string, 0, len(visited)+1)

for k := range visited {

cyclePath = append(cyclePath, k)

}

sort.Strings(cyclePath)

cyclePath = append(cyclePath, key)

log.Printf("E! cycle detected in item dependency path: %s", strings.Join(cyclePath, " -> "))

}

// Return a sentinel value so callers can detect and skip this branch.

return MonitorItem{}

Copilot · 2025-12-24T01:08:47Z

inputs/snmp_zabbix/scheduler_test.go

+	"flashcat.cloud/categraf/types"
+)
+
+func TestScheduler_CalcScatteredNextRun(t *testing.T) {


The test name 'TestScheduler_CalcScatteredNextRun' doesn't match the standard Go naming convention. Since 'calcScatteredNextRun' is a method on ItemScheduler, the test should be named 'TestItemScheduler_CalcScatteredNextRun' or 'TestItemScheduler_calcScatteredNextRun' for consistency with Go testing conventions.

Suggested change

func TestScheduler_CalcScatteredNextRun(t *testing.T) {

func TestItemScheduler_CalcScatteredNextRun(t *testing.T) {

Copilot · 2025-12-24T01:08:47Z

inputs/snmp_zabbix/scheduler_test.go

+	}
+}
+
+func TestScheduler_HeapOrdering(t *testing.T) {


The test name 'TestScheduler_HeapOrdering' should be 'TestItemScheduler_HeapOrdering' to match Go testing conventions, similar to the previous test naming issue.

Suggested change

func TestScheduler_HeapOrdering(t *testing.T) {

func TestItemScheduler_HeapOrdering(t *testing.T) {

Copilot · 2025-12-24T01:08:47Z

inputs/snmp_zabbix/scheduler.go

-					s.labelCache.DeleteLabel(sch.Item.Agent, sch.Item.DiscoveryRuleKey, sch.Item.DiscoveryIndex, sch.Item.LabelKey)
+			if _, ok := newIdx[id]; !ok {
+				// Lost Logic
+				if deleteTTL == 0 { // Immediately delete


The comparison using 'DurationImmediately' has been changed to compare against '0'. This assumes that DurationImmediately was previously defined as 0, but there's no visible definition in the diff. If DurationImmediately was a named constant with semantic meaning (e.g., a special sentinel value), replacing it with the magic number 0 reduces code clarity. Consider keeping the named constant or adding a comment explaining that 0 means immediate action.

Copilot · 2025-12-24T01:08:47Z

inputs/snmp_zabbix/scheduler_test.go

+	}
+}
+
+func TestScheduler_UpdateDiscoveredDiff_Logic(t *testing.T) {


The test name 'TestScheduler_UpdateDiscoveredDiff_Logic' should be 'TestItemScheduler_UpdateDiscoveredDiff' or 'TestItemScheduler_UpdateDiscoveredDiffLogic' to match Go testing conventions.

Suggested change

func TestScheduler_UpdateDiscoveredDiff_Logic(t *testing.T) {

func TestItemScheduler_UpdateDiscoveredDiffLogic(t *testing.T) {

Copilot · 2025-12-24T01:08:48Z

inputs/snmp_zabbix/scheduler.go

+	for i, it := range task.Items {
+		if s.itemID(it) == s.itemID(sch.Item) {
+			task.Items[i] = newItem
+			return


In updateItemInTask, after finding and updating the matching item, the function returns immediately. However, if there could be duplicate items with the same ID in the task.Items slice, only the first one would be updated. Consider adding documentation to clarify that items are unique within a task, or implement logic to handle potential duplicates if they're possible.

Suggested change

return

// Do not return here so that all items with the same ID are updated.

Copilot · 2025-12-24T01:08:48Z

inputs/snmp_zabbix/scheduler.go

+// Util function defined again to avoid dependency cycle if moved?
+// Actually it was local in this file.


The comment states this is a "Util function defined again to avoid dependency cycle if moved" but then questions itself with "Actually it was local in this file." This uncertain comment should be clarified or removed. If the function was always local to this file, the comment is confusing. If it's duplicated from elsewhere, that should be stated clearly and ideally the duplication should be resolved by placing the function in a shared utility file.

Suggested change

// Util function defined again to avoid dependency cycle if moved?

// Actually it was local in this file.

// copyTags returns a shallow copy of the given tags map.

Copilot · 2025-12-24T01:08:48Z

inputs/snmp_zabbix/scheduler.go

+	go s.runLoop(ctx)
 	go s.runMaintainLoop(ctx)


The Start method launches goroutines (runLoop and runMaintainLoop) after releasing the mutex, but these goroutines immediately try to acquire the mutex. While this works, there's a subtle race: if Stop() is called immediately after Start() but before the goroutines begin, the goroutines might not see the stopCh closure. A safer pattern would be to check s.running within the goroutines after acquiring the lock for the first time, or to pass stopCh as a parameter to the goroutine functions to capture its value at the time of Start().

kongfei605 added 2 commits December 23, 2025 21:28

refactor(snmp_zabbix): migrate scheduler to Min-Heap with consistent …

ce0cf61

…hashing

fix(snmp_zabbix): robust macro expansion and dependency tree building

9e16c87

Copilot AI review requested due to automatic review settings December 23, 2025 13:32

Copilot started reviewing on behalf of kongfei605 December 23, 2025 13:33 View session

Copilot AI reviewed Dec 23, 2025

View reviewed changes

chore(snmp_zabbix): robust pointer judgement

3c2055e

Co-authored-by: Copilot <[email protected]>

Copilot AI review requested due to automatic review settings December 23, 2025 23:58

Copilot started reviewing on behalf of kongfei605 December 23, 2025 23:59 View session

Copilot AI reviewed Dec 24, 2025

View reviewed changes

kongfei605 added 2 commits December 24, 2025 08:57

refactor(snmp_zabbix): optimize scheduler and fix thread safety

96061bb

refactor(snmp_zabbix): use copy remove (stable) instead of swap remove

865d54f

Copilot AI review requested due to automatic review settings December 24, 2025 01:02

Copilot started reviewing on behalf of kongfei605 December 24, 2025 01:02 View session

Copilot AI reviewed Dec 24, 2025

View reviewed changes

refactor(snmp_zabbix): enhance scheduler stability and thread safety

4765305

kongfei605 merged commit ad96014 into flashcatcloud:main Dec 24, 2025
2 of 3 checks passed

kongfei605 deleted the snmp_zbx_sch branch December 24, 2025 01:21

-	// Hash is deterministic.
-	// Offset is in [0, Interval)
-	// We cast to int64 (Duration) which is safe as interval is usually < 290 years
-	offset := time.Duration(h.Sum64() % uint64(interval))
-	now := time.Now()
-	// ...
+	// Hash is deterministic for a given agent string, so the same agent will always
+	// produce the same offset value.
+	// The modulus with interval keeps the offset in [0, interval), effectively placing
+	// the agent at a stable position within the interval window.
+	// We cast to time.Duration (backed by int64 nanoseconds), which is safe as
+	// interval is expected to be well within the valid duration range.
+	offset := time.Duration(h.Sum64() % uint64(interval))
+	now := time.Now()
+	// By adding the deterministic offset to "now", each agent's first execution is
+	// delayed by a consistent amount within the interval, scattering start times
+	// across agents and avoiding a thundering herd while still being reproducible
+	// across restarts for the same agent and interval.

		k := s.taskKey(newItem.Agent, newItem.Delay)
		s.addItemToTask(k, newItem)

-	// But we should use >> 1 unique slots.
-	if len(uniqueOffsets) < 10 {
-		t.Errorf("Expected distributed offsets, but got only %d unique seconds slots out of 100 agents", len(uniqueOffsets))
+	// We conservatively require a majority of slots to be used to catch bad distributions without being too strict.
+	const minUniqueOffsetSlots = 40
+	if len(uniqueOffsets) < minUniqueOffsetSlots {
+		t.Errorf("Expected distributed offsets, but got only %d unique seconds slots out of 100 agents (minimum expected: %d)", len(uniqueOffsets), minUniqueOffsetSlots)

	dst := make(map[string]string, len(src))
	dst := make(map[string]string, len(src)+1)

-	now := time.Now()
-	// ...
-	return now.Add(offset)
+	now := time.Now()
+	// Align to the start of the current interval, then add the offset.
+	// If that time is already in the past, advance to the next interval.
+	base := now.Truncate(interval)
+	nextRun := base.Add(offset)
+	if nextRun.Before(now) {
+		nextRun = nextRun.Add(interval)
+	}
+	return nextRun

-	if t1.Sub(t2).Abs() > time.Second {
+	diff := t1.Sub(t2)
+	if diff < 0 {
+		diff = -diff
+	}
+	if diff > time.Second {

-					log.Printf("E! cycle detected in item dependency: %s", key)
-				}
-				return *ptr
+					// Log detailed cycle path for easier debugging.
+					cyclePath := make([]string, 0, len(visited)+1)
+					for k := range visited {
+						cyclePath = append(cyclePath, k)
+					}
+					sort.Strings(cyclePath)
+					cyclePath = append(cyclePath, key)
+					log.Printf("E! cycle detected in item dependency path: %s", strings.Join(cyclePath, " -> "))
+				}
+				// Return a sentinel value so callers can detect and skip this branch.
+				return MonitorItem{}

	func TestScheduler_CalcScatteredNextRun(t *testing.T) {
	func TestItemScheduler_CalcScatteredNextRun(t *testing.T) {

	func TestScheduler_HeapOrdering(t *testing.T) {
	func TestItemScheduler_HeapOrdering(t *testing.T) {

	func TestScheduler_UpdateDiscoveredDiff_Logic(t *testing.T) {
	func TestItemScheduler_UpdateDiscoveredDiffLogic(t *testing.T) {

	return
	// Do not return here so that all items with the same ID are updated.

		// Util function defined again to avoid dependency cycle if moved?
		// Actually it was local in this file.

	// Util function defined again to avoid dependency cycle if moved?
	// Actually it was local in this file.
	// copyTags returns a shallow copy of the given tags map.

refactor(snmp_zabbix): migrate scheduler to Min-Heap with consistent … #1379

refactor(snmp_zabbix): migrate scheduler to Min-Heap with consistent … #1379

Uh oh!

Conversation

kongfei605 commented Dec 23, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment