-
Notifications
You must be signed in to change notification settings - Fork 329
fix(snmp_zabbix): stabilize SNMP collection order #1337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request stabilizes SNMP collection ordering by removing concurrent operations that could cause Request ID conflicts when reusing SNMP clients.
- Converted multi-OID walk discovery from concurrent to sequential execution to prevent Request ID collisions
- Made batch size configurable via MaxRepetitions with fallback to individual collection when batch size is 1
- Improved logging by making error logs conditional on DebugMode to reduce log spam
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| inputs/snmp_zabbix/discovery.go | Refactored performMultiOidWalkDiscovery to execute SNMP walks sequentially instead of concurrently, preventing Request ID conflicts from client reuse |
| inputs/snmp_zabbix/collector.go | Made batch size configurable via MaxRepetitions, extracted processSingleResult method for code reuse, and made error logging conditional on debug mode to reduce verbosity |
Comments suppressed due to low confidence (1)
inputs/snmp_zabbix/collector.go:302
- [nitpick] The error message includes the original
oidparameter which may have a leading dot, while the error is about a normalized OID lookup. For consistency and clarity in debugging, consider usingoidNoDotin the error messages on lines 297 and 302, or indicate that the OID has been normalized.
results[i].Error = fmt.Errorf("OID not found on device: %s", oid)
} else {
results[i].Value = pdu
}
} else {
results[i].Error = fmt.Errorf("no response for OID: %s", oid)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // 新增函数用于处理 walk[...] 语法 | ||
| // performMultiOidWalkDiscovery 处理 walk[...] 语法 | ||
| // 改为顺序执行,防止复用 client 导致的 Request ID 错乱 | ||
| func (d *DiscoveryEngine) performMultiOidWalkDiscovery(ctx context.Context, client *gosnmp.GoSNMP, discoveryOID string) ([][]gosnmp.SnmpPDU, error) { |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context parameter ctx is accepted but never used in this function. The function performs sequential SNMP walks that could potentially be long-running operations, but there's no mechanism to respect context cancellation. Consider checking ctx.Done() between iterations or passing the context to the SNMP operations if supported.
| return nil, fmt.Errorf("no OIDs found in walk[] directive: %s", discoveryOID) | ||
| } | ||
|
|
||
| var allPdus [][]gosnmp.SnmpPDU |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable allPdus is initialized on line 209 but never used before being reassigned in the loop starting at line 229. Consider removing the initialization on line 209 and directly assigning allPdus in the loop, or clarify the intended logic.
| } | ||
| results[res.index] = append(results[res.index], res.pdus) | ||
|
|
||
| results[i] = append(results[i], pdus) |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic of wrapping pdus in another slice with append(results[i], pdus) and then unwrapping it later (line 229-233) is unnecessarily complex. Since each iteration produces one []gosnmp.SnmpPDU, you could simplify by storing pdus directly: results[i] = pdus on line 226, and then allPdus = append(allPdus, pduGroup) on line 231.
| var wg sync.WaitGroup | ||
| resultChan := make(chan CollectionResult, len(items)) | ||
| // 使用带缓冲的 channel 防止阻塞 | ||
| resultChan := make(chan CollectionResult, len(items)+10) |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The buffer size calculation len(items)+10 appears arbitrary. If this is meant to prevent blocking, consider using a more justified value or adding a comment explaining why "+10" is appropriate for this use case. Alternatively, consider making this configurable or calculating based on the number of agents.
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // 批量大小 | ||
| batchSize := c.config.MaxRepetitions | ||
| if batchSize <= 0 { | ||
| batchSize = 10 // 默认值 | ||
| } |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MaxRepetitions is an SNMP protocol parameter that specifies how many variable bindings should be returned per GetBulk request, not how many OIDs should be batched together. Using it as the batch size for collector batching is a semantic mismatch.
The original hardcoded value of 60 may have been chosen specifically for batching operations. If configurability is desired, consider:
- Adding a separate
BatchSizeconfiguration parameter for collector batching - Using a different default value if
MaxRepetitionswas intended for a different purpose - Documenting why
MaxRepetitionsis being repurposed for batching if this is intentional
Note: The default value of 10 when MaxRepetitions <= 0 is significantly smaller than the original 60, which could impact collection performance.
| if res.err != nil { | ||
| // 如果任何一个 walk 失败,整个发现都失败 | ||
| return nil, fmt.Errorf("SNMP walk failed for OID %s: %w", oidStrings[res.index], res.err) | ||
| // 这里不需要 acquire 锁,因为上层 performSNMPDiscovery 已经持有锁了 |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment states "这里不需要 acquire 锁,因为上层 performSNMPDiscovery 已经持有锁了" (no need to acquire lock here because the upper layer performSNMPDiscovery already holds the lock).
While this is factually correct and explains why no lock acquisition is needed here, this comment is documenting the absence of code that was never present in the first place (even in the concurrent version that was removed). The comment might be more confusing than helpful, as it could make readers wonder if a lock was considered or removed.
Consider removing this comment or rephrasing it to focus on the sequential execution model, e.g., "顺序执行,复用上层的锁" (sequential execution, reusing the upper-layer lock).
| // 这里不需要 acquire 锁,因为上层 performSNMPDiscovery 已经持有锁了 | |
| // 顺序执行,复用上层的锁 |
| for i, oidStr := range oidStrings { | ||
| oid := strings.TrimSpace(oidStr) | ||
| if err := d.validateOID(oid); err != nil { | ||
| return nil, fmt.Errorf("invalid OID '%s' in walk[]: %w", oid, err) | ||
| } | ||
|
|
||
| wg.Add(1) | ||
| go func(oidToWalk string, idx int) { | ||
| defer wg.Done() | ||
| pdus, err := client.BulkWalkAll(oidToWalk) | ||
| pduChan <- struct { | ||
| pdus []gosnmp.SnmpPDU | ||
| index int | ||
| err error | ||
| }{pdus, idx, err} | ||
| }(oid, i) | ||
| } | ||
|
|
||
| wg.Wait() | ||
| close(pduChan) | ||
|
|
||
| // 按顺序收集结果以匹配预处理参数 | ||
| results := make([][][]gosnmp.SnmpPDU, len(oidStrings)) | ||
| for res := range pduChan { | ||
| if res.err != nil { | ||
| // 如果任何一个 walk 失败,整个发现都失败 | ||
| return nil, fmt.Errorf("SNMP walk failed for OID %s: %w", oidStrings[res.index], res.err) | ||
| // 这里不需要 acquire 锁,因为上层 performSNMPDiscovery 已经持有锁了 | ||
| pdus, err := client.BulkWalkAll(oid) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("SNMP walk failed for OID %s: %w", oid, err) | ||
| } | ||
| results[res.index] = append(results[res.index], res.pdus) | ||
|
|
||
| results[i] = append(results[i], pdus) | ||
| } |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ctx context.Context parameter is not used in this function. Since the function performs sequential SNMP operations that could take a significant amount of time, it should respect the context for cancellation.
Consider adding context cancellation checks in the loop, e.g.:
for i, oidStr := range oidStrings {
select {
case <-ctx.Done():
return nil, fmt.Errorf("SNMP walk was canceled or timed out: %w", ctx.Err())
default:
}
// ... existing code
}Note that performZabbixDependentDiscovery already implements proper context handling (lines 260-262), so this function should follow the same pattern for consistency.
| // 尝试多种形式的OID匹配(有没有前导点) | ||
| oidNoDot := strings.TrimPrefix(oid, ".") | ||
|
|
||
| if pdu, ok := valueMap[oidNoDot]; ok { |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states "尝试多种形式的OID匹配(有没有前导点)" (try multiple forms of OID matching with/without leading dot), but the implementation only tries one form - the OID without a leading dot.
The code normalizes both the response OIDs (line 280) and request OIDs (line 289) by removing leading dots, then only checks the normalized form. While this approach should work in most cases, it doesn't actually try "multiple forms" as the comment suggests.
Consider either:
- Updating the comment to reflect that both sides are normalized, or
- Implementing a fallback to try the original
oidifoidNoDotdoesn't match (though this may not be necessary if normalization is consistent)
| results := make([][][]gosnmp.SnmpPDU, len(oidStrings)) | ||
|
|
||
| for i, oidStr := range oidStrings { | ||
| oid := strings.TrimSpace(oidStr) | ||
| if err := d.validateOID(oid); err != nil { | ||
| return nil, fmt.Errorf("invalid OID '%s' in walk[]: %w", oid, err) | ||
| } | ||
|
|
||
| wg.Add(1) | ||
| go func(oidToWalk string, idx int) { | ||
| defer wg.Done() | ||
| pdus, err := client.BulkWalkAll(oidToWalk) | ||
| pduChan <- struct { | ||
| pdus []gosnmp.SnmpPDU | ||
| index int | ||
| err error | ||
| }{pdus, idx, err} | ||
| }(oid, i) | ||
| } | ||
|
|
||
| wg.Wait() | ||
| close(pduChan) | ||
|
|
||
| // 按顺序收集结果以匹配预处理参数 | ||
| results := make([][][]gosnmp.SnmpPDU, len(oidStrings)) | ||
| for res := range pduChan { | ||
| if res.err != nil { | ||
| // 如果任何一个 walk 失败,整个发现都失败 | ||
| return nil, fmt.Errorf("SNMP walk failed for OID %s: %w", oidStrings[res.index], res.err) | ||
| // 这里不需要 acquire 锁,因为上层 performSNMPDiscovery 已经持有锁了 | ||
| pdus, err := client.BulkWalkAll(oid) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("SNMP walk failed for OID %s: %w", oid, err) | ||
| } | ||
| results[res.index] = append(results[res.index], res.pdus) | ||
|
|
||
| results[i] = append(results[i], pdus) |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intermediate array structure results := make([][][]gosnmp.SnmpPDU, len(oidStrings)) combined with results[i] = append(results[i], pdus) creates a three-dimensional array where each element is a slice containing exactly one []gosnmp.SnmpPDU.
This can be simplified to:
results := make([][]gosnmp.SnmpPDU, len(oidStrings))
// ...
results[i] = pdusThen the final loop becomes:
allPdus = results
return allPdus, nilThis simplification makes the code more maintainable and easier to understand, since the extra nesting level serves no purpose in the sequential execution model.
| var wg sync.WaitGroup | ||
| resultChan := make(chan CollectionResult, len(items)) | ||
| // 使用带缓冲的 channel 防止阻塞 | ||
| resultChan := make(chan CollectionResult, len(items)+10) |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The buffer size is set to len(items)+10, but the rationale for adding exactly 10 extra slots is unclear.
Since the number of results that will be sent to the channel should equal the number of items (one CollectionResult per item), the buffer size could simply be len(items). The additional 10 slots don't provide a clear benefit and make the buffer sizing logic arbitrary.
Consider either:
- Using exactly
len(items)as the buffer size, or - Adding a comment explaining why the extra 10 slots are needed (e.g., if there's a specific edge case being handled)
| resultChan := make(chan CollectionResult, len(items)+10) | |
| resultChan := make(chan CollectionResult, len(items)) |
No description provided.