Conversation
…closed to avoid a crash, during netsnmp_session close/abort due to timeout. RCA: The netsnmp_agent_session requests use netsmp_subtree objects that matches the associated variable name. The netsnmp_subtrees created through a netsnmp_session are tied to that session. When a subagent connection is closed/dropped due to timeout, all associated netsnmp_subtree objects are fred. Hence during a netsnmp_session close, all the delegated netsnmp_agent_sessions are scanned for requets that could be using netsnmp_subtree objects associated with this netsnmp_sesssion or its subsession. For each of the found request, they are explicitly marked to fail and a call is made to complete them. But due to the bug in scanning, it leaves behind requests and hence later when the requests get processed, they refer the *now* freed netsnmp_subtree. As often these requests gets completed pretty soon, they escape crashing. But if it so happens that the freed memory happened to complete a memory unit, hence returned to kernel or it got reallocated & changed enough to crash, the snmpd crashes.
|
DO NOT Merge this yet. I still have some debugging to complete. |
| netsnmp_request_info *request; | ||
| - for(request = asp->requests; request; request = request->next) { | ||
| + int i; | ||
| + for(i = 0, request = asp->requests; i < asp->vbcount; ++i, ++request) { |
There was a problem hiding this comment.
Do we need to go to the next request here by
request = request->next
Otherwise I don't know what ++request is doing
There was a problem hiding this comment.
I asked Renuka about this offline. Apparently this is not a true linked list. @renukamanavalan can provide more detail.
There was a problem hiding this comment.
The requests are an array of netsnmp_request_info structs allocated.
Each is initialized for a variable in pdu->variables
Each request is assigned a subtree based on variable->name
Each netsnmp_subtree has its cache_id
The requests that carry same cache-id through their netsnmp_subtree are linked while assigning to tree cache.
In short, the requests are not maintained as link list, but linked while grouping per cache-id.
So when you are scanning across all, walk from 0 to vbcount.
|
can we get more detailed description on this issue? |
|
I had a detailed description in commit. Let me put that in the descriptions above. BTW, the binary I got from the one built for this PR, does not seem to have my patch. May be I ended up using incorrect binary. Checking it out ... Let me clear it and then make sure to add details before merge. |
|
retest this please |
|
@lguohan: I believe the labels you added are incorrect. I think you meant to add the "Request for..." labels instead. |
|
do you have a repro and tested the fix? |
* Enable debug image build for kvm image. * Fix a bug in cleaning up requests referring the netsmp_session being closed to avoid a crash, during netsnmp_session close/abort due to timeout. RCA: The netsnmp_agent_session requests use netsmp_subtree objects that matches the associated variable name. The netsnmp_subtrees created through a netsnmp_session are tied to that session. When a subagent connection is closed/dropped due to timeout, all associated netsnmp_subtree objects are fred. Hence during a netsnmp_session close, all the delegated netsnmp_agent_sessions are scanned for requets that could be using netsnmp_subtree objects associated with this netsnmp_sesssion or its subsession. For each of the found request, they are explicitly marked to fail and a call is made to complete them. But due to the bug in scanning, it leaves behind requests and hence later when the requests get processed, they refer the *now* freed netsnmp_subtree. As often these requests gets completed pretty soon, they escape crashing. But if it so happens that the freed memory happened to complete a memory unit, hence returned to kernel or it got reallocated & changed enough to crash, the snmpd crashes. * Revert the changes * Revert
- What I did
Fix snmpd crash that could happen when netsnmp session with an agent closes.
Background:
Bug:
While walking all requests of a netsnmp_agent_session, it tries to use the next pointer of request.
But the reality is that the requests are maintained as an array of netsnmp_request_info structs, with netsnmp_agent_session::vbcount holding the count of requests.
The next/prev is used for transient chaining of requests that use the same cache-id (through netsnmp_subtree) while adding in tree cache,
Fix:
To scan all requests, don't use next pointer, but instead walk the range 0 .. vbcount.
- How I did it
- How to verify it
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)