Skip to content

Fix snmpd crash#3312

Merged
renukamanavalan merged 4 commits intosonic-net:masterfrom
renukamanavalan:snmpd_crash
Aug 9, 2019
Merged

Fix snmpd crash#3312
renukamanavalan merged 4 commits intosonic-net:masterfrom
renukamanavalan:snmpd_crash

Conversation

@renukamanavalan
Copy link
Copy Markdown
Contributor

@renukamanavalan renukamanavalan commented Aug 8, 2019

- What I did
Fix snmpd crash that could happen when netsnmp session with an agent closes.

Background:

  1. A netsnmp_session could register netsnmp_subtrees with AGENTX_MSG_REGISTER command
  2. These netsnmp_subtree objects gets associated with corresponding netsnmp_session
  3. When an agent explicitly/implicitly/timeout closes the netsnmp_session, all the associated netsnmp_subtree objects get freed.
  4. But these netsnmp_subtree objects could be in use by requests from other netsnmp_sessions
  5. During close, it sweeps across all requests from all delegated netsnmp_agent_sessions, fail them, to get them undelegated & close, before proceeding to release all the associated subtrees.

Bug:
While walking all requests of a netsnmp_agent_session, it tries to use the next pointer of request.
But the reality is that the requests are maintained as an array of netsnmp_request_info structs, with netsnmp_agent_session::vbcount holding the count of requests.
The next/prev is used for transient chaining of requests that use the same cache-id (through netsnmp_subtree) while adding in tree cache,

Fix:
To scan all requests, don't use next pointer, but instead walk the range 0 .. vbcount.

- How I did it

- How to verify it

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

…closed to avoid a crash, during netsnmp_session close/abort due to timeout.

RCA:
The netsnmp_agent_session requests use netsmp_subtree objects that matches the associated variable name.
The netsnmp_subtrees created through a netsnmp_session are tied to that session.
When a subagent connection is closed/dropped due to timeout, all associated netsnmp_subtree objects are fred.
Hence during a netsnmp_session close, all the delegated netsnmp_agent_sessions are scanned for requets that could be using netsnmp_subtree objects associated with this netsnmp_sesssion or its subsession. For each of the found request, they are explicitly marked to fail and a call is made to complete them.

But due to the bug in scanning, it leaves behind requests and hence later when the requests get processed, they refer the *now* freed netsnmp_subtree. As often these requests gets completed pretty soon, they escape crashing. But if it so happens that the freed memory happened to complete a memory unit, hence returned to kernel or it got reallocated & changed enough to crash, the snmpd crashes.
@jleveque jleveque changed the title Snmpd crash Fix snmpd crash Aug 8, 2019
@renukamanavalan
Copy link
Copy Markdown
Contributor Author

DO NOT Merge this yet. I still have some debugging to complete.

netsnmp_request_info *request;
- for(request = asp->requests; request; request = request->next) {
+ int i;
+ for(i = 0, request = asp->requests; i < asp->vbcount; ++i, ++request) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to go to the next request here by
request = request->next
Otherwise I don't know what ++request is doing

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked Renuka about this offline. Apparently this is not a true linked list. @renukamanavalan can provide more detail.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requests are an array of netsnmp_request_info structs allocated.
Each is initialized for a variable in pdu->variables
Each request is assigned a subtree based on variable->name
Each netsnmp_subtree has its cache_id
The requests that carry same cache-id through their netsnmp_subtree are linked while assigning to tree cache.

In short, the requests are not maintained as link list, but linked while grouping per cache-id.

So when you are scanning across all, walk from 0 to vbcount.

@lguohan
Copy link
Copy Markdown
Collaborator

lguohan commented Aug 8, 2019

can we get more detailed description on this issue?

@renukamanavalan
Copy link
Copy Markdown
Contributor Author

I had a detailed description in commit. Let me put that in the descriptions above.

BTW, the binary I got from the one built for this PR, does not seem to have my patch. May be I ended up using incorrect binary.

Checking it out ...

Let me clear it and then make sure to add details before merge.

@renukamanavalan
Copy link
Copy Markdown
Contributor Author

retest this please

@jleveque
Copy link
Copy Markdown
Contributor

jleveque commented Aug 9, 2019

@lguohan: I believe the labels you added are incorrect. I think you meant to add the "Request for..." labels instead.

@lguohan
Copy link
Copy Markdown
Collaborator

lguohan commented Aug 12, 2019

do you have a repro and tested the fix?

@renukamanavalan
Copy link
Copy Markdown
Contributor Author

renukamanavalan commented Aug 12, 2019 via email

wangshengjun pushed a commit to wangshengjun/sonic-buildimage that referenced this pull request Nov 16, 2020
* Enable debug image build for kvm image.

* Fix a bug in cleaning up requests referring the netsmp_session being closed to avoid a crash, during netsnmp_session close/abort due to timeout.
RCA:
The netsnmp_agent_session requests use netsmp_subtree objects that matches the associated variable name.
The netsnmp_subtrees created through a netsnmp_session are tied to that session.
When a subagent connection is closed/dropped due to timeout, all associated netsnmp_subtree objects are fred.
Hence during a netsnmp_session close, all the delegated netsnmp_agent_sessions are scanned for requets that could be using netsnmp_subtree objects associated with this netsnmp_sesssion or its subsession. For each of the found request, they are explicitly marked to fail and a call is made to complete them.

But due to the bug in scanning, it leaves behind requests and hence later when the requests get processed, they refer the *now* freed netsnmp_subtree. As often these requests gets completed pretty soon, they escape crashing. But if it so happens that the freed memory happened to complete a memory unit, hence returned to kernel or it got reallocated & changed enough to crash, the snmpd crashes.

* Revert the changes

* Revert
mssonicbld added a commit that referenced this pull request Jun 8, 2024
…atically (#19243)

#### Why I did it
src/sonic-utilities
```
* 0af03e7c - (HEAD -> 202311, origin/202311) [Mellanox] Update sonic-utilities to support new SKU Mellanox-SN5600-V256 (#3312) (28 hours ago) [DavidZagury]
```
#### How I did it
#### How to verify it
#### Description for the changelog
mssonicbld added a commit that referenced this pull request Oct 12, 2024
…lly (#20485)

#### Why I did it
src/sonic-swss
```
* 55fd3f1 - (HEAD -> master, origin/master, origin/HEAD) [dash]: Wait for routing type config when adding VNET mapping (#3312) (5 hours ago) [Lawrence Lee]
```
#### How I did it
#### How to verify it
#### Description for the changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants