PlayerManager functions cause heap corruption when engine->GetPlayerUserId() returns -1

## Blah blah blah

Yes I have the latest SourceMod version etc

## Description of the problem

Lately I've been running my TF2 Linux srcds instances under valgrind to attempt to track down an elusive crash related to heap corruption that's been plaguing my MvM servers for well over a year now.

Generally (but not always) upon level shutdown, e.g. either at a map change or when shutting down the server, there is a random chance that a call to `malloc` or `free` from some random place in the srcds process will fail, yielding various error messages from the glibc malloc/free code. Ultimately what's going on here is heap corruption of some form or fashion.

Error messages that I've seen have included:
- `malloc(): invalid next size (unsorted)`
- `free(): invalid next size (fast)`
- `corrupted size vs. prev_size`
- `corrupted double-linked list`

## How to reproduce the problem (or not)

Unfortunately, it's difficult to present an easily reproducible test case for this problem, as the nature of the problem is inherently pseudorandom. There's no clear-cut trigger for the crash: it happens a highly variable and arbitrary amount of time after the corruption itself occurs, and the stack traces of the actual crash itself are effectively useless as they don't do anything to explain what code originally caused the corruption. If this problem was straightforward to reproduce, I would have figured out the root cause months and months ago.

So, back to valgrind. It makes running srcds hideously slow and painful, and suppressing all of the various benign uninitialized-value-access and uninitialized-value-conditional errors from the game/engine code itself is a reasonably monumental task. But, upon getting this all working, I've already been able to find a highly damning case of heap corruption, and it turns out it's not in my own code at all, it's right in SourceMod.

## What SourceMod is doing horribly wrong

The `PlayerManager` constructor allocates 256 KiB for `m_UserIdLookup` on the heap via `operator new[]` and `memset`s that memory to zero. This buffer is then used by the `PlayerManager` class in several places, including `PlayerManager::GetClientOfUserId`; apparently this is used as a cache of sorts, to avoid having to call `engine->GetPlayerUserId()` in a loop over all client indices every single time someone wants to quickly map from a userid to the corresponding client index. Fair enough.

Well, here's the problem. Take a look at what `PlayerManager::InvalidatePlayer` does, given the context of the documentation for `IVEngineServer::GetPlayerUserId`:
https://github.com/alliedmodders/sourcemod/blob/master/core/PlayerManager.cpp#L1477
https://github.com/ValveSoftware/source-sdk-2013/blob/master/mp/src/public/eiface.h#L133-L135

`IVEngineServer::GetPlayerUserId` explicitly documents that it **may** return -1 in cases where it fails to find the asked-for edict in its player list. But `PlayerManager::InvalidatePlayer` *directly indexes* into `m_UserIdLookup` without checking the return value of `GetPlayerUserId`. So there certainly **can** be cases where `PlayerManager::InvalidatePlayer` writes over `m_UserIdLookup[-1]`, thereby corrupting the heap chunk that was allocated for `m_UserIdLookup` and putting future heap allocation/deallocation operations at risk of failure.

Don't believe me that this would actually happen in practice? Well, have a look at this output from valgrind:
```
Invalid write of size 4
   at 0x37D13661: PlayerManager::InvalidatePlayer(CPlayer*) (PlayerManager.cpp:1477)
   by 0x37D138F9: PlayerManager::OnClientDisconnect_Post(edict_t*) (PlayerManager.cpp:828)
   by 0x37D139F4: PlayerManager::OnSourceModLevelEnd() (PlayerManager.cpp:756)
   by 0x37CFD655: SourceModBase::LevelShutdown() (sourcemod.cpp:383)
   by 0x2787802C: __SourceHook_MFHCls_SGD_LevelShutdown::Func() (metamod.cpp:84)
   by  0x5BAF038: Host_Changelevel(bool, char const*, char const*) (in tf2/bin/engine_srv.so)
   by  0x5BB8EBE: CHostState::State_ChangeLevelMP() (in tf2/bin/engine_srv.so)
   by  0x5BB9394: CHostState::FrameUpdate(float) (in tf2/bin/engine_srv.so)
   by  0x5BB942C: HostState_Frame(float) (in tf2/bin/engine_srv.so)
   by  0x5C3EEF2: CEngine::Frame() (in tf2/bin/engine_srv.so)
   by  0x5C3C2F5: CDedicatedServerAPI::RunFrame() (in tf2/bin/engine_srv.so)
   by  0x4F1E3E9: RunServer() (in tf2/bin/dedicated_srv.so)
   by  0x5C3C3EC: CModAppSystemGroup::Main() (in tf2/bin/engine_srv.so)
   by  0x5C86757: CAppSystemGroup::Run() (in tf2/bin/engine_srv.so)
   by  0x5C3D0DC: CDedicatedServerAPI::ModInit(ModInfo_t&) (in tf2/bin/engine_srv.so)
   by  0x4F1E092: CDedicatedAppSystemGroup::Main() (in tf2/bin/dedicated_srv.so)
   by  0x5000A77: CAppSystemGroup::Run() (in tf2/bin/dedicated_srv.so)
   by  0x5000A77: CAppSystemGroup::Run() (in tf2/bin/dedicated_srv.so)
   by  0x4ED8747: main (in tf2/bin/dedicated_srv.so)
   by  0x80489CA: main (in tf2/srcds_linux)

 Address 0x379a894c is 4 bytes before a block of size 262,144 alloc'd
   at  0x40373BB: operator new[](unsigned int) (vg_replace_malloc.c:417)
   by 0x37D120C2: PlayerManager::PlayerManager() (PlayerManager.cpp:131)
   by 0x37CEAC56: __static_initialization_and_destruction_0 (PlayerManager.cpp:56)
   by 0x37CEAC56: _GLOBAL__sub_I_g_Players (PlayerManager.cpp:2603)
```

So, we have clear proof that this exact sequence of events **is** happening under certain circumstances when `engine->GetPlayerUserId` is returning -1.

If you're interested, take a quick gander at how glibc structures its heap. It's pretty clear that overwriting the 4 bytes just prior to the user-allocated portion of a heap chunk means that the chunk's size, plus three additional important flag bits, get blasted away. At that point, any subsequent heap operations are at risk of failing with a glibc malloc/free error.
https://sourceware.org/glibc/wiki/MallocInternals

It's also worth pointing out that there is one other location in the code where `m_UserIdLookup` is similarly directly indexed by the result of `engine->GetPlayerUserId` without any checking:
https://github.com/alliedmodders/sourcemod/blob/master/core/PlayerManager.cpp#L545

But note that in the following third case, the sanity checks at the beginning of the function are sufficient to ensure that the index being assigned to is valid:
https://github.com/alliedmodders/sourcemod/blob/master/core/PlayerManager.cpp#L1373

## How to fix this brokenness

To fix this problem, it would make sense to, at a minimum, actually check the result of `engine->GetPlayerUserId` for -1 (as that is a documented possible return value); and if -1 is indeed returned, then simply do not write into `m_UserIdLookup`.

Further bulletproofing might be done in the form of factoring out writes to `m_UserIdLookup` into a separate private member function, which would handle the call to `GetPlayerUserId` itself and do a full bounds check to ensure no out-of-range assignments happen. Something roughly like this:
```cpp
void PlayerManager::CacheUserIdClientNumber(edict_t *pEntity, int client)
{
    int userid = engine->GetPlayerUserId(pEntity);
    
    if (userid >= 0 && userid <= USHRT_MAX)
    {
        m_UserIdLookup[userid] = client;
    }
    else
    {
        // log an error or whatever
    }
}
```

## Other issues that may be related to this one

The heap corruption problem that I found here could possibly be the underlying cause of #748. I'm not 100% sure about that, but it's certainly suspicious. I get heap-related crashes with annoying frequency, and I happen to run MvM servers, which happen to contain plenty of bots: 22 TFBots, to be precise.

I did some digging around with `git blame` and it's not clear that any direct change to the code in question was responsible for this becoming a problem within the last couple of years. But it's reasonable to suspect that some other upstream caller of the problematic code did change in the relatively recent past, and indirectly caused this issue to actually manifest itself more often.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PlayerManager functions cause heap corruption when engine->GetPlayerUserId() returns -1 #1107

Blah blah blah

Description of the problem

How to reproduce the problem (or not)

What SourceMod is doing horribly wrong

How to fix this brokenness

Other issues that may be related to this one

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

PlayerManager functions cause heap corruption when engine->GetPlayerUserId() returns -1 #1107

Description

Blah blah blah

Description of the problem

How to reproduce the problem (or not)

What SourceMod is doing horribly wrong

How to fix this brokenness

Other issues that may be related to this one

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions