Skip to content

PlayerManager functions cause heap corruption when engine->GetPlayerUserId() returns -1 #1107

@sigsegv-mvm

Description

@sigsegv-mvm

Blah blah blah

Yes I have the latest SourceMod version etc

Description of the problem

Lately I've been running my TF2 Linux srcds instances under valgrind to attempt to track down an elusive crash related to heap corruption that's been plaguing my MvM servers for well over a year now.

Generally (but not always) upon level shutdown, e.g. either at a map change or when shutting down the server, there is a random chance that a call to malloc or free from some random place in the srcds process will fail, yielding various error messages from the glibc malloc/free code. Ultimately what's going on here is heap corruption of some form or fashion.

Error messages that I've seen have included:

  • malloc(): invalid next size (unsorted)
  • free(): invalid next size (fast)
  • corrupted size vs. prev_size
  • corrupted double-linked list

How to reproduce the problem (or not)

Unfortunately, it's difficult to present an easily reproducible test case for this problem, as the nature of the problem is inherently pseudorandom. There's no clear-cut trigger for the crash: it happens a highly variable and arbitrary amount of time after the corruption itself occurs, and the stack traces of the actual crash itself are effectively useless as they don't do anything to explain what code originally caused the corruption. If this problem was straightforward to reproduce, I would have figured out the root cause months and months ago.

So, back to valgrind. It makes running srcds hideously slow and painful, and suppressing all of the various benign uninitialized-value-access and uninitialized-value-conditional errors from the game/engine code itself is a reasonably monumental task. But, upon getting this all working, I've already been able to find a highly damning case of heap corruption, and it turns out it's not in my own code at all, it's right in SourceMod.

What SourceMod is doing horribly wrong

The PlayerManager constructor allocates 256 KiB for m_UserIdLookup on the heap via operator new[] and memsets that memory to zero. This buffer is then used by the PlayerManager class in several places, including PlayerManager::GetClientOfUserId; apparently this is used as a cache of sorts, to avoid having to call engine->GetPlayerUserId() in a loop over all client indices every single time someone wants to quickly map from a userid to the corresponding client index. Fair enough.

Well, here's the problem. Take a look at what PlayerManager::InvalidatePlayer does, given the context of the documentation for IVEngineServer::GetPlayerUserId:
https://github.com/alliedmodders/sourcemod/blob/master/core/PlayerManager.cpp#L1477
https://github.com/ValveSoftware/source-sdk-2013/blob/master/mp/src/public/eiface.h#L133-L135

IVEngineServer::GetPlayerUserId explicitly documents that it may return -1 in cases where it fails to find the asked-for edict in its player list. But PlayerManager::InvalidatePlayer directly indexes into m_UserIdLookup without checking the return value of GetPlayerUserId. So there certainly can be cases where PlayerManager::InvalidatePlayer writes over m_UserIdLookup[-1], thereby corrupting the heap chunk that was allocated for m_UserIdLookup and putting future heap allocation/deallocation operations at risk of failure.

Don't believe me that this would actually happen in practice? Well, have a look at this output from valgrind:

Invalid write of size 4
   at 0x37D13661: PlayerManager::InvalidatePlayer(CPlayer*) (PlayerManager.cpp:1477)
   by 0x37D138F9: PlayerManager::OnClientDisconnect_Post(edict_t*) (PlayerManager.cpp:828)
   by 0x37D139F4: PlayerManager::OnSourceModLevelEnd() (PlayerManager.cpp:756)
   by 0x37CFD655: SourceModBase::LevelShutdown() (sourcemod.cpp:383)
   by 0x2787802C: __SourceHook_MFHCls_SGD_LevelShutdown::Func() (metamod.cpp:84)
   by  0x5BAF038: Host_Changelevel(bool, char const*, char const*) (in tf2/bin/engine_srv.so)
   by  0x5BB8EBE: CHostState::State_ChangeLevelMP() (in tf2/bin/engine_srv.so)
   by  0x5BB9394: CHostState::FrameUpdate(float) (in tf2/bin/engine_srv.so)
   by  0x5BB942C: HostState_Frame(float) (in tf2/bin/engine_srv.so)
   by  0x5C3EEF2: CEngine::Frame() (in tf2/bin/engine_srv.so)
   by  0x5C3C2F5: CDedicatedServerAPI::RunFrame() (in tf2/bin/engine_srv.so)
   by  0x4F1E3E9: RunServer() (in tf2/bin/dedicated_srv.so)
   by  0x5C3C3EC: CModAppSystemGroup::Main() (in tf2/bin/engine_srv.so)
   by  0x5C86757: CAppSystemGroup::Run() (in tf2/bin/engine_srv.so)
   by  0x5C3D0DC: CDedicatedServerAPI::ModInit(ModInfo_t&) (in tf2/bin/engine_srv.so)
   by  0x4F1E092: CDedicatedAppSystemGroup::Main() (in tf2/bin/dedicated_srv.so)
   by  0x5000A77: CAppSystemGroup::Run() (in tf2/bin/dedicated_srv.so)
   by  0x5000A77: CAppSystemGroup::Run() (in tf2/bin/dedicated_srv.so)
   by  0x4ED8747: main (in tf2/bin/dedicated_srv.so)
   by  0x80489CA: main (in tf2/srcds_linux)

 Address 0x379a894c is 4 bytes before a block of size 262,144 alloc'd
   at  0x40373BB: operator new[](unsigned int) (vg_replace_malloc.c:417)
   by 0x37D120C2: PlayerManager::PlayerManager() (PlayerManager.cpp:131)
   by 0x37CEAC56: __static_initialization_and_destruction_0 (PlayerManager.cpp:56)
   by 0x37CEAC56: _GLOBAL__sub_I_g_Players (PlayerManager.cpp:2603)

So, we have clear proof that this exact sequence of events is happening under certain circumstances when engine->GetPlayerUserId is returning -1.

If you're interested, take a quick gander at how glibc structures its heap. It's pretty clear that overwriting the 4 bytes just prior to the user-allocated portion of a heap chunk means that the chunk's size, plus three additional important flag bits, get blasted away. At that point, any subsequent heap operations are at risk of failing with a glibc malloc/free error.
https://sourceware.org/glibc/wiki/MallocInternals

It's also worth pointing out that there is one other location in the code where m_UserIdLookup is similarly directly indexed by the result of engine->GetPlayerUserId without any checking:
https://github.com/alliedmodders/sourcemod/blob/master/core/PlayerManager.cpp#L545

But note that in the following third case, the sanity checks at the beginning of the function are sufficient to ensure that the index being assigned to is valid:
https://github.com/alliedmodders/sourcemod/blob/master/core/PlayerManager.cpp#L1373

How to fix this brokenness

To fix this problem, it would make sense to, at a minimum, actually check the result of engine->GetPlayerUserId for -1 (as that is a documented possible return value); and if -1 is indeed returned, then simply do not write into m_UserIdLookup.

Further bulletproofing might be done in the form of factoring out writes to m_UserIdLookup into a separate private member function, which would handle the call to GetPlayerUserId itself and do a full bounds check to ensure no out-of-range assignments happen. Something roughly like this:

void PlayerManager::CacheUserIdClientNumber(edict_t *pEntity, int client)
{
    int userid = engine->GetPlayerUserId(pEntity);
    
    if (userid >= 0 && userid <= USHRT_MAX)
    {
        m_UserIdLookup[userid] = client;
    }
    else
    {
        // log an error or whatever
    }
}

Other issues that may be related to this one

The heap corruption problem that I found here could possibly be the underlying cause of #748. I'm not 100% sure about that, but it's certainly suspicious. I get heap-related crashes with annoying frequency, and I happen to run MvM servers, which happen to contain plenty of bots: 22 TFBots, to be precise.

I did some digging around with git blame and it's not clear that any direct change to the code in question was responsible for this becoming a problem within the last couple of years. But it's reasonable to suspect that some other upstream caller of the problematic code did change in the relatively recent past, and indirectly caused this issue to actually manifest itself more often.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Buggeneral bugs; can be anything

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions