Skip to content

Fix for Memory leak#3180

Merged
quaquel merged 13 commits intomesa:mainfrom
quaquel:memory_leak
Jan 20, 2026
Merged

Fix for Memory leak#3180
quaquel merged 13 commits intomesa:mainfrom
quaquel:memory_leak

Conversation

@quaquel
Copy link
Copy Markdown
Member

@quaquel quaquel commented Jan 19, 2026

This is a bugfix for the memory leak identified in #3179.

The problem is that Agent._ids is a defaultdict that stores references to model instances. This was done to ensure that uniqiue_id is unique relative to a given model. However, since it is a class attribute, this reference persists across the Python process, preventing the entire model blob from being garbage-collected.

There are various solutions

  1. Use a weakref of the model
  2. Use the hash of the model
  3. Move the assignment of unique_id into register_agent
  4. Add some method to Model to clean up (not yet explored), including removing the ref in Agent._ids

Here, I implement option 3 because, among the options tested, it was the fastest locally. I also moved away from itertool.count and instead just use an index that is being incremented. The main reason is that itertools.count will not be pickleable in Python 3.14, and count is overkill for the simple integer increments needed here anyway.

For reasons that escape me at present, it is still necessary to remove all agents from the model before it can be garbage-collected, at least in the updated test_examples. But when I try a minimal version of Boltzmann, this seems unnecessary. So, there might be some other memory issue remaining.

@EwoutH EwoutH marked this pull request as draft January 19, 2026 21:13
@quaquel quaquel added bug Release notes label trigger-benchmarks Special label that triggers the benchmarking CI labels Jan 19, 2026
@quaquel quaquel added trigger-benchmarks Special label that triggers the benchmarking CI and removed trigger-benchmarks Special label that triggers the benchmarking CI labels Jan 19, 2026
@mesa mesa deleted a comment from github-actions bot Jan 19, 2026
@mesa mesa deleted a comment from github-actions bot Jan 19, 2026
@github-actions
Copy link
Copy Markdown

Performance benchmarks:

Model Size Init time [95% CI] Run time [95% CI]
BoltzmannWealth small 🔵 -1.8% [-2.2%, -1.4%] 🔵 -1.2% [-1.4%, -1.1%]
BoltzmannWealth large 🔵 +3.2% [-2.0%, +9.6%] 🔴 +8.8% [+5.8%, +11.7%]
Schelling small 🟢 -4.1% [-4.4%, -3.8%] 🔵 -1.1% [-1.2%, -1.0%]
Schelling large 🔵 +2.0% [-0.8%, +4.9%] 🔵 -0.6% [-1.7%, +0.4%]
WolfSheep small 🔵 -1.6% [-2.8%, -0.0%] 🔵 -1.2% [-1.5%, -0.9%]
WolfSheep large 🔴 +24.5% [+13.5%, +35.4%] 🔵 -0.3% [-1.0%, +0.5%]
BoidFlockers small 🔵 -2.3% [-2.9%, -1.7%] 🔵 -0.3% [-0.5%, -0.0%]
BoidFlockers large 🔵 -3.0% [-3.5%, -2.4%] 🔵 -0.6% [-0.9%, -0.4%]

@quaquel quaquel added trigger-benchmarks Special label that triggers the benchmarking CI and removed trigger-benchmarks Special label that triggers the benchmarking CI labels Jan 19, 2026
@quaquel
Copy link
Copy Markdown
Member Author

quaquel commented Jan 19, 2026

Ok, these results are better than when testing locally, so this seems a reasonable solution.

Now we just need to add support for pickling the model because itertools.count won't be pickleable anymore (not that difficult to achieve I think), and we need to figure out why the alliance formation model breaks.

@github-actions

This comment was marked as duplicate.

commit 4b71cbe
Author: Jan Kwakkel <[email protected]>
Date:   Tue Jan 20 07:57:16 2026 +0100

    Update meta_agent.py

commit 702944a
Author: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Date:   Tue Jan 20 06:56:05 2026 +0000

    [pre-commit.ci] auto fixes from pre-commit.com hooks

    for more information, see https://pre-commit.ci

commit 0f2c81a
Author: Jan Kwakkel <[email protected]>
Date:   Tue Jan 20 07:54:25 2026 +0100

    Update meta_agent.py

commit 1820d89
Author: Jan Kwakkel <[email protected]>
Date:   Tue Jan 20 07:53:24 2026 +0100

    Update meta_agent.py
@quaquel quaquel added example Changes the examples or adds to them. and removed example Changes the examples or adds to them. labels Jan 20, 2026
@github-actions
Copy link
Copy Markdown

Performance benchmarks:

Model Size Init time [95% CI] Run time [95% CI]
BoltzmannWealth small 🟢 -13.9% [-14.4%, -13.4%] 🔵 -1.5% [-1.8%, -1.3%]
BoltzmannWealth large 🔵 +0.6% [-4.6%, +6.9%] 🟢 -7.3% [-9.6%, -4.7%]
Schelling small 🟢 -4.7% [-5.1%, -4.2%] 🔵 -1.2% [-1.4%, -0.9%]
Schelling large 🔵 +3.0% [-0.2%, +6.6%] 🟢 -8.2% [-9.9%, -6.4%]
WolfSheep small 🔵 -1.6% [-3.1%, +0.2%] 🔵 +0.7% [+0.5%, +1.0%]
WolfSheep large 🔴 +27.2% [+14.0%, +41.3%] 🔵 +0.7% [-1.3%, +2.4%]
BoidFlockers small 🔵 -2.6% [-3.1%, -2.1%] 🔵 +0.2% [+0.0%, +0.4%]
BoidFlockers large 🟢 -5.3% [-6.1%, -4.3%] 🔵 -0.1% [-0.3%, +0.1%]

@codebreaker32
Copy link
Copy Markdown
Collaborator

codebreaker32 commented Jan 20, 2026

Hi @quaquel

I was also working on it to fix the example error and modifying register_agent passed all the test

# In model.py

def register_agent(self, agent: Agent) -> None:
    # Check if the agent already has a valid ID
    if agent.unique_id is not None:
        # It's already registered! Don't touch the ID.
        # Just ensure it's in the internal list if needed.
        self._agents[agent] = None 
        return

    # Only generate a NEW ID if it completely lacks one
    agent.unique_id = next(self.agent_id_counter)
    self._agents[agent] = None

The reason I could've think of is(I might be wrong):

# meta_agent.py
def add_constituting_agents(self, new_agents: set[Agent]):
    for agent in new_agents:
        self._constituting_set.add(agent)
        agent.meta_agent = self
        self.model.register_agent(agent)  # <--- Culprit

Suppose Agent A's id 5 is used as a key to store data in various dictionaries. Suddenly, the agent's unique_id attribute is overwritten to 105. The Python dictionaries are not corrupted, but they are now out of sync: the dictionary still holds the data under the old key (5), but the code is now trying to retrieve it using the new key (105). This mismatch leads to a KeyError or logic errors because the system looks for an ID that isn't there

@quaquel
Copy link
Copy Markdown
Member Author

quaquel commented Jan 20, 2026

@codebreaker32, yes, this is indeed the source of the problem. But in my view, agents should never call register_agent twice. In fact, a Mesa user should never have to call it if they use super properly in their custom agents. So, I see this as a bug in create_meta_agents, not in register_agent (See #3183 for my proposed fix.)

quaquel and others added 5 commits January 20, 2026 10:57
commit e5b3a09
Author: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Date:   Tue Jan 20 08:15:05 2026 +0000

    [pre-commit.ci] auto fixes from pre-commit.com hooks

    for more information, see https://pre-commit.ci

commit 228b8b5
Author: Jan Kwakkel <[email protected]>
Date:   Tue Jan 20 09:14:54 2026 +0100

    Update meta_agent.py

commit 4b71cbe
Author: Jan Kwakkel <[email protected]>
Date:   Tue Jan 20 07:57:16 2026 +0100

    Update meta_agent.py

commit 702944a
Author: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Date:   Tue Jan 20 06:56:05 2026 +0000

    [pre-commit.ci] auto fixes from pre-commit.com hooks

    for more information, see https://pre-commit.ci

commit 0f2c81a
Author: Jan Kwakkel <[email protected]>
Date:   Tue Jan 20 07:54:25 2026 +0100

    Update meta_agent.py

commit 1820d89
Author: Jan Kwakkel <[email protected]>
Date:   Tue Jan 20 07:53:24 2026 +0100

    Update meta_agent.py
@quaquel quaquel removed the trigger-benchmarks Special label that triggers the benchmarking CI label Jan 20, 2026
@quaquel quaquel added the trigger-benchmarks Special label that triggers the benchmarking CI label Jan 20, 2026
@github-actions
Copy link
Copy Markdown

Performance benchmarks:

Model Size Init time [95% CI] Run time [95% CI]
BoltzmannWealth small 🔵 +3.0% [+1.9%, +4.1%] 🔵 -1.0% [-1.1%, -0.8%]
BoltzmannWealth large 🔵 +6.1% [+0.3%, +13.3%] 🔴 +25.5% [+20.7%, +30.4%]
Schelling small 🔵 -2.1% [-2.6%, -1.7%] 🔵 +0.9% [+0.6%, +1.2%]
Schelling large 🔵 +3.3% [+0.5%, +6.5%] 🔵 +3.8% [+0.9%, +7.1%]
WolfSheep small 🔵 +2.8% [+1.1%, +4.7%] 🔵 +0.4% [+0.0%, +0.8%]
WolfSheep large 🔴 +32.7% [+18.1%, +48.5%] 🔵 +2.6% [+0.8%, +4.5%]
BoidFlockers small 🟢 -7.6% [-8.0%, -7.1%] 🔵 -1.2% [-1.5%, -0.9%]
BoidFlockers large 🟢 -9.3% [-9.7%, -8.9%] 🔵 -1.2% [-1.4%, -0.9%]

@quaquel
Copy link
Copy Markdown
Member Author

quaquel commented Jan 20, 2026

The benchmarks remain strange. For example, Boltman Wealth has no additional agents during the run, so nothing in this PR would change the runtime.

@quaquel quaquel marked this pull request as ready for review January 20, 2026 10:23
@github-actions
Copy link
Copy Markdown

Performance benchmarks:

Model Size Init time [95% CI] Run time [95% CI]
BoltzmannWealth small 🟢 -3.7% [-4.0%, -3.3%] 🔵 -1.1% [-1.2%, -0.9%]
BoltzmannWealth large 🔵 +4.2% [-1.5%, +10.8%] 🔵 +6.1% [+2.9%, +9.5%]
Schelling small 🟢 -4.1% [-4.4%, -3.8%] 🔵 -1.3% [-1.4%, -1.2%]
Schelling large 🔵 +0.1% [-1.9%, +2.1%] 🔵 -1.6% [-2.7%, -0.4%]
WolfSheep small 🔵 -0.6% [-1.6%, +0.5%] 🔵 +0.9% [+0.6%, +1.2%]
WolfSheep large 🔴 +20.5% [+10.0%, +31.6%] 🔵 +1.1% [+0.2%, +2.0%]
BoidFlockers small 🟢 -5.5% [-5.9%, -5.1%] 🔵 -0.4% [-0.6%, -0.2%]
BoidFlockers large 🟢 -6.8% [-7.2%, -6.4%] 🔵 +0.1% [-0.3%, +0.5%]

Copy link
Copy Markdown
Member

@EwoutH EwoutH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to pre-approve this to unblock you.

Please document the problem, the solution, the rejected alternatives and the reasoning behind all those well. If this ever bites us back we can trace this back.

@quaquel
Copy link
Copy Markdown
Member Author

quaquel commented Jan 20, 2026

Please document the problem, the solution, the rejected alternatives and the reasoning behind all those well. If this ever bites us back we can trace this back.

It was included in the start post. In short, moving the assignment of agent.unique_id into the model ensures it remains unique within a model instance, while being faster than the alternatives I tested.

The main issue encountered next is that ideally model.register_agent and modelderegister_agent should be never called by the user directly but allways via super on the agent. It might be good to document this (and hard refs in general) for Mesa 4.

@quaquel quaquel merged commit b6e96d1 into mesa:main Jan 20, 2026
14 checks passed
@quaquel quaquel deleted the memory_leak branch January 21, 2026 07:04
EwoutH pushed a commit to EwoutH/mesa that referenced this pull request Jan 23, 2026
A memory leak was discovered in Mesa where model instances could never be garbage collected after agents were created. The root cause was the `Agent._ids` class attribute—a `defaultdict` that stored references to model instances to ensure `unique_id` values were unique on a per-model basis. Because `_ids` was a class-level attribute that persisted across the Python process, any model instance used as a key in this dictionary maintained a hard reference indefinitely, preventing the garbage collector from cleaning up the model and all its associated objects (agents, grids, etc.) even after the model went out of scope or was explicitly deleted.

This bug had significant practical consequences for Mesa users, particularly those running multiple simulations or batch experiments. Each time a model was instantiated and run within a function, the model objects would accumulate in RAM rather than being cleaned up when the function exited. This meant that running many model instances—common in parameter sweeps, sensitivity analyses, or optimization workflows—would cause unbounded memory growth, eventually exhausting available RAM. The issue was especially problematic because it was invisible to users: simply letting a model go out of scope or calling `del model` appeared to work but silently retained all the memory, and even explicitly removing agents with `model.remove_all_agents()` only partially addressed the problem depending on the space types used.

The fix moved the `unique_id` assignment logic from the `Agent` class into the `Model.register_agent()` method, eliminating the problematic class-level `_ids` defaultdict entirely. Instead of tracking IDs across all model instances in a shared dictionary, each model now maintains its own `agent_id_counter` instance attribute that starts at 1 and increments with each registered agent. This approach ensures that `unique_id` remains unique within each model instance while allowing the garbage collector to properly clean up model objects when they go out of scope, since there are no longer any persistent class-level references to model instances. The fix also replaced `itertools.count` with simple integer incrementation, which avoids upcoming pickle compatibility issues in Python 3.14.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Release notes label trigger-benchmarks Special label that triggers the benchmarking CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants