Skip to content

Fix: Add deepcopy to agent reporters to prevent mutable reference lea…#3038

Merged
quaquel merged 5 commits intomesa:mainfrom
Nithin9585:fix/datacollector-mutable-reference-3035
Dec 30, 2025
Merged

Fix: Add deepcopy to agent reporters to prevent mutable reference lea…#3038
quaquel merged 5 commits intomesa:mainfrom
Nithin9585:fix/datacollector-mutable-reference-3035

Conversation

@Nithin9585
Copy link
Copy Markdown
Contributor

Fixes #3035

The DataCollector was storing references to mutable objects (lists, dicts, numpy arrays, etc.) returned by agent reporters instead of creating deep copies. This caused all historical records to retroactively update when agents modified these objects in subsequent steps, completely invalidating longitudinal data analysis.

The Problem

When an agent reporter returns a mutable object, the DataCollector stores a reference to that object. As the simulation progresses and the agent modifies the object, the historical data changes to reflect the current state rather than the state at the time of collection.

Example:

# Step 1: Agent has list = []
datacollector.collect()  # Should record []

# Step 2: Agent has list = [1]  
datacollector.collect()  # Should record [1]

# Bug: When checking Step 1's data later, it shows [1] instead of []

The Solution

Apply deepcopy() to agent reporter results in both _record_agents() and _record_agenttype() methods, consistent with the existing behavior for model-level reporters.

Changes Made

  • mesa/datacollection.py: Add deepcopy() to agent reporter results (lines 284, 296)
  • tests/test_datacollector_mutable_fix.py: Add test to verify mutable data independence

Testing

  • New test test_mutable_data_independence passes
  • All existing DataCollector tests pass (15/15)
  • No regressions introduced

Impact

  • Data Integrity: Historical agent data now correctly preserves state at each collection time
  • Compatibility: Works with all mutable types (lists, dicts, sets, numpy arrays, custom objects)

@Nithin9585 Nithin9585 force-pushed the fix/datacollector-mutable-reference-3035 branch from 13b3932 to 86ba0da Compare December 29, 2025 16:06
@github-actions
Copy link
Copy Markdown

Performance benchmarks:

Model Size Init time [95% CI] Run time [95% CI]
BoltzmannWealth small 🔵 +0.7% [+0.4%, +1.0%] 🔵 +2.6% [+2.5%, +2.7%]
BoltzmannWealth large 🔵 +1.6% [+1.2%, +2.0%] 🔵 +5.5% [+3.0%, +7.6%]
Schelling small 🔵 +0.6% [-1.0%, +2.4%] 🔴 +5.5% [+4.1%, +7.1%]
Schelling large 🔵 +0.8% [+0.7%, +1.0%] 🔴 +19.3% [+14.4%, +24.6%]
WolfSheep small 🔵 -0.9% [-1.9%, -0.2%] 🔵 +0.8% [+0.5%, +1.0%]
WolfSheep large 🔵 -2.9% [-7.9%, +0.2%] 🔵 -0.1% [-4.0%, +3.4%]
BoidFlockers small 🔵 +1.1% [+0.7%, +1.5%] 🔵 +1.4% [+1.3%, +1.5%]
BoidFlockers large 🔵 +0.9% [+0.2%, +1.6%] 🔵 +1.5% [+1.2%, +1.8%]

mesa#3035)

The DataCollector was storing references to mutable objects returned by agent reporters instead of creating deep copies. This caused all historical records to show the current state rather than the state at each collection time.

Changes:

- Add deepcopy() to _record_agents() method

- Add deepcopy() to _record_agenttype() method

- Add test to verify mutable data independence

Fixes mesa#3035
@Nithin9585 Nithin9585 force-pushed the fix/datacollector-mutable-reference-3035 branch from 86ba0da to c35145e Compare December 29, 2025 16:12
@Nithin9585 Nithin9585 force-pushed the fix/datacollector-mutable-reference-3035 branch from 800ff21 to ff2ff4d Compare December 30, 2025 05:59
@quaquel
Copy link
Copy Markdown
Member

quaquel commented Dec 30, 2025

Thanks for this PR and for addressing my initial comments.

As you can see in the benchmark, deepcopy adds substantial overhead. So, I am not sure whether this is the ideal solution for this bug. It is a wasteful operation in the case of immutables, and expensive in the case of complicated, nested data structures. Moreover, if I am not mistaken, returning collections will break some of the downstream behavior of the data collector, such as returning dataframes. So I am curious if there are alternative solutions that don't have this overhead.

One option is to simply not allow mutables as a return to datacollector. Which, in my experience, should not be a significant problem since people typically collect simple data (i.e., numbers instead of mutable collections).

@codebyNJ
Copy link
Copy Markdown
Contributor

@Nithin9585 @quaquel

I guess we can use type-aware approach.

def get_reports(agent):
        _prefix = (agent.model.steps, agent.unique_id)
        reports = []
        for rep in rep_funcs:
            value = rep(agent)
            # only copy known mutable types
            if isinstance(value, (list, dict, set, bytearray)):
                reports.append(deepcopy(value))
            elif hasattr(value, '__array__'):  # numpy arrays
                reports.append(value.copy())
            else:
                reports.append(value)  # immutable types: int, float, str, tuple
        return _prefix + tuple(reports)

@quaquel quaquel added the bug Release notes label label Dec 30, 2025
@quaquel
Copy link
Copy Markdown
Member

quaquel commented Dec 30, 2025

I have to think about this a bit more. Your code, as given, will solve the problem only for known mutable types, but will still silently work fine for user-defined classes. I am inclined instead to limit data that can be collected to only known built-in immutable types. Which is a much tighter constraint.

Alternatively, you could do something like

python_immutable_types = [str, int, long, bool, float, tuple,]

if insintance(value, *python_immutable_types)
   reports.append(value)
else:
   reports.append(copy.deepcopy(value))

At least, this avoids the expensive deepcopy in 80% of all cases, if not more.

@Nithin9585
Copy link
Copy Markdown
Contributor Author

@quaquel Implemented your suggested approach:

python
python_immutable_types = (str, int, bool, float, tuple)
if isinstance(value, python_immutable_types):
    reports.append(value)
else:
    reports.append(deepcopy(value))

This avoids deepcopy overhead for primitives (80%+ of cases) while still fixing the bug. All 16 tests pass.

@Nithin9585 Nithin9585 force-pushed the fix/datacollector-mutable-reference-3035 branch from 20eee9d to 7fe252c Compare December 30, 2025 19:14
@quaquel quaquel added the trigger-benchmarks Special label that triggers the benchmarking CI label Dec 30, 2025
@github-actions
Copy link
Copy Markdown

Performance benchmarks:

Model Size Init time [95% CI] Run time [95% CI]
BoltzmannWealth small 🔵 +0.7% [-0.2%, +1.6%] 🟢 -4.9% [-5.1%, -4.8%]
BoltzmannWealth large 🔵 +3.1% [+2.6%, +3.5%] 🟢 -13.8% [-15.6%, -12.0%]
Schelling small 🔵 +4.4% [+2.8%, +6.1%] 🟢 -6.8% [-7.8%, -5.8%]
Schelling large 🔵 +3.0% [+2.8%, +3.2%] 🔵 -2.8% [-4.0%, -1.5%]
WolfSheep small 🔵 +2.9% [+1.9%, +3.4%] 🔵 +0.1% [-0.1%, +0.2%]
WolfSheep large 🔵 +4.5% [-3.7%, +13.4%] 🔵 -0.0% [-1.1%, +1.2%]
BoidFlockers small 🔵 +0.4% [+0.1%, +0.7%] 🔵 +0.1% [+0.0%, +0.3%]
BoidFlockers large 🔵 -0.2% [-0.8%, +0.3%] 🔵 +0.1% [+0.0%, +0.2%]

@quaquel quaquel removed the trigger-benchmarks Special label that triggers the benchmarking CI label Dec 30, 2025
@quaquel quaquel merged commit 489545d into mesa:main Dec 30, 2025
15 checks passed
@Nithin9585 Nithin9585 deleted the fix/datacollector-mutable-reference-3035 branch December 30, 2025 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Release notes label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DataCollector Mutable Reference Leak

3 participants