Make benchmarking more robust by codebreaker32 · Pull Request #3203 · mesa/mesa

codebreaker32 · 2026-01-24T16:57:00Z

Summary

This PR refactors benchmarks/global_benchmark.py to address high variance and inconsistency in our performance benchmarks (observed up to ±30% locally and on CI). The changes align our methodology with industry standards (e.g., NetLogo, ASV, CPython) by introducing warm-up periods, controlling garbage collection, improving timer precision.

Motive

The current benchmarking setup suffers from several critical flaws causing unstable results:

We currently measure from the very first run. This captures the cold start penalty. Standard practice(On netlogo and ASV) is to run the model 3+ times to "warm up" the system before starting the timer.
gc.enable() allows the Garbage Collector to fire unpredictably during timed runs, causing random spikes in execution time.
timeit.default_timer() lacks the strict monotonicity required for precise micro-benchmarking compared to time.perf_counter().
As noted in issue Memory leak in models #3179, there is still some memory issue. It would be good to call model.remove_agents() at the end of run_model.

Implementation

Added a loop to execute the model 3 times before starting the actual timer. These runs are discarded to ensure the system (caches, JIT) is primed.
Controlled GC:

gc.disable() is called before the timing loop to prevent random pauses.
gc.collect() is called explicitly to ensure a consistent memory state for each run.
gc.enable() to re-enable GC

Replaced timeit.default_timer() with time.perf_counter() to guarantee high-resolution, monotonic timing.
Added a call to model.remove_all_agents() at the end of run_model to mitigate memory accumulation (addressing @quaquel's feedback).

codebreaker32 · 2026-01-24T17:02:25Z

I ran this locally for 5 consecutive times

Timings Comparison Using median

Benchmark	Run 1 (Ref)		Run 2		Run 3		Run 4		Run 5
Benchmark	Init	Run	Init	Run	Init	Run	Init	Run	Init	Run
BoltzmannWealth (small)	0.00113 s	0.0271 s	-2.7%	+3.0%	+9.7%	+7.0%	+9.7%	+11.1%	-2.7%	+0.7%
BoltzmannWealth (large)	0.07195 s	0.2977 s	-8.1%	+0.5%	+2.2%	+18.7%	+28.1%	+38.8%	-6.3%	+1.0%
Schelling (small)	0.01325 s	0.0411 s	-1.2%	-0.5%	+6.4%	+4.9%	+20.4%	+23.6%	+2.8%	+2.9%
Schelling (large)	0.14845 s	0.5902 s	-4.3%	+5.3%	+12.1%	+33.6%	+4.5%	+25.2%	-4.7%	-0.7%
WolfSheep (small)	0.00490 s	0.0738 s	+9.8%	+6.4%	+18.8%	+16.9%	+1.6%	+8.5%	+3.1%	+5.1%
WolfSheep (large)	0.06919 s	0.8861 s	+7.1%	+22.6%	+2.1%	+22.5%	-4.5%	+7.0%	+7.9%	+3.8%
BoidFlockers (small)	0.00226 s	0.1561 s	+18.1%	+13.8%	+22.1%	+6.6%	+14.6%	+1.9%	+5.8%	+1.9%
BoidFlockers (large)	0.00464 s	0.2244 s	-3.0%	+2.6%	+1.3%	+6.4%	-10.3%	+0.8%	-16.4%	+2.5%

Here is the script to test locally and results
benchmark_output.txt
run_benchmark.sh

codebreaker32 · 2026-01-24T17:03:09Z

I can still see considerable regression locally

github-actions · 2026-01-24T17:07:18Z

Performance benchmarks:

Model	Size	Init time [95% CI]	Run time [95% CI]
BoltzmannWealth	small	🔴 +6.2% [+4.9%, +7.3%]	🔵 +0.6% [+0.2%, +0.9%]
BoltzmannWealth	large	🔵 -3.1% [-8.0%, +0.8%]	🔵 -4.5% [-8.6%, +0.1%]
Schelling	small	🔵 +3.0% [+2.5%, +3.6%]	🔴 +4.0% [+3.9%, +4.1%]
Schelling	large	🔵 -0.8% [-3.6%, +1.8%]	🟢 -4.5% [-6.1%, -3.1%]
WolfSheep	small	🔵 -3.6% [-5.2%, -2.3%]	🔵 +1.2% [+0.8%, +1.4%]
WolfSheep	large	🟢 -20.7% [-26.7%, -14.4%]	🔵 -0.8% [-2.2%, +0.8%]
BoidFlockers	small	🔵 +3.1% [+2.7%, +3.6%]	🔵 -1.0% [-1.2%, -0.8%]
BoidFlockers	large	🔵 +1.8% [+1.2%, +2.4%]	🔵 -0.8% [-1.0%, -0.6%]

codebreaker32 · 2026-01-24T17:41:08Z

Using min

Benchmark	Run 1 (Ref)		Run 2		Run 3		Run 4		Run 5
Benchmark	Init	Run	Init	Run	Init	Run	Init	Run	Init	Run
BoltzmannWealth (small)	0.00073 s	0.0221 s	+2.7%	+1.4%	+5.5%	+1.4%	+2.7%	+0.5%	+1.4%	0.0%
BoltzmannWealth (large)	0.05719 s	0.2591 s	-0.7%	-2.0%	+1.1%	-3.2%	-0.5%	-4.9%	-0.7%	+0.8%
Schelling (small)	0.01087 s	0.0346 s	+3.3%	+4.9%	-2.9%	-3.5%	-3.4%	-4.0%	-4.2%	-4.3%
Schelling (large)	0.10909 s	0.4504 s	+6.0%	+11.4%	+0.1%	+1.0%	+3.0%	+5.7%	+1.0%	+4.8%
WolfSheep (small)	0.00358 s	0.0613 s	+2.5%	+0.7%	0.0%	-0.8%	+0.8%	+0.2%	+3.4%	+3.8%
WolfSheep (large)	0.05467 s	0.7558 s	+10.3%	+12.4%	+1.9%	+4.4%	+1.0%	+2.9%	+3.7%	+8.0%
BoidFlockers (small)	0.00140 s	0.1189 s	+1.4%	+7.8%	+1.4%	+1.0%	-1.4%	+1.2%	0.0%	+4.0%
BoidFlockers (large)	0.00264 s	0.1693 s	+3.8%	+6.7%	-1.5%	-0.4%	-3.0%	+0.4%	+3.4%	+11.0%

Results:
benchmark_output.txt

codebreaker32 · 2026-01-24T17:47:37Z

I have reverted the PR to use original fastest_init(minimum time) approach

EwoutH · 2026-01-24T18:51:56Z

Thanks for the effort.

Can you give me some insights/statistics how much the warmup helps?

benchmarks/global_benchmark.py

codebreaker32 · 2026-01-24T19:53:07Z

Can you give me some insights/statistics how much the warmup helps?

Model	Avg Warmup	Stable Run	Improvement
BoltzmannWealth (small)	0.0342s	0.0292s	+14.7%
BoltzmannWealth (large)	0.4770s	0.4341s	+9.0%
Schelling (small)	0.0648s	0.0567s	+12.6%
Schelling (large)	0.8695s	0.8160s	+6.2%
WolfSheep (small)	0.0905s	0.0806s	+10.9%
WolfSheep (large)	1.1643s	1.0926s	+6.2%
BoidFlockers (small)	0.1691s	0.1550s	+8.3%
BoidFlockers (large)	0.2376s	0.2215s	+6.8%

improvement = (warm_up - stable_run)/warm_up

EwoutH · 2026-01-24T21:01:26Z

Sorry I meant reliability statistics, not speed. So standard deviation, quartile intervals, maybe a histogram or bloxplot.

codebreaker32 · 2026-01-25T06:08:14Z

Sorry I meant reliability statistics, not speed. So standard deviation, quartile intervals, maybe a histogram or bloxplot.

Without warm-up

With warm-up( 3 runs )

quaquel · 2026-01-25T07:18:56Z

I think that is quite conclusive. Warm-up is variance-reducing.

quaquel

Thanks a lot!

codebreaker32 and others added 2 commits January 24, 2026 16:43

Make benchmarking more robust

02b8ff2

Merge branch 'main' into benchmarks

47c7b00

codebreaker32 mentioned this pull request Jan 24, 2026

Introducing _StrongAgentSet to support strong references in agents.py #3160

Closed

codebreaker32 and others added 2 commits January 24, 2026 17:45

Replaced median approach with minimum

f5cbb7e

Merge branch 'main' into benchmarks

1b30739

quaquel reviewed Jan 24, 2026

View reviewed changes

benchmarks/global_benchmark.py Outdated Show resolved Hide resolved

Add missing gc.collect()

8e7e24d

codebreaker32 and others added 2 commits January 24, 2026 20:22

Update global_benchmark.py

b8c6b8b

Revert comment

006998c

Merge branch 'main' into benchmarks

f432b89

quaquel approved these changes Jan 25, 2026

View reviewed changes

quaquel added ignore-for-release PRs that aren't included in the release notes ci Release notes label labels Jan 25, 2026

quaquel merged commit aae552a into mesa:main Jan 25, 2026
16 checks passed

codebreaker32 deleted the benchmarks branch February 17, 2026 16:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make benchmarking more robust#3203

Make benchmarking more robust#3203
quaquel merged 8 commits intomesa:mainfrom
codebreaker32:benchmarks

codebreaker32 commented Jan 24, 2026 •

edited

Loading

Uh oh!

codebreaker32 commented Jan 24, 2026 •

edited

Loading

Uh oh!

codebreaker32 commented Jan 24, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 24, 2026

Uh oh!

codebreaker32 commented Jan 24, 2026

Uh oh!

codebreaker32 commented Jan 24, 2026

Uh oh!

EwoutH commented Jan 24, 2026

Uh oh!

Uh oh!

codebreaker32 commented Jan 24, 2026

Uh oh!

EwoutH commented Jan 24, 2026

Uh oh!

codebreaker32 commented Jan 25, 2026 •

edited

Loading

Uh oh!

quaquel commented Jan 25, 2026

Uh oh!

quaquel left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

codebreaker32 commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motive

Implementation

Uh oh!

codebreaker32 commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Timings Comparison Using median

Uh oh!

codebreaker32 commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 24, 2026

Uh oh!

codebreaker32 commented Jan 24, 2026

Uh oh!

codebreaker32 commented Jan 24, 2026

Uh oh!

EwoutH commented Jan 24, 2026

Uh oh!

Uh oh!

codebreaker32 commented Jan 24, 2026

Uh oh!

EwoutH commented Jan 24, 2026

Uh oh!

codebreaker32 commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quaquel commented Jan 25, 2026

Uh oh!

quaquel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codebreaker32 commented Jan 24, 2026 •

edited

Loading

codebreaker32 commented Jan 24, 2026 •

edited

Loading

codebreaker32 commented Jan 24, 2026 •

edited

Loading

codebreaker32 commented Jan 25, 2026 •

edited

Loading