Skip to content

Make benchmarking more robust#3203

Merged
quaquel merged 8 commits intomesa:mainfrom
codebreaker32:benchmarks
Jan 25, 2026
Merged

Make benchmarking more robust#3203
quaquel merged 8 commits intomesa:mainfrom
codebreaker32:benchmarks

Conversation

@codebreaker32
Copy link
Copy Markdown
Collaborator

@codebreaker32 codebreaker32 commented Jan 24, 2026

Summary

This PR refactors benchmarks/global_benchmark.py to address high variance and inconsistency in our performance benchmarks (observed up to ±30% locally and on CI). The changes align our methodology with industry standards (e.g., NetLogo, ASV, CPython) by introducing warm-up periods, controlling garbage collection, improving timer precision.

Motive

The current benchmarking setup suffers from several critical flaws causing unstable results:

  1. We currently measure from the very first run. This captures the cold start penalty. Standard practice(On netlogo and ASV) is to run the model 3+ times to "warm up" the system before starting the timer.
  2. gc.enable() allows the Garbage Collector to fire unpredictably during timed runs, causing random spikes in execution time.
  3. timeit.default_timer() lacks the strict monotonicity required for precise micro-benchmarking compared to time.perf_counter().
  4. As noted in issue Memory leak in models #3179, there is still some memory issue. It would be good to call model.remove_agents() at the end of run_model.

Implementation

  1. Added a loop to execute the model 3 times before starting the actual timer. These runs are discarded to ensure the system (caches, JIT) is primed.
  2. Controlled GC:
  • gc.disable() is called before the timing loop to prevent random pauses.
  • gc.collect() is called explicitly to ensure a consistent memory state for each run.
  • gc.enable() to re-enable GC
  1. Replaced timeit.default_timer() with time.perf_counter() to guarantee high-resolution, monotonic timing.
  2. Added a call to model.remove_all_agents() at the end of run_model to mitigate memory accumulation (addressing @quaquel's feedback).

@codebreaker32
Copy link
Copy Markdown
Collaborator Author

codebreaker32 commented Jan 24, 2026

I ran this locally for 5 consecutive times

Timings Comparison Using median

Benchmark Run 1 (Ref) Run 2 Run 3 Run 4 Run 5
Init Run Init Run Init Run Init Run Init Run
BoltzmannWealth (small) 0.00113 s 0.0271 s -2.7% +3.0% +9.7% +7.0% +9.7% +11.1% -2.7% +0.7%
BoltzmannWealth (large) 0.07195 s 0.2977 s -8.1% +0.5% +2.2% +18.7% +28.1% +38.8% -6.3% +1.0%
Schelling (small) 0.01325 s 0.0411 s -1.2% -0.5% +6.4% +4.9% +20.4% +23.6% +2.8% +2.9%
Schelling (large) 0.14845 s 0.5902 s -4.3% +5.3% +12.1% +33.6% +4.5% +25.2% -4.7% -0.7%
WolfSheep (small) 0.00490 s 0.0738 s +9.8% +6.4% +18.8% +16.9% +1.6% +8.5% +3.1% +5.1%
WolfSheep (large) 0.06919 s 0.8861 s +7.1% +22.6% +2.1% +22.5% -4.5% +7.0% +7.9% +3.8%
BoidFlockers (small) 0.00226 s 0.1561 s +18.1% +13.8% +22.1% +6.6% +14.6% +1.9% +5.8% +1.9%
BoidFlockers (large) 0.00464 s 0.2244 s -3.0% +2.6% +1.3% +6.4% -10.3% +0.8% -16.4% +2.5%

Here is the script to test locally and results
benchmark_output.txt
run_benchmark.sh

@codebreaker32
Copy link
Copy Markdown
Collaborator Author

codebreaker32 commented Jan 24, 2026

I can still see considerable regression locally

@github-actions
Copy link
Copy Markdown

Performance benchmarks:

Model Size Init time [95% CI] Run time [95% CI]
BoltzmannWealth small 🔴 +6.2% [+4.9%, +7.3%] 🔵 +0.6% [+0.2%, +0.9%]
BoltzmannWealth large 🔵 -3.1% [-8.0%, +0.8%] 🔵 -4.5% [-8.6%, +0.1%]
Schelling small 🔵 +3.0% [+2.5%, +3.6%] 🔴 +4.0% [+3.9%, +4.1%]
Schelling large 🔵 -0.8% [-3.6%, +1.8%] 🟢 -4.5% [-6.1%, -3.1%]
WolfSheep small 🔵 -3.6% [-5.2%, -2.3%] 🔵 +1.2% [+0.8%, +1.4%]
WolfSheep large 🟢 -20.7% [-26.7%, -14.4%] 🔵 -0.8% [-2.2%, +0.8%]
BoidFlockers small 🔵 +3.1% [+2.7%, +3.6%] 🔵 -1.0% [-1.2%, -0.8%]
BoidFlockers large 🔵 +1.8% [+1.2%, +2.4%] 🔵 -0.8% [-1.0%, -0.6%]

@codebreaker32
Copy link
Copy Markdown
Collaborator Author

Using min

Benchmark Run 1 (Ref) Run 2 Run 3 Run 4 Run 5
Init Run Init Run Init Run Init Run Init Run
BoltzmannWealth (small) 0.00073 s 0.0221 s +2.7% +1.4% +5.5% +1.4% +2.7% +0.5% +1.4% 0.0%
BoltzmannWealth (large) 0.05719 s 0.2591 s -0.7% -2.0% +1.1% -3.2% -0.5% -4.9% -0.7% +0.8%
Schelling (small) 0.01087 s 0.0346 s +3.3% +4.9% -2.9% -3.5% -3.4% -4.0% -4.2% -4.3%
Schelling (large) 0.10909 s 0.4504 s +6.0% +11.4% +0.1% +1.0% +3.0% +5.7% +1.0% +4.8%
WolfSheep (small) 0.00358 s 0.0613 s +2.5% +0.7% 0.0% -0.8% +0.8% +0.2% +3.4% +3.8%
WolfSheep (large) 0.05467 s 0.7558 s +10.3% +12.4% +1.9% +4.4% +1.0% +2.9% +3.7% +8.0%
BoidFlockers (small) 0.00140 s 0.1189 s +1.4% +7.8% +1.4% +1.0% -1.4% +1.2% 0.0% +4.0%
BoidFlockers (large) 0.00264 s 0.1693 s +3.8% +6.7% -1.5% -0.4% -3.0% +0.4% +3.4% +11.0%

Results:
benchmark_output.txt

@codebreaker32
Copy link
Copy Markdown
Collaborator Author

I have reverted the PR to use original fastest_init(minimum time) approach

@EwoutH
Copy link
Copy Markdown
Member

EwoutH commented Jan 24, 2026

Thanks for the effort.

Can you give me some insights/statistics how much the warmup helps?

@codebreaker32
Copy link
Copy Markdown
Collaborator Author

Can you give me some insights/statistics how much the warmup helps?

Model Avg Warmup Stable Run Improvement
BoltzmannWealth (small) 0.0342s 0.0292s +14.7%
BoltzmannWealth (large) 0.4770s 0.4341s +9.0%
Schelling (small) 0.0648s 0.0567s +12.6%
Schelling (large) 0.8695s 0.8160s +6.2%
WolfSheep (small) 0.0905s 0.0806s +10.9%
WolfSheep (large) 1.1643s 1.0926s +6.2%
BoidFlockers (small) 0.1691s 0.1550s +8.3%
BoidFlockers (large) 0.2376s 0.2215s +6.8%

improvement = (warm_up - stable_run)/warm_up

@EwoutH
Copy link
Copy Markdown
Member

EwoutH commented Jan 24, 2026

Sorry I meant reliability statistics, not speed. So standard deviation, quartile intervals, maybe a histogram or bloxplot.

@codebreaker32
Copy link
Copy Markdown
Collaborator Author

codebreaker32 commented Jan 25, 2026

Sorry I meant reliability statistics, not speed. So standard deviation, quartile intervals, maybe a histogram or bloxplot.

Without warm-up

timings_1_boxplot


With warm-up( 3 runs )

timings_2_boxplot

@quaquel
Copy link
Copy Markdown
Member

quaquel commented Jan 25, 2026

I think that is quite conclusive. Warm-up is variance-reducing.

Copy link
Copy Markdown
Member

@quaquel quaquel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

@quaquel quaquel added ignore-for-release PRs that aren't included in the release notes ci Release notes label labels Jan 25, 2026
@quaquel quaquel merged commit aae552a into mesa:main Jan 25, 2026
16 checks passed
@codebreaker32 codebreaker32 deleted the benchmarks branch February 17, 2026 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Release notes label ignore-for-release PRs that aren't included in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants