Skip to content

Add DataSet.record()#3295

Merged
quaquel merged 14 commits intomesa:mainfrom
quaquel:dataset_record
Feb 14, 2026
Merged

Add DataSet.record()#3295
quaquel merged 14 commits intomesa:mainfrom
quaquel:dataset_record

Conversation

@quaquel
Copy link
Copy Markdown
Member

@quaquel quaquel commented Feb 13, 2026

This PR adds a new record method to DataSet. Building on #3145 and #3156, and the discussion on data collection, this makes for quite an elegant API:

self.recorder = DataRecorder(self)
self.data_registry.track_agents(self.agents, "agent_data", "wealth").record(self.recorder)
self.data_registry.track_model(self, "model_data", "gini").record(self.recorder, 
                                                                  configuration=DatasetConfig(start_time=4, interval=2))

Implementation details
I kept this PR as focused as possible. At its core, it adds a new method DataSet.record(recorder, configuration). Internally, I added a BaseDataRecorder. add_dataset(dataset:DataSet, configuration:DataConfig|None=None). I updated the __init__ of BaseDataRecorder to allow for config=None and I removed the behavior where a recorder automatically records all datasets. This last change is not needed for this PR, but a key design principle is that we want to separate datasets at a given instant from their recording over time. Automatically recording datasets defeats this purpose.

I deliberately left out other ideas from the discussion because there is no consensus on those yet.

@github-actions
Copy link
Copy Markdown

Performance benchmarks:

Model Size Init time [95% CI] Run time [95% CI]
BoltzmannWealth small 🔴 +4.1% [+3.4%, +4.8%] 🔴 +26.6% [+26.1%, +26.9%]
BoltzmannWealth large 🔵 +0.8% [-0.3%, +1.9%] 🔴 +20.8% [+16.7%, +24.7%]
Schelling small 🔵 +0.9% [+0.6%, +1.3%] 🔵 +0.7% [+0.6%, +0.9%]
Schelling large 🔵 +2.4% [+1.9%, +3.0%] 🔴 +11.0% [+9.4%, +12.6%]
WolfSheep small 🔵 +0.2% [-0.1%, +0.5%] 🔵 -0.7% [-0.8%, -0.6%]
WolfSheep large 🔵 +1.5% [+0.7%, +2.3%] 🔴 +5.4% [+4.4%, +6.3%]
BoidFlockers small 🟢 -4.9% [-5.2%, -4.6%] 🔵 -0.6% [-0.7%, -0.4%]
BoidFlockers large 🔵 -3.4% [-3.9%, -3.0%] 🔵 -0.8% [-0.9%, -0.6%]

@EwoutH EwoutH mentioned this pull request Feb 14, 2026
42 tasks
@quaquel quaquel marked this pull request as ready for review February 14, 2026 17:53
@github-actions
Copy link
Copy Markdown

Performance benchmarks:

Model Size Init time [95% CI] Run time [95% CI]
BoltzmannWealth small 🔵 +1.2% [+0.7%, +1.7%] 🔴 +26.5% [+26.4%, +26.7%]
BoltzmannWealth large 🔵 +2.3% [+1.4%, +3.0%] 🔴 +20.5% [+16.7%, +23.6%]
Schelling small 🔵 +1.1% [+1.0%, +1.3%] 🔵 +0.5% [+0.3%, +0.6%]
Schelling large 🔵 +0.4% [-0.1%, +1.0%] 🔵 -3.4% [-6.3%, -0.7%]
WolfSheep small 🔵 +0.4% [+0.3%, +0.6%] 🔵 +0.2% [+0.0%, +0.4%]
WolfSheep large 🔵 +0.6% [-0.7%, +1.7%] 🔵 +1.0% [-1.7%, +3.4%]
BoidFlockers small 🔵 -1.4% [-1.8%, -1.1%] 🔵 -1.5% [-1.6%, -1.4%]
BoidFlockers large 🔵 -1.6% [-2.0%, -1.1%] 🔵 -1.0% [-1.2%, -0.8%]

@quaquel quaquel added the enhancement Release notes label label Feb 14, 2026
@EwoutH
Copy link
Copy Markdown
Member

EwoutH commented Feb 14, 2026

I'm concerned we're accumulating some API bloat.

This PR adds .record() to DataSet and add_dataset() to BaseDataRecorder, resulting in this API:

self.recorder = DataRecorder(self)
self.data_registry.track_agents(self.agents, "agent_data", "wealth").record(self.recorder)
self.data_registry.track_model(self, "model_data", "gini").record(
    self.recorder, 
    configuration=DatasetConfig(start_time=4, interval=2)
)

In our recent discussion I thought we were moving towards an API like this:

# No explicit recorder construction - handled internally by model.data
self.data.track_agents(Wolf, "wolf_energy", "energy").record()
self.data.track_model(self, "gini", "gini").record(Schedule(interval=5, start=100))

With as main difference that .record() doesn't take a recorder argument and is internally managed).

With this PR, users now need to understand:

  1. DataRecorder - construct it explicitly
  2. DataSet.record(recorder, configuration) - new method that takes the recorder
  3. BaseDataRecorder.add_dataset(dataset, configuration) - new public method
  4. Configuration can be passed to both DataRecorder.__init__() and .record()

Do you see options to decomplicate both the mental model for users and the API? I think sensible defaults also would help to go a long way.

@EwoutH EwoutH added the experimental Release notes label label Feb 14, 2026
@quaquel
Copy link
Copy Markdown
Member Author

quaquel commented Feb 14, 2026

With as main difference that .record() doesn't take a recorder argument and is internally managed.

Users might use different recorders depending on the backend. So, the dataset needs to know the recorder that is to be used.

Also, at the moment, we don't have a default recorder field on the model, nor a configuration via __init__ to set it up. So, datasets cannot rely on any field on the model, nor do they, in general, have a reference to the model, even if a default recorder field did exist.

I agree about the end goal of having a clean API. This PR is just a small step to getting there. And I would argue it already simplifies it quite a bit because there is no longer the need to pass a complex configuration dict to the recorder. Instead, we tie the configuration directly to the dataset via the new record() method.

A next step, in my view, is to see if we can simplify DatasetConfig, potentially via Schedule, or at least by using the same keywords where appropriate.

@EwoutH
Copy link
Copy Markdown
Member

EwoutH commented Feb 14, 2026

Thanks for the context, sounds good.

A next step, in my view, is to see if we can simplify DatasetConfig, potentially via Schedule, or at least by using the same keywords where appropriate.

We might create a DataSchedule subclass. User can than:

  • Pass nothing for the default (collecting ever 1 time)
  • Pass a Schedule for simple functionality
  • Pass a DataSchedule for more advanced functionality

@quaquel
Copy link
Copy Markdown
Member Author

quaquel commented Feb 14, 2026

It's a bit different. The two share start, end, and interval. But for data recording, interval only takes a number, but not a callable. In addition to these, DataSetConfig takes window_size, while Schedule takes count. So inheritance does not seem the best solution here.

@EwoutH
Copy link
Copy Markdown
Member

EwoutH commented Feb 14, 2026

Basically Schedule is a data storage object. It can have some fields that are useful/valid for recurring event scheduling, some that are useful for data recording, and some that are useful for both.

@quaquel quaquel merged commit 42245cd into mesa:main Feb 14, 2026
13 of 14 checks passed
@quaquel quaquel deleted the dataset_record branch February 14, 2026 21:16
@quaquel
Copy link
Copy Markdown
Member Author

quaquel commented Feb 14, 2026

Basically Schedule is a data storage object. It can have some fields that are useful/valid for recurring event scheduling, some that are useful for data recording, and some that are useful for both.

Which is why a basic protocol with start, end, and interval might make sense, but a full hierarchy of classes is probably overkill.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Release notes label experimental Release notes label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants