Add data registry by quaquel · Pull Request #3156 · mesa/mesa

quaquel · 2026-01-17T08:13:53Z

Summary

Expanding on an idea from #3145 as well as past discussion on data collection, this PR adds a novel data-registry approach to Mesa. This new approach rests on the idea of a DataSet. A DataSet contains part of the state of a model at a given instant.

The key point is that current data collection does too much. With this PR, we separate the getting of the state of part of the model at a given instant from the storage of these states over time. With explicit DataSet classes, it's now trivial to extend this if you need your own custom data collection. Another benefit of DataSet classes is that we get rid of the complex dict-style configuration of what to collect. Everything can be handled through args for attributes and kwargs for callables.

This PR adds

DataRegistry: a dict-like collection of datasets. It is always available via model.data_registry.
ModelDataSet: a dataset for gathering model-level data.
AgentDataSet: a dataset for gathering agent data from an AbstractAgentSet.
TableDataSet: a dataset for gathering miscellaneous data, works by adding rows to it.
NumpyAgentDataSet: a Numpy array-based dataset containing agent data for a specified Agent class.
A DataSet protocol

Datasets gather data from fields. Fields are always strings and assumed to be accessible via attribute access. DataSet does not support lambda functions. If you want to do something like that, use properties or descriptors instead. Data is accessed via DataSet.data. ModelDataSet and AgentDataSet will at that moment gather the data and return it. NumpyAgentDataSet will return a view on the numpy array containing the data. This view is always in sync with the attribute values, so this data is not separately gathered on request. TableDataSet will return the current list of rows.

This PR is a first draft, exploring the idea. Feedback is very much welcome. The focus is more on fleshing out the API than on optimizing the code itself.

API

class MyModel(Model):

	def __init__(self, rng=None):
		super.__init(self, rng=rng)

		self.model_output = DataRegistry()
		self.model_output.track_agents("agent_data", self.model.agents, fields=["wealth", "health", "age"])
		self.model_output.track_model("model_data", self, fields="average_age")

		self.data_collector = CollectorListener()

for more information, see https://pre-commit.ci

github-actions · 2026-01-17T08:20:39Z

Performance benchmarks:

Model	Size	Init time [95% CI]	Run time [95% CI]
BoltzmannWealth	small	🔵 -0.7% [-1.3%, -0.2%]	🔵 -1.1% [-1.3%, -0.8%]
BoltzmannWealth	large	🔵 -0.2% [-0.6%, +0.3%]	🔵 -1.2% [-2.7%, +0.3%]
Schelling	small	🔵 +0.1% [-0.1%, +0.3%]	🔵 -1.1% [-1.2%, -1.0%]
Schelling	large	🔵 -0.1% [-0.5%, +0.3%]	🔵 -0.4% [-0.9%, +0.3%]
WolfSheep	small	🔵 -0.3% [-0.6%, +0.1%]	🔵 -1.2% [-1.3%, -1.0%]
WolfSheep	large	🔵 +0.9% [+0.3%, +1.6%]	🔵 +1.3% [+0.6%, +1.9%]
BoidFlockers	small	🔵 -0.6% [-1.1%, -0.1%]	🔵 +1.0% [+0.7%, +1.2%]
BoidFlockers	large	🔵 +0.2% [-0.3%, +0.7%]	🔵 +0.5% [+0.2%, +0.8%]

codebreaker32 · 2026-01-17T14:00:38Z

I did some research for (3) and I found that we technically can, but we probably shouldn't:-

operator.attrgetter does not support default values. Even If one agent is missing the attribute, entire simulation would crash with an AttributeError. ( if we want to wrap it in a try...except block, we will lose the performance gain we were trying to achieve)
attrgetter has a noticeable setup cost. DataSet pays this cost only once in __init__ but AgentSet will have to pay it everytime we call get. (For agents < 200, it will be slower than a normal "for" loop)
The biggest overhead that will stop AgentSet to be vector-fast is because it stores "python objects" which are scattered in heap.

However it can be python-fast(For eg. Supporting both AgentSet and WeakAgentSet discussed in #3128 )

for more information, see https://pre-commit.ci

codebreaker32 · 2026-01-17T14:14:52Z

One more minor thing that I'd like to address is that operation.attrgetter would return scalar when len(self._args) = 1 which would result in TypeError if we tried to zip it. So for this special case, we'll have to force the result into tuple

for more information, see https://pre-commit.ci

EwoutH · 2026-01-17T16:19:06Z

Thanks a lot for this. Very useful pathfinding.

The separation of extraction (DataSet) from storage and timing/triggering (future work) is helpful. Always chop problems into smaller problems if you can.

From looking back at our original discussions, few questions on how flexible the extraction layer can be:

Dynamic agent selection: Right now AgentDataSet takes a fixed AgentSet at initialization. Could it accept a callable instead, so membership is re-evaluated each time .data is accessed? This would enable tracking things like "all starving agents" where the set changes each step:

# Instead of evaluating once at init
registry.track_agents("starving", model.agents.select(lambda a: a.energy < 10))

# Evaluate fresh each time
registry.track_agents("starving", lambda: model.agents.select(lambda a: a.energy < 10))

Of course, this add some complexity that you're not working with a fixed set of agents anymore.

Aggregation: Can DataSets reference the data from other DataSets? For example, if I'm already collecting wealth from agents, can I create a gini DataSet that operates on that extracted wealth data rather than re-extracting it?

wealth_data = AgentDataSet("wealth", model.agents, "wealth")

# Option A: Pass the wealth_data DataSet as the "model"?
gini_data = ModelDataSet("gini", wealth_data, gini=lambda ds: calculate_gini(ds.data))

# Option B: Keep model, but callable references wealth_data?
gini_data = ModelDataSet("gini", model, gini=lambda m: calculate_gini(wealth_data.data))

# Option C: Something else entirely?

DataSet composition: More generally, is there a pattern for DataSets to depend on other DataSets to avoid redundant extraction and enable data pipelines?

quaquel · 2026-01-17T16:26:56Z

Dynamic agent selection

This indeed currently does not work (nor does it for the existing data collector). I am actually thinking of making this possible in a custom DynamicAgentDataSet class that listens for agent_registered signals for inclusion and agent_deregistered signals for removal. So no need to filter on every call to .data, but make the dynamic AgentSet itself reactive instead.

Aggregation: Can DataSets reference the data from other DataSets?
DataSet composition: More generally, is there a pattern for DataSets to depend on other DataSets to avoid redundant extraction and enable data pipelines

This is related to the _is_dirty idea in #3145. At present, we cannot do this yet, but this design has more potential to make it possible. The most promising idea I have is to take a cue from the property layers and continuous space position approach. Basically, the attributes you want to track in an AgentDataSet, or even a specialized NumpyAgentDataSet, are defined as properties on the agent, which internally provide a view into a numpy array. In this way, the dataset is always in sync with the agent state because they are just two different views on the same data.

codebreaker32 · 2026-01-17T20:22:29Z

Just asking this out of curiosity, Is it a good idea to make Dataset observable and then squeeze out a new Class from ModelDataset known as ComputedDataset(depends on some parent Dataset eg. gini depends on wealth) whereas ModelDataset will purely keep environmental variables(such as tax_rate, time in boltzzmann). Whenever parent Dataset will change, it will trigger a signal and accordingly ComputedDataset will do its calculation(zero extraction cost)

class ComputedDataSet(DataSet):

    def __init__(
        self,
        name: str,
        parent: DataSet,
        compute_fn: Callable[[dict], Any],
    ):
       self.parent.observe(parent.name, SignalType.CHANGE, self.collect)

quaquel · 2026-01-18T09:14:22Z

Just asking this out of curiosity, Is it a good idea to make Dataset observable and then squeeze out a new Class from ModelDataset known as ComputedDataset

Something like this might be made to work. However, any dataset, by definition, is already a Computed because it depends on something else. For example, a wealth dataset depends on the wealth attributes in the agents. To know that _is_dirty is true, the wealth dataset would have to subscribe to all agents, and all agents would have to make self.wealth observable. The net result is an explosion of signals. In the simple case of the Boltzmann model, any given interaction between two agents results in two signals. So in any given step of the model, you might have to process 200 signals (assuming 100 agents). In contrast, a pull-based approach would involve a single loop over 100 agents.

codebreaker32 · 2026-01-18T09:32:59Z

You are right that a pure reactive approach would cause a massive Signal Explosion and a performance nightmare But that is not what I am proposing.
I am proposing a "Smart Pull" architecture that combines the efficiency of "Pull" approach with the modularity of Reactivity.

The Hybrid Flow:

Silent Updates: Agents update self.wealth normally. They do not emit signals. There is zero overhead here.
Trigger (1 Signal per Step): The AgentDataSet subscribes only to the global model.step signal (or sparse events). It assumes the data is "dirty" at the end of every step.
The "Smart Pull": When the step signal fires, AgentDataSet performs the standard Pull

Result: We have the memory buffer.
Only after the pull is complete, AgentDataSet emits a single CHANGE signal.

ComputedDataSet (e.g., Gini) listens to this signal.
It grabs the already-pulled buffer from AgentDataSet to calculate the metric.

quaquel · 2026-01-18T10:03:04Z

You are right that a pure reactive approach would cause a massive Signal Explosion and a performance nightmare But that is not what I am proposing.
I am proposing a "Smart Pull" architecture that combines the efficiency of "Pull" approach with the modularity of Reactivity.

That is a design that indeed makes more sense. However, it is also fragile. Basically, the dataset would become dirty on step. It is clean again after the first call to data. But there is nothing preventing the user to have updates to agents in the dataset afterwards:

agent.shuffle_do("step_a")
self.update_stats() # --> calls .data on some of our DataSets
self.shuffle_do("step_b")
self.update_stats() #  assumes that the data is clean but not true.

So, that is why a numpy view style design, as we use for the property layers (i.e., Cell.elevation) and ContinuousAgent.position, is superior. Any call to .data just returns the numpy array. This numpy array, by construction, is always in sync with the relevant attributes in all the agents. No need to bother with a dirty flag or anything.

However, I want to leave the numpy style agent data set for a future PR to avoid complicating this one too much.

for more information, see https://pre-commit.ci

… lookup (mesa#3045)

…esa#3139)

Add convenience property to access model.scenario directly from agents. This follows the same pattern as the existing random and rng properties, making scenario parameters easier to access within agent code.

for more information, see https://pre-commit.ci

…data_registry

for more information, see https://pre-commit.ci

…data_registry

for more information, see https://pre-commit.ci

quaquel · 2026-02-08T06:35:20Z

I added the unique_ids of agents as a separate numpy array so this is now included. However, I have left .data the same for now. In many cases, you only want the data and are not interested in the unique_id's of the agents to which this data belongs.

For the data recording, unique_ids do matter. Ideally, you want to store agent data by unique_id so you can trace agents over the simulation. @codebreaker32, I would love your perspective on this and how you think we could include this in the API. There is also a broader point on DataSet.data being heterogeneously typed which is unavoidable but might still benefit from some cleaning.

…data_registry

for more information, see https://pre-commit.ci

codebreaker32 · 2026-02-08T07:51:53Z

I recommend we keep .data pure (values only) to preserve the ease of mathematical operations (e.g., np.mean(dataset.data)), and instead expose IDs via a separate interface.

The Dataset: We add a public property dataset.ids (or dataset.index) to NumpyAgentDataSet that returns the _agent_ids array corresponding to the active rows.
The Listener: We update CollectorListener to handle this side-channel data. Since the listener has access to the registry, it can do something like:

# inside CollectorListener._store_dataset_snapshot
dataset = self.registry.datasets[name]
data = dataset.data

if hasattr(dataset, "ids"):
    ids = dataset.ids
    # Stack IDs(using np.hstack) with Data for storage

This way, the DataSet remains a fast numerical container, and the Listener handles the complexity of joining IDs for storage.

There is also a broader point on DataSet.data being heterogeneously typed which is unavoidable

Agree and rather I see "Listener" as the normalization layer.

quaquel · 2026-02-08T08:12:24Z

I recommend we keep .data pure (values only) to preserve the ease of mathematical operations (e.g., np.mean(dataset.data)), and instead expose IDs via a separate interface.

I agree with this. I'll add a NumpyAgentDataSet.agent_ids property.

Agree and rather I see "Listener" as the normalization layer.

Ok, for now, let's keep it this way. We might revisit this depending on if and how we want to use the DataRegistry in the UI side of things.

EwoutH · 2026-02-09T20:17:51Z

From the usage example in the PR description:

  self.model_output.track_agents("agent_data", self.model.agents, "wealth", "health", "age", )
  self.model_output.track_model("model_data", self, "average_age")

I find it a bit weird you can just pile one arguments. Can we make this a list or a set?

I think requiring keywords here might also help (including for future API changes).

quaquel · 2026-02-09T20:37:08Z

I find it a bit weird you can just pile one arguments. Can we make this a list or a set?
I think requiring keywords here might also help (including for future API changes).

I actually thought it was very convenient. You are just passing the different attributes/properties/descriptors you want to collect. So yes, you could also do this via e.g., a single argument fields: str|iterable[str].

EwoutH

Pre-approving since this is almost fully in the experimental space.

quaquel · 2026-02-10T13:13:57Z

So, before merging this, I would like to know if there is a preference regarding fields. We have three options

# current implementations
registry.track_model("model_dataset", "attr1", "attr2", "attr3")

# have a single fields argument
registry.track_model("model_dataset",  ["attr1", "attr2", "attr3"])

# have a single fields keyword argument
registry.track_model("model_dataset",  fields=["attr1", "attr2", "attr3"])

@EwoutH, @codebreaker32 do either of you have a clear preference? I like the convenience of the current implementation, but I see @EwoutH's point of future extendability. We also separately indicated a desire to move towards a keyword-preferred design, which would favor option 3.

codebreaker32 · 2026-02-10T13:18:53Z

I am fine with option 3 as well. It is self-documenting, extensible, and aligns with the project's shift toward explicit keyword arguments

for more information, see https://pre-commit.ci

quaquel · 2026-02-10T14:00:25Z

I added NumpyAgentDataSet.agent_ids and shifted to fields as a keyword argument. I am merging this so we can move on to finalizing #3145.

quaquel and others added 2 commits January 17, 2026 09:05

initial commit

253b0be

[pre-commit.ci] auto fixes from pre-commit.com hooks

6066610

for more information, see https://pre-commit.ci

quaquel changed the title ~~initial commit~~ Add data registry Jan 17, 2026

Update statistics.py

6af0b7e

quaquel and others added 2 commits January 17, 2026 15:13

Update statistics.py

8b50623

[pre-commit.ci] auto fixes from pre-commit.com hooks

c8bb073

for more information, see https://pre-commit.ci

quaquel and others added 5 commits January 17, 2026 15:15

Update statistics.py

a0b5d9a

[pre-commit.ci] auto fixes from pre-commit.com hooks

91c8984

for more information, see https://pre-commit.ci

Merge branch 'main' into data_registry

e28d702

Update statistics.py

f23f138

Update statistics.py

00996a6

quaquel added the feature Release notes label label Jan 17, 2026

quaquel mentioned this pull request Jan 17, 2026

Add DataRecorder for reactive Data Storage and DatasetConfig for Configuration #3145

Merged

quaquel and others added 8 commits January 18, 2026 19:08

add support for conditional data gathering in AgentDataSet

a0bd825

[pre-commit.ci] auto fixes from pre-commit.com hooks

b229cf9

for more information, see https://pre-commit.ci

Replace Computable Descriptor with @computed in mesa_signals (mesa#3153)

e210810

fix altair rendering issue (mesa#3098)

bddd22f

Fix network visualization bug: Replace array indexing with dictionary…

503aac4

… lookup (mesa#3045)

Allow list inputs for MesaSignal observable names and signal types (m…

c934626

…esa#3139)

Add scenario property to Agent class (mesa#3164)

35e825d

Add convenience property to access model.scenario directly from agents. This follows the same pattern as the existing random and rng properties, making scenario parameters easier to access within agent code.

remove fixme comment resolved by mesa#3139 (mesa#3161)

316131e

quaquel and others added 8 commits February 7, 2026 21:26

Update dataset.py

1761708

[pre-commit.ci] auto fixes from pre-commit.com hooks

db865bd

for more information, see https://pre-commit.ci

updates

1282ed3

Merge branch 'data_registry' of https://github.com/quaquel/mesa into …

7ecccbd

…data_registry

[pre-commit.ci] auto fixes from pre-commit.com hooks

db5480c

for more information, see https://pre-commit.ci

include unique_id

33937b3

Merge branch 'data_registry' of https://github.com/quaquel/mesa into …

8044930

…data_registry

[pre-commit.ci] auto fixes from pre-commit.com hooks

eccfbbe

for more information, see https://pre-commit.ci

quaquel and others added 3 commits February 8, 2026 07:46

Update test_dataset.py

e59e4c3

Merge branch 'data_registry' of https://github.com/quaquel/mesa into …

0fdb0ea

…data_registry

[pre-commit.ci] auto fixes from pre-commit.com hooks

958c754

for more information, see https://pre-commit.ci

Merge remote-tracking branch 'upstream/main' into data_registry

065c3f7

EwoutH approved these changes Feb 10, 2026

View reviewed changes

codebreaker32 approved these changes Feb 10, 2026

View reviewed changes

quaquel and others added 2 commits February 10, 2026 14:52

shift to fields kwargs

552bfcb

[pre-commit.ci] auto fixes from pre-commit.com hooks

7bc1d12

for more information, see https://pre-commit.ci

quaquel merged commit 48d08c3 into mesa:main Feb 10, 2026
12 of 14 checks passed

quaquel deleted the data_registry branch February 10, 2026 14:55

EwoutH added the experimental Release notes label label Feb 11, 2026

This was referenced Feb 13, 2026

Add DataSet.record() #3295

Merged

Improvements to DataSet and DataRecorder #3341

Open

codebreaker32 mentioned this pull request Mar 16, 2026

Add FutureWarning for deprecated DataCollector and update migration guide #3551

Open

Uh oh!

Conversation

quaquel commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

github-actions bot commented Jan 17, 2026

Uh oh!

codebreaker32 commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codebreaker32 commented Jan 17, 2026

Uh oh!

EwoutH commented Jan 17, 2026

Uh oh!

quaquel commented Jan 17, 2026

Uh oh!

codebreaker32 commented Jan 17, 2026

Uh oh!

quaquel commented Jan 18, 2026

Uh oh!

codebreaker32 commented Jan 18, 2026

Uh oh!

quaquel commented Jan 18, 2026

Uh oh!

quaquel commented Feb 8, 2026

Uh oh!

codebreaker32 commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quaquel commented Feb 8, 2026

Uh oh!

EwoutH commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quaquel commented Feb 9, 2026

Uh oh!

EwoutH left a comment

Choose a reason for hiding this comment

Uh oh!

quaquel commented Feb 10, 2026

Uh oh!

codebreaker32 commented Feb 10, 2026

Uh oh!

quaquel commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

quaquel commented Jan 17, 2026 •

edited

Loading

codebreaker32 commented Jan 17, 2026 •

edited

Loading

codebreaker32 commented Feb 8, 2026 •

edited

Loading

EwoutH commented Feb 9, 2026 •

edited

Loading