Skip to content

Add data registry#3156

Merged
quaquel merged 101 commits intomesa:mainfrom
quaquel:data_registry
Feb 10, 2026
Merged

Add data registry#3156
quaquel merged 101 commits intomesa:mainfrom
quaquel:data_registry

Conversation

@quaquel
Copy link
Copy Markdown
Member

@quaquel quaquel commented Jan 17, 2026

Summary

Expanding on an idea from #3145 as well as past discussion on data collection, this PR adds a novel data-registry approach to Mesa. This new approach rests on the idea of a DataSet. A DataSet contains part of the state of a model at a given instant.

The key point is that current data collection does too much. With this PR, we separate the getting of the state of part of the model at a given instant from the storage of these states over time. With explicit DataSet classes, it's now trivial to extend this if you need your own custom data collection. Another benefit of DataSet classes is that we get rid of the complex dict-style configuration of what to collect. Everything can be handled through args for attributes and kwargs for callables.

This PR adds

  • DataRegistry: a dict-like collection of datasets. It is always available via model.data_registry.
  • ModelDataSet: a dataset for gathering model-level data.
  • AgentDataSet: a dataset for gathering agent data from an AbstractAgentSet.
  • TableDataSet: a dataset for gathering miscellaneous data, works by adding rows to it.
  • NumpyAgentDataSet: a Numpy array-based dataset containing agent data for a specified Agent class.
  • A DataSet protocol

Datasets gather data from fields. Fields are always strings and assumed to be accessible via attribute access. DataSet does not support lambda functions. If you want to do something like that, use properties or descriptors instead. Data is accessed via DataSet.data. ModelDataSet and AgentDataSet will at that moment gather the data and return it. NumpyAgentDataSet will return a view on the numpy array containing the data. This view is always in sync with the attribute values, so this data is not separately gathered on request. TableDataSet will return the current list of rows.

This PR is a first draft, exploring the idea. Feedback is very much welcome. The focus is more on fleshing out the API than on optimizing the code itself.

API

class MyModel(Model):

	def __init__(self, rng=None):
		super.__init(self, rng=rng)

		self.model_output = DataRegistry()
		self.model_output.track_agents("agent_data", self.model.agents, fields=["wealth", "health", "age"])
		self.model_output.track_model("model_data", self, fields="average_age")

		self.data_collector = CollectorListener()

@github-actions
Copy link
Copy Markdown

Performance benchmarks:

Model Size Init time [95% CI] Run time [95% CI]
BoltzmannWealth small 🔵 -0.7% [-1.3%, -0.2%] 🔵 -1.1% [-1.3%, -0.8%]
BoltzmannWealth large 🔵 -0.2% [-0.6%, +0.3%] 🔵 -1.2% [-2.7%, +0.3%]
Schelling small 🔵 +0.1% [-0.1%, +0.3%] 🔵 -1.1% [-1.2%, -1.0%]
Schelling large 🔵 -0.1% [-0.5%, +0.3%] 🔵 -0.4% [-0.9%, +0.3%]
WolfSheep small 🔵 -0.3% [-0.6%, +0.1%] 🔵 -1.2% [-1.3%, -1.0%]
WolfSheep large 🔵 +0.9% [+0.3%, +1.6%] 🔵 +1.3% [+0.6%, +1.9%]
BoidFlockers small 🔵 -0.6% [-1.1%, -0.1%] 🔵 +1.0% [+0.7%, +1.2%]
BoidFlockers large 🔵 +0.2% [-0.3%, +0.7%] 🔵 +0.5% [+0.2%, +0.8%]

@quaquel quaquel changed the title initial commit Add data registry Jan 17, 2026
@codebreaker32
Copy link
Copy Markdown
Collaborator

codebreaker32 commented Jan 17, 2026

I did some research for (3) and I found that we technically can, but we probably shouldn't:-

  1. operator.attrgetter does not support default values. Even If one agent is missing the attribute, entire simulation would crash with an AttributeError. ( if we want to wrap it in a try...except block, we will lose the performance gain we were trying to achieve)
  2. attrgetter has a noticeable setup cost. DataSet pays this cost only once in __init__ but AgentSet will have to pay it everytime we call get. (For agents < 200, it will be slower than a normal "for" loop)
  3. The biggest overhead that will stop AgentSet to be vector-fast is because it stores "python objects" which are scattered in heap.

However it can be python-fast(For eg. Supporting both AgentSet and WeakAgentSet discussed in #3128 )

@codebreaker32
Copy link
Copy Markdown
Collaborator

One more minor thing that I'd like to address is that operation.attrgetter would return scalar when len(self._args) = 1 which would result in TypeError if we tried to zip it. So for this special case, we'll have to force the result into tuple

@EwoutH
Copy link
Copy Markdown
Member

EwoutH commented Jan 17, 2026

Thanks a lot for this. Very useful pathfinding.

The separation of extraction (DataSet) from storage and timing/triggering (future work) is helpful. Always chop problems into smaller problems if you can.

From looking back at our original discussions, few questions on how flexible the extraction layer can be:

  1. Dynamic agent selection: Right now AgentDataSet takes a fixed AgentSet at initialization. Could it accept a callable instead, so membership is re-evaluated each time .data is accessed? This would enable tracking things like "all starving agents" where the set changes each step:
# Instead of evaluating once at init
registry.track_agents("starving", model.agents.select(lambda a: a.energy < 10))

# Evaluate fresh each time
registry.track_agents("starving", lambda: model.agents.select(lambda a: a.energy < 10))

Of course, this add some complexity that you're not working with a fixed set of agents anymore.

  1. Aggregation: Can DataSets reference the data from other DataSets? For example, if I'm already collecting wealth from agents, can I create a gini DataSet that operates on that extracted wealth data rather than re-extracting it?
wealth_data = AgentDataSet("wealth", model.agents, "wealth")

# Option A: Pass the wealth_data DataSet as the "model"?
gini_data = ModelDataSet("gini", wealth_data, gini=lambda ds: calculate_gini(ds.data))

# Option B: Keep model, but callable references wealth_data?
gini_data = ModelDataSet("gini", model, gini=lambda m: calculate_gini(wealth_data.data))

# Option C: Something else entirely?
  1. DataSet composition: More generally, is there a pattern for DataSets to depend on other DataSets to avoid redundant extraction and enable data pipelines?

@quaquel
Copy link
Copy Markdown
Member Author

quaquel commented Jan 17, 2026

Dynamic agent selection

This indeed currently does not work (nor does it for the existing data collector). I am actually thinking of making this possible in a custom DynamicAgentDataSet class that listens for agent_registered signals for inclusion and agent_deregistered signals for removal. So no need to filter on every call to .data, but make the dynamic AgentSet itself reactive instead.

Aggregation: Can DataSets reference the data from other DataSets?
DataSet composition: More generally, is there a pattern for DataSets to depend on other DataSets to avoid redundant extraction and enable data pipelines

This is related to the _is_dirty idea in #3145. At present, we cannot do this yet, but this design has more potential to make it possible. The most promising idea I have is to take a cue from the property layers and continuous space position approach. Basically, the attributes you want to track in an AgentDataSet, or even a specialized NumpyAgentDataSet, are defined as properties on the agent, which internally provide a view into a numpy array. In this way, the dataset is always in sync with the agent state because they are just two different views on the same data.

@codebreaker32
Copy link
Copy Markdown
Collaborator

Just asking this out of curiosity, Is it a good idea to make Dataset observable and then squeeze out a new Class from ModelDataset known as ComputedDataset(depends on some parent Dataset eg. gini depends on wealth) whereas ModelDataset will purely keep environmental variables(such as tax_rate, time in boltzzmann). Whenever parent Dataset will change, it will trigger a signal and accordingly ComputedDataset will do its calculation(zero extraction cost)

class ComputedDataSet(DataSet):

    def __init__(
        self,
        name: str,
        parent: DataSet,
        compute_fn: Callable[[dict], Any],
    ):
       self.parent.observe(parent.name, SignalType.CHANGE, self.collect)

@quaquel
Copy link
Copy Markdown
Member Author

quaquel commented Jan 18, 2026

Just asking this out of curiosity, Is it a good idea to make Dataset observable and then squeeze out a new Class from ModelDataset known as ComputedDataset

Something like this might be made to work. However, any dataset, by definition, is already a Computed because it depends on something else. For example, a wealth dataset depends on the wealth attributes in the agents. To know that _is_dirty is true, the wealth dataset would have to subscribe to all agents, and all agents would have to make self.wealth observable. The net result is an explosion of signals. In the simple case of the Boltzmann model, any given interaction between two agents results in two signals. So in any given step of the model, you might have to process 200 signals (assuming 100 agents). In contrast, a pull-based approach would involve a single loop over 100 agents.

@codebreaker32
Copy link
Copy Markdown
Collaborator

You are right that a pure reactive approach would cause a massive Signal Explosion and a performance nightmare But that is not what I am proposing.
I am proposing a "Smart Pull" architecture that combines the efficiency of "Pull" approach with the modularity of Reactivity.

The Hybrid Flow:

  1. Silent Updates: Agents update self.wealth normally. They do not emit signals. There is zero overhead here.
  2. Trigger (1 Signal per Step): The AgentDataSet subscribes only to the global model.step signal (or sparse events). It assumes the data is "dirty" at the end of every step.
  3. The "Smart Pull": When the step signal fires, AgentDataSet performs the standard Pull

Result: We have the memory buffer.
Only after the pull is complete, AgentDataSet emits a single CHANGE signal.

ComputedDataSet (e.g., Gini) listens to this signal.
It grabs the already-pulled buffer from AgentDataSet to calculate the metric.

@quaquel
Copy link
Copy Markdown
Member Author

quaquel commented Jan 18, 2026

You are right that a pure reactive approach would cause a massive Signal Explosion and a performance nightmare But that is not what I am proposing.
I am proposing a "Smart Pull" architecture that combines the efficiency of "Pull" approach with the modularity of Reactivity.

That is a design that indeed makes more sense. However, it is also fragile. Basically, the dataset would become dirty on step. It is clean again after the first call to data. But there is nothing preventing the user to have updates to agents in the dataset afterwards:

agent.shuffle_do("step_a")
self.update_stats() # --> calls .data on some of our DataSets
self.shuffle_do("step_b")
self.update_stats() #  assumes that the data is clean but not true.

So, that is why a numpy view style design, as we use for the property layers (i.e., Cell.elevation) and ContinuousAgent.position, is superior. Any call to .data just returns the numpy array. This numpy array, by construction, is always in sync with the relevant attributes in all the agents. No need to bother with a dirty flag or anything.

However, I want to leave the numpy style agent data set for a future PR to avoid complicating this one too much.

@quaquel
Copy link
Copy Markdown
Member Author

quaquel commented Feb 8, 2026

I added the unique_ids of agents as a separate numpy array so this is now included. However, I have left .data the same for now. In many cases, you only want the data and are not interested in the unique_id's of the agents to which this data belongs.

For the data recording, unique_ids do matter. Ideally, you want to store agent data by unique_id so you can trace agents over the simulation. @codebreaker32, I would love your perspective on this and how you think we could include this in the API. There is also a broader point on DataSet.data being heterogeneously typed which is unavoidable but might still benefit from some cleaning.

@codebreaker32
Copy link
Copy Markdown
Collaborator

codebreaker32 commented Feb 8, 2026

I recommend we keep .data pure (values only) to preserve the ease of mathematical operations (e.g., np.mean(dataset.data)), and instead expose IDs via a separate interface.

  • The Dataset: We add a public property dataset.ids (or dataset.index) to NumpyAgentDataSet that returns the _agent_ids array corresponding to the active rows.
  • The Listener: We update CollectorListener to handle this side-channel data. Since the listener has access to the registry, it can do something like:
# inside CollectorListener._store_dataset_snapshot
dataset = self.registry.datasets[name]
data = dataset.data

if hasattr(dataset, "ids"):
    ids = dataset.ids
    # Stack IDs(using np.hstack) with Data for storage

This way, the DataSet remains a fast numerical container, and the Listener handles the complexity of joining IDs for storage.

There is also a broader point on DataSet.data being heterogeneously typed which is unavoidable

Agree and rather I see "Listener" as the normalization layer.

@quaquel
Copy link
Copy Markdown
Member Author

quaquel commented Feb 8, 2026

I recommend we keep .data pure (values only) to preserve the ease of mathematical operations (e.g., np.mean(dataset.data)), and instead expose IDs via a separate interface.

I agree with this. I'll add a NumpyAgentDataSet.agent_ids property.

Agree and rather I see "Listener" as the normalization layer.

Ok, for now, let's keep it this way. We might revisit this depending on if and how we want to use the DataRegistry in the UI side of things.

@EwoutH
Copy link
Copy Markdown
Member

EwoutH commented Feb 9, 2026

From the usage example in the PR description:

  self.model_output.track_agents("agent_data", self.model.agents, "wealth", "health", "age", )
  self.model_output.track_model("model_data", self, "average_age")

I find it a bit weird you can just pile one arguments. Can we make this a list or a set?

I think requiring keywords here might also help (including for future API changes).

@quaquel
Copy link
Copy Markdown
Member Author

quaquel commented Feb 9, 2026

I find it a bit weird you can just pile one arguments. Can we make this a list or a set?
I think requiring keywords here might also help (including for future API changes).

I actually thought it was very convenient. You are just passing the different attributes/properties/descriptors you want to collect. So yes, you could also do this via e.g., a single argument fields: str|iterable[str].

Copy link
Copy Markdown
Member

@EwoutH EwoutH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-approving since this is almost fully in the experimental space.

@quaquel
Copy link
Copy Markdown
Member Author

quaquel commented Feb 10, 2026

So, before merging this, I would like to know if there is a preference regarding fields. We have three options

# current implementations
registry.track_model("model_dataset", "attr1", "attr2", "attr3")

# have a single fields argument
registry.track_model("model_dataset",  ["attr1", "attr2", "attr3"])

# have a single fields keyword argument
registry.track_model("model_dataset",  fields=["attr1", "attr2", "attr3"])

@EwoutH, @codebreaker32 do either of you have a clear preference? I like the convenience of the current implementation, but I see @EwoutH's point of future extendability. We also separately indicated a desire to move towards a keyword-preferred design, which would favor option 3.

@codebreaker32
Copy link
Copy Markdown
Collaborator

I am fine with option 3 as well. It is self-documenting, extensible, and aligns with the project's shift toward explicit keyword arguments

@quaquel
Copy link
Copy Markdown
Member Author

quaquel commented Feb 10, 2026

I added NumpyAgentDataSet.agent_ids and shifted to fields as a keyword argument. I am merging this so we can move on to finalizing #3145.

@quaquel quaquel merged commit 48d08c3 into mesa:main Feb 10, 2026
12 of 14 checks passed
@quaquel quaquel deleted the data_registry branch February 10, 2026 14:55
@EwoutH EwoutH added the experimental Release notes label label Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

experimental Release notes label feature Release notes label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants