Skip to content

Add DataRecorder for reactive Data Storage and DatasetConfig for Configuration#3145

Merged
quaquel merged 71 commits intomesa:mainfrom
codebreaker32:listener
Feb 13, 2026
Merged

Add DataRecorder for reactive Data Storage and DatasetConfig for Configuration#3145
quaquel merged 71 commits intomesa:mainfrom
codebreaker32:listener

Conversation

@codebreaker32
Copy link
Copy Markdown
Collaborator

@codebreaker32 codebreaker32 commented Jan 15, 2026

This PR introduces a decoupled, event-driven data collection architecture designed to handle large-scale simulations efficiently. It separates what to collect (DataRegistry from #3156 ) from how to store it (DataRecorder).

The core design philosophy is the separation of concerns:

  1. What to collect is defined by the DataRegistry (from Add data registry #3156).
  2. When to collect is controlled by DatasetConfig (intervals, windows, start/end times). However there is an ongoing discussion about replacing it with Schedule( see The future of data collection #1944 (comment), Proposal: Unified Time and Event Scheduling API #2921 (comment) and Add Schedule dataclass and refactor EventGenerator #3250)
  3. How to store it is handled by DataRecorder implementations (Memory, SQL, Parquet).

Core Components & APIs

A. BaseDataRecorder (Abstract Base Class)

Handles the orchestration logic: subscription, interval checks, and lifecycle management.

  • __init__(model, config=...): Attaches to the model.
  • collect(): Manually triggers data collection.
  • get_table_dataframe(name): Retrieves data as a Pandas DataFrame.
  • summary(): Returns stats and summary.

B. DatasetConfig (Configuration)

Fine-grained control over when data is collected for each dataset.

  • interval: Collection frequency (e.g., every 1 time unit or every 0.5/2 time unit(s)).
  • start_time / end_time: Define a specific collection window. (start_time=0 explicitly means to collect initial state)
  • window_size: Rolling buffer capacity (e.g., keep only the last 1000 snapshots).

Example Usage:

config = {
    "agents_wealth": DatasetConfig(interval=10, window_size=500),
    "model_gini": DatasetConfig(start_time=100)
}
recorder = DataRecorder(model, config=config)
# recorder = JSONDataRecorder(model, config=config)

Included Recorders

This PR includes four implementations catering to different use cases:

i) DataRecorders (Default/Memory):
ii) SQLDataRecorder (SQLite):
iii) ParquetDataRecorder (Parquet):
iv) JSONDataRecorder (JSON):

Extensibility: Writing New Backends

The architecture makes adding new storage backends (e.g., MongoDB, HDF5, CSV) straightforward. Developers only need to inherit from BaseDataRecorder.
All interval logic, validation, and observable subscriptions are handled automatically by the base class.

Migration

Old

class MyModel(Model):
    def __init__(self):
        ...

        self.datacollector = DataCollector(
            model_reporters={"gini": self.compute_gini},
            agent_reporters={"Wealth": "wealth"}
        )

    def step(self):
        ...          # Step Logic
        self.datacollector.collect(self)

New

class MyModel(Model):
    def __init__(self):
        ...

        config = {
            "Gini": DatasetConfig(start_time=5, end_time=100, interval=1),
            "Wealth": DatasetConfig(start_time=0, end_time=100, interval=1)
        }

        self.recorder = DataRecorder(self, config=config)
        # self.recorder = SQLDataRecorder(self, config=config) # Easy backend swap

    def step(self):
        ...           # Step Logic

Note: Please refer to #3156 for details on the registration API

Known Limitation

  1. With current recorder, capturing the initial state ($t=0$) is non-trivial. If the Recorder is initialized before agents are created, it captures the initial state ($t=0$) immediately, but that state is empty/incomplete. If the Recorder relies solely on the time=0 signal emitted by Model.__init__, it will miss the event entirely because the signal fires before the user's __init__ logic even begins.

To ensure the recorder captures the fully populated initial state, the BaseDataRecorder performs an immediate check upon instantiation. This imposes a strict ordering requirement on the user: The DataRecorder must be instantiated as the absolute last step of the Model.__init__ method.

This is being solved in #3284.

  1. We currently lacks an explicit lifecycle event to signal that a simulation run has finished. The recorder cannot automatically detect when is simulation going to end and we might miss the last step.

Workaround: Users must manually invoke recorder.finalize() at the end of their execution block.

Note: Future work is expected to provide standard RUN_STARTED and RUN_ENDED events, which will automate this process( see #2921 (comment))

@codebreaker32
Copy link
Copy Markdown
Collaborator Author

I have modified the only examples that were used for benchmarking to compare timings.

@github-actions
Copy link
Copy Markdown

Performance benchmarks:

Model Size Init time [95% CI] Run time [95% CI]
BoltzmannWealth small 🔵 -2.8% [-3.5%, -1.9%] 🟢 -12.0% [-12.1%, -11.9%]
BoltzmannWealth large 🟢 -6.6% [-7.8%, -5.4%] 🟢 -9.1% [-11.2%, -7.4%]
Schelling small 🟢 -3.4% [-3.6%, -3.3%] 🟢 -12.1% [-12.2%, -12.0%]
Schelling large 🔵 -3.3% [-3.8%, -2.8%] 🟢 -6.4% [-7.8%, -5.3%]
WolfSheep small 🟢 -4.2% [-4.6%, -3.9%] 🔵 +0.4% [+0.2%, +0.7%]
WolfSheep large 🟢 -7.7% [-8.9%, -6.6%] 🔵 -0.7% [-2.4%, +0.8%]
BoidFlockers small 🔵 +1.8% [+1.3%, +2.2%] 🔵 +0.5% [+0.4%, +0.7%]
BoidFlockers large 🔵 +0.5% [-0.2%, +1.2%] 🔵 +0.7% [+0.5%, +0.9%]

@codebreaker32 codebreaker32 marked this pull request as draft January 15, 2026 19:33
@codebreaker32 codebreaker32 marked this pull request as ready for review January 15, 2026 19:56
@codebreaker32 codebreaker32 changed the title Prototyping the Listener Logic Prototyping the Listener Logic to replace datacollector Jan 15, 2026
@github-actions
Copy link
Copy Markdown

Performance benchmarks:

Model Size Init time [95% CI] Run time [95% CI]
BoltzmannWealth small 🔵 -1.3% [-1.9%, -0.6%] 🟢 -13.3% [-13.5%, -13.2%]
BoltzmannWealth large 🟢 -8.5% [-9.7%, -6.7%] 🟢 -13.6% [-14.9%, -12.3%]
Schelling small 🟢 -4.2% [-4.5%, -3.9%] 🟢 -12.6% [-12.7%, -12.5%]
Schelling large 🔵 -2.5% [-2.9%, -2.2%] 🟢 -4.7% [-5.2%, -4.0%]
WolfSheep small 🟢 -4.6% [-4.9%, -4.2%] 🔵 -1.6% [-1.8%, -1.5%]
WolfSheep large 🟢 -6.8% [-7.4%, -6.2%] 🔵 -1.3% [-2.2%, -0.4%]
BoidFlockers small 🔵 +2.5% [+2.0%, +3.1%] 🔵 -0.6% [-0.7%, -0.4%]
BoidFlockers large 🔵 +1.6% [+1.0%, +2.1%] 🔵 -0.6% [-0.9%, -0.4%]

@EwoutH EwoutH marked this pull request as draft January 15, 2026 20:53
@quaquel
Copy link
Copy Markdown
Member

quaquel commented Jan 16, 2026

There is a lot here that I appreciate.

  1. I like the explicit use of signals for triggering stuff. However, I think this can be improved and expanded a bit more as indicated in my comments on the code
  2. I like how much faster it is.

However, this PR also highlights a bunch of open questions (which is another reason why I like it as a pathfinding PR).

  1. The current approach, both here and in the data collector, is to just accumulate data using some kind of dict of lists. However, if one can preallocate an empty data structure of the appropriate size, storage will be faster. So, what if the new RunControl idea includes a default end_time? Then you can just preallocate stuff.
  2. The current design is a hybrid between a purely reactive push design and the current pull design. It is reactive to updates of model.time/model.step. It pulls the data to be collected, however, from the agent/model. I would be interested in exploring making this second step more reactive (does not have to be done in this PR). The benefit is that model-level aggregate statistics often require this anyway (e.g., Gini requires a table of agent wealth).
  3. The current data collector has gotten increasingly complicated over time. In part to cover all kinds of weird use cases. It might be helpful to step back and clarify the various use cases for data collection.
  • We want to collect agent-level data over time. This might be split by agent type, or by any set of agents. And this should also account for the addition or removal of agents over time. (Currently split over agent_reporter and agent_by_type reporters)
  • We want to collect agent data at random points in time. For example, agent lifetime data is collected whenever an agent is removed. This use case is handled by the table stuff
  • We want to collect aggregate data at the model level (i.e., model reporters)
  • Are there any more?
  1. What about very explicitly constraining the user on the data collection that is supported by Mesa? For example, what about only allowing "simple" data like numbers?
  2. I still think we need to more explicitly separate two distinct things: the observable state of the overall model at any given point in time, and the tracking of this state over time with (fixed) intervals. Any observable model or agent reporter is part of the observable state of the model. The newly added signals in this PR trigger, in essence, a snapshot of this state at a given point in time. Tables, however, are used for things that are not snapshotted in the same way.

@codebreaker32
Copy link
Copy Markdown
Collaborator Author

codebreaker32 commented Jan 16, 2026

  1. Agree but this depends entirelt on RunControl API
  2. In Python, making every attribute assignment reactive (using setattr or Traitlets) adds massive overhead. It can slow down the simulation physics by 10x-50x. Push-Trigger / Pull-Data (Current Hybrid) is the sweet spot. We wait for the step to finish (Trigger), then use C-speed vectorization to grab the data (Pull). True "Reactive Data" is likely too slow for Python ABMs.
  3. Before opening this PR, I have done extensive research(explored EMAWorkbench, mesa-geo, some heuristic algorithms like ACO, ABC etc.) and explored mesa `datacollector.
    For any ABM, there are three major categories(as you mentioned)
    i. Agent Time-Series; State of N agents at time 't'
    ii. Event Logs: Sparse event at random times
    iii. Model Aggregates: Global stats

One more category that I find fits this is
iv. Spatial/Field Data: Recording the state of the environment itself, e.g., a vegetation layer in a grid, a pheromone trail in an ant colony, or raster data in Mesa-Geo.
Why it doesn't fit any of those?

  • It isn't a single Model Aggregate (it's a matrix).
  • It isn't strictly Agent Data (storing a 1000x1000 grid as 1 million agent rows is incredibly inefficient).
  • It represents a "Dense State" that is best stored as a Tensor or Raster (2D/3D Array) rather than a list of records.
  1. I liked this approach but one more thing we might do is to let user decide. For most of the users "simple" data may be enough but for sophisticated users(like developers, researchers etc.), we should give them an option to override or inherit from our Listener to create their own
  2. I fully endorse this architectural split. It clarifies the role of the new listener:
  • Signals (STEP/RESET): These act as the "Camera Shutter" for taking consistent, synchronized Snapshots of the Model's Observable State (model_vars / agent_vars).
  • Tables: These remain the dedicated mechanism for Event Logging (asynchronous events like agent death or interactions) that are sparse and don't fit the snapshot paradigm.
  • By formalizing the ModelSignal, we are explicitly decoupling the "Trigger" (The Shutter) from the "Collection" (The Film), which is what allows us to swap in the high-performance columnar engine without changing the model logic. Also this new approach makes collector free to implement any logic

Update: Added the implementation checklist

@codebreaker32
Copy link
Copy Markdown
Collaborator Author

One more thing that I would like to ask is should I work on mesa_signals making more performative(because this PR relies on it) or is it covered by @EwoutH ?

@quaquel
Copy link
Copy Markdown
Member

quaquel commented Jan 16, 2026

In Python, making every attribute assignment reactive (using setattr or Traitlets) adds massive overhead

I agree, but that is why I called it the observable state. Clearly not all attributes should be tracked automatically. It's a concious choice by the modeller to indicate what she want's to track. Also, if designed well, the overhead is not that big. I ran tests on mesa a while ago using Boltzman. I had a table with agent.wealth. This was subscribed to updates of agent.wealth. And then model.gini used this table instead of looping over all agents. Depending on implementation details, this was faster than the current data collector approach.

For any ABM, there are three major categories

I agree with the three categories and your additional point on matrices/tensors. The problem in Mesa at the moment is that we lack stand alone support for all of this and instead it is all integrated into the data collector. Conceptually, we might need to have a model.statistics style object. User can define data tables on this which belong to any of four types you specified. Next, the data collector can take snapshos of any of the defined tables and store those in some performant back end (pd.dataframe, polars, database, etc.). We might even have an api given the user more fine grained control over which "data tables" are snapshotted when. So some might be snapshotted on every tick, others might be snapshotted only on some kind of run ended signal. And we might even have a dedicated signal that users can fire if they want to snapshot something.

The open question then is whether these "data tables" have to be reactive or can be reactive. We might just support both and leave it to the user to decide which is appropriate for their use case.

One more thing that I would like to ask is should I work on mesa_signals making more performative(because this PR relies on it) or is it covered by @EwoutH ?

I built the original version of mesa_signals and have been looking at improving it over the last couple of days. But input is of course allways welcome. For the record: I want to simplify the Computed/Computable stuff (not critical but might help). The main performance improvemnt most likely is to to figure out how to replace the Observable Descriptor with a property with closures (i.e., the property factory pattern you suggested earlier).

@quaquel quaquel added the feature Release notes label label Jan 16, 2026
@quaquel
Copy link
Copy Markdown
Member

quaquel commented Jan 16, 2026

I forced myself to articuate this idea of seperating model state and data collection. Below is a quick sketch of what I mean with having an object that contains the model state you want to track. Your listener object now would just need to iterate over (a subset of) the defined datatables. A lot of your fast data getting code could be moved into the respective DataSet objects. Moreover, this design is easily extendible: just define a new DataSet subclass.

What is missing here is datasets for variable sets of agents, and the tensor idea that you had. The last one could just be a ArrayDataSet and internally you would just have an ndarray of arbitrary dimension. The former is a bit trickier to do (and also very tricky in Mesa at the moment).

The resulting API would be roughly

class MyModel(Model):

	def __init__(self, rng=None):
		super.__init(self, rng=rng)

		self.model_output = ModelOutput()
		self.model_output.add_table(AgentDataSet("agent_data", self.model.agents, ["wealth", "health", "age"])
		self.model_output.add_table(ModelDataSet("model_data", self, ["gini", "average_age"])

		self.data_collector = CollectorListener()

A benefit of this approach is that these tables are now also available to others. For example, in Boltzman, the Gini calculation can now rely on output[agent_data], or you just add a separate self.model_output.wealth dataset. Some of the dirty flag stuff could be used as well. Basically, if we can figure out whether we need to renew the internal data, we can avoid having to collect agent-level data multiple times (as currently happens with gini and agent wealth in the data collector).

@EwoutH, also curious to get your thoughts on this.

The sketch for ModelOutput and DataSet

from .agent import AgentSet
from .model import Model

class DataSet:
    # follows anylogic
    def __init__(self, name, fields):
        self.name = name
        self.fields = fields

        # internal datastructure

    @property
    def data(self):
        raise NotImplementedError

class AgentDataSet(DataSet):
    def __init__(self, name, agents:AgentSet, fields:str|callable|list[str|callable]):
        super().__init__(name, fields)
        self.agents = agents

    @property
    def data(self):
        # gets the data for the fields from the agents
        return ...

class ModelDataSet[M: Model](DataSet):
    def __init__(self, name, model:M, fields:str|callable|list[str|callable]):
        super().__init__(name, fields)
        self.fields = fields

    @property
    def data(self):
        # gets the data for the fields from the agents
        return ...


class TableDataSet(DataSet):
    def __init__(self, name, fields:str|list[str]):
        super().__init__(fields)
        self.datasets = {}

    @property
    def data(self):
        # gets the data for the fields from the agents
        return ...

class ModelOutput:

    def __init__(self):
        self.datasets = {}

    def add_dataset(self, dataset: DataSet):
        pass

    def create_dataset(self, dataset_type, name, fields, *args):
        pass

    def __getitem__(self, name:str):
        return self.datasets[name]

@codebreaker32
Copy link
Copy Markdown
Collaborator Author

codebreaker32 commented Jan 16, 2026

This is a fantastic architectural pivot. I completely agree and it also perfectly addresses the "Separation of Concern" problem.

Fast Optimisations(or complex logics) now move to Dataset
The DataSet abstraction makes it trivial to add a TensorDataSet( or RasterDataSet for Mesa-Geo later) without having to hack the core listener logic.

Basically, if we can figure out whether we need to renew the internal data, we can avoid having to collect agent-level data multiple times

By treating the AgentDataSet as a stateful object rather than just a pass-through function, we can implement intra-step caching:-

  1. The AgentDataSet holds a cached view of the current step's vectors.
  2. The Gini reporter requests model.data.agents["wealth"]. The dataset sees its cache is empty/dirty, performs the optimized extraction (the "Fast Path"), caches the result, and returns it.
  3. The CollectorListener requests the same data to save it. The dataset returns the cached reference immediately $O(1)$
  4. .When the ModelSignal.STEP (or a PRE_STEP signal) fires, it marks the dataset as "dirty," forcing a refresh

Also I would suggest a more cleaner API:

class MoneyModel(Model):
    def __init__(self):
        self.data = DataRegistry(self)

        self.data.track_agents("agents", self.agents, ["wealth", "id"])
        self.data.track_model("model", ["gini", "step_count"])
        
        # (Future) Track Spatial/Tensor
        # self.data.track_grid("terrain", self.grid, "elevation")
        self.datacollector = CollectorListener()

For this user only need DataRegistry()

class DataRegistry:
    def __init__(self, model):
        self._model = model
        self._datasets = {}  # The internal storage

    def track_agents(self, name: str, source: AgentSet, reporters: list[str]) -> AgentDataSet:
        ds = AgentDataSet(name=name, agents=source, fields=reporters)
        self._datasets[name] = ds
        return ds

    def track_model(self, name: str="model", reporters: list[str]) -> ModelDataSet:
        ds = ModelDataSet(name=name, model=self._model, fields=reporters)
        self._datasets[name] = ds
        return ds

    def __getattr__(self, name):
        if name in self._datasets:
            return self._datasets[name]
        raise AttributeError(f"No dataset named '{name}'")

    def __iter__(self):
        return iter(self._datasets.values())

@quaquel
Copy link
Copy Markdown
Member

quaquel commented Jan 16, 2026

By treating the AgentDataSet as a stateful object rather than just a pass-through function

Something else occurred to me. At the moment, we treat agent attributes and their collection as two separate things. However, with property layers and the experimental continuous space, we have attributes that are views into an underlying numpy array. If we can capture that design in a property/descriptor style object, the data table and the agent-level attribute become the same thing.

Regardless, I'll try to put in a minimum-working PR for the DateRegistry/ModelOutput idea, with a few basic tables that cover many of our use cases.

@EwoutH
Copy link
Copy Markdown
Member

EwoutH commented Jan 16, 2026

Glad to such an engaging conversation!

I just finished #3155. From my perspective it’s production ready.

I will dive into this tomorrow or Sunday, now my brain needs a bit of rest.

@codebreaker32
Copy link
Copy Markdown
Collaborator Author

codebreaker32 commented Jan 17, 2026

Given the architectural shifts we discussed, this PR currently needs to be aligned with:

  1. Add unified event scheduling and time progression API #3155 (which is production ready, thanks to @EwoutH )
  2. Add data registry #3156

As soon as these foundational pieces are ready, I will return to this PR to refactor the implementation to match the new design.

@EwoutH
Copy link
Copy Markdown
Member

EwoutH commented Jan 17, 2026

Thanks for working on this! I had a quick initial look, there are indeed some interesting ideas here.

If I'm correct this is a hybrid design right? It listens to step events, and once those are observed it triggers a pull process where it gathers the specified data. So you could call this a "pull on signal" pattern?

In general, I think proper user control of A) what gets collected B) when it gets collected and C) how it gets stored is very important.

@quaquel
Copy link
Copy Markdown
Member

quaquel commented Jan 17, 2026

If I'm correct this is a hybrid design right? It listens to step events, and once those are observed it triggers a pull process where it gathers the specified data. So you could call this a "pull on signal" pattern?

Yes that is indeed the design.

In general, I think proper user control of A) what gets collected B) when it gets collected and C) how it gets stored is very important.

With #3156 and this, we separate this all more cleanly instead of trying to wrap it all into a single data collector. The DataRegistry covers what gets collected. The signals, as used here, determine when a snapshot of A is taken, and at least conceptually make it easy to store it or write your own backend storage.

@EwoutH
Copy link
Copy Markdown
Member

EwoutH commented Feb 12, 2026

@codebreaker32 could you update the PR description based on the accumulated insights in the discussions and review process? Please take some time for this and don’t fully outsource it to an LLM, having clarity on the current direction and all the considerations behind it is extremely important.

@codebreaker32
Copy link
Copy Markdown
Collaborator Author

Please take some time for this and don’t fully outsource it to an LLM, having clarity on the current direction and all the considerations behind it is extremely important

Sure, I understand the importance of it.

@codebreaker32
Copy link
Copy Markdown
Collaborator Author

codebreaker32 commented Feb 12, 2026

Hi @EwoutH

I have updated the PR description. Please check and tell if you want me to modify anything

@EwoutH
Copy link
Copy Markdown
Member

EwoutH commented Feb 12, 2026

Thanks, appreciated!

The discussed chaining is excluded from this PR right? What’s the plan for it, if any?

Could you add one minimal example of going from the old data collection to the new one?

@codebreaker32
Copy link
Copy Markdown
Collaborator Author

The discussed chaining is excluded from this PR right? What’s the plan for it, if any?

It might make more sense to clarify this in the PR description of #3156

Could you add one minimal example of going from the old data collection to the new one?

Sure

@quaquel
Copy link
Copy Markdown
Member

quaquel commented Feb 12, 2026

The discussed chaining is excluded from this PR right? What’s the plan for it, if any?

It might make more sense to clarify this in the PR description of #3156

I need to have #3284, and this one merged. Once both are in, I'll add a PR for the discussed chaining. It requires an update to the dataset protocol and implementations.

Copy link
Copy Markdown
Member

@EwoutH EwoutH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since everything is in the experimental space and this is in active development, no objections from me

@codebreaker32
Copy link
Copy Markdown
Collaborator Author

codebreaker32 commented Feb 12, 2026

Could you add one minimal example of going from the old data collection to the new one?

I have added a migration APIs in the PR Description.

I have one more suggestion, Should we move BoltzmannWealth to the experimental section to use these new APIs?

@quaquel
Copy link
Copy Markdown
Member

quaquel commented Feb 12, 2026

I have one more suggestion, Should we move BoltzmannWealth to the experimental section to use these new APIs?

For the time being, we need to duplicate the data collection since solara does not use the new style yet. Not sure which examples would be good to move over. For benchmarking, boltzmann is usefull and it will illustrate the key components. It can also be done in a next PR.

@codebreaker32 codebreaker32 changed the title Prototyping the Listener Logic to replace datacollector Add DataRecorder for reactive Data Storage and DatasetConfig for Configuration Feb 12, 2026
@quaquel
Copy link
Copy Markdown
Member

quaquel commented Feb 12, 2026

I have merged #3284. As a last request for this PR, can you tie the initial collection to this signal?

@codebreaker32
Copy link
Copy Markdown
Collaborator Author

I have merged #3284. As a last request for this PR, can you tie the initial collection to this signal?

Sure

@codebreaker32
Copy link
Copy Markdown
Collaborator Author

One more thing worth noting is:

  1. Now first model.step() will emit two signals So calling model.step() initially, will populate data with twice entries
  2. If user wants to collect initial data without calling step(), they'll have to manually call recorder.collect()

@quaquel
Copy link
Copy Markdown
Member

quaquel commented Feb 12, 2026

Now first model.step() will emit two signals So calling model.step() initially, will populate data with twice entries

I am not sure I follow. If I track the signals, I nicely see the time incrementing. So the first signal has as new 0, then the second signal has as new 1. etc. So it's not entirely true to state that model.step emits twice. Rather, before model.step is called, model.time emits a change to 0.0. Then model.step is run, after which model.time emits a change to 1.0. Then model.step is run again, and model.time emits a change to 3.0. etc.

@codebreaker32
Copy link
Copy Markdown
Collaborator Author

codebreaker32 commented Feb 13, 2026

If I track the signals, I nicely see the time incrementing

You are right that's why I didn't call it bug anywhere but the time=0.0 signal is now emitted inside the _advance_time method (which is triggered by step()), the very first time you write model.step(), Python executes both the $t=0$ emission and the $t=1$ emission.

I noticed this behavior while modifying the tests for new behavior. You can check yourself

MRE:

from mesa import Model, Agent
from mesa.experimental.data_collection import DataRecorder

class SimpleAgent(Agent):
    def __init__(self, model, wealth):
        super().__init__(model)
        self.wealth = wealth

class SimpleModel(Model):
    def __init__(self):
        super().__init__()
    
        SimpleAgent.create_agents(self, 3, [10, 20, 30])
        self.data_registry.track_agents(self.agents, "agent_data", fields=["wealth"])
        

model = SimpleModel()
recorder = DataRecorder(model)
df = recorder.get_table_dataframe("agent_data")
print(f"No step called,\n {df.to_string()} \n")


model.step()
df = recorder.get_table_dataframe("agent_data")

print(f"Step called once:\n {df.to_string()}\n")

model.step()

df = recorder.get_table_dataframe("agent_data")

print(f"Step called twice:\n {df.to_string()}")

Output:

No step called,
 Empty DataFrame
Columns: []
Index: [] 

Step called once:
    unique_id  wealth  time
0          1      10   0.0
1          2      20   0.0
2          3      30   0.0
3          1      10   1.0
4          2      20   1.0
5          3      30   1.0

Step called twice:
    unique_id  wealth  time
0          1      10   0.0
1          2      20   0.0
2          3      30   0.0
3          1      10   1.0
4          2      20   1.0
5          3      30   1.0
6          1      10   2.0
7          2      20   2.0
8          3      30   2.0

@quaquel
Copy link
Copy Markdown
Member

quaquel commented Feb 13, 2026

This is the behaviour I would expect. Ideally, we would trigger the t=0 collect after completing the initialization of the model, but before returning the instantiated object. Unfortunately, there is no __post_init__ on Python classes. There are ways to achieve the same via metaclasses or possibly via __init_subclass__. For now, however, I'd say that this is good enough. Finetuning this will become relevant once we start to use signals and potentially this new data collection in combination with the ui.

@codebreaker32
Copy link
Copy Markdown
Collaborator Author

Finetuning this will become relevant once we start to use signals and potentially this new data collection in combination with the ui.

Agree, just felt the need to explicitly state it out

@quaquel quaquel merged commit d6ad80f into mesa:main Feb 13, 2026
14 checks passed
@quaquel quaquel mentioned this pull request Feb 13, 2026
@EwoutH EwoutH added the experimental Release notes label label Feb 15, 2026
@codebreaker32 codebreaker32 deleted the listener branch March 11, 2026 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

experimental Release notes label feature Release notes label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants