Add DataRecorder for reactive Data Storage and DatasetConfig for Configuration#3145
Add DataRecorder for reactive Data Storage and DatasetConfig for Configuration#3145
DataRecorder for reactive Data Storage and DatasetConfig for Configuration#3145Conversation
|
I have modified the only examples that were used for benchmarking to compare timings. |
|
Performance benchmarks:
|
|
Performance benchmarks:
|
|
There is a lot here that I appreciate.
However, this PR also highlights a bunch of open questions (which is another reason why I like it as a pathfinding PR).
|
One more category that I find fits this is
Update: Added the implementation checklist |
|
One more thing that I would like to ask is should I work on |
I agree, but that is why I called it the observable state. Clearly not all attributes should be tracked automatically. It's a concious choice by the modeller to indicate what she want's to track. Also, if designed well, the overhead is not that big. I ran tests on mesa a while ago using Boltzman. I had a table with agent.wealth. This was subscribed to updates of agent.wealth. And then model.gini used this table instead of looping over all agents. Depending on implementation details, this was faster than the current data collector approach.
I agree with the three categories and your additional point on matrices/tensors. The problem in Mesa at the moment is that we lack stand alone support for all of this and instead it is all integrated into the data collector. Conceptually, we might need to have a model.statistics style object. User can define data tables on this which belong to any of four types you specified. Next, the data collector can take snapshos of any of the defined tables and store those in some performant back end (pd.dataframe, polars, database, etc.). We might even have an api given the user more fine grained control over which "data tables" are snapshotted when. So some might be snapshotted on every tick, others might be snapshotted only on some kind of run ended signal. And we might even have a dedicated signal that users can fire if they want to snapshot something. The open question then is whether these "data tables" have to be reactive or can be reactive. We might just support both and leave it to the user to decide which is appropriate for their use case.
I built the original version of mesa_signals and have been looking at improving it over the last couple of days. But input is of course allways welcome. For the record: I want to simplify the Computed/Computable stuff (not critical but might help). The main performance improvemnt most likely is to to figure out how to replace the Observable Descriptor with a property with closures (i.e., the property factory pattern you suggested earlier). |
|
I forced myself to articuate this idea of seperating model state and data collection. Below is a quick sketch of what I mean with having an object that contains the model state you want to track. Your listener object now would just need to iterate over (a subset of) the defined datatables. A lot of your fast data getting code could be moved into the respective DataSet objects. Moreover, this design is easily extendible: just define a new DataSet subclass. What is missing here is datasets for variable sets of agents, and the tensor idea that you had. The last one could just be a The resulting API would be roughly class MyModel(Model):
def __init__(self, rng=None):
super.__init(self, rng=rng)
self.model_output = ModelOutput()
self.model_output.add_table(AgentDataSet("agent_data", self.model.agents, ["wealth", "health", "age"])
self.model_output.add_table(ModelDataSet("model_data", self, ["gini", "average_age"])
self.data_collector = CollectorListener()A benefit of this approach is that these tables are now also available to others. For example, in Boltzman, the Gini calculation can now rely on @EwoutH, also curious to get your thoughts on this. The sketch for from .agent import AgentSet
from .model import Model
class DataSet:
# follows anylogic
def __init__(self, name, fields):
self.name = name
self.fields = fields
# internal datastructure
@property
def data(self):
raise NotImplementedError
class AgentDataSet(DataSet):
def __init__(self, name, agents:AgentSet, fields:str|callable|list[str|callable]):
super().__init__(name, fields)
self.agents = agents
@property
def data(self):
# gets the data for the fields from the agents
return ...
class ModelDataSet[M: Model](DataSet):
def __init__(self, name, model:M, fields:str|callable|list[str|callable]):
super().__init__(name, fields)
self.fields = fields
@property
def data(self):
# gets the data for the fields from the agents
return ...
class TableDataSet(DataSet):
def __init__(self, name, fields:str|list[str]):
super().__init__(fields)
self.datasets = {}
@property
def data(self):
# gets the data for the fields from the agents
return ...
class ModelOutput:
def __init__(self):
self.datasets = {}
def add_dataset(self, dataset: DataSet):
pass
def create_dataset(self, dataset_type, name, fields, *args):
pass
def __getitem__(self, name:str):
return self.datasets[name] |
|
This is a fantastic architectural pivot. I completely agree and it also perfectly addresses the "Separation of Concern" problem. Fast Optimisations(or complex logics) now move to Dataset
By treating the
Also I would suggest a more cleaner API: class MoneyModel(Model):
def __init__(self):
self.data = DataRegistry(self)
self.data.track_agents("agents", self.agents, ["wealth", "id"])
self.data.track_model("model", ["gini", "step_count"])
# (Future) Track Spatial/Tensor
# self.data.track_grid("terrain", self.grid, "elevation")
self.datacollector = CollectorListener()For this user only need DataRegistry() class DataRegistry:
def __init__(self, model):
self._model = model
self._datasets = {} # The internal storage
def track_agents(self, name: str, source: AgentSet, reporters: list[str]) -> AgentDataSet:
ds = AgentDataSet(name=name, agents=source, fields=reporters)
self._datasets[name] = ds
return ds
def track_model(self, name: str="model", reporters: list[str]) -> ModelDataSet:
ds = ModelDataSet(name=name, model=self._model, fields=reporters)
self._datasets[name] = ds
return ds
def __getattr__(self, name):
if name in self._datasets:
return self._datasets[name]
raise AttributeError(f"No dataset named '{name}'")
def __iter__(self):
return iter(self._datasets.values()) |
Something else occurred to me. At the moment, we treat agent attributes and their collection as two separate things. However, with property layers and the experimental continuous space, we have attributes that are views into an underlying numpy array. If we can capture that design in a property/descriptor style object, the data table and the agent-level attribute become the same thing. Regardless, I'll try to put in a minimum-working PR for the DateRegistry/ModelOutput idea, with a few basic tables that cover many of our use cases. |
|
Glad to such an engaging conversation! I just finished #3155. From my perspective it’s production ready. I will dive into this tomorrow or Sunday, now my brain needs a bit of rest. |
|
Given the architectural shifts we discussed, this PR currently needs to be aligned with:
As soon as these foundational pieces are ready, I will return to this PR to refactor the implementation to match the new design. |
|
Thanks for working on this! I had a quick initial look, there are indeed some interesting ideas here. If I'm correct this is a hybrid design right? It listens to step events, and once those are observed it triggers a pull process where it gathers the specified data. So you could call this a "pull on signal" pattern? In general, I think proper user control of A) what gets collected B) when it gets collected and C) how it gets stored is very important. |
Yes that is indeed the design.
With #3156 and this, we separate this all more cleanly instead of trying to wrap it all into a single data collector. The |
03d2b7c to
c964cb0
Compare
|
@codebreaker32 could you update the PR description based on the accumulated insights in the discussions and review process? Please take some time for this and don’t fully outsource it to an LLM, having clarity on the current direction and all the considerations behind it is extremely important. |
Sure, I understand the importance of it. |
|
Hi @EwoutH I have updated the PR description. Please check and tell if you want me to modify anything |
|
Thanks, appreciated! The discussed chaining is excluded from this PR right? What’s the plan for it, if any? Could you add one minimal example of going from the old data collection to the new one? |
It might make more sense to clarify this in the PR description of #3156
Sure |
I need to have #3284, and this one merged. Once both are in, I'll add a PR for the discussed chaining. It requires an update to the dataset protocol and implementations. |
EwoutH
left a comment
There was a problem hiding this comment.
Since everything is in the experimental space and this is in active development, no objections from me
I have added a migration APIs in the PR Description. I have one more suggestion, Should we move BoltzmannWealth to the experimental section to use these new APIs? |
For the time being, we need to duplicate the data collection since solara does not use the new style yet. Not sure which examples would be good to move over. For benchmarking, boltzmann is usefull and it will illustrate the key components. It can also be done in a next PR. |
DataRecorder for reactive Data Storage and DatasetConfig for Configuration
|
I have merged #3284. As a last request for this PR, can you tie the initial collection to this signal? |
Sure |
|
One more thing worth noting is:
|
I am not sure I follow. If I track the signals, I nicely see the time incrementing. So the first signal has as |
You are right that's why I didn't call it bug anywhere but the time=0.0 signal is now emitted inside the I noticed this behavior while modifying the tests for new behavior. You can check yourself MRE: from mesa import Model, Agent
from mesa.experimental.data_collection import DataRecorder
class SimpleAgent(Agent):
def __init__(self, model, wealth):
super().__init__(model)
self.wealth = wealth
class SimpleModel(Model):
def __init__(self):
super().__init__()
SimpleAgent.create_agents(self, 3, [10, 20, 30])
self.data_registry.track_agents(self.agents, "agent_data", fields=["wealth"])
model = SimpleModel()
recorder = DataRecorder(model)
df = recorder.get_table_dataframe("agent_data")
print(f"No step called,\n {df.to_string()} \n")
model.step()
df = recorder.get_table_dataframe("agent_data")
print(f"Step called once:\n {df.to_string()}\n")
model.step()
df = recorder.get_table_dataframe("agent_data")
print(f"Step called twice:\n {df.to_string()}")Output: |
|
This is the behaviour I would expect. Ideally, we would trigger the t=0 collect after completing the initialization of the model, but before returning the instantiated object. Unfortunately, there is no |
Agree, just felt the need to explicitly state it out |
This PR introduces a decoupled, event-driven data collection architecture designed to handle large-scale simulations efficiently. It separates what to collect (
DataRegistryfrom #3156 ) from how to store it (DataRecorder).The core design philosophy is the separation of concerns:
Schedule( see The future of data collection #1944 (comment), Proposal: Unified Time and Event Scheduling API #2921 (comment) and Add Schedule dataclass and refactor EventGenerator #3250)Core Components & APIs
A.
BaseDataRecorder(Abstract Base Class)Handles the orchestration logic: subscription, interval checks, and lifecycle management.
__init__(model, config=...): Attaches to the model.collect(): Manually triggers data collection.get_table_dataframe(name): Retrieves data as a Pandas DataFrame.summary(): Returns stats and summary.B.
DatasetConfig(Configuration)Fine-grained control over when data is collected for each dataset.
interval: Collection frequency (e.g., every 1 time unit or every 0.5/2 time unit(s)).start_time/end_time: Define a specific collection window. (start_time=0 explicitly means to collect initial state)window_size: Rolling buffer capacity (e.g., keep only the last 1000 snapshots).Example Usage:
Included Recorders
This PR includes four implementations catering to different use cases:
i)
DataRecorders(Default/Memory):ii)
SQLDataRecorder(SQLite):iii)
ParquetDataRecorder(Parquet):iv)
JSONDataRecorder(JSON):Extensibility: Writing New Backends
The architecture makes adding new storage backends (e.g., MongoDB, HDF5, CSV) straightforward. Developers only need to inherit from BaseDataRecorder.
All interval logic, validation, and observable subscriptions are handled automatically by the base class.
Migration
Old
New
Note: Please refer to #3156 for details on the registration API
Known Limitation
Model.__init__, it will miss the event entirely because the signal fires before the user's__init__logic even begins.To ensure the recorder captures the fully populated initial state, the BaseDataRecorder performs an immediate check upon instantiation. This imposes a strict ordering requirement on the user: The DataRecorder must be instantiated as the absolute last step of the
Model.__init__method.This is being solved in #3284.
Workaround: Users must manually invoke
recorder.finalize()at the end of their execution block.Note: Future work is expected to provide standard
RUN_STARTEDandRUN_ENDEDevents, which will automate this process( see #2921 (comment))