The future of data collection #1944
Replies: 18 comments 68 replies
-
|
Here is an illustration of the API overlap with AgentSet "gini":collect(model.agents, "wealth", function=calculate_gini)with AgentSet "gini": lambda model: calculate_gini(model.agents.get("wealth))"n_quiescent":collect(model.get_agents_of_type(Citizen), "condition", func=lambda x: len(entry for entry in x if entry=='Quiescent'))with AgentSet "n_quiescent": lambda model: len(model.agents.select(agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent")) |
Beta Was this translation helpful? Give feedback.
-
Just for reference, this information is outdated. Python dictionaries used to be unordered. In Python 3.6 insertion order became in implementation detail of CPython (the reference implementation of Python). But since Python 3.7 insertion order is guaranteed, so it is perfectly fine to rely on it. That said the mental model for dictionaries is still set-orientaded (which I think is the right model). So I agree that it would be confusing if this works DataCollector(model, collectors={"wealth": collect(model.agents, "wealth"),
"gini": collect("wealth", func=calculate_gini)}
})but this doesn't DataCollector(model, collectors={"gini": collect("wealth", func=calculate_gini),
"wealth": collect(model.agents, "wealth")}
})So we still would have to work around this problem internally which complicates the code. But I don't think we need tiered data collectors at all. I think they are a bit hard to understand and provide little benefit. At least how I understand it, they are basically a performance optimization, so you don't need to loop over all agents more than once. For small to medium models I don't think its a problem at all. For larger models or if you really do lots of simulation runs, yes it can matter. But than a better solution would anyway be to calculate your derivate variables afterwards. That is you just collect the wealth attribute, turn your data collection into a pandas dataframe and calculate the gini coefficient from the dataframe. That probably is even faster, because pandas can parallelize the calculations across all rows. This way you also don't mix any logic into your data collector. I think it is actually bad practice to calculate things in the data collector. It should basically an observer. If you have the gini coefficient in your model definition feel free to collect it. Otherwise calculate it as part of the data analysis. So for me the callable should be only used to filter your objects (e.g. only a certain type, or based on a condition) |
Beta Was this translation helpful? Give feedback.
-
|
This is my summary of the problems in the current data collector. I made a summary for the rest of @projectmesa/maintainers. Needs your opinion so that this can happen in time just before the 3.0 release. I think this should not be a GSoC 2024 project. Data collection problems:
|
Beta Was this translation helpful? Give feedback.
-
|
I suggest we try to contain the discussion on DataCollection here rather than having it spread over multiple locations. I am getting confused trying to find all the useful ideas and discussions. So rather than respond in #1933, I'll respond here. In 1933, @rht wrote
I am not entirely sure about this. Dataframes, for me, are associated with analyzing the results of a run. So, in my branch, Measures in my understanding are
So is State a single thing, or can it be multiple things? For example, an agent's position is clearly part of the agent (and by extension) model state. However, most of the time, position will be some tuple. So, somewhere, we have to translate the position into its elements. Do we want to do this in Measure, which would imply having multiple "fields" in a measure, or do we handle this downstream wherever Measure is being used? I personally am inclined to handle this further downstream. To continue the position example, in data collection, we might want to split position into x, y, (and z). For visualization, however, this splitting might not be required. So, I am unsure if we need multiple attributes/functions on a Measure. Instead, in my current thinking Measure always reflects a single state variable. |
Beta Was this translation helpful? Give feedback.
-
|
A quick update from my side. I have been trying to figure out a way to make it possible to access the value of Measure as an attribute. So the basic idea is that the following code works. class Measure:
def __init__(self, group, function):
self.group = group
self.function = function
def get_value(self):
return self.function(self.group)
class MyModel(Model):
def __init__(self, *args, **kwargs):
# some initiliaziation code goes here
self.gini = Measure(self.agents, "wealth", calculate_gini)
if __name__ == '__main__':
model = MyModel()
print(model.gini) # should actually do model.gini.get_value()This turns out to be not trivial because in this example class Measure:
def __init__(self, model, identifier, *args, **kwargs):
self.model = model
self.identifier = identifier
def get_value(self):
return getattr(self.model, self.identifier)
class MeasureDescriptor:
def __set_name__(self, owner, name):
self.public_name = name
self.private_name = "_" + name
def __get__(self, obj, owner):
return getattr(obj, self.private_name).get_value()
def __set__(self, obj, value):
setattr(obj, self.private_name, value)
class Model:
def __setattr__(self, name, value):
if isinstance(value, Measure) and not name.startswith("_"):
klass = type(self)
descr = MeasureDescriptor()
descr.__set_name__(klass, name)
setattr(klass, name, descr)
descr.__set__(self, value)
else:
super().__setattr__(name, value)
def __init__(self, identifier, *args, **kwargs):
self.gini = Measure(self, "identifier")
self.identifier = identifier
if __name__ == '__main__':
model1 = Model(1)
model2 = Model(2)
print(model1.gini)
print(model2.gini)To make I hope this explanation is clear enough. I admit it is a bit convoluted. It is also one of the only ways I have been able to come up with so far that makes it possible for Measures to behave as if they are normal attributes. Please let me know what you think of this direction for implementing Measure or whether the complexity is not worth it, and we forego the idea of having Measure behave as if it is an attribute that returns a simple value (e.g., int, float, string). |
Beta Was this translation helpful? Give feedback.
-
|
Thanks a lot for this. I think we're on the right track, I would only change the abstraction level on which the There are basically the following problems:
Then there is the complication that you sometimes have an object with members with attributes (like an AgentSet) and sometimes just have an object with attributes directly (like a Model). So basically there three levels that need to be defined:
You can already see how complicated this can possibly get. I will try to think about some possible abstractions, but feel free to built on this in the mean time. |
Beta Was this translation helpful? Give feedback.
-
|
On Group: I can see how groups can be used outside of the measure and data collection use case. They may be reused to organize agents step execution as well, e.g. if I want only the quiescent citizens in the Epstein civil violence to take certain actions. def step(self): # of a model
# Instead of
self.agents.select(agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent").do("rest")
# we do
self.quiescents.do("rest")What about doing addition on the groups # The drawback being this is not cacheable
(self.quiescents + self.injured_cops).do("rest")
# Needs to be
self.needs_rest = Group(self.quiescents + self.injured_cops)
self.needs_rest.do("rest") |
Beta Was this translation helpful? Give feedback.
-
The problem is an extension/detailing of 6. Let me try to explain in a bit more detail one of the details I am currently stuck on. The basic idea of a Collector is that it retrieves one or more attributes from an object or collection of objects, and optionally applies a callable to it. The issue now is that there is no way to specify the return of this optional callable in the current design. This return matters because it affects how data is stored in the collector and how it will be turned into a dataframe in So, for example, we are retrieving One idea I had after the conversation with @EwoutH is that the entire problem is analogous to e.g., pandas.DataFrame.apply. In case of collecting data from a collection of objects and next applying a callable to it, the user should specify the "axis" over which this function will operate. If you operate over the "columns", you are aggregating the information across all objects, while if you operate over "rows", the function is applied to the collected data for each object separately. I hope this helps to clarify the issue. |
Beta Was this translation helpful? Give feedback.
-
|
Played a bit around a few days ago. Now that we have our very powerful AgentSet, API seems to be able to get simpler: datacollector = DataCollector(
collectors = [
c(target=Model, attributes=["n_agents"], methods=calculate_energy),
c(target=Wolf, attributes=["sheep_eaten"]),
c(target=Sheep, attributes=["age"], methods=calculate_energy),
c(target=model.agents, attributes=["energy"], agg={"energy": np.mean}),
]Few notes:
c(target=Model, attributes=["n_agents"], methods=calculate_energy),gives {
f"{Model.__name__}_{n_agents}": {...},
f"{Model.__name__}_{calculate_energy}": {...},
}
Just one approach. Don't know if it's the best. |
Beta Was this translation helpful? Give feedback.
-
|
I took another stab at working out an API and data storage format: Proposed API DesignThe core of the proposal is a unified from mesa.datacollection import DataCollector, collect
import numpy as np
class WolfSheepModel(mesa.Model):
def __init__(self, n_wolves=10, n_sheep=50, grass_regrowth_time=30):
super().__init__()
# [...model initialization...]
# Initialize the data collector with various collectors
self.datacollector = DataCollector([
# Model-level attributes
collect(target=self, attributes=["steps", "living_wolves", "living_sheep"]),
# Agent type-specific collection
collect(target=Wolf, attributes=["energy", "sheep_eaten"]),
collect(target=Sheep, attributes=["energy", "grass_eaten"]),
# Dynamic agent filtering
collect(
target=self.agents.select(lambda a: a.energy < 2),
attributes=["energy", "pos"],
name="starving_agents"
),
# Aggregated metrics
collect(
target=self.agents,
attributes=["energy"],
aggregates={
"mean_energy": np.mean,
"energy_gini": self.calculate_gini
}
),
# Custom function
collect(
target=self,
function=lambda m: self.calculate_spatial_density(),
name="spatial_density"
)
])Data AccessThe # Run the model
model = WolfSheepModel()
for _ in range(100):
model.step()
# Get all data as a comprehensive DataFrame (long format)
all_data = model.datacollector.get_dataframe()
"""
Step DataType Entity ID Attribute Value
0 model Model - steps 0
0 model Model - living_wolves 10
0 agents Wolf 1 energy 20
0 aggregates - - mean_energy 17.5
...
"""
# Get specific data with multi-index DataFrames
wolf_df = model.datacollector.get_dataframe(target=Wolf)
"""
energy sheep_eaten
Step ID
0 1 20 0
2 18 0
...
"""
# Filter by attribute across all agent types
energy_data = model.datacollector.get_dataframe(attribute="energy")
"""
energy
Step Type ID
0 Wolf 1 20
Sheep 3 15
...
"""
# Get dynamically filtered collections by name
starving_df = model.datacollector.get_dataframe(name="starving_agents")
# Get aggregated metrics
aggregates = model.datacollector.get_dataframe(data_type="aggregates")
"""
mean_energy energy_gini
Step
0 17.50 0.11
1 16.25 0.12
...
"""
# Additional filtering options
time_range_df = model.datacollector.get_dataframe(time_range=(10, 20))
long_format_df = model.datacollector.get_dataframe(format="long")The multi-indexed DataFrames enable powerful analysis: # Average energy by agent type over time
energy_by_type = energy_data.groupby(level=["Step", "Type"]).mean()
# Calculate rate of change in wolf population
wolves_over_time = wolf_df.groupby(level="Step").size()
population_change = wolves_over_time.diff()Memory-Efficient Internal StructureThe internal data structure is optimized to avoid string duplication and use arrays for efficient storage: {
# Schema defined once - no string duplication
"schema": {
"Wolf": ["energy", "sheep_eaten"],
"Sheep": ["energy", "grass_eaten"],
"model": ["steps", "living_wolves", "living_sheep"],
"aggregates": ["mean_energy", "energy_gini"]
},
# Data storage uses position-based arrays matching the schema
"data": {
1: { # Timestep
"model": [1, 8, 42], # Values match schema positions
"agents": {
"Wolf": {
"ids": [1, 2, 3],
"values": [
[10, 2], # Agent 1: [energy, sheep_eaten]
[12, 1], # Agent 2: [energy, sheep_eaten]
[8, 0] # Agent 3: [energy, sheep_eaten]
]
},
"Sheep": {
"ids": [4, 5, 6],
"values": [
[5, 3], # Agent 4: [energy, grass_eaten]
[6, 4], # Agent 5: [energy, grass_eaten]
[4, 2] # Agent 6: [energy, grass_eaten]
]
}
},
"aggregates": [7.5, 0.18] # Values match schema positions
}
}
}I hope to found a balance with this API and data storage between flexibility, powerful features, collection and storage efficiency, and ease of use. Curious what everyone thinks. |
Beta Was this translation helpful? Give feedback.
-
class PredatorPreyModel(Model):
def __init__(self, num_wolves=10, num_sheep=50, seed=None):
super().__init__(seed=seed)
# Create agents
for _ in range(num_wolves):
Wolf(self, age=self.random.randint(0, 10), energy=self.random.randint(50, 100))
for _ in range(num_sheep):
Sheep(self, age=self.random.randint(0, 5), energy=self.random.randint(50, 100))
# Setup a datacollector with empty model_reporters
# We'll manually add values to it in our collect_data method
self.datacollector = DataCollector()
def collect_data(self):
"""Calculate and collect all model metrics."""
# Get agent sets we need
wolves = self.agents.select(agent_type=Wolf)
sheep = self.agents.select(agent_type=Sheep)
mature_wolves = wolves.select(lambda a: a.age > 5)
# Calculate all metrics
metrics = {
"Wolves": len(wolves),
"Sheep": len(sheep),
"Mature Wolves": len(mature_wolves),
"Average Wolf Energy": wolves.agg("energy", np.mean) if wolves else 0,
"Total Sheep Wool": sheep.agg("wool", sum) if sheep else 0
}
# Add current step's metrics to datacollector
for key, value in metrics.items():
if key not in self.datacollector.model_vars:
self.datacollector.model_vars[key] = []
self.datacollector.model_vars[key].append(value)
def step(self):
# Model logic
hungry_wolves = self.agents.select(
agent_type=Wolf,
filter_func=lambda a: a.energy < 30
)
sheep = self.agents.select(agent_type=Sheep)
for wolf in hungry_wolves:
if sheep:
prey = self.random.choice(list(sheep))
wolf.energy += prey.energy * 0.5
wolf.kills += 1
prey.remove()
self.agents.shuffle_do("step")
# Collect data
self.collect_data()Once you stop defining everything in your Model init, you can do dynamic selection and aggregation so much easier. |
Beta Was this translation helpful? Give feedback.
-
|
I know this is not the appropriate place to ask this, but since it contains the development of API, I wanted to ask what is the procedure of defining a new API? |
Beta Was this translation helpful? Give feedback.
-
|
Quick thought I originally raised in my Mesa-Frames proposal but wanted to mention here as well : Event Based data collection - data is collected only when defined conditions are met like -
although the initial idea was to reduce redundant data storage at mesa-frames scales I think it would be useful here as well example - |
Beta Was this translation helpful? Give feedback.
-
|
So I threw this whole conversation into Gemini 2.5 Pro, and it came up with this. I think this is the best API I've seen so far. Curious what everybody thinks. If it looks good I will try to move forward with an implementation. Synthesizing a Declarative Hi all, This has been a really valuable (and extensive!) discussion on the future of data collection in Mesa. Reading through it, particularly the recent ideas from @EwoutH and the summary of requirements from @rht, I wanted to try and synthesize a potential path forward that aims to capture the best aspects discussed, focusing on a declarative API that felt closer in the analysis above. While the manual Perhaps we can achieve the power demonstrated there (especially regarding dynamic sets and aggregations) within a more structured, declarative API centered around a Proposed Core Idea: Initialize # Illustrative Example API
from mesa.datacollection import DataCollector, collect
import numpy as np
# Assume Wolf, Sheep, Citizen Agent classes and calculate_gini defined elsewhere
class MyModel(mesa.Model):
def __init__(self, ...):
super().__init__()
# ... model setup ...
self.datacollector = DataCollector([
# --- Model Level ---
collect(name="step_count", target=self, attributes=["schedule.steps"]),
collect(name="total_wealth", target=self,
function=lambda m: sum(a.wealth for a in m.agents)),
# --- Agent Type Level (Resolved at runtime) ---
collect(name="wolf_data", target=Wolf, attributes=["energy", "kills"]),
# Example: apply function per agent (output stored per agent ID)
collect(name="sheep_age_category", target=Sheep,
function=lambda agent: "lamb" if agent.age < 2 else "adult",
apply_level="agent"), # Hint needed for per-agent application
# --- Dynamic AgentSet Level (Lambda evaluated at runtime) ---
collect(name="starving_agents_pos",
target=lambda m: m.agents.select(lambda a: a.energy < 10),
attributes=["pos"]), # Collects pos attribute for matching agents
collect(name="avg_starving_energy",
target=lambda m: m.agents.select(lambda a: a.energy < 10),
function=lambda agent_set: agent_set.agg("energy", np.mean)), # Aggregate over dynamic set
# --- Aggregation Focused ---
collect(name="energy_stats", target=self.agents, # Target can be an AgentSet
attributes=["energy"], # Base attribute(s) to collect first
aggregates={ # Dictionary: output_name -> func(list_of_values)
"mean_energy": np.mean,
"median_energy": np.median,
"energy_gini": calculate_gini # Assumes calculate_gini takes a list
}),
# Direct aggregation on an AgentSet (e.g., count) without collecting individuals first
collect(name="quiescent_count", target=Citizen, # Agent Type target
function=lambda agentset: agentset.select(lambda a: a.condition == "Quiescent").count), # agent_set passed to function
# --- Conditional Collection (Addresses @Ben-geo's point) ---
collect(name="periodic_wolf_aggression", target=Wolf, attributes=["aggression"],
trigger=lambda model: model.schedule.steps % 10 == 0) # Only collect every 10 steps
])
def step(self):
# ... step logic ...
self.datacollector.collect(self) # Pass model instance to process collectorsKey Components of
How this Addresses Key Problems:
Internal Storage & Retrieval: @EwoutH's schema-based internal storage idea seems excellent for efficiency. Conclusion: This synthesized |
Beta Was this translation helpful? Give feedback.
-
|
A long time ago I started implementing a database data collector for mesa as an extension. However, since it was not well performing compared to outputting DataFrames as pickles (and importing these into the DB) I haven't finalised and shared the project (yet). Anyway, it might be useful for (smaller) simulations and/or there might be ideas on improving performance. Let me know what you think, I'm happy to contribute: https://github.com/UniK-INES/mesa_dbdatacollection |
Beta Was this translation helpful? Give feedback.
-
|
Potential GSoC 2026 project: #2927 (comment) |
Beta Was this translation helpful? Give feedback.
-
|
The future of data collection has been picked up again with #3156 and #3145. The emerging design is to seperate the state of part of the model at a given time instant from the taking of snapshots of this state over time. To this end, #3156 introduces the idea of a The main advantages of this emerging design are
There are still some open issues
|
Beta Was this translation helpful? Give feedback.
-
|
As you know by now I love a clean API, so that's exactly what I took another stab at. But now based on all the ideas and insights of #3145. It's also heavily based/inspired on our The proposal below is the results of some brainstorming with Claude 4.6 Opus. Building on #3145Some very useful discussions and prototyping took place in #3145. This is how this proposal builds on it: What this proposal keeps
Where's room for potential improvement
Design Principles
The Proposed APICore idea:
|
| Arguments provided | Inferred type | What it creates |
|---|---|---|
track("wealth", source=AgentType) |
Agent attribute | AgentDataSet tracking wealth on all agents of that type |
track("wealth", source=agent_set) |
Agent attribute | AgentDataSet tracking wealth on that specific AgentSet |
track("gini", fn=callable) |
Model reporter | ModelDataSet calling fn(model) |
track("gini") (no source, no fn) |
Model attribute | ModelDataSet reading model.gini (property or attribute) |
track(["a", "b"], source=X) |
Multi-field agent | AgentDataSet tracking multiple fields |
track(["a", "b"], fn=[f1, f2]) |
Multi-field model | ModelDataSet with multiple reporters |
The name parameter for disambiguation
# If you track the same field from different agent types:
self.data.track("energy", source=Wolf, name="wolf_energy")
self.data.track("energy", source=Sheep, name="sheep_energy")If name is not provided, it defaults to:
- The field name for single fields:
"wealth" "{AgentType.__name__}_{field}"if source is an agent type:"Wolf_energy"- The explicit name for multi-field: must be provided
4. Numpy fast path (opt-in)
For performance-critical models with fixed agent populations:
# Explicit opt-in to numpy-backed storage
self.data.track("wealth", source=MoneyAgent, backend="numpy", n=n)This creates a NumpyAgentDataSet internally. The user doesn't need to know
about different dataset classes — they just say backend="numpy".
5. Event/table logging (sparse events)
For data that doesn't fit the snapshot model (agent death, interactions):
class Wolf(Agent):
def die(self):
# Log an event at the moment it happens
self.model.data.log("deaths", agent_id=self.unique_id, cause="starvation")
self.remove()
# Retrieve later
deaths_df = model.data["deaths"] # → DataFrame with [agent_id, cause, time]log() is for sparse, event-driven data. track() is for periodic snapshots.
This cleanly replaces the old DataCollector tables.
6. Storage backends
Default: in-memory (no configuration needed)
self.data.track("wealth", source=MoneyAgent)
# Just works, stored in memorySwap backend globally
from mesa.experimental.data_collection import SQLBackend, ParquetBackend
class MyModel(Model):
def __init__(self):
super().__init__()
self.data.backend = SQLBackend(db_path="results.db")
# OR
self.data.backend = ParquetBackend(output_dir="./data")
# All subsequent track() calls use this backend
self.data.track("wealth", source=MoneyAgent)Per-dataset backend override
self.data.track("wealth", source=MoneyAgent) # uses default (memory)
self.data.track("big_data", source=MoneyAgent, backend=ParquetBackend("./out"))Custom backends — implement a simple protocol
class StorageBackend(Protocol):
def initialize(self, name: str, dataset_type: str, columns: list[str]) -> None: ...
def store(self, name: str, time: float, data: Snapshot) -> None: ...
def retrieve(self, name: str) -> pd.DataFrame: ...
def clear(self, name: str | None = None) -> None: ...Where Snapshot is a tagged union / dataclass:
@dataclass
class Snapshot:
"""Typed snapshot — backend authors don't need to pattern-match on Any."""
kind: Literal["agent", "model", "numpy", "event"]
data: np.ndarray | list[dict] | dict
columns: list[str]
agent_ids: np.ndarray | None = None7. Complete API reference for model.data
class DataCollectionManager:
"""The unified interface at model.data"""
# === Registration (what to collect) ===
def track(
self,
fields: str | list[str],
*,
source: type[Agent] | AgentSet | None = None,
fn: Callable | list[Callable] | None = None,
name: str | None = None,
schedule: Schedule | None = None, # default: every step from t=0
window: int | None = None, # sliding window size
backend: StorageBackend | Literal["numpy"] | None = None,
n: int | None = None, # required if backend="numpy"
) -> Dataset:
"""Register data for periodic collection. Returns the Dataset for chaining."""
...
def log(
self,
table_name: str,
**fields,
) -> None:
"""Log a single event row (sparse, non-periodic data)."""
...
# === Retrieval ===
def __getitem__(self, name: str) -> pd.DataFrame:
"""model.data["wealth"] → DataFrame"""
...
def to_dataframes(self) -> dict[str, pd.DataFrame]:
"""Get all tracked data as DataFrames."""
...
# === Control ===
def collect(self) -> None:
"""Force collection of all due datasets now (rarely needed)."""
...
def finalize(self) -> None:
"""Capture final snapshot. Called automatically by model.close()."""
...
def enable(self, name: str) -> None: ...
def disable(self, name: str) -> None: ...
def clear(self, name: str | None = None) -> None: ...
# === Diagnostics ===
@property
def summary(self) -> dict[str, Any]: ...
# === Backend ===
@property
def backend(self) -> StorageBackend: ...
@backend.setter
def backend(self, value: StorageBackend) -> None: ...8. Migration from old DataCollector
# ─── OLD ───
self.datacollector = DataCollector(
model_reporters={
"Gini": lambda m: compute_gini(m),
"Population": lambda m: len(m.agents),
},
agent_reporters={"wealth": "wealth"},
agenttype_reporters={
Wolf: {"kills": "kills_count"},
Sheep: {"distance": "total_flight_distance"},
},
)
# In step():
self.datacollector.collect(self)
# ─── NEW ───
self.data.track("wealth", source=self.agents) # replaces agent_reporters
self.data.track("gini", fn=lambda m: compute_gini(m)) # replaces model_reporters
self.data.track("population", fn=lambda m: len(m.agents))
self.data.track("kills_count", source=Wolf) # replaces agenttype_reporters
self.data.track("total_flight_distance", source=Sheep)
# No manual collect() needed!Retrieving data
# OLD
model.datacollector.get_model_vars_dataframe()
model.datacollector.get_agent_vars_dataframe()
# NEW
model.data["gini"]
model.data["wealth"]
model.data.to_dataframes()9. Design comparison: Current PR vs This Proposal
| Aspect | Current PR | This proposal |
|---|---|---|
| Objects user must learn | 3 (DataRegistry, DatasetConfig, DataRecorder) | 1 (model.data) |
| Lines for basic setup | ~8-12 | ~2-3 |
| String key coupling | Yes (registry keys must match config keys) | No (single track() call) |
| Lambda support | No (must use @property) |
Yes (fn=lambda m: ...) |
| Schedule reuse | Separate DatasetConfig class |
Shared Schedule from event system |
| Backend swapping | Change class instantiation | Change model.data.backend = ... |
| Event logging | Not addressed (TableDataSet skipped) | model.data.log() |
| Final snapshot | Not handled | model.data.finalize() + auto in model.close() |
| Typed snapshots for backends | Any with pattern matching |
Snapshot dataclass |
| Auto-collection | ✅ | ✅ |
10. Implementation sketch: How model.data gets created
# In Model.__init__():
class Model:
def __init__(self, ...):
...
self.data = DataCollectionManager(self)DataCollectionManager internally holds:
- A dict of
Datasetobjects (the registry part) - A
StorageBackend(the recorder part) - A subscription to
model.timechanges
The existing DataRegistry and BaseDataRecorder code can be refactored
to live behind this facade. The internal architecture stays clean — only the
user-facing API changes.
Summary of the key design choices and why
- Single entry point (
model.data) — The current PR asks users to understand DataRegistry, DatasetConfig, and DataRecorder as separate concepts, wire them together with matching string keys, and remember not to call.collect(). That's a lot of cognitive load for what is fundamentally "I want to save my data." One object that doestrack,log, anddata["name"]is dramatically simpler. track()with source inference — Instead oftrack_agents(),track_model(),track_agents_numpy()as separate methods, onetrack()method infers the dataset type from its arguments. This follows the principle that users think in terms of what they want to track ("wealth of wolves"), not how the framework should store it internally.- Keep lambdas working — The current PR forces model reporters into
@propertymethods. That's fine for complex logic, but forlambda m: len(m.agents)it's unnecessary boilerplate. Supporting bothfn=callableand bare attribute names preserves flexibility. - Reuse
Scheduleinstead ofDatasetConfig— Since Add Schedule dataclass and refactor EventGenerator #3250 already merged aScheduledataclass for event scheduling, data collection should use the same object. One scheduling vocabulary across the framework.DatasetConfigas a separate class with slightly different parameter names (start_timevsstart) is a source of confusion. - Typed
Snapshotinstead ofAny— Backend authors currently have to writematch data: case np.ndarray(): ... case list(): ... case dict(): ...in every method. A typedSnapshotdataclass with akinddiscriminator makes this explicit and documentable. log()for event data — The PR explicitly skipsTableDataSetin the recorder. But event logging (agent deaths, interactions) is a real use case. A dedicatedlog()method with a simple**kwargsinterface handles this cleanly and replaces the old DataCollector table feature.finalize()tied tomodel.close()— This directly addresses quaquel's concern about missing the final state. Rather than debating signal design now, a simplefinalize()method thatmodel.close()calls is pragmatic and unblocking.
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
There has been quite some discussion in various places about changing data collection. This is my attempt to think this through in some more detail. It is heavily inspired by a suggestion by @Corvince at some point.
In the general case, data collection is taking an object and extracting from this one or more attributes, and optionally applying a callable to this. It might involve only an object and a callable applied to this object in specific cases. This object can be the model, an agent, an agentset, a space, or some user-defined class.
So, it seems sensible to create a separate Collector class that implements this basic logic. Because the behavior of AgentSet is a bit different from other objects (i.e., AgentSet.get instead of relying on getattr), I believe it makes sense to have 2 Collector classes: BaseCollector and AgentSetCollector (PEP 20, flat is better than nested). Rather than burden the user with this distinction, it is possible to use a factory function (e.g.,
collect(obj, attrs, func=None)) to create the appropriate Collector instance.Ideally, data should only be extracted once. So, in the case of the Boltzman wealth model, the data collector should be smart enough to extract the
wealthattribute only once from the agentset. This can relatively easily be realized by maintaining an internal mapping of all objects and the attributes to be retrieved from them. Moreover, extracting all relevant attributes from a given object in one go might be possible to avoid unnecessary iteration. This would, however, require a minor update toAgentSet.getso thatattr_nametakes a string or list of strings.I believe it is possible to design and implement this new-style DataCollector so that the current one can be implemented on top of it for backward compatibility.
Like with the current DataCollector, data collection should happen whenever
data_collector.collectis called. However, I believe it is paramount that the data collector also always extracts the current simulation time. Only by having the simulation time for each call tocollectcan you produce a clean and complete time series of the dynamics of the model over time. In fact, these time stamps could become part of the index/column labels of the DataFrames when turning the retrieved data into a DataFrame.Like with the current DataCollector, it should be easy to turn any retrieved data into a DatafFrame. This can easily be done through a
to_dataframemethod on the Collector class.So, what could the resulting API look like?
So, does the basis idea of object, retrieval of one or more attributes, and/or applying a callable make sense? Have a missed a key concern? Is there something obviously wrong or missing in the sketch of the API?
Beta Was this translation helpful? Give feedback.
All reactions