The future of data collection #1944

quaquel · 2024-01-07T19:52:47Z

quaquel
Jan 7, 2024
Maintainer

There has been quite some discussion in various places about changing data collection. This is my attempt to think this through in some more detail. It is heavily inspired by a suggestion by @Corvince at some point.

In the general case, data collection is taking an object and extracting from this one or more attributes, and optionally applying a callable to this. It might involve only an object and a callable applied to this object in specific cases. This object can be the model, an agent, an agentset, a space, or some user-defined class.

So, it seems sensible to create a separate Collector class that implements this basic logic. Because the behavior of AgentSet is a bit different from other objects (i.e., AgentSet.get instead of relying on getattr), I believe it makes sense to have 2 Collector classes: BaseCollector and AgentSetCollector (PEP 20, flat is better than nested). Rather than burden the user with this distinction, it is possible to use a factory function (e.g., collect(obj, attrs, func=None)) to create the appropriate Collector instance.

Ideally, data should only be extracted once. So, in the case of the Boltzman wealth model, the data collector should be smart enough to extract the wealth attribute only once from the agentset. This can relatively easily be realized by maintaining an internal mapping of all objects and the attributes to be retrieved from them. Moreover, extracting all relevant attributes from a given object in one go might be possible to avoid unnecessary iteration. This would, however, require a minor update to AgentSet.get so that attr_name takes a string or list of strings.

I believe it is possible to design and implement this new-style DataCollector so that the current one can be implemented on top of it for backward compatibility.

Like with the current DataCollector, data collection should happen whenever data_collector.collect is called. However, I believe it is paramount that the data collector also always extracts the current simulation time. Only by having the simulation time for each call to collect can you produce a clean and complete time series of the dynamics of the model over time. In fact, these time stamps could become part of the index/column labels of the DataFrames when turning the retrieved data into a DataFrame.

Like with the current DataCollector, it should be easy to turn any retrieved data into a DatafFrame. This can easily be done through a to_dataframe method on the Collector class.

So, what could the resulting API look like?

# boltzman wealth model
datacollector = DataCollector(model, collectors={
	"gini":collect(model.agents, "wealth", function=calculate_gini),
	"wealth":collect(model.agents, "wealth")
	})

# attribute like access to each collected datafield
datacollector.gini.to_dataframe()
datacollector.wealth.to_dataframe()

# for the eppstein civil violence example
datacollector = DataCollector(model, collectors={
	"n_quiescent":collect(model.get_agents_of_type(Citizen), "condition",
		func=lambda x: len(entry for entry in x if entry=='Quiescent')),
	"n_active":collect(model.get_agents_of_type(Citizen),
		func=lambda agentset: agentset.select(lambda agent: agent.condition=='Active')), # apply a callable to the object directly
	"n_jailed":collect(model.get_agents_of_type(Citizen), "jail_sentence",
		func=lambda x: len(entry for entry in x if entry>0)),	
	"pos":collect(model.agents, ["x", "y"],), # retrieve multiple attributes
	})
datacollector.add_collector("n_cops", collect(model, func=lambda model: len(model.get_agents_of_type("Cop"))))

So, does the basis idea of object, retrieval of one or more attributes, and/or applying a callable make sense? Have a missed a key concern? Is there something obviously wrong or missing in the sketch of the API?

rht · 2024-01-08T09:01:23Z

rht
Jan 8, 2024

Here is an illustration of the API overlap with AgentSet

"gini":collect(model.agents, "wealth", function=calculate_gini)

with AgentSet

"gini": lambda model: calculate_gini(model.agents.get("wealth))

"n_quiescent":collect(model.get_agents_of_type(Citizen), "condition", func=lambda x: len(entry for entry in x if entry=='Quiescent'))

with AgentSet

"n_quiescent": lambda model: len(model.agents.select(agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent"))

9 replies

rht Jan 8, 2024

I am less sure about the tiered API. Flat is better than nested.

I think this may refer specifically to nested conditionals. While a chain of method calls should be considered to be something along the line of the paradigm of the UNIX pipes, that are found in Bash pipes, Golang interfaces, and DataFrame composition of operations.

rht Jan 8, 2024

DataCollector(model, collectors={"wealth": collect(model.agents, "wealth"),
                                                          "gini": collect("wealth", func=calculate_gini)}
	})

Has a problem in that "wealth"'s definition depends on the order of evaluation of the dictionary, where "wealth" has to be evaluated first before "gini". A Python dictionary is unordered, and just happens to preserve key-value insertion order.

quaquel Jan 8, 2024
Maintainer Author

Has a problem in that "wealth"'s definition depends on the order of evaluation of the dictionary, where "wealth" has to be evaluated first before "gini". A Python dictionary is unordered, and just happens to preserve key-value insertion order.

Exactly, but that can be left to the internal logic of the DataCollector class to sort out. For example, you built up an internal defaultdict(list) with the object from which to retrieve attributes as keys and a list of attributes to retrieve. You first execute this. Next, you could have a second internal dict that maps each object+attribute pair back to the associated Collector (there are other ways as well but that would require some smartness in the Collector). This would not require any specification by the user because DataCollector is smart enough to figure this out itself. I hope this is clear, otherwise I am happy to code up a quick example of what I am trying to say.

rht Jan 8, 2024

Even if it can work under the hood, even if the behavior can be documented, I still find it problematic to deviate from the known behavior of Python dict, with the dict elements becoming ordered, and the next elements that may depend on the previous element being defined first. The dict elements should have all been defined simultaneously.

quaquel Jan 9, 2024
Maintainer Author

I am not sure what you are trying to say. I was trying to describe the internal logic of the DataCollector class and how it is pretty easy to reduce the number of times specific data is retrieved.

It seems that you are saying that the behavior of a class (DataCollector in this case) should be determined by the arguments used to instantiate it. This to me, however, is nonsensical. Is Agent an int just because one of the arguments (i.e., unique_id) is an int?

Corvince · 2024-01-09T07:40:56Z

Corvince
Jan 9, 2024
Maintainer

A Python dictionary is unordered, and just happens to preserve key-value insertion order.

Just for reference, this information is outdated. Python dictionaries used to be unordered. In Python 3.6 insertion order became in implementation detail of CPython (the reference implementation of Python). But since Python 3.7 insertion order is guaranteed, so it is perfectly fine to rely on it.

That said the mental model for dictionaries is still set-orientaded (which I think is the right model). So I agree that it would be confusing if this works

DataCollector(model, collectors={"wealth": collect(model.agents, "wealth"),
                                                          "gini": collect("wealth", func=calculate_gini)}
	})

but this doesn't

DataCollector(model, collectors={"gini": collect("wealth", func=calculate_gini),
                                                          "wealth": collect(model.agents, "wealth")}
	})

So we still would have to work around this problem internally which complicates the code.

But I don't think we need tiered data collectors at all. I think they are a bit hard to understand and provide little benefit. At least how I understand it, they are basically a performance optimization, so you don't need to loop over all agents more than once. For small to medium models I don't think its a problem at all. For larger models or if you really do lots of simulation runs, yes it can matter. But than a better solution would anyway be to calculate your derivate variables afterwards. That is you just collect the wealth attribute, turn your data collection into a pandas dataframe and calculate the gini coefficient from the dataframe. That probably is even faster, because pandas can parallelize the calculations across all rows.

This way you also don't mix any logic into your data collector. I think it is actually bad practice to calculate things in the data collector. It should basically an observer. If you have the gini coefficient in your model definition feel free to collect it. Otherwise calculate it as part of the data analysis. So for me the callable should be only used to filter your objects (e.g. only a certain type, or based on a condition)

3 replies

quaquel Jan 9, 2024
Maintainer Author

So we still would have to work around this problem internally which complicates the code.

I haven't tried a quick implementation, but it can be done in just five lines of code, so it is not that complicated from a code base point of view. The understandability of the API to the user is a more significant concern.

This way, you don't mix any logic into your data collector. I think it is actually bad practice to calculate things in the data collector. It should basically an observer.

I agree with this in principle. However, from practical experience, I know it can be convenient to support it (for example, to reduce the amount of data that needs to be stored, not needing to touch the source code itself if you want to quickly track something on the fly). It also would be a backward incompatible change.

rht Jan 9, 2024

I think users shouldn't be imposed on whether they should or should not process the raw data they have collected. With a small change in the API

model.datacollector.collect({"wealth": lambda model: model.agents.get("wealth")})
    .process({"gini": lambda data: calculate_gini(data["wealth"])})
    .process({"wealth**gini": lambda data: data["wealth"] ** data["gini"]})
    .store(["gini", "wealth**gini"])  # optional, if the user needs to store only a subset of the data collected

This is no longer a tiered data collection, but rather a series of processing over the first collection, and so is easier to explain.

quaquel Jan 9, 2024
Maintainer Author

An API along these lines also occurred to me. What I like is that it is declarative and clear. The question for me is whether the user should specify what to collect and process when initializing the DataCollector, in which case this is executed whenever datacollector.collect() is called. This is how the current DatacCollector is designed. Or, do the full specification of what to collect and process whenever you call datacollector.collect() as in the example given by @rht.

Or do we want to be even more radical and fully rerig datacollection on top of the observer design pattern?

rht · 2024-02-05T00:52:23Z

rht
Feb 5, 2024

This is my summary of the problems in the current data collector. I made a summary for the rest of @projectmesa/maintainers. Needs your opinion so that this can happen in time just before the 3.0 release. I think this should not be a GSoC 2024 project.

Data collection problems:

Can't collect based on agent type. Multi-Agent data collection #348, DataCollector requires all agents to have the same Attributes #976
Can't collect based on subset of agents via conditionals, which may change over time. Multi-Agent data collection #348 (comment)
Needs to separate the notion of "measure" (single use data) from data_over_time (timeseries). Measure may be used in visualization, without having to activate data_over_time, because the latter bogs down RAM. Discussion happened on Matrix.org.
Needs to be performant for large scale data collection (e.g. https://github.com/SC3-TUD/PNAS-Uncertainty-in-Boundedly-Rational-Climate-Adaptation/blob/d17fbe180384e7e79cf2adde153d1adfb9a401ba/data_collection.py#L686-L695)
Needs to be serializable to various format: dataframe (i.e., csv/excel), relational db's. Discussion happened on Matrix.org
Needs to have live data processing over collected data, as derived data, see The future of data collection #1944 (reply in thread)
Backend implementation: needs to be based on observer pattern, and pub/sub. Proposal: adding some form of event sourcing to MESA #1947 Event tracking and analysis #1930 Add system state tracking #1933
Flat (name only, e.g. {"data1": ..., "data2": ...}) dictionary of data collected vs 2-level (group, name, e.g. {group1: {"data1": ...}, group2: {"data5": ...}) dictionary of data collected, the latter for easy groupby(group). This is a generalization of get_model_vars_dataframe and get_agent_vars_dataframe of the current data collector.
~~Needs various convenient statistics functions for quick exploration of data~~ needs an easy way to allow user to apply statistical operations (min, median, max, etc) on the measures Aggegrated agent metric in DataCollection, graph in ChartModule #1145

4 replies

quaquel Feb 5, 2024
Maintainer Author

Thanks for this useful summary. I broadly agree. Some additional thoughts below

On 5. Yes, it is important that the datacollection mechanism is open for extension so users can easily replace the default behavior with their own custom (database) solution. The question is whether to do this at the level of each individual collect call, at the level of a complete run, or both.
On 7. I personally think this is indeed the way to go. It will be quite a bit more performant. For example, when testing this on a few models over Christmas, it was about 20%-30% faster than the current data collection. This speed up is mainly due to a shift from "pull" based data collection to "push" based data collection. So, e.g., instead of querying each agent for each collect call, you have a Measure that reflects the underlying data to which agents push their state updates. Another reason why I am in favor of this is that it enables a much cleaner separation of concerns (as evidenced by the discussion #1994 on tight vs. loose coupling between cells and the DiscreteSpace to which they belong. But, as indicated by @Corvince in #1947, adding pub/sub requires a very careful design and will add to the code base that needs to be maintained.
On 8. Can you elaborate on this. I am not sure I fully understand what you mean.
On 9. Not sure about this. If Collectors and Measures take a callable, the basic mechanism is in place for this. Should MESA maintain a list of default operations?

rht Feb 6, 2024

On 8. Can you elaborate on this. I am not sure I fully understand what you mean.

Users sometimes may want to get all the data collected of a particular group. With the current data collector, it would be get_model_vars_dataframe and get_agent_vars_dataframe. This needs to be generalized to groups. I have updated the summary with this info

On 9. Not sure about this. If Collectors and Measures take a callable, the basic mechanism is in place for this. Should MESA maintain a list of default operations?

I think @EwoutH wants an easy (minimal ceremony) way to apply statistics on the measures. The "object" in the current data collector is too indirect to be manipulated. I think your idea (and in @Corvince's wish list) of making it a concrete object solves this issue, because it allows one to do np.mean(model.wealth) (or np.mean(model.wealth.value) depending on the implementation detail), etc.

Optional comment: This seems to stem from the fact that Python's dict-key and Python object-attribute are not fully equivalent, unlike in JavaScript where I can specify an object as a dictionary

const quiescents = {wealth: ..., x: ..., y: ...}

And I'd be able to access wealth via quiescents.wealth, as an object attribute of quiescents.

rht Feb 6, 2024

Should MESA maintain a list of default operations?

Only as a last resort. But the idea of measure as a concrete object/attribute of a model, seems to render this unnecessary.

rht Feb 6, 2024

If Collectors and Measures take a callable, the basic mechanism is in place for this.

You actually already wrote this down before I did. And I somehow missed it. So point 9 already has a proposed solution.

quaquel · 2024-02-10T07:45:53Z

quaquel
Feb 10, 2024
Maintainer Author

I suggest we try to contain the discussion on DataCollection here rather than having it spread over multiple locations. I am getting confused trying to find all the useful ideas and discussions. So rather than respond in #1933, I'll respond here. In 1933, @rht wrote

I suppose measure may allow multiple attributes functions, for the case when the measures can be grouped into 1 DF.
Based on #1944 code example

I am not entirely sure about this. Dataframes, for me, are associated with analyzing the results of a run. So, in my branch, to_dataframe is part of the BaseCollector and its subclasses. Moreover, once one allows for multi-indexing, virtually anything can be gathered into a dataframe.

Measures in my understanding are

Conceptually, I understand a Measure as some observable model state at a particular time instant that is a function of internal model objects (e.g., agents, space, etc.). see here

So is State a single thing, or can it be multiple things? For example, an agent's position is clearly part of the agent (and by extension) model state. However, most of the time, position will be some tuple. So, somewhere, we have to translate the position into its elements. Do we want to do this in Measure, which would imply having multiple "fields" in a measure, or do we handle this downstream wherever Measure is being used?

I personally am inclined to handle this further downstream. To continue the position example, in data collection, we might want to split position into x, y, (and z). For visualization, however, this splitting might not be required.

So, I am unsure if we need multiple attributes/functions on a Measure. Instead, in my current thinking Measure always reflects a single state variable.

3 replies

rht Feb 10, 2024

I should have provided a problem that I want to solve, which is problem 8 in #1944 (comment). I think measure-containing-multiple-attributes/functions may solve problem 8, because it could be considered a preemptive organization method, instead of downstream, for the user to group-by attributes/functions for 1 group in 1 measure. The problem with the user doing group-by at downstream, is that while they may be aware which measures are from the same group, the Mesa framework isn't, because it is not specified explicitly.

Moreover, once one allows for multi-indexing, virtually anything can be gathered into a dataframe.

How does multi-indexing work? An example to see it in action would help.

Between the 2 choices, I prefer whichever has an overall simpler structure, both user facing (easy to conceptualize) and the underlying code.

rht Feb 10, 2024

The problem with the user doing group-by at downstream, is that while they may be aware which measures are from the same group, the Mesa framework isn't, because it is not specified explicitly.

This is not a problem. I suppose the framework does have an awareness of each measure's group. If I implement a function

combine_into_one_dataframe(measure1, measure2, ...)

in #2024's implementation, I can always check each measure's group attribute to see if there a group mismatch.

quaquel Feb 10, 2024
Maintainer Author

How does multi-indexing work?

I meant pandas multi indexing

This is not a problem. I suppose the framework does have an awareness of each measure's group. If I implement a function

That is basically in line with how I have approached it in my branch. Each Collector has a to_dataframe method. If the user wants to combine multiple dataframes into a bigger dataframe, for example, because each dataframe contains data about the same group, the user can easily do this via pd.concat. The benefit of this to me was that it keeps the MESA side of things relatively simple and easy to explain, while opening up more sophisticated use cases for the user to implement themselves on top of this by leveraging e.g., pandas.

quaquel · 2024-02-11T09:49:54Z

quaquel
Feb 11, 2024
Maintainer Author

A quick update from my side. I have been trying to figure out a way to make it possible to access the value of Measure as an attribute. So the basic idea is that the following code works.

class Measure:
    def __init__(self, group, function):
        self.group = group
        self.function = function

    def get_value(self):
        return self.function(self.group)


class MyModel(Model):
    
    def __init__(self, *args, **kwargs):
        # some initiliaziation code goes here

        self.gini = Measure(self.agents, "wealth", calculate_gini)


if __name__ == '__main__':
    model = MyModel()
    print(model.gini) # should actually do model.gini.get_value()

This turns out to be not trivial because in this example self.gini by default will return the Measure instance instead of Measure.get_value. So, we somehow need to hook into Python's mechanisms for setting and getting attributes. One possible way to do this, which works, is given below. Note that this is a barebones example, only focused on making the get_value work. None of the other pieces of Measure are included in this example.

class Measure:

    def __init__(self, model, identifier, *args, **kwargs):
        self.model = model
        self.identifier = identifier

    def get_value(self):
        return getattr(self.model, self.identifier)


class MeasureDescriptor:
    def __set_name__(self, owner, name):
        self.public_name = name
        self.private_name = "_" + name

    def __get__(self, obj, owner):
        return getattr(obj, self.private_name).get_value()

    def __set__(self, obj, value):
        setattr(obj, self.private_name, value)


class Model:

    def __setattr__(self, name, value):
        if isinstance(value, Measure) and not name.startswith("_"):
            klass = type(self)
            descr = MeasureDescriptor()
            descr.__set_name__(klass, name)
            setattr(klass, name, descr)
            descr.__set__(self, value)
        else:
            super().__setattr__(name, value)

    def __init__(self, identifier, *args, **kwargs):
        self.gini = Measure(self, "identifier")
        self.identifier = identifier


if __name__ == '__main__':
    model1 = Model(1)
    model2 = Model(2)
    print(model1.gini)
    print(model2.gini)

To make model.gini return model.gini.get_value(), we must invoke some Python magic. First, calling a method as part of attribute lookup can be done using properties/descriptors. So whenever we assign a Measure instance to an attribute, we should add the appropriate property/descriptor that handles the get_value call. We can intercept attribute assignments through the __setattr__ dunder method. Since we assign the Measure instance to '_{name_of_measure}' and __set__ invokes __setattr__ again, we need a bit of care when creating and assigning the property.

I hope this explanation is clear enough. I admit it is a bit convoluted. It is also one of the only ways I have been able to come up with so far that makes it possible for Measures to behave as if they are normal attributes. Please let me know what you think of this direction for implementing Measure or whether the complexity is not worth it, and we forego the idea of having Measure behave as if it is an attribute that returns a simple value (e.g., int, float, string).

6 replies

rht Feb 13, 2024

I'd say the ReactJS-state API of the self.gini_coeff Measure above would require the __get__ method that would retrieve the latest value as of the most recent self.set_gini_coeff. The difference between Solara's API and ReactJS's API is that the former combines var, set_var into 1 variable.

quaquel Feb 13, 2024
Maintainer Author

I thought that using the model.measure1 to retrieve the measure1 object, while model.measure1.value to retrieve the value has a parallel with Solara's use_reactive, which is Solara's alternative API to ReactJS's useState:

Ok, this is slightly different from what I was trying to achieve and, indeed, easier. The consequence of going this route is that at the level of Collectors, you need to differentiate between collectors that collect attributes and collectors that collect measures. The former can do obj.attr, while the latter needs to do obj.attr.value. This can be handled either within the collector class (by checking the return of each obj.attr, or by having separate Collector subclasses.

I am not familiar with either Solara's use_reactive or ReactJS's useState. I'll try to find time to read up on this.

At the moment, I am trying to develop a conceptual design and working implementation for data collection that does not rely on pub/sub. Ideally, this design is such that the public API can stay in place even if the implementation is switched to pub/sub. Once we all agree on the conceptual design, I'll try to demonstrate how the implementation becomes much easier if you use pub/sub (see @EwoutS comment in #1947).

rht Feb 13, 2024

@Corvince what is your take on the ReactJS angle on the Measure class?

I asked ChatGPT about the similarity/difference between ReactJS state management and pub/sub pattern.
Similarity:

While React's state management and the pub/sub pattern serve different purposes and operate within different scopes, they share underlying principles related to event-driven data flow and decoupling of components or actors. Understanding these similarities can help developers design more effective and efficient React applications, particularly when integrating with or implementing event-driven architectures.

Difference:

ReactJS state management is specifically tailored for handling UI state within the React framework, facilitating a component-based and unidirectional data flow that is closely tied to the React component lifecycle for rendering dynamic interfaces. In contrast, the Publish/Subscribe (pub/sub) pattern is a general-purpose, loosely coupled messaging paradigm designed for decoupled communication across different parts of an application or between different applications, supporting a dynamic and multi-directional flow of information. While React state is integral to component reactivity and structure within a UI context, pub/sub offers broader application-wide communication capabilities, enabling flexible and scalable architectures without being tied to any specific UI or framework.

Corvince Feb 13, 2024
Maintainer

I think the ChatGPT answer regarding the differences is pretty good. But first on useState and reactive. Reacts useState and solaras reactive are fundamentally different strategies to solve the same problem. The latter is not the former combined in a single object. Most importantly useState can only be used inside a component, or, in solara terms with a rendering context (that @solara.component provides). It returns a plain Python object with no magic attached. What triggers state updates are the side effects of the set_* function, which "inform" the rendering context of the state update.

On the other hand solaras reactive can be used anywhere and is not bound to any component, providing global state (there is an equivalent use_reactive for local state). But again, something needs to track these changes. So if you access a reactive variable inside a @solara.component, the component actually subscribes to the reactive object, which publishes its changes.

So in summary, yes, solaras reactive and pub/sub are strongly related. UseState, not so much. You can read more about reactive in the Vue docs

Regarding the API, after seeing the implementation of @quaquel I think the solution provided by @rht in his PR probably suitable. Especially it indeed could be turned into being reactive (or pub/sub) with the same API. Then again, maybe Measure and MeasureDescriptor can indeed be merged and we get the best of both worlds?

quaquel Feb 13, 2024
Maintainer Author

what is the drawback of combining Measure and MeasureDescriptor?

Then again, maybe Measure and MeasureDescriptor can indeed be merged and we get the best of both worlds?

To the best of my knowledge, it is technically impossible to combine Measure and MeasureDescriptor. Descriptors are defined on classes, not on instances. So, in my example with 2 model instances, you have two different Measure instances (one per Model instance) but only 1 ModelDescriptor instance (on the Model class). This is also one reason for my code example's rather complicated __setattr__ implementation (the other has to do with how the Descriptor protocol interacts with __new__). Of course, the descriptor and __setattr__ complicatedness would be hidden from the user if we decide to go down this route. They would become part of the MESA framework, and users would only need to work with the Measure class.

EwoutH · 2024-02-12T09:55:58Z

EwoutH
Feb 12, 2024
Maintainer

Thanks a lot for this. I think we're on the right track, I would only change the abstraction level on which the collect function works.

There are basically the following problems:

We want to dynamically retrieve the members of an object (like an AgentSet). We might want to select a subset of that collection, conditionally.
From either that object or all members of that object, we want to collect some attributes.
We might want to aggerate one or more of those attributes, before saving them.

Then there is the complication that you sometimes have an object with members with attributes (like an AgentSet) and sometimes just have an object with attributes directly (like a Model).

So basically there three levels that need to be defined:

Which object or set objects do I collect from?
For that object or set of objects, which attributes do I want to collect? And do I need to calculate them with some function?
(if set of objects) for each attribute, do I want to save all of the values of all objects, or do I want to aggerate them to one or more values?

You can already see how complicated this can possibly get. I will try to think about some possible abstractions, but feel free to built on this in the mean time.

5 replies

quaquel Feb 12, 2024
Maintainer Author

This is a fair summary of the state of the conversation.

At present, I guess we have identified four classes: DataCollector/CollectorContainer, Collector, Measure, and Group. Each of these might have several subclasses (e.g., AgentSetCollector). Let my try to summarize:

Group is a collection of objects, typically agents, and can be static or dynamic. Dynamic groups are groups were membership changes over the course of the simulation. This can be because agents are added or removed from the model, or because membership is based on the agent being in some state.

Measure reflects a state variable of the model at a given time instant. Ideally, it would be possible to access the value of a measure through attribute lookup

Collector collects measures and attributes and stores them. This allows one to capture the change of them over the course of the simulation.

DataCollector/CollectorContainer, in my thinking, is more of a convenience class. It contains all collectors, and a single collect call on it invokes collect on all collectors.

quaquel Feb 13, 2024
Maintainer Author

Quick update: I discussed this conceptual design with the scientific programmer working on several large MESA-based ABMs. Based on our discussion, this design seems to cover virtually all of their use cases. She particularly liked the ideas of dynamic/conditional Groups/AgentSets, the separation between Measure (part of the state of the Model at a given time instant) and Collector (tracking state over time), and the idea of making Collectors easily extendible on the storage side of things.

EwoutH Feb 15, 2024
Maintainer

An interesting thought we had: It might be useful to let an "aggerate" collector collect another "group" collector. So for example, a group collector collects wealth, and then the aggerate collector aggerates to a single value.

rht Feb 15, 2024

An interesting thought we had: It might be useful to let an "aggerate" collector collect another "group" collector. So for example, a group collector collects wealth, and then the aggerate collector aggerates to a single value.

That's point no. 6 in the summary: #1944 (comment)

rht Feb 15, 2024

Quick update: I discussed this conceptual design with the scientific programmer working on several large MESA-based ABMs. Based on our discussion, this design seems to cover virtually all of their use cases. She particularly liked the ideas of dynamic/conditional Groups/AgentSets, the separation between Measure (part of the state of the Model at a given time instant) and Collector (tracking state over time), and the idea of making Collectors easily extendible on the storage side of things.

It's great to have user validation of the idea, in that it covers complex use cases. Just needs to make sure the DynamicAgentSet is performant. Does she have an additional wish list regarding with the data collection?

rht · 2024-02-12T14:59:47Z

rht
Feb 12, 2024

On Group:

I can see how groups can be used outside of the measure and data collection use case. They may be reused to organize agents step execution as well, e.g. if I want only the quiescent citizens in the Epstein civil violence to take certain actions.

def step(self):  # of a model
    # Instead of
    self.agents.select(agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent").do("rest")
    # we do
    self.quiescents.do("rest")

What about doing addition on the groups

# The drawback being this is not cacheable
(self.quiescents + self.injured_cops).do("rest")
# Needs to be
self.needs_rest = Group(self.quiescents + self.injured_cops)
self.needs_rest.do("rest")

17 replies

quaquel Feb 21, 2024
Maintainer Author

What are the other contenders for the DynamicAgentSet backend implementation?

The class would need to have a base set of agents (or default to model._agents if no base set is provided). Next select, do, get, shuffle, and sort would all take this base set, run it through the condition and then do their normal thing. So, effectively, they all would first call select on the base set of agents. So it's not that complicated.

You could expand on this, just as we discussed with Measure, by having some kind of updated flag that is reset each time tick, while having a force_update keyword to override the flag if necessary. This can be a performance saver if if the DynamicAgentSet/ConditionalAgentSet is used several times without any intervening state changes to any of the agents within the set.

I don't see other viable backend implementation other than message passing, where the model holds a global message inbox object (this is apparently how hash.ai does it). But this sounds similar to the flow library, which @quaquel described the shortcomings of

I don't know enough about hash.ai to comment, but in principle pub/sub or the Observer design pattern is a a software design idea, and I would not conflate it with what I called a flow library which is a discrete simulation concept.

rht Feb 21, 2024

I don't see other viable backend implementation other than message passing,

I wrote this in a way that could be easily misunderstood. I did not mean message passing in the way that hash.ai to be the only viable way instead of pub/sub. I wanted to say that it is one major contender to pub/sub as the backend implemententation. And so, would need to compare contrast between the 2 implementations. I have not decided on either.

rht Feb 23, 2024

hash.ai's agent messages are specifically for communication among agents that are not neighbors. They are not used for communicating between agents and non-agent objects, as such their design shouldn't necessarily be taken into consideration as a backend for dynamic AgentSet. The only other alternative is that for a condition-based membership, that it is always computed from scratch. This has a simple naive implementation, but not performant.
As such, I am on board with pub/sub as the backend for dynamic AgentSet, which should be implemented and merged first before the new data collection construct.

quaquel Feb 23, 2024
Maintainer Author

I have done some more work on pub/sub and datacollection in my datacollection branch. I hope to find time over the weekend to do an actual performance comparison between 2 implementations for ConditionalAgentSet.

pub/sub based, this is already there and working nicely. Initial performance assessments on the Epstein example suggests that it adds little overhead (0.3 seconds on a 4 second run for all data collection). And the entire pub/sub structure adds less than 0.1 second to the run if no datacollection is used.
A more traditional/naive implementation where the ConditionalAgentSet has a base set and for do, shuffle, get, select, and sort, you first apply the condition to the base set. This can be quite straightforwardly implemented because applying the condition is like an annotation to each of these methods, so this can be handled in the constructor (i.e., __new__. I am curious to see what the performance overhead on the full data collection will be of this implementation.

Note that I prefer ConditionalAgentSet over DynamicAgentSet as a name.

quaquel Feb 23, 2024
Maintainer Author

I committed an alternative implementation along the lines of point 2. Initial testing doesn't show a massive performance difference between pub/sub and the naive implementation, at least for the Epstein model.

quaquel · 2024-02-16T09:54:49Z

quaquel
Feb 16, 2024
Maintainer Author

An interesting thought we had: It might be useful to let an "aggerate" collector collect another "group" collector. So for example, a group collector collects wealth, and then the aggerate collector aggerates to a single value.

That's point no. 6 in the summary: #1944 (comment)

The problem is an extension/detailing of 6. Let me try to explain in a bit more detail one of the details I am currently stuck on.

The basic idea of a Collector is that it retrieves one or more attributes from an object or collection of objects, and optionally applies a callable to it. The issue now is that there is no way to specify the return of this optional callable in the current design. This return matters because it affects how data is stored in the collector and how it will be turned into a dataframe in to_dataframe. Knowing this return is also essential in case users extend the collectors for storing data in e.g., a database.

So, for example, we are retrieving wealth from a collection of Agents and apply calculate_gini to it. Here, we go from a list of values to a single number. In contrast, we might retrieve attributes a, b, and c from a collection of agents and next apply a post process function to it, which operates on each agent and so returns another list.

One idea I had after the conversation with @EwoutH is that the entire problem is analogous to e.g., pandas.DataFrame.apply. In case of collecting data from a collection of objects and next applying a callable to it, the user should specify the "axis" over which this function will operate. If you operate over the "columns", you are aggregating the information across all objects, while if you operate over "rows", the function is applied to the collected data for each object separately.

I hope this helps to clarify the issue.

2 replies

EwoutH Mar 10, 2024
Maintainer

If you operate over the "columns", you are aggregating the information across all objects, while if you operate over "rows", the function is applied to the collected data for each object separately.

Yes, I like this. You could even do both, in some order.

I feel we're close! What's needed to get this done?

quaquel Mar 10, 2024
Maintainer Author

mainly time from my side I guess, but with my talk last week done, I hope to get back to this and devs.

EwoutH · 2024-09-05T10:04:03Z

EwoutH
Sep 5, 2024
Maintainer

Played a bit around a few days ago. Now that we have our very powerful AgentSet, API seems to be able to get simpler:

datacollector = DataCollector(
    collectors = [
        c(target=Model, attributes=["n_agents"], methods=calculate_energy),
        c(target=Wolf, attributes=["sheep_eaten"]),
        c(target=Sheep, attributes=["age"], methods=calculate_energy),
        c(target=model.agents, attributes=["energy"], agg={"energy": np.mean}),
    ]

Few notes:

The output is a dict for each object-variable combination. So

        c(target=Model, attributes=["n_agents"], methods=calculate_energy),

gives

{
    f"{Model.__name__}_{n_agents}": {...},
    f"{Model.__name__}_{calculate_energy}": {...},
}

The dicts are structured as:

The model.step as first key, and the agent.unique_id as second key for AgentSet objects
The model.step as key for all other objects

If one or more agg parameters are defined, for each a new dict will be created, with the model.step as key.
Collecting over different combinations of AgentSets can just be done by inputted a selection into the target.

Just one approach. Don't know if it's the best.

3 replies

quaquel Sep 5, 2024
Maintainer Author

It is indeed becoming simpler, but we are not there yet.

It is still not trivial to refer to the current agents of a given type. So your Wolf or Sheep example API does currently not work. Likewise, we don't yet have an easy solution for dynamically changing agent sets.
In my vision, a given collector c can take as an object another collector. This makes it possible to avoid double collection of data:

# partial and incomplete sketch of collect func. signature
def collect(name:str, target:Any, attributes:strList[str], func:callable):
    ... # logic goes here

wealth = collect("wealth", model.agents, "wealth")
gini = collect("gini", wealth_c, method="calculate_gini")
avg_energy = collect("average energy", model.agents, "energy", np.mean) # no need to use agg in agentset here.

3 I don't fully follow your agg idea

Corvince Sep 5, 2024
Maintainer

@EwoutH I like this. I think I like this a lot!

Let me think a bit more about this, but I think this is very close!

Not sure I share the concerns of @quaquel . But let me think about this a bit more

quaquel Sep 5, 2024
Maintainer Author

These remarks are not so much concerns, but more puzzle pieces that are still missing. I think the main one is point 1. How can we make it easy to say that you want to collect data for a given Agent class or a subset of agents that is dynamically changing over time?

EwoutH · 2025-03-23T08:18:53Z

EwoutH
Mar 23, 2025
Maintainer

I took another stab at working out an API and data storage format:

Proposed API Design

The core of the proposal is a unified collect() function that provides a consistent interface for all data collection needs:

from mesa.datacollection import DataCollector, collect
import numpy as np

class WolfSheepModel(mesa.Model):
    def __init__(self, n_wolves=10, n_sheep=50, grass_regrowth_time=30):
        super().__init__()
        # [...model initialization...]
        
        # Initialize the data collector with various collectors
        self.datacollector = DataCollector([
            # Model-level attributes
            collect(target=self, attributes=["steps", "living_wolves", "living_sheep"]),
            
            # Agent type-specific collection
            collect(target=Wolf, attributes=["energy", "sheep_eaten"]),
            collect(target=Sheep, attributes=["energy", "grass_eaten"]),
            
            # Dynamic agent filtering
            collect(
                target=self.agents.select(lambda a: a.energy < 2),
                attributes=["energy", "pos"],
                name="starving_agents"
            ),
            
            # Aggregated metrics
            collect(
                target=self.agents,
                attributes=["energy"],
                aggregates={
                    "mean_energy": np.mean,
                    "energy_gini": self.calculate_gini
                }
            ),
            
            # Custom function
            collect(
                target=self,
                function=lambda m: self.calculate_spatial_density(),
                name="spatial_density"
            )
        ])

Data Access

The get_dataframe() method provides flexible access to collected data:

# Run the model
model = WolfSheepModel()
for _ in range(100):
    model.step()

# Get all data as a comprehensive DataFrame (long format)
all_data = model.datacollector.get_dataframe()
"""
Step  DataType   Entity        ID  Attribute     Value
0     model      Model         -   steps         0
0     model      Model         -   living_wolves 10
0     agents     Wolf          1   energy        20
0     aggregates -             -   mean_energy   17.5
...
"""

# Get specific data with multi-index DataFrames
wolf_df = model.datacollector.get_dataframe(target=Wolf)
"""
           energy  sheep_eaten
Step ID                       
0    1         20            0
     2         18            0
...
"""

# Filter by attribute across all agent types
energy_data = model.datacollector.get_dataframe(attribute="energy")
"""
           energy
Step Type  ID    
0    Wolf  1     20
     Sheep 3     15
...
"""

# Get dynamically filtered collections by name
starving_df = model.datacollector.get_dataframe(name="starving_agents")

# Get aggregated metrics
aggregates = model.datacollector.get_dataframe(data_type="aggregates")
"""
      mean_energy  energy_gini
Step                          
0           17.50         0.11
1           16.25         0.12
...
"""

# Additional filtering options
time_range_df = model.datacollector.get_dataframe(time_range=(10, 20))
long_format_df = model.datacollector.get_dataframe(format="long")

The multi-indexed DataFrames enable powerful analysis:

# Average energy by agent type over time
energy_by_type = energy_data.groupby(level=["Step", "Type"]).mean()

# Calculate rate of change in wolf population
wolves_over_time = wolf_df.groupby(level="Step").size()
population_change = wolves_over_time.diff()

Memory-Efficient Internal Structure

The internal data structure is optimized to avoid string duplication and use arrays for efficient storage:

{
    # Schema defined once - no string duplication
    "schema": {
        "Wolf": ["energy", "sheep_eaten"],
        "Sheep": ["energy", "grass_eaten"],
        "model": ["steps", "living_wolves", "living_sheep"],
        "aggregates": ["mean_energy", "energy_gini"]
    },
    
    # Data storage uses position-based arrays matching the schema
    "data": {
        1: {  # Timestep
            "model": [1, 8, 42],  # Values match schema positions
            "agents": {
                "Wolf": {
                    "ids": [1, 2, 3],
                    "values": [
                        [10, 2],  # Agent 1: [energy, sheep_eaten]
                        [12, 1],  # Agent 2: [energy, sheep_eaten]
                        [8, 0]    # Agent 3: [energy, sheep_eaten]
                    ]
                },
                "Sheep": {
                    "ids": [4, 5, 6],
                    "values": [
                        [5, 3],  # Agent 4: [energy, grass_eaten]
                        [6, 4],  # Agent 5: [energy, grass_eaten]
                        [4, 2]   # Agent 6: [energy, grass_eaten]
                    ]
                }
            },
            "aggregates": [7.5, 0.18]  # Values match schema positions
        }
    }
}

I hope to found a balance with this API and data storage between flexibility, powerful features, collection and storage efficiency, and ease of use.

Curious what everyone thinks.

3 replies

quaquel Mar 23, 2025
Maintainer Author

In terms of API, I broadly agree. I am just not sure about the dynamic agent filtering. The api as shown will evaluate at instantiation, rather than at runtime.

Can you elaborate on the schema stuff because I haven't seen this before and I have no idea what is going on.

EwoutH Mar 23, 2025
Maintainer

It's basically a way to efficiently store data with highly repeating data. Instead of storing a "energy" or ""Steps" key every step, you save all dictionary keys once and then you apply them when reconstructing it into a dataframe or whatever.

This way you can store lists with only values, instead of dicts with keys + values.

The api as shown will evaluate at instantiation, rather than at runtime.

That's a good catch!

adamamer20 Mar 23, 2025
Maintainer

I really like this API! It’s simple to use but powerful enough to cover many different scenarios. Nice work!

EwoutH · 2025-03-23T19:20:12Z

EwoutH
Mar 23, 2025
Maintainer

class PredatorPreyModel(Model):
    def __init__(self, num_wolves=10, num_sheep=50, seed=None):
        super().__init__(seed=seed)
        
        # Create agents
        for _ in range(num_wolves):
            Wolf(self, age=self.random.randint(0, 10), energy=self.random.randint(50, 100))
            
        for _ in range(num_sheep):
            Sheep(self, age=self.random.randint(0, 5), energy=self.random.randint(50, 100))
        
        # Setup a datacollector with empty model_reporters
        # We'll manually add values to it in our collect_data method
        self.datacollector = DataCollector()
    
    def collect_data(self):
        """Calculate and collect all model metrics."""
        # Get agent sets we need
        wolves = self.agents.select(agent_type=Wolf)
        sheep = self.agents.select(agent_type=Sheep)
        mature_wolves = wolves.select(lambda a: a.age > 5)
        
        # Calculate all metrics
        metrics = {
            "Wolves": len(wolves),
            "Sheep": len(sheep),
            "Mature Wolves": len(mature_wolves),
            "Average Wolf Energy": wolves.agg("energy", np.mean) if wolves else 0,
            "Total Sheep Wool": sheep.agg("wool", sum) if sheep else 0
        }
        
        # Add current step's metrics to datacollector
        for key, value in metrics.items():
            if key not in self.datacollector.model_vars:
                self.datacollector.model_vars[key] = []
            self.datacollector.model_vars[key].append(value)
    
    def step(self):
        # Model logic
        hungry_wolves = self.agents.select(
            agent_type=Wolf,
            filter_func=lambda a: a.energy < 30
        )
        
        sheep = self.agents.select(agent_type=Sheep)
        
        for wolf in hungry_wolves:
            if sheep:
                prey = self.random.choice(list(sheep))
                wolf.energy += prey.energy * 0.5
                wolf.kills += 1
                prey.remove()
        
        self.agents.shuffle_do("step")
        
        # Collect data
        self.collect_data()

Once you stop defining everything in your Model init, you can do dynamic selection and aggregation so much easier.

0 replies

Sahil-Chhoker · 2025-03-24T05:17:56Z

Sahil-Chhoker
Mar 24, 2025
Maintainer

I know this is not the appropriate place to ask this, but since it contains the development of API, I wanted to ask what is the procedure of defining a new API?
Do we first finalize the methods and attributes that user interacts with and then see the internal workings or we first see if something is feasible and then on top of that we build the methods and attributes? Or there is no procedure to an API development or we have to keep everything in check simultaneously?

1 reply

EwoutH Mar 24, 2025
Maintainer

Basically we go through 3 steps:

Consensus on the problem
Consensus on the conceptual solution (this often includes user API and high level design)
Consensus on the implementation

But finding a good solution that has a clean API, is extendable, has acceptable implementation complexity and is useful for a broad set of users, often does require extensive discussion and iteration.

Excellent question by the way!

Ben-geo · 2025-04-10T21:28:17Z

Ben-geo
Apr 10, 2025
Collaborator

Quick thought I originally raised in my Mesa-Frames proposal but wanted to mention here as well :

Event Based data collection - data is collected only when defined conditions are met like -

number of agents hitting max_wealth is > (0.05* total number of agents)
number of Sheep at current timestep is 50% of previous ( a lot of sheeps died)

although the initial idea was to reduce redundant data storage at mesa-frames scales I think it would be useful here as well

example -

collect(
    target=Wolf,
    attributes=["energy"],
    trigger=lambda model: model.steps % 10 == 0
)

1 reply

EwoutH Apr 11, 2025
Maintainer

Thanks!

EwoutH · 2025-04-11T18:52:11Z

EwoutH
Apr 11, 2025
Maintainer

So I threw this whole conversation into Gemini 2.5 Pro, and it came up with this. I think this is the best API I've seen so far.

Curious what everybody thinks. If it looks good I will try to move forward with an implementation.

Synthesizing a Declarative collect() API for Data Collection

Hi all,

This has been a really valuable (and extensive!) discussion on the future of data collection in Mesa. Reading through it, particularly the recent ideas from @EwoutH and the summary of requirements from @rht, I wanted to try and synthesize a potential path forward that aims to capture the best aspects discussed, focusing on a declarative API that felt closer in the analysis above.

While the manual collect_data() method @EwoutH recently demonstrated offers ultimate flexibility by leveraging AgentSet methods directly, it pushes a lot of boilerplate and procedural logic onto the user and makes framework-level optimization harder.

Perhaps we can achieve the power demonstrated there (especially regarding dynamic sets and aggregations) within a more structured, declarative API centered around a collect() specification function?

Proposed Core Idea:

Initialize DataCollector with a list of Collector specifications, created using a factory function collect(). The DataCollector.collect(model) method then processes these specifications at each step.

# Illustrative Example API
from mesa.datacollection import DataCollector, collect
import numpy as np

# Assume Wolf, Sheep, Citizen Agent classes and calculate_gini defined elsewhere
class MyModel(mesa.Model):
    def __init__(self, ...):
        super().__init__()
        # ... model setup ...

        self.datacollector = DataCollector([
            # --- Model Level ---
            collect(name="step_count", target=self, attributes=["schedule.steps"]),
            collect(name="total_wealth", target=self,
                    function=lambda m: sum(a.wealth for a in m.agents)),

            # --- Agent Type Level (Resolved at runtime) ---
            collect(name="wolf_data", target=Wolf, attributes=["energy", "kills"]),
            # Example: apply function per agent (output stored per agent ID)
            collect(name="sheep_age_category", target=Sheep,
                    function=lambda agent: "lamb" if agent.age < 2 else "adult",
                    apply_level="agent"), # Hint needed for per-agent application

            # --- Dynamic AgentSet Level (Lambda evaluated at runtime) ---
            collect(name="starving_agents_pos",
                    target=lambda m: m.agents.select(lambda a: a.energy < 10),
                    attributes=["pos"]), # Collects pos attribute for matching agents
            collect(name="avg_starving_energy",
                    target=lambda m: m.agents.select(lambda a: a.energy < 10),
                    function=lambda agent_set: agent_set.agg("energy", np.mean)), # Aggregate over dynamic set

            # --- Aggregation Focused ---
            collect(name="energy_stats", target=self.agents, # Target can be an AgentSet
                    attributes=["energy"], # Base attribute(s) to collect first
                    aggregates={ # Dictionary: output_name -> func(list_of_values)
                        "mean_energy": np.mean,
                        "median_energy": np.median,
                        "energy_gini": calculate_gini # Assumes calculate_gini takes a list
                    }),
            # Direct aggregation on an AgentSet (e.g., count) without collecting individuals first
             collect(name="quiescent_count", target=Citizen, # Agent Type target
                    function=lambda agentset: agentset.select(lambda a: a.condition == "Quiescent").count), # agent_set passed to function

            # --- Conditional Collection (Addresses @Ben-geo's point) ---
            collect(name="periodic_wolf_aggression", target=Wolf, attributes=["aggression"],
                    trigger=lambda model: model.schedule.steps % 10 == 0) # Only collect every 10 steps
        ])

    def step(self):
        # ... step logic ...
        self.datacollector.collect(self) # Pass model instance to process collectors

Key Components of collect():

name (str): Required. Unique ID for the data stream.
target (Model | Type[Agent] | AgentSet | Callable): Required. What to collect from.
- self (Model instance)
- Wolf (Agent Class): Resolved at runtime to model.agents.select(agent_type=Wolf).
- self.agents (AgentSet instance)
- lambda m: m.agents.select(...): Crucial for dynamic sets. Evaluated each collect() call.
attributes (Optional[List[str]]): Specific attributes to get. Stores per-agent data for AgentSet targets. (Mutually exclusive with function).
function (Optional[Callable]): Apply a custom function.
- Takes model if target is Model.
- Takes agentset if target is AgentSet and apply_level="agentset" (default).
- Takes agent for each agent if apply_level="agent".
- (Mutually exclusive with attributes).
apply_level (Optional[str], default="agentset"): Specifies if a function on an AgentSet target operates on the whole set ("agentset") or per-agent ("agent").
aggregates (Optional[Dict[str, Callable]]): Apply aggregations after collecting attributes from an AgentSet. Maps output_name -> func(list_of_attribute_values). (Requires attributes, cannot be used with function).
trigger (Optional[Callable[[Model], bool]]): Collect only if this returns True. Defaults to always collect.

How this Addresses Key Problems:

Collect by Type: Handled by target=AgentType.
Dynamic Subsets: Handled by target=lambda m: ..., evaluated at runtime.
Measure vs. Timeseries: Functions/Aggregates produce single "measure" values per step. Attribute collection produces per-agent timeseries.
Performance: Declarative nature allows internal optimization (e.g., fetching an attribute once if used by multiple aggregates in different collectors targeting the same set might still be tricky, but within one collect call with aggregates it's easy). Schema-based storage (@EwoutH's idea) is key for memory.
Serialization: get_dataframe() (as proposed by @EwoutH with flexible filtering/formats) handles standard output. Internal storage could be pluggable.
Derived Data: Handled via function and aggregates.
Backend: API is agnostic. Default in-memory, but Pub/Sub isn't precluded as an internal optimization if needed later.
Grouping: Primarily via name. Retrieval via get_dataframe(name=...). Further analysis via pandas.
Stats Functions: Easily applied via aggregates or custom function.
Conditional Collection: Handled by trigger.

Internal Storage & Retrieval:

@EwoutH's schema-based internal storage idea seems excellent for efficiency. get_dataframe() would reconstruct DataFrames using this schema, offering the powerful filtering (by name, type, attribute, time) and formatting (long, multi-index wide) options discussed.

Conclusion:

This synthesized collect() approach aims to provide a declarative, understandable, and powerful interface. It leverages the capabilities of AgentSet for dynamic selection and aggregation under the hood when needed (via lambda targets and internal logic), provides clear parameters for common cases (attributes, aggregates), incorporates conditional collection (trigger), and seems to cover the core requirements identified in this thread. It avoids the boilerplate of the purely manual collect_data() method while retaining much of its flexibility.

0 replies

Holzhauer · 2025-06-03T06:58:09Z

Holzhauer
Jun 3, 2025

A long time ago I started implementing a database data collector for mesa as an extension. However, since it was not well performing compared to outputting DataFrames as pickles (and importing these into the DB) I haven't finalised and shared the project (yet). Anyway, it might be useful for (smaller) simulations and/or there might be ideas on improving performance. Let me know what you think, I'm happy to contribute: https://github.com/UniK-INES/mesa_dbdatacollection

1 reply

jackiekazil Jun 4, 2025
Maintainer

Thank you for highlighting!

@tpike3 @EwoutH - has Mesa has a package that advanced Mesa functionality that was fully independent of Mesa yet? (We have had packages that were officially associated or were domain specific applications (eg economics model), but I don't think the Mesa community has had a package that was fully independent of Mesa and specifically focused on advancing/modifying functionality.
This might be the first one.

EwoutH · 2025-12-06T17:19:07Z

EwoutH
Dec 6, 2025
Maintainer

Potential GSoC 2026 project: #2927 (comment)

0 replies

quaquel · 2026-01-18T16:30:19Z

quaquel
Jan 18, 2026
Maintainer Author

The future of data collection has been picked up again with #3156 and #3145. The emerging design is to seperate the state of part of the model at a given time instant from the taking of snapshots of this state over time. To this end, #3156 introduces the idea of a DataRegistry, which is a collection of DataSet instances. #3145 introduces the idea of a Collector, which is triggered to reactively (i.e., using mesa_signals] take snapshots of datasets and write them to an appropriate backend.

The main advantages of this emerging design are

Easy to extend; If you want to gather data that is not covered by any available dataset, you can just implement your own.
Simplifies the existing API; DataSet's just arguments for attributes and keyword arguments for callable. So, no longer any need for complicated dictionaries to specify data gathering. Also, the current data collector distinction between agents and agents by type is gone. We can just have a generic AgentDataSet that takes an AgentSet and contains its data.
DataSets are available to the model itself, so anything that relies on knowing the state of part of the model can just use the data set, instead of writing its own internal custom code to gather the data.

There are still some open issues

The idea of ConditionalAgentSets is not yet implemented. However, it should be relatively straightforward to implement this either inside AgentDataSet or in an appropriate subclass
Minimizing the need to gather data multiple times; at present, there is no clean way of knowing whether the data set is up to date (i.e., clean) or outdated (i.e., dirty). There are at least two conceptual directions for solving this: reactive data sets that listen to observables, or something akin to what is done for property layers and continuous positions in the experimental ContinuousSpace. Either design is again somethign that can be implemented as an appropriate subclass of DataSet.

3 replies

EwoutH Jan 18, 2026
Maintainer

Thanks for the write-up, appreciated, and excited we are moving forward on this again

jackiekazil Jan 20, 2026
Maintainer

Thank you for moving forward and putting together the summary.

Open issue 2 -

The first seems to be more scabable.
The second seems to be more aligned. with our property layers.

What do you see as the trade offs for each of the conceptual directions?

quaquel Jan 20, 2026
Maintainer Author

Having a reactive dataset is something I can implement quite quickly to see how it performs. Its efficiency will likely be model-specific. For example, in a boltzmann style model, any given agent interaction results in two attributes being updated, resulting in two signals that have to be handled. With 100 agents, each doing an exchange, you have 200 signals. Versus a simple for loop that would require only 100 agent attributes to be accessed. In constrast, something like sugarscape might have a lot less attribute updates that you want to gather. The main agent attribute gathered here is the trading partners, and if trade-partners don't change every tick, signals are sparser and likely to be faster.

I have a working proof of concept for the second option. It was not that difficult to implement, but as discussed here, it has various consequences and limitations. Most importantly, all attributes in a given DataSet must be of the same type, and it only really makes sense to do this for numbers (float, int). I want to test pandas/polars for performance, in which case we might have a bit more flexibility with multiple data types in a single dataset.

Regardless, both an observable-based agent data set, or a numpy-based agent data set is bound to be better than the current situation. Both will allways be in sync with the agents, and thus can be used for calculating other stuff. For examle, we can store agent.wealth in a dataset, use this dataset to calculate model.gini, and store both via the collector. Avoiding at least one, and possibly two loops over all agents. I am therefore inclined to just implement both and explain their use cases, leaving it to the user to profile to see which is most appropriate for their use case.

EwoutH · 2026-02-11T09:37:21Z

EwoutH
Feb 11, 2026
Maintainer

As you know by now I love a clean API, so that's exactly what I took another stab at. But now based on all the ideas and insights of #3145.

It's also heavily based/inspired on our AgentSet design and accompanying model.agents. This should help with keeping Mesa designs intuitive and predicatable.

The proposal below is the results of some brainstorming with Claude 4.6 Opus.

Building on #3145

Some very useful discussions and prototyping took place in #3145. This is how this proposal builds on it:

What this proposal keeps

The separation of "what to collect" (DataRegistry) from "when/how to store" (Recorder) is the right architectural split
Observable-based auto-collection eliminating manual .collect() calls is a clear UX win
Multiple backend support (memory, SQL, Parquet, JSON) with a shared ABC is well-structured
Performance is promising — up to 13% faster on some benchmarks

Where's room for potential improvement

Double declaration problem — Users define datasets in the registry, then re-declare a parallel config dict with matching string keys for the recorder. Fragile and redundant.
Three separate concepts users must learn — DataRegistry, DatasetConfig, DataRecorder — before they can collect any data. The old DataCollector was one object.
Model reporters must become @property methods — The shift from lambda m: len(m.agents) to requiring a @property on the model class is more boilerplate for simple cases. The Schelling diff adds 4 properties that were previously inline lambdas.

Design Principles

Progressive disclosure — Simple things should be simple. A one-liner should get you 80% of what users need. Power users can opt into complexity.
One object to learn — Users shouldn't need to understand three classes to collect data. The registry and recorder should feel like one thing.
Convention over configuration — Sensible defaults everywhere. Most users collect every step, keep everything, store in memory.
Fluent, chainable API — Declare what to track and how to record it in one flow.
Reuse Schedule across the framework — One scheduling vocabulary.
Lambdas and strings should still work — Don't force @property for simple reporters.

The Proposed API

Core idea: `model.data` is the single entry point

Users interact with one object — model.data. It handles both registration
(what to track) and recording (when/how to store). Under the hood it delegates
to a DataRegistry and a DataRecorder, but users don't need to know that.

1. The Simple Path (80% of users)

Minimal model — 2 lines of data collection setup

class BoltzmannWealth(Model):
    def __init__(self, n=100):
        super().__init__()
        MoneyAgent.create_agents(self, n)

        # Track agent attribute + model-level metric. That's it.
        self.data.track("wealth", source=MoneyAgent)
        self.data.track("gini", fn=lambda m: compute_gini(m))

    def step(self):
        self.agents.shuffle_do("step")
        # Collection is automatic — no manual call needed

Retrieving data — same as before

model = BoltzmannWealth(100)
for _ in range(50):
    model.step()

model.data["wealth"]      # → pd.DataFrame with columns [agent_id, wealth, time]
model.data["gini"]        # → pd.DataFrame with columns [gini, time]
model.data.to_dataframes()  # → dict[str, pd.DataFrame]

What happens behind the scenes

track("wealth", source=MoneyAgent) infers this is an agent attribute,
creates an AgentDataSet, and registers it for collection every step.
track("gini", fn=...) infers this is a model-level callable,
creates a ModelDataSet, and registers it for collection every step.
Auto-collection fires on model.time change (via observable).

2. The Power Path (researchers, custom scheduling)

Fine-grained control with `Schedule`

from mesa.experimental.devs.eventlist import Schedule

class WolfSheep(Model):
    def __init__(self, ...):
        super().__init__()
        # ...

        # Agent data every step
        self.data.track("energy", source=Wolf)
        self.data.track("energy", source=Sheep, name="sheep_energy")

        # Model metrics every 5 steps, starting at t=100
        self.data.track(
            ["num_wolves", "num_sheep", "num_grass"],
            fn=[self.count_wolves, self.count_sheep, self.count_grass],
            schedule=Schedule(interval=5, start=100),
        )

        # Or with properties (no fn needed if it's a model attribute/property)
        self.data.track("gini", schedule=Schedule(interval=10))

        # Sliding window — keep only last 500 snapshots
        self.data.track(
            "positions",
            source=Wolf,
            fields=["pos"],
            schedule=Schedule(interval=1),
            window=500,
        )

`Schedule` is the same object used for event scheduling

# Same Schedule class, used for events:
self.schedule_recurring(self.migrate, schedule=Schedule(interval=7, start=100))

# ...and for data collection:
self.data.track("wealth", source=MoneyAgent, schedule=Schedule(interval=7, start=100))

3. Source inference rules for `track()`

track() is a single method that infers what kind of dataset to create:

Arguments provided	Inferred type	What it creates
`track("wealth", source=AgentType)`	Agent attribute	AgentDataSet tracking `wealth` on all agents of that type
`track("wealth", source=agent_set)`	Agent attribute	AgentDataSet tracking `wealth` on that specific AgentSet
`track("gini", fn=callable)`	Model reporter	ModelDataSet calling fn(model)
`track("gini")` (no source, no fn)	Model attribute	ModelDataSet reading `model.gini` (property or attribute)
`track(["a", "b"], source=X)`	Multi-field agent	AgentDataSet tracking multiple fields
`track(["a", "b"], fn=[f1, f2])`	Multi-field model	ModelDataSet with multiple reporters

The `name` parameter for disambiguation

# If you track the same field from different agent types:
self.data.track("energy", source=Wolf, name="wolf_energy")
self.data.track("energy", source=Sheep, name="sheep_energy")

If name is not provided, it defaults to:

The field name for single fields: "wealth"
"{AgentType.__name__}_{field}" if source is an agent type: "Wolf_energy"
The explicit name for multi-field: must be provided

4. Numpy fast path (opt-in)

For performance-critical models with fixed agent populations:

# Explicit opt-in to numpy-backed storage
self.data.track("wealth", source=MoneyAgent, backend="numpy", n=n)

This creates a NumpyAgentDataSet internally. The user doesn't need to know
about different dataset classes — they just say backend="numpy".

5. Event/table logging (sparse events)

For data that doesn't fit the snapshot model (agent death, interactions):

class Wolf(Agent):
    def die(self):
        # Log an event at the moment it happens
        self.model.data.log("deaths", agent_id=self.unique_id, cause="starvation")
        self.remove()

# Retrieve later
deaths_df = model.data["deaths"]  # → DataFrame with [agent_id, cause, time]

log() is for sparse, event-driven data. track() is for periodic snapshots.
This cleanly replaces the old DataCollector tables.

6. Storage backends

Default: in-memory (no configuration needed)

self.data.track("wealth", source=MoneyAgent)
# Just works, stored in memory

Swap backend globally

from mesa.experimental.data_collection import SQLBackend, ParquetBackend

class MyModel(Model):
    def __init__(self):
        super().__init__()
        self.data.backend = SQLBackend(db_path="results.db")
        # OR
        self.data.backend = ParquetBackend(output_dir="./data")

        # All subsequent track() calls use this backend
        self.data.track("wealth", source=MoneyAgent)

Per-dataset backend override

self.data.track("wealth", source=MoneyAgent)  # uses default (memory)
self.data.track("big_data", source=MoneyAgent, backend=ParquetBackend("./out"))

Custom backends — implement a simple protocol

class StorageBackend(Protocol):
    def initialize(self, name: str, dataset_type: str, columns: list[str]) -> None: ...
    def store(self, name: str, time: float, data: Snapshot) -> None: ...
    def retrieve(self, name: str) -> pd.DataFrame: ...
    def clear(self, name: str | None = None) -> None: ...

Where Snapshot is a tagged union / dataclass:

@dataclass
class Snapshot:
    """Typed snapshot — backend authors don't need to pattern-match on Any."""
    kind: Literal["agent", "model", "numpy", "event"]
    data: np.ndarray | list[dict] | dict
    columns: list[str]
    agent_ids: np.ndarray | None = None

7. Complete API reference for `model.data`

class DataCollectionManager:
    """The unified interface at model.data"""

    # === Registration (what to collect) ===

    def track(
        self,
        fields: str | list[str],
        *,
        source: type[Agent] | AgentSet | None = None,
        fn: Callable | list[Callable] | None = None,
        name: str | None = None,
        schedule: Schedule | None = None,   # default: every step from t=0
        window: int | None = None,          # sliding window size
        backend: StorageBackend | Literal["numpy"] | None = None,
        n: int | None = None,               # required if backend="numpy"
    ) -> Dataset:
        """Register data for periodic collection. Returns the Dataset for chaining."""
        ...

    def log(
        self,
        table_name: str,
        **fields,
    ) -> None:
        """Log a single event row (sparse, non-periodic data)."""
        ...

    # === Retrieval ===

    def __getitem__(self, name: str) -> pd.DataFrame:
        """model.data["wealth"] → DataFrame"""
        ...

    def to_dataframes(self) -> dict[str, pd.DataFrame]:
        """Get all tracked data as DataFrames."""
        ...

    # === Control ===

    def collect(self) -> None:
        """Force collection of all due datasets now (rarely needed)."""
        ...

    def finalize(self) -> None:
        """Capture final snapshot. Called automatically by model.close()."""
        ...

    def enable(self, name: str) -> None: ...
    def disable(self, name: str) -> None: ...
    def clear(self, name: str | None = None) -> None: ...

    # === Diagnostics ===

    @property
    def summary(self) -> dict[str, Any]: ...

    # === Backend ===

    @property
    def backend(self) -> StorageBackend: ...

    @backend.setter
    def backend(self, value: StorageBackend) -> None: ...

8. Migration from old DataCollector

# ─── OLD ───
self.datacollector = DataCollector(
    model_reporters={
        "Gini": lambda m: compute_gini(m),
        "Population": lambda m: len(m.agents),
    },
    agent_reporters={"wealth": "wealth"},
    agenttype_reporters={
        Wolf: {"kills": "kills_count"},
        Sheep: {"distance": "total_flight_distance"},
    },
)
# In step():
self.datacollector.collect(self)

# ─── NEW ───
self.data.track("wealth", source=self.agents)         # replaces agent_reporters
self.data.track("gini", fn=lambda m: compute_gini(m)) # replaces model_reporters
self.data.track("population", fn=lambda m: len(m.agents))
self.data.track("kills_count", source=Wolf)            # replaces agenttype_reporters
self.data.track("total_flight_distance", source=Sheep)
# No manual collect() needed!

Retrieving data

# OLD
model.datacollector.get_model_vars_dataframe()
model.datacollector.get_agent_vars_dataframe()

# NEW
model.data["gini"]
model.data["wealth"]
model.data.to_dataframes()

9. Design comparison: Current PR vs This Proposal

Aspect	Current PR	This proposal
Objects user must learn	3 (DataRegistry, DatasetConfig, DataRecorder)	1 (`model.data`)
Lines for basic setup	~8-12	~2-3
String key coupling	Yes (registry keys must match config keys)	No (single `track()` call)
Lambda support	No (must use `@property`)	Yes (`fn=lambda m: ...`)
Schedule reuse	Separate `DatasetConfig` class	Shared `Schedule` from event system
Backend swapping	Change class instantiation	Change `model.data.backend = ...`
Event logging	Not addressed (TableDataSet skipped)	`model.data.log()`
Final snapshot	Not handled	`model.data.finalize()` + auto in `model.close()`
Typed snapshots for backends	`Any` with pattern matching	`Snapshot` dataclass
Auto-collection	✅	✅

10. Implementation sketch: How `model.data` gets created

# In Model.__init__():
class Model:
    def __init__(self, ...):
        ...
        self.data = DataCollectionManager(self)

DataCollectionManager internally holds:

A dict of Dataset objects (the registry part)
A StorageBackend (the recorder part)
A subscription to model.time changes

The existing DataRegistry and BaseDataRecorder code can be refactored
to live behind this facade. The internal architecture stays clean — only the
user-facing API changes.

Summary of the key design choices and why

Single entry point (model.data) — The current PR asks users to understand DataRegistry, DatasetConfig, and DataRecorder as separate concepts, wire them together with matching string keys, and remember not to call .collect(). That's a lot of cognitive load for what is fundamentally "I want to save my data." One object that does track, log, and data["name"] is dramatically simpler.
track() with source inference — Instead of track_agents(), track_model(), track_agents_numpy() as separate methods, one track() method infers the dataset type from its arguments. This follows the principle that users think in terms of what they want to track ("wealth of wolves"), not how the framework should store it internally.
Keep lambdas working — The current PR forces model reporters into @property methods. That's fine for complex logic, but for lambda m: len(m.agents) it's unnecessary boilerplate. Supporting both fn=callable and bare attribute names preserves flexibility.
Reuse Schedule instead of DatasetConfig — Since Add Schedule dataclass and refactor EventGenerator #3250 already merged a Schedule dataclass for event scheduling, data collection should use the same object. One scheduling vocabulary across the framework. DatasetConfig as a separate class with slightly different parameter names (start_time vs start) is a source of confusion.
Typed Snapshot instead of Any — Backend authors currently have to write match data: case np.ndarray(): ... case list(): ... case dict(): ... in every method. A typed Snapshot dataclass with a kind discriminator makes this explicit and documentable.
log() for event data — The PR explicitly skips TableDataSet in the recorder. But event logging (agent deaths, interactions) is a real use case. A dedicated log() method with a simple **kwargs interface handles this cleanly and replaces the old DataCollector table feature.
finalize() tied to model.close() — This directly addresses quaquel's concern about missing the final state. Rather than debating signal design now, a simple finalize() method that model.close() calls is pragmatic and unblocking.

7 replies

quaquel Feb 11, 2026
Maintainer Author

I don't have a clear preference on log, store or something else. Maybe entry?

TableDataSet currently has add_row. But add_entry or append might also work fine.

On, model.data_registry naming: If the registry is "always just there," could we shorten the access path? model.data reads more naturally than model.data_registry

I am not set on data_registry and agree that a shorter name might work better. I don't really like data as such because it's a bit vague. I personally think of it as state, but that might not work either. A good suggestion is thus welcome.

As for enforcing name always: I agree for track_model where the dataset groups multiple fields. For single-field agent tracking, could the name default to "{AgentType}_{field}"? Requiring a name for track_agents(Wolf, name="wolf_energy", fields="energy") when there's an obvious default feels as unneeded. I'm not sure on the string passing, especially since internally it are just all objects.

I still prefer being explicit here because the user then knows the name by which to retrieve a dataset, instead of having to dig into the docs to figure out how it is auto-generated.

Zooming out, we still reduce the mental model for users from 3 objects to 2. So the user mental model becomes.

Yes

codebreaker32 Feb 11, 2026
Collaborator

This discussion is fantastic. It feels like we are converging on a "Goldilocks" API that balances ease-of-use with architectural clarity.

Building the recorder surfaced a few concrete things worth adding to this so some few input from my side

On track vs. record separation:
The AgentDataSet-as-live-view pattern is genuinely useful independent of recording. When I was implementing _on_time_change, the recorder pulls dataset.data on every tick. But there's no reason a model couldn't also read self.data.datasets["wealth"].data mid-step for internal computation. Conflating "accessible" with "stored" loses that. The .track(...).record(...) chain expresses this cleanly, my only concern is the footgun of forgetting .record(). A warning when a tracked dataset is never recorded (detected at model.close() or the first step) would help.

On Snapshot:
Every backend implementation currently duplicates the same match data: case np.ndarray(): ... block. A typed Snapshot dataclass with a kind discriminator would cut each backend's _store_dataset_snapshot method by half and make the contract between the base class and implementations explicit. This should be adopted regardless of which API wins.

On Schedule reuse:
I am still unsure about reusing it for recorder tasks see #3145 (comment)

On lambdas vs. properties: both should work
For a longer run, switching to @property may be a good idea but I will leave this for you to decide

On explicit naming:
Auto-naming from Wolf_energy is fragile across refactors. But rather than requiring a name always, I'd suggest: single field defaults to field name (unambiguous), multiple fields require a name (ambiguous default), and agent datasets with a source class default to "{ClassName}_{field}" (but with the understanding that this is fragile to refactoring). This gives users the happy path without the footgun.

On model.data vs. model.data_registry
data is short but genuinely vague, model.data could mean "the model's current state" just as easily as "the collection manager." model.collector or model.recorder are more specific. I'd vote against state as it is overloaded too. model.stats might be the right balance of length and meaning for most users' mental model.

The one thing I'd push back on from @EwoutH's proposal
Source inference in track() is too implicit. track("energy", source=Wolf) doing something fundamentally different from track("energy", source=my_agentset) based on runtime type is the kind of thing that's obvious to the author but confusing to the reader. track_agents and track_model as explicit methods cost two extra characters to type but eliminate an entire class of confusion. I'd keep them. or look for some better alternatives

For eg

track("energy", source=Wolf)        # creates an AgentDataSet
track("energy", source=my_agentset) # creates a different AgentDataSet
track("gini", fn=lambda m: ...)     # creates a ModelDataSet
track("gini")                       # creates a ModelDataSet reading model.gini

EwoutH Feb 12, 2026
Maintainer

I’m very comfortable with model.data. All our important components have a single word: model.agents, model.time, etc.

Source inference in track() is too implicit. track("energy", source=Wolf) doing something fundamentally different from track("energy", source=my_agentset) based on runtime type is the kind of thing that's obvious to the author but confusing to the reader. track_agents and track_model as explicit methods cost two extra characters to type but eliminate an entire class of confusion. I'd keep them. or look for some better alternatives

I don’t follow this. Are you proposing also splitting track_agents and track_agentset?

codebreaker32 Feb 12, 2026
Collaborator

I don’t follow this. Are you proposing also splitting track_agents and track_agentset?

These calls look similar but create fundamentally different dataset types (agent vs model). That's the problematic inference, you can't tell what you're getting without mentally executing the dispatch logic

I am proposing track_agents() vs track_model() as separate methods that makes this explicit at the call site.

quaquel Feb 12, 2026
Maintainer Author

But within track_agents() we still have an ambiguity. Currently, a normal AgentDataSet tracks an agentset, while a NumpyAgentSet tracks an agent class. I can see an argument to at least handle both in the same way, and would prefer the agent class as the basis.

Uh oh!

The future of data collection #1944

Uh oh!

quaquel Jan 7, 2024 Maintainer

Replies: 18 comments · 68 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

quaquel Jan 8, 2024 Maintainer Author

Uh oh!

Uh oh!

quaquel Jan 9, 2024 Maintainer Author

Uh oh!

Corvince Jan 9, 2024 Maintainer

Uh oh!

quaquel Jan 9, 2024 Maintainer Author

Uh oh!

Uh oh!

quaquel Jan 9, 2024 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

quaquel Feb 5, 2024 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

quaquel Feb 10, 2024 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

quaquel Feb 10, 2024 Maintainer Author

Uh oh!

quaquel Feb 11, 2024 Maintainer Author

Uh oh!

Uh oh!

quaquel Feb 13, 2024 Maintainer Author

Uh oh!

Uh oh!

Corvince Feb 13, 2024 Maintainer

Uh oh!

quaquel Feb 13, 2024 Maintainer Author

Uh oh!

EwoutH Feb 12, 2024 Maintainer

Uh oh!

Uh oh!

quaquel Feb 12, 2024 Maintainer Author

Uh oh!

quaquel Feb 13, 2024 Maintainer Author

Uh oh!

EwoutH Feb 15, 2024 Maintainer

Uh oh!

quaquel
Jan 7, 2024
Maintainer

Replies: 18 comments 68 replies

quaquel Jan 8, 2024
Maintainer Author

quaquel Jan 9, 2024
Maintainer Author

Corvince
Jan 9, 2024
Maintainer

quaquel Jan 9, 2024
Maintainer Author

quaquel Jan 9, 2024
Maintainer Author

quaquel Feb 5, 2024
Maintainer Author

quaquel
Feb 10, 2024
Maintainer Author

quaquel Feb 10, 2024
Maintainer Author

quaquel
Feb 11, 2024
Maintainer Author

quaquel Feb 13, 2024
Maintainer Author

Corvince Feb 13, 2024
Maintainer

quaquel Feb 13, 2024
Maintainer Author

EwoutH
Feb 12, 2024
Maintainer

quaquel Feb 12, 2024
Maintainer Author

quaquel Feb 13, 2024
Maintainer Author

EwoutH Feb 15, 2024
Maintainer