Skip to content

Better DataFrame support for Agent.create_agents() #3186

@EwoutH

Description

@EwoutH

I think we can improve the ergonomics of creating agents from pandas DataFrames. Currently it requires verbose .tolist() conversions for each column.

Background

Users may initialize agents from tabular data (CSV, Parquet, database queries).

# Example synthetic population data
df = pd.DataFrame({
    'age': [25, 34, 45, 67, 29],
    'bmi': [22.5, 28.1, 31.2, 24.8, 26.3],
    'condition_status': ['healthy', 'at_risk', 'chronic', 'healthy', 'at_risk'],
    'income': [45000, 62000, 51000, 38000, 71000]
})

The current create_agents() API requires converting each DataFrame column to a list:

class HealthAgent(Agent):
    def __init__(self, model, age, bmi, condition_status, income):
        super().__init__(model)
        self.age = age
        self.bmi = bmi
        self.condition_status = condition_status
        self.income = income

# Current approach - verbose and inefficient
agents = HealthAgent.create_agents(
    model=model,
    n=len(df),
    age=df['age'].tolist(),        # Manual conversion
    bmi=df['bmi'].tolist(),        # Manual conversion
    condition_status=df['condition_status'].tolist(),  # Manual conversion
    income=df['income'].tolist()   # Manual conversion
)

Potential solutions

We're considering two approaches (not mutually exclusive):

Option 1: Accept DataFrame columns directly

Allow pandas Series as arguments without manual conversion:

# Proposed - cleaner API
agents = HealthAgent.create_agents(
    model=model,
    n=len(df),
    age=df['age'],           # No .tolist() needed
    bmi=df['bmi'],
    condition_status=df['condition_status'],
    income=df['income']
)

Option 2: Add df parameter for direct DataFrame input

Add a dedicated parameter that accepts a DataFrame:

# Most concise - auto-map all columns
agents = HealthAgent.create_agents(
    model=model,
    df=df
)

# If you only want certain columns, just filter them yourself before input:
agents = HealthAgent.create_agents(
    model=model,
    df=df[['age', 'bmi', 'condition_status', 'income']]
)

# Mix DataFrame with additional parameters
agents = HealthAgent.create_agents(
    model=model,
    df=df,
    initial_energy=100  # Same value for all agents
)

# Use DataFrame subset with overrides
agents = HealthAgent.create_agents(
    model=model,
    df=df[['age', 'bmi']],
    condition_status='healthy',  # Override for all
    income=df['adjusted_income']  # Mix with Series
)

Questions for discussion

  1. Which option should we implement? Both? Start with Option 1, add Option 2 later?
  2. For Option 2, how should we handle conflicts if df contains a column 'age' AND the user passes age=... explicitly?
  3. Should we also support other tabular formats like Polars DataFrames or NumPy structured arrays?

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions