PP Slides
PP Slides
Performative Prediction
a
Tutorial @ UAI 2024
2nd hour:
Performative optimality
Performative optimality vs performative stability
Model-free and model-based optimization
Coffee break
3rd hour:
Extensions of the framework and connections
Power, incentives, digital activism
Discussion
Different facets of prediction
Prediction in astronomy:
Detect regularities and laws in nature, purely explanatory and descriptive
Forecasts that can impact the predicted event constitute one of the most
central problems in the theory of economic forecasting
𝜃!"#
𝜃!"$
[PZMH20]
Performative stability:
𝜃 ∗ = argmin" Risk 𝜃, 𝐷(𝜃 ∗ )
Model remains optimal after deployment
𝜃!"#
𝜃!"$
A natural equilibrium concept
When does retraining converge?
* ∇𝜃 ℓ(𝑧, 𝜃) is 𝛽-Lipschitz in 𝑧
ℓ!%& 𝜃 : loss over 𝐷(𝜃!%& )
Proof sketch
ℓ! 𝜃 : loss over 𝐷(𝜃! )
𝜃!
∗
𝜃!%&
𝜃!∗
0
• 𝛾-strong convexity of the loss in 𝜃: [∇ℓ! 𝜃! − ∇ℓ! (𝜃!∗ )]# 𝜃! − 𝜃!∗ ≥ 𝛾||𝜃! − 𝜃!∗ ||$
0
• 𝛽-smoothness of the loss in the data: [∇ℓ! 𝜃! − ∇ℓ!%& 𝜃! ]# 𝜃! − 𝜃!∗ ≤ 𝛽 𝜃! − 𝜃!∗ 𝑊 𝐷(𝜃!%& , 𝐷(𝜃! ))
Proof sketch
ℓ! 𝜃 : loss over 𝐷(𝜃! )
𝜃!
∗
𝜃!%&
𝜃!∗
0
• 𝛾-strong convexity of the loss in 𝜃: [∇ℓ! 𝜃! − ∇ℓ! (𝜃!∗ )]# 𝜃! − 𝜃!∗ ≥ 𝛾||𝜃! − 𝜃!∗ ||$
0
• 𝛽-smoothness of the loss in the data: [∇ℓ! 𝜃! − ∇ℓ!%& 𝜃! ]# 𝜃! − 𝜃!∗ ≤ 𝛽 𝜃! − 𝜃!∗ 𝑊 𝐷(𝜃!%& , 𝐷(𝜃! ))
*
• 𝜖-sensitivity of 𝐷(⋅): ⇒ ||𝜃! − 𝜃!∗ || ≤ 𝑊 𝐷(𝜃!%& , 𝐷(𝜃! ))
+
𝛽
• 𝜖-sensitivity of 𝐷(⋅): ≤ 𝜖 𝜃!%& − 𝜃!
𝛾
𝛽 ∗
= 𝜖 𝜃!%& − 𝜃!%& contraction for 𝜖 < 𝛾/𝛽
𝛾
Fixed-point argument
Historical arguments about the possibility of
public prediction using Brouwer’s fixed point
theorem
• Simon 1954
• Grunberg, Modigliani 1954
• For small 𝜖 < 𝛾/𝛽 the gradient on problem 𝐷(𝜃% ) is aligned with the gradient on
problem 𝐷(𝜃 ∗ ) and never points against the gradient flow:
)* $
𝑔! (𝜃) − ∇𝑔∗ 𝜃 ≤ 𝜖𝛽 𝜃! − 𝜃∗ → cos ∡ 𝑔! 𝜃 , 𝑔∗ 𝜃 ≤ 1−
+
• It remains to choose the stepsize such that the variance decreases sufficiently
quickly as we approach stability to obtain the classical 𝑂 "! convergence
𝑔
rate.
#
• For small 𝜖 < 𝛾/𝛽 the gradient on problem 𝐷(𝜃% ) is aligned with the gradient on
problem 𝐷(𝜃 ∗ ) and never points against the gradient flow:
)* $
𝑔! (𝜃) − ∇𝑔∗ 𝜃 ≤ 𝜖𝛽 𝜃! − 𝜃∗ → cos ∡ 𝑔! 𝜃 , 𝑔∗ 𝜃 ≤ 1−
+
• Choosing step size such that gradient variance decreases sufficiently quickly as we
approach stability implies classical 𝑂 "! convergence rate
Greedy vs lazy
greedy = deploy every step lazy = deploy only periodically
[MPZH20]
Greedy vs lazy
greedy = deploy every step lazy = deploy only periodically
Step size for greedy deploy globally decreasing Step size for lazy deploy locally decreasing
and more conservative as 𝜖 grows between deployments and independent of 𝜖
[MPZH20]
Greedy vs lazy
greedy = deploy every step lazy = deploy only periodically
Step size for greedy deploy globally decreasing Step size for lazy deploy locally decreasing
and more conservative as 𝜖 grows between deployments and independent of 𝜖
Greedy vs lazy
Setup: Mean estimation z ∼ 𝑁(𝜇 + 𝜖𝜃, 𝜎 $ ) using ℓ 𝑧, 𝜃 = $% 𝑧 − 𝜃 $
Greedy vs lazy
deployments
greedy: 50K
Setup: Mean estimation z ∼ 𝑁(𝜇 + 𝜖𝜃, 𝜎 $ ) using ℓ 𝑧, 𝜃 = $% 𝑧 − 𝜃 $
lazy: 200
Theorem [LW24]:
Assume 𝐷 𝜃 is 𝜖-sensitive and ℓ(𝑧; 𝜃) is Lipschitz in 𝜃 and possibly nonconvex. Then,
• greedy deploy satisfies
# 5 #
∑34# || E/∼.('& ) ∇ℓ 𝑧; 𝜃3 ||2 = 𝑂 +𝑂 𝜖 ;
2 5
Performative optimality:
𝜃78 = argmin9 PR 𝜃 = argmin9 Risk 𝜃, 𝐷(𝜃) lowest possible risk
after deployment
Performative optimality
Under performativity, after deploying 𝜃 the learner experiences performative risk
PR 𝜃 ≔ Risk 𝜃, 𝐷 𝜃
Performative optimality:
𝜃78 = argmin9 PR 𝜃 = argmin9 Risk 𝜃, 𝐷(𝜃) lowest possible risk
after deployment
Performative stability:
𝜃7: = argmin9 Risk 𝜃, 𝐷(𝜃7: ) Do stable points have low
risk after deployment?
Stability and performative risk
Risk 𝜙, 𝐷(𝜃') )
Not always true! Stable points can even maximize PR 𝜃
𝜃'*
𝜃')
Stability and performative risk
𝜖-sensitive
Example:
/ 1
Non-convex for 𝜖 > 0.5! For 𝜖 ∈ ,
#0 0
stable point maximizes PR 𝜃
Optimizing the performative risk
Difficulties:
PR 𝜃 ≔ Risk(𝜃, 𝐷(𝜃))
• no gradient access
∇PR 𝜃 = E2∼(($) ∇ℓ z; 𝜃 + E2∼(($) [ℓ z; 𝜃 ∇log p𝜃 (z)]
Difficulties: For 𝑡 = 1, … , 𝑇:
• no gradient access
∇PR 𝜃 = E2∼(($) ∇ℓ z; 𝜃 + E2∼(($) [ℓ z; 𝜃 ∇log p𝜃 (z)]
• Based on bandits and other zeroth-order • Based on economic models (e.g. utility-maximizing
optimization methods agents), other models from domain knowledge, etc
• Convergence relies on general regularity conditions • Convergence relies on model correctness or degree
(e.g. convexity, smooth distribution shifts, etc) of model misspecification
Model-free: Model-based:
4
1 We model the data-generating process:
j
PR 𝜃𝑡 = l ℓ(𝑧35 ; 𝜃3 ) • agents manipulate features to maximize the prediction:
𝑚
589 7
1
𝑥 = argmax𝑥 𝛾 ⋅ 𝑥 𝜃 − ||𝑥 − 𝑥0||2
2
𝜃h PO = argmin$∈{$",…,$#} P
jR 𝜃 • agents can manipulate features, not label
utility-cost tradeoff
This is a distribution map model 𝐷𝛾 𝜃
𝑦 = 𝑦0 + 𝜏 ⋅ 𝑥 7 𝜃
Converges to 𝜃-@ at rate O( 𝑑𝑡 A9/C ) only queries PR, not its gradient
Beyond convexity?
Optimization with no gradients and no convexity = continuum-arm bandit problem?
• “pull arm” 𝜃3 and observe bandit feedback PR j 𝜃3 with E PR
j 𝜃3 = PR(𝜃3 )
• assuming only Lipschitzness of PR we can apply Lipschitz bandits [KSU08]
Beyond convexity?
Optimization with no gradients and no convexity = continuum-arm bandit problem?
• “pull arm” 𝜃3 and observe bandit feedback PR j 𝜃3 with E PR
j 𝜃3 = PR(𝜃3 )
• assuming only Lipschitzness of PR we can apply Lipschitz bandits [KSU08]
Confidence bound with bandit feedback Confidence bound with performative feedback
(Lipschitz PR(𝜃)) (Lipschitz Risk(𝜃, 𝐷(𝜙)) in 𝜙)
PR(𝜃) PR(𝜃)
Confidence bound with bandit feedback Confidence bound with performative feedback
(Lipschitz PR(𝜃)) (Lipschitz Risk(𝜃, 𝐷(𝜙)) in 𝜙)
upper
confidence bound
PR(𝜃) PR(𝜃)
Use successive elimination the algorithm can discard regions of the parameter space
to deal with finite-sample uncertainty [EMM06] that have never been explored!
Performative regret bound
Theorem [JZM22]:
Assume the distribution map 𝐷 𝜃 is 𝜖-sensitive and the loss ℓ(𝑧; 𝜃) is 𝐿2 -Lipschitz in 𝑧. Then,
the performative confidence bounds algorithm that after 𝑇 deployments achieves a regret
$%" $
Reg 𝑇 = ∑7389 E [PR 𝜃3 ] − PR 𝜃-@ = 𝑂z 𝑇+𝑇 $%& 𝐿2 𝜖 $%&
~ 𝜃 identified correctly → we can find the optimal solution 𝜃h PO offline for any loss function!
𝐷
Modeling 𝐷 𝜃 (“macro level”) in terms of the behavior of individual agents in the population (“micro level”)
Microfoundations
Modeling 𝐷 𝜃 (“macro level”) in terms of the behavior of individual agents in the population (“micro level”)
Example: strategic classification [HMPW16]
Distribution 𝐷(𝜃) comes from strategic behavior of individuals trying to adapt to decision rule
Rational-agent model
-
+ 𝑥 𝜃 = argmax 𝛾 𝑓$ 𝑥 − cost(𝑥M , 𝑥)
L
Theorem [JZM22]:
There exists an algorithm that after 𝑇 deployments achieves a regret
Reg 𝑇 = ∑7389 E [PR 𝜃3 ] − PR 𝜃-@ = 𝑂z 𝑇 max{𝑑, 𝑑𝑚} .
$%" $
Bandit approach: Reg 𝑇 = 𝑂z 𝑇+𝑇 $%& 𝐿2 𝜖 $%& fast rate regardless of the strength
of performative effects
Location families
“Macro level” model:
Satisfied in strategic classification model:
𝑧 ∼ 𝐷 𝜃 ⇔ 𝑧M + 𝜇∗7 𝜃, 𝑧M ∼ 𝐷M
𝑥 𝜃 = argmax 𝛾 𝑓$ 𝑥 − cost(𝑥M , 𝑥)
unknown “base” distribution, L
𝜇∗ ∈ 𝑅 > ? = z< ∈ 𝑅 = 𝑓; 𝑥 = 𝜃 7 𝑥
1
cost 𝑥< , 𝑥 = x − x< @ Λ(𝑥 − 𝑥< )
2
Theorem [JZM22]:
There exists an algorithm that after 𝑇 deployments achieves a regret
Reg 𝑇 = ∑7389 E [PR 𝜃3 ] − PR 𝜃-@ = 𝑂z 𝑇 max{𝑑, 𝑑𝑚} .
$%" $
Bandit approach: Reg 𝑇 = 𝑂z 𝑇+𝑇 $%& 𝐿2 𝜖 $%& fast rate regardless of the strength
of performative effects
Performative modeling through causal inference
Modeling 𝐷 𝜃 is fundamentally a causal inference problem
Example: if Zillow’s housing pricing algorithm is fixed, we can’t tell 𝐷 𝜃 and 𝐷OPQPRS apart
randomized 𝑓; discrete 𝑓;
deterministic 𝑓; (additive noise mechanism) (rounded predictions)
Performative risk
Including prediction
Gain of causal Gain of causal
can hurt performance
identification identification
Next: Extensions
Extensions of the framework and connections
Performative prediction framework: recap
Distribution at time 𝑡:
𝐷3 = 1 − 𝛿 ⋅ 𝐷3A9 + 𝛿 ⋅ 𝐷(𝜃3 )
𝐷(𝜃A )
Model 𝜃 deployed over multiple steps
→ distribution close to 𝐷(𝜃)
𝐷AB6
[NFDFR23, PY23, WYW23]
Example: multiple navigation apps predict travel time; people respond by considering multiple
predictions
Neglecting performative feedback can amplify unfairness and polarization over time, even if
starting from a fair model [LDRSH18, HSNL18, JXLZ24]
W
For example, we can choose: ℓ (𝑥, 𝑦); 𝜃 = ℓ.V (𝑥, 𝑦); 𝜃 − 𝑦 #
#
Loss can even depend only on data! e.g. ℓ (𝑥, 𝑦); 𝜃 = 𝑦 is a valid loss, because PR 𝜃 = 𝐸( $ [𝑦]
Shift in perspective:
Focus on those impacted by performativity
Performative risk
Two levers to achieve small risk
Performative power
Quantifying the strength of performativity as a notion of power
1
P≔ sup 9 E dist(𝑍Z (𝑓M ), 𝑍Z (𝑓))
QSPRJD X∈Y |𝑈|
Z∈[
Performative power
Quantifying the strength of performativity as a notion of power
How much would the average outcome change if the firm were to deploy a
different model?
click on a website/consumption of a service
A causal inference problem
Through performativity we can relate the abstract concept of power
to a causal inference problem.
How much would the average outcome change if the firm were to deploy a
different model?
click on a website/consumption of a service
search
query
These numbers speak to the performative power of the platform over its participants
shopping
box • Experiment as gold standard:
change the algorithm and inspect
effect
Power of Google search over the incoming traffic to a website ranked in first
position corresponds to more than 44% of base traffic.
[MCH24]
population
algorithmic
systems
data
model
predictions
Performativity:
Predictions impact people
Let’s zoom out
Data about people feeds back
into the learning system
population
algorithmic
systems
data
model
predictions
Performativity:
Predictions impact people
Let’s zoom out
Finding performative optima:
Firm anticipates
distribution shift
population
algorithmic
systems
data
model
predictions
Let’s zoom out
Reversing the order of play:
Individuals anticipate the use of
their data for learning
population
algorithmic
systems
data
model
predictions
Let’s zoom out
Reversing the order of play:
Individuals anticipate the use of
their data for learning
population
algorithmic
systems
data
model
predictions
Performativity is
the reason they care
Algorithmic resistance
Algorithmic resistance
Coordinated efforts
Coordination is effective on the side of the users to have influence on
the learning algorithm.
platform
Data leverage: • data strikes, concious data contribution [VH21, VHS19, VLTCH21]
• algorithmic collective action [HMMZ23]
• collective infrastructure, e.g., GigSense [IFS24]
A single datapoint has little influence, but systematic patterns will be picked up on
[HMMZ23]
𝑃<
1−𝛼
𝑃
platform
𝑃∗
𝛼
Platform observes Platform trains
Individuals’ initial data
mixture distribution ML model 𝑓 on 𝑃
𝑥, 𝑦 ∼ 𝑃M
𝛼-fraction of the population 𝑃 = 1 − 𝛼 𝑃M + 𝛼𝑃∗
joins the collective and
implements a strategy
to change data
Collective goal:
Favorable property of 𝑓
Anticipate retraining
Theorem [HMMZ23]: For controlling the output of a gradient-based
learner it is sufficient to have a collective of fraction 𝛼 repeatedly 𝛼 ∝ suboptimality of
modifying their data as long as the targeted solution 𝜃 ∗
𝛼 ≥ 𝑂 E2∼\! |∇ℓ 𝜃 ∗; 𝑧 |
Anticipate retraining
Theorem [HMMZ23]: For controlling the output of a gradient-based
learner it is sufficient to have a collective of fraction 𝛼 repeatedly 𝛼 ∝ suboptimality of
modifying their data as long as the targeted solution 𝜃 ∗
𝛼 ≥ 𝑂 E2∼\! |∇ℓ 𝜃 ∗; 𝑧 |
The more accurate the learner, the more effective the strategy
Incentives to participate
Utility of firm and participants often not aligned
Collective action gives power to participants
• Performativity is pervasive
• We saw that predictions impact people and people impact predictions; how do
we model this jointly?
Thank you!
Questions?
References
[PZMH20] Perdomo, J., Zrnic, T., Mendler-Dünner, C., & Hardt, M. Performative prediction.
International Conference on Machine Learning (ICML), 2020
[MPZH20] Mendler-Dünner, C., Perdomo, J., Zrnic, T., & Hardt, M. Stochastic optimization for performative
prediction. Advances in Neural Information Processing Systems (NeurIPS), 2020
[DX23] Drusvyatskiy, D., & Xiao, L. Stochastic optimization with decision-dependent distributions.
Mathematics of Operations Research, 2023
[WBD21] Wood, K., Bianchin, G., & Dall’Anese, E. Online projected gradient descent for stochastic optimization
with decision-dependent distributions. IEEE Control Systems Letters, 2021
[MMG23] Mofakhami, M., Mitliagkas, I., & Gidel, G. Performative prediction with neural networks.
International Conference on Artificial Intelligence and Statistics (AISTATS), 2023
[LW24] Li, Q., & Wai, H. T. Stochastic optimization schemes for performative prediction with nonconvex loss.
ArXiv preprint arXiv:2405.17922, 2024
[KSU08] Kleinberg, R., Slivkins, A., & Upfal, E. Multi-armed bandits in metric spaces.
ACM Symposium on Theory of Computing (STOC), 2008
[EMM06] Even-Dar, E., Mannor, S., Mansour, Y., & Mahadevan, S. Action elimination and stopping conditions for
the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research (JMLR), 2006
[GKRSW22] Gopalan, P., Kalai, A. T., Reingold, O., Sharan, V., & Wieder, U. Omnipredictors.
Innovations in Theoretical Computer Science (ITCS), 2022
[KP23] Kim, M. P., & Perdomo, J. C. Making decisions under outcome performativity.
Innovations in Theoretical Computer Science (ITCS), 2023
[HMPW16] Hardt, M., Megiddo, N., Papadimitriou, C., & Wootters, M. Strategic classification.
Innovations in Theoretical Computer Science (ITCS), 2016
[MDW22] Mendler-Dünner, C., Ding, F., & Wang, Y. Anticipating performativity by predicting from predictions.
Advances in Neural Information Processing Systems (NeurIPS), 2022
[BHK22] Brown, G., Hod, S., & Kalemaj, I. Performative prediction in a stateful world.
International Conference on Artificial Intelligence and Statistics (AISTATS), 2022
[RRDF22] Ray, M., Ratliff, L. J., Drusvyatskiy, D., & Fazel, M. Decision-dependent risk minimization in geometrically
decaying dynamic environments. AAAI Conference on Artificial Intelligence (AAAI), 2022
[LW22] Li, Q., & Wai, H. T. State dependent performative prediction with stochastic approximation.
International Conference on Artificial Intelligence and Statistics (AISTATS), 2022.
[IZY22] Izzo, Z., Zou, J., & Ying, L. How to learn when data gradually reacts to your model.
International Conference on Artificial Intelligence and Statistics (AISTATS), 2022
[NFDFR23] Narang, A., Faulkner, E., Drusvyatskiy, D., Fazel, M., & Ratliff, L. J. Multiplayer performative prediction:
Learning in decision-dependent games. Journal of Machine Learning Research (JMLR), 2023
[PY23] Piliouras, G., & Yu, F. Y. Multi-agent performative prediction: From global stability and optimality to chaos.
ACM Conference on Economics and Computation (EC), 2023
[WYW23] Wang, X., Yau, C. Y., & Wai, H. T. Network effects in performative prediction games.
International Conference on Machine Learning (ICML), 2023
[MTR23] Mandal, D., Triantafyllou, S., & Radanovic, G. Performative reinforcement learning.
International Conference on Machine Learning (ICML), 2023
[CHM24] Cheng, G., Hardt, M., Mendler-Dünner, C. Causal Inference out of Control: Identifying the Steerability
of Consumption. International Conference on Machine Learning (ICML), 2024
[RTMR22] Rank, B., Triantafyllou, S., Mandal, D., & Radanovic, G. Performative reinforcement learning in
gradually shifting environments. Conference on Uncertainty in Artificial Intelligence (UAI), 2024
[LDRSH18] Liu, L. T., Dean, S., Rolf, E., Simchowitz, M., & Hardt, M. Delayed impact of fair machine learning.
International Conference on Machine Learning (ICML), 2018
[HSNL18] Hashimoto, T., Srivastava, M., Namkoong, H., & Liang, P. Fairness without demographics in repeated
loss minimization. International Conference on Machine Learning (ICML), 2018
[JXLZ24] Jin, K., Xie, T., Liu, Y., & Zhang, X. Addressing polarization and unfairness in performative prediction.
ArXiv preprint arXiv:2406.16756, 2024
[VHS19] Vincent, N., Hecht, B., &Sen S. “Data Strikes”: Evaluating the Effectiveness of a New Form of Collective
Action Against Technology Companies. The World Wide Web Conference (WWW). 2019
[VH21] Vincent, N., Hecht, B. Can “Conscious Data Contribution” Help Users to Exert “Data Leverage” Against
Technology Companies? ACM Human-Computer Interaction. 2021
[VLTCH21] Vincent, N., Li, H., Tilly, N., Chancellor, S., Hecht, B. Data Leverage: A Framework for Empowering the
Public in its Relationship with Technology Companies. ACM Conference on Fairness, Accountability, and
Transparency (FAccT). 2021
[HMMZ23] Hardt, M., Mazumdar, E., Mendler-Dünner, C., Zrnic, T. Algorithmic collective action in machine
learning. International Conference on Machine Learning (ICML), 2023
[IFS24] Imteyaz, K., Flores-Saviaga, C., Savage, S. GigSense: An LLM-Infused Tool for Workers’ Collective
Intelligence. Arxiv preprint arXiv:2405.02528, 2024
[TCLY22] Tian, Z., Cui, L., Liang, J., Yu, S.. A comprehensive survey on poisoning attacks and countermeasures in
machine learning. ACM Computing Surveys, 55(8), 2022.
[BM24] Baumann, J., Mendler-Dünner, C. Algorithmic collective action in recommender systems.
European Workshop on Algorithmic Fairness (EWAF), 2024
[HM23] Hardt, M., & Mendler-Dünner, C. Performative prediction: past and future.
ArXiv preprint arXiv:2310.16608, 2023