Skip to main content

Debarghya Mukherjee

University of Michigan, Statistics, Graduate Student

Followers

6

Following

26

Public Views

I am a fourth year graduate student from department of Statistics, University of Michigan. I am working Prof. Ya'acov Ritov, Prof. Moulinath Banerjee and Prof. Yuekai Sun.
Supervisors: Prof. Moulinath Banerjee and Prof. Ya'acov Ritov

less

Amisha Priyadarshini

University of California, Irvine

Rolando Rebolledo

Universidad de Valparaiso

Martijn Pistorius

Imperial College London

Odalric-Ambrym Maillard

Technion Israel Institute of Technology

Aurélien Garivier

Interests

Uploads

Papers by Debarghya Mukherjee

Domain Adaptation meets Individual Fairness. And they get along

Many instances of algorithmic bias are caused by distributional shifts. For example, machine lear... more Many instances of algorithmic bias are caused by distributional shifts. For example, machine learning (ML) models often perform worse on demographic groups that are underrepresented in the training data. In this paper, we leverage this connection between algorithmic fairness and distribution shifts to show that algorithmic fairness interventions can help ML models overcome distribution shifts, and that domain adaptation methods (for overcoming distribution shifts) can mitigate algorithmic biases. In particular, we show that (i) enforcing suitable notions of individual fairness (IF) can improve the out-of-distribution accuracy of ML models, and that (ii) it is possible to adapt representation alignment methods for domain adaptation to enforce (individual) fairness. The former is unexpected because IF interventions were not developed with distribution shifts in mind. The latter is also unexpected because representation alignment is not a common approach in the IF literature.

Does enforcing fairness mitigate biases caused by subpopulation shift

neural information processing systems, Dec 6, 2021

Many instances of algorithmic bias are caused by subpopulation shifts. For example, ML models oft... more Many instances of algorithmic bias are caused by subpopulation shifts. For example, ML models often perform worse on demographic groups that are underrepresented in the training data. In this paper, we study whether enforcing algorithmic fairness during training improves the performance of the trained model in the target domain. On one hand, we conceive scenarios in which enforcing fairness does not improve performance in the target domain. In fact, it may even harm performance. On the other hand, we derive necessary and sufficient conditions under which enforcing algorithmic fairness leads to the Bayes model in the target domain. We also illustrate the practical implications of our theoretical results in simulations and on real data.

There is no trade-off: enforcing fairness can improve accuracy

arXiv e-prints, May 4, 2021

Many instances of algorithmic bias are caused by subpopulation shifts. For example, ML models oft... more Many instances of algorithmic bias are caused by subpopulation shifts. For example, ML models often perform worse on demographic groups that are underrepresented in the training data. In this paper, we study whether enforcing algorithmic fairness during training improves the performance of the trained model in the \emph{target domain}. On one hand, we conceive scenarios in which enforcing fairness does not improve performance in the target domain. In fact, it may even harm performance. On the other hand, we derive necessary and sufficient conditions under which enforcing algorithmic fairness leads to the Bayes model in the target domain. We also illustrate the practical implications of our theoretical results in simulations and on real data.

On robust learning in the canonical change point problem under heavy tailed errors in finite and growing dimensions

Electronic Journal of Statistics, 2022

This paper presents a number of new findings about the canonical change point estimation problem.... more This paper presents a number of new findings about the canonical change point estimation problem. The first part studies the estimation of a change point on the real line in a simple stump model using the robust Huber estimating function which interpolates between the 1 (absolute deviation) and 2 (least squares) based criteria. While the 2 criterion has been studied extensively, its robust counterparts and in particular, the 1 minimization problem have not. We derive the limit distribution of the estimated change point under the Huber estimating function and compare it to that under the 2 criterion. Theoretical and empirical studies indicate that it is more profitable to use the Huber estimating function (and in particular, the 1 criterion) under heavy tailed errors as it leads to smaller asymptotic confidence intervals at the usual levels compared to the 2 criterion. We also compare the 1 and 2 approaches in a parallel setting, where one has m independent single change point problems and the goal is to control the maximal deviation of the estimated change points from the true values, and establish rigorously that the 1 estimation criterion provides a superior rate of convergence to the 2 , and that this relative advantage is driven by the heaviness of the tail of the error distribution. Finally, we derive minimax optimal rates for the change plane estimation problem in growing dimensions and demonstrate that Huber estimation attains the optimal rate while the 2 scheme produces a rate sub-optimal estimator for heavy tailed errors. In the process of deriving our results, we establish a number of properties about the minimizers of compound Binomial and compound Poisson processes which are of independent interest.

Post-processing for Individual Fairness

ArXiv, 2021

Post-processing in algorithmic fairness is a versatile approach for correcting bias in ML systems... more Post-processing in algorithmic fairness is a versatile approach for correcting bias in ML systems that are already used in production. The main appeal of postprocessing is that it avoids expensive retraining. In this work, we propose general post-processing algorithms for individual fairness (IF). We consider a setting where the learner only has access to the predictions of the original model and a similarity graph between individuals guiding the desired fairness constraints. We cast the IF post-processing problem as a graph smoothing problem corresponding to graph Laplacian regularization that preserves the desired “treat similar individuals similarly” interpretation. Our theoretical results demonstrate the connection of the new objective function to a local relaxation of the original individual fairness. Empirically, our post-processing algorithms correct individual biases in large-scale NLP models such as BERT, while preserving accuracy.

There is no trade-off: enforcing fairness can improve accuracy

One of the main barriers to the broader adoption of algorithmic fairness in machine learning is t... more One of the main barriers to the broader adoption of algorithmic fairness in machine learning is the trade-off between fairness and performance of ML models: many practitioners are unwilling to sacrifice the performance of their ML model for fairness. In this paper, we show that this trade-off may not be necessary. If the algorithmic biases in an ML model are due to sampling biases in the training data, then enforcing algorithmic fairness may improve the performance of the ML model on unbiased test data. We study conditions under which enforcing algorithmic fairness helps practitioners learn the Bayes decision rule for (unbiased) test data from biased training data. We also demonstrate the practical implications of our theoretical results in real-world ML tasks.

SCENTS: Score explained non-randomized treatment systems

Non-randomized treatment effect models are widely used for the assessment of treatment effects in... more Non-randomized treatment effect models are widely used for the assessment of treatment effects in various fields and in particular social science disciplines like political science, psychometry, psychology. More specifically, these are situations where treatment is assigned to an individual based on some of their characteristics (e.g. scholarship is allocated based on merit or antihypertensive treatments are allocated based on blood pressure level) instead of being allocated randomly, as is the case, for example, in randomized clinical trials. Popular methods that have been largely employed till date for estimation of such treatment effects suffer from slow rates of convergence (i.e. slower than √(n)). In this paper, we present a new model coined SCENTS: Score Explained Non-Randomized Treatment Systems, and a corresponding method that allows estimation of the treatment effect at √(n) rate in the presence of fairly general forms of confoundedness, when the `score' variable on who...

$following equation: Hence we have s\ , = 0. Thus we can modify equation (B.3) to obtain:$

$We next also provide a bound Now going back to T\ in equation (C.8) we have: From equation (C.9) and (C.10) we conclude:$

Outlier-Robust Optimal Transport

Optimal transport (OT) measures distances between distributions in a way that depends on the geom... more Optimal transport (OT) measures distances between distributions in a way that depends on the geometry of the sample space. In light of recent advances in computational OT, OT distances are widely used as loss functions in machine learning. Despite their prevalence and advantages, OT loss functions can be extremely sensitive to outliers. In fact, a single adversarially-picked outlier can increase the standard W_2-distance arbitrarily. To address this issue, we propose an outlier-robust formulation of OT. Our formulation is convex but challenging to scale at a first glance. Our main contribution is deriving an equivalent formulation based on cost truncation that is easy to incorporate into modern algorithms for computational OT. We demonstrate the benefits of our formulation in mean estimation problems under the Huber contamination model in simulations and outlier detection tasks on real data.

Regression discontinuity design: estimating the treatment effect with standard parametric rate

Regression discontinuity design models are widely used for the assessment of treatment effects in... more Regression discontinuity design models are widely used for the assessment of treatment effects in psychology, econometrics and biomedicine, specifically in situations where treatment is assigned to an individual based on their characteristics (e.g. scholarship is allocated based on merit) instead of being allocated randomly, as is the case, for example, in randomized clinical trials. Popular methods that have been largely employed till date for estimation of such treatment effects suffer from slow rates of convergence (i.e. slower than √(n)). In this paper, we present a new model and method that allows estimation of the treatment effect at √(n) rate in the presence of fairly general forms of confoundedness. Moreover, we show that our estimator is also semi-parametrically efficient in certain situations. We analyze two real datasets via our method and compare our results with those obtained by using previous approaches. We conclude this paper with a discussion on some possible extens...

On robust learning in the canonical change point problem under heavy tailed errors in finite and growing dimensions

This paper presents a number of new findings about the canonical change point estimation problem.... more This paper presents a number of new findings about the canonical change point estimation problem. The first part studies the estimation of a change point on the real line in a simple stump model using the robust Huber estimating function which interpolates between the ℓ_1 (absolute deviation) and ℓ_2 (least squares) based criteria. While the ℓ_2 criterion has been studied extensively, its robust counterparts and in particular, the ℓ_1 minimization problem have not. We derive the limit distribution of the estimated change point under the Huber estimating function and compare it to that under the ℓ_2 criterion. Theoretical and empirical studies indicate that it is more profitable to use the Huber estimating function (and in particular, the ℓ_1 criterion) under heavy tailed errors as it leads to smaller asymptotic confidence intervals at the usual levels compared to the ℓ_2 criterion. We also compare the ℓ_1 and ℓ_2 approaches in a parallel setting, where one has m independent single c...

Markovian And Non-Markovian Processes with Active Decision Making Strategies For Addressing The COVID-19 Epidemic

We study and predict the evolution of Covid-19 in six US states from the period May 1 through Aug... more We study and predict the evolution of Covid-19 in six US states from the period May 1 through August 31 using a discrete compartment-based model and prescribe active intervention policies, like lockdowns, on the basis of minimizing a loss function, within the broad framework of partially observed Markov decision processes. For each state, Covid-19 data for 40 days (starting from May 1 for two northern states and June 1 for four southern states) are analyzed to estimate the transition probabilities between compartments and other parameters associated with the evolution of the epidemic. These quantities are then used to predict the course of the epidemic in the given state for the next 50 days (test period) under various policy allocations, leading to different values of the loss function over the training horizon. The optimal policy allocation is the one corresponding to the smallest loss. Our analysis shows that none of the six states need lockdowns over the test period, though the ...

Optimal Linear Discriminators For The Discrete Choice Model In Growing Dimensions

Manski's celebrated maximum score estimator for the discrete choice model, which is an optima... more Manski's celebrated maximum score estimator for the discrete choice model, which is an optimal linear discriminator, has been the focus of much investigation in both the econometrics and statistics literatures, but its behavior under growing dimension scenarios largely remains unknown. This paper addresses that gap. Two different cases are considered: p grows with n but at a slow rate, i.e. p/n → 0; and p ≫ n (fast growth). In the binary response model, we recast Manski's score estimation as empirical risk minimization for a classification problem, and derive the ℓ_2 rate of convergence of the score estimator under a transition condition in terms of our margin parameter that calibrates the level of difficulty of the estimation problem. We also establish upper and lower bounds for the minimax ℓ_2 error in the binary choice model that differ by a logarithmic factor, and construct a minimax-optimal estimator in the slow growth regime. Some extensions to the general case – the m...

Markovian And Non-Markovian Processes with Active Decision Making Strategies For Addressing The COVID-19 Pandemic

We study and predict the evolution of Covid-19 in six US states from the period May 1 through Aug... more We study and predict the evolution of Covid-19 in six US states from the period May 1 through August 31 using a discrete compartment-based model and prescribe active intervention policies, like lockdowns, on the basis of minimizing a loss function, within the broad framework of partially observed Markov decision processes. For each state, Covid-19 data for 40 days (starting from May 1 for two northern states and June 1 for four southern states) are analyzed to estimate the transition probabilities between compartments and other parameters associated with the evolution of the epidemic. These quantities are then used to predict the course of the epidemic in the given state for the next 50 days (test period) under various policy allocations, leading to different values of the loss function over the training horizon. The optimal policy allocation is the one corresponding to the smallest loss. Our analysis shows that none of the six states need lockdowns over the test period, though the ...

Asymptotic normality of a linear threshold estimator in fixed dimension with near-optimal rate

Linear thresholding models postulate that the conditional distribution of a response variable in ... more Linear thresholding models postulate that the conditional distribution of a response variable in terms of covariates differs on the two sides of a (typically unknown) hyperplane in the covariate space. A key goal in such models is to learn about this separating hyperplane. Exact likelihood or least square methods to estimate the thresholding parameter involve an indicator function which make them difficult to optimize and are, therefore, often tackled by using a surrogate loss that uses a smooth approximation to the indicator. In this note, we demonstrate that the resulting estimator is asymptotically normal with a near optimal rate of convergence: n^-1 up to a log factor, in a classification thresholding model. This is substantially faster than the currently established convergence rates of smoothed estimators for similar models in the statistics and econometrics literatures.

Estimation of a score-explained non-randomized treatment effect in fixed and high dimensions

Non-randomized treatment effect models are widely used for the assessment of treatment effects in... more Non-randomized treatment effect models are widely used for the assessment of treatment effects in various fields and in particular social science disciplines like political science, psychometry, psychology. More specifically, these are situations where treatment is assigned to an individual based on some of their characteristics (e.g. scholarship is allocated based on merit or antihypertensive treatments are allocated based on blood pressure level) instead of being allocated randomly, as is the case, for example, in randomized clinical trials. Popular methods that have been largely employed till date for estimation of such treatment effects suffer from slow rates of convergence (i.e. slower than $\sqrt{n}$). In this paper, we present a new model coined SCENTS: Score Explained Non-Randomized Treatment Systems, and a corresponding method that allows estimation of the treatment effect at $\sqrt{n}$ rate in the presence of fairly general forms of confoundedness, when the `score' var...

$where ¢ is the square root of the density of standard gaussian distribution, s, is the square root of the density of 7 and sx,z is the joint density of (X,Z). The function 8, (resp. $x,z) is defined as the (d/dt)s, .(¢) |r0 (resp. (d/dt)sx z (4) \r-0). Similar definition holds for a, 8,7, where we omit the subscript y for notational simplicity. The function sy here denotes the derivative of sy o(x) (true data generating density) with respect to 2. Note that in the last equality we reparametrize the variable (Y,Q,X,Z) — (€,, X,Z) which is bijective. The fisher inner product in T(Po) corresponding to two parametrization 7,72 can be expressed as: and as a consequence, Fisher information for estimating ao along this parametric submodel curve:$

$Note that given 7}, 7); is a linear combination of centered subgaussian random variables. Therefore by subgaussian concentration inequality, we have with probabilities going to 1: For T\3 we apply a similar analysis along with @; — ¢,. bound:$

$The other term (W , — Wi) | proix, (W, — W7)/n will consequently be o,(1) as it is a lower order term Fix 1< j,k <1+p,+ po: That ||W7.,.; \|?/n = O,(1) follows from an immediate application of WLLN. To show that the other part is op(1), note that ||(Wj —W))s«||?/n = 0 for 1 <k <p, +1. For pj +2 < k < pi + po, define k =k—(p, +1). Then:$

$Now, for any fixed y € G, the collection {P;}\4)<1) is the one-dimensional regular parametric sub- model, where P; is the probability measure corresponding to y(t). Hence for this fixed y, one may also view @, as a function from (—to,to) ++ R via the identification 0,(t) = @(y(¢)) and our parameter of interest is @,(0). The information bound (henceforth denoted by IB) for estimating 8(0) is: where || - ||” = 2|| - ||z,@) is the Fisher norm ((40], [47]) and the last equality follows from the fact that L is a bounded linear operator. The optimal asymptotic variance (a term borrowed from [45]) for estimating 6(so) is defined as the supremum of all these Cramer-Rao lower bounds J B(y) over all regular one dimensional parametrization y € G, i.e.:$

$To obtain A; we need to bound the @,, norm on the first term of the RHS of equation (A.7). We can bound that term as: )bserve that, we can bound T> via similar calculation we did to prove the third display of Lemma \.3 as it is a special case for 7’ = 1. Therefore we have: H.2 Proof of Lemma A.4$

There is no trade-off: enforcing fairness can improve accuracy

ArXiv, 2020

One of the main barriers to the broader adoption of algorithmic fairness in machine learning is t... more One of the main barriers to the broader adoption of algorithmic fairness in machine learning is the trade-off between fairness and performance of ML models: many practitioners are unwilling to sacrifice the performance of their ML model for fairness. In this paper, we show that this trade-off may not be necessary. If the algorithmic biases in an ML model are due to sampling biases in the training data, then enforcing algorithmic fairness may improve the performance of the ML model on unbiased test data. We study conditions under which enforcing algorithmic fairness helps practitioners learn the Bayes decision rule for (unbiased) test data from biased training data. We also demonstrate the practical implications of our theoretical results in real-world ML tasks.

Markovian And Non-Markovian Processes with Active Decision Making Strategies For Addressing The COVID-19 Pandemic

arXiv: Applications, 2020

We study and predict the evolution of Covid-19 in six US states from the period May 1 through Aug... more We study and predict the evolution of Covid-19 in six US states from the period May 1 through August 31 using a discrete compartment-based model and prescribe active intervention policies, like lockdowns, on the basis of minimizing a loss function, within the broad framework of partially observed Markov decision processes. For each state, Covid-19 data for 40 days (starting from May 1 for two northern states and June 1 for four southern states) are analyzed to estimate the transition probabilities between compartments and other parameters associated with the evolution of the epidemic. These quantities are then used to predict the course of the epidemic in the given state for the next 50 days (test period) under various policy allocations, leading to different values of the loss function over the training horizon. The optimal policy allocation is the one corresponding to the smallest loss. Our analysis shows that none of the six states need lockdowns over the test period, though the ...

Post-processing for Individual Fairness

Post-processing in algorithmic fairness is a versatile approach for correcting bias in ML systems... more Post-processing in algorithmic fairness is a versatile approach for correcting bias in ML systems that are already used in production. The main appeal of postprocessing is that it avoids expensive retraining. In this work, we propose general post-processing algorithms for individual fairness (IF). We consider a setting where the learner only has access to the predictions of the original model and a similarity graph between individuals, guiding the desired fairness constraints. We cast the IF post-processing problem as a graph smoothing problem corresponding to graph Laplacian regularization that preserves the desired “treat similar individuals similarly” interpretation. Our theoretical results demonstrate the connection of the new objective function to a local relaxation of the original individual fairness. Empirically, our post-processing algorithms correct individual biases in large-scale NLP models such as BERT, while preserving accuracy.

Asymptotic normality of a linear threshold estimator in fixed dimension with near-optimal rate

arXiv: Statistics Theory, 2020

Linear thresholding models postulate that the conditional distribution of a response variable in ... more Linear thresholding models postulate that the conditional distribution of a response variable in terms of covariates differs on the two sides of a (typically unknown) hyperplane in the covariate space. A key goal in such models is to learn about this separating hyperplane. Exact likelihood or least square methods to estimate the thresholding parameter involve an indicator function which make them difficult to optimize and are, therefore, often tackled by using a surrogate loss that uses a smooth approximation to the indicator. In this note, we demonstrate that the resulting estimator is asymptotically normal with a near optimal rate of convergence: $n^{-1}$ up to a log factor, in a classification thresholding model. This is substantially faster than the currently established convergence rates of smoothed estimators for similar models in the statistics and econometrics literatures.

Does enforcing fairness mitigate biases caused by subpopulation shift?

Many instances of algorithmic bias are caused by subpopulation shifts. For example, ML models oft... more Many instances of algorithmic bias are caused by subpopulation shifts. For example, ML models often perform worse on demographic groups that are underrepresented in the training data. In this paper, we study whether enforcing algorithmic fairness during training improves the performance of the trained model in the target domain. On one hand, we conceive scenarios in which enforcing fairness does not improve performance in the target domain. In fact, it may even harm performance. On the other hand, we derive necessary and sufficient conditions under which enforcing algorithmic fairness leads to the Bayes model in the target domain. We also illustrate the practical implications of our theoretical results in simulations and on real data.

Domain Adaptation meets Individual Fairness. And they get along

Many instances of algorithmic bias are caused by distributional shifts. For example, machine lear... more Many instances of algorithmic bias are caused by distributional shifts. For example, machine learning (ML) models often perform worse on demographic groups that are underrepresented in the training data. In this paper, we leverage this connection between algorithmic fairness and distribution shifts to show that algorithmic fairness interventions can help ML models overcome distribution shifts, and that domain adaptation methods (for overcoming distribution shifts) can mitigate algorithmic biases. In particular, we show that (i) enforcing suitable notions of individual fairness (IF) can improve the out-of-distribution accuracy of ML models, and that (ii) it is possible to adapt representation alignment methods for domain adaptation to enforce (individual) fairness. The former is unexpected because IF interventions were not developed with distribution shifts in mind. The latter is also unexpected because representation alignment is not a common approach in the IF literature.

Does enforcing fairness mitigate biases caused by subpopulation shift

neural information processing systems, Dec 6, 2021

Many instances of algorithmic bias are caused by subpopulation shifts. For example, ML models oft... more Many instances of algorithmic bias are caused by subpopulation shifts. For example, ML models often perform worse on demographic groups that are underrepresented in the training data. In this paper, we study whether enforcing algorithmic fairness during training improves the performance of the trained model in the target domain. On one hand, we conceive scenarios in which enforcing fairness does not improve performance in the target domain. In fact, it may even harm performance. On the other hand, we derive necessary and sufficient conditions under which enforcing algorithmic fairness leads to the Bayes model in the target domain. We also illustrate the practical implications of our theoretical results in simulations and on real data.

There is no trade-off: enforcing fairness can improve accuracy

arXiv e-prints, May 4, 2021

Many instances of algorithmic bias are caused by subpopulation shifts. For example, ML models oft... more Many instances of algorithmic bias are caused by subpopulation shifts. For example, ML models often perform worse on demographic groups that are underrepresented in the training data. In this paper, we study whether enforcing algorithmic fairness during training improves the performance of the trained model in the \emph{target domain}. On one hand, we conceive scenarios in which enforcing fairness does not improve performance in the target domain. In fact, it may even harm performance. On the other hand, we derive necessary and sufficient conditions under which enforcing algorithmic fairness leads to the Bayes model in the target domain. We also illustrate the practical implications of our theoretical results in simulations and on real data.

On robust learning in the canonical change point problem under heavy tailed errors in finite and growing dimensions

Electronic Journal of Statistics, 2022

This paper presents a number of new findings about the canonical change point estimation problem.... more This paper presents a number of new findings about the canonical change point estimation problem. The first part studies the estimation of a change point on the real line in a simple stump model using the robust Huber estimating function which interpolates between the 1 (absolute deviation) and 2 (least squares) based criteria. While the 2 criterion has been studied extensively, its robust counterparts and in particular, the 1 minimization problem have not. We derive the limit distribution of the estimated change point under the Huber estimating function and compare it to that under the 2 criterion. Theoretical and empirical studies indicate that it is more profitable to use the Huber estimating function (and in particular, the 1 criterion) under heavy tailed errors as it leads to smaller asymptotic confidence intervals at the usual levels compared to the 2 criterion. We also compare the 1 and 2 approaches in a parallel setting, where one has m independent single change point problems and the goal is to control the maximal deviation of the estimated change points from the true values, and establish rigorously that the 1 estimation criterion provides a superior rate of convergence to the 2 , and that this relative advantage is driven by the heaviness of the tail of the error distribution. Finally, we derive minimax optimal rates for the change plane estimation problem in growing dimensions and demonstrate that Huber estimation attains the optimal rate while the 2 scheme produces a rate sub-optimal estimator for heavy tailed errors. In the process of deriving our results, we establish a number of properties about the minimizers of compound Binomial and compound Poisson processes which are of independent interest.

Post-processing for Individual Fairness

ArXiv, 2021

Post-processing in algorithmic fairness is a versatile approach for correcting bias in ML systems... more Post-processing in algorithmic fairness is a versatile approach for correcting bias in ML systems that are already used in production. The main appeal of postprocessing is that it avoids expensive retraining. In this work, we propose general post-processing algorithms for individual fairness (IF). We consider a setting where the learner only has access to the predictions of the original model and a similarity graph between individuals guiding the desired fairness constraints. We cast the IF post-processing problem as a graph smoothing problem corresponding to graph Laplacian regularization that preserves the desired “treat similar individuals similarly” interpretation. Our theoretical results demonstrate the connection of the new objective function to a local relaxation of the original individual fairness. Empirically, our post-processing algorithms correct individual biases in large-scale NLP models such as BERT, while preserving accuracy.

There is no trade-off: enforcing fairness can improve accuracy

One of the main barriers to the broader adoption of algorithmic fairness in machine learning is t... more One of the main barriers to the broader adoption of algorithmic fairness in machine learning is the trade-off between fairness and performance of ML models: many practitioners are unwilling to sacrifice the performance of their ML model for fairness. In this paper, we show that this trade-off may not be necessary. If the algorithmic biases in an ML model are due to sampling biases in the training data, then enforcing algorithmic fairness may improve the performance of the ML model on unbiased test data. We study conditions under which enforcing algorithmic fairness helps practitioners learn the Bayes decision rule for (unbiased) test data from biased training data. We also demonstrate the practical implications of our theoretical results in real-world ML tasks.

SCENTS: Score explained non-randomized treatment systems

Non-randomized treatment effect models are widely used for the assessment of treatment effects in... more Non-randomized treatment effect models are widely used for the assessment of treatment effects in various fields and in particular social science disciplines like political science, psychometry, psychology. More specifically, these are situations where treatment is assigned to an individual based on some of their characteristics (e.g. scholarship is allocated based on merit or antihypertensive treatments are allocated based on blood pressure level) instead of being allocated randomly, as is the case, for example, in randomized clinical trials. Popular methods that have been largely employed till date for estimation of such treatment effects suffer from slow rates of convergence (i.e. slower than √(n)). In this paper, we present a new model coined SCENTS: Score Explained Non-Randomized Treatment Systems, and a corresponding method that allows estimation of the treatment effect at √(n) rate in the presence of fairly general forms of confoundedness, when the `score' variable on who...

$following equation: Hence we have s\ , = 0. Thus we can modify equation (B.3) to obtain:$

$We next also provide a bound Now going back to T\ in equation (C.8) we have: From equation (C.9) and (C.10) we conclude:$

Outlier-Robust Optimal Transport

Optimal transport (OT) measures distances between distributions in a way that depends on the geom... more Optimal transport (OT) measures distances between distributions in a way that depends on the geometry of the sample space. In light of recent advances in computational OT, OT distances are widely used as loss functions in machine learning. Despite their prevalence and advantages, OT loss functions can be extremely sensitive to outliers. In fact, a single adversarially-picked outlier can increase the standard W_2-distance arbitrarily. To address this issue, we propose an outlier-robust formulation of OT. Our formulation is convex but challenging to scale at a first glance. Our main contribution is deriving an equivalent formulation based on cost truncation that is easy to incorporate into modern algorithms for computational OT. We demonstrate the benefits of our formulation in mean estimation problems under the Huber contamination model in simulations and outlier detection tasks on real data.

Regression discontinuity design: estimating the treatment effect with standard parametric rate

Regression discontinuity design models are widely used for the assessment of treatment effects in... more Regression discontinuity design models are widely used for the assessment of treatment effects in psychology, econometrics and biomedicine, specifically in situations where treatment is assigned to an individual based on their characteristics (e.g. scholarship is allocated based on merit) instead of being allocated randomly, as is the case, for example, in randomized clinical trials. Popular methods that have been largely employed till date for estimation of such treatment effects suffer from slow rates of convergence (i.e. slower than √(n)). In this paper, we present a new model and method that allows estimation of the treatment effect at √(n) rate in the presence of fairly general forms of confoundedness. Moreover, we show that our estimator is also semi-parametrically efficient in certain situations. We analyze two real datasets via our method and compare our results with those obtained by using previous approaches. We conclude this paper with a discussion on some possible extens...

On robust learning in the canonical change point problem under heavy tailed errors in finite and growing dimensions

This paper presents a number of new findings about the canonical change point estimation problem.... more This paper presents a number of new findings about the canonical change point estimation problem. The first part studies the estimation of a change point on the real line in a simple stump model using the robust Huber estimating function which interpolates between the ℓ_1 (absolute deviation) and ℓ_2 (least squares) based criteria. While the ℓ_2 criterion has been studied extensively, its robust counterparts and in particular, the ℓ_1 minimization problem have not. We derive the limit distribution of the estimated change point under the Huber estimating function and compare it to that under the ℓ_2 criterion. Theoretical and empirical studies indicate that it is more profitable to use the Huber estimating function (and in particular, the ℓ_1 criterion) under heavy tailed errors as it leads to smaller asymptotic confidence intervals at the usual levels compared to the ℓ_2 criterion. We also compare the ℓ_1 and ℓ_2 approaches in a parallel setting, where one has m independent single c...

Markovian And Non-Markovian Processes with Active Decision Making Strategies For Addressing The COVID-19 Epidemic

We study and predict the evolution of Covid-19 in six US states from the period May 1 through Aug... more We study and predict the evolution of Covid-19 in six US states from the period May 1 through August 31 using a discrete compartment-based model and prescribe active intervention policies, like lockdowns, on the basis of minimizing a loss function, within the broad framework of partially observed Markov decision processes. For each state, Covid-19 data for 40 days (starting from May 1 for two northern states and June 1 for four southern states) are analyzed to estimate the transition probabilities between compartments and other parameters associated with the evolution of the epidemic. These quantities are then used to predict the course of the epidemic in the given state for the next 50 days (test period) under various policy allocations, leading to different values of the loss function over the training horizon. The optimal policy allocation is the one corresponding to the smallest loss. Our analysis shows that none of the six states need lockdowns over the test period, though the ...

Optimal Linear Discriminators For The Discrete Choice Model In Growing Dimensions

Manski's celebrated maximum score estimator for the discrete choice model, which is an optima... more Manski's celebrated maximum score estimator for the discrete choice model, which is an optimal linear discriminator, has been the focus of much investigation in both the econometrics and statistics literatures, but its behavior under growing dimension scenarios largely remains unknown. This paper addresses that gap. Two different cases are considered: p grows with n but at a slow rate, i.e. p/n → 0; and p ≫ n (fast growth). In the binary response model, we recast Manski's score estimation as empirical risk minimization for a classification problem, and derive the ℓ_2 rate of convergence of the score estimator under a transition condition in terms of our margin parameter that calibrates the level of difficulty of the estimation problem. We also establish upper and lower bounds for the minimax ℓ_2 error in the binary choice model that differ by a logarithmic factor, and construct a minimax-optimal estimator in the slow growth regime. Some extensions to the general case – the m...

Markovian And Non-Markovian Processes with Active Decision Making Strategies For Addressing The COVID-19 Pandemic

We study and predict the evolution of Covid-19 in six US states from the period May 1 through Aug... more We study and predict the evolution of Covid-19 in six US states from the period May 1 through August 31 using a discrete compartment-based model and prescribe active intervention policies, like lockdowns, on the basis of minimizing a loss function, within the broad framework of partially observed Markov decision processes. For each state, Covid-19 data for 40 days (starting from May 1 for two northern states and June 1 for four southern states) are analyzed to estimate the transition probabilities between compartments and other parameters associated with the evolution of the epidemic. These quantities are then used to predict the course of the epidemic in the given state for the next 50 days (test period) under various policy allocations, leading to different values of the loss function over the training horizon. The optimal policy allocation is the one corresponding to the smallest loss. Our analysis shows that none of the six states need lockdowns over the test period, though the ...

Asymptotic normality of a linear threshold estimator in fixed dimension with near-optimal rate

Linear thresholding models postulate that the conditional distribution of a response variable in ... more Linear thresholding models postulate that the conditional distribution of a response variable in terms of covariates differs on the two sides of a (typically unknown) hyperplane in the covariate space. A key goal in such models is to learn about this separating hyperplane. Exact likelihood or least square methods to estimate the thresholding parameter involve an indicator function which make them difficult to optimize and are, therefore, often tackled by using a surrogate loss that uses a smooth approximation to the indicator. In this note, we demonstrate that the resulting estimator is asymptotically normal with a near optimal rate of convergence: n^-1 up to a log factor, in a classification thresholding model. This is substantially faster than the currently established convergence rates of smoothed estimators for similar models in the statistics and econometrics literatures.

Estimation of a score-explained non-randomized treatment effect in fixed and high dimensions

Non-randomized treatment effect models are widely used for the assessment of treatment effects in... more Non-randomized treatment effect models are widely used for the assessment of treatment effects in various fields and in particular social science disciplines like political science, psychometry, psychology. More specifically, these are situations where treatment is assigned to an individual based on some of their characteristics (e.g. scholarship is allocated based on merit or antihypertensive treatments are allocated based on blood pressure level) instead of being allocated randomly, as is the case, for example, in randomized clinical trials. Popular methods that have been largely employed till date for estimation of such treatment effects suffer from slow rates of convergence (i.e. slower than $\sqrt{n}$). In this paper, we present a new model coined SCENTS: Score Explained Non-Randomized Treatment Systems, and a corresponding method that allows estimation of the treatment effect at $\sqrt{n}$ rate in the presence of fairly general forms of confoundedness, when the `score' var...

$where ¢ is the square root of the density of standard gaussian distribution, s, is the square root of the density of 7 and sx,z is the joint density of (X,Z). The function 8, (resp. $x,z) is defined as the (d/dt)s, .(¢) |r0 (resp. (d/dt)sx z (4) \r-0). Similar definition holds for a, 8,7, where we omit the subscript y for notational simplicity. The function sy here denotes the derivative of sy o(x) (true data generating density) with respect to 2. Note that in the last equality we reparametrize the variable (Y,Q,X,Z) — (€,, X,Z) which is bijective. The fisher inner product in T(Po) corresponding to two parametrization 7,72 can be expressed as: and as a consequence, Fisher information for estimating ao along this parametric submodel curve:$

$Note that given 7}, 7); is a linear combination of centered subgaussian random variables. Therefore by subgaussian concentration inequality, we have with probabilities going to 1: For T\3 we apply a similar analysis along with @; — ¢,. bound:$

$The other term (W , — Wi) | proix, (W, — W7)/n will consequently be o,(1) as it is a lower order term Fix 1< j,k <1+p,+ po: That ||W7.,.; \|?/n = O,(1) follows from an immediate application of WLLN. To show that the other part is op(1), note that ||(Wj —W))s«||?/n = 0 for 1 <k <p, +1. For pj +2 < k < pi + po, define k =k—(p, +1). Then:$

$Now, for any fixed y € G, the collection {P;}\4)<1) is the one-dimensional regular parametric sub- model, where P; is the probability measure corresponding to y(t). Hence for this fixed y, one may also view @, as a function from (—to,to) ++ R via the identification 0,(t) = @(y(¢)) and our parameter of interest is @,(0). The information bound (henceforth denoted by IB) for estimating 8(0) is: where || - ||” = 2|| - ||z,@) is the Fisher norm ((40], [47]) and the last equality follows from the fact that L is a bounded linear operator. The optimal asymptotic variance (a term borrowed from [45]) for estimating 6(so) is defined as the supremum of all these Cramer-Rao lower bounds J B(y) over all regular one dimensional parametrization y € G, i.e.:$

$To obtain A; we need to bound the @,, norm on the first term of the RHS of equation (A.7). We can bound that term as: )bserve that, we can bound T> via similar calculation we did to prove the third display of Lemma \.3 as it is a special case for 7’ = 1. Therefore we have: H.2 Proof of Lemma A.4$

There is no trade-off: enforcing fairness can improve accuracy

ArXiv, 2020

One of the main barriers to the broader adoption of algorithmic fairness in machine learning is t... more One of the main barriers to the broader adoption of algorithmic fairness in machine learning is the trade-off between fairness and performance of ML models: many practitioners are unwilling to sacrifice the performance of their ML model for fairness. In this paper, we show that this trade-off may not be necessary. If the algorithmic biases in an ML model are due to sampling biases in the training data, then enforcing algorithmic fairness may improve the performance of the ML model on unbiased test data. We study conditions under which enforcing algorithmic fairness helps practitioners learn the Bayes decision rule for (unbiased) test data from biased training data. We also demonstrate the practical implications of our theoretical results in real-world ML tasks.

Markovian And Non-Markovian Processes with Active Decision Making Strategies For Addressing The COVID-19 Pandemic

arXiv: Applications, 2020

We study and predict the evolution of Covid-19 in six US states from the period May 1 through Aug... more We study and predict the evolution of Covid-19 in six US states from the period May 1 through August 31 using a discrete compartment-based model and prescribe active intervention policies, like lockdowns, on the basis of minimizing a loss function, within the broad framework of partially observed Markov decision processes. For each state, Covid-19 data for 40 days (starting from May 1 for two northern states and June 1 for four southern states) are analyzed to estimate the transition probabilities between compartments and other parameters associated with the evolution of the epidemic. These quantities are then used to predict the course of the epidemic in the given state for the next 50 days (test period) under various policy allocations, leading to different values of the loss function over the training horizon. The optimal policy allocation is the one corresponding to the smallest loss. Our analysis shows that none of the six states need lockdowns over the test period, though the ...

Post-processing for Individual Fairness

Post-processing in algorithmic fairness is a versatile approach for correcting bias in ML systems... more Post-processing in algorithmic fairness is a versatile approach for correcting bias in ML systems that are already used in production. The main appeal of postprocessing is that it avoids expensive retraining. In this work, we propose general post-processing algorithms for individual fairness (IF). We consider a setting where the learner only has access to the predictions of the original model and a similarity graph between individuals, guiding the desired fairness constraints. We cast the IF post-processing problem as a graph smoothing problem corresponding to graph Laplacian regularization that preserves the desired “treat similar individuals similarly” interpretation. Our theoretical results demonstrate the connection of the new objective function to a local relaxation of the original individual fairness. Empirically, our post-processing algorithms correct individual biases in large-scale NLP models such as BERT, while preserving accuracy.

Asymptotic normality of a linear threshold estimator in fixed dimension with near-optimal rate

arXiv: Statistics Theory, 2020

Linear thresholding models postulate that the conditional distribution of a response variable in ... more Linear thresholding models postulate that the conditional distribution of a response variable in terms of covariates differs on the two sides of a (typically unknown) hyperplane in the covariate space. A key goal in such models is to learn about this separating hyperplane. Exact likelihood or least square methods to estimate the thresholding parameter involve an indicator function which make them difficult to optimize and are, therefore, often tackled by using a surrogate loss that uses a smooth approximation to the indicator. In this note, we demonstrate that the resulting estimator is asymptotically normal with a near optimal rate of convergence: $n^{-1}$ up to a log factor, in a classification thresholding model. This is substantially faster than the currently established convergence rates of smoothed estimators for similar models in the statistics and econometrics literatures.

Does enforcing fairness mitigate biases caused by subpopulation shift?

Many instances of algorithmic bias are caused by subpopulation shifts. For example, ML models oft... more Many instances of algorithmic bias are caused by subpopulation shifts. For example, ML models often perform worse on demographic groups that are underrepresented in the training data. In this paper, we study whether enforcing algorithmic fairness during training improves the performance of the trained model in the target domain. On one hand, we conceive scenarios in which enforcing fairness does not improve performance in the target domain. In fact, it may even harm performance. On the other hand, we derive necessary and sufficient conditions under which enforcing algorithmic fairness leads to the Bayes model in the target domain. We also illustrate the practical implications of our theoretical results in simulations and on real data.