This paper aims to better predict highly skewed auto insurance claims by combining candidate pred... more This paper aims to better predict highly skewed auto insurance claims by combining candidate predictions. We analyze a version of the Kangaroo Auto Insurance company data and study the effects of combining different methods using five measures of prediction accuracy. The results show the following. First, when there is an outstanding (in terms of Gini Index) prediction among the candidates, the āforecast combination puzzleā phenomenon disappears. The simple average method performs much worse than the more sophisticated model combination methods, indicating that combining different methods could help us avoid performance degradation. Second, the choice of the prediction accuracy measure is crucial in defining the best candidate prediction for ālow frequency and high severityā (LFHS) data. For example, mean square error (MSE) does not distinguish well between model combination methods, as the values are close. Third, the performances of different model combination methods can differ d...
Multi-armed bandit problem is an important optimization game that requires an explorationexploita... more Multi-armed bandit problem is an important optimization game that requires an explorationexploitation tradeoff to achieve optimal total reward. Motivated from industrial applications such as online advertising and clinical research, we consider a setting where the rewards of bandit machines are associated with covariates, and the accurate estimation of the corresponding mean reward functions plays an important role in the performance of allocation rules. Under a flexible problem setup, we establish asymptotic strong consistency and perform a finite-time regret analysis for a sequential randomized allocation strategy based on kernel estimation. In addition, since many nonparametric and parametric methods in supervised learning may be applied to estimating the mean reward functions but guidance on how to choose among them is generally unavailable, we propose a model combining allocation strategy for adaptive performance. Simulations and a real data evaluation are conducted to illustrate the performance of the proposed allocation strategy.
Given a dictionary of M_n initial estimates of the unknown true regression function, we aim to co... more Given a dictionary of M_n initial estimates of the unknown true regression function, we aim to construct linearly aggregated estimators that target the best performance among all the linear combinations under a sparse q-norm (0 ā¤ q ā¤ 1) constraint on the linear coefficients. Besides identifying the optimal rates of aggregation for these ā_q-aggregation problems, our multi-directional (or universal) aggregation strategies by model mixing or model selection achieve the optimal rates simultaneously over the full range of 0ā¤ q ā¤ 1 for general M_n and upper bound t_n of the q-norm. Both random and fixed designs, with known or unknown error variance, are handled, and the ā_q-aggregations examined in this work cover major types of aggregation problems previously studied in the literature. Consequences on minimax-rate adaptive regression under ā_q-constrained true coefficients (0 ā¤ q ā¤ 1) are also provided. Our results show that the minimax rate of ā_q-aggregation (0 ā¤ q ā¤ 1) is basically d...
This assumption is often made to analyze parametric and nonparametric estimators, e.g., in van de... more This assumption is often made to analyze parametric and nonparametric estimators, e.g., in van der Lann, Dudoit and Keles (2004). Recall that for a positive sequence an, an estimator {fĢn}n=1 is said to converge exactly at rate {an} in probability under the L2 loss if (i) ā„ā„ā„f ā fĢnā„ā„ā„ 2 = Op(an), and (ii) for every 0 < < 1, there exists c > 0 such that when n is large enough, P (ā„ā„ā„f ā fĢnā„ā„ā„ 2 ā„ c an ) ā„ 1ā .
Given a dictionary of Mn predictors, in a random design regression setting with n observations, w... more Given a dictionary of Mn predictors, in a random design regression setting with n observations, we construct estimators that target the best performance among all the linear combinations of the predictors under a sparse lq-norm (0 ā¤ q ā¤ 1) constraint on the linear coefficients. Besides identifying the optimal rates of convergence, our universal aggregation strategies by model mixing achieve the optimal rates simultaneously over the full range of 0 ā¤ q ā¤ 1 for any Mn and without knowledge of the lq-norm of the best linear coefficients to represent the regression function. To allow model misspecification, our upper bound results are obtained in a framework of aggregation of estimates. A striking feature is that no specific relationship among the predictors is needed to achieve the upper rates of convergence (hence permitting basically arbitrary correlations between the predictors). Therefore, whatever the true regression function (assumed to be uniformly bounded), our estimators autom...
It is often reported in the forecast combination literature that a simple average of candidate fo... more It is often reported in the forecast combination literature that a simple average of candidate forecasts is more robust than sophisticated combining methods. This phenomenon is usually referred to as the āforecast combination puzzleā. Motivated by this puzzle, we explore its possible explanations, including high variance in estimating the target optimal weights (estimation error), invalid weighting formulas, and model/candidate screening before combination. We show that the existing understanding of the puzzle should be complemented by the distinction of different forecast combination scenarios known as combining for adaptation and combining for improvement. Applying combining methods without considering the underlying scenario can itself cause the puzzle. Based on our new understandings, both simulations and real data evaluations are conducted to illustrate the causes of the puzzle. We further propose a multi-level AFTER strategy that can integrate the strengths of different combin...
The traditional activity of model selection aims at discovering a single model superior to other ... more The traditional activity of model selection aims at discovering a single model superior to other candidate models. In the presence of pronounced noise, however, multiple models are often found to explain the same data equally well. To resolve this model selection ambiguity, we introduce the general approach of model selection confidence sets (MSCSs) based on likelihood ratio testing. A MSCS is defined as a list of models statistically indistinguishable from the true model at a user-specified level of confidence, which extends the familiar notion of confidence intervals to the model-selection framework. Our approach guarantees asymptotically correct coverage probability of the true model when both sample size and model dimension increase. We derive conditions under which the MSCS contains all the relevant information about the true model structure. In addition, we propose natural statistics based on the MSCS to measure importance of variables in a principled way that accounts for the overall model uncertainty. When the space of feasible models is large, MSCS is implemented by an adaptive stochastic search algorithm which samples MSCS models with high probability. The MSCS methodology is illustrated through numerical experiments on synthetic data and real data examples.
In the era of "big data", analysts usually explore various statistical models or machine learning... more In the era of "big data", analysts usually explore various statistical models or machine learning methods for observed data in order to facilitate scientific discoveries or gain predictive power. Whatever data and fitting procedures are employed, a crucial step is to select the most appropriate model or method from a set of candidates. Model selection is a key ingredient in data analysis for reliable and reproducible statistical inference or prediction, and thus central to scientific studies in fields such as ecology, economics, engineering, finance, political science, biology, and epidemiology. There has been a long history of model selection techniques that arise from researches in statistics, information theory, and signal processing. A considerable number of methods have been proposed, following different philosophies and exhibiting varying performances. The purpose of this article is to bring a comprehensive overview of them, in terms of their motivation, large sample performance, and applicability. We provide integrated and practically relevant discussions on theoretical properties of state-ofthe-art model selection approaches. We also share our thoughts on some controversial views on the practice of model selection.
We introduce a new criterion to determine the order of an autoregressive model fitted to time ser... more We introduce a new criterion to determine the order of an autoregressive model fitted to time series data. It has the benefits of the two well-known model selection techniques, the Akaike information criterion and the Bayesian information criterion. When the data is generated from a finite order autoregression, the Bayesian information criterion is known to be consistent, and so is the new criterion. When the true order is infinity or suitably high with respect to the sample size, the Akaike information criterion is known to be efficient in the sense that its prediction performance is asymptotically equivalent to the best offered by the candidate models; in this case, the new criterion behaves in a similar manner. Different from the two classical criteria, the proposed criterion adaptively achieves either consistency or efficiency depending on the underlying true model. In practice where the observed time series is given without any prior information about the model specification, the proposed order selection criterion is more flexible and robust compared with classical approaches. Numerical results are presented demonstrating the adaptivity of the proposed technique when applied to various datasets.
Motivated by applications in personalized web services and clinical research, we consider a multi... more Motivated by applications in personalized web services and clinical research, we consider a multi-armed bandit problem in a setting where the mean reward of each arm is associated with some covariates. A multi-stage randomized allocation with arm elimination algorithm is proposed to combine the flexibility in reward function modeling and a theoretical guarantee of a cumulative regret minimax rate. When the function smoothness parameter is unknown, the algorithm is equipped with a histogram estimation based smoothness parameter selector using Lepski's method, and is shown to maintain the regret minimax rate up to a logarithmic factor under a "self-similarity" condition.
Die Dokumente auf EconStor dĆ¼rfen zu eigenen wissenschaftlichen Zwecken und zum Privatgebrauch ge... more Die Dokumente auf EconStor dĆ¼rfen zu eigenen wissenschaftlichen Zwecken und zum Privatgebrauch gespeichert und kopiert werden. Sie dĆ¼rfen die Dokumente nicht fĆ¼r ƶffentliche oder kommerzielle Zwecke vervielfƤltigen, ƶffentlich ausstellen, ƶffentlich zugƤnglich machen, vertreiben oder anderweitig nutzen. Sofern die Verfasser die Dokumente unter Open-Content-Lizenzen (insbesondere CC-Lizenzen) zur VerfĆ¼gung gestellt haben sollten, gelten abweichend von diesen Nutzungsbedingungen die in der dort genannten Lizenz gewƤhrten Nutzungsrechte. Terms of use: Documents in EconStor may be saved and copied for your personal and scholarly purposes. You are not to copy documents for public or commercial purposes, to exhibit the documents publicly, to make them publicly available on the internet, or to distribute or otherwise use the documents in public.
Model selection for quantile regression is a challenging problem. In ad-dition to the well-known ... more Model selection for quantile regression is a challenging problem. In ad-dition to the well-known general difficulty of model selection uncertainty, when quantiles at multiple probability levels are of interest, typically a single candidate does not serve all of them simultaneously. In this paper, we propose methods to combine quantile estimators. Oracle inequalities show that, at each given probabil-ity level, the combined estimators automatically perform nearly as well as the best candidate. Simulation and examples show that the proposed model combination approach often leads to a substantial gain in accuracy under global measures of performance.
The field of machine learning has developed a wide array of techniques for improving the effectiv... more The field of machine learning has developed a wide array of techniques for improving the effectiveness of performance elements. Ideally, a learning system would adapt its commitments to the demands of a particular learning situation, rather than relying on fixed commitments that impose tradeoffs between the efficiency and utility of a learning technique. This article presents an extension of the COMPOSER learning approach that dynamically adjusts its learning behavior based on the resources available for learning. COMPOSER is a speed-up learning technique that provides a statistical approach to the utility problem. The system identifies a sequence of transformations that, with high probability, increase the Type I utility of an initial planning system. The approach breaks the task into a learning phase and a utilization phase. This extension to COMPOSER adopts a rational policy that dynamically balances the trade-off between efficiency and utility. Implications for learning systems are discussed. (Contains 24 references.) (SLD)
We introduce the notion of variable selection condence set (VSCS) for linear regression based on ... more We introduce the notion of variable selection condence set (VSCS) for linear regression based on F-testing. Our method identies the most important variables in a principled way that goes beyond simply trusting the single lucky winner based on a model selection criterion. The VSCS extends the usual notion of condence intervals to the variable selection problem: A VSCS is a set of regression models that contains the true model with a given level of condence. Although the size of the VSCS properly reects the model selection uncertainty, without specic assumptions on the true model, the VSCS is typically rather large (unless the number of predictors is small). As a solution, we advocate special attention to the set of lower boundary models (LBMs), which are the most parsimonious models that are not statistically signicantly inferior to the full model at a given condence level. Based on the LBMs, variable importance and measures of co-appearance importance of predictors can be naturally dened.
Risk bounds are derived for regression estimation based on model selection over an unrestricted n... more Risk bounds are derived for regression estimation based on model selection over an unrestricted number of models. While a large list of models provides more flexibility, significant selection bias may occur with model selection criteria like AIC. We incorporate a model complexity penalty term in AIC to handle selection bias. Resulting estimators are shown to achieve a trade-off among approximation error, estimation error and model complexity without prior knowledge about the true regression function. We demonstrate the adaptability of these estimators over full and sparse approximation function classes with different smoothness. For high-dimensional function estimation by tensor product splines we show that with number of knots and spline order adaptively selected, the least squares estimator converges at anticipated rates simultaneously for Sobolev classes with different interaction orders and smoothness parameters.
Nonparametric regression techniques are often sensitive to the presence of correlation in the err... more Nonparametric regression techniques are often sensitive to the presence of correlation in the errors. The practical consequences of this sensitivity are explained, including the breakdown of several popular data-driven smoothing parameter selection methods. We review the existing literature in kernel regression, smoothing splines and wavelet regression under correlation, both for short-range and long-range dependence. Extensions to random design, higher dimensional models and adaptive estimation are discussed.
One important goal of regression analysis is prediction. In recent years, the idea of combining d... more One important goal of regression analysis is prediction. In recent years, the idea of combining different statistical methods has attracted an increasing attention. In this work, we propose a method, l1-ARM (adaptive regression by mixing), to robustly combine model selection methods that performs well adaptively. In numerical work, we consider the LASSO, SCAD, and adaptive LASSO in representative scenarios, as well as in cases of randomly generated models. The l1-ARM automatically performs like the best among them and consequently provides a better estimation/prediction in an overall sense, especially when outliers are likely to occur.
Journal of the American Statistical Association, 2001
Adaptation over different procedures is of practical importance. Different procedures perform wel... more Adaptation over different procedures is of practical importance. Different procedures perform well under different conditions. In many practical situations, it is rather hard to assess which conditions are (approximately) satisfied so as to identify the best procedure for the data at hand. Thus automatic adaptation over various scenarios is desirable. A practically feasible method, named adaptive regression by mixing (ARM), is proposed to convexly combine general candidate regression procedures. Under mild conditions, the resulting estimator is theoretically shown to perform optimally in rates of convergence without knowing which of the original procedures work the best. Simulations are conducted in several settings, including comparing a parametric model with nonparametric alternatives, comparing a neural network with a projection pursuit in multidimensional regression, and combining bandwidths in kernel regression. The results clearly support the theoretical property of ARM. The ARM algorithm assigns weights on the candidate modelsprocedures via proper assessment of performance of the estimators. The data are split into two parts, one for estimation and the other for measuring behavior in prediction. Although there are many plausible ways to assign the weights, ARM has a connection with information theory, which ensures the desired adaptation capability. Indeed, under mild conditions, we show that the squared L 2 risk of the estimator based on ARM is basically bounded above by the risk of each candidate procedure plus a small penalty term of order 1/n. Minimizing over the procedures gives the automatically optimal rate of convergence for ARM. Model selection often induces unnecessarily large variability in estimation. Alternatively, a proper weighting of the candidate models can be more stable, resulting in a smaller risk. Simulations suggest that ARM works better than model selection using Akaike or Bayesian information criteria when the error variance is not very small.
Given any countable collection of regression procedures (e.g., kernel, spline, wavelet, local pol... more Given any countable collection of regression procedures (e.g., kernel, spline, wavelet, local polynomial, neural nets, etc), we s h o w that a single adaptive procedure can be constructed to share the advantages of them to a great extent in terms of global squared L 2 risk. The combined procedure basically pays a price only of order 1=n for adaptation over the collection. An interesting consequence is that for a countable collection of classes of regression functions (possibly of completely di erent c haracteristics), a minimax-rate adaptive estimator can be constructed such that it automatically converges at the right rate for each of the classes being considered. A demonstration is given for high dimensional regression, for which case, to overcome the well-known curse of dimensionality in accuracy, i t i s a d v antageous to seek di erent w ays of characterizing a highdimensional function (e.g., using neural nets or additive modelings) to reduce the in uence of input dimension in the traditional theory of approximation (e.g., in terms of series expansion). However, in general, it is di cult to assess which c haracterization works well for the unknown regression function. Thus adaptation over di erent modelings is desired. For example, we s h o w b y c o m bining various regression procedures, a single estimator can be constructed to be minimax-rate adaptive o ver Besov classes of unknown smoothness and interaction order, to converge at rate o(n ;1=2) when the regression function has a neural net representation, and at the same time to be consistent o ver all bounded regression functions.
|We study minimax-rate adaptive estimation for density classes indexed by continuous hyper-parame... more |We study minimax-rate adaptive estimation for density classes indexed by continuous hyper-parameters. The classes are assumed to be partially ordered in terms of inclusion relationship. Under a mild condition on the minimax risks, we show that a minimax-rate adaptive estimator can be constructed for the classes. Index Terms|Adaptive density estimation, combining procedures, minimax-rate adaptation.
This paper aims to better predict highly skewed auto insurance claims by combining candidate pred... more This paper aims to better predict highly skewed auto insurance claims by combining candidate predictions. We analyze a version of the Kangaroo Auto Insurance company data and study the effects of combining different methods using five measures of prediction accuracy. The results show the following. First, when there is an outstanding (in terms of Gini Index) prediction among the candidates, the āforecast combination puzzleā phenomenon disappears. The simple average method performs much worse than the more sophisticated model combination methods, indicating that combining different methods could help us avoid performance degradation. Second, the choice of the prediction accuracy measure is crucial in defining the best candidate prediction for ālow frequency and high severityā (LFHS) data. For example, mean square error (MSE) does not distinguish well between model combination methods, as the values are close. Third, the performances of different model combination methods can differ d...
Multi-armed bandit problem is an important optimization game that requires an explorationexploita... more Multi-armed bandit problem is an important optimization game that requires an explorationexploitation tradeoff to achieve optimal total reward. Motivated from industrial applications such as online advertising and clinical research, we consider a setting where the rewards of bandit machines are associated with covariates, and the accurate estimation of the corresponding mean reward functions plays an important role in the performance of allocation rules. Under a flexible problem setup, we establish asymptotic strong consistency and perform a finite-time regret analysis for a sequential randomized allocation strategy based on kernel estimation. In addition, since many nonparametric and parametric methods in supervised learning may be applied to estimating the mean reward functions but guidance on how to choose among them is generally unavailable, we propose a model combining allocation strategy for adaptive performance. Simulations and a real data evaluation are conducted to illustrate the performance of the proposed allocation strategy.
Given a dictionary of M_n initial estimates of the unknown true regression function, we aim to co... more Given a dictionary of M_n initial estimates of the unknown true regression function, we aim to construct linearly aggregated estimators that target the best performance among all the linear combinations under a sparse q-norm (0 ā¤ q ā¤ 1) constraint on the linear coefficients. Besides identifying the optimal rates of aggregation for these ā_q-aggregation problems, our multi-directional (or universal) aggregation strategies by model mixing or model selection achieve the optimal rates simultaneously over the full range of 0ā¤ q ā¤ 1 for general M_n and upper bound t_n of the q-norm. Both random and fixed designs, with known or unknown error variance, are handled, and the ā_q-aggregations examined in this work cover major types of aggregation problems previously studied in the literature. Consequences on minimax-rate adaptive regression under ā_q-constrained true coefficients (0 ā¤ q ā¤ 1) are also provided. Our results show that the minimax rate of ā_q-aggregation (0 ā¤ q ā¤ 1) is basically d...
This assumption is often made to analyze parametric and nonparametric estimators, e.g., in van de... more This assumption is often made to analyze parametric and nonparametric estimators, e.g., in van der Lann, Dudoit and Keles (2004). Recall that for a positive sequence an, an estimator {fĢn}n=1 is said to converge exactly at rate {an} in probability under the L2 loss if (i) ā„ā„ā„f ā fĢnā„ā„ā„ 2 = Op(an), and (ii) for every 0 < < 1, there exists c > 0 such that when n is large enough, P (ā„ā„ā„f ā fĢnā„ā„ā„ 2 ā„ c an ) ā„ 1ā .
Given a dictionary of Mn predictors, in a random design regression setting with n observations, w... more Given a dictionary of Mn predictors, in a random design regression setting with n observations, we construct estimators that target the best performance among all the linear combinations of the predictors under a sparse lq-norm (0 ā¤ q ā¤ 1) constraint on the linear coefficients. Besides identifying the optimal rates of convergence, our universal aggregation strategies by model mixing achieve the optimal rates simultaneously over the full range of 0 ā¤ q ā¤ 1 for any Mn and without knowledge of the lq-norm of the best linear coefficients to represent the regression function. To allow model misspecification, our upper bound results are obtained in a framework of aggregation of estimates. A striking feature is that no specific relationship among the predictors is needed to achieve the upper rates of convergence (hence permitting basically arbitrary correlations between the predictors). Therefore, whatever the true regression function (assumed to be uniformly bounded), our estimators autom...
It is often reported in the forecast combination literature that a simple average of candidate fo... more It is often reported in the forecast combination literature that a simple average of candidate forecasts is more robust than sophisticated combining methods. This phenomenon is usually referred to as the āforecast combination puzzleā. Motivated by this puzzle, we explore its possible explanations, including high variance in estimating the target optimal weights (estimation error), invalid weighting formulas, and model/candidate screening before combination. We show that the existing understanding of the puzzle should be complemented by the distinction of different forecast combination scenarios known as combining for adaptation and combining for improvement. Applying combining methods without considering the underlying scenario can itself cause the puzzle. Based on our new understandings, both simulations and real data evaluations are conducted to illustrate the causes of the puzzle. We further propose a multi-level AFTER strategy that can integrate the strengths of different combin...
The traditional activity of model selection aims at discovering a single model superior to other ... more The traditional activity of model selection aims at discovering a single model superior to other candidate models. In the presence of pronounced noise, however, multiple models are often found to explain the same data equally well. To resolve this model selection ambiguity, we introduce the general approach of model selection confidence sets (MSCSs) based on likelihood ratio testing. A MSCS is defined as a list of models statistically indistinguishable from the true model at a user-specified level of confidence, which extends the familiar notion of confidence intervals to the model-selection framework. Our approach guarantees asymptotically correct coverage probability of the true model when both sample size and model dimension increase. We derive conditions under which the MSCS contains all the relevant information about the true model structure. In addition, we propose natural statistics based on the MSCS to measure importance of variables in a principled way that accounts for the overall model uncertainty. When the space of feasible models is large, MSCS is implemented by an adaptive stochastic search algorithm which samples MSCS models with high probability. The MSCS methodology is illustrated through numerical experiments on synthetic data and real data examples.
In the era of "big data", analysts usually explore various statistical models or machine learning... more In the era of "big data", analysts usually explore various statistical models or machine learning methods for observed data in order to facilitate scientific discoveries or gain predictive power. Whatever data and fitting procedures are employed, a crucial step is to select the most appropriate model or method from a set of candidates. Model selection is a key ingredient in data analysis for reliable and reproducible statistical inference or prediction, and thus central to scientific studies in fields such as ecology, economics, engineering, finance, political science, biology, and epidemiology. There has been a long history of model selection techniques that arise from researches in statistics, information theory, and signal processing. A considerable number of methods have been proposed, following different philosophies and exhibiting varying performances. The purpose of this article is to bring a comprehensive overview of them, in terms of their motivation, large sample performance, and applicability. We provide integrated and practically relevant discussions on theoretical properties of state-ofthe-art model selection approaches. We also share our thoughts on some controversial views on the practice of model selection.
We introduce a new criterion to determine the order of an autoregressive model fitted to time ser... more We introduce a new criterion to determine the order of an autoregressive model fitted to time series data. It has the benefits of the two well-known model selection techniques, the Akaike information criterion and the Bayesian information criterion. When the data is generated from a finite order autoregression, the Bayesian information criterion is known to be consistent, and so is the new criterion. When the true order is infinity or suitably high with respect to the sample size, the Akaike information criterion is known to be efficient in the sense that its prediction performance is asymptotically equivalent to the best offered by the candidate models; in this case, the new criterion behaves in a similar manner. Different from the two classical criteria, the proposed criterion adaptively achieves either consistency or efficiency depending on the underlying true model. In practice where the observed time series is given without any prior information about the model specification, the proposed order selection criterion is more flexible and robust compared with classical approaches. Numerical results are presented demonstrating the adaptivity of the proposed technique when applied to various datasets.
Motivated by applications in personalized web services and clinical research, we consider a multi... more Motivated by applications in personalized web services and clinical research, we consider a multi-armed bandit problem in a setting where the mean reward of each arm is associated with some covariates. A multi-stage randomized allocation with arm elimination algorithm is proposed to combine the flexibility in reward function modeling and a theoretical guarantee of a cumulative regret minimax rate. When the function smoothness parameter is unknown, the algorithm is equipped with a histogram estimation based smoothness parameter selector using Lepski's method, and is shown to maintain the regret minimax rate up to a logarithmic factor under a "self-similarity" condition.
Die Dokumente auf EconStor dĆ¼rfen zu eigenen wissenschaftlichen Zwecken und zum Privatgebrauch ge... more Die Dokumente auf EconStor dĆ¼rfen zu eigenen wissenschaftlichen Zwecken und zum Privatgebrauch gespeichert und kopiert werden. Sie dĆ¼rfen die Dokumente nicht fĆ¼r ƶffentliche oder kommerzielle Zwecke vervielfƤltigen, ƶffentlich ausstellen, ƶffentlich zugƤnglich machen, vertreiben oder anderweitig nutzen. Sofern die Verfasser die Dokumente unter Open-Content-Lizenzen (insbesondere CC-Lizenzen) zur VerfĆ¼gung gestellt haben sollten, gelten abweichend von diesen Nutzungsbedingungen die in der dort genannten Lizenz gewƤhrten Nutzungsrechte. Terms of use: Documents in EconStor may be saved and copied for your personal and scholarly purposes. You are not to copy documents for public or commercial purposes, to exhibit the documents publicly, to make them publicly available on the internet, or to distribute or otherwise use the documents in public.
Model selection for quantile regression is a challenging problem. In ad-dition to the well-known ... more Model selection for quantile regression is a challenging problem. In ad-dition to the well-known general difficulty of model selection uncertainty, when quantiles at multiple probability levels are of interest, typically a single candidate does not serve all of them simultaneously. In this paper, we propose methods to combine quantile estimators. Oracle inequalities show that, at each given probabil-ity level, the combined estimators automatically perform nearly as well as the best candidate. Simulation and examples show that the proposed model combination approach often leads to a substantial gain in accuracy under global measures of performance.
The field of machine learning has developed a wide array of techniques for improving the effectiv... more The field of machine learning has developed a wide array of techniques for improving the effectiveness of performance elements. Ideally, a learning system would adapt its commitments to the demands of a particular learning situation, rather than relying on fixed commitments that impose tradeoffs between the efficiency and utility of a learning technique. This article presents an extension of the COMPOSER learning approach that dynamically adjusts its learning behavior based on the resources available for learning. COMPOSER is a speed-up learning technique that provides a statistical approach to the utility problem. The system identifies a sequence of transformations that, with high probability, increase the Type I utility of an initial planning system. The approach breaks the task into a learning phase and a utilization phase. This extension to COMPOSER adopts a rational policy that dynamically balances the trade-off between efficiency and utility. Implications for learning systems are discussed. (Contains 24 references.) (SLD)
We introduce the notion of variable selection condence set (VSCS) for linear regression based on ... more We introduce the notion of variable selection condence set (VSCS) for linear regression based on F-testing. Our method identies the most important variables in a principled way that goes beyond simply trusting the single lucky winner based on a model selection criterion. The VSCS extends the usual notion of condence intervals to the variable selection problem: A VSCS is a set of regression models that contains the true model with a given level of condence. Although the size of the VSCS properly reects the model selection uncertainty, without specic assumptions on the true model, the VSCS is typically rather large (unless the number of predictors is small). As a solution, we advocate special attention to the set of lower boundary models (LBMs), which are the most parsimonious models that are not statistically signicantly inferior to the full model at a given condence level. Based on the LBMs, variable importance and measures of co-appearance importance of predictors can be naturally dened.
Risk bounds are derived for regression estimation based on model selection over an unrestricted n... more Risk bounds are derived for regression estimation based on model selection over an unrestricted number of models. While a large list of models provides more flexibility, significant selection bias may occur with model selection criteria like AIC. We incorporate a model complexity penalty term in AIC to handle selection bias. Resulting estimators are shown to achieve a trade-off among approximation error, estimation error and model complexity without prior knowledge about the true regression function. We demonstrate the adaptability of these estimators over full and sparse approximation function classes with different smoothness. For high-dimensional function estimation by tensor product splines we show that with number of knots and spline order adaptively selected, the least squares estimator converges at anticipated rates simultaneously for Sobolev classes with different interaction orders and smoothness parameters.
Nonparametric regression techniques are often sensitive to the presence of correlation in the err... more Nonparametric regression techniques are often sensitive to the presence of correlation in the errors. The practical consequences of this sensitivity are explained, including the breakdown of several popular data-driven smoothing parameter selection methods. We review the existing literature in kernel regression, smoothing splines and wavelet regression under correlation, both for short-range and long-range dependence. Extensions to random design, higher dimensional models and adaptive estimation are discussed.
One important goal of regression analysis is prediction. In recent years, the idea of combining d... more One important goal of regression analysis is prediction. In recent years, the idea of combining different statistical methods has attracted an increasing attention. In this work, we propose a method, l1-ARM (adaptive regression by mixing), to robustly combine model selection methods that performs well adaptively. In numerical work, we consider the LASSO, SCAD, and adaptive LASSO in representative scenarios, as well as in cases of randomly generated models. The l1-ARM automatically performs like the best among them and consequently provides a better estimation/prediction in an overall sense, especially when outliers are likely to occur.
Journal of the American Statistical Association, 2001
Adaptation over different procedures is of practical importance. Different procedures perform wel... more Adaptation over different procedures is of practical importance. Different procedures perform well under different conditions. In many practical situations, it is rather hard to assess which conditions are (approximately) satisfied so as to identify the best procedure for the data at hand. Thus automatic adaptation over various scenarios is desirable. A practically feasible method, named adaptive regression by mixing (ARM), is proposed to convexly combine general candidate regression procedures. Under mild conditions, the resulting estimator is theoretically shown to perform optimally in rates of convergence without knowing which of the original procedures work the best. Simulations are conducted in several settings, including comparing a parametric model with nonparametric alternatives, comparing a neural network with a projection pursuit in multidimensional regression, and combining bandwidths in kernel regression. The results clearly support the theoretical property of ARM. The ARM algorithm assigns weights on the candidate modelsprocedures via proper assessment of performance of the estimators. The data are split into two parts, one for estimation and the other for measuring behavior in prediction. Although there are many plausible ways to assign the weights, ARM has a connection with information theory, which ensures the desired adaptation capability. Indeed, under mild conditions, we show that the squared L 2 risk of the estimator based on ARM is basically bounded above by the risk of each candidate procedure plus a small penalty term of order 1/n. Minimizing over the procedures gives the automatically optimal rate of convergence for ARM. Model selection often induces unnecessarily large variability in estimation. Alternatively, a proper weighting of the candidate models can be more stable, resulting in a smaller risk. Simulations suggest that ARM works better than model selection using Akaike or Bayesian information criteria when the error variance is not very small.
Given any countable collection of regression procedures (e.g., kernel, spline, wavelet, local pol... more Given any countable collection of regression procedures (e.g., kernel, spline, wavelet, local polynomial, neural nets, etc), we s h o w that a single adaptive procedure can be constructed to share the advantages of them to a great extent in terms of global squared L 2 risk. The combined procedure basically pays a price only of order 1=n for adaptation over the collection. An interesting consequence is that for a countable collection of classes of regression functions (possibly of completely di erent c haracteristics), a minimax-rate adaptive estimator can be constructed such that it automatically converges at the right rate for each of the classes being considered. A demonstration is given for high dimensional regression, for which case, to overcome the well-known curse of dimensionality in accuracy, i t i s a d v antageous to seek di erent w ays of characterizing a highdimensional function (e.g., using neural nets or additive modelings) to reduce the in uence of input dimension in the traditional theory of approximation (e.g., in terms of series expansion). However, in general, it is di cult to assess which c haracterization works well for the unknown regression function. Thus adaptation over di erent modelings is desired. For example, we s h o w b y c o m bining various regression procedures, a single estimator can be constructed to be minimax-rate adaptive o ver Besov classes of unknown smoothness and interaction order, to converge at rate o(n ;1=2) when the regression function has a neural net representation, and at the same time to be consistent o ver all bounded regression functions.
|We study minimax-rate adaptive estimation for density classes indexed by continuous hyper-parame... more |We study minimax-rate adaptive estimation for density classes indexed by continuous hyper-parameters. The classes are assumed to be partially ordered in terms of inclusion relationship. Under a mild condition on the minimax risks, we show that a minimax-rate adaptive estimator can be constructed for the classes. Index Terms|Adaptive density estimation, combining procedures, minimax-rate adaptation.
Uploads
Papers by Yuhong Yang