-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Description
Describe the workflow you want to enable
I'd like to propose native categorical support for linear models, similar to #18394, i.e. ordinal encoded columns can be specified as "categorical". I see 3 main benefits:
- Possibly better user experience because one-hot-encoding becomes obsolete (though ordinal encoding is required as long
pandas.Categoricalis unsupported) - Possible speed up of fitting as the orthogonal design of categoricals could be exploited.
- Possible better memory footprint in particular in combination with dense numerical features.
Describe your proposed solution
Add a new parameter categorical_features indicating which columns to treat as categoricals (ordinal encoded) as in #18394.
Then, add this functionality via a coordinate descent (begin edit) and/or newton-cholesky (end edit) solver:
Internally, do as if categoricals were one-hot-encoded (as is done for the multiclass targets in HGBT, cf. code here) and exploit the following structure:
For a feature (sub-) matrix X of a single one-hot-encoded feature and a diagonal weight matrix W, it holds that:
X.t @ W @ X = diagonal.
Thus, coordinate descent could loop in parallel over all levels/categories of this feature X, i.e. a parallelized block update.
Estimators
If only the existing coordinate descent solver is modified, then only squared error based estimators, i.e. ElasticNet and Lasso, would profit.
If a new or extended coordinate descent solver is OK, then several GLMs would also have native categorical support, i.e. LogisticRegression, PoissonRegressor, TweedieRegressor, etc. See also #16637.