Skip to content

Add native categorical support for linear models via coordinate descent solver #18893

@lorentzenchr

Description

@lorentzenchr

Describe the workflow you want to enable

I'd like to propose native categorical support for linear models, similar to #18394, i.e. ordinal encoded columns can be specified as "categorical". I see 3 main benefits:

  • Possibly better user experience because one-hot-encoding becomes obsolete (though ordinal encoding is required as long pandas.Categorical is unsupported)
  • Possible speed up of fitting as the orthogonal design of categoricals could be exploited.
  • Possible better memory footprint in particular in combination with dense numerical features.

Describe your proposed solution

Add a new parameter categorical_features indicating which columns to treat as categoricals (ordinal encoded) as in #18394.

Then, add this functionality via a coordinate descent (begin edit) and/or newton-cholesky (end edit) solver:
Internally, do as if categoricals were one-hot-encoded (as is done for the multiclass targets in HGBT, cf. code here) and exploit the following structure:

For a feature (sub-) matrix X of a single one-hot-encoded feature and a diagonal weight matrix W, it holds that:
X.t @ W @ X = diagonal.

Thus, coordinate descent could loop in parallel over all levels/categories of this feature X, i.e. a parallelized block update.

Estimators

If only the existing coordinate descent solver is modified, then only squared error based estimators, i.e. ElasticNet and Lasso, would profit.
If a new or extended coordinate descent solver is OK, then several GLMs would also have native categorical support, i.e. LogisticRegression, PoissonRegressor, TweedieRegressor, etc. See also #16637.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions