Skip to content

CountFeaturizer for categorical data #5853

@papadopc

Description

@papadopc

I have used a data transformation based on count data with some success on kaggle:
https://www.kaggle.com/c/sf-crime/forums/t/15836/predicting-crime-categories-with-address-featurization-and-neural-nets

This is similar to what Azure does: https://msdn.microsoft.com/en-us/library/azure/dn913056.aspx
I've also found that adding an extra column that gives the frequency of each individual label over all predictive categories adds to the information.

The implementation would use a contingency_matrix to first calculate the frequencies, then add laplacian noise to avoid overfitting and finally return the new features.

Is there any interest to include something like this in sklearn?

Best

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions