CountFeaturizer for categorical data

I have used a data transformation based on count data with some success on kaggle:
https://www.kaggle.com/c/sf-crime/forums/t/15836/predicting-crime-categories-with-address-featurization-and-neural-nets

This is similar to what Azure does: https://msdn.microsoft.com/en-us/library/azure/dn913056.aspx
I've also found that adding an extra column that gives the frequency of each individual label over all predictive categories adds to the information.

The implementation would use a contingency_matrix to first calculate the frequencies, then  add laplacian noise to avoid overfitting and finally return the new features.

Is there any interest to include something like this in sklearn?

Best


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

CountFeaturizer for categorical data #5853

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

CountFeaturizer for categorical data #5853

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions