-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Description
I have used a data transformation based on count data with some success on kaggle:
https://www.kaggle.com/c/sf-crime/forums/t/15836/predicting-crime-categories-with-address-featurization-and-neural-nets
This is similar to what Azure does: https://msdn.microsoft.com/en-us/library/azure/dn913056.aspx
I've also found that adding an extra column that gives the frequency of each individual label over all predictive categories adds to the information.
The implementation would use a contingency_matrix to first calculate the frequencies, then add laplacian noise to avoid overfitting and finally return the new features.
Is there any interest to include something like this in sklearn?
Best