Warn on high-cardinality NA features producing uniform ipw weights#195
Warn on high-cardinality NA features producing uniform ipw weights#195neuralsorcerer wants to merge 7 commits intofacebookresearch:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds detection and warning functionality for high-cardinality categorical features with missing values that can cause uniform IPW weights, addressing issue #65. The implementation detects these problematic columns before model fitting and includes them in the warning message when all weights become identical.
Key changes:
- Added high-cardinality detection logic that identifies categorical columns where ≥80% of non-NA values are unique
- Enhanced the uniform weights warning to list potentially problematic high-cardinality columns
- Added comprehensive test coverage for the new detection functionality
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| balance/weighting_methods/ipw.py | Implements high-cardinality detection logic and enhances warning message to include problematic columns when uniform weights are detected |
| tests/test_ipw.py | Adds three test cases covering high-cardinality detection for object dtype, categorical dtype, and low-cardinality scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Please fix lint :) |
talgalili
left a comment
There was a problem hiding this comment.
Good progress!
Now that I'm looking at this closer, I think the way the 'issue' was written is not clear enough - so I'd like to clarify further.
Several comments:
- All of this new logic should be in an external _function, so as to not make ipw more complex, and also for easier testing/validating.
- magic numbers (0.8) should be placed as arguments in the function. With maybe a way to set it as part of a global feature in the package. That can be a TODO left as a comment. But just a thought. BTW, the number can probably be much lower, e.g. 50%. But sure, let's start with 80%.
- The check for high cardinality should be done in general, regardless of what the results of the weights are (1 or not). For example, if the user puts the 'id' column as a feature (by accident), or if they add a column with user names or something like this - we want a warning so that the user can see that this column has an issue - and may not be relevant for the model (and either removed or bucketed)
- The point is not NA features, it's features with high cardinality (that might have NA to them).
- It's worth adding the cardinality number itself, per column, to the warning message. And sort the columns (DESC) from high to low cardinality, in the warning message.
- All of these should be tested in the test plan.
THANKS!
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@talgalili has imported this pull request. If you are a Meta employee, you can view this in D88260566. |
|
LGTM. |
|
@neuralsorcerer |
|
Sure, thank you @talgalili |
|
FYI: |
|
@talgalili merged this pull request in 4900fa2. |
Added detection of high-cardinality categorical columns with missing values before IPW fitting and surfaced them when all weights become identical to flag potential causes. Also added tests.