Skip to content

Warn on high-cardinality NA features producing uniform ipw weights#195

Closed
neuralsorcerer wants to merge 7 commits intofacebookresearch:mainfrom
neuralsorcerer:card
Closed

Warn on high-cardinality NA features producing uniform ipw weights#195
neuralsorcerer wants to merge 7 commits intofacebookresearch:mainfrom
neuralsorcerer:card

Conversation

@neuralsorcerer
Copy link
Copy Markdown
Collaborator

Added detection of high-cardinality categorical columns with missing values before IPW fitting and surfaced them when all weights become identical to flag potential causes. Also added tests.

Copilot AI review requested due to automatic review settings December 3, 2025 06:53
@meta-cla meta-cla bot added the cla signed label Dec 3, 2025
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds detection and warning functionality for high-cardinality categorical features with missing values that can cause uniform IPW weights, addressing issue #65. The implementation detects these problematic columns before model fitting and includes them in the warning message when all weights become identical.

Key changes:

  • Added high-cardinality detection logic that identifies categorical columns where ≥80% of non-NA values are unique
  • Enhanced the uniform weights warning to list potentially problematic high-cardinality columns
  • Added comprehensive test coverage for the new detection functionality

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
balance/weighting_methods/ipw.py Implements high-cardinality detection logic and enhances warning message to include problematic columns when uniform weights are detected
tests/test_ipw.py Adds three test cases covering high-cardinality detection for object dtype, categorical dtype, and low-cardinality scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@talgalili
Copy link
Copy Markdown
Contributor

Please fix lint :)

Copy link
Copy Markdown
Contributor

@talgalili talgalili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress!

Now that I'm looking at this closer, I think the way the 'issue' was written is not clear enough - so I'd like to clarify further.

Several comments:

  1. All of this new logic should be in an external _function, so as to not make ipw more complex, and also for easier testing/validating.
  2. magic numbers (0.8) should be placed as arguments in the function. With maybe a way to set it as part of a global feature in the package. That can be a TODO left as a comment. But just a thought. BTW, the number can probably be much lower, e.g. 50%. But sure, let's start with 80%.
  3. The check for high cardinality should be done in general, regardless of what the results of the weights are (1 or not). For example, if the user puts the 'id' column as a feature (by accident), or if they add a column with user names or something like this - we want a warning so that the user can see that this column has an issue - and may not be relevant for the model (and either removed or bucketed)
  4. The point is not NA features, it's features with high cardinality (that might have NA to them).
  5. It's worth adding the cardinality number itself, per column, to the warning message. And sort the columns (DESC) from high to low cardinality, in the warning message.
  6. All of these should be tested in the test plan.

THANKS!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@talgalili talgalili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Dec 3, 2025

@talgalili has imported this pull request. If you are a Meta employee, you can view this in D88260566.

@talgalili
Copy link
Copy Markdown
Contributor

LGTM.
I'll send it for internal review.

@talgalili
Copy link
Copy Markdown
Contributor

@neuralsorcerer
FYI: While going over this again carefully, I noticed some things I'd like to change. I'll do it tomorrow, so this might take a day or two to land.

@neuralsorcerer
Copy link
Copy Markdown
Collaborator Author

Sure, thank you @talgalili

@talgalili
Copy link
Copy Markdown
Contributor

FYI:
I finished updating this PR but due to an infra issue, it might take a day (or so?!) to get pushed back to github.

@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Dec 4, 2025

@talgalili merged this pull request in 4900fa2.

@neuralsorcerer neuralsorcerer deleted the card branch December 4, 2025 09:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] check (raise warning) if features are provided that lead to all equal weights

4 participants