Skip to content

Conversation

@ArturoAmorQ
Copy link
Member

@ArturoAmorQ ArturoAmorQ commented Mar 23, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes.

The 20newsgroups classification example is our only document classification
example and it can/should be more didactic. Therefore, this PR reworks it by:

  • adding a loading function to easily remove text metadata when fetching dataset
  • adding a demo on a particular classifier before benchmarking the rest of them:
    • compare feature effects
    • show pollution in the dataset introduced by the metadata / motivate preprocessing dataset
    • use ConfusionMatrixDisplay to visualize results
  • adding text to provide interpretation of intermediate steps
  • removing similarly performing classifiers to avoid distractions and overlap in plots
  • removing the feature selection tools (they appear to have little to none effect)
  • making the benchmark function accept customized names

Any other comments?

The demo uses RidgeClassifier as it has a relatively low training time, but any classifier with a coef_ attribute may work.

This example uses TfidfVectorizer only. A comparison of its performance with respect to HashingVectorizer will be craft in another example.

@jeremiedbb
Copy link
Member

I triggered the CI again because the artifacts were not generated

@jjerphan jjerphan self-requested a review March 29, 2022 13:56
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, here is a first pass of feedback.

I also think you should append your name in the list of authors of this file in the header comment.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice refactoring. Here is a batch of feedback!

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments.

Co-authored-by: Julien Jerphanion <[email protected]>
@ArturoAmorQ
Copy link
Member Author

Thanks for your time and comments @ogrisel and @jjerphan !

@jjerphan
Copy link
Member

Thank you for your contribution, @ArturoAmorQ. 🙂

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a final review and here are a few more nitpicks but otherwise LGTM.

This example is much more interesting to read that it used to be :) Thank you very much for this contribution @ArturoAmorQ !

@ogrisel ogrisel merged commit 7102832 into scikit-learn:main May 23, 2022
@ArturoAmorQ ArturoAmorQ deleted the doc_classification branch June 9, 2022 13:30
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Aug 4, 2022
…it-learn#22928)



Co-authored-by: Jérémie du Boisberranger <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Julien Jerphanion <[email protected]>
glemaitre pushed a commit that referenced this pull request Aug 5, 2022

Co-authored-by: Jérémie du Boisberranger <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Julien Jerphanion <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants