Skip to content

Conversation

@ArturoAmorQ
Copy link
Member

Reference Issues/PRs

Related to #22928

What does this implement/fix? Explain your changes.

In #22928 we remove the use of HashingVectorizer from the plot_document_classification_20newsgroups.py example for the sake of simplicity.
A comparison of the performance of hashers and vectorizers can be moved to this existing example.

Any other comments?

Side effect: Implements notebook style as intended in #22406

@lesteve lesteve added the Quick Review For PRs that are quick to review label May 11, 2022
@lesteve lesteve removed the Quick Review For PRs that are quick to review label May 13, 2022
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, here is a batch of feedback.

@ArturoAmorQ ArturoAmorQ changed the title [WIP] DOC Rework plot_hashing_vs_dict_vectorizer.py example DOC Rework plot_hashing_vs_dict_vectorizer.py example May 20, 2022
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much @ArturoAmorQ, this notebook is much nicer than the original benchmark script.

Here is a final batch of suggestions for improvement:

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @ArturoAmorQ.

I think one should use other terms to make this example more accurate.

This is for instance the case of:

  • "frequency" which can be replace by "occurence (counts)" (to respect the the definition)
  • "speed" which can be replaced by "data processing rate" (to respect the unit (bytes/sec))

Here are some comments and formatting fixes.


Edit: not related to this PR, but #23004 might come with new changes for this example then.

Co-authored-by: Julien Jerphanion <[email protected]>
@ArturoAmorQ
Copy link
Member Author

Thanks @ogrisel and @jjerphan. This notebook is much more clearer thanks to your comments.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @ArturoAmorQ.

Edit: I let @ogrisel merge if everything LGTH.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM again, just a final batch of nitpicks + a formatting fix.

@ogrisel ogrisel merged commit 6ff214c into scikit-learn:main May 30, 2022
@ogrisel
Copy link
Member

ogrisel commented May 30, 2022

Merged, thank you very much for the nice contribution @ArturoAmorQ!

@ArturoAmorQ ArturoAmorQ deleted the compare_vectorizers branch June 9, 2022 13:29
ogrisel added a commit to ogrisel/scikit-learn that referenced this pull request Jul 11, 2022
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Aug 4, 2022
glemaitre pushed a commit that referenced this pull request Aug 5, 2022
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Julien Jerphanion <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants