Skip to content

Conversation

@amay1212
Copy link
Contributor

@amay1212 amay1212 commented May 3, 2023

Reference Issues/PRs

Addresses comment number 2 from @ArturoAmorQ in issue #26220

What does this implement/fix? Explain your changes.

This updates the documentation of HGBT in RF section

Any other comments?

@amay1212 amay1212 force-pushed the amay1212-user_guide_hgbt branch from ae191e1 to 231fffe Compare May 4, 2023 03:51
@amay1212 amay1212 changed the title HGBT Doc added to "user_guide" reference in RF HGBT Doc added to "user_guide" reference in RF #26220 May 4, 2023
@amay1212 amay1212 changed the title HGBT Doc added to "user_guide" reference in RF #26220 HGBT Doc added to "user_guide" reference in RF issue:#26220 May 4, 2023
@amay1212 amay1212 changed the title HGBT Doc added to "user_guide" reference in RF issue:#26220 HGBT Doc added to "user_guide" reference in RF May 4, 2023
@amay1212 amay1212 changed the title HGBT Doc added to "user_guide" reference in RF DOC Add HGDBT to "user_guide" reference in RF May 4, 2023
@amay1212 amay1212 changed the title DOC Add HGDBT to "user_guide" reference in RF DOC Add HGBDT to "user_guide" reference in RF May 4, 2023
@ArturoAmorQ
Copy link
Member

Thanks for addressing this last point in the issue :) here are a few comments:

  • I think the paragraph should be added at the end of the Random Forests subsection as it does not concern the other subsections in the Forests of randomized trees
  • The text is almost an exact copy-paste of the note in the Gradient Tree Boosting section, which is already redundant with the info in the Histogram-base Gradient Boosting section. Repeating said text does not bring any new information regarding RFs estimators themselves.
  • The statement on being faster is not always true, as RFs can be parallelized using n_jobs. Overall, the performance of HGBTs versus parallelized RFs depends on the specific characteristics of the dataset and the modeling task. It's always a good idea to try both models and compare their performance on your specific problem to determine which model is the best fit.
  • Not a strong opinion about this but I'd rather not use a note, as they are meant to be brief and here we want to give a deeper insight.

Please let me know if you need help or more details addressing my comments.

@amay1212
Copy link
Contributor Author

amay1212 commented May 4, 2023

Thanks for addressing this last point in the issue :) here are a few comments:

  • I think the paragraph should be added at the end of the Random Forests subsection as it does not concern the other subsections in the Forests of randomized trees
  • The text is almost an exact copy-paste of the note in the Gradient Tree Boosting section, which is already redundant with the info in the Histogram-base Gradient Boosting section. Repeating said text does not bring any new information regarding RFs estimators themselves.
  • The statement on being faster is not always true, as RFs can be parallelized using n_jobs. Overall, the performance of HGBTs versus parallelized RFs depends on the specific characteristics of the dataset and the modeling task. It's always a good idea to try both models and compare their performance on your specific problem to determine which model is the best fit.
  • Not a strong opinion about this but I'd rather not use a note, as they are meant to be brief and here we want to give a deeper insight.

Please let me know if you need help or more details addressing my comments.

@ArturoAmorQ
Thanks for the feedback, I believe that's a great point about how performance can be increased in RF as well. I think the first two points are quite straightforward. However, for the last two points, I have following assumptions

  • I believe you want me to include code examples for both HGBTs and RFs, so that it is clear how the models are implemented and how their performances can be compared.
  • For the last point, do you want me to add the comments or description in topic section or as a separate entity with name "When to Choose Histogram-based Gradient Boosting Trees over Random Forests", if you could confirm.

Please feel free to correct me in case I missed any point.

@ArturoAmorQ
Copy link
Member

It is not necessary to add code. For that purpose I opened #26320. What I had in mind is comparing the computational cost of the algorithms themself, for instance, building shallow (hgbt) vs deep (rf) trees. Mention that in rf trees are independent and can be built separately whereas hgbt is the corrections are built succesively. Mention the efficiency of binning, etc.

And for the last point I meant plain text, just not the note formatting.

Please let me know if it's clearer now.

@amay1212
Copy link
Contributor Author

amay1212 commented May 5, 2023

It is not necessary to add code. For that purpose I opened #26320. What I had in mind is comparing the computational cost of the algorithms themself, for instance, building shallow (hgbt) vs deep (rf) trees. Mention that in rf trees are independent and can be built separately whereas hgbt is the corrections are built succesively. Mention the efficiency of binning, etc.
And for the last point I meant plain text, just not the note formatting.
Please let me know if it's clearer now.

@ArturoAmorQ
Thanks for the quick feedback!
It's clear now. Also, I have made the changes and addressed all the points mentioned in the review comments. Please review and let me know in case of any concerns further.

Copy link
Member

@ArturoAmorQ ArturoAmorQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment: Please try to keep lines below 80 characters to comply with the PEP 8 style convention

implementation combines classifiers by averaging their probabilistic
prediction, instead of letting each classifier vote for a single class.

Comparison of Histogram-based Gradient Boosting Trees over Random Forests in terms of computational cost:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a subtitle, here you can introduce the goal of the following bullet-points like this:

Suggested change
Comparison of Histogram-based Gradient Boosting Trees over Random Forests in terms of computational cost:
A competitive alternative to random forests are
:ref:`histogram_based_gradient_boosting` (HGBT) models:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a subtitle, here you can introduce the goal of the following bullet-points like this:

Made the changes as requested.


- Building shallow trees: HGBT can be computationally more efficient than RF when building shallow trees because it only needs to consider a limited number of bins to construct the splits, whereas RF builds deep trees that require more iterations to reach optimal splits.

- Sequential boosting: In HGBT, the trees are built sequentially, with each tree correcting the mistakes of the previous trees. This can be more computationally efficient than RF, where all the trees are built independently and in parallel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrase "This can be more computationally efficient than RF" is a bit vague, I suggest giving a deeper explanation on this regard:

Suggested change
- Sequential boosting: In HGBT, the trees are built sequentially, with each tree correcting the mistakes of the previous trees. This can be more computationally efficient than RF, where all the trees are built independently and in parallel.
- Sequential boosting: In HGBT, the trees are built sequentially, with each tree
correcting the mistakes of the previous trees, which allows them to
iteratively improve the model's performance using fewer trees. In contrast,
RF use a majority vote to predict the outcome, which can require a larger
number of trees to achieve the same level of accuracy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrase "This can be more computationally efficient than RF" is a bit vague, I suggest giving a deeper explanation on this regard:

Made the changes as requested

@amay1212
Copy link
Contributor Author

amay1212 commented May 6, 2023

Another comment: Please try to keep lines below 80 characters to comply with the PEP 8 style convention

Thanks for noticing it!
Just made the changes, let me know if there are any concerns.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good addition. I wonder if we want to refer to xgboost here or not, since that's what people know.

@amay1212 amay1212 force-pushed the amay1212-user_guide_hgbt branch from 948f744 to 70349e9 Compare May 8, 2023 14:01
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is certainly an improvement.

@amay1212
Copy link
Contributor Author

amay1212 commented May 8, 2023

This is certainly an improvement.

Learning slowly

Copy link
Member

@ArturoAmorQ ArturoAmorQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please notice that you can accept suggestions directly from github. This makes the review process easier.

@amay1212 amay1212 force-pushed the amay1212-user_guide_hgbt branch from 425f2bc to 249b45d Compare May 9, 2023 13:03
@amay1212
Copy link
Contributor Author

amay1212 commented May 9, 2023

Please notice that you can accept suggestions directly from github. This makes the review process easier.

Sure, will keep in mind

Copy link
Member

@ArturoAmorQ ArturoAmorQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your effort @amay1212! It certainly LGTM :)

@adrinjalali adrinjalali enabled auto-merge (squash) May 9, 2023 13:12
@adrinjalali adrinjalali merged commit b013c6b into scikit-learn:main May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants