DOC Add HGBDT to "user_guide" reference in RF #26322

amay1212 · 2023-05-03T16:47:23Z

Reference Issues/PRs

Addresses comment number 2 from @ArturoAmorQ in issue #26220

What does this implement/fix? Explain your changes.

This updates the documentation of HGBT in RF section

Any other comments?

ArturoAmorQ · 2023-05-04T13:01:32Z

Thanks for addressing this last point in the issue :) here are a few comments:

I think the paragraph should be added at the end of the Random Forests subsection as it does not concern the other subsections in the Forests of randomized trees
The text is almost an exact copy-paste of the note in the Gradient Tree Boosting section, which is already redundant with the info in the Histogram-base Gradient Boosting section. Repeating said text does not bring any new information regarding RFs estimators themselves.
The statement on being faster is not always true, as RFs can be parallelized using n_jobs. Overall, the performance of HGBTs versus parallelized RFs depends on the specific characteristics of the dataset and the modeling task. It's always a good idea to try both models and compare their performance on your specific problem to determine which model is the best fit.
Not a strong opinion about this but I'd rather not use a note, as they are meant to be brief and here we want to give a deeper insight.

Please let me know if you need help or more details addressing my comments.

amay1212 · 2023-05-04T17:06:20Z

Thanks for addressing this last point in the issue :) here are a few comments:

I think the paragraph should be added at the end of the Random Forests subsection as it does not concern the other subsections in the Forests of randomized trees

The text is almost an exact copy-paste of the note in the Gradient Tree Boosting section, which is already redundant with the info in the Histogram-base Gradient Boosting section. Repeating said text does not bring any new information regarding RFs estimators themselves.

The statement on being faster is not always true, as RFs can be parallelized using n_jobs. Overall, the performance of HGBTs versus parallelized RFs depends on the specific characteristics of the dataset and the modeling task. It's always a good idea to try both models and compare their performance on your specific problem to determine which model is the best fit.

Not a strong opinion about this but I'd rather not use a note, as they are meant to be brief and here we want to give a deeper insight.

Please let me know if you need help or more details addressing my comments.

@ArturoAmorQ
Thanks for the feedback, I believe that's a great point about how performance can be increased in RF as well. I think the first two points are quite straightforward. However, for the last two points, I have following assumptions

I believe you want me to include code examples for both HGBTs and RFs, so that it is clear how the models are implemented and how their performances can be compared.
For the last point, do you want me to add the comments or description in topic section or as a separate entity with name "When to Choose Histogram-based Gradient Boosting Trees over Random Forests", if you could confirm.

Please feel free to correct me in case I missed any point.

ArturoAmorQ · 2023-05-04T18:31:22Z

It is not necessary to add code. For that purpose I opened #26320. What I had in mind is comparing the computational cost of the algorithms themself, for instance, building shallow (hgbt) vs deep (rf) trees. Mention that in rf trees are independent and can be built separately whereas hgbt is the corrections are built succesively. Mention the efficiency of binning, etc.

And for the last point I meant plain text, just not the note formatting.

Please let me know if it's clearer now.

amay1212 · 2023-05-05T03:17:07Z

It is not necessary to add code. For that purpose I opened #26320. What I had in mind is comparing the computational cost of the algorithms themself, for instance, building shallow (hgbt) vs deep (rf) trees. Mention that in rf trees are independent and can be built separately whereas hgbt is the corrections are built succesively. Mention the efficiency of binning, etc.
And for the last point I meant plain text, just not the note formatting.
Please let me know if it's clearer now.

@ArturoAmorQ
Thanks for the quick feedback!
It's clear now. Also, I have made the changes and addressed all the points mentioned in the review comments. Please review and let me know in case of any concerns further.

ArturoAmorQ

Another comment: Please try to keep lines below 80 characters to comply with the PEP 8 style convention

ArturoAmorQ · 2023-05-05T13:58:11Z

doc/modules/ensemble.rst

 implementation combines classifiers by averaging their probabilistic
 prediction, instead of letting each classifier vote for a single class.

+Comparison of Histogram-based Gradient Boosting Trees over Random Forests in terms of computational cost:


Instead of a subtitle, here you can introduce the goal of the following bullet-points like this:

Suggested change

Comparison of Histogram-based Gradient Boosting Trees over Random Forests in terms of computational cost:

A competitive alternative to random forests are

:ref:`histogram_based_gradient_boosting` (HGBT) models:

Instead of a subtitle, here you can introduce the goal of the following bullet-points like this:

Made the changes as requested.

doc/modules/ensemble.rst

ArturoAmorQ · 2023-05-05T14:34:28Z

doc/modules/ensemble.rst

+
+   - Building shallow trees: HGBT can be computationally more efficient than RF when building shallow trees because it only needs to consider a limited number of bins to construct the splits, whereas RF builds deep trees that require more iterations to reach optimal splits.
+
+   - Sequential boosting: In HGBT, the trees are built sequentially, with each tree correcting the mistakes of the previous trees. This can be more computationally efficient than RF, where all the trees are built independently and in parallel.


The phrase "This can be more computationally efficient than RF" is a bit vague, I suggest giving a deeper explanation on this regard:

Suggested change

- Sequential boosting: In HGBT, the trees are built sequentially, with each tree correcting the mistakes of the previous trees. This can be more computationally efficient than RF, where all the trees are built independently and in parallel.

- Sequential boosting: In HGBT, the trees are built sequentially, with each tree

correcting the mistakes of the previous trees, which allows them to

iteratively improve the model's performance using fewer trees. In contrast,

RF use a majority vote to predict the outcome, which can require a larger

number of trees to achieve the same level of accuracy.

The phrase "This can be more computationally efficient than RF" is a bit vague, I suggest giving a deeper explanation on this regard:

Made the changes as requested

doc/modules/ensemble.rst

amay1212 · 2023-05-06T08:38:03Z

Another comment: Please try to keep lines below 80 characters to comply with the PEP 8 style convention

Thanks for noticing it!
Just made the changes, let me know if there are any concerns.

adrinjalali

This is a good addition. I wonder if we want to refer to xgboost here or not, since that's what people know.

doc/modules/ensemble.rst

…d of normal url's

adrinjalali

This is certainly an improvement.

doc/modules/ensemble.rst

amay1212 · 2023-05-08T17:10:55Z

This is certainly an improvement.

Learning slowly

doc/modules/ensemble.rst

ArturoAmorQ

Please notice that you can accept suggestions directly from github. This makes the review process easier.

doc/modules/ensemble.rst

Co-authored-by: Arturo Amor <[email protected]>

amay1212 · 2023-05-09T13:04:04Z

Please notice that you can accept suggestions directly from github. This makes the review process easier.

Sure, will keep in mind

ArturoAmorQ

Thanks for your effort @amay1212! It certainly LGTM :)

HGBT Doc added to "user_guide" reference in RF

231fffe

amay1212 force-pushed the amay1212-user_guide_hgbt branch from ae191e1 to 231fffe Compare May 4, 2023 03:51

amay1212 changed the title ~~HGBT Doc added to "user_guide" reference in RF~~ HGBT Doc added to "user_guide" reference in RF #26220 May 4, 2023

amay1212 changed the title ~~HGBT Doc added to "user_guide" reference in RF #26220~~ HGBT Doc added to "user_guide" reference in RF issue:#26220 May 4, 2023

amay1212 changed the title ~~HGBT Doc added to "user_guide" reference in RF issue:#26220~~ HGBT Doc added to "user_guide" reference in RF May 4, 2023

amay1212 changed the title ~~HGBT Doc added to "user_guide" reference in RF~~ DOC Add HGDBT to "user_guide" reference in RF May 4, 2023

github-actions bot added the Documentation label May 4, 2023

amay1212 changed the title ~~DOC Add HGDBT to "user_guide" reference in RF~~ DOC Add HGBDT to "user_guide" reference in RF May 4, 2023

#issue:scikit-learn#26220 Made changes as per review comments

a183488

ArturoAmorQ reviewed May 5, 2023

View reviewed changes

Added changes as per the review comments

250ca5b

adrinjalali reviewed May 8, 2023

View reviewed changes