Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 2 additions & 4 deletions doc/modules/ensemble.rst
Original file line number Diff line number Diff line change
Expand Up @@ -218,7 +218,7 @@ setting ``oob_score=True``.
The size of the model with the default parameters is :math:`O( M * N * log (N) )`,
where :math:`M` is the number of trees and :math:`N` is the number of samples.
In order to reduce the size of the model, you can change these parameters:
``min_samples_split``, ``min_samples_leaf``, ``max_leaf_nodes`` and ``max_depth``.
``min_samples_split``, ``max_leaf_nodes`` and ``max_depth``.

Parallelization
---------------
Expand Down Expand Up @@ -382,9 +382,7 @@ The number of weak learners is controlled by the parameter ``n_estimators``. The
the final combination. By default, weak learners are decision stumps. Different
weak learners can be specified through the ``base_estimator`` parameter.
The main parameters to tune to obtain good results are ``n_estimators`` and
the complexity of the base estimators (e.g., its depth ``max_depth`` or
minimum required number of samples at a leaf ``min_samples_leaf`` in case of
decision trees).
the complexity of the base estimators (e.g., its depth ``max_depth``).

.. topic:: Examples:

Expand Down
17 changes: 7 additions & 10 deletions doc/modules/tree.rst
Original file line number Diff line number Diff line change
Expand Up @@ -330,15 +330,12 @@ Tips on practical use
for each additional level the tree grows to. Use ``max_depth`` to control
the size of the tree to prevent overfitting.

* Use ``min_samples_split`` or ``min_samples_leaf`` to control the number of
samples at a leaf node. A very small number will usually mean the tree
will overfit, whereas a large number will prevent the tree from learning
the data. Try ``min_samples_leaf=5`` as an initial value. If the sample size
varies greatly, a float number can be used as percentage in these two parameters.
The main difference between the two is that ``min_samples_leaf`` guarantees
a minimum number of samples in a leaf, while ``min_samples_split`` can
create arbitrary small leaves, though ``min_samples_split`` is more common
in the literature.
* Use ``min_samples_split`` to control the number of samples at a leaf node.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check whether my rephrasing of this paragraph is acceptable.

A very small number will usually mean the tree will overfit, whereas a
large number will prevent the tree from learning the data. If the sample
size varies greatly, a float number can be used as percentage in this
parameter. Note that ``min_samples_split`` can create arbitrarily
small leaves.

* Balance your dataset before training to prevent the tree from being biased
toward the classes that are dominant. Class balancing can be done by
Expand All @@ -347,7 +344,7 @@ Tips on practical use
class to the same value. Also note that weight-based pre-pruning criteria,
such as ``min_weight_fraction_leaf``, will then be less biased toward
dominant classes than criteria that are not aware of the sample weights,
like ``min_samples_leaf``.
like ``min_samples_split``.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding leads me to believe this change is grammatical, but please check?


* If the samples are weighted, it will be easier to optimize the tree
structure using weight-based pre-pruning criterion such as
Expand Down
33 changes: 33 additions & 0 deletions doc/whats_new/v0.20.rst
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,22 @@ Classifiers and regressors
efficient when ``algorithm='brute'``. :issue:`11136` by `Joel Nothman`_
and :user:`Aman Dalmia <dalmia>`.

- The parameter `min_samples_leaf` was deprecated in
:class:`ensemble.RandomForestClassifier`,
:class:`ensemble.RandomForestRegressor`,
:class:`ensemble.ExtraTreesClassifier`,
:class:`ensemble.ExtraTreesRegressor`,
:class:`ensemble.GradientBoostingClassifier`,
:class:`ensemble.GradientBoostingRegressor`,
:class:`tree.DecisionTreeClassifier`,
:class:`tree.DecisionTreeRegressor`,
:class:`tree.ExtraTreeClassifier`,
:class:`tree.ExtraTreeRegressor`,
and will be fixed to a value of 1 in version 0.22. It was not effective
for regularization and empirically, 1 is the best value.
:issue:`10773` by :user:`Bob Chen <lasagnaman>`.


Cluster

- :class:`cluster.KMeans`, :class:`cluster.MiniBatchKMeans` and
Expand Down Expand Up @@ -545,6 +561,23 @@ Datasets
API changes summary
-------------------

Classifiers and regressors

- The parameter `min_samples_leaf` was deprecated in
:class:`ensemble.RandomForestClassifier`,
:class:`ensemble.RandomForestRegressor`,
:class:`ensemble.ExtraTreesClassifier`,
:class:`ensemble.ExtraTreesRegressor`,
:class:`ensemble.GradientBoostingClassifier`,
:class:`ensemble.GradientBoostingRegressor`,
:class:`tree.DecisionTreeClassifier`,
:class:`tree.DecisionTreeRegressor`,
:class:`tree.ExtraTreeClassifier`,
:class:`tree.ExtraTreeRegressor`,
and will be fixed to a value of 1 in version 0.22. It was not effective
for regularization and empirically, 1 is the best value.
:issue:`10773` by :user:`Bob Chen <lasagnaman>`.

Linear, kernelized and related models

- Deprecate ``random_state`` parameter in :class:`svm.OneClassSVM` as the
Expand Down
4 changes: 2 additions & 2 deletions examples/ensemble/plot_adaboost_hastie_10_2.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,11 +43,11 @@
X_test, y_test = X[2000:], y[2000:]
X_train, y_train = X[:2000], y[:2000]

dt_stump = DecisionTreeClassifier(max_depth=1, min_samples_leaf=1)
dt_stump = DecisionTreeClassifier(max_depth=1)
dt_stump.fit(X_train, y_train)
dt_stump_err = 1.0 - dt_stump.score(X_test, y_test)

dt = DecisionTreeClassifier(max_depth=9, min_samples_leaf=1)
dt = DecisionTreeClassifier(max_depth=9)
dt.fit(X_train, y_train)
dt_err = 1.0 - dt.score(X_test, y_test)

Expand Down
2 changes: 1 addition & 1 deletion examples/ensemble/plot_gradient_boosting_oob.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@

# Fit classifier with out-of-bag estimates
params = {'n_estimators': 1200, 'max_depth': 3, 'subsample': 0.5,
'learning_rate': 0.01, 'min_samples_leaf': 1, 'random_state': 3}
'learning_rate': 0.01, 'random_state': 3}
clf = ensemble.GradientBoostingClassifier(**params)

clf.fit(X_train, y_train)
Expand Down
3 changes: 1 addition & 2 deletions examples/ensemble/plot_gradient_boosting_quantile.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,7 @@ def f(x):

clf = GradientBoostingRegressor(loss='quantile', alpha=alpha,
n_estimators=250, max_depth=3,
learning_rate=.1, min_samples_leaf=9,
min_samples_split=9)
learning_rate=.1, min_samples_split=9)

clf.fit(X, y)

Expand Down
2 changes: 0 additions & 2 deletions examples/model_selection/plot_randomized_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,6 @@ def report(results, n_top=3):
param_dist = {"max_depth": [3, None],
"max_features": sp_randint(1, 11),
"min_samples_split": sp_randint(2, 11),
"min_samples_leaf": sp_randint(1, 11),
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}

Expand All @@ -74,7 +73,6 @@ def report(results, n_top=3):
param_grid = {"max_depth": [3, None],
"max_features": [1, 3, 10],
"min_samples_split": [2, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}

Expand Down
Loading