[MRG] implement least absolute deviation loss in GBDTs#13896
[MRG] implement least absolute deviation loss in GBDTs#13896glemaitre merged 19 commits intoscikit-learn:masterfrom
Conversation
|
I think it's ready, ping @ogrisel @glemaitre @adrinjalali since you guys are the most familiar with the code ;) |
| # loss | ||
| # If the hessians are constant, we consider they are equal to 1. | ||
| # - This is correct for the half LS loss | ||
| # - For LAD loss, hessians are actually 0, but they are always |
There was a problem hiding this comment.
if that's the case, doesn't it make sense to actually set them to 0? I'm kinda worried about maintainability of this.
There was a problem hiding this comment.
or even have a parameter in the loss class which is the function which gives you the constant hessians, i.e. for LAD it'd be np.zeros.
There was a problem hiding this comment.
IMO it's more maintainable to have the convention that constant hessians are always 1 rather than have custom constant hessians for each loss. Especially when these hessians are never used, like here.
There was a problem hiding this comment.
Also, that's something that's always implicit and not explained in the papers, but hessians need to be treated as 1 (even though they're 0): in order for the leaf value computation to be an average instead of a sum.
glemaitre
left a comment
There was a problem hiding this comment.
It looks good otherwise. You will need an entry in what's new as well.
|
I just went to check our docs, and noticed/remembered we still don't have good examples/user guide for these classes. But other than that, it seems the loss function names are not consistent with the other ensemble methods, for instance the GBR calls this LAD, I think. |
| @pytest.mark.skipif(Y_DTYPE != np.float64, | ||
| reason='Need 64 bits float precision for numerical checks') | ||
| def test_numerical_gradients(loss, n_classes, prediction_dim): | ||
| @pytest.mark.parametrize('seed', range(1)) |
There was a problem hiding this comment.
^^
I can remove it, it's just convenient when testing locally
don't worry that's still on my stack, I'm just waiting for a few more features
right, we broke name consistency with the other implementation already ('ls' vs 'least-squares', 'deviance' vs 'binary-crossentropy', etc.) |
Co-Authored-By: Guillaume Lemaitre <[email protected]>
…solute_loss_hist_gbdt
…t-learn into absolute_loss_hist_gbdt
…solute_loss_hist_gbdt
…solute_loss_hist_gbdt
|
@glemaitre you were OK with this I think, do you wanna have another look so we can merge? Thanks! |
…solute_loss_hist_gbdt
|
Ping @glemaitre ;) Maybe @ogrisel too? |
| ---------- | ||
| loss : {'least_squares'}, optional (default='least_squares') | ||
| loss : {'least_squares', 'least_absolute_deviation'}, \ | ||
| optional (default='least_squares') |
There was a problem hiding this comment.
do we remove right away the optional and the parenthesis?
There was a problem hiding this comment.
I'd rather keep the current docstring consistent with the other entries but I'm +1 in updating all of them in another PR
|
I am still fine with the PR. But before to merge, would it make sense to add a test in |
…solute_loss_hist_gbdt
|
and regarding the test, WDYT? |
|
I agree. It currently fails, I'm investigating why. This has to do with the initial predictions being different from lightgbm and sklearn. |
|
OK I sorted it out. Added a comment: # - We don't check the least_absolute_deviation loss here. This is because
# LightGBM's computation of the median (used for the initial value of
# raw_prediction) is a bit off (they'll e.g. return midpoints when there
# is no need to.). Since these tests only run 1 iteration, the
# discrepancy between the initial values leads to biggish differences in
# the predictions. These differences are much smaller with more
# iterations. |
|
OK make sense. Merging then. Thanks @NicolasHug |
This PR implements the least absolute deviation (or MAE) loss for GBDTs.
It's not as trivial as it seems, since the leaves values have to be updated after the tree is trained. Indeed for MAE, we train the trees to fit the gradients (or more accuratly a Newton–Raphson step), but we need them to predict the median of the residuals. See the original paper for more details.
See also the current implementation of this for the old version of GBDTs, in particular
_update_terminal_region()insklearn/ensemble/gb_losses.pyWill ping when this is ready.
This is a bit slower than LightGBM, since the leaves values update is not done in parallel here.
python benchmarks/bench_hist_gradient_boosting.py --lightgbm --problem regression --n-samples-max 5000000 --n-trees 50 --loss least_absolute_deviation