-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
[MRG+2] Tree MAE fix to ensure sample_weights are used during impurity calculation #11464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tree MAE fix to ensure sample_weights are used during impurity calculation
|
I think this looks good, but I am no expert on the calculation of impurities. Could you give some calculation to show why the values in the test are those we should expect? Please add an entry to the change log under Bug Fixes at |
|
Thanks @jnothman , I will make those changes as soon as possible, for now, here is an example, I hope it is sufficient... It is perhaps important to remind that because we are dealing with sample_weights we cannot find the median by simply choosing/averaging the centre value(s), instead we consider the median where 50% of the cumulative weight is found in a sorted data set: Data sorted by y:
From the above we can see the that the sample with a y of 4 is the median because it is the first sample with a cumulative weight bigger then 50% of the total weight i.e., 1.4 Using the y value of 4 as our median we can now calculate the absolute error:
This was the original calculation (before this fix) however this does not consider the sample weight, i.e., even though the sample with x = 3, y = 4 has a weight of 0.1 it is currently being given a weight of 1 as it’s importance which is wrong(imo), I propose we should therefore multiply by the sample weight to get the relative error:
To calculate the impurity we divide the Total Error by the total wt (i.e., 2.5 / 2.3):
The DecisionTreeRegressor then does its next cycle, it goes through every value of X to find the optimal split. I won’t go through each one, just the optimal one… It finds that the optimal split is between the X values of 3 and 5. Therefore we have 2 new splits as follows: Left node:
Thus, with a median of 6:
To calculate the impurity we divide the Total Error by the total wt (i.e., 0.3 / 0.7):
Right node:
Thus, with a median of 4:
To calculate the impurity we divide the Total Error by the total wt (i.e., 1.2 / 1.6):
|
|
@jnothman |
|
@jnothman I managed to do some extra checks and I'm confident it is working as expected. I just pushed the Boston data set through a few RandomForestRegressors and obtained very similar results (error values) to the original implementation. The only downside to this fix is it takes a little longer to run, this is to be expected as there is now an additional lookup (sample weight) and an extra multiplication for every impurity calculation! |
|
In case of interest (?): Results: n_estimators=1000
n_estimators=10000
|
|
Interestingly... I wasn't impressed with all the datatype conversions going on, and trialled a version which shared the same type. The results are much better... I'll do a few more tests and upload the changes / results shortly... |
|
Results of datatype consistency fix:
Much better! |
sklearn/tree/_criterion.pyx
Outdated
|
|
||
| impurity += fabs(y_ik - self.node_medians[k]) * w | ||
|
|
||
| return <double>(impurity / (self.weighted_n_node_samples * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that there is need to cast here.
There is an implicit conversion of n_outputs do double since all other variables are double.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that impurity has been changed to a type DOUBLE_t so that the calculation within the loop does not require any implicit conversion since y_ik, self.node_medians and w are all now DOUBLE_t.
I think all the other variables would be converted to DOUBLE_t (not double) as this will likely be the "bigger" datatype...? I'd agree with you on the whole though, after doing this end calculation with all variables implicitly DOUBLE_T, the compiler will do one final implicit conversion to double. So explicit cast not required. I'll remove explicit cast.
sklearn/tree/_criterion.pyx
Outdated
| if sample_weight != NULL: | ||
| w = sample_weight[i] | ||
|
|
||
| imp_left += fabs(y_ik - median) * w |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not letting the previous code here. I don't see why we should not make the operation inplace.
I think that we can strip the casting thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
impurity_left and impurity_right point to array of type double. It is very expensive to cast within the loop (implicitly) from DOUBLE_t to double on each iteration hence the introduction of holders imp_left and imp_right. Cast is done at the end, outside of the loop where it is much less expensive and also will not suffer from loss in precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DOUBLE_t is np.npy_float64 which is actually a double. So there is no casting cost because they are the same type. I am not really sure why we have all those casting around.
The only casting which we can make explicit is at the front of n_output which is an integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is confusing why DOUBLE_t is used throughout instead of double in that case? Maybe some legacy reason (Python2? etc..?)
I'm not sure how intelligent the compiler is in determining that DOUBLE_t is np.npy_float64 and thus an implicit conversion is not required... so I ran some benchmarks to see the difference between updating directly to impurity_left/right[0] versus imp_left/right the results are interesting (note I removed all explicit casts). I ran the timings 5 times for each:
imp_left/right:
Seconds elapsed:|22.688602447509766
Total MAE: 0.8212764822134341
Seconds elapsed:|22.552382230758667
Total MAE: 0.8212764822134341
Seconds elapsed:|22.5827374458313
Total MAE: 0.8212764822134341
Seconds elapsed:|22.657567262649536
Total MAE: 0.8212764822134341
Seconds elapsed:|22.656784534454346
Total MAE: 0.8212764822134341
impurity_left/right[0]
Seconds elapsed:|25.91646981239319
Total MAE: 0.8212764822134341
Seconds elapsed:|26.169488668441772
Total MAE: 0.8212764822134341
Seconds elapsed:|25.957823753356934
Total MAE: 0.8212764822134341
Seconds elapsed:|26.088392734527588
Total MAE: 0.8212764822134341
Seconds elapsed:|25.923059225082397
Total MAE: 0.8212764822134341
There is a notable difference. This may not actually be caused by the implicit casts (if any?), it could the lookup to array item 0 for arrays impurity_left[0] / impurity_right[0] each iteration?
I think I can test this by changing my imp_left/right solution to have the imp_left/right variables declared as a double instead of DOUBLE_t? I'll try that and see what timings we get...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the results:
Seconds elapsed:|22.72613835334778
Total MAE: 0.8212764822134341
Seconds elapsed:|22.563627243041992
Total MAE: 0.8212764822134341
Seconds elapsed:|22.58454728126526
Total MAE: 0.8212764822134341
Seconds elapsed:|22.605653047561646
Total MAE: 0.8212764822134341
Seconds elapsed:|22.564542770385742
Total MAE: 0.8212764822134341
From this, I think we can safely conclude that there are no issues with implicit casts, the time wasted is looking up item 0 in the arrays on each iteration. Therefore I propose we keep imp_left / imp_right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(As an aside, I decided to run once last benchmark, explicitly casting self.n_outputs to type double and got the following:)
Seconds elapsed:|22.798115253448486
Seconds elapsed:|22.661853551864624
Seconds elapsed:|22.832958459854126
Seconds elapsed:|22.838365077972412
Seconds elapsed:|22.705886602401733
(so not overly conclusive and thus no benefit in introducing an explicit cast here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DOUBLE_t and double are the same at the end of the day. However, you point regarding the dereferencing is actually interesting and then it makes sense to avoid doing this in the loop and just make a single affectation at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If DOUBLE_t and double are exactly the same, we should replace all DOUBLE_t with double, and the equivalent with SIZE_t if equivalent to another native datatype? Otherwise it will just cause further confusion (as above) to other developers and untrue assumptions. Unless anyone knows of some other good reason for these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, in C++ double_t is an alias to double as well: http://www.cplusplus.com/reference/cmath/double_t/.
So I don't think that it worth making do much change. I don't exactly the reason why the code was originally written that way but it could have been imposed with Cython in the past.
sklearn/tree/_criterion.pyx
Outdated
| if sample_weight != NULL: | ||
| w = sample_weight[i] | ||
|
|
||
| imp_right += fabs(y_ik - median) * w |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
sklearn/tree/tests/test_tree.py
Outdated
| [0.6, 0.3, 0.1, 1.0, 0.3]) | ||
| assert_array_equal(dt_mae.tree_.impurity, [7.0/2.3, 3.0/0.7, 4.0/1.6]) | ||
| assert_array_almost_equal(dt_mae.tree_.impurity, | ||
| [2.5/2.3, 0.3/0.7, 1.2/1.6]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put space after the sign / in this case. It will be more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you mean like this:
[2.5 / 2.3, 0.3 / 0.7, 1.2 / 1.6]
| cdef DOUBLE_t imp_right = 0.0 | ||
|
|
||
| cdef void** left_child = <void**> self.left_child.data | ||
| cdef void** right_child = <void**> self.right_child.data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove the two lines below then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call 👍
sklearn/tree/_criterion.pyx
Outdated
| if sample_weight != NULL: | ||
| w = sample_weight[i] | ||
|
|
||
| imp_left += fabs(y_ik - median) * w |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, in C++ double_t is an alias to double as well: http://www.cplusplus.com/reference/cmath/double_t/.
So I don't think that it worth making do much change. I don't exactly the reason why the code was originally written that way but it could have been imposed with Cython in the past.
sklearn/tree/tests/test_tree.py
Outdated
| dt_mae.fit([[3], [5], [3], [8], [5]], [6, 7, 3, 4, 3], | ||
| [0.6, 0.3, 0.1, 1.0, 0.3]) | ||
| assert_array_equal(dt_mae.tree_.impurity, [7.0/2.3, 3.0/0.7, 4.0/1.6]) | ||
| assert_array_almost_equal(dt_mae.tree_.impurity, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably add the case where we have the same sample_weight (array of one) which should give the previous results as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, added (with additional comments too)
sklearn/tree/_criterion.pyx
Outdated
| cdef DOUBLE_t y_ik | ||
| cdef DOUBLE_t median | ||
| cdef DOUBLE_t w = 1.0 | ||
| cdef DOUBLE_t imp_left = 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you call imp_left to impurity_left and impurity_left to p_impurity_left to differentiate which one is the pointer and like this both name will be explicit.
The same thing should be done for the right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
doc/whats_new/v0.20.rst
Outdated
| features that appear in later stages. This issue only affected feature | ||
| importances. :issue:`11176` by :user:`Gil Forsyth <gforsyth>`. | ||
|
|
||
| - Fixed a bug in :class:`tree.MAE` to ensure sample_weights are being used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use backticks around sample_weights
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
English wise, it didn't quite look right sample_weights (with plurality) so changed to plain English (no parameter reference).
sklearn/tree/tests/test_tree.py
Outdated
| max_leaf_nodes=2) | ||
|
|
||
| # Test MAE where all sample weights are uniform: | ||
| dt_mae.fit([[3], [5], [3], [8], [5]], [6, 7, 3, 4, 3], np.ones(5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cosmit: could you be explicit on the named kwargs, so that it is easy to see that you are setting sample weights.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I'll do that later, I have simply followed the style that pre-existed.
| # Test MAE where all sample weights are uniform: | ||
| dt_mae.fit([[3], [5], [3], [8], [5]], [6, 7, 3, 4, 3], np.ones(5)) | ||
| assert_array_equal(dt_mae.tree_.impurity, [1.4, 1.5, 4.0 / 3.0]) | ||
| assert_array_equal(dt_mae.tree_.value.flat, [4, 4.5, 4.0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know that these are the right values. It would be useful to explain it in a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explained in my 2nd post above... Perhaps I could put a link to this pull in the comments? Otherwise I can try to summarise?
sklearn/tree/tests/test_tree.py
Outdated
| [0.6, 0.3, 0.1, 1.0, 0.3]) | ||
| assert_array_equal(dt_mae.tree_.impurity, [7.0/2.3, 3.0/0.7, 4.0/1.6]) | ||
| assert_array_almost_equal(dt_mae.tree_.impurity, | ||
| [2.5 / 2.3, 0.3 / 0.7, 1.2 / 1.6]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This illustrates well why I don't love hard coded numerical values: without being an expert in this part of the code, it is hard to see why the new version is correct and not the old one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, unfortunately this file has many other tests with hardcoded values and no explanation. I was simply following suit. On a side-note, I have a fix for another issue in this domain with respect to the medians not being calculated correctly. See #10725 I will need perhaps more rigorous testing there too rather than one test that works. It is tricky for me to know best way to integrate as the values can be verified via code at runtime (not a one off test) if desired (as explained in that linked issue)?
Perhaps you know of test-unit code that I can be re-directed to for how it should really be done properly so that I can follow that instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for cross contaminating threads, I only mention this other issue because in most likelihood I will be extending this exact test function! Ignore this please, I will investigate via other thread.
|
I explained in the 2nd post above... Perhaps I could put a link to this pull in
the comments? Otherwise I can try to summarise?
It would be better to summarize here.
|
sklearn/tree/tests/test_tree.py
Outdated
| def test_mae(): | ||
| # check MAE criterion produces correct results | ||
| # on small toy dataset | ||
| ''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace ''' by """ and put the title on the first line.
The example is verbose but at least it explains the rational behind the number.
sklearn/tree/tests/test_tree.py
Outdated
| # Test MAE where sample weights are non-uniform (as illustrated above): | ||
| dt_mae.fit(X=[[3], [5], [3], [8], [5]], y=[6, 7, 3, 4, 3], | ||
| sample_weight=[0.6, 0.3, 0.1, 1.0, 0.3]) | ||
| assert_array_almost_equal(dt_mae.tree_.impurity, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assert_allclose instead of assert_array_almost_equal
GaelVaroquaux
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Thank you so much!!
Merging when CI is green.
|
Trying to figure out what I'm doing wrong with this whitespace issue but I'll keep at it until fixed... In the meantime, many thanks @glemaitre, @GaelVaroquaux and @jnothman, really appreciate your help, time, patience and direction on this, has been a great learning process for me 👍. |
|
@JohnStott : you're doing good! The trailing whitespace error on travis means that you have whitespace at the end of certain lines. I personally work by setting up my editor to show me the whitespace at the end of lines (you can underline them, or highlight them). |
|
@GaelVaroquaux, Thanks I have just switched on as you suggest... I can now see several lines with trailing whitespace... hopefully now will go through, cheers. |
|
You still have one trailing whitespace. Just one ! |
1 similar comment
|
You still have one trailing whitespace. Just one ! |
|
@JohnStott : I figured that you were probably sleeping, so I just did this change, in the interest of merging this PR. |
|
Yey! PEP8 now works. Merging! Thank you! |
|
Brilliant, many thanks @GaelVaroquaux you saved me pulling out my hair :) 👍 |
Tree MAE is not considering sample_weights when calculating impurity!
In the proposed fix, you will see I have multiplied by the sample weight after applying the absolute to the difference (not before). This is in line with the consensus / discussion found here, where negative sample weights are considered: #3774 (and also because during initialisation, self.weighted_n_node_samples is a summation of the sample weights with no "absolute" applied (this is used in the impurity division calc)).
Reference Issues/PRs
Fixes: #11460 (Tree MAE is not considering sample_weights when calculating impurity!)