Skip to content

Conversation

@JohnStott
Copy link
Contributor

Tree MAE is not considering sample_weights when calculating impurity!

In the proposed fix, you will see I have multiplied by the sample weight after applying the absolute to the difference (not before). This is in line with the consensus / discussion found here, where negative sample weights are considered: #3774 (and also because during initialisation, self.weighted_n_node_samples is a summation of the sample weights with no "absolute" applied (this is used in the impurity division calc)).

Reference Issues/PRs

Fixes: #11460 (Tree MAE is not considering sample_weights when calculating impurity!)

@jnothman jnothman added the Bug label Jul 10, 2018
@jnothman jnothman added this to the 0.20 milestone Jul 10, 2018
@jnothman
Copy link
Member

I think this looks good, but I am no expert on the calculation of impurities. Could you give some calculation to show why the values in the test are those we should expect?

Please add an entry to the change log under Bug Fixes at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

@JohnStott
Copy link
Contributor Author

Thanks @jnothman ,

I will make those changes as soon as possible, for now, here is an example, I hope it is sufficient...

It is perhaps important to remind that because we are dealing with sample_weights we cannot find the median by simply choosing/averaging the centre value(s), instead we consider the median where 50% of the cumulative weight is found in a sorted data set:

Data sorted by y:

X y wt cuml_w  
3 3 0.1 0.1  
5 3 0.3 0.4  
8 4 1 1.4 this is the median (i.e., 1.4 >= 1.15)
3 6 0.6 2  
5 7 0.3 2.3  
  Total wt: 2.3    
  50% of Total wt: 1.15    

From the above we can see the that the sample with a y of 4 is the median because it is the first sample with a cumulative weight bigger then 50% of the total weight i.e., 1.4

Using the y value of 4 as our median we can now calculate the absolute error:

y y - Median Abs(y - Median)
3 -1 1
3 -1 1
4 0 0
6 2 2
7 3 3
  Total Error: 7

This was the original calculation (before this fix) however this does not consider the sample weight, i.e., even though the sample with x = 3, y = 4 has a weight of 0.1 it is currently being given a weight of 1 as it’s importance which is wrong(imo), I propose we should therefore multiply by the sample weight to get the relative error:

y Abs(Median – y) wt Abs(Median – y) * wt
3 1 0.1 0.1
3 1 0.3 0.3
4 0 1 0
6 2 0.6 1.2
7 3 0.3 0.9
    Total Error: 2.5

To calculate the impurity we divide the Total Error by the total wt (i.e., 2.5 / 2.3):

Impurity: 1.08695652173913

The DecisionTreeRegressor then does its next cycle, it goes through every value of X to find the optimal split. I won’t go through each one, just the optimal one… It finds that the optimal split is between the X values of 3 and 5. Therefore we have 2 new splits as follows:

Left node:

X y wt cuml_w  
3 3 0.1 0.1  
3 6 0.6 0.7 this is the median (i.e., 0.7 >= 0.35)
  Total wt: 0.7    
  50% of Total wt: 0.35    

Thus, with a median of 6:

y Abs(Median – y) wt Abs(Median – y) * wt
3 3 0.1 0.3
6 0 0.6 0
    Total Error: 0.3

To calculate the impurity we divide the Total Error by the total wt (i.e., 0.3 / 0.7):

Left Node Impurity: 0.428571428571429

Right node:

X y wt cuml_w  
5 3 0.3 0.3  
8 4 1 1.3 this is the median (i.e., 1.3 >= 0.8)
5 7 0.3 1.6  
  Total wt: 1.6    
  50% of Total wt: 0.8    

Thus, with a median of 4:

y Abs(Median – y) wt Abs(Median – y) * wt
3 1 0.3 0.3
4 0 1 0
7 3 0.3 0.9
    Total Error: 1.2

To calculate the impurity we divide the Total Error by the total wt (i.e., 1.2 / 1.6):

Right Node Impurity: 0.75

@JohnStott
Copy link
Contributor Author

@jnothman
if you can hold off merging this for now. I think it prudent I check out the information gain is still okay and perhaps try this out on some sample datasets (comparing to original method)... will get back to you within next few days, thanks.

@JohnStott
Copy link
Contributor Author

@jnothman I managed to do some extra checks and I'm confident it is working as expected. I just pushed the Boston data set through a few RandomForestRegressors and obtained very similar results (error values) to the original implementation. The only downside to this fix is it takes a little longer to run, this is to be expected as there is now an additional lookup (sample weight) and an extra multiplication for every impurity calculation!

@JohnStott
Copy link
Contributor Author

In case of interest (?):

import time
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

dataset = load_boston()
X_full, y_full = dataset.data, dataset.target

mae = 0
start = time.time()
noOfTrees = 1000
estimator = RandomForestRegressor(random_state=0, n_estimators=noOfTrees, criterion="mae")
estimator.fit(X_full, y_full)
prediction = estimator.predict(X_full)

mae = mean_absolute_error(y_full, prediction)

print ("\nSeconds elapsed:|" + str(time.time() - start))
print ("Total MAE: " + str(mae))

Results:

n_estimators=1000

original:  
Seconds elapsed: 23.1293253898621
MAE: 0.834333695652168
   
fixed impurity:  
Seconds elapsed: 25.9046275615692
MAE: 0.821276482213434

n_estimators=10000

original:  
Seconds elapsed: 231.590049028397
MAE: 0.834080750988118
   
fixed impurity:  
Seconds elapsed: 261.451124191284
MAE: 0.823872351778635

@JohnStott
Copy link
Contributor Author

Interestingly... I wasn't impressed with all the datatype conversions going on, and trialled a version which shared the same type. The results are much better... I'll do a few more tests and upload the changes / results shortly...

@JohnStott
Copy link
Contributor Author

Results of datatype consistency fix:

n_estimators=1000        
original:     fixed impurity/datatypes:  
Seconds elapsed: 23.1293253898621   Seconds elapsed: 22.7475745677948
MAE: 0.834333695652168   MAE: 0.821276482213434
n_estimators=10000        
original:     fixed impurity/datatypes:  
Seconds elapsed: 231.590049028397   Seconds elapsed: 227.267272949219
MAE: 0.834080750988118   MAE: 0.823872351778635

Much better!


impurity += fabs(y_ik - self.node_medians[k]) * w

return <double>(impurity / (self.weighted_n_node_samples *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that there is need to cast here.

There is an implicit conversion of n_outputs do double since all other variables are double.

Copy link
Contributor Author

@JohnStott JohnStott Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that impurity has been changed to a type DOUBLE_t so that the calculation within the loop does not require any implicit conversion since y_ik, self.node_medians and w are all now DOUBLE_t.

I think all the other variables would be converted to DOUBLE_t (not double) as this will likely be the "bigger" datatype...? I'd agree with you on the whole though, after doing this end calculation with all variables implicitly DOUBLE_T, the compiler will do one final implicit conversion to double. So explicit cast not required. I'll remove explicit cast.

if sample_weight != NULL:
w = sample_weight[i]

imp_left += fabs(y_ik - median) * w
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not letting the previous code here. I don't see why we should not make the operation inplace.

I think that we can strip the casting thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impurity_left and impurity_right point to array of type double. It is very expensive to cast within the loop (implicitly) from DOUBLE_t to double on each iteration hence the introduction of holders imp_left and imp_right. Cast is done at the end, outside of the loop where it is much less expensive and also will not suffer from loss in precision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DOUBLE_t is np.npy_float64 which is actually a double. So there is no casting cost because they are the same type. I am not really sure why we have all those casting around.

The only casting which we can make explicit is at the front of n_output which is an integer.

Copy link
Contributor Author

@JohnStott JohnStott Jul 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is confusing why DOUBLE_t is used throughout instead of double in that case? Maybe some legacy reason (Python2? etc..?)

I'm not sure how intelligent the compiler is in determining that DOUBLE_t is np.npy_float64 and thus an implicit conversion is not required... so I ran some benchmarks to see the difference between updating directly to impurity_left/right[0] versus imp_left/right the results are interesting (note I removed all explicit casts). I ran the timings 5 times for each:

imp_left/right:
Seconds elapsed:|22.688602447509766
Total MAE: 0.8212764822134341

Seconds elapsed:|22.552382230758667
Total MAE: 0.8212764822134341

Seconds elapsed:|22.5827374458313
Total MAE: 0.8212764822134341

Seconds elapsed:|22.657567262649536
Total MAE: 0.8212764822134341

Seconds elapsed:|22.656784534454346
Total MAE: 0.8212764822134341

impurity_left/right[0]
Seconds elapsed:|25.91646981239319
Total MAE: 0.8212764822134341

Seconds elapsed:|26.169488668441772
Total MAE: 0.8212764822134341

Seconds elapsed:|25.957823753356934
Total MAE: 0.8212764822134341

Seconds elapsed:|26.088392734527588
Total MAE: 0.8212764822134341

Seconds elapsed:|25.923059225082397
Total MAE: 0.8212764822134341

There is a notable difference. This may not actually be caused by the implicit casts (if any?), it could the lookup to array item 0 for arrays impurity_left[0] / impurity_right[0] each iteration?

I think I can test this by changing my imp_left/right solution to have the imp_left/right variables declared as a double instead of DOUBLE_t? I'll try that and see what timings we get...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the results:
Seconds elapsed:|22.72613835334778
Total MAE: 0.8212764822134341

Seconds elapsed:|22.563627243041992
Total MAE: 0.8212764822134341

Seconds elapsed:|22.58454728126526
Total MAE: 0.8212764822134341

Seconds elapsed:|22.605653047561646
Total MAE: 0.8212764822134341

Seconds elapsed:|22.564542770385742
Total MAE: 0.8212764822134341

From this, I think we can safely conclude that there are no issues with implicit casts, the time wasted is looking up item 0 in the arrays on each iteration. Therefore I propose we keep imp_left / imp_right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(As an aside, I decided to run once last benchmark, explicitly casting self.n_outputs to type double and got the following:)
Seconds elapsed:|22.798115253448486
Seconds elapsed:|22.661853551864624
Seconds elapsed:|22.832958459854126
Seconds elapsed:|22.838365077972412
Seconds elapsed:|22.705886602401733
(so not overly conclusive and thus no benefit in introducing an explicit cast here)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DOUBLE_t and double are the same at the end of the day. However, you point regarding the dereferencing is actually interesting and then it makes sense to avoid doing this in the loop and just make a single affectation at the end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If DOUBLE_t and double are exactly the same, we should replace all DOUBLE_t with double, and the equivalent with SIZE_t if equivalent to another native datatype? Otherwise it will just cause further confusion (as above) to other developers and untrue assumptions. Unless anyone knows of some other good reason for these?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, in C++ double_t is an alias to double as well: http://www.cplusplus.com/reference/cmath/double_t/.

So I don't think that it worth making do much change. I don't exactly the reason why the code was originally written that way but it could have been imposed with Cython in the past.

if sample_weight != NULL:
w = sample_weight[i]

imp_right += fabs(y_ik - median) * w
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

[0.6, 0.3, 0.1, 1.0, 0.3])
assert_array_equal(dt_mae.tree_.impurity, [7.0/2.3, 3.0/0.7, 4.0/1.6])
assert_array_almost_equal(dt_mae.tree_.impurity,
[2.5/2.3, 0.3/0.7, 1.2/1.6])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put space after the sign / in this case. It will be more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean like this:
[2.5 / 2.3, 0.3 / 0.7, 1.2 / 1.6]

cdef DOUBLE_t imp_right = 0.0

cdef void** left_child = <void**> self.left_child.data
cdef void** right_child = <void**> self.right_child.data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove the two lines below then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call 👍

if sample_weight != NULL:
w = sample_weight[i]

imp_left += fabs(y_ik - median) * w
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, in C++ double_t is an alias to double as well: http://www.cplusplus.com/reference/cmath/double_t/.

So I don't think that it worth making do much change. I don't exactly the reason why the code was originally written that way but it could have been imposed with Cython in the past.

dt_mae.fit([[3], [5], [3], [8], [5]], [6, 7, 3, 4, 3],
[0.6, 0.3, 0.1, 1.0, 0.3])
assert_array_equal(dt_mae.tree_.impurity, [7.0/2.3, 3.0/0.7, 4.0/1.6])
assert_array_almost_equal(dt_mae.tree_.impurity,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably add the case where we have the same sample_weight (array of one) which should give the previous results as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added (with additional comments too)

cdef DOUBLE_t y_ik
cdef DOUBLE_t median
cdef DOUBLE_t w = 1.0
cdef DOUBLE_t imp_left = 0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you call imp_left to impurity_left and impurity_left to p_impurity_left to differentiate which one is the pointer and like this both name will be explicit.
The same thing should be done for the right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍

features that appear in later stages. This issue only affected feature
importances. :issue:`11176` by :user:`Gil Forsyth <gforsyth>`.

- Fixed a bug in :class:`tree.MAE` to ensure sample_weights are being used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use backticks around sample_weights

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

English wise, it didn't quite look right sample_weights (with plurality) so changed to plain English (no parameter reference).

max_leaf_nodes=2)

# Test MAE where all sample weights are uniform:
dt_mae.fit([[3], [5], [3], [8], [5]], [6, 7, 3, 4, 3], np.ones(5))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cosmit: could you be explicit on the named kwargs, so that it is easy to see that you are setting sample weights.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'll do that later, I have simply followed the style that pre-existed.

# Test MAE where all sample weights are uniform:
dt_mae.fit([[3], [5], [3], [8], [5]], [6, 7, 3, 4, 3], np.ones(5))
assert_array_equal(dt_mae.tree_.impurity, [1.4, 1.5, 4.0 / 3.0])
assert_array_equal(dt_mae.tree_.value.flat, [4, 4.5, 4.0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know that these are the right values. It would be useful to explain it in a comment.

Copy link
Contributor Author

@JohnStott JohnStott Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained in my 2nd post above... Perhaps I could put a link to this pull in the comments? Otherwise I can try to summarise?

[0.6, 0.3, 0.1, 1.0, 0.3])
assert_array_equal(dt_mae.tree_.impurity, [7.0/2.3, 3.0/0.7, 4.0/1.6])
assert_array_almost_equal(dt_mae.tree_.impurity,
[2.5 / 2.3, 0.3 / 0.7, 1.2 / 1.6])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This illustrates well why I don't love hard coded numerical values: without being an expert in this part of the code, it is hard to see why the new version is correct and not the old one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, unfortunately this file has many other tests with hardcoded values and no explanation. I was simply following suit. On a side-note, I have a fix for another issue in this domain with respect to the medians not being calculated correctly. See #10725 I will need perhaps more rigorous testing there too rather than one test that works. It is tricky for me to know best way to integrate as the values can be verified via code at runtime (not a one off test) if desired (as explained in that linked issue)?

Perhaps you know of test-unit code that I can be re-directed to for how it should really be done properly so that I can follow that instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for cross contaminating threads, I only mention this other issue because in most likelihood I will be extending this exact test function! Ignore this please, I will investigate via other thread.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jul 16, 2018 via email

def test_mae():
# check MAE criterion produces correct results
# on small toy dataset
'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace ''' by """ and put the title on the first line.

The example is verbose but at least it explains the rational behind the number.

# Test MAE where sample weights are non-uniform (as illustrated above):
dt_mae.fit(X=[[3], [5], [3], [8], [5]], y=[6, 7, 3, 4, 3],
sample_weight=[0.6, 0.3, 0.1, 1.0, 0.3])
assert_array_almost_equal(dt_mae.tree_.impurity,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use assert_allclose instead of assert_array_almost_equal

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thank you so much!!

Merging when CI is green.

@GaelVaroquaux GaelVaroquaux changed the title Tree MAE fix to ensure sample_weights are used during impurity calculation [MRG+1] Tree MAE fix to ensure sample_weights are used during impurity calculation Jul 17, 2018
@GaelVaroquaux GaelVaroquaux changed the title [MRG+1] Tree MAE fix to ensure sample_weights are used during impurity calculation [MRG+2] Tree MAE fix to ensure sample_weights are used during impurity calculation Jul 17, 2018
@JohnStott
Copy link
Contributor Author

Trying to figure out what I'm doing wrong with this whitespace issue but I'll keep at it until fixed...

In the meantime, many thanks @glemaitre, @GaelVaroquaux and @jnothman, really appreciate your help, time, patience and direction on this, has been a great learning process for me 👍.

@GaelVaroquaux
Copy link
Member

@JohnStott : you're doing good!

The trailing whitespace error on travis means that you have whitespace at the end of certain lines.

I personally work by setting up my editor to show me the whitespace at the end of lines (you can underline them, or highlight them).

@JohnStott
Copy link
Contributor Author

@GaelVaroquaux, Thanks I have just switched on as you suggest... I can now see several lines with trailing whitespace... hopefully now will go through, cheers.

@GaelVaroquaux
Copy link
Member

You still have one trailing whitespace. Just one !

sklearn/tree/tests/test_tree.py:1780:63: W291 trailing whitespace
    dt_mae.fit(X=[[3], [5], [3], [8], [5]], y=[6, 7, 3, 4, 3], 

1 similar comment
@GaelVaroquaux
Copy link
Member

You still have one trailing whitespace. Just one !

sklearn/tree/tests/test_tree.py:1780:63: W291 trailing whitespace
    dt_mae.fit(X=[[3], [5], [3], [8], [5]], y=[6, 7, 3, 4, 3], 

@GaelVaroquaux
Copy link
Member

@JohnStott : I figured that you were probably sleeping, so I just did this change, in the interest of merging this PR.

@GaelVaroquaux
Copy link
Member

Yey! PEP8 now works.

Merging! Thank you!

@GaelVaroquaux GaelVaroquaux reopened this Jul 17, 2018
@GaelVaroquaux GaelVaroquaux merged commit 6ade12a into scikit-learn:master Jul 17, 2018
@JohnStott
Copy link
Contributor Author

Brilliant, many thanks @GaelVaroquaux you saved me pulling out my hair :) 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tree MAE is not considering sample_weights when calculating impurity!

4 participants