-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Description
It appears the existing "WeightedMedianCalculator.update_median_parameters_post_remove(...)" is not working correctly... Here is what I am seeing with an illustrated example...
y = [0.14038694, 0.4173048, 0.55868983]
sample_weights = [2, 1, 2]
in effect, the above is equivalent to:
0.14038694, 0.14038694, 0.4173048, 0.55868983, 0.55868983
The process starts with these items being added to the WeightedMedianCalculator...
I have outputted key values after each movement (after update_median_parameters_post_push and after update_median_parameters_post_remove method calls):
> PUSH, data: 0.4173048; orig_median: 0.0; new_median: 0.4173048;....
This tells us that the sample line with y = 0.4173048 was added to a new group. Since there is only one item this is the same as the median!
> PUSH; data: 0.55868983; orig_median: 0.4173048; new_median: 0.55868983;....
Next the 0.55868983 sample line is added to the group (note that this has a weight of 2). We now have in effect:
0.4174048, 0.55868983, 0.55868983
so the new median becomes 0.55868983 as expected.
> PUSH; data: 0.14038694; orig_median: 0.55868983; new_median: 0.4173048;....
Next the 0.14038694 sample line is added to the group. We now have in effect:
0.14038694, 0.14038694, 0.4173048, 0.55868983, 0.55868983
so the new median becomes 0.4173048 as expected.
> POP/REMOVE; data: 0.4173048; orig_median: 0.4173048; new_median: 0.55868983;....
The data equated to the sample line 0.4173048 is removed. We now have:
0.14038694, 0.14038694, 0.55868983, 0.55868983
The new median looks incorrect! Should it not be:
= (0.14038694 + 0.55868983) / 2
= 0.349538385
?
Steps to reproduce:
Unfortunately it is difficult to provide a python script to demonstrate the above, as it requires debugging private values but can be achieved this way:
/tree/utils.pyx -> WeightedMedianCalculator class:
cdef int remove(self, DOUBLE_t data, DOUBLE_t weight) nogil:
"""Remove a value from the MedianHeap, removing it
from consideration in the median calculation
"""
cdef int return_value
cdef DOUBLE_t original_median
if self.size() != 0:
original_median = self.get_median()
#FOR DEBUG:
cdef int original_k
cdef DOUBLE_t original_sum_w_0_k
cdef DOUBLE_t original_total_weight
original_k = self.k
original_sum_w_0_k = self.sum_w_0_k
original_total_weight = self.total_weight
return_value = self.samples.remove(data, weight)
self.update_median_parameters_post_remove(data, weight,
original_median)
with gil:
print (" ")
print ("POP/REMOVE, data: " + str(data)
+ "; orig_median: " + str(original_median)
+ "; new_median: " + str(self.get_median())
+ "; size: " + str(self.size())
+ "; weight: " + str(weight)
+ "; orig_total_weight: " + str(original_total_weight)
+ "; self.total_weight: " + str(self.total_weight)
+ "; orig_k: " + str(original_k)
+ "; self.k: " + str(self.k)
+ "; orig_sum_w_0_k: " + str(original_sum_w_0_k)
+ "; self.sum_w_0_k: " + str(self.sum_w_0_k)
)
return return_value
cdef int push(self, DOUBLE_t data, DOUBLE_t weight) nogil except -1:
"""Push a value and its associated weight to the WeightedMedianCalculator
Return -1 in case of failure to allocate memory (and raise MemoryError)
or 0 otherwise.
"""
cdef int return_value
cdef DOUBLE_t original_median
#FOR DEBUG:
cdef int original_k
cdef DOUBLE_t original_sum_w_0_k
cdef DOUBLE_t original_total_weight
original_k = self.k
original_sum_w_0_k = self.sum_w_0_k
original_total_weight = self.total_weight
if self.size() != 0:
original_median = self.get_median()
# samples.push (WeightedPQueue.push) uses safe_realloc, hence except -1
return_value = self.samples.push(data, weight)
self.update_median_parameters_post_push(data, weight,
original_median)
with gil:
print (" ")
print ("PUSH, data: " + str(data)
+ "; orig_median: " + str(original_median)
+ "; new_median: " + str(self.get_median())
+ "; size: " + str(self.size())
+ "; weight: " + str(weight)
+ "; orig_total_weight: " + str(original_total_weight)
+ "; self.total_weight: " + str(self.total_weight)
+ "; orig_k: " + str(original_k)
+ "; self.k: " + str(self.k)
+ "; orig_sum_w_0_k: " + str(original_sum_w_0_k)
+ "; self.sum_w_0_k: " + str(self.sum_w_0_k)
)
return return_value
After compilation, calling this script:
from sklearn.tree import DecisionTreeRegressor
train_y = [0.4173048,0.55868983,0.14038694]
train_X = [[ 15., 9.],
[ 19., 35.],
[ 40., 54.]]
sample_weight = [1,2,2]
depth = 1
wineTree = DecisionTreeRegressor(max_depth=depth, criterion='mae', random_state=1)
wineTree.fit(train_X, train_y
, sample_weight=sample_weight
)
I think I know where the problem lies and how to fix this issue, but first wanted to ensure this behaviour isn't desired especially considering sample_weights could be represented as floats etc?