BUG Median not always being calculated correctly for DecisionTrees in the WeightedMedianCalculator

It appears the existing "_WeightedMedianCalculator.update_median_parameters_post_remove_(...)" is not working correctly...  Here is what I am seeing with an illustrated example...

```
y = [0.14038694, 0.4173048, 0.55868983]
sample_weights = [2, 1, 2]
```

in effect, the above is equivalent to:

> 0.14038694, 0.14038694, 0.4173048, 0.55868983, 0.55868983


The process starts with these items being added to the _WeightedMedianCalculator_...  
I have outputted key values after each movement (after _update_median_parameters_post_push_ and after _update_median_parameters_post_remove_ method calls):


`> PUSH, data: 0.4173048;  orig_median: 0.0; new_median: 0.4173048;....`
This tells us that the sample line with y = 0.4173048 was added to a new group.  Since there is only one item this is the same as the median!


`> PUSH;  data: 0.55868983;  orig_median: 0.4173048; new_median: 0.55868983;....`
Next the 0.55868983 sample line is added to the group **(note that this has a weight of 2).**  We now have in effect:
> 0.4174048, **0.55868983**, 0.55868983
 
so the new median becomes 0.55868983 as expected.


`> PUSH;  data: 0.14038694;  orig_median: 0.55868983; new_median: 0.4173048;....`
Next the 0.14038694 sample line is added to the group.  We now have in effect:
> 0.14038694, 0.14038694, **0.4173048**, 0.55868983, 0.55868983

so the new median becomes 0.4173048 as expected.


`> POP/REMOVE;   data: 0.4173048;  orig_median: 0.4173048; new_median: 0.55868983;....`

The data equated to the sample line 0.4173048 is removed.  We now have:
> 0.14038694, **0.14038694, 0.55868983**, 0.55868983

The new median looks incorrect!  Should it not be:
= (0.14038694 + 0.55868983) / 2
= 0.349538385
?

Steps to reproduce:
Unfortunately it is difficult to provide a python script to demonstrate the above, as it requires debugging private values but can be achieved this way:
/tree/utils.pyx -> WeightedMedianCalculator class:

```
    cdef int remove(self, DOUBLE_t data, DOUBLE_t weight) nogil:
        """Remove a value from the MedianHeap, removing it
        from consideration in the median calculation
        """
        cdef int return_value
        cdef DOUBLE_t original_median

        if self.size() != 0:
            original_median = self.get_median()

        #FOR DEBUG:
        cdef int original_k
        cdef DOUBLE_t original_sum_w_0_k
        cdef DOUBLE_t original_total_weight
        original_k = self.k
        original_sum_w_0_k = self.sum_w_0_k
        original_total_weight = self.total_weight

        return_value = self.samples.remove(data, weight)
        self.update_median_parameters_post_remove(data, weight,
                                                  original_median)

        with gil:
            print (" ")
            print ("POP/REMOVE, data: " + str(data)
                + ";  orig_median: " + str(original_median) 
                + "; new_median: " + str(self.get_median()) 
                + ";  size: " + str(self.size())
                + ";  weight: " + str(weight)
                + ";  orig_total_weight: " + str(original_total_weight)
                + ";  self.total_weight: " + str(self.total_weight)
                + ";  orig_k: " + str(original_k)               
                + ";  self.k: " + str(self.k)
                + ";  orig_sum_w_0_k: " + str(original_sum_w_0_k)
                + ";  self.sum_w_0_k: " + str(self.sum_w_0_k)
                )

        return return_value
```
```
        cdef int push(self, DOUBLE_t data, DOUBLE_t weight) nogil except -1:
        """Push a value and its associated weight to the WeightedMedianCalculator

        Return -1 in case of failure to allocate memory (and raise MemoryError)
        or 0 otherwise.
        """
        cdef int return_value
        cdef DOUBLE_t original_median

        #FOR DEBUG:
        cdef int original_k
        cdef DOUBLE_t original_sum_w_0_k
        cdef DOUBLE_t original_total_weight
        original_k = self.k
        original_sum_w_0_k = self.sum_w_0_k
        original_total_weight = self.total_weight

        if self.size() != 0:
            original_median = self.get_median()
        # samples.push (WeightedPQueue.push) uses safe_realloc, hence except -1
        return_value = self.samples.push(data, weight)
        self.update_median_parameters_post_push(data, weight,
                                                original_median)

        with gil:
            print (" ")
            print ("PUSH, data: " + str(data) 
                + ";  orig_median: " + str(original_median) 
                + "; new_median: " + str(self.get_median()) 
                + ";  size: " + str(self.size())
                + ";  weight: " + str(weight)
                + ";  orig_total_weight: " + str(original_total_weight)
                + ";  self.total_weight: " + str(self.total_weight)
                + ";  orig_k: " + str(original_k)               
                + ";  self.k: " + str(self.k)
                + ";  orig_sum_w_0_k: " + str(original_sum_w_0_k)
                + ";  self.sum_w_0_k: " + str(self.sum_w_0_k)
                )

        return return_value
```

After compilation, calling this script:

```
from sklearn.tree import DecisionTreeRegressor

train_y = [0.4173048,0.55868983,0.14038694]
train_X = [[ 15., 9.],
           [ 19., 35.],
           [ 40., 54.]]
sample_weight = [1,2,2]

depth = 1

wineTree = DecisionTreeRegressor(max_depth=depth, criterion='mae', random_state=1)
wineTree.fit(train_X, train_y
    , sample_weight=sample_weight
    )
```

I think I know where the problem lies and how to fix this issue, but first wanted to ensure this behaviour isn't desired especially considering sample_weights could be represented as floats etc?




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG Median not always being calculated correctly for DecisionTrees in the WeightedMedianCalculator #10725

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

BUG Median not always being calculated correctly for DecisionTrees in the WeightedMedianCalculator #10725

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions