Skip to content

BUG Median not always being calculated correctly for DecisionTrees in the WeightedMedianCalculator #10725

@JohnStott

Description

@JohnStott

It appears the existing "WeightedMedianCalculator.update_median_parameters_post_remove(...)" is not working correctly... Here is what I am seeing with an illustrated example...

y = [0.14038694, 0.4173048, 0.55868983]
sample_weights = [2, 1, 2]

in effect, the above is equivalent to:

0.14038694, 0.14038694, 0.4173048, 0.55868983, 0.55868983

The process starts with these items being added to the WeightedMedianCalculator...
I have outputted key values after each movement (after update_median_parameters_post_push and after update_median_parameters_post_remove method calls):

> PUSH, data: 0.4173048; orig_median: 0.0; new_median: 0.4173048;....
This tells us that the sample line with y = 0.4173048 was added to a new group. Since there is only one item this is the same as the median!

> PUSH; data: 0.55868983; orig_median: 0.4173048; new_median: 0.55868983;....
Next the 0.55868983 sample line is added to the group (note that this has a weight of 2). We now have in effect:

0.4174048, 0.55868983, 0.55868983

so the new median becomes 0.55868983 as expected.

> PUSH; data: 0.14038694; orig_median: 0.55868983; new_median: 0.4173048;....
Next the 0.14038694 sample line is added to the group. We now have in effect:

0.14038694, 0.14038694, 0.4173048, 0.55868983, 0.55868983

so the new median becomes 0.4173048 as expected.

> POP/REMOVE; data: 0.4173048; orig_median: 0.4173048; new_median: 0.55868983;....

The data equated to the sample line 0.4173048 is removed. We now have:

0.14038694, 0.14038694, 0.55868983, 0.55868983

The new median looks incorrect! Should it not be:
= (0.14038694 + 0.55868983) / 2
= 0.349538385
?

Steps to reproduce:
Unfortunately it is difficult to provide a python script to demonstrate the above, as it requires debugging private values but can be achieved this way:
/tree/utils.pyx -> WeightedMedianCalculator class:

    cdef int remove(self, DOUBLE_t data, DOUBLE_t weight) nogil:
        """Remove a value from the MedianHeap, removing it
        from consideration in the median calculation
        """
        cdef int return_value
        cdef DOUBLE_t original_median

        if self.size() != 0:
            original_median = self.get_median()

        #FOR DEBUG:
        cdef int original_k
        cdef DOUBLE_t original_sum_w_0_k
        cdef DOUBLE_t original_total_weight
        original_k = self.k
        original_sum_w_0_k = self.sum_w_0_k
        original_total_weight = self.total_weight

        return_value = self.samples.remove(data, weight)
        self.update_median_parameters_post_remove(data, weight,
                                                  original_median)

        with gil:
            print (" ")
            print ("POP/REMOVE, data: " + str(data)
                + ";  orig_median: " + str(original_median) 
                + "; new_median: " + str(self.get_median()) 
                + ";  size: " + str(self.size())
                + ";  weight: " + str(weight)
                + ";  orig_total_weight: " + str(original_total_weight)
                + ";  self.total_weight: " + str(self.total_weight)
                + ";  orig_k: " + str(original_k)               
                + ";  self.k: " + str(self.k)
                + ";  orig_sum_w_0_k: " + str(original_sum_w_0_k)
                + ";  self.sum_w_0_k: " + str(self.sum_w_0_k)
                )

        return return_value
        cdef int push(self, DOUBLE_t data, DOUBLE_t weight) nogil except -1:
        """Push a value and its associated weight to the WeightedMedianCalculator

        Return -1 in case of failure to allocate memory (and raise MemoryError)
        or 0 otherwise.
        """
        cdef int return_value
        cdef DOUBLE_t original_median

        #FOR DEBUG:
        cdef int original_k
        cdef DOUBLE_t original_sum_w_0_k
        cdef DOUBLE_t original_total_weight
        original_k = self.k
        original_sum_w_0_k = self.sum_w_0_k
        original_total_weight = self.total_weight

        if self.size() != 0:
            original_median = self.get_median()
        # samples.push (WeightedPQueue.push) uses safe_realloc, hence except -1
        return_value = self.samples.push(data, weight)
        self.update_median_parameters_post_push(data, weight,
                                                original_median)

        with gil:
            print (" ")
            print ("PUSH, data: " + str(data) 
                + ";  orig_median: " + str(original_median) 
                + "; new_median: " + str(self.get_median()) 
                + ";  size: " + str(self.size())
                + ";  weight: " + str(weight)
                + ";  orig_total_weight: " + str(original_total_weight)
                + ";  self.total_weight: " + str(self.total_weight)
                + ";  orig_k: " + str(original_k)               
                + ";  self.k: " + str(self.k)
                + ";  orig_sum_w_0_k: " + str(original_sum_w_0_k)
                + ";  self.sum_w_0_k: " + str(self.sum_w_0_k)
                )

        return return_value

After compilation, calling this script:

from sklearn.tree import DecisionTreeRegressor

train_y = [0.4173048,0.55868983,0.14038694]
train_X = [[ 15., 9.],
           [ 19., 35.],
           [ 40., 54.]]
sample_weight = [1,2,2]

depth = 1

wineTree = DecisionTreeRegressor(max_depth=depth, criterion='mae', random_state=1)
wineTree.fit(train_X, train_y
    , sample_weight=sample_weight
    )

I think I know where the problem lies and how to fix this issue, but first wanted to ensure this behaviour isn't desired especially considering sample_weights could be represented as floats etc?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions