Skip to content
This repository was archived by the owner on Nov 19, 2020. It is now read-only.
This repository was archived by the owner on Nov 19, 2020. It is now read-only.

C4.5Learning: tree depends on the sorting order when the desicion variable is continuous #57

@SivanK

Description

@SivanK

For DecisionVariableKind=Continuous, candidates for cut values are determined by this code:
if (o[j] != o[j + 1])
candidates.Add((v[j] + v[j + 1]) / 2.0);

The order of the output values are determined by the sorting order of the actual values thus can change for several equal values.
For example, after sorting:
Values are: 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 24, 24, 26, 26, 28, 29, 30, 30
Labels are: 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1

Candidates for split in this case are: 2, 2, 2, 3, 3, 3

Another option for sorting would be:
Values are: 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 24, 24, 26, 26, 28, 29, 30, 30
Labels are: 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1
(I've just switched the order of two equal values)

In that case, although the values are exactly the same, the candidates for split would be: 2, 2, 2, 3, 3, 3, 3, 13.5.

This behavior may results in the same decision tree no matter if the values are:
2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 24, 24, 26, 26, 28, 29, 30, 30 OR
2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 240, 240, 260, 260, 280, 290, 300, 300 OR
2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 2400, 2400, 2600, 2600, 2800, 2900, 3000, 3000 OR..
Since both the last 3 and the 24 are labeled as 1 (other 3's are labeled as 0)

Any suggestions on what can be done?

Thanks a lot,
Sivan.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions