RBDT does not support xgboost inference

### Check duplicate issues.

- [ ] Checked for duplicates

### Description

RBDT was never testing proper handling of xgboost models, either in writing them to ROOT format or in their usage afterwards.

The functionality was never actually tested since the `xgboost` package was not installed on the CI nodes and the only available test was silently not running if the package was not found https://github.com/root-project/root/blob/05203aeefeaaeaa8efe1ca162fa6477d2b638c8a/tmva/tmva/test/CMakeLists.txt#L34-L41

The test had to be disabled right after the package was installed on some nodes and it immediately triggered failures as documented in this PR https://github.com/root-project/root/pull/15183

Since today, all platforms are now running (for the first time) RBDT tests with xgboost models. The Alma9 platform highlights a severe problem on the following three tests

```
 TEST FAILURES:
296:test-tcollbm
1314:tutorial-tmva-tmva101_Training-py
1315:tutorial-tmva-tmva102_Testing-py
```

By running inside a debugger, we can see that there is an issue at exit time (which **does not** depend on the version of xgboost)
```
(ROOT-CI) [root@663cf5af2db5 tmva]# gdb --args python tmva102_Testing.py 
[...]
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007f1251f78110 in __do_global_dtors_aux () from /lib64/libopenblaso.so.0
(gdb) 
(gdb) bt
#0  0x00007f1251f78110 in __do_global_dtors_aux () from /lib64/libopenblaso.so.0
#1  0x00007f1279ebef9e in _dl_fini () at dl-fini.c:142
#2  0x00007f12796574dd in __run_exit_handlers (status=0, listp=0x7f12797fa838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, 
    run_dtors=run_dtors@entry=true) at exit.c:116
#3  0x00007f1279657620 in __GI_exit (status=<optimized out>) at exit.c:146
#4  0x00007f127963feb7 in __libc_start_call_main (main=main@entry=0x557932280160 <main>, argc=argc@entry=2, argv=argv@entry=0x7ffe94d9b838)
    at ../sysdeps/nptl/libc_start_call_main.h:74
#5  0x00007f127963ff60 in __libc_start_main_impl (main=0x557932280160 <main>, argc=2, argv=0x7ffe94d9b838, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7ffe94d9b828) at ../csu/libc-start.c:389
#6  0x0000557932280095 in _start ()
```

And the same stacktrace is produced for all three tests. This may or may not be linked to https://github.com/wlav/cppyy/issues/227 and the fact that complex Python packages (in this case it's xgboost, but also pytorch is notorious for this behaviour) seem to not handle well C++ symbols.

Our current understanding of the situation with RBDT is the following:
* The implementation that is currently in master never actually worked on multiple levels (largely untested, tutorials did not check for correctness of results)
* Even if it worked, it lacked the ability to handle imbalanced trees which is pretty common when using xgboost models
* The re-implementation of the feature available at https://github.com/root-project/root/pull/15173 vastly improves the functionality
* There seems to be a major issue with complex Python packages and C++ which looks like we can't solve and will make at least one platform always red on our CI if we leave things as they are











### Reproducer

At least on Alma9, the following is the smallest reproducer of the aforementioned segfault
```python
import xgboost
import ROOT
import numpy as np
import json

np.random.seed(1234)


def create_dataset(num_events, num_features, num_outputs, dtype=np.float32):
    x = np.random.normal(0.0, 1.0, (num_events, num_features)).astype(dtype=dtype)
    if num_outputs == 1:
        y = np.random.normal(0.0, 1.0, (num_events)).astype(dtype=dtype)
    else:
        y = np.random.choice(
            a=range(num_outputs), size=(num_events), p=[1.0 / float(num_outputs)] * num_outputs
        ).astype(dtype=dtype)
    return x, y


def _test_XGBBinary(label):
    """
    Compare response of XGB classifier and TMVA tree inference system.
    """
    x, y = create_dataset(1000, 10, 2)
    xgb = xgboost.XGBClassifier(n_estimators=100, max_depth=3)
    xgb.fit(x, y)
    ROOT.TMVA.Experimental.SaveXGBoost(xgb, "myModel", "testXGBBinary{}.root".format(label), num_inputs=10)
    bdt = ROOT.TMVA.Experimental.RBDT("myModel", "testXGBBinary{}.root".format(label))

    y_xgb = xgb.predict_proba(x)[:, 1].squeeze()
    y_bdt = bdt.Compute(x).squeeze()
    np.testing.assert_array_almost_equal(y_xgb, y_bdt)

if __name__ == "__main__":
    _test_XGBBinary("default")
```

### ROOT version

Any

### Installation method

build from source

### Operating system

Alma9 (others could be affected)

### Additional context

_No response_

	# Disabled because RBDT doesn't support the imbalanced tree structure of
	# XGBoost models.
	# if(dataframe)
	# find_python_module(xgboost QUIET)
	# if (PY_XGBOOST_FOUND)
	# ROOT_ADD_PYUNITTEST(rbdt_xgboost rbdt_xgboost.py)
	# endif()
	# endif()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RBDT does not support xgboost inference #15197

Check duplicate issues.

Description

Reproducer

ROOT version

Installation method

Operating system

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RBDT does not support xgboost inference #15197

Description

Check duplicate issues.

Description

Reproducer

ROOT version

Installation method

Operating system

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions