Skip to content

[BUG][tabular] Memory Miscalculation in AutoGluon When Using CatBoost with Ray Parallelism #4930

@celestinoxp

Description

@celestinoxp

Describe the bug

Using autogluon tabular, Ray package is used for paralellism, but when have a big dataset with a lot of columns eg:10.000 Ray reproduces an error from catboost. It's only catboost with problems, all other algorithms work well.

Checking the log we can find the error at this line:
Will train 2 folds in parallel instead (Estimated 28.90% memory usage per fold, 57.80%/80.00% total)

In this case, AutoGluon's estimate was wrong, and the two models ended up taking >100% of memory instead of 57.80%, causing the out of memory exception.

To Reproduce

predictor = TabularPredictor(
    label="n2_maior_igual_17",
    eval_metric="log_loss",
    path="modelos/n2_maior_igual_17/"
).fit(
    dados_treino_n2_maior_igual_17,
    presets="best_quality",
    excluded_model_types=["KNN", "XT", "RF"],
    ds_args={"enable_ray_logging": False},
    ag_args_fit={
        "early_stop": None,
        "colsample_bylevel": 1.0,
    },
    time_limit= 8 * 3600,
    refit_full=True,
    calibrate=True
)

Screenshots / Logs

Error:

Fitting model: CatBoost_BAG_L1 ... Training model for up to 9610.46s of the 16674.71s of remaining time.
Memory not enough to fit 8 folds in parallel. Will train 2 folds in parallel instead (Estimated 28.90% memory usage per fold, 57.80%/80.00% total).
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (2 workers, per: cpus=8, gpus=0, memory=28.90%)
Warning: Exception caused CatBoost_BAG_L1 to fail during training... Skipping this model.
ray::_ray_fit() (pid=1932, ip=127.0.0.1)
File "python\ray\_raylet.pyx", line 1883, in ray._raylet.execute_task
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 413, in _ray_fit
fold_model.fit(X=X_fold, y=y_fold, X_val=X_val_fold, y_val=y_val_fold, time_limit=time_limit_fold, **resources, **kwargs_fold)
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\abstract\abstract_model.py", line 925, in fit
out = self._fit(**kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\tabular\models\catboost\catboost_model.py", line 243, in _fit
self.model.fit(X, **fit_final_kwargs)
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 5245, in fit
self._fit(X, y, cat_features, text_features, embedding_features, None, graph, sample_weight, None, None, None, None, baseline, use_best_model,
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 2410, in _fit
self._train(
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 1790, in _train
self._object._train(train_pool, test_pool, params, allow_clear_pool, init_model._object if init_model else None)
File "_catboost.pyx", line 5017, in _catboost._CatBoost._train
File "_catboost.pyx", line 5066, in _catboost._CatBoost._train
_catboost.CatBoostError: bad allocation
Detailed Traceback:
Traceback (most recent call last):
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\tabular\trainer\abstract_trainer.py", line 2160, in _train_and_save
model = self._train_single(**model_fit_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\tabular\trainer\abstract_trainer.py", line 2047, in _train_single
model = model.fit(X=X, y=y, X_val=X_val, y_val=y_val, X_test=X_test, y_test=y_test, total_resources=total_resources, **model_fit_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\abstract\abstract_model.py", line 925, in fit
out = self._fit(**kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\stacker_ensemble_model.py", line 270, in _fit
return super()._fit(X=X, y=y, time_limit=time_limit, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\bagged_ensemble_model.py", line 390, in _fit
self._fit_folds(
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\bagged_ensemble_model.py", line 847, in _fit_folds
fold_fitting_strategy.after_all_folds_scheduled()
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 690, in after_all_folds_scheduled
self._run_parallel(X, y, X_pseudo, y_pseudo, model_base_ref, time_limit_fold, head_node_id)
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 631, in _run_parallel
self._process_fold_results(finished, unfinished, fold_ctx)
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 587, in _process_fold_results
raise processed_exception
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 550, in _process_fold_results
fold_model, pred_proba, time_start_fit, time_end_fit, predict_time, predict_1_time, predict_n_size, fit_num_cpus, fit_num_gpus = self.ray.get(finished)
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\ray_private\auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\ray_private\worker.py", line 2771, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\ray_private\worker.py", line 919, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(CatBoostError): ray::_ray_fit() (pid=1932, ip=127.0.0.1)
File "python\ray\_raylet.pyx", line 1883, in ray._raylet.execute_task
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\ensemble\fold_fitting_strategy.py", line 413, in _ray_fit
fold_model.fit(X=X_fold, y=y_fold, X_val=X_val_fold, y_val=y_val_fold, time_limit=time_limit_fold, **resources, **kwargs_fold)
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\core\models\abstract\abstract_model.py", line 925, in fit
out = self._fit(**kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\celes\anaconda3\Lib\site-packages\autogluon\tabular\models\catboost\catboost_model.py", line 243, in _fit
self.model.fit(X, **fit_final_kwargs)
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 5245, in fit
self._fit(X, y, cat_features, text_features, embedding_features, None, graph, sample_weight, None, None, None, None, baseline, use_best_model,
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 2410, in _fit
self._train(
File "C:\Users\celes\anaconda3\Lib\site-packages\catboost\core.py", line 1790, in _train
self._object._train(train_pool, test_pool, params, allow_clear_pool, init_model._object if init_model else None)
File "_catboost.pyx", line 5017, in _catboost._CatBoost._train
File "_catboost.pyx", line 5066, in _catboost._CatBoost._train
_catboost.CatBoostError: bad allocation

Installed Versions
Latest Autogluon, catboost and ray

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions