Skip to content

roottest running out of threads !? #16552

@pcanal

Description

@pcanal

Check duplicate issues.

  • Checked for duplicates

Description

When running with ctest -j 32 on a node with 127 cores (see below for more details), one of the run had many failures due to running out of thread resources. The list of affected test includes:

47:PyMVA-Keras-Classification                                
348:PyMVA-Keras-Regression 
349:PyMVA-Keras-Multiclass  
985:tutorial-tmva-TMVA_SOFIE_Keras
1238:tutorial-tmva-RBatchGenerator_PyTorch-py  
1239:tutorial-tmva-RBatchGenerator_TensorFlow-py   
1247:tutorial-tmva-TMVA_SOFIE_RDataFrame-py        
1252:tutorial-tmva-keras-GenerateModel-py       
1253:tutorial-tmva-keras-MulticlassKeras-py       
1584:roottest-root-io-evolution-make              
1641:roottest-root-io-newstl-make

those (and possibly tutorial-tmva-keras-MulticlassKeras-py which did not run because it requires the previous test)

Reproducer

347/2278 Testing: PyMVA-Keras-Classification
347/2278 Test: PyMVA-Keras-Classification
Command: "/usr/bin/cmake" "-DCMD=/home/pcanal/root_working/build/quick-devel/tmva/pymva/test/testPyKerasClassification" "-DSYS=/home/pcanal/root_working/build/quick-devel" "-P" "/home/pcanal/root_working/code/quick-devel/cmake/modules/RootTestDriver.cmake"
Directory: /home/pcanal/root_working/build/quick-devel/tmva/pymva/test
"PyMVA-Keras-Classification" start time: Sep 24 20:01 UTC
Output:
----------------------------------------------------------
Get test data...
Generate keras model...
2024-09-24 20:01:12.572604: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-24 20:01:12.572668: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-24 20:01:12.573914: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-24 20:01:12.581129: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-09-24 20:01:15.157134: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/home/pcanal/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/home/pcanal/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
/home/pcanal/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/home/pcanal/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
2024-09-24 20:01:26.401521: F external/local_tsl/tsl/platform/default/env.cc:74] Check failed: ret == 0 (11 vs. 0)Thread tf_numa_-1_Eigen creation via pthread_create() failed.
[ERROR] Failed to generate model using python
CMake Error at /home/pcanal/root_working/code/quick-devel/cmake/modules/RootTestDriver.cmake:232 (message):
  error code: 1


<end of output>
Test time =  54.61 sec
----------------------------------------------------------
Test Failed.
"PyMVA-Keras-Classification" end time: Sep 24 20:02 UTC
"PyMVA-Keras-Classification" time elapsed: 00:00:54

Other errors:

14323:    system_error: Resource temporarily unavailable
614356:/bin/sh: fork: retry: Resource temporarily unavailable
614357:/bin/sh: fork: retry: Resource temporarily unavailable
614358:/bin/sh: fork: retry: Resource temporarily unavailable
614359:/bin/sh: fork: retry: Resource temporarily unavailable
614360:/bin/sh: fork: Resource temporarily unavailable
614444:/bin/sh: fork: retry: Resource temporarily unavailable
614445:/bin/sh: fork: retry: Resource temporarily unavailable
614446:/bin/sh: fork: retry: Resource temporarily unavailable
614447:/bin/sh: fork: retry: Resource temporarily unavailable
616571:LLVM ERROR: pthread_create failed: Resource temporarily unavailable
616573:sh: fork: retry: Resource temporarily unavailable
616574:sh: fork: retry: Resource temporarily unavailable
616575:sh: fork: retry: Resource temporarily unavailable
616576:sh: fork: retry: Resource temporarily unavailable
616577:sh: fork: Resource temporarily unavailable

ROOT version

master

Installation method

hand build

Operating system

Alma9

Additional context

Node is VM with 128GB of RAM and is access via Jupyter notebook.

jupyter-pcanal-rootdevel:quick-devel pcanal$ uname -a
Linux jupyter-pcanal-rootdevel 6.3.12-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jul  6 04:05:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
CPU(s):                  127
  On-line CPU(s) list:   0-126
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 7543 32-Core Processor
    CPU family:          25
    Model:               1
    Thread(s) per core:  1
    Core(s) per socket:  1

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

Issues

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions