-
Notifications
You must be signed in to change notification settings - Fork 1.5k
roottest running out of threads !? #16552
Copy link
Copy link
Closed
Labels
Description
Check duplicate issues.
- Checked for duplicates
Description
When running with ctest -j 32 on a node with 127 cores (see below for more details), one of the run had many failures due to running out of thread resources. The list of affected test includes:
47:PyMVA-Keras-Classification
348:PyMVA-Keras-Regression
349:PyMVA-Keras-Multiclass
985:tutorial-tmva-TMVA_SOFIE_Keras
1238:tutorial-tmva-RBatchGenerator_PyTorch-py
1239:tutorial-tmva-RBatchGenerator_TensorFlow-py
1247:tutorial-tmva-TMVA_SOFIE_RDataFrame-py
1252:tutorial-tmva-keras-GenerateModel-py
1253:tutorial-tmva-keras-MulticlassKeras-py
1584:roottest-root-io-evolution-make
1641:roottest-root-io-newstl-make
those (and possibly tutorial-tmva-keras-MulticlassKeras-py which did not run because it requires the previous test)
Reproducer
347/2278 Testing: PyMVA-Keras-Classification
347/2278 Test: PyMVA-Keras-Classification
Command: "/usr/bin/cmake" "-DCMD=/home/pcanal/root_working/build/quick-devel/tmva/pymva/test/testPyKerasClassification" "-DSYS=/home/pcanal/root_working/build/quick-devel" "-P" "/home/pcanal/root_working/code/quick-devel/cmake/modules/RootTestDriver.cmake"
Directory: /home/pcanal/root_working/build/quick-devel/tmva/pymva/test
"PyMVA-Keras-Classification" start time: Sep 24 20:01 UTC
Output:
----------------------------------------------------------
Get test data...
Generate keras model...
2024-09-24 20:01:12.572604: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-24 20:01:12.572668: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-24 20:01:12.573914: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-24 20:01:12.581129: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-09-24 20:01:15.157134: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/home/pcanal/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
setattr(self, word, getattr(machar, word).flat[0])
/home/pcanal/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
return self._float_to_str(self.smallest_subnormal)
/home/pcanal/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
setattr(self, word, getattr(machar, word).flat[0])
/home/pcanal/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
return self._float_to_str(self.smallest_subnormal)
2024-09-24 20:01:26.401521: F external/local_tsl/tsl/platform/default/env.cc:74] Check failed: ret == 0 (11 vs. 0)Thread tf_numa_-1_Eigen creation via pthread_create() failed.
[ERROR] Failed to generate model using python
CMake Error at /home/pcanal/root_working/code/quick-devel/cmake/modules/RootTestDriver.cmake:232 (message):
error code: 1
<end of output>
Test time = 54.61 sec
----------------------------------------------------------
Test Failed.
"PyMVA-Keras-Classification" end time: Sep 24 20:02 UTC
"PyMVA-Keras-Classification" time elapsed: 00:00:54
Other errors:
14323: system_error: Resource temporarily unavailable
614356:/bin/sh: fork: retry: Resource temporarily unavailable
614357:/bin/sh: fork: retry: Resource temporarily unavailable
614358:/bin/sh: fork: retry: Resource temporarily unavailable
614359:/bin/sh: fork: retry: Resource temporarily unavailable
614360:/bin/sh: fork: Resource temporarily unavailable
614444:/bin/sh: fork: retry: Resource temporarily unavailable
614445:/bin/sh: fork: retry: Resource temporarily unavailable
614446:/bin/sh: fork: retry: Resource temporarily unavailable
614447:/bin/sh: fork: retry: Resource temporarily unavailable
616571:LLVM ERROR: pthread_create failed: Resource temporarily unavailable
616573:sh: fork: retry: Resource temporarily unavailable
616574:sh: fork: retry: Resource temporarily unavailable
616575:sh: fork: retry: Resource temporarily unavailable
616576:sh: fork: retry: Resource temporarily unavailable
616577:sh: fork: Resource temporarily unavailable
ROOT version
master
Installation method
hand build
Operating system
Alma9
Additional context
Node is VM with 128GB of RAM and is access via Jupyter notebook.
jupyter-pcanal-rootdevel:quick-devel pcanal$ uname -a
Linux jupyter-pcanal-rootdevel 6.3.12-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jul 6 04:05:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
CPU(s): 127
On-line CPU(s) list: 0-126
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7543 32-Core Processor
CPU family: 25
Model: 1
Thread(s) per core: 1
Core(s) per socket: 1
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Issues