-
Notifications
You must be signed in to change notification settings - Fork 6.7k
test_autograd_save_memory fails and breaks builds #8211
Description
test_autograd_save_memory fails sporadically and causes builds to fail.
The code that breaks was added in PR #7478
Links to build failures:
https://builds.apache.org/view/Incubator%20Projects/job/incubator-mxnet/view/change-requests/job/PR-8199/3/
https://builds.apache.org/view/Incubator%20Projects/job/incubator-mxnet/job/master/491/
https://builds.apache.org/view/Incubator%20Projects/job/incubator-mxnet/job/master/493/
Error log:
test_operator_gpu.test_autograd_save_memory ... [20:02:42] /workspace/dmlc-core/include/dmlc/logging.h:308: [20:02:42] src/storage/./pooled_storage_manager.h:102: cudaMalloc failed: out of memory
Stack trace returned 10 entries:
[bt] (0) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7ffaa471293c]
[bt] (1) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7storage23GPUPooledStorageManager5AllocEm+0x1d8) [0x7ffaa5efd6a8]
[bt] (2) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet11StorageImpl5AllocEmNS_7ContextE+0x61) [0x7ffaa5f009f1]
[bt] (3) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNK5mxnet7NDArray13CheckAndAllocEv+0xf6) [0x7ffaa4a4ba06]
[bt] (4) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6common22SetupDefaultBlobsInOutERKSt6vectorINS_7NDArrayESaIS2_EES6_PS1_INS_5TBlobESaIS7_EESA_PS4_SB_SB_SB_PSt13unordered_mapIjjSt4hashIjESt8equal_toIjESaISt4pairIKjjEEERKS1_IjSaIjEE+0x38d) [0x7ffaa5a9143d]
[bt] (5) /workspace/python/mxnet/../../lib/libmxnet.so(_ZZN5mxnet10imperative12PushFComputeERKSt8functionIFvRKN4nnvm9NodeAttrsERKNS_9OpContextERKSt6vectorINS_5TBlobESaISA_EERKS9_INS_9OpReqTypeESaISF_EESE_EEPKNS2_2OpES5_RKNS_7ContextERKS9_IPNS_6engine3VarESaISW_EES10_RKS9_INS_8ResourceESaIS11_EERKS9_IPNS_7NDArrayESaIS17_EES1B_RKS9_IjSaIjEESJ_ENKUlNS_10RunContextENSU_18CallbackOnCompleteEE_clES1G_S1H_+0x1e8) [0x7ffaa5a91858]
[bt] (6) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x112) [0x7ffaa5a0f7a2]
[bt] (7) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x103) [0x7ffaa5a138d3]
[bt] (8) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataS5_+0x56) [0x7ffaa5a13ad6]
[bt] (9) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x3b) [0x7ffaa5a10d3b]
[20:02:42] /workspace/dmlc-core/include/dmlc/logging.h:308: [20:02:42] src/engine/./threaded_engine.h:347: [20:02:42] src/storage/./pooled_storage_manager.h:102: cudaMalloc failed: out of memory
Stack trace returned 10 entries:
[bt] (0) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7ffaa471293c]
[bt] (1) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7storage23GPUPooledStorageManager5AllocEm+0x1d8) [0x7ffaa5efd6a8]
[bt] (2) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet11StorageImpl5AllocEmNS_7ContextE+0x61) [0x7ffaa5f009f1]
[bt] (3) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNK5mxnet7NDArray13CheckAndAllocEv+0xf6) [0x7ffaa4a4ba06]
[bt] (4) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6common22SetupDefaultBlobsInOutERKSt6vectorINS_7NDArrayESaIS2_EES6_PS1_INS_5TBlobESaIS7_EESA_PS4_SB_SB_SB_PSt13unordered_mapIjjSt4hashIjESt8equal_toIjESaISt4pairIKjjEEERKS1_IjSaIjEE+0x38d) [0x7ffaa5a9143d]
[bt] (5) /workspace/python/mxnet/../../lib/libmxnet.so(_ZZN5mxnet10imperative12PushFComputeERKSt8functionIFvRKN4nnvm9NodeAttrsERKNS_9OpContextERKSt6vectorINS_5TBlobESaISA_EERKS9_INS_9OpReqTypeESaISF_EESE_EEPKNS2_2OpES5_RKNS_7ContextERKS9_IPNS_6engine3VarESaISW_EES10_RKS9_INS_8ResourceESaIS11_EERKS9_IPNS_7NDArrayESaIS17_EES1B_RKS9_IjSaIjEESJ_ENKUlNS_10RunContextENSU_18CallbackOnCompleteEE_clES1G_S1H_+0x1e8) [0x7ffaa5a91858]
[bt] (6) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x112) [0x7ffaa5a0f7a2]
[bt] (7) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x103) [0x7ffaa5a138d3]
[bt] (8) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataS5_+0x56) [0x7ffaa5a13ad6]
[bt] (9) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x3b) [0x7ffaa5a10d3b]
A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.
Stack trace returned 8 entries:
[bt] (0) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7ffaa471293c]
[bt] (1) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x40f) [0x7ffaa5a0fa9f]
[bt] (2) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x103) [0x7ffaa5a138d3]
[bt] (3) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataS5_+0x56) [0x7ffaa5a13ad6]
[bt] (4) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x3b) [0x7ffaa5a10d3b]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1a60) [0x7ffab1265a60]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8184) [0x7ffab8b79184]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7ffab88a5ffd]
terminate called after throwing an instance of 'dmlc::Error'
what(): [20:02:42] src/engine/./threaded_engine.h:347: [20:02:42] src/storage/./pooled_storage_manager.h:102: cudaMalloc failed: out of memory
Stack trace returned 10 entries:
[bt] (0) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7ffaa471293c]
[bt] (1) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7storage23GPUPooledStorageManager5AllocEm+0x1d8) [0x7ffaa5efd6a8]
[bt] (2) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet11StorageImpl5AllocEmNS_7ContextE+0x61) [0x7ffaa5f009f1]
[bt] (3) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNK5mxnet7NDArray13CheckAndAllocEv+0xf6) [0x7ffaa4a4ba06]
[bt] (4) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6common22SetupDefaultBlobsInOutERKSt6vectorINS_7NDArrayESaIS2_EES6_PS1_INS_5TBlobESaIS7_EESA_PS4_SB_SB_SB_PSt13unordered_mapIjjSt4hashIjESt8equal_toIjESaISt4pairIKjjEEERKS1_IjSaIjEE+0x38d) [0x7ffaa5a9143d]
[bt] (5) /workspace/python/mxnet/../../lib/libmxnet.so(_ZZN5mxnet10imperative12PushFComputeERKSt8functionIFvRKN4nnvm9NodeAttrsERKNS_9OpContextERKSt6vectorINS_5TBlobESaISA_EERKS9_INS_9OpReqTypeESaISF_EESE_EEPKNS2_2OpES5_RKNS_7ContextERKS9_IPNS_6engine3VarESaISW_EES10_RKS9_INS_8ResourceESaIS11_EERKS9_IPNS_7NDArrayESaIS17_EES1B_RKS9_IjSaIjEESJ_ENKUlNS_10RunContextENSU_18CallbackOnCompleteEE_clES1G_S1H_+0x1e8) [0x7ffaa5a91858]
[bt] (6) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x112) [0x7ffaa5a0f7a2]
[bt] (7) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x103) [0x7ffaa5a138d3]
[bt] (8) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataS5_+0x56) [0x7ffaa5a13ad6]
[bt] (9) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x3b) [0x7ffaa5a10d3b]
A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.
Stack trace returned 8 entries:
[bt] (0) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7ffaa471293c]
[bt] (1) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x40f) [0x7ffaa5a0fa9f]
[bt] (2) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x103) [0x7ffaa5a138d3]
[bt] (3) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataS5_+0x56) [0x7ffaa5a13ad6]
[bt] (4) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x3b) [0x7ffaa5a10d3b]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1a60) [0x7ffab1265a60]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8184) [0x7ffab8b79184]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7ffab88a5ffd]
Environment info
Operating System: Ubuntu
MXNet version:
Commit hashes for:
PR 8199 build that failed:
7771ee8
Build 491:
06806f5
Build 493:
63e5ed2
Minimum reproducible example
Build and run unit tests
Steps to reproduce
Build and run unit tests