Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

test_autograd_save_memory fails and breaks builds #8211

@indhub

Description

@indhub

test_autograd_save_memory fails sporadically and causes builds to fail.

The code that breaks was added in PR #7478

Links to build failures:
https://builds.apache.org/view/Incubator%20Projects/job/incubator-mxnet/view/change-requests/job/PR-8199/3/
https://builds.apache.org/view/Incubator%20Projects/job/incubator-mxnet/job/master/491/
https://builds.apache.org/view/Incubator%20Projects/job/incubator-mxnet/job/master/493/

Error log:

test_operator_gpu.test_autograd_save_memory ... [20:02:42] /workspace/dmlc-core/include/dmlc/logging.h:308: [20:02:42] src/storage/./pooled_storage_manager.h:102: cudaMalloc failed: out of memory

Stack trace returned 10 entries:
[bt] (0) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7ffaa471293c]
[bt] (1) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7storage23GPUPooledStorageManager5AllocEm+0x1d8) [0x7ffaa5efd6a8]
[bt] (2) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet11StorageImpl5AllocEmNS_7ContextE+0x61) [0x7ffaa5f009f1]
[bt] (3) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNK5mxnet7NDArray13CheckAndAllocEv+0xf6) [0x7ffaa4a4ba06]
[bt] (4) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6common22SetupDefaultBlobsInOutERKSt6vectorINS_7NDArrayESaIS2_EES6_PS1_INS_5TBlobESaIS7_EESA_PS4_SB_SB_SB_PSt13unordered_mapIjjSt4hashIjESt8equal_toIjESaISt4pairIKjjEEERKS1_IjSaIjEE+0x38d) [0x7ffaa5a9143d]
[bt] (5) /workspace/python/mxnet/../../lib/libmxnet.so(_ZZN5mxnet10imperative12PushFComputeERKSt8functionIFvRKN4nnvm9NodeAttrsERKNS_9OpContextERKSt6vectorINS_5TBlobESaISA_EERKS9_INS_9OpReqTypeESaISF_EESE_EEPKNS2_2OpES5_RKNS_7ContextERKS9_IPNS_6engine3VarESaISW_EES10_RKS9_INS_8ResourceESaIS11_EERKS9_IPNS_7NDArrayESaIS17_EES1B_RKS9_IjSaIjEESJ_ENKUlNS_10RunContextENSU_18CallbackOnCompleteEE_clES1G_S1H_+0x1e8) [0x7ffaa5a91858]
[bt] (6) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x112) [0x7ffaa5a0f7a2]
[bt] (7) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x103) [0x7ffaa5a138d3]
[bt] (8) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataS5_+0x56) [0x7ffaa5a13ad6]
[bt] (9) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x3b) [0x7ffaa5a10d3b]

[20:02:42] /workspace/dmlc-core/include/dmlc/logging.h:308: [20:02:42] src/engine/./threaded_engine.h:347: [20:02:42] src/storage/./pooled_storage_manager.h:102: cudaMalloc failed: out of memory

Stack trace returned 10 entries:
[bt] (0) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7ffaa471293c]
[bt] (1) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7storage23GPUPooledStorageManager5AllocEm+0x1d8) [0x7ffaa5efd6a8]
[bt] (2) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet11StorageImpl5AllocEmNS_7ContextE+0x61) [0x7ffaa5f009f1]
[bt] (3) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNK5mxnet7NDArray13CheckAndAllocEv+0xf6) [0x7ffaa4a4ba06]
[bt] (4) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6common22SetupDefaultBlobsInOutERKSt6vectorINS_7NDArrayESaIS2_EES6_PS1_INS_5TBlobESaIS7_EESA_PS4_SB_SB_SB_PSt13unordered_mapIjjSt4hashIjESt8equal_toIjESaISt4pairIKjjEEERKS1_IjSaIjEE+0x38d) [0x7ffaa5a9143d]
[bt] (5) /workspace/python/mxnet/../../lib/libmxnet.so(_ZZN5mxnet10imperative12PushFComputeERKSt8functionIFvRKN4nnvm9NodeAttrsERKNS_9OpContextERKSt6vectorINS_5TBlobESaISA_EERKS9_INS_9OpReqTypeESaISF_EESE_EEPKNS2_2OpES5_RKNS_7ContextERKS9_IPNS_6engine3VarESaISW_EES10_RKS9_INS_8ResourceESaIS11_EERKS9_IPNS_7NDArrayESaIS17_EES1B_RKS9_IjSaIjEESJ_ENKUlNS_10RunContextENSU_18CallbackOnCompleteEE_clES1G_S1H_+0x1e8) [0x7ffaa5a91858]
[bt] (6) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x112) [0x7ffaa5a0f7a2]
[bt] (7) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x103) [0x7ffaa5a138d3]
[bt] (8) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataS5_+0x56) [0x7ffaa5a13ad6]
[bt] (9) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x3b) [0x7ffaa5a10d3b]

A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 8 entries:
[bt] (0) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7ffaa471293c]
[bt] (1) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x40f) [0x7ffaa5a0fa9f]
[bt] (2) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x103) [0x7ffaa5a138d3]
[bt] (3) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataS5_+0x56) [0x7ffaa5a13ad6]
[bt] (4) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x3b) [0x7ffaa5a10d3b]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1a60) [0x7ffab1265a60]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8184) [0x7ffab8b79184]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7ffab88a5ffd]

terminate called after throwing an instance of 'dmlc::Error'
  what():  [20:02:42] src/engine/./threaded_engine.h:347: [20:02:42] src/storage/./pooled_storage_manager.h:102: cudaMalloc failed: out of memory

Stack trace returned 10 entries:
[bt] (0) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7ffaa471293c]
[bt] (1) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7storage23GPUPooledStorageManager5AllocEm+0x1d8) [0x7ffaa5efd6a8]
[bt] (2) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet11StorageImpl5AllocEmNS_7ContextE+0x61) [0x7ffaa5f009f1]
[bt] (3) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNK5mxnet7NDArray13CheckAndAllocEv+0xf6) [0x7ffaa4a4ba06]
[bt] (4) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6common22SetupDefaultBlobsInOutERKSt6vectorINS_7NDArrayESaIS2_EES6_PS1_INS_5TBlobESaIS7_EESA_PS4_SB_SB_SB_PSt13unordered_mapIjjSt4hashIjESt8equal_toIjESaISt4pairIKjjEEERKS1_IjSaIjEE+0x38d) [0x7ffaa5a9143d]
[bt] (5) /workspace/python/mxnet/../../lib/libmxnet.so(_ZZN5mxnet10imperative12PushFComputeERKSt8functionIFvRKN4nnvm9NodeAttrsERKNS_9OpContextERKSt6vectorINS_5TBlobESaISA_EERKS9_INS_9OpReqTypeESaISF_EESE_EEPKNS2_2OpES5_RKNS_7ContextERKS9_IPNS_6engine3VarESaISW_EES10_RKS9_INS_8ResourceESaIS11_EERKS9_IPNS_7NDArrayESaIS17_EES1B_RKS9_IjSaIjEESJ_ENKUlNS_10RunContextENSU_18CallbackOnCompleteEE_clES1G_S1H_+0x1e8) [0x7ffaa5a91858]
[bt] (6) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x112) [0x7ffaa5a0f7a2]
[bt] (7) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x103) [0x7ffaa5a138d3]
[bt] (8) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataS5_+0x56) [0x7ffaa5a13ad6]
[bt] (9) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x3b) [0x7ffaa5a10d3b]

A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 8 entries:
[bt] (0) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7ffaa471293c]
[bt] (1) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x40f) [0x7ffaa5a0fa9f]
[bt] (2) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0x103) [0x7ffaa5a138d3]
[bt] (3) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataS5_+0x56) [0x7ffaa5a13ad6]
[bt] (4) /workspace/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x3b) [0x7ffaa5a10d3b]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1a60) [0x7ffab1265a60]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8184) [0x7ffab8b79184]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7ffab88a5ffd]

Environment info

Operating System: Ubuntu

MXNet version:
Commit hashes for:
PR 8199 build that failed:
7771ee8
Build 491:
06806f5
Build 493:
63e5ed2

Minimum reproducible example

Build and run unit tests

Steps to reproduce

Build and run unit tests

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions