-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Reading/ignoring corrupt images with Gluon data loader (imdecode error cannot be captured) #12280
Description
Description
Short Version
mxnet.image.imdecode crashes and hangs when loading certain corrupt images using Gluon data loader. One possible workaround is to wrap a try/except block around imdecode, but Python try/except cannot capture MXNetError.
Long Version
I am working with a very large dataset that it is impractical to clean all images beforehand. Currently, when using Gluon data loader, loading a corrupt image crashes in imdecode with an MXNetError exception (see Error Message below) and then hangs. Ultimately, I would like the Gluon data loader to ignore corrupt images instead of crashing.
My idea to work around this issue is as follows: wrap the imdecode with a try/catch block and whenever an exception occurs, simply return a dummy image (and label). Given the dummy image/label during training, I can ignore backpropagating that sample. I've tried that (see What have you tried to solve it? below) but it does not work because Python try/catch cannot capture MXNetError.
I think there should be a mechanism to capture an error from imdecode or imread (both from mxnet.image) rather than crashing, unless I am missing something.
Environment info (Required)
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping: 1
CPU MHz: 2699.804
CPU max MHz: 3000.0000
CPU min MHz: 1200.0000
BogoMIPS: 4600.11
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0-31
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
----------Python Info----------
Version : 3.7.0
Compiler : GCC 7.2.0
Build : ('default', 'Jun 28 2018 13:15:42')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 18.0
Directory : /home/ubuntu/anaconda3/envs/mxnet_latest/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version : 1.3.0
Directory : /home/ubuntu/incubator-mxnet/python/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform : Linux-4.4.0-1062-aws-x86_64-with-debian-stretch-sid
system : Linux
node : ip-172-31-35-198
release : 4.4.0-1062-aws
version : #71-Ubuntu SMP Fri Jun 15 10:07:39 UTC 2018
----------Hardware Info----------
machine : x86_64
processor : x86_64
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0022 sec, LOAD: 0.4228 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0008 sec, LOAD: 0.3465 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0006 sec, LOAD: 0.3446 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0005 sec, LOAD: 0.1155 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0005 sec, LOAD: 0.0623 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0004 sec, LOAD: 0.0202 sec.
Package used (Python/R/Scala/Julia): Python
Build info (Required if built from source)
Compiler (gcc/clang/mingw/visual studio): gcc
MXNet commit hash: a6ecb59
Build config:
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#-------------------------------------------------------------------------------
# Template configuration for compiling mxnet
#
# If you want to change the configuration, please use the following
# steps. Assume you are on the root directory of mxnet. First copy the this
# file so that any local changes will be ignored by git
#
# $ cp make/config.mk .
#
# Next modify the according entries, and then compile by
#
# $ make
#
# or build in parallel with 8 threads
#
# $ make -j8
#-------------------------------------------------------------------------------
#---------------------
# choice of compiler
#--------------------
ifndef CC
export CC = gcc
endif
ifndef CXX
export CXX = g++
endif
ifndef NVCC
export NVCC = nvcc
endif
# whether compile with options for MXNet developer
DEV = 0
# whether compile with debug
DEBUG = 0
# whether to turn on segfault signal handler to log the stack trace
USE_SIGNAL_HANDLER =
# the additional link flags you want to add
ADD_LDFLAGS =
# the additional compile flags you want to add
ADD_CFLAGS =
#---------------------------------------------
# matrix computation libraries for CPU/GPU
#---------------------------------------------
# whether use CUDA during compile
USE_CUDA = 0
# add the path to CUDA library to link and compile flag
# if you have already add them to environment variable, leave it as NONE
# USE_CUDA_PATH = /usr/local/cuda
USE_CUDA_PATH = NONE
# whether to enable CUDA runtime compilation
ENABLE_CUDA_RTC = 1
# whether use CuDNN R3 library
USE_CUDNN = 0
#whether to use NCCL library
USE_NCCL = 0
#add the path to NCCL library
USE_NCCL_PATH = NONE
# whether use opencv during compilation
# you can disable it, however, you will not able to use
# imbin iterator
USE_OPENCV = 1
#whether use libjpeg-turbo for image decode without OpenCV wrapper
USE_LIBJPEG_TURBO = 0
#add the path to libjpeg-turbo library
USE_LIBJPEG_TURBO_PATH = NONE
# use openmp for parallelization
USE_OPENMP = 1
# whether use MKL-DNN library
USE_MKLDNN = 0
# whether use NNPACK library
USE_NNPACK = 0
# choose the version of blas you want to use
# can be: mkl, blas, atlas, openblas
# in default use atlas for linux while apple for osx
UNAME_S := $(shell uname -s)
ifeq ($(UNAME_S), Darwin)
USE_BLAS = apple
else
USE_BLAS = atlas
endif
# whether use lapack during compilation
# only effective when compiled with blas versions openblas/apple/atlas/mkl
USE_LAPACK = 1
# path to lapack library in case of a non-standard installation
USE_LAPACK_PATH =
# add path to intel library, you may need it for MKL, if you did not add the path
# to environment variable
USE_INTEL_PATH = NONE
# If use MKL only for BLAS, choose static link automatically to allow python wrapper
ifeq ($(USE_BLAS), mkl)
USE_STATIC_MKL = 1
else
USE_STATIC_MKL = NONE
endif
#----------------------------
# Settings for power and arm arch
#----------------------------
ARCH := $(shell uname -a)
ifneq (,$(filter $(ARCH), armv6l armv7l powerpc64le ppc64le aarch64))
USE_SSE=0
USE_F16C=0
else
USE_SSE=1
endif
#----------------------------
# F16C instruction support for faster arithmetic of fp16 on CPU
#----------------------------
# For distributed training with fp16, this helps even if training on GPUs
# If left empty, checks CPU support and turns it on.
# For cross compilation, please check support for F16C on target device and turn off if necessary.
USE_F16C =
#----------------------------
# distributed computing
#----------------------------
# whether or not to enable multi-machine supporting
USE_DIST_KVSTORE = 0
# whether or not allow to read and write HDFS directly. If yes, then hadoop is
# required
USE_HDFS = 0
# path to libjvm.so. required if USE_HDFS=1
LIBJVM=$(JAVA_HOME)/jre/lib/amd64/server
# whether or not allow to read and write AWS S3 directly. If yes, then
# libcurl4-openssl-dev is required, it can be installed on Ubuntu by
# sudo apt-get install -y libcurl4-openssl-dev
USE_S3 = 0
#----------------------------
# performance settings
#----------------------------
# Use operator tuning
USE_OPERATOR_TUNING = 1
# Use gperftools if found
USE_GPERFTOOLS = 1
# Use JEMalloc if found, and not using gperftools
USE_JEMALLOC = 1
#----------------------------
# additional operators
#----------------------------
# path to folders containing projects specific operators that you don't want to put in src/operators
EXTRA_OPERATORS =
#----------------------------
# other features
#----------------------------
# Create C++ interface package
USE_CPP_PACKAGE = 0
#----------------------------
# plugins
#----------------------------
# whether to use caffe integration. This requires installing caffe.
# You also need to add CAFFE_PATH/build/lib to your LD_LIBRARY_PATH
# CAFFE_PATH = $(HOME)/caffe
# MXNET_PLUGINS += plugin/caffe/caffe.mk
# WARPCTC_PATH = $(HOME)/warp-ctc
# MXNET_PLUGINS += plugin/warpctc/warpctc.mk
# whether to use sframe integration. This requires build sframe
# [email protected]:dato-code/SFrame.git
# SFRAME_PATH = $(HOME)/SFrame
# MXNET_PLUGINS += plugin/sframe/plugin.mk
Error Message:
Process Process-2: [1/1921]
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/mxnet_latest/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/ubuntu/anaconda3/envs/mxnet_latest/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/incubator-mxnet/python/mxnet/gluon/data/dataloader.py", line 170, in worker_loop
data_queue.put((idx, batch))
File "/home/ubuntu/anaconda3/envs/mxnet_latest/lib/python3.7/multiprocessing/queues.py", line 358, in put
obj = _ForkingPickler.dumps(obj)
File "/home/ubuntu/anaconda3/envs/mxnet_latest/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/home/ubuntu/incubator-mxnet/python/mxnet/gluon/data/dataloader.py", line 63, in reduce_ndarray
pid, fd, shape, dtype = data._to_shared_mem()
File "/home/ubuntu/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 200, in _to_shared_mem
self.handle, ctypes.byref(shared_pid), ctypes.byref(shared_id)))
File "/home/ubuntu/incubator-mxnet/python/mxnet/base.py", line 255, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [19:31:59] src/io/image_io.cc:162: Check failed: !dst.empty() Decoding failed. Invalid image file.
Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7ff3b5cb5a0b]
[bt] (1) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7ff3b5cb6578]
[bt] (2) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::io::ImdecodeImpl(int, bool, void*, unsigned long, mxnet::NDArray*)+0x4c6) [0x7
ff3b83e6fd6]
[bt] (3) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x3a279db) [0x7ff3b8a4b9db]
[bt] (4) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprB
lock*)+0x8e5) [0x7ff3b8a45e35]
[bt] (5) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::Thre
adedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invo
ke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0xe2) [0x7ff3b8a5c642]
[bt] (6) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::Manual
Event>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7ff3b8a4543a]
[bt] (7) /home/ubuntu/anaconda3/envs/mxnet_latest/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7ff40d4cfc5c]
[bt] (8) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7ff41b5f66ba]
[bt] (9) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7ff41b32c41d]
Minimum reproducible example
Run train_imagenet.py from GluonCV (https://github.com/dmlc/gluon-cv/blob/master/scripts/classification/imagenet/train_imagenet.py with commit hash 863f19bc86cda0f785b97c39a360fbd8cbd1b0e1) on a training dataset with corrupted images (e.g., an image with 0 bytes).
What have you tried to solve it?
- I modified
ImageFolderDatasetbelow so that it could handle corrupt images in theory. The try/catch does not captureMXNetError.
DEFAULT_IMAGE_SIZE = 224
DEFAULT_MISSING_LABELS_SENTINEL = -1234
class ImageFolderDataset(gluon.data.Dataset):
"""A dataset for loading image files stored in a folder structure like::
root/car/0001.jpg
root/car/xxxa.jpg
root/car/yyyb.jpg
root/bus/123.jpg
root/bus/023.jpg
root/bus/wwww.jpg
Parameters
----------
root : str
Path to root directory.
flag : {0, 1}, default 1
If 0, always convert loaded images to greyscale (1 channel).
If 1, always convert loaded images to colored (3 channels).
transform : callable, default None
A function that takes data and label and transforms them:
::
transform = lambda data, label: (data.astype(np.float32)/255, label)
Attributes
----------
synsets : list
List of class names. `synsets[i]` is the name for the integer label `i`
items : list of tuples
List of all images in (filename, label) pairs.
"""
def __init__(self, root, flag=1, transform=None, missing_sentinel=DEFAULT_MISSING_LABELS_SENTINEL):
self._root = os.path.expanduser(root)
self._flag = flag
self._transform = transform
self._missing_sentinel = missing_sentinel
self._exts = tuple(['.jpg', '.jpeg', '.png'])
self._list_images(self._root)
def _list_images(self, root):
self.synsets = []
self.items = []
for folder in sorted(os.listdir(root)):
path = os.path.join(root, folder)
if not os.path.isdir(path):
warnings.warn('Ignoring %s, which is not a directory.'%path, stacklevel=3)
continue
label = len(self.synsets)
self.synsets.append(folder)
for filename in sorted(os.listdir(path)):
filename = os.path.join(path, filename)
ext = os.path.splitext(filename)[1]
if ext.lower() not in self._exts:
warnings.warn('Ignoring %s of type %s. Only support %s'%(
filename, ext, ', '.join(self._exts)))
continue
self.items.append((filename, label))
def __getitem__(self, idx):
file_name = self.items[idx][0]
if os.path.exists(file_name) and file_name.endswith(self._exts):
try:
img = image.imread(file_name, self._flag)
label = self.items[idx][1]
except:
img = mx.nd.zeros((3, DEFAULT_IMAGE_SIZE, DEFAULT_IMAGE_SIZE))
label = self._missing_sentinel
else:
img = mx.nd.zeros((3, DEFAULT_IMAGE_SIZE, DEFAULT_IMAGE_SIZE))
label = self._missing_sentinel
if self._transform is not None:
return self._transform(img, label)
return img, label
def __len__(self):
return len(self.items)
- I replaced
image.imreadwithcv2.imread(directly using OpenCV) in the above code. It seemed to work on some images but still crashes eventually, which may meanmxnet.image.imdecodeis running somewhere else too? I have not explored this yet.