1 s2.0 S0952197623000192 Main
1 s2.0 S0952197623000192 Main
1. Introduction 2021; Chen et al., 2017), abnormal regions may also be reconstructed
correctly in the inference phase, which clearly violates the basic as-
In the background of industrial intelligence, detecting surface de- sumptions of the reconstruction models. Recently, embedding-based
fects on the products in industrial scenarios is essential to reduce models (Zheng et al., 2021; Roth et al., 2021; Cohen and Hoshen,
production costs, improve production efficiency, and guarantee prod- 2020; Defard et al., 2021) have shown better anomaly detection perfor-
uct quality. The detection of surface defects is a problem of locating mance than reconstruction-based models. The fundamental principle of
abnormal regions in images, such as scratches and smudges. However, embedding-based models is the feature matching between the test and
in practical applications, anomaly detection by traditional supervised normal samples. Although such models require little time consumption
learning is more difficult due to the low probability of abnormal in the training phase, they need to perform complex operations of fea-
samples and the diverse forms of anomalies. Therefore, the methods ture matching in the inference phase, which incurs high computational
based on semi-supervised techniques for surface defect detection have costs for model inference. In addition, such models are not trained
more significant advantages in practical applications, which require using anomaly-specific datasets and directly use pre-trained parameters
only normal samples in the training phase. for feature extraction and anomaly detection, which is not sufficiently
Based on semi-supervised techniques, most image surface defect adaptable to the anomaly detection task.
detection models attempt to explore the general patterns of normal Given the shortcomings of existing methods, this paper proposes
samples efficiently. For example, reconstruction models based on au- an end-to-end memory-based segmentation network (MemSeg) to ac-
toencoder (AE) (Bergmann et al., 2018; Tang et al., 2020) or generative complish product surface defect detection. Instead of reconstructing
adversarial network (GAN) (Akcay et al., 2018; Schlegl et al., 2017a; the input images, MemSeg determines the abnormal regions in the
Zenati et al., 2018; Schlegl et al., 2017b) aim to reconstruct normal images end-to-end. Additionally, our model does not entirely rely on
images with minimal error and locate anomalies based on the recon- the pre-trained model for feature extraction and defect detection, which
struction error. However, due to the powerful generalization ability alleviates the problem of inconsistent distribution between the source
of the convolutional neural networks (CNNs) (Wheeler and Karimi, and target domains. The design of MemSeg is based on the observation
∗ Corresponding author.
E-mail address: [email protected] (M. Yang).
https://doi.org/10.1016/j.engappai.2023.105835
Received 10 October 2022; Received in revised form 5 December 2022; Accepted 5 January 2023
Available online 18 January 2023
0952-1976/© 2023 Elsevier Ltd. All rights reserved.
M. Yang, P. Wu and H. Feng Engineering Applications of Artificial Intelligence 119 (2023) 105835
Table 1
√
Main advantages of MemSeg and mainstream methods. denotes that the method has
an advantage in this item.
Method Speed Accuracy Training End-to-end
Reconstruction-based ✓ ✓
Simulation-based ✓ ✓ ✓
Embedding-based ✓
Ours ✓ ✓ ✓ ✓
2
M. Yang, P. Wu and H. Feng Engineering Applications of Artificial Intelligence 119 (2023) 105835
the potential differences between normal and abnormal samples, some 3.1. Anomaly simulation strategy
works (Zavrtanik et al., 2021a; Li et al., 2021; Song et al., 2021)
attempted to generate artificially simulated abnormal samples during In industrial scenarios, anomalies occur in various forms, and it
training. Specifically, DRAEM (Zavrtanik et al., 2021a) superimposes is impossible to cover all of them when performing data collection,
additional texture images as noise onto normal images to generate which limits the modeling with supervised learning methods. However,
abnormal regions, and this type of data augmentation method aims to in the semi-supervised framework, using only normal samples and no
create textural anomalies. CutPaste (Li et al., 2021) and AnoSeg (Song comparisons with abnormal samples is not sufficient for the model
et al., 2021) use an augmentation method similar to copy and paste. to learn what are the normal patterns. In this paper, inspired by
This kind of method randomly copies a small rectangular area from DRAEM (Zavrtanik et al., 2021a), we design a more effective strategy to
the input image and randomly pastes it to the image to simulate simulate abnormal samples and introduce them during the training of
abnormal samples. By pasting rectangular patches of different sizes, MemSeg to accomplish self-supervised learning. MemSeg summarizes
aspect ratios, and rotation angles to create structural anomalies. As a the patterns of normal samples by comparing non-normal patterns
means of data augmentation, the existing anomaly simulation methods to mitigate the drawbacks of semi-supervised learning. As shown in
only consider structural anomalies or textural anomalies one-sidedly. At Fig. 3, the anomaly simulation strategy proposed in this paper is mainly
the same time, for some datasets, there is a problem of low simulation divided into three steps.
efficiency because the target foreground and background in images In the first step, a two-dimensional Perlin noise (Perlin, 1985) 𝑃 is
cannot distinguish well. However, the anomaly simulation strategy generated, then 𝑃 is binarized by threshold T to obtain the mask 𝑀𝑃 .
used by MemSeg solves these shortcomings. The Perlin noise has several random peaks, and 𝑀𝑃 generated by it can
In addition, despite the introduction of simulated abnormal samples, extract contiguous blocks of regions in the image. At the same time,
AnoSeg and DRAEM still need to complete the reconstruction process considering that the proportion of the main body of some industrial
of the input image; CutPaste only completes defect detection at the components in the acquired image is small, if anomaly simulation is
image level, and defect localization at the pixel level is implemented by performed directly without processing, it is easy to generate noise in the
GradCAM or Gaussian density estimation. More directly, with the help background part of the image, which increases the differences between
of a well-designed anomaly simulation strategy, MemSeg does not need simulated abnormal samples and real abnormal samples on the data
the reconstruction of the input image as an auxiliary task for model distribution, which is not conducive for the model to learn effective
learning and completes the pixel-level defect localization end-to-end. discriminative information, so we adopt a foreground enhancement
strategy for this type of image. That is, the input image 𝐼 is binarized to
2.3. Embedding-based methods obtain the mask 𝑀𝐼 , and the noise generated in the binarization process
is removed using the open or close operation. After that, the final mask
Embedding-based methods (Zheng et al., 2021; Roth et al., 2021; image 𝑀 is obtained by performing element-wise product on these two
Cohen and Hoshen, 2020; Defard et al., 2021) usually use a pre-trained masks.
network on ImageNet (Deng et al., 2009) to extract the high-level In the second step, the mask image 𝑀 and the noise image 𝐼𝑛
features of the original images (not trained using anomaly-specific perform element-wise product to obtain the region of interest (ROI) in
datasets), and the anomaly score is obtained by calculating the distance 𝐼𝑛 defined by 𝑀. Consistent with DRAEM (Zavrtanik et al., 2021a),
of the features between the test sample and the normal sample to locate MemSeg introduces a transparency factor 𝛿 in this process to balance
the abnormal region. FYD (Zheng et al., 2021) designs a two-stage the fusion of the original image and the noisy image, so the patterns
coarse-to-fine feature alignment network that learns robust feature dis- of simulated anomalies are closer to the real anomalies. Therefore, the
tributions of normal images; SPADE (Cohen and Hoshen, 2020) extends noisy foreground image 𝐼𝑛′ is generated using the following equation:
the KNN anomaly detection method to pixel level and detects anomalies ( )
𝐼𝑛′ = 𝛿 𝑀 ⊙ 𝐼𝑛 + (1 − 𝛿) (𝑀 ⊙ 𝐼) (1)
in the images through the pixel-level correspondence between the test
image and normal images; PaDiM (Defard et al., 2021) uses a pre- For the noisy image 𝐼𝑛 , we want its maximum transparency to be
trained CNN to extract the patch embeddings of the input image and higher to increase the difficulty of model learning and thus improve
uses the multivariate Gaussian distribution to obtain the probability the robustness of the model. Therefore, 𝛿 in Eq. (1) will randomly and
representation of the normal samples. Due to their simplicity and uniformly sample from [0.15, 1].
effectiveness, embedding-based methods are widely used, but they In the third step, the mask image 𝑀 is inverted to obtain 𝑀, and
usually require a complex process of feature matching in the inference then the element-wise product is performed on 𝑀 and the original
phase, which greatly limits the inference speed of models. The memory image 𝐼 to obtain image 𝐼 ′ , and according to
module in MemSeg is still based on the principle of embedding-based
𝐼𝐴 = 𝑀 ⊙ 𝐼 + 𝐼𝑛′ (2)
methods, but through the design of a more efficient algorithm of feature
matching, the memory module improves the precision of the model the data-augmented image 𝐼𝐴 , namely, the simulated abnormal image,
without adding too much computational cost. is obtained. 𝐼𝐴 takes the original input image 𝐼 as the background
and the ROI in the noise image 𝐼𝑛 extracted by mask image 𝑀 as the
3. Method foreground.
Among them, the noisy image 𝐼𝑛 comes from two parts, one part
This section demonstrates our novel framework to detect and lo- from the DTD texture dataset (Cimpoi et al., 2014), which aims to
calize fine-grained anomalies. An overview of MemSeg is shown in simulate textural anomalies; the other part from the input image, which
Fig. 2. MemSeg uses U-Net (Ronneberger et al., 2015) as a framework aims to simulate structural anomalies. For the simulation of structural
to complete a semantic segmentation task with the help of simulated anomalies, we first perform random adjustments of mirror symmetry,
abnormal samples and memory information in the training phase and rotation, brightness, saturation, and hue on the input image 𝐼. Then
localizes abnormal regions in images end-to-end in the inference phase. the preliminarily processed image is uniformly divided into a 4 × 8
MemSeg consists of several essential parts, and we will describe these grid and randomly arranged to obtain the disordered image 𝐼𝑛 .
parts in the following order: generation of abnormal samples by arti- With the above anomaly simulation strategy, MemSeg obtains simu-
ficial simulation (Section 3.1), generation of memory information and lated abnormal samples from both textural and structural perspectives,
spatial attention maps (Section 3.2), multi-scale feature fusion module and most abnormal regions are generated on the target foreground,
(MSFF Module) for the fusion of memory information and high-level which maximizes the similarity between the simulated abnormal sam-
features of images (Section 3.3), and loss functions (Section 3.4). ples and the real abnormal samples.
3
M. Yang, P. Wu and H. Feng Engineering Applications of Artificial Intelligence 119 (2023) 105835
Fig. 2. An overview of MemSeg. MemSeg is based on the U-Net architecture and uses a pre-trained ResNet18 (He et al., 2016) as an encoder. From the perspective of differences
and commonalities, MemSeg introduces simulated abnormal samples and a memory module to assist the model learning in a more oriented way and thus accomplishes the semi-
supervised surface defect task in an end-to-end approach. Meanwhile, to fully fuse the memory information with the high-level features of the input image, MemSeg introduces a
multi-scale feature fusion module (MSFF Module) and a novel spatial attention module, which significantly improves the precision of anomaly localization.
Fig. 3. The three steps of our anomaly simulation strategy. In the first step, the mask image 𝑀 is generated using Perlin noise and the target foreground; in the second step, the
ROI defined by 𝑀 in the noise image 𝐼𝑛 is extracted to generate the noise foreground image 𝐼𝑛′ ; in the third step, the noise foreground image is superimposed on the original
image to obtain the simulated abnormal image 𝐼𝐴 .
3.2. Memory module and spatial attention maps together constitute the memory information 𝑀𝐼. It needs to be em-
phasized that to ensure the unification of the memory information and
the high-level features of the input images, we always freeze the model
Memory Module. For humans, our recognition of anomalies is pred- parameters of block 1, block 2, and block 3 in ResNet18, but the rest
icated on knowing what is normal, and the abnormal regions are of the model is still trainable.
identified by comparing the test image with the normal images in our
Given an input image in the training or inference phase, as shown in
memory. Inspired by the learning process of human and embedding-
Fig. 2, the encoder also extracts high-level features of the input image
based methods, MemSeg uses a small number of normal samples as
to obtain features with dimensions of 64 × 64 × 64,128 × 32 × 32,
memory samples and extracts high-level features of the memory sam-
and 256 × 16 × 16. These features with different resolutions together
ples as memory information using a pre-trained encoder (ResNet18 He
constitute the information of the input image 𝐼𝐼. After that, the L2
et al., 2016) to assist the model training.
distance between 𝐼𝐼 and all the memory information 𝑀𝐼 is calculated,
To obtain the memory information, we first randomly select 𝑁 so 𝑁 difference information 𝐷𝐼 between the input image and the
normal images from the training data as memory samples and input memory samples is obtained:
them to the encoder to obtain features of dimensions N ×64 × 64×64,
N ×128 × 32×32, and N ×256 × 16×16 from block 1, block 2, and block ⋃
𝑁
𝐷𝐼 = ‖𝑀𝐼𝑖 − 𝐼𝐼 ‖ (3)
3 of ResNet18, respectively. These features with different resolutions ‖ ‖2
𝑖=1
4
M. Yang, P. Wu and H. Feng Engineering Applications of Artificial Intelligence 119 (2023) 105835
where 𝑁 is the number of memory samples. For 𝑁 difference informa- of information in the channel dimension, we use coordinate attention
tion, take the minimum sum of all elements in each 𝐷𝐼 as the standard (CA) (Hou et al., 2021) to capture the information relationship between
to obtain the best difference information 𝐷𝐼 ∗ between 𝐼𝐼 and 𝑀𝐼; that channels of 𝐶𝐼𝑛 . Then, for the features with different dimensions
is, weighted by coordinate attention, we continue to perform multi-scale
∑ information fusion: the feature maps of different dimensions are firstly
𝐷𝐼 ∗ = argmin 𝑥 (4)
𝐷𝐼 𝑖 ∈𝐷𝐼 𝑥∈𝐷𝐼 aligned in resolution using up-sampling, then aligned in the number of
𝑖
channels using convolution, and finally, the operation of element-wise
where 𝑖 ∈ [1, 𝑁]. The best difference information 𝐷𝐼 ∗ contains the
add is executed to achieve multi-scale feature fusion. The fused features
differences between the input sample and its most similar memory
are weighted by the spatial attention maps 𝑀𝑛 (𝑛 = 1, 2, 3) obtained in
sample, the larger the difference value at a position, the higher the
Section 3.2 and then fed to the final decoder.
probability that the region of the input image corresponding to that
position is abnormal.
Subsequently, the best difference information 𝐷𝐼 ∗ completes the 3.4. Training constraints
concatenation operation with the high-level features of the input image
𝐼𝐼 in the channel dimension to obtain the concatenated informa- To ensure that the prediction of MemSeg is close to its ground
tion 𝐶𝐼1 , 𝐶𝐼2 and 𝐶𝐼3 . Finally, the concatenated information will go truth, we use L1 loss and focal loss (Lin et al., 2017b) to guarantee
through the multi-scale feature fusion module for feature fusion, and the similarity of all pixels in the image space. The segmentation images
the fused features flow to the decoder through the skip connection of predicted under the constraint of L1 loss retain more edge information
U-Net. compared to L2 loss. Meanwhile, focal loss alleviates the problem of
area imbalance between normal and abnormal regions in images and
Spatial Attention Maps. It is evident from specific observations and
experiments (Section 4.6) that the best difference information 𝐷𝐼 ∗ has makes the model focus more on the segmentation of difficult samples
an important influence on the localization of abnormal regions. To fully to improve the accuracy of abnormal segmentation.
use the difference information, MemSeg extracts three spatial attention Specifically, MemSeg minimizes the L1 loss 𝐿𝑙1 and the focal loss
maps using 𝐷𝐼 ∗ , which are used to reinforce the guesses of the best 𝐿𝑓 between the ground truth 𝑆 of the simulated abnormal image and
difference information on the abnormal regions. the prediction 𝑆̂ of the model using (8) and (9), respectively.
For the features with three different dimensions in 𝐷𝐼 ∗ , the mean ‖ ‖
𝐿𝑙1 = ‖𝑆 − 𝑆̂ ‖ (8)
values are calculated in the channel dimension, and three feature maps ‖ ‖1
( )𝛾 ( )
of size 16×16, 32×32, and 64 × 64 are obtained. The 16 × 16 feature 𝐿𝑓 = −𝛼𝑡 1 − 𝑝𝑡 log 𝑝𝑡 (9)
map is directly used as the spatial attention map 𝑀3 . After 𝑀3 is up-
sampled, perform the element-wise product operation with the 32 × 32 where 𝑝𝑡 is equal to the predicted probability 𝑝 of the pixel category
feature map to obtain 𝑀2 ; and after 𝑀2 is up-sampled, perform the when the ground truth of the corresponding pixel in S is 1, and 𝑝𝑡 =
element-wise product operation with the 64 × 64 feature map to obtain 1 − 𝑝 when the ground truth of the pixel in 𝑆 is 0, 𝛼𝑡 and 𝛾 are
𝑀1 . As shown in Fig. 2, spatial attention maps 𝑀1 , 𝑀2 and 𝑀3 weight hyperparameters to control the degree of weighting.
the information which obtained after 𝐶𝐼1 , 𝐶𝐼2 and 𝐶𝐼3 are processed Finally, combining all constraints into an objective function, and the
by the MSFF, respectively. Mathematically, the formulas for solving following objective function is obtained:
𝑀1 , 𝑀2 and 𝑀3 are given as follows:
𝐿𝑎𝑙𝑙 = 𝜆𝑙1 𝐿𝑙1 + 𝜆𝑓 𝐿𝑓 (10)
𝐶3
1 ∑ ∗ where 𝜆𝑙1 and 𝜆𝑓 are the balancing hyper-parameters. In the training
𝑀3 = 𝐷𝐼3𝑖 (5)
𝐶3 𝑖=1 phase, the optimization goal of MemSeg is to minimize the objective
(𝐶 ) function defined by Eq. (10).
1 ∑2
∗
𝑀2 = 𝐷𝐼2𝑖 ⊙ 𝑀3𝑈 (6)
𝐶2 𝑖=1 4. Experiments
(𝐶 )
1 ∑ 1
∗
𝑀1 = 𝐷𝐼1𝑖 ⊙ 𝑀2𝑈 (7) This section evaluates the performance of MemSeg as well as the
𝐶1 𝑖=1
functionalities of different components on the semi-supervised anomaly
where 𝐶3 denotes the number of channels of 𝐷𝐼3∗ ; 𝐷𝐼3𝑖
∗ denotes the
detection datasets: MVTec AD dataset (Bergmann et al., 2019), Bean-
∗ 𝑈 𝑈
feature map of channel 𝑖 in 𝐷𝐼3 ; 𝑀3 and 𝑀2 denote the feature maps Tech AD dataset (Mishra et al., 2021), and a toy dataset.
obtained after up-sampling 𝑀3 and 𝑀2 , respectively.
4.1. Datasets and evaluation metric
3.3. Multi-scale feature fusion module
With the help of the memory module, MemSeg obtains the con- The MVTec AD dataset (Bergmann et al., 2019) is mainly aimed at
catenated information 𝐶𝐼 composed of the input image information 𝐼𝐼 the task of semi-supervised surface anomaly detection. The MVTec AD
and the best difference information 𝐷𝐼 ∗ . The direct use of 𝐶𝐼 has the dataset comprises 15 categories, including 5 different texture categories
problem of feature redundancy on the one hand; on the other hand, it and 10 different object categories, each category includes approxi-
increases the computational scale of the model and causes a decrease mately 60 to 400 normal samples for training and a mixture of normal
in the inference speed. Given the success of multi-scale feature fusion and abnormal images for testing, and the test set contains a variety
in target detection (Lin et al., 2017a; Chen et al., 2021), an intuitive of realistic anomalies with different textures and scales. The BeanTech
idea is to fully fuse the visual information and semantic information in dataset (Mishra et al., 2021) has 3 categories of 2540 images, which
the concatenated information 𝐶𝐼 with the help of the channel attention also only contain normal images in the training set.
mechanism and multi-scale feature fusion strategy. For the evaluation metric of defect detection, following the works
Our proposed multi-scale feature fusion module is shown in Fig. 4: in Zheng et al. (2021), Roth et al. (2021), Cohen and Hoshen (2020),
the concatenated information 𝐶𝐼𝑛 (𝑛 = 1, 2, 3) is initially fused by Defard et al. (2021), Zavrtanik et al. (2021a), Li et al. (2021) and Song
a 3 × 3 convolutional layer that maintains the number of channels. et al. (2021), we leverage image-level and pixel-level ROC-AUC for
Meanwhile, considering that 𝐶𝐼𝑛 is a simple concatenation of two kinds performance evaluation.
5
M. Yang, P. Wu and H. Feng Engineering Applications of Artificial Intelligence 119 (2023) 105835
Fig. 4. The multi-scale feature fusion module used by MemSeg. Considering that 𝐶𝐼n is a concatenation of two kinds of information in the channel dimension, and comes from
different depths of the encoder with different semantic information and visual information, so MemSeg use channel attention CA-Block and multi-scale strategy for feature fusion.
4.2. Implementation details Meanwhile, as shown in Fig. 5, the anomaly localization of MemSeg
at the pixel level is closer to the ground truth (GT), and the boundary
MemSeg is based on the AE model and uses ResNet18 (He et al., between the normal and abnormal regions is more precise, which
2016) as the encoder. And for the decoder part, corresponding to benefits from the end-to-end learning approach adopted by MemSeg
Fig. 2, the up-sampling layer contains a bilinear interpolation layer and the training of the model is directly guided by the pixel-level
and a basic convolution block consisting of a convolution layer, batch- ground truth of simulated abnormal samples.
normalization, and a ReLU activation function. The Conv Layer con-
tains two stacked basic convolution blocks; only the last Conv Layer 4.4. Impact of the anomaly simulation strategy
contains one basic convolution block and a 2-channel convolution
layer. The training process of MemSeg is carried out with 2700 itera- To evaluate the effectiveness of the proposed anomaly simulation
tions, the size of the input image is set to 256 × 256, and the batch size strategy for image defect detection, we remove the textural anomaly
is set to 8, which contains 4 normal samples and 4 simulated abnormal simulation, structural anomaly simulation, and foreground enhance-
samples. When performing anomaly simulation, most categories have
ment strategy in training, respectively, and compare these three cases
an equal probability of using textural anomaly simulation and struc-
with our complete strategy. Table 4 reports the AUC scores at the
tural anomaly simulation. We use a grid search to set hyper-parameters:
image and pixel level for the above four experiments. The AUC scores
the learning rate is set to 0.04; 𝛾 in the focal loss is set to 4; 𝜆𝑙1 and 𝜆𝑓
decrease when either component of the anomaly simulation strategy
in the objective function are set to 0.6 and 0.4, respectively.
is removed, which verifies that our anomaly simulation strategy is not
For most categories in both datasets, we randomly select 30 memory
only theoretically interpretable, but also has an excellent performance
samples in the training set to generate the memory information, but
in experimental validation. Additionally, to justify the choice of using
since the orientation of the screws in the MVTec AD dataset is randomly
Perlin noise to generate anomaly samples, we replaced the Perlin noise
arranged, we increased the number of their memory samples for better
with random rectangular noise and empirically limited the length and
feature matching; the sample size of toothbrushes in the training set is
width of the rectangles between 15 and 100. As shown in Table 4
too small, so we use only 10 memory samples while ensuring adequate
(Rect. Noise), there is only a slight impact on the performance of defect
training samples. MemSeg obtains the anomaly score of each pixel in
detection when using rectangular noise for anomaly region generation.
the image in an end-to-end approach, and the mean of the scores of
This shows that even though the shape of the defects used in the
the top 100 most abnormal pixels in the image is used as the anomaly
score at the image level. training process is significantly different from the shape of the real-
world defects, MemSeg can still adapt well to defect detection in real
4.3. Comparison with existing methods scenes, which further demonstrates the robustness of the approach that
introduces the self-supervised task to solve the semi-supervised task.
This subsection compares MemSeg with different methods. The AUC Now, to evaluate the role of simulated abnormal samples more fully,
scores of different methods are listed in Tables 2 and 3. We can see that we also want to know the data distribution of simulated and real abnor-
our method outperforms most existing methods, which demonstrates mal samples after the training of the model, so we visualize the outputs
the effectiveness of our method. For the MVTec AD dataset, at the of the encoder of simulated abnormal samples, real abnormal samples
image-level anomaly detection, the mean AUC score of MemSeg for tex- in the test set, and normal samples using t-SNE (Van der Maaten and
ture categories and object categories is 99.8% and 99.4%, respectively, Hinton, 2008). As shown in Fig. 6, for most categories, there is some
which is better than all the models compared, and MemSeg also has overlap in the spatial distribution of simulated abnormal samples and
outstanding defect localization performance at the pixel level. For the real abnormal samples, while abnormal samples are separated from
BeanTech AD dataset, although the anomaly localization of MemSeg at normal samples, which further proves the validity of our anomaly
the pixel level is not as good as PaDiM (97.1% vs 97.3%), the AUC simulation strategy.
score of MemSeg is better than all the models in the experiment at It is important to emphasize that the features we visualized are
the image level, which indicates that our model still has an accurate generated only at the bottleneck structure of U-Net. Although the sepa-
anomaly detection capability for other datasets. rability of features in some categories is not strong in two-dimensional
6
M. Yang, P. Wu and H. Feng Engineering Applications of Artificial Intelligence 119 (2023) 105835
Table 2
Comparison between MemSeg and different methods on the MVTec AD dataset in terms of ROC-AUC % with the format of (Image-level, Pixel-level).
Category SPADE (Cohen and PaDiM (Defard DRAEM (Zavrtanik CutPaste (Li P-SVDD (Yi and P-SVDD-C (Ahn Ours
Hoshen, 2020) et al., 2021) et al., 2021a) et al., 2021) Yoon, 2020) and Kim, 2022)
Carpet (–, 97.5) (–, 98.9) (97.0, 95.5) (92.9, 92.6) (93.1, 98.3) (94.4, 92.9) (99.6, 99.2)
Grid (–, 93.7) (–, 94.9) (99.9, 99.7) (94.6, 96.2) (99.9, 97.5) (95.6, 97.2) (100, 99.3)
Leather (–, 97.6) (–, 99.1) (100, 98.6) (90.9, 97.4) (100, 99.5) (96.1, 98.2) (100, 99.7)
Texture
Tile (–, 87.4) (–, 91.2) (99.6, 99.2) (97.8, 91.4) (93.4, 90.5) (93.5, 91.9) (100, 99.5)
Wood (–, 85.5) (–, 93.6) (99.1, 96.4) (96.5, 90.8) (98.6, 95.5) (98.0, 92.1) (99.6, 98.0)
Average (–, 92.3) (–, 95.6) (99.1, 97.9) (94.5, 93.7) (97.0, 96.3) (95.5, 94.5) (99.8, 99.1)
Bottle (–, 98.4) (–, 98.1) (99.2, 99.1) (98.6, 98.1) (98.3, 97.6) (99.5, 98.6) (100, 99.3)
Cable (–, 97.2) (–, 95.8) (91.8, 94.7) (90.3, 96.8) (80.6, 90) (97.8, 97.6) (98.2, 97.4)
Capsule (–, 99.0) (–, 98.3) (98.5, 94.3) (76.7, 95.8) (96.2, 97.4) (88.7, 96.3) (100, 99.3)
Hazelnut (–, 99.1) (–, 97.7) (100, 99.7) (92.0, 97.5) (97.3, 97.3) (97.9, 98.2) (100, 98.8)
Metal nut (–, 98.1) (–, 96.7) (98.7, 99.5) (94.0, 98.0) (99.3, 93.1) (96.5, 98.1) (100, 99.3)
Object Pill (–, 96.5) (–, 94.7) (98.9, 97.6) (86.1, 95.1) (92.4, 95.7) (91.9, 92.4) (99.0, 99.5)
Screw (–, 98.9) (–, 97.4) (93.9, 97.6) (81.3, 95.7) (86.3, 96.7) (83.3, 95.3) (97.8, 98.0)
Toothbrush (–, 97.9) (–, 98.7) (100, 98.1) (100, 98.1) (98.3, 98.1) (95.6, 96.0) (100, 99.4)
Transistor (–, 94.1) (–, 97.2) (93.1, 90.9) (91.5, 97) (95.5, 93.0) (92.1, 93.5) (99.2, 97.3)
Zipper (–, 96.5) (–, 98.2) (100, 98.8) (97.9, 95.1) (99.4, 99.3) (95.9, 96.0) (100, 98.8)
Average (–, 97.57) (–, 97.3) (97.4, 97.0) (90.8, 96.7) (94.3, 95.8) (93.9, 96.2) (99.4, 98.7)
Average (85.5, 96.0) (95.3, 96.7) (98.0, 97.3) (95.2, 96.0) (92.1, 95.7) (94.4, 95.6) (99.56, 98.84)
Table 3
Comparison between our method and different methods on the BeanTech AD dataset in terms of ROC-AUC % with the format of (Image-level, Pixel-level).
Category PatchCore (Roth et al., 2021) SPADE (Cohen and Hoshen, 2020) PaDiM (Defard et al., 2021) P-SVDD (Yi and Yoon, 2020) Ours
01 (90.9, 95.5) (91.4, 97.3) (99.8, 97.0) (95.7, 91.6) (98.7, 98.9)
02 (79.3, 94.7) (71.4, 94.4) (82.0, 96.0) (72.1, 93.6) (87.0, 96.2)
03 (99.8, 99.3) (99.9, 99.1) (99.4, 98.8) (82.1, 91.0) (99.4, 96.3)
Mean (90.0, 96.5) (87.6, 96.9) (93.7, 97.3) (83.3, 92.1) (95.0, 97.1)
Fig. 5. Comparison of MemSeg with PaDiM (Defard et al., 2021) and SPADE (Cohen and Hoshen, 2020) for anomaly localization on the MVTec AD dataset (before thresholding).
Our model has a more precise judgment of the abnormal regions.
Fig. 6. Separability display of normal samples, simulated abnormal samples, and real abnormal samples.
space, our model can still be corrected in the decoder part using infor- as real abnormal samples. MemSeg is based on the semi-supervised
mation from the skip connection. Besides, MemSeg does not completely learning framework, the reason we introduce the simulated abnormal
require the distribution of simulated abnormal samples to be the same samples during training is simply to make the model explicitly learn the
7
M. Yang, P. Wu and H. Feng Engineering Applications of Artificial Intelligence 119 (2023) 105835
Table 4
Evaluating the components of our anomaly simulation strategy on the MVTec AD
dataset. The AUC scores are reported for different strategies.
w/o w/o w/o Rect. Noise MemSeg
Texture Structure Foreground
Image-level 98.80 98.77 99.34 98.90 99.56
Pixel-level 97.31 98.09 98.40 98.13 98.84
Table 5
AUC scores of MemSeg on the MVTec AD dataset when using different
loss functions.
L1 loss Focal loss (Lin et al., 2017b) Image-level Pixel-level Fig. 8. Effects of the different numbers of memory samples on AUC scores. The vertical
coordinates report the mean AUC scores of the 13 categories in the MVTec AD dataset,
✓ 84.82 73.38
excluding screw and toothbrush.
✓ 98.92 98.64
✓ ✓ 99.56 98.84
Fig. 7. The effect of L1 loss on anomaly localization. When L1 loss and focal loss are
used simultaneously, the edges of the segmentation images obtained by MemSeg are
more precise.
Fig. 9. The generation process of spatial attention maps 𝑀1 , 𝑀2 , and 𝑀3 . This process
visually demonstrates the effectiveness of the memory module, multi-scale strategy, and
difference between normal and non-normal, so the model can better spatial attention for defect localization of images.
8
M. Yang, P. Wu and H. Feng Engineering Applications of Artificial Intelligence 119 (2023) 105835
Table 6
The AUC scores of MemSeg on the MVTec AD dataset when using different module
components.
Memory Multi-scale Spatial Coordinate Image-level Pixel-level
attention attention
96.42 96.08
✓ 98.41 98.27
✓ ✓ ✓ 99.08 98.60
✓ ✓ ✓ 99.26 98.67
✓ ✓ ✓ 98.96 98.44
✓ ✓ ✓ ✓ 99.56 98.84
Table 7
AUC scores of PatchCore, SPADE, PaDiM, and MemSeg on the toy dataset.
PatchCore (Roth et al., 2021) SPADE (Cohen and Hoshen, 2020) PaDiM (Defard et al., 2021) Ours
Image-level 99.75 95.75 99.30 99.83
Pixel-level 99.33 98.72 98.70 99.77
4.7. Evaluation with a toy dataset The above experiments demonstrate the effectiveness of MemSeg,
as well as the robustness of using self-supervised task to solve semi-
The noise generated by MemSeg using the anomaly simulation supervised task. Although MemSeg has good anomaly detection perfor-
strategy is irregular. To verify the ability of MemSeg to generalize mance on several datasets, there are still some limitations. As shown in
regular noise, we generate a toy dataset using normal samples in the Table 2, for the MVTec AD dataset, at the image level, the category
test set of the MVTec AD dataset. As shown in Fig. 10, the shapes of the with the worst anomaly detection effect is the screw. On the one
generated noise are rectangle, triangle, lightning bolt, star, heart, and hand, the reason is that our model relies more on the alignment of
circle; and the size, color, position, angle, and aspect ratio of the noise detection targets in space, but the orientation of screws in the dataset
are random. For the abnormal samples in the toy dataset, it is never is randomly arranged, which makes it difficult to generate effective
seen in the training phase of MemSeg. We apply the trained model difference information; on the other hand, the reason is that some
directly to the toy dataset and compare the performance with three abnormal regions of screws in the test set are small and difficult to
models. The AUC scores of the different models are shown in Table 7. distinguish, and the model is prone to misclassification, which is also
MemSeg achieves precise localization of the abnormal regions with reflected in other models. At the pixel level, the category with the worst
an AUC score close to 100%, which further demonstrates the strong anomaly detection effect is the transistor, because when the transistor
generalization ability of our model to localize unknown anomalies. has global logic anomalies such as misalignment or missing, it is
difficult for MemSeg to give the accurate location of anomalies. Mean-
4.8. Inference speed while, although MemSeg uses the foreground enhancement strategy in
anomaly simulation, MemSeg still makes false positive judgments on
Compared to reconstruction-based methods (Bergmann et al., 2018; background regions during inference, such as the capsule in Fig. 11,
Tang et al., 2020; Akcay et al., 2018; Schlegl et al., 2017a; Zenati et al., which places higher demands on the quality of the dataset. For these
2018; Schlegl et al., 2017b), the embedding-based methods (Zheng limitations, we can try to solve them in the future by increasing image
et al., 2021; Roth et al., 2021; Cohen and Hoshen, 2020; Defard et al., resolution, adding modules for global relationship modeling, and using
2021) achieve better performance in semi-supervised image surface better data preprocessing and postprocessing methods.
defect detection, but this kind of model needs to perform complex
feature matching in the inference phase, which is difficult to be applied 5. Conclusion
in industrial scenarios with high real-time requirements. Therefore, we
are also interested in the inference speed of MemSeg. Our experiments In this paper, we propose an end-to-end memory-based segmenta-
are carried out on a PC with an NVIDIA RTX 3090 GPU. Overall, tion network to detect surface defects in industrial products. Consider-
the total number of parameters in MemSeg is 80.12 MB, and the ing the small intra-class variation of products in the same production
memory requirement is 212.15 MB during inference. Thanks to the fully line, from the perspective of differences, we propose a well-designed
convolutional network (FCN) structure (Long et al., 2015), MemSeg anomaly simulation strategy for self-supervised learning of the model,
is friendly to parallel computing and has good scalability. For the which accounts for target foreground, textural and structural anoma-
inference speed, we calculate the time consumption of PaDiM (Defard lies; from the perspective of commonalities, we propose a memory
et al., 2021) and SPADE (Cohen and Hoshen, 2020) in the inference module and design an efficient feature-matching algorithm. Through
phase, and the time to process one image is 0.319 s and 0.339 s for the two points above, and combining the multi-scale feature fusion
these two models, respectively, while the time to process one image module and the spatial attention module, we effectively transform
9
M. Yang, P. Wu and H. Feng Engineering Applications of Artificial Intelligence 119 (2023) 105835
Bergmann, P., Fauser, M., Sattlegger, D., Steger, C., 2019. MVTec AD-a comprehensive
real-world dataset for unsupervised anomaly detection, In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9592–9600.
Bergmann, P., Löwe, S., Fauser, M., Sattlegger, D., Steger, C., 2018. Improving
unsupervised defect segmentation by applying structural similarity to autoencoders.
arXiv preprint arXiv:1807.02011.
Chen, S., Cheng, Z., Zhang, L., Zheng, Y., 2021. SnipeDet: Attention-guided pyrami-
dal prediction kernels for generic object detection. Pattern Recognit. Lett. 152,
302–310.
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2017. Deeplab:
Semantic image segmentation with deep convolutional nets, atrous convolution,
and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), 834–848.
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A., 2014. Describing textures
in the wild, In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 3606–3613.
Cohen, N., Hoshen, Y., 2020. Sub-image anomaly detection with deep pyramid
correspondences. arXiv preprint arXiv:2005.02357.
Defard, T., Setkov, A., Loesch, A., Audigier, R., 2021. Padim: a patch distribution
modeling framework for anomaly detection and localization, In: International
Conference on Pattern Recognition, pp. 475–489.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L., 2009. ImageNet: a large-
scale hierarchical image database, In: 2009 IEEE conference on computer vision
and pattern recognition, pp. 248–255.
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition,
In: Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 770–778.
Hou, Q., Zhou, D., Feng, J., 2021. Coordinate attention for efficient mobile network
design, In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Fig. 11. Analysis of Limitations of MemSeg. Using screw, transistor, and capsule as Pattern Recognition, pp. 13713–13722.
Li, C.L., Sohn, K., Yoon, J., Pfister, T., 2021. CutPaste: self-supervised learning for
examples respectively, demonstrate the limitations of MemSeg in fine-grained anomaly
anomaly detection and localization, In: Proceedings of the IEEE/CVF Conference
localization, global judgment, and false positives in the background.
on Computer Vision and Pattern Recognition, pp. 9664–9674.
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S., 2017. Feature
pyramid networks for object detection, In: Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 2117–2125.
semi-supervised anomaly detection into an end-to-end semantic seg-
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P., 2017. Focal loss for dense object
mentation task, making semi-supervised image surface defect detection detection, In: Proceedings of the IEEE international conference on computer vision,
more flexible. Simple but high-performance, MemSeg achieves SOTA pp. 2980–2988.
performance while meeting the real-time requirements of industrial Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic
segmentation, In: Proceedings of the IEEE conference on computer vision and
scenarios. In future work, we are interested in extending this paradigm
pattern recognition, pp. 3431–3440.
to address semi-supervised anomaly detection tasks in more scenarios, Mishra, P., Verk, R., Fornasier, D., Piciarelli, C., Foresti, G.L., 2021. VT-ADL: a vision
such as anomaly detection in 3D and medical scenes. transformer network for image anomaly detection and localization. arXiv preprint
arXiv:2104.10036.
Perlin, K., 1985. An image synthesizer. ACM Siggraph Comput. Graph. 19 (3), 287–296.
CRediT authorship contribution statement Pirnay, J., Chai, K., 2021. Inpainting transformer for anomaly detection. arXiv preprint
arXiv:2104.13897.
Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: convolutional networks for biomed-
Minghui Yang: Conceptualization, Methodology, Software, Valida-
ical image segmentation, In: International Conference on Medical image computing
tion, Formal analysis, Investigation, Data curation, Writing – original and computer-assisted intervention, pp. 234–241.
draft, Writing – review & editing, Visualization. Peng Wu: Methodol- Roth, K., Pemula, L., Zepeda, J., Schölkopf, B., Brox, T., Gehler, P., 2021. Towards
ogy, Writing – review & editing, Supervision. Hui Feng: Methodology, total recall in industrial anomaly detection. arXiv preprint arXiv:2106.08265.
Schlegl, T., Seeböck, P., Waldstein, S.M., Schmidt-Erfurth, U., Langs, G., 2017.
Writing – review & editing. Unsupervised anomaly detection with generative adversarial networks to guide
marker discovery, In: International conference on information processing in medical
Declaration of competing interest imaging, pp. 146–157.
Schlegl, T., Seebock, P., Waldstein, S.M., Schmidt-Erfurth, U., Langs, G., 2017. Unsu-
pervised anomaly detection with generative adversarial networks to guide marker
The authors declare that they have no known competing finan- discovery, In: International Conference on Information Processing in Medical
cial interests or personal relationships that could have appeared to Imaging, pp. 146–157.
Song, J., Kong, K., Park, Y.I., Kim, S.G., Kang, S.J., 2021. AnoSeg: anomaly seg-
influence the work reported in this paper.
mentation network using self-supervised learning. arXiv preprint arXiv:2110.
03396.
Data availability Tang, T.W., Kuo, W.H., Lan, J.H., Ding, C.F., Hsu, H., Young, H.T., 2020. Anomaly
detection neural network with dual auto-encoders GAN and its industrial inspection
applications. Sensors 20 (12), 3336.
The data is available online. Van der Maaten, L., Hinton, G., 2008. Visualizing data using t-SNE. J. Mach. Learn.
Res. 9 (11).
Wheeler, J.B., Karimi, H.A., 2021. A semantically driven self-supervised algorithm for
Acknowledgment detecting anomalies in image sets. Comput. Vis. Image Underst. 213.
Yi, J., Yoon, S., 2020. Patch svdd: patch-level svdd for anomaly detection and
Thanks are due to Prof. Jing Liu for the support of the experiment segmentation, In: Proceedings of the Asian Conference on Computer Vision.
Zavrtanik, V., Kristan, M., Skočaj, D., 2021. Draem-a discriminatively trained re-
and valuable discussions.
construction embedding for surface anomaly detection, In: Proceedings of the
IEEE/CVF International Conference on Computer Vision, pp. 8330–8339.
References Zavrtanik, V., Kristan, M., Skočaj, D., 2021b. Reconstruction by inpainting for visual
anomaly detection. Pattern Recognit. 112.
Zenati, H., Foo, C.S., Lecouat, B., Manek, G., Chandrasekhar, V.R., 2018. Efficient
Ahn, J.Y., Kim, G., 2022. Application of optimal clustering and metric learning to gan-based anomaly detection. arXiv preprint arXiv:1802.06222.
patch-based anomaly detection. Pattern Recognit. Lett.. Zheng, Y., Wang, X., Deng, R., Bao, T., Zhao, R., Wu, L., 2021. Focus your distribution:
Akcay, S., Atapour-Abarghouei, A., Breckon, T.P., 2018. Ganomaly: Semi-supervised coarse-to-fine non-contrastive learning for anomaly detection and localization. arXiv
anomaly detection via adversarial training, In: Asian conference on computer preprint arXiv:2110.04538.
vision, pp. 622–637.
10