0% found this document useful (0 votes)
90 views10 pages

Towards Open-Set Object Detection and Discovery

Uploaded by

Frezzy Chow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views10 pages

Towards Open-Set Object Detection and Discovery

Uploaded by

Frezzy Chow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Towards Open-Set Object Detection and Discovery

Jiyang Zheng⋆† Weihao Li† Jie Hong⋆† Lars Petersson† Nick Barnes⋆

The Australian National University † Data61-CSIRO

[Link]@{⋆ [Link], [Link]}

Abstract (a) Closed-Set Object Detection (c) Open-Set Object


Detection and Discovery
With the human pursuit of knowledge, open-set object (OSODD)
detection (OSOD) has been designed to identify unknown
objects in a dynamic world. However, an issue with the
current setting is that all the predicted unknown objects
share the same category as “unknown”, which require in-
cremental learning via a human-in-the-loop approach to
label novel classes. In order to address this problem, we (b) Open-Set Object Detection
present a new task, namely Open-Set Object Detection and (OSOD)
Discovery (OSODD). This new task aims to extend the abil-
ity of open-set object detectors to further discover the cate-
gories of unknown objects based on their visual appearance
without human effort. We propose a two-stage method that
first uses an open-set object detector to predict both known
and unknown objects. Then, we study the representation of
predicted objects in an unsupervised manner and discover
new categories from the set of unknown objects. With this
Dog Bird Cat Novel Class 1 (Potted Plant)
method, a detector is able to detect objects belonging to Unknown (Vase & Potted Plant) Novel Class 2 (Vase)
known classes and define novel categories for objects of un-
known classes with minimal supervision. We show the per- Figure 1. A visual comparison of object detection tasks. In closed-
formance of our model on the MS-COCO dataset under a set detection, objects from unseen classes are ignored or incor-
thorough evaluation protocol. We hope that our work will rectly classified into the set of known classes. While in open-set
promote further research towards a more robust real-world object detection, unknown objects are localised but share the same
category. Our task aims to detect objects of known classes and dis-
detection system.
cover novel visual categories for the identified objects of unknown
classes, which provides better scene understanding and a scalable
learning paradigm.
1. Introduction
Object detection is the task of localising and classify- jects from the set of known classes and localising objects
ing objects in an image. In recent years, deep learning that belong to an unknown class. Although OSOD has im-
approaches have advanced the detection models [3, 4, 15, proved the practicality of object detection by enabling de-
20, 37, 38, 45] and achieved remarkable progress. However, tection of instances of unknown classes, there is still the
these methods work under a strong assumption that all ob- issue that all identified objects of an unknown class share
ject classes are known at the training phase. As a result of the same category as “unknown” (see Fig. 1(b)). Additional
this assumption, object detectors would incorrectly treat ob- human annotation is required to incrementally learn novel
jects of unknown classes as background or classify them as object categories [24].
belonging to the set of known classes [11] (see Fig. 1(a)). Consider a child who is visiting a zoo for the first time.
To relax the above closed-set condition, open-set object The child can recognise some animals that are seen and
detection (OSOD) [11, 24, 32] considers a realistic scenario learned before, for example, ‘rabbit’ or ‘bird’, while the
where test images might contain novel classes that did not child might not recognise the species of many other rarely
appear during training. OSOD aims at jointly detecting ob- seen animals, like ‘zebra’ and ‘giraffe’. After observing, the

3961
child’s perception system will learn from these previously Task Dataset Known classes Unknown classes
unseen animals’ appearances and cluster them into different ODL Open-Set Non-Action Loc/Cat
categories even without being told what species they are. OSOD Open-Set Detect Loc
In this work, we consider a new task, where we aim to OSODD (Ours) Open-Set Detect Loc/Cat
localise objects of both known and unknown classes, as-
sign pre-defined category labels for known objects, and dis- Table 1. Comparisons of different Object Detection and Discovery
cover new categories for objects of unknown classes (see tasks. OSOD: open-set object detection; ODL: Object discovery
Fig. 1(c)). We term this task Open-Set Object Detection and localization. Loc means localise the objects of interest; Cat
and Discovery (OSODD). We motivate our proposed task, means discover novel categories.
OSODD, by suggesting that it is better suited to extracting
classifier. Recently, Liu et al. [31] proposed a deep metric
information from images. New category discovery provides
learning method to identify unseen classes for imbalanced
additional knowledge of data belonging to classes not seen
datasets. Self-supervised learning [14, 35, 43] approaches
before, helping intelligent vision-based systems to handle
have been explored to minimise external supervision.
more realistic use cases.
Miller et al. [32] first investigate the utility of label
We propose a two-stage framework to tackle the prob-
uncertainty in object detection under open-set conditions
lem of OSODD. First, we leverage the ability of an open
using dropout sampling. Dhamija et al. [11] define the
set object detector to detect objects of known classes and
problem of open-set object detection (OSOD) and con-
identify objects of unknown classes. The predicted propos-
ducted a study on traditional object detectors for their
als of objects of known and unknown classes are saved to a
abilities in avoiding classifying objects of unknown classes
memory buffer; Second, we explore the recurring pattern of
into one of the known classes. An evaluation metric is also
all objects and discover new categories from objects of un-
provided to assess the performance of the object detector
known classes. Specifically, we develop a self-supervised
under the open-set condition.
contrastive learning approach with domain-agnostic data
augmentation and semi-supervised k-means clustering for
category discovery. Open-World Recognition. The open-world setting intro-
duced a continual learning paradigm that extends the open-
Our contributions: set condition by assuming new semantic classes are in-
• We formalise the task Open-Set Object Detection and troduced gradually at each incremental time step. Ben-
Discovery (OSODD), which enables a richer under- dale et al. [2] first formalise the open-world setting for im-
standing within real-world detection systems. age recognition and propose an open-set classifier using the
nearest non-outlier algorithm. The model evolves when new
• We propose a two-stage framework to tackle this prob- labels for the unknown are provided by re-calibrating the
lem, and we present a comprehensive protocol to eval- class probabilities.
uate the object detection and category discovery per- Joseph et al. [24] transfer the open-world setting to an
formance. object detection system and propose the task of open-world
object detection (OWOD). The model uses example replay
• We propose a category discovery method in our frame-
to make the open-set detector learn new classes incremen-
work using domain-agnostic augmentation, contrastive
tally without forgetting the previous ones. The OWOD or
learning and semi-supervised clustering. The novel
OSOD model cannot explore the semantics of the identified
method outperforms other baseline methods in experi-
unknown objects, and extra human annotation is required to
ments.
learn novel classes incrementally. In contrast, our OSODD
model can discover novel category labels for objects of
2. Related Work
unknown classes without human effort.
Open-Set Recognition. Compared with closed-set learn-
ing, which assumes that only previously known classes Novel Category Discovery. The novel category discovery
are present during testing, open-set learning assumes the task aims to identify similar recurring patterns in the unla-
co-existence of known and unknown classes. Scheirer et belled dataset. In image recognition, it was earlier viewed
al. [40] first introduce the problem of open-set recognition as an unsupervised clustering problem. Xie et al. [46] pro-
with incomplete knowledge at training time, i.e., unknown posed the deep embedding network that can cluster data
classes can appear during testing. They developed a classi- and at the same time learn a data representation. Han et
fier in a one-vs-rest setting, which enables the rejection of al. [18] formulated the task of novel class discovery (NCD),
unknown samples. [22, 41] extend the framework in [40] to which clusters the unlabelled images into novel categories
a multi-class classifier using probabilistic models with the using deep transfer clustering. The NCD setting assumes
extreme value theory to minimise fading confidence of the that the training set contains both labelled and unlabelled

3962
data, the knowledge learned on labelled data could be studies the recurring pattern of the objects from the memory
transferred to targeted unlabelled data for category discov- buffer and discovers novel categories in the working mem-
ery [13, 17, 23, 48, 52]. ory. We assign the predicted objects of unknown classes
Object discovery and localisation (ODL) [6,9,27–29,36] from the detector with novel category labels using the dis-
aims to jointly discover and localise dominant objects from covered categories. The visualisation is shown in Fig. 4.
an image collection with multiple object classes in an unsu- The OCD module explores the working memory to dis-
pervised manner. Lee and Grauman [27] used object-graph cover new visual categories. It consists of an encoder com-
and appearance features for unsupervised discovery. Ramb- ponent as the feature extractor and a discriminator which
hat et al. [36] assumed partial knowledge of class labels and clusters the object representations. To train the encoder, we
conducted the discovery leveraging a dual memory module. first retrieve the predicted objects from known classes saved
Compared to ODL, our OSODD both performs detection in the known memory and the identified objects of unknown
on previously known classes and discovers novel categories classes saved in working memory. Then, these instance
for unknown objects, which provide a comprehensive scene samples are transformed using class-agnostic augmentation
understanding. to create a generalised view over the data [10, 26, 51]. We
Please refer to Tab. 1 to see the summarised differences use unsupervised contrastive learning where the predicted
between our setting and other similar settings in the object labels for the objects of known classes are ignored, the pair-
detection problem. wise contrastive loss [33] penalises dissimilarity of the same
object in different views regardless of the semantic informa-
3. Task Format tion. The contrastive learning enables the encoder to learn
a more discriminating feature representation in the latent
In this section, we formulate the task of Open-Set space [7, 19]. Lastly, with the learned feature space from
Detection and Discovery (OSODD). We have a set of the encoder, the discriminator clusters the object embed-
known object classes Ck = {C1 , C2 , · · · , Cm }, and ding into novel categories by using the constrained k-means
there exists a set of unknown visual categories Cu = clustering algorithm [44].
{Cm+1 , Cm+2 , · · · , Cm+n }, where Ck ∩ Cu = ∅. The
training dataset contains objects from Ck , and the testing 4.1. Object Detection and Retrieval
dataset contains objects from Ck ∪ Cu . An object instance Open-Set Object Detector. An open-set object detector
I is represented by I = [c, x, y, w, h], denoting the class predicts the location of all objects of interest. Then it clas-
label (c ∈ Ck or Cu ), the top-left x, y coordinates, and the sifies the objects into semantic classes and identifies the un-
width and height from the centre of the object bounding box seen objects as unknown (See ‘OSOD’ in Fig. 3).
respectively. A model is trained to localise all objects of in- We use the Faster RCNN architecture [38] as the base-
terest. Then, it classifies objects of a known class as one line model, following ORE [24]. Leveraging the class-
of Ckt and clusters objects of an unknown class into novel agnostic property of the region proposal network, we utilise
visual categories Cut . an unknown-aware RPN to identify unknown objects. The
unknown-aware RPN labels the proposals that have high
4. Our Approach scores but do not overlap with any ground-truth bounding
box as the potential unknown objects. To learn a more dis-
This section describes our approach for tackling OS-
criminative representation for each class, we use a prototype
ODD, beginning with an overview of our framework. We
based constrictive loss on the feature vectors fc generated
propose a generic framework consisting of two main mod-
by an intermediate layer in the ROI pooling head. A class
ules, Object Detection and Retrieval (ODR) and Object Cat-
prototype pi is computed by the moving average of the class
egory Discovery (OCD) (see Fig. 2).
instance representations, and the features fc of objects will
The ODR module uses an open-set object detector with
keep approaching their class prototype in the latent space.
a dual memory buffer for object instances detection and re-
The objective is formulated as:
trieval. The detector predicts objects of known classes with
their semantic labels from Ck and the location information, c
X
where the unknown objects are localised but with no seman- ℓpcl (fc ) = ℓ(fc , pi )
tic information available. We store the predicted objects in i=0
( (1)
the memory buffer [36], which is used to explore novel cate- ∥fc , pi ∥ if i = c
gories. The buffer is divided into two parts: known memory ℓ(fc , pi ) =
max (0, ∆ − ∥fc , pi ∥) otherwise
and working memory. The known memory contains pre-
dicted objects of known classes with semantic labels; the where fc is the feature vector of class c, pi is the prototype
working memory stores all current identified objects of un- of class i, ∥f, p∥ measures the distance between feature vec-
known classes without categorical information. The model tors and ∆ is a fixed value that defines the maximum dis-

3963
Unsupervised Constrained
Object-wise Novel
Image set S mix-up k-means
Predictions contrastive learning categories
augmentation clustering
Unknown
ENCODER Aware RPN
Known Unknown
objects objects

ROI Head
Object and its Other Objects
augmented version

Open-Set Object Detection Memory Buffer Category Discovery

OSODD Prediction
OSOD Prediction Novel Category Label

Figure 2. Illustration of the two-stage method for Open-Set Object Detection and Discovery (OSODD). The first stage includes detecting
objects of known classes and identifying objects of unknown classes using an open-set object detector. The instances of unknown classes
are saved into the working memory for category discovery. The instances of known classes are saved into the known memory with their
predicted semantic categories to assist the representation learning and clustering. The second stage pre-processes the objects from the
memory buffer in an unsupervised manner. The representations of these saved objects are first learned in the latent space by contrastive
learning, followed by a constrained k-means clustering used to find the novel categories beyond the known classes. Lastly, we update the
open-set detection predictions with the novel category labels to generate the final OSODD prediction (See visualisations in Figs. 3 and 4).

tance for dissimilar pairs. The total loss for the region of
interest pooling is defined as:

ℓroi = αpcl · ℓpcl + αcls · ℓcls + αreg · ℓreg (2)

where αpcl , αcls and αreg are positive adjustment ratios.


ℓcls , ℓreg are the regular classification and regression loss.
Given the encoded feature fc , we use an open-set
classifier with an energy-based model [25] to distinguish Figure 3. Comparison between OSOD and OSODD prediction.
the objects of known and unknown classes. The trained OSODD (Right figure) has extended the OSOD (Left figure) pre-
model is able to assign low energy values to known data diction by assigning novel category labels to instances of an un-
and thus creates dissimilar representations of distribution known class.
for the objects of known and unknown classes. When new
known class annotations are made available, we utilise the clustering method to estimate the category number in
example replay to alleviate forgetting the previous classes. the target dataset without any parametric learning. The
generalisation ability of the method towards our problem
Memory Module. As described above, we propose to use has been evaluated in Sec. 6.2.1.
a dual memory module to store predicted instances for cate-
gory discovery. The open-set detector detects the objects of Representation Learning. Representation learning aims
interest with their locations and the predicted label. The ob- to learn more discriminative features for input samples. We
jects of a known class Ik are saved into known memory Mk adapt contrastive learning [33] and utilise objects from both
with their semantic labels c ∈ Ck . These objects are treated known and working memory to help the network to learn
as a labelled dataset for the following category discovery. an informative embedding space. The learning is conducted
The identified objects of an unknown class Iu are stored in in an unsupervised manner. Following [19], we build a dy-
the working memory Mw . We perform the category dis- namic dictionary to store samples. The network is trained
covery on Mw , which aims to assign all instances in Mw to maximise similarities for positive pairs (an object and
with a novel category label c ∈ Cu . We update the open-set its augmented version) while minimising similarity for neg-
object detector’s prediction using the novel category labels ative pairs (different object instances) in the embedding
and produce our final OSODD predictions. space. For an object representation, the contrastive loss is
formulated as [8]:
4.2. Object Category Discovery
exp(q · k + /τ )
Category Number Estimation. Our category discovery ℓq,{k} = − log (3)
+ exp(q· k − /τ )
P
exp(q· k + /τ )
approach requires an estimation of the number of potential k−
classes. We use the class estimation method from [18],
one of the most commonly used techniques for image-level where q is a query object representation, {k} is the queue of
novel category discovery. The model uses a k-means key object samples, k + is an augmented version of q, known

3964
as the positive key, and k − is the representations of other Task-1 Task-2 Task-3

samples, known as the negative key. τ is a temperature pa- Outdoor,Accessory,


Sports,
Semantic Split VOC Classes Appliance, Truck
rameter. On top of the contrastive learning head, we adopt Wild Animal
Food

an unsupervised augmentation strategy [26] which replaces Known/Unknown Class 20/60 40/40 60/20
all samples with mixed samples. It minimises the vicinal Training Set 16551 45520 39402
risk [5] which discriminates classes with very different pat- Validation Set 1000
tern distributions and create more training samples [47]. For Test Set 4952
each sample in the queue {k}, we combine it with the query
object representation q via linear interpolation and generate Table 2. Details of class split for the Benchmark. Task-1, Task-
a new view km,i . Correspondingly, a new virtual label vi 2 and Task-3 have different dataset splits of known and unknown
for the ith mix sample xm,i is defined as: classes.

( 5.1. Benchmark Dataset


1, if q and k + are chosen;
vi = (4) Pascal VOC 2007 [12] contains 10k images with 20 la-
0, otherwise;
belled classes. MS-COCO [30] contains around 80k train-
ing and 5k validation images with 80 labelled classes. These
where q and k + are the positive sample pairs, the virtual two object detection datasets are used to build our bench-
label is assigned to 1 if the mixing pair are from the same mark. Following the setting of open-world object detec-
object instance. tion [49], the classes are separated into known and un-
known for three tasks T = {T1 , T2 , T3 }. For task Tt ∈ T ,
Novel Category Labelling. Using the encoded represen- all known classes from {Ti | i < t} are treated as known
tation of the objects, we perform the label assignments us- classes for Tt while the remaining classes are treated as un-
ing constrained k-means clustering [44], a non-parametric known. For the first task T1 , we consider 20 VOC classes
semi-supervised clustering method. The constrained k- as known classes, and the remaining non-overlapping 60
means clustering takes object encoding from both known classes in MS-COCO are treated as the unknown classes.
and working memory as its input. It converts the standard k- New classes are added to the known set in the successive
means clustering into a constraint algorithm by forcing the tasks,i.e., T2 and T3 . For evaluation, we use the validation
labelled object representation to be hard-assigned to their set from MS-COCO except for 48 images that are incom-
ground-truth class. In particular, we treat the object in- pletely labelled [49]. We summarise the benchmark details
stances from the known memory Mk as the labelled sam- in Tab. 2.
ples. We manually calculate the centroid for each labelled
class. These centroids from Mk serve as the first group of 5.2. Evaluation Metrics.
initial centroids for the k-means algorithm. We then ran-
Object Detection Metrics. A qualified open-set object de-
domly initialise the rest of the centroids for novel categories
tector needs to accurately distinguish unknown objects [11].
using the k-mean++ algorithm [1]. For each iteration, the la-
UDR (Unknown Detection Recall) [49] is defined as the lo-
belled object instances are assigned to the pre-defined clus-
calisation rate of unknown objects, and UDR (Unknown
ters while the unknown object instances from Mw are as-
Detection Precision) [49] is defined as the rate of cor-
signed to the cluster with the minimal distance between the
rect rejection of objects of an unknown class. Let true-
cluster centroids and the object embedding. By doing this,
positives (TPu ) be the predicted unknown object proposals
we effectively avoid falsely predicted objects (i.e. objects
that have intersection over union IoU > 0.5 with ground
that belong to one of the semantic classes being predicted as
truth unknown objects. Half false-negatives (FN*u ) be the
unknown) from influencing the centroid update. We run the
predicted known object proposals that have IoU > 0.5 with
last cluster assignment step using only the novel centroids
ground truth unknown objects. False-negatives(FNu ) is the
to ensure that all unknown objects from working memory
missed ground truth unknown objects. UDR and UDP are
are assigned to a discovered visual category in the final pre-
calculated as follow:
diction. The novel centroids from the algorithm represent
the discovered novel categories. TPu + FN*u
UDR =
TPu + FNu
(5)
5. Experimental Setup UDP =
TPu
TPu + FN*u
We provide a comprehensive evaluation protocol for
studying the performance of our model in detecting objects In our task, the other important aspect is to localise and
from known classes and discovery of new novel categories classify objects of interest from the known classes. We
for objects of unknown classes in our target dataset. evaluate the closed-set detection performance using the

3965
standard mean average precision (mAP) at IoU threshold energy-based classifier to discriminate the representations
of 0.5 [38]. To show the incremental learning ability, of known and unknown data. Our generic framework could
we provide the mAP measurement for the newly in- cooperate with any open-set object detector, hence it is
troduced known classes and previously known classes highly flexible.
separately [24, 34].
Category Discovery Baselines. We compare our novel
Category Discovery Metrics. Category discovery can be method with three baseline methods, including k-means,
evaluated using clustering metrics [18, 21, 27, 36, 44, 50]. FINCH [39] and a modified approach from DTC [18].
We adopt the three most commonly used clustering met- K-means clustering is a non-parametric clustering
rics for our object-based category discovery performance. method that minimises within-cluster variances. In every
Suppose a predicted proposal of an object of an unknown iteration, the algorithm first assigns the data points to the
class has matched to a ground truth unknown object. Let cluster with the minimum pairwise squared deviations be-
the predicted category label of the object proposal be yˆi , tween samples and centroids; then, it updates cluster cen-
the ground truth label for the object is denoted as yi . We troids with the current data points belonging to the cluster.
calculate the clustering accuracy (ACC) [18] by: FINCH [39] is a parameter-free clustering method that
discovers linking chains in the data by using the first near-
N
1 X est neighbour. The method directly develops the grouping
ACC = max 1{yi = p(yˆi )} (6)
p∈Py N of data without any external parameters. To make a fair
i=1
comparison, we set the number of clusters to the same as
where N is the number of clusters, and Py is the set of all the other baseline methods. We discuss the performance
permutations of the unknown class labels. of FINCH in estimating the number of novel classes in
Mutual Information I(X, Y ) quantifies the correlation Sec. 6.2.1.
between two random variables X and Y . The range of DTC+, the DTC method [18] is proposed for NCD prob-
I(X, Y ) is from 0 (Independent) to +∞. Normalised mu- lems [16], where the setting assumes the availability of un-
tual information (NMI) [42] is bounded in the range [0, 1]. labelled data at the training phase. The algorithm modifies
Let Cl be the set of ground truth classes, and Cl
c be the set deep embedded clustering [46] to learn knowledge of the
of predicted clusters. The NMI is formulated as: labelled subset and transfer it to the unlabelled subset. This
setting requires the unlabelled data in the training and test-
I(Cl, Cl)
c ing set to be from the same classes. However, no unknown
NMI = (7)
[H(Cl) + H(Cl)]/2
c instances are available in training under the open-set detec-
tion setting. Hence, the NCD-based approaches, such as
where I(Cl, Cl)
c is the sum of mutual information between DTC cannot be directly applied to our problem. To facil-
each class-cluster pair. H(Cl) and H(Cl)
c compute the en- itate the method in our settings, we modify it by transfer-
tropy using maximum likelihood estimation. The Purity of ring a portion of the classes from the known memory to the
the clusters is defined as: working memory during training and treating them as addi-
tional unknown classes. We evaluate DTC’s generalisation
N performance on our problem in Sec. 6.2.3.
1 X
Purity = max |Clk ∩ Cl
ci | (8)
N i=1 k 6.2. Experimental Results
Here, N is the number of clusters and max is the highest We report the quantitative results of the novel category
count of objects for a single class within each cluster. number estimation, object detection and novel category dis-
covery performance in Secs. 6.2.1 to 6.2.3. We show and
6. Results and Analysis discuss the qualitative results in Fig. 4 and in the supple-
mentary material.
6.1. Baselines
Object Detection Baselines. Our framework uses an 6.2.1 Novel Category Number Estimation
open-set object detector for known and unknown instance
detection. We compare two recent approaches: Faster- We show the results of estimating the number of novel
RCNN+ [24] and ORE [24]. The Faster-RCNN+ is categories in Tab. 3. The middle two columns show the
a popular two-stage object detection method, which is automatically discovered grouping by the FINCH algo-
modified from Faster RCNN [38] to localise objects of un- rithm [39]. The numbers are under-estimated by a large
known classes by additionally adapting an unknown-aware margin of 30%, 32.5% and 40% respectively. The last two
regional proposal. ORE uses contrastive clustering and an columns show the result using DTC [18]. It is found that

3966
Task GT FINCH [39] Error Est. [18] Error Task-1 Task-2 Task-3
Method mAP UDR UDP mAP UDR UDP mAP UDR UDP
1 60 42 30% 48 20%
F-RCNN + -/ 56.16 20.14 - 51.09/ 23.84 21.54 - 35.69/ 11.53 30.01 -
2 40 27 32.5% 31 22.5%
ORE [24] -/ 56.02 20.10 36.74 52.19/ 25.03 22.63 21.51 37.23/ 12.02 31.82 23.55
3 20 12 40% 16 20%

Table 4. Baseline model comparison for open-set detectors. The


Table 3. Result of novel Category estimation.
mean average precision (mAP) is recorded for the previous/current
known objects, there is no previous known for Task-1.
the estimated number was lower than the ground truth class
number, with an average error rate of 21%. By exploring
Task-1 Task-2 Task-3
the ground-truth labels in the grouping, we found that both
Method NMI ACC Purity NMI ACC Purity NMI ACC Purity
methods tend to ignore object classes with a small number
K-means 8.5 5.3 9.3 5.0 6.2 12.0 5.3 10.9 27.6
of samples. Compared to the class estimation in the image FINCH [39] 2.8 6.0 8.2 5.4 6.3 9.9 5.3 17.2 29.4
DTC+ [18] 7.5 4.6 5.2 4.0 4.2 7.5 3.9 5.0 25.4
recognition task [18, 44], the detection task faces more bi-
Ours 11.0 6.3 12.6 5.8 6.9 13.3 6.5 16.4 29.3
ased datasets as well as fewer available samples. Hence, it
is still a challenging task for object category estimation.
Table 5. Results of discovery with estimated class number (48, 31,
16 for Task-1, Task-2 and Task-3 respectively). The highest score
6.2.2 Open-Set Object Detection in each column is bold in black, and the second-highest score in
each column is bold in grey. Our novel method has outperformed
We compare two baseline models for the object detection the proposed baseline models for all scores in Task-1 and Task-2.
part in our framework and show the result in Tab. 4. For The cluster accuracy and purity scores are the second-highest in
each task, we record the mAP of all objects to evaluate the Task-3, with a marginal difference to the best-performed baseline.
closed-world detection result. UDR and UDP reflect the un-
known objectness performance and discrimination perfor- Task-1 Task-2 Task-3
mance. The ORE outperforms the modified Faster-RCNN Method NMI ACC Purity NMI ACC Purity NMI ACC Purity
on known classes detection by a smaller margin, which K-means 11.9 6.0 12.4 5.9 6.1 12.8 6.0 11.6 27.9
FINCH [39] 10.3 6.1 12.5 4.8 7.5 13.4 5.5 13.6 28.3
are −0.14%, +1.14% and +1.01% respectively. The mAP DTC+ [18] 8.3 4.7 9.2 4.2 5.0 12.1 5.0 7.7 26.1
scores get lower when new semantic classes are being in- Ours 13.1 6.5 13.1 7.0 7.5 13.8 6.1 13.2 29.1
troduced. The UDR result shows that ORE performs bet-
ter on unknown object localisation, with a +0.95% average Table 6. Results of discovery with ground truth class number (60,
unknown detection rate. As opposed to closed-set detec- 40, 20 for Task-1, Task-2 and Task-3 respectively). The highest
tion, the UDR scores improved when more classes are made score in each column is bold in black, and the second-highest score
available to the model. The Faster-RCNN baseline can only in each column is bold in grey. With the pre-defined number of
classes, our method has achieved the highest scores for all three
localise objects of an unknown class, but it does not identify
tasks, except for the accuracy in Task-3, which is behind the high-
them from known classes hence there is no UDP score. est scoring baseline method by a small margin. The overall perfor-
mance of our method is the best among all the proposed baselines.
6.2.3 Novel Category Discovery
6.3. Ablation study
Results of the object category discovery are shown in Tab. 5
and Tab. 6. The test condition is the same as the open-set de- To study the contribution of each component in our
tection. Our discovery method is able to accurately explore proposed framework, we design ablation experiments and
novel categories among the objects of unknown classes. show the results in Tab. 7.
Using the estimated number of classes, the discovery re-
sults are reported in Tab. 5. We observe that our method out- Representation Learning. The effects of the represen-
performed other baseline methods in the first two tasks. In tation learning in discovering novel classes are shown in
Task-3, where there are 60 known classes and 20 unknown Cases I, II and IV. The clustering result without encoding is
classes, our accuracy and purity score is slightly lower than reported in Case I. The result with only contrastive learning
the FINCH algorithm by 0.8% and 0.1%. We suggest that is reported in Case II. We observe that the performance
Task-3 may contain more biased unknown object classes, without encoding is around 10% lower compared to Case
therefore becoming challenging for self-supervised learning IV which is our method. Contrastive learning without the
to learn generalised representations. mix-up argumentation reflects higher scores compared to
We report the results using the ground truth number of Case I, but it is still around 4% lower in the aggregated
classes in Tab. 6. The results shown are similar to Tab. 5, scores compared to Case IV. This suggests that representa-
where our method has the best-aggregated performance tion learning is critical for constructing a strong baseline.
over three tasks. The method achieves respectable quan-
titative results considering the challenging level of the task. Category Discovery. We evaluate the effects of using

3967
Figure 4. Visualisation of OSODD predictions for Task-1. The tennis racket, stop sign, fire hydrant, clock, giraffe and zebra are the novel
classes that have not been introduced at this stage. The same bounding box colour indicates objects that belong to the same class or novel
category. The last column demonstrates a failure case where a giraffe is not detected, and one of the zebras is assigned to the wrong visual
category. More visualised results are provided in the supplementary material.

Representation Learning Category Discovery Task-1 Task-2 Task-3


Mix-Up Augmentation Contrastive Learning Semi-supervised Clustering NMI ACC Purity NMI ACC Purity NMI ACC Purity
I ✗ ✗ ✓ 8.9 5.6 10.5 4.7 5.4 11.9 5.5 14.7 27.7
II ✗ ✓ ✓ 10.5 6.3 12.0 5.6 5.4 13.2 6.1 15.5 28.6
III-1 ✓ ✓ ✗ 9.6 5.7 11.7 5.2 6.3 12.9 5.8 15.9 28.8
III-2 ✓ ✓ ✗ 7.4 6.3 12.3 5.4 6.4 13.1 6.0 16.8 28.7
IV ✓ ✓ ✓ 11.0 6.3 12.6 5.8 6.9 13.3 6.5 16.4 29.3

Table 7. Ablation Study on components of our proposed category discovery method. The complete method with all the proposed modules
achieves the best-aggregated performance in all tasks, which shows the importance of each component contributing to the method.

semi-supervised clustering in Case III-1, III-2 and IV. the supplementary material.
In Case III-1. We make the clustering algorithm fully
unsupervised by removing the labelled centroids and 7. Conclusion
instances. The results decrease by around 8% in all tasks.
Since the FINCH algorithm [39] shows a competitive In this work, we propose a framework to detect known
result in Tab. 5 and Tab. 6. In Case III-2, we replace the objects and discover novel visual categories for unknown
semi-supervised clustering with the FINCH algorithm. The objects. We term this task Open-Set Object Detection and
results show that Case IV outperforms Case III-2 in the Discovery (OSODD), as a natural extension of open-set ob-
task aggregation scores, which indicates our model bet- ject detection tasks. We develop a two-stage framework and
ter clusters the samples with the same learned feature space. a novel method for label assignment, outperforming other
popular baselines. Compared to detection and discovery
tasks, OSODD can provide more comprehensive informa-
Memory Module. To show the effects of the current mem- tion for real-world practices. We hope our work will con-
ory design, we ablate the module by removing the known tribute to the object detection community and motivate fur-
memory in representation learning. We report the results in ther research in this area.

3968
References [15] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE inter-
national conference on computer vision, pages 1440–1448,
[1] David Arthur and Sergei Vassilvitskii. k-means++: The 2015. 1
advantages of careful seeding. Technical report, Stanford, [16] Kai Han, Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, An-
2006. 5 drea Vedaldi, and Andrew Zisserman. Automatically discov-
[2] Abhijit Bendale and Terrance Boult. Towards open world ering and learning new visual categories with ranking statis-
recognition. In Proceedings of the IEEE conference on tics. arXiv preprint arXiv:2002.05714, 2020. 6
computer vision and pattern recognition, pages 1893–1902, [17] Kai Han, Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, An-
2015. 2 drea Vedaldi, and Andrew Zisserman. Autonovel: Automati-
[3] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delv- cally discovering and learning novel visual categories. IEEE
ing into high quality object detection. In Proceedings of the Transactions on Pattern Analysis and Machine Intelligence,
IEEE conference on computer vision and pattern recogni- 2021. 3
tion, pages 6154–6162, 2018. 1 [18] Kai Han, Andrea Vedaldi, and Andrew Zisserman. Learning
[4] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas to discover novel visual categories via deep transfer cluster-
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- ing. In Proceedings of the IEEE/CVF International Confer-
end object detection with transformers. In European confer- ence on Computer Vision (ICCV), October 2019. 2, 4, 6, 7
ence on computer vision, pages 213–229. Springer, 2020. 1 [19] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
[5] Olivier Chapelle, Jason Weston, Léon Bottou, and Vladimir Girshick. Momentum contrast for unsupervised visual rep-
Vapnik. Vicinal risk minimization. Advances in neural in- resentation learning. In Proceedings of the IEEE/CVF Con-
formation processing systems, 13, 2000. 5 ference on Computer Vision and Pattern Recognition, pages
[6] Jia Chen, Yasong Chen, Weihao Li, Guoqin Ning, Ming- 9729–9738, 2020. 3, 4
wen Tong, and Adrian Hilton. Channel and spatial attention [20] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
based deep object co-segmentation. Knowledge-Based Sys- shick. Mask r-cnn. In Proceedings of the IEEE international
tems, 211:106550, 2021. 3 conference on computer vision, pages 2961–2969, 2017. 1
[7] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- [21] Jie Hong, Weihao Li, Junlin Han, Jiyang Zheng, Pengfei
offrey Hinton. A simple framework for contrastive learning Fang, Mehrtash Harandi, and Lars Petersson. Goss: Towards
of visual representations. In International conference on ma- generalized open-set semantic segmentation. arXiv preprint
chine learning, pages 1597–1607. PMLR, 2020. 3 arXiv:2203.12116, 2022. 6
[8] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. [22] Lalit P Jain, Walter J Scheirer, and Terrance E Boult. Multi-
Improved baselines with momentum contrastive learning. class open set recognition using probability of inclusion. In
arXiv preprint arXiv:2003.04297, 2020. 4 European Conference on Computer Vision, pages 393–409.
Springer, 2014. 2
[9] Minsu Cho, Suha Kwak, Cordelia Schmid, and Jean Ponce.
[23] Xuhui Jia, Kai Han, Yukun Zhu, and Bradley Green.
Unsupervised object discovery and localization in the wild:
Joint representation learning and novel category discovery
Part-based matching with bottom-up region proposals. In
on single-and multi-modal data. In Proceedings of the
Proceedings of the IEEE conference on computer vision and
IEEE/CVF International Conference on Computer Vision,
pattern recognition, pages 1201–1210, 2015. 3
pages 610–619, 2021. 3
[10] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasude- [24] KJ Joseph, Salman Khan, Fahad Shahbaz Khan, and Vi-
van, and Quoc V Le. Autoaugment: Learning augmentation neeth N Balasubramanian. Towards open world object detec-
policies from data. arXiv preprint arXiv:1805.09501, 2018. tion. In Proceedings of the IEEE/CVF Conference on Com-
3 puter Vision and Pattern Recognition, pages 5830–5840,
[11] Akshay Dhamija, Manuel Gunther, Jonathan Ventura, and 2021. 1, 2, 3, 6, 7
Terrance Boult. The overlooked elephant of object detection: [25] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and
Open set. In Proceedings of the IEEE/CVF Winter Confer- F Huang. A tutorial on energy-based learning. Predicting
ence on Applications of Computer Vision, pages 1021–1030, structured data, 1(0), 2006. 4
2020. 1, 2, 5 [26] Kibok Lee, Yian Zhu, Kihyuk Sohn, Chun-Liang Li, Jinwoo
[12] Mark Everingham, Luc Van Gool, Christopher KI Williams, Shin, and Honglak Lee. I-mix: A domain-agnostic strat-
John Winn, and Andrew Zisserman. The pascal visual object egy for contrastive representation learning. arXiv preprint
classes (voc) challenge. International journal of computer arXiv:2010.08887, 2020. 3, 5
vision, 88(2):303–338, 2010. 5 [27] Yong Jae Lee and Kristen Grauman. Object-graphs for
[13] Enrico Fini, Enver Sangineto, Stéphane Lathuilière, Zhun context-aware category discovery. In 2010 IEEE Computer
Zhong, Moin Nabi, and Elisa Ricci. A unified objective for Society Conference on Computer Vision and Pattern Recog-
novel class discovery. In Proceedings of the IEEE/CVF Inter- nition, pages 1–8. IEEE, 2010. 3, 6
national Conference on Computer Vision, pages 9284–9292, [28] Weihao Li, Omid Hosseini Jafari, and Carsten Rother. Deep
2021. 3 object co-segmentation. In Asian Conference on Computer
[14] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un- Vision, pages 638–653. Springer, 2018. 3
supervised representation learning by predicting image rota- [29] Weihao Li, Omid Hosseini Jafari, and Carsten Rother. Local-
tions. In ICLR, 2018. 2 izing common objects using common component activation

3969
map. In Proceedings of the IEEE/CVF Conference on Com- tions. Journal of machine learning research, 3(Dec):583–
puter Vision and Pattern Recognition (CVPR) Workshops, 617, 2002. 6
June 2019. 3 [43] Jihoon Tack, Sangwoo Mo, Jongheon Jeong, and Jinwoo
[30] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Shin. Csi: Novelty detection via contrastive learning on dis-
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence tributionally shifted instances. In NeurIPS, 2020. 2
Zitnick. Microsoft coco: Common objects in context. In [44] Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zis-
European conference on computer vision, pages 740–755. serman. Generalized category discovery. arXiv preprint
Springer, 2014. 5 arXiv:2201.02609, 2022. 3, 5, 6, 7
[31] Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, [45] Xin Wang, Thomas E Huang, Trevor Darrell, Joseph E Gon-
Boqing Gong, and Stella X Yu. Large-scale long-tailed zalez, and Fisher Yu. Frustratingly simple few-shot object
recognition in an open world. In Proceedings of the detection. arXiv preprint arXiv:2003.06957, 2020. 1
IEEE/CVF Conference on Computer Vision and Pattern [46] Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised
Recognition, pages 2537–2546, 2019. 2 deep embedding for clustering analysis. In International
[32] Dimity Miller, Lachlan Nicholson, Feras Dayoub, and Niko conference on machine learning, pages 478–487. PMLR,
Sünderhauf. Dropout sampling for robust object detection 2016. 2, 6
in open-set conditions. In 2018 IEEE International Confer- [47] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and
ence on Robotics and Automation (ICRA), pages 3243–3249. David Lopez-Paz. mixup: Beyond empirical risk minimiza-
IEEE, 2018. 1, 2 tion. arXiv preprint arXiv:1710.09412, 2017. 5
[33] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- [48] Bingchen Zhao and Kai Han. Novel visual category discov-
sentation learning with contrastive predictive coding. arXiv ery with dual ranking statistics and mutual knowledge distil-
preprint arXiv:1807.03748, 2018. 3, 4 lation. Advances in Neural Information Processing Systems,
[34] Can Peng, Kun Zhao, and Brian C Lovell. Faster ilod: In- 34, 2021. 3
cremental learning for object detectors based on faster rcnn. [49] Xiaowei Zhao, Xianglong Liu, Yifan Shen, Yuqing Ma, Yix-
Pattern Recognition Letters, 140:109–115, 2020. 6 uan Qiao, and Duorui Wang. Revisiting open world object
[35] Pramuditha Perera, Vlad I Morariu, Rajiv Jain, Varun Man- detection. arXiv preprint arXiv:2201.00471, 2022. 5
junatha, Curtis Wigington, Vicente Ordonez, and Vishal M [50] Zhun Zhong, Enrico Fini, Subhankar Roy, Zhiming Luo,
Patel. Generative-discriminative feature representations for Elisa Ricci, and Nicu Sebe. Neighborhood contrastive
open-set recognition. In Proceedings of the IEEE/CVF Con- learning for novel class discovery. In Proceedings of the
ference on Computer Vision and Pattern Recognition, pages IEEE/CVF Conference on Computer Vision and Pattern
11814–11823, 2020. 2 Recognition, pages 10867–10875, 2021. 6
[36] Sai Saketh Rambhatla, Rama Chellappa, and Abhinav Shri- [51] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and
vastava. The pursuit of knowledge: Discovering and local- Yi Yang. Random erasing data augmentation. In Proceedings
izing novel categories using dual memory. arXiv preprint of the AAAI Conference on Artificial Intelligence, volume 34,
arXiv:2105.01652, 2021. 3, 6 pages 13001–13008, 2020. 3
[37] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali [52] Zhun Zhong, Linchao Zhu, Zhiming Luo, Shaozi Li, Yi
Farhadi. You only look once: Unified, real-time object de- Yang, and Nicu Sebe. Openmix: Reviving known knowledge
tection. In Proceedings of the IEEE conference on computer for discovering novel visual categories in an open world. In
vision and pattern recognition, pages 779–788, 2016. 1 Proceedings of the IEEE/CVF Conference on Computer Vi-
[38] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. sion and Pattern Recognition, pages 9462–9470, 2021. 3
Faster r-cnn: towards real-time object detection with region
proposal networks. IEEE transactions on pattern analysis
and machine intelligence, 39(6):1137–1149, 2016. 1, 3, 6
[39] Saquib Sarfraz, Vivek Sharma, and Rainer Stiefelhagen. Effi-
cient parameter-free clustering using first neighbor relations.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 8934–8943, 2019. 6,
7, 8
[40] Walter J Scheirer, Anderson de Rezende Rocha, Archana
Sapkota, and Terrance E Boult. Toward open set recogni-
tion. IEEE transactions on pattern analysis and machine
intelligence, 35(7):1757–1772, 2012. 2
[41] Walter J Scheirer, Lalit P Jain, and Terrance E Boult. Prob-
ability models for open set recognition. IEEE transactions
on pattern analysis and machine intelligence, 36(11):2317–
2324, 2014. 2
[42] Alexander Strehl and Joydeep Ghosh. Cluster ensembles—a
knowledge reuse framework for combining multiple parti-

3970

You might also like