0% found this document useful (0 votes)
27 views19 pages

Enhancing Image Annotation With Object Tracking An

Uploaded by

sumansilpa2903
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views19 pages

Enhancing Image Annotation With Object Tracking An

Uploaded by

sumansilpa2903
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.Doi Number

Enhancing Image Annotation with Object


Tracking and Image Retrieval: A Systematic
Review
Rodrigo Fernandes1,2, Alexandre Pessoa3, Marta Salgado4, Anselmo de Paiva3, Ishak
Pacal5, António Cunha1,2,
1Institute for Systems and Computer Engineering, Technology and Science - INESC TEC, Porto, 4200 – 465, Portugal
2School of Sciences and Technology, University of Trás-os-Montes and Alto Douro, Vila Real, 5000-801, Portugal
3Applied Computing Group, NCA-UFMA Federal University of Maranhão, São Luís, Brazil
4University Hospital Centre of Santo António, 4099-001 Porto, Portugal
5Igdir university, Iğdır, 76000, Turkey
Corresponding author: Rodrigo Fernandes (e-mail: [email protected]).
National Funds finance this work through the Portuguese funding agency, FCT—Fundacão para a Ciência e a Tecnologia, within project PTDC/EEI-
EEE/5557/2020. Co-funded by the European Union (grant number 101095359) and supported by the UK Research and Innovation (grant number 10058099).
However, the views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or the Health and Digital
Executive Agency (HaDEA).

ABSTRACT Effective image and video annotation is a fundamental pillar in computer vision and artificial
intelligence, crucial for the development of accurate machine learning models. Object tracking and image
retrieval techniques are essential in this process, significantly improving the efficiency and accuracy of
automatic annotation. This paper systematically investigates object tracking and image acquisition
techniques. It explores how these technologies can collectively enhance the efficiency and accuracy of the
annotation processes for image and video datasets. Object tracking is examined for its role in automating
annotations by tracking objects across video sequences, while image retrieval is evaluated for its ability to
suggest annotations for new images based on existing data. The review encompasses diverse methodologies,
including advanced neural networks and machine learning techniques, highlighting their effectiveness in
various contexts like medical analyses and urban monitoring. Despite notable advancements, challenges such
as algorithm robustness and effective human-AI collaboration are identified. This review provides valuable
insights into these technologies' current state and future potential in improving image annotation processes,
even showing existing applications of these techniques and their full potential when combined.

INDEX TERMS Image Annotation, Object Tracking, Image Retrieval, Deep Learning

I. INTRODUCTION making predictions or decisions based on data, is the


Image annotation is essential in various computer vision and foundation for automatic image annotation. Within this
artificial intelligence applications. With the significant domain, deep learning, especially using deep neural
increase in the volume of image data available, efficient networks, has shown an extraordinary ability to extract
methods to annotate large data sets are needed. Object complex features and patterns from images, facilitating tasks
tracking and image retrieval techniques are relevant methods such as object recognition [3, 4, 5] and image classification
for facilitating and potentially automating image annotation [6, 7, 8].
in this context. Convolutional neural networks (CNNs) are a fundamental
Artificial intelligence, particularly machine learning and pillar in image processing [9]. They simulate how the human
deep learning has revolutionized how we process and visual cortex interprets images, using layers of neurons to
interpret large volumes of visual data [1, 2]. Machine process visual data in various abstractions [10]. This
learning, which includes algorithms capable of learning and capability makes CNNs particularly suitable for computer

VOLUME XX, 2017 1


This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018

vision tasks, including image annotation, where they can annotation with emerging technologies as object tracking
identify and label objects in images with high precision. and image retrieval.
More recently, Visual Transformers (ViTs) [11, 12] have Manual image annotation, although traditional, presents
emerged as an alternative and robust approach. Inspired by significant challenges. These include high time demands,
the success of Transformers in natural language processing, variability in the accuracy and consistency of annotations
ViTs apply attention mechanisms to capture global due to human intervention, and difficulty scaling to large
relationships between different parts of an image. This data volumes. Automating image annotation, or at least
makes them particularly effective at understanding complex offering automated assistance in this process, can speed up
visual contexts, a valuable feature for automatic image the work and increase its accuracy and consistency [20].
annotation. The application of techniques designed to maximize learning
Accurate data annotation is fundamental in machine learning from limited data such as transfer learning [21, 22, 23, 24,
applications, directly impacting the effectiveness of trained 25], data augmentation [26, 27, 28, 29] and few-shot learning
models. Traditionally, annotation is carried out manually, a [30, 31, 32, 33, 34] complements this move towards
process that can be slow and subject to inconsistencies. automation. While these methods are valuable for training
Automating this process, partially or entirely, is a relevant robust models with sparse annotated datasets, the ultimate
objective for increasing the efficiency and consistency of goal remains to minimize their need by improving the
annotations. automation of the annotation process itself. This approach
Object tracking involves identifying and following objects not only addresses the immediate challenges of data scarcity,
over time in videos or image sequences. This technique can but also aligns with the long-term vision of creating self-
be used for automatic annotation [13, 14], following the sustaining deep learning ecosystems that can learn and adapt
trajectory of moving objects and marking them in each with minimal human oversight.
frame. This approach can reduce the time needed for The emphasis on developing automated annotation systems
annotation and improve consistency, especially in contexts is particularly pertinent given the exponential increase in
with dynamic objects. digital data [20]. The ability to automatically annotate and
On the other hand, image retrieval involves searching for and categorize this data becomes not only beneficial, but also
identifying similar images in large databases. Using essential for its management and value extraction.
algorithms to identify common patterns and characteristics, Automated annotation systems fueled by object tracking and
this technique can suggest annotations for new images [15, image retrieval therefore represent a significant advance in
16], based on previously annotated data, providing a starting this regard, offering scalable, efficient and accurate solutions
point for annotation. to meet the growing demands of various industries.
The joint application of object tracking and image retrieval Additionally, human-AI collaboration in image annotation
to image annotation offers a promising approach to introduces unique challenges that deserve further exploration
automation in computer vision. This systematic review aims [35, 36]. While AI can significantly improve efficiency and
to explore the current state of these techniques, assessing accuracy in identifying and tracking objects in image
how they can be applied to optimize image annotation. The sequences, properly integrating human judgement and
review focuses on analyzing recent studies and practical expertise is crucial to ensuring the relevance and semantic
applications, aiming to provide a detailed overview of the accuracy of annotations. The interaction between human
benefits and challenges of these methodologies. annotators and AI systems needs to be intuitive and flexible,
allowing for easy corrections and adjustments, and ensuring
A. MOTIVATION that human knowledge is effectively incorporated into the
The growing demand for annotated image datasets in fields annotation process.
such as medicine, security and pattern recognition highlight Object tracking and image retrieval techniques have already
the importance of efficient and accurate image annotation demonstrated their effectiveness in several practical
methods [17]. The motivation for this systematic review applications, suggesting significant potential for innovation
arises from the opportunity to explore how object tracking in image annotation. Object tracking can automate the
and image retrieval techniques can contribute to this process, identification and tracking of objects in image sequences,
offering solutions to existing challenges in manual reducing the human effort required to annotate each frame
annotation and providing a more automated and efficient individually [37, 38, 39]. On the other hand, image retrieval
approach [18, 19]. can facilitate annotation by identifying similar images with
It is important to emphasize that, despite the growing existing annotations, providing a reliable starting point and
research and development in image annotation, systematic speeding up the annotation process [40, 41, 42].
reviews in this area are remarkably scarce in recent years. Therefore, this review seeks to evaluate and synthesize
This gap in the literature highlights the critical need for a current knowledge on object tracking and image retrieval in
comprehensive review that synthesizes recent advances and image annotation, identifying potential advances, challenges
contextualizes the current state of the art. Thus, this review and opportunities. The aim is to provide a comprehensive
stands out by compiling and presenting the most current understanding of how these techniques can improve the
studies, reflecting the significant advances in image efficiency and accuracy of image annotation in different

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018

domains, contributing to advancing research and practice in identification and localization, underlining their
computer vision and related areas. effectiveness in dealing with visual perception challenges in
images.
B. OBJECTIVE AND RESEARCH QUESTIONS Pande et al. [45] presented a comparative analysis of a
This systematic review aims to explore the use of object variety of image annotation tools for object detection. This
tracking and image retrieval techniques to automate or assist study is notable for its comprehensive approach, evaluating
in image annotation. The focus is to investigate how the different annotation tools in terms of functionalities,
integration of these technologies can optimize the annotation effectiveness and applicability in varied object detection
process, enhancing its efficiency and accuracy. The review contexts. The review highlights the importance of the
will be guided by the following research questions: appropriate choice of annotation tool, emphasizing that the
Q1: How are object tracking and image retrieval techniques quality of the annotation has a direct and significant impact
being used to automate or assist in image annotation, and on the performance of object detection models.
what are the current developments associated with these Existing reviews in the field of image annotation with deep
technologies? learning, including the work of Adnan et al. [43], Ojha et al.
Q2: How can the integration of object tracking, and image [44] and Pande et al. [45], offer a comprehensive overview
retrieval be optimized to improve the image annotation of current methodologies and applications, focusing on
process? This question aims to discover innovative different aspects of this evolving area. They illustrate the
approaches to combining object tracking and image retrieval technological advances and challenges that still need to be
efficiently. It seeks to understand how the synergy between overcome, providing an overview of the trends and future
these two technologies can be maximized to speed up and directions of this emerging field. However, a notable
improve image annotation. limitation of these reviews is their tendency to focus on
Q3: What are the main challenges and limitations faced specific types of techniques in isolation, which can limit
when applying these techniques to image annotation? Here, understanding of the full capability of image annotation
the focus is on identifying the technical and practical technologies. Especially when considering the synergistic
challenges and limitations that currently prevent the effective potential of combining different approaches to tackle
implementation of object tracking and image retrieval in complex challenges, this perspective can prove restrictive.
image annotation. This issue also seeks to explore potential In contrast, our review distinguishes itself by exploring not
solutions or approaches to overcome these challenges. just one, but two complementary techniques: object tracking
An in-depth understanding of these issues will provide and image retrieval. By integrating these two approaches, we
valuable insights into the opportunities, challenges and propose a more holistic view of image annotation,
future directions for using advanced computer vision recognizing that combining these technologies can bring
techniques in image annotation, boosting efficiency and significant benefits to the efficiency and accuracy of the
accuracy in various fields of application, such as medical annotation process. This integration represents an evolution
diagnosis, surveillance and large-scale pattern recognition. in the field of image annotation, leveraging the potential of
each technique to complement and enrich the other, and
II. RELATED WORK opening up new possibilities for significant advances in the
In the dynamic field of image annotation with deep learning, automation and accuracy of image annotation.
several systematic reviews offer valuable insights and
explore different aspects of this evolving domain. Recent III. LITERATURE REVIEW METHODOLOGY
studies in the dynamic field of image annotations can be
summarized as follows. A. ELIGIBILITY CRITERIA
Adnan et al. [43] devoted themselves to a comprehensive To guarantee a relevant and objective selection of studies, we
analysis of Automatic Image Annotation (AIA) methods, established strict eligibility criteria, detailed in Table I. The
with a special emphasis on deep learning models. This primary purpose of these criteria is to filter out research that
review is significant in that it categorizes AIA methods into effectively addresses the questions proposed by this study,
five distinct categories: CNN-based, RNN-based, DNN- excluding those that fall outside the scope of our research.
based, LSTM-based and SAE-based. The study not only These parameters were carefully formulated to capture the
highlights recent advances in these techniques, but also most pertinent literature and restrict our analysis to
points to persistent challenges, such as the need for more documents strictly related to the research questions at hand.
accurate and efficient techniques to improve automatic Implementing these criteria before the literature search is a
image annotation. crucial strategy for reducing bias in the study selection
Ojha et al. [44] focused specifically on the use of
process.
convolutional neural networks (ConvNets) for image TABLE I
annotation. This review details how ConvNets are applied to
image content annotation, exploring their ability to extract Inclusion (IC) and exclusion (EC) criteria.
visual features for complex computer vision tasks. The
review highlights the crucial role of ConvNets in object

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018

Criteria Description relevance and quality of the included studies. This phase
ICO Published since 2020 involves a thorough evaluation of the documents retrieved
IC1 The title, abstract, or keywords match the search query
from the selected databases, based on the criteria previously
IC2 Work published in refereed journals or conference
IC3 Direct or indirect applicability of object tracking techniques established. The aim is to refine the initial set of documents
for image annotation to include only those that strictly fulfil the eligibility criteria,
IC4 Direct or indirect applicability of image retrieval techniques thus guaranteeing the integrity and relevance of the
for image annotation
subsequent analysis. The screening phase begins with a
EC0 Work not published in refereed journal or conference
EC1 Literature/Systematic Review review of the titles and abstracts of the 5455 documents
EC2 Full text is not available initially identified, as detailed in Section 3, "Literature
EC3 The paper is not written in English Review Methodology". This process allows us to identify
EC4 Does not consider the use of Object Tracking or Image and exclude studies that do not fall within the scope of our
Retrieval
EC5 Out of scope investigation, focussing on those papers that offer valuable
EC6 Technique used cannot be leveraged for image annotation insights into the use of object tracking and image retrieval
techniques in the annotation of images and video datasets.
B. IDENTIFICATION PHASE Through individual reading of titles and abstracts, we
identified that a substantial number of these documents did
In the initial selection phase, three reference databases were not meet our inclusion and exclusion criteria and were
chosen: IEEE Xplore, Scopus and SpringerLink. The search therefore excluded, leaving 95 unique works for the
centred on combinations of keywords in the titles, abstracts screening phase. This initial stage was crucial to ensure the
and keywords of the articles searched, with the following relevance and uniqueness of the studies within our research
query: scope. Subsequently, in the eligibility phase, we conducted a
("Object Tracking" OR "Image Retrieval”) AND detailed evaluation of the full text of these 95 documents,
("Image" OR "Dataset" OR "Video") AND strictly guided by the previously defined exclusion criteria.
("Annotation") This meticulous analysis resulted in the exclusion of a
significant portion of the documents, based on various
This process was carried out on 14 November 2023,
factors of incompatibility with the established criteria, such
considering publications from 2020 to 2023. This choice of
as non-compliance with the research objective. Of the
time was strategic to ensure the inclusion of the most recent
and relevant studies in deep learning applied to object evaluated articles, 15 were selected in the Object Tracking
tracking and image retrieval techniques for annotating image area and 17 in the "Image Retrieval" area, totaling 32 studies
datasets, resulting in the identification of 5455 documents for for data extraction and qualitative analysis. These studies
the initial screening phase. were chosen not only for their direct relevance to the themes
Additionally, during the review and selection of studies, we of interest but also for the quality of their methodologies,
identified some highly relevant works that did not strictly fit data sets used, and relevance to the proposed research
the terms of the original search but offered valuable questions. The selection of these studies reflects our
contributions to the topic under discussion. These studies commitment to covering a broad and in-depth spectrum of
were carefully included to enrich the analysis and discussion, the applications of the techniques mentioned in image
considering their direct or indirect relevance and the annotation.
potential to provide additional insights into object tracking
and image retrieval techniques applied to image annotation.

IV. RESULTS
Based on the criteria established in Section 3, this part of the
article explores the results achieved. We focus our analysis
on object tracking and image retrieval techniques,
considering how these technologies can be adapted for image
annotation. The research involves a careful analysis of the
algorithmic approaches examined, focusing on the data sets
used, the viability of the methods in various annotation
contexts and the real-time execution capacity of the models.
This review's main findings and observations are
summarized and organized in Tables II and III.
Figure 1 Study Selection Flow Diagram.
A. SCREENING PHASE AND ELIGIBILITY
After defining the eligibility criteria in Section 3, we began Following Figure 1, the next image presents a bar chart
the screening and eligibility determination phase, a crucial outlining the number of articles selected per year from the
stage in the systematic review process to ensure the initial set of studies. This visual representation allows for a

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018

clear and immediate understanding of the distribution and segmentation masks, trained on the COCO dataset. This
volume of relevant research within the specified time period. method forms the basis of Td-VOS, allowing precise
segmentation of objects in videos from the initialization of a
bounding box in the first frame.
Weiming Hu et al. [48] (SiamMask) uniquely combine object
tracking with real-time video segmentation. Using
convolutional Siamese networks fully trained offline with an
additional binary segmentation task, SiamMask operates
online with the initialization of a bounding box, processing
video at 55 fps. This innovative approach employs two- and
three-branch variants, integrating similarity, bounding box
regression and binary segmentation tasks, with mask
refinement to improve accuracy. SiamMask is adaptable to
multiple objects and stands out for its efficiency and speed.
Dominik Schörkhuber et al. [49] introduce a technique for
Figure 2 Studys selected separated by years.
semi-automatic annotation of night-time driving videos. The
The bar chart shown illustrates the annual distribution of the
method includes generating trajectory proposals by tracking,
articles selected from 2020 to 2023. It is possible to observe
extending and verifying these trajectories with single object
a progressive increase in the number of articles, with the
tracking and semi-automatic annotation of bounding boxes.
highest bar corresponding to the year 2023. This suggests Tested on the CVL dataset, focused on European rural roads
growing interest and progress in the research fields of object at night, the method demonstrated a 23 % increase in recall
tracking and image retrieval, as they become increasingly with near-constant precision, outperforming traditional
relevant to the development of more sophisticated image detection and tracking approaches. This work addresses the
annotation techniques. The graph serves not only as a gap of rural and night scenes in driving datasets, proposing
quantitative analysis of research output over the years but significant improvements for efficient annotation in
can also reflect the growing importance of these technologies challenging autonomous driving contexts.
in addressing complex challenges in computer vision and Roberto Henschel et al. [50] propose an advanced method for
artificial intelligence. multi-person tracking, combining video and body Inertial
Measurement Units (IMUs). This method stands out by
B. OBJECT TRACKING addressing the challenge of tracking people in situations
In this section, we will discuss in detail the algorithms that where appearance is not discriminating or changes over time,
represent significant advances in object tracking. The such as changes in clothing. Using a neural network to relate
approaches vary widely, from traditional machine learning person detections to IMU orientations and a graph labelling
techniques to advanced methods employing convolutional problem for global consistency between video and inertial
neural networks, each offering innovative solutions to data, the method overcomes the limitations of video-only
specific challenges within diverse application contexts, these approaches. With a challenging new dataset that includes
algorithms are presented in Table II with their respective both video and IMU recordings, the method achieved an
research articles. impressive average IDF1 score of 91.2 % demonstrating its
effectiveness in situations where it is feasible to equip people
Single Object Tracking with inertial sensors.
Tao Yu et al. [46] present a revolutionary method that
integrates the "Instance Tracking Head" (ITH) module into Multiple Object Tracking and Segmentation
object detection frameworks to detect and track polyps in Zhenbo Xu et al. [51] (PointTrackV2) stand out with an
colonoscopy videos. This method, aligned with the Scaled- innovative method that converts compact image
YOLOv4 detector, allows sharing of low-level feature representations into disordered 2D point clouds, facilitating
extraction and progressive specialization in detection and the rigorous separation of foreground and background areas
tracking. The approach stands out for its speed, being around from instance segments. This process is enriched by a variety
30 % faster than conventional methods, while maintaining of data modalities to enhance point features. PointTrackV2
exceptional detection accuracy (mAP of 91.70 %) and surpasses existing methods in efficiency and effectiveness,
tracking accuracy (MOTA of 92.50 %, Rank-1 Acc of 88.31 achieving speeds close to real time (20 FPS) on a single
%).vai 2080Ti GPU. In addition, the study introduces the APOLLO
Shaopan Xiong et al. [47] develop a solution based on a MOTS dataset, more challenging than KITTI MOTS, with a
tracking module that uses Siamese networks, specifically higher density of instances. Extensive evaluations
SiamFC++, to accurately localize objects. The innovation demonstrate the superior performance of PointTrackV2 on
here lies in modelling visual tracking as a similarity learning various datasets, also discussing the applicability of this
problem, complemented by the Box2Segmentation module, method in areas beyond tracking, such as detailed image
which efficiently transforms bounding boxes into

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018

classification, 2D pose estimation and object segmentation using multiple object tracking (MOT). The system
in videos. developed, Bridge Vehicle Load Identification System
Liqi Yan et al. [52] (STC-Seg) present a novel framework for (BVLIS), was tested on an operating cable-stayed bridge,
instance segmentation in videos under a weakly supervised demonstrating the accuracy and reliability of the method.
approach. Using unsupervised depth estimation and optical This approach is innovative in that it combines deep
flow, STC-Seg generates efficient pseudo-labels to train learning-based vehicle detection, camera calibration and 3D
deep networks, focusing on the accurate generation of bounding box reconstruction, providing an effective
instance masks. One of the main contributions is 'puzzle alternative to conventional methods for assessing the
loss', which allows end-to-end training using box-level condition of bridges and their behavior under vehicle loads.
annotations. In addition, STC-Seg incorporates an advanced Liu et al. [56] propose a novel technique for segmenting and
tracking module that utilizes diagonal points and spatio- tracking instances in microscopy videos without the need for
temporal discrepancy, increasing robustness against changes manual annotation. Using adversarial simulations and pixel
in object appearance. This method demonstrates exceptional embedding-based learning, the ASIST method is able to
performance, outperforming supervised alternatives on the simulate variations in the shape of cellular and subcellular
KITTI MOTS and YT-VIS datasets, evidencing the objects, overcoming the challenge of consistent annotations
effectiveness of weakly supervised learning in segmenting required by traditional methods. The study demonstrates that
instances in videos. ASIST achieves a significant improvement over supervised
approaches, showing superior performance in segmentation,
Improvements in Annotation and Efficiency detection and tracking of microvilli and comparable
Le et al. [53] propose an interactive and self-supervised performance in videos of HeLa cells. This method represents
annotation framework that significantly improves the a breakthrough in the quantitative analysis of microscopy
efficiency of creating object bounding boxes in videos. videos, offering an efficient and automated solution for the
Based on two main networks, Automatic Recurrent quantification of cellular and subcellular dynamics without
Annotation (ARA) and Interactive Recurrent Annotation the labor-intensive manual annotation.
(IRA), the method iterates over the improvement of a pre- Fahad Lateef et al. [57] propose an innovative object
existing detector by exposing it to unlabeled videos, identification framework (FOI) for autonomous vehicles,
generating better ground pseudo-truths for self-training. IRA focusing exclusively on camera data to detect and analyze
integrates human corrections to guide the detection network, objects in urban driving scenarios. This framework uses
using a Hierarchical Correction module that progressively image registration algorithms and optical flow estimation to
reduces the distance between annotated frames with each compensate for self-motion and extract accurate motion
iteration. This innovative system has proven capable of information from moving objects from a mobile camera. At
generating accurate, high-quality annotations for objects in the heart of this system is a moving object detection (MOD)
videos, substantially reducing annotation time and costs. model, which combines an encoder-decoder network with a
Sambaturu et al. [54] (ScribbleNet) present an interactive semantic segmentation network to perform two crucial tasks:
annotation method called ScribbleNet, designed to improve the semantic segmentation of objects into specific classes
the annotation of complex urban images for semantic and the binary classification of pixels to determine addition,
segmentation, crucial in autonomous navigation systems. the article presents a unique dataset for detecting moving
This technique offers a pre-segmented image, which objects, covering a variety of dynamic objects. The
iteratively improves segmentation using scribbles as input. experiments demonstrate the effectiveness of the proposed
Based on conditional inference and exploiting correlations framework in providing detailed semantic information about
learnt in deep neural networks, ScribbleNet significantly objects in urban driving environments.
reduces annotation time - up to 14.7 times faster than manual Zeren Chen et al. [58] (Siamese DETR) present "Siamese
annotation and 5.4 times faster than current interactive DETR", a new method for self-supervised training of DETR
methods. In addition, it integrates with the LabelMe image (DEtection TRansformer) transformers, introduced at the
annotation tool and will be made available as open-source 2023 IEEE/CVF Conference on Computer Vision and
software, notable for its ability to work with scenes in Pattern Recognition (CVPR). This study proposes
unknown environments, annotate new classes and correct combining the Siamese network with DETR's cross-attention
multiple labels simultaneously. mechanism, focusing on learning vision-invariant and
Zhu et al. [55] present an accurate method for reconstructing detection-oriented representations. The method achieved
3D bounding boxes of vehicles in order to obtain detailed state-of-the-art transfer performance on the COCO and
spatial-temporal information about vehicle loads on bridges. PASCAL VOC detection benchmarks. The team highlights
The study uses a deep convolutional neural network (DCNN) the effectiveness and versatility of Siamese DETR,
and the You Only Look Once (YOLO) detector to detect demonstrating significant improvements in localization
vehicles and obtain 2D bounding boxes. A model for
accuracy and acceleration in convergence. However,
reconstructing the 3D bounding box is proposed, making it
Siamese DETR relies on a pre-trained CNN, such as SwAV,
possible to determine the sizes and positions of vehicles.
and future work aims to integrate the CNN and Transformer
Spatial-temporal information on vehicle loads is obtained
into a unified training paradigm.

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018

TABLE II - Selected Object Tracking reviewed articles with their respective authors, year of publication, methodology/algorithms, main area, application
and real-time capacity.

Year Application in Image Real-time


Methodology/Algorithms Main Area
Authors Annotation Capacity
Yes, for annotation and
2022, Tao Yu Instance Tracking Head (ITH), Scaled-YOLOv4, Detecting and Tracking
tracking polyps in Yes
et al. [46] Similarity Metric based on Learning Polyps
colonoscopy videos
2020,
Shaopan Object segmentation approach in video based on tracking, Object Segmentation in Yes, for segmenting individual Not
Xiong et al. combining Box2Segmentation with general object tracking Videos objects in videos specified
[47]
Yes, it
2023, Yes, for real-time object processes
SiamMask: object tracking and real-time video Object Tracking and
Weiming Hu tracking and segmentation in around 55
segmentation with Siamese networks trained offline. Segmentation in Videos
et al. [48] videos frames per
second
2021, Yes, for analysing night-time
Semi-automatic video
Dominik Semi-automatic video annotation method using object driving data and developing Not
annotation, focus on
Schörkhuber detection and tracking. computer vision algorithms for specified
night driving
et al. [49] autonomous driving.
Yes, 20
Yes, for automatic annotation
2021, Zhenbo PointTrackV2, image conversion to 2D point clouds, Multi-object tracking and FPS speed
in scenes with multiple
Xu et al [51] SpatialEmbedding for instance segmentation segmentation (MOTS) on 2080Ti
moving objects
GPU
Yes,
Interactive Self-Annotation (ISA) framework based on Video annotation for moving focused on
2020, Trung- recurrent self-supervised learning, with Automatic objects, especially in the reducing
Automatic bounding box
Nghia Le et Recurrent Annotation (ARA) and Interactive Recurrent context of autonomous driving annotation
annotation in videos
al. [53] Annotation (IRA) processes, and Hierarchical Correction and intelligent transport time and
module. systems human
effort
Efficient annotation of urban Yes, with a
2023, Interactive image
images, reducing time and focus on
Bhavani Utilising latent feature perturbation in DNN for efficient annotation for semantic
human effort, with the reducing
Sambaturu et interactive annotation. Integration with LabelMe software. segmentation in urban
capacity to correct multiple annotation
al. [54] scenes.
labels simultaneously. time.
Computer vision for
2021, Jinsong Use of YOLO-v4 for vehicle detection and 3D bounding Vehicle detection and
monitoring vehicle loads Yes
Zhu et al [55] box reconstruction. tracking.
on bridges.
Uses CycleGAN for image-annotation synthesis and Computer vision and Facilitates segmentation and
2021, Quan Not
RSHN for pixel embedding. Includes annotation deep learning in cell tracking in microscopy videos
Liu et al [56] specified.
deformation strategies for HeLa cells. biology and microscopy. without manual annotation.
Presents a framework for object identification in
autonomous vehicles using cameras. It uses image
2023, Fahad Identification and
registration and optical flow for motion analysis. It Computer vision in Not
Lateef et al classification of moving
combines moving object detection with semantic autonomous driving. specified.
[57] objects in urban scenarios.
segmentation and encoder-decoder techniques, employing
the Semi-Global Matching algorithm for depth estimation.
Use of a neural network to associate people detections with
Identification and long-term
2020, Roberto IMU orientations, formulation of a graph labelling Tracking multiple people
tracking of people in videos, Not
Henschel et problem, use of the perspective correction (PC) algorithm in videos with wearable
useful for behavioural and specified
al. [50] and integration of IMU signals for trajectory IMU sensors.
sports analysis.
reconstruction.
Use of a Siamese self-supervised pre-training approach for
the Transformer architecture in DETR, with an emphasis Improved object detection and
2023, Zeren Self-supervised learning
on learning vision-invariant and detection-oriented semantic discrimination in Not
Chen et al and object detection
representations. Implements two self-supervised images, useful for object specified
[58] using Transformers.
pretraining tasks: Multi-Vision Region Detection and recognition and tracking tasks.
Multi-Vision Semantic Discrimination.
Instance segmentation,
2023, Liqi
STC-Seg, video instance segmentation, unsupervised depth weak supervised Yes, for video instance Not
Yan et al.
estimation, optical flow learning, spatio-temporal segmentation specified
[52]
collaboration
2020, Thiago Object detection, Public dataset for detecting
Use of CNNs, Mask R-CNN, YOLO, and three- Not
T. Santos et instance segmentation, and segmenting grape
dimensional association. specified
al. [59] object tracking bunches.

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018

C. IMAGE RETRIEVAL Improved Annotation and Automatic Caption


In this section, significant advances in the image retrieval Generation
field will be discussed with the intent of presenting a wide Binqiang Wang et al. [64]: The study presents the Recurrent
range of different algorithms and approaches to this Topic Retrieval Memory Network (RTRMN), an innovative
technique. The approaches vary from methods to solve approach for generating captions in remotely sensed images
problems in the field of image retrieval to new approaches to (RSI). By analyzing five annotated sentences per image and
this technique that show potential to innovate this field, each identifying common keywords between them, the RTRMN
offering solutions to specific challenges within diverse employs a memory network where these keywords serve as
application contexts, these algorithms are presented in Table guides for the formation of more precise and determined
III with their respective research articles. captions. This method offers a solution to the ambiguity
often found in captions generated by previous techniques,
Advances in CBIR and Overcoming the Semantic Gap improving both the accuracy and contextual relevance of
Faritha Banu et al. [60]: This paper proposes an innovative captions for remotely sensed images.
CBIR system that incorporates both model and content Myasar Mundher Adnan et al. [65]: This paper proposes an
annotations within an ontology framework. Using distinct advanced automatic image annotation system that combines
visual features such as color and texture, the system applies the ResNet50-Slantlet transform with technologies such as
advanced image segmentation techniques and extracts word2vec, principal component analysis (PCA) and t-SNE
features using grid-based color histograms and texture for effective image characterization. The system is able to
analysis. This multi-faceted approach not only significantly extract high and low-level visual features, including shape,
improves the accuracy and speed of image retrieval, but also texture and color, outperforming traditional methods in
addresses the semantic gap, especially useful in medical terms of precision, recall and F-measure across multiple
image search and retrieval contexts. datasets. This approach not only improves automatic image
Yi-Hui Chen et al. [61]: In this study, the authors present an annotation, but also provides valuable insights for semantic
innovative solution to the challenge of the semantic gap in gap reduction.
social image retrieval. Combining multiple visual features
and textual matching, the model employs ontologies and Feature Fusion with Deep Learning
linked open data to perform automatic, semantic annotation Umer Ali Khan and Ali Javed [66]: In this paper, the authors
of images. Using the Stanford Parser, the study extracts develop a hybrid CBIR system that integrates Local Tetra
candidate phrases related to images and maps them to entities Angle Patterns (LTAPs) and color moment features for more
in DBpedia, facilitating the generation of accurate RDF efficient image retrieval. The combination of these textural
annotations. The method was validated on NBA blogs, and chromatic features, together with the application of a
demonstrating significant improvements in the accuracy and genetic algorithm for attribute selection, results in a hybrid
relevance of search results. feature vector that significantly improves image retrieval
Ahmed et al. [62]: The article describes the development of performance. This system effectively addresses the semantic
the Deep-view Linguistic and Inductive Learning (DvLIL) gap and offers a robust solution for image retrieval in large
framework, which stands out for combining visual and databases.
textual modalities to improve image retrieval. Using state- Yikun Yang et al. [67]: The paper proposes an advanced
of-the-art techniques such as ResNet-50 and BERT, the image retrieval algorithm that utilizes a deep content-based
system extracts detailed visual features and generates quality model, integrating DNN-based saliency prediction
semantic and contextual representations of the text. The and image quality assessment (IQA). This method selects
fusion of these features, realized through a sequence of high-quality salient regions and concatenates them in a way
multilayer perceptrons based on inductive learning, that mimics human visual perception, improving image
demonstrates remarkable effectiveness on challenging retrieval in large datasets. The study demonstrates that this
datasets, offering a more adaptable and robust solution for approach outperforms several state-of-the-art algorithms,
image retrieval compared to traditional CBIR systems. offering an effective solution to the semantic gap in large-
P. Das and A. Neelima [63]: This paper introduces a robust scale image retrieval.
methodology for biomedical image retrieval, centered on the "DenseBert4Ret" by Zafran Khan et al [68]: This study
use of a feature vector that combines Zernike moments, develops an image retrieval system based on multimodal
curvlet features and gradient orientation. This holistic content, using the integration of DenseNet for visual feature
approach not only captures texture and shape information extraction and BERT for textual analysis. This bi-modal
effectively, but also minimizes redundant data. The system simultaneously processes images and text as queries,
methodology has been validated on four biomedical improving accuracy in retrieving images that match the
databases, demonstrating a superior retrieval rate, which combination of users' visual and textual desires. The
represents a significant advance in overcoming the semantic approach demonstrates superiority on real-world datasets,
gap in medical images. highlighting the potential of deep learning in creating joint
image and text representations.

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018

TABLE III Selected Image Retrieval reviewed articles with their respective authors, year of publication, methodology/algorithms, main area, application
and datasets.

Application in Image
Year, Authors Methodology/Algorithms Main Area Dataset
Annotation
2022, J Image segmentation and feature extraction using Content-Based Yes, for image annotation and WANG
Faritha Banu grid-based colour histogram and texture Image Retrieval retrieval
et al. [60] techniques (CBIR)
2020, Yi-Hui Automatic semantic annotation of images, Social Image Yes, for automatic semantic NBA Blogs (January 2015
Chen et al Natural language analysis, Candidate phrase Retrieval annotation and identification to November 2015) -
[61] extraction, RDF (Resource Description of semantic intentions in social Manual annotations with
Framework), SPARQL, LSI (Latent Semantic images RDF
Indexing)
2020, Recurrent Topic Memory Network (RTRMN), Remote Sensing Yes, to generate automatic UCM-CaptioRSICD
Binqiang Recurrent Neural Network, Memory Network, Image Processing, semantic descriptions of (Remote Sensing Image
Wang et al Convolutional-MaxPooling Legend remote sensing images Captioning Dataset)ns:and
[64] Generation
2023, Myasar ResNet50-SLT, word2vec, principal component Automatic Image Yes, by improving accuracy in Corel-5K, ESP-Game and
Mundher analysis (PCA), t-SNE Annotation, Deep image annotation. Flickr8k
Adnan et al. Learning
[65]
2021, Mona Multi-View Robust Spectral Clustering Image Annotation, Model for image annotation Flickr, 500PX and Corel-
Zamiri et al (MVRSC), Maximum Correntropy Criterion, Semantic based on multi-view fusion. 5K
[70] Half-Quadratic optimisation framework Retrieval
2022, Jordão Interactive image segmentation annotation Interactive Image Method for mass annotation of iCoSeg, DAVIS, Rooftop
Bragantini et guided by feature space projection, metric Segmentation, images through projection in and Cityscapes
al [71] learning, dimensionality reduction Interactive feature space.
Machine Learning
2023, Ikhlaq Use of RESNET-50 and BERT for image and Content-Based Recovery of modified images Fashion-200K and MIT-
Ahmed et al. text feature extraction, with inductive learning Image Retrieval on e-commerce platforms, States
[62] for feature fusion. (CBIR) using deep learning.
2022, Umer Use of local tetra angle patterns (LTAP) and Content-Based Efficient image retrieval on Corel-1K,Oxford Flower
Ali Khan et colour moment features to improve image Image Retrieval social media platforms, using and CIFAR-10
al. [66] retrieval accuracy, optimised with genetic (CBIR) advanced colour and texture
algorithm. features
2020, Yikun Use of DNN and CNN for saliency prediction Content-Based Retrieval of quality images ImageNet, Caltech256 and
Yang et al. and acquisition of deep image representations. Image Retrieval from large databases using CIFAR-10
[67] (CBIR) deep learning.
2021, Jhilik Use of Capsule Networks and decision fusion Content-Based Use of capsule architecture for IRMA (Image Retrieval in
Bhattacharya with W-DCT and RBC for classification and Medical Image accurate retrieval and Medical Applications) and
et al. [72] retrieval of medical images. Retrieval (CBIR) classification of medical ImageCLEFMed-2009
images in large databases.
2021, Development of an OLWGP descriptor for data Content-Based Use of optimised OLWGP and Kaggle datasets named:
Dhupam retrieval and classification, using a heuristic J- Image Retrieval CNN descriptors for accurate CT (computerised
Bhanu BMO algorithm for optimal feature point (CBIR) and retrieval and classification of tomography), CT head,
Mahesh et al. selection and an optimised CNN for medical data Medical Image medical images in large Fundus Iris
[73] classification. Classification databases. (DIARETDB1),
Mammogram breast
(MIAS), MRI brain, US
(ultrasound), X-ray bone,
X-ray chest and X-ray
dental

2022, Zafran Use of DenseNet to generate visual Content-Based Multi-modal CBIR that Fashion200k, MIT States
Khan et al. characteristics of images and BERT for text Image Retrieval processes image and text and FashionIQ
[68] embeddings. Deep learning for joint image and (CBIR) and queries to retrieve images
text representation. Medical Image from a substantial database,
Classification adjusting to the wishes
expressed in the query.
2022, Anna Use of the DenseNet-121 model pre-trained with Hash-based Improving accuracy in medical Chest X-ray8
Guan et al. the C2L method; introduction of interpretable medical image image retrieval, with special
[69] saliency maps; fusion of global and local retrieval, with a attention to injured areas in
features; definition of three loss functions to focus on chest X- chest X-rays.
optimise hash codes. rays.
2021, P. Das Method based on robust descriptors using Biomedical Image Use of robust descriptors for HRCT dataset,
et al. [63] Zernike Moments, curvalet features and gradient Retrieval effective retrieval of Emphysema CT database,
orientation for biomedical image retrieval. biomedical images from large OASIS MRI database and
databases NEMA MRI database
2023, Felipe Learned keypoint detection method for non-rigid Keypoint Improvement in deformable HRCT dataset,
Cadar et al. image matching, using an end-to-end Detection, Non- object matching and object Emphysema CT database,
[74] convolutional neural network (CNN). retrieval through learnt

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018

rigid Image keypoint detection, improving OASIS MRI database and


Matching accuracy in non-rigid images. NEMA MRI database
2022, Seyed Coupled dictionary learning with marginalised Machine Learning Improvement of image IAPRTC-12, FLICKR-60K
Mahdi loss function and L1 regularisation. and Image annotation through coupled and FLICKR-125K
Roostaiyan et Processing dictionary learning, addressing
al. [75] label imbalance.

Guan et al. [69]: The article presents an advanced method for improves the accuracy of the matches, but also the efficiency
retrieving medical images, focusing on feature fusion and of object retrieval, representing a significant advance in the
information interpretability. Using the DenseNet-121 model detection of keypoints in non-rigid images and in improving
to learn relevant medical features without the need for the matching performance of existing descriptors.
manual annotation, the method applies interpretable saliency Seyed Mahdi Roostaiyan et al. [75]: This paper introduces
maps and integrates global and local networks to extract Marginalised Coupled Dictionary Learning (MCDL) as a
complete information, resulting in a significant improvement new approach for real-time image annotation. Focusing on
in the accuracy of retrieval results. These advances promise learning a limited number of visual prototypes and their
valuable applications in computer-aided diagnosis systems. associated semantics, the method overcomes common
challenges in image annotation by offering an efficient and
Specific Applications and Innovations in Image fast solution with a publicly available implementation.
Annotation
Mona Zamiri and Hadi Sadoghi Yazdi [70]: This study V. DISCUSSION
introduces the Multi-View Robust Spectral Clustering This discussion section aims to further analyze the advances
(MVRSC) method for image annotation, modelling the and implications of object tracking and image retrieval
relationship between semantic and multi-features of training technologies, with a special focus on their practical
images. Using the Maximum Correntropy Criterion and applications in various domains and the significant real-
semi-quadratic optimization, the method suggests tags based world impact that these techniques have demonstrated.
on a new fusion distance at the decision level. Experimental Given the insights provided by the studies analyzed, we have
results on real datasets demonstrate the method's undertaken a comparative assessment of the techniques
effectiveness in generating accurate and meaningful studied, highlighting their strengths, weaknesses and
annotations, integrating geographic and visual information. suitability for different application scenarios. This analysis
Bragantini et al. [71]: In this article, the authors propose an not only illuminates the unique contributions of each
innovative approach to interactive image annotation, method, model or algorithm, but also sheds light on the
allowing the simultaneous annotation of segments of
synergies and challenges that arise when integrating these
multiple images through projection onto the feature space.
technologies to solve complex real-world problems.
This technique results in a faster process and avoids
We recognize the complexity and depth of the topics covered
redundancies by annotating similar components in different
by this review and have therefore expanded our discussion to
images. The results show a significant improvement in the
provide a more nuanced view of the limitations and
efficiency of image annotation, suggesting possibilities for
challenges faced by current methodologies. In addition, we
integration with other existing image segmentation
will discuss the potential interdisciplinary applications of
methodologies.
these technologies in more detail, highlighting areas that go
Jhilik Bhattacharya et al. [72]: The paper proposes an
beyond those primarily considered in the review. This
advanced approach to medical image search, utilizing
includes exploring how object tracking and image retrieval
capsule architecture and decision fusion to address
can be innovatively applied in fields such as health, public
challenges such as data imbalance, insufficient labels and
safety and environmental conservation, where they have the
obscured images. Tested on the IRMA dataset, the method
potential to promote significant advances.
demonstrates superior efficiency, significantly improving
diagnostic efficiency by grouping similar images for A. OBJECT TRACKING
automatic retrieval and annotation. The field of object tracking has been marked by significant
Dhupam Bhanu Mahesh et al. [73]: This paper presents a innovations, especially with the application of advanced
medical image retrieval and classification model based on neural networks and deep learning techniques. The
the Optimized Local Weber Gradient Pattern (OLWGP), introduction of the Instance Tracking Head (ITH) by Tao Yu
using a new heuristic algorithm to improve image retrieval. et al [46] exemplifies this evolution, offering a notable
The study also employs an optimized CNN model for image improvement in tracking accuracy in medical contexts, such
classification, demonstrating superior performance on as colonoscopy videos. This innovation underlines the ability
several public databases and offering significant advances in of these new technologies to adapt to specialized
medical image retrieval and classification. applications where precision is crucial.
Felipe Cadar et al. [74]: The study presents a novel technique Advancing the complexity of applications, Shaopan Xiong et
for keypoint detection in non-rigid images using a CNN al. [47] explored the use of Siamese networks to improve
trained with true correspondences. This method not only object segmentation in videos. This approach not only

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018

strengthens tracking accuracy, but also highlights the features and textual matching. This breakthrough highlights
versatility of modern techniques in dealing with dynamic and a growing trend in CBIR: the integration of multiple data
complex scenes. The convergence of these technologies' modalities to enrich the retrieval process, making the results
points to a horizon where object tracking can be adapted to a more aligned with the users' intentions.
wider range of scenarios, from controlled environments to In this context of multimodal enrichment, Binqiang Wang et
busy urban contexts. al. [64] made progress with the Recurrent Topic Retrieval
The accuracy and adaptability of object tracking in adverse Memory Network (RTRMN), which generates accurate
conditions represent ongoing challenges, as demonstrated by captions for remotely sensed images. This development
Dominik Schörkhuber et al. [49] in their studies of night-time highlights the importance of contextualization and detail in
driving videos. This work illustrates the importance of the generation of captions, which are crucial aspects for the
developing systems that can operate efficiently under interpretation and use of images in areas such as
variations in visibility, a critical factor for security and environmental and geographical research.
monitoring applications. The integration of multimodalities and technological
Collaboration between humans and artificial intelligence has innovation, as seen in the work of Ikhlaq Ahmed et al. [62]
emerged as a recurring theme, with studies such as those by and Zafran Khan et al. [68], exemplify how the combination
Trung-Nghia Le et al. [53] and Bhavani Sambaturu et al. [54] of in-depth visual features and semantic textual
highlighting collaborative approaches to image annotation representations can refine image retrieval. This approach not
and analysis. This human-machine interaction suggests a only improves accuracy, but also personalizes the image
future in which the precision and efficiency of AI can be retrieval experience, adapting to the specific needs of users
combined with human sensitivity and discernment to create in a variety of contexts, from e-commerce to multimedia.
more robust and accurate solutions in a variety of As we explore specific applications and advances in
applications. segmentation and classification, studies such as those by
The expansion to multiple objects tracking and Mona Zamiri et al. [70] and Jhilik Bhattacharya et al. [72]
segmentation, as demonstrated by Zhenbo Xu et al. [51] and bring to light innovative methods that improve image
Liqi Yan et al. [52], opens up new possibilities for real-time annotation and retrieval in urban and medical contexts. They
monitoring and analysis of complex scenes. These use advanced clustering techniques and capsule networks to
techniques, which transform images into more malleable model semantic relationships and multiple features,
representations such as point clouds, highlight the potential demonstrating the adaptability of these technologies to
of deep learning to extract and analyze information in an specific annotation and retrieval needs.
efficient and innovative way. However, beyond specific applications, CBIR faces the
The challenge of detecting and analyzing movements in ongoing challenge of detecting and analyzing complex
dynamic scenarios is addressed by Fahad Lateef et al. [57] patterns in images. Anna Guan et al [69], for example, focus
and Roberto Henschel et al. [50], who apply object tracking on medical image retrieval, using hashing techniques based
to contexts of urban mobility and human interactions, on feature fusion and interpretability to better represent
respectively. These studies illustrate how technology can be injured areas on X-rays. This approach not only advances
adapted to improve public safety and understand complex computer-aided diagnosis, but also emphasizes the
behavior in crowded environments. importance of interpretable and transparent systems.
Finally, the diversity of applications and continuous Looking to the future, CBIR should continue to explore data
innovation in object tracking, as reflected by the works of fusion and deep contextualization. Deep learning,
Zeren Chen et al. [58] and Thiago T. Santos et al. [59], exemplified by Umer Ali Khan et al [66] and Yikun Yang et
highlight the importance of ethical approaches, especially in al [67], promises to transform image retrieval by
public contexts where privacy and consent are paramount dynamically adapting to a variety of contexts and user
concerns. The evolution of this technology not only promises requirements. Furthermore, the emphasis on interpretation
improvements in a variety of fields, but also imposes the and user interaction, as evidenced by Seyed Mahdi
need for careful reflection on its responsible use. Roostaiyan et al. [75], highlights the need for methods that
can effectively deal with unbalanced labels and maintain
B. IMAGE RETRIEVAL
The evolution of Content-Based Image Retrieval (CBIR) has data sparsity.
been driven by significant technological advances, as Thus, the trajectory of CBIR is marked by an intersection of
demonstrated by J Faritha Banu et al [60], who developed an technological innovation, practical applicability and
innovative CBIR system using ontologies to integrate model integration challenges. Seyed Mahdi Roostaiyan et al's
and content annotations. This system not only improves the approach [75], which introduces marginalized coupled
accuracy and speed of image retrieval, but also paves the way dictionary learning for real-time image annotation, illustrates
for practical applications, especially in medical fields where the need for adaptive approaches capable of dealing with the
precision in image search and retrieval is vital. diversity and complexity of image datasets, while
This quest for accuracy and efficiency is complemented by maintaining computational efficiency and the relevance of
the efforts of Yi-Hui Chen et al [61] to address the "semantic retrieval results.
gap" in social image retrieval by combining multiple visual

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018

Continued innovation in CBIR, particularly in the integration observations to similar or relevant cases in medical literature,
of advanced deep learning techniques and multimodal annotation systems can provide a wealth of clinical context,
analysis, as demonstrated by Felipe Cadar et al. [74] in their potentially revealing previously hidden patterns or
research on keypoint detection in non-rigid images, correlations.
highlights the potential for significant advances in image However, the successful implementation of this integrated
retrieval accuracy and capacity. The application of these approach requires overcoming significant challenges,
techniques in a variety of contexts, from medical analyses to including managing large volumes of data, the need for high-
pattern recognition in remote sensing images, suggests a performance processing algorithms and ensuring accuracy
broad spectrum of possibilities for improving both the and relevance in the annotations generated. In addition,
granularity and applicability of CBIR. ethical and privacy issues remain paramount, especially in
However, as technologies advance and their applications sensitive applications such as surveillance and medicine.
expand, ethical and privacy considerations emerge,
especially in contexts involving sensitive or identifiable data. VI. CONCLUSION
The need for responsible and transparent approaches to This systematic review investigated the current use of object
image retrieval is becoming increasingly pressing, tracking and image retrieval techniques in automating or
emphasizing the importance of incorporating ethical assisting image annotation. From the studies analyzed, the
principles into the development and deployment of CBIR answers to the proposed questions are as follows:
systems.
Q1: How are object tracking and image retrieval techniques
C. POTENTIAL OF COMBINING TECHNIQUES FOR being used to automate or assist in image annotation, and
IMAGE ANNOTATION what are the current developments associated with these
The fusion of object tracking and image retrieval techniques technologies?
promises to revolutionize the field of image annotation, A: Automatic image annotation is an area that continues to
offering more sophisticated and efficient methods for evolve with the development of object tracking and image
identifying and cataloguing visual content. This convergence retrieval techniques. These techniques are essential for
has the potential to significantly automate the annotation improving the accuracy and efficiency of annotation, which
process, improving accuracy and reducing the manual effort is important in various applications such as medical
required, particularly in large datasets. diagnosis, urban surveillance and the management of large
In the context of object tracking, the ability to continuously image databases.
follow an entity through a sequence of images or videos Object tracking and image retrieval techniques, if used for
provides a solid basis for dynamic and contextually rich this purpose, can play key roles in automating and assisting
annotations. When integrated with image retrieval systems, image annotation, greatly helping annotators who would
this continuous tracking can be enriched with historical or otherwise have to do it manually. These technologies are
semantic information extracted from extensive databases, very useful not only for improving the efficiency of
allowing for annotations that capture not only the identity of annotation processes, but also for increasing the accuracy of
the object, but also its behavior, interactions and evolution the annotations generated, which is essential in fields such as
over time. medical diagnosis, urban monitoring and automatic
For example, in surveillance video analysis, the combination multimedia content management.
of these techniques can automate the annotation of activities, Object tracking has benefited from the advancement of deep
identifying and cataloguing specific actions by individuals or neural networks and sophisticated machine learning
vehicles. This not only saves time manually reviewing hours methods, which have contributed to automation and
of footage, but also improves search and retrieval accuracy in image annotation. The integration of advanced
capabilities, allowing users to quickly find moments or technologies not only improves the identification and
events of interest based on detailed annotations. tracking of objects in image sequences, but also facilitates
In scientific and environmental research, the combined the automatic and continuous annotation of these objects.
application of these technologies can facilitate the An example of this evolution is the work of Tao Yu et al.
cataloguing of species or natural phenomena by integrating [46] who developed the "Instance Tracking Head" (ITH),
movement information captured by object tracking with integrated into the Scaled-YOLOv4 detector. This
taxonomic or behavioral knowledge derived from image innovation offers improvements in the detection and tracking
retrieval systems. This can significantly speed up the of objects in medical videos, such as colonoscopies.
annotation of large sets of images captured in field studies, Improved detection accuracy and continuous tracking enable
allowing researchers to concentrate their efforts on analyzing automatic and accurate annotation of polyps over time,
and interpreting the data. facilitating medical monitoring and analysis by reducing the
In the medical field, this integrated approach could transform need for manual annotation, which is often prone to errors
the way diagnostic images are annotated and stored. By and inconsistencies.
combining the precise tracking of injuries or medical Another significant development is SiamMask, created by
conditions in sequential images with the ability to link these Weiming Hu et al. [48], which combines object tracking and

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018

segmentation in real time. This tool processes video at a rate the effective interpretation and use of images in
of 55 frames per second, enabling continuous and automated environmental monitoring and urban planning.
annotation of fast-moving objects. SiamMask is particularly Furthermore, Umer Ali Khan and Ali Javed have created a
useful in scenarios that require real-time responses, such as hybrid CBIR system that combines local tetra angle patterns
urban surveillance and traffic monitoring, where the precise with color moment features to improve image retrieval. This
identification and tracking of objects is essential for security system addresses the challenge of the "semantic gap" found
and incident management. in large image databases by enriching the automatic
The study by Roberto Henschel et al. offers an advanced annotation process with more detailed and accurate features,
method for tracking multiple people using both video and which is essential for better categorization and use of the
inertial measurement units (IMUs). This method is retrieved images.
especially effective in environments where the appearance of These developments in image retrieval, together with
individuals changes frequently, such as at sporting events or advances in object tracking, are broadening the possibilities
concerts, allowing accurate annotation of movements and for using images in a variety of practical applications,
positions without loss of subject identity, even in challenging ensuring that visual information is maximized to its full
conditions. potential. With these advanced technologies, it is possible to
Additionally, Zhenbo Xu et al. [51] with PointTrackV2 automate the annotation of large image datasets, reducing
transform images into 2D point clouds, which facilitates the manual labor and increasing the reliability of the
segmentation of instances and the tracking of multiple information.
objects in crowded and dynamic environments. This
technique enables effective annotation in congested urban Q2: How can the integration of object tracking and image
areas or at public events, where accurate tracking and retrieval be optimized to improve the image annotation
annotation of multiple moving objects is crucial for process?
subsequent analyses and decision-making. A: The efficient integration of object tracking and image
These are just a few examples of the advances that retrieval techniques represents a significant advance in the
demonstrate how object tracking can transform the task of field of automatic image annotation. Both techniques have
image annotation, making it more efficient and reducing the complementary capabilities which, when aligned correctly,
manual workload. The application of these technologies in a can substantially improve the accuracy, efficiency and
variety of fields, from medicine to public safety, highlights applicability of image annotation in a variety of contexts.
the significant potential for future innovations that can lead Existing object tracking techniques usually have difficulties
to an even deeper understanding and better practices in in some areas, namely in situations of low light or visibility,
analyzing visual data. in situations where you want to follow an individual in a
Continuing the discussion on the automation of image crowd, situations where the scenery changes abruptly or even
annotation, we have also seen significant advances in image when there is occlusion of objects due to overlapping objects
retrieval that complement the improvements brought about momentarily blocking the object being followed, although
by object tracking. Image retrieval techniques have benefited image retrieval cannot solve these issues it can offer support
greatly from deep learning and semantic analysis, which for annotating these complicated situations using previously
increase the accuracy and speed of retrieval and enrich the studied information providing the annotator not only with
quality of automatic annotations. different perspectives but also with comparisons that would
For example, Faritha Banu et al. have developed a system not have been observed previously, a paper that
that employs grid-based color histograms and texture experimented with this approach was H. Wei and Y. Huang
analysis within an ontological framework, which not only [76], this paper shows how the combination of both
speeds up the retrieval of medical images but also improves techniques used for the purpose of autonomous driving, their
the accuracy of automatic annotations. This advance is quite approach shows also that it could also be used to improve
important for clinical applications, where accurate image annotation, this being a step towards an
annotations can mean better diagnosis and treatment. interdisciplinary collaboration for image annotation. In the
Yi-Hui Chen et al. implemented a method that combines following, we present a detailed approach to how this
visual and textual analysis for semantic retrieval of social integration can be optimized, exploring connections between
images. This method not only improves the relevance and the techniques discussed in the selected articles.
accuracy of annotations, but also facilitates the As previously mentioned, object tracking often faces
categorization and retrieval of social content based on clear challenges in conditions of poor lighting and visibility.
semantic intent, thus improving the management of large Image retrieval techniques, such as the one presented by
image databases. Faritha Banu et al, which employ ontologies to improve the
Binqiang Wang et al. advanced the automatic generation of accuracy and speed of image retrieval, can be used to
captions for remote sensing images through recurrent complement and enrich the training datasets of tracking
memory networks, which use common keywords to generate models. In addition, the integration of advanced image
accurate and contextual descriptions. This process is vital for attributes and semantic annotations extracted through
retrieval methods can help tracking models to better adapt to

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018

varying conditions, using historical or similar data to adjust These challenges highlight the continued need for research
their predictions in real time and improve accuracy in and development in the areas of object tracking and image
challenging environments. retrieval. Innovative solutions are needed to address these
Multiple object tracking, as explored in works such as that limitations, potentially through the development of more
by Zhenbo Xu et al. [51] (PointTrackV2), can benefit sophisticated algorithms that can better cope with adverse
significantly from image retrieval. For example, techniques conditions and complex contexts, and more effective
that use textual and visual analysis for semantic annotation semantic processing techniques that better align image
of images (Yi-Hui Chen et al.) can be integrated to provide retrieval results with user needs. Improvements in these areas
additional context that makes it easier to distinguish between will not only advance the state of the art in automatic image
similar objects in crowded scenes. This approach can enable annotation, but also expand its practical applications in fields
tracking systems to assign more accurate identities and such as medical diagnosis, urban monitoring and multimedia
maintain tracking consistency over time, even when objects content management.
interact or hide from each other.
Even well-defined tracking systems, such as Weiming Hu et VII. FUTURE RESEARCH
al. 's SiamMask [48], can be improved with image retrieval The evolution of image annotation through object tracking
techniques that process contextual and appearance and image retrieval technologies, as systematically analyzed,
variations. Using advanced image retrieval algorithms that shows a promising trajectory towards more efficient and
integrate deep and semantic features (such as the DvLIL accurate machine learning models. Despite notable
system by Ikhlaq Ahmed et al. [62]), it is possible to develop advances, the field faces challenges that require a research
an adaptive layer that adjusts the parameters of the tracking agenda geared towards promoting innovation and addressing
model in real time, based on features previously observed in the complexities of real-world applications.
similar situations. This not only improves the robustness of A critical area for future exploration lies in improving
the tracking, but also reduces errors caused by abrupt algorithmic robustness and generalization. Current
changes in the scenario or the appearance of objects. methodologies demonstrate varying degrees of effectiveness
To maximize the benefits of this integration, it is crucial to in different datasets and conditions, often struggling with
implement effective synchronization between object low-visibility scenarios and rapid object movements.
tracking and image retrieval systems. This can be achieved Solving these problems requires a concerted effort to
by developing integrated frameworks that combine real-time develop algorithms that are not only adaptable to diverse
data streams with dynamic access to annotated image environmental conditions, but also capable of learning from
databases, allowing for fluid and complementary interaction limited and unstructured data, a good example of this effort
between the tracking and retrieval processes. is shown by Dominik Schörkhuber et al. [49]. The
integration of unsupervised and semi-supervised learning
Q3: What are the main challenges and limitations faced paradigms could offer a way to reduce dependence on
when applying these techniques to image annotation? extensively annotated data sets, thus expanding the
A: Despite advances in automatic image annotation through applicability of these technologies in domains where such
object tracking and image retrieval techniques, significant data is scarce or difficult to obtain in order to increase the
challenges still persist in both areas, impacting the amount of annotation data, creating a cycle where this kind
effectiveness of these technologies. of algorithms are continually less needed to fight scarce
In object tracking, one of the main challenges faced is the annotation.
management of occlusions, where objects of interest are At the same time, the synergy between human expertise and
temporarily blocked by other elements in the scene, automated systems represents fertile ground for research.
complicating their detection and continuous tracking. In The current landscape of image annotation tools reflects a
addition, rapid variations in the scene, such as sudden growing recognition of the invaluable role of human
changes in lighting or rapid movements of objects, can intuition and understanding in improving AI-generated
challenge current algorithms, reducing tracking accuracy. annotations. Future research should endeavor to improve this
The need for accurate tracking in low-visibility conditions symbiosis by developing more intuitive interfaces and
also remains a technical obstacle, especially in applications feedback mechanisms, which are open source and easy to
such as night surveillance or in adverse weather conditions. use. These systems should not only facilitate the
In image retrieval, the "semantic gap" - the discrepancy incorporation of human corrections, but also learn from these
between the visual attributes of retrieved images and the interactions, thus continuously improving the accuracy and
semantic meaning that users attribute to those images - relevance of the annotations.
remains a prominent challenge. This gap often results in In addition, the imperative need for real-time annotation
annotations that don't match user expectations or specific capabilities cannot be overemphasized, especially in
application needs, limiting the practical usefulness of image domains that require instant decision-making, such as
retrieval systems. Finding more robust and adaptive methods surveillance and live medical diagnosis. The search for real-
to fill this semantic gap is crucial to improving the relevance time processing solutions requires innovations in terms of
and accuracy of automatically generated annotations. computational efficiency and algorithmic speed. This may

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018

involve taking advantage of the growing dominance of [7] K. Muhammad, S. Khan, J. D. Ser, and V. H. C. de
peripheral computing and developing models adapted for use Albuquerque, ‘Deep Learning for Multigrade
in resource-limited environments, ensuring that the benefits Brain Tumor Classification in Smart Healthcare
of automated annotation can be realized across a broader Systems: A Prospective Survey’, IEEE
spectrum of applications. Transactions on Neural Networks and Learning
The ethical considerations and privacy concerns surrounding Systems, vol. 32, no. 2, pp. 507–522, Feb. 2021,
the use of these technologies, particularly in sensitive areas doi: 10.1109/TNNLS.2020.2995800.
such as personal surveillance and healthcare, require [8] W. Teng, N. Wang, H. Shi, Y. Liu, and J. Wang,
rigorous attention. Future research should prioritize the ‘Classifier-Constrained Deep Adversarial Domain
development of ethical frameworks and privacy-preserving Adaptation for Cross-Domain Semisupervised
mechanisms. This includes exploring advanced data Classification in Remote Sensing Images’, IEEE
anonymization techniques and secure data sharing protocols Geoscience and Remote Sensing Letters, vol. 17,
to protect individual privacy, while enabling the beneficial no. 5, pp. 789–793, May 2020, doi:
applications of image annotation technologies. 10.1109/LGRS.2019.2931305.
[9] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, ‘A
ACKNOWLEDGMENT Survey of Convolutional Neural Networks:
National Funds finance this work through the Portuguese Analysis, Applications, and Prospects’, IEEE
funding agency, FCT—Fundacão para a Ciência e a Transactions on Neural Networks and Learning
Tecnologia, within project PTDC/EEI-EEE/5557/2020. Co- Systems, vol. 33, no. 12, pp. 6999–7019, Dec.
funded by the European Union (grant number 101095359) 2022, doi: 10.1109/TNNLS.2021.3084827.
and supported by the UK Research and Innovation (grant [10] R. Chauhan, K. K. Ghanshala, and R. C. Joshi,
number 10058099). However, the views and opinions ‘Convolutional Neural Network (CNN) for Image
expressed are those of the author(s) only and do not Detection and Recognition’, in 2018 First
necessarily reflect those of the European Union or the Health International Conference on Secure Cyber
and Digital Executive Agency (HaDEA). Computing and Communication (ICSCCC), Dec.
2018, pp. 278–282. doi:
REFERENCES 10.1109/ICSCCC.2018.8703316.
[1] ‘Machine Learning With Big Data: Challenges and [11] Y. Xu et al., ‘Transformers in computational visual
Approaches | IEEE Journals & Magazine | IEEE media: A survey’, Comp. Visual Media, vol. 8, no.
Xplore’. doi: 10.1109/ACCESS.2017.2696365. 1, pp. 33–62, Mar. 2022, doi: 10.1007/s41095-
[2] ‘Machine Learning With Big Data: Challenges and 021-0247-3.
Approaches | IEEE Journals & Magazine | IEEE [12] Y. Liu et al., ‘A Survey of Visual Transformers’,
Xplore’. Accessed: doi: IEEE Transactions on Neural Networks and
10.1109/ACCESS.2017.2696365. Learning Systems, pp. 1–21, 2023, doi:
[3] L. Ren, J. Lu, J. Feng, and J. Zhou, ‘Uniform and 10.1109/TNNLS.2022.3227717.
Variational Deep Learning for RGB-D Object [13] K. G. Ince, A. Koksal, A. Fazla, and A. A. Alatan,
Recognition and Person Re-Identification’, IEEE ‘Semi-Automatic Annotation for Visual Object
Transactions on Image Processing, vol. 28, no. 10, Tracking’, presented at the Proceedings of the
pp. 4970–4983, Oct. 2019, doi: IEEE/CVF International Conference on Computer
10.1109/TIP.2019.2915655. Vision, 2021, pp. 1233–1239. Accessed: Jan. 31,
[4] J. Seo and H. Park, ‘Object Recognition in Very 2024. [Online]. doi:
Low Resolution Images Using Deep Collaborative 10.1109/ICCVW54120.2021.00143
Learning’, IEEE Access, vol. 7, pp. 134071– [14] L. Porzi, M. Hofinger, I. Ruiz, J. Serrat, S. R.
134082, 2019, doi: Bulo, and P. Kontschieder, ‘Learning Multi-Object
10.1109/ACCESS.2019.2941005. Tracking and Segmentation From Automatic
[5] S. H. Kasaei, ‘OrthographicNet: A Deep Transfer Annotations’, presented at the Proceedings of the
Learning Approach for 3-D Object Recognition in IEEE/CVF Conference on Computer Vision and
Open-Ended Domains’, IEEE/ASME Transactions Pattern Recognition, 2020, pp. 6846–6855.
on Mechatronics, vol. 26, no. 6, pp. 2910–2921, Accessed: Jan. 31, 2024. [Online]. doi:
Dec. 2021, doi: 10.1109/TMECH.2020.3048433. 10.48550/arXiv.1912.02096.
[6] S.-J. Liu, H. Luo, and Q. Shi, ‘Active Ensemble [15] X. Li, L. Chen, L. Zhang, F. Lin, and W.-Y. Ma,
Deep Learning for Polarimetric Synthetic Aperture ‘Image annotation by large-scale content-based
Radar Image Classification’, IEEE Geoscience and image retrieval’, in Proceedings of the 14th ACM
Remote Sensing Letters, vol. 18, no. 9, pp. 1580– international conference on Multimedia, in MM
1584, Sep. 2021, doi: ’06. New York, NY, USA: Association for
10.1109/LGRS.2020.3005076. Computing Machinery, Oct. 2006, pp. 607–610.
doi: 10.1145/1180639.1180764.

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018

[16] D. D. Burdescu, C. G. Mihai, L. Stanescu, and M. 53, no. 3, pp. 1699–1711, Mar. 2023, doi:
Brezovan, ‘Automatic image annotation and 10.1109/TCYB.2021.3108237.
semantic based image retrieval for medical [26] W. Zhang, Y. Zhang, and L. Zhang, ‘Multiplanar
domain’, Neurocomputing, vol. 109, pp. 33–48, Data Augmentation and Lightweight Skip
Jun. 2013, doi: 10.1016/j.neucom.2012.07.030. Connection Design for Deep-Learning-Based
[17] O. Pelka, F. Nensa, and C. M. Friedrich, Abdominal CT Image Segmentation’, IEEE
‘Annotation of enhanced radiographs for medical Transactions on Instrumentation and
image retrieval with deep convolutional neural Measurement, vol. 72, pp. 1–11, 2023, doi:
networks’, Plos One, vol. 13, no. 11, 2018, doi: 10.1109/TIM.2023.3328707.
10.1371/journal.pone.0206229. [27] Y. Ma, M. Liu, Y. Tang, X. Wang, and Y. Wang,
[18] L. Porzi, M. Hofinger, I. Ruiz, J. Serrat, S. R. ‘Image-Level Automatic Data Augmentation for
Bulo, and P. Kontschieder, ‘Learning Multi-Object Pedestrian Detection’, IEEE Transactions on
Tracking and Segmentation From Automatic Instrumentation and Measurement, vol. 73, pp. 1–
Annotations’, presented at the Proceedings of the 12, 2024, doi: 10.1109/TIM.2023.3336760.
IEEE/CVF Conference on Computer Vision and [28] J. Cao, M. Luo, J. Yu, M.-H. Yang, and R. He,
Pattern Recognition, 2020, pp. 6846–6855. doi: ‘ScoreMix: A Scalable Augmentation Strategy for
10.48550/arXiv.1912.02096. Training GANs With Limited Data’, IEEE
[19] X. Li, L. Chen, L. Zhang, F. Lin, and W.-Y. Ma, Transactions on Pattern Analysis and Machine
‘Image annotation by large-scale content-based Intelligence, vol. 45, no. 7, pp. 8920–8935, Jul.
image retrieval’, in Proceedings of the 14th ACM 2023, doi: 10.1109/TPAMI.2022.3231649.
international conference on Multimedia, in MM [29] L. Zhang and K. Ma, ‘A Good Data Augmentation
’06. New York, NY, USA: Association for Policy is not All You Need: A Multi-Task
Computing Machinery, Oct. 2006, pp. 607–610. Learning Perspective’, IEEE Transactions on
doi: 10.1145/1180639.1180764. Circuits and Systems for Video Technology, vol.
[20] M. M. Adnan et al., ‘Automated Image Annotation 33, no. 5, pp. 2190–2201, May 2023, doi:
With Novel Features Based on Deep ResNet50- 10.1109/TCSVT.2022.3219339.
SLT’, IEEE Access, vol. 11, pp. 40258–40277, [30] X. Wang, X. Wang, B. Jiang, and B. Luo, ‘Few-
2023, doi: 10.1109/ACCESS.2023.3266296. Shot Learning Meets Transformer: Unified Query-
[21] S. A. H. Minoofam, A. Bastanfard, and M. R. Support Transformers for Few-Shot
Keyvanpour, ‘TRCLA: A Transfer Learning Classification’, IEEE Transactions on Circuits and
Approach to Reduce Negative Transfer for Systems for Video Technology, vol. 33, no. 12, pp.
Cellular Learning Automata’, IEEE Transactions 7789–7802, Dec. 2023, doi:
on Neural Networks and Learning Systems, vol. 10.1109/TCSVT.2023.3282777.
34, no. 5, pp. 2480–2489, May 2023, doi: [31] P. Tian and S. Xie, ‘An Adversarial Meta-Training
10.1109/TNNLS.2021.3106705. Framework for Cross-Domain Few-Shot
[22] Z. Zhu, K. Lin, A. K. Jain, and J. Zhou, ‘Transfer Learning’, IEEE Transactions on Multimedia, vol.
Learning in Deep Reinforcement Learning: A 25, pp. 6881–6891, 2023, doi:
Survey’, IEEE Transactions on Pattern Analysis 10.1109/TMM.2022.3215310.
and Machine Intelligence, vol. 45, no. 11, pp. [32] Y. Cui et al., ‘Uncertainty-Guided Semi-
13344–13362, Nov. 2023, doi: Supervised Few-Shot Class-Incremental Learning
10.1109/TPAMI.2023.3292075. With Knowledge Distillation’, IEEE Transactions
[23] H. Han, H. Liu, C. Yang, and J. Qiao, ‘Transfer on Multimedia, vol. 25, pp. 6422–6435, 2023, doi:
Learning Algorithm With Knowledge Division 10.1109/TMM.2022.3208743.
Level’, IEEE Transactions on Neural Networks [33] J. Li, M. Gong, H. Liu, Y. Zhang, M. Zhang, and
and Learning Systems, vol. 34, no. 11, pp. 8602– Y. Wu, ‘Multiform Ensemble Self-Supervised
8616, Nov. 2023, doi: Learning for Few-Shot Remote Sensing Scene
10.1109/TNNLS.2022.3151646. Classification’, IEEE Transactions on Geoscience
[24] Z. Fan, L. Shi, Q. Liu, Z. Li, and Z. Zhang, and Remote Sensing, vol. 61, pp. 1–16, 2023, doi:
‘Discriminative Fisher Embedding Dictionary 10.1109/TGRS.2023.3234252.
Transfer Learning for Object Recognition’, IEEE [34] H.-J. Ye, L. Han, and D.-C. Zhan, ‘Revisiting
Transactions on Neural Networks and Learning Unsupervised Meta-Learning via the
Systems, vol. 34, no. 1, pp. 64–78, Jan. 2023, doi: Characteristics of Few-Shot Tasks’, IEEE
10.1109/TNNLS.2021.3089566. Transactions on Pattern Analysis and Machine
[25] H. Shi, J. Li, J. Mao, and K.-S. Hwang, ‘Lateral Intelligence, vol. 45, no. 3, pp. 3721–3737, Mar.
Transfer Learning for Multiagent Reinforcement 2023, doi: 10.1109/TPAMI.2022.3179368.
Learning’, IEEE Transactions on Cybernetics, vol. [35] G. Carneiro, A. B. Chan, P. J. Moreno, and N.
Vasconcelos, ‘Supervised Learning of Semantic

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018

Classes for Image Annotation and Retrieval’, pp. 976–982. doi:


IEEE Transactions on Pattern Analysis and 10.1109/ICAAIC53929.2022.9792665.
Machine Intelligence, vol. 29, no. 3, pp. 394–410, [46] T. Yu et al., ‘An end-to-end tracking method for
Mar. 2007, doi: 10.1109/TPAMI.2007.61. polyp detectors in colonoscopy videos’, Artificial
[36] A. Ulges, M. Worring, and T. Breuel, ‘Learning Intelligence in Medicine, vol. 131, p. 102363, Sep.
Visual Contexts for Image Annotation From Flickr 2022, doi: 10.1016/j.artmed.2022.102363.
Groups’, IEEE Transactions on Multimedia, vol. [47] S. Xiong, S. Li, L. Kou, W. Guo, Z. Zhou, and Z.
13, no. 2, pp. 330–341, Apr. 2011, doi: Zhao, ‘Td-VOS: Tracking-Driven Single-Object
10.1109/TMM.2010.2101051. Video Object Segmentation’, in 2020 IEEE 5th
[37] S. Li et al., ‘A Multitask Benchmark Dataset for International Conference on Image, Vision and
Satellite Video: Object Detection, Tracking, and Computing (ICIVC), Jul. 2020, pp. 102–107. doi:
Segmentation’, IEEE Transactions on Geoscience 10.1109/ICIVC50857.2020.9177471.
and Remote Sensing, vol. 61, pp. 1–21, 2023, doi: [48] ‘SiamMask: A Framework for Fast Online Object
10.1109/TGRS.2023.3278075. Tracking and Segmentation | IEEE Journals &
[38] ‘Multi-Drone-Based Single Object Tracking With Magazine | IEEE Xplore’. Accessed: Dec. 28,
Agent Sharing Network | IEEE Journals & 2023. [Online]. doi:
Magazine | IEEE Xplore’. Accessed: Apr. 10, 10.1109/TPAMI.2022.3172932 .
2024. doi: 10.1109/TCSVT.2020.3045747. [49] D. Schörkhuber, F. Groh, and M. Gelautz,
[39] X. Zheng, H. Cui, and X. Lu, ‘Multiple Source ‘Bounding Box Propagation for Semi-automatic
Domain Adaptation for Multiple Object Tracking Video Annotation of Nighttime Driving Scenes’,
in Satellite Video’, IEEE Transactions on in 2021 12th International Symposium on Image
Geoscience and Remote Sensing, vol. 61, pp. 1– and Signal Processing and Analysis (ISPA), Sep.
11, 2023, doi: 10.1109/TGRS.2023.3336665. 2021, pp. 131–137. doi:
[40] S. Zhang, J. Huang, H. Li, and D. N. Metaxas, 10.1109/ISPA52656.2021.9552141.
‘Automatic Image Annotation and Retrieval Using [50] R. Henschel, T. Von Marcard, and B. Rosenhahn,
Group Sparsity’, IEEE Transactions on Systems, ‘Accurate Long-Term Multiple People Tracking
Man, and Cybernetics, Part B (Cybernetics), vol. Using Video and Body-Worn IMUs’, IEEE
42, no. 3, pp. 838–849, Jun. 2012, doi: Transactions on Image Processing, vol. 29, pp.
10.1109/TSMCB.2011.2179533. 8476–8489, 2020, doi:
[41] D. Wang, S. C. H. Hoi, Y. He, J. Zhu, T. Mei, and 10.1109/TIP.2020.3013801.
J. Luo, ‘Retrieval-Based Face Annotation by Weak [51] Z. Xu, W. Yang, W. Zhang, X. Tan, H. Huang,
Label Regularized Local Coordinate Coding’, and L. Huang, ‘Segment as Points for Efficient and
IEEE Transactions on Pattern Analysis and Effective Online Multi-Object Tracking and
Machine Intelligence, vol. 36, no. 3, pp. 550–563, Segmentation’, IEEE Transactions on Pattern
Mar. 2014, doi: 10.1109/TPAMI.2013.145. Analysis and Machine Intelligence, vol. 44, no. 10,
[42] ‘Supervised Learning of Semantic Classes for pp. 6424–6437, Oct. 2022, doi:
Image Annotation and Retrieval | IEEE Journals & 10.1109/TPAMI.2021.3087898.
Magazine | IEEE Xplore’. Accessed: Apr. 10, [52] L. Yan, Q. Wang, S. Ma, J. Wang, and C. Yu,
2024. doi: 10.1109/TPAMI.2007.61. ‘Solve the Puzzle of Instance Segmentation in
[43] M. M. Adnan, M. S. M. Rahim, A. Rehman, Z. Videos: A Weakly Supervised Framework With
Mehmood, T. Saba, and R. A. Naqvi, ‘Automatic Spatio-Temporal Collaboration’, IEEE
Image Annotation Based on Deep Learning Transactions on Circuits and Systems for Video
Models: A Systematic Review and Future Technology, vol. 33, no. 1, pp. 393–406, Jan.
Challenges’, IEEE Access, vol. 9, pp. 50253– 2023, doi: 10.1109/TCSVT.2022.3202574.
50264, 2021, doi: [53] T.-N. Le, S. Akihiro, S. Ono, and H. Kawasaki,
10.1109/ACCESS.2021.3068897. ‘Toward Interactive Self-Annotation For Video
[44] U. Ojha, U. Adhikari, and D. K. Singh, ‘Image Object Bounding Box: Recurrent Self-Learning
annotation using deep learning: A review’, in 2017 And Hierarchical Annotation Based Framework’,
International Conference on Intelligent Computing in 2020 IEEE Winter Conference on Applications
and Control (I2C2), Jun. 2017, pp. 1–5. doi: of Computer Vision (WACV), Mar. 2020, pp.
10.1109/I2C2.2017.8321819. 3220–3229. doi:
[45] B. Pande, K. Padamwar, S. Bhattacharya, S. 10.1109/WACV45572.2020.9093398.
Roshan, and M. Bhamare, ‘A Review of Image [54] B. Sambaturu, A. Gupta, C. V. Jawahar, and C.
Annotation Tools for Object Detection’, in 2022 Arora, ‘ScribbleNet: Efficient interactive
International Conference on Applied Artificial annotation of urban city scenes for semantic
Intelligence and Computing (ICAAIC), May 2022, segmentation’, Pattern Recognition, vol. 133, p.

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018

109011, Jan. 2023, doi: Selected Topics in Applied Earth Observations and
10.1016/j.patcog.2022.109011. Remote Sensing, vol. 13, pp. 256–270, 2020, doi:
[55] J. Zhu, X. Li, C. Zhang, and T. Shi, ‘An accurate 10.1109/JSTARS.2019.2959208.
approach for obtaining spatiotemporal information [65] M. M. Adnan et al., ‘Automated Image Annotation
of vehicle loads on bridges based on 3D bounding With Novel Features Based on Deep ResNet50-
box reconstruction with computer vision’, SLT’, IEEE Access, vol. 11, pp. 40258–40277,
Measurement, vol. 181, p. 109657, Aug. 2021, doi: 2023, doi: 10.1109/ACCESS.2023.3266296.
10.1016/j.measurement.2021.109657. [66] U. A. Khan and A. Javed, ‘A hybrid CBIR system
[56] Q. Liu et al., ‘ASIST: Annotation-free synthetic using novel local tetra angle patterns and color
instance segmentation and tracking by adversarial moment features’, Journal of King Saud
simulations’, Computers in Biology and Medicine, University - Computer and Information Sciences,
vol. 134, p. 104501, Jul. 2021, doi: vol. 34, no. 10, Part A, pp. 7856–7873, Nov. 2022,
10.1016/j.compbiomed.2021.104501. doi: 10.1016/j.jksuci.2022.07.005.
[57] F. Lateef, M. Kas, and Y. Ruichek, ‘Motion and [67] Y. Yang, S. Jiao, J. He, B. Xia, J. Li, and R. Xiao,
geometry-related information fusion through a ‘Image retrieval via learning content-based deep
framework for object identification from a moving quality model towards big data’, Future
camera in urban driving scenarios’, Transportation Generation Computer Systems, vol. 112, pp. 243–
Research Part C: Emerging Technologies, vol. 249, Nov. 2020, doi: 10.1016/j.future.2020.05.016.
155, p. 104271, Oct. 2023, doi: [68] Z. Khan, B. Latif, J. Kim, H. K. Kim, and M. Jeon,
10.1016/j.trc.2023.104271. ‘DenseBert4Ret: Deep bi-modal for image
[58] Z. Chen et al., ‘Siamese DETR’, in 2023 retrieval’, Information Sciences, vol. 612, pp.
IEEE/CVF Conference on Computer Vision and 1171–1186, Oct. 2022, doi:
Pattern Recognition (CVPR), Jun. 2023, pp. 10.1016/j.ins.2022.08.119.
15722–15731. doi: [69] A. Guan, L. Liu, X. Fu, and L. Liu, ‘Precision
10.1109/CVPR52729.2023.01509. medical image hash retrieval by interpretability
[59] T. T. Santos, L. L. de Souza, A. A. dos Santos, and and feature fusion’, Computer Methods and
S. Avila, ‘Grape detection, segmentation, and Programs in Biomedicine, vol. 222, p. 106945, Jul.
tracking using deep neural networks and three- 2022, doi: 10.1016/j.cmpb.2022.106945.
dimensional association’, Computers and [70] M. Zamiri and H. Sadoghi Yazdi, ‘Image
Electronics in Agriculture, vol. 170, p. 105247, annotation based on multi-view robust spectral
Mar. 2020, doi: 10.1016/j.compag.2020.105247. clustering’, Journal of Visual Communication and
[60] J. Faritha Banu, P. Muneeshwari, K. Raja, S. Image Representation, vol. 74, p. 103003, Jan.
Suresh, T. P. Latchoumi, and S. Deepan, 2021, doi: 10.1016/j.jvcir.2020.103003.
‘Ontology Based Image Retrieval by Utilizing [71] J. Bragantini, A. X. Falcão, and L. Najman,
Model Annotations and Content’, in 2022 12th ‘Rethinking interactive image segmentation:
International Conference on Cloud Computing, Feature space annotation’, Pattern Recognition,
Data Science & Engineering (Confluence), Jan. vol. 131, p. 108882, Nov. 2022, doi:
2022, pp. 300–305. doi: 10.1016/j.patcog.2022.108882.
10.1109/Confluence52989.2022.9734194. [72] J. Bhattacharya, T. Bhatia, and H. S. Pannu,
[61] Y.-H. Chen, E. J.-L. Lu, and S.-C. Lin, ‘Ontology- ‘Improved search space shrinking for medical
based Dynamic Semantic Annotation for Social image retrieval using capsule architecture and
Image Retrieval’, in 2020 21st IEEE International decision fusion’, Expert Systems with
Conference on Mobile Data Management (MDM), Applications, vol. 171, p. 114543, Jun. 2021, doi:
Jun. 2020, pp. 337–341. doi: 10.1016/j.eswa.2020.114543.
10.1109/MDM48529.2020.00074. [73] D. Bhanu Mahesh, G. Satyanarayana Murty, and
[62] I. Ahmed, N. Iltaf, Z. Khan, and U. Zia, ‘Deep- D. Rajya Lakshmi, ‘Optimized Local Weber and
view linguistic and inductive learning (DvLIL) Gradient Pattern-based medical image retrieval
based framework for Image Retrieval’, and optimized Convolutional Neural Network-
Information Sciences, vol. 649, p. 119641, Nov. based classification’, Biomedical Signal
2023, doi: 10.1016/j.ins.2023.119641. Processing and Control, vol. 70, p. 102971, Sep.
[63] P. Das and A. Neelima, ‘A Robust Feature 2021, doi: 10.1016/j.bspc.2021.102971.
Descriptor for Biomedical Image Retrieval’, [74] F. Cadar, W. Melo, V. Kanagasabapathi, G. Potje,
IRBM, vol. 42, no. 4, pp. 245–257, Aug. 2021, R. Martins, and E. R. Nascimento, ‘Improving the
doi: 10.1016/j.irbm.2020.06.007. matching of deformable objects by learning to
[64] B. Wang, X. Zheng, B. Qu, and X. Lu, ‘Retrieval detect keypoints’, Pattern Recognition Letters, vol.
Topic Recurrent Memory Network for Remote 175, pp. 83–89, Nov. 2023, doi:
Sensing Image Captioning’, IEEE Journal of 10.1016/j.patrec.2023.08.012.

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018

[75] S. M. Roostaiyan, M. M. Hosseini, M. M. Montes and Alto Douro (UTAD). Prof. António participated as a member
in 7 funded research projects. His research interests are medical image
Kashani, and S. H. Amiri, ‘Toward real-time
analysis, bio-image analysis, computer vision, machine learning, and
image annotation using marginalized coupled artificial intelligence, particularly in Computer-aided Diagnosis applied in
dictionary learning’, J Real-Time Image Proc, vol. several imaging modalities, e.g. computed tomography of the lung and
19, no. 3, pp. 623–638, Jun. 2022, doi: endoscopic videos. He is part of the organization committee HCIST -
International Conference on Health and Social Care Information Systems
10.1007/s11554-022-01210-6.
and Technologies (2013- 2015, 2020-2023), and the organization chair
[76] H. Wei and Y. Huang, ‘Online Multiple Object (2012) and Advisory Board (2016-2023).
Tracking Using Spatial Pyramid Pooling Hashing
and Image Retrieval for Autonomous Driving’,
Machines, vol. 10, no. 8, Art. no. 8, Aug. 2022,
doi: 10.3390/machines10080668.

Rodrigo Fernandes completed a bachelor's degree in Biomedical


Engineering and is currently finishing a master’s degree in biomedical
engineering at the University of Trás-os-Montes and Alto Douro. Currently
a researcher for Computer Assisted Annotation methods for Capsule
Endoscopy Datasets, a project at Inesc Tec and the University of Trás-os-
Montes e Alto Douro, his work focuses mainly on Object Tracking and
Image Retrieval deep learning methods for detecting and annotating gastric
lesions.

Alexandre Pessoa Graduated in Computer Science from the Federal


University of Maranhão (UFMA), holds a Master's degree in Computer
Science from the Institute of Mathematics and Statistics of the University of
São Paulo (IME-USP), in the area of Artificial Intelligence, and is currently
studying for a PhD in Computer Science in the UFMA/UFPI Association
Doctoral Programme in Computer Science. He is interested in Artificial
Intelligence, Machine Learning, Computer Theory, Digital Image
Processing and Computer Vision. He has experience in Digital Image
Processing, Computer Vision and Convolutional Neural Networks.

Marta Salgado began her degree in Medicine at the University of Porto in


1997 and completed her specialty in Gastroenterology in 2005. She is a
Graduate Hospital Assistant in the Gastroenterology Department of the
University Hospital Centre of Porto. She is also a guest lecturer on the
Master’s Degree in Medicine at the Abel Salazar Biomedical Sciences
Institute and the author of dozens of papers presented at scientific meetings
and published in scientific journals.

Ishak Pacal completed his bachelor's degree in Computer Engineering at


Harran University and obtained his master's degree in Electronic
Communications and Computer Engineering from The University of
Nottingham. He successfully finished his doctoral studies in real-time polyp
detection using deep learning at Erciyes University in 2022. Currently, he
serves as an Assistant Professor at Igdir University. With over 15
publications in SCI-indexed journals, his research interests span medical
image processing, artificial intelligence in healthcare, and artificial
intelligence in agriculture.

Anselmo de Paiva holds a degree in Civil Engineering from the State


University of Maranhão (1990), a master's degree in Civil Engineering -
Structures from the Pontifical Catholic University of Rio de Janeiro (1993)
and a doctorate in Informatics from the Pontifical Catholic University of Rio
de Janeiro (2001). He is currently a Full Professor at the Federal University
of Maranhão. He is the coordinator of the NCA-UFMA Applied Computing
Centre. He has experience in Computer Science, with an emphasis on
Graphics Processing, working mainly on the following subjects: Virtual and
Augmented Reality, Computer Graphics, GIS, Medical Image Processing
and Volumetric Visualisation. He is a member of the SBC (Brazilian
Computer Society) and the ACM (Association for Computing Machinery).

Prof. António Cunha is a PhD senior researcher and an Auxiliary


Professor at the Engineering department of the University of Trás-os-

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4

You might also like