Enhancing Image Annotation With Object Tracking An
Enhancing Image Annotation With Object Tracking An
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.Doi Number
ABSTRACT Effective image and video annotation is a fundamental pillar in computer vision and artificial
intelligence, crucial for the development of accurate machine learning models. Object tracking and image
retrieval techniques are essential in this process, significantly improving the efficiency and accuracy of
automatic annotation. This paper systematically investigates object tracking and image acquisition
techniques. It explores how these technologies can collectively enhance the efficiency and accuracy of the
annotation processes for image and video datasets. Object tracking is examined for its role in automating
annotations by tracking objects across video sequences, while image retrieval is evaluated for its ability to
suggest annotations for new images based on existing data. The review encompasses diverse methodologies,
including advanced neural networks and machine learning techniques, highlighting their effectiveness in
various contexts like medical analyses and urban monitoring. Despite notable advancements, challenges such
as algorithm robustness and effective human-AI collaboration are identified. This review provides valuable
insights into these technologies' current state and future potential in improving image annotation processes,
even showing existing applications of these techniques and their full potential when combined.
INDEX TERMS Image Annotation, Object Tracking, Image Retrieval, Deep Learning
vision tasks, including image annotation, where they can annotation with emerging technologies as object tracking
identify and label objects in images with high precision. and image retrieval.
More recently, Visual Transformers (ViTs) [11, 12] have Manual image annotation, although traditional, presents
emerged as an alternative and robust approach. Inspired by significant challenges. These include high time demands,
the success of Transformers in natural language processing, variability in the accuracy and consistency of annotations
ViTs apply attention mechanisms to capture global due to human intervention, and difficulty scaling to large
relationships between different parts of an image. This data volumes. Automating image annotation, or at least
makes them particularly effective at understanding complex offering automated assistance in this process, can speed up
visual contexts, a valuable feature for automatic image the work and increase its accuracy and consistency [20].
annotation. The application of techniques designed to maximize learning
Accurate data annotation is fundamental in machine learning from limited data such as transfer learning [21, 22, 23, 24,
applications, directly impacting the effectiveness of trained 25], data augmentation [26, 27, 28, 29] and few-shot learning
models. Traditionally, annotation is carried out manually, a [30, 31, 32, 33, 34] complements this move towards
process that can be slow and subject to inconsistencies. automation. While these methods are valuable for training
Automating this process, partially or entirely, is a relevant robust models with sparse annotated datasets, the ultimate
objective for increasing the efficiency and consistency of goal remains to minimize their need by improving the
annotations. automation of the annotation process itself. This approach
Object tracking involves identifying and following objects not only addresses the immediate challenges of data scarcity,
over time in videos or image sequences. This technique can but also aligns with the long-term vision of creating self-
be used for automatic annotation [13, 14], following the sustaining deep learning ecosystems that can learn and adapt
trajectory of moving objects and marking them in each with minimal human oversight.
frame. This approach can reduce the time needed for The emphasis on developing automated annotation systems
annotation and improve consistency, especially in contexts is particularly pertinent given the exponential increase in
with dynamic objects. digital data [20]. The ability to automatically annotate and
On the other hand, image retrieval involves searching for and categorize this data becomes not only beneficial, but also
identifying similar images in large databases. Using essential for its management and value extraction.
algorithms to identify common patterns and characteristics, Automated annotation systems fueled by object tracking and
this technique can suggest annotations for new images [15, image retrieval therefore represent a significant advance in
16], based on previously annotated data, providing a starting this regard, offering scalable, efficient and accurate solutions
point for annotation. to meet the growing demands of various industries.
The joint application of object tracking and image retrieval Additionally, human-AI collaboration in image annotation
to image annotation offers a promising approach to introduces unique challenges that deserve further exploration
automation in computer vision. This systematic review aims [35, 36]. While AI can significantly improve efficiency and
to explore the current state of these techniques, assessing accuracy in identifying and tracking objects in image
how they can be applied to optimize image annotation. The sequences, properly integrating human judgement and
review focuses on analyzing recent studies and practical expertise is crucial to ensuring the relevance and semantic
applications, aiming to provide a detailed overview of the accuracy of annotations. The interaction between human
benefits and challenges of these methodologies. annotators and AI systems needs to be intuitive and flexible,
allowing for easy corrections and adjustments, and ensuring
A. MOTIVATION that human knowledge is effectively incorporated into the
The growing demand for annotated image datasets in fields annotation process.
such as medicine, security and pattern recognition highlight Object tracking and image retrieval techniques have already
the importance of efficient and accurate image annotation demonstrated their effectiveness in several practical
methods [17]. The motivation for this systematic review applications, suggesting significant potential for innovation
arises from the opportunity to explore how object tracking in image annotation. Object tracking can automate the
and image retrieval techniques can contribute to this process, identification and tracking of objects in image sequences,
offering solutions to existing challenges in manual reducing the human effort required to annotate each frame
annotation and providing a more automated and efficient individually [37, 38, 39]. On the other hand, image retrieval
approach [18, 19]. can facilitate annotation by identifying similar images with
It is important to emphasize that, despite the growing existing annotations, providing a reliable starting point and
research and development in image annotation, systematic speeding up the annotation process [40, 41, 42].
reviews in this area are remarkably scarce in recent years. Therefore, this review seeks to evaluate and synthesize
This gap in the literature highlights the critical need for a current knowledge on object tracking and image retrieval in
comprehensive review that synthesizes recent advances and image annotation, identifying potential advances, challenges
contextualizes the current state of the art. Thus, this review and opportunities. The aim is to provide a comprehensive
stands out by compiling and presenting the most current understanding of how these techniques can improve the
studies, reflecting the significant advances in image efficiency and accuracy of image annotation in different
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018
domains, contributing to advancing research and practice in identification and localization, underlining their
computer vision and related areas. effectiveness in dealing with visual perception challenges in
images.
B. OBJECTIVE AND RESEARCH QUESTIONS Pande et al. [45] presented a comparative analysis of a
This systematic review aims to explore the use of object variety of image annotation tools for object detection. This
tracking and image retrieval techniques to automate or assist study is notable for its comprehensive approach, evaluating
in image annotation. The focus is to investigate how the different annotation tools in terms of functionalities,
integration of these technologies can optimize the annotation effectiveness and applicability in varied object detection
process, enhancing its efficiency and accuracy. The review contexts. The review highlights the importance of the
will be guided by the following research questions: appropriate choice of annotation tool, emphasizing that the
Q1: How are object tracking and image retrieval techniques quality of the annotation has a direct and significant impact
being used to automate or assist in image annotation, and on the performance of object detection models.
what are the current developments associated with these Existing reviews in the field of image annotation with deep
technologies? learning, including the work of Adnan et al. [43], Ojha et al.
Q2: How can the integration of object tracking, and image [44] and Pande et al. [45], offer a comprehensive overview
retrieval be optimized to improve the image annotation of current methodologies and applications, focusing on
process? This question aims to discover innovative different aspects of this evolving area. They illustrate the
approaches to combining object tracking and image retrieval technological advances and challenges that still need to be
efficiently. It seeks to understand how the synergy between overcome, providing an overview of the trends and future
these two technologies can be maximized to speed up and directions of this emerging field. However, a notable
improve image annotation. limitation of these reviews is their tendency to focus on
Q3: What are the main challenges and limitations faced specific types of techniques in isolation, which can limit
when applying these techniques to image annotation? Here, understanding of the full capability of image annotation
the focus is on identifying the technical and practical technologies. Especially when considering the synergistic
challenges and limitations that currently prevent the effective potential of combining different approaches to tackle
implementation of object tracking and image retrieval in complex challenges, this perspective can prove restrictive.
image annotation. This issue also seeks to explore potential In contrast, our review distinguishes itself by exploring not
solutions or approaches to overcome these challenges. just one, but two complementary techniques: object tracking
An in-depth understanding of these issues will provide and image retrieval. By integrating these two approaches, we
valuable insights into the opportunities, challenges and propose a more holistic view of image annotation,
future directions for using advanced computer vision recognizing that combining these technologies can bring
techniques in image annotation, boosting efficiency and significant benefits to the efficiency and accuracy of the
accuracy in various fields of application, such as medical annotation process. This integration represents an evolution
diagnosis, surveillance and large-scale pattern recognition. in the field of image annotation, leveraging the potential of
each technique to complement and enrich the other, and
II. RELATED WORK opening up new possibilities for significant advances in the
In the dynamic field of image annotation with deep learning, automation and accuracy of image annotation.
several systematic reviews offer valuable insights and
explore different aspects of this evolving domain. Recent III. LITERATURE REVIEW METHODOLOGY
studies in the dynamic field of image annotations can be
summarized as follows. A. ELIGIBILITY CRITERIA
Adnan et al. [43] devoted themselves to a comprehensive To guarantee a relevant and objective selection of studies, we
analysis of Automatic Image Annotation (AIA) methods, established strict eligibility criteria, detailed in Table I. The
with a special emphasis on deep learning models. This primary purpose of these criteria is to filter out research that
review is significant in that it categorizes AIA methods into effectively addresses the questions proposed by this study,
five distinct categories: CNN-based, RNN-based, DNN- excluding those that fall outside the scope of our research.
based, LSTM-based and SAE-based. The study not only These parameters were carefully formulated to capture the
highlights recent advances in these techniques, but also most pertinent literature and restrict our analysis to
points to persistent challenges, such as the need for more documents strictly related to the research questions at hand.
accurate and efficient techniques to improve automatic Implementing these criteria before the literature search is a
image annotation. crucial strategy for reducing bias in the study selection
Ojha et al. [44] focused specifically on the use of
process.
convolutional neural networks (ConvNets) for image TABLE I
annotation. This review details how ConvNets are applied to
image content annotation, exploring their ability to extract Inclusion (IC) and exclusion (EC) criteria.
visual features for complex computer vision tasks. The
review highlights the crucial role of ConvNets in object
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018
Criteria Description relevance and quality of the included studies. This phase
ICO Published since 2020 involves a thorough evaluation of the documents retrieved
IC1 The title, abstract, or keywords match the search query
from the selected databases, based on the criteria previously
IC2 Work published in refereed journals or conference
IC3 Direct or indirect applicability of object tracking techniques established. The aim is to refine the initial set of documents
for image annotation to include only those that strictly fulfil the eligibility criteria,
IC4 Direct or indirect applicability of image retrieval techniques thus guaranteeing the integrity and relevance of the
for image annotation
subsequent analysis. The screening phase begins with a
EC0 Work not published in refereed journal or conference
EC1 Literature/Systematic Review review of the titles and abstracts of the 5455 documents
EC2 Full text is not available initially identified, as detailed in Section 3, "Literature
EC3 The paper is not written in English Review Methodology". This process allows us to identify
EC4 Does not consider the use of Object Tracking or Image and exclude studies that do not fall within the scope of our
Retrieval
EC5 Out of scope investigation, focussing on those papers that offer valuable
EC6 Technique used cannot be leveraged for image annotation insights into the use of object tracking and image retrieval
techniques in the annotation of images and video datasets.
B. IDENTIFICATION PHASE Through individual reading of titles and abstracts, we
identified that a substantial number of these documents did
In the initial selection phase, three reference databases were not meet our inclusion and exclusion criteria and were
chosen: IEEE Xplore, Scopus and SpringerLink. The search therefore excluded, leaving 95 unique works for the
centred on combinations of keywords in the titles, abstracts screening phase. This initial stage was crucial to ensure the
and keywords of the articles searched, with the following relevance and uniqueness of the studies within our research
query: scope. Subsequently, in the eligibility phase, we conducted a
("Object Tracking" OR "Image Retrieval”) AND detailed evaluation of the full text of these 95 documents,
("Image" OR "Dataset" OR "Video") AND strictly guided by the previously defined exclusion criteria.
("Annotation") This meticulous analysis resulted in the exclusion of a
significant portion of the documents, based on various
This process was carried out on 14 November 2023,
factors of incompatibility with the established criteria, such
considering publications from 2020 to 2023. This choice of
as non-compliance with the research objective. Of the
time was strategic to ensure the inclusion of the most recent
and relevant studies in deep learning applied to object evaluated articles, 15 were selected in the Object Tracking
tracking and image retrieval techniques for annotating image area and 17 in the "Image Retrieval" area, totaling 32 studies
datasets, resulting in the identification of 5455 documents for for data extraction and qualitative analysis. These studies
the initial screening phase. were chosen not only for their direct relevance to the themes
Additionally, during the review and selection of studies, we of interest but also for the quality of their methodologies,
identified some highly relevant works that did not strictly fit data sets used, and relevance to the proposed research
the terms of the original search but offered valuable questions. The selection of these studies reflects our
contributions to the topic under discussion. These studies commitment to covering a broad and in-depth spectrum of
were carefully included to enrich the analysis and discussion, the applications of the techniques mentioned in image
considering their direct or indirect relevance and the annotation.
potential to provide additional insights into object tracking
and image retrieval techniques applied to image annotation.
IV. RESULTS
Based on the criteria established in Section 3, this part of the
article explores the results achieved. We focus our analysis
on object tracking and image retrieval techniques,
considering how these technologies can be adapted for image
annotation. The research involves a careful analysis of the
algorithmic approaches examined, focusing on the data sets
used, the viability of the methods in various annotation
contexts and the real-time execution capacity of the models.
This review's main findings and observations are
summarized and organized in Tables II and III.
Figure 1 Study Selection Flow Diagram.
A. SCREENING PHASE AND ELIGIBILITY
After defining the eligibility criteria in Section 3, we began Following Figure 1, the next image presents a bar chart
the screening and eligibility determination phase, a crucial outlining the number of articles selected per year from the
stage in the systematic review process to ensure the initial set of studies. This visual representation allows for a
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018
clear and immediate understanding of the distribution and segmentation masks, trained on the COCO dataset. This
volume of relevant research within the specified time period. method forms the basis of Td-VOS, allowing precise
segmentation of objects in videos from the initialization of a
bounding box in the first frame.
Weiming Hu et al. [48] (SiamMask) uniquely combine object
tracking with real-time video segmentation. Using
convolutional Siamese networks fully trained offline with an
additional binary segmentation task, SiamMask operates
online with the initialization of a bounding box, processing
video at 55 fps. This innovative approach employs two- and
three-branch variants, integrating similarity, bounding box
regression and binary segmentation tasks, with mask
refinement to improve accuracy. SiamMask is adaptable to
multiple objects and stands out for its efficiency and speed.
Dominik Schörkhuber et al. [49] introduce a technique for
Figure 2 Studys selected separated by years.
semi-automatic annotation of night-time driving videos. The
The bar chart shown illustrates the annual distribution of the
method includes generating trajectory proposals by tracking,
articles selected from 2020 to 2023. It is possible to observe
extending and verifying these trajectories with single object
a progressive increase in the number of articles, with the
tracking and semi-automatic annotation of bounding boxes.
highest bar corresponding to the year 2023. This suggests Tested on the CVL dataset, focused on European rural roads
growing interest and progress in the research fields of object at night, the method demonstrated a 23 % increase in recall
tracking and image retrieval, as they become increasingly with near-constant precision, outperforming traditional
relevant to the development of more sophisticated image detection and tracking approaches. This work addresses the
annotation techniques. The graph serves not only as a gap of rural and night scenes in driving datasets, proposing
quantitative analysis of research output over the years but significant improvements for efficient annotation in
can also reflect the growing importance of these technologies challenging autonomous driving contexts.
in addressing complex challenges in computer vision and Roberto Henschel et al. [50] propose an advanced method for
artificial intelligence. multi-person tracking, combining video and body Inertial
Measurement Units (IMUs). This method stands out by
B. OBJECT TRACKING addressing the challenge of tracking people in situations
In this section, we will discuss in detail the algorithms that where appearance is not discriminating or changes over time,
represent significant advances in object tracking. The such as changes in clothing. Using a neural network to relate
approaches vary widely, from traditional machine learning person detections to IMU orientations and a graph labelling
techniques to advanced methods employing convolutional problem for global consistency between video and inertial
neural networks, each offering innovative solutions to data, the method overcomes the limitations of video-only
specific challenges within diverse application contexts, these approaches. With a challenging new dataset that includes
algorithms are presented in Table II with their respective both video and IMU recordings, the method achieved an
research articles. impressive average IDF1 score of 91.2 % demonstrating its
effectiveness in situations where it is feasible to equip people
Single Object Tracking with inertial sensors.
Tao Yu et al. [46] present a revolutionary method that
integrates the "Instance Tracking Head" (ITH) module into Multiple Object Tracking and Segmentation
object detection frameworks to detect and track polyps in Zhenbo Xu et al. [51] (PointTrackV2) stand out with an
colonoscopy videos. This method, aligned with the Scaled- innovative method that converts compact image
YOLOv4 detector, allows sharing of low-level feature representations into disordered 2D point clouds, facilitating
extraction and progressive specialization in detection and the rigorous separation of foreground and background areas
tracking. The approach stands out for its speed, being around from instance segments. This process is enriched by a variety
30 % faster than conventional methods, while maintaining of data modalities to enhance point features. PointTrackV2
exceptional detection accuracy (mAP of 91.70 %) and surpasses existing methods in efficiency and effectiveness,
tracking accuracy (MOTA of 92.50 %, Rank-1 Acc of 88.31 achieving speeds close to real time (20 FPS) on a single
%).vai 2080Ti GPU. In addition, the study introduces the APOLLO
Shaopan Xiong et al. [47] develop a solution based on a MOTS dataset, more challenging than KITTI MOTS, with a
tracking module that uses Siamese networks, specifically higher density of instances. Extensive evaluations
SiamFC++, to accurately localize objects. The innovation demonstrate the superior performance of PointTrackV2 on
here lies in modelling visual tracking as a similarity learning various datasets, also discussing the applicability of this
problem, complemented by the Box2Segmentation module, method in areas beyond tracking, such as detailed image
which efficiently transforms bounding boxes into
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018
classification, 2D pose estimation and object segmentation using multiple object tracking (MOT). The system
in videos. developed, Bridge Vehicle Load Identification System
Liqi Yan et al. [52] (STC-Seg) present a novel framework for (BVLIS), was tested on an operating cable-stayed bridge,
instance segmentation in videos under a weakly supervised demonstrating the accuracy and reliability of the method.
approach. Using unsupervised depth estimation and optical This approach is innovative in that it combines deep
flow, STC-Seg generates efficient pseudo-labels to train learning-based vehicle detection, camera calibration and 3D
deep networks, focusing on the accurate generation of bounding box reconstruction, providing an effective
instance masks. One of the main contributions is 'puzzle alternative to conventional methods for assessing the
loss', which allows end-to-end training using box-level condition of bridges and their behavior under vehicle loads.
annotations. In addition, STC-Seg incorporates an advanced Liu et al. [56] propose a novel technique for segmenting and
tracking module that utilizes diagonal points and spatio- tracking instances in microscopy videos without the need for
temporal discrepancy, increasing robustness against changes manual annotation. Using adversarial simulations and pixel
in object appearance. This method demonstrates exceptional embedding-based learning, the ASIST method is able to
performance, outperforming supervised alternatives on the simulate variations in the shape of cellular and subcellular
KITTI MOTS and YT-VIS datasets, evidencing the objects, overcoming the challenge of consistent annotations
effectiveness of weakly supervised learning in segmenting required by traditional methods. The study demonstrates that
instances in videos. ASIST achieves a significant improvement over supervised
approaches, showing superior performance in segmentation,
Improvements in Annotation and Efficiency detection and tracking of microvilli and comparable
Le et al. [53] propose an interactive and self-supervised performance in videos of HeLa cells. This method represents
annotation framework that significantly improves the a breakthrough in the quantitative analysis of microscopy
efficiency of creating object bounding boxes in videos. videos, offering an efficient and automated solution for the
Based on two main networks, Automatic Recurrent quantification of cellular and subcellular dynamics without
Annotation (ARA) and Interactive Recurrent Annotation the labor-intensive manual annotation.
(IRA), the method iterates over the improvement of a pre- Fahad Lateef et al. [57] propose an innovative object
existing detector by exposing it to unlabeled videos, identification framework (FOI) for autonomous vehicles,
generating better ground pseudo-truths for self-training. IRA focusing exclusively on camera data to detect and analyze
integrates human corrections to guide the detection network, objects in urban driving scenarios. This framework uses
using a Hierarchical Correction module that progressively image registration algorithms and optical flow estimation to
reduces the distance between annotated frames with each compensate for self-motion and extract accurate motion
iteration. This innovative system has proven capable of information from moving objects from a mobile camera. At
generating accurate, high-quality annotations for objects in the heart of this system is a moving object detection (MOD)
videos, substantially reducing annotation time and costs. model, which combines an encoder-decoder network with a
Sambaturu et al. [54] (ScribbleNet) present an interactive semantic segmentation network to perform two crucial tasks:
annotation method called ScribbleNet, designed to improve the semantic segmentation of objects into specific classes
the annotation of complex urban images for semantic and the binary classification of pixels to determine addition,
segmentation, crucial in autonomous navigation systems. the article presents a unique dataset for detecting moving
This technique offers a pre-segmented image, which objects, covering a variety of dynamic objects. The
iteratively improves segmentation using scribbles as input. experiments demonstrate the effectiveness of the proposed
Based on conditional inference and exploiting correlations framework in providing detailed semantic information about
learnt in deep neural networks, ScribbleNet significantly objects in urban driving environments.
reduces annotation time - up to 14.7 times faster than manual Zeren Chen et al. [58] (Siamese DETR) present "Siamese
annotation and 5.4 times faster than current interactive DETR", a new method for self-supervised training of DETR
methods. In addition, it integrates with the LabelMe image (DEtection TRansformer) transformers, introduced at the
annotation tool and will be made available as open-source 2023 IEEE/CVF Conference on Computer Vision and
software, notable for its ability to work with scenes in Pattern Recognition (CVPR). This study proposes
unknown environments, annotate new classes and correct combining the Siamese network with DETR's cross-attention
multiple labels simultaneously. mechanism, focusing on learning vision-invariant and
Zhu et al. [55] present an accurate method for reconstructing detection-oriented representations. The method achieved
3D bounding boxes of vehicles in order to obtain detailed state-of-the-art transfer performance on the COCO and
spatial-temporal information about vehicle loads on bridges. PASCAL VOC detection benchmarks. The team highlights
The study uses a deep convolutional neural network (DCNN) the effectiveness and versatility of Siamese DETR,
and the You Only Look Once (YOLO) detector to detect demonstrating significant improvements in localization
vehicles and obtain 2D bounding boxes. A model for
accuracy and acceleration in convergence. However,
reconstructing the 3D bounding box is proposed, making it
Siamese DETR relies on a pre-trained CNN, such as SwAV,
possible to determine the sizes and positions of vehicles.
and future work aims to integrate the CNN and Transformer
Spatial-temporal information on vehicle loads is obtained
into a unified training paradigm.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018
TABLE II - Selected Object Tracking reviewed articles with their respective authors, year of publication, methodology/algorithms, main area, application
and real-time capacity.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018
TABLE III Selected Image Retrieval reviewed articles with their respective authors, year of publication, methodology/algorithms, main area, application
and datasets.
Application in Image
Year, Authors Methodology/Algorithms Main Area Dataset
Annotation
2022, J Image segmentation and feature extraction using Content-Based Yes, for image annotation and WANG
Faritha Banu grid-based colour histogram and texture Image Retrieval retrieval
et al. [60] techniques (CBIR)
2020, Yi-Hui Automatic semantic annotation of images, Social Image Yes, for automatic semantic NBA Blogs (January 2015
Chen et al Natural language analysis, Candidate phrase Retrieval annotation and identification to November 2015) -
[61] extraction, RDF (Resource Description of semantic intentions in social Manual annotations with
Framework), SPARQL, LSI (Latent Semantic images RDF
Indexing)
2020, Recurrent Topic Memory Network (RTRMN), Remote Sensing Yes, to generate automatic UCM-CaptioRSICD
Binqiang Recurrent Neural Network, Memory Network, Image Processing, semantic descriptions of (Remote Sensing Image
Wang et al Convolutional-MaxPooling Legend remote sensing images Captioning Dataset)ns:and
[64] Generation
2023, Myasar ResNet50-SLT, word2vec, principal component Automatic Image Yes, by improving accuracy in Corel-5K, ESP-Game and
Mundher analysis (PCA), t-SNE Annotation, Deep image annotation. Flickr8k
Adnan et al. Learning
[65]
2021, Mona Multi-View Robust Spectral Clustering Image Annotation, Model for image annotation Flickr, 500PX and Corel-
Zamiri et al (MVRSC), Maximum Correntropy Criterion, Semantic based on multi-view fusion. 5K
[70] Half-Quadratic optimisation framework Retrieval
2022, Jordão Interactive image segmentation annotation Interactive Image Method for mass annotation of iCoSeg, DAVIS, Rooftop
Bragantini et guided by feature space projection, metric Segmentation, images through projection in and Cityscapes
al [71] learning, dimensionality reduction Interactive feature space.
Machine Learning
2023, Ikhlaq Use of RESNET-50 and BERT for image and Content-Based Recovery of modified images Fashion-200K and MIT-
Ahmed et al. text feature extraction, with inductive learning Image Retrieval on e-commerce platforms, States
[62] for feature fusion. (CBIR) using deep learning.
2022, Umer Use of local tetra angle patterns (LTAP) and Content-Based Efficient image retrieval on Corel-1K,Oxford Flower
Ali Khan et colour moment features to improve image Image Retrieval social media platforms, using and CIFAR-10
al. [66] retrieval accuracy, optimised with genetic (CBIR) advanced colour and texture
algorithm. features
2020, Yikun Use of DNN and CNN for saliency prediction Content-Based Retrieval of quality images ImageNet, Caltech256 and
Yang et al. and acquisition of deep image representations. Image Retrieval from large databases using CIFAR-10
[67] (CBIR) deep learning.
2021, Jhilik Use of Capsule Networks and decision fusion Content-Based Use of capsule architecture for IRMA (Image Retrieval in
Bhattacharya with W-DCT and RBC for classification and Medical Image accurate retrieval and Medical Applications) and
et al. [72] retrieval of medical images. Retrieval (CBIR) classification of medical ImageCLEFMed-2009
images in large databases.
2021, Development of an OLWGP descriptor for data Content-Based Use of optimised OLWGP and Kaggle datasets named:
Dhupam retrieval and classification, using a heuristic J- Image Retrieval CNN descriptors for accurate CT (computerised
Bhanu BMO algorithm for optimal feature point (CBIR) and retrieval and classification of tomography), CT head,
Mahesh et al. selection and an optimised CNN for medical data Medical Image medical images in large Fundus Iris
[73] classification. Classification databases. (DIARETDB1),
Mammogram breast
(MIAS), MRI brain, US
(ultrasound), X-ray bone,
X-ray chest and X-ray
dental
2022, Zafran Use of DenseNet to generate visual Content-Based Multi-modal CBIR that Fashion200k, MIT States
Khan et al. characteristics of images and BERT for text Image Retrieval processes image and text and FashionIQ
[68] embeddings. Deep learning for joint image and (CBIR) and queries to retrieve images
text representation. Medical Image from a substantial database,
Classification adjusting to the wishes
expressed in the query.
2022, Anna Use of the DenseNet-121 model pre-trained with Hash-based Improving accuracy in medical Chest X-ray8
Guan et al. the C2L method; introduction of interpretable medical image image retrieval, with special
[69] saliency maps; fusion of global and local retrieval, with a attention to injured areas in
features; definition of three loss functions to focus on chest X- chest X-rays.
optimise hash codes. rays.
2021, P. Das Method based on robust descriptors using Biomedical Image Use of robust descriptors for HRCT dataset,
et al. [63] Zernike Moments, curvalet features and gradient Retrieval effective retrieval of Emphysema CT database,
orientation for biomedical image retrieval. biomedical images from large OASIS MRI database and
databases NEMA MRI database
2023, Felipe Learned keypoint detection method for non-rigid Keypoint Improvement in deformable HRCT dataset,
Cadar et al. image matching, using an end-to-end Detection, Non- object matching and object Emphysema CT database,
[74] convolutional neural network (CNN). retrieval through learnt
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018
Guan et al. [69]: The article presents an advanced method for improves the accuracy of the matches, but also the efficiency
retrieving medical images, focusing on feature fusion and of object retrieval, representing a significant advance in the
information interpretability. Using the DenseNet-121 model detection of keypoints in non-rigid images and in improving
to learn relevant medical features without the need for the matching performance of existing descriptors.
manual annotation, the method applies interpretable saliency Seyed Mahdi Roostaiyan et al. [75]: This paper introduces
maps and integrates global and local networks to extract Marginalised Coupled Dictionary Learning (MCDL) as a
complete information, resulting in a significant improvement new approach for real-time image annotation. Focusing on
in the accuracy of retrieval results. These advances promise learning a limited number of visual prototypes and their
valuable applications in computer-aided diagnosis systems. associated semantics, the method overcomes common
challenges in image annotation by offering an efficient and
Specific Applications and Innovations in Image fast solution with a publicly available implementation.
Annotation
Mona Zamiri and Hadi Sadoghi Yazdi [70]: This study V. DISCUSSION
introduces the Multi-View Robust Spectral Clustering This discussion section aims to further analyze the advances
(MVRSC) method for image annotation, modelling the and implications of object tracking and image retrieval
relationship between semantic and multi-features of training technologies, with a special focus on their practical
images. Using the Maximum Correntropy Criterion and applications in various domains and the significant real-
semi-quadratic optimization, the method suggests tags based world impact that these techniques have demonstrated.
on a new fusion distance at the decision level. Experimental Given the insights provided by the studies analyzed, we have
results on real datasets demonstrate the method's undertaken a comparative assessment of the techniques
effectiveness in generating accurate and meaningful studied, highlighting their strengths, weaknesses and
annotations, integrating geographic and visual information. suitability for different application scenarios. This analysis
Bragantini et al. [71]: In this article, the authors propose an not only illuminates the unique contributions of each
innovative approach to interactive image annotation, method, model or algorithm, but also sheds light on the
allowing the simultaneous annotation of segments of
synergies and challenges that arise when integrating these
multiple images through projection onto the feature space.
technologies to solve complex real-world problems.
This technique results in a faster process and avoids
We recognize the complexity and depth of the topics covered
redundancies by annotating similar components in different
by this review and have therefore expanded our discussion to
images. The results show a significant improvement in the
provide a more nuanced view of the limitations and
efficiency of image annotation, suggesting possibilities for
challenges faced by current methodologies. In addition, we
integration with other existing image segmentation
will discuss the potential interdisciplinary applications of
methodologies.
these technologies in more detail, highlighting areas that go
Jhilik Bhattacharya et al. [72]: The paper proposes an
beyond those primarily considered in the review. This
advanced approach to medical image search, utilizing
includes exploring how object tracking and image retrieval
capsule architecture and decision fusion to address
can be innovatively applied in fields such as health, public
challenges such as data imbalance, insufficient labels and
safety and environmental conservation, where they have the
obscured images. Tested on the IRMA dataset, the method
potential to promote significant advances.
demonstrates superior efficiency, significantly improving
diagnostic efficiency by grouping similar images for A. OBJECT TRACKING
automatic retrieval and annotation. The field of object tracking has been marked by significant
Dhupam Bhanu Mahesh et al. [73]: This paper presents a innovations, especially with the application of advanced
medical image retrieval and classification model based on neural networks and deep learning techniques. The
the Optimized Local Weber Gradient Pattern (OLWGP), introduction of the Instance Tracking Head (ITH) by Tao Yu
using a new heuristic algorithm to improve image retrieval. et al [46] exemplifies this evolution, offering a notable
The study also employs an optimized CNN model for image improvement in tracking accuracy in medical contexts, such
classification, demonstrating superior performance on as colonoscopy videos. This innovation underlines the ability
several public databases and offering significant advances in of these new technologies to adapt to specialized
medical image retrieval and classification. applications where precision is crucial.
Felipe Cadar et al. [74]: The study presents a novel technique Advancing the complexity of applications, Shaopan Xiong et
for keypoint detection in non-rigid images using a CNN al. [47] explored the use of Siamese networks to improve
trained with true correspondences. This method not only object segmentation in videos. This approach not only
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018
strengthens tracking accuracy, but also highlights the features and textual matching. This breakthrough highlights
versatility of modern techniques in dealing with dynamic and a growing trend in CBIR: the integration of multiple data
complex scenes. The convergence of these technologies' modalities to enrich the retrieval process, making the results
points to a horizon where object tracking can be adapted to a more aligned with the users' intentions.
wider range of scenarios, from controlled environments to In this context of multimodal enrichment, Binqiang Wang et
busy urban contexts. al. [64] made progress with the Recurrent Topic Retrieval
The accuracy and adaptability of object tracking in adverse Memory Network (RTRMN), which generates accurate
conditions represent ongoing challenges, as demonstrated by captions for remotely sensed images. This development
Dominik Schörkhuber et al. [49] in their studies of night-time highlights the importance of contextualization and detail in
driving videos. This work illustrates the importance of the generation of captions, which are crucial aspects for the
developing systems that can operate efficiently under interpretation and use of images in areas such as
variations in visibility, a critical factor for security and environmental and geographical research.
monitoring applications. The integration of multimodalities and technological
Collaboration between humans and artificial intelligence has innovation, as seen in the work of Ikhlaq Ahmed et al. [62]
emerged as a recurring theme, with studies such as those by and Zafran Khan et al. [68], exemplify how the combination
Trung-Nghia Le et al. [53] and Bhavani Sambaturu et al. [54] of in-depth visual features and semantic textual
highlighting collaborative approaches to image annotation representations can refine image retrieval. This approach not
and analysis. This human-machine interaction suggests a only improves accuracy, but also personalizes the image
future in which the precision and efficiency of AI can be retrieval experience, adapting to the specific needs of users
combined with human sensitivity and discernment to create in a variety of contexts, from e-commerce to multimedia.
more robust and accurate solutions in a variety of As we explore specific applications and advances in
applications. segmentation and classification, studies such as those by
The expansion to multiple objects tracking and Mona Zamiri et al. [70] and Jhilik Bhattacharya et al. [72]
segmentation, as demonstrated by Zhenbo Xu et al. [51] and bring to light innovative methods that improve image
Liqi Yan et al. [52], opens up new possibilities for real-time annotation and retrieval in urban and medical contexts. They
monitoring and analysis of complex scenes. These use advanced clustering techniques and capsule networks to
techniques, which transform images into more malleable model semantic relationships and multiple features,
representations such as point clouds, highlight the potential demonstrating the adaptability of these technologies to
of deep learning to extract and analyze information in an specific annotation and retrieval needs.
efficient and innovative way. However, beyond specific applications, CBIR faces the
The challenge of detecting and analyzing movements in ongoing challenge of detecting and analyzing complex
dynamic scenarios is addressed by Fahad Lateef et al. [57] patterns in images. Anna Guan et al [69], for example, focus
and Roberto Henschel et al. [50], who apply object tracking on medical image retrieval, using hashing techniques based
to contexts of urban mobility and human interactions, on feature fusion and interpretability to better represent
respectively. These studies illustrate how technology can be injured areas on X-rays. This approach not only advances
adapted to improve public safety and understand complex computer-aided diagnosis, but also emphasizes the
behavior in crowded environments. importance of interpretable and transparent systems.
Finally, the diversity of applications and continuous Looking to the future, CBIR should continue to explore data
innovation in object tracking, as reflected by the works of fusion and deep contextualization. Deep learning,
Zeren Chen et al. [58] and Thiago T. Santos et al. [59], exemplified by Umer Ali Khan et al [66] and Yikun Yang et
highlight the importance of ethical approaches, especially in al [67], promises to transform image retrieval by
public contexts where privacy and consent are paramount dynamically adapting to a variety of contexts and user
concerns. The evolution of this technology not only promises requirements. Furthermore, the emphasis on interpretation
improvements in a variety of fields, but also imposes the and user interaction, as evidenced by Seyed Mahdi
need for careful reflection on its responsible use. Roostaiyan et al. [75], highlights the need for methods that
can effectively deal with unbalanced labels and maintain
B. IMAGE RETRIEVAL
The evolution of Content-Based Image Retrieval (CBIR) has data sparsity.
been driven by significant technological advances, as Thus, the trajectory of CBIR is marked by an intersection of
demonstrated by J Faritha Banu et al [60], who developed an technological innovation, practical applicability and
innovative CBIR system using ontologies to integrate model integration challenges. Seyed Mahdi Roostaiyan et al's
and content annotations. This system not only improves the approach [75], which introduces marginalized coupled
accuracy and speed of image retrieval, but also paves the way dictionary learning for real-time image annotation, illustrates
for practical applications, especially in medical fields where the need for adaptive approaches capable of dealing with the
precision in image search and retrieval is vital. diversity and complexity of image datasets, while
This quest for accuracy and efficiency is complemented by maintaining computational efficiency and the relevance of
the efforts of Yi-Hui Chen et al [61] to address the "semantic retrieval results.
gap" in social image retrieval by combining multiple visual
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018
Continued innovation in CBIR, particularly in the integration observations to similar or relevant cases in medical literature,
of advanced deep learning techniques and multimodal annotation systems can provide a wealth of clinical context,
analysis, as demonstrated by Felipe Cadar et al. [74] in their potentially revealing previously hidden patterns or
research on keypoint detection in non-rigid images, correlations.
highlights the potential for significant advances in image However, the successful implementation of this integrated
retrieval accuracy and capacity. The application of these approach requires overcoming significant challenges,
techniques in a variety of contexts, from medical analyses to including managing large volumes of data, the need for high-
pattern recognition in remote sensing images, suggests a performance processing algorithms and ensuring accuracy
broad spectrum of possibilities for improving both the and relevance in the annotations generated. In addition,
granularity and applicability of CBIR. ethical and privacy issues remain paramount, especially in
However, as technologies advance and their applications sensitive applications such as surveillance and medicine.
expand, ethical and privacy considerations emerge,
especially in contexts involving sensitive or identifiable data. VI. CONCLUSION
The need for responsible and transparent approaches to This systematic review investigated the current use of object
image retrieval is becoming increasingly pressing, tracking and image retrieval techniques in automating or
emphasizing the importance of incorporating ethical assisting image annotation. From the studies analyzed, the
principles into the development and deployment of CBIR answers to the proposed questions are as follows:
systems.
Q1: How are object tracking and image retrieval techniques
C. POTENTIAL OF COMBINING TECHNIQUES FOR being used to automate or assist in image annotation, and
IMAGE ANNOTATION what are the current developments associated with these
The fusion of object tracking and image retrieval techniques technologies?
promises to revolutionize the field of image annotation, A: Automatic image annotation is an area that continues to
offering more sophisticated and efficient methods for evolve with the development of object tracking and image
identifying and cataloguing visual content. This convergence retrieval techniques. These techniques are essential for
has the potential to significantly automate the annotation improving the accuracy and efficiency of annotation, which
process, improving accuracy and reducing the manual effort is important in various applications such as medical
required, particularly in large datasets. diagnosis, urban surveillance and the management of large
In the context of object tracking, the ability to continuously image databases.
follow an entity through a sequence of images or videos Object tracking and image retrieval techniques, if used for
provides a solid basis for dynamic and contextually rich this purpose, can play key roles in automating and assisting
annotations. When integrated with image retrieval systems, image annotation, greatly helping annotators who would
this continuous tracking can be enriched with historical or otherwise have to do it manually. These technologies are
semantic information extracted from extensive databases, very useful not only for improving the efficiency of
allowing for annotations that capture not only the identity of annotation processes, but also for increasing the accuracy of
the object, but also its behavior, interactions and evolution the annotations generated, which is essential in fields such as
over time. medical diagnosis, urban monitoring and automatic
For example, in surveillance video analysis, the combination multimedia content management.
of these techniques can automate the annotation of activities, Object tracking has benefited from the advancement of deep
identifying and cataloguing specific actions by individuals or neural networks and sophisticated machine learning
vehicles. This not only saves time manually reviewing hours methods, which have contributed to automation and
of footage, but also improves search and retrieval accuracy in image annotation. The integration of advanced
capabilities, allowing users to quickly find moments or technologies not only improves the identification and
events of interest based on detailed annotations. tracking of objects in image sequences, but also facilitates
In scientific and environmental research, the combined the automatic and continuous annotation of these objects.
application of these technologies can facilitate the An example of this evolution is the work of Tao Yu et al.
cataloguing of species or natural phenomena by integrating [46] who developed the "Instance Tracking Head" (ITH),
movement information captured by object tracking with integrated into the Scaled-YOLOv4 detector. This
taxonomic or behavioral knowledge derived from image innovation offers improvements in the detection and tracking
retrieval systems. This can significantly speed up the of objects in medical videos, such as colonoscopies.
annotation of large sets of images captured in field studies, Improved detection accuracy and continuous tracking enable
allowing researchers to concentrate their efforts on analyzing automatic and accurate annotation of polyps over time,
and interpreting the data. facilitating medical monitoring and analysis by reducing the
In the medical field, this integrated approach could transform need for manual annotation, which is often prone to errors
the way diagnostic images are annotated and stored. By and inconsistencies.
combining the precise tracking of injuries or medical Another significant development is SiamMask, created by
conditions in sequential images with the ability to link these Weiming Hu et al. [48], which combines object tracking and
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018
segmentation in real time. This tool processes video at a rate the effective interpretation and use of images in
of 55 frames per second, enabling continuous and automated environmental monitoring and urban planning.
annotation of fast-moving objects. SiamMask is particularly Furthermore, Umer Ali Khan and Ali Javed have created a
useful in scenarios that require real-time responses, such as hybrid CBIR system that combines local tetra angle patterns
urban surveillance and traffic monitoring, where the precise with color moment features to improve image retrieval. This
identification and tracking of objects is essential for security system addresses the challenge of the "semantic gap" found
and incident management. in large image databases by enriching the automatic
The study by Roberto Henschel et al. offers an advanced annotation process with more detailed and accurate features,
method for tracking multiple people using both video and which is essential for better categorization and use of the
inertial measurement units (IMUs). This method is retrieved images.
especially effective in environments where the appearance of These developments in image retrieval, together with
individuals changes frequently, such as at sporting events or advances in object tracking, are broadening the possibilities
concerts, allowing accurate annotation of movements and for using images in a variety of practical applications,
positions without loss of subject identity, even in challenging ensuring that visual information is maximized to its full
conditions. potential. With these advanced technologies, it is possible to
Additionally, Zhenbo Xu et al. [51] with PointTrackV2 automate the annotation of large image datasets, reducing
transform images into 2D point clouds, which facilitates the manual labor and increasing the reliability of the
segmentation of instances and the tracking of multiple information.
objects in crowded and dynamic environments. This
technique enables effective annotation in congested urban Q2: How can the integration of object tracking and image
areas or at public events, where accurate tracking and retrieval be optimized to improve the image annotation
annotation of multiple moving objects is crucial for process?
subsequent analyses and decision-making. A: The efficient integration of object tracking and image
These are just a few examples of the advances that retrieval techniques represents a significant advance in the
demonstrate how object tracking can transform the task of field of automatic image annotation. Both techniques have
image annotation, making it more efficient and reducing the complementary capabilities which, when aligned correctly,
manual workload. The application of these technologies in a can substantially improve the accuracy, efficiency and
variety of fields, from medicine to public safety, highlights applicability of image annotation in a variety of contexts.
the significant potential for future innovations that can lead Existing object tracking techniques usually have difficulties
to an even deeper understanding and better practices in in some areas, namely in situations of low light or visibility,
analyzing visual data. in situations where you want to follow an individual in a
Continuing the discussion on the automation of image crowd, situations where the scenery changes abruptly or even
annotation, we have also seen significant advances in image when there is occlusion of objects due to overlapping objects
retrieval that complement the improvements brought about momentarily blocking the object being followed, although
by object tracking. Image retrieval techniques have benefited image retrieval cannot solve these issues it can offer support
greatly from deep learning and semantic analysis, which for annotating these complicated situations using previously
increase the accuracy and speed of retrieval and enrich the studied information providing the annotator not only with
quality of automatic annotations. different perspectives but also with comparisons that would
For example, Faritha Banu et al. have developed a system not have been observed previously, a paper that
that employs grid-based color histograms and texture experimented with this approach was H. Wei and Y. Huang
analysis within an ontological framework, which not only [76], this paper shows how the combination of both
speeds up the retrieval of medical images but also improves techniques used for the purpose of autonomous driving, their
the accuracy of automatic annotations. This advance is quite approach shows also that it could also be used to improve
important for clinical applications, where accurate image annotation, this being a step towards an
annotations can mean better diagnosis and treatment. interdisciplinary collaboration for image annotation. In the
Yi-Hui Chen et al. implemented a method that combines following, we present a detailed approach to how this
visual and textual analysis for semantic retrieval of social integration can be optimized, exploring connections between
images. This method not only improves the relevance and the techniques discussed in the selected articles.
accuracy of annotations, but also facilitates the As previously mentioned, object tracking often faces
categorization and retrieval of social content based on clear challenges in conditions of poor lighting and visibility.
semantic intent, thus improving the management of large Image retrieval techniques, such as the one presented by
image databases. Faritha Banu et al, which employ ontologies to improve the
Binqiang Wang et al. advanced the automatic generation of accuracy and speed of image retrieval, can be used to
captions for remote sensing images through recurrent complement and enrich the training datasets of tracking
memory networks, which use common keywords to generate models. In addition, the integration of advanced image
accurate and contextual descriptions. This process is vital for attributes and semantic annotations extracted through
retrieval methods can help tracking models to better adapt to
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018
varying conditions, using historical or similar data to adjust These challenges highlight the continued need for research
their predictions in real time and improve accuracy in and development in the areas of object tracking and image
challenging environments. retrieval. Innovative solutions are needed to address these
Multiple object tracking, as explored in works such as that limitations, potentially through the development of more
by Zhenbo Xu et al. [51] (PointTrackV2), can benefit sophisticated algorithms that can better cope with adverse
significantly from image retrieval. For example, techniques conditions and complex contexts, and more effective
that use textual and visual analysis for semantic annotation semantic processing techniques that better align image
of images (Yi-Hui Chen et al.) can be integrated to provide retrieval results with user needs. Improvements in these areas
additional context that makes it easier to distinguish between will not only advance the state of the art in automatic image
similar objects in crowded scenes. This approach can enable annotation, but also expand its practical applications in fields
tracking systems to assign more accurate identities and such as medical diagnosis, urban monitoring and multimedia
maintain tracking consistency over time, even when objects content management.
interact or hide from each other.
Even well-defined tracking systems, such as Weiming Hu et VII. FUTURE RESEARCH
al. 's SiamMask [48], can be improved with image retrieval The evolution of image annotation through object tracking
techniques that process contextual and appearance and image retrieval technologies, as systematically analyzed,
variations. Using advanced image retrieval algorithms that shows a promising trajectory towards more efficient and
integrate deep and semantic features (such as the DvLIL accurate machine learning models. Despite notable
system by Ikhlaq Ahmed et al. [62]), it is possible to develop advances, the field faces challenges that require a research
an adaptive layer that adjusts the parameters of the tracking agenda geared towards promoting innovation and addressing
model in real time, based on features previously observed in the complexities of real-world applications.
similar situations. This not only improves the robustness of A critical area for future exploration lies in improving
the tracking, but also reduces errors caused by abrupt algorithmic robustness and generalization. Current
changes in the scenario or the appearance of objects. methodologies demonstrate varying degrees of effectiveness
To maximize the benefits of this integration, it is crucial to in different datasets and conditions, often struggling with
implement effective synchronization between object low-visibility scenarios and rapid object movements.
tracking and image retrieval systems. This can be achieved Solving these problems requires a concerted effort to
by developing integrated frameworks that combine real-time develop algorithms that are not only adaptable to diverse
data streams with dynamic access to annotated image environmental conditions, but also capable of learning from
databases, allowing for fluid and complementary interaction limited and unstructured data, a good example of this effort
between the tracking and retrieval processes. is shown by Dominik Schörkhuber et al. [49]. The
integration of unsupervised and semi-supervised learning
Q3: What are the main challenges and limitations faced paradigms could offer a way to reduce dependence on
when applying these techniques to image annotation? extensively annotated data sets, thus expanding the
A: Despite advances in automatic image annotation through applicability of these technologies in domains where such
object tracking and image retrieval techniques, significant data is scarce or difficult to obtain in order to increase the
challenges still persist in both areas, impacting the amount of annotation data, creating a cycle where this kind
effectiveness of these technologies. of algorithms are continually less needed to fight scarce
In object tracking, one of the main challenges faced is the annotation.
management of occlusions, where objects of interest are At the same time, the synergy between human expertise and
temporarily blocked by other elements in the scene, automated systems represents fertile ground for research.
complicating their detection and continuous tracking. In The current landscape of image annotation tools reflects a
addition, rapid variations in the scene, such as sudden growing recognition of the invaluable role of human
changes in lighting or rapid movements of objects, can intuition and understanding in improving AI-generated
challenge current algorithms, reducing tracking accuracy. annotations. Future research should endeavor to improve this
The need for accurate tracking in low-visibility conditions symbiosis by developing more intuitive interfaces and
also remains a technical obstacle, especially in applications feedback mechanisms, which are open source and easy to
such as night surveillance or in adverse weather conditions. use. These systems should not only facilitate the
In image retrieval, the "semantic gap" - the discrepancy incorporation of human corrections, but also learn from these
between the visual attributes of retrieved images and the interactions, thus continuously improving the accuracy and
semantic meaning that users attribute to those images - relevance of the annotations.
remains a prominent challenge. This gap often results in In addition, the imperative need for real-time annotation
annotations that don't match user expectations or specific capabilities cannot be overemphasized, especially in
application needs, limiting the practical usefulness of image domains that require instant decision-making, such as
retrieval systems. Finding more robust and adaptive methods surveillance and live medical diagnosis. The search for real-
to fill this semantic gap is crucial to improving the relevance time processing solutions requires innovations in terms of
and accuracy of automatically generated annotations. computational efficiency and algorithmic speed. This may
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018
involve taking advantage of the growing dominance of [7] K. Muhammad, S. Khan, J. D. Ser, and V. H. C. de
peripheral computing and developing models adapted for use Albuquerque, ‘Deep Learning for Multigrade
in resource-limited environments, ensuring that the benefits Brain Tumor Classification in Smart Healthcare
of automated annotation can be realized across a broader Systems: A Prospective Survey’, IEEE
spectrum of applications. Transactions on Neural Networks and Learning
The ethical considerations and privacy concerns surrounding Systems, vol. 32, no. 2, pp. 507–522, Feb. 2021,
the use of these technologies, particularly in sensitive areas doi: 10.1109/TNNLS.2020.2995800.
such as personal surveillance and healthcare, require [8] W. Teng, N. Wang, H. Shi, Y. Liu, and J. Wang,
rigorous attention. Future research should prioritize the ‘Classifier-Constrained Deep Adversarial Domain
development of ethical frameworks and privacy-preserving Adaptation for Cross-Domain Semisupervised
mechanisms. This includes exploring advanced data Classification in Remote Sensing Images’, IEEE
anonymization techniques and secure data sharing protocols Geoscience and Remote Sensing Letters, vol. 17,
to protect individual privacy, while enabling the beneficial no. 5, pp. 789–793, May 2020, doi:
applications of image annotation technologies. 10.1109/LGRS.2019.2931305.
[9] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, ‘A
ACKNOWLEDGMENT Survey of Convolutional Neural Networks:
National Funds finance this work through the Portuguese Analysis, Applications, and Prospects’, IEEE
funding agency, FCT—Fundacão para a Ciência e a Transactions on Neural Networks and Learning
Tecnologia, within project PTDC/EEI-EEE/5557/2020. Co- Systems, vol. 33, no. 12, pp. 6999–7019, Dec.
funded by the European Union (grant number 101095359) 2022, doi: 10.1109/TNNLS.2021.3084827.
and supported by the UK Research and Innovation (grant [10] R. Chauhan, K. K. Ghanshala, and R. C. Joshi,
number 10058099). However, the views and opinions ‘Convolutional Neural Network (CNN) for Image
expressed are those of the author(s) only and do not Detection and Recognition’, in 2018 First
necessarily reflect those of the European Union or the Health International Conference on Secure Cyber
and Digital Executive Agency (HaDEA). Computing and Communication (ICSCCC), Dec.
2018, pp. 278–282. doi:
REFERENCES 10.1109/ICSCCC.2018.8703316.
[1] ‘Machine Learning With Big Data: Challenges and [11] Y. Xu et al., ‘Transformers in computational visual
Approaches | IEEE Journals & Magazine | IEEE media: A survey’, Comp. Visual Media, vol. 8, no.
Xplore’. doi: 10.1109/ACCESS.2017.2696365. 1, pp. 33–62, Mar. 2022, doi: 10.1007/s41095-
[2] ‘Machine Learning With Big Data: Challenges and 021-0247-3.
Approaches | IEEE Journals & Magazine | IEEE [12] Y. Liu et al., ‘A Survey of Visual Transformers’,
Xplore’. Accessed: doi: IEEE Transactions on Neural Networks and
10.1109/ACCESS.2017.2696365. Learning Systems, pp. 1–21, 2023, doi:
[3] L. Ren, J. Lu, J. Feng, and J. Zhou, ‘Uniform and 10.1109/TNNLS.2022.3227717.
Variational Deep Learning for RGB-D Object [13] K. G. Ince, A. Koksal, A. Fazla, and A. A. Alatan,
Recognition and Person Re-Identification’, IEEE ‘Semi-Automatic Annotation for Visual Object
Transactions on Image Processing, vol. 28, no. 10, Tracking’, presented at the Proceedings of the
pp. 4970–4983, Oct. 2019, doi: IEEE/CVF International Conference on Computer
10.1109/TIP.2019.2915655. Vision, 2021, pp. 1233–1239. Accessed: Jan. 31,
[4] J. Seo and H. Park, ‘Object Recognition in Very 2024. [Online]. doi:
Low Resolution Images Using Deep Collaborative 10.1109/ICCVW54120.2021.00143
Learning’, IEEE Access, vol. 7, pp. 134071– [14] L. Porzi, M. Hofinger, I. Ruiz, J. Serrat, S. R.
134082, 2019, doi: Bulo, and P. Kontschieder, ‘Learning Multi-Object
10.1109/ACCESS.2019.2941005. Tracking and Segmentation From Automatic
[5] S. H. Kasaei, ‘OrthographicNet: A Deep Transfer Annotations’, presented at the Proceedings of the
Learning Approach for 3-D Object Recognition in IEEE/CVF Conference on Computer Vision and
Open-Ended Domains’, IEEE/ASME Transactions Pattern Recognition, 2020, pp. 6846–6855.
on Mechatronics, vol. 26, no. 6, pp. 2910–2921, Accessed: Jan. 31, 2024. [Online]. doi:
Dec. 2021, doi: 10.1109/TMECH.2020.3048433. 10.48550/arXiv.1912.02096.
[6] S.-J. Liu, H. Luo, and Q. Shi, ‘Active Ensemble [15] X. Li, L. Chen, L. Zhang, F. Lin, and W.-Y. Ma,
Deep Learning for Polarimetric Synthetic Aperture ‘Image annotation by large-scale content-based
Radar Image Classification’, IEEE Geoscience and image retrieval’, in Proceedings of the 14th ACM
Remote Sensing Letters, vol. 18, no. 9, pp. 1580– international conference on Multimedia, in MM
1584, Sep. 2021, doi: ’06. New York, NY, USA: Association for
10.1109/LGRS.2020.3005076. Computing Machinery, Oct. 2006, pp. 607–610.
doi: 10.1145/1180639.1180764.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018
[16] D. D. Burdescu, C. G. Mihai, L. Stanescu, and M. 53, no. 3, pp. 1699–1711, Mar. 2023, doi:
Brezovan, ‘Automatic image annotation and 10.1109/TCYB.2021.3108237.
semantic based image retrieval for medical [26] W. Zhang, Y. Zhang, and L. Zhang, ‘Multiplanar
domain’, Neurocomputing, vol. 109, pp. 33–48, Data Augmentation and Lightweight Skip
Jun. 2013, doi: 10.1016/j.neucom.2012.07.030. Connection Design for Deep-Learning-Based
[17] O. Pelka, F. Nensa, and C. M. Friedrich, Abdominal CT Image Segmentation’, IEEE
‘Annotation of enhanced radiographs for medical Transactions on Instrumentation and
image retrieval with deep convolutional neural Measurement, vol. 72, pp. 1–11, 2023, doi:
networks’, Plos One, vol. 13, no. 11, 2018, doi: 10.1109/TIM.2023.3328707.
10.1371/journal.pone.0206229. [27] Y. Ma, M. Liu, Y. Tang, X. Wang, and Y. Wang,
[18] L. Porzi, M. Hofinger, I. Ruiz, J. Serrat, S. R. ‘Image-Level Automatic Data Augmentation for
Bulo, and P. Kontschieder, ‘Learning Multi-Object Pedestrian Detection’, IEEE Transactions on
Tracking and Segmentation From Automatic Instrumentation and Measurement, vol. 73, pp. 1–
Annotations’, presented at the Proceedings of the 12, 2024, doi: 10.1109/TIM.2023.3336760.
IEEE/CVF Conference on Computer Vision and [28] J. Cao, M. Luo, J. Yu, M.-H. Yang, and R. He,
Pattern Recognition, 2020, pp. 6846–6855. doi: ‘ScoreMix: A Scalable Augmentation Strategy for
10.48550/arXiv.1912.02096. Training GANs With Limited Data’, IEEE
[19] X. Li, L. Chen, L. Zhang, F. Lin, and W.-Y. Ma, Transactions on Pattern Analysis and Machine
‘Image annotation by large-scale content-based Intelligence, vol. 45, no. 7, pp. 8920–8935, Jul.
image retrieval’, in Proceedings of the 14th ACM 2023, doi: 10.1109/TPAMI.2022.3231649.
international conference on Multimedia, in MM [29] L. Zhang and K. Ma, ‘A Good Data Augmentation
’06. New York, NY, USA: Association for Policy is not All You Need: A Multi-Task
Computing Machinery, Oct. 2006, pp. 607–610. Learning Perspective’, IEEE Transactions on
doi: 10.1145/1180639.1180764. Circuits and Systems for Video Technology, vol.
[20] M. M. Adnan et al., ‘Automated Image Annotation 33, no. 5, pp. 2190–2201, May 2023, doi:
With Novel Features Based on Deep ResNet50- 10.1109/TCSVT.2022.3219339.
SLT’, IEEE Access, vol. 11, pp. 40258–40277, [30] X. Wang, X. Wang, B. Jiang, and B. Luo, ‘Few-
2023, doi: 10.1109/ACCESS.2023.3266296. Shot Learning Meets Transformer: Unified Query-
[21] S. A. H. Minoofam, A. Bastanfard, and M. R. Support Transformers for Few-Shot
Keyvanpour, ‘TRCLA: A Transfer Learning Classification’, IEEE Transactions on Circuits and
Approach to Reduce Negative Transfer for Systems for Video Technology, vol. 33, no. 12, pp.
Cellular Learning Automata’, IEEE Transactions 7789–7802, Dec. 2023, doi:
on Neural Networks and Learning Systems, vol. 10.1109/TCSVT.2023.3282777.
34, no. 5, pp. 2480–2489, May 2023, doi: [31] P. Tian and S. Xie, ‘An Adversarial Meta-Training
10.1109/TNNLS.2021.3106705. Framework for Cross-Domain Few-Shot
[22] Z. Zhu, K. Lin, A. K. Jain, and J. Zhou, ‘Transfer Learning’, IEEE Transactions on Multimedia, vol.
Learning in Deep Reinforcement Learning: A 25, pp. 6881–6891, 2023, doi:
Survey’, IEEE Transactions on Pattern Analysis 10.1109/TMM.2022.3215310.
and Machine Intelligence, vol. 45, no. 11, pp. [32] Y. Cui et al., ‘Uncertainty-Guided Semi-
13344–13362, Nov. 2023, doi: Supervised Few-Shot Class-Incremental Learning
10.1109/TPAMI.2023.3292075. With Knowledge Distillation’, IEEE Transactions
[23] H. Han, H. Liu, C. Yang, and J. Qiao, ‘Transfer on Multimedia, vol. 25, pp. 6422–6435, 2023, doi:
Learning Algorithm With Knowledge Division 10.1109/TMM.2022.3208743.
Level’, IEEE Transactions on Neural Networks [33] J. Li, M. Gong, H. Liu, Y. Zhang, M. Zhang, and
and Learning Systems, vol. 34, no. 11, pp. 8602– Y. Wu, ‘Multiform Ensemble Self-Supervised
8616, Nov. 2023, doi: Learning for Few-Shot Remote Sensing Scene
10.1109/TNNLS.2022.3151646. Classification’, IEEE Transactions on Geoscience
[24] Z. Fan, L. Shi, Q. Liu, Z. Li, and Z. Zhang, and Remote Sensing, vol. 61, pp. 1–16, 2023, doi:
‘Discriminative Fisher Embedding Dictionary 10.1109/TGRS.2023.3234252.
Transfer Learning for Object Recognition’, IEEE [34] H.-J. Ye, L. Han, and D.-C. Zhan, ‘Revisiting
Transactions on Neural Networks and Learning Unsupervised Meta-Learning via the
Systems, vol. 34, no. 1, pp. 64–78, Jan. 2023, doi: Characteristics of Few-Shot Tasks’, IEEE
10.1109/TNNLS.2021.3089566. Transactions on Pattern Analysis and Machine
[25] H. Shi, J. Li, J. Mao, and K.-S. Hwang, ‘Lateral Intelligence, vol. 45, no. 3, pp. 3721–3737, Mar.
Transfer Learning for Multiagent Reinforcement 2023, doi: 10.1109/TPAMI.2022.3179368.
Learning’, IEEE Transactions on Cybernetics, vol. [35] G. Carneiro, A. B. Chan, P. J. Moreno, and N.
Vasconcelos, ‘Supervised Learning of Semantic
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018
109011, Jan. 2023, doi: Selected Topics in Applied Earth Observations and
10.1016/j.patcog.2022.109011. Remote Sensing, vol. 13, pp. 256–270, 2020, doi:
[55] J. Zhu, X. Li, C. Zhang, and T. Shi, ‘An accurate 10.1109/JSTARS.2019.2959208.
approach for obtaining spatiotemporal information [65] M. M. Adnan et al., ‘Automated Image Annotation
of vehicle loads on bridges based on 3D bounding With Novel Features Based on Deep ResNet50-
box reconstruction with computer vision’, SLT’, IEEE Access, vol. 11, pp. 40258–40277,
Measurement, vol. 181, p. 109657, Aug. 2021, doi: 2023, doi: 10.1109/ACCESS.2023.3266296.
10.1016/j.measurement.2021.109657. [66] U. A. Khan and A. Javed, ‘A hybrid CBIR system
[56] Q. Liu et al., ‘ASIST: Annotation-free synthetic using novel local tetra angle patterns and color
instance segmentation and tracking by adversarial moment features’, Journal of King Saud
simulations’, Computers in Biology and Medicine, University - Computer and Information Sciences,
vol. 134, p. 104501, Jul. 2021, doi: vol. 34, no. 10, Part A, pp. 7856–7873, Nov. 2022,
10.1016/j.compbiomed.2021.104501. doi: 10.1016/j.jksuci.2022.07.005.
[57] F. Lateef, M. Kas, and Y. Ruichek, ‘Motion and [67] Y. Yang, S. Jiao, J. He, B. Xia, J. Li, and R. Xiao,
geometry-related information fusion through a ‘Image retrieval via learning content-based deep
framework for object identification from a moving quality model towards big data’, Future
camera in urban driving scenarios’, Transportation Generation Computer Systems, vol. 112, pp. 243–
Research Part C: Emerging Technologies, vol. 249, Nov. 2020, doi: 10.1016/j.future.2020.05.016.
155, p. 104271, Oct. 2023, doi: [68] Z. Khan, B. Latif, J. Kim, H. K. Kim, and M. Jeon,
10.1016/j.trc.2023.104271. ‘DenseBert4Ret: Deep bi-modal for image
[58] Z. Chen et al., ‘Siamese DETR’, in 2023 retrieval’, Information Sciences, vol. 612, pp.
IEEE/CVF Conference on Computer Vision and 1171–1186, Oct. 2022, doi:
Pattern Recognition (CVPR), Jun. 2023, pp. 10.1016/j.ins.2022.08.119.
15722–15731. doi: [69] A. Guan, L. Liu, X. Fu, and L. Liu, ‘Precision
10.1109/CVPR52729.2023.01509. medical image hash retrieval by interpretability
[59] T. T. Santos, L. L. de Souza, A. A. dos Santos, and and feature fusion’, Computer Methods and
S. Avila, ‘Grape detection, segmentation, and Programs in Biomedicine, vol. 222, p. 106945, Jul.
tracking using deep neural networks and three- 2022, doi: 10.1016/j.cmpb.2022.106945.
dimensional association’, Computers and [70] M. Zamiri and H. Sadoghi Yazdi, ‘Image
Electronics in Agriculture, vol. 170, p. 105247, annotation based on multi-view robust spectral
Mar. 2020, doi: 10.1016/j.compag.2020.105247. clustering’, Journal of Visual Communication and
[60] J. Faritha Banu, P. Muneeshwari, K. Raja, S. Image Representation, vol. 74, p. 103003, Jan.
Suresh, T. P. Latchoumi, and S. Deepan, 2021, doi: 10.1016/j.jvcir.2020.103003.
‘Ontology Based Image Retrieval by Utilizing [71] J. Bragantini, A. X. Falcão, and L. Najman,
Model Annotations and Content’, in 2022 12th ‘Rethinking interactive image segmentation:
International Conference on Cloud Computing, Feature space annotation’, Pattern Recognition,
Data Science & Engineering (Confluence), Jan. vol. 131, p. 108882, Nov. 2022, doi:
2022, pp. 300–305. doi: 10.1016/j.patcog.2022.108882.
10.1109/Confluence52989.2022.9734194. [72] J. Bhattacharya, T. Bhatia, and H. S. Pannu,
[61] Y.-H. Chen, E. J.-L. Lu, and S.-C. Lin, ‘Ontology- ‘Improved search space shrinking for medical
based Dynamic Semantic Annotation for Social image retrieval using capsule architecture and
Image Retrieval’, in 2020 21st IEEE International decision fusion’, Expert Systems with
Conference on Mobile Data Management (MDM), Applications, vol. 171, p. 114543, Jun. 2021, doi:
Jun. 2020, pp. 337–341. doi: 10.1016/j.eswa.2020.114543.
10.1109/MDM48529.2020.00074. [73] D. Bhanu Mahesh, G. Satyanarayana Murty, and
[62] I. Ahmed, N. Iltaf, Z. Khan, and U. Zia, ‘Deep- D. Rajya Lakshmi, ‘Optimized Local Weber and
view linguistic and inductive learning (DvLIL) Gradient Pattern-based medical image retrieval
based framework for Image Retrieval’, and optimized Convolutional Neural Network-
Information Sciences, vol. 649, p. 119641, Nov. based classification’, Biomedical Signal
2023, doi: 10.1016/j.ins.2023.119641. Processing and Control, vol. 70, p. 102971, Sep.
[63] P. Das and A. Neelima, ‘A Robust Feature 2021, doi: 10.1016/j.bspc.2021.102971.
Descriptor for Biomedical Image Retrieval’, [74] F. Cadar, W. Melo, V. Kanagasabapathi, G. Potje,
IRBM, vol. 42, no. 4, pp. 245–257, Aug. 2021, R. Martins, and E. R. Nascimento, ‘Improving the
doi: 10.1016/j.irbm.2020.06.007. matching of deformable objects by learning to
[64] B. Wang, X. Zheng, B. Qu, and X. Lu, ‘Retrieval detect keypoints’, Pattern Recognition Letters, vol.
Topic Recurrent Memory Network for Remote 175, pp. 83–89, Nov. 2023, doi:
Sensing Image Captioning’, IEEE Journal of 10.1016/j.patrec.2023.08.012.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3406018
[75] S. M. Roostaiyan, M. M. Hosseini, M. M. Montes and Alto Douro (UTAD). Prof. António participated as a member
in 7 funded research projects. His research interests are medical image
Kashani, and S. H. Amiri, ‘Toward real-time
analysis, bio-image analysis, computer vision, machine learning, and
image annotation using marginalized coupled artificial intelligence, particularly in Computer-aided Diagnosis applied in
dictionary learning’, J Real-Time Image Proc, vol. several imaging modalities, e.g. computed tomography of the lung and
19, no. 3, pp. 623–638, Jun. 2022, doi: endoscopic videos. He is part of the organization committee HCIST -
International Conference on Health and Social Care Information Systems
10.1007/s11554-022-01210-6.
and Technologies (2013- 2015, 2020-2023), and the organization chair
[76] H. Wei and Y. Huang, ‘Online Multiple Object (2012) and Advisory Board (2016-2023).
Tracking Using Spatial Pyramid Pooling Hashing
and Image Retrieval for Autonomous Driving’,
Machines, vol. 10, no. 8, Art. no. 8, Aug. 2022,
doi: 10.3390/machines10080668.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4