In what ways can deep learning enhance the accuracy of X-ray image analysis for medical diagnoses?

Deep learning enhances the accuracy of X-ray image analysis for medical diagnoses by effectively learning complex patterns in image data that may not be immediately obvious to human practitioners. Convolutional neural networks (CNNs), in particular, are adept at identifying subtle features within X-ray images through multiple layers of processing, leading to improved classification and detection of conditions such as pneumonia. The ability to automatically adjust to variations in images through parameter tuning further increases reliability and diagnostic precision .

How does the use of WordNet improve text clustering techniques, and what are the challenges in applying WordNet to multilingual contexts?

WordNet improves text clustering techniques by providing semantic relatedness and sense disambiguation for words, which enables more accurate clustering based on meanings rather than mere word frequencies . However, applying WordNet to multilingual contexts presents challenges such as the need for comprehensive multilingual lexical databases and effective mappings between different languages' semantic structures. The intrinsic linguistic differences and the lack of standardized translations across languages further complicate the development of multilingual WordNets .

How does Hadoop provide a framework for managing big data, and what are the limitations of this approach?

Hadoop provides a framework for managing big data by utilizing a distributed file system that allows for the storage and processing of large data sets across clusters of computers using simple programming models . It integrates components like MapReduce for batch processing and HDFS for scalable and reliable storage. However, limitations include its lack of real-time data processing capabilities, high dependency on hardware compatibility, and the complexity involved in managing the system as Hadoop clusters scale .

How does the use of genetic algorithms improve task scheduling in cloud computing environments?

Genetic algorithms improve task scheduling in cloud computing by optimizing the allocation of resources to tasks based on the principle of natural selection. These algorithms explore various scheduling solutions simultaneously, enabling them to efficiently search for the best allocation by considering various factors such as execution time and resource cost . They are particularly effective in dynamic and complex cloud environments where the exact conditions for optimal scheduling are difficult to predict .

What role does semantic similarity play in enhancing information retrieval systems, and what methods are used to measure it?

Semantic similarity enhances information retrieval systems by enabling them to better understand user queries, leading to more relevant search results that consider the contextual meaning rather than just keyword matching . Methods to measure semantic similarity include using lexical databases like WordNet to evaluate the distance between concepts, utilizing path-based and information content-based metrics to determine the relatedness of concepts within a semantic network .

What are the benefits and drawbacks of using machine learning approaches in sentiment analysis compared to traditional lexicon-based methods?

Machine learning approaches to sentiment analysis provide benefits such as adaptability to context-specific sentiments and improved accuracy by learning from data patterns rather than relying solely on predefined dictionaries . They can capture complex emotional nuances and accommodate the evolving nature of language. However, drawbacks include the need for large labeled datasets for training, which might not always be available, and the potential for models to misinterpret sarcasm or complex sentence structures compared to traditional lexicon-based methods which are simpler and do not require training .

What are the key considerations in the design of real-time disease spread tracking systems using low energy sensors?

Key considerations in designing real-time disease spread tracking systems using low energy sensors include ensuring accurate proximity detection to measure social distancing, preserving user privacy through secure encryption of data, maintaining energy efficiency to extend sensor battery life, and integrating with secure databases to track and manage exposure levels effectively. Additionally, the reliability of signal strength measurements and the adaptability of sensors to different environmental conditions are critical factors .

What are some key security concerns associated with big data, and how do DNA encryption techniques address these issues?

Big data security concerns stem from the sheer volume and variety of data, which makes traditional encryption techniques inadequate in ensuring privacy and protection against unauthorized access . DNA encryption techniques address these concerns by leveraging the complex structure of DNA sequences to create more secure encryption keys. This method involves converting digital data into DNA sequences using encoding rules, which can then be used for cryptography purposes, significantly enhancing security by making it more difficult for attackers to decipher encrypted data .

In what ways do encryption techniques based on steganography differ from traditional encryption methods, and what are their unique advantages?

Encryption techniques based on steganography differ from traditional methods by embedding encrypted data within other non-suspicious files, such as images or videos, rather than altering the data into ciphertext . The unique advantage of steganography is its ability to conceal the existence of the encrypted data, which enhances security by making it less obvious to detect or attack, as compared to more visible encrypted forms like ciphertext .

What challenges arise in designing a secure campus network, and what strategies can be implemented to overcome these?

Designing a secure campus network involves challenges such as protecting against external threats while managing extensive internal traffic, ensuring reliable access control, and maintaining user privacy . Strategies to overcome these challenges include implementing a layered security approach with firewalls and intrusion detection systems, utilizing VPNs for secure remote access, and adopting strict access control measures combined with regular network audits to identify and mitigate vulnerabilities .

Open navigation menu

Upload

0% found this document useful (0 votes)

2K views801 pages

Smart Computing

Uploaded by

DOLLY THANKACHAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2K views801 pages

Smart Computing

Uploaded by

DOLLY THANKACHAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 801

Smart Innovation, Systems and Technologies 224

Suresh Chandra Satapathy

Vikrant Bhateja
Margarita N. Favorskaya
T. Adilakshmi Editors

Smart Computing
Techniques and
Applications
Proceedings of the Fourth International
Conference on Smart Computing and
Informatics, Volume 2
Smart Innovation, Systems and Technologies

Volume 224

Series Editors
Robert J. Howlett, Bournemouth University and KES International,
Shoreham-by-Sea, UK
Lakhmi C. Jain, KES International, Shoreham-by-Sea, UK
The Smart Innovation, Systems and Technologies book series encompasses the
topics of knowledge, intelligence, innovation and sustainability. The aim of the
series is to make available a platform for the publication of books on all aspects of
single and multi-disciplinary research on these themes in order to make the latest
results available in a readily-accessible form. Volumes on interdisciplinary research
combining two or more of these areas is particularly sought.
The series covers systems and paradigms that employ knowledge and intelligence
in a broad sense. Its scope is systems having embedded knowledge and intelligence,
which may be applied to the solution of world problems in industry, the environment
and the community. It also focusses on the knowledge-transfer methodologies and
innovation strategies employed to make this happen effectively. The combination of
intelligent systems tools and a broad range of applications introduces a need for a
synergy of disciplines from science, technology, business and the humanities. The
series will include conference proceedings, edited collections, monographs, hand-
books, reference books, and other relevant types of book in areas of science and
technology where smart systems and technologies can offer innovative solutions.
High quality content is an essential feature for all book proposals accepted for the
series. It is expected that editors of all accepted volumes will ensure that
contributions are subjected to an appropriate level of reviewing process and adhere
to KES quality principles.
Indexed by SCOPUS, EI Compendex, INSPEC, WTI Frankfurt eG, zbMATH,
Japanese Science and Technology Agency (JST), SCImago, DBLP.
All books published in the series are submitted for consideration in Web of Science.

More information about this series at http://www.springer.com/series/8767

Suresh Chandra Satapathy · Vikrant Bhateja ·
Margarita N. Favorskaya · T. Adilakshmi
Editors

Smart Computing
Techniques and Applications
Proceedings of the Fourth International
Conference on Smart Computing
and Informatics, Volume 2
Editors
Suresh Chandra Satapathy Vikrant Bhateja
School of Computer Engineering Department of Electronics
KIIT University and Communication Engineering
Bhubaneswar, Odisha, India Shri Ramswaroop Memorial Group
of Professional Colleges (SRMGPC)
Margarita N. Favorskaya Lucknow, Uttar Pradesh, India
Informatics and Computer Techniques
Dr. A.P.J. Abdul Kalam Technical
Reshetnev Siberian State University
University
of Science and Technologies
Lucknow, Uttar Pradesh, India
Krasnoyarsk, Russia
T. Adilakshmi
Department of Computer Science
and Engineering
Vasavi College of Engineering
Hyderabad, India

ISSN 2190-3018 ISSN 2190-3026 (electronic)

Smart Innovation, Systems and Technologies
ISBN 978-981-16-1501-6 ISBN 978-981-16-1502-3 (eBook)
https://doi.org/10.1007/978-981-16-1502-3

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2021
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Conference Committees

Chief Patrons

Sri. M. Krishna Murthy, Secretary, VAE

Sri. P. Balaji, CEO, VCE

Patron

Dr. S. V. Ramana, Principal, VCE

Honorary Chair

Dr. Lakhmi Jain, Australia

General Chair

Dr. Suresh Chandra Satapathy, KIIT DU, Bhubaneswar

Organizing Chair

Dr. T. Adilakshmi, Professor and HOD, CSE, VCE

v
vi Conference Committees

Publication Chairs

Dr. Nagaratna P. Hegde, Professor, CSE, VCE

Dr. Vikrant Bhateja, SRMGPC, Lucknow, UP, India

Program Committee

Dr. S. Ramachandram, Former Vice Chancellor, OU

Dr. Banshidhar Majhi, Director, IITDM, Kancheepuram
Dr. Siba K. Udgata, Professor, HCU
Dr. Sourav Mukhopadhyay, Associate Professor, IIT Kharagpur
Dr. P. Radha Krishna, Professor, CSE, NIT Warangal
Dr. M. M. Gore, Professor, MNNIT, Allahabad
Dr. S. M. Hegde, Professor, NIT Surathkal
Dr. Bapi Raju S., Professor, IIIT Hyderabad
Dr. Rajendra Hegadi, Associate Professor, IIIT Dharwad
Dr. S. Sameen Fatima, Former Principal, OU
Dr. K. Shyamala, Professor, OU
Dr. Naveen Sivadasan, TCS Innovation Labs, Hyderabad
Dr. Badrinath G. Srinivas, Research Scientist—III, Amazon Development Center,
Hyderabad
Dr. Ravindra S. Hegadi, Professor, PAH Solapur University
Dr. S. P. Algur, Professor and Chairman, CSE, Rani Channamma University, Belgavi
Dr. R. Sridevi, HOD, Department of CSE, JNTUH

International Advisory Committee/Technical Program

Committee

Dr. Rammohan, South Korea

Dr. Kailash C. Patidar, South Africa
Dr. Naeem Hanoon, Malaysia
Dr. Vimal Kumar, the University of Waikato, New Zealand
Dr. Akshay Sadananda Uppinakudru Pai, University of Copenhagen, Denmark
Dr. K. C. Santosh, the University of South Dakota
Dr. Ayush Goyal, Texas A&M University, Kingsville
Dr. Sobhan Babu, Associate Professor, IIT Hyderabad
Dr. D. V. L. N. Somayajulu, Director, IIIT, Kurnool
Dr. Siba Udgata, Professor, HCU
Dr. R. B. V. Subramaanyam, Professor, NITW
Dr. S. G. Sanjeevi, Professor, NITW
Conference Committees vii

Dr. Sanjay Sengupta, CSIR, New Delhi

Dr. A. Govardhan, Rector, JNTU Hyderabad
Prof. Chintan Bhatt, Chandubhai Patel Institute of Technology, Gujarat
Dr. Munesh Chandra Trivedi, ABES Engineering College, Ghaziabad
Dr. Alok Aggarwal, Professor
Dr. Anuja Arora, Jaypee Institute of Information Technology, Noida, India
Dr. Divakar Yadav, Associate Professor, MMMUT, Gorakhpur, India
Dr. Kuda Nageswar Rao, Andhra University, Visakhapatnam
Dr. M. Ramakrishna Murthy, ANITS, Visakhapatnam
Dr. Suberna Kumar, MVGR, Vizianagaram
Dr. J. V. R. Murthy, Director Incubation and IPR, JNTU Kakinada
Dr. D. Ravi, IDRBT, Hyderabad
Dr. Badrinath G. Srinivas, Research Scientist–III, Amazon Development Center,
Hyderabad
Dr. K. Shyamala, Professor, OU
Dr. P. V. Sudha, Professor, OU
Dr. M. A. Hameed, Assistant Professor, OU
Dr. B. Sujatha, Assistant Professor, OU
Dr. T. Adilakshmi, Professor and HOD, CSE, VCE
Dr. Nagaratna P. Hegde, Professor, CSE, VCE
Dr. V. Sireesha, Assistant Professor, CSE, VCE

Organizing Committee

Dr. D. Baswaraj, Professor, CSE, VCE

Dr. K. Srinivas, Associate Professor, CSE, VCE
Dr. V. Sireesha, Assistant Professor, CSE, VCE
Mr. S. Vinay Kumar, Assistant Professor, CSE, VCE
Mr. M. Sashi Kumar, Assistant Professor, CSE, VCE
M. Sunitha Reddy, Assistant Professor, CSE, VCE
R. Sateesh Kumar, Assistant Professor, CSE, VCE
Mr. T. Nishitha, Assistant Professor, CSE, VCE

Publicity Committee

Dr. M. Shanmukhi, Professor, CSE, VCE

Mr. C. Gireesh, Assistant Professor, CSE, VCE
Ms. T. Jalaja, Assistant Professor, CSE, VCE
Mr. I. Navakanth, Assistant Professor, CSE, VCE
Ms. S. Komal Kaur, Assistant Professor, CSE, VCE
Mr. T. Saikanth, Assistant Professor, CSE, VCE
viii Conference Committees

Ms. K. Mamatha, Assistant Professor, CSE, VCE

Mr. P. Narasiah, Assistant Professor, CSE, VCE

Website Committee

Mr. S. Vinay Kumar, Assistant Professor, CSE, VCE

Mr. M. S. V. Sashi Kumar, Assistant Professor, CSE, VCE
Preface

This volume contains the selected papers presented at the 4th International Confer-
ence on Smart Computing and Informatics (SCI 2020) organized by the Depart-
ment of Computer Science and Engineering, Vasavi College of Engineering
(Autonomous), Ibrahimbagh, Hyderabad, Telangana, during October 9–10, 2020.
It provided a great platform for researchers from across the world to report, delib-
erate, and review the latest progress in the cutting-edge research pertaining to smart
computing and its applications to various engineering fields. The response to SCI
2020 was overwhelming with a good number of submissions from different areas
relating to artificial intelligence, machine learning, cognitive computing, computa-
tional intelligence, and its applications in main tracks. After a rigorous peer review
with the help of technical program committee members and external reviewers, only
quality papers were accepted for presentation and subsequent publication in this
volume of SIST series of Springer.
Several special sessions were floated by eminent professors in cutting-edge
technologies such as blockchain, AI, ML, data engineering, computational intel-
ligence, big data analytics and business analytics, and intelligent systems. Eminent
researchers and academicians delivered talks addressing the participants in their
respective fields of proficiency. Our thanks are due to Prof. Roman Senkerik, Head
of AI Lab, Tomas Bata University in Zlin, Czech Republic; Shri. Shankarnarayan
Bhat, Director Engineering, Intel Technologies India Pvt. Ltd.; Ms. Krupa Rajendran,
Assoc. VP, HCL Technologies; and Mr. Aninda Bose, Springer, India, for delivering
keynote addresses for the benefit of the participants. We would like to express our
appreciation to the members of the technical program committee for their support
and cooperation in this publication. We are also thankful to the team from Springer
for providing a meticulous service for the timely production of this volume.
Our heartfelt thanks to Shri. M. Krishna Murthy, Secretary, VAE; Sri. P. Balaji,
CEO, VCE; and Dr. S. V. Ramana, Principal, VCE, for extending support to conduct
this conference in Vasavi College of Engineering. Profound thanks to Prof. Lakhmi
C. Jain, Australia, for his continuous guidance and support from the beginning of the
conference. Without his support, we could never have executed such a mega event. We
are grateful to all the eminent guests, special chairs, track managers, and reviewers

ix
x Preface

for their excellent support. A special vote of thanks to numerous authors across the
country as well as abroad for their valued submissions and to all the delegates for
their fruitful discussions that made this conference a great success.
Editorial Board of SCI 2020

Bhubaneswar, India Suresh Chandra Satapathy

Lucknow, India Vikrant Bhateja
Krasnoyarsk, Russia Margarita N. Favorskaya
Hyderabad, India T. Adilakshmi
List of Special Sessions Collocated with SCI-2020

SS_01: Next-Generation Data Engineering

and Communication Technology

Dr. Suresh Limkar, AISSMS Institute of Information Technology, Pune

SS_02: Artificial Intelligence and Machine Learning

Applications (AIML)

Dr. Sowmya V., CEN, Amrita Vishwa Vidyapeetham, Coimbatore

Dr. Anand Kumar M., NIT Karnataka
Dr. M. Venkatesan, NIT Karnataka
Prof. Soman K. P., Amrita Vishwa Vidyapeetham, Coimbatore

SS_03: Advances in Computational Intelligence and Its

Applications

Dr. C. Kishor Kumar Reddy, Stanley College of Engineering and Technology for
Women, Hyderabad
P. R. Anisha, Stanley College of Engineering and Technology for Women, Hyderabad

xi
xii List of Special Sessions Collocated with SCI-2020

SS_04: Blockchain Technology: Foundations, Challenges,

and Applications

Prof. Sandeep Kumar Panda, Faculty of Science and Technology, ICFAI Foundation
for Higher Education, Hyderabad
Prof. Santosh Kumar Swain, School of Computer Engineering, KIIT (Deemed to be)
University, Bhubaneswar

SS_05: Application of Machine Learning for Intelligent

System Design

Dr. Minakhi Rout, KIIT (Deemed to be) University, Bhubaneswar

SS_06: Advances in Big Data Analytics and Business

Intelligence

Dr. Vijay B. Gadicha, G. H. Raisoni Academy of Engineering and Technology,

Nagpur
Dr. Ajay B. Gadicha, P. R. Pote College of Engineering and Management, Amravati

SS_07: Recent Advances in Artificial

intelligence-Applications, Challenges, and Future Trends

Dr. S. Velliangiri, CMR Institute of Technology, Hyderabad

Dr. P. Karthikeyan, Presidency University, Bengaluru
Dr. Iwin Thanakumar Joseph, KITS, Coimbatore
Contents

An Intelligent Tracking Application for Post-pandemic . . . . . . . . . . . . . . . 1

V. Roopa, R. Vasikaran, M. Sriram Karthik, S. Sindhu, and N. Vaishnavi
Investigation on the Influence of English Expertise on Non-native
English-Speaking Students’ Scholastic Performance Using Data
Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Subhashini Sailesh Bhaskaran and Mansoor Al Aali
Machine Learning Algorithms for Modelling Agro-climatic
Indices: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
G. Edwin Prem Kumar and M. Lydia
Design of Metal-Insulator-Metal Based Stepped Impedance
Square Ring Resonator Dual-Band Band Pass Filter . . . . . . . . . . . . . . . . . . 25
Surendra Kumar Bitra and M. Sridhar
Covid-19 Spread Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Srinivas Kanakala and Vempaty Prashanthi
Social Media Anatomy of Text and Emoji in Expressions . . . . . . . . . . . . . . 41
Shelley Gupta, Ojas Garg, Radhika Mehrotra, and Archana Singh
Development of Machine Learning Model Using Least
Square-Support Vector Machine, Differential Evolution and Back
Propagation Neural Network to Detect Breast Cancer . . . . . . . . . . . . . . . . . 51
Madhura D. Vankar and G. A. Patil
Distributed and Decentralized Attribute Based Access Control
for Smart Health Care Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
B. Ravinder Reddy and T. Adilakshmi
Dynamic Node Identification Management in Hadoop Cluster
Using DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
J. Balaraju and P. V. R. D. Prasada Rao

xiii
xiv Contents

A Scientometric Inspection of Research Based on WordNet Lexical

During 1995–2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Minni Jain, Gaurav Sharma, and Amita Jain
Sentiment Analysis of an Online Sentiment with Text and Slang
Using Lexicon Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Shelley Gupta, Shubhangi Bisht, and Shirin Gupta
Fuzzy Logic Technique for Evaluation of Performance of Load
Balancing Algorithms in MCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Divya, Harish Mittal, Niyati Jain, Bijender Bansal, and Deepak Kr. Goyal
Impact of Bio-inspired Algorithms to Predict Heart Diseases . . . . . . . . . . 121
N. Sree Sandhya and G. N. Beena Bethel
Structured Data Extraction Using Machine Learning from Image
of Unstructured Bills/Invoices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
K. M. Yindumathi, Shilpa Shashikant Chaudhari, and R. Aparna
Parallel Enhanced Chaotic Model-Based Integrity to Improve
Security and Privacy on HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
B. Madhuravani, N. Chandra Sekhar Reddy, and Boggula Lakshmi
Exploring the Fog Computing Technology in Development of IoT
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Chaitanya Nukala, Varagiri Shailaja, A. V. Lakshmi Prasuna, and B. Swetha
NavRobotVac: A Navigational Robotic Vacuum Cleaner Using
Raspberry Pi and Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Shaik Abdul Nabi and Mettu Krishna Vardhan
A Hybrid Clinical Data Predication Approach Using Modified PSO . . . . 169
P. S. V. Srinivasa Rao, Mekala Srinivasa Rao, and Ranga Swamy Sirisati
Software Defect Prediction Using Optimized Cuckoo Search Based
Nature-Inspired Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
C. Srinivasa Kumar, Ranga Swamy Sirisati, and Srinivasulu Thonukunuri
Human Facial Expression Recognition Using Fusion of DRLDP
and DCT Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
M. Avanthi and P. Chandra Sekhar Reddy
Brain Tumor Classification and Segmentation Using Deep Learning . . . . 201
Manohar Madgi, Shantala Giraddi, Geeta Bharamagoudar, and M. S. Madhur
A Hybrid Approach Using ACO-GA for Task Scheduling in Cloud . . . . . 209
Simran Shrivas, Sonika Shrivastava, and Lalit Purohit
K-Means Algorithm-Based Text Extraction from Complex Video
Images Using 2D Wavelet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Divya Saxena and Anubhav Kumar
Contents xv

Opinion Mining-Based Conjoint Analysis of Consumer Brands . . . . . . . . 227

Kumar Ravi, Aishwarya Priyadarshini, and Vadlamani Ravi
Task Scheduling in Cloud Using Improved Genetic Algorithm . . . . . . . . . 241
Shyam Sunder Pabboju and T Adilakshmi
Sentiment Analysis for Telugu Text Using Cuckoo Search Algorithm . . . 253
G. Janardana Naidu and M. Seshashayee
Automation of Change Impact Analysis for Python Applications . . . . . . . 259
T. Jalaja, T. Adilakshmi, and P. S. R. Abhishek
Enhancing Item-Based Collaborative Filtering for Music
Recommendation System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
M. Sunitha, T. Adilakshmi, and Mir Zahed Ali
Deep Learning-Based Enhanced Classification Model
for Pneumonia Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
S. Jeba Priya, S. Joshua Jaistein, G. Naveen Sundar,
and T. Raja Sundrapandiyanleebanon
Automatic Fake News Detector in Social Media Using Machine
Learning and Natural Language Processing Approaches . . . . . . . . . . . . . . 295
J. Srinivas, K. Venkata Subba Reddy, G. J. Sunny Deol,
and P. VaraPrasada Rao
A Novel Method for Optimizing Data Consumption by Enabling
a Custom Plug-In . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
Vijay A. Kanade
An Effective Mechanism for the Secure Transmission of Medical
Images Using Compression and Public Key Encryption Mechanism . . . . 317
T. K. Ratheesh and Varghese Paul
A Systematic Survey on Radar Target Detection Techniques in Sea
Clutter Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
R. Navya and R. Devaraju
An Ensemble Model for Predicting Chronic Diseases Using
Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
B. Manjulatha and Suresh Pabboju
COVID-19 Face Mask Live Detection Using OpenCV . . . . . . . . . . . . . . . . . 347
Anveshini Dumala, Anusha Papasani, and Sireesha Vikkurty
Chest X-Ray Image Analysis of Convolutional Neural Network
Models with Transfer Learning for Prediction of COVID Patients . . . . . . 353
M. Shyamala Devi, P. Swathi, N. Pavan Kumar, Ravi Varma Tungala,
Saranya Vivekanandan, and Priyanka Moorthy
xvi Contents

Predicting Customer Loyalty in Banking Sector with Mixed

Ensemble Model and Hybrid Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
Jesmi Latheef and S. Vineetha
Design Patterns and Microservices for Reengineering of Legacy
Web Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
V. Dattatreya, K. V. Chalapati Rao, and M. Raghava
A Comparative Study on Single Image Dehazing Using
Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
Poornima Shrivastava, Roopam Gupta, Asmita A. Moghe, and Rakesh Arya
Plasmodium falciparum Detection in Cell Images Using
Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
Smaranjit Ghose, Suhrid Datta, C. Malathy, and M. Gayathri
An Online Path Planning with Modified Autonomous Parallel
Parking Controller for Collision Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . 403
Naitik M. Nakrani and Maulin M. Joshi
Real-Time Proximity Sensing Module for Social Distancing
and Disease Spread Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
Sreeja Rajesh, Varghese Paul, Abdul Adil Basheer, and Jibin Lukose
Automatic Depression Level Analysis Using Audiovisual Modality . . . . . 425
Aishwarya Chordia, Mihir Kale, Mukta Mayee, Preksha Yadav,
and Suhasini Itkar
A Notification Alert System with Heartbeat and Temperature
Sensors for Abnormal Health Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
V. Sireesha, M. S. V. Sashi Kumar, S. Vinay Kumar, and R. M. Shiva Krishna
Recommender System for Resolving the Cold Start Challenges
Using Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
Chandrima Roy, Siddharth Swarup Rautray, and Manjusha Pandey
A Skyline Based Technique for Web Service Selection . . . . . . . . . . . . . . . . . 461
Yamini Barge, Lalit Purohit, and Soma Saha
Novel Trust Model to Enhance Availability in Private Cloud . . . . . . . . . . . 473
Vijay Kumar Damera, A. Nagesh, and M. Nagaratna
Feature Impact on Sentiment Extraction of TEnglish Code-Mixed
Movie Tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
S. Padmaja, M. Nikitha, Sasidhar Bandu, and S. Sameen Fatima
Linear and Ensembling Regression Based Health Cost Insurance
Prediction Using Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
M. Shyamala Devi, P. Swathi, M. Purushotham Reddy,
V. Deepak Varma, A. Praveen Kumar Reddy, Saranya Vivekanandan,
and Priyanka Moorthy
Contents xvii

An Adaptive Correlation Clustering-Based Recommender System

for the Long-Tail Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
Soanpet Sree Lakshmi, T. AdiLakshmi, and Bakshi Abhinith
Plant Leaf Identification Using HOG and Random Forest Regressor . . . 515
Jyotisagar Bal, Manas Kumar Rath, and Prasanta Kumar Swain
Deep Learning Based Facial Feature Detection for Ethnicity
Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
Sujitha Juliet Devaraj, R. Catherine Joy, I. Santhosh, and I. C. Kevin
Scanning Array Antenna Radiation Pattern Design Containing
Asymmetric Null Steering Based on L-ASBO . . . . . . . . . . . . . . . . . . . . . . . . 535
Anitha Suresh, C. Puttamadappa, and Manoj Kumar Singh
A Modified Novel Signal Flow Graph and Memory-Based Radix-8
FFT Processor Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
A. Anitha, B. Triveni, Pinninti Kishore, and Makkena Madhavi Latha
Vouch augmented Program Courses Recommendation System
for E-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
K. B. V. Rama Narasimham, C. V. P. R. Prasad, J. Jyothirmai, and M. Raghava
Heart Disease Prediction Using Extended KNN (E-KNN) . . . . . . . . . . . . . . 565
R. Sateesh Kumar and S. Sameen Fatima
Prediction Analysis of Diabetes Using Machine Learning . . . . . . . . . . . . . . 573
Srikanth Bethu, G. Charles Babu, B. Sankara Babu, and V. Anusha
Enhanced Goodput and Energy-Efficient Geo-Opportunistic
Routing Protocol for Underwater Wireless Sensor Networks . . . . . . . . . . . 585
V. Baranidharan, B. Moulieshwaran, V. Karthik, R. Sanjay,
and V. Thangabalaji
Early Detection of Pneumonia from Chest X-Ray Images Using
Deep Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
Prateek Sarangi, Pradosh Priyadarshan, Swagatika Mishra,
Adyasha Rath, and Ganapati Panda
Detection of Network Anomaly Sequences Using Deep Recurrent
Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
R. Ravinder Reddy, K. Ayyappa Reddy, C. Madan Kumar, and Y. Ramadevi
Driver Drowsiness Detection Using Convolution Neural Networks . . . . . . 617
P. Ravi Teja, G. Anjana Gowri, G. Preethi Lalithya, R. Ajay,
T. Anuradha, and C. S. Pavan Kumar
Glaucoma Detection Using Morphological Filters and GLCM
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
Babita Pal, Vikrant Bhateja, Archita Johri, Deepika Pal,
and Suresh Chandra Satapathy
xviii Contents

Analysis of Encryption Algorithm for Data Security in Cloud

Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
Arijit Dutta, Akash Bhattacharyya, Chinmaya Misra,
and Sudhangshu Sekhar Patra
A Machine Learning Approach in Data Perturbation
for Privacy-Preserving Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
Jayanti Dansana and Adarsh Singh
IoT Service-Based Crowdsourcing Ecosystem in Smart Cities . . . . . . . . . . 655
Arijit Dutta, Ruben Roy, Chinmaya Misra, and Kamakhya Singh
On Interior, Exterior, and Boundary of Fuzzy Soft Multi-Set
Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663
S. A. Naisal and K. Reji Kumar
Early Prediction of Pneumonia Using Convolutional Neural
Network and X-Ray Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673
C. Kishor Kumar Reddy, P. R. Anisha, and K. Apoorva
Predicting the Energy Output of Wind Turbine Based on Weather
Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
P. R. Anisha, C. Kishor Kumar Reddy, and Nuzhat Yasmeen
A Study and Early Identificatıon of Leaf Diseases in Plants Using
Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693
R. Madana Mohana, C. Kishor Kumar Reddy, and P. R. Anisha
Distributed and Energy Balanced Routing for Heterogeneous
Wireless Sensor Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 711
Shivani S. Bhasgi and Sujatha Terdal
ESRRAK-Efficient Self-Route Recovery in Wireless Sensor
Networks Using ACO Aggregation and K-Means Algorithm . . . . . . . . . . . 719
Abhijit Halkai and Sujatha Terdal
An Explanation of Personal Variations on the Basis of Model
Theory or RKT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729
K. Reji Kumar
Fingerprint Enhancement Using Fuzzy Logic and Deep Neural
Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735
Sridevi Sarraju and Franklin Bein
Gaussian Filter-Based Speech Segmentation Algorithm
for Gujarati Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747
Priyanka Vishwas Gujarathi and Sandip Raosaheb Patil
Smart Farming Technology with AI & Block Chain: A Review . . . . . . . . . 757
Deepali Jawale and Sandeep Malik
Contents xix

Design and Development of Electronic System for Predicting

Nutrient Deficiency in Plants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765
Amruta Chore and Dolly Thankachan
Classification of Hyperspectral Images with Various Spatial
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 773
Sandhya Shinde and Hemant Patidar
Detecting and Classifying Various Diseases in Plants . . . . . . . . . . . . . . . . . . 781
Rashmi Deshpande and Hemant Patidar
Offline Handwritten Dogra Script Recognition Using
Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 789
Reya Sharma, Baijnath Kaushik, and Naveen Kumar Gondhi
“Device Design of 30 and 10 nm Triple Gate Single Finger Fin-FET
for on Current (ION ) and off Current (IOFF ) Measurement” . . . . . . . . . . . . 799
Sarika M. Jagtap and Vitthal J. Gond
Fact Check Using Multinomial Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . 813
Madhavi Ajay Pradhan, Ankita Shinde, Rohan Dhiman,
Shreyas Ghorpade, and Swapnil Jawale

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825

About the Editors

Suresh Chandra Satapathy is currently working as Professor, KIIT Deemed to

be University, Odisha, India. He obtained his Ph.D. in Computer Science Engi-
neering from JNTUH, Hyderabad, and master’s degree in Computer Science and
Engineering from National Institute of Technology (NIT), Rourkela, Odisha. He has
more than 27 years of teaching and research experience. His research interest includes
machine learning, data mining, swarm intelligence studies and their applications to
engineering. He has more than 98 publications to his credit in various reputed inter-
national journals and conference proceedings. He has edited many volumes from
Springer AISC, LNEE, SIST and LNCS in the past, and he is also the editorial
board member in few international journals. He is a senior member of IEEE and a
life member of Computer Society of India. Currently, he is National Chairman of
Division-V (Education and Research) of Computer Society of India.

Vikrant Bhateja is Associate Professor, Department of ECE in SRMGPC,

Lucknow. His areas of research include digital image and video processing, computer
vision, medical imaging, machine learning, pattern analysis and recognition. He has
around 160 quality publications in various international journals and conference
proceedings. He is associate editor of IJSE and IJACI. He has edited more than 30
volumes of conference proceedings with Springer Nature and is presently EiC of IGI
Global: IJNCR journal.

Dr. Margarita N. Favorskaya is Professor and Head of the Department of Infor-

matics and Computer Techniques at Reshetnev Siberian State University of Science
and Technology, Russian Federation. Professor Favorskaya is a member of KES
organization since 2010, the IPC member and Chair of invited sessions of over 30
international conferences. She serves as Reviewer in international journals (Neuro-
computing, Knowledge Engineering and Soft Data Paradigms, Pattern Recogni-
tion Letters, Engineering Applications of Artificial Intelligence), Associate Editor
of Intelligent Decision Technologies Journal, International Journal of Knowledge-
Based and Intelligent Engineering Systems and International Journal of Reasoning-
based Intelligent Systems, Honorary Editor of the International Journal of Knowl-
edge Engineering and Soft Data Paradigms, Reviewer, Guest Editor and Book Editor
xxi
xxii About the Editors

(Springer). She is the author or the co-author of 200 publications and 20 educational
manuals in computer science. She co-authored/co-edited seven books for Springer
recently. She supervised nine Ph.D. candidates and is presently supervising four
Ph.D. students. Her main research interests are digital image and videos processing,
remote sensing, pattern recognition, fractal image processing, artificial intelligence
and information technologies.

Dr. T. Adilakshmi is currently working as Professor and Head of the Department,

Vasavi College of Engineering. She completed her Bachelor of Engineering from
Vasavi College of Engineering, Osmania University, in the year 1986, and did her
Master of Technology in CSE from Manipal Institute of Technology, Mangalore,
in 1993. She received Ph.D. from Hyderabad Central University (HCU) in 2006 in
the area of Artificial Intelligence. Her research interests include data mining, image
processing, artificial intelligence, machine learning, computer networks and cloud
computing. She has 23 journal publications to her credit and presented 28 papers at
international and national conferences. She has been recognized as a research super-
visor by Osmania University (OU) and Jawaharlal Nehru Technological University
(JNTU). Two research scholars were awarded Ph.D. under her supervision, and she
is currently supervising 11 Ph.D. students.
An Intelligent Tracking Application
for Post-pandemic

V. Roopa, R. Vasikaran, M. Sriram Karthik, S. Sindhu, and N. Vaishnavi

Abstract The Global Pandemic has created a huge impact in the country at a tremen-
dous rate affecting 3,794,314 people, snatching the lives of 66,678 people. This has
created a huge impact on the county’s economy and socioeconomic and psycholog-
ical status of the citizens of the country. This health also created a situation which
altered the lifestyle of the people. Due to the prevailing conditions of India the govern-
ment has been liberalizing the lockdown and implementing unlock provisions in the
country as a result it is difficult to track people movements and migrations which is
the only preferable means to control the spread of the disease and to find the suspect
able positive cases. A tracking system with a mobile-based application can be imple-
mented which tracks the users of the application’s travel history and movements on a
daily basis in the society which keeps the records of the user along with physiological
parameters (Temperature) so that the changes, suspects can be easily tracked.

1 Introduction

In September 17, 2019, a strange man admitted with an unknown disease in Wuhan,
Hubei province, in China, which turns out to be a reason for global pandemic called
COVID-19 which drastically changed and affected the whole human lifestyle in the
world. It had a direct impact on people’s health made them fall sick, increase the
suffering rate, ultimately end up in the number of the causality, according to WHO
reports COVID-19 affects in different ways for different people, dry cough, fever
and tiredness are the most common symptoms of COVID-19. Based on their immune

V. Roopa (B) · R. Vasikaran · M. Sriram Karthik · S. Sindhu · N. Vaishnavi

Department of Information Technology, Sri Krishna College of Technology, Coimbatore, India
e-mail: [email protected]
R. Vasikaran
e-mail: [email protected]
M. Sriram Karthik
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 1
S. C. Satapathy et al. (eds.), Smart Computing Techniques and Applications,
Smart Innovation, Systems and Technologies 224,
https://doi.org/10.1007/978-981-16-1502-3_1
2 V. Roopa et al.

system strength, a person recovers from the disease with or without hospitalization,
the pandemic has not only affected people’s health but also devasted every country’s
economy, made socialization a forbidden word and had indirect effect on people’s
mental health too. Unfortunately, the world has to run amid of this pandemic problem.
India has gone through lockdown from past April 2020 which put the whole country
into universal lockdown to ensure the reduction of spreading of coronavirus, yet
people will suffer even huge if they put to stay in lockdown for long time as we said
the world has to run for all the people’s basic needs and to ensure their economy, so
month by month the grip has been loosened on this lockdown scenario, since then
the COVID-19 cases are reaching its peak which is inevitable.
The COVID-19 tracking charts boosted exponentially after the stage by unlock
programs taken by the government of India. The government has taken huge efforts
to reduce the spread of COVID-19 and educating people about social distancing
and asking them to maintain it. The government has closed every other economic
business other than grocery shops for basic needs and hospitals for the health care
in the beginning of the country’s lockdown and it also closed the borders of the
states, districts and cities and made transportation only allowed for daily needs of
people making them to stagnant over a particular area but when the unlock of all this
happened people are when made available to move from one place to another, the
breakdown of virus also became unstoppable and the healthcare sector is struggling
to track the patients movement because it helps to track and predict the people who
are more prone to catch the disease because COVID-19 spreads through contact.
And tracking of people is hard so. It is hard to predict the person who might be an
active spreader of COVID-19 which will result in more patients without any contact
history which ultimately result in social spread which is even more dangerous.
Currently in India, 3.77 M people are affected by COVID-19; at the same time,
we cannot keep the people, so there has to be measure to ensure social distancing
and that is just a promise which should be kept! but daily, we see news that it is not
followed properly, so every shops, restaurant, supermarkets and places where people
gather now asked to use a infra-red thermometer to check the customers and their
temperatures are written down with their name and contact number, and this is the
manual way the normal grocery, shopkeepers to big trading companies and other
industries, but this is only useful for detecting if the person’s temperature is high
or low, but it cannot say its coronavirus or not and neither it seems to be useful for
tracking only few that also in rare cases, so we see that there is no good solution to
track the people who visit the places and trace them, so we came up with a solution
that will be tremendously useful for the post lockdown scenario where the people
came back to their new normal with facemasks and social distancing the project we
took will be useful to track them and intimate them and also the shop owner and
maintain the stats on who all come to their shops. For example, let us take a person
is going to a coffee shop and then going to mall then to a restaurant after this post
lockdown period then reaches home and this continues for about 2 weeks then he was
diagnosed by COVID-19 at this time when the health department asks him where he
been during all this period his memory cannot keep up all the details about where
he visited and can barely remember if he misses anything it is going to end up as
An Intelligent Tracking Application for Post-pandemic 3

social spread so many of people will affect by the disease and we cannot able to
track people who are more prone to catch the disease so here comes in the game
changer is our product which can be useful for tracking during post-pandemic time.
The system works based on global positioning system to locate and track the users!
first and foremost we are creating a mobile app so it will reach people easy and will
have a good engagement. The system designed will have two provision: one for the
users and another for the commercial user aka the shop owners and business. First
comes the registration phase where give your details like name, age and address for
basic information and you will be logged in to the application so whenever you enter
any commercial center you have to show the device a QR code it is created by the
work of RSA algorithm which appears in the app and the setup of IR thermometer
which is interfaced with application using Bluetooth so that data will be automatically
available in database through the commercial province of application it also shows
up in your application for your own personal tracking for example if you have gone
to a coffee shop you will show up the QR code then IR thermometer will detect your
body temperature sends it to both you and the shop keeper and shows the name,
the place at present aka coffee shop, body temperature at that time and these data’s
will keep on stacking. In the application, so in case, if you unfortunately affected
by COVID-19, we will have a track of visits and able to warn them who visited the
same coffee shop and also it is easier to detect where to track people and where not
to track people and it can also show up areas which is more prone the disease thus
helping other people to stay safe and lead a well-being post-pandemic life.

2 Literature Survey

Due to pandemic of COVID-19, all countries are looking toward mitigation plan to
control the spread with the help of some modeling techniques [1]. Modeling tech-
niques include wearing N95 mask with valves and to sanitize properly and to maintain
distance between the wards. COVID-19 can be spread through respiratory droplets
or due to close contact with the infected patients. SARS-CoV-2 was isolated from
fecal samples of infected patients, which supports the significance of fecal-oral route
in the transmission of SARS-CoV-2, but a WHO-China joint commission report
has denied this route of transmission [2]. When a person affected by coronavirus
knowingly or unknowingly makes contact with another person who is not affected
by the virus despite any age groups. There is at most chance for the virus to prolif-
erate. In comparison with traditional physical or hard sensors, MCS is inexpensive,
since there is no need for network deployment, and its spatio-temporal coverage is
outstanding. Two different approaches of MCS have been distinguished; they are
(i) mobile sensing, which leverages raw data generated from the hardware sensors
that are embedded in mobile devices (e.g., accelerometer, GPS, network connectivity,
camera, or microphone, among others); and (ii) social sensing (or social networking),
which leverages user-contributed data from OSN. The latter considers participants
as ‘social sensors’, i.e., agents that provide information about their environment
4 V. Roopa et al.

through social media services after the interaction with other agents [3]. Information
provided by the current situation regarding the pandemic should be updated properly
through social media because majority of the population are accessing social media
sites.
Social lockdown and distancing measures are the only tools available to fight the
COVID-19 outbreak [4]. Prevention is always better than cure. Instead of suffering
from the virus is good to stay away from the virus that can be done by distancing
ourselves from others as of now social distancing is the only way to control this
virus spread between wards until the arrival of vaccine. Early identification of non-
compliance with the measures decreed in law RD 463/2020 and its subsequent exten-
sions, such as (i) limitation of the freedom of movement of persons, (ii) opening to
the public of unauthorized premises, establishments, areas of worship, etc., and (iii)
agglomerations, among others. Social networks are an increasingly common way of
reporting such events, and their identification can be used by authorities for resource
planning [5]. People who do not follow rules and regulations given by the govern-
ment to bring this spread to control should be punished in the name of advice and
make people understand the seriousness behind every rules.
For instance, to make people know about the seriousness of this virus attack and to
educate them about the death rate and recovery rate and to make them self-prepared
if suppose they are prone to coronavirus. Its mission is to help citizens self-assess
the probability of suffering the infectious disease COVID-19, to reduce the volume
of calls to the health emergency number, informing the population, allowing an
initial triage of possible cases and a subsequent follow-up by the Health Authorities
[6]. If suppose, a ward is not feeling good and has all symptoms like fever, dry
cough, tiredness or difficulties in breathing. Then it is always advisable to consult
the nearby clinic and take necessary requirements to control the coronavirus spread.
Smartphone-based contact-tracing applications are shown as a promising technology
to finish or reduce the lockdown and quarantine measures. The technology of these
mobile apps is based on the results of several years of research efforts on Mobile
Computing, and particularly on Opportunistic Networking (OppNet) and MCS [7].
We are put under certain circumstances like where we are not allowed to roam
around and to self-quarantine ourselves, We also cannot be without knowing the
outside situation and condition about this virus spread at those situations. Smartphone
technologies come in hand. For instance, we have several applications that give us a
clear depiction about instant scenarios about the nation with its pandemic. It exhibits
the data generation rate which can be calculated in time or frequency domain.
Data like death rate recovery and rate provided by healthcare should be updated
properly to make people aware. Developing a novel vaccine is very crucial to
defending the rapid endless of global burden of the COVID-19 pandemic. Big data
can gain insights for vaccine/drug discovery against the COVID-19 pandemic. Few
attempts have been made to develop a suitable vaccine for COVID-19 using big data
within this short period of time [8]. We do not have a suitable vaccine developed
for this virus. So, Big data plays a vital role in collecting information and control
the spread to the maximum count and to control the spread until vaccine arrives.
Conventional medicine alternatively called as allopathic medicine, biomedicine,
An Intelligent Tracking Application for Post-pandemic 5

mainstream medicine, orthodox medicine and Western medicine, medical doctors

and other professional healthcare providers such as nurses, therapists, and pharma-
cists use drugs, surgery or radiation to treat illnesses and eliminate symptoms [9].
Presently, there’s no licensed medical medication for COVID-19 so people acquire
some home remedies like drinking hot water, and having all rich nutritious food to
improve immunity and perusing healthy lifestyle.

3 Proposed System

The system provides two user login: one for the common user and other for the busi-
ness/commercial center; both are connected to the backend database. Both sides have
a connection established between database and application. The user side application
constantly tracks the location of the user using GPS (Global Positioning System)
and updates in the database. This makes the continuous data tracking possible.
The communication between the infra-red (IR) thermometer and the Commer-
cial/business side application is made possible by interfacing the IR thermometer
with the application via Bluetooth. At first when the user arrives at an any busi-
ness/commercial center, the user’s unique ID is scanned and detected by the commer-
cial center side application as it can access user data from the database using a QR
code. Then the user’s body temperature is detected using the IR thermometer which
is then sent to the business side user application via Bluetooth and it is automatically
communicated to the database that gets updated for both the common user and busi-
ness/commercial side user. Thus, the user activities and movements in such critical
situation can be easily tracked and detected.
People have needs and duties which they had to fulfill which creates a necessity for
them to move around in the society due to these indispensable needs the government
issues social unlocks in the country which creates a difficulty in social distancing and
tracking which is the source of blocking the spread of the disease and the only way to
strategize the testing of susceptible cases. The proposed system focuses on tracking
the movement of the user through a mobile-based application which is GPS enabled
using unique ID and QR’s for the users from one end and maintaining the user records
at the business centers (restaurants, malls, supermarkets, garments, salon, etc.) at the
other end so that the user and the backend database will be having the complete
data of the user’s activity and the business centers will be having the whole data and
physiological parameter (temperature) of the customers which can be further used
for tracking.
The system works by tracking the users of the application by tracking the user’s
activity using GPS (Global Positioning System), the users will be provided with a
unique ID (QR in this case). The business centers will be provided separate login
provisions to store the data of the customers including physiological data. When
the users arrive at certain spot such as business centers the QR will be scanned by
the business centers using the application for correct identification and their body
6 V. Roopa et al.

Fig. 1 Working of intelligent AI tracking application

temperature will be checked using an infra-red (IR) thermometer which will be inter-
faced with the mobile app using the Bluetooth so that the data will be automatically
sent to the database using the business center’s application. The dataset will be safely
stored in the databases which can be used for further tracking that can play a vital role
during the unlocking process. This system has two separate provisions one for the
common users and another for the business and commercial centers. The provision
provides for the common users features QR code (which can be generated using RSA
algorithm), tracking using GPS and data storage (Fig. 1).
The provision provided for the commercial centers features Bluetooth interfacing
between the IR Thermometer and the application, QR detection and data storage.
Thus, these features make the tracking of the users at the social centers at ease and effi-
cient. RSA(Rivest-Shamir-Aldeman) is an algorithm used to encrypt and decrypt
messages. It is an asymmetric cryptographic algorithm. There are two different keys
used in this algorithm. This method is called public-key cryptography. One of the
keys can be given to anyone. The other key must be kept private. It is based on finding
factors of a large composite number is difficult when the factors are prime numbers.
It is called prime factorization and the key is also called key pair (public and private
key) generator. The flow of application here first the customer enters a shop he will
show up the QR code generated by the RSA algorithm thus the data transfers from
user side to the commercial side which also includes the data of the individual’s body
temperature which is detected by the infra-red thermometer and all these data are
stored in the database, thus this application will be enormously helpful for tracking
the individual’s movement in case of unfortunately getting affected by the pandemic
An Intelligent Tracking Application for Post-pandemic 7

virus and also will play a major role in warning the user’s who visited the same place
as the patient. This application is simple, efficient and easy to deploy and we need
not rely on any new system or hardware since it is available as an application, we
can use it in mobile which most of them have in their hands.

4 Result and Discussion

As coronavirus is a contagious disease that spreads through social interaction between

humans, Bluetooth tracing is vital for spread. Devices like mobile and tablets, it
presents an ideal platform to introduce Bluetooth tracing software due to their ease of
use and personalized usage. Therefore, several smartphone apps have been developed
by governments, international agencies and other parties to mitigate the virus spread.
In this paper, we analyze the large set of Bluetooth tracing with respect to different
security and privacy metrics. We analyze Bluetooth tracing apps permission analysis,
privacy analysis the security of the apps and review of the users. This approach has
the benefit of not requiring network effects, because single individuals can track their
locations without needing their contacts to have the app. The approach of logging
location history is less private than direct tracing, but that may possibly be resolved
with appropriate safeguards and redactions. Further, hybrid approaches involving
both GPS data and Bluetooth proximity networks may prove to be useful to public
health officials in modeling disease spread beyond just tracing. In Fig. 2 the Image
1 and 2(QR code) shows the QR code in the user side application being scanned by
the commercial user application to access the user data and to record the body vital
signs in the database to keep up to date records of the health conditions of the user
this step is made automatic using the IR temperature interfaced via Bluetooth with
the application and QR code to access the user database and directly storing them.
In Fig. 2, the Image 1 and 2(map) shows the person being tracked at each and every
location they are traveling and every place they have visited at this situation, and this
helps the government to track people in this post-pandemic situation to suspect the
infectious places and infected persons so that we could avoid the spread of diseases.
The application overall helps the government in keeping track of people who have
been tested positive for the virus. It is also an excellent way to alert people about
the number of infected cases in their area that have been identified as coronavirus
positive or if they accidentally came in contact with a person suffering from COVID-
19. The application requires being in running mode at all times to continue tracing
individuals actively. The application can be used in such a way that it enables your
smartphone to exchange the tracing keys periodically. This will help to locally store
the unique ID of the people who have come into contact with the user. If later a user
is tested positive for coronavirus, this method of cryptography will also ensure the
privacy and the safety of your data. In addition to showing the data of the number of
users who have been identified as positive, a map can be shown of the nearby area
where people have been identified as positive for COVID-19.
8 V. Roopa et al.

Fig. 2 Implementation of tracker on smart devices—cluster analysis

References

1. Mahalle, P.N., Sable, N.P., Mahalle, N.P., Shinde, G.R.: Predictive analytics of COVID-19 using
information, communication and technologies (2020)
2. Ying, S., et al.: Spread and control of COVID-19 in China and their associations with population
movement, public health emergency measures, and medical resources, p. 24 (2020) [Online].
Available: https://doi.org/10.1101/2020.02.24.20027623
3. Li, R., Rivers, C., Tan, Q., et al.: The demand for inpatient and ICU beds for COVID-19 in
the US: lessons from Chinese cities. medRxiv, 1–12 (2020). https://doi.org/10.1101/2020.03.
09.20033241
4. Du, J., Vong, C.-M., Chen, C.L.P.: Novel efficient RNN and LSTM like architectures: recurrent
and gated broad learning systems and their applications for text classification. IEEE Trans.
Cybern. (2020). https://doi.org/10.1109/TCYB.2020.2969705
5. World Health Organization: Critical preparedness, readiness and response actions for COVID-
19: interim guidance (2020)
6. Guo, B., Wang, Z., Yu, Z., et al.: Mobile crowd sensing and computing: the review of an emerging
human-powered sensing paradigm. ACM Comput. Surv. (CSUR) 48(1), 1–31 (2015)
7. International Labour Organization: The socioeconomic impact of COVID-19 in fragile settings:
peace and social cohesion at risk. https://www.ilo.org/global/topics/employment-promotion/rec
overy-and-reconstruction/WCMS_741158/langen/index.htm. Accessed 30 Apr 2020
8. Doran, D., Severin, K., Gokhale, S., et al.: Social media enabled human sensing for smart cities.
AI Commun. (2015)
9. Adolph, C., Amano, K., Bang Jensen, B., et al.: Pandemic politics: timing state-level social
distancing responses to COVID-19. medRxiv (2020)
Investigation on the Influence of English
Expertise on Non-native
English-Speaking Students’ Scholastic
Performance Using Data Mining

Subhashini Sailesh Bhaskaran and Mansoor Al Aali

Abstract This investigation reports about an understanding of the connection

between English skill and scholastic accomplishment of science students in Bahrain.
Data from student information system were investigated by applying data mining
techniques mainly decision tree algorithm. The results demonstrated a signifi-
cant effect of English expertise on students’ final cumulative grade point average
(CGPA). These discoveries demonstrate that the English expertise of graduate
students in a non-western polyglot scholastic background is significant for their
scholarly accomplishment. Results from this investigation affirm the requirement for
colleges in polyglot backgrounds to put resources into non-native English-speaking
(L2) graduate students’ English expertise toward the beginning of their scholastic
projects. Instructional proposals are made, alongside recommendations for additional
investigation.

1 Related Literature

1.1 Language Expertise and Scholastic Accomplishment

The acknowledgment that language helps in science and mathematics learning has
prompted an expanded interest for substance territory expertise guidance [1–4]. Some
research has created solid link between expertise and scholastic execution in science
and arithmetic training (CCAAL 2010, p. 22). Disciplinary education, or substance
territory explicit expertise, comprises of expertise aptitudes and learning that help
graduate students’ comprehension of ideas identified with a specific field of study,
for example, science and arithmetic. Exceptional concern for disciplinary expertise
in science and arithmetic instruction is significant for various reasons. Science writ-
ings are frequently testing to graduate students and require additional endeavors to

S. S. Bhaskaran (B) · M. Al Aali

Ahlia University, Manama, Bahrain
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 9
S. C. Satapathy et al. (eds.), Smart Computing Techniques and Applications,
Smart Innovation, Systems and Technologies 224,
https://doi.org/10.1007/978-981-16-1502-3_2
10 S. S. Bhaskaran and M. Al Aali

process. They are useful commonly and normally present thick and dynamic ideas,
utilize new phrasing and language that graduate students are not likely to encounter
in their day by day language use [5]. The expositive and specialized nature of science
writings puts levels of popularity on graduate students’ language aptitudes. The last
incorporate knowing particular vocabulary; translating logical images and graphs;
perceiving and understanding hierarchical examples regular to science writings;
deriving principle thoughts, utilizing inductive and deductive thinking abilities; and
perceiving circumstances and logical results connections [6]. Significant dominance
of language skills and perusing expertise are in this way imperative for graduate
students who are contemplating science and arithmetic, regardless of whether at
essential, auxiliary or at tertiary level. This is considerably more so for graduate
students in polyglot or bilingual backgrounds, who are not educated in their primary
language but rather in a subsequent language (L2). Tooth (2006) contends that the
particular semantic highlights that make science messages progressively thick and
dynamic can cause perusing perception issues particularly for English L2 students.
Because of a worldwide expanding relocation and development of individuals, bilin-
gual and polyglot backgrounds are developing. It is assessed that half of the total
populace utilizes more than one language or lingo in their regular day-to-day exis-
tence [7]. This extension of polyglots legitimizes more consideration for the expertise
practices of L2 students in science and arithmetic training (cf. [8], Rhodes and Feder
2014). Most research into literacy practices in science and mathematics education
comes from the richer western world. Unexpectedly, L2 examination into the rich
polyglot backgrounds is scarce [9, 10]. The current paper is enlivened by the scarcity
of research in the remainder of Middle East in regard to the job of language in science.
It is an endeavor to fill this research gap, and specifically to all the more likely compre-
hend the job of graduate students’ English expertise on scholarly accomplishment
in a non-western, polyglot instructive background. It reports on a conceptual model
that gives knowledge into the relationship between English expertise and scholastic
performance of science college graduate students in Bahrain, and an experimental
trial of that model.
This study analyzes the following questions:
(a) Is there a direct relationship between non-native English-speaking students’
scholastic English expertise and their scholastic accomplishment in science
and mathematics instruction in Bahrain?
(b) With respect to the level of English language expertise if there is noteworthy
difference in the scholastic accomplishment of students in Bahrain?

2 Data Mining Process

The steps of Data Mining data preparation, data selection, transformation, modeling
and evaluation are discussed below to unearth hidden information in the dataset that
was used to relate English scores and CGPA of students.
Investigation on the Influence of English Expertise … 11

3 Data Preparation

Student dataset of a higher education offering different degrees at the undergraduate

level was extracted from the student information system. The dataset pertains to
graduated students belonging to 12 programs between 2003 and 2014. The size of
the dataset is 646.

4 Data Selection

Being the second step data selection encompasses the method of gathering necessary
data supported by transformation of data which is the method of transforming data
into the necessary layout necessary for modeling (Fayyad 1996). Data was obtained
from the database using SQL queries related to English GPA and overall CGPA.
Following the extraction several tables were clubbed into a single table. Cleaning
was done by tackling missing values and variables were coded properly to allow the
use of classification algorithms. A Few variables were obtained directly from the
database. Certain features were computed or inferred based on other items present
in various tables. Using feature selection, variables were extracted. These features
were utilized in the following stage.

5 Modeling

Modeling algorithms are used for the discovery of knowledge. Classification is used
for categorizing data into different groups according to some conditions. A decision
tree uses a tree of choices and their probable outcomes, involving opportunity incident
outcomes, resource expenses, and efficiency. In order to find the relationship between
English results and final GPA of science students decision tree was applied.
12 S. S. Bhaskaran and M. Al Aali

The main aim of such a tree is to find the performance achieved by a student
in terms of CGPA based on the performance in English courses. This tree helps in
finding if English grades affect the overall GPA of the students. Such a knowledge
could help the students and advisors to know if there is an impact of the English
courses on the overall GPA and if there is an impact to coach the students on English
courses. It can be seen from the decision tree output that when the GPA of English
courses are lesser the final GPA of all courses is also lesser. Likewise when the GPA
Investigation on the Influence of English Expertise … 13

of the English courses is higher, then it is the case of final GPA of all courses. When
the English GPA is between 0 and 2.5, then the final GPA is between 2 and 2.5.
When the English GPA is more than 2.5, then the final GPA falls between 3.5 and 4
showing a direct relationship between the performance of English courses and final
GPA.

6 Discussion and Conclusion

This examination explored the job of English expertise on the scholastic accomplish-
ment in an example of science and mathematics graduate students. Utilizing deci-
sion tree examinations, the investigation showed that students’ scholastic English
expertise is important for their scholarly accomplishment in a polyglot scholastic
background. Consolidating two universally perceived perusing perception tests took
into account an expansive estimation of graduate students’ perusing abilities. A solid
impact was found between graduate students’ scholastic accomplishment and English
results.
The added value of the research is threefold. Our research broadens the focus
of L2 reading research in science and arithmetic instruction in a Bahraini polyglot
background. As argued in the paper, essential research into perusing in developing
nations is rare. Moreover, our investigation added considerably to our comprehension
of the connection between L2 graduate students’ English expertise and their science
and arithmetic accomplishment at tertiary level. The examination substantiates past
discoveries of the job of language in the science and arithmetic educational plan
(displayed in the presentation), in light of information from a Middle Eastern back-
ground. Finally, by adopting a longitudinal strategy, it shows that even after 4 years
of study, there is an indirect mediation connection between expertise and scholastic
accomplishment. These discoveries have significant instructional ramifications. Our
information affirms the requirement for colleges to put resources into L2 graduate
students’ expertise toward the beginning of their scholastic degrees. A Few univer-
sities in Bahrain have launched a mandatory Communicative program, orientation
in English program where students are trained in language and interaction skills.
This consideration ought to go past building vocabulary and dominance of word-
perusing abilities, since this is not an assurance that graduate students can under-
stand science writings (CCAAL 2010). Explicit thoughtfulness regarding preparing
science content is along these lines required. A second instructional ramification
has to do with the embodiment of creating perusing aptitudes: the more individ-
uals read, the better they become at it [11–13]. Notwithstanding the CS courses,
graduate students ought to take part in perusing widely as a feature of their scholarly
courses. Despite the fact that college graduate students in Bahrain guarantee to esteem
their reading material, they want to take in course content from different assets, for
example, addresses and address notes [14, 15]. This conduct, joined with restricted
access to perusing assets, confines graduate students’ perusing improvement. It is in
this way fundamental that college courses incorporate perusing assignments and give
14 S. S. Bhaskaran and M. Al Aali

adequate access to perusing materials. The finding that 10.3% of the graduate students
revealed utilizing English in their home condition is in accordance with Bahrain being
viewed as an ESL nation [16, 17]. The profoundly heterogeneous population, as far
as age (from 17 to 35 years of age), and utilization of home language (speaking
to 20 distinct dialects) demonstrates the requirement for establishments to cater to
enormous disparities regarding age and language aptitudes inside their homerooms.

References

1. Carnegie Council on Advancing Adolescent Literacy. Time to act: An agenda for advancing
adolescent literacy for college and career success. New York, NY, Carnegie Corporation of
New York (2010)
2. Fang, Z., Lamme, L., Pringle, R., Patrick, J., Sanders, J., Zmach, C., Henkel, M.: Integrating
reading into middle school science: What we did, found and learned. Int. J. Sci. Edu. 30(15),
2067–2089 (2008). https://doi.org/10.1080/09500690701644266
3. Hand, B.M., Alvermann, D.E., Gee, J., Guzzetti, B.J., Norris, S.P., Phillips, L.M., . . . Yore,
L.D.: Guest editorial. Message from the Bisland group: what is literacy in science literacy? J.
Res. Sci. Teach. 40(7), 607–615 (2003)
4. Norris, S.P., Phillips, L.M.: How literacy in its fundamental sense is central to scientific literacy.
Sci. Edu. 87, 224–240 (2003)
5. Palinscar, A.: The next generation science standards and the common core state standards:
Proposing a happy marriage. Sci. Children 51(1), 10–15 (2013)
6. Barton, M.L., Jordan, D.L.: Teaching reading in science. A supplement to teaching reading in
the content areas: If not me, then who? McREL, Aurora, CO (2001)
7. Ansaldo, A.I., Marcottea, K., Schererc, L., Raboyeaua, G.: Languagetherapy and bilingual
aphasia: clinical implications of psycholinguistic and neuroimaging research. J. Neurolinguis-
tics 21, 539–557 (2008)
8. Barwell, R., Barton, B., Setati, M.: Multilingual issues in mathematics education: introduction.
Educ. Stud. Math. 64(2), 113–119 (2007)
9. Paran, A., Williams, E.: Editorial: reading and literacy in developing countries. J. Res. Read.
30(1), 1–6 (2007)
10. Pretorius, E.J., Mampuru, D.M.: Playing football without a ball: language, reading and
academic performance in a high-poverty school. J. Res. Read. 30(1), 38–58 (2007)
11. Cox, K.E., Guthrie, J.T.: Motivational and cognitive contributions to students’ amount of
reading. Contemp. Edu. Psychol. 26, 116–131 (2001)
12. Mullis, I. V. S., Martin, M. O., Kennedy, A. M., Foy, P.: PIRLS 2006 international report; IEA’s
progress in international reading literacy study in primary schools in 40 countries. Boston, MA:
International Association for the Evaluation of Educational Achievement (IEA) (2007)
13. Organisation for Economic Co-operation and Development. PISA 2009 assessment framework,
key competencies in reading, mathematics and science. Paris, France, Author (2009)
14. Owusu-Acheaw, M., Larson, A.G.: Reading habits among students and its effect on academic
performance: A study of students of Koforidua polytechnic. Lib. Phil. Practice, Paper 1130,
1–22 (2014)
15. Stoffelsma, L.: Short-term gains, long-term losses? A diary study on literacy practices in Ghana.
J. Res Read. (2018). https://doi-org.ru.idm.oclc.org/10.1111/1467-9817.12136
16. Ahulu, S.: Hybridized English in Ghana. Engl. Today: Int. Rev. Engl. Lang. 11(4), 31–36
(1995). https://doi.org/10.1017/S0266078400008609
17. Kachru, B.B.: Standards, codification and sociolinguistic realism: The English language in the
outer circle. In: Quirk R., Widdowson H.G. (eds.) English in the world: teaching and learning the
language and literatures pp. 11–30. Cambridge, England, Cambridge University Press (1985)
Machine Learning Algorithms
for Modelling Agro-climatic Indices:
A Review

G. Edwin Prem Kumar and M. Lydia

Abstract Modelling lays a solid platform to assess the effects of climate variability
on agricultural crop yield and management. It also aids in measuring the effectiveness
of control measures planned and to design optimal strategies to enhance agricultural
productivity and crop intensity. Models that aid in predicting drought, soil quality,
crop yield, etc. in the light of climate variabilities can go a long way in enhancing
global food security. Efficient modelling of agro-climatic indices will simplify the
upscaling of experimental observations and aid in the implementation of climate-
smart agriculture. This paper aims to present a comprehensive review of the use
of machine learning algorithms for modelling agro-climatic indices. Such models
find effective application in crop yield forecasting, crop monitoring, and manage-
ment, soil quality prediction, modelling evapotranspiration, rainfall, drought, and
pest outbreaks. The research challenges and future research directions in this area
have also been outlined.

1 Introduction

The impact of weather and climate on agricultural yield has been significant. It has
been proven that variability in climate-related parameters can have worse effects on
food security on a global scale. The El-Nino Southern Oscillation (ENSO) being the
lead player in causing inter-annual climate mode changes results in droughts and a
significant reduction in crop yield. Variability in climatic conditions has significantly
affected cropping area and intensity as well [1]. Hence, modelling of agro-climatic
indices will be of great advantage to farmers to plan well in advance. The challenges
involved in developing models for agriculture incorporating climate change involve
the development of biophysical and economic models, discrete-event models, and
models for dynamic change, interactions, management, and uncertainties [2].
Agro-climatic indices are defined based on the relationships between crop
yield/management etc. to variation in climate. They are used to measure the optimal

G. Edwin Prem Kumar (B) · M. Lydia

Sri Krishna College of Engineering and Technology, Coimbatore, India

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 15
S. C. Satapathy et al. (eds.), Smart Computing Techniques and Applications,
Smart Innovation, Systems and Technologies 224,
https://doi.org/10.1007/978-981-16-1502-3_3
16 G. Edwin Prem Kumar and M. Lydia

climatic conditions required for desired agricultural performance in terms of yield,

intensity, etc. The agro-climatic indices include parameters related to growing season,
frost conditions, and multicriteria climate classification (MCC) system [3]. Parame-
ters related to the growing season for viticulture include growing season temperature,
precipitation, length, the number of dry and wet days. Frost conditions are depicted
by mean data of first frost fall and last spring frost, number of days with frost and
days with minimum temperature less than −15 °C. Furthermore, the MCC system
includes parameters like heliothermal index, dryness index and cool night index.
Crop simulation models (CSMs) are process-based models that describe crop
growth and development as a function of weather and soil conditions [4]. CSMs
played a critical role in bioeconomic modelling, integrating the land use and agri-
cultural productivity, both at regional and global scales. However, statistical models
are known to scrutinize the relationship between agricultural crop yield and climate
variables much better than CSMs as they capture the impact of many variables both
directly and indirectly. These models outperformed CSMs by discovering relation-
ships and modelling mechanisms of stressed biotic environments better. Though
statistical models are severely limited by the availability of sufficient, quality data
pertaining to weather, yield, agricultural management, etc., it has been found that
hybrid statistical and CSMs can enhance the performance of statistical models [4].
In recent years, machine learning (ML) has emerged in a big way as a technique
to revolutionize several areas of research including agriculture (Fig. 1). ML tech-
niques along with big data, the Internet of Things (IoT), and several other advanced
hybrid algorithms can handle complicated non-linear models and produce accurate

Fig. 1 Machine learning

applications in agriculture
Crop yield
forecasting

Crop
Pest outbreak
monitoring &
model
management

ML
applications
in
Agriculture

Modelling
Evaluating
rainfall and
soil quality
drought

Evapo-
transpiration
model
Machine Learning Algorithms for Modelling … 17

results. ML models like regression, clustering, artificial neural networks (ANN),

Bayesian models, support vector machines (SVM), ensemble learning, etc. find appli-
cation in several agricultural processes like crop yield prediction, rainfall and drought
prediction, crop management, pest outbreaks, etc. [5, 6].
This paper aims to provide an exhaustive review of the ML techniques used to
model various agro-climatic indices, which are in turn useful in the estimation of key
parameters in crop management. The ML techniques-based agro-climatic models
used for several agricultural applications have been described in Sects. 2 and 3. The
further research directions and challenges in this area have been suggested in Sect. 4
and the conclusions drawn from this survey are presented in Sect. 5.

2 Machine Learning for Crop Yield Forecasting

Machine learning-based agro-climatic models have been used for forecasting crop
yield, monitoring, and management of crop. This section outlines the recent research
work carried out in these areas. Annual crop yield is significantly determined by
the climatic conditions. Regression-based models can be used to assess crop yield
based on agro-climatic indices like Standardized Precipitation-Evapotranspiration
Index (SPEI). Mathieu and Aires developed models to make annual estimation and
seasonal prediction of crop yield using more than fifty agro-climatic indices [7].
Elavarasan and Vincent proposed a deep reinforcement learning model using agro-
climatic indices for crop yield prediction and compared it with other ML models like
ANN, Long Short-Term Memory (LSTM), gradient boosting, random forest (RF),
etc. The accuracy of the model was evaluated using suitable evaluation metrics [8].
Mkhabela et al. predicted the crop yield of barley, canola, field peas and spring
wheat using regression models in different agro-climatic zones, which included sub-
humid, semi-arid and arid regions [10]. Johnson et al. used ML techniques and
vegetation indices for crop yield forecasting. The vegetation indices used as predic-
tors for crop yield include, the Normalized Difference Vegetation Index (NDVI) and
Enhanced Vegetation Index (EVI) derived from the Moderate-resolution Imaging
Spectro-radiometer (MODIS), and NDVI derived from the Advanced Very High-
Resolution Radiometer (AVHRR) [11]. ANN is the most common ML algorithm used
and convolutional NN (CNN) is the most widely used DL algorithm for predicting
crop yield. The most common features used for prediction of crop yield include
temperature, rainfall and soil type [17]. Table 1 outlines the recent research carried
out in crop yield forecasting.
18 G. Edwin Prem Kumar and M. Lydia

Table 1 Models for crop yield forecasting

Authors Agro-climatic indices ML technique
Mathieu and Aires [7] SPEI and 49 other parameters Regression model
Elavarasan and Vincent [8] 38 parameters Deep Recurrent Q-Network
(DRQN)
Mishra et al. [9] Rainfall, temperature, humidity Linear regression (LR), Ridge
regression, Lasso regression,
Support Vector Regression
(SVR) with linear, polynomial
and Radial Basis Function
(RBF) kernels
Johnson et al. [11] Vegetation indices Multiple linear regression
(MODIS-NDVI, MODIS-EVI, (MLR), Bayesian neural
AVHRR-NDVI) networks, Model-based
recursive partitioning
Chen et al. [12] Photosynthetically Active MLR
Radiation (PAR), Radiation Use
Efficiency (RUE), vegetation and
meteorological indices
Folberth et al. [13] Annual climate indices, growing Extreme gradient boosting, RF
season climate indices, monthly
climate indices
Mathieu and Aires [14] SPEI, average temperature NN Classifier
Mupangwa et al. [15] Big data from simulated Logistic regression, Linear
cropping systems discriminant analysis,
K-nearest neighbour (K-NN),
Classification and Regression
trees, Gaussian naïve Bayes,
SVM
Bai et al. [16] Landsat 8 vegetation index and SVM, NN, Mahalanobis
phenological length Distance, Maximum
Likelihood
Feng et al. [18] Daily climate data Hybrid modelling approach
Agricultural Production
Systems sIMulator
(APSIM)-RF, APSIM-MLR
Feng et al. [19] In-situ climate data, remote APSIM-RF, APSIM-MLR
sensing data, wheat trial data,
soil hydraulic properties
Kamir et al. [20] NDVI time series, climate time Regression, SVR with RBF
series, Yield maps, crop yield
statistics
Cai et al. [21] Satellite data, climate data Least Absolute Shrinkage and
Selection Operator (LASSO),
NN, RF, SVM
(continued)
Machine Learning Algorithms for Modelling … 19

Table 1 (continued)
Authors Agro-climatic indices ML technique
Zarei et al. [22] United Nations Environment Simple and multiple
Programme (UNEP) aridity Generalizes Estimation
index, Modified De-Martonne Equation
index
Xu et al. [23] Meteorological data RF, SVM
Gumuscu et al. [24] Air temperature, daily minimum k-NN, SVM, decision trees
air temperature, daily
precipitation
Huy et al. [25] 12 climate indices D-vine quantile regression
model
Wang et al. [26] Southern Oscillation Index RF
(SOI), SOI phase, NINO3.4,
Multivariate ENSO Index (MEI)

3 Machine Learning for Crop Monitoring

and Management

Machine learning-based agro-climatic models prove to be of great advantage in

crop monitoring and management (Table 2). Agro-climatic indices along with high-
resolution remote sensing data can be an effective tool for crop monitoring. Balles-
teros et al. employed two agro-climatic indices namely reference evapotranspiration
(ETo) and growing degree days (GDD) and few other vegetable indices for monitoring
crop [27]. ML algorithms and vegetation indices were used to characterize and map
cropping patterns [28]. The robustness of Landsat-based fraction of absorbed photo-
synthetically active radiation (fAPAR) models was assessed using ML algorithms
[29].

4 Research Directions and Challenges

Accuracy in agricultural modelling is a very critical parameter to be satisfied. Effec-

tive models need a large amount of quality data, which is indeed a big challenge
in this research. The performance of predictive modelling improves significantly if
there is a substantial increase in the sample size in both spatial and temporal distri-
butions [21]. Climate variability directly and indirectly affects several agricultural
processes and the yield, making the modelling process a challenging one. Since the
cropping area and intensity keeps changing, the use of static or historic data is likely
to have potential errors. Annual mappings of crop area obtained from the satellite
will aid in improved accuracy. The use of passive or active remote sensing data
improves the accuracy of models as it incorporates parameters like canopy biomass
and water content. ML algorithms that model the non-linearity better and capture
20 G. Edwin Prem Kumar and M. Lydia

Table 2 Models for crop monitoring and management

Authors Objective ML technique
Ballesteros et al. [27] Crop monitoring MLR
Feyisa et al. [28] Map cropping patterns SVM, RF, C5.0
Muller et al. [29] Assess the fidelity of models MLR, Decision tree, RF
Vindya and Vedamurthy [30] Crop selection Naïve Bayes Classification
Kale and Patil [31] Decision support system Data mining, Fuzzy logic
Shi et al. [32] Ascertain change in NDVI RF Regression with residual
analysis
Lee et al. [33] Projection of life-cycle Boosted regression tree
environmental impacts
Macedo et al. [34] Estimation of crop area Convolutional LSTM models
Young et al. [35] Seasonal forecasting of daily Regularized Extreme Learning
mean air temperatures Machine
Sharma et al. [36] Sustainable agriculture supply ANN, Bayesian network,
chain performance clustering, DL, Ensemble
learning, regression, SVM
Jakariya et al. [37] Assessment of vulnerability LR, Bayesian ridge regression,
index XGB regression, RF regression,
Extremely randomized trees
regression

well the relationship between the input and output variables are the need of the hour.
The algorithms need to respond to variabilities at all levels and should ensure quicker
convergence with higher computational speed.

5 Conclusion

Climate is both a resource and a restraint for agriculture. Early and consistent fore-
casting models play a critical role in farmer’s decision-making pertaining to crop
selection, yield, pest occurrence, irrigation needs, etc. Agro-climatic indices are
indicators of climate characteristic which has definite agricultural significance. The
spatial characteristics and temporal distribution of agro-climatic indices can be inves-
tigated and modelled to understand the growing seasonal parameters of different
crops like wheat, maize, etc. This paper presented an exhaustive review on ML
algorithms for crop yield forecasting, crop monitoring and forecasting based on
agro-climatic indices. Research challenges and directions have also been presented.
Machine Learning Algorithms for Modelling … 21

References

1. Iizumi, T., Ramankutty, N.: How do weather and climate influencing cropping area and
intensity? Glob. Food Sec. 4, 46–50 (2015)
2. Kipling, R.P., Topp, C.F.E., Bannink, A., Bartley, D.J., Penedo, I.B., Cortignani, R., del Prado,
A., Dono, G., Faverdin, P., Graux, A.I., Hutchings, N.J., Lauwers, L., Gulzari, S.O., Reidsma, P.,
Rolinski, S., Ramos, M. R., Sandars, D.L., Sandor, R., Schonhart, M., Seddaiu, G., Middelkoop,
J.V., Shrestha, S., Weindl, I., Eory, V.: To what extent is climate change adaptation a novel
challenge for agricultural modellers? Environ. Model. Softw. 120, 104492 (2019)
3. Ruml, M., Vukovic, A., Vujadinovic, M., Djurdjevic, V., Vasic, Z.R., Atanackovic, Z., Sivcev,
B., Markovic, N., Matijasevic, S., Petrovic, N.: On the use of regional climate models: impli-
cations of climate change for viticulture in Serbia. Agric. For. Meteorol. 158–159, 53–62
(2012)
4. Rotter, R.P., Hoffman, M.P., Koch, M., Muller, C.: Progress in modelling agricultural impacts
of and adaptations to climate change. Curr. Opin. Plant Biol. 45(B), 255–261 (2018)
5. Liakos, K.G., Busato, P., Moshou, D., Pearson, S., Bochtis, D.: Mach. Learn. Agric. Rev. 18,
2674 (2018)
6. Priya, R., Ramesh, D.: ML based sustainable precision agriculture: a future generation
perspective. Sustain. Comput. Inf. Syst. 28, 100439 (2020)
7. Mathieu, J.A., Aires, F.: Assessment of the agro-climatic indices to improve crop yield
forecasting. Agric. For. Meteorol. 253–254, 15–30 (2018)
8. Elavarasan, D., Vincent, D.: Crop yield prediction using deep reinforcement learning model
for sustainable agrarian applications. IEEE Access 8, 86886–86901 (2020)
9. Mishra, S., Mishra, D., Santra, G.H.: Adaptive boosting of weak regressors for forecasting of
crop production considering climatic variability: an empirical assessment. J. King Saud Univ.
Comput. Inf. Sci. (2017)
10. Mkhbela, M.S., Bullock, P., Raj, S., Wang, S., Yang, Y.: Crop yield forecasting on the Canadian
Prairies using MODIS NDVI data. Agric. For. Meteorol. 151, 385–393 (2011)
11. Johnson, M.D., Hsieh, W.W., Cannon, A.J., Davidson, A., Bedard, F.: Crop yield forecasting
on the Canadian Prairies by remotely sensed vegetation indices and machine learning methods.
Agric. For. Meteorol. 218–219, 74–84 (2016)
12. Chen, Y., Donohue, R.J., McVicar, T.R., Waldner, F., Mata, G., Ota, N., Houshmandfar,
A., Dayal, K., Lawes, R.A.: Nationwide crop yield estimation based on photosynthesis and
meteorological stress indices. Agric. For. Meteorol. 284, 107872 (2020)
13. Folberth, C., Baklanov, A., Balkovic, J., Skalsky, R., Khabarov, N., Obersteiner, M.: Spatio-
temporal downscaling of gridded crop model yield estimates based on machine learning. Agric.
For. Meteorol. 264, 1–15 (2019)
14. Mathieu, J.A., Aires, F.: Using neural network classifier approach for statistically forecasting
extreme corn yield losses in Eastern United States. Earth Space Sci. 5, 622–639 (2018)
15. Mupangwa, W., Chipindu, L., Nyagumbo, I., Mkuhlani, S., Sisito, G.: Evaluating machine
learning algorithms for predicting maize yield under conservation agriculture in Eastern and
Southern Africa. SN Appl. Sci. 2, 952 (2020)
16. Bai, T., Zhang, N., Mercatoris, B., Chen, Y.: Jujube yield prediction method combining Landsat
8 Vegetation Index and the phenological length. Comput. Electron. Agric. 162, 1011–1027
(2019)
17. Klompenburg, T.V., Kassahun, A., Catal, C.: Crop yield prediction using machine learning: a
systematic literature review. Comput. Electron. Agric. 177, 105709 (2020)
18. Feng, P., Wang, B., Liu, D.L., Waters, C., Yu, Q.: Incorporating machine learning with
biophysical model can improve the evaluation of climate extremes impacts on wheat yield
in south-eastern Australia. Agric. For. Meteorol. 275, 100–113 (2019)
19. Feng, P., Wang, B., Liu, D.L., Waters, C., Xiao, D., Shi, L., Yu, Q.: Dynamic wheat yield
forecasts are improved by a hybrid approach using a biophysical model and machine learning
technique. Agric. For. Meteorol. 285–286, 107922 (2020)
22 G. Edwin Prem Kumar and M. Lydia

20. Kamir, E., Waldner, F., Hochman, Z.: Estimating wheat yields in Australia using climate
records, satellite image time series and machine learning methods. ISPRS J. Photogram. Remote
Sens. 160, 124–135 (2020)
21. Cai, Y., Guan, K., Lobell, D., Potgieter, A.B., Wang, S., Peng, J., Xu, T., Asseng, S., Zhang,
Y., You, L., Peng, B.: Integrating satellite and climate data to predict wheat yield in Australia
using machine learning approaches. Agric. For. Meteorol. 274, 144–159 (2019)
22. Zarei, A.R., Shabani, A., Mahmoudi, M.R.: Comparison of the climate indices based on the
relationship between yield loss of rain-fed winter wheat and changes of climate indices using
GEE model. Sci. Total Environ. 661, 711–722 (2019)
23. Xu, X., Gao, P., Zhu, X., Guo, W., Ding, J., Li, C., Zhu, M., Wu, X.: Design of an integrated
climatic assessment indicator (ICAI) for wheat production: a case study in Jiangsu Province,
China. Ecol. Indic. 101, 943–953 (2019)
24. Gumuscu, A., Tenekeci, M.E., Bilgili, A.V.: Estimation of wheat planting date using machine
learning algorithms based on available climate data. Sustain. Comput. Inf. Syst. 100308 (2019)
25. Huy, T.H., Deo, R.C., Mushtaq, S., An-Vo, D.A., Khan, S.: Modeling the joint influence of
multiple synoptic-scale, climate mode indices on Australian wheat yield using a vine copula-
based approach. Eur. J. Agron. 98, 65–81 (2018)
26. Wang, B., Feng, P., Waters, C., Cleverly, J., Liu, D.L., Yu, Q.: Quantifying the impacts of pre-
occurred ENSO signals on wheat yield variation using machine learning in Australia. Agric.
For. Meteorol. 291, 108043 (2020)
27. Ballesteros, R., Ortega, J.F., Hernandez, D., Campo, A.D., Moreno, M.A.: Combined use of
agro-climatic and very high-resolution remote sensing information for crop monitoring. Int. J.
Appl. Earth Obs. Geoinf. 72, 66–75 (2018)
28. Feyisa, G.L., Palao, L.K., Nelson, A., Gumma, M.K., Paliwal, A., Win, K.T., Nge, K.H.,
Johnson, D.E.: Characterizing and mapping cropping patterns in a complex agro-ecosystem:
an iterative participatory mapping procedure using machine learning algorithms and MODIS
vegetation indices. Comput. Electron. Agric. 175, 105595 (2020)
29. Muller, S.J., Sithole, P., Singels, A., Niekerk, A.V.: Assessing the fidelity of Landsat-based
fAPAR models in two diverse sugarcane growing regions. Comput. Electron. Agric. 170,
105248 (2020)
30. Vindya N.D., Vedamurthy H.K.: Machine learning algorithm in smart farming for crop iden-
tification. In: Smys, S., Tavares, J., Balas, V., Iliyasu A. (eds.) Computational vision and bio-
inspired computing, ICCVBIC 2019. Advances in Intelligent Systems and Computing, vol.
1108. Springer, Cham (2020)
31. Kale, S.S., Patil, P.S.: Data mining technology with Fuzzy Logic, neural networks and machine
learning for agriculture. In: Balas, V., Sharma, N., Chakrabarti, A. (eds.) Data management,
analytics and innovation. Advances in Intelligent Systems and Computing, vol. 839. Springer,
Singapore (2019)
32. Shi, Y., Jin, N., Ma, X., Wu, B., He, Q., Yue, C., Yu, Q.: Attribution of climate and human activ-
ities to vegetation change in China using machine learning techniques. Agric. For. Meteorol.
294, 108146 (2020)
33. Lee, E.K., Zhang, W.J., Zhang, X., Adler, P.R., Lin, S., Feingold, B.J., Khwaja, H.A., Romeiko,
X.X.: Projecting life-cycle environmental impacts of corn production in the U.S. Midwest under
future climate scenarios using a machine learning approach. Sci. Total Environ. 714, 136697
(2020)
34. Macedo, M.M.G., Mattos, A.B., Oliveira, D.A.B.: Generalization of convolutional LSTM
models for crop area estimation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 13,
1134–1142 (2020)
35. Young, S.J., Rang, K.K., Chul, H.J.: Seasonal forecasting of daily mean air temperatures using
a coupled global climate model and machine learning algorithm for field-scale agricultural
management. Agric. For. Meteorol. 281, 107858 (2020)
36. Sharma, R., Kamble, S.S., Gunasekaran, A., Kumar, V., Kumar, A.: A systematic literature
review on machine learning applications for sustainable agriculture supply chain performance.
Comput. Oper. Res. 119, 104926 (2020)
Machine Learning Algorithms for Modelling … 23

37. Jakariya, Md., Alam, Md.S., Rahman, Md.A., Ahmed, S., Elahi, M.M.L., Khan, A.M.S., Saad,
S., Tamim, H.M., Ishtiak, T., Sayem, S.M., Ali, M.S., Akter, D.: Assessing climate-induced
agricultural vulnerable coastal communities of Bangladesh using machine learning techniques.
Sci. Total Environ. 742, 140255 (2020)
Design of Metal-Insulator-Metal Based
Stepped Impedance Square Ring
Resonator Dual-Band Band Pass Filter

Surendra Kumar Bitra and M. Sridhar

Abstract In this paper, a metal–insulator metal (MIM) based plasmonic stepped

impedance square ring resonator (SI-SRR) band-pass filter (BPF) is designed and
analyzed for dual-band applications. The MIM-based SI-SRR is investigated using
commercially available CST studio suite. The proposed SI-SRR is compact and low
power requirements suitable for Photonic Integrated Circuits (PICs). The SI-SRR is
operated in the wavelengths of 1317 nm (227.6 THz) and 1640 nm (182.8 THz) with
appropriate reflection and transmission parameters. The stepped impedance stubs
are used in the ring resonator for tunable operating bands. The proposed SI-SRR has
wide applications in PICs

1 Introduction

The waves that are produced at Metal and Insulator region when light is interacting
are the Surface Plasmon Polarities (SPPs) [1]. These are high-speed electromagnetic
waves traveling on the metal–insulator regions. MIM is one of the popular waveguides
used for designing most of the optical devices [2]. MIM-based components are
suitable for PICs. Recently, MIM-based components like stub ring resonators [3, 4],
triangular resonators [5], rectangular ring resonator [6], square ring resonators [7],
and circular ring resonators [8]. Different excitation schemes of ring resonators are
briefly analyzed [9]. The ring resonators provide the dual-band operating wavelengths
discussed in [8, 9].
In this work, first investigate the MIM-based plasmonic SI-SRR dual-band BPF for
the O and L optical bands. The transmission performance of the filter is analyzed using
CST studio suite. The SI-SRR improves the bandwidth and produces a more confine-
ment nature. The concurrent plasmonic ring resonators and filters are operating in
two or more frequency bands simultaneously. Several multi-band components are
proposed in [9].

S. K. Bitra (B) · M. Sridhar

Department of ECE, KLEF, Guntur, Andhra Pradesh, India

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 25
S. C. Satapathy et al. (eds.), Smart Computing Techniques and Applications,
Smart Innovation, Systems and Technologies 224,
https://doi.org/10.1007/978-981-16-1502-3_4
26 S. K. Bitra and M. Sridhar

Table 1 Design parameters

S. No. Parameter Value in nm
of SI-SRR
1 L1 1000
2 L2 900
3 L3 200
4 W3 50
5 L4 200
6 W4 60
7 d 50
8 g 10

The paper organization is as follows: Sect. 2 describes the SI-SRR dual-band BPF
design procedure and optimized parameters using MIM waveguide. The simulation
results and field distributions are included in Sect. 3. Finally, the paper ends with the
conclusion.

2 Stepped Impedance SRR Design

The proposed filter consists of coupled line feed and square ring resonator with stub
loaded on one side of the ring forms the stepped impedance square ring resonator
(SI-SRR) is shown in Fig. 1. The basic MIM characteristics and SRR dual-band char-
acteristics are investigated in [10]. The enhancement of previous work is included in
the stub to form the SI-SRR to give the better notching characteristics. The dimen-
sions of the proposed SI-SRR dual-band BPF are represented in Table 1. The filter is
designed and simulated in commercially available CST studio suite. The mesh sizes
were taken as 5 nm × 5 nm. The dimensions of the SRR are calculated using [11]. The
SI-SRR is suitable for optical O band (1260–1360 nm) and U band (1625–1675 nm).
The filter is easily fabricated using lithographic techniques. For designing purpose,
the metal is taken as silver and insulator is silica.

3 Simulation Results

The SI-SRR is designed using the CST studio suite using the above-optimized dimen-
sions. By applying the coupled mode excitations, the reflection and transmission
coefficients are observed and represented in Fig. 2. The reflection coefficient is high-
lighted in the red color, operated in dual bands are 1317 and 1640 nm with the power
of approximately −15 dB for both the bands. The transmission coefficient is repre-
sented in green color with the power of −3 dB approximately. Due to the stub in
Design of Metal-Insulator-Metal Based Stepped Impedance Square … 27

` W4

Port 1 L4
Port 2

L3
W3 `

L2
d
L1

Fig. 1 Proposed SI-SRR dual-band BPF

Fig. 2 Reflection and transmission coefficient of SI-SRR dual-band BPF

the ring resonator, improving the bandwidth at O and U bands is observed. PML
boundary conditions are used to simulate the SI-SRR filter.
Figure 3 represents the field distributions of the SI-SRR filter at 1317 and 1640 nm.
The field distributions show the power confinement of the given power at the metal
and insulator region.

4 Conclusion

The MIM-based plasmonic SI-SRR is suitable for optical O (1260–1360 nm) and
U band (1625–1675 nm) dual-band applications. The Reflection and transmission
coefficient of SI-SRR with resonant behavior is numerically analyzed. The center
28 S. K. Bitra and M. Sridhar

Fig. 3 Field Distributions at a 1317 nm (227.6 THz) and b 1640 nm (182.8 THz)

operated wavelength of the SI-SRR is 1317 and 1640 nm with −15 dB reflection coef-
ficient. The SI-SRR is easily fabricated using semiconductor fabrication procedure
techniques. SI-SRR filter is best suitable for photonic integrated circuit applications.

References

1. Barnes, W.L., Dereux, A., Ebbesen, T.W.: Surface plasmon subwavelength optics. Nature 424,
824–830 (2003)
2. Ozbay, E.: Plasmonics: merging photonics and electronics at nanoscale dimensions. Science
(80) 311, 189–193 (2006)
3. Taylor, P., Li, C., Qi, D., Xin, J., Hao, F.: Metal insulator metal plasmonic waveguide for low
distortion slow light at telecom frequencies. J. Mod. Opt. 61, 37–41 (2014)
Design of Metal-Insulator-Metal Based Stepped Impedance Square … 29

4. Zafar, R., Salim, M.: Analysis of asymmetry of fano resonance in plasmonic metal insulator
metal waveguide. J. Photonics 23, 1–6 (2016)
5. Oh, G., Kim, D., Kim, S.H., Ki, H.C., Kim, T.U., Choi, T.: Integrated refractometric sensor
utilizing a triangular ring resonator combined with SPR. IEEE Photonics Tech. Lett. 26, 2189–
2192 (2014)
6. Yun, B., Hu, G., Cui, Y.: Theoretical analysis of a nanoscale plasmonic filter based on
rectangular metal insulator metal waveguide. J. Phys. 43, 385102 (1–8) (2010)
7. Liu, J., Fang, G., Zhao, H., Zhang, Y.: Plasmon flow control at gap waveguide junctions using
square ring resonators. J. Phys. 43, 055103 (1–6) (2009)
8. Setayesh, A., Miranaziry, S.R., Abrishamian, M.S.: Numerical investigation of tunable band
pass\band stop plasmonic filters with hollow core circular ring resonators. J. Opt. Soc. Korea
43, 82–89 (2011)
9. Vishwanath, M., Khan, H.: Excitation schemes of plasmonic angular ring resonator based band
pass filters using MIM waveguide. Photonics 6, 1–9 (2019)
10. Bitra, S.K., Sridhar, M.: Design of nanoscale square ring resonator band pass filter using metal
insulator metal. In: Chowdary, P., Charkravarthy, V. (eds.) ICMEET Conference 2020, vol.
655. Springer, Singapore (2020)
11. Yun, B., Hu, G., Cui, Y.: Theoretical analysis of nanoscale plasmonic filter based on a
rectangular metal insulator metal waveguide. J. Phys. D Appl. Phys. 43, 385120 (2010)
Covid-19 Spread Analysis

Srinivas Kanakala and Vempaty Prashanthi

Abstract Based on the public datasets afforded by John Hopkins University and
Canadian health authorities, we developed a forecasting model of Covid-19 after
analyzing the spread. Data related to the cumulative amount of definite cases, per
day, in each country and another dataset consisting of various life factors, scored
by the people living in each country around the globe. We are going to merge these
two datasets to see if there is any relationship between the spread of the virus in
a country by preprocessing, merging and finding correlation between datasets we
will calculate needed measures and prepare them for an analysis, then we will try to
predict the spread of cases by using various methods. Time series data tracking the
number of people affected by the coronavirus globally, including confirmed cases
of the coronavirus, the number of people who have died due to the coronavirus and
the number of people who have recovered from the deadly infection. Data science
can give accurate pictures of coronavirus outcomes and also helps in tracking the
spread. Secondly using Covid-19 data, we can make supply chain logistics decisions
in spreadsheets supplies of personal protective equipment and ventilators to hospi-
tals and clinics across the world. An analysis of the country, by state and region,
identifying locations of highest need for supplies and ventilators according to the
dataset collected. This is called a supply plan. Finally, create a set of visualizations
and then add these visualizations to a presentation so that we can report on findings.

1 Introduction

The tale Covid that began in Wuhan, China, has spread to practically all nations and
was announced as pandemic [1]. The degree of this flare-up is fast. It is difficult to
precisely survey the lethality of this infection and it has all the earmarks of being

S. Kanakala (B)
VNR Vignana Jyothi Institute of Engineering and Technology, Hyderabad, India
V. Prashanthi
Gokaraju Rangaraju Institute of Engineering and Technology, Hyderabad, India

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 31
S. C. Satapathy et al. (eds.), Smart Computing Techniques and Applications,
Smart Innovation, Systems and Technologies 224,
https://doi.org/10.1007/978-981-16-1502-3_5
32 S. Kanakala and V. Prashanthi

unmistakably more deadly than the Covid that caused SARS and MERS. Researchers
[2] have recognized two new strains of the Covid, demonstrating it is now been
changed at any rate once. The greatest test is an obscure number of individuals have
been contaminated by the infection without getting indicative. These individuals
are transporters of the infection without themselves giving any indications. At first,
individuals who gave no indications of disease were not isolated and this prompts
the spread of the infection at a gigantic rate. The infection is additionally appeared to
influence its hosts lopsidedly. Youngsters appear to be less inclined to be tainted while
the moderately aged and more seasoned grown-ups are mysteriously contaminated.
Men are bound to kick the bucket from the contamination contrasted with ladies,
and furthermore individuals with a more fragile safe framework, Type 2 diabetes and
hypertension. Be that as it may, as of late numerous alive and well youthful people
have kicked the bucket from the contamination making it much harder to comprehend
the impact of Covid-19 [3, 4].

2 Literature Survey

Coronavirus Fig. 1 is the overwhelming illness brought about by and was called
Covid. This new virus was vague before and started in Wuhan, China, in December
2019. Coronavirus is a pandemic affecting many countries.

Fig. 1 Coronavirus
Covid-19 Spread Analysis 33

2.1 Symptoms

The indications of Covid-19 [5] are general influenza like and a few patients increase
an great kind of pneumonia. Patients have fever, muscle pain and body throbs, hacks
and sore throat about following six days of getting the contamination. The huge
people feel truly desperate and frail and improve all alone, yet a minority of patients
will deteriorate following 5–7 days of ailment and the patients have windedness and
exacerbating hack. The hack is dry and not wet. It is even observed that patients have
solid cerebral pains. Furthermore, the side effects are somewhat unique in relation to
influenza. People tainted from the infection might not have a virus. However, a few
people do not become ill while being contaminated and are spreading the infection
to new has. These individuals ought not to be all over town spreading the illness.
Individuals who got tainted and have been effectively restored have likewise got
contaminated by the Covid-19 once more. Making it much harder to contain the
episode. There is no anti-microbial to treat the Covid-19 and it may not be accessible
until the spring of 2021. This makes it much more imperative to take preventive
activities.

2.2 No Symptoms

Coronavirus is basically extended with respiratory beads detached by somebody

who is hacking or has other effects, like fever or sluggishness [6]. Many people who
have coronavirus experience just mellow indications. These are the symptoms in the
beginning. Most as of late it has been indicated that elevated levels of the infection
are available in respiratory discharges during the “presymptomatic period that can a
days ago to over seven days” before the fever and hack normal for Covid-19. This
capacity of the infection to be communicated by individuals without side effects is a
significant purpose behind the pandemic. Some news show that persons who are not
having any manifestations can communicate the virus.

2.3 Coronavirus Modes of Spread

The Covid spreads principally from one person to other [7]. This occurs among
peoples who are close to each other. Beads which are created when a contaminated
individual hacks or wheezes may land in the mouths or noses of individuals who are
close by, or potentially be breathed in into their lungs. An individual contaminated
with Covid—even one without any side effects may emanate vaporizers when they
talk or relax. Mist concentrates are irresistible viral particles that can buoy or float
34 S. Kanakala and V. Prashanthi

around noticeable all around for as long as three hours. One can be affected by Covid-
19 when he touches an item which has virus and then touches their own mouth, nose,
or conceivably eyes.

2.4 Importance of Social Distance and Self-isolation

Every person should maintain social distancing of about 6 ft or more from others.
Schools, gatherings, occasions, malls, etc. do not maintain any social distancing.
Therefore, these are closed during Covid. This will help the society from virus, as
the spread will be controlled. Self-isolation is an important measure that should be
taken by the people who is affected by coronavirus. He or she should be isolated in
separate room, even from family members. These people should not go to crowdie
places like schools, etc. Take clinical assistance. If you do not live in a region with
intestinal sickness or dengue fever kindly do the accompanying.

3 Existing System

Aarogya Setu: It is an Indian Covid-19 APP. “Contact following, Syndromic plan-

ning and Self-evaluation” computerized administration, fundamentally a portable
application, created by the National Informatics Center under the Ministry of Elec-
tronics and Information Technology (MeitY). The cause for this application is to
make people familiar with Covid-19 for well-being of people. It is an app which
needs the mobile phone’s GPS and Bluetooth to follow the Covid contamination.
The app can be accessed from Android and iOS versatile frameworks. With the help
of Bluetooth, it indicates danger when one comes close (inside six feet of) to you
who is with coronavirus, by viewing the database of cases around India. With the
help of GPS, it can detect whether the region is a place with contaminated zone.
Drawbacks of Existing System:
• It is forced through leader request with no legitimization.
• Recently, Robert Baptiste has tweeted that safety weaknesses in Aarogya Setu
permitted programmers to realize who is tainted or not well in their preferred
region. He additionally gave subtleties of what number of individuals were unwell
and contaminated at the PM’s Office, the India Parliament and the Home Office.
• The application’s Terms of Service (TOS) gives restricted obligation to the admin-
istration. In this way, there is no administration responsibility in the event of
information burglary of clients.
Covid-19 Spread Analysis 35

4 Proposed System

We developed a new model of Covid-19 after analyzing the spread. Data related to
the cumulative no of confirmed cases, per day, in each Country and another dataset
consisting of various life factors, scored by the people living in each country around
the globe. We are going to merge these two datasets to see if there is any relationship
between the spread of the virus in a country by preprocessing [8, 9], merging and
finding correlation between datasets we will calculate needed measures and prepare
them for an Analysis, then we will try to predict the spread of cases by using various
methods. Time series data tracking the number of people affected by the coronavirus
globally, including confirmed cases of the coronavirus, the number of people who
have died due to the coronavirus and the number of people who have recovered from
the deadly infection. Data preprocessing [10, 11] is a data mining technique which
is used to transform the raw data in a useful and efficient format.
Steps Involved in Data Preprocessing:
1. Data Cleaning: The information may have numerous insignificant and missed
parts. To deal with such things, information cleaning is finished. This includes
treatment of missed information, boisterous information and so on.
(a) Missing Data: This condition comes when some information is not
present. This can be obtained from different techniques such as:
(i) Ignore the Tuples: This method is appropriate when the dataset is
huge and different qualities are not present in a tuple.
(ii) Fill the Missing Qualities: There are many methods to fill this.
One can do this physically, by characteristic mean or the most likely
worth.
(b) Uproarious Data: Noisy information is a negligible information that
cannot be deciphered by machines. It tends to be produced because of
flawed information assortment, information section mistakes and so on.
It very well may be dealt with in following manners.
(i) Binning Method: This strategy chips away at arranged information
so as to smooth it. The entire information is isolated into portions of
equivalent size and afterward different strategies are performed to
finish the errand. Each fragmented is dealt with independently. We
can supplant information of section by the mean or limit esteems
could be utilized to finish the errand.
(ii) Relapse: Here information is turned smooth by fitting it into relapse
function. The relapse utilized might be straight or various
(iii) Bunching: This methodology bunches the comparative informa-
tion in a group. The anomalies might be not detected or else it
comes under outside the bunches.
36 S. Kanakala and V. Prashanthi

2. Data Transformation: This progression is taken so as to change the information

in fitting structures appropriate for mining measure. This includes Normaliza-
tion, Attribute Selection, Discretization, Concept Hierarchy Generation.
3. Data Reduction: Since information mining is a method that is utilized to deal
with colossal measure of information. While working with gigantic volume of
information, examination got more diligently in such cases. So as to dispose of
this, we utilize information decrease method. It means to expand the capacity
effectiveness and decrease information stockpiling and investigation costs. The
different strides to information decrease are Data Cube aggregation, Attribute
Subset Selection, Numerosity Reduction, Dimensionality Reduction.

5 Proposed System

The supply logistics from the USA are considered. This set is cleaned by removing
the unnecessary columns. It narrows down the data to information we need the most.
This prevents distraction and enables a clear idea of what needs to be analyzed.
Calculating the vent creating a pivot table in python. Since the values are extremely
clumped, we can sort by a required filter such as region like MidWest. We use a legend
to depict this. We have shown a pie chart representation to give a clear picture of the
distribution in Fig. 2 since we have many states and matplotlib does a rudimentary
view of the analysis. Calculating the ventilator requirements for each state in the US.

Fig. 2 Covid crisis pie chart

Covid-19 Spread Analysis 37

We take the number of patients hospitalized and the number of cumulative ventilators
available to get the required number.
A specific case study to observe the number of cases in China, India and Japan
show in Fig. 3. We can see that the rate of cases in the origin country of the virus
has the maximum cases and since the other two countries are in close proximity, the
cases have spread and also on a similar level in India and Japan.
The spread of the virus in India can be seen by the 21st day where the sudden
spike is evident shown in Fig. 4. The government issued lockdown to flatten this
spike of increase shown in Fig. 5.

Fig. 3 Specific case study to observe the number of cases in China, India and Japan

Fig. 4 Spread of the virus in India can be seen by the 21st day
38 S. Kanakala and V. Prashanthi

Fig. 5 For first 21 days coronavirus spread

Corona cases versus GDP: Does the wealth of a country mean less corona cases?
It does not seem so. In fact, developing countries have a lesser risk than developed
countries as per our studies shown in Fig. 7. We have tried to correlate the number
of cases and the development of a country. This can be due to various reasons like
climatic conditions, etc. However, this is not due to lack of testing kits on the contrary
in developing countries (Fig. 6).

Fig. 6 Corona cases versus GDP

Covid-19 Spread Analysis 39

Fig. 7 Future spread based on cases and GDP per capita

6 Conclusion

In this digital world, new data and information on the coronavirus and the progress
of the occurrence have become accessible at an exceptional pace. Even though,
tough questions stay without answer and exact answers to predict the dynamics of
the situation will not receive in such stage. Analyzing and predicting the spread of
viruses with the existing data will help us to have a better understanding to prevent
the spread and to take preventive measures. To fight with coronavirus, we have to
take care of ourselves and follow all the safety measures and rules that have been
given by the government. Everyone can play a part in helping scientists to fight the
coronavirus.

References

1. Novel, Coronavirus Pneumonia Emergency Response Epidemiology: The epidemiological

characteristics of an outbreak of 2019 novel coronavirus diseases (COVID-19) in China.
Zhonghua liu xing bing xue za zhi= Zhonghua liuxingbingxue zazhi 41.2, 145 (2020)
2. Perlman, S.: Another decade, another coronavirus, 760–762 (2020)
3. Abroug, F., et al.: Family cluster of Middle East respiratory syndrome coronavirus infections,
Tunisia, 2013. Emerg. Infect. Dis. 20.9, 1527 (2014)
4. Van Der Hoek, L., et al.: Identification of a new human coronavirus. Nat. Med. 10.4, 368–373
(2004)
5. Guan, W.-j., et al.: Clinical characteristics of coronavirus disease 2019 in China. New England
J. Med. 382.18, 1708–1720 (2020)
40 S. Kanakala and V. Prashanthi

6. Schoeman, D., Fielding, B.C.: Coronavirus envelope protein: current knowledge. Virol. J.
16(1), 1–22 (2019)
7. Song, F., et al.: Emerging 2019 novel coronavirus (2019-nCoV) pneumonia. Radiology 295.1,
210–217 (2020)
8. Kamiran, F., Calders, T.: Data preprocessing techniques for classification without discrimina-
tion. Knowl. Inf. Syst. 33(1), 1–33 (2012)
9. Prashanthi, V., Kanakala, S.: Plant disease detection using Convolutional neural networks. Int.
J. Adv. Trends Comput. Sci. Eng. 9(3), 2632–2637
10. García, S., Luengo, J., Herrera, F.: Data preprocessing in data mining. Springer International
Publishing, Cham, Switzerland (2015)
11. Prashanthi, V., Kanakala, S.: Generating analytics from web log. Int. J. Engi. Adv. Technol.
9(4), 161–165
Social Media Anatomy of Text and Emoji
in Expressions

Shelley Gupta, Ojas Garg, Radhika Mehrotra, and Archana Singh

Abstract Social Media is a burgeoning platform where a person or a group can

interact to collaborate and share their ideas through online mediums. Nowadays,
social media content of Twitter, Facebook, Instagram, WhatsApp, online blogs,
forums, etc. consists of text and emoji. Emoji are pictorial representation used in
electronic media, which a user can or cannot combine with a text to express their
sentiments. However, there is still too little work related to emoji in sentiment anal-
ysis. These emoji assist in extensive analysis of the opinions and sentiments of the
public by means of Sentiment Analysis. Due to availability of large quantity of
opinion-rich digital data, much of the current research is focusing on the area of
the sentiment analysis. In this paper, the anatomy of social online media content is
analyzed. Here, nearly 1.7 lac comments are analyzed. Also, the sentiments expressed
by the end users on various popular products, delight events, etc., are analyzed to
determine the usage of text and emoji on social online media like Facebook.

1 Introduction

In recent years, the number of people using social online media to share their opinions
and emotions have increased enormously [1, 2]. Social online media like Facebook,
Twitter and Instagram plays an important role in day to day life to communicate
and to share opinion of the user without any restrain [3]. Facebook is one of the
most popular social online media. It encloses the posts uploaded by the users which

S. Gupta (B) · O. Garg · R. Mehrotra

ABES Engineering College, Ghaziabad, Uttar Pradesh, India
e-mail: [email protected]
O. Garg
e-mail: [email protected]
A. Singh
Department of Information Technology, Amity School of Engineering and Technology, Noida,
India
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 41
S. C. Satapathy et al. (eds.), Smart Computing Techniques and Applications,
Smart Innovation, Systems and Technologies 224,
https://doi.org/10.1007/978-981-16-1502-3_6
42 S. Gupta et al.

consist of comments having both text and emoji. By this means the analysis can be
done more extensively. Now the challenge is to frame a technology to identify and
outline the sentiments.
Sentiment analysis is the automated process of visualizing and categorizing the
opinions or sentiments expressed about a given subject or spoken language. In a world
where 2.5 quintillion [4] bytes of data is generated every day; sentiment analysis has
become a vital tool for making the data legitimate. Sentiment analysis can be useful
in several ways. For example, in marketing, it helps in determining the popularity
of the product, success rate of the product and helps in improving the quality of the
product [5, 6]. Text and Emoji combination has provided the world a new way to
express their emotions in a colorful manner.
The main objectives of the paper are:
• To evaluate online social media pages of multiple popular categories involving
events (sport and movie), automobiles, brands, OTT platform, ecom platforms,
etc.
• To evaluate and determine the total count of comments having both text and emoji
along with the total count of emoji used.
• Comparative study among categories to determine among them the usage of
Emoji.
• Comparative study among categories to determine among them the usage of
Comments with Text and Emoji both.
• Study the aggregated anatomy of categories.

We use particularly Facebook because of the following reasons:

• Facebook is used by different people to express their opinion about different
subjects; thus, it is a prominent source of people’s sentiments.
• Facebook contains an enormous number of comments having both text and emoji
and it grows every day.
Facebook’s audience varies from regular users to celebrities, company repre-
sentatives, political leaders, and even top brands have their profiles on Facebook.
Therefore, it is possible to collect sentiments and opinions from different categories.

2 Literature Review

In recent years, Sentiment Analysis has become a prominent research topic in the
field of Natural Language Processing [2]. With the rapid development of online
networking sites and digital technologies, sentiment analysis is becoming more
reputable than before. Sentiment Analysis has two approaches Machine Learning
approach [6, 7] and Lexicon-based approach [5].
Machine Learning approach, is a technique to categorize the contextual data
into predetermined categories without being explicitly programmed [7]. Machine
Social Media Anatomy of Text and Emoji in Expressions 43

Learning approach can be divided into two methods [8, 9]. First, Supervised Learning
which contains two sets of documents which are training set and a test set. Training
set is used to study the data by the classifier. Test set is used for validating the data.
Second, Unsupervised Learning, which involves the unlabeled data, for this, the algo-
rithm recognizes the learning patterns in input and gives the specific output values.
It also not requires prior training to mine data [10].
Lexicon-based approach, label the polarity (can be positive, negative, or neutral)
of a textual content by aggregating the polarity of each word or phrase [11]. It is
divided into two methods-Dictionary-based approach, which uses WordNet [12, 13],
SentiWordNet [14] or any other dictionary to find suitable words of the sentiment
word to determine the polarity. Second is Corpus-based approach, this helps in finding
context-specific orientated sentiment words from a huge corpus [15].
Emoji have become competent aid of sentiment analysis as they are widely used in
expressing feelings and emotions [2]. The meticulousness of recognizing emotions
can increase and improve with the analysis of emoji. Emoji first originate in Japanese
mobile phones in 1997 [16], and it became progressively popular worldwide in 2010s
after being added to several mobile operating systems.
Most of the research are done with text [6, 7] only or with emoji [16] only but not
involving the both. The objective of the paper is to determine the utilization of both
text and emoji on social platform to express sentiments.

3 Proposed Approach

User’s sentiments and opinions are the major criterion for the improvement of the
quality and services. Overall, 117,000 online sentiment of 8 popular categories across
the world have been downloaded as shown in Table 1. Sentiments or opinions of a
person cannot be determined by only investigating the text. For definite and metic-
ulous results, evaluation of both text and emoji should be done. The steps of the
approach adopted are shown in Fig. 1.
Step 1: Identify some of the famous categories like Events (Sport and Movie),
Ecommerce Platforms, Automobiles, etc. These categories are further
divided into segments like Events (Sport and Movie) into IPL, FIFA,
Filmfare, etc.
Step 2: The posts and comments of most popular categories on Facebook are
downloaded using Facepager [17].
Step 3: The data set is pre-processed to remove videos and images.
Step 4: Each segment category is analyzed to determine:
• Total count of comments having both text and emoji and emoji count.
• Comparative study among 8 categories is done to determine among
them the usage of emoji along with text and emoji both.
• Aggregated Anatomy of Categories.
44 S. Gupta et al.

Fig. 1 Proposed approach

Step 5: The result of each segment is exhibited using charts.

Based on this analysis we determine the popularity of emoji among online users,
Table 1 shows the statistics of post and comments under each category.
Social Media Anatomy of Text and Emoji in Expressions 45

4 Implementation and Results

The dataset has been collected from Facebook and downloaded using Facepager.
The module of Python called emoji is used to analyze emoji in dataset. Some of
Python library used involves: Openpyxl to read and write in excel, Pathlib to work
with file paths, re, workbook, etc. The results of (1) category-segment (2) category
comparative study (3) aggregated anatomy of categories are listed below:

4.1 Category-Segment Analysis

Figure 2, shows the graphical representation of comment count, emoji count, and
comments with text and emoji of various segments analyzed under different 8
categories.

4.2 Category Comparative Analysis

Table 2, shows that 6 categories out of 8 i.e. 75% of categories have total number
of emoji, i.e., emoji count, greater than 40% of the total number of their respective
comments considered i.e. comment count. Therefore, the total number of comments
with text and emoji cannot be ignored. Thus, clearly reflecting the greater use of
emoji by online users to express their sentiments.
Figure 3 shows that category of automobile incorporates maximum use of
comments with text and emoji. Figure 4 shows that the usage of emoji is maximum
in skin care brand and events (sports and movies).

4.3 Aggregated Anatomy of Categories

Figure 5 shows the following results:

• The emoji count is more than the 50% of the total number of comments considered.
• The count of comments with text and emoji both is nearly 25% of the total number
of comments considered.
• The count of comments with text and emoji both is nearly 47% of the total number
of emoji used.
46 S. Gupta et al.

a b

c d

e
f

g h

Fig. 2 a Events (Sport and Movie), b Ecommerce Platform, c Automobile, d Software Application,
e OTT Platform, f Gadget, g Broadcaster, h Skin Care Brand

5 Conclusion

Our findings are based on the popular pages of Facebook and analyzing the total count
of emoji and total count of comments with both text and emoji. From the analysis,
we observe that these days, people are utilizing more emoji as a means of expressing
their sentiments and opinions on social media. Thus, the sentiment analysis can be
Social Media Anatomy of Text and Emoji in Expressions 47

Table 2 Category comment count, emoji count, and comments with text and emoji
S. No. Category Comment count Emoji count Comments with text and
emoji
1 Event (Sport and Movie) 17,510 15,456 4604
2 Economic Platform 22,420 12,126 4511
3 Automobile 21,198 12,070 15,833
4 Online Platform 23,775 8835 2709
5 OTT Platform 18,029 8645 3277
6 Gadget 18,701 5573 2212
7 Broadcaster 29,319 12,663 4185
8 Skin Care Brand 20,669 15,772 5362
Total 171,621 91,140 42,693

Fig. 3 Categories comparison of comments with text and emoji

done more extensively when both emoji and text are considered. It provides us the
direction for future work.
48 S. Gupta et al.

Fig. 4 Categories emoji count

Fig. 5 Aggregated anatomy of categories

References

1. Suman, C., Saha, S., Bhattacharyya, P., Chaudhari, R. S.: Emoji helps! A multi-modal Siamese
architecture for Tweet user verification. Cogn. Comput., 1–16 (2020)
2. Gupta, S., Singh, A., Ranjan, J.: Sentiment analysis: usage of text and emoji for expressing
sentiments. In: Advances in Data and Information Sciences, pp. 477–486. Springer, Singapore
(2020)
3. Anjaria, M., Guddeti, R.M.R.: A novel sentiment analysis of social networks using supervised
learning. Soc. Netw. Anal. Min. 4(1), 181 (2014)
4. Marr, B.: How much data do we create every day? The mind blowing stats everyone should read
(2020). Retrieved 21 June 2020, from https://www.forbes.com/sites/bernardmarr/2018/05/21/
how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/
5. Dey, A., Jenamani, M., Thakkar, J.J.: Senti-N-Gram: an n-gram lexicon for sentiment analysis.
Expert Syst. Appl. 103, 92–105 (2018)
6. Tripathy, A., Agrawal, A., Rath, S.K.: Classification of sentiment reviews using n-gram machine
learning approach. Expert Syst. Appl. 57, 117–126 (2016)
Social Media Anatomy of Text and Emoji in Expressions 49

7. Tripathy, A., Anand, A., Rath, S.K.: Document-level sentiment classification using hybrid
machine learning approach. Knowl. Inf. Syst. 53(3), 805–831 (2017). https://doi.org/10.1007/
s10115-017-1055-z
8. Ayyadevara, V. K.: Word2vec. In: Pro Machine Learning Algorithms, pp. 167–178. Apress,
Berkeley, CA (2018)
9. Asghar, M.Z., Khan, A., Bibi, A., Kundi, F.M., Ahmad, H.: Sentence-level emotion detection
framework using rule-based classification. Cogn. Comput. 9(6), 868–894 (2017). https://doi.
org/10.1007/s12559-017-9503-3
10. Schrauwen, S.: Machine learning approaches to sentiment analysis using the Dutch Netlog
Corpus. Computational Linguistics and Psycholinguistics Research Center, pp. 30–34 (2010)
11. Da Silva, N.F.F., Hruschka, E.R., Hruschka, E.R.: Tweet sentiment analysis with classifier
ensembles. Decis. Support Syst. 66, 170–179 (2014). https://doi.org/10.1016/j.dss.2014.07.003
12. Miller, G.A.: Nouns in WordNet. WordNet: an electronic lexical database, pp. 23–46 (1998)
13. Lee, Y., Ke, H., Yen, T., Huang, H., Chen, H.: Combining and learning word embedding with
WordNet for semantic relatedness and similarity measurement. J. Am. Soc. Inf. Sci. 71(6),
657–670 (2019). https://doi.org/10.1002/asi.24289
14. Denecke, K.: Using SentiWordNet for multilingual sentiment analysis. In: 2008 IEEE 24th
International Conference on Data Engineering Workshop (2008). https://doi.org/10.1109/
icdew.2008.4498370
15. Hardeniya, T., Borikar, D.A.: Dictionary based approach to sentiment analysis—a review. Int.
J. Adv. Eng. Manage. Sci. 2(5) (2016)
16. Fernández-Gavilanes, M., Juncal-Martínez, J., García-Méndez, S., Costa-Montenegro, E.,
González-Castaño, F.J.: Creating emoji lexica from unsupervised sentiment analysis of their
descriptions. Expert Syst. Appl. 103, 74–91 (2018)
17. Facepager 3.6 Download (Free)—Facepager.exe (2020). https://facepager.software.informer.
com/3.6/
Development of Machine Learning
Model Using Least Square-Support
Vector Machine, Differential Evolution
and Back Propagation Neural Network
to Detect Breast Cancer

Madhura D. Vankar and G. A. Patil

Abstract Recently death rate among women increases due to Breast Cancer. Breast
Cancer diagnosis system has been implemented with the help of Different Machine
learning algorithms. Out of those, in our present work, we have used DE, LSSVM,
and BPNN algorithms for developing Machine learning models. Here the training
dataset is taken from UCI repository. In this system, we compare results of LS-SVM
and backpropagation neural networks to generate accurate results at an early stage.
Early detection increases the chances of treatment which helps to save women Life.
It will be helpful to medical field because it avoids loss of data.

1 Introduction

Machine learning provides capability to computers for learning without being

programmed. Machine learning algorithms are used to develop models with the
help of training data. It includes two phases i.e. Training and Testing phase. Mostly
Medical field utilizes different machine learning tools because it uses concepts such
as classification and recognition systems.
This paper focuses on the breast cancer diagnosis at early stage, conducted
using least square support vector machine (LS-SVM) classifier algorithm and back-
propagation neural network algorithm (BPNN). Early diagnosis is necessary to
increase survivability of women. Here we perform demonstration on Wisconsin
Breast Cancer Dataset (WBCD) taken from UCI repository. System will use a Differ-
ential Evolution algorithm to optimize parameters of LS-SVM classifier which helps
to enhance the accuracy. Here we use Supervised Learning mode of Neural Network
which helps to train the neural network with a training dataset. Subsequently, it helps
in the backpropagation algorithm to minimize the errors by adjusting interconnec-
tions weights. Here, the performance of the system is evaluated by using a tenfold
cross validation method.

M. D. Vankar (B) · G. A. Patil

Computer Science and Engineering Department, D. Y. Patil College of Engineering and
Technology, Shivaji University, Kasaba Bawada, Kolhapur, Maharashtra, India

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 51
S. C. Satapathy et al. (eds.), Smart Computing Techniques and Applications,
Smart Innovation, Systems and Technologies 224,
https://doi.org/10.1007/978-981-16-1502-3_7
52 M. D. Vankar and G. A. Patil

2 Literature Survey

Singh and Parveen [1] have proposed a hybrid methodology for classification with
the help of support vector machine (SVM) and fuzzy c-means clustering. It provides
accurate results to detect the brain tumor. It requires large memory space.
Lin et al. [2] have developed CAD system for characterizing breast nodules as
either benign or malignant by using ultrasonic images. This paper uses classifier
such as fuzzy cerebellar model neural network (FCMNN). It follows trial and error
criteria to decide its all analysis parameters.
Vijay and Saini [3] have proposed system for detection of breast cancer using
Image Registration Techniques. System describes features of Feed-forward back-
propagation Artificial Neural Network (ANN) model. Mean Square Error (MSE) is
used as analysis criteria to evaluate system performance. It is more time-consuming,
because it follows trial and error method to compute number of neurons.
Utomo et al. [4] have used Extreme Learning Machine Neural Networks (ELM
ANN) for developing decision support systems. ELM algorithm helps to generate
accurate results and high sensitivity rate. This system requires more time to train the
network than other methods. It provides low specificity rate.
Saini et al. [5] have developed a model using an adaptive neural network to detect
cancer. It requires more training time. It provides better accuracy as compared to the
fuzzy system. It uses clustering features which allows to create number of clusters
depends upon the problem size.
There was a need felt to develop a Machine Learning Model to optimize training
time and enhance performance by using Least Square-Support Vector Machine,
Differential Evolution and Back Propagation Neural network, so as to help detect
more accurately breast cancer in the patients. The followings are the main objectives
of this research work:
• To develop Machine Learning Model.
• To replace missing value in the dataset using a suitable algorithm.
• To optimize training time by Merging Differential Evolution with LS-SVM.
• To obtain Accurate detection results by comparing LS-SVM and BPN.
• To evaluate performance of the system with the help of a tenfold cross validation
method.

3 System Architecture

Our proposed system will be used for developing machine learning model with the
help of big data. System will process Dataset by using missing value replacement
method to improve the performance. Then ‘Feature Scaling’ is used to convert data
into binary format. After completion of conversion, the dataset will be transferred
towards the machine learning model to train neural network by using two classifiers
such as LS-SVM and BPN. Further, the focus is to compare the results of two
Development of Machine Learning Model … 53

Fig. 1 System architecture

classifiers and provide accurate results to the end-user with the help of analysis
parameters (Fig. 1).

4 Methodology

4.1 Dataset Attributes

The given algorithms are implemented on the Wisconsin Breast Cancer Dataset
(WBCD) taken from UCI repository of machine learning databases [13, 14]. Total
Dataset consists of 10 Attributes. Out of those, nine attributes indicate as input
variables and remaining one as output (2, 4). The value 2 indicates benignancy and
4 indicates that it is of malignant type of breast cancer. The dataset also contains
another attribute such as ‘sample code number’ which is discarded, because it is not
required for the classification process.
Table 1 provides information about attributes along with its range.
The functioning of the system has been divided into the following modules.
54 M. D. Vankar and G. A. Patil

Table 1 Attributes to be
S. no. Attributes Domain
considered from the WBCD
1 Clump thickness 1 … 10
2 Cell size—uniformity 1 … 10
3 Shape of cell—uniformity 1 … 10
4 Marginal adhesion 1 … 10
5 Cell size—single epithelial 1 … 10
6 Bare nuclei 1 … 10
7 Bland chromatin 1 … 10
8 Normal nucleoli 1 … 10
9 Mitoses 1 … 10
10 Class 2 shows benignancy, 4
shows malignancy

5 Module 1: Missing Value Replacement and Feature

Scaling of Attributes

This module is used to make a dataset ready for training neural network using
the following methods. Here, Data owner authentication, User authentication, and
Missing value replacement mechanisms have been implemented.
(A) Data Owner Authentication:
In Data owner authentication, the owner will sign in into the system by
providing his/her username and password. If a new user wants to upload a
testing database file into the system, he has to register giving his username,
password, mobile number, address, and other details required. Once the data
owner is authenticated, he can browse and load the Wisconsin Breast Cancer
Training Dataset into the system for processing.
(B) Litwise Deletion Algorithm to Ignore the Tuples Containing Missing Data:
This method helps to remove missing values within a given dataset. Dataset
consists of attributes as mentioned in Table 1. To delete the subjects that have
missing value is the simplest way of handling missing data. If any row contains
a missing value for one attribute which represents ‘?’ symbol, then that entire
row is deleted.
(C) Feature Scaling of Attribute:
After completion of litwise deletion method, dataset will be forwarded towards
Feature Scaling. The Feature Scaling of attribute is carried out to convert
dataset values into binary format except last attribute i.e. class (2-benign,
4-malignant). The following formula is used to obtain scaling.

X = [x − min(x)] [max(x)− min(x)]

Use the following considerations to obtain new values:

Development of Machine Learning Model … 55

(i) 0–5 values are converted to 0

(ii) 6–10 are converted to 1.

6 Module 2: Training Neural Network using LS-SVM

and DE

This module is focusing on the training of neural network using Least-Square Support
Vector Machine and Differential Evolution merged with LS-SVM to improve the
system’s accuracy.
(A) Least Squares Support Vector Machine (LS-SVM):
Figure 2 shows the role of LS-SVM to classify all training vectors in two classes
with the help of hyperplane and to select the best hyperplane that leaves the maximum
margin from both classes.
LS-SVM Concept:
LS-SVM classifier is used to obtain suitable a hyper plane, which separates various
classes. It is an extended version of SVM. As compared to other techniques, LS-SVM
is very simple and faster.
LS-SVM Algorithm:

Step 1: Load the training vector of n data points, where X i represents the input
vector and Y i represents the corresponding ith target with values {2,4}.
Step 2: For each input data point find out random weights

x ) = x · w
g( + wo

Fig. 2 Concept of LS-SVM

56 M. D. Vankar and G. A. Patil

Step 3: Obtain the value of the bias term b and initialize the error e for each point
randomly.

x) ≥ 1
g( ∀ x ∈ Class 1
x ) ≤ −1
g( ∀ x ∈ Class 2

Step 4: Minimize the objective function with the help of e, w and b values. Calculate
the value of total margin.
Minimize w term which helps to maximize separability between two
classes.
Step 5: To satisfy the Karush-Kuhn-Tucker (KKT) conditions in the set of
equations by developing the Lagrangian function with the solution

N
−
→
=
w λi yi xi
i=0

N
λi yi = 0
i=0

Step 6: Find out the number of support vectors.

Step 7: Use RBD kernel functions for classification of Training data.Use RBD
kernel functions for classification of Training data.

(B) Differential Evolution (DE):

It is simple optimization Technique.

DE-Based Optimization Technique:

Figure 3 shows idea behind DE-LSSVM algorithm which helps to utilize developed
model to predict test data.
DE has standard phases like initialization, mutation, crossover, and selection. DE
improves its accuracy by merged it with LS-SVM classifier. Differential Evolution is
used to optimize kernel parameters of LS-SVM. It will create a new fitness function
which act as a classifier evaluation criteria, i.e., accuracy, mean absolute error, and
root mean squared error.

7 Module 3: Training Neural Network Using Back

Propagation Neural Network (BPNN)

The Back Propagation algorithm is quite simple and easy to program. It is used
for performing many tasks such as Classification, Clustering. Here we use a Super-
vised Learning model to train the neural network. It includes two phases such as
Development of Machine Learning Model … 57

Fig. 3 The flowchart of the DE-LSSVM algorithm

Feed Forward Computation and Backpropagation. Feed Forward Neural Network

phase helps to calculate output then it will be checked against actual output. If calcu-
lated outputs are not satisfactory then connections between layers are modified by
using Backpropagation algorithm. This procedure is performed again and again to
minimize error which is calculated at Feed forward neural network phase (Fig. 4).
Four Phases of the Training Process:
1. Initialization of weights: It is used to assign some small random values.
2. Feed forward approach: In this phase, network has fixed synaptic weights. It
is used to calculate error by using the following formula.

Error Signal = Calculated Output − Desired Output

3. To calculate the propagated error: It focuses on minimization of error by

adjusting synaptic weights of network. Forward and backward passes are
repeated until error is minimized.
58 M. D. Vankar and G. A. Patil

Fig. 4 Working of BPNN

∂ E2
WAB(new) = WAB(old) − η
∂ WAB

∂ E2
where n represents the learning rate parameter, ∂ WAB
represents the sensitivity.
4. To update weights in the neural network.

8 Module 4: Tenfold Cross Validation Method:

It mainly focuses on evaluation of system performance with the help of LS-SVM

and BPN. Steps of tenfold cross validation performance evaluator are:
• First divide the given dataset into 10 sets.
• Then out of 10 sets, Training is performed on 9 sets and testing is performed on
remaining 1 set.
• Take its mean accuracy by repeating this procedure 10 times.

9 Module 5: Experimental Results and Analysis

For experimentation net beans IDE has been installed.

Figure 5 shows Data owner authentication page, where owner will sign in into the
system by providing his/her username and password.
Development of Machine Learning Model … 59

Fig. 5 Data owner authentication

After data owner authentication process, data owner can browse and load the
Wisconsin Breast Cancer Training Dataset into system for processing. Here we will
perform demonstration on 600 records of WBCD.
Figure 6 shows the Data Owner screen which is used to load training dataset for
development of machine learning model.
Figure 7 shows the dataset after replacing missing values by using Litwise deletion
method.
Figure 8 shows the dataset values in binary format except the last attribute that is
being obtained after performing feature scaling.
Figure 9 shows a Data Owner screen with DE-LS-SVM Training results that
include correctly and incorrectly classified instances which will help to calculate
evaluation parameters of the system.
1. Accuracy = It represents percentage of correctly classified data.
60 M. D. Vankar and G. A. Patil

Fig. 6 Load dataset for training

Fig. 7 Litwise deletion method

c
c
= (No. of correctly classified data in class m)/ (Total No. of data in class m)
m=1 m=1
= (386/600) ∗ 100 = 64.3333%.

2. Mean Absolute Error = It measures error between observations using the

following formula.
Development of Machine Learning Model … 61

Fig. 8 Feature scaling of attributes

Fig. 9 Train dataset using DE and LS-SVM

1
n
MAE = y j − ŷ j
n j=1
1
= {prediction − Actual observation}
n
1
= {600 − 386} = 0.3566
600

where, n = total number of instances within the dataset.

62 M. D. Vankar and G. A. Patil

Fig. 10 Train dataset using BPNN

3. Root Mean Squared Error (RMSE) = It shows the value i.e. square root of
the MAE.

1 n √
RMSE = |y j − ŷj| = 0.3566 = 0.5971
n j=1

Figure 10 shows a Data Owner screen with BPNN Training results that include
correctly and incorrectly classified instances.
1. Accuracy = (573/600)*100 = 95.5%.
2. Mean Absolute Error (MAE) = 600
1
{600–573} = 0.0522
√
3. Root Mean Squared Error = 0.0522 = 0.1998.
Tables 2, 4 and 5 contain calculated values of Accuracy, Mean Absolute Error,
and Root Mean Squared Error of DE-LSSVM and BPNN algorithms with respect to
number of records given by end-user.
From Table 2, we can easily calculate the average of accuracy for DE-LSSVM
and BPNN algorithms with respect to the number of records within the dataset.
Table 3 gives clear idea that BPNN algorithm provides more accurate results as
compared to DE-LSSVM algorithm.
Figure 11 shows that accuracy of BPNN is higher than DE-LSSVM algorithm. It
is being observed that when number of records within dataset increases then accuracy
of DE-LSSVM algorithm goes on decreasing.
Figure 12 shows that all analysis parameters of DE-LSSVM algorithm with respect
to number of records within dataset.
Figure 13 shows that all analysis parameters of BPNN algorithm with respect to
number of records within the dataset. From Fig. 11, it is being observed that when the
Development of Machine Learning Model … 63

Table 2 Accuracy between

Number of records Accuracy (%)
DE-LSSVM and BPNN
algorithm DE-LSSVM BPNN
10 90 90
20 75 95
30 70 90
40 67 97.5
50 62 94
100 56 94
200 58 94.5
300 56.33 94.667
400 58.25 93.75
500 61.8 95
600 64.33 95.5

Table 3 Average accuracy of

Algorithm DE-LSSVM BPNN
DE-LSSVM and BPNN
algorithm Average accuracy (%) 65.34 93.99

Table 4 Mean absolute error,

Number of records Mean absolute error Root mean squared
Root mean squared error
error
values of DE-LSSVM
algorithm 10 0.1 0.3162
20 0.25 0.5
30 0.3 0.5477
40 0.325 0.5701
50 0.38 0.6164
100 0.44 0.6633
200 0.42 0,6481
300 0.4367 0.6608
400 0.4175 0.6461
500 0.382 0.6181
600 0.3566 0.0522

number of records within a dataset increases then accuracy of BPNN algorithm goes
on increasing which means number of records within dataset is directly proportional
to the accuracy of BPNN algorithm.
64 M. D. Vankar and G. A. Patil

Table 5 Mean absolute error,

Number of records Mean absolute error Root mean squared
Root mean squared error
error
values of BPNN algorithm
10 0.1503 0.3222
20 0.1419 0.265
30 0.1079 0.2416
40 0.0668 0.1738
50 0.0944 0.2401
100 0.0728 0.2159
200 0.0609 0.2104
300 0.0651 0.2156
400 0.0679 0.2295
500 0.0603 0.2152
600 0.5971 0.1998

Fig. 11 Accuracy between DE-LSSVM and BPNN algorithm

Fig. 12 Evaluation parameters of DE-LSSVM algorithm

Development of Machine Learning Model … 65

Fig. 13 Evaluation parameters of BPNN algorithm

10 Conclusion

In this paper, Machine learning model has been developed for breast cancer patients
with the help of DE-LSSVM and BPNN classification algorithms. Here missing
values are handled by using Litwise deletion method under Dataset Preprocessing
module which enhances system performance. DE is combined with LS-SVM which
helps to improve system’s accuracy and minimize the training time. Therefore it
helps to generate results at early stage which increases chance of survivability of
women. As mentioned in Table 3: BPNN gives 93.99% of average accuracy and
LS-SVM gives 65.34% of average accuracy with respect to the available records in
the given dataset. Therefore from the results of all analysis parameters, it is observed
that BPNN algorithm provides more accurate results as compared to DE-LSSVM
algorithm. This accuracy probably will go on increasing for large dataset. There is a
further scope to go for implementation with solo classification techniques to improve
the efficiency. Adding other features will help the system to handle and test more
than one dataset at a time.

References

1. Parveen and Singh, A.: Detection of brain tumor in MRI images, using combination of Fuzzy
C-Means and SVM. In: IEEE Paper on Signal Processing and Integrated Networks (SPIN)
(2015)
2. Hou, Y.-L., Lin, C.-M., Chen, K.-H., Chen, T.-Y.: Breast nodules computer-aided diagnostic
system design using Fuzzy Cerebellar model neural networks. IEEE Trans. Fuzzy Syst. 22(3)
(2014)
3. Saini, S., Vijay, R.: Optimization of artificial neural network breast cancer detection system
based on image registration techniques. Int. J. Comput. Appl. 105(14) (2014)
4. Utomo, C.P., Kardiana, A., Yuliwulandari, R.: Breast cancer diagnosis using artificial neural
networks with extreme learning techniques. Int. J. Adv. Res. Artif. Intel. 3(7) (2014)
5. Singh, S., Saini, S., Singh, M.: Cancer detection using adaptive neural network. Int. J. Adv.
Res. Technol. 1(4) (2012)
66 M. D. Vankar and G. A. Patil

6. Al-Anezi, M.M., Mohammed, M.J., Hammadi, D.S.: Artificial immunity and features reduction
for effective breast cancer diagnosis and prognosis. IJCSI Int. J. Comput. Sci. Issues 10(3)
(2013)
7. Jouni, H., Issa, M., Harb, A., Jacquemod, G., Leduc, Y.: Neural network architecture for breast
cancer detection and classification. In: IEEE International Multidisciplinary Conference on
Engineering Technology (IMCET) (2016)
8. Menaka, K., Karpagavalli, S.: Breast cancer classification using support vector machine and
genetic programming. Int. J. Innovative Res. Comput. Commun. Eng. 1(7) (2013)
9. Fiuji, H.H., Almasi, B.N., Mehdikhan, Z., Bibak, B., Pilevar, M., Almasi, O.N.: Automated
diagnostic system for breast cancer using least square support vector machine. Am. J. Biomed.
Eng. (2013)
10. Usha Rani, K.: Parallel approach for diagnosis of breast cancer using neural network technique.
Int. J. Comput. Appl. 10 (2010)
11. Sharma, A., Mehta, N., Sharma, I.: Reasoning with missing values in multi attribute datasets.
Int. J. Adv. Res. Comput. Sci. Softw. Eng. 3(5) (2013)
12. Kaushik, K., Arora, A.: Breast cancer diagnosis using artificial neural network. Int. J. Latest
Trends Eng. Technol. (IJLTET) 7(2) (2016)

Web Links for Accessing Information About Breast Cancer

Dataset

13. https://archive.ics.uci.edu/ml/index.php
14. https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)
Distributed and Decentralized Attribute
Based Access Control for Smart Health
Care Data

B. Ravinder Reddy and T. Adilakshmi

Abstract Most access control solutions today provide the ability for centralized
authorities, whether governments, manufacturers, or service providers to gain unau-
thorized access to and control devices by collecting and analyzing user’s data. Data
owners (DO) experience a necessity to concentrate on their own medical data and
manage them. As an alternative, Blockchain promotes healthcare, including health
care data, which actually may cause legal and privacy issues and solves the problem of
sharing with third parties. Blockchain technology provides data owners with compre-
hensive, immutable records, and access to Smart Health Care (SHC) data free from
service providers and websites. This paper presents a scheme, in which the data
owner endorses the message based on attributes without leaking data other than the
appended proof. The proposed model is Distributed and Decentralized scheme for
Access Control.

1 Introduction

Smartwatches are ultimate medical sensors well suited for bringing essential medical
monitoring into home which are easy to use, always running, and continuously in
contact with our bodies. The major issue is when someone updates or modifies data
without knowledge that lead to a a major damage. Access Control of Smart Health
Care (SHC) data can be an alternative solution for securing and providing privacy
for data owners. SHC data stores the health-related personal data gathered from
wearable devices like Fitbit. Some incorrect alteration of any such sensitive data

B. Ravinder Reddy (B)

Department of CSE, UCE (A), OU, Hyderabad, India
e-mail: [email protected]
Department of CSE, Anurag Group of Institutions (A), Hyderabad 500088, India
T. Adilakshmi
Department of CSE, Vasavi College of Engineering (A), Hyderabad 500089, India
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 67
S. C. Satapathy et al. (eds.), Smart Computing Techniques and Applications,
Smart Innovation, Systems and Technologies 224,
https://doi.org/10.1007/978-981-16-1502-3_8
68 B. Ravinder Reddy and T. Adilakshmi

can be harmful. Thus, privacy becomes an important aspect for a SHC system. The
proposed model designates the patients to generate, share and manage data with
e-healthcare providers by storing access rights of data collected on blockchain.

2 Literature Survey

2.1 Access Control

According to Zyskind et al. [1] blockchain can handle many real-time issues at large
like. Currency and Transaction support, cloud storage, supply chains, and public
charity.
Generally, in an organization, evaluation of access control policy to determine
whether the requested access to a warehouse can be performed is done by a party
that cannot be trusted at times. Attribute-Based Access Control (ABAC) policies are
one of the common approaches to express access policy.
It combines a set of rules with conditions over a set of attributes assigned to
subjects or resources. For instance, consider the resource, A, an object that controls
the defined policy to resources, number of subjects, Pi, and number of resources Xj.
Pi holds the rights to perform on actions like transfer of rights specified by defined
policies on Xj. Hence, the approach needs A & Pi to perform actions which are
independent from one another. The policy issuer and subject have no role to play
while in the process of exchange. As coined by Di Francesco Maesa et al. [2] a policy
creation transaction (PCT) is to be performed by A to transfer the access rights to a
new subject in Pi. At last, after the policy is created, which is updated by resource
owner A, is revoked.
Any Access control policy is defined based on Access Rights. An access right
defined as permission for access of resource that is granted and which can allow a
program or person to identify the ideal data. Digital access rights play crucial role
in data security and compliance. These permissions control the ability of the users
to view, read, write, change, navigate, and execute the contents of the file system.
Table 1 represents the access rights.

Table 1 Representation of access rights for different user types

User User type Type of access Access control File
Demo Admin Create different Read, write execute All files
Users
Demo 1 Data owner (DO) Create users and Read, write execute Uploaded file
share access rights
Demo 2 User As allowed by DO As allowed by DO As allowed by DO
Distributed and Decentralized Attribute Based … 69

2.2 Blockchain

On the other side, the invention of Bitcoin [3] by a person (or group of people)
using the name Satoshi Nakamoto in 2008 has inspired blockchain and many other
applications. Block chain contains three basic elements. The first is a distributed
ledger, but it is centralized. The second element consensus algorithm is a way to
make sure that all the ledger copies are the same for everyone and the process is
called mining. The third, encryption and distribution of block, which contains data
stored, and the result is a Hash code, which indicates the location of the block and
its contents, and then links it with rest of the chain [4].
The blockchain-based Access Control is beneficial for, Easy of transfer access
rights from one user to other, anytime validation can be done by any user to know
who is having currently the access, and Proof of Existence of access rights is possible.
Certain properties or capabilities of blockchain ensure the highest security to
transaction. Some of the terminology and core components of Blockchain are: Peer-
to-Peer Network (P2P) [5], in which the users are connected to one another with no
central control. Hash Pointer [6], which points to the next block of transaction, Digital
Signatures [7], used for authenticating users. The two digital signature schemes which
work on Elliptic and Edward curve respectively are Elliptic Curve Digital Signature
Algorithm (ECDSA) and Edwards-curve Digital Signature Algorithm (EdDSA) [8].
Consensus [9] avoids intermediaries, Merkle Tree [10] is a technique that separates
data and applies hash functions till a single hash value is achieved. Proof of Work
(PoW) [11], a consensus algorithm used in the network to provide security and
decision making, Proof of Stake [12], in bitcoin it is a collection of coins to win.
Byzantine Fault Tolerance [1], system is used to identify issues related to various
components in the network.
For example, Bob needs to upload a file and share it with someone like spark.
1. Bob: Uploads the file to cloud server.
2. Bob: gives access rights to spark for the file.
3. Spark: Initially can see the file but cannot read or download.
4. Spark: uses the private key generated by bob while giving attributes.
5. Spark: Finally, downloads the file.
In this using Blockchain and encryption can create two levels of security for the
access control. Figure 1 represents the communication among patients, hospitals,
and doctors.

2.3 Smart Health Care on Cloud

Deploying the SHC system on cloud ecosystem provides several opportunities like
accessibility, computational elasticity, greater fault tolerance, and interoperability
[13, 14]. As per HIPPA [15], the cloud providers are under non-covered entities. Thus,
70 B. Ravinder Reddy and T. Adilakshmi

Fig. 1 Communication among patients, doctors, and hospital

the service providers have no obligation to ensure the integrity, confidentiality and
to provide proper access to the consumers [16]. Consequently, the privacy becomes
an important concern to adapt the medical system.
The health data is managed and controlled by the data owner or patient, unlike the
other digital health records [17, 18]. Owners can selectively share their health-related
data with third parties while hiding some data in private. The cloud can allow access
the medical data anytime from anywhere. It can also support the system to prepare
for appointments and maintain a more sophisticated view of personal health to share.
The cloud on the other side may have business interest in analyzing the health
care data, and may also have malicious employees or sometimes even cloud may be
hacked. As a result, the medical system will communicate with different users, and
the given access control policy should support accountability and revocation features.
Thus, the smart health care system must offer a tamper-proof feature and protect the
data owner’s privacy. In our proposed model, the underlying cloud infrastructure of
the SHC system is considered as untrusted, and Blockchain used to provide security.

2.4 Limitations of Existing System

The existing system comes with various limitations like:

• ABAC is based on Central Authority and is not distributed.
• Creation or Configuring of ABAC is not so simple.
• Blockchain implementation of access control may be slow.
Distributed and Decentralized Attribute Based … 71

• Memory Utilization, Transaction time, Energy Consumption are high for

blockchain.

3 Proposed Methodology

The proposed model in Fig. 2 works by collecting the data from wearable devices
stores it on the cloud and assigns the data with access rights based on Attribute
Based Access Control (ABAC) policies. The generated attributes and access rights
or logs are stored on blockchain nodes which in turn can reduce the burden of over-
load, storage utilization on cloud, and improving performance making blockchain
lightweight.
This system will provide high-quality preventive mechanisms for accessing smart
health records. The Proposed model can also improve the efficiency in terms of power
consumption, improves processing and execution time.
The benefits of this approach is secure sharing of patient information by providing
decentralized access, improved patient and service provider interaction and safe-
guarding privacy. Both the general ABAC and Blockchain based ABAC schemes uses
User, resource, object and environment as attributes. Table 2 depicts a comparison
of Attribute based access control (ABAC) and blockchain ABAC.

Fig. 2 Architecture of proposed model

72 B. Ravinder Reddy and T. Adilakshmi

Table 2 Comparison of ABAC and blockchain based ABAC

S. no. Parameter ABAC Blockchain based ABAC
1 Access rights Granted based on policies Granted by re-encryption
framed using attributes using a hash key, based on
attributes
2 Access control Created by direct Created by specified
encryption attributes
3 Computational process Expensive based on file size Only access rights are stored
4 User tracing Anonymous Can be traced

Table 3 Results of the proposed model

S. no. Parameter On-premise Cloud Cloud and blockchain
1 Energy consumption 40% less efficient than 10% higher 40% More efficient than
cloud and blockchain without cloud and 10%
and 50% less efficient less efficient than cloud
than cloud
2 Execution time 3s 2s 40% More efficient than
without cloud and 10%
less efficient than cloud
3 File size 16 MB 16 MB 16 MB
4 Memory utilization 90% 96% Improved
5 CPU load 93% 95% Improved

4 Results

The proposed methodology can be implemented by considering the minimum hard-

ware requirements that can be considered are 4 GB RAM, 100 GB Storage, and
Processor-i7 and if tested on different environments like cloud, cloud and Blockchain
and On-Premise, the expected results can be achieved as shown in Table 3.

5 Conclusion

The proposed model allows Data Owners get the control to generate, share and
manage patient’s data with healthcare providers by storing access rights of Cloud
stored data on blockchain. Storing the access policies on blockchain can increase
the security and avoid unauthorized access. The paper emphasized on shortfalls of
existing system that can be avoided on further implementation of the proposed model.
The proposed Architecture for SHC data can improve the Efficiency, Execution time
by lowering the storage and finally improving the overall data access.
Distributed and Decentralized Attribute Based … 73

6 Future Scope

In the future we tend to implement the model using private blockchain. As an alter-
native solution Hyperledger Fabric, which reduces the overhead involved in mining
and that supports 10,000 transactions/second can be considered for implementing.
We feel that a comprehensive study is still a need for defining the fine grained Access
Control using a light weight Blockchain.

References

1. Zyskind, G., Nathan, O., Pentland, A.: Decentralizing privacy: using blockchain to protect
personal data. In: 2015 IEEE Security and Privacy Workshops, San Jose, CA, pp. 180–184
(2015). https://doi.org/10.1109/SPW.2015.27
2. Di Francesco Maesa, D., Mori, P., Ricci, L.: Blockchain based access control. Lecture Notes
in Computer Science, pp. 206–220 (2017). https://doi.org/10.1007/978-3-319-59665-5_15
3. Nakamoto, S.: Bitcoin: a peer-to-peer electronic cash system (2008)
4. Nguyen, Q.K.: Blockchain—a financial technology for future sustainable development. In:
Green Technology and Sustainable Development (GTSD), pp. 51–54 (2016)
5. Pappalardo, G., Di Matteo, T., Caldarelli, G., et al.: Blockchain inefficiency in the Bitcoin peers
network. EPJ Data Sci. 7, 30 (2018). https://doi.org/10.1140/epjds/s13688-018-0159-3
6. Ajao, L.A., Agajo, J., Adedokun, E.A., Karngong, L.: Crypto Hash Algorithm-based blockchain
technology for managing decentralized ledger database in oil and gas industry. J. 2(3), 300–325
(2019). https://doi.org/10.3390/j2030021
7. Martínez, V.G., Hernández-Álvarez, L., Hernandez Encinas, L.: Analysis of the cryptographic
tools for Blockchain and Bitcoin. Mathematics 8, 131 (2020). https://doi.org/10.3390/math80
10131
8. Wang, L., Shen, X., Li, J., Shao, J., Yang, Y.: Cryptographic primitives in blockchains. J. Netw.
Comput. Appl. (2018). https://doi.org/10.1016/j.jnca.2018.11.003
9. Ismail, L., Materwala, H.: A review of blockchain architecture and consensus protocols: use
cases, challenges, and solutions. Symmetry 11, 1198 (2019). https://doi.org/10.3390/sym111
01198
10. Lee, D., Park, N.: Blockchain based privacy preserving multimedia intelligent video surveil-
lance using secure Merkle tree. Multimedia Tools Appl. (2020). https://doi.org/10.1007/s11
042-020-08776-y
11. Gervais, A., Karame, G.O., Wüst, K., Glykantzis, V., Ritzdorf, H., Capkun, S.: On the security
and performance of proof of work blockchains. In: Proceedings of the 2016 ACM SIGSAC
Conference on Computer and Communications Security—CCS’16 (2016). https://doi.org/10.
1145/2976749.2978341
12. King, S., Nadal, S.: Ppcoin: peer-to-peer crypto-currency with proof-of-stake, self-published
paper, pp. 1–6 (2012)
13. Kaufman, L.M.: Data security in the world of cloud computing. IEEE Secur. Priv. 7(4), 61–64
(2009)
14. Zhang, Y., He, D., Choo, K.R.: BaDS: blockchain-based architecture for data sharing with ABS
and CP-ABE in IoT. Wirel. Commun. Mob. Comput. 2018, 9 (2018)
15. WhatisHIPAA, https://www.dhcs.ca.gov/formsandpubs/laws/hipaa/Pages/1.00WhatisHIPAA.
aspx, 2018.
16. Dwyer III, S.J., Weaver, A.C., Knight Hughes, K.: Health insurance portability and account-
ability act. Security Issues in the Digital Medical Enterprise, Society for Computer Applications
in Radiology, 2nd edn (2004)
74 B. Ravinder Reddy and T. Adilakshmi

17. Baird, A., North, F., Raghu, T.S.: Personal Health Records (PHR) and the future of the
physician-patient relationship. In Proceedings of the 2011 iConference, New York, NY, USA,
pp. 281–288 (2011)
18. Wangthammang, M., Vasupongayya, S.: Distributed storage design for encrypted personal
health record data. In: Proceedings of the 8th International Conference on Knowledge and
Smart Technology, KST 2016, Thailand, pp. 184–189 (2016)
Dynamic Node Identification
Management in Hadoop Cluster Using
DNA

J. Balaraju and P. V. R. D. Prasada Rao

Abstract The distributed system is gambling a vital position in storing and

processing big data and information era is speedily increasing from various resources
every minute. Hadoop has a scalable, and efficient disbursed machine supporting
commodity hardware with the aid of combining specific networks within the topo-
graphical locality. Node support inside the Hadoop cluster is unexpectedly growing in
one-of-a-kind versions which can be dealing with issues to control clusters. Hadoop
does no longer offers node management, including and deletion Node futures. Node
identification in a cluster absolutely depends on DHCP servers that handles IP
addresses; hostname is based totally at the physical address (MAC) address of every
node. There is a scope for the hacker to theft the information using IP or hostname and
developing a disturbance in a dispensed gadget by using adding a malicious node by
assigning replica IP or hostname. This paper proposing novel node management for
the disbursed machine the usage of DNA hiding and producing a completely unique
key by way of combing a completely unique Physical address (MAC) of node and
hostname. This mechanism is supplying higher node control for the Hadoop cluster
offering adding and deletion node mechanism through the use of constrained compu-
tations and providing better node security from hackers. the main objective of this
paper is to design a set of rules to put in force node touchy records hiding using DNA
sequences and supplying safety to the node and its data from hackers.

1 Introduction

Distributed Computing (DC) [1] furnishes a sensible area work with productive
execution of a reply on a number PCs related with a system. For conveyed Computing
DC, large undertakings are partitioned into littler troubles which would then be
capable to be carried out on more than a few PCs concurrently free of one another.
In the route of current years, the intermixing of software program engineering and

J. Balaraju (B) · P. V. R. D. Prasada Rao

Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation,
Vaddeswaram, Andhra Pradesh, India

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 75
S. C. Satapathy et al. (eds.), Smart Computing Techniques and Applications,
Smart Innovation, Systems and Technologies 224,
https://doi.org/10.1007/978-981-16-1502-3_9
76 J. Balaraju and P. V. R. D. Prasada Rao

the multifaceted nature of science has lead to the affluent subject of bioinformatics.
Bioinformatics [2] is one of the greater up-to-date territories and has made us totally
conscious of a definitely special universe of science. Deoxyribonucleic Acid (DNA)
[3] association investigation can be a protracted cycle going from a few hours to
several days. This paper manufactures a dispersed framework that offers the reply
for some bioinformatics-associated applications. The widely widespread goal of
this paper is to bring together a Distributed Bioinformatics Computing System for
hereditary succession examination of DNA. To process, we put away a big no. of
DNA succession using different lengths of DNA preparations have been utilized
for the sequential and nonconsecutive instance search to suppose about the frame-
work’s response time received using single and several PCs. Furthermore, quite a
number lengths of DNA preparations had been likewise utilized for the instance ID
to assume about its response time watched utilizing a solitary PC and several PCs.
The specific goal of the proposed disseminated calculation for investigation of DNA
successions are (1) Build up a potential disseminated DNA association examination
calculations for diagram coordinating of DNA Gene grouping and sub-successions
recognizable proof. (2) Execute them on an about coupled appropriated system, for
example, widely wide-spread local and broad territory prepare utilizing fashionable
programming language. The principal focus of this paper is to propose a calculation
to actualize information covering up in DNA groupings to expand the multifaceted
nature and making disarray to the programmers.

2 Hadoop Cluster

Hadoop [4] is an Apache open supply machine written in java that lets in dispersed
making ready of massive datasets throughout bunches of PCs utilizing easy program-
ming models. The Hadoop shape utility works in a scenario that offers dispersed
potential and calculation throughout businesses of PCs. Hadoop is meant to scale
up from single employee to a top-notch many machines, every supplying nearby
calculation and capacity. In small Hadoop Cluster (HC) [5] have a solitary ace Node
Server and a range of patron hubs. In a larger HC, HDFS hubs are overseen via a dedi-
cated NameNode employee to have the record framework record, and an optionally
available NameNode that can produce depictions of the namenode’s reminiscence
structures, as a consequence forestalling record framework debasement and loss
of data The dimension of HC’s are shortly extended from 2005, the inventors is
constrained to simply 20 to forty nodes in a groups. At that factor, they understood
two issues, they are no longer accomplish its manageable till it ran dependably on
the better groups. In the second stage, Yahoo efficaciously tried Hadoop on a one
thousand hub bunch and start using it later yippee and Apache Software Foundation
correctly tried a 4000 hub crew with Hadoop. HC bunch improved 4000–10,000+ in
quite a number of deliveries.
Dynamic Node Identification Management in Hadoop Cluster … 77

3 DHCP

DHCP [6] is a conference that offers snappy, programmed, and focal administration
for the dispersion of IP addresses [7] internal a system. DHCP is moreover used
to organize the subnet cover, default entryway, and DNS employee statistics on the
device. Media Access Control (MAC) Address is a novel identifier of the Network
Interface Controller (NIC). A gadget hub can have distinctive NIC but every with
novel MAC. A device chairman saves a scope of IP addresses for DHCP, and each
DHCP patron on the LAN is organized to demand an IP tackle from the DHCP
employee at some stage in gadget introduction. The solicitation and-award measure
makes use of a lease thinking with a controllable timeframe, allowing the DHCP
employee to get better and later on reallocate IP tends to that are now not recharged.
The DHCP employee continually appoints an IP tackle to a bringing up purchaser
from the vary characterized via the executive. The DHCP employee offers a personal
IP address subordinate upon each’s consumer identity established on predefined
planning via the overseer.

4 DNA

Deoxyribonucleic Acid (DNA) cryptography [8], the utilization of the characteristics

of DNA affords new probabilities and heading to statistics stowing away. This work
will use the herbal traits of DNA successions. The units of DNA capacity, for example,
imperative blending, and DNA document provide any other layer of safety to the
proposed technique. So as to tightly close refined statistics thru unstable structures
like the Internet, utilizing special kinds of facts insurance plan is fundamental. One
of the famous strategies to invulnerable data thru the Internet is data protecting up.
In view of the increasing quantity of Internet clients, the use of data stowing away or
Stenographic strategies [9] is unavoidable. Before utilizing natural houses of DNA
arrangements, generally implanting a thriller message into the host pics used to be
the traditional technique of facts protecting up. The most enormous ones had been
the attention of the mutilations of the photograph when the host image was modified
to positive degrees. The key bit of their work is, the usage of natural attributes of
DNA arrangements.

5 Literature Survey

Alabady et al. [10] is carried out a Network Security Model for Cooperative
Network delivered a machine protection model. The creator has examined weak-
nesses, dangers, assaults, association shortcomings, and protection methods with
device assurance.
78 J. Balaraju and P. V. R. D. Prasada Rao

Balaraju et al. [11] have examined large information advances and their advances
for improved massive information. Information safety is a chief difficulty in the
administration part, science, exploration, and commercial enterprise ventures. They
likewise examined statistics stockpiling, handling, and safety territory and find out the
challenges via making use of normal protection units for Hadoop. They encouraged
a solitary DNA-based totally impenetrable hub for verification and metadata of the
board for Hadoop which is the high-quality reply for strengthening statistics and
dispose of NNSE blocks for protection metadata in the Namenode in Hadoop.
Bin Ali et al. [12] developed a Secure Campus Network have configured. The
developed hierarchical structure of the campus community thinking about one-of-a-
kind sorts of protection problems that make sure the exceptional of service.
Pandya et al. [13] are developed a Network Structure and mentioned 5 primary
community topologies like Bus, ring, Star, Tree, and Mesh.
Balaraju et al. [14] are constructed up an Algorithm Built-in Authentication
Based on Access (BABA) as a safety incidence coordinated as Hadoop hub for
making positive about statistics in HDFS and straight away metadata safety for
evading customers statistics in Hadoop. The instrument contributes a made certain
about Hadoop Cluster barring making use of different safety sport plans which like-
wise lessens operational cost, calculations, expands data security, and giving steady
safety solutions for Hadoop Cluster. The enhancement of this work is to reduce the
computational weights of the proposed calculation.
Kennedy et al. [15] are carried out a Structured Network for a Small system.
Creators have reenacted prepare configuration making use of the Cisco Packet Tracer
programming and Wire Shark conference analyzer.
Balaraju et al. [16] had accomplished a solely new protection component, a
Secure Authentication Interface (SAI) layer over the Hadoop Cluster. As a Single
Security convention, this interface offers consumer verification, metadata security,
and get admission to control. Contrasted the modern instruments, SAI can provide
safety a much less computational weight. Creators focused on protection challenges
and tended to for making certain about Big Data in a Hadoop Cluster thru a solitary,
restrictive protection gadget known as Secure Authentication Interface. SAI made a
confided in situation inner HC via confirming the two customers and their cycles.
All the above papers that are surveyed have proposed a number parts of allotted
gadget structure, geographies, and execution but they have now not examined troubles
seemed in possible usage. Numerous Authors are concentrating a made certain about
the appropriation framework in Hadoop and finally which is treasured to data in the
Hadoop crew have now not examined DHCP and MAC officers in element.

6 Problem Statement

In any distribution system including Hadoop distribution structures linked a central-

ized community switch. each node in the network has a unique IP cope with this
is dynamically assigned through the DHCP server by accumulating the bodily cope
Dynamic Node Identification Management in Hadoop Cluster … 79

with of every related node and assigning static IP each node is hard for administrator
in larger networks. IP cope with and hostname can seem any consumer working
in a dispensed device and they’ll access any node with suitable permissions. The
problem with acting IP and hostname, the hacker may additionally vicinity repro-
duction hostname and replica IP deal with for disturbing or malfunctioning network
by using placing a malicious node within the community. On the grounds that there
may be scope dropping essential facts from a dispensed system by using IP cope
with and it could be a safety hazard.

7 Related Work

The present invention relates to distributed computing systems and is more particu-
larly directed to architecture and implementation of a scalable distributed computing
environment which facilitates communication between independently operating
nodes on a single network.
The primary objective of the research work is to create a layer on top of the
distributed system especially for the Hadoop distributed system, so every node
appears in the layer. For setting the environment we configured a centralized DHCP
server and 250 nodes in the network with a different configuration. Multinode HC
[17] configured, the master node configured within the DHCP server as Namenode,

Fig. 1 Existing Hadoop cluster by configuring DHCP server

80 J. Balaraju and P. V. R. D. Prasada Rao

and the remaining nodes are data nodes. Figure 1 shows default distribution environ-
ment. The developed security layer is also configured within the DHCP Server for
collecting the hostname and IP addresses which are stored itself.
The DNA algorithm within the proposed layer is converting the IP address, the
hostname in different level and producing a unique key which appears for the user
including the physical address (MAC) of NIC. The key generation is generating
by using a highly secured DNA hiding [10] methodology for creating confusion
hackers to access any node from the network and finally it becomes a highly secured
distributed system.
Key generating Process:
Start:
1. Gathering Host, MAC, IP.
2. Merging Host, MAC, IP as UNIQ_Key.
3. Translating UNIQ_Key into BINARY Form.
4. Translating Binary form to DNA.
5. Assign a digit to DNA into a Number // a= 0, c= 1, g= 2, t= 3.
6. Translating Number to Hexadecimal.
7. Making UNIQ_Key from Hexadecimal.
Stop.
Table 1 is displaying special key technology processes in extraordinary ranges for
growing confusion to the hackers. The hostname of the node is eight characters, it
dynamically assigning from generated special key and altering every 7 days, it can
be up to date robotically central table. Table 2 is displaying the hostname and fame
of node information. So, the hackers get confusions to get admission to a unique
node the usage of hostname from the dispensed system. The essential gain of the
proposed technique is no longer having everlasting hostname to get entry to node
and it shared data.
The secured layer additionally includes nodes facts with the aid of preserving
central desk alongside hostname. Internally each and every node linked with different
nodes with the aid of performing hostname managed via the protection layer. All
these hostnames in the community are maintained by means of a invulnerable layer
which include the storage status, processing configuration by way of periodically
updating in the central table. Figure 2 is showings the impenetrable layer within
DHCP, Namenode server alongside with special hostnames for every node. The
impervious layer is updating the central desk when a new node is introduced or
putting off of a node from the allotted system.

8 Result Analysis

Overall performance analysis is analyzed with small disbursed system with proposed
comfortable layer all the prevailing security techniques aren’t concentrated to cover
Table 1 Procedure for hiding system information using DNA
Steps Description MAC_Address IP Address Hostname
01 Node sensitive data 10:78:D2:55:95:A8 10.50.4.8 HD_DN01
02 Merged data 10:78:D2:55:95:A8-10.50.4.8- HD_DN01
03 Binary form 00,110,001 00,110,000 00,111,010 00,110,
111 00,111,000 00,111,010 01,000,100 00,110,
010 00,111,010 00,110,101 00,110,101 00,111,010 00,
111,001 00,110,101 00,111,010 01,000,001 00,
111,000 00,101,101 00,110,001 00,110,000 00,101,
110 00,110,101 00,110,000 00,101,110 00,110,100 00,
101,110 00,111,000 00,101,101 00,100,000 01,001,000 01,
000,100 01,011,111 01,000,100 01,001,110 00,110,000 00,110,001
04 DNA form ATACATAAATGCATGTATGAATGGCACAATAGATGGATCCATCCATGGATGCATCCAT
GGCAACATGAAGTCATACATAAAGAGATCCATAAAGAGATCAAGAGATGAAGACAGAA
CAGACACACCTTCACACATGATAAATAC
05 A = 0, C = 1, G = 2, T = 3 0301 0300 0321 0323 0320 0322 1010 0302 0322 0311 0311 0322 0321 0311 0322
Dynamic Node Identification Management in Hadoop Cluster …

1001 0320 0231 0301 0300 0202 0311 0300 0202 0310 0202 0320 0201
0200 1020 1010 1133 1010 1030 0300 0301
06 Decimal to Hexa 5CC3148053C4906F901A54B4DE8225FD8071B87BA02E51C8E7B2C2BDE75CC314
7A2CCCD09B85CEA1AB7E9E03DF1AEC6D5947237BCFA358FB2DED
05 Unique Key 5CC3148053C4906F901A54B4DE8225FD8071B87BA02E51C8E7B2C2BDE75CC3147A2CCC
D09B85CEA1AB7E9E03DF1AEC6D5947237BCFA358FB2DED
81
82 J. Balaraju and P. V. R. D. Prasada Rao

Table 2 Nodes status information in central server

MAC Node_Hostname Status Joined/removed
A4-1F-72-58-BB-01 A1AB7E9E0 Active 14-MAR-2020
F5-10-72-58-BB-01 3DF1AEC6D Active 14-MAR-2020
C4-1F-B2-58-BB-01 5947237BC Active 14-MAR-2020
A4-1F-72-58-BB-01 FA358FB2D Removed 23-JUN-2020
A4-1F-72-58-BB-01 47237BCFA Active 14-MAR-2020
A4-1F-72-58-BB-01 CC3147A2C Active 14-MAR-2020
A4-1F-72-58-BB-01 BDE75CC31 Active 14-MAR-2020
A4-1F-72-58-BB-01 2E51C8E7B Removed 05-JUL-2020

Fig. 2 Proposed Hadoop cluster by dynamic hostnames

nodes data. Table 3 is showing the nodes user get right of entry to information in
configured environment. Parent three is about assessment of nodes and person facts
in present and evolved gadget (Figs. 3 and 4).

Table 3 Nodes and users

Existing Proposed (Security layer)
information
Nodes 250 250
Users 207 252
Nodes accessed 225 9
Results (data access 180 0
nodes)
Dynamic Node Identification Management in Hadoop Cluster … 83

Nodes and Users InformaƟon.

300
250
200
150
100
50
0
Nodes Users Nodes Accessed by Results (data Access
users Nodes)

ExisƟng Proposed

Fig. 3 Nodes and user data existing and proposed

Security Perfomance Existing and Proposed.

300

250

200

150

100

0
Existing Proposed

Nodes Users Nodes Accessed by users Results (data Access Nodes)

Fig. 4 Performance evolution of existing and proposed

This safety layer possibly offers 24 × 7 securities for HC which may be very
beneficial for small allotted system to keep their statistics securely. Determine 4 is
indicates the overall performance extended safety for above-configured surroundings.
This will increase the facts protection, unique operational, and decrease preservation
hassle.
84 J. Balaraju and P. V. R. D. Prasada Rao

9 Conclusion

The research studies of this work, we’ve got carried out a secured distributing system
with the help of DHCP server, IP, Host, and MAC aggregate. The unique allotted
machine network allotted a dedicated unique ID to each node for securing commu-
nity at the side of information. the administrator best has entire privileges to get
entry to all of the nodes such as the server and the others cannot get right of entry
to any node in the network without knowing IP or Host. The performance of the
community management and distributed System is improved in the long run data
security is superior. The complete community is deliberate to be automatic which
entails minimal person intervention. The destiny scope of this work is to beautify the
same security layer to use to the wan community placed in special places.

References

1. Yang, H., Lee, J.: Secure distributed computing with straggling servers using polynomial codes.
IEEE Trans. Inf. Forensics Secur. 14(1), 141–150 (2019). https://doi.org/10.1109/TIFS.2018.
2846601
2. Khanan, A., Abdullah, S., Mohamed, A.H.H.M., Mehmood, A., Ariffin, K.A.Z.: Big data
security and privacy concerns: a review. In: Al-Masri, A., Curran, K. (eds.) Smart technologies
and innovation for a sustainable future. Advances in Science, Technology & Innovation (IEREK
Interdisciplinary Series for Sustainable Development). Springer, Cham (2019). https://doi.org/
10.1007/978-3-030-01659-3_8
3. Roy, M., et al.: Data security techniques based on DNA encryption. In: Chakraborty, M.,
Chakrabarti, S., Balas, V. (eds.) Proceedings of International Ethical Hacking Conference
2019. eHaCON 2019. Advances in Intelligent Systems and Computing, vol. 1065. Springer,
Singapore (2020). https://doi.org/10.1007/978-981-15-0361-0_19
4. Samet, R., Aydın, A., Toy, F.: Big data security problem based on Hadoop framework. In:
2019 4th International Conference on Computer Science and Engineering (UBMK), Samsun,
Turkey, pp. 1–6 (2019). https://doi.org/10.1109/UBMK.2019.8907074
5. Akhgarnush, E., Broeckers, L., Jakoby, T.: Hadoop: A standard framework for computer cluster.
In: Liermann, V., Stegmann, C. (eds.) The Impact of Digital Transformation and FinTech on
the Finance Professional. Palgrave Macmillan, Cham (2019). https://doi.org/10.1007/978-3-
030-23719-6_18
6. Rajput, A.K., Tewani, R., Dubey, A.: The helping protocol “DHCP”. In: 2016 3rd International
Conference on Computing for Sustainable Global Development (INDIACom), New Delhi,
pp. 634–637 (2016)
7. Mohsin, M., Prakash, R.: IP address assignment in a mobile ad hoc network. In: MILCOM
2002. Proceedings, Anaheim, CA, USA, vol. 2, pp. 856–861 (2002). https://doi.org/10.1109/
MILCOM.2002.1179586
8. Sajisha, K.S., Mathew, S.: An encryption based on DNA cryptography and steganography.
In: 2017 International Conference of Electronics, Communication and Aerospace Tech-
nology (ICECA), Coimbatore, pp. 162–167 (2017). https://doi.org/10.1109/ICECA.2017.821
2786https://doi.org/10.1109/ICECA.2017.8212786
9. Kar, N., Mandal, K., Bhattacharya, B.: Improved Chaos-based video steganography using
DNA alphabets. ICT Express 4(1), 6–13 (2018). ISSN 2405-9595. https://doi.org/10.1016/j.
icte.2018.01.003
10. Alabady, S.: Design and implementation of a network security model for cooperative network.
Int. Arab J. Technol. 1(2) (2009)
Dynamic Node Identification Management in Hadoop Cluster … 85

11. Balaraju, J., Rao, P.V.V.P.: Recent advances in big data storage and security schemas of HDFS:
a survey (2018)
12. Bin Ali, M.N., Hossain, M.E, Masud Parvez, Md.: Design and implementation of a secure
campus network. Int. J. Emerg. Technol. Adv. Eng. Website 5(7) (2015). www.ijetae.com
(ISSN 2250-2459. ISO 9001:2008 Certified Journal
13. Pandya, K.: Network structure or topology. Int. J. Adv. Res. Comput. Sci. Manage. Stud. 1(2),
6 (2013)
14. Balaraju, J., Prasada Rao, P.V.R.D.: Designing authentication for Hadoop cluster using DNA
algorithm. Int. J. Recent. Technol. Eng. (IJRTE) 8(3) (2019). ISSN: 2277-3878. https://doi.org/
10.35940/ijrte.C5895.0983
15. Offor, Kennedy, J., Obi, Patrick, I., Nwadike Kenny, T., Okonkwo II: Int. J. Eng. Res. Technol.
(IJERT) 2(8) (2013)
16. Balaraju, J., Rao, P.: Innovative secure authentication interface for Hadoop cluster using DNA
cryptography: a practical study (2020).https://doi.org/10.1007/978-981-15-2475-2_3
17. Gugnani, S., Khanolkar, D., Bihany, T., Khadilkar, N.: Rule based classification on a multi node
scalable Hadoop cluster. In: Fortino, G., Di Fatta, G., Li, W., Ochoa, S., Cuzzocrea, A., Pathan,
M. (eds.) Internet and distributed computing systems. IDCS 2014. Lecture Notes in Computer
Science, vol. 8729. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11692-1_15
18. Demidov, V.V.: Hiding and storing messages and data in DNA. In: DNA beyond Genes.
Springer, Cham (2020). https://doi.org/10.1007/978-3-030-36434-2_2
A Scientometric Inspection of Research
Based on WordNet Lexical During
1995–2019

Minni Jain, Gaurav Sharma, and Amita Jain

Abstract The purpose of this study is to conduct a scientometric inspection for the
research based on WordNet lexicon. WordNet is a lexical database for the English
Language developed by George A. Miller and his team at Princeton University in
1985. This study reviews WordNet-based research mainly published in the Web of
Science (WoS) database from 1995 to 2019. The publication data has been analyzed
computationally to present year-wise publications, publications growth rate, country-
wise research publications, authors who published a large number of papers in this
field. The results have been shown in the form of figures. The present study will be
useful for a better understanding of the trajectory of the research work done in the
field of WordNet during the last 25 years and shall highlight the prominent research
areas and categories in the field of WordNet.

1 Introduction

WordNet is a lexical resource for the English language. It groups English phrases
into units of synonyms known as synsets. WordNet has been widely used in the field
of Natural Language Processing (NLP), Artificial Intelligence (AI), Text Analysis,
Machine Translations, Information Retrieval, etc. in the recent past [1]. A common
use of WordNet is to determine the similarities between words. It was developed in
the Princeton University under the direction of prof. George A. Miller in 1985. Since
then, a lot of research has been done to strengthen the WordNet using semantic rela-
tions between different components. It is a combination of dictionary and thesaurus

M. Jain · G. Sharma (B)

Department of Computer Science and Engineering, Delhi Technological University, Delhi
110042, India
M. Jain
e-mail: [email protected]
A. Jain
Department of Computer Science and Engineering, Ambedkar Institute of Advanced
Communication Technologies and Research, Delhi 110031, India

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 87
S. C. Satapathy et al. (eds.), Smart Computing Techniques and Applications,
Smart Innovation, Systems and Technologies 224,
https://doi.org/10.1007/978-981-16-1502-3_10
88 M. Jain et al.

because unlike dictionary it also found the meaningful relations between words.
WordNet allows the different lexical categories such as noun, verbs, adverbs, and
adjectives however it ignores the propositions and determiners [2]. It will generate
the valid meaningful sentence for the use of a set of synsets to link with the semantic
relations. WordNet includes mainly two terms word forms and sense pairs. Word
forms is a string over the finite alphabets and sense pair is the set of meanings. This
word forms with a particular sense is known as “word” in the English language [1].
So more than 166,000 word forms and sense pairs are used in the WordNet [3].
Semantic relations are the relations between meanings or relations between two
sentences or relations between two words [4]. Some semantics relations are given
below:
• Synonymy is the set of synonyms (synsets) to represent word senses.
• Antonymy is the set of opposite names of word forms.
• Hyponymy is the inverse of hypernymy (super-name), so the hyponymy is (sub-
name). It is a transitive relation.
• Meronymy is the inverse of holonymy (whole-name), so the meronymy is the
(part-name). It shows the member parts.
• Troponymy (manner-name) is used for verbs. Entailment is showing the relations
between verbs.
These semantics relations are represented by pointers in between word forms
or synsets. More than 116,000 pointers are used in WordNet. Semantic relations in
WordNet is an open-source and freely available for everyone on the Internet. WordNet
3.11 is the latest version. The increasing growth of research work on WordNet since
the period of inception inspires us to analyze the progress of research work on
WordNet. In this paper, a scientometric analysis has been made computationally to
highlight the role of WordNet in the development of language. The present study shall
be helpful to understand the trajectory of the research work done on WordNet during
a large span that is from 1995 to 2019. The resource data on the WordNet research
publications has been taken from the Web of Science (WoS). We have analyzed the
data so, obtained from the Web of Science (WoS) computationally to identify the
year-wise research publications on WordNet and represented graphically for better
understanding. The WordNet research publications data have also been analyzed to
identify author-wise publications.
This information we are representing in the form of graphs for better under-
standing. Thus, the aim of the present study is to provide the analytic account of
the progress of the research work for example the major areas of research work,
major concepts, and major approaches to research on WordNet during the period
of time 1995–2019 [5–19]. This study shall be useful to understand how and why
the research work on WordNet has grown over time and which countries, organi-
zations/Institutions, authors have contributed more substantially. It will also help to
understand the major potential research areas and applications of WordNet. Section 2

1 https://wordnetweb.princeton.edu/perl/webwn.
A Scientometric Inspection of Research Based on WordNet … 89

describes the data and methodology; Sect. 3 highlights the results of the study with
corresponding visualizations, and conclusions are given in Sect. 4.

2 Data and Methodology

For the scientometric analysis of WordNet the data is retrieved from Web of Science
(WoS). The total number of papers retrieved is 947 but out of these 104 papers
are filtered. The collected data was based on the time span 1995–2019 [5–19].
Web of Science (WoS) is the large database for the different documents, articles,
reviews, proceeding papers, editorial materials, etc. [5]. So, the study in this paper
is about 25 years of work on WordNet to be analyzed using tables and graphical
representations.

3 Analytical Study

This section describes the details of various important indicators computed through
the analysis of data.

3.1 Year-Wise Publications

First of all, we have measured the number of published papers on WordNet for each
of the years 1995–2019. Figure 1 shows the number of published papers in WordNet

Fig. 1 Year-wise publications graph

90 M. Jain et al.

on a year-wise plot. We have a column about NOP (No. of publications) and total
citations and average citations per year. We can observe that this graph has been
more or less flat till 2002, after which there is a steep rise. From 2006 to 2007, there
is a drastic fall in the number of papers in this time period. After that, the graph is
consistently increasing and slightly up and down was happened in between year by
year.

3.2 Country-Wise Publications

We have analyzed the country-wise contributions of WordNet research during

25 years of the period from 1995 to 2019. The research paper publications can
be seen in the terms of record count in Web of Science (WoS). The topmost countries
according to the research paper record count, as listed above are illustrated in the
graph in Fig. 2. The graph is explained by different colors, these colors are indi-
cating the “Countries”, and the values are “Record Count”. The highest number of
published papers or record count about WordNet is on the account of the USA.
After that many countries come, China is the second topmost country who has
a high number of published papers on WordNet. Spain is just comparatively low
than China. After that Spain, there is a huge difference between the other countries.
The other major countries from where WordNet work is reported to include Italy,
South Korea, England, Canada, Germany, France, India, etc. These are the only top
10 countries which we included in this paper.

Fig. 2 Country-wise publication graph

A Scientometric Inspection of Research Based on WordNet … 91

3.3 Top Organization Publications

After analyzing WordNet research publications to extract the county-level results, we

tried to understand the organization-level research output. First, we have to identify
the top organization’s contributions significantly to WordNet work during the 1995–
2019 period. We list the top 10 organizations in descending order of the number
of publications originating from them. The table indicates four different indicators,
namely NOP (Number of Papers), TC (Total Citations), ACPP (Average Citations
Per Paper) and h-index for the WordNet research output originating from various
organizations. The values of NOP, TC, ACPP, and h-index are computed from the
Web of Science (WoS). ACPP values is calculated as:

TC
ACPP = (1)
NOP
The h-index metric is the author-level metric which is measured both the produc-
tivity and the impact of the citations published work of a scientist or scholar. The
h-index is calculated for individuals, institutions, journals, etc. We can observe that
the Universitat D Alacant has the highest number of research papers on the topic
of WordNet. This is following by the Chinese Academy of Science which is in the
second position by few numbers of papers (Fig. 3).
We have selected only the top 10 organizations, in which Universitat D Alacant
having the largest number of papers (18 papers) after that Chinese Academy of
Science has the second-highest papers (13 papers) there have 4 organizations [Chosun
university, Consiglio Nazionale Delle Ricerche Cnr, Universitat Politecnica De
Valencia, Universitat Rovira I Virgili] who have the same number of papers (12
papers) Now the last 4 organizations [Centre National De La Recherche Scientifique,
Instituito Politechnico National Mexico, Princeton University, University of Basque
Country] also have the same number of papers (11 papers).

Fig. 3 Area graph for organization-wise publications

92 M. Jain et al.

Fig. 4 Treemap of author-wise publications

3.4 Author-Wise Publications

We have also analyzed the WordNet research publication data to identify the most
productive and most cited authors. We are defining here highly productive authors
who produce a high amount of research papers published during the period of 1995–
2019. We present the list of top 15 most productive authors, in this table we indicate
the four indicators again are NOP (Number of papers), TC (Total Citations), ACPP
(Average Citations Per Paper), and h-index. We can observe that Kim P is the most
productive author on WordNet research during 1995–2019. He published 12 papers
on WordNet, his total citations are 88, his ACPP is 7.33 and h-index is 6. This is
followed by the Rosso P he published 11 papers, his total citations are 77, ACPP is
7, and h-index is 4. But if the most productive author is also the most cited author is
not compulsory. Sanchez D. published 10 papers, his total citations are 488, ACPP is
48.8, and h-index is 10. According to the Web of Science (WoS) using this table the
most productive author is Kim P. with 12 papers published on WordNet, highest total
citations having author is Weikum G. with 521 citations, the most average citations
per paper author is Weikum G. with 74.43 and the highest h-index author is Sanchez
D. with h-index 10. Here for the better understanding, we make the treemap chart in
Fig. 4, which shows the top 15 most productive authors with their NOP.

4 Conclusions

The paper presents an analytical study in the field of WordNet research. We analyzed
the research publications on WordNet research and extract the data from Web of
Science (WoS) in the time span 1995–2019. The analytical study helped to identify the
year-wise publications, country-wise contributions, top organization publications,
A Scientometric Inspection of Research Based on WordNet … 93

and top author-wise publications. In this paper, we used the figures which result for
better understanding of the research work done in the field of WordNet. From the
figures, we conclude that the USA, China, and Spain are the most productive countries
with the record count of more than 100 in the field of WordNet research while India
has been placed on number 10 with 32 record count. It representing the year-wise
publications that the highest number of research papers have been published in the
year 2016. But the remarkable growth in the research publications has been found
from 2003. It appears that extensive use of WordNet as a tool has attracted computer
science researchers and software developers. Author-wise treemap shows that the
authors Kim P., Rosso P., and Sachez D. are the only authors who published the
research papers in double digits. Universitat D Alacant is the topmost organization
who published the highest number of research papers in the field of WordNet while
Chinese Academy of Science and Chosun University are the topmost second and
third organizations in the production of research papers on WordNet. organization
of the WordNet is at the 9th place in the list of top 10 organizations of the research
on WordNet. The analysis has predicted to provide an analytical study to researchers
working in this domain in exploring the discipline. This paper will also be helpful
for the beginners who are new in this field and wants to do research in the WordNet.
In the upcoming time, the country-wise contributions, year-wise publications, and
author-wise publication will change when the research will continue in this field with
time because this paper analyzed only for 25 years (1995–2019) of time span.

References

1. Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38, 39–41 (1995).
https://doi.org/10.1145/219717.219748
2. Miller, G.A., Hristea, F.: WordNet nouns: classes and instances. J. Comput. Linguist. 32, 1–3
(2006). https://doi.org/10.1162/coli.2006.32.1.1
3. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: WordNet: an on-line lexical
database. Int. J. Lexicogr. 3, 235–312 (1993)
4. Wei, T., Lu, Y., Chang, H., Zhou, Q., Bao, X.: A Semantic approach for text clustering using
WordNet and lexical chains. Expert Syst. Appl. Int. J. 42, 2264–2275 (2015). https://doi.org/
10.1016/j.eswa.2014.10.023
5. Mah, T., Ben Aouicha, M., Ben Hamadou, A.: A new semantics relatedness measurement
using WordNet features. J. Knowl. Inf. Syst. 41, 467–497 (2014). https://doi.org/10.1007/s10
115-013-0672-4
6. Montoyo, A., Palomar, M., Rigau, G.: Interface for WordNet enrichment with classification
systems. Database Expert Syst. Appl. 2113, 122–130 (2001)
7. Gomes, P., Pereira, F.C., Paiva, P., Seco, N., Carreiro, P., Ferreira, J.L., Bento, C.: Advances in
Artificial Intelligence, vol. 2671, pp. 537–543 (2003)
8. Miller, G.A., Fellbaum, C.: WordNet then and now. Lang. Resour. Eval. 41, 209–2214 (2007).
https://doi.org/10.1007/s10579-007-9044-6
9. Christiane Fellbaum Vossen, P.: Challenges for multilingual WordNet. Lang. Resour. Eval. 46,
313–326 (2012). https://doi.org/10.1007/s10579-012-9186-z
10. Otegi, A., Arregi, X., Ansa, O., Agirre, E.: Using knowledge based relatedness for information
retrieval. Knowl. Inf. Syst. 44, 689–718 (2015). https://doi.org/10.1007/s10115-014-0785-4
94 M. Jain et al.

11. Gomes, P., Pereira, F.C., Paiva, P., Seco, N., Carriero, P., Ferreira, J.L., Bento, C.: Noun sense
disambiguation with WordNet for software design retrieval. Advances in Artificial Intelligence,
vol. 2671, pp. 537–543 (2003)
12. Sonakshi, V.I.J., Tayal, D., Jain, A.: A machine learning approach for automated evalua-
tion of short answers using text similarity based on WordNet graphs. Wirel. Pers. Commun.
(2019).https://doi.org/10.1007/s11277-019-06913-x
13. Rudnicka, E., Piasecki, M., Bond, F., Grabowski, L., Piotrowski, T.: Sense equivalence in
PLWordNet to Princeton WordNet mapping. Int. J. Lexicogr. 32, 296–325 (2019). https://doi.
org/10.1093/ijl/ecz004
14. Jiang, Y.C., Yang, M.X., Qu, R.: Semantic similarity measures for formal concept analysis
using linked data and WordNet. Multimedia Tools Appl. 78, 19807–19837 (2019). https://doi.
org/10.1007/s11042-019-7150-2
15. Zhu, N.F., Wang, S.Y., He, J.S., Tang, D., He, P., Zhang, Y.Q.: On the suitability of WordNet
to privacy management. Wirel. Pers. Commun. 103, 359–378 (2018). https://doi.org/10.1007/
s11277-018-5447-5
16. Cai, Y.Y., Zhang, Q.C., Lu, W., Che, X.P.: A hybrid approach for measuring semantic similarity
based on IC-weighted path distance in WordNet. J. Intel. Inf. Syst. 51, 23–47 (2018). https://
doi.org/10.1007/s10844-017-0479-y
17. Ehsani, R., Solak, E., Yildiz, O.T.: Constructing a WordNet for Turkish using manual and
automatic annotation. ACM Translations on Asian and Low-Resource Language Information
Processing, vol. 17 (2018). https://doi.org/10.1145/3185664
18. Guinovert, X.G., Portela, M.A.S.: Building the Galician WordNet: methods and applications.
Lang. Resour. Eval. 52, 317–339 (2018). https://doi.org/10.1007/s10579-017-9408-5
19. Vij, S., Jain, A., Tayal, D., Castillo, O.: Fuzzy logic for inculcating significance of semantic
relations in word sense disambiguation using a WordNet graph. Int. J. Fuzzy Syst. 20, 444–459
(2018). https://doi.org/10.1007/s40815-017-0433-8
Sentiment Analysis of an Online
Sentiment with Text and Slang Using
Lexicon Approach

Shelley Gupta, Shubhangi Bisht, and Shirin Gupta

Abstract Sentiment analysis is a technique that helps data analysts to analyze the
various online users’ opinions about a particular product or service. There are various
approaches for performing sentiment analysis. Lexicon-based approach uses senti-
ment lexicons to calculate the polarity of the sentences. It is observed that nowadays
online lexicons consist of text and online slang as well. The proposed approach
calculates the polarity of the sentence by evaluating a polarity score of sentiments
with text and slang. The sentiment polarity of tweets is also evaluated using the
machine learning classification techniques like SVM, Random Forest, and Linear
regression with accuracy, recall, precision, and F-score parameters. The accuracy
of the proposed approach has been evaluated to 96% for Twitter dataset containing
17,452 tweets and 97% for other social media sentiments.

1 Introduction

Sentiment analysis is an excellent way to get to know the reaction and feelings of
people (particularly consumers) about any product, topic or idea [1, 4] expressed on
review sites, e-commerce sites, online opinion sites, or social media like Facebook,
Twitter, etc. Sentiment analysis is employed using three broad techniques: Lexical
based [3, 9, 10, 17], machine learning based [9, 10, 17] and hybrid/combined [2].
Lexicon-based analysis is governed by matching the new tokens with pre-defined
dictionaries that have existing tagged lexicons as positive, negative, or neutral [11]
polarity and score. Machine Learning approach as stated in [2, 9, 17] is the most
popular approach for sentiment analysis due to its accuracy and ease of adaptability.
Generally labeled and sizable datasets are used which requires human annotators
that are expensive and time-consuming. Hybrid/combined [2] approach combines

S. Gupta (B) · S. Bisht · S. Gupta

ABES Engineering College, Ghaziabad, Uttar Pradesh, India
e-mail: [email protected]
S. Bisht
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 95
S. C. Satapathy et al. (eds.), Smart Computing Techniques and Applications,
Smart Innovation, Systems and Technologies 224,
https://doi.org/10.1007/978-981-16-1502-3_11
96 S. Gupta et al.

the accuracy of machine learning approach with the speed of lexicon-based approach
with the aim to make it more accurate.
The use of slangs for expressing one’s opinion saves time and space within the
word limit using slangs and abbreviations. Most of the existing sentiment anal-
ysis approaches remove the slangs during the pre-processing of the dataset. Thus,
removing the impact of online slangs in sentiment score evaluation. Although
our belief is inclusion of slang in determining the polarity and score of senti-
ment at sentence and document level, can enhance the sentiment score of different
approaches.
Our distinct contributions for the paper can be summarized as follows: (i) Our
approach is a rule-based approach that deals with slang by replacing their meaning
in an online sentiment and then calculating the sentiment score at the sentence level.
(ii) The approach determines the average score of the total number of positive and
negative sentences to evaluate the sentiment score and polarity of an online document.
(iii) Our approach outperforms the existing approaches as it calculates the sentiment
score using online slang dictionary at sentence and document level.
The paper is organized as follows: Sect. 2 describes the literature review to
Sentiment analysis and the background of this work. Section 3 demonstrates the
proposed approach. Section 4 demonstrates the score calculation and result eval-
uation. Sections 5 and 6 represent the conclusion and limitations with future
work.

2 Literature Review

A large proportion of the population uses the fast medium of English vocabulary of
slang and abbreviations while expressing their emotions on a social media platform.
Very few existing sentiment analysis techniques have considered commonly used
slangs while calculating the polarity of the sentences.
General Inquirer (GI) [12] is a text mining approach that uses oldest manually
constructed lexicons. This approach has classified a large number of words (11 k) into
a few categories (183 or more). GI has been widely used to automatically determine
the properties of text.
LIWC discovers the feature of speaking and writing language. It works by
revealing the patterns in the speech by calculating recurrence with which words
of different categories are used with respect to the total number of words in the
sentence [13].
SentiWordnet 3.0 [5] is an enriched version of SentiWordNet 1.0 [8] which
supports sentiment classification. The whole process of SentiWordNet 3.0 [5] uses
semi-supervised and random walk processes instead of manual glosses.
Vader [6] has successfully calculated the overall polarity of sentences by taking
into consideration the score of each word of the sentence and performing a qualitative
analysis to identify those properties of text which affect the intensity of the sentence.
Sentiment Analysis of an Online Sentiment with Text and Slang … 97

It tells about how positive or negative the sentiment is. However, the only shortcoming
of Vader [6] was the use of a very limited number of slangs.
Senti-N-Gram [11] extracts the n-gram scores from a customer-generated data
using a ratio-based approach which depends on the number of positive and negative
sentences in a given dataset. It worked on n-gram lexicons with successfully handling
the cases of intensifiers and negation.
However, the above approaches have not mentioned anything about slangs.
Thus, this paper aims at suggesting an approach that deals with sentiment anal-
ysis suggesting with the abilities of Vader [6] and slang at sentence and document
level as well.

3 Proposed Approach

The steps of proposed approach are elaborated below and shown in Fig. 1. Figure 3
elaborates the definitions of symbols used in algorithm Fig. 2.
1. Initialization: A slang dictionary is initialized considering near about 300 most
commonly used slangs. The keys correspond to the slang and the values corre-
spond to the meaning of the respective slang. Every slang is considered in
lowercase and uppercase letters as well, e.g. lol: laugh out loud, LOL: laugh
out loud. The existing Vader [6] dictionary of common lexicons created by
using well-established word-banks, LIWC [13], ANEW [18] and GI [12] is
also initialized.
2. Pre-processing of Dataset: The links of various images and emojis are removed
from the dataset. The tweets with only image links were replaced by a numeric
data type NaN (Not a Number).
3. Slang Replacement: The above pre-processed dataset as any slang is determined
it is replaced with its meaning. E.g.: The sentence “The joke you cracked was
funny!!! hahaha lol.” will become “The joke you cracked was funny!!! hahaha
laugh out loud.”
4. Score Calculation (Sentence Level): The score of each sentence is calculated
on the basis of the lexicons [6]. This provided a polarity to each sentence as
positive, negative, or neutral considering the following five cases of punctuation,
capitalization, degree modifiers, contrastive conjunction, and negation [6].
5. Final Score Calculation Normalisation: The sentiment score of each sentence
lies between −4 and +4. Then a Normalization function is applied to map each
score to a value between −1 and +1. The normalization function used is Hutto
Normalization function [6].
6. Score Calculation (Document Level): Suppose an online document consists of
‘n’ number of tweets. The score and polarity of the document are determined
on the basis of the average score of positive sentences or negative sentences.
If the average score of positive sentences is greater than the average score of
negative sentences, the polarity of the document is positive or vice versa.
98 S. Gupta et al.

Fig. 1 Proposed approach

4 Demonstration for Score Calculation

Table 1 contains online sentiments covering all five cases of quantitative analysis, i.e.,
punctuation, negation, capitalization, intensifiers, and contrastive conjunction. The
polarity score of these comments are the same for our proposed approach and Vader
Sentiment Analysis of an Online Sentiment with Text and Slang … 99

Fig. 2 Symbols used in algorithm

[6] for online sentiments ‘without slang’. Although, the polarity score of proposed
approach and Vader [6] is considerably different for online sentiments ‘with slangs’.
100 S. Gupta et al.

# Step 1: Initialization
dict := {slang0: meaning0, slang1:meaning1, slang2:meaning2, …………… slangl: meaningl};
df := sheet[“Tweets”];
# Step 2: Eliminating images and emojis
for i: = 0 to n-1 do:
text:= df. iloc[i]; text1 :=strip_image(text);
df [“Tweets without image”] := text1;
end
for i =0 to n-1 do:
text2 = df. iloc [i, “Tweets without image”]; text3 = strip_emoji(text2);
df [“Tweets without emoji”]: = text3;
end
# Step 3: Replacing Slangs
for i =0 to n-1 do:
sentence = df. iloc [i,” Tweets without emoji”]
if word in sentence == dict[slang] then:
sentence =replace_slangs ();
else do:
sentence =sentence;
end
df[“Tweets with replaced slangs”] =sentence;
end
# Step 4: Score Calculation (Sentence Level)
for i =0 to n-1 do:
text4 = df. iloc [i,” Tweets with replaced slangs”];
score=average_score(text4);
end
# Step 5: Applying Normalization Function
for i =0 to n-1 do:
norm_score=score / math. sqrt ((score*score) +alpha);
end
# Step 6: Score calculation (Document Level)
asns=avg_neg (); asps=avg_pos () ;
if (ASPS > ASNS): document_score = ASPS;
else document_score = ASNS;
end

Fig. 3 Proposed algorithm

Sentiment Analysis of an Online Sentiment with Text and Slang … 101

Table 1 Demonstration of score calculation

Without slangs Proposed/Vader With slangs Proposed Vader approach
approach scores approach score score
LOVED your 0.6841 TBH LOVED 0.8964 0.6841
work your work gw
The joke you 0.7163 The joke you 0.9077 0.8617
cracked was cracked was
funny!!! funny!!! hahaha
lol
I love you!! 0.6988 I Love you!! 0.9059 0.8684
xoxo
You were my −0.1444 You were my −0.8387 0.6542
friend but after friend but after
the stunt you the stunt you
pulled i do not pulled I do not
like you like you gth
I hope both of 0.7964 I hope both of 0.9062 0.7964
you are having you are having
best time best time ILYF
The service here 0.3832 SRSLY The 0.2280 0.3832
is marginally service here is
good marginally good
I do not like your −0.2755 I do not like your −0.7098 0.2755
tone tone, it is vb
Kobe Bryant was 0.729 Kobe Bryant was 0.8730 0.7034
a GREAT a GREAT and
basketball player awsm basketball
player
Food was −0.6571 The food was −0.6877 −0.6571
horrible!!! ABS horrible !!!
Your outfit was 0.4926 Your outfit was 0.6239 0.4926
good! good amz !

4.1 Experimental Setup

To implement the proposed approach, we use Python and Vader a lexicon and rule-
based sentiment analysis tool [6, 15]. (i) Tweets of various most followed personali-
ties of Twitter are downloaded to conduct various experiments [14, 16]. The dataset
consists of 16,000 tweets of 80 personalities across the world, 40 males and 40
females. The tweets of different personalities are stored as a document. (2) Comments
on Facebook consists of 1452 online sentiments downloaded using facepager [7].
102 S. Gupta et al.

4.2 Results Evaluation

The tweets of each personality are stored as a document in a single excel. The scores
of each tweet are evaluated at sentence level. At document level, the average score of
all the tweets of personalities is taken to determine polarity and sentiment score of the
tweets of various personalities. The experiment results at sentence level are illustrated
in Tables 2 and 3. Table 2 clearly shows our proposed approach outperforms existing
[6] for Twitter dataset, for SVM, Random Forest, and Linear Regression. Table
3 demonstrates that our approach shows accuracy of 97% for SVM and 98% for
Random Forest for Facebook dataset. Thus, it is observed that our proposed approach
gives better accuracy when text dataset with different slangs are evaluated.

5 Conclusion

The proposed approach attempts to calculate the polarity of online sentiments by

considering the slangs used by the users in their online sentiments. It evaluates the
sentence as negative, positive, or neutral by successfully replacing the slangs with
their meanings to produce the correct score. The proposed approach is a further
extension of the existing approach Vader [6] by overcoming its limitation i.e. usage
of only common slangs.

6 Limitations and Future Work

Limitations of this approach are: (1) Exaggerated slangs such as “looooool”, “rofl!!!!”
etc. are not captured. (2) Use of emoticons such as , etc. in sentiments are
very common but our approach doesn’t consider them. (3) We have considered only
300 slangs however more could be considered. This paves the way for future work
of proposing a framework for sentiment analysis that considers exaggerated slangs,
emojis, much more slangs than 300 for better and accurate results.
Table 2 Comparative results on Twitter dataset (Sentence level)
ML techniques SVM Random forest Linear regression
Evaluation metrics P R F A P R F A P R F A
Without Slang (Vader) 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.99 0.93 0.923 0.93 0.93
With Slang (Proposed Approach) 0.95 0.965 0.96 0.95 0.94 0.95 0.96 0.95 0.95 0.95 0.95 0.95
Sentiment Analysis of an Online Sentiment with Text and Slang …
103
104

Table 3 Comparative results on Facebook dataset (Sentence level)

ML techniques SVM Random forest Linear regression
Evaluation metrics P R F A P R F A P R F A
Without Slang (Vader) 0.96 0.96 0.969 0.96 0.95 0.95 0.95 0.97 0.96 0.96 0.96 0.95
With Slang (Proposed Approach) 0.98 0.98 0.979 0.975 0.96 0.96 0.96 0.98 0.98 0.97 0.97 0.97
S. Gupta et al.
Sentiment Analysis of an Online Sentiment with Text and Slang … 105

References

1. Liu, B., Zhang, L.: A survey of opinion mining and sentiment analysis. In: Mining Text Data,
pp. 415–463. Springer, Boston, MA (2012)
2. Nagamanjula, R., Pethalakshmi, A.: A novel framework based on bi-objective optimization
and LAN2FIS for Twitter sentiment analysis. Soc. Netw. Anal. Min. 10(34), 34 (2020)
3. Taboada, M., Brooke, J., Tofiloski, M., Voll, K., Stede, M.: Lexicon-based methods for
sentiment analysis. Comput. Linguist. 37, 267–307 (2011). https://doi.org/10.1162/COLI_a_
00049
4. Anjaria, M., Guddeti, R.M.R.: A novel sentiment analysis of social networks using supervised
learning. Soc. Netw. Anal. Min. 4(1), 181 (2014)
5. Baccianella, S., Esuli, A., Sebastiani, F.: SentiWordNet 3.0: an enhanced lexical resource for
sentiment analysis and opinion mining. In: Proceedings of LREC, 10 (2010)
6. Gilbert, C.H.E., Hutto, E.: Vader: a parsimonious rule-based model for sentiment analysis of
social media text. In: Eighth International Conference on Weblogs and Social Media (ICWSM-
14), vol. 81, p. 82. (2014). Available at (20/04/16). https://comp.social.gatech.edu/papers/icw
sm14.vader.hutto.pdf
7. Facepager. Get the software safe and easy. (2020). Retrieved 8 June 2020, from https://facepa
ger.software.informer.com/3.6/
8. Esuli, A., Sebastiani, F.: Sentiwordnet: a publicly available lexical resource for opinion mining.
In: LREC, vol. 6, pp. 417–422 (2006)
9. Korayem, M., Aljadda, K., Crandall, D.: Sentiment/subjectivity analysis survey for languages
other than English. Soc. Netw. Anal. Min. 6(1), 75 (2016)
10. Wang, Z., Lin, Z.: Optimal feature selection for learning-based algorithms for sentiment
classification. Cogn. Comput. 12(1), 238–248 (2020)
11. Dey, A., Jenamani, M., Thakkar, J.J.: Senti-N-Gram: an n-gram lexicon for sentiment analysis.
Expert Syst. Appl. 103, 92–105 (2018)
12. Stone, P.J., Bales, R.F., Namenwirth, J.Z., Ogilvie, D.M.: The general inquirer: a computer
system for content analysis and retrieval based on the sentence as a unit of information. Behav.
Sci. 7(4), 484–498 (1962)
13. Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic inquiry and word count: LIWC 2001,
71(2001) (2001). Lawrence Erlbaum Associates, Mahway
14. Gupta, S., Singh, A., Ranjan, J.: Sentiment analysis: usage of text and emoji for expressing
sentiments. In: Advances in Data and Information Sciences (2020)
15. VaderSentiment (2020). Retrieved 7 June 2020, from https://pypi.org/project/vaderSentiment/
16. Find out who’s not following you back on Twitter, Tumblr, & Pinterest (2020). Retrieved 7
June 2020, from https://friendorfollow.com/twitter/most-followers/
17. Han, H., Zhang, J., Yang, J., Shen, Y., Zhang, Y.: Generate domain-specific sentiment lexicon
for review sentiment analysis. Multimedia Tools Appl. 77(16), 21265–21280 (2018)
18. Nielsen, F.Å.: A new ANEW: evaluation of a word list for sentiment analysis in
microblogs. arXiv preprint arXiv:1103.2903 (2011)
Fuzzy Logic Technique for Evaluation
of Performance of Load Balancing
Algorithms in MCC

Divya, Harish Mittal, Niyati Jain, Bijender Bansal, and Deepak Kr. Goyal

Abstract Mobile Cloud Computing (MCC) is considered next-generation applica-

tions. One most important issue in MCC is Load Balancing. Uniform distribution of
load among all the virtual servers can provide a better response time. Users demand
more services with better results. Many algorithms are proposed to address this
issue. Analysis of prevalent load balancing algorithms in cloud computing is done
using various parameters like throughput, resource utilization, response time, etc.,
To compare the existing algorithms in a much better way and to compare their perfor-
mance, a fuzzy logic technique is evolved in this paper. The technique is illustrated
using a suitable numerical example. Evaluation of performance of 8 Algorithms is
carried out on the basis of four metrics throughput, response time, migration time,
and overhead. Using this technique Gradation of prevalent algorithms may be done
easily in terms of their performance on the basis of desired metrics.

1 Introduction

Cloud computing is pay-per-use computing model. Mobile Cloud Computing is

considered next-generation applications. One most important issue in Mobile Cloud
Computing is Load Balancing. Uniform distribution of load among all the virtual
servers can provide a better response time. It has gained much attention these days.
Users demand more services with better results. Many algorithms are proposed
to address this issue. Analysis of prevalent load balancing algorithms in cloud
computing is done using various parameters like throughput, resource utilization,
response time, etc., To compare the existing algorithms in a much better way and to
compare their efficiency, a fuzzy logic technique is evolved in this paper. In future
experimental study will be done so that gradation of prevalent algorithms may be

Divya · N. Jain · B. Bansal · D. Kr. Goyal

Department of CSE, Vaish College of Engineering, Rohtak, India
H. Mittal (B)
BM Institute of Engineering and Technology, Sonepat, India

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 107
S. C. Satapathy et al. (eds.), Smart Computing Techniques and Applications,
Smart Innovation, Systems and Technologies 224,
https://doi.org/10.1007/978-981-16-1502-3_12
108 Divya et al.

done in terms of their efficiency. Literature Review and Analysis of some prevalent
algorithms of Load Balancing Algorithms are described in Sect. 2. The proposed
model is described in Sect. 3 followed by concluding remarks and future work in
Sect. 4.

2 Literature Review

2.1 Software Quality Assessment Based on Fuzzy Logic

Mittal Harish et al. “Software Quality Assessment Based on Fuzzy Logic Technique”,
[5]. Provides fuzzy logic based precise approach to quantify quality of software
modules on the basis of inspection rate and error density. TFNs were used to represent
inspection rate and error density of the software. Software modules are given quality
grades using fuzzy logic.

2.2 Deriving Quality Metris for Cloud Computing Systems

Matthias Beker, Sebastian Lehrig, Steffen Becker, proposed “Systematically

Deriving Quality Metrics for Cloud Computing Systems” [1], 2015. They derived
and classified metrics according the goal question in a top-down fashion by defining
the goal to analyze cloud systems and questions that help achieving goals.

2.3 Dynamic Round Robin (DRR)

Lin et al. [3] proposed for energy-aware virtual machine scheduling and consolida-
tion. They analyzed the problem of power consumption problems in data centers.
DRR reduce a significant amount of power consumption. Performance Parameters
considered Throughput, Overhead, Fault tolerance, Migration time, and Resource
Utilization.

2.4 Load Balancing in Cloud Computing

Power-Based Load Balancing for Cloud Computing PALB. Galloway et al. [2]
presented a load balancing approach to laaS cloud Architecture. The approach main-
tains the state and decides the number of compute nodes that should be operating.
Fuzzy Logic Technique for Evaluation of Performance … 109

They considered Throughput, Overhead, Fault tolerance, Migration time, Response

Time, and Resource utilization.
Max–Min Task Scheduling Algorithm for Load Balance [4], Mao et al. (2014)
in Cloud Computing is required to distribute the dynamic local workload evenly
across all the nodes to achieve high user satisfaction and resource utilization. Perfor-
mance Parameters Throughput, Overhead, Response Time, and Resource utilization
were studied.
Mishra et al. [19] described various load balancing techniques in homoge-
neous and heterogeneous cloud computing environments. A system architecture,
with distinct models for the host, VM is described. They proposed taxonomy for
the load balancing algorithm in the cloud environment. To analyze the performance
of heuristic-based algorithms, simulation is carried out in CloudSim simulator.
Afzal and Kavitha [20] presents a comparative study on load balancing
approaches. They framed a set of problem-related questions and discussed them
in the work. The data collected for this study had been gathered from five potential
databases. A multilevel taxonomy-based classification was proposed by considering
five criteria. The study revealed that task scheduling is important in both proactive
and reactive approaches.

2.5 Quality Metrics

Table 1 illustrates Scalability Metrics, Elasticity Metrics, and Efficiency Metrics.

Table 1 Quality metrics

Metric Unit Example
Scalability metrics Scalability Range (ScR) Max The system scales upto 100
req./min
Scalability Speed (ScS) Max, rate The system scales upto 100
req./min, with linear
increase rate 1 req/month
Elasticity metrics Number of Service Level 1/time unit 40 SLO violations/hour
Objectives (SLO) violations
(NSLOV)
Mean time to Quality repair Time unit 30 s for an additional 10
(MTTQR) requests/hour
Efficiency metrics Resource Provisioning [0,∞] 5 more resources than
Efficiency (RPE) actual resource demand
Marginal Cost (MC) Monetary unit $2 for an additional 100
requests/hour
110 Divya et al.

3 Proposed Model for the Evaluation and Comparison

of Load Balancing Algorithms

Load Balancing is reassigning the total load to the individual lodes of the collec-
tive system that facilitates networks to provide maximum throughput with minimum
response time. Load Balancing Algorithms are of two types static and dynamic.
Static algorithm divides the load equally on servers and is called round-robin algo-
rithm, while dynamic algorithm uses weights on servers. Static algorithm creates
imbalanced traffic while dynamic algorithm tries to balance the traffic.
In 2019 Afzal and Kavihta, Journal of Cloud Computing Advances Systems and
Applications, investigated that in existing literature there are 16 major Metric in Load
Balancing, which are given in Table 2.
In order to evaluate performance of load balancing algorithms, one must first
identify performance parameters/Metrics that strongly influence the performance.
Such metrics are
• Throughput-High throughput is necessary for overall system performance.
Generally, it is measured in MB/s.
• Overhead—It should be low.
• Fault Tolerance—It should be high.
• Response Time—is the time interval between sending a request and receiving its
response. It should be low. It is measured in ms.

Table 2 Metrics in load

S. no. Metric Contribution%
balancing
1 Throughput 7.87
2 Overhead 7.09
3 Migration time 3.93
4 Response time 13.39
5 Execution times 11.81
6 Resource utilization 11.02
7 Scalability 9.45
8 Waiting time 2.36
9 Execution cost 8.66
10 Makespan 9.45
11 Degree of balance 4.72
12 Power consumption 3.14
13 Service level violation 0.78
14 Task rejection ratio 1.50
15 Fault tolerance 4.72
16 Migration cost 0.11
Fuzzy Logic Technique for Evaluation of Performance … 111

others Respons Time

Migration
Time
Fault Execution
Tolerance Time

Overhead

resource
utilization
Throughput

Execution Cost Scalability

Makespan

Fig. 1 Metrics that strongly influence the performance of Load Balancing Algorithms

• Resource Utilization—is the proper utilization of the resources. It should be

high. It is usually measured in milliseconds (ms).
• Scalability—is the ability to perform uniform load balancing. It should be high.
• Migration Time—is the time required in migrating the jobs or resources from
one node to another. It should be low (Fig. 1).

Assumption: Number of concurrent users should be the same for each Algorithm
The metrics are of two types—Metric for which performance increases with increase
in the value of metric e.g., throughput, fault tolerance, resource utilization, scalability,
and metric for which performance decreases with decrease in the value of the metric
e.g., response time, migration time, and overhead. We take the metrics as Triangular
Fuzzy Number (a, m, b) and suppose that m divides the line joining the points a and
b in the ratio 1:1, so that

a+b b−a b−a

m= F= =
2 2m b+a

For b > a

m−x b−x
µ(x) = =0, x ≤ a = ,a ≤ x ≤ m = , m ≤ x ≤ b = 0, x ≥ b
m−a b−m
(1)

For b < a
112 Divya et al.

x −m x −b
µ(x) = =0, x ≥ a = ,a ≤ x ≤ m = , m ≤ x ≤ b = 0, x ≤ b
a−m m−b
(2)

Fuzzification
(a) For metric whose performance is directly proportional to performance (Table
3)

where, a, b, c, d, w1 , w2 , w3 , and w4 are real constants, w1 < w2 < w3 < w4 .

A value of x may have two membership functions. For a+b 2
≤ x ≤ b, it has
membership function µ as low and 1 − µ for medium. For b ≤ x ≤ b+c 2
, it has
membership function µ as medium and 1 − µ for low (Fig. 2).

(b) For metric whose performance is indirectly proportional to performance (Table

4).

where a, b, c, d, w1 , w2 , w3 , and w4 are real constants and w1 < w2 < w3 < w4

(Fig. 3).

Table 3 Linguistic Variables

Information interval Linguistic variable Weights
for Metric for which
performance is directly [a, b] Low w1
proportional to the value of [b, c] Medium w2
metric
[c, d] High w3
[d,∞] Very high w4

μ(x)
Low Medium High
1

w1 w2 w3 w4

0 a a+b/2 b b+c/2 c c+d/2 d x

Fig. 2 TFNs when performance is directly proportional to the value of the metric

Table 4 Linguistic Variables

Information interval Linguistic variable Weights
for Metric for which
performance is indirectly [0, d] Very high w4
proportional to the value of [d, c] High w3
metric
[c, b] Medium w2
[b, a] Low w1
Fuzzy Logic Technique for Evaluation of Performance … 113

μ(x) V.High High Medium Low

1
w4 w3 w2 w1

0 d c+d/2 x c b+c/2 b a+b/2 a x

Fig. 3 TFNs when performance is indirectly proportional to the value of the metric

A value of x may have two membership functions. For a+b

2
≥ x ≥ b, it has
membership function µ as high and 1 − µ for medium. For b ≥ x ≥ b+c
2
, it has
membership function µ as medium and 1 − µ for high.
Defuzzification
After evaluating membership for the value of metric, find its contribution in the
performance as marks for this metric, using the following rules.
Equation 3 for Metric for which performance increases with increase in the value
⎧
⎪ µi
⎪
⎪
∗ w1 , a ≥ x ≥ a+b
2
⎪
⎪ µi ∗ w1 + (1 − µi ) ∗ w2 , a+b ≥x ≥b
⎪
⎪
⎪
⎪
2
⎨ µi ∗ w2 + (1 − µi ) ∗ w1 , b ≥ x ≥ b+c
2
m = µi ∗ w2 + (1 − µi ) ∗ w3 , b+c ≥x ≥c (3)
⎪
⎪
2
⎪ µi
⎪ ∗ w3 + (1 − µi ) ∗ w2 , c ≥ x ≥ c+d
⎪
⎪ 2
⎪
⎪ µi ∗ w3 + (1 − µi ) ∗ w4 , c+d ≥x ≥d
⎪
⎩ 2
µi ∗ w4 , x <d

Equation 4 for Metric for which performance decreases with increase in the value.
⎧
⎪
⎪ µi ∗ w1 , a ≥ x ≥ a+b
⎪
⎪
2
⎪
⎪ µ ∗ w1 + (1 − µi ) ∗ w2 , a+b ≥x ≥b
⎪
⎪µ
i 2
⎪
⎨ i ∗ w2 + (1 − µi ) ∗ w1 , b ≥ x ≥ b+c
2
m i = µi ∗ w2 + (1 − µi ) ∗ w3 , b+c ≥x ≥c (4)
⎪
⎪
2
⎪
⎪ µi ∗ w3 + (1 − µi ) ∗ w2 , c ≥ x ≥ c+d
⎪
⎪ 2
⎪
⎪ µi ∗ w3 + (1 − µi ) ∗ w4 , c+d ≥x ≥d
⎪
⎩ 2
µi ∗ w4 , x <d

If we take w1 = 10, w2 = 20, w3 = 40, w4 = 50, Mi are marks out of 50.

Let Total Marks is Sum of marks of all the metrics (taking MAX value as 100).
Then Grade of Algorithm is calculated as (Table 5).
114 Divya et al.

Table 5 Classification of
Total marks (100) Grade
grades
0 ≤ M total < 5 1
5 ≤ M total < 10 2
10 ≤ M total < 20 3
20 ≤ M total < 30 4
30 ≤ M total < 40 5
40 ≤ M total < 50 6
50 ≤ M total < 60 7
60 ≤ M total < 70 8
70 ≤ M total < 80 9
M total ≥ 80 10

Table 6 Values of
Algorithm Response time in ms Throughput in MB/s
throughput and response time
of various algorithms A 2600 17
B 1700 26
C 800 37
D 2700 38
E 2650 27
F 1800 44
G 850 18
H 3400 17

Table 7 Weights for

Throughput in MB/s Complexity Weights
throughput
[15,25] Low 10
[25,35] Medium 20
[35,45] High 40
[45,∞] Very high 50

Illustrative Example: We take a simple case by taking into consideration only

four metrics.
Inputs: Throughput, Response time, Migration Time and Overhead. But the result
can be extended to any desired number of Metrics.
Output: Grade of Algorithm.
The values of throughput and response time of various algorithms are given in
Table 6.
Fuzzy Logic Technique for Evaluation of Performance … 115

Fuzzification
Taking, n = no. of concurrent users, Complex Matrix for Throughput (Table 7 and
Fig. 4).
⎧
⎪
⎪ µthr ∗ 10, 15 ≤ x ≤ 20
⎪
⎪
⎪
⎪ µthr ∗ 10 + (1 − µthr ) ∗ 20, 20 ≤ x ≤ 25
⎪
⎪
⎪
⎨ µthr ∗ 20 + (1 − µthr ) ∗ 10, 25 ≤ x ≤ 30
Mthr = µthr ∗ 20 + (1 − µthr ) ∗ 40, 30 ≤ x ≤ 35 (5)
⎪
⎪
⎪ µthr ∗ 40 + (1 − µthr ) ∗ 20,
⎪ 35 ≤ x ≤ 40
⎪
⎪
⎪
⎪ µthr ∗ 40 + (1 − µthr ) ∗ 50, 40 ≤ x ≤ 45
⎪
⎩
µthr ∗ 50, x > 45

Value of weights for low, medium and high, and very high are 10, 20, 40, and 50,
respectively. The calculated values of µthr are given in Table 8

μ(x)

Low 10 Medium 20 High 40 V.High 50

0 15 25 35 45 x

Fig. 4 TFNs for throughput

Table 8 Calculated values of marks for throughput

Algorithm Throughput in MB/s µthr 1 − µthr Marks out of 50
A 17 0.4 0.6 4
B 26 0.2 0.8 12
C 37 0.4 0.6 28
D 38 0.6 0.4 32
E 27 0.4 0.6 14
F 44 0.2 0.8 48
G 18 0.6 0.4 6
H 17 0.4 0.6 4
116 Divya et al.

Response Time
⎧
⎪
⎪ µres ∗ 10, 3500 ≥ x ≥ 3000
⎪
⎪
⎪
⎪ µres ∗ 10 + (1 − µres ) ∗ 20, 3000 ≥ x ≥ 2500
⎪
⎪
⎪
⎨ µres ∗ 20 + (1 − µres ) ∗ 10, 2500 ≥ x ≥ 2000
Mres = µres ∗ 20 + (1 − µres ) ∗ 40, 2000 ≥ x ≥ 1500 (6)
⎪
⎪
⎪
⎪ µres ∗ 40 + (1 − µres ) ∗ 20, 1500 ≥ x ≥ 1000
⎪
⎪
⎪
⎪ µres ∗ 40 + (1 − µres ) ∗ 50, 1000 ≥ x ≥ 500
⎪
⎩
µres ∗ 50, x < 500

Value of weights for high, medium and low and very low are 10, 20, 40, and 50,
respectively (Table 9). The calculated values of µres are given in Table 10 (Fig. 5).

Table 9 Calculated values of marks of response time

Algorithm Response time (ms) µres 1 − µres M res (Out of 50)
A 2600 0.2 0.8 18
B 1700 0.4 0.6 32
C 800 0.6 0.4 44
D 2700 0.4 0.6 16
E 2650 0.3 0.7 17
F 1800 0.6 0.4 28
G 850 0.7 0.3 43
H 3400 0.2 0.8 2

Table 10 Calculated values of marks of migration time

Algorithm Migration time in ms µmgr 1 − µmr M mgr (Marks are out of 50)
A 2400 0.2 0.8 12
B 1600 0.2 0.8 36
C 900 0.2 0.8 48
D 2900 0.2 0.8 18
E 2850 0.7 0.3 13
F 3400 0.2 0.8 2
G 550 0.1 0.9 49
H 3300 0.4 0.6 4
Fuzzy Logic Technique for Evaluation of Performance … 117

μ(x) Very Low Low Medium High

50 40 20 10

0 500 1500 2500 3500 x

Fig. 5 TFNs representing response time using Fig. 3

⎧
⎪
⎪ µmig ∗ 10, 3500 ≥ x ≥ 3000
⎪
⎪
⎪
⎪ µmig ∗ 10 + 1 − µmig ∗ 20, 3000 ≥ x ≥ 2500
⎪
⎪
⎪
⎨ µmig ∗ 20 + 1 − µmig ∗ 10, 2500 ≥ x ≥ 2000
Mmgr = µmig ∗ 20 + 1 − µmig ∗ 40, 2000 ≥ x ≥ 1500 (7)
⎪
⎪
⎪
⎪ µmig ∗ 40 + 1 − µmig ∗ 20, 1500 ≥ x ≥ 1000
⎪
⎪
⎪
⎪ µmig ∗ 40 + 1 − µmig ∗ 50, 1000 ≥ x ≥ 500
⎪
⎩
µmig ∗ 50, x < 500

⎧
⎪ µovr ∗ 10,
⎪
⎪
10 ≥ x ≥ 9
⎪
⎪ µovr ∗ 10 + (1 − µovr ) ∗ 20, 9≥x ≥8
⎪
⎪
⎪
⎪
⎨ µovr ∗ 20 + (1 − µovr ) ∗ 10, 8≥x ≥7
Movr = µovr ∗ 20 + (1 − µovr ) ∗ 40, 7≥x ≥6 (8)
⎪
⎪
⎪ µovr ∗ 40 + (1 − µovr ) ∗ 20,
⎪ 6≥x ≥5
⎪
⎪
⎪
⎪ µovr ∗ 40 + (1 − µovr ) ∗ 50, 5≥x ≥4
⎪
⎩
µovr ∗ 50, x <4

Value of weights for high, medium and low, and very low are taken 10, 20, 40,
and 50, respectively. The calculated values of µovr are given in Table 11
Grades of Algorithms are calculated in Table 12 using Tables 8, 9, 10 and 11
(Fig. 6).

4 Conclusion and Future Work

Analysis of prevalent load balancing algorithms in cloud computing is done using

various parameters like throughput, resource utilization, response time etc., To
compare the existing algorithms in a much better way and to compare their perfor-
mance, a Fuzzy Logic Technique based Model is proposed. The Model is explained
by a suitable Numerical by taking 4 metrics Throughput, Response Time, Migration
118 Divya et al.

Table 11 Calculations for overhead

Algorithm Overhead in million Rs µovr 1 − µovr M ovr (Marks are out of 50)
A 7.5 0.5 0.5 15
B 6.5 0.5 0.5 30
C 4.5 0.5 0.5 45
D 8.5 0.5 0.5 15
E 8.50 0.5 0.5 15
F 9.5 0.5 0.5 5
G 4.5 0.5 0.5 45
H 9.5 0.5 0.5 5

Table 12 Grades of algorithms

Algorithm M res M thr M mgr M ovr Total marks (100) Grade
Out of 50
A 18 4 12 15 24.5 4
B 32 12 36 30 55 7
C 44 28 48 45 82.5 9
D 16 32 18 15 41 5
E 17 14 13 15 29.5 4
F 28 48 2 5 41.5 6
G 43 6 49 45 71.5 9
H 2 4 4 5 7.5 2

Fig. 6 Comparison of 10
performance of various
algorithms 8

0
Fuzzy Logic Technique for Evaluation of Performance … 119

Time and Overhead. In future experimental study of the existing realistic algorithms
can be done using this model so that gradation of prevalent algorithms may be done
in terms of their performance on the basis of desired metrics. Present MCC Architec-
tures are not up to the mark for the present requirements. There is an immense need
to tackle the issues of the MCC environment. There is vast scope for this study in
the fast-emerging field of Healthcare Monitoring. Fuzzy logic has vast capabilities
to address the challenges; hence effort is to evolve fuzzy-based Models.

References

1. Becker, M., Lehrig, S., Becker, S.: Systematically deriving quality metrics for cloud computing
systems. In: Proceedings of the 6th ACM/SPEC international conference on performance
engineering, pp. 169–174 (2015)
2. Galloway, J.M., Smith, K.L., Vrbsky, S.S.: Power aware load balancing for cloud computing.
In: Proceedings of the World Congress on Engineering and Computer Science, vol. 1, pp. 19–21
(2011)
3. Lin, C.C., Liu, P., Wu, J.J.: Energy aware virtual machine dynamic provision and scheduling for
cloud computing. In: 2011 IEEE International Conference on Cloud Computing, pp. 736–737.
IEEE (2011)
4. Mao, Y., Chen, X., Li, X.: Max–Min task scheduling algorithm for load balance in cloud
computing, pp. 457–465
5. Harish, M., Pardeep, B., Puneet, G.: Software quality assessment based on fuzzy logic
technique. Int. J. Soft Comput. Appl. (3), 105-112 (2008). ISSN: 1453-2277
6. Fernando, N., Loke, S.W., Rahayu, W.: Mobile cloud computing: a survey. Future Generat.
Comput. Syst. 29(1), 84–106 (2013)
7. Jia, W., Zhu, H., Cao, Z., Wei, L., Lin, X.: SDSM: a secure data service mechanism in
mobile cloud computing. In: Proceedings of IEEE Conference on Computer Communications
Workshops (INFOCOM), pp. 1060–1065 (2011)
8. Kosta, S., Aucinas, A., Hui, P., Mortier, R., Zhang, X.: Unleashing the power of mobile cloud
computing using Thinkair, pp. 1–17. CoRR abs/1105.3232 (2011)
9. Liang, H., Huang, D., Cai, L.X., Shen, X., Peng, D.: Resource allocation for security services in
mobile cloud computing. In: Proceedings of IEEE Conference on Computer Communications
Workshops (INFOCOM), pp. 191–195 (2011)
10. Naz, S.N., Abbas, S., et al.: Efficient load balancing in cloud computing using multi-layered
Mamdani fuzzy inference expert system. (IJACSA) Int. J. Adv. Comput. Sci. Appl. 10(3) (2019)
11. Ragmani, A, et al.: An improved Hybrid Fuzzy-Ant Colony Algorithm applied to load balancing
in cloud computing environment. In: The 10th International Conference on Ambient Systems,
Networks and Technologies (ANT) April 29–May 2, LEUVEN, Belgium (2019)
12. Shiraz, M., Gani, A., Khokhar, R., Buyya, R.: A review on distributed application processing
frameworks in smart mobile devices for mobile cloud computing. IEEE Commun. Surv.
Tutorials 15(3), 1294–1313 (2013)
13. Sanaei, Z., Abolfazli, S., Gani, A., Buyya, R.: Heterogeneity in mobile cloud computing:
taxonomy and open challenges. IEEE Commun. Surv. Tutorials 16(1), 369–392 (2014)
14. Sethi, S, et al.: Efficient load Balancing in Cloud Computing using Fuzzy Logic. IOSR J. Eng.
(IOSRJEN) 2(7), pp. 65–71 (2012). ISSN: 2250-3021
15. Zadeh, L.A.: From computing with numbers to computing with words-from manipulation of
measurements to manipulation of perceptions. Int. J. Appl. Math. Comput. Sci. 12(3), 307–324
(2002)
16. Zadeh, L.A.: Fuzzy sets. Inf. Control 8, 338–353 (1965)
120 Divya et al.

17. Zadeh., L.A.: The concept of a linguistic variable and its applications to approximate
reasoning—part I. Inf. Sci. 8, 199–249 (1975)
18. Zhou, B., Dastjerdi, A.V., Calheiros, R.N., Srirama, S.N., Buyya, R.: A context sensitive
offloading scheme for mobile cloud computing service. In: Proceedings of the IEEE 8th
International Conference on Cloud Computing (CLOUD), pp. 869–876 (2015)
19. Mishra, S.K., Sahoo, B., Parida, P.P.: Load balancing in cloud computing: a big picture. J. King
Saud Univ. Comput. Inf. Sci. 32.2, 149–158 (2020)
20. Afzal, S., Kavitha, G.: Load balancing in cloud computing—a hierarchical taxonomical
classification. J. Cloud Comput. 8.1, 22 (2019)
Impact of Bio-inspired Algorithms
to Predict Heart Diseases

N. Sree Sandhya and G. N. Beena Bethel

Abstract Optimization techniques are employed to deal with dynamic, difficult,

and robust problems. Most of the Machine learning algorithms are implemented to
predict heart diseases. Classification techniques are one of the methods that is highly
used in machine learning for prediction. Some classification methods predict accu-
racy with acceptable range, but others may not. In this paper, we streamline two
different bio inspired algorithms, Ant and Bat are used for heart disease prediction.
Here, we extracting the key features from heart disease attributes using these two bio-
inspired algorithms. Then these extracted features are implemented to the different
classifiers. In this research, we examine the bio inspired algorithms optimized with
Random Forest and SVM classifiers and compared the results. Ant colony opti-
mization and Bat colony optimization give better results with SVM classifier than
Random Forest classifier. When comparing the results in this research, Bat algorithm
is better-optimized algorithm than ant algorithm.

1 Introduction

Most of the people in the world are suffering from heart diseases, which leads to death
frequently. According to mortality statistics, it is proven that the main origin of death
in the world is because of heart diseases. As per WHO (World Health Organization)
four out of 5 cardiovascular diseases (CVD) deaths are due to heart strokes and
heart attacks [1]. Heart plays a major role in functioning of our body and it is a
very important organ. It pumps blood throughout the body. Whenever heart disease
occurs, all the functionalities of heart is not properly worked. So, it is important to
predict heart diseases with an intelligent approach for better results. In this research,
we implemented smart optimization algorithms like bio inspired algorithms for heart
disease prediction.
Bio inspired algorithms are nature-inspired algorithms developed to resolve
complex complications. With huge data, it became more challenging to provide

N. Sree Sandhya (B) · G. N. Beena Bethel

CSE Department, GRIET, Hyderabad, Telangana, India

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 121
S. C. Satapathy et al. (eds.), Smart Computing Techniques and Applications,
Smart Innovation, Systems and Technologies 224,
https://doi.org/10.1007/978-981-16-1502-3_13
122 N. Sree Sandhya and G. N. Beena Bethel

optimum solution [2]. At these times, bio inspired algorithms are recognized to
solve complex problems with the novel approaches. The objective of this paper is to
enhance the attributes of dataset for better prediction. By extracting the features of bio
inspired algorithms, the attribute class labels are analyzed for further process. These
class label values are generated by considering the remaining attribute values with
the extracted bio inspired algorithm features. Then respective classifier is applied on
this dataset for predicting the accuracy.

2 Related Work

In healthcare, the patient’s data is increasing day-by-day with extra medical informa-
tion. To predict any disease, it is important to collect related information of various
scenarios of patient’s data. Different techniques and tools are depending on this
data for prediction of a disease. The key aim of this paper is to analyze the patient’s
records and extract the main features and implementing the bio inspired algorithms to
predict the heart disease. Kora and Ramakrishna [3] presented a method for predicting
myocardial infarction based on the changes in ECG signals. Myocardial infarction
is predicted by using the proposed method called improved bat algorithm. But for
implementation of improved bat algorithm they just taken 13 patient records. Out
of 13, 7 are myocardial infarction patients records and remaining 6 are normal indi-
vidual records. Four methods like Support Vector Machine, K-Nearest Neighbors,
Levenberg–Marquardt Neural Network, and Conjugate Gradient Neural Network
are implemented in both normal bat and improved bat algorithms. They concluded
that improved bat algorithm gives better results when compared with normal bat
algorithm.
Dubey et al. [4] diagnosed heart diseases in the early stage by combining the Ant
optimization algorithm with data mining techniques (DMACO). They considered the
pheromone value of ant and recognized the risk level. Whenever pheromone value
increases the risk also increases. Then Ant algorithm with data mining techniques
are applied to generate the detection rate of the heart disease and concluded that
DMACO develops the pheromone intensity and improves the detection rate of the
disease.
From the above-discussed works, it is clear that all the techniques used for heart
disease prediction are hybrid classification methods. But the work in this research
is different from the above-discussed works in a way that, here we implemented
two different optimization techniques with two different classifiers. In this study, we
analyze the efficiency by combining optimization techniques with classifiers.
Impact of Bio-inspired Algorithms to Predict Heart Diseases 123

3 Methodology

The system architecture used in this research for predicting the heart disease is shown
in Fig. 1.
Here, Heart data means dataset which contains related data of heart disease
patients. Heart Disease Dataset used in this research is taken from UCI Machine
Learning Repository. UCI Repository contains different datasets from different
domains. David Aha was developed this UCI repository by in 1987 [5]. In this
research, we used Cleveland (Cleveland clinic Foundation) data, which contains 303
instances with 76 raw attributes. Out of 76, we used only 14 attribute values. Out of
14, the last value is the predicted attribute based on these 13 attribute values. We are
considering the Cleveland dataset because it has less amount of missing data.

3.1 Optimization Techniques

3.1.1 Ant Colony Optimization

Marco Dorgio was presented Ant Algorithm in 1992. This Ant algorithm was
designed based on the real ants’ inspiration. It is highly used to resolve complex
difficulties. Ant algorithm finds optimal solution by performing iterations [6]. The
main goal is to select the finest features with lowest redundancy.
Let us assume ‘m’ ants develop solutions from a finite set of components ‘C’.
Consider an empty solution sx = φ, at each iteration it is extended by adding new
possible solution from the set of neighbors N(sx ) ⊆ C. The path of the graph can
be designed by constructing these partial solutions [7]. At construction step, these
solution components are formed by using probabilistic approach. Then

β
Tiαj · η i j
P C|s x = β (1)
C ∈ N (S x ) Tiαj · ηi j

Fig. 1 Architecture
124 N. Sree Sandhya and G. N. Beena Bethel

where C = {cij }, i = 1, 2, 3, …, m and j = 1, 2, 3, …, n; and ∀cij N (S x ).

T ij is a pheromone value and ηi j is heuristic value. Both T ij and ηi j are accom-
panying with the component cij . α and β are determining pheromone and heuristic
information and these are real positive parameters. By using this probability rule ants
are constructing solutions is called a tour [8]. To find the best solutions, we have to
update the pheromone values. Here we are gathering good solutions associated with
pheromones and avoid bad ones. This is achieved through pheromone evaporation.

Ti js ← (1 − ρ) · Ti j + ρ · F(s) (2)
s ∈ S|ci j ∈ s

where S is the solutions used for update, ρ is evaporation rate and it lies between (0,
1], F(s) is fitness function and F:S → R+ is a function then f (s) < f (sc ) ⇒ F(s) ≥
F(sc ), ∀ s
= sc ∈ S.
The algorithm is as follows:

Ant Algorithm
Step 1 create the initial parameters like N nodes and M arcs. Then constant amount
of pheromone is assigned to all arcs.
Step 2 Ant k uses the pheromone trail at node i to compute the next node j by using
probabilistic approach. It is calculated using (1).
Step 3 When ants are traversed in between the arc (i, j) the pheromone value is
updated, which are called local trails. These changes are done using the
following equation.

Tij ← Tij + Tk

Step 4 When ant k moved to the next node, the pheromone evaporation is done
using the following equation.

Tij ← (1 − ρ) · Tij , ∀(i, j) ∈ A

Step 5 Repeat step 2 until ant target point reached. This is called iteration cycle
and it involves ant’s movements, pheromone evaporation and deposits.
Step 6 Whenever ants reach the target, they will update the pheromone using global
trails and finds the optimal solution. These global trails are done using (2)
Step 7 This process repeats until termination condition is satisfied. If so, it will
generate the output otherwise repeat above steps once again.
Step 8 End.
Impact of Bio-inspired Algorithms to Predict Heart Diseases 125

3.1.2 Bat Colony Optimization Algorithm

In 2010, Xin-She Yang developed bat algorithm. It is developed based on the behavior
of micro bats communication [9] using echolocation. Generally, bats using echo-
based location to search food and travel from one place to another. During this
search of food bats, change their velocity, frequency, and sound accordingly. The bat
algorithm is presented as follows.

Bat Algorithm
Step 1 Define the objective function or fitness function of a bat.
Step 2 Initialize the bat parameters like frequency (fi), Loudness (A) and pulse
emission rate (ri).
Step 3 Randomly generate the bat population.
Step 4 Sort the current population values (preferably in descending order).
Step 5 Generate new frequency, velocity and position values using (3), (4), (5)
respectively until maximum iteration criteria is done.
Step 6 Generate the best solution as a local solution in step 5.
Step 7 New solutions should be stored in a resource log.
Step 8 Update the values of loudness and pulse emission rate using the (6) and
(7).
Step 9 Fitness of new solution is tested w.r.t. A and r then rank the bats and find
the current best position.
Step 10 Repeat the process until termination criteria is satisfied.
Step 11 End.
Bats fly randomly with the velocity vi at position x i with different frequency
ranges f [min, max], varying wavelength λ and loudness A0 to search for prey. The
wavelength is adjusted automatically and pulse emission rate can be varied in between
[0,1] and it depends on the target of proximity. Loudness value lies between [A0 ,
Amin ]. Here the bat is randomly assigned the frequency between [f min , f max ], hence
it is called frequency-tuning algorithm. Each bat is associated with velocity vi t and
position x i t in search space at each iteration t, with respect to assigned frequency
f i . Hence at each iteration, we need to update f i , vi and x i and along with these
parameters loudness and pulse emission rate also be updated. To do this, we use the
following equations

f i = f min + ( f max − f min )β (3)

vit = vit−1 + xit−1 − x ∗ f i (4)

xit = xit−1 + vit (5)

Ai (t + 1) = α Ai (t) (6)
126 N. Sree Sandhya and G. N. Beena Bethel

ri (t + 1) = ri 0 1 − e−γ t (7)

where f i is the existing frequency, vit is the existing velocity and X i is the existing
position. Ai is the loudness and r i is the pulse emission rate.
vit − 1 is the previous velocity, xit − 1 is the previous position.
α, β are random values and lies between [0,1] and 7 is a constant value and 7 > 0.
x* is the present best position.

3.2 Classification

Heart Disease Prediction can be perceived as a clustering problem or classification

problem [10]. In contrast, we have a tendency to fashioned a model on the immense
set of presence and absence data. So, we can turn it to classification once we extract
the features from the respective bio inspired algorithm. In this research, we comparing
the updated dataset using two classifiers namely random forest and support vector
machines.

4 Experiments and Results

Here, we confer the heart disease dataset researches and evaluations in python envi-
ronment. The objective of this project was to test which optimization technique
organizes the heart disease the best with a specific classifier. We use tenfold valida-
tion to evaluate the performance of classification methods for predicting the heart
disease. To avoid uneven operation consequences, each trial was run 10 times, and
the optimized classification accuracy was chosen for assessment. Then the accuracy
comparison of ant and bat algorithms with both the classifiers is as described in
Fig. 2.
From Fig. 2, it is observed that irrespective of the classifier the order of the
optimization algorithms which gives the best accuracy is Bat and Ant algorithms
respectively. When we compare these two algorithms in the view of feature extraction,

Fig. 2 Accuracy
comparison graph of Ant and
Bat algorithm using Random
Forest (RF) and SVM
classifiers
Impact of Bio-inspired Algorithms to Predict Heart Diseases 127

Ant algorithm takes the least priority and Bat Algorithm gives better results in both
the cases. In this research, With Random Forest classifier, Ant gives 70.96% and bat
algorithm gives 80.64% accuracy. With SVM Classifier, Ant gives 74.2% and Bat
algorithm gives 83.87%. Both the algorithms give better results with SVM classifier.

5 Conclusion and Future Scope

In this paper, we mainly focus the two bio inspired algorithms feature extraction
using two classifiers Random forest and SVM. Optimized SVM classifier gives better
results in each case than Optimized Random Forest classifier. Bat algorithm gives
better accuracy with both classifiers. So, we conclude that Bat algorithm is better
than Ant algorithm implemented in this research. Similarly, in the case of classi-
fiers it is also clear that SVM gives better prediction results than Random Forest.
Because both classifiers applied on same bio algorithm but classification results vary.
In Future works, there is a large scope for more bio algorithms with multiple classi-
fiers. Whenever we implement multiple classifiers on a single bio algorithm will give
better clarity of that respective feature extraction quality and increases the scope for
better results.

Reference

1. Khourdifi, Y., Bahaj, M.: Heart disease prediction and classification using Machine Learning
Algorithms Optimized by particle swarm optimization and Ant Colony Optimization. Int. J.
Intell. Eng. Syst. (2019)
2. Darwish, A.: Bio-inspired computing: algorithms review, deep analysis, and the scope of
applications. Future Comput. Inf. J. (2018)
3. Kora, P., Ramakrishna, K.S.: Improved Bat Algorithm for the detection of myocardial
infarction. Springer Plus 4(1), 666 (2015)
4. Dubey, A., Patel, R., Choure, K.: An efficient data mining and Ant Colony Optimization
Technique (DMACO) for heart disease prediction. Int. J. Adv. Technol. Eng. Explor. (IJATEE)
1(1), 1–6 (2014)
5. Dataset, UCI Machine learning Repository [online]; https://archive.ics.uci.edu/ml/datasets/
Heart+Disease
6. Rao, P., et al.: An efficient approach for detection of heart attack using Noble Ant Colony
Optimization concept of data. IJESRT (2018)
7. Dorigo, M.: Ant Colony optimization. Scholarpedia 2(3), 1461 (2007)
8. Nasiruddin, I., Ansari, A.Q., Katiyar, S.: Ant Colony Optimization: a tutorial review,
Conference Paper (2015)
9. Hinduja, R., Mettildha Mary, I., Ilakkiya, M., Kavya, S.: CAD diagnosis using PSO, BAT,
MLR and SVM. Int. J. Adv. Res. Ideas Innovations Technol. (2017)
10. Kotsiantis, S., Pintelas, P.E., Zaharakis, I.D.: Machine learning: a review of classification and
combining techniques. Artif. Intell. Rev. 26(3), 159–190 (2006)
Structured Data Extraction Using
Machine Learning from Image
of Unstructured Bills/Invoices

K. M. Yindumathi, Shilpa Shashikant Chaudhari, and R. Aparna

Abstract The identification and extraction of unstructured data have always been
one of the most difficult challenges of computer vision. Parsing this sort of large data is
very challenging; however, recent advancements in computer vision technology help
make this feasible. A ubiquitous commonplace item that many consumers receive is
difficult to be transformed into raw data. Receipts contain a dense amount of data
that can be useful for future analysis, but there exists no widely available solution for
transforming receipts into structured data. Existing solutions are either very costly or
inaccurate. This paper introduces a data pipeline for the identification, cropping, and
extraction of unstructured data within healthcare bills/invoice images. This pipeline
outperforms existing solutions by a large margin, and offers the ability to automati-
cally pull out semantic data such as description and unit price from an image of a bill.
It achieves this success by using Logistic Regression, KNeighbours, and OpenCV
Scikit to crop the image. Optical Character Recognition (OCR) is applied to detect
chunks of text and process the image. The accuracy observed is approximately 93%
for Logistic Regression and 81% for KNeighbours.

1 Introduction

Recent years have seen a growing interest in harnessing advances Machine Learning
(ML) and Optical Character Recognition (OCR) to translate physical and handwritten
documents into digital copies. It appears to be difficult to create digital documents
from scratch. Ultimately, a solution to get the simplicity of documents generation

K. M. Yindumathi (B) · S. S. Chaudhari · R. Aparna

Department of Computer Science and Engineering, M.S. Ramaiah Institute of Technology,
Bengaluru 560054, India
e-mail: [email protected]
S. S. Chaudhari
e-mail: [email protected]
R. Aparna
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 129
S. C. Satapathy et al. (eds.), Smart Computing Techniques and Applications,
Smart Innovation, Systems and Technologies 224,
https://doi.org/10.1007/978-981-16-1502-3_14
130 K. M. Yindumathi et al.

while ensuring the ease of digital documents usage is today’s need. Identification of
text characters generally deals with the identification of optically formed characters
and is often called OCR. OCR’s simple concept is to turn any hand-written or typed
text into data files that can be interpreted and read by computer. Any document or
book can be scanned with OCR, and the editable text file can then be translated from
a computer. The OCR program has two main benefits which include the potential to
improve efficiency by successfully reducing staff participation and processing data.
Most broadly, the fields of implementation of this system are postal offices, banks,
publishing sector, government agencies, education, banking, health care.
The universal OCR system consists of three main steps which are image acquisi-
tion and preprocessing, feature extraction, and classification [1]. Image preprocessing
phase cleans up and enhances the image by noise removal, correction, binarization,
dilation, color adjustment and text segmentation, etc. Feature extraction is a tech-
nique for extracting and capturing certain pieces of information from data. In the
classification phase, the portion of the divided text in the document image will be
mapped to the equivalent textual representation. The original invoice image is initially
pre-processed by secondary rotation and edge cutting to eliminate the unnecessary
background information. The region of the critical information in the normal image
obtained is then derived by following the pattern, which is the focal point of the
recognition of the material. OCR is used to translate image information into text and
enable easy use of the interpreted content. The principle point is to analyze certain
ML algorithms and process text using categorization and classification.
In particular, this paper is interested in the processing of unstructured bill image
data and converting it into a simple-to-use, analyzable structured data format. It helps
to resolve some of the fundamental difficulties and inconsistencies associated with
parsing this sort of unstructured data. We use a combination of bill image datasets
and custom receipt data collected over the past few months by a small group of
people to train and evaluate the effectiveness of this system. Our aim is to be able
to process and parse receipt data from most standard use cases, which involves
steps such as correcting the input image orientation, cropping the receipt to remove
the background, running OCR to pull text from the image, and using algorithm to
determine relevant data from the OCR result. A good, satisfactory system handles
the above characteristics and further edge cases by pulling the description and unit
price from image inputs. The paper demonstrates how this result was achieved and
provides both a quantitative and qualitative evaluation with respect to the success of
this system.

2 Related Works

This section discusses related papers concerning the specific task of analyzing and
classifying receipt-data using machine learning. Researchers mostly focused on
developing techniques to improve the recognition and extraction of text from unstruc-
tured data whereas industry has focused on creating commercial systems to reduce
Structured Data Extraction Using Machine Learning … 131

manual labor costs for inputting receipt image data for analysis or reporting. However,
neither produces an optimal system due to degradations in either accuracy or cost.
One commonly used extraction mechanism for text detection is a Convolutional
Neural Network (CNN) [1, 2]. This class of OCR utilizes a Long Short Term Memory
(LSTM) to propose regions of interest where text may exist as well as a CNN to
determine the likelihood of text appearing at that location. These systems all provide
end-to-end for text identification—from localizing and recognizing text in images to
retrieving data within such text. Deep learning based method for bank serial number
recognition system consistently achieved state-of-the-art performance on bank serial
number text detection benchmarks [1]. Some research work has also been done
on graph neural networks to define the table structure for text extraction. Canny
edge detector for image data identification [3] is proposed for utilizing all kinds of
images, failed for handwritten bills. Table 1 presents summary of the text extraction
techniques in various fields along with experimentation.

3 Proposed Structured Data Extraction

This section describes the process leading up to classify and extract information from
receipts (typed bills and handwritten bills) using machine learning.
A text segmentation technique defines the first character in the line and then tries to
interpret the whole line according to the position of the first character, and in the case
of handwritten text, the flow is not within. And those characters aren’t remembered
well. If the picture includes an object that is not a bill or other rectangular piece of
paper or entity, the job then classifies the entity and recognizes it as the bill required
to retrieve the material. Further text analysis can be done to determine whether the
collected information is from a bill.
The process of structured data extraction from the unstructured image data of
the bill involves data pipeline that starts with collecting the receipts in the form of
unstructured image data, converting the receipts into image format, correcting the
image orientation, cropping the receipt to remove the background, running OCR
to pull text from the image, and training machine learning algorithms to determine
relevant data from the OCR result, writing a custom made algorithm to extract specific
fields of data from the receipts that we are interested in and finally evaluating the
classifiers performance. To eliminate the useless context information, the origin bill
image is preprocessed to get grayscale image for to accelerate the processing. The
image quality is calibrated to a better angle and view for easier retrieval of the
data. Next, the region of the necessary information in the normal image obtained
is extracted by matching the prototype, which is the center of identification of the
information. Recognized optical character is used through subsequent use of the
derived information to translate the image information into text. Initially, OpenCV
is used to detect bills from the image, and to remove unnecessary noise from the
image. Then, the Tesseract OCR engine is used to transfer intermediate images for
further processing. Tesseract tries to apply Segmentation of Text to catch written text
132 K. M. Yindumathi et al.

Table 1 Comparison of text extraction from bills

Paper Metrics Algorithm Results
[3] Convert binary text object OCR Extract all kinds of blur
into ASCII images
[2] Canny edge detector for Open CV, Tesseract OCR Extract all kinds of
image data identification images, failed for
handwritten bills
[1] Number and letter Tanimo to measure theory Rate of numeral
segmentation recognition 99% and
letter recognition 92%
[4] Detecting and filtering Definition evaluation Detect low-quality image
images with low quality algorithm based re-blur and screen out
undesirable image
[5] Secondary rotation, rotate OCR data is exerted as The accuracy of this
image, degree of rotation excel format method can be reached
upto 95.92%
[6] Various documents formats Combination of heuristic Best result with an
filtering and OCR average accuracy of 92%
[7] Block level image into Softmax CNN classifier Best result with an
individual characters accuracy of 99.92%
[8] Image accuracy for a OCR based CNN The accuracy rate is 95%
selected text region for taken at a distance of
10 km
[9] CNN to remove the visual Entity –aware mechanism The accuracy for model
portrayals, entity-aware for feature extraction and trained over real data is
network to decrypt EoIs LSTM as input about 95.8%
[10] Graph Neural Network to graphlet discovery Accuracy related to table
define the table structure algorithm detection is about 97%
[11] Denomination based Tesseract for text extraction Accuracy rate is 93.3%
K-means clustering framework
[12] Bounding box is designed OCR Accuracy after
for specific elements pre-processing is about
87.5%
[13] FCNN and LSTM for text Text detection Tesseract 4. Best result with an
classification 0 and EAST detector accuracy 83.3%
[14] Dense Nets and CTC for Area extraction- YOLOv3 Average accuracy of
text conversion recognition is about 0.96
[15] Entity extraction by OCR Arabic text is cursive
labelling Regexp patterns difficult to determine
accuracy
[16] BPN predict text or non-text OCR Recognition accuracy
area block wise 99.24%
[17] Column based character Text recognition with OCR Character recognition
segmentation accuracy 89% and table
recognition 98%
Structured Data Extraction Using Machine Learning … 133

Bill/invoice identification
Data Image pre- Bill/invoice region Identification of
collection processing identification corners for region

Structured Data preparation using Data extraction Image

Machine learning algorithm using OCR cropping

Fig. 1 Data pipeline phases for extracting structured data from unstructured image data

in various fonts and languages. Figure 1 shows sequential phases of data pipeline for
extracting structured data from unstructured image data. Each step can be executed
in pipeline for multiple images as each phase can be independent.
Phase 0: Data collection: The data set we are using for this project is 40 healthcare
bills. Those receipts are divided into 2 major categories, normal health care bills and
covid health care bills. All the receipts are less than a year old. Healthcare bills are
the biggest major categories with amount of variance while covid related healthcare
bills are highest amount and hence the dataset have more mixed types of healthcare
bills. We made sure that the backdrop wasn’t too loud, and that the contrast between
the receipt was good. The receipts were guided more or less vertically, without
too much coercion than required. Nevertheless, most of the receipts were crumpled
during transportation so that they had folds and creases and some had also washed
out letters to make the job not so straightforward at all.
Phase 1: Pre-processing Image: We converted the images to grayscale that conve-
niently reduces the data by two-thirds and accelerates the subsequent processing
process. We have normalized global illumination by using an old image processing
trick to remove gradients of slow illumination: slow gradients lead to modulation
of image intensity at low frequencies, so filtering the image with a high-pass filter
render the global illumination clear. The filter is implemented effectively using the
discrete cosine transform (DCT): convert the image into frequency space, cancel
the low-frequency components and convert the image back into the space domain.
Usage of DCT instead of the discrete Fourier transform avoids dealing with complex
numbers.
Phase 2: Healthcare bills identification: The healthcare bills is usually a rectangular
piece of paper, defining the area using a quadrilateral with the four-vertex {p1, p2, p3,
p4}. The vertices are selected such that the polygon occupies as much of the receipt
as possible and as little of the background as possible. The preprocessed images is to
segment the image into pixels representing the receipt and those not. In particular, blur
the image to eliminate noise and add a 60% resolution level for initial segmentation.
There we apply binary closure to remove small false detections in the past and fill
flaws along the contour. Finally, it discards all but the largest blob in the image. The
process of bill identification is 6-step as follows. (1) Identify bill receipt based on
134 K. M. Yindumathi et al.

four-vertex polygon. (2) Convert image into grayscale using Gaussian blurring. (3)
Remove noise and set threshold with 60% resolution. (4) Extract the features from
the image using edge detection. (5) Measure the receipt end-points in the image. (6)
Outline the end-points with green blobs.
Phase 3: Corner Identification: Putting the vertices in the corners make the polygon
clip. The lower left corner has a very obtuse angle which gives an impression that
accurate corner detection cannot be as simple as internal angle calculation. In other
cases, the receipt may be distorted, rounded or even corners caused by an irregular
tear in odd locations. We depend on the receipt edges for a more robust solution. We
measure the receipt outlines from the foreground mask (using binary morphology)
and then apply a probabilistic transformation to get the start and end points of the
line segments into the image. Instead, we calculate the intersection of each pair of
horizontal and vertical segments (green blobs) to produce a list of corner candidates,
which we reduce with average change (red crosses) to more than a reasonable number.
We tried different methods of finding the receipts, each with varying success rates.
Phase 4: Cropping, de-skewing and enhance: Next steps are to strip it out of the
picture and boost the contrast of what is written on the receipt. Both are standard
image processing operations, and scikit-image provides transforming wrap using a
mapping of pixel input and pixel output positions to do them in one step. We simply
calculate the quadrilateral edge-lengths for the output shape and create a rectangle
with width and height equal to the full length of the segments top and bottom and
left and right, respectively. Gray thresholds and corresponding pixel filter, where we
have removed all blobs that crossed the boundary of the pixel in the right direction.
Threshold effectively suppresses many of the luminous variations while keeping the
text intact. Nevertheless, several characters, particularly the small ones, are hard to
distinguish in the binary image. Therefore by Gaussian blurring, we first feather
the mask out and replace it with the original image (receipt). It essentially protects
all hidden areas but with soft rather than sharp edges. The process of Cropping, de-
skewing and enhance is 5-step as follows. (1) Load the necessary XML classifiers and
load input images. (2) Image can be resized to 500 pixels, cropped, blurred, de-skew.
(3) Convert image into grayscale using Gaussian blurring. (4) Image segmentation
based on words, characters. (5) Extract the features from the image using edge
detection.
Phase 5: Extraction of Data: The most-free alternative is the Tesseract OCR engine.
We used pytesseract fundamental, which is a simple wrapper around Tesseract: call
image_to_string to transform the image to a single formatted string. Call image to
data for assured recognition and other useful information for individual fragments
of the text. The underlying OCR engine uses Long Short-Term Memories (LSTM)
network. Tesseract architecture includes adaptive thresholding to convert image into
binary images, component analysis to extract OCR image with a black background
and white text, line/word finding using contours and blobs, recognize words using
machine learning classifier two-pass process to extract text from image. We must
strip out detections of low trust and those that are simply too poor to be text for
Structured Data Extraction Using Machine Learning … 135

performance boosting. The default output is in a tab separate values (TSV) format.
Pytesseract can automatically convert TSV into a data-frame using pandas. The
data extraction process is 5-step process as follows. (1)Apply Tesseract OCR for
image. (2) Differentiate word counters associated with image. (3) Differentiate letter
counters associated with word counter image. (4) Preprocess letter images and TSV
data is generated. (5) Consolidate associated data to text then to excel file as shown
in Fig. 2.
Phase 6: Data preparation: Until feeding the data into any algorithm, there are
several measures that need to be taken to prepare the data by transforming it in such
structured sample data set used into this method is described as seen in Fig. 3. The
vectorization and tokenization act is the principal steps of this process. We decided
to differentiate these modules and left it this way to other preparation phases to make

PPE KIT 45000

BED CHARGES 72500
DIETICIAN 300
DOCTOR FEES 45600
DRESSING 360
DRUG ADMINISTRATION
7384.5
ECG 330
LABORATORY 5384

.txt file
Input Image .excel File

Fig. 2 OCR processing of proposed system

Fig. 3 Dataset of healthcare bills

136 K. M. Yindumathi et al.

the testing and optimization as close as possible. We also tried to study whether
we could use machine learning to retrieve the receipt from different data points. In
this sense, when referring to a data point, it is assumed to be a particular form of
token which contains a value that varies between transactions, in our case the unit
price and definition. The token may contain different formatting or style in different
documents in our solution, the only constraint that is required is that it is a special
token that in most situations makes it ideal for extracting data points.
We used tools offered by scikit-learn libraries them to suit the classifiers’ specific
implementation advice. The process of extracting specific data points is very different
from classifying the receipts. We could not use the same algorithms or classifiers for
this task. We opted to use the following classification models for evaluation based on
that text extraction. (1) Logistic Regression and (2) Kneighbors Classifier. We then
continued with the best of those models for tuning and maximizing hyper-parameters
before progressing with a second round of benchmarking. Evaluation to be able to
achieve similar outcomes over different runs with the same random seed was used for
the data shuffling process when comparing different algorithms. The benchmarking
software was a basic python script that supplied the pipeline with a new classifier for
a round of checking against the data collection. The classifier was then replaced by
the next classifier, and it repeated the process.
To parse a given OCR output, words are first tokenized using sklearn libraries and
then keywords for each given category identified. Once each keyword is identified,
we ran a spatial search for all given text inputs nearest to the text input containing the
keyword for insightful information. For example, for parsing pricing data, keywords
such as “price”, “total”, “description” and “amount” are first identified from the
OCR output. For each given positive keywords match, a nearest-neighbor search
is conducted to look for text bounding boxes containing pricing information. The
keyword-price pair is selected for the bounding boxes that are furthest down on the
page. Once the unit-price and description information are found, this information is
returned as output and a plot containing the individual images generated during the
pipeline combined with the parsed information.
The extracted text data set is divided into a training set (80%) and a test set (20%).
Both sets are divided into two parts, one part containing the OCR extracted receipt
text and the other one containing the value for the data point referred to as the label.
The model has access to the labels during training to know what it is looking for
while the prediction function only has access to the document text. The labels for
the test data is instead used after the predictions have been made by the model to
compare against. The Fig. 3 is a graphical representation of the actual dataset.

4 Results Analysis

The development environment used for this project includes PYTHON version 3 as
language to develop the software in order to meet the project requirements. For the
initial development and exploration, we have used SPYDER as an IDE and then later
Structured Data Extraction Using Machine Learning … 137

we have shifted to PYCHARM, which supports Python and makes the task comple-
tion ease. Various other libraries and pre-trained models used include Pytesseract,
numpy, imutils, cv2, pandas, sklearn, etc. These Programming algorithms have well
defined models that are trained with some samples to increase the accuracy of the
result. The result depends on the amount of training given to the model. The receipts
accuracy statistic is measured as the number of correct tokens divided by the overall
tokens count. The cumulative number of tokens in the record generated by the OCR
algorithm is the sum of tokens in it. This is because noise is seen incorrect tokens
generated by the OCR algorithm. The classification efficiency metric is the mean of
the product of carrying out a cross-validation test of three folds on the data collection.
While relating to the consistency of going forward algorithms it is the mean value of
this cross validity to which it applies. The accuracy increases by a small margin after
running the optimization test. The default values were found to be strongly efficient,
and accurately identified the gap in precision between the optimized algorithms. In
the case of KNeighbours, the variance area corresponds to the difference in time
being expressed in Fig. 4 against the Logistic Regression while Fig. 5 represents the
accuracy. The final results for the accuracy and time is presented in Table 2.
The final results for the accuracy and time is presented in Table 2.
With approximately 93% precision, we find this to be very fair though bearing
in mind the relatively limited data collection analyzed. With the assessed model,
subcategories which are much smaller subsets with less variation between categories
also have very high accuracy. There were several classifiers managed to score 81% or
more in the cross-validation test, which may mean that classifying receipts is a very
simple task that machine learning is very suitable to start with, particularly because
the optimization of the hyper parameter did not see any major improvements and
compared to the Kneighbors we outperformed them with great margins. We assume
these findings are very positive, but as these findings derive only from principle
solution verification, a wider analysis will be required to validate them.

Fig. 4 Algorithm validation

time
138 K. M. Yindumathi et al.

Fig. 5 Algorithms accuracy

Table 2 Algorithm accuracy

Algorithm Accuracy (%) Time (min)
and time
Logistic regression 93 24.3
KNeighbours 81 37.5

This is an experiment requiring estimation of a numerical value. Running the

python script describes the data set structure first, then matches the pattern and tests
it on the test dataset. Finally, on Logistic Regression we got 93% accuracy and on
KNeighbours 81% accuracy. This was achieved with a basic custom-made approach
that has tremendous room for development, and predicted to do even better with
more complex solutions. As such, we find these findings to be very positive and they
could require time and work to determine a more complicated approach.

5 Conclusion

The paper describes the analysis of different text extraction methodologies based on
Machine learning algorithms. The research is to separate the content from imprinted
invoice. The downside of the inspection is that, even though the picture consist a
document that is not a bill, a rectangular slip to paper or an object, the inspection will
recognize the object and find it to be necessary bill, the substance of which will be
deleted. The applied algorithm is Logistic Regression and the Kneighbours for the
extracted information to predict the accuracy. The sample images taken for imple-
mentation and the accuracy observed is approximately 93% for Logistic Regression
and 81% for KNeighbours. The effort is to make people understand their bills and to
bring about some level of transparency in the market. We think these results are very
promising, but as these results only stem from a proof of concept solution, a larger
study might be needed to verify them. The findings of this work indicate that there
are many ways to enhance the efficiency of automated processing of receipts. The
Structured Data Extraction Using Machine Learning … 139

program will be successful in detecting bills that are mutilated because experiment
findings suggest that there are reasonably strong outcomes from local thresholding
techniques. The binarization methods require adaptation of the brightness. This trans-
lation is particularly necessary for the identification of characters. We need to analyze
the speed-up of goal character recognition.

Acknowledgements We are grateful to Vadivel Karuppannan and Shailesh Prabhu of VB Ideas

Private Limited, Bangalore for their assistance in problem formulation and periodic reviews.

References

1. Umam, A., Chuang, J.-H., Li, D.-L.: A Light Deep Learning Based Method for Bank Serial
Number Recognition (2018)
2. Rahmat, R.F., Gunawan, D., Faza, S., Haloho, N., Nababan, E.B.: Android-Based Text Recog-
nition on Receipt Bill for Tax Sampling System. Department of Information Technology,
Universitas Sumatera Utara, Medan, Indonesia (2018)
3. Sidhwa, H., Kulshrestha, S., Malhotra, S., Virmani, S.: Text extraction from bills and
invoices. In: International Conference on Advances in Computing, Communication Control
and Networking, pp. 564–568 (2018)
4. Mizan, C.M., Chakraborty, T., Karmakar, S.: Text recognition using image processing. Int. J.
Adv. Res. Comput. Sci. 8(5), 765–768 (2017)
5. Song, W., Deng, S.: Bank bill recognition based on an image processing. In: 2009 Third
International Conference on Genetic and Evolutionary Computing, pp. 569–573 (2009)
6. Jiang, F., Zhang, L.-J., Chen, H.: Automated image quality assessment for certificates and bills.
In: 1st International Conference on Cognitive Computing, pp. 1–5 (2017)
7. Sun, Y., Mao, X., Hong, S., Xu, W., Gui, G.: Template matching-based method for intelligent
invoice information identification. In: AI-Driven Big Data Processing: Theory, Methodology,
and Applications, vol. 7, pp. 28392–28401 (2019)
8. Holt, X., Chisholm, A.: Extracting structured data from invoices. In: Proceedings of
Australasian Language. Technology Association Workshop, pp. 53−59 (2018)
9. Guo, H., Qin, X., Liu, J., Han, J., Liu, J., Ding, E.: EATEN: Entity-aware Attention for Single
Shot Visual Text Extraction. Department of Computer Vision Technology (2019)
10. Riba, P., Dutta, A., Goldmann, L., Fornes, A., Ramos, O., Llados, J.: Table Detection in Invoice
Documents by Graph Neural Networks. Computer Vision Center (2018)
11. Abburu, V., Gupta, S., Rimitha, S.R., Mulimani, M., Koolagudi, S.G.: Currency recogni-
tion system using image processing. In: Tenth International Conference on Contemporary
Computing, pp. 10–12 (2017)
12. Raka, P., Agrwal, S., Kolhe, K., Karad, A., Pujeri, R.V., Thengade, A., Pujeri, U.: OCR to read
credit/debit card details to autofill forms on payment portals. Int. J. Res. Eng. Sci. Manag. 2(4),
478–481 (2019)
13. Kopeykina, L., Savchenko, A.V.: Automatic privacy detection in scanned document images
based on deep neural networks. In: International Russian Automation Conference (2019)
14. Meng, Y., Wang, R., Wang, J., Yang, J., Gui, G.: IRIS: smart phone aided intelligent reim-
bursement system using deep learning. In: College of Telecommunication and Information
Engineering, vol. 7, pp. 165635–165645 (2019)
15. Rahal, N., Tounsi, M., Benjlaiel, M., Alimi, A.M.: Information extraction from Arabic and
latin scanned invoices. In: IEEE 2nd International Workshop on Arabic and Derived Script
Analysis and Recognition, pp. 145–150. University of Sfax (2018)
140 K. M. Yindumathi et al.

16. Umam, A., Chuang, J.-H., Li, D.-L.: A Light Deep Learning Based Method for Bank Serial
Number Recognition. Department of Computer Science (2018)
17. Xu, L., Fan, W., Sun, J., Li, X.: A Knowledge-Based Table Recognition for Chinese Bank
Statement Images, pp. 3279–3283. Santoshi Naoi Fujitsu Research and Development Center,
Beijing (2016)
Parallel Enhanced Chaotic Model-Based
Integrity to Improve Security
and Privacy on HDFS

B. Madhuravani, N. Chandra Sekhar Reddy, and Boggula Lakshmi

Abstract This article presents how to provide security to large chunk of data. Data
be the crucial in all areas, is to be protected during storage and retrieval. The
large volumes of data called Big Data will be stored in HADOOP in HDFS file
systems. This research helps in understanding the Big Data and provides a model
which improves security and efficient storage of data onto HDFS. This paper uses
an enhanced Quadratic Chaotic Map in the process of generating keys to improve
integrity. The model is implemented in parallel manner to reduces time in the process
of verifying integrity. The model is implemented and tested with HADOOP Cluster
which improved security and minimized time.

1 Introduction

Data being fundamental in all areas be protective and manageable. As data is

increasing day by day with emerging trends Internet of Things IoT, it should be
processed and managed in efficient manner. The emerging technology in Computer
Science field, the Big Data manage data sets whose size is impractical and unimage-
able. The data can be in various forms like structure, semi-structure and un structure
(Fig. 1) [1].

1.1 Data-Information-Knowledge-Wisdom-Decision
(DIKWD) Pyramid

The computer systems are processing machines for data which accepts input, stores
data in disks, process and produces result in the form of output. The data can be of
different representations, which is given as DIKWD Pyramid (Fig. 2) [2].

B. Madhuravani (B) · N. Chandra Sekhar Reddy · B. Lakshmi

Department of Computer Science and Engineering, MLR Institute of Technology, Dundigal,
Hyderabad, Telangana, India

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 141
S. C. Satapathy et al. (eds.), Smart Computing Techniques and Applications,
Smart Innovation, Systems and Technologies 224,
https://doi.org/10.1007/978-981-16-1502-3_15
142 B. Madhuravani et al.

Fig. 1 Types of data

Fig. 2 DIKWD pyramid

DATA

INFORMATI
ON
KNOWLEDGE

WISDOM

DECISION

Data is un-processed, un-organized and un-structured data. Which is available as

raw. Where information is processed and structured data. Knowledge is an infor-
mation gained through study and analysis. The experience and knowledge helps to
make wise decisions.

1.2 Big Data

Big Data—large volumes of data, an emerging trend in the field of computer science.
In wireless sensor networks in real world are connected with large number of devices
and storing and producing of large volume of data. Hence there is a large demand in
storing and processing of large data. One solution to this is Big Data. Hence there is
a need for parallel processing of data and also the security issues are boosted up for
Bid Data.
Parallel Enhanced Chaotic Model-Based Integrity to Improve … 143

Fig. 3 Three V’s

The Big Data characteristics are defined as Three V’s (Fig. 3) [3]—V-Volume,
V-Variety, and V-Velocity.

1.3 Hadoop

A solution to Big Data problem is Hadoop (Fig. 4) [4]. It’s the combination of Map
Reduce and HDFS. Where Map Reduce does processing and HDFS used for storage.

2 Enhanced Quadratic Map

The general equation for Quadratic Map [5, 6] is given as

xn+1 = pxn2 + qt

Let p = 1 and q = 1
xn+1 = xn2 + t
2
x (1) = x (1) + t

Fig. 4 HADOOP
144 B. Madhuravani et al.
2
x (1) − x (1) + t = 0
√
x2(1) = 0.5 1 ± 1 − 4t
xn+2 = xn+12
+t
2 2
xn+2 = xn + t + t
xn+2 = xn4 + t xn2 + t xn2 + t 2 + t
xn+2 = xn
x 4 + 2x 2 − x + t x 2 + t = x 2 − x + t x 2 + x + 1 + t = 0
√
x y(2) = 0.5 ∗ 1 ± −3 − 4t

Let p = −x q = 1 t = s

xn+1 = pxn2 + qt

xn+1 = s−t xn2

s e (0, 2) and t e (0, 1)
Algorithm Enhanced_Qudratic(upper_bound,lower_bound)
{
for 1 to n do
{
Calculate point = (Random_Value * (upper_bound-
lower_bound)) + lower_bound;
s = point;
t = Random_Value;
x = Random_Value;
plotting = s – power(x,2);
}
}

Plotting data is presented in Table 1 and its corresponding plotting is given in

Fig. 5.

3 Proposed Model

3.1 Big Data Security System

The Big Data Security system [7] is depicted in Fig. 6. The model takes input from
different big data sources. Process the input data in storing and retrieving from
HDFS. The HDFS uses privacy and security services to overcome several security
challenging issues.
Parallel Enhanced Chaotic Model-Based Integrity to Improve … 145

Table 1 Plotting data

X Y X Y X Y X Y X Y
1 0.232875 21 0.058198 41 0.195583 61 0.024572 81 0.072284
2 0.166929 22 0.080503 42 0.068599 62 −0.06414 82 0.339254
3 0.025129 23 0.31357 43 0.096259 63 0.312608 83 0.414752
4 0.344497 24 0.294245 44 0.278975 64 0.23461 84 0.072936
5 0.215562 25 0.001356 45 −0.10992 65 0.021318 85 0.236616
6 0.162311 26 0.217931 46 0.361283 66 −0.22555 86 0.345707
7 0.047713 27 0.410416 47 0.057775 67 0.136593 87 0.177242
8 0.209773 28 0.040958 48 0.494532 68 0.362841 88 0.152431
9 0.26003 29 0.289463 49 −0.06218 69 0.124408 89 0.322687
10 0.099767 30 0.123042 50 0.291799 70 0.153504 90 0.298568
11 −0.18155 31 0.301667 51 −0.00484 71 0.108776 91 0.216872
12 0.290262 32 0.004594 52 0.13529 72 0.45248 92 0.110915
13 0.268044 33 −0.10173 53 0.275352 73 −0.17845 93 0.165477
14 −0.16444 34 −0.03964 54 0.162848 74 0.063284 94 0.335702
15 0.002755 35 0.470395 55 0.139831 75 0.301267 95 0.056919
16 0.389281 36 −0.08801 56 −0.05347 76 0.178351 96 0.275423
17 −0.02248 37 −0.03638 57 0.117242 77 0.36466 97 0.285088
18 0.259248 38 0.247029 58 −0.07157 78 −0.08031 98 0.065591
19 0.282307 39 0.301248 59 −0.13844 79 0.116092 99 0.102762
20 −0.05414 40 0.257316 60 0.367234 80 −0.08216 100 0.295698

Fig. 5 Enhanced quadratic map plotting

3.2 Proposed Parallel Processing Authentication Model

The proposed authentication model uses the enhanced quadratic chaotic in the gener-
ation of pseudorandom numbers based on system/node identity. The data from Big
Data Sources will be stored and retrieved from HDFS using proposed authentication
model, where the storage and processing can be performed parallelly which in turn
improves the security and storage efficiency [8] (Fig. 7).
146 B. Madhuravani et al.

Fig. 6 Big data security

system

Fig. 7 Proposed parallel processing authentication model

4 Experimental Results

The system is implemented and analyzed in Java platform for comparative analysis in
terms of time taken to perform and verify the integrity with traditional hash functions
and proposed quadratic chaotic hash function. The results (Table 2, Fig. 8) proved
the proposed model is taking less time for integrity verification, in turn, improves
the life time of sensor nodes and improves energy.
Parallel Enhanced Chaotic Model-Based Integrity to Improve … 147

Table 2 Comparative
Integrity algorithm Execution time (ms)
analysis
MD 5 8963
SHA 256 7924
SHA 384 7612
SHA 512 7419
Proposed parallel enhanced quadratic 6854
chaotic hash function

Fig. 8 Comparative analysis in terms of execution time

5 Conclusion

This paper presented a model which improved security and lifetime through parallel
enhanced quadratic chaotic hash model. Obviously, the execution time of the system
is improved with parallel processing approach. The files should encode before storing
data into HDFS. The experimental results proved that the system is efficient in terms
of execution time.

References

1. https://www.datamation.com/big-data/structured-vs-unstructured-data.html
2. https://en.wikipedia.org/wiki/DIKW_pyramid
3. https://www.zdnet.com/article/volume-velocity-and-variety-understanding-the-three-vs-of-
big-data/
148 B. Madhuravani et al.

4. Hadoop, W.T.: The Definitive Guide: The Definitive Guide. O’Reilly Media (2009), 2. Borthakur,
D.: HDFS architecture guide. HADOOP APACHE PROJECT https://hadoop.Apach
5. Kanso, A., Yahyaoui, H., Almulla, M.: Keyed hash function based on a chaotic map. Inf. Sci.
186, 249–264 (2012)
6. Madhuravani, B., Murthy, D.S.R.: A hybrid parallel hash model based on multi-chaotic maps
for mobile data security. J. Theor. Appl. Inf. Technol. 94(2), (2017). ISSN: 1992-8645
7. Sai Prasad, K., Chandra Sekhar Reddy, N., Rama, B., Soujanya, A., Ganesh, D.: Analyzing and
predicting academic performance of students using data mining techniques. J. Adv. Res. Dyn.
Control Syst. 10(7), 259–266 (2018)
8. Chandra Sekhar Reddy, N., Chandra Rao Vemuri, P., Govardhan, A., Navya, K.: An implemen-
tation of novel feature subset selection algorithm for IDS in mobile networks. Int. J. Adv. Trends
Comput. Sci. Eng. 8(5), 2132–2141 (2019). ISSN 2278-3091
Exploring the Fog Computing
Technology in Development of IoT
Applications

Chaitanya Nukala, Varagiri Shailaja, A. V. Lakshmi Prasuna, and B. Swetha

Abstract In the 21st era, IoT is assuming a significant part in creating Smart urban
communities. With the development of IoT, information is developing with increasing
speed. As the information is developing the need to store information is likewise
expanding. More the information, the dormancy will be high to store and recover
information from the cloud. The idea of mist processing was started to reduce the
inertness for getting to information to and from the cloud. Haze processing gives
the capacity, figuring just as systems administration administrations toward the end
purpose of the system. Haze hubs likewise have restricted computational abilities.
Because of certain shortcomings, haze figuring and distributed computing can’t
continue alone, so both these advances are coordinated to fabricate keen IoT foun-
dation for Smart city. Mist figuring have a significant job and preeminent duty being
developed of a Smart city. This paper examines different utilization of mist regis-
tering and their usage in Smart urban areas. It additionally proposes a model for
Waste administration framework in a city. Mist figuring can assist with overseeing
the waste assortment of the city in a keen manner. Based on our survey, a few open
concerns and difficulties of mist processing are examined, and the bearings for future
analysts have additionally been talked about.

C. Nukala (B)
Department of CSE, RGM College of Engineering and Technology (Autonomous), Nandyal,
Andra Pradesh 518501, India
V. Shailaja
Department of IT, Gokaraju Rangaraju Institute of Engineering and Technology, Hyderabad,
Telangana 500090, India
A. V. Lakshmi Prasuna · B. Swetha
Department of IT, Mahatma Gandhi Institute of Technology, Gandipet, Hyderabad, Telangana
500075, India

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 149
S. C. Satapathy et al. (eds.), Smart Computing Techniques and Applications,
Smart Innovation, Systems and Technologies 224,
https://doi.org/10.1007/978-981-16-1502-3_16
150 C. Nukala et al.

1 Introduction

A creating number of physical articles are being related with the IoT [1]. It is the
interconnection of different physical elements that pass on and exchange data the
sensors, shrewd meters, telephones and vehicles, radio-recurrence distinguishing
proof (RFID) labels, and actuator [2]. The interconnection of these devices enables
shrewd IoT applications like following on the web trucked merchandise, condition
observing, medical services keen home and savvy framework, and so forth.
IoT devices make a great deal of data, which procure tremendous figuring office,
stockpiling zone, and correspondence information move limit. Cisco said 50 billion
gadgets could be associated by Internet in 2020 [3], and it will build 500 billion by
2025 [4].
The idea of mist figuring was presented by Cisco in the year 2012 [5]. The begin-
ning objective of haze registering is to upgrade the profitability and to diminish
volume of information which is moved to cloud for preparing. For ease, the board
of all assets haze layer go about as a middle among gadgets and cloud server farms.
Haze processing can offer types of assistance in different zones i.e. observation,
transportation division, clever urban areas, medical services, and keen structures.
Mist figuring can be utilized in various kinds of IoT administrations [5–7]. To start
with, Smart E-Health Gateway can be utilized for patients to checking their wellbeing
status [8]. Crisis caution can be enacted and send the alerts to the proprietor [9]. A
haze based Electronic Data Interchange (EDI) is an exhaustive virtualized device
outfitted with the capacity, transmission and registering limit [5].
Further segments of the article are clarified as follows: Sect. 2 contains Moti-
vation of the investigation, Sect. 3 explains the Layered design of IoT and mist
processing and how it can function for keen utilizations of IoT, Sect. 4 examines
uses of IoT and haze registering, Sect. 5 clarifies load adjusting in mist figuring
condition, Sect. 6 examine the proposed philosophy, Sect. 7 has been regarding open
issues and supportive gestures lastly Sect. 8 finish up the paper and clarifies the
opportunity of Future.

2 Motivation

This manuscript provides gives a short conversation of the application territories of

IoT and FC. The fundamental ideas of IoT and FC are talked about that incorporates
points of interest, hindrances, and engineering of FC.
Mist layer requires load adjusting to accomplish asset productivity, stay away
from over-burden in the system, to improve framework execution, and furthermore
to ensure the framework against disappointments. The inspiration of this paper is
to investigating the keen waste administration framework which incorporates load
adjusting on the haze layer. This paper likewise talks about different open issues and
difficulties looked in mist conditions.
Exploring the Fog Computing Technology in Development … 151

Fig. 1 Layers of IoT architecture [10]

3 Layered Framework of IOT and FC

3.1 Layered Framework of IOT

IoT has been arrangement of devices that send, share, and use data from the physical
condition to offer types of assistance to individuals, endeavors, and society. The
essential three-layer engineering is appeared in Fig. 1 [10].
(a) Layer of Sensors: This detect and collecting information from nature. Sensors,
scanner tag marks, RFID labels, GPS, camera, and actuator are available in
this layer
(b) Layer of Network: This utilized to assemble the information from sensor and
sends to the web. Liable for organize layer is interfacing with other savvy things,
arrange gadgets, and workers. Its features are moreover used for communi-
cating and handling sensor data. Utilizing various advancements, different
kinds of conventions and heterogeneous systems are accumulated.
(c) Layer of Middleware: This gets information from layer Network. Its inspiration
is administration the board, information stream the executives, and security
control. It moreover performs information taking care of and takes decisions
normally taking into account results.
(d) Layer of Application: It gets information from the Middleware layer and gives
overall administration of the application.

3.2 Layered Architecture of FC

The class of CC for example FC have three layer engineering. The lower layer
contains IoT gadgets. Mist layer is the center layer. IoT gadgets are coupled to cloud
152 C. Nukala et al.

Fig. 2 Architecture of Fog computing

layer through mist layer. Entire gadgets store their information on the cloud. The
haze layer channels information, and the information which isn’t quickly needed is
diverted to the cloud. The every now and again got to information is put away on the
mist layer. Layered design of FC is clarified as beneath: Fig. 2 shows the engineering
of FC.
• Brilliant IoT gadgets: The a great many sensors hubs and implanted frame-
works having low transfer speed and low inertness are utilized at this layer. Savvy
gadgets like brilliant structures, advanced cells, workstations, shrewd power bulbs,
keen vehicles, and so forth can be considered as IoT gadgets which gather the
information and send this information to the haze layer [14].
• Haze Layer: Network layer of haze is additionally partitioned into two sections:
Fog system and Core organize.
Network of Fog: It incorporates 3G/4G/5G/LTE/Wi-Fi and so forth multi-edge
benefits that are utilized to interface distinctive detecting gadgets with the haze hubs.
Haze hubs are utilized to channel information assembled by method of IoT gadgets
and not regularly utilized information is diverted to the upper layer for example cloud
layer.
Network of Core: QoS, manifold protocol Label Switching (MPLS), manifold-
cast, and security were deliberated at this phase [11].
• Layer of Cloud: It incorporates a great deal of server farms and cloud facilitating
IoT investigation. The colossal information accumulated through various IoT
gadgets are put away in the huge server farms situated at different areas on the
planet.
Exploring the Fog Computing Technology in Development … 153

4 Applications of FC and IOT

Haze works in dispersed condition. It offers types of assistance to the last client at the
edge gadget. FC have different application regions where we can incorporate with
IoT that can be examined as follows:
a. Meticulous Healthcare: In view of dirtied condition different sorts of micro-
scopic organisms have been being spread noticeable all around which causes
different ailments.
Each individual has occupied today as a result of quick ways of life. Shrewd
healthcare has the brilliant IoT which monitors exercises of individuals and
measures different boundaries of their body and continues transferring the infor-
mation on the haze hubs, which have been being seen by the specialists. The
information put away on the haze hubs have been being utilized by specialists
to treat the patients inside time. The individuals have been wearing insightful
gadgets and these have been additionally connected to mist hubs which have
been ceaselessly sending the estimations of body boundaries (temperature,
pulse, and so on.) so as to the mist hubs. These wise wearable gadgets help
to monitor individuals’ wellbeing [7].
b. Meticulous Parking: Because of much increment in transportation in the urban
communities, all the more parking spots have been required. Individuals need to
meander to a great extent to locate the fitting parking spot for them. FC presented
a novel thought of Meticulous stopping. With the utilization of FC the stopping
spaces can be introduced with the sensors which continue following climate the
parking have been a has vacant or full [12].
c. Meticulous Agriculture: Agribusiness has the wide territory to be given consid-
eration since has zone has from where all the urban communities have been
getting food. Brilliant agribusiness idea has been introduced in most recent
couple of years. Shrewd detecting and figuring have been playing a critical
obligation in keen agribusiness. A couple of savvy agribusiness approaches
have been imagined around there. Brilliant water system frameworks have been
given. Shrewd sensors have been being repaired in the fields [13]. FC gives a
stage to working the sensors in the fields, which have been ceaselessly watching
the yields. The sensors identify necessities of the harvests and persistently store
information in the haze hubs which send the cautions to the ranchers about the
prerequisites of the yields. The shrewd farming assumes an essential function
for building a brilliant city.
d. Meticulous Waste administration: The earth has corrupting step by step, and
to moderate the planet we require sharp consideration towards the normal assets.
Presently days, day by day developing waste and water exhaustion from the earth
looks for more consideration. Keen trash the board framework can be named
the answers for improvement of condition in this time. Such keen frameworks
will be created with the assistance of cloud just as FC to gather and deal with
the waste all the more proficiently [8].
154 C. Nukala et al.

Fig. 3 Balancing of load in network

5 Load Balancing in FC Environment

In a system, scarcely any frameworks stay under-stacked sooner or later stretch, while
the others convey the whole heap of the system. To keep up the heap in a reasonable
plan, “Burden Balancing” gets vital. “Burden Balancing endeavors to disperse the
heap in indistinguishable extents all through assets relying upon response capacity
all together that each valuable asset isn’t over-burden or underutilized in a cloud
device” [14]. Burden adjusting additionally needed to be done to dodge halt and
diminish the worker flood issue. Figure 3 shows the heap adjusting.
A portion of the objectives of mist based burden adjusting are talked about as
underneath:
• In instance of failure of system, the balancing of load offers plans for backup.
• Consequently, performance could be enhanced.

6 Projected Methodology

The novel waste administration network has been projected in this manuscript by
modifying the load at the layer of mist. Here, projected method comprises of three
phases: mist, datacenters of cloud, gadgets related to sensor. It could be stated that
comprise of sensors. Moreover, these canisters smart would tied up further over the
layer of mist that data of channels have to be transferred towards cloud. The balancer
of burden would adapt mist hubs heap. Moreover, it might distinct the similar heap
on entire hubs.
These hubs of haze would readily create the messages for advising the transporters
of trash for waster gathering. The sensors of security would established in canisters
that lighten regarding receptacles. Further, in instance, anybody might try for canis-
ters tempering, where these sensors would start signal clamoring that might spare
by considering receptacles. Further, the containers smart would be linked by 3G or
Exploring the Fog Computing Technology in Development … 155

Fig. 4 Load balancing model

4G towards layer of mist. The application would be formed for contributing overall
circumstance. The ensuing Fig. 4 shows the proposed model of burden adjusting
framework.

7 Open Issues and Encouragements

There were divergent open issues, which could be worked in an addition. Here,
accompanying could be examined further:
• Interaction among hubs: further, it might be examined in addition to explore the
engineering by which hubs of haze might interact over one another.
• Recognition devices: the gadgets that are smart were exorbitant than another
fundamental market components. Here, these could be deliberated deprived of
any issue.
• It could be a significant issue in this contemporary world. FC pre-requisite more
computations of security for real-time practice.
• It shall be in condition of mist. As it causes challenges when pair of workers have
been loaded very much, while other workers would be stacked beneath.
156 C. Nukala et al.

• Effectiveness of energy: The effectiveness of energy could a prominent in FC test.

The mist utilizing force has to be lessened.

8 Conclusion and Future Scope

The objective of this manuscript portrays joining of IoT and FC for assisting divergent
implementations. What’s more, load adjusting has been proposed. Next to, a couple
of uses, including the brilliant farming, keen medical services, shrewd stopping, and
savvy squander the executives showed. The basic purpose behind this survey is to give
a significant perception and preferences of IoT and its fuse with mist/edge processing
and how burden adjusting should be possible when FC incorporated with IoT. There
are still more open regions for future analysts, for example, Haze organizing, asset
provisioning, greater headway in transportation.

References

1. Al-Fuqaha, A., Guizani, M., Mohammadi, M., Aledhari, M., Ayyash, M.: Internet of things:
a survey on enabling technologies, protocols, and applications. IEEE Commun. Surv. Tutor.
17(4), 2347–2376 (2015)
2. Gubbia, J., Buyyab, R., Marusica, S., Palaniswamia, M.: Internet of things (IoT): a vision,
architectural elements, and future directions. Futur. Gener. Comp. Syst. 29(7), 1645–1660
(2013)
3. Evans, D.: The Internet of Things: How the Next Evolution of the Internet is Changing
Everything. Cisco White Paper (2011)
4. Camhi, J.: Former Cisco CEO John Chambers Predicts 500 Billion Connected Devices by
2025. Business Insider (2015)
5. Luan, T.H., Gao, L., Xiang, Y., Li, Z., Sun, L.: Fog Computing: Focusing on Mobile Users at
the Edge. arXiv preprint arXiv:1502.01815 (2015)
6. Emilia, R., Naranjo, P.G.V., Shojafar, M., Vaca-cardenas, L.: Big Data Over SmartGrid—A
Fog Computing Perspective Big Data Over SmartGrid—A Fog Computing Perspective (2016)
7. Khan, S., Parkinson, S., Qin, Y.: Fog computing security: a review of current applications and
security solutions. J. Cloud Comput. 6(1) (2017)
8. Mahmud, R., Kotagiri, R., Buyya, R.: Fog Computing: A Taxonomy, Survey and Future
Directions, pp. 1–28 (2016)
9. Desikan, K.E.S., Srinivasan, M., Murthy, C.S.R.: A novel distributed latency-aware data
processing in fog computing—enabled IoT networks. In: Proceedings of the ACM Workshop
on Distributed Information Processing in Wireless Networks—DIPWN’17, pp. 1–6 (2017)
10. Chi, Q., Yan, H., Zhang, C., Pang, Z., Xu, L.D.: A reconfigurable smart sensor interface for
industrial WSN in IoT environment. IEEE Trans. Ind. Inform. 10(2) (2014)
11. Chiang, M., Zhang, T.: Fog an IoT: an overview of research opportunities. IEEE Internet Things
J. 3(6), 854–864 (2016)
12. Perera, C., Qin, Y., Estrella, J.C., Reiff-Marganiec, S., Vasilakos, A.V.: Fog Computing for
Sustainable Smart Cities: A Survey (2017)
Exploring the Fog Computing Technology in Development … 157

13. Varshney, P., Simmhan, Y.: Demystifying Fog Computing: Characterizing Architectures, Appli-
cations and Abstractions. In: Proceedings—2017 IEEE 1st International Conference of Fog
Edge Computing ICFEC, pp. 115–124 (2017)
14. Dastjerdi, A.V., Gupta, H., Calheiros, R.N., Ghosh, S.K.: Fog Computing: Principles,
Architectures, and Applications, pp. 1–26 (2016)
NavRobotVac: A Navigational Robotic
Vacuum Cleaner Using Raspberry Pi
and Python

Shaik Abdul Nabi and Mettu Krishna Vardhan

Abstract Human life is becoming more advanced day by day and the use of tech-
nology is making our lives easier. By this consideration, one of our daily task is
to keep our surroundings clean by natural way using a broomstick, as technology
is increasing day by day we came with manual vacuum cleaners and then moved
to robotic vacuum cleaners. In robotic vacuum cleaner there exist many different
methods which are used to clean in home or hotels. In NavRobotVac (Navigational
Robotic Vacuum cleaner) model, we consider the science of architecture (vastu) and
implemented using python programming with Raspberry Pi. It automatically scans
area around and starts cleaning. In this model, IoT sensors are used to read real-time
data thus it gives technical efficiency and economic efficiency.

1 Introduction

Vacuum cleaners were developed in mid-1920 and then we came up with different
designs and features [1]. In 1996 the 1st robotic vacuum cleaner was developed using
programming. NavRobotVac cleans your home even when you are out of your home,
reduces work, and saves more time for the household [2]. There are many companies
who develops vacuum cleaners using different methods with very high cost [3]. For
developing a robotic vacuum cleaner there are mainly two different tasks, firstly
scanning the area which needs to be cleaned, we scan the area in a ground level
using sensors surrounded by the robot then processed in a raspberry pi using python
language, followed grid mechanism where each cube in a grid is based on the size
and distance moved by robot, vacuum fan and two sweeping motors are turned off
while scanning using a relay module. Secondly by using scanned data robots start the
cleaning process from its initial position, remembers the area it already cleaned so

S. A. Nabi (B)
Department of CSE, Sreyas Institute of Engineering and Technology, Hyderabad, Telangana, India
e-mail: [email protected]
M. K. Vardhan
Sreyas Institute of Engineering and Technology, Hyderabad, Telangana, India

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 159
S. C. Satapathy et al. (eds.), Smart Computing Techniques and Applications,
Smart Innovation, Systems and Technologies 224,
https://doi.org/10.1007/978-981-16-1502-3_17
160 S. A. Nabi and M. K. Vardhan

that no redundant cleaning takes place of same location, stores the dust in a removable
bin. Scanning and cleaning both follow different algorithms depending upon input
given by sensors. Developed a better algorithm for an efficient cleaning process as
we are making it cheap by using ultrasonic sensor, raspberry pi, micro gear motors,
line tracking sensor, compass sensor, relay module, and DC motor.
Robotic vacuum cleaners have reached a good amount of success within the
domestic market. The iRobot Corporation (one of the most popular players) claims
to sold 6 million units of their products within one decade [4].
According to the statistics of the International Federation of Robotics [5], about 2.5
million personal and service robots were sold in 2011, an increase of 15% in numbers
(19% in value) compared to 2010 [6]. This statistical figures clearly highlights the
increasing demand of domestic robots at our homes, which encourages to creates new
interaction patterns. In addition to this, the demand for the operation of many new
cleaning robots will follow the same tendency. Because of that, we have designed
our robot with effectively in terms of price and time.

2 Related Work

An existing robotic [7] vacuum cleaner had some drawbacks like colliding with
obstacles and stopped at a shorter distance from walls and other objects. It was not
able to reach all corners and edges of the room and left those areas unclean [8].
A design and implementation of a smart floor cleaning robot uses a microcontroller
[9] where data cannot be stored, it is difficult to find the place which is not cleaned
and which is cleaned, action will be performed only based on the current data read
by sensors.
In existing vacuum cleaner, robot doesn’t have any idea about its current position
and direction in which it is moving, if any obstacle occurs robot turns to the direction
with higher free space even though the robot already cleaned that location.

3 NavRobotVac Design

NavRobotVac uses a grid mechanism for the virtual view of a robot, the size of each
grid depends upon dimensions of body and wheels, the robot length and width is
30 × 30 cm and 1.5 cm radius of a wheel from which each grid is considered as
10 × 10 cm. Grid mechanism of NavRobotVac is shown in Fig. 1. The robot uses a
compass sensor to know the direction for easy scanning stores the area and performs
the next operation based upon direction and data given by ultrasonic sensors.
The proposed system is designed in such a way that it is targeted to meet the user
needs and also reaches more number of users. It consists of two wheels of radius
1.5 cm which are placed exactly in the center, for one complete rotation of wheels
it moves 10 cm distance which is considered the size of a cube in a grid. A line
NavRobotVac: A Navigational Robotic Vacuum Cleaner … 161

Fig. 1 Grid mechanism of

NavRobotVac

detecting sensor is used to identify one complete rotation of a wheel by sensing a

white line on black wheel and based on the movement the current location of the
robot is updated continuously. The robot rotates very smoothly for the turn in the
same place. A 360° rotatable wheel is fixed at the back for balancing the robot. Speed
of the wheel varies based on the function it performs. If the robot decides to turn
left or right then wheels are rotated in the opposite direction to each other so that the
robot stays in the same position and only direction gets changed. Robot stops only
if it completes its task or if the pause button is pressed. The design of NavRobotVac
is shown in Fig. 2.
The robot consists of seven ultrasonic sensors on each side except back, compass
sensor is fixed facing the north direction, line detecting sensor is fixed beside a wheel,
along with relay module all the sensors are connected to raspberry pi for the power
supply and to transfer the data with which the data is processed to give instruction to
wheels connected to micro gear motors along with the fan connected to DC motor to
suck the dust and rotational brushes fixed to micro gear motors. A 12 V power supply
is given to the relay and 5 V is supplied to the motor driver by reducing voltage using
a voltage regulator. The detailed circuit diagram of NavRobotVac is shown in Fig. 3.
Robot contains start, pause buttons, and a switch for main power supply. When
the switch is on all the sensors and processor gets turned on. When the start button is
pressed by placing the robot at the center of the room for the first time, the robot starts
scanning in a spiracle pattern in an anti-clockwise direction. Once the scanning is
done scanned data is stored in raspberry pi memory. Immediately after the scanning
robot starts the cleaning process, cleaning is done for every 24 h (fixed in a program).
Pause button is used to stop the cleaning or scanning process until it is pressed again.
162 S. A. Nabi and M. K. Vardhan

Fig. 2 Robot design

Fig. 3 Circuit diagram of NavRobotVac

NavRobotVac: A Navigational Robotic Vacuum Cleaner … 163

4 Functionality

It mainly consists of two functionalities. The first one is scanning and second is
cleaning. If any obstacle appeared in the process of scanning or cleaning, the robot
performs operation based on the current input and previous data if present (NPY file)
and continues its operation. The detailed descriptions of these two functionalities are
defined as follows.

4.1 Scanning Algorithm

A matrix is created and saved in NPY file before the scanning process begins to
record the area of the location, then the robot turns towards the north direction and
starts moving until it reaches a wall, then it turns towards left i.e. west and begins
scanning using ultrasonic sensors, process the data in raspberry pi and updates the
file. Robot initially completes scanning borders of the location and scans inside the
scanned area. We followed the spiral pattern for the scanning process. The algorithm
reads the file if there is any area which is not scanned and moves towards the location
if it finds any, if not the scanning process stops. Then all the extra rows and columns
in a matrix which are useless are removed from the file and store only the scanned
area.

4.2 Cleaning Algorithm

Cleaning algorithm uses the NPY file saved by the scanning algorithm of the location
that needs to be cleaned, make a copy of the file to note the place which is already
cleaned while cleaning. To start the cleaning process the robot needs to go to its
initial position which is the south west corner. A method is executed to find the
initial position and move to the location. Then the robot starts cleaning the west wall
from south to north, after reaching north wall the robot turns east, moves front and
turns towards south, using the same method cleaning process is done from west to
east. Cleaning algorithm is done in a parallel pattern. The process begins after the
scheduled time is completed.
NavRobotVac is implemented using python programming language; execution
is done in raspberry pi containing processor and raspbian operating system. The
program is kept in bash location for the auto start of execution as soon as the power
turns on. The basic functionality of the product is shown in Fig. 4.
164 S. A. Nabi and M. K. Vardhan

Fig. 4 Functionality of
NavRobotVac

4.3 Obstacles Avoidance

In the process of scanning when any obstacle comes, robot considers it as a non-
cleaning area. As the scanning pattern is spiracle in an anti-clockwise direction the
robot moves towards its left and continues scanning.
When the process of cleaning is going on, if any object appears in front of the
robot, it will pass the obstacle from its right side and updates that area as a non
cleaning area. In the next turn when the same area is free then the robot moves
towards that area and updates to the empty space in NPY file which needs to be
cleaned.

5 Results and Discussions

NavRobotVac model aim is to strive the advance homemade robotic vacuum cleaners
for domestic needs with reference to numerous features: usability, apparent conve-
nience, and style. These are the basic factors to meet the domestic needs for the
adoption of advanced technology. Based on users need and time to time to meet
their prospects and their methods of employing a robotic vacuum cleaner, we can
reach their expectations and meet their true needs and take them under consideration
NavRobotVac: A Navigational Robotic Vacuum Cleaner … 165

while the process of evolving these types of systems with advanced technologies.
The resultant of our proposed system is shown in Fig. 5.
The proposed system considered a sample area with five corners to test the robot
as shown in Fig. 6. It starts with north direction and ends with after scanning the
entire blocked area. The result can be seen in a NPY file which contains 0, 1, and
2’s, 0 says that area is out of border, 1 represents empty area to be cleaned and 2
represents obstacle in the field. An example for output of proposed model is shown
in Fig. 7.
In an existing system robot works on a microcontroller as a central unit which only
works on a present input it leads to a redundant cleaning of same location. Whereas
in NavRobotVac, it uses Raspberry pi as a central unit which gives the instruction
by taking present input and previous data given by sensors.

Fig. 5 NavRobotVac

Fig. 6 Sample area with 5

corners
166 S. A. Nabi and M. K. Vardhan

Fig. 7 Sample area output

While comparing with existing system, Robots store mapped area in a file which
requires additional memory whereas in proposed system the mapped area is stored
in a NPY format which consumes less memory and it is the fastest method of loading
in data. This robot is designed in octagon shape to minimize the size and for easy
movement of robot. Also it is designed in a minimum price as there is no extra
equipment like camera etc.

6 Conclusion and Future Scope

Due to revolution in the technologies, robots which are used for domestic purpose
are transformed from the straight forward “random-walk” methodology to further
advanced navigation systems, comprising a domestic technology for a reasonable
cost.
With this user-friendly approach, the developed product efficiently ensures Scan-
ning of a new area which needs to be cleaned and stores the scanned area as a NPY
file in a memory present in raspberry pi. Cleaning process executes after every time
period completed, simultaneously updating the area file if any obstacles are found.
By implementing this system a user is free from keeping his home clean as that
work is done by a robotic vacuum cleaner. Users need to keep a time gap to activate
the cleaning process. After cleaning is done the user takes a dustbin from the robot,
cleans it and fixes it back for the next clean.
In future, the vacuum cleaner NavRobotVac can be added with many features like
moping, carpet cleaning, and an application can be designed for the robot to make
track of it.
NavRobotVac: A Navigational Robotic Vacuum Cleaner … 167

Acknowledgements We would like to thank everyone, who has motivated and supported us for
preparing this Manuscript.

References

1. Mitchel, N.: The Lazy Person’s Guide to a Happy Home: Tips for People Who (Really) Hate
Cleaning, 9 Jan 2016 [Online]. Available: https://www.apartmenttherapy.com/thelazy-personsgu
ide-to-a-happy-home-cleaning-tips-forpeople-who-reallyhate-cleaning-197266. Accessed 22
June 2016
2. Prassler, E., Munich, M.E., Pirjanian, P., Kosuge, K.: Domestic robotics. In: Springer Handbook
of Robotics, pp. 1729–1758. Springer International Publishing (2016)
3. Hong, Y., Sun, R., Lin, R., Shumei, Y., Sun, L.: Mopping module design and experiments of
a multifunction floor cleaning robot. In: 2014 11th World Congress on Intelligent Control and
Automation (WCICA), pp. 5097–5102. IEEE (2014)
4. C.o. iRobot: iRobot: Our History, 29 May 2015. [Online]. Available: https://www.irobot.com/
About-iRobot/CompanyInformation/History.aspx
5. Adkins, J.: ‘History of Robotic Vaccum Cleaner, 21 June 2014 [Online]. Available: https://prezi.
com/pmn30uytu74g/history-ofthe-robotic-vacuum-cleaner/. Accessed 14. 04. 2015
6. Kapila, V.: Introduction to Robotics (2012). https://engineering.nyu.edu/mechatronics/smart/
pdf/Intro2Robotics.pdf
7. Shashank, R., et al.: Shuddhi—Acleaning Agent. Int. J. Innov. Technol. Explor. Eng. (IJITEE)
9(2S) (2019). ISSN: 2278-3075
8. Frolizzi, J., Disalvo, C.: Service robots in the domestic environment: A study of Roomba vacuum
in the home. In: International Conference on Human Robot Interaction HRI, pp. 258–265 (2006)
9. Appliance, C.: How to Choose the Best Vacuum Cleaner, 23 Feb 2016 [Online]. Available:
https://learn.compactappliance.com/vacuum-cleaner-guide/. Accessed 22 June 2016
A Hybrid Clinical Data Predication
Approach Using Modified PSO

P. S. V. Srinivasa Rao, Mekala Srinivasa Rao, and Ranga Swamy Sirisati

Abstract Enhancement of diagnostic predictive mechanisms in Clinical Decision

Support Systems (CDSS) is performed through improving disease staging predic-
tions, illness progression prediction, and reducing the number of features in high
dimensional clinical data sets. All the works in this paper uses benchmark datasets
from University of California Irvine machine learning repository for verification.
Clinical datasets are obtained through various diagnostics procedures, and the data
from different sources was transformed into a single format before the start of the
analysis. Since there are many diagnostics procedures, the transformed data usually
has a large number of features. Preliminary data cleaning procedures like duplicate
removal, noise dismissal, and filling up of missing data are performed initially. Data
reduction techniques are applied over this cleaned data. A dataset is mainly due to
two main reasons: too many instances or too many attributes. Once the noisy and
duplicate instances are removed, having too many instances usually generates an
accurate classifier. However, having too many features is a negative aspect of data
analytics because the dataset’s unimportant features tend to bring down the classi-
fier’s accuracy that has produced using the dataset. Reducing the number of features
to an optimal level to enhance the classifier’s accuracy is the aim of feature reduction
algorithms. Every disease has various stages of severity. In this paper, an effective
DSS support system for the Clinical Data predication Approach was proposed based
the combination of Hybrid technique using C4.5, decision tree with Particle Swarm
Optimization (PSO) evaluated for various diseases.

P. S. V. Srinivasa Rao (B)

Vignan’s Institute of Management and Technology for Women, Kondapur, Ghatkesar Mandal,
Telangana, India
M. S. Rao
Department of CSE, Lakireddy Bali Reddy College of Engineering, Mylavaram, Andhra Pradesh,
India
R. S. Sirisati
Department of CSE, Vignan’s Institute of Management and Technology for Women, Kondapur,
Ghatkesar Mandal, Telangana, India

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 169
S. C. Satapathy et al. (eds.), Smart Computing Techniques and Applications,
Smart Innovation, Systems and Technologies 224,
https://doi.org/10.1007/978-981-16-1502-3_18
170 P. S. V. Srinivasa Rao et al.

1 Introduction

Every country in the world needs healthy individuals in order to have an efficient
human resource. Improving the health care field does not only requires personnel
support. With the active support of software tools, hardware equipment, and rich
information, healthcare domains could see an unprecedented change. In recent years,
Data predication Approaches designed primarily to help during the treatment phase
is gaining momentum, but the Data predication Approaches to support during the
diagnosis phase is still in its rudimentary form. Since the sophistication of machine
learning and other data analytics algorithms are getting real in the market place, Data
predication Approaches research in clinical diagnosis is gaining ground. With that
in mind, the problem statement objectives are formulated [1, 2]. The Clinical Data
predication Approach (CDPA) assists physicians and other healthcare professionals
in their decision-making process. According to Robert Hay ward, Clinical Data
predication Approaches link with health care domain to effect health. CDPA uses
Computational Intelligence (CI) techniques as a means to make decisions. CDPA uses
various types of patient data during the process of decision making. CDPA promi-
nently finds its usage mainly in the treatment phase. It is used as an alerting/alarming
system. Integrating with the Internet of Things (IoT) technologies, CDPA supports
treating personnel during the hospitalization period like rule-based drug delivery,
emergency vital alert systems, and continuous sensing and monitoring systems. The
CDPA is used to give correct information to the necessary personnel at the right
time through technology supporting automation. Thus as of now, the role of CDPA
is mostly pertained to sensing and alerting. Now the time is ripe enough to apply
CDPA extensively during the diagnosis phase, as there are too many advancements
in machine CI. Some of the prominent applications of CDPAes are CO [3–5] (Fig. 1).
i. Tracking the progression of diseases.
ii. Suggesting suitable diet needs for patients.
iii. Suggesting an effective treatment procedure.
iv. Staging of diseases.
v. Predicting the causality of diseases.
vi. Pharmacological decision support.
vii. Estimating the outcome of a treatment procedure.
viii. Estimating the outcome of a surgery.
ix. Early warning systems in the ICU-intensive care unit.
x. Diagnosis of diseases.
The effectiveness of CDPA is decided by way of handling the two main challenges,
such as data handling and data analytics.
Clinical datasets are obtained through various diagnostics procedures, and the data
from different sources got transformed into a single format before the start of the
analysis. Since there are many diagnostics procedures, the transformed data usually
has a large number of features. Preliminary data cleaning procedures like duplicate
removal, noise dismissal, and filling up of missing data are performed initially. Data
A Hybrid Clinical Data Predication Approach Using Modified PSO 171

Fig. 1 DSS support system for clinical data predication approach

reduction techniques are applied over this cleaned data. A dataset is mainly due to
two main reasons: too many instances or too many attributes. Once the noisy and
duplicate instances are removed, having too many instances usually generates an
accurate classifier. However, having too many features is a negative aspect of data
analytics because the dataset’s unimportant features tend to bring down the classifier’s
accuracy that has produced using the dataset. Reducing the number of features to
an optimal level to enhance the classifier’s accuracy is the aim of feature reduction
algorithms [4–6].

2 Related Work

Clinical datasets are obtained through various diagnostics procedures, and the data
from different sources got transformed into a single format before the start of the
analysis. Since there are many diagnostics procedures, the transformed data usually
has a large number of features. Preliminary data cleaning procedures like duplicate
removal, noise dismissal, and filling up of missing data are performed initially. Data
reduction techniques are applied over this cleaned data. A dataset is mainly due
to two main reasons: too many instances or too many attributes. Once the noisy
and duplicate instances are removed, having too many instances usually generates
172 P. S. V. Srinivasa Rao et al.

an accurate classifier. However, having too many features is a negative aspect of

data analytics because the dataset’s unimportant features tend to bring down the
classifier’s accuracy that has produced using the dataset. Reducing the number of
features to an optimal level to enhance the classifier’s accuracy is the aim of feature
reduction algorithms. The main challenge is that the sequence’s prediction is made
with a trained entity that is not trained with the transition data. Continuous-time
series clinical data corresponding to individual patients is not available in enormous
quantity than the current data severity levels. So in the proposed work, we have used
the clinical dataset having current severity levels to detect the transition sequence
[7–9].
Patient-centric disease sequence prediction is performed in all of the previous
researches use time-series data. No research work addresses the prediction of possible
transition stage of disease for a patient without the availability of that patient’s time-
series data in Fig. 2. The availability of time-series data, such as ICU data, is compara-
tively low. So the previous works using time series data do not have many benchmark
data sets to verify, so the reliability of the proposed Algorithm can not be verified
to the fullest. Previous works that focus on time series data assume that the disease

Fig. 2 Data preprocessing

phases
A Hybrid Clinical Data Predication Approach Using Modified PSO 173

Fig. 3 Attribute considered DSS in clinical data

progression takes a sequential route, which is not necessarily the default case. The
disease can jump stages, and it does not refer to passing through the stages in a
fast manner. There is a deficiency in research works that can redefine clinical test
data mapping with new disease stages. All the research mentioned above gaps is
addressed. The proposed solution uses nontime series data and predicts the patient-
centric disease transition stage. The solution provided can even redefine the staging
pattern of the disease [2, 10] (Fig. 3).

3 Proposed DSS Support System for Clinical Data

Predication Approach Using C4.5 with PSO

After reducing the features form the equation formulated as part of this proposed
work using C4.5 decision tree and PSO technique for Algorithm 1, are mentioned
below [9, 11, 12]. The density of the cluster is found out using the Equations. clusteri
= 1 Density of Cluster, P-cluster where p is the particles (Fig. 4).

pkt = pk 1, pk 2t , . . . , pk D t (1)
174 P. S. V. Srinivasa Rao et al.

Fig. 4 DSS phases of hybrid C4.5 decision tree with PSO

p1t = p1 1, p1 2t , . . . , p1 D t (2)

Va_id t = w ∗ Va_id t−1 + c1r1 pid t − xid t

+ c2 r2 pgd t − xid t , m = 1, 2, . . . , M (3)

xid t+1 = xid t + Va_, d = 1, 2, . . . , D (4)

Here, each decision tree can be treated as a particle for Swarm Optimization. The
Eq. 1 represents best solution tree particle, Eq. 2, with the population l, and Eqs. 3
and 4 deliberates velocity and updated dynamic position.
A Hybrid Clinical Data Predication Approach Using Modified PSO 175

Algorithm: 1: DSS support system for Clinical Data predication Approach

i. Calculate cluster = Li = 1” Cluster – DPti cluster.and F_cluster
ii. Assign medial diseases Data Set of all cluster M × N
iii. Training Set cluster nodes of diseases
iv. Test Samples Examples
v. Apply C4.5 and Decision Tree with using Eqs. 1 and 2
vi. Apply Current Decision Level using PSO with Eqs. 3 and 4
vii. Assign the final Particle as Decision parameter

A unified model for feature selection is also proposed, significantly to reduce

features in massive datasets. It does reduction by selecting features appearing in both
the set of feature subsets selected through MBPSO and other existing Binary Genetic
Algorithm based methods and the model. An evaluation parameter for wrapper-based
feature selection methods is called the Trade-off factor (TF). The following are the
highlights of the proposed work. Wrapper-based methods are used in this work, and
feature selection is in every step of the Algorithm through the creation of the SVM
classifier. The existing Binary Genetic Algorithm (BGA) is used along with in unified
approach. A comparison of various feature selection models is performed using a
neural network classifier.
Particle Swarm Optimization technique (PSO) is one of the evolutionary optimiza-
tion algorithms. It is a heuristic algorithm. It has proved its mettle in optimization;
PSO has been considered to provide a solution in this work. Binary Particle Swarm
Optimization (BPSO) technique is a version of PSO, which uses particles having
binary digits. An updated version called Mutated Binary Particle Swarm Optimiza-
tion (MBPSO) is proposed in this. BPSO is used in feature selection problems. Every
digit in a particle used in BPSO is either 0 or 1. If there is a dataset having n number
of features, then the particles will be having n number of binary digits. Each feature
corresponds to each digit in a particle’s binary string. The features corresponding to
1’s are all congregated, which is considered a derived data set. Once every particle
had its fitness value found, the particle with the highest fitness value is the global best
particle. Then every other particle will update its velocity and position towards the
global best particle. During the movement, binary digits of particles will be updated.
Again the process is repeated till the fitness of the global best particle is lesser than
a threshold value, which is the preset or specific number of iterations are fixed, so
that after completing the present iteration, the features of the dataset corresponding
to the global best particle is considered as the reduced feature subset. The move-
ment of particles in search space is said to be an exploration phase of PSO, and the
movement of particles towards the global best particle is said to be the exploitation
phase. PSO algorithm is proved to be good to have acceptable exploitation. That is,
it tends to converge fast. However, it has a problem with exploration. The proposed
method called MBPSO aims to alleviate the problem of exploration by introducing
a mutation mechanism, which is usually associated with genetic algorithms. The
workflow diagram is shown in Fig. 2. The idea is to selectively apply the mutation
process, which is defined as a random update of particles’ digits. Last k particles with
176 P. S. V. Srinivasa Rao et al.

the least fitness values are subjected to mutation, thereby the process of exploita-
tion is not disturbed, and the process of exploration is enhanced through making
the last k fittest particles scrambled through the search space. Thus, the proposed
method called MBPSO aims at enhancing the exploration phase of BPSO. The eval-
uation of finished selection using the neural network uses random sampling without
replacement of both random samplings without replacement of random sampling
without replacement during training and testing. There is only one hidden layer with
ten neurons in evaluating neural networks. For every method, the count of input
neurons varies with the outcome of feature selection, and the count of neurons in
the input layer is equal to the count of features selected. All datasets used evaluate
have only one class, so there is only one neuron in the neural network’s output layer.
Genetic Algorithms (GA) are evolutionary optimization algorithms. GA algorithms
are heuristic algorithms. The GA components are population generation, selecting
fit members from the population to perform a cross over. After cross-over opera-
tion, the members are subjected to the next process called a mutation. Members are
called genetic strings. A string can be a sequence of integers or real numbers. If a
string is composed of binary numbers, then it is called a binary string. GA, which
uses binary strings, is said to be Binary Genetic Algorithm (BGA). The meaning of
cross-over between two strings is that both of the two strings undergoing the process
will lose certain parts of it, and the lost part fill be filled with part from other strings.
This process of marriage between two strings is said to be cross-over. The resultant
modified strings are said to be offspring. A mutation is a process where a string’s
digits are randomly chosen and assigned with a random value (binary string 0 or
1). BGA finds its use in feature subset selection. The process starts with population
generation, meaning, and creation of strings. N number of strings are created. Each
string will have m number of binary digits, equal to the high dimensional dataset’s
total count of features. Once strings are generated, then the fitness of every string
is computed. Features corresponding to 1 alone in strings are congregated, and the
resultant dataset is called derived dataset. A derived dataset, a classifier is created,
and that classifier is evaluated. The classifier’s accuracy is set as the fitness value
of that string. If the classifier’s accuracy is more, then the fitness of the string will
also be more. Thus every string of the entire generated population finds its fitness
value. This kind of methodology where a classifier is generated to fix the fitness value
is called wrapper-based methodology. Once every string had its fitness found, then
the first k strings are chosen to be part of the reproductive pool. In the reproductive
pool, pairs of strings are chosen randomly, and they are subjected to cross-over and
mutation. BGA used in work uses a single point cross over.

4 Experiments and Result

There are many dark areas in clinical data analytics, like genomic data analytics,
which can solve genetic and hereditary diseases. These diseases data set is considered
A Hybrid Clinical Data Predication Approach Using Modified PSO 177

from the UCI ML standard repository. We can now identify a genetic disease occur-
rence, but the cure can not be given. However, once the genomic data research/bio-
informatics unfolds the secrets of genetic code, then wonders can be done in curing
incurable diseases, and many unknown areas like the one that is mentioned in our
research that is the randomness in disease progression, the key for this riddle might be
in the genetics field. Nowadays, computation infrastructure is very much ready, and
there is the availability of fast processing systems like supercomputing and cluster
computing, and the platform to share the enormous data is also getting shape through
the cloud and sample data storage. The advancement of big data analytics is in the
budding phase, and once this reaches a matured phase, with high computing powered
machines along with efficient big data analytics algorithms with enormous support
data available, most of the health care problems can be solved which helps people
lead a more happy life. Every symptom that is detected during suffering from the
disease is due to some parameters going wrong. The flaw in a specific parameter is
caused due to various reasons. The entity which brings flaw to a specific parameter
is said to be a causal entity. All the entities which are causal for a specific param-
eter are called Concepts. Many flawed parameters characterize the disease, and each
parameter has its own set of concepts. The concept sets of vital parameters may
intersect. If we represent the concept dependencies, we end up getting a map data
structure, which depicts interdependencies of concepts, and it is called a cognitive
concept map. In order to set right the parameters, the concepts have to be provoked.
The weigh of involvement of concepts in deciding the parameters is patient-centric.
It can be taken as a problem statement, and a solution can be provided for accurate
prediction of the percentage of involvement of concepts in tweaking the features
of vital parameters. For example, if red blood cells count is said to be a parameter
and this count can be altered through tweaking one or more concepts, how much to
tweak is patient-centric. If this can be done, patient-centric targeted treatment can
be provided. In today’s scenario, time-series data is available in abundant quantities
both for in and outpatients. Every hospital has electronic medical records for all of
their patients. These data are now getting shared in cloud platforms. Even though
there is no standardized mechanism to share the data is not available currently due to
various privacy data laws, the data can be shared with patients’ consent, But the time
series data corresponding to fetal development is very scarce. Not much research
has been done in capturing many vitals or parameters during the fetal development
phase. Currently, the data being captured development phase is very minimal, and
it is of non-invasive type. If there is a safe and commonly usable minimal inva-
sive technology is developed for fetal analysis, many genetic disorders can be out.
The fetus’s presently morphological structure is set as a benchmark to analyze the
genetic disorders, which has a very high error rate. If extensive data is collected
during fetal development, through minimal safe invasive and non-invasive methods,
a good repository of sequential data would be obtained to analyze using Computa-
tional Intelligence techniques. Many genetic disorders can be prevented or mutated
upon if they are detected at the correct time, or the decision to abort the fetus can be
taken based on the severity of the genetic disorder. Diseases such as down syndrome,
178 P. S. V. Srinivasa Rao et al.

Huntington disease, fragile x syndrome, hearing loss, sickle cell disease, turner’s
syndrome, Fragile X syndrome, and many more can be detected more accurately.
As per Figs. 5, 6, 7, 8, 9 and 10 describes various parameters predicted using
proposed C4.5 and PSO. The behavioral and social communication pattern data and
fetal development data can be correlated by analyzing autism and Asperger syndrome.
In the previous idea, analyzing fetal data, a genetic disease of morphological disor-
ders can be identified. However, in this case, intellectual ability can be predicted
through fetal time series data if the current intellectual disability is successfully
correlated with fetal clinical data. Intelligent prosthesis or learning limbs: Presently,
much research has brought good results in designing sophisticated weightless arti-
ficial limbs or prosthesis. Nevertheless, still much more can be done to make a
prosthetic limb more intelligent. For example, if the pattern of ordinary people’s
movements and activities are captured, having parameters such as pressure in the
joints, angular velocity and stress-induced angle changes, vibrations, flexibility. The

Fig. 5 Disease support with

frequency

Fig. 6 Various Disease with

level parameter-1
A Hybrid Clinical Data Predication Approach Using Modified PSO 179

Fig. 7 Various Disease with level parameter-2

Fig. 8 Various Disease with

level parameter-3

relation between the patterns can be understood, and the prosthetics can also adjust
the parameters, just like a normal limb. For example, the parameter changes during
running are captured well, and then the training can come out to be fair and so the
prosthetics leg can even be designed to run in the same manner with which an average
person run. This knowing can be utilized in developing humanoid robots too.
Sensory organ functionality is dependent on neurological structure. With the
advancement in neural and deep learning methodologies, a sensory organ function
can be emulated artificially. Artificial and efficient sensory organs are the theme
180 P. S. V. Srinivasa Rao et al.

Fig. 9 Various Disease with

level parameter-4

Fig. 10 Various Disease with level parameter-5

of the future. Imagine an artificial eye capable of seeing than a normal eye and
ear, hearing more than its natural counterpart. Artificial sensory organs using neural
networks need needs a long way to go. However, proper funding and much research
coordination in this domain will revive many special people’s lives. Drug suggestion
is another area where computational intelligence algorithms can be applied. Drug
sensitiveness varies from patient to patient.
Moreover, the effect of drugs to vary from patient to patient. Using cognitive maps
and weight optimization algorithms, patient-centric drug-related decisions such as
what type of drug, quantity, duration, and mode of delivery can be decided. The
proposed objectives to improve diagnostic prediction in clinical Data predication
Approaches are achieved through improving upon accurate staging, progression
prediction, and efficient preprocessing of data through feature selection. The research
work of the paper is summarized as follows. This work also used coronary heart
A Hybrid Clinical Data Predication Approach Using Modified PSO 181

disease as a reference to verify the Algorithm. The proposed method’s highlight is to

arrive at a patient-centric possible disease stage transition sequence using nontime
series data. Unlike most of the methods, which predicts the next stage for a patient
having a specific disease using clinical time-series data, the proposed Algorithm
uses nontime series data corresponding to different persons. The proposed method
uses a novel approach rooted in vector algebra, gravitational force-based, and nearest
neighbor property.

5 Conclusions

The proposed algorithm gives a solution to the problem statement’s third objective:
feature subset selection using a wrapper-based method with heuristic optimization
algorithms such as modified particle swarm optimization algorithm and Genetic
Algorithm. However, some previous works on this field use heuristic algorithms that
did not address exploring and exploiting heuristic algorithms together. The proposed
work hybrid methodology based on the C4.5 decision tree with PSO enhances both
exploration and expletory processes through genetic Algorithm and modified particle
swarm optimization algorithms. An evaluation parameter called the trade-off factor
is proposed, which is used to evaluate the selected features.
It has been found that computational intelligence algorithms are indispensable
in health care applications. In the further coming years, computational intelligence
algorithms will rise only because of its efficient functioning mechanisms due to
much energy is getting into research in these domains. The research on clinical data
analytics holds the key to future diagnostics and treatment. Clinical data analytics
is in the starting stage from where it will revolutionize the health care industry.
Technology support in the health care industry pertains to clinical tests, and in the
last five decades, the use of technology in testing has grown, but the application of
technology in the diagnostics phase is still not reliable. Extensive research is needed
to come out with concrete results to be trusted upon the technology in the decision.
Some of the future research works that can be taken up to enhance clinical Data
predication Approaches are discussed in the next section.

References

1. Herland, M., Khoshgoftaar, T.M., Wald, R.: Survey of clinical data mining applications on
big data in health informatics. In: Proceedings of ICMLA ’13—vol. 02, pp. 465–472. Google
Scholar Digital Library Physionet-MIMICIII [n.d.]. https://archive.physionet.org/physiobank/
database/mimic3cdb/
2. Cai, X., Perez-Concha, O., Coiera, E., Martin-Sanchez, F., Day, R., Roffe, D., Gallego, B.: Real-
time prediction of mortality, readmission, and length of stay using electronic health record data.
J. Am. Med. Inf. Assoc. 23(3), 553–561 (2016). https://doi.org/10.1093/jamia/ocv110
182 P. S. V. Srinivasa Rao et al.

3. Li, L., Cheng, W.-Y., Glicksberg, B.S., Gottesman, O., Tamler, R., Chen, R., Bottinger, E.P.,
Dudley, J.T.: Identification of type 2 diabetes subgroups through topological analysis of patient
similarity. Sci. Trans. Med. 7, 311 (2015)
4. Sirisati, R.S.: Machine learning based diagnosis of diabetic retinopathy using digital fundus
images with CLAHE along FPGA Methodology. Int. J. Adv. Sci. Technol. (IJAST-2005-4238)
29(3), 9497–9508 (2020)
5. Purushotham, S., Meng, C., Che, Z., Liu, Y.: Benchmark of deep learning models on large
healthcare MIMIC Datasets. CoRR abs/1710.08531 (2017). https://arxiv.org/abs/1710.08531
6. Sirisati, R.S.: Dimensionality reduction using machine learning and big data technologies. Int.
J. Innov. Technol. Explor. Eng. (IJITEE-2278–3075) 9(2), pp 1740–1745 (2019)
7. Pai, S., Bader, G.D.: Patient similarity networks for precision medicine. J. Mol. Biol. (2018)
8. Pai, S., Hui, S., Isserlin, R., Shah, M.A., Kaka, H., Bader, G.D.: netDx: Interpretable patient
classification using integrated patient similarity networks. bioRxiv (2018)
9. Kanungo, T., Mount, D., Netanyahu, N., Piatko, C., Silverman, R., Wu, A.: An efficient K-means
clustering algorithm analysis and implementation. IEEE TPMAI 24(07), 881–892 (2002)
10. Mao, Y., Chen, W., Chen, Y., Lu, C., Kollef, M., Bailey, T.: An integrated data mining approach
to real-time clinical monitoring and deterioration warning. In: Proceedings of SIGKDD’12,
pp. 1140–1148 [n.d.]
11. Ma, T., Zhang, A.: Integrate multi-omic data using affinity network fusion (ANF) for cancer
patient clustering. In 2017 IEEE International Conference on Bioinformatics and Biomedicine
(BIBM), pp. 7–10. IEEE (2017)
12. Yu, Shi.: Multiclass spectral clustering. In: Proceedings Ninth IEEE International Conference
on Computer Vision, vol. 1, pp. 313–319 (2003). https://doi.org/10.1109/ICCV.2003.1238361
Software Defect Prediction Using
Optimized Cuckoo Search Based
Nature-Inspired Technique

C. Srinivasa Kumar, Ranga Swamy Sirisati, and Srinivasulu Thonukunuri

Abstract These days, software systems are very complex and versatile. Therefore
it is essential to identify and fix the software error. Software error assessment is
one of the most active areas of research in software engineering. In this research,
we are introducing soft computing methods to assess software errors. Our proposed
technique ts software gives errors and accurate results. In our proposed method, the
error database is first extracted, which acts as an input. After that, the collected input
(data) is clustered by the clustering technique. For this purpose, we use the modified
C-Mean Algorithm. Therefore, the data is clustered. An efficient classification algo-
rithm then groups clustered data. For this reason, we use a hybrid nervous system.
Therefore, there are software bugs, and these errors are optimized using the MCS
algorithm. Our proposed method for software error assessment is implemented on the
Java platform. Performance measurement is measured by various parameters such
as execution rate and execution time. Our proposed Cuckoo search based strategy
is comparable to many existing strategies. Graphical representation of comparison
results from our proposed strategy for identifying software proposals is one that
effectively evaluates profitable strategy and reasonable reference rates.

1 Introduction

Software Defect Prediction (SDP) plays an essential part in reducing software devel-
opment costs and maintaining Achilles’ and others’ high quality (2017). When there
is a recurring software failure in the system, it automatically causes a software
error. Software error is a bug introduced by software developers and shareholders.
A software vulnerability assessment’s primary purpose is to improve the quality,

C. Srinivasa Kumar (B) · R. S. Sirisati

Department of CSE, Vignan’s Institute of Management and Technology for Women, Kondapur,
Ghatkesar Mandal, Telangana, India
S. Thonukunuri
Department of Mathematics, Vignan’s Institute of Management and Technology for Women,
Kondapur, Ghatkesar Mandal, Telangana, India

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 183
S. C. Satapathy et al. (eds.), Smart Computing Techniques and Applications,
Smart Innovation, Systems and Technologies 224,
https://doi.org/10.1007/978-981-16-1502-3_19
184 C. Srinivasa Kumar et al.

marginal cost, and time of software products. For high-performance error assess-
ment, researchers and others have been working on selecting consistent features and
experimental learning algorithms since the 1990s (2014). To assess the error-ability
of those who support software testing activities. Software errors are a significant
cause of failure for large engineering projects, causing significant financial damage
in the twenty-first century. In software quality, various error assessment methods are
proposed. The strategy is used on various assumptions, including source code metrics
(e.g., alignment, coordination, size) by. Software vulnerability assessment helps to
test resources through software modules to address their vulnerability effectively.
Current software error assessment models for estimating the number of errors. The
software module may fail to provide an appropriate command because accurately
estimating the number of errors in a software module due to noise data is chal-
lenging. Each software error arises in different conditions and environments and
therefore differs in its specific characteristics. Software errors can have a substantial
negative impact on software quality. Therefore, error assessment in software quality
and software reliability is exceptional. Comparative Software Quality Engineering is
a new area of research in the distorted assessment (2014). Current Error Assessment
Function (1) Calculate the number of errors remaining in the software system, (2)
Identify the error associations, and (3) Classify the error-characteristic features of the
software components that are generally error-free and error-free. The reference result
can be used as a necessary step by the software developer can control the software
process (2009). Software Error Prediction (SDP) empirical studies are very biased
with data quality suffer from widely limited generalizations (1986). False assump-
tions can help improve software quality and reduce the distribution cost of those
software systems. In 2005, SDP research overgrew. It allows researchers to collect
error assessment data sets from real-world projects for public use and create repetitive
and comparable models throughout the study. To date, many SDP based works have
done extensive research on matrix and learning algorithms that describe code modules
for designing prediction models (2014). Therefore software vulnerability assessment
plays an essential role in improving software quality. This software can help reduce
testing time and cost. Therefore, it is used in many organizations to save time, improve
software quality, software testing, and project resources. Assessing software vulner-
abilities for errors in historical databases. In the real world, lumbering elephants are
exposed by the aggression of speeding dwarfs. As software projects grow in size,
defective assessment technology plays an essential role in supporting developers and
speeding up the time to market with more reliable software products [2] (Fig. 1).

2 Parallel Works

The general software error assessment procedure follows the machine learning
methods. The first step is to find examples from the software. An example of code,
function, class or method. These examples arise from various problem tracking
systems, version control systems, or e-mail archives. For example, the software has
Software Defect Prediction Using Optimized Cuckoo Search … 185

Fig. 1 Software defect prediction model block diagram

different dimensions. These examples can be classified as Bug B or Bug Number,

Clean C or Clean. After identifying examples using ranges and dimensions, the first
step of machine learning pre-processing methods is used to create new examples.
Pre-processing is applied to capture features, measure data, and mute noise (2014)
[3, 4]. It is not mandatory to apply to all types of error assessment models. After the
pre-processing, examples were created to practice the error assessment model. The
model of the model causes distorted events and pure events. The number of errors
in the example is called regression. This event only gives false or altered results
for both, so it is called binary classification. There are many applications for soft-
ware bug prediction. Its primary purpose is to allocate resources to test software
products effectively. Error Assessment Model produces error encounter software,
and its scope. Once the reference model is built, its performance should be eval-
uated. Performance is usually measured in two ways: power presents energy and
descriptive energy. Pride Dimensions Predictive power: power measures a model’s
accuracy for injecting software artifacts with absent power defects. Accuracy, recall,
F-measurement, AUCRC, false-positive rates on X-axis, and reasonable favorable
rates on Wi-axis are all used in the classification range and are commonly used
in error assessment studies. Diffusion Dimensions Diffusion Energy: In addition
to measuring power dissipation energy, descriptive energy is also used in distorted
studies. The descriptive power model measures how well one understands the differ-
ence in data. The measurements described for R2 or standard deviation are commonly
used to determine the descriptive strength [5–9].
186 C. Srinivasa Kumar et al.

3 Proposed Modified Cuckoo Search Technique.

Yang and Deb recently introduced all Cuckoo search optimization methods. Cuckoo
has an aggressive breeding strategy. The female lays her fertile eggs in another
species’ cage, so surrogate parents inadvertently raise brother Atul Bisht and others
(2012) [10]. Coconut eggs are sometimes found in the nest, and surrogate parents
either throw it or throw it into the nest or start their flock elsewhere. Cuckoo Search
Optimization Algorithm considers various design parameters and controls based on
the three main compatibility rules of Azim et al. (2011) (1) Each Cuckoo lays one egg
at a time and lays it randomly in the selected nest. (2) Good nests with high-quality
eggs are passed on to the next generation; (3) The number of host nests available is
determined, and the hockey bird’s ability determines the number of Cuckoo eggs. In
this case, the host bird can lay eggs or build a new nest. This straw finally selected
can be calculated using the n furnace fraction of the current frequency for simplicity.
It needs to replace with a new chamber (using new random solutions) in the next
cycle. This method has been successfully demonstrated in some benchmarking tasks.
The particle optimization method is better than other methods, including manual
(2007). Cuckoo Search Algorithm is a metaheuristic algorithm inspired by Cuckoo’s
reproductive nature and is easy to implement. There are plenty of places for cocaine.
Each egg represents a solution, and the nasal egg represents a new solution. The
parasites of some species of chickens are unusually attractive. These birds can lay
eggs in the host nest and mimic external features such as the host egg’s color and
color. If this strategy is not successful, the host may throw an egg out of the nose or
leave the nest, leading Azimuth to others (2011) [11–13]. Based on this situation, the
researchers developed an evolutionary optimization algorithm called Cuckoo Search
(CS) and collected CS using their own rules:
Coockoo Search Optimization cost function

Min CS(S) = Min[cs1(S), cs2(S), . . . , csM(S)] (1)

Inequality Constrains

gb (S) ≥ 0, b = 1, 2, . . . , c (2)

h l (S) = 0, l = 1, 2, . . . , a (3)

Here Eq. (1) represents Coockoo Search cost optimization function for software
predictive parameters, Eqs. (2), and (3) represents constraints to make inequalities
for searches. The new solution satisfies the N-Dimensional Boolean lattice. The
solutions are updated in the corners of the hypercube. Additionally, suppose a given
attribute is selected or not. In that case, the solution uses a binary vector, where an
attribute is selected to create the new dataset and if not 0. In our particular way,
the solution represents the value of the attribute. Weight optimization used in the
Software Defect Prediction Using Optimized Cuckoo Search … 187

modified Cuckoo search algorithm. The Cuckoo search algorithm refers to a meta-
heuristic algorithm that attributes its origin to Cuckoo’s reproductive behavior and
is easy to implement. There are many nests for the nose. Each egg signals a solution,
and a Cuckooed egg matches the new solution. Novel and best solutions instead of
the nest’s terrible solutions. As a modified nervous system, we have modified the
standard Cuckoo search algorithm to include the gas supply during the upgrade phase
levy uses the flight equation.

Sb t = Sb(t) + 0.01 × α × L(β) × (P1(t) − P2(t)) × m

Optimization in gas supply is better than usual. The methods of the optimization
process are shown as follows (Fig. 2):

Fig. 2 Flow of modified

Cuckoo search algorithm
188 C. Srinivasa Kumar et al.

Algorithm-1: Modified Cuckoo search algorithm (Phase-I)

1. Each Cuckoo randomly selects a N host nests to lay eggs Xi
2. The number of host nests available is determined randomly hops i
3. The nests containing the best quality eggs Xi are passed on to the next generation fitness (Fi)
4. Select n randomly j
5 If Fi < = Fj (the host finds an egg in the bird’s nose, it may lay eggs or leave it.)
4. Early stages j be solution
5. Let and j new step
Replace j. Evaluate fitness performance Pα
7. Modify using the Levy flight equation
8. Find best the result
9. Stop

Algorithm-2: Modified Cuckoo search algorithm (Phase-II)

Step 1: Get started
The host nest population (MI, where i = 1, 2, n) begins unilaterally
Step 2: Creating a new cocaine step
With the help of Levi’s aircraft, Cuckoon was randomly selected to create new solutions. It is
assessed to determine the dominance of carved Cuckoo solutions
Step 3: Fitness Assessment iterations 9 and 10 best assess fitness here by choosing the following
equations. Fitness Maximum popularity
SP - TP refers to the selected population - refers to the total population
Step 4: Restore step

Initially, the levy is applied by planes, resolving the cautionary transition. The
quality of the novel solutions is assessed, and one of them is selected arbitrarily.
Suppose the quality of the novel solution in the selected niche is better than the
previous solution. It replaces using a new solution (Cuckoo). Otherwise, the previous
solution is considered the best solution.

4 Results and Discussion

Software error assessment is a recent research topic; Many researchers have focused
on providing efficient technology by providing software quality.