0% found this document useful (0 votes)
714 views891 pages

International Conference On Innovative Computing and Communications

The document is the proceedings of the International Conference on Innovative Computing and Communications (ICICC 2019), published in the series 'Advances in Intelligent Systems and Computing'. It covers a wide range of topics related to intelligent systems and computing, including applications in various disciplines such as engineering, healthcare, and e-commerce. The series aims for rapid dissemination of research results and is indexed in major databases like ISI Proceedings and SCOPUS.

Uploaded by

SACHIN KUMAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
714 views891 pages

International Conference On Innovative Computing and Communications

The document is the proceedings of the International Conference on Innovative Computing and Communications (ICICC 2019), published in the series 'Advances in Intelligent Systems and Computing'. It covers a wide range of topics related to intelligent systems and computing, including applications in various disciplines such as engineering, healthcare, and e-commerce. The series aims for rapid dissemination of research results and is indexed in major databases like ISI Proceedings and SCOPUS.

Uploaded by

SACHIN KUMAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Advances in Intelligent Systems and Computing 1087

Ashish Khanna · Deepak Gupta ·


Siddhartha Bhattacharyya ·
Vaclav Snasel · Jan Platos ·
Aboul Ella Hassanien Editors

International
Conference
on Innovative
Computing and
Communications
Proceedings of ICICC 2019, Volume 1
Advances in Intelligent Systems and Computing

Volume 1087

Series Editor
Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences,
Warsaw, Poland

Advisory Editors
Nikhil R. Pal, Indian Statistical Institute, Kolkata, India
Rafael Bello Perez, Faculty of Mathematics, Physics and Computing,
Universidad Central de Las Villas, Santa Clara, Cuba
Emilio S. Corchado, University of Salamanca, Salamanca, Spain
Hani Hagras, School of Computer Science and Electronic Engineering,
University of Essex, Colchester, UK
László T. Kóczy, Department of Automation, Széchenyi István University,
Gyor, Hungary
Vladik Kreinovich, Department of Computer Science, University of Texas
at El Paso, El Paso, TX, USA
Chin-Teng Lin, Department of Electrical Engineering, National Chiao
Tung University, Hsinchu, Taiwan
Jie Lu, Faculty of Engineering and Information Technology,
University of Technology Sydney, Sydney, NSW, Australia
Patricia Melin, Graduate Program of Computer Science, Tijuana Institute
of Technology, Tijuana, Mexico
Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro,
Rio de Janeiro, Brazil
Ngoc Thanh Nguyen , Faculty of Computer Science and Management,
Wrocław University of Technology, Wrocław, Poland
Jun Wang, Department of Mechanical and Automation Engineering,
The Chinese University of Hong Kong, Shatin, Hong Kong
The series “Advances in Intelligent Systems and Computing” contains publications
on theory, applications, and design methods of Intelligent Systems and Intelligent
Computing. Virtually all disciplines such as engineering, natural sciences, computer
and information science, ICT, economics, business, e-commerce, environment,
healthcare, life science are covered. The list of topics spans all the areas of modern
intelligent systems and computing such as: computational intelligence, soft comput-
ing including neural networks, fuzzy systems, evolutionary computing and the fusion
of these paradigms, social intelligence, ambient intelligence, computational neuro-
science, artificial life, virtual worlds and society, cognitive science and systems,
Perception and Vision, DNA and immune based systems, self-organizing and
adaptive systems, e-Learning and teaching, human-centered and human-centric
computing, recommender systems, intelligent control, robotics and mechatronics
including human-machine teaming, knowledge-based paradigms, learning para-
digms, machine ethics, intelligent data analysis, knowledge management, intelligent
agents, intelligent decision making and support, intelligent network security, trust
management, interactive entertainment, Web intelligence and multimedia.
The publications within “Advances in Intelligent Systems and Computing” are
primarily proceedings of important conferences, symposia and congresses. They
cover significant recent developments in the field, both of a foundational and
applicable character. An important characteristic feature of the series is the short
publication time and world-wide distribution. This permits a rapid and broad
dissemination of research results.
** Indexing: The books of this series are submitted to ISI Proceedings,
EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink **

More information about this series at [Link]


Ashish Khanna Deepak Gupta
• •

Siddhartha Bhattacharyya Vaclav Snasel


• •

Jan Platos Aboul Ella Hassanien


Editors

International Conference
on Innovative Computing
and Communications
Proceedings of ICICC 2019, Volume 1

123
Editors
Ashish Khanna Deepak Gupta
Department of Computer Science Department of Computer Science
and Engineering and Engineering
Maharaja Agrasen Institute of Technology Maharaja Agrasen Institute of Technology
New Delhi, Delhi, India New Delhi, Delhi, India

Siddhartha Bhattacharyya Vaclav Snasel


Department of Computer Science Department of Computer Science
and Engineering VŠB—Technical University of Ostrava
Christ University Ostrava, Czech Republic
Bangalore, India
Aboul Ella Hassanien
Jan Platos Faculty of Computers and Information
Department of Computer Science Cairo University
VŠB—Technical University of Ostrava Giza, Egypt
Ostrava, Czech Republic

ISSN 2194-5357 ISSN 2194-5365 (electronic)


Advances in Intelligent Systems and Computing
ISBN 978-981-15-1285-8 ISBN 978-981-15-1286-5 (eBook)
[Link]
© Springer Nature Singapore Pte Ltd. 2020
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with regard
to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Dr. Ashish Khanna would like to dedicate this
book to his mentors Dr. A. K. Singh and
Dr. Abhishek Swaroop for their constant
encouragement and guidance and his family
members including his mother, wife and kids.
He would also like to dedicate this work to his
(late) father Sh. R. C. Khanna with folded
hands for his constant blessings.

Dr. Deepak Gupta would like to dedicate this


book to his father Sh. R. K. Gupta, his mother
Smt. Geeta Gupta, his mentors Dr. Anil
Kumar Ahlawat and Dr. Arun Sharma for
their constant encouragement, his family
members including his wife, brothers, sisters,
kids and his students close to his heart.

Professor (Dr.) Siddhartha Bhattacharyya


would like to dedicate this book to his father
late Ajit Kumar Bhattacharyya, his mother
late Hashi Bhattacharyya, his beloved wife
Rashni and his colleagues Anirban,
Hrishikesh, Indrajit, Abhijit, Biswanath and
Hiranmoy, who have been beside him
through thick and thin.
Professor (Dr.) Jan Platos would like to
dedicate this book to his wife Daniela and his
daughters Emma and Margaret.

Professor. (Dr.) Aboul Ella Hassanien would


like to dedicate this book to his beloved wife
Azza Hassan El-Saman.
Organizing Committee

ICICC 2019 Steering Committee Members

General Chairs
Prof. Dr. Vaclav Snasel, VŠB—Technical University of Ostrava, Czech Republic.
Prof. Dr. Siddhartha Bhattacharyya, Principal, RCC Institute of Information
Technology, Kolkata.

Honorary Chair
Prof. Dr. Janusz Kacprzyk, FIEEE, Polish Academy of Sciences, Poland.

Conference/Symposium Chair
Prof. Dr. Maninder Kaur, Director, Guru Nanak Institute of Management, Delhi,
India.

Technical Program Chairs


Dr. Pavel Kromer, VŠB—Technical University of Ostrava, Czech Republic.
Dr. Jan Platos, VŠB—Technical University of Ostrava, Czech Republic.
Prof. Dr. Joel J. P. C. Rodrigues, National Institute of Telecommunications (Inatel),
Brazil.
Prof. Dr. Aboul Ella Hassanien, Cairo University, Egypt.
Dr. Victor Hugo C. de Albuquerque, Universidade de Fortaleza, Brazil.

Conveners
Dr. Ashish Khanna, Maharaja Agrasen Institute of Technology, India.
Dr. Deepak Gupta, Maharaja Agrasen Institute of Technology, India.

vii
viii Organizing Committee

Publicity Chairs
Dr. Hussain Mahdi, National University of Malaysia.
Dr. Rajdeep Chowdhury, Academician, Author and Editor, India.
Prof. Dr. Med Salim Bouhlel, University of Sfax, Tunisia.
Dr. Mohamed Elhoseny, Mansoura University, Egypt.
Dr. Anand Nayyar, Duy Tan University, Vietnam.
Dr. Andino Maseleno, STMIK Pringsewu, Lampung, Indonesia.

Publication Chairs
Dr. D. Jude Hemanth, Associate Professor, Karunya University, Coimbatore.
Dr. Nilanjan Dey, Techno India College of Technology, Kolkata, India.
Gulshan Shrivastava, National Institute of Technology, Patna, India.

Co-conveners
Dr. Avinash Sharma, Maharishi Markandeshwar University (Deemed to be
University), India.
P. S. Bedi, Guru Tegh Bahadur Institute of Technology, Delhi, India.
Mr. Moolchand Sharma, Maharaja Agrasen Institute of Technology, India.

ICICC 2019 Advisory Committee

Prof. Dr. Vincenzo Piuri, University of Milan, Italy.


Prof. Dr. Valentina Emilia Balas, Aurel Vlaicu University of Arad, Romania.
Prof. Dr. Marius Balas, Aurel Vlaicu University of Arad, Romania.
Prof. Dr. Mohamed Salim Bouhlel, University of Sfax, Tunisia.
Prof. Dr. Aboul Ella Hassanien, Cairo University, Egypt.
Prof. Dr. Cenap Ozel, King Abdulaziz University, Saudi Arabia.
Prof. Dr. Ashiq Anjum, University of Derby, Bristol, UK.
Prof. Dr. Mischa Dohler, King’s College London, UK.
Prof. Dr. Sanjeevikumar Padmanaban, University of Johannesburg, South Africa.
Prof. Dr. Siddhartha Bhattacharyya, Principal, RCC Institute of Information
Technology, Kolkata, India.
Prof. Dr. David Camacho, Associate Professor, Universidad Autonoma de Madrid,
Spain.
Prof. Dr. Parmanand, Dean, Galgotias University, UP, India.
Dr. Abu Yousuf, Assistant Professor, University Malaysia Pahang, Gambang,
Malaysia.
Prof. Dr. Salah-ddine, Krit University Ibn Zohr, Agadir, Morocco.
Dr. Sanjay Kumar Biswash, Research Scientist, INFOCOMM Lab, Russia.
Prof. Dr. Maryna Yena S., Senior Lecturer, Medical University of Kiev, Ukraine.
Prof. Dr. Giorgos Karagiannidis, Aristotle University of Thessaloniki, Greece.
Organizing Committee ix

Prof. Dr. Tanuja Srivastava, Department of Mathematics, IIT Roorkee.


Dr. D Jude Hemanth, Associate Professor, Karunya University, Coimbatore.
Prof. Dr. Tiziana Catarci, Sapienza University of Rome, Italy.
Prof. Dr. Salvatore Gaglio, University degli Studi di Palermo, Italy.
Prof. Dr. Bozidar Klicek, University of Zagreb, Croatia.
Dr. Marcin Paprzycki, Associate Professor, Polish Academy of Sciences, Poland.
Prof. Dr. A. K. Singh, NIT Kurukshetra, India.
Prof. Dr. Anil Kumar Ahlawat, KIET Group of Institutions, India.
Prof. Dr. Chang-Shing Lee, National University of Tainan, Taiwan.
Dr. Paolo Bellavista, Associate Professor, Alma Mater Studiorum - Università di
Bologna.
Prof. Dr. Sanjay Misra, Covenant University, Nigeria.
Prof. Dr. Benatiallah Ali, Associate Professor, University of Adrar, Algeria.
Prof. Dr. Suresh Chandra Satapathy, PVPSIT, Vijayawada, India.
Prof. Dr. Marylene Saldon-Eder, Mindanao University of Science and Technology.
Prof. Dr. Özlem ONAY, Anadolu University, Eskisehir, Turkey.
Prof. Dr. Kei Eguchi, Department of Information Electronics, Fukuoka Institute of
Technology.
Prof. Dr. Zoltan Horvath, Professor, Kasetsart University.
Dr. AKM Matiul Alam, Canada.
Prof. Dr. Joong Hoon Jay Kim, Korea University.
Prof. Dr. Sheng-Lung Peng, National Dong Hwa University, Taiwan.
Dr. Daniela Lopez De Luise, CI2S Lab, Argentina.
Dr. Dac-Nhuong Le, Hai Phong University, Vietnam.
Dr. Dusanka Boskovic, University of Sarajevo, Sarajevo.
Dr. Periklis Chat Zimisios, Alexander TEI of Thessaloniki, Greece.
Dr. Nhu Gia Nguyen, Duy Tan University, Vietnam.
Prof. Dr. Huynh Thanh Binh, Hanoi University of Science and Technology,
Vietnam.
Dr. Ahmed Faheem Zobaa, Brunel University London.
Dr. Kirti Tyagi, Inha University in Tashkent.
Prof. Dr. Ladjel Bellatreche, University of Poitiers, France.
Prof. Dr. Victor C. M. Leung, The University of British Columbia, Canada.
Prof. Dr. Huseyin Irmak, Cankiri Karatekin University, Turkey.
Dr. Alex Norta, Associate Professor, Tallinn University of Technology, Estonia.
Prof. Dr. Amit Prakash Singh, GGSIPU, Delhi, India.
Prof. Dr. Abhishek Swaroop, Bhagwan Parshuram Institute of Technology, Delhi.
Prof Christos Douligeris, University of Piraeus, Greece.
Dr. Brett Edward Trusko, President and CEO (IAOIP) and Assistant Professor,
Texas A&M University, Texas.
Prof. Dr. R. K. Datta, Director, MERIT.
Prof. Dr. Joel J. P. C. Rodrigues, National Institute of Telecommunications (Inatel),
Brazil; Instituto de Telecomunicações, Portugal.
Prof. Dr. Victor Hugo C. de Albuquerque, University of Fortaleza (UNIFOR),
Brazil.
x Organizing Committee

Dr. Atta ur Rehman Khan, King Saud University, Riyadh.


Dr. João Manuel R. S. Tavares, Professor, Associado com Agregação FEUP—
DEMec.
Prof. Dr. Ravish Saggar, Director, Banarsidas Chandiwala Institute of Information
Technology, Delhi.
Prof. Dr. Ku Ruhana Ku Mahamud, School of Computing, College of Arts and
Sciences, Universiti Utara Malaysia, Malaysia.
Prof. Ghasem D. Najafpour, Babol Noshirvani University of Technology, Iran.
Prof. Dr. Sanjeevikumar Padmanaban, University of Aalborg, Denmark.
Prof. Dr. Frede Blaabjerg, President (IEEE Power Electronics Society), University
of Aalborg, Denmark.
Prof. Dr. Jens Bo Holm Nielson, Aalborg University, Denmark.
Prof. Dr. Venkatadri Marriboyina, Amity University, Gwalior, India.
Dr. Pradeep Malik, Vignana Bharathi Institute of Technology (VBIT), Hyderabad,
India.
Dr. Abu Yousuf, Assistant Professor, University Malaysia Pahang, Gambang,
Malaysia.
Dr. Ahmed A. Elngar, Assistant Professor, Faculty of Computers and Information,
Beni Suef University, Beni Suef, Salah Salem Str., 62511, Egypt.
Prof. Dr. Dijana Oreski, Faculty of Organization and Informatics, University of
Zagreb, Varazdin, Croatia.
Prof. Dr. Dhananjay Kalbande, Professor and Head, Sardar Patel Institute of
Technology, Mumbai, India.
Prof. Dr. Avinash Sharma, Maharishi Markandeshwar Engineering College,
MMDU Campus, India.
Dr. Sahil Verma, Lovely Professional University, Phagwara, India.
Dr. Kavita, Lovely Professional University, Phagwara, India.
Prof. Prasad K. Bhaskaran, Professor and Head of Department, Ocean Engineering
and Naval Architecture, IIT Kharagpur.
Preface

We hereby are delighted to announce that VŠB—Technical University of Ostrava,


Czech Republic, Europe, has hosted the eagerly awaited and much-coveted
International Conference on Innovative Computing and Communication
(ICICC 2019). The second version of the conference was able to attract a diverse
range of engineering practitioners, academicians, scholars and industry delegates,
with the reception of abstracts from more than 2200 authors from different parts
of the world. The committee of professionals dedicated toward the conference is
striving to achieve a high-quality technical program with tracks on innovative
computing, innovative communication network and security and Internet of things.
All the tracks chosen in the conference are interrelated and are very famous among
the present-day research community. Therefore, a lot of research is happening in the
above-mentioned tracks and their related sub-areas. As the name of the conference
starts with the word ‘innovation,’ it has targeted out-of-box ideas, methodologies,
applications, expositions, surveys and presentations helping to upgrade the current
status of research. More than 550 full-length papers have been received, among
which the contributions are focused on theoretical, computer simulation-based
research and laboratory-scale experiments. Among these manuscripts, 129 papers
have been included in the Springer proceedings after a thorough two-stage review
and editing process. All the manuscripts submitted to ICICC 2019 were
peer-reviewed by at least two independent reviewers, who were provided with a
detailed review pro forma. The comments from the reviewers were communicated
to the authors, who incorporated the suggestions in their revised manuscripts. The
recommendations from two reviewers were taken into consideration while selecting
a manuscript for inclusion in the proceedings. The exhaustiveness of the review
process is evident, given the large number of articles received addressing a wide
range of research areas. The stringent review process ensured that each published
manuscript met the rigorous academic and scientific standards. It is an exalting
experience to finally see these elite contributions materialize into two book volumes
as ICICC 2019 proceedings by Springer entitled International Conference on

xi
xii Preface

Innovative Computing and Communications. The articles are organized into two
volumes in some broad categories covering subject matters on machine learning,
data mining, big data, networks, soft computing and cloud computing, although
given the diverse areas of research reported it might not have been always possible.
ICICC 2019 invited five keynote speakers, who are eminent researchers in the
field of computer science and engineering, from different parts of the world. In
addition to the plenary sessions on each day of the conference, five concurrent
technical sessions are held every day to assure the oral presentation of around 129
accepted papers. Keynote speakers and session chair(s) for each of the concurrent
sessions have been leading researchers from the thematic area of the session.
A technical exhibition is held during all the two days of the conference, which has
put on display the latest technologies, expositions, ideas and presentations. The
delegates were provided with a book of extended abstracts to quickly browse
through the contents, participate in the presentations and provide access to a broad
audience of the audience. The research part of the conference was organized in a
total of 35 special sessions. These special sessions provided the opportunity for
researchers conducting research in specific areas to present their results in a more
focused environment.
An international conference of such magnitude and release of the ICICC 2019
proceedings by Springer has been the remarkable outcome of the untiring efforts
of the entire organizing team. The success of an event undoubtedly involves the
painstaking efforts of several contributors at different stages, dictated by their
devotion and sincerity. Fortunately, since the beginning of its journey, ICICC 2019
has received support and contributions from every corner. We thank them all who
have wished the best for ICICC 2019 and contributed by any means toward its
success. The edited proceedings volumes by Springer would not have been possible
without the perseverance of all the steering, advisory and technical program
committee members.
All the contributing authors owe thanks from the organizers of ICICC 2019 for
their interest and exceptional articles. We would also like to thank the authors of the
papers for adhering to the time schedule and for incorporating the review com-
ments. We wish to extend our heartfelt acknowledgment to the authors, peer
reviewers, committee members and production staff whose diligent work put shape
to the ICICC 2019 proceedings. We especially want to thank our dedicated team of
peer reviewers who volunteered for the arduous and tedious step of quality
checking and critique on the submitted manuscripts. We wish to thank our faculty
colleagues Mr. Moolchand Sharma and Ms. Prerna Sharma for extending their
enormous assistance during the conference. The time spent by them and the mid-
night oil burnt are greatly appreciated, for which we will ever remain indebted. The
management, faculties and administrative and support staff of the college have
always been extending their services whenever needed, for which we remain
thankful to them.
Preface xiii

Lastly, we would like to thank Springer for accepting our proposal for
publishing the ICICC 2019 proceedings. Help received from Mr. Aninda Bose,
Senior Editor, Acquisition, in the process has been very useful.

New Delhi, India Ashish Khanna


Deepak Gupta
Organizers, ICICC 2019
About This Book

International Conference on Innovative Computing and Communications


(ICICC 2019) was held on 21–22 March at VŠB—Technical University of Ostrava,
Czech Republic, Europe. This conference was able to attract a diverse range of
engineering practitioners, academicians, scholars and industry delegates, with the
reception of papers from more than 2200 authors from different parts of the world.
Only 129 papers have been accepted and registered with an acceptance ratio of 23%
to be published in two volumes of prestigious Springer Advances in Intelligent
Systems and Computing (AISC) Series. Volume 1 includes the accepted papers of
machine learning, data mining and big data tracks. This volume includes a total of
77 papers from these two tracks.

xv
Contents

Improving the Accuracy of Collaborative Filtering-Based


Recommendations by Considering the Temporal Variance
of Top-N Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Pradeep Kumar Singh, Showmik Setta, Pijush Kanti Dutta Pramanik
and Prasenjit Choudhury
Exploring the Effect of Tasks Difficulty on Usability Scores
of Academic Websites Computed Using SUS . . . . . . . . . . . . . . . . . . . . . 11
Kalpna Sagar and Anju Saha
Prediction and Estimation of Dominant Factors Contributing
to Lesion Malignancy Using Neural Network . . . . . . . . . . . . . . . . . . . . . 21
Kumud Tiwari, Sachin Kumar and R. K. Tiwari
Prediction of Tuberculosis Using Supervised Learning Techniques
Under Pakistani Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Muhammad Ali and Waqas Arshad
Automatic Retail Invoicing and Recommendations . . . . . . . . . . . . . . . . 41
Neeraj Garg and S. K. Dhurandher
Modeling Open Data Usage: Decision Tree Approach . . . . . . . . . . . . . . 57
Barbara Šlibar
Technology-Driven Smart Support System for Tourist Destination
Management Organizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Leo Mrsic, Gorazd Surla and Mislav Balkovic
Ortho-Expert: A Fuzzy Rule-Based Medical Expert System
for Diagnosing Inflammatory Diseases of the Knee . . . . . . . . . . . . . . . . 77
Anshu Vashisth, Gagandeep Kaur and Aditya Bakshi
GA with k-Medoid Approach for Optimal Seed Selection to Maximize
Social Influence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Sakshi Agarwal and Shikha Mehta

xvii
xviii Contents

Sentiment Analysis on Kerala Floods . . . . . . . . . . . . . . . . . . . . . . . . . . . 107


Anmol Dudani, V. Srividya, B. Sneha and B. K. Tripathy
Recommendation System Using Community Identification . . . . . . . . . . . 125
Suman Venkata Sai Voggu, Yuvraj Singh Champawat, Swaraj Kothari
and B. K. Tripathy
Comparison of Deep Learning and Random Forest for Rumor
Identification in Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
T. Manjunath Kumar, R. Murugeswari, D. Devaraj and J. Hemalatha
Ontological Approach to Analyze Traveler’s Interest Towards
Adventures Club . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Toseef Aslam, Maria Latif, Palwashay Sehar, Hira Shaheen,
Tehseen Kousar and Nazifa Nazir
Performance Analysis of Off-Line Signature Verification . . . . . . . . . . . . 161
Sanjeev Rana, Avinash Sharma and Kamlesh Kumari
Fibroid Detection in Ultrasound Uterus Images Using Image
Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
K. T. Dilna and D. Jude Hemanth
Progressive Generative Adversarial Binary Networks for Music
Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Manan Oza, Himanshu Vaghela and Kriti Srivastava
Machine Learning Approach for Diagnosis of Autism Spectrum
Disorders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Sai Yerramreddy, Samriddha Basu, Ananya D. Ojha
and Dhananjay Kalbande
Methodologies for Epilepsy Detection: Survey and Review . . . . . . . . . . 207
Ananya D. Ojha, Ananya Navelkar, Madhura Gore
and Dhananjay Kalbande
Scene Understanding Using Deep Neural Networks—Objects,
Actions, and Events: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Ranjini Surendran and D. Jude Hemanth
Scene Text Recognition: A Preliminary Investigation on Various
Techniques and Implementation Using Deep Learning Classifiers . . . . . 233
N. Bhavesh Shri Kumar, Dasi Naga Brahma Krishna Sumanth Reddy,
K. Sairam and J. Naren
Computer-Aided Diagnosis System for Investigation and Detection
of Epilepsy Using Machine Learning Techniques . . . . . . . . . . . . . . . . . . 243
J. Naren, A. B. Sarada Pyngas and S. Subhiksha
Contents xix

Moments-Based Feature Vector Extraction for Iris Recognition . . . . . . 255


J. Jenkin Winston and D. Jude Hemanth
A Novel Approach to Improve Website Ranking Using Digital
Marketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Khyati Verma, Sanjay Kumar Malik and Ashish Khanna
A Comparative Study on Different Skull Stripping Techniques
from Brain Magnetic Resonance Imaging . . . . . . . . . . . . . . . . . . . . . . . 279
Ruhul Amin Hazarika, Khrawnam Kharkongor, Sugata Sanyal
and Arnab Kumar Maji
Predicting Academic Performance of International Students Using
Machine Learning Techniques and Human Interpretable
Explanations Using LIME—Case Study of an Indian University . . . . . . 289
Pawan Kumar and Manmohan Sharma
Improved Feature Matching Approach for Detecting Copy-Move
Forgery and Localization of Digital Images . . . . . . . . . . . . . . . . . . . . . . 305
Vanita Mane and Subhash Shinde
Sentiment Analysis Using Gini Index Feature Selection, N-Gram
and Ensemble Learners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Furqan Iqbal
Text Summarization by Hybridization of Hypergraphs and Hill
Climbing Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
Hemamalini Siranjeevi, Swaminathan Venkatraman
and Kannan Krithivasan
Comparing Machine Learning Algorithms to Predict Diabetes
in Women and Visualize Factors Affecting It the Most—A Step
Toward Better Health Care for Women . . . . . . . . . . . . . . . . . . . . . . . . . 339
Arushi Agarwal and Ankur Saxena
Evolution of mHealth Eco-System: A Step Towards
Personalized Medicine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
Mohit Saxena and Ankur Saxena
Genetic Variance Study in Human on the Basis of Skin/Eye/Hair
Pigmentation Using Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
Ankur Saxena, Shivani Chandra, Alka Grover, Lakshay Anand
and Shalini Jauhari
An Improved Technique on Existing Neck and Head Support Systems
for Cervical Dystonia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
Rahul Dubey, Rahul Vishwakarma and Ashish Mishra
xx Contents

Automatic Text Summarization Using Fuzzy Extraction . . . . . . . . . . . . 395


Bharti Sharma, Nitika Katyal, Vishant Kumar, Shivani and Amit Lathwal
A Comparison of Machine Learning Approaches for Classifying
Flood-Hit Areas in Aerial Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
J. Akshya and P. L. K. Priyadarsini
Cognitive Services Applied as Student Support Service Chatbot
for Educational Institution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
Leo Mrsic, Tomislav Mesic and Mislav Balkovic
Feature Extraction and Detection of Obstructive Sleep Apnea
from Raw EEG Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
Ch. Usha Kumari, Padmavathi Kora, K. Meenakshi, K. Swaraja, T. Padma,
Asisa Kumar Panigrahy and N. Arun Vignesh
AIRUYA-A Personal Shopping Assistant . . . . . . . . . . . . . . . . . . . . . . . . 435
Yashi Rai, Aishwarya Raj, Kumari Suruchi Sah and Akash Sinha
Adapting Machine Learning Techniques for Credit Card Fraud
Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
Bright Keswani, Prity Vijay, Narayan Nayak, Poonam Keswani,
Saumyaranjan Dash, Laxman Sahoo, Tarini Ch. Mishra
and Ambarish G. Mohapatra
Ensemble Feature Selection Method Based on Recently Developed
Nature-Inspired Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
Jatin Arora, Utkarsh Agrawal, Prayag Tiwari, Deepak Gupta
and Ashish Khanna
Diagnosis of Parkinson’s Disease Using a Neural Network Based
on QPSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
Srishti Sahni, Vaibhav Aggarwal, Ashish Khanna, Deepak Gupta
and Siddhartha Bhattacharyya
Transfer Learning Model for Detecting Early Stage of Prurigo
Nodularis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
Dhananjay Kalbande, Rithvika Iyer, Tejas Chheda, Uday Khopkar
and Avinash Sharma
Optimization of External Stimulus Features for Hybrid Visual
Brain–Computer Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
Deepak Kapgate, Dhananjay Kalbande and Urmila Shrawankar
Predicting the Outcome of an Election Results Using Sentiment
Analysis of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
Ankur Saxena, Neeraj Kushik, Ankur Chaurasia and Nidhi Kaushik
Contents xxi

A Wide ResNet-Based Approach for Age and Gender Estimation


in Face Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
Rajdeep Debgupta, Bidyut B. Chaudhuri and B. K. Tripathy
Bio-inspired Algorithms for Diagnosis of Heart Disease . . . . . . . . . . . . . 531
Moolchand Sharma, Ananya Bansal, Shubbham Gupta, Chirag Asija
and Suman Deswal
Emotion Detection Through EEG Signals Using FFT and Machine
Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
Anvita Saxena, Kaustubh Tripathi, Ashish Khanna, Deepak Gupta
and Shirsh Sundaram
Text Summarization with Different Encoders for Pointer
Generator Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
Minakshi Tomer and Manoj Kumar
Performance Evaluation of Meta-Heuristic Algorithms in Social
Media Using Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
P. Silambarasi and Kiran L. N. Eranki
Heuristic Coordination for Multi-agent Motion Planning . . . . . . . . . . . . 569
Buddhadeb Pradhan, Nirmal Baran Hui and Diptendu Sinha Roy
Optimization of Click-Through Rate Prediction
of an Advertisement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
N. Madhu Sudana Rao, Kiran L. N. Eranki, D. L. Harika, H. Kavya Sree,
M. M. Sai Prudhvi and M. Rajasekar Reddy
Prediction of Cervical Cancer Using Chicken Swarm Optimization . . . . 591
Ayush Kumar Tripathi, Priyam Garg, Alok Tripathy, Navender Vats,
Deepak Gupta and Ashish Khanna
Describing Image Using Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 605
Atul Kumar, Ratnesh Kumar and Shailesh Kumar Shrivastava
An Efficient Expert System for Proactive Fire Detection . . . . . . . . . . . . 613
Venus Singla and Harkiran Kaur
Computational Intelligence for Technology and Services Computing
Analysis of Sentiment Analysis Techniques . . . . . . . . . . . . . . . . . . . . . . 623
Puja Bharti, Amit Kant Verma, Ravi Raj and Gopal Krishna
Detection of Parkinson’s Disease Using Machine Learning Techniques
for Voice and Handwriting Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 631
Nikita Goel, Ashish Khanna, Deepak Gupta and Naman Gupta
xxii Contents

Audio–Video Aid Generator for Multisensory Learning . . . . . . . . . . . . 645


Reshabh Kumar Sharma, Aman Alam Bora, Sachin Bhaskar
and Prabhat Kumar
E-Labharthi—Information Management for Sustainable Rural
Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655
Amar Nath Pandey, Shakti Pandey and Sangeeta Sinha
Prediction of Celiac Disease Using Machine-Learning Techniques . . . . . 663
Agrima Mehandiratta, Neha Vij, Ashish Khanna, Pooja Gupta,
Deepak Gupta and Ayush Kumar Gupta
Detection of Devanagari Text from Wild Images Through Image
Processing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
Manas Bhardwaj, Savitoj Singh, Ashish Khanna, Ankita Gupta,
Deepak Gupta and Naman Gupta
Dynamic Web with Automatic Code Generation
Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687
Prerna Sharma, Vikas Chaudhary, Nakul Malhotra, Nikita Gupta
and Mohit Mittal
Detection of Garbage Disposal from a Mobile Vehicle
Using Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 699
Shubhi Jain, Naman Gupta, Ashish Khanna, Ankita Gupta
and Deepak Gupta
Using Neural Network to Identify Forgery in Offline Signatures . . . . . . 711
Piyush Agrawal, Rahul Bhalsodia and Yash Garg
Analyzing the Impact of Age and Gender on User Interaction
in Gaming Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 721
Abid Jamil, Ch. M. Nadeem Faisal, Muhammad Asif Habib, Sohail Jabbar
and Haseeb Ahmad
Analysis of Prediction Techniques for Temporal Data Based
on Nonlinear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 731
Pinki Sagar, Prinima Gupta and Indu Kashyap
Robust Denoising Technique for Ultrasound Images Based
on Weighted Nuclear Norm Minimization . . . . . . . . . . . . . . . . . . . . . . . 741
Shaik Mahaboob Basha and B. C. Jinaga
Comparison of Machine Learning Models for Airfoil Sound Pressure
Prediction and Denoising for Airbots . . . . . . . . . . . . . . . . . . . . . . . . . . . 751
R. Kavitha and C. R. Srivatsan
Contents xxiii

Local Texture Features for Content-Based Image Retrieval


of Interstitial Lung Disease Patterns on HRCT Lung Images . . . . . . . . 761
Jatindra Kumar Dash, Manisha Patro, Snehasish Majhi, Gandham Girish
and P. Nancy Anurag
Multilevel Quantum Sperm Whale Metaheuristic for Gray-Level
Image Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775
Siddhartha Bhattacharyya, Sandip Dey, Jan Platos, Vaclav Snasel
and Tulika Dutta
Malaria Detection on Giemsa-Stained Blood Smears Using Deep
Learning and Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 789
Nobel Dang, Varun Saraf, Ashish Khanna, Deepak Gupta
and Tariq Hussain Sheikh
An Effective Instruction Execution and Processing Model
in Multiuser Machine Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805
Abraham Ayegba Alfa, Sanjay Misra, Francisca N. Ogwueleka,
Ravin Ahuja, Adewole Adewumi, Robertas Damasevicius
and Rytis Maskeliunas
Automatic 3D Reconstruction Detection System for Knee
Osteoarthritis Based on K-Means Algorithm . . . . . . . . . . . . . . . . . . . . . 819
Kadry Ali Ezzat, Lamia Nabil Mahdy, Aboul Ella Hassanien,
Ashraf Darwish, Snasel Vaclav and Deepak Gupta
Optimized Twin Support Vector Clustering in Transmission Electron
Microscope of Cobalt Nanoparticles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829
Atrab A. Abd El-Aziz, Heba Al Shater, A. Dakhlaoui,
Aboul Ella Hassanien and Deepak Gupta
Transfer Learning with a Fine-Tuned CNN Model for Classifying
Augmented Natural Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843
Dalia Ezzat, Aboul Ella Hassanien, Mohamed Hamed N. Taha,
Siddhartha Bhattacharyya and Snasel Vaclav
Classification of Human Sperm Head in Microscopic Images Using
Twin Support Vector Machine and Neural Network . . . . . . . . . . . . . . . 857
Kamel K. Mohammed, Heba M. Afify, Fayez Fouda,
Aboul Ella Hassanien, Siddhartha Bhattacharyya and Snasel Vaclav
Challenges of Big Data Visualization in Internet-of-Things
Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873
Doaa Mohey Eldin, Aboul Ella Hassanien and Ehab E. Hassanien
xxiv Contents

Multiple Cyclic Swarming Optimization for Uni- and Multi-modal


Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887
I. Fares, Rizk M. Rizk-Allah, Aboul Ella Hassanien and Snasel Vaclav

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899


About the Editors

Ashish Khanna received his Ph.D. from NIT, Kurukshetra in March 2017, his
[Link]. in 2009 & his [Link]. from GGSIPU, Delhi in 2004. He completed his
postdoc at Inatel, Brazil. He has published 72 research papers as well as book
chapters in reputed journals and conferences. He has also authored/edited 14 books.

Deepak Gupta received his Ph.D. in CSE from Dr. APJ Abdul Kalam Technical
University, M.E. from Delhi University, & [Link]. in 2017, 2010 and 2005
respectively. He completed his postdoc at Inatel, Brazil. He is currently working at
Maharaja Agrasen Institute of Technology, GGSIPU, India. He published 82 papers
in international journals and conferences. He has authored/edited 36 books with
international publishers.

Siddhartha Bhattacharyya completed his bachelor’s degrees in Physics and


Optics & Optoelectronics and master’s in Optics & Optoelectronics at the
University of Calcutta in 1995, 1998 and 2000, respectively, and his Ph.D. in CSE
at Jadavpur University in 2008. He is currently the Principal of the RCC Institute of
Information Technology, Kolkata, India. He is a co-author/editor of 24 books and
has published more than 200 papers in international journals and conferences.

Vaclav Snasel is Dean of the Faculty of Electrical Engineering and Computer


Science at the Mining University, Technical University of Ostrava. He has almost
30 years’ experience in academia and research including industrial cooperations. He
works in a multi-disciplinary environment on various real world problems.

Jan Platos received his Bachelor of Computer Science in 2005; Master of


Computer Science in 2006 and Ph.D. in Computer Science (2010). He is currently
the Head of the Department of Computer Science at the Faculty of Electrical
Engineering and Computer Science, VSB-Technical University of Ostrava. He is
author of more than 178 papers in international journals and conferences. He has
also organized 11 international conferences.

xxv
xxvi About the Editors

Aboul Ella Hassanien (Abo) received his [Link]. with honours in 1986 and [Link].
degree in 1993, from the Pure Mathematics and Computer Science Department at
the Faculty of Science, Ain Shams University, Cairo, Egypt. He received his
doctoral degree from Tokyo Institute of Technology, Japan in 1998. He is a Full
Professor at the IT Department, Faculty of Computer and Information, Cairo
University. He has authored over 380 research publications in leading
peer-reviewed journals, book chapters and conference proceedings.
Improving the Accuracy of Collaborative
Filtering-Based Recommendations
by Considering the Temporal Variance
of Top-N Neighbors

Pradeep Kumar Singh, Showmik Setta, Pijush Kanti Dutta Pramanik


and Prasenjit Choudhury

Abstract The accuracy of the recommendation process based on neighborhood-


based collaborative filtering tends to diverge because the interests/preferences of
the neighbors are likely to change along with time. The traditional recommendation
methods do not consider the shifted likings of the neighbors; hence, the calculated
set of neighbors does not always reflect the optimal neighborhood at any given
point of time. In this paper, we propose a novel approach to calculate the similarity
between users and find the similar neighbors of the target user in different time period
to improve the accuracy in personalized recommendation. The performance of the
proposed algorithm is tested on the MovieLens dataset using different performance
metrics viz. MAE, RMSE, precision, recall, F-score, and accuracy.

Keywords Recommender systems · Collaborative filtering · Similarity metrics ·


Prediction approach · Rating · Top-n neighbor · Time period · Cluster · MAE ·
RMSE · Precision · Recall · F-score · Accuracy

1 Introduction

Collaborative filtering (CF) is one of the most popular filtering approaches used in
e-commerce applications for recommending online items to the users [1]. To predict
the recommendable items, CF-based recommendation systems consider the alikeness

P. K. Singh (B) · P. K. D. Pramanik · P. Choudhury


Department of Computer Science and Engineering,
National Institute of Technology Durgapur, Durgapur, India
e-mail: pksingh300689se@[Link]
P. K. D. Pramanik
e-mail: pijushjld@[Link]
P. Choudhury
e-mail: prasenjit0007@[Link]
S. Setta
Department of Computer Application, Techno India Hooghly, Chinsurah, India
e-mail: showmiklovesport@[Link]

© Springer Nature Singapore Pte Ltd. 2020 1


A. Khanna et al. (eds.), International Conference on Innovative Computing
and Communications, Advances in Intelligent Systems and Computing 1087,
[Link]
2 P. K. Singh et al.

of the ratings (rating similarity) given by the similar users (neighbors) for an item.
Similarity measures play an important role in the accuracy of CF [1]. An inaccurate
top-n similar neighbor of the target user creates low prediction accuracy in CF-based
recommendation systems [2]. However, traditional similarity measures have certain
limitations to find the similar neighbors of the target user in different time periods.

1.1 Problem Definition

The similar rating pattern of two users suggests that they might have similar likings.
But human preferences change over the time. And, as a result, the list of neighbors
of a particular user also changes. For example, Table 1 shows a list of four users and
six movies with the rating information. All users provide the ratings of the first three
movies in 1996 and the remaining three movies in 1997.
The above table represents the changing interest of User 1. The most similar user
of User 1 is User 2 in 1996 but his rating pattern is more similar to User 3 in 1997.
Traditional methods for finding similarity measures use the complete table. The
accuracy of the recommendation based on the old set of similar users tends to decrease
along with time. Hence, there is a need of a novel neighborhood calculation approach
that considers the changing preferences of the neighbors to enhance the accuracy in
personalized recommendations.

1.2 Proposed Solution Approach

To include the temporal information in calculating the neighborhood, we have con-


sidered user ratings per year basis. The optimized k-means clustering is applied on
years to find the cluster of similar users in optimized cluster of years. The philosophy
of considering optimized clustering on years is that two users are considered to be
more similar if their yearly rating behavior on co-rated items and yearly number of
ratings are similar.

Table 1 Rating information of users in different years


Year 1996 1997
User Movie
A B C D E F
User 1 3.5 4.5 1 2 4 3
User 2 3 4 1.5 1.5 2 1
User 3 0.5 1.5 0.5 1.5 4 3.5
User 4 0.5 1 1.5 2 5 3
Improving the Accuracy of Collaborative Filtering … 3

The advantage of this approach is that we can find the top-n neighbors of the
target users in different time intervals. This will improve the accuracy in personalized
recommendation compared to using traditional similarity metrics.

1.3 Advantage of the Proposed Approach

The work of this paper offers twofold advantages:


– Mitigates the limitation of the traditional similarity measures to find the similar
users in different time period.
– Leads to more personalized and accurate recommendation.

1.4 Contribution of This Paper

The aim of the proposed approach is to extract the similar users of the target user in
different time intervals for more personalized recommendations and to improve the
accuracy of the traditional neighborhood-based CF algorithms. The main contribu-
tions of this paper are as follows:
– Calculate the total number of ratings provided by a user in different time intervals.
– On the dataset that contains users’ total number of rating information in different
years, apply the optimized K-means clustering algorithms (Elbow method) to find
the optimal number of clusters.
– Comparison of the traditional CF algorithm and the proposed approach on the
basis of performance metrics, i.e., MAE, RMSE, precision, recall, and accuracy.

1.5 Organization of the Paper

Section 2 discusses the associated background very briefly. Not many works have
addressed the temporal information for calculating the neighborhood. Few of them
are mentioned in this section. Section 3 presents the solution details which also
includes two algorithms: (a) finding optimal number of clusters of similar users
and (b) finding the top-n neighbors of a target user. Section 4 does a comparative
analysis of the proposed approach on the MovieLens dataset using the performance
metrics such as MAE, RMSE, precision, recall, F-score, and accuracy. Finally, Sect. 5
concludes the paper.
4 P. K. Singh et al.

2 Background and Related Work

The RS gives probabilistic suggestion for products or items to users on their explicit
and implicit preferences, using the help of previously collected data of users and
items. Content-based and CF are the two main filtering approaches used in RS [3]. In
content-based recommendation [4], RS recommends items that are similar to those
items which are liked by a user in the past, whereas CF identifies the top-n similar
users that have similar taste to the target user. Breese et al. [5] have introduced the two
classifications of CF, i.e., model-based CF and memory-based CF. In a model-based
CF, firstly training process is completed using the previous information of users and
items and then uses for prediction. Campos et al. [6] have proposed a Bayesian net-
work and combine the features of content-based filtering and collaborative filtering.
They provide their effective experimental results on the MovieLens dataset. Similar-
ity between top-n users/items and prediction computation using these top-n similar
users/items are the two basic steps in memory-based CF. Based on the user similar-
ity and item similarity, the memory-based CF can be categorized into user-based CF
and item-based CF, respectively. Scalability and sparsity are the two major issues
in CF. Li et al. [7] have minimized the scalability problem with optimized MapRe-
duce for item-based collaborative filtering recommendation algorithm. To mitigate
the sparsity issue, Kanti et al. [8] have introduced a new CF approach that combines
both similarities user and item for rating prediction. Lee et al. [9] have elaborated
a time-based recommender system for more accurate recommendation. Their pro-
posed system has utilized the temporal information, i.e., user purchase time and item
launch time with the rating information to find the more personalized neighbors in
different time intervals. Koohi el al. [10] have proposed a method to find similar
neighbors of target user to improve the performance of user-based CF. Najafabadi
et al. [11] have used the concept of k-means clustering and association rules mining
to improve the accuracy of collaborative filtering recommendations. But, k-means
clustering has limitation to determine the optimal number of cluster that will pro-
vide high accuracy. Hence, our work mainly focuses to minimize the limitation of
k-means clustering with more personalized recommendation.

3 Proposed Recommendation Approach

Rating provides the users’ feedback or their inherent interest on the particular item.
Users’ interest is dynamic in nature so that their similar neighbor set also changes in
different time periods. The philosophy of the proposed approach is that the two users
are said to be similar if their rating patterns are similar in different time periods.
Hence, the proposed work includes a matrix Yuc (users’ yearly contribution) that
shows the total number of rating provided by a user in a particular year.
Improving the Accuracy of Collaborative Filtering … 5

Table 2 A matrix of users’ yearly contribution


User Year
Y1 ... ... Yj ... ... Yk
U ser 1 Yuc (1, 1) ... ... Yuc (1, j) ... ... Yuc (1, k)
... ... ... ... ... ... ... ...
U ser i Yuc (i, 1) ... ... Yuc (i, j) ... ... Yuc (i, k)
... ... ... ... ... ... ... ...
U ser m Yuc (n, 1) ... ... Yuc (m, j) ... ... Yuc (n, k)

Table 2 shows a matrix of size m × k, where m and k represent the number of


users and the number of years, respectively. If a user i has rated n items in jth year
then the value of Yuc (i, j) will be n. The procedure of finding optimal number of
cluster of similar users is shown in Algorithm 1.

Algorithm 1 : Finding optimal number of clusters of similar users


1: Input: Yuc dataset.
2: Output: Optimal number of cluster of similar users.
3: Procedure:
4: For c = 1 to k, compute the k-means clustering algorithm on Yuc , where k represents the total
number of years  
5: For each c, calculate the SSE (sum of squared error) using kc=1 u∈Cc dist (u, Cc )2 . where
C represents a set of clusters C=(C1 , C2 ,..., Cc ...Ck ) and dist is a function that calculates the
distance between user u and cluster centroid.
6: Plot the curve of SSE for each cluster c=1 to k.
7: The location of a bend (knee) in the plot where c and SSE value will be low, is considered as
the optimal number of cluser Co .

Furthermore, in the proposed approach the



following equation
 is computed to
k
C(u,c) v∈Ni (u) sim(u,v)(rvi −r¯v )
recommend the top-n items to a user, rˆui = kc=1 (r¯u + 
|sim(u,v)|
).
c=1 |C(u,c)| v∈Ni (u)

Here, rˆui denotes the predicted rating of target user u on the item i and C(u, c)
represents a binary matrix that shows the belonging nature of user u in cluster c. If
user u belongs to the cluster c, the value of C(u, c) will be 1 otherwise 0. r¯u and r¯v
show the average rating of user u and v, respectively. rvi represents the rating of user v
on item i, whereas sim(u, v) identifies the similarity between user u and v. Algorithm
2 represents the complete steps in the proposed recommendation approach.
6 P. K. Singh et al.

Algorithm 2 : Recommendation of top-n list to the target user


1: Input: User-Item rating dataset, A set of users (U), items (I) and Time (Y) when a user gives
his rating.
2: Output: A list of top-n items to the target user u based on the predicted rating using proposed
algorithm.
3: Procedure:
4: For ∀i ∈ U, ∀j ∈ Y, calculate Yuc (i, j)
5: Apply the optimized k-means clustering algorithm (Elbow method) on the Yuc matrix and
compute the optimal number of clusters C=(C1 , C2 ,..., Cc ...Ck ).
6: For ∀i ∈ U, ∀j ∈ I, if Ri j 
== 0 then,
k
C(u,c) v∈Ni (u) sim(u,v)(rvi −r¯v )
ˆ =
7: rui kc=1 (r¯u + 
|sim(u,v)|
) // Ri j means rating of user u on item i.
c=1 |C(u,c)| v∈Ni (u)
8: Generate a list of top-n items based of the predicted rating.

Table 3 Used similarity metric and prediction approach



i∈I (ri,u −r¯u )(ri,v −r¯v )
Pearson correlation sim(u, v) = √
2  2 2
√ 
i∈I (ri,u −r¯u ) i∈I (ri,v −r¯v )
2

v∈Ni (u) sim(u,v)(rvi −r¯v )
Mean centering ˆ = r¯u +
rui 
v∈Ni (u) |sim(u,v)|

4 Comparative Analysis

The MovieLens dataset ml-100k has been collected to compare the experimental
analysis [12]. This collected dataset has 100,000 rating information of 943 users and
1682 movies. The ratings belong in the range of 1 to 5 with 1 increment, 1 denotes the
lowest rating whereas 5 represents the highest rating. The collected dataset that has
93.695% sparsity. Based on different training and testing sets, the collected dataset
is divided into Dataset 1 and Dataset 2. Dataset 1 contains 45% trained dataset and
55% test dataset, whereas Dataset 2 has 35% trained dataset and 65% test dataset.
We use Pearson Correlation as a similarity measure and Mean Centering prediction
approach for rating prediction as shown in Table 3.
In Table 3, ri,u and ri,v denote the rating of user u and v on item i. Six different
metrics i.e., MAE, RMSE, precision, recall, F-Score, and accuracy have been used
for evaluation of the proposed approach [1]. The equations of computing MAE and
RMSE values are as follows:
N
| pi − qˆi |
MAE = i=1 (1)
N

N
i=1 ( pi − qˆi )
2
RMSE = (2)
N
Improving the Accuracy of Collaborative Filtering … 7

Table 4 Classification of the possible results of a recommendation of an item to a user


Type of ratings Prediction
Recommended Not recommended
(Predicted high rating) (Predicted low rating)
Actual high rating True-Positive (t p ) False-Negative ( f n )
Actual low rating False-Positive ( f p ) True-Negative (tn )

Here, pi and qˆi show the predicted and actual rating of item i, respectively. N rep-
resents the total number of predicted item. We consider the ratings above 3 as a
high rating (recommended items), and less than 3 as a low rating (not recommended
items). The classification of the possible results is shown in Table 4.
Hence, using Table 4, the equations of precision, recall, F-Score, and accuracy
become:
#t p
Precision = (3)
#t p + # f p

#t p
Recall = (4)
#t p + # f n

Precision ∗ Recall
F − Score = 2 ∗ (5)
Precision + Recall

#t p + #tn
Accuracy = (6)
#t p + #tn + # f p + # f n

Here, # denotes the ‘number of’.


Figures 1 and 2 show that in case of different similar users the proposed CF
outperforms the traditional CF algorithm.

5 Conclusion

Collaborative filtering has been the most popular technique used in recommenda-
tion systems to suggest online items to the users. The recommendation to a user is
done based on the preferences of the similar users (neighbors) who rated the same
items likewise. Hence, for personalized and accurate recommendation, it is crucial
to find the set of neighbors correctly. With the explosive growth of the number of
users, the traditional CF faces difficulty in findings of top-n neighbors of the target
user. Especially, traditional similarity measures have issues in computing the top-n
similar users due to the changing interest of the target user over time. This affects
the accuracy of the recommendation. This paper addresses this problem by calcu-
8 P. K. Singh et al.

Comparison based on MAE values

Comparison based on RMSE values

Comparison based on Precision

Fig. 1 Comparison between traditional CF and proposed approach based on MAE, RMSE, and
precision values
Improving the Accuracy of Collaborative Filtering … 9

Comparison based on Recall values

Comparison based on F-Score

Comparison based on Accuracy

Fig. 2 Comparison between traditional CF and proposed approach based on recall, F-Score, and
accuracy
10 P. K. Singh et al.

lating the top-n neighbors per year basis. This approach guarantees that the list of
top-n neighbors always remains updated as per the changed preferences of the target
user’s neighbors. As a result, the proposed CF provides significantly improvement
in prediction accuracy than the existing traditional CF algorithm.

References

1. S.K. Singh, P.K.D Pramanik, P. Choudhury, A comparative study of different similarity metrics
in highly sparse rating sataset, in Proceedings of ICDMAI of Data Management, Analytics and
Innovation, vol. 2 (Springer, Berlin, 2018), pp. 45–60
2. A.M. Jorge, J. Vinagre, M. Domingues, J. Gama, C. Soares, P. Matuszyk, M. Spiliopoulou,
Scalable Online Top-N Recommender Systems (Springer International Publishing, Berlin, 2017)
3. M. Balabanović, Y. Shoham, Fab: content-based, collaborative recommendation. Commun.
ACM 40(3), 66–72 (1997)
4. B. Mobasher, Data mining for web personalization, in The Adaptive Web (Springer, Heidelberg,
2007), pp. 90–135
5. J. Breese, D. Hecherman, C. Kadie, Empirical analysis of predictive algorithms for collaborative
filtering, in Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (1998),
pp. 43–52
6. L.M. Campos, J.M. de Fernández-Luna, J.F. Huete, M.A. Rueda-Morales, Combining content-
based and collaborative recommendations: A hybrid approach based on Bayesian networks.
Int. J. Approximate Reasoning 51, 785–799 (2010)
7. C. Li, K. He, CBMR: An optimized map reduce for item-based collaborative filtering recom-
mendation algorithm with empirical analysis. Concurrency Comput. Pract. Experience 29(10),
e4092 (2017)
8. S. Kanti, T. Mahara, Merging user and item based collaborative filtering to alleviate data
sparsity. Int. J. Syst. Assur. Eng. Manag. 1–7 (2016)
9. T.Q. Lee, Y. Park, Y.T. Park, A time-based approach to effective recommender systems using
implicit feedback. Expert Syst. Appl. 34(4), 3055–3062 (2008)
10. H. Koohi, K. Kiani, A new method to find neighbor users that improves the performance of
collaborative filtering. Expert Syst. Appl. 83, 30–39 (2017)
11. M.K. Najafabadi, M.N. Mahrin, S. Chuprat, H.M. Sarkan, Improving the accuracy of collab-
orative filtering recommendations using clustering and association rules mining on implicit
data. Comput. Hum. Behav. 67, 113–128 (2017)
12. MovieLens | GroupLens: [Link] Last Accessed 18 Aug
2018
Exploring the Effect of Tasks Difficulty
on Usability Scores of Academic Websites
Computed Using SUS

Kalpna Sagar and Anju Saha

Abstract The prime objective of this study is to empirically determine the effect
of tasks difficulty on usability scores computed using the System Usability Scale
(SUS). Usability dataset is created by involving twelve end-users that evaluate the
usability of 15 academic websites in a laboratory. Each end-user performs three
subsets of six tasks whose difficulty is varying from easy to impossible under six
different categories. Results are obtained after applying two statistical techniques,
one is ANOVA and other is the correlation with regression. Results show that the SUS
scores vary from higher to lower values when end-users conduct usability assessment
with a list of easy, moderate, and impossible tasks on academic websites. The results
also indicate the effect of tasks difficulty on the correlation between the SUS scores
and task success rate. Though, the strength of the correlation is strong with each
subset of tasks but it varies that depends on the nature of the tasks.

Keywords Usability evaluation · System usability scale · Quantitative


assessment · Academic websites · Usability metric · Tasks difficulty

1 Introduction

In human–computer interaction, during the last decade, researchers have shown


strong interest in the quantitative assessment of usability which is conducted by
using various surveys. SUS containing ten questions [1] is a widely used survey that
is applied to a variety of interactive systems [2–4], and under various task scenarios
[5]. SUS is considered as a powerful tool for computation and comparison of usabil-
ity score for any interactive system. In the usability engineering research domain,

K. Sagar (B) · A. Saha


University School of Information and Communication Technology, GGSIPU, Sector 16-C, Delhi
110078, India
e-mail: sagarkalpna87@[Link]
A. Saha
e-mail: anju_kochhar@[Link]

© Springer Nature Singapore Pte Ltd. 2020 11


A. Khanna et al. (eds.), International Conference on Innovative Computing
and Communications, Advances in Intelligent Systems and Computing 1087,
[Link]
12 K. Sagar and A. Saha

the effect of tasks difficulty on SUS scores is mentioned as an open issue that moti-
vates us to perform the experiment for this study [6]. So, the aim of this paper is to
determine the effect of tasks difficulty, i.e. ranging from easy to impossible types, on
usability scores computed using SUS. However, SUS does not provide information
about how does one interactive system is more usable than other. Due to this reason,
conventional ISO metrics [7] are needed to experiment further to have a detailed
picture of system usability [5]. The recent study had also been found that shows the
strong positive correlation between subjective usability measures and ISO metric,
i.e. the task success rates, for both the laboratory and field studies at the individual
and system level [5]. It also motivates us to explore the effect of task difficulty on
the correlation between the SUS scores and success rate of tasks ranging from easy
to impossible types. The remaining sections of this research paper are organized
as follows: Section 2 describes the research methodology. Section 3 explains the
experimental results with the analysis. Section 4 contains conclusions with future
directions and last section contains appendix.

2 Research Methodology

2.1 End-Users Details

Twelve end-users are involved in the usability evaluation process of 15 academic


websites. Each end-user of varying age groups frequently assesses the academic
websites and is having diverse educational and professional backgrounds.

2.2 Three Subsets of Tasks Considered for Usability


Evaluation

15 academic websites1 are evaluated with three subsets of tasks in which difficulty
level varies from easy to impossible under six different categories as presented below
in Table 1.
Also, frequently used tasks by all the stakeholders are only considered and the
purpose of defining these tasks is to identify whether the website satisfies the usability
goals required by all stakeholders or not.

1 Academic websites listed in NIRF are considered [8–10].


Exploring the Effect of Tasks Difficulty on Usability Scores … 13

Table 1 List of categories considered for usability evaluation with three subsets of tasks
Categories List of tasks
Impossible tasks Medium-difficulty-level Easy-level tasks
tasks
1. Content Find online user Find the concerned Find vision and
feedback for the person responsible for mission statement
professor uploading upcoming
events if required then
contact
2. University Find out Ph.D. Determine how to Find out about
program supervisors details apply for the program various programs
like available slots, conducted in
etc university
3. Navigation Find the sitemap Show the users where Navigate to
they are homepage from any
navigational level
4. Search Find screen reader Find the image and Find simple and
on the website for video-based search advanced search
visually impaired results facility
students
5. Mobile Search for last Determine the website Determine the
semester result on has its mobile app university website’s
the mobile device text readable on
mobile devices
6. Social media Find any digital Find Blog of university Find the Facebook
storytelling media to page of the university
help students

2.3 Procedure

Each end-user performs all the tasks mentioned in Table 1 on the assigned academic
website. After executing these tasks in the laboratory, each end-user is allowed to fill
10-item SUS survey2 on the basis of their task experiences during the experiments.
The Google form is generated for filling the responses in the SUS survey. The pro-
cedure for computation of the SUS scores is adopted from [1] whose value varies
from 0 to 100. Any interactive system having the higher SUS score indicates that it
has higher usability rating. Further, an ISO metric of system effectiveness as a task
success rate on each academic website is also calculated [5, 11].

2 SUS Survey is mentioned in Appendix Section as Fig. 4.


14 K. Sagar and A. Saha

3 Experimental Results

This section describes the experimental results which are obtained after performing
two statistical techniques, i.e. one way ANOVA for comparative analysis, and cor-
relation with regression for investigating the relationship between SUS scores and
task success rate.

3.1 Comparative Analysis

The comparative analysis is performed showing how the SUS scores vary when
usability assessment is done with a list of easy, moderate, and impossible-level tasks.
Once the usability dataset is collected by the end-users for 15 academic websites,
one way ANOVA, the statistical technique in Minitab 17 is applied. Table 2 contains
all the experimental and computed values of SUS mean for 15 academic websites
which are evaluated with three subsets of tasks. Results show an overall difference in
usability ratings of 15 academic websites between three subsets of tasks as p < 0.001.
Further, from the Table 2, it is observed that the end-users executing easy-level tasks
have rated the 13 websites with higher SUS scores as compared to other two groups.
On the contrary, the end-users executing impossible type tasks have rated the 15
academic websites with lowest SUS scores. Research practitioners must carefully
consider the tasks for usability evaluation as its difficulty level affect the SUS scores.

3.2 Correlation and Regression

Another statistical technique, i.e. correlation and regression are also applied to the
collected usability dataset. The correlation is one of statistical measure that deter-
mines association of two variables, whereas the regression explains how an indepen-
dent variable is numerically related to the dependent variable. The standard Pearson
correlation was executed in Minitab 17 to investigate the relationship between com-
puted SUS scores and task success rate of three subcategories of tasks ranging from
easy, moderate and impossible tasks. Table 3 contains all the experimental and com-
puted values. Results investigated the positive correlation between SUS scores and
task success rate for academic websites for three subcategories of tasks.
As seen in Fig. 1(Easy tasks), Fig. 2 (Moderate tasks), and Fig. 3 (Impossible
tasks), higher SUS scores are associated with higher success rate and vice versa. In
other words, easy-level tasks are having higher success rate and higher SUS scores.
For easy-level tasks, r(15) = 0.951, p < 0.001, r 2 = 0.905, for moderate-level tasks,
r(15) = 0.914, p < 0.001, r 2 = 0.836, for impossible-level tasks, r(15) = 0.971, p
< 0.001, r 2 = 0.943.
Exploring the Effect of Tasks Difficulty on Usability Scores … 15

Table 2 SUS scores mean, standard deviation, F-value, and P-value for 15 academic websites
broken down by tasks level difficulty
Universities Easy tasks Moderate tasks Impossible tasks F-value P-value
1. IISC M = 59.79, SD M = 37.50, SD M = 23.75, SD 283.58 0.000
= 0.72 = 2.61 = 5.86
[Link] M = 50.00, SD M = 29.58, SD M = 12.50, SD 663.38 0.000
= 1.06 = 3.34 = 2.61
[Link] M = 53.75, SD M = 25.00, SD M = 54.79, SD 259.06 0.000
= 1.30 = 5.22 = 3.27
[Link] M = 57.50, SD M = 27.50, SD M = 25.00, SD 314.00 0.000
= 1.85 = 2.61 = 5.22
[Link] M = 51.04, SD M = 40.42, SD M = 29.58, SD 113.52 0.000
= 4.19 = 2.79 = 3.34
[Link] M = 58.13, SD M = 29.58, SD M = 31.25, SD 296.07 0.000
= 2.17 = 3.34 = 3.92
[Link] M = 42.50, SD M = 29.38, SD M = 27.50, SD 43.97 0.000
= 5.44 = 3.39 = 3.69
[Link] M = 36.25, SD M = 28.96, SD M = 26.04, SD 26.21 0.000
= 2.5 = 3.76 = 4.19
[Link] M = 53.96, SD M = 37.50, SD M = 27.50, SD 169.00 0.000
= 4.19 = 2.61 = 3.69
[Link] M = 54.79, SD M = 41.67, SD M = 30.63, SD 86.00 0.000
= 3.28 = 1.62 = 6.92
[Link] M = 41.25, SD M = 25.83, SD M = 32.08, SD 34.84 0.000
= 6.08 = 3.74 = 3.34
[Link] M = 33.75, SD M = 35.41, SD M = 27.08, SD 33.91 0.000
= 2.5 = 1.44 = 3.51
[Link] M = 53.54, SD M = 32.71, SD M = 30.83, SD 166.38 0.000
= 4.19 = 2.91 = 2.89
14.b-u-ac M = 53.96, SD M = 33.12, SD M = 28.33, SD 455.22 0.000
= 1.29 = 2.41 = 2.68
[Link] M = 23.96, SD M = 24.17, SD M = 13.13, SD 43.70 0.000
= 2.91 = 4.44 = 2.17

These results indicate a strong relationship between the SUS scores and task
success rate, although the strength of this relationship is different. This strength of
correlation depends on the nature of the tasks that are considered to establish the cor-
relation. Our findings are echoed with [5, 12]. This research work encourages novice
researchers to carefully consider the tasks and their difficulty levels for usability
assessment as a determinant in the estimation of usability through SUS scores. One
significant question can be asked that is why it is important for researchers to under-
stand the effect of tasks difficulty on usability scores computed using SUS. It is
simply because intuitive nature of researchers can lead them to take inappropriate
16 K. Sagar and A. Saha

Table 3 Mean SUS Scores computed with the execution of easy, moderate, impossible level tasks,
and task success rate for 15 academic websites
Universities SUS Success SUS mean Success SUS mean Success rate
mean rate of of rate of of of
of easy easy moderate moderate impossible impossible
tasks tasks tasks tasks tasks tasks
1. IISC 59.79 98.61 37.5 65.28 23.75 16.67
[Link] 50 75 29.58 33.33 12.50 1.39
[Link] 53.75 83.33 25 20.83 54.79 84.72
[Link] 57.5 97.22 27.50 22.22 25 18.05
[Link] 51.04 80.55 40.41 50 29.58 33.33
[Link] 58.12 97.22 29.58 33.33 31.25 37.50
[Link] 42.5 69.44 29.38 33.33 27.50 20.83
[Link] 36.25 65.27 28.96 31.94 26.04 19.44
[Link] 53.95 80.55 37.50 47.22 27.50 20.83
[Link] 54.79 84.72 41.66 51.39 30.63 34.72
[Link] 41.25 68.05 25.83 19.44 32.08 34.72
[Link] 33.75 63.88 35.41 45.83 27.08 20.83
[Link] 53.54 83.33 32.71 36.11 30.83 23.61
14.b-u-ac 53.95 81.94 33.13 36.11 28.33 20.83
[Link] 23.95 51.38 24.17 19.44 13.13 1.39

For Easy level task, this relation was significant at:


r(15) 0.951, p <.001, r2 0.905.

Regression Fit: Success Rate = 1.229 * (SUS


Mean)+19.36............................(1)

Fig. 1 Relationship between SUS and success rate mean when end-users execute a list of easy
tasks
Exploring the Effect of Tasks Difficulty on Usability Scores … 17

For Moderate level task, this relation was

significant at: r(15) = 0.914, p <.001, r2= 0.836.

Regression Fit: Success Rate = 2.200 * (SUS

Mean)-33.77............................(2)

Fig. 2 Relationship between SUS and success rate mean when end-users execute a list of moderate
tasks

For Impossible level task, this relation was

significant at: r(15) 0.971, p <.001,

r2 0.943.

Regression Fit:

Fig. 3 Relationship between SUS and success rate mean when end-users execute a list of impossible
tasks

tasks for usability evaluation of an interactive system that can affect its usability
computed using SUS.

4 Conclusions and Future Directions

The aim of this paper was to examine the effect of tasks difficulty on usability scores
computed using SUS. Six different categories (i.e. content, university program, nav-
igation, search, mobile, and social media) are created containing tasks whose diffi-
culty level vary from easy to impossible level. In the laboratory, usability dataset is
collected by involving twelve end-users that evaluate the usability of 15 academic
websites with the list of defined tasks whose difficulty vary from easy to impossible
types. The software, i.e. Minitab 17 is employed for applying statistical techniques,
i.e. one way ANOVA, correlation, and regression. The obtained results show that
the SUS scores vary from higher to lower values when end-users conduct usability
assessment with a list of easy to impossible tasks on academic websites. From this,
18 K. Sagar and A. Saha

the conclusion can be made that higher difficulty level of tasks results in lower SUS
scores and vice versa. The results also indicate a strong relationship between the
SUS scores and system effectiveness, an ISO metric, measured as task success rate,
although the strength of this relationship is different for different types of tasks or
which depends on the nature of the tasks considered. From this, the conclusion can be
made that difficulty level of tasks does not affect the correlated relationship between
the SUS score and success rate, although the strength of this relationship varies that
depends on the nature of the tasks. This research work encourages novice researchers
to carefully consider the tasks for usability assessment as a determinant in the esti-
mation of usability through SUS scores. In this study, a contemporary application of
SUS is presented for academic websites where the comparative results of usability
assessments with tasks of varying difficulty levels are considered. Though, the num-
ber of studies will be required to understand this behavior in the near future. Also,
the field study for the same research work will be needed in the future so that the
strength of correlation for field study can also be determined. Further, generalized
usability model can be implemented, optimized using evolutionary algorithms [13],
and novel automated tool can also be implemented for usability evaluation in near
future [14].

5 Appendix

See Fig. 4.

Fig. 4 System usability scale survey


Exploring the Effect of Tasks Difficulty on Usability Scores … 19

References

1. J. Brooke, SUS-A quick and dirty usability scale. Usability Eval. Ind. 189(194), 4–7 (1996)
2. J. Kirakowski, M. Corbett, SUMI: The software usability measurement inventory. Br. J. Edu.
Technol. 24(3), 210–212 (1993)
3. J.R. Lewis, IBM computer usability satisfaction questionnaires: psychometric evaluation and
instructions for use. Int. J. Hum.-Comput. Interact. 7(1), 57–78 (1995)
4. J. Sauro, SUPR-Q: A comprehensive measure of the quality of the website user experience. J.
Usability Stud. 10(2), 68–86 (2015)
5. P. Kortum, S.C. Peres, The relationship between system effectiveness and subjective usability
scores using the System Usability Scale. Int. J. Hum.-Comput. Interact. 30(7), 575–584 (2014)
6. J.R. Lewis, The system usability scale: past, present, and future. Int. J. Hum.–Comput. Interact.
1–14 (2018)
7. I. Standard, Ergonomic requirements for office work with visual display terminals (vdts)–part
11: Guidance on usability. ISO Standard 9241-11: 1998. Int. Organ. Stand (2018)
8. [Link]
9. K. Sagar, A. Saha, Qualitative usability feature selection with ranking: a novel approach for
ranking the identified usability problematic attributes for academic websites using data-mining
techniques. Hum.-centric Comput. Inf. Sci. 7(1), 29 (2017)
10. K. Sagar, D. Gupta, A.K. Sangaiah, Manual versus automated qualitative usability assessment
of interactive systems. Concurrency Comput. Pract. Experience e5091
11. [Link]
12. J. Sauro, J.R. Lewis, Correlations among prototypical usability metrics: evidence for the con-
struct of usability. in Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems (ACM, 2009, April), pp. 1609–1618
13. R. Jain, D. Gupta, A. Khanna, Usability feature optimization using MWOA. ed. by S. Bhat-
tacharyya, A. Hassanien, D. Gupta, A. Khanna, I. Pan International Conference on Innovative
Computing and Communications. Lecture Notes in Networks and Systems, vol 56 (Springer,
Singapore, 2019)
14. S. Kapoor, K. Sagar, B.V.R. Reddy, Speedroid: a novel automation testing tool for mobile apps.
ed. by S. Bhattacharyya, A. Hassanien, D. Gupta, A. Khanna, I Pan International Conference
on Innovative Computing and Communications. Lecture Notes in Networks and Systems, vol
56 (Springer, Singapore, 2019)
Prediction and Estimation of Dominant
Factors Contributing to Lesion
Malignancy Using Neural Network

Kumud Tiwari, Sachin Kumar and R. K. Tiwari

Abstract Cancer is among one of the major causes of death worldwide. Any region
when infected experiences uncontrollable growth of cells, resulting in unstoppable
growth of protrusions or lesions. Lesions are categorized as benign or malignant.
Imaging techniques have gained a potential rise over last two decades in diagno-
sis and detection of cancer cells. Automated classifiers could upgrade the diagnosis
process substantially, in terms of both time consumption and accuracy by automat-
ically distinguishing benign and malignant patterns. The paper presents a statistical
conclusion clubbed with artificial neural network (ANN) tool for early detection of
disease, problem addressed is of breast lesions; however, the same can be addressed
to any category of lesions appearing in any region of body. The statistical analy-
sis on sample data of 699 values was evaluated for the purpose of establishing the
dependence on selected microscopic attributes having a higher percentage contri-
bution on cell malignancy, further a technique for prediction of breast lesion into
benign and malignant categories using ANN is performed, that achieved sensitivity,
specificity and classification accuracy as 96.94%, 98.75% and 97.70% for complete
nine microscopic attributes and 96.70, 96.72 and 96.68 for selected microscopic
attributes, respectively, having a higher percentage contribution on cell malignancy.
Reduction in features resulted in lesser number of epoch’s and hence reduction in
processing time for identification of infection and thus early detection.

Keywords Breast cancer · ANN · Lesion · Dominant attributes · Epoch

K. Tiwari (B) · S. Kumar


Amity School of Engineering and Technology, Amity University, Lucknow, India
e-mail: kumud.1992@[Link]
S. Kumar
e-mail: skumar3@[Link]
R. K. Tiwari
Department of Physics and Electronics, Dr. RML Avadh University, Faizabad, India

© Springer Nature Singapore Pte Ltd. 2020 21


A. Khanna et al. (eds.), International Conference on Innovative Computing
and Communications, Advances in Intelligent Systems and Computing 1087,
[Link]
22 K. Tiwari et al.

1 Introduction

Cancer as predicted by World Health Organization has become one of dominant


worldwide killer disease, proving fatal if detected in later stages. Breast lesion is
one of the most common lesion among females and if malignant, is deadly cause of
death among women [1]. When the cell tissues of the breast becomes abnormal and
start to uncontrollably divide then breast lesion occurs. These abnormal cells form
large lump of tissues, which consequently becomes a lesion and hence tumor. In fact,
breast lesion affects approximately 10% of all women at some period of their life.
In India, every 4 min one woman is diagnosed with breast lesion and every 8 min
one woman dies because of breast lesion. In India, we have been witnessing more
and more numbers of patients that are being diagnosed with breast cancer at a very
young age groups (in their thirties and forties) as shown in Fig. 1 [2].
Early detection plays a vital part in treatment of breast lesion as they are helpful
to save patient’s life, where screening mammography is considered to be the most
effective tool. Nevertheless, the mammogram images are at a time very difficult to be
read due to the large variation in shape and size but also because of their low image
contrast [3]. An automatic prediction of breast lesion has been proposed by many
scientific researchers. Artificial neural network (ANN) which is automated classifiers
can be used by radiologists for pattern recognition and variety of data classification
tasks which in turn becomes a promising classification tool in breast lesion detection
[4]. In ANN, some dermoscopic tested images are given firstly for training to neuron.
Backpropagation algorithm is employed to train the units, topology output is matched
with ideal output and error signal so produced will propagate in reverse direction.
Error is minimized through weights [5, 6]. Until error is zero, this process will
continue. ANN is structured in layers. ANN comprises of layer at input from where
information is given to topology, which further connects to at least one hidden and
output layer as shown in Fig. 2.
ANN finds many applications areas in MRI, mammography, IR imaging and
ultrasound for early detection of breast lesion. Image features can be differentiated in
various aspects, like color, texture, spatial relations and shape. Selections of different
data features from images will amount to diversified decisions for classification.
Classifications primarily consists of three types: first, the method based upon rule,

Fig. 1 Trend of breast Breast Cancer in India


cancer in India based on age 40
30
20
10
0
20 to 30 30 to 40 40 to 50 50 to 60 60 +
25 years back Presently
Prediction and Estimation of Dominant Factors Contributing … 23

Fig. 2 Structure of artificial neural network

like decision tree and rough sets; second, the method based upon statistics, like
support vector machine; and third, artificial neural network [7].
The paper presents a statistical analysis for the purpose of establishing the depen-
dence and identifying dominant or significant attributes for benign or malignant cells
on nine microscopic attributes: uniformity in shape of the cell, uniformity in size of
the cell, marginal adhesion, clump thickness, bland chromatin, bare nuclei, single
epithelial, normal nucleoli and mitoses and an artificial neural network technique for
classification of lesion. ANN algorithm was examined on dataset of 699 samples for
all nine microscopic attributes and for significant or dominant attributes The paper
is arranged as Sect. 1 presents introduction, Sect. 2 presents related work, Sect. 3
explains proposed methodology for estimation of dominant factor and implementa-
tion of ANN for lesion detection, Sect. 4 presents result of the statistical analysis
and sensitivity, specificity and classification accuracy archived using ANN tool and
finally the conclusion is presented.

2 Literature Review

In India, rural population have been devoid of basic, proper and timely medical care.
The author proposed a system which provides instant medical attention to remote
patients. The system analyzes the symptoms of the well-known diseases and pre-
scribes precise medicine within milliseconds and can improve the current medical
situation [8]. Breast cancer represents 25% of all cancer cases in world [9]. For past
few decades, many new imaging techniques have developed for diagnosis of breast
cancer which assist radiologist in highlighting suspicious areas as it helps them to
find a lesion that cannot be spotted by naked eye. As technology is improving at a
very rapid rate, many researchers are concerned about the development of intelligent
24 K. Tiwari et al.

techniques that can be used in detection of lesion with improved classification accu-
racy. Decision tree is a significant approach that is used for prediction of lesion. It
is easy and efficient classification method. Ability to diversely classify the primary
attributes of a given case study is its important paradigm [10]. On the basis of the
research outcomes, artificial neural network has demonstrated as a good classifier
for the classification of malignant lesion in mammography. In ANN, mammogra-
phy implementation of three-layer neural network using backpropagation algorithm
has become a pioneer [11]. There are different ANNs that are developed and are
established on the concept of decreasing the false negative and false positive detec-
tion and increasing the true positive detection rate for optimum outcome. In ANN,
implementation of wavelet such as biorthogonal spline wavelet ANN, particle swarm
optimized wavelet NN and Gabor wavelets ANN has improved the specificity and
sensitivity that are acquired in micro calcification and masses detection. For biomed-
ical applications, determining breast cancer malignancy, computer aided detection
(CAD) frameworks utilizing biomedical images mostly due to the fact that they have
non-radiation properties, low cost, high availability and fastest results with high accu-
racy. For the detection of the breast cancer, an improved version using ultrasound
images has been introduced lately, which works on a 3-D ultrasound imaging that
provides more insight data on the lesion when considered relative to the orthodox
2-D imaging [12]. A new hybrid method [13] has been developed for the detection
and classification of breast lesion by combining two ANN techniques. It has been
found out that two phase hierarchical NN gives better result.
There are many different techniques for the detection and classification of breast
lesion that utilizes the breast cancer imaging, depending on the input attributes. ANN
technique has been used widely for cancer cell investigation in medical examinations.
Many NNs models were utilized for enhancement of the detection and the classifi-
cation of the breast lesion, that are trained with previous cases that were diagnosed
correctly by the clinicians [13, 14] or can change the mass characteristics such as
size, shape, margins, granularity or the signal intensity. In 2012, multistate cellular
neural networks (CNN) was being used in biomedical image segmentation for eval-
uation of the fat contents by estimating the density of breast regions. A hybrid model
was introduced for the identification of breast lesion from MR images which mainly
consists of SVM and PCNN [14]. Another hybrid algorithm was presented by the
combination of perceptron with the SIFT for detection of breast lesion [15]. Different
clustering algorithms [16] were used for segmenting nuclei based on the fine needle
biopsy microscopic images. Topological, texture and morphological features were
mostly used for classifier training on 500 images from 50 patients which achieves
an accuracy between 84 and 93%.

3 Proposed Methodology

Methodology employed processes the sample labeled data simultaneously for iden-
tifying dominant feature and detecting infected cells. As parallel leg, the data is
Prediction and Estimation of Dominant Factors Contributing … 25

statistically evaluated for establishing the dominant factors responsible for lesion
malignancy. Samples obtained are processed on the scale of 1–10, with 10 represent-
ing higher weight, attributes being analyzed for establishing dominancy included
uniformity in shape of the cell, uniformity in size of the cell, marginal adhesion,
clump thickness, bland chromatin, bare nuclei, single epithelial, normal nucleoli and
mitoses, percentage contribution of individual attributes are evaluated to asses domi-
nant features impacting malignancy and features that may indicate benign condition
of cell, further correlation coefficient is estimated for factors that are dominant for
malignant and benign cells and factors which are not dominant, significance value
is obtained to establish the potential of dominant attributes.
Simultaneously, identification of infected cell is achieved through ANN tool,
the data is processed with reference to standard acceptable range of microscopic
attributes to classify them as malignant or benign, technique in terms of flow graph
is depicted in Fig. 3, data is first subjected to pre-processing and cleaning process
this is done for data restoration, processed data is fed to ANN tool in two layers,
first layer consisting of all nine microscopic attributes and second layer consisting
data for significant or dominant attributes. Next phase involves selection of hidden

Labeled Sample Data

Data is subjected to:


Cleaning (to remove Noise)
Preprocessing ( to neglect false contouring)

Dominant Factors: Processed Data is fed to ANN Tool:


rationalized to normal scale Data is Segregated as:
1. Nine Microscopic
2. Significant Parameters

Percentage contribution of
microscopic features: Selection of Number of hidden layers &
Evaluated Statistically Training through Conjugate Gradient
Backpropogation Algorithm

Correlation Coefficient
estimated for probable
Estimation of Classification
dominant parameters
Rate, Sensitivity and
Specificity

Identification of Significant Parameters for early detection of


Malignancy
&
Classification as Malignant or Benign Cells

Fig. 3 Generalized flow graph for identification of dominant factors


26 K. Tiwari et al.

layers followed by conjugate gradient backpropagation algorithm for classification.


ANN tool is already calibrated to acceptable ranges of microscopic features to clas-
sify data as malignant or benign. Estimated values obtained from confusion matrix
is processed to obtain sensitivity, specificity and classification accuracy for both
layers, Results obtained are further compared with number of epoch’s required for
processing, processing times reduces as number of epoch’s reduces.

4 Results

Validation of the results were achieved from the database obtained University of Cali-
fornia at Irvine (UCI), a total of 699 samples were there, each have nine microscopic
attributes: uniformity in shape of the cell, uniformity in size of the cell, marginal
adhesion, clump thickness, bland chromatin, bare nuclei, single epithelial, normal
nucleoli and mitoses. These attributes are first normalized on scale of 1–10, with 10
representing higher weight; individual normalized attributes were subjected to sum-
mation, and percentage contribution of individual parameter was evaluated against
total population of the sample segregated separately for benign and malignant cases;
the data is represented in terms of percentage as depicted in Table 1; from the table,
it can be derived that percentage contribution of attributes uniformity of cell size,
single epithelial cell and bland chromatin for benign was more as compared to other
attributes, so they can be concluded as more dominant attributes and a faster anal-
ysis can result from close monitoring of dominant attributes. Similarly, percentage
contribution of attributes uniformity of cell size, uniformity of cell shape and bare
nuclei for malignant cases was more as compared to other attributes, so they can be
concluded as more dominant attributes and a faster analysis can result from close
monitoring of dominant attributes; however, the process has been tested only on a
limited sample of 699 values and percentage variations of more than 5% have been
accepted as distinguishable difference, but considering the fact that early detection
can save potential lives, so realizing dominant factors can go a long way, and further
work on a larger sample is proposed as next work. Individual attributes were further
processed to estimate correlation coefficient and hence significance value. Attributes
who do not play a significant role in characterizing cell as benign or malignant are
marginal adhesion, normal nucleoli and mitoses, while attributes playing a signifi-
cant role in characterizing cells as benign are uniformity of cell size, single epithelial
cell size and bland chromatin. Attributes responsible for the sample to be malig-
nant are uniformity of cell size, uniformity of cell shape and bare nuclei. Degree of
freedom (df), Sum of square (SS), mean square (MS), F value and significance F
values have been estimated and are depicted in Table 2. Regression analysis con-
ducted showed that significance value (F) decreases from non-dominant parameter
to dominant parameter as follows: 1.176E-146 for microscopic attributes not play-
ing a significant role is greater than 1.2E-191 for microscopic attributes playing a
significant role for the sample to be benign is greater than 3.5E-244 for microscopic
attributes for the sample to be malignant; though not very close to 0.5; however,
Table 1 Percentage contribution of individual microscopic attributes
Types Clump Uniformity Uniformity Marginal Single Bare nuclei Bland Normal Mitoses
thickness of cell size of cell adhesion epithelial chromatin nucleoli
shape cell Size
Benign Total 1354 607 661 625 971 624 962 591 487
(458 cases) numeric
standing
% 29.56 26.46 14.43 13.65 21.20 13.62 21.00 12.90 10.63
Malignant Total 1734 1584 1581 1337 1277 1834 1441 1413 624
(241 cases) numeric
Prediction and Estimation of Dominant Factors Contributing …

standing
% 71.95 65.72 65.33 55.48 52.99 76.09 59.79 58.63 25.89
27
28 K. Tiwari et al.

Table 2 Regression estimate


Cases Degree of Sum of Mean F value Significance
freedom square sum of
(df) (SS) square
(MS)
Non-significant Regression 3 393.39 131.13 382.54 1.176E − 146
microscopic Residual 695 695 238.24
attributes
Significant Regression 3 454.83 151.61 595.98 1.2E − 191
microscopic Residual 695 176.79 0.2543
attributes for
the sample to
be Benign
Significant Regression 3 506.81 168.93 940.64 3.5E − 244
microscopic Residual 695 124.82 0.1795
attributes for
the sample to
be Malignant

a general trend indicates dependence of infected cases are more toward certain set
of microscopic attributes as compared to benign cases; if focused investigation is
processed toward the significant attributes, then an early detection is possible.
Two layers obtained, first as normalized set of data for all nine microscopic
attributes and second as data for dominant attributes are fed to artificial neural net-
work tool. ANN works on principal of back spread calculation, as a first step it
performs by assigning approximated values to weight, second step involves feed
forward propagation, third step is backpropagation of errors and fourth step updates
the weight values.
ANN analysis was performed on 699 entries obtained from University of Califor-
nia at Irvine (UCI) as a first step for all nine microscopic entries. ANN to optimize
training performance, divides the data into training, validation and testing in the
ratio 7:1.5:1.5, number of hidden neurons was set as ten, further increase in neurons
resulted in increase in latency, training of the samples was achieved through conju-
gate gradient backpropagation algorithm. The best validation performance achieved
was 0.031 at an epoch 16, as depicted in Fig. 4.
Figure 5 displays confusion matrix through which predictive attributes for the
model have been derived. Sensitivity, specificity and classification accuracy of the
proposed models on nine microscopic attributes achieved are 96.94%, 98.75% and
97.70%, respectively, as evaluated from the values obtained from confusion matrix.
Significant factors as identified from statistical analysis for malignant cases were then
taken as input and were fed to ANN tool; best validation performance was achieved
at 0.0529 at epoch 6 as depicted in Fig. 6, confusion matrix for the same is depicted
in Fig. 7. Sensitivity, specificity and classification accuracy of the proposed models
on only significant microscopic attributes achieved is 96.70%, 96.72% and 96.68%,
respectively, as depicted in Table 3.
Prediction and Estimation of Dominant Factors Contributing … 29

Fig. 4 Variation in MSE with respect to number of iteration for nine microscopic attributes

Fig. 5 Confusion matrix for nine microscopic attributes


30 K. Tiwari et al.

Fig. 6 Variation of MSE with respect to number of iterations for significant attributes

Fig. 7 Confusion matrix for identified dominant microscopic attributes


Prediction and Estimation of Dominant Factors Contributing … 31

Table 3 Performance measure


Classification rate Sensitivity Specificity
     
T P+T N TP TN
T P+T N+F P+F N T P+F N T N+F P

Performance measure 97.70 96.94 98.75


for nine microscopic
parameter (%)
Performance measure 96.70 96.72 96.68
for significant
microscopic
parameter (%)

Statistical analysis obtained by estimation of all nine microscopic attributes and


significant attributes displayed that segregated significant attributes have percentage
contribution higher as compared to other attributes, termed as non-significant. ANN
tool was processed for both types of data, results obtained showed minor loss of
sensitivity, specificity and classification accuracy, number of epoch’s reduced from 16
to 6, suggesting an early detection is possible as number of attributes to be evaluated
has been reduced. In Table 3 true positive (TP), true negative (TN), false negative
(FN) and false positive (FP) are used for performance measure.

5 Conclusion

Cancer is potentially one of major contributors of death as non-communicable dis-


ease, though best practices like eating healthy and daily workouts can help in securing
health and preventing most of deadly infections even cancer; however, still regular
checkups are recommended and early detection can play a vital role in preventing
number of causalities. The paper presents an ANN tool-based detection of carcino-
genic conditions in the breast, supported by a statistical analysis to identify dominant
attributes contributing to malignancy, thus achieving reduction in number attributes
and processing time for identification. Test samples used were obtained from Wis-
consin Breast Cancer dataset from University of California at Irvine (UCI) having
699 entries, which was divided into benign and malignant cases, statistical analysis
defined percentage contribution of attributes uniformity of cell size, uniformity of cell
shape and bare nuclei for malignant cases was more as compared to other attributes,
same was confirmed by measure of significance value (F), sensitivity, specificity and
classification accuracy of ANN model on nine microscopic attributes achieved is
96.94%, 98.75% and 97.70%, respectively. Significant factors as identified from sta-
tistical analysis for malignant cases were then taken as input attributes and estimated
sensitivity, specificity and classification accuracy achieved was 96.70%, 96.72% and
96.68%, respectively. The results obtained showed minor loss of sensitivity, speci-
ficity and classification accuracy, however, as the number of attributes can be reduced,
32 K. Tiwari et al.

early detection is possible as number of attributes to be evaluated will be reduced.


The result was confirmed as the number of epoch’s reduced from 16 to 6, for ANN
analysis on all nine microscopic to dominant attributes, suggesting an early detection
is possible.

References

1. R.L. Siegel, K.D. Miller, A. Jemal, Cancer statistics: A cancer journal for clinicians (2015)
2. Breast Cancer India, [Link]
3. L. Hadjiiski, B. Sahiner, M.A. Helvie, et al., Breast masses: Computer-aided diagnosis with
serial mammograms. Radiology. 240(2), 343–356 (2006)
4. F.A. Cardillo, A. Starita, D. Caramella, A. Cillotti, A neural tool for breast cancer detection
and classification in MRI. in Proceedings of the 23rd Annual International Conference of the
IEEE Engineering in Medicine and Biology Society, vol. 3 (2011) pp. 2733–2736
5. T. Jayaraj, V. Sanjana, V.P. Darshini, A review on neural network and its implementation on
Breast cancer detection. IEEE (2016)
6. P.S. Pawar, D.R. Patil, Breast Cancer detection using neural network model. IEEE (2013)
7. A.A. Tzacheva, K. Najarian, J.P. Brockway, Breast cancer detection in gadolinium-enhanced
MRI images by static region descriptors and neural networks. J. Magn. Reson. Imaging 17(3),
337–342 (2003)
8. V.D. Khairnar, et al., Primary healthcare using artificial intelligence. in International conference
on innovative computing and communication (2018) pp. 243–251
9. L.N. Shulman, W. Willett, A. Sievers, F.M. Knaul, Breast cancer in developing countries:
Opportunities for improved survival. J. Oncol. (2010)
10. N.M. Lutimath, et al., Regression analysis for liver disease using R: A case study. in
International conference on innovative computing and communication (2018) pp. 421–429
11. A.E. Hassanien, N. El-Bendary, Breast cancer detection and classification using support vector
machines and pulse coupled neural network. in Advances in Intelligent Systems and Computing,
(Springer, Berlin, Germany, 2013), pp. 269–279
12. G. Ertas, D. Demirgunes, O. Erogul Conventional and multi-state cellular neural networks in
segmenting breast region from MR images: Performance comparison. in Proceedings of the
International Symposium on Innovations in Intelligent Systems and Applications (INISTA ’12)
(2014) pp. 1–4
13. J. Dheeba, N.A. Singh, S.T. Selvi, Computer-aided detection of breast cancer on mammograms:
A swarm intelligence optimized wavelet neural network approach. J. Biomed. Inform. 49, 45–52
(2014)
14. T. Balakumaran, I.L.A. Vennila, C.G. Shankar, Detection of micro-calcification in mammo-
grams using wavelet transform and fuzzy shell clustering. Int. J. Comput. Sci. Info. Technol.
7(1), 121–125 (2010)
15. A.M. ElNawasany, A.F. Ali, M.E. Waheed, A novel hybrid perceptron neural network algorithm
for classifying breast MRI tumors. in Proceedings of the International Conference on Advanced
Machine Learning Technologies and Applications (Cairo, Egypt, 2014) pp. 357–366
16. M. Kowal, P. Filipczuk, A. Obuchowicz, J. Korbicz, R. Monczak, Computer-aided diagnosis
of breast cancer based on fine needle biopsy microscopic images. Comput. Biol. Med. 43(10),
1563–1572 (2013)
Prediction of Tuberculosis Using
Supervised Learning Techniques Under
Pakistani Patients

Muhammad Ali and Waqas Arshad

Abstract Tuberculosis (TB) is a long-lasting malady, transmitted by peoples, cows,


birds etc., it impacts practically whole parts of the human body yet the vast majority
of the frequency are found in lungs issue. Tuberculosis (TB) begun by a bacterium to
be specific mycobacterium. For the distinguishing proof of tuberculosis persistent,
mycobacterium must be found in mucus. For this reason, a special culture is arranged
where the mycobacterium tuberculosis (MTB) microbes are be duplicated and this
entire procedure required a period of somewhere around one and half month. Addi-
tionally, the accomplishment of the treatment of tuberculosis is critical for the peoples
because of its lengthy treatment period. It is anticipated that the malady is leveled
but in under-developing nations, it is as yet a major issue. There is a requirement
for the mechanized timely forecasting of the tuberculosis (TB) which hence needs
extra datasets of the disease and more comparative investigation on these datasets. In
our investigation, we propose another dataset in low-socioeconomic status country
like Pakistan and show how this dataset can be utilized for the timely forecasting of
tuberculosis (TB).

Keywords Data mining (DM) · Mycobacterium tuberculosis (MTB) · Decision


tree (DT) · Weka · Supervised learning

1 Introduction

Data mining (DM) is a huge field that associates with other areas such as artificial
intelligence (AI), machine learning (ML), statistics, and database for exploration of
enormous quantity of data by using some procedures. Data mining (DM) is a pro-
cedure of finding valuable material/data/pattern from the known bulky data/records.

M. Ali (B)
Department of Computer Science, The Superior College, Lahore, Pakistan
e-mail: muhammedbwn@[Link]
W. Arshad
Department of Computer Science, University of Lahore, Lahore, Pakistan
e-mail: [Link]@[Link]

© Springer Nature Singapore Pte Ltd. 2020 33


A. Khanna et al. (eds.), International Conference on Innovative Computing
and Communications, Advances in Intelligent Systems and Computing 1087,
[Link]
34 M. Ali and W. Arshad

Transformation and recovering/finding valuable info from the enormous quantity of


data, normally every 3 years, it is expected that the quantity of data doubled. Due to
this, field of data mining becomes so essential [1]. Data mining uses are in diverse
areas such as, education, market business analysis (MBA), weather forecasting, cus-
tomer relationship management (CRM), detection of frauds in different transactions,
prediction of air pollution [2], fire forest detection [3] and churning behavior of
customers [4], it also assists the medical consultants for taking precise conclusions
in diverse maladies i.e., Beast Tumor [5] cardiovascular [6] diseases. Tuberculosis
(TB) should be leveled out once, turning into a genuine problem of the world once
more. Tuberculosis is normally, exceptionally transmittable through cows, fowls and
people (who are as of now patients of tuberculosis). Tuberculosis can disturb almost
all the human body organs but it generally disturbs lungs; however, in some cases
TB co-infected with HIV/AIDS [1]. Generally, colonized and low financial status
nations i.e., Pakistan are going under its genuine risk. As indicated by the World
Health Organization, Pakistan has the fifth most TB patients around the world and
this is our inspiration of this research.
Different methods such as GeneXpert, skin test, sputum-smear microscopy, blood
test, cultural, and radiography are existed for finding of TB, but all the above-
mentioned tests have some restrictions. For example, GeneXpert is as correct as
the cultural method but it required high expenses (ten time costly than the cultural
method) and, therefore, it is problematic to the individuals (specifically low-income
peoples) of Pakistan [7]. Cultural method needs phlegm (thick viscous substance),
a type of mucus, that comprises disease germs and outcomes of the cultural test can
be late 4–6 weeks [8, 9]. All these concerns straight to the necessity of mechanized
methods for timely identification of tuberculosis. Mechanized/automated methods
also assist the health care specialist to choose whether to begin the cure of the alleged
human or wait the medical tests results of the patients. Study expects to contribute
another dataset, among one of the most influenced nations of the relating malady, for
forecasting of tuberculosis in Pakistan. And, we showed the utilization of projected
dataset and in what way it can be helpful for classification of TB.

2 Literature Review

In this section, we anticipated a technical analysis of existing study about profi-


cient methods of forecasting of TB. Author of the [1] used adaptive network-based
fuzzy inference system (ANFIS), multilayer perception and PART for predicting of
TB before the medical examination, and root-mean-square error of these models are
18%, 19%, and 20% accordingly. Author of the [7] sketched a cross sectional investi-
gation about detecting of TB i.e., pulmonary and extra pulmonary TB was conducted
at Jinnah Hospital Lahore, Pakistan. He selected strong pulmonary 2200 samples of
patients and tested using three methods i.e., culture method, GeneXpert test and
Ziehl-Neelsen staining. GeneXpert provides genuinely high affectability against the
rifampicin, and the consequences of the GeneXpert were genuinely equivalent to the
Prediction of Tuberculosis Using Supervised Learning Techniques … 35

way of culture test with short time span. Notwithstanding, the expense of GeneXpert
test is ten time higher than the culture test, likewise twenty time higher than the smear
microscopy, bringing about non-reasonableness for individuals of Pakistan being the
populated and low financial status. The author of the [9] used six algorithms i.e.,
SVM, decision tree (DT), logistic regression, artificial neural network (ANN), radial
basic function and Bayesian network for building models. And investigation shows
that decision tree C4.5 provides high outcomes with the accuracy level of 74.21%.
6450 records of the patients were used in research with features: age, nationality, sex,
area of residence, weight, current stay in prison, TB type, length, case type, diabetes,
treatment category, recent TB infection, low body weight, HIV, imprisonment, and
drug resistance. Hussainy et al. [10] used decision tree algorithm for forecasting
of TB. Antigen used for calculating the microbes in the blood by using multiplex
microbeads immunoassay (MMIA) which stores different level of antigens and mea-
suring the value/quantity of antigens, MFI is used against bacteria. Gerg and Rupal
[11] projected a new framework, in his research, he initially made two groups using
K-means method, thereafter for feature extraction applied principal component anal-
ysis (PCA) in the both groups. After feature extraction, feature optimization of these
groups done by using genetic algorithm (GA) and build the classifier using neural
network (NN) algorithm. He proposed that better optimized features are expected
if we apply independent component analysis (ICA) in place of PCA. Asha et al.
[12] anticipated a hybrid method for recognition of TB. They utilized 700 patient’s
records gathered from a state hospital and treated the raw data in hybrid by clustering
into two groups i.e., pulmonary tuberculosis and retroviral pulmonary tuberculosis.
The projected model provides the 98.7% using support vector machine among
C4.5, K-NN, AdaBoost, Naïve Bayes, Bagging, and Random Forest [13]. Used
ANFIS algorithm for constructing classifier with 503 instances having 30 features,
and got 97% accuracy. Rusdah et al. [14] utilized an ensemble technique with two
algorithms i.e., SVM and Bagging and obtained 70.41% accuracy using mentioned
attributes and as compared with single classifier/model ensemble technique pro-
duce better results. Tracey et al. [15] and Moedomo et al. [16] used laboratory free
approaches i.e., sound and images as the key attributes for the identification of TB.
Kalhori and Zheng [17] showed that logistic regression and SVM yields best results
among six classifiers. Author of [18] made a survey from 2007 to 2013 and includes
different papers and proposed that SVM is one of the best algorithms between them.

3 Methodology

In this research, data has been gathered/obtained from the local hospital i.e., Gulab
Davi Hospital [19] after getting fundamental endorsements from the experts of the
medical clinic. We changed the raw data (medical records) from hard form to MS
Excel. The dataset contains nine properties seeing their importance and significance
in detecting the tuberculosis. Descriptions and detail of selected features and their
types are stated in Table 1.
36 M. Ali and W. Arshad

Table 1 List of attributes with description


S. No. Attribute Description Type
1 Gender Male → M, Female → F N
2 Age 1–18: 1, 18–24:2, 24–35:3, N
35–40:4, 40–48: 5,
48–60 + :6
3 Weight loss Yes, No N
4 Marital status Single → S, Married → M, Other → O N
5 TB in family No: 0, Yes: 1 N
6 TB history No: 0, Yes: 1 N
7 Smear-results 1 -ve: Possibly not infection, N
0: weakly positive
1 +ve: moderate positive,
2 +ve: moderate positive
3 +ve: strongly positive,
4 +ve: strongly positive
8 X-Ray-Results Positive: 1, Negative: 0 N
9 Class attribute Tuberculosis status: Yes, C
Tuberculosis status not true: No
N Nominal Attribute, C Class Attribute

3.1 Data Cleaning and Transformation

Data pre-preparing is one of the center and fundamental procedure in learning


method. Different types of pre-preparing methods are available i.e., data cleaning,
missing values handling, data transformation, data reduction/resizing and detecting
the outliers. Clustering, generalization, data smoothing, binning, or discretization of
data are types of data transformation. Binning is used to diminish the size of data
by transforming the data numerical to interim/categorical form. In this research, age
feature is binned into six bins detailed in Table 1. Cleaning of information incorpo-
rates dealing with missing values, finding and eliminating of the noise and outliers
from the data. A few causes which may leads missing qualities in the dataset are as
under:
• The obtained dataset is not gathered for mining.
• Errors or lapses.
• Typing mistake
For taking care of the missing qualities, diverse techniques are accessible like ignor-
ing, one by one manually entering the missing qualities, we can fill the value by
most repeated value, mean/median or global constant. In the proposed dataset, two
properties named as XRay-results and Smear-results have some missing values. For
dealing with those missing qualities, we picked two strategies. At first we replaced
the missing qualities with the mean/mode and in second strategy eliminating the
occurrences having missing qualities.
Prediction of Tuberculosis Using Supervised Learning Techniques … 37

3.2 Environmental Setup

A data mining tool ‘WEKA’ is made by the Waikato University [20] in New Zealand
by utilizing Java language. It is a cutting edge tool for performing different data
mining tasks i.e., data per-processing, regression, clustering, classification and asso-
ciation rules which directly applied on dataset. WEKA also provides a variety of
visualization tools [21]. In Weka, mostly ARFF and CSV file format are used.

4 Results

Table 2 speaks the outcomes with handling the missing values by using the these
missing values to replace with mean/mode method while in Table 3 after disqualifying
the records having where the missing values in the dataset. The outcomes show that
when we replaced the missing values with the mean values then no improvement
in results have been achieved as equated to simply eliminating the record excluding
Naïve Bayes classifier, as the same produced little bit high results. The reason behind
eliminating the missing values that’s maybe because the proportion of instances
(5.3%) just 32 records out of 597 having missing values in two attributes and the
proportion of instances is very little or the mean method (replacing the missing values
with mean) would presumably not be very close to the real values in the dataset. And
(see Fig. 1) demonstrates the decision tree (DT) of the data when in our dataset we
have no missing value.

Table 2 Accuracy of model


Algorithm name Correctly Incorrectly
(having 597 instances) with
classified (%) classified (%)
handling the missing values
(using mean mode) Support vector 79.4 20.6
machine
Logistic regression 79.1 20.9
Naive Bayes 79.1 20.9
C5.0 81.9 18.1

Table 3 Accuracy of model


Algorithm name Correctly Incorrectly
(having no missing
classified (%) classified (%)
values/ignoring missing
values) Support vector (81.10) (19.90)
machine
Logistic regression (80.85) (19.15)
Naive Bayes (77.18) (22.83)
C5.0 (84.07) (15.93)
38 M. Ali and W. Arshad

Fig. 1 Shows the decision tree

5 Conclusion

The fundamental point of the examination is to add another dataset to the community
of one the most disturbing sicknesses particularly in one of the social orders where
the infection is a genuine risk. And shows that in what way this dataset can be
utilized for forecasting of tuberculosis (TB) which might support the health care
specialist in choosing either to begin the treatment of suspected person or wait for
medical reports of the TB persons. The dataset consists of 565 patients each having
nine features. In general, C5.0 achieved higher to the next three algorithms i.e., SVM,
logistic regression and Naïve Bayes. We believe that results will add more significant
perception of forecasting of tuberculosis status. Our purposed research work can be
boosted and exhausted if we may utilize some other data mining techniques. Results
may be improved utilizing other distinctive data cleaning strategies.

Acknowledgements Since the investigation utilized optional information, we don’t have individ-
ual contact to the patients. The examination is absolutely committed to scholarly reason. Ethical
clearance was acquired from hospital and this investigation is only for open advantages.
Prediction of Tuberculosis Using Supervised Learning Techniques … 39

References

1. K.R. Lakshmi, M. Veera Krishna, S.P. Kumar, Utilization of data mining techniques for pre-
diction and diagnosis of tuberculosis disease survivability. I. J. Mod. Educ. Comput. Sci. 8,
8–17 (2013)
2. M. Yadav, S. Jain, K.R. Seeja,) Prediction of air quality using time series data mining. in S.
Bhattacharyya, A. Hassanien, D. Gupta, A. Khanna, I Pan (eds) International Conference on
Innovative Computing and Communications. Lecture Notes in Networks and Systems, vol. 56.
(Springer, Singapore, 2019)
3. V. Dubey, P. Kumar, N. Chauhan, Forest fire detection system using IoT and artificial neural
network. in S. Bhattacharyya, A. Hassanien, D. Gupta, A. Khanna, I. Pan (eds) International
Conference on Innovative Computing and Communications. Lecture Notes in Networks and
Systems, vol. 55. (Springer, Singapore, 2019)
4. M. Ali, A.U. Rehman, H. Shamaz, Prediction of churning behavior of customers in tele-
com sector using supervised learning techniques. 143–147. [Link]
8586836
5. M.F. bin Othman, T.M.S. Yau, Comparison of different classification techniques using WEKA
for breast cancer. IFMBE Proc. 15, 520–523 (2007)
6. A.U. Rehman, A. Aziz, Detection of cardiac disease using data mining classification techniques.
(IJASCA) 8(07) (2017)
7. M. Saeed, S. Iram, S. Hussain, A. Ahmed, M. Akbar, M. Aslam, GeneXpert: A new tool for
the rapid detection of rifampicin resistance in mycobacterium tuberculosis J. Pak. Med. Asso.
67(2) (2017)
8. D. Nagabushanam, N. Naresh, A. Raghunath, K. Parveen Kumar, Prediction of tuberculosis
using data mining techniques on Indian Patient’s Data. IJCST 4(4) (2013)
9. B.C Lakshmanan, V. Srinivasan, C. Ponnuraja, Data mining with decision tree to evaluate the
pattern on effectiveness of treatment for pulmonary tuberculosis: A clustering and classification
techniques. (SCIRJ) (2015)
10. S.F. Hussainy, S. Ahmad Raza, A. Khaliq. M.A. Zafar, Decision tree-inspired classification
algorithm for early detection of tuberculosis (TB). in Thirty Seventh International Conference
on Information Systems (Dublin 2016)
11. S. Gerg, N. Rupal A data mining approach to detect tuberculosis using clustering and GA-NN
techniques. IJSR (2015)
12. T. Asha, S. Natarajan, K.N.B. Murthy, A data mining approach to the diagnosis of tuberculosis
by cascading clustering and classification. CoRR. abs/1108.1045 (2011)
13. S. Kalhori, X. Zeng, Improvement the accuracy of six applied classification algorithms through
integrated supervised and unsupervised learning approach. J. Comput. Commun. 2, 201–209
(2014). [Link]
14. Rusdah, E. Winarko, R. Wardoyo, Preliminary diagnosis of pulmonary tuberculosis using
ensemble method. ICDSE. 175–180 (2015)
15. B.H. Tracey, G. Comina, S. Larson, M. Bravard, J.W. López, R.H. Gilman, Cough detection
algorithm for monitoring patient recovery from pulmonary tuberculosis. IEMBS (2011)
16. R. Moedomo M. Ahmad, B. Alisjahbana, T. Djatmiko, The lung diseases diagnosis software:
Influenza and tuberculosis case studies in the cloud computing environment ICCCSN (2012)
17. S.R.N. Kalhori, X.-J. Zeng, Improvement the accuracy of six applied classification algorithms
through integrated supervised and unsupervised learning approach. J. Comput. Commun. 201–
209 (2014)
18. E. W. Rusdah, Review on data mining methods for tuberculosis diagnosis. ISICO (2013)
19. Gualb Davi Hospital: [Link]
20. WEKA: [Link]
21. J. Han, M. Kamber, J. Pei, Data Mining Concepts and Techniques. 3rd edn (2012)
Automatic Retail Invoicing
and Recommendations

Neeraj Garg and S. K. Dhurandher

Abstract The goal of this research paper is to make today’s computing systems
to sense the presence of users at some place and their current states. This research
paper further may exploit present context information of users and help people to get
current context services as per the preferences and current needs like giving current
discounted product details (they wish to buy) and giving personal recommendations
like a movie, trending clothes in the market, significant particulars in a warehouse
about product purchases. The realized implementation of suggesting particular ser-
vices by warehouse server is done by studying user profile, analyzing user behavior
and user purchase history. It was found out after taking Zadeh’s fuzzification equation
in the process that clothes should be taken extra large for a person, and probability
is found out to be 78% true in all cases except some conditions of person having
weight less than 76.5 kg (middle clothes preferred) and also age less than 11 years
(small clothes preferred in this case).

Keywords Retail · Users · Presence · Contexts · Activity · Recommendations ·


Person · etc

1 Introduction

Pervasive computing deals with ubiquitous environments to deal with intelligent


objects. These environments provide information as well as context services to the
users, when and where needed. Context-aware applications usually depend on infor-
mation repository called the context model and also on location, the current user is
in. That means if a person gets registered with a location server of some clothing
warehouse, then the user is authorized to get context web services about suitable

N. Garg (B)
Department of Computer Science Engineering, MAIT, Delhi, India
e-mail: neeraj_garg20032003@[Link]
S. K. Dhurandher
Department of Information Technology, NSUT, Delhi, India
e-mail: dhurandher@[Link]

© Springer Nature Singapore Pte Ltd. 2020 41


A. Khanna et al. (eds.), International Conference on Innovative Computing
and Communications, Advances in Intelligent Systems and Computing 1087,
[Link]
42 N. Garg and S. K. Dhurandher

clothing, for example, by the warehouse server. The server may calculate the cur-
rent location and current section of the user in the warehouse by using distance
triangulation methods (GPS) or using network-centric (Bluetooth, Zigbee, WLAN)
methods. The user context depends on many different attributes, like location, move-
ment, availability, schedule plans, activity, and currently used services. Location
is an essential part of the context as the user’s preferences and skills, his physical
environment (e.g., weather and temperature), social relationship details (e.g., who,
when, and with whom) all affect his next movement or behavior. Researchers are
also analyzing today, different context situations or locations of users based on user’s
behavior. Here, the selling company may provide customers with some offers and
discounts on shopping behavior and actual transactions data (Amazon, Google, etc.).
Failure to remember, means that you were supposed to do some task but you forgot to
do because of some complex new situation, that can be dealt with pervasive memory
reminder application. The list of all things, expected by user individually will be
taken care by strong pervasive applications. Obviously, the reminder for something
at some time, will take person’s concentration, time and energy, resulting in office
wasting hours. So in short, location prediction, movement prediction, action predic-
tion, and daily routine prediction, all belong to context prediction techniques, and
these all must be analyzed perfectly to give recommendations to moving user.

1.1 Challenges of This Research

The issues in context-aware despite resource-constrained applications are to identify


those cases as where what and when ubiquitous environments are created. Some of
the questions need to be addressed:
1. What minute context information can be modeled in LBS and that too is
dependent on application and methods in LBS?
2. How information is inferred and how it is acquired like sources like sensors,
web-based applications, and users’ explicit inputs?
3. How structured and modeled context information allows efficient extraction?
If the application can also find out where the user will be, then the application will
be able to provide services the user may need in the future, as well as the services
they need at present. Section 2 discusses the existing work done by researchers in the
same direction. Section 3 tells about different activity recognition concepts. Section 4
covers case studies of recommendation, for a visitor in a warehousing mall as per
body dimensions. Section 5 concludes with a challenge of what can be done next.
Automatic Retail Invoicing and Recommendations 43

2 Related Work

Many researchers are working in the direction of predicting the human location and
then suggesting some recommendations as per their present locations and recom-
mending their buying behavior history patterns (of users). Mobile guides are digital
guides to the user’s surroundings do help in the mobile search, ‘you-are-here’ maps,
and tour guides for tourism and recreational purposes. Navigation systems (e.g., car
and pedestrian navigation systems) assist in way-finding tasks in unfamiliar environ-
ments. Existing authors [1] utilized the user-generated contents and did some rec-
ommendations for a particular location. Similarly, some papers [2] discussed health-
care (exercise, fitness monitoring) applications. Apart from location-based services
(LBS), existing authors also find good research paper corresponding to analyze user
habits, patterns, and then recommendations. Yu-Cheng et al. analyzed user habits to
propose community recommendations [3] and categorized users with similar habits.
Personalized DTV Program Recommendation (PDPR) system analyzed and used the
viewing pattern of consumers to personalize the program recommendations and to
use computing resources [4] efficiently. Efstratiou et al. detected social interactions
using Bluetooth data [5] by the real deployment of a system that involves location
tracking, conversation monitoring, and interaction with physical objects. Adams
et al. [6] used both technologies, Bluetooth and GPS to model user behavior through
proximity concerning visited places. Singla et al. [7] used a hidden Markov model
approach [8] for recognizing activities in a single smart home environment through
an observed sequence of sensor readings. Lumet al. [9] made an adaptation based on
a user’s context as Bayesian networks are used to predict the prominent activities of
users. Rao et al. [10] introduced view-invariant dynamic time warping for analyzing
activities with trajectories. Zelniker et al. [11] created global trajectories by tracking
people across different cameras and detected abnormal activities if the current global
trajectory deviates from the normal paths. Huýnh et al. [12] adopted a topic model
to predict users’ daily routines (such as office work, commuting, or lunch routine)
from activity patterns, and the result was broken into domains of likes-dislikes of
users. Alireza et al. [13] recorded the type and arrival timestamp of each activity
relating to the mobile user. If it is thought in advance that personal behavior will be
such and such in these conditions [14], then this observed behavior of notification
but also used the application to collect subjective feedback to gather information
about notifications. User monitoring [12] determined the user position and can send
warnings to the user in the case of a problem.

3 Activity Recognition and Adaptation Conceptualization

This task of activity recognition understands the movement behavior for trajectories,
taking customer’s point of interests (POI). The activities may be classified as station-
ary or non-stationary. For example, office working is a stationary activity, whereas
44 N. Garg and S. K. Dhurandher

marketing, shopping, and moving for sales promotion are a non-stationary activity.
Mathematical techniques can identify different activities as C1, C2, C3, etc. Once
activities are identified as stationary or moving, then there comes a requirement of
adaptation. So to give services and relevant content as per location, LBS determines
where the user is right now. So ubiquitous positioning provides an accurate estimate
of a user’s or an object’s location at all times. One example of activity recognition is
social context inference [15], i.e., deriving social relations based on the user’s daily
communication with other people. Here, if a person goes daily by someplace and
does some routine work, then the system recommends him the same work next day
at the same time if he forgets or is slightly late in reaching the same place of work.
That raises some questions beforehand with problem solution.

3.1 How Indoor Environments Detect an Object’s Position


in Adverse GNSS Conditions?

As global navigation systems (GNSS), like GPS, help to determine a location in


outdoors but still indoors, underground works and dense urban environments still
need better techniques despite advances in positioning. So, activity recognition needs
adaptation by system services to tune with the user’s needs. Existing researchers had
explained that spatial context (location information) improves mobile user experience
through giving personally relevant mobile services. Activity monitor [15], a motion
analysis system combines the specific information with geographical location data to
provide a rich source for extracting contextual information (e.g., in shock or falling
from the roof but if it is having the geometrical location of the accident).

3.2 What Policy Will Make Positioning Accuracy in LBS


Applications?

Different LBS applications having location accuracy requirements position wise


(vertical and horizontal) and their frequency and latency are needed for betterment.
Adaptation of context services as per situation of individual is important in the context
whether he is moving, static, in office, watching movies, running in morning. So
accordingly services are read out, spoken by machine, displayed through camera
or stored for later retrieval. Figure 1 shows the movement prediction framework
and systems try to find patterns of users movement, map location coordinates with
observed patterns and then map symbolic location with coordinates. Contour profile is
one of the techniques, used for object identification in the field of pattern recognition,
where outer vertical and horizontal edges are detected as image black pixels in the
white background. In [3], social information, such as community and friendship, was
exploited to find user interests.
Automatic Retail Invoicing and Recommendations 45

Fig. 1 Courtesy’ movement prediction framework [16] (different types of datasets, patterns, and
trajectories)

While understanding location information shown in Fig. 1, we need to analyze


data with various features of ML, that help in extracting main points via Missing
Values, Low Variance /high Correlation Filter, Random Forests/ Ensemble Trees
technique, Principal Component Analysis (PCA), Backward/Forward Feature Elim-
ination. These techniques give different patterns for different groups such as students,
travelers, business people, etc. thus affect the result. Similarly, detecting activity
through hidden Markov model, noting down the purchase behavior of the consumer,
or tracking the location of the user in some closed circuit, all these observations in
venture with analysis of gathered data do some processing through multiple learn-
ing techniques and intelligent computations are done on collected data. It becomes
possible when all applications, when running on mobile devices, take care of nearby
context data.

3.3 How LBS-Generated Data Analysis Works (Tracking


Social Media Data)

LBS applications and their ubiquity services because of location sensing devices
(attached with them) have dealt location-based tracking data and even social data so
as to gather geographic information and give insights on common and distinguished
behavior of people’s moves in different environments. Here, it is meant that to sup-
port efficient data analysis and further knowledge discovery, there is a continuous
need to model and store LBS-generated data. Normally, geospatial data representa-
tion models cannot capture all important features of large LBS-generated data and
relationships among them. This is a strong need of the system, to be adaptable to
46 N. Garg and S. K. Dhurandher

current technology, and that will be possible through similarity matching technique
that gives exactly matched resource at that time. The system through data models
should find whether the user is walking, running, or sitting calm by monitoring dif-
ferent sensors (gyroscope, accelerometer, proximity, GPS). User’s body dimensions
like height, weight, facial appearance should be noted and measured carefully so that
personal characteristics (parameters) can be diagnosed by investigating authorities to
move in the direction of making a complete information retrieval system for helping
psychiatrists and human behavior experts.
For a shopping scenario, for the consumer, there could be certain clothes brand
related to fashion, color, size, etc., as qualities of good clothes are preferred all over
the world. The age, preference, height, weight type of information are fed in the
personal profile of the user and kept as information file document in a file folder
in a smartphone. The objective here is to identify some features that allow for the
reliable inference of motion-related activities, independent from person, and dataset
is used to train the classifier. If there is a set of information as input like demographic
data or user information, then adaptive music system plays according to age, gender,
region (country), music information (timestamp, artist, song, title), spatial context
(time, date, location, weather, city) as an example, but it is a challenging task to tune
to perfect song as per input parameters.

4 Product Recommendation as Per Body Parameters

Some of the existing frameworks on which authors have worked upon in the previous
years can be used to implement this product recommendation work. The earlier
defined models are efficient mobility prediction model (MPM) [17]. These models
will be required to locate a moving user in the warehouse. User mobility models are
applied in the warehouse, through these frameworks [17].
Broadly, recommender systems can be categorized as collaborative-based where
items by other users with similar tastes are liked in the past. It can be a content-
based recommendation for items that resemble the ones the user preferred in the
past. It can be hybrid-based that combines the characteristics of both content-based
and collaborative techniques. Personalized recommenders systems, particularly in
our case for clothes size determination, by the system for different aged and weight
people (male, female, kid) offer a feasible solution when users want to ensure that
proper content is delivered to them when the number of choices increases. A simple
prediction function like Pa,j *(Ri,j − Ri ) where Pa,j denotes the prediction for the user
‘a’ on item ‘j,’ Ri,j denotes the rating for user ‘i’ on ‘j’ item, and Ri denotes rating
for all users I (not specific to item j). Here, J is cloth, etc.
The system takes here body dimensions (age, height, weight, and physical appear-
ance) from the user profile of a moving person, and the same is considered for the
decision to be taken by the server in the warehouse regarding the selection of clothes.
For this, some membership functions are taken. These membership functions are built
Automatic Retail Invoicing and Recommendations 47

on some probability values based on height, age, weight and that vary concerning
the type of person considered like teenager, kid, or adult.

4.1 Age and Height Probability Determination

Recommendation systems [3], given by users require, some of the body parame-
ters, some records in advance and then feeding those values, make a closed system.
Main attributes of the human body are represented/defined in linguistic variables like
height, age, facial expression, weight, etc. Prediction of clothes by age has a plea
that if a person (kid or adult) is taller in height or heavy, then the system should be
able to recommend that person, extra large sizes garments by fuzziness values. For
that, there is a need for designing age/weight/height type of some fuzzy membership
functions. Like if some parameters are measured, like
1. Height (150–180 cm) could get different labeled values: small, average, tall, very
tall, extremely tall.
2. Weight (50–90 kg) could get taken values: lean thin, slim, fatty, obesity, etc.
3. Age (03–24) could get different person values: kid, teenager, young adult,
completely adult.
Like, if a person’s exact age is not known but still the age of a person needs to be
predicted. As if a person’s age seems to be approximately 8–10 years, he is called a
kid with a membership degree 0.8 (probability), a teenager with a degree of 0.3, etc.,
seen from Table 1. A fuzzy membership function is defined for sets of kids, teenager,
young adult, and complete adult. Set A (03–09) kids; Set B (10–13) teenager; Set C
(14–18) young adult; Set D (19–24) complete adult. That explains that if a person
seems to be of age 10 years by looking and estimating, people will call that person
as a kid with a probability of 0.8 and complete adult with a probability of 0.001.
Similarly, membership functions for different weight and different height (kids and
adults) are defined and kept them in different categories kids and adults (Tables 2
and 3) from a medical point of view. The first column of table denotes age groups,
second column of table denotes body structure type, and the remaining four columns
denote the probability of having particular weights/age in kg/years. The experiments
are not conducted for kids, in this work.

Table 1 Set A—age group and type of person with corresponding probability values
Age group Type of Age with Age with Age with Age with
person probability probability probability probability
03–09 Kids 10/0.8 13/0.3 18/0.1 24/0.01
10–13 Teenager 10/0.3 13/0.75 18/0.5 24/0.1
14–18 Young adult 10/0.1 13/0.2 18/0.5 24/0.4
19–24 Complete 10/0.001 13/0.8 18/0.6 24/0.95
adult
48 N. Garg and S. K. Dhurandher

Table 2 Set B—age and weight (kids and adults)—12–80 kg


Age group Body type Weight/Prob. Weight/prob. Weight/prob. Weight/prob.
3–5 Lean thin 12/0.7 15/0.3 25/0.03 30/0.01
6–13 Slim 25/0.5 30/0.9 40/0.1 50/0.001
14–20 Fatty 52/0.1 63/0/4 74/0.8 80/0.2
21–25 Heavy 52/0.01 63/0.002 74/0.4 80/0.9

Table 3 Set C-age and heightened (adults and kids)-52–80 years


Age Body Age/probability Age/probability Age/probability Age/probability
group structure
45–53 Lean thin 52/0.7 63/0.25 74/0.05 80/0.003
54–63 Slim 52/0.5 63/0.4 74/0.1 80/0.001
64–74 Fatty 55/0.1 63/0.4 74/0.8 80/0.2
Above Heavy 52/0.001 63/0.008 74/0.4 80/0.9
75

Recommenders systems have the ability to provide personalized and meaningful


information recommendations by taking into account individual users’ interests and
information needs like age, height, weight, and sufficient data to deal with such type of
cases. Context-aware recommendation process incorporates contextual information
into the recommendation process by taking contextual pre-filtering or post-filtering
technique to deal with contextual preferences, like a person’s body is lean thin or too
fatty.
Now, membership functions are taken for different weighted (not adult people),
because depending upon body structure, the system will recommend them clothes.
Table 4 shows different aged people with varying heights, with their body dimensions
with probabilities.
Similarly, membership functions are defined for different height adults when it
is not sure about their heights. Sets A: 150–170 cm—short; B: 160–180—average;
C: 170–190—taller; D:180–200—Very tall (height in centimeters). Table 5 shows
different height people, corresponding to body dimensions, probability-wise.

Table 4 Set D—age-height membership function (80–180 cm)


Age group Height type Height/Prob. Height/prob. Height/prob. Height/prob.
03–09 Small 80/0.8 90/0.4 100/0.1 110/0.01
10–13 Average 110/0.4 120/0.7 130/0.4 150/0.2
10–13 Ok 130/0.4 140/0.5 150/0.1 160/0.05
14–20 Taller 135/0.2 145/0.5 160/0.6 180/0.8
Automatic Retail Invoicing and Recommendations 49

Table 5 Set E—body height membership function (adult)—150–200 cm


Age Height Age/probability Age/probability Age/probability Age/probability
group structure
(19–24)
1 Short 150/0.1 160/0.4 165/0.1 170/0.01
2 Average 160/0.4 165/0.7 170/0.15 180/0.1
3 Taller 170/0.1 175/0.6 180/0.8 190/0.1
4 Very tall 180/0.1 185/0.3 190/0.6 200/0.8

Some real data is taken randomly. It is fed in CSV format with some data files.
Once data is fed into excel or CSV format, then it can be used by machine learning
technique using python, or gui based WEKA software to extract and use as per field
mentioned.

Rule 1: If X is a teenager, then X will have average height as a recommenda-


tion.
Looking at the age group from Table 1, and doing estimations, it is found out
that clothing recommended is average [size, type of cloth (T-shirt, sweater, jeans,
trousers, etc.)] And it is recommended to the user. Let us take a measured distribution
of set teenager of age (from set A), teenager’ age = (9/0.3, 13/0.75, 18/0.5, 24/0.1),
where (a/b) in subset teenager, represent age/membership value of having that age.
Let us consider the subset average of height for kids and taking some value from Set
C (Table 6).
Vector A: Average = (110/0.4, 120/0.7, 130/0.4, 150/0.2). A Cartesian product is
taken between (set A)—age and height with µR (age, height) for teenagers (Set C).
There is a fuzzy system relating to µTeenager (age = 10–13) to µAverage (height = ‘110–
150’) and applying minimum-maximum composition theorem with a condition.

Ai ∗ A J = Ai if Ai < A j (1)

= A j if A j ≤ A (2)

  
= Ai ∗ A j = Min 1, 1 − Ai + A j (3)

Table 6 Set F—a Cartesian product of age and height (teenager)


03–09 10–13 14–18 19–24
A B C D
Small A 0.3*0.4 0.75*0.4 0.5*0.4 0.1*0.4
Average B 0.3*0.7 0.75*0.7 0.5*0.7 0.1*0.7
Ok C 0.3*0.4 0.75*0.7 0.5*0.4 0.1*0.2
Taller D 0.3*0.2 0.75*0.2 0.5*0.2 0.1*0.2
50 N. Garg and S. K. Dhurandher

X acts as operator to resolve value between left of star (*) and right of star, in
fuzzy relationship as mentioned in Table 6. Now, comparing it against the measured
distribution of teenager from standard membership distribution curves, then fuzzy
membership distribution of person being average height can be estimated by a post-
multiplying relational matrix by teenager vector (Table 6).

i.e., µAverage (Height = ‘110−150’)

= µR (Age, Height)∞ µTeenager (Age = 10 − 13) (4)

where ∞ denotes fuzzy, AND and OR composition operator, computing product in


matrix algebra with the replacement of addition and multiplication with fuzzy OR
(maximum) AND (minimum) operations, respectively.
µAverage (height=‘110–150’) comes out to be (110/0.3, 120/0.75, 130/0.5, 140/0.1).
That shows that if X is a teenager, then the probability of having a height around
120 cm is 0.75 and that was made possible after measuring and consulting mem-
bership distribution function. Doing calculations, for taller height, people, but for a
teenager, after consulting the membership distribution curve, µtaller comes out. µtaller
(height=‘135–180’), it was found out (135/0.2, 145/0.6, 160/0.8, 180/0.1), refers to
Set C (Table 3).

4.2 Age and Weight Probability Assessment

Similarly, it can be proved using age and weight as parameters and predicted a
person’s weight according to member functions.

Rule 2: if x is a young adult, then x will have a slim weight


Let young adult and slim memberships are (from Set A and Set D) are given below.

Young adult = (10/0.1, 13/.2, 18/0.5, 24/0.4)


Slim = (52/0.5, 63/.4, 74/0.1, 80/0.001)

Taking a Cartesian product between age and weight with µR (age, weight)
a = 8–10, b = 10–13, c = 14–18, d = 19–24 (years) and A = 30–52, B = 53–63,
C = 64–74, D = above 75 (Table 7).
Measuring the distribution of young adult from membership distribution curves
which are given like post-multiplying relational matrix can estimate fuzzy mem-
bership distribution of person being slim weight by a young adult vector, shown
below.
That is µslim weight (weight 54–63) = µR (age, weight) ∞ µyoung adult (age = 14–18)
using ∞(fuzzy AND and OR operator)
µslim weight 54–63 = (A/0.1, B/0.2 C/0.4, D/0.4) comes out.
Automatic Retail Invoicing and Recommendations 51

Table 7 Set G Cartesian product of age and weight


Age 06–09 10–13 14–18 19–24
Weight A B C d
Lean thin(45–53) A 0.1*0.5 0.2*0.5 0.5*0.5 0.4*0.5
Slim(54–63) B 0.1*0.4 0.2*0.4 0.5*0.4 0.4*0.4
Fair(64–74) C 0.1*0.1 0.2*0.1 0.5*0.1 0.4*0.1
Heavy(above 75) D 0.1*0.001 0.2*0.001 0.5*0.001 0.4*0.001

The two types of collaborative filtering algorithms like memory-based CF (col-


laborative filtering) and model-based CF. Memory-based approaches can be used
to identify similar users from the entire database. By identifying similar users, a
prediction of preferences can be calculated. Memory CF uses neighborhood meth-
ods where model approaches use the earlier constructed model to predict and issue
recommendation, and examples are Bayesian classifiers and cluster-based CFs.

Rule 3: if x is a young adult, taller in height and heavy, then x is recommended


extra large clothes.
The system should recommend that person, extra large sizes garments by fuzziness
values. The design of some/weight/height parameters with real data is given in Fig. 2.
Here, height and weight are combined using ˆ operator and compared, i.e., µAGE
(young adult) ˆµtall (height) ˆ µfatty (weight) → XL large clothes (X). A fuzzy mem-
bership function is taken for taking all types of clothes having the probability for
different sizes like clothes size/probability = {38/0.1, 40/0.5, 42/0.4, 44/0.2} in gen-
eral for all types of persons. Here, size 38 is small, 40 is middle, 42 is large, and 44
is extra large for males clothes (upper chest). Similarly, taking data for height and
weight: height (taller) from Table 8, Weight (fatty) from Table 4.

Taller = (170/0.1, 175/0.6, 180/0.8, 190/0.2)

Fig. 2 Weka data (age,


height) versus clothes
referral
52 N. Garg and S. K. Dhurandher

Table 8 Set H: Cartesian product (age, height, weight)—adult


Height 150 160 170 180
Size A B C D
38 A 0.1*0.1 0.6*0.1 0.8*0.1 0.2*0.1
40 B 0.1*0.2 0.6*0.2 0.8*0.2 0.2*0.2
42 C 0.1*0.5 0.6*0.5 0.8*0.5 0.2*0.5
44 D 0.1*0.8 0.6*0.8 0.8*0.8 0.2*0.8

Fatty = (55/0.1, 63/0.4, 74/0.8, 80/0.2)

Then, µtall (height) ˆ µfatty (weight) = (0.1,0.4,0.8,0.2)


Doing calculation for young adult taller in height against different sized clothes,
given by clothes size/probability
matrix = {38/0.1, 40/0.5, 42/0.4, 44/0.2}. That leads to the following table given
in Set H, Table 8. If a person (young adult) is taller, then he should be recommended
extra large clothes by measurement distribution like (38/0.1, 40/0.5, 42/0.4, 44/0.2),
i.e., only 20% young adult people will be asked to have large clothes (XL). If data
is measured for a fatty and adult person of age (55–63), dimensions noted from
Table 8 are (55/0.1,63/0.4,74/0.1,80/0.001) but for having an average height. It is
found out that for average height, µAverage = (160/0.4, 165/0.7, 170/0.15, 180/0.1}
then after comparing and multiplying, then µtall (average height) ˆ µfatty (weight)
= (0.1,0.4,0.15,0.001). Comparing with clothes size/probability = {38/0.1, 40/0.5,
42/0.4, 44/0.2}, function gets probability values like µXL = (38/0.1, 40/0.4, 42/0.15,
44/0.1). That explains if a person is having an average height and fatty in weight, he
will be asked to wear 42 size large clothes (maximum probability ~=0.4). If a person
is taller and heavy, then, there are 0.2 probabilities that he will be recommended
for XL clothes as given in Fig. 3 and that shows a chart showing height labels with
increasing heights in centimeters. So, it is suggested that if a person is in the shopping
area, he can be suggested clothes, food, etc., according to his height or weight or
preference file stored in his mobile database.
Figure 4a, b gives recommendation for different sized clothes for a small dataset
(14 tuples having seven features) and recommends 63% probability of predicting
correctly by applying RIDOR model using Weka data mining tool.
Figure 4c explains that if a person is called kid, then random forest technique gives
the possibility of a person having fatty in shape with 100% probability, whereas if
the height of a person is less than 173.5, then recommended clothes are middle with
75% probability and 25% with large clothes.
Automatic Retail Invoicing and Recommendations 53

Fig. 3 Height versus tagged labels on the chart

5 Summary and Future Challenges

Pervasive computing applications best work on combining both software and hard-
ware. As a contribution, in the future, warehouse server enabled through GPS (work-
ing better indoor) will be able to calculate the current location and current section of
the user in the warehouse by using distance triangulation methods or network-centric
methods. Although there is a strong need to make a user profile which involves (i)
inferring profile from user actions (implicit like buying history, clicking on the web,
etc., and (ii) inferring profile from explicit user ratings, that includes feedback tech-
nique by filling out forms, etc. Both implicit actions and explicit ratings are processed
to build the content profile. It is done in most content-based recommender systems
that merge the ratings and user actions into profile information about the user in
action and preferences to infer keywords and attributes in order to build the user pro-
file. Sensing the user’s activity context using software sensors in the context-aware
environment will be the main objective of this research in the future. After getting
the required output according to the set aims from the proposed system, there are still
directions and questions on how long the context data is needed and the proposed
system will follow the future trends and standards and provides new ways of thinking
for activity and context sensing.
54 N. Garg and S. K. Dhurandher

Fig. 4 a Ridor method to recommend middle clothes. b Ridor method to recommend extra large
clothes. c Random method to recommend differently aged kid’s clothes
Automatic Retail Invoicing and Recommendations 55

References

1. J. Raper, G. Gartner, H. Karimi, C. Rizos, Applications of location-based services: A selected


review. J. Location Based Ser. 1(2), 89–111 (2007)
2. H. Huang, Context-aware location recommendation using geo-tagged photos in social media.
ISPRS Int. J. Geo-Info. 5(12), 195 (2016)
3. Y.-C. Chen, H.-C. Huang, Y.-M. Huang, Community-based program recommendation for the
next generation electronic program guide. IEEE Trans. Consum. Electron. 55(2), 707–712
(1995)
4. S. Lee, D. Lee, S. Lee, Personalized DTV program recommendation system in a cloud
computing environment. IEEE Trans. Consum. Electron. 56(2), 1034–1042 (2011)
5. C. Efstratiou, I. Leontiadis, M. Picone, K.K. Rachuri, C. Mascolo, J. Crowcroft, Sense and
sensibility in a pervasive world. in Pervasive (2012) pp. 406–42
6. B. Adams, D.Q. Phung, S. Venkatesh, Sensing and using social context. TOMCCAP. 5(2),
1–27 (2008)
7. G. Singla, D. Cook, M. Schmitter-Edgecombe, Recognizing independent and joint activities
among multiple resident in smart environments. Ambient Intell. Humanized Comput. J. 1(1),
57–63 (2010)
8. T. Gao, D. Greenspan, M. Welsh, R.R. Juang, A. Alm, Vital signs monitoring and patient
tracking over a wireless network. in 27th IEEE EMBS Annual International Conference (2005)
pp. 102–105
9. W.Y. Lum, F.C. Lau. A context-aware decision engine for content adaptation. IEEE Pervasive
Comput. 1(3), 41–49 (2002)
10. C. Rao, M. Shah, T. Syeda-Mahmood, Action recognition based on view-invariant spatiotem-
poral analysis. in ACM Multimedia (2003) pp. 518–527
11. E.E. Zelniker, S. Gong, T. Xiang, Global abnormal behavior detection using a network of CCTV
cameras. in Proceeding of International on Workshop Visual Surveillance (2008) pp. 303–311
12. T. Huýnh, M Fritz, B. Schiele, Discovery of activity patterns using topic models. In
International Conference on Ubiquitous Computing (2008) pp. 10–19
13. A.S. Shirazi, N. Henze, T. Dingler, A. Schmidt, Large-scale assessment of mobile notifications.
in CHI’14, April 26–May 01 2014, (Toronto, ON, 2014) pp. 3055–3064
14. B.D. Davison, H. Hirsh, Predicting sequences of user actions. in AAAI/ICML Workshop on
Predicting the Future: AI Approaches to Time–Series Analysis (2018) pp. 5–12
15. M.G. Michael et al., A research note on ethics in the emerging age of Uberveillance. Comput.
Commun. 31(6), 1191–1199 (2008)
16. J. Gaber, New paradigms for ubiquitous and pervasive applications. In Workshop on Software
Engineering Challenges for Ubiquitous Computing (2006)
17. N. Garg, S.K. Dhurandher, P. Nicopolitidis, J.S. Lather Efficient mobility prediction scheme
for pervasive networks. Int. J. Commun. Syst. 31(6) (2017)
Modeling Open Data Usage: Decision
Tree Approach

Barbara Šlibar

Abstract Predicting the quality of datasets is important for individuals, private or


public organizations regardless of their intent, especially in today’s expeditious and
competitive environment. If data are considered as valuable, then data need to have a
stronghold in their quality or rather in indicators from which the quality consists of.
Although there are several data mining methods that can be used for this issue, the
regression tree method is used because of its advantages in comparison with other
methods. Some of the pros are that tree is very easy to understand or explain, it is
a collection of if-then rules, it can be used for investigating relationship between
predictors and target attribute without understanding the form of the relationships.
In this paper, open data usage predicting task was performed on the data collected
from open data portal [Link]. Results show that it is possible to build a model
of high accuracy.

Keywords Open data · Regression tree · Metadata

1 Introduction

Measuring and comparing the quality of open data is not an easy task since multiple
quality dimensions should be taken into account. The poor quality of such data
can negatively influence the decision made by subjects. Accordingly, the influence
of metadata on the usage and thereby on the quality were investigated within this
research.
The main objective of this research is to investigate the performance of the open
data usage prediction model using regression tree-based approach. In order to achieve
it, the defined objective is decomposed into following two objectives: (1) to determine
important metadata of open datasets, (2) to build a model for predicting number of
downloads of the open datasets based on identified metadata.

B. Šlibar (B)
Faculty of Organization and Informatics, University of Zagreb, Pavlinska 2, 42000 Varaždin,
Croatia
e-mail: [Link]@[Link]

© Springer Nature Singapore Pte Ltd. 2020 57


A. Khanna et al. (eds.), International Conference on Innovative Computing
and Communications, Advances in Intelligent Systems and Computing 1087,
[Link]
58 B. Šlibar

Therefore, the paper is structured as follows; firstly, the previous researches related
to quality of open datasets are pointed out; secondly, the research methodology is
presented together with a detail description of used data and applied classification
mining method; and thirdly, the results of conducted research are shown. Finally, the
findings of the paper are given as well as recommendations for further research.

2 Literature Review

Nowadays, data is published through open data portals, which are considered to be
cataloged. These catalogs are comparable to digital libraries where metadata have
a key role along with its quality [1, 2]. As the metadata quality is directly related
to the value of digitized libraries, and then it has a direct impact on the objects
contained in these libraries [1]. Datasets are part of the open data portals, aggregate
resources, and metadata about these resources [2]. Since the number of such resources
is increasing, there is more concern with regards to the quality of the resources and
convenient metadata. Hence, Neumaier et al. (2016) introduced automated quality
assessment of metadata based on mapping certain metadata of CKAN, Socrata, and
OpenDataSoft to metadata standard Data Catalog Vocabulary (DCAT) and defining
quality dimensions and metrics regarding metadata key in the DCAT specification
[2]. Reiche and HöFig (2013) proposed, implemented and applied metadata quality
metrics (completeness, weighted completeness, accuracy, richness of information
and accessibility) to three open data portals [3]. They pointed out that metadata of
resources such as name, URL, description, format, etc. represent core metadata [3]. In
order to empower end-users to assess open data portals, Kubler et al. (2018) developed
the Open Data Portal Quality (ODPQ) framework. This framework included some
metadata since it can be helpful for evaluating various aspects of quality dimensions
[4].
The aforementioned researches are more focused on quality assessment of portals
rather than evaluation of the quality of the datasets published on them. Therefore,
researches that are directed more on the level of the dataset will be presented below.
Kučera et al. (2013) investigated quality of the catalog records that stand for
datasets published on an open data portal through metadata, and proposed techniques
for their improvement [5]. Metadata, interaction mechanisms, and data quality indi-
cators were used by Zuiderwijk et al. (2016) due to authors’ intention to improve the
usage of Open Government Data (OGD). Some of the metadata that was embodied
into prototype: title, description, URL, publisher, views, date of publishing, etc. [6].
Since software such as Socrata or CKAN is not capable of automatically cover the
quality metrics because of their technical possibilities, the Matamoros et al. (2018)
proposal for measuring quality of open datasets [7]. Altogether, the 17 quality metrics
were proposed that covered metadata, content and structure [7]. Previous researches
into the assessment of the quality of open data portals or datasets include either sub-
jective metrics or methods that require inclusion of real people. Because of those
Modeling Open Data Usage: Decision Tree Approach 59

reasons, from research conducted within this paper this subjectivity is excluded by
evaluation of the metadata using the regression tree method.

3 Research Methodology

Since the regression tree method is widely used for similar issues in which researches
have been investigating the prediction of certain factors with respect to other impor-
tant factors, it was applied within this research [8–10]. Design of the research can be
described through four phases: first phase is about gathering the data; second one is
directed to data preprocessing which includes data cleaning and data transformation;
third phase includes applying of the chose classification mining method, and fourth
phase covers evaluation and interpretation of obtained results [11]. Therefore, this
section of the paper describes in detail the used data as well as the applied method.

3.1 Data Description

The data were gathered from open data portal [Link] by Java Web application.
The reason why this portal was chosen among vast similar portals lays in the fact
that it is ranked as best or as one of the best open data portals [12–14]. The gathered
data contain metadata for some datasets which are published on [Link]. Based
on statistics of dataset usage which were found on the portal, the Web application
could collect all needed data [15]. The application loaded these statistics as .csv file,
and then it requested an API call [Link]
{id} for every dataset identifier found in this .csv file. Almost half, more precisely
5 attributes out of a total of 11 attributes, are converted according to the following
rules:
• Title—if the data is retrieved it is labeled as TRUE, unless it is labeled FALSE;
• Description/Note—if the data is retrieved it is labeled as TRUE, unless it is labeled
FALSE;
• License—if the license information contains the keyword “open”, then the license
openness is labeled as TRUE, otherwise FALSE;
• Dataset URL—if the data is retrieved it is labeled as TRUE, unless it is labeled
FALSE;
• Machine-readable format score—the original data is expressed numerically,
so it is converted “into textual values as follows: 0 == “BAD“, 1 == “NOT OK”,
2 == “OK”, 3 == “GOOD”, 4 == “VERY GOOD”, 5 == “EXCELLENT”.
Prepared dataset contains 1049 observations, and it was last updated on 28th
August 2018.
60 B. Šlibar

Altogether, dataset contains 9 categorical attributes (Name/Id, Title, Descrip-


tion/Note, Publisher, Update frequency, License, Dataset URL, Domain, Machine-
readable format score), and 2 numerical attributes (Number of views, Number of
downloads). Distribution form is checked for two numerical attributes Number of
views and Number of downloads. The distribution at both variables is skewed right.
So, in order to reduce the variability of the data, the log transformation is done for
both attributes Number of views and Number of downloads.

3.2 Regression Tree

Shmueli et al. (2017) classification and regression tree were adduced as most trans-
parent and easy to interpret methods amongst other data-driven methods [16–18].
These methods can be used for constructing prediction models from data. The models
are obtained by dividing observations or rater splitting into subgroups on predictors.
Homogeneity of the subgroups is very important. In order to create useful prediction
or classification rules, the terminal subgroups should be homogeneous as much as
possible in the terms of the target attribute [9–16].
The biggest difference between regression and classification tree is in the type of
target attribute. While the type of target attribute for classification tree is categori-
cal, the type of target attribute for regression tree is continuous. Other divergences
between them are present in the details of the prediction, impurity measures, and
evaluating performance [16–19]. Considering prediction in regression tree, the value
of the terminal subgroup is defined by the average outcome value of the training
observations contained in that subgroup. A common impurity measure in regression
trees is the sum of the squared deviations from the mean of terminal subgroups. The
usually used measures for evaluating predictive performance of regression tree are
measures such as the root mean square error (RMSE) [16–19]. Apart from that, they
operate in pretty much same way.

4 Research Results

In order to recursively partition conducted data based on the relationship between


predictors and target attributes, the Partition platform from JMP tool is employed.
Having regard to unique identifier which is given to every observation during the
dataset import to JMP tool, the attribute Name/Id is irrelevant for final model. For
this reason, this attribute is removed before the modeling process had been started.
Regression model was built on a total of ten attributes:
• Nine predictors—Title, Description/Note, Publisher, Update frequency, License,
Dataset URL, Domain, Machine-readable format score, Number of views;
• One target attribute—Number of downloads.
Modeling Open Data Usage: Decision Tree Approach 61

Fig. 1 R-square values for training set (upper curve) and validation set (bottom curve) according
to the number of splits

Prepared dataset is divided into training set (80% of observations) and validation
set (20% of observations). Even though the training set is indispensable for building a
regression tree, the validation set is also very important since it validates the predictive
ability of the model.
Automatic splitting was used for growing the tree, and it resulted in six splits.
It works in a manner that process of splitting would continue until the R-Square of
validation is better than what the following ten splits would gain [5]. In order to
visualize this process, Fig. 1 is displayed.
There are graphs for goodness of fit and they rely upon the type of target attribute.
In regards to the type of target attribute Number of downloads which is continuous, the
graph used for displaying how well the model fits the data is Actual by Predicted plot.
The predicted means of leaves are located on the x-axis and the actual values scattered
around means are located on y-axis. Speaking of the regression trees where the target
variable is continuous, the mean represents average response for all observations in
an observed branch. Place where predicted and actual values are the same is shown
with a vertical line [19]. Since there are six splits, the built regression tree has seven
leaves for training and for validation set (Fig. 2). Therefore, there are seven distinct
predicted values.
Fit statistic for prepared dataset points out summarized ability of the model to
predict Number of downloads (Table 1). It shows R-Square, RMSE, number of rows
and number of splits for training as well as for validation set. It is common that
model predicts the dataset used to create the model better than validation set. Also,
the RMSE indicates that the built model has better performance on training data than
on the validation set.
Predictors that contribute the most to the model are Publisher, Machine-readable
format score, and Number of view (Table 2).
If the leaves or terminal subgroups are observed, then the leaf with the highest
mean is one where observation contains any value of the Machine-readable format
score and Number of views is equal to or greater than 49,524.
62 B. Šlibar

Fig. 2 Actual by predicted plot for training set (left graph) and for validation set (right graph)

Table 1 Fit statistic of model for predicting number of downloads of the open datasets based on
identified metadata
R-square RMSE No. of rows No. of splits
Training 0.878 0.2799181 842 6
Validation 0.811 0.3653379 207

Table 2 Contribution of predictors within the model


Metadata Number Sum of Portion
of splits squares
Machine-readable 1 301.272832 0.6331
format
Number of views 4 144.518134 0.3037
Publisher 1 30.1123613 0.0633
Title 0 0 0.0000
Description 0 0 0.0000
Update frequency 0 0 0.0000
Licence 0 0 0.0000
Dataset URL 0 0 0.0000
Domain 0 0 0.0000

5 Conclusion

The modeling of open data usage by data-driven method regression tree was the focus
of this research. Before building the predictive model, the indicator of the open data
quality should be known. Therefore, the metadata was chosen based on the literature
review. The results of the research show that accuracy of model which was based on
core metadata is high according to method measures. Also, the build model predicts
well on the validation set. Only three predictors out of nine contributes to the model.
The reason is that applied method was regression tree that operates in such a way.
Modeling Open Data Usage: Decision Tree Approach 63

In case that the number of splits was larger, surely other predictors would have had
an impact on the model, but it would be very small.
There are few recommendations for future research. One recommendation is to
apply another classification mining method such as Boosted Tree or Bootstrap Forest
over existing data in order to deeply examine the impact of other predictors, which did
not show relevance to the target value Number of downloads. Second, more metadata
should be included in the model.

References

1. A. Tani, L. Candela, D. Castelli, Dealing with metadata quality: the legacy of digital library
efforts. Inf. Process. Manage. 49(6), 1194–1205 (2013)
2. S. Neumaier, J. Umbrich, A. Polleres, Automated quality assessment of metadata across open
data portals. J. Data Inf Qual 8(1), 1–29 (2016)
3. K.J. Reiche, E. Höfig, Implementation of metadata quality metrics and application on public
government data, in 2013 IEEE 37th Annual Computer Software and Applications Conference
Workshops, 2013, pp. 236–241
4. S. Kubler, J. Robert, S. Neumaier, J. Umbrich, Y. Le Traon, Comparison of metadata quality
in open data portals using the analytic hierarchy process. Gov Inf Q 35(1), 13–29 (2018)
5. J. Kučera, D. Chlapek, M. Nečaský, Open government data catalogs: current approaches
and quality perspective, in Technology-Enabled Innovation for Democracy, Government and
Governance, 2013, pp. 152–166
6. A. Zuiderwijk, M. Janssen, I. Susha, Improving the speed and ease of open data use through
metadata, interaction mechanisms, and quality indicators. J. Organ Comput Electr Commer
26(1–2), 116–146 (2016)
7. J.H.M. Matamoros, L.A.R. Rojas, G.M.T. Bermúdez, Proposal to measure the quality of open
data sets. Knowl. Manage. Organ. 701–709 (2018)
8. H. Li, J. Sun, J. Wu, Predicting business failure using classification and regression tree: an
empirical comparison with popular classical statistical methods and top classification mining
methods. Expert Syst. Appl. 37(8), 5895–5904 (2010)
9. M. Ließ, B. Glaser, B. Huwe, Uncertainty in the spatial prediction of soil texture: comparison
of regression tree and random forest models. Geoderma 170, 70–79 (2012)
10. C. Zheng, V. Malbasa, M. Kezunovic, Regression tree for stability margin prediction using
synchrophasor measurements. IEEE Trans. Power Syst. 28(2), 1978–1987 (2013)
11. R. Kovač, D. Oreški, Educational data driven decision making: early identification of students
at risk by means of machine learning. p. 7 (2018)
12. B. Marr, Big data: 33 brilliant and free data sources anyone can use, in Forbes. [Online].
Available [Link]
free-data-sources-for-2016/. Accessed 29 Aug2018
13. M. Lnenicka, An in-depth analysis of open data portals as an emerging public e-service 9(2),
11 (2015)
14. Open Data Barometer. [Online]. Available [Link]
indicator=ODB. Accessed 29 Aug 2018
15. Usage by dataset—[Link]. [Online]. Available [Link]
Accessed 29 Aug 2018
16. G. Shmueli, P.C. Bruce, I. Yahav, N.R. Patel, K.C. Lichtendahl, Data Mining for Business
Analytics: Concepts, Techniques, and Applications in R, 1st edn. (Wiley, 2017)
17. A.B. Shaik, S. Srinivasan, A brief survey on random forest ensembles in classification model, in
International Conference on Innovative Computing and Communications, 2019, pp. 253–260
64 B. Šlibar

18. N.M. Lutimath, D.R. Arun Kumar, C. Chetan, Regression analysis for liver disease using r: a
case study, in International Conference on Innovative Computing and Communications, 2019,
pp. 421–429
19. SAS, JMP 12 Specialized Models. (SAS Institute, Cary, NC, 2015)
Technology-Driven Smart Support
System for Tourist Destination
Management Organizations

Leo Mrsic , Gorazd Surla and Mislav Balkovic

Abstract Hospitality and tourism are among major economic drivers while also
among the largest sectors in the world. However, that growth does not come without
problems and overcrowding in tourist destinations is starting to be a big one, affecting
all stakeholders: government, residents and tourists. Overtourism problem is not
going to be solved overnight, but it cannot be solved without a system that will
measure, examine and predict tourism at the destination. Even though there is growing
demand for such systems, there is still no globally adopted concept. Information and
communications technology (ICT) advancements, especially Big Data and Internet
of Things, are making possible innovations in decision-making management, which
could also be introduced in tourism and used as a management tool. The scope of
this research was to examine some options of using technological advancements and
available data to build our version of data-driven Destination Management System
(DMS) that we called eDestination as part of Destination Management Organization
(DMO) strategy.

Keywords Overtourism · Destination management system · Big data · IoT · Smart


tourism · Data visualization · Decision support

1 Introduction

In recent decades, the hospitality and tourism industry has become one of the most
important economic sectors globally. Each year more and more people travel into
almost every corner of the World, despite worldwide economic crises, searching

L. Mrsic (B) · M. Balkovic


Algebra University College, Ilica 242, Zagreb, Croatia
e-mail: [Link]@[Link]
M. Balkovic
e-mail: [Link]@[Link]
G. Surla
Algebra Valamar Riviera Inc, Dubrovnik, Croatia
e-mail: [Link]@[Link]
© Springer Nature Singapore Pte Ltd. 2020 65
A. Khanna et al. (eds.), International Conference on Innovative Computing
and Communications, Advances in Intelligent Systems and Computing 1087,
[Link]
66 L. Mrsic et al.

for unique and new places and experiences. By some authors, emerging tourism is
described as the most obvious form of globalization [1]. Looking back at more than
50 years in the past, tourism has experienced continued diversification and expansion,
and has become one of the global fastest-growing and largest industry sectors [2]. In
addition to the traditional favorite destinations of North America and Europe, many
new destinations have emerged. A large number of worldwide destinations measure
significantly growing interest for investment in tourism. This has turned tourism
into a key driver of economic progress through the jobs and enterprise activity,
infrastructure development but also through increased export revenues.

1.1 Overtourism

Overtourism is quite a new term that started to be used on twitter as #overtourism


in August 2012. First definition of overtourism was given in an article about the
impact of tourism in Iceland [3]. Overtourism was explained in a negative way,
from the perspective of potential hazards to popular destinations worldwide. Due to
industry dynamics and intensity, forces that powers growth often inflict unavoidable
negative consequences if not managed well. This can lead to a decline in tourism
as a sustainable framework especially when overall impact on local residents has to
be managed and measured very carefully. Moreover, this term can also be known
as “overcrowding” or “tourismphobia”. UNWTO has defined overtourism through
its impact on a destination that excessively influences perceived quality of life or
citizens and/or quality of visitor’s experiences in a negative way (UNWTO 2018).
Overtourism is a complex problem for tourist destinations. The problem varies from
destination to destination, and there is no easy and fast solution to it. The solution
requires long-term planning with all stakeholders involved, with good management
built on a comprehensive fact base.

1.2 Digital Technologies in Hospitality and Tourism

Digital technologies, applications, and tools allowed tourism companies to be


more responsive and to improve their competitiveness and performance by (hyper)
automating, transforming, and information technology their business models, pro-
cesses, and functions such as procurement or supply chain, customer service and man-
agement, human resources, or marketing [4]. Tourism is changing with the advance
of information technology (IT). One of its first use in tourism was in the early stages
of the Mainframe in the 1950s when the flight booking system went online from
traditional manual booking system. However, today the advancements in areas like
social media, mobile technologies, Big Data, and the robotics/IoT have been used
in transformation of the tourism industry. Modern technologies are also drivers for
development of numerous new tools that can be used in tourism industry. Today,
Technology-Driven Smart Support System … 67

tourism destinations face a set of new challenges from influence of the shared econ-
omy business models, new technologies to both consumers and the environment
(Dimitrios et al. [5]. Technology is playing critical role in the competitiveness of
tourism destinations and organizations as well as the entire industry [6]. In its pol-
icy recommendation for long-term sustainability of urban tourism, researchers are
suggesting investment in innovation, technology, and partnership to promote smart
cities—allowing technology to address not only innovation but also accessibility and
sustainability (UNWTO 2018).

2 Destination Management Systems and Tools

Even though there is a great need for digital technology in destination management,
still there cannot be found concept which can be explained as universally adopted [7].
DMS as a source needs to be the main tool for a tourist destination to reach sustain-
ability. There is no real system implemented that is used by more DMO’s, but at the
moment DMO’s are trying to do something through its websites. Although DMS is
often considered as advanced DMOs web platform since their inception somewhere
in the middle of ‘90s the evidence clearly shows that not many destinations were able
to develop and implement such systems successfully [8]. More and more, DMOs use
digital technology in order to facilitate the tourist experience before, during and
after destination visit (all parts of traveling process), as well as for the coordina-
tion of all stakeholders involved in the experience and service delivery of tourism
[9]. Nowadays DMOs are attempting to provide information and accept reservations
for different local enterprises and coordinate their facilities, but also utilize digital
technology to promote their policies, harmonize their operational processes, increase
the expenditure of tourist, increase overall experience level and boost the multiplier
effects in the local economy [10]. To do that DMS needs to administrate a wide
range of requests and provide efficient and appropriate information on an increas-
ing supply of tourism products. National and regional governments are employing
DMSs to facilities DMOs management, as well as to support local ecosystem at the
destination level (UNWTO 2008) [11]. In recent studies like “Åre Case” [12] authors
made research on Swedish mountain tourism destinations. They have explored dif-
ferent customer-centric knowledge sources, like tourists’ search (Web navigation),
booking and feedback-behavior (review platforms, surveys), with a goal to create
a business intelligence-based destination management information system based on
data collected from pre-trip and post-trip experiences and facts. They have set up the
knowledge destination framework architecture sensitive to knowledge creation and
knowledge application layer. Knowledge generation layer includes various sources
of customer-based data (e.g., booking, weblogs, and customer feedback), the techni-
cal components for data extraction, transformation and loading (ETL processes), a
centralized data analytics platform and analytics modeling part. System overall rep-
resents decentralized presentation and advanced visualization of data models with
data that rests on the knowledge-based transactional layer [13], generally called
68 L. Mrsic et al.

DMS. Besides a sophisticated technology application, the effective use of a DMS


demands development and implementation of learning processes as part of organiza-
tional behavior. In Åre Case authors find it crucial to integrate both, public and pri-
vate, stakeholders and defined their user and location-specific and unique/individual
knowledge requirements. Following the input from stakeholders and results of previ-
ously conducted literature review they have made a set of indicators that are defined in
Fig. 1. Economic indicators were made of indicators like confirmed visitors, pricing
policies, overnight register, overall sales, and total occupancy. For customer behav-
ior indicators they used measures for website logs (e.g., search terms or page visits),
booking and expenditure behavior (e.g., booking channel, length of stay, conversion
rates, guest lifecycle management, cancelations), and profile of a customer (e.g.,
country of origin, gender, age, travel behavior, customer value score, preferred type
of transportation and accommodation, purpose of visit). Finally, customer experi-
ence and perception were used as indicators that were measuring destination brand

Fig. 1 Åre case indicators concept


Technology-Driven Smart Support System … 69

awareness like brand visibility, information sources, interest about the destination,
destination value areas (e.g., skiing or non-skiing winter activates, summer acti-
vates and attractions, atmosphere, social interaction, services, and features), value
for money score and customer loyalty and satisfaction [12]. Moreover, they define
each indicator to different business process and set them up in the period during a
trip in which it happens. They have defined which processes are happening during
the pre-trip phase, during the on-site phase, and during the post-trip phase. For their
research, they have focused on pre-trip phase using booking as an economic perfor-
mance indicator and web navigation as customer behavior indicator. Also, for the
post-trip phase, they were using feedback as customer perceptions and experience
indicator [14].

3 Technology-Driven Smart Support System for DMO

From previous chapters, we can see that tourism has its problems and one if it is lack of
management system. Also, there are different available data that are currently not used
for decision making in tourist destinations. Technology advancements, especially Big
Data and IoT, are making possible innovations in decision-making management, that
could also be introduced in tourism. Following the research by the so-called “Swedish
collective” in their Åre Case [12] described in previous chapter, we will build our
version of data-driven DMS that we are going to call eDestination. During their
research on Swedish mountain tourism destination, the focus was on activities before
and after trip. Customer-based knowledge sources, like tourists’ search (weblogs),
booking and feedback-behavior (surveys, reviews) were used. However, to consider
this a complete destination management system, there is a big gap in during-trip
phase. The goal of this research is to examine possibilities to collect data from
during-trip phase, build and test few scenarios that could be used for a data-driven
management system for a tourist destination. Using the same approach, we will create
an architecture for our DMS, but we will focus on indicators that are happening during
the “On-Site” phase [15].

3.1 eDestination

eDestination was being developed for Croatian sea tourist destination Šibenik. The
author’s objective will be to create different scenarios that will be used to test what
data would be available in a real tourist destination. The main goal is to find what
open data can be used and what additional data is there available for DMO – Šibenik
Tourist Board (TZ Šibenik). With different scenarios, the idea is to see how different
data can be put in correlation to get understanding of tourist behavior during their stay
at the destination. The idea is to exploit a few open and available technologies like
Big Data, IoT and advanced data visualization. With this research and eDestination
70 L. Mrsic et al.

pilot, we want to set up a cornerstone for future destination management systems


that can be built in any Croatian tourist destination and later on transmitted to any
other destination worldwide.

4 Šibenik Smart Destination Case

eDestination Šibenik is composed of various data that is available to TZ Šibenik.


We used the case described in Chap. 2 as a starting point for our research, but
with two major differences. First one is that it is made for Šibenik and second
that it uses only destination performance indicators that are happening during “On-
Site” phase. eDestination Šibenik uses open data that is available to everyone (social
networks), data that is available only to DMOs in Croatia (tourist statistics) and
some new data that will be created for Šibenik (IoT sensors). That data from different
sources are collected in Cloud-based SQL database. Figure 2 is presenting high-level
architecture of eDestination, where data collection is set up in “Backend” and under
“Client” is shown what would DMO has as part of destination management. If we
look at the source of data that we are using for this research to a case from Sweden
described previously, we can define destination performance indicators as followed:
(1) Customer Behavior Indicator – movement data from IoT sensors, (2) Economic
Performance Indicator—overnights data from eVisitor, (3) Customer Perception &
Experience Indicator—eDestination questionnaire data. By using different scenarios
in this research, the idea is to show how different data can be put in correlation to
get a needed insight into what is happening in the Šibenik. Collected data in some

Fig. 2 eDestination high-level architecture


Technology-Driven Smart Support System … 71

scenarios will be analyzed and visualized using Tableau Software that will allow us
to create interactive data visualization focused on business intelligence [16, 17].

4.1 Scenario 1: Visitor Satisfaction Survey eDestination

The focus of this scenario is to compare during and after trip tourist satisfaction.
Two POIs in Šibenik are selected for this part of research—St. Michel’s Fortress
and Šibenik City Museum. To determine post-trip tourist satisfaction, we used data
from TripAdvisor. (“TripAdvisor Šibenik,” n.d.) Using simple web crawler for data
mining we have gathered average score of POI, a total number of reviews and its
scores on a scale from 1 to 5 (terrible, poor, average, very good, excellent). Data is
collected on 15th September 2018, and the score is based on all reviews that are made
on TripAdvisor for that POI. St. Michel’s is ranked overall as a 6th attraction (things
to do) in Šibenik, it has a total of 489 reviews with average score of 4.0, and the first
score dates from 30th July 2012. Other selected POI, Šibenik City Museum, has only
22 reviews, it is ranked overall as the 24th attraction in Šibenik, having same average
score 4.0 and first review scored on 15th September 2015. To gather the same kind
of data, but during visitors stay in the POI, we have created a web questionnaire
application. The application is designed for tablets and has one question that is
answered using a touch screen selection of answers on a screen. We have set up the
question “How do you like our attraction?” and give the possibility to answer with
selecting one out of five emoticons that represent an excellent, very good, average,
poor and terrible score. Tablet with questionnaire is located in both selected POIs, on
a visible place, next to the exit. It is connected to the Wi-Fi and is sending answers
in eDestination database in the real-time. For this research period from 25th July to
25th August 2018 is used. Results show that much more visitors are willing to give
a review or score during their stay than after their trip. In just 30 days day we have
gathered more scores than the POIs had in a couple of years on TripAdvisor. Total
of 511 scores is collected on TripAdvisor ever since the POIs are listed in, while in
one month of research we gathered 1914 scores. Interesting to see is also that terrible
comment is something that rarely appears in TripAdvisor score (2% for St. Michael’s
Fortress and 0% for Šibenik City Museum), but it looks that visitors don’t mind of
giving terrible score during their visit in the POI if they don’t like their experience
(8% for Fortress and 20% for Museum). On the other hand, TripAdvisor has a more
detailed score, as the visitors are also leaving a written review. However, on the other
hand, data collected through eDestination questionnaire is real-time data, and there
are additional possibilities that could be done with it.
72 L. Mrsic et al.

4.2 Scenario 2: Movement Behavior

Usage of people counting sensors is not something new. It is used, for example, in
the retail business, where this kind of sensors are located in stores to see how many
people are coming in the stores, how much time they spend in it, what are peak hours,
and similar. However, what we want to explore is tourist movement behavior. Our
scenario would be the simple solution tested in two POIs, but including more, or
all, POIs in destination would create a grid that would be able to detect movement
behavior in and between POIs in the destination. According to Lew and McKercher
study [18], the destination and tourist characteristics pattern of the path that tourists
follow. Authors have used the circular area near the point of accommodation to
represent the relative distance referring to activity/movement. This range varies from
extremely restricted movement to completely unrestricted movement. Unrestricted
movement behavior is something that every destination is looking for. Could ICT
have an impact on this kind of tourist behavior? If tourists got information about
different POIs, their locations, working hours, peak hours and similar information in
one place, they would change their movement behavior and they would plan their trip
freely around the destination. To start with, we have installed an HPC005 sensor at
the entrances of both selected POIs Šibenik City Museum and St. Michael’s Fortress.
The sensor is recording the movement of people. For each entrance or exit, sensor is
sending time and +1 (entrance) or −1 (exit) value. Collected data is imported into
Tableau Software, analyzed and the dashboard was created as shown in Fig. 3. Data
is collected for the period from 1st July to 30th August 2018, the peak of tourist
season in Šibenik.

Fig. 3 City of Šibenik, Croatia, data visualization


Technology-Driven Smart Support System … 73

The dashboard is interactive, and it contains a map with both POI locations, graphs
with total visitor numbers per month, date and hour for both POI separately. From
the results show we can see that these two POIs have different patterns and numbers.
In this case, this is logical, as POIs have different carrying capacity levels. However,
if there would be created a full grid of all Šibenik POIs, and grid with this kind of
data would be available for tourists, they would be able to plan their movements in
destinations accordingly.

4.3 Scenario 3: Advanced Visualization of Overnights

With this scenario, we want to get a wide picture of what is happening in the des-
tination. This scenario will show where tourists are staying during their overnight
destination and what is guest structure, what are their habits regarding accommoda-
tion type, category, and length of stay. In Croatia the overnight registration system
is called eVisitor. It is a system that contains all data about tourists that arrived in
Croatia and stayed overnight in one of the legal accommodation providers. Import of
tourist details is State-regulated and is mandatory for all accommodation providers.
Each accommodation provider is registered in the system with its address, number
of units, number of beds, type of accommodation and official category. Each guest
that has overnight in Croatia needs to be registered with its first and last name, sex,
country of origin, date of birth, ID or passport number and its check-in and check-out
date. This information is then shared with different State stakeholders, among them
also with DMOs. For this research, TZ Šibenik gave us partial data set from eVisi-
tor with all accommodation units and tourists from five different countries of origin
(Croatia, Slovenia, Bosnia and Hercegovina, Germany and Austria) for the period
from 1st July to 2nd September 2018. Dataset extracted from eVisitor was uploaded
to Tableau Software, and with it, we made a detailed analysis. The result of that anal-
ysis is an interactive dashboard shown in Fig. 4. The dashboard was created using
several portlets. In the first window, we have put the Google map layer, on which we
have geo-located all used accommodation properties in Šibenik. The map allows the
user to zoom in and out; each accommodation property is marked with a dot; each
dot has more details about the owner and maximum capacity. There is an option to
look at all the properties, select just a single property or make a group within the
neighborhood. Next to the map are details about specific accommodation categories
and types of accommodation (hotel, campsite, non-commercial and other—private
accommodation). For each category and type are additional details about the length
of stay and total of overnights. The interactive dashboard lets the user explore total,
only one, or more categories or types. Filters on the left side are giving multiple
options to set up the dashboard. At the bottom part of the dashboard are located
different graphs, to allow additional filtering. There are arrivals and departures per
week, and length of stay. Moreover, there are tourist demographic details, one graph
for sex, other for age groups and the last one for the county of origin. Overall, there
are options to inspect any combination with a simple click. After showing this result
74 L. Mrsic et al.

Fig. 4 eVisitor dataset, Šibenik, Croatia

to TZ Šibenik, their comment was only that this was science fiction. With the dash-
board, they now see details that they did not think they could get, and all this was
made with the data that they already have.

5 Future Research

For the end of this research, we have decided to create a setup for future research.
Destinations would need time to start using scenarios that were created in previous
chapters in the right way. Also, time would be needed to gather data from these sce-
narios that we could make additional analysis and prediction models using advanced
Big Data analytics and later on also artificial intelligence. Due to that, this scenario
cannot be made in this stage of research, at the beginning. How would this scenario
look like? It would be a diverse mix of all the data source that we used in previous
scenarios, and we would put them in correlation. If this scenario were created for
Šibenik City Museum, it would contain questionnaire data from Scenario 1, data
from the sensor from Scenario 2, and data from eVistor from Scenario 3. With these
data, we would get information about visitor satisfaction and the number of visitors
visiting the Museum. We would add a total number of tourists that had overnight
in Šibenik from eVisitor. Also, we want to add one more external factor, weather.
Weather data could be collected with web service through one of web weather pages
like AccuWeather. That kind of diverse mix would have a goal of examining cus-
tomer’s behavior and movement during different weather conditions and different
occupancy levels of the destination. Also, future development of ICT will bring new
technologies and new datasets in play (i.e., like wearables or self-driving cars) so
Technology-Driven Smart Support System … 75

further researches in this filed should keep an open mind on possibilities that new
technologies will bring. Described scenarios, together with the additional scenarios
that could be created, set up in a destination in multiple locations and combined in
a smart grid would create destinations “Hawkeye.” This kind of DMS would have
multiple parts, which would be used by different stakeholders. From eDestination
Šibenik, we could see that ICT technology and tools can be easily used in tourism.
Destinations already have a large amount of unused data. In addition, there are dif-
ferent new datasets available to be gathered. Moreover, with the use of technology,
getting new data is not an issue anymore. If we compare eDestination Šibenik with
the Big Data case, we can see that using pre and post-trip data is not giving enough
information to make a complete DMS. Case showed it could help in understanding
tourist behavior from an expectation point, but from that kind of system, we do not
know how tourists are using the destination and what is happening in a destination
on a daily basis. Moreover, as shown in Šibenik, this is something that can be done
with the use of simple IoT technology. ICT innovations are making life much easier
in some other industries, so there is no reason that this cannot happen in the tourism
industry. Even though the tourism sector is known as it does not go away with inno-
vations well, this is something that will be changed. As the World is entering the
fourth industrial revolution, the tourists are changing its behavior. Moreover, with
their change, the tourism industry will need to change as well. It is hard to expect that
one globally unified DMS will be created, but this is not necessary. Every destination
is unique, every destination has a unique problem, so there should not be an issue
that every destination has its own DMS. However, each destination should have one.

Acknowledgements Ethical approval committee for the study provided in this chapter includes
official representatives from City of Šibenik Tourist Board, Šibenik City Government, and University
College Algebra. Appropriate permissions from the responsible authorities (include City of Šibenik
Tourist Board, Šibenik City Government) were issued prior to installation of HPC005 sensor at
the entrances of both selected POIs Šibenik City Museum and St. Michael’s Fortress used in the
study. Data retrieved from eVisitor national register was anonymized and used upon agreement and
approval between Šibenik Tourst Board, Croatian National Tourist Board, and University College
Algebra. All data and data samples were anonymized at source (device and/or relevant register),
primary and secondary anonymization was additionally conducted prior and after analysis to assure
full anonymization of data. No individual data was either collected, stored or processed at any time
as part of this study.

References

1. M. Mowforth, I. Munt, in Tourism and sustainability: Development, globalisation and new


tourism in the Third World: Fourth edition. [Link]
2. UNWTO Tourism Highlights, 2017 Edition. in Unwto Tourism Highlights. [Link]
18111/9789284419029
3. R. Ali, Foreword: The Coming Perils of Overtourism (2016). Retrieved 15 Aug 2018 from
[Link]
76 L. Mrsic et al.

4. M. Sigala, D. Marinidis, Web map services in tourism: a framework exploring the organisational
transformations and implications on business operations and models. Int. J. Bus. Inf. Syst. 9(4),
415 (2012). [Link]
5. D. Buhalis, A. Amaranggana, Smart Tourism destinations, in Information and Communication
Technologies in Tourism 2014 (Springer International Publishing, Cham, 2013), pp. 553–564.
[Link]
6. WTO, E-Business for Tourism—Practical Guidelines for Destinations and Businesses (2001)
7. R. Egger, D. Buhalis, ETourism Case Studies, 1st edn, eds. R. Egger, D. Buhalis (Butterworth-
Heinemann, 2008)
8. P. Alford, S. Clarke, Information technology and tourism a theoretical critique. Technovation
(2009). [Link]
9. D. Buhalis, Information technology as a strategic tool for economic, social, cultural and
environmental benefits enhancement of tourism at destination regions. Prog Tourism Hosp
Res 3(1), 71–93 (1997). [Link]
PTH42%[Link];2-T
10. D. Buhalis, A. Spada, Destination management systems: criteria for success—an exploratory
research, in Information and Communication Technologies in Tourism 2000 (Springer Vienna,
Vienna, 2000), pp. 473–484). [Link]
11. D. Buhalis, D. Leung, R. Law (n.d.), eTourism: critical information and communication
technologies for tourism destinations. In Destination Marketing and Management: Theories
Ad Applications (CABI, Wallingford), pp. 205–224). [Link]
0205
12. M. Fuchs, W. Höpken, M. Lexhagen, Big data analytics for knowledge generation in tourism
destinations—a case from Sweden. J. Destination Mark Manage 3(4), 198–209 (2014). https://
[Link]/10.1016/[Link].2014.08.002
13. W. Höpken, M. Fuchs, D. Keil, M. Lexhagen, The knowledge destination—a customer
information-based destination management information system, in Information and Commu-
nication Technologies in Tourism 2011 (Springer Vienna, Vienna, 2011), pp. 417–429. https://
[Link]/10.1007/978-3-7091-0503-0_34
14. ‘Overtourism’?—Understanding and Managing Urban Tourism Growth beyond Perceptions
(World Tourism Organization, (UNWTO), (2018)). [Link]
15. TUNWTO, in Handbook on E-marketing for Tourism Destinations (UNWTO, 2008). https://
[Link]/10.18111/9789284412761
16. V.D. Ambeth Kumar et al., IOT-based smart museum using wearable device, in International
Conference on Innovative Computing and Communications Proceedings of ICICC 2018, vol.
1 (2018)
17. M.N. Shafique et al., The role of big data predictive analytics acceptance and radio fre-
quency identification acceptance in supply chain performance, in International Conference
on Innovative Computing and Communications Proceedings of ICICC 2018, vol 1 (2018)
18. A. Lew, B. McKercher, Modeling tourist movements. Ann Tourism Res 33(2), 403–423 (2006).
[Link]
Ortho-Expert: A Fuzzy Rule-Based
Medical Expert System for Diagnosing
Inflammatory Diseases of the Knee

Anshu Vashisth, Gagandeep Kaur and Aditya Bakshi

Abstract The proposed work is for the diagnosis of the inflammatory diseases of
the knee joint. The main diseases which are discussed in this research under inflam-
matory diseases are osteoarthritis, rheumatoid arthritis and osteonecrosis of the knee
joint. The software used for this research is MATLAB, and fuzzy logic method is
employed in it. Mamdani inference engine is used. All the input parameters required
are consulted with the expert of Orthopaedic during the phase of knowledge acqui-
sition. Survey method is used for the data collection, and various defuzzification
methods are used to check the accuracy of the proposed system.

Keywords Fuzzy logic · Inference engine · Osteoarthritis · Osteonecrosis ·


Rheumatoid · Defuzzification

1 Introduction

The proposed system is designed to diagnose the orthopaedic diseases of the knee
joint under inflammatory category. The word orthopaedic comprises from two Greek
words Ortho + Paedion, where ortho means straight and paedion means child. Pre-
viously, it was an art of straightening the deformities of children. This was derived
by a French physician named Nicolas Andry in the year 1741, who was known as the
father of orthopaedic [1]. The system is designed to diagnose three diseases (OA),
(RA) and (ON) using the rule-based fuzzy set theory.

A. Vashisth · G. Kaur (B) · A. Bakshi


School of Computer Science Engineering, Lovely Professional University,
Phagwara, Punjab, India
e-mail: gagandeep.23625@[Link]
A. Vashisth
e-mail: anshu.23500@[Link]
A. Bakshi
e-mail: aditya.17433@[Link]

© Springer Nature Singapore Pte Ltd. 2020 77


A. Khanna et al. (eds.), International Conference on Innovative Computing
and Communications, Advances in Intelligent Systems and Computing 1087,
[Link]
78 A. Vashisth et al.

1.1 Orthopaedic Diseases

With the advancement of the science, discovery of Roentgen rays (X-rays) and dis-
covery of bacteria by Louis Pasteur, the new era of diseases of skeletal system
emerged. From this, various bones and joints diseases came under orthopaedic, which
are known as orthopaedic diseases [1].
The human skeletal is made up of 206 bones where each bone of skeletal system
could be covered under inflammatory, infective and neoplastic disorder. Moreover,
the human body is comprised of various joints as synovial fibrous, ball and socket
joint and hinge joints. The diseases of which are covered by specialized bone and
joint disease branch called orthopaedic. [2] Under the orthopaedic, there are spine,
elbow, knee, ankle, wrist and shoulder body parts. Knee joint is the main weight-
bearing joint, so we have focus special attention towards those large hinge joints of
the body.

1.2 Knee and Knee Category

Knee is the hinge joint which plays a vital role in the skeletal system of human. The
whole body weight is on the knee joint. Knee is comprised of three bones. The lower
bone of thigh which is called femur, upper part of the leg bone is called tibia and the
knee cap bone is known as patella, as shown in Fig. 1. The articulation between lower
femur and upper tibia is divided by ACL (anterior cruciate ligament), PCL (posterior
cruciate ligament) and two cushions like menisci called medial menisci and lateral
menisci. The inner part is called medial compartment which can be identified from
the medial joint line, and the outer part is called lateral compartment which is assessed
by lateral joint line [2]. The articulation between patella and lower part of femur is
known as the patellofemoral joint. The cruciate and menisci of joint make the knee
joint stable and act as cushions of the knee. The menisci prevent the degeneration

Fig. 1 Parts of knee [15]


Ortho-Expert: A Fuzzy Rule-Based Medical Expert System … 79

Table 1 Knee category


Category name Disease 1 Disease 2 Disease 3
Inflammatory Osteoarthritis Rheumatoid arthritis Osteonecrosis
Infective Acute septic Chronic septic Tuberculosis

of articular surfaces of femur and tibia through their cushion activity during the
weight-bearing activities.
The main area of interest in knee, as described below in Table 1, is inflammatory
pathology that is osteoarthritis of knee, rheumatoid arthritis of knee and osteonecro-
sis of knee and infective pathology like septic arthritis which can be acute sep-
tic arthritis, chronic septic arthritis and tuberculosis of the knee. The symptoms of
inflammatory region have a common resemblance, but the diagnosis can be made
from slight variation of symptoms. Moreover, some investigations like X-rays, MRI
and blood investigations can make the diagnosis to its perfect correction. Similarly,
infective pathology differs on symptoms, and various investigations are also needed
to diagnose the disease.
(1) Inflammatory Diseases: inflammatory diseases mean inflammation to the bony
and soft tissue structures of the knee joint that causes inflammatory arthritis.
Various diseases come under the inflammatory diseases like osteoarthritis of the
knee joint, rheumatoid arthritis of the knee joint and osteonecrosis of the knee
[3].
(a) Osteoarthritis: osteoarthritis is a common joint disorder [4]. This is the
progressive softening and disintegration of the articular cartilage. It is
accompanied by new growth of cartilage and bone at the joint margin. The
symptoms for the osteoarthritis disease are pain, age, swelling, deformity
and restriction.
(b) Rheumatoid arthritis: rheumatoid arthritis is an autoimmune disease. It
is the commonest cause of the chronic inflammatory disease. The most
common characteristic features are elevated ESR, symmetrical polyarthritis
and morning stiffness. Pathology of RA in knee: stage1 is synovitis and
swelling in joint, stage 2 is early joint destruction with particular region
and stage 3 is advanced joint destruction and deformity. This disease is
common in all age group (children, young, and old). It can be seropositive
rheumatoid arthritis or seronegative rheumatoid arthritis.
(c) Osteonecrosis: the avascular necrosis of the medial condyle of the femur
is very common in the knee joint. It is often associated with alcoholism
and drug addiction. It is three times more common in females, above the
age of 60 years.
80 A. Vashisth et al.

2 Literature Review

There are numerous areas of medical field where a fuzzy-based expert system
has been implemented successfully, which diagnosis various diseases like asthma,
dengue, ENT diseases, cardiovascular diseases, cancer, diabetes, tumours, infectious
diseases, determination of the risk and diagnosis of the drug doses. There are sev-
eral expert systems available. MYCIN is the first expert system which is developed
by Dr. Edward Shortliffe, Feigenbaum and Buchanan in 1970s. MYCIN is used for
diagnosis of the infectious blood diseases. For implementation of system, it uses
LISP language and 450 rules that are made to diagnose the diseases. It calculates the
dosages based on patient’s weight and handles interactions between various drugs
[5].
Another expert system like Dendral uses computer program named Heuristic
Dendral for the data generated by mass spectrometer. It is implemented in LISP. It is
used to identify the molecular structure of organic molecules, by analysing their mass
spectra and the knowledge of chemistry [5]. Most of the fuzzy expert system has been
developed and implemented, but there is less focus given on the orthopaedic field.
This proposed fuzzy-based expert system helps in the diagnosis of inflammatory
knee disorders.
In 2002, using fuzzy relational theory, the author designed the hierarchical fuzzy
inference system to diagnose the arthritis disease in various joints. It was a two-
level process where the first-level diagnoses reduced the scope of diagnose in second
level [6]. In 2010, a fuzzy inference system was designed to diagnose the arthritis
disease and the severity level of the arthritis. Ten parameters were used to diagnose
the severity level of disease in fuzzy logic controller [3, 7].

3 Proposed Methodology

The proposed expert system is designed using MATLAB software. There is a signifi-
cant increase in the variety of fuzzy logic like washing machines, portfolio selection,
medical diagnosis and controlling process of industries. Fuzzy logic concept starts
from the fuzzy set theory which does not have the clear boundary. Fuzzy set does not
have crisp set. Crisp set means having value in yes or no, true or false and good or
bad form. It is used when there is a confusion either to accept the number or to reject
the number which means the number is on boundary. That number which is on the
boundary line creates confusion whether to accept that or to reject. For these types of
input parameters, fuzzy set is used. All the input parameters, which are taking crisp
inputs, are fuzzified using the standard min-max Mamdani inference system. For the
defuzzification, centroid method is used. The whole process for the proposed system
is defined in Fig. 2.
Ortho-Expert: A Fuzzy Rule-Based Medical Expert System … 81

Fig. 2 Methodology for proposed system

3.1 Knowledge Engineering

The foremost step for designing the fuzzy system is to collect data and identify all
input and output parameters. These can be done by consulting with an orthopaedic
expert doctor. Firstly, collect the data regarding the category and then refine that
collected data. The data is collected from various sources such as research papers,
books and expert doctors. Under the orthopaedic diseases, there are various categories
of the knee joint. Major categories of the knee joint are inflammatory diseases,
infective diseases and neoplastic diseases. The data is collected as shown in Fig. 3
for the inflammatory category which covered OA, RA and ON diseases.
From the above data collected, refinement is performed. The symptoms are cate-
gorized as primary symptoms and secondary symptoms. The disease name and the
corresponding symptoms are described below in Table 2.

Fig. 3 Knee categories and diseases


82 A. Vashisth et al.

Table 2 Inflammatory diseases and refinement of symptoms


Inflammatory Symptom 1 Symptom 2 Symptom 3 Symptom 4 Symptom 5
Osteoarthritis Pain Age Swelling Deformity Restriction
Rheumatoid Pain ESR Age Bilateral Bilateral
arthritis swelling tenderness
Osteonecrosis Pain Swelling Medial joint Restriction
tenderness

3.2 Fuzzification of Inputs and Outputs

The next step is fuzzification, mapping of crisp input value to the membership func-
tion. It converts the real numbers to fuzzy set which uses linguistic variables [8]: such
as triangular, trapezoidal membership function for the input and output parameters
are used [9]. For all input and output parameters, range is set. The fuzzified input
symptoms for OA, RA and ON disease are given in Tables 3, 4 and 5, respectively.
Figures 4 and 5 show the input and output membership function for the symptoms.

Table 3 Fuzzified input symptoms and output for OA disease


Inputs Output
Pain Age Swelling Deformity Restriction OA
Low (0–4) Low (0–30) Low (0–4) Low (0–4) Low (0–4) Early OA
(0–4)
Medium Medium Medium Medium Medium Moderate
(3–7) (30–45) (3–7) (3–7) (3–7) OA (3–7)
High (6–10) High (40–90) High (6–10) High (6–10) High (6–7) Severe OA
(6–10)

Table 4 Fuzzified input symptoms and output for RA disease


Output
Pain Age ESR Bilateral Bilateral RA
swelling tenderness
Low (0–4) Juvenile Low (0–29) Low (0–4) Low (0–4) Sero −ve
(0–17) (0–4)
Medium Adult Medium Medium Medium Sero +ve
(3–7) (16–45) (29–60) (3–7) (3–7) (3–7)
High (6–10) High (44–90) High High (6–10) High (6–7)
(60–100)
Ortho-Expert: A Fuzzy Rule-Based Medical Expert System … 83

Table 5 Fuzzified input symptoms and output for ON disease


Inputs Output
Pain Swelling Medial joint line Restriction ON
tenderness
Low (0–4) Low (0–30) Low (0–4) Low (0–4) Stage 1(0–4)
Medium (3–7) Medium (30–45) Medium (3–7) Medium (3–7) Stage 2(3–10)
High (6–10) High (40–90) High (6–10) High (6–10)

Fig. 4 Membership function for input parameter pain

Fig. 5 Membership function for output parameter OA

3.3 Fuzzy Rule Base Model

All the rules are formed in the fuzzy. All the rules are written in IF-THEN rule format
[10].
84 A. Vashisth et al.

Fig. 6 Rule viewer

IF (Condition1… AND/OR Condition2) THEN (Action). For the OA disease,


there are 143 rules; for RA 152 rules and for ON 68 rules are made with the help for
expert. Figure 6 shows the rules and rule viewer.

Pain ESR Age Swelling Bilateral tenderness OA


Low Medium Low Medium Medium Early
Low Medium Medium Low High Moderate
High High High Medium Low Severe
Medium High Low Medium High Moderate
High Medium Medium Low Low Early

Pain ESR Age Swelling Bilateral tenderness RA


Low Medium Low Medium Medium Early
Low Medium Medium Low High Moderate
High High High Medium Low Severe
Medium High Low Medium High Moderate
High Medium Medium Low Low Early
Ortho-Expert: A Fuzzy Rule-Based Medical Expert System … 85

Pain Swelling Medial joint tenderness Restriction ON


Low Medium Low Medium Stage 1
Low Medium High High Stage 2
High Low Medium Low Stage 1
Medium High Medium High Stage 2
High Medium High High Stage 2

3.4 Fuzzy Inference

In the fuzzy inference system, Mamdani and Sugeno types of system are used. In this
proposed system, Mamdani inference system is used. In fuzzy inference generation,
implication and aggregation methods are used. For the AND operator in the rules,
MIN function is used, where for OR operators in the rules, the MAX function is
used. Figure 7 shows there are five input parameters and one output parameters.

Fig. 7 Fuzzy inference


86 A. Vashisth et al.

3.4.1 Fuzzy Logic Operators

There are two fuzzy logic operators which are used for the rule formation. The two
operators are AND and OR. These operators are used between the input values.
The AND operator is used as fuzzy intersection, and OR operator is used as fuzzy
union [11]. These are applied on two fuzzy sets named as fuzzy setA and fuzzy setB
aggregates two membership functions as described below:

μ A∩B (x) = min(μ A (x), μ B (x)) (1)

μ A∪B (x) = max(μ A (x), μ B (x)) (2)

3.4.2 Implication

For the same output membership functions, the inference engine may have more
than one rules which are activated, but there should be single output. There are
two operators AND and OR. If there is AND operator between the input variable,
then the minimum input value is picked up. Similarly, if OR operator is used, then
the maximum input value is selected for the output of the rule. This is known as
implication method.

3.4.3 Aggregation

After implication, the next step is to combine the outputs of each rule into a single
fuzzy set. This is called as the aggregation process.

3.5 Defuzzification

For the defuzzification, membership function value is converted back to the crisp
value which is the output [12]. It uses different methods such as centre of area
(COA), bisector of area (BOA), largest of maximum (LOM), smallest of maximum
(SOM) and mean of maximum (MOM). Defuzzification values and ranks for different
parameters of OA, RA juvenile, RA adult and ON disease are given below in Tables 6,
7, 8 and 9, respectively.
1. Centre of area (COA): under the aggregation function, it calculates the centroid
for that area.
∫ μ A (z)zdz
Z COA = (3)
∫ μ A (z)zdz
Table 6 Defuzzified values and ranks for different inputs for OA disease
Input symptoms Defuzzified values and ranks according to different defuzzification methods
Pain Age Swelling Deformity Restriction SOM rank LOM rank MOM rank Centroid rank Bisector rank
5 45 5 5 5 4 2 6 2 5 2 5 2 5 2
8 40 6 6 8 4 2 4 3 4 3 4 3 4 3
8 35 5 2 2 0 3 2.5 5 1.2 5 1.6 5 1.6 5
8 65 8.5 9 10 6.7 1 10 1 8.3 1 8.1 1 8.1 1
Ortho-Expert: A Fuzzy Rule-Based Medical Expert System …

2.5 37 8.5 4 8 0 3 3.2 4 1.6 4 1.79 4 1.8 4


87
88

Table 7 Defuzzified values and ranks for different inputs for RA disease (Juvenile)
Input symptoms Defuzzified values and ranks according to different defuzzification methods
Pain ESR Age Swelling Bileteral tenderness SOM rank LOM rank MOM rank Centroid rank Bisector rank
2 20 12 8 9 1 4 3.1 5 2.05 5 2.28 4 2.26 4
3 35 16 2 6 3.16 3 6.88 1 5.02 1 5 1 5.02 1
3 31 10 8 8.5 3.2 2 6.7 2 4.99 2 5 1 5.02 1
3 60 10 8 5 4 1 4 3 4 3 4 2 4 2
6.5 27 13 8 9 1 4 3.76 4 2.38 4 2.43 3 2.44 3
A. Vashisth et al.
Table 8 Defuzzified values and ranks for different inputs for RA disease (Adults)
Input symptoms Defuzzified values and ranks according to different defuzzification methods
Pain ESR Age Swelling Bileteral tenderness SOM rank LOM rank MOM rank Centroid rank Bisector rank
5 50 45 5 5 3.1 4 6.94 1 2.05 5 4.99 2 5.02 1
2 20 22 8 9 1 2 3.04 5 5.02 1 2.27 5 2.26 4
9 25 67 4 5 1 2 3.58 4 4.99 2 2.38 4 2.38 3
3 35 45 2 6 4 1 4 3 4 3 4 3 4 2
Ortho-Expert: A Fuzzy Rule-Based Medical Expert System …

4 31 55 7 8 3.28 3 6.7 2 2.38 4 5 1 5.02 1


89
90

Table 9 Defuzzified values and ranks for different inputs for osteonecrosis disease
Input symptoms Defuzzified values and ranks according to different defuzzification methods
Pain Swelling Medial joint tenderness Restriction SOM rank LOM rank MOM rank Centroid rank Bisector rank
2 5 8 9 4.5 3 8 2 6.25 2 6.39 2 6.4 1
5 5 5 5 6 1 6 3 6 3 6.33 3 6.3 2
6 7 5 2.5 3.8 4 9 1 6.4 1 6.44 1 6.4 1
5 8 2 9 0 5 2 5 1 5 1.53 5 1.5 4
8 5 3 2 5 2 5 4 5 4 5 4 5 3
A. Vashisth et al.
Ortho-Expert: A Fuzzy Rule-Based Medical Expert System … 91

2. Bisector of area (BOA): this method basically divides the whole region into two
equal parts, sometimes the centre of area and bisector of area lie on the same line
but not always.


BOA β
μ A dz = μ A (z)dz (4)
a BOA

3. Largest of maximum (LOM): the largest value of the aggregation output


membership function will be returned.
4. Smallest of maximum (SOM): this method of defuzzification returns the smallest
value of aggregation output membership function.
5. Mean of maximum (MOM): this function returns the value by calculating all
maximum values, arithmetic mean of the aggregated membership functions.

∫ zdz
Z mom = (5)
∫ dz

4 Graphical User Interface

The graphical user interface is used as an interface through which the layman user or
doctor can give input in the user-friendly manner to get the results, which are easily
understandable. Figure 8 shows the interface used for this model and Fig. 9 shows
the treatment plan for the disease.

5 Testing

In the last step, testing is performed. The output is tested with the observed output
which is given by expert. The expected output matches with the system output for
the testing [7, 13, 14].

5.1 Predictive Values

False positive and false negative values are the predictive values. False positive
means when the system gives the result that there is a disease but there is no disease.
False negative means there is disease but the system returning as there is no disease.
Sensitivity is the percentage of the patients with disease have positive test where
specificity is the percentage of the patients without disease have negative test.
92 A. Vashisth et al.

Fig. 8 Graphical user interface for inflammatory diseases of knee

Fig. 9 Graphical user interface for treatment plan for inflammatory diseases of knee

Mathematically,
Positive Predictive value (PPV)
PPV = True Positive/(True Positive + False Positive)
Negative Predictive value (NPV)
Ortho-Expert: A Fuzzy Rule-Based Medical Expert System … 93

Table 10 Sample test results


Test Disease present Disease absent
Test positive True positives (97) False negatives (1)
Test negative False positives (3) True negatives (49)
Total 100 50

NPV = True Negative/(True Negative + False Negative)


Sensitivity: TP/(TP + FN)
Specificity: TN/(TN + FP)
PPV = TP/(TP + FP) = 97/(97 + 1) = 0.98
NPV = TN/(TN + FN) = 49/(49 + 3) = 0.94
SENSITIVITY = TP/(TP + FN) = 97/(97 + 3) = 0.97
SPECIFICITY = TN/(TN + FP) = 49/(49 + 1) = 0.98
If the system gives positive results, only then the system is accurate. The system
should give positive result if the disease is present and negative if the disease is not
present. But there is always the possibility of various types of errors. To deal with
these errors, tests are conducted by referring the Table 10; the PPV value of test is
98%, NPV value of the test is 94%, sensitivity is 97% and specificity of the test is
98%. Table 11 shows the sample testing for the fuzzy inference system, and Table 12
shows system testing on the patient’s data which describes the stage of disease.

6 Discussion and Conclusion

The proposed fuzzy-based expert system is developed to diagnose the orthopaedic


diseases of knee for the inflammatory category and the stage of disease, based upon
symptoms. This fuzzy rule-based technique is very helpful and accurate to diagnose
the inflammatory diseases. Therefore, this proposed expert system could be used
as a tool, which can abet the orthopaedic doctors as well as learning system for
orthopaedic medical students and practitioners. Also, even an ordinary person who
does not have any knowledge of artificial intelligence can use this system through
the interface to identify stage of inflammatory disease of knee joint.
This proposed research work can further be extended for another category like
infective diseases and neoplastic diseases. The system could be designed on neuro-
fuzzy technique.
94

Table 11 Sample test results for fuzzy inference system


Input symptoms Output by Output by expert Result
Pain Age Swelling Deformity Restriction ESR Tenderness Tenderness system
bilateral medial joint line
8 25 8 0 0 50 5 2 4.9 (RA) Sero + adult RA True
8 60 9 8 6 0 7 0 8.14 (OA) Severe OA True
6 45 0 5 7 29 9 0 4 (RA) Sero −ve adult True
RA
7 15 3 0 0 47 8.5 4 5 (RA) Sero + juvenile True
RA
9 35 8 5 6 0 2 9 6.3 (ON) Stage 2 ON True
4 48 8 7 0 0 0 0 6.64 (OA) Moderate OA True
9 32 7 0 8 0 0 5 3.8 (ON) Stage 2 False positive
7 45 8 0 0 56 9 0 4 (RA) Sero + juvenile True
RA
3 13 7 0 0 28 8 0 2.45 (RA) Sero − juvenile True
RA
A. Vashisth et al.
Table 12 System testing on patient’s data
Input symptoms Expected output (disease)
Pain Age Swelling Deformity Restriction ESR Bilateral tenderness Medial joint OA RA ON
tenderness
Medium Old Low High High Nill Nill Nill Severe
High Old High Nill Nill High Medium Nill AdultSero +ve
Low Child High Nill Nill Low High Nill JuvenileSero −ve
High Adult High Nill High Nill Nill Nill Stage2
High Adult Medium Low Low Nill Low Nill Early
Ortho-Expert: A Fuzzy Rule-Based Medical Expert System …

High Old High High High Nill Low Nill Severe


High Adult Medium Medium High Nill Nill Nill Moderate
Low Adult Low Nill Low Nill Nill Medium Stage1
95
96 A. Vashisth et al.

Acknowledgements The authors would like to acknowledge the expert Dr. Surjit Singh Dardi
who is working as Orthopaedic Medical Officer at Sant Sarwan Dass Charitable hospital, kathar,
Hoshiarpur, Punjab, India, for his continuous support and constant effort for data collection,
throughout the development, testing and validation of this proposed system.

References

1. S. Pandey, A. Pandey, Clinical orthopaedic diagnosis (Jaypee Brothers Medical Publishers,


New Delhi, 2009)
2. R. Moskowitz, R. Altman, J. Buckwalter, Osteoarthritis, 4th edn (2006)
3. S. Singh, A. Kumar, K. Panneerselvam, J.J. Vennila, Diagnosis of arthritis through fuzzy
inference system. J. Med. Syst. 36(3), 1459–1468 (2012)
4. E.O. Justice, K.F. Taiwo, Fuzzy-based system for determining the severity level of knee
osteoarthritis. Int. J. Intell. Syst. Appl. 4(9), 46–53 (2012)
5. M. Daniel, P. Hájek, P.H. Nguyen, CADIAG-2 and MYCIN-like systems. Artif. Intell. Med.
9(3), 241–259 (1997)
6. C.K. Lim, K.M. Yew, K.H. Ng, B.J.J. Abdullah, A proposed hierarchical fuzzy inference system
for the diagnosis of arthritic diseases. Australas. Phys. Eng. Sci. Med. 25(3), 144–150 (2002)
7. R.H. Taylor, B.D. Mittelstadt, H.A. Paul, W. Hanson, P. Kazanzides, J.F. Zuhars, W.L. Bargar,
An image-directed robotic system for precise orthopaedic surgery. IEEE Trans. Robot. Autom.
10(3), 261–275 (1994)
8. P. McCauley-Bell, A.B. Badiru, Fuzzy modeling and analytic hierarchy processing to quantify
risk levels associated with occupational injuries. I. The development of fuzzy-linguistic risk
levels. IEEE Trans. Fuzzy Syst. 4(2), 124–131 (1996)
9. J.A. Mendez, A. Leon, A. Marrero, J.M. Gonzalez-Cava, J.A. Reboso, J.I. Estevez, J.F. Gomez-
Gonzalez, Improving the anesthetic process by a fuzzy rule based medical decision system.
Artif. Intell. Med. 84, 159–170 (2018)
10. N.H. Phuong, V. Kreinovich, Fuzzy logic and its applications in medicine. Int. J. Med. Inform.
62(2–3), 165–173 (2001)
11. U.M. Rao, Y.R. Sood, R.K. Jarial, Subtractive clustering fuzzy expert system for engineering
applications. Proc. Comput. Sci. 48, 77–83 (2015)
12. M. Elkano et al., Fuzzy rule-based classification systems for multi-class problems using binary
decomposition strategies: on the influence of n-dimensional overlap functions in the fuzzy
reasoning method. Inf. Sci. 332, 94–114 (2016)
13. Y. Chen, C.Y. Hsu, L. Liu, S. Yang, Constructing a nutrition diagnosis expert system. Expert
Syst. Appl. 39(2), 2132–2156 (2012)
14. J. Pan, G.N. DeSouza, A.C. Kak, FuzzyShell: a large-scale expert system shell using fuzzy
logic for uncertainty reasoning. IEEE Trans. Fuzzy Syst. 6(4), 563–581 (1998)
15. [Link]
16. D.H. Mantzaris, G.C. Anastassopoulos, D.K. Lymberopoulos, Medical disease prediction using
artificial neural networks, in BioInformatics and BioEngineering 2008. BIBE 2008. 8th IEEE
International Conference on IEEE (2008)
17. F. Başçiftçi, E. Avuçlu, An expert system design to diagnose cancer by using a new method
reduced rule base. Comput. Methods Programs Biomed. 157, 113–120 (2018)
GA with k-Medoid Approach
for Optimal Seed Selection to Maximize
Social Influence

Sakshi Agarwal and Shikha Mehta

Abstract In this rapidly rising field of Web, volume of online social networks has
increased exponentially. This inspires the researchers to work in the area of infor-
mation diffusion, i.e., spread of information through “word of mouth” effect. Infor-
mation maximization is an important research problem of information diffusion,
i.e., selection of k most influential nodes in the network such that they can maxi-
mize the information spread. In this paper, we proposed an influence maximization
model that identifies optimal seeds to maximize the influence spread in the network.
Our proposed algorithm is a hybrid approach, i.e., GA with k-medoid approach
using dynamic edge strength. To analyze the efficiency of the proposed algorithm,
experiments are performed on two large-scale datasets using fitness score measure.
Experimental outcome illustrated 8–16% increment in influence propagation by pro-
posed algorithm as compared to existing seed selection methods, i.e., general greedy,
random, discounted degree, and high degree.

Keywords k-Medoid · Genetic algorithm · Social influence · Seed selection ·


Topical affinity propagation

1 Introduction

Web market share of online social networks is increasing exponentially. Social net-
work is becoming tremendously substantial for various applications, i.e., educational,
matrimony Web sites, government division, job search portals, recommendation sys-
tem [1], health care, viral marketing [2], and numerous other businesses. In all these
applications, Social network is commonly considered as a platform for information
propagation in network, i.e., social influence. In online social networks, activities of

S. Agarwal (B) · S. Mehta


Computer Science & Information Technology, Jaypee Institute of Information Technogy, Noida,
India
e-mail: [Link]@[Link]
S. Mehta
e-mail: mehtshikha@[Link]

© Springer Nature Singapore Pte Ltd. 2020 97


A. Khanna et al. (eds.), International Conference on Innovative Computing
and Communications, Advances in Intelligent Systems and Computing 1087,
[Link]
98 S. Agarwal and S. Mehta

a person can lead to change in another person’s behavior, i.e., social influence. This
change in user’s behavior depends on the other user’s influence strength and infor-
mation spreads from one user to another user. Spread of information depends on the
user’s position in the network. Selection of appropriate nodes becomes a challenge in
order to gain maximum influence spread in the system, i.e., influence maximization.
Influence maximization problem was proved to be NP-hard for numerous propa-
gation models [3]. Therefore, this paper targets to maximize influence extent by
determining k optimal nodes in the system using dynamic edge strength. Therefore,
information propagation through these nodes will maximize the effect of influence
in the network.
In the past studies, various models and algorithms have been introduced with
respect to social influence [4]. Aslay et al. [5] presented the current state of the art of
the influence maximization (IM) in the field of social network analysis (SNA), i.e.,
existing algorithms and theoretical developments in the field of IM. Anagnostopou-
los et al. [6] described social influence identification problem methodically. They
explained various social correlation models and proposed two methods that can
identify influence in network using time-dependent user action information. Chen
et al. [7] extended discounted degree approach to improve influence propagation in
the system. Mittal et al. [8] identified that the centralities are the major elements to
discover the important authors in collaboration networks. Chen et al. [9] proposed
an algorithm which is scalable with respect to size of networks, i.e., social networks.
In their algorithm, they used one tunable parameter which provides balance between
the running time and influence spread in the network. Similarly, Khomami et al. [10]
proposed learning automaton-based solution to identify minimum positive influence
dominating set (MPIDS) to maximize influence propagation. Goyal et al. [11] solved
influence maximization problem by their proposed model, i.e., credit distribution.
Credit distribution method uses the propagation traces of each action in the time
interval t and estimates the expected influence flow in the network. Chen et al. [12]
introduced a directed acyclic graph-based scalable influence maximization approach
that is modified with respect to linear threshold model. So, influence maximization is
a rising field for which various theories and models have been introduced in the recent
past years. Therefore, to optimize scope of influence in the network, we have pre-
sented a hybrid approach by selecting optimal seeds from the network. Kumar et al.
[13] applied the concept of influence propagation to detect the rumor on social media.
The remaining part of this paper is arranged as follows: In Sect. 1, we explained the
importance of influence in social network. We have also described the role of good
seeds in information propagation, and the related work has been done in this field.
In Sect. 2, we explain the proposed algorithm, i.e., GA with k-medoid approach for
optimal seed selection to maximize social influence. After this, in Sect. 3, we illus-
trate performance analysis of proposed methodology with respect to other methods
on two datasets. Lastly, we enclose conclusion and future work.
GA with k-Medoid Approach … 99

2 Proposed Methodology

In this paper, we propose an influence maximization technique by discovering appro-


priate set of seeds that maximize the influence spread in the network. Our proposed
algorithm is a hybrid approach and divided into three steps as shown in Fig. 1. In
the first step, edge strength score is updated by topical affinity propagation (TAP)
technique. In the second step, initial population set is generated using k-medoid algo-
rithm. This initial population is given as an input to genetic algorithm (GA) to trace
the optimal set of seeds that can increase the impact of influence up to maximize
extent.
Given a directed network G = (E; V; S, k), where E denotes the set of links/edges
(u, v), i.e., u, v mG, set of nodes is symbolized by V, and k is a variable that denotes
the number of seeds. S is the set of edge strength score w.r.t edge set E, where edge
strength score suv of an edge (u, v) is

1
suv =
out_degree(u)

Objective of our algorithm is to find k optimal nodes as seed set V 1 , where V 1


⊆ V and influence spread is maximized in network G through set V 1 .

2.1 Dynamic Edge Strength Score

For a given network G(V, E, S), dynamic strength score of every edge S d is calculated
using topical affinity propagation algorithm. TAP is an algorithm which computes the
topic-wise influence likelihood of each node with respect to different node attributes
as defined in algorithm 1. In this paper, we considered one topic per node. Therefore,
no node attribute required to input in TAP. Table 1 defines the variables involved in
the estimation of dynamic influence likelihood.
Node_score function is the base component of the TAP algorithm as defined by
Eq. 1.

⎨ si ri
ri = i
j∈N (i) (si j +s ji )

node(vi , ri ) = (s ji ) (1)
⎩  j∈N (i) ri  = i
j∈N (i) (si j +s ji )

Fig. 1 Flowchart of the proposed algorithm


100 S. Agarwal and S. Mehta

Table 1 List of symbols


vi A specific node in G
ri Representative node of node vi with highest edge_sum for node set {N(vi ) ∪ vi }
eij A link between nodes vi and vj
sij Strength score of an edge eij
I ij Influence score of node vi on node vj
T ij Influence likelihood assumed by node vj to node vi , initial value = 0
Aij Influence likelihood node vj approves to take on self from node vi
N(i) Set of all nodes having incoming edge from node vi
Lnij Logarithm of the edge_score of edge (i, j)

where r i is the node with highest edge_sum value in the set {N(i) ∪ i} for node i
identified using Eq. 2.

Edge_sum(vi ) = sik (2)
k∈N (i)

Similarly, logarithm of normalized edge_score Lni for node i, Aij , T jj , T ij and I ij


calculated using Eqs. 3, 4, 5, 6, and 7, respectively.
node(vi ,ri )|r
i = j
Lni j = log  node(vi ,ri ) |r
(3)
k∈N (i)∪{i} i =k

Ai j = Lni j −maxk∈N ( j) {Lnik + Tik } (4)

 
T j j = maxk∈N ( j) min Ak j , 0 (5)

     
Ti j = min max A j j , 0 − min A j j , 0 − maxk∈N ( j)\{i} min Ak j , 0 , i ∈ N ( j)
(6)
1
Ii j = (7)
−( A ji +T ji )
1+e
GA with k-Medoid Approach … 101

Algorithm 1
Dynamic likelihood computation G(E, V, S)

1. Compute the influence value of each node vi using eq. 1


2. Compute Lnij for each edge e(i,j) using eq. 3
3. For each eij, initialize Tij = zero
4. Repeat till convergence
5. For all eij Є G
[Link] Aij using eq. 4
7. For all vj Є G
[Link] Tjj using eq. 5
9. For all eij Є G
10. Compute Tij using eq. 6
11. For all vi Є G
12. For every node
13. Compute Iki using eq. 7
14. Create G1 (E, V, Sd), where Sd= I i.e. set of influence score of each edge.

Therefore, in the initial phase of our proposed algorithm, we updated the edge
strength score from S to dynamic edge strength score S d .

2.2 Generate Initial Population Using k-Medoid Algorithm

In this step, population set P of k nodes is generated such that this population easily
converges in optimal time and generates optimal seeds to maximize influence prop-
agation in the network. In the past studies, various selection approaches have been
introduced such as graph-based heuristics and mathematical models with respect to
various domain, i.e., shortest path optimization [14] or influence maximization [15].
In random model, selection of k nodes is arbitrary and does not depend on any of
the property of node or edge attribute. It selects the k nodes in O(1) time. In general
greedy model, choices made on the best are at moment basis such that the objective
function is optimized. Therefore, it chooses best solution at every step. Node with
largest neighborhood is the strongest node of the network is the hypothesis of the
next method, i.e., high degree. Therefore, this method uses the out-degree for the
directed graphs and degree for the undirected graphs. Extension of high degree is
known as discounted degree or single discount heuristic. This heuristic believes that
neighborhood of neighboring nodes is not mutually exclusive [16]. In this paper, we
applied clustering algorithm, i.e., k-medoid algorithm to generate k cluster centers
as population set P as described in algorithm 2.
102 S. Agarwal and S. Mehta

Algorithm 2
k-medoid G(V, E, k)

medoid set P = ({}1, {}2,………, {}k-1,{}k): k node sets


1. Randomly select k nodes as medoids from V and assign them to P such as one me-
doid per node set i.e. ({N}1, {N2},………, {Nk-1},{Nk}):
2. Calculate shortest distance of each non medoid node i w.r.t. each medoid Ni and al-
locate a to closest medoid set.
3. For all medoid{}i:
4. For all non medoid point b:
5. Update medoid Ni with non medoid node b
6. Compute total distance Cli,v i.e. shortest distance from b to all nodes of G.
7. Select the Medoid set Ni with lowest cost
8. Repeat steps 2 to 7 till no change in medoid set P

2.3 Optimal Seed Selection Using Genetic Algorithm

Genetic algorithm is a well-known and widely accepted optimization technique. In


the literature, various studies and researchers addressed and proposed solution for
various optimization problems using genetic algorithm [16]. In this paper, we used
genetic algorithm to solve influence maximization problem. For a given network
G(E, S d , V, P), genetic algorithm starts searching for k optimal nodes which can
improve the influence spread up to maximum extent using initial population set P,
i.e., k nodes selected using k-medoid clustering algorithm. In this step, after each
iteration we modified its population set P by replacing the least efficient candidate
node i with lowest fitness score (i) with node j, i.e., generated using one random
mutation and 1-point crossover with same probability one [16]. Therefore, objective
of genetic algorithm over dynamic edge score S d and initial population set P is
maximizing spread of influence by generating final population space Pd as set of k
optimal seeds of the network.

3 Experiments and Results

We have performed detailed experiments on two datasets [17], i.e., Amazon co-
purchasing network and wiki vote network. Details of these dataset are given in
Table 2. We analyzed the performance of proposed approach with GA, i.e., embedded
with various other seed selection methods, i.e., general greedy, random, discounted
degree, and high degree.
GA with k-Medoid Approach … 103

Table 2 Datasets
Dataset Details of dataset
Network definition Node count Edges count Degree statistics
Wiki vote network If person i voted 7115 103,689 Lowest 0, highest
for person j, an 893
edge i → j created
Amazon product If an item i 5122 11,321 Lowest 0, highest 5
co-purchased purchased along
with item j, an
edge i → j created

3.1 Evaluation Parameter

In this paper, we have applied fitness score, i.e., (Pd ) as the performance parameter
to compare the effectiveness of our proposed algorithm with other existing algo-
rithms. The total number of non-influenced nodes converted from non-influenced to
influenced state by node set Pd is known as fitness score of node set Pd , i.e., (Pd ).
Detailed description of cascade model to compute the fitness score is given in our
previous work [15].

3.2 Analysis of Results

In this paper, we analyzed efficacy of the proposed approach with various existing
seed selection methods, i.e., general greedy, random, discounted degree, and high
degree methods embedded with GA using dynamic probabilities (GADP). We per-
formed experiments on two datasets, i.e., Amazon product co-purchased dataset and
wiki vote.
In our experiments, we used different seed values raging from 10 to 50. Figure 2a
and b shows the experimental results for Amazon product co-purchased and wiki

Fig. 2 Statistical results: a wiki vote, b Amazon product co-purchased


104 S. Agarwal and S. Mehta

vote, respectively. It is noticed from Fig. 2a that the proposed algorithm, i.e., k-
medoid along with GADP which is indicated by sky blue line shows the better
results as compared to all other algorithms used in this paper. For wiki vote dataset,
discounted degree shows the second best results because of high out-degree ratio. It is
clear through the outcomes of the experiments that the proposed algorithm increased
influence propagation by converting more number of nodes into influenced state from
non-influenced state, i.e., up to 11%. Similar results have been observed from the
Amazon product co-purchased dataset as shown in Fig. 2b. The proposed approach
improved the influence propagation up to 16% for Amazon product co-purchased
dataset with respect to other approaches. For this dataset, greedy approach shows the
second highest fitness score because of low maximum out-degree ratio.
Overall, from the results, it can be depicted that high degree and discounted degree
show the good results for the datasets of small out-degree ratio and greedy approach
shows the good results for the datasets of high out-degree ratio, whereas the proposed
approach presented the improved results for both types of degree ratio, i.e., low and
high.
We have also performed the comparative analysis with respect to propagation
value as well, i.e., fitness score of the results generated by different approaches as
shown in Fig. 3. It can be easily depicted from the Fig. 3 that our proposed algorithm
shows the significant improvement in fitness score as well with respect to all other
approaches. Our proposed algorithm showed this improvement with respect to fitness
score for both the datasets.
Overall, from experimental results we also observed an interesting behavior of
our proposed algorithm w.r.t. other methods, i.e., consistency. Our proposed algo-
rithm showed the improved results for different out-degree ratios, i.e., low and high
both, whereas random, discounted degree, and high degree result are out-degree
ratio dependent. Therefore, our proposed algorithm shows the improved influence
propagation by 8–16% with respect to other approaches.

Fig. 3 Comparision of fitness score: a wiki vote, b Amazon product co-purchased


GA with k-Medoid Approach … 105

4 Conclusion and Future Work

In this paper, we proposed an influence maximization model that identifies opti-


mal seeds that maximize the influence spread in the network. Our proposed algo-
rithm is a hybrid approach, i.e., GA with k-medoid approach using dynamic edge
strength. Through experiments, we analyzed efficacy of proposed approach with var-
ious existing selection methods, i.e., general greedy, random, discounted degree, and
high degree methods embedded with GA and dynamic probabilities (GADP). In our
experiments, we have used two types of datasets, i.e., wiki vote and Amazon product
co-purchasing network with high and low out-degree ratio, respectively. Experimen-
tal results demonstrate that the proposed approach is able to achieve up to 16%
improvement in influence spread with respect to other approaches. Through results,
we also observed that performance of proposed algorithm is consistent, i.e., does not
depend on the out-degree ratio. Therefore, our proposed algorithm maximized the
influence spread by finding optimal seeds as compared to other approaches.

References

1. X. Song, Y. Chi, K. Hino, B.L. Tseng, Information flow modeling based on diffusion rate for
prediction and ranking, in WWW (2007), pp. 191–200
2. P. Domingos, M. Richardson, Mining the network value of customers, in KDD (2001), pp. 57–66
3. D. Kempe, J. Kleinberg, É. Tardos, Maximizing the spread of influence through a social net-
work, in Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. ACM (2003)
4. Y. Li, et al., Influence maximization on social graphs: a survey. IEEE Trans. Knowl. Data Eng.
(2018)
5. C. Aslay et al., Influence maximization in online social networks, in Proceedings of the Eleventh
ACM International Conference on Web Search and Data Mining. ACM (2018)
6. A. Anagnostopoulos, R. Kumar, M. Mahdian, Influence and correlation in social networks,
in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining. ACM (2008)
7. W. Chen, Y. Wang, S. Yang, Efficient influence maximization in social networks, in Proceedings
of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
ACM (2009)
8. R. Mittal, M.P.S. Bhatia, Identifying prominent authors from scientific collaboration multiplex
social networks, in International Conference on Innovative Computing and Communications
(Springer, Singapore, 2019), pp. 289–296
9. W. Chen, C. Wang, Y. Wang, Scalable influence maximization for prevalent viral marketing in
large-scale social networks, in Proceedings of the 16th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining. ACM (2010)
10. M.M.D. Khomami et al., Minimum positive influence dominating set and its application in
influence maximization: a learning automata approach. Appl. Intell. 48(3), 570–593 (2018)
11. A. Goyal, F. Bonchi, L.V.S. Lakshmanan, A data-based approach to social influence
maximization. Proc. VLDB Endowment 5(1), 73–84 (2011)
12. W. Chen, Y. Yuan, L. Zhang, Scalable influence maximization in social networks under the
linear threshold model, in Data Mining (ICDM), 2010 IEEE 10th International Conference on.
IEEE, pp. 88–97 (2010)
106 S. Agarwal and S. Mehta

13. A. Kumar, S.R. Sangwan, Rumor detection using machine learning techniques on social
media, in International Conference on Innovative Computing and Communications (Springer,
Singapore, 2019), pp. 213–221
14. S. Agarwal, S. Mehta, Approximate shortest distance computing using k-medoids clustering.
Ann Data Sci 4(4), 547–564 (2017)
15. S. Agarwal and S. Mehta, Social influence maximization using genetic algorithm with dynamic
probabilities, in 2018 Eleventh International Conference on Contemporary Computing (IC3)
(Noida, India, 2018), pp. 1–6
16. D. Bucur, G. Iacca, Influence maximization in social networks with genetic algorithms, in
European Conference on the Applications of Evolutionary Computation (Springer, Cham,
2016)
17. J. Leskovec, A. Krevl, Large Network Dataset Collection (2015)
Sentiment Analysis on Kerala Floods

Anmol Dudani, V. Srividya, B. Sneha and B. K. Tripathy

Abstract In twenty-first century, Twitter has been a very influential social media
platform, be it in election or gathering aids for a disaster. In this work, we propose to
study about the effect of natural calamities like the recent Kerala flood, their effect
on the people and reactions of people from different stratus of society. The direction
of our research will be towards sentimental analysis using RStudio and the Twitter
app. This research highlights the reactions of people on a public platform during the
calamity. Word cloud and other data visualization techniques are used for our study.
The research also highlights on how the government reacts and how aid is provided
in the dire time of need. We predict that our research will be useful to human society
as it showcases a lot about human behaviour, its goodness and shortcomings. The
paper also throws light on the Hurricane Michael calamity and gives a comparative
study on relations like the differences of opinion among the people as well as their
similarities during and after a calamity.

Keywords Sentiment analysis · Twitter · # tags · Flood

A. Dudani · V. Srividya · B. Sneha


School of Computer Science and Engineering (SCOPE), Vellore, Tamil Nadu 632014, India
e-mail: anmol.dudani2018@[Link]
V. Srividya
e-mail: srividya.v2018@[Link]
B. Sneha
e-mail: sneha.b2018@[Link]
B. K. Tripathy (B)
School of Information Technology and Engineering (SITE), Vellore Institute of Technology,
Vellore, Tamil Nadu 632014, India
e-mail: tripathybk@[Link]

© Springer Nature Singapore Pte Ltd. 2020 107


A. Khanna et al. (eds.), International Conference on Innovative Computing
and Communications, Advances in Intelligent Systems and Computing 1087,
[Link]
108 A. Dudani et al.

1 Introduction

On August 15, 2018, severe floods affected the God’s own country, Kerala due to
heavy rainfall. The devastation was such that it caused severe damage to properties
and lives in the state. Huge volume of information was offered on numerous websites
related to the calamity where users were sharing and exchanging their thoughts and
opinion. Social networking sites, like Twitter, are expansively being employed by
individuals to enunciate their feedbacks for everyday events. Information are often
publicized on Twitter through the course of retweet feature, i.e. once a Twitter handle
shares a tweet, the tweet then will be perceived by all others who follow that handle
on Twitter, therefore increasing the reach of tweet posted formerly [1]. In the recent
years, hash tags that have been engendered by the big data environments like Twitter
have flooded the Internet. Hash tags became a vital characteristic of contents spawned
by nearly all of the social media platforms. Noticeably, entities have used hash tags to
postulate their interests, frame of mind and proficiencies regarding any and all topics.
Simply to call a couple of them, handles use hash tags to proclaim the cars they drive
(e.g. #honda), the motion picture they watch (e.g. #Hobbit), the diversions they get
indulgence from (e.g. #skating) and also the grub they cook (e.g. #fish, #grilling).
As a corollary of such societal engagement, hash tags have fascinated many scholars
in abundant realms of research. The machine learning practice and the knowledge-
based techniques have been extensively used in sentiment analysis. Lexicon-based
technique is another name for knowledge-based technique. The techniques based on
lexicons concentrate on producing the lexicons based on opinions from the data and
then finding the divergence of these lexicons [2]. However, the foremost objective
of machine learning techniques [3] is to progress the formula that heightens the
performance of the system using training data. The machine learning postulates an
answer to the sentiment sorting downside in two ordered steps: (1) train and develop
the model using training set data and (2) categorizing the unclassified or unlabelled
data created based on the trained data set model [4].
In this paper, a sentiment analysis of people affected during Kerala flood is per-
formed and a word cloud is constructed from them. The sentiments of individuals
are compared based on the tweets, and an analysis is performed to visualize different
sentiments exhibited by the chosen people. The paper additionally provides a com-
parative study on the relation like the differences of opinion among people as well
as their similarities during calamities in different parts of the world by taking into
account the Hurricane Michael that affected many lives in Florida during the same
period.
Sentiment Analysis on Kerala Floods 109

2 Literature Survey

This section discusses the various literary works collected through various sources.
“Sentiment Analysis of Twitter Data: Emotions on Trump” during the 2015–2016
Primary Debates”, Malak Abdullah et al. considered data obtained from Twitter social
media platform that were mostly concerning debates of Donald Trump when he was
the former candidate for the elections. This paper also tries to look into the emotions
of people around the globe regarding him in the debates collected to test if the tweets
obtained support him or not. One strong feature of this study is that positive or
negative tweets do not directly indicate if there is a support for the candidate or not
[5].
Pulkit Garg et al. performed a massive study on tweets which were obtained during
the disturbing Uri terror attack that happened on 18 Sep 2016 which shook the world.
The tweets were collected with the help of Twitter social media. The study analysed
the emotions like pain, sorrow and fear of people at that time of attack, thereby
conducting the sentiment analysis [6].
“Does Social Media Big Data Make the World Smaller? An EDA of Keyword-
Hash tag Networks”, Ahmed Abdeen Hamed et al. conducted a study on the bulk
measure networks called K–H networks. This was collected from Twitter. In the
study, the networks are studied on the basis of connecting vertices between any two
keywords present in the data set and also paid attention to the eccentricity of each
keyword present. The results obtained from this study were that the no of vertices of
any two keywords were not connected in their chosen network, K–H network was
three, and eccentricity of each and every word present was four [7].
“Sentiment Analysis of Twitter Data: Case Study on Digital India.”, Prerna Mishra
et al. analysed the data from the social media platform which is the means of informa-
tion exchange in the present world. The chosen platform for their study was Twitter.
The data that interested them for analysis was concerning India’s Prime Minister
Modi. They considered the tweets that were obtained during the Prime Minister’s
Digital Campaign. The sentiment analysis was done, and the data was classified
based on polarity index. They used dictionary-based approach for the study of polar-
ity index. The results obtained from this study were that 500 opinions were positive,
200 were negative, and the rest were neutral. They also concluded that people of
India supported the Prime Minister in the campaign [8].
“Prediction of Indian Election Using Sentiment Analysis on Tweets in Hindi”,
Parul Sharma et al. performed their study on the tweets which they obtained in
Hindi, national language of India. They used the Twitter archiver tool as their method
to obtain tweets. For their study, they collected tweets for a month which were
regarding the five national parties present in India during the campaigning period,
2016. Naive Bayes and SVM algorithm were used to build the classifier. The data was
classified into positive, negative and neutral categories. This was helpful in finding
out the sentiments of the users towards their support or dejection for their desired
political party. The results of the analysis for Naive Bayes were supporting BJP. The
results obtained from the dictionary-based approach were supporting the INC. SVM
110 A. Dudani et al.

predicted that BJP would win many elections. The results of the elections turned out
to be in favour of BJP who won 60 out of 126 constituencies in elections, 2016 [9].
In [10] paper, two main approaches of sentiment analysis were done. The former
used Naïve Bayes, decision tree and K-nearest neighbour, and latter used deep neural
network, a RNN using long short-term memory (LSTM). The experimentation was
done on three Twitter data sets, namely IMDB, Amazon and Airline. Also, illustration
on the comparison between these approaches was done. The experiment results have
shown that the recurrent neural network using LSTM scored the highest accuracy of
88, 87 and 93% [10].
In [11], they propose a technique for sentiment analysis on the feedbacks and
delimit the polarity which is said to be caused due to various scenarios. This is
followed by a clustering process for the positive and negative feedbacks which were
previously obtained in order to identify the wide spectrum of the topics interesting the
organization which targets the feedback. This entire process of summarization helps
in improving the current feedback process which is employed in the restaurants.
The main emphasis of the research [12] lies on the classification of tweets based
on emotions for the data gathered from Twitter. Previously, machine learning was
employed in the field of sentiment analysis which failed to give better results. The
ensemble techniques were used in order to increase the efficiency and reliability. The
ensembling is done by merging the concepts of SVM and decision tree. The results
have been found out to give better classification results which were verified using
accuracy and f measure.
This research presents sentiments analysis on Indian movie reviews using machine
learning techniques. Bayesian classifier is used along with feature selection. The
classifier built is trained on Chi-square, Info-gain, Gain-Ratio, One-R and relief
attribute. A comparative study has been performed in order to study their results.
The evaluation by F-value and false positive was also carried out. Results of this
study showed that Relief-F feature selection approach was found to be good with
better F-value, low FP rate for most of the selected features For the less number of
features, One-R method was superior when compared to the Relief-F [13].
This paper [14] considers a government policy that caused a revolution in India, the
demonetization of currency by the Indian government that took effect on November
8, 2016, under the Prime Minister, Narendra Modi. In this study, analysis is done
from the common man’s perspective with the aid of sentiment analysis using Twitter
data. The analysis is done based on state using geo-location in order to study the
reaction across the nation.
Sentiment Analysis on Kerala Floods 111

3 Methodology

3.1 Twitter Data Set

The information is gathered from Twitter using Twitter API, and a R package called
Twitter, which is a Twitter client based on R programming language, provides an
interface to the Twitter Web API. We have used a Windows-10 PC with 64-bit OS
for our entire study.
In order to extract tweets, an app should be created using the consumer key,
consumer secret key, access token, access secret and direct Twitter authentication
should be setup by the one who owns a developers account for Twitter.
In our paper, the tweets related to Kerala floods that were trending during the
tragedy that occurred in 2018 are the main data that is dealt for carrying out sentiment
analysis. The trending data is generally described using hash tags. Some of the hash
tags for our analysis used are #keralaflood, #kerala, #keralafloods, #standwithKerala,
#doforkerala, #keralafloodrelief, #keralafloods2018 and #savekerala. For the purpose
of comparison of the reactions of people across the globe on a similar tragedy, we col-
lected the tweets on Michael Hurricane that took place in Florida, 2018. The hash tags
used for the analysis are #hurricanemichael2018, #floridaStrong, #HurricaneRelief,
#ThereNoMatterWhere, #panamabeach and many more.

3.2 Sentiment Analysis

The Twitter data sentiment analysis is a field that desires much more recognition due
to enormous amount of information got. Figure 1 displays the steps to perform the
procedure of Twitter data sentiment analysis.
In the procedure, the data from Twitter collected is prerectified in order to execute
the data cleaning. The vital features are taken from the cleaned text, after utilizing
any of the feature strategies. The obtained quantity of the data is tagged as negative
or positive tweets. In order to organize it into a training set, lastly, training set are

Fig. 1 Twitter information sentiment analysis procedure


112 A. Dudani et al.

provided as an input and the opinion that are extracted to the classifier which was
designed to group the remainder of the data set, i.e. test set.
Data Collection. The sources of data to be selected in order to run the sentiment
analysis play a major role in the entire procedure. Twitter which is essentially a
micro-blogging site has gained higher quality because of its high participation rate
from users. Tweet which ate the messages posted on Twitter by the users is restricted
to 140 characters. Twitter provides two types of APIs like stream API and search
API. API required to assemble Twitter information using hash tags is search API,
while stream API is required to propagate data at the same time it is being produced.
In this paper, the search API is used to save the related tweets into a csv file for
further analysis.
Data Preprocessing. The Twitter data mining is a difficult chore as it is raw
information. In order to use classifier on this raw data, it is vital to clean the data.
The steps below show the cleaning procedure by removal of the following tokens:
• Twitter terminologies like hash tags (#), retweets (RT) and account Id (@).
• URLs, links and emoticon, non-letter data and symbols.
• The stop words such as are, is, am and so on.
• Compression of the elongated words like enjjoyyy into enjoy.
• Decompression of words like g8, f9 and so on.
In this paper, for cleaning the collected tweets, the text is first converted into
lowercase, and whitespaces and punctuation marks are removed including some
unnecessary words like single characters [a–z] and [A–Z] [15].
Extraction of Features. The cleaned data set has various properties. In this stage,
we extract parts of speech like nouns, verb and adjectives and later these words are
recognized as negative or positive to find the divergence of the entire sentence.
The feature extraction ways are:
• Specific words and their existence count are calculated.
• Adverse words are considered as they affect the polarity of the sentence.
In this paper, term frequency and term presence are used to make a word cloud
using library (tm), Text Mining in R, by importing information, handling cor-
pus, information management, creation of term document matrices and preprocess-
ing ways. Negative phrases are analysed using tidyverse, tidytext, dplyr and tidyr
packages in R [16].
Preparing a Training Data Set. The machine learning technique and the
knowledge-based techniques have been extensively used in sentiment analysis.
Lexicon-based technique is another name for knowledge-based technique. The tech-
niques based on lexicon concentrate on producing the lexicons based on opinions
from the data and then finding the divergence of these lexicons [2]. However, the key
objective of machine learning techniques is to advance the method that augments
the performance of the system using training data. The machine learning masks the
weaknesses of sentiment classification using two steps: (1) train and develop the
model using training set data and (2) categorizing the unlabelled data based on the
trained model [4].
Sentiment Analysis on Kerala Floods 113

In this paper, the knowledge-based technique is used in conjunction with the


machine learning technique to prepare the training data set.
Classifier. For classification, once the training data is ready, after being tokenized,
normalized, and collected as a bag of words, a lexicon dictionary on the training data
is used to classify each word based on its emotion into neutral, negative, or positive.
There are different classifier techniques available namely:
• Naive Bayes classifiers in which the classification is done statistically [17].
• Maximum entropy that follows the principle of maximizing the ambiguity and
also satisfies the constraints.
• SVM algorithm [18] uses regression [19] techniques for classification.
In this paper, Naive Bayes classifiers [20] are used to classify the test data into
positive, negative, or neutral emotional lexicons by using bing and afinn dictionaries
in R. The lexicon score is also calculated using these dictionaries.
Polarity-wise Classified data. The classified data is now analysed based on the
emotions, and a word cloud is constructed. A word cloud is a structure of words
applied to a specific subject, in which the sizes of every word indicate its frequency
or importance [18].
Polarity-wise classified data can be represented using a comparison word cloud. A
comparison word cloud is a plot of frequencies of words across documents comparing
them according to their occurrence. A comparison word cloud of the words mined
from the tweets is drawn and divided into positive and negative thoughts or emotions.
In this paper, the word clouds are generated by the word cloud, wordcloud2
libraries. The comparison word cloud is created by plotting the test data against the
sentiment obtained from the dictionary.

4 Results and Discussion

A. Observations from the Data set

The images below are the results obtained when the data collected from Twitter was
analysed using the Tableau software. The following outcomes were obtained, namely
a text table view of the data set by considering only the screen name of the users
and if the tweets posted by them were retweeted further. Another text table displays
the screen name along with being marked as favourite. A bar graph compares the
amount of tweets that are retweeted or not. A tree map visualizes the screen name of
the user and the number of retweets in the high range. A text table displays the Reply
to UID, Reply to SID along with the F! value based on the screen name. A data set
summary shows the tweets by the specific users identified by their screen names.
A range of the retweets from 50 to 15,000 s for some users is shown in Fig. 2. From
the figure, we observe that the minimum no of retweets corresponds to FlorentMento
and the maximum number of retweets is obtained for misspete64 user.
114 A. Dudani et al.

Fig. 2 Text table displaying


a range of retweets for some
users

A bar graph shown below (Fig. 3) enables us to understand that out of 500 tweets
considered, around 400 tweets have been retweeted and have become popular among
the other users. It also shows that 100 tweets are just posted once and have not been
retweeted by others. The major insight from this graph is that the topic related to
Kerala flood has been trending for a long time on the Twitter platform.
The table in Fig. 4 was generated based on the count of tweets being marked as
favourite. Out of 500 tweets, only 11 tweets have been marked as favourite. From
the above table, we infer that the tweets posted by weatherdak are made as favourite
4 times, whereas the tweet by Rep Jayapal has the maximum number of 16.
A tree map is generally a data representation that visualizes the selected data based
on the size and shape. The parameter used for shape of the blocks in the tree map is a
continuous value. In Fig. 5, the selected data is the screen names of the users whose
tweets are being analysed. The parameter for size is the sum of the retweets for each
screen name. From the figure, we observe that misspete64 has the highest number
of retweets of about 15,736. The next maximum being mollyexcelsior has 14,153
retweets. The lowest value is for the user phriick with 1098 number of retweets. The
legend shown in the top right corner depicts the colour of the block in the tree map
with the corresponding screen name.
The table in Fig. 6 gives us an overview of some tweets that have been logged for
future analysis. The data obtained consists of the attributes Reply to SID, Reply to
UID and screen name the values in the last column correspond to the F1 attribute of
Sentiment Analysis on Kerala Floods 115

Fig. 3 A bar graph showing the amount of tweets that are retweeted or not

Fig. 4 A text table


displaying the range for the
tweets being marked as
favourite
116 A. Dudani et al.

Fig. 5 A tree map showing the users who have retweets in the range of 1000–16,000 s

the data set. The UID refers to the User Id of each user who uses Twitter and SID
refers to the Session under which the user had logged in while posting a particular
tweet.
The summary of the data set (Fig. 7) in the form of a text table provides many
insights such as the tweet posted by the user, the status of the source which is basically
the web page link to tweet, the reply to SN where SN refers to the sender, if the tweet
has been retweeted or not based on the retweet attribute and so on. From the above
information, the analysis obtained is that none of the above-mentioned tweets were
considered important and hence have not been retweeted.
A word cloud is basically a medium to imagine text embedded with size variation
to inform the observer about the most frequent word used or most valuable term in
the set of words based on some numeric parameters. For Fig. 8 shown above, the
parameter used the sum of the retweets for the tweets posted by the users with these
screen names.
B. Observations from the Sentiment Analysis
The word cloud, emotional index and comparative word cloud are obtained using Text
Mining Package, Word cloud Package, Syuzhet package and RColorBrewer Package
which are shown below. The Twitter data discussed in this section are processed using
RStudio.

4.1 Word Cloud

For Kerala, Fig. 9 shows the intensity of the words used based on the frequency of the
repetition of the word. The darker and larger word denotes the most commonly word
Sentiment Analysis on Kerala Floods 117

Fig. 6 A text table showing the number of records that have a reply to SID and UID for every
screen name

used. The light and smaller words denote less commonly used words. In Graph 4.1,
the word cloud uses Dark2 shades to denote the intensity, resulting in black being
the darkest and thus the most frequently used word and green being the lightest and
less frequently used word. In this Graph floods is the most commonly used word,
followed by [Link].
118 A. Dudani et al.

Fig. 7 A summary of the data set with selected columns

Fig. 8 A word cloud showing the screen names of users based on the retweet parameter

4.2 Intensity Index

For intensity index, Fig. 10 shows the intensity of different emotions in the collected
tweets. The higher the peak the more frequency of the word used and higher the index
Sentiment Analysis on Kerala Floods 119

Fig. 9 Word cloud of Kerala

the stronger the word towards the emotion indicates. In Graph 4.2, the histograms
use NCR sentiment dictionary to find the emotional index of the word and height
of a bar to show intensity and index of the emotion. In this Graph fear is the most
commonly felt emotion while negative, anger and surprise are the more strongly
expressed emotions.

4.3 Comparative Word Cloud

For comparative word cloud, Fig. 11 shows the intensity and word affiliation of
different words in the collected tweets based on the frequency of the repetition of the
word. The darker and larger word denotes the most commonly word used. The light
and smaller words denote less commonly used words. The word cloud also uses Blue
or Red shades to denote Positive or Negative emotions, respectively. In Graph 4.3
the word cloud uses Bing lexicons to denote the word affiliation. The intensity is
denoted the darkest being dark blue/dark red and thus the most commonly used word
and lightest being light blue/light red and less frequently used word. In this Graph
dynamic is the most commonly used word also the most commonly positive word,
while warning is the most commonly negative word used.
120 A. Dudani et al.

Fig. 10 Intensity index for Kerala

4.4 Comparison Study

In order to study the differences of opinion among people and their similarities in
emotional gauge during a crisis, we compare our study of Kerala floods to that of
Hurricane Michael.
Figure 12 shows the intensity of different emotions in the collected tweets. The
higher the peak the more frequency of the word used and higher the index the stronger
the word towards the emotion indicates. The histograms use NCR sentiment dictio-
nary to find the emotional index of the word and height of a bar to show intensity
and index of the emotion. In Graph 4.11 comparison between the intensity index of
different emotion expressed towards Kerala flood and Hurricane Michael is shown.
Table 1 gives a comparison study between the frequency of words for each emotion
between Kerala flood and Hurricane Michael. The results obtained are:
• The Anger shown for Kerala floods is more than that for Hurricane Michael; also
stronger words are used to show that emotion.
• The Anticipation for relief is stronger for Hurricane Michael is more than that for
Kerala floods.
Sentiment Analysis on Kerala Floods 121

Fig. 11 Comparative word cloud for Kerala

Fig. 12 Comparison of intensity index for Kerala flood and Hurricane Michael

• The intensity of stronger words used for Fear is more in case Hurricane Michael
than that for Kerala floods.
• Joy and Trust are more in case of Hurricane Michael than that for Kerala floods
showing positive attitude towards the crises also showing a better management
from the government.
122 A. Dudani et al.

Table 1 Comparison of
Emotion Kerala flood Hurricane Michael
intensity index for Kerala
flood and Hurricane Michael Anger 25 11
Anticipation 9 20
Disgust 16 6
Fear 39 31
Joy 9 11
Negative 35 40
Positive 28 25
Sadness 21 10
Surprise 15 11
Trust 20 18

• Positive and Negative emotions are more in case of Hurricane Michael than that
for Kerala floods showing more vocal and interested people.
• Disgust, Surprise and Sadness is virtually the same in both the cases.

5 Result Analysis

This section below consists of the understandings obtained from the results discussed
in Sect. 4. From the data set analysis, a number of observations were obtained.
• Out of around 500 tweets considered, 400 tweets were retweeted, this shows that
the topic considered is a very trendy and much talked about topic.
• The maximum number of retweets is for a tweet posted by the user with screen
name misspete64.
• The user Rep Jayapal has got maximum number of tweets being marked as
favourite. The count is 16.
Emotional indices created for Kerala floods and Michael Hurricane Florida were
compared. The results obtained from the emotional index are:
• For the Kerala floods, the emotions of fear and negative thoughts were the highest
and the strongest.
• There is a major variance between the vocation of fear and negative thoughts
between Hurricane Michael and the floods in Kerala.
• The analogy also stated that there is not much variance in the emotions of disgust,
anticipation, sadness, trust and surprise among the two calamities.
Sentiment Analysis on Kerala Floods 123

6 Conclusion

Sentiment analysis is the route of discerning and categorizing thoughts articulated


in a particular text based on some situation, to be positive, negative, or neutral. In
this paper, a thorough sentiment analysis of latest tweets gathered over tragedies that
occurred in Kerala and Florida are used to dissect human emotions when tragedies
occur, also their response to governmental bodies such as political parties, police or
army. The most widespread emotions were derived from the tweets and presented in
the form of word clouds. Finally, categorizing them into an emotional comparative
word cloud is done that set a differentiation between human emotions relative to their
thoughts. This study has helped us in understanding about the emotional similarities
and dissimilarities among people across different parts of the world. In future, the
concept of sentiment analysis can be extended to study on the various happenings
across the globe. The detailed study on the calamities discussed in our paper can also
be extended as study area in various domains.

References

1. S. Ranjan, S. Sood, V. Verma, Twitter sentiment analysis of real-time customer experience


feedback for predicting growth of indian telecom companies, in 4th International Conference
on Computing Sciences (ICCS) (2018)
2. B. Schuller, T. Knaup, Learning and Knowledge-Based Sentiment Analysis in Movie Review
Key Excerpts (Springer-Verlag, Berlin Heidelberg, 2011)
3. R.K. Dwivedi, M. Aggarwal, S.K. Keshari, A. Kumar, Sentiment analysis and feature extraction
using rule-based model (RBM), in International Conference on Innovative Computing and
Communications (Springer, Singapore, 2018), pp. 57–63
4. O.R. Llombart, using machine learning techniques for sentiment analysis (School of Engineer-
ing (UAB), 2017) (unpublished)
5. M. Abdullah, M. Hadzikadic, Sentiment analysis of twitter data: emotions revealed regarding
donald trump during the 2015–2016 primary debates, in International Conference on Tools
with Artificial Intelligence (2017)
6. P. Garg, H. Garg, V. Ranga, Sentiment analysis of the Uri terror attack using twitter, in
International Conference on Computing, Communication and Automation (ICCCA 2017)
7. A.A. Hamed, X. Wu, Does social media big data make the world smaller? An exploratory
analysis of keyword-hash tag networks, in 2014 IEEE International Congress on Big Data
8. P. Mishra, R. Rajnish, P. Kumar, Sentiment analysis of twitter data: case study on digital india,
in InCITe—The Next Generation IT Summit (2016)
9. P. Sharma, T.S. Moh, Prediction of indian election using sentiment analysis on Hindi twitter,
in 2016 IEEE Conference
10. Y.M. Wazery, H.S. Mohammed, E.H. Houssein, Twitter sentiment analysis using deep neural
network, in 2018 14th International Computer Engineering Conference (ICENCO), IEEE (Dec
2018), pp. 177–182
11. A. Patil, K. Bheda, N.S. Upadhyay, R. Sawant, Restaurant’s feedback analysis system using
sentimental analysis and data mining techniques. In 2018 International Conference on Current
Trends towards Converging Technologies (ICCTCT ), IEEE (Mar 2018), pp. 1–4
12. M. Rathi, A. Malik, D. Varshney, R. Sharma, S. Mendiratta, Sentiment analysis of tweets
using machine learning approach, in 2018 Eleventh International Conference on Contemporary
Computing (IC3), IEEE (Aug 2018), pp. 1–3
124 A. Dudani et al.

13. A. Tripathi, S.K. Trivedi, Sentiment analysis of Indian movie review with various feature selec-
tion techniques, in 2016 IEEE International Conference on Advances in Computer Applications
(ICACA), IEEE (Oct 2016), pp. 181–185
14. P. Singh, R.S. Sawhney, K.S. Kahlon, Sentiment analysis of demonetization of 500 & 1000
rupee banknotes by Indian government. ICT Expr. 4(3), 124–129 (2018)
15. D. Bholane Savita, D. Gore, Sentiment analysis on twitter data using support vector machine
IJCST 4(3) (2016)
16. H. Kang, S.J. Yoo, D. Han, Senti-lexon and improved Naïve Bayes algorithms for sentiment
analysis of restaurant reviews. Elsevier, Expert Syst. Appl. (2012)
17. E. Haddi, X. Liu, Y. Shi, The role of text preprocessing in sentiment analysis. Elsevier Procedia
Comput. Sci. 17 (2013)
18. S. Naz, A. Sharan, N. Malik, Sentiment classification on twitter data using support vector
machine, in IEEE/WIC/ACM International Conference on Web Intelligence (WI) (2018)
19. M. Nafees, H. Dar, I.U. Lali, S. Tiwana, Sentiment analysis of polarity in product reviews in
social media, in 14th International Conference on Emerging Technologies (ICET ) (2018)
20. A. Dhini, D.A. Kusumaningrum, Sentiment analysis of airport customer reviews, in IEEE
International Conference on Industrial Engineering and Engineering Management (IEEM)
(2018)
Recommendation System Using
Community Identification

Suman Venkata Sai Voggu, Yuvraj Singh Champawat, Swaraj Kothari


and B. K. Tripathy

Abstract Community Detection has garnered a lot of attention in the years follow-
ing the introduction of social media applications like Facebook, Twitter, Instagram,
WhatsApp, etc. Community refers to a group of closely-knit people who share com-
mon ideas, thoughts and likes. They may bond over topics ranging from politics to
religion, sports to music and movies, or from educational to holidaying. Researchers
have proposed various algorithms for identifying people who may fit into a partic-
ular community. These algorithms are being used by social media giants in form of
‘suggestions’. In this paper, we propose an algorithm that can be used to identify
people who share common interests on social network, therefore, forming commu-
nity with same interest. Detailed analysis of the result shows that a person can be
recommended to a community if more than 50% of his interests match with the other
members belonging to that community.

Keywords Community detection · Social media · Recommendation system · Data


frame

S. V. S. Voggu · Y. S. Champawat · S. Kothari


School of Computer Science and Engineering (SCOPE), Vellore Institute of Technology, Vellore,
Tamil Nadu 632014, India
e-mail: voggusuman@[Link]
Y. S. Champawat
e-mail: yschampawat1995@[Link]
S. Kothari
e-mail: kothariswaraj@[Link]
B. K. Tripathy (B)
School of Information Technology and Engineering (SITE), Vellore Institute of Technology,
Vellore, Tamil Nadu 632014, India
e-mail: tripathybk@[Link]

© Springer Nature Singapore Pte Ltd. 2020 125


A. Khanna et al. (eds.), International Conference on Innovative Computing
and Communications, Advances in Intelligent Systems and Computing 1087,
[Link]
126 S. V. S. Voggu et al.

1 Introduction

The human race has seen a lot in its lifetime. Be it the very early stone age, bronze
age or iron age, then there was industrial revolution that changed everything for us.
People were manufacturing textiles, weapons, cars, edible items, etc. From 1950s we
are said to be living in the digital age. Time starting from 2001 is specifically known
as the Big Data Revolution. With the advent of Internet, came the development of
products that made our life a lot easier. They were Google Chrome, Mozilla Firefox,
Maps, Microsoft office. And then we were introduced to social network through a
number of Social Media Applications like Orkut, Facebook, Instagram, WhatsApp,
Snapchat, etc. Today most of us are surrounded by these applications. We cannot
imagine our life without them. More than 2.34 billion people are connected to each
other using social media.
Today people live two lives simultaneously—One, where they interact with real-
world and the other on social media. From here on we refer the latter as Digital life.
Today, Social media has become an inseparable part of our life. It has increasingly
become a very important part of our life because of which it has garnered a lot of
attention from the research community. Quite a few researches have been conducted
in the field of identification of communities but most of them deal with multidisci-
plinary topics. As per our knowledge, very few scholars have tried to look into the
semantics of the way people interact on social media. In this paper we have tried to
identify people who are most likely to interact with each other based on their shared
interests and thoughts. We call it a community.
Community Identification basically refers to the technique and it is a process of
identifying people based on their common interests [1, 2]. In this paper, we have
proposed an algorithm that can be used to identify communities. For this, we col-
lected data using Google Forms. In this form we posed a number of multiple-choice
questions to our targeted group of people to know about their interests, views and
likes. With the help of our friends and family, we got response from a whooping
540 people belonging to 17 states of India. Each question formed an attribute of our
dataset and each person became a tuple in it. We have applied our algorithm on this
collected dataset that consists of 15 attributes and 530 tuples.
In this algorithm, we have used a weighted community identification technique
wherein we have ordered the attributes based on their effect on identification of
community and then each of these attributes was assigned weights. Here higher the
weightage of an attribute, higher is its effect on identifying community.
Recommendation System Using Community Identification 127

2 Literature Review

There are quite a few research papers that we found to be related to our work. These
papers have helped us to get a better understanding about communities and methods
that can be used for their identification.
In [3], the authors have emphasised on the need of identifying communities in the
research fields. They have analysed the data and identified that ‘Review of Modern
Physics’ has the greatest number of citations. Here they have tried to identify the
most active and influential nodes. For this they have applied sociometric analysis.
In [4], the authors have nicely explained that Twitter is one of the most important
tools for dissemination of information related to scholarly articles. In this paper, they
have established that Twitter is being mostly used by non-academic users to discover
information and develop connections with scholars to gain access to their scholarly
materials.

3 Methodology

In the proposed algorithm, we intend to provide an alternative approach in finding


a group of people that share common interests, views, opinions, likes and most
importantly thoughts. In short, we can use this algorithm to find a community. But
to identify individuals that have a common thought process, we need data regarding
their interests and views. We need information about their favourite activities and
likes.
For this purpose, first, we thought of collecting data from Facebook. But views,
likes, opinions of any user in Facebook are personal information and they do not
publicise this data even for research purposes as it may lead to privacy infringement
issues. So, we tried with Twitter. There are three ways of collecting data from Twitter:
(i) Use the datasets that are already available: None of these datasets matched
the content we wanted for our algorithm. They contained tweets regarding a
particular topic like #Ferguson, #Brexit, etc.
(ii) Purchasing data from Twitter: Twitter allows users and advertising agencies
to collect information based on a particular #Tweet or tweets over a specific
duration of time provided they pay exorbitant fees. This method is rarely used
as the costs are inherently very high even for research purposes.
(iii) Free data access using Twitter API: Twitter allows us to access just 1% of data
collected from tweets over a specified time duration. Twitter does not provide
any information regarding the methods it is using to sample this 1% of data. To
make things worse, Twitter bots generate a significant number of tweets and it
is very difficult to segregate human-generated tweets from the ones generated
by Twitter bots. After an intense discussion, we decided that we cannot use
Twitter data as we doubt its reliability.
128 S. V. S. Voggu et al.

This drove us to use Google Forms and collecting data directly from the people.
After running many pilot tests, we prepared a questionnaire consisting of 16 questions
where every question was provided with adequate choices. Email being part of these
questions was added to improve the credibility and reliability of data. We made the
Form live on 06/10/2018 and sent it to our friends. We personally called them and
requested them to fill the form, forward it to their friends and ensure that their friends
also fill the form. They readily obliged. Within a short period of time, we received
responses from 540 people. We gathered a diversified data from people belonging
to 17 different states of India. This data was exported in CSV format as shown in
Table 1.
The collected data was organised into 15 columns, i.e. attributes where each
attribute corresponding to a question from the questionnaire and 540 tuples each
corresponding to individual response recorded by a person. It was then converted
into a CSV file which was used as the final dataset on which the proposed algorithm
was tested.
The algorithm depends on the similarity index between two individuals X and
Y to determine if they can be part of the same community or not. Similarity index
further depends on the match of corresponding attribute values of X and Y. Weights
are assigned to each of these attributes to improve the results based on their priority.
Higher the priority of an attribute, higher the weight assigned to it and vice versa.

Table 1 The dataset collected using the Google form


Name Live Occupation Career Network Veg
Suman Andhra Student Entrepreneur BSNL Vegetarian
Venkata Sai Pradesh
Voggu
Swaraj Madhya Student Engineer Airtel Vegetarian
Pradesh
Yuvraj Singh Rajasthan Student Civil Servant Vodafone Non-vegetarian
Champawat
Rashmi R Karnataka Student Engineer Jio Vegetarian
Mayank Haryana Student Engineer Jio Vegetarian
Sneha B Karnataka Student Engineer Airtel Non-vegetarian
Shivani Gujarat Student Entrepreneur Vodafone Non-vegetarian
Champawat
Manideep Telangana Student Entrepreneur Jio Non-vegetarian
Ganji
K. Thejeswar Andhra Student Engineer Jio Non-vegetarian
Reddy Pradesh
Panyam Andhra Student Engineer Jio Non-vegetarian
Gangadhar Pradesh
Recommendation System Using Community Identification 129

We propose the following algorithm.

The Algorithm
This Algorithm consists of the following steps:

1) Import the packages necessary for execution of the proposed algorithm//Here


we are importing numpy, pandas.
2) Load the data from .csv file into a dataframe.
3) W[i] <- stores the weight of every attribute i.
4) L[j] <- stores the indexes of people who are part of community, X is most
likely to be part of.
5) X <- index value of individual whose community is to be found out.
6) Initialize similarity index, SI =0
7) Initialize K=0 // If there a match, K=1 else it is 0
8) For all index values Not Equal To X:
{
For every attribute of the dataframe, a(i):
{
If the attribute value of X=a(i) //There is a match
{
Set K=1
Set SI = SI + K * W[i]
Set i=i+1
}
}
If SI > 50
{
Add index i to the final set L[j]
Set j=j+1
}
}
9) Print the “Name” attribute for every L[j].

4 Experimental Setup

As previously discussed, we have used Google form to collect information about the
interests, views and opinions of people. Some of the questions asked were:
1. Where do you live?
2. Which genre of music do you like?
3. Which landscape would you prefer for holidaying?
4. Which is the biggest problem India is facing?
130 S. V. S. Voggu et al.

Entire list of the questions posed to the respondents in the survey can be accessed
through the following google link: [Link]
Whenever an algorithm is proposed, it is very important that we check its validity
is checked. For this purpose, we developed a Python program based on the proposed
algorithm and executed it.
The setup we used is a platform having an Intel i5 processor with 4 GB RAM,
Python IDE like Jupyter Notebook, PyCharm, JAVA jdk 1.8.0_101 and Anaconda
Navigator.
The Python Program is developed basing upon the proposed algorithm in the
Jupyter Notebook. For a given input Id of the person for whom we want to recommend
friends, this program successfully generates a list of people whose interests match
more than 50% with that of the given input. This list forms a community. Each person
from this community can be then recommended to the concerned person as a friend
suggestion.
A sample output for a person with ID 3 is presented below.

5 Results

In this study, we discuss the alternative method of identification of communities in


social networks. The Python program was executed on the dataset of 540 people in
prescribed working environment generating the output as shown in Fig. 1. It lists the
names of people whose interests match with a person with X (userId:3). Similarity
index indicates the percentage of similarity between these people and the concerned
person. For example a person called ‘Parth’ from Gujarat has similarity index of
61.1%. It means that his views, interests and opinions match 61.1% with that of X.
We studied the dataset that we collected from questionnaire using google forms.
In these forms, google provided us with beautiful statistics about the data that we
had collected.
Since this google form was sent to our family and friends, most of the respondents
were from our home states. As shown in Fig. 2, out of 540 people surveyed, 46.7% of
responses were collected from Andhra Pradesh. The results were completely tilted

Fig. 1 Output generated after execution of the code given in Fig. 2 for a user id ‘3’
Recommendation System Using Community Identification 131

Fig. 2 Domicile of the


respondents

in favour of AP because of the aggressive publicity from our friends in that state.
Gujarat followed AP contributing 13.8% of the total responses.
Figure 3 shows a pie chart of the responses recorded about their favourite indoor
game. Since the responses were collected from relatively younger population it was
unsurprising to know that the respondents were mostly interested in playing computer
and mobile games. Of the total people surveyed, 29.8% chose computer games as
their favourite indoor sport. Carrom was a close second garnering 24.8% of the total
votes.
Figure 4 reveals the political views of the respondents. As shown, four options
were given to the people—BJP, Congress, regional parties and NOTA. Clearly seen
in the pie-chart, most of the people surveyed chose NOTA. There can be two reasons
for that. They are either not interested in politics or did not feel connected by the

Fig. 3 Favourite indoor


games

Fig. 4 Political views


132 S. V. S. Voggu et al.

political ideology of these political parties. As more than half of the respondents
were from AP, Telangana and Tamil Nadu, it wasn’t shocking to know that they
prefer regional parties over the two major national parties, i.e. BJP and Congress.
This question was by far the most important one for our survey as it gives the glimpse
of the opinions and thoughts of the people.
In this study, we developed an algorithm which for a given individual X, generates
the list of names forming a community.

6 Conclusion

In this paper, we studied the problem of detection of communities based on common


views, opinions, interests and thoughts. The aim is to find community based on the
similarity index of an individual with every other person in the dataset. This algo-
rithm can be integrated with social media applications for friend recommendations,
Similarity experiments, Crime detection, Fraud Detection, etc. In this paper we have
proposed an algorithm that works on static data, i.e. data at one moment in time. In
future, we plan to work on algorithm that can be used to identify communities in
dynamic social networks. In future we plan to work on algorithm that can be used to
identify communities in dynamic social networks.

Acknowledgements We, the authors of the research paper ‘Recommendation System using Com-
munity Identification’ hereby declare that the questions contained in the questionnaire were prepared
by us solely for the purpose of collecting the data. It was necessary to collect the data since none of
the open and free datasets available on the internet fulfilled our requirements. We also declare that
the data was collected through a google form by keeping the identity of the person anonymous. The
data does not contain any personal information of the 545 people who had filled the google form.
Statement of Consent We, the authors of the research paper ‘Recommendation System using
Community Identification’ hereby give our consent to ICICC Conference to publish our research
paper in their publication.

References

1. S.A. Moosavi, M. Jalali, N. Misaghian et al., Community detection in social networks using
user frequent pattern mining. Knowl. Inf. Syst. 51, 159 (2017). [Link]
016-0970-8
2. S. Sobolevsky, R. Campari, A. Belyi, C. Ratti, General optimization technique for high-quality
community detection in complex networks. Phys. Rev. E 90, 012811 (2014)
3. B.S. Khan, M.A. Niazi, Network Community Detection. Published in Arxiv (2017)
4. I.R. Fischhoff, S.R. Sundaresan, J. Cordingley, H.M. Larkin, M.-J. Sellier, D.I. Rubenstein,
Social relationships and reproductive state influence leadership roles in movements of plains
zebra (equus burchellii). Anim Behav. (2006) (Submitted)
Comparison of Deep Learning
and Random Forest for Rumor
Identification in Social Networks

T. Manjunath Kumar, R. Murugeswari, D. Devaraj and J. Hemalatha

Abstract The societal lifetime of each individual has created with online social
media. These locations have made outrageous improvements in the socialize envi-
ronment. The world’s targetable and fashionable Online Social Network (OSN) is
Facebook, and it has brilliantly had more than a billion clients. It is a household to
numerous kinds of antagonistic objects who misuse the sites by posting harmful or
wrong messages. In few years, Twitter and other blogging sites have been around
multimillion energetic users. It converted a novel means of rumor-spreading stage.
The problem of detecting rumors is now more important, especially in OSNs. In this
paper, we proposed rumor a different machine learning approaches as Naïve Bayes,
Decision tree, Deep learning and Random forest algorithm for identifying rumors.
The experiment can be done with Rapid miner tool on everyday data from Facebook.
The schemes of rumor identification are verified by smearing fifteen sorts based on
user’s performances in Facebook data set to forecast whether a microblog post is
a rumor or not. From the experiments, precision, recall, f-score value is calculated
for all the four machine learning algorithms, further values are compared to find
the accuracy (%) in all the four algorithms. And our experimental result shows that
the overall average of precision for a Random forest provides 97% than the other
comparative methods.

T. Manjunath Kumar (B) · R. Murugeswari · J. Hemalatha


Department of CSE, Kalasalingam Academy of Research and Education, Krishnankoil,
Tamilnadu 626126, India
e-mail: manjunathkumar.t@[Link]
R. Murugeswari
e-mail: [Link]@[Link]
J. Hemalatha
e-mail: jhemalathakumar@[Link]
D. Devaraj
Department of EEE, Kalasalingam Academy of Research and Education, Krishnankoil,
Tamilnadu 626126, India
e-mail: deva230@[Link]

© Springer Nature Singapore Pte Ltd. 2020 133


A. Khanna et al. (eds.), International Conference on Innovative Computing
and Communications, Advances in Intelligent Systems and Computing 1087,
[Link]
134 T. Manjunath Kumar et al.

Keywords Rumor identification · Machine learning · Online social network ·


Microblog

1 Introduction

Nowadays, all over the world, the people depend on online social networks (OSNs)
to exchange knowledge, thoughts, explore, information, resources, and experiences
private communications. Social Network is a precise investigation and recording of
social correspondence in minor groups, work gatherings and particularly classrooms.
In this innovation world, long range interpersonal communication locales helped a ton
to carry the world closer, however, alongside this they have produced parcel of new
issues for instance, open on these stages is engaged for spams, and the dependable
of the stage may make the objectives progressively liable for tricks. As per overview
completed by CNN in 2012, there were 83 million phony Facebook accounts.
Along with OSM (Online Social Media), Twitter and other blogging microblog
sites become more widespread [1]. These systems have developed rapidly, and
publics using as the main source to spread and obtain the data in everyday activ-
ities. Microblogging guide users to interchange minor essentials of content such as
messages or short sentences, individual images, video links or email links which are
the major reason for their reputation. But researchers have investigated and deliber-
ated the blowout of rumors, scams, or fake information on Facebook, microblogging
sites such as Twitter [2], etc. in the form of main sources such as text, images, videos
links.
Rumors mostly state to information whose realism and reason are unpredictable,
and which can be caused under emergency conditions, causing public attention and
damaging the public order [3]. We established the unruly rumors in this paper by
looking into rumors publisher’s activity differentiates from rumor post and might
have several responses from a post [4].
We investigated the problem of rumor identification in Facebook data set. A rumor
identification scheme user behavior-based is proposed, in which the user’s activities
are preserved as concealed elements to designate to be rumor that are possible or
how many posts will be possible rumor. Our method on rumor identification contains
two phases (1) Based on the collection of Facebook data and user’s profiles, we
collect the structures of user’s actions from every Facebook post. Overall, fourteen
attributes of user’s deeds are taken in this paper. (2) We apply several machines-
learning algorithms to identify the classifiers for rumor detection. Generally Rumor
detection model consists of three models using machine learning approaches namely:
(1) Network Topology Based Model (2) Information Diffusion Based model (3) User
Behavior-Based Model. In that, our proposed work will be highly focused on User
Behavior Model.
Before getting into the problem, here we discussed with the introduction about
machine earning algorithms. Of course, the entire paper is dealing with the Random
forest algorithm, it comes under ensemble learning. Deep learning is the emerging
Comparison of Deep Learning and Random … 135

area and it is said to be a particularly promising alternate to all the classification


problems depends on the features trained with ensemble classifiers.
The rest of this paper is set up as pursues. In Sect. 2, we convey the past related
exertion did for the rumor identification. In Sect. 3, we propose our methodology for
rumor identification. Section 4 conveys the trial result. Section 5 gives the conclusion.

2 Related Work

In this section, we examine the aspects of the different existing rumor detection
strategies.
Cao Xiao et al. projected an adaptable strategy for finding malignant clients, bits of
rumors, spams on OSN. They presented the administered machine learning pipeline
strategies to classify the bunch of clarifications as contemptuous or certified. The
key highlights utilized in this paper are measurements on fields of client created
content, for example, name, email address, organization or college/university. These
incorporate the two frequencies of examples inside the bunch and correlation of
content frequencies over the whole client base. They assessed both in-test and out-
of-test information demonstrated the solid execution of this methodology [5].
Akshay J. Sarode et al. proposed a test structure with which recognizes the phony
profile, counterfeit bits of rumor is plausible inside the friend’s list, post, links,
pictures, recordings. This structure is confined to a particular online social networking
communication website to be specific Facebook from which removes information
and utilizations it to arrange them as genuine or phony by utilizing unsupervised
and supervised machine learning algorithms. The proposed strategy has given a
precision of around 98%, which can be useful to recognize the phony profiles from
the accessible profiles [6].
Sushila Shelke et al. projected ponder and examine the source recognition method-
ologies of rumor or falsehood in a social media network. As a result, they present the
pictorial scientific categorization of variables to be considered for the source recog-
nition approach and the order of current source location approaches in the social
media network. The center has been given to different best in class source identifi-
cation methodologies of rumor or deception and correlation between methodologies
in social media networks [7].
Akshi Kumar et al. discovered an introduction on rumor detection via social
media network which displays the fundamental phrasing and sorts of bits of rumor
and the conventional procedure of rumor detection. A best in class portraying the
utilization of supervised machine learning (ML) algorithms for rumor detection via
social media is introduced. The key purpose is to offer a position to the sum and sort
of work directed in the region of ML-put together rumor detection with respect to
social media, to recognize the exploration holes inside the space [8].
Prateek Dewan et al. endeavored to control the course of detecting malicious
Facebook pages by exercise various supervised learning algorithms on our dataset.
136 T. Manjunath Kumar et al.

Artificial neural networks qualified on a settled measured sack of-words accom-


plished an exactness of 84.13%. Our sack of-words show depends on a constrained
history page activity. It is probably going to assemble and examine the whole history
all things considered; pages can adjustment execution after some time. To house
such varieties in comportment, prescribed a self-versatile model which depends on
the latest movement by the page [9].
Gang Liang et al. explained that microblog systems, for example, Twitter and
Facebook have turned into another method for rumor-spreading stages that have
multimillion dynamic clients. They researched rumor identification conspires by
applying five new highlights dependent on shopper’s practices and joins the new
highlights with the current all-around demonstrated successful client conduct based
highlights, for example, comments and reposting, to anticipate whether a microblog
post is rumor. The rumor distributer’s execution may veer off from normal buyers, and
rumor posts may have assorted answers from a typical post. The proposed new sorts
will enlarge the rumor identification highlight database, and preferred standpoint the
design of programmed rumor detection systems [10].
Qiao Zhang et al. focus on a reflex rumor identification strategy dependent on the
gathering of new proposed suggested characteristics and low sorts of the short mes-
sages. Shallow highlights are the characteristics that can’t separate between typical
messages and rumor by and large. An incredible accumulation of supervised models
are utilization, for example, Random Forest and Support Vector Machine [11].
Jiaojiao Jiang et al. discovered the troublesome of rumor source detection in
time-fluctuating social media networking that can be reduced to a string of static
systems by presenting a period incorporating window. In grouping to address the
errands posted by time-fluctuating informal communities, adjusted are two distinct
techniques. (1) Used a novel switch dispersion strategy which can suddenly thin
limit the size of fearful sources. (2) Announced a logical model for talk spreading in
time-changing interpersonal organizations. The analysis results demonstrate that our
techniques are successful in discovering rumor sources in a few sorts of continuous
fluctuating social media networking and strategy can diminish 60–90% of the source
looking for region in different time-changing social networks [12].
RaveenaDayani et al. focus to work at content-level and make attempts to
answer relevant questions at this level. Our detection approach is based on Rumor-
Knowledge-Base (RKB) which is a repository of tweets related to different rumor
topics, manually pre-detected and pre-verified, along with sentiment polarities to
suggest whether this tweet spreads this rumor or refutes it. In this way, we aim to
cover all kinds of tweets, i.e., tweets not related to rumor, tweets in favor or against
the rumor. To address a key limitation of above approach, which is manual main-
tenance of RKB, we propose to extract tweets from the Twitter user accounts of
popular news agencies for ascertaining the trustworthiness of tweet contents and
automatically maintaining the RKB [13].
Sardar Hamidian et al. focus on the problem of detecting rumors in Twitter
data. They proposed a completely label-independent method for attribute cohort
that depends on the tweet content. Their experiment is based on two conditions:
Single-step Rumor Detection and classification (SRDC) and two-step (TRDC). In
Comparison of Deep Learning and Random … 137

both SRDC and TRDC, features are divided into classes and according to that classes
tweets it applies for classification. In the experiment they used WEKA platform for
training and testing their proposed method using the J48 classifier [14].
Zhao et al. worked on primary rumor detection using terms such as false or uncon-
firmed to find questioning and neglecting tweets. Our RNN learns representations
that are significantly more complex than these explicit signals; these representations
can detention the concealed implications and dependences over time [15].
In a generic point of view, ensemble classifier provides better results without
respect to the research field area. For example Ensemble classifier has provided more
than 95 percentage of classification rate, which was the best results in the classifica-
tion of stego images. Hemalatha et al. implemented the system based on an efficient
classifier, multi-surface proximal support vector machine ensemble oblique random
rotation forest, which provided detection rate superior to other existing classifiers
[16].
Our proposed work is based on rumor detection on Facebook dataset or Twitter
dataset using machine learning techniques that are Naïve Bayes, Decision Tree, Deep
Learning and Random Forest. All supervised learning techniques are work only on
known dataset. These techniques are going to implement on Rapid Miner tool which
is data science platform provides an integrated environment and supports all machine
learning process.

3 Rumor Detection Algorithm

This section represents the rumor detection algorithm for Facebook dataset. The
overall framework of the proposed model is given in Fig. 1.

3.1 Data Set

A Data Set is an arrangement of records. Most as often as possible an informational


collection parallels to the fillings of a sole database table, or a solitary measurable
information grid, where every section of the table speaks to a particular variable
and each column takes after to a characteristic of the informational collection. We
gather the OSN data for Facebook, the world’s biggest and driving stage worker, to
test the conduct of the algorithms. These algorithms are part of Machine Learning
techniques which are more adapted nowadays. All these algorithms (Naïve Bayes,
Decision Tree, Deep Learning, Random forest) are supervised learning methodology
which is applicable for known dataset. Our (Facebook) dataset contains some features
which is helpful in detecting rumor.
138 T. Manjunath Kumar et al.

Fig. 1 Overall framework of the proposed work

3.2 Rumor Detection Features

The discovery recital is reliant on features that are approved. Extracted eleven user
behavior features from Facebook posts. There happen large alterations in the use
patterns. Readers will retort inversely when reading normal posts and rumor posts.
The Dataset features for Rumor detection algorithms are given in Table 1.

3.3 Pre-processing

After, the collection of Data set we must pre-process the data set means it will
read the data and analyze the input and handle the missing values in data set like
evacuate records having invalid esteem, supplant numeric qualities by mean esteem,
and expel ostensible traits having invalid esteem. The preprocessing can be done by
using Discretization and Set Role operator which is described below.
1. Discretizes by Binning
This worker discretizes the nominal numerical attributes into user-specified number
of bins. Bins of equal series are mechanically produced; the number of the values in
dissimilar bins may differ. The discretization by binning is done on the values that
are within the stated boundaries.
Comparison of Deep Learning and Random … 139

Table 1 Features in Facebook dataset


S. No. Features extracted Description
1. Page total like When the public likes a page, they’re
showing support for the Page and they want
to see content from it. People who like a
Page will automatically follow it. Total
number of counts calculated by number of
like made on page
2. Post month When someone posts a subject (text,
images, video, links) on social media month
of that is recorded in database. When people
visit that post, post month will appear, and
the count of people visited, decides in
distinguishing with other posts
3. Post weekday The best time to post on Facebook is
between 12 p.m. what’s more, 3 p.m.
Monday, Wednesday, Thursday, and Friday
and on the ends of the week from 12 p.m. to
1 p.m. dependent on the study report
4. Post hour When subject is posted time is recorded in
data set which will be visible to the people
who will visit that post in future. It also
shows up the hours spent after the post and
count of people. Giving to review the finest
period to post is 12 p.m. to 1 p.m. Most of
the people scroll their news feed during
lunch hours
5. Overall post total reach It implies contacting individuals inside their
news sources on page and as shared by
another user’s. Whole spread is the quantity
of selective people who saying satisfied. It
aggravates each further metric you can
pathway: negative criticism. Commitment,
likes, comments, snaps, and Lifetime post
full impersonations are the amount of ages a
post from Page has appeared. For instance,
on the off chance that someone sees a post
refresh in course of events and sees that
equivalent refresh when a companion bonds
it that would consider two impressions. The
quantity of incomparable social orders who
common spoofs of page post
6. Overall post total impressions The check of times a post from Page is
showed up. For instance, in the event that
anybody sees a Page refresh in post feed, it
sees a similar refresh when a companion or
relative offers it, consider two effects. The
quantity of various customers who got
response of page post
(continued)
140 T. Manjunath Kumar et al.

Table 1 (continued)
S. No. Features extracted Description
7. Overall engaged user The number of different users who are busy
on your post or created a story about the
page. Engagement determines stories
created that were not the result of a click
within a matter
8. Overall post consumptions It measures any snap of substance on-page,
regardless of whether it makes a story or
not. Coming up next are estimated as
utilization of page post: Link clicks, Photo
views, Video plays, post remarks, likes, and
shares
9. Overall post customers The people or unique users who clicked
anywhere within posts or page content.
Clicks creating stories are included in other
clicks
10. Liked page and engaged with post This includes anyone who visited the page
and regardless of the actions they took such
as liked the page or post, commented on
page or post, or shared the page or post
among other users and other actions like
subscribe (for updates)
11. Total interactions The total number of custom and standard
events that are triggered when a user
interacts. It captures all of the feedback
pages receive from users. The goal of the
metrics is to provide an updated snapshot
how users are engaging with post contents

2. Set Role
This worker is castoff to alteration the part of one or more attributes in Date set such
as id role, label role, batch role, weight role, and cluster role. The role of an attribute
describes how other operators handle this attribute.

3.4 Random Forest Algorithm

This algorithm practices random subset of attributes its mechanism precisely like the
Decision Tree operator with one exception for each split only a random subset of
features is obtainable. It learns decision tree from both numerical data and nominal.
The objective is to generate a classification model that foresees the value of the label
based on numerous input features.
Comparison of Deep Learning and Random … 141

3.5 Rumor Identification Algorithm

The steps for finding the rumor is given in Table 2 and also the corresponding flow
diagram is given in Fig. 2.

Table 2 Rumor identification using random forest algorithm


Algorithm: Rumor identification
Step 1: Load Face book dataset and apply the random forest algorithm
Step 2: Among the Eleven features, consider the page total like feature. If this value or count is
less than threshold value, then identified as non-rumor
Step 3: If the value or count is greater than threshold value then considers the category of the
features (Photo, Links, Images, and Videos) to distinguish
Step 4: Again, if value exceeds the threshold values, the different other features such as the
lifetime post impressions and, lifetime engaged users can be considered. Based on the count
rumor or non-rumor is detected
Step 5: The performance of Random forest algorithm is compared with existing algorithm
Naïve Bayes, Decision tree and Deep learning based on the Evaluation metrics such as
precision, recall and F-Score

Facebook
dataset

Pre-processing

Feature Extraction (shown in table 1)

Random Forest algorithm

No
Count < Yes
Rumor threshold Non rumor
value

Fig. 2 Flow diagram of rumor detection


142 T. Manjunath Kumar et al.

4 Experimental Results

4.1 Rumor Identification Evaluation

The microblog is represented by user activity features, and to identify whether the
microblog is a rumor or not. Based on the different attributes, we can apply by four
classifier algorithms Naïve Bayes, Decision tree, Random Forest and Deep Learning.
To check the efficiency and general eligibility to evaluate the user eleven features,
we find the several feature values in Fig. 3. It can be viewed from Fig. 4 there will be

Fig. 3 Classification of rumor and non-rumor based on user behavior features

Fig. 4 Comparison between random forest and Naïve Bayes


Comparison of Deep Learning and Random … 143

existence of notable difference between rumor and normal posts when we apply the
features like link and status. There is no significant difference for the features like
video and photo.

4.2 Performance Metrics

To appraise the presentation of the method used in this paper, based on precision,
recall, and F-score as evaluation metrics. The precision ‘P’ is the fraction of the
correctly determined rumor to all the rumor microblogs detected. Recall ‘R’ is the
fraction of correctly determined rumor microblogs to all the rumor microblogs. F-
score can be defined as the harmonic mean of recall and precision. The calculation
of precision, recall, and F-score is defined as follows:

|correctly determined rumor|


P=
|rumors microblog detector|
|correctly determined rumor|
R=
|rumors microblogs|
 
2P ∗ R
F= ∗ 100%
P+R

where ∗ is the number of element set


Figure 4 depicts the test consequence of Random Forest algorithm. The Recall,
Precision, and F-score of rumor distribution developed dependent on clients’ con-
duct are 0.96, 0.97 and 0.9723, separately. The precision, recall, and F-score have
expanded than Naïve Bayes approach which implies that our strategy and credits set
not exclusively to enhance the anticipated precision yet additionally can recognize a
greater number of rumors tidbits than Naïve Bayes.
The examination consequence of Random forest with Decision tree algorithm is
shown in Fig. 5. The recall, precision and F-score of rumor assignment developed
dependent on Decision tree algorithm are 0.75, 0.86 and 0.80, separately. The recall,
precision, and F-score have expanded 0.21, 0.11 and 0.17 separately than existing
algorithm.
Figure 6 describes the experimental result of random forest with deep learning.
The recall, precision, and F-score of rumor allocation using deep learning for eleven
features are 0.52, 0.78 and 0.63. It is obvious that Random forest algorithm is better
for detecting rumor than deep learning algorithms.
Figure 7 compares the performance measure of Accuracy for different machine
learning algorithms. From this figure, Random Forest algorithm gives better accuracy
than Naive Bayes, Decision Tree, and Deep Learning algorithm.
144 T. Manjunath Kumar et al.

Fig. 5 Comparison between random forest and decision tree

Fig. 6 Comparison between random forest and deep learning

Fig. 7 Overall accuracy performance of algorithm


Comparison of Deep Learning and Random … 145

5 Conclusion

Online social media and Microblogging are the great platforms to broadcast the
information among the active user, but bloggers are using this in wrong way by
spreading fake information (rumor) on these sites. In this paper, we inspect the
rumor detection issues in the Microblog framework. The rumors are identified or
differentiate rumors with other post we applied the machine learning algorithms that
is Decision Tree, Naïve Bayes, Deep Learning, and Random Forest. The algorithms
are examined on Facebook data set in which there is a bar comparison between
various attributes such as page total like, type, category, post month, etc. The user’s
behaviors differ on rumors and original post. The performance of the evaluation
metrics related to all the four algorithms are recorded and concluded that Random
Forest had produced higher value for precision, recall, f-score and higher accuracy
as compared to other algorithms.

References

1. A. Friggeri, L.A. Adamic, D. Eckles, J. Cheng, Rumor cascades, in Proceeding of 8th


International AAAI Conference on Weblogs Social Media (2014), pp. 101–110
2. M. Mendoza, B. Poblete, C. Castillo, Twitter under crisis: can we trust what we RT? in 1st
Workshop Social Media Analytics (2010), pp. 71–79
3. F. Chierichetti, S. Lattanzi, A. Panconesi, Rumor spreading in social networks, in Automata,
Languages and Programming (2009), pp. 375–386
4. J. Kostka, Y.A. Oswald, R. Wattenhofer, Word of mouth: rumor dissemination in social
networks, in Structural Information and Communication Complexity (2008), pp. 185–196
5. C. Xiao, D.M. Freeman, T. Hwa, Detecting clusters of fake accounts in online social networks, in
Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security (2015), pp. 91–101
6. A.J. Sarode, A. Mishra, Audit and analysis of imposters: an experimental approach to detect
fake profile in online social network, in Proceedings of the Sixth International Conference on
Computer and Communication Technology (2015), pp. 1–8
7. S. Shelke, V. Attar, Source detection of rumor in social network—a review. ELSEVIER-Online
Soc. Netw. Media 9, 30–42 (2019)
8. A. Kumar, S.R. Sangwan, Rumor detection using machine learning techniques on social media,
in Proceedings of the International Conference on Innovative Computing and Communications
(ICICC 2018), Lecture Notes in Networks and Systems 56 (2018), pp. 213–221
9. P. Dewan, S. Bagroy, P. Kumaraguru, Hiding in plain sight: characterizing and detecting mali-
cious Facebook pages, in IEEE/ACM International Conference on Advances in Social Networks
Analysis and Mining (2016), pp. 20–26
10. G. Liang, W. He, C. Xu, L. Chen, J. Zeng, Rumor identification in microblogging systems
based on users behaviors. IEEE Trans. Comput. Soc. Syst. 99–108 (2016)
11. Q. Zhang, S. Zhang, J. Dong, J. Xiong, X. Cheng, Automatic detection of rumor on social
network, in Natural Language Processing and Chinese Computing (2015), pp. 113–122
12. J. Jiang, S. Wen, S. Yu, Y. Xiang, W. Zhou, Rumor source identification in social networks
with time-varying topology. IEEE Trans. Dependable Sect. Comput. 15(1), 166–179 (2018)
13. R. Dayani, N. Chhabra, T. Kadian, R. Kaushal, Rumor: detecting misinformation in twitter
(2015)
14. S. Hamidian, M. Diab, Rumor detection and classification for twitter data, in The Fifth Inter-
national Conference on Social Media Technologies, Communication, and Informatics (2015),
pp. 71–77
146 T. Manjunath Kumar et al.

15. Z. Zhao, P. Resnick, Q. Mei, Enquiring minds: early detection of rumors in social media from
enquiry posts, in Proceedings of WWW (2015)
16. H. Jeyaprakash, M.K. Kavitha Devi, S. Geetha, A comparative review of various machine
learning approaches for improving the performance of stego anomaly detection, in IGI Global,
Handbook of Research on Network Forensics and analysis Techniques, pp. 351–371
Ontological Approach to Analyze
Traveler’s Interest Towards Adventures
Club
Toseef Aslam, Maria Latif, Palwashay Sehar, Hira Shaheen, Tehseen Kousar
and Nazifa Nazir

Abstract Now tourisms has become an industry and many adventure clubs offer
different adventure trips for the travelers. So there is a need for a system for trav-
elers which prefer their choices and give them all needed information in one place.
We proposed a new travel recommended system for adventurers, which provides a
defined solution to user demands. It is a “Knowledge Base system” a smart recom-
mender system with domain-specific ontology. Queries can be constructed in natural
language and a related query management strategy is developed. The solution space
is searched from two perspectives: user demand and offers relevance. Many systems
working on this perspective but in Pakistan not such systems present which prefer-
ence user choices. Pakistan is a country which is rich in adventure tourism and a
good choice for international adventurers. So there is need such a system for based
on user interest and their desired information. In the following system, we will deal
with the main problems that adventurer presents in terms of information search and
decision-making processes according to the domain ontology.

Keywords System for travel recommendation · Knowledge base system ·


Ontology · Similarity computation · Queries matching

T. Aslam (B) · M. Latif · P. Sehar · H. Shaheen · T. Kousar · N. Nazir


University of Lahore, Gujarat Campus, Lahore, Pakistan
e-mail: [Link]@[Link]
M. Latif
e-mail: [Link]@[Link]
P. Sehar
e-mail: [Link]@[Link]
H. Shaheen
e-mail: hirash79@[Link]
T. Kousar
e-mail: tehseenbashir75@[Link]
N. Nazir
e-mail: naziifa008@[Link]

© Springer Nature Singapore Pte Ltd. 2020 147


A. Khanna et al. (eds.), International Conference on Innovative Computing
and Communications, Advances in Intelligent Systems and Computing 1087,
[Link]
148 T. Aslam et al.

1 Introduction

Nowadays, users can find a lot of information on the Internet. That is why sometimes
it becomes a hard and complex task to select the information a user is interested in.
The user is often unable to look through all the available information. Therefore,
highly interesting information can get lost in the middle of a sea of data. Users can
have access to a huge deal of data and information related to a specific place and
adventure activities but surely they will prefer to filter that information and get those
elements or activities that match their particular interests.
Traveler’s these days are getting more used to turn to new technologies when
planning a trip. This reality can be explained by the fact that the Internet is part of
our daily life. For this purpose, several adventure clubs and companies that offer
varied touristic and adventure information about the destination have been set up.
But most of the times adventurers are failed to search their desired destination; they
also failed to search the unique location for their favorite adventure. They want to
discover new locations and more fun in adventure but most of the information repeat
and they cannot find their desired information.
For the purpose of improving traveling experiences, we proposed a new travel rec-
ommended a system for travelers, which provides a defined solution to user demands.
It is a “Knowledge Base system” a smart recommender system with domain-specific
ontology. Ontological systems supply personalized information to users. We use
ontological approach because Ontology is basically a relation between different
instances so; we use this approach to enhance our results. In other words, the system
selects the most suitable options from a large list of offers, by taking the users profile
and interests into account.

2 Existing Systems

The already existing systems and research on travelers ontology are mainly for the
destinations or for search optimization. The papers we have discussed are explained
briefly and what are the problem identified and give the solutions to the system. Lem-
naru et al. [1] have done work on case base reasoning ontology system. They propose
an ontology-based system for user’s demand. It’s a hybrid system with domain base
ontology. Quires constructed in natural language or templet base. The system is given
the two dimensions, travel description, and travel solution. Hawalah et al. [2] have
work done about Semantic Knowledgebase ontology system. The system uses user
Ontological Approach to Analyze Traveler’s Interest … 149

ontology file for the recommendation of the trip. Tomai et al. [3], this paper draws
from previous work on trip planning in the context of web services. The system is
ontology-based Web portal for Tourism. Choi et al. [4] recommend travel ontology
based on the Semantic web. They use an OWL for made their ontology. The system
gives an opportunity for users to choose their preferences. Hua-li et al. [5] while
being progressing, sudden occasions may drive their trip to totally reschedule they
decide to plan and search for options. Keeping in mind the end goal to encourage
semantic coordinating between option touristic locales and client setting, a particular
vocabulary for the tourism area, client sort, time and area is required. This show in
this paper existing tourism ontology can scarcely satisfy this objective as they basi-
cally concentrate on space ideas. Bahramiana et al. [6], Tourist have time and budget
limitations and problem in select points of interest. The available information is
overloading, it is difficult for a tourist to select the most appreciate ones considering
preferences. A substance based suggestion framework is proposed, which utilizes
the data about the client’s choice. In proposed system figures, a level of matching
between them. This has the most closeness with the client’s choices. The proposed
content-based recommender framework is improved utilizing the ontological data
about tourism spot to spot both the client profile and the recommendable system. de
Lange et al. [7] many people use the Internet to book their trips. The evolution of the
Internet made trip booking much easier. In this work they have identified the com-
mon sets of concepts that are appearing in online traveling related websites. Based
on the findings, it has developed an ontology, which represents the common set of
concepts and their relations. Maedche et al. [8] this paper draws how the difference
between the two boundaries might be limited. The objective is too semantically inter-
face right now detached snippets of data so as to lessen the weight on the client of
finding and comprehension. Missikoff et al. [9] in this paper portrays three essentials
of the harmonization exertion: interoperability, ontologies, and go-between. Also,
draw a dream of a future electronic the travel industry advertises dependent on these
essentials. Mar et al. [10] they propose a strategy for extricating semantic substance
from printed web archives to consequently instantiate a domain ontology. Karoui
et al. [11] in this paper they deal with the automation of ontology building process
from HTML pages. The suggested system methodology is based on the complemen-
tary use of two approaches. Ananthapadmanaban et al. [12] this paper investigates
making refined client profiles philosophy which can build the way toward hunt-
ing down the ideal travel industry bundle by dissecting the client enthusiasm with
assistance of client cosmology for the travel industry. They have made client pro-
file philosophy through which can deduce traveler explicit zone of intrigue. Moreno
et al. [13] present the availability of numerical assessment of the relationship for
150 T. Aslam et al.

cosmology framework. The prescribed framework has been completely planned and
actualized in the science and innovation park of the travel industry and recreation.
Jakkilinki et al. [14] present the hidden structure and task of a semantic electronic
clever visit arranging instrument. Prantner et al. [15] present the travel industry phi-
losophy and semantic administration framework. They did some primer aftereffects
of On Tourism Project. Results displayed in this paper distinguish openly accessi-
ble the travel industry ontologies and existing unreservedly accessible cosmology
the executive devices for the travel industry area. Cardoso [16] the paper is around
one vital sort of e-the travel industry application that has surfaced lately is dynamic
bundling frameworks. Knublauch [17] they take a shot at some underlying consider-
ations on programming design and an improved approach for Web administrations
and operators for the Semantic Web. This design is driven by formal area models.
Barta et al. [18] the paper shows a Semantic Space arrangement of Tourism which
strategy dependent on Modularized Ontologies.

3 Our System

We have discussed 18 such systems that are purely related to tourism, adventure
trips ontology. All systems were mainly related to search optimization and ontology
approaches. Some were for online booking and some for gathering desired infor-
mation about destinations and adventure activities. Our proposed system is at the
base of ontology. In the recommender system, the ontological domain permits the
classification of objects to be suggested. In our recommender system, we consider
that each object is an instance of one (or several) of the lowest level classes of the
ontology and we use the ontological domain to represent the user’s preferences. In
this sense, concepts are represented as subsets of the domain in which users can be
interested in. Considering that the degree of interest can be different in each concept.
Our suggested system is on the base of adventurer’s interests and needs. We get the
feedback from the travelers and create a suggested interface for adventure clubs. Our
suggested interface gets 20% better results than other existing systems.

4 Methodology

Unlike other systems developed (using the ontology approach), but not focus on user’s
interest and likeness. This research and recommended a system for overcoming this
problem. We focus on user interest and desire. A feature common to the vast majority
of recommender systems—and so is ours—is the use of profiles that represent the
Ontological Approach to Analyze Traveler’s Interest … 151

needs of information and interests of the users. In this way, user’s profiles turn into a
key piece of recommender systems in order to obtain efficient filtering. An inadequate
profile modeling can lead to poor quality and little relevant recommendations for the
users. In a recommender system, the ontological domain permits the classification of
objects to be suggested. In our recommender system, we consider that each object is
an instance of one (or several) of the lowest level classes of the ontology and we use
the ontological domain to represent the user’s preferences. In this sense, concepts are
represented as subsets of the domain in which users can be interested in. Considering
that the degree of interest can be different in each concept. We proposed a solution
to this problem. For this, we conducted a questionnaire from users and get different
results. We used these results in making an ontology bases user interface suggestion
system for traveler’s and as well as for adventure clubs. The main factors we used for
our system are a user, interest in different activities, offers, location, free services,
provided by adventure clubs and tourism organizations. Our system gives proper
result after checking these parameters. For making ontology we used protégé tools
and made relations between different entities to get the required results. Our system
is smart enough to get optimized results. It will help users and refer an appropriate
trip plane and offers according to their interests.
We conducted an online survey and got the results after evaluating choices from an
online form created on Google. Our participant knows the purpose of our survey and
fully agree with us to share the results of their answers. Documents filed by users of a
different part of society. The survey was conducted on different persons and then we
converted the numerical results into graphical forms. Then we create a questionnaire
for the survey. We want to know about user preference and desired interests. With
the help of a questionnaire, we get information about user interest. This is the base
of our research and recommended online web system. With the help of analyzing the
questionnaire, we get the latest travelers trend towards adventure clubs parameters to
develop our algorithm and recommender system for enhancement of users interesting
information and facilities. We also get users’ preference parameters which they like
most. We also use the following.

5 Algorithm

We create an algorithm on the base of our data, which we collect from our conducted
survey. In order to understand the algorithm, following parameters are needed to be
known. Table 1 contains the names of the parameters used in the algorithm and the
values they will hold. Their data type is also given below.
152 T. Aslam et al.

Table 1 Parameters description table


Name Type Description (in case of Description (in case of
traveler) adventure club)
Age Integer Contains the age of Different age-wise list
traveler will be provided
Gender String Contains the gender of Gender wise list will be
traveler provided
Profession String Contains the profession Contains possible
of traveler professions of travelers
Tourists main interests String Contains the main Contains interests that
interests of traveler. adventure club is offering
Event must be organized String Contains the event must Contains event that
be organized for the adventure club is offering
traveler
Mode of communication String Contains the mode of Contains software to send
communication to inform messages from adventure
traveler about events club
Tourists guide services Boolean Contains the tourists Tourist guide from
guide services for the adventure club
traveler
Preference of travel Boolean Contains the preference Contains option for
agency of travel agency for the availing traveling agency
traveler or not
Kind of accommodation String Contains the kind of Contains location of
accommodation for the adventure club
traveler

This table consists of the names of parameters and the data types in which the
values would be stored in them. The description is provided with the parameters
in the context of the traveler as well as the adventure club. Separate parameters
are not used as it will consume extra space and would not be efficient. The name
of the proposed algorithm is “Optimized adventure club” and it contains different
parameters on which function is performed to obtain the required results. Following
abbreviation is used in the algorithm.
Ontological Approach to Analyze Traveler’s Interest … 153

6 Declaration Algorithm

• Declaration:
Int A=ɸ
Boolean TGS=ɸ
Boolean PTA=ɸ
String G=NULL, P=NULL, TMI=NULL, EMO=Null, MC=Null, KA=Null;

• Input (T):
Retrieve (G,A,P,TMI,MC);

• Data from (AC)


Retrieve (KA,PTA,EMO,TGS);

• Processing:
Search String = G∧A∧P∧TMI∧MC

IF (Search String == True)


Names = Function Suggestion ();
Print (Names);

• Suggestion Function:
String Function Suggestion ()
{
IF (T.G=AC.G∧T.A = AC.A ∧[Link]=[Link]
∧ T.P=AC.P ∧ [Link]=AC. MC
∨ T. TGS=[Link] ∨ [Link]=[Link] ∨
[Link] =AC. PTA∨[Link]=[Link])
Return Suggestions;

ELSE IF (T.G=AC.G∧T.A = AC.A ∧[Link]=[Link]


∧ T.P=AC.P ∧ [Link]=AC. MC
∧ T. TGS=[Link] ∨ [Link]=[Link] ∨
[Link] =AC. PTA∨[Link]=[Link])
154 T. Aslam et al.

Return Suggestions;
ELSE IF (T.G=AC.G∧T.A = AC.A ∧[Link]=[Link]
∧ T.P=AC.P ∧ [Link]=AC. MC
∧ T. TGS=[Link] ∧ [Link]=[Link] ∨
[Link] =AC. PTA∨[Link]=[Link])

Return Suggestions;
ELSE IF (T.G=AC.G∧T.A = AC.A ∧[Link]=[Link]
∧ T.P=AC.P ∧ [Link]=AC. MC
∧ T. TGS=[Link] ∧ [Link]=[Link] ∧
[Link] =AC. PTA∨[Link]=[Link])

Return Suggestions;
ELSE IF (T.G=AC.G∧T.A = AC.A ∧[Link]=[Link]
∧ T.P=AC.P ∧ [Link]=AC. MC
∧ T. TGS=[Link] ∧ [Link]=[Link] ∧
[Link] =AC. PTA∧[Link]=[Link])
Return Suggestions;
}

The algorithm describes how the search is being optimized to give the best results
to the traveler. In declaration phase, parameters are being declared and nullified so
that they do not contain any garbage value. In 2nd and 3rd and 4th step data is being
retrieved from the user and adventure club database and values would be taken in
the variable according to Table 2.
In processing phase, a search string is declared that clearly shows that profession,
tourist main interests, event must be organized must be given by traveler in order
to continue the search which would be later on compared with the values taken
from adventure club against the same variables I-e profession, tourist main interests,

Table 2 Description of
Name Type
algorithm
Age A
Profession P
Gender G
Tourists main interests TMI
Event must be organized EMO
Mode of communication MC
Tourists guide services TGS
Preference of travel agency PTA
Kind of accommodation KA
Ontological Approach to Analyze Traveler’s Interest … 155

event must be organized. Function ‘Suggestion” keeps comparing the values given
by the traveler against parameters that are described in the table with the values taken
from adventure club database against the same parameters until the results are not
improved and satisfy the requirements of tourist.

7 Model Building of Personalized Recommendation Based


on Ontology

We designed an ontology using the Protégé 4.3 tool on the basis of the algorithm
that we designed. Different classes, subclasses, Data properties, and object prop-
erties were created in that ontology. The first tab we start with is the Classes tab.
Classes are the major building blocks (“nouns”) within our ontology. Classes and
subclasses are created for structuring the algorithm that we designed. The subclasses
of adventure club names are events, guide services, kind of accommodation and
travel agency. The subclass “event” has further three classes: social events, adven-
turous and cultural events. The subclass “kind of accommodation” has also further
three subclasses: camping, hotels, and resorts. The class user has some subclasses.
These are Age, Gender, Interests of the user, mode of communication user can prefer,
the profession of the user. The subclass “gender” has further two classes: male and
female. The subclass “mode of communication” has further three classes: Via e-mail,
via text-MSG, and via phone-call. The subclass profession has further three classes
which related to users “profession” which are: Employee, Businessman, and Student.
Object properties define the relations (predicates) between two objects (also called
individuals) in OWL ontology. It is used to construct a link between classes and
subclasses. Domain and range contain names of classes and subclasses. Properties
are also further subcategorized.

8 Ontology Graph

Protégé tool is used to automatically create classes and subclasses and link with each
other.
Figure 1 displays the detailed model of user class and Adventure club class. The
linked lines show the link between classes and subclasses.

9 Database Architecture

For a proper understanding of the ontology database, we show a design that is


indicating an image which shows working of ontology data.
156 T. Aslam et al.

Fig. 1 Ontology graph

In Fig. 2 shows system architecture. Names of desired destination and offers that
are appropriate for users agreeing to their preferences are shown to users on the
output device.

10 Prototype Interface

Recommendation methods based on contents generate recommendations of objects in


the domain chose to the user preferences. All ‘Knowledge Base system’ users indicate
at the beginning of the session, their interest in a number of general motivations.
Then the system analyzes the user actions and shapes the initial preferences with the
help of machine learning algorithms. Recommendations are established on a straight
correspondence between the characteristics of the suggested actions and the user
interest in each of these characteristics. Knowledge Base system contains a database
of all the tourist activities available in the region. The main objective that pursues the
recommender system is to specify a degree of liking to each concept of the ontology,
which will make it possible to calculate the interest that the user has in each activity.
Ontological Approach to Analyze Traveler’s Interest … 157

Fig. 2 System architecture

The system also stores the degree of confidence in each concept. This degree of
confidence will depend on the evidence received from the user.

11 Results

To evaluate our designed prototype we have performed a control experiment. In


this experiment, participants were required to fill the following questions based on
their experience. Each participant has to respond to 1–5 scale based on individual’s
experience.
Figure 3 shows the combined results of all questions. P1…P20 shows number
of persons who participate in our control experiment. All participants are willing
to participate in our study and they know about our study plan and very happy for
our efforts and project. Our first question is about maximum coverage of elements,
accuracy, and usability of our proposed interface more than 85% agrees that we
improve efficiency and results. Persons also strongly agree with the second question
we improve their result for the desired destination. They also agree our interface easy
158 T. Aslam et al.

Fig. 3 Result graph

to use and understand. In the last question ‘What do you think that our interface will
provide an exact match to individual’s query?’ most of the persons choose neutral.
All results show that we improve result in perspective of user’s interest. Our basic
purpose of this research is to focus on user interest is gained.

12 Conclusion

In the modern era, people’s attraction towards adventure tourism (discovering places)
has been widely increased. In existing studies, one of the problems associated with
tourism is defined as finding most suitable places with individual’s interest. For
this purpose, many research efforts have been carried out in which recommenda-
tion based search systems are proposed. In these systems, input from the user about
their preferences while traveling are extracted and most suitable destinations are
suggested. These solutions still have certain limitations such as considering fewer
parameters of user’s traveling interest and these solutions directly do not fit with the
context of Pakistan. So in this research; we have proposed a recommendation system
using ontology-based approach. In which, first we gathered information about trav-
elers interest towards selecting their destinations using questionnaires. Then based
on questionnaire results, most important parameters are extracted through which we
have designed an algorithm using the concept of ontological approach. Then we have
also designed an interface prototype as per proposed algorithm. To evaluate usability,
accuracy, and efficiency of our interface, we have performed a control experiment
in which a questionnaire was filled by the user to identify their experience of our
Ontological Approach to Analyze Traveler’s Interest … 159

designed prototype. Results show that all parameters of evaluations are satisfied and
recommendations for destinations are found suitable for individual’s interest.

Acknowledgements Pakistan is a beautiful country with great scenery and the government of
Pakistan wants to enhance tourism industry in country. So in country a special need for such a
system which fulfills users’ requirements. We conduct a survey of 120 persons and collect their
choice and create a system and again conduct a survey and give detailed results. In this study, Khizar
Hameed and Junaid Haseeb guide us and check our results. Our whole participants know our study
purpose and they are happy to participate in our study project. We present our study with the help
of University of Lahore, computer science department.

References

1. C. Lemnaru, M. Dobrin, M. Florea, R. Potolea, Designing a travel recommendation system


using case-based reasoning and domain ontology. Technical University of Cluj-Napoca (2012)
2. A. Hawalah, M. Fasli, Using user personalized ontological profile to infer semantic knowledge
for personalized recommendation. School of Computer Science and Electronic Engineering
(2011)
3. E. Tomai, S. Michael, P. Prastacos, An ontology-based web-portal for tourism. FORTH, Institute
of Applied and Computational Mathematics Heraklion, Greece (2006)
4. C. Choi, M. Cho, E. Kang, P. Kim, Travel ontology for recommendation system based on
semantic web (2006)
5. P. Hua-li, Z. Zhi-Jun, Research on the application ontology-based personalized tourist
recommendation system (2016)
6. Z. Bahramiana, R. Ali Abbaspoura, An ontology-based tourism recommender system based
on spreading activation model (2015)
7. T. de Lange, Developing ontology for the online travelling domain (2012)
8. A. Maedchea, S. Staabb, Applying semantic web technologies for tourism information systems
(2002)
9. M. Missikoffa et al., Harmonise—towards interoperability in the tourism domain (2003)
10. J. María et al., Ontology population: an application for the e-tourism domain (2011)
11. L. Karoui, M.-A. Aufaure, N. Bennacer, Ontology discovery from web pages: application to
tourism (2005)
12. K.R. Ananthapadmanaban et al., Personalization of user profile: creating user profile ontology
for Tamilnadu tourism (2011)
13. A. Moreno et al., SigTur/E-destination: ontology-based personalized recommendation of
tourism and leisure activities (2012)
14. R. Jakkilinki, M. Georgievski, N. Sharda, Connecting destinations with ontology-based e-
tourism planner (2007)
15. K. Prantner, Y. Ding, M. Luger, Z. Yan, Tourism otology and semantic management system:
state-of-the-art analysis (2008)
16. J. Cardoso, E-tourism: creating dynamic packages using semantic web processes (2006)
17. H. Knublauch, Ontology-driven software development in the context of the semantic web: an
example scenario with Protégé/OWL (2005)
18. R. Barta, Covering the semantic space of tourism—an approach based on modularized
ontologies (2010)
Performance Analysis of Off-Line
Signature Verification

Sanjeev Rana, Avinash Sharma and Kamlesh Kumari

Abstract To reduce fraud in financial transactions, signature verification is impor-


tant for security purposes. In this paper, an attempt has been made to analysis the per-
formance of off-line handwritten signature verification using image-based features.
Photocopies and scanned documents are considered as the best possible evidence
in the situations when the original documents are either lost or damaged. Although
the photocopies are the filtered images of original information and do not reproduce
details as in the original documents. In this paper, combinations of four features,
i.e., Average object area, mean, Euler number and area of signature image is used to
verify the signature. Publically available database BHsig260 is used. In this database,
two types of signature are available, i.e.,Bengali and Hindi. Proposed work shows
that accuracy of Hindi off-line signature verification is 78.5% with sample size of
15 and accuracy of Bengali off-line signature verification is 69.1 with sample size
of 20.

Keywords K-nearest neighbor (KNN) · Support vector machine (SVM) · Graphics


processing unit (GPU) · Forensic handwriting expert (FHE) · Neural network (NN)

1 Introduction

A reliable scheme that can uniquely discover individuals is only possible by the use
of biometric. Biometric traits are mainly categorized into two categories:
A FHE uses signature’s graphic form, measurable elements of signature such as
distance between letters, angles of strokes, size of loop, and a microscopic stere-
oscope to analyze the signature. In recent years, deep neural network are used for
image recognition. These networks require big data and more storage and GPU for
computation. For offline signature verification, no database with large samples is
available. In this paper, an attempt has been made to analysis the performance of
Bangla and Hindi signature verification using only four features (i.e., Average object

S. Rana · A. Sharma · K. Kumari (B)


Department of Computer Science Engineering, M.M.D.U, Mullana, Ambala, India
e-mail: savyakamlesh@[Link]

© Springer Nature Singapore Pte Ltd. 2020 161


A. Khanna et al. (eds.), International Conference on Innovative Computing
and Communications, Advances in Intelligent Systems and Computing 1087,
[Link]
162 S. Rana et al.

area, mean, Euler no. and area of signature image). Signature Verification is affected
by number of signers, number of samples available for each signature as shown in
experimental section of the paper.
The paper is organized as follows: Sect. 1 describes the Introduction. Section 2
presents Related Work. Section 3 presents the Methodology. Section 4 describes the
Experimentation. Section 5 presents conclusion.

2 Related Work

Performance of offline signature authentication system depends on various factors,


i.e., features extraction method, classifier used image quality of scanned signature
[1, 2]. The features used by researchers in offline signature verification are shown in
Table 1.

Table 1 Features used in offline signature verification


Feature extraction Offline database Classifier Refs.
Average object area and CEDAR NN, KNN, SVM [3]
entropy
Euler no, mean, average CEDAR KNN, SVM and Boosted [4]
object area, entropy, tree
standard deviation and area
Under sampled bitmap, Bangla signature NN [5]
end point, direction chain
code
Gaussian grid feature Bangla signature SVM [6]
Gradient feature, Zernike Hindi signature SVM [7]
moment
ULBP BHSig260 NN [8]
CNN based BHSig260, CEDAR, CNN [9]
GPDS
SIFT and LBP NN [10]
CNN GPDS960 CNN [11]
CNN Brazillian, GPDS SVM [12]
Performance Analysis of Off-Line Signature Verification 163

3 Methodology

3.1 Gathering Database

In our research work, Bangla and Hindi Signature from BHsig2601 are used. The
sizes of images in the database are not same, so we resize all images of size 96*96.
Samples of three users from BHsig260 dataset are shown in Fig. 1.

3.2 Feature Extraction

Combinations of four features (Average Object Area, Mean, Area and Euler No.)
are extracted from image difference of Genuine—Genuine Signature pair and of
Genuine—Forged Signature pair in Matlab 2015 [13]. The variation of features values
of genuine signatures of one user for different samples are shown in Table 2. Table 3

BANGLA
GENUINE FORGED

HINDI
GENUINE FORGED

Fig. 1 Sample of signatures

1 [Link]
164 S. Rana et al.

Table 2 Feature values for genuine signature pair (Bangla)


Average object area Mean Euler No. Area
13.57778 0.066298 −4 650.25
10.07018 0.062283 0 614.75
9.923077 0.05599 −5 554.625
13.2 0.064453 −4 632.25
18.08333 0.070638 −14 693.875
12.43902 0.055339 3 543.125
12.19149 0.062174 −10 612.5
13.19512 0.058702 −4 577.875
15.02703 0.06033 −3 590.25
11.63265 0.061849 −8 608.625
8.440678 0.054036 6 533.125
14.88235 0.054905 −9 540
12 0.05599 1 549.75
11.96154 0.067491 −10 662.5
15.77143 0.059896 −8 588
10.68 0.057943 −2 570.75
12.24444 0.059787 −14 593.375
14.69048 0.066949 −9 660
16.18421 0.066732 −19 661.375

Table 3 Genuine-forged
Average object area Mean Euler No. Area
signature pair (features) for
Bangla 15.35849 0.088325 −9 870
14.55172 0.09158 4 897.375
20 0.088976 −10 869.25
13.10714 0.079644 1 781.5
19.85106 0.101237 −3 988.5
13.63636 0.097656 −5 958.25
14.27419 0.096029 −7 942.75
11.02817 0.084961 15 833.75
11.87324 0.091471 3 895.5
13.95 0.09082 −6 895.75
15.05263 0.093099 0 910.875
14.28571 0.086806 10 848.75
13.27778 0.103733 6 1014.125
14.21311 0.094076 −8 924.375
13.62319 0.101997 2 997.25
(continued)
Performance Analysis of Off-Line Signature Verification 165

Table 3 (continued)
Average object area Mean Euler No. Area
15.20968 0.102322 −4 1006
15.44828 0.097222 −8 952.625
13.66102 0.087457 8 857
12.76563 0.08865 1 873

shows the variation of features values of forged-genuine signatures of one user for
different samples.

3.3 Classifier

Three classifiers are used, i.e., SVM, KNN, and Boosted Tree.

4 Experimental Setup

Five-fold cross-validation is used for result purposes. K-nearest neighbor (KNN),


SVM, and Boosted Tree are used for classification.

4.1 Experiment Using Combinations of Four Features


(Average Object Area, Mean, Euler No. and Area)

Figures 2 and 3 show the accuracy of SVM is high as compared to Boosted Tree
and KNN. But accuracy of Hindi is greater than the Bengali signature. Signature
Verification is affected by no. of signers, for more signer accuracy decreases in both
cases, i.e., Bengali signature and Hindi signature.

4.2 The Effect of Varying Sample Size

Signature Verification is affected by no. of samples available for each signature.


Figure 4 depicts that the accuracy of SVM is 69.1% for Bengali signature with a
sample size of 20. Figure 5 depicts that the accuracy of SVM is 78.5% for Hindi
signature with sample size of 15. Also result of Hindi signature verification is better
than Bangla signature verification.
166 S. Rana et al.

Fig. 2 Accuracy of Bengali signature verification

Fig. 3 Accuracy of Hindi signature verification

Fig. 4 Accuracy of Bengali signature verification


Performance Analysis of Off-Line Signature Verification 167

Fig. 5 Accuracy of Hindi signature verification

4.3 The Effect of User Dependent Model

As for accuracy of signature verification depends upon the number of signer as


mentioned in Figs. 2 and 3, so experiment was performed with each signer, i.e., we
trained the model for each user and observed increased accuracy as compared to
single model for multiple users. In some cases, we obtained accuracy more than 90%
as shown in Tables 4 and 5.

Table 4 Accuracy (%) for


User SVM KNN Boosted tree
each user model (Bengali)
1 95.7 100 45.7
2 82.6 91.3 89.1
3 71.7 80.4 89.1
4 60.9 67.4 71.7
5 47.8 52.2 63
6 80.4 84.8 80.4
7 80.4 87 91.3
8 95.7 100 97.8
9 84.8 97.8 45.7
10 84.8 95.7 93.5
11 78.3 82.6 84.8
12 76.1 80.4 80.4
13 69.6 76.1 80.4
14 58.7 63 67.4
(continued)
168 S. Rana et al.

Table 4 (continued)
User SVM KNN Boosted tree
15 91.3 93.5 95.7
16 67.4 71.7 76.1
17 84.8 78.3 78.3
18 60.9 65.2 78.3
19 84.8 87 76.1
20 89.1 100 45.7
21 82.6 84.8 89.1
22 93.5 97.8 89.1
23 54.3 65.2 71.7
24 87 87 91.3
25 97.8 97.8 45.7
26 84.8 84.8 84.8
27 82.6 87 84.8
28 69.6 67.4 80.7
29 80.4 87 78.3
30 80.4 76.1 82.6
31 91.3 97.8 45.7
32 82.6 84.8 87
33 89.1 91.3 67.4
34 54.3 63 54.3
35 76.1 80.4 87
36 93.5 97.8 89.1
37 45.7 37 52
38 63 56.5 65.2
39 73.9 78.3 80.4
40 65.2 80.4 76.1
41 91.3 95.7 95.7
42 73.9 73.9 78.3
43 87 84.8 73.9
44 82.6 80.4 82.6
45 82.6 78.3 80.4
46 100 100 45.7
47 73.9 78.3 78.3
48 76.1 80.4 82.6
49 78.3 89 93.5
50 82.6 80.4 73.9
Performance Analysis of Off-Line Signature Verification 169

Table 5 Accuracy (%) for


User SVM KNN Boosted tree
each user model (Hindi)
1 84.8 93.5 91.3
2 63 67.4 76
3 56.5 65.2 67.4
4 78.3 78.3 80.4
5 69.6 73.9 87
6 80.4 84.8 91.3
7 97.8 100 45.7
8 50 37 30.4
9 95.7 95.7 45.7
10 58.7 56.5 65.2
11 97.8 97.8 45.7
12 67.4 63 65.2
13 78.3 78.3 87
14 82.6 84.8 91.3
15 63 69.6 76.1
16 95.7 100 45.7
17 52.2 45.7 67.4
18 93.5 95.7 82.6
19 89 93.5 91.3
20 93.5 91.3 78.3
21 45.7 41.3 45.7
22 78.3 87 87
23 58.7 56.5 54.3
24 87 89.1 45.1
25 65.2 60 71.7
26 87 91.3 84.8
27 89.1 84.8 804
28 73.9 80.4 84.8
29 84.8 87 84.8
30 76 78.3 80.4
31 71.7 71.7 69.6
32 60.9 60.9 60.9
33 87 78.3 84.8
34 76 87 84.8
35 82.6 93.5 84.8
36 82.6 87 78.3
37 95.7 100 45.7
(continued)
170 S. Rana et al.

Table 5 (continued)
User SVM KNN Boosted tree
38 78.3 76.1 78.3
39 78.3 84.8 80.4
40 84.8 91.3 45.7
41 76 91.3 45.7
42 52.2 60.9 63
43 73.9 76.1 71.7
44 71.7 71.7 80.4
45 69.6 71.7 71.7
46 43.5 50 66.9
47 41.3 41.3 41.3
48 65.2 52.2 63
49 84.8 84.8 82.6
50 84.8 80.4 84.8

5 Conclusion

In this paper, performance analysis of off-line signature verification using image-


based features, i.e., combinations of four features has been presented. Results for
Hindi offline signature verification are superior to Bangla offline signature verifi-
cation for writer independent model. Proposed work shows that accuracy of Hindi
off-line signature verification is 78.5% with sample size of 15 and accuracy of Ben-
gali off-line signature verification is 69.1 with sample size of 20. For each user model,
we obtained accuracy more than 90% for some cases. As far as storage requirement
is concerned, numbers of features used are four to reduce the storage complexity.
Drawback of image-based feature extraction technique is that the lively features (like
speed) of handwritten signature image cannot be extracted. In the near future, we
plan to extend our work on Persian offline signature verification or other language
databases.

Acknowledgements The authors are grateful to the anonymous reviewers for their constructive
comments which helped to improve this paper.

References

1. K. Kumari, V.K. Shrivastava, Factors affecting the accuracy of automatic signature verification,
IEEE (2016)
2. K. Kumari, V.K. Shrivastava, A review of automatic signature verification, in ICTCS (2016)
3. K. Kumari, S. Rana, Writer-independent off-line signature verification. Int. J. Comput. Eng.
Technol. (IJCET). 9(4), 85–89 (2018)
Performance Analysis of Off-Line Signature Verification 171

4. K. Kumari, S. Rana, Offline signature verification using intelligent algorithm. Int. J. Eng.
Technol. [S.l.]. 7(4.12), 69–72 (2018). ISSN 2227-524X
5. S. Pal, A. Alaei, U. Pal, M. Blumenstein, Off-line Bangla signature verification: an empirical
study, in The 2013 International Joint Conference on Neural Networks (IJCNN), IEEE (2013),
pp. 1–7
6. S. Pal, V. Nguyen, M. Blumenstein, U. Pal, Off-line Bangla signature verification, in 2012 10th
IAPR International Workshop on Document Analysis Systems (DAS), IEEE (2012), pp. 282–286
7. S. Pal, M. Blumenstein, U. Pal, Hindi off-line signature verification, in 2012 International
Conference on Frontiers in Handwriting Recognition (ICFHR), IEEE (2012), pp. 373–378
8. S. Pal, A. Alaei, U. Pal, M. Blumenstein, Performance of an off-line signature verification
method based on texture features on a large indic-script signature dataset, in 2016 12th IAPR
Workshop on Document Analysis Systems (DAS), IEEE (2016, April), pp. 72–77
9. S. Dey, A. Dutta, J.I. Toledo, S.K. Ghosh, J. Lladós, U. Pal, SigNet: convolutional siamese
network for writer independent offline signature verification. arXiv preprint arXiv:1707.02131
(2017)
10. B.S. Thakare, H.R. Deshmukh, A combined feature extraction model using SIFT and LBP for
offline signature verification system, in 2018 3rd International Conference for Convergence in
Technology (I2CT ), IEEE (2018), pp. 1–7
11. M.B. Yilmaz, O.Z.T. Kagan, Hybrid user-independent and user-dependent offline signature
verification with a two-channel CNN, in 2018 IEEE/CVF Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW ), IEEE (2018), pp. 639–6398
12. V.L. Souza, A.L. Oliveira, R. Sabourin, A writer-independent approach for offline signature ver-
ification using deep convolutional neural networks features, in 2018 7th Brazilian Conference
on Intelligent Systems (BRACIS), IEEE (2018), pp. 212–217
13. [Link]
Fibroid Detection in Ultrasound Uterus
Images Using Image Processing

K. T. Dilna and D. Jude Hemanth

Abstract The unnatural growth present in the uterus wall is uterus fibroids. Presence
of fibroid in uterus leads to infertility. Ultrasound images are a significant tool to
detect uterus disorders. Fibroid extraction from ultrasound scanned images is indeed
a challenging task considering its size, less detectible boundaries and positions.
Segmentation of ultrasound images is not an easy task because of speckle noise.
This paper endows a method to segment uterus fibroid from ultrasound scanned
images. This method utilizes many mathematical morphology concepts to detect
fibroid region. The method segmented the fibroid and extracts some shape-based
features.

Keywords Fibroid · Uterus · Ultrasonic imaging · Segmentation · Morphological


operations

1 Introduction

The uterus is the reproductive system in humans. A normal uterus is about 7.5 cm
(3 in.) long, 5 cm (2 in.) wide and 2.5 cm (1 in.) deep. Inside, it is hollow with thick
muscular walls. A few abnormalities that are seen as tumours which are infectious
in nature are called as ‘Fibroid’ that grows in the wall of the uterus. These are also
known as uterine myomas, leiomyomas or fibromas. On average between 20 and
50% of women of reproductive age have fibroids, although not all are diagnosed.
Types of fibroid: Fibroids can be classified according to their position in the uterus
or womb:

K. T. Dilna
Department of ECE, College of Engineering and Technology, Payyanur, India
D. Jude Hemanth (B)
Department of ECE, Karunya University, Coimbatore, India
e-mail: judehemanth@[Link]

© Springer Nature Singapore Pte Ltd. 2020 173


A. Khanna et al. (eds.), International Conference on Innovative Computing
and Communications, Advances in Intelligent Systems and Computing 1087,
[Link]
174 K. T. Dilna and D. Jude Hemanth

a. Subserosal—towards the outside of the womb/uterus. They can cause compres-


sion on the surrounding tissues, such as the bladder and bowel.
b. Intramural—in the wall of the womb/uterus. This can cause pressure on the
bladder and/or uterus and infertility or miscarriage with heavy bleeding.
Detecting abnormalities using image processing method are as follows:
• Extract image data for processing
• Preprocessing to remove noise in image data
• Identify ROI
• Segment the desired area of image for analysis
• Extract the feature for classification.
Section 2 discusses with previous work done in detection of uterus fibroids.
Section 3 focusses on methodology used in this paper. Section 4 is about results
and discussion. Section 5 gives conclusion.

2 Related Works

Shivakumar K. et al. has used GVF snake method for the Segmentation of fibroids in
uterus images [1]. N. Sriraam et al. describes an automated detection of uterine fibroid
by using wavelet features and a neural network classifier [2]. Feedforward backprop-
agation neural network (BPNN) classifier is used for segmentation. Yixuan Yuan
et al. proposed a novel weighted locality-constrained linear coding (LLC) method
for uterus image analysis [3]. Leonardo Rundo et al. developed a semi-automatic
approach which depends on region-growing segmentation technique [4]. Bo Ni et al.
used dynamic statistical shape model (SSM)-based segmentation method [5]. Effi-
ciency and stability are the focal areas of this method. Alireza Fallahi et al. used a
two-step method for image analysis [6]. In the first segment uterine segmented using
FCM and then morphological operations are implemented. In the second step, fuzzy
algorithm is used for refining the process. Divya has used a generalized multiple-
kernel fuzzy C-means (FCM) framework for image segmentation problems [7]. A
linear combination of multiple kernels is proposed and the updating rules for the
linear coefficients of the composite kernel are derived as well. T. Ratha Jeyalakshmi
et al. provides mathematical morphology-based methods for automated segmentation
[8]. Similar ultrasound uterus image analysis methods are available in [9–11].
Fibroid Detection in Ultrasound Uterus Images … 175

3 Methodology

The proposed method has the following steps. Preprocessing, segmentation and
feature extraction (Fig. 1).

3.1 Fibroid Image

The input image used is Ultrasound scanned fibroid Image. The uterus ultrasound
image scanning is carried out either abdominally or transvaginally. Evaluation of
ultrasound images can differentiate cysts and fibroids (solid tumors). But it cannot
accurately diagnose the number, size or fibroid position. Manual evaluation of ultra-
sound images is very difficult because of its high resolutions and a huge mass of
image slices (Fig. 2).

Fig. 1 Block diagram

Fig. 2 Ultrasound fibroid


image
176 K. T. Dilna and D. Jude Hemanth

3.2 Preprocessing

Noise reduction in image processing is known as preprocessing. Ultrasound image


quality is bounded by granular speckle noise. The steps of preprocessing stage are
as follows:
• Read the image
• Cropping unnecessary portions
• Resizing the image
• Filtering the image
• Morphological Operation
Consequence of speckle noise on ultrasound image is, it turns down the resolution
and contrast of the image which leads to inferior accuracy. Boundary edges of image
are usually imperfect, misplaced or transparent at some places. This makes it difficult
for the ultrasound images to get segmented. Reduction of speckle noise is very
essential to increase ultrasound image quality and median filter which is a nonlinear
filtering method can apply on ultrasound images to remove the ‘speckle’ noise. It
produces less blurred images. In median filter, each pixel is replaced with its median
value of neighbourhood pixels.

3.3 Proposed Algorithm

The filtered ultrasound image which is free from speckle noise is now used as the
input for detection of fibroid. The steps followed in the proposed system are as
follows.
Step 1: Input image is transformed to binary image based on threshold. This can be
calculated from the mean value m of the pixel values found in the image.
Step 2: Takes the complement of binary image which is taken from step 1.
Step 3: Morphological operations are carried out on the image.
Step 4: Calculating the image area and detect the maximum area image by measure
properties of image region.
Step 5: Find the product of maximum area image and the image which is taken
from step 3.
Step 6: Extract image area from binary image by specifying the size.
Step 7: Morphological operation-erosion is carried out on the extracted image area.
Step 8: Multiply input image with the output image from step 7 which results in
fibroid region.
Fibroid Detection in Ultrasound Uterus Images … 177

3.4 Feature Extraction

The features used in this work are based on shapes namely area, diameter, accuracy,
perimeter, eccentricity, major-axis and minor-axis from Ultrasound fibroid images.
The extracted features help in identifying the size of the fibroids. For a typical case—
the smaller ones can be cured by medicines whereas larger sizes would require a
surgery. A detailed explanation of these features is available in the literature.

4 Results and Discussion

The proposed algorithm is applied on many uterus images with fibroids on the inner
wall of the uterus. The algorithm works well and gives good results. The filtered
image is shown in Fig. 3. The result of this algorithm on two different images is
shown in this paper as sample outputs in Fig. 4. The original images are shown in
Fig. 2.
Table 1 displays the result of the feature extraction. It can be noted that the sizes
of the fibroids are different for different patients. The treatment planning is based on
the size of the fibroids.

Fig. 3 Filtered image using


median filtering
178 K. T. Dilna and D. Jude Hemanth

Step1 Step2 Step3

Step4 Step5 Step6

Step7 Step8

Fig. 4 Result for each algorithm step

Table 1 Feature extraction


Features Image 1 Image 2
Area 8020 5060
Perimeter 713.7950 431.3340
Eccentricity 0.5982 0.9475
Centroid [596.2300, [423.7276,
385.5382] 1.994.2747]
Minor-axis length 114.5316 50.2193
Major-axis length 142.9259 157.0628
Fibroid Detection in Ultrasound Uterus Images … 179

5 Conclusion

In this work, morphological-based image processing technique is used for ultrasound


uterus image analysis. Several features are extracted from these images and used for
the segmentation process. The fibroids are extracted with high efficiency which is
evident from the experimental results. As a future scope, more different features
can be extracted and machine learning techniques can be used for the segmentation
process.

References

1. S.K. Harlapur, R.S. Hegadi, Segmentation and analysis of fibroid from ultrasound images. Int.
J. Comput. Appl. 975, 8887 (2015)
2. N. Sriraam, D. Nithyashri, L. Vinodashri, P. Manoj Niranjan, Detection of uterine fibroids
using wavelet packet features with BPNN classifier, in IEEE EMBS Conference on Biomedical
Engineering & Sciences (2010)
3. Y. Yuan, A. Hoogi, C.F. Beaulieu, M.Q.-H. Meng, D. Lrubin, Weighted locality–constrained
linear coding for lesson classification in CT images, in Proceedings of 37th Annual International
Conference of the IEEE Engineering in Medicine and Biology Society (2015)
4. L. Rundo, C. Militell, S. Vitabile, C. Casarino, Combining split-and-merge and multi-seed
region growing algorithms for uterine fibroid segmentation in MRgFUS treatments. Med. Biol.
Eng. Comput. 54(7), 1071–1084 (2016)
5. B. Nia, F. Hea, Z. Yuana, Segmentation of uterine fibroid ultrasound images using a dynamic
statistical shape model in HIFU therapy. Comput. Med. Imaging Graph. 46, 302–314 (2015)
6. A. Fallahi, M. Pooyan, H. Khotanlou, H. Hashemi, K. Firouznia, M.A. Oghabian, Uterine
fibroid segmentation on multiplan MRI using FCM, MPFCM and morphological operations.
IEEE (2010)
7. S. Divya, Detection of fibroid using image processing technique. Int. J. Emerg. Technol. Adv.
Eng. 5(3), 167–171 (2010)
8. T. Ratha Jeyalakshmi, K. Ramar Kadarkarai, Segmentation and feature extraction of fluid-filled
uterine fibroid—a knowledge-based approach. Int. J. Sci. Technol. 4, 405–416 (2010)
9. J. Yao, D. Chen, W. Lu, A. Premkumar, Uterine fibroid segmentation and volume measurement
on MRI, in Proceedings of SPIE, vol. 6143 (2006)
10. A. Alush, H. Greenspan, J. Goldberger, Automated and interactive lesion detection and
segmentation in uterine cervix images. IEEE Trans. Med. Imaging 29(2) (2010)
11. M.J. Padghamod, J.P. Gawande, Classification of ultrasonic uterine images. Adv. Res. Electr.
Electron. Eng. 1(3), 89–92 (2014)
Progressive Generative Adversarial
Binary Networks for Music Generation

Manan Oza, Himanshu Vaghela and Kriti Srivastava

Abstract Recent improvements in generative adversarial network (GAN) training


techniques prove that progressively training a GAN drastically stabilizes the training
and improves the quality of outputs produced. Adding layers after the previous ones
have converged has proven to help in better overall convergence and stability of the
model as well as reducing the training time by a sufficient amount. Thus, we use
this training technique to train the model progressively in the time and pitch domain,
i.e., starting from a very small time value and pitch range; we gradually expand the
matrix sizes until the end result is a completely trained model giving outputs having
tensor sizes [4 (bar) × 96 (time steps) × 84 (pitch values) × 8 (tracks)]. As proven
in previously proposed models, deterministic binary neurons also help in improving
the results. Thus, we make use of a layer of deterministic binary neurons at the end
of the generator to get binary-valued outputs instead of fractional values existing
between 0 and 1.

Keywords Generative adversarial networks · Progressive GAN · Music


generation · Binary neurons

1 Introduction

Generating and composing music in symbolic domain using neural networks is an


active research field over recent years. Efforts have been made to generate music
in the form of monophonic notes [1] or lead sheets [2]. Other music generation
attempts are using symbolic domain and music transcription [3, 4]. In [5], a number of
instruments for the synthesized music are increased. In order to increase polyphony,

M. Oza (B) · H. Vaghela · K. Srivastava


Department of Computer Engineering, D. J. Sanghvi College of Engineering, Mumbai, India
e-mail: manan.oza0001@[Link]
H. Vaghela
e-mail: himanshuvaghela1998@[Link]
K. Srivastava
e-mail: [Link]@[Link]
© Springer Nature Singapore Pte Ltd. 2020 181
A. Khanna et al. (eds.), International Conference on Innovative Computing
and Communications, Advances in Intelligent Systems and Computing 1087,
[Link]
182 M. Oza et al.

Fig. 1 Two 8-track


piano-roll samples created
by our proposed model

music representation in the form of piano-rolls is used. Piano-rolls contain musical


patterns of N-tracks in the form of pitches and corresponding time step in the form
of binary matrix (Fig. 1).
More than one instrument make it difficult to create piano-rolls due to possibility
of many active notes in a time step. Recurrent neural networks and convolutional
neural networks have been used recently in music generation. It is better to learn
local musical patterns in CNNs, whereas RNNs have been used to learn temporal
feature dependency of music. In [5], five track piano-rolls were generated by using
convolutional generative adversarial networks (GANs) [6]. While the music gener-
ated was not comparable to that of human musicians, this was the first model to
generate polyphonic and multi-track music.
Music synthesis is further improved by using binary neurons (BN) [7, 8]. Hard
thresholding (HT) and Bernoulli sampling (BS) can be applied to the floating-point
predictions in BNs where BS is implemented after considering them as probabilities
and HT binarizes them, while there may be many exceptions in either case. In HT,
problem arises when there are many floating-point prediction values close to the
threshold. In BS, binary approximation of 0 and one may be uneven.
In [2], as binarization of output of generator is done only at the testing time,
discriminator D distinguishes the generated and real piano-rolls more effectively.
In binary neurons, binarization is performed during training and testing both which
assist the discriminator in extracting features relevant to music. The input of the
discriminator is in binary form instead of floating point. The model space RM is
reduced to 2M such that M is the product of the number of possible pitches and
number of time steps. Relatively smaller model space makes training easy.
Progressive Generative Adversarial Binary Networks … 183

In [7], a refiner network R which uses binary neurons is placed between generator
G and discriminator D where R binarizes the floating-point predictions made by
G. Two-stage training is conducted where first, G is fixed after pretraining G and
D. Second, R is trained followed by fine-tuning D. As compared to Musegan [5],
this model is more effective because of the use of deterministic binary neurons
(DBNs). In our proposed model, we use progressive generative adversarial networks
[9] with DBNs. Our model consists of a total of 12 layers in the shared generator and
discriminator network and 8 layers in the refiner network at the end of all phases.
Pitch and time-step values are increased progressively layer by layer. Experimental
results indicate that the final output is more efficient due to the progressive training
of GANs.

2 Background

2.1 Generative Adversarial Networks

Generative adversarial networks (GAN) have a latent space derived from the original
dataset, and generator tries to fool the discriminator attempting to generate realistic
data as mentioned earlier. The generator function G and discriminator function D
are the two main components of a GAN which are locked into a minimax game.
The discriminator takes in the output of the generator or the original dataset x as
input. During the training phase, it learns to discern between fake and real samples.
The generator takes in input as a noise vector z which is a sample of the prior
distribution pz of the original dataset. The generator fools the discriminator with its
counterfeit sample G(z). The generator and discriminator are trained using a deep
neural network. Wasserstein GAN (WGAN) [10] which is an alternative form of GAN
measures the Wasserstein distance between the real distribution and the distribution
of the generator. This distance acts as a critic to the generator function. The WGAN
objective function is given as:

min max Expd [D(x)] − Expz [D(G(z))], (1)


G D

where pd denotes the distribution of real data.


A new gradient penalty (GP) term was added by Gulrajani et al. [11] The GP term
enforces the Lipschitz constraints on the discriminator and is an important factor in
training of the discriminator. Thus, the gradient penalty term added to the objective
function of the GAN is:
   2 
Ex̂px̂ ∇x̂ x̂ − 1 (2)
184 M. Oza et al.

where px̂ is defined as sampling uniformly along straight lines between pairs of
points sampled from pd and the model distribution pg . It was observed that WGAN-
GP [5] stabilized the training and attenuated the mode collapse issue in comparison
with the weight clipping methodology used in the original WGAN. Hence, we use
WGAN-GP in our proposed framework.

2.2 Progressive Growing of GANs

In progressive growing of GANs [9], we train the GAN network in multiple phases.
In phase 1, it takes in the noise vector z and uses n convolution layers to generate
a low-resolution music sample. Then, we train the discriminator with the generated
music and the real low-resolution dataset. Once the training stabilizes, we add n
more convolution layers to up-sampling the music to a slightly higher resolution and
n more convolution layers to down-sampling music in the discriminator. Here, by
resolution of music, we imply the number of time steps and pitch values and we
have taken n = 1. Large number of time steps and pitch values correspond to higher
resolution and vice versa.
The progressive training speeds up and stabilizes the regular GAN training meth-
ods. Most of the iterations are done at lower resolutions, and training is significantly
faster with comparable music quality using other approaches. In short, it produces
higher-resolution images with better music quality. The progressive GAN technique
uses a simplified minibatch discrimination to improve the diversity of results. Pro-
gressive GAN computes the standard deviation for each feature in each spatial loca-
tion over the minibatch. Then, it averages them to yield a single scalar value. It is
concatenated to all spatial locations and over the minibatch at one of the latest layers
in the discriminator. If the generated music samples do not have the same diversity
as the real music samples, this value will be different and therefore will be penalized
by the discriminator.
Progressive GAN initializes the filter weights with (0, 1) and then scales the
weights at runtime for each layer ŵi = wi /c where
 −1
2
c= (3)
number of inputs

For the generator, the features at every convolution layers are normalized, given
by:
ax,y
bx,y =