100% found this document useful (1 vote)
2K views711 pages

Software Engineering (PDFDrive)

Uploaded by

anshikatusawar1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
2K views711 pages

Software Engineering (PDFDrive)

Uploaded by

anshikatusawar1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Advances in Intelligent Systems and Computing 731

M. N. Hoda
Naresh Chauhan
S. M. K. Quadri
Praveen Ranjan Srivastava Editors

Software
Engineering
Proceedings of CSI 2015
Advances in Intelligent Systems and Computing

Volume 731

Series editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
e-mail: kacprzyk@[Link]
The series “Advances in Intelligent Systems and Computing” contains publications on theory,
applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all
disciplines such as engineering, natural sciences, computer and information science, ICT, economics,
business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the
areas of modern intelligent systems and computing such as: computational intelligence, soft computing
including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms,
social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and
society, cognitive science and systems, Perception and Vision, DNA and immune based systems,
self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric
computing, recommender systems, intelligent control, robotics and mechatronics including
human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent
data analysis, knowledge management, intelligent agents, intelligent decision making and support,
intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia.
The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings
of important conferences, symposia and congresses. They cover significant recent developments in the
field, both of a foundational and applicable character. An important characteristic feature of the series is
the short publication time and world-wide distribution. This permits a rapid and broad dissemination of
research results.

Advisory Board
Chairman
Nikhil R. Pal, Indian Statistical Institute, Kolkata, India
e-mail: nikhil@[Link]
Members
Rafael Bello Perez, Universidad Central “Marta Abreu” de Las Villas, Santa Clara, Cuba
e-mail: rbellop@[Link]
Emilio S. Corchado, University of Salamanca, Salamanca, Spain
e-mail: escorchado@[Link]
Hani Hagras, University of Essex, Colchester, UK
e-mail: hani@[Link]
László T. Kóczy, Széchenyi István University, Győr, Hungary
e-mail: koczy@[Link]
Vladik Kreinovich, University of Texas at El Paso, El Paso, USA
e-mail: vladik@[Link]
Chin-Teng Lin, National Chiao Tung University, Hsinchu, Taiwan
e-mail: ctlin@[Link]
Jie Lu, University of Technology, Sydney, Australia
e-mail: [Link]@[Link]
Patricia Melin, Tijuana Institute of Technology, Tijuana, Mexico
e-mail: epmelin@[Link]
Nadia Nedjah, State University of Rio de Janeiro, Rio de Janeiro, Brazil
e-mail: nadia@[Link]
Ngoc Thanh Nguyen, Wroclaw University of Technology, Wroclaw, Poland
e-mail: [Link]@[Link]
Jun Wang, The Chinese University of Hong Kong, Shatin, Hong Kong
e-mail: jwang@[Link]

More information about this series at [Link]


M. N. Hoda Naresh Chauhan

S. M. K. Quadri Praveen Ranjan Srivastava


Editors

Software Engineering
Proceedings of CSI 2015

123
Editors
M. N. Hoda S. M. K. Quadri
Bharati Vidyapeeth’s Institute of Computer Department of Computer Science
Applications and Management University of Kashmir
(BVICAM) Srinagar, Jammu and Kashmir
New Delhi, Delhi India
India
Praveen Ranjan Srivastava
Naresh Chauhan Department of Information Technology
Department of Computer Engineering and Systems
YMCAUST Indian Institute of Management Rohtak
Faridabad, Haryana Rohtak, Haryana
India India

ISSN 2194-5357 ISSN 2194-5365 (electronic)


Advances in Intelligent Systems and Computing
ISBN 978-981-10-8847-6 ISBN 978-981-10-8848-3 (eBook)
[Link]
Library of Congress Control Number: 2018940631

© Springer Nature Singapore Pte Ltd. 2019


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
part of Springer Nature
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface

The last decade has witnessed remarkable changes in IT industry, virtually in all
domains. The 50th Annual Convention, CSI-2015, on the theme “Digital Life” was
organized as a part of CSI@50, by CSI at Delhi, the national capital of the country,
during December 2–5, 2015. Its concept was formed with an objective to keep ICT
community abreast of emerging paradigms in the areas of computing technologies
and more importantly looking at its impact on the society.
Information and Communication Technology (ICT) comprises of three main
components: infrastructure, services, and product. These components include the
Internet, infrastructure-based/infrastructure-less wireless networks, mobile termi-
nals, and other communication mediums. ICT is gaining popularity due to rapid
growth in communication capabilities for real-time-based applications. New user
requirements and services entail mechanisms for enabling systems to intelligently
process speech- and language-based input from human users. CSI-2015 attracted
over 1500 papers from researchers and practitioners from academia, industry, and
government agencies, from all over the world, thereby making the job of the
Programme Committee extremely difficult. After a series of tough review exercises
by a team of over 700 experts, 565 papers were accepted for presentation in
CSI-2015 during the 3 days of the convention under ten parallel tracks. The
Programme Committee, in consultation with Springer, the world’s largest publisher
of scientific documents, decided to publish the proceedings of the presented papers,
after the convention, in ten topical volumes, under ASIC series of the Springer, as
detailed hereunder:
1. Volume # 1:
ICT Based Innovations
2. Volume # 2:
Next Generation Networks
3. Volume # 3:
Nature Inspired Computing
4. Volume # 4:
Speech and Language Processing for Human-Machine
Communications
5. Volume # 5: Sensors and Image Processing
6. Volume # 6: Big Data Analytics

v
vi Preface

7. Volume # 7: Systems and Architecture


8. Volume # 8: Cyber Security
9. Volume # 9: Software Engineering
10. Volume # 10: Silicon Photonics & High Performance Computing
We are pleased to present before you the proceedings of Volume # 9 on
“Software Engineering.” The title on “Software Engineering” aims at informing the
readers about state of the art of software engineering by publishing high-quality
papers that represent results of consolidated research and innovations in software
engineering and related areas. It helps the practitioners, researchers, and academi-
cians understand the issues involved in designing, developing, evolving, and val-
idating complex software systems. It provides excellent coverage for developing
professional careers in software engineering.
The title “Software Engineering” also provides insights into various research
issues such as software reliability, verification and validation, security, extensibil-
ity, model and component-based development, software process models and
process-driven systems, and human–computer collaborative systems.
By taking above point of view, this volume is published, which would be
beneficial for researchers of this domain.
The volume includes scientific, original, and high-quality papers presenting novel
research, ideas, and explorations of new vistas in domains like component-based
software engineering, knowledge-based software engineering, agile methodologies,
CASE tools. The aim of this volume is to provide a stimulating forum for sharing
knowledge and results in model, methodology, and implementations of speech and
language processing tools. Its authors are researchers and experts of these domains.
This volume is designed to bring together researchers and practitioners from aca-
demia and industry to focus on extending the understanding and establishing new
collaborations in these areas. It is the outcome of the hard work of the editorial team,
who have relentlessly worked with the authors and steered up the same to compile
this volume. It will be useful source of reference for the future researchers in this
domain. Under the CSI-2015 umbrella, we received over 300 papers for this volume,
out of which 70 papers are being published, after a rigorous review processes, carried
out in multiple cycles.
On behalf of organizing team, it is a matter of great pleasure that CSI-2015 has
received an overwhelming response from various professionals from across the
country. The organizers of CSI-2015 are thankful to the members of Advisory
Committee, Programme Committee, and Organizing Committee for their all-round
guidance, encouragement, and continuous support. We express our sincere grati-
tude to the learned Keynote Speakers for support and help extended to make this
event a grand success. Our sincere thanks are also due to our Review Committee
Members and the Editorial Board for their untiring efforts in reviewing the
manuscripts, giving suggestions and valuable inputs for shaping this volume. We
hope that all the participated delegates will be benefitted academically and wish
them for their future endeavors.
Preface vii

We also take the opportunity to thank the entire team from Springer, who have
worked tirelessly and made the publication of the volume a reality. Last but not
least, we thank the team from Bharati Vidyapeeth’s Institute of Computer
Applications and Management (BVICAM), New Delhi, for their untiring support,
without which the compilation of this huge volume would not have been possible.

New Delhi, India M. N. Hoda


Faridabad, India Naresh Chauhan
Srinagar, India S. M. K. Quadri
Rohtak, India Praveen Ranjan Srivastava
December 2017
The Organization of CSI-2015

Chief Patron

Padmashree Dr. R. Chidambaram, Principal Scientific Advisor, Government of


India

Patrons

Prof. S. V. Raghavan, Department of Computer Science, IIT Madras, Chennai


Prof. Ashutosh Sharma, Secretary, Department of Science and Technology,
Ministry of Science of Technology, Government of India

Chair, Programme Committee

Prof. K. K. Aggarwal, Founder Vice Chancellor, GGSIP University, New Delhi

Secretary, Programme Committee

Prof. M. N. Hoda, Director, Bharati Vidyapeeth’s Institute of Computer Applications


and Management (BVICAM), New Delhi

Advisory Committee

Padma Bhushan Dr. F. C. Kohli, Co-Founder, TCS


Mr. Ravindra Nath, CMD, National Small Industries Corporation, New Delhi
Dr. Omkar Rai, Director General, Software Technological Parks of India (STPI),
New Delhi

ix
x The Organization of CSI-2015

Adv. Pavan Duggal, Noted Cyber Law Advocate, Supreme Courts of India
Prof. Bipin Mehta, President, CSI
Prof. Anirban Basu, Vice President-cum-President Elect, CSI
Shri Sanjay Mohapatra, Secretary, CSI
Prof. Yogesh Singh, Vice Chancellor, Delhi Technological University, Delhi
Prof. S. K. Gupta, Department of Computer Science and Engineering, IIT Delhi
Prof. P. B. Sharma, Founder Vice Chancellor, Delhi Technological University,
Delhi
Mr. Prakash Kumar, IAS, Chief Executive Officer, Goods and Services Tax
Network (GSTN)
Mr. R. S. Mani, Group Head, National Knowledge Networks (NKN), NIC,
Government of India, New Delhi

Editorial Board

A. K. Nayak, CSI
A. K. Saini, GGSIPU, New Delhi
R. K. Vyas, University of Delhi, Delhi
Shiv Kumar, CSI
Anukiran Jain, BVICAM, New Delhi
Parul Arora, BVICAM, New Delhi
Vishal Jain, BVICAM, New Delhi
Ritika Wason, BVICAM, New Delhi
Anupam Baliyan, BVICAM, New Delhi
Nitish Pathak, BVICAM, New Delhi
Shivendra Goel, BVICAM, New Delhi
Shalini Singh Jaspal, BVICAM, New Delhi
Vaishali Joshi, BVICAM, New Delhi
Contents

A Multi-agent Framework for Context-Aware Dynamic User


Profiling for Web Personalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Aarti Singh and Anu Sharma
Implementation of Equivalence of Deterministic Finite-State
Automation and Non-deterministic Finite-State Automaton in
Acceptance of Type 3 Languages Using Programming Code . . . . . . . . . 17
Rinku, Chetan Sharma, Ajay Kumar Rangra, Manish Kumar
and Aman Madaan
A Multi-factored Cost- and Code Coverage-Based Test Case
Prioritization Technique for Object-Oriented Software . . . . . . . . . . . . . 27
Vedpal and Naresh Chauhan
A Novel Page Ranking Mechanism Based on User
Browsing Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Shilpa Sethi and Ashutosh Dixit
Indexing of Semantic Web for Efficient Question
Answering System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Rosy Madaan, A. K. Sharma, Ashutosh Dixit and Poonam Bhatia
A Sprint Point Based Tool for Agile Estimation . . . . . . . . . . . . . . . . . . . 63
Rashmi Popli and Naresh Chauhan
Improving Search Results Based on Users’ Browsing Behavior
Using Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Deepika, Shilpa Juneja and Ashutosh Dixit
Performance Efficiency Assessment for Software Systems . . . . . . . . . . . 83
Amandeep Kaur, P. S. Grover and Ashutosh Dixit
Impact of Programming Languages on Energy Consumption
for Sorting Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Tej Bahadur Chandra, Pushpak Verma and Anuj Kumar Dwivedi

xi
xii Contents

Crawling Social Web with Cluster Coverage Sampling . . . . . . . . . . . . . 103


Atul Srivastava, Anuradha and Dimple Juneja Gupta
Efficient Management of Web Data by Applying Web Mining
Pre-processing Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Jaswinder Kaur and Kanwal Garg
A Soft Computing Approach to Identify Multiple Paths
in a Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Shalini Aggarwal, Pardeep Kumar and Shuchita Upadhyaya
An Efficient Focused Web Crawling Approach . . . . . . . . . . . . . . . . . . . 131
Kompal Aggarwal
A New Log Kernel-Based Possibilistic Clustering . . . . . . . . . . . . . . . . . . 139
Meena Tushir and Jyotsna Nigam
Fuzzy c-Means Clustering Strategies: A Review
of Distance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Jyoti Arora, Kiran Khatter and Meena Tushir
Noise Reduction from ECG Signal Using Error Normalized Step Size
Least Mean Square Algorithm (ENSS) with Wavelet Transform . . . . . . 163
Rachana Nagal, Pradeep Kumar and Poonam Bansal
A Novel Approach for Extracting Pertinent Keywords for Web Image
Annotation Using Semantic Distance and Euclidean Distance . . . . . . . . 173
Payal Gulati and Manisha Yadav
Classification of Breast Tissue Density Patterns Using SVM-Based
Hierarchical Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Jitendra Virmani, Kriti and Shruti Thakur
Advances in EDM: A State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Manu Anand
Proposing Pattern Growth Methods for Frequent Pattern Mining
on Account of Its Comparison Made with the Candidate Generation
and Test Approach for a Given Data Set . . . . . . . . . . . . . . . . . . . . . . . . 203
Vaibhav Kant Singh
A Study on Initial Centroids Selection for Partitional Clustering
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Mahesh Motwani, Neeti Arora and Amit Gupta
A Novel Rare Itemset Mining Algorithm Based on Recursive
Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
Mohak Kataria, C. Oswald and B. Sivaselvan
Contents xiii

Computation of Various Entropy Measures for Anticipating


Bugs in Open-Source Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
H. D. Arora and Talat Parveen
Design of Cyber Warfare Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Yogesh Chandra and Pallaw Kumar Mishra
Performance Evaluation of Features Extracted
from DWT Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Manisha Saini and Rita Chhikara
Control Flow Graph Matching for Detecting
Obfuscated Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Chandan Kumar Behera, Genius Sanjog and D. Lalitha Bhaskari
A Novel Framework for Predicting Performance of Keyword
Queries Over Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Mujaffar Husain and Udai Shanker
Predicting and Accessing Security Features into Component-Based
Software Development: A Critical Survey . . . . . . . . . . . . . . . . . . . . . . . 287
Shambhu Kr. Jha and R. K. Mishra
Traditional Software Reliability Models Are Based on Brute Force
and Reject Hoare’s Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
Ritika Wason, A. K. Soni and M. Qasim Rafiq
Object-Oriented Metrics for Defect Prediction . . . . . . . . . . . . . . . . . . . . 305
Satwinder Singh and Rozy Singla
A Comparative Analysis of Static and Dynamic Java Bytecode
Watermarking Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
Krishan Kumar and Prabhpreet Kaur
Software Architecture Evaluation in Agile Environment . . . . . . . . . . . . 335
Chandni Ahuja, Parminder Kaur and Hardeep Singh
Mutation Testing-Based Test Suite Reduction Inspired
from Warshall’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
Nishtha Jatana, Bharti Suri and Prateek Kumar
Software Component Retrieval Using Rough Sets . . . . . . . . . . . . . . . . . 365
Salman Abdul Moiz
Search-Based Secure Software Testing: A Survey . . . . . . . . . . . . . . . . . 375
Manju Khari, Vaishali and Manoj Kumar
Limitations of Function Point Analysis in Multimedia
Software/Application Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
Sushil Kumar, Ravi Rastogi and Rajiv Nag
xiv Contents

Maintainability Analysis of Component-Based


Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
Nitin Upadhyay
An Assessment of Vulnerable Detection Source Code Tools . . . . . . . . . . 403
Anoop Kumar Verma and Aman Kumar Sharma
Devising a New Method for Economic Dispatch Solution and Making
Use of Soft Computing Techniques to Calculate Loss Function . . . . . . . 413
Ravindra Kumar Chahar and Aasha Chuahan
Trusted Operating System-Based Model-Driven Development
of Secure Web Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
Nitish Pathak, Girish Sharma and B. M. Singh
Navigational Complexity Metrics of a Website . . . . . . . . . . . . . . . . . . . . 433
Divyam Pandey, Renuka Nagpal and Deepti Mehrotra
Evaluation and Comparison of Security Mechanisms In–Place
in Various Web Server Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
Syed Mutahar Aaqib and Lalitsen Sharma
Component-Based Quality Prediction via Component Reliability
Using Optimal Fuzzy Classifier and Evolutionary Algorithm . . . . . . . . . 449
Kavita Sheoran and Om Prakash Sangwan
Applying Statistical Usage Testing Along with White Box
Testing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
Sunil Kumar Khatri, Kamaldeep Kaur and Rattan Datta
A Review on Application Security Management Using Web
Application Security Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
A. Rakesh Phanindra, V. B. Narasimha and Ch. V. PhaniKrishna
A Review of Software Testing Approaches in Object-Oriented
and Aspect-Oriented Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
Vasundhara Bhatia, Abhishek Singhal, Abhay Bansal and Neha Prabhakar
A Literature Survey of Applications of Meta-heuristic Techniques
in Software Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
Neha Prabhakar, Abhishek Singhal, Abhay Bansal
and Vasundhara Bhatia
A Review of Test Case Prioritization and
Optimization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
Pavi Saraswat, Abhishek Singhal and Abhay Bansal
Software Development Activities Metric to Improve Maintainability
of Application Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
Adesh Kumar pandey and C. P. Agrawal
Contents xv

Label Count Algorithm for Web Crawler-Label Count . . . . . . . . . . . . . 525


Laxmi Ahuja
Vulnerability Discovery in Open- and Closed-Source Software:
A New Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
Ruchi Sharma and R. K. Singh
Complexity Assessment for Autonomic Systems by Using
Neuro-Fuzzy Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
Pooja Dehraj and Arun Sharma
Proposal for Measurement of Agent-Based Systems . . . . . . . . . . . . . . . . 551
Sangeeta Arora and P. Sasikala
Optimal Software Warranty Under Fuzzy Environment . . . . . . . . . . . . 563
A. K. Shrivastava and Ruchi Sharma
Automation Framework for Test Script Generation for Android
Mobile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
R. Anbunathan and Anirban Basu
Optimizing the Defect Prioritization in Enterprise Application
Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
Viral Gupta, Deepak Kumar and P. K. Kapur
Desktop Virtualization—Desktop as a Service and Formulation
of TCO with Return on Investment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
Nitin Chawla and Deepak Kumar
An Assessment of Some Entropy Measures in Predicting Bugs
of Open-Source Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
Vijay Kumar, H. D. Arora and Ramita Sahni
A Path Coverage-Based Reduction of Test Cases and Execution
Time Using Parallel Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Leena Singh and Shailendra Narayan Singh
iCop: A System for Mitigation of Felonious Encroachment
Using GCM Push Notification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631
Saurabh Mishra, Madhurima Hooda, Saru Dhir and Alisha Sharma
Clustering the Patent Data Using K-Means Approach . . . . . . . . . . . . . . 639
Anuranjana, Nisha Mittas and Deepti Mehrotra
Success and Failure Factors that Impact on Project Implementation
Using Agile Software Development Methodology . . . . . . . . . . . . . . . . . . 647
Saru Dhir, Deepak Kumar and V. B. Singh
Fuzzy Software Release Problem with Learning Functions
for Fault Detection and Correction Processes . . . . . . . . . . . . . . . . . . . . . 655
Deepak Kumar and Pankaj Gupta
xvi Contents

Reliability Assessment of Component-Based Software System


Using Fuzzy-AHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663
Bhat Jasra and Sanjay Kumar Dubey
Ranking Usability Metrics Using Intuitionistic Preference
Relations and Group Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . 671
Ritu Shrivastava
Research Challenges of Web Service Composition . . . . . . . . . . . . . . . . . 681
Ali A. Alwasouf and Deepak Kumar
Automation Software Testing on Web-Based Application . . . . . . . . . . . 691
Saru Dhir and Deepak Kumar
Automated Test Data Generation Applying Heuristic
Approaches—A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 699
Neetu Jain and Rabins Porwal
Comparison of Optimization Strategies
for Numerical Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709
Gopal Narayanam, Kartikay Ranjan and Sumit Kumar
Sentiment Analysis on Tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717
Mehjabin Khatoon, W. Aisha Banu, A. Ayesha Zohra and S. Chinthamani
About the Editors

Prof. M. N. Hoda is a Professor of Computer Science and Director of BVICAM,


New Delhi. He has over 22 years of academic experience in different capacities.
Prior to joining the academic world, he initially worked in the corporate sector as a
Software Engineer. He is an expert member of several board-level committees of
DST, CSIR and MHRD, and various universities. His current areas of research are
Information System Audits, Software Engineering, Computer Networks, Artificial
Intelligence, ICTs and Innovative Pedagogies for 21st Century Teaching Learning
Systems. He is a Senior Member of IEEE, CSI, IE(I), ISTE, ISCA and ACM, and a
fellow of IETE.

Prof. Naresh Chauhan is a Professor and Chairman in the Department of


Computer Engineering at YMCA University of Science and Technology, Faridabad
(India). He received his Ph.D. (Computer Engineering) from MD University,
Rohtak; his [Link]. (Information Technology) from GGSIP University, New
Delhi; and his [Link]. (Computer Engineering) from NIT, Kurukshetra. He has 23
years of experience in education and the industry. Previously, he served at Bharat
Electronics Ltd and Motorola India Ltd. His research interests include Internet
Technologies, Software Engineering, Software Testing, and Real-Time Systems.
He has published two books on Software Testing and Operating Systems from
Oxford University Press, India.

Prof. S. M. K. Quadri is a Professor and Head of the Department of Computer


Science, University of Kashmir, Srinagar. He received his master's degree from
Aligarh Muslim University (AMU) and his [Link]. in Computer Applications
from the Indian School of Mines (ISM), Dhanbad. He completed his doctorate in
Computer Science at the University of Kashmir, Srinagar. He is a widely travelled
researcher and a member of many prestigious national-level committees of India,
e.g., NAAC, AICTE, and CSI. He has published many research papers in Elsevier
and Emerald journals and for IEEE conferences. His main research interests are
Software Testing and Reliability Engineering.

xvii
xviii About the Editors

Dr. Praveen Ranjan Srivastava is an Assistant Professor and Head of the


Department (Area Chair) in the Information Technology and Systems Group at the
Indian Institute of Management (IIM), Rohtak, India. He is currently pursuing
research in the area of Software Engineering and Management using novel
nature-inspired techniques. His primary research areas are Software Testing, Data
Analytics, Quality Management, Expert Systems, and Decision Science. He has
published more than 120 research papers in various leading international journals.
A Multi-agent Framework
for Context-Aware Dynamic User
Profiling for Web Personalization

Aarti Singh and Anu Sharma

Abstract Growing volume of information on World Wide Web has made relevant
information retrieval a difficult task. Customizing the information according to the
user interest has become a need of the hour. Personalization aims to solve many
associated problems in current Web. However, keeping an eye on user’s behavior
manually is a difficult task. Moreover, user interests change with the passage of
time. So, it is necessary to create a user profile accurately and dynamically for better
personalization solutions. Further, the automation of various tasks in user profiling
is highly desirable considering large size and high intensity of users involved. This
work presents an agent-based framework for dynamic user profiling for personal-
ized Web experience. Our contribution in this work is the development of a novel
agent-based technique for maintaining long-term and short-term user interests along
with context identification. A novel agent-based approach for dynamic user pro-
filing for Web personalization has also been proposed. The proposed work is
expected to provide an automated solution for dynamic user profile creation.

Keywords Context aware  Dynamic  Multi-agents  User profiling


Web personalization

1 Introduction

Recent years have seen an exponential increase in size of World Wide Web and
thereby led to many bottlenecks in accessing the required and relevant material from
the pool of available information. Web personalization (WP) offers a solution to the
information overload problem in current Web by providing the users with a person-
alized experience considering their interest, behavior, and context [1]. Thus, there is

A. Singh  A. Sharma (&)


MMICT&BM, MMU, Ambala, India
e-mail: anu@[Link]
A. Singh
e-mail: Singh2208@[Link]

© Springer Nature Singapore Pte Ltd. 2019 1


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
2 A. Singh and A. Sharma

an increasing demand to devise adaptive techniques of Web personalization which


may analyze user interests from his/her Web Surfing Behavior (WSB). However,
considering the size of WWW and number of users accessing Internet, it is really hard
to keep an eye on every user’s behavior on Internet manually. Moreover, an indi-
vidual’s interests and requirements are major contributing factors toward his/her WSB
and both these factors keep on changing with difference of time span.
An individual’s interests may be classified as long-term interests (LTIs) and
short-term interests (STIs). LTI reflects user’s personality or may be a part of his/
her professional profile and do not change rapidly, e.g., interest of a person in a
particular sport, particular type of art, or research interest in a particular field. STIs
are generated from dynamic requirements or may result from changing moods of a
person such as watching movie, tour and travel plans, shopping for goods or
accessory. Thus, any WP technique must focus on dynamic mechanism to capture
user behavior so as to generate a digital user profile accurately. Some other
important issues related to user profiling are accuracy, inference/reasoning of user
interest, scalability, privacy, and user model interoperability [2].
User modeling (UM) is referred to as gathering and exploiting the information
about the individual user. UM consists of various tasks described in detail in
Sect. 2. Further, the automation of these tasks is highly desired considering large
size and high intensity of users involved. Intelligent agents may be deployed for this
purpose, since they may work autonomously, can monitor their environment con-
tinuously without being tired, and can learn user behavior. A multi-agent-based
approach Web content mining based on semantic technology has been applied by
Singh [3]. Further, a study to identify the scope of agents in Web usage mining has
been undertaken by Singh and Mishra [4]. This provided us the motivation for the
present work. This work presents an agent-based framework for dynamic user
profiling for personalized Web experience. This paper is structured as follows:
Sect. 2 elaborates user modeling techniques along with its inherent challenges.
Section 3 presents the literature review with problem identification. Section 4
presents proposed framework, and Sect. 5 elaborates detailed working of frame-
work along with detailed algorithmic procedure for each agent. Section 6 describes
the flow diagram, and Sect. 7 concludes the work and presents future research
directions.

2 User Modeling Techniques

UM is referred to as gathering and exploiting the information about the individual


user. Information for creating user profile may be collected explicitly by asking the
user to provide information or implicitly by observing the user behavior [5]. User
profiles that may adapt themselves to change in user interests and preferences are
called as dynamic user profile, whereas a static profile contains information that
does not change with the time. There are three phases of user profile construction as
given below.
A Multi-agent Framework for Context-Aware Dynamic User … 3

2.1 Information Gathering

This phase consists of mainly gathering the information about the Web user. Two
methods for collecting information about the user are explicit and implicit.
Explicit Information Collection
This deals with asking for user input and feedback for gathering knowledge about
the user. This is an accurate method for identifying the user interest items. This
method requires the explicit user intervention for specifying the complete interest
and preferences. Another shortcoming of explicit user profile is that after some time
the profile becomes outdated. Yahoo Personalized Portal explicitly asks the user for
specifying his preferences and then customizes the home page layout as per
interests of the user.
Implicit Information Collection
This deals with observing the user activities to identify the user information without
explicitly asking for it. Thus, it removes the burden from user to enter the infor-
mation. Various implicit information collection techniques are browsing cache,
proxy server, browser agents, desktop agents, Web logs, and search logs. Most of
these techniques are client-side based, except Web and search logs which are based
on server side [6].

2.2 User Profile Representation

Some of the important techniques for representing user profiles are sets of weighted
keywords, semantic networks, or weighted concepts, or association rules:
Keyword-Based Profile
User profiles are most commonly represented as sets of keywords. These can be
automatically extracted from Web documents or directly provided by the user.
Numeric weights are assigned to each keyword which shows the degree of user’s
interests in that topic [7, 8].
Semantic Network Profiles
Keyword-based profiles suffer from polysemy problem. This problem is solved by
using weighted semantic network in which each node represents a concept [9].
Concept Profiles
Concept-based profiles are similar to semantic network-based profile. However, in
concept-based profiles, the nodes represent abstract topics considered interesting to
the user, rather than specific words or sets of related words. Concept profiles are
also similar to keyword profiles in that often they are represented as vectors of
weighted features, but the features represent concepts rather than words or sets of
words. Mechanisms have been developed to express user’s interest in each topic,
e.g., assigning a numerical value, or weight, associated with each topic [10].
4 A. Singh and A. Sharma

2.3 User Profile Construction

Various machine learning techniques are applied on collected information to build


user profile [11]. Keyword-based profiles are constructed by extracting the key-
words from the Web pages in search histories. Each extracted word is also assigned
a weight which shows its relative importance with respect to interest. A vector of
extracted keyword thus represents the user interest. Moukas [8] have extracted the
top few keywords implicitly from WSB and used single keyword for each of user
interest. But this approach could not represent the true user behavior as multiple
user interests are represented as single vector. So, in another approach, a user profile
is represented as multiple vectors with each vector corresponding to a separate user
interest category. Number of user’s interests to be included in the profile are either
specified as fixed number or by using a similarity threshold. One keyword vector
per interest instead of single keyword is represented by Chen and Sycara [7]. This is
a more accurate method of user profile construction.
Semantic net user profiles are created by adding the keywords to a network of
nodes. This profile is more accurate than keyword-based as it eliminates the
ambiguity and synonymy in natural language. Concept-based profiles use available
taxonomies to represent weighted concept hierarchies. These profiles are con-
structed by mapping the extracted keywords to the existing concepts in taxonomy.
If the same concept is shown again, then its weight is increased by one.

3 Related Work

An extensive study for identifying scope, applicability, and state of the art in
applying semantics to WP is undertaken by Singh and Anand [12]. Semantically
enhanced and ontological profiles are found [13–15] more appropriate for repre-
senting the user profile. Ontological profile allows powerful representational
medium along with associated inference mechanism for user profile creation.
A brief review of work on semantically enhanced and ontological dynamic user
profiles is given below.
A generic ontology-based user modeling architecture (OntobUM) for knowledge
management had been proposed by Razmerita et al. [16]. Ontology-based user
modeling architecture has three different ontologies namely user, domain, and log
ontologies. User ontology structures the different characteristics of users and their
relationships. Domain ontology defines the domain and application-specific con-
cepts and their relationships. Log ontology defines the semantics of the user
interaction with the system.
An ontology-based user profile creation by unobtrusively monitoring user
behavior and relevance feedback has been proposed by Middleton et al. [14]. The
cold start problem is addressed by using external ontology to bootstrap the rec-
ommendation system. This work considers only is-a type relationship in ontology
A Multi-agent Framework for Context-Aware Dynamic User … 5

and may be further extended to include other types of relationships in ontology.


Further, there is scope for using external software agents and adding contextual
information in task modeling. The techniques for the improvement of ontological
user profile have been investigated by Trajkova and Gauch [17]. The authors
investigated that ranking the concepts in the profiles by number of documents
assigned to them rather than by accumulated weights provides better profile
accuracy. The ordering of the concepts was also considered along with detecting
non-relevant concepts and pruning them from the profile. But, this work does not
include user’s contextual and temporal information in user profile. In an approach,
user context is modeled by using ontological profiles in Web search. The existing
concepts in domain ontology are assigned implicitly derived interest scores [18].
A spreading activation algorithm has been used for maintaining and incrementally
updating the weights based on the user’s ongoing behavior. This approach may be
further improved by observing the ontological profile over time for ensuring the
true requirements of user as seen from interest scores. A dynamic model for user
modeling by applying clustering techniques on Web usage data off-line has been
proposed by Saeed et al. [19]. This model is adjusted periodically through new
transaction and thus solves the problem of maintaining the accuracy of user model.
But this does not use software agents and ontology.
A hybrid approach is adopted by Bhowmick et al. [13] by considering the
features of static and dynamic user profiling techniques. Static user profile specifies
the user’s interest in a much focused manner, and dynamic user profiling adds the
feature of adaptability into it. But this approach is limited to the school
curriculum-related topics and also does not consider context and location while
generating recommendations. Sosnovsky and Dicheva [15] had combined together
research from two different fields—user modeling and Web ontology—to demon-
strate how recent semantic trends in Web development can be combined with the
modern technologies of user modeling. They have also highlighted the use of
software agents and ontology. Users’ behavior and characterization of users’ needs
for context-aware applications had been analyzed by Skillen et al. [20]. The cre-
ation of a user profile ontology for context-aware application personalization within
mobile environments was described. An approach for personalizing the Web using
user profile ontology has been proposed by Vigneshwari and Aramudhan [21].
These ontologies are created dynamically using the server log file through classi-
fication and clustering techniques. A global ontology is also created by identifying
the word pairs that have occurred in most of the documents. An automatic mech-
anism for ontology creation which is found useful in WP was proposed by Singh
and Anand [12]. A multi-agent-based knowledge-oriented personalized search
engine framework was suggested by Middleton [14].
Some other works on generating user profile dynamically are given by Hawalah
and Fasli [18], Li et al. [22], Vigneshwari and Aramudhan [21]. A strategy for
maintaining the LTI and STI of the user through user topic tree and Page-History
Buffer (PHB) had been proposed by Li et al. [22]. Although the study considers the
dynamic adaption of user profile, it does not use software agents to automate the
tasks in user modeling. A multi-agent approach to solve this has been proposed by
6 A. Singh and A. Sharma

Hawalah and Fasli [18] for building a dynamic user profile that is capable of
learning and adapting to user behavior. But, it does not consider the contextual
information in recommendations. An agent-based Web search personalization
approach using dynamic user profile has been proposed by Li et al. [22]. The user
query is optimized using user’s profile preferences and the query-related synonyms
from the WordNet ontology. The search results obtained from a set of syntactic
search engines are combined to produce the final personalized results. However,
this approach is limited to Web search personalization. In another work, Woerndl
and Groh [23] has proposed a framework for multi-agent-based personalized
context-aware information retrieval from distributed sources considering the pri-
vacy and access control. An agent-based interface is proposed which optimizes the
user query using WordNet and user profile preferences and fetches the personalized
search results from various search engines like Google, Yahoo. Although it includes
the contextual information, it does not describe the detailed implementation of
various agents and also does not include temporal aspects in context.
A critical analysis of the literature reveals that many research attempts have been
made in using software agents and ontology for dynamic profile creation consid-
ering contextual information. Some of these studies are oriented toward search
results personalization, while generic UM approaches do not consider contextual
information. So, there is a need to consider some issues like adding the contextual
information for creating dynamic user profile efficiently. This work aims to propose
a framework for agent-oriented context-aware dynamic user profiling for person-
alized recommendations for the user. Our contribution in this work may be sum-
marized as follows:
1. A novel agent-based technique has been developed for maintaining LTI and STI
at client-side layer. User activities at his desktop are monitored by an agent and
then incorporated in STI.
2. A novel agent-based approach has been developed for user context identification
at client-side layer.
3. A novel agent-based approach has been developed for dynamic user profiling
for WP.
The next section describes the proposed framework in detail.

4 Proposed Framework for Multi-agent-Based Dynamic


User Profiling

The core idea for user profiling is based on the assumption that some characteristics
of individual user affect the usefulness of recommendations. This framework
considers three main dimensions of user modeling as given below:
A Multi-agent Framework for Context-Aware Dynamic User … 7

• Background: This includes the basic information about the user.


• Interest directed user information: Information about multiple sites browsed and
search queries by user helps in determining user’s interest dynamically.
• Context: Spatial and temporal context plays an important role in generating
useful recommendations to users. A user may choose different recommendations
at different locations at different points of time in a day.
Various components of this approach are distributed and autonomous, which is
the motivation behind using intelligent agents in it. These agents can communicate
asynchronously and work simultaneously without interfering with other agents.
This framework comprises of two layers namely client-side layer and server-side
layer for collecting various types of data useful for profile creation. These layers
work on client side and server side, respectively. These help in the automation of
complex and tedious processes involved in user profiling and in efficient execution
of the various tasks with reduced computation time. The high-level architecture of
the proposed work is given in Fig. 1.
Three agents, at client side, gather the information about user namely user
context identification agent (UCIA), user behavior tracking agent (UBTA), and user
desktop agent (UDA). Detailed working of these agents is given in Sect. 5. These
agents collect the data, and this information is sent to the agents working at server
side.

Fig. 1 High-level architecture of proposed framework


8 A. Singh and A. Sharma

There are three agents working at server side namely user query analyzer agent
(UQAA), profile manager agent (PMA), and cluster agent (CA). Detailed working
and description of these agents are given in Sect. 5. These agents process the
explicit and implicit information about the user and apply text processing and
clustering techniques for generating and storing user profile. They also apply the
semantic technologies for user profiling and generating better recommendations.
The next section explains in detail the flow of information, purpose of each
agent, and their algorithms.

5 Detailed Description of Framework

The proposed multi-agent framework comprises of three agents at client side and
three agents at server side as shown in Fig. 2

Fig. 2 Detailed view of architecture of proposed framework


A Multi-agent Framework for Context-Aware Dynamic User … 9

5.1 Client-Side Agents

Various agents operating at client side are as follows:


– UCIA: Temporal information about the user is collected from system date and
time settings. It receives the location data of the user from current Wi-fi/IP
addresses utilizing the available location databases. Whenever the user enters a
query in browser and selects the context, this agent copies this query for later
analysis. It maintains a table containing keywords searched, context of search
along with date and time of search.
– UDA: This agent is responsible for monitoring the user activities on desktop. It
monitors the applications and files opened and used by user. It maintains a list of
this usage along with time spent on them by the user, in a specified folder to
maintain user privacy. This helps in observing the potential interest areas of the
user.
– UBTA: This agent is responsible for summarizing user behavior and creating
user profile based on short-term and long-term interests of the user. For this, it
analyzes and processes tables maintained by UCIA and UDA. This agent also
estimates the user’s interest in Web pages by considering the five parameters
namely text selection, text tracking, link clicking, link pointing, and scrolling.
Detailed procedure can be seen from [24]. Filtered Web pages are then classified
as STI and LTI based on their usage occasionally or daily. This agent is also
responsible for communicating user’s STI and LTI to server-side agent for their
incorporation in user profile.

5.2 Server-Side Agents

– PMA: This agent is responsible for managing dynamic profile of the user by
collaborating with UBTA of particular user. It receives LT and ST interests of a
user from UBTA and maintains it in its database. Further, it also analyzes online
WSB of the user, recorded on server side by UQAA, and keeps user profile
updated.
– UQAA: This agent analyzes keywords of queries received from a particular user
or IP address. It also saves the Web pages scrolled by the user using search
engine or by requesting particular URL, along with time spent on each page. It
also records hyperlinks accessed from one page to the other.
– CA: This agent is responsible for clustering the users based on similar interest
areas so that similar recommendations may be provided to them. It works on the
user profile database maintained by the PMA.
The algorithms for various agents involved in the framework are given in
Figs. 3, 4, 5, 6, 7 and 8.
10 A. Singh and A. Sharma

Fig. 3 User context identification agent

Fig. 4 User desktop agent


A Multi-agent Framework for Context-Aware Dynamic User … 11

Fig. 5 User behavior tracking agent


12 A. Singh and A. Sharma

Fig. 6 Algorithm for checking user’s degree of interest on Web page

Fig. 7 User query analyzer agent

Fig. 8 Profile manager agent


A Multi-agent Framework for Context-Aware Dynamic User … 13

Fig. 9 Flow diagram of the


proposed framework

6 Flow Diagram of the Proposed Work

The flow diagram given in Fig. 9 illustrates the working of the proposed frame-
work. UCIA, UDA, and UBTA work simultaneously on client side to gather the
information about user’s contexts and interests periodically.
1. UCIA accesses the client machine and is responsible for performing two tasks:
1:1 It extracts the time, location, and searched keywords and stores them in
table.
1:2 It passes this information to PMA which stores the contextual information in
user profile database.
2. UBTA accesses the client machine to collect the browsing history of user. It
performs the following tasks:
2:1 It extracts the various parameters from browser cache and stores them into
USER_INTEREST table.
2:2 It accesses the USER_INTEREST table and identifies the user degree of
interest in Web page/file by using isInteresting() function. It also identifies
the STI and LTI from USER_INTEREST table and prepares a database
WSB for storing short-term and long-term interests.
14 A. Singh and A. Sharma

2:3 It sends the database WSB to PMA after receiving a request from PMA to
access WSB.
3. UDA accesses the files in a specified folder in client machine and extracts
various parameters.
3:1 It stores this information in USER_INTEREST table.
4. PMA sends a request to UBTA to access WSB.
4:1 PMA sends a request to UQAA to access the information from
USERY_QUERY table.
4:2 PMA sends the USER_PROFILE database to CA after receiving a request
from CA.
4:3 PMA creates and updates a database named USER_PROFILE for storing
user profile.
5. UQAA accesses server-side machine to access Web server logs. It parses the
information contained in the log files.
5:1 It stores this information in a table named USER_QUERY.
5:2 It sends the USER_QUERY to PMA.
6. CA sends a request to PMA for accessing the USER_PROFILE database.
6:1 After its request is authenticated, it is given access to USER_PROFILE
database.
6:2 Using USER_PROFILE database, CA creates the clusters of users on var-
ious parameters like time, location, and interests.
7. Recommendations are given from server side to client considering STI, LTI, and
other contextual parameters.

7 Conclusion and Future Work

This study proposes a framework for multi-agent-based dynamic user profile


generation considering contextual information. The proposed framework uses a
number of autonomous software agents for automating the various tasks in profile
generation. Detailed algorithms for working of each of the agents have been
described along with flow of information between system components. Use of
automated agents makes the proposed system more efficient than other available
methods. The current work may be further extended by comparing and specifying
various mechanisms for detailed representation of the user profile. Further, the
actual implementation of the framework can be undertaken in future research.
A Multi-agent Framework for Context-Aware Dynamic User … 15

References

1. Singh, A.: Wisdom web: the WWW generation next. Int J. Advancements Technol. 3(3),
123–126 (2012)
2. Jammalamadaka, K., Srinivas, I.V.: A survey on ontology based web personalization. Int.
J. Res. Eng. Technol. 2(10), 163–167 (2013)
3. Singh, A.: Agent based framework for semantic web content mining. Int. J. Advancements
Technol. 3(2), 108–113 (2012)
4. Singh, A., Mishra, R.: Exploring web usage mining with scope of agent technology. Int.
J. Eng. Sci. Technol. 4(10), 4283–4289 (2012)
5. Carmagnola, F., Cena, F., Gena, C.: User model interoperability: a survey. User Model.
User-Adap. Inter. 21(3), 285–331 (2011)
6. Kelly, D., Teevan, J.: Implicit feedback for inferring user preference: a bibliography.
ACM SIGIR Forum 37(2), 18–28 (2003)
7. Chen, L., Sycara, K.: WebMate: a personal agent for browsing and searching. In: Proceedings
of the 2nd International Conference on Autonomous Agents, Minneapolis/St. Paul, 9–13
May, pp. 132–139. ACM Press, New York (1998)
8. Moukas, A.: Amalthaea: information discovery and filtering using a multiagent evolving
ecosystem. Appl. Artif. Intel. 11(5), 437–457 (1997)
9. Minio, M., Tasso, C.: User modeling for information filtering on INTERNET services:
exploiting an extended version of the UMT shell. In: UM96 Workshop on User Modeling for
Information Filtering on the WWW; Kailua-Kona, Hawaii, 2–5 Jan 1996
10. Bloedorn, E., Mani, I., MacMillan, T.R.: Machine learning of user profiles: representational
issues. In: Proceedings of AAAI 96 from Portland, 4–8 Aug, Oregon, vol. 1, pp. 433–438
(1996)
11. Gauch, S., Speretta, M., Chandramouli, A., Micarelli, A.: User profiles for personalized
information access. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web.
LNCS 4321, pp. 54–89. Springer, Berlin, Heidelberg (2007)
12. Singh, A., Anand, P.: Automatic domain ontology construction mechanism. In: Proceedings
of IEEE International Conference on Recent Advances in Intelligent Computing Systems
(RAICS) from 19–21 Dec, pp. 304–309. IEEE Press, Trivandrum, Kerala, India (2013)
13. Bhowmick, P.K., Sarkar, S., Basu, A.: Ontology based user modeling for personalized
information access. Int. J. Comput. Sci. Appl. 7(1), 1–22 (2010)
14. Middleton, S.E., Shadbolt, N.R., Roure, D.C.D.: Ontological user profiling in recommender
systems. ACM Trans. Inf. Syst. 22(1), 54–88 (2004)
15. Sosnovsky, S., Dicheva, D.: Ontological technologies for user modelling. Int. J. Metadata
Semant. Ontol. 5(1), 32–71 (2010)
16. Razmerita, L., Angehrn, A., Maedche, A.: Ontology based user modeling for knowledge
management systems. In: Brusilovsky, P., Corbett, A., Rosis, F.D. (eds.) User Modeling
2003. LNCS, vol. 2702, pp. 213–217. Springer, Berlin, Heidelberg (2003)
17. Trajkova, J., Gauch, S.: Improving ontology-based user profiles. In: Proceedings of RIAO
2004 on 26–28 Apr, pp. 380–389, France (2004)
18. Hawalah, A., Fasli, M.: A multi-agent system using ontological user profiles for dynamic user
modelling, In: Proceedings of International Conferences on Web Intelligence and Intelligent
Agent Technology, pp. 430–437, IEEE Press, Washington (2011)
19. Aghabozorgi, S.R., Wah, T.Y.: Dynamic modeling by usage data for personalization systems,
In: Proceedings of 13th International Conference on Information Visualization, pp. 450–455.
IEEE Press, Barcelona (2009)
20. Skillen, K.L., Chen, L., Nugent, C.D., Donnelly, M.P., Burns, W., Solheim, I.: Ontological
user profile modeling for context-aware application personalization. In: Bravo, J.,
López-de-Ipiña, D., Moya, F. (eds.) Ubiquitous Computing and Ambient Intelligence.
LNCS, vol. 7656, pp. 261–268. Springer, Berlin, Heidelberg (2012)
16 A. Singh and A. Sharma

21. Vigneshwari, S., Aramudhan, M.: A novel approach for personalizing the web using user
profiling ontologies. In: IEEE Fourth International Conference on Advanced Computing
ICoAC, pp. 1–4. IEEE Press, Chennai (2012)
22. Li, L., Yang, Z., Wang, B., Kitsuregawa, M.: Dynamic adaptation strategies for long-term and
short-term user profile to personalize search. In: Dong, G., Lin, X., Wang, W., Yang, Y., Yu,
J.X. (eds.) Advances in Data and Web Management. LNCS, vol. 4505, pp. 228–240. Springer
Berlin, Heidelberg (2007)
23. Woerndl, W., Groh, G.: A proposal for an agent-based architecture for context-aware
personalization in the semantic web. In: Proceeding of IJCAI Workshop Multi-agent
information retrieval and recommender systems, Edinburg, UK-IJCAI (2005)
24. Hijikata, Y.: Estimating a user’s degree of interest in a page during web browsing. In:
Proceedings of IEEE SMC ‘99 Conference, vol. 4, pp. 105–110. IEEE Press, Tokyo (1999)
25. Moawad, I.F., Talha, H., Hosny, E., Hashim, M.: Agent-based web search personalization
approach using dynamic user profile. Egypt. Inf. J. 13, 191–198 (2012)
26. Sieg, A., Mobasher, B., Burke, R.: Learning ontology-based user profiles: a semantic
approach to personalized web search. IEEE Intel. Inf. Bull. 8(1), 7–18 (2007)
27. Singh, A., Alhadidi, B.: Knowledge oriented personalized search engine: a step towards
wisdom web. Int. J. Comput. Appl. 76(8), 1–9 (2013)
28. Singh, A., Sharma, A., Dey, N.: Semantics and agents oriented web personalization: state of
the art. Int. J. Serv. Sci. Manag. Eng. Technol. 6(2), 35–49 (2015)
29. Yahoo Personalized Portal. [Link]
Implementation of Equivalence
of Deterministic Finite-State Automation
and Non-deterministic Finite-State
Automaton in Acceptance of Type 3
Languages Using Programming Code

Rinku, Chetan Sharma, Ajay Kumar Rangra, Manish Kumar


and Aman Madaan

Abstract An automaton is used where information and materials are transformed,


transmitted, and utilized for performing some processes without direct involvement
of human. A finite automaton (both deterministic and non-deterministic) is used to
recognize a formal regular language. In this paper, we will represent the acceptance
of a formal regular language (Type 3 according to Noam Chomsky) by both
deterministic and non-deterministic automaton. We are using a simple algorithm to
implement finite-state automaton in a middle-level language and demonstrate that
deterministic finite-state automation provides a best and unique solution in com-
parison to non-deterministic finite-state automaton. More important that if a prob-
lem solved by Non-deterministic finite state automation can be easily solved using
equivalent deterministic finite state automaton.

Keywords Finite-state automaton  Formal regular language  Deterministic



Non-deterministic Equivalence

Rinku (&)  C. Sharma  A. K. Rangra  M. Kumar  A. Madaan


Chitkara University, Kalujhinda, Himachal Pradesh, India
e-mail: [Link]@[Link]
C. Sharma
e-mail: [Link]@[Link]
A. K. Rangra
e-mail: [Link]@[Link]
M. Kumar
e-mail: [Link]@[Link]
A. Madaan
e-mail: amanmadaan90@[Link]

© Springer Nature Singapore Pte Ltd. 2019 17


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
18 Rinku et al.

1 Introduction

The word ‘formal’ means that all the rules for the language are explicitly stated in
terms of what string of symbols can occur, and a formal language can be viewed as
a set of all strings permitted by the rules of formation. Finite automatons are the
simplest model of an automatic machine. If we see the history of designing of the
automatic machines, the first calculating device was the abacus first time used in
China. Abacus was used to perform some arithmetic operations like addition,
multiplication on positive integers. This was the first initiation toward the designing
of calculating devices. Further, we found several enhancements, and presently, we
have numerous calculating devices. Today, all automatic machines are designed
based on some kind of models. One great example of this machine is the computer
system and finite automation are its abstract model.
Finite automaton is mainly used for modeling the reactive systems. System
which changes its actions, outputs, and conditions/status in reply to reactions from
within/outside it is known as reactive system. A reactive system is a
situation-driven/control-driven system continuously having to react to external and/
or internal reaction. In general, finite automatons are useful models to explain
dynamic behaviors of reactive systems.
Automaton (finite) consists of a finite memory called input tape, a read-only
head, and a finite control. The input is written on tape, and head reads one symbol at
a time on the tape and moves forward into next state and goes on until the last
symbol of input string. The movement of head and transition into next state are
decided by finite control. When input is read, the finite automaton decided the
validity or acceptability of the input by acceptance or rejection. It does not write its
output on the tape, which is a limitation of this model.
The input tape is divided into compartments, and each compartment contains 1
symbol from the input alphabets. The symbol ‘$’ is used at the leftmost cell, and the
symbol ‘W’ is used at the rightmost cell to indicate the beginning and end of the
input tape. It is similar to read-only file in a computer system, which has both
beginning and end.

2 Finite Acceptors

Inputs are transformed into outputs with help of an information processing device.
With a device, only two alphabets are associated: A is taken as input for com-
municating, and alphabet B is taken as output for receiving answers. Let us consider
a device that accepts input as English sentence and outputs the corresponding
Implementation of Equivalence of Deterministic Finite-State … 19

sentence in French language. Once whole input is read out, then each input alphabet
is processed step by step, and if after reading last alphabet of input string we reach
to final state, then the output is yes and we can say A* is accepted else rejected. By
the above procedure, A* is divided into two subparts: The ‘true’ subset is called as
machine accepted language, and the ‘no’ subset is called as machine rejected
language. The device that operates such thing is called as acceptor. The mathe-
matical model is described below:
Finite automaton is of two types:
• DFA
• NDFA.
Deterministic finite acceptor A is explained with five tuples of information:
A = (R, Q, S, d, F), R consists of a finite alphabet, a finite set Q of states, and a
function: Q  R ! Q, defined as the transition function and a set F of acceptance
states. The set Q contains an element s and a subset F, and the set of acceptance
states.

3 Algorithm to Implement DFA


20 Rinku et al.

4 Programming Code to Implement DFA and NFA


Implementation of Equivalence of Deterministic Finite-State … 21

5 Regular Languages or Type 3 Grammars

Type 3 languages are described by Type 3 grammars. The LHS of a Type 3


grammar is a single non-terminal. The RHS is empty or consists of a single ter-
minal, or it can have a single terminal which is followed by a non-terminal. Type 3
grammars are represented by the productions A ! aB or A ! a. Type 3 languages
are recognized using finite-state automata (FA). The following is an example of a
regular grammar:

A ! aA A!e A ! bB B ! bB B!e A!c


22 Rinku et al.

6 Equivalence of NFA to DFA on Acceptance of Regular


Language

The given above code shows the logical programming implementation of


deterministic as well as non-deterministic automaton along with its equivalence.
Now there is some example executed based on the above-given algorithm and
code with some test cases.
Implementation of Equivalence of Deterministic Finite-State … 23

7 Example of NFA to DFA with Test Cases

Consider an NFA with few states along with some inputs Equivalent DFA for above NFA can be created easily as
shown in figure per the algorithm

8 Algorithm for Conversion of NFA to DFA

Let Mn = (Q, R, d, q0, F) be NFA which identifies any language L then equivalent
DFA Md = (Q, R, d′, q′0, F′) which satisfies the respective conditions recognizes L.
Q = 2Q i.e. the set of all subsets of Q.
To obtain equivalent DFA, we follow the following procedure.
Step 1: Initially Q = Ø.
Step 2: {q0} is the initial state of the DFA M and Put q0 into Q.
Step 3: For each state q in Q following steps are needed: add this new state, add
d (q,a) = U p  q d(p, a) to d, where d on the right-hand side (R.H.S) is
that of NFA Mn.
Step 4: Repeat step 3 till new state are there to add in Q; if there is new state
found to add in Q, then process is terminated. All the states of Q which
contain accepting state of Mn are accepting state of M.
Important: Not reachable states are not included in Q (Not reachable means
states which are not reached from initial state).
24 Rinku et al.

8.1 Test Case 1 for NFA and Its Equivalent DFA

NFA DFA
Input : aaabbaab Input : aaabbaab
Standard Output : Final State (i.e q2, q3) Standard Output : Final State (i.e C, D, E)
Output : q2,q3 Output : D

Test case 2: DFA


Input : baabab Input : baabab
Standard Output : Final State (i.e q2, q3) Standard Output : Final State (i.e C, D, E)
Output : q3 Output : D
NFA

9 Result and Conclusion

In this paper, we have discussed distinction between deterministic and


non-deterministic finite-state automaton. Using a simple algorithm, we can imple-
ment a deterministic as well as non-deterministic finite-state automaton in an
executable format to find the solution of various problems. Here, we have also
discussed the equivalence of deterministic and non-deterministic finite-state
automaton with the use of various examples. This example shows the concern of
non-deterministic finite-state automaton and gives equivalent deterministic
finite-state automaton along with some suitable test cases to be measured. This
concluded that non-deterministic finite-state automation may have different solution
of given problems, but deterministic finite-state automaton provides the best unique
solution to find the solution to a problem.
Implementation of Equivalence of Deterministic Finite-State … 25

References

1. Goldsmith, J.A.: Towards a new empiricism, vol. 36. 2007


2. Bátfai, N.: Recombinations of busy beaver machines. http:[Link]/pdf/0908.4013 (2009)
3. Žilinskas, A., Žilinskas, J.: Interval arithmetic based optimization in nonlinear regression.
Informatica (2010)
4. Sergeyev, Y.D.: Numerical point of view on Calculus for functions assuming finite, infinite,
and infinitesimal values over finite, infinite, and infinitesimal domains. Nonlinear Analysis
Series A: Theory, Methods & Applications (2009)
5. Sharma, C., Shakya, R.: Computational complexity in language string processing and theory of
halting problem in deterministic turing machine accepting context sensitive language, context
free language or regular language, 2015. In: 2nd International Conference on Computing for
Sustainable Global Development (INDIACom), New Delhi, pp. 2091–2096 (2015).
6. Nagpal, C.K.: Formal Languages and Automata Theory. April 2011
7. Pandey, A.K.: An Introduction To Automata Theory & Formal Languages (2013)
8. Prof. Kamala Krithivasan, Theory of Computation, 2011
A Multi-factored Cost- and Code
Coverage-Based Test Case Prioritization
Technique for Object-Oriented Software

Vedpal and Naresh Chauhan

Abstract Test case prioritization is a process to order the test cases in such a way
that maximum faults are detected as earlier as possible. It is very expensive to
execute the unordered test cases. In the present work, a multi-factored cost- and
code coverage-based test case prioritization technique is presented that prioritizes
the test cases based on the percentage coverage of considered factors and code
covered by the test cases. For validation and analysis, the proposed approach has
been applied on three object-oriented programs and efficiency of the prioritized
suite is analyzed by comparing the APFD of the prioritized and non-prioritized test
cases.

 
Keywords Object-oriented testing Test case prioritization Cost- and code

coverage-based testing Multi-factors-based test case prioritization

1 Introduction

Testing is one of the important phases of the software development life cycle. The
testing of the software consumes a lot of time and efforts. Presently, the cost of
testing the software is increasing very rapidly. According to finding of the sixth
quality report [1], the share of the testing budget is expected to reach 29% by 2017.
Every software industry mainly concerns the reduction of the testing cost and
detection of the bug by taking the minimum time. For delivering the quality soft-
ware, it is necessary to detect all the possible errors and fix them. There are various
factors that indicate the quality of the software. These factors are functionality,
correctness, completeness, efficiency, portability, usability, reliability, integrity,

Vedpal (&)  N. Chauhan


Department of Computer Engineering, YMCA University of Science and Technology,
Faridabad, India
e-mail: ved_ymca@[Link]
N. Chauhan
e-mail: nareshchauhan19@[Link]

© Springer Nature Singapore Pte Ltd. 2019 27


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
28 Vedpal and N. Chauhan

maintainability, etc. Due to the advancement of technology and expectation of the


customer, it is very challenging task to deliver the quality software within allocated
budget.
The quality of the software also depends on the skills and experience of the
developer. Nowadays, various languages are used by the developer. Every language
has various features which provide the different functionalities. Many packages are
also available in the markets which are directly imported in the software. If the
different features of the language and packages are not used efficiently, then
error-prone software may be produced. If the developers are not aware of the
conditions of the software developed, then it will surely produce the bug.
Sometimes, the implemented functionality is error-free, but it may consume more
memory and execution time. High consumption of the memory and execution time
affects the quality of the software. So there should be a technique to identify the
factors which are most capable of introducing the error, increasing the execution
time of software, and consuming more memory. The test case prioritization should
be done in a way that the maximum errors can be detected as earlier as possible.
In this paper, a test case prioritization technique is presented. The proposed
technique prioritizes the test cases on the basis of the cost and code covered by the
test cases. To calculate the cost of the path, various factors are considered. For
experimental validation and analysis, the proposed approach has been applied on
the various examples that were implemented in the C++ language. The results show
effectiveness of the proposed approach.

2 Related Work

The size of the test suits is increased as the software evolves. Test cases are used to
test the existing program. If the software gets modified due to the addition of new
functionality, new test cases may be added to the existing test cases. There are many
constraints on the industry like resource, time, and cost. So, it is important to
prioritize the test cases in a way that probability of error detection is higher and
earlier. In this section, an overview of various researchers is discussed.
Shahid and Ibrahim [2] proposed a new code-based test case prioritization
technique. The presented approach prioritizes the test cases on the basis of the code
covered by the test case. The test cases that covered the maximum methods have the
highest probability to detect the errors earlier. Abdullah et al. [3] presented the
findings of the systemic review conducted to collect evidence testability estimation
of object-oriented design. They concluded that testability is a factor that predicts
that how much effort will be required for testing the software.
Huda et al. [4] proposed an effective quantification model of object-oriented
design. The proposed model uses the technique of multiple linear regressions
between the effective factors and metrics. Structural and functional information of
object-oriented software has been used to validate the assessment of the effec-
tiveness of the factors. The model has been proposed by establishing the correlation
A Multi-factored Cost- and Code Coverage-Based Test Case … 29

between effectiveness and object-oriented design constructs. The quantifying ability


of model is empirically validated.
Chhikara and Chhillar [5] presented a set of metrics. The presented metrics are
used to order the programs based on their complexity values. They concluded that
there should be compromise among internal attributes of software to maintain the
higher degree of reusability. Patwa et al. [6] presented the factors of coding phase
that effect the testing of object-oriented software. These factors are programmer and
tester skills, programmer and tester organization, development team size, program
workload (stress), domain knowledge, and human nature (mistake or work omis-
sion). Analysis of factors and place of these factors according to their impact on the
software are identified by using the relative weight method and ANOVA test.
Bruntink et al. [7] analyzed the relation between classes and their JUnit test
cases. They demonstrated a significant correlation between the class-level metrics
and test-level metrics. They also discussed how various metrics can contribute to
testability. They conducted the experiments using the GQM and MEDEA frame-
work. The results are evaluated using Spearman’s rank-order correlation coefficient.
Malviya and Singh [8] presented some observation on maintainability estimation
model for object-oriented software in requirement, design, coding and testing
phases. The presented work is about increasing the maintainability factors of the
MOOD metrics.
Hao et al. [9] presented a unified test case prioritization approach. The presented
approach includes two models. They showed that there is a spectrum of test case
prioritization techniques. The spectrum is generated by the model that resides
between the techniques using purely total or purely additional strategies. They
proposed extensions to enable the use of probabilities that test cases can detect
errors for methods and use the dynamic coverage information in place of static
coverage information.

3 Proposed Work

The presented approach prioritizes the test cases on the basis of the cost and the
coverage of the code covered by the test case. For accurately finding out the cost of
the test case, some factors are considered in Table 1. The proposed approach works
at two levels. At the first level, all the considered factors existed in the source code
are identified. After identification and counting the factors, all independent paths of
the source code are resolute; then, the value of the cost of each path is determined
on the basis of the coverage of the identified factors. Test cases are selected cor-
responding to independent paths. The cost of the test case can be calculated by
using Formula 1. The code coverage of test case is determined by counting lines of
code executed by the test case. At the second level, pairs of cost and code value of
each test case are created. In this way by using the value of the cost and code
coverage, the test cases are prioritized. The following scenario is used for the
prioritization of the test cases:
30 Vedpal and N. Chauhan

Table 1 Proposed factors S. No. Proposed factor


1 Operators
2 Variables
3 External system call
4 Predicate statement
5 Assignment statement
6 Use of libraries/packages
7 Virtual function
8 Exception handling

(1) Highest code coverage and cost will have highest priority.
(2) Second priority should be given to test case that has highest cost value.
(3) Third priority should be given to test case that has highest code coverage.
(4) Test cases with the equal code coverage and cost should be ordered.
The overview of the proposed approach is shown in Fig. 1.
The cost covered by the test case can be calculated by applying the Formula 1 as
given below:

CostðTi Þ ¼ SFðTi Þ=TF ð1Þ

where SF is the sum of the factors covered by the ith test case, and TF is the sum of
the all existing factors in source code.

Fig. 1 Overview of the


proposed approach Non prioritized test cases

Determine all factors existing in source pro-


gram and factors which are being covered by
individual test cases

Calculate the cost and code covered by each


test cases

Prioritize the test cases using calculated value


of cost and code coverage

Execute the test cases in prioritized order


A Multi-factored Cost- and Code Coverage-Based Test Case … 31

3.1 Considered Factors for Prioritized Test Cases

Table 1 shows the considered factors that are used to prioritize the test cases. The
factors are considered by the structural analysis of the program. The considered
factors may affect the testing process in terms of consumption of memory, exe-
cution time, and the possibility of introducing the error in program.
These are factors related to object-oriented program affecting the prioritization of
test cases. There are a total of eight object-oriented software-related factors included
in this work. The algorithm of the proposed approach is given in Fig. 2.

4 Result and Analysis

For the experimental validation and evaluation, the proposed approach has been
applied on the three programs. The programs are implemented in the C++ lan-
guage. For the experimental analysis, intentionally faults are introduced in the
programs. The program one has 170 lines of code, program [10] two has 361 lines
of code, and program three has 48 lines of code.
Table 2 shows the various factors covered by the test cases, Table 3 shows the
line of code covered by the test cases, Table 4 shows the calculated cost of all test
cases that are used to test the software, and Table 5 shows the various pairs of cost
and code covered by the test cases.
The prioritizing order of test cases as determined by the proposed approach is
TC6, TC2, TC1, TC4, TC8, TC5, TC3, TC7.

4.1 Faults Detected by Test Cases in Non-Prioritized Order

The faults are identified in the non-prioritized order as shown in Table 6.

Let T be is the list of non prioritized test cases and T’ be the list of the prioritized test cases.
While ( T not empty)
Begin
Step 1. Identify and Count all the considered factors that are used in the source code.
Step 2. Determine the factors and line of code being covered by the test cases.
Step 3. Calculate the cost by applying the formula on test cases.
Cost (Ti) = SF(Ti) / TF
Where SF is the sum of factors covered by the test case and TF is the sum of the factors in the source code
End
Step 4. Determine the all possible pairs of the code coverage value and cost value of each test cases
Pair = ( Cost , Code Coverage)
Step 5. Prioritized the test cases in the following scenarios
(1) Highest the value of cost and code covered by the test case have highest priority
(2) Second priority should be given to test case that has highest cost value.
(2) Third priority should be given to test case that has highest code coverage.
(3) Test cases with the equal value of the code coverage and cost should be prioritized in the random order.
Create T’ the list of prioritize test cases.

Fig. 2 Algorithm of the proposed approach


32 Vedpal and N. Chauhan

Table 2 Factors covered by test cases


Factors TC1 TC2 TC3 TC4 TC5 TC6 TC7 TC8
Operators 4 4 0 0 1 7 0 0
Variable 3 3 1 4 2 3 1 4
Native method 0 0 0 0 0 0 0 00
Control statement 0 1 0 0 0 2 0 0
Assignment 3 2 0 0 1 2 0 0
SF 10 10 1 4 4 12 1 4

Table 3 Line of code covered by test cases


TC1 TC2 TC3 TC4 TC5 TC6 TC7 TC8
Line of code 36 42 31 36 34 48 31 36

Table 4 Calculated cost of test cases


TC1 TC2 TC3 TC4 TC5 TC6 TC7 TC8
Factor coverage 10 10 1 4 4 12 1 4
(TFC)
Total factors (TF) 35 35 35 35 35 35 35 35
Cost 28.57 28.57 2.85 11.42 11.42 34.28 2.85 11.42

Table 5 Pairs of the cost and S. No. Test case Pairs


code coverage by test cases
1 TC1 (36, 28.57)
2 TC2 (42, 28.57)
3 TC3 (31, 2.85)
4 TC4 (36, 11.42)
5 TC5 (34, 11.42)
6 TC6 (48, 34.82)
7 TC7 (31, 2.85)
8 TC8 (36, 11.42)
A Multi-factored Cost- and Code Coverage-Based Test Case … 33

Table 6 Faults detected by test cases in non-prioritized order


TC1 TC2 TC3 TC4 TC5 TC6 TC7 TC8
F1 * * * * * * * *
F2 * *
F3 *
F4 *
F5 *
F6 *
F7 *
F8 *
F9 *
F10 *

Table 7 Faults detected by test cases in prioritized order


TC6 TC2 TC1 TC4 TC8 TC5 TC7 TC3
F1 * * * * * * * *
F2 * *
F3 *
F4 *
F5 *
F6 *
F7 *
F8 *
F9 *
F10 *

4.2 Faults Detected by Test Cases in Prioritized Order

Table 7 shows the faults detected by the test cases when they executed in prioritized
order.
For simplicity of the approach, the faults are detected for only one program.

4.3 Comparison of APFD Graphs Prioritized


and Non-prioritized Order of Test Cases for Three
Programs

Figure 3 shows the comparison of APFD graphs for program one, Fig. 4 shows the
comparison of APFD graphs for program two, and Fig. 5 shows the comparison of
APFD graphs for program three. The test cases are executed in prioritized order
obtained after applying the proposed approach and non-prioritized approach [11, 12].
34 Vedpal and N. Chauhan

Fig. 3 Comparison of APFD detected by random and prioritized test cases for program one

Fig. 4 Comparison of APFD detected by random and prioritized test cases for program two

4.4 Effectiveness of the Proposed Approach

Effectiveness of the proposed approach is measured through APFD metric, and its
value is shown in Table 8 [13]. The APFD value of prioritized order of test cases
obtained by applying the proposed approach is better than random ordering of test
cases. Therefore, it can be observed from Table 8 prioritized test cases has higher
fault exposing rate than the non-prioritized test cases.
A Multi-factored Cost- and Code Coverage-Based Test Case … 35

Fig. 5 Comparison of APFD detected by random and prioritized test cases for program three

Table 8 Compared result of test cases for prioritized and non-prioritized order
Case study Non-prioritized test cases (APFD) (%) Prioritized test cases (APFD) (%)
Program one 57 70
Program two 55 72
Program three 37 62

5 Conclusion

In the proposed approach, a test case prioritization technique for object-oriented


software is presented. The approach prioritizes the test cases based on coverage of
determined cost and lines of code. It works at two levels. At the first level, cost and
code covered by the test cases are determined. Costs are calculated by using the
proposed factors that are covered by the test case. At the second level, test cases are
prioritized on the basis of determined cost and code covered by the test case. For
experimental validation and evaluation, it has been applied on three object-oriented
programs. The experimented results show that the proposed approach leads to
improve the rate of fault detection early in comparison with non-prioritized order of
test cases. The significant characteristic of the proposed approach is that it helps to
reduce the testing cost by detecting the faults in short time.
36 Vedpal and N. Chauhan

References

1. Decouz Hester.: Capgemini world quality report. (2014) [Link]


2. Shahid, M., Ibrahim, S.: A new code based test case prioritization technique. Int. J. Softw.
Eng. Appl. 8(6), 31–38 (2014)
3. Abdullah, D., Srivastava, R., Khan, M.H.: Testability estimation of object oriented design: a
revisit. Int. J. Adv. Res. Comput. Commun. Eng. 2(8) (2013)
4. Huda, M., Arya, Y.D.S., Khan, M.H.: Evaluating effectiveness factor of object oriented
design: a testability perspective. Int. J. Softw. Eng. Appl. (IJSEA) 6(1), 41 (2015)
5. Chhikara, A., Chhillar, R.S.: Analyzing complexity of java program using object oriented
software metrics. IJCSI Int. J. Comput. Sci. 9(1), 3 (2012). ISSN:1694-0814
6. Patwa, S., Malviya A.K.: Impact of coding phase on object oriented software testing.
Covenanat J. Inf. Commun. Technol. (CJICT) 2(1) (2014)
7. Bruntink, M., Van Deursen A.: Predicting class testability using object oriented metrics. In:
4th IEEE International Workshop on Source Code Analysis and Manipulation, pp. 136–145.
Chicago, IL, US (2004)
8. Malviya, A.K., Singh V.: Some observation on maintainability estimation model for object
oriented software in requirement, design, coding and testing phases. Int. J. Adv. Res. Comput.
Sci. Softw. Eng. 5 (3) (2015). ISSN: 2277 128X
9. Hao, D., Zhang, L., Zhang, L., Rothermel, G., Mei, H.: A unified test case prioritization
approach. In: ACM Trans. Softw. Eng. Methodol. 9(4) (2010)
10. [Link]
11. Meyers S.: More effective C++. Addison-Wesley, Professional Computing Series
12. Jain R., Aggarwal N.: Additional fault detection test case prioritization. BIJIT-BVICAM’s Int.
J. Inf. Technol. 5(2) (2013)
13. Chauhan, N.: Software testing principles and practices. Oxford University Press, Oxford
(2010)
A Novel Page Ranking Mechanism
Based on User Browsing Patterns

Shilpa Sethi and Ashutosh Dixit

Abstract Primary goal of every search engine is to provide the sorted information
according to user’s need. To achieve this goal, it employs ranking techniques to sort
the Web pages based on their importance and relevance to user query. Most of the
ranking techniques till now are either based upon Web content mining or link
structure mining or both. However, they do not consider the user browsing patterns
and interest while sorting the search results. As a result of which, ranked list fails to
cater the user’s information need efficiently. In this paper, a novel page ranking
mechanism based on user browsing patterns and link visits is being proposed. The
simulated results show that the proposed ranking mechanism performs better than
the conventional PageRank mechanism in terms of providing satisfactory results to
the user.

1 Introduction

The World Wide Web is an interlinked collection of trillions of Web documents


accessed via Internet. To retrieve the desired information from such a huge col-
lection, Web search engines can be used [1]. The basic components of every search
engine are: crawler, indexer, user interface, and query processor. The Crawler
browses the Web automatically at a specified interval, downloads the Web page
from different Web servers, and provides the information to indexer. The indexer
further indexes Web documents in search engine database. The query processor
executes the user query on search engine database and finds the documents which
matches with the user query and ranks them according to some relevance factors.
Many ranking algorithms have been proposed in past. Among them, PageRank

S. Sethi (&)  A. Dixit


Department of Computer Science, YMCA University of Science and Technology,
Faridabad, India
e-mail: [Link]@[Link]
A. Dixit
e-mail: dixit_ashutosh@[Link]

© Springer Nature Singapore Pte Ltd. 2019 37


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
38 S. Sethi and A. Dixit

Table 1 Overview of web mining techniques


Features WCM WSM WUM
Mining focus Within the document Within as well as User navigational
between the patterns
documents
Input data Txt, HTML documents Hyperlinks Query logs,
server log files
Representation Bags of words for Structured Relational tables
of data unstructured data HTML tags summary of web
for semistructured data, pages in the form
ontology of web graph
View of data Semi- and unstructured structured Event triggers
Method NLP and machine learning Proprietary Personalization
algorithms algorithms,
association rules,
NLP

(PR) [2, 3], Weighted PageRank (WPR) [4], and Hypertext Induced Topic Search
(HITS) [5], PageRank based on link visit (PRLV) [6] are some of the popular
ranking algorithms. All these algorithms are based on Web mining concept.
Web mining is a branch of data mining techniques that discover the useful
patterns from Web documents. Web content mining (WCM), Web structure mining
(WSM), and Web usages mining (WUM) are three main categories of Web mining.
[7, 8] PR, WPR, and HTTS are purely based on Web structure mining, whereas
PRLV is based on Web structure as well as Web usages mining. Table 1 gives the
overview of these Web mining techniques. The proposed mechanism captures the
user interest in an efficient way by applying the concept of Web usages and Web
structure mining.
The rest of the paper is organized as follows: In next section, Web structure
mining and popular algorithms of this area have been discussed. Section 3
describes the proposed ranking mechanism in detail with example illustrations.
Section 4 depicts the complete working of proposed system. Concluding remarks
are given in Sect. 5.

2 Related Work

This section describes an overview of Web structure mining and some popular
PageRank algorithms with examples illustration. Web structure mining means
generating the link summary of Web pages and Web server in the form of Web
graph by using the concept of hyperlink topology between different documents as
well as within the same document [8]. A Web graph is directed labeled graph where
nodes represent the Web pages and edges represents the hyperlinks between the
A Novel Page Ranking Mechanism Based on User Browsing Patterns 39

pages. There are many algorithms based on Web structure mining. Some of them
which form the basis of proposed work are discussed in following sections.

2.1 PageRank

The PageRank algorithm was developed by Google and named after Larry Page,
[3]. The link structure of Web page is used to find out the importance of a Web
page. The importance of a page P can be obtained by evaluating the importance of
pages from which the page P can be accessed. Links from these pages are called as
inbound links. According to this algorithm, if the inbound links of a page are
important, then its outbound links also become important. The PageRank of a page
P is equally divided among its outbound links which further propagated to pages
corresponding to these outbound links. The PageRank of a page X can be calculated
by Eq. (1) as follow
 
PRðP1 Þ PRðP2 Þ PRðPn Þ
PRðXÞ ¼ ð1  dÞ þ d þ  ð1Þ
OðP1 Þ OðP1 Þ OðPn Þ

where:
• P1, P2, … Pn represent the inbound links of page X
• O(P1), O(P2) … O(Pn) are no. of outbound links of page P1, P2 … Pn,
respectively
• d is the damping factor which is a measure of probability of user following
direct link. Its value is usually set to 0.85.
To explain the working of PR method, Let us take a small Web structure as
shown in Fig. 1a consisting of four pages, namely P1, P2, P3, and P4, where page P1
is inbound link of page P2 and P4, page P2 is inbound ink of page P4 and P3, P3 is
inbound link of P1, P2, and P4, and P4 is inbound link of P1. According to Eq. (1),
PageRank of page P1, P2, P3, and P4 can be computed as follows:

(a) (b)

Fig. 1 a Sample web structure, b sample web structure with link visits and link weight
40 S. Sethi and A. Dixit

PRðPi Þ ¼ ½ð1d Þ þ d ðPRðP3 Þ=3 þ PRðP4 Þ=1Þ ð1aÞ

PRðP2 Þ ¼ ½ð1  d Þ þ d ðPRðP1 Þ=2 þ PRðP3 Þ=3Þ ð1bÞ

PRðP3 Þ ¼ ½ð1  d Þ þ d ðPRðP2 Þ=2Þ ð1cÞ

PRðP4 Þ  ½ð1  d Þ þ d ðPRðP1 Þ=2PRðP2 Þ=2 þ PRðP3 Þ=3Þ ð1dÞ

Initially considering the PageRank of each page equal to 1 and taking the value
of d = 0.85, calculating PageRank of each page iteratively until their vales becomes
stable as shown in Table 2.
From Table 2, it may be noted that PR(P1) > PR(P4) > PR(P2) > PR(P3).
These PR values are extracted by crawler while downloading a page from Web
server, and these values will remain constant till the Web link structure will not
change. In order to obtain the overall page score of a page, the query processor adds
the precomputed PageRank (PR) value associated with the page with text matching
score of page with the user query before presenting the results to the user.

2.2 Page Ranking Algorithm Based on Link Visits (PRLV)

Duhan et al. [6] proposed the extension of PageRank method in which the
PageRank of a page is computed on the basis of no. of visits to Web page. They
pointed out that traditional PR method evenly distributes the PageRank of page
among its outgoing links, whereas it may not be always the case that all the
outgoing links of a page hold equal importance. So, they proposed a method which
assigns more rank to an outgoing link that is more visited by the user. For this
purpose, a client side agent is used to send the page visit information to server side
agent. A database of log files is maintained on the server side which store the URLs
of the visited pages, its hyperlinks, and IP addresses of users visiting these
hyperlinks. The visit weight of a hyperlink is calculated by counting the distinct IP
addresses clicking the corresponding page. The PageRank of page ‘X’ based upon
visit of link is computed by the Eq. (2)

Table 2 Calculation of Steps PR(P1) PR(P2) PR(P3) PR(P4)


PageRank by PR method
1 1.2805 2.1205 0.575 1.2805
2 1.4863 0.9422 1.052 1.8434
3 2.0147 1.0795 0.5504 1.4799
4 1.5639 1.1622 0.6088 1.6209
5 1.7000 0.9871 0.6439 1.4725
A Novel Page Ranking Mechanism Based on User Browsing Patterns 41

X PRðPi Þ  LðPi ; XÞ
PRðXÞ ¼ ð1  dÞ þ d ð2Þ
P €IðXÞ
TLðPi ; OðPiÞÞ
i

Where:
• PR(X) is PageRank of page X calculated by Eq. (1).
• I(X) is set of incoming links of page X.
• L(Pi, X) is no. of link visits from Pi to X.
• TL(Pi, O(Pi)) is total no. of user visits on all the outgoing links of page Pi.
Let us consider the same hyperlinked structure as shown in Fig. 2 with no. of
visits and visit weight (written in bracket) shown in Fig. 3. By taking the value of
d = 0.85, the PageRank based on visit of link can be easily obtained by using
Eq. (2) and iteration method as shown in Table 3.
By comparing the results of PR with PRLV, it is found that rank order of pages
has been changed. By using PRVOL, PRVOL(P1) > PRVOL(P2) > PRVOL
(P4) > PRVOL(P3). A critical look at the available literature indicates that although
dividing the PageRank of a page among its outgoing links based on link visit solved
the problem of finding the importance of a page within the Web, it has been
observed that the user who visits on a particular page may not necessarily find the
page useful. Therefore, the time spent and action performed such as print, save may
be considered as vital parameter while determining the relevance of a page with
respect to the user. The proposed ranking mechanism discussed in the next section
overcomes the above shortcomings by incorporating the user page access infor-
mation to the link structure information of a Web page.

Fig. 2 Proposed system


42 S. Sethi and A. Dixit

Fig. 3 DB builder
Www
Fetched
pages URL Frontier
URL to fetch
Crawler

Pages

Page Extractor

Parsed page
Information
Indexer

Page information

Database

Table 3 Calculation of Steps PRVOL PRVOL PRVOL PRVOL


PageRank by PRLV method (P1) (P2) (P3) (P4)
1 1.141 0.560 0.256 0.569
2 0.669 0.944 0.209 0.622
3 0.708 0.392 0.250 0.822
4 0.884 0.408 0.191 0.352
5 0.476 0.464 0.193 0.333

3 Proposed Page Ranking Mechanism

An efficient search model based on user page access information is being proposed
here as shown in Fig. 2. It consists of four main components: search engine
interface, PPF calculator, query processor, and DB builder. The detailed description
of each component is given in subsequent sections.

3.1 Search Engine Interface

The user enters the query at search engine interface. It passes these query words to
query processing module and sends a signal ‘something to record’ to PPF calcu-
lator. At the end of search operation, it receives the sorted list of documents from
query processor to present back to the user. When the user clicks a page in the result
list, it sends the hit(click) information to its corresponding Web server which in turn
stores this information in server log files.
A Novel Page Ranking Mechanism Based on User Browsing Patterns 43

3.2 PPF Calculator

After receiving the signal ‘something to record’ from search engine interfaces, it
observes and records certain facts about the activity of user on a particular page. For
this, it assigns the page probability factor (PPF) to each page clicked by the user.
The page probability factor PPF(Pi) can be computed as per the equation given in
(3)

PPFðPi Þ ¼ CLICKwt ðPi Þ þ TIMEwt ðPi Þ þ ACTIONwt ðPi Þ ð3Þ

where:
• CLICKwt(Pi) denotes the importance of page Pi with respect to all the pages
clicked by user ui for query qi in the current search session.
• TIMESCORE(Pi) denotes the time spent by user ‘u’ on the page Pi
• ACTIONwt(Pi) denotes the action performed on the page Pi
The computation of each of these factor used in Eq. (3) is given below.
Calculation of click weight on page Pi: When a user clicks a page, the click
weight of page P increases as if the user votes for this page [9]. For any more
clicking by the same user, the click weight of page will not be affected. To find the
importance of page P with respect to query q, the click weight is defined by Eq. (4)
given below.

C
CLICKwt ðPi Þ ¼ ð4Þ
jCLICKðq; ; uÞj

where
• click(q, *, u) denotes the total no. of clicks made by user u on all the pages for
query q in current session.
• C is no. of vote for a page. It is set to 1 for clicked pages and 0 otherwise.
Let us consider a user clicked three pages P1, P2 … P10. Click weight of each
clicked page can be computed by Eq. (4) as shown below. The CLICKwt of all
other pages is set to zero as they did not get any click from user.

1
CLICKwt ðP1 Þ ¼ ¼ 0:33 ð4aÞ
1þ1þ1
1
CLICKwt ðP2 Þ ¼ ¼ 0:33 ð4bÞ
1þ1þ1
1
CLICKwt ðP5 Þ ¼ ¼ 0:33 ð4cÞ
1þ1þ1
44 S. Sethi and A. Dixit

Calculation of time weight on page P: Time is also an important factor as more


time the user spent on something, more he is interested in it [10]. Time weight of
document Pi is computed by analyzing its relevancy w.r.t document p whose view
time is maximum. For query q in current session as given in Eq. (5). Let us consider
the user spent 3 min, 2 min, and 9 min on page P1, P2, and P5, respectively. The
time weight of page P1, P2, P5 can be computed by Eq. (5).

Time spentðPi Þ
TIMEwt ðPi Þ ¼ ð5Þ
Highest Time SpentðPÞ

3
TIMEwt ðP1 Þ ¼ ¼ 0:33 ð5aÞ
9
2
TIMEwt ðP2 Þ ¼ ¼ 0:22 ð5bÞ
9
9
TIMEwt ðP5 Þ ¼ ¼1 ð5cÞ
9

Calculation of action weight on page Pi: Action that user may carry on any
Web document is listed in Table 4 along with the weights. The weight is assigned
according to the relevancy of the action where relevancy is determined based on
user feedback in response of a survey. It is observed in the survey that if someone is
printing the page means, it has higher utility at present, saving is less scored as the
user will require it later on, bookmark come next, and sending comes at last in
priority list as page is used by some other user. If a user performs more than one
action, then only the higher weight value is considered. For example, if user per-
fume printing as well as saving, then only the printing weight is assigned to the
page.
Let us consider the user takes no action on page P1 and P2 but performs the save
action on page P5. So, ACTIONwt(P5) = 0.3, ACTIONwt(P1) = 0, and
ACTIONwt(P2) = 0. This PPF information related to each clicked page is updated
in search engine database. Initially, PPF of all pages is set to zero. It is computed
and updated every time; a user selects the page in the result list.

Table 4 Action weight Action Weight


Print 0.4
Save 0.3
Bookmark 0.2
Send 0.1
No action 0
A Novel Page Ranking Mechanism Based on User Browsing Patterns 45

3.3 DB Builder

This component is responsible for extracting the pages from www and storing their
information into search engine database. The main subcomponents are as follows:
page extractor, crawler, and indexer as shown in Fig. 3. The working of each
component is discussed in following subsections.
Crawler: It extracts the URL from the URL frontier and downloads the pages at
specified interval [11] from the different Web server. URL frontier is a queue that
contains URLs of the pages that need to be downloaded. The structure of URL
frontier is shown in Fig. 6. The downloaded pages are passed to page extractor.
Table 5 gives description of different fields of URL frontier.
Page extractor: It parses the fetched page and divides it into no. of terms. All
the nonfunctional terms such as at, on, the are removed. It stores the term infor-
mation related to parsed page in Term_info table. It also extracts the link infor-
mation of page and stores it into Link_info. The structure of Term_info and
Link_info is shown in Fig. 4. The different fields of Term_info and Link_info are
described in Table 6.
Indexer: It first calculates the link weight of a page using Eq. (2) by taking the
link information from Link_info then indexes every term of parsed page in search
engine database. The structure of database is shown in Fig. 6. The page probability

Table 5 URL frontier description


Field Description
URL URL of the page yet to be downloaded by the crawler, e.g., [Link]
com/ domain name system
Priority The numeric value based on revisit interval of a page assigned by the crawler
Depth The numeric value indicates how deep the crawler visits the hyperlinks of a page
and downloads them
Server The web server name of the URL. For example, in the above-mentioned URL,
name the server name is [Link]

URL FRONTIER
URL priority depth Server name
PAGE REPOSITORY
Link_info
URL Doc_ID depth In_lnk Hit_coumt Out _lnk Page address
Term_info
term Doc ID frequency Term posion
DATABASE SCHEMA
Term Doc_ID frequency Term position Link weight PPf

Fig. 4 Data structures for used by DB builder


46 S. Sethi and A. Dixit

Table 6 Description of page repository


Field Description
Term The word extracted from page
Doc_ID A unique number or alphanumeric no. is assigned to each page such as D1, D2,
D3
Frequency No. of occurrences of term in page
Position It is a numeric value which tells the position of term in the page; e.g., graphics at
53rd position indicates that it is 53rd term of the parsed page Di
URL The URL of the page being parsed
Depth The depth of URL
In_lnks The URL of incoming link
Hit_count The no. of link visit from incoming URL to parsed page URL
Out_lnks URL of outgoing links
Page It specifies the memory location of page on the server site
address

Table 7 Organization of data in search engine database


Term Doc ID Frequency Position Link_weight PPF
Prime D3 5 4, 12, 23, 56 7 4.5
Prime D8 3 7, 45, 89 9 7
– – – – – –
Factors D3 6 5, 13, 24, 57, 79, 99 7 4.5
Minister D8 3 8, 46, 90 9 7

factor, PPF, of each new page is initially set to zero and updated by PPF calculator
when ever the page is clicked by user in result list. Let us consider the sample data
shown in Table 7 for understanding the organization of information in search
engine database. The information is stored as vocabulary (terms list) and postings as
shown in Fig. 5.

Fig. 5 Organization of information in search engine database


A Novel Page Ranking Mechanism Based on User Browsing Patterns 47

Fig. 6 Algorithm for query processor

As shown in Table 7, the term ‘Prime’ occurs at four different places: 4, 12, 23,
and 56 in document D3 in row1. The link_weight of D3 is 7, and PPF score is 4.5.
Likewise, the information about the other terms is also stored.

3.4 Query Processor

It executes the user query on search engine database and fetches the pages whose
text matches with the query terms. It calculates the overall_page_score of each
selected page by adding the precomputed Link_weight to PPF and returns the
sorted list of pages to search engine interface. The algorithm of query processor is
shown in Fig. 6.

4 Example

Let us consider the scenario depicted in Fig. 7 to have a complete understanding of


proposed page ranking mechanism.

5 Conclusion

In this paper, an efficient page ranking mechanism based on user browsing pattern
has been presented. These patterns are used to deduce the importance of a page with
respect to other candidate pages for a query. The technique is automatic in nature,
48 S. Sethi and A. Dixit

User query→

Sorted search results←


Search engine interface

Search Something to record


Query
processor
1. [Link]/wiki/Prime_triplet

Fetch the matched pages with PPF score and


Fetch clicked URL
[Link]/wiki/index. from query log and
php/Prime_triplet
record time spent and
[Link]/library/drmath/view/56063. action performed on
html clicked URL
4 [Link]/stable/279581
PPF calculator
Send the hit information to web server

Search the clicked URLs and update PPF ←

link information
Database Extract the PPF of clicked URLs →

Term Doc Freq position Link_wt PPF


id/URL
prime D4 9 3,12,26,48,66,87,109,129,156 1.08 3.2
prime D2 8 2,15,45,67,80.94.130,197 1.0 4.3
…… ……
triplets D4 9 4,13,27,49,67,88,110,130,157 1.08 3.2
triplets D1 6 5,20,34,56,78 0.678 0
…….
Download pages with link and access information
DB Builder

Link log at server site


URL Inlink UrLs Outlinks Hitcount
.[Link]/wiki/Pri 20 [Link] 340
me_triplet ics
10 [Link] 480
Weisstein
7 [Link] 198
ime

Fig. 7 Example depicting the working of proposed mechanism

and no overhead is involved at the part of user. The technique does not create each
user profile; instead, collective interests are used to find the relevancy of a page. So,
optimized approach is adopted. The technique proves to provide more relevant
results as compared to regular search engine.

References

1. Sethi S., Dixit A.: Design of presonalised search system based on user interest and query
structuring. In: Proceedings of the 9th INDIACom; INDIACom 2nd International Conference
on Computing for Sustainable Global Development, 11–13 March 2015
2. Brin, S., Page, L.: The anatomy of a large scale hypertextual web search engine. Comput.
Netw. ISDN Syst. 30(1–7), 107–117 (1998)
3. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order
to the web. In: Technical report, Stanford Digital Libraries SIDL-WP-1999-0120 (1999)
A Novel Page Ranking Mechanism Based on User Browsing Patterns 49

4. Xing W., Ghorbani A.: Weighted pagerank algorithm. In: Proceedings of the Second Annual
Conference on Communication Networks and Services Research (CNSR’04)0-7695-2096-0/
04 $20.00 © 2004
5. Pal, S., Talwar, V., Mitra, P.: Web mining in soft computing framework: relevance, state of
the art and future directions. IEEE Trans. Neural Netw. 13(5), 1163–1177 (2002)
6. Gyanendra, K., Duahn, N., Sharma A.K.: Page ranking based on number of visits of web
pages. In: International Conference on Computer and Communication Technology (ICCCT)-
2011, 978-1-4577-1385-9
7. Markov, Z., Larose, D.T.: Mining the web: uncovering patterns in web content, structure, and
usage data. Wiley, New York (2007)
8. Sethi, S., Dixit, A.: A comparative study of link based pge ranking algorithm. Int. J. Adv.
Technol. Eng. Sci. (IJATES) 3(01) (2015). ISSN: 2348-7550
9. Tyagi, N. et al. Weighted page rank algorithm based on number of visits of web page. Int.
J. Soft Comput. Eng. (IJSCE) 2(3) (2012). ISSN: 2231-2307
10. Mittal, A., Sethi, S.: A novel approach to page ranking mechanism based on user interest. Int.
J. Adv. Technol. Eng. Sci. (IJATES) 3(01) (2015). ISSN: 2348-7550
11. Dixit, A., Sharma, A.: A mathematical model for crawler revisit frequency. In: IEEE 2nd
International Advanced Computing Conference (IACC), pp. 316–319 (2010)
Indexing of Semantic Web for Efficient
Question Answering System

Rosy Madaan, A. K. Sharma, Ashutosh Dixit and Poonam Bhatia

Abstract Search engine is a program that performs a search in the documents for
finding out the response to the user’s query in form of keywords. It then provides a
list of web pages comprising of those keywords. Search engines cannot differentiate
between the variable documents and spams. Some search engine crawler retrieves
only document title not the entire text in the document. The major objective of
Question Answering system is to develop techniques that not only retrieve docu-
ments, but also provide exact answers to natural language questions. Many
Question Answering systems developed are able to carry out the processing needed
for attaining higher accuracy levels. However, there is no major progress on
techniques for quickly finding exact answers. Existing Question Answering system
is unable to handle variety of questions and reasoning-based question. In case of
absence of data sources, QA system fails to answer the query. This paper inves-
tigates a novel technique for indexing the semantic Web for efficient Question
Answering system. Proposed techniques include manual constructed question
classifier based on <Subject, Predicate, Object>, retrieval of documents specifically
for Question Answering, semantic type answer extraction, answer extraction via
manually constructed index for every category of Question.

Keywords Information retrieval  Question Answering  Semantic web


R. Madaan (&)
Department of Computer Science and Engineering, GD Goenka Univeristy, Sohna, Gurgaon,
India
e-mail: [Link]@[Link]
A. K. Sharma
Department of Computer Science and Engineering, BSAITM, Alampur, Faridabad, India
e-mail: ashokkale2@[Link]
A. Dixit  P. Bhatia
Department of Computer Engineering, YMCA University of Science and Technology,
Faridabad, India
e-mail: dixit_ashutosh@[Link]
P. Bhatia
e-mail: poonambhatia0401@[Link]

© Springer Nature Singapore Pte Ltd. 2019 51


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
52 R. Madaan et al.

1 Introduction

Question Answering (QA) is a technology that aims at retrieving the answer to a


question written in natural language from the large collections of documents.
Indexing is a technique of formation of indexes for the fast retrieval of the infor-
mation needed by the Question Answering system to answer the query of the user.
The user in the question categorization module selects the category of the Question
which makes the system understand that it has to search in that particular category
index; we have different index for different category of the Question.

2 Related Work

The evolution of the QA system was through closed domain because of their less
complexity. Previously used QAs were BASEBALL and LUNAR. BASEBALL [1]
QA gives information about the US baseball league for one year. LUNAR QA gives
information about the geographical analysis of rocks given by the Apollo moon
missions. Both QA systems were very powerful in their own domains. LUNAR was
examined at a lunar science, and it was able to answer approximately 90% of the
questions in its domain posed by people who are not trained on this system. The
common feature of all these systems is that they had knowledge database or
knowledge systems that were implemented by experts of the chosen domain.
SHRDLU [2] was a Question Answering system that has been developed by
Terry Winograd. It was basically developed to offer for the user to ask the robot
questions. Its implementation was done using the rules of the physics encoded in
computer programming. The Question Answering systems developed to interface
with these expert systems produced more repeatable and valid responses to ques-
tions within an area of knowledge. The system answered questions related to the
Unix OS. It had a knowledge base of its domain, and its target is to phrase the
answer to accommodate various types of users. LILOG [2] is a closed-domain
Question Answering system and is basically a text understanding system. This
system gives tourism information in a German city. Other system also helps the
system in linguistic and computational processing.
QUALM (story understanding system) [3] works through asking questions about
simple, paragraph length stories. QUALM [3] system includes a question analysis
module that links each question with a question type. This question type guides all
further processing and retrieval of information (see Table 1).
Kupiec (a simple WH question model) [3] Question Answering system performs
similar function but it rather solves simpler who question models to build a QA
system. This QA used the interrogative words for informing the kinds of infor-
mation required by the system. Table 2 lists the Question categories and Answer
type.
Indexing of Semantic Web for Efficient Question Answering System 53

Table 1 QUALM question Question type Example question


categories
Causal antecedent How did the cloud formed?
Goal orientation Rohan left for what reason?
Causal consequent What happened after John left?
Disjunctive Was Rohan or Mary here?
Verification Did Rohan leave?

Table 2 Kupiec question Question type Answer type


categories
Who/whose Person
What/which Thing, Person, location
Where Location
When Time
How many Number

AnswerBus Question Answering system [4] is an open-domain Question


Answering system based on sentence-level Web information retrieval. It accepts
users’ natural language questions in English, German, French, Spanish, Italian and
Portuguese and provides answers in English. It can respond to users’ questions
within several seconds. Five search engines and directories (Google, Yahoo,
WiseNut, AltaVista and Yahoo News) are used to retrieve web pages that contain
answers. AnswerBus takes a user question in natural language, i.e. in English.
A simple language recognition module will determine whether the question is in
English, or any other five languages. If the question language is not in English,
AnswerBus will send the original question and language information about the
question to AltaVista’s translation tool and obtain the question that has been
translated into English and accordingly answer is provided to the user.
AskMSR Question Answering system [3] is to take the query as input and
rewrite the query into the form the system can support. The rewrites generated by
the system are simple string-based manipulations. AskMSR does not use a parser or
part-of-speech tagger for query reformulation, but use a lexicon for a small per-
centage of rewrites, in order to determine the possible parts of speech of a word as
well as its semantic variation. It created the rewrite rules and associated weights
manually for the current system, and it may be possible to learn query to answer
reformulations and their weight [4–6].
Query Reformulation in AskMSR is done as follows: Given a question, the
system generates a number of weighted rewrite strings which are likely substrings
of declarative answers to the question. For example, “When was C discovered?” is
rewritten as “C was discovered”. We search through the collection of documents in
search of such patterns. Since many of these string rewrites will result in no
matching documents, we also produce less decisive rewrites that have a much
greater chance of finding matches. For each query, we generate a rewrite which is
not having any stop word in the resultant query [7, 8].
54 R. Madaan et al.

3 Proposed Architecture

This Question Answering system is a system that is developed with an objective to


answer question posed by a user in a natural language [9, 10]. The architecture of
this system is shown in Fig. 1. The following are the functional modules of the
proposed system:
• Query interface,
• Query processing,
• Document retrieval processing,
• Result processing.
The query interface is a dialogue to the Question Answering system that the
user’s natural language can be entered into and the output is given to the user. The
knowledge base of this system consists of a stop words list, a synonym knowledge
base, a Question Answer database (in the form of documents), a question term
(index words), technology information knowledge database.

3.1 Query Interface Module

Query interface in Question Answering system is the module responsible for


executing the query given by the user. It receives input queries in the form of plain
text, parses and optimizes them and completes the execution of the query by
performing different techniques on the query to retrieve the answer to the query
posed by the user [11, 12]. So, the user query is the query posed by the user in plain
text which is considered as a string of words.

3.2 Query Processing Module

Query processing is done in the manner as shown in Fig. 2.

Query interface

Question processing Question Categorization Indexing Module

Answer Reformulation Document

Fig. 1 Proposed architecture


Indexing of Semantic Web for Efficient Question Answering System 55

Synonym Who What


Matching index index

Identify Query Tokenized When


Removal of Semantic Indexing
Type Query index
Stop Words Mapping

How

Fig. 2 Query processing framework

a. Identification of the query: This component is used to identify the type of the
query entered by the user.
b. Tokenization of the query: In the first step, query is tokenized into bag of words
or string of words.
c. Stop word Removal: Removing stop word is necessary to improve the efficiency
of the query. Adjusting the stop word list to the given task can significantly
improve results of the query.

3.3 Question Categorization Module

In this module, user has to select the category of the Question, i.e. who, what, when
so that the system can identify the expected answer to the user query [13].

3.4 Question Analysis Module

Question analysis (see Fig. 3) is also referred to as “Question Focus”. Classifying


the question and knowing its type are not enough for finding answers to all
questions. The “what” questions in particular can be quite ambiguous in terms of
the information asked by the question. It reduces the ambiguity, analyses the
Question and identifies its focus which is necessary [14, 15]. For example, if the
user enters the query “Who is the Founder of C”, focus must be on the “founder”
and “C”.
See Fig. 3 for Question analysis.

3.5 Question Categorization

Question categorization improves the performance of the retrieving the answer by


the Question Answering system. If the category the user selected is “who”, then the
56 R. Madaan et al.

Fig. 3 Question analysis

Table 3 Question Question Type


categorization
Who discover C who
Who is the founder of C who
Who found C who
Who develop C who
Who invented C who
What is the history of C what
History of C –

expected answer can be name of person or some organization. Table 3 lists the
questions along with their type.

3.6 Semantic Mapping in Question Answering System

Dealing with difficult question involves the identification of semantic mapping of


the query. Firstly, we need to recognize the answer type that is expected which is an
affluent semantic structure. Secondly, we need to identify the Question class and
patterns. Thirdly, we need to model the scenario that what sort of information is
needed by the query.
Three concepts are needed to identify the answer type:
a. Question class.
b. Relationship between the subject and object, i.e. <Subject, Predicate, Object>.
c. Semantic mapping of the Question.
Let us take a question: What is the description of C? For this question, by
knowing only the category of the question it is difficult to judge the answer to the
query; we have to create a semantic relation between the query and the answer type.
Query formulation is done on the basis of <Subject, Predicate, Object> “Subject”
tell what contains the paragraph of the “object” C description judges the relation-
ship between the subject and the object [16].
Indexing of Semantic Web for Efficient Question Answering System 57

Table 4 General index Term Document ID


Founder Doc1
C Doc1, Doc2
Dennis Ritchie Doc1
General purpose Doc2

3.7 Indexing Module

Indexing is a technique of formation of indexes for the fast retrieval of the information
needed by the Question Answering system to answer the query of the user [17]. The
user in the question categorization module selects the category of the Question which
makes the system understand that it has to search in that particular category index; we
have different index for different category of the Question. The indexing module helps
the QA system to locate the terms with the document id for fast processing. The
indexing technique used in the QA system is manually locating the term with the
document id. With the indexing module, the QA system identifies the matched doc-
ument related to the query term for finding the candidate answers. After the identi-
fication of the matched documents, result processing module will process the result
back to the user interface. Table 4 shows a general index.

3.8 Result Processing Module

After the documents are indexed, the documents having the matched query answers
are processed to the query processor and finally the output of the query is given to
the user. Figure 4 shows the process how result processing module works.
The proposed algorithm is shown in Fig. 5.
Algorithm for query processing is shown in Fig. 6.

Result processing Indexed Documents Finding Matched Doc.


Documents Data
Processed

Fig. 4 Result processing module

Fig. 5 Proposed algorithm of Step 1: Take query as input as entered by the user
Step 2: Identify the category of the question.
result processing module Step 3: a) If category=”Who” then check with who-index.
b) If category=”What” then check with what-index.
c) If category=”When” then check with when-index.
d) If category=”Why” then check with why-index.
e) If category=”how” then check with how-index.
f) If category=”Where” then check with where-index.
Step 4: If category=”not found” then check with main-index.
Step 5: Return the candidate answer generated by matching the documents.
58 R. Madaan et al.

Step 1: Query=Input.
Step 2: Split query into Tokens.
Step 3: Check for Stop words
a) If found remove them.
Step 4: Set terms as tokens and return
Step 5: Identify the Subject, Predicate, Object for the query to find the relationship bet ween the subject and object.
Step 6: If term in the query does not match with the index term then find the synonym of the query term and replace the synon ym with the
query term in the index.
Step 7: Identify the number of documents that match the query terms.
Step 8: Processing of result for the match documents is done.
Step 9: Processed Result is given to the query interface.

Fig. 6 Algorithm: query processing

4 Experimental Evaluation

The approaches taken have given satisfactory results up to a great extent. There are
many types of question for which the answers are relevant and accurate. There is a
scope of improvement in many areas but things have been achieved as per the
proposed work. Whenever you want to know about a question “who is the founder
of C?” you will be expecting the name of the person or an organization as the
category of the Question is who. The answer will contain the name of the person
and some description about the person. Snapshot of the proposed system is shown
in Fig. 7.
This Question Answering system is able to answer a variety of questions
accurately. As can be seen in the snapshots, the answers are formatted according to
the Question requirements. This section calculates the relevancy of various answers.
It is worth noting that the answers follow a true positive result orientation, i.e. all
the relevant results are coming. In some cases, other relevant information that might
be useful is also coming in results. Performance is calculated on the basis of
relevant result given by the system. Formula for calculating the performance is:

Fig. 7 Snapshot of the proposed system


Indexing of Semantic Web for Efficient Question Answering System 59

Performance % ¼ ðAnswer processed by the systemÞ=ðExpected relevant answerÞ


 100

Let us take examples of some questions in Table 5:


Apart from this Question like:
a. Question: How to become self-educated?
Answers expected are:
1. Stepwise method to perform the task.
Accuracy from answers from this method entirely depends on the user experi-
ence. If those results meet the user expectation, then it is 100% else it cannot be
computed.
b. Question: Convert metre to centimetre?
Answers expected are:
1. Conversion technique comes in the category of “how”.
QA System Accuracy = 1/1*100 = 100%
This method relies on the current conversion data, so accuracy is 100%. See
Fig. 8 for the plot of system accuracy.

Table 5 Example questions, expected answer type(s) and QA system accuracy


Question Expected answer type(s) QA system
accuracy
Who is the founder of C? 1. Name of the person 2/2*100 = 100%
2. Name of the organization
What is description of C? 1. Description about the topic 1/1*100 = 100%
Where is Qutab Minar 1. Location (i.e. city, state, country) 2/2*100 = 100%
situated? 2. Location compared to other locations
When did India got 1. Date and the place 2/3*100 = 66.6%
independence? 2. Also some description about the past
3. Information about the condition when the
event takes place
Which holiday fall on 1. All the holidays which fall on that date. 1/1*100 = 100%
December 31?
When was Mahatma Gandhi 1. Birth date 1/1*100 = 100%
born?
What event happened on 26 1. All the event happened on that date 1/1*100 = 100%
January?
60 R. Madaan et al.

Fig. 8 Bar graph showing System Accuracy


system accuracy of various
120
QA systems 100
80
60
System Accuracy
40
20
0
WHO WHAT WHEN WHY WHEREHOW

5 Conclusions and Future Scope

This Question Answering system is above its counterparts in giving relevant


answers to many type of questions. By selecting the category of Question for the
query, the indexing retrieval is done at a fast rate. As seen by the results and the
views of the users, these facts have come out as its bright feature. Information about
a person, word, location, birth date, death date, holidays, event, etc., available to the
user is accurate and relevant to the need of the particular user, and according to the
survey, many users are satisfied by it. The algorithm designed for the retrieval is
fast in most of the cases. It is getting faster day by day. System’s main aim was to
have index retrieval fast so as to improve the complexity of the Question. QA
system mapping of the query into <S,P,O> and creating the relationship between
the subject and object improve the efficiency of the Question Answering system. By
first selecting the category of the question QA system able to know what type of
answer is expected from the user. These things conclude that the proposed Question
Answering system is fast, efficient and relevant up to a great extent. Further work
can be done to improve the quality of the results and on reasoning-type questions.

References

1. Question Answering System Wikipedia


2. Hirschman, L., Gaizauskas, R.: Natural Language Question Answering System (2001)
3. Allam, A.M.N., Haggag, M.H.: The question answering systems: a survey. Int. J. Res. Rev.
Inf. Sci. (IJRRIS) (2012)
4. Zheng, Z.: AnswerBus question answering system. School of Information, University of
Michigan (2001)
5. Greengrass, E.: Information Retrieval: A Survey, DOD Technical Report TR-R52-008-001
(2000)
6. Brin, S., Page, L.: Computer Networking and ISDN System, pp. 107–117 (1998)
7. John, B., Claire, C., Vinay, C., Robert, G., Sanda, H., David, I.: Issues, Tasks and Program
Structuresto Roadmap Research in Question & Answering (Q&A) (2003)
8. Narayana, S., Harabagaiu, S.: Question Answering based on Semantic Structure (2005)
9. Andrenucci, A., Sneiders, E.: Automated Question Answering: Review of the Main
Approaches (2005)
10. Soergel, D.: Indexing and retrieval performance. In: Srihari, R., Li, W. (eds.) A Question
Answering System Supported by Information Extraction (1994)
Indexing of Semantic Web for Efficient Question Answering System 61

11. Hammo, B., Abu-Salem, H., Lytinen, S.: QARAB: A Question Answering System to Support
the Arabic Language (2002)
12. Mudgal, R., Madaan, R., Sharma, A.K., Dixit. A.: A Novel Architecture for Question
Classification and Indexing Scheme for Efficient Question Answering (2013)
13. Moldovan, D., Paşca, M., Harabagiu, S., Surdeanu, M.: Performance Issues and Error
Analysis in an Open-Domain Question Answering System (April 2003)
14. Lim, N.R., Saint-Dizier, P.: Some Challenges in the Design of Comparative and Evaluative
Question Answering Systems (2010)
15. Suresh kumar, G., Zayaraz, G.: Concept Relation Extraction Using Naïve Bayes Classifier for
Ontology-Based Question Answering Systems (13 Mar 2014)
16. Kapri, d., Madaan, R., Sharma, A.K., Dixit, A.: A Novel Architecture for Relevant Blog Page
Identification (2013)
17. Balahur, A., Boldrini, E., Montoyo, A., Martínez-Barco, P.: A Comparative Study of Open
Domain and Opinion Question Answering Systems for Factual and Opinionated Queries
(2009)
A Sprint Point Based Tool for Agile
Estimation

Rashmi Popli and Naresh Chauhan

Abstract In agile environment, the software is developed by self-organizing and


cross-functional teams. Agile promotes ad hoc programming, iterative, and incre-
mental development and a time-boxed delivery approach. Agile is always flexible
to changes. Estimation approaches of agile are very different from traditional ones.
However, research involving estimation in agile methodologies is considerably less
advanced. This paper focuses on the estimation phase of agile software develop-
ment which probably ranks as the crucial first step. Poor decisions related to esti-
mation activity can cause software failures. In agile environment in order to support
estimation, some delay-related factors are proposed that can delay the release date
of the project and also a new Sprint-point based estimation tool (SPBE) is designed
and developed in excel. The proposed tool is based on the Sprint-point based
Estimation Framework and place major emphasis on accurate estimates of effort,
cost, and release date by constructing detailed requirements as accurately as
possible.

Keywords Agile software development  Agile alliance  Sprint-point



Planning poker Agile estimation

1 Introduction

Agile software development (ASD) methodology is presently considered as the


main methodology for software development in organizations [1]. The word “agile”
means flexible and responsive. The ASD has the ability to survive in rapid changing

R. Popli (&)  N. Chauhan


Department of Computer Engineering, YMCA University of Science and Technology,
Faridabad, India
e-mail: rashmipopli@[Link]; rashmimukhija@[Link]
N. Chauhan
e-mail: nareshchauhan19@[Link]

© Springer Nature Singapore Pte Ltd. 2019 63


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
64 R. Popli and N. Chauhan

environment. ASD is a lightweight approach to most aspects of software devel-


opment in contrast to traditional methods of producing applications.

1.1 Estimation

Estimation of size, effort, cost, and duration of an agile software project is a


complicated task. Exact estimations of software are critical for software developers
and management. Ignorance of estimation approaches causes dangerous effects like
exceeding budget of project, not delivered on time, poor quality, and not right
product [2, 3]. Popular agile methods are expert opinion, planning poker, analogy
based, top-down approach, and bottom-up approach [1], etc. But these methods are
not so efficient because in expert opinion the result is always based on the obser-
vation of expert and his/her experience. Similarly analogy-based method depends
on the historical data or on previous projects. However in absence of historical data
or in case it is a new project, it becomes difficult to estimate by existing agile
estimation methods. Therefore, there is a need of some estimation tool that cal-
culates estimates in agile environment effectively.

2 Sprint-point based Estimation Framework in SPBE


Tool

2.1 Sprint Estimation

When planning about first sprint, at least 80% of the backlog items are estimated to
build a reasonable project map. These backlog items consist of user stories grouped
in sprints and user stories based on estimation is done using story points. When a
software developer estimates that a given work can be done within 10 h, it never
means that work will be completed in 10 h. Because no one can sit in one place for
the whole day and there can be a number of factors that can affect story points and
hence decrease the velocity. To estimate cost and time, it is a big challenge [4].
To resolve this problem, the concept of Sprint-point is proposed. A Sprint-point
basically calculates the effective story points. Sprint point is an evaluation or
estimation unit of the user story instead of story point. By using Sprint points, more
accurate estimates can be achieved. Thus, the unit of effort is Sprint Point
(SP) which is the amount of effort, completed in a unit time.
In the proposed Sprint-point based Estimation Framework, requirements are first
gathered from client in the form of user stories. After requirement gathering, a user
story-based prioritization algorithm is applied to prioritize the user stories.
Consequently, story points in each user story are calculated and uncertainty in story
points is removed with the help of three types of story points proposed. Then, these
A Sprint Point Based Tool for Agile Estimation 65

story points are converted to sprint-points based on the proposed agile estimation
factors. Afterwards, sprint-point based estimation algorithm is applied to calculate
cost, effort, and time in a software project.
If there is requirement of regression testing in agile, then defect data is gathered
based upon the similar kinds of projects, which is used to calculate rework effort
and rework cost of a project. Finally, the sprint-point based estimation algorithm
using regression testing is applied to calculate the total cost, effort, and duration of
the project.
This Sprint-point based Estimation Framework as shown in Fig. 1 performs
estimation in scrum using below steps:
Step 1: User stories are prioritized by using User story-Based Prioritization
Algorithm (will be discussed in Sect. 2.2).
Step 2: Uncertainty in story point is removed (will be discussed in Sect. 2.3).
Step 3: Story points are converted into sprint-points by considering agile delay
factors. Delay factor is being proposed that affects the user stories and thus affects
the cost, effort, and duration of a software project. Sprint-point based estimation is
done by using the proposed Sprint-point based estimation using delay-related
factors (will be discussed in Sect. 2.4).

2.2 User Story-Based Prioritization Algorithm

In agile software development method, the requirements from the customer are
taken in the form of user stories. The proposed prioritization rule is “Prioritize the

Fig. 1 User stories


66 R. Popli and N. Chauhan

user stories such that the user stories with the highest ratio of importance to actual
effort will be prioritized first and skipping user stories that are “too big” for current
release” [5].
Consider the ratio of importance as desired by client to actual effort done by
project team (I/E) as in Formula 1.

Importance of user stories


Prioritization of user stories ¼ ð1Þ
Effort per user stories

2.3 Managing Uncertainty in Estimating User Stories

The technique of managing uncertainty in estimation reduces uncertainty by fol-


lowing approach of reducing uncertainty in estimating user stories [6].

2.4 Sprint-Point Based Estimation Using Delay-Related


Factors

Looking toward the various unaddressed problems of estimation, a new


Sprint-point based Estimation Framework in agile has been proposed that helps to
estimate the accurate cost, time, and effort [7, 8].
In this project, some delay-related factors have been proposed that can affect the
estimation of cost and effort of the project. These factors are as below:
(1) Complexity
(2) Security
(3) Technical Ability
(4) Expected Ambiguity in Detail
(5) Expected Changes in Environment
(6) Team Members Responsibilities Outside the Project

Algorithm 1 Sprint-point based Estimation Framework Algorithm


• Identify the delay-related factors which effect the effort in scrum environment
where
P ¼ f p1 ; p2 ; . . .pi ; . . .pn g where 1 < i <= n
• Identify the unadjusted value of story points (UVSP) related to each level in
scrum environment
where UVSP = {p1 + p2 + p3 +  + pn}
where Li (1 <= i <= 3)
A Sprint Point Based Tool for Agile Estimation 67

• Compute the estimated story points (ESP) as


ESP ¼ BSP þ 0:1ðUVSPÞ
where BSP is baseline story points
• Compute velocity from first iteration as
Velocity = ESP/story point completed in one iteration
• Compute new velocity by considering delay-related factors
V ¼ velocity  d
• Estimated Development Time (EDT) = ESP/Velocity (in Days)
• Release Date = Start date + EDT
In the proposed algorithm, firstly the delay-related factors in agile scrum envi-
ronment are calculated on the basis of which the UVSP is determined. With the help
of UVSP, the story points are estimated. These estimated story points are then used
in the calculation of the development time and the release date of the project.

2.5 Sprint-Point Based Estimation Tool

As agile projects are of small duration, so the team has not so much amount of time
to apply the mathematical algorithms. To resolve this issue, a new sprint-point
based estimation tool (SPBE) is designed and developed in Excel to automate the
Sprint-point based Estimation Framework. The proposed SPBE tool for estimation
place major emphasis on accurate estimates of effort, cost and release date by
constructing detailed requirements as accurately as possible. This tool is used as a
vehicle to validate the feasibility of the project. The proposed tool is a set of
individual spreadsheets with data calculated for each team separately. The esti-
mation tool is created to provide more accuracy in velocity calculations, as well as
better visibility through burn down charts on all stages including planning, tracking,
and forecasting. The proposed estimation tool first decides the priority sequence of
user stories that dictates the implementation order. The priority of user story is
decided based on the importance of user stories to the client and the effort of the
scrum team. After prioritization product, backlog is prepared which is the most
important artifact for gathering the data. After selecting a subset of the user stories,
the sprint backlog is prepared and the period for the next iteration is decided.
68 R. Popli and N. Chauhan

Table 1 Contents of SPBE tool


S. Spreadsheet Description
No. name
1. Release This spreadsheet contains the information about the overall planned
summary and realized size of each release
2. Product This spreadsheet lists all the user stories in prioritized order
backlog
3. ESP-product This spreadsheet lists all the user stories and the story points in each
backlog user story
4. TSP-product This spreadsheet lists all the user stories and the sprint-points in
backlog each user story
5. Estimation This spreadsheet calculates the total estimated effort, cost, and time
summary for the release
6. Sprint backlog This spreadsheet is a list of tasks identified by the scrum team to be
completed during the particular sprint
7. Sprint This spreadsheet contains information like start date, end date of
summary sprint
8. Defect This spreadsheet describes the summary of defects, bug status, bug
assignee, and bug reporter and also the date of bug creation
9. Worklog This spreadsheet involves the various resources involved in the
project and also the type of work allocated to them
10. Metric This spreadsheet shows the various metrics like rework effort, defect
analysis density ratio, effort variance

2.5.1 Contents of SPBE Tool

The SPBE tool contains the components. There is a separate spreadsheet for each
component as in Table 1 like release summary, capacity management, product
backlog, sprint backlog, sprint summary, defect, work log, and metric analysis. The
backlog sheet contains all the user stories. The sprint summary sheet contains the
information about the sprint like release date, start date.

2.6 Case Study

All the proposed approaches have been numerically analyzed on a case study
named as enable quiz. The user stories of case study are in Table 2. As agile
projects are of small duration so the team has not so much amount of time to apply
the mathematical algorithms for estimation of cost, effort, and time. For resolving
this problem, a new Sprint–point based estimation tool (SPBE) has been designed
and developed to automate the Sprint-point based Estimation Framework. The
A Sprint Point Based Tool for Agile Estimation 69

Table 2 User stories of case study


Story Importance (I) Effort (E) I/E
1. As a manager, I want to browse my 10 7 10/7 = 1.42
existing quizzes
2. As a manager, I can make sure that I am 4 3 4/3 = 1.33
subscribed to all the necessary topics for my
skills audit
3. As a manager, I can add additional 1 16 1/16 = 0.0625
technical topics to my quizzes
4. As a manager, I want to create a custom 8 8 8/8 = 1
quiz bank
5. As a manager, I want to create a quiz so I 5 4 5/4 = 1.25
can use it with my staff
6. As a manager, I want to create a list of 7 5 7/5 = 1.4
students from an Excel file so I can invite
them to take the quiz
7. As a manager, I want to create a list of 6 8 6/8 = 0.75
students online
8. As a manager, I want to invite a set of 3 8 3/8 = 0.375
students
9. As a manager, I want to see which 9 6 9/6 = 1.5
students have completed the quiz
10. As a manager, I want to see how the 2 16 2/16 = 0.125
students scored on the test so I can put in
place a skills improvement program

proposed SPBE tool for estimation places major emphasis on accurate estimates of
effort, cost and release date by constructing detailed requirements as accurately as
possible [9–11]. This tool may be used as a vehicle to validate the feasibility of the
project.

2.7 Results

By using delay-related factors, decelerated velocity is calculated on the basis of


which effort and cost is calculated. For simplicity, only 10 user stories are taken.
The project start date is assumed as January 1, 2014, and initial velocity is 5 SP/day.
The delay-related factors are taken at medium level. By using the Sprint-point based
estimation tool the following results as shown in Table 3 are calculated.
The snapshots of the results in SPBE tool are as in Figs. 1, 2 and 3.
70 R. Popli and N. Chauhan

Table 3 Results
Unadjusted value (UV). All the six factors at medium level so UVSP = 6*6 = 36
UV = 6
Total user stories 10
BSP 300
Project start date 1st January, 2014
Estimated story points (ESP) = BSP + 0.1 (UVSP) 300 + 0.1 (36) = 303.6
Initial velocity 5 SP/Day
AvgVF = Average of VF of all the 6 factors 0.95667
Decelerated velocity (DV) = V * AvgVF 5 * 0.95667 = 4.78330 SP/
Day
Estimated development time (EDT) = ESP/DV 303.6 * 8/
4.78330 = 507.76 h
Project end date 14 Jan 2014

Fig. 2 Capacity planning for release


A Sprint Point Based Tool for Agile Estimation 71

Fig. 3 Priortized product backlog

3 Conclusion

The main focus of this paper is to propose a new sprint point based estimation tool
which improve accuracy of release planning and monitoring. The estimation tool is
created to provide more accuracy in velocity calculations, as well as better visibility
through burn down charts on all stages including planning, tracking, and fore-
casting. This tool is used as a vehicle to validate the feasibility of the project. The
approach developed is really simple and easy to understand and can be effectively
used for release date calculation in agile environment. By this method, release date
of small and medium size project can be calculated efficiently.

References

1. Cockburn, A.: Agile Software Development, Pearson Education. Asia Low Price Edition
(2007)
2. Stober, T., Hansmann, U.: Agile Software Development Best Practices for Large Software
Development Projects. Springer Publishing, NewYork (2009)
3. Awad, M.A.: A comparison between agile and traditional software development methodolo-
gies. Unpublished doctoral dissertation, The University of Western Australia, Australia (2005)
4. Maurer, F., Martel, S.: extreme programming. rapid development for web-based applications.
IEEE Internet Comput. 6(1), 86–91 (2002)
5. Popli, R., Chauhan, N.: Prioritizing user stories in agile environment. In: International
Conference on Issues and Challenges in Intelligent Computing Techniques, Ghaziabad, India
(2014)
72 R. Popli and N. Chauhan

6. Popli, R., Chauhan, N.: Managing uncertainity of story-points in agile software. In:
International Conference on Computing for Sustainable Global Development, BVICAM,
Delhi (2015)
7. Popli, R., Chauhan, N.: Sprint-point based estimation in scrum. In: International Conference
on Information Systems and Computer Networks, GLA University Mathura (2013)
8. Popli, R., Chauhan, N.: Impact of key factors on agile estimation. In: International Conference
on Research and Development Prospects on Engineering and Technology (2013)
9. Cohn, M.: Agile Estimating and Planning. Copyright Addison-Wesley (2005)
10. Popli, R., Chauhan, N.: An agile software estimation technique based on regression testing
efforts. In: 13th Annual International Software Testing Conference in India, Bangalore, India
(2013)
11. Popli, R., Chauhan, N.: Management of time uncertainty in agile environment. Int. J. Softw.
Eng. and Applications 4(4) (2014)
Improving Search Results Based
on Users’ Browsing Behavior Using
Apriori Algorithm

Deepika, Shilpa Juneja and Ashutosh Dixit

Abstract World Wide Web (WWW) is decentralized, dynamic, and diverse. It is


growing exponentially in size. To improve search results, various ranking methods
are being used. Due to vast information on the Web, there is a need to build an
intelligent technique that automatically evaluates Web pages that are of user
interest. In this paper, interest of a user in a particular Web page can be estimated by
his browsing behavior without incurring additional time and effort by the user. It
can also adapt to changes in user’s interests over time. A page ranking mechanism
is being proposed which takes user’s actions into account. For this, a Web browser
has been developed to store user’s behavior. Apriori algorithm is applied on the
data collected by Web browser which results in most frequent actions out of all
actions. A calculated confidence value has been used to calculate weight of the Web
page. Higher the weight, higher the rank.


Keywords World wide web Apriori algorithm  Browsing behavior

Actions Web browser PageRank

1 Introduction

With the advent increase in information over the Web, people are now more
interested and inclined toward Internet to get their data. Each user has its own
interest and accordingly his expectations from search engine vary. Search engines
play an important role in getting relevant information. Search engines use various

Deepika (&)  S. Juneja  A. Dixit


Department of Computer Engineering, YMCA University of Science and Technology,
Faridabad, India
e-mail: deepikapunj@[Link]
S. Juneja
e-mail: shilpajuneja1189@[Link]
A. Dixit
e-mail: dixit_ashutosh@[Link]

© Springer Nature Singapore Pte Ltd. 2019 73


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
74 Deepika et al.

ranking methods like HITS, PageRank but these ranking methods do not consider
user browsing behaviors on Web. In this paper, a PageRank mechanism is being
devised which considers user’s browsing behavior to provide relevant pages on the
Web. Users perform various actions while browsing. These actions include clicking
scrolling, opening a URL, searching text, refreshing, etc., which can be used to
perform automatic evaluation of a Web page and hence to improve search results.
The actions have been stored in a database; an algorithm named Apriori has been
applied upon these actions stored in database to calculate weight of a particular
Web page. Higher the weight higher will be the rank of that page.

1.1 Introduction to Apriori Algorithm

Apriori algorithm [1] is an algorithm used in mining frequent itemsets for learning
association rules. This algorithm is designed to operate on large databases con-
taining transactions, e.g., collection of items purchased by a customer. The whole
point of an algorithm is to extract useful information from large amount of data.
This can be achieved by finding rules which satisfy both a minimum support
threshold and a minimum confidence threshold.
The support and confidence can be defined as below:
– Support count of an itemset is number of transactions that contain that itemset.
– Confidence value is the measure of certainty associated with discovered pattern.
Formally, the working of Apriori algorithm can be defined by following two
steps:
i. Join Step
– Find the frequent itemsets, i.e., items whose occurrence in database are greater
than or equal to the minimum support threshold;
– Iteratively find frequent itemsets from 1 to k for k-itemsets.
ii. Prune Step
– The results are pruned to find the frequent itemsets.
– Generate association rules from these frequent itemsets which satisfy minimum
support and minimum confidence threshold.

2 Related Work

In order to get relevant results, a search engine has to modify their searching
pattern. Crawlers are the module of search engine which are responsible for
gathering Web pages from the WWW. There are many design issues related to
Improving Search Results Based on Users’ Browsing Behavior … 75

designing of crawler [2]. Crawlers search on the basis of content or structure of a


Web page [3]. But here, after getting Web pages from the crawler, now it is
responsibility of search engine to show the users according to his interest. To show
the results to the user, its interest should be taken into consideration. Users’
browsing behavior actions should be involved for showing results. Much work has
been done in this area. Some are discussed below:
Ying [4] classified the users’ browsing behavior into three categories:
(a) Physical Behavior like eye rotation, heart rate changes.
(b) Significant behavior like save page, print page, open.
(c) Indirect behaviors like browsing time, mouse–keyboard operation.
According to this paper, indirect behaviors are most important in analyzing the
user interest.
Morita and Shinoda [5], Konstan [6], and Claypool et al. [7] conclude that user
spend much time on the page which is of his interest. The longer the user spends
time on some page, the more interested he is in the page.
On contrary, Weinreich et al. [8] found that in nearly 50% cases user spend
much time in deciding whether to move to the next page or not rather in reading the
content. It showed that time spent alone cannot be the important action in con-
sidering users’ interest.
Goecks and Shavlik [9] proposed an approach for an intelligent Web browser
that is able to learn a user’s interest without the need for explicitly rating pages.
They measured mouse movement and scrolling activity in addition to user browsing
activity (e.g., navigation history). It shows somewhat better result as compared to
the previous work.
Xing et al. [10] suggested that some actions like scrolling, mouse clicking should
be included in the total browsing time. They worked on user browsing history and
analyze documents which user has visited. On analysis he concludes that browsing
time and printing are important actions in showing users’ interest.
Tan et al. [11] focused on user behavior to optimize the structure of network and
Web site server configuration. By analyzing user browsing behavior, they found
what type of information should broadcast instead of all. Thus, improving Internet
usage and reducing Internet overhead. In this, users’ browsing behavior was used
for utilizing network resources in efficient manner.
Yang et al. [12] proposed personalized teaching software based on students
interest. Students’ interest was calculated by analyzing their browsing behavior. By
knowing their interest, they developed the software and cater the need of students in
efficient way.
From the above literature survey, it may conclude that existing system does not
give relevant results in terms of users’ interest. Many works has been done in this
area. Many researchers worked on users’ interest by analyzing their browsing
behavior on Web pages [13, 14]. Some concluded browsing time has great impact
in showing users’ interest while some take such type of patterns that showed users’
interest. But most of them do not consider enough behavior patterns.
76 Deepika et al.

3 Working Methodology

The main purpose of proposed approach is to find relevant pages by estimating


user’s interest implicitly. The proposed technique starts by developing a Web
browser to record user’s actions [2]. Actions include duration on Web page, number
of mouse clicks, number of key ups and down, save, print, number of scrollbar
clicks, reload, save, open URL, stop loading, add to favorites, back, forward, copy,
search text, hyperlink, active time duration, etc. The proposed browser also pops up
a window at the time of closing the Web page that asks the user to rate that Web
page. Whenever user performs above-mentioned actions, their details will be stored
in database. Apriori algorithm will then be applied on above-collected data. This
algorithm will result in most frequent actions and confidence values of subsets of
frequent actions. The values satisfying minimum confidence threshold will be used
to calculate weight of the Web page. The flow of proposed work is shown in Fig. 1.
The detailed description of each step is discussed below:
Step 1: Developing User Interface: A Web browser was proposed in first
step. Normally, a Web browser does not store actions performed by a user, whereas
the proposed Web browser automatically stores various actions performed by dif-
ferent users. When a user opens this browser a unique id is provided to him and his
actions get stored in accordance with his id.
Step 2: Storing Actions performed by user in database: All actions performed
by user get stored in database. A user can view summary of all actions by clicking
on “User Stats” on the interface. User stats show the frequency of every action that
is performed by different users.

Fig. 1 Structure of proposed


approach Developing user inter-
face (Web Browser)

Storing actions per-


formed by users in data

Applying Apriori algo-


rithm

Calculating page

Applying Rank
Improving Search Results Based on Users’ Browsing Behavior … 77

Step 3: Applying Apriori: Apriori is applied on stored actions to get most frequent
actions. For each page, most frequent patterns are generated. These patterns are then
used for calculating the page weight.
Step 4: Calculate Page Weight and PageRank: After applying Apriori on actions,
frequent patterns are obtained. Confidence values of each pattern are taken to
calculate page weight. Confidence value can be calculated as per Eq. (1):

Confidence value ¼ support count ðmost frequent actionÞ=support count ðsub


set of most frequent actionÞ
ð1Þ

where support count of an itemset is number of transactions that contain that


itemset.
Based on the confidence value of each subset, the weight can be calculated by
the following formula:
X
Pwt ¼ C1 þ C2 þ C3 þ . . .. . .. . .:: þ Ci ¼ Ci ð2Þ

where C1, C2, C3…. are the confidence values of subsets of frequent occurring
actions which satisfy minimum confidence threshold.
Step 5: Apply Rank: Higher the page weight, higher its rank. It means the weight
of the page which is calculated from above step is considered for PageRank. The
page which has highest weight is assigned higher rank.

4 Example

In this section, an example is taken to show the working of proposed work. For this,
actions performed by 40 users on two pages page P1, page P2 were stored in
database. Database also stored number of times users visited those pages at different
times. Users perform actions on those pages according to their needs. Their actions
will be stored in a database. Apriori will be applied on those actions. For applying
Apriori, minimum support of 20% and minimum confidence threshold of 60% were
considered. Result of Apriori shows that most frequent actions on P1 were save as,
add to favorites, number of scrollbar clicks, and most frequent actions on another
page P2 were number of mouse clicks and print. With the help of this, the confi-
dence values of pages were calculated.
Step 1: A Web browser is developed to store the user actions. The interface of
proposed Web browser is shown in Fig. 2.
Step 2: All the actions performed by 40 users get stored in database and screenshot
of which is shown in Fig. 3.
78 Deepika et al.

Fig. 2 Proposed bowser

Fig. 3 Storing actions in database


Improving Search Results Based on Users’ Browsing Behavior … 79

Step 3: After applying Apriori on stored actions, frequency of actions can be


calculated and then find out frequent patterns. The snapshot of this step is shown in
Fig. 4.
Step 4: The confidence value of a page is calculated by putting the values in
Eq. (1): First of all, we will calculate confidence values of most frequent actions of
page P1.
C1 = sc {Save As, Add to Favorites, Number of Scrollbar Clicks}/sc {Save
As} = 2/2 = 100%
C2 = sc {Save As, Add to Favorites, Number of Scrollbar Clicks}/sc {Add to
Favorites} = 2/6 = 33%
C3 = sc {Save As, Add to Favorites, Number of Scrollbar Clicks}/sc {Number of
Scrollbar Clicks} = 2/4 = 50%
C4 = sc {Save As, Add to Favorites, Number of Scrollbar Clicks}/sc {Save As,
Add to Favorites} = 2/3 = 67%
C5 = sc {Save As, Add to Favorites, Number of Scrollbar Clicks}/sc {Save as,
Number of Scrollbar Clicks} = 2/2 = 100%
C6 = sc {Save As, Add to Favorites, Number of Scrollbar Clicks}/sc {Add to
Favorites, Number of Scrollbar Clicks} = 2/3 = 67%
Secondly, we will calculate confidence values of most frequent actions of page P2
C1′ = sc {Number of Mouseclicks, Print, Add to Favorites}/sc {Number of
ouseclicks} = 2/3 = 67%
C2′ = sc {Number of Mouseclicks, Print, Add to Favorites}/sc {Print} =
2/6 = 33%
C3′ = sc {Number of Mouseclicks, Print, Add to Favorites}/sc {Add to Favo
rites} = 2/4 = 50%
C4′ = sc {Number of Mouseclicks, Print, Add to Favorites}/sc {Number of
Mouseclicks, Add to Favorites} = 2/2 = 100%

Fig. 4 Frequency of actions


performed by users
80 Deepika et al.

Table 1 Confidence values Confidence values Selection


of subsets of most frequent
actions C1 = 1 ✓
C2 = 0.33 ✕
C3 = 0.5 ✕
C4 = 0.67 ✓
C5 = 1 ✓
C6 = 0.67 ✓
C1′ = 0.67 ✓
C2′ = 0.33 ✕
C3′ = 0.5 ✕
C4′ = 1 ✓
C5′ = 0.5 ✕
C6′ = 0.67 ✓

C5′ = sc {Number of Mouseclicks, Print, Add to Favorites}/sc {Number of


Mouseclicks, Print} = 2/4 = 50%
C6′ = sc {Number of Mouseclicks, Print, Add to Favorites}/sc {Print, Add to
Favorites} = 2/3 = 67% (Table 1)
Since minimum confidence threshold is 60%, confidence values C2, C3, C2′,
C3′, C5′ will be rejected in page weight calculation.
Therefore, weight of page P1 is calculated as per Eq. (2):
Pwt = C1 + C4 + C5 + C6 = 1+0.67 + 1+0.67 = 3.34
Weight of page P2 is:
Pwt′ = C1′ + C4′ + C6′ = 0.67 + 1+0.67 = 2.34
Step 5: Since page P1 has higher weight than page P2, rank of page P1 will be
higher.
Page weight by Hit Count will be calculated by taking average of frequency of
their visit on a particular page. Table 2 shows page weight of P1, P2 by Apriori,
and Hit Count.
Star rating is calculated by user by explicitly asking while closing the Web page.
By the use of Table 2, we can give a graphical representation in Fig. 5 which
shows the comparison between page weights calculated by both Apriori and Hit
Count.
It is observed that Apriori is performing better than Hit Count as proposed work
calculates higher relevance score. By explicitly asking users’ interest also shows
that proposed method predicts more accurate interest than Hit Count method.
Improving Search Results Based on Users’ Browsing Behavior … 81

Table 2 Comparison between page weight by Apriori, Hit Count, and star rating (by user)
Pages Pwt (Apriori) Pwt (Hit Count) Star rating (by user)
Page P1 3.34 1.34 3
Page P2 2.6 2.2 2

4
(in terms of Page weight &

Proposed
3 Hit Count
star rating)
Likings

2 Star Rating (By


User)
1

0
Page 1 Page 2

Fig. 5 Comparison between page weight by Apriori and by Hit Count

5 Conclusion

In this proposed approach, users’ browsing behavior is used to find relevant pages.
While past researches consider on user visit history and observing their time spent
on a Web page, whereas above-mentioned mechanism shows that there are other
user browsing behavior can also be consider to find the user’s interest. The pro-
posed mechanism identifies several implicit indicators that can be used to determine
a user’s interest in a Web page. In addition to previously studied implicit indicators,
several new implicit indicators are also taken into consideration. The indicators
examined are complete duration on a Web page, active time duration, search text,
copy, print, save as, reload, number of key up and down, number of mouse clicks,
number of scrollbar clicks, add to favorites, hyperlink, back, and forward. These
implicit indicators prove to be more accurate than any indicator alone. To show that
proposed results are more accurate, explicit indicator is also used to find more
relevant pages. The explicit indicator used here is by asking from user to rate the
page according to its interest, and then comparison is done with proposed work.
The comparison shows that proposed is much closer to explicit indicators and is
more accurate.

References

1. Agrawal, R., Ramakrishan, S.: Fast Algorithms for mining Association Rules. IBM Almaden
Research Center (1994)
2. Deepika, Dixit, A.: Capturing User Browsing Behaviour Indicators. Electr. Comput. Eng. Int.
J. (ECIJ) 4(2), 23–30 (2015)
82 Deepika et al.

3. Deepika, Dixit, A.: Web crawler design issues: a review. Int. J. Multidiscip. res. Acad.
(IJMRA) (2012)
4. Ying, X.: The research on user modeling for internet personalized Services. Nat. Univ. Def.
Technol. (2003)
5. Morita, M., Shinoda Y.: Information filtering based on user behavior analysis and best match
text retrieval. In: Proceedings of the 17th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval (SIGIR), pp. 272–281 (1994)
6. Miller, B.N., Ried, J.T., Konstan, J.A.: GroupLens: applying collaborative filtering to usenet
news. Commun. ACM, March (1997)
7. Claypool, M., Le, P., Waseda, M., Brown, D.: Implicit interest indicators. In: Proceedings 6th
International Conference on Intelligent User Interfaces, ACM Press (2001)
8. Weinreich, H., Obendort, H., Herder, E., Mayer, M.: Off the beaten tracks: exploring three
aspects of web navigation. In: WWW Conference 2006, ACM Press (2006)
9. Goecks, J. Shavlik, J.: Learning users’ interests by unobtrusively observing their normal
behavior. In Proceedings 5th International Conference on Intelligent User Interfaces,
pp. 129–132 (2000)
10. Xing, K., Zhang, B., Zhou, B., Liu, Y.: Behavior based user extraction algorithm. In: IEEE
International Conferences on Internet of Things, and Cyber, Physical and Social Computing
(2011)
11. Tan, S.H., Chen, M., Yang, G.H.: User Behavior Mining on Large Scale Web Log Data. In:
Apperceiving Computing and Intelligence Analysis (ICACIA) International Conference
(2010)
12. Yang, Q., Hao, H., Neng, X: The research on user interest model based on quantization
browsing behavior. In: The 7th International Conference on Computer Science and Education
(ICCSE) Melbourne, Australia (2012)
13. Agrawal, R., Faloutsos, C., Swami, A.: Efficient similarity search in sequence databases. In:
Proceedings of the Fourth International Conference on Foundations of Data Organization and
Algorithms, Chicago, October (1993)
14. Kim, H., Chan, K.: Implicit indicators for interesting web pages. In: Web Information System
and Technologies WEBIST 2005 (2005)
Performance Efficiency Assessment
for Software Systems

Amandeep Kaur, P. S. Grover and Ashutosh Dixit

Abstract Software quality is a complex term. Various researchers have various


different views for defining it. One common point in all is that quality is required
and is indispensable. It should not only be able to meet the customer requirements
but should exceed it. Customers not only mean the external customers but the
internal ones too. Performance efficiency characteristic is one of the vital software
quality characteristics. If we improve the performance efficiency, then it will def-
initely have a positive effect on the software quality. In this paper, we have iden-
tified various sub-characteristics that can affect the performance efficiency of the
software and proposed a performance efficiency model. We assessed the perfor-
mance efficiency of the software systems by using one of the multi-criteria decision
making (MCDM) methods, namely analytical hierarchy process (AHP). Results
suggest that the proposed model is consistent and may be used for comparing the
software system.


Keywords Software quality Software quality models  Performance efficiency

Optimized code Analytical hierarchy process

A. Kaur (&)
GTBIT, New Delhi, India
e-mail: [Link]@[Link]
P. S. Grover
KIIT Group of Colleges, Gurgaon, India
e-mail: drpsgrover@[Link]
A. Dixit
YMCA University of Science and Technology, Faridabad, India
e-mail: dixit_ashutosh@[Link]

© Springer Nature Singapore Pte Ltd. 2019 83


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
84 A. Kaur et al.

1 Introduction

As per IEEE Std 610.12-1990, IEEE Standard Glossary of Software Engineering


Terminology, “Software Quality” is defined as the degree to which a system,
component, or process meets specified requirements and customer needs [1]. This
definition was modified in IEEE Std 1633-2008, IEEE Recommended Practice on
Software Reliability. It defines “Software Quality” as
1. The totality of features and characteristics of a software product that bears on its
ability to satisfy given needs, such as conforming to specifications.
2. The degree to which software possesses a desired combination of attributes.
3. The degree to which a customer or user perceives that software meets his or her
composite expectations.
4. The composite characteristics of software that determine the degree to which the
software in use will meet the expectations of the customer.
This definition of software quality not only covers the objective but also the
subjective part of quality. But one thought-provoking question that arises is that
“Who is the Customer?”. First thing that comes to the mind is the external cus-
tomers. That is, those people who are external to the organization and who receive
our product (software) and services. Another category of customers who are often
forgotten or taken for granted are the internal customers. These are the people in the
next phase of the software development life cycle and who are the internal cus-
tomers of our work done.
Quality cannot be added later into the system as an afterthought. Rather, it needs
to be built into the system from the very beginning. For building the software
quality, we need to build an efficient system in terms of not only time and resources
but efficiency of the code should also be a parameter. That is, we need to consider
the expectations of not only external customers but also our internal customers.

2 Related Work

Inadequate quality of the software systems may lead to many problems like difficult
maintenance, low performance efficiency, low reusability or frequent program
change. From time to time, several researchers have proposed various software
quality models in order to measure the quality of the software products. Latest
software quality standard is ISO/IEC 25010 which was prepared by ISO/IEC JTCI
after technically revising the earlier software quality model ISO/IEC 9126-1:2001.
Various amendments were made in order to address the weaknesses of ISO/IEC
9126 in the newly revised quality model division ISO/IEC 2501n [2].
As per this latest International Standard ISO 25010, software product quality
model enumerates eight characteristics. These characteristics are further subdivided
Performance Efficiency Assessment for Software Systems 85

into sub-characteristics which can be measured internally or externally [3]. Briefly,


these quality characteristics are as follows:
1. Functional suitability: “the degree to which the software product provides
functions that meet stated and implied needs when the software is used under
specified conditions.” It consists of functional completeness, functional cor-
rectness, and functional appropriateness as its sub-characteristics.
2. Reliability: “the degree to which software product performs specified functions
under specified conditions for a specified period of time.” It consists of maturity,
fault-tolerance, recoverability, and availability as its sub-characteristics.
3. Usability: “the degree to which software product can be used by specified users
to achieve specified goals with effectiveness, efficiency, and satisfaction in a
specified context of use.” It consists of appropriateness, recognizability, learn-
ability, operability, user error protection, user interface aesthetics, and accessi-
bility as its sub-characteristics.
4. Security: “degree to which a software product protects information and data so
that persons or other products or systems have the degree of data access
appropriate to their types and levels of authorization.” It consists of confiden-
tiality, integrity, non-repudiation, accountability, and authenticity as its
sub-characteristics.
5. Performance Efficiency: “the capability of the software product to provide
appropriate performance, relative to the amount of resources used, under stated
conditions.” It consists of time behavior, resource utilization, and capacity as its
sub-characteristics.
6. Maintainability: “the degree of effectiveness and efficiency with which a soft-
ware product can be modified by the intended modifiers.” It consists of mod-
ularity, reusability, analyzability, modifiability, and testability as its
sub-characteristics.
7. Portability: “the degree of effectiveness and efficiency with which a software
product can be transferred from one environment to another.” It consists of
adaptability, installability, and replaceability as its sub-characteristics.
8. Compatibility: “the degree to which the software product can exchange infor-
mation with other software products and/or perform its required functions, while
sharing the same hardware or software environment.” It consists of co-existence
and interoperability as its sub-characteristics [2].
One major amendment done by ISO 25010 software quality model is in relation
to efficiency characteristic of the software quality.
Earlier in ISO 9126 model, efficiency characteristic was one of the six software
quality characteristics and was defined as “the capability of the software product to
provide appropriate performance, relative to the amount of resources used, under
stated conditions.” It consists of time behavior, resource utilization, and efficiency
compliance as its sub-characteristics.
Later in ISO 25010 model, efficiency was renamed to performance efficiency
and capacity sub-characteristic was added to it along with time behavior and
86 A. Kaur et al.

resource utilization. Capacity as per ISO 25010 is the degree to which the maxi-
mum limits of a product or system parameter meets requirements.

3 Proposed Performance Efficiency Model

Performance is an indication of the responsiveness of a system to execute specified


actions in given time interval. Performance efficiency can be defined as the per-
formance relative to the amount of resources used under the stated conditions.
David Parnas quoted “For much of my life, I have been a software voyeur,
peeking furtively at other people’s dirty code. Occasionally, I find a real jewel, a
well- structured program written in a consistent style, free of kludges, developed so
that each component is simple and organized, and designed so that the product is
easy to change.”
Writing an optimized code has a positive effect on performance in terms of less
response time, increased throughput, reduced memory consumption, and reduced
network bandwidth consumptions.
According to Pressman “More books have been written about programming
(coding) and the principle and concepts that guide it than about any other topic in
the software process.”
Sommerville also identified efficiency as one of the four generalized attributes
which is not concerned with what a program does, but how well the program does it
[4].

3.1 Performance Efficiency in Quality Models

Various software quality models have been reviewed in order to understand the
perspective for taking the performance efficiency as a characteristic for defining the
quality.
In 1977, Jim McCall identified three main perspectives (product revision, pro-
duct transition, and product operations) for characterizing the quality attributes of a
software product and considered efficiency as one of the quality factors under
product operations. It defined one or more quality criteria for each quality factor, in
order to assess the overall quality of software product. According to McCall’s
quality model, the quality criteria for efficiency are execution efficiency and storage
efficiency [5].
In 1978, Barry Boehm proposed a software quality model with seven quality
attributes according to the three fundamental uses (As-is utility, maintainability, and
portability) of the software which may affect the quality of the software product. It
identified efficiency as a quality attribute under As-is utility. According to Boehm’s
quality model, the factors that affect efficiency are accountability, device efficiency,
and accessibility [3].
Performance Efficiency Assessment for Software Systems 87

Fig. 1 Proposed performance efficiency model

In 1993, ISO 9126 software quality model was proposed and composed of six
quality characteristics in relation to the internal and external quality. It identified
efficiency as one of the quality characteristics and specifies three quality attributes
that affect the efficiency of software are time behavior, resource behavior, and
efficiency compliance [6].
In 2009, Kumar extended the ISO/IEC 9126 quality model and proposed
aspect-oriented programming-based software quality model, viz aspect-oriented
software quality (AOSQUAMO) model. It added code reducibility as a
sub-characteristic under efficiency quality characteristic. Hence, the quality attri-
butes that affect the efficiency according to AOSQUAMO model are time behavior,
resource behavior, and code reducibility [7].
In 2011, although in ISO/IEC 25010 the problems related to efficiency were
addressed, but still one area is untouched [2].
Use of solid coding techniques and good programming practices while devel-
oping high-quality optimized code plays an important role in software quality and
performance. Code written while consistently applying well coding standard and
proper coding techniques is not only an optimized code in terms of time, effort, cost
(resources) but also is easier to comprehend and maintain. Hence, it will serve the
expectation of our internal customers. The missing point in ISO 25010 model is that
while estimating efficiency of software, no weightage is given to how efficiently
code is written and how much optimized it is.
In this section, we propose a performance efficiency model as performance
efficiency is a vital part for improving the software quality (Fig. 1).

4 Assessment of the Proposed Performance Efficiency


Model

Analytic hierarchy process (AHP) method is one of the multi-criteria decision


making (MCDM) methods that was developed by Dr. Thomas L. Saaty so as to
make decisions in an organized and structured manner. Hence, AHP method works
as a decision support tool that is used to solve various complex decision problems
[8, 9].
88 A. Kaur et al.

Fig. 2 Problem
decomposition

Table 1 Comparison matrix C1 C2 *** *** *** Cn


C1 w1/w1 w1/w2 … … … w1/wn
C2 w2/w1 w2/w2 … … … w2/wn
*** … … … … … …
*** … … … … … …
*** … … … … … …
Cn wn/w1 wn/w1 wn/wn
where C1, C2 … Cn denote the criteria and w1, w2 … wn their
calculated weights

AHP is a four-step process:


Step 1: Define the problem and state the goal or objective.
Step 2: Define the criteria/factors that affect the goal or objective, and structure
them into levels and sub-levels (Fig. 2).
Step 3: Use paired comparisons of each criterion with respect to each other, and
find the matrix with calculated weights, eigenvectors, and consistency measures
(Table 1).
Saaty advised consistency index (CI) and consistency ratio (CR) as two mea-
sures for verifying the consistency of the comparison matrix [10].

ðkmax  nÞ CI
CI ðConsistency IndexÞ ¼ and CRðConsistency RatioÞ ¼
n1 RI

As per Saaty, for n  5, i.e., for 3  3 matrix if Consistency Ratio is greater


than 0.05: for 4  4 matrix if Consistency Ratio is greater than 0.08 and for all
larger matrixes If Consistency Ratio is greater than 0.1 than our set of judgment is
inconsistent.
Step 4: Synthesize the rank of alternatives until final choice is made.
Performance Efficiency Assessment for Software Systems 89

5 Case Study

Step 1: Structuring the hierarchy for performance efficiency (Fig. 3).


Step 2: Calculating weights for each criterion.
Next step after decomposing the performing efficiency characteristics into a
hierarchy of sub-characteristics, namely time behavior, optimized code, resource
utilization, and capacity is to calculate the weights corresponding to these
sub-characteristics. For this, a survey was conducted in which ten participants from
software industry background participated [11, 12]. A survey form composed of six
comparisons was provided to each participant. It constituted of pairwise comparison
of the sub-characteristics. The participants were asked to assign a relative value in
the range of 1–9 for every pairwise comparison. After collecting the same input
from the participants, the mean value is calculated and shown in Fig. 4.
Step 3: Eigenvector and eigenvalue calculation.
Eigenvector is calculated by squaring the comparison matrix and then calcu-
lating the row sum which is then normalized.
I Iteration
The iteration is repeated till the time difference between the current eigenvector
and previous eigenvector becomes negligible (Figs. 5 and 6).

Fig. 3 Hierarchy for performance efficiency

Fig. 4 Comparison matrix with calculated weights


90 A. Kaur et al.

Fig. 5 Squaring the A matrix

Fig. 6 Row sum matrix and


eigenvector

II Iteration
See Figs. 7 and 8.
III Iteration
After the third iteration, the difference between the current and the previous
eigenvector is approaching to zero. Hence, these values can be accepted as our final
values (Figs. 9 and 10).

Fig. 7 Squaring the A2 matrix

Fig. 8 Row sum matrix, eigenvector, and difference

Fig. 9 Squaring the A4 matrix


Performance Efficiency Assessment for Software Systems 91

Fig. 10 Row sum matrix,


eigenvector, and difference

Fig. 11 Eigenvalues, k, and


kmax

Step 4: Consistency measures calculation


According to Saaty, matrix A is consistent matrix if kmax  n and consistency
ratio (CR), the ratio of consistency index to random matrix, is significantly small,
i.e., less than 10%. If not, then there is still scope of improvement in consistency.
As for our case study, an equal to four hence to be consistent kmax should be greater
than or equal to four (Fig. 11).

(k max n) (4:2125  4)
Consistency index ðCI) ¼ ¼ ¼ 0:0708
n1 41

And

CI 0:070833
Consistency Ratio ðCR) ¼ ¼ ¼ 0:0787
RI 0:9

Now as kmax is 4.2125 which is greater than 4 and consistency ratio is 0.0787
which is less than 0.1, hence matrix A is consistent.

6 Conclusion and Future Scope

In this paper, we assess the performance efficiency characteristic of the software


systems. Firstly, we identified the criterion/factors that could affect the performance
efficiency and structured them into levels and proposed the performance efficiency
model. After this, in order to assess the proposed model, we conducted the survey.
Ten participants from software industry background participated in the survey. We
applied AHP method for insuring the consistency of the proposed performance
efficiency model. Result showed that the chosen quality sub-characteristics are
consistent and the relative ranking of the quality attributes for performance efficacy
92 A. Kaur et al.

is in the order of time behavior, optimized code, resource utilization, and then
capacity. In future, this model may be used for comparing the performance effi-
ciency of different software systems.

References

1. Pressman, R.S.: Software Engineering: A Practitioner’s Approach, 5th edn. Mc Graw Hill,
New York (2005)
2. ISO/IEC 2010; ISO/IEC 25010: Systems & software engineering—system and software
quality requirements and evaluation (SQuaRE)—system and software quality models (2010)
3. Boehm, B., et al.: Quantitative evaluation of software quality. In: IEEE International
Conference on Software Engineering, pp. 592–605 (1976)
4. Sommerville, Ian (2004). Software Engineering (Seventh ed.). Pearson. pp. 12–13. ISBN 0–
321-21026-3
5. McCall, J.A., Richards, P.K., Walters, G.F.: Factors in Software Quality, vols. I, II, and III.
US Rome Air Development Center Reports. US Department of Commerce; USA (1977)
6. ISO 2001; ISO/IEC 9126-1: Software Engineering—Product Quality—Part 1: Quality Model.
International Organisation for Standardisation, Geneva Switzerland (2001)
7. Kumar, A.: Analysis & design of matrices for aspect oriented systems. Ph.D. Thesis, Thapar
University (2010)
8. Saaty, T.L.: Analytic Hierarchy Process. Mc Graw Hill (1980)
9. Kaur, A., Grover, P.S., Dixit, A.: Quantitative evaluation of proposed maintainability model
using AHP method. In: 2nd International Conference on computing for sustainable global
development, pp. 8.159–8.163 (2015)
10. Kaur, A., Grover, P.S., Dixit, A.: An improved model to estimate quality of the software
product. YMCAUST Int. J. Res. 1(2), 01–06 (2013)
11. Kaur, A., Grover, P.S., Dixit, A.: Analysis of quality attribute and metrics of various software
development methodologies. In: International Conference on Advancements in Computer
Applications and Software Engineering, pp. 05–10 (2012)
12. Grady, R., et al.: Software Metrics: Establishing a Company-Wide Program, p. 159. Prentice
Hall (1987)
Impact of Programming Languages
on Energy Consumption for Sorting
Algorithms

Tej Bahadur Chandra, Pushpak Verma and Anuj Kumar Dwivedi

Abstract In today’s scenario, this world is moving rapidly toward the global
warming. Various experiments are performed, to concentrate more on the energy
efficiency. One way to achieve this is by implementing the sorting algorithms in
such a programming language which consumes least amount of energy which is our
current area of research in this paper. In this study, our main goal is to find such a
programming language which consumes least amount of energy and contributes to
green computing. In our experiment, we implemented different sorting algorithms
in different programming languages in order to find the most power-efficient
language.

Keywords Programming language  Sorting algorithms  Power consumption


Joulemeter

1 Introduction

Energy efficiency is a critical aspect in battery-operated devices like sensor nodes,


Internet of things (IoT) devices, and many other devices engaged in space research
operations. As data size grows in battery-operated devices, so does the power
consumption, and it ultimately reduces the device’s uptime. Not only is the hard-
ware blamed for energy consumption, however, the software is equally responsible
for same. We, on this paper, focus on the energy consumption of three standard
programming languages, Visual Basic 6.0, Java, and C#.Net by implementing four

T. B. Chandra (&)  P. Verma


School of Information Technology, MATS University, Raipur, Chhattisgarh, India
e-mail: tejbahadur1990@[Link]
P. Verma
e-mail: [Link]@[Link]
A. K. Dwivedi
Govt. Vijay Bhushan Singh Deo Girls Degree College, Jashpur 496331, Chhattisgarh, India
e-mail: [Link]@[Link]

© Springer Nature Singapore Pte Ltd. 2019 93


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
94 T. B. Chandra et al.

sorting algorithms, bubble sort, insertion sort, selection sort, and Quick sort. We
simulate the energy consumption of sorting algorithms when implemented in dif-
ferent programming languages in order to come up with energy-efficient
programming.
Our interest area in this paper is to promote green computing by coming up with
such programming languages and sorting algorithms into limelight which require
least energy.

2 Related Works

The IT industries have been focusing on energy efficiency of hardware and evolved
their devices for better efficiency [1]. Green computing is the
environmental-friendly use of available computing resources without sacrificing
performance [2]. A very few researches have been performed in this field due to
constrained hardware resources and extensive cost. Researchers had concluded that
the most time and energy-efficient sorting algorithm is Quick sort [3]. It is also
found that the energy consumption greatly depends on time and space complexity
of the algorithms [4]. A programmer can develop application-level energy-efficient
solutions if he uses energy-efficient language [5]. Algorithms also have great impact
on the energy consumption and ultimately on green computing [6]. Several
researches have already been performed on hardware and concluded that home
server hardware together with well-tuned, parallelized sorting algorithms can sort
bulk amounts of data and is noticeably more energy-efficient than older systems [7].
Bunse, C. concluded that, different software has the different energy payload; also,
his studies show that different algorithms have different energy requirements [8].

3 Role of Programming Language in Energy Efficiency

A lot of researches had already been performed to gain the energy efficiency which
mostly focuses on algorithm designs, hardware architectures (VLSI designs),
operating systems, and compilers, etc., but investigations show that the program-
ming language design and good programming practice may be one perspective to
reduce power consumption [9]. Distinct programming languages handle the same
situation differently and require discrete number of operations to accomplish the
same task. In general, compiled language is typical to code but runs faster; on the
other hand, interpreted languages are easier to program but take longer to run. There
is some difference in the energy consumption between different loops (such as For
Loop and While Loop) when the number of operations needed in increasing the loop
counter variables and checking termination conditions are significant between the
two alternatives. Energy consumption by an application can be further cut down by
using ‘vector operations’ in a vector register where possible. Code execution time
Impact of Programming Languages on Energy Consumption … 95

can also be reduced by taking advantage of multiple threads and cores, resulting in
increased idle time that in turn leads to power conservation. The inefficient codes
force the CPU to draw more from the processor and consume more electricity [10].
Performance-to-power relationship can be improved by loop unrolling. Use of
idle-power-friendly programming language implementations and libraries may
improve the power saving [11]. Based on several researches, researchers estimated
that between 30 and 50% energy savings can be achieved by selecting
energy-aware software solutions and even more could be achieved by proper
combination of both software and hardware [12].

4 Sorting Algorithms

It is a very efficient way which helps in performing important task of putting the
element list in order. For example, sorting will arrange the elements in ascending or
descending [13]. When it comes to battery-operated devices, use of energy-efficient
sorting is a prime requirement. The text follows contains the sorting algorithms that
were used in our research. Table 1 shows the complexities of various sorting
algorithms.

4.1 Bubble Sort

It is a simple algorithm, which begins at the start of the data, bubbles the largest
element to the end of the data set on each pass, and in next cycle it repeats the same
with one reduced cycle [14].

Table 1 Sorting algorithm complexity


Sorting name Average case Worst case Best case Stability
Bubble O(n^2) O(n^2) O(n) Yes
Insertion O(n^2) O(n^2) O(n) Yes
Quick sort O(nlog n) O(n^2) O(n log(n)) No
Selection O(n^2) O(n^2) O(n^2) No
96 T. B. Chandra et al.

4.2 Selection Sort

It is also referred as comparison sort, is best known for its simplicity, and has pretty
good performance over complicated algorithms. It is not as good as insertion sort
which works in a similar way as the selection sort [15].

4.3 Insertion Sort

Insertion sort separates the data list into two parts: one part which is in sorted
section and the other which is in the unsorted section. When a new element is
inserted, all the existing elements are required to be shifted to right [16].

4.4 Quick Sort

The Quick sort is an advanced form of Quick sort. Some modifications are made in
the internal loops of Quick sort to make it very well optimized and short. It is also
abbreviated as Median Hybrid Quick sort.

5 Programming Language

It is a specially designed instruction set that is used to represent algorithms in


machines’ understandable format and control its behavior. There exist many dif-
ferent programming languages that suit to different environments. Being a pro-
grammer, it is very essential to focus on the latest programming trends that
contribute to power conservation. Following are some computer languages that are
used in our study [17].

5.1 Java [18]

Java is a platform-independent, open-source programming language mainly


designed to operate on the distributed environment of the Internet. It uses all the
best parts of C and C++ by introducing many new things in it. Microsoft version of
Java is known as visual J++.
Impact of Programming Languages on Energy Consumption … 97

5.2 Visual Basic 6.0

It provides a complete GUI-based high-level integrated programming environment


for developing Windows application. It is based on event-driven programming and
is derived from language called BASIC. Visual Basic provides easy application
development by using structured programming approach and can be easily
debugged and modified [19].

5.3 C#.Net

C# is a modern language which is type-safe and completely object-oriented pro-


gramming language from Microsoft. C# embodies both the power of C++ and
comfort of Visual Basic. C# uses C++ as base while incorporating features that
make it familiar to Java programmers. C# is designed to work on Microsoft plat-
form [20].

6 Experimental Setup

Here in this study, we used a simulator tool named “Joulemeter” [21] from
Microsoft Corporation to simulate power consumption of various sorting algo-
rithms implemented in three languages: Java, C#.NET, and Visual Basic 6.0.
We used Intel Core i5 and 4th generation CPU with Windows 8.1 (Version
6.3.9600) operating system to perform our experiment. Figure 1 shows the model
for experimental setup.

Joulemeter Installed and


Simulated on the system
Integer & Double Data List
OS: Windows 8.1 Version 6.3.9600
Processor: Intel Core i5, 4th Gen.

All Sorting Algorithms Power Consumption Report


Programs Installed

Fig. 1 Experimental setup for calculation of power consumption


98 T. B. Chandra et al.

– Joulemeter
It is a software tool which was introduced by Microsoft. It is used to view the
overall power consumption by the whole system and also the key components
which are to be monitored. The user has to calibrate the Joulemeter in order to
estimate the power consumption of an idle system. After the calibration, the total
power consumption can be monitored and obtained by adding up values (in watt)
for duration of interest. It can also be converted to other units like kW h/W h by
using the following conversion:

1 kW h ¼ 1000 W  1 h
¼ 1000 W  3600 s
¼ 3; 600; 000 J:
i:e: Watt ¼ J/s:
Thus; 1 Joule ¼ 1=3; 600; 000 kW h: ½3

– Sorting Algorithm Programs


In our study, we implemented four sorting algorithms in three different lan-
guages: Visual Basic 6.0, C#.Net, and Java. All these sorting programs provide
sorting on both integer and double data sets with sixty thousand elements.

7 Experimental Run

In our experiment, we have compared the four sorting algorithms implemented in


three different languages on the basis of their power consumption over sixty
thousand integer as well as double data elements. We performed four test runs on
same data set and took their average to find average power consumption per second
on each programming language for all sorting algorithms discussed here. To verify
which programming language requires lesser amount of energy for which sorting
algorithm, we plotted a bar graph.

8 Experimental Result and Analysis

After comparing the above sorting algorithms implemented in different program-


ming languages over sixty thousand data for both integer and double data types.
The values are calculated on the basis of power consumption represented in Watt
per second.
Impact of Programming Languages on Energy Consumption … 99

Fig. 2 a Average power consumed for integer data set (in W/s). b Average power consumed for
integer data set (in W/s)

In Fig. 2a, b, we noted that different programming languages consume different


amount of power for each sorting algorithm and also the power consumption
depends on selection of sorting algorithms.
In Fig. 3a, b, we observed that in case of sorting the elements of double data type
also, Java is most power efficient and Visual Basic consumes more power.
100 T. B. Chandra et al.

Fig. 3 a Average power consumed for double data set (in W/s). b Average power consumed for
double data set (in W/s)

9 Conclusion

Our research enlightens on such measurements and calculations, which results in


selection of such programming language which consumes least power and leads to
increased uptime of battery-operated devices and ultimately results in green com-
puting. In this paper, we mainly focus on the power efficiency of programming
languages used to implement sorting algorithms. Based on our experiment, we
found that Java is most power-efficient programming language while Visual Basic
6.0 consumes more power than the other two languages which are used in our
study. Also, sorting the data elements of type double consumes more power than
that of data set of integer type.
Impact of Programming Languages on Energy Consumption … 101

References

1. Raza, K., Patle, V.K., Arya, S.: A review on green computing for eco-friendly and sustainable
IT. J. Comput. Intell. Electron. Syst. 1(1), 3–16 (2012)
2. Saha, B: Green computing. Int. J. Comput. Trends Technol. 14(2) (2014)
3. Chandra, T.B., Patle, V.K., Kumar, S.: New horizon of energy efficiency in sorting
algorithms: green computing. In: Proceedings of National Conference on Recent Trends in
Green Computing. School of Studies in Computer in Computer Science & IT, Pt. Ravishankar
Shukla University, Raipur, India, 24–26 Oct 2013
4. Bunse, C., Höpfner, H., Roychoudhury, S., Mansour, E.: Choosing the “best” sorting
algorithm for optimal energy consumption. In: ICSOFT (2), pp. 199–206 (2009)
5. Liu, Y. D.: Energy-aware programming in pervasive computing. In: NSF Workshop on
Pervasive Computing at Scale (PeCS) (2011)
6. Narain, B., Kumar, S.: Impact of algorithms on green computing. Int. J. Comput. Appl.
(2013). ISSN No. 0975-8887
7. Beckmann, A., Meyer, U., Sanders, P., Singler, J.: Energy-efficient sorting using solid state
disks. In: Proceedings of IEEE Green Computing Conference (2010)
8. Bunse, C., Hopfner, H., Mansour, E., Roychoudhury, S.: Exploring the energy consumption
of data sorting algorithms in embedded and mobile environments. In: Tenth International
Conference on Mobile Data Management: Systems, Services and Middleware, 2009.
MDM’09, pp. 600–607. IEEE (2009)
9. Liu, Y.D.: Energy-aware programming in pervasive computing. In: NSF Workshop on
Pervasive Computing at Scale (PeCS) (2011)
10. Francis, K., Richardson, P.: Green maturity model for virtualization. Archit. J. 18(1), 9–15
(2009)
11. Energy-Efficient Software Guidelines. [Link]
energy-efficient-software-guidelines
12. Code green: Energy-efficient programming to curb computers power use, [Link]
[Link]/news/2011/05/31/code-green-energy-efficient-programming-to-curb-
computers-power-use/
13. Sareen, P.: Comparison of sorting algorithms (on the basis of average case). Int. J. Adv. Res.
Comput. Sci. Softw. Eng. 3(3), 522–532 (2013)
14. Research Paper on Sorting Algorithms. [Link]
mac-286-data-structure_22946/. Accessed on 26 Oct 2009
15. Nagpal, H.: Hit sort: a new generic sorting algorithm
16. Khairullah, M.: Enhancing worst sorting algorithms. Int. J. Adv. Sci. Technol. 56 (2013)
17. Singh, T.: New software development methodology for student of Java programming
language. Int. J. Comput. Commun. Eng. 2(2), 194–196 (2013)
18. Gosling, J.: The Java language specification. Addison-Wesley Professional (2000)
19. Hassan, A.B., Abolarin, M.S., Jimoh, O.H.: The application of Visual Basic computer
programming language to simulate numerical iterations. Leonardo J. Sci. 5(9), 125–136
(2006)
20. Benton, N., Cardelli, L., Fournet, C.: Modern concurrency abstractions for C#. ACM Trans.
Program. Lang. Syst. (TOPLAS) 26(5), 769–804 (2004)
21. Joulemeter. [Link]
daf55565f794
Crawling Social Web with Cluster
Coverage Sampling

Atul Srivastava, Anuradha and Dimple Juneja Gupta

Abstract Social network can be viewed as a huge container of nodes and rela-
tionship edges between the nodes. Covering every node of social network in the
analysis process faces practical inabilities due to gigantic size of social network.
Solution to this is to take a sample by collecting few nodes and relationship status
of huge network. This sample can be considered as a representative of complete
network, and analysis is carried out on this sample. Resemblance of results derived
by analysis with reality majorly depends on the extent up to which a sample
resembles with its actual network. Sampling, hence, appears to be one of the major
challenges for social network analysis. Most of the social networks are scale-free
networks and can be seen having overlapping clusters. This paper develops a robust
social Web crawler that uses a sampling algorithm which considers clustered view
of social graph. Sample will be a good representative of the network if it has similar
clustered view as actual graph.

Keywords Social graph  Sampling  Social web crawler  OSNs

1 Introduction

Social networks provide an open platform to analyse and understand the behaviour
of users, their interaction patterns and propagation of information. An explosive
growth of Online Social Networks (OSNs) has assured possibility of prominent

A. Srivastava (&)  Anuradha


Department of Computer Engineering, Y.M.C.A. University of Science and Technology,
Faridabad, India
e-mail: atul.nd2@[Link]
Anuradha
e-mail: anuangra@[Link]
D. J. Gupta
Department of Computer Science and Engineering, D. I. M. T, Ambala, India
e-mail: dimplejunejagupta@[Link]

© Springer Nature Singapore Pte Ltd. 2019 103


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
104 A. Srivastava et al.

outcomes of social network analysis. Facebook, a social networking site, crossed


1.3 billion monthly active users and Twitter, also called SMS of Internet,
284 million monthly active users putting 500 million tweets per day [1, 2].
Although OSNs provide easily available data for analysis, the size of OSNs is
gigantic that hinders researchers to understand the structure of graphs. Huge size of
OSNs brings two major challenges: first, it is difficult to collect complete graph.
Reasons being, the administrator of the network is not willing to give data or the
users on the site have different restrictions on visibility of their data. Time required
to acquire complete graph makes it impossible. Secondly, if somehow the data of
complete graph is gathered at one place, it requires expensive and well-equipped
computers and large overhead in terms of time, storage and computation [3].
Alternatively, sampling of graph suggests a prominent and inexpensive solution.
A subset of graph is considered representative of the original graph. A good
sampling cuts short the scale of the original graph, yet maintains its characteristics.
While sampling can, in principle, draw accurate results in comparatively very
small observations, the accuracy majorly depends on the representativeness of the
sample. Recent work in the area of sampling focuses on resemblance of statistical
properties a sample holds. Sampling algorithms are based on the facts that nodes are
connected to one another via some relationship in social graph. Hence, a sample of
social graph can be collected easily by crawling. Earlier efforts in sampling were
made with the techniques like BFS/Snowball-type approaches [4]. These algorithms
are fed with seed node (starting point), and then connected nodes are explored
recursively. But these techniques came up with a well-known anomaly called
biasing. Sample collected through these techniques may be biased towards
high-degree nodes due to which the sample exhibits poor statistical properties [5].
Recent work focuses on unbiased sampling techniques mostly based on random
walks [6–11] in the social network.

2 Related Work

Several sampling algorithms have been proposed for social graph sampling. These
algorithms can be put into two categories: first, which focuses on nodes and second,
which focuses on edges. In algorithms in former category, the sampling
decision-making process is executed on nodes, e.g., BFS [4, 12], MHRW [4, 9, 10]
and UGDSG [13]. The latter class of algorithms acquires edges in sample and nodes
as the end points of the edges are selected, e.g., FS [13].
BFS has been used widely to study user behaviour of OSNs [4] and analysing
topological characteristics of social graphs [14]. But BFS suffers from biasing [4,
14]. It visits nodes with higher degree more frequently. Due to which BFS obtains
higher local clustering coefficient than the original ones [paper].
MHRW is based on Markov Chain Monte Carlo (MCMC) model that selects
random nodes’ samples according to degree probability distribution of the nodes [9,
10]. MHRW designs a proposal function based on probability distribution which is
Crawling Social Web with Cluster Coverage Sampling 105

randomly accepted or rejected. Probabilities are modified during transition by


which sample achieves convergence at probability distribution. Selection of next
node from the current node is made on the basis of another independent event, i.e.,
random number generation, p, from uniform distribution U(0, 1). If p < probability
ratio of current node and proposed node, then proposed node is selected as next
node. So if degree of the proposed node is small (small chance to be selected), there
will be a high probability that the proposal will be accepted. MHRW was originally
designed for undirected graphs. Another method USDSG [ ] is developed that is
based on MHRW and works with directed graphs. USDSG considers all the uni-
directional edges as bidirectional edges. To apply UGDSG, a directed graph is
changed to symmetric graph. This methodology is also used in Frontier Sampling
(FS) [1]. FS is an edge sampling algorithm based on Random Walk. FS selects one
node v as seed node from set of seed nodes S, with the probability defined as:

kv
P ð vÞ ¼ P
u2S kv

An edge (v, w) is selected uniformly from node v’s outgoing edges, and v will be
replaced with w in the set of seed nodes. Edge (v, w) is added to the sample. FS
does not perform well if clustering coefficient is small [13].
Corlette et al. [15] proposed event-driven sampling that focuses active part of the
network. Event-driven sampling is similar to the process used by various search
engines to refresh their repository by including new Web pages and eliminating
expired ones. However, multigraph sampling [16] considers social network dis-
tributed in clusters. The sampling is carried out on multiple relations among users in
social network.

3 Problem Description

Generally, social Web crawling has distinct steps; crawler starts with seed node,
explores and collects its directly connected nodes, selects few of explored nodes as
sample for further exploration, and this process is repeated. After fetching any
information of one node completely, crawler needs next node to crawl and that is
selected by sampling algorithm.
Almost every social network is scale-free and can be seen as unevenly formed;
i.e., the network is denser at some places and sparse at others. These denser portions
can be considered as clusters or people/actors in these denser portions exhibit some
kind of similar characteristics, e.g., same workplace, same hometown, same study
place or same country or same continent. Hence, social network is not uniform but
it is collection of overlapping clusters (few clusters can be stand-alone also).
We consider a social graph G ¼ ðV; EÞ, where V is collection of nodes and E is
collection of edges representing associations among nodes. Due to scale-free per-
sona of social graphs, we can consider the graph has several overlapping clusters
106 A. Srivastava et al.
S
CL1 ; CL2 ; CL3 . . .CLk such that G ¼ 1  i  k CLi . There is a possibility that few of
these clusters may not be overlapping of completely disjoints. Here, clusters can be
said former form of communities or less-restricted communities. There is a greater
possibility that each community that is excavated from sample of the graph defi-
nitely has a parent cluster. Let graph Gs ¼ ðVs ; Es Þ be the sample of graph
G. Co S1 ; Co2 ; Co3 . . .:Com be the communities detected in Gs , such that
Gs ¼ 1  j  m Coj . Then, following predicate is always true

8j ½Coj ! 9i ½CLi is parent of Coj ; where; 1  j  m and 1  i  k

Here, we propose an algorithm that focuses on the above-stated fact with


assumption that there is no disjoint cluster in the social graph. The sample reflects
almost exact overlapping clustered view of original network only if the above
predicate holds.

4 Cluster Coverage Sampling (CCS) Methodology

A crawling framework proposed in this paper uses adaptive sampling which aims at
potential limitations or challenges evident in several existing sampling algorithms
which will be discussed along with the explanation of proposed framework. The
crawling framework is briefly shown in Fig. 1. The framework assumes that the
social graph is undirected, crawler is interested in only publically available infor-
mation, and graph is well connected (graph can be disconnected if it has stand-alone
clusters which will be ignored by the crawler). Social Web is apparently huge and is
in the form of overlapping clusters which is demonstrated by intersecting bubbles in
the social Web cloud.
Social Web is clustered which overlaps by having common actors. The crawler
while digging into the network and collecting sample must ensure that it touches
every cluster. Crawler starts with a seed node and proceeds further by hoping to its
friends (directly connected nodes). In Fig. 1, social Web is shown having clusters
which overlap. Crawler can start at any cluster to which seed node belongs.
Crawling proceeds further with following algorithms:

Algorithm 1
Crawler ðSn Þ
Input: Seed_Node Sn .
Start
Perform login with credentials of Sn ;
Initialize Sample_Date_Set and Sparse_Data_Set;
Cn ½  Sn ;
Crawling ðCn ½ Þ;
End
Crawling Social Web with Cluster Coverage Sampling 107

Fig. 1 Crawler framework based on CCS

As shown in Algorithm 1, crawler enters in the network via seed node Sn . Most
of the OSNs require username and password, provided to the crawler to login. Two
data sets are maintained by the framework. First, Sparse_Data_Set that contains
every node explored by the crawler and identified links between them. Out of many
explored nodes, few are selected for further exploration. Connectivity among nodes
is known for only those nodes which are visited by the crawler. Thus,
Sparse_Data_Set contains many nodes among which the relationship status is not
known and hence considered unrelated nodes (there may exist connectivity in real
network). Therefore, another data set, Sample_Data_Set is also maintained that
contains only those nodes among which the relationship is known. Cn ½  is the list of
nodes, which will be crawled by the crawler next.

Algorithm 2
Crawling ðCn ½ Þ
Input: Nodes_To_Crawl_List Cn ½ .
Start
Update (Sample_Date_Set, Cn ½ , Sparse_Data_Set);
108 A. Srivastava et al.

While ðCn ½ !emptyÞ


{
Pn Select_&_Remove ðCn ½ Þ; //Pn is current node being
processed.
E_Nodes Find_Friends ðPn Þ;
Competing_Nodes Find_Redundency(E_Nodes, id_Table);
Update (Sparse_Data_Set, Competing_Nodes); //nodes explored
and their connections are saved.
Cond_Table Fetch_Attr_Values ðPn Þ;
}
EXP Expression_Generator(Cond_Table);
Cn ½  Sample_Nodes (Competing_Nodes, EXP);
Crawling ðCn ½ Þ; //Recursive call.
End

Algorithm 2 exploits actual crawling process. List of sampled nodes Cn ½  is fed


to crawling process. Nodes contained in Cn ½  are visited by the crawler one by one
so these nodes are stored in Sample_Data_Set. E_Nodes contains direct friends of
the node currently being visited by the crawler. E_Nodes may contain redundancy
if sampled nodes have common friends or sampled nodes have friends which have
already been sampled. id_Table contains every node sampled by the crawler so far.
Such redundancies are eliminated in Competing_Nodes. Sparse_Data_Set is
updated with every new node their connections explored through current node
being processed, Pn .
Cond_Table contains attributes of user (node) and their possible values which
are refined along with the crawling. For instance, suppose Facebook is being
crawled. Any user in the network has attributes like ‘Lives in’, ‘Work’, ‘Study
Places’. Possible values of such attributes have huge range as it depends on cluster
of the gigantic social graph which the crawler is in. As the crawling process
proceeds and crawler visits nodes, attributes (if found any new) and their possible
values (if found any new) are updated. Selection expression EXP is generated on
attributes and their values in Cond_Table. Nodes which satisfy EXP are sampled
next.
EXP is a Boolean expression. It contains several attributes with their one pos-
sible value ANDed or ORed, thereby forming a nested Boolean expression. One
combination of attribute and its value is considered
 as an ‘Atom’ (Atomk can be
defined as ATTRk ¼¼ Valuej ðATTRk Þ ; 1  j  total num of possible values
of ATTRk ). AND and OR (Between_Atom_Op_List) are binary operators, used
between atoms, and NOT (On_Atom_Op_List) is unary operator used on any Atom
optionally. Values of attributes Between_Atom_Operator and On_Atom_Operator
are selected randomly. Algorithm 3 is used to generate (first time) and update EXP.
Crawling Social Web with Cluster Coverage Sampling 109

Algorithm 3
Expression_Generator(Cond_Table)
Start
k = 0;
while(k < Num_Of_Attributes(Cond_Table))
{
Atomk Select ValueðATTRk Þ;
On Atom Opk Select OpðOn Atom Op ListÞ;
Between Atom Opk Select OpðBetween Atom Op ListÞ;
Update(EXP, On Atom Opk , Atomk , Between Atom Opk );
}
Return EXP;
End

Algorithm 4 represents how next nodes are selected from Competing_Nodes.


Number of nodes to be sampled is limited by a threshold value which is decided by
the size of Competing_Nodes. Nodes are selected randomly from
Competing_Nodes but only those nodes which satisfy EXP are sampled. If the
algorithm fails to find desired number of nodes which satisfy EXP, it completes
requirement by randomly selecting remaining number of nodes from
Competing_Nodes.

Algorithm 4
Sample_Nodes (Competing_Nodes, EXP)
Start-
s_count = 0; r_count = 0; //To count number of sampled nodes and
randomly selected nodes
while(s_count < Threshold(Competing_Nodes) & r_count < SizeOf
(Competing_Nodes))
{
r_count ++;
Pn RandomSelect_&_Remove (Competing_Nodes); //Pn is
current node being processed
If(Pn holds EXP)
{
s_count ++;
add(Node_List, Pn );
110 A. Srivastava et al.

}
}
while(s_count < Threshold(Competing_Nodes))
{
Add(Node_List, RandomSelect_&_Remove(Competing_Nodes));
s_count ++;
}
Append(id_Table, Node_List);
Return Node_List;
End

Let us assume that the crawler has sampled nodes of cluster 2 of social graph
shown in Fig. 1. Nodes lying in the area of cluster 2 that is not overlapped by any
other cluster have similar characteristics and probably can satisfy the current
EXP. If EXP is altered at few Atoms, there might be a possibility that nodes lying in
the area of cluster 2 overlapped by any other cluster (1, 3, 4, 7 or 8). By sampling
nodes lying in overlapping area of clusters, crawler migrates from one cluster to
other cluster. Hence, the sample will have nodes covering almost every cluster of
the social network rather than having nodes of just few adjacent clusters.

5 Experiments and Result Analysis

The proposed model in this paper has been tested on a synthetic social network
created on local server. The synthetic network has identical characteristics as any
other social network in terms of power law of degree distribution, clustering
coefficient, etc. The synthetic social network is created using logs of e-mail con-
versations of students and faculty members in YMCA UST, Faridabad, India. The
nodes in synthetic social network exhibit some inherent attributes such as place of
birth, place of work, education, political orientation, birthday. These attributes of
each user and their values are used for creating Boolean expression EXP. The
description of the data set is shown in Table 1.
The attributes of each node in the network are created anonymously. The
algorithm is executed on the network in two phases. In first phase, the algorithm is
executed for some time and paused. First sample is gathered that is called sample 1.
Then algorithm is resumed again and another sample is gathered as sample 2.
Table 2 shows statistics of sample 1 and sample 2.

Table 1 Data set statistics Data set statistics


Nodes 1539
Edges 20,234
Crawling Social Web with Cluster Coverage Sampling 111

Table 2 Samples statistics Sample Nodes Edges


Sample 1 257 1332
Sample 2 513 4442

The collected samples are analysed against degree distribution and compared
with the original network to ensure if the samples collected represent the original
network. Degrees of nodes in the networks are normalized by the median of the
degrees; therefore, complete degree distribution falls in range of 0–2.

Degreej
Normalized Degreej ¼
M

where M is median of degrees in the network and 1  j  total number of nodes.


Degree distribution is shown in Fig. 2. X-axis denotes degree distribution and y-
axis denotes fraction of nodes. The graph clearly represents that the original net-
work as well as the two samples follows power law of degree distribution that is
first evidence of correctness of the algorithm.
Figure 3 shows original network. Figures 4 and 5 represent sample 1 and sample 2,
respectively. It is clearly evident the original network is clustered which overlaps.
Similar overlapping clustered view can be seen in sample 1 and sample 2.
In Fig. 6, the Normalized Mean Square Error (NMSE) for a node degree has
been taken as a metric to demonstrate the behaviour of the sampling algorithms on a
bar graph. For the estimation of node degree distribution, we need to first calculate
the NMSE of node degree k, using the below-given formula [13]

Fig. 2 Degree distribution of samples and original network


112 A. Srivastava et al.

Fig. 3 Original social network

Fig. 4 Sample 1

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 2 
E h^k  hk
NMSEðkÞ ¼
hk

where ^hk is the estimation of hk based on the sampled graph. NMSEðk Þ metric is
defined in order to show the difference between the degree distribution of the
Crawling Social Web with Cluster Coverage Sampling 113

Fig. 5 Sample 2

Fig. 6 NMSE comparison graph

sampled graphs and the original one. The lower the NMSE value of an algorithm,
the better is the performance of the sampling algorithm in the social network graph.
Above analysis clearly shows that algorithm proposed in this paper tends to
migrate from one cluster to another. Therefore, we had taken two clusters at dif-
ferent intervals to make sure if the algorithm is leaving one cluster and getting into
other. It also implies that the representation extent of cluster is directly proportional
to the time for which the algorithm is run if the size of the graph is finite. But
evidently, we have gigantic online social network whose size is near to infinite. In
that case, we can set a threshold on the number of nodes crawled. When sufficient
number of nodes and their relationship edges has been gathered, the algorithm can
be stopped. The collected sample will have clustered view similar to the original
network, thereby giving maximum possible representation of the network.
114 A. Srivastava et al.

6 Conclusion and Future Scope

This paper presents a crawler that works on cluster coverage sampling algorithm for
selection of next node. The algorithm benefits clustered structure of most of the
social networks. The algorithm is tested on a synthetic network that has clustered
structure like many social networks and scale-free distribution of nodes. The results
are promising. As future prospect, the algorithm is to be tested on some online
social network like Facebook. Social graph dynamism is also a burning area in the
field of social network analysis. Cluster coverage sampling algorithm also shows
promising possibilities of its application in tackling social graph dynamism.

References

1. [Link]
2. [Link]
3. Srivastava, A., Anuradha, Gupta, D.J.: Social network analysis: hardly easy. IEEE Int. Conf.
Reliab. Optim. Inf. Technol. 6, 128–135 (2014)
4. Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Practical recommendations on crawling
online social networks. IEEE J. Sel. Areas Commun. 29(9), 1872–1892 (2011)
5. Lee, S.H., Kim, P.-J., Jeong, H.: Statistical properties of sampled networks. Phy. Rev. E 73,
016102 (2006)
6. Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Walking in facebook: a case study of
unbiased sampling of OSNs. In: Proceedings of IEEE INFOCOM (2010)
7. Ribeiro, B., Towsley, D.: Estimating and sampling graphs with multidimensional random
walks. In: Proceedings of ACM IMC (2010)
8. Cho, M., Lee, J., Lee, K.M.: Reweighted random walks for graph matching. ECCV 2010,
Part V, LNCS 6315, pp. 492–505 (2010)
9. Lee, C.H., Xu, X., Eun, D.Y.: Beyond random walk and metropolis-hastings samplers: why
you should not backtrack for unbiased graph sampling. ACM (2013)
10. Li, R.H., Yu, J.X., Qin, L., Mao, R., Jin, T.: On random walk based graph sampling.
IEEE ICDE Conference (2015)
11. Ribeiro, B., Wang, P., Murai, F., Towsley, D.: Sampling directed graphs with random walks.
In: Ribeiro, B., Wang, P., Murai, F., Towsley, D. (eds.) UMass CMPSCI Technical
Report UMCS (2011)
12. Wilson, C., Boe, B., Sala, A., Puttaswamy, K.P.N., Zhao, B.Y.: User interactions in social
networks and their implications. In: Proceedings of ACM EuroSys (2009)
13. Wang, T., Chen, Y., Zhang, Z., Xu, T., Jin, L., Hui, P., Deng, B., Li, X.: Understanding graph
sampling algorithms for social network analysis. In: 2011 31st International Conference on
Distributed Computing Systems Workshops (ICDCSW), pp. 123, 128, 20–24 June 2011
14. Ahn, Y., Han, S., Kwak, H., Moon, S., Jeong, H.: Analysis of topological characteristics of
huge online social networking services. In: Proceedings of WWW (2007)
15. Corlette, D., Shipman, F.: Capturing on-line social network link dynamics using event-driven
sampling. In: IEEE International Conference on Computational Science and Engineering
(2009)
16. Gjoka, M., Butts, C.T., Kurant, M., Markopoulou, A.: Multigraph sampling of online social
networks. IEEE J. Sel. Areas Commun. 29(9), 1893–1905 (2011)
Efficient Management of Web Data
by Applying Web Mining Pre-processing
Methodologies

Jaswinder Kaur and Kanwal Garg

Abstract Web usage mining is defined as the application of data mining tech-
niques to extract interesting usage patterns from Web data. Web data provides the
information about Web user’s behavior. Pre-processing of Web data is an essential
process in Web usage mining. This is used to convert the raw data into processed
data which is necessary for Web mining task. In this research paper, author pro-
posed the effective Pre-processing methodology which involves field extraction,
significant attributes selection, data selection, and data cleaning. The efficient
proposed methodology improves the quality of Web data by managing missing
values, noise, inconsistency, and incompleteness which is usually found attached
with data. Moreover, obtained results of pre-processing will be further used in
frequent pattern discovery.

Keywords Web log file  Web server  Web usage mining  Pre-processing
Data cleaning

1 Introduction

Data mining is a process of discovering interesting and useful pattern from raw
data, also known as knowledge discovery in databases. This technology analyzes
the data and helps to extract useful information. It is used in various areas such as
retail industry, intrusion detection, and financial data analysis. Generally, Web
mining is used in World Wide Web that continues to grow both in the huge volume
of traffic and the size and complexity of Web sites, where it is difficult to identify
the relevant information present in the Web [1]. In the proposed model, researcher

J. Kaur (&)  K. Garg


Department of Computer Science & Applications, Kurukshetra University,
Kurukshetra, Haryana, India
e-mail: kaurjazz84@[Link]
K. Garg
e-mail: gargkanwal@[Link]

© Springer Nature Singapore Pte Ltd. 2019 115


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
116 J. Kaur and K. Garg

used Microsoft Excel and SQL Developer software to perform the pre-processing
process. Through this process, raw Web log data is transformed to processed data.
The main objective of this model is to remove irrelevant data and improve data
quality.
Author has organized this paper in following sections: In Sect. 2, researcher
reviewed the pre-processing related work. Section 3 gives a description of the
proposed model of pre-processing technique. Section 4 shows experimental results
of the effective proposed system. Section 5 concludes the discussion.

2 Literature Review

Web usage mining provides useful information in abundance which makes this area
interesting for research organizations and researchers in academics. Rathod [2]
focuses on the pre-processing tasks that are data extraction and data filtering which
were performed on combined log file; in this, one of the popular platform [Link]
2010 is used to implement the above tasks. Priyanka and Ujwala [3] proposed two
algorithms for field extraction and data cleansing; latter is used in removing the
errors and inconsistencies and improves the quality of data. Ramya et al. [4]
explored the concept of merging of log files from different servers and is used in
data pre-processing process, after which, data cleaning removes request concerning
non-analyzed resources. Prince Mary et al. [5] clarified that people are more
interested in analyzing log files which can offer more useful insight into Web site
usage, thus data cleaning process is used in data analysis and mining. Data cleaning
involved the following steps such as robot cleaning, elimination of failed status
code, elimination of graphics records, videos. Chintan and Kirit [6] proposed novel
technique that was more effective as compared to existing techniques, data
extraction, data storage, and data cleaning technique included in data pre-processing
to convert the Web log file in database. During data cleaning, only noise is removed
from log file. Ramesh et al. [7] stated that data pre-processing is a significant
process of Terror Tracking using Web Usage Mining (TTUM), which was effec-
tively done by field extraction, data cleaning, and data storage tasks. Pooja et al. [8]
explored the records which are not suitable for identifying the user’s navigational
behavior, such as access from the Web robot, e.g., Googlebot, access for images
and access for audio files. These records are removed during data cleaning module.

3 Data Pre-processing on Web Log File

Web usage mining is a significant part of Web mining. It is used to extract user’s
access patterns from Web data over WWW in order to understand the user’s
browsing behavior, and it helps in building of Web-based applications. Web usage
mining depends on different source of data such as Web Server Side, Proxy Side,
Efficient Management of Web Data by Applying Web Mining … 117

and Client Side. Generally, raw Web log contains irrelevant, noisy, incomplete,
duplicate, and missing data. Data pre-processing is the most important phase of
Web usage mining in order to clean Web log data for discovering patterns.
The log files are stored in the Web server and may suffer from various data
anomalies such data may be irrelevant, inconsistent, and noisy in nature. In the
present research work, the researcher removes the said anomalies by following data
pre-processing steps as shown in Fig. 1.

3.1 Field Extraction

The server’s log entry contains different fields like IP address, date and time,
request and user agent that should be separated out before applying data cleaning
process. Field extraction is the process to separate the fields from single entry of the
log file.

Fig. 1 Proposed system for pre-processing


118 J. Kaur and K. Garg

Researcher performed field extraction process on log file using Microsoft Excel
which is explained as follows.
Open Microsoft Excel, using open option select the log file, use characters such
as space or comma as delimiter to separate each field. These separated attributes are
stored in different heading according to the input Web log file such as IP address,
username, date, request, status code, referrer. Now, this file can be saved with .xls
extension, this concludes the field extraction process.

3.2 Significant Attributes Selection

Researcher had analyzed the Apache Web server log file and obtained the important
attributes of Web log files which were IP address, date, time, request line, or URI
stem, status code, referrer, and user agent [9]. These attributes provide valuable
information about visitor’s behavior, traffic pattern, navigation paths, Web browser,
errors, etc.
Using the above analysis, only significant attributes which were identified are
selected for further processing.

3.3 Data Selection

When a Web server receives a HTTP Request, it returns back a HTTP Response
code to the client. These HTTP status codes are three-digit numbers such as 1xx,
2xx, 3xx. All these records with the status code below 200 and above 299 are not
used for analysis.
Filtering of records is done in Microsoft Excel sheet which was obtained after
significant attributes, by keeping records with status code >=200 and status code
<=299.

3.4 Data Cleaning

Data cleaning is the fourth step of pre-processing to remove irrelevant records.


Following three kinds of records are required to be removed.
a. Records having duplication.
b. Records with missing value.
c. Access for images, audio files, video files, and css files; they are identified by
checking extensions in request field such as .gif, .mp3, .avi. In addition to this,
the records with irrelevant information created by spiders, crawlers, Web robots
like Googlebot, can be found in request or user agent field.
Efficient Management of Web Data by Applying Web Mining … 119

3.4.1 Remove Duplicacy

In Microsoft Excel file, select range of the cell then select Data tab and perform
‘Remove Duplicates’ in Data Tools Group to remove the duplicate records.

3.4.2 Missing Value

Find out the blank cell with the help of Find and Replace dialog box. Remove these
tuples.

3.4.3 Noise

After removing missing value, obtained Microsoft Excel file is imported in


Oracle SQL Developer remove noise from log file.
Proposed procedure:
Step 1: Create new database connection in Oracle SQL Developer.
Step 2: Under this new database connection, using SQL command create a database
table.
Step 3: Import obtained Microsoft Excel file after removing missing value into this
newly created database table.
Step 4: Using SQL command, remove noise present in the request field and user
agent field such as images, audio files, video files, page style files, and robots.
Completion of this will create a database table having processed data.
Step 5: Save and close the SQL Developer.
After performing all the above steps, processed data will be used for further
analysis and interpretation of frequent pattern.

4 Results and Interpretation

In this section, effectiveness and efficiency of the proposed model mentioned above
are proved, and researcher has applied several experiments with the Web log file.
Apache Web log file is used and is from April 08, 2012, to July 08, 2012, with a file
size of 958 kB at the beginning.
Researcher performed experiments on a 2.40 GHz Intel Core processor,
4.00 GB RAM, 64-bit Windows 2007 Home Premium Operating System, SQL
Developer 4.1.0, and Java SE Development Kit JDK 8. Oracle SQL Developer
4.1.0 is a free Oracle database integrated development environment (IDE). SQL
Developer provides GUI to help the DBA and database; user saves time while
performing the database task. Oracle database 10g, 11g, 12c is supported by SQL
Developer and requires Java-supported platforms (OS) for operation.
120 J. Kaur and K. Garg

Figure 2 shows Web log file imported in Microsoft Excel. Each field is separated
by using space character as delimiter. As explained different attributes are stored in
separate headings. Initial Web log file entries were 6000.
According to the analysis, only six attributes are significant out of nine attributes
those are IP address, date and time, request, status code, referrer, and user agent
attributes. These six attributes help in identifying the user navigational behavior and
are of at most importance for Web usage mining. Outcomes from Web usage
mining are implemented to enhance the characteristics of Web-based applications.
After performing the data selection step, only 3674 rows were left.
First step in data cleaning removes duplicate records from log file. Thus, 11
duplicates values were identified and 3663 unique values remain. During second
step, the fields with missing values were deleted as this does not provide any useful
information for Web mining. Total 23 cells containing missing value in this log file
were deleted. 3640 tuples are left on completion of this step.
In addition to above, researcher found noise in log file such as .gif, robot, and
Web spider, example of which is ‘GET/icons/[Link].’
In Fig. 3, Microsoft Excel file is imported in SQL Developer to perform further
action; SQL Query is executed to remove noise from Web log file such as .gif, .rm, .
mp4, robot, and spider. Result obtained contains only 2458 rows.
Finally, pre-processed Web log file is obtained. Furthermore, this processed file
will be used in analysis and interpretation of frequent pattern of log file.
According to Table 1, this clearly indicates that the application of pre-processing
on the Web log file helps in removing irrelevant data and improves the quality of
data.

Fig. 2 Imported Web log file in Microsoft Excel


Efficient Management of Web Data by Applying Web Mining … 121

Fig. 3 Processed database table after pre-processing

Table 1 Resultant data set size after data pre-processing


Raw Web log Significant attribute Data Data Result
file selection selection cleaning
No. of tuples 6000 NA 3674 2458 2458
No. of 09 06 NA NA 06
attributes

5 Conclusion

In the present research work, the researcher has taken raw Web log file for its
empirical analysis and concludes that data pre-processing is an essential step to
remove anomalies; lies in data for the purpose. The researcher has proposed data
pre-processing methodology which improved the quality of data. After applying the
said methodology, the procedure proves that only six attributes are significant out of
nine available attributes and similarly out of 6000 original instances only 2458 are
meaningful. The resultant itemset surely will be useful for the discovery of frequent
pattern.

References

1. Jayalatchumy, D., Thambidurai, P.: Web mining research issue and future directions—a
survey. IOSR J. Comput. Eng. (IOSR-JCE) 14, 20–27 (2013)
2. Rathod, D.B.: Customizable web log mining from web server log. Int. J. Eng. Dev. Res.
(IJEDR), 96–100 (2011)
3. Priyanka, P., Ujwala, P.: Preprocessing of web server log file for web mining. World J. Sci.
Technol. 2, 14–18 (2012)
4. Ramya, C., Shreedhara, K.S., Kavitha, G.: Preprocessing: a prerequisite for discovering
patterns in web usage mining process. Int. J. Inf. Electron. Eng. 3, 196–199 (2013)
122 J. Kaur and K. Garg

5. Prince Mary, S., Baburaj, E.: An efficient approach to perform pre-processing.


Indian J. Comput. Sci. Eng. (IJCSE) 4, 404–410 (2013)
6. Chintan, H.M., Kirit, R.R.: An efficient technique for web log preprocessing using microsoft
excel. Int. J. Comput. Appl. 90, 25–28 (2014)
7. Ramesh, Y., Mayuri, D., Tejali, N., Trupti, K.: Unauthorized terror attack tracking using web
usage mining. Int. J. Comput. Sci. Inf. Technol. (IJCSIT) 5, 1210–1212 (2014)
8. Pooja, K., Jyotsna, N.: Data preprocessing: a milestone of web usage mining. Int. J. Eng. Sci.
Innovative Technol. (IJESIT) 4 (2015)
9. Kaur, J., Garg, K.: Analyzing the different attributes of web log files to have an effective web
mining. Int. J. Adv. Sci. Tech. Res. 3, 127–134 (2015)
A Soft Computing Approach to Identify
Multiple Paths in a Network

Shalini Aggarwal, Pardeep Kumar and Shuchita Upadhyaya

Abstract This paper presents a modified technique to generate the probability table
(routing table) for the selection of path for an ant in AntNet algorithm. This paper
also uses the concept of probe ant along with clone ant. The probe ant identifies the
multiple paths which can be stored at destination. Overall purpose of this paper is to
get an insight into the ant-based algorithms and identifying multiple optimal paths.

Keywords AntNet  ACO  ABC  Probe ant  Clone ant  Link goodness

1 Introduction

Routing is an important aspect in computer networks as it controls the traffic


between one node to another depending on various parameters. Routing is the basic
process for determining the performance of a network in terms of quality and
quantity. For performing routing, routers are required. The basic purpose of routers
is to update the routing table via an algorithm. Most of the traditional routing
algorithms transmit the data by minimizing the cost function which may be dis-
tance, delay, jitter, etc. They find the shortest path which may become congested
over time and/or may become expensive later on. Sometimes, a path which is better
on the basis of more than one metric such as distance, delay, bandwidth, jitter is
required. Then sometimes, there is a trade-off between those metrics. Therefore, it is
necessary to get an optimal solution based on more than one metric.

S. Aggarwal (&)  P. Kumar  S. Upadhyaya


Department of Computer Science and Applications, Kurukshetra University,
Kurukshetra, India
e-mail: shaliniagg07@[Link]
P. Kumar
e-mail: mittalkuk@[Link]
S. Upadhyaya
e-mail: shuchita_bhasin@[Link]

© Springer Nature Singapore Pte Ltd. 2019 123


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
124 S. Aggarwal et al.

For getting such optimal solution, nowadays a number of studies are going on to
identify routing alternatives using heuristic techniques. One of such techniques is
Ant Colony Optimization (ACO). In ant-based routing, the artificial ants, while
moving from source to destination, deposit some artificial pheromone in terms of
numeric value and store some information in the form of routing table at the
intermediate nodes. This information is used by newly generated ants for accepting
or rejecting a path. Another important aspect of routing is increasing need of
multi-path routing. The main reason for the need of multi-path routing is the data
transmission in Internet. When a video or audio streaming is required, very high
bandwidth is required. With single path routing, it might not be possible, hence
arise the need of multi-path routing. When multi-path routing is combined with
multi-metric routing, it becomes a perfect candidate for use of heuristic techniques.
Therefore, ACO can prove to be a very effective technique for this type of routing.

2 Related Work

According to [1], an ant not only finds the shortest path for searching its food
source but also conveys this shortest path to other ants. In the ACO, this intelli-
gence of ants is intermixed with various optimization techniques to identify opti-
mum routes in a computer network. In this paper, various mechanisms to solve the
problem to identify optimum path using ACO were explained and compared with
different traditional algorithms of routing.
This paper throws light on a critical review in four different groups for applying
ACO in routing, which are (i) Ant-Based Control (ABC) Systems, (ii) AntNet
System, (iii) Ant System with other variations, and (iv) Multiple Ant Colony
Systems.
Schoonderwoerd et al. [2, 3] applied ACO to routing in telecommunication
networks based on circuit-switching. The algorithm was termed as Ant-Based
Control (ABC). In the ABC algorithm, all the nodes in a network follow various
features [2] such as capacity, probability of being a destination, pheromone table,
and routing table, on the basis of which criteria for choosing next node is decided.
But the main problem of Schoonderwoerd et al. approach is that it can only be
applied when the network is symmetric in nature.
For packet-switched networks routing, Caro and Dorigo’s [4–6] designed
AntNet Routing. Although it is inspired by ACO meta-heuristic, yet have additional
changes as desired for a network.
In [7], a new version of AntNet was generated and was named as AntNet-FA or
AntNet-CO. In this version, backward ants performed a number of tasks such as,
(a) estimating the trip time using various metrics, (b) updating local traffic statistics,
and (c) determining and depositing the pheromone for estimating the probability of
reinforcement. As backward ants are using real-time statistics for determining the
amount of reinforcement, the information for routing was found to be more correct
A Soft Computing Approach to Identify Multiple Paths in a Network 125

and up-to-date. The results of this version are found to be better than
AntNet algorithm which are proved by experiment performed in this paper.
Oida and Sekido [8] proposed Agent-based Routing system (ARS) in which they
suggested for supporting various types of bandwidth requirement, the forward ants
move in a network, which is based on bandwidth constrained. The probability of
selection of outgoing link depends on routing table as well as bandwidth
constraints.
Although adaption has been proved to be one of the better techniques for
identifying the optimum paths, but one of the major problems that can be attached
with AntNet is stagnation [9]. Due to this problem, local optimum solution might be
obtained and diversity of the population might also be lost. In this paper, the
concept of multiple ant colonies was applied to the packet-switched networks.
Upon comparison with AntNet algorithm with evaporation, it was found that by
using multiple ant colonies throughput can be increased. No improvement was
found in delay. But the basic problem was the need of large resources for multiple
ant colonies.

3 AntNet Routing Algorithm

In an AntNet algorithm [1], an ant explores the path and updates the routing and
probability tables, so that other ants can use the tables to know which path is better
than others. Some statistical traffic model is also used to help the ants to identify the
better path.
A routing table is maintained which is a local database. The routing table
contains information about all possible destinations along with probabilities to
reach these destinations via each of the neighbours of the node.
Another data structure that each node carries is termed as local traffic statistics.
This structure follows the traffic fluctuations as viewed by the local node.
The AntNet algorithm as proposed by Di Caro and Dorigo is dependent on two
types of ants named as forward ant and backward ant. The forward ant collects the
information regarding the network, while the backward ant uses this information to
update the routing tables on their path.
Working of the algorithm is given as follows:
(i) Initially to generate a routing table, a forward ant is initiated from every node
towards the destination node after a fixed time interval to find low-cost path
to that node and load status of the network is also explored, and accordingly,
priorities of the paths are set. These priorities are used by the forward ants to
transmit the data.
(ii) The forward ants store the information about their paths in a stack.
(iii) At each node, decision is made to select a node for reaching towards des-
tination with the help of probabilities using pheromone values. The nodes
126 S. Aggarwal et al.

which are unvisited are only considered for selection or from all the
neighbours in case all of them have found to be previously visited.
(iv) When the forward ant moves towards destination, if at any time any cycle is
detected, all the nodes in that cycle’s path are popped. Also all of the
information about them is also deleted.
(v) When the forward ant reaches its destination, then it generates another ant
named as backward ant. It transfers all of its memory to it and dies.
(vi) The backward ant as its name indicates will travel to the opposite direction
that of forward ant. The backward ant uses stack formed by forward ant and
pops the element in the stack to reach the source node. The backward ant use
high-priority queues to reach source so that information can be quickly
transmitted to the source node. The information collected by forward ant is
stored in the routing table by the backward ants.
(vii) The backward ant basically updates the two data structure, i.e. routing table
and the local traffic model for all the node in its path for all entries starting
from the destination node.

Limitations of AntNet Routing


AntNet has performed very well as compared to the traditional routing algorithms
as visible from various experiments performed by different researchers. But still
there are problems associated with the algorithm. This algorithm takes into account
the probabilities to identify the path for destination. As more and more pheromone
will be deposited on that path, most or all of the traffic will travel through the same
path, and hence making that path congested. Also as probability of that path is very
high, other path may never be explored, and hence, the diversity of solution may be
lost. This basically is the problem of stagnation [1]. Although some work has been
done on stagnation, but still a lot of effort is required to make this algorithm better.

4 Proposed AntNet Modifications

In this work, a new type of forward ant named as probe ant [10] along with its clone
is proposed. The probe ants are generated to replace the forward ants from source
node to explore the path. The probe and clone probe ants are generated depending
on the paths selected at a particular node. Only a little additional overhead is
required for generating clone ants. The multiple paths are selected according to the
probabilities in the pheromone table and a threshold value of bandwidth. These
probe and clone ants [11] will reach to the destination according to proposed
strategy. The advantage will be that instead of one optimal path more than one
optimal path are identified. The paths will be identified only with the help of a
single ant, other ants being the clone ant. The major advantage of this technique is
saving of the overhead taken by number of forward ants in AntNet algorithm.
A Soft Computing Approach to Identify Multiple Paths in a Network 127

The proposed modifications are briefly explained below:


The structure of probe ant is defined as follows:

S D I Aid PT MB TD HC
Structure of Probe Ant

Here,
S Source node,
D Destination node,
I Intermediate node from which ant has arrived recently,
Aid Ant Identification Number
PT Path Traversed so far
MB Minimum Bandwidth of the path traversed
TD Total Delay of the path
HC Hop Count
In this work, a threshold bandwidth is considered. Any link having bandwidth
less than that threshold value is discarded. The remaining links are considered for
the purpose of identifying the multiple paths. The process works in a similar
manner to AntNet algorithm, except that queue length is taken into account for
selecting a path in case of AntNet algorithm, while in the proposed work, two
metrics, i.e. delay and bandwidth, are considered for selecting a path. Another
important difference in this case is that instead of selecting a single path, multiple
(quite similar) paths are being selected for data transmission. Various paths are
identified at the destination and stored in an array and from these paths some
optimal paths are chosen using some intelligent approach. For selecting the paths
based on two metrics, a new metric named as link goodness is proposed and is
denoted by GL. The value of GL can be calculated with the help of following
formula:
a
GL ¼ þ ð 1  aÞ  B L
DL þ 1

where
DL Delay of a Link,
BL Bandwidth of a Link
a is a constant having value between 0 and 1, i.e. 0 < a < 1.
At each node, two tables named as pheromone table and routing table will be
maintained. The pheromone table will be initialized with equal pheromone value
distributed to each link from a node except the discarded link. The second table is
the routing table consisting of probabilities for selecting the next node in the path.
The structure of both the tables is as shown below:
128 S. Aggarwal et al.

Destination nodes
Neighbouring nodes s11 s12 … s1n
s21 s22 … s2n
. . . .
. . . .
. . . .
sl1 sl2 … sln
Pheromone table

In the above table, snd is the pheromone value of link from node n when des-
tination node is d. The snd value lies in the interval [0, 1] and sums up to 1 (as the
probabilities can not exceed 1) along each destination column:
X
snd ¼ 1; d 2 ½1; N ; Nk ¼ fneighboursðkÞg
n2Nk

The routing table can be represented as follows:

Destination node
Neighbour nodes 1 2 … n
1 p11 … p1n
2 p21 p22 …
. . . . .
. . . . .
. . . . .
l pl1 … pln
Routing table

In the above table, blank cells indicate that particular neighbour node is dis-
carded for a particular destination.
Psd = probability of creating a probe ant at node s with node d as destination.
At each node i, a probe ant computes probability Pijd of selecting neighbour node
j for moving towards its destination d by adding sijd * Gij to the pheromone value
sijd and then subtracting this value dividing equally from each of the remaining
valid neighbours. This process must be followed for each of the neighbours in
parallel.
Using these probabilities, multiple paths from source to destination can be
identified, and the information about paths will be stored in probe ants. At desti-
nation, paths can be stored in an array of paths. From this array, some of the better
paths on the basis of various metrics can be found using some intelligent approach.
A Soft Computing Approach to Identify Multiple Paths in a Network 129

5 Conclusion

In this paper, a modified approach for generating the probabilities for routing table
has been proposed. In this approach, the probabilities have been calculated of the
basis on two metrics instead of a single metric. Another important feature that has
been used in this paper is the use of a threshold value of bandwidth. Only one ant
will be generated at the source, while other ants will be the clone ants with a little
additional overhead. With the help of this technique, multiple paths can be gen-
erated which can be stored at destination via probe ants. These paths can be further
refined using some intelligent approach, and multiple paths for a better transmission
of data packets can be identified.

References

1. Sim, K.M., Sun, W.H.: Ant colony optimization for routing and load-balancing: survey and
new directions. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 33(5) (2003)
2. Schoonderwoerd, R., Holland, O., Bruten, J., Rothkrantz, L.: Ants for Load Balancing in
Telecommunication Networks. Hewlett Packard Lab., Bristol, U.K., Tech., pp. 96–35 (1996)
3. Schoonderwoerd, R., Holland, O., Bruten, J., Rothkrantz, L.: Ant-based load balancing in
telecommunications networks. Adapt. Behav. 5(2) (1996)
4. Caro, G.D., Dorigo, M.: AntNet: distributed stigmergetic control for communications
networks. J. Artif. Intell. Res. 9, 317–365 (1998)
5. Caro, G.D., Dorigo, M.: Ant colonies for adaptive routing in packet-switched communica-
tions networks. In: Proceedings of 5th International Conference on Parallel Problem Solving
from Nature, Amsterdam, The Netherlands, 27–30 Sept 1998
6. Caro, G.D., Dorigo, M.: An adaptive multi-agent routing algorithm inspired by ants behavior.
In: Proceedings of 5th Annual Australasian Conference on Parallel Real-Time Systems,
pp. 261–272 (1998)
7. Caro, G.D., Dorigo, M.: Two ant colony algorithms for best-effort routing in datagram
networks. In: Proceedings of 10th IASTED International Conference on Parallel Distributed
Computing Systems, pp. 541–546 (1998)
8. Oida, K., Sekido, M.: An agent-based routing system for QoS guarantees. In: IEEE
International Conference on Systems, Man, and Cybernetics, pp. 833–838 (1999)
9. Tekiner, F., Ghassemlooy, Z., Al-khayatt, S.: Investigation of Antnet Routing Algorithm by
Employing Multiple Ant Colonies for Packet Switched Networks to Overcome the Stagnation
Problem
10. Devi, G., Upadhyaya S.: Path identification in multipath routing. JGRCS J. Glob. Res.
Comput. Sci. 2(9) (2011)
11. Richa, S., Upadhyaya, S.: Identifying multiple optimal paths in Antnet routing algorithm with
negligible overhead. Int. J. Comput. Sci. Netw. Secur. 9(2), 314–320 (2009)
An Efficient Focused Web Crawling
Approach

Kompal Aggarwal

Abstract The amount of data and its dynamicity makes it very difficult to crawl the
World Wide Web (WWW) completely. It is a challenge in front of researchers to
crawl only the relevant pages from this huge Web. Thus, a focused crawler resolves
this issue of relevancy to a certain level, by focusing on Web pages for some given
topic or a set of topics. This paper deals with survey of various focused crawling
techniques which are based on different parameters to find the advantages and
drawbacks for relevance prediction of URLs. This paper formulates the problem
after analysing the existing work on focused crawlers and proposes a solution to
improve the existing focused crawler.

Keywords Focused crawler  Page relevance  Web crawler

1 Introduction

World Wide Web (WWW) contains a large amount of information, and every
second, new information is added such that the Web size is of order tens of billions
of pages. Web crawler is the most important component of search engine. It con-
tinuously downloads pages, and these pages are indexed and stored in database. So,
it becomes very difficult for a crawler to crawl entire Web and keep its index fresh.
Because of limitation of various computing resources and time constraints, focused
crawlers are developed. A focused crawler is web crawler that downloads only
those Web pages which are considered to be relevant for specific topic or set of
topics.
A focused crawler can be implemented in various ways [1]. Some of the
approaches are shown below.

K. Aggarwal (&)
Department of Computer Science, Government College, Chhachhrauli, India
e-mail: kompalagg@[Link]

© Springer Nature Singapore Pte Ltd. 2019 131


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
132 K. Aggarwal

Priority-Based Focused Crawler In a priority-based focused crawler, the priority


queue is used for storing retrieved pages instead of a normal queue. The priority is
assigned to each page based on a function which uses various factors to score a
page. Thus, in every iteration, a more relevant page is returned.
Structure-Based Focused Crawler In structure-based focused crawlers, to eval-
uate the page relevance, we have to take into account the Web page structure. The
strategy used is predefined formula for computation of page relevance score. After
that, link relevance score can be predicted, and then from the queue of URLs to be
crawled, the authority score is determined. The priority is determined from rele-
vance score and authority sore.
Context-Based Focused Crawler To avoid filtering out unwanted data,
context-based focused crawler tries to understand the context of the user’s needs by
interacting with user and comprehending the user profile. The crawler then gets
adapted to such contexts and uses them for future search requests.
Learning-Based Focused Crawler Learning-based focused crawler is a new
approach based on learning. It is used for improving relevance prediction in focused
Web crawler. First of all, a training set is built. The purpose of training set is to train
the system, and it contains four relevance attribute values: URL, parent page,
anchor text, and surrounding text relevancy. Secondly, classifier (NB) is trained by
using training set. The trained classifier is used for predicting the relevancy of URL
which is still unvisited [2, 3].
The architecture of focused Web crawler is shown in Fig. 1. URL queue is
initialized with a list which is not visited and maintained by a crawler and starts off
with seed URLs. The module Web page downloader extracts URLs from the queue
which is used for storing URLs and retrieves the pages as specified by retrieved

Fig. 1 Architecture of Seed URL’s


focussed web crawler Web

URL Irrelevant
Web page
Queue table
downloader

Parser &
Extractor

Relevance Topic Specific


Relevant
Calculator Weight Table
Page DB

Relevan
Irrelevant
Topic filter
An Efficient Focused Web Crawling Approach 133

URL from the Internet [4]. The parser module retrieves various information such as
the text, terms, and the out links URLs from the page which is being downloaded.
The relevance calculator module calculates relevance of the parsed page with
respect to topic, and relevance score is assigned to URLs. Topic filter module
determines whether the page content which is being parsed is related to topic
specified or not. If the parsed page is found to be relevant, the URLs retrieved from
it will then be updated in the queue which is used for storing URLs; otherwise, page
will be discarded and added to the irrelevant matrix.

2 Related Work

De Bra et al. [5] proposed “fish search” for predicting page relevance in which the
system takes as input a set consisting of seed pages and search query, and it
considers only those Web pages whose content matches a given search query and
their neighbourhoods (pages pointed to by these matched pages). Fish-search
algorithm treats Web as a directed graph where Web page is considered as node,
and hyperlink can be taken as edge, so the way of searching is similar to the search
operation in directed graph. For each node, we have to judge whether the node is
relative or irrelative (1 represents relevant, and 0 represents irrelevant). In the
fish-search algorithm, for storing URL of page to be searched, a list is maintained. If
the fish finds a relevant page based on query, it continues to look for more links
from the same page. If the page is not found to be relevant, then his child receives a
low score. The various URLs have different priority; the higher priority URL will
be stored at the front of the list and will be searched earlier in comparison with
others. It associates a relevance score in a discrete manner (1 stands for relevant; 0
or 0.5 stands for irrelevant). The drawback of the fish-search algorithm is that the
differentiation among priority that exists among the pages in the list is very low.
Michael et al. [6] proposed shark search which is the modification made to fish
search. In this algorithm, a child derives a discounted score from its parent. The
inherited discounted score is combined with a value which is based on anchor text
that exists around the link in the Web page. One improvement is that it replaces
binary with fuzzy score. Binary value of document relevance can be relevant or
irrelevant. “Fuzzy” score is a score lying between 0 (no similarity) and 1 (perfect
“conceptual” match). It uses a vector space model [7] for calculating the relevance
score in which the similarity between a search string and a document is a function
of the distance existing among the vectors. The cosine similarity measure is
commonly used measure for this purpose. But this algorithm neglects information
of link structure.
Batra [1] proposes a method based on link scoring in which the crawler also
checks and crawls the unvisited links up to a certain threshold that is included in
irrelevant pages. The various approaches have been suggested by many researchers
which are based on link analysis; for example, one of effective focused crawling
which is based on both content and link structures has been proposed. This crawling
134 K. Aggarwal

method is based on URL score, links, and anchor scores. The link scores can be
calculated by the crawler using the following equation and a decision is made
regarding fetching of pages as specified by links or not. The link score can be
calculated as [1, 8]:

LinkScoreðiÞ ¼ URLScore ðiÞ þ AnchorscoreðiÞ þ LinksFromRelevantPageDBðiÞ


þ ½Relevanceðp1 Þ þ Relevanceðp2 Þ þ    þ Relevanceðpn Þ:

LinkScore (i) represents score of link i, URLScore(i) is the relevance that exists
between the HREF information of i and the topic keywords, and AnchorScore(i) is
the relevance that exists between the anchor text of i and the topic keywords.
LinksFromRelevantPageDB(i) is the number of links from relevant crawled
pages to i; Pn is the ith parent page of URL n. Parent page is a page from which a
link was extracted. Thus crawler spends some time on crawling relevant pages and
also crawl relevant pages, those are children of an irrelevant page.
In a crawling process, the effectiveness of the focused crawler [8] can be
determined from the speed of the crawling process and not only from the maximum
amount of fetched relevant pages. The speed of a crawler is determined from the
number of inserted URLs which are considered to be relevant.
Therefore, URL optimization mechanism is used by the crawler to select relevant
links which helps for the removal of certain links whose link scores lie below a
predefined threshold. For URL optimization, two methods used for evaluating
relevant pages link score are used.
• Same as above equation which is used for irrelevant pages.
• Naive Bayes (NB) classifier which is used to compute link scores [2, 8].
Diligenti et al. [9] proposes a method based on link analysis of forward links and
backward links of the page. In this method, priority value is assigned to Web page
based on Pvalue which is the difference between FwdLnk and BkLnk. The priority
is directly proportional to the page having the highest Pvalue.

Pvalue ¼ FwdLnk  BkLnk

The URLs had been classified in the above categories by using the parameters
backward link count (BkLnk) and the forward link count (FwdLnk). The values of
forward link count (FwdLnk) can be calculated by the server after returning the
number of links in the page after parsing it and without downloading it. The values
of BkLnk can be calculated by the server after calculation of how many pages are
referring to this page by using the existing database which is built by various
crawler machines. After that, the crawler will sort the list of URLs according to
descending order of Pvalue of URLs received from the repository. This sorted list of
URLs will be sent to maintain quality of crawling. This method gives weightage to
current database and helps in building a database of higher-quality indexed Web
An Efficient Focused Web Crawling Approach 135

pages even when the focus is on crawling the entire Web. This method would also
be used for broken links, growing and matured stage of databases. A URL having
zero value for FwdLnk and nonzero value of BKLnk is assigned low priority as
Pvalue will always be negative [9].
Kaur and Gupta [10] proposed weighted page rank method, in which weight of
Web page is calculated on the basis of the number of incoming and outgoing links,
and the importance of page is dependent on the weight. The relevancy comes out by
using this technique is less as page rank is calculated on the basis of weight of the
Web page calculated at the time of indexing [2].
This algorithm is an updated version of page rank as important pages have been
assigned more value instead of dividing the page rank value commonly among all
outgoing links of that page [11].
The weighted page rank is thus given by:

PRðuÞ¼ 1  dÞ=jDj þ d ðPRðV1 ÞWin ðV1 ; uÞWout ðV1 ; uÞ þ    þ PRðVn ÞWin ðVn ; uÞ Wout ðVn ; uÞÞ

where Win(v, u) = weight of link(v, u) can be calculated on the basis of number of


inlinks of page u and the number of inlinks of all reference pages of page v Wout(v,
u) = weight of link(v, u) can be calculated on the basis of number of outlinks of
page u and the number of outlinks of all reference pages of page v PR(u) = page
rank of page u.
Pal et al. [4] proposed a method for construction of topic-specific weight table.
The topic name had been given to Google Web search engine, and based on the topic,
first few results are retrieved. Each word had been assigned weight by wi = tf * df,
where tf is term frequency and df is document frequency After assigning weights to
each word, their weights had been normalized by using the following equation

Wi þ 1 ¼ Wi = Wmax

where Wmax stands for the maximum weight that can be assigned to any keyword
and Wi + 1 stands for the new weight assigned to every keyword. The page rele-
vancy is calculated with the help of topic specific weight table and by using cosine
similarity measure.
Kleinberg [12] proposes a method called Hyperlink-Induced Topic Search
(HITS). It calculates the hubs as well as authority values of the relevant pages. The
result given is the relevant as well as important page.
HIT is a link analysis algorithm for rating Web pages. The page which is pointed
to many other pages is represented as good hub, and the page that was linked by
many different hubs represented a good authority.
Selecting seed URL set is based on authority of Web pages. The rank is cal-
culated by computing hub and authorities score. Authority score determines the
value of the page content. Hub score determines the value of page links to other
pages.
136 K. Aggarwal

When comparing two pages having approximately the same number of citations,
if one of these has received more citations from P1 and P2, which are counted as
important or prestigious pages, then that page relevance becomes higher. In other
words, having citations from an important page is better than from an unimportant
one [13, 14].
Fujimura et al. [15] proposes Eigen Rumor Algorithm to solve the problem of
increasing number of blogs on the Web. It is a challenge to the service providers to
display quality blogs to users [13].
• The page rank algorithm decides very low page rank scores for blog entries so
rank score cannot be assigned to blog entries according to their importance.
• The rank score can be provided to every blog by weighting the hub scores, and
authority of the bloggers is dependent on the calculation of eigenvector.
• It is mostly used for blog ranking not for ranking the web pages.

3 Proposed Method

The proposed focused crawler starts with a seed URL. It does not download the
page, instead it parses the page to extract the URLs and words of interest into that
page [16].
To determine the importance of the page being parsed, the weight of keywords/
search string in whole Web page is calculated corresponding to the every keyword
in the topics specific weight table. As occurrence of same words at various locations
of a page has different importance and representing different information [17]. So,
relevance of the page being parsed can be decided by considering its each com-
ponent. For example, the title text is more informative for expressing the topic
covered in a page as compared to the common text.
Here, we are computing the weight of oulinks/hyperlinks by the same procedure,
i.e. to crawl an relevant page which is children of an irrelevant page [18, 19]. If the
level of relevance crosses predefined threshold, only then the page will be down-
loaded and will be extracted and repository is updated with the page. The page will
be discarded otherwise.
In this way, we save the bandwidth after discarding an irrelevant page and
network load is reduced.
To obtain the overall weight (wkp) of keyword k in page p, the weights of
keyword in different locations of page p can be added as shown below:
wkp = wkurl + wkt + wkb
wkp = l weight of keyword k in page p
wkurl = weight assigned to keyword k based on page URL
wkt = weight assigned to keyword k based on page title
An Efficient Focused Web Crawling Approach 137

wkb = weight assigned to keyword k based on page body


wko = weight assigned to keyword k based on page outlinks
furl = frequency of search string in url
fb = frequency of search string in body
ft = frequency of search string in title
Wurl = predefined weight of URL
Wb = predefined weight of body text
Wt = predefined weight of title text
Proposed Algorithm
1. Start
2. Add Seed URL to Queue
3. Pick URL from queue
4. Fetch the page
5. Parse the page
6. while (queue is not empty)
{
If (search string or keyword present in URL)
wkurl = furl*wurl
else
Return 0
If (search string or keyword present in title)
wkt = ft*wt
Else
Return 0
If (search string or keyword present in the body)
wkb = fb* wb
Else
Return 0
wkp = wkurl + wkt + wkb
If (wkp > threshold_value)
{
Add to output.
}
Insert all outlinks into the queue
}
7. Stop
138 K. Aggarwal

4 Conclusion

Hence, by using the concept of page weight, we completely scan Web pages and
compute the page weight. In this way, we can improve the efficiency of Web
crawler, as URL list produced as output by this method is of great importance than
traditional Web crawling method.

References

1. Batra, M.R.: A review of focused crawler approaches. Int. J. 4(7), (2014)


2. Pant, G., Srinivasan, P.: Link contexts in classifier-guided topical crawlers. IEEE Trans.
Knowl. Data Eng. 18(1), 107–122 (2006)
3. [Link]/web_crawler. Accessed 12 May 2010
4. Pal, A., Tomar, D.S., Shrivastava, S.C.: Effective focused crawling based on content and link
structure analysis. 2(1), (2009)
5. De Bra, P., Houben, G.-J., Kornatzky, Y., Post, R.: Information retrieval in distributed
hypertexts. In: Proceedings of RIAO-94, Intelligent Multimedia, Information Retrieval
Systems and Management, pp. 481–492 (1994)
6. Michael, H., Michal, J., Yoelle, M., Dan, P., Menachem, S., Sigalit, U.: The Shark-search
algorithm—an application: tailored web site mapping. Comput. Netw. ISDN Syst. 30(1–7),
317–326 (1998)
7. Hao, H.W., Mu, C.X., Yin, X.C., Lim, S., Wang, Z.B.: An improved topic relevance
algorithm for focused crawling. In: IEEE (2011)
8. Mali, S., Meshram, B.B.: Focused web crawler with page change detection policy. In: 2nd
International Conference and workshop on Emerging Trends in Technology (ICWET)
Proceedings, International Journal of Computer Applications® (IJCA) 2011
9. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C., Gori, M.: Focused crawling using context
graphs. In: Proceedings of 26th International Conference on Very Large Databases (2000)
10. Kaur, S., Gupta, A.: A survey on web focused Information extraction Algorithms. Int. J. Res.
Comput. Appl. Robot. 3(4), 19–23 (2015)
11. Xing, W., Ghorbani, A.: Weighted pagerank algorithm, In: Proceeding of the Second Annual
Conference on Communication Networks and Services Research (CSNR’ 04), IEEE (2004)
12. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM-SIAM
Symp. Discrete Algorithm 46(5), 604–632 (1999)
13. Sharma, D.K., Sharma, A.K.: A comparative analysis of web page ranking algorithms. Int.
J. Comput. Sci. Eng. 02(08), 2670–2676 (2010)
14. [Link]
15. Fujimura, K., Inoue, T., Sugisaki, M.: The EigenRumor algorithm for ranking blogs. In:
WWW 2005 2nd Annual Workshop on the Web logging Ecosystem (2005)
16. Chain, X., Zhang, X.: HAWK: a focused crawler with content and link analysis. In: IEEE
International Conference on e-Business Engineering (2008)
17. Avraam, I., Anagnostopoulos, I.: A comparison over focused web crawling strategies. In:
2011 IEEE Conference on Panhellenic Conference on Informatics, IEEE (2011)
18. Porter, M. F.: An algorithm for suffix stripping. Readings in information retrieval, pp. 313–
316. Morgan Kaufmann Publishers Inc, San Francisco (1997)
19. Avraam, I., Anagnostopoulos, I.: A comparison over focused web crawling strategies. In:
IEEE 15th Panhellenic Conferewnce Informatics, pp. 245–249 (2011)
A New Log Kernel-Based Possibilistic
Clustering

Meena Tushir and Jyotsna Nigam

Abstract An unsupervised possibilistic (UPC) algorithm with the use of validity


indexes has already been proposed. Although UPC works well, it does not show a
good accuracy for non-convex cluster structure. To overcome this limitation, we
have proposed a kernel version of UPC with a conditionally positive-definite kernel
function. It has the ability to detect clusters with different shapes and convex
structures because it transforms data into high-dimensional space. Our proposed
algorithm, the kernelized UPC-Log(UKPC-L), is an extension of UPC, by intro-
ducing log kernel function, which is only conditionally positive-definite function.
This makes the performance of our proposed algorithm better than UPC in case of
non-convex cluster structures. This has been demonstrated by the results obtained
on several real and synthetic datasets. We have compared the performance of UPC
and our proposed algorithm using the concept of misclassification, accuracy and
error rate to show its efficiency and accuracy.


Keywords Fuzzy c-means Kernel functions  Possibilistic c-means
Conditionally positive-definite function

1 Introduction

Clustering [1, 2] is commonly used as one of the analytical techniques in the fields
of pattern recognition, modelling of system, image segmentation and analysis,
communication, data mining and so on. It sorts out the group of N observations or

M. Tushir (&)
Department of Electrical and Electronics Engineering, Maharaja Surajmal
Institute of Technology, C-4, Janakpuri, New Delhi, India
e-mail: meenatushir@[Link]
J. Nigam
Department of Information Technology, Maharaja Surajmal Institute
of Technology, C-4, Janakpuri, New Delhi, India
e-mail: jyotsnaroy81@[Link]

© Springer Nature Singapore Pte Ltd. 2019 139


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
140 M. Tushir and J. Nigam

data points into C groups or data clusters (number of clusters), and the objects in
each group are more identical to one another than those in other groups. Data
clusters on which clustering is to be performed can be well separated, continuous or
overlapping. On the basis of cluster type, there are predominantly two types of
clustering: hard and fuzzy. Traditional (hard) clustering has a constraint to it that
each point of the dataset is restricted to only one cluster. The conception of fuzzy
clusters was proposed by Zadeh [3]. It included the notion of partial membership
where a data point in a dataset can simultaneously belong to more than one cluster.
It gave rise to the thought of probabilistic theory in clustering. Clusters, which are
well separated or continuous, give suitable results when processed with hard
clustering techniques but sometimes objects in a cluster overlap each other and
belong simultaneously to several clusters. Compared with hard clustering, fuzzy
clustering has a superior clustering performances and capabilities. The membership
degree of a vector xk to the ith cluster (uik) is a value in the interval [0, 1] in fuzzy
clustering. This idea was first introduced by Ruspini [4] and used by Dunn [5] to
construct a fuzzy clustering method based on the criterion function minimization.
This approach was generalized by Bezdek to fuzzy c-means algorithm by using a
weighted exponent on the fuzzy memberships. Although FCM is a very competent
clustering method, its membership does not always correspond well to the degrees
of belonging of the data, and it carries certain constraints. The results of FCM may
tend to be inaccurate in a noisy environment. Many spurious minima are generated
due to the presence of noise in the datasets, which further aggravate the situation.
FCM does not provide any method to handle this complication. To reduce this
weakness to some extent, Krishnapuram and Keller [6] proposed possibilistic c-
means algorithm. PCM uses a possibilistic approach, which uses a possibilistic
membership function to describe the degree of belonging. It determines a possi-
bilistic partition matrix, in which a possibilistic membership function measures their
absolute degree of typicality of a point in a cluster. In contrast to FCM, possibilistic
approach appeared to be more resilient to noise and outliers. PCM has a necessity to
predefine the number of clusters and is sensitive to initializations, which sometimes
generates coincident clusters. The condition to define the optimal c is termed as
cluster validity. To overcome the limitations imposed by FCM and PCM, many
new algorithms have been proposed which comprise of the ability to generate both
membership and typicality for unlabelled data [7, 8]. However, it is inferred from
the observations that these algorithms tend to give not so good results for
unequal-sized clusters. In 2006, a novel fuzzy clustering algorithm called unsu-
pervised possibilistic clustering algorithm (UPC) was put forward by Yang and Wu
[9]. Unsupervised clustering can determine the number of clusters with the help of
their proposed validity indexes. The objective function of UPC integrates the FCM
objective function with two cluster validity indexes. The parameters used in UPC
are easy to manoeuvre. UPC has many credits, like most enhanced visions, it only
works for the convex cluster structure of the dataset. UPC fails to give desirable
results when processed on non-convex structures of datasets. The kernel-based
unsupervised algorithm proposed in this paper is an extension to the UPC. The data
points in the proposed method (UKPC-L) are mapped to a higher dimensional space
A New Log Kernel-Based Possibilistic Clustering 141

by kernel function, and the culminating clustering partition is obtained by opti-


mizing the objective function of UKPC-L. The experimental results exhibit better
results when compiled with the proposed kernel-based unsupervised algorithm.
The remainder of this paper is organized as follows. Section 2 provides back-
ground information on the fuzzy c-means, possibilistic c-means and unsupervised
possibilistic clustering. In Sect. 3, the kernel-based approach and kernel trick are
highlighted. In Sect. 4, the proposed kernelized version of UPC clustering model is
formulated. Section 5 highlights the potential of the proposed approach through
various synthetic and real datasets. Concluding remarks are presented in Sect. 6.

2 Literature Work

2.1 Fuzzy C-Means Clustering (FCM)

Most methodical clustering algorithms for clustering analysis are based on the
optimization of basic c-means function. In datasets clustering the similarity norm
used to group the object, which are identical in nature, is defined by the distance
norm. A large family of clustering algorithms and techniques complies with the
fuzzy c-means functional formulation by Bezdek and Dunn:

X
c X
n
2
J ðU; V Þ ¼ ik kxk  vi k
um ð1Þ
i¼1 k¼1

It is applicable to wide variety of data analysis problems and works well with
them. FCM algorithm assigns membership values to the data points, and it is
inversely related to the relative distance of a point to the cluster centres in the FCM
model. Each data point in FCM is represented by xk and vi represents the centre of
the cluster. Closeness of each data point to the centre of the cluster is defined as
membership value uik. m is the weighting exponent who determines the fuzziness of
the resulting clusters and it ranges from [1, ∞].
The cluster centres and membership values are computed as
Pn
um
ik xk
vi ¼ Pk¼1
n m ð2Þ
k¼1 uik

1
uik ¼ ð3Þ
Pc xk vi m1
2

j¼1 xk vj

For a given dataset X = {x,…, xn}, the fuzzy c-means algorithm partitions X into
C fuzzy subsets by minimizing the objective function given above. It satisfies the
condition:
142 M. Tushir and J. Nigam

X
c
uik ¼ 1
i¼1

There are some limitations to the fuzzy c-means clustering method. Firstly, since
it is based on fixed distance norm, it has a limitation that this norm forces the
objective function to prefer clusters of certain shape even if they are not present in
the data. Secondly, it is sensitive to noise and there is a requirement to predefine the
number of clusters.

2.2 Possibilistic C-Means Clustering (PCM)

PCM works with the viewpoint of possibility theory. According to possibility


theory, the membership values can be interpreted as the degree of compatibility or
possibility of the points belonging to the classes. In variance to PCM, fuzzy pos-
sibilistic c-means clustering simultaneously produces both typicality and mem-
bership values. Due to low typicality, outliers are eliminated from the algorithm.
The objective function of PCM is [9]
n X
X c X
c X
n
Pm ðT; V; X; cÞ ¼ tikm dik2 þ ci ð1  tki Þm ð4Þ
i¼1 k¼1 i¼1 k¼1

where tik is the typicality value for data point xk whose cluster centre is vi, dki is the
distance measured between xk and ci, and c denotes a user-defined constant. c is
greater than zero with the value of i lying in the range of 1 to c. PCM uses
approximate optimization for additional conditions. The typicality of data point xk
and the centre of cluster vi can be obtained by the following equations

1
tki ¼  m1
1 ; 8i ð5Þ
1þ dki
ci

Pn m
tik xk
vi ¼ Pk¼1
n m ð6Þ
k¼1 uik

PCM does not hold the probabilistic constraint that the membership values of the
data point across classes sum to one. It overcomes sensitivity to noise and over-
comes the need to specify number of clusters. However, there are still some dis-
advantages in the PCM, i.e. it depends highly on a good initialization and it tends to
produce coincidental clusters which are undesirable. The KPCM uses the KFCM to
initialize the membership and avoids the above-mentioned weakness of the PCM.
A New Log Kernel-Based Possibilistic Clustering 143

2.3 Unsupervised Possibilistic C-Means Clustering (UPC)

In 2006, Yang and Wu proposed a new possibilistic clustering algorithm called


unsupervised possibilistic clustering (UPC) [9]. The objective function of UPC is an
extension to FCM objective function with partition coefficient (PC) and partition
entropy (PE) validity indexes to make it robust to noise and outliers.
The objective function of UPC is
c X
X n  
n
b X c X
Jupc ðU; V Þ ¼ ij dij þ
um 2
p ffiffi
ffi u m
log u m
 u m
ð7Þ
i¼1 j¼1
m2 c i¼1 j¼1 ij ij ij

where m is the fuzzy factor and c is the cluster numbers. The first term is identical to
FCM objective function. The second term is constructed by analogue of PE and PC
validity indexes.
Parameter b is the sample covariance defined as:
Pn 

x
j¼1 xj  
Xn
xj
b¼ where x ¼ ð8Þ
n j¼1
n

Minimizing the objective function with respect to uij and setting it to zero, we get
equation for membership value, i.e.
pffiffiffi 2 !
m cxj vi 
uij ¼ exp  ; i ¼ 1; . . .; c; j ¼ 1; . . .; n ð9Þ
b

Minimizing the objective function with respect to vi , we get equation for


membership value cluster centres, i.e.
Pn
lm
ij xj
vi ¼ Pn
j¼1
ð10Þ
j¼1 lm
ij

3 Kernel-Based Algorithm

3.1 Kernel-Based Approach

A kernel function is generalization of the distance matrix that measures the distance
between two data points as the data points are mapped into high-dimensional space
in which they are more clearly separable [10, 11]. We can increase the accuracy of
algorithm by exploiting a kernel function in calculating the distance of data points
from the prototypes.
144 M. Tushir and J. Nigam

Kernel trick:
The kernel trick is a very interesting and powerful tool. It provides a bridge from
linearity to nonlinearity to any algorithm that can express solely on terms of dot
products between two vectors.
Kernel trick is interesting because that mapping does not need to be ever
computed. The “trick” is wherever a dot product is used, it is replaced with a kernel
function. The kernel function denotes an inner product in feature space and is
usually denoted as:

K ðx; yÞ ¼ hUð xÞ; Uð yÞi

Given an unlabelled dataset X ¼ fx1 ; . . .; xn g in p-dimensional space Rp , let U


be nonlinear mapping function from this input space to high-dimensional feature
space H:

U : Rp ! H; x ! Uð xÞ

The dot product


 in high-dimensional feature space can be calculated through
kernel function K xi ; xj , in the input space Rp
   
K xi ; xj ¼ Uðxi ÞU xj ð11Þ

Some examples of kernel function are as follows:


Example 1 (Linear kernel): kðx; yÞ ¼ xT y þ c
   d
Example 2 (Polynomial kernel): K xi ; xj ¼ xi :xj þ c where c  0; d 2 N
Example 3 (Gaussian basis kernel):
  !
  xi  xj 2
K xi ; xj : exp  where r [ 0
2r2

Example 4 (Hyperbolic tangent (sigmoid) kernel):


 
K ðx; yÞ ¼ tanh axT þ c :

4 Proposed kernelized UPC-Log Algorithm

To further improve UPC, we have used a log kernel function. Log kernel function is
conditionally positive-definite function, i.e. it is positive for all values of greater
than 0. We named our proposed algorithm as UKPC-L.
A New Log Kernel-Based Possibilistic Clustering 145

Conditionally positive definite function:


Let X be a non-empty set. A kernel K is called conditionally positive definite if
and only if it is symmetric and

X
n  
cj ck K xj ; xk  0 ð12Þ
j;k¼1

Pn
For n  1; c1 ,…, cn 2 R with j¼1 cj ¼ 0 and

x1 ; . . .; xn 2 X

The log kernel function that we have proposed for our algorithm is:
  2 
K ðx; yÞ ¼ log 1 þ axj  vi  ð13Þ

Using our log kernel function, the objective function and the distance for
UKPC-L clustering is as follows:
c X
X n  
n
b X c X
JUKPCL ¼ ij Dij þ
um 2
p ffiffi
ffi u m
ij log u m
ij  u m
ij ð14Þ
i¼1 j¼1
m2 c i¼1 j¼1

where
  2 
D2ij ¼ kUðxi Þ  Uðvi Þk2 ¼ 2 log 1 þ axj  vi  ð15Þ

Pn  2
2 j¼1 logð1 þ axj  vi  Þ
b¼ ð16Þ
n

Minimizing the objective function with respect to vi , we get equation for cluster
centres

Pn m
j¼1 uij xj
 1 
1 þ akxj vi k
2

vi ¼ Pn um ð17Þ
 ij

1 þ akxj vi k
j¼1 2

Minimizing the objective function with respect to uij and setting it to zero, we get
equation for membership value
 pffiffiffi 
2m c  2 
uij ¼ exp log 1 þ axj vi  ð18Þ
b
146 M. Tushir and J. Nigam

The general form of our proposed algorithm is as follows:

*Fix the number of clusters C; fix(m) > 1; set the learning rate a;
*Execute a FCM clustering algorithm to find initial U and V;
*Set iteration count k = 1;
Repeat
Calculate objective function J(u, v) using (14).
Compute b using (16).
Update vi using (17).
Update uik using (18).
Until a given stopping criterion, i.e. convergence precision e is satisfied.

5 Experimental Results

To demonstrate the efficiency and performance of our proposed algorithm, we


applied the kernelized UPC-L and UPC to a number of widely used datasets. We
used datasets with varying shapes, orientation and number of data points. The
real-life datasets used in the experiments are Iris dataset and Seed dataset. For our
experiments, we have chosen: m = 2, a = 2, e = 0.00001 and max_iter = 100.

5.1 Synthetic Datasets

1. Gaussian Random Data with Noise: The dataset generated is spherical with
unequal radii. There are four clusters and, the data points in each cluster are
normally distributed over two-dimensional space. Analysing the Fig. 1 closely,
our proposed algorithm gives desired results with cluster centres located at their
prototypical locations, while in the cases of UPC the cluster centres are slightly
shifted from the ideal locations.
The respective covariance matrices are:

0:40 0:40 0:40 10


, , and
00:8 00:8 00:8 01
The respective means for each cluster using the UPC clustering technique and
our proposed algorithm are given as:
A New Log Kernel-Based Possibilistic Clustering 147

Fig. 1 Clustering by a UPC. (i)Clustering by UPC


b UKPC-L
18

16

14

12

10

x2
8

2
0 2 4 6 8 10 12 14
x1

(ii) Clustering by UKPC-L


18

16

14

12

10
x2

2
0 2 4 6 8 10 12 14
x1
148 M. Tushir and J. Nigam

2 3
2 5
6 2 14 7
VIDEAL ¼6
4 7
7
85
10 14
2 3 2 3
1:9947 5:0186 2:0146 4:9913
6 2:0138 13:9765 7 6 7
VUPC ¼6 7 VUKPCL ¼ 6 2:0007 13:9779 7
4 6:7874 8:1653 5 4 6:9814 8:0660 5
9:9408 13:8123 9:9735 14:0319

To show the effectiveness of our proposed algorithm, we also compute the error
using:

E ¼ jjVideal  V jj2

where * is UPC/[Link] = 0.00126 and EUKPC-L = 0.00029. Clearly we can


see from error results that our proposed algorithm is better than UPC.
2. R15 spherical dataset: The next dataset consists of identical-sized spherical
dataset with 15 clusters. UPC has misplaced all the inner cluster prototypes. Our
proposed algorithm clusters the data points in the best way with all cluster
centres at the required positions.
The shift in the centres is clearly visible in Fig. 2.
3. Random Gaussian dataset without noise: The next dataset is a typical convex
cluster dataset. In this dataset, the UPC algorithm detects only four clusters
taking all as convex shapes, whereas our proposed algorithm detects five
clusters and hence proves to work well all shapes as well (Fig. 3).

5.2 High-Dimensional Real Datasets

We now study the performance quality and efficiency of our proposed clustering
algorithm on a few real datasets, namely Iris dataset and Seed dataset. The clus-
tering results were concluded using Huang’s accuracy measure [12] with the fol-
lowing formula:
Pk
ni
r¼ i¼1
ð19Þ
n
A New Log Kernel-Based Possibilistic Clustering 149

Fig. 2 Clustering by a UPC. (i) Clustering by UPC


b UKPC-L using R15 dataset 18

16

14

12

10

x2
8

0
2 4 6 8 10 12 14 16 18
x1

(ii) Clustering by UKPC-L


18

16

14

12

10
x2

0
2 4 6 8 10 12 14 16 18
x1

where ni is the number of data occurring in both the ith cluster and its corre-
sponding true cluster, and n is the number of data points in the dataset. According to
this measure, a higher value of r indicates a better clustering result with perfect
clustering yielding a value r = 1. Error has been calculated by using the measure:
E=1−r
150 M. Tushir and J. Nigam

Fig. 3 Clustering by a UPC. (i) Clustering by UPC


b UKPC-L
15

10

-5

x2 -10

-15

-20

-25

-30
-20 -10 0 10 20 30 40 50 60
x1

(ii) Clustering by UKPC-L


15

10

-5
x2

-10

-15

-20

-25

-30
-20 -10 0 10 20 30 40 50 60
x1
A New Log Kernel-Based Possibilistic Clustering 151

Table 1 Number of misclassified data, elapsed time, accuracy and error using UPC and UKPC-L
Clustering algorithm Misclassifications Elapsed time Accuracy (%) Error (%)
UPC 12 1.50 92.0 8.0
UKPC-L 10 0.71 93.3 6.6

Table 2 Number of misclassified data, elapsed time, accuracy and error using UPC and UKPC-L
Clustering algorithm Misclassifications Elapsed time Accuracy (%) Error (%)
UPC 25 0.28 88.57 11.42
UKPC-L 23 0.29 89.04 10.95

1. Iris dataset: This is a four-dimensional dataset containing 50 samples each of


three types of Iris flowers. One of the three clusters (class 1) is well separated
from the other two, while classes 2 and 3 have some overlap. We made several
runs of UPC and our proposed method when these algorithms are initialized
with FCM terminal prototypes. Table 1 compares the algorithms based on the
optimization efficiency as judged by the above equation.
As can be seen in the Table 1, UPC has the higher error value with 12 samples
assigned to incorrect labels. Comparing it with our proposed method, our algorithm
shows better accuracy of 93.33% making only ten errors.
2. Seed dataset: This is a multivariate dataset with 7 attributes (dimensions) and
210 instances. The examined dataset comprised kernels belonging to three
different varieties of wheat with 70 elements each.
As seen clearly from the Table 2, our proposed algorithm misclassifies only 23
elements into incorrect classes with 89.04% accuracy, which is slightly more effi-
cient compared to UPC which misclassifies 25 elements.

6 Conclusions

In this paper, we have proposed a conditionally positive-definite log kernel-based


possibilistic clustering. It performs well for different-shaped datasets, thereby
overcoming the limitation of previously proposed unsupervised algorithm in terms
of efficiency. Compared to UPC, our proposed method selects more desirable
cluster centroids, thereby increasing the clustering accuracy. To check the accuracy
of our algorithm, we have applied it on various real and synthetic datasets, and our
proposed algorithm gives better performance results on various performance criteria
like misclassification, accuracy and error rate.
152 M. Tushir and J. Nigam

References

1. Anderberg, M.R.: Cluster Analysis for Application. Academic Press, New York (1973)
2. Backer, E., Jain, A.K.: A clustering performance measure based on fuzzy set decomposition.
IEEE Trans. Pattern Anal. Mach. Intell. 3(1), 66–74 (1981)
3. Zadeh, L.A.: Fuzzy sets. Inf. Control 8, 338–353 (1965)
4. Ruspini, E.H.: A new approach to clustering. Inform. Control 15(1), 22–32 (1969)
5. Dunn, J.C.: A fuzzy relative of the ISODATA process and its use in detecting compact
well-separated cluster. J. Cybern. 3(3), 32–57 (1973)
6. Krishnapuram, R., Keller, J.M.: A possibilistic approach to clustering. IEEE Trans. Fuzzy
Syst. 1, 98–110 (1993)
7. Dave, R.: Characterization and detection of noise in clustering. Pattern Rec. Lett. 12(11),
657–664 (1991)
8. Pal, N.R., Pal, K., Bezdek, J.C.: A mixed c-means clustering model. In: Proceedings of the
IEEE International Conference on Fuzzy Systems, Spain, pp. 11–21 (1997)
9. Yang, M.S., Wu, K.L.: Unsupervised possibilistic clustering. Pattern Recogn. 39(1), 5–21
(2006)
10. Christianini, N., Taylor, J.S.: An Introduction to SVMs and Other Kernel-Based Learning
Methods. Cambridge University Press, Cambridge (2000)
11. Girolami, M.: Mercer kernel-based clustering in feature space. IEEE Trans. Neural Netw. 13
(3), 780–784 (2002)
12. Huang, Z., Ng, M.K.: A fuzzy k-modes algorithm for clustering categorical data. IEEE Trans.
Fuzzy Syst. 7(4), 446–452 (1999)
Fuzzy c-Means Clustering Strategies:
A Review of Distance Measures

Jyoti Arora, Kiran Khatter and Meena Tushir

Abstract In the process of clustering, our attention is to find out basic procedures
that measures the degree of association between the variables. Many clustering
methods use distance measures to find similarity or dissimilarity between any pair
of objects. The fuzzy c-means clustering algorithm is one of the most widely used
clustering techniques which uses Euclidean distance metrics as a similarity mea-
surement. The choice of distance metrics should differ with the data and how the
measure of their comparison is done. The main objective of this paper is to present
mathematical description of different distance metrics which can be acquired with
different clustering algorithm and comparing their performance using the number of
iterations used in computing the objective function, the misclassification of the
datum in the cluster, and error between ideal cluster center location and observed
center location.

 
Keywords FCM clustering Euclidean distance Standard euclidean distance
 
Mahalanobis distance Minkowski distance Chebyshev distance

J. Arora (&)
Department of Information Technology, Maharaja Surajmal Institute of Technology,
C-4, Janakpuri, New Delhi, India
e-mail: [Link]@[Link]
K. Khatter
Department of Computer Science, Ansal University, Gurgaon, India
e-mail: kirankhatter@[Link]
M. Tushir
Department of Electrical & Electronics Engineering, Maharaja Surajmal Institute
of Technology, C-4, Janakpuri, New Delhi, India
e-mail: meenatushir@[Link]

© Springer Nature Singapore Pte Ltd. 2019 153


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
154 J. Arora et al.

1 Introduction

Clustering is a technique of finding similar characteristic data among the given set
of data through association rules and classification rules resulting into separation of
classes and frequent pattern recognition. Clustering is basically knowledge dis-
covery process whose result can be used for future use, in various applications.
A good cluster definition involves low interclass similarity and high intra-class
similarity. In order to categorize the data, we have to apply different similarity
measure techniques to establish a relation between the patterns which will group the
data into different clusters with a degree of membership. In clustering, we have to
evaluate a good distance metrics, in order to have high intra-class similarity. Several
clustering algorithms with the different distance metrics have been developed in the
past, some of them are used in detecting different shapes of clusters such as
spherical [1], elliptical [2], some of them are used to detect the straight lines [3, 4],
algorithms focusing on the compactness of the clusters [2, 5]. Clustering is a
challenging field of research as it can be used as a separate tool to gain insight into
the allocation of data, to observe the characteristic feature of each cluster, and to
spotlight on a particular set of clusters for more analysis. Focusing on the proximity
measures, we can find some work that compares a set of distance metrics; therefore,
these could be used as guidelines. However, most of the work includes basic
distance metrics as Grabusts [6] compared Euclidean distance, Manhattan distance,
and Correlation distance with k-means on Iris dataset, similarly Hathaway [7]
compared distance with different values of p, Liu et al. [8] proposed a new algo-
rithm while changing the Euclidean distance with Standard Mahalanobis distance.
In this paper, we are presenting the survey of different distance metrics in order
to acquire proximity measure to be followed by the clustering criterion that results
in the definition of a good clustering scheme for dataset. We have included
Euclidean distance, Standard Euclidean distance, Mahalanobis distance, Standard
Mahalanobis distance, Minkowski distance, and Chebyshev distance and compared
on the criteria of accuracy and misclassification, location of center which to our
knowledge have not been discussed in such detail in any of the surveys till now.
The remainder of this paper is sectioned as follows. Section 2 provides related
work which includes detail of fuzzy c-means algorithm and overview of different
distance metrics, and Sect. 3 includes experimental results on different data types
including synthetic and real datasets. Section 4 concludes the review.

2 Related Work

2.1 Fuzzy c-Means Algorithm

The notion of fuzzy sets was developed by Zadeh [9] is an attempt to modify
exclusive clustering on the basis of their probability of any parameter on which
clusters have been developed, which was further extended by Bezdek et al. [3] as
Fuzzy c-Means Clustering Strategies … 155

fuzzy c-means algorithm (FCM) is the most widely used algorithm. This approach
partitions a set of data fx1 ; . . .; xn g  Rs into c-clusters based on a similarity
computed by the least square function of Euclidean distance metrics. The objective
function of FCM is

c X
X N  2
JFCM ðX : U; VÞ ¼ ðuij Þm xi  vj  ; 1\m\1 ð1Þ
j¼1 i¼1

where m > 1 is a fuzzification parameter, vj 2 RS is a cluster center, uij 2 ½0; 1 is a


 2
degree to which data x belongs to cluster, defines partition matrix. xi  vj  is a
Euclidean distance metrics. The partition matrix uij is a convenient tool for repre-
senting
Pc cluster structure in the data, where the fuzzy partition has constraint
i¼1 u ij ¼ 1. The most effective method of optimization of Eq. (1) is by alternative
optimization method used in conventional FCM. The equation of vj and uij can be
referred from [3].

2.2 Distance Measures

Important component of clustering algorithm is similarity measurement between the


data points. There are different measures which have been used with different
clustering algorithms in order to have clusters with desired properties. Some dis-
tance measures we will discuss are

2.2.1 Euclidean Distance

The Euclidean distance is the most intense similarity measure which is used widely
in FCM. This formula includes two objects and compares each attribute of indi-
vidual item with other to determine strength of relativity with each other. The
smaller the distance is greater the similarity. The equation for Euclidean distance is

dx;v ¼ ðx1  vi Þ2 þ ðx2  vi Þ2 . . .ðxn  vi Þ2 ð2Þ

Euclidean distance as a measure of similarity, hyperspherical-shaped clusters of


equal size are usually detected [10]. However, Euclidean distance degrades the
performance in the presence of noise in the dataset. This is because the object to
center dissimilarity term in (1) can place considerable membership to the outlying
data and due to membership constraint.
156 J. Arora et al.

2.2.2 Standard Euclidean Distance

The Standard Euclidean distance can be squared in order to place progressively


greater weight on objects that are farther apart. In this case, the equation becomes

2
dx;v ¼ ðx1  vi Þ2 þ ðx2  vi Þ2 . . .ðxn  vi Þ2 ð3Þ

This distance metrics is frequently used in optimization problems in which


distances only have to be compared. Clustering with the Euclidean Squared dis-
tance is faster than clustering with the regular Euclidean distance.

2.2.3 Mahalanobis Distance

Mahalanobis distance is a measure of the distance between a given point and a


distribution. It is a multi-dimensional generalization of the act of determining the
number of standard deviations that a point x is away from the mean of the distri-
bution. The equation of the Mahalanobis distance is given by

dx;v ¼ ðxk  vi ÞA1 ðxk  vi Þ ð4Þ

In (4), A is a covariance matrix of data. The Mahalanobis distance is better


adopted where the cluster required are nonspherical in shape. In [2], Cai has dis-
cussed in clustering algorithm, a modified Mahalanobis distance with preserved
volume was used. It is more particularly useful when multinormal distributions are
involved.

2.2.4 Standard Mahalanobis Distance

In Standard Mahalanobis distance, covariance matrix is replaced by correlation


matrix. The equation of the Standard Mahalanobis distance is represented by

dx;v ¼ ðxk  vi ÞR1 ðxk  vi Þ ð5Þ

In (5) R is a correlation matrix. In [8], Liu et al. has proposed new algorithm
giving FCM-SM, normalizing each feature in the objective function, and all
covariance matrix becomes corresponding correlation matrix.

2.2.5 Minkowski Distance

Minkowski distance termed as Lp is a generalized metric that includes special cases


of p and introduced three distance metrics with p = 1(Manhattan distance), p = 2
Fuzzy c-Means Clustering Strategies … 157

(Euclidean distance), and p = 1 (Chebyshev distance). The Minkowski distance of


order p between data points and center points is represented by
!1=p
X
n
p
dx;v ¼ kx i  v i k ð6Þ
i¼1

In [1, 2], different function of Minkowski distance with different value of p has
been implemented with FCM showing results on relational and object data types.

2.2.6 Chebyshev Distance

Chebyshev distance (L1 ) is a distance metric specified on a vector space where the
given two vectors are separated by a distance which is the largest of their differ-
ences measured along any coordinate dimension. Essentially, it is the maximum
distance between two points in any single dimension. The Chebyshev distance
between ant two points is given by (7)

dx;v ¼ maxjxi  vi j ð7Þ

3 Experiments’ Results

The purpose of the experimental part was to test the operation of the FCM algo-
rithm by applying different distance metrics on synthetic and real dataset. We used
datasets with wide variety in the shape of clusters, numbers of clusters, and count of
features in different data point. FCM is an unsupervised clustering so the number of
clusters to group the data was given by us. We choose m = 2 which is a good
choice for fuzzy clustering. For all parameters, we use e = 0.001, max_iter = 200.

3.1 Synthetic Datasets

3.1.1 X12 Dataset

The first example involves X12 dataset as given in Fig. 1 contains two identical
clusters with one outlier which is equidistant from both the clusters. We know FCM
is very sensitive to noise, so we do not get the desired results. To show the
effectiveness of different distance metrics, we have calculated the error E , by sum
of the square of the difference between calculated center and the ideal center with
every distance metrics as given in Eq. (8).
158 J. Arora et al.

Fig. 1 Representation of X12 X12 Dataset


12
dataset
10

x2
2

-2

-4
-6 -4 -2 0 2 4 6
x1

Different Volume Rectangle Dataset


By changing the volume of clusters in a pattern, we observe the effectiveness of
different distance metrics. Figure 2 shows the representation of dataset with two
identical clusters of different volume, ideal cluster center are defined by VRD.
Table 1 shows that for X12 dataset, best results are shown with Euclidean dis-
tance, Minkowski distance, and Chebyshev distance but number of iterations used
by Minkowski distance are more. The Standard Mahalanobis distance and

Fig. 2 Representation of Rectangle Dataset

rectangle dataset 14

12

10

6
x2

-2

-4

-6
4 6 8 10 12 14 16 18 20 22
x1
Table 1 Centers produced by FCM with different distance metrics, effectiveness and number of iteration for the X12 dataset and different volume rectangle
dataset
Distance Euclidean Std. Euclidean Mahalanobis Std. mahalanobis Minkowski Chebyshev
Fuzzy c-Means Clustering Strategies …

Center (V12) 2.98 0.54 2.67 2.15 2.97 0.55 −2.98 0.54 −2.98 0.54 −2.98 0.54
−2.98 0.54 −1.51 0.08 −2.97 0.55 2.98 0.54 2.98 0.54 2.98 0.54
E* 0.412 4.213 0.438 0.412 0.412 0.412
No. of Iter. 10 10 10 9 17 10
Center (VRD) 4.22 −0.02 7.42 0.30 5.90 −0.10 4.86 0.54 4.92 −0.01 4.76 −0.02
16.30 −1.01 16.26 −2.02 16.42 −1.06 16.33 −0.94 16.33 −0.94 16.26 −1.00
E* 0.349 3.53 0.5 0.211 0.059 0.062
No. of Iter. 10 23 27 10 14 13
159
160 J. Arora et al.

Mahalanobis distance also show optimal results but Standard Euclidean distance
shows poor results. Similarly in Different Volume Rectangle dataset, Minkowski
distance and Chebyshev distance perform best as compared to other distance
metrics. The highest number of iterations is used by Mahalanobis distance. The
Standard Euclidean distance shows worst result with this dataset also. Both the
above data comprises of clusters forming compact clouds that are well separated
from one another, thus sum of squared error distance outperforms as compare to
other distance and Standard Euclidean distance shows poor results. Mahalanobis
distance due to calculation of covariance matrix for the data does not show accurate
result.

E ¼ kV12 ðidealÞ  V12 ðdistÞk2 ð8Þ


   
3:3400 0 5 0
V12 ðidealÞ ¼ VRD ðidealÞ ¼
3:3400 0 16 1

3.2 High-Dimensional DataSets

We now examine the defined evaluation criteria with some well-known real data-
sets, namely Iris dataset, Wine dataset and Wisconsin dataset. We are going to
analyze the clustering results using Huang’ s accuracy measure (r) [11].
Pk
ni
r¼ i¼1
ð9Þ
n

where ni is the number of data occurring in both the ith cluster and its corre-
sponding true cluster, and n is the number of data points in the dataset. According to
this measure, a higher value of r indicates a better clustering result with perfect
clustering yielding a value of r = 1.
We made several runs of FCM with different distance metrics and calculated the
misclassification, accuracy, number of iterations on all the three high-dimensional
datasets. Here in Table 2, we find that how the algorithm shows different values of
misclassification over the three datasets with the change of distance metrics. In this,
we can see Chebyshev distance is giving good result with Iris and Breast Cancer
dataset with an accuracy of 90 and 96% respectively, Standard Euclidean distance is
giving best result with Wine dataset with an accuracy of 91%, however number of
iterations used is very high.
Table 2 FCM with different distance metrics showing misclassification, accuracy and number of iteration with Iris, Wine, Breast Cancer (BC) dataset
Dataset Distance
Euclidean Std. Euclidean Mahalanobis Mahalanobis Minkowski Chebyshev
MisclassificationIRIS 17 27 43 27 17 15
Fuzzy c-Means Clustering Strategies …

AccuracyIRIS 0.88 0.82 0.71 0.82 0.88 0.9


No. of [Link] 18 41 67 22 25 20
MisclassificationWINE 56 16 79 55 55 56
AccuracyWINE 0.68 0.91 0.55 0.69 0.69 0.68
No. of [Link] 47 113 9 46 86 61
MisclassificationBC 30 31 22 53 38 22
AccuracyBC 0.95 0.95 0.96 0.92 0.94 0.96
No. of [Link] 14 22 100 25 21 18
161
162 J. Arora et al.

4 Conclusions

We described various distance metrics for FCM and examined the behavior of the
algorithm with different approaches. It has been concluded from results on various
synthetic and real datasets that Euclidean distance works well for most of the
datasets. Chebyshev and Minkowski distances are equally suitable for clustering.
Further exhaustive exploration on distance metrics needs to be done on various
datasets.

References

1. Patrick, J.F., Groenen, U., Kaymak, J.V., Rosmalen: Fuzzy clustering with Minkowski
distance functions. Econometric Institute Report, EI(24), (2006)
2. Cai, J.Y., Xie, F.D., Zhang, Y.: Fuzzy c-means algorithm based on adaptive Mahalanobis
distance. Comput. Eng. Appl. 174–176(2010)
3. Bezdek, J.C., Coray, C., Gunderson, R, Watson, J.: Detection and characterization of cluster
substructure. SIAM J. Appl. Math. 339–372 (1981)
4. Dave, R.N.: Use of the adaptive fuzzy clustering algorithm to detect lines in digital images.
Intell Robots Comput. Vision VIII 1192, pp. 600–661 (1982)
5. Dunn, J.C.: A fuzzy relative of the ISODATA process and its use in detecting compact well
separated clusters. J Cybern, 32–57(1973)
6. Grabusts, P.: The choice of metrics for clustering algorithms. In: International Scientific and
Practical Conference Vol 2(8), pp. 70–76 (2011)
7. Hathaway, R.J., Bezdek, J.C., Hu, Y.: Generalised fuzzy c-means clustering strategies using
LP norm distance. IEEE Trans. Fuzzy Syst. 8(5), (2000)
8. Liu, H.C., Jeng, B.C., Yih, J.M., Yu, Y.K.: Fuzzy c-means clustering algorithm based on
standard mahalanobis distance. In: International Symposium on Information Processing,
pp. 422–427 (2009)
9. Zadeh, L.A.: Fuzzy sets. Inf. Control 8, 338–353 (1965)
10. Su, M.C., Chou, C.H.: A Modified means of K-Means algorithm with a distance based on
cluster symmetry. IEEE Trans. Pattern Anal. Mach. Intell. 23, 674–680 (2001)
11. Tushir, M., Srivastava, S.: A new kernelized hybrid c-mean clustering with optimized
parameters. Appl. Soft Comp. 10, 381–389 (2010)
Noise Reduction from ECG Signal Using
Error Normalized Step Size Least Mean
Square Algorithm (ENSS) with Wavelet
Transform

Rachana Nagal, Pradeep Kumar and Poonam Bansal

Abstract This paper presents the reduction of baseline wander noise found in ECG
signals. The reduction has been done using wavelet transform inspired error nor-
malized step size least mean square (ENSS-LMS) algorithm. We are presenting a
wavelet decomposition-based filtering technique to minimize the computational
complexity along with the good quality of output signal. The MATLAB simulation
results validate the good noise rejection in output signal by analyzing parameters,
excess mean square error (EMSE) and misadjustment.

Keywords ECG signal  ENSS algorithm  LMS algorithm  Wavelet transform



EMSE Misadjustment

1 Introduction

Electrocardiographic (ECG) signals are very low amplitude signals around 1 mV.
ECG signal can be corrupted with either noises named as baseline wander (BW) or
power line interface (PLI) noises. These two noises badly degrade the ECG signal
quality and generate a resembling PQRST waveform and remove some tiny features
which are important for diagnosis. ECG is commonly more affected by Baseline
Wander (BW), which is the reason behind varying electrode skin impudence,
patient’s breath, and movement. This noise is a kind of sinusoid signal with random
frequency and phase. Baseline wander reduction is very important step to process

R. Nagal (&)  P. Kumar


Department of ECE, ASET, Amity University, Noida, Uttar Pradesh, India
e-mail: rnagal@[Link]
P. Kumar
e-mail: pkumar4@[Link]
P. Bansal
Department of CSE, MSIT, IP University, New Delhi, India
e-mail: pbansal89@[Link]

© Springer Nature Singapore Pte Ltd. 2019 163


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
164 R. Nagal et al.

ECG signal. Finally, the work presents here with aim to find the clean signal from the
undesired noisy signals so that the out coming signal can be used for easy diagnosis.
There are so many techniques used to enhance the quality of the signal in the
research papers [1–9], those were used both adaptive and non-adaptive models.
There were so many adaptive filters proposed for canceling noise from ECG signal.
These adaptive filters minimize the error between the noisy ECG signal (considered
as the primary input) and a reference signal which is somehow correlated with the
noise present in primary ECG signal. To track the dynamic variations of the ECG
signals, Thakor et al. [1] give the concept of an adaptive recurrent filter structure
which acquired the impulse response of normal QRS. To track the QRS complexes
in ECG signals with few parameters, an adaptive system has been proposed which
is based on the Hermite functions [2]. To update the coefficient of the filter, there is
always a need to have an adaptive algorithm. The task of this adaptive algorithm is
to minimize the error obtained after subtracting the output of the filter and the
addition of main signal and noise. The researcher also analyzed the error conver-
gence when the reference input is deterministic signal. One of that kind works has
been published by Olmos et al. [3], where they had derived the expression for
steady-state misadjustment taking ECG as input signal. Costa et al. [4] proposed
noise resilient variable step size LMS algorithm which is specially designed for
biomedical signals. A software approach has also been developed by Brouse et al.
[5] for detecting noises from ECG signal using wavelet decomposition of signals.
But hardware implementation of this approach becomes costly for biomedical
applications. So there are several adaptive algorithms, and there modification has
been published [10–13].
This paper contributes by implementing error normalized step size least mean
square (ENSS-LMS) algorithm with the help of wavelet transforms. In this paper,
the adaptive structure of ENSS is presented to eliminate the baseline wander noise
from ECG signals. MIT-BIH database is used to implement and analyze the per-
formance of weight updating algorithm (ENSS) to eliminate the baseline wander
noise from ECG signals. MATLAB simulations have been done to indicate sub-
stantial improvements of the quality if ECG signals by obtaining the good value of
EMSE and misadjustment.
This paper is organized as follows: Sect. 2 gives the proposed Implementation of
ENSS algorithm along with wavelet transform. Section 3 shows the simulation
results. Section 4 concludes the work.

2 Proposed Implementation

2.1 ENSS Algorithm-Based ANC System

Adaptive noise cancellation is a technique used to remove the noise from the input
signal. Figure 1 shows the primary input signal x(n) which is basically the addition
of original signal with some noise. If an original signal is say s(n) and noise signal
Noise Reduction from ECG Signal Using Error Normalized Step … 165

Fig. 1 Block diagram of


Inverse
proposed ENSS-based ANC Primary 1-D ± 1-D
system Signal x Wavelet Wavelet
(n)

Noise
Signal 1-D
Weights of
N (n) Wavelet
Filter w (n)

ENSS
Algorithm

say N1(n), then primary input x(n) becomes x(n) = s(n) + N1(n). There is also a
reference signal taken which is related to the noise added with the original signal.
The reference noise signal N(n) will pass through a filter whose coefficient will be
updated through ENSS-LMS algorithm.

2.2 1-D Wavelet Transform (Haar Wavelet Transform)

It has been seen that a signal is decomposed in terms of its sinusoids for spectral
analysis, but it has also found that decomposing the signal in terms of its com-
ponents for different frequency band spectrum is very good to reduce the com-
plexity of transform like wavelet transform [14]. So we use discrete wavelet
transform for spectral analysis of signal using a set of basic function defined in both
frequency and time domain. Here we are using ‘Haar’ wavelet [15]. Haar wavelet is
basically a sequence of rescaled square-shaped integral function on the unit interval
represented in terms of an orthonormal function. It is generally used for the analysis
of signals which suddenly changes [16]. The mother wavelet function for Haar
wavelet is defined as:
  " p1ffiffi p1ffiffi # 
XW ð0; 0Þ x ð 0Þ
¼ p12ffiffi p12ffiffi or X ¼ Hn;0 x ð1Þ
X/ ð0; 0Þ x ð 1Þ
2 2

Its inverse function is defined as

x ¼ HT X ð2Þ
166 R. Nagal et al.

2.3 ENSS-LMS Algorithm

Error normalized step size least mean square (ENSS) is an algorithm where instead
of filter input the step size parameters vary. The step size parameters vary as a
nonlinear function of the error vector. In error normalized step size algorithm, the
variable step size is inversely proportional to the squared of the error vector. Also in
this algorithm, the number of iteration and the length of the error vector are equal.
The equation for updating the weights of ENSS is:
l
W ð n þ 1Þ ¼ W ð nÞ þ eðnÞxðnÞ ð3Þ
1 þ lk e n k 2

where

X
n1
ke n k2 ¼ e2 ð n  i Þ ð4Þ
i¼0

is the squared of the error eðnÞ, which is used for the estimation of it for its entire
updating. The step size equation is now defined as:

1
W ð n þ 1Þ ¼ W ð nÞ þ eðnÞxðnÞ ð5Þ
1
l þ ke n k2

As the length of error vector e(n) is equal to the number of iterations say n, the
proposed variable step size µ(n) becomes nonlinear decreasing function of n. So,
ken k2 will be increasing function of n. Also, µ(n) is an increasing function of the
parameter µ [17].

2.4 Implementation Steps

1. Import the ECG signal in the program to workspace which is x(n); we consider
this ECG signal with noise.
2. Following that baseline wander noise file is also imported from workspace, we
call it as reference noise signal and denoted by N(n).
3. A correlated noise is generated by passing the noise signals through a
parameterized filter of some order.
4. Wavelet Transform of principle signal and baseline wander noise both will be
taken. So the output of 1-D wavelet box shown in Fig. 1 is xðnÞWðtÞ for
primary signal and N1ðnÞWðtÞ for reference signal.
5. Weights, error, and other variables are initialized.
Noise Reduction from ECG Signal Using Error Normalized Step … 167

6. Now the wavelet coefficient obtained from signals will be processed through
the FIR Wiener filter.
7. The step size is determined using its limit equation (3), where the error will be
calculated by Eq. (5).
8. Output of the filter is calculated by multiplying the baseline noise signal with
the computed tap weight. So Eq. (3) becomes
 
l
W ð n þ 1Þ ¼ W ð nÞ þ 2
e ðn ÞN1nÞW ð n Þ : ð6Þ
1 þ lk e n k

9. The output of filter acts as a complementary to the noise present in the signal
which when added to the corrupted signal cancels a part of the noise. This
negative addition gives desired response from the ANC system.
10. The desired signal is a cleaned signal. This cleaned signal which when sub-
tracted from the principal signal gives us an error signal.
11. The error signal is fed back into tap weight computing equation via ENSS-LMS
algorithm using Eq. (3).
12. Finally, the inverse 1-D wavelet transform has been taken to get the final signal
using Eq. (2).
The resulted signal is the clean ECG signal.

3 Simulation Results

The work presented here used database of 3600 samples of the ECG signal. This
has been collected from the benchmark MIT-BIH arrhythmia database [18]
(recording nos 101, 102, 103). The non-stationary real baseline wander noise is
obtained from MIT-BIH Normal Sinus Rhythm Database (NSTDB). The arrhyth-
mia database consists of ECG recordings obtained from 47 subjects (men and
women of different age-groups). The recording has 11-bit resolution over a 10 mV
range with 360 samples per second per channel. The performance of algorithm is
evaluated in terms of EMSE and misadjustment as shown in Table 2.
After getting the database, the ECG recording (records 101, 102, 103), add
baseline wander noise to generate primary signals. As shown in Fig. 1, wavelet
transform of primary signal (addition of normal ECG and baseline wander) has
been taken. Figure 2 shows the wavelet transform of primary signal using ‘Haar’
wavelet with level 1 decomposition. After this export the coefficient of wavelet to
MATLAB workspace.
Now as per Fig. 1, the wavelet transform of reference noise needs to be taken
using ‘Haar’ wavelet with level 1 decomposition. Figure 3 shows the wavelet
decomposition of reference noise signal using wavelet 1-D transform. The coeffi-
cient of wavelet transform can be now export to MATLAB workspace for further
processing.
168 R. Nagal et al.

Fig. 2 Wavelet
decomposition of primary
signal (ECG+ baseline
wander noise)

Fig. 3 Wavelet
decomposition of reference
noise signal (baseline wander
noise)

The wavelet decomposition allows the temporal addition of the ECG signal and
left the ECG signal with only one-fourth of the input signal size. Now the reference
signal will be passed through a FIR filter designed. The output of the filter will be
subtracted from the primary signal x(n) [after wavelet decomposition it becomes sig
(n)], and an error signal has been generated, which will again fed back to the filter
via ENSS-LMS algorithm [using Eq. (1)] to update the weights of filter. Finally,
inverse wavelet transform is taken to get the cleaned signal shown in Fig. 4.
To validate the clean ECG signal shown in Fig. 4, the parameters EMSE and
misadjustment have been analyzed (refer Table 1). It is clear from Table 2 that
when recording 101 has been taken as input signal with noise, the EMSE reduced to
−30.665 dB which is great reduction as compared to LMS algorithm with value
Noise Reduction from ECG Signal Using Error Normalized Step … 169

Fig. 4 Final ECG signal output after taking inverse wavelet transform

−7.788 dB for same recording for step size parameter µ = 0.025. Now increasing
the step size parameter with value µ = 0.05, the value of EMSE reduced to
−34.257 dB and to −18.752 dB for LMS algorithm so as misadjustment too. For
step size parameter µ = 0.1, EMSE has value −39.710 (ENSS) against −30.407 in
case of LMS algorithm. To validate the model, the experiment was repeated for
other ECG database recording too named as 101, 102, and many more. But because
of space constraint, only the results of three recordings have been shown in
Tables 1 and 2.

Table 1 EMSE and misadjustment values for different recorded ECG signals for ENSS-LMS
algorithm
ENSS Record 101 Record 102 Record 103 Average
EMSE M EMSE M EMSE M EMSE M
(dB) (dB) (dB) (dB)
µ = 0.025 −30.665 0.5142 −31.543 0.5150 −31.658 0.6265 −30.666 0.5519
µ = 0.05 −34.257 0.9337 −35.252 0.9299 −34.743 0.9501 −34.746 0.9379
µ = 0.1 −39.710 0.7555 −41.324 0.8258 −42.101 0.7563 −41.045 0.7792

Table 2 EMSE and misadjustment values for different recorded ECG signals for LMS algorithm
LMS Record 101 Record 102 Record 103 Average
EMSE M EMSE M EMSE M EMSE M
(dB) (dB) (dB) (dB)
µ = 0.025 −7.788 0.908 −6.952 0.992 −7.129 0.912 −7.289 0.937
µ = 0.05 −18.752 0.846 −17.842 0.878 −18.521 0.855 −18.371 0.859
µ = 0.1 −30.407 −8.919 −29.107 −8.990 −30.203 −8.23 −29.905 −8.890
170 R. Nagal et al.

4 Conclusion

Wavelet decomposition-based method for improving the quality of ECG signal by


removing baseline wander noise has been proposed in this paper. The simulation
results show the effectiveness of this technique by decreasing the EMSE and
misadjustment. Analyzing the result shown in Tables 1 and 2, we can say that
stablishing the wavelet-based adaptive noise cancellation system with ENSS
algorithm, taking ECG signal as input gives better results than LMS stablished
ANC system. The average value of EMSE from recordings 101, 102, 103 gives
−30.66 dB than −7.289 dB (LMS) for µ = 0.025. Similar for misadjustment, the
average value is 0.5519 (ENSS) than 0.937 (LMS) value for µ = 0.025. Overall
even for other values for µ = 0.025, the proposed method behaves better and can be
used as a performance optimization tool in biomedical signal processing.

References

1. Thakor, N.V., Zhu, Y.S.: Applications of adaptive filtering to ECG analysis: noise
cancellation and arrhythmia detection. IEEE Trans. Biomed. Eng. 38(8), 785–794 (1991)
2. Lagnna, P., Jan, R., Olmos, S., Thakor, N.V., Rix, H., Caminal, P.: Adaptive estimation of
QRS complex by the Hermite model for classification and ectopic beat detection. Med. BioI.
Eng. Comput. 34(1), 58–68 (1996)
3. Olmos, S., Laguna, P.: Steady-state MSE convergence analysis in LMS adaptive filters with
deterministic reference inputs for biomedical signals. IEEE Trans. Signal Process. 48, 2229–
2241 (2000)
4. Costa, M.H., Bermudez, C.M.: A noise resilient variable step-size LMS algorithm. Sig.
Process. 88, 733–748 (2008)
5. Brouse, C., Bumont, G.A., Herrmann, F.J., Ansermino, J.M.: A wavelet approach to detecting
electrocautery noise in the ECG. IEEE Eng. Med. BioI. Mag. 25(4), 76–82 (2006)
6. Leski, J.M., Henzel, N.: ECG baseline wander and power line interference reduction using
nonlinear filter bank. Sig. Process. 85, 781–793 (2005)
7. Meyer, C., Gavela, J.F., Harris, M.: Combining algorithms in automatic detection of QRS
complexes in ECG signals. IEEE Trans. Inf. Technol Biomed. 10(3), 468–475 (2006)
8. Kotas, M.: Application of projection pursuit based robust principal component analysis to
ECG enhancement. Biomed. Signal Process. Control 1, 289–298 (2007)
9. Mihov, G., Dotsinsky, I.: Power-line interference elimination from ECG in case of
non-multiplicity between the sampling rate and the powerline frequency. Biomed. Signal
Process. Control 3, 334–340 (2008)
10. Floris, E., Schlaefer, A., Dieterich, S., Schweikard, A.: A fast lane approach to LMS
prediction of respiratory motion signals. Biomed. Signal Process. Control 3, 291–299 (2008)
11. Li, N., Zhang, Y., Yanling, H., Chambers, J.A.: A new variable step size NLMS algorithm
designed for applications with exponential decay impulse responses. Sig. Process. 88, 2346–
2349 (2008)
12. Xiao, Y.: A new efficient narrowband active noise control system and its performance
analysis. IEEE Trans. Audio Speech Lang. Process. 19(7) (2011)
13. Leigh, G.M.: Fast FIR algorithms for the continuous wavelet transform from constrained least
squares. IEEE Trans. Signal Process. 61(1) (2013)
Noise Reduction from ECG Signal Using Error Normalized Step … 171

14. Kozacky, W.J., Ogunfunmi, T.: Convergence analysis of an adaptive algorithm with output
power constraints. IEEE Trans. Circ. Syst. II Express Briefs 61(5) (2014)
15. Das, R.L., Chakraborty, M.: On convergence of proportionate-type normalized least mean
square algorithms. IEEE Trans. Circ. Syst. II 62(5) (2015)
16. Sheetal, Mittal, M.: A Haar wavelet based approach for state analysis of disk drive read
system. Appl. Mech. Mater. 592–594, 2267–2271 (2014)
17. Narula, V. et al.: Assessment of variants of LMS algorithms for noise cancellation in low and
medium frequency signals. In: IEEE Conference on Recent Advancements in Electrical,
Electronics and Control Engineering, pp. 432–437 (2011)
18. [Link]
BIHdatabase)
A Novel Approach for Extracting
Pertinent Keywords for Web Image
Annotation Using Semantic Distance
and Euclidean Distance

Payal Gulati and Manisha Yadav

Abstract The World Wide Web today comprises of billions of Web documents
with information on varied topics presented by different types of media such as text,
images, audio, and video. Therefore along with textual information, the number of
images over WWW is exponentially growing. As compared to text, the annotation
of images by its semantics is more complicated as there is a lack of correlation
between user’s semantics and computer system’s low-level features. Moreover, the
Web pages are generally composed of contents containing multiple topics and the
context relevant to the image on the Web page makes only a small portion of the
full text, leading to the challenge for image search engines to annotate and index
Web images. Existing image annotation systems use contextual information from
page title, image src tag, alt tag, meta tag, image surrounding text for annotating
Web image. Nowadays, some intelligent approaches perform a page segmentation
as a preprocessing step. This paper proposes a novel approach for annotating Web
images. In this work, Web pages are divided into Web content blocks based on the
visual structure of page and thereafter the textual data of Web content blocks which
are semantically closer to the blocks containing Web images are extracted. The
relevant keywords from textual information along with contextual information of
images are used for annotation.

Keywords Image search engine  Annotation  Semantic correlation


Semantic distance measure

P. Gulati
YMCA UST, Faridabad, Haryana, India
e-mail: gulatipayal@[Link]
M. Yadav (&)
RPSGOI, Mahendergarh, Haryana, India
e-mail: manishayadav17@[Link]

© Springer Nature Singapore Pte Ltd. 2019 173


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
174 P. Gulati and M. Yadav

1 Introduction

WWW is the largest repository of digital images in the world. The number of
images available over the Web is exponentially growing and will continue to
increase in future. However, as compared to text, the annotation of images by
means of the semantics they depict is much more complicated. Humans can rec-
ognize objects depicted in images, but in computer vision, the automatic under-
standing the semantics of the images is still the perplexing task. Image annotation
can be done either through content-based or text-based approaches. In text-based
approach, different parts of a Web page are considered as possible sources for
contextual information of images, namely image file names (ImgSrc), page title,
anchor texts, alternative text (ALT attribute), image surrounding text. In the
content-based approach, image processing techniques such as texture, shape, and
color are considered to describe the content of a Web image.
Most of the image search engines index images using text information associated
with images, i.e., on the basis of alt tags, image caption. Alternative tags or alt tag
provides a textual alternative to non-textual content in Web pages such as image,
video, media. It basically provides a semantic meaning and description to the
embedded images. However, the Web is still replete with images that have missing,
incorrect, or poor text. In fact in many cases, images are given only empty or null
alt attribute (alt = “ ”) thereby such images remain inaccessible. Image search
engines that annotate Web images based on content-based annotation have problem
of scalability.
In this work, a novel approach for extracting pertinent keywords for Web image
annotation using semantic distance and Euclidean distance is proposed. Further, this
work proposes an algorithm that automatically crawls the Web pages and extracts
the contextual information from the pages containing valid images. The Web pages
are segmented into Web content blocks and thereafter semantic correlation is cal-
culated between Web image and Web content block using semantic distance
measure. The pertinent keywords from contextual information along with semantic
similar content are then used for annotating Web images. Thereafter, the images are
indexed with the associated text it refers to.
This paper is organized as follows: Sect. 2 discusses the related work done in
this domain. Section 3 presents the architecture of the proposed system. Section 4
describes the algorithm for this approach. Finally, Sect. 5 comprises of the
conclusion.

2 Related Work

A number of text-based approaches for Web image annotation have been proposed
in recent years [1]. There are numerous systems [2–6] that use contextual infor-
mation for annotating Web images. Methods for exacting contextual information
A Novel Approach for Extracting Pertinent Keywords for Web … 175

are (i) window-based extraction [7, 8], (ii) structure-based wrappers [9, 10],
(iii) Web page segmentation [11–13].
Window-based extraction is a heuristic approach which extracts image sur-
rounding text; it yields poor results as at times irrelevant data is extracted and
relevant data is discarded. Structure-based wrappers use the structural informa-
tion of Web page to decide the borders of the image context but these are not
adaptive as they are designed for specific design patterns of Web page. Web page
segmentation method is adaptable to different Web page styles and divides the
Web page into segments of common topics, and then each image is associated with
the textual contents of the segment which it belongs to. Moreover, it is difficult to
determine the semantics of text with the image.
In this work, Web page is segmented into Web content blocks using
vision-based page segmentation algorithm [12]. Thereafter, semantic similarity is
calculated between Web image and Web content block using semantic distance
measure. Semantic distance is the inverse of semantic similarity [14] that is the less
distance of the two concepts, the more they are similar. So, semantic similarity and
semantic distance are used interchangeably in this work.
Semantic distance between Web content blocks is calculated by determining a
common representation among them. Generally, text is used for common repre-
sentation. As per the literature review, there are various similarity metrics for texts
[13, 15, 16]. Some simple metrics are based on lexical matching. Prevailing
approaches are successful to some extent, as they do not identify the semantic
similarity of texts. For instance, terms Plant and Tree have a high semantic cor-
relation which remains unnoticed without background knowledge. To overcome
this, WordNet taxonomy as background knowledge is discussed [17, 18].
In this work, the word-to-word similarity metric [19] is used to calculate the
similarity between words and text-to-text similarity is calculated using the metric
introduced by Corley [20].

3 Proposed Architecture

The architecture of proposed system is given in Fig. 1. Components of proposed


system are discussed in following subsequent subsections.

3.1 Crawl Manager

Crawl manager is a computer program that takes the seed URL from the URL
queue and fetches the Web page from WWW.
176 P. Gulati and M. Yadav

Fig. 1 Proposed architecture

3.2 URL Queue

URL queue is a type of repository which stores the list of URLs that are discovered
and extracted by crawler.

3.3 Parser

Parser is used to extract information present on Web pages. Parser downloads the
Web page and extracts the XML file of the same. Thereafter, it convert XML file
into DOM object models. It then checks whether valid images are present on the
A Novel Approach for Extracting Pertinent Keywords for Web … 177

Web page or not. If valid image is present on the Web page, then the page is
segmented using visual Web page segmenter; otherwise, next URL is crawled.
The DOM object models which contain page title of Web page, image source, and
alternative text of valid images present on the Web page are extracted from the set
of object models of the Web page.

3.4 Visual Web Page Segmenter

Visual Web page segmenter is used for the segmentation of Web pages into Web
content blocks. By the term segmentation of Web pages means dividing the page by
certain rules or procedures to obtain multiple semantically different Web content
blocks whose content can be investigated further.
In the proposed approach, VIPS algorithm [12] is used for the segmentation of
Web page into Web content blocks. It extracts the semantic structure of a Web page
based on its visual representation. The segmentation process has basically three
steps: block extraction, separator detection, and content structure construction.
Blocks are extracted from DOM tree structure of the Web page by using the page
layout structure, and then separators are located among these blocks. The
vision-based content structure of a page is obtained by combining the DOM
structure and the visual cues. Therefore, a Web page is a collection of Web content
blocks that have similar DOC. With the permitted DOC (pDOC) set to its maximum
value, a set of Web content blocks that consist of visually indivisible contents is
obtained. This algorithm also provides the two-dimensional Cartesian coordinates
of each visual block present on the Web page based on their locations on the Web
page.

3.5 Block Analyzer

Block analyzer analyses the Web content blocks obtained from segmentation.
Further, it divides the Web content blocks into two categories: image blocks and
text blocks. Web blocks which contain images are considered as image blocks and
rest are considered as text blocks.

3.6 Nearest Text Block Detector

Nearest text block detector detects the nearest text blocks to an image block. For
checking closeness, Euclidean distance between closest edges of two blocks is
calculated. Distance between two line segments is obtained by using Eq. (1):
178 P. Gulati and M. Yadav

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Euclidean Distance ¼ ðx2  x1 Þ2 þ ðy2  y1 Þ2 ð1Þ

After the distance is calculated between each image block pair and text block
pair, the text blocks whose distance from image block is below the threshold are
assigned to that image block. In this way, each image block is assigned with a group
of text blocks which are closer in distance with that image block.

3.7 Tag Text Extractor

In the proposed approach, tag text extractor is used for extracting text from the
HTML tags. Parser provides the DOM object models by parsing a Web page. If the
image present on this Web page is valid, i.e., it is not a button or an icon, which is
checked by valid image checker, are extracted from metadata of image like image
source (Imgsrc), alternative text (Alt). Page title of the Web page which contains
this image is also extracted.

3.8 Keyword Extractor

In this work, keyword extractor is used to extract keywords from the metadata of
images and page title. Keywords are stored into a text file which is further used for
obtaining semantically close text blocks by calculating semantic distance.

3.9 Semantic Distance Calculator

Semantic distance calculator is used to determine the semantic correlation among


the Web content blocks. As lexical matching between words does not provide better
results, here the words are matched with concepts in a knowledge base and concept
to concept matching is computed using WordNet.
Before computing text similarity between image metadata and text blocks,
preprocessing of the text blocks is done. After preprocessing process, sentence
detection is done. Then tokenization is done, and finally, a part of speech tagging is
done for all the words using NLP. At last, stemming of the terms is done and
thereafter, terms are mapped to WordNet concepts.
The similarity of text is calculated using Corley’s approach. In this method, for
every noun (verb) that belongs to image metadata, the noun (verb) in the text of text
blocks with maximum semantic similarity is identified according to Eq. 2.
A Novel Approach for Extracting Pertinent Keywords for Web … 179

2:ICðLCSÞ
simLin ¼ ð2Þ
ICðConcept1 Þ þ ICðConcept2 Þ

Here LCS is the least common subsumer of the two concepts in the WordNet
taxonomy, and IC is the information content that measures the specificity for a
concept as follows:

ICðconceptÞ ¼  log PðconceptÞ ð3Þ

In Eq. 3, P(concept) is the probability of occurrence of an instance of concept in


a large corpus. For the classes other than noun (verb), a lexical matching is per-
formed. The similarity between two texts T1 (text of image metadata), T2 (text of
text blocks) is calculated as:
P
wi 2T1 maxSimðwi ; T2 Þ:idf ðwi Þ
simðT1 ; T2 ÞT1 ¼ P ð4Þ
wi 2T1 idf ðwi Þ

where idf(wi) is the inverse document frequency [19] of the word wi in a large
corpus. A directional similarity score is further calculated with respect to T1. The
score from both directions is combined into a bidirectional similarity as given in
Eq. 5:

simðT1 ; T2 Þ ¼ simðT1 ; T2 ÞT1 :simðT1 ; T2 ÞT2 ð5Þ

This similarity score has a value between 0 and 1. From this similarity score,
semantic distance is calculated as follows:

distsem ðT1 ; T2 Þ ¼ 1  simðT1 ; T2 Þ ð6Þ

In this way, semantic distance is calculated among image block and its nearest
text blocks. The text block whose semantic distance is less is the semantically
correlated text block to that image block.

3.10 Text Extractor

Text extractor is used to extract text from text blocks present on the Web page. Text
of semantically close text block obtained in the previous step is extracted and
buffered. This text along with the text extracted from image metadata and page title
of Web page is used to extract frequent keywords.
180 P. Gulati and M. Yadav

3.11 Keyword Determiner

In this work, keyword determiner is used to extract keywords from the text stored in
a buffer. Frequent keywords are determined by applying a threshold on the fre-
quency count of keywords. Keywords whose frequency is above the threshold are
extracted and used for annotating images.

3.12 Image Annotation

Page title of Web page, image source of image, alternative text of image, and
frequent keywords extracted in the previous step—all of these describe the image
best.

4 Algorithm

The algorithm for proposed system is automatic image annotation. This algorithm
takes the URL of Webpage as input and provides the description of the Web page as
output.
This algorithm is used here for annotating images present on the Web page.
Firstly, parsing is done to extract page title, Img_Src, Alt Text of image. Secondly,
Web page segmentation is performed using VIPS algorithm. Then validity of image
is checked and for valid images find nearest text blocks using the algorithm given
below. For closer text block list, semantic distance is calculated using bidirectional
similarity between blocks. Then keywords are extracted from the semantically close
text block. These keywords are used for image annotation process.

Automatic_Image_Annotation (Description of image)


Begin
Parse the Web page (URL)
If contain valid image
Text1 = Extract Page_Title, Img_Src(valid image),
alt( valid image)
Web_Page_Segmentation (URL)
For each valid image_block
Text_block_list = Find_Nearest_text_block
(ImageBlock Cartesian Coordinates, Text Blocks
Cartesian Coordinates)
least_distance = some_big_number;
For each text block in Text_block_list
Distance = Find_semantic_distance (Text Block, Text1 )
A Novel Approach for Extracting Pertinent Keywords for Web … 181

If (least_distance > distance)


{
Least_distance = Distance
Return id text ;
}
Extract keywords from text block ( id text )
End
End

Algorithm for obtaining nearest text blocks is find nearest text blocks. It takes
image blocks and text blocks as input and provides a list of nearest blocks as output.
This algorithm collects the nearest text blocks to an image block present on the Web
page using closest edge Euclidean distance between Web content blocks. It uses the
Cartesian coordinates of Web content blocks to calculate Euclidean distance.

Find_Nearest_Text_Block (List of Nearest Text Blocks)


Begin
For each Text Block
{
Distance = calculate Euclidean distance
between image block and text block
If (distance < threshold)
{
Put the id of text block in a list
}
}
End

5 Conclusion

This paper presents algorithm for the novel approach for extracting pertinent
keywords for Web image annotation using semantics. In this work, Web images are
automatically annotated by determining pertinent keywords from contextual
information from Web page and semantic similar content from Web content blocks.
This approach provides better results than method of image indexing using Web
page segmentation and clustering [21], as in existing method context of image, it is
not coordinated with the context of surrounding text. This approach will provide
good results as closeness between image and Web content blocks is computed using
both Euclidean distance and semantic distance.
182 P. Gulati and M. Yadav

References

1. Sumathi, T., Devasena, C.L., Hemalatha, M.: An overview of automated image annotation
approaches. Int. J. Res. Rev. Inf. Sci. 1(1) (2011) (Copyright © Science Academy Publisher,
United Kingdom)
2. Swain, M., Frankel, C., Athitsos, V.: Webseer: an image search engine for the World Wide
Web. In: CVPR (1997)
3. Smith, J., Chang, S.: An image and video search engine for the world-wide web. Storage.
Retr. Im. Vid. Datab. 8495 (1997)
4. Ortega-Binderberger, M., Mehrotra, V., Chakrabarti, K., Porkaew, K.: Webmars: a
multimedia search engine. In: SPIE An. Symposium on Electronic Imaging, San Jose,
California. Academy Publisher, United Kingdom (2000)
5. Alexandre, L., Pereira, M., Madeira, S., Cordeiro, J., Dias, G.: Web image indexing:
combining image analysis with text processing. In: Proceedings of the 5th International
Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS04). Publisher,
United Kingdom (2004)
6. Yadav, M., Gulati, P.: A novel approach for extracting relevant keywords for web image
annotation using semantics. In: 9th International Conference on ASEICT (2015)
7. Coelho, T.A.S., Calado, P.P., Souza, L.V., Ribeiro-Neto, B., Muntz, R.: Image retrieval using
multiple evidence ranking. IEEE Trans. Knowl. Data Eng. 16(4), 408–417 (2004)
8. Pan, L.: Image 8: an image search engine for the internet. Honours Year Project Report,
School of Computing, National University of Singapore, April, 2003
9. Liu, B.: Web data mining: exploring hyperlinks, contents, and usage data. Data-Centric Syst.
Appl. Springer 2007 16(4), 408–417 (2004)
10. Fauzi, F., Hong, J., Belkhatir, M.: Webpage segmentation for extracting images and their
surrounding contextual information. In: ACM Multimedia, pp. 649–652 (2009)
11. Chakrabarti, D., Kumar, R., Punera, K.: A graphtheoretic approach to webpage segmentation.
In: Proceeding of the 17th International Conference on World Wide Web, WWW’08,
pp. 377–386, New York, USA (2008)
12. Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: VIPS: a vision based page segmentation algorithm.
Technical Report, Microsoft Research (MSR-TR-2003-79) (2003)
13. Hattori, G., Hoashi, K., Matsumoto, K., Sugaya, F.: Robust web page segmentation for
mobile terminal using content distances and page layout information. In: Proceedings of the
16th International Conference on World Wide Web, WWW’07, pp. 361–370, New York, NY,
USA. ACM (2007)
14. Nguyen, H.A., Eng, B.: New semantic similarity techniques of concepts applied in the
Biomedical domain and wordnet. Master thesis, The University of Houston-Clear Lake
(2006)
15. Voorhees, E.: Using WordNet to disambiguate word senses for text retrieval. In: Proceedings
of the 16th Annual International ACM SIGIR Conference (1993)
16. Landauer, T.K., Foltz, P., Laham, D.: Introduction to latent semantic analysis. Discourse
Processes 25 (1998)
17. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: WordNet: An on-line lexical
database. Int. J. Lexicogr. 3, 235–244 (1990)
18. Patwardhan, S., Banerjee, S., Pedersen, T.: Using measures of semantic relatedness for word
sense disambiguation. In: Proceedings of the 4th International Conference on Computational
Linguistics and Intelligent Text Processing, CICLing’03, pp. 241–257. Springer, Berlin,
Heidelberg (2003)
19. Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 36th
Annual Meeting of the Association for Computational Linguistics and 17th International
Conference on Computational Linguistics, vol. 2, ACL-36, pp. 768–774, Morristown, NJ,
USA. Association for Computational Linguistics (1998); Sparck Jones, K.: A Statistical
A Novel Approach for Extracting Pertinent Keywords for Web … 183

Interpretation of Term Specificity and Its Application in Retrieval, pp. 132–142. Taylor
Graham Publishing, London, UK (1988)
20. Corley, C., Mihalcea, R.: Measuring the semantic similarity of texts. In: Proceedings of the
ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment,
EMSEE’05, pp. 13–18, Morristown, NJ, USA, 2005. Association for Computational
Linguistics (1998)
21. Tryfou, G., Tsapatsoulis, N.: Image Indexing Based on Web Page Segmentation and
Clustering (2014)
Classification of Breast Tissue Density
Patterns Using SVM-Based Hierarchical
Classifier

Jitendra Virmani, Kriti and Shruti Thakur

Abstract In the present work, three-class breast tissue density classification has
been carried out using SVM-based hierarchical classifier. The performance of
Laws’ texture descriptors of various resolutions have been investigated for differ-
entiating between fatty and dense tissues as well as for differentiation between
fatty-glandular and dense-glandular tissues. The overall classification accuracy of
88.2% has been achieved using the proposed SVM-based hierarchical classifier.

Keywords Breast tissue density classification  Texture feature extraction


Hierarchical classifier

1 Introduction

The most commonly diagnosed disease among women nowadays is breast cancer
[1]. It has been shown that high breast tissue density is associated with high risk of
developing breast cancer [2–10]. Mortality rate for breast cancer can be increased if
detection is made at an early stage. Breast tissue is broadly classified into fatty,
fatty-glandular, or dense-glandular based on its density.
Various computer-aided diagnostic (CAD) systems have been developed by
researchers in the past to discriminate between different density patterns, thus
providing the radiologists with a system that can act as a second opinion tool to

J. Virmani (&)
Thapar University, Patiala, Punjab, India
e-mail: [Link]@[Link]
Kriti
Jaypee University of Information Technology, Waknaghat, Solan, India
e-mail: kriti.23gm@[Link]
S. Thakur
Department of Radiology, IGMC, Shimla, Himachal Pradesh, India
e-mail: tshruti878@[Link]

© Springer Nature Singapore Pte Ltd. 2019 185


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
186 J. Virmani et al.

Fig. 1 Sample of mammographic images from MIAS database, a typical fatty tissue ‘mdb132,’
b typical fatty-glandular tissue ‘mdb016,’ c typical dense-glandular tissue ‘mdb216,’ d atypical
fatty tissue ‘mdb096,’ e atypical fatty-glandular tissue ‘mdb090,’ f atypical dense-glandular tissue
‘mdb100’

validate their diagnosis. Few studies have been carried out on Mammographic
Image Analysis Society (MIAS) dataset for classification of breast tissue density
patterns into fatty, fatty-glandular, and dense-glandular tissue types [3–10]. Among
these, mostly the studies have been carried on the segmented breast tissue
(SBT) and rarely on fixed-size ROIs [3–10]. Out of these studies, Subashini et al.
[6] report a maximum accuracy of 95.4% using the SBT approach, and Mustra et al.
[9] report a maximum accuracy of 82.0% using the ROI extraction approach.
The experienced participating radiologist (one of the authors of this paper)
graded the fatty, fatty-glandular, and dense-glandular images as belonging to typical
or atypical categories. The sample images of typical and atypical cases depicting
different density patterns are shown in Fig. 1.
In the present work, a hierarchical classifier with two stages for binary classi-
fication has been proposed. This classifier is designed using support vector machine
(SVM) classifier in each stage to differentiate between fatty and dense breast tissues
and then between fatty-glandular and dense-glandular breast tissues using Laws’
texture features.

2 Methodology

2.1 Description of Dataset

The MIAS database consists of total 322 mammographic images out of which 106
are fatty, 104 are fatty-glandular, and 112 are dense-glandular [11]. From each
image, a fixed-size ROI has been extracted for further processing.

2.2 Selecting Regions of Interest

After conducting repeated experiments, it has been asserted that for classification of
breast density, the center area of the tissue is the optimal choice [12]. Accordingly,
Classification of Breast Tissue Density … 187

Fig. 2 ROI extraction


protocol for mammographic
image ‘mdb216’ with ROI
marked

fixed-size ROIs of size 200  200 pixels have been extracted from each mam-
mogram as depicted in Fig. 2.

2.3 Proposed Method

Computer-aided diagnostic systems involve analysis of mammograms through


computers which can be used by the radiologists as a second opinion tool for
validating their diagnosis as these systems tend to improve the diagnostic accuracy
by detecting any lesions that might be missed during subjective analysis [1, 3–10,
13, 14]. The block diagram is shown in Fig. 3.
Feature Extraction Module The texture descriptor vectors (TDVs) derived from
Laws’ texture analysis using Laws’ masks of resolutions 3, 5, 7, and 9 have been
used in the present work for design of SVM-based hierarchical classifier.
Feature Classification Module Support vector machine classifier has been
extensively used for classification of texture patterns in medical images [1, 14–19].
In the present work, two binary SVM classifiers arranged in a hierarchical frame-
work have been used for three-class breast tissue density classification. The SVM
classifier is implemented using LibSVM library [20].
188 J. Virmani et al.

Fig. 3 Proposed classification system

3 Results

Various experiments were conducted to obtain the classification performance of


Laws’ texture features using hierarchical classifier built using two stages of binary
SVM classifiers.

3.1 Classification Performance of Laws’ Texture Features


Using Hierarchical Classifier

In this work, the performance of TDVs derived using Laws’ masks of length 3, 5, 7,
and 9 is evaluated using SVM-based hierarchical classifier. The results obtained are
shown in Table 1.
From Table 1, it can be observed that OCA of 91.3, 93.2, 91.9, and 92.5% is
achieved for TDV1, TDV2, TDV3, and TDV4, respectively, using SVM-1
sub-classifier, and OCA of 92.5, 84.2, 87.0, and 90.7% is obtained for TDV1,
TDV2, TDV3, and TDV4, respectively, using SVM-2 sub-classifier.
The results from Table 1 show that for differentiating between the fatty and
dense breast tissues, SVM-1 sub-classifier gives best performance for features
extracted using Laws’ mask of length 5 (TDV2), and for further classification of
dense tissues into fatty-glandular and dense-glandular classes, SVM-2 sub-classifier
gives best performance using features derived from Laws’ mask of length 3
(TDV1). This analysis of the hierarchical classifier is shown in Table 2. The OCA
for hierarchical classifier is calculated by adding the misclassified cases at each
classification stage.
Classification of Breast Tissue Density … 189

Table 1 Performance of TDFVs derived from laws’ texture features using hierarchical classifier
TDV (l) Classifier CM OCA (%)
F D
TDV1 (30) SVM-1 F 43 10 91.3
D 4 104
SVM-2 FG DG 92.5
FG 48 4
DG 4 52
TDV2 (75) SVM-1 F D 93.2
F 43 10
D 1 107
SVM-2 FG DG 84.2
FG 39 13
DG 4 52
TDV3 (30) SVM-1 F D 91.9
F 43 10
D 3 105
SVM-2 FG DG 87.0
FG 41 11
DG 3 53
TDV4 (75) SVM-1 F D 92.5
F 44 9
D 3 105
SVM-2 FG DG 90.7
FG 46 6
DG 4 52
Note TDV: texture descriptor vector, l: length of TDV, CM: confusion matrix, F: fatty class, D:
dense class, FG: fatty-glandular class, DG: dense-glandular class, OCA: overall classification
accuracy

Table 2 Performance analysis of hierarchical classifier


Classifier TDV (l) CM CA (%) Misclassified OCA
cases (%)
F D
SVM-1 TDV2 F 43 10 93.2 (11/ 19/161 88.2
(75) D 1 107 161)
SVM-2 TDV1 FG DG 92.5 (8/
(30) FG 48 4 108)
DG 4 52
Note TDV: texture descriptor vector, l: length of TDV, CM: confusion matrix, F: fatty class, D:
dense class, FG: fatty-glandular class, DG: dense-glandular class, CA: classification accuracy,
OCA: overall classification accuracy
190 J. Virmani et al.

4 Conclusion

From the exhaustive experiments carried out in the present work, it can be con-
cluded that Laws’ masks of length 5 yield the maximum classification accuracy of
93.2% for differential diagnosis between fatty and dense classes and Laws’ masks
of length 3 yield the maximum classification accuracy 92.5% for differential
diagnosis between fatty-glandular and dense-glandular classes. Further, for the
three-class problem, a single multi-class SVM classifier would construct three
different binary SVM sub-classifiers where each binary sub-classifier is trained to
separate a pair of classes and decision is made by using majority voting technique.
In case of hierarchical framework, the classification can be done using only two
binary SVM sub-classifiers.

References

1. Kriti, Virmani, J., Dey, N., Kumar, V.: PCA-PNN and PCA-SVM based CAD systems for
breast density classification. In: Hassanien, A.E., et al. (eds.) Applications of Intelligent
Optimization in Biology and Medicine, vol. 96, pp. 159–180. Springer (2015)
2. Wolfe, J.N.: Breast patterns as an index of risk for developing breast cancer. Am.
J. Roentgenol. 126(6), 1130–1137 (1976)
3. Blot, L., Zwiggelaar, R.: Background texture extraction for the classification of mammo-
graphic parenchymal patterns. In: Proceedings of Conference on Medical Image
Understanding and Analysis, pp. 145–148 (2001)
4. Bosch, A., Munoz, X., Oliver, A., Marti, J.: Modeling and classifying breast tissue density in
mammograms. In: Computer Vision and Pattern Recognition, IEEE Computer Society
Conference, 2, pp. 1552–1558. IEEE Press, New York (2006)
5. Muhimmah, I., Zwiggelaar, R.: Mammographic density classification using multiresolution
histogram information. In: Proceedings of 5th International IEEE Special Topic Conference
on Information Technology in Biomedicine (ITAB), pp. 1–6. IEEE Press, New York (2006)
6. Subashini, T.S., Ramalingam, V., Palanivel, S.: Automated assessment of breast tissue density
in digital mammograms. Comput. Vis. Image Underst. 114(1), 33–43 (2010)
7. Tzikopoulos, S.D., Mavroforakis, M.E., Georgiou, H.V., Dimitropoulos, N., Theodoridis, S.:
A fully automated scheme for mammographic segmentation and classification based on breast
density and asymmetry. Comput. Methods Programs Biomed. 102(1), 47–63 (2011)
8. Li, J.B.: Mammographic image based breast tissue classification with kernel self-optimized
fisher discriminant for breast cancer diagnosis. J. Med. Syst. 36(4), 2235–2244 (2012)
9. Mustra, M., Grgic, M., Delac, K.: Breast density classification using multiple feature
selection. Auotomatika 53(4), 362–372 (2012)
10. Silva, W.R., Menotti, D.: Classification of mammograms by the breast composition. In:
Proceedings of the 2012 International Conference on Image Processing, Computer Vision,
and Pattern Recognition, pp. 1–6 (2012)
11. Suckling, J., Parker, J., Dance, D.R., Astley, S., Hutt, I., Boggis, C.R.M., Ricketts, I.,
Stamatakis, E., Cerneaz, N., Kok, S.L., Taylor, P., Betal, D., Savage, J.: The mammographic
image analysis society digital mammogram database. In: Gale, A.G., et al. (eds.) Digital
Mammography. LNCS, vol. 1069, pp. 375–378. Springer, Heidelberg (1994)
12. Li, H., Giger, M.L., Huo, Z., Olopade, O.I., Lan, L., Weber, B.L., Bonta, I.: Computerized
analysis of mammographic parenchymal patterns for assessing breast cancer risk: effect of
ROI size and location. Med. Phys. 31(3), 549–555 (2004)
Classification of Breast Tissue Density … 191

13. Kumar, I., Virmani, J., Bhadauria, H.S.: A review of breast density classification methods. In:
Proceedings of 2nd IEEE International Conference on Computing for Sustainable Global
Development (IndiaCom-2015), pp. 1960–1967. IEEE Press, New York (2015)
14. Virmani, J., Kriti.: Breast tissue density classification using wavelet-based texture descriptors.
In: Proceedings of the Second International Conference on Computer and Communication
Technologies (IC3T-2015), vol. 3, pp. 539–546 (2015)
15. Virmani, J., Kumar, V., Kalra, N., Khandelwal, N.: A comparative study of computer-aided
classification systems for focal hepatic lesions from B-mode ultrasound. J. Med. Eng.
Technol. 37(44), 292–306 (2013)
16. Virmani, J., Kumar, V., Kalra, N., Khandelwal, N.: SVM-based characterization of liver
ultrasound images using wavelet packet texture descriptors. J. Digit. Imaging 26(3), 530–543
(2013)
17. Virmani, J., Kumar, V., Kalra, N., Khandelwal, N.: Characterization of primary and
secondary malignant liver lesions from B-mode ultrasound. J. Digit. Imaging 26(6),
1058–1070 (2013)
18. Virmani, J., Kumar, V., Kalra, N., Khandelwal, N.: SVM-based characterization of liver
cirrhosis by singular value decomposition of GLCM matrix. Int. J. Artif. Intel. Soft Comput. 3
(3), 276–296 (2013)
19. Virmani, J., Kumar, V., Kalra, N., Khandelwal, N.: PCA-SVM based CAD system for focal
liver lesions from B-Mode ultrasound. Defence Sci. J. 63(5), 478–486 (2013)
20. Chang, C.C., Lin, C.J.: LIBSVM, a library of support vector machines. ACM Trans. Intell.
Syst. Technol. 2(3), 27–65 (2011)
Advances in EDM: A State of the Art

Manu Anand

Abstract Potentials of data mining in academics have been discussed in this paper.
To enhance the Educational Institutional services along with the improvement in
student’s performance by increasing their grades, retention rate, maintain their
attendance, giving prior information about their eligibility whether they can give
examination or not based on attendance, evaluating the result using the marks,
predicting how many students have enrolled in which course and all other aspects
like this can be analyzed using various fields of Data Mining. This paper discusses
one of this aspect in which the distinction has been predicted based on the marks
scored by the MCA students of Bharati Vidyapeeth Institute of Computer
Applications and Management, affiliated to GGSIPU using various machine
learning algorithms, and it has been observed that “Boost Algorithm” outperforms
other machine learning models in the prediction of distinction.

Keywords Distinction prediction  Machine learning  Boost algorithm

1 Introduction

Every sector, every organization maintains large amount of data in their databases
and powerful tools are designed to perform data analysis, as a result mining of data
will result in golden chunks of “Knowledge.” There are many misnomers of data
mining like knowledge mining, knowledge discovery from databases, knowledge
extraction, and pattern analysis that can be achieved using various data mining
algorithms like classification, association, clustering, prediction. Data mining
approach plays a pivot role in decision support system (DSS).
Educational data mining [1] is promising as a research area with a collection of
computational and psychological methods to understand how students learn. EDM

M. Anand (&)
Bharati Vidyapeeth Institute of Computers Application and Management (BVICAM),
New Delhi, India
e-mail: [Link]@[Link]

© Springer Nature Singapore Pte Ltd. 2019 193


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
194 M. Anand

develops methods and applies techniques from statistics, machine learning, and data
mining to analyze data collected during teaching and learning [2–4]. In this paper,
machine learning classification models have been used to predict distinction of
students using marks of nine subjects that a student of MCA third semester scored
in their End term exams of GGSIPU. Various subjects whose marks are considered
in student dataset used for predicting distinction are theory of computation, com-
puter graphics, Java programming, data communication and networking, C# pro-
gramming, computer graphics laboratory, Java programming laboratory, C#
programming laboratory, general proficiency. There are totally 112 records with 15
input variables. Some of the considered features like marks of all subjects, per-
centage, and distinction have higher importance than others like name, roll no. in
predicting distinction. The student dataset is used by various machine learning
models namely decision tree model, AdaBoost model, SVM model, linear model,
neural network model [5].

2 Materials and Methods

2.1 Dataset and Its Features

In total, 112 student records having 15 input variables have been considered for
data analysis. These student records are the original results of MCA students of the
third semester in 2014, at Bharati Vidyapeeth Institute of Computer Applications
and Management from GGSIPU. Table 1 describes the subjects associated with
each code in the student dataset. In Table 2, student dataset sample is shown.
Table 3 shows the correlation between each feature. In this, total marks scored by
every student are calculated and then the percentage has been evaluated. Depending
upon the percentage, the distinction has been set to 0 or 1.

Table 1 Features description Feature Subject title


201 Theory of computation
203 Computer graphics
205 Java programming
207 Data communication and networking
209 C# programming
251 Computer graphics laboratory
253 Java programming laboratory
255 C# programming laboratory
261 General proficiency
Advances in EDM: A State of the Art 195

Table 2 Sample dataset


Roll_No. Name 201 203 205 207 209 251 253 255 261
311604413 SANGEETA SINHA 55 60 67 53 66 78 84 80 81
411604413 PAWAN 50 66 62 73 60 80 83 80 78
511604413 GIRISH SHANKAR 33 63 62 72 52 78 86 82 78
611604413 TANU 55 76 75 82 73 78 89 84 70

Table 3 Correlation between Roll_No. Total Percentage Distinction


each feature
311604413 624 69.33333 0
411604413 632 70.22222 0
511604413 606 67.33333 0
611604413 682 75.77778 1

2.2 Feature Measurement

In this analysis of distinction prediction, some of the basic calculations have been
performed in the student dataset using simple method. First total marks have been
evaluated by the summation of every mark scored by each and every student,
followed by the calculation of percentage for each and every record. Then, the
distinction has been marked as 0 or 1 based on the percentage a student has scored.
If the percentage is greater than or equal to 75%, then the distinction is marked as 1
else it is marked as 0.

3 Methodology

The methodology is described in Fig. 1. In the first step, the result of MCA students
of the third semester has been taken as the primary data for the prediction of
distinction. After this, various features like their total of marks, percentage, and
distinction have been evaluated by considering distinction as target value. The
removal of extra fields was carried out which were not required in distinction
prediction, while different algorithms were applied on the student dataset. There
were totally 15 input variables, out of which 9 variables have been passed as an
input to dataset for evaluation. Then, different algorithms have been applied on
student dataset. In the fifth step, different models were trained and tested on the
dataset with their default parameters. Finally, the evolution of the model is done on
accuracy and sensitivity.
196 M. Anand

Fig. 1 Methodology used Student Data Collection

Feature Measurement

Data Cleansing

Distinction Prediction using various Algorithm

Train & Test the predictive Model

Result Analysis

3.1 AdaBoost Algorithm

Adaptive boosting is a machine learning algorithm used in combination with other


learning algorithms to improve their performance that represents final output of the
boosted classifier which is represented by the output of the other learning algo-
rithms (“weak learners”) which is combined into a weighted sum.
Algorithm: Adaptive Boost pseudocode
Set uniform example weights.
for Each base learner do
Train base learner with weighted sample.
Test base learner on all data.
Set learner weight with weighted error.
Set example weights based on ensemble predictions.
end for

3.2 Feature Importance Using AdaBoost Algorithm

The AdaBoost is used to find the importance of each feature. To train a boosted
classifier, AdaBoost algorithm is used. The basic representation of Boost classifier
is as follows:

X
T
FT ðxÞ ¼ ft ðxÞ
t¼1

where each ft is a weak learner that takes an object x as input and returns a
real-valued result indicating the class of the object. Predicted object class is iden-
tified with the sign of the output obtained with the weak learner.
Advances in EDM: A State of the Art 197

For each sample in the training set, an output or hypothesis is produced by weak
learner. While executing this algorithm, the main focus should be on the mini-
mization of the resulting t-stage Boost classifier which is achieved with the
selection of weak learner and by assigning a coefficient at .
X
Et ¼ E ½Ft1 ðxi Þ þ at hðxi Þ
i

Here, Ft1 ðxÞ is the boosted classifier that has been built up to the previous stage
of training, EðFÞ is some error function, and ft ðxÞ ¼ at hðxÞ is the weak learner that
is being considered for the addition to the final classifier.

3.3 Machine Learning Methods

Four machine learning models [5, 6] for distinction prediction have been used.
Rattle has been used where all these models are available. The idea about this
model is discussed below:
(a) Decision tree: It uses a recursive partitioning approach [7].
(b) Support vector machine: SVM searches for so-called support vectors which are
data points that are found to lie at the edge of an area in space which is a
boundary from one class of points to another.
(c) Linear model: Covariance analysis, single stratum analysis of variance, and
regression are evaluated using this model.
(d) Neural net: It is based on the idea of multiple layers of neurons connected to
each other, feeding the numeric data through the network, combining the
numbers, to produce a final answer (Table 4).

Table 4 Parameters used for different algorithms


Model Parameter setting
Decision tree Min split = 30, max depth = 40, min bucket = 7
SVM Kernel radial basis
LM Multinomial
Neural net Hidden layer nodes
Boost Max depth = 30, min split = 20, no. of trees = 50
198 M. Anand

4 Model Evaluation

Classifier’s performance has been measured using various parameters like accuracy,
sensitivity, ROC. Sensitivity Si for the class i can be defined as the number of
patterns correctly predicted to be in class i with respect to the total number of
patterns in class i. Consider p number of classes, and the value Cij of size
p * p represents the number of patterns of class i predicted in class j, then accuracy
and sensitivity can be calculated as follows:
P
i¼1;p Cii
Accuracy ¼ P P
i¼1;p j¼1;p Cij

Cii
Sensitivity ¼ P
j¼1;p Cij

5 Experimental Result

This section deals with the analysis on prediction result of all the four machine
learning classification models on the testing dataset. In Figs. 2 and 4, Precision
parameter has been plotted for ada, SVM and verified for all other models. In

Fig. 2 Precision versus recall


using ada algorithm
Advances in EDM: A State of the Art 199

Fig. 3 Sensitivity versus


specificity using ada
algorithm

Figs. 3 and 5, sensitivity has been observed. Another parameter predicted vs


observed also been plotted (Fig. 6).

Fig. 4 Precision versus recall


using svm algorithm
200 M. Anand

Fig. 5 Sensitivity versus specificity using svm algorithm

Fig. 6 Predicted versus observed using ada as well as svm

6 Conclusion and Future Scope

In this work, various machine learning classification models are explored with input
variables to predict the distinction of students. The result indicates that Boost
Algorithm outperforms the other classification models.
Advances in EDM: A State of the Art 201

This work can be extended by inclusion of more parameters such as combining


the distinction and attendance record of student to predict that “if a student atten-
dance record is good, does it actually enhance the academic performance or not”
and then different categories of student based on these parameters can be made that
can be further used with some classification models. In this paper, individual
models are applied on the dataset to predict the distinction, and we can also use
ensemble approach.

References

1. Ayesha, S., Mustafa, T., Sattar, A., Khan, I.: Data mining model for higher education system.
Eur. J. Sci. Res. 43(1), 24–29 (2010)
2. Shovon, Md.H.I.: Prediction of student academic performance by an application of K-means
clustering algorithm. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 2(7) (2012)
3. Sharma, T.C.: WEKA approach for comparative study of classification algorithm (IJARCCE).
Int. J. Adv. Res. Comput. Commun. Eng. 2(4) (2013)
4. Kumar, V., Chadha, A.: An empirical study of the applications of data mining techniques in
higher education. Int. J. Adv. Comput. Sci. Appl. 2(3), 80–84 (2011)
5. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
6. Keerthi, S.S., Gilbert, E.G.: Convergence of a generalized SMO algorithm for SVM classifier
design. Mach. Learn. 46(1), 351–360 (2002)
7. Fernandez Caballero, J.C., Martinez, F.J., Hervas, C., Gutierrez, P.A.: Sensitivity versus
accuracy in multiclass problem using Memetic Pareto evolutionary neural networks. IEEE
Trans. Neural Netw. 21, 750–770 (2010)
Proposing Pattern Growth Methods
for Frequent Pattern Mining on Account
of Its Comparison Made
with the Candidate Generation and Test
Approach for a Given Data Set

Vaibhav Kant Singh

Abstract Frequent pattern mining is a very important field for mining of associ-
ation rule. Association rule mining is an important technique of data mining that is
meant to extract meaningful information from large data sets accumulated as a
result of various data processing activities. There are several algorithms proposed
for having solution to the problem of frequent pattern mining. In this paper, we have
mathematically compared two most widely used approaches, such as candidate
generation and test and pattern growth approaches to search for the better approach
for a given data set. In this paper, we came to conclusion that the pattern growth
methods are more efficient in maximum cases for the purpose of frequent pattern
mining on account of their cache conscious behavior. In this paper, we have taken a
data set and have implemented both the algorithms on that data set; the experi-
mental result of the working of both the algorithms for the given data set shows that
the pattern growth approach is more efficient than the candidate generation and test
approach.


Keywords Association rule mining Candidate generation and test
 
Data mining Frequent pattern mining Pattern growth methods

The research work is small part of supplementary work done by the author beside his base work
on RSTDB an AI supported Candidate generation and test algorithm.

V. K. Singh (&)
Department of Computer Science and Engineering, Institute of Technology,
Guru Ghasidas Vishwavidyalaya, Bilaspur, Chhattisgarh, India
e-mail: vibhu200427@[Link]

© Springer Nature Singapore Pte Ltd. 2019 203


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
204 V. K. Singh

1 Introduction

With the increase in the use of computer, there is a situation where we are accu-
mulating huge amount of data every day as a result of various data processing
activities. The data that we are gaining can be used for having competitive
advantage in the current scenario where the time to take important decisions has
gone down. Today the need for systems that can help the decision makers to make
valuable decision on the basis of some patterns extracted from historical data in
form of some reports, graphs, etc., has increased. The branch of computer science
that is in concern with the subject is data mining. Data mining is branch of com-
puter science that is developed to combine the human’s power of detecting patterns
along with the computers computation power to generate patterns. The branch is
having its utility in designing of decision support systems that are efficient enough
to help the decision makers to make valuable decisions on time.
Data mining is used for a wide range of applications such as finance, database
marketing, health insurance, medical purpose, bioinformatics, text mining, biode-
fense. There are several data mining tools that help the designer of the system to
simulate such systems such as Intelligent miner, PRW, Enterprise miner, Darwin,
and Clementine. Some of the basic techniques used for the purpose of data mining
are association rule mining, clustering, classification, frequent episode, deviation
detection, neural network, genetic algorithm, rough sets techniques, support vector
machine, etc. For each of the above-mentioned techniques for data mining, there are
algorithms associated which are used for implementation of each paradigm. In this
paper, we are concerned with the association rule mining.

1.1 Association Rule Mining

An association rule is an expression of the form X ! Y, where X and Y are the sets
of items. The intuitive meaning of such a rule is that the transaction of the database
which contains X tends to contain Y. Association depends basically on two things:
• Confidence.
• Support.
Some of the algorithms used for finding of association rules from large data set
include Apriori algorithm, partition algorithm, Pincer-Search algorithm, dynamic
item set counting algorithm, FP-tree growth algorithm. In this paper, we have taken
into account two types of approaches for mining of association rules. First is
candidate generation and test algorithm and second approach is pattern growth
approach. The two approaches are concerned with extraction of frequent patterns
from large data sets. Frequent patterns are patterns that occur frequently in trans-
actions from a given data set. Frequent pattern mining is an important field of
association rule mining.
Proposing Pattern Growth Methods for Frequent Pattern … 205

2 Literature Survey

In [1], Han et al. proposed the FP-tree as a pattern growth approach. In [2], Agarwal
et al. showed the power of using transaction projection in conjunction with lexi-
cographic tree structure in order to generate frequent item sets required for asso-
ciation rules. In [3], Zaki et al. proposed algorithm that utilizes the structural
properties of frequent item sets to facilitate fast discovery is shown. In [4], Agarwal
and Srikant proposed two algorithms for the purpose of frequent pattern mining
which are fundamentally different. Empirical formula used in the paper showed that
the proposed algorithm proved to be handy as compared to the previously proposed
approaches. In [5], Tiovonen proposed a new algorithm that reduces the database
activity in mining. The proposed algorithm is efficient enough to find association
rules in single database. In [6], Savasere et al. proposed algorithm which showed
improvement in the input–output overhead associated with previous algorithms.
The feature proved to be handy for many real-life database mining scenarios. In [7],
Burdick et al. proposed algorithm showed that the breakdown of the algorithmic
components showed parent equivalence pruning and dynamic reordering were quite
beneficial in reducing the search space while relative compression of vertical bit-
maps increases vertical scalability of the proposed algorithm whereas reduces cost
of counting of supports. In [8], Zaki and Gonda proposed a novel vertical repre-
sentation Diffset. The proposed approach drastically reduces the amount of space
required to store intermediate results. In [9, 10], the author proposed a new algo-
rithm RSTDB which also works on the candidate generation and test mechanism.
The algorithm is having a new module that makes it more efficient than the previous
approach. In [11], a study of some of the pattern growth methods along with
description of the new algorithm RSTDB for frequent pattern mining is shown. In
[12], the algorithm RSTDB is compared from FP-tree growth algorithm. In [13], a
cache conscious approach for frequent pattern mining is given. In [14, 15], can-
didate generation and test approach for frequent pattern mining is given. In [16],
RSTDB is proposed as an application.

3 Experiment

In this paper for comparing the two approaches, we have taken two algorithms
Apriori as a representative of candidate generation and test mechanism and FP-tree
as representatives of pattern growth approach, since these two algorithms are the
base algorithm that explains to the two approaches in best. In this section, we took
Table 1 to explain the difference of the two algorithms. Table 1 is having 15
records and 9 distinct records having lexicographic property.
206 V. K. Singh

Table 1 Transaction Transaction ID Item set


database consisting of 15
records and 9 distinct items T100 BDFG
T101 E
T102 EFG
T103 CEG
T104 BDH
T105 BC
T106 H
T107 DEG
T108 ACEG
T109 CEG
T110 BCI
T111 AEFH
T112 BCD
T113 EFH
T114 BFGI

In this section, we will consider one transaction database of Table 1 capable of


holding 15 records and 9 distinct items A, B, C, D, E, F, G, H, I. We will see the
working of both algorithms for the purpose of evaluation of FP-tree from the side of
pattern growth methods as a more cache conscious effort. The minimum support for
the transaction database is fixed at 20%, i.e., minimum of threshold value of 3 is
required to be frequent. Here, now in Fig. 1, we have shown the working of the two
approaches for the given data set according to Han and Kamber [17].

4 Result and Discussion

The total number of counters required for the finding of frequent item set in Apriori
algorithm which is candidate generation and test approach is more.
If the number of counters is more, this implies that the total amount of space
required will be more in the case of Apriori approach. Also for generation of
candidate set at each step of the Apriori algorithm, the whole transaction database is
to be scanned. Whereas in the case of FP-tree, once f-list is generated only once
each record is to be scanned. Beside the above advantages, FP-tree structure could
be accommodated in the memory and fast referencing is possible. The comparison
we are making is for a small data set which shows difference in efficiency. Think in
real-time scenario where the size is very large. The efficiency of the FP-tree would
be far better than Apriori.
Proposing Pattern Growth Methods for Frequent Pattern … 207

Apriori Algorithm a Candidate FP-Tree Algorithm a Pattern Growth


Generation and Test algorithm approach

Fig. 1 Comparison of the working of the two algorithms in terms of space consumed for
evaluation of frequent patterns

5 Conclusion

Considering to the fact that today’s data mining algorithms are keener on having the
data set required for computation in the main memory to have better performance
outcome. From Table 2, it is clear that the amount of space required by the Apriori
algorithm violates to need of the current scenario where time takes the peak
position. From the above example of working of Apriori and FP-tree, it is clear that
execution of Apriori will require considerable large amount of space compared to
FP-tree. Thus, FP-tree employing tree structure easily fitting to the cache can be a
better option despite of the complexity present. Thus, we propose that FP-tree
which is a cache conscious effort as the size of the tree is good enough to reside in
the memory for complete computation. As already discussed that we have taken
FP-tree as a representative of the pattern growth algorithm and Apriori as a rep-
resentative of the candidate generation and test algorithm, the result obtained is
208 V. K. Singh

Table 2 Comparison of apriori and FP-tree


Differentiating Apriori FP-tree
property
Space required More space required for Less space required for processing
for processing processing
Time required for More time required for processing Less time required for processing
operation
TDB scanning More scanning of TDB required Less scanning of TDB required
required for generating final result for generating final result
Complexity of Less complex More complex
programming
Understandability Simple to understand Compared to Apriori difficult to
understand

meant to give a general idea of the two approaches and also the result guides us that
we should go for pattern growth approach when selecting algorithm for frequent
pattern mining, which is also clear from its working that it is more cache conscious
compared to the other one. Beside the above facts derived from the working of the
two approaches, some positives which are drawn of Apriori algorithm are the
simplicity of implementation which is very important and the approaches which are
similar to candidate generation and test are easily understandable.

References

1. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In:
Proceeding of the 2000 ACM SIGMOID International Conference on Management of Data,
pp. 1–12. ACM Press (2000)
2. Agarwal, R.C., Aggarwal, C.C., Prasad, V.V.V.: A tree projection algorithm for generation of
frequent item sets. J. Parallel Distrib. Comput. 61(3), 350–371 (2001)
3. Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: New algorithms for fast discovery of
association rules. In: Proceeding of the 3rd International Conference on Knowledge
Discovery and Data Mining, pp. 283–296. AAAI Press (1997)
4. Agarwal, R., Srikant, R.: Fast algorithms for mining association rules. In: The International
Conference on Very Large Databases, pp. 207–216 (1993)
5. Toivonen, H.: Sampling large databases for association rules. In: Proceeding of 22nd Very
Large Database Conference (1996)
6. Savasere, A., Omiecinski, E., Navathe, S.: An efficient algorithm for mining association rules
in large databases. In: Proceeding of the 21st Very Large Database Conference (1995)
7. Burdick, D., Calimlim, M., Gehrke, J.: MAFIA: a maximal frequent itemset algorithm for
transaction databases. In: Proceeding of ICDE’01, pp. 443–452 (2001)
8. Zaki, M., Gouda, K.: Fast vertical mining using diffsets. In: Journal of ACM SIGKDD’03,
Washington, D.C. (2003)
9. Singh, V.K., Singh, V.K.: Minimizing space time complexity by RSTDB a new method for
frequent pattern mining. In: Proceeding of the First International Conference on Human
Computer Interaction IHCI’09, Indian Institute of Information Technology, Allahabad,
pp. 361–371. Springer, India (2009). ISBN 978-81-8489-404-2
Proposing Pattern Growth Methods for Frequent Pattern … 209

10. Singh, V.K., Shah, V., Jain, Y.K., Shukla, A., Thoke, A.S., Singh, V.K., Dule, C., Parganiha,
V.: Proposing an efficient method for frequent pattern mining. In: Proceeding of International
Conference on Computational and Statistical Sciences, Bangkok, WASET, vol. 36, pp. 1184–
1189 (2009). ISSN 2070-3740
11. Singh, V.K., Shah, V.: Minimizing space time complexity in frequent pattern mining by
reducing database scanning and using pattern growth methods. Chhattisgarh J. Sci. Technol.
ISSN 0973-7219
12. Singh, V.K.: Comparing proposed test algorithm RSTDB with FP-tree growth method for
frequent pattern mining. Aryabhatt J. Math. Inf. 5(1), 137–140 (2013). ISSN 0975-7139
13. Singh, V.K.: RSTDB and cache conscious techniques for frequent pattern mining. In:
CERA-09, Proceeding of Fourth International Conference on Computer applications in
Electrical Engineering, Indian Institute of Technology, Roorkee, India, pp. 433–436, 19–21
Feb 2010
14. Singh, V.K., Singh, V.K.: RSTDB a new candidate generation and test algorithm for frequent
pattern mining. In: CNC-2010, ACEEE and IEEE, IEEE Communication Society,
Washington, D.C., Proceeding of International Conference on Advances in Communication
Network and Computing, Published by ACM DL, Calicut, Kerala, India, pp. 416–418, 4–5
Oct 2010
15. Singh, V.K., Singh, V.K.: RSTDB a candidate generation and test approach. Int. J. Res.
Digest, India 5(4), 41–44 (2010)
16. Singh, V.K.: Solving management problems using RSTDB a frequent pattern mining
technique. In: Int. J. Adv. Res. Comput. Commun. Eng., India 4(8), 285–288 (2015)
17. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publisher,
San Francisco, CA (2001)
A Study on Initial Centroids Selection
for Partitional Clustering Algorithms

Mahesh Motwani, Neeti Arora and Amit Gupta

Abstract Data mining tools and techniques allow an organization to make creative
decisions and subsequently do proper planning. Clustering is used to determine the
objects that are similar in characteristics and group them together. K-means clus-
tering method chooses random cluster centres (initial centroid), one for each cen-
troid, and this is the major weakness of K-means. The performance and quality of
K-means strongly depends on the initial guess of centres (centroid). By augmenting
K-means with a technique of selecting centroids, several modifications have been
suggested in research on clustering. The first two main authors of this paper have
also developed three algorithms that unlike K-means do not perform random
generation of the initial centres and actually produce same set of initial centroids for
the same input data. These developed algorithms are sum of distance clustering
(SODC), distance-based clustering algorithm (DBCA) and farthest distributed
centroid clustering (FDCC). We present a brief survey of the algorithms available in
the research on modification of initial centroids for K-means clustering algorithm
and further describe the developed algorithm farthest distributed centroid clustering
in this paper. The experimental results carried out show that farthest distributed
centroid clustering algorithm produces better quality clusters than the partitional
clustering algorithm, agglomerative hierarchical clustering algorithm and the hier-
archical partitioning clustering algorithm.

Keywords Partitional clustering  Initial centroids  Recall  Precision

M. Motwani (&)
Department of Computer Science and Engineering, RGPV, Bhopal, India
e-mail: [Link].7@[Link]
N. Arora  A. Gupta
RGPV, Bhopal, India

© Springer Nature Singapore Pte Ltd. 2019 211


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
212 M. Motwani et al.

1 Introduction

K-means [1–4] is a famous partition algorithm which clusters the n data points into
k groups. It defines k centroids, one for each cluster. For this k, data points are
selected at random from D as initial centroids.
The K-means algorithm has a drawback that it produces different clusters for
every different set of initial centroids. Thus, the quality of clusters formed depends
on the randomly chosen set of initial k centroids [5]. This drawback of K-means is
removed by augmenting K-means with some technique of selecting initial cen-
troids. We discuss in Sect. 2, the different such modifications published in the
literature on modified K-means algorithm. These proposed algorithms do not per-
form random generation of the initial centres and do not produce different results for
the same input data. In Sect. 3, we discuss the farthest distributed centroid clus-
tering algorithm followed by its experimental results in Sect. 4. Section 5 contains
conclusion, and finally, Sect. 6 has bibliography.

2 Study on Initial Centroid Selection for Clustering


Algorithms

An algorithm proposed in [6] partitions the given set of points to be clustered in a


cell. This cell is divided into two cells with the help of cutting plane. The cells are
divided one at a time till they are equal to the number of clusters to be formed. The
k initial centroids for the K-means algorithm are the centres of these k cells. The
method proposed in [7] analyses the distribution and probability of data density to
find the initial centroids. The given input dataset is divided into smaller parts in
order to minimize the memory requirement needed for storing the given dataset.
This results in superior clustering results from the refined initial centroids in
comparison to randomly chosen initial centroids in K-means algorithm.
A global K-means clustering algorithm in [8] adds the cluster centre incre-
mentally. It also performs several executions of the K-means algorithm which are
equal to the size of the data set. An improved version of K-means proposed in [9]
evaluates the distance between every pair of data points in order to determine the
similar data points. Based on this determination of data points, initial centroids are
finally chosen. This method creates better clusters in comparison to K-means
algorithm.
The method in [10] distributes the initial centroids in order to widen the dis-
tances among them. An efficiently enhanced K-means method proposed in [11] uses
a heuristic function. This function is chosen in a manner to perform less number of
distance calculations. This helps in improving the execution time of the K-means
algorithm. In this method, the determination of initial centroids is similar to that of
the K-means algorithm. Thus, after every execution of the algorithm, it produces
different clusters just like in the K-means algorithm. A smaller time complexity
A Study on Initial Centroids Selection for Partitional … 213

results in this method in comparison to K-means algorithm. By embedding hier-


archical clustering into K-means clustering, a new algorithm [12] is proposed for
finding the initial cluster centres.
The algorithm proposed in [13] selects the position of initial centroids calculated
from the farthest accumulated distance metric between each data point and all
previous centroids. A data point which has the maximum distance is thus selected
as initial centroid. In the paper [14], the method of selection of the first initial
centroid is from the given set of points which are going to be clustered. The
remaining initial centroids are selected from rest of the points. The probability of
selecting each of this point is proportional to square of the distance between this
point and its nearest initial centroid.
The initialization method proposed in [15] depends on reverse nearest neighbour
(RNN) search and coupling degree. RNN takes a query point p and fetches all
points in a given dataset whose nearest neighbour is p. The coupling degree
between neighbourhoods of nodes is defined as the amount of similarity between
the objects. This method deals with three main sets: candidate set (CS), represen-
tative set (RS) and centroids set. The centroid initialization method proposed in [16]
is based on the graph theory of Kruskal algorithm [17]. This algorithm uses the
Kruskal algorithm to generate the minimum spanning tree of the all clustering
objects at first and then deletes (k − 1) edges according to the order of their
weights. The K-mean values of objects in each connected sub-graph that is obtained
are then taken as the initial clustering centres. Simulation experiment shows that
this method gives better accuracy and clustering quality in comparison with
K-means algorithm.
The method in [18] partitions the dataset into blocks so as to compress the
dataset. The K-means is applied on this compressed dataset, to get the k points that
are used as the initial centroids for the K-means on the full dataset. The method in
[19] constructs a Voronoi diagram from the given data points. For a set of n points
in a m-dimensional space, the Voronoi diagram [20] of these set of points is the
division of the space into n cells in such a way that each point belongs to only one
cell. The initial centroids are chosen from the points which lie on the Voronoi
circles.
Teaching learning-based Optimization algorithm proposed in [21] has two steps.
The first step explores the search space and finds the near-optimal cluster centres
which are then evaluated by a function. The cluster centres with the minimum
function values are taken in the second step as initial centroids for the K-means
algorithm. The algorithm proposed in [22] defines two measures for refining initial
clustering centres, to measure a point’s local density. The clustering centre with
local maximal density for each cluster is produced with the help of these mea-
surements. After refinement, the K-means clustering algorithm converges to a better
local minimum. Experiments demonstrate the efficiency of the proposed algorithm.
The method in [23] takes the data and finds the high-density neighbourhoods from
it. The central points of the neighbourhoods are now chosen as initial centroids for
clustering. The high-density neighbourhoods are found by neighbourhood-based
214 M. Motwani et al.

clustering [24]. A filtering method is used to avoid too many clusters from being
formed.
A heuristic method is used in [25] to find a good set of initial centroids. The
method uses a weighted average score of dataset. The rank score is found by
averaging the attribute of each data point. This generates initial centroids that
follow the data distribution of the given set. A sorting algorithm is applied to the
score of each data point and divided into k subsets. The nearest value of mean from
each subset is taken as initial centroid. This algorithm produces the clusters in less
time as compared to K-means. A genetic algorithm for the K-means initialization
(GAKMI) is used in [26] for the selection of initial centroids. The set of initial
centroids is represented by a binary string of length n. Here n is the number of
feature vectors. The GAKMI algorithm uses binary encoding, in which bits set to
one select elements of the learning set as initial centroids. A chromosome repair
algorithm is used before fitness evaluation to convert infeasible chromosomes into
feasible chromosomes. The GAKMI algorithm results in better clusters as compared
to the standard K-means algorithm.
Sum of distance clustering (SODC) [27] algorithm for clustering selects initial
centroids using criteria of finding sum of distances of data objects to all other data
objects. The algorithm uses the concept that good clusters are formed when the
choice of initial k centroids is such that they are as far as possible from each other.
The proposed algorithm results in better clustering on synthetic as well as real
datasets when compared to the K-means technique. Distance-based clustering
algorithm (DBCA) [28] is based on computing the total distance of a node from all
other nodes. The clustering algorithm uses the concept that good clusters are formed
when the choice of initial k centroids is such that they are as far as possible from
each other. Once some point d is selected as initial centroid, the proposed algorithm
computes average of data points to avoid the points near to d from being selected as
next initial centroids.
The farthest distributed centroid clustering (FDCC) algorithm [29] uses the
concept that good clusters are formed when the choice of initial k centroids is such
that they are as far as possible from each other. FDCC algorithm proposed here uses
criteria of sum of distances of data objects to all other data objects. Unlike
K-means, FDCC algorithm does not perform random generation of the initial
centres and produces same results for the same input data. DBCA and FDCC
clustering algorithms produce better quality clusters than the partitional clustering
algorithm, agglomerative hierarchical clustering algorithm and the hierarchical
partitioning clustering algorithm.
A Study on Initial Centroids Selection for Partitional … 215

3 Developed Farthest Distributed Centroid Clustering


Algorithm

3.1 Basic Concept

The algorithm selects a good set of initial centroids such that the selected initial
centroids are spread out within the data space as far as possible from each other.
Figure 1 illustrates the selection of four initial centroids C1, C2, C3 and C4. As is
evident, there are four clusters in the data space. The proposed technique selects a
point d as the first initial centroid using a distance criteria explained in Sect. 3.2.
Once this point d is selected as initial centroid, the proposed technique avoids the
points near to d from being selected as next initial centroids. This is how C1, C2,
C3 and C4 are distributed as far as possible from each other.

3.2 Farthest Distributed Centroid Clustering Algorithm

Let the clustering of n data points in the given dataset D is to be done into k clusters.
In farthest distributed centroids clustering (FDCC) algorithm, the distance of each
data point di = 1 to n in the given dataset D is calculated from all other data points
and these distances are stored in a distance matrix DM. Total distance of each data
point di = 1 to n with all other data points is calculated. The total distance for a
point di is sum of all elements in the row of the DM corresponding to di. These
sums are stored in a sum of distances vector SD. The vector SD is sorted in
decreasing order of total distance values. Let P-SD is the vector of data points
corresponding to the sorted vector SD; i.e., P-SD [1] will be the data point whose
sum of distances (available in SD [1]) from all other data points is maximum, and

Fig. 1 Illustration of
selecting a good set of initial
centroids
216 M. Motwani et al.

P-SD [2] will be the data point whose sum of distances (available in SD [2]) from
all other data points is second highest. In general, P-SD [i] will be the data point
whose sum of distances from all other data points is in SD [i].
The first point d of the vector P-SD is the first initial centroid. Put this initial
centroid point in the set S of initial centroids. To avoid the points near to d from
being selected as next initial centroids, variable x is defined as follows:

x ¼ floor ðn=kÞ ð1Þ

Here, the floor (n/k) function maps the real number (n/k) to the largest previous
integer; i.e., it returns the largest integer not greater than (n/k).
Now discard the next x number of points of the vector P-SD and define the next
point left after discarding these x numbers of points from this vector P-SD, as the
second initial centroid. Now discard the next x number of points from this vector
P-SD and define the next point left after discarding these x numbers of points from
this vector P-SD, as the third initial centroid. This process is repeated till k numbers
of initial centroids are defined. These k initial centroids are now used in the
K-means process as substitute for the k random initial centroids. K-means is now
invoked for clustering the dataset D into k number of clusters using the initial
centroids available in set S.

4 Experimental Study

The experiments are performed on core i5 processor with a speed of 2.5 GHz and
4 GB RAM using MATLAB. The comparison of the quality of the clustering
achieved with FDCC algorithm [29] is made with the quality of the clustering
achieved with
1. Partitional clustering technique. The K-means is the partitional technique
available as a built-in function in MATLAB [30].
2. Hierarchical clustering technique. The ClusterData is agglomerative hierarchical
clustering technique available as a built-in function in MATLAB [30]. The
single linkage is the default option used in ClusterData to create hierarchical
cluster tree.
3. Hierarchical partitioning technique. CLUTO [31] is a software package for
clustering datasets. CLUTO contains both partitional and hierarchical clustering
algorithms. The repeated bisections method available in CLUTO is a hierar-
chical partitioning algorithm that initiates a series of k − 1 repeated bisections to
produce the required k clusters. This effectively is the bisect K-means divisive
clustering algorithm and is the default option in CLUTO named as Cluto-rb.
Recall and precision [32] are used to evaluate the quality of clustering achieved.
Recall is the percentage of data points that have been correctly put into a cluster
among all the relevant points that should have been in that cluster. Precision is the
A Study on Initial Centroids Selection for Partitional … 217

Table 1 Average recall and average precision for real datasets


Algorithm Corel5K Corel (Wang) Wine
Recall Precision Recall Precision Recall Precision
(%) (%) (%) (%) (%) (%)
FDCC 30.02 38.2 100 100 41.67 72
K-means 20.02 33.3 60.7 86.71 33.67 61
ClusterData 10.08 41 100 100 25.37 47
Cluto-rb 23.02 30.32 33.29 27.91 39.67 69

Fig. 2 Average recall for real 120


datasets 100
80
60 K-Means
40
20
0 ClusterData
Cluto-rb
FDCC

Fig. 3 Average precision for 120


real datasets 100
80
60 K-Means
40
20
0 ClusterData
Cluto-rb
FDCC

percentage of data points that have been correctly put into a cluster among all the
points put into that cluster.
Three real datasets, namely Corel5K, Corel (Wang) and Wine, are used in the
experiments [33]. Corel5K is a collection of 5000 images downloaded from
Website [34]. We have formed ten clusters of these images using K-means,
ClusterData, Cluto-rb and FDCC algorithm. Corel (Wang) database consists of 700
images of the Corel stock photo and is downloaded from Website [35] and contains
1025 features per image. We have formed ten clusters of these images using
K-means, ClusterData, Cluto-rb and FDCC algorithm. The wine recognition dataset
[36] consists of quantities of 13 constituents found in three types of classes of wines
of Italy. The dataset consists of 178 instances. The number of instances in class 1,
class 2 and class 3 are 59, 71 and 48, respectively. We have formed three clusters of
these images using K-means, ClusterData, Cluto-rb and FDCC algorithm.
218 M. Motwani et al.

The average recall and precision using these algorithms is shown in Table 1. The
graphs plotted for average recall and average precision in percentage using these
techniques is shown in Figs. 2 and 3, respectively. The results show that the recall
and precision of FDCC are better than that of K-means, ClusterData and Cluto-rb.
Hence, the FDCC algorithm produces better quality clusters.

5 Conclusion

The initialization of k data points determines quality of clustering of K-means. The


random selection of k initial centroids is the weakness of K-means. Several algo-
rithms are developed to overcome this weakness of K-means algorithm. These
algorithms select a good set of initial centroids to be used by K-means. We have
discussed several such algorithms available in the literature. We have also discussed
farthest distributed centroid clustering algorithm developed by the first two authors
of this paper. The experimental results show that farthest distributed centroid
clustering algorithm produces better quality clusters than the partitional clustering
algorithm, agglomerative hierarchical clustering algorithm and the hierarchical
partitioning clustering algorithm.

References

1. Han, J., Kamber, H.: Data Mining Concepts and Techniques. Morgan Kaufmann Publishers,
Burlington (2002)
2. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations.
In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability,
Berkeley, University of California Press, pp. 281–297 (1967)
3. Dunham, M.: Data Mining: Introductory and Advanced Concepts. Pearson Education,
London (2006)
4. Llyod, S.: Least Squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137
(1982)
5. Khan, S.S., Ahmed, A.: Cluster center initialization algorithm for k-means algorithm. Pattern
Recogn. Lett. 25(11), 1293–1302 (2004)
6. Deelers, S., Auwatanamongkol, S.: Engineering k-means algorithm with initial cluster centers
derived from data partitioning along the data axis with the highest variance. In: Proceedings of
World Academy of Science, Engineering and Technology, vol. 26, pp. 323–328 (2007)
7. Bradley, P.S., Fayyad, U.M.: Refining initial points for K-Means clustering. In: Proceedings
of the 15th International Conference on Machine Learning, Morgan Kaufmann, San
Francisco, CA, pp. 91–99 (1998)
8. Likas, A., Vlassis, N., Verbeek, J.J.: The global k-means clustering algorithm. Pattern
Recogn. 36, 451–461 (2003)
9. Yuan, F., Meng, Z.H., Zhang, H.X., Dong, C.R.: A new algorithm to get the initial centroids.
In: Proceedings of the 3rd International Conference on Machine Learning and Cybernetics,
Shanghai, pp. 26–29 (2004)
A Study on Initial Centroids Selection for Partitional … 219

10. Barakbah, A.R., Helen, A.: Optimized K-means: an algorithm of initial centroids optimization
for K-means. In: Proceedings of Soft Computing, Intelligent Systems and Information
Technology (SIIT), pp. 63–66 (2005)
11. Fahim, A.M., Salem, A.M., Torkey, F.A., Ramadan, M.A.: An efficient enhanced k-means
clustering algorithm. J. Zhejiang Univ. Sci. 7(10), 1626–1633 (2006)
12. Barakbah, A.R., Arai, K.: Hierarchical K-means: an algorithm for centroids initialization for
K-means. Rep. Fac. Sci. Eng. 36(1) (2007) (Saga University, Japan)
13. Barakbah, A.R, Kiyoki, Y.: A pillar algorithm for K-means optimization by distance
maximization for initial centroid designation. In: IEEE Symposium on Computational
Intelligence and Data Mining (IDM), Nashville-Tennessee, pp. 61–68 (2009)
14. Arthur, D., Vassilvitskii, S.: K-means++: the advantages of careful seeding. In: Proceedings
of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, Philadelphia, PA, USA,
Society for Industrial and Applied Mathematics, pp. 1027–1035 (2007)
15. Ahmed, A.H., Ashour, W.: An initialization method for the K-means algorithm using RNN
and coupling degree. Int. J. Comput. Appl. (0975–8887) 25(1) (2011)
16. Huang, L., Du, S., Zhang, Y., Ju, Y., Li, Z.: K-means initial clustering center optimal
algorithm based on Kruskal. J. Inf. Comput. Sci. 9(9), 2387–2392 (2012)
17. Kruskal, J.: On the shortest spanning subtree and the travelling salesman problem. Proc. Am.
Math. Soc, 48–50 (1956)
18. Fahim, A.M., Salem, A.M., Torkey, F.A., Ramadan, M.A, Saake, G.: An efficient K-means with
good initial starting points. Georgian Electron. Sci. J. Comput. Sci. Telecommun. 19(2) (2009)
19. Reddy, D., Jana, P.K.: Initialization for K-mean clustering Voronoi diagram. In: International
Conference on C3IT-2012, Hooghly, Procedia Technology (Elsevier) vol. 4, pp. 395–400,
Feb 2012
20. Preparata, F.P., Shamos, M.I.: Computational Geometry—An Introduction. Springer, Berlin,
Heidelberg, Tokyo (1985)
21. Naik, A., Satapathy, S.C., Parvathi, K.: Improvement of initial cluster center of c-means using
teaching learning based optimization. Procedia Technol. 6, 428–435 (2012)
22. Yang, S.Z., Luo, S.W.: A novel algorithm for initializing clustering centers. In: Proceedings
of International Conference on IEEE Machine Learning and Cybernetics, China, vol. 9,
pp. 5579–5583 (2005)
23. Ye, Y., Huang, J., Chen, X., Zhou, S., Williams, G., Xu, X.: Neighborhood density method
for selecting initial cluster centers in K-means clustering. In: Advances in Knowledge
Discovery and Data Mining. Lecture Notes in Computer Science, vol. 3918, pp. 189–198
(2006)
24. Zhou, S., Zhao, J.: A neighborhood-based clustering algorithm. In: PAKD 2005. LNAI 3518,
pp. 361–371 (1982)
25. Mahmud, M.S., Rahman, M., Akhtar, N.: Improvement of K-means clustering algorithm with
better initial centroids based on weighted average. In: 7th International Conference on
Electrical and Computer Engineering, Dhaka, Bangladesh, Dec 2012
26. Kwedlo, W., Iwanowicz, P.: Using genetic algorithm for selection of initial cluster centers for
the K-means method. In: International Conference on Artificial Intelligence and Soft
Computing. Springer Notes on Artificial Intelligence, pp. 165–172 (2010)
27. Arora, N., Motwani, M.: Sum of distance based algorithm for clustering web data. Int.
J. Comput. Appl. 87(7), 26–30 (2014)
28. Arora, N., Motwani, M.: A distance based clustering algorithm. Int. J. Comput. Eng. Technol.
5(5), 109–119 (2014)
29. Arora, N., Motwani, M.: Optimizing K-Means by fixing initial cluster centers. Int. J. Curr.
Eng. Technol. 4(3), 2101–2107 (2014)
30. MathWorks MatLab: The Language of Technical Computing (2009)
31. Karypis, G.: CLUTO: A Clustering Toolkit. Release 2.1.1, Tech. Rep. No. 02-017. University
of Minnesota, Department of Computer Science, Minneapolis, MN 55455 (2003)
32. Kowalski, G.: Information Retrieval Systems—Theory and Implementation. Kluwer
Academic Publishers (1997)
220 M. Motwani et al.

33. Veenman, C.J., Reinders, M.J.T., Backer, E.: A maximum variance cluster algorithm. IEEE
Trans. Pattern Anal. Mach. Intell. 24(9), 1273–1280 (2002)
34. Duygulu, P., et al.: Object recognition as machine translation: learning a lexicon for a fixed
image vocabulary. In: Proceedings of the 7th European Conference on Computer Vision,
pp. 97–112 (2002)
35. Wang, J.Z., Li, J., Wiederhold, G.: SIMPLIcity: semantics-sensitive integrated matching for
picture libraries. IEEE Trans. Pattern Anal. Mach. Intell. 23(9), 947–963 (2001)
36. Forina, M., Aeberhard, S.: UCI Machine Learning Repository (1991)
A Novel Rare Itemset Mining Algorithm
Based on Recursive Elimination

Mohak Kataria, C. Oswald and B. Sivaselvan

Abstract Pattern mining in large databases is the fundamental and a non-trivial


task in data mining. Most of the current research focuses on frequently occurring
patterns, even though less frequently/rarely occurring patterns benefit us with useful
information in many real-time applications (e.g., in medical diagnosis, genetics). In
this paper, we propose a novel algorithm for mining rare itemsets using recursive
elimination (RELIM)-based method. Simulation results indicate that our approach
performs efficiently than existing solution in time taken to mine the rare itemsets.


Keywords Apriori Frequent pattern mining  Rare itemset mining

RELIM Maximum support

1 Introduction

With huge influx of data in every real-world application, data analysis becomes
important and data mining helps in effective, efficient, and scalable analysis by
uncovering many hidden associations among data which otherwise cannot be
interpreted and is useful at the same time. Data mining is the non-trivial process of
extraction of hidden, previously unknown and potentially useful information from
large databases [1, 2]. It differs from retrieval tasks in the fact that knowledge
(patterns) can be discovered through data mining. Pattern mining being a basic data
mining task enables to extract hidden patterns from set of data records called

M. Kataria (&)
Department of Computer Science and Engineering, Thapar University, Patiala, Punjab, India
e-mail: mohakkataria@[Link]
C. Oswald  B. Sivaselvan
Department of Computer Engineering, Design and Manufacturing Kancheepuram,
Indian Institute of Information Technology, Chennai, India
e-mail: coe13d003@[Link]
B. Sivaselvan
e-mail: sivaselvanb@[Link]

© Springer Nature Singapore Pte Ltd. 2019 221


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
222 M. Kataria et al.

transactions. The various data mining techniques involve association rule mining
(ARM), classification, clustering, and outlier analysis. ARM is the process of
finding frequent itemsets/patterns, associations, correlations among sets of items or
objects in transactional databases, relational databases, and other information
repositories [1].
The motivation for ARM emerged from market basket analysis, which is a
collection of items purchased by a customer in an individual customer transaction
[2], for example a customer’s visit to a grocery store or an online purchase from
[Link]. Huge collections of transactions are received from them. An analysis
of the transaction database is done to find frequently occurring sets of items, or
itemsets, that appear together. Frequent pattern (itemset) mining (FPM) is an
important and non-trivial phase in ARM followed by rule generation [1]. Let I ¼
fi1 ; i2 ; i3 ; . . .; im g be a set of items, and a transaction database
TD ¼ hT1 ; T2 ; T3 ; . . .; Tn i, where Ti ði 2 ½1. . .nÞ is a transaction containing a set of
items in I. The support of a pattern X, where X is a set of items, is the number of
transactions containing X in TD. A pattern (itemset) X is frequent if its support is not
less than a user-defined minimum support(min supp = a). FPM algorithms con-
centrate on mining all possible frequent patterns in the transactional database.
The second phase is relatively straightforward compared to the first phase.
Algorithms for ARM have primarily focused on the first phase as a result of the
potential number of frequent itemsets being exponential in the number of different
items, although the actual number of frequent itemsets can be much smaller. Thus,
there is a need for algorithms that are scalable. Many efficient algorithms have been
designed to address these criteria first of which was Apriori [2]. It uses prior
knowledge which is “all non-empty subsets of a frequent itemset must also be
frequent.” A rule is defined as an implication of the form X ) Y where X, Y  I
and X \ Y = ∅. The sets of items (for short itemsets) X and Y are called antecedent
and consequent transactions containing only X of the rule, respectively. A rule
which satisfies a minimum confidence threshold is said be an interesting rule. The
rule X ) Y satisfy confidence c, where

Number of transactions containing both X and Y


c¼ :
Transactions containing only X

The paper is organized as follows. Related studies are presented in Sect. 2.


Section 3 presents the proposed algorithm. Section 4 explains the proof of cor-
rectness of our algorithm. Details of the datasets, results, and performance dis-
cussion are given in Sect. 5. Section 6 concludes with further research directions.

1.1 Rare Itemset Mining

The current literature on mining is focused primarily on frequent itemsets only. But
rare itemsets too find their place of high interest sometimes, especially in cases of
A Novel Rare Itemset Mining Algorithm Based on Recursive … 223

biomedical databases. The above description applies to a vast range of mining


situations where biomedical data are involved [3]. For instance, in pharmacovigi-
lance, one is interested in associating the drugs taken by patients to the adverse
effects the latter may present as a result (safety signals or drug interactions in the
technical language of the field). A nice motivation given by Szathmary et al. is
given below [4]. Given a database of drug effects on patients, rare itemset mining
helps us in decoding the adverse effects of drugs on patients which may be rare but
drastic. Mining on such data for frequent itemsets provides us with associations
such as “{drug} [ {A},” where {A} is an itemset describing a kind of desirable
effect. These frequent associations mean that a drug is working in the expected right
way, whereas in case if it is a rare association, it represents an effect of drug which
is abnormal, possibly undesirable. Discovering rare itemsets and rare association
rules deriving from rare itemsets is particularly useful in the field of biology and
medicine. Rare events deserve special attention because they have their own
importance in a database full of hidden knowledge. In these examples, there lies the
motivation of extracting the hidden rare itemsets within a database.
An itemset Xfx1 ; . . .; xn g is a rare itemset (RI) if 0 < support(X)  maximum
support(max supp). An itemset X is non-present itemset (NPIs) if support (X) = 0.
Moreover, all rare itemsets are infrequent but not all infrequent itemsets (IFI) are
rare, i.e., RI  I F I. min supp differentiates between frequent itemsets (FIs) and
infrequent itemsets(IFIs). So, max supp used for finding RIs will be less than min
supp used for finding FIs, in fact 0 < max supp < max supp. If max supp = 0, set of
RIs will be ∅ and if max supp = min supp − 1, then RI = IFI. If we take a rela-
tively higher value for max supp, then the cardinality of RIs will be more compared
to smaller value of max supp. Mining all possible rare itemsets is termed as rare
itemset mining.
If X is a RI, then X 0  X 2 fRI; NPI g and the subsets of X may or may not be
RI. For example, if X ={a, b, c} is an RI with a support of 1 and {a, b, c, d} is
present in the database, then {a, b, c, e} is not present in the database. So it is
evident that {a, b, c, d}  X and is a RI and its support will also be 1, and {a, b, c,
e}  X is not present in database, so its support will be 0. {a, b} is a subset of X,
but it may or may not be a RI, i.e., support({a, b})  support(X). The proposed
work employs efficient strategies to mine rare itemsets in a time-efficient manner
using RELIM algorithm, designed for FIM by Borgelt et al. [5]. The paper focuses
on exhaustive listing of all the RIs present in the database subject to a value of max
supp.

2 Related Work

The scarce literature on the subject of rare itemset mining exclusively adapts the
general levelwise framework of pattern mining around the seminal Apriori algo-
rithm to various forms of the frequent pattern algorithms like FP-growth, ECLAT,
H-mine, Counting Inference, RELIM [2, 5–8]. A detailed survey of seminal
224 M. Kataria et al.

FP-based algorithms can be seen in [9]. These methods provide with a large itemset
search space and associations, which are not frequent. But these associations will be
incomplete either due to restrictive definitions or high computational cost. Hence,
as argued by [10], these methods will not be able to collect a huge number of
potentially interesting rare patterns. As a remedy, we put forward a novel and
simple approach toward efficient and complete extraction of RIs. Mining for RIs has
received enhanced focus of researchers, and in the recent past, few algorithms have
been proposed.
Apriori-inverse algorithm by Y. S. Koh et al. uses the basic Apriori approach for
mining sporadic rules (rules having itemsets with low support value and high
confidence) [11]. This algorithm was the seminal work for RI mining. Laszlo
Szathmary et al. proposed two algorithms, namely MRG-Exp, which finds minimal
rare itemsets (mRIs) with a naive Apriori approach and a rare itemset miner
algorithm (ARIMA) which retrieves all RIs from mRIs [3]. Rarity algorithm by
Luigi Troiano et al. implements a variant Apriori strategy [12]. It first identifies the
longest rare itemsets on the top of the power set lattice and moves downward the
lattice pruning frequent itemsets and tracks only those that are confirmed to be rare.
Laszlo Szathmary et al. have proposed Walky-G algorithm which finds mRIs using
a vertical mining approach [3]. The major limitation of the existing algorithms is
that they work with the assumption that all itemsets having support <min supp form
the set of RIs. This notion states that all IFIs are RIs, while this is not the case. All
IFIs may or may not be RIs, but all RIs are IFIs. RIs are those IFIs which are
relatively rare.
The other limitation with the present algorithms is that they propose to find mRIs
only; none of the algorithms except ARIMA and take into consideration the
exhaustive generation of set of RIs. As these algorithms use Apriori levelwise
approach, so generating the exhaustive set of RIs would be a very space and time
expensive approach as is evident from the paper by Luigi Troiano et al., in which
the authors took subsets of the benchmark datasets to analyze results of ARIMA
and rarity algorithms [12]. Since the number of RIs produced is very large, the
memory usage and execution time are high for such exhaustive RI generation. The
proposed algorithm overcomes the repeated scan limitation of Apriori-based
approaches based on the recursive elimination (RELIM) algorithm.

3 Proposed Rare Itemset Mining Algorithm

We introduce the RELIM algorithm for exhaustive generation of rare itemsets from
given database. Like RELIM for frequent itemsets, this algorithm follows the same
representation for transactions. In the first scan of the database, the frequencies of
each unique item present in each transaction of database are calculated. The items
are sorted in ascending order according to their frequencies. Items having same
frequencies can be in any order. The relim prefix list for each item is initialized with
their support value taken to be 0 and suffix-set list to ∅. The next step is to reorder
A Novel Rare Itemset Mining Algorithm Based on Recursive … 225

the items in each transaction according to the ascending order got in the previous
step. This step takes one more pass over the database. After sorting the transaction,
P = prefix[T′]; i.e., first item of the transaction is taken as the prefix for the sorted
transaction T and the remaining items are taken as suffix-set for the transaction.
Then the, suffix-set to suffix-set list of prefix P is inserted in relim prefix list for
each transaction.
After populating the relim prefix list with all of the sorted transactions, iteration
through the relim prefix list in the order of ascending order of frequencies of each
unique item found in step 1 is done. For each prefix P in relim prefix list, and for
each suffix-set S(i) in the suffix-set list S of P, generate all of the possible subsets
except empty set and store the subsets prefixed with P in a hashmap along with its
support value. If a generated subset already exists in hashmap, its associated value
is incremented. Now let P′ = suffix-set(S(i)) is the prefix item of the suffix-set S(i).
The suffix-set from the suffix-set list of prefix P is removed. Let S(i) = suffix-set(S
(i)). S(i) to prefix P is added in the relim prefix list, and its corresponding support is
incremented by 1. Once iteration through the suffix-set list of prefix P is over, all the
candidate rare itemsets in a hashmap are obtained and on pruning the hashmap, and
the rare itemsets prefixed with P are generated. Subset generation is done based on
an iterative approach, and a hashmap data structure is used in the process of pruning
itemsets to optimize on implementation.

3.1 Illustration

After scanning database and counting frequencies of unique items, seven unique
items in sorted order remain, as shown in Fig. 1. Then, a prefix list of seven items is
generated with initial support of 0. Now in next pass, the transactions in the order of
frequencies of prefixes are sorted, and for each sorted transaction T and its prefix,
suffix-set[T] is inserted into the suffix-set list of prefix[T] in the relim prefix list. In
this case, the first transaction {a, d, f} is converted to {f, a, d} and {a, d} being the
suffix-set of the transaction is stored in suffix-set list of prefix f. The relim prefix list
is populated with all the transactions and the recursive.

Algorithm 1 RELIM-based RI Efficient Pruning-based RELIM


based-Mining of Rare Patterns.

Input: D, max supp


Output: RI’s with support  max supp.
Method:
Scan D once and retrieve all frequent 1-length itemsets in Fi .
Sort Fi in ascending order to get F’ .
226 M. Kataria et al.

Fig. 1 Illustration for RELIM-based RI mining

for f 2 F’ do
P ← prefix(f); P .support ← 0
P .suffixsetlist
A Novel Rare Itemset Mining Algorithm Based on Recursive … 227

end for
call sort transaction and addto prefixlist();
for each prefix P 2 P REFI X LI ST do
candidate [P ] ← P .support
for each suffi[Link] 2 P .suffixsetlist do
call calc support candidate subsets(SS, P);
P’ ← prefix(SS)
P .suffixsetlist ← P .suffixsetlist – SS
P .support–; SSʹ ← suffix(SS)
if SSʹ = ∅ then
Pʹ .suffixsetlist ← P .suffixsetlist [ SSʹ
Pʹ.support ++;
end if
end for
Rp ← prune candidatelist() //Set of RI’s with prefix P
end for

Procedure sort transaction and add to prefixlist(): Find the sorted transac-
tion and add it to the prefix list

Method:
for each transaction T 2 D do
T’ ← ∅
for each item I 2 F’ do
if (I 2 T) then
T0 [I
end if
end for
P’ ← prefix (T’); SSʹ ← suffix(T’)
P .suffixlist ← P .suffixsetlist [ SSʹ
P .support ++;
end for

Elimination of each prefix is carried out. The first prefix is g with support
(number of suffix-sets in the suffix-set list) of 1 and the only suffix-set being {e, c,
b}. Now all possible subsets of this suffix-set are generated except empty set which
are {b}, {c}, {e}, {e, b}, {e, c}, {b, c}, {e, c, b}, and their supports are calculated
and are inserted in the candidate list. Candidate list is pruned, and RIs are retained.
In this case, all the subsets generated of prefix g are rare.
228 M. Kataria et al.

After generating RIs, an extension for prefix g is created and all the suffix-sets of
g as transactions are listed, and the same is inserted into extended relim prefix list
for prefix g. The only suffix-set of g is {e, c, b}.

Procedure calc support candidate subsets(SS, P): Find the support of can-
didate subsets

Method:
sub ← ∅
sub ← generate subsets(SS); //
Generate subsets of all items in suffixset and add it to a set of subsets
sub.
for each set S 2 sub do
Candidate[P [ S] ++;
end for

Procedure prune candlist(): prune the candidate list of rare itemsets

Method:
R ←∅
m ← count of RI’s
for i ← 0 to [Link] do
if value [[Link](i)]  max supp # transactions then
m + + ; R ← R [ [Link](i)
end if
end for
Return R

So prefix[{e, c, b}] = e, and suffix-set[{e, c, b}] is {c, b}. Hence, suffix-set {c,
b} is added to prefix e suffix-set list, and support for e is incremented to 1 in the
extended relim prefix list for prefix g. After populating this extended relim prefix
list, the suffix-sets in the extended prefix list are added to the corresponding relim
prefix list and the support values are added from the extended prefix list to the relim
prefix list. So {c, b} is added to the suffix-set list of e, and support for e is
incremented. This process happens for every prefix in the main relim prefix list.
A Novel Rare Itemset Mining Algorithm Based on Recursive … 229

4 Proof of Correctness

Lemma 1 The ascending order sorting of transactions leads to the generation of


RIs with prefix tp collectively.
Proof After sorting transactions in an ascending order of frequencies of unique
itemsets, we ensure in relim data structure that all transactions with prefix P are in
the suffix-set list of P. So generating all subsets of each suffix-set of P will generate
candidate itemsets with prefix P only and the sorting ensure that P will not be a part
of any suffix-sets of subsequent prefixes.
Lemma 2 Let Xfx1 ; . . .; xn g is a rare itemset, then X 0  X is also a rare itemset
or a non-present, where X 0 fx1 ; . . .; xn g.
Proof If support (X) = s and Y superset of X, then support (y)  s. The range for
support (y) is 0  support(y)  s. So either superset of RI is rare or its support is
0 and is non-present. This is the monotonicity property of Apriori lattice.

5 Simulation Results and Discussion

Simulation experiments are performed on an Intel Core i5-3230 M CPU 2.26 GHz
with 4 GB main memory and 500 GB hard disk on Ubuntu 13.04 OS Platform and
are implemented using C++ 11 standard. The standard frequent itemset mining
(FIMI) dataset is taken for simulation [13]. Results presented in this section are over
Mushroom dataset which contains 8124 transactions with varying number of items
per transaction (maximum 119). The detailed results are tabulated in Table 1. In the
simulation for this algorithm, various combinations of number of items per trans-
actions and number of transactions of mushroom data set were used and the portion
of data used from mushroom dataset was truncated using the first item and first
transaction as pivot. There is a general trend, which we see that as we increase the
dimension of the portion we use for getting RIs out of the mushroom dataset, the
time taken to mine the RI increases.
Figure 2 highlights the comparison of natural logarithm of time taken to mine
RIs with varying number of items per transaction by keeping number of transac-
tions constant with RareRELIM and rarity algorithm approach. It can be clearly
seen that time taken to mine RIs increases when we increase items per transaction
keeping number of transactions constant. This is because of the fact that a larger
data set leads to more candidate item sets and hence more time taken to prune RI’s
from candidate sets. For example, for 60 items per transaction, log(time taken)
increases from 39.64 to 227.81 as we increase number of transaction from 1000 to
8124 for our approach.
Figure 3 showcases the comparison of natural logarithm of time taken to mine
RIs with varying number of items per transaction by keeping number of transac-
tions constant with RareRELIM and rarity algorithm approach. It can be clearly
230 M. Kataria et al.

Table 1 Detailed results for mushroom dataset for various max supp
max_supp Time taken_RELIM-based Time # Candidate RI # RI
(%) RI (s) taken_Rarity (s)
#Items per transaction: 40 and no. of transactions: 5000
20 7.02 9.5472 23,549 23,286
10 6.957 9.46152 23,549 22,578
5 6.91 9.3976 23,549 20,382
2.5 6.864 9.33504 23,549 17,128
1 6.848 9.31328 23,549 14,162
#Items per transaction: 40 and no. of transactions: 8124
20 27.699 38.22462 785,887 785,847
10 27.409 37.82442 785,887 785,847
5 27.393 37.80234 785,887 785,575
2.5 30.701 42.36738 785,887 785,073
1 31.87 43.9806 785,887 781,829
#Items per transaction: 50 and no. of transactions: 5000
20 17.004 23.46552 101,283 101,016
10 17.05 23.529 101,283 100,176
5 17.035 23.5083 101,283 97,258
2.5 19.422 26.80236 101,283 90,828
1 16.77 23.1426 101,283 78,158
#Items per transaction: 50 and no. of transactions: 8124
20 28.938 40.80258 138,079 137,875
10 27.627 38.95407 138,079 137,109
5 28.065 39.57165 138,079 134,891
2.5 29.031 40.93371 138,079 128,895
1 28.314 39.92274 138,079 113,663
#Items per transaction: 60 and no. of transactions: 5000
20 150.4 212.064 851,839 850,662
10 147.108 207.42228 851,839 845,550
5 147.862 208.48542 851,839 826,958
2.5 147.489 207.95949 851,839 782,490
1 145.018 204.47538 851,839 692,178
#Items per transaction: 60 and no. of transactions: 8124
20 228.072 328.423.68 1,203,575 1,202,659
10 227.807 328.04208 1,203,575 1,198,659
5 232.191 334.35504 1,203,575 1,185,813
2.5 230.1 331.344 1,203,575 1,148,930
1 227.854 328.10976 1,203,575 1,050,463

seen that in this case also, time taken to mine RIs increases when we increase
number of transactions keeping items per transaction constant because of the same
reason that as data set increases, more candidates are generated and hence more
A Novel Rare Itemset Mining Algorithm Based on Recursive … 231

Fig. 2 Number of items per 350


transaction versus log time 3000 RELIM
taken for mushroom 300 5000 RELIM
8124 RELIM
3000 Rarity
250 5000 Rarity

time taken(s)
8124 Rarity
200

150

10
log
100

50

0
25 30 35 40 45 50 55 60
Number of items per transaction

Fig. 3 Number of 350


transactions versus log time 40 RELIM
taken for mushroom 300 50 RELIM
60 RELIM
40 Rarity
250 50 Rarity
time taken(s)

60 Rarity
200

150
10
log

100

50

0
1000 2000 3000 4000 5000 6000 7000 8000 9000
Number of transactions

time is taken to prune RIs. For example, for 8124 transactions, log(time taken)
increases from 0.7 to 230.1 as we increase number of items per transaction from 25
to 60 for our approach. But in general, RareRELIM algorithm takes less time than
rarity algorithm because rarity approach takes into consideration candidate set and
veto list, and it generates both of them and uses both of them to classify an itemset
as rare or not. It takes more time to generate both candidate itemsets and veto lists.
Figure 4 highlights the comparison of number of RIs with varying support for a
fixed portion of mushroom data set, i.e., 60/8124 (items per transaction/number of
transactions). We see a general rise in trend of the graph because of the reason that,
as we keep on increasing the maximum support value used to filter out the RIs from
the database, the rarity of an itemset tends to increase and itemsets tend to be less
frequent. As can be seen from the graph, as we increase max supp value from 1 to
232 M. Kataria et al.

Fig. 4 Maximum support x 10


6

1.22
versus number of rare
itemsets for mushroom 1.2
dataset
1.18
60 X 8124 mushroom data
1.16

log10 number of RI
1.14

1.12

1.1

1.08

1.06

1.04
0 5 10 15 20
Maximum Support(%)

20%, the value of log10 #RIs increases from 1,050,463 to 1,202,659, an increase of
14%, leading to the conclusion that we got more number of RIs at higher max supp
value.

6 Conclusion

The paper has presented a novel approach to mine rare itemsets, employing RELIM
strategy. Efficient pruning strategies along with a simple data structure have been
employed, and results indicate the significant decline in time taken to mine the rare
itemsets. Future work shall focus on reducing the storage space taken for generating
the RIs. Moreover, efforts shall be put in using the mined rare itemsets in other data
mining strategies.

References

1. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, Los Altos
(2000)
2. Agarwal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In:
Bocca, J.B., Jarke, M., Zaniolo, C. (eds.) VLDB’94, Proceedings of 20th International
Conference on Very Large Data Bases, pp. 487–499. 12–15 September 1994, Morgan
Kaufmann, Santiago de Chile, Chile (1994)
3. Szathmary, L., Valtchev, P., Napoli, A., Godin, R.: Efficient vertical mining of minimal rare
itemsets. In: CLA, pp. 269–280. Citeseer (2012)
4. Szathmary, L., Napoli, A., Valtchev, P.: Towards rare itemset mining. In: 19th IEEE
International Conference on Tools with Artificial Intelligence, 2007, ICTAI 2007, vol. 1,
pp. 305–312. IEEE (2007)
A Novel Rare Itemset Mining Algorithm Based on Recursive … 233

5. Borgelt, C.: Keeping things simple: finding frequent item sets by recursive elimination. In:
Proceedings of the 1st International Workshop on Open Source Data Mining: Frequent
Pattern Mining Implementations, pp. 66–70. ACM (2005)
6. Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: a
frequent-pattern tree approach. Data Min. Knowl. Discov. 8(1), 53–87 (2004)
7. Goethals, B.: Survey on frequent pattern mining (2003)
8. Bastide, Y., Taouil, R., Pasquier, N., Stumme, G., Lakhal, L.: Mining frequent patterns with
counting inference. ACM SIGKDD Explor. Newsl 2(2), 66–75 (2000)
9. Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: current status and future
directions. Data Min. Knowl. Disc. 15(1), 55–86 (2007)
10. Weiss, G.M.: Mining with rarity: a unifying framework. ACM SIGKDD Explor. Newsl 6(1),
7–19 (2004)
11. Koh, Y.S., Rountree, N.: Finding sporadic rules using apriori-inverse. In: Advances in
Knowledge Discovery and Data Mining, pp. 97–106. Springer (2005)
12. Troiano, L., Scibelli, G., Birtolo, C.: A fast algorithm for mining rare itemsets. In: 2009 Ninth
International Conference on Intelligent Systems Design and Applications, pp. 1149–1155.
IEEE (2009)
13. Dataset, F.: Frequent itemset mining implementation (fimi) dataset
Computation of Various Entropy
Measures for Anticipating Bugs
in Open-Source Software

H. D. Arora and Talat Parveen

Abstract Bugs could be introduced at any phase of software development process.


Bugs are recorded in repositories, which occur due to frequent changes in source
code of software to meet the requirements of organizations or users. Open-source
software is frequently updated and source codes are changed continuously due to
which source code becomes complicated and hence bugs appear frequently. Bug
repair process includes addition of new feature, enhancement of existing feature,
some faults or other maintenance task. Entropy measures the uncertainty, thus
helpful in studying code change process. In this paper, bugs reported in various
subcomponents of Bugzilla open-source software are considered; changes are
quantified in terms of entropies using Renyi, Havrda–Charvat, and Arimoto entropy
measures of each component for all changes in components. A linear regression
model using SPSS is applied to detect the expected bugs in the Bugzilla sub-
components. Performance has been measured using goodness-of-fit curve and other
R-square residuals.

Keywords Entropy measure  Bug detection  Bugzilla  Open-source software

1 Introduction

Software goes under regular maintenance and updation with introduction of new
features, enhancement of existing features, and bug repair. Software use and
dependency has increased over a time it required almost everywhere in current time.
Ever-increasing user demand leads to tremendous changes in software, making it
complex over time. The process of maintenance is crucial phase in software

H. D. Arora (&)  T. Parveen


Department of Applied Mathematics, Amity Institute of Applied Sciences,
Amity University, Sector-125, Noida, Uttar Pradesh, India
e-mail: hdarora@[Link]
T. Parveen
e-mail: [Link]@[Link]

© Springer Nature Singapore Pte Ltd. 2019 235


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
236 H. D. Arora and T. Parveen

development cycle. When software undergoes maintenance, main changes occurred


in source code are due to the introduction of new features, bug repair, and feature
enhancement. It is necessary to record to all the changes in source code in a
repository to access it in future for further study and understanding of code change
process. Frequent source code change makes it complex and thus faulty and
sometimes causes a failure. Software reliability and quality are affected by bugs
lying dormant or introduced at any phase in the software development cycle. Bugs
are introduced in the system by increased complexity of source code, advanced
release pressure, no communication, or miscommunication among active devel-
opers and bugs introduced during software development process. Bugzilla [1] is
most widely used bug-tracking system. It is the most commonly used bug reporting
system which assists in bug-fix process. It was released by ‘Netscape
Communications’ in 1998. Entropy is an important concept of information theory
which is defined as the measure of randomness/uncertainty/complexity of code
change. It tells us how much information is present in an event.
In this paper, we have developed an approach for predicting bugs in an
open-source software system using entropy measures. We have applied the simple
linear regression for predicting future bugs using entropy changes calculated using
bug-tracking data from Bugzilla [1]. We have predicted the bugs using three dif-
ferent generalizations of Shannon’s entropy [2], namely Renyi [3], Havrda–Charvat
[4], and Arimoto [5]. Further, we analysed our results with the R-square measure
calculated for each entropy measure. To do this work, we have taken our data from
the Bugzilla bug-tracking system and we considered the bug report in the Bugzilla
subsystem, arranged the data on yearly basis, and then applied simple linear
regression using SPSS. This paper is organized as follows:

2 Literature Review

Hassan [6] applied information theoretic approach of measuring the amount of


entropy as a measure of complexity of code change in a system, and he proposed to
quantify the code change process in terms of entropy. Six open-source software
systems FreeBSD, NetBSD, OpenBSD, KDE, KOffice, and PostgreSQL were
considered for evaluating entropy-based metrics. Entropy-based measures proved to
be better in prediction than the methods using number of changes and previous
bugs. Singh et al. [7] used Hassan’s [6] approach and applied entropy-based metrics
to predict the bugs using simple linear regression (SLR) and support vector
regression (SVR). Nagappan et al. [8] studied the effect of number of modified lines
on the defect density in windows vista. It is concluded that code churn is better
predictor than absolute churn. Ihara et al. [9] proposed bug prediction model which
studies base and state metrics using the eclipse project gives improved performance
than standard model. Maszczyk et al. [10], Shannon entropy in standard top-down
decision trees, and Tsallis and Renyi entropies for modified C4.5 decision trees are
compared for several data sets. They presented that this approach may be used in
Computation of Various Entropy Measures for Anticipating Bugs … 237

any decision tree and information selection algorithm. Menzies et al. [11] used
NASA Metrics Data Program (MDP) for his study and concluded that learning
algorithm used is more relevant than static source code metric. The effect of using
metrics such as McCabe, LOC, and Halstead is compared with algorithms such as
J48, OneR, and Naïve Bayes. Chaturvedi et al. [12] proposed a method of pre-
dicting the potential complexity of code changes in subcomponents of Bugzilla
project using bass model. This approach helps in predicting the source code
changes yet to be diffused in a system. D’Ambros et al. [13] have done extensive
bug prediction along with releasing data set on their website consisting of many
software systems. They have set a benchmark by comparing the approach devel-
oped by them with the performance of several existing bug prediction approaches.
Giger et al. [14] introduced in a bug prediction a metric with number of modified
lines using fine-grained source code changes (SCCs). Shihab et al. [15] proposed a
statistical regression model to study the eclipse open-source software. This
approach can predict post-release defects using the number of defects in previous
version, and the proposed model is better over existing PCA-based models.

3 Information Theoretic Approach

Information theory has become a keyword in current era which is characterized by a


quantitative approach to the notion of information. It is collection of statistics- and
probability-based mathematical theories. Information theory is a probabilistic
approach dealing with assessing and defining the amount of information contained
in a message. The concept of entropy in information theory was developed by
Shannon [16]. A measure of entropy known as Shannon’s entropy was defined in
Shannon [16]; in his research paper entitled ‘A mathematical theory of communi-
cation’, Shannon [16] proposed and investigated a mathematical model for com-
munication system. He suggested a quantitative measure of average amount of
information supplied by a probabilistic experiment. A systematic attempt to develop
a generalization of Shannon [2] entropy was carried out by Renyi [2] who char-
acterized a nonnegative and additive entropy of order a defined as follows:
!
1 Xn
H/ ðPÞ ¼ log pai ð1Þ
1a i¼1

P
where a 6¼ 1; a [ 0; Pi  0; ni¼1 Pi ¼ 1, a is a real parameter, n is the number of
files, and value of n varies from 1 to n. Pi is the probability of change in a file, and
the entropy is maximum when all files have same probability change, i.e. when
pi ¼ 1n ; 8i 2 1; 2; . . .n. The entropy would be minimum when element k has prob-
ability say, Pi ¼ 1 and 8i 6¼ k; Pi ¼ 0. Havrda and Charvat [4] gave the first
non-additive measure of entropy called structural b-entropy. This quantity permits
238 H. D. Arora and T. Parveen

simpler axiomatic characterizations too. It reduced to Shannon [16] when b ! 1.


It lacks the additive property but satisfies the sum property and generalized
recursive property.
" #
 1 Xn
Hb ðPÞ ¼ 2 1b
1 pbk 1 ð2Þ
k¼1

where b 6¼ 1; b [ 0—b is a real parameter.


Arimoto [5] defined generalized parametric entropy referred as c-entropy. It is
neither additive, nor satisfies recursive or sum property.
" !c #
 1 X
n 1
Hc ðPÞ ¼ 2c1  1 pk
c
1 ð3Þ
k¼1

where c 6¼ 1; c [ 0—c is a real parameter. Many generalizations of Shannon’s


entropy [2] had been developed so far by researchers such as Renyi [3], Aczel and
Daroczy [17], Havrada and Charvat [4], Kapur [18], Arimoto [5], Sharma and
Taneja [19], and Tsallis [2]. We have limited ourselves to three generalization of
Shannon’s entropy measure for our study which are Renyi [3], Havrda and Charvat
[4], and Arimoto [5] entropy measure.

4 Basic Model for Entropy Measurement

In open-source software, source code changes frequently with the increasing


demand of new feature addition and bug repair process. Changes are recorded in a
repository, and complexity of source code increases with code change process.
Shannon’s entropy is used for studying code change process. Probability changes
for a specific file are calculated, and thus, entropy is calculated. Time period could
be considered in terms of year, half year, month, or a week [6].
For example, in a system let 14 changes be occurred in three different files over a
different time period. If File1 has two changes, File2 and File3 have one change in
time t1. Table 1 represents the total changes in each file for respective time period
t1, t2, and t3. For time period t1, total files in t1 = 4. Thus, probability of File1 for

Table 1 Changes in File1, File2, and File3 with respect to time period t1, t2, and t3

File1
File2
File3
t1 t2 t3
Computation of Various Entropy Measures for Anticipating Bugs … 239

t1 = 2/5 = 0.4, probability of File2 for t1 = 1/5 = 0.2, and probability of File3 for
t1 = 1/5 = 0.2. Similarly, probabilities for time periods t2 and t3 could be calcu-
lated. Using the probabilities found by above-mentioned method, entropies for all
time periods could be calculated. In this study, Renyi [3], Havrda–Charvat [4], and
Arimoto [5] entropies are calculated using Eqs. (1)–(3). When there would be
changes in all files, then the entropy would be maximum, while it would be min-
imum for most changes occurring in a single file.

5 Data Collection and Preprocessing

Bugzilla renders bug-tracking service to large number of organizations and projects.


It is open-source software. Apache Project, Code Magus, Eclipse, GCC, KDE,
NetBeans, Red Hat, W3C, and many other organization use public Bugzilla
installation. In this paper, components from the Mozilla subsystem are selected.
Bugzilla [1] is a bug-tracking system, and bug reports of few subcomponents are
considered for the study. Bugs are extracted from the CVS repository: http://
[Link]. We have considered subcomponents of Bugzilla project for
our study. Data extracted is arranged considering changes in files of subsystem on
yearly basis. In this paper, a fixed period as one year is taken from 2007 to July
2015. The number of files in each of these subsystems varied in the range of 21–157
files. Following subcomponents are selected, and a number of files in each com-
ponent [1] are mentioned along with them (Table 2):
The data is available in the repository, and the data is collected from the
repository and processed using following steps:
Step1 Select the project
Step2 Select the subsystem
Step3 Browse Bugzilla repository for tracking the bug reports
Step4 Collect the reported bug data from the repository for selected
subcomponents
Step5 Arrange the data year-wise and calculate entropy using Eqs. (1)–(3)
Step6 Apply simple linear regression model with bugs mðtÞ observed as
dependent and entropy HðtÞ calculated as independent variables.
Step7 Predict the future bugs using regression coefficients
Step8 Performance of the model is assessed through R2 and other statistical
performance measure
All changes in the subcomponents have been extracted by browsing the con-
current version repository of Bugzilla. Simple linear regression model is fitted for
predicting the future bugs using the entropy measure of the historical changes.
240

Table 2 Bugzilla components and changes in files


XSLT Reporter XForms General Telemetry Verbatim Geolocation MFBT MathML X-Remote Widget XBL
Dashboard Android
131 27 87 19 31 57 66 126 91 21 58 157
H. D. Arora and T. Parveen
Computation of Various Entropy Measures for Anticipating Bugs … 241

6 Calculation of Arimoto, Havrda–Charvat, and Renyi


Entropy

In this paper, subcomponents of Bugzilla software are considered and entropy


variation in the system over a period of time ranging from 2007 to 2015 based on
number of bugs calculated is studied Renyi [3]. Havrda–Charvat [4], and Arimoto
[5] entropies are calculated based on data extracted for different subcomponents for
a, b, c ranging from 0.1 to 0.9, respectively. After collecting data for time period
varying from 2007 to 2015 for different files as discussed above, Renyi, Havrda–
Charvat, and Arimoto entropies are calculated using the calculated probabilities as
depicted in Eqs. (1)–(3), respectively, following the method as described in Sect. 4.
For Renyi, Havrda–Charvat, and Arimoto a, b, and c are ranging from 0.1 to 0.9,
respectively, as by the definition of these entropy measures the value parameter in
Eq. (1)–(3) must be greater than 0 and not equal to 1.
Figures 1, 2 and 3 depict the entropy change in various subsystems over the time
period ranging from 2007 to 2015 calculated using Renyi, Havrda–Charvat, and
Arimoto entropy measures, respectively. Renyi, Havrda–Charvat, and Arimoto
entropy decrease as we increase the value of a; b; c from 0.1 to 0.9.

2 α=0.1
Renyi Entropy α=0.2
α=0.3
Entropy

α=0.4
α=0.5
1 α=0.6
α=0.7
α=0.8
α=0.9
0
2007 2008 2009 2010 2011 2012 2013 2014 2015

Fig. 1 Renyi entropy for a ranging from 0.1 to 0.9

10 Havrda Charvat Entropy β=0.1


β=0.2
Entropy

β=0.3
β=0.4
β=0.5
β=0.6
β=0.7
β=0.8
0 β=0.9

2007 2008 2009 2010 2011 2012 2013 2014 2015

Fig. 2 Havrda–Charvat entropy for b ranging from 0.1 to 0.9


242 H. D. Arora and T. Parveen

Entropy 5 Arimoto Entropy γ=0.1


γ=0.2
γ=0.3
γ=0.4
γ=0.5
γ=0.6
γ=0.7
0 γ=0.8
γ=0.9
2007 2008 2009 2010 2011 2012 2013 2014 2015
Fig. 3 Arimoto entropy for c ranging from 0.1 to 0.9

7 Analysis and Bug Prediction Approach

The simple linear regression [19] is a statistical approach which enables predicting
the dependent variable on the basis of independent variable which has been used to
fit the entropy measure as independent variable HðtÞ and bugs observed as
dependent variable mðtÞ using Eq. (4)

mðtÞ ¼ r0 þ r1  HðtÞ ð4Þ

where r0 and r1 are regression coefficients. Once getting the regression coefficients
r0 and r1 value, we predict the future bugs by putting the value of regression
coefficients in Eq. (4). The simple linear regression model is used to study the code
change process [20]. SLR is fitted for independent variable HðtÞ and dependent
variable mðtÞ using Eq. (4). Here, mðtÞ represents the number of bugs recorded for
specified in the Bugzilla [1] subcomponent, namely XSLT, Reporter, XForms,
General, Telemetry Dashboard, Verbatim, Geolocation, MFBT, MathML,
X-Remote, Widget Android, and XBL. The regression coefficients r0 and r1 are
calculated by varying the value of a; b; c parameters in Eqs. (1)–(3) taking entropy
as independent variable and number of recorded bugs as dependent variable. We
varied the value of a; b; c from 0.1 to 0.9, respectively, for each entropy measure
and studied the effect of different entropy measure on bug prediction technique [21].
The values of R, R2 , adjusted R2 standard error of estimate, and regression coef-
ficients for the entropy measures, namely Renyi [3], Havrda–Charvat [4], and
Arimoto [5] have been calculated by varying the parameter value in each entropy
measure from 0.1 to 0.9 from year 2007 to 2015 and illustrated in Table 3.
R value determines the correlation between set of observed and predicted data,
and its value lies between −1 and 1. R2 also known as coefficients of determination
determines the closeness of data to the fitted regression line, and it lies between 0
and 1. R-square estimates the percentage of total variation about mean, and its value
close to 1 explains that data fits the model appropriately. Adjusted R2 represents the
percentage of variation based only on the independent variables which are affecting
the dependent variable. The results of the simple linear regression are represented in
Table 3, the table represents the value of R, R2 , adjusted R2 , standard error of
estimate, and regression coefficients. The value of parameter a in Renyi’s entropy
[11] measure in Eq. (1) is varied from 0.1 to 0.9, and entropy is calculated for the
Computation of Various Entropy Measures for Anticipating Bugs … 243

Table 3 Statistics and regression coefficients


Model Parameter R R2 Adjusted Std. error of the Regression
R2 estimate coefficients
r0 r1
Equation 1 0.1 0.996 0.993 0.992 26.50367 −65.497 114.985
(a) 0.2 0.996 0.992 0.991 26.91679 −62.751 117.442
0.3 0.996 0.992 0.991 27.49703 −59.943 119.870
0.4 0.996 0.992 0.991 28.21549 −57.126 122.240
0.5 0.996 0.991 0.99 29.03881 −54.346 124.528
0.6 0.995 0.991 0.989 29.93830 −51.639 126.705
0.7 0.995 0.99 0.989 30.90207 −49.015 128.734
0.8 0.995 0.989 0.988 31.96744 −46.437 130.525
0.9 0.994 0.988 0.987 33.43824 −43.611 131.648
Equation 2 0.1 0.995 0.991 0.989 30.09957 −8.091 15.512
(b) 0.2 0.995 0.99 0.988 31.54890 −10.639 17.875
0.3 0.994 0.989 0.987 32.63041 −13.668 20.392
0.4 0.994 0.988 0.987 33.38173 −17.109 23.041
0.5 0.994 0.988 0.986 33.85169 −20.884 25.800
0.6 0.994 0.988 0.986 34.09773 −24.905 28.642
0.7 0.994 0.988 0.986 34.18984 −29.076 31.535
0.8 0.994 0.988 0.986 34.23875 −33.263 34.429
0.9 0.994 0.988 0.986 34.61405 −37.092 37.174
Equation 3 0.1 0.994 0.989 0.987 32.80168 −58.873 72.152
(c) 0.2 0.994 0.989 0.987 33.00622 −59.372 67.935
0.3 0.994 0.989 0.987 32.90916 −59.629 64.082
0.4 0.994 0.989 0.987 32.66821 −59.262 60.425
0.5 0.995 0.989 0.987 32.42818 −58.201 56.908
0.6 0.995 0.989 0.988 32.24788 −56.447 53.514
0.7 0.995 0.989 0.988 32.12785 −54.053 50.238
0.8 0.995 0.989 0.988 32.01330 −51.131 47.099
0.9 0.995 0.99 0.988 31.61755 −48.045 44.227

time period varying from year 2007 to 2015. The entropy calculated is used to
predict the bugs using simple linear regression model in SPSS. The regression
coefficients and hence the predicted value for the future bugs are evaluated. It is
observed that as the value of a progresses from 0.1 to 0.9, a decline in the value of
R2 is observed, which reduces from 0.993 for a = 0.1 to 0.998 for a = 0.9. The
value of parameter b in Havrda–Charvat [4]’s entropy measure in Eq. (2) is varied
from 0.1 to 0.9 and entropy for the period varying from year 2007 to 2008 is
calculated. Then, the regression coefficients evaluated through SLR are used to
predict the bugs to be appeared in system. A decline in the value of R2 is observed
as the b progresses from 0.1 to 0.9. It is seen that for b = 0.1 the value of R2 is
0.991, while following continuous degradation as b value increases at b = 0.9, the
244 H. D. Arora and T. Parveen

R square Renyi R square Havrda Charvat R square Arimoto


1
R Square

0.99

0.98
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Parameter in equation(1), (2) and (3)

Fig. 4 Performance of simple linear regression model in bug prediction

R-square value is found to be 0.988. Similarly, for Arimoto [5] entropy in Eq. (3)
similar to SLR approach is applied as is mentioned above, and hence, the regression
coefficients are calculated. The predicted value is calculated, and hence, R2 value is
analysed. It is found that unlikely of Renyi [11] and Havrda–Charvat [4] entropy
measure analysis with SLR in Arimoto [5] as we increased the value of c from 0.1
to 0.9, the value of R2 improved at c = 0.9. The value of R2 was similar for c = 0.1
to 0.8, and it increased significantly at c = 0.9.
A graph of R-square against the parameter value ranging from 0.1 to 0.9 in
Eqs. (1)–(3) is plotted, and it is observed that the R-square value for the where
R-square value for bugs predicted by Renyi [3] entropy measure is close to 1. It is
observed that the simple linear regression approach predicts the future bugs with
precision. The value of R-square indicates that the Renyi [3] entropy measure diffused
in simple linear regression approach as independent variable provides the best pre-
dicted value of future bugs. While best R-square values for bugs predicted by Havrda–
Charvat [4] and Arimoto [5] entropy are 99 and 99.1%, respectively. (Figure 4)

8 Result and Discussion

Goodness-of-fit curve has been plotted between observed and predicted values.
Goodness of fit for simple linear regression depicts the fitting of the observed bug
value with the predicted bug’s value. Simple linear regression method significantly
anticipates the future bugs using the number of bugs recorded from the subsystem
of open-source software and entropy measure in Eqs. (1)–(3) by determining the
regression coefficients. Figures 5, 6, and 7 represent this goodness-of-fit curve for
each entropy measure considered for our study, i.e. Renyi [3], Havrda–Charvat [4],

Fig. 5 Goodness-of-fit curve 810


for Renyi entropy
610
BUGS

410

210

10
Computation of Various Entropy Measures for Anticipating Bugs … 245

Fig. 6 Goodness-of-fit curve 810


for Havrda–Charvat entropy
610

410

BUGS
210

10

Fig. 7 Goodness-of-fit curve 810


for Arimoto entropy
610

BUGS
410

210

10

and Arimoto [5]. The predicted value is calculated, and hence, R-square value is
analysed. It is found that in Renyi [3] and Havrda–Charvat [4] entropy measure
analysis, the value of R-square degraded from 99.3 to 98.9 and 99.1 to 98.8% with
the increase in value of parameter a and b from 0.1 to 0.9, respectively.
While in case of Arimoto [5] as we increased the value of c from 0.1 to 0.9, the
value of R-square improved at c = 0.9 from 98.7 to 98.8%. The value of R-square
was similar for c = 0.1 to 0.8, i.e. shows 98.7% variability, and it increased sig-
nificantly at c = 0.9 to 98.8% variability.

9 Conclusion

Software is affected by code change process which leads to increased code com-
plexity. Based on the data set statistics, Renyi [3], Havrda–Charvat [4], and
Arimoto [5] entropies are calculated for different values of parameter a, b, and c,
respectively, ranging from 0.1 to 0.9 in Eq. (1)–(3) for 12 subcomponents of
Mozilla open-source software (OSS). It is observed that Renyi [3], Havrda–Charvat
[4], and Arimoto [5] entropies lie between 0.4 and 1, 1.8 and 9, 0.7 and 3,
respectively. It could be analysed through Figs. 1, 2 and 3 that all entropy values
decrease as we increase the value of a from 0.1 to 0.9. We have considered a bug
prediction approach to predict the appearance of bugs in the OSS subsystem in
coming years; hence, through comparative analysis of the R2 value for each entropy
measure, it is observed that Renyi [3] entropy measure provides the best R2 value
99.3% for a = 0.1. Goodness-of-fit curve has been plotted and represented in
Figs. 1, 2, and 3. This study based on entropy measures is useful in learning the
code change process in a system. In the future, entropy measures could be used to
246 H. D. Arora and T. Parveen

predict bug occurrence in the software and thus complexity of code changes that
software has during a long maintenance period. In the future, our study could be
extended to predict the potential complexity/entropy of code changes that the
software will have during a long maintenance period.

References

1. The Bugzilla. [Link]


2. Tsallis, C.: Possible generalization of Bolzmann-Gibbs statistics. J. Stat. Phys. 52, 479–487
(1988)
3. Renyi A.: On Measures of entropy and information. In: Proceedings of the Fourth Berkeley
Symposium on Mathematical Statistics and Probability, pp. 547–561. University of California
Press, Berkeley-Los Angeles (1961)
4. Havrada, J.H., Charvat, F.: Quantification method of classification process: concept of
structural a -entropy. Kybernetika 3, 30–35 (1967)
5. Arimoto, S.: Information-theoretical considerations on estimaton problems. Inf. Control 19,
181–194 (1971)
6. Hassan A.E.: Predicting faults based on complexity of code change. In: The proceedings of
31st International Conference on Software Engineering, pp 78–88 (2009)
7. Singh, V.B., Chaturvedi, K.K.: Improving the quality of software by quantifying the code
change metric and predicting the bugs. In: Murgante, B., et al. (eds.) ICCSA 2013, Part II,
LNCS 7972, pp. 408–426. Springer, Berlin (2013)
8. Nagappan, N., Ball T.: Using software dependencies and churn metrics to predict field
failures: an empirical case study. In: ESEM ’07: Proceedings of the First International
Symposium on Empirical Software Engineering and Measurement, pp. 364–373. Washington,
DC, USA. IEEE Computer Society (2007)
9. Ihara, A., Kamei, Y., Monden, A., Ohira, M., Keung, J.W., Ubayashi, N., Matsumoto, K.: An
investigation on software bug-fix prediction for open source software projects—a case study
on the eclipse project. In: Software Engineering Conference (APSEC), 2012 19th
Asia-Pacific, pp. 112–119. vol. 2 (2012)
10. Maszczk and Duch: Comparison of Shannon, Renyi and Tsallis entropy used in decision
trees. Lect. Notes Comput. Sci. 5097, 643–651 (2008)
11. Menzies, Tim, Greenwald, Jeremy, Frank, Art: Data mining static code attributes to learn
defect predictors. IEEE Trans. Softw. Eng. 33(1), 2–13 (2007)
12. Chaturvedi, K.K., Kapur, P.K., Anand, S.: Singh VB (2014) Predicting the complexity of
code changes using entropy based measures. Int. J. Syst. Assur. Eng. Manag. 5(2), 155–164
(2014)
13. D’Ambros M., Robbes R.: An extensive comparison of bug prediction approaches. In:
MSR’10: Proceedings of the 7th International Working Conference on Mining Software
Repositories, pp. 31–41 (2010)
14. Giger, E., Pinzger, M., Gall H.C.: Comparing fine-grained source code changes and code
churn for bug prediction. In: Proceedings of the 8th Working Conference on Mining Software
Repositories, pp. 83–92. MSR ’11, New York, NY, USA. ACM (2011)
15. Shihab, E., Jiang, Z.M., Ibrahim, W.M., Adams, B., Hassan A.E.: Understanding the impact
of code and process metrics on post-release defects: a case study on the eclipse project. In:
ESEM ’10: Proceedings of the 2010 ACM-IEEE International Symposium on Empirical
Software Engineering and Measurement, pp. 1–10. New York, NY, USA. ACM (2010)
16. Shannon, C.E.: A mathematical theory of communication. Bell’s Syst. Tech. J. 27, 379–423
(1948)
Computation of Various Entropy Measures for Anticipating Bugs … 247

17. Aczel, J., Daroczy, Z.: Charakterisierung Der Entrpien Positiver Ordung under Shannibscen
Entropie; Acta Math. Acad. Sci. Hunger. 14, 95–121 (1963)
18. Kapur, J.N.: Generalization entropy of order a and type b. Math. Semin. 4, 78–84 (1967)
19. Weisberg, S.: Applied linear regression. Wiley, New York (1980)
20. Sharma, B.D., Taneja, I.J.: Three generalized additive measures of entropy. ETK 13, 419–433
(1977)
21. IBM SPSS statistics for windows, Version 20.0. Armonk, NY: IBM Corp
Design of Cyber Warfare Testbed

Yogesh Chandra and Pallaw Kumar Mishra

Abstract Innovations doing fine in a predictable, controlled environment may be


much less effective, dependable or manageable in a production environment, more
so in cyber systems, where every day there is new technology in malware detection,
zero-day vulnerabilities are coming up. Considering NCSP-2013, authors propose a
realistic cyber warfare testbed using XenServer hypervisor, commodity servers and
open-source tools. Testbed supports cyber-attack and defence scenarios, malware
containment, exercise logs and analysis to develop tactics and strategies. Further,
authors provide ways and means to train cyber warriors, honing their skills as well
as maturing attack and defence technologies on this testbed.

Keywords Virtualization  Hypervisor  Metasploit  Malware


Container

1 Introduction

National Cyber Security Policy (NCSP) was announced in July 2013 with a mission
to protect information and IT infrastructure in cyberspace, build capabilities to
prevent and respond to cyber threats, reduce vulnerabilities and hence damage from
cyber incidents through an arrangement of institutional structures, people, pro-
cesses, technology and their collaboration. One of the objectives is to generate
workforce of five lakhs cyber security professionals in coming five years through
capacity building, skill development, training and use of open standards for cyber
security [1]. National Institute of Standards and Technology (NIST) developed the
framework for improving critical IT infrastructure cyber security in Feb 2014.

Y. Chandra (&)  P. K. Mishra


Institute for Systems Studies and Analyses, Defence Research and Development
Organisation, Metcalfe House, Civil Lines, New Delhi, Delhi 110054, India
e-mail: [Link]@[Link]
P. K. Mishra
e-mail: [Link]@[Link]

© Springer Nature Singapore Pte Ltd. 2019 249


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
250 Y. Chandra and P. K. Mishra

While NCSP addresses the high-level perspective, the NIST framework helps
organizations addressing actual cyber security risks.
The gap between threat and defence continues to expand as opponents use
increasingly sophisticated attack technologies. Cyber Defense Technology
Experimental Research (DETER) is an attempt to fill that gap by hosting an
advanced testbed facility where top researchers and academicians conduct critical
cyber security experiments and educational exercises. It emulates real-world
complexity and scale essential to evolve advance solutions to help protect against
sophisticated cyber-attacks and network design vulnerabilities [2] using Emulab
software [3].
Cyber Exercise and Research Platform (KYPO) project aims to create an
environment for R&D of new methods to protect critical IT infrastructure against
cyber-attacks. A virtualized environment is used for emulation of complex
cyber-attacks and analyse their behaviour and impact on the IT infrastructure [4].
Malware is a powerful, free and independent malware analysis service to the
security community [5]. DARPA is setting up a virtual testbed named National
Cyber Range to carry out virtual cyber warfare games for testing different scenarios
and technologies in response to cyber-attacks.

2 Testbed Requirements

Cyber security experiments involve two or more contestants, due to the adversarial
nature of the problem. Safe and isolated virtual network environment is needed to
be attacked and penetrated as a means of learning and improving penetration testing
skills. In conventional environment, investigator constructing the experiment is
fully aware of both sides of the scenario and all components of the experiment are
equally visible. This affects scenario realism and may foster experimental bias and
may not be able to handle present-day attacks.
Testbed should support on-demand creation of experimental scenarios and
advanced attack emulation. Scenarios generated are to be based on real-world
exploitations. The requirements are network-related, hosts-related, monitoring,
testbed control and deployment. The host monitoring set-up needs to get infor-
mation about the node performance. It is to monitor processor and memory uti-
lization, open connections, interface statistics and other characteristics of hosts.

3 Testbed Design

The testbed consists of four major systems, i.e. cluster of nodes and networks,
attack system, defence system and data logger and report generation.
Design of Cyber Warfare Testbed 251

Fig. 1 Cyber warfare testbed

3.1 Design of Cluster

Testbed cluster comprises of many nodes in a laboratory connected using a pro-


grammable switch to construct any arbitrary topology. Some nodes may be used to
emulate required link characteristics using network emulator (Netem).
Citrix XenServer serves the objective regarding setting up of underlying
infrastructure for cyber warfare. Though the console is provided by XenCenter, but
it is limited in use and functionality. More sophistication and high fidelity in
creation of nodes and controlling them can be achieved through commercially
available tools such as Cloudstack. Using these tools, a number of virtual nodes are
created for both attack and defence system and testing them in requisite environ-
ment [6] (Fig. 1).

3.2 Composing Attack System

Based on the attacker objective, a number of sequences of attack in the form of


attack graphs may be generated. Each attack graph will be evaluated based on the
extent of objective achieved, time and resources used, the deception achieved and
ease of launch. Though open-source software OpenVAS can be explored, authors
have used Nessus framework for this and have found quite useful aid in generation
of attack graph. The best attack sequence may be executed by the specifically built
252 Y. Chandra and P. K. Mishra

actuator which will pick next node in the graph and launch particular attack on the
node. Individual attack tools such as LOIC for DDOS or hping for scanning can be
utilized by starters. For the attack system, Metasploit Community Edition [7] in
conjugation with Nmap and Wireshark may be used. Kali Linux (formerly
Backtrack) by Offensive Security contains several tools aimed at numerous infor-
mation security tasks, such as penetration testing, forensics and reverse engineering
[8]. Using these tools, a number of attack techniques may be tried and matured on
the cyber warfare testbed. Malware and malware generation tools can be sourced
from Websites like [Link] [9].

3.3 Composing Defence System

Intrusion detection systems (IDSs) and system logs may be used to detect several
types of malicious behaviours that can compromise the security and trust of a
computer system. This includes network attacks against vulnerable services,
data-driven attacks on applications, host-based attacks such as privilege escalation,
unauthorized logins and access to sensitive files, and malware (viruses, Trojan
horses and worms). Provisions are given in IDS to configure rules according to the
experiment. Authors used Snort for network-based intrusion detection and OSSEC
for host-based intrusion detection. These will report various security-related system
activities and logs to central console, e.g. insertion/removal of external media,
changes in specified software. These alerts then can be correlated to have better
awareness of cyber situation.
Formulation of proper response in case of such attack based on situation
response technologies includes response recommenders that evaluate alternatives
and recommend responses in light of the mission and response managers and
actuators that implement the responses [10].

3.4 Data Logger and Report Generation

The data logger collects data in passive mode. There are no external linkages from
the data logger during a live experiment. Interactive access is allowed to this only
from report generation and post analyses modules. Depending on the experiment’s
risk rating, data collected by this module may have to be cleaned for malware
before it can to be used by others.
Data logger retrieves information about the performance. Several tools are
available under the various nomenclature, most popular being security information
and event management (SIEM). Authors have used system information gatherer and
reporter (SIGAR) for data logging and reporting. SIGAR API provides a portable
code for collecting system information such as processor and memory utilization,
open connections, interface statistics and other characteristics of hosts.
Design of Cyber Warfare Testbed 253

4 Operational Configuration

4.1 Scenarios

The testbed will facilitate a scenario-driven exercise designed to emulate expected


player actions required to meet certain objectives. The exercise scenario may be
built using adversaries with credible capabilities. The scenario will be built around a
series of cyber-attack vectors that fall within the technical capabilities of the
adversary (Fig. 2).

4.2 Attack Emulation

Bandwidth amplification attack, phishing, malware propagation and infection sce-


narios may be enacted to understand emerging mobile worms, viruses and other
malware. Application layer DoS attacks such as Slowloris and Slow HTTP POST
rely on the fact that the HTTP protocol requires requests to be completely received
by the server before processing them. If an HTTP request is not complete, or if there
is a very low transfer rate, the server keeps its resources busy waiting for the rest of
the data. If the server keeps too many resources busy, this leads to denial of service.
These attacks can easily be emulated in the proposed testbed using SlowHTTPTest
tool. Malware containment may be achieved through virtualization, encryption and
zone-based security segmentations (VLANs). Initially, malware analysis could be
done in sandbox environment, before deploying it to the testbed.

Fig. 2 Operational diagram


254 Y. Chandra and P. K. Mishra

5 Analysis

A large number of exercises need to be conducted for various scenarios and their
objectives to generate volume of data. This data are then analysed for specific
objectives afterwards.

5.1 Identification of Measures of Effectiveness (MOEs)

Firstly, MOEs are identified, and the MOEs determine the attack strength, particular
to a specific node on specific network configuration. MOEs for attack may be
number of computers affected, data losses in terms of time and volume, target
identification, number of targets engaged, number of attempts/mechanisms used to
breach, targets missed, DOS induced in terms of time frame, number of routers
attacked/compromised, number of antivirus/antimalware defeated, number of OS
breached, number of Websites attacked, number of applications breached.
Similarly, MOEs for defence may be time delay in detection of attack, value of
asset before and after attack, data losses in terms of time and volume.

5.2 Vulnerability Analysis

Vulnerability analysis may be carried out using Common Vulnerability Scoring


Scheme (CVSS). CVSS provides open framework for communicating the charac-
teristics and severity of software vulnerabilities [11]. Authors have used this model
to define slightly modified base, temporal and environmental metric groups.
Cyber security prediction models may be developed using history and nature of
attack and models for assessment of system vulnerability, most probable attack,
most probable target system, attack strength and evaluation of defence techniques
for such attack [10].

5.3 Development of Strategy and Tactics

Statistical methods or data mining tools may be applied on data collected during
exercises to evolve better defence techniques and attack-resilient network topolo-
gies. Added to it is behaviour analysis, which will help prevention against advanced
persistent threat (APT), polymorphic malware, zero-day attacks, which otherwise
are almost impossible to detect. It includes behaviour of both, malware as well as
attacker. A good analysis of polymorphic malware from behavioural perspective
Design of Cyber Warfare Testbed 255

has been carried out by Vinod et al. [12]. These analyses will help produce tactical
rule book for senior IT managers and chief information security officers (CISOs).

6 Conclusion

The main tools of cyber defence are not the switches, routers or operating systems,
but rather the cyber defenders themselves. The key benefits derived from the testbed
would be enhancement of skills on cyber warfare, formulation of cyber warfare
strategies and tactics, a training platform for the cyber warfare and analysis of
vulnerability of the existing network system.
Innovations doing fine in a predictable and controlled environment may be less
effective, dependable or manageable in a production environment. Without realistic,
large-scale resources and research environments, results are unpredictable. The
testbed as a mini-Internet emulates real-world complexity and scale.
The testbed can also host capture the flag (CTF) and attack–defence competi-
tions, where every team has own network with vulnerable services. Every team has
time for patching services and developing exploits. Each team protects own services
for scoring defence points and hack opponents for scoring attack points. Exercise
can take place on WAN and can be extended to multiple team configuration.

7 Future Scope

US military has indicated that it considers cyber-attack as an “Act of War” and will
include cyber warfare for future conflicts engaging its Cyber Command. In future, a
move from device-based to cloud-based botnets, hijacking distributed processing
power may be witnessed. Highly distributed denial-of-service attacks using cloud
processing may be launched. To counter such massive attacks, the testbed needs to
be realized at much larger scale. Hence, the testbed also needs to be migrated to
cloud platform [13]. Hybrid cloud solutions, e.g. OpenStack may be used as
infrastructure-as-a-service (IaaS). Further efficiency may be achieved by harnessing
operating system container technology. OS designed to run containers, such as Red
Hat Atomic Host, Ubuntu Snappy, CoreOS and Windows Nano, can complement
hypervisors, where containers (VMs without hypervisor) are lightweight virtual
machines which are realized by OS kernel.
There is scope for simulating human behaviour related to cyber security using
agent-based simulation toolkit. It will support for agent modelling of human
security behaviour to capture perceived differences from standard decision making.
256 Y. Chandra and P. K. Mishra

References

1. National Cyber Security Policy (NCSP-2013). [Link]


security-policy-2013-1
2. The DETER Project. [Link]
3. Emulab Software. [Link]
4. KYPO Project. [Link]
5. Malware Analysis Service. [Link]
6. Citrix Xenserver. [Link]
[Link]
7. Metasploit Community Edition. [Link]
8. Kali Linux. [Link]
9. Source of Malware and Malware Generation Tools. [Link]
10. Mishra, P.K.: Cyber defence: an approach to defend cyber assets. In: Proceedings of 1st
National Conference on Cyber Security 2012, DIAT Pune (2012)
11. Common Vulnerabilities Scoring Scheme. [Link]
12. Vinod P., Laxmi, V., Gaur, M.S., Chauhan, G.: Malware analysis using non-signature based
method. In: Proceedings of IEEE International Conference on Network Communication and
Computer, ICNCC (2011)
13. Kouril, D., et al.: Cloud-based testbed for simulation of cyber attacks. In: Proceedings of the
2014 IEEE Network Operations and Management Symposium (2014)
Performance Evaluation of Features
Extracted from DWT Domain

Manisha Saini and Rita Chhikara

Abstract The key task of Steganalyzer is to identify if a carrier is carrying hidden


information or not. Blind Steganalysis can be tackled as two-class pattern recog-
nition problem. In this paper, we have extracted two sets of feature vectors from
discrete wavelet transformation domain of images to improve performance of a
Steganalyzer. The features extracted are histogram features with three bins 5, 10,
and 15 and Markov features with five threshold values 2, 3, 4, 5, 6, respectively.
The performance of two feature sets is compared among themselves and with
existing Farid discrete wavelet transformation features based on parameter classi-
fication accuracy using neural network back-propagation classifier. In this paper, we
are using three Steganography algorithms outguess, nsF5 and PQ with various
embedding capacities.

Keywords Discrete wavelet transformation  Neural network  Steganography



Steganalysis Histogram Markov

1 Introduction

Steganography is a technique of information hiding in digital medium in such a way


that it is imperceptible to human eyes. According to the survey [1], it is found that
terrorists have been using Steganography techniques to pass information over Web
sites and auction sites over a period of time. Most widely used digital medium over
the Internet is image file in comparison with other digital objects. Steganography
hides information in spatial domain and transform domain. Main difference among
spatial and transform domain is that in spatial domain data embedding in pixels is

M. Saini (&)
G D Goenka University, Gurgaon, India
e-mail: [Link]@[Link]
R. Chhikara
The NorthCap University, Gurgaon, India
e-mail: ritachhikara@[Link]

© Springer Nature Singapore Pte Ltd. 2019 257


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
258 M. Saini and R. Chhikara

done directly, whereas in transform domain images are first transformed [2, 3] to
discrete cosine transform (DCT) or discrete wavelet transform (DWT) domain and
then the message is embedded inside the image. So advantage of transform method
over spatial method is that it is more protected against statistical attacks.
The misuse of Steganography has led to development of countermeasures known
as Steganalysis [4]. The aim of forensic Steganalysis is to recognize the existence of
embedded message and to finally retrieve the secret message. Based on the way
methods are detected from a carrier, Steganalysis is classified into two categories:
target and blind Steganalysis. (i) Target/specific Steganalysis where particular kind
of Steganography algorithm which is used to hide the message is known and
(ii) blind/universal Steganalysis where the particular kind of Steganography tool
used to hide the message is not known.
In this paper, two sets of feature vectors are extracted from discrete wavelet
domain of images to enhance performance of a steganalyzer. The features extracted
are histogram features with three bins 5, 10, and 15 and Markov features with five
different threshold values 2, 3, 4, 5, 6, respectively. The performance of two feature
sets is compared with existing DWT features given by Farid [5] and with different
parameters among themselves.
The rest of the paper is divided in the following sections. Section 2 discusses
three Steganography algorithms outguess, nsF5, and PQ used in the experiment. In
Sect. 3, proposed DWT feature extraction method is explained. Section 4 discusses
neural network classifier used for classification. In Sects. 5 and 6, experimental
results and analysis is described in detail. Finally, the paper is concluded in Sect. 7.

2 Steganography Algorithms

Our experiment employs three steganography algorithms: Outguess, nsF5, and PQ.
Outguess [6] is an enhanced variant of Jsteg. It uses pseudorandom number gen-
erator (PRNG)-based scattering to obscure Steganalysis. Seed is required as addi-
tional parameter to initialize the pseudorandom number generator. nsF5 (no
shrinkage F5 [7]) is enhanced version of F5, proposed in 2007. This algorithm was
developed to improve the problem of shrinkage which exists in F5 algorithm by
combining F5 algorithm with wet paper codes (WPCs). In perturbed quantization
(PQ) algorithm [8], quantization is perturbed according to a random key for data
embedding. In this procedure, prior to embedding the data, on the cover–medium,
an information-reducing process is applied that includes quantization such as lossy
compression, resizing, or A/D conversion. PQ does not leave any traces in the form
that the existing Steganalysis method can grab.
Performance Evaluation of Features Extracted from DWT Domain 259

3 Proposed DWT Feature Extraction

On applying discrete wavelet transformation (DWT) [9] on each image from image
dataset, we obtain four different subbands at each level (up to level 3 decomposition
is done), and we have obtained 12 sub-bands (cA1, cH1, cV1, cD1 at Level 1; cA2,
cH2, cV2, cD2 at Level 2, and cA3, cH3, cV3, cD3 at Level 3, respectively) where
CA is approximation coefficients and rest three subbands (cH, cV, cD) are detailed
coefficients. Histogram and Markov features are extracted from the complete image,
and 12 subbands are obtained after applying DWT.

3.1 Histogram

Histogram [10] is an approach of summarizing data that are calculated on an


interval scale (either discrete or continuous). It partitions the range of feasible
values in a dataset into groups. Features of histogram are extracted by binning the
elements of all subbands into evenly spaced containers [11] and returning the
number of elements in each container. The values are thereafter normalized by
dividing each element by sum of all the elements.

H O ¼ H O=sumðH OÞ ð1Þ

We have extracted features calculated from original image +12 subbands (13
subbands) by using three different bins 5, 10, and 15.

3.2 Markov

Markov features show the difference between absolute values of neighboring


coefficients. These features have been extracted in DCT domain by Pevný [12], but
we have extracted Markov features in DWT domain [11]. We have calculated four
difference arrays along horizontal, vertical, minor diagonal, and major diagonal
direction using formula:

Nh ðu; vÞ ¼ N ðu; vÞ  N ðu þ 1; vÞ ð2Þ

Nv ðu; vÞ ¼ N ðu; vÞ  N ðu; v þ 1Þ ð3Þ

Nd ðu; vÞ ¼ N ðu; vÞ  N ðu þ 1; v þ 1Þ ð4Þ


260 M. Saini and R. Chhikara

Nm ðu; vÞ ¼ N ðu þ 1; vÞ  Nðu; v þ 1Þ ð5Þ

From difference array, we have calculated four transition probability matrices


(TPMs) in horizontal, vertical, diagonal direction. We have extracted features from
HL and LH subbands and averaged the four transition probability matrices (TPMs)
to obtain features. We have applied different threshold values 2, 3, 4, 5, and 6,
respectively.

3.3 Existing DWT Farid Features

Farid proposed 72 feature set [5]; four features that are extracted from subbands are
mean, variance, skewness, and kurtosis formula which are shown below.

1X n
E ð xÞ ¼ xk ð6Þ
n k¼1

1 X n
Varð xÞ ¼ ðxi  Eð xÞÞ2 ð7Þ
n  1 i¼1
2 !3 3
x  E ð xÞ
Sð xÞ ¼ E4 pffiffiffiffiffiffiffiffiffiffiffiffiffiffi 5 ð8Þ
Varð xÞ
2 !4 3
x  E ð xÞ
K ð xÞ ¼ E 4 pffiffiffiffiffiffiffiffiffiffiffiffiffi 5 ð9Þ
varð xÞ

The image decomposition used in Farid paper is based on separable quadrature


mirror filters (QMFs) which contain vertical, horizontal, and diagonal subbands.
A total of 72 features are extracted from level 3 decomposition.

4 Neural Network Classifier

We have used neural network back-propagation classifier to classify images into


stego images and cover images [13]. Features extracted are given as input to input
layer, and then, input propagates through network to output layer. Output is cal-
culated and compared with target output. If actual output does not match with the
target output, then error is generated and propagated back. Weights are adjusted till
neural network classifier correctly classifies images.
Performance Evaluation of Features Extracted from DWT Domain 261

Table 1 Embedding Images Outguess nsF5 (%) PQ (%)


capacities
1–500 16  16 10 10
501–1000 32  32 25 25
1001–2000 48  48 50 50

Table 2 Proposed histogram Histogram Markov


features and Markov features
Bins [B] Features Threshold [T] Features
5 65 T = 2 25
10 130 T = 3 49
15 195 T = 4 81
– – T = 5 121
– – T = 6 169

5 Experimental Results

We have created dataset of images consisting of 2,000 cover and 2,000 stego
images of various sizes which are resized into 640  480. Stego images are
obtained by applying three Steganography tools, Outguess, nsF5, and PQ. These
4000 images have been divided into training (2,800 samples), validation (600
samples), and testing (600 samples) sets. Features are extracted from image using
DWT transform domain. Then, features are fed to neural network back-propagation
classifier for detecting whether image is stego or clean image. The performance of
classifier is compared on the basis of parameter classification accuracy. Various
embedding capacities have been used to generate the dataset as shown in Table 1.
Accuracy of classifier is the fraction of test samples that are correctly classified.
Firstly, we have compared the performance of proposed features extracted from
wavelet domain among themselves and with existing DWT features. Next, exper-
iment evaluates histogram features with different bins and finally Markov features
with different threshold values.
Histogram features are extracted by binning the elements of all subbands into 5,
10, and 15 evenly spaced containers. Markov features are calculated after applying
different threshold values 2, 3, 4, 5, and 6, respectively, as shown in Table 2.

6 Analysis

Comparison between various histogram features on the basis of parameter classi-


fication accuracy with 20 hidden neurons in Outguess, nsF5, and PQ
Steganography algorithm is listed in Table 3. We can observe the following from
Table 3. (i) In Outguess Steganography algorithm, accuracy obtained with 5 bins
262 M. Saini and R. Chhikara

Table 3 Comparison of accuracy in % of histogram features with 20 hidden neurons


Outguess nsF5 PQ
5 bin 10 bin 15 bin 5 bin 10 bin 15 bin 5 bin 10 bin 15 bin
Training 99.78 98.89 98.28 51.32 52.5 51.17 87.53 91.03 90.8
Validation 99.66 98.66 97.16 48.33 52 52.33 89 90 89.6
Testing 99.3 97.8 97.83 50.33 49.6 49.33 89.66 89.16 89.5

Table 4 Comparison of T=2 T=3 T=4 T=5 T=6


accuracy in % of various
Outguess
threshold values with 20
hidden neurons Training 61.142 64.214 72.107 71.75 71.35
Validation 63 62 66.667 70.166 71.66
Testing 59.166 60.666 69.334 70.667 70
nsF5
Training 87.5 92.142 93.214 95.785 96.714
Validation 87.166 90.166 92 92.833 93.667
Testing 87.166 89.166 90.667 93 94.667
PQ
Training 86.892 92.464 95.5 99 98.5
Validation 86.166 91.667 95 98 98.166
Testing 88.833 91.334 94.666 97 97.5

(65 features) is much higher than obtained with 10 bins (130 features) and 15 bins
(195 features). (ii) In nsF5 and PQ Steganography algorithm, we obtain equivalent
results with 5 bin, 10 bin, and 15 bin, respectively.
From the experimental results of Table 4, we found that accuracy obtained after
applying threshold value 5 is best in comparison with other threshold values in case
of Outguess, but in nsF5 and PQ Steganography algorithm threshold value 6 gives
more accuracy in comparison with others. Next, we evaluate the performance of the
proposed features with existing DWT features based on parameter classification
accuracy.
As shown in Fig. 1, we conclude that proposed histogram 65 features in DWT
domain give best accuracy in comparison with Markov 121 features and existing
Farid 72 DWT features for Outguess. This proves that this algorithm is more
sensitive to histogram features and with reduced number of features we are able to
obtain better accuracy.
We conclude from Figs. 2 and 3, that Markov 121 features are able to detect
nsF5 and PQ Steganography algorithm more efficiently. However, the limitation is
that histogram features (65) do not improve the accuracy much for these two
algorithms.
Performance Evaluation of Features Extracted from DWT Domain 263

Outguess
150
121
99.83
100
70.66 65 69.1 72

50 Accuracy in %

0 Number of Features

Markov(DWT) Histogram(DWT) Farid(DWT)

Fig. 1 Comparison of accuracy in % of proposed features with existing features for Outguess

nsF5
150
121
93
100
72
50.3 65 59.3
50 Accuracy in %
Number of Features
0
Markov(DWT) Histogram(DWT) Farid(DWT)

Fig. 2 Comparison of accuracy in % of proposed features with existing features for nsF5

PQ
150
121
97 89.66 89.83
100 72
65
50 Accuracy in %

0 Number of Features

Markov(DWT) Histogram(DWT) Farid(DWT)

Fig. 3 Comparison of accuracy in % of proposed features with existing features for PQ

7 Conclusion

We have created image dataset consisting of 2000 cover images downloaded from
various Web sites and digital photographs. Three Steganography algorithms
(1) Outguess (2) nsF5 (3) PQ have been applied to obtain three sets of 2,000 stego
264 M. Saini and R. Chhikara

images. Various embedding capacities have been used to generate the datasets.
Three kinds of images with sizes of 16  16, 32  32, and 48  48 were
embedded in image using Outguess Steganography algorithm. Different embedding
ratio taken in nsF5 and PQ methods are 10, 25, and 50%. The reason for embedding
diverse % of information is to test the performance with different percentage of
information embedded. So total 4,000 images (cover plus stego) have been gen-
erated in each case. These 4000 images have been divided into training (2,800
samples), validation (600 samples), and testing (600 samples) by neural network
classifier. After dividing the dataset into training, validation, and testing, the pro-
posed DWT features (i) histogram, (ii) Markov are compared with existing
Farid DWT features. We have compared the histogram features with different bins
5, 10, and 15, respectively, and Markov features with different threshold values 2,
3, 4, and 5, respectively. All the experiments are performed with neural network
classifier. The results show 5 bins histogram features give best performance for
Outguess, and in nsF5 and PQ Steganography algorithm, we obtain equivalent
results with 5 bin, 10 bin, and 15 bin. Markov features with threshold value 5 gives
best performance for Outguess; in nsF5 and PQ Steganography algorithm, we get
better performance with Markov features with threshold value 6. Experimental
results indicate that outguess algorithm is more sensitive to histogram statistical
features and nsF5 and PQ can be detected with Markov statistical features. From the
experimental result, we can also conclude that proposed set of features gives better
performance in comparison with existing DWT features. Future work involves
(a) applying proposed features on different embedding algorithms. (b) Finding the
size of information hidden in a stego image.

References

1. Choudhary, K.: Image steganography and global terrorism. Glob. Secur. Stud. 3(4), 115–135
(2012)
2. Cheddad, A., Condell, J., Curran, K., Mc Kevitt, P.: Digital image steganography: survey and
analysis of current methods. Signal Processing 90, pp. 727–752 (2010)
3. Johnson, N.F., Jajodia, S.: Steganalysis the investigation of hidden information. In:
Proceedings of the IEEE Information Technology Conference, Syracuse, NY, pp. 113–116
(1998)
4. Nissar, A., Mir, A.H.: Classification of steganalysis techniques: a study. In: Digital Signal
Processing, vol. 20, pp. 1758–1770 (2010)
5. Farid, H.: Detecting hidden messages using higher-order statistical models. In: Proceedings of
the IEEE International Conference on Image Processing (ICIP), vol. 2, pp. 905–908 (2002)
6. Niels Provos. [Link]
7. Fridrich, J., Pevný, T., Kodovský, J.: Statistically undetectable JPEG steganography: dead
ends, challenges, and opportunities. In: Proceedings of the ACM Workshop on Multimedia &
Security, pp. 3–14 (2007)
8. Fridrich, J., Gojan, M., Soukal, D.: Perturbed quantization steganography. J. Multimedia Syst.
11, 98–107 (2005)
Performance Evaluation of Features Extracted from DWT Domain 265

9. Ali, S.K., Beijie, Z.: Analysis and classification of remote sensing by using wavelet transform
and neural network. In: IEEE 2008 International Conference on Computer Science and
Software Engineering, pp. 963–966 (2008)
10. Shimazaki, H., Shinomoto, S.: A method for selecting the bin size of a time histogram. Neural
Comput. 19(6), 1503–1527 (2007)
11. Saini, M., Chhikara, R.: DWT feature based blind image steganalysis using neural network
classifier. Int. J. Eng. Res. Technol. 4(04), 776–782 (2015)
12. Pevný, T., Fridrich, J.: Merging Markov and DCT features for multi-class JPEG steganalysis.
In: Proceedings of the SPIE, pp. 03–04 (2007)
13. Bakhshandeh, S., Bakhshande, F., Aliyar, M.: Steganalysis algorithm based on cellular
automata transform and neural network. In: Proceedings of the IEEE International Conference
on Information Security and Cryptology (ISCISC), pp. 1–5 (2013)
Control Flow Graph Matching
for Detecting Obfuscated Programs

Chandan Kumar Behera, Genius Sanjog and D. Lalitha Bhaskari

Abstract Malicious programs like the viruses, worms, Trojan horses, and back-
doors infect host computers by taking advantage of flaws of the software and
thereby introducing some kind of secret functionalities. The authors of these
malicious programs attempt to find new methods to get avoided from detection
engines. They use different obfuscation techniques such as dead code insertion,
instruction substitution to make the malicious programs more complex. Initially,
obfuscation techniques those are used by software developers to protect their
software from piracy are now misused by these malware authors. This paper intends
to detect such obfuscated programs or malware using control flow graph
(CFG) matching technique, using VF2 algorithm. If the original CFG of the exe-
cutable is found to be isomorphic to subgraph of obfuscated CFG (under exami-
nation), then it can be classified as an obfuscated one.

Keywords Obfuscation  Decompilation  Optimization  Graph isomorphism


Control flow graph

1 Introduction

The obfuscation is a technique to convert a program to a new different version,


while making their functionalities equal to each other [1]. Using this technique, the
malware writers make their code harder to understand. A lot of malware is being
designed every day and no full proof technique exists that can combat all of them.

C. K. Behera (&)  G. Sanjog  D. Lalitha Bhaskari


Department of Computer Science & Systems Engineering, Andhra University,
Visakhapatnam, India
e-mail: [Link]@[Link]
G. Sanjog
e-mail: [Link]@[Link]
D. Lalitha Bhaskari
e-mail: lalithabhaskari@[Link]

© Springer Nature Singapore Pte Ltd. 2019 267


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
268 C. K. Behera et al.

These malware authors try to find innovative techniques all the way for prevention
of their codes from malware detection engines or delaying the analysis process of
the malware, etc., by transforming their malicious programs into another format,
which will be much for reverse engineer [2], i.e., by changing the syntax and
making it more complex by using different obfuscation techniques.
Earlier, malware detectors were designed by using simple pattern-matching
techniques. Some malware was then easily detected and removed through these
antivirus software. These antivirus software maintain a repository of virus signa-
tures, i.e., binary pattern characteristics of the malicious codes, and is used to scan
the program for the presence of these fixed patterns. If the matches were found by
the detectors, they used to alert them as malwares. But, eventually the malware
writers learnt the weakness of these detectors, which is the presence of a particular
pattern (the virus signature) and then the writers started changing the syntax or
semantics of these malwares in such a way that, it would not get detected easily, but
the basic behavior and functionality of such malwares remained the same [3, 4].
Section 2 discusses disassembling. In Sect. 3, some obfuscation techniques are
listed out. Section 4 explains about control flow graphs. Finally, Sect. 5 proposes a
framework, followed by the explanation with an example, which includes disas-
sembly of suspected malware, its CFG generation, and matching. The last section is
about the conclusion of the paper with future targets.

2 Disassembly of Executables

The disassembling process aims at achieving a program as identical as to the


original program [5, 6]. Compilation process consists of translation process which
converts any high-level language (HLL) code to machine code [7]. Then, with the
help of linkers and loaders, executable code will be created. As we have the
executable of any malware, the next process will be disassembling the executable.
During this process, all the details regarding registers, individual instructions will
be trimmed away. Finally, the intermediate representation will be prepared by
replacing the jump instructions with loops and conditional statements, if available
[8, 9]. The automatically generated assembly code from the executable will be
lacking from significant comments, variable names, and function names used by the
code writer. Because, after a series of transformations, the machine-specific
metadata attached with the program semantics will be removed.

3 Obfuscation Techniques

Obfuscation was originally aimed at protecting programs against reverse engi-


neering, illegal modifications, and protecting software from illegal usage at the
stage of distribution (with the help of watermarks), but it has been broadly used by
Control Flow Graph Matching for Detecting Obfuscated Programs 269

malware authors to defeat detection [2]. Some of the simple techniques are men-
tioned below, which can be used for obfuscating the code generally.
(a) Replacement of instructions with equivalent instructions: A technique in which
instructions in the original code are replaced by some other equivalent
instructions.
(b) By introducing dead code instructions: A technique in which a dead code is
inserted into the original code.
(c) Insert semantically NOP code: By this technique, a NO operation code will be
inserted in the original code to make the code lengthier as well as complex.
(d) Insert unreachable code: Some code will be inserted in the program, which is
never going to be executed. In some cases, jumping is introduced to bypass it
directly.
(e) Reorder the instructions: Technique which uses reordering of mutually
exclusive statements such that the syntax changes from the original code but the
output remains same.
(f) Reorder the loops: This technique reorders mutually exclusive loop statements
in the program.
(g) Loop unrolling: This technique uses by moving the loop, and the statements
written inside the loop body are rewritten as many number of times as the
counter value.
(h) Cloning methods: This technique uses in different parts of the program in
different ways, but for the same task.
(i) Inline methods: In this technique, the body of the method is placed into the
body of its callers and the method will be removed.

4 Control Flow Graph

The control flow is driven by the execution of instructions with having various
sequential instructions, conditional/unconditional jumps, function calls, returning
statements, etc., of the program. The control flow graph (CFG) corresponds to all the
possible paths, which must be covered during its execution of the program [10, 11].

4.1 Control Flow Graph Reduction

If the program is an obfuscated one, then various optimization techniques will be


used on the CFG to reduce the number of basic blocks (if possible) [11]. The
reduction of CFG can be done by concatenating consecutive instruction nodes, by
removing the loops and jumps to handle possible permutation of the code or by
merging consecutive conditional jumps.
270 C. K. Behera et al.

4.2 Control Flow Graph Matching

The aim is to develop a detection strategy based on control flow graphs. In this
paper, VF2 algorithm is used to find the subgraphs isomorphism [12]. For two
graphs G and G′, the algorithm goes down as a traversal of a tree data structure, and
at each step, we try to match the next node of G with one of the nodes of G′ and
stop after traversal of all the nodes of G′ (i.e., after finding a leaf). The matching is
done for all the possible combinations, and if it does not end completely, then the
graphs taken would not be isomorphic [13, 14].

5 Proposed Framework

To prove this framework, we perform the following steps:


Step 1. Get code of a simple program in high-level language, (in our case C
language). Obfuscate it several times in that particular high-level language
only, using one or more obfuscating transformations.
Step 2. Generate executables for both of these programs using C language
compiler.
Step 3. Decompile both of these executables to produce assembly language pro-
grams using any decompiler.
Step 4. Make basic block of codes from both of the assembly language programs.
Step 5. Construct control flow graph from the basic blocks for both of these
programs.
Step 6. Compare both of the CFGs for subgraph isomorphism using VF2
algorithm.
Since the obfuscated program is also derived from the original program, so both
the CFGs of the original and obfuscated programs should match. For justifying
correctness of the proposed framework, a simple C program has been taken for
experiment, which finds the factorial of a number.

5.1 Explanation

A simple factorial calculation program in C language as shown in Fig. 1 is taken


into consideration.
Thereafter, the above-mentioned program will be obfuscated, using various
obfuscation techniques, and after that the original code can be transformed as
Fig. 2.
Control Flow Graph Matching for Detecting Obfuscated Programs 271

Fig. 1 Snapshot of the


original HLL program

As, we can have the executable file of the malicious program, the executable
code of both the programs is taken for this testing. Next, with the help of disas-
semblers, the assembly codes will be generated, from the machine codes of both the
original and obfuscated programs (Fig. 3).
Then, the codes will be optimized by using by simply eliminating dead codes,
removal of global common sub-expression, removal of loop invariant codes,
eliminating induction variables, etc. Then, the control flow graph will be generated
from the assembly codes. The structure of the CFGs is shown below, to understand
the difference between both the CFGs, pictorially (Fig. 4).
By looking to all basic blocks and the jump, conditions, loops, and return
statements, the graphs are prepared from both CFGs (Fig. 5).
Both of these CFGs are matched for subgraph isomorphism using the VF2
algorithm. As a result, the following steps are performed while comparing both of
the graphs (Fig. 6).

Fig. 2 Snapshot of obfuscated HLL program


272 C. K. Behera et al.

Fig. 3 Assembly code after disassembling the executable of ‘fact.c’ program

Fig. 4 Assembly code after disassembling the executable of ‘obfuscated fact.c’ program
Control Flow Graph Matching for Detecting Obfuscated Programs 273

Fig. 5 Snapshots of the structures of the CFGs of the original and obfuscated program

Fig. 6 Optimized CFGs G and G′ of the original and the obfuscated program
274 C. K. Behera et al.

(a) Both of the empty graphs G and G′ match


(b) Node A of G matches with A′ of G′
(c) Node B of G matches with B′ of G′
(d) Node C of G matches with C′ of G′
(e) Node D of G matches with D′ of G′
(f) Node E of G matches with E′ of G′
After reaching the leaf node of graph G, no more nodes are left in it and so it can
be declared that the graph G is isomorphic to a subgraph of graph G′. Thus,
according to the observation, the CFG of the original program comes out to be
isomorphic with a subgraph of the CFG of the obfuscated program. The result is as
per our expectations.
The above exercise establishes the fact that our proposed framework works well
as per its design and proved as it can be used for detection of advanced malware or
any other software, over which obfuscation techniques have been performed. On
the contrary, in existing systems only a comparison of the binary signature of both
the programs will be done. Due to application of several obfuscation techniques on
the original program, the binary pattern of the obfuscated program will vary hugely
with that of original program. So the obfuscated program will not be alerted as a
malware by existing antivirus engines.

6 Conclusion

In this paper, a framework has been proposed to detect the obfuscated programs or
malware using control flow graph matching technique. This framework is verified,
and its functionality has been proved through an example. After generating exe-
cutables from both original and obfuscated programs, disassembling has been done
on both the executables (as it is known that only malware executables were
available, not their source code). Next, both the assembly codes and constructed
CFGs of both executables have been optimized. Then, matching has been done
between these CFGs for subgraph isomorphism using VF2 algorithm and found to
be matched. This means, after obfuscating also, the proposed framework is able to
detect the program. The future expectations are to maintain a database of CFGs of
known malware and study matching issues (such as matching CFG of any new
suspected program with a huge database of CFGs). For this, better graph storage
and graph comparison techniques are to be implemented as continuation of this
paper.
Control Flow Graph Matching for Detecting Obfuscated Programs 275

References

1. You, I., Yim, K.: Malware obfuscation techniques: a brief survey. In: International
Conference on Broadband, Wireless Computing, Communication and Applications, IEEE
Computer Society, pp. 297–300 (2010)
2. Sharif, M., et al.: Impeding malware analysis using conditional code obfuscation. In: Network
and Distributed System Security Symposium (2008)
3. Walenstein, A., Lakhotia, A.: A transformation-based model of malware derivation. In: 7th
IEEE International Conference on Malicious and Unwanted Software, pp. 17–25 (2012)
4. Durfina, L., Kroustek, J., Zemek, P.: Psyb0t malware: a step-by-step decompilation—case
study. In: Working Conference on Reverse Engineering (WCRE), pp. 449–456. IEEE
Computer Society (2013)
5. Ernst, M., et al.: Quickly detecting relevant program invariants. In: 22nd International
Conference on Software Engineering, pp. 449–458 (2000)
6. Cordella, L.P., Foggia, P., Sansone, C., Vento, M.: Evaluating performance of the VF graph
matching algorithm. In: Proceedings of the 10th International Conference on Image Analysis
and Processing, pp. 1172–1177. IEEE Computer Society Press (1999)
7. Cordella, L.P., Foggia, P., Sansone, C., Vento, M.: An improved algorithm for matching large
graphs. In: 3rd International Workshop on Graph-based Representations, Italy (2001)
8. McKay, B.D.: Practical graph isomorphism. Congressus Numerantium 30, 45–87 (1981)
9. Messmer, B.T., Bunke, H.: A decision tree approach to graph and subgraph isomorphism
detection. J. Pattern Recog. 32, 1979–1998 (1999)
10. Gold, R.: Reductions of control flow graphs. Int. J. Comput., Electr. Autom. Control Inf. Eng.
8(3), (2014)
11. Sadiq, W., Orlowska, M.E.: Analyzing process models using graph reduction techniques. Inf.
Syst. 25(2), 117–134 (2000)
12. Bondy, J.A., Murty. U.S.R.: Graph Theory. Springer, Berlin (2008)
13. Abadi, M., Budiu, M., Erlingsson, U’., Ligatti, J.: Control flow integrity principles,
implementations, and applications. ACM Trans. Inf. Syst. Secur. 13(1), 4:1–4:40 (2009)
14. Brunel, J., Doligez, D., Hansen, R.R., Lawall, J.L., Muller, G.: A foundation for flow-based
program matching, using temporal logic and model checking POPL. ACM (2009)
A Novel Framework for Predicting
Performance of Keyword Queries Over
Database

Mujaffar Husain and Udai Shanker

Abstract In fast-growing information era, business or commercial RDBMS pro-


vides huge data access in the form of distributed database, and this data can be
accessed by keyword queries very easily, because no need to know structured query
languages which is problem for almost end user, and they do not know SQL but
needs efficient result; with the help of keyword query interface (KQI), it is possible
but often has problem like low ranking quality, i.e., low precision values or recall
values, as shown in recent benchmarks. Because of ambiguity in keyword-based
search, query result needs improvement. Effectiveness of keyword query should be
decided based on query result. It attracts more improvement because huge data
creates more complication, and for efficient results, query should be appropriate.
Commercial database must support efficient approach to deal with such issues like
low precision value of result; by existing methods, precision value is low because of
ambiguous interpretation of queries, so in this paper, we try to rank the result
according to similarity score based on mathematical model to find proper ranking of
result to give efficient result of keyword-based queries.

Keywords Keyword query interface  Query effectiveness  Keyword query



Similarity score Precision

1 Introduction

Keyword-based search was popularly useful for unstructured data but now going
popular day by day for RDBNS because of end user requirement, so this field has
attracted much efficient way to tackle problem related to keyword-based search.
So many researchers have suggested different approaches for such papers [1–5]

M. Husain (&)  U. Shanker


Department of Computer Science, MMMUT, Gorakhpur, India
e-mail: Mujaffarhn786@[Link]
U. Shanker
e-mail: udaigkp@[Link]

© Springer Nature Singapore Pte Ltd. 2019 277


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
278 M. Husain and U. Shanker

have different methods to deal with Keyword based search. Keyword query inter-
face tries to find most appropriate and efficient result for each query. Each Keyword
query may have different result because of which exact prediction is not possible in
this scenario. We know day to day more users want to access data in all the forms
available on the network, so they need their desired result by just typing simple
keyword-based search because almost user does not aware about database lan-
guages. Different papers suggested different approach to deal with it like in papers
[6–9] suggested how to get efficient result by using keyword queries like selecting
top k result, semsearch- and index search-based results which is efficient to filter the
efficient result for given query. Some related introductory information given for the
different terminologies used in this paper is:
Structured Data: When data is stored in form of relational database and can
easily retrieve by structured query language (SQL), it is called structured data.
Unstructured data is some sort of data which is stored without any preimplemen-
tation logic in form of file, Web, etc., and cannot be retrieved easily [10]. Data
mining is organized for application in the business community because it is sup-
ported by three technologies that are now sufficiently nature.
• Huge data collection added millions pages everyday;
• Powerful supercomputers with great computation power;
• Efficient data mining algorithms for efficient computation.
Business databases are developing at huge rate. According to Google, daily,
millions of pages are added. Huge data creates complexity problems, so it needs
more efficient strategies to manage and access. Today, a great many pages included
in one day. In as indicated by a study, Internet every day included data of amount as
about from today 2000 years before information is accessible throughout the day.
The extra requirement for enhanced computational systems can now be met in a
savvy way with parallel multiprocessor PC innovation. Information mining calcu-
lations embody systems that have existed for at smallest 10 years, however have
1 min prior have been actualized as settled, trustworthy, intelligible apparatus that
always surpass more established factual techniques.
How much query is effective? “For any query, we can calculate how much query
is efficient and we can suggest other query instead of hard query if precision value is
low.”

2 Related Work

Today, many application works on both structured data and unstructured data
means where lots of users try to find answer to their queries with keyword search
only. Commercial relational database management systems (RDBMSs) are nowa-
days providing way for keyword queries execution, so for this type of query it is not
as simple as in SQL execution to get appropriate answer to the query, and so
A Novel Framework for Predicting Performance … 279

different approach like [1–9, 11, 12] is suggested by different researchers to get
appropriate answer by just executing keyword queries in structured database. As we
know, with the keyword queries, it is not possible to get exact answer, so ambiguity
creates more answer for same queries, and so researchers suggested the different
approach to rank the result according to relevance of result and also rank the
performance of the queries performing on particular databases.

2.1 Post-retrieval Methods

These are the methods that work after each executed keyword query and help to get
the efficient result.

2.2 Clarity Score Based

The strategies in view of the idea of clarity score accept that clients are occupied
with a not very many points, so they figure a question simple in the event that its
aftermath fit in with not very many topic(s) and along these lines, adequately
recognize capable from different archives in the combination [7].

2.3 Popularity Based

The basic idea of popularity raking is to find the pages that are popular and rank
them higher than other pages that contain specified keywords. This is mostly
working in unstructured like Web-based keyword search. Example for this imple-
mentation is that most search engine uses this strategy like Google, Bing, etc.

2.4 Challenges in Information Retrieve in Structured


Database with Keyword Query

Information retrieval is as much important as information because if we cannot get


required information from database, there is no need of storing data. We have
surveyed and have found the following big challenges:
• Proper interpretation of what a particular keyword query is hard.
• Ambiguity due to a Keyword Query, which type of information we want to
search it’s difficult to exactly find this.
• Information needed behind each keyword query may be different, but query may
be same.
280 M. Husain and U. Shanker

3 Proposed System

3.1 Objective

We study many methods used to get efficient result with the help of keyword query,
but each method has some certain problem in our work we try to improve the
performance of Keyword Query.
• Improve the performance of the keyword query on the basis of information
needed behind keyword query.
• We try to find the similarity score between document and keyword query and
rank the query as according to its performance which we calculate by precision
value and recall value.
• We can suggest a user if he wants some certain information, then which key-
word query will be useful.
In our proposed method, we create vector to map effectiveness of the query on
database or structured data. This based on terms in particular document.
Vector space model is very useful in calculating distance between query and
data. This model is used to deal with first unstructured data to get the appropriate
result by executing keyword queries. This model provides just talk with the doc-
ument set with the help of term present in the queries (Table 1).
With the help of this model, it is easy and effective to get ranking of the result as
relevance of the result for particular query. So this model provides some mathe-
matical calculation to get the efficient result in case of ambiguity in the interpre-
tation of the query.
By and by, it is less demanding to ascertain the cosine of the point between the
vectors, rather than the edge itself (Fig. 1):

d2  q
cos h ¼ ð1Þ
k d 2 k k qk

Table 1 Hierarchy of different model


A Novel Framework for Predicting Performance … 281

Fig. 1 Vector representation


of query and documents

In above equation, (d2q) is calculated as documents, kd2 k is the norm of vector


d2, and kqk is the norm of vector q. The norm of a vector is calculated as such:
sffiffiffiffiffiffiffiffiffiffiffiffiffi
X n
k qk ¼ q2i ð2Þ
i¼1

In this model, documents and queries are represented as vectors.


 
dj ¼ w1;j ; w2;j ; . . .; wt;j
 
q ¼ w1;q ; w2;q ; . . .; wn;q ð3Þ

In the given mathematical model, we try to calculate the weight of each


matching text or word, and each document is represented in the form of algebraic
terms, and query is also represented in same manner. The model is known as term
recurrence backward record recurrence model. The weight vector for this particular
example is calculated as in this report d is
 T
vd ¼ w1;d ; w2;d ; . . .; wN;d

jDj
wt;d ¼ tft;d  log
jfd 0 2 Djt 2 d 0 gj

jDj
where wt;d ¼ tft;d  log jfd 0 2Djt2d 0 gj.

And tft;d is frequency of term t in document d

jDj
log
jfd 0 2 Djt 2 d 0 gj
282 M. Husain and U. Shanker

The above ratio gives inverse document frequency

jDj

The term represents the total number of documents in document set;

jfd 0 2 Djt 2 d 0 gj

The above parameter represents the total number documents in which term
t present.
Using the similarity based on cosine model for calculation of similarity score
between document dj and query q can be calculated as:
PN
dj  q wi;j wi;q
  ¼ P i¼1 ffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
simðdj ; qÞ ¼  
dj kqk N 2
PN 2 ffi ð4Þ
w
i¼1 i;j i¼1 wi;q

This is the method which is used to implement the keyword query ranking for
database. When we see this method compared to other method which is already
implemented, this method is performing very well and precision value is good.
This method is based on similarity-based score to get the efficient result (Fig. 2).
Step to get ranking module:
Example for method implementation by algebraic calculation:
Here is an improved case of the vector space recovery model. Consider a little
accumulation C that comprises in the accompanying three records:
d1: “tom cruise MIB”
d2: “tom hunk future”
d3: “cameroon diaz MIB”

Fig. 2 Ranking schema


A Novel Framework for Predicting Performance … 283

A few terms show up in two reports, some seem just in one archive. The
aggregate number of records is N = 3. Consequently, the idf values for the terms
are:
“cameroon log2 (3/1) = 1.584”
“diaz log2 (3/1) = 1.584”
“tom log2 (3/2) = 0.584”
“future log2 (3/1) = 1.584”
“MIB log2 (3/2) = 0.584”
“cruise log2 (3/1) = 1.584”
“hunk log2 (3/1) = 1.584”
In above document, we calculate the values of documents weight according to
the term frequency in each document based on log2 of the ratio of document terms
and whole document.
Presently, we reproduce the tf scores by the idf estimations of every term,
acquiring the accompanying grid of reports by terms: (All the terms seemed just
once in every report in our little accumulation, so the greatest quality for stan-
dardization is 1.)
For the all above document set, we execute query “tom cruise MIB”, and we
compute the tf–idf vector for the query and store the value for further calculation to
compute the final ranking list of the result (Tables 2 and 3).
Given the going with inquiry: “tom tom MIB”, we figure the tf–idf vector for the
query and calculate the term frequency to get the value get similarity score which is
helpful to get ranking of result. These query terms are now matched with document
set, and distance between query and document set is calculated.
For “q” 0 (2/2) * 0.584 = 0.584 0 (1/2) * 0.584 = 0.292 0 0
We compute the length of every report and of the inquiry (Table 4):
At that point the likeness qualities are:
As per the similarity values, the last request in which the reports are exhibited as
result of the inquiry will be: d1, d2, d3 (Table 5).

Table 2 Value for keyword in documents


Tom Cruise MIB Hunk Future Cameroon Diaz
d1 1 1 1 0 0 0 0
d2 1 0 0 1 1 0 0
d3 0 0 1 0 0 1 1

Table 3 Value for further calculation


Tom Cruise MIB Hunk Future Cameroon Diaz
d1 0.584 1.584 0.584 0 0 0 0
d2 0.584 0 0 1.584 1.584 0 0
d3 0 0 0.584 0 0 1.584 1.584
284 M. Husain and U. Shanker

Table 4 Calculation of length of documents and query


Length of d1 = “sqrt((0.584)2 + (1.584)2 + (0.584)2) = 1.786”
Length of d2 = “sqrt((0.584)2 + (1.584)2 + (1.584)2) = 2.316”
Length of d3 = “sqrt((1.584)2 + (1.584)2 + (0.584)2) = 2.316”
Length of q = “sqrt((0.584)2 + (0.292)2) = 0.652”

Table 5 Final ranking


cosSim(d1, q) = (0 * 0 + 0 * 0 + 0.584 * 0.584 + 0 * 0 + 0.584 * 0.292 + 0.5
84 * 0 + 0 * 0/(1.011 * 0.652) = 0.776
cosSim(d2, q) = (0 * 0 + 0 * 0 + 0.584 * 0.584 + 1.584 * 0 + 0 * 0.292 + 0.8
4 * 0 + 0 * 0)/(1.786 * 0.652) = 0.292
cosSim(d3, q) = (1.584 * 0 + 1.584 * 0 + 0 * 0.584 + 0 * 0 + 0.584 * 0.292
+ 0 * 0 + 0 * 0)/(2.316 * 0.652) = 0.112

4 Implementation and Discussion

The framework is implementing with the help of java, and query raking is effi-
ciently calculated by this implementation. We use the database MySQL to store the
data. We use two table name movie and actor. We access the data with the help of
java programming language. For example, we store data in actor table
insert into actor values (‘tom cruise’, ‘titanic’, ‘2000’);
insert into actor values(‘tom cruise’, ‘mission impossible’, ‘2002’);
After the execution of the queries, we can get ranking of the queries based on the
result (Fig. 3).
The method is performing very well, and precision is considerable (Fig. 4).
From this figure, we can conclude that our proposed scenario produces maxi-
mum precision values rather than our existing scenario. By using our method
proposed system achieves high precision values compared than base paper method
in existing system. For number of data values, it generates the higher precision
values in current scenario. Thus, we achieve the higher precision value for proposed
system rather than the existing system.
If we follow the keyword query suggestions, then we can get efficient result with
precision values near 0.8–0.9.
A Novel Framework for Predicting Performance … 285

Fig. 3 Software for


predicting performance of
keyword queries over
databases

Fig. 4 Similarity score


calculation based on model

5 Conclusion

In our work, we try to find the efficient result for the given keyword query based on
the ranking of the result calculated with our suggested mathematical model which
gives high precision value result as according to user needs. They are: Seeking
286 M. Husain and U. Shanker

quality is lower than the other framework, and dependability rate of the framework
is most minimal. Keeping in mind the end goal to defeat these downsides, we are
proposing the enhanced ranking calculation which is utilized to improve the
exactness rate of the framework. Moreover, we proposed the idea of
philosophy-based algebraic model apparatus for enhancing the top k efficient results
productivity. It is utilized for giving comparable words as well as semantic
importance for the given keyword query. From the efficient result, we are acquiring
the proposed framework is well powerful than the current framework by method for
exactness rated, nature of result, and effectiveness of our model. This model tries to
predict the effectiveness of the query on the database.

References

1. Kurland, O., Shtok, A., Carmel, D., Hummel, S.: A unified framework for post-retrieval
query-performance prediction. In: Proceedings of 3rd International ICTIR, Bertinoro, Italy,
pp. 15–26 (2011)
2. Lam, K.-Y., Ulusoy, O., Lee, T.S.H., Chan, E., Li, G.: An efficient method for generating
location updates for processing of location-dependent continuous queries. 0-7695-0996-7/01,
IEEE (2001)
3. Hauff, C., Azzopardi, L., Hiemstra, D., Jong, F.: Query performance prediction: evaluation
contrasted with effectiveness. In: Proceedings of 32nd ECIR, Milton Keynes, U.K.,
pp. 204–216 (2010)
4. Kurland, O., Shtok, A., Hummel, S., Raiber, F., Carmel, D., Rom, O.: Back to the roots: a
probabilistic framework for query performance prediction. In: Proceedings of 21st
International CIKM, Maui, HI, USA, pp. 823–832 (2012)
5. Hristidis, V., Gravano, L., Papakonstantinou, Y.: Efficient IR style keyword search over
relational databases. In: Proceedings of 29th VLDB Conference, Berlin, Germany,
pp. 850–861 (2003)
6. Luo, Y., Lin, X., Wang, W., Zhou, X.: SPARK: top-k keyword query in relational databases.
In: Proceedings of 2007 ACM SIGMOD, Beijing, China, pp. 115–126 (2007)
7. Trotman, A., Wang, Q.: Overview of the INEX 2010 data centric track. In: 9th International
Workshop INEX 2010, Vugh, The Netherlands, pp. 1–32 (2010)
8. Tran, T., Mika, P., Wang, H., Grobelnik, M.: Semsearch ‘S10. In: Proceedings of 3rd
International WWW Conference, Raleigh, NC, USA (2010)
9. Demidova, E., Fankhauser, P., Zhou, X., Nejdl, W.: DivQ: diversification for keyword search
over structured databases. In: Proceedings of SIGIR’ 10, Geneva, Switzerland, pp. 331–338
(2010)
10. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun.
ACM 18(11), 613–620 (1975). [Link]
papers/[Link]
11. Finin, T., Mayfield, J., Joshi, A., Cost, R.S., Fink, C.: Information retrieval and the semantic
web. In: Proceedings of 11th International Conference on Information and Knowledge
Management, pp. 461–468, ACM (2002)
12. Ganti, V., He, Y., Xin, D.: Keyword++: a framework to improve keyword search over entity
databases. Proc. VLDB Endow. 3(1), 711–722 (2010)
13. Manning, C., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval.
Cambridge University Press, New York (2008)
Predicting and Accessing Security
Features into Component-Based
Software Development: A Critical
Survey

Shambhu Kr. Jha and R. K. Mishra

Abstract Software development communities have made the true venture to


software development through the concept of component-based software approach
and commercial off-the-shelf (COTS). Majority of the present software applications
have software components as the basic elements, and component-based software
development (CBSD) has been successful in building applications and systems.
However, the security of CBS for the software component is still not properly
noticed. Developing secure software is the accountability of all the stakeholders
involved with component-based software development model. The main challenge
is to access how much security we have achieved or how can we predict and
evaluate the security at the early stage of software development? Software security
has a very constructive impact on software productivity, maintainability, cost, and
quality. Therefore, more early we introduce and evaluate security into component-
based software more productivity and quality we can achieve. In this paper, efforts
are done to provide some suitable guiding principle to the software engineers in the
development of secure component-based software products. This paper also dis-
cusses the overview of requirement specification and analyzes the software archi-
tectures and design for developing secure component-based software products.


Keywords Commercial off-the-shelf (COTS) Component-based software
 
development (CBSD) Component integrator Intrusion detection
System resilience

S. Kr. Jha (&)


Mewar University, Chittorgarh, India
e-mail: skjha2@[Link]
R. K. Mishra
National Informatics Centre, New Delhi, India
e-mail: [Link]@[Link]

© Springer Nature Singapore Pte Ltd. 2019 287


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
288 S. Kr. Jha and R. K. Mishra

1 Introduction

Components-based software is considered as a significant justification to many of the


challenges among software community [1]. It is accredited with improving the effi-
ciency, maintainability and reducing the development time and cost. Majority of
software development organizations have claimed significant benefits from it. Some
of the software developing organizations have implemented systematic reuse pro-
grams which have resulted in-house libraries of reusable components off-the-shelf
(COTS) [2, 3]. Other software developing organizations have supported their reuse
with component-based technologies and tools. Explosive popularity of the Internet
has resulted in a new situation for reusable software components. Consequently, many
organizations are spending much time in reusable component selection and its
retrieval since the choice of the appropriate components has a major impact on the
software project and finally the resulting products. Despite its never-ending potential,
both in terms of cost and in terms of quality of the product, the technology of software
reuse has not become as popular as expected. Regular survey has been conducted on
the component-based software development, and it was found that more than more
than 50% software companies around the world are using component-based approach
in its development. Current literature for CBSD approach mainly concerns on inte-
grating and reusability by acquiring existing component from the repository with
unknown security properties [3–5]. In most of the component-based development
approach, security is given the last priority, whereas functionality is given as top
priority. Therefore, the current component-based development process is miserably
lacking to develop secure software systems. The main reason behind security lapses is
due to lack of trusted software components and producing software in a secure
manner. It is very important because individuals and organizations mostly depend on
software. Despite huge popularity of CBSD in the industry and the large number of
research publications about it in academia, CBSD still lacks fundamental formal
foundations for the requirement, composition, and verification of security require-
ments [6]. Current CBSD practices do not provide the essential requirements for
developing secure systems [7, 8]. Various surveys have been conducted, and it was
reported about various hurdles involved in the use of CBSD. Since the
component-based development approach is mainly based on interdependencies
among reusable software components which create troubles at the time of integrating
it, therefore, security characteristics of software components must be measured and
evaluated earlier in the CBSD life cycle.

2 Security Specification for Software Component

Software security requirements must be considered as basic fundamental steps for


the development of secure system [9]. It must be integrated into the development
process at very early requirement phase. Unfortunately, in most of the software
Predicting and Accessing Security … 289

Table 1 General software security requirements


I Totally ignored about security goals due to various other project constraints like
development time, budget, and complexity in business process
II Specify vague security goals at early development phase of the software as they are
unclear about it
III Functional manager gives least priority to the security issues as he is more concern about
it
IV Specify commonly used security mechanisms as architectural constraints
V Many of the users who are playing major role with the software are not contacted about
security goals

development process including component-based approach, security features are


always given as a second choice. Functional manager of the software is always
concern about meeting the functional goals and objectives considering the com-
plexity, time, and budget constraints of the software projects. People from software
community including academicians, researchers, software developers specifically
component integrators, and other practitioners working in domain of component-
based development were contacted to express their views on security requirements
of component-based software [10]. Most of them have said that security goals
which are specified at the early development phase are unclear and given least
priority. During the early requirement specification of software development, only
very few selected component users are contacted and rest are simply ignored. We
can conclude the above observations as follow (Table 1).
Early requirement plans are most likely to change as the development process
proceeds, but early planning must help to ensure that major security requirements
are not ignored. If we can estimate and quantify the security requirement at the early
stage of development, we can have better control in managing and implementing it
at its highest level. The reason behind the difficulty in security requirement spec-
ification at the time of development of component-based software is the specific
security need of different component-based applications and the complex nature of
different operating environment under which same software components operate. In
component-based approach where same component has different level of security
requirement as it is used under different operating environment, component inte-
grators or users have to implement security requirements while designing appli-
cation program interface (API) at the time of integrating different software
component in different ways as component, in its entire lifetime, may go underuse
of different applications running in many types of environments and perform dif-
ferent task (Fig. 1).
Software component vendors are also playing major role in fulfilling the security
requirements of component-based software development. Component vendors
should apply various security mechanisms such as component certification, user
verification, access control, component encryption, and intrusion detection systems.
Each software component should thoroughly document before its actual use.
A software component might authenticate various security measures in particular
290 S. Kr. Jha and R. K. Mishra

Requirement
Component
Definition
High Level search
Design

Common criteria for Security


Assessment and Evaluation

Integration
Selection Evaluation
and Operation

Fig. 1 Requirement centric security attributes

application in a specific operating environment, but that same component security


might not give the same performance in a completely different operating environ-
ment. The ability to produce highly secure and quality software applications within
the fixed time and budget remains a big challenge to most software development
companies. Any faults in the software or delay in its delivery will cause problems
for many stakeholders involved.

3 Architecture and Design for Secure Software


Component

Embedment of security attributes in component-based development is remaining a


huge challenge at the architectural design level. In the process of exploring the
architecture and early design aspects of software components to obtain awareness of
the embedding of security features in the CBSD process, we have met with various
experts from different industry and academia. An online survey was also conducted
through e-mail and other social media like LinkedIn, Twitter, Whatsapp, and
Facebook in form of questionnaires to all potential candidates. The potential can-
didates involved in the survey mainly included experts from industry including
software component vendors, component integrators, project managers of
component-based projects, academician from computer science and applications,
and Ph.D. research scholars who are conducting research in software quality
assessments. The questionnaires contain questions on the basis of component-based
software design processes and its software security features. Participants were
needed to share their professional view on the specified statements in the form of
question. Questionnaires were mainly of following types as presented in Table 2.
Predicting and Accessing Security … 291

Table 2 Category of questionnaires


Category Description
A Life cycle models used during development of software component repository and
component-based software
B Selection criteria used by CBS developer for selecting individual software
component
C Desired attributes of functional and non-functional security requirements by the
various stakeholders at application level
D Security feature incorporated by component integrator during design of application
program interface required to integrate various software component

After spending lots of time for compilation of data, it was observed that security
assessment for component-based software development must occur at the three
different levels as shown in Fig. 2.
Once all the functional or non-functional security requirements are precisely
identified and estimated, available COTS components must be evaluated to deter-
mine their suitability for use in the application being developed. Accessibility of a
component repository is advantageous for component integrator to select most
suitable component for development of specific application. While interpreting the
respondent’s feedback for category A questionnaires, it was observed that the
majority of the component integrators do not use a proper CBSD process for
developing software application. For category B, component documentation and
vendor reputation play more important role than any other factors. For category C,
majority of the participants have the same opinion on the concept of embedding
security activities at the component level rather than system level. High-risk factors
were associated with use of external component while developing in house appli-
cation. For category D, access control and applying encryption technique were
preferred by majority of the respondent.

Fig. 2 Level of security


needed in CBSD Security Level
For
Component Based Software Development

Application Level

Interface Level

Component Level
292 S. Kr. Jha and R. K. Mishra

4 Proposed Framework for Security of Component-Based


Software

Development of secure and trustworthy component-based systems needs to address


the problem of security evaluation and prediction at the individual component level
and application level too. This research paper proposes a component security
framework for development of secure and trustworthy component-based applica-
tions. In the proposed framework, CBSD security criteria are broadly classified into
two categories which are named as
(i) functional security criteria, (ii) non-functional security criteria.
Functional security criteria evaluate and predict security mechanisms by con-
firming all valid user of the component before its actual use. Controlled navigation
of the component by the component user plays very important role, whereas
encryption of data at the time of intercomponent communication adds extra security
at application level. It also ensures data integrity at component level as well as
application level (Fig. 3).
The goal of these mechanisms is to help in estimating and predicting security
level of the proposed component-based software application to fulfill high-level
system’s non-functional security requirements. Similarly, non-functional security
criteria identify a system’s resilience and level of immunity to attack from outside

Functional Non Functional


V. Conclusion
Security Criteria Security Criteria for CBSD Security
Criteria

System
User
Resilence
Authentication

Level of
Immunity to
Access Control
Attack

Product
Encryption and Documentation
Data Integrity

Intrusion
Detection

Fig. 3 Security framework for CBSD


Predicting and Accessing Security … 293

agency. Outside agency may be hackers, crackers, or any so-called cyber terrorist
intending to perform some fraud or manipulation. We are considering majority of
the component as black box in nature; however, software components may be white
box and gray box in nature also. Encryption mechanism of the proposed framework
is more suitable for components which are white box in nature at the component
level, whereas same encryption mechanism can be applied at the level of inter-
component communication for the component which is black box in nature at the
application level. Access control methods provide guidelines to the component
integrators while selecting most suitable software components which can produce
high-quality component-based application. Component certification and docu-
mentation also play major role while establishing various security issues. While
proposing this framework, a survey was also conducted among the various com-
ponent user working at different level about component certification and docu-
mentation. It was observed that very few component vendors are properly working
on the issue of component documentation. Even the component certification issue is
unexplored. Out of 150 components selected as sample, less than 5% of compo-
nents were found thoroughly documented.

5 Conclusion and Future Scope

This paper is mainly discussing about challenges faced by software component


integrator while implementing various functional and non-functional security issues
at the time of component-based software development. It also talks about impact of
early security requirements specification during the development process of
component-based application. This paper certainly provides some guidelines to
component-based software developer about incorporating various security mecha-
nisms through prescribed security framework. It also discusses about security issues
at three different levels which include component level security, application level
security, and security applied at time of intercomponent communications.
Improvement of component documentation and standardization of component
certification is area of concern which can be considered as future scope for pre-
dicting and accessing security for component-based software development. More
accurate validation of proposed framework can also improve both functional and
non-functional security of component-based software development.

References

1. Agrawal, J.R.D.V.K.: Study of perfective maintainability for component-based software


systems using aspect-oriented-programming techniques. In: Proceedings of the International
Conference on Intelligent Computational Systems, Dubai, 7–8 Jan 2012
2. Jha, S.K., Mishra, R.K.: Accessing software quality for component based software through
trustworthiness and dependability analysis. Int. J. Develop. Res. 5(4) (2015)
294 S. Kr. Jha and R. K. Mishra

3. Gill, N.S.: Importance of software component characterization for better software reusability.
ACM SIGSOFT SEN 31 (2000)
4. Sharma, A., Grover, P.S., Kumar, R.: Investigation of reusability, complexity and
customizability for component based systems. ICFAI J. IT 2 (2006)
5. Mir, I.A., Quadri, S.M.: Analysis and evaluating security of component-based software
development: a security metrics framework. Int. J. Comput. Netw. Inf. Secur. (2012)
6. Kahtan, H., Bakar, N.A., Nordin, R.: Dependability attributes for increased security in CBSD.
J. Comput. Sci. 8(10) (2014)
7. Alberts, C., Allenm, J., Stoddard, R.: Risk-based measurement and analysis: application to
software security. Software Engineering Institute, Carnegie Mellon University (2012)
8. Soni, N., Jha, S.K.: Component based software development: a new paradigm. Int. J. Sci. Res.
Educ. 2(6) (2014)
9. Zurich, E.T.H.: A common criteria based approach for COTS component selection. Chair of
software engineering ©JOT, 2005 special issue: 6th GPCE Young Researchers Workshop
(2004)
10. Bertoa, M.F., Vallecillo, A.: Quality attributes for COTS components. In: Proceeding of the
6th ECOOP Workshop on Quantitative Approaches in Object-Oriented Software Engineering
(QAOOSE 2002). Malaga, Spain (2002)
Traditional Software Reliability Models
Are Based on Brute Force and Reject
Hoare’s Rule

Ritika Wason, A. K. Soni and M. Qasim Rafiq

Abstract This research analyses the causes for the inaccurate estimations and
ineffective assessments of the varied traditional software reliability growth models.
It attempts to expand the logical foundations of software reliability by amalga-
mating techniques first applied in geometry, other branches of mathematics and
later in computer programming. The paper further proposes a framework for a
generic reliability growth model that can be applied during all phases of software
development for accurate runtime control of self-learning, intelligent, service-
oriented software systems. We propose a new technique to employing runtime code
specifications for software reliability. The paper aims at establishing the fact that
traditional models fail to ensure reliable software operation as they employ brute
force mechanisms. Instead, we should work on embedding reliability into software
operation by using a mechanism based on formal models like Hoare’s rule.

 
Keywords Learning automata Software reliability Formal methods

Automata-based software reliability model Finite state automata

R. Wason (&)
Department of MCA, Bharati Vidyapeeth’s Institute of Computer Applications
and Management, New Delhi, India
e-mail: rit_2282@[Link]
A. K. Soni
School of Engineering and Technology, Sharda University, Greater Noida, India
e-mail: [Link]@[Link]
M. Qasim Rafiq
Department of Computer Science and Engineering, Aligarh Muslim University,
Aligarh, India
e-mail: mqrafiq@[Link]

© Springer Nature Singapore Pte Ltd. 2019 295


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
296 R. Wason et al.

1 Introduction

Ensuring correct operation by implementing dependable software is the most


important challenge of software production today [1]. Accurate reliability assess-
ment not only optimizes the overall software life cycle costs but also allows for
efficient scheduling of resources for software [2]. It makes way for able perfor-
mance appraisal of operational software. Reliability assessments form the basis of
various activities like software project management, software feasibility and soft-
ware modification decision. Hecht [3] describes software reliability functions into
three major categories of measurement, estimation and prediction. Software relia-
bility “measurement” relies upon failure interval data obtained by executing the
software in its real operational environment [4]. Software reliability “estimation”
refers to determining software reliability based upon operation in some test envi-
ronment. Reliability “prediction” implies the practice of computing software reli-
ability measures from program data which does not include failure intervals and
takes into account factors like size and program complexity [5]. Diverse applica-
tions and importance of software reliability metrics can be strongly coupled to the
functions described above. For example, system engineering relies upon prediction;
project management upon estimation; and operational software management upon
measurement [6].
To ensure the above characteristics, many varied software reliability models
have been proposed since early 1970s [7]. A careful analysis of these models
reveals that they actually try to provide a mechanism for describing a random
process called failure either as a function of the number of failures experienced of
next time to failure [8]. For the past four decades, all these models have estimated
software reliability combining failure history data with their individual assumptions
as depicted in Table 1.
An analysis of the different classes of software reliability models as depicted in
Table 1 reveals that all the traditional reliability models suffer from what we may
call the “conditional applicability syndrome”. None of the existing software relia-
bility models are capable of maintaining and controlling software reliability
throughout the software life cycle. Though these models may be successful for a
particular software system under one operational environment, they may fail or
yield inaccurate estimates for the same software under different settings.
A major lacuna of all software dependability and reliability research is the
absence of a generic, intelligent, self-learning reliability estimation model. Despite
huge effort and volumes of work, we are still looking for a formal reliability
assessment model that can identify possible states the software may assume during
its life cycle to ensure accurate reliability “prediction”, precise reliability “estima-
tion” and correct reliability “measurement” once the software is operational [9, 10].
All the conventional models treat software as a black box and estimate
software reliability using complex statistical or mathematical functions which
have been borrowed from hardware reliability models [11]. Detailed analysis of the
history of software reliability estimation parameters reveals that all the common
Traditional Software Reliability Models Are Based on Brute … 297

Table 1 Applicability of traditional reliability models in different software development phases


S. No. Software Features Usable reliability Drawbacks
development models
phase
1 Design phase Faults are detected by None Absence of test cases
other formal and and failure data makes
informal methods all existing software
reliability models
unsuitable
2 Unit testing Test cases generated Fault seeding Test cases are not a
phase module-wise from models or input representative sample
input domain in domain-based of the actual
deterministic fashion. models operational usage;
In most cases, hence, estimates
programmer himself is obtained may not be
the tester of much use
3 Integration Test cases generated Time between Different reliability
testing phase to verify the failures, fault estimation models
correctness of the seeding, failure applicable under
partial or whole count and input diverse assumptions.
system in either a domain-based Choice of model
random or models should be based on
deterministic fashion operational
environment
4 Acceptance Inputs for operational Fault count and Reliability model
testing phase usage are generated to input based on operational
verify software domain-based environment
acceptability models
5 Operational Allied, deterministic Fault count models Reliability model
phase user inputs instead of based on operational
random environment

software reliability estimation parameters such as failure rate (k), hazard function
(k(t)), MTBF, MTTR, MTTF were initially suggested for the reliability prediction
of hardware products [12] and were simply fitted to the software process over-
looking the different nature of software versus hardware as well as their reliability.
Hence, it has been observed that the shortcomings of the existing software
reliability models have been the major cause of the unreliability of modern
software.
Major challenges of software reliability can be defined as the ability to achieve
the following [13]:
(i) Encompass heterogeneous runtime execution and operation mechanisms for
software components.
(ii) Provide abstractions that can identify, isolate and control runtime software
errors.
(iii) Ensure robustness of software systems.
298 R. Wason et al.

Hence, it is time that the software engineering community learns from mistakes
of conventional reliability models and utilizes post-testing failure data to control
software operation at run-time [14, 15].
The above discussion implies that what is required is a more holistic approach,
integrating formal methods like Hoare’s logic and automata [16]. The key feature of
this approach shall be a shift of emphasis from probabilistic assumptions about the
reliable execution of software to actual runtime operation control [8, 12, 17, 18].
This paper proposes a general framework for a reliability model based on finite state
machines.
The remainder of this paper is designed as follows. Section 2 reviews some
popular software reliability models. Section 3 establishes the brute force nature of
all traditional SRGMs as the major cause of their inaccurate estimates. Section 4
discusses the viability of modelling reliability metrics using formal methods like
Hoare’s logic. Section 5 discusses the basic framework for a generic, intelligent,
self-learning reliability estimation model. Section 6 discusses the conclusion and
future enhancements for this study.

2 Software Reliability Measurement

Chow [7] was the first to suggest a software reliability estimation model. Since then
hundreds of models were applied for calculating reliability of different software
systems. However, with every model, the search for another model that overcomes
the limitations of its ancestors intensifies. Actually, none of the traditional reliability
models offer silver bullet for accurate reliability measurement for software. It is
observed that traditional reliability models are only trying to mathematically rep-
resent the interaction between the software and its operational environment. As
such most of these models are hence brute force as they are probabilistically trying
to estimate software reliability by fitting some mathematical model [7] onto the
software reliability estimation process. The limitation of traditional models along
with the growing size and complexity of modern software makes accurate reliability
prediction all the more hazardous.
As observed from Table 1, most of the software reliability models calculate
reliability based on failure history data obtained from either the testing or debug-
ging phase of software development [12]. However, difficulty lies in the quantifi-
cation of this measure. Most of the existing reliability models assess reliability
growth based on failure modes surfaced during testing and combine them with
different assumptions to estimate reliability [1, 2, 10]. Some of the well-known
traditional software reliability models can be classified according to their main
features as detailed in Table 2.
Tables 1 and 2 establish that all the existing reliability models have limited
applicability during software testing, validation or operational phases of the soft-
ware life cycle. However, the basic foundation of most software reliability growth
Traditional Software Reliability Models Are Based on Brute … 299

Table 2 Feature classification of traditional reliability models


S. No. Class Example Assumptions
1 Time between Jelinski–Moranda (i) Independent times between
failure/ de-eutrophication model; Goel– failures
deterministic Okumoto imperfect debugging (ii) Equal probability of each fault
models model; Littlewood–Verrall exposure
Bayesian model (iii) Embedded faults independent
of each other
(iv) Debugging and perfect fault
removal after each occurrence
2 Fault count Goel–Okumoto NHPP model; (i) Independent testing intervals
models Musa’s execution time model; (ii) Homogeneous testing during
Shooman exponential model; intervals
(iii) Number of faults detected per
interval independent of each other
3 Fault seeding Mills hypergeometric model (i) Seeded faults are randomly
models distributed in the software with
equal probability of being
detected with native faults
4 Input Nelson model; Ramamoorthy (i) Known input profile
domain-based and Bastani model distribution
model (ii) Random testing used
(iii) Input domain can also be
partitioned into equivalence
classes

models is often viewed doubtfully due to the unrealistic assumptions they are based
on, hence being the underlying cause of the reliability challenge of this century.

3 Brute Force Nature of Traditional Models

Despite decades of continual research and progress, software reliability remains a


term of aspiration. As the human society comes to rely heavily on powerful systems
that keep shrinking in size and growing in complexity, software systems have come
under harsh tectonic stresses due to the growing demand for reliable, distributed
systems. Despite more than half-a-century of advancement, the software industry
continues to strive for a mature engineering discipline to meet the reliability
demands of an information-age society [13]. The fundamental cause of this failure
is the fact that our basic approach to software construction is flawed. All traditional
software reliability estimation models demand more information about the software
than is actually known like failure rate. As a result, this produces dubious
assumptions about software resulting in misleading results. Having understood the
root cause of all our software quality problems, we now need to identify and debug
the mistakes of traditional reliability models in order to enable a befitting solution to
our software reliability crisis.
300 R. Wason et al.

At this point, it would be logical to first understand the nature of computing.


A computer program in reality can be termed as a behaving machine (BM) or an
automaton that can detect modifications in its environment and effect alterations in
it just like our biological nervous system [13]. Such a universal behaving machine
(UBM) may be composed of a couple of elementary behaving entities called nodes
and an environment that can bring a change in the state of these nodes. The above
description maps to what we know of mathematically as a finite state machine
(FSM) [8, 17–19]. Hence, from the above discussion, we may conclude that at its
core all software behave as a finite state machine.
Unfortunately for the software world, traditional reliability models do not treat
the software as a tightly integrated collection of communicating entities that can
switch states upon environmental stimuli [20, 21]. The cause for this can be
attributed either to sheer ignorance or vested interest of ensuring oneself a lifetime
career as reliability engineer. All traditional models present a time-variant reliability
analysis of existing software by calculating reliability using brute force efforts for
generating assumptions and parameter values to be fitted onto a mathematical
model [1, 7]. However, the accuracy of the model assumptions and parameter
values is questionable.
From the long legacy of software failures, the software community has culti-
vated a belief that large, complex software are less reliable [9]. The belief strikingly
contrasts the human brain, the most dependable and complex system in the world,
with reliability scores of magnitude greater than that of any composite existing
software [13]. Hence, the human brain is a living proof of the fact that the reliability
of a behaving system does not have to be inversely proportional to its complexity.
The problem with the traditional reliability models lies in their way of quantifying
reliability. Most existing reliability models use either of the alternative way to
quantify reliability: failure intensity (expected number of failures per unit time) or
mean time to failure (MTTF, expected value of the failure interval) [1]. However,
both the estimates involve estimation of software reliability from failure data
obtained during testing and validation phases of software which may significantly
differ from the actual operational environment where the system may be actually
implemented.
The primary purpose of any software reliability model is to anticipate failure
behaviour that will be observed from the program at run-time [22]. This runtime
behaviour, however, changes rapidly with the implementation environment under
which a program operates and cannot be completely tracked during program test-
ing. Developer alliance can further diminish secure coding practices.

4 Perceiving Reliability with Hoare’s Logic

Uncertainty will continue to disrupt reliability of all our software systems, unless
we transition to some metric rooted firmly in science and mathematics [4, 9, 23].
Engineers have always relied on mathematical analysis to predict how their designs
Traditional Software Reliability Models Are Based on Brute … 301

will behave in the real world [2]. Unfortunately, the mathematics that describes
physical hardware systems does not apply within the artificial binary universe of
software systems [2]. However, computer scientists have contrived ways to trans-
late software specifications and programs into equivalent mathematical expressions
which can be analysed using theoretical tools called formal methods [5, 15]. In the
current decade when reliable software is no longer just a negotiable option but a
firm requirement for many real-time, critical software systems, we need to become
smarter about how we design, analyse and test software systems.
Hoare’s logic/Floyd–Hoare logic is an axiomatic means of proving program
correctness which has been the formal basis to formally verify the correctness of
software [16]. However, it is rarely applied in actual practice. We propose devel-
opment of a reliability model that can control the traversed program path using an
underlying automata-based software representation and apply Hoare’s logic as the
basis to ensure correct runtime program execution [11]. Our approach performs
dynamic checking of software transition and instead of halting the program upon
failure state transition; we allow the program to continue execution through an
alternate path. Formal, Hoare’s logic-based software verification provides strong
possibility to establish program correctness [16]. However, practical use of this
logic is limited due to difficulties like determining invariants for iterations and side
effects of invoking subroutines (methods/functions/procedures) in different pro-
gramming languages.
Hoare’s logic based on predicate logic [16] defines a set of axioms to classify the
semantics of programming languages. It describes an axiom for each program
construct. These axioms are further applied to establish program correctness for
programs written in different programming languages.
We propose applying axioms of Hoare’s logic for ensuring reliable software
operation [16]. To clarify the same, we discuss the Hoare axiom for assignment:
Let x: = E be an assignment, then

fQðEjxÞgx :¼ E fQg ð1Þ

Equation 1 states that x: = E is acceptable with respect to the particular


post-assertion Q and the derived pre-assertion Q(E|x).
Taking, Hoare’s axiom of assignment as basis, we propose an axiom of relia-
bility to be followed by each software transition.
P
x :¼ is acceptable
P with respect to the given post-state qi 2 Q and the previous
pre-state qj 2 Qð jdÞ. This implies that a given user input x is correct only if it
transforms the state of software under study to an acceptable post-state.
302 R. Wason et al.

5 Proposed Model Framework

Systems crash when they fail to expect the unexpected [2]. The unreliability of
software stems from the problem of listing and/or identifying all the possible states
of the program [13]. To conquer this limitation, we propose development of an
automata-based software reliability model [11, 24]. The proposed model is based on
the conventional idea of a Markov model [25, 26]. This model divides the possible
software configurations into distinct states, each connected to the others through a
transition rate. In this framework, the software at run-time would be controlled by
an underlying automata representation of the software. Each software state change
and/or transition shall be allowed by the compiler and/or interpreter only after
validating if it takes the software to an acceptable automata node. The basic flow of
this automata-based reliability model is depicted in Fig. 1.
As depicted in Fig. 1 above, the proposed model framework utilizes the
equivalent assembly code of the software under study to extract the underlying
opcodes. Utilizing these opcode instructions, a next state transition table is obtained
which is used as the basis for designing the underlying finite state automata (FSM).
Each node of the obtained FSM can be assigned a probability value. Initially, each
node of the FSM can be assigned an equivalent probability. With each software
execution, the node probability can be incremented and/or decremented depending
upon the software path of execution. The proposed framework shall help ensure
reliable software execution by monitoring each software transition. In case, the
framework detects that in the next transition the software may transit to a failure
node, it may halt the system immediately and work on finding an alternate route.

Fig. 1 Basic flow diagram of


proposed automata-based
software reliability model
Traditional Software Reliability Models Are Based on Brute … 303

6 Conclusion

The ultimate objective of this study is to embrace the software community with the
fact that existing software reliability models are all brute force as they do not
capture the actual nature of software execution [27]. With this work, we also have
built upon a novel software reliability model that uses the formal model of an
automaton to control runtime software operation. The utmost importance of this
framework is that the framework can be built into the software execution engine
(compiler and/or interpreter) to ensure similar automata-based execution for all
software sharing similar execution semantics. However, this remains a tall objective
to achieve in the near future.
This research may be considered as generic model for ensuring reliable software
operation of modern agent-based, cloud-based, distributed or service-oriented
software. This research could also be classified as a generic one which could
instantiate numerous other studies. Possible extensions to this research can be
training of the FSM framework into learning automata to identify and register faulty
nodes as well as incorrect user input types leading to the same. Further, the model
can also be applied to improve runtime performance of software. This study is a
step forward towards overcoming all software reliability issues and ensuring robust
software systems.

References

1. Vyatkin, V., Hanisch, H.M., Pang, C., Yang, C.H.: Closed-loop modeling in future
automation system engineering and validation. IEEE Trans. Syst. Man Cybern. Part C: Appl.
Rev. 39(1), 17–28 (2009)
2. Yindun, S., Shiyi, Xu: A new method for measurement and reduction of software complexity.
Tsinghua Sci. Technol. 12(S1), 212–216 (2007)
3. Hecht H.: Measurement, estimation and prediction of software reliability, in software
engineering technology, vol. 2, pp. 209–224. Maidenhead, Berkshire, England, Infotech
International (1977)
4. Samimi, H., Darli, A.E., Millstein, T.: Falling back on executable specifications. In:
Proceedings of the 24th European Conference on Object-Oriented Programming
(ECOOP’10), pp. 552–576 (2010)
5. Wahls, T., Leavens, G.T., Baker, A.L.: Executing formal specifications with concurrent
constraint programming. Autom. Softw. Eng. 7(4), 315–343 (2000)
6. Leuschen, M.L., Walker, I.D., Cavallaro, J.R.: Robot reliability through fuzzy markov
models. In: Proceedings of Reliability and Maintainability Symposium, pp. 209–214 (1998)
7. Beckert, B., Hoare, T., Hanhle, R., Smith, D.R., et al.: Intelligent systems and formal methods
in software engineering. IEEE Intell. Syst. 21(6), 71–81 (2006)
8. Musa, J.D., Okumoto. A logarithmic poisson execution time model for software reliability
measurement. In: Proceedings of the 7th International Conference on Software Engineering,
pp. 230–237. Orlando (1984)
9. Meyer, B., Woodcock, J.: Verified software: theories, tools, experiments. In: IFIP TC 2/WG
2.3 Conference VSTTE 2005, Springer-Verlag (2005)
304 R. Wason et al.

10. Hsu, C.J., Huang, C. Y.: A study on the applicability of modified genetic algorithms for the
parameter estimation of software reliability modeling. In: Proceedings of the 34th
Annual IEEE Computer Software and Applications Conference, pp. 531–540 (2010)
11. Jelinski, Z., Moranda, P.B.: Software reliability research. In: Statistical Computer perfor-
mance Evaluation, pp. 465–484. Academic Press, New York (1972)
12. Benett, A.A.: Theory of probability. Electr. Eng. 52, 752–757 (1933)
13. Hoare, C.A.R.: How did software get so reliable without proof. In: Proceedings of The Third
International Symposium of Formal Methods Europe on Industrial Benefits and Advances in
Formal Methods, pp. 1–17 (1996)
14. Zee, K., Kuncak, V., Rinard, M.C.: Full functional verification of linked data structures. In:
Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and
Implementation PLDI ’08, pp. 349–361. ACM, New York (2008)
15. Flanagan, C., Leino, K.R.M., Lillibridge, M., Nelson, G., Saxe, J.B., Stata, R.: Extended static
checking for java. In: Proceedings of the ACM SIGPLAN 2002 Conference on Programming
language design and implementation PLDI ’02, pp. 234–245. ACM, New York (2002)
16. Zee, K., Kuncak, V., Rinard, M.C.: An integrated proof language for imperative programs. In:
Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and
Implementation PLDI ’09, pp. 338–351. ACM, New York (2009)
17. Wason, R., Ahmed, P., Rafiq, M.Q.: Automata-based software reliability model: the key to
reliable software. Int. J. Softw. Eng. Its Appl. 7(6), 111–126 (2013)
18. Benes, N., Buhnova, B., Cerna, I., Oslejsek, R.: Reliability analysis in component-based
development via probabilistic model checking. In: Proceedings of the 15th ACM SIGSOFT
Symposium on Component-Based Software Engineering (CBSE’12), pp. 83–92 (2012)
19. Sharygina, N., Browne, J.C., Kurshan, R.P.: A formal object-oriented analysis for software
reliability: design for verification. In: Proceedings of the 4th International Conference on
Fundamental Approaches to Software Engineering (FASE’01), pp. 318–322. Springer-Verlag
(2001)
20. Afshar, H.P., Shojai, H., Navabi, Z.: A new method for checking FSM correctness
(Simulation Replacement). In: Proceedings of International Symposium on
Telecommunications, pp. 219–222. Isfahan–Iran (2003)
21. Alur, R., Dill, D.L.: A theory of timed automata. Theoret. Comput. Sci. 126(2), 183–235
(1994)
22. Chow, T.S.: Testing software design modeled by finite state machines. IEEE Trans. Softw.
Eng. SE-4(3), 178–187 (1978)
23. Hoare, C.A.R.: An axiomatic basis for computer programming. ACM Commun. 12(10), 576–
580 (1969)
24. Crochemore, M., Gabbay, D.M.: Reactive automata. Inf. Comput. 209(4), 692–704 (2011)
25. Goseva-Popstojanova, K., Kamavaram, S.: Assessing uncertainty in reliability of
component-based software systems. In: 14th International Symposium on Software
Reliability Engineering (ISSRE 2003), pp. 307–320 (2003)
26. Goel, A.L.: Software reliability models: assumptions, limitations and applicability. IEEE
Trans. Softw. Eng. 11(12), 1411–1414 (1983)
27. Ramamoorthy, C.V., Bastani, F.B.: Software reliability—status and perspectives. IEEE Trans.
Softw. Eng. SE-8(4), 354–371 (1982)
Object-Oriented Metrics for Defect
Prediction

Satwinder Singh and Rozy Singla

Abstract Today, defect prediction is an important part of software industry to meet


deadlines for their products. Defect prediction techniques help the organizations to
use their resources effectively which results in lower cost and time requirements.
Various metrics are used for defect prediction in within company (WC) and
cross-company (CC) projects. In this paper, we used object-oriented metrics to
build a defect prediction model for within company and cross-company projects. In
this paper, feed-forward neural network (FFNN) model is used to build a defect
prediction model. The proposed model was tested over four datasets against within
company defect prediction (WCDP) and cross-company defect prediction (CCDP).
The proposed model gives good results for WCDP and CCDP as compared to
previous studies.

Keywords Object-oriented metrics  Defect prediction


Artificial neural network (ANN)

1 Introduction

Defect prediction is a technique to find defects in software before its delivery. It is


an important part of software engineering to control the schedule and cost of a
software system. There are various models that can be used for defect prediction.
Defect prediction models help the engineer to check the reliability of the software
against time and cost factor. These models are developed using various metrics. But
today, object-oriented metrics are more popular. So in this paper, object-oriented
metrics are used for defect prediction.

S. Singh  R. Singla (&)


BBSBEC, Fatehgarh Sahib, Punjab, India
e-mail: rozysingla92@[Link]
S. Singh
e-mail: [Link]@[Link]

© Springer Nature Singapore Pte Ltd. 2019 305


M. N. Hoda et al. (eds.), Software Engineering, Advances in Intelligent Systems
and Computing 731, [Link]
306 S. Singh and R. Singla

Software metrics are collected with the help of automated tools which are used
by defect prediction models to predict the defects in the system. There is generally a
dependent variable, and various independent variables are present in any fault
prediction model. Dependent variable defines that the software modules are faulty
or not. Various metrics such as process metrics, product metrics, etc., can be used as
independent variables. For example, cyclomatic complexity and lines of code which
are method-level product metrics [1].
Cross-company defect prediction (CCDP) is a mechanism that builds defect
predictors w