Software Engineering (PDFDrive)
Software Engineering (PDFDrive)
M. N. Hoda
Naresh Chauhan
S. M. K. Quadri
Praveen Ranjan Srivastava Editors
Software
Engineering
Proceedings of CSI 2015
Advances in Intelligent Systems and Computing
Volume 731
Series editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
e-mail: kacprzyk@[Link]
The series “Advances in Intelligent Systems and Computing” contains publications on theory,
applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all
disciplines such as engineering, natural sciences, computer and information science, ICT, economics,
business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the
areas of modern intelligent systems and computing such as: computational intelligence, soft computing
including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms,
social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and
society, cognitive science and systems, Perception and Vision, DNA and immune based systems,
self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric
computing, recommender systems, intelligent control, robotics and mechatronics including
human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent
data analysis, knowledge management, intelligent agents, intelligent decision making and support,
intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia.
The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings
of important conferences, symposia and congresses. They cover significant recent developments in the
field, both of a foundational and applicable character. An important characteristic feature of the series is
the short publication time and world-wide distribution. This permits a rapid and broad dissemination of
research results.
Advisory Board
Chairman
Nikhil R. Pal, Indian Statistical Institute, Kolkata, India
e-mail: nikhil@[Link]
Members
Rafael Bello Perez, Universidad Central “Marta Abreu” de Las Villas, Santa Clara, Cuba
e-mail: rbellop@[Link]
Emilio S. Corchado, University of Salamanca, Salamanca, Spain
e-mail: escorchado@[Link]
Hani Hagras, University of Essex, Colchester, UK
e-mail: hani@[Link]
László T. Kóczy, Széchenyi István University, Győr, Hungary
e-mail: koczy@[Link]
Vladik Kreinovich, University of Texas at El Paso, El Paso, USA
e-mail: vladik@[Link]
Chin-Teng Lin, National Chiao Tung University, Hsinchu, Taiwan
e-mail: ctlin@[Link]
Jie Lu, University of Technology, Sydney, Australia
e-mail: [Link]@[Link]
Patricia Melin, Tijuana Institute of Technology, Tijuana, Mexico
e-mail: epmelin@[Link]
Nadia Nedjah, State University of Rio de Janeiro, Rio de Janeiro, Brazil
e-mail: nadia@[Link]
Ngoc Thanh Nguyen, Wroclaw University of Technology, Wroclaw, Poland
e-mail: [Link]@[Link]
Jun Wang, The Chinese University of Hong Kong, Shatin, Hong Kong
e-mail: jwang@[Link]
Editors
Software Engineering
Proceedings of CSI 2015
123
Editors
M. N. Hoda S. M. K. Quadri
Bharati Vidyapeeth’s Institute of Computer Department of Computer Science
Applications and Management University of Kashmir
(BVICAM) Srinagar, Jammu and Kashmir
New Delhi, Delhi India
India
Praveen Ranjan Srivastava
Naresh Chauhan Department of Information Technology
Department of Computer Engineering and Systems
YMCAUST Indian Institute of Management Rohtak
Faridabad, Haryana Rohtak, Haryana
India India
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
part of Springer Nature
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface
The last decade has witnessed remarkable changes in IT industry, virtually in all
domains. The 50th Annual Convention, CSI-2015, on the theme “Digital Life” was
organized as a part of CSI@50, by CSI at Delhi, the national capital of the country,
during December 2–5, 2015. Its concept was formed with an objective to keep ICT
community abreast of emerging paradigms in the areas of computing technologies
and more importantly looking at its impact on the society.
Information and Communication Technology (ICT) comprises of three main
components: infrastructure, services, and product. These components include the
Internet, infrastructure-based/infrastructure-less wireless networks, mobile termi-
nals, and other communication mediums. ICT is gaining popularity due to rapid
growth in communication capabilities for real-time-based applications. New user
requirements and services entail mechanisms for enabling systems to intelligently
process speech- and language-based input from human users. CSI-2015 attracted
over 1500 papers from researchers and practitioners from academia, industry, and
government agencies, from all over the world, thereby making the job of the
Programme Committee extremely difficult. After a series of tough review exercises
by a team of over 700 experts, 565 papers were accepted for presentation in
CSI-2015 during the 3 days of the convention under ten parallel tracks. The
Programme Committee, in consultation with Springer, the world’s largest publisher
of scientific documents, decided to publish the proceedings of the presented papers,
after the convention, in ten topical volumes, under ASIC series of the Springer, as
detailed hereunder:
1. Volume # 1:
ICT Based Innovations
2. Volume # 2:
Next Generation Networks
3. Volume # 3:
Nature Inspired Computing
4. Volume # 4:
Speech and Language Processing for Human-Machine
Communications
5. Volume # 5: Sensors and Image Processing
6. Volume # 6: Big Data Analytics
v
vi Preface
We also take the opportunity to thank the entire team from Springer, who have
worked tirelessly and made the publication of the volume a reality. Last but not
least, we thank the team from Bharati Vidyapeeth’s Institute of Computer
Applications and Management (BVICAM), New Delhi, for their untiring support,
without which the compilation of this huge volume would not have been possible.
Chief Patron
Patrons
Advisory Committee
ix
x The Organization of CSI-2015
Adv. Pavan Duggal, Noted Cyber Law Advocate, Supreme Courts of India
Prof. Bipin Mehta, President, CSI
Prof. Anirban Basu, Vice President-cum-President Elect, CSI
Shri Sanjay Mohapatra, Secretary, CSI
Prof. Yogesh Singh, Vice Chancellor, Delhi Technological University, Delhi
Prof. S. K. Gupta, Department of Computer Science and Engineering, IIT Delhi
Prof. P. B. Sharma, Founder Vice Chancellor, Delhi Technological University,
Delhi
Mr. Prakash Kumar, IAS, Chief Executive Officer, Goods and Services Tax
Network (GSTN)
Mr. R. S. Mani, Group Head, National Knowledge Networks (NKN), NIC,
Government of India, New Delhi
Editorial Board
A. K. Nayak, CSI
A. K. Saini, GGSIPU, New Delhi
R. K. Vyas, University of Delhi, Delhi
Shiv Kumar, CSI
Anukiran Jain, BVICAM, New Delhi
Parul Arora, BVICAM, New Delhi
Vishal Jain, BVICAM, New Delhi
Ritika Wason, BVICAM, New Delhi
Anupam Baliyan, BVICAM, New Delhi
Nitish Pathak, BVICAM, New Delhi
Shivendra Goel, BVICAM, New Delhi
Shalini Singh Jaspal, BVICAM, New Delhi
Vaishali Joshi, BVICAM, New Delhi
Contents
xi
xii Contents
xvii
xviii About the Editors
Abstract Growing volume of information on World Wide Web has made relevant
information retrieval a difficult task. Customizing the information according to the
user interest has become a need of the hour. Personalization aims to solve many
associated problems in current Web. However, keeping an eye on user’s behavior
manually is a difficult task. Moreover, user interests change with the passage of
time. So, it is necessary to create a user profile accurately and dynamically for better
personalization solutions. Further, the automation of various tasks in user profiling
is highly desirable considering large size and high intensity of users involved. This
work presents an agent-based framework for dynamic user profiling for personal-
ized Web experience. Our contribution in this work is the development of a novel
agent-based technique for maintaining long-term and short-term user interests along
with context identification. A novel agent-based approach for dynamic user pro-
filing for Web personalization has also been proposed. The proposed work is
expected to provide an automated solution for dynamic user profile creation.
1 Introduction
Recent years have seen an exponential increase in size of World Wide Web and
thereby led to many bottlenecks in accessing the required and relevant material from
the pool of available information. Web personalization (WP) offers a solution to the
information overload problem in current Web by providing the users with a person-
alized experience considering their interest, behavior, and context [1]. Thus, there is
This phase consists of mainly gathering the information about the Web user. Two
methods for collecting information about the user are explicit and implicit.
Explicit Information Collection
This deals with asking for user input and feedback for gathering knowledge about
the user. This is an accurate method for identifying the user interest items. This
method requires the explicit user intervention for specifying the complete interest
and preferences. Another shortcoming of explicit user profile is that after some time
the profile becomes outdated. Yahoo Personalized Portal explicitly asks the user for
specifying his preferences and then customizes the home page layout as per
interests of the user.
Implicit Information Collection
This deals with observing the user activities to identify the user information without
explicitly asking for it. Thus, it removes the burden from user to enter the infor-
mation. Various implicit information collection techniques are browsing cache,
proxy server, browser agents, desktop agents, Web logs, and search logs. Most of
these techniques are client-side based, except Web and search logs which are based
on server side [6].
Some of the important techniques for representing user profiles are sets of weighted
keywords, semantic networks, or weighted concepts, or association rules:
Keyword-Based Profile
User profiles are most commonly represented as sets of keywords. These can be
automatically extracted from Web documents or directly provided by the user.
Numeric weights are assigned to each keyword which shows the degree of user’s
interests in that topic [7, 8].
Semantic Network Profiles
Keyword-based profiles suffer from polysemy problem. This problem is solved by
using weighted semantic network in which each node represents a concept [9].
Concept Profiles
Concept-based profiles are similar to semantic network-based profile. However, in
concept-based profiles, the nodes represent abstract topics considered interesting to
the user, rather than specific words or sets of related words. Concept profiles are
also similar to keyword profiles in that often they are represented as vectors of
weighted features, but the features represent concepts rather than words or sets of
words. Mechanisms have been developed to express user’s interest in each topic,
e.g., assigning a numerical value, or weight, associated with each topic [10].
4 A. Singh and A. Sharma
3 Related Work
An extensive study for identifying scope, applicability, and state of the art in
applying semantics to WP is undertaken by Singh and Anand [12]. Semantically
enhanced and ontological profiles are found [13–15] more appropriate for repre-
senting the user profile. Ontological profile allows powerful representational
medium along with associated inference mechanism for user profile creation.
A brief review of work on semantically enhanced and ontological dynamic user
profiles is given below.
A generic ontology-based user modeling architecture (OntobUM) for knowledge
management had been proposed by Razmerita et al. [16]. Ontology-based user
modeling architecture has three different ontologies namely user, domain, and log
ontologies. User ontology structures the different characteristics of users and their
relationships. Domain ontology defines the domain and application-specific con-
cepts and their relationships. Log ontology defines the semantics of the user
interaction with the system.
An ontology-based user profile creation by unobtrusively monitoring user
behavior and relevance feedback has been proposed by Middleton et al. [14]. The
cold start problem is addressed by using external ontology to bootstrap the rec-
ommendation system. This work considers only is-a type relationship in ontology
A Multi-agent Framework for Context-Aware Dynamic User … 5
Hawalah and Fasli [18] for building a dynamic user profile that is capable of
learning and adapting to user behavior. But, it does not consider the contextual
information in recommendations. An agent-based Web search personalization
approach using dynamic user profile has been proposed by Li et al. [22]. The user
query is optimized using user’s profile preferences and the query-related synonyms
from the WordNet ontology. The search results obtained from a set of syntactic
search engines are combined to produce the final personalized results. However,
this approach is limited to Web search personalization. In another work, Woerndl
and Groh [23] has proposed a framework for multi-agent-based personalized
context-aware information retrieval from distributed sources considering the pri-
vacy and access control. An agent-based interface is proposed which optimizes the
user query using WordNet and user profile preferences and fetches the personalized
search results from various search engines like Google, Yahoo. Although it includes
the contextual information, it does not describe the detailed implementation of
various agents and also does not include temporal aspects in context.
A critical analysis of the literature reveals that many research attempts have been
made in using software agents and ontology for dynamic profile creation consid-
ering contextual information. Some of these studies are oriented toward search
results personalization, while generic UM approaches do not consider contextual
information. So, there is a need to consider some issues like adding the contextual
information for creating dynamic user profile efficiently. This work aims to propose
a framework for agent-oriented context-aware dynamic user profiling for person-
alized recommendations for the user. Our contribution in this work may be sum-
marized as follows:
1. A novel agent-based technique has been developed for maintaining LTI and STI
at client-side layer. User activities at his desktop are monitored by an agent and
then incorporated in STI.
2. A novel agent-based approach has been developed for user context identification
at client-side layer.
3. A novel agent-based approach has been developed for dynamic user profiling
for WP.
The next section describes the proposed framework in detail.
The core idea for user profiling is based on the assumption that some characteristics
of individual user affect the usefulness of recommendations. This framework
considers three main dimensions of user modeling as given below:
A Multi-agent Framework for Context-Aware Dynamic User … 7
There are three agents working at server side namely user query analyzer agent
(UQAA), profile manager agent (PMA), and cluster agent (CA). Detailed working
and description of these agents are given in Sect. 5. These agents process the
explicit and implicit information about the user and apply text processing and
clustering techniques for generating and storing user profile. They also apply the
semantic technologies for user profiling and generating better recommendations.
The next section explains in detail the flow of information, purpose of each
agent, and their algorithms.
The proposed multi-agent framework comprises of three agents at client side and
three agents at server side as shown in Fig. 2
– PMA: This agent is responsible for managing dynamic profile of the user by
collaborating with UBTA of particular user. It receives LT and ST interests of a
user from UBTA and maintains it in its database. Further, it also analyzes online
WSB of the user, recorded on server side by UQAA, and keeps user profile
updated.
– UQAA: This agent analyzes keywords of queries received from a particular user
or IP address. It also saves the Web pages scrolled by the user using search
engine or by requesting particular URL, along with time spent on each page. It
also records hyperlinks accessed from one page to the other.
– CA: This agent is responsible for clustering the users based on similar interest
areas so that similar recommendations may be provided to them. It works on the
user profile database maintained by the PMA.
The algorithms for various agents involved in the framework are given in
Figs. 3, 4, 5, 6, 7 and 8.
10 A. Singh and A. Sharma
The flow diagram given in Fig. 9 illustrates the working of the proposed frame-
work. UCIA, UDA, and UBTA work simultaneously on client side to gather the
information about user’s contexts and interests periodically.
1. UCIA accesses the client machine and is responsible for performing two tasks:
1:1 It extracts the time, location, and searched keywords and stores them in
table.
1:2 It passes this information to PMA which stores the contextual information in
user profile database.
2. UBTA accesses the client machine to collect the browsing history of user. It
performs the following tasks:
2:1 It extracts the various parameters from browser cache and stores them into
USER_INTEREST table.
2:2 It accesses the USER_INTEREST table and identifies the user degree of
interest in Web page/file by using isInteresting() function. It also identifies
the STI and LTI from USER_INTEREST table and prepares a database
WSB for storing short-term and long-term interests.
14 A. Singh and A. Sharma
2:3 It sends the database WSB to PMA after receiving a request from PMA to
access WSB.
3. UDA accesses the files in a specified folder in client machine and extracts
various parameters.
3:1 It stores this information in USER_INTEREST table.
4. PMA sends a request to UBTA to access WSB.
4:1 PMA sends a request to UQAA to access the information from
USERY_QUERY table.
4:2 PMA sends the USER_PROFILE database to CA after receiving a request
from CA.
4:3 PMA creates and updates a database named USER_PROFILE for storing
user profile.
5. UQAA accesses server-side machine to access Web server logs. It parses the
information contained in the log files.
5:1 It stores this information in a table named USER_QUERY.
5:2 It sends the USER_QUERY to PMA.
6. CA sends a request to PMA for accessing the USER_PROFILE database.
6:1 After its request is authenticated, it is given access to USER_PROFILE
database.
6:2 Using USER_PROFILE database, CA creates the clusters of users on var-
ious parameters like time, location, and interests.
7. Recommendations are given from server side to client considering STI, LTI, and
other contextual parameters.
References
1. Singh, A.: Wisdom web: the WWW generation next. Int J. Advancements Technol. 3(3),
123–126 (2012)
2. Jammalamadaka, K., Srinivas, I.V.: A survey on ontology based web personalization. Int.
J. Res. Eng. Technol. 2(10), 163–167 (2013)
3. Singh, A.: Agent based framework for semantic web content mining. Int. J. Advancements
Technol. 3(2), 108–113 (2012)
4. Singh, A., Mishra, R.: Exploring web usage mining with scope of agent technology. Int.
J. Eng. Sci. Technol. 4(10), 4283–4289 (2012)
5. Carmagnola, F., Cena, F., Gena, C.: User model interoperability: a survey. User Model.
User-Adap. Inter. 21(3), 285–331 (2011)
6. Kelly, D., Teevan, J.: Implicit feedback for inferring user preference: a bibliography.
ACM SIGIR Forum 37(2), 18–28 (2003)
7. Chen, L., Sycara, K.: WebMate: a personal agent for browsing and searching. In: Proceedings
of the 2nd International Conference on Autonomous Agents, Minneapolis/St. Paul, 9–13
May, pp. 132–139. ACM Press, New York (1998)
8. Moukas, A.: Amalthaea: information discovery and filtering using a multiagent evolving
ecosystem. Appl. Artif. Intel. 11(5), 437–457 (1997)
9. Minio, M., Tasso, C.: User modeling for information filtering on INTERNET services:
exploiting an extended version of the UMT shell. In: UM96 Workshop on User Modeling for
Information Filtering on the WWW; Kailua-Kona, Hawaii, 2–5 Jan 1996
10. Bloedorn, E., Mani, I., MacMillan, T.R.: Machine learning of user profiles: representational
issues. In: Proceedings of AAAI 96 from Portland, 4–8 Aug, Oregon, vol. 1, pp. 433–438
(1996)
11. Gauch, S., Speretta, M., Chandramouli, A., Micarelli, A.: User profiles for personalized
information access. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web.
LNCS 4321, pp. 54–89. Springer, Berlin, Heidelberg (2007)
12. Singh, A., Anand, P.: Automatic domain ontology construction mechanism. In: Proceedings
of IEEE International Conference on Recent Advances in Intelligent Computing Systems
(RAICS) from 19–21 Dec, pp. 304–309. IEEE Press, Trivandrum, Kerala, India (2013)
13. Bhowmick, P.K., Sarkar, S., Basu, A.: Ontology based user modeling for personalized
information access. Int. J. Comput. Sci. Appl. 7(1), 1–22 (2010)
14. Middleton, S.E., Shadbolt, N.R., Roure, D.C.D.: Ontological user profiling in recommender
systems. ACM Trans. Inf. Syst. 22(1), 54–88 (2004)
15. Sosnovsky, S., Dicheva, D.: Ontological technologies for user modelling. Int. J. Metadata
Semant. Ontol. 5(1), 32–71 (2010)
16. Razmerita, L., Angehrn, A., Maedche, A.: Ontology based user modeling for knowledge
management systems. In: Brusilovsky, P., Corbett, A., Rosis, F.D. (eds.) User Modeling
2003. LNCS, vol. 2702, pp. 213–217. Springer, Berlin, Heidelberg (2003)
17. Trajkova, J., Gauch, S.: Improving ontology-based user profiles. In: Proceedings of RIAO
2004 on 26–28 Apr, pp. 380–389, France (2004)
18. Hawalah, A., Fasli, M.: A multi-agent system using ontological user profiles for dynamic user
modelling, In: Proceedings of International Conferences on Web Intelligence and Intelligent
Agent Technology, pp. 430–437, IEEE Press, Washington (2011)
19. Aghabozorgi, S.R., Wah, T.Y.: Dynamic modeling by usage data for personalization systems,
In: Proceedings of 13th International Conference on Information Visualization, pp. 450–455.
IEEE Press, Barcelona (2009)
20. Skillen, K.L., Chen, L., Nugent, C.D., Donnelly, M.P., Burns, W., Solheim, I.: Ontological
user profile modeling for context-aware application personalization. In: Bravo, J.,
López-de-Ipiña, D., Moya, F. (eds.) Ubiquitous Computing and Ambient Intelligence.
LNCS, vol. 7656, pp. 261–268. Springer, Berlin, Heidelberg (2012)
16 A. Singh and A. Sharma
21. Vigneshwari, S., Aramudhan, M.: A novel approach for personalizing the web using user
profiling ontologies. In: IEEE Fourth International Conference on Advanced Computing
ICoAC, pp. 1–4. IEEE Press, Chennai (2012)
22. Li, L., Yang, Z., Wang, B., Kitsuregawa, M.: Dynamic adaptation strategies for long-term and
short-term user profile to personalize search. In: Dong, G., Lin, X., Wang, W., Yang, Y., Yu,
J.X. (eds.) Advances in Data and Web Management. LNCS, vol. 4505, pp. 228–240. Springer
Berlin, Heidelberg (2007)
23. Woerndl, W., Groh, G.: A proposal for an agent-based architecture for context-aware
personalization in the semantic web. In: Proceeding of IJCAI Workshop Multi-agent
information retrieval and recommender systems, Edinburg, UK-IJCAI (2005)
24. Hijikata, Y.: Estimating a user’s degree of interest in a page during web browsing. In:
Proceedings of IEEE SMC ‘99 Conference, vol. 4, pp. 105–110. IEEE Press, Tokyo (1999)
25. Moawad, I.F., Talha, H., Hosny, E., Hashim, M.: Agent-based web search personalization
approach using dynamic user profile. Egypt. Inf. J. 13, 191–198 (2012)
26. Sieg, A., Mobasher, B., Burke, R.: Learning ontology-based user profiles: a semantic
approach to personalized web search. IEEE Intel. Inf. Bull. 8(1), 7–18 (2007)
27. Singh, A., Alhadidi, B.: Knowledge oriented personalized search engine: a step towards
wisdom web. Int. J. Comput. Appl. 76(8), 1–9 (2013)
28. Singh, A., Sharma, A., Dey, N.: Semantics and agents oriented web personalization: state of
the art. Int. J. Serv. Sci. Manag. Eng. Technol. 6(2), 35–49 (2015)
29. Yahoo Personalized Portal. [Link]
Implementation of Equivalence
of Deterministic Finite-State Automation
and Non-deterministic Finite-State
Automaton in Acceptance of Type 3
Languages Using Programming Code
1 Introduction
The word ‘formal’ means that all the rules for the language are explicitly stated in
terms of what string of symbols can occur, and a formal language can be viewed as
a set of all strings permitted by the rules of formation. Finite automatons are the
simplest model of an automatic machine. If we see the history of designing of the
automatic machines, the first calculating device was the abacus first time used in
China. Abacus was used to perform some arithmetic operations like addition,
multiplication on positive integers. This was the first initiation toward the designing
of calculating devices. Further, we found several enhancements, and presently, we
have numerous calculating devices. Today, all automatic machines are designed
based on some kind of models. One great example of this machine is the computer
system and finite automation are its abstract model.
Finite automaton is mainly used for modeling the reactive systems. System
which changes its actions, outputs, and conditions/status in reply to reactions from
within/outside it is known as reactive system. A reactive system is a
situation-driven/control-driven system continuously having to react to external and/
or internal reaction. In general, finite automatons are useful models to explain
dynamic behaviors of reactive systems.
Automaton (finite) consists of a finite memory called input tape, a read-only
head, and a finite control. The input is written on tape, and head reads one symbol at
a time on the tape and moves forward into next state and goes on until the last
symbol of input string. The movement of head and transition into next state are
decided by finite control. When input is read, the finite automaton decided the
validity or acceptability of the input by acceptance or rejection. It does not write its
output on the tape, which is a limitation of this model.
The input tape is divided into compartments, and each compartment contains 1
symbol from the input alphabets. The symbol ‘$’ is used at the leftmost cell, and the
symbol ‘W’ is used at the rightmost cell to indicate the beginning and end of the
input tape. It is similar to read-only file in a computer system, which has both
beginning and end.
2 Finite Acceptors
Inputs are transformed into outputs with help of an information processing device.
With a device, only two alphabets are associated: A is taken as input for com-
municating, and alphabet B is taken as output for receiving answers. Let us consider
a device that accepts input as English sentence and outputs the corresponding
Implementation of Equivalence of Deterministic Finite-State … 19
sentence in French language. Once whole input is read out, then each input alphabet
is processed step by step, and if after reading last alphabet of input string we reach
to final state, then the output is yes and we can say A* is accepted else rejected. By
the above procedure, A* is divided into two subparts: The ‘true’ subset is called as
machine accepted language, and the ‘no’ subset is called as machine rejected
language. The device that operates such thing is called as acceptor. The mathe-
matical model is described below:
Finite automaton is of two types:
• DFA
• NDFA.
Deterministic finite acceptor A is explained with five tuples of information:
A = (R, Q, S, d, F), R consists of a finite alphabet, a finite set Q of states, and a
function: Q R ! Q, defined as the transition function and a set F of acceptance
states. The set Q contains an element s and a subset F, and the set of acceptance
states.
Consider an NFA with few states along with some inputs Equivalent DFA for above NFA can be created easily as
shown in figure per the algorithm
Let Mn = (Q, R, d, q0, F) be NFA which identifies any language L then equivalent
DFA Md = (Q, R, d′, q′0, F′) which satisfies the respective conditions recognizes L.
Q = 2Q i.e. the set of all subsets of Q.
To obtain equivalent DFA, we follow the following procedure.
Step 1: Initially Q = Ø.
Step 2: {q0} is the initial state of the DFA M and Put q0 into Q.
Step 3: For each state q in Q following steps are needed: add this new state, add
d (q,a) = U p q d(p, a) to d, where d on the right-hand side (R.H.S) is
that of NFA Mn.
Step 4: Repeat step 3 till new state are there to add in Q; if there is new state
found to add in Q, then process is terminated. All the states of Q which
contain accepting state of Mn are accepting state of M.
Important: Not reachable states are not included in Q (Not reachable means
states which are not reached from initial state).
24 Rinku et al.
NFA DFA
Input : aaabbaab Input : aaabbaab
Standard Output : Final State (i.e q2, q3) Standard Output : Final State (i.e C, D, E)
Output : q2,q3 Output : D
References
Abstract Test case prioritization is a process to order the test cases in such a way
that maximum faults are detected as earlier as possible. It is very expensive to
execute the unordered test cases. In the present work, a multi-factored cost- and
code coverage-based test case prioritization technique is presented that prioritizes
the test cases based on the percentage coverage of considered factors and code
covered by the test cases. For validation and analysis, the proposed approach has
been applied on three object-oriented programs and efficiency of the prioritized
suite is analyzed by comparing the APFD of the prioritized and non-prioritized test
cases.
Keywords Object-oriented testing Test case prioritization Cost- and code
coverage-based testing Multi-factors-based test case prioritization
1 Introduction
Testing is one of the important phases of the software development life cycle. The
testing of the software consumes a lot of time and efforts. Presently, the cost of
testing the software is increasing very rapidly. According to finding of the sixth
quality report [1], the share of the testing budget is expected to reach 29% by 2017.
Every software industry mainly concerns the reduction of the testing cost and
detection of the bug by taking the minimum time. For delivering the quality soft-
ware, it is necessary to detect all the possible errors and fix them. There are various
factors that indicate the quality of the software. These factors are functionality,
correctness, completeness, efficiency, portability, usability, reliability, integrity,
2 Related Work
The size of the test suits is increased as the software evolves. Test cases are used to
test the existing program. If the software gets modified due to the addition of new
functionality, new test cases may be added to the existing test cases. There are many
constraints on the industry like resource, time, and cost. So, it is important to
prioritize the test cases in a way that probability of error detection is higher and
earlier. In this section, an overview of various researchers is discussed.
Shahid and Ibrahim [2] proposed a new code-based test case prioritization
technique. The presented approach prioritizes the test cases on the basis of the code
covered by the test case. The test cases that covered the maximum methods have the
highest probability to detect the errors earlier. Abdullah et al. [3] presented the
findings of the systemic review conducted to collect evidence testability estimation
of object-oriented design. They concluded that testability is a factor that predicts
that how much effort will be required for testing the software.
Huda et al. [4] proposed an effective quantification model of object-oriented
design. The proposed model uses the technique of multiple linear regressions
between the effective factors and metrics. Structural and functional information of
object-oriented software has been used to validate the assessment of the effec-
tiveness of the factors. The model has been proposed by establishing the correlation
A Multi-factored Cost- and Code Coverage-Based Test Case … 29
3 Proposed Work
The presented approach prioritizes the test cases on the basis of the cost and the
coverage of the code covered by the test case. For accurately finding out the cost of
the test case, some factors are considered in Table 1. The proposed approach works
at two levels. At the first level, all the considered factors existed in the source code
are identified. After identification and counting the factors, all independent paths of
the source code are resolute; then, the value of the cost of each path is determined
on the basis of the coverage of the identified factors. Test cases are selected cor-
responding to independent paths. The cost of the test case can be calculated by
using Formula 1. The code coverage of test case is determined by counting lines of
code executed by the test case. At the second level, pairs of cost and code value of
each test case are created. In this way by using the value of the cost and code
coverage, the test cases are prioritized. The following scenario is used for the
prioritization of the test cases:
30 Vedpal and N. Chauhan
(1) Highest code coverage and cost will have highest priority.
(2) Second priority should be given to test case that has highest cost value.
(3) Third priority should be given to test case that has highest code coverage.
(4) Test cases with the equal code coverage and cost should be ordered.
The overview of the proposed approach is shown in Fig. 1.
The cost covered by the test case can be calculated by applying the Formula 1 as
given below:
where SF is the sum of the factors covered by the ith test case, and TF is the sum of
the all existing factors in source code.
Table 1 shows the considered factors that are used to prioritize the test cases. The
factors are considered by the structural analysis of the program. The considered
factors may affect the testing process in terms of consumption of memory, exe-
cution time, and the possibility of introducing the error in program.
These are factors related to object-oriented program affecting the prioritization of
test cases. There are a total of eight object-oriented software-related factors included
in this work. The algorithm of the proposed approach is given in Fig. 2.
For the experimental validation and evaluation, the proposed approach has been
applied on the three programs. The programs are implemented in the C++ lan-
guage. For the experimental analysis, intentionally faults are introduced in the
programs. The program one has 170 lines of code, program [10] two has 361 lines
of code, and program three has 48 lines of code.
Table 2 shows the various factors covered by the test cases, Table 3 shows the
line of code covered by the test cases, Table 4 shows the calculated cost of all test
cases that are used to test the software, and Table 5 shows the various pairs of cost
and code covered by the test cases.
The prioritizing order of test cases as determined by the proposed approach is
TC6, TC2, TC1, TC4, TC8, TC5, TC3, TC7.
Let T be is the list of non prioritized test cases and T’ be the list of the prioritized test cases.
While ( T not empty)
Begin
Step 1. Identify and Count all the considered factors that are used in the source code.
Step 2. Determine the factors and line of code being covered by the test cases.
Step 3. Calculate the cost by applying the formula on test cases.
Cost (Ti) = SF(Ti) / TF
Where SF is the sum of factors covered by the test case and TF is the sum of the factors in the source code
End
Step 4. Determine the all possible pairs of the code coverage value and cost value of each test cases
Pair = ( Cost , Code Coverage)
Step 5. Prioritized the test cases in the following scenarios
(1) Highest the value of cost and code covered by the test case have highest priority
(2) Second priority should be given to test case that has highest cost value.
(2) Third priority should be given to test case that has highest code coverage.
(3) Test cases with the equal value of the code coverage and cost should be prioritized in the random order.
Create T’ the list of prioritize test cases.
Table 7 shows the faults detected by the test cases when they executed in prioritized
order.
For simplicity of the approach, the faults are detected for only one program.
Figure 3 shows the comparison of APFD graphs for program one, Fig. 4 shows the
comparison of APFD graphs for program two, and Fig. 5 shows the comparison of
APFD graphs for program three. The test cases are executed in prioritized order
obtained after applying the proposed approach and non-prioritized approach [11, 12].
34 Vedpal and N. Chauhan
Fig. 3 Comparison of APFD detected by random and prioritized test cases for program one
Fig. 4 Comparison of APFD detected by random and prioritized test cases for program two
Effectiveness of the proposed approach is measured through APFD metric, and its
value is shown in Table 8 [13]. The APFD value of prioritized order of test cases
obtained by applying the proposed approach is better than random ordering of test
cases. Therefore, it can be observed from Table 8 prioritized test cases has higher
fault exposing rate than the non-prioritized test cases.
A Multi-factored Cost- and Code Coverage-Based Test Case … 35
Fig. 5 Comparison of APFD detected by random and prioritized test cases for program three
Table 8 Compared result of test cases for prioritized and non-prioritized order
Case study Non-prioritized test cases (APFD) (%) Prioritized test cases (APFD) (%)
Program one 57 70
Program two 55 72
Program three 37 62
5 Conclusion
References
Abstract Primary goal of every search engine is to provide the sorted information
according to user’s need. To achieve this goal, it employs ranking techniques to sort
the Web pages based on their importance and relevance to user query. Most of the
ranking techniques till now are either based upon Web content mining or link
structure mining or both. However, they do not consider the user browsing patterns
and interest while sorting the search results. As a result of which, ranked list fails to
cater the user’s information need efficiently. In this paper, a novel page ranking
mechanism based on user browsing patterns and link visits is being proposed. The
simulated results show that the proposed ranking mechanism performs better than
the conventional PageRank mechanism in terms of providing satisfactory results to
the user.
1 Introduction
(PR) [2, 3], Weighted PageRank (WPR) [4], and Hypertext Induced Topic Search
(HITS) [5], PageRank based on link visit (PRLV) [6] are some of the popular
ranking algorithms. All these algorithms are based on Web mining concept.
Web mining is a branch of data mining techniques that discover the useful
patterns from Web documents. Web content mining (WCM), Web structure mining
(WSM), and Web usages mining (WUM) are three main categories of Web mining.
[7, 8] PR, WPR, and HTTS are purely based on Web structure mining, whereas
PRLV is based on Web structure as well as Web usages mining. Table 1 gives the
overview of these Web mining techniques. The proposed mechanism captures the
user interest in an efficient way by applying the concept of Web usages and Web
structure mining.
The rest of the paper is organized as follows: In next section, Web structure
mining and popular algorithms of this area have been discussed. Section 3
describes the proposed ranking mechanism in detail with example illustrations.
Section 4 depicts the complete working of proposed system. Concluding remarks
are given in Sect. 5.
2 Related Work
This section describes an overview of Web structure mining and some popular
PageRank algorithms with examples illustration. Web structure mining means
generating the link summary of Web pages and Web server in the form of Web
graph by using the concept of hyperlink topology between different documents as
well as within the same document [8]. A Web graph is directed labeled graph where
nodes represent the Web pages and edges represents the hyperlinks between the
A Novel Page Ranking Mechanism Based on User Browsing Patterns 39
pages. There are many algorithms based on Web structure mining. Some of them
which form the basis of proposed work are discussed in following sections.
2.1 PageRank
The PageRank algorithm was developed by Google and named after Larry Page,
[3]. The link structure of Web page is used to find out the importance of a Web
page. The importance of a page P can be obtained by evaluating the importance of
pages from which the page P can be accessed. Links from these pages are called as
inbound links. According to this algorithm, if the inbound links of a page are
important, then its outbound links also become important. The PageRank of a page
P is equally divided among its outbound links which further propagated to pages
corresponding to these outbound links. The PageRank of a page X can be calculated
by Eq. (1) as follow
PRðP1 Þ PRðP2 Þ PRðPn Þ
PRðXÞ ¼ ð1 dÞ þ d þ ð1Þ
OðP1 Þ OðP1 Þ OðPn Þ
where:
• P1, P2, … Pn represent the inbound links of page X
• O(P1), O(P2) … O(Pn) are no. of outbound links of page P1, P2 … Pn,
respectively
• d is the damping factor which is a measure of probability of user following
direct link. Its value is usually set to 0.85.
To explain the working of PR method, Let us take a small Web structure as
shown in Fig. 1a consisting of four pages, namely P1, P2, P3, and P4, where page P1
is inbound link of page P2 and P4, page P2 is inbound ink of page P4 and P3, P3 is
inbound link of P1, P2, and P4, and P4 is inbound link of P1. According to Eq. (1),
PageRank of page P1, P2, P3, and P4 can be computed as follows:
(a) (b)
Fig. 1 a Sample web structure, b sample web structure with link visits and link weight
40 S. Sethi and A. Dixit
Initially considering the PageRank of each page equal to 1 and taking the value
of d = 0.85, calculating PageRank of each page iteratively until their vales becomes
stable as shown in Table 2.
From Table 2, it may be noted that PR(P1) > PR(P4) > PR(P2) > PR(P3).
These PR values are extracted by crawler while downloading a page from Web
server, and these values will remain constant till the Web link structure will not
change. In order to obtain the overall page score of a page, the query processor adds
the precomputed PageRank (PR) value associated with the page with text matching
score of page with the user query before presenting the results to the user.
Duhan et al. [6] proposed the extension of PageRank method in which the
PageRank of a page is computed on the basis of no. of visits to Web page. They
pointed out that traditional PR method evenly distributes the PageRank of page
among its outgoing links, whereas it may not be always the case that all the
outgoing links of a page hold equal importance. So, they proposed a method which
assigns more rank to an outgoing link that is more visited by the user. For this
purpose, a client side agent is used to send the page visit information to server side
agent. A database of log files is maintained on the server side which store the URLs
of the visited pages, its hyperlinks, and IP addresses of users visiting these
hyperlinks. The visit weight of a hyperlink is calculated by counting the distinct IP
addresses clicking the corresponding page. The PageRank of page ‘X’ based upon
visit of link is computed by the Eq. (2)
X PRðPi Þ LðPi ; XÞ
PRðXÞ ¼ ð1 dÞ þ d ð2Þ
P €IðXÞ
TLðPi ; OðPiÞÞ
i
Where:
• PR(X) is PageRank of page X calculated by Eq. (1).
• I(X) is set of incoming links of page X.
• L(Pi, X) is no. of link visits from Pi to X.
• TL(Pi, O(Pi)) is total no. of user visits on all the outgoing links of page Pi.
Let us consider the same hyperlinked structure as shown in Fig. 2 with no. of
visits and visit weight (written in bracket) shown in Fig. 3. By taking the value of
d = 0.85, the PageRank based on visit of link can be easily obtained by using
Eq. (2) and iteration method as shown in Table 3.
By comparing the results of PR with PRLV, it is found that rank order of pages
has been changed. By using PRVOL, PRVOL(P1) > PRVOL(P2) > PRVOL
(P4) > PRVOL(P3). A critical look at the available literature indicates that although
dividing the PageRank of a page among its outgoing links based on link visit solved
the problem of finding the importance of a page within the Web, it has been
observed that the user who visits on a particular page may not necessarily find the
page useful. Therefore, the time spent and action performed such as print, save may
be considered as vital parameter while determining the relevance of a page with
respect to the user. The proposed ranking mechanism discussed in the next section
overcomes the above shortcomings by incorporating the user page access infor-
mation to the link structure information of a Web page.
Fig. 3 DB builder
Www
Fetched
pages URL Frontier
URL to fetch
Crawler
Pages
Page Extractor
Parsed page
Information
Indexer
Page information
Database
An efficient search model based on user page access information is being proposed
here as shown in Fig. 2. It consists of four main components: search engine
interface, PPF calculator, query processor, and DB builder. The detailed description
of each component is given in subsequent sections.
The user enters the query at search engine interface. It passes these query words to
query processing module and sends a signal ‘something to record’ to PPF calcu-
lator. At the end of search operation, it receives the sorted list of documents from
query processor to present back to the user. When the user clicks a page in the result
list, it sends the hit(click) information to its corresponding Web server which in turn
stores this information in server log files.
A Novel Page Ranking Mechanism Based on User Browsing Patterns 43
After receiving the signal ‘something to record’ from search engine interfaces, it
observes and records certain facts about the activity of user on a particular page. For
this, it assigns the page probability factor (PPF) to each page clicked by the user.
The page probability factor PPF(Pi) can be computed as per the equation given in
(3)
where:
• CLICKwt(Pi) denotes the importance of page Pi with respect to all the pages
clicked by user ui for query qi in the current search session.
• TIMESCORE(Pi) denotes the time spent by user ‘u’ on the page Pi
• ACTIONwt(Pi) denotes the action performed on the page Pi
The computation of each of these factor used in Eq. (3) is given below.
Calculation of click weight on page Pi: When a user clicks a page, the click
weight of page P increases as if the user votes for this page [9]. For any more
clicking by the same user, the click weight of page will not be affected. To find the
importance of page P with respect to query q, the click weight is defined by Eq. (4)
given below.
C
CLICKwt ðPi Þ ¼ ð4Þ
jCLICKðq; ; uÞj
where
• click(q, *, u) denotes the total no. of clicks made by user u on all the pages for
query q in current session.
• C is no. of vote for a page. It is set to 1 for clicked pages and 0 otherwise.
Let us consider a user clicked three pages P1, P2 … P10. Click weight of each
clicked page can be computed by Eq. (4) as shown below. The CLICKwt of all
other pages is set to zero as they did not get any click from user.
1
CLICKwt ðP1 Þ ¼ ¼ 0:33 ð4aÞ
1þ1þ1
1
CLICKwt ðP2 Þ ¼ ¼ 0:33 ð4bÞ
1þ1þ1
1
CLICKwt ðP5 Þ ¼ ¼ 0:33 ð4cÞ
1þ1þ1
44 S. Sethi and A. Dixit
Time spentðPi Þ
TIMEwt ðPi Þ ¼ ð5Þ
Highest Time SpentðPÞ
3
TIMEwt ðP1 Þ ¼ ¼ 0:33 ð5aÞ
9
2
TIMEwt ðP2 Þ ¼ ¼ 0:22 ð5bÞ
9
9
TIMEwt ðP5 Þ ¼ ¼1 ð5cÞ
9
Calculation of action weight on page Pi: Action that user may carry on any
Web document is listed in Table 4 along with the weights. The weight is assigned
according to the relevancy of the action where relevancy is determined based on
user feedback in response of a survey. It is observed in the survey that if someone is
printing the page means, it has higher utility at present, saving is less scored as the
user will require it later on, bookmark come next, and sending comes at last in
priority list as page is used by some other user. If a user performs more than one
action, then only the higher weight value is considered. For example, if user per-
fume printing as well as saving, then only the printing weight is assigned to the
page.
Let us consider the user takes no action on page P1 and P2 but performs the save
action on page P5. So, ACTIONwt(P5) = 0.3, ACTIONwt(P1) = 0, and
ACTIONwt(P2) = 0. This PPF information related to each clicked page is updated
in search engine database. Initially, PPF of all pages is set to zero. It is computed
and updated every time; a user selects the page in the result list.
3.3 DB Builder
This component is responsible for extracting the pages from www and storing their
information into search engine database. The main subcomponents are as follows:
page extractor, crawler, and indexer as shown in Fig. 3. The working of each
component is discussed in following subsections.
Crawler: It extracts the URL from the URL frontier and downloads the pages at
specified interval [11] from the different Web server. URL frontier is a queue that
contains URLs of the pages that need to be downloaded. The structure of URL
frontier is shown in Fig. 6. The downloaded pages are passed to page extractor.
Table 5 gives description of different fields of URL frontier.
Page extractor: It parses the fetched page and divides it into no. of terms. All
the nonfunctional terms such as at, on, the are removed. It stores the term infor-
mation related to parsed page in Term_info table. It also extracts the link infor-
mation of page and stores it into Link_info. The structure of Term_info and
Link_info is shown in Fig. 4. The different fields of Term_info and Link_info are
described in Table 6.
Indexer: It first calculates the link weight of a page using Eq. (2) by taking the
link information from Link_info then indexes every term of parsed page in search
engine database. The structure of database is shown in Fig. 6. The page probability
URL FRONTIER
URL priority depth Server name
PAGE REPOSITORY
Link_info
URL Doc_ID depth In_lnk Hit_coumt Out _lnk Page address
Term_info
term Doc ID frequency Term posion
DATABASE SCHEMA
Term Doc_ID frequency Term position Link weight PPf
factor, PPF, of each new page is initially set to zero and updated by PPF calculator
when ever the page is clicked by user in result list. Let us consider the sample data
shown in Table 7 for understanding the organization of information in search
engine database. The information is stored as vocabulary (terms list) and postings as
shown in Fig. 5.
As shown in Table 7, the term ‘Prime’ occurs at four different places: 4, 12, 23,
and 56 in document D3 in row1. The link_weight of D3 is 7, and PPF score is 4.5.
Likewise, the information about the other terms is also stored.
It executes the user query on search engine database and fetches the pages whose
text matches with the query terms. It calculates the overall_page_score of each
selected page by adding the precomputed Link_weight to PPF and returns the
sorted list of pages to search engine interface. The algorithm of query processor is
shown in Fig. 6.
4 Example
5 Conclusion
In this paper, an efficient page ranking mechanism based on user browsing pattern
has been presented. These patterns are used to deduce the importance of a page with
respect to other candidate pages for a query. The technique is automatic in nature,
48 S. Sethi and A. Dixit
User query→
link information
Database Extract the PPF of clicked URLs →
and no overhead is involved at the part of user. The technique does not create each
user profile; instead, collective interests are used to find the relevancy of a page. So,
optimized approach is adopted. The technique proves to provide more relevant
results as compared to regular search engine.
References
1. Sethi S., Dixit A.: Design of presonalised search system based on user interest and query
structuring. In: Proceedings of the 9th INDIACom; INDIACom 2nd International Conference
on Computing for Sustainable Global Development, 11–13 March 2015
2. Brin, S., Page, L.: The anatomy of a large scale hypertextual web search engine. Comput.
Netw. ISDN Syst. 30(1–7), 107–117 (1998)
3. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order
to the web. In: Technical report, Stanford Digital Libraries SIDL-WP-1999-0120 (1999)
A Novel Page Ranking Mechanism Based on User Browsing Patterns 49
4. Xing W., Ghorbani A.: Weighted pagerank algorithm. In: Proceedings of the Second Annual
Conference on Communication Networks and Services Research (CNSR’04)0-7695-2096-0/
04 $20.00 © 2004
5. Pal, S., Talwar, V., Mitra, P.: Web mining in soft computing framework: relevance, state of
the art and future directions. IEEE Trans. Neural Netw. 13(5), 1163–1177 (2002)
6. Gyanendra, K., Duahn, N., Sharma A.K.: Page ranking based on number of visits of web
pages. In: International Conference on Computer and Communication Technology (ICCCT)-
2011, 978-1-4577-1385-9
7. Markov, Z., Larose, D.T.: Mining the web: uncovering patterns in web content, structure, and
usage data. Wiley, New York (2007)
8. Sethi, S., Dixit, A.: A comparative study of link based pge ranking algorithm. Int. J. Adv.
Technol. Eng. Sci. (IJATES) 3(01) (2015). ISSN: 2348-7550
9. Tyagi, N. et al. Weighted page rank algorithm based on number of visits of web page. Int.
J. Soft Comput. Eng. (IJSCE) 2(3) (2012). ISSN: 2231-2307
10. Mittal, A., Sethi, S.: A novel approach to page ranking mechanism based on user interest. Int.
J. Adv. Technol. Eng. Sci. (IJATES) 3(01) (2015). ISSN: 2348-7550
11. Dixit, A., Sharma, A.: A mathematical model for crawler revisit frequency. In: IEEE 2nd
International Advanced Computing Conference (IACC), pp. 316–319 (2010)
Indexing of Semantic Web for Efficient
Question Answering System
Abstract Search engine is a program that performs a search in the documents for
finding out the response to the user’s query in form of keywords. It then provides a
list of web pages comprising of those keywords. Search engines cannot differentiate
between the variable documents and spams. Some search engine crawler retrieves
only document title not the entire text in the document. The major objective of
Question Answering system is to develop techniques that not only retrieve docu-
ments, but also provide exact answers to natural language questions. Many
Question Answering systems developed are able to carry out the processing needed
for attaining higher accuracy levels. However, there is no major progress on
techniques for quickly finding exact answers. Existing Question Answering system
is unable to handle variety of questions and reasoning-based question. In case of
absence of data sources, QA system fails to answer the query. This paper inves-
tigates a novel technique for indexing the semantic Web for efficient Question
Answering system. Proposed techniques include manual constructed question
classifier based on <Subject, Predicate, Object>, retrieval of documents specifically
for Question Answering, semantic type answer extraction, answer extraction via
manually constructed index for every category of Question.
1 Introduction
2 Related Work
The evolution of the QA system was through closed domain because of their less
complexity. Previously used QAs were BASEBALL and LUNAR. BASEBALL [1]
QA gives information about the US baseball league for one year. LUNAR QA gives
information about the geographical analysis of rocks given by the Apollo moon
missions. Both QA systems were very powerful in their own domains. LUNAR was
examined at a lunar science, and it was able to answer approximately 90% of the
questions in its domain posed by people who are not trained on this system. The
common feature of all these systems is that they had knowledge database or
knowledge systems that were implemented by experts of the chosen domain.
SHRDLU [2] was a Question Answering system that has been developed by
Terry Winograd. It was basically developed to offer for the user to ask the robot
questions. Its implementation was done using the rules of the physics encoded in
computer programming. The Question Answering systems developed to interface
with these expert systems produced more repeatable and valid responses to ques-
tions within an area of knowledge. The system answered questions related to the
Unix OS. It had a knowledge base of its domain, and its target is to phrase the
answer to accommodate various types of users. LILOG [2] is a closed-domain
Question Answering system and is basically a text understanding system. This
system gives tourism information in a German city. Other system also helps the
system in linguistic and computational processing.
QUALM (story understanding system) [3] works through asking questions about
simple, paragraph length stories. QUALM [3] system includes a question analysis
module that links each question with a question type. This question type guides all
further processing and retrieval of information (see Table 1).
Kupiec (a simple WH question model) [3] Question Answering system performs
similar function but it rather solves simpler who question models to build a QA
system. This QA used the interrogative words for informing the kinds of infor-
mation required by the system. Table 2 lists the Question categories and Answer
type.
Indexing of Semantic Web for Efficient Question Answering System 53
3 Proposed Architecture
Query interface
How
a. Identification of the query: This component is used to identify the type of the
query entered by the user.
b. Tokenization of the query: In the first step, query is tokenized into bag of words
or string of words.
c. Stop word Removal: Removing stop word is necessary to improve the efficiency
of the query. Adjusting the stop word list to the given task can significantly
improve results of the query.
In this module, user has to select the category of the Question, i.e. who, what, when
so that the system can identify the expected answer to the user query [13].
expected answer can be name of person or some organization. Table 3 lists the
questions along with their type.
Indexing is a technique of formation of indexes for the fast retrieval of the information
needed by the Question Answering system to answer the query of the user [17]. The
user in the question categorization module selects the category of the Question which
makes the system understand that it has to search in that particular category index; we
have different index for different category of the Question. The indexing module helps
the QA system to locate the terms with the document id for fast processing. The
indexing technique used in the QA system is manually locating the term with the
document id. With the indexing module, the QA system identifies the matched doc-
ument related to the query term for finding the candidate answers. After the identi-
fication of the matched documents, result processing module will process the result
back to the user interface. Table 4 shows a general index.
After the documents are indexed, the documents having the matched query answers
are processed to the query processor and finally the output of the query is given to
the user. Figure 4 shows the process how result processing module works.
The proposed algorithm is shown in Fig. 5.
Algorithm for query processing is shown in Fig. 6.
Fig. 5 Proposed algorithm of Step 1: Take query as input as entered by the user
Step 2: Identify the category of the question.
result processing module Step 3: a) If category=”Who” then check with who-index.
b) If category=”What” then check with what-index.
c) If category=”When” then check with when-index.
d) If category=”Why” then check with why-index.
e) If category=”how” then check with how-index.
f) If category=”Where” then check with where-index.
Step 4: If category=”not found” then check with main-index.
Step 5: Return the candidate answer generated by matching the documents.
58 R. Madaan et al.
Step 1: Query=Input.
Step 2: Split query into Tokens.
Step 3: Check for Stop words
a) If found remove them.
Step 4: Set terms as tokens and return
Step 5: Identify the Subject, Predicate, Object for the query to find the relationship bet ween the subject and object.
Step 6: If term in the query does not match with the index term then find the synonym of the query term and replace the synon ym with the
query term in the index.
Step 7: Identify the number of documents that match the query terms.
Step 8: Processing of result for the match documents is done.
Step 9: Processed Result is given to the query interface.
4 Experimental Evaluation
The approaches taken have given satisfactory results up to a great extent. There are
many types of question for which the answers are relevant and accurate. There is a
scope of improvement in many areas but things have been achieved as per the
proposed work. Whenever you want to know about a question “who is the founder
of C?” you will be expecting the name of the person or an organization as the
category of the Question is who. The answer will contain the name of the person
and some description about the person. Snapshot of the proposed system is shown
in Fig. 7.
This Question Answering system is able to answer a variety of questions
accurately. As can be seen in the snapshots, the answers are formatted according to
the Question requirements. This section calculates the relevancy of various answers.
It is worth noting that the answers follow a true positive result orientation, i.e. all
the relevant results are coming. In some cases, other relevant information that might
be useful is also coming in results. Performance is calculated on the basis of
relevant result given by the system. Formula for calculating the performance is:
References
11. Hammo, B., Abu-Salem, H., Lytinen, S.: QARAB: A Question Answering System to Support
the Arabic Language (2002)
12. Mudgal, R., Madaan, R., Sharma, A.K., Dixit. A.: A Novel Architecture for Question
Classification and Indexing Scheme for Efficient Question Answering (2013)
13. Moldovan, D., Paşca, M., Harabagiu, S., Surdeanu, M.: Performance Issues and Error
Analysis in an Open-Domain Question Answering System (April 2003)
14. Lim, N.R., Saint-Dizier, P.: Some Challenges in the Design of Comparative and Evaluative
Question Answering Systems (2010)
15. Suresh kumar, G., Zayaraz, G.: Concept Relation Extraction Using Naïve Bayes Classifier for
Ontology-Based Question Answering Systems (13 Mar 2014)
16. Kapri, d., Madaan, R., Sharma, A.K., Dixit, A.: A Novel Architecture for Relevant Blog Page
Identification (2013)
17. Balahur, A., Boldrini, E., Montoyo, A., Martínez-Barco, P.: A Comparative Study of Open
Domain and Opinion Question Answering Systems for Factual and Opinionated Queries
(2009)
A Sprint Point Based Tool for Agile
Estimation
1 Introduction
1.1 Estimation
When planning about first sprint, at least 80% of the backlog items are estimated to
build a reasonable project map. These backlog items consist of user stories grouped
in sprints and user stories based on estimation is done using story points. When a
software developer estimates that a given work can be done within 10 h, it never
means that work will be completed in 10 h. Because no one can sit in one place for
the whole day and there can be a number of factors that can affect story points and
hence decrease the velocity. To estimate cost and time, it is a big challenge [4].
To resolve this problem, the concept of Sprint-point is proposed. A Sprint-point
basically calculates the effective story points. Sprint point is an evaluation or
estimation unit of the user story instead of story point. By using Sprint points, more
accurate estimates can be achieved. Thus, the unit of effort is Sprint Point
(SP) which is the amount of effort, completed in a unit time.
In the proposed Sprint-point based Estimation Framework, requirements are first
gathered from client in the form of user stories. After requirement gathering, a user
story-based prioritization algorithm is applied to prioritize the user stories.
Consequently, story points in each user story are calculated and uncertainty in story
points is removed with the help of three types of story points proposed. Then, these
A Sprint Point Based Tool for Agile Estimation 65
story points are converted to sprint-points based on the proposed agile estimation
factors. Afterwards, sprint-point based estimation algorithm is applied to calculate
cost, effort, and time in a software project.
If there is requirement of regression testing in agile, then defect data is gathered
based upon the similar kinds of projects, which is used to calculate rework effort
and rework cost of a project. Finally, the sprint-point based estimation algorithm
using regression testing is applied to calculate the total cost, effort, and duration of
the project.
This Sprint-point based Estimation Framework as shown in Fig. 1 performs
estimation in scrum using below steps:
Step 1: User stories are prioritized by using User story-Based Prioritization
Algorithm (will be discussed in Sect. 2.2).
Step 2: Uncertainty in story point is removed (will be discussed in Sect. 2.3).
Step 3: Story points are converted into sprint-points by considering agile delay
factors. Delay factor is being proposed that affects the user stories and thus affects
the cost, effort, and duration of a software project. Sprint-point based estimation is
done by using the proposed Sprint-point based estimation using delay-related
factors (will be discussed in Sect. 2.4).
In agile software development method, the requirements from the customer are
taken in the form of user stories. The proposed prioritization rule is “Prioritize the
user stories such that the user stories with the highest ratio of importance to actual
effort will be prioritized first and skipping user stories that are “too big” for current
release” [5].
Consider the ratio of importance as desired by client to actual effort done by
project team (I/E) as in Formula 1.
As agile projects are of small duration, so the team has not so much amount of time
to apply the mathematical algorithms. To resolve this issue, a new sprint-point
based estimation tool (SPBE) is designed and developed in Excel to automate the
Sprint-point based Estimation Framework. The proposed SPBE tool for estimation
place major emphasis on accurate estimates of effort, cost and release date by
constructing detailed requirements as accurately as possible. This tool is used as a
vehicle to validate the feasibility of the project. The proposed tool is a set of
individual spreadsheets with data calculated for each team separately. The esti-
mation tool is created to provide more accuracy in velocity calculations, as well as
better visibility through burn down charts on all stages including planning, tracking,
and forecasting. The proposed estimation tool first decides the priority sequence of
user stories that dictates the implementation order. The priority of user story is
decided based on the importance of user stories to the client and the effort of the
scrum team. After prioritization product, backlog is prepared which is the most
important artifact for gathering the data. After selecting a subset of the user stories,
the sprint backlog is prepared and the period for the next iteration is decided.
68 R. Popli and N. Chauhan
The SPBE tool contains the components. There is a separate spreadsheet for each
component as in Table 1 like release summary, capacity management, product
backlog, sprint backlog, sprint summary, defect, work log, and metric analysis. The
backlog sheet contains all the user stories. The sprint summary sheet contains the
information about the sprint like release date, start date.
All the proposed approaches have been numerically analyzed on a case study
named as enable quiz. The user stories of case study are in Table 2. As agile
projects are of small duration so the team has not so much amount of time to apply
the mathematical algorithms for estimation of cost, effort, and time. For resolving
this problem, a new Sprint–point based estimation tool (SPBE) has been designed
and developed to automate the Sprint-point based Estimation Framework. The
A Sprint Point Based Tool for Agile Estimation 69
proposed SPBE tool for estimation places major emphasis on accurate estimates of
effort, cost and release date by constructing detailed requirements as accurately as
possible [9–11]. This tool may be used as a vehicle to validate the feasibility of the
project.
2.7 Results
Table 3 Results
Unadjusted value (UV). All the six factors at medium level so UVSP = 6*6 = 36
UV = 6
Total user stories 10
BSP 300
Project start date 1st January, 2014
Estimated story points (ESP) = BSP + 0.1 (UVSP) 300 + 0.1 (36) = 303.6
Initial velocity 5 SP/Day
AvgVF = Average of VF of all the 6 factors 0.95667
Decelerated velocity (DV) = V * AvgVF 5 * 0.95667 = 4.78330 SP/
Day
Estimated development time (EDT) = ESP/DV 303.6 * 8/
4.78330 = 507.76 h
Project end date 14 Jan 2014
3 Conclusion
The main focus of this paper is to propose a new sprint point based estimation tool
which improve accuracy of release planning and monitoring. The estimation tool is
created to provide more accuracy in velocity calculations, as well as better visibility
through burn down charts on all stages including planning, tracking, and fore-
casting. This tool is used as a vehicle to validate the feasibility of the project. The
approach developed is really simple and easy to understand and can be effectively
used for release date calculation in agile environment. By this method, release date
of small and medium size project can be calculated efficiently.
References
1. Cockburn, A.: Agile Software Development, Pearson Education. Asia Low Price Edition
(2007)
2. Stober, T., Hansmann, U.: Agile Software Development Best Practices for Large Software
Development Projects. Springer Publishing, NewYork (2009)
3. Awad, M.A.: A comparison between agile and traditional software development methodolo-
gies. Unpublished doctoral dissertation, The University of Western Australia, Australia (2005)
4. Maurer, F., Martel, S.: extreme programming. rapid development for web-based applications.
IEEE Internet Comput. 6(1), 86–91 (2002)
5. Popli, R., Chauhan, N.: Prioritizing user stories in agile environment. In: International
Conference on Issues and Challenges in Intelligent Computing Techniques, Ghaziabad, India
(2014)
72 R. Popli and N. Chauhan
6. Popli, R., Chauhan, N.: Managing uncertainity of story-points in agile software. In:
International Conference on Computing for Sustainable Global Development, BVICAM,
Delhi (2015)
7. Popli, R., Chauhan, N.: Sprint-point based estimation in scrum. In: International Conference
on Information Systems and Computer Networks, GLA University Mathura (2013)
8. Popli, R., Chauhan, N.: Impact of key factors on agile estimation. In: International Conference
on Research and Development Prospects on Engineering and Technology (2013)
9. Cohn, M.: Agile Estimating and Planning. Copyright Addison-Wesley (2005)
10. Popli, R., Chauhan, N.: An agile software estimation technique based on regression testing
efforts. In: 13th Annual International Software Testing Conference in India, Bangalore, India
(2013)
11. Popli, R., Chauhan, N.: Management of time uncertainty in agile environment. Int. J. Softw.
Eng. and Applications 4(4) (2014)
Improving Search Results Based
on Users’ Browsing Behavior Using
Apriori Algorithm
Keywords World wide web Apriori algorithm Browsing behavior
Actions Web browser PageRank
1 Introduction
With the advent increase in information over the Web, people are now more
interested and inclined toward Internet to get their data. Each user has its own
interest and accordingly his expectations from search engine vary. Search engines
play an important role in getting relevant information. Search engines use various
ranking methods like HITS, PageRank but these ranking methods do not consider
user browsing behaviors on Web. In this paper, a PageRank mechanism is being
devised which considers user’s browsing behavior to provide relevant pages on the
Web. Users perform various actions while browsing. These actions include clicking
scrolling, opening a URL, searching text, refreshing, etc., which can be used to
perform automatic evaluation of a Web page and hence to improve search results.
The actions have been stored in a database; an algorithm named Apriori has been
applied upon these actions stored in database to calculate weight of a particular
Web page. Higher the weight higher will be the rank of that page.
Apriori algorithm [1] is an algorithm used in mining frequent itemsets for learning
association rules. This algorithm is designed to operate on large databases con-
taining transactions, e.g., collection of items purchased by a customer. The whole
point of an algorithm is to extract useful information from large amount of data.
This can be achieved by finding rules which satisfy both a minimum support
threshold and a minimum confidence threshold.
The support and confidence can be defined as below:
– Support count of an itemset is number of transactions that contain that itemset.
– Confidence value is the measure of certainty associated with discovered pattern.
Formally, the working of Apriori algorithm can be defined by following two
steps:
i. Join Step
– Find the frequent itemsets, i.e., items whose occurrence in database are greater
than or equal to the minimum support threshold;
– Iteratively find frequent itemsets from 1 to k for k-itemsets.
ii. Prune Step
– The results are pruned to find the frequent itemsets.
– Generate association rules from these frequent itemsets which satisfy minimum
support and minimum confidence threshold.
2 Related Work
In order to get relevant results, a search engine has to modify their searching
pattern. Crawlers are the module of search engine which are responsible for
gathering Web pages from the WWW. There are many design issues related to
Improving Search Results Based on Users’ Browsing Behavior … 75
3 Working Methodology
Calculating page
Applying Rank
Improving Search Results Based on Users’ Browsing Behavior … 77
Step 3: Applying Apriori: Apriori is applied on stored actions to get most frequent
actions. For each page, most frequent patterns are generated. These patterns are then
used for calculating the page weight.
Step 4: Calculate Page Weight and PageRank: After applying Apriori on actions,
frequent patterns are obtained. Confidence values of each pattern are taken to
calculate page weight. Confidence value can be calculated as per Eq. (1):
where C1, C2, C3…. are the confidence values of subsets of frequent occurring
actions which satisfy minimum confidence threshold.
Step 5: Apply Rank: Higher the page weight, higher its rank. It means the weight
of the page which is calculated from above step is considered for PageRank. The
page which has highest weight is assigned higher rank.
4 Example
In this section, an example is taken to show the working of proposed work. For this,
actions performed by 40 users on two pages page P1, page P2 were stored in
database. Database also stored number of times users visited those pages at different
times. Users perform actions on those pages according to their needs. Their actions
will be stored in a database. Apriori will be applied on those actions. For applying
Apriori, minimum support of 20% and minimum confidence threshold of 60% were
considered. Result of Apriori shows that most frequent actions on P1 were save as,
add to favorites, number of scrollbar clicks, and most frequent actions on another
page P2 were number of mouse clicks and print. With the help of this, the confi-
dence values of pages were calculated.
Step 1: A Web browser is developed to store the user actions. The interface of
proposed Web browser is shown in Fig. 2.
Step 2: All the actions performed by 40 users get stored in database and screenshot
of which is shown in Fig. 3.
78 Deepika et al.
Table 2 Comparison between page weight by Apriori, Hit Count, and star rating (by user)
Pages Pwt (Apriori) Pwt (Hit Count) Star rating (by user)
Page P1 3.34 1.34 3
Page P2 2.6 2.2 2
4
(in terms of Page weight &
Proposed
3 Hit Count
star rating)
Likings
0
Page 1 Page 2
5 Conclusion
In this proposed approach, users’ browsing behavior is used to find relevant pages.
While past researches consider on user visit history and observing their time spent
on a Web page, whereas above-mentioned mechanism shows that there are other
user browsing behavior can also be consider to find the user’s interest. The pro-
posed mechanism identifies several implicit indicators that can be used to determine
a user’s interest in a Web page. In addition to previously studied implicit indicators,
several new implicit indicators are also taken into consideration. The indicators
examined are complete duration on a Web page, active time duration, search text,
copy, print, save as, reload, number of key up and down, number of mouse clicks,
number of scrollbar clicks, add to favorites, hyperlink, back, and forward. These
implicit indicators prove to be more accurate than any indicator alone. To show that
proposed results are more accurate, explicit indicator is also used to find more
relevant pages. The explicit indicator used here is by asking from user to rate the
page according to its interest, and then comparison is done with proposed work.
The comparison shows that proposed is much closer to explicit indicators and is
more accurate.
References
1. Agrawal, R., Ramakrishan, S.: Fast Algorithms for mining Association Rules. IBM Almaden
Research Center (1994)
2. Deepika, Dixit, A.: Capturing User Browsing Behaviour Indicators. Electr. Comput. Eng. Int.
J. (ECIJ) 4(2), 23–30 (2015)
82 Deepika et al.
3. Deepika, Dixit, A.: Web crawler design issues: a review. Int. J. Multidiscip. res. Acad.
(IJMRA) (2012)
4. Ying, X.: The research on user modeling for internet personalized Services. Nat. Univ. Def.
Technol. (2003)
5. Morita, M., Shinoda Y.: Information filtering based on user behavior analysis and best match
text retrieval. In: Proceedings of the 17th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval (SIGIR), pp. 272–281 (1994)
6. Miller, B.N., Ried, J.T., Konstan, J.A.: GroupLens: applying collaborative filtering to usenet
news. Commun. ACM, March (1997)
7. Claypool, M., Le, P., Waseda, M., Brown, D.: Implicit interest indicators. In: Proceedings 6th
International Conference on Intelligent User Interfaces, ACM Press (2001)
8. Weinreich, H., Obendort, H., Herder, E., Mayer, M.: Off the beaten tracks: exploring three
aspects of web navigation. In: WWW Conference 2006, ACM Press (2006)
9. Goecks, J. Shavlik, J.: Learning users’ interests by unobtrusively observing their normal
behavior. In Proceedings 5th International Conference on Intelligent User Interfaces,
pp. 129–132 (2000)
10. Xing, K., Zhang, B., Zhou, B., Liu, Y.: Behavior based user extraction algorithm. In: IEEE
International Conferences on Internet of Things, and Cyber, Physical and Social Computing
(2011)
11. Tan, S.H., Chen, M., Yang, G.H.: User Behavior Mining on Large Scale Web Log Data. In:
Apperceiving Computing and Intelligence Analysis (ICACIA) International Conference
(2010)
12. Yang, Q., Hao, H., Neng, X: The research on user interest model based on quantization
browsing behavior. In: The 7th International Conference on Computer Science and Education
(ICCSE) Melbourne, Australia (2012)
13. Agrawal, R., Faloutsos, C., Swami, A.: Efficient similarity search in sequence databases. In:
Proceedings of the Fourth International Conference on Foundations of Data Organization and
Algorithms, Chicago, October (1993)
14. Kim, H., Chan, K.: Implicit indicators for interesting web pages. In: Web Information System
and Technologies WEBIST 2005 (2005)
Performance Efficiency Assessment
for Software Systems
Keywords Software quality Software quality models Performance efficiency
Optimized code Analytical hierarchy process
A. Kaur (&)
GTBIT, New Delhi, India
e-mail: [Link]@[Link]
P. S. Grover
KIIT Group of Colleges, Gurgaon, India
e-mail: drpsgrover@[Link]
A. Dixit
YMCA University of Science and Technology, Faridabad, India
e-mail: dixit_ashutosh@[Link]
1 Introduction
2 Related Work
Inadequate quality of the software systems may lead to many problems like difficult
maintenance, low performance efficiency, low reusability or frequent program
change. From time to time, several researchers have proposed various software
quality models in order to measure the quality of the software products. Latest
software quality standard is ISO/IEC 25010 which was prepared by ISO/IEC JTCI
after technically revising the earlier software quality model ISO/IEC 9126-1:2001.
Various amendments were made in order to address the weaknesses of ISO/IEC
9126 in the newly revised quality model division ISO/IEC 2501n [2].
As per this latest International Standard ISO 25010, software product quality
model enumerates eight characteristics. These characteristics are further subdivided
Performance Efficiency Assessment for Software Systems 85
resource utilization. Capacity as per ISO 25010 is the degree to which the maxi-
mum limits of a product or system parameter meets requirements.
Various software quality models have been reviewed in order to understand the
perspective for taking the performance efficiency as a characteristic for defining the
quality.
In 1977, Jim McCall identified three main perspectives (product revision, pro-
duct transition, and product operations) for characterizing the quality attributes of a
software product and considered efficiency as one of the quality factors under
product operations. It defined one or more quality criteria for each quality factor, in
order to assess the overall quality of software product. According to McCall’s
quality model, the quality criteria for efficiency are execution efficiency and storage
efficiency [5].
In 1978, Barry Boehm proposed a software quality model with seven quality
attributes according to the three fundamental uses (As-is utility, maintainability, and
portability) of the software which may affect the quality of the software product. It
identified efficiency as a quality attribute under As-is utility. According to Boehm’s
quality model, the factors that affect efficiency are accountability, device efficiency,
and accessibility [3].
Performance Efficiency Assessment for Software Systems 87
In 1993, ISO 9126 software quality model was proposed and composed of six
quality characteristics in relation to the internal and external quality. It identified
efficiency as one of the quality characteristics and specifies three quality attributes
that affect the efficiency of software are time behavior, resource behavior, and
efficiency compliance [6].
In 2009, Kumar extended the ISO/IEC 9126 quality model and proposed
aspect-oriented programming-based software quality model, viz aspect-oriented
software quality (AOSQUAMO) model. It added code reducibility as a
sub-characteristic under efficiency quality characteristic. Hence, the quality attri-
butes that affect the efficiency according to AOSQUAMO model are time behavior,
resource behavior, and code reducibility [7].
In 2011, although in ISO/IEC 25010 the problems related to efficiency were
addressed, but still one area is untouched [2].
Use of solid coding techniques and good programming practices while devel-
oping high-quality optimized code plays an important role in software quality and
performance. Code written while consistently applying well coding standard and
proper coding techniques is not only an optimized code in terms of time, effort, cost
(resources) but also is easier to comprehend and maintain. Hence, it will serve the
expectation of our internal customers. The missing point in ISO 25010 model is that
while estimating efficiency of software, no weightage is given to how efficiently
code is written and how much optimized it is.
In this section, we propose a performance efficiency model as performance
efficiency is a vital part for improving the software quality (Fig. 1).
Fig. 2 Problem
decomposition
ðkmax nÞ CI
CI ðConsistency IndexÞ ¼ and CRðConsistency RatioÞ ¼
n1 RI
5 Case Study
II Iteration
See Figs. 7 and 8.
III Iteration
After the third iteration, the difference between the current and the previous
eigenvector is approaching to zero. Hence, these values can be accepted as our final
values (Figs. 9 and 10).
(k max n) (4:2125 4)
Consistency index ðCI) ¼ ¼ ¼ 0:0708
n1 41
And
CI 0:070833
Consistency Ratio ðCR) ¼ ¼ ¼ 0:0787
RI 0:9
Now as kmax is 4.2125 which is greater than 4 and consistency ratio is 0.0787
which is less than 0.1, hence matrix A is consistent.
is in the order of time behavior, optimized code, resource utilization, and then
capacity. In future, this model may be used for comparing the performance effi-
ciency of different software systems.
References
1. Pressman, R.S.: Software Engineering: A Practitioner’s Approach, 5th edn. Mc Graw Hill,
New York (2005)
2. ISO/IEC 2010; ISO/IEC 25010: Systems & software engineering—system and software
quality requirements and evaluation (SQuaRE)—system and software quality models (2010)
3. Boehm, B., et al.: Quantitative evaluation of software quality. In: IEEE International
Conference on Software Engineering, pp. 592–605 (1976)
4. Sommerville, Ian (2004). Software Engineering (Seventh ed.). Pearson. pp. 12–13. ISBN 0–
321-21026-3
5. McCall, J.A., Richards, P.K., Walters, G.F.: Factors in Software Quality, vols. I, II, and III.
US Rome Air Development Center Reports. US Department of Commerce; USA (1977)
6. ISO 2001; ISO/IEC 9126-1: Software Engineering—Product Quality—Part 1: Quality Model.
International Organisation for Standardisation, Geneva Switzerland (2001)
7. Kumar, A.: Analysis & design of matrices for aspect oriented systems. Ph.D. Thesis, Thapar
University (2010)
8. Saaty, T.L.: Analytic Hierarchy Process. Mc Graw Hill (1980)
9. Kaur, A., Grover, P.S., Dixit, A.: Quantitative evaluation of proposed maintainability model
using AHP method. In: 2nd International Conference on computing for sustainable global
development, pp. 8.159–8.163 (2015)
10. Kaur, A., Grover, P.S., Dixit, A.: An improved model to estimate quality of the software
product. YMCAUST Int. J. Res. 1(2), 01–06 (2013)
11. Kaur, A., Grover, P.S., Dixit, A.: Analysis of quality attribute and metrics of various software
development methodologies. In: International Conference on Advancements in Computer
Applications and Software Engineering, pp. 05–10 (2012)
12. Grady, R., et al.: Software Metrics: Establishing a Company-Wide Program, p. 159. Prentice
Hall (1987)
Impact of Programming Languages
on Energy Consumption for Sorting
Algorithms
Abstract In today’s scenario, this world is moving rapidly toward the global
warming. Various experiments are performed, to concentrate more on the energy
efficiency. One way to achieve this is by implementing the sorting algorithms in
such a programming language which consumes least amount of energy which is our
current area of research in this paper. In this study, our main goal is to find such a
programming language which consumes least amount of energy and contributes to
green computing. In our experiment, we implemented different sorting algorithms
in different programming languages in order to find the most power-efficient
language.
1 Introduction
sorting algorithms, bubble sort, insertion sort, selection sort, and Quick sort. We
simulate the energy consumption of sorting algorithms when implemented in dif-
ferent programming languages in order to come up with energy-efficient
programming.
Our interest area in this paper is to promote green computing by coming up with
such programming languages and sorting algorithms into limelight which require
least energy.
2 Related Works
The IT industries have been focusing on energy efficiency of hardware and evolved
their devices for better efficiency [1]. Green computing is the
environmental-friendly use of available computing resources without sacrificing
performance [2]. A very few researches have been performed in this field due to
constrained hardware resources and extensive cost. Researchers had concluded that
the most time and energy-efficient sorting algorithm is Quick sort [3]. It is also
found that the energy consumption greatly depends on time and space complexity
of the algorithms [4]. A programmer can develop application-level energy-efficient
solutions if he uses energy-efficient language [5]. Algorithms also have great impact
on the energy consumption and ultimately on green computing [6]. Several
researches have already been performed on hardware and concluded that home
server hardware together with well-tuned, parallelized sorting algorithms can sort
bulk amounts of data and is noticeably more energy-efficient than older systems [7].
Bunse, C. concluded that, different software has the different energy payload; also,
his studies show that different algorithms have different energy requirements [8].
A lot of researches had already been performed to gain the energy efficiency which
mostly focuses on algorithm designs, hardware architectures (VLSI designs),
operating systems, and compilers, etc., but investigations show that the program-
ming language design and good programming practice may be one perspective to
reduce power consumption [9]. Distinct programming languages handle the same
situation differently and require discrete number of operations to accomplish the
same task. In general, compiled language is typical to code but runs faster; on the
other hand, interpreted languages are easier to program but take longer to run. There
is some difference in the energy consumption between different loops (such as For
Loop and While Loop) when the number of operations needed in increasing the loop
counter variables and checking termination conditions are significant between the
two alternatives. Energy consumption by an application can be further cut down by
using ‘vector operations’ in a vector register where possible. Code execution time
Impact of Programming Languages on Energy Consumption … 95
can also be reduced by taking advantage of multiple threads and cores, resulting in
increased idle time that in turn leads to power conservation. The inefficient codes
force the CPU to draw more from the processor and consume more electricity [10].
Performance-to-power relationship can be improved by loop unrolling. Use of
idle-power-friendly programming language implementations and libraries may
improve the power saving [11]. Based on several researches, researchers estimated
that between 30 and 50% energy savings can be achieved by selecting
energy-aware software solutions and even more could be achieved by proper
combination of both software and hardware [12].
4 Sorting Algorithms
It is a very efficient way which helps in performing important task of putting the
element list in order. For example, sorting will arrange the elements in ascending or
descending [13]. When it comes to battery-operated devices, use of energy-efficient
sorting is a prime requirement. The text follows contains the sorting algorithms that
were used in our research. Table 1 shows the complexities of various sorting
algorithms.
It is a simple algorithm, which begins at the start of the data, bubbles the largest
element to the end of the data set on each pass, and in next cycle it repeats the same
with one reduced cycle [14].
It is also referred as comparison sort, is best known for its simplicity, and has pretty
good performance over complicated algorithms. It is not as good as insertion sort
which works in a similar way as the selection sort [15].
Insertion sort separates the data list into two parts: one part which is in sorted
section and the other which is in the unsorted section. When a new element is
inserted, all the existing elements are required to be shifted to right [16].
The Quick sort is an advanced form of Quick sort. Some modifications are made in
the internal loops of Quick sort to make it very well optimized and short. It is also
abbreviated as Median Hybrid Quick sort.
5 Programming Language
5.3 C#.Net
6 Experimental Setup
Here in this study, we used a simulator tool named “Joulemeter” [21] from
Microsoft Corporation to simulate power consumption of various sorting algo-
rithms implemented in three languages: Java, C#.NET, and Visual Basic 6.0.
We used Intel Core i5 and 4th generation CPU with Windows 8.1 (Version
6.3.9600) operating system to perform our experiment. Figure 1 shows the model
for experimental setup.
– Joulemeter
It is a software tool which was introduced by Microsoft. It is used to view the
overall power consumption by the whole system and also the key components
which are to be monitored. The user has to calibrate the Joulemeter in order to
estimate the power consumption of an idle system. After the calibration, the total
power consumption can be monitored and obtained by adding up values (in watt)
for duration of interest. It can also be converted to other units like kW h/W h by
using the following conversion:
1 kW h ¼ 1000 W 1 h
¼ 1000 W 3600 s
¼ 3; 600; 000 J:
i:e: Watt ¼ J/s:
Thus; 1 Joule ¼ 1=3; 600; 000 kW h: ½3
7 Experimental Run
Fig. 2 a Average power consumed for integer data set (in W/s). b Average power consumed for
integer data set (in W/s)
Fig. 3 a Average power consumed for double data set (in W/s). b Average power consumed for
double data set (in W/s)
9 Conclusion
References
1. Raza, K., Patle, V.K., Arya, S.: A review on green computing for eco-friendly and sustainable
IT. J. Comput. Intell. Electron. Syst. 1(1), 3–16 (2012)
2. Saha, B: Green computing. Int. J. Comput. Trends Technol. 14(2) (2014)
3. Chandra, T.B., Patle, V.K., Kumar, S.: New horizon of energy efficiency in sorting
algorithms: green computing. In: Proceedings of National Conference on Recent Trends in
Green Computing. School of Studies in Computer in Computer Science & IT, Pt. Ravishankar
Shukla University, Raipur, India, 24–26 Oct 2013
4. Bunse, C., Höpfner, H., Roychoudhury, S., Mansour, E.: Choosing the “best” sorting
algorithm for optimal energy consumption. In: ICSOFT (2), pp. 199–206 (2009)
5. Liu, Y. D.: Energy-aware programming in pervasive computing. In: NSF Workshop on
Pervasive Computing at Scale (PeCS) (2011)
6. Narain, B., Kumar, S.: Impact of algorithms on green computing. Int. J. Comput. Appl.
(2013). ISSN No. 0975-8887
7. Beckmann, A., Meyer, U., Sanders, P., Singler, J.: Energy-efficient sorting using solid state
disks. In: Proceedings of IEEE Green Computing Conference (2010)
8. Bunse, C., Hopfner, H., Mansour, E., Roychoudhury, S.: Exploring the energy consumption
of data sorting algorithms in embedded and mobile environments. In: Tenth International
Conference on Mobile Data Management: Systems, Services and Middleware, 2009.
MDM’09, pp. 600–607. IEEE (2009)
9. Liu, Y.D.: Energy-aware programming in pervasive computing. In: NSF Workshop on
Pervasive Computing at Scale (PeCS) (2011)
10. Francis, K., Richardson, P.: Green maturity model for virtualization. Archit. J. 18(1), 9–15
(2009)
11. Energy-Efficient Software Guidelines. [Link]
energy-efficient-software-guidelines
12. Code green: Energy-efficient programming to curb computers power use, [Link]
[Link]/news/2011/05/31/code-green-energy-efficient-programming-to-curb-
computers-power-use/
13. Sareen, P.: Comparison of sorting algorithms (on the basis of average case). Int. J. Adv. Res.
Comput. Sci. Softw. Eng. 3(3), 522–532 (2013)
14. Research Paper on Sorting Algorithms. [Link]
mac-286-data-structure_22946/. Accessed on 26 Oct 2009
15. Nagpal, H.: Hit sort: a new generic sorting algorithm
16. Khairullah, M.: Enhancing worst sorting algorithms. Int. J. Adv. Sci. Technol. 56 (2013)
17. Singh, T.: New software development methodology for student of Java programming
language. Int. J. Comput. Commun. Eng. 2(2), 194–196 (2013)
18. Gosling, J.: The Java language specification. Addison-Wesley Professional (2000)
19. Hassan, A.B., Abolarin, M.S., Jimoh, O.H.: The application of Visual Basic computer
programming language to simulate numerical iterations. Leonardo J. Sci. 5(9), 125–136
(2006)
20. Benton, N., Cardelli, L., Fournet, C.: Modern concurrency abstractions for C#. ACM Trans.
Program. Lang. Syst. (TOPLAS) 26(5), 769–804 (2004)
21. Joulemeter. [Link]
daf55565f794
Crawling Social Web with Cluster
Coverage Sampling
Abstract Social network can be viewed as a huge container of nodes and rela-
tionship edges between the nodes. Covering every node of social network in the
analysis process faces practical inabilities due to gigantic size of social network.
Solution to this is to take a sample by collecting few nodes and relationship status
of huge network. This sample can be considered as a representative of complete
network, and analysis is carried out on this sample. Resemblance of results derived
by analysis with reality majorly depends on the extent up to which a sample
resembles with its actual network. Sampling, hence, appears to be one of the major
challenges for social network analysis. Most of the social networks are scale-free
networks and can be seen having overlapping clusters. This paper develops a robust
social Web crawler that uses a sampling algorithm which considers clustered view
of social graph. Sample will be a good representative of the network if it has similar
clustered view as actual graph.
1 Introduction
Social networks provide an open platform to analyse and understand the behaviour
of users, their interaction patterns and propagation of information. An explosive
growth of Online Social Networks (OSNs) has assured possibility of prominent
2 Related Work
Several sampling algorithms have been proposed for social graph sampling. These
algorithms can be put into two categories: first, which focuses on nodes and second,
which focuses on edges. In algorithms in former category, the sampling
decision-making process is executed on nodes, e.g., BFS [4, 12], MHRW [4, 9, 10]
and UGDSG [13]. The latter class of algorithms acquires edges in sample and nodes
as the end points of the edges are selected, e.g., FS [13].
BFS has been used widely to study user behaviour of OSNs [4] and analysing
topological characteristics of social graphs [14]. But BFS suffers from biasing [4,
14]. It visits nodes with higher degree more frequently. Due to which BFS obtains
higher local clustering coefficient than the original ones [paper].
MHRW is based on Markov Chain Monte Carlo (MCMC) model that selects
random nodes’ samples according to degree probability distribution of the nodes [9,
10]. MHRW designs a proposal function based on probability distribution which is
Crawling Social Web with Cluster Coverage Sampling 105
kv
P ð vÞ ¼ P
u2S kv
An edge (v, w) is selected uniformly from node v’s outgoing edges, and v will be
replaced with w in the set of seed nodes. Edge (v, w) is added to the sample. FS
does not perform well if clustering coefficient is small [13].
Corlette et al. [15] proposed event-driven sampling that focuses active part of the
network. Event-driven sampling is similar to the process used by various search
engines to refresh their repository by including new Web pages and eliminating
expired ones. However, multigraph sampling [16] considers social network dis-
tributed in clusters. The sampling is carried out on multiple relations among users in
social network.
3 Problem Description
Generally, social Web crawling has distinct steps; crawler starts with seed node,
explores and collects its directly connected nodes, selects few of explored nodes as
sample for further exploration, and this process is repeated. After fetching any
information of one node completely, crawler needs next node to crawl and that is
selected by sampling algorithm.
Almost every social network is scale-free and can be seen as unevenly formed;
i.e., the network is denser at some places and sparse at others. These denser portions
can be considered as clusters or people/actors in these denser portions exhibit some
kind of similar characteristics, e.g., same workplace, same hometown, same study
place or same country or same continent. Hence, social network is not uniform but
it is collection of overlapping clusters (few clusters can be stand-alone also).
We consider a social graph G ¼ ðV; EÞ, where V is collection of nodes and E is
collection of edges representing associations among nodes. Due to scale-free per-
sona of social graphs, we can consider the graph has several overlapping clusters
106 A. Srivastava et al.
S
CL1 ; CL2 ; CL3 . . .CLk such that G ¼ 1 i k CLi . There is a possibility that few of
these clusters may not be overlapping of completely disjoints. Here, clusters can be
said former form of communities or less-restricted communities. There is a greater
possibility that each community that is excavated from sample of the graph defi-
nitely has a parent cluster. Let graph Gs ¼ ðVs ; Es Þ be the sample of graph
G. Co S1 ; Co2 ; Co3 . . .:Com be the communities detected in Gs , such that
Gs ¼ 1 j m Coj . Then, following predicate is always true
A crawling framework proposed in this paper uses adaptive sampling which aims at
potential limitations or challenges evident in several existing sampling algorithms
which will be discussed along with the explanation of proposed framework. The
crawling framework is briefly shown in Fig. 1. The framework assumes that the
social graph is undirected, crawler is interested in only publically available infor-
mation, and graph is well connected (graph can be disconnected if it has stand-alone
clusters which will be ignored by the crawler). Social Web is apparently huge and is
in the form of overlapping clusters which is demonstrated by intersecting bubbles in
the social Web cloud.
Social Web is clustered which overlaps by having common actors. The crawler
while digging into the network and collecting sample must ensure that it touches
every cluster. Crawler starts with a seed node and proceeds further by hoping to its
friends (directly connected nodes). In Fig. 1, social Web is shown having clusters
which overlap. Crawler can start at any cluster to which seed node belongs.
Crawling proceeds further with following algorithms:
Algorithm 1
Crawler ðSn Þ
Input: Seed_Node Sn .
Start
Perform login with credentials of Sn ;
Initialize Sample_Date_Set and Sparse_Data_Set;
Cn ½ Sn ;
Crawling ðCn ½ Þ;
End
Crawling Social Web with Cluster Coverage Sampling 107
As shown in Algorithm 1, crawler enters in the network via seed node Sn . Most
of the OSNs require username and password, provided to the crawler to login. Two
data sets are maintained by the framework. First, Sparse_Data_Set that contains
every node explored by the crawler and identified links between them. Out of many
explored nodes, few are selected for further exploration. Connectivity among nodes
is known for only those nodes which are visited by the crawler. Thus,
Sparse_Data_Set contains many nodes among which the relationship status is not
known and hence considered unrelated nodes (there may exist connectivity in real
network). Therefore, another data set, Sample_Data_Set is also maintained that
contains only those nodes among which the relationship is known. Cn ½ is the list of
nodes, which will be crawled by the crawler next.
Algorithm 2
Crawling ðCn ½ Þ
Input: Nodes_To_Crawl_List Cn ½ .
Start
Update (Sample_Date_Set, Cn ½ , Sparse_Data_Set);
108 A. Srivastava et al.
Algorithm 3
Expression_Generator(Cond_Table)
Start
k = 0;
while(k < Num_Of_Attributes(Cond_Table))
{
Atomk Select ValueðATTRk Þ;
On Atom Opk Select OpðOn Atom Op ListÞ;
Between Atom Opk Select OpðBetween Atom Op ListÞ;
Update(EXP, On Atom Opk , Atomk , Between Atom Opk );
}
Return EXP;
End
Algorithm 4
Sample_Nodes (Competing_Nodes, EXP)
Start-
s_count = 0; r_count = 0; //To count number of sampled nodes and
randomly selected nodes
while(s_count < Threshold(Competing_Nodes) & r_count < SizeOf
(Competing_Nodes))
{
r_count ++;
Pn RandomSelect_&_Remove (Competing_Nodes); //Pn is
current node being processed
If(Pn holds EXP)
{
s_count ++;
add(Node_List, Pn );
110 A. Srivastava et al.
}
}
while(s_count < Threshold(Competing_Nodes))
{
Add(Node_List, RandomSelect_&_Remove(Competing_Nodes));
s_count ++;
}
Append(id_Table, Node_List);
Return Node_List;
End
Let us assume that the crawler has sampled nodes of cluster 2 of social graph
shown in Fig. 1. Nodes lying in the area of cluster 2 that is not overlapped by any
other cluster have similar characteristics and probably can satisfy the current
EXP. If EXP is altered at few Atoms, there might be a possibility that nodes lying in
the area of cluster 2 overlapped by any other cluster (1, 3, 4, 7 or 8). By sampling
nodes lying in overlapping area of clusters, crawler migrates from one cluster to
other cluster. Hence, the sample will have nodes covering almost every cluster of
the social network rather than having nodes of just few adjacent clusters.
The proposed model in this paper has been tested on a synthetic social network
created on local server. The synthetic network has identical characteristics as any
other social network in terms of power law of degree distribution, clustering
coefficient, etc. The synthetic social network is created using logs of e-mail con-
versations of students and faculty members in YMCA UST, Faridabad, India. The
nodes in synthetic social network exhibit some inherent attributes such as place of
birth, place of work, education, political orientation, birthday. These attributes of
each user and their values are used for creating Boolean expression EXP. The
description of the data set is shown in Table 1.
The attributes of each node in the network are created anonymously. The
algorithm is executed on the network in two phases. In first phase, the algorithm is
executed for some time and paused. First sample is gathered that is called sample 1.
Then algorithm is resumed again and another sample is gathered as sample 2.
Table 2 shows statistics of sample 1 and sample 2.
The collected samples are analysed against degree distribution and compared
with the original network to ensure if the samples collected represent the original
network. Degrees of nodes in the networks are normalized by the median of the
degrees; therefore, complete degree distribution falls in range of 0–2.
Degreej
Normalized Degreej ¼
M
Fig. 4 Sample 1
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2
E h^k hk
NMSEðkÞ ¼
hk
where ^hk is the estimation of hk based on the sampled graph. NMSEðk Þ metric is
defined in order to show the difference between the degree distribution of the
Crawling Social Web with Cluster Coverage Sampling 113
Fig. 5 Sample 2
sampled graphs and the original one. The lower the NMSE value of an algorithm,
the better is the performance of the sampling algorithm in the social network graph.
Above analysis clearly shows that algorithm proposed in this paper tends to
migrate from one cluster to another. Therefore, we had taken two clusters at dif-
ferent intervals to make sure if the algorithm is leaving one cluster and getting into
other. It also implies that the representation extent of cluster is directly proportional
to the time for which the algorithm is run if the size of the graph is finite. But
evidently, we have gigantic online social network whose size is near to infinite. In
that case, we can set a threshold on the number of nodes crawled. When sufficient
number of nodes and their relationship edges has been gathered, the algorithm can
be stopped. The collected sample will have clustered view similar to the original
network, thereby giving maximum possible representation of the network.
114 A. Srivastava et al.
This paper presents a crawler that works on cluster coverage sampling algorithm for
selection of next node. The algorithm benefits clustered structure of most of the
social networks. The algorithm is tested on a synthetic network that has clustered
structure like many social networks and scale-free distribution of nodes. The results
are promising. As future prospect, the algorithm is to be tested on some online
social network like Facebook. Social graph dynamism is also a burning area in the
field of social network analysis. Cluster coverage sampling algorithm also shows
promising possibilities of its application in tackling social graph dynamism.
References
1. [Link]
2. [Link]
3. Srivastava, A., Anuradha, Gupta, D.J.: Social network analysis: hardly easy. IEEE Int. Conf.
Reliab. Optim. Inf. Technol. 6, 128–135 (2014)
4. Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Practical recommendations on crawling
online social networks. IEEE J. Sel. Areas Commun. 29(9), 1872–1892 (2011)
5. Lee, S.H., Kim, P.-J., Jeong, H.: Statistical properties of sampled networks. Phy. Rev. E 73,
016102 (2006)
6. Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Walking in facebook: a case study of
unbiased sampling of OSNs. In: Proceedings of IEEE INFOCOM (2010)
7. Ribeiro, B., Towsley, D.: Estimating and sampling graphs with multidimensional random
walks. In: Proceedings of ACM IMC (2010)
8. Cho, M., Lee, J., Lee, K.M.: Reweighted random walks for graph matching. ECCV 2010,
Part V, LNCS 6315, pp. 492–505 (2010)
9. Lee, C.H., Xu, X., Eun, D.Y.: Beyond random walk and metropolis-hastings samplers: why
you should not backtrack for unbiased graph sampling. ACM (2013)
10. Li, R.H., Yu, J.X., Qin, L., Mao, R., Jin, T.: On random walk based graph sampling.
IEEE ICDE Conference (2015)
11. Ribeiro, B., Wang, P., Murai, F., Towsley, D.: Sampling directed graphs with random walks.
In: Ribeiro, B., Wang, P., Murai, F., Towsley, D. (eds.) UMass CMPSCI Technical
Report UMCS (2011)
12. Wilson, C., Boe, B., Sala, A., Puttaswamy, K.P.N., Zhao, B.Y.: User interactions in social
networks and their implications. In: Proceedings of ACM EuroSys (2009)
13. Wang, T., Chen, Y., Zhang, Z., Xu, T., Jin, L., Hui, P., Deng, B., Li, X.: Understanding graph
sampling algorithms for social network analysis. In: 2011 31st International Conference on
Distributed Computing Systems Workshops (ICDCSW), pp. 123, 128, 20–24 June 2011
14. Ahn, Y., Han, S., Kwak, H., Moon, S., Jeong, H.: Analysis of topological characteristics of
huge online social networking services. In: Proceedings of WWW (2007)
15. Corlette, D., Shipman, F.: Capturing on-line social network link dynamics using event-driven
sampling. In: IEEE International Conference on Computational Science and Engineering
(2009)
16. Gjoka, M., Butts, C.T., Kurant, M., Markopoulou, A.: Multigraph sampling of online social
networks. IEEE J. Sel. Areas Commun. 29(9), 1893–1905 (2011)
Efficient Management of Web Data
by Applying Web Mining Pre-processing
Methodologies
Abstract Web usage mining is defined as the application of data mining tech-
niques to extract interesting usage patterns from Web data. Web data provides the
information about Web user’s behavior. Pre-processing of Web data is an essential
process in Web usage mining. This is used to convert the raw data into processed
data which is necessary for Web mining task. In this research paper, author pro-
posed the effective Pre-processing methodology which involves field extraction,
significant attributes selection, data selection, and data cleaning. The efficient
proposed methodology improves the quality of Web data by managing missing
values, noise, inconsistency, and incompleteness which is usually found attached
with data. Moreover, obtained results of pre-processing will be further used in
frequent pattern discovery.
Keywords Web log file Web server Web usage mining Pre-processing
Data cleaning
1 Introduction
Data mining is a process of discovering interesting and useful pattern from raw
data, also known as knowledge discovery in databases. This technology analyzes
the data and helps to extract useful information. It is used in various areas such as
retail industry, intrusion detection, and financial data analysis. Generally, Web
mining is used in World Wide Web that continues to grow both in the huge volume
of traffic and the size and complexity of Web sites, where it is difficult to identify
the relevant information present in the Web [1]. In the proposed model, researcher
used Microsoft Excel and SQL Developer software to perform the pre-processing
process. Through this process, raw Web log data is transformed to processed data.
The main objective of this model is to remove irrelevant data and improve data
quality.
Author has organized this paper in following sections: In Sect. 2, researcher
reviewed the pre-processing related work. Section 3 gives a description of the
proposed model of pre-processing technique. Section 4 shows experimental results
of the effective proposed system. Section 5 concludes the discussion.
2 Literature Review
Web usage mining provides useful information in abundance which makes this area
interesting for research organizations and researchers in academics. Rathod [2]
focuses on the pre-processing tasks that are data extraction and data filtering which
were performed on combined log file; in this, one of the popular platform [Link]
2010 is used to implement the above tasks. Priyanka and Ujwala [3] proposed two
algorithms for field extraction and data cleansing; latter is used in removing the
errors and inconsistencies and improves the quality of data. Ramya et al. [4]
explored the concept of merging of log files from different servers and is used in
data pre-processing process, after which, data cleaning removes request concerning
non-analyzed resources. Prince Mary et al. [5] clarified that people are more
interested in analyzing log files which can offer more useful insight into Web site
usage, thus data cleaning process is used in data analysis and mining. Data cleaning
involved the following steps such as robot cleaning, elimination of failed status
code, elimination of graphics records, videos. Chintan and Kirit [6] proposed novel
technique that was more effective as compared to existing techniques, data
extraction, data storage, and data cleaning technique included in data pre-processing
to convert the Web log file in database. During data cleaning, only noise is removed
from log file. Ramesh et al. [7] stated that data pre-processing is a significant
process of Terror Tracking using Web Usage Mining (TTUM), which was effec-
tively done by field extraction, data cleaning, and data storage tasks. Pooja et al. [8]
explored the records which are not suitable for identifying the user’s navigational
behavior, such as access from the Web robot, e.g., Googlebot, access for images
and access for audio files. These records are removed during data cleaning module.
Web usage mining is a significant part of Web mining. It is used to extract user’s
access patterns from Web data over WWW in order to understand the user’s
browsing behavior, and it helps in building of Web-based applications. Web usage
mining depends on different source of data such as Web Server Side, Proxy Side,
Efficient Management of Web Data by Applying Web Mining … 117
and Client Side. Generally, raw Web log contains irrelevant, noisy, incomplete,
duplicate, and missing data. Data pre-processing is the most important phase of
Web usage mining in order to clean Web log data for discovering patterns.
The log files are stored in the Web server and may suffer from various data
anomalies such data may be irrelevant, inconsistent, and noisy in nature. In the
present research work, the researcher removes the said anomalies by following data
pre-processing steps as shown in Fig. 1.
The server’s log entry contains different fields like IP address, date and time,
request and user agent that should be separated out before applying data cleaning
process. Field extraction is the process to separate the fields from single entry of the
log file.
Researcher performed field extraction process on log file using Microsoft Excel
which is explained as follows.
Open Microsoft Excel, using open option select the log file, use characters such
as space or comma as delimiter to separate each field. These separated attributes are
stored in different heading according to the input Web log file such as IP address,
username, date, request, status code, referrer. Now, this file can be saved with .xls
extension, this concludes the field extraction process.
Researcher had analyzed the Apache Web server log file and obtained the important
attributes of Web log files which were IP address, date, time, request line, or URI
stem, status code, referrer, and user agent [9]. These attributes provide valuable
information about visitor’s behavior, traffic pattern, navigation paths, Web browser,
errors, etc.
Using the above analysis, only significant attributes which were identified are
selected for further processing.
When a Web server receives a HTTP Request, it returns back a HTTP Response
code to the client. These HTTP status codes are three-digit numbers such as 1xx,
2xx, 3xx. All these records with the status code below 200 and above 299 are not
used for analysis.
Filtering of records is done in Microsoft Excel sheet which was obtained after
significant attributes, by keeping records with status code >=200 and status code
<=299.
In Microsoft Excel file, select range of the cell then select Data tab and perform
‘Remove Duplicates’ in Data Tools Group to remove the duplicate records.
Find out the blank cell with the help of Find and Replace dialog box. Remove these
tuples.
3.4.3 Noise
In this section, effectiveness and efficiency of the proposed model mentioned above
are proved, and researcher has applied several experiments with the Web log file.
Apache Web log file is used and is from April 08, 2012, to July 08, 2012, with a file
size of 958 kB at the beginning.
Researcher performed experiments on a 2.40 GHz Intel Core processor,
4.00 GB RAM, 64-bit Windows 2007 Home Premium Operating System, SQL
Developer 4.1.0, and Java SE Development Kit JDK 8. Oracle SQL Developer
4.1.0 is a free Oracle database integrated development environment (IDE). SQL
Developer provides GUI to help the DBA and database; user saves time while
performing the database task. Oracle database 10g, 11g, 12c is supported by SQL
Developer and requires Java-supported platforms (OS) for operation.
120 J. Kaur and K. Garg
Figure 2 shows Web log file imported in Microsoft Excel. Each field is separated
by using space character as delimiter. As explained different attributes are stored in
separate headings. Initial Web log file entries were 6000.
According to the analysis, only six attributes are significant out of nine attributes
those are IP address, date and time, request, status code, referrer, and user agent
attributes. These six attributes help in identifying the user navigational behavior and
are of at most importance for Web usage mining. Outcomes from Web usage
mining are implemented to enhance the characteristics of Web-based applications.
After performing the data selection step, only 3674 rows were left.
First step in data cleaning removes duplicate records from log file. Thus, 11
duplicates values were identified and 3663 unique values remain. During second
step, the fields with missing values were deleted as this does not provide any useful
information for Web mining. Total 23 cells containing missing value in this log file
were deleted. 3640 tuples are left on completion of this step.
In addition to above, researcher found noise in log file such as .gif, robot, and
Web spider, example of which is ‘GET/icons/[Link].’
In Fig. 3, Microsoft Excel file is imported in SQL Developer to perform further
action; SQL Query is executed to remove noise from Web log file such as .gif, .rm, .
mp4, robot, and spider. Result obtained contains only 2458 rows.
Finally, pre-processed Web log file is obtained. Furthermore, this processed file
will be used in analysis and interpretation of frequent pattern of log file.
According to Table 1, this clearly indicates that the application of pre-processing
on the Web log file helps in removing irrelevant data and improves the quality of
data.
5 Conclusion
In the present research work, the researcher has taken raw Web log file for its
empirical analysis and concludes that data pre-processing is an essential step to
remove anomalies; lies in data for the purpose. The researcher has proposed data
pre-processing methodology which improved the quality of data. After applying the
said methodology, the procedure proves that only six attributes are significant out of
nine available attributes and similarly out of 6000 original instances only 2458 are
meaningful. The resultant itemset surely will be useful for the discovery of frequent
pattern.
References
1. Jayalatchumy, D., Thambidurai, P.: Web mining research issue and future directions—a
survey. IOSR J. Comput. Eng. (IOSR-JCE) 14, 20–27 (2013)
2. Rathod, D.B.: Customizable web log mining from web server log. Int. J. Eng. Dev. Res.
(IJEDR), 96–100 (2011)
3. Priyanka, P., Ujwala, P.: Preprocessing of web server log file for web mining. World J. Sci.
Technol. 2, 14–18 (2012)
4. Ramya, C., Shreedhara, K.S., Kavitha, G.: Preprocessing: a prerequisite for discovering
patterns in web usage mining process. Int. J. Inf. Electron. Eng. 3, 196–199 (2013)
122 J. Kaur and K. Garg
Abstract This paper presents a modified technique to generate the probability table
(routing table) for the selection of path for an ant in AntNet algorithm. This paper
also uses the concept of probe ant along with clone ant. The probe ant identifies the
multiple paths which can be stored at destination. Overall purpose of this paper is to
get an insight into the ant-based algorithms and identifying multiple optimal paths.
Keywords AntNet ACO ABC Probe ant Clone ant Link goodness
1 Introduction
For getting such optimal solution, nowadays a number of studies are going on to
identify routing alternatives using heuristic techniques. One of such techniques is
Ant Colony Optimization (ACO). In ant-based routing, the artificial ants, while
moving from source to destination, deposit some artificial pheromone in terms of
numeric value and store some information in the form of routing table at the
intermediate nodes. This information is used by newly generated ants for accepting
or rejecting a path. Another important aspect of routing is increasing need of
multi-path routing. The main reason for the need of multi-path routing is the data
transmission in Internet. When a video or audio streaming is required, very high
bandwidth is required. With single path routing, it might not be possible, hence
arise the need of multi-path routing. When multi-path routing is combined with
multi-metric routing, it becomes a perfect candidate for use of heuristic techniques.
Therefore, ACO can prove to be a very effective technique for this type of routing.
2 Related Work
According to [1], an ant not only finds the shortest path for searching its food
source but also conveys this shortest path to other ants. In the ACO, this intelli-
gence of ants is intermixed with various optimization techniques to identify opti-
mum routes in a computer network. In this paper, various mechanisms to solve the
problem to identify optimum path using ACO were explained and compared with
different traditional algorithms of routing.
This paper throws light on a critical review in four different groups for applying
ACO in routing, which are (i) Ant-Based Control (ABC) Systems, (ii) AntNet
System, (iii) Ant System with other variations, and (iv) Multiple Ant Colony
Systems.
Schoonderwoerd et al. [2, 3] applied ACO to routing in telecommunication
networks based on circuit-switching. The algorithm was termed as Ant-Based
Control (ABC). In the ABC algorithm, all the nodes in a network follow various
features [2] such as capacity, probability of being a destination, pheromone table,
and routing table, on the basis of which criteria for choosing next node is decided.
But the main problem of Schoonderwoerd et al. approach is that it can only be
applied when the network is symmetric in nature.
For packet-switched networks routing, Caro and Dorigo’s [4–6] designed
AntNet Routing. Although it is inspired by ACO meta-heuristic, yet have additional
changes as desired for a network.
In [7], a new version of AntNet was generated and was named as AntNet-FA or
AntNet-CO. In this version, backward ants performed a number of tasks such as,
(a) estimating the trip time using various metrics, (b) updating local traffic statistics,
and (c) determining and depositing the pheromone for estimating the probability of
reinforcement. As backward ants are using real-time statistics for determining the
amount of reinforcement, the information for routing was found to be more correct
A Soft Computing Approach to Identify Multiple Paths in a Network 125
and up-to-date. The results of this version are found to be better than
AntNet algorithm which are proved by experiment performed in this paper.
Oida and Sekido [8] proposed Agent-based Routing system (ARS) in which they
suggested for supporting various types of bandwidth requirement, the forward ants
move in a network, which is based on bandwidth constrained. The probability of
selection of outgoing link depends on routing table as well as bandwidth
constraints.
Although adaption has been proved to be one of the better techniques for
identifying the optimum paths, but one of the major problems that can be attached
with AntNet is stagnation [9]. Due to this problem, local optimum solution might be
obtained and diversity of the population might also be lost. In this paper, the
concept of multiple ant colonies was applied to the packet-switched networks.
Upon comparison with AntNet algorithm with evaporation, it was found that by
using multiple ant colonies throughput can be increased. No improvement was
found in delay. But the basic problem was the need of large resources for multiple
ant colonies.
In an AntNet algorithm [1], an ant explores the path and updates the routing and
probability tables, so that other ants can use the tables to know which path is better
than others. Some statistical traffic model is also used to help the ants to identify the
better path.
A routing table is maintained which is a local database. The routing table
contains information about all possible destinations along with probabilities to
reach these destinations via each of the neighbours of the node.
Another data structure that each node carries is termed as local traffic statistics.
This structure follows the traffic fluctuations as viewed by the local node.
The AntNet algorithm as proposed by Di Caro and Dorigo is dependent on two
types of ants named as forward ant and backward ant. The forward ant collects the
information regarding the network, while the backward ant uses this information to
update the routing tables on their path.
Working of the algorithm is given as follows:
(i) Initially to generate a routing table, a forward ant is initiated from every node
towards the destination node after a fixed time interval to find low-cost path
to that node and load status of the network is also explored, and accordingly,
priorities of the paths are set. These priorities are used by the forward ants to
transmit the data.
(ii) The forward ants store the information about their paths in a stack.
(iii) At each node, decision is made to select a node for reaching towards des-
tination with the help of probabilities using pheromone values. The nodes
126 S. Aggarwal et al.
which are unvisited are only considered for selection or from all the
neighbours in case all of them have found to be previously visited.
(iv) When the forward ant moves towards destination, if at any time any cycle is
detected, all the nodes in that cycle’s path are popped. Also all of the
information about them is also deleted.
(v) When the forward ant reaches its destination, then it generates another ant
named as backward ant. It transfers all of its memory to it and dies.
(vi) The backward ant as its name indicates will travel to the opposite direction
that of forward ant. The backward ant uses stack formed by forward ant and
pops the element in the stack to reach the source node. The backward ant use
high-priority queues to reach source so that information can be quickly
transmitted to the source node. The information collected by forward ant is
stored in the routing table by the backward ants.
(vii) The backward ant basically updates the two data structure, i.e. routing table
and the local traffic model for all the node in its path for all entries starting
from the destination node.
In this work, a new type of forward ant named as probe ant [10] along with its clone
is proposed. The probe ants are generated to replace the forward ants from source
node to explore the path. The probe and clone probe ants are generated depending
on the paths selected at a particular node. Only a little additional overhead is
required for generating clone ants. The multiple paths are selected according to the
probabilities in the pheromone table and a threshold value of bandwidth. These
probe and clone ants [11] will reach to the destination according to proposed
strategy. The advantage will be that instead of one optimal path more than one
optimal path are identified. The paths will be identified only with the help of a
single ant, other ants being the clone ant. The major advantage of this technique is
saving of the overhead taken by number of forward ants in AntNet algorithm.
A Soft Computing Approach to Identify Multiple Paths in a Network 127
S D I Aid PT MB TD HC
Structure of Probe Ant
Here,
S Source node,
D Destination node,
I Intermediate node from which ant has arrived recently,
Aid Ant Identification Number
PT Path Traversed so far
MB Minimum Bandwidth of the path traversed
TD Total Delay of the path
HC Hop Count
In this work, a threshold bandwidth is considered. Any link having bandwidth
less than that threshold value is discarded. The remaining links are considered for
the purpose of identifying the multiple paths. The process works in a similar
manner to AntNet algorithm, except that queue length is taken into account for
selecting a path in case of AntNet algorithm, while in the proposed work, two
metrics, i.e. delay and bandwidth, are considered for selecting a path. Another
important difference in this case is that instead of selecting a single path, multiple
(quite similar) paths are being selected for data transmission. Various paths are
identified at the destination and stored in an array and from these paths some
optimal paths are chosen using some intelligent approach. For selecting the paths
based on two metrics, a new metric named as link goodness is proposed and is
denoted by GL. The value of GL can be calculated with the help of following
formula:
a
GL ¼ þ ð 1 aÞ B L
DL þ 1
where
DL Delay of a Link,
BL Bandwidth of a Link
a is a constant having value between 0 and 1, i.e. 0 < a < 1.
At each node, two tables named as pheromone table and routing table will be
maintained. The pheromone table will be initialized with equal pheromone value
distributed to each link from a node except the discarded link. The second table is
the routing table consisting of probabilities for selecting the next node in the path.
The structure of both the tables is as shown below:
128 S. Aggarwal et al.
Destination nodes
Neighbouring nodes s11 s12 … s1n
s21 s22 … s2n
. . . .
. . . .
. . . .
sl1 sl2 … sln
Pheromone table
In the above table, snd is the pheromone value of link from node n when des-
tination node is d. The snd value lies in the interval [0, 1] and sums up to 1 (as the
probabilities can not exceed 1) along each destination column:
X
snd ¼ 1; d 2 ½1; N ; Nk ¼ fneighboursðkÞg
n2Nk
Destination node
Neighbour nodes 1 2 … n
1 p11 … p1n
2 p21 p22 …
. . . . .
. . . . .
. . . . .
l pl1 … pln
Routing table
In the above table, blank cells indicate that particular neighbour node is dis-
carded for a particular destination.
Psd = probability of creating a probe ant at node s with node d as destination.
At each node i, a probe ant computes probability Pijd of selecting neighbour node
j for moving towards its destination d by adding sijd * Gij to the pheromone value
sijd and then subtracting this value dividing equally from each of the remaining
valid neighbours. This process must be followed for each of the neighbours in
parallel.
Using these probabilities, multiple paths from source to destination can be
identified, and the information about paths will be stored in probe ants. At desti-
nation, paths can be stored in an array of paths. From this array, some of the better
paths on the basis of various metrics can be found using some intelligent approach.
A Soft Computing Approach to Identify Multiple Paths in a Network 129
5 Conclusion
In this paper, a modified approach for generating the probabilities for routing table
has been proposed. In this approach, the probabilities have been calculated of the
basis on two metrics instead of a single metric. Another important feature that has
been used in this paper is the use of a threshold value of bandwidth. Only one ant
will be generated at the source, while other ants will be the clone ants with a little
additional overhead. With the help of this technique, multiple paths can be gen-
erated which can be stored at destination via probe ants. These paths can be further
refined using some intelligent approach, and multiple paths for a better transmission
of data packets can be identified.
References
1. Sim, K.M., Sun, W.H.: Ant colony optimization for routing and load-balancing: survey and
new directions. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 33(5) (2003)
2. Schoonderwoerd, R., Holland, O., Bruten, J., Rothkrantz, L.: Ants for Load Balancing in
Telecommunication Networks. Hewlett Packard Lab., Bristol, U.K., Tech., pp. 96–35 (1996)
3. Schoonderwoerd, R., Holland, O., Bruten, J., Rothkrantz, L.: Ant-based load balancing in
telecommunications networks. Adapt. Behav. 5(2) (1996)
4. Caro, G.D., Dorigo, M.: AntNet: distributed stigmergetic control for communications
networks. J. Artif. Intell. Res. 9, 317–365 (1998)
5. Caro, G.D., Dorigo, M.: Ant colonies for adaptive routing in packet-switched communica-
tions networks. In: Proceedings of 5th International Conference on Parallel Problem Solving
from Nature, Amsterdam, The Netherlands, 27–30 Sept 1998
6. Caro, G.D., Dorigo, M.: An adaptive multi-agent routing algorithm inspired by ants behavior.
In: Proceedings of 5th Annual Australasian Conference on Parallel Real-Time Systems,
pp. 261–272 (1998)
7. Caro, G.D., Dorigo, M.: Two ant colony algorithms for best-effort routing in datagram
networks. In: Proceedings of 10th IASTED International Conference on Parallel Distributed
Computing Systems, pp. 541–546 (1998)
8. Oida, K., Sekido, M.: An agent-based routing system for QoS guarantees. In: IEEE
International Conference on Systems, Man, and Cybernetics, pp. 833–838 (1999)
9. Tekiner, F., Ghassemlooy, Z., Al-khayatt, S.: Investigation of Antnet Routing Algorithm by
Employing Multiple Ant Colonies for Packet Switched Networks to Overcome the Stagnation
Problem
10. Devi, G., Upadhyaya S.: Path identification in multipath routing. JGRCS J. Glob. Res.
Comput. Sci. 2(9) (2011)
11. Richa, S., Upadhyaya, S.: Identifying multiple optimal paths in Antnet routing algorithm with
negligible overhead. Int. J. Comput. Sci. Netw. Secur. 9(2), 314–320 (2009)
An Efficient Focused Web Crawling
Approach
Kompal Aggarwal
Abstract The amount of data and its dynamicity makes it very difficult to crawl the
World Wide Web (WWW) completely. It is a challenge in front of researchers to
crawl only the relevant pages from this huge Web. Thus, a focused crawler resolves
this issue of relevancy to a certain level, by focusing on Web pages for some given
topic or a set of topics. This paper deals with survey of various focused crawling
techniques which are based on different parameters to find the advantages and
drawbacks for relevance prediction of URLs. This paper formulates the problem
after analysing the existing work on focused crawlers and proposes a solution to
improve the existing focused crawler.
1 Introduction
World Wide Web (WWW) contains a large amount of information, and every
second, new information is added such that the Web size is of order tens of billions
of pages. Web crawler is the most important component of search engine. It con-
tinuously downloads pages, and these pages are indexed and stored in database. So,
it becomes very difficult for a crawler to crawl entire Web and keep its index fresh.
Because of limitation of various computing resources and time constraints, focused
crawlers are developed. A focused crawler is web crawler that downloads only
those Web pages which are considered to be relevant for specific topic or set of
topics.
A focused crawler can be implemented in various ways [1]. Some of the
approaches are shown below.
K. Aggarwal (&)
Department of Computer Science, Government College, Chhachhrauli, India
e-mail: kompalagg@[Link]
URL Irrelevant
Web page
Queue table
downloader
Parser &
Extractor
Relevan
Irrelevant
Topic filter
An Efficient Focused Web Crawling Approach 133
URL from the Internet [4]. The parser module retrieves various information such as
the text, terms, and the out links URLs from the page which is being downloaded.
The relevance calculator module calculates relevance of the parsed page with
respect to topic, and relevance score is assigned to URLs. Topic filter module
determines whether the page content which is being parsed is related to topic
specified or not. If the parsed page is found to be relevant, the URLs retrieved from
it will then be updated in the queue which is used for storing URLs; otherwise, page
will be discarded and added to the irrelevant matrix.
2 Related Work
De Bra et al. [5] proposed “fish search” for predicting page relevance in which the
system takes as input a set consisting of seed pages and search query, and it
considers only those Web pages whose content matches a given search query and
their neighbourhoods (pages pointed to by these matched pages). Fish-search
algorithm treats Web as a directed graph where Web page is considered as node,
and hyperlink can be taken as edge, so the way of searching is similar to the search
operation in directed graph. For each node, we have to judge whether the node is
relative or irrelative (1 represents relevant, and 0 represents irrelevant). In the
fish-search algorithm, for storing URL of page to be searched, a list is maintained. If
the fish finds a relevant page based on query, it continues to look for more links
from the same page. If the page is not found to be relevant, then his child receives a
low score. The various URLs have different priority; the higher priority URL will
be stored at the front of the list and will be searched earlier in comparison with
others. It associates a relevance score in a discrete manner (1 stands for relevant; 0
or 0.5 stands for irrelevant). The drawback of the fish-search algorithm is that the
differentiation among priority that exists among the pages in the list is very low.
Michael et al. [6] proposed shark search which is the modification made to fish
search. In this algorithm, a child derives a discounted score from its parent. The
inherited discounted score is combined with a value which is based on anchor text
that exists around the link in the Web page. One improvement is that it replaces
binary with fuzzy score. Binary value of document relevance can be relevant or
irrelevant. “Fuzzy” score is a score lying between 0 (no similarity) and 1 (perfect
“conceptual” match). It uses a vector space model [7] for calculating the relevance
score in which the similarity between a search string and a document is a function
of the distance existing among the vectors. The cosine similarity measure is
commonly used measure for this purpose. But this algorithm neglects information
of link structure.
Batra [1] proposes a method based on link scoring in which the crawler also
checks and crawls the unvisited links up to a certain threshold that is included in
irrelevant pages. The various approaches have been suggested by many researchers
which are based on link analysis; for example, one of effective focused crawling
which is based on both content and link structures has been proposed. This crawling
134 K. Aggarwal
method is based on URL score, links, and anchor scores. The link scores can be
calculated by the crawler using the following equation and a decision is made
regarding fetching of pages as specified by links or not. The link score can be
calculated as [1, 8]:
LinkScore (i) represents score of link i, URLScore(i) is the relevance that exists
between the HREF information of i and the topic keywords, and AnchorScore(i) is
the relevance that exists between the anchor text of i and the topic keywords.
LinksFromRelevantPageDB(i) is the number of links from relevant crawled
pages to i; Pn is the ith parent page of URL n. Parent page is a page from which a
link was extracted. Thus crawler spends some time on crawling relevant pages and
also crawl relevant pages, those are children of an irrelevant page.
In a crawling process, the effectiveness of the focused crawler [8] can be
determined from the speed of the crawling process and not only from the maximum
amount of fetched relevant pages. The speed of a crawler is determined from the
number of inserted URLs which are considered to be relevant.
Therefore, URL optimization mechanism is used by the crawler to select relevant
links which helps for the removal of certain links whose link scores lie below a
predefined threshold. For URL optimization, two methods used for evaluating
relevant pages link score are used.
• Same as above equation which is used for irrelevant pages.
• Naive Bayes (NB) classifier which is used to compute link scores [2, 8].
Diligenti et al. [9] proposes a method based on link analysis of forward links and
backward links of the page. In this method, priority value is assigned to Web page
based on Pvalue which is the difference between FwdLnk and BkLnk. The priority
is directly proportional to the page having the highest Pvalue.
The URLs had been classified in the above categories by using the parameters
backward link count (BkLnk) and the forward link count (FwdLnk). The values of
forward link count (FwdLnk) can be calculated by the server after returning the
number of links in the page after parsing it and without downloading it. The values
of BkLnk can be calculated by the server after calculation of how many pages are
referring to this page by using the existing database which is built by various
crawler machines. After that, the crawler will sort the list of URLs according to
descending order of Pvalue of URLs received from the repository. This sorted list of
URLs will be sent to maintain quality of crawling. This method gives weightage to
current database and helps in building a database of higher-quality indexed Web
An Efficient Focused Web Crawling Approach 135
pages even when the focus is on crawling the entire Web. This method would also
be used for broken links, growing and matured stage of databases. A URL having
zero value for FwdLnk and nonzero value of BKLnk is assigned low priority as
Pvalue will always be negative [9].
Kaur and Gupta [10] proposed weighted page rank method, in which weight of
Web page is calculated on the basis of the number of incoming and outgoing links,
and the importance of page is dependent on the weight. The relevancy comes out by
using this technique is less as page rank is calculated on the basis of weight of the
Web page calculated at the time of indexing [2].
This algorithm is an updated version of page rank as important pages have been
assigned more value instead of dividing the page rank value commonly among all
outgoing links of that page [11].
The weighted page rank is thus given by:
PRðuÞ¼ 1 dÞ=jDj þ d ðPRðV1 ÞWin ðV1 ; uÞWout ðV1 ; uÞ þ þ PRðVn ÞWin ðVn ; uÞ Wout ðVn ; uÞÞ
Wi þ 1 ¼ Wi = Wmax
where Wmax stands for the maximum weight that can be assigned to any keyword
and Wi + 1 stands for the new weight assigned to every keyword. The page rele-
vancy is calculated with the help of topic specific weight table and by using cosine
similarity measure.
Kleinberg [12] proposes a method called Hyperlink-Induced Topic Search
(HITS). It calculates the hubs as well as authority values of the relevant pages. The
result given is the relevant as well as important page.
HIT is a link analysis algorithm for rating Web pages. The page which is pointed
to many other pages is represented as good hub, and the page that was linked by
many different hubs represented a good authority.
Selecting seed URL set is based on authority of Web pages. The rank is cal-
culated by computing hub and authorities score. Authority score determines the
value of the page content. Hub score determines the value of page links to other
pages.
136 K. Aggarwal
When comparing two pages having approximately the same number of citations,
if one of these has received more citations from P1 and P2, which are counted as
important or prestigious pages, then that page relevance becomes higher. In other
words, having citations from an important page is better than from an unimportant
one [13, 14].
Fujimura et al. [15] proposes Eigen Rumor Algorithm to solve the problem of
increasing number of blogs on the Web. It is a challenge to the service providers to
display quality blogs to users [13].
• The page rank algorithm decides very low page rank scores for blog entries so
rank score cannot be assigned to blog entries according to their importance.
• The rank score can be provided to every blog by weighting the hub scores, and
authority of the bloggers is dependent on the calculation of eigenvector.
• It is mostly used for blog ranking not for ranking the web pages.
3 Proposed Method
The proposed focused crawler starts with a seed URL. It does not download the
page, instead it parses the page to extract the URLs and words of interest into that
page [16].
To determine the importance of the page being parsed, the weight of keywords/
search string in whole Web page is calculated corresponding to the every keyword
in the topics specific weight table. As occurrence of same words at various locations
of a page has different importance and representing different information [17]. So,
relevance of the page being parsed can be decided by considering its each com-
ponent. For example, the title text is more informative for expressing the topic
covered in a page as compared to the common text.
Here, we are computing the weight of oulinks/hyperlinks by the same procedure,
i.e. to crawl an relevant page which is children of an irrelevant page [18, 19]. If the
level of relevance crosses predefined threshold, only then the page will be down-
loaded and will be extracted and repository is updated with the page. The page will
be discarded otherwise.
In this way, we save the bandwidth after discarding an irrelevant page and
network load is reduced.
To obtain the overall weight (wkp) of keyword k in page p, the weights of
keyword in different locations of page p can be added as shown below:
wkp = wkurl + wkt + wkb
wkp = l weight of keyword k in page p
wkurl = weight assigned to keyword k based on page URL
wkt = weight assigned to keyword k based on page title
An Efficient Focused Web Crawling Approach 137
4 Conclusion
Hence, by using the concept of page weight, we completely scan Web pages and
compute the page weight. In this way, we can improve the efficiency of Web
crawler, as URL list produced as output by this method is of great importance than
traditional Web crawling method.
References
Keywords Fuzzy c-means Kernel functions Possibilistic c-means
Conditionally positive-definite function
1 Introduction
Clustering [1, 2] is commonly used as one of the analytical techniques in the fields
of pattern recognition, modelling of system, image segmentation and analysis,
communication, data mining and so on. It sorts out the group of N observations or
M. Tushir (&)
Department of Electrical and Electronics Engineering, Maharaja Surajmal
Institute of Technology, C-4, Janakpuri, New Delhi, India
e-mail: meenatushir@[Link]
J. Nigam
Department of Information Technology, Maharaja Surajmal Institute
of Technology, C-4, Janakpuri, New Delhi, India
e-mail: jyotsnaroy81@[Link]
data points into C groups or data clusters (number of clusters), and the objects in
each group are more identical to one another than those in other groups. Data
clusters on which clustering is to be performed can be well separated, continuous or
overlapping. On the basis of cluster type, there are predominantly two types of
clustering: hard and fuzzy. Traditional (hard) clustering has a constraint to it that
each point of the dataset is restricted to only one cluster. The conception of fuzzy
clusters was proposed by Zadeh [3]. It included the notion of partial membership
where a data point in a dataset can simultaneously belong to more than one cluster.
It gave rise to the thought of probabilistic theory in clustering. Clusters, which are
well separated or continuous, give suitable results when processed with hard
clustering techniques but sometimes objects in a cluster overlap each other and
belong simultaneously to several clusters. Compared with hard clustering, fuzzy
clustering has a superior clustering performances and capabilities. The membership
degree of a vector xk to the ith cluster (uik) is a value in the interval [0, 1] in fuzzy
clustering. This idea was first introduced by Ruspini [4] and used by Dunn [5] to
construct a fuzzy clustering method based on the criterion function minimization.
This approach was generalized by Bezdek to fuzzy c-means algorithm by using a
weighted exponent on the fuzzy memberships. Although FCM is a very competent
clustering method, its membership does not always correspond well to the degrees
of belonging of the data, and it carries certain constraints. The results of FCM may
tend to be inaccurate in a noisy environment. Many spurious minima are generated
due to the presence of noise in the datasets, which further aggravate the situation.
FCM does not provide any method to handle this complication. To reduce this
weakness to some extent, Krishnapuram and Keller [6] proposed possibilistic c-
means algorithm. PCM uses a possibilistic approach, which uses a possibilistic
membership function to describe the degree of belonging. It determines a possi-
bilistic partition matrix, in which a possibilistic membership function measures their
absolute degree of typicality of a point in a cluster. In contrast to FCM, possibilistic
approach appeared to be more resilient to noise and outliers. PCM has a necessity to
predefine the number of clusters and is sensitive to initializations, which sometimes
generates coincident clusters. The condition to define the optimal c is termed as
cluster validity. To overcome the limitations imposed by FCM and PCM, many
new algorithms have been proposed which comprise of the ability to generate both
membership and typicality for unlabelled data [7, 8]. However, it is inferred from
the observations that these algorithms tend to give not so good results for
unequal-sized clusters. In 2006, a novel fuzzy clustering algorithm called unsu-
pervised possibilistic clustering algorithm (UPC) was put forward by Yang and Wu
[9]. Unsupervised clustering can determine the number of clusters with the help of
their proposed validity indexes. The objective function of UPC integrates the FCM
objective function with two cluster validity indexes. The parameters used in UPC
are easy to manoeuvre. UPC has many credits, like most enhanced visions, it only
works for the convex cluster structure of the dataset. UPC fails to give desirable
results when processed on non-convex structures of datasets. The kernel-based
unsupervised algorithm proposed in this paper is an extension to the UPC. The data
points in the proposed method (UKPC-L) are mapped to a higher dimensional space
A New Log Kernel-Based Possibilistic Clustering 141
2 Literature Work
Most methodical clustering algorithms for clustering analysis are based on the
optimization of basic c-means function. In datasets clustering the similarity norm
used to group the object, which are identical in nature, is defined by the distance
norm. A large family of clustering algorithms and techniques complies with the
fuzzy c-means functional formulation by Bezdek and Dunn:
X
c X
n
2
J ðU; V Þ ¼ ik kxk vi k
um ð1Þ
i¼1 k¼1
It is applicable to wide variety of data analysis problems and works well with
them. FCM algorithm assigns membership values to the data points, and it is
inversely related to the relative distance of a point to the cluster centres in the FCM
model. Each data point in FCM is represented by xk and vi represents the centre of
the cluster. Closeness of each data point to the centre of the cluster is defined as
membership value uik. m is the weighting exponent who determines the fuzziness of
the resulting clusters and it ranges from [1, ∞].
The cluster centres and membership values are computed as
Pn
um
ik xk
vi ¼ Pk¼1
n m ð2Þ
k¼1 uik
1
uik ¼ ð3Þ
Pc xk vi m1
2
j¼1 xk vj
For a given dataset X = {x,…, xn}, the fuzzy c-means algorithm partitions X into
C fuzzy subsets by minimizing the objective function given above. It satisfies the
condition:
142 M. Tushir and J. Nigam
X
c
uik ¼ 1
i¼1
There are some limitations to the fuzzy c-means clustering method. Firstly, since
it is based on fixed distance norm, it has a limitation that this norm forces the
objective function to prefer clusters of certain shape even if they are not present in
the data. Secondly, it is sensitive to noise and there is a requirement to predefine the
number of clusters.
where tik is the typicality value for data point xk whose cluster centre is vi, dki is the
distance measured between xk and ci, and c denotes a user-defined constant. c is
greater than zero with the value of i lying in the range of 1 to c. PCM uses
approximate optimization for additional conditions. The typicality of data point xk
and the centre of cluster vi can be obtained by the following equations
1
tki ¼ m1
1 ; 8i ð5Þ
1þ dki
ci
Pn m
tik xk
vi ¼ Pk¼1
n m ð6Þ
k¼1 uik
PCM does not hold the probabilistic constraint that the membership values of the
data point across classes sum to one. It overcomes sensitivity to noise and over-
comes the need to specify number of clusters. However, there are still some dis-
advantages in the PCM, i.e. it depends highly on a good initialization and it tends to
produce coincidental clusters which are undesirable. The KPCM uses the KFCM to
initialize the membership and avoids the above-mentioned weakness of the PCM.
A New Log Kernel-Based Possibilistic Clustering 143
where m is the fuzzy factor and c is the cluster numbers. The first term is identical to
FCM objective function. The second term is constructed by analogue of PE and PC
validity indexes.
Parameter b is the sample covariance defined as:
Pn
x
j¼1 xj
Xn
xj
b¼ where x ¼ ð8Þ
n j¼1
n
Minimizing the objective function with respect to uij and setting it to zero, we get
equation for membership value, i.e.
pffiffiffi 2 !
m cxj vi
uij ¼ exp ; i ¼ 1; . . .; c; j ¼ 1; . . .; n ð9Þ
b
3 Kernel-Based Algorithm
A kernel function is generalization of the distance matrix that measures the distance
between two data points as the data points are mapped into high-dimensional space
in which they are more clearly separable [10, 11]. We can increase the accuracy of
algorithm by exploiting a kernel function in calculating the distance of data points
from the prototypes.
144 M. Tushir and J. Nigam
Kernel trick:
The kernel trick is a very interesting and powerful tool. It provides a bridge from
linearity to nonlinearity to any algorithm that can express solely on terms of dot
products between two vectors.
Kernel trick is interesting because that mapping does not need to be ever
computed. The “trick” is wherever a dot product is used, it is replaced with a kernel
function. The kernel function denotes an inner product in feature space and is
usually denoted as:
U : Rp ! H; x ! Uð xÞ
To further improve UPC, we have used a log kernel function. Log kernel function is
conditionally positive-definite function, i.e. it is positive for all values of greater
than 0. We named our proposed algorithm as UKPC-L.
A New Log Kernel-Based Possibilistic Clustering 145
X
n
cj ck K xj ; xk 0 ð12Þ
j;k¼1
Pn
For n 1; c1 ,…, cn 2 R with j¼1 cj ¼ 0 and
x1 ; . . .; xn 2 X
The log kernel function that we have proposed for our algorithm is:
2
K ðx; yÞ ¼ log 1 þ axj vi ð13Þ
Using our log kernel function, the objective function and the distance for
UKPC-L clustering is as follows:
c X
X n
n
b X c X
JUKPCL ¼ ij Dij þ
um 2
p ffiffi
ffi u m
ij log u m
ij u m
ij ð14Þ
i¼1 j¼1
m2 c i¼1 j¼1
where
2
D2ij ¼ kUðxi Þ Uðvi Þk2 ¼ 2 log 1 þ axj vi ð15Þ
Pn 2
2 j¼1 logð1 þ axj vi Þ
b¼ ð16Þ
n
Minimizing the objective function with respect to vi , we get equation for cluster
centres
Pn m
j¼1 uij xj
1
1 þ akxj vi k
2
vi ¼ Pn um ð17Þ
ij
1 þ akxj vi k
j¼1 2
Minimizing the objective function with respect to uij and setting it to zero, we get
equation for membership value
pffiffiffi
2m c 2
uij ¼ exp log 1 þ axj vi ð18Þ
b
146 M. Tushir and J. Nigam
*Fix the number of clusters C; fix(m) > 1; set the learning rate a;
*Execute a FCM clustering algorithm to find initial U and V;
*Set iteration count k = 1;
Repeat
Calculate objective function J(u, v) using (14).
Compute b using (16).
Update vi using (17).
Update uik using (18).
Until a given stopping criterion, i.e. convergence precision e is satisfied.
5 Experimental Results
1. Gaussian Random Data with Noise: The dataset generated is spherical with
unequal radii. There are four clusters and, the data points in each cluster are
normally distributed over two-dimensional space. Analysing the Fig. 1 closely,
our proposed algorithm gives desired results with cluster centres located at their
prototypical locations, while in the cases of UPC the cluster centres are slightly
shifted from the ideal locations.
The respective covariance matrices are:
16
14
12
10
x2
8
2
0 2 4 6 8 10 12 14
x1
16
14
12
10
x2
2
0 2 4 6 8 10 12 14
x1
148 M. Tushir and J. Nigam
2 3
2 5
6 2 14 7
VIDEAL ¼6
4 7
7
85
10 14
2 3 2 3
1:9947 5:0186 2:0146 4:9913
6 2:0138 13:9765 7 6 7
VUPC ¼6 7 VUKPCL ¼ 6 2:0007 13:9779 7
4 6:7874 8:1653 5 4 6:9814 8:0660 5
9:9408 13:8123 9:9735 14:0319
To show the effectiveness of our proposed algorithm, we also compute the error
using:
E ¼ jjVideal V jj2
We now study the performance quality and efficiency of our proposed clustering
algorithm on a few real datasets, namely Iris dataset and Seed dataset. The clus-
tering results were concluded using Huang’s accuracy measure [12] with the fol-
lowing formula:
Pk
ni
r¼ i¼1
ð19Þ
n
A New Log Kernel-Based Possibilistic Clustering 149
16
14
12
10
x2
8
0
2 4 6 8 10 12 14 16 18
x1
16
14
12
10
x2
0
2 4 6 8 10 12 14 16 18
x1
where ni is the number of data occurring in both the ith cluster and its corre-
sponding true cluster, and n is the number of data points in the dataset. According to
this measure, a higher value of r indicates a better clustering result with perfect
clustering yielding a value r = 1. Error has been calculated by using the measure:
E=1−r
150 M. Tushir and J. Nigam
10
-5
x2 -10
-15
-20
-25
-30
-20 -10 0 10 20 30 40 50 60
x1
10
-5
x2
-10
-15
-20
-25
-30
-20 -10 0 10 20 30 40 50 60
x1
A New Log Kernel-Based Possibilistic Clustering 151
Table 1 Number of misclassified data, elapsed time, accuracy and error using UPC and UKPC-L
Clustering algorithm Misclassifications Elapsed time Accuracy (%) Error (%)
UPC 12 1.50 92.0 8.0
UKPC-L 10 0.71 93.3 6.6
Table 2 Number of misclassified data, elapsed time, accuracy and error using UPC and UKPC-L
Clustering algorithm Misclassifications Elapsed time Accuracy (%) Error (%)
UPC 25 0.28 88.57 11.42
UKPC-L 23 0.29 89.04 10.95
6 Conclusions
References
1. Anderberg, M.R.: Cluster Analysis for Application. Academic Press, New York (1973)
2. Backer, E., Jain, A.K.: A clustering performance measure based on fuzzy set decomposition.
IEEE Trans. Pattern Anal. Mach. Intell. 3(1), 66–74 (1981)
3. Zadeh, L.A.: Fuzzy sets. Inf. Control 8, 338–353 (1965)
4. Ruspini, E.H.: A new approach to clustering. Inform. Control 15(1), 22–32 (1969)
5. Dunn, J.C.: A fuzzy relative of the ISODATA process and its use in detecting compact
well-separated cluster. J. Cybern. 3(3), 32–57 (1973)
6. Krishnapuram, R., Keller, J.M.: A possibilistic approach to clustering. IEEE Trans. Fuzzy
Syst. 1, 98–110 (1993)
7. Dave, R.: Characterization and detection of noise in clustering. Pattern Rec. Lett. 12(11),
657–664 (1991)
8. Pal, N.R., Pal, K., Bezdek, J.C.: A mixed c-means clustering model. In: Proceedings of the
IEEE International Conference on Fuzzy Systems, Spain, pp. 11–21 (1997)
9. Yang, M.S., Wu, K.L.: Unsupervised possibilistic clustering. Pattern Recogn. 39(1), 5–21
(2006)
10. Christianini, N., Taylor, J.S.: An Introduction to SVMs and Other Kernel-Based Learning
Methods. Cambridge University Press, Cambridge (2000)
11. Girolami, M.: Mercer kernel-based clustering in feature space. IEEE Trans. Neural Netw. 13
(3), 780–784 (2002)
12. Huang, Z., Ng, M.K.: A fuzzy k-modes algorithm for clustering categorical data. IEEE Trans.
Fuzzy Syst. 7(4), 446–452 (1999)
Fuzzy c-Means Clustering Strategies:
A Review of Distance Measures
Abstract In the process of clustering, our attention is to find out basic procedures
that measures the degree of association between the variables. Many clustering
methods use distance measures to find similarity or dissimilarity between any pair
of objects. The fuzzy c-means clustering algorithm is one of the most widely used
clustering techniques which uses Euclidean distance metrics as a similarity mea-
surement. The choice of distance metrics should differ with the data and how the
measure of their comparison is done. The main objective of this paper is to present
mathematical description of different distance metrics which can be acquired with
different clustering algorithm and comparing their performance using the number of
iterations used in computing the objective function, the misclassification of the
datum in the cluster, and error between ideal cluster center location and observed
center location.
Keywords FCM clustering Euclidean distance Standard euclidean distance
Mahalanobis distance Minkowski distance Chebyshev distance
J. Arora (&)
Department of Information Technology, Maharaja Surajmal Institute of Technology,
C-4, Janakpuri, New Delhi, India
e-mail: [Link]@[Link]
K. Khatter
Department of Computer Science, Ansal University, Gurgaon, India
e-mail: kirankhatter@[Link]
M. Tushir
Department of Electrical & Electronics Engineering, Maharaja Surajmal Institute
of Technology, C-4, Janakpuri, New Delhi, India
e-mail: meenatushir@[Link]
1 Introduction
Clustering is a technique of finding similar characteristic data among the given set
of data through association rules and classification rules resulting into separation of
classes and frequent pattern recognition. Clustering is basically knowledge dis-
covery process whose result can be used for future use, in various applications.
A good cluster definition involves low interclass similarity and high intra-class
similarity. In order to categorize the data, we have to apply different similarity
measure techniques to establish a relation between the patterns which will group the
data into different clusters with a degree of membership. In clustering, we have to
evaluate a good distance metrics, in order to have high intra-class similarity. Several
clustering algorithms with the different distance metrics have been developed in the
past, some of them are used in detecting different shapes of clusters such as
spherical [1], elliptical [2], some of them are used to detect the straight lines [3, 4],
algorithms focusing on the compactness of the clusters [2, 5]. Clustering is a
challenging field of research as it can be used as a separate tool to gain insight into
the allocation of data, to observe the characteristic feature of each cluster, and to
spotlight on a particular set of clusters for more analysis. Focusing on the proximity
measures, we can find some work that compares a set of distance metrics; therefore,
these could be used as guidelines. However, most of the work includes basic
distance metrics as Grabusts [6] compared Euclidean distance, Manhattan distance,
and Correlation distance with k-means on Iris dataset, similarly Hathaway [7]
compared distance with different values of p, Liu et al. [8] proposed a new algo-
rithm while changing the Euclidean distance with Standard Mahalanobis distance.
In this paper, we are presenting the survey of different distance metrics in order
to acquire proximity measure to be followed by the clustering criterion that results
in the definition of a good clustering scheme for dataset. We have included
Euclidean distance, Standard Euclidean distance, Mahalanobis distance, Standard
Mahalanobis distance, Minkowski distance, and Chebyshev distance and compared
on the criteria of accuracy and misclassification, location of center which to our
knowledge have not been discussed in such detail in any of the surveys till now.
The remainder of this paper is sectioned as follows. Section 2 provides related
work which includes detail of fuzzy c-means algorithm and overview of different
distance metrics, and Sect. 3 includes experimental results on different data types
including synthetic and real datasets. Section 4 concludes the review.
2 Related Work
The notion of fuzzy sets was developed by Zadeh [9] is an attempt to modify
exclusive clustering on the basis of their probability of any parameter on which
clusters have been developed, which was further extended by Bezdek et al. [3] as
Fuzzy c-Means Clustering Strategies … 155
fuzzy c-means algorithm (FCM) is the most widely used algorithm. This approach
partitions a set of data fx1 ; . . .; xn g Rs into c-clusters based on a similarity
computed by the least square function of Euclidean distance metrics. The objective
function of FCM is
c X
X N 2
JFCM ðX : U; VÞ ¼ ðuij Þm xi vj ; 1\m\1 ð1Þ
j¼1 i¼1
The Euclidean distance is the most intense similarity measure which is used widely
in FCM. This formula includes two objects and compares each attribute of indi-
vidual item with other to determine strength of relativity with each other. The
smaller the distance is greater the similarity. The equation for Euclidean distance is
dx;v ¼ ðx1 vi Þ2 þ ðx2 vi Þ2 . . .ðxn vi Þ2 ð2Þ
2
dx;v ¼ ðx1 vi Þ2 þ ðx2 vi Þ2 . . .ðxn vi Þ2 ð3Þ
In (5) R is a correlation matrix. In [8], Liu et al. has proposed new algorithm
giving FCM-SM, normalizing each feature in the objective function, and all
covariance matrix becomes corresponding correlation matrix.
In [1, 2], different function of Minkowski distance with different value of p has
been implemented with FCM showing results on relational and object data types.
Chebyshev distance (L1 ) is a distance metric specified on a vector space where the
given two vectors are separated by a distance which is the largest of their differ-
ences measured along any coordinate dimension. Essentially, it is the maximum
distance between two points in any single dimension. The Chebyshev distance
between ant two points is given by (7)
3 Experiments’ Results
The purpose of the experimental part was to test the operation of the FCM algo-
rithm by applying different distance metrics on synthetic and real dataset. We used
datasets with wide variety in the shape of clusters, numbers of clusters, and count of
features in different data point. FCM is an unsupervised clustering so the number of
clusters to group the data was given by us. We choose m = 2 which is a good
choice for fuzzy clustering. For all parameters, we use e = 0.001, max_iter = 200.
The first example involves X12 dataset as given in Fig. 1 contains two identical
clusters with one outlier which is equidistant from both the clusters. We know FCM
is very sensitive to noise, so we do not get the desired results. To show the
effectiveness of different distance metrics, we have calculated the error E , by sum
of the square of the difference between calculated center and the ideal center with
every distance metrics as given in Eq. (8).
158 J. Arora et al.
x2
2
-2
-4
-6 -4 -2 0 2 4 6
x1
rectangle dataset 14
12
10
6
x2
-2
-4
-6
4 6 8 10 12 14 16 18 20 22
x1
Table 1 Centers produced by FCM with different distance metrics, effectiveness and number of iteration for the X12 dataset and different volume rectangle
dataset
Distance Euclidean Std. Euclidean Mahalanobis Std. mahalanobis Minkowski Chebyshev
Fuzzy c-Means Clustering Strategies …
Center (V12) 2.98 0.54 2.67 2.15 2.97 0.55 −2.98 0.54 −2.98 0.54 −2.98 0.54
−2.98 0.54 −1.51 0.08 −2.97 0.55 2.98 0.54 2.98 0.54 2.98 0.54
E* 0.412 4.213 0.438 0.412 0.412 0.412
No. of Iter. 10 10 10 9 17 10
Center (VRD) 4.22 −0.02 7.42 0.30 5.90 −0.10 4.86 0.54 4.92 −0.01 4.76 −0.02
16.30 −1.01 16.26 −2.02 16.42 −1.06 16.33 −0.94 16.33 −0.94 16.26 −1.00
E* 0.349 3.53 0.5 0.211 0.059 0.062
No. of Iter. 10 23 27 10 14 13
159
160 J. Arora et al.
Mahalanobis distance also show optimal results but Standard Euclidean distance
shows poor results. Similarly in Different Volume Rectangle dataset, Minkowski
distance and Chebyshev distance perform best as compared to other distance
metrics. The highest number of iterations is used by Mahalanobis distance. The
Standard Euclidean distance shows worst result with this dataset also. Both the
above data comprises of clusters forming compact clouds that are well separated
from one another, thus sum of squared error distance outperforms as compare to
other distance and Standard Euclidean distance shows poor results. Mahalanobis
distance due to calculation of covariance matrix for the data does not show accurate
result.
We now examine the defined evaluation criteria with some well-known real data-
sets, namely Iris dataset, Wine dataset and Wisconsin dataset. We are going to
analyze the clustering results using Huang’ s accuracy measure (r) [11].
Pk
ni
r¼ i¼1
ð9Þ
n
where ni is the number of data occurring in both the ith cluster and its corre-
sponding true cluster, and n is the number of data points in the dataset. According to
this measure, a higher value of r indicates a better clustering result with perfect
clustering yielding a value of r = 1.
We made several runs of FCM with different distance metrics and calculated the
misclassification, accuracy, number of iterations on all the three high-dimensional
datasets. Here in Table 2, we find that how the algorithm shows different values of
misclassification over the three datasets with the change of distance metrics. In this,
we can see Chebyshev distance is giving good result with Iris and Breast Cancer
dataset with an accuracy of 90 and 96% respectively, Standard Euclidean distance is
giving best result with Wine dataset with an accuracy of 91%, however number of
iterations used is very high.
Table 2 FCM with different distance metrics showing misclassification, accuracy and number of iteration with Iris, Wine, Breast Cancer (BC) dataset
Dataset Distance
Euclidean Std. Euclidean Mahalanobis Mahalanobis Minkowski Chebyshev
MisclassificationIRIS 17 27 43 27 17 15
Fuzzy c-Means Clustering Strategies …
4 Conclusions
We described various distance metrics for FCM and examined the behavior of the
algorithm with different approaches. It has been concluded from results on various
synthetic and real datasets that Euclidean distance works well for most of the
datasets. Chebyshev and Minkowski distances are equally suitable for clustering.
Further exhaustive exploration on distance metrics needs to be done on various
datasets.
References
1. Patrick, J.F., Groenen, U., Kaymak, J.V., Rosmalen: Fuzzy clustering with Minkowski
distance functions. Econometric Institute Report, EI(24), (2006)
2. Cai, J.Y., Xie, F.D., Zhang, Y.: Fuzzy c-means algorithm based on adaptive Mahalanobis
distance. Comput. Eng. Appl. 174–176(2010)
3. Bezdek, J.C., Coray, C., Gunderson, R, Watson, J.: Detection and characterization of cluster
substructure. SIAM J. Appl. Math. 339–372 (1981)
4. Dave, R.N.: Use of the adaptive fuzzy clustering algorithm to detect lines in digital images.
Intell Robots Comput. Vision VIII 1192, pp. 600–661 (1982)
5. Dunn, J.C.: A fuzzy relative of the ISODATA process and its use in detecting compact well
separated clusters. J Cybern, 32–57(1973)
6. Grabusts, P.: The choice of metrics for clustering algorithms. In: International Scientific and
Practical Conference Vol 2(8), pp. 70–76 (2011)
7. Hathaway, R.J., Bezdek, J.C., Hu, Y.: Generalised fuzzy c-means clustering strategies using
LP norm distance. IEEE Trans. Fuzzy Syst. 8(5), (2000)
8. Liu, H.C., Jeng, B.C., Yih, J.M., Yu, Y.K.: Fuzzy c-means clustering algorithm based on
standard mahalanobis distance. In: International Symposium on Information Processing,
pp. 422–427 (2009)
9. Zadeh, L.A.: Fuzzy sets. Inf. Control 8, 338–353 (1965)
10. Su, M.C., Chou, C.H.: A Modified means of K-Means algorithm with a distance based on
cluster symmetry. IEEE Trans. Pattern Anal. Mach. Intell. 23, 674–680 (2001)
11. Tushir, M., Srivastava, S.: A new kernelized hybrid c-mean clustering with optimized
parameters. Appl. Soft Comp. 10, 381–389 (2010)
Noise Reduction from ECG Signal Using
Error Normalized Step Size Least Mean
Square Algorithm (ENSS) with Wavelet
Transform
Abstract This paper presents the reduction of baseline wander noise found in ECG
signals. The reduction has been done using wavelet transform inspired error nor-
malized step size least mean square (ENSS-LMS) algorithm. We are presenting a
wavelet decomposition-based filtering technique to minimize the computational
complexity along with the good quality of output signal. The MATLAB simulation
results validate the good noise rejection in output signal by analyzing parameters,
excess mean square error (EMSE) and misadjustment.
1 Introduction
Electrocardiographic (ECG) signals are very low amplitude signals around 1 mV.
ECG signal can be corrupted with either noises named as baseline wander (BW) or
power line interface (PLI) noises. These two noises badly degrade the ECG signal
quality and generate a resembling PQRST waveform and remove some tiny features
which are important for diagnosis. ECG is commonly more affected by Baseline
Wander (BW), which is the reason behind varying electrode skin impudence,
patient’s breath, and movement. This noise is a kind of sinusoid signal with random
frequency and phase. Baseline wander reduction is very important step to process
ECG signal. Finally, the work presents here with aim to find the clean signal from the
undesired noisy signals so that the out coming signal can be used for easy diagnosis.
There are so many techniques used to enhance the quality of the signal in the
research papers [1–9], those were used both adaptive and non-adaptive models.
There were so many adaptive filters proposed for canceling noise from ECG signal.
These adaptive filters minimize the error between the noisy ECG signal (considered
as the primary input) and a reference signal which is somehow correlated with the
noise present in primary ECG signal. To track the dynamic variations of the ECG
signals, Thakor et al. [1] give the concept of an adaptive recurrent filter structure
which acquired the impulse response of normal QRS. To track the QRS complexes
in ECG signals with few parameters, an adaptive system has been proposed which
is based on the Hermite functions [2]. To update the coefficient of the filter, there is
always a need to have an adaptive algorithm. The task of this adaptive algorithm is
to minimize the error obtained after subtracting the output of the filter and the
addition of main signal and noise. The researcher also analyzed the error conver-
gence when the reference input is deterministic signal. One of that kind works has
been published by Olmos et al. [3], where they had derived the expression for
steady-state misadjustment taking ECG as input signal. Costa et al. [4] proposed
noise resilient variable step size LMS algorithm which is specially designed for
biomedical signals. A software approach has also been developed by Brouse et al.
[5] for detecting noises from ECG signal using wavelet decomposition of signals.
But hardware implementation of this approach becomes costly for biomedical
applications. So there are several adaptive algorithms, and there modification has
been published [10–13].
This paper contributes by implementing error normalized step size least mean
square (ENSS-LMS) algorithm with the help of wavelet transforms. In this paper,
the adaptive structure of ENSS is presented to eliminate the baseline wander noise
from ECG signals. MIT-BIH database is used to implement and analyze the per-
formance of weight updating algorithm (ENSS) to eliminate the baseline wander
noise from ECG signals. MATLAB simulations have been done to indicate sub-
stantial improvements of the quality if ECG signals by obtaining the good value of
EMSE and misadjustment.
This paper is organized as follows: Sect. 2 gives the proposed Implementation of
ENSS algorithm along with wavelet transform. Section 3 shows the simulation
results. Section 4 concludes the work.
2 Proposed Implementation
Adaptive noise cancellation is a technique used to remove the noise from the input
signal. Figure 1 shows the primary input signal x(n) which is basically the addition
of original signal with some noise. If an original signal is say s(n) and noise signal
Noise Reduction from ECG Signal Using Error Normalized Step … 165
Noise
Signal 1-D
Weights of
N (n) Wavelet
Filter w (n)
ENSS
Algorithm
say N1(n), then primary input x(n) becomes x(n) = s(n) + N1(n). There is also a
reference signal taken which is related to the noise added with the original signal.
The reference noise signal N(n) will pass through a filter whose coefficient will be
updated through ENSS-LMS algorithm.
It has been seen that a signal is decomposed in terms of its sinusoids for spectral
analysis, but it has also found that decomposing the signal in terms of its com-
ponents for different frequency band spectrum is very good to reduce the com-
plexity of transform like wavelet transform [14]. So we use discrete wavelet
transform for spectral analysis of signal using a set of basic function defined in both
frequency and time domain. Here we are using ‘Haar’ wavelet [15]. Haar wavelet is
basically a sequence of rescaled square-shaped integral function on the unit interval
represented in terms of an orthonormal function. It is generally used for the analysis
of signals which suddenly changes [16]. The mother wavelet function for Haar
wavelet is defined as:
" p1ffiffi p1ffiffi #
XW ð0; 0Þ x ð 0Þ
¼ p12ffiffi p12ffiffi or X ¼ Hn;0 x ð1Þ
X/ ð0; 0Þ x ð 1Þ
2 2
x ¼ HT X ð2Þ
166 R. Nagal et al.
Error normalized step size least mean square (ENSS) is an algorithm where instead
of filter input the step size parameters vary. The step size parameters vary as a
nonlinear function of the error vector. In error normalized step size algorithm, the
variable step size is inversely proportional to the squared of the error vector. Also in
this algorithm, the number of iteration and the length of the error vector are equal.
The equation for updating the weights of ENSS is:
l
W ð n þ 1Þ ¼ W ð nÞ þ eðnÞxðnÞ ð3Þ
1 þ lk e n k 2
where
X
n1
ke n k2 ¼ e2 ð n i Þ ð4Þ
i¼0
is the squared of the error eðnÞ, which is used for the estimation of it for its entire
updating. The step size equation is now defined as:
1
W ð n þ 1Þ ¼ W ð nÞ þ eðnÞxðnÞ ð5Þ
1
l þ ke n k2
As the length of error vector e(n) is equal to the number of iterations say n, the
proposed variable step size µ(n) becomes nonlinear decreasing function of n. So,
ken k2 will be increasing function of n. Also, µ(n) is an increasing function of the
parameter µ [17].
1. Import the ECG signal in the program to workspace which is x(n); we consider
this ECG signal with noise.
2. Following that baseline wander noise file is also imported from workspace, we
call it as reference noise signal and denoted by N(n).
3. A correlated noise is generated by passing the noise signals through a
parameterized filter of some order.
4. Wavelet Transform of principle signal and baseline wander noise both will be
taken. So the output of 1-D wavelet box shown in Fig. 1 is xðnÞWðtÞ for
primary signal and N1ðnÞWðtÞ for reference signal.
5. Weights, error, and other variables are initialized.
Noise Reduction from ECG Signal Using Error Normalized Step … 167
6. Now the wavelet coefficient obtained from signals will be processed through
the FIR Wiener filter.
7. The step size is determined using its limit equation (3), where the error will be
calculated by Eq. (5).
8. Output of the filter is calculated by multiplying the baseline noise signal with
the computed tap weight. So Eq. (3) becomes
l
W ð n þ 1Þ ¼ W ð nÞ þ 2
e ðn ÞN1nÞW ð n Þ : ð6Þ
1 þ lk e n k
9. The output of filter acts as a complementary to the noise present in the signal
which when added to the corrupted signal cancels a part of the noise. This
negative addition gives desired response from the ANC system.
10. The desired signal is a cleaned signal. This cleaned signal which when sub-
tracted from the principal signal gives us an error signal.
11. The error signal is fed back into tap weight computing equation via ENSS-LMS
algorithm using Eq. (3).
12. Finally, the inverse 1-D wavelet transform has been taken to get the final signal
using Eq. (2).
The resulted signal is the clean ECG signal.
3 Simulation Results
The work presented here used database of 3600 samples of the ECG signal. This
has been collected from the benchmark MIT-BIH arrhythmia database [18]
(recording nos 101, 102, 103). The non-stationary real baseline wander noise is
obtained from MIT-BIH Normal Sinus Rhythm Database (NSTDB). The arrhyth-
mia database consists of ECG recordings obtained from 47 subjects (men and
women of different age-groups). The recording has 11-bit resolution over a 10 mV
range with 360 samples per second per channel. The performance of algorithm is
evaluated in terms of EMSE and misadjustment as shown in Table 2.
After getting the database, the ECG recording (records 101, 102, 103), add
baseline wander noise to generate primary signals. As shown in Fig. 1, wavelet
transform of primary signal (addition of normal ECG and baseline wander) has
been taken. Figure 2 shows the wavelet transform of primary signal using ‘Haar’
wavelet with level 1 decomposition. After this export the coefficient of wavelet to
MATLAB workspace.
Now as per Fig. 1, the wavelet transform of reference noise needs to be taken
using ‘Haar’ wavelet with level 1 decomposition. Figure 3 shows the wavelet
decomposition of reference noise signal using wavelet 1-D transform. The coeffi-
cient of wavelet transform can be now export to MATLAB workspace for further
processing.
168 R. Nagal et al.
Fig. 2 Wavelet
decomposition of primary
signal (ECG+ baseline
wander noise)
Fig. 3 Wavelet
decomposition of reference
noise signal (baseline wander
noise)
The wavelet decomposition allows the temporal addition of the ECG signal and
left the ECG signal with only one-fourth of the input signal size. Now the reference
signal will be passed through a FIR filter designed. The output of the filter will be
subtracted from the primary signal x(n) [after wavelet decomposition it becomes sig
(n)], and an error signal has been generated, which will again fed back to the filter
via ENSS-LMS algorithm [using Eq. (1)] to update the weights of filter. Finally,
inverse wavelet transform is taken to get the cleaned signal shown in Fig. 4.
To validate the clean ECG signal shown in Fig. 4, the parameters EMSE and
misadjustment have been analyzed (refer Table 1). It is clear from Table 2 that
when recording 101 has been taken as input signal with noise, the EMSE reduced to
−30.665 dB which is great reduction as compared to LMS algorithm with value
Noise Reduction from ECG Signal Using Error Normalized Step … 169
Fig. 4 Final ECG signal output after taking inverse wavelet transform
−7.788 dB for same recording for step size parameter µ = 0.025. Now increasing
the step size parameter with value µ = 0.05, the value of EMSE reduced to
−34.257 dB and to −18.752 dB for LMS algorithm so as misadjustment too. For
step size parameter µ = 0.1, EMSE has value −39.710 (ENSS) against −30.407 in
case of LMS algorithm. To validate the model, the experiment was repeated for
other ECG database recording too named as 101, 102, and many more. But because
of space constraint, only the results of three recordings have been shown in
Tables 1 and 2.
Table 1 EMSE and misadjustment values for different recorded ECG signals for ENSS-LMS
algorithm
ENSS Record 101 Record 102 Record 103 Average
EMSE M EMSE M EMSE M EMSE M
(dB) (dB) (dB) (dB)
µ = 0.025 −30.665 0.5142 −31.543 0.5150 −31.658 0.6265 −30.666 0.5519
µ = 0.05 −34.257 0.9337 −35.252 0.9299 −34.743 0.9501 −34.746 0.9379
µ = 0.1 −39.710 0.7555 −41.324 0.8258 −42.101 0.7563 −41.045 0.7792
Table 2 EMSE and misadjustment values for different recorded ECG signals for LMS algorithm
LMS Record 101 Record 102 Record 103 Average
EMSE M EMSE M EMSE M EMSE M
(dB) (dB) (dB) (dB)
µ = 0.025 −7.788 0.908 −6.952 0.992 −7.129 0.912 −7.289 0.937
µ = 0.05 −18.752 0.846 −17.842 0.878 −18.521 0.855 −18.371 0.859
µ = 0.1 −30.407 −8.919 −29.107 −8.990 −30.203 −8.23 −29.905 −8.890
170 R. Nagal et al.
4 Conclusion
References
1. Thakor, N.V., Zhu, Y.S.: Applications of adaptive filtering to ECG analysis: noise
cancellation and arrhythmia detection. IEEE Trans. Biomed. Eng. 38(8), 785–794 (1991)
2. Lagnna, P., Jan, R., Olmos, S., Thakor, N.V., Rix, H., Caminal, P.: Adaptive estimation of
QRS complex by the Hermite model for classification and ectopic beat detection. Med. BioI.
Eng. Comput. 34(1), 58–68 (1996)
3. Olmos, S., Laguna, P.: Steady-state MSE convergence analysis in LMS adaptive filters with
deterministic reference inputs for biomedical signals. IEEE Trans. Signal Process. 48, 2229–
2241 (2000)
4. Costa, M.H., Bermudez, C.M.: A noise resilient variable step-size LMS algorithm. Sig.
Process. 88, 733–748 (2008)
5. Brouse, C., Bumont, G.A., Herrmann, F.J., Ansermino, J.M.: A wavelet approach to detecting
electrocautery noise in the ECG. IEEE Eng. Med. BioI. Mag. 25(4), 76–82 (2006)
6. Leski, J.M., Henzel, N.: ECG baseline wander and power line interference reduction using
nonlinear filter bank. Sig. Process. 85, 781–793 (2005)
7. Meyer, C., Gavela, J.F., Harris, M.: Combining algorithms in automatic detection of QRS
complexes in ECG signals. IEEE Trans. Inf. Technol Biomed. 10(3), 468–475 (2006)
8. Kotas, M.: Application of projection pursuit based robust principal component analysis to
ECG enhancement. Biomed. Signal Process. Control 1, 289–298 (2007)
9. Mihov, G., Dotsinsky, I.: Power-line interference elimination from ECG in case of
non-multiplicity between the sampling rate and the powerline frequency. Biomed. Signal
Process. Control 3, 334–340 (2008)
10. Floris, E., Schlaefer, A., Dieterich, S., Schweikard, A.: A fast lane approach to LMS
prediction of respiratory motion signals. Biomed. Signal Process. Control 3, 291–299 (2008)
11. Li, N., Zhang, Y., Yanling, H., Chambers, J.A.: A new variable step size NLMS algorithm
designed for applications with exponential decay impulse responses. Sig. Process. 88, 2346–
2349 (2008)
12. Xiao, Y.: A new efficient narrowband active noise control system and its performance
analysis. IEEE Trans. Audio Speech Lang. Process. 19(7) (2011)
13. Leigh, G.M.: Fast FIR algorithms for the continuous wavelet transform from constrained least
squares. IEEE Trans. Signal Process. 61(1) (2013)
Noise Reduction from ECG Signal Using Error Normalized Step … 171
14. Kozacky, W.J., Ogunfunmi, T.: Convergence analysis of an adaptive algorithm with output
power constraints. IEEE Trans. Circ. Syst. II Express Briefs 61(5) (2014)
15. Das, R.L., Chakraborty, M.: On convergence of proportionate-type normalized least mean
square algorithms. IEEE Trans. Circ. Syst. II 62(5) (2015)
16. Sheetal, Mittal, M.: A Haar wavelet based approach for state analysis of disk drive read
system. Appl. Mech. Mater. 592–594, 2267–2271 (2014)
17. Narula, V. et al.: Assessment of variants of LMS algorithms for noise cancellation in low and
medium frequency signals. In: IEEE Conference on Recent Advancements in Electrical,
Electronics and Control Engineering, pp. 432–437 (2011)
18. [Link]
BIHdatabase)
A Novel Approach for Extracting
Pertinent Keywords for Web Image
Annotation Using Semantic Distance
and Euclidean Distance
Abstract The World Wide Web today comprises of billions of Web documents
with information on varied topics presented by different types of media such as text,
images, audio, and video. Therefore along with textual information, the number of
images over WWW is exponentially growing. As compared to text, the annotation
of images by its semantics is more complicated as there is a lack of correlation
between user’s semantics and computer system’s low-level features. Moreover, the
Web pages are generally composed of contents containing multiple topics and the
context relevant to the image on the Web page makes only a small portion of the
full text, leading to the challenge for image search engines to annotate and index
Web images. Existing image annotation systems use contextual information from
page title, image src tag, alt tag, meta tag, image surrounding text for annotating
Web image. Nowadays, some intelligent approaches perform a page segmentation
as a preprocessing step. This paper proposes a novel approach for annotating Web
images. In this work, Web pages are divided into Web content blocks based on the
visual structure of page and thereafter the textual data of Web content blocks which
are semantically closer to the blocks containing Web images are extracted. The
relevant keywords from textual information along with contextual information of
images are used for annotation.
P. Gulati
YMCA UST, Faridabad, Haryana, India
e-mail: gulatipayal@[Link]
M. Yadav (&)
RPSGOI, Mahendergarh, Haryana, India
e-mail: manishayadav17@[Link]
1 Introduction
WWW is the largest repository of digital images in the world. The number of
images available over the Web is exponentially growing and will continue to
increase in future. However, as compared to text, the annotation of images by
means of the semantics they depict is much more complicated. Humans can rec-
ognize objects depicted in images, but in computer vision, the automatic under-
standing the semantics of the images is still the perplexing task. Image annotation
can be done either through content-based or text-based approaches. In text-based
approach, different parts of a Web page are considered as possible sources for
contextual information of images, namely image file names (ImgSrc), page title,
anchor texts, alternative text (ALT attribute), image surrounding text. In the
content-based approach, image processing techniques such as texture, shape, and
color are considered to describe the content of a Web image.
Most of the image search engines index images using text information associated
with images, i.e., on the basis of alt tags, image caption. Alternative tags or alt tag
provides a textual alternative to non-textual content in Web pages such as image,
video, media. It basically provides a semantic meaning and description to the
embedded images. However, the Web is still replete with images that have missing,
incorrect, or poor text. In fact in many cases, images are given only empty or null
alt attribute (alt = “ ”) thereby such images remain inaccessible. Image search
engines that annotate Web images based on content-based annotation have problem
of scalability.
In this work, a novel approach for extracting pertinent keywords for Web image
annotation using semantic distance and Euclidean distance is proposed. Further, this
work proposes an algorithm that automatically crawls the Web pages and extracts
the contextual information from the pages containing valid images. The Web pages
are segmented into Web content blocks and thereafter semantic correlation is cal-
culated between Web image and Web content block using semantic distance
measure. The pertinent keywords from contextual information along with semantic
similar content are then used for annotating Web images. Thereafter, the images are
indexed with the associated text it refers to.
This paper is organized as follows: Sect. 2 discusses the related work done in
this domain. Section 3 presents the architecture of the proposed system. Section 4
describes the algorithm for this approach. Finally, Sect. 5 comprises of the
conclusion.
2 Related Work
A number of text-based approaches for Web image annotation have been proposed
in recent years [1]. There are numerous systems [2–6] that use contextual infor-
mation for annotating Web images. Methods for exacting contextual information
A Novel Approach for Extracting Pertinent Keywords for Web … 175
are (i) window-based extraction [7, 8], (ii) structure-based wrappers [9, 10],
(iii) Web page segmentation [11–13].
Window-based extraction is a heuristic approach which extracts image sur-
rounding text; it yields poor results as at times irrelevant data is extracted and
relevant data is discarded. Structure-based wrappers use the structural informa-
tion of Web page to decide the borders of the image context but these are not
adaptive as they are designed for specific design patterns of Web page. Web page
segmentation method is adaptable to different Web page styles and divides the
Web page into segments of common topics, and then each image is associated with
the textual contents of the segment which it belongs to. Moreover, it is difficult to
determine the semantics of text with the image.
In this work, Web page is segmented into Web content blocks using
vision-based page segmentation algorithm [12]. Thereafter, semantic similarity is
calculated between Web image and Web content block using semantic distance
measure. Semantic distance is the inverse of semantic similarity [14] that is the less
distance of the two concepts, the more they are similar. So, semantic similarity and
semantic distance are used interchangeably in this work.
Semantic distance between Web content blocks is calculated by determining a
common representation among them. Generally, text is used for common repre-
sentation. As per the literature review, there are various similarity metrics for texts
[13, 15, 16]. Some simple metrics are based on lexical matching. Prevailing
approaches are successful to some extent, as they do not identify the semantic
similarity of texts. For instance, terms Plant and Tree have a high semantic cor-
relation which remains unnoticed without background knowledge. To overcome
this, WordNet taxonomy as background knowledge is discussed [17, 18].
In this work, the word-to-word similarity metric [19] is used to calculate the
similarity between words and text-to-text similarity is calculated using the metric
introduced by Corley [20].
3 Proposed Architecture
Crawl manager is a computer program that takes the seed URL from the URL
queue and fetches the Web page from WWW.
176 P. Gulati and M. Yadav
URL queue is a type of repository which stores the list of URLs that are discovered
and extracted by crawler.
3.3 Parser
Parser is used to extract information present on Web pages. Parser downloads the
Web page and extracts the XML file of the same. Thereafter, it convert XML file
into DOM object models. It then checks whether valid images are present on the
A Novel Approach for Extracting Pertinent Keywords for Web … 177
Web page or not. If valid image is present on the Web page, then the page is
segmented using visual Web page segmenter; otherwise, next URL is crawled.
The DOM object models which contain page title of Web page, image source, and
alternative text of valid images present on the Web page are extracted from the set
of object models of the Web page.
Visual Web page segmenter is used for the segmentation of Web pages into Web
content blocks. By the term segmentation of Web pages means dividing the page by
certain rules or procedures to obtain multiple semantically different Web content
blocks whose content can be investigated further.
In the proposed approach, VIPS algorithm [12] is used for the segmentation of
Web page into Web content blocks. It extracts the semantic structure of a Web page
based on its visual representation. The segmentation process has basically three
steps: block extraction, separator detection, and content structure construction.
Blocks are extracted from DOM tree structure of the Web page by using the page
layout structure, and then separators are located among these blocks. The
vision-based content structure of a page is obtained by combining the DOM
structure and the visual cues. Therefore, a Web page is a collection of Web content
blocks that have similar DOC. With the permitted DOC (pDOC) set to its maximum
value, a set of Web content blocks that consist of visually indivisible contents is
obtained. This algorithm also provides the two-dimensional Cartesian coordinates
of each visual block present on the Web page based on their locations on the Web
page.
Block analyzer analyses the Web content blocks obtained from segmentation.
Further, it divides the Web content blocks into two categories: image blocks and
text blocks. Web blocks which contain images are considered as image blocks and
rest are considered as text blocks.
Nearest text block detector detects the nearest text blocks to an image block. For
checking closeness, Euclidean distance between closest edges of two blocks is
calculated. Distance between two line segments is obtained by using Eq. (1):
178 P. Gulati and M. Yadav
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Euclidean Distance ¼ ðx2 x1 Þ2 þ ðy2 y1 Þ2 ð1Þ
After the distance is calculated between each image block pair and text block
pair, the text blocks whose distance from image block is below the threshold are
assigned to that image block. In this way, each image block is assigned with a group
of text blocks which are closer in distance with that image block.
In the proposed approach, tag text extractor is used for extracting text from the
HTML tags. Parser provides the DOM object models by parsing a Web page. If the
image present on this Web page is valid, i.e., it is not a button or an icon, which is
checked by valid image checker, are extracted from metadata of image like image
source (Imgsrc), alternative text (Alt). Page title of the Web page which contains
this image is also extracted.
In this work, keyword extractor is used to extract keywords from the metadata of
images and page title. Keywords are stored into a text file which is further used for
obtaining semantically close text blocks by calculating semantic distance.
2:ICðLCSÞ
simLin ¼ ð2Þ
ICðConcept1 Þ þ ICðConcept2 Þ
Here LCS is the least common subsumer of the two concepts in the WordNet
taxonomy, and IC is the information content that measures the specificity for a
concept as follows:
where idf(wi) is the inverse document frequency [19] of the word wi in a large
corpus. A directional similarity score is further calculated with respect to T1. The
score from both directions is combined into a bidirectional similarity as given in
Eq. 5:
This similarity score has a value between 0 and 1. From this similarity score,
semantic distance is calculated as follows:
In this way, semantic distance is calculated among image block and its nearest
text blocks. The text block whose semantic distance is less is the semantically
correlated text block to that image block.
Text extractor is used to extract text from text blocks present on the Web page. Text
of semantically close text block obtained in the previous step is extracted and
buffered. This text along with the text extracted from image metadata and page title
of Web page is used to extract frequent keywords.
180 P. Gulati and M. Yadav
In this work, keyword determiner is used to extract keywords from the text stored in
a buffer. Frequent keywords are determined by applying a threshold on the fre-
quency count of keywords. Keywords whose frequency is above the threshold are
extracted and used for annotating images.
Page title of Web page, image source of image, alternative text of image, and
frequent keywords extracted in the previous step—all of these describe the image
best.
4 Algorithm
The algorithm for proposed system is automatic image annotation. This algorithm
takes the URL of Webpage as input and provides the description of the Web page as
output.
This algorithm is used here for annotating images present on the Web page.
Firstly, parsing is done to extract page title, Img_Src, Alt Text of image. Secondly,
Web page segmentation is performed using VIPS algorithm. Then validity of image
is checked and for valid images find nearest text blocks using the algorithm given
below. For closer text block list, semantic distance is calculated using bidirectional
similarity between blocks. Then keywords are extracted from the semantically close
text block. These keywords are used for image annotation process.
Algorithm for obtaining nearest text blocks is find nearest text blocks. It takes
image blocks and text blocks as input and provides a list of nearest blocks as output.
This algorithm collects the nearest text blocks to an image block present on the Web
page using closest edge Euclidean distance between Web content blocks. It uses the
Cartesian coordinates of Web content blocks to calculate Euclidean distance.
5 Conclusion
This paper presents algorithm for the novel approach for extracting pertinent
keywords for Web image annotation using semantics. In this work, Web images are
automatically annotated by determining pertinent keywords from contextual
information from Web page and semantic similar content from Web content blocks.
This approach provides better results than method of image indexing using Web
page segmentation and clustering [21], as in existing method context of image, it is
not coordinated with the context of surrounding text. This approach will provide
good results as closeness between image and Web content blocks is computed using
both Euclidean distance and semantic distance.
182 P. Gulati and M. Yadav
References
1. Sumathi, T., Devasena, C.L., Hemalatha, M.: An overview of automated image annotation
approaches. Int. J. Res. Rev. Inf. Sci. 1(1) (2011) (Copyright © Science Academy Publisher,
United Kingdom)
2. Swain, M., Frankel, C., Athitsos, V.: Webseer: an image search engine for the World Wide
Web. In: CVPR (1997)
3. Smith, J., Chang, S.: An image and video search engine for the world-wide web. Storage.
Retr. Im. Vid. Datab. 8495 (1997)
4. Ortega-Binderberger, M., Mehrotra, V., Chakrabarti, K., Porkaew, K.: Webmars: a
multimedia search engine. In: SPIE An. Symposium on Electronic Imaging, San Jose,
California. Academy Publisher, United Kingdom (2000)
5. Alexandre, L., Pereira, M., Madeira, S., Cordeiro, J., Dias, G.: Web image indexing:
combining image analysis with text processing. In: Proceedings of the 5th International
Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS04). Publisher,
United Kingdom (2004)
6. Yadav, M., Gulati, P.: A novel approach for extracting relevant keywords for web image
annotation using semantics. In: 9th International Conference on ASEICT (2015)
7. Coelho, T.A.S., Calado, P.P., Souza, L.V., Ribeiro-Neto, B., Muntz, R.: Image retrieval using
multiple evidence ranking. IEEE Trans. Knowl. Data Eng. 16(4), 408–417 (2004)
8. Pan, L.: Image 8: an image search engine for the internet. Honours Year Project Report,
School of Computing, National University of Singapore, April, 2003
9. Liu, B.: Web data mining: exploring hyperlinks, contents, and usage data. Data-Centric Syst.
Appl. Springer 2007 16(4), 408–417 (2004)
10. Fauzi, F., Hong, J., Belkhatir, M.: Webpage segmentation for extracting images and their
surrounding contextual information. In: ACM Multimedia, pp. 649–652 (2009)
11. Chakrabarti, D., Kumar, R., Punera, K.: A graphtheoretic approach to webpage segmentation.
In: Proceeding of the 17th International Conference on World Wide Web, WWW’08,
pp. 377–386, New York, USA (2008)
12. Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: VIPS: a vision based page segmentation algorithm.
Technical Report, Microsoft Research (MSR-TR-2003-79) (2003)
13. Hattori, G., Hoashi, K., Matsumoto, K., Sugaya, F.: Robust web page segmentation for
mobile terminal using content distances and page layout information. In: Proceedings of the
16th International Conference on World Wide Web, WWW’07, pp. 361–370, New York, NY,
USA. ACM (2007)
14. Nguyen, H.A., Eng, B.: New semantic similarity techniques of concepts applied in the
Biomedical domain and wordnet. Master thesis, The University of Houston-Clear Lake
(2006)
15. Voorhees, E.: Using WordNet to disambiguate word senses for text retrieval. In: Proceedings
of the 16th Annual International ACM SIGIR Conference (1993)
16. Landauer, T.K., Foltz, P., Laham, D.: Introduction to latent semantic analysis. Discourse
Processes 25 (1998)
17. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: WordNet: An on-line lexical
database. Int. J. Lexicogr. 3, 235–244 (1990)
18. Patwardhan, S., Banerjee, S., Pedersen, T.: Using measures of semantic relatedness for word
sense disambiguation. In: Proceedings of the 4th International Conference on Computational
Linguistics and Intelligent Text Processing, CICLing’03, pp. 241–257. Springer, Berlin,
Heidelberg (2003)
19. Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 36th
Annual Meeting of the Association for Computational Linguistics and 17th International
Conference on Computational Linguistics, vol. 2, ACL-36, pp. 768–774, Morristown, NJ,
USA. Association for Computational Linguistics (1998); Sparck Jones, K.: A Statistical
A Novel Approach for Extracting Pertinent Keywords for Web … 183
Interpretation of Term Specificity and Its Application in Retrieval, pp. 132–142. Taylor
Graham Publishing, London, UK (1988)
20. Corley, C., Mihalcea, R.: Measuring the semantic similarity of texts. In: Proceedings of the
ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment,
EMSEE’05, pp. 13–18, Morristown, NJ, USA, 2005. Association for Computational
Linguistics (1998)
21. Tryfou, G., Tsapatsoulis, N.: Image Indexing Based on Web Page Segmentation and
Clustering (2014)
Classification of Breast Tissue Density
Patterns Using SVM-Based Hierarchical
Classifier
Abstract In the present work, three-class breast tissue density classification has
been carried out using SVM-based hierarchical classifier. The performance of
Laws’ texture descriptors of various resolutions have been investigated for differ-
entiating between fatty and dense tissues as well as for differentiation between
fatty-glandular and dense-glandular tissues. The overall classification accuracy of
88.2% has been achieved using the proposed SVM-based hierarchical classifier.
1 Introduction
The most commonly diagnosed disease among women nowadays is breast cancer
[1]. It has been shown that high breast tissue density is associated with high risk of
developing breast cancer [2–10]. Mortality rate for breast cancer can be increased if
detection is made at an early stage. Breast tissue is broadly classified into fatty,
fatty-glandular, or dense-glandular based on its density.
Various computer-aided diagnostic (CAD) systems have been developed by
researchers in the past to discriminate between different density patterns, thus
providing the radiologists with a system that can act as a second opinion tool to
J. Virmani (&)
Thapar University, Patiala, Punjab, India
e-mail: [Link]@[Link]
Kriti
Jaypee University of Information Technology, Waknaghat, Solan, India
e-mail: kriti.23gm@[Link]
S. Thakur
Department of Radiology, IGMC, Shimla, Himachal Pradesh, India
e-mail: tshruti878@[Link]
Fig. 1 Sample of mammographic images from MIAS database, a typical fatty tissue ‘mdb132,’
b typical fatty-glandular tissue ‘mdb016,’ c typical dense-glandular tissue ‘mdb216,’ d atypical
fatty tissue ‘mdb096,’ e atypical fatty-glandular tissue ‘mdb090,’ f atypical dense-glandular tissue
‘mdb100’
validate their diagnosis. Few studies have been carried out on Mammographic
Image Analysis Society (MIAS) dataset for classification of breast tissue density
patterns into fatty, fatty-glandular, and dense-glandular tissue types [3–10]. Among
these, mostly the studies have been carried on the segmented breast tissue
(SBT) and rarely on fixed-size ROIs [3–10]. Out of these studies, Subashini et al.
[6] report a maximum accuracy of 95.4% using the SBT approach, and Mustra et al.
[9] report a maximum accuracy of 82.0% using the ROI extraction approach.
The experienced participating radiologist (one of the authors of this paper)
graded the fatty, fatty-glandular, and dense-glandular images as belonging to typical
or atypical categories. The sample images of typical and atypical cases depicting
different density patterns are shown in Fig. 1.
In the present work, a hierarchical classifier with two stages for binary classi-
fication has been proposed. This classifier is designed using support vector machine
(SVM) classifier in each stage to differentiate between fatty and dense breast tissues
and then between fatty-glandular and dense-glandular breast tissues using Laws’
texture features.
2 Methodology
The MIAS database consists of total 322 mammographic images out of which 106
are fatty, 104 are fatty-glandular, and 112 are dense-glandular [11]. From each
image, a fixed-size ROI has been extracted for further processing.
After conducting repeated experiments, it has been asserted that for classification of
breast density, the center area of the tissue is the optimal choice [12]. Accordingly,
Classification of Breast Tissue Density … 187
fixed-size ROIs of size 200 200 pixels have been extracted from each mam-
mogram as depicted in Fig. 2.
3 Results
In this work, the performance of TDVs derived using Laws’ masks of length 3, 5, 7,
and 9 is evaluated using SVM-based hierarchical classifier. The results obtained are
shown in Table 1.
From Table 1, it can be observed that OCA of 91.3, 93.2, 91.9, and 92.5% is
achieved for TDV1, TDV2, TDV3, and TDV4, respectively, using SVM-1
sub-classifier, and OCA of 92.5, 84.2, 87.0, and 90.7% is obtained for TDV1,
TDV2, TDV3, and TDV4, respectively, using SVM-2 sub-classifier.
The results from Table 1 show that for differentiating between the fatty and
dense breast tissues, SVM-1 sub-classifier gives best performance for features
extracted using Laws’ mask of length 5 (TDV2), and for further classification of
dense tissues into fatty-glandular and dense-glandular classes, SVM-2 sub-classifier
gives best performance using features derived from Laws’ mask of length 3
(TDV1). This analysis of the hierarchical classifier is shown in Table 2. The OCA
for hierarchical classifier is calculated by adding the misclassified cases at each
classification stage.
Classification of Breast Tissue Density … 189
Table 1 Performance of TDFVs derived from laws’ texture features using hierarchical classifier
TDV (l) Classifier CM OCA (%)
F D
TDV1 (30) SVM-1 F 43 10 91.3
D 4 104
SVM-2 FG DG 92.5
FG 48 4
DG 4 52
TDV2 (75) SVM-1 F D 93.2
F 43 10
D 1 107
SVM-2 FG DG 84.2
FG 39 13
DG 4 52
TDV3 (30) SVM-1 F D 91.9
F 43 10
D 3 105
SVM-2 FG DG 87.0
FG 41 11
DG 3 53
TDV4 (75) SVM-1 F D 92.5
F 44 9
D 3 105
SVM-2 FG DG 90.7
FG 46 6
DG 4 52
Note TDV: texture descriptor vector, l: length of TDV, CM: confusion matrix, F: fatty class, D:
dense class, FG: fatty-glandular class, DG: dense-glandular class, OCA: overall classification
accuracy
4 Conclusion
From the exhaustive experiments carried out in the present work, it can be con-
cluded that Laws’ masks of length 5 yield the maximum classification accuracy of
93.2% for differential diagnosis between fatty and dense classes and Laws’ masks
of length 3 yield the maximum classification accuracy 92.5% for differential
diagnosis between fatty-glandular and dense-glandular classes. Further, for the
three-class problem, a single multi-class SVM classifier would construct three
different binary SVM sub-classifiers where each binary sub-classifier is trained to
separate a pair of classes and decision is made by using majority voting technique.
In case of hierarchical framework, the classification can be done using only two
binary SVM sub-classifiers.
References
1. Kriti, Virmani, J., Dey, N., Kumar, V.: PCA-PNN and PCA-SVM based CAD systems for
breast density classification. In: Hassanien, A.E., et al. (eds.) Applications of Intelligent
Optimization in Biology and Medicine, vol. 96, pp. 159–180. Springer (2015)
2. Wolfe, J.N.: Breast patterns as an index of risk for developing breast cancer. Am.
J. Roentgenol. 126(6), 1130–1137 (1976)
3. Blot, L., Zwiggelaar, R.: Background texture extraction for the classification of mammo-
graphic parenchymal patterns. In: Proceedings of Conference on Medical Image
Understanding and Analysis, pp. 145–148 (2001)
4. Bosch, A., Munoz, X., Oliver, A., Marti, J.: Modeling and classifying breast tissue density in
mammograms. In: Computer Vision and Pattern Recognition, IEEE Computer Society
Conference, 2, pp. 1552–1558. IEEE Press, New York (2006)
5. Muhimmah, I., Zwiggelaar, R.: Mammographic density classification using multiresolution
histogram information. In: Proceedings of 5th International IEEE Special Topic Conference
on Information Technology in Biomedicine (ITAB), pp. 1–6. IEEE Press, New York (2006)
6. Subashini, T.S., Ramalingam, V., Palanivel, S.: Automated assessment of breast tissue density
in digital mammograms. Comput. Vis. Image Underst. 114(1), 33–43 (2010)
7. Tzikopoulos, S.D., Mavroforakis, M.E., Georgiou, H.V., Dimitropoulos, N., Theodoridis, S.:
A fully automated scheme for mammographic segmentation and classification based on breast
density and asymmetry. Comput. Methods Programs Biomed. 102(1), 47–63 (2011)
8. Li, J.B.: Mammographic image based breast tissue classification with kernel self-optimized
fisher discriminant for breast cancer diagnosis. J. Med. Syst. 36(4), 2235–2244 (2012)
9. Mustra, M., Grgic, M., Delac, K.: Breast density classification using multiple feature
selection. Auotomatika 53(4), 362–372 (2012)
10. Silva, W.R., Menotti, D.: Classification of mammograms by the breast composition. In:
Proceedings of the 2012 International Conference on Image Processing, Computer Vision,
and Pattern Recognition, pp. 1–6 (2012)
11. Suckling, J., Parker, J., Dance, D.R., Astley, S., Hutt, I., Boggis, C.R.M., Ricketts, I.,
Stamatakis, E., Cerneaz, N., Kok, S.L., Taylor, P., Betal, D., Savage, J.: The mammographic
image analysis society digital mammogram database. In: Gale, A.G., et al. (eds.) Digital
Mammography. LNCS, vol. 1069, pp. 375–378. Springer, Heidelberg (1994)
12. Li, H., Giger, M.L., Huo, Z., Olopade, O.I., Lan, L., Weber, B.L., Bonta, I.: Computerized
analysis of mammographic parenchymal patterns for assessing breast cancer risk: effect of
ROI size and location. Med. Phys. 31(3), 549–555 (2004)
Classification of Breast Tissue Density … 191
13. Kumar, I., Virmani, J., Bhadauria, H.S.: A review of breast density classification methods. In:
Proceedings of 2nd IEEE International Conference on Computing for Sustainable Global
Development (IndiaCom-2015), pp. 1960–1967. IEEE Press, New York (2015)
14. Virmani, J., Kriti.: Breast tissue density classification using wavelet-based texture descriptors.
In: Proceedings of the Second International Conference on Computer and Communication
Technologies (IC3T-2015), vol. 3, pp. 539–546 (2015)
15. Virmani, J., Kumar, V., Kalra, N., Khandelwal, N.: A comparative study of computer-aided
classification systems for focal hepatic lesions from B-mode ultrasound. J. Med. Eng.
Technol. 37(44), 292–306 (2013)
16. Virmani, J., Kumar, V., Kalra, N., Khandelwal, N.: SVM-based characterization of liver
ultrasound images using wavelet packet texture descriptors. J. Digit. Imaging 26(3), 530–543
(2013)
17. Virmani, J., Kumar, V., Kalra, N., Khandelwal, N.: Characterization of primary and
secondary malignant liver lesions from B-mode ultrasound. J. Digit. Imaging 26(6),
1058–1070 (2013)
18. Virmani, J., Kumar, V., Kalra, N., Khandelwal, N.: SVM-based characterization of liver
cirrhosis by singular value decomposition of GLCM matrix. Int. J. Artif. Intel. Soft Comput. 3
(3), 276–296 (2013)
19. Virmani, J., Kumar, V., Kalra, N., Khandelwal, N.: PCA-SVM based CAD system for focal
liver lesions from B-Mode ultrasound. Defence Sci. J. 63(5), 478–486 (2013)
20. Chang, C.C., Lin, C.J.: LIBSVM, a library of support vector machines. ACM Trans. Intell.
Syst. Technol. 2(3), 27–65 (2011)
Advances in EDM: A State of the Art
Manu Anand
Abstract Potentials of data mining in academics have been discussed in this paper.
To enhance the Educational Institutional services along with the improvement in
student’s performance by increasing their grades, retention rate, maintain their
attendance, giving prior information about their eligibility whether they can give
examination or not based on attendance, evaluating the result using the marks,
predicting how many students have enrolled in which course and all other aspects
like this can be analyzed using various fields of Data Mining. This paper discusses
one of this aspect in which the distinction has been predicted based on the marks
scored by the MCA students of Bharati Vidyapeeth Institute of Computer
Applications and Management, affiliated to GGSIPU using various machine
learning algorithms, and it has been observed that “Boost Algorithm” outperforms
other machine learning models in the prediction of distinction.
1 Introduction
Every sector, every organization maintains large amount of data in their databases
and powerful tools are designed to perform data analysis, as a result mining of data
will result in golden chunks of “Knowledge.” There are many misnomers of data
mining like knowledge mining, knowledge discovery from databases, knowledge
extraction, and pattern analysis that can be achieved using various data mining
algorithms like classification, association, clustering, prediction. Data mining
approach plays a pivot role in decision support system (DSS).
Educational data mining [1] is promising as a research area with a collection of
computational and psychological methods to understand how students learn. EDM
M. Anand (&)
Bharati Vidyapeeth Institute of Computers Application and Management (BVICAM),
New Delhi, India
e-mail: [Link]@[Link]
develops methods and applies techniques from statistics, machine learning, and data
mining to analyze data collected during teaching and learning [2–4]. In this paper,
machine learning classification models have been used to predict distinction of
students using marks of nine subjects that a student of MCA third semester scored
in their End term exams of GGSIPU. Various subjects whose marks are considered
in student dataset used for predicting distinction are theory of computation, com-
puter graphics, Java programming, data communication and networking, C# pro-
gramming, computer graphics laboratory, Java programming laboratory, C#
programming laboratory, general proficiency. There are totally 112 records with 15
input variables. Some of the considered features like marks of all subjects, per-
centage, and distinction have higher importance than others like name, roll no. in
predicting distinction. The student dataset is used by various machine learning
models namely decision tree model, AdaBoost model, SVM model, linear model,
neural network model [5].
In total, 112 student records having 15 input variables have been considered for
data analysis. These student records are the original results of MCA students of the
third semester in 2014, at Bharati Vidyapeeth Institute of Computer Applications
and Management from GGSIPU. Table 1 describes the subjects associated with
each code in the student dataset. In Table 2, student dataset sample is shown.
Table 3 shows the correlation between each feature. In this, total marks scored by
every student are calculated and then the percentage has been evaluated. Depending
upon the percentage, the distinction has been set to 0 or 1.
In this analysis of distinction prediction, some of the basic calculations have been
performed in the student dataset using simple method. First total marks have been
evaluated by the summation of every mark scored by each and every student,
followed by the calculation of percentage for each and every record. Then, the
distinction has been marked as 0 or 1 based on the percentage a student has scored.
If the percentage is greater than or equal to 75%, then the distinction is marked as 1
else it is marked as 0.
3 Methodology
The methodology is described in Fig. 1. In the first step, the result of MCA students
of the third semester has been taken as the primary data for the prediction of
distinction. After this, various features like their total of marks, percentage, and
distinction have been evaluated by considering distinction as target value. The
removal of extra fields was carried out which were not required in distinction
prediction, while different algorithms were applied on the student dataset. There
were totally 15 input variables, out of which 9 variables have been passed as an
input to dataset for evaluation. Then, different algorithms have been applied on
student dataset. In the fifth step, different models were trained and tested on the
dataset with their default parameters. Finally, the evolution of the model is done on
accuracy and sensitivity.
196 M. Anand
Feature Measurement
Data Cleansing
Result Analysis
The AdaBoost is used to find the importance of each feature. To train a boosted
classifier, AdaBoost algorithm is used. The basic representation of Boost classifier
is as follows:
X
T
FT ðxÞ ¼ ft ðxÞ
t¼1
where each ft is a weak learner that takes an object x as input and returns a
real-valued result indicating the class of the object. Predicted object class is iden-
tified with the sign of the output obtained with the weak learner.
Advances in EDM: A State of the Art 197
For each sample in the training set, an output or hypothesis is produced by weak
learner. While executing this algorithm, the main focus should be on the mini-
mization of the resulting t-stage Boost classifier which is achieved with the
selection of weak learner and by assigning a coefficient at .
X
Et ¼ E ½Ft1 ðxi Þ þ at hðxi Þ
i
Here, Ft1 ðxÞ is the boosted classifier that has been built up to the previous stage
of training, EðFÞ is some error function, and ft ðxÞ ¼ at hðxÞ is the weak learner that
is being considered for the addition to the final classifier.
Four machine learning models [5, 6] for distinction prediction have been used.
Rattle has been used where all these models are available. The idea about this
model is discussed below:
(a) Decision tree: It uses a recursive partitioning approach [7].
(b) Support vector machine: SVM searches for so-called support vectors which are
data points that are found to lie at the edge of an area in space which is a
boundary from one class of points to another.
(c) Linear model: Covariance analysis, single stratum analysis of variance, and
regression are evaluated using this model.
(d) Neural net: It is based on the idea of multiple layers of neurons connected to
each other, feeding the numeric data through the network, combining the
numbers, to produce a final answer (Table 4).
4 Model Evaluation
Classifier’s performance has been measured using various parameters like accuracy,
sensitivity, ROC. Sensitivity Si for the class i can be defined as the number of
patterns correctly predicted to be in class i with respect to the total number of
patterns in class i. Consider p number of classes, and the value Cij of size
p * p represents the number of patterns of class i predicted in class j, then accuracy
and sensitivity can be calculated as follows:
P
i¼1;p Cii
Accuracy ¼ P P
i¼1;p j¼1;p Cij
Cii
Sensitivity ¼ P
j¼1;p Cij
5 Experimental Result
This section deals with the analysis on prediction result of all the four machine
learning classification models on the testing dataset. In Figs. 2 and 4, Precision
parameter has been plotted for ada, SVM and verified for all other models. In
In this work, various machine learning classification models are explored with input
variables to predict the distinction of students. The result indicates that Boost
Algorithm outperforms the other classification models.
Advances in EDM: A State of the Art 201
References
1. Ayesha, S., Mustafa, T., Sattar, A., Khan, I.: Data mining model for higher education system.
Eur. J. Sci. Res. 43(1), 24–29 (2010)
2. Shovon, Md.H.I.: Prediction of student academic performance by an application of K-means
clustering algorithm. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 2(7) (2012)
3. Sharma, T.C.: WEKA approach for comparative study of classification algorithm (IJARCCE).
Int. J. Adv. Res. Comput. Commun. Eng. 2(4) (2013)
4. Kumar, V., Chadha, A.: An empirical study of the applications of data mining techniques in
higher education. Int. J. Adv. Comput. Sci. Appl. 2(3), 80–84 (2011)
5. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
6. Keerthi, S.S., Gilbert, E.G.: Convergence of a generalized SMO algorithm for SVM classifier
design. Mach. Learn. 46(1), 351–360 (2002)
7. Fernandez Caballero, J.C., Martinez, F.J., Hervas, C., Gutierrez, P.A.: Sensitivity versus
accuracy in multiclass problem using Memetic Pareto evolutionary neural networks. IEEE
Trans. Neural Netw. 21, 750–770 (2010)
Proposing Pattern Growth Methods
for Frequent Pattern Mining on Account
of Its Comparison Made
with the Candidate Generation and Test
Approach for a Given Data Set
Abstract Frequent pattern mining is a very important field for mining of associ-
ation rule. Association rule mining is an important technique of data mining that is
meant to extract meaningful information from large data sets accumulated as a
result of various data processing activities. There are several algorithms proposed
for having solution to the problem of frequent pattern mining. In this paper, we have
mathematically compared two most widely used approaches, such as candidate
generation and test and pattern growth approaches to search for the better approach
for a given data set. In this paper, we came to conclusion that the pattern growth
methods are more efficient in maximum cases for the purpose of frequent pattern
mining on account of their cache conscious behavior. In this paper, we have taken a
data set and have implemented both the algorithms on that data set; the experi-
mental result of the working of both the algorithms for the given data set shows that
the pattern growth approach is more efficient than the candidate generation and test
approach.
Keywords Association rule mining Candidate generation and test
Data mining Frequent pattern mining Pattern growth methods
The research work is small part of supplementary work done by the author beside his base work
on RSTDB an AI supported Candidate generation and test algorithm.
V. K. Singh (&)
Department of Computer Science and Engineering, Institute of Technology,
Guru Ghasidas Vishwavidyalaya, Bilaspur, Chhattisgarh, India
e-mail: vibhu200427@[Link]
1 Introduction
With the increase in the use of computer, there is a situation where we are accu-
mulating huge amount of data every day as a result of various data processing
activities. The data that we are gaining can be used for having competitive
advantage in the current scenario where the time to take important decisions has
gone down. Today the need for systems that can help the decision makers to make
valuable decision on the basis of some patterns extracted from historical data in
form of some reports, graphs, etc., has increased. The branch of computer science
that is in concern with the subject is data mining. Data mining is branch of com-
puter science that is developed to combine the human’s power of detecting patterns
along with the computers computation power to generate patterns. The branch is
having its utility in designing of decision support systems that are efficient enough
to help the decision makers to make valuable decisions on time.
Data mining is used for a wide range of applications such as finance, database
marketing, health insurance, medical purpose, bioinformatics, text mining, biode-
fense. There are several data mining tools that help the designer of the system to
simulate such systems such as Intelligent miner, PRW, Enterprise miner, Darwin,
and Clementine. Some of the basic techniques used for the purpose of data mining
are association rule mining, clustering, classification, frequent episode, deviation
detection, neural network, genetic algorithm, rough sets techniques, support vector
machine, etc. For each of the above-mentioned techniques for data mining, there are
algorithms associated which are used for implementation of each paradigm. In this
paper, we are concerned with the association rule mining.
An association rule is an expression of the form X ! Y, where X and Y are the sets
of items. The intuitive meaning of such a rule is that the transaction of the database
which contains X tends to contain Y. Association depends basically on two things:
• Confidence.
• Support.
Some of the algorithms used for finding of association rules from large data set
include Apriori algorithm, partition algorithm, Pincer-Search algorithm, dynamic
item set counting algorithm, FP-tree growth algorithm. In this paper, we have taken
into account two types of approaches for mining of association rules. First is
candidate generation and test algorithm and second approach is pattern growth
approach. The two approaches are concerned with extraction of frequent patterns
from large data sets. Frequent patterns are patterns that occur frequently in trans-
actions from a given data set. Frequent pattern mining is an important field of
association rule mining.
Proposing Pattern Growth Methods for Frequent Pattern … 205
2 Literature Survey
In [1], Han et al. proposed the FP-tree as a pattern growth approach. In [2], Agarwal
et al. showed the power of using transaction projection in conjunction with lexi-
cographic tree structure in order to generate frequent item sets required for asso-
ciation rules. In [3], Zaki et al. proposed algorithm that utilizes the structural
properties of frequent item sets to facilitate fast discovery is shown. In [4], Agarwal
and Srikant proposed two algorithms for the purpose of frequent pattern mining
which are fundamentally different. Empirical formula used in the paper showed that
the proposed algorithm proved to be handy as compared to the previously proposed
approaches. In [5], Tiovonen proposed a new algorithm that reduces the database
activity in mining. The proposed algorithm is efficient enough to find association
rules in single database. In [6], Savasere et al. proposed algorithm which showed
improvement in the input–output overhead associated with previous algorithms.
The feature proved to be handy for many real-life database mining scenarios. In [7],
Burdick et al. proposed algorithm showed that the breakdown of the algorithmic
components showed parent equivalence pruning and dynamic reordering were quite
beneficial in reducing the search space while relative compression of vertical bit-
maps increases vertical scalability of the proposed algorithm whereas reduces cost
of counting of supports. In [8], Zaki and Gonda proposed a novel vertical repre-
sentation Diffset. The proposed approach drastically reduces the amount of space
required to store intermediate results. In [9, 10], the author proposed a new algo-
rithm RSTDB which also works on the candidate generation and test mechanism.
The algorithm is having a new module that makes it more efficient than the previous
approach. In [11], a study of some of the pattern growth methods along with
description of the new algorithm RSTDB for frequent pattern mining is shown. In
[12], the algorithm RSTDB is compared from FP-tree growth algorithm. In [13], a
cache conscious approach for frequent pattern mining is given. In [14, 15], can-
didate generation and test approach for frequent pattern mining is given. In [16],
RSTDB is proposed as an application.
3 Experiment
In this paper for comparing the two approaches, we have taken two algorithms
Apriori as a representative of candidate generation and test mechanism and FP-tree
as representatives of pattern growth approach, since these two algorithms are the
base algorithm that explains to the two approaches in best. In this section, we took
Table 1 to explain the difference of the two algorithms. Table 1 is having 15
records and 9 distinct records having lexicographic property.
206 V. K. Singh
The total number of counters required for the finding of frequent item set in Apriori
algorithm which is candidate generation and test approach is more.
If the number of counters is more, this implies that the total amount of space
required will be more in the case of Apriori approach. Also for generation of
candidate set at each step of the Apriori algorithm, the whole transaction database is
to be scanned. Whereas in the case of FP-tree, once f-list is generated only once
each record is to be scanned. Beside the above advantages, FP-tree structure could
be accommodated in the memory and fast referencing is possible. The comparison
we are making is for a small data set which shows difference in efficiency. Think in
real-time scenario where the size is very large. The efficiency of the FP-tree would
be far better than Apriori.
Proposing Pattern Growth Methods for Frequent Pattern … 207
Fig. 1 Comparison of the working of the two algorithms in terms of space consumed for
evaluation of frequent patterns
5 Conclusion
Considering to the fact that today’s data mining algorithms are keener on having the
data set required for computation in the main memory to have better performance
outcome. From Table 2, it is clear that the amount of space required by the Apriori
algorithm violates to need of the current scenario where time takes the peak
position. From the above example of working of Apriori and FP-tree, it is clear that
execution of Apriori will require considerable large amount of space compared to
FP-tree. Thus, FP-tree employing tree structure easily fitting to the cache can be a
better option despite of the complexity present. Thus, we propose that FP-tree
which is a cache conscious effort as the size of the tree is good enough to reside in
the memory for complete computation. As already discussed that we have taken
FP-tree as a representative of the pattern growth algorithm and Apriori as a rep-
resentative of the candidate generation and test algorithm, the result obtained is
208 V. K. Singh
meant to give a general idea of the two approaches and also the result guides us that
we should go for pattern growth approach when selecting algorithm for frequent
pattern mining, which is also clear from its working that it is more cache conscious
compared to the other one. Beside the above facts derived from the working of the
two approaches, some positives which are drawn of Apriori algorithm are the
simplicity of implementation which is very important and the approaches which are
similar to candidate generation and test are easily understandable.
References
1. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In:
Proceeding of the 2000 ACM SIGMOID International Conference on Management of Data,
pp. 1–12. ACM Press (2000)
2. Agarwal, R.C., Aggarwal, C.C., Prasad, V.V.V.: A tree projection algorithm for generation of
frequent item sets. J. Parallel Distrib. Comput. 61(3), 350–371 (2001)
3. Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: New algorithms for fast discovery of
association rules. In: Proceeding of the 3rd International Conference on Knowledge
Discovery and Data Mining, pp. 283–296. AAAI Press (1997)
4. Agarwal, R., Srikant, R.: Fast algorithms for mining association rules. In: The International
Conference on Very Large Databases, pp. 207–216 (1993)
5. Toivonen, H.: Sampling large databases for association rules. In: Proceeding of 22nd Very
Large Database Conference (1996)
6. Savasere, A., Omiecinski, E., Navathe, S.: An efficient algorithm for mining association rules
in large databases. In: Proceeding of the 21st Very Large Database Conference (1995)
7. Burdick, D., Calimlim, M., Gehrke, J.: MAFIA: a maximal frequent itemset algorithm for
transaction databases. In: Proceeding of ICDE’01, pp. 443–452 (2001)
8. Zaki, M., Gouda, K.: Fast vertical mining using diffsets. In: Journal of ACM SIGKDD’03,
Washington, D.C. (2003)
9. Singh, V.K., Singh, V.K.: Minimizing space time complexity by RSTDB a new method for
frequent pattern mining. In: Proceeding of the First International Conference on Human
Computer Interaction IHCI’09, Indian Institute of Information Technology, Allahabad,
pp. 361–371. Springer, India (2009). ISBN 978-81-8489-404-2
Proposing Pattern Growth Methods for Frequent Pattern … 209
10. Singh, V.K., Shah, V., Jain, Y.K., Shukla, A., Thoke, A.S., Singh, V.K., Dule, C., Parganiha,
V.: Proposing an efficient method for frequent pattern mining. In: Proceeding of International
Conference on Computational and Statistical Sciences, Bangkok, WASET, vol. 36, pp. 1184–
1189 (2009). ISSN 2070-3740
11. Singh, V.K., Shah, V.: Minimizing space time complexity in frequent pattern mining by
reducing database scanning and using pattern growth methods. Chhattisgarh J. Sci. Technol.
ISSN 0973-7219
12. Singh, V.K.: Comparing proposed test algorithm RSTDB with FP-tree growth method for
frequent pattern mining. Aryabhatt J. Math. Inf. 5(1), 137–140 (2013). ISSN 0975-7139
13. Singh, V.K.: RSTDB and cache conscious techniques for frequent pattern mining. In:
CERA-09, Proceeding of Fourth International Conference on Computer applications in
Electrical Engineering, Indian Institute of Technology, Roorkee, India, pp. 433–436, 19–21
Feb 2010
14. Singh, V.K., Singh, V.K.: RSTDB a new candidate generation and test algorithm for frequent
pattern mining. In: CNC-2010, ACEEE and IEEE, IEEE Communication Society,
Washington, D.C., Proceeding of International Conference on Advances in Communication
Network and Computing, Published by ACM DL, Calicut, Kerala, India, pp. 416–418, 4–5
Oct 2010
15. Singh, V.K., Singh, V.K.: RSTDB a candidate generation and test approach. Int. J. Res.
Digest, India 5(4), 41–44 (2010)
16. Singh, V.K.: Solving management problems using RSTDB a frequent pattern mining
technique. In: Int. J. Adv. Res. Comput. Commun. Eng., India 4(8), 285–288 (2015)
17. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publisher,
San Francisco, CA (2001)
A Study on Initial Centroids Selection
for Partitional Clustering Algorithms
Abstract Data mining tools and techniques allow an organization to make creative
decisions and subsequently do proper planning. Clustering is used to determine the
objects that are similar in characteristics and group them together. K-means clus-
tering method chooses random cluster centres (initial centroid), one for each cen-
troid, and this is the major weakness of K-means. The performance and quality of
K-means strongly depends on the initial guess of centres (centroid). By augmenting
K-means with a technique of selecting centroids, several modifications have been
suggested in research on clustering. The first two main authors of this paper have
also developed three algorithms that unlike K-means do not perform random
generation of the initial centres and actually produce same set of initial centroids for
the same input data. These developed algorithms are sum of distance clustering
(SODC), distance-based clustering algorithm (DBCA) and farthest distributed
centroid clustering (FDCC). We present a brief survey of the algorithms available in
the research on modification of initial centroids for K-means clustering algorithm
and further describe the developed algorithm farthest distributed centroid clustering
in this paper. The experimental results carried out show that farthest distributed
centroid clustering algorithm produces better quality clusters than the partitional
clustering algorithm, agglomerative hierarchical clustering algorithm and the hier-
archical partitioning clustering algorithm.
M. Motwani (&)
Department of Computer Science and Engineering, RGPV, Bhopal, India
e-mail: [Link].7@[Link]
N. Arora A. Gupta
RGPV, Bhopal, India
1 Introduction
K-means [1–4] is a famous partition algorithm which clusters the n data points into
k groups. It defines k centroids, one for each cluster. For this k, data points are
selected at random from D as initial centroids.
The K-means algorithm has a drawback that it produces different clusters for
every different set of initial centroids. Thus, the quality of clusters formed depends
on the randomly chosen set of initial k centroids [5]. This drawback of K-means is
removed by augmenting K-means with some technique of selecting initial cen-
troids. We discuss in Sect. 2, the different such modifications published in the
literature on modified K-means algorithm. These proposed algorithms do not per-
form random generation of the initial centres and do not produce different results for
the same input data. In Sect. 3, we discuss the farthest distributed centroid clus-
tering algorithm followed by its experimental results in Sect. 4. Section 5 contains
conclusion, and finally, Sect. 6 has bibliography.
clustering [24]. A filtering method is used to avoid too many clusters from being
formed.
A heuristic method is used in [25] to find a good set of initial centroids. The
method uses a weighted average score of dataset. The rank score is found by
averaging the attribute of each data point. This generates initial centroids that
follow the data distribution of the given set. A sorting algorithm is applied to the
score of each data point and divided into k subsets. The nearest value of mean from
each subset is taken as initial centroid. This algorithm produces the clusters in less
time as compared to K-means. A genetic algorithm for the K-means initialization
(GAKMI) is used in [26] for the selection of initial centroids. The set of initial
centroids is represented by a binary string of length n. Here n is the number of
feature vectors. The GAKMI algorithm uses binary encoding, in which bits set to
one select elements of the learning set as initial centroids. A chromosome repair
algorithm is used before fitness evaluation to convert infeasible chromosomes into
feasible chromosomes. The GAKMI algorithm results in better clusters as compared
to the standard K-means algorithm.
Sum of distance clustering (SODC) [27] algorithm for clustering selects initial
centroids using criteria of finding sum of distances of data objects to all other data
objects. The algorithm uses the concept that good clusters are formed when the
choice of initial k centroids is such that they are as far as possible from each other.
The proposed algorithm results in better clustering on synthetic as well as real
datasets when compared to the K-means technique. Distance-based clustering
algorithm (DBCA) [28] is based on computing the total distance of a node from all
other nodes. The clustering algorithm uses the concept that good clusters are formed
when the choice of initial k centroids is such that they are as far as possible from
each other. Once some point d is selected as initial centroid, the proposed algorithm
computes average of data points to avoid the points near to d from being selected as
next initial centroids.
The farthest distributed centroid clustering (FDCC) algorithm [29] uses the
concept that good clusters are formed when the choice of initial k centroids is such
that they are as far as possible from each other. FDCC algorithm proposed here uses
criteria of sum of distances of data objects to all other data objects. Unlike
K-means, FDCC algorithm does not perform random generation of the initial
centres and produces same results for the same input data. DBCA and FDCC
clustering algorithms produce better quality clusters than the partitional clustering
algorithm, agglomerative hierarchical clustering algorithm and the hierarchical
partitioning clustering algorithm.
A Study on Initial Centroids Selection for Partitional … 215
The algorithm selects a good set of initial centroids such that the selected initial
centroids are spread out within the data space as far as possible from each other.
Figure 1 illustrates the selection of four initial centroids C1, C2, C3 and C4. As is
evident, there are four clusters in the data space. The proposed technique selects a
point d as the first initial centroid using a distance criteria explained in Sect. 3.2.
Once this point d is selected as initial centroid, the proposed technique avoids the
points near to d from being selected as next initial centroids. This is how C1, C2,
C3 and C4 are distributed as far as possible from each other.
Let the clustering of n data points in the given dataset D is to be done into k clusters.
In farthest distributed centroids clustering (FDCC) algorithm, the distance of each
data point di = 1 to n in the given dataset D is calculated from all other data points
and these distances are stored in a distance matrix DM. Total distance of each data
point di = 1 to n with all other data points is calculated. The total distance for a
point di is sum of all elements in the row of the DM corresponding to di. These
sums are stored in a sum of distances vector SD. The vector SD is sorted in
decreasing order of total distance values. Let P-SD is the vector of data points
corresponding to the sorted vector SD; i.e., P-SD [1] will be the data point whose
sum of distances (available in SD [1]) from all other data points is maximum, and
Fig. 1 Illustration of
selecting a good set of initial
centroids
216 M. Motwani et al.
P-SD [2] will be the data point whose sum of distances (available in SD [2]) from
all other data points is second highest. In general, P-SD [i] will be the data point
whose sum of distances from all other data points is in SD [i].
The first point d of the vector P-SD is the first initial centroid. Put this initial
centroid point in the set S of initial centroids. To avoid the points near to d from
being selected as next initial centroids, variable x is defined as follows:
Here, the floor (n/k) function maps the real number (n/k) to the largest previous
integer; i.e., it returns the largest integer not greater than (n/k).
Now discard the next x number of points of the vector P-SD and define the next
point left after discarding these x numbers of points from this vector P-SD, as the
second initial centroid. Now discard the next x number of points from this vector
P-SD and define the next point left after discarding these x numbers of points from
this vector P-SD, as the third initial centroid. This process is repeated till k numbers
of initial centroids are defined. These k initial centroids are now used in the
K-means process as substitute for the k random initial centroids. K-means is now
invoked for clustering the dataset D into k number of clusters using the initial
centroids available in set S.
4 Experimental Study
The experiments are performed on core i5 processor with a speed of 2.5 GHz and
4 GB RAM using MATLAB. The comparison of the quality of the clustering
achieved with FDCC algorithm [29] is made with the quality of the clustering
achieved with
1. Partitional clustering technique. The K-means is the partitional technique
available as a built-in function in MATLAB [30].
2. Hierarchical clustering technique. The ClusterData is agglomerative hierarchical
clustering technique available as a built-in function in MATLAB [30]. The
single linkage is the default option used in ClusterData to create hierarchical
cluster tree.
3. Hierarchical partitioning technique. CLUTO [31] is a software package for
clustering datasets. CLUTO contains both partitional and hierarchical clustering
algorithms. The repeated bisections method available in CLUTO is a hierar-
chical partitioning algorithm that initiates a series of k − 1 repeated bisections to
produce the required k clusters. This effectively is the bisect K-means divisive
clustering algorithm and is the default option in CLUTO named as Cluto-rb.
Recall and precision [32] are used to evaluate the quality of clustering achieved.
Recall is the percentage of data points that have been correctly put into a cluster
among all the relevant points that should have been in that cluster. Precision is the
A Study on Initial Centroids Selection for Partitional … 217
percentage of data points that have been correctly put into a cluster among all the
points put into that cluster.
Three real datasets, namely Corel5K, Corel (Wang) and Wine, are used in the
experiments [33]. Corel5K is a collection of 5000 images downloaded from
Website [34]. We have formed ten clusters of these images using K-means,
ClusterData, Cluto-rb and FDCC algorithm. Corel (Wang) database consists of 700
images of the Corel stock photo and is downloaded from Website [35] and contains
1025 features per image. We have formed ten clusters of these images using
K-means, ClusterData, Cluto-rb and FDCC algorithm. The wine recognition dataset
[36] consists of quantities of 13 constituents found in three types of classes of wines
of Italy. The dataset consists of 178 instances. The number of instances in class 1,
class 2 and class 3 are 59, 71 and 48, respectively. We have formed three clusters of
these images using K-means, ClusterData, Cluto-rb and FDCC algorithm.
218 M. Motwani et al.
The average recall and precision using these algorithms is shown in Table 1. The
graphs plotted for average recall and average precision in percentage using these
techniques is shown in Figs. 2 and 3, respectively. The results show that the recall
and precision of FDCC are better than that of K-means, ClusterData and Cluto-rb.
Hence, the FDCC algorithm produces better quality clusters.
5 Conclusion
References
1. Han, J., Kamber, H.: Data Mining Concepts and Techniques. Morgan Kaufmann Publishers,
Burlington (2002)
2. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations.
In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability,
Berkeley, University of California Press, pp. 281–297 (1967)
3. Dunham, M.: Data Mining: Introductory and Advanced Concepts. Pearson Education,
London (2006)
4. Llyod, S.: Least Squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137
(1982)
5. Khan, S.S., Ahmed, A.: Cluster center initialization algorithm for k-means algorithm. Pattern
Recogn. Lett. 25(11), 1293–1302 (2004)
6. Deelers, S., Auwatanamongkol, S.: Engineering k-means algorithm with initial cluster centers
derived from data partitioning along the data axis with the highest variance. In: Proceedings of
World Academy of Science, Engineering and Technology, vol. 26, pp. 323–328 (2007)
7. Bradley, P.S., Fayyad, U.M.: Refining initial points for K-Means clustering. In: Proceedings
of the 15th International Conference on Machine Learning, Morgan Kaufmann, San
Francisco, CA, pp. 91–99 (1998)
8. Likas, A., Vlassis, N., Verbeek, J.J.: The global k-means clustering algorithm. Pattern
Recogn. 36, 451–461 (2003)
9. Yuan, F., Meng, Z.H., Zhang, H.X., Dong, C.R.: A new algorithm to get the initial centroids.
In: Proceedings of the 3rd International Conference on Machine Learning and Cybernetics,
Shanghai, pp. 26–29 (2004)
A Study on Initial Centroids Selection for Partitional … 219
10. Barakbah, A.R., Helen, A.: Optimized K-means: an algorithm of initial centroids optimization
for K-means. In: Proceedings of Soft Computing, Intelligent Systems and Information
Technology (SIIT), pp. 63–66 (2005)
11. Fahim, A.M., Salem, A.M., Torkey, F.A., Ramadan, M.A.: An efficient enhanced k-means
clustering algorithm. J. Zhejiang Univ. Sci. 7(10), 1626–1633 (2006)
12. Barakbah, A.R., Arai, K.: Hierarchical K-means: an algorithm for centroids initialization for
K-means. Rep. Fac. Sci. Eng. 36(1) (2007) (Saga University, Japan)
13. Barakbah, A.R, Kiyoki, Y.: A pillar algorithm for K-means optimization by distance
maximization for initial centroid designation. In: IEEE Symposium on Computational
Intelligence and Data Mining (IDM), Nashville-Tennessee, pp. 61–68 (2009)
14. Arthur, D., Vassilvitskii, S.: K-means++: the advantages of careful seeding. In: Proceedings
of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, Philadelphia, PA, USA,
Society for Industrial and Applied Mathematics, pp. 1027–1035 (2007)
15. Ahmed, A.H., Ashour, W.: An initialization method for the K-means algorithm using RNN
and coupling degree. Int. J. Comput. Appl. (0975–8887) 25(1) (2011)
16. Huang, L., Du, S., Zhang, Y., Ju, Y., Li, Z.: K-means initial clustering center optimal
algorithm based on Kruskal. J. Inf. Comput. Sci. 9(9), 2387–2392 (2012)
17. Kruskal, J.: On the shortest spanning subtree and the travelling salesman problem. Proc. Am.
Math. Soc, 48–50 (1956)
18. Fahim, A.M., Salem, A.M., Torkey, F.A., Ramadan, M.A, Saake, G.: An efficient K-means with
good initial starting points. Georgian Electron. Sci. J. Comput. Sci. Telecommun. 19(2) (2009)
19. Reddy, D., Jana, P.K.: Initialization for K-mean clustering Voronoi diagram. In: International
Conference on C3IT-2012, Hooghly, Procedia Technology (Elsevier) vol. 4, pp. 395–400,
Feb 2012
20. Preparata, F.P., Shamos, M.I.: Computational Geometry—An Introduction. Springer, Berlin,
Heidelberg, Tokyo (1985)
21. Naik, A., Satapathy, S.C., Parvathi, K.: Improvement of initial cluster center of c-means using
teaching learning based optimization. Procedia Technol. 6, 428–435 (2012)
22. Yang, S.Z., Luo, S.W.: A novel algorithm for initializing clustering centers. In: Proceedings
of International Conference on IEEE Machine Learning and Cybernetics, China, vol. 9,
pp. 5579–5583 (2005)
23. Ye, Y., Huang, J., Chen, X., Zhou, S., Williams, G., Xu, X.: Neighborhood density method
for selecting initial cluster centers in K-means clustering. In: Advances in Knowledge
Discovery and Data Mining. Lecture Notes in Computer Science, vol. 3918, pp. 189–198
(2006)
24. Zhou, S., Zhao, J.: A neighborhood-based clustering algorithm. In: PAKD 2005. LNAI 3518,
pp. 361–371 (1982)
25. Mahmud, M.S., Rahman, M., Akhtar, N.: Improvement of K-means clustering algorithm with
better initial centroids based on weighted average. In: 7th International Conference on
Electrical and Computer Engineering, Dhaka, Bangladesh, Dec 2012
26. Kwedlo, W., Iwanowicz, P.: Using genetic algorithm for selection of initial cluster centers for
the K-means method. In: International Conference on Artificial Intelligence and Soft
Computing. Springer Notes on Artificial Intelligence, pp. 165–172 (2010)
27. Arora, N., Motwani, M.: Sum of distance based algorithm for clustering web data. Int.
J. Comput. Appl. 87(7), 26–30 (2014)
28. Arora, N., Motwani, M.: A distance based clustering algorithm. Int. J. Comput. Eng. Technol.
5(5), 109–119 (2014)
29. Arora, N., Motwani, M.: Optimizing K-Means by fixing initial cluster centers. Int. J. Curr.
Eng. Technol. 4(3), 2101–2107 (2014)
30. MathWorks MatLab: The Language of Technical Computing (2009)
31. Karypis, G.: CLUTO: A Clustering Toolkit. Release 2.1.1, Tech. Rep. No. 02-017. University
of Minnesota, Department of Computer Science, Minneapolis, MN 55455 (2003)
32. Kowalski, G.: Information Retrieval Systems—Theory and Implementation. Kluwer
Academic Publishers (1997)
220 M. Motwani et al.
33. Veenman, C.J., Reinders, M.J.T., Backer, E.: A maximum variance cluster algorithm. IEEE
Trans. Pattern Anal. Mach. Intell. 24(9), 1273–1280 (2002)
34. Duygulu, P., et al.: Object recognition as machine translation: learning a lexicon for a fixed
image vocabulary. In: Proceedings of the 7th European Conference on Computer Vision,
pp. 97–112 (2002)
35. Wang, J.Z., Li, J., Wiederhold, G.: SIMPLIcity: semantics-sensitive integrated matching for
picture libraries. IEEE Trans. Pattern Anal. Mach. Intell. 23(9), 947–963 (2001)
36. Forina, M., Aeberhard, S.: UCI Machine Learning Repository (1991)
A Novel Rare Itemset Mining Algorithm
Based on Recursive Elimination
Keywords Apriori Frequent pattern mining Rare itemset mining
RELIM Maximum support
1 Introduction
With huge influx of data in every real-world application, data analysis becomes
important and data mining helps in effective, efficient, and scalable analysis by
uncovering many hidden associations among data which otherwise cannot be
interpreted and is useful at the same time. Data mining is the non-trivial process of
extraction of hidden, previously unknown and potentially useful information from
large databases [1, 2]. It differs from retrieval tasks in the fact that knowledge
(patterns) can be discovered through data mining. Pattern mining being a basic data
mining task enables to extract hidden patterns from set of data records called
M. Kataria (&)
Department of Computer Science and Engineering, Thapar University, Patiala, Punjab, India
e-mail: mohakkataria@[Link]
C. Oswald B. Sivaselvan
Department of Computer Engineering, Design and Manufacturing Kancheepuram,
Indian Institute of Information Technology, Chennai, India
e-mail: coe13d003@[Link]
B. Sivaselvan
e-mail: sivaselvanb@[Link]
transactions. The various data mining techniques involve association rule mining
(ARM), classification, clustering, and outlier analysis. ARM is the process of
finding frequent itemsets/patterns, associations, correlations among sets of items or
objects in transactional databases, relational databases, and other information
repositories [1].
The motivation for ARM emerged from market basket analysis, which is a
collection of items purchased by a customer in an individual customer transaction
[2], for example a customer’s visit to a grocery store or an online purchase from
[Link]. Huge collections of transactions are received from them. An analysis
of the transaction database is done to find frequently occurring sets of items, or
itemsets, that appear together. Frequent pattern (itemset) mining (FPM) is an
important and non-trivial phase in ARM followed by rule generation [1]. Let I ¼
fi1 ; i2 ; i3 ; . . .; im g be a set of items, and a transaction database
TD ¼ hT1 ; T2 ; T3 ; . . .; Tn i, where Ti ði 2 ½1. . .nÞ is a transaction containing a set of
items in I. The support of a pattern X, where X is a set of items, is the number of
transactions containing X in TD. A pattern (itemset) X is frequent if its support is not
less than a user-defined minimum support(min supp = a). FPM algorithms con-
centrate on mining all possible frequent patterns in the transactional database.
The second phase is relatively straightforward compared to the first phase.
Algorithms for ARM have primarily focused on the first phase as a result of the
potential number of frequent itemsets being exponential in the number of different
items, although the actual number of frequent itemsets can be much smaller. Thus,
there is a need for algorithms that are scalable. Many efficient algorithms have been
designed to address these criteria first of which was Apriori [2]. It uses prior
knowledge which is “all non-empty subsets of a frequent itemset must also be
frequent.” A rule is defined as an implication of the form X ) Y where X, Y I
and X \ Y = ∅. The sets of items (for short itemsets) X and Y are called antecedent
and consequent transactions containing only X of the rule, respectively. A rule
which satisfies a minimum confidence threshold is said be an interesting rule. The
rule X ) Y satisfy confidence c, where
The current literature on mining is focused primarily on frequent itemsets only. But
rare itemsets too find their place of high interest sometimes, especially in cases of
A Novel Rare Itemset Mining Algorithm Based on Recursive … 223
2 Related Work
The scarce literature on the subject of rare itemset mining exclusively adapts the
general levelwise framework of pattern mining around the seminal Apriori algo-
rithm to various forms of the frequent pattern algorithms like FP-growth, ECLAT,
H-mine, Counting Inference, RELIM [2, 5–8]. A detailed survey of seminal
224 M. Kataria et al.
FP-based algorithms can be seen in [9]. These methods provide with a large itemset
search space and associations, which are not frequent. But these associations will be
incomplete either due to restrictive definitions or high computational cost. Hence,
as argued by [10], these methods will not be able to collect a huge number of
potentially interesting rare patterns. As a remedy, we put forward a novel and
simple approach toward efficient and complete extraction of RIs. Mining for RIs has
received enhanced focus of researchers, and in the recent past, few algorithms have
been proposed.
Apriori-inverse algorithm by Y. S. Koh et al. uses the basic Apriori approach for
mining sporadic rules (rules having itemsets with low support value and high
confidence) [11]. This algorithm was the seminal work for RI mining. Laszlo
Szathmary et al. proposed two algorithms, namely MRG-Exp, which finds minimal
rare itemsets (mRIs) with a naive Apriori approach and a rare itemset miner
algorithm (ARIMA) which retrieves all RIs from mRIs [3]. Rarity algorithm by
Luigi Troiano et al. implements a variant Apriori strategy [12]. It first identifies the
longest rare itemsets on the top of the power set lattice and moves downward the
lattice pruning frequent itemsets and tracks only those that are confirmed to be rare.
Laszlo Szathmary et al. have proposed Walky-G algorithm which finds mRIs using
a vertical mining approach [3]. The major limitation of the existing algorithms is
that they work with the assumption that all itemsets having support <min supp form
the set of RIs. This notion states that all IFIs are RIs, while this is not the case. All
IFIs may or may not be RIs, but all RIs are IFIs. RIs are those IFIs which are
relatively rare.
The other limitation with the present algorithms is that they propose to find mRIs
only; none of the algorithms except ARIMA and take into consideration the
exhaustive generation of set of RIs. As these algorithms use Apriori levelwise
approach, so generating the exhaustive set of RIs would be a very space and time
expensive approach as is evident from the paper by Luigi Troiano et al., in which
the authors took subsets of the benchmark datasets to analyze results of ARIMA
and rarity algorithms [12]. Since the number of RIs produced is very large, the
memory usage and execution time are high for such exhaustive RI generation. The
proposed algorithm overcomes the repeated scan limitation of Apriori-based
approaches based on the recursive elimination (RELIM) algorithm.
We introduce the RELIM algorithm for exhaustive generation of rare itemsets from
given database. Like RELIM for frequent itemsets, this algorithm follows the same
representation for transactions. In the first scan of the database, the frequencies of
each unique item present in each transaction of database are calculated. The items
are sorted in ascending order according to their frequencies. Items having same
frequencies can be in any order. The relim prefix list for each item is initialized with
their support value taken to be 0 and suffix-set list to ∅. The next step is to reorder
A Novel Rare Itemset Mining Algorithm Based on Recursive … 225
the items in each transaction according to the ascending order got in the previous
step. This step takes one more pass over the database. After sorting the transaction,
P = prefix[T′]; i.e., first item of the transaction is taken as the prefix for the sorted
transaction T and the remaining items are taken as suffix-set for the transaction.
Then the, suffix-set to suffix-set list of prefix P is inserted in relim prefix list for
each transaction.
After populating the relim prefix list with all of the sorted transactions, iteration
through the relim prefix list in the order of ascending order of frequencies of each
unique item found in step 1 is done. For each prefix P in relim prefix list, and for
each suffix-set S(i) in the suffix-set list S of P, generate all of the possible subsets
except empty set and store the subsets prefixed with P in a hashmap along with its
support value. If a generated subset already exists in hashmap, its associated value
is incremented. Now let P′ = suffix-set(S(i)) is the prefix item of the suffix-set S(i).
The suffix-set from the suffix-set list of prefix P is removed. Let S(i) = suffix-set(S
(i)). S(i) to prefix P is added in the relim prefix list, and its corresponding support is
incremented by 1. Once iteration through the suffix-set list of prefix P is over, all the
candidate rare itemsets in a hashmap are obtained and on pruning the hashmap, and
the rare itemsets prefixed with P are generated. Subset generation is done based on
an iterative approach, and a hashmap data structure is used in the process of pruning
itemsets to optimize on implementation.
3.1 Illustration
After scanning database and counting frequencies of unique items, seven unique
items in sorted order remain, as shown in Fig. 1. Then, a prefix list of seven items is
generated with initial support of 0. Now in next pass, the transactions in the order of
frequencies of prefixes are sorted, and for each sorted transaction T and its prefix,
suffix-set[T] is inserted into the suffix-set list of prefix[T] in the relim prefix list. In
this case, the first transaction {a, d, f} is converted to {f, a, d} and {a, d} being the
suffix-set of the transaction is stored in suffix-set list of prefix f. The relim prefix list
is populated with all the transactions and the recursive.
for f 2 F’ do
P ← prefix(f); P .support ← 0
P .suffixsetlist
A Novel Rare Itemset Mining Algorithm Based on Recursive … 227
end for
call sort transaction and addto prefixlist();
for each prefix P 2 P REFI X LI ST do
candidate [P ] ← P .support
for each suffi[Link] 2 P .suffixsetlist do
call calc support candidate subsets(SS, P);
P’ ← prefix(SS)
P .suffixsetlist ← P .suffixsetlist – SS
P .support–; SSʹ ← suffix(SS)
if SSʹ = ∅ then
Pʹ .suffixsetlist ← P .suffixsetlist [ SSʹ
Pʹ.support ++;
end if
end for
Rp ← prune candidatelist() //Set of RI’s with prefix P
end for
Procedure sort transaction and add to prefixlist(): Find the sorted transac-
tion and add it to the prefix list
Method:
for each transaction T 2 D do
T’ ← ∅
for each item I 2 F’ do
if (I 2 T) then
T0 [I
end if
end for
P’ ← prefix (T’); SSʹ ← suffix(T’)
P .suffixlist ← P .suffixsetlist [ SSʹ
P .support ++;
end for
Elimination of each prefix is carried out. The first prefix is g with support
(number of suffix-sets in the suffix-set list) of 1 and the only suffix-set being {e, c,
b}. Now all possible subsets of this suffix-set are generated except empty set which
are {b}, {c}, {e}, {e, b}, {e, c}, {b, c}, {e, c, b}, and their supports are calculated
and are inserted in the candidate list. Candidate list is pruned, and RIs are retained.
In this case, all the subsets generated of prefix g are rare.
228 M. Kataria et al.
After generating RIs, an extension for prefix g is created and all the suffix-sets of
g as transactions are listed, and the same is inserted into extended relim prefix list
for prefix g. The only suffix-set of g is {e, c, b}.
Procedure calc support candidate subsets(SS, P): Find the support of can-
didate subsets
Method:
sub ← ∅
sub ← generate subsets(SS); //
Generate subsets of all items in suffixset and add it to a set of subsets
sub.
for each set S 2 sub do
Candidate[P [ S] ++;
end for
Method:
R ←∅
m ← count of RI’s
for i ← 0 to [Link] do
if value [[Link](i)] max supp # transactions then
m + + ; R ← R [ [Link](i)
end if
end for
Return R
So prefix[{e, c, b}] = e, and suffix-set[{e, c, b}] is {c, b}. Hence, suffix-set {c,
b} is added to prefix e suffix-set list, and support for e is incremented to 1 in the
extended relim prefix list for prefix g. After populating this extended relim prefix
list, the suffix-sets in the extended prefix list are added to the corresponding relim
prefix list and the support values are added from the extended prefix list to the relim
prefix list. So {c, b} is added to the suffix-set list of e, and support for e is
incremented. This process happens for every prefix in the main relim prefix list.
A Novel Rare Itemset Mining Algorithm Based on Recursive … 229
4 Proof of Correctness
Simulation experiments are performed on an Intel Core i5-3230 M CPU 2.26 GHz
with 4 GB main memory and 500 GB hard disk on Ubuntu 13.04 OS Platform and
are implemented using C++ 11 standard. The standard frequent itemset mining
(FIMI) dataset is taken for simulation [13]. Results presented in this section are over
Mushroom dataset which contains 8124 transactions with varying number of items
per transaction (maximum 119). The detailed results are tabulated in Table 1. In the
simulation for this algorithm, various combinations of number of items per trans-
actions and number of transactions of mushroom data set were used and the portion
of data used from mushroom dataset was truncated using the first item and first
transaction as pivot. There is a general trend, which we see that as we increase the
dimension of the portion we use for getting RIs out of the mushroom dataset, the
time taken to mine the RI increases.
Figure 2 highlights the comparison of natural logarithm of time taken to mine
RIs with varying number of items per transaction by keeping number of transac-
tions constant with RareRELIM and rarity algorithm approach. It can be clearly
seen that time taken to mine RIs increases when we increase items per transaction
keeping number of transactions constant. This is because of the fact that a larger
data set leads to more candidate item sets and hence more time taken to prune RI’s
from candidate sets. For example, for 60 items per transaction, log(time taken)
increases from 39.64 to 227.81 as we increase number of transaction from 1000 to
8124 for our approach.
Figure 3 showcases the comparison of natural logarithm of time taken to mine
RIs with varying number of items per transaction by keeping number of transac-
tions constant with RareRELIM and rarity algorithm approach. It can be clearly
230 M. Kataria et al.
Table 1 Detailed results for mushroom dataset for various max supp
max_supp Time taken_RELIM-based Time # Candidate RI # RI
(%) RI (s) taken_Rarity (s)
#Items per transaction: 40 and no. of transactions: 5000
20 7.02 9.5472 23,549 23,286
10 6.957 9.46152 23,549 22,578
5 6.91 9.3976 23,549 20,382
2.5 6.864 9.33504 23,549 17,128
1 6.848 9.31328 23,549 14,162
#Items per transaction: 40 and no. of transactions: 8124
20 27.699 38.22462 785,887 785,847
10 27.409 37.82442 785,887 785,847
5 27.393 37.80234 785,887 785,575
2.5 30.701 42.36738 785,887 785,073
1 31.87 43.9806 785,887 781,829
#Items per transaction: 50 and no. of transactions: 5000
20 17.004 23.46552 101,283 101,016
10 17.05 23.529 101,283 100,176
5 17.035 23.5083 101,283 97,258
2.5 19.422 26.80236 101,283 90,828
1 16.77 23.1426 101,283 78,158
#Items per transaction: 50 and no. of transactions: 8124
20 28.938 40.80258 138,079 137,875
10 27.627 38.95407 138,079 137,109
5 28.065 39.57165 138,079 134,891
2.5 29.031 40.93371 138,079 128,895
1 28.314 39.92274 138,079 113,663
#Items per transaction: 60 and no. of transactions: 5000
20 150.4 212.064 851,839 850,662
10 147.108 207.42228 851,839 845,550
5 147.862 208.48542 851,839 826,958
2.5 147.489 207.95949 851,839 782,490
1 145.018 204.47538 851,839 692,178
#Items per transaction: 60 and no. of transactions: 8124
20 228.072 328.423.68 1,203,575 1,202,659
10 227.807 328.04208 1,203,575 1,198,659
5 232.191 334.35504 1,203,575 1,185,813
2.5 230.1 331.344 1,203,575 1,148,930
1 227.854 328.10976 1,203,575 1,050,463
seen that in this case also, time taken to mine RIs increases when we increase
number of transactions keeping items per transaction constant because of the same
reason that as data set increases, more candidates are generated and hence more
A Novel Rare Itemset Mining Algorithm Based on Recursive … 231
time taken(s)
8124 Rarity
200
150
10
log
100
50
0
25 30 35 40 45 50 55 60
Number of items per transaction
60 Rarity
200
150
10
log
100
50
0
1000 2000 3000 4000 5000 6000 7000 8000 9000
Number of transactions
time is taken to prune RIs. For example, for 8124 transactions, log(time taken)
increases from 0.7 to 230.1 as we increase number of items per transaction from 25
to 60 for our approach. But in general, RareRELIM algorithm takes less time than
rarity algorithm because rarity approach takes into consideration candidate set and
veto list, and it generates both of them and uses both of them to classify an itemset
as rare or not. It takes more time to generate both candidate itemsets and veto lists.
Figure 4 highlights the comparison of number of RIs with varying support for a
fixed portion of mushroom data set, i.e., 60/8124 (items per transaction/number of
transactions). We see a general rise in trend of the graph because of the reason that,
as we keep on increasing the maximum support value used to filter out the RIs from
the database, the rarity of an itemset tends to increase and itemsets tend to be less
frequent. As can be seen from the graph, as we increase max supp value from 1 to
232 M. Kataria et al.
1.22
versus number of rare
itemsets for mushroom 1.2
dataset
1.18
60 X 8124 mushroom data
1.16
log10 number of RI
1.14
1.12
1.1
1.08
1.06
1.04
0 5 10 15 20
Maximum Support(%)
20%, the value of log10 #RIs increases from 1,050,463 to 1,202,659, an increase of
14%, leading to the conclusion that we got more number of RIs at higher max supp
value.
6 Conclusion
The paper has presented a novel approach to mine rare itemsets, employing RELIM
strategy. Efficient pruning strategies along with a simple data structure have been
employed, and results indicate the significant decline in time taken to mine the rare
itemsets. Future work shall focus on reducing the storage space taken for generating
the RIs. Moreover, efforts shall be put in using the mined rare itemsets in other data
mining strategies.
References
1. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, Los Altos
(2000)
2. Agarwal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In:
Bocca, J.B., Jarke, M., Zaniolo, C. (eds.) VLDB’94, Proceedings of 20th International
Conference on Very Large Data Bases, pp. 487–499. 12–15 September 1994, Morgan
Kaufmann, Santiago de Chile, Chile (1994)
3. Szathmary, L., Valtchev, P., Napoli, A., Godin, R.: Efficient vertical mining of minimal rare
itemsets. In: CLA, pp. 269–280. Citeseer (2012)
4. Szathmary, L., Napoli, A., Valtchev, P.: Towards rare itemset mining. In: 19th IEEE
International Conference on Tools with Artificial Intelligence, 2007, ICTAI 2007, vol. 1,
pp. 305–312. IEEE (2007)
A Novel Rare Itemset Mining Algorithm Based on Recursive … 233
5. Borgelt, C.: Keeping things simple: finding frequent item sets by recursive elimination. In:
Proceedings of the 1st International Workshop on Open Source Data Mining: Frequent
Pattern Mining Implementations, pp. 66–70. ACM (2005)
6. Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: a
frequent-pattern tree approach. Data Min. Knowl. Discov. 8(1), 53–87 (2004)
7. Goethals, B.: Survey on frequent pattern mining (2003)
8. Bastide, Y., Taouil, R., Pasquier, N., Stumme, G., Lakhal, L.: Mining frequent patterns with
counting inference. ACM SIGKDD Explor. Newsl 2(2), 66–75 (2000)
9. Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: current status and future
directions. Data Min. Knowl. Disc. 15(1), 55–86 (2007)
10. Weiss, G.M.: Mining with rarity: a unifying framework. ACM SIGKDD Explor. Newsl 6(1),
7–19 (2004)
11. Koh, Y.S., Rountree, N.: Finding sporadic rules using apriori-inverse. In: Advances in
Knowledge Discovery and Data Mining, pp. 97–106. Springer (2005)
12. Troiano, L., Scibelli, G., Birtolo, C.: A fast algorithm for mining rare itemsets. In: 2009 Ninth
International Conference on Intelligent Systems Design and Applications, pp. 1149–1155.
IEEE (2009)
13. Dataset, F.: Frequent itemset mining implementation (fimi) dataset
Computation of Various Entropy
Measures for Anticipating Bugs
in Open-Source Software
1 Introduction
Software goes under regular maintenance and updation with introduction of new
features, enhancement of existing features, and bug repair. Software use and
dependency has increased over a time it required almost everywhere in current time.
Ever-increasing user demand leads to tremendous changes in software, making it
complex over time. The process of maintenance is crucial phase in software
2 Literature Review
any decision tree and information selection algorithm. Menzies et al. [11] used
NASA Metrics Data Program (MDP) for his study and concluded that learning
algorithm used is more relevant than static source code metric. The effect of using
metrics such as McCabe, LOC, and Halstead is compared with algorithms such as
J48, OneR, and Naïve Bayes. Chaturvedi et al. [12] proposed a method of pre-
dicting the potential complexity of code changes in subcomponents of Bugzilla
project using bass model. This approach helps in predicting the source code
changes yet to be diffused in a system. D’Ambros et al. [13] have done extensive
bug prediction along with releasing data set on their website consisting of many
software systems. They have set a benchmark by comparing the approach devel-
oped by them with the performance of several existing bug prediction approaches.
Giger et al. [14] introduced in a bug prediction a metric with number of modified
lines using fine-grained source code changes (SCCs). Shihab et al. [15] proposed a
statistical regression model to study the eclipse open-source software. This
approach can predict post-release defects using the number of defects in previous
version, and the proposed model is better over existing PCA-based models.
P
where a 6¼ 1; a [ 0; Pi 0; ni¼1 Pi ¼ 1, a is a real parameter, n is the number of
files, and value of n varies from 1 to n. Pi is the probability of change in a file, and
the entropy is maximum when all files have same probability change, i.e. when
pi ¼ 1n ; 8i 2 1; 2; . . .n. The entropy would be minimum when element k has prob-
ability say, Pi ¼ 1 and 8i 6¼ k; Pi ¼ 0. Havrda and Charvat [4] gave the first
non-additive measure of entropy called structural b-entropy. This quantity permits
238 H. D. Arora and T. Parveen
Table 1 Changes in File1, File2, and File3 with respect to time period t1, t2, and t3
File1
File2
File3
t1 t2 t3
Computation of Various Entropy Measures for Anticipating Bugs … 239
t1 = 2/5 = 0.4, probability of File2 for t1 = 1/5 = 0.2, and probability of File3 for
t1 = 1/5 = 0.2. Similarly, probabilities for time periods t2 and t3 could be calcu-
lated. Using the probabilities found by above-mentioned method, entropies for all
time periods could be calculated. In this study, Renyi [3], Havrda–Charvat [4], and
Arimoto [5] entropies are calculated using Eqs. (1)–(3). When there would be
changes in all files, then the entropy would be maximum, while it would be min-
imum for most changes occurring in a single file.
2 α=0.1
Renyi Entropy α=0.2
α=0.3
Entropy
α=0.4
α=0.5
1 α=0.6
α=0.7
α=0.8
α=0.9
0
2007 2008 2009 2010 2011 2012 2013 2014 2015
β=0.3
β=0.4
β=0.5
β=0.6
β=0.7
β=0.8
0 β=0.9
The simple linear regression [19] is a statistical approach which enables predicting
the dependent variable on the basis of independent variable which has been used to
fit the entropy measure as independent variable HðtÞ and bugs observed as
dependent variable mðtÞ using Eq. (4)
where r0 and r1 are regression coefficients. Once getting the regression coefficients
r0 and r1 value, we predict the future bugs by putting the value of regression
coefficients in Eq. (4). The simple linear regression model is used to study the code
change process [20]. SLR is fitted for independent variable HðtÞ and dependent
variable mðtÞ using Eq. (4). Here, mðtÞ represents the number of bugs recorded for
specified in the Bugzilla [1] subcomponent, namely XSLT, Reporter, XForms,
General, Telemetry Dashboard, Verbatim, Geolocation, MFBT, MathML,
X-Remote, Widget Android, and XBL. The regression coefficients r0 and r1 are
calculated by varying the value of a; b; c parameters in Eqs. (1)–(3) taking entropy
as independent variable and number of recorded bugs as dependent variable. We
varied the value of a; b; c from 0.1 to 0.9, respectively, for each entropy measure
and studied the effect of different entropy measure on bug prediction technique [21].
The values of R, R2 , adjusted R2 standard error of estimate, and regression coef-
ficients for the entropy measures, namely Renyi [3], Havrda–Charvat [4], and
Arimoto [5] have been calculated by varying the parameter value in each entropy
measure from 0.1 to 0.9 from year 2007 to 2015 and illustrated in Table 3.
R value determines the correlation between set of observed and predicted data,
and its value lies between −1 and 1. R2 also known as coefficients of determination
determines the closeness of data to the fitted regression line, and it lies between 0
and 1. R-square estimates the percentage of total variation about mean, and its value
close to 1 explains that data fits the model appropriately. Adjusted R2 represents the
percentage of variation based only on the independent variables which are affecting
the dependent variable. The results of the simple linear regression are represented in
Table 3, the table represents the value of R, R2 , adjusted R2 , standard error of
estimate, and regression coefficients. The value of parameter a in Renyi’s entropy
[11] measure in Eq. (1) is varied from 0.1 to 0.9, and entropy is calculated for the
Computation of Various Entropy Measures for Anticipating Bugs … 243
time period varying from year 2007 to 2015. The entropy calculated is used to
predict the bugs using simple linear regression model in SPSS. The regression
coefficients and hence the predicted value for the future bugs are evaluated. It is
observed that as the value of a progresses from 0.1 to 0.9, a decline in the value of
R2 is observed, which reduces from 0.993 for a = 0.1 to 0.998 for a = 0.9. The
value of parameter b in Havrda–Charvat [4]’s entropy measure in Eq. (2) is varied
from 0.1 to 0.9 and entropy for the period varying from year 2007 to 2008 is
calculated. Then, the regression coefficients evaluated through SLR are used to
predict the bugs to be appeared in system. A decline in the value of R2 is observed
as the b progresses from 0.1 to 0.9. It is seen that for b = 0.1 the value of R2 is
0.991, while following continuous degradation as b value increases at b = 0.9, the
244 H. D. Arora and T. Parveen
0.99
0.98
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Parameter in equation(1), (2) and (3)
R-square value is found to be 0.988. Similarly, for Arimoto [5] entropy in Eq. (3)
similar to SLR approach is applied as is mentioned above, and hence, the regression
coefficients are calculated. The predicted value is calculated, and hence, R2 value is
analysed. It is found that unlikely of Renyi [11] and Havrda–Charvat [4] entropy
measure analysis with SLR in Arimoto [5] as we increased the value of c from 0.1
to 0.9, the value of R2 improved at c = 0.9. The value of R2 was similar for c = 0.1
to 0.8, and it increased significantly at c = 0.9.
A graph of R-square against the parameter value ranging from 0.1 to 0.9 in
Eqs. (1)–(3) is plotted, and it is observed that the R-square value for the where
R-square value for bugs predicted by Renyi [3] entropy measure is close to 1. It is
observed that the simple linear regression approach predicts the future bugs with
precision. The value of R-square indicates that the Renyi [3] entropy measure diffused
in simple linear regression approach as independent variable provides the best pre-
dicted value of future bugs. While best R-square values for bugs predicted by Havrda–
Charvat [4] and Arimoto [5] entropy are 99 and 99.1%, respectively. (Figure 4)
Goodness-of-fit curve has been plotted between observed and predicted values.
Goodness of fit for simple linear regression depicts the fitting of the observed bug
value with the predicted bug’s value. Simple linear regression method significantly
anticipates the future bugs using the number of bugs recorded from the subsystem
of open-source software and entropy measure in Eqs. (1)–(3) by determining the
regression coefficients. Figures 5, 6, and 7 represent this goodness-of-fit curve for
each entropy measure considered for our study, i.e. Renyi [3], Havrda–Charvat [4],
410
210
10
Computation of Various Entropy Measures for Anticipating Bugs … 245
410
BUGS
210
10
BUGS
410
210
10
and Arimoto [5]. The predicted value is calculated, and hence, R-square value is
analysed. It is found that in Renyi [3] and Havrda–Charvat [4] entropy measure
analysis, the value of R-square degraded from 99.3 to 98.9 and 99.1 to 98.8% with
the increase in value of parameter a and b from 0.1 to 0.9, respectively.
While in case of Arimoto [5] as we increased the value of c from 0.1 to 0.9, the
value of R-square improved at c = 0.9 from 98.7 to 98.8%. The value of R-square
was similar for c = 0.1 to 0.8, i.e. shows 98.7% variability, and it increased sig-
nificantly at c = 0.9 to 98.8% variability.
9 Conclusion
Software is affected by code change process which leads to increased code com-
plexity. Based on the data set statistics, Renyi [3], Havrda–Charvat [4], and
Arimoto [5] entropies are calculated for different values of parameter a, b, and c,
respectively, ranging from 0.1 to 0.9 in Eq. (1)–(3) for 12 subcomponents of
Mozilla open-source software (OSS). It is observed that Renyi [3], Havrda–Charvat
[4], and Arimoto [5] entropies lie between 0.4 and 1, 1.8 and 9, 0.7 and 3,
respectively. It could be analysed through Figs. 1, 2 and 3 that all entropy values
decrease as we increase the value of a from 0.1 to 0.9. We have considered a bug
prediction approach to predict the appearance of bugs in the OSS subsystem in
coming years; hence, through comparative analysis of the R2 value for each entropy
measure, it is observed that Renyi [3] entropy measure provides the best R2 value
99.3% for a = 0.1. Goodness-of-fit curve has been plotted and represented in
Figs. 1, 2, and 3. This study based on entropy measures is useful in learning the
code change process in a system. In the future, entropy measures could be used to
246 H. D. Arora and T. Parveen
predict bug occurrence in the software and thus complexity of code changes that
software has during a long maintenance period. In the future, our study could be
extended to predict the potential complexity/entropy of code changes that the
software will have during a long maintenance period.
References
17. Aczel, J., Daroczy, Z.: Charakterisierung Der Entrpien Positiver Ordung under Shannibscen
Entropie; Acta Math. Acad. Sci. Hunger. 14, 95–121 (1963)
18. Kapur, J.N.: Generalization entropy of order a and type b. Math. Semin. 4, 78–84 (1967)
19. Weisberg, S.: Applied linear regression. Wiley, New York (1980)
20. Sharma, B.D., Taneja, I.J.: Three generalized additive measures of entropy. ETK 13, 419–433
(1977)
21. IBM SPSS statistics for windows, Version 20.0. Armonk, NY: IBM Corp
Design of Cyber Warfare Testbed
1 Introduction
National Cyber Security Policy (NCSP) was announced in July 2013 with a mission
to protect information and IT infrastructure in cyberspace, build capabilities to
prevent and respond to cyber threats, reduce vulnerabilities and hence damage from
cyber incidents through an arrangement of institutional structures, people, pro-
cesses, technology and their collaboration. One of the objectives is to generate
workforce of five lakhs cyber security professionals in coming five years through
capacity building, skill development, training and use of open standards for cyber
security [1]. National Institute of Standards and Technology (NIST) developed the
framework for improving critical IT infrastructure cyber security in Feb 2014.
While NCSP addresses the high-level perspective, the NIST framework helps
organizations addressing actual cyber security risks.
The gap between threat and defence continues to expand as opponents use
increasingly sophisticated attack technologies. Cyber Defense Technology
Experimental Research (DETER) is an attempt to fill that gap by hosting an
advanced testbed facility where top researchers and academicians conduct critical
cyber security experiments and educational exercises. It emulates real-world
complexity and scale essential to evolve advance solutions to help protect against
sophisticated cyber-attacks and network design vulnerabilities [2] using Emulab
software [3].
Cyber Exercise and Research Platform (KYPO) project aims to create an
environment for R&D of new methods to protect critical IT infrastructure against
cyber-attacks. A virtualized environment is used for emulation of complex
cyber-attacks and analyse their behaviour and impact on the IT infrastructure [4].
Malware is a powerful, free and independent malware analysis service to the
security community [5]. DARPA is setting up a virtual testbed named National
Cyber Range to carry out virtual cyber warfare games for testing different scenarios
and technologies in response to cyber-attacks.
2 Testbed Requirements
Cyber security experiments involve two or more contestants, due to the adversarial
nature of the problem. Safe and isolated virtual network environment is needed to
be attacked and penetrated as a means of learning and improving penetration testing
skills. In conventional environment, investigator constructing the experiment is
fully aware of both sides of the scenario and all components of the experiment are
equally visible. This affects scenario realism and may foster experimental bias and
may not be able to handle present-day attacks.
Testbed should support on-demand creation of experimental scenarios and
advanced attack emulation. Scenarios generated are to be based on real-world
exploitations. The requirements are network-related, hosts-related, monitoring,
testbed control and deployment. The host monitoring set-up needs to get infor-
mation about the node performance. It is to monitor processor and memory uti-
lization, open connections, interface statistics and other characteristics of hosts.
3 Testbed Design
The testbed consists of four major systems, i.e. cluster of nodes and networks,
attack system, defence system and data logger and report generation.
Design of Cyber Warfare Testbed 251
actuator which will pick next node in the graph and launch particular attack on the
node. Individual attack tools such as LOIC for DDOS or hping for scanning can be
utilized by starters. For the attack system, Metasploit Community Edition [7] in
conjugation with Nmap and Wireshark may be used. Kali Linux (formerly
Backtrack) by Offensive Security contains several tools aimed at numerous infor-
mation security tasks, such as penetration testing, forensics and reverse engineering
[8]. Using these tools, a number of attack techniques may be tried and matured on
the cyber warfare testbed. Malware and malware generation tools can be sourced
from Websites like [Link] [9].
Intrusion detection systems (IDSs) and system logs may be used to detect several
types of malicious behaviours that can compromise the security and trust of a
computer system. This includes network attacks against vulnerable services,
data-driven attacks on applications, host-based attacks such as privilege escalation,
unauthorized logins and access to sensitive files, and malware (viruses, Trojan
horses and worms). Provisions are given in IDS to configure rules according to the
experiment. Authors used Snort for network-based intrusion detection and OSSEC
for host-based intrusion detection. These will report various security-related system
activities and logs to central console, e.g. insertion/removal of external media,
changes in specified software. These alerts then can be correlated to have better
awareness of cyber situation.
Formulation of proper response in case of such attack based on situation
response technologies includes response recommenders that evaluate alternatives
and recommend responses in light of the mission and response managers and
actuators that implement the responses [10].
The data logger collects data in passive mode. There are no external linkages from
the data logger during a live experiment. Interactive access is allowed to this only
from report generation and post analyses modules. Depending on the experiment’s
risk rating, data collected by this module may have to be cleaned for malware
before it can to be used by others.
Data logger retrieves information about the performance. Several tools are
available under the various nomenclature, most popular being security information
and event management (SIEM). Authors have used system information gatherer and
reporter (SIGAR) for data logging and reporting. SIGAR API provides a portable
code for collecting system information such as processor and memory utilization,
open connections, interface statistics and other characteristics of hosts.
Design of Cyber Warfare Testbed 253
4 Operational Configuration
4.1 Scenarios
5 Analysis
A large number of exercises need to be conducted for various scenarios and their
objectives to generate volume of data. This data are then analysed for specific
objectives afterwards.
Firstly, MOEs are identified, and the MOEs determine the attack strength, particular
to a specific node on specific network configuration. MOEs for attack may be
number of computers affected, data losses in terms of time and volume, target
identification, number of targets engaged, number of attempts/mechanisms used to
breach, targets missed, DOS induced in terms of time frame, number of routers
attacked/compromised, number of antivirus/antimalware defeated, number of OS
breached, number of Websites attacked, number of applications breached.
Similarly, MOEs for defence may be time delay in detection of attack, value of
asset before and after attack, data losses in terms of time and volume.
Statistical methods or data mining tools may be applied on data collected during
exercises to evolve better defence techniques and attack-resilient network topolo-
gies. Added to it is behaviour analysis, which will help prevention against advanced
persistent threat (APT), polymorphic malware, zero-day attacks, which otherwise
are almost impossible to detect. It includes behaviour of both, malware as well as
attacker. A good analysis of polymorphic malware from behavioural perspective
Design of Cyber Warfare Testbed 255
has been carried out by Vinod et al. [12]. These analyses will help produce tactical
rule book for senior IT managers and chief information security officers (CISOs).
6 Conclusion
The main tools of cyber defence are not the switches, routers or operating systems,
but rather the cyber defenders themselves. The key benefits derived from the testbed
would be enhancement of skills on cyber warfare, formulation of cyber warfare
strategies and tactics, a training platform for the cyber warfare and analysis of
vulnerability of the existing network system.
Innovations doing fine in a predictable and controlled environment may be less
effective, dependable or manageable in a production environment. Without realistic,
large-scale resources and research environments, results are unpredictable. The
testbed as a mini-Internet emulates real-world complexity and scale.
The testbed can also host capture the flag (CTF) and attack–defence competi-
tions, where every team has own network with vulnerable services. Every team has
time for patching services and developing exploits. Each team protects own services
for scoring defence points and hack opponents for scoring attack points. Exercise
can take place on WAN and can be extended to multiple team configuration.
7 Future Scope
US military has indicated that it considers cyber-attack as an “Act of War” and will
include cyber warfare for future conflicts engaging its Cyber Command. In future, a
move from device-based to cloud-based botnets, hijacking distributed processing
power may be witnessed. Highly distributed denial-of-service attacks using cloud
processing may be launched. To counter such massive attacks, the testbed needs to
be realized at much larger scale. Hence, the testbed also needs to be migrated to
cloud platform [13]. Hybrid cloud solutions, e.g. OpenStack may be used as
infrastructure-as-a-service (IaaS). Further efficiency may be achieved by harnessing
operating system container technology. OS designed to run containers, such as Red
Hat Atomic Host, Ubuntu Snappy, CoreOS and Windows Nano, can complement
hypervisors, where containers (VMs without hypervisor) are lightweight virtual
machines which are realized by OS kernel.
There is scope for simulating human behaviour related to cyber security using
agent-based simulation toolkit. It will support for agent modelling of human
security behaviour to capture perceived differences from standard decision making.
256 Y. Chandra and P. K. Mishra
References
1 Introduction
M. Saini (&)
G D Goenka University, Gurgaon, India
e-mail: [Link]@[Link]
R. Chhikara
The NorthCap University, Gurgaon, India
e-mail: ritachhikara@[Link]
done directly, whereas in transform domain images are first transformed [2, 3] to
discrete cosine transform (DCT) or discrete wavelet transform (DWT) domain and
then the message is embedded inside the image. So advantage of transform method
over spatial method is that it is more protected against statistical attacks.
The misuse of Steganography has led to development of countermeasures known
as Steganalysis [4]. The aim of forensic Steganalysis is to recognize the existence of
embedded message and to finally retrieve the secret message. Based on the way
methods are detected from a carrier, Steganalysis is classified into two categories:
target and blind Steganalysis. (i) Target/specific Steganalysis where particular kind
of Steganography algorithm which is used to hide the message is known and
(ii) blind/universal Steganalysis where the particular kind of Steganography tool
used to hide the message is not known.
In this paper, two sets of feature vectors are extracted from discrete wavelet
domain of images to enhance performance of a steganalyzer. The features extracted
are histogram features with three bins 5, 10, and 15 and Markov features with five
different threshold values 2, 3, 4, 5, 6, respectively. The performance of two feature
sets is compared with existing DWT features given by Farid [5] and with different
parameters among themselves.
The rest of the paper is divided in the following sections. Section 2 discusses
three Steganography algorithms outguess, nsF5, and PQ used in the experiment. In
Sect. 3, proposed DWT feature extraction method is explained. Section 4 discusses
neural network classifier used for classification. In Sects. 5 and 6, experimental
results and analysis is described in detail. Finally, the paper is concluded in Sect. 7.
2 Steganography Algorithms
Our experiment employs three steganography algorithms: Outguess, nsF5, and PQ.
Outguess [6] is an enhanced variant of Jsteg. It uses pseudorandom number gen-
erator (PRNG)-based scattering to obscure Steganalysis. Seed is required as addi-
tional parameter to initialize the pseudorandom number generator. nsF5 (no
shrinkage F5 [7]) is enhanced version of F5, proposed in 2007. This algorithm was
developed to improve the problem of shrinkage which exists in F5 algorithm by
combining F5 algorithm with wet paper codes (WPCs). In perturbed quantization
(PQ) algorithm [8], quantization is perturbed according to a random key for data
embedding. In this procedure, prior to embedding the data, on the cover–medium,
an information-reducing process is applied that includes quantization such as lossy
compression, resizing, or A/D conversion. PQ does not leave any traces in the form
that the existing Steganalysis method can grab.
Performance Evaluation of Features Extracted from DWT Domain 259
On applying discrete wavelet transformation (DWT) [9] on each image from image
dataset, we obtain four different subbands at each level (up to level 3 decomposition
is done), and we have obtained 12 sub-bands (cA1, cH1, cV1, cD1 at Level 1; cA2,
cH2, cV2, cD2 at Level 2, and cA3, cH3, cV3, cD3 at Level 3, respectively) where
CA is approximation coefficients and rest three subbands (cH, cV, cD) are detailed
coefficients. Histogram and Markov features are extracted from the complete image,
and 12 subbands are obtained after applying DWT.
3.1 Histogram
H O ¼ H O=sumðH OÞ ð1Þ
We have extracted features calculated from original image +12 subbands (13
subbands) by using three different bins 5, 10, and 15.
3.2 Markov
Farid proposed 72 feature set [5]; four features that are extracted from subbands are
mean, variance, skewness, and kurtosis formula which are shown below.
1X n
E ð xÞ ¼ xk ð6Þ
n k¼1
1 X n
Varð xÞ ¼ ðxi Eð xÞÞ2 ð7Þ
n 1 i¼1
2 !3 3
x E ð xÞ
Sð xÞ ¼ E4 pffiffiffiffiffiffiffiffiffiffiffiffiffiffi 5 ð8Þ
Varð xÞ
2 !4 3
x E ð xÞ
K ð xÞ ¼ E 4 pffiffiffiffiffiffiffiffiffiffiffiffiffi 5 ð9Þ
varð xÞ
5 Experimental Results
We have created dataset of images consisting of 2,000 cover and 2,000 stego
images of various sizes which are resized into 640 480. Stego images are
obtained by applying three Steganography tools, Outguess, nsF5, and PQ. These
4000 images have been divided into training (2,800 samples), validation (600
samples), and testing (600 samples) sets. Features are extracted from image using
DWT transform domain. Then, features are fed to neural network back-propagation
classifier for detecting whether image is stego or clean image. The performance of
classifier is compared on the basis of parameter classification accuracy. Various
embedding capacities have been used to generate the dataset as shown in Table 1.
Accuracy of classifier is the fraction of test samples that are correctly classified.
Firstly, we have compared the performance of proposed features extracted from
wavelet domain among themselves and with existing DWT features. Next, exper-
iment evaluates histogram features with different bins and finally Markov features
with different threshold values.
Histogram features are extracted by binning the elements of all subbands into 5,
10, and 15 evenly spaced containers. Markov features are calculated after applying
different threshold values 2, 3, 4, 5, and 6, respectively, as shown in Table 2.
6 Analysis
(65 features) is much higher than obtained with 10 bins (130 features) and 15 bins
(195 features). (ii) In nsF5 and PQ Steganography algorithm, we obtain equivalent
results with 5 bin, 10 bin, and 15 bin, respectively.
From the experimental results of Table 4, we found that accuracy obtained after
applying threshold value 5 is best in comparison with other threshold values in case
of Outguess, but in nsF5 and PQ Steganography algorithm threshold value 6 gives
more accuracy in comparison with others. Next, we evaluate the performance of the
proposed features with existing DWT features based on parameter classification
accuracy.
As shown in Fig. 1, we conclude that proposed histogram 65 features in DWT
domain give best accuracy in comparison with Markov 121 features and existing
Farid 72 DWT features for Outguess. This proves that this algorithm is more
sensitive to histogram features and with reduced number of features we are able to
obtain better accuracy.
We conclude from Figs. 2 and 3, that Markov 121 features are able to detect
nsF5 and PQ Steganography algorithm more efficiently. However, the limitation is
that histogram features (65) do not improve the accuracy much for these two
algorithms.
Performance Evaluation of Features Extracted from DWT Domain 263
Outguess
150
121
99.83
100
70.66 65 69.1 72
50 Accuracy in %
0 Number of Features
Fig. 1 Comparison of accuracy in % of proposed features with existing features for Outguess
nsF5
150
121
93
100
72
50.3 65 59.3
50 Accuracy in %
Number of Features
0
Markov(DWT) Histogram(DWT) Farid(DWT)
Fig. 2 Comparison of accuracy in % of proposed features with existing features for nsF5
PQ
150
121
97 89.66 89.83
100 72
65
50 Accuracy in %
0 Number of Features
7 Conclusion
We have created image dataset consisting of 2000 cover images downloaded from
various Web sites and digital photographs. Three Steganography algorithms
(1) Outguess (2) nsF5 (3) PQ have been applied to obtain three sets of 2,000 stego
264 M. Saini and R. Chhikara
images. Various embedding capacities have been used to generate the datasets.
Three kinds of images with sizes of 16 16, 32 32, and 48 48 were
embedded in image using Outguess Steganography algorithm. Different embedding
ratio taken in nsF5 and PQ methods are 10, 25, and 50%. The reason for embedding
diverse % of information is to test the performance with different percentage of
information embedded. So total 4,000 images (cover plus stego) have been gen-
erated in each case. These 4000 images have been divided into training (2,800
samples), validation (600 samples), and testing (600 samples) by neural network
classifier. After dividing the dataset into training, validation, and testing, the pro-
posed DWT features (i) histogram, (ii) Markov are compared with existing
Farid DWT features. We have compared the histogram features with different bins
5, 10, and 15, respectively, and Markov features with different threshold values 2,
3, 4, and 5, respectively. All the experiments are performed with neural network
classifier. The results show 5 bins histogram features give best performance for
Outguess, and in nsF5 and PQ Steganography algorithm, we obtain equivalent
results with 5 bin, 10 bin, and 15 bin. Markov features with threshold value 5 gives
best performance for Outguess; in nsF5 and PQ Steganography algorithm, we get
better performance with Markov features with threshold value 6. Experimental
results indicate that outguess algorithm is more sensitive to histogram statistical
features and nsF5 and PQ can be detected with Markov statistical features. From the
experimental result, we can also conclude that proposed set of features gives better
performance in comparison with existing DWT features. Future work involves
(a) applying proposed features on different embedding algorithms. (b) Finding the
size of information hidden in a stego image.
References
1. Choudhary, K.: Image steganography and global terrorism. Glob. Secur. Stud. 3(4), 115–135
(2012)
2. Cheddad, A., Condell, J., Curran, K., Mc Kevitt, P.: Digital image steganography: survey and
analysis of current methods. Signal Processing 90, pp. 727–752 (2010)
3. Johnson, N.F., Jajodia, S.: Steganalysis the investigation of hidden information. In:
Proceedings of the IEEE Information Technology Conference, Syracuse, NY, pp. 113–116
(1998)
4. Nissar, A., Mir, A.H.: Classification of steganalysis techniques: a study. In: Digital Signal
Processing, vol. 20, pp. 1758–1770 (2010)
5. Farid, H.: Detecting hidden messages using higher-order statistical models. In: Proceedings of
the IEEE International Conference on Image Processing (ICIP), vol. 2, pp. 905–908 (2002)
6. Niels Provos. [Link]
7. Fridrich, J., Pevný, T., Kodovský, J.: Statistically undetectable JPEG steganography: dead
ends, challenges, and opportunities. In: Proceedings of the ACM Workshop on Multimedia &
Security, pp. 3–14 (2007)
8. Fridrich, J., Gojan, M., Soukal, D.: Perturbed quantization steganography. J. Multimedia Syst.
11, 98–107 (2005)
Performance Evaluation of Features Extracted from DWT Domain 265
9. Ali, S.K., Beijie, Z.: Analysis and classification of remote sensing by using wavelet transform
and neural network. In: IEEE 2008 International Conference on Computer Science and
Software Engineering, pp. 963–966 (2008)
10. Shimazaki, H., Shinomoto, S.: A method for selecting the bin size of a time histogram. Neural
Comput. 19(6), 1503–1527 (2007)
11. Saini, M., Chhikara, R.: DWT feature based blind image steganalysis using neural network
classifier. Int. J. Eng. Res. Technol. 4(04), 776–782 (2015)
12. Pevný, T., Fridrich, J.: Merging Markov and DCT features for multi-class JPEG steganalysis.
In: Proceedings of the SPIE, pp. 03–04 (2007)
13. Bakhshandeh, S., Bakhshande, F., Aliyar, M.: Steganalysis algorithm based on cellular
automata transform and neural network. In: Proceedings of the IEEE International Conference
on Information Security and Cryptology (ISCISC), pp. 1–5 (2013)
Control Flow Graph Matching
for Detecting Obfuscated Programs
Abstract Malicious programs like the viruses, worms, Trojan horses, and back-
doors infect host computers by taking advantage of flaws of the software and
thereby introducing some kind of secret functionalities. The authors of these
malicious programs attempt to find new methods to get avoided from detection
engines. They use different obfuscation techniques such as dead code insertion,
instruction substitution to make the malicious programs more complex. Initially,
obfuscation techniques those are used by software developers to protect their
software from piracy are now misused by these malware authors. This paper intends
to detect such obfuscated programs or malware using control flow graph
(CFG) matching technique, using VF2 algorithm. If the original CFG of the exe-
cutable is found to be isomorphic to subgraph of obfuscated CFG (under exami-
nation), then it can be classified as an obfuscated one.
1 Introduction
These malware authors try to find innovative techniques all the way for prevention
of their codes from malware detection engines or delaying the analysis process of
the malware, etc., by transforming their malicious programs into another format,
which will be much for reverse engineer [2], i.e., by changing the syntax and
making it more complex by using different obfuscation techniques.
Earlier, malware detectors were designed by using simple pattern-matching
techniques. Some malware was then easily detected and removed through these
antivirus software. These antivirus software maintain a repository of virus signa-
tures, i.e., binary pattern characteristics of the malicious codes, and is used to scan
the program for the presence of these fixed patterns. If the matches were found by
the detectors, they used to alert them as malwares. But, eventually the malware
writers learnt the weakness of these detectors, which is the presence of a particular
pattern (the virus signature) and then the writers started changing the syntax or
semantics of these malwares in such a way that, it would not get detected easily, but
the basic behavior and functionality of such malwares remained the same [3, 4].
Section 2 discusses disassembling. In Sect. 3, some obfuscation techniques are
listed out. Section 4 explains about control flow graphs. Finally, Sect. 5 proposes a
framework, followed by the explanation with an example, which includes disas-
sembly of suspected malware, its CFG generation, and matching. The last section is
about the conclusion of the paper with future targets.
2 Disassembly of Executables
3 Obfuscation Techniques
malware authors to defeat detection [2]. Some of the simple techniques are men-
tioned below, which can be used for obfuscating the code generally.
(a) Replacement of instructions with equivalent instructions: A technique in which
instructions in the original code are replaced by some other equivalent
instructions.
(b) By introducing dead code instructions: A technique in which a dead code is
inserted into the original code.
(c) Insert semantically NOP code: By this technique, a NO operation code will be
inserted in the original code to make the code lengthier as well as complex.
(d) Insert unreachable code: Some code will be inserted in the program, which is
never going to be executed. In some cases, jumping is introduced to bypass it
directly.
(e) Reorder the instructions: Technique which uses reordering of mutually
exclusive statements such that the syntax changes from the original code but the
output remains same.
(f) Reorder the loops: This technique reorders mutually exclusive loop statements
in the program.
(g) Loop unrolling: This technique uses by moving the loop, and the statements
written inside the loop body are rewritten as many number of times as the
counter value.
(h) Cloning methods: This technique uses in different parts of the program in
different ways, but for the same task.
(i) Inline methods: In this technique, the body of the method is placed into the
body of its callers and the method will be removed.
The control flow is driven by the execution of instructions with having various
sequential instructions, conditional/unconditional jumps, function calls, returning
statements, etc., of the program. The control flow graph (CFG) corresponds to all the
possible paths, which must be covered during its execution of the program [10, 11].
The aim is to develop a detection strategy based on control flow graphs. In this
paper, VF2 algorithm is used to find the subgraphs isomorphism [12]. For two
graphs G and G′, the algorithm goes down as a traversal of a tree data structure, and
at each step, we try to match the next node of G with one of the nodes of G′ and
stop after traversal of all the nodes of G′ (i.e., after finding a leaf). The matching is
done for all the possible combinations, and if it does not end completely, then the
graphs taken would not be isomorphic [13, 14].
5 Proposed Framework
5.1 Explanation
As, we can have the executable file of the malicious program, the executable
code of both the programs is taken for this testing. Next, with the help of disas-
semblers, the assembly codes will be generated, from the machine codes of both the
original and obfuscated programs (Fig. 3).
Then, the codes will be optimized by using by simply eliminating dead codes,
removal of global common sub-expression, removal of loop invariant codes,
eliminating induction variables, etc. Then, the control flow graph will be generated
from the assembly codes. The structure of the CFGs is shown below, to understand
the difference between both the CFGs, pictorially (Fig. 4).
By looking to all basic blocks and the jump, conditions, loops, and return
statements, the graphs are prepared from both CFGs (Fig. 5).
Both of these CFGs are matched for subgraph isomorphism using the VF2
algorithm. As a result, the following steps are performed while comparing both of
the graphs (Fig. 6).
Fig. 4 Assembly code after disassembling the executable of ‘obfuscated fact.c’ program
Control Flow Graph Matching for Detecting Obfuscated Programs 273
Fig. 5 Snapshots of the structures of the CFGs of the original and obfuscated program
Fig. 6 Optimized CFGs G and G′ of the original and the obfuscated program
274 C. K. Behera et al.
6 Conclusion
In this paper, a framework has been proposed to detect the obfuscated programs or
malware using control flow graph matching technique. This framework is verified,
and its functionality has been proved through an example. After generating exe-
cutables from both original and obfuscated programs, disassembling has been done
on both the executables (as it is known that only malware executables were
available, not their source code). Next, both the assembly codes and constructed
CFGs of both executables have been optimized. Then, matching has been done
between these CFGs for subgraph isomorphism using VF2 algorithm and found to
be matched. This means, after obfuscating also, the proposed framework is able to
detect the program. The future expectations are to maintain a database of CFGs of
known malware and study matching issues (such as matching CFG of any new
suspected program with a huge database of CFGs). For this, better graph storage
and graph comparison techniques are to be implemented as continuation of this
paper.
Control Flow Graph Matching for Detecting Obfuscated Programs 275
References
1. You, I., Yim, K.: Malware obfuscation techniques: a brief survey. In: International
Conference on Broadband, Wireless Computing, Communication and Applications, IEEE
Computer Society, pp. 297–300 (2010)
2. Sharif, M., et al.: Impeding malware analysis using conditional code obfuscation. In: Network
and Distributed System Security Symposium (2008)
3. Walenstein, A., Lakhotia, A.: A transformation-based model of malware derivation. In: 7th
IEEE International Conference on Malicious and Unwanted Software, pp. 17–25 (2012)
4. Durfina, L., Kroustek, J., Zemek, P.: Psyb0t malware: a step-by-step decompilation—case
study. In: Working Conference on Reverse Engineering (WCRE), pp. 449–456. IEEE
Computer Society (2013)
5. Ernst, M., et al.: Quickly detecting relevant program invariants. In: 22nd International
Conference on Software Engineering, pp. 449–458 (2000)
6. Cordella, L.P., Foggia, P., Sansone, C., Vento, M.: Evaluating performance of the VF graph
matching algorithm. In: Proceedings of the 10th International Conference on Image Analysis
and Processing, pp. 1172–1177. IEEE Computer Society Press (1999)
7. Cordella, L.P., Foggia, P., Sansone, C., Vento, M.: An improved algorithm for matching large
graphs. In: 3rd International Workshop on Graph-based Representations, Italy (2001)
8. McKay, B.D.: Practical graph isomorphism. Congressus Numerantium 30, 45–87 (1981)
9. Messmer, B.T., Bunke, H.: A decision tree approach to graph and subgraph isomorphism
detection. J. Pattern Recog. 32, 1979–1998 (1999)
10. Gold, R.: Reductions of control flow graphs. Int. J. Comput., Electr. Autom. Control Inf. Eng.
8(3), (2014)
11. Sadiq, W., Orlowska, M.E.: Analyzing process models using graph reduction techniques. Inf.
Syst. 25(2), 117–134 (2000)
12. Bondy, J.A., Murty. U.S.R.: Graph Theory. Springer, Berlin (2008)
13. Abadi, M., Budiu, M., Erlingsson, U’., Ligatti, J.: Control flow integrity principles,
implementations, and applications. ACM Trans. Inf. Syst. Secur. 13(1), 4:1–4:40 (2009)
14. Brunel, J., Doligez, D., Hansen, R.R., Lawall, J.L., Muller, G.: A foundation for flow-based
program matching, using temporal logic and model checking POPL. ACM (2009)
A Novel Framework for Predicting
Performance of Keyword Queries Over
Database
1 Introduction
Keyword-based search was popularly useful for unstructured data but now going
popular day by day for RDBNS because of end user requirement, so this field has
attracted much efficient way to tackle problem related to keyword-based search.
So many researchers have suggested different approaches for such papers [1–5]
have different methods to deal with Keyword based search. Keyword query inter-
face tries to find most appropriate and efficient result for each query. Each Keyword
query may have different result because of which exact prediction is not possible in
this scenario. We know day to day more users want to access data in all the forms
available on the network, so they need their desired result by just typing simple
keyword-based search because almost user does not aware about database lan-
guages. Different papers suggested different approach to deal with it like in papers
[6–9] suggested how to get efficient result by using keyword queries like selecting
top k result, semsearch- and index search-based results which is efficient to filter the
efficient result for given query. Some related introductory information given for the
different terminologies used in this paper is:
Structured Data: When data is stored in form of relational database and can
easily retrieve by structured query language (SQL), it is called structured data.
Unstructured data is some sort of data which is stored without any preimplemen-
tation logic in form of file, Web, etc., and cannot be retrieved easily [10]. Data
mining is organized for application in the business community because it is sup-
ported by three technologies that are now sufficiently nature.
• Huge data collection added millions pages everyday;
• Powerful supercomputers with great computation power;
• Efficient data mining algorithms for efficient computation.
Business databases are developing at huge rate. According to Google, daily,
millions of pages are added. Huge data creates complexity problems, so it needs
more efficient strategies to manage and access. Today, a great many pages included
in one day. In as indicated by a study, Internet every day included data of amount as
about from today 2000 years before information is accessible throughout the day.
The extra requirement for enhanced computational systems can now be met in a
savvy way with parallel multiprocessor PC innovation. Information mining calcu-
lations embody systems that have existed for at smallest 10 years, however have
1 min prior have been actualized as settled, trustworthy, intelligible apparatus that
always surpass more established factual techniques.
How much query is effective? “For any query, we can calculate how much query
is efficient and we can suggest other query instead of hard query if precision value is
low.”
2 Related Work
Today, many application works on both structured data and unstructured data
means where lots of users try to find answer to their queries with keyword search
only. Commercial relational database management systems (RDBMSs) are nowa-
days providing way for keyword queries execution, so for this type of query it is not
as simple as in SQL execution to get appropriate answer to the query, and so
A Novel Framework for Predicting Performance … 279
different approach like [1–9, 11, 12] is suggested by different researchers to get
appropriate answer by just executing keyword queries in structured database. As we
know, with the keyword queries, it is not possible to get exact answer, so ambiguity
creates more answer for same queries, and so researchers suggested the different
approach to rank the result according to relevance of result and also rank the
performance of the queries performing on particular databases.
These are the methods that work after each executed keyword query and help to get
the efficient result.
The strategies in view of the idea of clarity score accept that clients are occupied
with a not very many points, so they figure a question simple in the event that its
aftermath fit in with not very many topic(s) and along these lines, adequately
recognize capable from different archives in the combination [7].
The basic idea of popularity raking is to find the pages that are popular and rank
them higher than other pages that contain specified keywords. This is mostly
working in unstructured like Web-based keyword search. Example for this imple-
mentation is that most search engine uses this strategy like Google, Bing, etc.
3 Proposed System
3.1 Objective
We study many methods used to get efficient result with the help of keyword query,
but each method has some certain problem in our work we try to improve the
performance of Keyword Query.
• Improve the performance of the keyword query on the basis of information
needed behind keyword query.
• We try to find the similarity score between document and keyword query and
rank the query as according to its performance which we calculate by precision
value and recall value.
• We can suggest a user if he wants some certain information, then which key-
word query will be useful.
In our proposed method, we create vector to map effectiveness of the query on
database or structured data. This based on terms in particular document.
Vector space model is very useful in calculating distance between query and
data. This model is used to deal with first unstructured data to get the appropriate
result by executing keyword queries. This model provides just talk with the doc-
ument set with the help of term present in the queries (Table 1).
With the help of this model, it is easy and effective to get ranking of the result as
relevance of the result for particular query. So this model provides some mathe-
matical calculation to get the efficient result in case of ambiguity in the interpre-
tation of the query.
By and by, it is less demanding to ascertain the cosine of the point between the
vectors, rather than the edge itself (Fig. 1):
d2 q
cos h ¼ ð1Þ
k d 2 k k qk
jDj
wt;d ¼ tft;d log
jfd 0 2 Djt 2 d 0 gj
jDj
where wt;d ¼ tft;d log jfd 0 2Djt2d 0 gj.
jDj
log
jfd 0 2 Djt 2 d 0 gj
282 M. Husain and U. Shanker
jDj
jfd 0 2 Djt 2 d 0 gj
The above parameter represents the total number documents in which term
t present.
Using the similarity based on cosine model for calculation of similarity score
between document dj and query q can be calculated as:
PN
dj q wi;j wi;q
¼ P i¼1 ffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
simðdj ; qÞ ¼
dj kqk N 2
PN 2 ffi ð4Þ
w
i¼1 i;j i¼1 wi;q
This is the method which is used to implement the keyword query ranking for
database. When we see this method compared to other method which is already
implemented, this method is performing very well and precision value is good.
This method is based on similarity-based score to get the efficient result (Fig. 2).
Step to get ranking module:
Example for method implementation by algebraic calculation:
Here is an improved case of the vector space recovery model. Consider a little
accumulation C that comprises in the accompanying three records:
d1: “tom cruise MIB”
d2: “tom hunk future”
d3: “cameroon diaz MIB”
A few terms show up in two reports, some seem just in one archive. The
aggregate number of records is N = 3. Consequently, the idf values for the terms
are:
“cameroon log2 (3/1) = 1.584”
“diaz log2 (3/1) = 1.584”
“tom log2 (3/2) = 0.584”
“future log2 (3/1) = 1.584”
“MIB log2 (3/2) = 0.584”
“cruise log2 (3/1) = 1.584”
“hunk log2 (3/1) = 1.584”
In above document, we calculate the values of documents weight according to
the term frequency in each document based on log2 of the ratio of document terms
and whole document.
Presently, we reproduce the tf scores by the idf estimations of every term,
acquiring the accompanying grid of reports by terms: (All the terms seemed just
once in every report in our little accumulation, so the greatest quality for stan-
dardization is 1.)
For the all above document set, we execute query “tom cruise MIB”, and we
compute the tf–idf vector for the query and store the value for further calculation to
compute the final ranking list of the result (Tables 2 and 3).
Given the going with inquiry: “tom tom MIB”, we figure the tf–idf vector for the
query and calculate the term frequency to get the value get similarity score which is
helpful to get ranking of result. These query terms are now matched with document
set, and distance between query and document set is calculated.
For “q” 0 (2/2) * 0.584 = 0.584 0 (1/2) * 0.584 = 0.292 0 0
We compute the length of every report and of the inquiry (Table 4):
At that point the likeness qualities are:
As per the similarity values, the last request in which the reports are exhibited as
result of the inquiry will be: d1, d2, d3 (Table 5).
The framework is implementing with the help of java, and query raking is effi-
ciently calculated by this implementation. We use the database MySQL to store the
data. We use two table name movie and actor. We access the data with the help of
java programming language. For example, we store data in actor table
insert into actor values (‘tom cruise’, ‘titanic’, ‘2000’);
insert into actor values(‘tom cruise’, ‘mission impossible’, ‘2002’);
After the execution of the queries, we can get ranking of the queries based on the
result (Fig. 3).
The method is performing very well, and precision is considerable (Fig. 4).
From this figure, we can conclude that our proposed scenario produces maxi-
mum precision values rather than our existing scenario. By using our method
proposed system achieves high precision values compared than base paper method
in existing system. For number of data values, it generates the higher precision
values in current scenario. Thus, we achieve the higher precision value for proposed
system rather than the existing system.
If we follow the keyword query suggestions, then we can get efficient result with
precision values near 0.8–0.9.
A Novel Framework for Predicting Performance … 285
5 Conclusion
In our work, we try to find the efficient result for the given keyword query based on
the ranking of the result calculated with our suggested mathematical model which
gives high precision value result as according to user needs. They are: Seeking
286 M. Husain and U. Shanker
quality is lower than the other framework, and dependability rate of the framework
is most minimal. Keeping in mind the end goal to defeat these downsides, we are
proposing the enhanced ranking calculation which is utilized to improve the
exactness rate of the framework. Moreover, we proposed the idea of
philosophy-based algebraic model apparatus for enhancing the top k efficient results
productivity. It is utilized for giving comparable words as well as semantic
importance for the given keyword query. From the efficient result, we are acquiring
the proposed framework is well powerful than the current framework by method for
exactness rated, nature of result, and effectiveness of our model. This model tries to
predict the effectiveness of the query on the database.
References
1. Kurland, O., Shtok, A., Carmel, D., Hummel, S.: A unified framework for post-retrieval
query-performance prediction. In: Proceedings of 3rd International ICTIR, Bertinoro, Italy,
pp. 15–26 (2011)
2. Lam, K.-Y., Ulusoy, O., Lee, T.S.H., Chan, E., Li, G.: An efficient method for generating
location updates for processing of location-dependent continuous queries. 0-7695-0996-7/01,
IEEE (2001)
3. Hauff, C., Azzopardi, L., Hiemstra, D., Jong, F.: Query performance prediction: evaluation
contrasted with effectiveness. In: Proceedings of 32nd ECIR, Milton Keynes, U.K.,
pp. 204–216 (2010)
4. Kurland, O., Shtok, A., Hummel, S., Raiber, F., Carmel, D., Rom, O.: Back to the roots: a
probabilistic framework for query performance prediction. In: Proceedings of 21st
International CIKM, Maui, HI, USA, pp. 823–832 (2012)
5. Hristidis, V., Gravano, L., Papakonstantinou, Y.: Efficient IR style keyword search over
relational databases. In: Proceedings of 29th VLDB Conference, Berlin, Germany,
pp. 850–861 (2003)
6. Luo, Y., Lin, X., Wang, W., Zhou, X.: SPARK: top-k keyword query in relational databases.
In: Proceedings of 2007 ACM SIGMOD, Beijing, China, pp. 115–126 (2007)
7. Trotman, A., Wang, Q.: Overview of the INEX 2010 data centric track. In: 9th International
Workshop INEX 2010, Vugh, The Netherlands, pp. 1–32 (2010)
8. Tran, T., Mika, P., Wang, H., Grobelnik, M.: Semsearch ‘S10. In: Proceedings of 3rd
International WWW Conference, Raleigh, NC, USA (2010)
9. Demidova, E., Fankhauser, P., Zhou, X., Nejdl, W.: DivQ: diversification for keyword search
over structured databases. In: Proceedings of SIGIR’ 10, Geneva, Switzerland, pp. 331–338
(2010)
10. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun.
ACM 18(11), 613–620 (1975). [Link]
papers/[Link]
11. Finin, T., Mayfield, J., Joshi, A., Cost, R.S., Fink, C.: Information retrieval and the semantic
web. In: Proceedings of 11th International Conference on Information and Knowledge
Management, pp. 461–468, ACM (2002)
12. Ganti, V., He, Y., Xin, D.: Keyword++: a framework to improve keyword search over entity
databases. Proc. VLDB Endow. 3(1), 711–722 (2010)
13. Manning, C., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval.
Cambridge University Press, New York (2008)
Predicting and Accessing Security
Features into Component-Based
Software Development: A Critical
Survey
Keywords Commercial off-the-shelf (COTS) Component-based software
development (CBSD) Component integrator Intrusion detection
System resilience
1 Introduction
Requirement
Component
Definition
High Level search
Design
Integration
Selection Evaluation
and Operation
After spending lots of time for compilation of data, it was observed that security
assessment for component-based software development must occur at the three
different levels as shown in Fig. 2.
Once all the functional or non-functional security requirements are precisely
identified and estimated, available COTS components must be evaluated to deter-
mine their suitability for use in the application being developed. Accessibility of a
component repository is advantageous for component integrator to select most
suitable component for development of specific application. While interpreting the
respondent’s feedback for category A questionnaires, it was observed that the
majority of the component integrators do not use a proper CBSD process for
developing software application. For category B, component documentation and
vendor reputation play more important role than any other factors. For category C,
majority of the participants have the same opinion on the concept of embedding
security activities at the component level rather than system level. High-risk factors
were associated with use of external component while developing in house appli-
cation. For category D, access control and applying encryption technique were
preferred by majority of the respondent.
Application Level
Interface Level
Component Level
292 S. Kr. Jha and R. K. Mishra
System
User
Resilence
Authentication
Level of
Immunity to
Access Control
Attack
Product
Encryption and Documentation
Data Integrity
Intrusion
Detection
agency. Outside agency may be hackers, crackers, or any so-called cyber terrorist
intending to perform some fraud or manipulation. We are considering majority of
the component as black box in nature; however, software components may be white
box and gray box in nature also. Encryption mechanism of the proposed framework
is more suitable for components which are white box in nature at the component
level, whereas same encryption mechanism can be applied at the level of inter-
component communication for the component which is black box in nature at the
application level. Access control methods provide guidelines to the component
integrators while selecting most suitable software components which can produce
high-quality component-based application. Component certification and docu-
mentation also play major role while establishing various security issues. While
proposing this framework, a survey was also conducted among the various com-
ponent user working at different level about component certification and docu-
mentation. It was observed that very few component vendors are properly working
on the issue of component documentation. Even the component certification issue is
unexplored. Out of 150 components selected as sample, less than 5% of compo-
nents were found thoroughly documented.
References
3. Gill, N.S.: Importance of software component characterization for better software reusability.
ACM SIGSOFT SEN 31 (2000)
4. Sharma, A., Grover, P.S., Kumar, R.: Investigation of reusability, complexity and
customizability for component based systems. ICFAI J. IT 2 (2006)
5. Mir, I.A., Quadri, S.M.: Analysis and evaluating security of component-based software
development: a security metrics framework. Int. J. Comput. Netw. Inf. Secur. (2012)
6. Kahtan, H., Bakar, N.A., Nordin, R.: Dependability attributes for increased security in CBSD.
J. Comput. Sci. 8(10) (2014)
7. Alberts, C., Allenm, J., Stoddard, R.: Risk-based measurement and analysis: application to
software security. Software Engineering Institute, Carnegie Mellon University (2012)
8. Soni, N., Jha, S.K.: Component based software development: a new paradigm. Int. J. Sci. Res.
Educ. 2(6) (2014)
9. Zurich, E.T.H.: A common criteria based approach for COTS component selection. Chair of
software engineering ©JOT, 2005 special issue: 6th GPCE Young Researchers Workshop
(2004)
10. Bertoa, M.F., Vallecillo, A.: Quality attributes for COTS components. In: Proceeding of the
6th ECOOP Workshop on Quantitative Approaches in Object-Oriented Software Engineering
(QAOOSE 2002). Malaga, Spain (2002)
Traditional Software Reliability Models
Are Based on Brute Force and Reject
Hoare’s Rule
Abstract This research analyses the causes for the inaccurate estimations and
ineffective assessments of the varied traditional software reliability growth models.
It attempts to expand the logical foundations of software reliability by amalga-
mating techniques first applied in geometry, other branches of mathematics and
later in computer programming. The paper further proposes a framework for a
generic reliability growth model that can be applied during all phases of software
development for accurate runtime control of self-learning, intelligent, service-
oriented software systems. We propose a new technique to employing runtime code
specifications for software reliability. The paper aims at establishing the fact that
traditional models fail to ensure reliable software operation as they employ brute
force mechanisms. Instead, we should work on embedding reliability into software
operation by using a mechanism based on formal models like Hoare’s rule.
Keywords Learning automata Software reliability Formal methods
Automata-based software reliability model Finite state automata
R. Wason (&)
Department of MCA, Bharati Vidyapeeth’s Institute of Computer Applications
and Management, New Delhi, India
e-mail: rit_2282@[Link]
A. K. Soni
School of Engineering and Technology, Sharda University, Greater Noida, India
e-mail: [Link]@[Link]
M. Qasim Rafiq
Department of Computer Science and Engineering, Aligarh Muslim University,
Aligarh, India
e-mail: mqrafiq@[Link]
1 Introduction
software reliability estimation parameters such as failure rate (k), hazard function
(k(t)), MTBF, MTTR, MTTF were initially suggested for the reliability prediction
of hardware products [12] and were simply fitted to the software process over-
looking the different nature of software versus hardware as well as their reliability.
Hence, it has been observed that the shortcomings of the existing software
reliability models have been the major cause of the unreliability of modern
software.
Major challenges of software reliability can be defined as the ability to achieve
the following [13]:
(i) Encompass heterogeneous runtime execution and operation mechanisms for
software components.
(ii) Provide abstractions that can identify, isolate and control runtime software
errors.
(iii) Ensure robustness of software systems.
298 R. Wason et al.
Hence, it is time that the software engineering community learns from mistakes
of conventional reliability models and utilizes post-testing failure data to control
software operation at run-time [14, 15].
The above discussion implies that what is required is a more holistic approach,
integrating formal methods like Hoare’s logic and automata [16]. The key feature of
this approach shall be a shift of emphasis from probabilistic assumptions about the
reliable execution of software to actual runtime operation control [8, 12, 17, 18].
This paper proposes a general framework for a reliability model based on finite state
machines.
The remainder of this paper is designed as follows. Section 2 reviews some
popular software reliability models. Section 3 establishes the brute force nature of
all traditional SRGMs as the major cause of their inaccurate estimates. Section 4
discusses the viability of modelling reliability metrics using formal methods like
Hoare’s logic. Section 5 discusses the basic framework for a generic, intelligent,
self-learning reliability estimation model. Section 6 discusses the conclusion and
future enhancements for this study.
Chow [7] was the first to suggest a software reliability estimation model. Since then
hundreds of models were applied for calculating reliability of different software
systems. However, with every model, the search for another model that overcomes
the limitations of its ancestors intensifies. Actually, none of the traditional reliability
models offer silver bullet for accurate reliability measurement for software. It is
observed that traditional reliability models are only trying to mathematically rep-
resent the interaction between the software and its operational environment. As
such most of these models are hence brute force as they are probabilistically trying
to estimate software reliability by fitting some mathematical model [7] onto the
software reliability estimation process. The limitation of traditional models along
with the growing size and complexity of modern software makes accurate reliability
prediction all the more hazardous.
As observed from Table 1, most of the software reliability models calculate
reliability based on failure history data obtained from either the testing or debug-
ging phase of software development [12]. However, difficulty lies in the quantifi-
cation of this measure. Most of the existing reliability models assess reliability
growth based on failure modes surfaced during testing and combine them with
different assumptions to estimate reliability [1, 2, 10]. Some of the well-known
traditional software reliability models can be classified according to their main
features as detailed in Table 2.
Tables 1 and 2 establish that all the existing reliability models have limited
applicability during software testing, validation or operational phases of the soft-
ware life cycle. However, the basic foundation of most software reliability growth
Traditional Software Reliability Models Are Based on Brute … 299
models is often viewed doubtfully due to the unrealistic assumptions they are based
on, hence being the underlying cause of the reliability challenge of this century.
Uncertainty will continue to disrupt reliability of all our software systems, unless
we transition to some metric rooted firmly in science and mathematics [4, 9, 23].
Engineers have always relied on mathematical analysis to predict how their designs
Traditional Software Reliability Models Are Based on Brute … 301
will behave in the real world [2]. Unfortunately, the mathematics that describes
physical hardware systems does not apply within the artificial binary universe of
software systems [2]. However, computer scientists have contrived ways to trans-
late software specifications and programs into equivalent mathematical expressions
which can be analysed using theoretical tools called formal methods [5, 15]. In the
current decade when reliable software is no longer just a negotiable option but a
firm requirement for many real-time, critical software systems, we need to become
smarter about how we design, analyse and test software systems.
Hoare’s logic/Floyd–Hoare logic is an axiomatic means of proving program
correctness which has been the formal basis to formally verify the correctness of
software [16]. However, it is rarely applied in actual practice. We propose devel-
opment of a reliability model that can control the traversed program path using an
underlying automata-based software representation and apply Hoare’s logic as the
basis to ensure correct runtime program execution [11]. Our approach performs
dynamic checking of software transition and instead of halting the program upon
failure state transition; we allow the program to continue execution through an
alternate path. Formal, Hoare’s logic-based software verification provides strong
possibility to establish program correctness [16]. However, practical use of this
logic is limited due to difficulties like determining invariants for iterations and side
effects of invoking subroutines (methods/functions/procedures) in different pro-
gramming languages.
Hoare’s logic based on predicate logic [16] defines a set of axioms to classify the
semantics of programming languages. It describes an axiom for each program
construct. These axioms are further applied to establish program correctness for
programs written in different programming languages.
We propose applying axioms of Hoare’s logic for ensuring reliable software
operation [16]. To clarify the same, we discuss the Hoare axiom for assignment:
Let x: = E be an assignment, then
Systems crash when they fail to expect the unexpected [2]. The unreliability of
software stems from the problem of listing and/or identifying all the possible states
of the program [13]. To conquer this limitation, we propose development of an
automata-based software reliability model [11, 24]. The proposed model is based on
the conventional idea of a Markov model [25, 26]. This model divides the possible
software configurations into distinct states, each connected to the others through a
transition rate. In this framework, the software at run-time would be controlled by
an underlying automata representation of the software. Each software state change
and/or transition shall be allowed by the compiler and/or interpreter only after
validating if it takes the software to an acceptable automata node. The basic flow of
this automata-based reliability model is depicted in Fig. 1.
As depicted in Fig. 1 above, the proposed model framework utilizes the
equivalent assembly code of the software under study to extract the underlying
opcodes. Utilizing these opcode instructions, a next state transition table is obtained
which is used as the basis for designing the underlying finite state automata (FSM).
Each node of the obtained FSM can be assigned a probability value. Initially, each
node of the FSM can be assigned an equivalent probability. With each software
execution, the node probability can be incremented and/or decremented depending
upon the software path of execution. The proposed framework shall help ensure
reliable software execution by monitoring each software transition. In case, the
framework detects that in the next transition the software may transit to a failure
node, it may halt the system immediately and work on finding an alternate route.
6 Conclusion
The ultimate objective of this study is to embrace the software community with the
fact that existing software reliability models are all brute force as they do not
capture the actual nature of software execution [27]. With this work, we also have
built upon a novel software reliability model that uses the formal model of an
automaton to control runtime software operation. The utmost importance of this
framework is that the framework can be built into the software execution engine
(compiler and/or interpreter) to ensure similar automata-based execution for all
software sharing similar execution semantics. However, this remains a tall objective
to achieve in the near future.
This research may be considered as generic model for ensuring reliable software
operation of modern agent-based, cloud-based, distributed or service-oriented
software. This research could also be classified as a generic one which could
instantiate numerous other studies. Possible extensions to this research can be
training of the FSM framework into learning automata to identify and register faulty
nodes as well as incorrect user input types leading to the same. Further, the model
can also be applied to improve runtime performance of software. This study is a
step forward towards overcoming all software reliability issues and ensuring robust
software systems.
References
1. Vyatkin, V., Hanisch, H.M., Pang, C., Yang, C.H.: Closed-loop modeling in future
automation system engineering and validation. IEEE Trans. Syst. Man Cybern. Part C: Appl.
Rev. 39(1), 17–28 (2009)
2. Yindun, S., Shiyi, Xu: A new method for measurement and reduction of software complexity.
Tsinghua Sci. Technol. 12(S1), 212–216 (2007)
3. Hecht H.: Measurement, estimation and prediction of software reliability, in software
engineering technology, vol. 2, pp. 209–224. Maidenhead, Berkshire, England, Infotech
International (1977)
4. Samimi, H., Darli, A.E., Millstein, T.: Falling back on executable specifications. In:
Proceedings of the 24th European Conference on Object-Oriented Programming
(ECOOP’10), pp. 552–576 (2010)
5. Wahls, T., Leavens, G.T., Baker, A.L.: Executing formal specifications with concurrent
constraint programming. Autom. Softw. Eng. 7(4), 315–343 (2000)
6. Leuschen, M.L., Walker, I.D., Cavallaro, J.R.: Robot reliability through fuzzy markov
models. In: Proceedings of Reliability and Maintainability Symposium, pp. 209–214 (1998)
7. Beckert, B., Hoare, T., Hanhle, R., Smith, D.R., et al.: Intelligent systems and formal methods
in software engineering. IEEE Intell. Syst. 21(6), 71–81 (2006)
8. Musa, J.D., Okumoto. A logarithmic poisson execution time model for software reliability
measurement. In: Proceedings of the 7th International Conference on Software Engineering,
pp. 230–237. Orlando (1984)
9. Meyer, B., Woodcock, J.: Verified software: theories, tools, experiments. In: IFIP TC 2/WG
2.3 Conference VSTTE 2005, Springer-Verlag (2005)
304 R. Wason et al.
10. Hsu, C.J., Huang, C. Y.: A study on the applicability of modified genetic algorithms for the
parameter estimation of software reliability modeling. In: Proceedings of the 34th
Annual IEEE Computer Software and Applications Conference, pp. 531–540 (2010)
11. Jelinski, Z., Moranda, P.B.: Software reliability research. In: Statistical Computer perfor-
mance Evaluation, pp. 465–484. Academic Press, New York (1972)
12. Benett, A.A.: Theory of probability. Electr. Eng. 52, 752–757 (1933)
13. Hoare, C.A.R.: How did software get so reliable without proof. In: Proceedings of The Third
International Symposium of Formal Methods Europe on Industrial Benefits and Advances in
Formal Methods, pp. 1–17 (1996)
14. Zee, K., Kuncak, V., Rinard, M.C.: Full functional verification of linked data structures. In:
Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and
Implementation PLDI ’08, pp. 349–361. ACM, New York (2008)
15. Flanagan, C., Leino, K.R.M., Lillibridge, M., Nelson, G., Saxe, J.B., Stata, R.: Extended static
checking for java. In: Proceedings of the ACM SIGPLAN 2002 Conference on Programming
language design and implementation PLDI ’02, pp. 234–245. ACM, New York (2002)
16. Zee, K., Kuncak, V., Rinard, M.C.: An integrated proof language for imperative programs. In:
Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and
Implementation PLDI ’09, pp. 338–351. ACM, New York (2009)
17. Wason, R., Ahmed, P., Rafiq, M.Q.: Automata-based software reliability model: the key to
reliable software. Int. J. Softw. Eng. Its Appl. 7(6), 111–126 (2013)
18. Benes, N., Buhnova, B., Cerna, I., Oslejsek, R.: Reliability analysis in component-based
development via probabilistic model checking. In: Proceedings of the 15th ACM SIGSOFT
Symposium on Component-Based Software Engineering (CBSE’12), pp. 83–92 (2012)
19. Sharygina, N., Browne, J.C., Kurshan, R.P.: A formal object-oriented analysis for software
reliability: design for verification. In: Proceedings of the 4th International Conference on
Fundamental Approaches to Software Engineering (FASE’01), pp. 318–322. Springer-Verlag
(2001)
20. Afshar, H.P., Shojai, H., Navabi, Z.: A new method for checking FSM correctness
(Simulation Replacement). In: Proceedings of International Symposium on
Telecommunications, pp. 219–222. Isfahan–Iran (2003)
21. Alur, R., Dill, D.L.: A theory of timed automata. Theoret. Comput. Sci. 126(2), 183–235
(1994)
22. Chow, T.S.: Testing software design modeled by finite state machines. IEEE Trans. Softw.
Eng. SE-4(3), 178–187 (1978)
23. Hoare, C.A.R.: An axiomatic basis for computer programming. ACM Commun. 12(10), 576–
580 (1969)
24. Crochemore, M., Gabbay, D.M.: Reactive automata. Inf. Comput. 209(4), 692–704 (2011)
25. Goseva-Popstojanova, K., Kamavaram, S.: Assessing uncertainty in reliability of
component-based software systems. In: 14th International Symposium on Software
Reliability Engineering (ISSRE 2003), pp. 307–320 (2003)
26. Goel, A.L.: Software reliability models: assumptions, limitations and applicability. IEEE
Trans. Softw. Eng. 11(12), 1411–1414 (1983)
27. Ramamoorthy, C.V., Bastani, F.B.: Software reliability—status and perspectives. IEEE Trans.
Softw. Eng. SE-8(4), 354–371 (1982)
Object-Oriented Metrics for Defect
Prediction
1 Introduction
Software metrics are collected with the help of automated tools which are used
by defect prediction models to predict the defects in the system. There is generally a
dependent variable, and various independent variables are present in any fault
prediction model. Dependent variable defines that the software modules are faulty
or not. Various metrics such as process metrics, product metrics, etc., can be used as
independent variables. For example, cyclomatic complexity and lines of code which
are method-level product metrics [1].
Cross-company defect prediction (CCDP) is a mechanism that builds defect
predictors w