International Conference On Innovative Computing and Communications
International Conference On Innovative Computing and Communications
International
Conference
on Innovative
Computing and
Communications
Proceedings of ICICC 2019, Volume 1
Advances in Intelligent Systems and Computing
Volume 1087
Series Editor
Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences,
Warsaw, Poland
Advisory Editors
Nikhil R. Pal, Indian Statistical Institute, Kolkata, India
Rafael Bello Perez, Faculty of Mathematics, Physics and Computing,
Universidad Central de Las Villas, Santa Clara, Cuba
Emilio S. Corchado, University of Salamanca, Salamanca, Spain
Hani Hagras, School of Computer Science and Electronic Engineering,
University of Essex, Colchester, UK
László T. Kóczy, Department of Automation, Széchenyi István University,
Gyor, Hungary
Vladik Kreinovich, Department of Computer Science, University of Texas
at El Paso, El Paso, TX, USA
Chin-Teng Lin, Department of Electrical Engineering, National Chiao
Tung University, Hsinchu, Taiwan
Jie Lu, Faculty of Engineering and Information Technology,
University of Technology Sydney, Sydney, NSW, Australia
Patricia Melin, Graduate Program of Computer Science, Tijuana Institute
of Technology, Tijuana, Mexico
Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro,
Rio de Janeiro, Brazil
Ngoc Thanh Nguyen , Faculty of Computer Science and Management,
Wrocław University of Technology, Wrocław, Poland
Jun Wang, Department of Mechanical and Automation Engineering,
The Chinese University of Hong Kong, Shatin, Hong Kong
The series “Advances in Intelligent Systems and Computing” contains publications
on theory, applications, and design methods of Intelligent Systems and Intelligent
Computing. Virtually all disciplines such as engineering, natural sciences, computer
and information science, ICT, economics, business, e-commerce, environment,
healthcare, life science are covered. The list of topics spans all the areas of modern
intelligent systems and computing such as: computational intelligence, soft comput-
ing including neural networks, fuzzy systems, evolutionary computing and the fusion
of these paradigms, social intelligence, ambient intelligence, computational neuro-
science, artificial life, virtual worlds and society, cognitive science and systems,
Perception and Vision, DNA and immune based systems, self-organizing and
adaptive systems, e-Learning and teaching, human-centered and human-centric
computing, recommender systems, intelligent control, robotics and mechatronics
including human-machine teaming, knowledge-based paradigms, learning para-
digms, machine ethics, intelligent data analysis, knowledge management, intelligent
agents, intelligent decision making and support, intelligent network security, trust
management, interactive entertainment, Web intelligence and multimedia.
The publications within “Advances in Intelligent Systems and Computing” are
primarily proceedings of important conferences, symposia and congresses. They
cover significant recent developments in the field, both of a foundational and
applicable character. An important characteristic feature of the series is the short
publication time and world-wide distribution. This permits a rapid and broad
dissemination of research results.
** Indexing: The books of this series are submitted to ISI Proceedings,
EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink **
Editors
International Conference
on Innovative Computing
and Communications
Proceedings of ICICC 2019, Volume 1
123
Editors
Ashish Khanna Deepak Gupta
Department of Computer Science Department of Computer Science
and Engineering and Engineering
Maharaja Agrasen Institute of Technology Maharaja Agrasen Institute of Technology
New Delhi, Delhi, India New Delhi, Delhi, India
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Dr. Ashish Khanna would like to dedicate this
book to his mentors Dr. A. K. Singh and
Dr. Abhishek Swaroop for their constant
encouragement and guidance and his family
members including his mother, wife and kids.
He would also like to dedicate this work to his
(late) father Sh. R. C. Khanna with folded
hands for his constant blessings.
General Chairs
Prof. Dr. Vaclav Snasel, VŠB—Technical University of Ostrava, Czech Republic.
Prof. Dr. Siddhartha Bhattacharyya, Principal, RCC Institute of Information
Technology, Kolkata.
Honorary Chair
Prof. Dr. Janusz Kacprzyk, FIEEE, Polish Academy of Sciences, Poland.
Conference/Symposium Chair
Prof. Dr. Maninder Kaur, Director, Guru Nanak Institute of Management, Delhi,
India.
Conveners
Dr. Ashish Khanna, Maharaja Agrasen Institute of Technology, India.
Dr. Deepak Gupta, Maharaja Agrasen Institute of Technology, India.
vii
viii Organizing Committee
Publicity Chairs
Dr. Hussain Mahdi, National University of Malaysia.
Dr. Rajdeep Chowdhury, Academician, Author and Editor, India.
Prof. Dr. Med Salim Bouhlel, University of Sfax, Tunisia.
Dr. Mohamed Elhoseny, Mansoura University, Egypt.
Dr. Anand Nayyar, Duy Tan University, Vietnam.
Dr. Andino Maseleno, STMIK Pringsewu, Lampung, Indonesia.
Publication Chairs
Dr. D. Jude Hemanth, Associate Professor, Karunya University, Coimbatore.
Dr. Nilanjan Dey, Techno India College of Technology, Kolkata, India.
Gulshan Shrivastava, National Institute of Technology, Patna, India.
Co-conveners
Dr. Avinash Sharma, Maharishi Markandeshwar University (Deemed to be
University), India.
P. S. Bedi, Guru Tegh Bahadur Institute of Technology, Delhi, India.
Mr. Moolchand Sharma, Maharaja Agrasen Institute of Technology, India.
xi
xii Preface
Innovative Computing and Communications. The articles are organized into two
volumes in some broad categories covering subject matters on machine learning,
data mining, big data, networks, soft computing and cloud computing, although
given the diverse areas of research reported it might not have been always possible.
ICICC 2019 invited five keynote speakers, who are eminent researchers in the
field of computer science and engineering, from different parts of the world. In
addition to the plenary sessions on each day of the conference, five concurrent
technical sessions are held every day to assure the oral presentation of around 129
accepted papers. Keynote speakers and session chair(s) for each of the concurrent
sessions have been leading researchers from the thematic area of the session.
A technical exhibition is held during all the two days of the conference, which has
put on display the latest technologies, expositions, ideas and presentations. The
delegates were provided with a book of extended abstracts to quickly browse
through the contents, participate in the presentations and provide access to a broad
audience of the audience. The research part of the conference was organized in a
total of 35 special sessions. These special sessions provided the opportunity for
researchers conducting research in specific areas to present their results in a more
focused environment.
An international conference of such magnitude and release of the ICICC 2019
proceedings by Springer has been the remarkable outcome of the untiring efforts
of the entire organizing team. The success of an event undoubtedly involves the
painstaking efforts of several contributors at different stages, dictated by their
devotion and sincerity. Fortunately, since the beginning of its journey, ICICC 2019
has received support and contributions from every corner. We thank them all who
have wished the best for ICICC 2019 and contributed by any means toward its
success. The edited proceedings volumes by Springer would not have been possible
without the perseverance of all the steering, advisory and technical program
committee members.
All the contributing authors owe thanks from the organizers of ICICC 2019 for
their interest and exceptional articles. We would also like to thank the authors of the
papers for adhering to the time schedule and for incorporating the review com-
ments. We wish to extend our heartfelt acknowledgment to the authors, peer
reviewers, committee members and production staff whose diligent work put shape
to the ICICC 2019 proceedings. We especially want to thank our dedicated team of
peer reviewers who volunteered for the arduous and tedious step of quality
checking and critique on the submitted manuscripts. We wish to thank our faculty
colleagues Mr. Moolchand Sharma and Ms. Prerna Sharma for extending their
enormous assistance during the conference. The time spent by them and the mid-
night oil burnt are greatly appreciated, for which we will ever remain indebted. The
management, faculties and administrative and support staff of the college have
always been extending their services whenever needed, for which we remain
thankful to them.
Preface xiii
Lastly, we would like to thank Springer for accepting our proposal for
publishing the ICICC 2019 proceedings. Help received from Mr. Aninda Bose,
Senior Editor, Acquisition, in the process has been very useful.
xv
Contents
xvii
xviii Contents
Ashish Khanna received his Ph.D. from NIT, Kurukshetra in March 2017, his
[Link]. in 2009 & his [Link]. from GGSIPU, Delhi in 2004. He completed his
postdoc at Inatel, Brazil. He has published 72 research papers as well as book
chapters in reputed journals and conferences. He has also authored/edited 14 books.
Deepak Gupta received his Ph.D. in CSE from Dr. APJ Abdul Kalam Technical
University, M.E. from Delhi University, & [Link]. in 2017, 2010 and 2005
respectively. He completed his postdoc at Inatel, Brazil. He is currently working at
Maharaja Agrasen Institute of Technology, GGSIPU, India. He published 82 papers
in international journals and conferences. He has authored/edited 36 books with
international publishers.
xxv
xxvi About the Editors
Aboul Ella Hassanien (Abo) received his [Link]. with honours in 1986 and [Link].
degree in 1993, from the Pure Mathematics and Computer Science Department at
the Faculty of Science, Ain Shams University, Cairo, Egypt. He received his
doctoral degree from Tokyo Institute of Technology, Japan in 1998. He is a Full
Professor at the IT Department, Faculty of Computer and Information, Cairo
University. He has authored over 380 research publications in leading
peer-reviewed journals, book chapters and conference proceedings.
Improving the Accuracy of Collaborative
Filtering-Based Recommendations
by Considering the Temporal Variance
of Top-N Neighbors
1 Introduction
Collaborative filtering (CF) is one of the most popular filtering approaches used in
e-commerce applications for recommending online items to the users [1]. To predict
the recommendable items, CF-based recommendation systems consider the alikeness
of the ratings (rating similarity) given by the similar users (neighbors) for an item.
Similarity measures play an important role in the accuracy of CF [1]. An inaccurate
top-n similar neighbor of the target user creates low prediction accuracy in CF-based
recommendation systems [2]. However, traditional similarity measures have certain
limitations to find the similar neighbors of the target user in different time periods.
The similar rating pattern of two users suggests that they might have similar likings.
But human preferences change over the time. And, as a result, the list of neighbors
of a particular user also changes. For example, Table 1 shows a list of four users and
six movies with the rating information. All users provide the ratings of the first three
movies in 1996 and the remaining three movies in 1997.
The above table represents the changing interest of User 1. The most similar user
of User 1 is User 2 in 1996 but his rating pattern is more similar to User 3 in 1997.
Traditional methods for finding similarity measures use the complete table. The
accuracy of the recommendation based on the old set of similar users tends to decrease
along with time. Hence, there is a need of a novel neighborhood calculation approach
that considers the changing preferences of the neighbors to enhance the accuracy in
personalized recommendations.
The advantage of this approach is that we can find the top-n neighbors of the
target users in different time intervals. This will improve the accuracy in personalized
recommendation compared to using traditional similarity metrics.
The aim of the proposed approach is to extract the similar users of the target user in
different time intervals for more personalized recommendations and to improve the
accuracy of the traditional neighborhood-based CF algorithms. The main contribu-
tions of this paper are as follows:
– Calculate the total number of ratings provided by a user in different time intervals.
– On the dataset that contains users’ total number of rating information in different
years, apply the optimized K-means clustering algorithms (Elbow method) to find
the optimal number of clusters.
– Comparison of the traditional CF algorithm and the proposed approach on the
basis of performance metrics, i.e., MAE, RMSE, precision, recall, and accuracy.
Section 2 discusses the associated background very briefly. Not many works have
addressed the temporal information for calculating the neighborhood. Few of them
are mentioned in this section. Section 3 presents the solution details which also
includes two algorithms: (a) finding optimal number of clusters of similar users
and (b) finding the top-n neighbors of a target user. Section 4 does a comparative
analysis of the proposed approach on the MovieLens dataset using the performance
metrics such as MAE, RMSE, precision, recall, F-score, and accuracy. Finally, Sect. 5
concludes the paper.
4 P. K. Singh et al.
The RS gives probabilistic suggestion for products or items to users on their explicit
and implicit preferences, using the help of previously collected data of users and
items. Content-based and CF are the two main filtering approaches used in RS [3]. In
content-based recommendation [4], RS recommends items that are similar to those
items which are liked by a user in the past, whereas CF identifies the top-n similar
users that have similar taste to the target user. Breese et al. [5] have introduced the two
classifications of CF, i.e., model-based CF and memory-based CF. In a model-based
CF, firstly training process is completed using the previous information of users and
items and then uses for prediction. Campos et al. [6] have proposed a Bayesian net-
work and combine the features of content-based filtering and collaborative filtering.
They provide their effective experimental results on the MovieLens dataset. Similar-
ity between top-n users/items and prediction computation using these top-n similar
users/items are the two basic steps in memory-based CF. Based on the user similar-
ity and item similarity, the memory-based CF can be categorized into user-based CF
and item-based CF, respectively. Scalability and sparsity are the two major issues
in CF. Li et al. [7] have minimized the scalability problem with optimized MapRe-
duce for item-based collaborative filtering recommendation algorithm. To mitigate
the sparsity issue, Kanti et al. [8] have introduced a new CF approach that combines
both similarities user and item for rating prediction. Lee et al. [9] have elaborated
a time-based recommender system for more accurate recommendation. Their pro-
posed system has utilized the temporal information, i.e., user purchase time and item
launch time with the rating information to find the more personalized neighbors in
different time intervals. Koohi el al. [10] have proposed a method to find similar
neighbors of target user to improve the performance of user-based CF. Najafabadi
et al. [11] have used the concept of k-means clustering and association rules mining
to improve the accuracy of collaborative filtering recommendations. But, k-means
clustering has limitation to determine the optimal number of cluster that will pro-
vide high accuracy. Hence, our work mainly focuses to minimize the limitation of
k-means clustering with more personalized recommendation.
Rating provides the users’ feedback or their inherent interest on the particular item.
Users’ interest is dynamic in nature so that their similar neighbor set also changes in
different time periods. The philosophy of the proposed approach is that the two users
are said to be similar if their rating patterns are similar in different time periods.
Hence, the proposed work includes a matrix Yuc (users’ yearly contribution) that
shows the total number of rating provided by a user in a particular year.
Improving the Accuracy of Collaborative Filtering … 5
Here, rˆui denotes the predicted rating of target user u on the item i and C(u, c)
represents a binary matrix that shows the belonging nature of user u in cluster c. If
user u belongs to the cluster c, the value of C(u, c) will be 1 otherwise 0. r¯u and r¯v
show the average rating of user u and v, respectively. rvi represents the rating of user v
on item i, whereas sim(u, v) identifies the similarity between user u and v. Algorithm
2 represents the complete steps in the proposed recommendation approach.
6 P. K. Singh et al.
4 Comparative Analysis
The MovieLens dataset ml-100k has been collected to compare the experimental
analysis [12]. This collected dataset has 100,000 rating information of 943 users and
1682 movies. The ratings belong in the range of 1 to 5 with 1 increment, 1 denotes the
lowest rating whereas 5 represents the highest rating. The collected dataset that has
93.695% sparsity. Based on different training and testing sets, the collected dataset
is divided into Dataset 1 and Dataset 2. Dataset 1 contains 45% trained dataset and
55% test dataset, whereas Dataset 2 has 35% trained dataset and 65% test dataset.
We use Pearson Correlation as a similarity measure and Mean Centering prediction
approach for rating prediction as shown in Table 3.
In Table 3, ri,u and ri,v denote the rating of user u and v on item i. Six different
metrics i.e., MAE, RMSE, precision, recall, F-Score, and accuracy have been used
for evaluation of the proposed approach [1]. The equations of computing MAE and
RMSE values are as follows:
N
| pi − qˆi |
MAE = i=1 (1)
N
N
i=1 ( pi − qˆi )
2
RMSE = (2)
N
Improving the Accuracy of Collaborative Filtering … 7
Here, pi and qˆi show the predicted and actual rating of item i, respectively. N rep-
resents the total number of predicted item. We consider the ratings above 3 as a
high rating (recommended items), and less than 3 as a low rating (not recommended
items). The classification of the possible results is shown in Table 4.
Hence, using Table 4, the equations of precision, recall, F-Score, and accuracy
become:
#t p
Precision = (3)
#t p + # f p
#t p
Recall = (4)
#t p + # f n
Precision ∗ Recall
F − Score = 2 ∗ (5)
Precision + Recall
#t p + #tn
Accuracy = (6)
#t p + #tn + # f p + # f n
5 Conclusion
Collaborative filtering has been the most popular technique used in recommenda-
tion systems to suggest online items to the users. The recommendation to a user is
done based on the preferences of the similar users (neighbors) who rated the same
items likewise. Hence, for personalized and accurate recommendation, it is crucial
to find the set of neighbors correctly. With the explosive growth of the number of
users, the traditional CF faces difficulty in findings of top-n neighbors of the target
user. Especially, traditional similarity measures have issues in computing the top-n
similar users due to the changing interest of the target user over time. This affects
the accuracy of the recommendation. This paper addresses this problem by calcu-
8 P. K. Singh et al.
Fig. 1 Comparison between traditional CF and proposed approach based on MAE, RMSE, and
precision values
Improving the Accuracy of Collaborative Filtering … 9
Fig. 2 Comparison between traditional CF and proposed approach based on recall, F-Score, and
accuracy
10 P. K. Singh et al.
lating the top-n neighbors per year basis. This approach guarantees that the list of
top-n neighbors always remains updated as per the changed preferences of the target
user’s neighbors. As a result, the proposed CF provides significantly improvement
in prediction accuracy than the existing traditional CF algorithm.
References
1. S.K. Singh, P.K.D Pramanik, P. Choudhury, A comparative study of different similarity metrics
in highly sparse rating sataset, in Proceedings of ICDMAI of Data Management, Analytics and
Innovation, vol. 2 (Springer, Berlin, 2018), pp. 45–60
2. A.M. Jorge, J. Vinagre, M. Domingues, J. Gama, C. Soares, P. Matuszyk, M. Spiliopoulou,
Scalable Online Top-N Recommender Systems (Springer International Publishing, Berlin, 2017)
3. M. Balabanović, Y. Shoham, Fab: content-based, collaborative recommendation. Commun.
ACM 40(3), 66–72 (1997)
4. B. Mobasher, Data mining for web personalization, in The Adaptive Web (Springer, Heidelberg,
2007), pp. 90–135
5. J. Breese, D. Hecherman, C. Kadie, Empirical analysis of predictive algorithms for collaborative
filtering, in Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (1998),
pp. 43–52
6. L.M. Campos, J.M. de Fernández-Luna, J.F. Huete, M.A. Rueda-Morales, Combining content-
based and collaborative recommendations: A hybrid approach based on Bayesian networks.
Int. J. Approximate Reasoning 51, 785–799 (2010)
7. C. Li, K. He, CBMR: An optimized map reduce for item-based collaborative filtering recom-
mendation algorithm with empirical analysis. Concurrency Comput. Pract. Experience 29(10),
e4092 (2017)
8. S. Kanti, T. Mahara, Merging user and item based collaborative filtering to alleviate data
sparsity. Int. J. Syst. Assur. Eng. Manag. 1–7 (2016)
9. T.Q. Lee, Y. Park, Y.T. Park, A time-based approach to effective recommender systems using
implicit feedback. Expert Syst. Appl. 34(4), 3055–3062 (2008)
10. H. Koohi, K. Kiani, A new method to find neighbor users that improves the performance of
collaborative filtering. Expert Syst. Appl. 83, 30–39 (2017)
11. M.K. Najafabadi, M.N. Mahrin, S. Chuprat, H.M. Sarkan, Improving the accuracy of collab-
orative filtering recommendations using clustering and association rules mining on implicit
data. Comput. Hum. Behav. 67, 113–128 (2017)
12. MovieLens | GroupLens: [Link] Last Accessed 18 Aug
2018
Exploring the Effect of Tasks Difficulty
on Usability Scores of Academic Websites
Computed Using SUS
Abstract The prime objective of this study is to empirically determine the effect
of tasks difficulty on usability scores computed using the System Usability Scale
(SUS). Usability dataset is created by involving twelve end-users that evaluate the
usability of 15 academic websites in a laboratory. Each end-user performs three
subsets of six tasks whose difficulty is varying from easy to impossible under six
different categories. Results are obtained after applying two statistical techniques,
one is ANOVA and other is the correlation with regression. Results show that the SUS
scores vary from higher to lower values when end-users conduct usability assessment
with a list of easy, moderate, and impossible tasks on academic websites. The results
also indicate the effect of tasks difficulty on the correlation between the SUS scores
and task success rate. Though, the strength of the correlation is strong with each
subset of tasks but it varies that depends on the nature of the tasks.
1 Introduction
the effect of tasks difficulty on SUS scores is mentioned as an open issue that moti-
vates us to perform the experiment for this study [6]. So, the aim of this paper is to
determine the effect of tasks difficulty, i.e. ranging from easy to impossible types, on
usability scores computed using SUS. However, SUS does not provide information
about how does one interactive system is more usable than other. Due to this reason,
conventional ISO metrics [7] are needed to experiment further to have a detailed
picture of system usability [5]. The recent study had also been found that shows the
strong positive correlation between subjective usability measures and ISO metric,
i.e. the task success rates, for both the laboratory and field studies at the individual
and system level [5]. It also motivates us to explore the effect of task difficulty on
the correlation between the SUS scores and success rate of tasks ranging from easy
to impossible types. The remaining sections of this research paper are organized
as follows: Section 2 describes the research methodology. Section 3 explains the
experimental results with the analysis. Section 4 contains conclusions with future
directions and last section contains appendix.
2 Research Methodology
15 academic websites1 are evaluated with three subsets of tasks in which difficulty
level varies from easy to impossible under six different categories as presented below
in Table 1.
Also, frequently used tasks by all the stakeholders are only considered and the
purpose of defining these tasks is to identify whether the website satisfies the usability
goals required by all stakeholders or not.
Table 1 List of categories considered for usability evaluation with three subsets of tasks
Categories List of tasks
Impossible tasks Medium-difficulty-level Easy-level tasks
tasks
1. Content Find online user Find the concerned Find vision and
feedback for the person responsible for mission statement
professor uploading upcoming
events if required then
contact
2. University Find out Ph.D. Determine how to Find out about
program supervisors details apply for the program various programs
like available slots, conducted in
etc university
3. Navigation Find the sitemap Show the users where Navigate to
they are homepage from any
navigational level
4. Search Find screen reader Find the image and Find simple and
on the website for video-based search advanced search
visually impaired results facility
students
5. Mobile Search for last Determine the website Determine the
semester result on has its mobile app university website’s
the mobile device text readable on
mobile devices
6. Social media Find any digital Find Blog of university Find the Facebook
storytelling media to page of the university
help students
2.3 Procedure
Each end-user performs all the tasks mentioned in Table 1 on the assigned academic
website. After executing these tasks in the laboratory, each end-user is allowed to fill
10-item SUS survey2 on the basis of their task experiences during the experiments.
The Google form is generated for filling the responses in the SUS survey. The pro-
cedure for computation of the SUS scores is adopted from [1] whose value varies
from 0 to 100. Any interactive system having the higher SUS score indicates that it
has higher usability rating. Further, an ISO metric of system effectiveness as a task
success rate on each academic website is also calculated [5, 11].
3 Experimental Results
This section describes the experimental results which are obtained after performing
two statistical techniques, i.e. one way ANOVA for comparative analysis, and cor-
relation with regression for investigating the relationship between SUS scores and
task success rate.
The comparative analysis is performed showing how the SUS scores vary when
usability assessment is done with a list of easy, moderate, and impossible-level tasks.
Once the usability dataset is collected by the end-users for 15 academic websites,
one way ANOVA, the statistical technique in Minitab 17 is applied. Table 2 contains
all the experimental and computed values of SUS mean for 15 academic websites
which are evaluated with three subsets of tasks. Results show an overall difference in
usability ratings of 15 academic websites between three subsets of tasks as p < 0.001.
Further, from the Table 2, it is observed that the end-users executing easy-level tasks
have rated the 13 websites with higher SUS scores as compared to other two groups.
On the contrary, the end-users executing impossible type tasks have rated the 15
academic websites with lowest SUS scores. Research practitioners must carefully
consider the tasks for usability evaluation as its difficulty level affect the SUS scores.
Another statistical technique, i.e. correlation and regression are also applied to the
collected usability dataset. The correlation is one of statistical measure that deter-
mines association of two variables, whereas the regression explains how an indepen-
dent variable is numerically related to the dependent variable. The standard Pearson
correlation was executed in Minitab 17 to investigate the relationship between com-
puted SUS scores and task success rate of three subcategories of tasks ranging from
easy, moderate and impossible tasks. Table 3 contains all the experimental and com-
puted values. Results investigated the positive correlation between SUS scores and
task success rate for academic websites for three subcategories of tasks.
As seen in Fig. 1(Easy tasks), Fig. 2 (Moderate tasks), and Fig. 3 (Impossible
tasks), higher SUS scores are associated with higher success rate and vice versa. In
other words, easy-level tasks are having higher success rate and higher SUS scores.
For easy-level tasks, r(15) = 0.951, p < 0.001, r 2 = 0.905, for moderate-level tasks,
r(15) = 0.914, p < 0.001, r 2 = 0.836, for impossible-level tasks, r(15) = 0.971, p
< 0.001, r 2 = 0.943.
Exploring the Effect of Tasks Difficulty on Usability Scores … 15
Table 2 SUS scores mean, standard deviation, F-value, and P-value for 15 academic websites
broken down by tasks level difficulty
Universities Easy tasks Moderate tasks Impossible tasks F-value P-value
1. IISC M = 59.79, SD M = 37.50, SD M = 23.75, SD 283.58 0.000
= 0.72 = 2.61 = 5.86
[Link] M = 50.00, SD M = 29.58, SD M = 12.50, SD 663.38 0.000
= 1.06 = 3.34 = 2.61
[Link] M = 53.75, SD M = 25.00, SD M = 54.79, SD 259.06 0.000
= 1.30 = 5.22 = 3.27
[Link] M = 57.50, SD M = 27.50, SD M = 25.00, SD 314.00 0.000
= 1.85 = 2.61 = 5.22
[Link] M = 51.04, SD M = 40.42, SD M = 29.58, SD 113.52 0.000
= 4.19 = 2.79 = 3.34
[Link] M = 58.13, SD M = 29.58, SD M = 31.25, SD 296.07 0.000
= 2.17 = 3.34 = 3.92
[Link] M = 42.50, SD M = 29.38, SD M = 27.50, SD 43.97 0.000
= 5.44 = 3.39 = 3.69
[Link] M = 36.25, SD M = 28.96, SD M = 26.04, SD 26.21 0.000
= 2.5 = 3.76 = 4.19
[Link] M = 53.96, SD M = 37.50, SD M = 27.50, SD 169.00 0.000
= 4.19 = 2.61 = 3.69
[Link] M = 54.79, SD M = 41.67, SD M = 30.63, SD 86.00 0.000
= 3.28 = 1.62 = 6.92
[Link] M = 41.25, SD M = 25.83, SD M = 32.08, SD 34.84 0.000
= 6.08 = 3.74 = 3.34
[Link] M = 33.75, SD M = 35.41, SD M = 27.08, SD 33.91 0.000
= 2.5 = 1.44 = 3.51
[Link] M = 53.54, SD M = 32.71, SD M = 30.83, SD 166.38 0.000
= 4.19 = 2.91 = 2.89
14.b-u-ac M = 53.96, SD M = 33.12, SD M = 28.33, SD 455.22 0.000
= 1.29 = 2.41 = 2.68
[Link] M = 23.96, SD M = 24.17, SD M = 13.13, SD 43.70 0.000
= 2.91 = 4.44 = 2.17
These results indicate a strong relationship between the SUS scores and task
success rate, although the strength of this relationship is different. This strength of
correlation depends on the nature of the tasks that are considered to establish the cor-
relation. Our findings are echoed with [5, 12]. This research work encourages novice
researchers to carefully consider the tasks and their difficulty levels for usability
assessment as a determinant in the estimation of usability through SUS scores. One
significant question can be asked that is why it is important for researchers to under-
stand the effect of tasks difficulty on usability scores computed using SUS. It is
simply because intuitive nature of researchers can lead them to take inappropriate
16 K. Sagar and A. Saha
Table 3 Mean SUS Scores computed with the execution of easy, moderate, impossible level tasks,
and task success rate for 15 academic websites
Universities SUS Success SUS mean Success SUS mean Success rate
mean rate of of rate of of of
of easy easy moderate moderate impossible impossible
tasks tasks tasks tasks tasks tasks
1. IISC 59.79 98.61 37.5 65.28 23.75 16.67
[Link] 50 75 29.58 33.33 12.50 1.39
[Link] 53.75 83.33 25 20.83 54.79 84.72
[Link] 57.5 97.22 27.50 22.22 25 18.05
[Link] 51.04 80.55 40.41 50 29.58 33.33
[Link] 58.12 97.22 29.58 33.33 31.25 37.50
[Link] 42.5 69.44 29.38 33.33 27.50 20.83
[Link] 36.25 65.27 28.96 31.94 26.04 19.44
[Link] 53.95 80.55 37.50 47.22 27.50 20.83
[Link] 54.79 84.72 41.66 51.39 30.63 34.72
[Link] 41.25 68.05 25.83 19.44 32.08 34.72
[Link] 33.75 63.88 35.41 45.83 27.08 20.83
[Link] 53.54 83.33 32.71 36.11 30.83 23.61
14.b-u-ac 53.95 81.94 33.13 36.11 28.33 20.83
[Link] 23.95 51.38 24.17 19.44 13.13 1.39
Fig. 1 Relationship between SUS and success rate mean when end-users execute a list of easy
tasks
Exploring the Effect of Tasks Difficulty on Usability Scores … 17
Mean)-33.77............................(2)
Fig. 2 Relationship between SUS and success rate mean when end-users execute a list of moderate
tasks
r2 0.943.
Regression Fit:
Fig. 3 Relationship between SUS and success rate mean when end-users execute a list of impossible
tasks
tasks for usability evaluation of an interactive system that can affect its usability
computed using SUS.
The aim of this paper was to examine the effect of tasks difficulty on usability scores
computed using SUS. Six different categories (i.e. content, university program, nav-
igation, search, mobile, and social media) are created containing tasks whose diffi-
culty level vary from easy to impossible level. In the laboratory, usability dataset is
collected by involving twelve end-users that evaluate the usability of 15 academic
websites with the list of defined tasks whose difficulty vary from easy to impossible
types. The software, i.e. Minitab 17 is employed for applying statistical techniques,
i.e. one way ANOVA, correlation, and regression. The obtained results show that
the SUS scores vary from higher to lower values when end-users conduct usability
assessment with a list of easy to impossible tasks on academic websites. From this,
18 K. Sagar and A. Saha
the conclusion can be made that higher difficulty level of tasks results in lower SUS
scores and vice versa. The results also indicate a strong relationship between the
SUS scores and system effectiveness, an ISO metric, measured as task success rate,
although the strength of this relationship is different for different types of tasks or
which depends on the nature of the tasks considered. From this, the conclusion can be
made that difficulty level of tasks does not affect the correlated relationship between
the SUS score and success rate, although the strength of this relationship varies that
depends on the nature of the tasks. This research work encourages novice researchers
to carefully consider the tasks for usability assessment as a determinant in the esti-
mation of usability through SUS scores. In this study, a contemporary application of
SUS is presented for academic websites where the comparative results of usability
assessments with tasks of varying difficulty levels are considered. Though, the num-
ber of studies will be required to understand this behavior in the near future. Also,
the field study for the same research work will be needed in the future so that the
strength of correlation for field study can also be determined. Further, generalized
usability model can be implemented, optimized using evolutionary algorithms [13],
and novel automated tool can also be implemented for usability evaluation in near
future [14].
5 Appendix
See Fig. 4.
References
1. J. Brooke, SUS-A quick and dirty usability scale. Usability Eval. Ind. 189(194), 4–7 (1996)
2. J. Kirakowski, M. Corbett, SUMI: The software usability measurement inventory. Br. J. Edu.
Technol. 24(3), 210–212 (1993)
3. J.R. Lewis, IBM computer usability satisfaction questionnaires: psychometric evaluation and
instructions for use. Int. J. Hum.-Comput. Interact. 7(1), 57–78 (1995)
4. J. Sauro, SUPR-Q: A comprehensive measure of the quality of the website user experience. J.
Usability Stud. 10(2), 68–86 (2015)
5. P. Kortum, S.C. Peres, The relationship between system effectiveness and subjective usability
scores using the System Usability Scale. Int. J. Hum.-Comput. Interact. 30(7), 575–584 (2014)
6. J.R. Lewis, The system usability scale: past, present, and future. Int. J. Hum.–Comput. Interact.
1–14 (2018)
7. I. Standard, Ergonomic requirements for office work with visual display terminals (vdts)–part
11: Guidance on usability. ISO Standard 9241-11: 1998. Int. Organ. Stand (2018)
8. [Link]
9. K. Sagar, A. Saha, Qualitative usability feature selection with ranking: a novel approach for
ranking the identified usability problematic attributes for academic websites using data-mining
techniques. Hum.-centric Comput. Inf. Sci. 7(1), 29 (2017)
10. K. Sagar, D. Gupta, A.K. Sangaiah, Manual versus automated qualitative usability assessment
of interactive systems. Concurrency Comput. Pract. Experience e5091
11. [Link]
12. J. Sauro, J.R. Lewis, Correlations among prototypical usability metrics: evidence for the con-
struct of usability. in Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems (ACM, 2009, April), pp. 1609–1618
13. R. Jain, D. Gupta, A. Khanna, Usability feature optimization using MWOA. ed. by S. Bhat-
tacharyya, A. Hassanien, D. Gupta, A. Khanna, I. Pan International Conference on Innovative
Computing and Communications. Lecture Notes in Networks and Systems, vol 56 (Springer,
Singapore, 2019)
14. S. Kapoor, K. Sagar, B.V.R. Reddy, Speedroid: a novel automation testing tool for mobile apps.
ed. by S. Bhattacharyya, A. Hassanien, D. Gupta, A. Khanna, I Pan International Conference
on Innovative Computing and Communications. Lecture Notes in Networks and Systems, vol
56 (Springer, Singapore, 2019)
Prediction and Estimation of Dominant
Factors Contributing to Lesion
Malignancy Using Neural Network
Abstract Cancer is among one of the major causes of death worldwide. Any region
when infected experiences uncontrollable growth of cells, resulting in unstoppable
growth of protrusions or lesions. Lesions are categorized as benign or malignant.
Imaging techniques have gained a potential rise over last two decades in diagno-
sis and detection of cancer cells. Automated classifiers could upgrade the diagnosis
process substantially, in terms of both time consumption and accuracy by automat-
ically distinguishing benign and malignant patterns. The paper presents a statistical
conclusion clubbed with artificial neural network (ANN) tool for early detection of
disease, problem addressed is of breast lesions; however, the same can be addressed
to any category of lesions appearing in any region of body. The statistical analy-
sis on sample data of 699 values was evaluated for the purpose of establishing the
dependence on selected microscopic attributes having a higher percentage contri-
bution on cell malignancy, further a technique for prediction of breast lesion into
benign and malignant categories using ANN is performed, that achieved sensitivity,
specificity and classification accuracy as 96.94%, 98.75% and 97.70% for complete
nine microscopic attributes and 96.70, 96.72 and 96.68 for selected microscopic
attributes, respectively, having a higher percentage contribution on cell malignancy.
Reduction in features resulted in lesser number of epoch’s and hence reduction in
processing time for identification of infection and thus early detection.
1 Introduction
like decision tree and rough sets; second, the method based upon statistics, like
support vector machine; and third, artificial neural network [7].
The paper presents a statistical analysis for the purpose of establishing the depen-
dence and identifying dominant or significant attributes for benign or malignant cells
on nine microscopic attributes: uniformity in shape of the cell, uniformity in size of
the cell, marginal adhesion, clump thickness, bland chromatin, bare nuclei, single
epithelial, normal nucleoli and mitoses and an artificial neural network technique for
classification of lesion. ANN algorithm was examined on dataset of 699 samples for
all nine microscopic attributes and for significant or dominant attributes The paper
is arranged as Sect. 1 presents introduction, Sect. 2 presents related work, Sect. 3
explains proposed methodology for estimation of dominant factor and implementa-
tion of ANN for lesion detection, Sect. 4 presents result of the statistical analysis
and sensitivity, specificity and classification accuracy archived using ANN tool and
finally the conclusion is presented.
2 Literature Review
In India, rural population have been devoid of basic, proper and timely medical care.
The author proposed a system which provides instant medical attention to remote
patients. The system analyzes the symptoms of the well-known diseases and pre-
scribes precise medicine within milliseconds and can improve the current medical
situation [8]. Breast cancer represents 25% of all cancer cases in world [9]. For past
few decades, many new imaging techniques have developed for diagnosis of breast
cancer which assist radiologist in highlighting suspicious areas as it helps them to
find a lesion that cannot be spotted by naked eye. As technology is improving at a
very rapid rate, many researchers are concerned about the development of intelligent
24 K. Tiwari et al.
techniques that can be used in detection of lesion with improved classification accu-
racy. Decision tree is a significant approach that is used for prediction of lesion. It
is easy and efficient classification method. Ability to diversely classify the primary
attributes of a given case study is its important paradigm [10]. On the basis of the
research outcomes, artificial neural network has demonstrated as a good classifier
for the classification of malignant lesion in mammography. In ANN, mammogra-
phy implementation of three-layer neural network using backpropagation algorithm
has become a pioneer [11]. There are different ANNs that are developed and are
established on the concept of decreasing the false negative and false positive detec-
tion and increasing the true positive detection rate for optimum outcome. In ANN,
implementation of wavelet such as biorthogonal spline wavelet ANN, particle swarm
optimized wavelet NN and Gabor wavelets ANN has improved the specificity and
sensitivity that are acquired in micro calcification and masses detection. For biomed-
ical applications, determining breast cancer malignancy, computer aided detection
(CAD) frameworks utilizing biomedical images mostly due to the fact that they have
non-radiation properties, low cost, high availability and fastest results with high accu-
racy. For the detection of the breast cancer, an improved version using ultrasound
images has been introduced lately, which works on a 3-D ultrasound imaging that
provides more insight data on the lesion when considered relative to the orthodox
2-D imaging [12]. A new hybrid method [13] has been developed for the detection
and classification of breast lesion by combining two ANN techniques. It has been
found out that two phase hierarchical NN gives better result.
There are many different techniques for the detection and classification of breast
lesion that utilizes the breast cancer imaging, depending on the input attributes. ANN
technique has been used widely for cancer cell investigation in medical examinations.
Many NNs models were utilized for enhancement of the detection and the classifi-
cation of the breast lesion, that are trained with previous cases that were diagnosed
correctly by the clinicians [13, 14] or can change the mass characteristics such as
size, shape, margins, granularity or the signal intensity. In 2012, multistate cellular
neural networks (CNN) was being used in biomedical image segmentation for eval-
uation of the fat contents by estimating the density of breast regions. A hybrid model
was introduced for the identification of breast lesion from MR images which mainly
consists of SVM and PCNN [14]. Another hybrid algorithm was presented by the
combination of perceptron with the SIFT for detection of breast lesion [15]. Different
clustering algorithms [16] were used for segmenting nuclei based on the fine needle
biopsy microscopic images. Topological, texture and morphological features were
mostly used for classifier training on 500 images from 50 patients which achieves
an accuracy between 84 and 93%.
3 Proposed Methodology
Methodology employed processes the sample labeled data simultaneously for iden-
tifying dominant feature and detecting infected cells. As parallel leg, the data is
Prediction and Estimation of Dominant Factors Contributing … 25
statistically evaluated for establishing the dominant factors responsible for lesion
malignancy. Samples obtained are processed on the scale of 1–10, with 10 represent-
ing higher weight, attributes being analyzed for establishing dominancy included
uniformity in shape of the cell, uniformity in size of the cell, marginal adhesion,
clump thickness, bland chromatin, bare nuclei, single epithelial, normal nucleoli and
mitoses, percentage contribution of individual attributes are evaluated to asses domi-
nant features impacting malignancy and features that may indicate benign condition
of cell, further correlation coefficient is estimated for factors that are dominant for
malignant and benign cells and factors which are not dominant, significance value
is obtained to establish the potential of dominant attributes.
Simultaneously, identification of infected cell is achieved through ANN tool,
the data is processed with reference to standard acceptable range of microscopic
attributes to classify them as malignant or benign, technique in terms of flow graph
is depicted in Fig. 3, data is first subjected to pre-processing and cleaning process
this is done for data restoration, processed data is fed to ANN tool in two layers,
first layer consisting of all nine microscopic attributes and second layer consisting
data for significant or dominant attributes. Next phase involves selection of hidden
Percentage contribution of
microscopic features: Selection of Number of hidden layers &
Evaluated Statistically Training through Conjugate Gradient
Backpropogation Algorithm
Correlation Coefficient
estimated for probable
Estimation of Classification
dominant parameters
Rate, Sensitivity and
Specificity
4 Results
Validation of the results were achieved from the database obtained University of Cali-
fornia at Irvine (UCI), a total of 699 samples were there, each have nine microscopic
attributes: uniformity in shape of the cell, uniformity in size of the cell, marginal
adhesion, clump thickness, bland chromatin, bare nuclei, single epithelial, normal
nucleoli and mitoses. These attributes are first normalized on scale of 1–10, with 10
representing higher weight; individual normalized attributes were subjected to sum-
mation, and percentage contribution of individual parameter was evaluated against
total population of the sample segregated separately for benign and malignant cases;
the data is represented in terms of percentage as depicted in Table 1; from the table,
it can be derived that percentage contribution of attributes uniformity of cell size,
single epithelial cell and bland chromatin for benign was more as compared to other
attributes, so they can be concluded as more dominant attributes and a faster anal-
ysis can result from close monitoring of dominant attributes. Similarly, percentage
contribution of attributes uniformity of cell size, uniformity of cell shape and bare
nuclei for malignant cases was more as compared to other attributes, so they can be
concluded as more dominant attributes and a faster analysis can result from close
monitoring of dominant attributes; however, the process has been tested only on a
limited sample of 699 values and percentage variations of more than 5% have been
accepted as distinguishable difference, but considering the fact that early detection
can save potential lives, so realizing dominant factors can go a long way, and further
work on a larger sample is proposed as next work. Individual attributes were further
processed to estimate correlation coefficient and hence significance value. Attributes
who do not play a significant role in characterizing cell as benign or malignant are
marginal adhesion, normal nucleoli and mitoses, while attributes playing a signifi-
cant role in characterizing cells as benign are uniformity of cell size, single epithelial
cell size and bland chromatin. Attributes responsible for the sample to be malig-
nant are uniformity of cell size, uniformity of cell shape and bare nuclei. Degree of
freedom (df), Sum of square (SS), mean square (MS), F value and significance F
values have been estimated and are depicted in Table 2. Regression analysis con-
ducted showed that significance value (F) decreases from non-dominant parameter
to dominant parameter as follows: 1.176E-146 for microscopic attributes not play-
ing a significant role is greater than 1.2E-191 for microscopic attributes playing a
significant role for the sample to be benign is greater than 3.5E-244 for microscopic
attributes for the sample to be malignant; though not very close to 0.5; however,
Table 1 Percentage contribution of individual microscopic attributes
Types Clump Uniformity Uniformity Marginal Single Bare nuclei Bland Normal Mitoses
thickness of cell size of cell adhesion epithelial chromatin nucleoli
shape cell Size
Benign Total 1354 607 661 625 971 624 962 591 487
(458 cases) numeric
standing
% 29.56 26.46 14.43 13.65 21.20 13.62 21.00 12.90 10.63
Malignant Total 1734 1584 1581 1337 1277 1834 1441 1413 624
(241 cases) numeric
Prediction and Estimation of Dominant Factors Contributing …
standing
% 71.95 65.72 65.33 55.48 52.99 76.09 59.79 58.63 25.89
27
28 K. Tiwari et al.
a general trend indicates dependence of infected cases are more toward certain set
of microscopic attributes as compared to benign cases; if focused investigation is
processed toward the significant attributes, then an early detection is possible.
Two layers obtained, first as normalized set of data for all nine microscopic
attributes and second as data for dominant attributes are fed to artificial neural net-
work tool. ANN works on principal of back spread calculation, as a first step it
performs by assigning approximated values to weight, second step involves feed
forward propagation, third step is backpropagation of errors and fourth step updates
the weight values.
ANN analysis was performed on 699 entries obtained from University of Califor-
nia at Irvine (UCI) as a first step for all nine microscopic entries. ANN to optimize
training performance, divides the data into training, validation and testing in the
ratio 7:1.5:1.5, number of hidden neurons was set as ten, further increase in neurons
resulted in increase in latency, training of the samples was achieved through conju-
gate gradient backpropagation algorithm. The best validation performance achieved
was 0.031 at an epoch 16, as depicted in Fig. 4.
Figure 5 displays confusion matrix through which predictive attributes for the
model have been derived. Sensitivity, specificity and classification accuracy of the
proposed models on nine microscopic attributes achieved are 96.94%, 98.75% and
97.70%, respectively, as evaluated from the values obtained from confusion matrix.
Significant factors as identified from statistical analysis for malignant cases were then
taken as input and were fed to ANN tool; best validation performance was achieved
at 0.0529 at epoch 6 as depicted in Fig. 6, confusion matrix for the same is depicted
in Fig. 7. Sensitivity, specificity and classification accuracy of the proposed models
on only significant microscopic attributes achieved is 96.70%, 96.72% and 96.68%,
respectively, as depicted in Table 3.
Prediction and Estimation of Dominant Factors Contributing … 29
Fig. 4 Variation in MSE with respect to number of iteration for nine microscopic attributes
Fig. 6 Variation of MSE with respect to number of iterations for significant attributes
5 Conclusion
References
1. R.L. Siegel, K.D. Miller, A. Jemal, Cancer statistics: A cancer journal for clinicians (2015)
2. Breast Cancer India, [Link]
3. L. Hadjiiski, B. Sahiner, M.A. Helvie, et al., Breast masses: Computer-aided diagnosis with
serial mammograms. Radiology. 240(2), 343–356 (2006)
4. F.A. Cardillo, A. Starita, D. Caramella, A. Cillotti, A neural tool for breast cancer detection
and classification in MRI. in Proceedings of the 23rd Annual International Conference of the
IEEE Engineering in Medicine and Biology Society, vol. 3 (2011) pp. 2733–2736
5. T. Jayaraj, V. Sanjana, V.P. Darshini, A review on neural network and its implementation on
Breast cancer detection. IEEE (2016)
6. P.S. Pawar, D.R. Patil, Breast Cancer detection using neural network model. IEEE (2013)
7. A.A. Tzacheva, K. Najarian, J.P. Brockway, Breast cancer detection in gadolinium-enhanced
MRI images by static region descriptors and neural networks. J. Magn. Reson. Imaging 17(3),
337–342 (2003)
8. V.D. Khairnar, et al., Primary healthcare using artificial intelligence. in International conference
on innovative computing and communication (2018) pp. 243–251
9. L.N. Shulman, W. Willett, A. Sievers, F.M. Knaul, Breast cancer in developing countries:
Opportunities for improved survival. J. Oncol. (2010)
10. N.M. Lutimath, et al., Regression analysis for liver disease using R: A case study. in
International conference on innovative computing and communication (2018) pp. 421–429
11. A.E. Hassanien, N. El-Bendary, Breast cancer detection and classification using support vector
machines and pulse coupled neural network. in Advances in Intelligent Systems and Computing,
(Springer, Berlin, Germany, 2013), pp. 269–279
12. G. Ertas, D. Demirgunes, O. Erogul Conventional and multi-state cellular neural networks in
segmenting breast region from MR images: Performance comparison. in Proceedings of the
International Symposium on Innovations in Intelligent Systems and Applications (INISTA ’12)
(2014) pp. 1–4
13. J. Dheeba, N.A. Singh, S.T. Selvi, Computer-aided detection of breast cancer on mammograms:
A swarm intelligence optimized wavelet neural network approach. J. Biomed. Inform. 49, 45–52
(2014)
14. T. Balakumaran, I.L.A. Vennila, C.G. Shankar, Detection of micro-calcification in mammo-
grams using wavelet transform and fuzzy shell clustering. Int. J. Comput. Sci. Info. Technol.
7(1), 121–125 (2010)
15. A.M. ElNawasany, A.F. Ali, M.E. Waheed, A novel hybrid perceptron neural network algorithm
for classifying breast MRI tumors. in Proceedings of the International Conference on Advanced
Machine Learning Technologies and Applications (Cairo, Egypt, 2014) pp. 357–366
16. M. Kowal, P. Filipczuk, A. Obuchowicz, J. Korbicz, R. Monczak, Computer-aided diagnosis
of breast cancer based on fine needle biopsy microscopic images. Comput. Biol. Med. 43(10),
1563–1572 (2013)
Prediction of Tuberculosis Using
Supervised Learning Techniques Under
Pakistani Patients
1 Introduction
Data mining (DM) is a huge field that associates with other areas such as artificial
intelligence (AI), machine learning (ML), statistics, and database for exploration of
enormous quantity of data by using some procedures. Data mining (DM) is a pro-
cedure of finding valuable material/data/pattern from the known bulky data/records.
M. Ali (B)
Department of Computer Science, The Superior College, Lahore, Pakistan
e-mail: muhammedbwn@[Link]
W. Arshad
Department of Computer Science, University of Lahore, Lahore, Pakistan
e-mail: [Link]@[Link]
2 Literature Review
way of culture test with short time span. Notwithstanding, the expense of GeneXpert
test is ten time higher than the culture test, likewise twenty time higher than the smear
microscopy, bringing about non-reasonableness for individuals of Pakistan being the
populated and low financial status. The author of the [9] used six algorithms i.e.,
SVM, decision tree (DT), logistic regression, artificial neural network (ANN), radial
basic function and Bayesian network for building models. And investigation shows
that decision tree C4.5 provides high outcomes with the accuracy level of 74.21%.
6450 records of the patients were used in research with features: age, nationality, sex,
area of residence, weight, current stay in prison, TB type, length, case type, diabetes,
treatment category, recent TB infection, low body weight, HIV, imprisonment, and
drug resistance. Hussainy et al. [10] used decision tree algorithm for forecasting
of TB. Antigen used for calculating the microbes in the blood by using multiplex
microbeads immunoassay (MMIA) which stores different level of antigens and mea-
suring the value/quantity of antigens, MFI is used against bacteria. Gerg and Rupal
[11] projected a new framework, in his research, he initially made two groups using
K-means method, thereafter for feature extraction applied principal component anal-
ysis (PCA) in the both groups. After feature extraction, feature optimization of these
groups done by using genetic algorithm (GA) and build the classifier using neural
network (NN) algorithm. He proposed that better optimized features are expected
if we apply independent component analysis (ICA) in place of PCA. Asha et al.
[12] anticipated a hybrid method for recognition of TB. They utilized 700 patient’s
records gathered from a state hospital and treated the raw data in hybrid by clustering
into two groups i.e., pulmonary tuberculosis and retroviral pulmonary tuberculosis.
The projected model provides the 98.7% using support vector machine among
C4.5, K-NN, AdaBoost, Naïve Bayes, Bagging, and Random Forest [13]. Used
ANFIS algorithm for constructing classifier with 503 instances having 30 features,
and got 97% accuracy. Rusdah et al. [14] utilized an ensemble technique with two
algorithms i.e., SVM and Bagging and obtained 70.41% accuracy using mentioned
attributes and as compared with single classifier/model ensemble technique pro-
duce better results. Tracey et al. [15] and Moedomo et al. [16] used laboratory free
approaches i.e., sound and images as the key attributes for the identification of TB.
Kalhori and Zheng [17] showed that logistic regression and SVM yields best results
among six classifiers. Author of [18] made a survey from 2007 to 2013 and includes
different papers and proposed that SVM is one of the best algorithms between them.
3 Methodology
In this research, data has been gathered/obtained from the local hospital i.e., Gulab
Davi Hospital [19] after getting fundamental endorsements from the experts of the
medical clinic. We changed the raw data (medical records) from hard form to MS
Excel. The dataset contains nine properties seeing their importance and significance
in detecting the tuberculosis. Descriptions and detail of selected features and their
types are stated in Table 1.
36 M. Ali and W. Arshad
A data mining tool ‘WEKA’ is made by the Waikato University [20] in New Zealand
by utilizing Java language. It is a cutting edge tool for performing different data
mining tasks i.e., data per-processing, regression, clustering, classification and asso-
ciation rules which directly applied on dataset. WEKA also provides a variety of
visualization tools [21]. In Weka, mostly ARFF and CSV file format are used.
4 Results
Table 2 speaks the outcomes with handling the missing values by using the these
missing values to replace with mean/mode method while in Table 3 after disqualifying
the records having where the missing values in the dataset. The outcomes show that
when we replaced the missing values with the mean values then no improvement
in results have been achieved as equated to simply eliminating the record excluding
Naïve Bayes classifier, as the same produced little bit high results. The reason behind
eliminating the missing values that’s maybe because the proportion of instances
(5.3%) just 32 records out of 597 having missing values in two attributes and the
proportion of instances is very little or the mean method (replacing the missing values
with mean) would presumably not be very close to the real values in the dataset. And
(see Fig. 1) demonstrates the decision tree (DT) of the data when in our dataset we
have no missing value.
5 Conclusion
The fundamental point of the examination is to add another dataset to the community
of one the most disturbing sicknesses particularly in one of the social orders where
the infection is a genuine risk. And shows that in what way this dataset can be
utilized for forecasting of tuberculosis (TB) which might support the health care
specialist in choosing either to begin the treatment of suspected person or wait for
medical reports of the TB persons. The dataset consists of 565 patients each having
nine features. In general, C5.0 achieved higher to the next three algorithms i.e., SVM,
logistic regression and Naïve Bayes. We believe that results will add more significant
perception of forecasting of tuberculosis status. Our purposed research work can be
boosted and exhausted if we may utilize some other data mining techniques. Results
may be improved utilizing other distinctive data cleaning strategies.
Acknowledgements Since the investigation utilized optional information, we don’t have individ-
ual contact to the patients. The examination is absolutely committed to scholarly reason. Ethical
clearance was acquired from hospital and this investigation is only for open advantages.
Prediction of Tuberculosis Using Supervised Learning Techniques … 39
References
1. K.R. Lakshmi, M. Veera Krishna, S.P. Kumar, Utilization of data mining techniques for pre-
diction and diagnosis of tuberculosis disease survivability. I. J. Mod. Educ. Comput. Sci. 8,
8–17 (2013)
2. M. Yadav, S. Jain, K.R. Seeja,) Prediction of air quality using time series data mining. in S.
Bhattacharyya, A. Hassanien, D. Gupta, A. Khanna, I Pan (eds) International Conference on
Innovative Computing and Communications. Lecture Notes in Networks and Systems, vol. 56.
(Springer, Singapore, 2019)
3. V. Dubey, P. Kumar, N. Chauhan, Forest fire detection system using IoT and artificial neural
network. in S. Bhattacharyya, A. Hassanien, D. Gupta, A. Khanna, I. Pan (eds) International
Conference on Innovative Computing and Communications. Lecture Notes in Networks and
Systems, vol. 55. (Springer, Singapore, 2019)
4. M. Ali, A.U. Rehman, H. Shamaz, Prediction of churning behavior of customers in tele-
com sector using supervised learning techniques. 143–147. [Link]
8586836
5. M.F. bin Othman, T.M.S. Yau, Comparison of different classification techniques using WEKA
for breast cancer. IFMBE Proc. 15, 520–523 (2007)
6. A.U. Rehman, A. Aziz, Detection of cardiac disease using data mining classification techniques.
(IJASCA) 8(07) (2017)
7. M. Saeed, S. Iram, S. Hussain, A. Ahmed, M. Akbar, M. Aslam, GeneXpert: A new tool for
the rapid detection of rifampicin resistance in mycobacterium tuberculosis J. Pak. Med. Asso.
67(2) (2017)
8. D. Nagabushanam, N. Naresh, A. Raghunath, K. Parveen Kumar, Prediction of tuberculosis
using data mining techniques on Indian Patient’s Data. IJCST 4(4) (2013)
9. B.C Lakshmanan, V. Srinivasan, C. Ponnuraja, Data mining with decision tree to evaluate the
pattern on effectiveness of treatment for pulmonary tuberculosis: A clustering and classification
techniques. (SCIRJ) (2015)
10. S.F. Hussainy, S. Ahmad Raza, A. Khaliq. M.A. Zafar, Decision tree-inspired classification
algorithm for early detection of tuberculosis (TB). in Thirty Seventh International Conference
on Information Systems (Dublin 2016)
11. S. Gerg, N. Rupal A data mining approach to detect tuberculosis using clustering and GA-NN
techniques. IJSR (2015)
12. T. Asha, S. Natarajan, K.N.B. Murthy, A data mining approach to the diagnosis of tuberculosis
by cascading clustering and classification. CoRR. abs/1108.1045 (2011)
13. S. Kalhori, X. Zeng, Improvement the accuracy of six applied classification algorithms through
integrated supervised and unsupervised learning approach. J. Comput. Commun. 2, 201–209
(2014). [Link]
14. Rusdah, E. Winarko, R. Wardoyo, Preliminary diagnosis of pulmonary tuberculosis using
ensemble method. ICDSE. 175–180 (2015)
15. B.H. Tracey, G. Comina, S. Larson, M. Bravard, J.W. López, R.H. Gilman, Cough detection
algorithm for monitoring patient recovery from pulmonary tuberculosis. IEMBS (2011)
16. R. Moedomo M. Ahmad, B. Alisjahbana, T. Djatmiko, The lung diseases diagnosis software:
Influenza and tuberculosis case studies in the cloud computing environment ICCCSN (2012)
17. S.R.N. Kalhori, X.-J. Zeng, Improvement the accuracy of six applied classification algorithms
through integrated supervised and unsupervised learning approach. J. Comput. Commun. 201–
209 (2014)
18. E. W. Rusdah, Review on data mining methods for tuberculosis diagnosis. ISICO (2013)
19. Gualb Davi Hospital: [Link]
20. WEKA: [Link]
21. J. Han, M. Kamber, J. Pei, Data Mining Concepts and Techniques. 3rd edn (2012)
Automatic Retail Invoicing
and Recommendations
Abstract The goal of this research paper is to make today’s computing systems
to sense the presence of users at some place and their current states. This research
paper further may exploit present context information of users and help people to get
current context services as per the preferences and current needs like giving current
discounted product details (they wish to buy) and giving personal recommendations
like a movie, trending clothes in the market, significant particulars in a warehouse
about product purchases. The realized implementation of suggesting particular ser-
vices by warehouse server is done by studying user profile, analyzing user behavior
and user purchase history. It was found out after taking Zadeh’s fuzzification equation
in the process that clothes should be taken extra large for a person, and probability
is found out to be 78% true in all cases except some conditions of person having
weight less than 76.5 kg (middle clothes preferred) and also age less than 11 years
(small clothes preferred in this case).
1 Introduction
N. Garg (B)
Department of Computer Science Engineering, MAIT, Delhi, India
e-mail: neeraj_garg20032003@[Link]
S. K. Dhurandher
Department of Information Technology, NSUT, Delhi, India
e-mail: dhurandher@[Link]
clothing, for example, by the warehouse server. The server may calculate the cur-
rent location and current section of the user in the warehouse by using distance
triangulation methods (GPS) or using network-centric (Bluetooth, Zigbee, WLAN)
methods. The user context depends on many different attributes, like location, move-
ment, availability, schedule plans, activity, and currently used services. Location
is an essential part of the context as the user’s preferences and skills, his physical
environment (e.g., weather and temperature), social relationship details (e.g., who,
when, and with whom) all affect his next movement or behavior. Researchers are
also analyzing today, different context situations or locations of users based on user’s
behavior. Here, the selling company may provide customers with some offers and
discounts on shopping behavior and actual transactions data (Amazon, Google, etc.).
Failure to remember, means that you were supposed to do some task but you forgot to
do because of some complex new situation, that can be dealt with pervasive memory
reminder application. The list of all things, expected by user individually will be
taken care by strong pervasive applications. Obviously, the reminder for something
at some time, will take person’s concentration, time and energy, resulting in office
wasting hours. So in short, location prediction, movement prediction, action predic-
tion, and daily routine prediction, all belong to context prediction techniques, and
these all must be analyzed perfectly to give recommendations to moving user.
2 Related Work
Many researchers are working in the direction of predicting the human location and
then suggesting some recommendations as per their present locations and recom-
mending their buying behavior history patterns (of users). Mobile guides are digital
guides to the user’s surroundings do help in the mobile search, ‘you-are-here’ maps,
and tour guides for tourism and recreational purposes. Navigation systems (e.g., car
and pedestrian navigation systems) assist in way-finding tasks in unfamiliar environ-
ments. Existing authors [1] utilized the user-generated contents and did some rec-
ommendations for a particular location. Similarly, some papers [2] discussed health-
care (exercise, fitness monitoring) applications. Apart from location-based services
(LBS), existing authors also find good research paper corresponding to analyze user
habits, patterns, and then recommendations. Yu-Cheng et al. analyzed user habits to
propose community recommendations [3] and categorized users with similar habits.
Personalized DTV Program Recommendation (PDPR) system analyzed and used the
viewing pattern of consumers to personalize the program recommendations and to
use computing resources [4] efficiently. Efstratiou et al. detected social interactions
using Bluetooth data [5] by the real deployment of a system that involves location
tracking, conversation monitoring, and interaction with physical objects. Adams
et al. [6] used both technologies, Bluetooth and GPS to model user behavior through
proximity concerning visited places. Singla et al. [7] used a hidden Markov model
approach [8] for recognizing activities in a single smart home environment through
an observed sequence of sensor readings. Lumet al. [9] made an adaptation based on
a user’s context as Bayesian networks are used to predict the prominent activities of
users. Rao et al. [10] introduced view-invariant dynamic time warping for analyzing
activities with trajectories. Zelniker et al. [11] created global trajectories by tracking
people across different cameras and detected abnormal activities if the current global
trajectory deviates from the normal paths. Huýnh et al. [12] adopted a topic model
to predict users’ daily routines (such as office work, commuting, or lunch routine)
from activity patterns, and the result was broken into domains of likes-dislikes of
users. Alireza et al. [13] recorded the type and arrival timestamp of each activity
relating to the mobile user. If it is thought in advance that personal behavior will be
such and such in these conditions [14], then this observed behavior of notification
but also used the application to collect subjective feedback to gather information
about notifications. User monitoring [12] determined the user position and can send
warnings to the user in the case of a problem.
This task of activity recognition understands the movement behavior for trajectories,
taking customer’s point of interests (POI). The activities may be classified as station-
ary or non-stationary. For example, office working is a stationary activity, whereas
44 N. Garg and S. K. Dhurandher
marketing, shopping, and moving for sales promotion are a non-stationary activity.
Mathematical techniques can identify different activities as C1, C2, C3, etc. Once
activities are identified as stationary or moving, then there comes a requirement of
adaptation. So to give services and relevant content as per location, LBS determines
where the user is right now. So ubiquitous positioning provides an accurate estimate
of a user’s or an object’s location at all times. One example of activity recognition is
social context inference [15], i.e., deriving social relations based on the user’s daily
communication with other people. Here, if a person goes daily by someplace and
does some routine work, then the system recommends him the same work next day
at the same time if he forgets or is slightly late in reaching the same place of work.
That raises some questions beforehand with problem solution.
Fig. 1 Courtesy’ movement prediction framework [16] (different types of datasets, patterns, and
trajectories)
LBS applications and their ubiquity services because of location sensing devices
(attached with them) have dealt location-based tracking data and even social data so
as to gather geographic information and give insights on common and distinguished
behavior of people’s moves in different environments. Here, it is meant that to sup-
port efficient data analysis and further knowledge discovery, there is a continuous
need to model and store LBS-generated data. Normally, geospatial data representa-
tion models cannot capture all important features of large LBS-generated data and
relationships among them. This is a strong need of the system, to be adaptable to
46 N. Garg and S. K. Dhurandher
current technology, and that will be possible through similarity matching technique
that gives exactly matched resource at that time. The system through data models
should find whether the user is walking, running, or sitting calm by monitoring dif-
ferent sensors (gyroscope, accelerometer, proximity, GPS). User’s body dimensions
like height, weight, facial appearance should be noted and measured carefully so that
personal characteristics (parameters) can be diagnosed by investigating authorities to
move in the direction of making a complete information retrieval system for helping
psychiatrists and human behavior experts.
For a shopping scenario, for the consumer, there could be certain clothes brand
related to fashion, color, size, etc., as qualities of good clothes are preferred all over
the world. The age, preference, height, weight type of information are fed in the
personal profile of the user and kept as information file document in a file folder
in a smartphone. The objective here is to identify some features that allow for the
reliable inference of motion-related activities, independent from person, and dataset
is used to train the classifier. If there is a set of information as input like demographic
data or user information, then adaptive music system plays according to age, gender,
region (country), music information (timestamp, artist, song, title), spatial context
(time, date, location, weather, city) as an example, but it is a challenging task to tune
to perfect song as per input parameters.
Some of the existing frameworks on which authors have worked upon in the previous
years can be used to implement this product recommendation work. The earlier
defined models are efficient mobility prediction model (MPM) [17]. These models
will be required to locate a moving user in the warehouse. User mobility models are
applied in the warehouse, through these frameworks [17].
Broadly, recommender systems can be categorized as collaborative-based where
items by other users with similar tastes are liked in the past. It can be a content-
based recommendation for items that resemble the ones the user preferred in the
past. It can be hybrid-based that combines the characteristics of both content-based
and collaborative techniques. Personalized recommenders systems, particularly in
our case for clothes size determination, by the system for different aged and weight
people (male, female, kid) offer a feasible solution when users want to ensure that
proper content is delivered to them when the number of choices increases. A simple
prediction function like Pa,j *(Ri,j − Ri ) where Pa,j denotes the prediction for the user
‘a’ on item ‘j,’ Ri,j denotes the rating for user ‘i’ on ‘j’ item, and Ri denotes rating
for all users I (not specific to item j). Here, J is cloth, etc.
The system takes here body dimensions (age, height, weight, and physical appear-
ance) from the user profile of a moving person, and the same is considered for the
decision to be taken by the server in the warehouse regarding the selection of clothes.
For this, some membership functions are taken. These membership functions are built
Automatic Retail Invoicing and Recommendations 47
on some probability values based on height, age, weight and that vary concerning
the type of person considered like teenager, kid, or adult.
Recommendation systems [3], given by users require, some of the body parame-
ters, some records in advance and then feeding those values, make a closed system.
Main attributes of the human body are represented/defined in linguistic variables like
height, age, facial expression, weight, etc. Prediction of clothes by age has a plea
that if a person (kid or adult) is taller in height or heavy, then the system should be
able to recommend that person, extra large sizes garments by fuzziness values. For
that, there is a need for designing age/weight/height type of some fuzzy membership
functions. Like if some parameters are measured, like
1. Height (150–180 cm) could get different labeled values: small, average, tall, very
tall, extremely tall.
2. Weight (50–90 kg) could get taken values: lean thin, slim, fatty, obesity, etc.
3. Age (03–24) could get different person values: kid, teenager, young adult,
completely adult.
Like, if a person’s exact age is not known but still the age of a person needs to be
predicted. As if a person’s age seems to be approximately 8–10 years, he is called a
kid with a membership degree 0.8 (probability), a teenager with a degree of 0.3, etc.,
seen from Table 1. A fuzzy membership function is defined for sets of kids, teenager,
young adult, and complete adult. Set A (03–09) kids; Set B (10–13) teenager; Set C
(14–18) young adult; Set D (19–24) complete adult. That explains that if a person
seems to be of age 10 years by looking and estimating, people will call that person
as a kid with a probability of 0.8 and complete adult with a probability of 0.001.
Similarly, membership functions for different weight and different height (kids and
adults) are defined and kept them in different categories kids and adults (Tables 2
and 3) from a medical point of view. The first column of table denotes age groups,
second column of table denotes body structure type, and the remaining four columns
denote the probability of having particular weights/age in kg/years. The experiments
are not conducted for kids, in this work.
Table 1 Set A—age group and type of person with corresponding probability values
Age group Type of Age with Age with Age with Age with
person probability probability probability probability
03–09 Kids 10/0.8 13/0.3 18/0.1 24/0.01
10–13 Teenager 10/0.3 13/0.75 18/0.5 24/0.1
14–18 Young adult 10/0.1 13/0.2 18/0.5 24/0.4
19–24 Complete 10/0.001 13/0.8 18/0.6 24/0.95
adult
48 N. Garg and S. K. Dhurandher
Some real data is taken randomly. It is fed in CSV format with some data files.
Once data is fed into excel or CSV format, then it can be used by machine learning
technique using python, or gui based WEKA software to extract and use as per field
mentioned.
Ai ∗ A J = Ai if Ai < A j (1)
= A j if A j ≤ A (2)
= Ai ∗ A j = Min 1, 1 − Ai + A j (3)
X acts as operator to resolve value between left of star (*) and right of star, in
fuzzy relationship as mentioned in Table 6. Now, comparing it against the measured
distribution of teenager from standard membership distribution curves, then fuzzy
membership distribution of person being average height can be estimated by a post-
multiplying relational matrix by teenager vector (Table 6).
Similarly, it can be proved using age and weight as parameters and predicted a
person’s weight according to member functions.
Taking a Cartesian product between age and weight with µR (age, weight)
a = 8–10, b = 10–13, c = 14–18, d = 19–24 (years) and A = 30–52, B = 53–63,
C = 64–74, D = above 75 (Table 7).
Measuring the distribution of young adult from membership distribution curves
which are given like post-multiplying relational matrix can estimate fuzzy mem-
bership distribution of person being slim weight by a young adult vector, shown
below.
That is µslim weight (weight 54–63) = µR (age, weight) ∞ µyoung adult (age = 14–18)
using ∞(fuzzy AND and OR operator)
µslim weight 54–63 = (A/0.1, B/0.2 C/0.4, D/0.4) comes out.
Automatic Retail Invoicing and Recommendations 51
Pervasive computing applications best work on combining both software and hard-
ware. As a contribution, in the future, warehouse server enabled through GPS (work-
ing better indoor) will be able to calculate the current location and current section of
the user in the warehouse by using distance triangulation methods or network-centric
methods. Although there is a strong need to make a user profile which involves (i)
inferring profile from user actions (implicit like buying history, clicking on the web,
etc., and (ii) inferring profile from explicit user ratings, that includes feedback tech-
nique by filling out forms, etc. Both implicit actions and explicit ratings are processed
to build the content profile. It is done in most content-based recommender systems
that merge the ratings and user actions into profile information about the user in
action and preferences to infer keywords and attributes in order to build the user pro-
file. Sensing the user’s activity context using software sensors in the context-aware
environment will be the main objective of this research in the future. After getting
the required output according to the set aims from the proposed system, there are still
directions and questions on how long the context data is needed and the proposed
system will follow the future trends and standards and provides new ways of thinking
for activity and context sensing.
54 N. Garg and S. K. Dhurandher
Fig. 4 a Ridor method to recommend middle clothes. b Ridor method to recommend extra large
clothes. c Random method to recommend differently aged kid’s clothes
Automatic Retail Invoicing and Recommendations 55
References
Barbara Šlibar
1 Introduction
Measuring and comparing the quality of open data is not an easy task since multiple
quality dimensions should be taken into account. The poor quality of such data
can negatively influence the decision made by subjects. Accordingly, the influence
of metadata on the usage and thereby on the quality were investigated within this
research.
The main objective of this research is to investigate the performance of the open
data usage prediction model using regression tree-based approach. In order to achieve
it, the defined objective is decomposed into following two objectives: (1) to determine
important metadata of open datasets, (2) to build a model for predicting number of
downloads of the open datasets based on identified metadata.
B. Šlibar (B)
Faculty of Organization and Informatics, University of Zagreb, Pavlinska 2, 42000 Varaždin,
Croatia
e-mail: [Link]@[Link]
Therefore, the paper is structured as follows; firstly, the previous researches related
to quality of open datasets are pointed out; secondly, the research methodology is
presented together with a detail description of used data and applied classification
mining method; and thirdly, the results of conducted research are shown. Finally, the
findings of the paper are given as well as recommendations for further research.
2 Literature Review
Nowadays, data is published through open data portals, which are considered to be
cataloged. These catalogs are comparable to digital libraries where metadata have
a key role along with its quality [1, 2]. As the metadata quality is directly related
to the value of digitized libraries, and then it has a direct impact on the objects
contained in these libraries [1]. Datasets are part of the open data portals, aggregate
resources, and metadata about these resources [2]. Since the number of such resources
is increasing, there is more concern with regards to the quality of the resources and
convenient metadata. Hence, Neumaier et al. (2016) introduced automated quality
assessment of metadata based on mapping certain metadata of CKAN, Socrata, and
OpenDataSoft to metadata standard Data Catalog Vocabulary (DCAT) and defining
quality dimensions and metrics regarding metadata key in the DCAT specification
[2]. Reiche and HöFig (2013) proposed, implemented and applied metadata quality
metrics (completeness, weighted completeness, accuracy, richness of information
and accessibility) to three open data portals [3]. They pointed out that metadata of
resources such as name, URL, description, format, etc. represent core metadata [3]. In
order to empower end-users to assess open data portals, Kubler et al. (2018) developed
the Open Data Portal Quality (ODPQ) framework. This framework included some
metadata since it can be helpful for evaluating various aspects of quality dimensions
[4].
The aforementioned researches are more focused on quality assessment of portals
rather than evaluation of the quality of the datasets published on them. Therefore,
researches that are directed more on the level of the dataset will be presented below.
Kučera et al. (2013) investigated quality of the catalog records that stand for
datasets published on an open data portal through metadata, and proposed techniques
for their improvement [5]. Metadata, interaction mechanisms, and data quality indi-
cators were used by Zuiderwijk et al. (2016) due to authors’ intention to improve the
usage of Open Government Data (OGD). Some of the metadata that was embodied
into prototype: title, description, URL, publisher, views, date of publishing, etc. [6].
Since software such as Socrata or CKAN is not capable of automatically cover the
quality metrics because of their technical possibilities, the Matamoros et al. (2018)
proposal for measuring quality of open datasets [7]. Altogether, the 17 quality metrics
were proposed that covered metadata, content and structure [7]. Previous researches
into the assessment of the quality of open data portals or datasets include either sub-
jective metrics or methods that require inclusion of real people. Because of those
Modeling Open Data Usage: Decision Tree Approach 59
reasons, from research conducted within this paper this subjectivity is excluded by
evaluation of the metadata using the regression tree method.
3 Research Methodology
Since the regression tree method is widely used for similar issues in which researches
have been investigating the prediction of certain factors with respect to other impor-
tant factors, it was applied within this research [8–10]. Design of the research can be
described through four phases: first phase is about gathering the data; second one is
directed to data preprocessing which includes data cleaning and data transformation;
third phase includes applying of the chose classification mining method, and fourth
phase covers evaluation and interpretation of obtained results [11]. Therefore, this
section of the paper describes in detail the used data as well as the applied method.
The data were gathered from open data portal [Link] by Java Web application.
The reason why this portal was chosen among vast similar portals lays in the fact
that it is ranked as best or as one of the best open data portals [12–14]. The gathered
data contain metadata for some datasets which are published on [Link]. Based
on statistics of dataset usage which were found on the portal, the Web application
could collect all needed data [15]. The application loaded these statistics as .csv file,
and then it requested an API call [Link]
{id} for every dataset identifier found in this .csv file. Almost half, more precisely
5 attributes out of a total of 11 attributes, are converted according to the following
rules:
• Title—if the data is retrieved it is labeled as TRUE, unless it is labeled FALSE;
• Description/Note—if the data is retrieved it is labeled as TRUE, unless it is labeled
FALSE;
• License—if the license information contains the keyword “open”, then the license
openness is labeled as TRUE, otherwise FALSE;
• Dataset URL—if the data is retrieved it is labeled as TRUE, unless it is labeled
FALSE;
• Machine-readable format score—the original data is expressed numerically,
so it is converted “into textual values as follows: 0 == “BAD“, 1 == “NOT OK”,
2 == “OK”, 3 == “GOOD”, 4 == “VERY GOOD”, 5 == “EXCELLENT”.
Prepared dataset contains 1049 observations, and it was last updated on 28th
August 2018.
60 B. Šlibar
Shmueli et al. (2017) classification and regression tree were adduced as most trans-
parent and easy to interpret methods amongst other data-driven methods [16–18].
These methods can be used for constructing prediction models from data. The models
are obtained by dividing observations or rater splitting into subgroups on predictors.
Homogeneity of the subgroups is very important. In order to create useful prediction
or classification rules, the terminal subgroups should be homogeneous as much as
possible in the terms of the target attribute [9–16].
The biggest difference between regression and classification tree is in the type of
target attribute. While the type of target attribute for classification tree is categori-
cal, the type of target attribute for regression tree is continuous. Other divergences
between them are present in the details of the prediction, impurity measures, and
evaluating performance [16–19]. Considering prediction in regression tree, the value
of the terminal subgroup is defined by the average outcome value of the training
observations contained in that subgroup. A common impurity measure in regression
trees is the sum of the squared deviations from the mean of terminal subgroups. The
usually used measures for evaluating predictive performance of regression tree are
measures such as the root mean square error (RMSE) [16–19]. Apart from that, they
operate in pretty much same way.
4 Research Results
Fig. 1 R-square values for training set (upper curve) and validation set (bottom curve) according
to the number of splits
Prepared dataset is divided into training set (80% of observations) and validation
set (20% of observations). Even though the training set is indispensable for building a
regression tree, the validation set is also very important since it validates the predictive
ability of the model.
Automatic splitting was used for growing the tree, and it resulted in six splits.
It works in a manner that process of splitting would continue until the R-Square of
validation is better than what the following ten splits would gain [5]. In order to
visualize this process, Fig. 1 is displayed.
There are graphs for goodness of fit and they rely upon the type of target attribute.
In regards to the type of target attribute Number of downloads which is continuous, the
graph used for displaying how well the model fits the data is Actual by Predicted plot.
The predicted means of leaves are located on the x-axis and the actual values scattered
around means are located on y-axis. Speaking of the regression trees where the target
variable is continuous, the mean represents average response for all observations in
an observed branch. Place where predicted and actual values are the same is shown
with a vertical line [19]. Since there are six splits, the built regression tree has seven
leaves for training and for validation set (Fig. 2). Therefore, there are seven distinct
predicted values.
Fit statistic for prepared dataset points out summarized ability of the model to
predict Number of downloads (Table 1). It shows R-Square, RMSE, number of rows
and number of splits for training as well as for validation set. It is common that
model predicts the dataset used to create the model better than validation set. Also,
the RMSE indicates that the built model has better performance on training data than
on the validation set.
Predictors that contribute the most to the model are Publisher, Machine-readable
format score, and Number of view (Table 2).
If the leaves or terminal subgroups are observed, then the leaf with the highest
mean is one where observation contains any value of the Machine-readable format
score and Number of views is equal to or greater than 49,524.
62 B. Šlibar
Fig. 2 Actual by predicted plot for training set (left graph) and for validation set (right graph)
Table 1 Fit statistic of model for predicting number of downloads of the open datasets based on
identified metadata
R-square RMSE No. of rows No. of splits
Training 0.878 0.2799181 842 6
Validation 0.811 0.3653379 207
5 Conclusion
The modeling of open data usage by data-driven method regression tree was the focus
of this research. Before building the predictive model, the indicator of the open data
quality should be known. Therefore, the metadata was chosen based on the literature
review. The results of the research show that accuracy of model which was based on
core metadata is high according to method measures. Also, the build model predicts
well on the validation set. Only three predictors out of nine contributes to the model.
The reason is that applied method was regression tree that operates in such a way.
Modeling Open Data Usage: Decision Tree Approach 63
In case that the number of splits was larger, surely other predictors would have had
an impact on the model, but it would be very small.
There are few recommendations for future research. One recommendation is to
apply another classification mining method such as Boosted Tree or Bootstrap Forest
over existing data in order to deeply examine the impact of other predictors, which did
not show relevance to the target value Number of downloads. Second, more metadata
should be included in the model.
References
1. A. Tani, L. Candela, D. Castelli, Dealing with metadata quality: the legacy of digital library
efforts. Inf. Process. Manage. 49(6), 1194–1205 (2013)
2. S. Neumaier, J. Umbrich, A. Polleres, Automated quality assessment of metadata across open
data portals. J. Data Inf Qual 8(1), 1–29 (2016)
3. K.J. Reiche, E. Höfig, Implementation of metadata quality metrics and application on public
government data, in 2013 IEEE 37th Annual Computer Software and Applications Conference
Workshops, 2013, pp. 236–241
4. S. Kubler, J. Robert, S. Neumaier, J. Umbrich, Y. Le Traon, Comparison of metadata quality
in open data portals using the analytic hierarchy process. Gov Inf Q 35(1), 13–29 (2018)
5. J. Kučera, D. Chlapek, M. Nečaský, Open government data catalogs: current approaches
and quality perspective, in Technology-Enabled Innovation for Democracy, Government and
Governance, 2013, pp. 152–166
6. A. Zuiderwijk, M. Janssen, I. Susha, Improving the speed and ease of open data use through
metadata, interaction mechanisms, and quality indicators. J. Organ Comput Electr Commer
26(1–2), 116–146 (2016)
7. J.H.M. Matamoros, L.A.R. Rojas, G.M.T. Bermúdez, Proposal to measure the quality of open
data sets. Knowl. Manage. Organ. 701–709 (2018)
8. H. Li, J. Sun, J. Wu, Predicting business failure using classification and regression tree: an
empirical comparison with popular classical statistical methods and top classification mining
methods. Expert Syst. Appl. 37(8), 5895–5904 (2010)
9. M. Ließ, B. Glaser, B. Huwe, Uncertainty in the spatial prediction of soil texture: comparison
of regression tree and random forest models. Geoderma 170, 70–79 (2012)
10. C. Zheng, V. Malbasa, M. Kezunovic, Regression tree for stability margin prediction using
synchrophasor measurements. IEEE Trans. Power Syst. 28(2), 1978–1987 (2013)
11. R. Kovač, D. Oreški, Educational data driven decision making: early identification of students
at risk by means of machine learning. p. 7 (2018)
12. B. Marr, Big data: 33 brilliant and free data sources anyone can use, in Forbes. [Online].
Available [Link]
free-data-sources-for-2016/. Accessed 29 Aug2018
13. M. Lnenicka, An in-depth analysis of open data portals as an emerging public e-service 9(2),
11 (2015)
14. Open Data Barometer. [Online]. Available [Link]
indicator=ODB. Accessed 29 Aug 2018
15. Usage by dataset—[Link]. [Online]. Available [Link]
Accessed 29 Aug 2018
16. G. Shmueli, P.C. Bruce, I. Yahav, N.R. Patel, K.C. Lichtendahl, Data Mining for Business
Analytics: Concepts, Techniques, and Applications in R, 1st edn. (Wiley, 2017)
17. A.B. Shaik, S. Srinivasan, A brief survey on random forest ensembles in classification model, in
International Conference on Innovative Computing and Communications, 2019, pp. 253–260
64 B. Šlibar
18. N.M. Lutimath, D.R. Arun Kumar, C. Chetan, Regression analysis for liver disease using r: a
case study, in International Conference on Innovative Computing and Communications, 2019,
pp. 421–429
19. SAS, JMP 12 Specialized Models. (SAS Institute, Cary, NC, 2015)
Technology-Driven Smart Support
System for Tourist Destination
Management Organizations
Abstract Hospitality and tourism are among major economic drivers while also
among the largest sectors in the world. However, that growth does not come without
problems and overcrowding in tourist destinations is starting to be a big one, affecting
all stakeholders: government, residents and tourists. Overtourism problem is not
going to be solved overnight, but it cannot be solved without a system that will
measure, examine and predict tourism at the destination. Even though there is growing
demand for such systems, there is still no globally adopted concept. Information and
communications technology (ICT) advancements, especially Big Data and Internet
of Things, are making possible innovations in decision-making management, which
could also be introduced in tourism and used as a management tool. The scope of
this research was to examine some options of using technological advancements and
available data to build our version of data-driven Destination Management System
(DMS) that we called eDestination as part of Destination Management Organization
(DMO) strategy.
1 Introduction
In recent decades, the hospitality and tourism industry has become one of the most
important economic sectors globally. Each year more and more people travel into
almost every corner of the World, despite worldwide economic crises, searching
for unique and new places and experiences. By some authors, emerging tourism is
described as the most obvious form of globalization [1]. Looking back at more than
50 years in the past, tourism has experienced continued diversification and expansion,
and has become one of the global fastest-growing and largest industry sectors [2]. In
addition to the traditional favorite destinations of North America and Europe, many
new destinations have emerged. A large number of worldwide destinations measure
significantly growing interest for investment in tourism. This has turned tourism
into a key driver of economic progress through the jobs and enterprise activity,
infrastructure development but also through increased export revenues.
1.1 Overtourism
tourism destinations face a set of new challenges from influence of the shared econ-
omy business models, new technologies to both consumers and the environment
(Dimitrios et al. [5]. Technology is playing critical role in the competitiveness of
tourism destinations and organizations as well as the entire industry [6]. In its pol-
icy recommendation for long-term sustainability of urban tourism, researchers are
suggesting investment in innovation, technology, and partnership to promote smart
cities—allowing technology to address not only innovation but also accessibility and
sustainability (UNWTO 2018).
Even though there is a great need for digital technology in destination management,
still there cannot be found concept which can be explained as universally adopted [7].
DMS as a source needs to be the main tool for a tourist destination to reach sustain-
ability. There is no real system implemented that is used by more DMO’s, but at the
moment DMO’s are trying to do something through its websites. Although DMS is
often considered as advanced DMOs web platform since their inception somewhere
in the middle of ‘90s the evidence clearly shows that not many destinations were able
to develop and implement such systems successfully [8]. More and more, DMOs use
digital technology in order to facilitate the tourist experience before, during and
after destination visit (all parts of traveling process), as well as for the coordina-
tion of all stakeholders involved in the experience and service delivery of tourism
[9]. Nowadays DMOs are attempting to provide information and accept reservations
for different local enterprises and coordinate their facilities, but also utilize digital
technology to promote their policies, harmonize their operational processes, increase
the expenditure of tourist, increase overall experience level and boost the multiplier
effects in the local economy [10]. To do that DMS needs to administrate a wide
range of requests and provide efficient and appropriate information on an increas-
ing supply of tourism products. National and regional governments are employing
DMSs to facilities DMOs management, as well as to support local ecosystem at the
destination level (UNWTO 2008) [11]. In recent studies like “Åre Case” [12] authors
made research on Swedish mountain tourism destinations. They have explored dif-
ferent customer-centric knowledge sources, like tourists’ search (Web navigation),
booking and feedback-behavior (review platforms, surveys), with a goal to create
a business intelligence-based destination management information system based on
data collected from pre-trip and post-trip experiences and facts. They have set up the
knowledge destination framework architecture sensitive to knowledge creation and
knowledge application layer. Knowledge generation layer includes various sources
of customer-based data (e.g., booking, weblogs, and customer feedback), the techni-
cal components for data extraction, transformation and loading (ETL processes), a
centralized data analytics platform and analytics modeling part. System overall rep-
resents decentralized presentation and advanced visualization of data models with
data that rests on the knowledge-based transactional layer [13], generally called
68 L. Mrsic et al.
awareness like brand visibility, information sources, interest about the destination,
destination value areas (e.g., skiing or non-skiing winter activates, summer acti-
vates and attractions, atmosphere, social interaction, services, and features), value
for money score and customer loyalty and satisfaction [12]. Moreover, they define
each indicator to different business process and set them up in the period during a
trip in which it happens. They have defined which processes are happening during
the pre-trip phase, during the on-site phase, and during the post-trip phase. For their
research, they have focused on pre-trip phase using booking as an economic perfor-
mance indicator and web navigation as customer behavior indicator. Also, for the
post-trip phase, they were using feedback as customer perceptions and experience
indicator [14].
From previous chapters, we can see that tourism has its problems and one if it is lack of
management system. Also, there are different available data that are currently not used
for decision making in tourist destinations. Technology advancements, especially Big
Data and IoT, are making possible innovations in decision-making management, that
could also be introduced in tourism. Following the research by the so-called “Swedish
collective” in their Åre Case [12] described in previous chapter, we will build our
version of data-driven DMS that we are going to call eDestination. During their
research on Swedish mountain tourism destination, the focus was on activities before
and after trip. Customer-based knowledge sources, like tourists’ search (weblogs),
booking and feedback-behavior (surveys, reviews) were used. However, to consider
this a complete destination management system, there is a big gap in during-trip
phase. The goal of this research is to examine possibilities to collect data from
during-trip phase, build and test few scenarios that could be used for a data-driven
management system for a tourist destination. Using the same approach, we will create
an architecture for our DMS, but we will focus on indicators that are happening during
the “On-Site” phase [15].
3.1 eDestination
eDestination was being developed for Croatian sea tourist destination Šibenik. The
author’s objective will be to create different scenarios that will be used to test what
data would be available in a real tourist destination. The main goal is to find what
open data can be used and what additional data is there available for DMO – Šibenik
Tourist Board (TZ Šibenik). With different scenarios, the idea is to see how different
data can be put in correlation to get understanding of tourist behavior during their stay
at the destination. The idea is to exploit a few open and available technologies like
Big Data, IoT and advanced data visualization. With this research and eDestination
70 L. Mrsic et al.
scenarios will be analyzed and visualized using Tableau Software that will allow us
to create interactive data visualization focused on business intelligence [16, 17].
The focus of this scenario is to compare during and after trip tourist satisfaction.
Two POIs in Šibenik are selected for this part of research—St. Michel’s Fortress
and Šibenik City Museum. To determine post-trip tourist satisfaction, we used data
from TripAdvisor. (“TripAdvisor Šibenik,” n.d.) Using simple web crawler for data
mining we have gathered average score of POI, a total number of reviews and its
scores on a scale from 1 to 5 (terrible, poor, average, very good, excellent). Data is
collected on 15th September 2018, and the score is based on all reviews that are made
on TripAdvisor for that POI. St. Michel’s is ranked overall as a 6th attraction (things
to do) in Šibenik, it has a total of 489 reviews with average score of 4.0, and the first
score dates from 30th July 2012. Other selected POI, Šibenik City Museum, has only
22 reviews, it is ranked overall as the 24th attraction in Šibenik, having same average
score 4.0 and first review scored on 15th September 2015. To gather the same kind
of data, but during visitors stay in the POI, we have created a web questionnaire
application. The application is designed for tablets and has one question that is
answered using a touch screen selection of answers on a screen. We have set up the
question “How do you like our attraction?” and give the possibility to answer with
selecting one out of five emoticons that represent an excellent, very good, average,
poor and terrible score. Tablet with questionnaire is located in both selected POIs, on
a visible place, next to the exit. It is connected to the Wi-Fi and is sending answers
in eDestination database in the real-time. For this research period from 25th July to
25th August 2018 is used. Results show that much more visitors are willing to give
a review or score during their stay than after their trip. In just 30 days day we have
gathered more scores than the POIs had in a couple of years on TripAdvisor. Total
of 511 scores is collected on TripAdvisor ever since the POIs are listed in, while in
one month of research we gathered 1914 scores. Interesting to see is also that terrible
comment is something that rarely appears in TripAdvisor score (2% for St. Michael’s
Fortress and 0% for Šibenik City Museum), but it looks that visitors don’t mind of
giving terrible score during their visit in the POI if they don’t like their experience
(8% for Fortress and 20% for Museum). On the other hand, TripAdvisor has a more
detailed score, as the visitors are also leaving a written review. However, on the other
hand, data collected through eDestination questionnaire is real-time data, and there
are additional possibilities that could be done with it.
72 L. Mrsic et al.
Usage of people counting sensors is not something new. It is used, for example, in
the retail business, where this kind of sensors are located in stores to see how many
people are coming in the stores, how much time they spend in it, what are peak hours,
and similar. However, what we want to explore is tourist movement behavior. Our
scenario would be the simple solution tested in two POIs, but including more, or
all, POIs in destination would create a grid that would be able to detect movement
behavior in and between POIs in the destination. According to Lew and McKercher
study [18], the destination and tourist characteristics pattern of the path that tourists
follow. Authors have used the circular area near the point of accommodation to
represent the relative distance referring to activity/movement. This range varies from
extremely restricted movement to completely unrestricted movement. Unrestricted
movement behavior is something that every destination is looking for. Could ICT
have an impact on this kind of tourist behavior? If tourists got information about
different POIs, their locations, working hours, peak hours and similar information in
one place, they would change their movement behavior and they would plan their trip
freely around the destination. To start with, we have installed an HPC005 sensor at
the entrances of both selected POIs Šibenik City Museum and St. Michael’s Fortress.
The sensor is recording the movement of people. For each entrance or exit, sensor is
sending time and +1 (entrance) or −1 (exit) value. Collected data is imported into
Tableau Software, analyzed and the dashboard was created as shown in Fig. 3. Data
is collected for the period from 1st July to 30th August 2018, the peak of tourist
season in Šibenik.
The dashboard is interactive, and it contains a map with both POI locations, graphs
with total visitor numbers per month, date and hour for both POI separately. From
the results show we can see that these two POIs have different patterns and numbers.
In this case, this is logical, as POIs have different carrying capacity levels. However,
if there would be created a full grid of all Šibenik POIs, and grid with this kind of
data would be available for tourists, they would be able to plan their movements in
destinations accordingly.
With this scenario, we want to get a wide picture of what is happening in the des-
tination. This scenario will show where tourists are staying during their overnight
destination and what is guest structure, what are their habits regarding accommoda-
tion type, category, and length of stay. In Croatia the overnight registration system
is called eVisitor. It is a system that contains all data about tourists that arrived in
Croatia and stayed overnight in one of the legal accommodation providers. Import of
tourist details is State-regulated and is mandatory for all accommodation providers.
Each accommodation provider is registered in the system with its address, number
of units, number of beds, type of accommodation and official category. Each guest
that has overnight in Croatia needs to be registered with its first and last name, sex,
country of origin, date of birth, ID or passport number and its check-in and check-out
date. This information is then shared with different State stakeholders, among them
also with DMOs. For this research, TZ Šibenik gave us partial data set from eVisi-
tor with all accommodation units and tourists from five different countries of origin
(Croatia, Slovenia, Bosnia and Hercegovina, Germany and Austria) for the period
from 1st July to 2nd September 2018. Dataset extracted from eVisitor was uploaded
to Tableau Software, and with it, we made a detailed analysis. The result of that anal-
ysis is an interactive dashboard shown in Fig. 4. The dashboard was created using
several portlets. In the first window, we have put the Google map layer, on which we
have geo-located all used accommodation properties in Šibenik. The map allows the
user to zoom in and out; each accommodation property is marked with a dot; each
dot has more details about the owner and maximum capacity. There is an option to
look at all the properties, select just a single property or make a group within the
neighborhood. Next to the map are details about specific accommodation categories
and types of accommodation (hotel, campsite, non-commercial and other—private
accommodation). For each category and type are additional details about the length
of stay and total of overnights. The interactive dashboard lets the user explore total,
only one, or more categories or types. Filters on the left side are giving multiple
options to set up the dashboard. At the bottom part of the dashboard are located
different graphs, to allow additional filtering. There are arrivals and departures per
week, and length of stay. Moreover, there are tourist demographic details, one graph
for sex, other for age groups and the last one for the county of origin. Overall, there
are options to inspect any combination with a simple click. After showing this result
74 L. Mrsic et al.
to TZ Šibenik, their comment was only that this was science fiction. With the dash-
board, they now see details that they did not think they could get, and all this was
made with the data that they already have.
5 Future Research
For the end of this research, we have decided to create a setup for future research.
Destinations would need time to start using scenarios that were created in previous
chapters in the right way. Also, time would be needed to gather data from these sce-
narios that we could make additional analysis and prediction models using advanced
Big Data analytics and later on also artificial intelligence. Due to that, this scenario
cannot be made in this stage of research, at the beginning. How would this scenario
look like? It would be a diverse mix of all the data source that we used in previous
scenarios, and we would put them in correlation. If this scenario were created for
Šibenik City Museum, it would contain questionnaire data from Scenario 1, data
from the sensor from Scenario 2, and data from eVistor from Scenario 3. With these
data, we would get information about visitor satisfaction and the number of visitors
visiting the Museum. We would add a total number of tourists that had overnight
in Šibenik from eVisitor. Also, we want to add one more external factor, weather.
Weather data could be collected with web service through one of web weather pages
like AccuWeather. That kind of diverse mix would have a goal of examining cus-
tomer’s behavior and movement during different weather conditions and different
occupancy levels of the destination. Also, future development of ICT will bring new
technologies and new datasets in play (i.e., like wearables or self-driving cars) so
Technology-Driven Smart Support System … 75
further researches in this filed should keep an open mind on possibilities that new
technologies will bring. Described scenarios, together with the additional scenarios
that could be created, set up in a destination in multiple locations and combined in
a smart grid would create destinations “Hawkeye.” This kind of DMS would have
multiple parts, which would be used by different stakeholders. From eDestination
Šibenik, we could see that ICT technology and tools can be easily used in tourism.
Destinations already have a large amount of unused data. In addition, there are dif-
ferent new datasets available to be gathered. Moreover, with the use of technology,
getting new data is not an issue anymore. If we compare eDestination Šibenik with
the Big Data case, we can see that using pre and post-trip data is not giving enough
information to make a complete DMS. Case showed it could help in understanding
tourist behavior from an expectation point, but from that kind of system, we do not
know how tourists are using the destination and what is happening in a destination
on a daily basis. Moreover, as shown in Šibenik, this is something that can be done
with the use of simple IoT technology. ICT innovations are making life much easier
in some other industries, so there is no reason that this cannot happen in the tourism
industry. Even though the tourism sector is known as it does not go away with inno-
vations well, this is something that will be changed. As the World is entering the
fourth industrial revolution, the tourists are changing its behavior. Moreover, with
their change, the tourism industry will need to change as well. It is hard to expect that
one globally unified DMS will be created, but this is not necessary. Every destination
is unique, every destination has a unique problem, so there should not be an issue
that every destination has its own DMS. However, each destination should have one.
Acknowledgements Ethical approval committee for the study provided in this chapter includes
official representatives from City of Šibenik Tourist Board, Šibenik City Government, and University
College Algebra. Appropriate permissions from the responsible authorities (include City of Šibenik
Tourist Board, Šibenik City Government) were issued prior to installation of HPC005 sensor at
the entrances of both selected POIs Šibenik City Museum and St. Michael’s Fortress used in the
study. Data retrieved from eVisitor national register was anonymized and used upon agreement and
approval between Šibenik Tourst Board, Croatian National Tourist Board, and University College
Algebra. All data and data samples were anonymized at source (device and/or relevant register),
primary and secondary anonymization was additionally conducted prior and after analysis to assure
full anonymization of data. No individual data was either collected, stored or processed at any time
as part of this study.
References
4. M. Sigala, D. Marinidis, Web map services in tourism: a framework exploring the organisational
transformations and implications on business operations and models. Int. J. Bus. Inf. Syst. 9(4),
415 (2012). [Link]
5. D. Buhalis, A. Amaranggana, Smart Tourism destinations, in Information and Communication
Technologies in Tourism 2014 (Springer International Publishing, Cham, 2013), pp. 553–564.
[Link]
6. WTO, E-Business for Tourism—Practical Guidelines for Destinations and Businesses (2001)
7. R. Egger, D. Buhalis, ETourism Case Studies, 1st edn, eds. R. Egger, D. Buhalis (Butterworth-
Heinemann, 2008)
8. P. Alford, S. Clarke, Information technology and tourism a theoretical critique. Technovation
(2009). [Link]
9. D. Buhalis, Information technology as a strategic tool for economic, social, cultural and
environmental benefits enhancement of tourism at destination regions. Prog Tourism Hosp
Res 3(1), 71–93 (1997). [Link]
PTH42%[Link];2-T
10. D. Buhalis, A. Spada, Destination management systems: criteria for success—an exploratory
research, in Information and Communication Technologies in Tourism 2000 (Springer Vienna,
Vienna, 2000), pp. 473–484). [Link]
11. D. Buhalis, D. Leung, R. Law (n.d.), eTourism: critical information and communication
technologies for tourism destinations. In Destination Marketing and Management: Theories
Ad Applications (CABI, Wallingford), pp. 205–224). [Link]
0205
12. M. Fuchs, W. Höpken, M. Lexhagen, Big data analytics for knowledge generation in tourism
destinations—a case from Sweden. J. Destination Mark Manage 3(4), 198–209 (2014). https://
[Link]/10.1016/[Link].2014.08.002
13. W. Höpken, M. Fuchs, D. Keil, M. Lexhagen, The knowledge destination—a customer
information-based destination management information system, in Information and Commu-
nication Technologies in Tourism 2011 (Springer Vienna, Vienna, 2011), pp. 417–429. https://
[Link]/10.1007/978-3-7091-0503-0_34
14. ‘Overtourism’?—Understanding and Managing Urban Tourism Growth beyond Perceptions
(World Tourism Organization, (UNWTO), (2018)). [Link]
15. TUNWTO, in Handbook on E-marketing for Tourism Destinations (UNWTO, 2008). https://
[Link]/10.18111/9789284412761
16. V.D. Ambeth Kumar et al., IOT-based smart museum using wearable device, in International
Conference on Innovative Computing and Communications Proceedings of ICICC 2018, vol.
1 (2018)
17. M.N. Shafique et al., The role of big data predictive analytics acceptance and radio fre-
quency identification acceptance in supply chain performance, in International Conference
on Innovative Computing and Communications Proceedings of ICICC 2018, vol 1 (2018)
18. A. Lew, B. McKercher, Modeling tourist movements. Ann Tourism Res 33(2), 403–423 (2006).
[Link]
Ortho-Expert: A Fuzzy Rule-Based
Medical Expert System for Diagnosing
Inflammatory Diseases of the Knee
Abstract The proposed work is for the diagnosis of the inflammatory diseases of
the knee joint. The main diseases which are discussed in this research under inflam-
matory diseases are osteoarthritis, rheumatoid arthritis and osteonecrosis of the knee
joint. The software used for this research is MATLAB, and fuzzy logic method is
employed in it. Mamdani inference engine is used. All the input parameters required
are consulted with the expert of Orthopaedic during the phase of knowledge acqui-
sition. Survey method is used for the data collection, and various defuzzification
methods are used to check the accuracy of the proposed system.
1 Introduction
The proposed system is designed to diagnose the orthopaedic diseases of the knee
joint under inflammatory category. The word orthopaedic comprises from two Greek
words Ortho + Paedion, where ortho means straight and paedion means child. Pre-
viously, it was an art of straightening the deformities of children. This was derived
by a French physician named Nicolas Andry in the year 1741, who was known as the
father of orthopaedic [1]. The system is designed to diagnose three diseases (OA),
(RA) and (ON) using the rule-based fuzzy set theory.
With the advancement of the science, discovery of Roentgen rays (X-rays) and dis-
covery of bacteria by Louis Pasteur, the new era of diseases of skeletal system
emerged. From this, various bones and joints diseases came under orthopaedic, which
are known as orthopaedic diseases [1].
The human skeletal is made up of 206 bones where each bone of skeletal system
could be covered under inflammatory, infective and neoplastic disorder. Moreover,
the human body is comprised of various joints as synovial fibrous, ball and socket
joint and hinge joints. The diseases of which are covered by specialized bone and
joint disease branch called orthopaedic. [2] Under the orthopaedic, there are spine,
elbow, knee, ankle, wrist and shoulder body parts. Knee joint is the main weight-
bearing joint, so we have focus special attention towards those large hinge joints of
the body.
Knee is the hinge joint which plays a vital role in the skeletal system of human. The
whole body weight is on the knee joint. Knee is comprised of three bones. The lower
bone of thigh which is called femur, upper part of the leg bone is called tibia and the
knee cap bone is known as patella, as shown in Fig. 1. The articulation between lower
femur and upper tibia is divided by ACL (anterior cruciate ligament), PCL (posterior
cruciate ligament) and two cushions like menisci called medial menisci and lateral
menisci. The inner part is called medial compartment which can be identified from
the medial joint line, and the outer part is called lateral compartment which is assessed
by lateral joint line [2]. The articulation between patella and lower part of femur is
known as the patellofemoral joint. The cruciate and menisci of joint make the knee
joint stable and act as cushions of the knee. The menisci prevent the degeneration
of articular surfaces of femur and tibia through their cushion activity during the
weight-bearing activities.
The main area of interest in knee, as described below in Table 1, is inflammatory
pathology that is osteoarthritis of knee, rheumatoid arthritis of knee and osteonecro-
sis of knee and infective pathology like septic arthritis which can be acute sep-
tic arthritis, chronic septic arthritis and tuberculosis of the knee. The symptoms of
inflammatory region have a common resemblance, but the diagnosis can be made
from slight variation of symptoms. Moreover, some investigations like X-rays, MRI
and blood investigations can make the diagnosis to its perfect correction. Similarly,
infective pathology differs on symptoms, and various investigations are also needed
to diagnose the disease.
(1) Inflammatory Diseases: inflammatory diseases mean inflammation to the bony
and soft tissue structures of the knee joint that causes inflammatory arthritis.
Various diseases come under the inflammatory diseases like osteoarthritis of the
knee joint, rheumatoid arthritis of the knee joint and osteonecrosis of the knee
[3].
(a) Osteoarthritis: osteoarthritis is a common joint disorder [4]. This is the
progressive softening and disintegration of the articular cartilage. It is
accompanied by new growth of cartilage and bone at the joint margin. The
symptoms for the osteoarthritis disease are pain, age, swelling, deformity
and restriction.
(b) Rheumatoid arthritis: rheumatoid arthritis is an autoimmune disease. It
is the commonest cause of the chronic inflammatory disease. The most
common characteristic features are elevated ESR, symmetrical polyarthritis
and morning stiffness. Pathology of RA in knee: stage1 is synovitis and
swelling in joint, stage 2 is early joint destruction with particular region
and stage 3 is advanced joint destruction and deformity. This disease is
common in all age group (children, young, and old). It can be seropositive
rheumatoid arthritis or seronegative rheumatoid arthritis.
(c) Osteonecrosis: the avascular necrosis of the medial condyle of the femur
is very common in the knee joint. It is often associated with alcoholism
and drug addiction. It is three times more common in females, above the
age of 60 years.
80 A. Vashisth et al.
2 Literature Review
There are numerous areas of medical field where a fuzzy-based expert system
has been implemented successfully, which diagnosis various diseases like asthma,
dengue, ENT diseases, cardiovascular diseases, cancer, diabetes, tumours, infectious
diseases, determination of the risk and diagnosis of the drug doses. There are sev-
eral expert systems available. MYCIN is the first expert system which is developed
by Dr. Edward Shortliffe, Feigenbaum and Buchanan in 1970s. MYCIN is used for
diagnosis of the infectious blood diseases. For implementation of system, it uses
LISP language and 450 rules that are made to diagnose the diseases. It calculates the
dosages based on patient’s weight and handles interactions between various drugs
[5].
Another expert system like Dendral uses computer program named Heuristic
Dendral for the data generated by mass spectrometer. It is implemented in LISP. It is
used to identify the molecular structure of organic molecules, by analysing their mass
spectra and the knowledge of chemistry [5]. Most of the fuzzy expert system has been
developed and implemented, but there is less focus given on the orthopaedic field.
This proposed fuzzy-based expert system helps in the diagnosis of inflammatory
knee disorders.
In 2002, using fuzzy relational theory, the author designed the hierarchical fuzzy
inference system to diagnose the arthritis disease in various joints. It was a two-
level process where the first-level diagnoses reduced the scope of diagnose in second
level [6]. In 2010, a fuzzy inference system was designed to diagnose the arthritis
disease and the severity level of the arthritis. Ten parameters were used to diagnose
the severity level of disease in fuzzy logic controller [3, 7].
3 Proposed Methodology
The proposed expert system is designed using MATLAB software. There is a signifi-
cant increase in the variety of fuzzy logic like washing machines, portfolio selection,
medical diagnosis and controlling process of industries. Fuzzy logic concept starts
from the fuzzy set theory which does not have the clear boundary. Fuzzy set does not
have crisp set. Crisp set means having value in yes or no, true or false and good or
bad form. It is used when there is a confusion either to accept the number or to reject
the number which means the number is on boundary. That number which is on the
boundary line creates confusion whether to accept that or to reject. For these types of
input parameters, fuzzy set is used. All the input parameters, which are taking crisp
inputs, are fuzzified using the standard min-max Mamdani inference system. For the
defuzzification, centroid method is used. The whole process for the proposed system
is defined in Fig. 2.
Ortho-Expert: A Fuzzy Rule-Based Medical Expert System … 81
The foremost step for designing the fuzzy system is to collect data and identify all
input and output parameters. These can be done by consulting with an orthopaedic
expert doctor. Firstly, collect the data regarding the category and then refine that
collected data. The data is collected from various sources such as research papers,
books and expert doctors. Under the orthopaedic diseases, there are various categories
of the knee joint. Major categories of the knee joint are inflammatory diseases,
infective diseases and neoplastic diseases. The data is collected as shown in Fig. 3
for the inflammatory category which covered OA, RA and ON diseases.
From the above data collected, refinement is performed. The symptoms are cate-
gorized as primary symptoms and secondary symptoms. The disease name and the
corresponding symptoms are described below in Table 2.
The next step is fuzzification, mapping of crisp input value to the membership func-
tion. It converts the real numbers to fuzzy set which uses linguistic variables [8]: such
as triangular, trapezoidal membership function for the input and output parameters
are used [9]. For all input and output parameters, range is set. The fuzzified input
symptoms for OA, RA and ON disease are given in Tables 3, 4 and 5, respectively.
Figures 4 and 5 show the input and output membership function for the symptoms.
All the rules are formed in the fuzzy. All the rules are written in IF-THEN rule format
[10].
84 A. Vashisth et al.
In the fuzzy inference system, Mamdani and Sugeno types of system are used. In this
proposed system, Mamdani inference system is used. In fuzzy inference generation,
implication and aggregation methods are used. For the AND operator in the rules,
MIN function is used, where for OR operators in the rules, the MAX function is
used. Figure 7 shows there are five input parameters and one output parameters.
There are two fuzzy logic operators which are used for the rule formation. The two
operators are AND and OR. These operators are used between the input values.
The AND operator is used as fuzzy intersection, and OR operator is used as fuzzy
union [11]. These are applied on two fuzzy sets named as fuzzy setA and fuzzy setB
aggregates two membership functions as described below:
3.4.2 Implication
For the same output membership functions, the inference engine may have more
than one rules which are activated, but there should be single output. There are
two operators AND and OR. If there is AND operator between the input variable,
then the minimum input value is picked up. Similarly, if OR operator is used, then
the maximum input value is selected for the output of the rule. This is known as
implication method.
3.4.3 Aggregation
After implication, the next step is to combine the outputs of each rule into a single
fuzzy set. This is called as the aggregation process.
3.5 Defuzzification
For the defuzzification, membership function value is converted back to the crisp
value which is the output [12]. It uses different methods such as centre of area
(COA), bisector of area (BOA), largest of maximum (LOM), smallest of maximum
(SOM) and mean of maximum (MOM). Defuzzification values and ranks for different
parameters of OA, RA juvenile, RA adult and ON disease are given below in Tables 6,
7, 8 and 9, respectively.
1. Centre of area (COA): under the aggregation function, it calculates the centroid
for that area.
∫ μ A (z)zdz
Z COA = (3)
∫ μ A (z)zdz
Table 6 Defuzzified values and ranks for different inputs for OA disease
Input symptoms Defuzzified values and ranks according to different defuzzification methods
Pain Age Swelling Deformity Restriction SOM rank LOM rank MOM rank Centroid rank Bisector rank
5 45 5 5 5 4 2 6 2 5 2 5 2 5 2
8 40 6 6 8 4 2 4 3 4 3 4 3 4 3
8 35 5 2 2 0 3 2.5 5 1.2 5 1.6 5 1.6 5
8 65 8.5 9 10 6.7 1 10 1 8.3 1 8.1 1 8.1 1
Ortho-Expert: A Fuzzy Rule-Based Medical Expert System …
Table 7 Defuzzified values and ranks for different inputs for RA disease (Juvenile)
Input symptoms Defuzzified values and ranks according to different defuzzification methods
Pain ESR Age Swelling Bileteral tenderness SOM rank LOM rank MOM rank Centroid rank Bisector rank
2 20 12 8 9 1 4 3.1 5 2.05 5 2.28 4 2.26 4
3 35 16 2 6 3.16 3 6.88 1 5.02 1 5 1 5.02 1
3 31 10 8 8.5 3.2 2 6.7 2 4.99 2 5 1 5.02 1
3 60 10 8 5 4 1 4 3 4 3 4 2 4 2
6.5 27 13 8 9 1 4 3.76 4 2.38 4 2.43 3 2.44 3
A. Vashisth et al.
Table 8 Defuzzified values and ranks for different inputs for RA disease (Adults)
Input symptoms Defuzzified values and ranks according to different defuzzification methods
Pain ESR Age Swelling Bileteral tenderness SOM rank LOM rank MOM rank Centroid rank Bisector rank
5 50 45 5 5 3.1 4 6.94 1 2.05 5 4.99 2 5.02 1
2 20 22 8 9 1 2 3.04 5 5.02 1 2.27 5 2.26 4
9 25 67 4 5 1 2 3.58 4 4.99 2 2.38 4 2.38 3
3 35 45 2 6 4 1 4 3 4 3 4 3 4 2
Ortho-Expert: A Fuzzy Rule-Based Medical Expert System …
Table 9 Defuzzified values and ranks for different inputs for osteonecrosis disease
Input symptoms Defuzzified values and ranks according to different defuzzification methods
Pain Swelling Medial joint tenderness Restriction SOM rank LOM rank MOM rank Centroid rank Bisector rank
2 5 8 9 4.5 3 8 2 6.25 2 6.39 2 6.4 1
5 5 5 5 6 1 6 3 6 3 6.33 3 6.3 2
6 7 5 2.5 3.8 4 9 1 6.4 1 6.44 1 6.4 1
5 8 2 9 0 5 2 5 1 5 1.53 5 1.5 4
8 5 3 2 5 2 5 4 5 4 5 4 5 3
A. Vashisth et al.
Ortho-Expert: A Fuzzy Rule-Based Medical Expert System … 91
2. Bisector of area (BOA): this method basically divides the whole region into two
equal parts, sometimes the centre of area and bisector of area lie on the same line
but not always.
BOA β
μ A dz = μ A (z)dz (4)
a BOA
∫ zdz
Z mom = (5)
∫ dz
The graphical user interface is used as an interface through which the layman user or
doctor can give input in the user-friendly manner to get the results, which are easily
understandable. Figure 8 shows the interface used for this model and Fig. 9 shows
the treatment plan for the disease.
5 Testing
In the last step, testing is performed. The output is tested with the observed output
which is given by expert. The expected output matches with the system output for
the testing [7, 13, 14].
False positive and false negative values are the predictive values. False positive
means when the system gives the result that there is a disease but there is no disease.
False negative means there is disease but the system returning as there is no disease.
Sensitivity is the percentage of the patients with disease have positive test where
specificity is the percentage of the patients without disease have negative test.
92 A. Vashisth et al.
Fig. 9 Graphical user interface for treatment plan for inflammatory diseases of knee
Mathematically,
Positive Predictive value (PPV)
PPV = True Positive/(True Positive + False Positive)
Negative Predictive value (NPV)
Ortho-Expert: A Fuzzy Rule-Based Medical Expert System … 93
Acknowledgements The authors would like to acknowledge the expert Dr. Surjit Singh Dardi
who is working as Orthopaedic Medical Officer at Sant Sarwan Dass Charitable hospital, kathar,
Hoshiarpur, Punjab, India, for his continuous support and constant effort for data collection,
throughout the development, testing and validation of this proposed system.
References
Abstract In this rapidly rising field of Web, volume of online social networks has
increased exponentially. This inspires the researchers to work in the area of infor-
mation diffusion, i.e., spread of information through “word of mouth” effect. Infor-
mation maximization is an important research problem of information diffusion,
i.e., selection of k most influential nodes in the network such that they can maxi-
mize the information spread. In this paper, we proposed an influence maximization
model that identifies optimal seeds to maximize the influence spread in the network.
Our proposed algorithm is a hybrid approach, i.e., GA with k-medoid approach
using dynamic edge strength. To analyze the efficiency of the proposed algorithm,
experiments are performed on two large-scale datasets using fitness score measure.
Experimental outcome illustrated 8–16% increment in influence propagation by pro-
posed algorithm as compared to existing seed selection methods, i.e., general greedy,
random, discounted degree, and high degree.
1 Introduction
Web market share of online social networks is increasing exponentially. Social net-
work is becoming tremendously substantial for various applications, i.e., educational,
matrimony Web sites, government division, job search portals, recommendation sys-
tem [1], health care, viral marketing [2], and numerous other businesses. In all these
applications, Social network is commonly considered as a platform for information
propagation in network, i.e., social influence. In online social networks, activities of
a person can lead to change in another person’s behavior, i.e., social influence. This
change in user’s behavior depends on the other user’s influence strength and infor-
mation spreads from one user to another user. Spread of information depends on the
user’s position in the network. Selection of appropriate nodes becomes a challenge in
order to gain maximum influence spread in the system, i.e., influence maximization.
Influence maximization problem was proved to be NP-hard for numerous propa-
gation models [3]. Therefore, this paper targets to maximize influence extent by
determining k optimal nodes in the system using dynamic edge strength. Therefore,
information propagation through these nodes will maximize the effect of influence
in the network.
In the past studies, various models and algorithms have been introduced with
respect to social influence [4]. Aslay et al. [5] presented the current state of the art of
the influence maximization (IM) in the field of social network analysis (SNA), i.e.,
existing algorithms and theoretical developments in the field of IM. Anagnostopou-
los et al. [6] described social influence identification problem methodically. They
explained various social correlation models and proposed two methods that can
identify influence in network using time-dependent user action information. Chen
et al. [7] extended discounted degree approach to improve influence propagation in
the system. Mittal et al. [8] identified that the centralities are the major elements to
discover the important authors in collaboration networks. Chen et al. [9] proposed
an algorithm which is scalable with respect to size of networks, i.e., social networks.
In their algorithm, they used one tunable parameter which provides balance between
the running time and influence spread in the network. Similarly, Khomami et al. [10]
proposed learning automaton-based solution to identify minimum positive influence
dominating set (MPIDS) to maximize influence propagation. Goyal et al. [11] solved
influence maximization problem by their proposed model, i.e., credit distribution.
Credit distribution method uses the propagation traces of each action in the time
interval t and estimates the expected influence flow in the network. Chen et al. [12]
introduced a directed acyclic graph-based scalable influence maximization approach
that is modified with respect to linear threshold model. So, influence maximization is
a rising field for which various theories and models have been introduced in the recent
past years. Therefore, to optimize scope of influence in the network, we have pre-
sented a hybrid approach by selecting optimal seeds from the network. Kumar et al.
[13] applied the concept of influence propagation to detect the rumor on social media.
The remaining part of this paper is arranged as follows: In Sect. 1, we explained the
importance of influence in social network. We have also described the role of good
seeds in information propagation, and the related work has been done in this field.
In Sect. 2, we explain the proposed algorithm, i.e., GA with k-medoid approach for
optimal seed selection to maximize social influence. After this, in Sect. 3, we illus-
trate performance analysis of proposed methodology with respect to other methods
on two datasets. Lastly, we enclose conclusion and future work.
GA with k-Medoid Approach … 99
2 Proposed Methodology
1
suv =
out_degree(u)
For a given network G(V, E, S), dynamic strength score of every edge S d is calculated
using topical affinity propagation algorithm. TAP is an algorithm which computes the
topic-wise influence likelihood of each node with respect to different node attributes
as defined in algorithm 1. In this paper, we considered one topic per node. Therefore,
no node attribute required to input in TAP. Table 1 defines the variables involved in
the estimation of dynamic influence likelihood.
Node_score function is the base component of the TAP algorithm as defined by
Eq. 1.
⎧
⎨ si ri
ri = i
j∈N (i) (si j +s ji )
node(vi , ri ) = (s ji ) (1)
⎩ j∈N (i) ri = i
j∈N (i) (si j +s ji )
where r i is the node with highest edge_sum value in the set {N(i) ∪ i} for node i
identified using Eq. 2.
Edge_sum(vi ) = sik (2)
k∈N (i)
T j j = maxk∈N ( j) min Ak j , 0 (5)
Ti j = min max A j j , 0 − min A j j , 0 − maxk∈N ( j)\{i} min Ak j , 0 , i ∈ N ( j)
(6)
1
Ii j = (7)
−( A ji +T ji )
1+e
GA with k-Medoid Approach … 101
Algorithm 1
Dynamic likelihood computation G(E, V, S)
Therefore, in the initial phase of our proposed algorithm, we updated the edge
strength score from S to dynamic edge strength score S d .
In this step, population set P of k nodes is generated such that this population easily
converges in optimal time and generates optimal seeds to maximize influence prop-
agation in the network. In the past studies, various selection approaches have been
introduced such as graph-based heuristics and mathematical models with respect to
various domain, i.e., shortest path optimization [14] or influence maximization [15].
In random model, selection of k nodes is arbitrary and does not depend on any of
the property of node or edge attribute. It selects the k nodes in O(1) time. In general
greedy model, choices made on the best are at moment basis such that the objective
function is optimized. Therefore, it chooses best solution at every step. Node with
largest neighborhood is the strongest node of the network is the hypothesis of the
next method, i.e., high degree. Therefore, this method uses the out-degree for the
directed graphs and degree for the undirected graphs. Extension of high degree is
known as discounted degree or single discount heuristic. This heuristic believes that
neighborhood of neighboring nodes is not mutually exclusive [16]. In this paper, we
applied clustering algorithm, i.e., k-medoid algorithm to generate k cluster centers
as population set P as described in algorithm 2.
102 S. Agarwal and S. Mehta
Algorithm 2
k-medoid G(V, E, k)
We have performed detailed experiments on two datasets [17], i.e., Amazon co-
purchasing network and wiki vote network. Details of these dataset are given in
Table 2. We analyzed the performance of proposed approach with GA, i.e., embedded
with various other seed selection methods, i.e., general greedy, random, discounted
degree, and high degree.
GA with k-Medoid Approach … 103
Table 2 Datasets
Dataset Details of dataset
Network definition Node count Edges count Degree statistics
Wiki vote network If person i voted 7115 103,689 Lowest 0, highest
for person j, an 893
edge i → j created
Amazon product If an item i 5122 11,321 Lowest 0, highest 5
co-purchased purchased along
with item j, an
edge i → j created
In this paper, we have applied fitness score, i.e., (Pd ) as the performance parameter
to compare the effectiveness of our proposed algorithm with other existing algo-
rithms. The total number of non-influenced nodes converted from non-influenced to
influenced state by node set Pd is known as fitness score of node set Pd , i.e., (Pd ).
Detailed description of cascade model to compute the fitness score is given in our
previous work [15].
In this paper, we analyzed efficacy of the proposed approach with various existing
seed selection methods, i.e., general greedy, random, discounted degree, and high
degree methods embedded with GA using dynamic probabilities (GADP). We per-
formed experiments on two datasets, i.e., Amazon product co-purchased dataset and
wiki vote.
In our experiments, we used different seed values raging from 10 to 50. Figure 2a
and b shows the experimental results for Amazon product co-purchased and wiki
vote, respectively. It is noticed from Fig. 2a that the proposed algorithm, i.e., k-
medoid along with GADP which is indicated by sky blue line shows the better
results as compared to all other algorithms used in this paper. For wiki vote dataset,
discounted degree shows the second best results because of high out-degree ratio. It is
clear through the outcomes of the experiments that the proposed algorithm increased
influence propagation by converting more number of nodes into influenced state from
non-influenced state, i.e., up to 11%. Similar results have been observed from the
Amazon product co-purchased dataset as shown in Fig. 2b. The proposed approach
improved the influence propagation up to 16% for Amazon product co-purchased
dataset with respect to other approaches. For this dataset, greedy approach shows the
second highest fitness score because of low maximum out-degree ratio.
Overall, from the results, it can be depicted that high degree and discounted degree
show the good results for the datasets of small out-degree ratio and greedy approach
shows the good results for the datasets of high out-degree ratio, whereas the proposed
approach presented the improved results for both types of degree ratio, i.e., low and
high.
We have also performed the comparative analysis with respect to propagation
value as well, i.e., fitness score of the results generated by different approaches as
shown in Fig. 3. It can be easily depicted from the Fig. 3 that our proposed algorithm
shows the significant improvement in fitness score as well with respect to all other
approaches. Our proposed algorithm showed this improvement with respect to fitness
score for both the datasets.
Overall, from experimental results we also observed an interesting behavior of
our proposed algorithm w.r.t. other methods, i.e., consistency. Our proposed algo-
rithm showed the improved results for different out-degree ratios, i.e., low and high
both, whereas random, discounted degree, and high degree result are out-degree
ratio dependent. Therefore, our proposed algorithm shows the improved influence
propagation by 8–16% with respect to other approaches.
References
1. X. Song, Y. Chi, K. Hino, B.L. Tseng, Information flow modeling based on diffusion rate for
prediction and ranking, in WWW (2007), pp. 191–200
2. P. Domingos, M. Richardson, Mining the network value of customers, in KDD (2001), pp. 57–66
3. D. Kempe, J. Kleinberg, É. Tardos, Maximizing the spread of influence through a social net-
work, in Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. ACM (2003)
4. Y. Li, et al., Influence maximization on social graphs: a survey. IEEE Trans. Knowl. Data Eng.
(2018)
5. C. Aslay et al., Influence maximization in online social networks, in Proceedings of the Eleventh
ACM International Conference on Web Search and Data Mining. ACM (2018)
6. A. Anagnostopoulos, R. Kumar, M. Mahdian, Influence and correlation in social networks,
in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining. ACM (2008)
7. W. Chen, Y. Wang, S. Yang, Efficient influence maximization in social networks, in Proceedings
of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
ACM (2009)
8. R. Mittal, M.P.S. Bhatia, Identifying prominent authors from scientific collaboration multiplex
social networks, in International Conference on Innovative Computing and Communications
(Springer, Singapore, 2019), pp. 289–296
9. W. Chen, C. Wang, Y. Wang, Scalable influence maximization for prevalent viral marketing in
large-scale social networks, in Proceedings of the 16th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining. ACM (2010)
10. M.M.D. Khomami et al., Minimum positive influence dominating set and its application in
influence maximization: a learning automata approach. Appl. Intell. 48(3), 570–593 (2018)
11. A. Goyal, F. Bonchi, L.V.S. Lakshmanan, A data-based approach to social influence
maximization. Proc. VLDB Endowment 5(1), 73–84 (2011)
12. W. Chen, Y. Yuan, L. Zhang, Scalable influence maximization in social networks under the
linear threshold model, in Data Mining (ICDM), 2010 IEEE 10th International Conference on.
IEEE, pp. 88–97 (2010)
106 S. Agarwal and S. Mehta
13. A. Kumar, S.R. Sangwan, Rumor detection using machine learning techniques on social
media, in International Conference on Innovative Computing and Communications (Springer,
Singapore, 2019), pp. 213–221
14. S. Agarwal, S. Mehta, Approximate shortest distance computing using k-medoids clustering.
Ann Data Sci 4(4), 547–564 (2017)
15. S. Agarwal and S. Mehta, Social influence maximization using genetic algorithm with dynamic
probabilities, in 2018 Eleventh International Conference on Contemporary Computing (IC3)
(Noida, India, 2018), pp. 1–6
16. D. Bucur, G. Iacca, Influence maximization in social networks with genetic algorithms, in
European Conference on the Applications of Evolutionary Computation (Springer, Cham,
2016)
17. J. Leskovec, A. Krevl, Large Network Dataset Collection (2015)
Sentiment Analysis on Kerala Floods
Abstract In twenty-first century, Twitter has been a very influential social media
platform, be it in election or gathering aids for a disaster. In this work, we propose to
study about the effect of natural calamities like the recent Kerala flood, their effect
on the people and reactions of people from different stratus of society. The direction
of our research will be towards sentimental analysis using RStudio and the Twitter
app. This research highlights the reactions of people on a public platform during the
calamity. Word cloud and other data visualization techniques are used for our study.
The research also highlights on how the government reacts and how aid is provided
in the dire time of need. We predict that our research will be useful to human society
as it showcases a lot about human behaviour, its goodness and shortcomings. The
paper also throws light on the Hurricane Michael calamity and gives a comparative
study on relations like the differences of opinion among the people as well as their
similarities during and after a calamity.
1 Introduction
On August 15, 2018, severe floods affected the God’s own country, Kerala due to
heavy rainfall. The devastation was such that it caused severe damage to properties
and lives in the state. Huge volume of information was offered on numerous websites
related to the calamity where users were sharing and exchanging their thoughts and
opinion. Social networking sites, like Twitter, are expansively being employed by
individuals to enunciate their feedbacks for everyday events. Information are often
publicized on Twitter through the course of retweet feature, i.e. once a Twitter handle
shares a tweet, the tweet then will be perceived by all others who follow that handle
on Twitter, therefore increasing the reach of tweet posted formerly [1]. In the recent
years, hash tags that have been engendered by the big data environments like Twitter
have flooded the Internet. Hash tags became a vital characteristic of contents spawned
by nearly all of the social media platforms. Noticeably, entities have used hash tags to
postulate their interests, frame of mind and proficiencies regarding any and all topics.
Simply to call a couple of them, handles use hash tags to proclaim the cars they drive
(e.g. #honda), the motion picture they watch (e.g. #Hobbit), the diversions they get
indulgence from (e.g. #skating) and also the grub they cook (e.g. #fish, #grilling).
As a corollary of such societal engagement, hash tags have fascinated many scholars
in abundant realms of research. The machine learning practice and the knowledge-
based techniques have been extensively used in sentiment analysis. Lexicon-based
technique is another name for knowledge-based technique. The techniques based on
lexicons concentrate on producing the lexicons based on opinions from the data and
then finding the divergence of these lexicons [2]. However, the foremost objective
of machine learning techniques [3] is to progress the formula that heightens the
performance of the system using training data. The machine learning postulates an
answer to the sentiment sorting downside in two ordered steps: (1) train and develop
the model using training set data and (2) categorizing the unclassified or unlabelled
data created based on the trained data set model [4].
In this paper, a sentiment analysis of people affected during Kerala flood is per-
formed and a word cloud is constructed from them. The sentiments of individuals
are compared based on the tweets, and an analysis is performed to visualize different
sentiments exhibited by the chosen people. The paper additionally provides a com-
parative study on the relation like the differences of opinion among people as well
as their similarities during calamities in different parts of the world by taking into
account the Hurricane Michael that affected many lives in Florida during the same
period.
Sentiment Analysis on Kerala Floods 109
2 Literature Survey
This section discusses the various literary works collected through various sources.
“Sentiment Analysis of Twitter Data: Emotions on Trump” during the 2015–2016
Primary Debates”, Malak Abdullah et al. considered data obtained from Twitter social
media platform that were mostly concerning debates of Donald Trump when he was
the former candidate for the elections. This paper also tries to look into the emotions
of people around the globe regarding him in the debates collected to test if the tweets
obtained support him or not. One strong feature of this study is that positive or
negative tweets do not directly indicate if there is a support for the candidate or not
[5].
Pulkit Garg et al. performed a massive study on tweets which were obtained during
the disturbing Uri terror attack that happened on 18 Sep 2016 which shook the world.
The tweets were collected with the help of Twitter social media. The study analysed
the emotions like pain, sorrow and fear of people at that time of attack, thereby
conducting the sentiment analysis [6].
“Does Social Media Big Data Make the World Smaller? An EDA of Keyword-
Hash tag Networks”, Ahmed Abdeen Hamed et al. conducted a study on the bulk
measure networks called K–H networks. This was collected from Twitter. In the
study, the networks are studied on the basis of connecting vertices between any two
keywords present in the data set and also paid attention to the eccentricity of each
keyword present. The results obtained from this study were that the no of vertices of
any two keywords were not connected in their chosen network, K–H network was
three, and eccentricity of each and every word present was four [7].
“Sentiment Analysis of Twitter Data: Case Study on Digital India.”, Prerna Mishra
et al. analysed the data from the social media platform which is the means of informa-
tion exchange in the present world. The chosen platform for their study was Twitter.
The data that interested them for analysis was concerning India’s Prime Minister
Modi. They considered the tweets that were obtained during the Prime Minister’s
Digital Campaign. The sentiment analysis was done, and the data was classified
based on polarity index. They used dictionary-based approach for the study of polar-
ity index. The results obtained from this study were that 500 opinions were positive,
200 were negative, and the rest were neutral. They also concluded that people of
India supported the Prime Minister in the campaign [8].
“Prediction of Indian Election Using Sentiment Analysis on Tweets in Hindi”,
Parul Sharma et al. performed their study on the tweets which they obtained in
Hindi, national language of India. They used the Twitter archiver tool as their method
to obtain tweets. For their study, they collected tweets for a month which were
regarding the five national parties present in India during the campaigning period,
2016. Naive Bayes and SVM algorithm were used to build the classifier. The data was
classified into positive, negative and neutral categories. This was helpful in finding
out the sentiments of the users towards their support or dejection for their desired
political party. The results of the analysis for Naive Bayes were supporting BJP. The
results obtained from the dictionary-based approach were supporting the INC. SVM
110 A. Dudani et al.
predicted that BJP would win many elections. The results of the elections turned out
to be in favour of BJP who won 60 out of 126 constituencies in elections, 2016 [9].
In [10] paper, two main approaches of sentiment analysis were done. The former
used Naïve Bayes, decision tree and K-nearest neighbour, and latter used deep neural
network, a RNN using long short-term memory (LSTM). The experimentation was
done on three Twitter data sets, namely IMDB, Amazon and Airline. Also, illustration
on the comparison between these approaches was done. The experiment results have
shown that the recurrent neural network using LSTM scored the highest accuracy of
88, 87 and 93% [10].
In [11], they propose a technique for sentiment analysis on the feedbacks and
delimit the polarity which is said to be caused due to various scenarios. This is
followed by a clustering process for the positive and negative feedbacks which were
previously obtained in order to identify the wide spectrum of the topics interesting the
organization which targets the feedback. This entire process of summarization helps
in improving the current feedback process which is employed in the restaurants.
The main emphasis of the research [12] lies on the classification of tweets based
on emotions for the data gathered from Twitter. Previously, machine learning was
employed in the field of sentiment analysis which failed to give better results. The
ensemble techniques were used in order to increase the efficiency and reliability. The
ensembling is done by merging the concepts of SVM and decision tree. The results
have been found out to give better classification results which were verified using
accuracy and f measure.
This research presents sentiments analysis on Indian movie reviews using machine
learning techniques. Bayesian classifier is used along with feature selection. The
classifier built is trained on Chi-square, Info-gain, Gain-Ratio, One-R and relief
attribute. A comparative study has been performed in order to study their results.
The evaluation by F-value and false positive was also carried out. Results of this
study showed that Relief-F feature selection approach was found to be good with
better F-value, low FP rate for most of the selected features For the less number of
features, One-R method was superior when compared to the Relief-F [13].
This paper [14] considers a government policy that caused a revolution in India, the
demonetization of currency by the Indian government that took effect on November
8, 2016, under the Prime Minister, Narendra Modi. In this study, analysis is done
from the common man’s perspective with the aid of sentiment analysis using Twitter
data. The analysis is done based on state using geo-location in order to study the
reaction across the nation.
Sentiment Analysis on Kerala Floods 111
3 Methodology
The information is gathered from Twitter using Twitter API, and a R package called
Twitter, which is a Twitter client based on R programming language, provides an
interface to the Twitter Web API. We have used a Windows-10 PC with 64-bit OS
for our entire study.
In order to extract tweets, an app should be created using the consumer key,
consumer secret key, access token, access secret and direct Twitter authentication
should be setup by the one who owns a developers account for Twitter.
In our paper, the tweets related to Kerala floods that were trending during the
tragedy that occurred in 2018 are the main data that is dealt for carrying out sentiment
analysis. The trending data is generally described using hash tags. Some of the hash
tags for our analysis used are #keralaflood, #kerala, #keralafloods, #standwithKerala,
#doforkerala, #keralafloodrelief, #keralafloods2018 and #savekerala. For the purpose
of comparison of the reactions of people across the globe on a similar tragedy, we col-
lected the tweets on Michael Hurricane that took place in Florida, 2018. The hash tags
used for the analysis are #hurricanemichael2018, #floridaStrong, #HurricaneRelief,
#ThereNoMatterWhere, #panamabeach and many more.
The Twitter data sentiment analysis is a field that desires much more recognition due
to enormous amount of information got. Figure 1 displays the steps to perform the
procedure of Twitter data sentiment analysis.
In the procedure, the data from Twitter collected is prerectified in order to execute
the data cleaning. The vital features are taken from the cleaned text, after utilizing
any of the feature strategies. The obtained quantity of the data is tagged as negative
or positive tweets. In order to organize it into a training set, lastly, training set are
provided as an input and the opinion that are extracted to the classifier which was
designed to group the remainder of the data set, i.e. test set.
Data Collection. The sources of data to be selected in order to run the sentiment
analysis play a major role in the entire procedure. Twitter which is essentially a
micro-blogging site has gained higher quality because of its high participation rate
from users. Tweet which ate the messages posted on Twitter by the users is restricted
to 140 characters. Twitter provides two types of APIs like stream API and search
API. API required to assemble Twitter information using hash tags is search API,
while stream API is required to propagate data at the same time it is being produced.
In this paper, the search API is used to save the related tweets into a csv file for
further analysis.
Data Preprocessing. The Twitter data mining is a difficult chore as it is raw
information. In order to use classifier on this raw data, it is vital to clean the data.
The steps below show the cleaning procedure by removal of the following tokens:
• Twitter terminologies like hash tags (#), retweets (RT) and account Id (@).
• URLs, links and emoticon, non-letter data and symbols.
• The stop words such as are, is, am and so on.
• Compression of the elongated words like enjjoyyy into enjoy.
• Decompression of words like g8, f9 and so on.
In this paper, for cleaning the collected tweets, the text is first converted into
lowercase, and whitespaces and punctuation marks are removed including some
unnecessary words like single characters [a–z] and [A–Z] [15].
Extraction of Features. The cleaned data set has various properties. In this stage,
we extract parts of speech like nouns, verb and adjectives and later these words are
recognized as negative or positive to find the divergence of the entire sentence.
The feature extraction ways are:
• Specific words and their existence count are calculated.
• Adverse words are considered as they affect the polarity of the sentence.
In this paper, term frequency and term presence are used to make a word cloud
using library (tm), Text Mining in R, by importing information, handling cor-
pus, information management, creation of term document matrices and preprocess-
ing ways. Negative phrases are analysed using tidyverse, tidytext, dplyr and tidyr
packages in R [16].
Preparing a Training Data Set. The machine learning technique and the
knowledge-based techniques have been extensively used in sentiment analysis.
Lexicon-based technique is another name for knowledge-based technique. The tech-
niques based on lexicon concentrate on producing the lexicons based on opinions
from the data and then finding the divergence of these lexicons [2]. However, the key
objective of machine learning techniques is to advance the method that augments
the performance of the system using training data. The machine learning masks the
weaknesses of sentiment classification using two steps: (1) train and develop the
model using training set data and (2) categorizing the unlabelled data based on the
trained model [4].
Sentiment Analysis on Kerala Floods 113
The images below are the results obtained when the data collected from Twitter was
analysed using the Tableau software. The following outcomes were obtained, namely
a text table view of the data set by considering only the screen name of the users
and if the tweets posted by them were retweeted further. Another text table displays
the screen name along with being marked as favourite. A bar graph compares the
amount of tweets that are retweeted or not. A tree map visualizes the screen name of
the user and the number of retweets in the high range. A text table displays the Reply
to UID, Reply to SID along with the F! value based on the screen name. A data set
summary shows the tweets by the specific users identified by their screen names.
A range of the retweets from 50 to 15,000 s for some users is shown in Fig. 2. From
the figure, we observe that the minimum no of retweets corresponds to FlorentMento
and the maximum number of retweets is obtained for misspete64 user.
114 A. Dudani et al.
A bar graph shown below (Fig. 3) enables us to understand that out of 500 tweets
considered, around 400 tweets have been retweeted and have become popular among
the other users. It also shows that 100 tweets are just posted once and have not been
retweeted by others. The major insight from this graph is that the topic related to
Kerala flood has been trending for a long time on the Twitter platform.
The table in Fig. 4 was generated based on the count of tweets being marked as
favourite. Out of 500 tweets, only 11 tweets have been marked as favourite. From
the above table, we infer that the tweets posted by weatherdak are made as favourite
4 times, whereas the tweet by Rep Jayapal has the maximum number of 16.
A tree map is generally a data representation that visualizes the selected data based
on the size and shape. The parameter used for shape of the blocks in the tree map is a
continuous value. In Fig. 5, the selected data is the screen names of the users whose
tweets are being analysed. The parameter for size is the sum of the retweets for each
screen name. From the figure, we observe that misspete64 has the highest number
of retweets of about 15,736. The next maximum being mollyexcelsior has 14,153
retweets. The lowest value is for the user phriick with 1098 number of retweets. The
legend shown in the top right corner depicts the colour of the block in the tree map
with the corresponding screen name.
The table in Fig. 6 gives us an overview of some tweets that have been logged for
future analysis. The data obtained consists of the attributes Reply to SID, Reply to
UID and screen name the values in the last column correspond to the F1 attribute of
Sentiment Analysis on Kerala Floods 115
Fig. 3 A bar graph showing the amount of tweets that are retweeted or not
Fig. 5 A tree map showing the users who have retweets in the range of 1000–16,000 s
the data set. The UID refers to the User Id of each user who uses Twitter and SID
refers to the Session under which the user had logged in while posting a particular
tweet.
The summary of the data set (Fig. 7) in the form of a text table provides many
insights such as the tweet posted by the user, the status of the source which is basically
the web page link to tweet, the reply to SN where SN refers to the sender, if the tweet
has been retweeted or not based on the retweet attribute and so on. From the above
information, the analysis obtained is that none of the above-mentioned tweets were
considered important and hence have not been retweeted.
A word cloud is basically a medium to imagine text embedded with size variation
to inform the observer about the most frequent word used or most valuable term in
the set of words based on some numeric parameters. For Fig. 8 shown above, the
parameter used the sum of the retweets for the tweets posted by the users with these
screen names.
B. Observations from the Sentiment Analysis
The word cloud, emotional index and comparative word cloud are obtained using Text
Mining Package, Word cloud Package, Syuzhet package and RColorBrewer Package
which are shown below. The Twitter data discussed in this section are processed using
RStudio.
For Kerala, Fig. 9 shows the intensity of the words used based on the frequency of the
repetition of the word. The darker and larger word denotes the most commonly word
Sentiment Analysis on Kerala Floods 117
Fig. 6 A text table showing the number of records that have a reply to SID and UID for every
screen name
used. The light and smaller words denote less commonly used words. In Graph 4.1,
the word cloud uses Dark2 shades to denote the intensity, resulting in black being
the darkest and thus the most frequently used word and green being the lightest and
less frequently used word. In this Graph floods is the most commonly used word,
followed by [Link].
118 A. Dudani et al.
Fig. 8 A word cloud showing the screen names of users based on the retweet parameter
For intensity index, Fig. 10 shows the intensity of different emotions in the collected
tweets. The higher the peak the more frequency of the word used and higher the index
Sentiment Analysis on Kerala Floods 119
the stronger the word towards the emotion indicates. In Graph 4.2, the histograms
use NCR sentiment dictionary to find the emotional index of the word and height
of a bar to show intensity and index of the emotion. In this Graph fear is the most
commonly felt emotion while negative, anger and surprise are the more strongly
expressed emotions.
For comparative word cloud, Fig. 11 shows the intensity and word affiliation of
different words in the collected tweets based on the frequency of the repetition of the
word. The darker and larger word denotes the most commonly word used. The light
and smaller words denote less commonly used words. The word cloud also uses Blue
or Red shades to denote Positive or Negative emotions, respectively. In Graph 4.3
the word cloud uses Bing lexicons to denote the word affiliation. The intensity is
denoted the darkest being dark blue/dark red and thus the most commonly used word
and lightest being light blue/light red and less frequently used word. In this Graph
dynamic is the most commonly used word also the most commonly positive word,
while warning is the most commonly negative word used.
120 A. Dudani et al.
In order to study the differences of opinion among people and their similarities in
emotional gauge during a crisis, we compare our study of Kerala floods to that of
Hurricane Michael.
Figure 12 shows the intensity of different emotions in the collected tweets. The
higher the peak the more frequency of the word used and higher the index the stronger
the word towards the emotion indicates. The histograms use NCR sentiment dictio-
nary to find the emotional index of the word and height of a bar to show intensity
and index of the emotion. In Graph 4.11 comparison between the intensity index of
different emotion expressed towards Kerala flood and Hurricane Michael is shown.
Table 1 gives a comparison study between the frequency of words for each emotion
between Kerala flood and Hurricane Michael. The results obtained are:
• The Anger shown for Kerala floods is more than that for Hurricane Michael; also
stronger words are used to show that emotion.
• The Anticipation for relief is stronger for Hurricane Michael is more than that for
Kerala floods.
Sentiment Analysis on Kerala Floods 121
Fig. 12 Comparison of intensity index for Kerala flood and Hurricane Michael
• The intensity of stronger words used for Fear is more in case Hurricane Michael
than that for Kerala floods.
• Joy and Trust are more in case of Hurricane Michael than that for Kerala floods
showing positive attitude towards the crises also showing a better management
from the government.
122 A. Dudani et al.
Table 1 Comparison of
Emotion Kerala flood Hurricane Michael
intensity index for Kerala
flood and Hurricane Michael Anger 25 11
Anticipation 9 20
Disgust 16 6
Fear 39 31
Joy 9 11
Negative 35 40
Positive 28 25
Sadness 21 10
Surprise 15 11
Trust 20 18
• Positive and Negative emotions are more in case of Hurricane Michael than that
for Kerala floods showing more vocal and interested people.
• Disgust, Surprise and Sadness is virtually the same in both the cases.
5 Result Analysis
This section below consists of the understandings obtained from the results discussed
in Sect. 4. From the data set analysis, a number of observations were obtained.
• Out of around 500 tweets considered, 400 tweets were retweeted, this shows that
the topic considered is a very trendy and much talked about topic.
• The maximum number of retweets is for a tweet posted by the user with screen
name misspete64.
• The user Rep Jayapal has got maximum number of tweets being marked as
favourite. The count is 16.
Emotional indices created for Kerala floods and Michael Hurricane Florida were
compared. The results obtained from the emotional index are:
• For the Kerala floods, the emotions of fear and negative thoughts were the highest
and the strongest.
• There is a major variance between the vocation of fear and negative thoughts
between Hurricane Michael and the floods in Kerala.
• The analogy also stated that there is not much variance in the emotions of disgust,
anticipation, sadness, trust and surprise among the two calamities.
Sentiment Analysis on Kerala Floods 123
6 Conclusion
References
13. A. Tripathi, S.K. Trivedi, Sentiment analysis of Indian movie review with various feature selec-
tion techniques, in 2016 IEEE International Conference on Advances in Computer Applications
(ICACA), IEEE (Oct 2016), pp. 181–185
14. P. Singh, R.S. Sawhney, K.S. Kahlon, Sentiment analysis of demonetization of 500 & 1000
rupee banknotes by Indian government. ICT Expr. 4(3), 124–129 (2018)
15. D. Bholane Savita, D. Gore, Sentiment analysis on twitter data using support vector machine
IJCST 4(3) (2016)
16. H. Kang, S.J. Yoo, D. Han, Senti-lexon and improved Naïve Bayes algorithms for sentiment
analysis of restaurant reviews. Elsevier, Expert Syst. Appl. (2012)
17. E. Haddi, X. Liu, Y. Shi, The role of text preprocessing in sentiment analysis. Elsevier Procedia
Comput. Sci. 17 (2013)
18. S. Naz, A. Sharan, N. Malik, Sentiment classification on twitter data using support vector
machine, in IEEE/WIC/ACM International Conference on Web Intelligence (WI) (2018)
19. M. Nafees, H. Dar, I.U. Lali, S. Tiwana, Sentiment analysis of polarity in product reviews in
social media, in 14th International Conference on Emerging Technologies (ICET ) (2018)
20. A. Dhini, D.A. Kusumaningrum, Sentiment analysis of airport customer reviews, in IEEE
International Conference on Industrial Engineering and Engineering Management (IEEM)
(2018)
Recommendation System Using
Community Identification
Abstract Community Detection has garnered a lot of attention in the years follow-
ing the introduction of social media applications like Facebook, Twitter, Instagram,
WhatsApp, etc. Community refers to a group of closely-knit people who share com-
mon ideas, thoughts and likes. They may bond over topics ranging from politics to
religion, sports to music and movies, or from educational to holidaying. Researchers
have proposed various algorithms for identifying people who may fit into a partic-
ular community. These algorithms are being used by social media giants in form of
‘suggestions’. In this paper, we propose an algorithm that can be used to identify
people who share common interests on social network, therefore, forming commu-
nity with same interest. Detailed analysis of the result shows that a person can be
recommended to a community if more than 50% of his interests match with the other
members belonging to that community.
1 Introduction
The human race has seen a lot in its lifetime. Be it the very early stone age, bronze
age or iron age, then there was industrial revolution that changed everything for us.
People were manufacturing textiles, weapons, cars, edible items, etc. From 1950s we
are said to be living in the digital age. Time starting from 2001 is specifically known
as the Big Data Revolution. With the advent of Internet, came the development of
products that made our life a lot easier. They were Google Chrome, Mozilla Firefox,
Maps, Microsoft office. And then we were introduced to social network through a
number of Social Media Applications like Orkut, Facebook, Instagram, WhatsApp,
Snapchat, etc. Today most of us are surrounded by these applications. We cannot
imagine our life without them. More than 2.34 billion people are connected to each
other using social media.
Today people live two lives simultaneously—One, where they interact with real-
world and the other on social media. From here on we refer the latter as Digital life.
Today, Social media has become an inseparable part of our life. It has increasingly
become a very important part of our life because of which it has garnered a lot of
attention from the research community. Quite a few researches have been conducted
in the field of identification of communities but most of them deal with multidisci-
plinary topics. As per our knowledge, very few scholars have tried to look into the
semantics of the way people interact on social media. In this paper we have tried to
identify people who are most likely to interact with each other based on their shared
interests and thoughts. We call it a community.
Community Identification basically refers to the technique and it is a process of
identifying people based on their common interests [1, 2]. In this paper, we have
proposed an algorithm that can be used to identify communities. For this, we col-
lected data using Google Forms. In this form we posed a number of multiple-choice
questions to our targeted group of people to know about their interests, views and
likes. With the help of our friends and family, we got response from a whooping
540 people belonging to 17 states of India. Each question formed an attribute of our
dataset and each person became a tuple in it. We have applied our algorithm on this
collected dataset that consists of 15 attributes and 530 tuples.
In this algorithm, we have used a weighted community identification technique
wherein we have ordered the attributes based on their effect on identification of
community and then each of these attributes was assigned weights. Here higher the
weightage of an attribute, higher is its effect on identifying community.
Recommendation System Using Community Identification 127
2 Literature Review
There are quite a few research papers that we found to be related to our work. These
papers have helped us to get a better understanding about communities and methods
that can be used for their identification.
In [3], the authors have emphasised on the need of identifying communities in the
research fields. They have analysed the data and identified that ‘Review of Modern
Physics’ has the greatest number of citations. Here they have tried to identify the
most active and influential nodes. For this they have applied sociometric analysis.
In [4], the authors have nicely explained that Twitter is one of the most important
tools for dissemination of information related to scholarly articles. In this paper, they
have established that Twitter is being mostly used by non-academic users to discover
information and develop connections with scholars to gain access to their scholarly
materials.
3 Methodology
This drove us to use Google Forms and collecting data directly from the people.
After running many pilot tests, we prepared a questionnaire consisting of 16 questions
where every question was provided with adequate choices. Email being part of these
questions was added to improve the credibility and reliability of data. We made the
Form live on 06/10/2018 and sent it to our friends. We personally called them and
requested them to fill the form, forward it to their friends and ensure that their friends
also fill the form. They readily obliged. Within a short period of time, we received
responses from 540 people. We gathered a diversified data from people belonging
to 17 different states of India. This data was exported in CSV format as shown in
Table 1.
The collected data was organised into 15 columns, i.e. attributes where each
attribute corresponding to a question from the questionnaire and 540 tuples each
corresponding to individual response recorded by a person. It was then converted
into a CSV file which was used as the final dataset on which the proposed algorithm
was tested.
The algorithm depends on the similarity index between two individuals X and
Y to determine if they can be part of the same community or not. Similarity index
further depends on the match of corresponding attribute values of X and Y. Weights
are assigned to each of these attributes to improve the results based on their priority.
Higher the priority of an attribute, higher the weight assigned to it and vice versa.
The Algorithm
This Algorithm consists of the following steps:
4 Experimental Setup
As previously discussed, we have used Google form to collect information about the
interests, views and opinions of people. Some of the questions asked were:
1. Where do you live?
2. Which genre of music do you like?
3. Which landscape would you prefer for holidaying?
4. Which is the biggest problem India is facing?
130 S. V. S. Voggu et al.
Entire list of the questions posed to the respondents in the survey can be accessed
through the following google link: [Link]
Whenever an algorithm is proposed, it is very important that we check its validity
is checked. For this purpose, we developed a Python program based on the proposed
algorithm and executed it.
The setup we used is a platform having an Intel i5 processor with 4 GB RAM,
Python IDE like Jupyter Notebook, PyCharm, JAVA jdk 1.8.0_101 and Anaconda
Navigator.
The Python Program is developed basing upon the proposed algorithm in the
Jupyter Notebook. For a given input Id of the person for whom we want to recommend
friends, this program successfully generates a list of people whose interests match
more than 50% with that of the given input. This list forms a community. Each person
from this community can be then recommended to the concerned person as a friend
suggestion.
A sample output for a person with ID 3 is presented below.
5 Results
Fig. 1 Output generated after execution of the code given in Fig. 2 for a user id ‘3’
Recommendation System Using Community Identification 131
in favour of AP because of the aggressive publicity from our friends in that state.
Gujarat followed AP contributing 13.8% of the total responses.
Figure 3 shows a pie chart of the responses recorded about their favourite indoor
game. Since the responses were collected from relatively younger population it was
unsurprising to know that the respondents were mostly interested in playing computer
and mobile games. Of the total people surveyed, 29.8% chose computer games as
their favourite indoor sport. Carrom was a close second garnering 24.8% of the total
votes.
Figure 4 reveals the political views of the respondents. As shown, four options
were given to the people—BJP, Congress, regional parties and NOTA. Clearly seen
in the pie-chart, most of the people surveyed chose NOTA. There can be two reasons
for that. They are either not interested in politics or did not feel connected by the
political ideology of these political parties. As more than half of the respondents
were from AP, Telangana and Tamil Nadu, it wasn’t shocking to know that they
prefer regional parties over the two major national parties, i.e. BJP and Congress.
This question was by far the most important one for our survey as it gives the glimpse
of the opinions and thoughts of the people.
In this study, we developed an algorithm which for a given individual X, generates
the list of names forming a community.
6 Conclusion
Acknowledgements We, the authors of the research paper ‘Recommendation System using Com-
munity Identification’ hereby declare that the questions contained in the questionnaire were prepared
by us solely for the purpose of collecting the data. It was necessary to collect the data since none of
the open and free datasets available on the internet fulfilled our requirements. We also declare that
the data was collected through a google form by keeping the identity of the person anonymous. The
data does not contain any personal information of the 545 people who had filled the google form.
Statement of Consent We, the authors of the research paper ‘Recommendation System using
Community Identification’ hereby give our consent to ICICC Conference to publish our research
paper in their publication.
References
1. S.A. Moosavi, M. Jalali, N. Misaghian et al., Community detection in social networks using
user frequent pattern mining. Knowl. Inf. Syst. 51, 159 (2017). [Link]
016-0970-8
2. S. Sobolevsky, R. Campari, A. Belyi, C. Ratti, General optimization technique for high-quality
community detection in complex networks. Phys. Rev. E 90, 012811 (2014)
3. B.S. Khan, M.A. Niazi, Network Community Detection. Published in Arxiv (2017)
4. I.R. Fischhoff, S.R. Sundaresan, J. Cordingley, H.M. Larkin, M.-J. Sellier, D.I. Rubenstein,
Social relationships and reproductive state influence leadership roles in movements of plains
zebra (equus burchellii). Anim Behav. (2006) (Submitted)
Comparison of Deep Learning
and Random Forest for Rumor
Identification in Social Networks
Abstract The societal lifetime of each individual has created with online social
media. These locations have made outrageous improvements in the socialize envi-
ronment. The world’s targetable and fashionable Online Social Network (OSN) is
Facebook, and it has brilliantly had more than a billion clients. It is a household to
numerous kinds of antagonistic objects who misuse the sites by posting harmful or
wrong messages. In few years, Twitter and other blogging sites have been around
multimillion energetic users. It converted a novel means of rumor-spreading stage.
The problem of detecting rumors is now more important, especially in OSNs. In this
paper, we proposed rumor a different machine learning approaches as Naïve Bayes,
Decision tree, Deep learning and Random forest algorithm for identifying rumors.
The experiment can be done with Rapid miner tool on everyday data from Facebook.
The schemes of rumor identification are verified by smearing fifteen sorts based on
user’s performances in Facebook data set to forecast whether a microblog post is
a rumor or not. From the experiments, precision, recall, f-score value is calculated
for all the four machine learning algorithms, further values are compared to find
the accuracy (%) in all the four algorithms. And our experimental result shows that
the overall average of precision for a Random forest provides 97% than the other
comparative methods.
1 Introduction
Nowadays, all over the world, the people depend on online social networks (OSNs)
to exchange knowledge, thoughts, explore, information, resources, and experiences
private communications. Social Network is a precise investigation and recording of
social correspondence in minor groups, work gatherings and particularly classrooms.
In this innovation world, long range interpersonal communication locales helped a ton
to carry the world closer, however, alongside this they have produced parcel of new
issues for instance, open on these stages is engaged for spams, and the dependable
of the stage may make the objectives progressively liable for tricks. As per overview
completed by CNN in 2012, there were 83 million phony Facebook accounts.
Along with OSM (Online Social Media), Twitter and other blogging microblog
sites become more widespread [1]. These systems have developed rapidly, and
publics using as the main source to spread and obtain the data in everyday activ-
ities. Microblogging guide users to interchange minor essentials of content such as
messages or short sentences, individual images, video links or email links which are
the major reason for their reputation. But researchers have investigated and deliber-
ated the blowout of rumors, scams, or fake information on Facebook, microblogging
sites such as Twitter [2], etc. in the form of main sources such as text, images, videos
links.
Rumors mostly state to information whose realism and reason are unpredictable,
and which can be caused under emergency conditions, causing public attention and
damaging the public order [3]. We established the unruly rumors in this paper by
looking into rumors publisher’s activity differentiates from rumor post and might
have several responses from a post [4].
We investigated the problem of rumor identification in Facebook data set. A rumor
identification scheme user behavior-based is proposed, in which the user’s activities
are preserved as concealed elements to designate to be rumor that are possible or
how many posts will be possible rumor. Our method on rumor identification contains
two phases (1) Based on the collection of Facebook data and user’s profiles, we
collect the structures of user’s actions from every Facebook post. Overall, fourteen
attributes of user’s deeds are taken in this paper. (2) We apply several machines-
learning algorithms to identify the classifiers for rumor detection. Generally Rumor
detection model consists of three models using machine learning approaches namely:
(1) Network Topology Based Model (2) Information Diffusion Based model (3) User
Behavior-Based Model. In that, our proposed work will be highly focused on User
Behavior Model.
Before getting into the problem, here we discussed with the introduction about
machine earning algorithms. Of course, the entire paper is dealing with the Random
forest algorithm, it comes under ensemble learning. Deep learning is the emerging
Comparison of Deep Learning and Random … 135
2 Related Work
In this section, we examine the aspects of the different existing rumor detection
strategies.
Cao Xiao et al. projected an adaptable strategy for finding malignant clients, bits of
rumors, spams on OSN. They presented the administered machine learning pipeline
strategies to classify the bunch of clarifications as contemptuous or certified. The
key highlights utilized in this paper are measurements on fields of client created
content, for example, name, email address, organization or college/university. These
incorporate the two frequencies of examples inside the bunch and correlation of
content frequencies over the whole client base. They assessed both in-test and out-
of-test information demonstrated the solid execution of this methodology [5].
Akshay J. Sarode et al. proposed a test structure with which recognizes the phony
profile, counterfeit bits of rumor is plausible inside the friend’s list, post, links,
pictures, recordings. This structure is confined to a particular online social networking
communication website to be specific Facebook from which removes information
and utilizations it to arrange them as genuine or phony by utilizing unsupervised
and supervised machine learning algorithms. The proposed strategy has given a
precision of around 98%, which can be useful to recognize the phony profiles from
the accessible profiles [6].
Sushila Shelke et al. projected ponder and examine the source recognition method-
ologies of rumor or falsehood in a social media network. As a result, they present the
pictorial scientific categorization of variables to be considered for the source recog-
nition approach and the order of current source location approaches in the social
media network. The center has been given to different best in class source identifi-
cation methodologies of rumor or deception and correlation between methodologies
in social media networks [7].
Akshi Kumar et al. discovered an introduction on rumor detection via social
media network which displays the fundamental phrasing and sorts of bits of rumor
and the conventional procedure of rumor detection. A best in class portraying the
utilization of supervised machine learning (ML) algorithms for rumor detection via
social media is introduced. The key purpose is to offer a position to the sum and sort
of work directed in the region of ML-put together rumor detection with respect to
social media, to recognize the exploration holes inside the space [8].
Prateek Dewan et al. endeavored to control the course of detecting malicious
Facebook pages by exercise various supervised learning algorithms on our dataset.
136 T. Manjunath Kumar et al.
both SRDC and TRDC, features are divided into classes and according to that classes
tweets it applies for classification. In the experiment they used WEKA platform for
training and testing their proposed method using the J48 classifier [14].
Zhao et al. worked on primary rumor detection using terms such as false or uncon-
firmed to find questioning and neglecting tweets. Our RNN learns representations
that are significantly more complex than these explicit signals; these representations
can detention the concealed implications and dependences over time [15].
In a generic point of view, ensemble classifier provides better results without
respect to the research field area. For example Ensemble classifier has provided more
than 95 percentage of classification rate, which was the best results in the classifica-
tion of stego images. Hemalatha et al. implemented the system based on an efficient
classifier, multi-surface proximal support vector machine ensemble oblique random
rotation forest, which provided detection rate superior to other existing classifiers
[16].
Our proposed work is based on rumor detection on Facebook dataset or Twitter
dataset using machine learning techniques that are Naïve Bayes, Decision Tree, Deep
Learning and Random Forest. All supervised learning techniques are work only on
known dataset. These techniques are going to implement on Rapid Miner tool which
is data science platform provides an integrated environment and supports all machine
learning process.
This section represents the rumor detection algorithm for Facebook dataset. The
overall framework of the proposed model is given in Fig. 1.
The discovery recital is reliant on features that are approved. Extracted eleven user
behavior features from Facebook posts. There happen large alterations in the use
patterns. Readers will retort inversely when reading normal posts and rumor posts.
The Dataset features for Rumor detection algorithms are given in Table 1.
3.3 Pre-processing
After, the collection of Data set we must pre-process the data set means it will
read the data and analyze the input and handle the missing values in data set like
evacuate records having invalid esteem, supplant numeric qualities by mean esteem,
and expel ostensible traits having invalid esteem. The preprocessing can be done by
using Discretization and Set Role operator which is described below.
1. Discretizes by Binning
This worker discretizes the nominal numerical attributes into user-specified number
of bins. Bins of equal series are mechanically produced; the number of the values in
dissimilar bins may differ. The discretization by binning is done on the values that
are within the stated boundaries.
Comparison of Deep Learning and Random … 139
Table 1 (continued)
S. No. Features extracted Description
7. Overall engaged user The number of different users who are busy
on your post or created a story about the
page. Engagement determines stories
created that were not the result of a click
within a matter
8. Overall post consumptions It measures any snap of substance on-page,
regardless of whether it makes a story or
not. Coming up next are estimated as
utilization of page post: Link clicks, Photo
views, Video plays, post remarks, likes, and
shares
9. Overall post customers The people or unique users who clicked
anywhere within posts or page content.
Clicks creating stories are included in other
clicks
10. Liked page and engaged with post This includes anyone who visited the page
and regardless of the actions they took such
as liked the page or post, commented on
page or post, or shared the page or post
among other users and other actions like
subscribe (for updates)
11. Total interactions The total number of custom and standard
events that are triggered when a user
interacts. It captures all of the feedback
pages receive from users. The goal of the
metrics is to provide an updated snapshot
how users are engaging with post contents
2. Set Role
This worker is castoff to alteration the part of one or more attributes in Date set such
as id role, label role, batch role, weight role, and cluster role. The role of an attribute
describes how other operators handle this attribute.
This algorithm practices random subset of attributes its mechanism precisely like the
Decision Tree operator with one exception for each split only a random subset of
features is obtainable. It learns decision tree from both numerical data and nominal.
The objective is to generate a classification model that foresees the value of the label
based on numerous input features.
Comparison of Deep Learning and Random … 141
The steps for finding the rumor is given in Table 2 and also the corresponding flow
diagram is given in Fig. 2.
Facebook
dataset
Pre-processing
No
Count < Yes
Rumor threshold Non rumor
value
4 Experimental Results
The microblog is represented by user activity features, and to identify whether the
microblog is a rumor or not. Based on the different attributes, we can apply by four
classifier algorithms Naïve Bayes, Decision tree, Random Forest and Deep Learning.
To check the efficiency and general eligibility to evaluate the user eleven features,
we find the several feature values in Fig. 3. It can be viewed from Fig. 4 there will be
existence of notable difference between rumor and normal posts when we apply the
features like link and status. There is no significant difference for the features like
video and photo.
To appraise the presentation of the method used in this paper, based on precision,
recall, and F-score as evaluation metrics. The precision ‘P’ is the fraction of the
correctly determined rumor to all the rumor microblogs detected. Recall ‘R’ is the
fraction of correctly determined rumor microblogs to all the rumor microblogs. F-
score can be defined as the harmonic mean of recall and precision. The calculation
of precision, recall, and F-score is defined as follows:
5 Conclusion
Online social media and Microblogging are the great platforms to broadcast the
information among the active user, but bloggers are using this in wrong way by
spreading fake information (rumor) on these sites. In this paper, we inspect the
rumor detection issues in the Microblog framework. The rumors are identified or
differentiate rumors with other post we applied the machine learning algorithms that
is Decision Tree, Naïve Bayes, Deep Learning, and Random Forest. The algorithms
are examined on Facebook data set in which there is a bar comparison between
various attributes such as page total like, type, category, post month, etc. The user’s
behaviors differ on rumors and original post. The performance of the evaluation
metrics related to all the four algorithms are recorded and concluded that Random
Forest had produced higher value for precision, recall, f-score and higher accuracy
as compared to other algorithms.
References
15. Z. Zhao, P. Resnick, Q. Mei, Enquiring minds: early detection of rumors in social media from
enquiry posts, in Proceedings of WWW (2015)
16. H. Jeyaprakash, M.K. Kavitha Devi, S. Geetha, A comparative review of various machine
learning approaches for improving the performance of stego anomaly detection, in IGI Global,
Handbook of Research on Network Forensics and analysis Techniques, pp. 351–371
Ontological Approach to Analyze
Traveler’s Interest Towards Adventures
Club
Toseef Aslam, Maria Latif, Palwashay Sehar, Hira Shaheen, Tehseen Kousar
and Nazifa Nazir
Abstract Now tourisms has become an industry and many adventure clubs offer
different adventure trips for the travelers. So there is a need for a system for trav-
elers which prefer their choices and give them all needed information in one place.
We proposed a new travel recommended system for adventurers, which provides a
defined solution to user demands. It is a “Knowledge Base system” a smart recom-
mender system with domain-specific ontology. Queries can be constructed in natural
language and a related query management strategy is developed. The solution space
is searched from two perspectives: user demand and offers relevance. Many systems
working on this perspective but in Pakistan not such systems present which prefer-
ence user choices. Pakistan is a country which is rich in adventure tourism and a
good choice for international adventurers. So there is need such a system for based
on user interest and their desired information. In the following system, we will deal
with the main problems that adventurer presents in terms of information search and
decision-making processes according to the domain ontology.
1 Introduction
Nowadays, users can find a lot of information on the Internet. That is why sometimes
it becomes a hard and complex task to select the information a user is interested in.
The user is often unable to look through all the available information. Therefore,
highly interesting information can get lost in the middle of a sea of data. Users can
have access to a huge deal of data and information related to a specific place and
adventure activities but surely they will prefer to filter that information and get those
elements or activities that match their particular interests.
Traveler’s these days are getting more used to turn to new technologies when
planning a trip. This reality can be explained by the fact that the Internet is part of
our daily life. For this purpose, several adventure clubs and companies that offer
varied touristic and adventure information about the destination have been set up.
But most of the times adventurers are failed to search their desired destination; they
also failed to search the unique location for their favorite adventure. They want to
discover new locations and more fun in adventure but most of the information repeat
and they cannot find their desired information.
For the purpose of improving traveling experiences, we proposed a new travel rec-
ommended a system for travelers, which provides a defined solution to user demands.
It is a “Knowledge Base system” a smart recommender system with domain-specific
ontology. Ontological systems supply personalized information to users. We use
ontological approach because Ontology is basically a relation between different
instances so; we use this approach to enhance our results. In other words, the system
selects the most suitable options from a large list of offers, by taking the users profile
and interests into account.
2 Existing Systems
The already existing systems and research on travelers ontology are mainly for the
destinations or for search optimization. The papers we have discussed are explained
briefly and what are the problem identified and give the solutions to the system. Lem-
naru et al. [1] have done work on case base reasoning ontology system. They propose
an ontology-based system for user’s demand. It’s a hybrid system with domain base
ontology. Quires constructed in natural language or templet base. The system is given
the two dimensions, travel description, and travel solution. Hawalah et al. [2] have
work done about Semantic Knowledgebase ontology system. The system uses user
Ontological Approach to Analyze Traveler’s Interest … 149
ontology file for the recommendation of the trip. Tomai et al. [3], this paper draws
from previous work on trip planning in the context of web services. The system is
ontology-based Web portal for Tourism. Choi et al. [4] recommend travel ontology
based on the Semantic web. They use an OWL for made their ontology. The system
gives an opportunity for users to choose their preferences. Hua-li et al. [5] while
being progressing, sudden occasions may drive their trip to totally reschedule they
decide to plan and search for options. Keeping in mind the end goal to encourage
semantic coordinating between option touristic locales and client setting, a particular
vocabulary for the tourism area, client sort, time and area is required. This show in
this paper existing tourism ontology can scarcely satisfy this objective as they basi-
cally concentrate on space ideas. Bahramiana et al. [6], Tourist have time and budget
limitations and problem in select points of interest. The available information is
overloading, it is difficult for a tourist to select the most appreciate ones considering
preferences. A substance based suggestion framework is proposed, which utilizes
the data about the client’s choice. In proposed system figures, a level of matching
between them. This has the most closeness with the client’s choices. The proposed
content-based recommender framework is improved utilizing the ontological data
about tourism spot to spot both the client profile and the recommendable system. de
Lange et al. [7] many people use the Internet to book their trips. The evolution of the
Internet made trip booking much easier. In this work they have identified the com-
mon sets of concepts that are appearing in online traveling related websites. Based
on the findings, it has developed an ontology, which represents the common set of
concepts and their relations. Maedche et al. [8] this paper draws how the difference
between the two boundaries might be limited. The objective is too semantically inter-
face right now detached snippets of data so as to lessen the weight on the client of
finding and comprehension. Missikoff et al. [9] in this paper portrays three essentials
of the harmonization exertion: interoperability, ontologies, and go-between. Also,
draw a dream of a future electronic the travel industry advertises dependent on these
essentials. Mar et al. [10] they propose a strategy for extricating semantic substance
from printed web archives to consequently instantiate a domain ontology. Karoui
et al. [11] in this paper they deal with the automation of ontology building process
from HTML pages. The suggested system methodology is based on the complemen-
tary use of two approaches. Ananthapadmanaban et al. [12] this paper investigates
making refined client profiles philosophy which can build the way toward hunt-
ing down the ideal travel industry bundle by dissecting the client enthusiasm with
assistance of client cosmology for the travel industry. They have made client pro-
file philosophy through which can deduce traveler explicit zone of intrigue. Moreno
et al. [13] present the availability of numerical assessment of the relationship for
150 T. Aslam et al.
cosmology framework. The prescribed framework has been completely planned and
actualized in the science and innovation park of the travel industry and recreation.
Jakkilinki et al. [14] present the hidden structure and task of a semantic electronic
clever visit arranging instrument. Prantner et al. [15] present the travel industry phi-
losophy and semantic administration framework. They did some primer aftereffects
of On Tourism Project. Results displayed in this paper distinguish openly accessi-
ble the travel industry ontologies and existing unreservedly accessible cosmology
the executive devices for the travel industry area. Cardoso [16] the paper is around
one vital sort of e-the travel industry application that has surfaced lately is dynamic
bundling frameworks. Knublauch [17] they take a shot at some underlying consider-
ations on programming design and an improved approach for Web administrations
and operators for the Semantic Web. This design is driven by formal area models.
Barta et al. [18] the paper shows a Semantic Space arrangement of Tourism which
strategy dependent on Modularized Ontologies.
3 Our System
We have discussed 18 such systems that are purely related to tourism, adventure
trips ontology. All systems were mainly related to search optimization and ontology
approaches. Some were for online booking and some for gathering desired infor-
mation about destinations and adventure activities. Our proposed system is at the
base of ontology. In the recommender system, the ontological domain permits the
classification of objects to be suggested. In our recommender system, we consider
that each object is an instance of one (or several) of the lowest level classes of the
ontology and we use the ontological domain to represent the user’s preferences. In
this sense, concepts are represented as subsets of the domain in which users can be
interested in. Considering that the degree of interest can be different in each concept.
Our suggested system is on the base of adventurer’s interests and needs. We get the
feedback from the travelers and create a suggested interface for adventure clubs. Our
suggested interface gets 20% better results than other existing systems.
4 Methodology
Unlike other systems developed (using the ontology approach), but not focus on user’s
interest and likeness. This research and recommended a system for overcoming this
problem. We focus on user interest and desire. A feature common to the vast majority
of recommender systems—and so is ours—is the use of profiles that represent the
Ontological Approach to Analyze Traveler’s Interest … 151
needs of information and interests of the users. In this way, user’s profiles turn into a
key piece of recommender systems in order to obtain efficient filtering. An inadequate
profile modeling can lead to poor quality and little relevant recommendations for the
users. In a recommender system, the ontological domain permits the classification of
objects to be suggested. In our recommender system, we consider that each object is
an instance of one (or several) of the lowest level classes of the ontology and we use
the ontological domain to represent the user’s preferences. In this sense, concepts are
represented as subsets of the domain in which users can be interested in. Considering
that the degree of interest can be different in each concept. We proposed a solution
to this problem. For this, we conducted a questionnaire from users and get different
results. We used these results in making an ontology bases user interface suggestion
system for traveler’s and as well as for adventure clubs. The main factors we used for
our system are a user, interest in different activities, offers, location, free services,
provided by adventure clubs and tourism organizations. Our system gives proper
result after checking these parameters. For making ontology we used protégé tools
and made relations between different entities to get the required results. Our system
is smart enough to get optimized results. It will help users and refer an appropriate
trip plane and offers according to their interests.
We conducted an online survey and got the results after evaluating choices from an
online form created on Google. Our participant knows the purpose of our survey and
fully agree with us to share the results of their answers. Documents filed by users of a
different part of society. The survey was conducted on different persons and then we
converted the numerical results into graphical forms. Then we create a questionnaire
for the survey. We want to know about user preference and desired interests. With
the help of a questionnaire, we get information about user interest. This is the base
of our research and recommended online web system. With the help of analyzing the
questionnaire, we get the latest travelers trend towards adventure clubs parameters to
develop our algorithm and recommender system for enhancement of users interesting
information and facilities. We also get users’ preference parameters which they like
most. We also use the following.
5 Algorithm
We create an algorithm on the base of our data, which we collect from our conducted
survey. In order to understand the algorithm, following parameters are needed to be
known. Table 1 contains the names of the parameters used in the algorithm and the
values they will hold. Their data type is also given below.
152 T. Aslam et al.
This table consists of the names of parameters and the data types in which the
values would be stored in them. The description is provided with the parameters
in the context of the traveler as well as the adventure club. Separate parameters
are not used as it will consume extra space and would not be efficient. The name
of the proposed algorithm is “Optimized adventure club” and it contains different
parameters on which function is performed to obtain the required results. Following
abbreviation is used in the algorithm.
Ontological Approach to Analyze Traveler’s Interest … 153
6 Declaration Algorithm
• Declaration:
Int A=ɸ
Boolean TGS=ɸ
Boolean PTA=ɸ
String G=NULL, P=NULL, TMI=NULL, EMO=Null, MC=Null, KA=Null;
• Input (T):
Retrieve (G,A,P,TMI,MC);
• Processing:
Search String = G∧A∧P∧TMI∧MC
• Suggestion Function:
String Function Suggestion ()
{
IF (T.G=AC.G∧T.A = AC.A ∧[Link]=[Link]
∧ T.P=AC.P ∧ [Link]=AC. MC
∨ T. TGS=[Link] ∨ [Link]=[Link] ∨
[Link] =AC. PTA∨[Link]=[Link])
Return Suggestions;
Return Suggestions;
ELSE IF (T.G=AC.G∧T.A = AC.A ∧[Link]=[Link]
∧ T.P=AC.P ∧ [Link]=AC. MC
∧ T. TGS=[Link] ∧ [Link]=[Link] ∨
[Link] =AC. PTA∨[Link]=[Link])
Return Suggestions;
ELSE IF (T.G=AC.G∧T.A = AC.A ∧[Link]=[Link]
∧ T.P=AC.P ∧ [Link]=AC. MC
∧ T. TGS=[Link] ∧ [Link]=[Link] ∧
[Link] =AC. PTA∨[Link]=[Link])
Return Suggestions;
ELSE IF (T.G=AC.G∧T.A = AC.A ∧[Link]=[Link]
∧ T.P=AC.P ∧ [Link]=AC. MC
∧ T. TGS=[Link] ∧ [Link]=[Link] ∧
[Link] =AC. PTA∧[Link]=[Link])
Return Suggestions;
}
The algorithm describes how the search is being optimized to give the best results
to the traveler. In declaration phase, parameters are being declared and nullified so
that they do not contain any garbage value. In 2nd and 3rd and 4th step data is being
retrieved from the user and adventure club database and values would be taken in
the variable according to Table 2.
In processing phase, a search string is declared that clearly shows that profession,
tourist main interests, event must be organized must be given by traveler in order
to continue the search which would be later on compared with the values taken
from adventure club against the same variables I-e profession, tourist main interests,
Table 2 Description of
Name Type
algorithm
Age A
Profession P
Gender G
Tourists main interests TMI
Event must be organized EMO
Mode of communication MC
Tourists guide services TGS
Preference of travel agency PTA
Kind of accommodation KA
Ontological Approach to Analyze Traveler’s Interest … 155
event must be organized. Function ‘Suggestion” keeps comparing the values given
by the traveler against parameters that are described in the table with the values taken
from adventure club database against the same parameters until the results are not
improved and satisfy the requirements of tourist.
We designed an ontology using the Protégé 4.3 tool on the basis of the algorithm
that we designed. Different classes, subclasses, Data properties, and object prop-
erties were created in that ontology. The first tab we start with is the Classes tab.
Classes are the major building blocks (“nouns”) within our ontology. Classes and
subclasses are created for structuring the algorithm that we designed. The subclasses
of adventure club names are events, guide services, kind of accommodation and
travel agency. The subclass “event” has further three classes: social events, adven-
turous and cultural events. The subclass “kind of accommodation” has also further
three subclasses: camping, hotels, and resorts. The class user has some subclasses.
These are Age, Gender, Interests of the user, mode of communication user can prefer,
the profession of the user. The subclass “gender” has further two classes: male and
female. The subclass “mode of communication” has further three classes: Via e-mail,
via text-MSG, and via phone-call. The subclass profession has further three classes
which related to users “profession” which are: Employee, Businessman, and Student.
Object properties define the relations (predicates) between two objects (also called
individuals) in OWL ontology. It is used to construct a link between classes and
subclasses. Domain and range contain names of classes and subclasses. Properties
are also further subcategorized.
8 Ontology Graph
Protégé tool is used to automatically create classes and subclasses and link with each
other.
Figure 1 displays the detailed model of user class and Adventure club class. The
linked lines show the link between classes and subclasses.
9 Database Architecture
In Fig. 2 shows system architecture. Names of desired destination and offers that
are appropriate for users agreeing to their preferences are shown to users on the
output device.
10 Prototype Interface
The system also stores the degree of confidence in each concept. This degree of
confidence will depend on the evidence received from the user.
11 Results
to use and understand. In the last question ‘What do you think that our interface will
provide an exact match to individual’s query?’ most of the persons choose neutral.
All results show that we improve result in perspective of user’s interest. Our basic
purpose of this research is to focus on user interest is gained.
12 Conclusion
In the modern era, people’s attraction towards adventure tourism (discovering places)
has been widely increased. In existing studies, one of the problems associated with
tourism is defined as finding most suitable places with individual’s interest. For
this purpose, many research efforts have been carried out in which recommenda-
tion based search systems are proposed. In these systems, input from the user about
their preferences while traveling are extracted and most suitable destinations are
suggested. These solutions still have certain limitations such as considering fewer
parameters of user’s traveling interest and these solutions directly do not fit with the
context of Pakistan. So in this research; we have proposed a recommendation system
using ontology-based approach. In which, first we gathered information about trav-
elers interest towards selecting their destinations using questionnaires. Then based
on questionnaire results, most important parameters are extracted through which we
have designed an algorithm using the concept of ontological approach. Then we have
also designed an interface prototype as per proposed algorithm. To evaluate usability,
accuracy, and efficiency of our interface, we have performed a control experiment
in which a questionnaire was filled by the user to identify their experience of our
Ontological Approach to Analyze Traveler’s Interest … 159
designed prototype. Results show that all parameters of evaluations are satisfied and
recommendations for destinations are found suitable for individual’s interest.
Acknowledgements Pakistan is a beautiful country with great scenery and the government of
Pakistan wants to enhance tourism industry in country. So in country a special need for such a
system which fulfills users’ requirements. We conduct a survey of 120 persons and collect their
choice and create a system and again conduct a survey and give detailed results. In this study, Khizar
Hameed and Junaid Haseeb guide us and check our results. Our whole participants know our study
purpose and they are happy to participate in our study project. We present our study with the help
of University of Lahore, computer science department.
References
1 Introduction
A reliable scheme that can uniquely discover individuals is only possible by the use
of biometric. Biometric traits are mainly categorized into two categories:
A FHE uses signature’s graphic form, measurable elements of signature such as
distance between letters, angles of strokes, size of loop, and a microscopic stere-
oscope to analyze the signature. In recent years, deep neural network are used for
image recognition. These networks require big data and more storage and GPU for
computation. For offline signature verification, no database with large samples is
available. In this paper, an attempt has been made to analysis the performance of
Bangla and Hindi signature verification using only four features (i.e., Average object
area, mean, Euler no. and area of signature image). Signature Verification is affected
by number of signers, number of samples available for each signature as shown in
experimental section of the paper.
The paper is organized as follows: Sect. 1 describes the Introduction. Section 2
presents Related Work. Section 3 presents the Methodology. Section 4 describes the
Experimentation. Section 5 presents conclusion.
2 Related Work
3 Methodology
In our research work, Bangla and Hindi Signature from BHsig2601 are used. The
sizes of images in the database are not same, so we resize all images of size 96*96.
Samples of three users from BHsig260 dataset are shown in Fig. 1.
Combinations of four features (Average Object Area, Mean, Area and Euler No.)
are extracted from image difference of Genuine—Genuine Signature pair and of
Genuine—Forged Signature pair in Matlab 2015 [13]. The variation of features values
of genuine signatures of one user for different samples are shown in Table 2. Table 3
BANGLA
GENUINE FORGED
HINDI
GENUINE FORGED
1 [Link]
164 S. Rana et al.
Table 3 Genuine-forged
Average object area Mean Euler No. Area
signature pair (features) for
Bangla 15.35849 0.088325 −9 870
14.55172 0.09158 4 897.375
20 0.088976 −10 869.25
13.10714 0.079644 1 781.5
19.85106 0.101237 −3 988.5
13.63636 0.097656 −5 958.25
14.27419 0.096029 −7 942.75
11.02817 0.084961 15 833.75
11.87324 0.091471 3 895.5
13.95 0.09082 −6 895.75
15.05263 0.093099 0 910.875
14.28571 0.086806 10 848.75
13.27778 0.103733 6 1014.125
14.21311 0.094076 −8 924.375
13.62319 0.101997 2 997.25
(continued)
Performance Analysis of Off-Line Signature Verification 165
Table 3 (continued)
Average object area Mean Euler No. Area
15.20968 0.102322 −4 1006
15.44828 0.097222 −8 952.625
13.66102 0.087457 8 857
12.76563 0.08865 1 873
shows the variation of features values of forged-genuine signatures of one user for
different samples.
3.3 Classifier
Three classifiers are used, i.e., SVM, KNN, and Boosted Tree.
4 Experimental Setup
Figures 2 and 3 show the accuracy of SVM is high as compared to Boosted Tree
and KNN. But accuracy of Hindi is greater than the Bengali signature. Signature
Verification is affected by no. of signers, for more signer accuracy decreases in both
cases, i.e., Bengali signature and Hindi signature.
Table 4 (continued)
User SVM KNN Boosted tree
15 91.3 93.5 95.7
16 67.4 71.7 76.1
17 84.8 78.3 78.3
18 60.9 65.2 78.3
19 84.8 87 76.1
20 89.1 100 45.7
21 82.6 84.8 89.1
22 93.5 97.8 89.1
23 54.3 65.2 71.7
24 87 87 91.3
25 97.8 97.8 45.7
26 84.8 84.8 84.8
27 82.6 87 84.8
28 69.6 67.4 80.7
29 80.4 87 78.3
30 80.4 76.1 82.6
31 91.3 97.8 45.7
32 82.6 84.8 87
33 89.1 91.3 67.4
34 54.3 63 54.3
35 76.1 80.4 87
36 93.5 97.8 89.1
37 45.7 37 52
38 63 56.5 65.2
39 73.9 78.3 80.4
40 65.2 80.4 76.1
41 91.3 95.7 95.7
42 73.9 73.9 78.3
43 87 84.8 73.9
44 82.6 80.4 82.6
45 82.6 78.3 80.4
46 100 100 45.7
47 73.9 78.3 78.3
48 76.1 80.4 82.6
49 78.3 89 93.5
50 82.6 80.4 73.9
Performance Analysis of Off-Line Signature Verification 169
Table 5 (continued)
User SVM KNN Boosted tree
38 78.3 76.1 78.3
39 78.3 84.8 80.4
40 84.8 91.3 45.7
41 76 91.3 45.7
42 52.2 60.9 63
43 73.9 76.1 71.7
44 71.7 71.7 80.4
45 69.6 71.7 71.7
46 43.5 50 66.9
47 41.3 41.3 41.3
48 65.2 52.2 63
49 84.8 84.8 82.6
50 84.8 80.4 84.8
5 Conclusion
Acknowledgements The authors are grateful to the anonymous reviewers for their constructive
comments which helped to improve this paper.
References
1. K. Kumari, V.K. Shrivastava, Factors affecting the accuracy of automatic signature verification,
IEEE (2016)
2. K. Kumari, V.K. Shrivastava, A review of automatic signature verification, in ICTCS (2016)
3. K. Kumari, S. Rana, Writer-independent off-line signature verification. Int. J. Comput. Eng.
Technol. (IJCET). 9(4), 85–89 (2018)
Performance Analysis of Off-Line Signature Verification 171
4. K. Kumari, S. Rana, Offline signature verification using intelligent algorithm. Int. J. Eng.
Technol. [S.l.]. 7(4.12), 69–72 (2018). ISSN 2227-524X
5. S. Pal, A. Alaei, U. Pal, M. Blumenstein, Off-line Bangla signature verification: an empirical
study, in The 2013 International Joint Conference on Neural Networks (IJCNN), IEEE (2013),
pp. 1–7
6. S. Pal, V. Nguyen, M. Blumenstein, U. Pal, Off-line Bangla signature verification, in 2012 10th
IAPR International Workshop on Document Analysis Systems (DAS), IEEE (2012), pp. 282–286
7. S. Pal, M. Blumenstein, U. Pal, Hindi off-line signature verification, in 2012 International
Conference on Frontiers in Handwriting Recognition (ICFHR), IEEE (2012), pp. 373–378
8. S. Pal, A. Alaei, U. Pal, M. Blumenstein, Performance of an off-line signature verification
method based on texture features on a large indic-script signature dataset, in 2016 12th IAPR
Workshop on Document Analysis Systems (DAS), IEEE (2016, April), pp. 72–77
9. S. Dey, A. Dutta, J.I. Toledo, S.K. Ghosh, J. Lladós, U. Pal, SigNet: convolutional siamese
network for writer independent offline signature verification. arXiv preprint arXiv:1707.02131
(2017)
10. B.S. Thakare, H.R. Deshmukh, A combined feature extraction model using SIFT and LBP for
offline signature verification system, in 2018 3rd International Conference for Convergence in
Technology (I2CT ), IEEE (2018), pp. 1–7
11. M.B. Yilmaz, O.Z.T. Kagan, Hybrid user-independent and user-dependent offline signature
verification with a two-channel CNN, in 2018 IEEE/CVF Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW ), IEEE (2018), pp. 639–6398
12. V.L. Souza, A.L. Oliveira, R. Sabourin, A writer-independent approach for offline signature ver-
ification using deep convolutional neural networks features, in 2018 7th Brazilian Conference
on Intelligent Systems (BRACIS), IEEE (2018), pp. 212–217
13. [Link]
Fibroid Detection in Ultrasound Uterus
Images Using Image Processing
Abstract The unnatural growth present in the uterus wall is uterus fibroids. Presence
of fibroid in uterus leads to infertility. Ultrasound images are a significant tool to
detect uterus disorders. Fibroid extraction from ultrasound scanned images is indeed
a challenging task considering its size, less detectible boundaries and positions.
Segmentation of ultrasound images is not an easy task because of speckle noise.
This paper endows a method to segment uterus fibroid from ultrasound scanned
images. This method utilizes many mathematical morphology concepts to detect
fibroid region. The method segmented the fibroid and extracts some shape-based
features.
1 Introduction
The uterus is the reproductive system in humans. A normal uterus is about 7.5 cm
(3 in.) long, 5 cm (2 in.) wide and 2.5 cm (1 in.) deep. Inside, it is hollow with thick
muscular walls. A few abnormalities that are seen as tumours which are infectious
in nature are called as ‘Fibroid’ that grows in the wall of the uterus. These are also
known as uterine myomas, leiomyomas or fibromas. On average between 20 and
50% of women of reproductive age have fibroids, although not all are diagnosed.
Types of fibroid: Fibroids can be classified according to their position in the uterus
or womb:
K. T. Dilna
Department of ECE, College of Engineering and Technology, Payyanur, India
D. Jude Hemanth (B)
Department of ECE, Karunya University, Coimbatore, India
e-mail: judehemanth@[Link]
2 Related Works
Shivakumar K. et al. has used GVF snake method for the Segmentation of fibroids in
uterus images [1]. N. Sriraam et al. describes an automated detection of uterine fibroid
by using wavelet features and a neural network classifier [2]. Feedforward backprop-
agation neural network (BPNN) classifier is used for segmentation. Yixuan Yuan
et al. proposed a novel weighted locality-constrained linear coding (LLC) method
for uterus image analysis [3]. Leonardo Rundo et al. developed a semi-automatic
approach which depends on region-growing segmentation technique [4]. Bo Ni et al.
used dynamic statistical shape model (SSM)-based segmentation method [5]. Effi-
ciency and stability are the focal areas of this method. Alireza Fallahi et al. used a
two-step method for image analysis [6]. In the first segment uterine segmented using
FCM and then morphological operations are implemented. In the second step, fuzzy
algorithm is used for refining the process. Divya has used a generalized multiple-
kernel fuzzy C-means (FCM) framework for image segmentation problems [7]. A
linear combination of multiple kernels is proposed and the updating rules for the
linear coefficients of the composite kernel are derived as well. T. Ratha Jeyalakshmi
et al. provides mathematical morphology-based methods for automated segmentation
[8]. Similar ultrasound uterus image analysis methods are available in [9–11].
Fibroid Detection in Ultrasound Uterus Images … 175
3 Methodology
The proposed method has the following steps. Preprocessing, segmentation and
feature extraction (Fig. 1).
The input image used is Ultrasound scanned fibroid Image. The uterus ultrasound
image scanning is carried out either abdominally or transvaginally. Evaluation of
ultrasound images can differentiate cysts and fibroids (solid tumors). But it cannot
accurately diagnose the number, size or fibroid position. Manual evaluation of ultra-
sound images is very difficult because of its high resolutions and a huge mass of
image slices (Fig. 2).
3.2 Preprocessing
The filtered ultrasound image which is free from speckle noise is now used as the
input for detection of fibroid. The steps followed in the proposed system are as
follows.
Step 1: Input image is transformed to binary image based on threshold. This can be
calculated from the mean value m of the pixel values found in the image.
Step 2: Takes the complement of binary image which is taken from step 1.
Step 3: Morphological operations are carried out on the image.
Step 4: Calculating the image area and detect the maximum area image by measure
properties of image region.
Step 5: Find the product of maximum area image and the image which is taken
from step 3.
Step 6: Extract image area from binary image by specifying the size.
Step 7: Morphological operation-erosion is carried out on the extracted image area.
Step 8: Multiply input image with the output image from step 7 which results in
fibroid region.
Fibroid Detection in Ultrasound Uterus Images … 177
The features used in this work are based on shapes namely area, diameter, accuracy,
perimeter, eccentricity, major-axis and minor-axis from Ultrasound fibroid images.
The extracted features help in identifying the size of the fibroids. For a typical case—
the smaller ones can be cured by medicines whereas larger sizes would require a
surgery. A detailed explanation of these features is available in the literature.
The proposed algorithm is applied on many uterus images with fibroids on the inner
wall of the uterus. The algorithm works well and gives good results. The filtered
image is shown in Fig. 3. The result of this algorithm on two different images is
shown in this paper as sample outputs in Fig. 4. The original images are shown in
Fig. 2.
Table 1 displays the result of the feature extraction. It can be noted that the sizes
of the fibroids are different for different patients. The treatment planning is based on
the size of the fibroids.
Step7 Step8
5 Conclusion
References
1. S.K. Harlapur, R.S. Hegadi, Segmentation and analysis of fibroid from ultrasound images. Int.
J. Comput. Appl. 975, 8887 (2015)
2. N. Sriraam, D. Nithyashri, L. Vinodashri, P. Manoj Niranjan, Detection of uterine fibroids
using wavelet packet features with BPNN classifier, in IEEE EMBS Conference on Biomedical
Engineering & Sciences (2010)
3. Y. Yuan, A. Hoogi, C.F. Beaulieu, M.Q.-H. Meng, D. Lrubin, Weighted locality–constrained
linear coding for lesson classification in CT images, in Proceedings of 37th Annual International
Conference of the IEEE Engineering in Medicine and Biology Society (2015)
4. L. Rundo, C. Militell, S. Vitabile, C. Casarino, Combining split-and-merge and multi-seed
region growing algorithms for uterine fibroid segmentation in MRgFUS treatments. Med. Biol.
Eng. Comput. 54(7), 1071–1084 (2016)
5. B. Nia, F. Hea, Z. Yuana, Segmentation of uterine fibroid ultrasound images using a dynamic
statistical shape model in HIFU therapy. Comput. Med. Imaging Graph. 46, 302–314 (2015)
6. A. Fallahi, M. Pooyan, H. Khotanlou, H. Hashemi, K. Firouznia, M.A. Oghabian, Uterine
fibroid segmentation on multiplan MRI using FCM, MPFCM and morphological operations.
IEEE (2010)
7. S. Divya, Detection of fibroid using image processing technique. Int. J. Emerg. Technol. Adv.
Eng. 5(3), 167–171 (2010)
8. T. Ratha Jeyalakshmi, K. Ramar Kadarkarai, Segmentation and feature extraction of fluid-filled
uterine fibroid—a knowledge-based approach. Int. J. Sci. Technol. 4, 405–416 (2010)
9. J. Yao, D. Chen, W. Lu, A. Premkumar, Uterine fibroid segmentation and volume measurement
on MRI, in Proceedings of SPIE, vol. 6143 (2006)
10. A. Alush, H. Greenspan, J. Goldberger, Automated and interactive lesion detection and
segmentation in uterine cervix images. IEEE Trans. Med. Imaging 29(2) (2010)
11. M.J. Padghamod, J.P. Gawande, Classification of ultrasonic uterine images. Adv. Res. Electr.
Electron. Eng. 1(3), 89–92 (2014)
Progressive Generative Adversarial
Binary Networks for Music Generation
1 Introduction
In [7], a refiner network R which uses binary neurons is placed between generator
G and discriminator D where R binarizes the floating-point predictions made by
G. Two-stage training is conducted where first, G is fixed after pretraining G and
D. Second, R is trained followed by fine-tuning D. As compared to Musegan [5],
this model is more effective because of the use of deterministic binary neurons
(DBNs). In our proposed model, we use progressive generative adversarial networks
[9] with DBNs. Our model consists of a total of 12 layers in the shared generator and
discriminator network and 8 layers in the refiner network at the end of all phases.
Pitch and time-step values are increased progressively layer by layer. Experimental
results indicate that the final output is more efficient due to the progressive training
of GANs.
2 Background
Generative adversarial networks (GAN) have a latent space derived from the original
dataset, and generator tries to fool the discriminator attempting to generate realistic
data as mentioned earlier. The generator function G and discriminator function D
are the two main components of a GAN which are locked into a minimax game.
The discriminator takes in the output of the generator or the original dataset x as
input. During the training phase, it learns to discern between fake and real samples.
The generator takes in input as a noise vector z which is a sample of the prior
distribution pz of the original dataset. The generator fools the discriminator with its
counterfeit sample G(z). The generator and discriminator are trained using a deep
neural network. Wasserstein GAN (WGAN) [10] which is an alternative form of GAN
measures the Wasserstein distance between the real distribution and the distribution
of the generator. This distance acts as a critic to the generator function. The WGAN
objective function is given as:
where px̂ is defined as sampling uniformly along straight lines between pairs of
points sampled from pd and the model distribution pg . It was observed that WGAN-
GP [5] stabilized the training and attenuated the mode collapse issue in comparison
with the weight clipping methodology used in the original WGAN. Hence, we use
WGAN-GP in our proposed framework.
In progressive growing of GANs [9], we train the GAN network in multiple phases.
In phase 1, it takes in the noise vector z and uses n convolution layers to generate
a low-resolution music sample. Then, we train the discriminator with the generated
music and the real low-resolution dataset. Once the training stabilizes, we add n
more convolution layers to up-sampling the music to a slightly higher resolution and
n more convolution layers to down-sampling music in the discriminator. Here, by
resolution of music, we imply the number of time steps and pitch values and we
have taken n = 1. Large number of time steps and pitch values correspond to higher
resolution and vice versa.
The progressive training speeds up and stabilizes the regular GAN training meth-
ods. Most of the iterations are done at lower resolutions, and training is significantly
faster with comparable music quality using other approaches. In short, it produces
higher-resolution images with better music quality. The progressive GAN technique
uses a simplified minibatch discrimination to improve the diversity of results. Pro-
gressive GAN computes the standard deviation for each feature in each spatial loca-
tion over the minibatch. Then, it averages them to yield a single scalar value. It is
concatenated to all spatial locations and over the minibatch at one of the latest layers
in the discriminator. If the generated music samples do not have the same diversity
as the real music samples, this value will be different and therefore will be penalized
by the discriminator.
Progressive GAN initializes the filter weights with (0, 1) and then scales the
weights at runtime for each layer ŵi = wi /c where
−1
2
c= (3)
number of inputs
For the generator, the features at every convolution layers are normalized, given
by:
ax,y
bx,y =