0% found this document useful (0 votes)
24 views10 pages

299 April 2019

Uploaded by

a.apoorva89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views10 pages

299 April 2019

Uploaded by

a.apoorva89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

International Journal of Research ISSN NO:2236-6124

CONCEPTS OF DATA MINING AND INTEGRATING UNCERTAINTY IN DATA


MINING

[1]
Apoorva.A
Assistant professor
BCA Department
New Horizon College Marathalli, Bangalore
E-mail : [email protected]

Presently, a very large amount of data stored in databases is increasing at a tremendous speed.
This growing need gives a view for a new research field called Knowledge Discovery in
Databases (KDD) or Data Mining, which attract attention from researchers in from various fields
which includes Database Design, Statistics, Pattern Recognition, Machine Learning, and Data
Visualization et,.data mining is various kinds of data-scientific data, environmental data,
financial data and mathematical data. Manually analyzing, classifying, and summarizing the data
is impossible because of the incredible increase in data in this age of net work and information
sharing. This research investigates the fundamentals of data mining and current research on
integrating uncertainty into data mining in an effort to develop new techniques for incorporating
uncertainty management in data mining.

I. INTRODUCTION

Briefly speaking, data mining refers to extracting useful information from vast amounts of data.
Many other terms are being used to interpret data mining, such as knowledge mining from
databases, knowledge extraction, data analysis, and data archaeology. Nowadays, it is commonly
agreed that data mining is an essential step in the process of knowledge discovery in databases,
or KDD. In this paper, based on a broad view of data mining functionality, data mining is the
process of discovering interesting knowledge from large amounts of data stored either in
databases, data warehouses, or other information repositories [2].

Volume VIII, Issue IV, April/2019 Page No:2361


International Journal of Research ISSN NO:2236-6124

Data mining an interdisciplinary subfield of computer science is the computational process


of discovering patterns in large data sets involving methods at the intersection of artificial
intelligence, machine learning, statistics, and database systems. The overall goal of the data
mining process is to extract information from a data set and transform it into an
understandable structure for further use. Data Mining is widely used in diverse areas. There
are number of commercial data mining system available today yet there are many challenges
in this field.

II. LITERATURE SURVAY

Recently many Data mining research were done in the various domains, such as Mobile
commerce. Paper [1] proposed cluster-Based Temporal Mobile Sequential pattern mine
(CTMSP-Mine) to discover the cluster-Based Temporal Mobile sequential pattern. They used
techniques two techniques such as Co-smart-cast algorithm to cluster the mobile transaction
sequences. In this algorithm, they proposed LBS-alignment to evaluate similarity of mobile
transaction sequences and GA-based time segmentation algorithm to find the most suitable time
intervals. After clustering and segmentation user cluster table and time interval table are
generated.CTMSP-Mine algorithm to mine CTMSPs from mobile transaction database
according to user cluster table and time table. In online, they predict subsequent behaviors
according to user’s previous mobile transaction sequences and current time mining an
interdisciplinary subfield of computer science is the computational process of discovering

Volume VIII, Issue IV, April/2019 Page No:2362


International Journal of Research ISSN NO:2236-6124

patterns in large data sets involving methods at the intersection of artificial intelligence,
machine learning, statistics, and database systems. The overall goal of the data mining process
is to extract information from a data set and transform it into an understandable structure for
further use. Data Mining is widely used in diverse areas. There are number of commercial data
mining system available today yet there are many challenges in this field.

Paper [2] deals Students Mood recognition during online self-assessment test .They used
exponential logic and its formulas for computation. Student’s previous answers and slide bar
status are considered as input. Total Number of questions for online self-assessment test,
Student’s goal, and slide bar value are used as variables for exponential logic. This system
identifies student’s current status of mood and gives appropriate feedback. Limitation of this
system is student’s manually selecting their mood using slide bar without any automation.

Paper [3] focused on how to improve aspect- level opinion mining for online customer
reviews. They proposed the Joint Aspect/Sentiment model (JAS) to extract aspects and aspect-
dependent sentiment lexicons from online customer reviews in a unified framework. They used
Gibbs Sampling algorithm.

In Paper [4] a novel weakly supervised cybercriminal network mining method which can
uncover both explicit and implicit relationships among cybercriminals based on their
conversational messages posted on online social media. Mined two types of semantics such as
transactional and collaborative relationships among cybercriminals using context-sensitive Gibbs
sampling algorithm. They used probabilistic generative model to extract multi-word expressions
describing two types of cyber-criminal relationships in unlabeled messages. They used concept
level approaches to better grasp the implicit semantics associated with text.

Research[5] deals about CIoT is a new network paradigm, where (physical/virutual) things or
objects are interconnected and behave as agents with minimum human intervention, the things
interact with each other following a context-aware perception-action cycle, use the methodology
of understanding – by-building to learn from both the physical environment and social networks,
store the learned semantic and/or knowledge in kinds of databases and adapt themselves to
changes or uncertainties via resource-efficient decision-making mechanism. This research used

Volume VIII, Issue IV, April/2019 Page No:2363


International Journal of Research ISSN NO:2236-6124

game models and multiagent learning algorithm and these should be carefully designed for
different applications in large-scale CIoT applications. This system (CIoT) needs massive
sensitive data. Advantages of incorporation CIoT in applications are saving people’s time and
effort, Increasing resource efficiency and Enhancing service provisioning.

Paper [6] research on discovering and connections between social emotions and online
documents as social affective text mining, including predicting emotions from online documents
associating emotions with latent topics for document categorization to help online users to select
related documents based on their emotional preferences. In this research, associate emotions
with a specific emotional event/topic used instead of only a single term. They proposed a joint
emotion-topic model for social affective text mining, which introduces an additional layer of
emotion modeling into Latent Dirichlet Allocation (LDA). This model takes social affective text
as input ex. College student jumps, affection related problem and categorize the text according to
different emotions ex. Empathy, touched and surprise. They developed an approximate inference
method based on Gibbs Sampling Algorithm.

III. DATA MINING APPLICATIONS

There are approximately 100,000 genes in a human body and each gene is
composed of hundreds of individual nucleotides which are arranged in a particular order.
Ways of these nucleotides being ordered and sequenced are infinite to form distinct
genes. Data mining technology can be used to analyze sequential pattern, to search
similarity and to identify particular gene sequences that are related to various diseases. In
the future, data mining technology will play a vital role in the development of new
pharmaceuticals and advances in cancer therapies.

Financial data collected in the banking and financial industry is often relatively
complete, reliable, and of high quality, which facilitates systematic data analysis and data
mining. Typical cases include classification and clustering of customers for targeted
marketing, detection of money laundering and other financial crimes as well as design
and construction of data warehouses for multidimensional data analysis.

The retail industry is a major application area for data mining since it collects
huge amounts of data on customer shopping history, consumption, and sales and service

Volume VIII, Issue IV, April/2019 Page No:2364


International Journal of Research ISSN NO:2236-6124

records. Data mining on retail is able to identify customer buying habits, to discover
customer purchasing pattern and to predict customer consuming trends. Data mining
technology helps design effective goods transportation, distribution polices and less
business cost.

Data mining in telecommunication industry can help understand the business


involved, identify telecommunication patterns, catch fraudulent activities, make better
use of resources and improve service quality. Typical cases include multidimensional
analysis of telecommunication data, fraudulent pattern analysis and the identification of
unusual patterns as well as multidimensional association and sequential pattern analysis.

IV. UNCERTAINTY IN DATA MINING INTEGRATION

There are many factors causing data uncertainty in real-world applications. These
factors include outdated resources, sampling errors, imprecise calculation and other
errors, and so on. This is especially true for applications that require interaction with the
physical world, such as location-based services [6] and sensor monitoring [7]. Recently,
research has been done in the area of data uncertainty management in databases. It is
proposed that when data mining is performed on uncertain data, data uncertainty has to
be considered in order to obtain high quality data mining results [4]. This is called
"Uncertain Data Mining".

Figure 4. Real-world data Figure 5. Recorded data Figure 6. Uncertain data

According to Chau, et.al [4], Figure 4 shows that the real-world data are
partitioned into three clusters (a, b, c). Figure 5 is the recorded locations of some objects
(shaded) that are not the same as their true location, thus creating clusters a’, b’, c’ and

Volume VIII, Issue IV, April/2019 Page No:2365


International Journal of Research ISSN NO:2236-6124

c’’. Note that a’ has one fewer object than a, and b’ has one more object than b. Also, c is
mistakenly split into c’ and c’’. In Figure 6, line uncertainty is considered to produce
clusters a’, b’ and c. The clustering result is closer to that of Figure 4 than Figure 5.

Based on whether data imprecision is considered, Chau, et.al [4] propose that data
mining methods can be classified through a taxonomy. Common data mining techniques
such as association rule mining, data classification and data clustering need to be
modified in order to handle uncertain data. Moreover, there are two types of data
clustering: hard clustering and fuzzy clustering. Hard clustering aims at improving the
accuracy of clustering by considering expected data values after data uncertainty is
considered. On the other hand, fuzzy clustering presents the clustering result in a “fuzzy”
form [5].

Figure 7. A taxonomy of data mining on data with uncertainty

K-means clustering for precise data [4]

The classical K-means clustering algorithm which aims at finding a set C of K clusters
Cj with cluster mean cj to minimize the sum of squared errors (SSE). The SSE is usually
calculated as follows:

(1)

Volume VIII, Issue IV, April/2019 Page No:2366


International Journal of Research ISSN NO:2236-6124

Where || . || is a distance metric between a data point xi and a cluster means cj.
For example, the Euclidean distance is defined as:

(2)

The mean (centroid) of a cluster Ci is defined by the following vector:

(3)

The K-means algorithm is as follows:

Assign initial values for cluster means c1 to ck

1 repeat
2 for i=1 to n do
3 Assign each data point xi to cluster Cj where || cj-xi || is the minimum
4 end for
5 for j=1 to K do
6 Recalculate cluster mean cj of cluster Cj
7 end for
8 until convergence
9 return C

Volume VIII, Issue IV, April/2019 Page No:2367


International Journal of Research ISSN NO:2236-6124

K-means clustering for uncertain data [4]

In order to take into account data uncertainty in the clustering process, Chau, et.al
[11] propose a clustering algorithm with the goal of minimizing the expected sum of
squared errors E(SSE). Notice that a data object xi is specified by an uncertainty region
with an uncertainty f(xi). Given a set of clusters, Cj’s the expected SSE can be calculated
as:

(4)

Cluster means are then given by:

(5)

They also propose a new K-means algorithm for clustering uncertain data.

Assign initial values for cluster means c1 to ck

1 repeat
2 for i=1 to n do
3 Assign each data point xi to cluster Cj where || cj-xi || is the minimum
4 end for
5 for j=1 to K do
6 Recalculate cluster mean cj of cluster Cj
7 end for
8 until convergence

Volume VIII, Issue IV, April/2019 Page No:2368


International Journal of Research ISSN NO:2236-6124

9 return C

Based on [4], the main difference between UK-mean clustering and K-means clustering lies in
the computation of distance and clusters. In particular, UK-means compute the expected distance
and cluster centroids based on the data uncertainty model. Again, convergence can be defined
based on different criteria. Note that if the convergence is based on squared error, E(SSE) as in
Equation (4) should be used instead of SSE. In Step 4, they point out that it is often difficult to
determine E(|| cj -xi||) algebraically. In particular, the variety of geometric shapes of uncertainty
regions (e.g., line, circle) and different uncertainty pdf imply that numerical integration methods
are necessary. In view of this, E(|| cj - xi||2), which is easier to obtain, is used instead. This allows
us to determine the cluster assignment (i.e., Step 4) using a simple algebraic expression.

V. Conclusion

Data mining or knowledge data discovery is the computer-assisted process of digging


through and analyzing enormous sets of data and then extracting the meaning of the data.It also
discusses background on data mining and methods to integrate uncertainty in data mining such as
K-means algorithm. It is also shown that data mining technology can be used in many areas in
real life including biomedical and DNA data analysis, financial data analysis, the retail industry
and also in the telecommunication industry. One of the biggest challenges for data mining
technology is managing the uncertain data which may be caused by outdated resources, sampling
errors, or imprecise calculation. Future research will involve the development of new techniques
for incorporating uncertainty management in data mining.

V1.REFERENCES

[1] Eric Hsueh-Chan Lu, Vincent S. Tseng, Member, IEEE, And Philip S. Yu, Fellow, IEEE
“Mining Cluster-Based Temporal Mobile Sequential Patterns I N Location- Based Service
Environments” IEEE TRANSAC TIONS ON KNOWLEDGE AND DATA ENGINE
ERING, VOL. 23, NO. 6, JUNE 2011

Volume VIII, Issue IV, April/2019 Page No:2369


International Journal of Research ISSN NO:2236-6124

[2] Christos N. Moridis And Anastasios A. Economides “Mood Recognition During Online
Self-Assessment Tests” IEEE TRANSACTIONS ON LE ARNING TECHN OLOGIES,
VOL. 2, NO. 1, JANUARY-MARCH 2009

[3] XU Xueke, CHENG Xueqi, TAN Songbo, LIU Yue, SHEN Huawei “Aspect-Level
Opinion Mining Of Online Customer Reviews ” Key Laboratory Of Web Data Science And
Technology, Institute Of Computing Technology, Chinese Academy Of Sciences, Beijing
100190, China, MANAGEMENT VISUALIZATION OF USER AND NETWORK
DATA, China Communications, March 2013

[4] Raymond Y.K. Lau Department Of Information Systems, City University Of Hong Kong,
Hong Kong SAR Yunqing Xia Department Of Computer Science And Technology, Tsinghua
University, Beijing 100084, CHINA Yunming Ye Shenzhen Key Laboratory Of Internet
Information Collaboration, Shenzhen Graduate School, Harbin Institute Of Technology,
Shenzhen 518055, CHINA “A Probabilistic Generative Model For Mining Cybercriminal
Networks From Online Social Media

[5] Qihui Wu, Senior Member , IEEE, Guoru Ding, Student Member, IEEE , Yuhua Xu, Student
Member , IEEE, Shuo Feng, Zhiyong Du, Jinlong Wang, Senior Member , IEEE, And Keping
Long, Senior Member , IEEE “Cognitive Internet Of Things: A New Paradigm Beyond
Connection” IEEE INTERNET OF THINGS JOURNAL, VOL. 1, NO. 2, APRIL 2014

[6] Shenghua Bao, Shengliang Xu, Li Zhang, Rong Yan, Zhong Su, Dingyi Han, And Yong Yu
“Mining Social Emotions From Affective Text ”IEEE TRANSACTIONS ON
KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 9, SEPTEMBER 2012

[7] Reynold Cheng, Dmirti V.Kalashnikov, Sunil Prabhakar, Evaluating Probabilistic


Queries Over Imprecise Data, UK: Elsevier Science Ltd, 2007.

[8] Sapphire, Large Scale Data Mining And Pattern Recognition,


Https://Computation.Llnl.Gov/Casc/Sapphire/Overview/Data_Mining_Steps.Gif,
1999.

Volume VIII, Issue IV, April/2019 Page No:2370

You might also like