0% found this document useful (0 votes)
14 views44 pages

Data Mining

Uploaded by

MONALI DESHPANDE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views44 pages

Data Mining

Uploaded by

MONALI DESHPANDE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

DATA MINING

UNIT II
Fundamentals of Data Mining:
• Data mining refers to extracting or mining knowledge
from large amounts of data. Data mining should have
been more appropriately named as knowledge mining
which emphasis on mining from large amounts of data.
• It is the computational process of discovering patterns in
large data sets involving methods at the intersection of
artificial intelligence, machine learning, statistics, and
database systems.
• The overall goal of the data mining process is to extract
information from a data set and transform it into an
understandable structure for further use
The key properties of data mining
are
• Automatic discovery of patterns
• Prediction of likely outcomes
• Creation of actionable information
• Focus on large datasets and databases
The Scope of Data Mining
• The processes require sifting through an immense
amount of material, or intelligently probing it to find
exactly where the value resides.
• Given databases of sufficient size and quality, data
mining technology can generate new business
opportunities by providing these capabilities.
Automated prediction of trends and
behaviours
• Data mining automates the process of finding predictive
information in large databases.
• A typical example of a predictive problem is targeted
marketing. Data mining uses data on past promotional
mailings to identify the targets most likely to maximize
return on investment in future mailings. Other
predictive problems include forecasting bankruptcy and
other forms of default, and identifying segments of a
population likely to respond similarly to given events.
Automated discovery of previously
unknown patterns.
• Data mining tools sweep through databases and identify
previously hidden patterns in onestep. An example of
pattern discovery is the analysis of retail sales data to
identify seemingly
unrelated products that are often purchased together.
Other pattern discovery problems include detecting
fraudulent credit card transactions and identifying
anomalous data that could represent data entry keying
errors.
Data mining involves six common
classes of tasks
• Anomaly detection (Outlier/change/deviation detection)
• Association rule learning (Dependency modelling)
• Clustering
• Classification
• Regression
• Summarization
Data mining involves six common
classes of tasks
• Anomaly detection (Outlier/change/deviation detection) – The identification of
unusual data records, that might be interesting or data errors that require further
investigation.
• Association rule learning (Dependency modelling) – Searches for relationships
between variables. For example a supermarket might gather data on customer
purchasing habits. Using association rule learning, the supermarket can determine
which products are frequently bought together and use this information for marketing
purposes. This is sometimes referred to as market basket analysis.
• Clustering – is the task of discovering groups and structures in the data that are in
some way or another "similar", without using known structures in the data.
• Classification – is the task of generalizing known
structure to apply to new data. For example, an e-mail
program might attempt to classify an e-mail as
"legitimate" or as "spam".
• Regression – attempts to find a function which models
the data with the least error.
• Summarization – providing a more compact
representation of the data set, including Visualization
and report generation
Architecture of Data Mining
1. Knowledge Base

This is the domain knowledge that is used to guide the search or evaluate the
interestingness
of resulting patterns. Such knowledge can include concept hierarchies, used to
organize
attributes or attribute values into different levels of abstraction. Knowledge such as
user
beliefs, which can be used to assess a pattern’s interestingness based on its
unexpectedness, may also be included. Other examples of domain knowledge are
additional interestingness constraints or thresholds, and metadata (e.g., describing
data from multiple heterogeneous sources).
• 2. Data Mining Engine:
This is essential to the data mining system and ideally consists of a set of functional
modules
for tasks such as characterization, association and correlation analysis, classification,
prediction, cluster analysis, outlier analysis, and evolution analysis.
3. Pattern Evaluation Module:
• This component typically employs interestingness
measures interacts with the data mining modules so as to
focus the search toward interesting patterns. It may use
interestingness thresholds to filter out discovered patterns.
Alternatively, the pattern evaluation module may be
integrated with the mining module, depending on the
implementation of the data mining method used. For
efficient data mining, it is highly recommended to push the
evaluation of pattern interestingness as deep as possible
into the mining process as to confine the search to only the
interesting patterns.
4. User interface:
• This module communicates between users and the data
mining system, allowing the user to interact with the
system by specifying a data mining query or task,
providing information to help focus the search, and
performing exploratory data mining based on the
intermediate datamining results. In addition, this
component allows the user to browse database and
data warehouse schemas or data structures, evaluate
mined patterns, and visualize the patterns in different
forms.
Classification of Data Mining
Systems Data mining systems can be categorized according
to various criteria, as follows

Classification according to the kinds of


databases mined: A data mining system can be
classified according to the kinds of databases
mined. Database systems can be classified
according to different criteria (such as data models,
or the types of data or applications involved), each
of which may require its own data mining technique.
Data mining systems can therefore be classified
accordingly.

For instance, if classifying according to data models,


we may have a relational, transactional, object-
relational, or data warehouse mining system. If
classifying according to the special types of data
handled, we may have a spatial, time-series, text,
stream data, multimedia data mining system, or a
World Wide Web mining system.
Classification according to the kinds of knowledge
mined:

• Data mining systems can be categorized according to the kinds of


knowledge they mine, that is, based on data mining functionalities, such as
characterization, discrimination, association and correlation analysis,
classification, prediction, clustering, outlier analysis, and evolution analysis.
• A comprehensive data mining system usually provides multiple and/or
integrated data mining functionalities. data mining systems can be
distinguished based on the granularity or levels of abstraction of the
knowledge mined, including generalized knowledge (at a high level of
abstraction), primitive-level knowledge (at a raw data level), or knowledge
at multiple levels (considering several levels of abstraction).
An advanced data mining system should facilitate the discovery of
knowledge at multiple levels of abstraction.
• Data mining systems can also be categorized as
those that mine data regularities (commonly occurring
patterns) versus those that mine data irregularities
(such as exceptions, or outliers).
• In general, concept description, association and
correlation analysis, classification, prediction,and
clustering mine data regularities, rejecting outliers as
noise. These methods may also help detect outliers.
Classification according to the kinds
of techniques utilized
• Data mining systems can be categorized according to the
underlying data mining techniques employed. These
techniques can be described according to the degree of user
interaction involved (e.g., autonomous systems, interactive
exploratory systems, query-driven systems) or the methods
of data analysis employed (e.g., database-oriented or data
warehouse–oriented techniques, machine learning, statistics,
visualization, pattern recognition, neural networks, and so
on). A sophisticated data mining system will often adopt
multiple data mining techniques or work out an effective,
integrated technique that combines the merits of a few
individual approaches.
Classification according to the
applications adapted
• Data mining systems can also be categorized according
to the applications they adapt. For example, data
mining systems may
be tailored specifically for finance,
telecommunications, DNA, kstock markets, e-mail,
and so on. Different applications often require the
integration of application-specific methods.
Therefore, a generic, all-purpose data mining system
may not fit domain-specific mining tasks.
Data Mining Process
• Data Mining is a process of discovering various models,
summaries, and derived values from a given collection
of data.
• The general experimental procedure adapted to data-
mining problems involves the following steps:
1. State the problem and formulate the hypothesis:

Most data-based modeling studies are performed in a particular application domain.


Hence,
domain-specific knowledge and experience are usually necessary in order to come up
with a
meaningful problem statement.
In this step, a modeler usually specifies a set of variables for the unknown dependency
and, if possible, a general form of this dependency as an initial hypothesis. There may
be several hypotheses formulated for a single problem at this stage.
The first step requires the combined expertise of an application domain and a data-
mining
model.
In practice, it usually means a close interaction between the data-mining expert and
the application expert.
In successful data-mining applications, this cooperation does not stoping the initial
phase; it continues during the entire data
2.Collect the data
• This step is concerned with how the data are generated
and collected. In general, there are two distinct
possibilities.
• The first is when the data-generation process is under
the control of an expert (modeler): this approach is
known as a designed experiment.
• The second possibility is when the expert cannot
influence the data- generation process: this is known
as the observational approach. An observational setting,
namely, random data
generation, is assumed in most data-mining
applications.
Data Mining Task Primitives
• A data mining task can be specified in the form of a data
mining query, which is input to the data mining system. A
data mining query is defined in terms of data mining task
primitives.
These primitives allow the user to interactively
communicate with the data mining system during
discovery to direct the mining process or examine the
findings from different angles or depths. The data mining
primitives specify the following,
• 1. Task-relevant data to be mined.
• 2. Kind of knowledge to be mined.
• 3. Background knowledge to be used in the discovery
process.
• 4. Interestingness measures and thresholds for pattern
evaluation.
• 5. Representation for visualizing the discovered
patterns.
Integration of a data mining system
with a database system
• The data mining system is integrated with a database or data
warehouse system so that it can do its tasks in an effective presence.
A data mining system operates in an environment that needed it to
communicate with other data systems like a database system.
• There are the possible integration schemes that can integrate these
systems which are as follows –
• No coupling − No coupling defines that a data mining system will
not use any function of a database or data warehouse system. It can
retrieve data from a specific source (including a file system), process
data using some data mining algorithms, and therefore save the
mining
results in a different file.
• Loose Coupling − In this data mining system uses
some services of a database or dataware house system.
The data is fetched from a data repository handled by
these systems. Data mining approaches are used to
process the data and then the processed data is saved
either in a file or in a designated area in a database or
data warehouse. Loose coupling is better than
nocoupling as it can fetch some area of data stored in
databases by using query processing or various system
facilities.
• Semitight Coupling − In this adequate execution of a few
essential data mining primitives can be supported in the
database/Datawarehouse system. These primitives can
contain sorting, indexing, aggregation, histogram analysis,
multi-way join, and pre-computation of some important
statistical measures, including sum, count, max, min,
standard deviation, etc.
• Tight coupling − Tight coupling defines that a data mining
system is smoothly integrated into the database/data
warehouse system. The data mining subsystem is considered
as one functional element of an information system.
Major Issues in Data Mining

• Mining different kinds of knowledge in databases. - The need of different users


is not the
same. And Different user may be in interested in different kind of knowledge.
Therefore, it is
necessary for data mining to cover broad range of knowledge discovery task.
• Interactive mining of knowledge at multiple levels of abstraction. - The data
mining
process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based on returned results.
• Data mining query languages and adhoc data mining. - Data Mining Query
language that
allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse
query language and optimized for efficient and flexible data mining
• Presentation and visualization of data mining results. -
Once the patterns are discovered it needs to be expressed in
high level languages, visual representations. These
representations should be easily understandable by the users.
• Handling noisy or incomplete data. - The data cleaning
methods are required that can handle the noise, incomplete
objects while mining the data regularities. If data cleaning
methods are not there then the accuracy of the discovered
patterns will be poor.
• Pattern evaluation. - It refers to interestingness of the
problem. The patterns discovered should be interesting because
either they represent common knowledge or lack novelty
• Efficiency and scalability of data mining algorithms. - In
order to effectively extract the information from huge amount of
data in databases, data mining algorithm must be efficient and
scalable.
• Parallel, distributed, and incremental mining algorithms.
- The factors such as huge size of databases, wide distribution
of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms.
These algorithms divide the data into partitions which is further
processed parallel. Then the results from the partitions are
merged. The incremental algorithms, updates the databases
without having to mine the data again from the scratch.
Data preprocessing:
• Data preprocessing is converting raw data into legible
and defined setsthat allow businesses to conduct data
mining, analyze the data, and process it for business
activities. It's important for businesses to preprocess
their data correctly, as they use various forms of input to
collect raw data, which can affect its quality.
Preprocessing data is an important step, as raw data can
be inconsistent or incomplete in its formatting. Effectively
• preprocessing raw data can increase its accuracy, which
can increase the quality of projects and improve its
reliability.
Importance of data preprocessing
• The following are some benefits of preprocessing data:
1. It improves accuracy and reliability. Preprocessing data
removes missing or inconsistent data values resulting from
human or computer error, which can improve the accuracy
and quality of a dataset, making it more reliable.
2. It makes data consistent. When collecting data, it's
possible to have data duplicates,and discarding them
during preprocessing can ensure the data values for
analysis are consistent, which helps produce accurate
results
• It increases the data's algorithm readability.
Preprocessing enhances the data's quality and makes it
easier for machine learning algorithms to read, use, and
interpret it.
Data Cleaning
Data cleaning is an essential step in the data mining process. It is crucial to the construction
of a model. The step that is required, but frequently overlooked by everyone, is data
cleaning.
The major problem with quality information management is data quality. Problems with data
quality can happen at any place in an information system. Data cleansing offers a solution to
these issues.

Data cleaning is the process of correcting or deleting inaccurate, damaged, improperly


formatted, duplicated, or insufficient data from a dataset. Even if results and algorithms
appear to be correct, they are unreliable if the data is inaccurate. There are numerous ways
for
data to be duplicated or incorrectly labeled when merging multiple data sources.
• data cleaning lowers errors and raises the caliber of the
data. Although it might be a time-consuming and
laborious operation, fixing data mistakes and removing
incorrect information must be done.
Data Integration
• Data integration is the process of combining data from
multiple sources into a cohesive and consistent view.
This process involves identifying and accessing the
different data sources,mapping the data to a common
format, and reconciling any inconsistencies or
discrepancies between the sources.
• The goal of data integration is to make it easier to
access and analyze data that is spread across multiple
systems or platforms, in order to gain a more complete
and accurate understanding of the data
Data Transformation
• Data transformation can involve the following:
• 1. Smoothing, which works to remove noise from the
data. Such techniques include binning, regression, and
clustering.
• 2. Aggregation, where summary or aggregation
operations are applied to the data. For example, the
daily sales data may be aggregated so as to compute
monthly and annual total amounts. This step is typically
used in constructing a data cube for analysis of the data
at multiple granularities
3. Generalization of the data, where low-level
or―primitive‖ (raw) data are replaced by higher-level
concepts through the use of concept hierarchies. For
example, categorical attributes, like street, can be
generalized to higher- level concepts, like city or country.
4. Normalization, where the attribute data are scaled so as
to fall within a small specified range, such as 1:0 to 1:0, or
0:0 to 1:0.
5. Attribute construction (or feature construction),where
new attributes are constructed and added from the given
set of attributes to help the mining process.
Data Reduction

• Data reduction techniques can be applied to obtain a


reduced representation of the data set that is much
smaller in volume, yet closely maintains the integrity of
the original data. That is, mining on the reduced data
set should be more efficient yet produce the same (or
almostthe same) analytical results
Strategies for data reduction include the following:

• Data cube aggregation: where aggregation


operations are applied to the data in the construction of
a data cube.
• Attribute subset selection: where irrelevant, weakly
relevant, or redundant attributes or dimensions may be
detected and removed.
• Dimensionality reduction: where encoding
mechanisms are used to reduce the dataset size.
• Numerosity reduction: where the data are replaced
or estimated by alternative, smaller data
representations such as param etric models (which
need store only the model parameters
• 28 instead of the actual data) or nonparametric
methods such as clustering, sampling, and the use of
histograms.
• Discretization and concept hierarchy generation,
where raw data values for attributes are replaced by
ranges or higher conceptual levels. Data discretization
is a form of numerosityreduction that is very useful for
the automatic generation of concept hierarchies.
• Discretization and concept hierarchy generation are
powerful tools for data mining, in that they allow the
mining of data at multiple levels of abstraction
Knowledge Discovery in
Databases(KDD)
• The list of steps involved in knowledge discovery process:
• Data Cleaning - In this step the noise and inconsistent data is removed.
• Data Integration - In this step multiple data sources are combined.
• Data Selection - In this step relevant to the analysis task are retrieved from
the database.
• Data Transformation - In this step data are transformed or consolidated into
forms
• appropriate for mining by performing summary or aggregation operations.
• Data Mining - In this step intelligent methods are applied in order to extract
data
• patterns.
• Pattern Evaluation - In this step, data patterns are evaluated.
• Knowledge Presentation - In this step,knowledge is represented.
KDD Process Out Come

You might also like