FDS Unit 1
FDS Unit 1
DATA MINING
DEFINITIONS:
Data Mining is a multidisciplinary field that involves discovering patterns, associations and
knowledge from large datasets. It combines techniques from statistics, machine learning, artificial
intelligence and database management to extract meaningful insights and information from raw
data.
The Goal of data mining is to turn raw data into actionable knowledge, helping organisations make
informed decisions and gain a competitive advantage.
It involves understanding the problem that needs to be solved and defining the objectives of
the data mining project.
It identifies the business problem, understanding the goals and objectives of the project, and
defining the KPIs that will be used to measure success
This step involves building a predictive model using machine learning algorithms.
This includes selecting an appropriate algorithm, training the model on the data, and
evaluating its performance.
This step is important because it is the heart of the data mining process and involves
developing a model that can accurately predict outcomes on new data.
5. Evaluate the Data
This step involves deploying the model into the production environment.
This includes integrating the model into existing systems and processes to make predictions
in real-time.
This step is important because it allows the model to be used in a practical setting and to
generate value for the organization.
The KDD process is an iterative process and it requires multiple iterations of the above steps to
extract accurate knowledge from the data.
Data selection is defined as the process where data relevant to the analysis is
decided and retrieved from the data collection.
For this we can use Neural network, Decision Trees, Naive bayes, Clustering, and
Regression methods.
4. Data Transformation:
Data Transformation is defined as the process of transforming data into appropriate form
required by mining procedure.
Data Transformation is a two-step process:
1. Data Mapping: Assigning elements from source base to destination to capture
transformations.
2. Code generation: Creation of the actual transformation program.
5. Data Mining:
Data mining is defined as techniques that are applied to extract patterns potentially
useful.
It transforms task relevant data into patterns, and decides purpose of model using
classification or characterization.
6. Pattern Evaluation:
The interpreted knowledge is evaluated in the context of domain and business goals.
Stakeholders assess the practical implications of the discovered knowledge and
consider how it can be applied to improve decision-making or solve specific
problems.
9. Decision Making:
Knowledge and insights derived from data mining are implemented into the
operational environment.
It involves integrating the findings into existing systems, developing new applications
based on the discovered knowledge.
11.Monitoring and Maintenance:
The final step involves monitoring the performance of the deployed models and
maintaining them over time.
It ensures that the knowledge remains relevant and effective as new data becomes
available.
COMPONENTS OF DAATA MINING SYSTEM / DATA MINING ARCHITECTURE
1. Data Source:
The actual source of data is the database, data warehouse, world wide web,
text files and other documents.
Huge amount of historical data is required for data mining to be successful.
Data warehouse comprise one or more databases, text file, spreadsheets
etc.,
Data cleaning, Integration and Selection:
Before storing the data to the database or data warehouse server, data must
be cleaned, integrated and selected.
Since the data comes from different sources, data may not be complete and
accurate, so data should be cleaned and unified.
Only the data of interest will be selected and passed to the server.
2. Database or data warehouse server:
The database or data warehouse server consists of the original data that is
ready to be processed.
It is responsible for retrieving the relevant data that is based on user’s data
mining request.
Knowledge Base:
It is called domain knowledge that is used to guide the search or evaluate the
interestingness of resulting patterns.
It includes the concept hierarchies, used to organize or attribute values into
different levels of abstraction
3. Data Mining Engine:
It consists of set of functional modules for tasks such as characterisation, association,
classification, cluster analysis, prediction, time series analysis and derivation analysis.
4. Pattern Evaluation Module:
It employs interestingness measures and interacts with data mining modules
so as to focus the search towards interesting patterns.
It uses thresholds to filter out discovered patterns.
It integrated with the mining modules depending on the data mining method
used.
5. Graphical User Interface:
It communicates between the users and the data mining system.
It allows the user to interact with the system by specifying data mining query.
It allows users to browse database and data warehouse schemas or data
structures, evaluate mined patters and visualise the patterns in different
forms.
1. Classification
2. Clustering
3. Regression
4. Association Rule Mining
5. Text Mining
6. Decision Trees
7. Anomaly Detection
8. Neural Networks
9. Time Series Analysis
10. Collaborative Filtering
1. Classification:
This technique is used to obtain important and relevant information about data and
metadata.
This data mining technique helps to classify data in different classes.
It involves identification of patterns in data and labelled the data in predefined classes.
Classification algorithms are used in unseen data that is classifying features or unknown
object to the class.
Classification problems are of two types, Binary classification and Multiclass classification.
o Binary classification: It involves categorising instances into two classes such as spam
or non-spam emails, fraud or non-fraud transactions
o Multiclass Classification: It involves categorising instances into more than two
classes such as classifying emails into multiple categories such as spam, promotions,
updates.
o
The data set used is of two types:
o Training Set: Used to train the classification model, consisting of instances with
known class labels.
o Test Set: Used to evaluate the performances of the trained model consisting of
instances with unknown class labels.
Data Preparation: Data is cleaned and pre-processed, missing values and noisy data are
handled.
Feature Selection: Relevant features that contribute to the model are selected
Model Training: Classification algorithm is used to learn the mapping between features and
class labels using training set.
Applications: Classification is widely used in various domains including Email, spam detection, credit
scoring, medical diagnosis, fraud detection, sentiment analysis and more.
Algorithms Used: Popular classification algorithm are KNN(K-Nearest Neighbours), Decision Tree,
Random forest etc.,
2. Clustering:
3. Regression
Types:
Logistic Regression:
Dependent variable is categorical
Non-Linear Regression:
Applications: Used in finance, healthcare, engineering, sales forecasting, price prediction, risk
assessment, estimating sales revenue, forecasting sales prices and more.
This technique used to identify patterns or associations among variables in a large dataset.
Goal of Association Rule Mining is to discover interesting and meaningful relationships
between variables that can be used to make informed decisions.
Examines the frequency of co-occurrence of variables in a dataset, and then identifying the
patterns or rules that occur most frequently.
Algorithms used are
Apriori Algorithm: It is specially designed for association rule mining in transactional
databases. It is used to discover frequent item sets and generate association rules
based on their occurrence in a dataset.
FP-growth algorithm: Frequent Pattern Growth is an alternative data mining
algorithm for discovering frequent item sets in transactional databases designed
especially for association rule mining. It uses tree-based structure to efficiently mine
frequent patterns and generate association rules.
Frequent Item Sets: Subsets of items that frequently appear together in the dataset.
Association Rules: Logical relationships or patterns discovered in data sets using association rule
mining techniques. It shows the interesting connections between different items, events or variables
based on their co-occurrence in transactions or observations.
Confidence: It is a measure of the reliability of an association rule. It indicates the likelihood that the
presence of one item implies the presence of another item in the transaction.
5. Text Mining
It involves analysing and extracting useful information from unstructured textual data like
documents, articles, emails, social media posts etc.,
It is also known as text analytics or NLP ( Natural Language Processing)
It involves processing, analysing and transforming textual data to discover patterns,
relationships and knowledge.
Text Mining Process involves following steps:
o Text Preprocessing: It involves transforming raw text data into suitable format for
analysis. It involves tasks such as removing punctuation, converting to lowercase,
tokenization, removing stop words (‘the’, ’if’)
o Feature Extraction: It extracts relevant features or attributes from the text data to
represent it in a numerical or structured format that can be used for analysis
o Text Mining Techniques: It involves tasks such as text classification, sentimental
analysis, topic modelling, named entity recognition depending on the desired
insights and knowledge to be extracted.
o Model Training and Evaluation: Text data is divided into training and evaluation sets
using machine learning or statistical models. The models are trained using training
data and their performance is evaluated using appropriated metrics.
o Interpretation and Visualisation of Results: The results obtained from text mining
techniques applied are interpreted and analysed. The findings are visualised using
charts or graphs to communicate the insights effectively.
o Iteration and Refinement: The results are analysed and iterated on the text mining
process as per the requirement. To improve the quality and relevance of the
extracted insights, the parameters are fine-tuned and preprocessing steps are
adjusted to get better results.
o Reporting and Deployment: A comprehensive report summarizing the findings,
insights and recommendations derived from the text mining process is prepared. If
applicable the results are communicated to stakeholders
Applications: Used in sentiment analysis, topic modelling, content classification, text
summarisation, customer feedback analysis, Healthcare informatics, Legal Document
Analysis
6. Decision Trees
7.Anomaly Detection:
Anomaly detection also known as outlier detection, it is a technique used in data mining and
machine learning to identify patterns or instances that deviated significantly from the norm
or expected behaviour within a data set.
It represent data points that are rare, unusual, or indicative of potential issues, errors or
interesting events.
Types of Anomalies:
o Point Anomalies: Individual data points that deviated from the normal behaviour.
o Contextual Anomalies: Instances that are anomalous in a specific context but not in
others.
o Collective Anomalies: A group of related data points that collectively exhibit
anomalous behaviour.
Approaches to Anomaly Detection:
o Statistical Methods: Use statistical measures like mean, median, standard deviations
to identify anomalies
o Machine Learning Algorithms: Supervised and unsupervised learning methods,
including clustering, classification can be employed for anomaly detection
o Deep Learning: Neural networks can capture complex patterns and identify
anomalies in high-dimensional data.
Algorithms: Isolation Forest, One-Class SVM (Support Vector Machine)
Applications: Cybersecurity, Fraud detection, Health Monitoring, Industrial Systems.
8. Neural Networks:
A neural network is specific type of machine learning model that is often used with AI and
Deep learning
It is a set of connected input and output units where each connection has a weight
associated with it.
Consists of interconnected nodes or “neurons” that process information.
These neurons are organized into layers, with
each layer responsible for a specific aspect of the
computation.
ADVANTAGES:
Example:
Image recognition
Speech recognition
Natural language processing
It is a technique used for analysing and forecasting data points collected over time.
It involves analysing data points that are measured at regular intervals of time to
identify patterns, trends, and seasonality.
Goal is to make predictions about future values of the time series by modelling the
underlying patterns in the data.
Time series can be either univariate, where only one variable is measured over time,
or multivariate, where multiple variables are measured over time.
Used in predicting stock prices, forecasting weather patterns, and predicting demand
for products.
ADVANTAGES
1. Ability to capture trends and seasonality in the data, its flexibility in modelling
different types of time series,
2. Ability to provide forecasts and confidence intervals.
10. Collaborative Filtering
The diverse data types data mining issues highlight the complexity of dealing with
heterogeneous data sources.
Data mining often involves integrating data from various formats, such as text,
images, and structured databases.
Each data type presents unique challenges in terms of preprocessing, feature
extraction, and modelling, requiring specialized approaches and tools to tackle these
complexities effectively.
CHALLENGES OF DATA MINING
1]Data Quality
The quality of data used in data mining is one of the most significant challenges.
The accuracy, completeness, and consistency of the data affect the accuracy of the
results obtained.
The data may contain errors, omissions, duplications, or inconsistencies, which may
lead to inaccurate results.
Moreover, the data may be incomplete, meaning that some attributes or values are
missing, making it challenging to obtain a complete understanding of the data.
Data quality issues can arise due to a variety of reasons, including data entry errors,
data storage issues, data integration problems, and data transmission errors.
To address these challenges, data mining practitioners must apply data cleaning and
data preprocessing techniques to improve the quality of the data
2]Data Complexity
Data complexity refers to the vast amounts of data generated by various sources,
such as sensors, social media, and the internet of things (IoT).
The complexity of the data may make it challenging to process, analyze, and
understand.
In addition, the data may be in different formats, making it challenging to integrate
into a single dataset.
To address this challenge, data mining practitioners use advanced techniques such as
clustering, classification, and association rule mining.
These techniques help to identify patterns and relationships in the data, which can
then be used to gain insights and make predictions.
3]Data Privacy and Security
5]Ethics
Data mining raises ethical concerns related to the collection, use, and dissemination
of data. The data may be used to discriminate against certain groups, violate privacy
rights, or perpetuate existing biases. Moreover, data mining algorithms may not be
transparent, making it challenging to detect biases or discrimination.
Design of data warehouses: Retail data covers a wide spectrum like sales,
customers, employees, goods transportation, consumption and services, outcome of
data analysis and the data mining can help guide the design and development of
data warehouse structures.
Multidimensional analysis of Sales, Customers, Products, time and region: Retail
industry requires timely information regarding customer needs, product needs and
fashions as well as the quality, cost, profit and services of commodities.
Customer Segmentation: Data Mining divide the customers into distinct segments
based on their behaviour, preferences and purchase history. This can be used for
targeted marketing campaigns, personalised promotions and customized product
recommendations for each segment.
Analysis of the effectiveness of sales campaigns: Retail industry conducts sales
campaigns using advertisements, coupons and various kinds of discounts and
bonuses to promote products and attract customers. Multidimensional analysis can
be used to analyse the effectiveness of sales campaign that help improve company
profits.
Inventory Management: It involves analysing historical sales data to predict future
demand and optimize inventory levels.
Customer Retention: Customer loyalty and purchase trends can be analysed in a
systematic way. Goods purchased at different periods by the same customers can be
grouped into sequences. Sequence pattern mining can be used to investigate
changes in customer consumption and suggest adjustment on the pricing and variety
of goods in order to help retain customers and attract new customers.
Purchase recommendations and cross-reference of items: Market Basket Analysis
identifies associations and relationships between products that customers tend to
purchase together. Such information can be used to form purchase
recommendations.
Price Optimization: Data Mining helps determine the optimal pricing strategy by
analysing customer behaviour, competitor pricing and market trends.
Fraud Detection: Data Mining helps detect and prevent fraudulent activities such as
payment fraud or returns fraud. It is used to enhance security measures, reduce
financial losses and maintain customer trust.
HEALTH CARE:
Clinical data analysis involves the examination and interpretation of information derived
from medical records, patient data and other healthcare sources.
Electronic Health Records Analysis: The objective is to analyse patient data stored in
electronic health records for clinical decision support and quality improvement.
Disease Prediction and Risk Assessment: The objective is to predict the likelihood of
disease occurrence and assess the risk factors by machine learning algorithms,
logistic regressions etc.,
Clinical Trials and Research: The objective is to analyse data from clinical trials to
assess the efficacy and safety interventions. Statistical analysis and regression
analysis is used to evaluate treatment outcomes and draw conclusions from research
data.
Patient Stratification: The objective is to group patients based on similar
characteristics, enabling personalised treatment plan by clustering techniques such
as K-means, hierarchical clustering etc.,
Outcome and Performance Measurement: The objective is to evaluate the
effectiveness of healthcare interventions and assess the performance of Healthcare
providers.
Patient Reported Outcomes Analysis: The objective is to incorporate patient
perspecitves and reported outcomes into clinical research and decision-making.
Disease Diagnostics and Biomarker Discovery: The objective is to identify
biomarkers for diseases and understand disease mechanisms. It plays crucial aspects
of clinical data analysis that aim to identify reliable indicators of diseases and
contribute to early detection, accurate diagnosis and personalised treatment
strategies.
Clinical Data Integration: The objective is to integrate diverse clinical data sources,
including Electronic Health Records (EHR), laboratory results and medical imaging to
build a comprehensive patient profile.
FINANCIAL SECTOR
Data mining plays a crucial role in financial data analysis, helping financial institutions,
investment firms and other organisations extract valuable insights from vast amount data.
EDUCATION:
For analysing the education sector, data mining uses Educational Data Mining (EDM)
method.
This method generates patterns that can be used both by learners and educators.
By using data mining EDM we can perform some educational task:
1. Predicting students admission in higher education
2. Predicting students profiling
3. Predicting student performance
4. Teachers teaching performance
5. Curriculum development
6. Predicting student placement opportunities
RESEARCH: