0% found this document useful (0 votes)

12 views20 pages

FDS Unit 1

Data mining is a multidisciplinary field focused on discovering patterns and knowledge from large datasets, utilizing techniques from statistics, machine learning, and AI. The data mining process involves several steps including understanding business needs, preparing data, modeling, evaluating, and deploying solutions. Additionally, the Knowledge Discovery in Databases (KDD) process includes data cleaning, integration, selection, transformation, and evaluation to extract valuable insights for decision-making.

Uploaded by

anjoomvkkl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views20 pages

FDS Unit 1

Uploaded by

anjoomvkkl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

UNIT – 1

DATA MINING
DEFINITIONS:
Data Mining is a multidisciplinary field that involves discovering patterns, associations and
knowledge from large datasets. It combines techniques from statistics, machine learning, artificial
intelligence and database management to extract meaningful insights and information from raw
data.
The Goal of data mining is to turn raw data into actionable knowledge, helping organisations make
informed decisions and gain a competitive advantage.

NEED FOR DATA MINING:

Enormous growth of data (BIG DATA)
Storing that data and getting some useful information is called data mining
DATA MINING STEPS
1. Understand Business

 It involves understanding the problem that needs to be solved and defining the objectives of
the data mining project.
 It identifies the business problem, understanding the goals and objectives of the project, and
defining the KPIs that will be used to measure success

.2. Understand the Data

 This step involves collecting and exploring the data to

gain a better understanding of its structure, quality,
and content.
 This includes understanding the sources of the data,
identifying any data quality issues, and exploring the
data to identify patterns and relationships.

3. Prepare the Data

 This includes cleaning the data to remove any errors

or inconsistencies, transforming the data to make it
suitable for analysis, and integrating the data from
different sources to create a single dataset.
 This step is important because it ensures that the data is in a format that can be used for
modelling

4. Model the Data

 This step involves building a predictive model using machine learning algorithms.
 This includes selecting an appropriate algorithm, training the model on the data, and
evaluating its performance.
 This step is important because it is the heart of the data mining process and involves
developing a model that can accurately predict outcomes on new data.
5. Evaluate the Data

 This step involves evaluating the performance of the model.

 This includes using statistical measures to assess how well the model is able to predict
outcomes on new data.
 This step is important because it helps ensure that the model is accurate and can be used in
the real world.

6. Deploy the Solution

 This step involves deploying the model into the production environment.
 This includes integrating the model into existing systems and processes to make predictions
in real-time.
 This step is important because it allows the model to be used in a practical setting and to
generate value for the organization.

KNOWLEDGE DISCOVERY IN DATABASE (KDD)

KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful,
previously unknown, and potentially valuable information from large datasets.

The KDD process is an iterative process and it requires multiple iterations of the above steps to
extract accurate knowledge from the data.

STEPS IN KDD PROCESS

1. Data Cleaning:
Data cleaning is defined as removal of noisy and irrelevant data from collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance error.
3. Cleaning with Data discrepancy detection and Data transformation tools
2. Data Integration:

 Data integration is defined as heterogeneous data from multiple sources combined

in a common source (Datawarehouse).
 Data integration using Data Migration tools, Data Synchronization tools and
ETL(Extract-Load-Transformation) process.
3. Data Selection:

 Data selection is defined as the process where data relevant to the analysis is
decided and retrieved from the data collection.
 For this we can use Neural network, Decision Trees, Naive bayes, Clustering, and
Regression methods.
4. Data Transformation:
Data Transformation is defined as the process of transforming data into appropriate form
required by mining procedure.
Data Transformation is a two-step process:
1. Data Mapping: Assigning elements from source base to destination to capture
transformations.
2. Code generation: Creation of the actual transformation program.
5. Data Mining:

 Data mining is defined as techniques that are applied to extract patterns potentially
useful.
 It transforms task relevant data into patterns, and decides purpose of model using
classification or characterization.
6. Pattern Evaluation:

 Pattern Evaluation is defined as identifying strictly increasing patterns representing

knowledge based on given measures.
 It finds interestingness score of each pattern, and uses summarization and
Visualization to make data understandable by user.
7. Knowledge Representation:

 The discovered patterns and knowledge need to be represented in an

understandable and usable form.
 It involves visualisations, reports etc., to get a better insight for a decision maker.
8. Interpretation and Evaluation:

 The interpreted knowledge is evaluated in the context of domain and business goals.
Stakeholders assess the practical implications of the discovered knowledge and
consider how it can be applied to improve decision-making or solve specific
problems.
9. Decision Making:

 Based on the evaluated knowledge, decisions are made.

 Data mining process influence strategic or operational decisions within an
organisation.
10. Deployment:

 Knowledge and insights derived from data mining are implemented into the
operational environment.
 It involves integrating the findings into existing systems, developing new applications
based on the discovered knowledge.
11.Monitoring and Maintenance:

 The final step involves monitoring the performance of the deployed models and
maintaining them over time.
 It ensures that the knowledge remains relevant and effective as new data becomes
available.
COMPONENTS OF DAATA MINING SYSTEM / DATA MINING ARCHITECTURE

1. Data Source:
 The actual source of data is the database, data warehouse, world wide web,
text files and other documents.
 Huge amount of historical data is required for data mining to be successful.
 Data warehouse comprise one or more databases, text file, spreadsheets
etc.,
Data cleaning, Integration and Selection:
 Before storing the data to the database or data warehouse server, data must
be cleaned, integrated and selected.
 Since the data comes from different sources, data may not be complete and
accurate, so data should be cleaned and unified.
 Only the data of interest will be selected and passed to the server.
2. Database or data warehouse server:
 The database or data warehouse server consists of the original data that is
ready to be processed.
 It is responsible for retrieving the relevant data that is based on user’s data
mining request.
Knowledge Base:
 It is called domain knowledge that is used to guide the search or evaluate the
interestingness of resulting patterns.
 It includes the concept hierarchies, used to organize or attribute values into
different levels of abstraction
3. Data Mining Engine:
It consists of set of functional modules for tasks such as characterisation, association,
classification, cluster analysis, prediction, time series analysis and derivation analysis.
4. Pattern Evaluation Module:
 It employs interestingness measures and interacts with data mining modules
so as to focus the search towards interesting patterns.
 It uses thresholds to filter out discovered patterns.
 It integrated with the mining modules depending on the data mining method
used.
5. Graphical User Interface:
 It communicates between the users and the data mining system.
 It allows the user to interact with the system by specifying data mining query.
 It allows users to browse database and data warehouse schemas or data
structures, evaluate mined patters and visualise the patterns in different
forms.

DATA MINING TECHNIQUES

Different types of Data Mining Techniques which are used to predict desire output are as follows,

1. Classification
2. Clustering
3. Regression
4. Association Rule Mining
5. Text Mining
6. Decision Trees
7. Anomaly Detection
8. Neural Networks
9. Time Series Analysis
10. Collaborative Filtering
1. Classification:

 This technique is used to obtain important and relevant information about data and
metadata.
 This data mining technique helps to classify data in different classes.
 It involves identification of patterns in data and labelled the data in predefined classes.
 Classification algorithms are used in unseen data that is classifying features or unknown
object to the class.
 Classification problems are of two types, Binary classification and Multiclass classification.
o Binary classification: It involves categorising instances into two classes such as spam
or non-spam emails, fraud or non-fraud transactions
o Multiclass Classification: It involves categorising instances into more than two
classes such as classifying emails into multiple categories such as spam, promotions,
updates.

o
 The data set used is of two types:
o Training Set: Used to train the classification model, consisting of instances with
known class labels.
o Test Set: Used to evaluate the performances of the trained model consisting of
instances with unknown class labels.

MODEL BUILDING PROCESS:

 Data Preparation: Data is cleaned and pre-processed, missing values and noisy data are
handled.
 Feature Selection: Relevant features that contribute to the model are selected
 Model Training: Classification algorithm is used to learn the mapping between features and
class labels using training set.

Applications: Classification is widely used in various domains including Email, spam detection, credit
scoring, medical diagnosis, fraud detection, sentiment analysis and more.

Algorithms Used: Popular classification algorithm are KNN(K-Nearest Neighbours), Decision Tree,
Random forest etc.,

2. Clustering:

 Clustering is a division of information into

groups of connected objects.
 This technique helps to recognize the
differences and similarities between the
data.
 Clustering is very similar to the classification,
but it involves grouping chunks of data
together based on their similarities.
 Types of Clustering are:
o Partitioning Clustering: It divided data into non-overlapping clusters where each
data point belongs to exactly one cluster
o Hierarchical Clustering: It creates a tree-like structures of nested clusters.
o Density based clustering: It works by detecting areas where points are concentrated
and where they are separated by areas that are empty or sparse.
o Grid Base clustering: It quantize the object space into a finite number of cells that
form a grid structures.
 Applications: Market segmentation, image processing, anomaly detection, te
 Popular clustering algorithms are K-means, Hierarchical clustering, Density-based clustering

3. Regression

 Establish a relationship between a dependent variable and one or more independent

variables.
 Goal is to build a model that can be used to predict the value of the dependent variable
based on the values of the independent variables.
 Dependent variable = response variable/Target variable/Effect
 Independent variable = Predictor variable or features or causes.
 Regression analysis technique is commonly used in demand forecasting, price optimization,
and trend analysis.
 Data sets used are:
o Training Set: Used to train the regression model, consisting of instances with known
target variable values
o Test Set: Used to evaluate the performance of the trained model consisting of
instances with unknown target variable values

Types:

1. Simple Linear regression

2. Multiple Linear Regression
3. Logistic Regression
4. Non-linear Regression

Simple Linear Regression:

 It has one independent variable

 Relationship between the dependent and
independent variables is assumed to be
linear.

Multiple Linear Regression:

 It has more than one independent variables

 Relationship between the dependent and
independent variables is assumed to be linear as
well.

Logistic Regression:
 Dependent variable is categorical
Non-Linear Regression:

 The relationship between the dependent and independent variables is non-linear.

Applications: Used in finance, healthcare, engineering, sales forecasting, price prediction, risk
assessment, estimating sales revenue, forecasting sales prices and more.

4. Association Rule Mining

 This technique used to identify patterns or associations among variables in a large dataset.
 Goal of Association Rule Mining is to discover interesting and meaningful relationships
between variables that can be used to make informed decisions.
 Examines the frequency of co-occurrence of variables in a dataset, and then identifying the
patterns or rules that occur most frequently.
 Algorithms used are
 Apriori Algorithm: It is specially designed for association rule mining in transactional
databases. It is used to discover frequent item sets and generate association rules
based on their occurrence in a dataset.
 FP-growth algorithm: Frequent Pattern Growth is an alternative data mining
algorithm for discovering frequent item sets in transactional databases designed
especially for association rule mining. It uses tree-based structure to efficiently mine
frequent patterns and generate association rules.

Frequent Item Sets: Subsets of items that frequently appear together in the dataset.

Association Rules: Logical relationships or patterns discovered in data sets using association rule
mining techniques. It shows the interesting connections between different items, events or variables
based on their co-occurrence in transactions or observations.

Confidence: It is a measure of the reliability of an association rule. It indicates the likelihood that the
presence of one item implies the presence of another item in the transaction.

 ALGORITHMS: Popularly used in Market Basket Analysis, Cross-Selling, Healthcare, Web

usage mining and more.

5. Text Mining

 It involves analysing and extracting useful information from unstructured textual data like
documents, articles, emails, social media posts etc.,
 It is also known as text analytics or NLP ( Natural Language Processing)
 It involves processing, analysing and transforming textual data to discover patterns,
relationships and knowledge.
 Text Mining Process involves following steps:
o Text Preprocessing: It involves transforming raw text data into suitable format for
analysis. It involves tasks such as removing punctuation, converting to lowercase,
tokenization, removing stop words (‘the’, ’if’)
o Feature Extraction: It extracts relevant features or attributes from the text data to
represent it in a numerical or structured format that can be used for analysis
o Text Mining Techniques: It involves tasks such as text classification, sentimental
analysis, topic modelling, named entity recognition depending on the desired
insights and knowledge to be extracted.
o Model Training and Evaluation: Text data is divided into training and evaluation sets
using machine learning or statistical models. The models are trained using training
data and their performance is evaluated using appropriated metrics.
o Interpretation and Visualisation of Results: The results obtained from text mining
techniques applied are interpreted and analysed. The findings are visualised using
charts or graphs to communicate the insights effectively.
o Iteration and Refinement: The results are analysed and iterated on the text mining
process as per the requirement. To improve the quality and relevance of the
extracted insights, the parameters are fine-tuned and preprocessing steps are
adjusted to get better results.
o Reporting and Deployment: A comprehensive report summarizing the findings,
insights and recommendations derived from the text mining process is prepared. If
applicable the results are communicated to stakeholders
 Applications: Used in sentiment analysis, topic modelling, content classification, text
summarisation, customer feedback analysis, Healthcare informatics, Legal Document
Analysis

6. Decision Trees

 Decision trees are a technique used to represent complex decision-making processes in a

visual format.
 Analyse data by constructing a tree-like model of decisions and their possible consequences.
 A decision tree consists of nodes and
edges,
 nodes represent decisions or event
 edges represent the possible outcomes or
consequences of those decisions.
 Decision Trees can be easily transformed
into classification rules.
 It enables users to clearly understand how
the data inputs affect the outputs.
 Smaller decision trees are easy to
interpret.
 When various decision tree models are
combined they create predictive analytics models known as random forest.
 Used in risk assessment, customer segmentation, and product recommendation.

7.Anomaly Detection:

 Anomaly detection also known as outlier detection, it is a technique used in data mining and
machine learning to identify patterns or instances that deviated significantly from the norm
or expected behaviour within a data set.
 It represent data points that are rare, unusual, or indicative of potential issues, errors or
interesting events.
 Types of Anomalies:
o Point Anomalies: Individual data points that deviated from the normal behaviour.
o Contextual Anomalies: Instances that are anomalous in a specific context but not in
others.
o Collective Anomalies: A group of related data points that collectively exhibit
anomalous behaviour.
 Approaches to Anomaly Detection:
o Statistical Methods: Use statistical measures like mean, median, standard deviations
to identify anomalies
o Machine Learning Algorithms: Supervised and unsupervised learning methods,
including clustering, classification can be employed for anomaly detection
o Deep Learning: Neural networks can capture complex patterns and identify
anomalies in high-dimensional data.
 Algorithms: Isolation Forest, One-Class SVM (Support Vector Machine)
 Applications: Cybersecurity, Fraud detection, Health Monitoring, Industrial Systems.

8. Neural Networks:

 A neural network is specific type of machine learning model that is often used with AI and
Deep learning
 It is a set of connected input and output units where each connection has a weight
associated with it.
 Consists of interconnected nodes or “neurons” that process information.
 These neurons are organized into layers, with
each layer responsible for a specific aspect of the
computation.

ADVANTAGES:

 Ability to learn and generalize from complex data

 Ability to handle noise and missing data.
 Ability to adapt and changing data.

Example:

 Image recognition
 Speech recognition
 Natural language processing

9. Time Series Analysis

 It is a technique used for analysing and forecasting data points collected over time.
 It involves analysing data points that are measured at regular intervals of time to
identify patterns, trends, and seasonality.
 Goal is to make predictions about future values of the time series by modelling the
underlying patterns in the data.
 Time series can be either univariate, where only one variable is measured over time,
or multivariate, where multiple variables are measured over time.
 Used in predicting stock prices, forecasting weather patterns, and predicting demand
for products.
ADVANTAGES
1. Ability to capture trends and seasonality in the data, its flexibility in modelling
different types of time series,
2. Ability to provide forecasts and confidence intervals.
10. Collaborative Filtering

 Collaborative filtering is a technique used to make recommendations based on the

preferences of similar users.
 It works by creating a matrix of user-item interactions.
 Each cell in the matrix represents the user’s preference or rating for a particular
item.
 Collaborative filtering algorithms then use this matrix to find patterns or similarities
in the ratings of different users and items.
TYPES:
1. User based
2. Item based
User based Collaborative filtering:
Users who have similar preferences and recommends items that these users have rated
highly.
Item based Collaborative filtering:
Identifies items that are similar to the ones the user has already rated highly and
recommends these similar items.

COMPARISION OF DATA MINING AND KDD

COMPARISION OF DATA MINING AND DBMS

DATA MINING ISSUES

1. Mining Methodology Issues

 Methodology-related data mining issues encompass challenges related to the choice

and application of mining algorithms and techniques.
 Selecting the right method for a specific dataset and problem can be daunting.
 Issues like overfitting, bias, and the need for interpretability often arise, making it
crucial to strike a balance between model complexity and accuracy.
2. Performance Issues

 Performance-related data mining issues revolve around scalability, efficiency, and

handling large datasets.
 As data volumes continue to grow exponentially, it becomes essential to develop
algorithms and infrastructure capable of processing and analysing data promptly.
 Performance bottlenecks can hinder the practical application of data mining
techniques.
3. Diverse Data Types Issue

 The diverse data types data mining issues highlight the complexity of dealing with
heterogeneous data sources.
 Data mining often involves integrating data from various formats, such as text,
images, and structured databases.
 Each data type presents unique challenges in terms of preprocessing, feature
extraction, and modelling, requiring specialized approaches and tools to tackle these
complexities effectively.
CHALLENGES OF DATA MINING
1]Data Quality

 The quality of data used in data mining is one of the most significant challenges.
 The accuracy, completeness, and consistency of the data affect the accuracy of the
results obtained.
 The data may contain errors, omissions, duplications, or inconsistencies, which may
lead to inaccurate results.
 Moreover, the data may be incomplete, meaning that some attributes or values are
missing, making it challenging to obtain a complete understanding of the data.
 Data quality issues can arise due to a variety of reasons, including data entry errors,
data storage issues, data integration problems, and data transmission errors.
 To address these challenges, data mining practitioners must apply data cleaning and
data preprocessing techniques to improve the quality of the data
2]Data Complexity

 Data complexity refers to the vast amounts of data generated by various sources,
such as sensors, social media, and the internet of things (IoT).
 The complexity of the data may make it challenging to process, analyze, and
understand.
 In addition, the data may be in different formats, making it challenging to integrate
into a single dataset.
 To address this challenge, data mining practitioners use advanced techniques such as
clustering, classification, and association rule mining.
 These techniques help to identify patterns and relationships in the data, which can
then be used to gain insights and make predictions.
3]Data Privacy and Security

 Data privacy and security is another significant challenge in data mining.

 As more data is collected, stored, and analysed, the risk of data breaches and cyber-
attacks increases.
 The data may contain personal, sensitive, or confidential information that must be
protected.
 Data privacy regulations such as DISHA, GDPR, CCPA, and HIPAA impose strict rules
on how data can be collected, used, and shared.
 To address this challenge, data mining practitioners must apply data anonymization
and data encryption techniques to protect the privacy and security of the data.
 Data anonymization involves removing personally identifiable information (PII) from
the data, while data encryption involves using algorithms to encode the data to
make it unreadable to unauthorized users. 4]Scalability
 Data mining algorithms must be scalable to handle large datasets efficiently. As the
size of the dataset increases, the time and computational resources required to
perform data mining operations also increase. Moreover, the algorithms must be
able to handle streaming data, which is generated continuously and must be
processed in real-time.
 To address this challenge, data mining practitioners use distributed computing
frameworks such as Hadoop and Spark. These frameworks distribute the data and
processing across multiple nodes, making it possible to process large datasets quickly
and efficiently.
4]interpretability
 Data mining algorithms can produce complex models that are difficult to interpret.
This is because the algorithms use a combination of statistical and mathematical
techniques to identify patterns and relationships in the data. Moreover, the models
may not be intuitive, making it challenging to understand how the model arrived at a
particular conclusion.
 To address this challenge, data mining practitioners use visualization techniques to
represent the data and the models visually. Visualization makes it easier to
understand the patterns and relationships in the data and to identify the most
important variables.

5]Ethics

 Data mining raises ethical concerns related to the collection, use, and dissemination
of data. The data may be used to discriminate against certain groups, violate privacy
rights, or perpetuate existing biases. Moreover, data mining algorithms may not be
transparent, making it challenging to detect biases or discrimination.

APPLICATIONS OF DATA MINING

RETAIL INDUSTRY
Data mining is important in retail because it facilitates analysing the data, develop
targeted marketing strategies, optimize inventory management and improve the overall
customer experience.

 Design of data warehouses: Retail data covers a wide spectrum like sales,
customers, employees, goods transportation, consumption and services, outcome of
data analysis and the data mining can help guide the design and development of
data warehouse structures.
 Multidimensional analysis of Sales, Customers, Products, time and region: Retail
industry requires timely information regarding customer needs, product needs and
fashions as well as the quality, cost, profit and services of commodities.
 Customer Segmentation: Data Mining divide the customers into distinct segments
based on their behaviour, preferences and purchase history. This can be used for
targeted marketing campaigns, personalised promotions and customized product
recommendations for each segment.
 Analysis of the effectiveness of sales campaigns: Retail industry conducts sales
campaigns using advertisements, coupons and various kinds of discounts and
bonuses to promote products and attract customers. Multidimensional analysis can
be used to analyse the effectiveness of sales campaign that help improve company
profits.
 Inventory Management: It involves analysing historical sales data to predict future
demand and optimize inventory levels.
 Customer Retention: Customer loyalty and purchase trends can be analysed in a
systematic way. Goods purchased at different periods by the same customers can be
grouped into sequences. Sequence pattern mining can be used to investigate
changes in customer consumption and suggest adjustment on the pricing and variety
of goods in order to help retain customers and attract new customers.
 Purchase recommendations and cross-reference of items: Market Basket Analysis
identifies associations and relationships between products that customers tend to
purchase together. Such information can be used to form purchase
recommendations.
 Price Optimization: Data Mining helps determine the optimal pricing strategy by
analysing customer behaviour, competitor pricing and market trends.
 Fraud Detection: Data Mining helps detect and prevent fraudulent activities such as
payment fraud or returns fraud. It is used to enhance security measures, reduce
financial losses and maintain customer trust.
HEALTH CARE:
Clinical data analysis involves the examination and interpretation of information derived
from medical records, patient data and other healthcare sources.

 Electronic Health Records Analysis: The objective is to analyse patient data stored in
electronic health records for clinical decision support and quality improvement.
 Disease Prediction and Risk Assessment: The objective is to predict the likelihood of
disease occurrence and assess the risk factors by machine learning algorithms,
logistic regressions etc.,
 Clinical Trials and Research: The objective is to analyse data from clinical trials to
assess the efficacy and safety interventions. Statistical analysis and regression
analysis is used to evaluate treatment outcomes and draw conclusions from research
data.
 Patient Stratification: The objective is to group patients based on similar
characteristics, enabling personalised treatment plan by clustering techniques such
as K-means, hierarchical clustering etc.,
 Outcome and Performance Measurement: The objective is to evaluate the
effectiveness of healthcare interventions and assess the performance of Healthcare
providers.
 Patient Reported Outcomes Analysis: The objective is to incorporate patient
perspecitves and reported outcomes into clinical research and decision-making.
 Disease Diagnostics and Biomarker Discovery: The objective is to identify
biomarkers for diseases and understand disease mechanisms. It plays crucial aspects
of clinical data analysis that aim to identify reliable indicators of diseases and
contribute to early detection, accurate diagnosis and personalised treatment
strategies.
 Clinical Data Integration: The objective is to integrate diverse clinical data sources,
including Electronic Health Records (EHR), laboratory results and medical imaging to
build a comprehensive patient profile.
FINANCIAL SECTOR
Data mining plays a crucial role in financial data analysis, helping financial institutions,
investment firms and other organisations extract valuable insights from vast amount data.

 Credit Scoring and Risk Assessment: The objective is to evaluate the

creditworthiness of individuals and businesses. Data mining algorithms analyse
historical credit data, transaction patterns to predict the likelihood of default and
assess credit risk.
 Fraud Detection and Prevention: The objective is to identify and prevent fraudulent
activities in financial transactions. Anomaly detection algorithm analyse the patterns
in transaction data to detect unusual activities such as unauthorised transactions or
identify theft.
 Customer Detection: The objective is to divide customers into segments based on
common characteristics for targeted marketing. Clustering algorithms analyse
customer behaviour, spending patterns and demographics to group customers with
similar traits.
 Algorithmic Trading and Market Analysis: The objective is to analyse market trends,
predict stock prices and execute trades. Machine learning algorithms, time series
analysis and sentiment analysis are used to make predictions and inform algorithmic
trading strategies.
 Portfolio Management: The objective is to optimize the composition of investment
portfolios to maximise returns and manage risks. Optimisation algorithms help in
constructing diversified portfolios that balance risk and return.
 Customer Relationship Management: The objective is to enhance customer
relationships and improve customer satisfaction. Data mining analyses customer
interactions, preferences and feedback to identify opportunities for personalised
services, targeted marketing and customer retention strategies.
 Market Basket Analysis: The objective is to identify patterns and relationship
between different financial products or services. Association rule mining helps in
discovering relationships between various financial products, to promote deals,
offers, sale by the companies and data mining techniques helps to achieve this
analysis task.
INTRUSION DETECTION:

 A network intrusion refers to any unauthorized activity on a digital network.

 Network intrusions often involve stealing valuable network resources.
 Data mining technique plays a vital role in searching intrusion detection, network
attacks, and anomalies.
 These techniques help in selecting and refining useful and relevant information from
large data sets.
 Data mining technique helps in classify relevant data for Intrusion Detection System.
 Intrusion Detection system generates alarms for the network traffic about the
foreign invasions in the system.
For example:
1. Detect security violations
2. Misuse Detection
3. Anomaly Detection

EDUCATION:

 For analysing the education sector, data mining uses Educational Data Mining (EDM)
method.
 This method generates patterns that can be used both by learners and educators.
By using data mining EDM we can perform some educational task:
1. Predicting students admission in higher education
2. Predicting students profiling
3. Predicting student performance
4. Teachers teaching performance
5. Curriculum development
6. Predicting student placement opportunities
RESEARCH:

 A data mining technique can perform predictions, classification, clustering,

associations, and grouping of data with perfection in the research area.
 Rules generated by data mining are unique to find results.
 Here, we create a training model and testing model. The training/testing model is a
strategy to measure the precision of the proposed model. It is called Train/Test
because we split the data set into two sets: a training data set and a testing data set.
A training data set used to design the training model whereas testing data set is used
in the testing model.
Example:
1. Classification of uncertain data.
2. Decision support system
3. Web Mining
4. Domain-driven data mining
5. IoT (Internet of Things)and Cybersecurity
6. Smart farming IoT(Internet of Things)
7. Information-based clustering.

Unit III DWDM
No ratings yet
Unit III DWDM
113 pages
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
No ratings yet
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
5 pages
Data Mining Essentials for Students
No ratings yet
Data Mining Essentials for Students
15 pages
Data Mining & Knowledge Discovery
No ratings yet
Data Mining & Knowledge Discovery
60 pages
21SE204-B DATA MINING - S2 M.Tech: Prepared By, Prince V Jose Ap, Cse Saintgits College of Engg
No ratings yet
21SE204-B DATA MINING - S2 M.Tech: Prepared By, Prince V Jose Ap, Cse Saintgits College of Engg
31 pages
Fundamentals of Data Science Notes (Module - 1)
No ratings yet
Fundamentals of Data Science Notes (Module - 1)
19 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
DWDM Notes - Unit 1
No ratings yet
DWDM Notes - Unit 1
26 pages
Data Mining Techniques and Applications
100% (1)
Data Mining Techniques and Applications
18 pages
Understanding Data Mining Techniques
No ratings yet
Understanding Data Mining Techniques
9 pages
DM Notes
No ratings yet
DM Notes
193 pages
cc15 2nd
No ratings yet
cc15 2nd
2 pages
Unit - I
No ratings yet
Unit - I
22 pages
Data Mining Simran
No ratings yet
Data Mining Simran
128 pages
Unit-1 (Data Mining)
No ratings yet
Unit-1 (Data Mining)
13 pages
Data Mining
No ratings yet
Data Mining
46 pages
DATA Mining
No ratings yet
DATA Mining
21 pages
DM Module1 Notes
No ratings yet
DM Module1 Notes
25 pages
DM Sem U-1
No ratings yet
DM Sem U-1
50 pages
Steps Involved in KDD Process: Data Mining
No ratings yet
Steps Involved in KDD Process: Data Mining
14 pages
Data Mining 14
No ratings yet
Data Mining 14
3 pages
Data Mining
No ratings yet
Data Mining
15 pages
What Is Data Mining: Effective Data Collection Warehousing
No ratings yet
What Is Data Mining: Effective Data Collection Warehousing
21 pages
DWDM Unit II
No ratings yet
DWDM Unit II
18 pages
DM Module1
No ratings yet
DM Module1
15 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
15 pages
Unit-1 Data Mining
No ratings yet
Unit-1 Data Mining
19 pages
Unit-2 Finalized
No ratings yet
Unit-2 Finalized
12 pages
Knowledge Management UNIT-3 Notes
No ratings yet
Knowledge Management UNIT-3 Notes
17 pages
Understanding Data Mining Concepts
No ratings yet
Understanding Data Mining Concepts
44 pages
Data Science
No ratings yet
Data Science
11 pages
Data Preprocessing Personal
No ratings yet
Data Preprocessing Personal
11 pages
Unit Iii
No ratings yet
Unit Iii
33 pages
Data Mining Process Overview
100% (1)
Data Mining Process Overview
51 pages
Data Mining
No ratings yet
Data Mining
3 pages
D-Unit-1 R16
No ratings yet
D-Unit-1 R16
17 pages
ISS-DSS - Module 3
No ratings yet
ISS-DSS - Module 3
23 pages
Data Mining and IBM SPSS Modeler
No ratings yet
Data Mining and IBM SPSS Modeler
20 pages
Data Mining and Warehousing
No ratings yet
Data Mining and Warehousing
18 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
17 pages
Unit 1 DM
No ratings yet
Unit 1 DM
16 pages
Data Warehousing & Mining Overview
No ratings yet
Data Warehousing & Mining Overview
55 pages
5 Data Mining Proccess and Techniques - Week 7
No ratings yet
5 Data Mining Proccess and Techniques - Week 7
61 pages
Data-Mining Notes
No ratings yet
Data-Mining Notes
110 pages
Chapter 1&2
No ratings yet
Chapter 1&2
91 pages
DWDM Unit-II Notes
No ratings yet
DWDM Unit-II Notes
29 pages
DWM 4
No ratings yet
DWM 4
23 pages
DWH Unit 3
No ratings yet
DWH Unit 3
7 pages
U1 - Data Warehouse Intro
No ratings yet
U1 - Data Warehouse Intro
13 pages
Module1 1 Introduction
No ratings yet
Module1 1 Introduction
27 pages
DM & W SQ
No ratings yet
DM & W SQ
15 pages
Unit 2 Introduction To Data Mining
No ratings yet
Unit 2 Introduction To Data Mining
38 pages
Unit 1
No ratings yet
Unit 1
43 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
13 pages
Data Mining Process, Techniques, Tools & Examples
No ratings yet
Data Mining Process, Techniques, Tools & Examples
11 pages
UNIT3
No ratings yet
UNIT3
125 pages
Minor Projectppt
No ratings yet
Minor Projectppt
17 pages
MI Chapter 9 Review
No ratings yet
MI Chapter 9 Review
20 pages
Recommender Systems-Chapter 4
No ratings yet
Recommender Systems-Chapter 4
76 pages
A Survey of Similarity Measures For Collaborative Filtering-Based Recommender System
No ratings yet
A Survey of Similarity Measures For Collaborative Filtering-Based Recommender System
11 pages
Web Usage Mining Explained
No ratings yet
Web Usage Mining Explained
4 pages
Predictive Analytics Fundamentals
No ratings yet
Predictive Analytics Fundamentals
20 pages
Projects
No ratings yet
Projects
35 pages
Recommender System Syllabus
100% (1)
Recommender System Syllabus
3 pages
Additive Regression for Netflix Data
No ratings yet
Additive Regression for Netflix Data
12 pages
Data Science & Big Data Projects
100% (1)
Data Science & Big Data Projects
85 pages
International Conference On Innovative Computing and Communications
No ratings yet
International Conference On Innovative Computing and Communications
891 pages
MLT Unit 5 Notes
No ratings yet
MLT Unit 5 Notes
14 pages
Weather-Based Dress Recommendation App
No ratings yet
Weather-Based Dress Recommendation App
65 pages
Latex Template Sem 6 Project 1 Report 2023 24 2
No ratings yet
Latex Template Sem 6 Project 1 Report 2023 24 2
32 pages
Edutrack Paper
No ratings yet
Edutrack Paper
6 pages
Predict
No ratings yet
Predict
196 pages
Recommender Systems
No ratings yet
Recommender Systems
20 pages
Recommendation System in Python
No ratings yet
Recommendation System in Python
13 pages
User Based Spotify Recommendation System Using Machine Learning Algorithms
No ratings yet
User Based Spotify Recommendation System Using Machine Learning Algorithms
6 pages
2023 ML Assignment
No ratings yet
2023 ML Assignment
57 pages
Smart Tourism System for Budget Travelers
No ratings yet
Smart Tourism System for Budget Travelers
7 pages
A Proposed Hybrid Book Recommender System: Suhas Patil
No ratings yet
A Proposed Hybrid Book Recommender System: Suhas Patil
5 pages
Recommender Systems Overview and Techniques
No ratings yet
Recommender Systems Overview and Techniques
92 pages
MINI PROJECT Music Recommendation
No ratings yet
MINI PROJECT Music Recommendation
21 pages
Hackathon Submission Template (Level-1-Solution)
No ratings yet
Hackathon Submission Template (Level-1-Solution)
6 pages
A Survey On Location Recommendation Based On PSI Model &trajectory Mining
No ratings yet
A Survey On Location Recommendation Based On PSI Model &trajectory Mining
10 pages
Unit Iii Collaborative Filtering
No ratings yet
Unit Iii Collaborative Filtering
51 pages
Training Guide - VMWAIM
No ratings yet
Training Guide - VMWAIM
117 pages
Flipkart Recommendation
0% (1)
Flipkart Recommendation
35 pages
Research G4 1 1
No ratings yet
Research G4 1 1
38 pages

FDS Unit 1

Uploaded by

FDS Unit 1

Uploaded by

UNIT – 1

NEED FOR DATA MINING:

.2. Understand the Data

 This step involves collecting and exploring the data to

3. Prepare the Data

 This includes cleaning the data to remove any errors

4. Model the Data

 This step involves evaluating the performance of the model.

6. Deploy the Solution

KNOWLEDGE DISCOVERY IN DATABASE (KDD)

STEPS IN KDD PROCESS

 Data integration is defined as heterogeneous data from multiple sources combined

 Pattern Evaluation is defined as identifying strictly increasing patterns representing

 The discovered patterns and knowledge need to be represented in an

 Based on the evaluated knowledge, decisions are made.

DATA MINING TECHNIQUES

MODEL BUILDING PROCESS:

 Clustering is a division of information into

 Establish a relationship between a dependent variable and one or more independent

1. Simple Linear regression

Simple Linear Regression:

 It has one independent variable

Multiple Linear Regression:

 It has more than one independent variables

 The relationship between the dependent and independent variables is non-linear.

4. Association Rule Mining

 ALGORITHMS: Popularly used in Market Basket Analysis, Cross-Selling, Healthcare, Web

 Decision trees are a technique used to represent complex decision-making processes in a

 Ability to learn and generalize from complex data

9. Time Series Analysis

 Collaborative filtering is a technique used to make recommendations based on the

COMPARISION OF DATA MINING AND KDD

DATA MINING ISSUES

 Methodology-related data mining issues encompass challenges related to the choice

 Performance-related data mining issues revolve around scalability, efficiency, and

 Data privacy and security is another significant challenge in data mining.

APPLICATIONS OF DATA MINING

 Credit Scoring and Risk Assessment: The objective is to evaluate the

 A network intrusion refers to any unauthorized activity on a digital network.

 A data mining technique can perform predictions, classification, clustering,

You might also like