0% found this document useful (0 votes)
50 views25 pages

Introduction To Data Mining

Uploaded by

examsgfgcwt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views25 pages

Introduction To Data Mining

Uploaded by

examsgfgcwt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

6th Sem BCA Data Mining - KVN

Fundamentals of Data Science 6th Sem. BCA - NEP


Unit-I
Data Mining
Topics:
Introduction, Data Mining Definitions, Knowledge Discovery in Database(KDD) Vs Data
Mining, DBMS Vs Data Mining, Data Mining Techniques, Problems, Issues and Challenges in
Dama Mining, Data Mining Applications.
What is Data ?
Data is a word we hear everywhere nowadays. In general, data is a collection of facts,
information, and statistics and this can be in various forms such as numbers, text, sound,
images, or any other format.
Data can be generated by:
 Humans
 Machines
 Human-Machine combines.

It can often generated anywhere where any information is generated and stored in
structured or unstructured formats.
What is Information ?
Information is data that has been processed , organized, or structured in a way that makes
it meaningful, valuable and useful. It is data that has been given context , relevance and
purpose. It gives knowledge, understanding and insights that can be used for decision-
making , problem-solving, communication and various other purposes.
Why data is important ?
 Data helps in make better decisions.
 Data helps in solve problems by finding the reason for underperformance.
 Data helps one to evaluate the performance.
 Data helps one improve processes.
 Data helps one understand consumers and the market.

Categories of Data
Data can be categories into two main parts –
 Structured Data: This type of data is organized data into specific format, making it
easy to search , analyze and process. Structured data is found in a relational
databases that includes information like numbers, data and categories.
 Unstructured Data: Unstructured data does not conform to a specific structure or
format. It may include some text documents , images, videos, and other data that is
not easily organized or analyzed without additional processing.

1
6th Sem BCA Data Mining - KVN

Data becomes valuable when it is processed, analyzed, and interpreted to extract


meaningful insights or information. This process involves various techniques and tools,
such as data mining , data analytics, and machine learning.
What is the Data Processing Cycle?
The data processing cycle refers to the iterative sequence of transformations applied to raw
data to generate meaningful insights. It can be viewed as a pipeline with distinct stages:
1. Data Acquisition: This stage encompasses the methods used to collect raw data
from various sources. This could involve sensor readings, scraping web data, or
gathering information through surveys and application logs.
2. Data Preparation: Raw data is inherently messy and requires cleaning and pre-
processing before analysis. This stage involves tasks like identifying and handling
missing values, correcting inconsistencies, formatting data into a consistent
structure, and potentially removing outliers.
3. Data Input: The pre-processed data is loaded into a system suitable for further
processing and analysis. This often involves converting the data into a machine-
readable format and storing it in a database or data warehouse.
4. Data Processing: Here, the data undergoes various manipulations and
transformations to extract valuable information. This may include aggregation,
filtering, sorting, feature engineering (creating new features from existing ones), and
applying machine learning algorithms to uncover patterns and relationships.
5. Data Output: The transformed data is then analyzed using various techniques to
generate insights and knowledge. This could involve statistical analysis, visualization
techniques, or building predictive models.
6. Data Storage: The processed data and the generated outputs are stored in a secure
and accessible format for future use, reference, or feeding into further analysis
cycles.
The data processing cycle is iterative, meaning the output from one stage can become the
input for another. This allows for continuous refinement, deeper analysis, and the creation
of increasingly sophisticated insights from the raw data.

2
6th Sem BCA Data Mining - KVN

Introduction of DBMS (Database Management System)


A Database Management System (DBMS) is a software solution designed to
efficiently manage, organize, and retrieve data in a structured manner. It serves as a
critical component in modern computing, enabling organizations to store, manipulate, and
secure their data effectively. From small applications to enterprise systems, DBMS plays a
vital role in supporting data-driven decision-making and operational efficiency.
What is a DBMS?
A DBMS is a system that allows users to create, modify, and query databases while
ensuring data integrity, security, and efficient data access. Unlike traditional file systems,
DBMS minimizes data redundancy, prevents inconsistencies, and simplifies data
management with features like concurrent access and backup mechanisms. It organizes
data into tables, views, schemas, and reports, providing a structured approach to data
management.
Example:
A university database can store and manage student information, faculty records, and
administrative data, allowing seamless retrieval, insertion, and deletion of information as
required.
Key Features of DBMS
1. Data Modelling: Tools to create and modify data models, defining the structure and
relationships within the database.
2. Data Storage and Retrieval: Efficient mechanisms for storing data and executing
queries to retrieve it quickly.
3. Concurrency Control: Ensures multiple users can access the database
simultaneously without conflicts.
4. Data Integrity and Security: Enforces rules to maintain accurate and secure data,
including access controls and encryption.
5. Backup and Recovery: Protects data with regular backups and enables recovery in
case of system failures.
Types of DBMS
There are several types of Database Management Systems (DBMS), each tailored to
different data structures, scalability requirements, and application needs.
The most common types are as follows:
1. Relational Database Management System (RDBMS)
RDBMS organizes data into tables (relations) composed of rows and columns. It uses
primary keys to uniquely identify rows and foreign keys to establish relationships between
tables. Queries are written in SQL (Structured Query Language), which allows for efficient
data manipulation and retrieval.
Examples: MySQL, Oracle, Microsoft SQL Server and Postgre SQL.

3
6th Sem BCA Data Mining - KVN

2. NoSQL DBMS
NoSQL systems are designed to handle large-scale data and provide high performance for
scenarios where relational models might be restrictive. They store data in various non-
relational formats, such as key-value pairs, documents, graphs, or columns. These flexible
data models enable rapid scaling and are well-suited for unstructured or semi-structured
data.
Examples: MongoDB, Cassandra, DynamoDB and Redis.
3. Object-Oriented DBMS (OODBMS)
OODBMS integrates object-oriented programming concepts into the database
environment, allowing data to be stored as objects. This approach supports complex data
types and relationships, making it ideal for applications requiring advanced data modelling
and real-world simulations.
Examples: ObjectDB, db4o.
Database Languages
Database languages are specialized sets of commands and instructions used to define,
manipulate, and control data within a database. Each language type plays a distinct role in
database management, ensuring efficient storage, retrieval, and security of data. The
primary database languages include:
1. Data Definition Language (DDL)
DDL is the short name for Data Definition Language, which deals with database schemas
and descriptions, of how the data should reside in the database.
 CREATE: to create a database and its objects like (table, index, views, store
procedure, function, and triggers)
 ALTER: alters the structure of the existing database
 DROP: delete objects from the database
 TRUNCATE: remove all records from a table, including all spaces allocated for the
records are removed
 COMMENT: add comments to the data dictionary
 RENAME: rename an object
2. Data Manipulation Language (DML)
DML focuses on manipulating the data stored in the database, enabling users to retrieve,
add, update, and delete data.
 SELECT: retrieve data from a database
 INSERT: insert data into a table
 UPDATE: updates existing data within a table
 DELETE: Delete all records from a database table
 MERGE: UPSERT operation (insert or update)
 CALL: call a PL/SQL or Java subprogram
 EXPLAIN PLAN: interpretation of the data access path
 LOCK TABLE: concurrency Control

4
6th Sem BCA Data Mining - KVN

3. Data Control Language (DCL)


DCL commands manage access permissions, ensuring data security by controlling who can
perform certain actions on the database.
 GRANT: Provides specific privileges to a user (e.g., SELECT, INSERT).
 REVOKE: Removes previously granted permissions from a user.
4. Transaction Control Language (TCL)
TCL commands oversee transactional data to maintain consistency, reliability, and
atomicity.
 ROLLBACK: Undoes changes made during a transaction.
 COMMIT: Saves all changes made during a transaction.
 SAVEPOINT: Sets a point within a transaction to which one can later roll back.
5. Data Query Language (DQL)
DQL is a subset of DML, specifically focused on data retrieval.
 SELECT: The primary DQL command, used to query data from the database without
altering its structure or contents.
Paradigm Shift from File System to DBMS
Before the advent of modern Database Management Systems (DBMS), data was managed
using basic file systems on hard drives. While this approach allowed users
to store, retrieve, and update files as needed, it came with numerous challenges.
A typical example can be seen in a file-based university management system, where data
was stored in separate sections such as Departments, Academics, Results, Accounts, and
Hostels. Certain information like student names and phone numbers was repeated
across multiple files, leading to the following issues:
1. Redundancy of data
When the same data exists in multiple places, any update must be manually repeated
everywhere. For instance, if a student changes their phone number, it must be updated
across all sections. Failure to do so leads to unnecessary duplication and wasted storage.
2. Inconsistency of Data
Data is said to be inconsistent if multiple copies of the same data do not match each other.
If the Phone number is different in Accounts Section and Academics Section, it will be
inconsistent. Inconsistency may be because of typing errors or not updating all copies of
the same data.
3. Complex Data Access
A user should know the exact location of the file to access data, so the process is very
cumbersome and tedious. If the user wants to search the student hostel allotment number
of a student from 10000 unsorted students’ records, how difficult it can be.

5
6th Sem BCA Data Mining - KVN

4. Lack of Security
File systems provided limited control over who could access certain data. A student who
gained access to a file with grades might easily alter it without proper authorization,
compromising data integrity.
5. No Concurrent Access
File systems were not designed for multiple users working at the same time. If one user
was editing a file, others had to wait, which hindered collaboration and slowed down
workflows.
6. No Backup and Recovery
File systems lacked built-in mechanisms for creating backups or recovering data after a loss.
If a file was accidentally deleted or corrupted, there was no easy way to restore it,
potentially causing permanent data loss.
Advantages of DBMS
1. Data organization: A DBMS allows for the organization and storage of data in a
structured manner, making it easy to retrieve and query the data as needed.
2. Data integrity: A DBMS provides mechanisms for enforcing data integrity
constraints, such as constraints on the values of data and access controls that
restrict who can access the data.
3. Concurrent access: A DBMS provides mechanisms for controlling concurrent access
to the database, to ensure that multiple users can access the data without
conflicting with each other.
4. Data security: A DBMS provides tools for managing the security of the data, such as
controlling access to the data and encrypting sensitive data.
5. Backup and recovery: A DBMS provides mechanisms for backing up and recovering
the data in the event of a system failure.
6. Data sharing: A DBMS allows multiple users to access and share the same data,
which can be useful in a collaborative work environment.
Disadvantages of DBMS
1. Complexity: DBMS can be complex to set up and maintain, requiring specialized
knowledge and skills.
2. Performance overhead: The use of a DBMS can add overhead to the performance of
an application, especially in cases where high levels of concurrency are required.
3. Scalability: The use of a DBMS can limit the scalability of an application, since it
requires the use of locking and other synchronization mechanisms to ensure data
consistency.
4. Cost: The cost of purchasing, maintaining and upgrading a DBMS can be high,
especially for large or complex systems.
5. Limited Use Cases: Not all use cases are suitable for a DBMS, some solutions don’t
need high reliability, consistency or security and may be better served by other
types of data storage.

6
6th Sem BCA Data Mining - KVN

Applications of DBMS
1. Enterprise Information: Sales, accounting, human resources, Manufacturing, online
retailers.
2. Banking and Finance Sector: Banks maintaining the customer details, accounts,
loans, banking transactions, credit card transactions. Finance: Storing the
information about sales and holdings, purchasing of financial stocks and bonds.
3. University: Maintaining the information about student course enrolled information,
student grades, staff roles.
4. Airlines: Reservations and schedules.
5. Telecommunications: Prepaid, postpaid bills maintenance.

A Database Management System (DBMS) is an essential tool for efficiently managing,


organizing, and retrieving large volumes of data across various industries. Its ability to
handle data securely, ensure integrity, support concurrent access, and provide backup and
recovery options makes it indispensable for modern data-driven applications. While
DBMSs come with complexities and costs, their benefits in terms of data management and
security far outweigh the challenges, making them a crucial component in any data-centric
environment.

7
6th Sem BCA Data Mining - KVN

Data Mining
Content
 What is Data Mining?
 Importance of Data mining
 Process of Data Mining
 Applications of Data Mining
 Advantages of Data Mining
 Disadvantages of Data Mining

What is Data Mining?


Data mining is the process of analysing large data sets or data warehouses for the
extraction of useful information with the help of computers, automation tools, and a wide
range of techniques. It is also known as KDD or Knowledge Discovery in Data.
In simple words, data mining is a process in which the set of raw data available is extracted
to gather useful information from it.
As a branch of Data science, data mining is a wide concept that includes establishing
relationships, finding problems, artificial intelligence and machine learning(developing
algorithms to predict the behaviour or outcome), etc.
Importance of Data mining
 With the help of data mining a large stack of data from multiple sources can be
analyzed easily for patterns and relationships
 Data mining helps in making predictions, smart decisions.
 Most importantly with the help of data collected from various users the data mining.
Process of Data Mining
Data mining is an interactive process consisting of six steps.
1. Understanding the Business Objectives – This is the first and very important step for
the start of the process, data scientists and various stakeholders all work together to
understand the objectives or scope of the business. Once a clear understanding is
made we move to a new phase.
2. Collection of Data – In this phase, data mining experts are hired to gather the
relevant data from various fields .i.e. social sites, service data, etc and are stored in a
data warehouse.

8
6th Sem BCA Data Mining - KVN

3. Preparation of Data – It is a time-consuming step as the data gathered in the


previous step is now cleaned which consists of three phases.
 Extraction – The data is extracted from various sources and stored in
warehouses.
 Transforming – The data is cleaned.i.e removal of duplicate data, updating
the missing values, etc.
 Loading – Now, the data collected from the previous two steps is fed to the
database.
4. Model Building – Selection of an appropriate model i.e.(clustering,regression
analysis) is done based on the analysis of data done before. In this phase various
tools, algorithms, statistical and mathematical approaches are applied.
5. Evaluation – Once the model is ready and all the values of data are aggregated, it’s
time to evaluate the results of the model developed must meet the objectives set in
phase 1.
6. Deployment – After the model has been evaluated its time for its deployment is
done in the form of graphs or spreadsheets.
Applications of Data Mining
There are several applications of data mining that helps businesses or organization
to have an advantage over their competitors.
1. Insurance Companies – With the help of data analytics, insurance companies
can solve fraud problems, customer attrition, etc. Data mining lets these
companies find a new way to target customers and offer productive prices to
the existing customers.
2. Education – The education sector is the new field in which data mining is used
to predict goals of learning. Data mining helps the institutes to get the best
results and to know the group of students that need extra attention.
3. Banking – Data mining in banking helps to understand customer behavior with
automated algorithms. It also helps the banks to detect fraud alerts and to
keep track of a customer’s purchasing history to provide them with various
banking facilities i.e. Credit Cards, loans, etc.
4. Marketing – Marketing is the best-benefited field from data mining. It helps
the organization or institutes to understand better about customer behavior.
Data mining helps bring all together on several criteria i.e. age, purchase
history, income level, locations, etc. With these several benefits, companies

9
6th Sem BCA Data Mining - KVN

can retain their customers for a long time period by targeting their specific
needs and requirements.
5. Retail – With the help of marketing, data insights the retail businesses such as
grocery shops, dairy shops, etc. can know about the customer’s behavior in
their locality and can stock the items as per the need.
6. Manufacturing – For the manufacturer, data mining can help in the early
detection of problems, maintenance of products, etc. It can also help to
design the product as per the customer needs at system level designing.
7. Healthcare – With the help of data mining doctors are able to know their
patients a lot better and diagnose the problem easily. Data mining helps to
enhance health care facilities at a reduced cost. It also helps the organizations
in the detection of fraud alerts, waste, and manages cost-effective
relationships between the patients and hospitals.
Advantages of Data Mining
The advantages of data mining are as follows-
 Helping the organizations to gather authentic and correct information.
 It can be easily inducted to new plus existing platforms.
 With the help of data mining an organization can create improved plans and
decisions.
 It is cost effective.
 It helps in reducing the cost of products by creating a competition against
various companies.
 Data mining helps in early detection of frauds, defects in complex designs,etc.
Disadvantages of Data Mining
 Data mining is not always accurate and in certain cases can lead to
unintended.
 A large database is required to go for mining thus making the process hard.
 Selection of the right tool for a certain business is a cumbersome task as each
tool has a different algorithm.
 Data mining is hard and complex, thus a proper training about various tools is
required.

10
6th Sem BCA Data Mining - KVN

DBMS Vs Data Mining

DBMS Data Mining


1) DBMS is a software system used for storing, 1) Data Mining, on the other hand, is the
manipulating, and managing data efficiently, process of discovering patterns, relationships,
ensuring data is organized and accessible. and insights within large datasets.
2) DBMS is crucial as it forms the foundation 2) Data Mining skills are valuable for
for managing data effectively within an uncovering valuable insights from vast
organization. amounts of data to support decision making
processes.
3) DBMS focuses on the storage, retrieval, and 3) Data Mining focuses on analyzing data to
management of data in a structured manner. extract meaningful patterns and relationships
that may not be readily apparent.
4) DBMS will gain knowledge of database 4) Data Mining will learn about techniques
design, normalization, transactions, and query such as clustering, classification, regression,
optimization, which are fundamental for and anomaly detection to extract valuable
knowledge from data.
working with data in various applications.
5) DBMS enables users to store, update, and 5) Data Mining involves using algorithms and
retrieve data efficiently through the use of statistical techniques to uncover patterns and
relational databases. trends in data that can be used for predictive
analytics and decision making.
6) DBMS is essential for creating and managing 6) Data Mining focuses on exploring and
databases effectively, ensuring data integrity, analyzing data to discover hidden patterns and
security, and consistency. trends that can be valuable for businesses and
organizations.
7) DBMS provides tools and features for 7) Data Mining requires advanced analytics
managing structured data in a systematic way. skills to explore unstructured or semi
structured data to derive insights and make
informed decisions.
8) In DBMS training practical skills in designing 8) In Data Mining training may involve hands
and implementing databases using DBMS tools on experience with tools like Python, R for
like MySQL, Oracle, or SQL Server. analyzing and extracting patterns from
complex datasets.
9) Both DBMS and Data Mining play 9) Data Mining focusing on the extraction of
complementary roles in the field of data knowledge and insights from data.
management and analysis, with DBMS focusing
on the organization and storage of data.
10) DBMS work alone without Data Mining. 10) Data Mining may not work without DBMS.

11
6th Sem BCA Data Mining - KVN

Knowledge Discovery in Databases (KDD) / KDD Process in Databases


KDD (Knowledge Discovery in Databases) in data mining refers to the process of extracting
valuable insights, patterns, and knowledge from large datasets. It involves various stages
such as data selection, preprocessing, mining, pattern evaluation, and knowledge
presentation.
KDD Process
KDD is an iterative method and extracts valuable data after numerous repetitions of the
processes. KDD involves several steps, each advancing the goal of extracting useful
information from data.
These steps are as follows:
1. Data Selection
2. Data Cleaning and Preprocessing
3. Data Transformation and Reduction
4. Data Mining
5. Evaluation and Interpretation of Results
6. Knowledge presentation

12
6th Sem BCA Data Mining - KVN

1. Data Selection
Data Selection is the initial step in the Knowledge Discovery in Databases (KDD) process,
where relevant data is identified and chosen for analysis. It involves selecting a dataset or
focusing on specific variables, samples, or subsets of data that will be used to extract
meaningful insights.
 It ensures that only the most relevant data is used for analysis, improving efficiency
and accuracy.
 It involves selecting the entire dataset or narrowing it down to particular features or
subsets based on the task’s goals.
 Data is selected after thoroughly understanding the application domain.
By carefully selecting data, we ensure that the KDD process delivers accurate, relevant, and
actionable insights.
2. Data Cleaning
In the KDD process, Data Cleaning is essential for ensuring that the dataset is accurate and
reliable by correcting errors, handling missing values, removing duplicates, and addressing
noisy or outlier data.
 Missing Values: Gaps in data are filled with the mean or most probable value to
maintain dataset completeness.
 Noisy Data: Noise is reduced using techniques like binning, regression, or clustering
to smooth or group the data.
 Removing Duplicates: Duplicate records are removed to maintain consistency and
avoid errors in analysis.
Data cleaning is crucial in KDD to enhance the quality of the data and improve the
effectiveness of data mining.
3. Data Transformation and Reduction
Data Transformation in KDD involves converting data into a format that is more suitable for
analysis.
 Normalization: Scaling data to a common range for consistency across variables.
 Discretization: Converting continuous data into discrete categories for simpler
analysis.
 Data Aggregation: Summarizing multiple data points (e.g., averages or totals) to
simplify analysis.
 Concept Hierarchy Generation: Organizing data into hierarchies for a clearer, higher-
level view.

13
6th Sem BCA Data Mining - KVN

Data Reduction helps simplify the dataset while preserving key information.
 Dimensionality Reduction: Reducing the number of variables while keeping
essential data.
 Numerosity Reduction: Reducing data points using methods like sampling to
maintain critical patterns.
 Data Compression: Compacting data for easier storage and processing.
Together, these techniques ensure that the data is ready for deeper analysis and mining.
4. Data Mining
Data Mining is the process of discovering valuable, previously unknown patterns from large
datasets through automatic or semi-automatic means. It involves exploring vast amounts of
data to extract useful information that can drive decision-making.
Key characteristics of data mining patterns include:
 Validity: Patterns that hold true even with new data.
 Novelty: Insights that are non-obvious and surprising.
 Usefulness: Information that can be acted upon for practical outcomes.
 Understandability: Patterns that are interpretable and meaningful to humans.

In the KDD process, choosing the data mining task is critical. Depending on the objective,
the task could involve classification, regression, clustering, or association rule mining. After
determining the task, selecting the appropriate data mining algorithms is essential. These
algorithms are chosen based on their ability to efficiently and accurately identify patterns
that align with the goals of the analysis.
5. Evaluation and Interpretation of Results
Evaluation in KDD involves assessing the patterns identified during data mining to
determine their relevance and usefulness. It includes calculating the “interestingness
score” for each pattern, which helps to identify valuable insights. Visualization and
summarization techniques are then applied to make the data more understandable and
accessible for the user.
Interpretation of Results focuses on presenting these insights in a way that is meaningful
and actionable. By effectively communicating the findings, decision-makers can use the
results to drive informed actions and strategies.
6. Knowledge Presentation
It is the final step of the KDD process. When knowledge is presented to a user visually
through tables, graphs, charts, trees, matrices, etc., it is known as knowledge
representation.
It is used to facilitate well-informed decision-making and problem-solving. The main
objective of knowledge presentation is to explain the insights and conclusions produced
through data mining clearly.

14
6th Sem BCA Data Mining - KVN

Advantages of KDD in Data Mining


Some of the advantages of KDD are as follows:
 KDD helps in data-driven decision-making.
 It is also used for pattern recognition and fraud detection systems.
 It improves the performance of firms and organisations.
 One of the critical features of KDD is to uncover hidden patterns in the datasets.
 It helps in the detection of anomalies in databases.
 It plays a vital role in research and discovery fields.
 It provides feedback after analysing datasets helping companies to modify their
process.
Disadvantages of KDD in Data Mining
Some of the disadvantages of KDD are as follows:
 KDD is a complex process.
 It heavily depends on the quality of the data. So, data quality maintenance is
required for the KDD process.
 Analysing large amounts of data can raise security and privacy issues.
 Overfitting data in the KDD process can decrease the system's performance.
 To select algorithms and analyse patterns, great human expertise is required.

KDD Vs Data Mining

Feature Knowledge Discovery in Databases (KDD) Data Mining


Subset of KDD, specifically focused on
Definition Process of extracting useful patterns from data.
pattern extraction.
Broader, encompasses the entire process of Specific, concentrating on the
Scope
knowledge discovery. algorithmic extraction of patterns.
Involves stages like data selection, preprocessing,
Primarily associated with the data
Stages transformation, data mining, interpretation, and
mining phase within KDD.
evaluation.
Aims at uncovering patterns, trends, and Focuses on applying algorithms to
Goal
knowledge from data. identify patterns within data.
Encompasses data cleaning, integration,
Primarily involves applying algorithms
Components selection, transformation, data mining, pattern
for pattern discovery.
evaluation, and knowledge representation.
Multidisciplinary approach, involving database Often seen as a subfield of machine
Interdisciplinary management, statistics, machine learning, and learning and statistics, with a
domain knowledge. narrower focus.
Applied in diverse fields for predictive
Used in various domains for decision support
Application modelling, classification, clustering,
and strategic planning.
and anomaly detection.

15
6th Sem BCA Data Mining - KVN

Practical Example of KDD – A Case Study


Let’s assume a scenario that a fitness center wants to improve member retention by
analyzing usage patterns.
Data Selection: The fitness center gathers data from its membership system, focusing on
the past six months of activity. They filter out inactive members and focus on those with
regular usage.
Data Cleaning and Preprocessing: The fitness center cleans the data by eliminating
duplicates and correcting missing information, such as incomplete workout records or
member details. They also handle any gaps in data by filling in missing values based on
previous patterns.
Data Transformation and Reduction: The data is transformed to highlight important
metrics, such as the average number of visits per week per member and their most
frequently chosen workout types. Dimensionality reduction is applied to focus on the most
significant factors like membership duration and gym attendance frequency.
Data Mining: By applying clustering algorithms, the fitness center segments members into
groups based on their usage patterns. These segments include frequent visitors, occasional
users, and those with minimal attendance.
Evaluation and Interpretation of Results: The fitness center evaluates the groups by
examining their retention rates. They find that occasional users are more likely to cancel
their memberships. The interpretation reveals that members who visit the gym less than
once a week are at a higher risk of discontinuing their membership.
This analysis helps the fitness center implement effective retention strategies, such as
offering tailored incentives and creating engagement programs aimed at boosting the
activity of occasional users.
Knowledge Presentation: It is used to facilitate well-informed decision-making and
problem-solving. The main objective of knowledge presentation is to explain the insights
and conclusions produced through data mining clearly.
FAQs
Where is KDD used?
KDD is used across diverse domains like business, healthcare, finance, and science to extract
valuable patterns and insights from large volumes of data.
What is the role of KDD?
KDD plays a vital role in uncovering hidden patterns, trends, and knowledge from data, providing
valuable insights for decision-making and strategic planning in various fields.

What are some issues in the KDD process?

16
6th Sem BCA Data Mining - KVN

Since KDD requires gathering and analysing vast volumes of data, privacy issues may arise. KDD can
be a challenging procedure that calls for specialised training and understanding to implement and
fully grasp the outcomes.
What is KDD in Data Mining?
KDD (Knowledge Discovery in Databases) in data mining refers to the process of extracting valuable
insights, patterns, and knowledge from large datasets. It involves various stages such as data
selection, preprocessing, mining, pattern evaluation, and knowledge presentation.
What is the data preparation stage of knowledge discovery process?
The data preparation stage involves cleaning, transforming, and organizing selected data for
analysis. Tasks include removing noise, handling missing values, and converting data into a suitable
format. This stage ensures that the data is ready for analysis in subsequent stages of the
knowledge discovery process.
What are the 4 steps of KDD?
The four steps of KDD (Knowledge Discovery in Databases) include data selection, preprocessing,
data mining, pattern evaluation and Knowledge. These steps collectively involve identifying
relevant data sources, preparing the data for analysis, applying data mining techniques to uncover
patterns and trends, and evaluating the discovered patterns for significance and usefulness.
Data Mining Techniques:
There are a wide array of data mining techniques used in data science and data analytics. Your
choice of technique depends on the nature of your problem, the available data, and the desired
outcomes. Predictive modeling is a fundamental component of mining data and is widely used to
make predictions or forecasts based on historical data patterns.
Top-10 data mining techniques are:
1. Classification
Classification is a technique used to categorize data into predefined classes or categories based on
the features or attributes of the data instances. It involves training a model on labeled data and
using it to predict the class labels of new, unseen data instances.

2. Regression

17
6th Sem BCA Data Mining - KVN

Regression is employed to predict numeric or continuous values based on the relationship


between input variables and a target variable. It aims to find a mathematical function or
model that best fits the data to make accurate predictions.

3. Clustering
Clustering is a technique used to group similar data instances together based on their
intrinsic characteristics or similarities. It aims to discover natural patterns or structures in
the data without any predefined classes or labels.

4. Association Rule

18
6th Sem BCA Data Mining - KVN

Association rule mining focuses on discovering interesting relationships or patterns among


a set of items in transactional or market basket data. It helps identify frequently co-
occurring items and generates rules such as "if X, then Y" to reveal associations between
items. This simple Venn diagram shows the associations between itemsets X and Y of a
dataset.

5. Anomaly Detection
Anomaly detection, sometimes called outlier analysis, aims to identify rare or unusual data
instances that deviate significantly from the expected patterns. It is useful in detecting
fraudulent transactions, network intrusions, manufacturing defects, or any other abnormal
behavior.

6. Time Series Analysis

19
6th Sem BCA Data Mining - KVN

Time series analysis focuses on analyzing and predicting data points collected over time. It
involves techniques such as forecasting, trend analysis, seasonality detection, and anomaly
detection in time-dependent datasets.

7. Neural Networks
Neural networks are a type of machine learning or AI model inspired by the human brain's
structure and function. They are composed of interconnected nodes (neurons) and layers
that can learn from data to recognize patterns, perform classification, regression, or other
tasks.

8. Decision Trees

20
6th Sem BCA Data Mining - KVN

Decision trees are graphical models that use a tree-like structure to represent decisions
and their possible consequences. They recursively split the data based on different
attribute values to form a hierarchical decision-making process.

9. Ensemble Methods
Ensemble methods combine multiple models to improve prediction accuracy and
generalization. Techniques like Random Forests and Gradient Boosting utilize a combination
of weak learners to create a stronger, more accurate model.

21
6th Sem BCA Data Mining - KVN

Problems, Issues and Challenges in Data Mining


Data Mining - Problems
Major problems/issues in data mining include data quality, privacy and security,
scalability, handling diverse data types, integration with heterogeneous sources,
interpretation of results, and dynamic data, along with legal and ethical concerns.
Data Mining - Issues
Data mining is not an easy task, as the algorithms used can get very complex and data is
not always available at one place. It needs to be integrated from various heterogeneous
data sources. These factors also create some issues. Here in this tutorial, we will discuss the
major issues regarding –
1. Mining Methodology and User Interaction
2. Performance Issues
3. Diverse Data Types Issues

1. Mining Methodology and User Interaction Issues


It refers to the following kinds of issues −
 Mining different kinds of knowledge in databases − Different users may be
interested in different kinds of knowledge. Therefore it is necessary for data mining
to cover a broad range of knowledge discovery task.

22
6th Sem BCA Data Mining - KVN

 Interactive mining of knowledge at multiple levels of abstraction − The data mining


process needs to be interactive because it allows users to focus the search for
patterns, providing and refining data mining requests based on the returned results.
 Incorporation of background knowledge − To guide discovery process and to express
the discovered patterns, the background knowledge can be used. Background
knowledge may be used to express the discovered patterns not only in concise terms
but at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the
data cleaning methods are not there then the accuracy of the discovered patterns
will be poor.
 Pattern evaluation − The patterns discovered should be interesting because either
they represent common knowledge.
2. Performance Issues
There can be performance-related issues such as follows −
 Efficiency and scalability of data mining algorithms − In order to effectively extract
the information from huge amount of data in databases, data mining algorithm must
be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such as huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further processed in a parallel
fashion. Then the results from the partitions is merged. The incremental algorithms,
update databases without mining the data again from scratch.
3. Diverse Data Types Issues
 Handling of relational and complex types of data − The database may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It is
not possible for one system to mine all these kind of data.
 Mining information from heterogeneous databases and global information
systems − The data is available at different data sources on LAN or WAN. These data
source may be structured, semi structured or unstructured. Therefore mining the
knowledge from them adds challenges to data mining.

23
6th Sem BCA Data Mining - KVN

Challenges of Data Mining


Various challenges of data mining as -
 Data Quality - Data mining heavily relies on the quality of the input data. Inaccurate,
incomplete, or noisy data can lead to misleading results and hinder the discovery of
meaningful patterns.
 Data Complexity - Complex datasets with diverse structures, including unstructured
data like text and images, pose significant challenges in terms of preprocessing,
integration, and analysis.
 Data Privacy and Security - Safeguarding sensitive information is paramount. Data
mining can potentially compromise privacy if not conducted with stringent privacy-
preserving techniques and compliance with data protection regulations.
 Scalability - As data volumes continue to grow, ensuring that data mining algorithms
and infrastructure can handle large-scale datasets efficiently becomes a pressing
issue.
 Interpretability - Understanding and explaining the outcomes of data mining models
is crucial for informed decision-making. Black-box models can raise concerns when
interpretability is required.
 Ethics - Ethical considerations in data mining, such as fairness, bias, and the
responsible use of data, are gaining prominence. Ensuring ethical practices
throughout the data mining process is a critical challenge.
Data Mining Applications:
There are some important applications of data mining. Some of the following applications
are:
o Market Basket Analysis: Retailers use data mining to identify the products
frequently purchased in combination. This supports targeted marketing, product
placement, and store design.
o Customer segmentation: Organizations use data mining to classify customers based
on shared traits or behaviours. This makes it possible to create individualized
marketing plans and product recommendations.
o Recommendation Systems: Data mining is frequently applied in recommendation
systems, including those used by social networks, e-commerce websites, and
streaming platforms. It examines user behaviour and preferences to recommend
personalized products, content, or friends.
o Financial Market Forecasting: Data mining is used in finance to forecast future stock
prices, currency exchange rates, and market trends by analyzing historical market
data, news sentiment, and economic indicators. This is helpful for trading and
investment strategies.

24
6th Sem BCA Data Mining - KVN

o Healthcare Fraud Detection: Data mining is used in the healthcare industry to


identify fraudulent billing practices, insurance claims, and unnecessary medical
procedures. It aids in spotting unusual patterns that might point to fraudulent
activity.
o Churn Prediction: Data mining is used by businesses in sectors like
telecommunications and subscription services to forecast which customers are most
likely to discontinue their subscriptions. This aids in campaigns to keep customers.
o Credit Scoring: Data mining is used by financial institutions to evaluate a person's or
company's creditworthiness. It aids in selecting whether to approve loans and the
applicable interest rates.
o Agriculture: To maximize crop yields and reduce resource waste, farmers use data
mining to analyze crop data, weather patterns, and soil conditions.
These applications demonstrate the adaptability and significance of data mining across
various industries and domains for making knowledgeable decisions, streamlining
procedures, and gaining insightful knowledge from data.

******
Unit-I Assignment: Data Mining

1. What is Data Mining ?


2. Define KDD.
3. Explain Clustering.
4. Explain Association Rule.
5. What is outlier ?
6. Explain the process of knowledge discovery in data Mining
7. Differentiate between DBMS and Data Mining.
8. Differentiate between KDD and Data Mining.
9. Explain Data Mining Techniques.
10. Explain in detail data mining Issues and Challenges.
11. Explain the Applications of Data Mining.

25

You might also like