0% found this document useful (0 votes)
49 views26 pages

Data Mining Practical 123

Uploaded by

varinda0322
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views26 pages

Data Mining Practical 123

Uploaded by

varinda0322
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

PRACTICAL FILE

BE (CSE) 6th Semester

DATA MINING AND ANALYSIS


January 2024 – June 2024

Submitted By
Varinda
Roll Number: UE213123

Submitted To
MS. JASKIRAN
Computer Science and Engineering

Computer Science and Engineering


University Institute of Engineering and Technology
INDEX

S. No. Name of Page No. Remarks


Practical/Program

1 Introduction to Data Mining 3-6


and it’s extraction tools

2 Building a Database Design 7-9


using ER Model

3 To understand triggers in 10-16


SQL and to write a PL/SQL
program.
4 Preprocessing and loading 17-19
of data

5 To perform Data 20-21


Preprocessing in Weka and
applying filters.
6 Implementing The Apriori 22-24
algorithm

7 To implement and explore 24-26


FPGrowth algorithm in
Weka
PRACTICAL-1
AIM: Introduction to Data Mining and it’s extraction tools

INTRODUCTION:
Data mining is the process of extracting valuable information from large datasets. It involves
using techniques from statistics, machine learning, and database systems to identify patterns,
relationships, and trends within the data.

Definition: Data mining is the computational process of analyzing data from different
perspectives, dimensions, and angles, and categorizing or summarizing it into meaningful
information.
Applications: Data mining finds applications in various domains, including finance,
healthcare, retail, and telecommunications. Some common uses include customer profiling,
market basket analysis, anomaly detection, and predictive modeling.
Knowledge Discovery: Data mining is a step in the process of knowledge discovery or
knowledge extraction. It helps us gain patterns and knowledge from a bulk of data.
Data Sources: Data mining can be applied to different types of data, such as data warehouses,
transactional databases, relational databases, multimedia databases, spatial databases, time-
series databases, and the World Wide Web.

Purpose of Data Mining:


The main purpose of data mining is to:

 Analyze Data: Data miners use various tools and techniques to extract, transform, and
analyze data. They uncover hidden patterns, future trends, and behaviors.
 Make Informed Decisions: By understanding the data, businesses can make data-
driven decisions. For example, banks analyze transaction details and customer profiles
to predict which customers might be interested in credit cards, personal loans, or
insurance.
Types of Data Mining :
Data mining can be performed on the following types of data.
1. Relational Database, a collection of multiple data sets formally organized by tables,
records, and column.
2. Data Warehouse is the technology that collects the data from various sources within
the organization to provide meaningful business insights.
3. Transactional database refers to a database management system (DBMS) that has the
potential to undo a database transaction if it is not performed appropriately.
4. Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an
IT structure.

Data Mining Process


1. Data gathering: Identify and assemble relevant data for an analytics application. The
data might be located in different source systems, a data warehouse or a data lake.
External data sources can also be used. Wherever the data comes from, a data scientist
often moves it to a data lake for the remaining steps in the process.
2. Data preparation: This stage includes a set of steps to get the data ready to be mined.
Data preparation starts with data exploration, profiling and pre-processing, followed by
data cleansing work to fix errors and other data quality issues, such as duplicate or
missing values. Data transformation is also done to make data sets consistent, unless a

data scientist wants to analyze unfiltered raw data for a particular application.
3. Data mining: Once the data is prepared, a data scientist chooses the appropriate data
mining technique and then implements one or more algorithms to do the mining.
These techniques, for example, could analyze data relationships and detect patterns,
associations and correlations. In machine learning applications, the algorithms
typically must be trained on sample data sets to look for the information being sought
before they're run against the full set of data.
4. Data analysis and interpretation: The data mining results are used to create
analytical models that can help drive decision-making and other business actions. The
data scientist or another member of a data science team must also communicate the
findings to business executives and users, often through data visualization and the use
of data storytelling techniques.

Techniques used in data mining


 Classification: This technique involves categorizing data into predefined classes or
labels based on the attributes of the data instances. It's widely used in tasks such as
spam email detection, customer churn prediction, and medical diagnosis.
 Clustering: Clustering groups similar data instances together based on their
characteristics, without predefined classes. It's utilized in customer segmentation,
anomaly detection, and recommendation systems, helping to uncover natural groupings
within the data.
 Association Rule Mining: Association rule mining identifies interesting relationships
or patterns between variables in large datasets. It's commonly applied in market basket
analysis, where it reveals associations between items frequently purchased together,
aiding in targeted marketing strategies.
 Regression Analysis: Regression analysis predicts a continuous outcome variable
based on one or more predictor variables. It's extensively used in forecasting sales,
estimating house prices, and analyzing the relationship between variables in scientific
research, providing valuable insights into numerical trends and relationships within the
data.
 Anomaly Detection: Anomaly detection identifies outliers or unusual patterns in data
that deviate significantly from the norm. It's crucial in fraud detection, network security
monitoring, and equipment maintenance, helping to detect and mitigate unusual or
potentially harmful behavior.

Data mining Applications


 Finance and Banking:
 Detects fraud and money laundering patterns.
 Enhances customer trust through safety measures.
 Develops credit scoring models for risk assessment.
 Retail and E-commerce:
 Analyzes customer shopping habits for targeted marketing.
 Optimizes inventory management based on purchasing patterns.
 Identifies fraud or credit card misuse in real-time.
 Healthcare:
 Improves patient outcomes by analyzing large datasets.
 Identifies health risks and aids in personalized treatment.
 Detects drug interactions and improves diagnostic accuracy.
 Manufacturing:
 Analyzes production data to enhance efficiency.
 Monitors product quality and identifies areas for improvement.
 Reduces costs through process optimization.
 Telecom:
 Understands customer behavior and preferences.
 Identifies calling patterns and potential fraudulent activity.
 Improves network utilization and customer service.
Data Mining Tools:
Several tools facilitate in data mining are:
 Weka: Weka is an open-source tool that provides a wide range of machine learning
algorithms and data preprocessing capabilities.
 KNIME: KNIME allows users to create data workflows, integrate various data sources,
and perform analytics.
 Orange: Orange is a visual programming tool for data visualization, exploration, and
analysis.
 Python Libraries: Python libraries like scikit-learn, pandas, and numpy offer powerful
data mining capabilities.
 SQL and NoSQL Databases: SQL databases (e.g., MySQL, PostgreSQL) and NoSQL
databases (e.g., MongoDB, Cassandra) are essential for storing and querying large
datasets.
PRACTICAL-2
AIM: Building a Database Design using ER Model

INTRODUCTION:
A database is a collection of information that is organized so that it can be easily accessed,
managed and updated. Data is organized into rows, columns and tables. A database
management system (DBMS) is a computer software application that interacts with the user,
other applications, and the database itself to capture and analyze data.

An Entity-Relationship Model represents the structure of the database with the help of a
diagram. ER Modelling is a systematic process to design a database as it would require you to
analyse all data requirements before implementing your database.

Symbols Used in ER Model:


ER Model is used to model the logical view of the system from a data perspective which
consists of these symbols:
 Rectangles: Rectangles represent Entities in the ER Model.
 Ellipses: Ellipses represent Attributes in the ER Model.
 Diamond: Diamonds represent Relationships among Entities.
 Lines: Lines represent attributes to entities and entity sets with other
relationship types.
 Double Ellipse: Double Ellipses represent Multi-Valued Attributes.
 Double Rectangle: Double Rectangle represents a Weak Entity.

Components of ER Diagram:
ER Model consists of Entities, Attributes, and Relationships among Entities in a Database
System.
1. Entities: Entities represent real-world objects or concepts within the system.
Examples include students, books, employees, or products. Each entity is depicted as
a rectangle in the diagram.
Entities have attributes (properties) associated with them. For instance, a student
entity might have attributes like student ID, name, and date of birth.

2. Attributes:
Attributes describe the properties of an entity. They are represented as ovals
connected to the corresponding entity rectangle. For example, a book entity might
have attributes such as title, author, and ISBN.

3. Relationships: Relationships define how entities are connected or associated. They


are represented as diamonds connecting two or more entities. Examples of
relationships include “works for” (between employees and departments), “enrolls in”
(between students and courses), or “buys” (between customers and products).

4. Cardinality: Cardinality specifies the number of instances of one entity that can be
related to another entity.Common cardinality notations include one-to-one (1:1), one-
to-many (1:N), and many-to-many (N:M).For instance, a student can enroll in
multiple courses (1:N relationship).

EXAMPLE 1:
Hospital Management System:
Entities: Patients, Doctors, and Tests.
Relationships:
 A patient can undergo multiple tests.
 Each test is associated with a specific patient.
 Doctors can perform tests on patients.
Attributes:

EXAMPLE 2:
Student Enrollment System
Entities: Student, Course, Enrollment, and Faculty
Relationships:
 Student enrolls in Course
 'Enrollment overseen by Faculty
Attributes: Student's ID, Course's code, Enrollment's date, and Faculty's department.

EXAMPLE 3:
Banking Transaction System
Entities: Customer, Account, Transaction, and Bank Branch
Relationships:
 Customer owns Account
 Transaction debits/credits Account
 Account managed by Bank Branch
Attributes: The Customer's SSN, Account's number, Transaction's date, and Bank Branch's
location.
PRACTICAL-3
AIM: To implement triggers, functions, procedure and cursors.

INTRODUCTION:
TRIGGERS
A trigger is a stored procedure in a database that automatically invokes whenever a special
event in the database occurs. For example, a trigger can be invoked when a row is inserted
into a specified table or when specific table columns are updated in simple words a trigger is
a collection of SQL statements with particular names that are stored in system memory. It
belongs to a specific class of stored procedures that are automatically invoked in response to
database server events. Every trigger has a table attached to it.

SYNTAX:

Different Trigger Types in SQL Server:


1. DDL Trigger
2. DML Trigger
3. Logon Triggers
1. DDL Triggers: The Data Definition Language (DDL) command events such as
Create_table, Create_view, drop_table, Drop_view, and Alter_table cause the DDL
triggers to be activated.
SQL Server
Output:

2. DML Triggers:The Data uses manipulation Language (DML) command events that
begin with Insert, Update, and Delete set off the DML triggers. corresponding to
insert_table, update_view, and delete_table.
SQL Server

Output:

3. Logon Triggers: Logon triggers are fires in response to a LOGON event. When a
user session is created with a SQL Server instance after the authentication process
of logging is finished but before establishing a user session, the LOGON event takes
place. As a result, the PRINT statement messages and any errors generated by the
trigger will all be visible in the SQL Server error log. Authentication errors prevent
logon triggers from being used.
How does SQL Server Show Trigger?
The show or list trigger is useful when we have many databases with many tables. This query
is very useful when the table names are the same across multiple databases. We can view a
list of every trigger available in the SQL Server by using the command below:
Syntax:
FROM sys.triggers, SELECT name, is_instead_of_trigger
IF type = ‘TR’;

The SQL Server Management Studio makes it very simple to display or list all triggers that
are available for any given table. The following steps will help us accomplish this:
Go to the Databases menu, select the desired database, and then expand it.
 Select the Tables menu and expand it.
 Select any specific table and expand it.
We will get various options here. When we choose the Triggers option, it displays all the
triggers available in this table.

WHY WE NEED TRIGGER?


Triggers will be helpful when we need to execute some events automatically on certain
desirable scenarios. For example, we have a constantly changing table and need to know the
occurrences of changes and when these changes happen. If the primary table made any changes
in such scenarios, we could create a trigger to insert the desired data into a separate table.

 FOR Triggers can be defined on tables or views. It fires only when all operations
specified in the triggering SQL statement have initiated successfully.
 AFTER Triggers fires only after the specified triggering SQL statement completed
successfully. AFTER triggers cannot be defined on views.

 INSTEAD OF Triggers allows you to override the INSERT, UPDATE, or DELETE


operations on a table or view. The actual DML operations do not occur at all.
 LOGON Triggers is fired automatically on a LOGON event. They are DDL
triggers and are created at the server level. We can define more than one LOGON
trigger on a server.

FUNCTIONS
A stored function (also called a user function or user-defined function) is a set of PL/SQL
statements you can call by name. Stored functions are very similar to procedures, except
that a function returns a value to the environment in which it is called Function.
 Create or Replace Function
RfactorialFunction(x in out int)
return int
is y int;
begin
if (x > 1) then
y := x - 1;
return x * RfactorialFunction(y);
end if;
return 1;
end RfactorialFunction;
/
Declare
num INT;
ans INT;
Begin
num := :Enter_Number;
ans := RfactorialFunction(num);
dbms_output.put_line('Factorial : ' || ans);
end;
/
Output:

FACTORIAL = 720
CURSORS
A cursor is a temporary work area created in the system memory when a SQL statement is
executed. A cursor contains information on a select statement and the rows of data
accessed by it. This temporary work area is used to store the data retrieved from the
database, and manipulate this data. A cursor can hold more than one row, but can process
only one row at a time. There are mainly two types of cursors :-
1. Implicit cursors 2. Explicit cursors
EXAMPLE:
Declare
Cursor C1 is select * from Students;
rec C1%rowtype;
Begin
open c1;
loop
fetch c1 into rec;
dbms_output.put_line('Roll No: ' || rec.RollNo || ' ' || 'Name: ' || rec.SName || ' ' || 'Std: '
|| rec.Std || ' ' || 'Marks: ' || rec.Marks);

dbms_output.put_line('**************************************************
*****');
exit when c1%rowcount = 5;
end loop;
end;
/
Output:
Roll No: 1 Name: Akarsh Std: 1 Marks: 30
*******************************************************
Roll No: 2 Name: Babita Std: 2 Marks: 15
*******************************************************
Roll No: 3 Name: Chandu Std: 3 Marks: 20
*******************************************************
Roll No: 4 Name: Danish Std: 4 Marks: 30
*******************************************************
Roll No: 5 Name: Eren Std: 5 Marks: 40
*******************************************************

PROCEDURES
In data mining, procedures are systematic methods or algorithms used to extract useful
information and patterns from large datasets. These procedures help in uncovering hidden
insights, trends, and relationships that can be valuable for decision-making and prediction.
Here's an explanation along with syntax and an example for implementing a commonly
used data mining procedure: association rule mining. Association rule mining is a
technique used to discover interesting relationships, or associations, between variables in
large datasets. It identifies patterns where one set of items tends to co-occur with another
set of items within transactions or events.
The most common algorithm for association rule mining is the Apriori algorithm

EXAMPLE:
DATASET:
Transaction 1: milk, bread, butter
Transaction 2: milk, bread
Transaction 3: milk, butter
Transaction 4: bread, butter
Transaction 5: milk
Transaction 6: bread

OUTPUT:

PL/SQL PROGRAM
PL/SQL extends SQL by adding constructs found in procedural languages, resulting in a
structural language that is more powerful than SQL. The basic unit in PL/SQL is a block. All
PL/SQL programs are made up of blocks, which can be nested within each other.

Typically, each block performs a logical action in the program. A block has the following
structure:
DECLARE
declaration statements;

BEGIN
executable statements

EXCEPTIONS
exception handling statements

END;
 Declare section starts with DECLARE keyword in which variables, constants,
records as cursors can be declared which stores data temporarily. It basically
consists definition of PL/SQL identifiers. This part of the code is optional.
 Execution section starts with BEGIN and ends with END keyword.This is a
mandatory section and here the program logic is written to perform any task like
loops and conditional statements. It supports
all DML commands, DDL commands and SQL*PLUS built-in functions as well.
 Exception section starts with EXCEPTION keyword.This section is optional
which contains statements that are executed when a run-time error occurs. Any
exceptions can be handled in this section.
PRACTICAL-4

AIM: Preprocessing and loading of data

INTRODUCTION:
Weka prefers to load data in the ARFF format. It is an acronym that stands
for Attribute-Relation File Format. It is an extension of the CSV file format
where a header is used that provides metadata about the data types in the
columns.
For example, the first few lines of the classic iris flowers dataset in CSV
format looks as follows:

Step 1: Open Weka


Launch Weka: Open the Weka application on your computer. You'll see the
Weka GUI Chooser window.

Step 2: Choose an Interface


Select Explorer: Click on the "Explorer" button to open the Weka Explorer
interface.
Step 3: Load a Dataset
1. Open File: In the Explorer interface, click on the "Open file..." button
located in the top-left corner.
This will open a file dialog window.
2. Select the Dataset: Navigate to the location of your dataset file.
Weka primarily supports ARFF (Attribute-Relation File Format) files, but
it can also load CSV files and some other formats.
Select your dataset file and click "Open".

Step 4: View the Dataset


Inspect the Data: After loading the dataset, you will see information about it in
the "Preprocess" tab.
The "Attributes" panel on the left lists all the attributes (features) in your dataset.
The "Current relation" section shows the name of the dataset, the number of
instances, and the number of attributes.
The "Selected attribute" section displays details about any attribute you select,
including basic statistics.

Step 5: Optional Preprocessing


Preprocess the Data: You can perform various preprocessing tasks such as
filtering attributes, normalizing data, or handling missing values using the
options available in the "Preprocess" tab.
Additional Tips
Loading CSV Files: If your dataset is in CSV format, Weka will prompt you to
configure how it reads the CSV file. You might need to specify if there is a
header row and how to handle missing values.
ARFF Files: If your dataset is in ARFF format, it will be loaded directly
without additional configuration.
PRACTICAL-5

AIM: To perform Data Preprocessing in Weka and applying filters.

INTRODUCTION:
The preprocessing of data is a crucial task in data mining. Because most of the data is raw,
there are chances that it may contain empty or duplicate values, have garbage values, outliers,
extra columns, or have a different naming convention. All these things degrade the results.

To make data cleaner, better and comprehensive, WEKA comes up with a comprehensive set
of options under the filter category. Here, the tool provides both supervised and unsupervised
types of operations. Here is the list of some operations for preprocessing:

 ReplaceMissingWithUserConstant: to fix empty or null value issues.


 NumericToNominal: to convert the data from numeric to nominal.
 Remove: to remove a given attribute from data.
 Discretize: discretizes a range of numeric attributes in the dataset into nominal
attributes.

1. ReplaceMissingWithUserConstant : Replaces all missing values for nominal, string,


numeric and date attributes in the dataset with user-supplied constant values. If the user
content contains missing values, we can apply the given filter to replace them with user
given values. Set the required settings for given filter.
2. Numeric to Nominal: A filter for turning numeric attributes into nominal ones.

3. REMOVE: A filter that removes a range of attributes from the dataset. Will re-order the
remaining attributes if invert matching sense is turned on and the attribute column indices
are not specified in ascending order.

4. DISCRETIZE: An instance filter that discretizes a range of numeric attributes in the


dataset into nominal attributes.
PRACTICAL-6

AIM: Implementing The Apriori algorithm

INTRODUCTION:
An association may be found between peanut butter and bread. Finding such associations
becomes vital for supermarkets as they would stock bread next to peanut butter so that
customers can locate both items easily resulting in an increased sale for the supermarket.
The Apriori algorithm is one such algorithm in ML that finds out the probable associations
and creates association rules. WEKA provides the implementation of the Apriori algorithm.
You can define the minimum support and an acceptable confidence level while computing
these rules. We will apply the Apriori algorithm to the weather data provided in the WEKA
installation.

 Loading Data
In the WEKA explorer, open the Preprocess tab, click on the Open file ... button and
select weather.nominal.arff database from the installation folder. After the data is
loaded you will see the following screen –

The database contains 5 attributes. You can easily understand how difficult it would be to
detect the association between such a large number of attributes. Fortunately, this task is
automated with the help of Apriori algorithm.

 Associator
Click on the Associate TAB and click on the Choose button. Select the Apriori
association as shown in the screenshot –
 To set the parameters for the Apriori algorithm, click on its name, a window will pop
up as shown below that allows you to set the parameters –

 After you set the parameters, click the Start button. After a while you will see the results
as shown in the screenshot below −
 At the bottom, you will find the detected best rules of associations. This will help the
weather prediction associations
PRACTICAL-7

AIM: To implement and explore FPGrowth algorithm in Weka

INTRODUCTION:
FPGrowth is an algorithm for finding patterns in data and it’s much more efficient than its
predecessor, Apriori. FP-Growth is a data mining algorithm used to discover frequent patterns
in large datasets.
It works by building an FP-Tree, a compact representation of transactions, and then recursively
mining frequent itemsets directly from this tree.
One of the key advantages of FP-Growth over traditional methods like Apriori is its efficiency
in handling large datasets. By compressing the dataset into an FP-Tree and mining frequent
itemsets directly from it, FP-Growth can be significantly faster than other approaches,
especially when dealing with sparse datasets.
FP-Growth is efficient, especially for sparse datasets, and is commonly used in applications
like market basket analysis and association rule mining.
It's primarily used for data mining and machine learning tasks such as association rule mining.

Performing FPGrowth algorithm in Weka


1. Create and open the shopping dataset in Weka

2. Remove the No Attribute from the Dataset

3. Go to the Associate tab. The FPGrowth rules can be mined from here.
4. After hitting the start button following information is obtained

Following information is obtained

● The scheme used is FPGrowth.


● Instances and Attributes: It has 9 instances and 5 attributes.
● Minimum support and minimum confidence are 0.2 and 0.9 respectively. Out of 6
instances, 2 instances are found with min support,
● The number of cycles performed for the mining association rule is 12.
● The large itemsets generated are 3: L (1), L (2), and L (3) but these are not ranked as
their sizes are 7, 11, and 5 respectively.
● Rules found are ranked. The interpretation of these rules is as follows:
● Butter T 4 => Beer F 4: means out of 6, 4 instances show that for butter true, beer is
false. This gives a strong association. The confidence level is 0.1

You might also like