EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
[Link] CSE
ADVANCED DATABASE MANAGEMENT SYSTEM LAB
Syllabus
Module Detailed Contents Hrs
No
1 Distributed Database: 2
Implementation of Partitions: Range, List.
Self-Learning Topics: Hash Partition, Composite
partition
2 ORDBMS:
Implementation of,
• Abstract Data Type
• Reference
Self-Learning Topics: Nested ADT, Inheritance
3 Pre-processing in R : 2
Data pre-processing techniques in R
Self-Learning Topics: Sorting, Date Conversion
4 Data Mining - Classification using Reprogramming 2
:
Implementation and Analysis of -Regression,
Classification Models
5 Data Mining - Clustering and Association using
R-Programming:
Implementation of Market Basket Analysis and
Clustering.
6
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
Experiment 1
DISTRIBUTED DATABASE
RANGE Partitioning in mysql
Problem Statement:
1a. Implementation of Data partitioning through Range.
Concept:
A table that is partitioned by range is partitioned in such a way that
each partition contains rows for which the partitioning expression value
lies within a given range.
Ranges should be contiguous but not overlapping, and are defined
using the VALUES LESS THAN operator.
Below example demonstrate the partitioning in range into mysql
database.
Program:
Mysql>create database mca;
Mysql>use mca;
mysql> CREATE TABLE tr (id INT, name VARCHAR (50), purchased
DATE)
PARTITION BY RANGE (YEAR (purchased)) (
PARTITION p0 VALUES LESS THAN (1990),
PARTITION p1 VALUES LESS THAN (1995),
PARTITION p2 VALUES LESS THAN (2000),
PARTITION p3 VALUES LESS THAN (2005),
PARTITION p4 VALUES LESS THAN (2010),
PARTITION p5 VALUES LESS THAN (2015)
);
1
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
Above code simply create a table named as tr with three columns
and 6 partitions on which values are defined inside brackets i.e ().
Now insert data as rows into employees table with following queries.
Mysql> INSERT INTO tr VALUES
(1, 'desk organiser', '2003-10-15'),
(2, 'alarm clock', '1997-11-05'),
(3, 'chair', '2009-03-10'),
(4, 'bookcase', '1989-01-10'),
(5, 'exercise bike', '2014-05-09'),
(6, 'sofa', '1987-06-05'),
(7, 'espresso maker', '2011-11-22'),
(8, 'aquarium', '1992-08-04'),
(9, 'study desk', '2006-09-16'),
(10, 'lava lamp', '1998-12-25');
Now check the inserted data with select statement.
Mysql> select * from tr;
2
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
We can also check the inserted data into table tr with partition
with range created with following code.
Mysql> SELECT * FROM tr PARTITION (p2);
Mysql> SELECT * FROM tr PARTITION (p5);
Mysql> SELECT * FROM tr PARTITION (p4);
We can also check the data with below select statement.
Mysql> SELECT PARTITION_NAME, TABLE_ROWS FROM
INFORMATION_SCHEMA.PARTITIONS WHERE
TABLE_NAME='tr';
3
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
Drop a MySQL partition
If you feel some data are useless in a partitioned table you can drop
one or more partition(s). To delete all rows from partition p0 of tr, you can
use the following statement:
MySQL> ALTER TABLE tr TRUNCATE PARTITION p0;
Mysql> SELECT PARTITION_NAME, TABLE_ROWS FROM
INFORMATION_SCHEMA.PARTITIONS WHERE
TABLE_NAME='tr';
1b. Implementation of Data partitioning through List partition.
MySQL LIST Partitioning
It is the same as Range Partitioning. Here, the partition is defined
and selected based on columns matching one of a set of discrete value lists
rather than a set of a contiguous range of values. It is performed by the
PARTITION BY LIST(exp) clause. The exp is an expression or column
value that returns an integer value. The VALUES IN(value_lists)
statement will be used to define each partition.
In the below example, suppose we have 12 stores distributed
among four franchises based on their region. The table explains it more
clearly:
Region Store ID Number
East 101, 103, 105
West 102, 104, 106
North 107, 109, 111
South 108, 110, 112
We can partition the above table where rows for stores belonging
to the same region and will be stored in the same partition. The following
4
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
statement arranges the stores in the same region using LIST partitioning,
as shown below:
Mysql>CREATE TABLE Stores (
cust_name VARCHAR(40),
bill_no VARCHAR(20) NOT NULL,
store_id INT PRIMARY KEY NOT NULL,
bill_date DATE NOT NULL,
amount DECIMAL(8,2) NOT NULL
)
PARTITION BY LIST(store_id) (
PARTITION pEast VALUES IN (101, 103, 105),
PARTITION pWest VALUES IN (102, 104, 106),
PARTITION pNorth VALUES IN (107, 109, 111),
PARTITION pSouth VALUES IN (108, 110, 112));
5
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
6
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
Experiment no 2
ORDBMS
Aim: Implementation of Abstract Data Type & Reference
Creating Tables Under the Relational Model
The relational approach normalizes everything into tables. The
table names are Customer_reltab, PurchaseOrder_reltab, and Stock_reltab.
Each part of an address becomes a column in the Customer_reltab
table. Structuring telephone numbers as columns sets an arbitrary limit on
the number of telephone numbers a customer can have.
The relational approach separates line items from their purchase
orders and puts each into its own table, named Purchase Order_reltab and
LineItems_reltab.
The relational approach results in the tables describe in the
following sections.
Customer_reltab
The Customer_reltab table has the following definition:
CREATE TABLE Customer_reltab (
CustNo NUMBER NOT NULL,
CustName VARCHAR2(200) NOT NULL,
Street VARCHAR2(200) NOT NULL,
City VARCHAR2(200) NOT NULL,
State CHAR(2) NOT NULL,
Zip VARCHAR2(20) NOT NULL,
Phone1 VARCHAR2(20),
Phone2 VARCHAR2(20),
Phone3 VARCHAR2(20),
PRIMARY KEY (CustNo));
7
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
This table, Customer_reltab, stores all the information about
customers, which means that it fully contains information that is intrinsic
to the customer (defined with the NOTNULL constraint) and information
that is not as essential. According to this definition of the table, the
application requires that every customer have a shipping address.
PurchaseOrder_reltab
The PurchaseOrder_reltab table has the following definition:
CREATE TABLE PurchaseOrder_reltab (
PONo NUMBER, /* purchase order no */
Custno NUMBER references Customer_reltab, /* Foreign KEY
referencing
customer */
OrderDate DATE, /* date of order */
ShipDate DATE, /* date to be shipped */
ToStreet VARCHAR2(200), /* shipto address */
ToCity VARCHAR2(200),
ToState CHAR(2),
ToZip VARCHAR2(20),
PRIMARY KEY(PONo));
Purchase Order_reltab manages the relationship between the
customer and the purchase order by means of the foreign key (FK) column
Cust No, which references the CustNo key of the Customer_reltab. The
Purchase Order_reltab table contains no information about related line
items. The line items table (next section) uses the purchase order number
to relate a line item to its parent purchase order.
Stock_reltab
The Stock_reltab table has the following definition:
CREATE TABLE Stock_reltab (
StockNo NUMBER PRIMARY KEY,
Price NUMBER,
TaxRate NUMBER);
8
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
LineItems_reltab
The LineItems_reltab table has the following definition:
CREATE TABLE LineItems_reltab (
LineItemNo NUMBER,
PONo NUMBER REFERENCES PurchaseOrder_reltab,
StockNo NUMBER REFERENCES Stock_reltab,
Quantity NUMBER,
Discount NUMBER,
PRIMARY KEY (PONo, LineItemNo));
The table name is in the plural form LineItems_reltab to emphasize
to someone reading the code that the table holds a collection of line items.
PONo, which references the PONo column in Purchase Order_reltab
StockNo, which references the StockNo column in Stock_reltab
Inserting Values Under the Relational Model
In our application, statements like these insert data into the tables:
INSERT INTO Stock_reltab VALUES(1004, 6750.00, 2);
INSERT INTO Stock_reltab VALUES(1011, 4500.23, 2);
INSERT INTO Stock_reltab VALUES(1534, 2234.00, 2);
INSERT INTO Stock_reltab VALUES(1535, 3456.23, 2);
INSERT INTO Customer_reltab
VALUES (1, 'Jean Nance', '2 Avocet Drive',
'Redwood Shores', 'CA', '95054',
'415-555-1212', NULL, NULL);
9
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
INSERT INTO Customer_reltab
VALUES (2, 'John Nike', '323 College Drive',
'Edison', 'NJ', '08820',
'609-555-1212', '201-555-1212', NULL);
INSERT INTO PurchaseOrder_reltab
VALUES (1001, 1, SYSDATE, '10-MAY-1997',
NULL, NULL, NULL, NULL);
INSERT INTO PurchaseOrder_reltab
VALUES (2001, 2, SYSDATE, '20-MAY-1997',
'55 Madison Ave', 'Madison', 'WI', '53715');
INSERT INTO LineItems_reltab VALUES(01, 1001, 1534, 12, 0);
INSERT INTO LineItems_reltab VALUES(02, 1001, 1535, 10, 10);
INSERT INTO LineItems_reltab VALUES(01, 2001, 1004, 1, 0);
INSERT INTO LineItems_reltab VALUES(02, 2001, 1011, 2, 1);
Querying Data Under the Relational Model
The application can execute queries like these:
SELECT [Link], [Link], [Link], [Link], [Link],
[Link], C.phone1, C.phone2, C.phone3,
[Link], [Link],
[Link], [Link], [Link], [Link]
FROM Customer_reltab C,
PurchaseOrder_reltab P,
LineItems_reltab L
WHERE [Link] = [Link]
AND [Link] = [Link]
AND [Link] = 1001;
Get the Total Value of Purchase Orders
SELECT [Link], SUM([Link] * [Link])
FROM PurchaseOrder_reltab P,
LineItems_reltab L,
Stock_reltab S
WHERE [Link] = [Link]
AND [Link] = [Link]
GROUP BY [Link];
Get the Purchase Order and Line Item Data for Stock Item 1004
SELECT [Link], [Link],
[Link], [Link], [Link], [Link]
FROM PurchaseOrder_reltab P,
10
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
LineItems_reltab L
WHERE [Link] = [Link]
AND [Link] = 1004;
Updating Data Under the Relational Model. The application
can execute statements like these to update the data:
UPDATE LineItems_reltab
SET Quantity = 20
WHERE PONo = 1001
AND StockNo = 1534;
Deleting Data Under the Relational Model
DELETE
FROM LineItems_reltab
WHERE PONo = 1001;
DELETE
FROM PurchaseOrder_reltab
WHERE PONo = 1001;
11
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
12
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
13
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
EXPERIMENT NO.3
Aim : Data preprocessing in R.
Objective: What is data preprocessing? Different steps involved in data
preprocessing.
Theory:
Data Preprocessing
Data preprocessing is a process of preparing the raw data to make
it suitable for a machine learning model. It is the first and crucial step
while making any machine learning model.
When creating a machine learning model, it is not a case that we
come across the clean and formatted data always. It is mandatory to clean
the data and put it in a formatted way before using it for any model. So,
for this we use data preprocessing.
Why do we need Data Preprocessing?
A real-world data generally contains noises, missing values, and
maybe in an unusable format which cannot be directly used for machine
learning models. Data preprocessing is used for cleaning the data and
making it suitable for a machine learning model which also increases the
accuracy and efficiency of a machine learning model.
Data Preprocessing in R
[Link] file
14
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
Importing the Dataset
Here, first we will change the working directory to E:/datasets
(where [Link] is stored)
To display all 7 rows from csv file
This dataset consists of four features. The dependent factor is the
‘Purchased’ column.
If the above dataset is to be used for machine learning model, the
idea will be to predict if an item got purchased or not depending on the
Country, Age and Salary of a person. The highlighted cells with value
‘NA’ denote missing values in the dataset.
15
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
Dealing with Missing Values
dataset$Age = ifelse([Link](dataset$Age),ave(dataset$Age, FUN = function(x) mean(x,
[Link] = 'TRUE')),dataset$Age)
dataset$Salary = ifelse([Link](dataset$Salary), ave(dataset$Salary, FUN = function(x)
mean(x, [Link] = 'TRUE')), dataset$Salary)
The above code checks for missing values in the Age and Salary
columns and update the missing cells with the column-wise average.
dataset$column_header:
Selects the column in the dataset specified after $ (Age and
Salary).
[Link](dataset$column_header):
This method returns true for all the cells in the specified column
with no values.
ave(dataset$column_header, FUN = function(x) mean(x, [Link]
= ‘TRUE’)):
This method calculates the average of the column passed as
argument.
Output:
Since we don’t want decimal places for Age, we will round it up using the
following code.
dataset$Age = [Link](format(round(dataset$Age, 0)))
16
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
The argument 0 in the round function means no decimal places.
After executing the above code block , the dataset would look like
what’s shown below :
Dealing with Categorical Data
Categorical variables represent types of data which may be divided
into groups. Examples of categorical variables are race, sex, age group,
educational level etc.
In our dataset, we have categorical features ‘Purchased’. In R we can
use the factor method to convert texts into numerical codes.
dataset$Purchased = factor(dataset$Purchased, levels = c('No','Yes'), labels = c(0,1))
factor(dataset$olumn_header, levels = c(), labels = c()) :
the factor method converts the categorical features in the specified
column to factors or numerical codes.
levels:
The categories in the column passed as a vector. Example
c(‘No’,’Yes’)
labels:
The numerical codes for the specified categories in the same order.
Example c(0,1))
17
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
Output:
Questions:
1. Incorrect or invalid data is known as .
[Link] data [Link] [Link] data [Link] data
2. What will be the output of the following R code?
> x <- c(2, 6, NaN, NA, 4)
> [Link](x)
a) FALSE FALSE TRUE TRUE FALSE
b) FALSE TRUE TRUE TRUE FALSE
c) TRUE FALSE TRUE TRUE FALSE
d) TRUE FALSE TRUE FALSE FALSE
3. is used for cleaning the data and making it suitable for a
machine learning model.
a. Data preprocessing
b. Saving the data
c. Data Repairing
d. Data removing
18
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
EXPERIMENT NO. 4
Aim: To implement and analyse linear regression
Objective:- To understand linear regression which is a statistical model to
study the relationship that could exist between two variable quantities :
one of the variables is called independent variable(x) and the other is
known as dependent variable(y).
Theory:
LINEAR REGRESSION
The independent and dependent variables are assumed to have a linear
relationship in linear regression. This implies that a line can be drawn
between the two.
The relation between two quantitative variables is estimated using simple
linear regression. When you need to know the following, you can apply
simple linear regression:
1) What is the extent of the association between the two variables?
2) The value of the dependent variable at a given value of the independent
variable.
Program:
In this program, ages and heights of people are recorded in an excel file
“[Link]” and the relationship between ages (independent
variable) and heights(dependent variable) is studied.
This relationship between heights and ages can be expressed as a linear
equation:
Heights = m*ages + c.
M is the slope of the line and c is the intercept.
19
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
([Link] file )
20
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
Output :
Residuals: The intention is for the sum of the residuals to be close to zero
or as low as possible. Most cases will not follow a completely straight line
in real life, hence residuals are always to be expected.
Coefficients: The values of the intercept (“c” value) and the slope (“m”
value) for the age can be seen These “c” and “m” values are used to draw
a line between all the points of the data.
So in this case, if there is a person whose age is 18, then the height
can be calculated as (18*2.0500 + 131.0778)
The p-value for the age is 0.000001155. The smaller the value the
better is ‘ages’ a good determinant of ‘heights’.
R² value is almost 1 for models that fit well and in case of models
that poorly fit the data have R² value near about 0. In this output R squared
value is 0.9715 which explains almost 97% of the variability.
21
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
EXPERIMENT NO. 5
Aim: Implementation of market basket analysis.
Objective:- To implement market basket analysis in R
Theory:
Market Basket Analysis is a form of frequent itemset mining that
examines consumer purchasing patterns by identifying relationships
between the many goods in their "shopping baskets." By getting insight
into which goods are commonly purchased together by customers,
businesses may build marketing strategies based on the finding of these
relationships.
Market Basket Analysis is a method of determining the value of a
market basket.
MBA is most often used to help in cross-selling and up-selling. If
you know that customers who buy trousers also buy belts, for example,
you may advertise the belts on the same page or offer them as part of a
bundle to try to boost sales. You may also advertise one product while
seeing an increase in the other.
Customers' purchase patterns are depicted using "Association
Rules" in Market Basket Analysis. A rule's interestingness is determined
by two metrics: support and confidence.
Example:
Tea_powder => sugar [support = 4%, confidence = 70%]
a. A support of 2% for the above rule states that 2% of all the
transaction under analysis show that tea powder and sugar are
purchased together.
support(B => C) = P(B U C)
b. A confidence of 70% means that 70% of the customers who
purchased tea powder also bought the sugar.
c. Lift is a metric that helps us figure out if combining two products
increases our chances of making a sale.
Packages/functions used:
1) arules
It is used for displaying, manipulating, and analysing transaction data and
patterns (frequent item sets and association rules)
22
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
2) arulesViz
Extends package 'arules' with various visualization techniques for
association rules and item sets.
3) inspect()
It summarizes all relevant options, plots and statistics that should be
usually considered
4) [Link]()
It finds redundant rules
5) apriori()
From a given collection of transaction data, apriori() creates the
most relevant set of rules. It also demonstrates the rules' support,
confidence, and lifting. The relative strength of the rules may be
determined using these three criteria.
6) plot()
It is used to visualize association rules and item sets. It has in it
implemented several popular visualization methods like scatter plots.
Algorithm:
The Apriori algorithm seeks out "often recurring item sets." An itemset is
a collection of related items (such as products in a basket) whose
frequency of co-occurrence is determined by a user-defined "support"
level.
1) Read through the entire transaction.
2) Calculate the value of support for each item.
3) If the item's support is less than the minimum, it should be discarded.
Otherwise, add it to the frequently used itemset.
4) Determine the level of confidence for each non-empty subset.
If the confidence is less than the minimum, the subgroup should be
discarded.
Program:
1) library(arules)
2) library(arulesViz)
3) inspect(Groceries)
23
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
4) grules<-apriori(Groceries)
24
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
5) grules <-apriori(Groceries, parameter=list( supp = 0.001 , conf = 0.8 ))
25
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
6) inspect(grules[1:5])
7) grules <- sort (grules, by = ” support ”,decreasing = T)
inspect(grules[1:10])
8) redundant_grules <- [Link](grules)
9) summary(redundant_grules)
26
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
10) grules<-grules[!redundant_grules]
11) inspect(grules)
12) grules_whole_milk <- apriori(Groceries, parameter=list(supp =
0.0001,conf = 0.08), appearance=list(default="rhs”, lhs = "whole milk"))
27
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
13) inspect(grules_whole_milk[1:7])
14) plot(grules_whole_milk, method = "graph" ,engine ="interactive")
28
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
Graph:
29
EVEREST COLLEGE OF ENGINEERING AND TECHNOLOGY
30