NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
NAME PLACENTE, JOHN CABRERA
RAPIDMINER: HOW TO ACCESS YOUR DATA
EXERCISE 1: Read data. Prepare CDSA_EX01.xlsx
Objectives:
1. Read CDSA_EX01.xlsx
2. Import in Rapidminer
3. Read and show DATA RESULT
4. Show statistics report.
OUTPUT NO. 01: SHOW YOUR DATA RESULT SCREENSHOT.
OUTPUT NO. 02: SHOW YOUR STATISTICS REPORT.
EXERCISE #2: MODIFYING YOUR FIRST CHART
Use your CDSA_EX01 to show a LINE CHART.
X: TRANSACTION
Y: REPAIRTIME
OUTPUT 1: SHOW RESULT
OUTPUT 2: CUSTOMIZING THE CHART
Title: REPAIR TIME – SERVICES
Subtitle: PHILIPPINES
SHOW MARKERS (CIRCLE)
CHANGE THE COLOR TO RED
SHOW LABEL
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
EXERCISE #3: SAVING YOUR PROCESS.
OBJECTIVE:
1. Create a repository CDSA-CMDI.
2. Save your process.
3. Name the process CDSA-EX01PROCESS
ACTIVITY OUTPUT: Write the step-by-step procedure on how to create repository and save your process.
Include screenshots.
STEP ACTIVITY OWNER DURATION
EXERCISE #4. LET’S DO MORE ANALYSIS
A. Load the data file CDSA_EX02.xlsx
B. Answer the following questions:
1. There are more male customers than female. True or false? Show evidence.
FEMALE
2. There are more customers aged 50 and above. True or false? Show evidence. TRUE
3. The average gross sales is 69883.38. True or false? Show evidence.
4. The maximum cost of goods sold (COGS) is PHP 39,134.69. True or false? Show evidence.
5. What is the average customer satisfaction?
6. How many are the customers in Luzon, Visayas and Mindanao?
EXERCISE #5. LET’S DO MORE CHARTING
A. Using CDSA_EX02.xlsx, perform the following charts.
1. Create a bar chart and line chart.
Bar Chart: Gross Sales
Line Chart: COGS
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
By: AGE
Result:
2. Create Boxplot of Gross Sales
Group By: Location
Result
3. Create donut chart of NETSALES
Group by Age
Result
4. Create SCATTER/Bubble plot
X-axis: Age
Value Column: Gross Sales
Color by: Location
Result:
Questions:
1. What are the ages of customers from Mindanao?
5. Create a heat map of the average satisfaction of customers
Group by AGE.
Result:
CASE STUDY NO. 01
AYO-AYO BANK OF ASIA is studying its business performance in the Philippines. A group of data scientists
now studies the data sets given to them by the corporation. Open the data set and provide the company
with an insight. What do you see and what can you share as a data scientist?
FILE: CDSA_EX03.xlsx
CASE STUDY NO. 02
What is causing the high reject in the factory?
FILE: CDSA_EX03B.xlsx
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
EXERCISE #6: FILTERING OF DATA
A. LOAD DATA CDSA_EX03.xls
OBJECTIVE: GET ALL LOAN DATA ONLY
OPERATOR: Use FILTER EXAMPLES
RESULT:
EXERCISE #7: REMOVING OF DUPLICATE DATA
A. LOAD DATA CDSA_EX04.xls
OBJECTIVE:
1. Remove duplicate data from the file.
RESULT:
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
SHOW NEW DATA WITHOUT THE DUPLICATE
SHOW THE REMOVED DATA SET
EXERCISE #8: WRITE THE NEW DATA TO THE DATA SET
A. From exercise #7, dump the duplicate data to an EXCEL FILE
OPERATOR: WRITE EXCEL
1. CREATE A BLANK EXCEL FILE, “[Link]”
2. USE THE OPERATOR
RESULT:
EXERCISE #10: LOOK FOR MISSING DATA
A. FROM THE FILE CDSA_EX05.XLSX
OBJECTIVE: USE THE OPERATOR FILTER EXAMPLES TO LOOK FOR MISSING DATA SETS.
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
RESULTS:
EXERCISE #11: GENERATING ATTRIBUTES 1
A. Given the file CDSA_EX02.xlsx:
CREATE COLUMN “RESULT”
If the CSAT is less than 7, then state “FAILED”
If the CSAT is 7 or more, then state “PASS”
Use the operator GENERATE ATTRIBUTE.
RESULT:
EXERCISE #12: GENERATING ATTRIBUTES 1
A. Using CDSA_EX06.xlsx
Compute for CSAT if
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
CSAT = 0.3 X CONTENT + 0.6 X TRAINER + 0.1 X VENUE
Use GENERATE ATTRIBUTE to produce results.
RESULT:
EXERCISE #13: GENERATE COPY OF COLUMNS
A. Using CDSA_EX06.xlsx
Generate a copy of the column CONTENT and name it as CONTENT2.
RESULT:
EXERCISE #14: GENERATE CONCATENATION
A. USING CDSA_EX02, CONCATENATE CUSTOMER AND AGE into CUSTOMER_AGE
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
DATA SCIENCE MAJOR TOOLS
DECISION TREES
Gives you a TREE based partition of data sets
Normally this is used to describe the data set thru trees
Used in root cause analysis
Example:
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
EXERCISE #15: GENERATE DECISION TREES (EXAMPLE 1)
A. Given CDSA_EX07, determine who are giving failing grade to transactions.
USE DECISION TREES.
STUDY RESULTS:
Level 2:
Level 3:
Level 4:
EXERCISE #16: DETERMINE FACTORS THAT AFFECTS HIGH SUGAR LEVEL (EXAMPLE 2)
A. Given CDSA_EX08_DIABETES, determine what factors affects SUGAR LEVEL
USE DECISION TREES.
EXERCISE #17: DETERMINE CHURN FACTORS (EXAMPLE 3)
A. Given CDSA_EX09, determine who what factors affects CHURN (Uumaalis)
USE DECISION TREES.
EXERCISE #18: INSURANCE CHARGES (EXAMPLE 4)
A. Given CDSA_EX10, determine who what factors affects INSURANCE CHARGES
USE DECISION TREES.
Determine outliers in the data sets
METHOD 1: Use boxplot to detect outliers.
METHOD 2: Use operator DETECT OUTLIER (Distances)
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
EXERCISE #19: OUTLIER DETECTION
A. Given CDSA_EX11, detect the outliers.
Use DETECT OUTLIER (Distances)
METHOD 3: Use statistical process control
UCL = Mean + 3 x stdev
LCL = Mean – 3 x stdev
UCL = Upper control limit
LCL = Lower control limit
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
EXERCISE #20: OUTLIER DETECTION USING STATISTICAL PROCESS CONTROL
A. Given CDSA_EX11, create an SPC chart to monitor outlier
*** NOTE: Use the first 30 CLEAN dataset as baseline for UCL and LCL
RESULT
NELSON RULES FOR OUTLIER DETECTION / ANOMALY DETECTION
Instructions: Draw an example per each rule.
1 | One point is beyond Zone A.
2 | Nine points in a row in Zone C or beyond.
3 | Six points in a row steadily increasing or decreasing
4 | Fourteen points in a row alternating up and down
5 | Two out of three points in a row in Zone A or beyond
6 | Four out of five points in a row in Zone B or beyond
7 | Fifteen points in a row are in Zone C
8 | Eight points in a row on both sides of centerline with none in Zone C.
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
LINEAR REGRESSION
USE LINEAR REGRESSION to determine relationship between a response and factor, such as SALES AND
MARKETING.
EXAMPLE: For the dataset CDSA_EX18.XLSX, can you determine whether marketing is effective in
increasing sales. Also predict sales, if for example, I spend PHP 1,000,000 for marketing– how much is
the predicted sales.
SHOW RESULTS:
Also show accuracy of the performance model.
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
POLYNOMIAL REGRESSION
USE POLYNOMIAL REGRESSION to determine relationship between a response and factor, such as SALES
AND MARKETING.
EXAMPLE: For the dataset [Link], determine the relationship between YEARS OF
SERVICE AND SALARY
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
SHOW RESULTS:
Also show accuracy of the performance model.
GENERALIZED LINEAR MODELLING
LINEAR REGRESSION AND MULTIPLE LINEAR REGRESSION ANALYSIS
Helps determine factors that affects outcomes
Helps in forecasting values based on factors
Determines what factors are important to response
EXAMPLE DATA:
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
EXERCISE #21: GENERAL LINEAR MODELLING (EXCEL DATA ANALYSIS)
Use general linear modelling to determine the relationship of factors to response MSAT (Marriage
Satisfaction).
DATA: CDSA_EX12.xlsx
USE EXCEL DATA ANALYSIS
RESULTS:
EXERCISE #21: GENERAL LINEAR MODELLING (USING RAPIDMINER)
Use general linear modelling to determine the relationship of factors to response MSAT (Marriage
Satisfaction).
DATA: CDSA_EX12.xlsx
USE RAPIDMINER
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
RESULTS:
MODEL PERFORMANCE TESTING AND VALIDATION
Analysis models are measured via ABSOLUTE ERROR, CORRELATION AND SQUARED
CORRELATION including all other possible measures.
HOW DO WE MEASURE THE PERFORMANCE OF A CERTAIN MODEL?
EXERCISE #22: MEASURE PERFORMANCE OF THE GENERAL LINEAR MODEL (USING PEFORMANCE)
Use general linear modelling to determine the relationship of factors to response MSAT (Marriage
Satisfaction). Then determine the model performance using OPERATOR: PERFORMANCE (REGRESSION)
DATA: CDSA_EX12.xlsx
USE RAPIDMINER
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
SHOW RESULTS:
Is the model accurate?
NEURAL NETWORK MODELLING
EXERCISE #23: MEASURE PERFORMANCE OF THE NEURAL NETWORK (USING PEFORMANCE)
Use NEURAL NETWORK to determine the relationship of factors to response MSAT (Marriage
Satisfaction). Then determine the model performance using OPERATOR: PERFORMANCE (REGRESSION)
DATA: CDSA_EX12.xlsx
USE RAPIDMINER
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
RESULTS:
Show the accuracy of the performance model NEURAL NET.
Compare to General Linear Model.
DEEP LEARNING MODELLING
EXERCISE #24: MEASURE PERFORMANCE OF THE NEURAL NETWORK (USING PEFORMANCE)
Use DEEP LEARNING to determine the relationship of factors to response MSAT (Marriage Satisfaction).
Then determine the model performance using OPERATOR: PERFORMANCE (REGRESSION)
DATA: CDSA_EX12.xlsx
USE RAPIDMINER
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
RANDOM FOREST MODELLING
EXERCISE #25: MEASURE PERFORMANCE OF THE RANDOM FOREST (USING PEFORMANCE)
Use RANDOM FOREST to determine the relationship of factors to response MSAT (Marriage Satisfaction).
Then determine the model performance using OPERATOR: PERFORMANCE (REGRESSION)
DATA: CDSA_EX12.xlsx
USE RAPIDMINER
DECISION TREE MODELLING
EXERCISE #25: MEASURE PERFORMANCE OF THE DECISION TREE FOREST (USING PEFORMANCE)
Use RANDOM FOREST to determine the relationship of factors to response MSAT (Marriage Satisfaction).
Then determine the model performance using OPERATOR: PERFORMANCE (REGRESSION)
DATA: CDSA_EX12.xlsx
USE RAPIDMINER
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
CASE STUDY #3: LOAN SCORE PREDICTION
EXERCISE #25: MEASURE PERFORMANCE USING DIFFERENT MODELS
GIVEN the file CDSA_EX14, determine the LOAN SCORE PREDICTION model and its performance if the
following models are used:
A. DECISION TREE
B. GLM
C. NEURAL NET
D. RANDOM FOREST
E. DEEP LEARNING
F. GLM VIA EXCEL
Create a table and show absolute error, relative error, correlation and squared correlation.
Determine which factors affect the loan score.
CDAA
DBMA
CDRA
CDSA
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
[Link]
MODEL ABSOLUTE RELATIVE CORRELATION SQUARED
ERROR ERROR CORRELATION
DECISION T.
GLM
NEURAL
R. FOREST
DEEP LEARNING
HYPOTHESIS TESTING: COMPARISON OF MEANS
In data science modelling, we statistically and probabilistically compare the performance
against a target or against another model.
ONE SAMPLE T
Given the file CDSA_E16.xlsx, determine if the Absolute Percentage Error meets the 3.0% error
criterion/requirement of the company.
Use SPCforExcel in the process.
SHOW RESULT:
Do they meet the requirement in forecast model performance?
TWO SAMPLE T
Given the file CDSA_E17.xlsx, compare the model performance between NEURAL NET and DEEP
LEARNING. Which one is better?
Use SPCforExcel in the process.
SHOW RESULT:
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA
PAIRED T
Given the file CDSA_E19.xlsx, determine if the employee performance survey improved?
Use SPCforExcel in the process.
SHOW RESULT:
ONE WAY ANOVA
Given the file CDSA_E20.xlsx, determine if the MODELS ARE EQUAL OR NOT
SHOW RESULT:
ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE