0% found this document useful (0 votes)

6 views5 pages

MachineLearningBigR Tutorial

Uploaded by

ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views5 pages

MachineLearningBigR Tutorial

Uploaded by

ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Machine Learning with Big R Tutorial

Learn how to use machine learning with IBM® InfoSphere® BigInsights™ Big R to perform
statistical analysis and modeling on big data. You must download, license, and install the
appropriate R software before using Big R.

About
In this scenario, you will perform statistical analysis and modeling on a sample airline data set to
leverage the machine learning capabilities of Big R. Use Big R to predict the arrival delay for the
flights by using other columns as predictors.

The airline data set contains a small sample of US flight information from 1987-2008 provided in
the Big R package.

Procedure
Run commands from an R environment.

1. Access the airline dataset on HDFS.

2. Perform data transformations required for the machine learning algorithms.
3. (Optional) Calculate descriptive statistics.
4. Create training and testing sets to use for the following models:
• Create a linear regression model for the arrival delay of flights, and use it to generate
predictions.
• Create a support vector machine classifier for the arrival delay of flights.

Accessing data on HDFS

# Connect to BigInsights.
> bigr.connect(host="myhost.ibm.com", port=7052, user="my_user",
password="my_password")

# Access the airline dataset on HDFS. useMapReduce by default is TRUE.

# The sample data set is not large, so set the parameter to FALSE
# to run the data faster.
# To run the example on a large dataset, set the useMapReduce parameter
# to TRUE.
> airline <- bigr.frame(bigr.env$TEXT_FILE, "/user/airline_lab.csv", ",",
coltypes=ifelse(1:29 %in% c(9,11,17,18,23), "character", "integer"),
header=TRUE, na.string = "NA", useMapReduce=FALSE)

# Display the data set. The data set has 29 columns.

> str(airline)
'bigr.frame': 29 variables:
$ Year : int 2004 2004 2004 2004 2004 2004
$ Month : int 2 2 2 2 2 2
$ DayofMonth : int 12 16 18 19 21 24
$ DayOfWeek : int 4 1 3 4 6 2
$ DepTime : int 633 2115 700 1140 936 1117
$ CRSDepTime : int 635 2120 700 1145 935 1120
$ ArrTime : int 935 2340 817 1427 1036 1922
$ CRSArrTime : int 930 2350 820 1420 1035 1930
$ UniqueCarrier : chr "B6" "B6" "B6" "B6" "B6" "B6"
$ FlightNum : int 165 199 2 67 68 206
$ TailNum : chr "N553JB" "N570JB" "N544JB" "N570JB" "N544JB" "N548JB"
$ ActualElapsedTime: int 182 325 77 167 60 305
$ CRSElapsedTime : int 175 330 80 155 60 310
$ AirTime : int 162 114 49 141 41 468
$ ArrDelay : int 5 -10 -3 7 1 -8
$ DepDelay : int -2 -5 0 -5 1 -3
$ Origin : chr "JFK" "JFK" "JFK" "RSW" "JFK" "LGB"
$ Dest : chr "TPA" "LAS" "BUF" "JFK" "SYR" "JFK"
$ Distance : int 1005 2248 301 1074 209 2465
$ TaxiIn : int 3 8 2 7 3 7
$ TaxiOut : int 17 23 26 19 16 10
$ Cancelled : int 0 0 0 0 0 0
$ CancellationCode : chr "NA" "NA" "NA" "NA" "NA" "NA"
$ Diverted : int 0 0 0 0 0 0
$ CarrierDelay : int 0 0 0 0 0 0
$ WeatherDelay : int 0 0 0 0 0 0
$ NASDelay : int 0 0 0 0 0 0
$ SecurityDelay : int 0 0 0 0 0 0
$ LateAircraftDelay: int 0 0 0 0 0 0

Perform data transformations

# Filter relevant columns for modeling and statistical analysis.
> airlineFiltered <- airline[, c("Month", "DayofMonth", "DayOfWeek", "CRSDepTime",
"Distance", "ArrDelay")]

# Discretize the ArrDelay column into three categories: Low, Medium, and High.
# The categories are used to make predictions.
> airlineFiltered$Delay <- ifelse(airlineFiltered$ArrDelay > 15, "High",
ifelse(airlineFiltered$ArrDelay < 5, "Low",
"Medium"))

# Machine learning algorithms use objects from class bigr.matrix as input.

# A bigr.matrix object are numeric data sets on HDFS. Use the bigr.transform
# function to recode non-numeric columns.
> airlineMatrix <- bigr.transform(airlineFiltered,
outData="/user/airlinef.sample.matrix",
transformPath="/user/airline.sample.transform")

# Display the recoded data. Notice that the “Delay” column was recoded into
# numeric values {1, 2, 3} corresponding to {“Low”, “Medium”, “High”}.
> str(airlineMatrix)
'bigr.matrix': 7 variables:
$ Month : scale 2 2 2 2 2 2
$ DayofMonth: scale 12 16 18 19 21 24
$ DayOfWeek : scale 4 1 3 4 6 2
$ CRSDepTime: scale 635 2120 700 1145 935 1120
$ Distance : scale 1005 2248 301 1074 209 2465
$ ArrDelay : scale 5 -10 -3 7 1 -8
$ Delay : nominal 2 1 1 2 1 1

Calculate descriptive statistics

# Perform the following descriptive statistics with the data: boundaries, mean,
# variance, standard deviation, standard error in mean, coefficient of variation,
# skewness, kurtosis, standard error in skewness, standard error in kurtosis,
# median, interquartile mean, number of categories, and number of modes.
> bigr.univariateStats(airlineMatrix)
Month DayofMonth DayOfWeek CRSDepTime Distance ArrDelay Delay
Min. 1.000000000 1.000000000 1.000000000 0.000000e+00 0.000000e+00 -6.800000e+01 NA
Max. 12.000000000 31.000000000 7.000000000 2.359000e+03 4.983000e+03 1.016000e+03 NA
Range 11.000000000 30.000000000 6.000000000 2.359000e+03 4.983000e+03 1.084000e+03 NA
Mean 6.556821182 15.682723814 3.947690038 1.334173e+03 7.009630e+02 6.957357e+00 NA
Var 11.855746085 77.360205146 3.948726893 2.279813e+05 3.040063e+05 9.433493e+02 NA
SD 3.443217403 8.795465033 1.987140381 4.774738e+02 5.513676e+02 3.071399e+01 NA
SEM 0.009594523 0.024508557 0.005537165 1.330480e+00 1.536385e+00 8.558452e-02 NA
CoV 0.525135170 0.560837845 0.503367884 3.578800e-01 7.865859e-01 4.414606e+00 NA
Skewness -0.020943817 0.014407088 0.045535081 -3.318663e-02 1.651622e+00 5.628932e+00 NA
Kurtosis -1.204467468 -1.189566184 -1.221353204 -8.037885e-01 3.448230e+00 7.194442e+01 NA
SES 0.006825422 0.006825422 0.006825422 6.825422e-03 6.825422e-03 6.825422e-03 NA
SEK 0.013650738 0.013650738 0.013650738 1.365074e-02 1.365074e-02 1.365074e-02 NA
Median 7.000000000 16.000000000 4.000000000 1.326000e+03 5.450000e+02 0.000000e+00 NA
IQM 6.575758987 15.646758289 3.903672645 1.330305e+03 5.625964e+02 4.798043e-01 NA
# cat. NA NA NA NA NA NA 3
# modes NA NA NA NA NA NA 1

# Compute the Pearson's correlation between the predictors and the response
# variable. For example, ArrDelay:
> bigr.bivariateStats(airlineMatrix, cols1=c("Month", "DayofMonth", "DayOfWeek",
"CRSDepTime", "Distance"), cols2=c("ArrDelay"))
X Y Cor
1 Month ArrDelay -0.008673369
2 DayofMonth ArrDelay 0.005967325
3 DayOfWeek ArrDelay 0.004634345
4 CRSDepTime ArrDelay 0.105184454
5 Distance ArrDelay 0.009198216

Create training and testing sets

# Split the data into 70% for training and 30% for testing.
> samples <- bigr.sample(airlineMatrix, perc=c(0.7, 0.3))
> train <- samples[[1]]
> test <- samples[[2]]

# Check that the training and testing sets are split correctly.
> nrow(train) / nrow(airlineMatrix)
[1] 0.6994487
> nrow(test) / nrow(airlineMatrix)
[1] 0.3005513

Create a linear regression model

# Build a linear regression model on the training set for the arrival delay using
# all other columns in the training set as predictors. The model will be stored
# on the specified HDFS location.
> lm <- bigr.lm(ArrDelay ~ ., data=train, directory="/user/lm.airline")
# Display the coefficients of the Linear Regression model for each predictor
# column.
> coef(lm)
(Intercept) Month DayofMonth DayOfWeek CRSDepTime Distance Delay
NA -0.895564 -0.2924151 -1.442185 -0.006449712 -0.004677833 23.05393

# Evaluate the model against the testing set and store the output on HDFS/GPFS.
> pred <- predict(lm, test, "/user/lm.airline.preds")

# Display the results of the evaluation including predictions and statistics that
# assess the quality of the model.
> pred
$preds
preds
1 26.386420
2 -5.413295
3 49.194540
4 -1.665189
5 32.698154
6 26.066165
7 46.975960
8 -3.830847
9 -7.096804
10 -11.654695
... showing first 10 rows only.

$statistics
Name Y-column Scaled Value
1 LOGLHOOD_Z NA FALSE NaN
2 LOGLHOOD_Z_PVAL NA FALSE NaN
3 PEARSON_X2 NA FALSE 2.159397e+07
4 PEARSON_X2_BY_DF NA FALSE 5.581278e+02
5 PEARSON_X2_PVAL NA FALSE 0.000000e+00
6 DEVIANCE_G2 NA FALSE 2.159397e+07
7 DEVIANCE_G2_BY_DF NA FALSE 5.581278e+02
8 DEVIANCE_G2_PVAL NA FALSE 0.000000e+00
9 LOGLHOOD_Z NA TRUE NaN
10 LOGLHOOD_Z_PVAL NA TRUE NaN
11 PEARSON_X2 NA TRUE 2.159397e+07
12 PEARSON_X2_BY_DF NA TRUE 5.581278e+02
13 PEARSON_X2_PVAL NA TRUE 0.000000e+00
14 DEVIANCE_G2 NA TRUE 2.159397e+07
15 DEVIANCE_G2_BY_DF NA TRUE 5.581278e+02
16 DEVIANCE_G2_PVAL NA TRUE 0.000000e+00
17 AVG_TOT_Y 1 NA 7.163350e+00
18 STDEV_TOT_Y 1 NA 3.088156e+01
19 AVG_RES_Y 1 NA -1.211733e+00
20 STDEV_RES_Y 1 NA 2.359393e+01
21 PRED_STDEV_RES 1 TRUE 1.000000e+00
22 PLAIN_R2 1 NA 4.148338e-01
23 ADJUSTED_R2 1 NA 4.147582e-01
24 PLAIN_R2_NOBIAS 1 NA 4.163735e-01
25 ADJUSTED_R2_NOBIAS 1 NA 4.162830e-01
Create a support vector machine classifier
# Build an SVM model.
> svmModel <- bigr.svm(formula=Delay ~ ., data=train,
directory="/user/svm.airline")

# Display the coefficients of the model.

> coef(svmModel)
Low Medium High
Month 9.545485e-04 -3.040180e-06 -0.0013980276
DayofMonth 1.722937e-03 -7.107980e-06 -0.0028906842
DayOfWeek 4.674919e-04 -1.848519e-06 -0.0007403553
CRSDepTime 2.010074e-04 -3.315430e-04 -0.0004699866
Distance 2.923583e-05 -1.521188e-04 -0.0001873042
ArrDelay -3.635988e-02 9.555523e-07 0.0355064272

# Evaluate the model against the testing set and store the output on HDFS/GPFS.
> predSVM <- predict(svmModel, test, "/user/svm.preds.airline", returnScores=T)

# Display the results of the evaluation including overall accuracy,

# confusion matrix, and the scores for each example and class.
> predSVM
$accuracy
[1] 79.1826

$ctable
Low Medium High
Low 23826 6615 558
Medium 0 0 0
High 4 881 6824

$scores
Low Medium High
1 -0.007668927 -0.3597726 -0.3548827
2 0.622849678 -0.6164869 -1.1750341
3 -1.626988731 -0.5585051 0.9624977
4 1.398264835 -0.5139710 -1.7943678
5 -4.645066794 -0.9445551 3.3536020
6 -0.239887320 -0.3716973 -0.1592355
7 -3.436321972 -0.6938454 2.5247219
8 0.667641212 -0.4536836 -1.0925997
9 0.975276859 -0.9322711 -1.7988017
10 0.994386065 -0.6001759 -1.5456743
... showing first 10 rows only.

18BCE10291 - Outliers Assignment
No ratings yet
18BCE10291 - Outliers Assignment
10 pages
NYC Flights Data Analysis Lab
No ratings yet
NYC Flights Data Analysis Lab
13 pages
NYC Flights Data Analysis Lab
No ratings yet
NYC Flights Data Analysis Lab
9 pages
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
No ratings yet
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
8 pages
Descriptive Statistics, Hypothesis Testing, and Basic
No ratings yet
Descriptive Statistics, Hypothesis Testing, and Basic
62 pages
Logistic Regression Analysis of Flight Satisfaction
No ratings yet
Logistic Regression Analysis of Flight Satisfaction
16 pages
Loading Datasets From Excel/CSV: A) Local R Database Dataset
No ratings yet
Loading Datasets From Excel/CSV: A) Local R Database Dataset
4 pages
Predicting Flight Delays
No ratings yet
Predicting Flight Delays
6 pages
14 Work With Big Data
No ratings yet
14 Work With Big Data
74 pages
KrutikaKolhe 862467252 HW4
No ratings yet
KrutikaKolhe 862467252 HW4
16 pages
Delivery Time Prediction Guide
No ratings yet
Delivery Time Prediction Guide
36 pages
R With SQL
No ratings yet
R With SQL
8 pages
Sparklyr Online Training Overview
No ratings yet
Sparklyr Online Training Overview
63 pages
Mda Practical2 Eda
No ratings yet
Mda Practical2 Eda
50 pages
Week2 Cheat Sheet Data Wrangling With Tidyverse
No ratings yet
Week2 Cheat Sheet Data Wrangling With Tidyverse
4 pages
Week3 Cheat Sheet Exploratory Data Analysis
No ratings yet
Week3 Cheat Sheet Exploratory Data Analysis
3 pages
Practical 9 - Time-Series Forecasting
No ratings yet
Practical 9 - Time-Series Forecasting
5 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
Assignment 3
No ratings yet
Assignment 3
6 pages
Time Series Analysis in R for Aviation Data
No ratings yet
Time Series Analysis in R for Aviation Data
49 pages
Introductory Time Series Analysis in R
No ratings yet
Introductory Time Series Analysis in R
22 pages
Main Summary
No ratings yet
Main Summary
19 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
ARDL
No ratings yet
ARDL
38 pages
Data Manipulation with dplyr in R
100% (1)
Data Manipulation with dplyr in R
22 pages
Task:-5: Name:-Shambel Gonfa Reg no:-18BCE2429 Data Vitualization Lab Course code:-CSE3020
No ratings yet
Task:-5: Name:-Shambel Gonfa Reg no:-18BCE2429 Data Vitualization Lab Course code:-CSE3020
8 pages
Flight Delay Cost Index Visualization
No ratings yet
Flight Delay Cost Index Visualization
8 pages
Introduction To Dplyr
No ratings yet
Introduction To Dplyr
9 pages
Thong Ke Mo Ta EDA ANOVA 1
No ratings yet
Thong Ke Mo Ta EDA ANOVA 1
17 pages
Intervention Models: Something's Happened Around T 200
No ratings yet
Intervention Models: Something's Happened Around T 200
41 pages
Specialized Data in Predictive Analytics
No ratings yet
Specialized Data in Predictive Analytics
44 pages
ISYE6501 Homework 1
No ratings yet
ISYE6501 Homework 1
7 pages
Ds
No ratings yet
Ds
2 pages
BDA MSC It
No ratings yet
BDA MSC It
35 pages
Praktikum Modul 3
No ratings yet
Praktikum Modul 3
5 pages
Final
No ratings yet
Final
15 pages
1.1 Loading The Data: Survival by Sex
No ratings yet
1.1 Loading The Data: Survival by Sex
6 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
Flight Price Prediction Capstone Project Submission 2
No ratings yet
Flight Price Prediction Capstone Project Submission 2
69 pages
ANOVA
No ratings yet
ANOVA
8 pages
Lab Manual - DSR
No ratings yet
Lab Manual - DSR
32 pages
SNU Assignment 1
No ratings yet
SNU Assignment 1
3 pages
20mia1006 Lab 4 FDA
No ratings yet
20mia1006 Lab 4 FDA
15 pages
Time Series Analysis
No ratings yet
Time Series Analysis
4 pages
Da Lab File 2
No ratings yet
Da Lab File 2
13 pages
Tài Liệu Không Có Tiêu Đề
No ratings yet
Tài Liệu Không Có Tiêu Đề
7 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Souvenir Sales Forecasting Models
No ratings yet
Souvenir Sales Forecasting Models
20 pages
Exercises 01
No ratings yet
Exercises 01
2 pages
BQL Record PDF
No ratings yet
BQL Record PDF
65 pages
Q3AB
No ratings yet
Q3AB
15 pages
Homework 2
100% (1)
Homework 2
14 pages
Major Project Final
No ratings yet
Major Project Final
21 pages
Objects Oriented Programming OOP
No ratings yet
Objects Oriented Programming OOP
66 pages
Data Transformation 1 Reviewed
No ratings yet
Data Transformation 1 Reviewed
43 pages
Week 6
No ratings yet
Week 6
36 pages
R Examples
No ratings yet
R Examples
56 pages
12+cana 12 Pappula+Ashok
No ratings yet
12+cana 12 Pappula+Ashok
15 pages
Conditional Forecasting of Bitcoin Prices Using Exogenous Variables
No ratings yet
Conditional Forecasting of Bitcoin Prices Using Exogenous Variables
17 pages
Fin Irjmets1663071054
No ratings yet
Fin Irjmets1663071054
5 pages
CapitalVX A Machine Learning Model For Startup 2021 The Journal of Finance
No ratings yet
CapitalVX A Machine Learning Model For Startup 2021 The Journal of Finance
21 pages
Pairwise Acquisition Prediction With SHAP V 2021 The Journal of Finance and
No ratings yet
Pairwise Acquisition Prediction With SHAP V 2021 The Journal of Finance and
23 pages
Negative Conversion Premium 2021 The Journal of Finance and Data Science
No ratings yet
Negative Conversion Premium 2021 The Journal of Finance and Data Science
21 pages
How Does The Creditor Conflict Affect Bond 2021 The Journal of Finance and
No ratings yet
How Does The Creditor Conflict Affect Bond 2021 The Journal of Finance and
27 pages
Inventory Effects On The Price Dynamics of VSTOXX 2021 The Journal of Financ
No ratings yet
Inventory Effects On The Price Dynamics of VSTOXX 2021 The Journal of Financ
17 pages
Economies 13 00006 v2
No ratings yet
Economies 13 00006 v2
28 pages
2279 Barath R
No ratings yet
2279 Barath R
56 pages
Algo Trade 1
No ratings yet
Algo Trade 1
57 pages
Using Hybrid Machine Learning Models For Stock Price Forecasting
No ratings yet
Using Hybrid Machine Learning Models For Stock Price Forecasting
40 pages
US Population Growth: 1825 to 1896
No ratings yet
US Population Growth: 1825 to 1896
1 page
Bank Suspensions
No ratings yet
Bank Suspensions
1 page
HVAC REFERANCE تجميع لموضوعات وأكواد بالتكييف
No ratings yet
HVAC REFERANCE تجميع لموضوعات وأكواد بالتكييف
4 pages
Unemployment
No ratings yet
Unemployment
1 page
Tenses - Past Simple Past Continuous: Besides The Past Tense and Past Participle Have The Same Form
No ratings yet
Tenses - Past Simple Past Continuous: Besides The Past Tense and Past Participle Have The Same Form
4 pages
Automatic Trading System Analysis
No ratings yet
Automatic Trading System Analysis
52 pages
IJISAE 50 Rahul+Marui+Dhokane 3 1867
No ratings yet
IJISAE 50 Rahul+Marui+Dhokane 3 1867
8 pages
Exercises 4
No ratings yet
Exercises 4
30 pages
Topic 7 - P Value, CI
No ratings yet
Topic 7 - P Value, CI
48 pages
LSS Cheat Sheets
100% (2)
LSS Cheat Sheets
29 pages
Math 10 Pre-Assessment Guide
No ratings yet
Math 10 Pre-Assessment Guide
2 pages
Doe Midterm Revision 20251103193152
No ratings yet
Doe Midterm Revision 20251103193152
20 pages
FDM Presentation
No ratings yet
FDM Presentation
13 pages
Ai & ML Week-9
No ratings yet
Ai & ML Week-9
30 pages
MM Lab Chi Square
No ratings yet
MM Lab Chi Square
8 pages
Chap - 4
No ratings yet
Chap - 4
21 pages
STAT-221 Statistics - II: NUST Business School BBA
No ratings yet
STAT-221 Statistics - II: NUST Business School BBA
4 pages
Measures of Dispersion For Grouped Data
No ratings yet
Measures of Dispersion For Grouped Data
19 pages
Alumni Giving and Class Size Impact
No ratings yet
Alumni Giving and Class Size Impact
23 pages
JBI Critical Appraisal Checklist For Randomized Controlled Trials
No ratings yet
JBI Critical Appraisal Checklist For Randomized Controlled Trials
1 page
Audit Sampling Methods Explained
No ratings yet
Audit Sampling Methods Explained
31 pages
Six Sigma Project Steps & Tools Guide
No ratings yet
Six Sigma Project Steps & Tools Guide
1 page
MESPRO
No ratings yet
MESPRO
47 pages
Quantitative Techniques For Business II
100% (1)
Quantitative Techniques For Business II
4 pages
Point Estimation and Interval Estimation: Learning Objectives
No ratings yet
Point Estimation and Interval Estimation: Learning Objectives
58 pages
Six Sigma Green Belt Exam Guide
No ratings yet
Six Sigma Green Belt Exam Guide
5 pages
Task 2 and 3
No ratings yet
Task 2 and 3
5 pages
Reanalysis Suggests Evidence For Motor Simulation in Naming Tools Is Limited: A Commentary On Witt, Kemmerer, Linkenauger, and Culham (2010)
No ratings yet
Reanalysis Suggests Evidence For Motor Simulation in Naming Tools Is Limited: A Commentary On Witt, Kemmerer, Linkenauger, and Culham (2010)
4 pages
Chap15 - Time Series Forecasting & Index Number
No ratings yet
Chap15 - Time Series Forecasting & Index Number
60 pages
Data Analytics For Business Intelligence
No ratings yet
Data Analytics For Business Intelligence
50 pages
PW6: Feature Engineering Estin 2024-2025: Exercise 1: Dataset Task
No ratings yet
PW6: Feature Engineering Estin 2024-2025: Exercise 1: Dataset Task
1 page
Practice
No ratings yet
Practice
4 pages
ANOVA MCQ (Free PDF) - Objective Question Answer For ANOVA Quiz - Download Now!
No ratings yet
ANOVA MCQ (Free PDF) - Objective Question Answer For ANOVA Quiz - Download Now!
10 pages
Two-Way ANOVA and Heteroskedasticity
No ratings yet
Two-Way ANOVA and Heteroskedasticity
27 pages
BIOL933 Experiment Design Class
No ratings yet
BIOL933 Experiment Design Class
43 pages
Grand Assessment - Applied Data Science
No ratings yet
Grand Assessment - Applied Data Science
13 pages
Quick R
No ratings yet
Quick R
143 pages

MachineLearningBigR Tutorial

Uploaded by

MachineLearningBigR Tutorial

Uploaded by

Machine Learning with Big R Tutorial

1. Access the airline dataset on HDFS.

Accessing data on HDFS

# Access the airline dataset on HDFS. useMapReduce by default is TRUE.

# Display the data set. The data set has 29 columns.

Perform data transformations

# Machine learning algorithms use objects from class bigr.matrix as input.

Calculate descriptive statistics

Create training and testing sets

Create a linear regression model

# Display the coefficients of the model.

# Display the results of the evaluation including overall accuracy,

You might also like