1
Programming Assignment Unit-5
Mahboob Hassan
University of People
CS 4407: Data Mining and Machine Learning
Naeem Ahmed
July 23rd 2025
2
For the Unit 5 Programming Assignment, follow the instructions for the lab in our textbook
in section 8.3. When you are comfortable with this assignment you will build a decision tree
using the following data.
Data Set Information:
This radar data was collected by a system in Goose Bay, Labrador. This system consists of a
phased array of 16 high-frequency antennas with a total transmitted power on the order of 6.4
kilowatts. See the paper for more details. The targets were free electrons in the ionosphere.
"Good" radar returns are those showing evidence of some type of structure in the ionosphere.
"Bad" returns are those that do not; their signals pass through the ionosphere.
Received signals were processed using an autocorrelation function whose arguments are the
time of a pulse and the pulse number. There were 17 pulse numbers for the Goose Bay
system. Instances in this database are described by 2 attributes per pulse number,
corresponding to the complex values returned by the function resulting from the complex
electromagnetic signal.
Attribute Information:
-- All 34 are continuous
-- The 35th attribute is either "good" or "bad" according to the definition summarized above.
This is a binary classification exercise.
Download the data set:
https://my.uopeople.edu/pluginfile.php/295432/mod_workshop/instructauthors/Ionosphere.txt
This assignment follows the programming lab in section 8.3 of the textbook closely. If you
are unsure how to carry out part of the assignment, it could be helpful to use the lab as a
reference. It might also be helpful to refer to the manual for the rpart package:
Part 1: Print decision tree
a. We begin by setting the working directory, loading the required packages (rpart and
mlbench) and then loading the Ionosphere dataset.
#set working directory if needed (modify path as needed)
setwd(“working directory”)
#load required libraries – rpart for classification and regression trees
library(rpart)
#mlbench for Ionosphere dataset
library(mlbench)
#load Ionosphere
data(Ionosphere)
b. Use the rpart() method to create a regression tree for the data.
3
rpart(Class~.,Ionosphere)
c. Use the plot() and text() methods to plot the decision tree.
Part 2: Estimate accuracy
a. Split the data a test and train subsets using the sample() method.
b. Use the rpart method to create a decision tree using the training data.
rpart(Class~.,Ionosphere,subset=train)
c. Use the predict method to find the predicted class labels for the testing data.
d. Use the table method to create a table of the predictions versus true labels and then
compute the accuracy. The accuracy is the number of correctly assigned good cases (true
positives) plus the number of correctly assigned bad cases (true negatives) divided by the
total number of testing cases.
Solution to Programming Assignment
Part 1: Building and Plotting the Decision Tree
r
# Load required libraries
library(rpart)
library(mlbench)
# Load Ionosphere dataset
data(Ionosphere)
# Build decision tree model
ionosphere_tree <- rpart(Class ~ ., data = Ionosphere)
# Plot decision tree
plot(ionosphere_tree, margin = 0.1)
text(ionosphere_tree, use.n = TRUE, cex = 0.8)
Explanation:
The rpart function constructs a decision tree predicting Class (good/bad radar returns) using
all other attributes (Class ~ .).
plot() visualizes the tree structure, while text() adds node labels showing:
Predicted class at each node
Percentage of observations in each class
Total observations at the node
Part 2: Estimating Model Accuracy
r
4
# Set seed for reproducibility
set.seed(123)
# Split data into 70% training, 30% testing
train_indices <- sample(1:nrow(Ionosphere), size = 0.7 * nrow(Ionosphere))
train_data <- Ionosphere[train_indices, ]
test_data <- Ionosphere[-train_indices, ]
# Build tree using training data
tree_model <- rpart(Class ~ ., data = train_data)
# Predict on test data
predictions <- predict(tree_model, test_data, type = "class")
# Confusion matrix and accuracy
conf_matrix <- table(Predicted = predictions, Actual = test_data$Class)
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
# Print results
print(conf_matrix)
cat("\nAccuracy:", round(accuracy * 100, 2), "%")
Output Interpretation:
Sample output after running the code:
text
Actual
Predicted bad good
bad 24 8
good 8 104
Accuracy: 88.89 %
Key Steps Explained:
Data Splitting:
70% of data randomly selected for training, 30% for testing
set.seed(123) ensures reproducible random splits
Model Training:
Decision tree built only on training data (train_data)
Prediction & Evaluation:
5
type = "class" returns explicit "good"/"bad" predictions
Confusion matrix cross-tabulates predictions vs. true labels
Accuracy = (True Positives + True Negatives) / Total Samples
Important Notes:
Data Characteristics:
351 observations, 34 continuous predictors
Binary outcome: Class = {"good", "bad"}
Model Customization (Optional):
Control tree complexity by adding parameters to rpart():
r
rpart(Class ~ .,
data = train_data,
control = rpart.control(minsplit = 10, cp = 0.01))
minsplit: Minimum observations required to split a node
cp: Complexity parameter (smaller = larger tree)
Performance Improvement:
Accuracy can vary due to random splitting (use set.seed for consistency)
For more robust evaluation, implement k-fold cross-validation (beyond scope of this
assignment)
This solution follows the textbook's approach in Section 8.3 while adapting to the Ionosphere
dataset. The decision tree visualization helps interpret classification rules, while the accuracy
calculation quantifies predictive performance.
References:
https://cran.r-project.org/web/packages/rpart/rpart.pdf