0% found this document useful (0 votes)
30 views12 pages

Esami - R UNIPD

The document outlines exam questions for a Machine Learning for Bioengineering course, covering various datasets related to medical conditions such as breast cancer, liver disease, cardiac arrhythmia, and more. Each section includes tasks like data analysis, model evaluation, and performance comparison using techniques like k-NN, PCA, classification trees, and random forests. Students are required to justify their answers, provide code and figures, and submit their work via email.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views12 pages

Esami - R UNIPD

The document outlines exam questions for a Machine Learning for Bioengineering course, covering various datasets related to medical conditions such as breast cancer, liver disease, cardiac arrhythmia, and more. Each section includes tasks like data analysis, model evaluation, and performance comparison using techniques like k-NN, PCA, classification trees, and random forests. Students are required to justify their answers, provide code and figures, and submit their work via email.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

MACHINE LEARNING FOR BIOENGINEERING

Exam, February 10, 2023: PART 2 (120 min)

In moodle you will find a dataset regarding breast cancer and denoted “cancer.dat”. The dataset consists of
geometrical characteristics of cell nuclei obtained from digitized images of breast mass of benign (B) and malignant
(M) tumors, as given in the response variable class. There are no missing data. The overall scope is to classify
tissue samples into one the two classes based on their geometrical features.

Answers must be justified and explained. You can write your answers on sheets of paper, or in a document on
the computer to be send with code and figures. Send code and figures by email to [email protected].

WRITE YOUR NAME ON ALL SHEETS OF PAPER!

1. (13 pts)

- What is the accuracy of the “chance” (also known as “naive” or “majority”) predictor for class?
- Split the data into training and test sets (use set.seed in your code for reproducibility).
- Perform a k-Nearest Neighbor (k-NN) analysis for k = 3 to predict class in the test set from the training
set. Compare your result to the “chance” predictor.
- Tune the k-NN classifier with respect to the test error. Comment on your result.

2. (10 pts)

- Perform a Principal Components Analysis (PCA) on the numerical part of the training data, i.e., assuming
that the true type of the class is unknown. How many Principal Components (PCs) would you use?
- Which of the original features mainly determine the first PC?

3. (12 pts)

- Perform a classification tree analysis of the dataset.


- Which are the most important features according to this analysis? Compare to the results from PCA.
- Does the performance improve if you use the first few principal components for the tree analysis?

4. (15 pts)

- Perform hierarchical cluster analysis with “complete-link clustering”, assuming that the true class is un-
known, and visualize the result.
- Inspecting the resulting dendogram, how many cluster would you choose? Why?
- Find the suggested number of clusters according to the silhouette method (do not use the function
fviz nbclust but write your own explicit code with a for loop).
- Do the clusters obtained in the two previous questions correspond to the class variable? Comment on
the results.

5. Bonus question – do only in case you have finished all the other ones!
Perform a support vector machine analysis predicting class and evaluate the final model on the test set.
MACHINE LEARNING FOR BIOENGINEERING
Exam, June 15, 2022: PART 2

In moodle you will find a dataset regarding the liver disease hepatitis and denoted “hepatitis.dat”. The dataset
consists of 155 patients with a response variable denoted “class” that indicates whether the individual died of the
disease, and 19 features. The aim is to predict “class” from the other features. There are missing data.

Answers must be justified and explained. You can write your answers on sheets of paper, or in a document on
the computer to be send with code and figures. Send code and figures by email to [email protected].

WRITE YOUR NAME ON ALL SHEETS OF PAPER!

1. (5 pts)
What is the accuracy of the naive (“chance” or “majority”) predictor?

2. (10 pts)
Perform a k-medoids analysis with two clusters using the 19 features on the complete data, i.e., after omitting
patients with NAs. Do the clusters correspond to the “class” variable? Comment on the result.

3. (10 pts)

- Perform a tree analysis on the complete data, i.e., after omitting patients with NAs.
- Evaluate the (training) accuracy of the obtained tree on the same dataset using a confusion matrix.
- Based on your tree analysis, do the features with many missing datapoints seem to be important for
classification?

4. (10 pts)

- Split the entire dataset (with NAs) into training and test sets (use set.seed in your code for reproducibil-
ity).
- Perform a tree analysis using surrogate splits on the training set.
- Choose a patient in your test set with NA in the variable of the primary split. Explain in details how this
individual is classified by the tree.
- Evaluate the final tree on the test set and comment on the results.

5. (15 pts)

- Perform a gradient boosting analysis (with trees) on the training and test sets generated for the previous
question (gbm uses surrogate splits).
- Evaluate the classifier with a ROC curve and comment on the result.
- Which threshold would you choose if you aim to predict correctly at least 75% of the cases that die?
- Which threshold would you choose to maximize the prediction accuracy?
- Which are the two most important features according to this analysis?
- Describe qualitatively how these two most important features influence the probability of dying according
to your analysis.

6. Bonus question – do only in case you have finished all the other ones!
Does the prediction based on gradient boosting improve if you use multiple imputation instead of surrogate
splits?
MACHINE LEARNING FOR BIOENGINEERING
Exam, June 30, 2022: PART 2

In moodle you will find a dataset regarding cardiac arrhythmia and denoted “arrhythmia.dat”. The dataset
consists of 452 patients with a response variable denoted “class” that indicates whether the individual had ar-
rhythmia according to the doctors, and 199 features (4 basic features: age, sex, height, weight; the feature
heart.rate; the remaining 194 features describe various features obtained from an ECG). There are missing data.

Answers must be justified and explained. You can write your answers on sheets of paper, or in a document on
the computer to be send with code and figures. Send code and figures by email to [email protected].

WRITE YOUR NAME ON ALL SHEETS OF PAPER!

1. (10 pts)

- What is the accuracy of the “chance” (also known as “naive” or “majority”) predictor for class?
- Evaluate Naive Bayes prediction using a training set and a test set. Compare to chance classification and
comment on the results.

2. (10 pts)

- Perform a linear regression analysis with heart.rate as a function of the 4 basic features (see above).
Find the optimal model according to AIC and BIC.
- Repeat the analysis of the previous point considering only adults (age 18 or older).
- Comment on your results, possibly using plots of heart.rate versus the basic features to help your
reasoning.

3. (15 pts)

- Perform a principal component analysis (PCA) using the numerical features that do not have missing
data. What would be a “good” number of principal components to consider?
- Perform a k-nearest neighbor analysis using all the numerical features without NAs to predict the class.
Evaluate with LOOCV.
- Perform a k-nearest neighbor analysis using the first few principal components to predict the class.
Evaluate with LOOCV. Comment on your results.

4. (15 pts)

- Split the dataset (excluding features with NAs) into training and test sets (use set.seed in your code for
reproducibility).
- Perform a support vector machine analysis predicting class and evaluate the final model on the test set.
Comment on the results regarding accuracy, sensitivity and specificity.

5. Bonus question – do only in case you have finished all the other ones!
Does the prediction based on SVM improve if you use multiple imputation to handle the features with
missing data (rather than excluding them)?
MACHINE LEARNING FOR BIOENGINEERING
Exam, August 30, 2023: PART 2

In moodle you will find a dataset regarding breastcancer, denoted “nki5y.dat”.


The dataset consists of 280 cases with a response variable denoted “death” that indicates if the patients survived
beyond 5 years after treatment was initiated (0: survived >5 years; 1: died before 5 years) and 12 features.
“patnr” is a patient ID number.

The aim is to analyze how the 5-year mortality (“death”) depends on the other features, and vice versa, to use
these features to predict “death”. There are no missing data.

Answers must be justified and explained. You can write your answers on sheets of paper, or in a document on the
computer to be send with code and figures.

>>> Send code and figures by email to [email protected]. <<<

WRITE YOUR NAME ON ALL SHEETS OF PAPER!

1. (5 pts)
What is the accuracy of the naive (“chance” or “majority”) predictor? What does it predict?

2. (20 pts)

- Split the dataset into training and test sets (use set.seed in your code for reproducibility).
- Perform a thorough logistic regression analysis using the training data.
- Evaluate the test accuracy of the final logistic model on the test set using a confusion matrix. Comment
on the results.
- Evaluate the classifier with a ROC curve and comment on the result. Which threshold would you choose
in order to classify at least 80% of positive cases correctly?

3. (10 pts)

- Perform a thorough tree classification analysis using the training data generated for the previous question.
- Comment on the correspondence between important variables as found by the tree analysis and the relevant
variables found from logistic modelling.
- Evaluate the test accuracy of the obtained tree on the test set using a confusion matrix. Comment on the
results.

4. (20 pts)

- Perform a gradient boosting analysis (with trees) on the training set (generated for question 2), and
evaluate the accuracy on the test set using a confusion matrix.
- Evaluate the classifier with a ROC curve and comment on the result.
- Which are the two most important features according to this analysis?
- Describe qualitatively how these two most important features influence the 5-years mortality according to
your analysis, and compare to the results found in questions 2 and 3.
MACHINE LEARNING FOR BIOENGINEERING
Exam, September 19, 2022: PART 2 (120 min)

In moodle you will find a dataset regarding raisins and denoted “raisin.dat”. The dataset consists geometrical
characteristics of 900 raisins of two types, “Besni” and “Kecimen”, given in the response variable “Class”. There
are no missing data. The overall scope is to classify raisins into one the two classes based on their geometrical
features.

Answers must be justified and explained. You can write your answers on sheets of paper, or in a document on
the computer to be send with code and figures. Send code and figures by email to [email protected].

WRITE YOUR NAME ON ALL SHEETS OF PAPER!

1. (10 pts)

- Perform a k-means analysis with two clusters, assuming that the true type of the raisins is unknown, and
visualize the result.
- Do the clusters correspond to the “Class” variable? Comment on the result.
- Find the optimal number of clusters.

2. (10 pts)

- Split the data into training and test sets (use set.seed in your code for reproducibility).
- Perform a logistic regression analysis on the training data, finding the optimal model according to AIC
and BIC.
- Evaluate predictions of the optimal model on the test set, and comment on your results.
- Show a ROC curve and calculate the AUC for the optimal model and comment on the results.

3. (15 pts)

- Perform a Principal Components Analysis (PCA) on the numerical part of the training data, i.e., assuming
that the true type of the raisins is unknown. How many Principal Components (PCs) would you use?
- Which of the original features mainly determine the first PC?
- Train a logistic regression model on the training data to predict the Class from the first two PCs. Evaluate
predictions of the model on the test set, and comment on your results.

4. (15 pts)

- Perform a Random Forest (RF) analysis of the dataset.


- Which are the most important parameters according to this analysis? Compare to the results from logistic
regression and PCA.
- Does the performance improve if you use the first few principal components for the RF analysis?

5. Bonus question – do only in case you have finished all the other ones!
Perform a support vector machine analysis predicting class and evaluate the final model on the test set.
MACHINE LEARNING FOR BIOENGINEERING
Exam, June 27, 2023: PART 2

In moodle you will find a dataset regarding fetal health, denoted “fetalhealth.dat”. The dataset consists of
2126 cases with a response variable denoted “fetal health” that indicates the health status of the foetus (1: Nor-
mal, 2: Suspect, 3: Pathological), and 21 features mainly related to cardiotocograms (CTGs). The aim is to
predict “fetal health” from the other features. There are no missing data.

Answers must be justified and explained. You can write your answers on sheets of paper, or in a document on
the computer to be send with code and figures.

>>> Send code and figures by email to [email protected]. <<<

WRITE YOUR NAME ON ALL SHEETS OF PAPER!

1. (5 pts)
What is the accuracy of the naive (“chance” or “majority”) predictor? What does it predict?

2. (10 pts)

- Perform a k-means analysis with three clusters using the 21 features. Do the clusters correspond to the
“fetal health” variable? Comment on the result.
- Find the optimal number of clusters for k-means using silhouettes. Perform the corresponding clustering
and comment on the results.

3. (10 pts)

- Split the dataset into training and test sets (use set.seed in your code for reproducibility).
- Perform a thorough tree analysis using the training data.
- Evaluate the test accuracy of the obtained tree on the test dataset using a confusion matrix. Comment
on the results.

4. (25 pts)

- Perform a thorough random forest analysis using the training data and evaluate the test accuracy using a
confusion matrix. Comment on the results.
- Perform a thorough random forest analysis aiming to predict “Suspect or Pathological” vs. “Normal”,
using the training data (i.e., unite classes 2 and 3).
- Which are the two most important features according to this analysis?
- Describe qualitatively how these two most important features influence the probability of the case being
“Suspect or Pathological”.
- Evaluate the classifier with a ROC curve and comment on the result.
- Which threshold would you choose if you aim to predict correctly at least 95% of the cases that are
“Suspect or Pathological”?
MACHINE LEARNING FOR BIOENGINEERING
Exam, July 12, 2023: PART 2

In moodle you will find a dataset regarding migraine, denoted “migraine.dat”. The dataset consists of 400
cases with a response variable denoted “Type” that indicates the type of migraine that each of the subjects suffers
from. There are 7 different types of migraine in the data set. The aim is to predict “Type” from the other 18
features. There are no missing data.

Answers must be justified and explained. You can write your answers on sheets of paper, or in a document on
the computer to be send with code and figures.

>>> Send code and figures by email to [email protected]. <<<

WRITE YOUR NAME ON ALL SHEETS OF PAPER!

1. (5 pts)

- What is the accuracy of the naive (“chance” or “majority”) predictor? What does it predict?

2. (10 pts)

- Perform a hierarchical clustering analysis using the 18 features.


- Do the clusters that you have found correspond to the “Type” variable? Comment on the result.
- Perform a principal components analysis (PCA). Does this help you to see clusters? Compare to the
results for hierarchical clustering and comment.

3. (10 pts)

- Split the dataset into training and test sets (use set.seed in your code for reproducibility).
- Perform a thorough k nearest neighbor (k-NN) analysis using the training data.
- Evaluate the test accuracy of the obtained model on the test dataset using a confusion matrix. Comment
on the results.

4. (25 pts)

(a) Perform a thorough support vector machine (SVM) analysis using the training data and evaluate the
test accuracy using a confusion matrix. Comment on the results.
(b) Perform a thorough SVM analysis aiming to predict “Type = 6” vs. “Type ̸= 6”, using the training
data (i.e., unite classes 1-5 and 7). Compare to the SVM analysis performed in the previous question.
(c) Investigate (e.g., graphically) and describe qualitatively how the features Visual and Tinnitus are in-
volved in predicting whether a subject has a Type 6 migraine or not, according to this analysis. It may
be useful to set all other variables to their average value. Do you see any evidence for an interaction
between these two features (Visual and Tinnitus) in the SVM model? Investigate also how (some of)
the other variables influence the predictions of the model.
(d) Repeat the previous question for the analysis done for question (a).
MACHINE LEARNING FOR BIOENGINEERING
Exam, January 31, 2024: PART 2

In moodle you will find a dataset regarding cardiovascular disease, denoted “cvd.dat”. The dataset consists of
1316 cases with a response variable denoted “class” that indicates whether the patient had a heart attack (0: no,
1: yes) and 8 features. The aim is to predict “class” from the other features. There are no missing data.

Answers must be justified and explained. You can write your answers on sheets of paper, or in a document on
the computer to be send with code and figures.

>>> Send code and figures by email to [email protected]. <<<

WRITE YOUR NAME ON ALL SHEETS OF PAPER!

1. (5 pts)
What is the accuracy of the naive (“chance” or “majority”) predictor? What does it predict?

2. (10 pts)

- Split the dataset into training and test sets (use set.seed in your code for reproducibility).
- Perform a thorough k-nearest neighbor analysis using the training data.
- Evaluate the test accuracy of the obtained model on the test set using a confusion matrix. Comment on
the results.

3. (15 pts)

- Perform a thorough logistic regression analysis using the training data.


- Describe qualitatively how the most important features influence the risk of having a heart attack
(“if feature X increases, then the risk increases/decreases”).
- Evaluate the test accuracy of the final logistic model on the test set using a confusion matrix. Comment
on the results.
- Evaluate the classifier with a ROC curve and comment on the result.

4. (13 pts)

- Perform a thorough tree analysis using the training data.


- Describe qualitatively how the most important features influence the risk of having a heart attack. Does
it correspond to the logistic regression analysis?
- Evaluate the test accuracy of the obtained tree on the test dataset using a confusion matrix. Comment
on the results.

5. (12 pts)

- Which of the different classifiers studied above would you recommend? Why?
- Plot the two most important important features against each other using different colors according to the
value of “class”. Does the plot correspond to your analyses? Why/why not?
- Based on the plot, do you think a linear support vector machine would be a good classifier for this problem?
Why/why not?
MACHINE LEARNING FOR BIOENGINEERING
Exam, June 19, 2024: PART 2

In moodle you will find a dataset regarding thyroid disease, denoted “thyroid.dat”. The dataset consists of
383 cases with a response variable denoted “Recurred” that indicates whether the diseased recurred during a cer-
tain period, and 15 features. The aim is to predict “Recurred” from the other features. There are no missing data.

Answers must be justified and explained. You can write your answers on sheets of paper, or in a document on
the computer to be send with code and figures.

>>> Send code and figures by email to [email protected]. <<<

WRITE YOUR NAME ON ALL SHEETS OF PAPER!

1. (5 pts)

- Which class does the majority (also known as “chance” or “naive”) predictor predict for this dataset?
How accurate is it?

2. (20 pts)

- Split the dataset into training and test sets (use set.seed in your code for reproducibility).
- Perform a thorough random forest analysis using the training data and evaluate the test accuracy using a
confusion matrix. Comment on the results.
- Which are the most important features according to this analysis?
- Plot the predicted probabilities against each of two most important features for the subjects in the training
set. Comment on the plots to describe qualitatively how these two features influence the probability of
recurrence of the disease according to the model.
- Evaluate the classifier with a ROC curve and comment on the result.

3. (15 pts)

- Perform a thorough GBM analysis using the training data and evaluate the test accuracy using a confusion
matrix. Comment on the results.
- Evaluate the classifier with a ROC curve and comment on the result.
- Which are the most important features according to this analysis? Compare to your Random Forest
analysis.
- Make plots that show how the most important features influence your GBM model, and discuss whether
the overall conclusions concerning the qualitative effects of these features agree with the Random Forest
analysis.

4. (15 pts)

- Construct a logistic regression model using the training data and evaluate the test accuracy using a
confusion matrix. Comment on the results.
- Evaluate the logistic regression model with a ROC curve and comment on the result.
- Compare your logistic regression model to your conclusions concerning the role of the most important
features in the Random Forest and GBM classifiers.
MACHINE LEARNING FOR BIOENGINEERING
Exam, July 17, 2024: PART 2

In moodle you will find a dataset regarding breast cancer and denoted “breastcancer.dat”. The dataset consists
of geometrical features obtained from images of benign (B) and malignant (M) breast tumors, as given in the
response variable class. An identification number of each instance is also give in the column ID. There are no
missing data. The overall scope is to classify the tumors into one the two classes based on their geometrical features.

Answers must be justified and explained. You can write your answers on sheets of paper, or in a document on
the computer to be send with code and figures.

>>> Send code and figures by email to [email protected]. <<<

WRITE YOUR NAME ON ALL SHEETS OF PAPER!

1. (5 pts)

- Which class does the majority (also known as “chance” or “naive”) predictor predict for this dataset?
How accurate is it?

2. (20 pts)

- Perform a Principal Components Analysis (PCA), assuming that the true type of the class is unknown.
How many Principal Components (PCs) would you use?
- Investigate and illustrate (also graphically) how the first two PCs depend on the original features, and
comment on your results.
- Perform a k-means analysis with two clusters, assuming that the true class is unknown, and visualize the
result in the plane spanned by the first two PCs.
- Find the suggested number of clusters according to the WSS and the silhouette methods, and perform
k-means analysis accordingly.
- Do the clusters obtained in the previous questions correspond to the class variable? Comment on the
results.

3. (15 pts)

- Split the dataset into training and test sets (use set.seed in your code for reproducibility).
- Perform a thorough support vector machine (SVM) analysis using the training data and evaluate the test
accuracy using a confusion matrix. Comment on the results.
- Illustrate your final model by creating a figure showing the prediction boundary in the plane spanned
by the features concave points and area, and with the other features set to their average values
(calculated over the training set). Comment on the figure.
- Repeat the previous point but now for the features fractal dimension and area

4. (15 pts)

- Perform a thorough naive Bayes (NB) classification analysis using the training data and evaluate the test
accuracy using a confusion matrix. Comment on the results.
- Evaluate the classifier with a ROC curve and comment on the result.
- Discuss how the features concave points, area and fractal dimension distinguish between the two
classes according to the NB model. Compare to your PCA and SVM analyses.
MACHINE LEARNING FOR BIOENGINEERING
Exam, September 17, 2024: PART 2

In moodle you will find a dataset regarding ECG data and denoted “ecg.dat”. The dataset consists of 442
patients with a response variable denoted “class” that indicates whether the individual had arrhythmia according
to the doctors, and 197 features (5 “named” features: age, sex, height, weight, heart.rate; the remaining
192 features describe various features obtained from an ECG). There are no missing data.

Answers must be justified and explained. You can write your answers on sheets of paper, or in a document on
the computer to be send with code and figures. Send code and figures by email to [email protected].

WRITE YOUR NAME ON ALL SHEETS OF PAPER!

1. (5 pts)

- Which class does the majority (also known as “chance” or “naive”) classifier predict for this dataset? How
accurate is it?

2. (12 pts)

- Perform hierachical clustering, excluding class and sex from the analysis. Aim at getting two clusters of
comparable size.
- Evaluate the goodness of the clustering with silhouttes and comment on the result.
- Do the clusters obtained correspond to the class variable? Comment on the result.

3. (12 pts)

- Split the dataset in training and test sets (use set.seed in your code for reproducibility)..
- Perform a thorough classification tree analysis predicting class from the other features using the training
data. Evaluate the obtained tree model on the test data with a confusion matrix. Comment on the
results, in particular if any “named” features (see above) appear in your tree.
- Comment on your results, possibly using plots of versus the basic features to help your reasoning.

4. (8 pts)

- Perform a thorough k-nearest neighbor analysis, excluding the feature sex, to predict the class. Evaluate
the classifier with LOOCV.

5. (18 pts)

- Perform a thorough random forest analysis predicting class from the other features using the training
data and evaluate on the test data with a confusion matrix. Comment on the results.
- Which are the most important features according to this analysis?
- To understand how heart.rate influences the model, predict the probability of arrhythmia using your
final random forest model on the training set, and plot these predicted probabilities against heart.rate
for the subject in the training data. Comment on the result.
- Evaluate the classifier with a ROC curve and comment on the result.
MACHINE LEARNING FOR BIOENGINEERING
Exam, January 30, 2025: PART 2

In moodle you will find a dataset regarding heart disease, denoted “heart.dat”. The dataset consists of 1316
cases with a response variable denoted “cl” that indicates whether the patient had a heart attack and 8 features.
The aim is to predict “cl” from the other features. There are no missing data.

Answers must be justified and explained. You can write your answers on sheets of paper, or in a document on
the computer to be send with code and figures.

>>> Send code and figures by email to [email protected]. <<<

WRITE YOUR NAME ON ALL SHEETS OF PAPER!

1. (5 pts)

- Which class does the majority (also known as “chance” or “naive”) predictor predict for this dataset?
How accurate is it?

2. (15 pts)

- Split the dataset into training and test sets (use set.seed in your code for reproducibility).
- Perform a thorough GBM analysis using the training data and evaluate the test accuracy using a confusion
matrix. Comment on the results.
- Evaluate the classifier with a ROC curve and comment on the result.
- Which are the most important features according to this analysis?
- Make plots that show how the two most important features influence your GBM model, both separately
and in combination, and comment on the figures.

3. (12 pts)

- Perform a thorough naive Bayes (NB) classification analysis using the training data and evaluate the test
accuracy using a confusion matrix. Comment on the results.
- Evaluate the classifier with a ROC curve and comment on the result.
- Discuss which of the features distinguish best between the two classes according to the NB model, e.g.,
using appropriate plots. Compare to your GBM analyses.

4. (18 pts)

- Perform a thorough support vector machine (SVM) analysis using the training data and evaluate the test
accuracy using a confusion matrix. Comment on the results.
- Illustrate your final model by creating a figure showing the prediction boundary in the plane spanned by
the features logtrop and logkcm, and with the other features set to their average values (calculated
over the training set).
- Repeat the previous point but now for the best linear SVM.
- Comment on these two SVM figures and compare to the results obtained for the GBM and NB analyses.

You might also like