ML Assignment
ML Assignment
Abstract—This paper presents a comprehensive compara?ve consuming, and dependent on radiologist exper;se, leading to
analysis of ten classical machine learning algorithms for the mul?- variability in clinical decisionmaking. These challenges have
class classifica?on of cerebrovascular lesions using noncontrast
computed tomography (CT) images. The study employs a recently mo;vated the development
published 2025 dataset, comprising approximately 10,000 CT slices
from 733 pa?ents labeled across ten lesion categories. The proposed Code and notebook available at: h7ps://[Link]/file/d/15VJ 53
framework leverages radiomics-based feature extrac?on techniques DFHzSBM95UGkaVY06Gs3IPA8/view?usp=sharing
including Histogram of Oriented Gradients (HOG), Local Binary of automated diagnos;c methods that can assist clinicians by
PaWerns (LBP), Gray Level Co-occurrence Matrix (GLCM) with Haralick providing objec;ve and reproducible predic;ons.
texture sta?s?cs, and Gabor filters. Extracted features are
standardized, reduced using Principal Component Analysis (PCA), and
Machine learning (ML) has emerged as a powerful tool in
evaluated using 5×3 Repeated Stra?fied K-Fold cross-valida?on. Ten medical imaging, enabling systems to iden;fy complex data
algorithms—Logis?c Regression, K-Nearest Neighbors (KNN), Support paLerns and perform classifica;on tasks based on extracted
Vector Machine (SVM), Naive Bayes, Bayesian Network, features. In cerebrovascular imaging, ML-based systems can
Decision Tree, Random Forest, AdaBoost, XGBoost, and Mul?Layer
assist in differen;a;ng between ischemic and hemorrhagic
Perceptron (MLP)—are benchmarked for performance,
interpretability, and efficiency. SVM achieved the highest accuracy lesions, quan;fying lesion severity, and predic;ng disease
(97.79%) and macro-F1 score (0.9783). The framework demonstrates outcomes. Although deep learning models such as
that radiomics-driven ML can yield high accuracy on medical imaging convolu;onal neural networks (CNNs) have shown strong
datasets while maintaining interpretability and CPU efficiency. The
performance in medical imaging, they are oHen data-intensive,
study provides a transparent and reproducible baseline suitable for
academic research and educa?onal use. computa;onally demanding, and less interpretable. In
Index Terms—Cerebrovascular CT, Radiomics, Explainable contrast, classical ML algorithms combined with radiomics
Machine Learning, Mul?-class Classifica?on, Classical feature extrac;on offer a transparent and resource-efficient
ML, Cross-valida?on approach, allowing for clinical interpretability without
requiring high-end hardware or large-scale datasets.
I. INTRODUCTION Radiomics is a methodology that transforms medical images
Cerebrovascular diseases such as ischemic stroke, into a high-dimensional feature space through the extrac;on
intracranial hemorrhage, and mixed vascular pathologies of quan;ta;ve descriptors. These features capture the shape,
represent a significant global health burden, accoun;ng for texture, and intensity varia;ons within lesions, providing
millions of deaths and disabili;es each year. Rapid and meaningful representa;ons for classifica;on tasks. Radiomic
accurate detec;on of these condi;ons is cri;cal for effec;ve techniques such as Histogram of Oriented Gradients (HOG),
medical interven;on. Non-contrast computed tomography Local Binary PaLerns (LBP), Gray Level Co-occurrence Matrix
(CT) is the most widely used imaging modality for (GLCM), Haralick features, and Gabor filters have been
cerebrovascular diagnosis due to its speed, accessibility, and successfully applied in cancer detec;on, lung disease
reliability in emergency seFngs. However, the manual characteriza;on, and brain lesion segmenta;on. Yet, their
interpreta;on of CT images is oHen subjec;ve, ;me- systema;c applica;on in cerebrovascular CT classifica;on—
par;cularly for mul;-class problems—remains limited. Most neurodegenera;ve and cerebrovascular diseases, emphasizing
studies focus on binary classifica;on (e.g., stroke vs. control), their diagnos;c poten;al but no;ng that most applica;ons
leaving a research gap in mul;-category diagnosis using remain binary classifica;on tasks or outcome predic;on rather
handcraHed features and classical algorithms. than mul;-class lesion iden;fica;on [5]. Zhang et al. (2024)
Addressing this gap, this study performs a comprehensive combined thrombus and perithrombus radiomic features with
comparison of ten classical ML algorithms—Logis;c deep-learning representa;ons to predict malignant cerebral
Regression, K-Nearest Neighbors (KNN), Support Vector edema aHer reperfusion therapy, evalua;ng eleven algorithms
Machine (SVM), Naive Bayes, Bayesian Network, Decision Tree, and achieving strong AUC performance [6]. Lyu et al. (2023)
Random Forest, AdaBoost, XGBoost, and Mul;-Layer developed a CT-based radiomics model to differen;ate
Perceptron (MLP)—for mul;-class cerebrovascular CT image between primary and secondary intracranial hemorrhage in
classifica;on. Instead of relying on deep neural networks, 238 pa;ents, extrac;ng over 1,700 features and using support
this framework leverages radiomics-derived features extracted vector machines (SVM) for classifica;on [7]. Their model
from 2D CT slices. Feature preprocessing includes outperformed radiologist assessment in hemorrhage
normaliza;on and dimensionality reduc;on through Principal discrimina;on, highligh;ng the value of radiomics in
Component Analysis (PCA), ensuring robust model training cerebrovascular diagnos;cs.
while minimizing overfiFng. To ensure fairness and Recent work by Sun et al. (2024) proposed a
reproducibility, model performance is evaluated using radiomicsclinical ML model for predic;ng fu;le recanaliza;on
Repeated Stra;fied KFold Cross-Valida;on (5 folds × 3 repeats) following endovascular treatment in anterior circula;on stroke
and assessed via mul;ple metrics including accuracy, macro- using non-contrast CT (NCCT) scans [8]. The study used 2,016
F1, balanced accuracy, and standard devia;on. radiomic features and logis;c regression to achieve AUC values
The outcomes demonstrate that classical ML algorithms, above 0.85, demonstra;ng radiomics’ predic;ve poten;al in
when applied to radiomics features, can achieve high clinical decision-making. Other studies, including those by Yao
diagnos;c performance while remaining interpretable and et al. (2024) and Wang et al. (2025), further established that
computa;onally lightweight. The Support Vector Machine handcraHed radiomic features can complement or outperform
achieved the highest performance (97.79% accuracy), followed deep learning models when datasets are small or lack pixel-
closely by KNN and Random Forest. The findings highlight the level annota;ons.
effec;veness of feature-based learning approaches for CT Despite these advances, two limita;ons persist in current
image analysis, sugges;ng their poten;al as sustainable and research: (i) most radiomics-based ML models in
reproducible alterna;ves to deep learning in data-limited or cerebrovascular imaging address binary or prognos;c tasks,
resource-constrained environments. and (ii) few studies systema;cally benchmark mul;ple classical
ML algorithms on the same dataset using handcraHed
II. RELATED WORK features. The present work addresses both gaps by conduc;ng
a ten-class classifica;on of cerebrovascular CT images using
Feature engineering and radiomics have been fundamental mul;ple feature families (HOG, LBP, GLCM, Haralick, and
in advancing machine learning (ML) applica;ons for medical Gabor) and ten ML algorithms, evaluated under a standardized
image analysis. Early founda;onal works established the crossvalida;on protocol. This study therefore bridges classical
theore;cal basis for texture and gradient-based radiomics with comprehensive model benchmarking to
representa;ons. Haralick et al. [1] introduced texture feature establish a reproducible and interpretable baseline for future
descriptors derived from the Gray Level Co-occurrence Matrix deep learning and explainable AI research.
(GLCM), defining sta;s;cal measures such as contrast, energy,
and entropy that remain widely used in medical imaging. Ojala III. DATASET DESCRIPTION
et al. [2] developed Local Binary PaLerns (LBP) as an efficient
method for encoding local texture micro-paLerns. Dalal and The dataset used in this study is the Cerebrovascular Lesions
Triggs [3] proposed the Histogram of Oriented Gradients (HOG) CT Image Dataset, published by Macin et al. (2025) and hosted
descriptor, which captures shape and boundary details through on Kaggle under the ;tle “Deep learning-based classifica;on of
gradient orienta;on histograms. Daugman [4] formalized cerebrovascular lesions on CT images using a novel SwinNeXt
Gabor filters for spa;al-frequency decomposi;on, enabling architecture.” It can be accessed at:
robust texture and edge analysis in two dimensions. These hLps://[Link]/datasets/buraktaci/
classical feature extrac;on techniques collec;vely underpin cerebrovascular-lesions/data?select=cerebrovascular+lesions.
the radiomics framework adopted in the present study. This dataset was compiled retrospec;vely from pa;ents
In neuroimaging, radiomics has become an essen;al tool for admiLed to the Department of Neurology at Malatya Turgut
quan;ta;ve feature extrac;on from CT and MRI scans. Shi et Ozal University Medical Faculty between 2018 and 2022.¨ CT
al. (2024) reviewed ML-based radiomics approaches for images exhibi;ng mo;on ar;facts were excluded. Each
pa;ent’s scans were reviewed and verified independently by IV. METHODOLOGY
radiologists and neurologists to ensure diagnos;c accuracy. The proposed workflow consists of a fully reproducible and
The study cohort includes 733 individuals (405 male and 328 explainable pipeline that transforms raw CT slices into
female), and the data are categorized into ten clinically dis;nct radiomics-based feature vectors, followed by the training and
classes represen;ng different cerebrovascular lesion types as comparison of ten classical machine learning algorithms. The
well as healthy controls. en;re pipeline is designed for CPU-only execu;on in Google
The dataset contains approximately 10,000 non-contrast Colab or Jupyter environments without reliance on GPU
axial CT slices, each labeled according to one of ten diagnos;c accelera;on. The process includes (1) data preprocessing, (2)
categories. The number of images per class ranges from 1,008 radiomics-based feature extrac;on, (3) feature scaling and
to 1,191, providing a rela;vely balanced mul;-class dimensionality reduc;on, (4) model training and
distribu;on. Detailed demographic and class-wise sta;s;cs are hyperparameter setup, (5) repeated stra;fied cross-valida;on,
provided in Table I. and (6) evalua;on using mul;ple metrics. Each stage is
The dataset has a reported usability ra;ng of 5.00 on Kaggle described in detail below.
and is publicly available for academic and research purposes.
The authors note that pa;ent iden;fiers were removed to A. Preprocessing
protect privacy, preven;ng pa;ent-wise stra;fica;on in All CT images from the ten classes were downloaded from
downstream ML experiments. Consequently, this study adopts the Kaggle repository and organized into class-specific
image-level repeated stra;fied cross-valida;on rather
TABLE I
DETAILS OF THE CEREBROVASCULAR LESIONS CT IMAGE DATASET (MACIN et al., 2025).
Class Male (n) Female (n) Total Age (Mean ± SD) Images (n)
Acute ischemic infarc<on 45 38 83 70.78 ± 15.37 1119
Epidural hemorrhage 48 21 69 55.94 ± 17.56 1011
Chronic ischemic infarc<on 41 33 74 62.25 ± 12.32 1008
Subdural effusion 36 30 66 71.30 ± 15.88 1080
Parenchymal hemorrhage 54 42 96 65.25 ± 19.64 1077
Subarachnoid hemorrhage 32 45 77 52.12 ± 18.47 1166
Subdural hemorrhage 42 38 80 48.44 ± 17.54 1177
Ventricular hemorrhage 40 32 72 62.78 ± 18.88 1020
Mixed cerebrovascular disease 42 27 69 68.75 ± 14.63 1191
Healthy Control 25 22 47 38.24 ± 16.24 1016
than pa;ent-wise spliFng, with explicit acknowledgment of directories. Since the dataset comprises axial CT slices stored
this limita;on in later sec;ons. Key characteris;cs: as PNG files, no DICOM parsing was required. Each image was
• 733 unique pa;ents (405 male / 328 female) and 10 first loaded using the OpenCV and NumPy libraries for array-
diagnos;c categories. based manipula;on.
• Approximately 10,000 axial CT slices (1.35 GB dataset size). To ensure uniformity across the dataset, all images were
• Balanced class distribu;on ( 1,000 images per class).
converted to grayscale (if not already), as color informa;on is
• Images are non-contrast axial brain CT scans in PNG format.
not relevant for CT-based intensity analysis. Each image was
• Public academic license; no iden;fiable pa;ent metadata
then resized to a fixed spa;al resolu;on of 128×128 pixels to
released. standardize feature extrac;on dimensions. Pixel intensi;es
were normalized to the range [0,1] using min–max
This dataset provides an ideal benchmark for evalua;ng
normaliza;on to prevent scale bias during feature
classical ML and radiomics-based methods for cerebrovascular
computa;on.
lesion classifica;on. It is recent (published 2025), high-quality,
Op;onal preprocessing techniques such as Gaussian
clinically verified, and sufficiently large for sta;s;cally reliable
blurring and histogram equaliza;on were tested for denoising
cross-valida;on experiments, while remaining manageable for
and contrast enhancement but were not included in the final
CPU-based computa;on and educa;onal use.
model training to maintain reproducibility. Images were
verified visually for consistency aHer resizing and
normaliza;on. The final preprocessed dataset was stored as (10,000 images × 8,000 features). This feature set was used for
NumPy arrays for efficient itera;on during feature extrac;on. all subsequent model training and analysis.
ACKNOWLEDGMENT
The author thanks the faculty of the Manipal Ins;tute of
Technology, Bengaluru, for academic guidance, and the dataset