DAL Assignment 6
DAL Assignment 6
Abstract—In this study, Support Vector Machines (SVMs) were In essence, this paper serves as a formal exposition
employed to classify stars as pulsars based on features extracted into the application of SVMs for pulsar classification,
from the moments of the integrated profile and the DM-SNR combining empirical analysis with a systematic exploration
curve. Our SVM model demonstrated outstanding performance,
achieving high accuracy and F1 scores in pulsar identification. of hyperparameters to elucidate the mechanics underpinning
Beyond classification prowess, we conducted a comprehensive accurate and balanced astral categorization.
investigation into the impact of various hyperparameters on
the SVM classifier. This exploration delves into the tuning II. DATA AND C LEANING
of parameters, shedding light on optimal configurations for
enhanced classification accuracy. The study not only contributes
A. The Datasets
to the field of pulsar detection but also provides valuable insights Two datasets were provided to build the support vector
into the nuanced interplay between hyperparameters and SVM model. The first was a train set and the second was a test set.
performance, paving the way for more robust and efficient
classification methodologies in pulsar astronomy.
Given that the test set did not contain the target labels (pulsar
Index Terms—Support Vector Machines, bootstrapping, Pus- or not), we could not verify the goodness of our model’s
lars, Classification, cross-validation prediction.
I. I NTRODUCTION Other than the target label (’target class’) being included in
the train set, all other labels were common to both datasets.
The exploration of celestial phenomena has long captivated
The data collected and their types are summarized below.
the scientific community, prompting a dedicated inquiry into
the classification of pulsars—unique astral entities emitting
TABLE I
rhythmic signals akin to cosmic lighthouses. This paper delves TABLE OF THE FEATURES IN THE GIVEN DATASETS ALONG WITH THEIR
into the application of Support Vector Machines (SVMs) DESCRIPTIONS . N OTE THAT THE ’ TARGET CLASS ’ VARIABLE IS ABSENT
as a classification tool for distinguishing pulsars based on IN THE TEST SET. W E OBSERVE THAT ALL VARIABLES ARE NUMERICAL
pertinent features derived from the moments of the integrated Feature Type
profile and the DM-SNR curve. These features serve as Mean of the integrated profile Numerical
distinctive signatures, analogous to cosmic fingerprints and Standard deviation of the integrated profile Numerical
Excess kurtosis of the integrated profile Numerical
heartbeats. Skewness of the integrated profile Numerical
Mean of the DM-SNR curve Numerical
The SVM model, functioning as a discerning agent in this Standard deviation of the DM-SNR curve Numerical
Excess kurtosis of the DM-SNR curve Numerical
cosmic pursuit, exhibits noteworthy accuracy in identifying Skewness of the DM-SNR curve Numerical
pulsars while concurrently considering classification balance target class Categorical(0, 1)
through the F1 score. Beyond empirical success, this study
delves into the nuanced realm of hyperparameters, scrutinizing
their effects on SVM classifier performance. The examination B. Data Cleaning
of hyperparameter configurations enhances our understanding A KNN Imputer is used on the dataset to impute missing
of optimal settings, thereby refining the efficiency of the values. This largely preserves variable distribution. Finally
classification methodology. the variables are converted to their appropriate types and
the cleaned dataset is returned. No confounding symbols are
The significance of this research extends beyond the present in the train or test data, we only find missing values.
celestial domain, encapsulating the endeavor to refine
and fortify our cosmic classifier. By unraveling the There are multiple imputation techniques available. One
intricacies of the SVM’s decision-making process and can impute missing values by 0, by the mean of the
its sensitivity to hyperparameters, this study contributes to the variable, by the median or mode of the variable or by
broader discourse on classification methodologies and holds randomly sampling from the distribution of the variable.
implications for the advancement of astronomical inquiry. The Expectation Imputers (mean, median, mode) distort
the distribution of the imputed data about the expectation
estimator used, when compared to the Random Sampling
Imputer (RSI).
III. M ETHODS
A. Support Vector Machine Classifier
Fig. 2. The probability and cumulative distributions of the Standard Support Vector Machines (SVMs) stand as powerful entities
deviation of the integrated profile of the various passengers is plotted. The within the realm of machine learning, specifically tailored
left image contains the KDE of the data before and after KNN Imputation. The
right image shows the ECDFs of the data before and after KNN Imputation. for classification tasks. The core principle guiding SVMs
In both distribution functions, no significant differences between the two involves the identification of an optimal hyperplane that
distributions can be observed, indicating that the imputer does not change maximally separates data points belonging to distinct classes.
the underlying distribution.
This hyperplane acts as a decisive boundary, facilitating the
accurate classification of new instances.
interval between the hyperplane and the closest data point
from either class. Maximizing this margin is pivotal, as it
enhances the model’s capacity for generalization to novel data.
B. Classification Metrics
There are various metrics that can evaluate the goodness-
of-fit of a given classifier. Some of these metrics are presented
in this section. In classification tasks, it is essential to choose
Fig. 8. The probability and cumulative distributions of the Skewness of appropriate evaluation metrics based on the problem’s context
the DM-SNR curve of the various passengers is plotted. The left image and objectives.
contains the KDE of the data before and after KNN Imputation. The right
image shows the ECDFs of the data before and after KNN Imputation. In both 1) Accuracy: Accuracy is one of the most straightforward
distribution functions, no significant differences between the two distributions classification metrics and is defined as:
can be observed, indicating that the imputer does not change the underlying
distribution. Number of Correct Predictions
Accuracy = (1)
Total Number of Predictions
It measures the proportion of correct predictions made by the
In the scenario of a binary classification dataset, SVM model. While accuracy provides an overall sense of model
strives to discover a hyperplane within the feature space performance, it may not be suitable for imbalanced datasets,
characterized by the equation: where one class dominates the other.
Fig. 12. The left plot contains the histogram of the recall obtained for each
bootstrap sample from the validation split. The right plot contains the ECDF
of the recall obtained for each bootstrap sample from the validation split. We
find that the metric is high and its variance is acceptable
Fig. 13. The left plot contains the histogram of the precision obtained for
each bootstrap sample from the validation split. The right plot contains the
ECDF of the precision obtained for each bootstrap sample from the validation
split. We find that the metric is high and its variance is acceptable
Fig. 14. The left plot contains the histogram of the F1 score obtained for
each bootstrap sample from the validation split. The right plot contains the Fig. 15. Variation of F1 Score (Left) and Accuracy (Right) for various values
ECDF of the F1 score obtained for each bootstrap sample from the validation of “Regularization Parameter (C)” of the Support Vector Machine. We find
split. We find that the metric is high and its variance is acceptable that more estimators lead to better performance on the validation set. However,
the performance saturates after C = 10−1 .
Fig. 16. Variation of F1 Score (Left) and Accuracy (Right) for various values Fig. 17. Variation of F1 Score (Left) and Accuracy (Right) for various
of “γ” of the RBF kernel Support Vector Machine. We find that larger values values of “Kernel” of the Support Vector Machine. We find that this does
lead to worse performance on the validation set. not largely affect the performance on the validation set. Sigmoid kernel gives
worse performance compared to the other three
V. D ISCUSSION
significant task.
The results obtained from the application of Support
VI. C ONCLUSIONS AND F UTURE W ORK
Vector Machines (SVMs) to classify pulsars in our study are
notably promising, with the model achieving a commendable It is essential to acknowledge potential considerations and
accuracy of 97%, an F1 score of 88%, and a precision of avenues for further exploration. The nature of astronomical
97%. These metrics underscore the robust performance of data, characterized by noise and variability, may introduce
SVMs in accurately identifying pulsars, showcasing their challenges that impact model performance. Future work could
efficacy in the realm of astronomical classification tasks. involve refining feature engineering techniques or exploring
advanced kernel functions to enhance the SVM’s ability to
The high accuracy indicates that the SVM successfully discern subtle patterns in pulsar signals.
differentiated between pulsars and non-pulsars with a
remarkable level of precision. The F1 score, considering Additionally, an analysis of misclassifications, particularly
both precision and recall, provides a balanced evaluation of instances where the SVM failed to identify pulsars,
the model’s performance, affirming its capability to reliably may provide insights into the limitations of the model.
identify pulsars while minimizing false positives and false Understanding the specific characteristics of misclassified
negatives. instances can guide future improvements and refinements in
the classification methodology.
The precision of 97% indicates a low rate of false positives,
highlighting the SVM’s ability to avoid misclassifying Furthermore, the computational efficiency of SVMs,
non-pulsar instances as pulsars. This is particularly crucial in while commendable, could be a consideration for large-scale
astronomical studies, where the identification of pulsars is a datasets. Exploring optimization techniques or parallelization
Fig. 18. Variation of F1 Score (Left) and Accuracy (Right) for various
values of “Polynomial Degree” of the Polynomial Kernel. We no effect on
the performance.
R EFERENCES
[1] James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. An introduc-
tion to statistical learning (Vol. 112, p. 18). New York: springer.
[2] Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009). The
elements of statistical learning: data mining, inference, and prediction
(Vol. 2, pp. 1-758). New York: springer.
[3] Bishop, C. M., & Nasrabadi, N. M. (2006). Pattern recognition and
machine learning (Vol. 4, No. 4, p. 738). New York: springer.