0% found this document useful (0 votes)
25 views7 pages

DAL Assignment 6

IITB DAL Assignment 6

Uploaded by

msrirang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views7 pages

DAL Assignment 6

IITB DAL Assignment 6

Uploaded by

msrirang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

A Support Vector Machine for Pulsar Classification

1st Melpakkam Pradeep (CH20B064)


Department of Chemical Engineering
IIT Madras
Chennai, India
[email protected]

Abstract—In this study, Support Vector Machines (SVMs) were In essence, this paper serves as a formal exposition
employed to classify stars as pulsars based on features extracted into the application of SVMs for pulsar classification,
from the moments of the integrated profile and the DM-SNR combining empirical analysis with a systematic exploration
curve. Our SVM model demonstrated outstanding performance,
achieving high accuracy and F1 scores in pulsar identification. of hyperparameters to elucidate the mechanics underpinning
Beyond classification prowess, we conducted a comprehensive accurate and balanced astral categorization.
investigation into the impact of various hyperparameters on
the SVM classifier. This exploration delves into the tuning II. DATA AND C LEANING
of parameters, shedding light on optimal configurations for
enhanced classification accuracy. The study not only contributes
A. The Datasets
to the field of pulsar detection but also provides valuable insights Two datasets were provided to build the support vector
into the nuanced interplay between hyperparameters and SVM model. The first was a train set and the second was a test set.
performance, paving the way for more robust and efficient
classification methodologies in pulsar astronomy.
Given that the test set did not contain the target labels (pulsar
Index Terms—Support Vector Machines, bootstrapping, Pus- or not), we could not verify the goodness of our model’s
lars, Classification, cross-validation prediction.

I. I NTRODUCTION Other than the target label (’target class’) being included in
the train set, all other labels were common to both datasets.
The exploration of celestial phenomena has long captivated
The data collected and their types are summarized below.
the scientific community, prompting a dedicated inquiry into
the classification of pulsars—unique astral entities emitting
TABLE I
rhythmic signals akin to cosmic lighthouses. This paper delves TABLE OF THE FEATURES IN THE GIVEN DATASETS ALONG WITH THEIR
into the application of Support Vector Machines (SVMs) DESCRIPTIONS . N OTE THAT THE ’ TARGET CLASS ’ VARIABLE IS ABSENT
as a classification tool for distinguishing pulsars based on IN THE TEST SET. W E OBSERVE THAT ALL VARIABLES ARE NUMERICAL

pertinent features derived from the moments of the integrated Feature Type
profile and the DM-SNR curve. These features serve as Mean of the integrated profile Numerical
distinctive signatures, analogous to cosmic fingerprints and Standard deviation of the integrated profile Numerical
Excess kurtosis of the integrated profile Numerical
heartbeats. Skewness of the integrated profile Numerical
Mean of the DM-SNR curve Numerical
The SVM model, functioning as a discerning agent in this Standard deviation of the DM-SNR curve Numerical
Excess kurtosis of the DM-SNR curve Numerical
cosmic pursuit, exhibits noteworthy accuracy in identifying Skewness of the DM-SNR curve Numerical
pulsars while concurrently considering classification balance target class Categorical(0, 1)
through the F1 score. Beyond empirical success, this study
delves into the nuanced realm of hyperparameters, scrutinizing
their effects on SVM classifier performance. The examination B. Data Cleaning
of hyperparameter configurations enhances our understanding A KNN Imputer is used on the dataset to impute missing
of optimal settings, thereby refining the efficiency of the values. This largely preserves variable distribution. Finally
classification methodology. the variables are converted to their appropriate types and
the cleaned dataset is returned. No confounding symbols are
The significance of this research extends beyond the present in the train or test data, we only find missing values.
celestial domain, encapsulating the endeavor to refine
and fortify our cosmic classifier. By unraveling the There are multiple imputation techniques available. One
intricacies of the SVM’s decision-making process and can impute missing values by 0, by the mean of the
its sensitivity to hyperparameters, this study contributes to the variable, by the median or mode of the variable or by
broader discourse on classification methodologies and holds randomly sampling from the distribution of the variable.
implications for the advancement of astronomical inquiry. The Expectation Imputers (mean, median, mode) distort
the distribution of the imputed data about the expectation
estimator used, when compared to the Random Sampling
Imputer (RSI).

Unfortunately the RSI is a slow imputation technique.


Either a prior distribution must be assumed and its parameters
estimated from data, or a non-parametric method such as a
Kernel Density Estimate (KDE) can be used.

Since a distribution can be considered as a data cluster,


we use a K-Nearest Neighbours (KNN) Imputer. This Imputer Fig. 3. The probability and cumulative distributions of the Excess kurtosis
of the integrated profile of the various passengers is plotted. The left image
finds the closest k nearest neighbours for each data point contains the KDE of the data before and after KNN Imputation. The right
with a missing value, and imputes the missing value based image shows the ECDFs of the data before and after KNN Imputation. In both
on them. Since all points used in the imputation are very distribution functions, no significant differences between the two distributions
can be observed, indicating that the imputer does not change the underlying
”close”, we expect this method to preserve each variable’s distribution.
distribution quite well.

We can also observe this empirically. In Figs. 1-3, we


present the Kernel Density Estimate (KDE) and Empirical
Cumulative Density Function (ECDF) of the numerical
variables in the train dataset, before and after imputation. It
is evident that the underlying distributions do not undergo
drastic changes due to the KNN Imputation.

Fig. 4. The probability and cumulative distributions of the Skewness of


the integrated profile of the various passengers is plotted. The left image
contains the KDE of the data before and after KNN Imputation. The right
image shows the ECDFs of the data before and after KNN Imputation. In both
distribution functions, no significant differences between the two distributions
can be observed, indicating that the imputer does not change the underlying
distribution.

Fig. 1. The probability and cumulative distributions of the Mean of the


integrated profile of the various passengers is plotted. The left image contains
the KDE of the data before and after KNN Imputation. The right image
shows the ECDFs of the data before and after KNN Imputation. In both
distribution functions, no significant differences between the two distributions
can be observed, indicating that the imputer does not change the underlying
distribution.

Fig. 5. The probability and cumulative distributions of the Mean of the


DM-SNR curve of the various passengers is plotted. The left image contains
the KDE of the data before and after KNN Imputation. The right image
shows the ECDFs of the data before and after KNN Imputation. In both
distribution functions, no significant differences between the two distributions
can be observed, indicating that the imputer does not change the underlying
distribution.

III. M ETHODS
A. Support Vector Machine Classifier
Fig. 2. The probability and cumulative distributions of the Standard Support Vector Machines (SVMs) stand as powerful entities
deviation of the integrated profile of the various passengers is plotted. The within the realm of machine learning, specifically tailored
left image contains the KDE of the data before and after KNN Imputation. The
right image shows the ECDFs of the data before and after KNN Imputation. for classification tasks. The core principle guiding SVMs
In both distribution functions, no significant differences between the two involves the identification of an optimal hyperplane that
distributions can be observed, indicating that the imputer does not change maximally separates data points belonging to distinct classes.
the underlying distribution.
This hyperplane acts as a decisive boundary, facilitating the
accurate classification of new instances.
interval between the hyperplane and the closest data point
from either class. Maximizing this margin is pivotal, as it
enhances the model’s capacity for generalization to novel data.

Support vectors emerge as pivotal elements within SVM,


representing data points in close proximity to the decision
boundary. These vectors play a crucial role in determining
the optimal hyperplane and, consequently, the efficacy of the
SVM model.
Fig. 6. The probability and cumulative distributions of the Standard
deviation of the DM-SNR curve of the various passengers is plotted. The left
image contains the KDE of the data before and after KNN Imputation. The In situations where linear separation proves impractical,
right image shows the ECDFs of the data before and after KNN Imputation. SVM harnesses the kernel trick. Kernels facilitate the
In both distribution functions, no significant differences between the two
distributions can be observed, indicating that the imputer does not change
transformation of the input space into a higher-dimensional
the underlying distribution. realm, where linear separation becomes feasible. Notable
kernels include the linear, polynomial, and radial basis
function (RBF) kernels.

The optimization objective intrinsic to SVM revolves


around the identification of the optimal hyperplane. This
process entails the minimization of the norm of the weight
vector, subject to the constraint that each data point is
correctly classified. The resultant convex optimization
problem is frequently addressed through techniques such as
gradient descent.
Fig. 7. The probability and cumulative distributions of the Excess kurtosis
of the DM-SNR curve of the various passengers is plotted. The left image
contains the KDE of the data before and after KNN Imputation. The right In summary, SVMs emerge as versatile and potent classifiers,
image shows the ECDFs of the data before and after KNN Imputation. In both proficient in scenarios involving both linear and non-linear
distribution functions, no significant differences between the two distributions separations. Their adeptness in delineating an optimal
can be observed, indicating that the imputer does not change the underlying
distribution. hyperplane with a maximized margin contributes to their
efficacy in diverse classification tasks. A comprehensive grasp
of the foundational concepts and optimization principles
underlying SVMs establishes a robust framework for their
application across a spectrum of machine learning endeavors.

B. Classification Metrics
There are various metrics that can evaluate the goodness-
of-fit of a given classifier. Some of these metrics are presented
in this section. In classification tasks, it is essential to choose
Fig. 8. The probability and cumulative distributions of the Skewness of appropriate evaluation metrics based on the problem’s context
the DM-SNR curve of the various passengers is plotted. The left image and objectives.
contains the KDE of the data before and after KNN Imputation. The right
image shows the ECDFs of the data before and after KNN Imputation. In both 1) Accuracy: Accuracy is one of the most straightforward
distribution functions, no significant differences between the two distributions classification metrics and is defined as:
can be observed, indicating that the imputer does not change the underlying
distribution. Number of Correct Predictions
Accuracy = (1)
Total Number of Predictions
It measures the proportion of correct predictions made by the
In the scenario of a binary classification dataset, SVM model. While accuracy provides an overall sense of model
strives to discover a hyperplane within the feature space performance, it may not be suitable for imbalanced datasets,
characterized by the equation: where one class dominates the other.

w·x+b=0 2) Recall: Recall, also known as sensitivity or true positive


Here, w represents the weight vector, x denotes the input rate, quantifies a model’s ability to correctly identify positive
vector, and b stands for the bias term. instances:
True Positives
Central to SVM is the concept of the margin, the spatial Recall = (2)
True Positives + False Negatives
Recall is essential when the cost of missing positive cases
(false negatives) is high, such as in medical diagnoses.

3) Precision: Precision measures the accuracy of positive


predictions made by the model:
True Positives
Precision = (3)
True Positives + False Positives
Precision is valuable when minimizing false positive
predictions is critical, like in spam email detection.

4) F1-score: The F1 score is the harmonic mean of preci-


sion and recall, providing a balance between the two:
Precision · Recall
F1 Score = 2 · (4)
Precision + Recall
It is particularly useful when there is an uneven class
distribution or when both precision and recall need to be
considered simultaneously.

5) Receiver Operator Characteristic Curve (ROC Curve):


The ROC curve is a graphical representation of a model’s
performance across different classification thresholds. It plots
the true positive rate (recall) against the false positive rate (1
- specificity) at various threshold values.
Fig. 10. The correlation heatmap between all independent variables. This
was obtained by finding the pairwise correlation coefficient between each
independent variable. The color gradient indicates the magnitude of the
correlation between the variables.

B. Support Vector Machines are accurate classifiers

To train and evaluate our Support Vector Machine model,


we split our train data into train and validation splits. This is
done using a fixed random seed for replicability, with 20%
of our given data in the validation split.

The Support Vector Machine model is first trained on


the train split. We then bootstrap the validation set (1000
bootstrap samples) and compute the evaluation metrics
Fig. 9. A sample ROC curve from a classifier. Note the trade-off between presented in Section III-B. We provide the 95% CIs for our
sensitivity and specificity. Based on the problem, we may optimize be required evaluation metrics in Table II. The probability distributions
to optimize for only one.
and ECDFs of our evaluation metrics are presented in Figs.
The area under the ROC curve (AUC-ROC) quantifies the 11-14.
model’s overall performance. A higher AUC-ROC indicates
a better model at distinguishing between positive and negative TABLE II
instances. E VALUATION METRICS OF THE S UPPORT V ECTOR M ACHINE CLASSIFIER .
W E FIND THAT ACCURACY AND P RECISION ARE REASONABLY HIGH . T HE
VARIANCE IN THESE ESTIMATES ARE ALSO ACCEPTABLE .
IV. R ESULTS
Metric Value 95% CI
A. Existence of Linear Relationships among income factors Accuracy 0.98 (0.97, 0.99)
The correlation heatmap for the numerical independent Precision 0.95 (0.91, 0.99)
Recall 0.82 (0.75, 0.89)
variables is shown in Fig. 8. We observe that no variables F1 Score 0.88 (0.83, 0.93)
are correlated with each other. No linear relationships exist
between the independent variables, and we can proceed with
our classifier.
C. Support Vector Machines are sensitive to hyperparameters

Support Vector Machines are known to have multiple


hyperparameters, unlike other methods such as Logistic
Regression and Naive Bayes. This leads to radically different
performance when these hyperparameters are tuned. The
SVM’s kernel and margins, in particular, are very important.

Our classification model yields a train accuracy of 100% and


a validation accuracy of 97%. While the chances of overfitting
Fig. 11. The left plot contains the histogram of the accuracy obtained for
each bootstrap sample from the validation split. The right plot contains the are slim, we would like to investigate the performance of
ECDF of the accuracy obtained for each bootstrap sample from the validation different Support Vector Machines. To do so, we examine how
split. We find that the metric is high and its variance is acceptable accuracy and the F1 score vary with different hyperparameters.

In Figs. 15 - 18 we plot the variation of accuracy and


the F1 score with various hyperparameters. To get a true
picture, we perform 100 bootstraps of the validation set and
evaluate the accuracy and F1 score. We then plot the mean
metrics, along with the error bars for 2 standard deviations
assuming a t-distribution.

Fig. 12. The left plot contains the histogram of the recall obtained for each
bootstrap sample from the validation split. The right plot contains the ECDF
of the recall obtained for each bootstrap sample from the validation split. We
find that the metric is high and its variance is acceptable

Fig. 13. The left plot contains the histogram of the precision obtained for
each bootstrap sample from the validation split. The right plot contains the
ECDF of the precision obtained for each bootstrap sample from the validation
split. We find that the metric is high and its variance is acceptable

Fig. 14. The left plot contains the histogram of the F1 score obtained for
each bootstrap sample from the validation split. The right plot contains the Fig. 15. Variation of F1 Score (Left) and Accuracy (Right) for various values
ECDF of the F1 score obtained for each bootstrap sample from the validation of “Regularization Parameter (C)” of the Support Vector Machine. We find
split. We find that the metric is high and its variance is acceptable that more estimators lead to better performance on the validation set. However,
the performance saturates after C = 10−1 .
Fig. 16. Variation of F1 Score (Left) and Accuracy (Right) for various values Fig. 17. Variation of F1 Score (Left) and Accuracy (Right) for various
of “γ” of the RBF kernel Support Vector Machine. We find that larger values values of “Kernel” of the Support Vector Machine. We find that this does
lead to worse performance on the validation set. not largely affect the performance on the validation set. Sigmoid kernel gives
worse performance compared to the other three

V. D ISCUSSION
significant task.
The results obtained from the application of Support
VI. C ONCLUSIONS AND F UTURE W ORK
Vector Machines (SVMs) to classify pulsars in our study are
notably promising, with the model achieving a commendable It is essential to acknowledge potential considerations and
accuracy of 97%, an F1 score of 88%, and a precision of avenues for further exploration. The nature of astronomical
97%. These metrics underscore the robust performance of data, characterized by noise and variability, may introduce
SVMs in accurately identifying pulsars, showcasing their challenges that impact model performance. Future work could
efficacy in the realm of astronomical classification tasks. involve refining feature engineering techniques or exploring
advanced kernel functions to enhance the SVM’s ability to
The high accuracy indicates that the SVM successfully discern subtle patterns in pulsar signals.
differentiated between pulsars and non-pulsars with a
remarkable level of precision. The F1 score, considering Additionally, an analysis of misclassifications, particularly
both precision and recall, provides a balanced evaluation of instances where the SVM failed to identify pulsars,
the model’s performance, affirming its capability to reliably may provide insights into the limitations of the model.
identify pulsars while minimizing false positives and false Understanding the specific characteristics of misclassified
negatives. instances can guide future improvements and refinements in
the classification methodology.
The precision of 97% indicates a low rate of false positives,
highlighting the SVM’s ability to avoid misclassifying Furthermore, the computational efficiency of SVMs,
non-pulsar instances as pulsars. This is particularly crucial in while commendable, could be a consideration for large-scale
astronomical studies, where the identification of pulsars is a datasets. Exploring optimization techniques or parallelization
Fig. 18. Variation of F1 Score (Left) and Accuracy (Right) for various
values of “Polynomial Degree” of the Polynomial Kernel. We no effect on
the performance.

strategies may be warranted to ensure scalability without


compromising performance.

In conclusion, the results of our SVM-based pulsar


classification are promising, with high accuracy, F1 score,
and precision. While affirming the effectiveness of SVMs
in astronomical classification tasks, ongoing research is
necessary to address potential challenges and enhance the
model’s capability to discern pulsars in diverse and complex
datasets. The achievements presented in this study lay a
foundation for future advancements in pulsar identification
and contribute to the broader field of astronomical data
analysis.

R EFERENCES
[1] James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. An introduc-
tion to statistical learning (Vol. 112, p. 18). New York: springer.
[2] Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009). The
elements of statistical learning: data mining, inference, and prediction
(Vol. 2, pp. 1-758). New York: springer.
[3] Bishop, C. M., & Nasrabadi, N. M. (2006). Pattern recognition and
machine learning (Vol. 4, No. 4, p. 738). New York: springer.

You might also like