Aalto University
CS-C3240 - Machine Learning
Stellar classification
with logistic regression
ML project stage 2
Figure 1: Stars in the Jewel Box cluster by ESO VLT
Contents
1 Introduction 2
2 Problem formulation 2
3 Methods 2
3.1 Preparing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.3 Model validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Results 4
5 Conclusion 4
6 References 5
7 Appendix A: source code 5
1
1 Introduction
There is an astronomical amount of astronomical data. With the new James Webb
telescope in orbit and aligned [1], we’ll discover even more of the universe. Automating
data analysis will make astronomers’ life easier. This report attempts one such astronomy
problem, to classify stars based on their spectral data. First the problem is formulated as
a machine learning problem, then the methods, such as data preparation and the logistic
regression modles, are discussed. Finally the results are analyzed and a conclusion is
drawn; the polynomial logistic regressor classifies the stars with stellar accuracy.
2 Problem formulation
The data points represent stars, like the ones in figure 1. There are quite many star
types, but for simplicity, let’s focus on dwarf stars, main sequence stars and giant stars.
These three star types will be represented with corresponding category numbers 0, 1, and
2. These are the labels I’ll try to predict.
To classify a star, we need measurements. These measurements could be the size of the
star, its colour, surface temperature or absolute magnitude, i.e. the star’s brightness.
Absolute magnitude and temperature will be the features, as they are the most relevant
ones for classification, as will be seen later. Both of these features have numerical values.
The features and more specific star type lables can be found in two datasets from Kaggle
[2, 3].
3 Methods
Figure 2: A Hertzsprung–Russell diagram [3] explains stellar classification based on tem-
perature and absolute magnitude. Main sequence makes a continuous curve throughout
the diagram, with dwarf stars under it and giants on top.
2
3.1 Preparing data
The dataset is from Kaggle [3]. There are 240 data points in total, i.e. 240 stars and their
measurements and classifications, with seven values per star. Out of the seven values,
’Star type’ column will be transformed into the labels as described below, and ’Absolute
magnitude (Mv)’ and ’Temperature (K)’ columns will become the features. The missing
values in the dataset were calculated from other known values by the creator of the
dataset, using equations from astrophysics, so that no values were missing.
In the original data, the star types are red, brown and white dwarfs, main sequence
stars, super giants and hyper giants. These were combined into the three general labels:
0) dwarfs, 1) main sequence stars and 2) giants. Unneccessary columns were removed,
leaving just label column ’Class’ and feature columns ’Magnitude’ and ’Temperature’.
These features were chosen based on their appearance on the Hertzprung-Russel dia-
gram, depicted in figure 2. The diagram plots stars based on these features and spectral
class, and clusters the stars into their classes. Therefore these features are relevant to
classification. Spectral class is calculated from temperature, so it was left out.
This didn’t effect the number of data points, which remained at 240, split as such:
Dwarf stars
80
Main sequence stars
120
Giant stars
40
Features were then standardized with Scikit-learn’s preprosessing scaler. This was nec-
essary for the chosen model, logistic regression. Finally, 20% of data points were left out
as test data, leaving 192 data points to work with.
3.2 Model selection
The first model of choice for classifying stars was logistic regression. It uses a linear
hypothesis space and a logistic loss function.
As can be seen in figure 2, we can quite well separate the different classes with a straight
line, so linear map is expected get quite good results. Linear methods are also simpler to
code than eg. polynomials, so it was a good choice for my first machine learning project.
Linear classification works by drawing a line (or in higher dimensions, a plane) between
two classes.
Logistic loss is a continuous function, so it’s very quick to optimize. This is important for
the used validation method, k-fold. It’s also less sensitive against outliers than squared
error loss.
Second model is almost same, but the hypothesis space is polynomials of degree 3 instead
of a linear space. Let’s call it polynomial logistic regression. It uses logistic loss as well.
The main sequence curve in the Hertzsprung–Russell diagram 2 is in the shape of an
3
S-curve. A polynomial of second degree makes a valley or a hill, third degree has both.
An s-curve is composed of a valley and a hill, and therefore a third degree is appropriate.
A larger degree could result in overfitting.
3.3 Model validation
Because the dataset is small, K-fold cross validation will be used for both models. K-fold
prevents the misleadings of an unlucky split, which is likely with such a small dataset. K-
fold splits the data into k folds, and uses one fold as the validation set and the rest as the
training set. This happens k times, each time changing which fold is used for validation.
Therefore the model training must be done k times, but luckily logistic regression is quick
to optimize. Using k=5 we get a validation set of 38 datapoints and training set of 154.
The validation set size, 38, is very close to 40, the number of data points in the smallest
category, and therefore five folds allow for the biggest possible training set while the
validation set stays at a reasonable size. K-fold’s shuffle was on and random seed was 3.
4 Results
Since this is a classification problem, instead of training and validation errors, accuracy
scores are used. The linear logistic regression model yielded an average validation accu-
racy of 93,2% and the polynomial logistic regressor got 99,5%. The polynomial model
gets a better fit. It has curves, unlike the other model, and therefore contours the curves
in figure 2 better. Because of this, the polynomial logistic regression will be the final
chosen method. When the model was trained with all the 192 data points and tested
with the remaining data, the accuracy score (which is equivalent to the test error) was
100%. This means that the model classified all the training data stars correctly!
5 Conclusion
This report applied two logistic regrressors to classify stars into giants, dwarfs and main
sequence stars based on their temperature and brightness. The models and features were
chosen based on the Hertzprung-Russell diagram, depicted in figure 2. The linear hy-
pothesis space worked for most of the data points, but the cluster’s edges were sometimes
misclassified.
The polynomial model worked better and was chosen as the final model. It achieved
100% accuracy in the final test. This is very satisfactory, this is as good as it can get
with a dataset of this size. The accuracy score might be misleading due to the small size
of the dataset. However, it’s unlikely that the good score is due to overfitting, because of
the small hypothesis space. It’s not due to a lucky split either, it scores 100% on other
splits too. The data is just very regular - it does obey the laws of the universe. More
data would be necessary to further test the model.
The next step, besides getting more data, would be to further classify the stars, differen-
tiating dwarf stars into red, brown and white dwarfs, and separating giants into hyper-
and supergiants.
4
6 References
[1] [Link]
working-successfully
[Link]
[2] [Link]
[3] [Link] (this one was used)
7 Appendix A: source code
[1]: #imports
%config Completer.use_jedi = False # enable code auto-completion
import numpy as np
import pandas as pd
import [Link] as plt
import seaborn as sns #data visualization library
from sklearn.linear_model import LogisticRegression
from [Link] import accuracy_score, confusion_matrix,␣
,→mean_squared_error # evaluation metrics
from sklearn import preprocessing
from [Link] import PolynomialFeatures # function to␣
,→generate polynomial and interaction features
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
[2]: #data is from [Link]
#data prepping
data = pd.read_csv('./[Link]')
#remove unrelevant data
[Link](columns=['Luminosity(L/Lo)','Radius(R/Ro)','Star␣
,→color','Spectral Class'],inplace=True)
#rename columns
[Link] =['Temperature', 'Magnitude', 'Class']
# star types are:
# Brown Dwarf -> Star Type = 0
# Red Dwarf -> Star Type = 1
# White Dwarf-> Star Type = 2
# Main Sequence -> Star Type = 3
# Supergiant -> Star Type = 4
# Hypergiant -> Star Type = 5
#combine labels:
#Red Dwarf, Brown Dwarf and White Dwarf -> Dwarf, 0,
#Main Sequence -> 1
#SuperGiants and HyperGiants -> Giants 2
5
data0 = 0
data1 = 0
data2 = 0
y = []
for label in data['Class']:
if(label == 4 or label == 5):
[Link](2)
elif (label == 3):
[Link](1)
else:
[Link](0)
y = [Link](y)
#features X is now only the magnitude
[Link](columns=['Class'],inplace=True)
X = data.to_numpy().reshape(240, 2)
#scaling/standardizing features
scaler = [Link]().fit(X)
X = [Link](X)
# take out test data
X, X_test, y, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
[3]: #define k-fold
k = 5
shuffle = True
seed = 3
kfold = KFold(n_splits=k, shuffle=shuffle, random_state=seed)
tr_errors1 = [] #k different errors
val_errors1 = []
average_val_accuracy1 = 0
tr_errors2 = [] #k different errors
val_errors2 = []
average_val_accuracy2 = 0
#go though all folds
for (train_index, val_index) in [Link](X):
#split into training and validation sets
X_train, y_train, X_val, y_val = X[train_index], y[train_index],␣
,→X[val_index], y[val_index]
6
#model 1
clf = LogisticRegression()
[Link](X_train, y_train)
y_pred_train = [Link](X_train)
accuracy_train = accuracy_score(y_train, y_pred_train)
#validation
y_pred_val = [Link](X_val)
accuracy_val = accuracy_score(y_val, y_pred_val)
average_val_accuracy1 += accuracy_val
#calculate the errors
tr_error = mean_squared_error(y_train, y_pred_train)
val_error = mean_squared_error(y_val, y_pred_val)
#save errors
tr_errors1.append(tr_error)
val_errors1.append(val_error)
#model 2, use same names for 'muuttuja's but calculate them again
clf = LogisticRegression()
poly = PolynomialFeatures(3)
X_train_poly = poly.fit_transform(X_train)
scaler = [Link]().fit(X_train_poly)
X_train_poly = [Link](X_train_poly) #scale the␣
,→transformed features
[Link](X_train_poly, y_train)
y_pred_train = [Link](X_train_poly)
accuracy_train = accuracy_score(y_train, y_pred_train)
#validation
X_val_poly = poly.fit_transform(X_val)
scaler = [Link]().fit(X_val_poly)
X_val_poly = [Link](X_val_poly)
y_pred_val = [Link](X_val_poly)
accuracy_val = accuracy_score(y_val, y_pred_val)
average_val_accuracy2 += accuracy_val
#calculate errors
tr_error = mean_squared_error(y_train, y_pred_train)
val_error = mean_squared_error(y_val, y_pred_val)
#save errors
tr_errors2.append(tr_error)
val_errors2.append(val_error)
#scores
average_train_error1 = [Link](tr_errors1)
average_val_error1 = [Link](val_errors1)
print("Model 1:")
print("Average training error : ", average_train_error1)
print("Average validation error : ", average_val_error1)
print("Average validation accuracy : ", average_val_accuracy1/5)
7
average_train_error2 = [Link](tr_errors2)
average_val_error2 = [Link](val_errors2)
print("Model 2:")
print("Average training error : ", average_train_error2)
print("Average validation error : ", average_val_error2)
print("Average validation accuracy : ", average_val_accuracy2/5)
Model 1:
Average training error : 0.04040404040404041
Average validation error : 0.06747638326585695
Average validation accuracy : 0.9325236167341432
Model 2:
Average training error : 0.0
Average validation error : 0.005128205128205128
Average validation accuracy : 0.9948717948717949
[4]: clf = LogisticRegression()
poly = PolynomialFeatures(3)
X_poly = poly.fit_transform(X)
scaler = [Link]().fit(X_poly)
X_poly = [Link](X_poly) #scale the transformed features
[Link](X_poly, y) #train model with X and y
#transform X_test into polynomials, scale and make a prediction
X_test_poly = [Link](X_test)
X_test_poly = [Link](X_test_poly)
y_pred_test = [Link](X_test_poly) #predict!
print(accuracy_score(y_test, y_pred_test))
1.0