0% found this document useful (0 votes)

53 views8 pages

Stellar Classification With Logistic Regression

This document presents a machine learning project focused on classifying stars into three categories: dwarfs, main sequence stars, and giants using logistic regression. The project details the data preparation, model selection, and validation processes, ultimately achieving a 100% accuracy with a polynomial logistic regression model. The report concludes that while the results are promising, further data is needed for more robust testing and classification.

Uploaded by

Ann

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views8 pages

Stellar Classification With Logistic Regression

Uploaded by

Ann

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Aalto University

CS-C3240 - Machine Learning

Stellar classification
with logistic regression
ML project stage 2

Figure 1: Stars in the Jewel Box cluster by ESO VLT

Contents
1 Introduction 2

2 Problem formulation 2

3 Methods 2
3.1 Preparing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.3 Model validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

4 Results 4

5 Conclusion 4

6 References 5

7 Appendix A: source code 5

1
1 Introduction
There is an astronomical amount of astronomical data. With the new James Webb
telescope in orbit and aligned [1], we’ll discover even more of the universe. Automating
data analysis will make astronomers’ life easier. This report attempts one such astronomy
problem, to classify stars based on their spectral data. First the problem is formulated as
a machine learning problem, then the methods, such as data preparation and the logistic
regression modles, are discussed. Finally the results are analyzed and a conclusion is
drawn; the polynomial logistic regressor classifies the stars with stellar accuracy.

2 Problem formulation
The data points represent stars, like the ones in figure 1. There are quite many star
types, but for simplicity, let’s focus on dwarf stars, main sequence stars and giant stars.
These three star types will be represented with corresponding category numbers 0, 1, and
2. These are the labels I’ll try to predict.
To classify a star, we need measurements. These measurements could be the size of the
star, its colour, surface temperature or absolute magnitude, i.e. the star’s brightness.
Absolute magnitude and temperature will be the features, as they are the most relevant
ones for classification, as will be seen later. Both of these features have numerical values.
The features and more specific star type lables can be found in two datasets from Kaggle
[2, 3].

3 Methods

Figure 2: A Hertzsprung–Russell diagram [3] explains stellar classification based on tem-

perature and absolute magnitude. Main sequence makes a continuous curve throughout
the diagram, with dwarf stars under it and giants on top.

2
3.1 Preparing data
The dataset is from Kaggle [3]. There are 240 data points in total, i.e. 240 stars and their
measurements and classifications, with seven values per star. Out of the seven values,
’Star type’ column will be transformed into the labels as described below, and ’Absolute
magnitude (Mv)’ and ’Temperature (K)’ columns will become the features. The missing
values in the dataset were calculated from other known values by the creator of the
dataset, using equations from astrophysics, so that no values were missing.
In the original data, the star types are red, brown and white dwarfs, main sequence
stars, super giants and hyper giants. These were combined into the three general labels:
0) dwarfs, 1) main sequence stars and 2) giants. Unneccessary columns were removed,
leaving just label column ’Class’ and feature columns ’Magnitude’ and ’Temperature’.
These features were chosen based on their appearance on the Hertzprung-Russel dia-
gram, depicted in figure 2. The diagram plots stars based on these features and spectral
class, and clusters the stars into their classes. Therefore these features are relevant to
classification. Spectral class is calculated from temperature, so it was left out.
This didn’t effect the number of data points, which remained at 240, split as such:

Dwarf stars
80
Main sequence stars
120
Giant stars

Features were then standardized with Scikit-learn’s preprosessing scaler. This was nec-
essary for the chosen model, logistic regression. Finally, 20% of data points were left out
as test data, leaving 192 data points to work with.

3.2 Model selection

The first model of choice for classifying stars was logistic regression. It uses a linear
hypothesis space and a logistic loss function.
As can be seen in figure 2, we can quite well separate the different classes with a straight
line, so linear map is expected get quite good results. Linear methods are also simpler to
code than eg. polynomials, so it was a good choice for my first machine learning project.
Linear classification works by drawing a line (or in higher dimensions, a plane) between
two classes.
Logistic loss is a continuous function, so it’s very quick to optimize. This is important for
the used validation method, k-fold. It’s also less sensitive against outliers than squared
error loss.
Second model is almost same, but the hypothesis space is polynomials of degree 3 instead
of a linear space. Let’s call it polynomial logistic regression. It uses logistic loss as well.
The main sequence curve in the Hertzsprung–Russell diagram 2 is in the shape of an

3
S-curve. A polynomial of second degree makes a valley or a hill, third degree has both.
An s-curve is composed of a valley and a hill, and therefore a third degree is appropriate.
A larger degree could result in overfitting.

3.3 Model validation

Because the dataset is small, K-fold cross validation will be used for both models. K-fold
prevents the misleadings of an unlucky split, which is likely with such a small dataset. K-
fold splits the data into k folds, and uses one fold as the validation set and the rest as the
training set. This happens k times, each time changing which fold is used for validation.
Therefore the model training must be done k times, but luckily logistic regression is quick
to optimize. Using k=5 we get a validation set of 38 datapoints and training set of 154.
The validation set size, 38, is very close to 40, the number of data points in the smallest
category, and therefore five folds allow for the biggest possible training set while the
validation set stays at a reasonable size. K-fold’s shuffle was on and random seed was 3.

4 Results
Since this is a classification problem, instead of training and validation errors, accuracy
scores are used. The linear logistic regression model yielded an average validation accu-
racy of 93,2% and the polynomial logistic regressor got 99,5%. The polynomial model
gets a better fit. It has curves, unlike the other model, and therefore contours the curves
in figure 2 better. Because of this, the polynomial logistic regression will be the final
chosen method. When the model was trained with all the 192 data points and tested
with the remaining data, the accuracy score (which is equivalent to the test error) was
100%. This means that the model classified all the training data stars correctly!

5 Conclusion
This report applied two logistic regrressors to classify stars into giants, dwarfs and main
sequence stars based on their temperature and brightness. The models and features were
chosen based on the Hertzprung-Russell diagram, depicted in figure 2. The linear hy-
pothesis space worked for most of the data points, but the cluster’s edges were sometimes
misclassified.
The polynomial model worked better and was chosen as the final model. It achieved
100% accuracy in the final test. This is very satisfactory, this is as good as it can get
with a dataset of this size. The accuracy score might be misleading due to the small size
of the dataset. However, it’s unlikely that the good score is due to overfitting, because of
the small hypothesis space. It’s not due to a lucky split either, it scores 100% on other
splits too. The data is just very regular - it does obey the laws of the universe. More
data would be necessary to further test the model.
The next step, besides getting more data, would be to further classify the stars, differen-
tiating dwarf stars into red, brown and white dwarfs, and separating giants into hyper-
and supergiants.

4
6 References
[1] [Link]
working-successfully
[Link]
[2] [Link]
[3] [Link] (this one was used)

7 Appendix A: source code

[1]: #imports
%config Completer.use_jedi = False # enable code auto-completion
import numpy as np
import pandas as pd
import [Link] as plt
import seaborn as sns #data visualization library
from sklearn.linear_model import LogisticRegression
from [Link] import accuracy_score, confusion_matrix,␣
,→mean_squared_error # evaluation metrics
from sklearn import preprocessing
from [Link] import PolynomialFeatures # function to␣
,→generate polynomial and interaction features

from sklearn.model_selection import KFold

from sklearn.model_selection import train_test_split

[2]: #data is from [Link]

#data prepping
data = pd.read_csv('./[Link]')
#remove unrelevant data
[Link](columns=['Luminosity(L/Lo)','Radius(R/Ro)','Star␣
,→color','Spectral Class'],inplace=True)

#rename columns
[Link] =['Temperature', 'Magnitude', 'Class']
# star types are:
# Brown Dwarf -> Star Type = 0
# Red Dwarf -> Star Type = 1
# White Dwarf-> Star Type = 2
# Main Sequence -> Star Type = 3
# Supergiant -> Star Type = 4
# Hypergiant -> Star Type = 5

#combine labels:
#Red Dwarf, Brown Dwarf and White Dwarf -> Dwarf, 0,
#Main Sequence -> 1
#SuperGiants and HyperGiants -> Giants 2

5
data0 = 0
data1 = 0
data2 = 0

y = []
for label in data['Class']:
if(label == 4 or label == 5):
[Link](2)
elif (label == 3):
[Link](1)
else:
[Link](0)

y = [Link](y)

#features X is now only the magnitude

[Link](columns=['Class'],inplace=True)
X = data.to_numpy().reshape(240, 2)

#scaling/standardizing features
scaler = [Link]().fit(X)
X = [Link](X)

# take out test data

X, X_test, y, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)

[3]: #define k-fold

k = 5
shuffle = True
seed = 3
kfold = KFold(n_splits=k, shuffle=shuffle, random_state=seed)

tr_errors1 = [] #k different errors

val_errors1 = []
average_val_accuracy1 = 0

tr_errors2 = [] #k different errors

val_errors2 = []
average_val_accuracy2 = 0

#go though all folds

for (train_index, val_index) in [Link](X):
#split into training and validation sets
X_train, y_train, X_val, y_val = X[train_index], y[train_index],␣
,→X[val_index], y[val_index]

6
#model 1
clf = LogisticRegression()
[Link](X_train, y_train)
y_pred_train = [Link](X_train)
accuracy_train = accuracy_score(y_train, y_pred_train)
#validation
y_pred_val = [Link](X_val)
accuracy_val = accuracy_score(y_val, y_pred_val)
average_val_accuracy1 += accuracy_val
#calculate the errors
tr_error = mean_squared_error(y_train, y_pred_train)
val_error = mean_squared_error(y_val, y_pred_val)
#save errors
tr_errors1.append(tr_error)
val_errors1.append(val_error)

#model 2, use same names for 'muuttuja's but calculate them again
clf = LogisticRegression()
poly = PolynomialFeatures(3)
X_train_poly = poly.fit_transform(X_train)
scaler = [Link]().fit(X_train_poly)
X_train_poly = [Link](X_train_poly) #scale the␣
,→transformed features

[Link](X_train_poly, y_train)
y_pred_train = [Link](X_train_poly)
accuracy_train = accuracy_score(y_train, y_pred_train)
#validation
X_val_poly = poly.fit_transform(X_val)
scaler = [Link]().fit(X_val_poly)
X_val_poly = [Link](X_val_poly)
y_pred_val = [Link](X_val_poly)
accuracy_val = accuracy_score(y_val, y_pred_val)
average_val_accuracy2 += accuracy_val
#calculate errors
tr_error = mean_squared_error(y_train, y_pred_train)
val_error = mean_squared_error(y_val, y_pred_val)
#save errors
tr_errors2.append(tr_error)
val_errors2.append(val_error)

#scores
average_train_error1 = [Link](tr_errors1)
average_val_error1 = [Link](val_errors1)
print("Model 1:")
print("Average training error : ", average_train_error1)
print("Average validation error : ", average_val_error1)
print("Average validation accuracy : ", average_val_accuracy1/5)

7
average_train_error2 = [Link](tr_errors2)
average_val_error2 = [Link](val_errors2)
print("Model 2:")
print("Average training error : ", average_train_error2)
print("Average validation error : ", average_val_error2)
print("Average validation accuracy : ", average_val_accuracy2/5)

Model 1:
Average training error : 0.04040404040404041
Average validation error : 0.06747638326585695
Average validation accuracy : 0.9325236167341432
Model 2:
Average training error : 0.0
Average validation error : 0.005128205128205128
Average validation accuracy : 0.9948717948717949
[4]: clf = LogisticRegression()

poly = PolynomialFeatures(3)
X_poly = poly.fit_transform(X)
scaler = [Link]().fit(X_poly)
X_poly = [Link](X_poly) #scale the transformed features
[Link](X_poly, y) #train model with X and y

#transform X_test into polynomials, scale and make a prediction

X_test_poly = [Link](X_test)
X_test_poly = [Link](X_test_poly)
y_pred_test = [Link](X_test_poly) #predict!

print(accuracy_score(y_test, y_pred_test))

1.0

4375-Article Text-26335-1-10-20230427
No ratings yet
4375-Article Text-26335-1-10-20230427
13 pages
Stellar Classification via ML Models
No ratings yet
Stellar Classification via ML Models
6 pages
Practical Manual - Machine Learning Application
No ratings yet
Practical Manual - Machine Learning Application
4 pages
EE485 Proposal
No ratings yet
EE485 Proposal
2 pages
Apackage For The Automated Classification of Periodic Variable Stars
No ratings yet
Apackage For The Automated Classification of Periodic Variable Stars
15 pages
Astronomy Lab: Star Classification
No ratings yet
Astronomy Lab: Star Classification
8 pages
Decision Trees for Star Classification
No ratings yet
Decision Trees for Star Classification
20 pages
Machine Learning Classifies Celestial Objects
No ratings yet
Machine Learning Classifies Celestial Objects
21 pages
Stellar Label-Free Gaia Analysis
No ratings yet
Stellar Label-Free Gaia Analysis
18 pages
Classification of Stars
No ratings yet
Classification of Stars
29 pages
Astronomy Basics for Enthusiasts
No ratings yet
Astronomy Basics for Enthusiasts
3 pages
Anthony Evans - Lab 3
No ratings yet
Anthony Evans - Lab 3
12 pages
Classification of Stars, Galaxies and Quasars
No ratings yet
Classification of Stars, Galaxies and Quasars
8 pages
Binary Star Distance Calculations
No ratings yet
Binary Star Distance Calculations
4 pages
Page 4
No ratings yet
Page 4
1 page
Pulsar Detection with ML Techniques
No ratings yet
Pulsar Detection with ML Techniques
7 pages
Star Classification Lab Guide
No ratings yet
Star Classification Lab Guide
3 pages
Yellow Dwarf Stars
No ratings yet
Yellow Dwarf Stars
4 pages
Stellar Structure and Evolution Notes
No ratings yet
Stellar Structure and Evolution Notes
42 pages
Tyce OlavesonW21
No ratings yet
Tyce OlavesonW21
38 pages
Stellar Evolution
No ratings yet
Stellar Evolution
30 pages
Cluster Hdbscan Dan GMM
No ratings yet
Cluster Hdbscan Dan GMM
45 pages
Stellar Evolution & Brightness
No ratings yet
Stellar Evolution & Brightness
7 pages
Stellar Classification
No ratings yet
Stellar Classification
3 pages
Kernel Clustering in CoRoT Data
No ratings yet
Kernel Clustering in CoRoT Data
15 pages
Apparent Brightness in Stellar Classification
No ratings yet
Apparent Brightness in Stellar Classification
8 pages
Classification of Stars
100% (1)
Classification of Stars
3 pages
A Normalizing Flow Approach For The Inference of Star Cluster Properties From Unresolved Broadband Photometry
No ratings yet
A Normalizing Flow Approach For The Inference of Star Cluster Properties From Unresolved Broadband Photometry
17 pages
Inbound 8305906749493556406
No ratings yet
Inbound 8305906749493556406
12 pages
Kami Export - 2303.08474v1
No ratings yet
Kami Export - 2303.08474v1
10 pages
Classification of Stars
No ratings yet
Classification of Stars
34 pages
A COMPASS To Model Comparison and Simulation-Based Inference in Galactic Chemical Evolution
No ratings yet
A COMPASS To Model Comparison and Simulation-Based Inference in Galactic Chemical Evolution
18 pages
Farrell 2023 J. Cosmol. Astropart. Phys. 2023 016
No ratings yet
Farrell 2023 J. Cosmol. Astropart. Phys. 2023 016
39 pages
Gharat 22
No ratings yet
Gharat 22
5 pages
Capstone
No ratings yet
Capstone
20 pages
HRAstro 2025
No ratings yet
HRAstro 2025
8 pages
Stellar Population Synthesis at The Resolution of 2003: and S. Charlot
No ratings yet
Stellar Population Synthesis at The Resolution of 2003: and S. Charlot
29 pages
Introduction Student Version
No ratings yet
Introduction Student Version
46 pages
Group 16 Project
No ratings yet
Group 16 Project
18 pages
Stars: Dr. Simone Scaringi
No ratings yet
Stars: Dr. Simone Scaringi
374 pages
BC03
No ratings yet
BC03
29 pages
Stellar Color Codes Explained
No ratings yet
Stellar Color Codes Explained
10 pages
ML Week 16
No ratings yet
ML Week 16
5 pages
Solutions: Massachusetts Institute of Technology
No ratings yet
Solutions: Massachusetts Institute of Technology
13 pages
25 - IS - EW H-RDiagramSE
No ratings yet
25 - IS - EW H-RDiagramSE
8 pages
Report
No ratings yet
Report
5 pages
Astronomy: Understanding Stars
No ratings yet
Astronomy: Understanding Stars
67 pages
Stellar Classifications
No ratings yet
Stellar Classifications
5 pages
E2. Stellar Radiation & Stellar Types: IB Physics Power Points Option E
No ratings yet
E2. Stellar Radiation & Stellar Types: IB Physics Power Points Option E
60 pages
Random Forest Algorithm For Classification of Multiwavelength Data
No ratings yet
Random Forest Algorithm For Classification of Multiwavelength Data
7 pages
Deep Multi-Survey Classification of Variable Stars
No ratings yet
Deep Multi-Survey Classification of Variable Stars
16 pages
Final Year Project: Asteroids Classification Using Machine Learning
No ratings yet
Final Year Project: Asteroids Classification Using Machine Learning
15 pages
Word 1
No ratings yet
Word 1
2 pages
Characteristics of Stars
No ratings yet
Characteristics of Stars
7 pages
2024 Summer Projects Descriptions MPIA
No ratings yet
2024 Summer Projects Descriptions MPIA
14 pages
CH 17 Nature of Stars
No ratings yet
CH 17 Nature of Stars
46 pages
(Astrophysics and Space Science Library 131) Fionn Murtagh, André Heck (Auth.) - Multivariate Data Analysis-Springer Netherlands (1987) PDF
No ratings yet
(Astrophysics and Space Science Library 131) Fionn Murtagh, André Heck (Auth.) - Multivariate Data Analysis-Springer Netherlands (1987) PDF
224 pages
H-R Diagram Worksheet
No ratings yet
H-R Diagram Worksheet
3 pages
ML Project Stage 2
No ratings yet
ML Project Stage 2
9 pages
Predicting Movie Rating Prior To Release
No ratings yet
Predicting Movie Rating Prior To Release
15 pages
Fortum Financials2023
No ratings yet
Fortum Financials2023
171 pages
L II 12 Merged
No ratings yet
L II 12 Merged
447 pages
Fortum Governance2023
No ratings yet
Fortum Governance2023
22 pages
ECON-C4210 - Econometrics II: Capstone: Lecture 9A: Time Series I
No ratings yet
ECON-C4210 - Econometrics II: Capstone: Lecture 9A: Time Series I
59 pages
ECON-C4210 - Econometrics II: Capstone: Lectures 1&2: Panel Data
No ratings yet
ECON-C4210 - Econometrics II: Capstone: Lectures 1&2: Panel Data
96 pages
PTE Success for IELTS Strugglers
100% (1)
PTE Success for IELTS Strugglers
25 pages
Coffee Quality Prediction via LGBM
No ratings yet
Coffee Quality Prediction via LGBM
11 pages
EVSExamformCircularSummer2025 955154
No ratings yet
EVSExamformCircularSummer2025 955154
1 page
Neohumanism A Renaissance in Education: Acarya Abhidevananda Avadhuta
No ratings yet
Neohumanism A Renaissance in Education: Acarya Abhidevananda Avadhuta
39 pages
Duneier Sidewalk
No ratings yet
Duneier Sidewalk
3 pages
A Detailed Lesson Plan in English 7
100% (1)
A Detailed Lesson Plan in English 7
12 pages
Single in Seoul-1
No ratings yet
Single in Seoul-1
5 pages
Hansgrohe ComfortZone Mixer Guide
No ratings yet
Hansgrohe ComfortZone Mixer Guide
7 pages
Assignment-1 Reading Skills
No ratings yet
Assignment-1 Reading Skills
2 pages
Judge The Validity of The Evidence Listened To-Presentation
No ratings yet
Judge The Validity of The Evidence Listened To-Presentation
38 pages
Physics Marking Guide 2022
No ratings yet
Physics Marking Guide 2022
23 pages
Thesis Paper Table of Contents Example
100% (3)
Thesis Paper Table of Contents Example
7 pages
Curriculum Vitae - Alyzza Acabal
No ratings yet
Curriculum Vitae - Alyzza Acabal
3 pages
RRB Chemical Metallurgical Assistant Old Questions 30th Aug Shift 2
No ratings yet
RRB Chemical Metallurgical Assistant Old Questions 30th Aug Shift 2
50 pages
Cherry JEE Main 2025 RESULT
No ratings yet
Cherry JEE Main 2025 RESULT
2 pages
Algebra & Functions Test
No ratings yet
Algebra & Functions Test
2 pages
Summary of Ug Fee (2024-2025)
No ratings yet
Summary of Ug Fee (2024-2025)
6 pages
Educational and Professional Qualifications
No ratings yet
Educational and Professional Qualifications
2 pages
Summarised Project Initiation Documentation For NowByou
No ratings yet
Summarised Project Initiation Documentation For NowByou
4 pages
Iit-Jee (M+a) Nurture Act-1 Paper-2 25-09-2022
No ratings yet
Iit-Jee (M+a) Nurture Act-1 Paper-2 25-09-2022
20 pages
CBSE Skills Development Curriculum Overview
No ratings yet
CBSE Skills Development Curriculum Overview
4 pages
Principles OF Emergency Care: First Aid
100% (2)
Principles OF Emergency Care: First Aid
16 pages
HR Career Highlights
No ratings yet
HR Career Highlights
6 pages
Resolution For Reading Center
No ratings yet
Resolution For Reading Center
4 pages
AIIMS JR
No ratings yet
AIIMS JR
2 pages
Mindfulness's Impact on Self-Esteem in Youth
No ratings yet
Mindfulness's Impact on Self-Esteem in Youth
15 pages
Transcript Data
No ratings yet
Transcript Data
4 pages
Term Paper Outline Guide for Students
100% (1)
Term Paper Outline Guide for Students
8 pages
Pahuacho - Goles Por La Paz
No ratings yet
Pahuacho - Goles Por La Paz
49 pages
Kriya - High Tech Yoga: Meaning of Mantra
No ratings yet
Kriya - High Tech Yoga: Meaning of Mantra
1 page

Stellar Classification With Logistic Regression

Uploaded by

Stellar Classification With Logistic Regression

Uploaded by

Aalto University

CS-C3240 - Machine Learning

Figure 1: Stars in the Jewel Box cluster by ESO VLT

7 Appendix A: source code 5

Figure 2: A Hertzsprung–Russell diagram [3] explains stellar classification based on tem-

3.2 Model selection

3.3 Model validation

7 Appendix A: source code

from sklearn.model_selection import KFold

[2]: #data is from [Link]

#features X is now only the magnitude

# take out test data

[3]: #define k-fold

tr_errors1 = [] #k different errors

tr_errors2 = [] #k different errors

#go though all folds

#transform X_test into polynomials, scale and make a prediction

You might also like