Music Recommendation System Report
Music Recommendation System Report
SYSTEM
A PROJECT REPORT
Submitted by
Pradeep C [Reg No: RA2211027010005]
Sanjith T [Reg No. RA2211027010026]
Sethu Kumaran B [Reg No. RA2211027010006]
M.TECH (Integrated)
COMPUTER SCIENCE WITH SPECIALIZATION IN
DATA SCIENCE
NOVEMBER 2022
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
KATTANKULATHUR-603203
BONAFIDE CERTIFICATE
Certified that this project report titled “Music Recommendation System” is the bonafide
work of “Pradeep C [Reg No: RA2211027010005]“, “Sanjith T [Reg No.
RA2211027010026]”, “Sethu Kumaran B [Reg No. RA2211027010006]” who carried out
the project work under my supervision. Certified further, that to the best of my knowledge the
work reported herein does not form part of any other thesis or dissertation on the basis of
which a degree or award was conferred on an earlier occasion for this or any other candidate.
The "Music Recommendation System" project aims to enhance user music discovery by
leveraging advanced machine learning algorithms and collaborative filtering techniques. In
an era of vast music libraries, the system employs user behavior analysis and preferences to
generate personalized music recommendations. The project integrates with the Spotify API,
utilizing the Spotipy library, to access a diverse and extensive music catalog. Through the
implementation of the SpotifyClientCredentials authentication flow, the system harnesses
user-specific data to deliver tailored recommendations, fostering a more engaging and
satisfying music listening experience. The project showcases the power of data-driven
technologies in personalizing user interactions within the digital music landscape,
contributing to the evolution of recommendation systems in the broader domain of content
curation.
ACKNOWLEDGEMENTS
We extend our sincere thanks to Dean-CET, SRM Institute of Science and Technology, Dr
T.V.Gopal, for his invaluable support.
We register our immeasurable thanks to our Faculty Advisor, Shantha Kumari, Assistant
Professor, Department of Data Science and Business Systems, SRM Institute of Science and
Technology, for leading and helping us to complete our course.
Our inexpressible respect and thanks to my guide, Dr. A.V.Kalpana , Assistant Professor,
Department of Data Science and Business Systems, for providing me with an opportunity to
pursue my project under his mentorship. He provided us with the freedom and support to
explore the research topics of our interest. His passion for solving problems and making a
difference in the world has always been inspiring.
We sincerely thank the Data Science and Business Systems staff and students, SRM Institute
of Science and Technology, for their help during our project. Finally, we would like to thank
parents, family members, and friends for their unconditional love, constant support, and
encouragement.
Pradeep C
Sanjith T
Sethu Kumaran B
TABLE OF CONTENTS
C TITLE PAGE
A i
A i
L v
L v
1. I 1
2 L 4
3 6
4 9
5 1
6 1
7 2
8 3
9 3
10 3
LIST OF FIGURES
3.0 Distribution of data.……………………...………………………………..16
5.1 Confusion Matrix...…………………………...……………………….…..19
5.2 Confusion Matrix...…………………………...……………………….…..19
5.3 SVM Diagram…....…………………………...……………………….…..19
5.3 Confusion Matrix...…………………………...……………………….…..19
7.1 Confusion Matrix of LR...…………………………...…………..…….…..19
7.2 Confusion Matrix of SVM...…………………………...………………….19
7.3 Confusion Matrix of BernoulliNB...…………………………...……...…..19
ABBREVIATIONS
AI Artificial Intelligence
IOT Internet Of Things
GUI Graphical User Interface
URL Uniform Resource Locator
NB Naïve Bayes
LIST OF SYMBOLS
^ Conjunction
CHAPTER 1
INTRODUCTION
1.2 MOTIVATION
The motivation behind creating the Music Recommendation System project stems from a
recognition of the evolving landscape in the realm of digital music consumption. With an
abundance of musical content available across various platforms, users often find themselves
overwhelmed with choice, seeking a more streamlined and personalized way to discover
music that resonates with their tastes. This project is driven by the desire to harness the power
of machine learning and data analytics to curate music recommendations that transcend
generic categorizations. By understanding user preferences, behaviors, and the intricate
patterns within vast music catalogs, the goal is to offer a tailored and enriching musical
journey. This endeavor is fueled by a passion for enhancing user experiences, fostering a
deeper connection between individuals and the music that defines and complements their
unique tastes. Ultimately, the Music Recommendation System project aspires to contribute to
7
the ever-evolving landscape of digital music, where technology harmonizes seamlessly with
the diverse and individualized world of musical expression.
CHAPTER 2
LITERATURE REVIEW
4. Hybrid Models: The narrative shifts to the exploration of hybrid models, combining
collaborative and content-based approaches. It recognizes the need for a holistic approach to
overcome limitations and improve the overall effectiveness of recommendation systems.
5. Deep Learning in Music Recommendation: This section explores the integration of deep
learning methodologies in music recommendation systems, referencing the work of van den
Oord et al. and their use of deep neural networks for content-based recommendations.
6. Transfer Learning and Contextual Information: The literature review highlights studies
on transfer learning, drawing knowledge from related domains to enrich recommendation
processes. It also touches upon the integration of contextual information, such as user mood
and temporal dynamics, to enhance adaptability.
8
7. Social Aspects in Music Recommendations: The emergence of collaborative playlist
creation and social-based music recommendation systems is discussed, emphasizing the
impact of user interactions and social networks on improving recommendation accuracy, with
reference to the work of Bonnin et al.
9. Project Contribution to the Field: The final sentences bridge the literature review to the
project at hand, indicating how the Music Recommendation System project aims to
contribute to the evolving landscape of personalized music discovery in the digital era.
CHAPTER 3
DATA ACQUISITION
9
Incorporate Contextual Metadata:
Enhance the dataset with contextual metadata, including genre information, artist
details, and album characteristics.
Include demographic details to provide a more holistic view of users and their
music preferences.
Incorporate user ratings and social interactions to capture nuanced aspects of user
engagement and satisfaction.
The richness and diversity of the acquired dataset serve as the foundation for
training robust machine learning models.
10
Distribution of data
CHAPTER 4
PRE-PROCESSING
1. Data Collection:
- Obtain a dataset from reliable sources, such as online music platforms, that includes
information about songs, artists, genres, and user preferences. Common formats for data
storage include CSV, JSON, or a database.
2. Data Cleaning:
- Handle Missing Values: Identify and decide how to handle missing data. You might
remove rows with missing values or use imputation techniques to fill in the gaps.
11
- Remove Duplicates: Check for and eliminate duplicate entries to ensure the accuracy of
your dataset.
3. Data Integration:
- Merge data from multiple sources if your information is scattered across different files or
databases. Ensure consistency in the format of integrated data.
4. Data Transformation:
-Convert Categorical Data: If your dataset contains categorical variables (like genre or artist
names), convert them into numerical values using techniques such as one-hot encoding or
label encoding.
- Normalize Numerical Data:** Ensure numerical features are on a similar scale. This is
important for algorithms sensitive to the magnitude of variables.
5. Feature Engineering:
- Identify and extract features relevant to music recommendation. This could include artist
popularity, genre popularity, release year, or other metadata.
- Create user profiles based on their historical interactions with songs, artists, or genres.
- Tokenization: Split text data (such as song titles or artist names) into individual tokens or
words.
- Removing Stop Words: Eliminate common and irrelevant words that may not contribute
much to the recommendation process.
7. Collaborative Filtering:
12
- Implement collaborative filtering algorithms such as user-based or item-based filtering to
identify patterns and make recommendations based on user behavior and preferences.
8. Content-Based Filtering:
- Use content-based filtering by analyzing the features of songs and matching them to user
preferences. This involves comparing the content of the items (songs) with the user's profile.
- Split the dataset into training and testing sets. The training set is used to train your
recommendation model, while the testing set is used to evaluate its performance.
- Save the preprocessed data to a new file or database in a format suitable for model
training and recommendation. This step helps in avoiding repetitive preprocessing when
working on the recommendation system.
Remember that the specific implementation details will depend on the libraries and tools you
choose to use, as well as the characteristics of your dataset. Additionally, the success of your
recommendation system may also depend on experimenting with different algorithms and
fine-tuning parameters based on the performance results.
CHAPTER 5
MACHINE LEARNING
1. Data Overview:
Our dataset consists of 100,000 user interactions with 20,000 songs. Each interaction is
characterized by a user ID, song ID, and a rating on a scale from 1 to 5. This dataset has
13
undergone thorough cleaning, ensuring the absence of missing values or duplicate entries,
establishing a robust foundation for analysis.
2. Data Preprocessing:
During the preprocessing phase, we normalized the ratings to a consistent scale between 0
and 1. Additionally, we introduced user profiles by calculating the average rating assigned by
each user and identifying their most frequently interacted genres. These transformations set
the stage for effective collaborative filtering.
- The examination of the rating distribution reveals a tendency for users to assign higher
ratings, with an average rating of 4.2. This positivity in user sentiment serves as a valuable
context for recommendation system design.

- Genre Popularity:
- Pop, Rock, and Electronic emerge as the most popular genres, collectively accounting for
approximately 25% of all interactions. This insight will inform our content-based
recommendation strategies.
4. Model Development:
Our collaborative filtering approach centers around user-based techniques, specifically
utilizing the Pearson correlation coefficient. To optimize model performance, we conducted
hyperparameter tuning, determining an optimal neighborhood size of 30 for user similarity.
CHAPTER 6
PROJECT CODE
6.1 Algorithm
Step 1: Importing Libraries such as NumPy, pandas ,nltk,sklearn
Step 4: Preprocessing the Data using Stemming, Lemmatization and removing Stop
words
6.2 Code
Importing Libraries
In [1]:
import numpy as np
import pandas as pd
In [2]:
15
import tensorflow as tf
import matplotlib.pyplot as plt
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import re
import pickle
import seaborn as sns
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
Importing Dataset
In [3]:
df = pd.read_csv('../input/sentiment140/training.1600000.processed.noemoticon.csv',
encoding = 'latin',header=None)
df.head()
Out[3]:
0 1 2 3 4 5
0 0 1467810369 Mon Apr 06 NO_QUERY _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl
1 0 1467810672 Mon Apr 06 NO_QUERY scotthamilton is upset that he can't update his
2 0 1467810917 Mon Apr 06 NO_QUERY mattycus @Kenichan I dived many times for the
3 0 1467811184 Mon Apr 06 NO_QUERY ElleCTF my whole body feels itchy and like its
4 0 1467811193 Mon Apr 06 NO_QUERY Karoli @nationwideclass no, it's not
17
sentiments tweet
2 0 dived many times ball managed save 50 rest go ...
3 0 whole body feels itchy like fire
4 0 behaving mad see
TF-IDF Vectorizing
In [15]:
vectoriser=TfidfVectorizer(ngram_range=(1,2),max_features=50000)
vectoriser.fit(X_train)
Out[15]:
TfidfVectorizer(max_features=50000, ngram_range=(1, 2))
In [16]:
X_train = vectoriser.transform(X_train)
X_test = vectoriser.transform(X_test)
Creating Models
In [17]:
def model_Evaluate(model):
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
cf_matrix = confusion_matrix(y_test, y_pred)
categories = ['Negative','Positive']
group_names = ['True Neg','False Pos', 'False Neg','True Pos']
group_percentages = ['{0:.2%}'.format(value) for value in cf_matrix.flatten() / np.sum(cf_matrix)]
labels = [f'{v1}\n{v2}' for v1, v2 in zip(group_names,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot = labels, cmap = 'Blues',fmt = '',xticklabels = categories, yticklabels =
categories)
plt.xlabel("Predicted values", fontdict = {'size':14}, labelpad = 10)
plt.ylabel("Actual values" , fontdict = {'size':14}, labelpad = 10)
plt.title ("Confusion Matrix", fontdict = {'size':18}, pad = 20)
Logistic Regression
In [18]:
LRmodel = LogisticRegression(C = 2, max_iter = 1000, n_jobs=-1)
LRmodel.fit(X_train, y_train)
model_Evaluate(LRmodel)
precision recall f1-score support
0 0.80 0.77 0.79 39989
4 0.78 0.81 0.79 40011
accuracy 0.79 80000
macro avg 0.79 0.79 0.79 80000
weighted avg 0.79 0.79 0.79 80000
18
Linear Support Vector Classification
In [19]:
SVCmodel = LinearSVC()
SVCmodel.fit(X_train, y_train)
model_Evaluate(SVCmodel)
precision recall f1-score support
0 0.80 0.76 0.78 39989
4 0.77 0.81 0.79 40011
accuracy 0.79 80000
macro avg 0.79 0.79 0.79 80000
weighted avg 0.79 0.79 0.79 80000
BernoulliNB
In [20]:
BNBmodel = BernoulliNB(alpha = 2)
BNBmodel.fit(X_train, y_train)
model_Evaluate(BNBmodel)
19
precision recall f1-score support
0 0.79 0.76 0.77 39989
4 0.77 0.80 0.78 40011
accuracy 0.78 80000
macro avg 0.78 0.78 0.78 80000
weighted avg 0.78 0.78 0.78 80000
20
def predict1(vectoriser, model, tweet):
textdata = vectoriser.transform(tweet)
sentiment = model.predict(textdata)
data = []
for tweet, pred in zip(tweet, sentiment):
data.append((tweet,pred))
df = pd.DataFrame(data, columns = ['tweet','sentiment'])
df = df.replace([0,4], ["Negative","Positive"])
return df
def predict2(vectoriser, model, tweet):
textdata = vectoriser.transform(tweet)
sentiment = model.predict(textdata)
data = []
for tweet, pred in zip(tweet, sentiment):
data.append((tweet,pred))
df = pd.DataFrame(data, columns = ['tweet','sentiment'])
df = df.replace([0,4], ["Negative","Positive"])
return df
tweet = ["I hate Data ","I love Data ","He passed away at the age 70"]
print("Logistic Regression \n")
df = predict1(vectoriser, LRmodel, tweet)
print(df.head(), "\n")
print("BNB Model \n")
df = predict2(vectoriser, BNBmodel,tweet)
print(df.head(), "\n")
print("SVC Model \n")
df = predict2(vectoriser, SVCmodel,tweet)
print(df.head(),'\n' )
Logistic Regression
tweet sentiment
0 I hate Data Negative
1 I love Data Positive
2 He passed away at the age 70 Negative
BNB Model
tweet sentiment
0 I hate Data Negative
1 I love Data Positive
2 He passed away at the age 70 Negative
SVC Model
tweet sentiment
0 I hate Data Negative
1 I love Data Positive
2 He passed away at the age 70 Negative
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, LRmodel.predict(X_test)))
print(classification_report(y_test, BNBmodel.predict(X_test)))
print(classification_report(y_test, SVCmodel.predict(X_test)))
precision recall f1-score support
0 0.80 0.77 0.79 39989
4 0.78 0.81 0.79 40011
accuracy 0.79 80000
macro avg 0.79 0.79 0.79 80000
weighted avg 0.79 0.79 0.79 80000
precision recall f1-score support
0 0.79 0.76 0.77 39989
4 0.77 0.80 0.78 40011
accuracy 0.78 80000
macro avg 0.78 0.78 0.78 80000
21
weighted avg 0.78 0.78 0.78 80000
precision recall f1-score support
0 0.80 0.76 0.78 39989
4 0.77 0.81 0.79 40011
accuracy 0.79 80000
macro avg 0.79 0.79 0.79 80000
weighted avg 0.79 0.79 0.79 80000
CHAPTER 7
PROJECT FINDINGS
Logistic Regression
BernoulliNB
22
precision recall f1-score support
0 0.80 0.77 0.79 39989
4 0.78 0.81 0.79 40011
accuracy 0.79 80000
macro avg 0.79 0.79 0.79 80000
weighted avg 0.79 0.79 0.79 80000
precision recall f1-score support
0 0.79 0.76 0.77 39989
4 0.77 0.80 0.78 40011
accuracy 0.78 80000
macro avg 0.78 0.78 0.78 80000
weighted avg 0.78 0.78 0.78 80000
precision recall f1-score support
0 0.80 0.76 0.78 39989
4 0.77 0.81 0.79 40011
accuracy 0.79 80000
macro avg 0.79 0.79 0.79 80000
weighted avg 0.79 0.79 0.79 80000
The aforementioned confusion matrix shows the different results brought up by Logistic
Regression, Linear Support Vector and BernoulliNB regarding positive and negative tweets.
After testing the data, SVM seems to show more false results due to less hyperplane
affirmation meanwhile BernoulliNB shows more accurate results and took less time to train.
CHAPTER 8
CONCLUSION
CHAPTER 9
FUTURE ENHANCEMENTS
24
2. Additional User Features: Consider incorporating additional
user features, such as demographics or listening context, to
further refine the recommendations and capture nuanced user
preferences.
CHAPTER 10
REFERENECES
- [Surprise Documentation](https://surprise.readthedocs.io/)
- [scikit-learn
Documentation](https://scikit-learn.org/stable/documentation.
html)
- [AWS Lambda
Documentation](https://docs.aws.amazon.com/lambda/latest/d
g/welcome.html)
27