0% found this document useful (0 votes)
15 views86 pages

M04 Lecture Notes

The document provides an overview of Python libraries, including standard and third-party libraries, and explains the concepts of modules, packages, and classes. It details the data analytics process, covering data collection, preprocessing, analysis, and sharing insights, while also introducing recommendation systems and their filtering techniques. Additionally, it discusses the use of the Pandas library for data manipulation and analysis in Python.

Uploaded by

Berly Brigith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views86 pages

M04 Lecture Notes

The document provides an overview of Python libraries, including standard and third-party libraries, and explains the concepts of modules, packages, and classes. It details the data analytics process, covering data collection, preprocessing, analysis, and sharing insights, while also introducing recommendation systems and their filtering techniques. Additionally, it discusses the use of the Pandas library for data manipulation and analysis in Python.

Uploaded by

Berly Brigith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Libraries in Python

© Faculty of Management
Libraries in Python
A library is a collection of pre-written code that is aimed at providing a set of
capabilities and functionalities for use in other programs.

Types of Python Libraries


• Standard Libraries:
• Included with Python, these libraries help you perform file I/O, simple
persistence, and data serialization
• Third-Party Libraries:
• Additional libraries that can be installed as needed.
• More specialized functionality, like data analysis, machine learning, or
image processing.
Standard Libraries

The Python Standard Library. (n.d.). Python. https://docs.python.org/3/library/index.html String Methods (n.d.). Python. https://docs.python.org/3/library/index.html
Third-Party Libraries

TextBlob: Simplified Text Processing (n.d.). https://textblob.readthedocs.io/en/dev/

Textblob. (n.d). PyPI. https://pypi.org/project/textblob/0.9.0/


Documentation

Source: scikit-learn developers (BSD License(2024) Count Vectorizor. https://scikit-


learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer
Library
A library consists of modules and packages designed to offer a range of capabilities
and functionalities for incorporation into other programs.

Library Package Module


Module
A module is a single Python file that sentiment_analyzer.py
contains definitions and statements.
lexicon = {"good": 1, "excellent": 1,
"bad": -1, "terrible": -1, "not": -0.5}
It can include functions, classes,
variables, and even runnable code def analyze_sentiment(lexicon, text):
that is meant to be imported and
………..
used in other Python scripts.
return sentiment_score
Example 1: Module with functions
Module sentiment_analyzer.py
lexicon = {"good": 1, "excellent": 1, "bad": -1, "terrible": -1, "not": -0.5}

def analyze_sentiment(lexicon, text):


………..
return sentiment_score
Import functions in Module
import sentiment_analyzer
# Example usage
text = input("Enter a text to analyze its sentiment: ")
result = sentiment_analyzer.analyze_sentiment(text)
print(f"The sentiment of the text is: {result}")
Example 2: Module with class
Module
sentiment_analyzer2.py
class SentimentAnalyzer:
def __init__(self):
self.lexicon={"good": 1, "excellent": 1, "bad": -1, "terrible": -1, "not": -0.5}
def analyze_sentiment(self, text):
…………….
# use self.lexicon for analyzing sentiment

return score
Class
Class
• Blueprint for creating objects
• Defines a set of attributes and methods
that are common to all objects of that
type
Class
Attributes SentimentAnalyzer
• Variable associated with a Class or Object Attributes
lexicon
Method Class
Method
• Function that is defined inside a class
analyze_sentiment()
body
Object
• An instance of a class Class
• When a class is defined, no SentimentAnalyzer
Object 1
memory is allocated until an MovieReview
object is created from the class
• Each object can hold different
data, but they share the Object 2
lexicon
functionalities defined in their RestaurantReview
respective class analyze_sentiment()
Object 3
ProductReview
Defining a Class
sentiment_analyzer2.py
class SentimentAnalyzer:
def __init__(self, lexicon):
self.lexicon = lexicon # This is an attribute

def analyze_sentiment(self, text):


sentiment_score = 0
words = text.split()
for word in words:
sentiment_score += self.lexicon.get(word, 0)
return sentiment_score
Using the Class
# Import the SentimentAnalyzer class from the sentiment_analyzer2 module
from sentiment_analyzer2 import SentimentAnalyzer

# Create an instance of SentimentAnalyzer for restaurant reviews


restaurant_lexicon = { "good": 1, "bad": -1, "delicious": 5, "tasty": 4, "bland": -4 }
restaurant_review_analyzer = SentimentAnalyzer(restaurant_lexicon)

# Example usage
restaurant_review = "The appetizers were bland but the main course was delicious"
restaurant_result = restaurant_review_analyzer.analyze_sentiment(restaurant_review)
print(f"The sentiment score of the restaurant review is: {restaurant_result}")
Package
A package is a collection of Python
modules under a common namespace.
Package Module
Definition __init__.py
• A Package is achieved by having a
directory with a special file named text_analysis tokenizer.py
__init__.py Class
• Can be empty but signifies that the SentimentAnalyzer
directory is a package sentiment_analyzer.py
• Can be imported the same way a
module can be
Module vs Package vs Library
Module Package Library

A library can include


A module is the A package is a collection
multiple packages or
simplest form of code of modules organized
standalone modules
organization, consisting into directories, possibly
that provide a wide
of a single file. with sub-packages.
range of functionalities.
How to Import
Using Python libraries is straightforward—
• Import the relevant libraries or,
• Specific functions based on the tasks at hand.
• Importing the Entire Library or Package
import text analysis
text_analyzer = text_analysis.sentiment_analyzer.SentimentAnalyzer(lexicon_1)

• Importing Specific Functions or Classes


from text_analysis.sentiment_analyzer import SentimentAnalyzer
text_analyzer = SentimentAnalyzer(lexicon_1)
Python Libraries for Analytics
Data Analytics Process

© Faculty of Management
Define Problem
Data Collection

• Define your data needs


• Clearly outline the goals and objectives of your analysis.
• Identify sources
• Determine where to gather data from.
• Databases, surveys, APIs, or external providers.
• Collect relevant data
• Consider both structured and unstructured data
Data
Data is the raw material used in various fields to develop analysis, interpretation,
and decision-making processes.

Structured Data Unstructured Data

Textual Image
Data Video Audio
File
• Sales transactions • Social media posts
• Stock price trends • Customer reviews
• Employee data • Product images
Structured Data
Organized in clear tables with rows and columns.

• Well-suited for mathematical and statistical analysis.


• Follow consistent formats and data types.
• Easily integrated into databases and systems.
Unstructured Data
Lacks predefined structure or format.

Textual Data Image File Video Audio

• Diverse formats like text, image, audio, and video.


• Requires specialized techniques like NLP and image recognition.
• Reveals qualitative insights, sentiment, and context.
Structured VS Unstructured Data
Structured Data Unstructured Data
• Can be analyzed using standard • Often requires specialized analytics
statistical methods and business techniques tailored to the specific
intelligence tools data types

Combining structured and unstructured data analytics


allows organizations to understand their data and make
more informed decisions comprehensively.
Data Preprocessing

• Data Cleaning
• Handle missing values, outliers, and inconsistencies in the collected data.
• Data Transformation
• Standardize formats, normalized values, and create derived features if needed.
• Data Integration
• Combine data from different sources while maintaining data quality.
Python libraries

Structured Data

Unstructured Data
-Text

Unstructured Data
- Image
Data Analysis

• Exploratory Data Analysis


• Descriptive statistics, visualizations, and summaries of data.
• Model building and apply analytics techniques
• Construct proper models based on analysis goals.
• Evaluation and Validation
• Assess model performance, validate results, etc.
Python libraries

Descriptive Analysis

Predictive Analysis
Insight Sharing

• Interpret Finding
• Derive meaningful insights and patterns from the analysis results.
• Visualization
• Create charts, graphs, and visual representations to communicate insights.
• Sharing and Reporting
• Present findings to stakeholders through reports, presentations, or interactive tools.
Demo Intro:
Recommendation Systems

© Faculty of Management
Recommendation System
An algorithmic tool that suggests items or
content aligned with user preferences,
aiding in their discovery process.

Recommend User

Items
Cold Start Problem
The "cold start problem" in recommendation systems refers to the challenge of
providing accurate suggestions for new users or items with limited data.

• Popularity-based Filtering
• Content-based Filtering
Popularity-based Filtering
Popularity-based filtering is a simple and intuitive method to suggest items to users
based on their popularity or overall popularity among all users.

Item Characteristics
• Ratings (#, avg.)
• Year of Release
• Genres Recommend

Items Users
Content-based Filtering
Content-based filtering is a technique that suggests items to users based on the
attributes or features of the items themselves and the user's past interactions or
preferences.

Item Characteristics + User Preferences


• Genre
• Director
• Actor
Recommend
Goal of Activities
In these demo activities:

• Understand the data structure of Pandas (DataFrame & Series)


for structured data handling and descriptive analysis
• Handle unstructured data (i.e., text) and converting it to
structured format.
• Apply machine learning algorithms using the recommendation
system framework.
Popularity-based
Filtering
Recommendation Systems

© Faculty of Management
Popularity-based Filtering
Develop an item-based recommendation system.

Prepare Data : Item-related Data

movieID year genres avgRating numRatings


A 2023 Adventure|Comedy 7.4 224,000
B 2013 Romance|Sci-Fi 8.0 664,000
C 2023 Adventure 6.1 70,000
D 2010 Adventure|Sci-Fi 8.8 25,000,000
Pandas
An open-source data handling and manipulation library for Python.

Before using the library, you need to import it into your code

import pandas as pd

Note: pd is a common abbreviation throughout this program.


There’s no need to repeatedly mention ‘pandas’ every time.
Demo - Documentation

Lustigs, I. (n.d.). Cheatsheet for pandas, Princeton Consultants, inspired by Rstudio Data Wrangling Cheatsheet https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
Data Collection
& Data Pre-Processing

© Faculty of Management
Data Structures in Pandas
The data structure of pandas
1. DataFrame
Two-dimensional table-like Series Series DataFrame
structure that can store and apples oranges apples oranges
manipulate data with rows
and columns 0 3 0 0 0 3 0
2. Series 1 2 1 3 1 2 3
One-dimensional labelled 2 0 2 7 2 0 7
array that can hold data of 3 1 3 2 3 1 2
any type
Creating Series
Let’s create a Series which contains average ratings of movies.

import pandas as pd Index avgRating


0 7.4
avgRating_list=[7.4,8.0,6.1, 8.8] 1 8.0
avgRating=pd.Series(avgRating_list) 2 6.1
3 8.8

pd.Series() # function creates a pandas Series


While the list uses a zero-based integer index to access the values, the
Pandas Series can have values associated with an explicitly defined index.
Manipulating Series
Use movieId as the index for the Series instead of index number.

import
Index pandas as pd
avgRating Index avgRating
0 7.4 A 7.4
1 8.0
avgRating_list=[7.4,8.0,6.1, 8.8] B 8.0
2 6.1
movieId=['A','B','C','D'] C 6.1
3 8.8 D 8.8

avgRating_r=pd.Series(avgRating_list,
index=movieId)
Accessing Elements of a Series
Index avgRating
Access an element
A 7.4
B 8.0
C 6.1
D 8.8
# Access and print the value in the 'avgRating_r' Series using the index label 'A'
print(avgRating_r['A'])

# Access and print the value in the 'avgRating_r' Series using the numeric index 0
print(avgRating_r[0])
Creating DataFrames
Create a DataFrame which consists of movieId, year, and genres.
Keys
movies_dict={ movies_df
'movieId': ['A','B','C','D'], movie year genres
'year’. : [2023,2013,2023,2010], Id
'genres’. : ['Adventure|Comedy','Romance|Sci-Fi’, A 2023 Adventure|Comedy
'Adventure','Adventure|Sci-Fi’]} B 2013 Romance|Sci-Fi
C 2023 Adventure
movies_df=pd.DataFrame(movies_dict) D 2010 Adventure|Sci-Fi

Values
Example: Create a DataFrame
Create a DataFrame with ‘avgRating’ and ‘numRatings’ columns and set the
movieId as the index.

ratings_df
Index avgRating numRatings
A 7.4 224,000
B 8.0 664,000
C 6.1 70,000
D 8.8 25,000,000
Add a Column
Using the index, create a movieID column and add it to the DataFrame.
Then reset the index to be numeric

ratings_df ratings_df
Index avgRating numRatings Index movieId avgRating numRatings
A 7.4 224,000 0 A 7.4 224,000
B 8.0 664,000 1 B 8.0 664,000
C 6.1 70,000 2 C 6.1 70,000
D 8.8 25,000,000 3 D 8.8 25,000,000
Process: Add a Column
We add the new column by assigning the index to it.
ratings_df['movieId'] = ratings_df.index

Specifies the new column named Retrieves the index of the DataFrame;
'movieId' within the DataFrame these index values will be assigned to
'ratings_df' the 'movieId' column

Assignment operator that assigns a


value to the 'movieId' column

If needed, we can then reset the index to default integers.


ratings_df.reset_index(drop=True, inplace=True)
Combine Data Sets (Join)
• What is the Join operation?
• Why do we need to join data?
• How do we join data?
Example: Join
Combine datasets
movies_df ratings_df
movieId year genres movieId avgRating numRatings
A 2023 Adventure|Comedy A 7.4 224,000
B 2013 Romance|Sci-Fi B 8.0 664,000
C 2023 Adventure C 6.1 70,000
D 2010 Adventure|Sci-Fi D 8.8 25,000,000

combined_df = movies_df.join(ratings_df.set_index('movieId'))

combined_df = pd.merge(movies_df, ratings_df, on='movieId')


Join Methods Inner Join Left Join

Join Table 1 Table 2 Table 1 Table 2


movies_df
(Table 1)
pd.merge(movies_df, ratings_df, pd.merge(movies_df, ratings_df,
ratings_df on='movieID', how='inner') on='movieID', how='left')
(Table 2) Right Join Full Outer Join

Table 1 Table 2 Table 1 Table 2

pd.merge(movies_df, ratings_df, pd.merge(movies_df, ratings_df,


on='movieID', how='right' on='movieID', how='outer')
Export DataFrame -Demo
• Export DataFrame to CSV file

combined_df.to_csv('movie_data.csv', index=False)

• Export DataFrame to Excel file

combined_df.to_excel('movie_data.xlsx', index=False)
Data Analysis

© Faculty of Management
Popularity-based Filtering
• Sorting
• Filtering
Sorting
• Sort in Ascending Order

sorted_df = df.sort_values(by='avgRating')

• Sort in Descending Order

sorted_df = df.sort_values(by='avgRating', ascending=False)


Sorting: Descending order
• Sort by multiple columns by descending order

sorted_df=df[['movieId','year','avgRating',]].sort_values(by=['year','avgRating'], ascending=False)
Sorting: Specify order
• Sort by multiple columns by explicitly specifying the sorting order of each column.

sorted_df=df[['movieId','year','avgRating',]].sort_values(by=['year','avgRating'], ascending=[True,False])
Filtering DataFrames
query() allows Boolean expressions for filtering rows

df.query('numRatings > 1000000')

df.query('year > 2020 & avgRating >= 8.5')

df.query('genres.str.contains("Sci-Fi") & numRatings >= 200000 & avgRating > 7.5')


Practice : Demo
Blockbusters: "List the movies that are extremely popular, with more than a
million ratings."

blockbuster_movies = df.query('numRatings > 1000000')


Filtering DataFrames
Trending High-Quality Movies: "Find all movies released after 2018 with an
average rating of at least 8.5."

top_recent_movies = df.query('year >= 2018 & avgRating >= 8.5')


String Methods in Pandas
Pandas provides a variety of string methods that are accessible via the
‘ .str’ accessor

df.genres str.contains("Sci-Fi")

df.genres.str.contains("Advenure")
Demo
Popular Sci-Fi Adventures: "I love sci-fi! Can you show me sci-fi movies with
at least 200,000 ratings and an average rating higher than 7.5?“

sci_fi_hits = df.query('genres.str.contains("Sci-Fi") & numRatings >= 2000


00 & avgRating > 7.5')
Insight Sharing

© Faculty of Management
Popularity-based Filtering

Source: Jamian (2021, July 31). Recommendation Engines— Netflix and Amazon product recommendations techniques. Medium.
https://jamian.medium.com/recommendation-engines-netflix-and-amazon-product-recommendations-techniques-3f93896d85b0
Content-Based Filtering:
Textual Data
Recommendation Systems

© Faculty of Management
Content-based Filtering
Develop a content-based filtering system
to integrate personalized
recommendations. User’s Preferences

Prepare Data: User


• Item-related Data
• Users’ preferences for the items
Items Recommend
Content-based Filtering
Item
movieId year genres
A 2023 Adventure|Comedy
B 2013 Romance|Sci-Fi
C 2023 Adventure
A B C D D 2010 Adventure|Sci-Fi

UserId year genres


1 2014 Adventure|Sci-Fi
User
Sources: (A) Barbie movie poster, (B) Her movie poster, (C) Transformers movie poster, (D) Inception movie poster, Interstellar movie poster; IMDB
Content-Based Filtering Process
Content-based filtering relies on analyzing the content or characteristics of items and
building user profiles based on their preferences.

2. User profile creation 1. Item profile creation

3. Similarity calculation

4. Rank & Recommend


Recommendations are made by comparing the similarity between the user profile
and the profiles of the items.
Demo: Import related libraries
• Handling and manipulating the dataset to calculate the s
imilarity score between users’ preferred items and the li
st of items

• Convert data into a numerical format


• Calculate similarity score
Demo : Import related libraries
Library Package Module

sklearn feature_extraction text Class


from sklearn.feature_extraction.text import CountVectorizer
Text Vectorization
The process of converting textual data into numerical vectors that
machine learning algorithms can understand and process.

genres
Adventure|Comedy
Romance|Sci-Fi
Adventure Comedy Romance Sci-Fi
Adventure
1 1 0 0
Adventure|Sci-Fi
0 0 1 1
1 0 0 0
1 0 0 1
Demo Import related libaries
from sklearn.metrics.pairwise import cosine_similarity

movieId Adventure Comedy Romance Sci-Fi


A 1 1 0 0
B 0 0 1 1 Cosine similarity
C 1 0 0 0
D 1 0 0 1

Similarity
UserId Adventure Comedy Romance Sci-Fi
1 1 0 0 1
Data Collection
& Data Pre-Processing

© Faculty of Management
Item Profile Creation

A B C D
movieId Adventure Comedy Romance Sci-Fi
A 1 1 0 0
B 0 0 1 1
C 1 0 0 0
D 1 0 0 1
Sources: (A) Barbie movie poster, (B) Her movie poster, (C) Transformers movie poster, (D) Inception movie poster; IMDB
User Profile Creation

UserId Adventure Comedy Romance Sci-Fi


1 1 0 0 1

User 1
CountVectorizer
1. Initialize the CountVectorizer

# Create a CountVectorizer instance with a custom tokenizer


vectorizer = CountVectorizer(tokenizer=lambda x: x.split('|'))

Lambda Function
• A small anonymous function defined using the lambda keyword in Python
• It can take any number of arguments but has only one expression
• Here, lambda x: x.split('|') takes an input x (which will be the genre string like
"Adventure|Comedy") and returns a list by splitting x on the '|' character
Apply CountVectorizer
2. Apply Vectorizer to the Genre Data:

# Apply the vectorizer to the 'genres' column of our DataFrame


genre_matrix = vectorizer.fit_transform(combined_data['genres'])

• vectorizer is an object of the class CountVectorizer


• The method fit_transform is called on this object. This method
does two main things:
• Fit: It learns or creates a vocabulary from the data provided, which
in this case, is the unique genres.
• Transform: It converts the data into a numerical format based
on the vocabulary it learned during the fit stage.
Demo
Show the results at the end.
# Display the genre matrix in array format.
print(genre_matrix.toarray())

# Retrieve and print the unique genres


unique_genres = vectorizer.get_feature_names_out()
print("Unique genres:", unique_genres)

Unique genres: ['adventure' 'comedy' 'romance' 'sci-fi']


Data Analysis

© Faculty of Management
Similarity Calculation
genre_matrix[-1]
UserId Adventure Comedy Romance Sci-Fi
1 1 0 0 1
genre_matrix[:-1]
movieId Adventure Comedy Romance Sci-Fi
Similarity A 1 1 0 0
B 0 0 1 1
C 1 0 0 0
D 1 0 0 1
Cosine Similarity in Content-Based Filtering
Cosine similarity is a metric used to measure the similarity between
two vectors in a multi-dimensional space.

The cosine similarity value ranges from -1 to 1.


• 1 indicating exactly the same direction.
• -1 indicating completely opposite directions.

Source: pyimagesearch.com
Cosine Similarity
Cosine similarity

UserId Adventure Comedy Romance Sci-Fi


1 1 0 0 1

Similarity
movieId Adventure Comedy Romance Sci-Fi
A 1 1 0 0
Item Ranking and Recommendation
Cosine similarity

User_movieId CS_Genre
1_A 0.5
1_B 0.5
1_C 0.7071
1_D 1

User Movie D
Sources: Inception movie poster and Interstellar movie poster; IMDB
Insight Sharing

© Faculty of Management
Insight Sharing
User’s Preferences

User

Recommend

Sources: Inception movie poster and Interstellar movie poster; IMDB


More Data?

To enhance the generation of a content-based recommendation


system,
what additional factors can be considered?

You might also like