Data Science
Chapter 1
1. What is Data Science?
Definition:
Data Science is an interdisciplinary field that uses data to extract insights, build models, and
support decision-making.
It combines three key elements:
o Statistics & Mathematics → to understand patterns in data.
o Computer Science (Programming) → to process and analyze data using algorithms.
o Domain Knowledge → to apply findings to real-world problems.
o Example:
o Healthcare: Using data to predict diseases.
o Netflix: Recommending movies based on your watch history.
o Banks: Detecting fraudulent transactions.
2. Why Data Science?
In the modern world, huge amounts of data are generated every second (social media, online
shopping, healthcare, finance, etc.).
Raw data is useless unless converted into meaningful insights.
Data Science helps in decision-making, predictions, and automation.
Applications of Data Science:
Netflix: Movie/TV show recommendations.
Facebook & Instagram: Friend suggestions and content ranking.
Banks: Fraud detection and credit scoring.
Healthcare: Disease risk prediction and drug discovery.
3. The Data Science Workflow
The typical process followed by a data scientist is:
o Define the Problem → What are we trying to solve?
o Collect Data → Gather information from databases, APIs, sensors, etc.
o Clean & Prepare Data → Handle missing values, remove duplicates, fix errors.
o Explore Data (EDA – Exploratory Data Analysis) → Use statistics and visualizations
to understand data.
o Build Models (Machine Learning/AI) → Train algorithms to make predictions.
o Evaluate & Interpret Results → Check accuracy, performance, and meaning.
o Communicate Insights → Present findings through reports, dashboards, or
visualizations.
Think of it as a cycle – data science is an iterative process.
4. Case Study: DataSciencester
To make these ideas practical, Joel Grus (in Data Science from Scratch) introduced a fictional
social network called DataSciencester.
(a) Representing Users
Each user is stored in Python as a dictionary with an id and name.
users = [
{"id": 0, "name": "Hero"},
{"id": 1, "name": "Dunn"},
{"id": 2, "name": "Sue"},
]
Here, id is unique, making it easy to track users.
(b) Representing Friendships
Friendships are stored as pairs of user IDs.
friendships = [
(0, 1),
(0, 2),
(1, 2)
]
(0, 1) means user 0 (Hero) is friends with user 1 (Dunn).
(c) Building a Friend Network
We can attach a list of friends to each user:
for user in users:
user["friends"] = [] # start with empty list
for i, j in friendships:
users[i]["friends"].append(users[j])
users[j]["friends"].append(users[i])
Now:
Hero’s friends → Dunn, Sue
Dunn’s friends → Hero, Sue
(d) Analyzing Connections
Number of Friends:
def number_of_friends(user):
return len(user["friends"])
Average Connections:
total_connections = sum(number_of_friends(user) for user in users)
avg_connections = total connections / len(users)
This tells us how connected people are on average.
(e) Finding Popular Users
We can sort users by their number of friends:
num_friends_by_id = [(user["id"], number_of_friends(user)) for user in users]
print(sorted(num_friends_by_id, key=lambda x: x[1], reverse=True))
This identifies influencers in the network.
(f) Friend-of-a-Friend (Foaf)
We can recommend new friends based on mutual friends:
def friends_of_friend_ids(user):
return [foaf["id"]
for friend in user["friends"]
for foaf in friend["friends"]]
Example: If Hero is friends with Dunn, and Dunn is friends with Sue, then Sue is a friend-of-a-
friend of Hero.
5. Why This Case Study is Important
Teaches data representation → users and relationships stored in Python.
Demonstrates basic analysis → counting, averaging, ranking.
Shows real-world relevance → friend suggestions, influencer ranking, community detection.
This is a mini version of Facebook or LinkedIn.
Roles in Data Science
Data science is teamwork where different professionals handle different parts of the process.
Data Scientist
Analyzes data to find patterns, insights, and predictions.
Builds machine learning models.
Communicates results to decision-makers.
Example: Predicting which customers are likely to leave a telecom company.
Data Engineer
Designs and manages data pipelines, databases, and storage systems.
Ensures data is clean, accessible, and reliable for analysis.
Example: Building the system that collects streaming data from YouTube users.
Machine Learning Engineer
Focuses on deploying ML models into production.
Optimizes algorithms for speed and accuracy.
Example: Implementing recommendation models that run live on Netflix.
Business Analyst
Acts as a bridge between technical teams and business teams.
Converts insights into strategies and decisions.
Example: Using sales data to advise a retail store on stocking inventory.
Skills Required for Data Science
Programming Skills
Python, R, SQL for data analysis and automation.
Libraries: NumPy, Pandas, Matplotlib, Seaborn, Scikit-Learn.
Mathematics & Statistics
Probability, hypothesis testing, linear algebra.
Understanding distributions, correlation, regression.
Data Visualization
Tools: Tableau, Power BI, Matplotlib, Seaborn.
Skill: Present results in charts and graphs for better storytelling.
Soft Skills
Problem-Solving: Breaking down complex problems into steps.
Communication: Explaining results to non-technical people.
Storytelling with Data: Turning numbers into actionable insights.
Example: Presenting a fraud detection system to bank managers in simple terms.
Scope of Data Science
Data Science is one of the fastest-growing fields with applications across industries.
o Healthcare: Predicting diseases, drug discovery.
o Finance: Fraud detection, credit scoring.
o Retail & E-commerce: Customer segmentation, product recommendation.
o Social Media: News feed ranking, friend recommendations.
o Manufacturing: Predictive maintenance of machines.
Class Activity
Task: Pick one Pakistani company (for example, Careem, Daraz, Jazz, HBL).
Discuss in groups:
How does it already use data science?
If not, how could it use data science to improve services?
Example: Careem uses data science for estimating ride fares, matching drivers with passengers,
and predicting demand in different areas.
Python Basics – Syntax, Data Types, and Loops
Python Setup
Install Python from python.org or install Anaconda which includes Python, Jupyter Notebook,
and libraries.
o IDEs (Integrated Development Environments):
o Jupyter Notebook: Best for data science.
o VS Code: Lightweight and widely used.
o PyCharm: Professional IDE.
Basic Python Syntax Rules:
Must start with a letter or underscore.
Print Statement Cannot start with a number.
print("Hello, Data Science") Case-sensitive (Name ≠ name).
Output:
Hello, Data Science
Variables
Variables are used to store values.
name = "Aleeza" # string
age = 21 # integer
gpa = 3.7 # float
is_student = True # boolean
Data Types in Python
Text Type Boolean Type
str → string (text in quotes). bool → True or False
message = "Hello World" is_active = True
Collections
List → Ordered, changeable. Numeric Types
fruits = ["apple", "banana", "cherry"] int → integers (e.g., 10)
Tuple → Ordered, unchangeable. float → decimal numbers (e.g., 3.14)
coordinates = (4, 5) x = 10 # int
Dictionary → Key-value pairs. y = 3.14 # float
student = {"name": "Aleeza", "age": 21}
Loops in Python
For Loop
Used when the number of iterations is known.
for i in range(5):
print(i)
Output:
0
1
2
3
4
While Loop
Used when the number of iterations is not fixed.
i=0
while i < 5:
print(i)
i += 1
Output:
0
1
2
3
4
Data Structures in Python
Introduction
Data structures are ways of storing and organizing data so that they can be used efficiently in
programs. Python provides several built-in data structures that are widely used in data science
tasks.
The main data structures are:
o Lists
o Tuples
o Dictionaries
o Sets
Lists
A list is an ordered collection of items. Lists are mutable, which means items can be added,
removed, or changed.
Creating a List
fruits = ["apple", "banana", "cherry"]
numbers = [10, 20, 30, 40]
Accessing Elements
print(fruits[0]) # apple
print(fruits[2]) # cherry
Modifying Elements
fruits[1] = "mango"
print(fruits) # ['apple', 'mango', 'cherry']
List Methods
fruits.append("orange") # add item
fruits.remove("apple") # remove item
len(fruits) # length of list
Tuples
A tuple is similar to a list but immutable (cannot be changed after creation).
Creating a Tuple
coordinates = (10, 20)
Accessing Elements
print(coordinates[0]) # 10
Immutability
coordinates[0] = 50 # Error: cannot modify
Use tuples when data should not change (for example, fixed locations, constant values).
Dictionaries
A dictionary stores data in key-value pairs. It is unordered and mutable.
Creating a Dictionary
student = {"name": "Aleeza", "age": 21, "grade": "A"}
Accessing Values
print(student["name"]) # Aleeza
Adding/Updating Values
student["age"] = 22
student["city"] = "Lahore"
Removing Keys
del student["grade"]
Dictionaries are very useful in data science for structured data like JSON files.
Sets
A set is an unordered collection of unique items.
Creating a Set
numbers = {1, 2, 3, 4, 4, 5}
print(numbers) # {1, 2, 3, 4, 5}
Set Operations
A = {1, 2, 3}
B = {3, 4, 5}
print(A.union(B)) # {1, 2, 3, 4, 5}
print(A.intersection(B)) # {3}
print(A.difference(B)) # {1, 2}
Sets are useful when uniqueness of items is required.
Class Activity
Activity 1
Create a list of five student names. Add two more names and remove one.
students = ["Ali", "Sara", "Hassan", "Fatima", "Omar"]
students.append("Areeba")
students.append("Bilal")
students.remove("Omar")
print(students)
# Output: ['Ali', 'Sara', 'Hassan', 'Fatima', 'Areeba', 'Bilal']
Activity 2
Create a tuple of three cities and try to change one element (observe the error).
cities = ("Karachi", "Lahore", "Islamabad")
# cities[0] = "Multan" # This will give an error: TypeError: 'tuple' object does not support item
assignment
Activity 3
Create a dictionary to store details of a book (title, author, year). Update the year.
book = {"title": "Data Science 101", "author": "John Smith", "year": 2018}
book["year"] = 2023
print(book)
# Output: {'title': 'Data Science 101', 'author': 'John Smith', 'year': 2023}
Activity 4
Create two sets of numbers and find their union and intersection.
A = {1, 2, 3, 4}
B = {3, 4, 5, 6}
print(A.union(B)) # {1, 2, 3, 4, 5, 6}
print(A.intersection(B)) # {3, 4}
File Handling and Data Input/Output in Python
Introduction
File handling is an important part of programming because it allows reading and writing data
permanently. Unlike variables that are temporary, files store data even after a program stops. In
Python, we use the built-in open() function for working with files.
Opening and Closing Files
file = open("example.txt", "r") # open file in read mode
file.close() # always close after use
Modes:
"r" → Read (default, error if file not found)
"w" → Write (creates new file or overwrites existing)
"a" → Append (adds data at the end of file)
"r+" → Read and Write
Writing to a File
file = open("data.txt", "w")
file.write("Hello, this is my first file.\n")
file.write("Python makes file handling easy.")
file.close()
This will create a file named data.txt with two lines of text.
Reading from a File
Method 1: Read entire file
file = open("data.txt", "r")
content = file.read()
print(content)
file.close()
Method 2: Read line by line
file = open("data.txt", "r")
for line in file:
print(line.strip()) # strip removes newline character
file.close()
Method 3: Readlines() into list
file = open("data.txt", "r")
lines = file.readlines()
print(lines) # ['Hello, this is my first file.\n', 'Python makes file handling easy.']
file.close()
Using with Statement (Recommended)
The with statement automatically closes the file after use.
with open("data.txt", "r") as file:
content = file.read()
print(content)
Appending to a File
with open("data.txt", "a") as file:
file.write("\nThis line is added later.")
Handling CSV Files
CSV (Comma Separated Values) files are very common in data science.
import csv
# Writing CSV
with open("students.csv", "w", newline="") as file:
writer = csv.writer(file)
writer.writerow(["Name", "Age", "Grade"])
writer.writerow(["Ali", 20, "A"])
writer.writerow(["Sara", 22, "B"])
# Reading CSV
with open("students.csv", "r") as file:
reader = csv.reader(file)
for row in reader:
print(row)
Class Activities
Activity 1
Write a program to create a file called notes.txt and write three lines into it.
with open("notes.txt", "w") as f:
f.write("This is line 1\n")
f.write("This is line 2\n")
f.write("This is line 3\n")
Activity 2
Write a program to read the contents of notes.txt and display them.
with open("notes.txt", "r") as f:
print(f.read())
Activity 3
Append one more line to notes.txt and then display all lines.
with open("notes.txt", "a") as f:
f.write("This is line 4\n")
with open("notes.txt", "r") as f:
for line in f:
print(line.strip())
Activity 4
Create a CSV file of three employees with their names and salaries. Then read and display the
data.
import csv
with open("employees.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["Name", "Salary"])
writer.writerow(["Areeba", 50000])
writer.writerow(["Bilal", 60000])
writer.writerow(["Omar", 55000])
with open("employees.csv", "r") as f:
reader = csv.reader(f)
for row in reader:
print(row)
Introduction to NumPy and Pandas
Introduction
In Data Science, handling and analyzing large datasets efficiently is very important. Python
provides two powerful libraries for this purpose: NumPy and Pandas.
NumPy (Numerical Python): Used for numerical operations, arrays, and mathematical functions.
Pandas: Built on NumPy, used for data manipulation and analysis in tabular (row/column)
format.
Part 1: NumPy Basics
Importing NumPy
import numpy as np
Creating Arrays
arr = np.array([1, 2, 3, 4, 5])
print(arr) # [1 2 3 4 5]
print(type(arr)) # <class 'numpy.ndarray'>
1D Array: np.array([1,2,3])
2D Array:
arr2d = np.array([[1,2,3],[4,5,6]])
print(arr2d)
Useful Array Functions
print(np.zeros(5)) # [0. 0. 0. 0. 0.]
print(np.ones((2,3))) # 2x3 array of ones
print(np.arange(1,10,2)) # [1 3 5 7 9]
print(np.linspace(0,1,5))# [0. 0.25 0.5 0.75 1.]
Array Operations
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(a + b) # [5 7 9]
print(a * b) # [ 4 10 18]
print(a ** 2) # [1 4 9]
print(np.dot(a, b)) # 32 (dot product)
Statistical Functions
arr = np.array([10, 20, 30, 40, 50])
print(np.mean(arr)) # 30.0
print(np.median(arr)) # 30.0
print(np.std(arr)) # standard deviation
Part 2: Pandas Basics
Importing Pandas
import pandas as pd
Series (1D Data)
s = pd.Series([10, 20, 30, 40], index=["a","b","c","d"])
print(s)
Output:
a 10
b 20
c 30
d 40
DataFrame (2D Data)
data = {
"Name": ["Ali", "Sara", "Omar"],
"Age": [22, 24, 21],
"Marks": [85, 90, 78]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age Marks
0 Ali 22 85
1 Sara 24 90
2 Omar 21 78
Accessing Data in DataFrame
print(df["Name"]) # column access
print(df.iloc[0]) # row by index
print(df.loc[1, "Marks"]) # specific cell
Basic Operations
print(df.describe()) # summary statistics
print(df.head(2)) # first 2 rows
print(df.tail(1)) # last row
Reading and Writing CSV with Pandas
# Save to CSV
df.to_csv("students.csv", index=False)
# Read from CSV
df2 = pd.read_csv("students.csv")
print(df2)
Class Activities
Activity 1: NumPy
Create a NumPy array of numbers from 1 to 10 and calculate their mean and standard deviation.
import numpy as np
arr = np.arange(1,11)
print("Mean:", np.mean(arr))
print("Standard Deviation:", np.std(arr))
Activity 2: Pandas DataFrame
Create a DataFrame of 3 students with columns: Name, Age, GPA. Then display only the GPA
column.
import pandas as pd
data = {
"Name": ["Hina", "Bilal", "Owais"],
"Age": [20, 21, 22],
"GPA": [3.5, 3.8, 3.2]
}
df = pd.DataFrame(data)
print(df["GPA"])
Activity 3: CSV Handling with Pandas
Create a DataFrame for 3 products with Price and Quantity, save it to a CSV file, then read it
back.
data = {
"Product": ["Pen", "Notebook", "Eraser"],
"Price": [20, 50, 10],
"Quantity": [5, 2, 10]
}
df = pd.DataFrame(data)
df.to_csv("products.csv", index=False)
df2 = pd.read_csv("products.csv")
print(df2)
Data Cleaning and Preparation
Introduction
Before analysis or modeling, real-world data usually needs cleaning.
Data is often incomplete, inconsistent, or contains errors.
Data Cleaning and Preparation ensures high-quality, accurate datasets for analysis.
1. Common Problems in Raw Data
Missing Values: Some entries are empty.
Duplicates: Same record appears multiple times.
Incorrect Data Types: Numbers stored as text, dates stored as strings.
Inconsistent Formatting: "Male"/"M", "Female"/"F".
Outliers: Unusual values (e.g., salary = 999999).
2. Handling Missing Data
Checking Missing Data
import pandas as pd
data = {
"Name": ["Ali", "Sara", "Omar", "Hina"],
"Age": [22, None, 21, 23],
"Marks": [85, 90, None, 88]
}
df = pd.DataFrame(data)
print(df.isnull()) # shows True where values are missing
print(df.isnull().sum()) # counts missing values per column
Filling Missing Values
df["Age"].fillna(df["Age"].mean(), inplace=True) # replace with mean
df["Marks"].fillna(0, inplace=True) # replace with 0
Dropping Missing Values
df.dropna(inplace=True) # removes rows with any missing value
3. Removing Duplicates
df = pd.DataFrame({
"Name": ["Ali", "Sara", "Ali"],
"Age": [22, 23, 22]
})
df = df.drop_duplicates()
4. Correcting Data Types
df["Age"] = df["Age"].astype(int) # convert to integer
5. Handling Inconsistent Data
Example: Different labels for gender.
df["Gender"] = df["Gender"].replace({"M":"Male","F":"Female"})
6. Detecting Outliers
Using statistical methods:
import numpy as np
arr = np.array([10, 12, 15, 14, 100]) # 100 is an outlier
mean = np.mean(arr)
std = np.std(arr)
for x in arr:
if abs(x - mean) > 2*std:
print("Outlier:", x)
7. Renaming Columns
df.rename(columns={"Marks":"Score"}, inplace=True)
8. Feature Scaling (Normalization/Standardization)
Scaling helps when data values have different ranges.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[["Marks"]] = scaler.fit_transform(df[["Marks"]])
Class Activities and Solutions
Activity 1: Handling Missing Data
Create a DataFrame of 5 students with some missing ages and marks. Replace missing ages with
average age, and missing marks with 0.
data = {
"Name": ["Ali", "Sara", "Omar", "Hina", "Bilal"],
"Age": [22, None, 21, None, 24],
"Marks": [85, 90, None, 88, None]
}
df = pd.DataFrame(data)
df["Age"].fillna(df["Age"].mean(), inplace=True)
df["Marks"].fillna(0, inplace=True)
print(df)
Activity 2: Removing Duplicates
Create a DataFrame with duplicate rows and remove duplicates.
df = pd.DataFrame({
"Product": ["Pen", "Pen", "Notebook", "Eraser"],
"Price": [20, 20, 50, 10]
})
df = df.drop_duplicates()
print(df)
Activity 3: Gender Formatting
A DataFrame has inconsistent gender values. Replace them with standard labels.
df = pd.DataFrame({
"Name": ["Ali", "Sara", "Omar"],
"Gender": ["M", "Female", "F"]
})
df["Gender"] = df["Gender"].replace({"M":"Male","F":"Female"})
print(df)