Master Data Science With Python
Master Data Science With Python
com
alphatoolz.com
Mastering Data Science with Python
PART I: INTRODUCTION TO DATA SCIENCE AND PYTHON 4
Chapter 1: Introduction to Data Science 4
Chapter 2: Setting Up Your Environment 6
Chapter 3: Python Basics for Data Science 7
23
1
Chapter 21: Generative Adversarial Networks (GANs) 41
2
Chapter 46: Deployment Cheat Sheet 80
PART X: APPENDICES 82
Chapter 47: Appendix A: Mathematical Foundations 82
3
Part I: Introduction to Data Science and Python
Chapter 1: Introduction to Data Science
alphatoolz.com
4
1.4 Overview of the Data Science Process
• Workflow Diagram:
• Steps:
1. Problem Definition : Understanding the business problem.
2. Data Acquisition : Gathering relevant data.
3. Data Cleaning : Preparing data for analysis.
4. Exploratory Data Analysis (EDA) : Understanding data patterns.
5. Modeling : Applying machine learning algorithms.
6. Evaluation : Assessing model performance.
7. Deployment : Implementing the model in production.
8. Monitoring : Continuously improving the model.
alphatoolz.com 5
Chapter 2: Setting Up Your Environment
python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Creating a sample DataFrame
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']}
df = pd.DataFrame(data)
print(df)
https://alphatoolz.com
6
2.4 Setting Up Virtual Environments
bash
# Lists
fruits = ["apple", "banana", "cherry"]
print(fruits[1]) # Output: banana
# Dictionaries
person = {"name": "John", "age": 28}
print(person["name"]) # Output: John
print(square(4)) # Output: 16
alphatoolz.com 7
3.3 Working with Libraries
import numpy as np
# Filtering data
df_filtered = df[df['Age'] > 30]
print(df_filtered)
Additional Resources
• Cheat Sheets:
o Python Basics Cheat Sheet
o Pandas Cheat Sheet
o NumPy Cheat Sheet
• Search Terms for Diagrams:
o "Data Science Process Workflow"
o "Machine Learning Workflow Diagram"
o "Python Libraries for Data Science Diagram"
• Recommended Books and Courses:
o "Python for Data Analysis" by Wes McKinney
o "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by
Aurélien Géron
o Online courses on platforms like Coursera, edX, and Udacity
This expanded part of the book provides a solid foundation in Data Science and Python,
complete with practical examples, explanations, and additional resources to help readers get
started on their data science journey.
alphatoolz.com
8
Part II: Data Collection and Preprocessing
Chapter 4: Data Collection
• CSV Files
o Example:
python
import pandas as pd
python
import sqlite3
import requests
alphatoolz.com 9
df = pd.DataFrame(data)
print(df.head())
• BeautifulSoup
o Example:
python
from bs4 import BeautifulSoup
import requests
# Fetching and parsing HTML content
response = requests.get('https://example.com')
soup = BeautifulSoup(response.text, 'html.parser')
titles = soup.find_all('h2')
for title in titles:
print(title.text)
# scrapy_spider.py
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
start_urls = ['https://example.com']
• Example:
python
import dask.dataframe as dd
10
• Explanation: Dask handles larger-than-memory datasets by parallelizing operations.
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://example-ecommerce.com/products'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
products = []
for product in soup.find_all('div', class_='product'):
name = product.find('h2').text
price = product.find('span', class_='price').text
products.append({'Name': name, 'Price': price})
df = pd.DataFrame(products)
print(df.head())
python
o Explanation: fillna replaces missing values with the mean of the column.
11
5.2 Data Transformation
scaler = MinMaxScaler()
df[['column1', 'column2']] =
scaler.fit_transform(df[['column1', 'column2']])
• StandardScaler
o Example:
python
scaler = StandardScaler()
df[['column1', 'column2']] =
scaler.fit_transform(df[['column1', 'column2']])
12
5.5 Dealing with Outliers
• Identifying Outliers
o Example:
python # Identifying outliers using IQR Q1 =
df['column_name'].quantile(0.25) Q3 =
df['column_name'].quantile(0.75) IQR = Q3 - Q1 outliers
= df[(df['column_name'] < (Q1 - 1.5 * IQR)) |
(df['column_name'] > (Q3 + 1.5 * IQR))] print(outliers)
# Removing outliers
df = df[~((df['column_name'] < (Q1 - 1.5 * IQR)) |
(df['column_name'] > (Q3 + 1.5 * IQR)))]
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Sample customer data
data = {'CustomerID': [1, 2, 3, 4, 5],
'Age': [25, np.nan, 35, 45, 55],
'Income': [50000, 60000, 70000, np.nan, 90000]}
df = pd.DataFrame(data)
# Scaling data
scaler = StandardScaler()
df[['Age', 'Income']] = scaler.fit_transform(df[['Age', 'Income']])
13
Chapter 6: Exploratory Data Analysis (EDA)
• Summary Statistics
o Example:
python # Summary
statistics
print(df.describe())
• Histograms
o Example:
python
# Histogram
df['column_name'].hist()
plt.show()
# Box plot
df.boxplot(column='column_name')
plt.show()
• Calculating Correlation
o Example:
python
# Correlation matrix
print(df.corr())
14
• Visualizing Correlation with Heatmaps
o Example:
python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
alphatoolz.com 15
Additional Resources
• Cheat Sheets:
o Pandas Data Cleaning Cheat Sheet
o Data Visualization Cheat Sheet
• Search Terms for Diagrams:
o "Data Cleaning Workflow Diagram" o
"Data Preprocessing Workflow" o
"Data Transformation Diagram"
• Recommended Books and Courses:
o "Data Science for Business" by Foster Provost and Tom Fawcett
o "Python Data Science Handbook" by Jake VanderPlas
o Online courses on platforms like Coursera, edX, and Udemy
This expanded part of the book provides comprehensive coverage of data collection and
preprocessing techniques, complete with practical examples, explanations, and additional
resources to help readers prepare their data for analysis.
16
Part III: Data Visualization
Chapter 7: Introduction to Data Visualization
• Charts and Graphs : Bar charts, line graphs, scatter plots, etc.
• Maps : Geospatial visualizations.
• Dashboards: Interactive visual displays of data.
• Workflow Diagram:
17
• Guidelines:
o Understand the data type (categorical, numerical, temporal).
o Define the purpose (comparison, distribution, relationship).
o Select the appropriate chart type (bar, line, pie, etc.).
Chapter 8: Matplotlib
python
18
• Adding Legends
o Example:
python
• Bar Charts
o Example:
python
categories = ['A', 'B', 'C', 'D']
values = [3, 7, 5, 8]
plt.bar(categories, values)
plt.show()
• Histograms
o Example:
python
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4]
plt.hist(data, bins=4)
plt.show()
plt.scatter(x, y)
plt.show()
19
• Example:
python
import pandas as pd
import matplotlib.pyplot as plt
# Sample sales data
data = {'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May'],
'Sales': [200, 300, 250, 400, 450]}
df = pd.DataFrame(data)
# Line plot
plt.plot(df['Month'], df['Sales'])
plt.xlabel('Month')
plt.ylabel('Sales')
plt.title('Monthly Sales')
plt.show()
# Bar chart
plt.bar(df['Month'], df['Sales'])
plt.xlabel('Month')
plt.ylabel('Sales')
plt.title('Monthly Sales')
plt.show()
Chapter 9: Seaborn
20
9.2 Customizing Plots
• Bar Plots
o Example:
python
sns.histplot(data['Sales'], bins=5)
plt.show()
21
• Heatmaps
o Example:
python
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True)
plt.show()
# Box plot
sns.boxplot(x='Purchased', y='Income', data=customer_df)
plt.title('Income Distribution by Purchase Status')
plt.show()
22
Chapter 10: Plotly and Interactive Visualizations
10.1Introduction to Plotly
import plotly.express as px
23
10.3Creating Different Types of Interactive Plots
• Bar Plots
o Example:
python
24
10.4Case Study: Interactive Sales Dashboard
Additional Resources
• Cheat Sheets:
o Matplotlib Cheat Sheet
o Seaborn Cheat Sheet
o Plotly Cheat Sheet
• Search Terms for Diagrams:
o "Data Visualization Workflow"
o "Choosing the Right Chart Workflow"
o "Data Visualization Dashboard Design"
• Recommended Books and Courses:
o "Storytelling with Data" by Cole Nussbaumer Knaflic
o "Data Visualisation: A Handbook for Data Driven Design" by Andy Kirk
o Online courses on platforms like Coursera, edX, and Udemy
This expanded part of the book provides a detailed guide to data visualization techniques
using Matplotlib, Seaborn, and Plotly, complete with practical examples, explanations, and
additional resources to help readers effectively visualize their data.
25
Part IV: Machine Learning
Chapter 11: Introduction to Machine Learning
• Real-life Examples:
o Healthcare: Predicting patient outcomes, diagnosing diseases.
o Finance: Fraud detection, stock market prediction.
o
Retail: Customer segmentation, recommendation systems.
• Case Study: Predicting housing prices using historical data.
• Workflow Diagram:
26
• Steps:
o Define the problem
o Collect and preprocess data
o Select a model
o Train the model
o Evaluate the model
o Tune the model
o Deploy the model
12.1Feature Engineering
• Definition: The process of creating new features from raw data to improve the
performance of ML models.
• Examples:
o Creating interaction terms between features.
o Binning continuous variables into categorical bins.
o Example:
python
import pandas as pd
data = {'age': [25, 45, 35, 50, 23, 43, 33, 51, 26, 48],
'income': [50000, 60000, 70000, 80000, 55000, 65000,
75000, 85000, 52000, 62000]}
df = pd.DataFrame(data)
# Binning age into categories
bins = [20, 30, 40, 50, 60]
labels = ['20-30', '30-40', '40-50', '50-60']
df['age_bin'] = pd.cut(df['age'], bins=bins, labels=labels)
print(df)
12.2Feature Scaling
scaler = MinMaxScaler()
df['income_scaled'] = scaler.fit_transform(df[['income']])
print(df)
27
o Standardization:
python
scaler = StandardScaler()
df['income_standardized'] =
scaler.fit_transform(df[['income']])
print(df)
• Techniques:
o Removing missing values:
python
df.dropna(inplace=True)
df.fillna(df.mean(), inplace=True)
• Example:
python
13.1Linear Regression
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Sample data
data = {'experience': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
28
'salary': [30000, 32000, 34000, 36000, 38000, 40000, 42000,
44000, 46000, 48000]}
df = pd.DataFrame(data)
# Split data
X = df[['experience']]
y = df['salary']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=0)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
# Plot
plt.scatter(X, y, color='blue')
plt.plot(X, model.predict(X), color='red')
plt.xlabel('Experience')
plt.ylabel('Salary')
plt.title('Experience vs Salary')
plt.show()
13.2Logistic Regression
# Train model
model = LogisticRegression()
model.fit(X_train, y_train)
# Predict
29
y_pred = model.predict(X_test)
# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
# Plot
plt.scatter(X, y, color='blue')
plt.plot(X, model.predict_proba(X)[:, 1], color='red')
plt.xlabel('Hours Studied')
plt.ylabel('Probability of Passing')
plt.title('Logistic Regression')
plt.show()
13.3Decision Trees
• Overview: Decision trees are a non-parametric supervised learning method used for
classification and regression. They break down a dataset into smaller subsets while at
the same time an associated decision tree is incrementally developed.
• Example:
python
# Train model
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
# Plot tree
plot_tree(model, filled=True)
plt.show()
30
• Solution: Use logistic regression to model the probability of churn.
• Example:
python
# Train model
model = LogisticRegression()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
14.1K-Means Clustering
# Sample data
data = {'age': [25, 45, 35, 50, 23, 43, 33, 51, 26, 48],
# Train model
kmeans = KMeans(n_clusters=2)
df['cluster'] = kmeans.fit_predict(df)
# Plot clusters
plt.scatter(df['age'], df['income'], c=df['cluster'])
plt.xlabel('Age')
31
plt.ylabel('Income')
plt.title('K-Means Clustering')
plt.show()
# Sample data
data = {'age': [25, 45, 35, 50, 23, 43, 33, 51, 26, 48],
# Apply PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(df)
df_pca = pd.DataFrame(data=principal_components, columns=['PC1',
'PC2'])
# Plot PCA
plt.scatter(df_pca['PC1'], df_pca['PC2'])
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA')
plt.show()
# Plot segments
plt.scatter(df['age'], df['spending_score'], c=df['segment'])
plt.xlabel('Age')
plt.ylabel('Spending Score')
plt.title('Customer Segmentation')
plt.show()
32
Chapter 15: Model Evaluation and Tuning
15.1Cross-Validation
Sample data
X = df[['age', 'income']]
y = df['buys']
# Train model
model = DecisionTreeClassifier()
scores = cross_val_score(model, X, y, cv=5)
print(f'Cross-Validation Scores: {scores}')
15.2Grid Search
# Sample data
X = df[['age', 'income']]
y = df['buys']
# Train model
model = DecisionTreeClassifier()
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X, y)
print(f'Best Parameters: {grid_search.best_params_}')
33
from sklearn.ensemble import RandomForestClassifier
# Sample data
X = df[['age', 'income']]
y = df['buys']
Additional Resources
• Cheat Sheets:
o Scikit-learn Cheat Sheet
o Machine Learning Algorithm Cheat Sheet
• Search Terms for Diagrams:
o "Machine Learning Workflow Diagram"
o "Model Evaluation Workflow"
o "Hyperparameter Tuning Workflow"
• Recommended Books and Courses:
o "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by
Aurélien Géron
o "Machine Learning Yearning" by Andrew Ng
o Online courses on platforms like Coursera, edX, and Udacity
This expanded part of the book provides a comprehensive guide to machine learning
techniques, including supervised and unsupervised learning, with practical examples,
explanations, and additional resources to help readers effectively apply machine learning to
real-world problems.
34
Part V: Deep Learning
Chapter 16: Introduction to Deep Learning
• Definition: Deep Learning is a subset of machine learning that uses neural networks
with many layers (deep neural networks) to model complex patterns in data.
• Real-life Applications:
o Computer Vision: Image classification, object detection.
o Natural Language Processing: Language translation, sentiment analysis.
o
Healthcare: Disease prediction, medical imaging.
• Case Study: Image classification using convolutional neural networks (CNNs).
• Workflow Diagram:
• Steps:
o Data collection and preprocessing
o Model selection
o Model training
o Model evaluation
o Model tuning
o Model deployment
alphatoolz.com 35
Chapter 17: Neural Networks
• Components:
o Neurons: Basic units of a neural network.
o Layers: Input layer, hidden layers, output layer.
o
Activation Functions: Sigmoid, ReLU, Tanh.
• Example:
python
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# Sample data
X = [[0, 0], [0, 1], [1, 0], [1, 1]]
y = [0, 1, 1, 0]
# Define model
model = Sequential()
model.add(Dense(4, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
# Train model
model.fit(X, y, epochs=100, verbose=0)
# Evaluate model
loss, accuracy = model.evaluate(X, y)
print(f'Accuracy: {accuracy}')
17.2Cheat Sheet:
18.1Understanding CNNs
• Components:
o Convolutional Layers: Extract features from input data.
o Pooling Layers: Reduce the dimensionality of feature maps.
o
Fully Connected Layers: Perform classification based on extracted features.
36
• Example:
python from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten,
Dense
# Load data
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape((X_train.shape[0], 28, 28, 1))
X_test = X_test.reshape((X_test.shape[0], 28, 28, 1))
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
# Define model
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28,
1)))
model.add(MaxPooling2D((2, 2)))
model.add(Flatten())
model.add(Dense(100, activation='relu'))
model.add(Dense(10, activation='softmax'))
# Compile model
model.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])
# Train model
model.fit(X_train, y_train, epochs=10, verbose=1,
validation_data=(X_test, y_test))
# Evaluate model
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Accuracy: {accuracy}')
37
Chapter 19: Recurrent Neural Networks (RNNs)
19.1Understanding RNNs
• Components:
o Recurrent Layers: Capture sequential dependencies in data.
o LSTM and GRU Units : Handle long-term dependencies and mitigate
vanishing gradient problem.
• Example:
python
# Compile model
model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])
# Train model
model.fit(X_train, y_train, epochs=3, verbose=1,
validation_data=(X_test, y_test))
# Evaluate model
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Accuracy: {accuracy}')
alphatoolz.com 38
19.3Workflow Diagrams :
alphatoolz.com
39
Chapter 20: Transfer Learning
python
# Define model
model = Model(inputs=base_model.input, outputs=predictions)
# Freeze base model layers
for layer in base_model.layers:
layer.trainable = False
# Compile model
model.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])
# Prepare data
datagen = ImageDataGenerator(rescale=1.0/255.0)
train_it = datagen.flow_from_directory('data/train/',
class_mode='categorical', batch_size=64, target_size=(224, 224))
test_it = datagen.flow_from_directory('data/test/',
class_mode='categorical', batch_size=64, target_size=(224, 224))
# Train model
model.fit(train_it, steps_per_epoch=len(train_it),
validation_data=test_it, validation_steps=len(test_it), epochs=10)
40
20.3Workflow Diagrams :
21.1Understanding GANs
• Components:
o Generator : Creates fake data resembling the real data.
o Discriminator : Distinguishes between real and fake data.
• Example:
python
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LeakyReLU, Reshape,
Flatten
from tensorflow.keras.optimizers import Adam
# Generator
def build_generator():
model = Sequential()
model.add(Dense(256, input_dim=100))
model.add(LeakyReLU(alpha=0.2))
model.add(Dense(512))
model.add(LeakyReLU(alpha=0.2))
model.add(Dense(1024))
model.add(LeakyReLU(alpha=0.2))
model.add(Dense(28 * 28, activation='tanh'))
model.add(Reshape((28, 28)))
return model
# Discriminator
41
def build_discriminator():
model = Sequential()
model.add(Flatten(input_shape=(28, 28)))
model.add(Dense(512))
model.add(LeakyReLU(alpha=0.2))
model.add(Dense(256))
model.add(LeakyReLU(alpha=0.2))
model.add(Dense(1, activation='sigmoid'))
return model
# Compile models
discriminator = build_discriminator()
discriminator.compile(loss='binary_crossentropy', optimizer=Adam(),
metrics=['accuracy'])
generator = build_generator()
gan = Sequential([generator, discriminator])
gan.compile(loss='binary_crossentropy', optimizer=Adam())
# Train GAN
def train_gan(epochs=10000, batch_size=128):
(X_train, _), (_, _) = mnist.load_data()
X_train = (X_train - 127.5) / 127.5
for epoch in range(epochs):
d_loss_real = discriminator.train_on_batch(real_images,
np.ones((batch_size, 1)))
d_loss_fake = discriminator.train_on_batch(generated_images,
np.zeros((batch_size, 1)))
d_loss = 0.5 * np.add(d_loss_real, d_loss_fake)
train_gan()
alphatoolz.com 42
21.2System Design and Workflow Diagrams:
alphatoolz.com 43
Chapter 22: Reinforcement Learning
• Components:
o Agent: Learns to make decisions.
o Environment : The world the agent interacts with.
o Rewards: Feedback from the environment.
• Example:
python
import gym
import numpy as np
# Create environment
env = gym.make('CartPole-v1')
# Initialize variables
state = env.reset()
done = False
score = 0
# Sample run
while not done:
44
22.3Workflow Diagrams :
Additional Resources
• Cheat Sheets:
o TensorFlow and Keras Cheat Sheets
o Deep Learning Hyperparameter Cheat Sheet
• Recommended Books and Courses:
o "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
o Online courses on platforms like Coursera (Andrew Ng’s Deep Learning
Specialization), edX, and Udacity
This expanded part of the book provides a comprehensive guide to deep learning techniques,
including neural networks, CNNs, RNNs, transfer learning, GANs, and reinforcement
learning, with practical examples, explanations, and additional resources to help readers
effectively apply deep learning to real-world problems.
45
Part VI: Natural Language Processing (NLP)
Chapter 23: Introduction to NLP
23.1What is NLP?
23.2NLP Workflow
• Steps:
o Text acquisition
o Text preprocessing
o Text representation
o Model building
o Model evaluation
o Model deployment
24.1Tokenization
46
24.2Stop Words Removal
• Definition: Removing common words that do not contribute to the meaning of the
text.
• Example:
python
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in
stop_words]
print(filtered_tokens)
# Output: ['Natural', 'language', 'processing', 'Python', '.']
stemmer = PorterStemmer()
stems = [stemmer.stem(word) for word in filtered_tokens]
print(stems)
# Output: ['Natur', 'languag', 'process', 'Python', '.']
• Example (Lemmatization):
python
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print(lemmas)
# Output: ['Natural', 'language', 'processing', 'Python', '.']
24.4Cheat Sheet:
47
Chapter 25: Text Representation
corpus = [
25.2TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(corpus)
print(X_tfidf.toarray())
# Output: [[0. 0. 0. 0. 0.39428517 0.
0. 0.39428517 0.39428517 0. 0. ],
# [0. 0. 0. 0.46979188 0.
0.46979188 0.46979188 0. 0. 0.46979188 0. ],
# [0.46979188 0. 0.46979188 0.46979188 0. 0.
0. 0. 0. 0. 0.46979188]]
25.3Word Embeddings
sentences = [
48
['machine', 'learning', 'is', 'fascinating'],
] ['python', 'is', 'great', 'for', 'data', 'science']
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1,
workers=4)
print(model.wv['python'])
# Output: [vector representation of the word 'python']
26.1Sentiment Analysis
# Vectorize text
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
# Train model
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)
26.2Cheat Sheet:
49
Chapter 27: Named Entity Recognition (NER)
27.1Understanding NER
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.label_)
# Output: Apple ORG
# U.K. GPE
# $1 billion MONEY
27.3Workflow Diagrams :
50
Chapter 28: Machine Translation
• Definition: Translating text from one language to another using machine learning
models.
• Example:
python
model_name = 'Helsinki-NLP/opus-mt-en-es'
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)
51
28.3Workflow Diagrams :
Additional Resources
• Cheat Sheets:
o NLP Preprocessing Cheat Sheet
o Text Classification Algorithms Cheat Sheet
• Recommended Books and Courses:
o "Speech and Language Processing" by Daniel Jurafsky and James H. Martin
o Online courses on platforms like Coursera (Deeplearning.ai’s NLP
Specialization), edX, and Udacity
This expanded part of the book provides a comprehensive guide to natural language
processing techniques, including text preprocessing, text representation, text classification,
named entity recognition, and machine translation, with practical examples, explanations, and
additional resources to help readers effectively apply NLP to real-world problems.
52
Part VII: Deployment and Production
Chapter 29: Introduction to Deployment
29.2Deployment Workflow
53
Chapter 30: Model Serialization and Saving
• Example:
python
import pickle
from sklearn.linear_model import LogisticRegression
# Train a model
model = LogisticRegression()
model.fit(X_train, y_train)
54
30.2Saving and Loading Models with Joblib
• Example:
python
• Example:
python
if __name__ == '__main__':
app.run(debug=True)
• Example:
bash
55
31.3System Design and Workflow Diagrams:
• Example:
python
class Features(BaseModel):
features: list
app = FastAPI()
model = load('model.joblib')
@app.post('/predict')
def predict(data: Features):
prediction = model.predict([data.features])
return {'prediction': prediction.tolist()}
if __name__ == '__main__':
import uvicorn
uvicorn.run(app, host='0.0.0.0', port=8000)
56
32.2Testing the FastAPI
• Example:
bash
• Example:
dockerfile
FROM python:3.8-slim
WORKDIR /app
57
33.2Building and Running the Docker Container
• Example:
bash
58
34.2Creating a Kubernetes Deployment
• Example:
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-model-api
spec:
replicas: 2
selector:
matchLabels:
app: my-model-api
template:
metadata:
labels:
app: my-model-api
spec:
containers:
- name: my-model-api
image: my_model_api:latest
ports:
- containerPort: 5000
• Example:
yaml
apiVersion: v1
kind: Service
metadata:
name: my-model-api-service
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 5000
selector:
app: my-model-api
59
Chapter 35: Monitoring and Maintenance
• Prometheus Setup:
yaml
• Grafana Setup:
o Connect Grafana to Prometheus data source.
o Create dashboards to monitor API performance and resource usage.
60
35.3System Design and Workflow Diagrams:
Steps:
61
Chapter 36: Case Study: Deploying a Real-World NLP Model
36.1Problem Definition
36.3Step-by-Step Implementation
# Sample data
texts = ["I love this movie", "I hate this movie", "This movie is
okay"]
labels = [1, 0, 1]
# Vectorize text
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
# Train model
model = MultinomialNB()
model.fit(X, labels)
62
# Save model and vectorizer
dump(model, 'sentiment_model.joblib')
dump(vectorizer, 'vectorizer.joblib')
jsonify
from joblib import load
app = Flask(__name__)
model = load('sentiment_model.joblib')
vectorizer = load('vectorizer.joblib')
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
features = vectorizer.transform([data['text']])
prediction = model.predict(features)
return jsonify({'prediction': int(prediction[0])})
if __name__ == '__main__':
app.run(debug=True)
FROM python:3.8-slim
WORKDIR /app
name: sentiment-model-api
spec:
replicas: 2
selector:
matchLabels:
app: sentiment-model-api
template:
metadata:
labels:
63
app: sentiment-model-api
spec:
containers:
- name: sentiment-model-api
image: sentiment_model_api:latest
ports:
- containerPort: 5000
Additional Resources
• Cheat Sheets:
o Docker Commands Cheat Sheet
o Kubernetes Commands Cheat Sheet
• Recommended Books and Courses:
o "Kubernetes Up & Running" by Kelsey Hightower, Brendan Burns, and Joe
Beda
o Online courses on platforms like Coursera, edX, and Udacity
This expanded part of the book provides a comprehensive guide to deploying machine
learning models, including model serialization, creating APIs with Flask and FastAPI,
containerizing with Docker, orchestrating with Kubernetes, and monitoring and maintaining
deployed models, with practical examples, explanations, and additional resources to help
readers effectively deploy models in real-world scenarios.
64
Part VIII: Case Studies and Real-life Applications
Chapter 37: Predictive Maintenance in Manufacturing
37.1Problem Definition
37.2Data Collection
37.3Data Preprocessing
• Example:
python
import pandas as pd
data = pd.read_csv('sensor_data.csv')
data['timestamp'] = pd.to_datetime(data['timestamp'])
data.set_index('timestamp', inplace=True)
37.4Feature Engineering
• Example:
python
37.5Model Training
• Example:
python from sklearn.model_selection import
train_test_split
from sklearn.ensemble import RandomForestClassifier
X = data_resampled[['temp_lag1', 'vibration_lag1']]
y = data_resampled['failure']
65
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
model = RandomForestClassifier()
model.fit(X_train, y_train)
37.6Model Evaluation
• Example:
python
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
37.7Deployment
• Example: Deploy the model using Flask, Docker, and Kubernetes (as detailed in Part
VII).
66
Chapter 38: Customer Segmentation in Retail
38.1Problem Definition
38.2Data Collection
38.3Data Preprocessing
• Example:
python
data = pd.read_csv('transaction_data.csv')
data['transaction_date'] = pd.to_datetime(data['transaction_date'])
# Aggregate data by customer
customer_data = data.groupby('customer_id').agg({
'transaction_date': 'max',
'amount_spent': ['sum', 'mean', 'count']
}).reset_index()
38.4Feature Engineering
• Example:
python
# Calculate recency
current_date = pd.to_datetime('2023-01-01')
customer_data['recency'] = (current_date -
customer_data['last_purchase']).dt.days
# Drop unnecessary columns
customer_data.drop(columns=['last_purchase'], inplace=True)
38.5Clustering
• Example:
python
67
scaled_features = scaler.fit_transform(customer_data[features])
38.6Segment Analysis
• Example:
python
segment_summary = customer_data.groupby('segment').agg({
'recency': 'mean',
'total_spent': 'mean',
'avg_spent': 'mean',
'purchase_count': 'mean'
}).reset_index()
print(segment_summary)
38.7Deployment
• Example: Deploy the model using Flask, Docker, and Kubernetes (as detailed in Part
VII).
68
Chapter 39: Fraud Detection in Finance
39.1Problem Definition
39.2Data Collection
39.3Data Preprocessing
• Example:
python
data = pd.read_csv('fraud_data.csv')
data['timestamp'] = pd.to_datetime(data['timestamp'])
# Extract features from timestamp
data['hour'] = data['timestamp'].dt.hour
data['day'] = data['timestamp'].dt.dayofweek
39.4Model Training
• Example:
python
39.5Model Evaluation
• Example:
python
y_pred = model.predict(X_test)
y_pred = [1 if x == -1 else 0 for x in y_pred]
69
print(classification_report(y_test, y_pred))
39.6Deployment
• Example: Deploy the model using Flask, Docker, and Kubernetes (as detailed in Part
VII).
40.1Problem Definition
40.2Data Collection
40.3Data Preprocessing
• Example:
python
import tensorflow as tf
70
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="validation",
seed=123,
image_size=image_size,
batch_size=batch_size
)
40.4Model Training
• Example:
python
model = models.Sequential([
layers.Rescaling(1./255, input_shape=(128, 128, 3)),
layers.Conv2D(32, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(128, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
history = model.fit(
train_ds,
validation_data=val_ds,
epochs=10
)
40.5Model Evaluation
• Example:
python
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
71
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy') plt.legend()
plt.figure()
40.6Deployment
• Example: Deploy the model using Flask, Docker, and Kubernetes (as detailed in Part
VII).
72
Chapter 41: Sentiment Analysis in Social Media
41.1Problem Definition
• Scenario: Analyze social media posts to determine public sentiment about a product
or service.
41.2Data Collection
41.3Data Preprocessing
• Example:
python
import tweepy
# Authenticate to Twitter
auth = tweepy.OAuthHandler('API_KEY', 'API_SECRET_KEY')
auth.set_access_token('ACCESS_TOKEN', 'ACCESS_TOKEN_SECRET')
api = tweepy.API(auth)
# Fetch tweets
tweets = api.search(q="product", lang="en", count=100)
tweet_data = []
for tweet in tweets:
tweet_data.append(tweet.text)
41.4Text Cleaning
• Example:
python
import re
def clean_tweet(tweet):
tweet = re.sub(r'http\S+', '', tweet)
tweet = re.sub(r'@\w+', '', tweet)
tweet = re.sub(r'#\w+', '', tweet)
tweet = re.sub(r'\W', ' ', tweet)
tweet = tweet.lower()
return tweet
73
41.5Model Training
• Example:
python
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(clean_tweets)
y = [1 if 'positive' in tweet else 0 for tweet in clean_tweets] #
Dummy labels
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
model = MultinomialNB()
model.fit(X_train, y_train)
41.6Model Evaluation
• Example:
python
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
41.7Deployment
74
• Example: Deploy the model using Flask, Docker, and Kubernetes (as detailed in Part
VII).
Additional Resources
• Cheat Sheets:
o Pandas Cheat Sheet
o Scikit-Learn Cheat Sheet
o TensorFlow Cheat Sheet
• Recommended Books and Courses:
o "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by
Aurélien Géron
o Online courses on platforms like Coursera, edX, and Udacity
This expanded part of the book provides a comprehensive guide to various real-life
applications of data science, including predictive maintenance, customer segmentation, fraud
detection, image classification, and sentiment analysis, with practical examples, explanations,
and additional resources to help readers apply data science techniques to real-world
problems.
75
Part IX: Cheat Sheets and Resources
Chapter 42: Python for Data Science Cheat Sheet
42.1Python Basics
• Example:
python
# Lists
fruits = ["apple", "banana", "cherry"]
print(fruits[0]) # Output: apple
# Dictionaries
student = {"name": "John", "age": 21, "courses": ["Math", "CompSci"]}
print(student["name"]) # Output: John
# Loops
for fruit in fruits:
print(fruit)
# Functions
def greet(name):
return f"Hello, {name}!"
42.2Pandas
• Cheat Sheet:
o Download a comprehensive Pandas Cheat Sheet.
•
Example:
python
import pandas as pd
# Creating DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']
}
df = pd.DataFrame(data)
# Display DataFrame
print(df)
# Selecting columns
print(df['Name'])
76
# Filtering rows
print(df[df['Age'] > 30])
# Grouping data
grouped = df.groupby('City').mean()
print(grouped)
42.3NumPy
• Cheat Sheet:
o Download a comprehensive NumPy Cheat Sheet.
• Example:
Creating arrays
arr = np.array([1, 2, 3, 4, 5])
print(arr)
# Array operations
print(arr + 5)
print(arr * 2)
# Matrix operations
mat1 = np.array([[1, 2], [3, 4]])
mat2 = np.array([[5, 6], [7, 8]])
print(np.dot(mat1, mat2))
# Reshaping arrays
print(arr.reshape(1, 5))
• Cheat Sheet:
o Download a comprehensive Scikit-Learn Cheat Sheet.
• Example:
python
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
# Train model
model = LogisticRegression()
77
model.fit(X_train, y_train)
# Predict
predictions = model.predict(X_test)
print(predictions)
43.2Model Evaluation
• Example:
python
44.1Basic Operations
• Cheat Sheet:
o Download a comprehensive TensorFlow Cheat Sheet.
• Example:
Tensors
a = tf.constant(2)
b = tf.constant(3)
# Basic Operations
print(tf.add(a, b))
print(tf.multiply(a, b))
# Creating Variables
W = tf.Variable(tf.random.normal([3, 3]))
print(W)
• Example:
python
78
# Define model
model = Sequential([
Dense(32, activation='relu', input_shape=(10,)),
Dense(64, activation='relu'),
Dense(1, activation='sigmoid')
])
# Compile model
model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])
# Dummy data
import numpy as np
X = np.random.random((100, 10))
y = np.random.randint(2, size=(100, 1))
# Train model
model.fit(X, y, epochs=10, batch_size=32)
44.3Model Evaluation
• Example:
python
# Evaluate model
loss, accuracy = model.evaluate(X, y)
print(f"Loss: {loss}")
print(f"Accuracy: {accuracy}")
• Cheat Sheet:
o Download a comprehensive Matplotlib Cheat Sheet.
• Example:
python import matplotlib.pyplot as plt
# Line plot
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
plt.title("Line Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()
# Bar plot
plt.bar(['A', 'B', 'C'], [5, 7, 3])
plt.title("Bar Plot")
plt.xlabel("Categories")
plt.ylabel("Values")
plt.show()
79
45.2Advanced Plotting with Seaborn
• Cheat Sheet:
o Download a comprehensive Seaborn Cheat Sheet.
• Example:
python
# Line plot
sns.lineplot(data=data, x='x', y='y')
plt.title("Seaborn Line Plot")
plt.show()
# Heatmap
matrix = np.random.rand(10, 12)
sns.heatmap(matrix, annot=True)
plt.title("Seaborn Heatmap")
plt.show()
46.1Docker Commands
• Cheat Sheet:
o Download a comprehensive Docker Commands Cheat Sheet.
•
Example:
bash
46.2Kubernetes Commands
• Cheat Sheet:
o Download a comprehensive Kubernetes Commands Cheat Sheet.
•
Example:
bash
80
kubectl apply -f deployment.yaml
# Get pods
kubectl get pods
# Describe service
kubectl describe service myservice
• Cheat Sheet:
o Download a comprehensive Jenkins Pipeline Syntax Cheat Sheet.
• Example:
groovy
pipeline {
agent any
stages {
stage('Build') {
steps {
sh 'echo Building...'
}
}
stage('Test') {
steps {
sh 'echo Testing...'
}
}
stage('Deploy') {
steps {
sh 'echo Deploying...'
}
}
}
}
Additional Resources
• Cheat Sheets:
o Python Cheat Sheet
o Pandas Cheat Sheet
o Scikit-Learn Cheat
o Sheet TensorFlow
o Cheat Sheet Matplotlib
• Cheat Sheet
Recommended :
Books and Courses
o "Python Data Science Handbook" by Jake VanderPlas
o "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by
Aurélien Géron
o Online courses on platforms like Coursera, edX, and Udacity
81
Part X: Appendices
Chapter 47: Appendix A: Mathematical Foundations
47.1Linear Algebra
• Key Concepts:
o Vectors, Matrices, Dot Product, Matrix Multiplication, Eigenvalues,
Eigenvectors
• Example:
python import numpy as np # Vectors
v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])
# Dot Product
dot_product = np.dot(v1, v2)
print("Dot Product:", dot_product)
# Matrices
m1 = np.array([[1, 2], [3, 4]])
m2 = np.array([[5, 6], [7, 8]])
# Matrix Multiplication
matrix_product = np.dot(m1, m2)
print("Matrix Product:\n", matrix_product)
47.2Calculus
• Key Concepts:
•
o Derivatives, Integrals, Gradient Descent
Example:
python
import sympy as sp
# Define a symbol
x = sp.symbols('x')
# Define a function
f = x**2 + 3*x + 2
# Derivative
derivative = sp.diff(f, x)
82
print("Derivative:", derivative)
# Integral
integral = sp.integrate(f, x)
print("Integral:", integral)
• Key Concepts:
o Probability Distributions, Mean, Median, Standard Deviation, Hypothesis
Testing
• Example:
python
import numpy as np
import scipy.stats as stats
# Mean and Standard Deviation
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
mean = np.mean(data)
std_dev = np.std(data)
print("Mean:", mean)
print("Standard Deviation:", std_dev)
# Probability Distribution
norm_dist = stats.norm(loc=mean, scale=std_dev)
print("Probability of value 5:", norm_dist.pdf(5))
# Hypothesis Testing
t_stat, p_value = stats.ttest_1samp(data, 5)
print("T-statistic:", t_stat)
print("P-value:", p_value)
47.4Cheat Sheets
83
Chapter 48: Appendix B: Python Reference
• Example:
python
48.2File Handling
• Example:
python
# Writing to a file
with open('example.txt', 'w') as file:
file.write("Hello, World!")
48.3Error Handling
• Example:
python
try:
result = 10 / 0
except ZeroDivisionError:
print("Error: Division by zero!")
• Cheat Sheets:
o Pandas Cheat Sheet
o NumPy Cheat Sheet
o Matplotlib Cheat Sheet
84
• "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by
Aurélien Géron
49.2Online Courses
• Jupyter Notebooks
• Google Colab
• Anaconda
50.2Abbreviations
85
Additional Resources
86