0% found this document useful (0 votes)

7 views6 pages

4-Hour - Full Stack Data Science - Project Cheat Sheet

This cheat sheet outlines a 4-hour project for full stack data science, detailing steps from data acquisition using SQL to model training and evaluation. It includes code snippets for data cleaning, feature engineering, and model deployment, emphasizing the importance of preprocessing and scaling. The final step involves making predictions on a test dataset and saving the results to a CSV file.

Uploaded by

Kavialagan Arjunan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views6 pages

4-Hour - Full Stack Data Science - Project Cheat Sheet

Uploaded by

Kavialagan Arjunan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

4-Hour "Full Stack Data Science" Project Cheat Sheet

Step 1: Get Data (The SQL-in-Python Part)

Imports:

None
import pandas as pd
import numpy as np
import sqlite3 # <-- This is for running SQL
import re # For text cleaning
from datetime import datetime

# --- Modeling Imports ---

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier # (Good
default choice)
from sklearn.linear_model import LogisticRegression # (Simpler
choice)
from sklearn.metrics import accuracy_score, classification_report

Connect & Load Data

None
# 1. Create a connection
conn = sqlite3.connect('the_database_file_name.db')

# 2. (Optional) See what tables are in the database:

tables_query = "SELECT name FROM sqlite_master WHERE
type='table';"
tables = pd.read_sql_query(tables_query, conn)
print("Tables in the DB:")
print(tables)
# 3. Write your SQL query (as a multi-line string)
# (Use SQL to do heavy joins/filtering *before* Python)
sql_query = """
SELECT
c.customer_id,
c.age,
c.signup_date,
COUNT(t.transaction_id) as purchase_count,
SUM(t.amount) as total_spend
FROM
customers c
LEFT JOIN
transactions t ON c.customer_id = t.customer_id
GROUP BY
c.customer_id
"""

# 4. Load SQL query directly into a pandas DataFrame

df = pd.read_sql_query(sql_query, conn)

# 5. Close the connection

conn.close()

# (If they also give a CSV, just use this):

# df_csv = pd.read_csv('another_file.csv')

Step 2: Clean & Explore Data (EDA)

First Look (Find Problems)

None
df.info() # <-- MOST IMPORTANT. Check for nulls &
wrong data types
print(df.head()) # See what the data looks like
print(df.describe()) # Get stats (mean, median, etc.)
print(df.isnull().sum()) # See null counts per column

Common Cleaning

None
# --- Fill Missing Numbers (e.g., 'age') ---
median_val = df['age'].median()
df['age'] = df['age'].fillna(median_val)

# --- Fill Missing Text (e.g., 'category') ---

mode_val = df['category'].mode()[0]
df['category'] = df['category'].fillna(mode_val)

# --- Fix 'object' columns that should be numbers ---

# e.g., "$1,250.75" -> 1250.75
df['price'] = df['price'].str.replace(r'[$,]', '',
regex=True).astype(float)

# --- Fix 'object' columns that should be dates ---

df['signup_date'] = pd.to_datetime(df['signup_date'])

Step 3: Feature Engineering (How You Win)

From Dates (after pd.to_datetime)

None
df['signup_year'] = df['signup_date'].dt.year
df['signup_month'] = df['signup_date'].dt.month
df['signup_day_of_week'] = df['signup_date'].dt.dayofweek #
(Mon=0, Sun=6)
df['is_weekend'] = df['signup_day_of_week'].isin([5,
6]).astype(int)
From Text (using re)

None
# e.g., Extract area code from ' (555) 123-4567'
df['area_code'] = df['phone'].str.extract(r'$(\d{3})$')

From Numbers (Binning)

None
# Group ages into categories
def age_group(age):
if age < 30: return '18-29'
elif age < 50: return '30-49'
else: return '50+'
df['age_group'] = df['age'].apply(age_group)

Turn Categories into Dummies (for Modeling)

None
# Creates new 0/1 columns for each category
df = pd.get_dummies(df, columns=['age_group', 'region'],
drop_first=True)

Step 4: Modeling (Train & Evaluate)

1. Define X (Features) and y (Target)

None
# 'y' is the one column you want to predict
y = df['churn']

# 'X' is all the features. Drop the target AND any ID/non-numeric
columns!
X = df.drop(['churn', 'customer_id', 'name', 'signup_date'],
axis=1)

2. Split Data

None
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

3. Scale Features (Very Important!)

None
scaler = StandardScaler()
# Fit on train data
X_train_scaled = scaler.fit_transform(X_train)
# ONLY transform on test data
X_test_scaled = scaler.transform(X_test)

4. Train Model

None
# Use a good, all-around model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

5. Evaluate Model (on your test set)

None
predictions = model.predict(X_test_scaled)
print(f"Accuracy: {accuracy_score(y_test, predictions):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, predictions))
Step 5: Final Submission (The Deliverable)

You will be given a new "test" file with no target column. You must apply your full pipeline to it.

None
# 5. Make Final Predictions
final_predictions = model.predict(X_final_test_scaled)

# 6. Create Submission File

submission = pd.DataFrame({
'customer_id': df_final_test['customer_id'], # Get ID from
the test file
'prediction': final_predictions
})

# 7. Save to CSV (index=False is VITAL)

submission.to_csv('my_submission.csv', index=False)

print("Submission file created successfully!")

Data Preparation Guide
No ratings yet
Data Preparation Guide
6 pages
Varshini Phase 2
No ratings yet
Varshini Phase 2
19 pages
Inthiyas Phase2 PRJ
No ratings yet
Inthiyas Phase2 PRJ
8 pages
Day 4
No ratings yet
Day 4
62 pages
Daa 01
No ratings yet
Daa 01
11 pages
Ads Phase 5
No ratings yet
Ads Phase 5
23 pages
Another Project-Creating Customer Segments
No ratings yet
Another Project-Creating Customer Segments
31 pages
Ex 5.1 Customer Behaviour Prediction
No ratings yet
Ex 5.1 Customer Behaviour Prediction
8 pages
Customer Segmentation
No ratings yet
Customer Segmentation
9 pages
IIM PBA Assignment 2
No ratings yet
IIM PBA Assignment 2
3 pages
Mall Customer Data Analysis PDF
No ratings yet
Mall Customer Data Analysis PDF
10 pages
Advanced Feature Engineering and Data Preprocessing in Machine Learning
No ratings yet
Advanced Feature Engineering and Data Preprocessing in Machine Learning
7 pages
Customer Segmentation in Python
No ratings yet
Customer Segmentation in Python
71 pages
Objective
No ratings yet
Objective
4 pages
Batch 7 Report Python
No ratings yet
Batch 7 Report Python
7 pages
Machine Learning with Python Course Guide
100% (1)
Machine Learning with Python Course Guide
2 pages
Python for Business Analytics
No ratings yet
Python for Business Analytics
11 pages
Customer Segmentation with Python Guide
No ratings yet
Customer Segmentation with Python Guide
14 pages
Data Mining Lab: Classification & Clustering
No ratings yet
Data Mining Lab: Classification & Clustering
2 pages
ML Adv
No ratings yet
ML Adv
51 pages
Revenue Prediction Using Data Mining
No ratings yet
Revenue Prediction Using Data Mining
30 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
15 pages
Tasks For Students
No ratings yet
Tasks For Students
4 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Mall Customer Segmentation Guide
No ratings yet
Mall Customer Segmentation Guide
8 pages
ML Project Part B
No ratings yet
ML Project Part B
8 pages
Data Analytics for Actuaries
No ratings yet
Data Analytics for Actuaries
76 pages
2324 BigData Lab3
No ratings yet
2324 BigData Lab3
6 pages
Churn Prediction Using Logistic Regression
No ratings yet
Churn Prediction Using Logistic Regression
2 pages
Project V 13
No ratings yet
Project V 13
7 pages
Machine Learning PBL
No ratings yet
Machine Learning PBL
9 pages
Data Mining and Visualization Techniques
100% (1)
Data Mining and Visualization Techniques
16 pages
B Tech-AIML-question Bank-2 Answer Key
No ratings yet
B Tech-AIML-question Bank-2 Answer Key
9 pages
Project Report-Micro Credit Loan
No ratings yet
Project Report-Micro Credit Loan
8 pages
Supermarket Sales Insights
No ratings yet
Supermarket Sales Insights
8 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Main Py
No ratings yet
Main Py
3 pages
Spark Lab
No ratings yet
Spark Lab
6 pages
Hanoi - 2021: (Document Title)
No ratings yet
Hanoi - 2021: (Document Title)
19 pages
Random Forest Model
No ratings yet
Random Forest Model
16 pages
Churn Analysis for UK Retailer
No ratings yet
Churn Analysis for UK Retailer
15 pages
Phase-2 (1) .Docx - Abi
No ratings yet
Phase-2 (1) .Docx - Abi
11 pages
Data Analysis
No ratings yet
Data Analysis
4 pages
Python Data Science Cheat Sheet
0% (1)
Python Data Science Cheat Sheet
3 pages
Customer Segmentation with ML Techniques
100% (1)
Customer Segmentation with ML Techniques
19 pages
AI
No ratings yet
AI
16 pages
Data Analyst Course Insights
No ratings yet
Data Analyst Course Insights
29 pages
Analytical Project Using Python BMBA-252
No ratings yet
Analytical Project Using Python BMBA-252
4 pages
Axe Submission
No ratings yet
Axe Submission
4 pages
DataScience Report
No ratings yet
DataScience Report
12 pages
Module 5 - 0 - Boston House Pricing and Customer Churn
No ratings yet
Module 5 - 0 - Boston House Pricing and Customer Churn
17 pages
UI21CS29 Lab2
No ratings yet
UI21CS29 Lab2
11 pages
Supermart Grocery Sales - Retail Analytics Dataset - (Data Analyst)
No ratings yet
Supermart Grocery Sales - Retail Analytics Dataset - (Data Analyst)
17 pages
Banking Marketing Target Prediction
No ratings yet
Banking Marketing Target Prediction
13 pages
Data Cleaning
No ratings yet
Data Cleaning
7 pages
Quadexp IDS Project
No ratings yet
Quadexp IDS Project
22 pages
Data Science Practicals With Answers
No ratings yet
Data Science Practicals With Answers
10 pages
Customer Churn Prediction Using Machine Learning
No ratings yet
Customer Churn Prediction Using Machine Learning
25 pages
Phase 3
No ratings yet
Phase 3
12 pages
Tote Bag
No ratings yet
Tote Bag
1 page
Essential Data Science Formulae
No ratings yet
Essential Data Science Formulae
13 pages
Reflex Energy Audit 2019 v3
No ratings yet
Reflex Energy Audit 2019 v3
113 pages
Multi Sim
No ratings yet
Multi Sim
713 pages
Cutting Edge Cricket
67% (3)
Cutting Edge Cricket
192 pages
POM Unit 1
No ratings yet
POM Unit 1
127 pages
L1-Linear Data Structure
No ratings yet
L1-Linear Data Structure
90 pages
Solution of Wave Equation
No ratings yet
Solution of Wave Equation
7 pages
Student Solutions Manual To Accompany Advanced Engineering Mathematics VOL 2
60% (5)
Student Solutions Manual To Accompany Advanced Engineering Mathematics VOL 2
222 pages
l5 - Data Analysis (c20)
No ratings yet
l5 - Data Analysis (c20)
48 pages
TRACES - PH - Meter - SOP
No ratings yet
TRACES - PH - Meter - SOP
2 pages
50 THE Effect of - Thiamine (Vitamin B1) ON OF Yeast: Fermentation
No ratings yet
50 THE Effect of - Thiamine (Vitamin B1) ON OF Yeast: Fermentation
7 pages
Velocity and WC Chart PDF
No ratings yet
Velocity and WC Chart PDF
2 pages
Online Examination
100% (1)
Online Examination
63 pages
Unit Cards v4 001w
No ratings yet
Unit Cards v4 001w
24 pages
GED 405 Presentation
No ratings yet
GED 405 Presentation
14 pages
Critical Reading Skills
100% (4)
Critical Reading Skills
16 pages
Group 2 PR-1
100% (1)
Group 2 PR-1
43 pages
CAN Protocol for Engineers
No ratings yet
CAN Protocol for Engineers
11 pages
Lesson Plan GET IP Grade 5 English FAL
No ratings yet
Lesson Plan GET IP Grade 5 English FAL
11 pages
U 4
No ratings yet
U 4
19 pages
ISO Consulting & Training Services
No ratings yet
ISO Consulting & Training Services
7 pages
B2b Marketing Mix Impact On Asia Pacific Region
No ratings yet
B2b Marketing Mix Impact On Asia Pacific Region
9 pages
Speed Enforcement Brochure 21x21 Rz-Web en
No ratings yet
Speed Enforcement Brochure 21x21 Rz-Web en
7 pages
Proposed VCCT Wind Turbine Foundation Bill of Quantities
No ratings yet
Proposed VCCT Wind Turbine Foundation Bill of Quantities
2 pages
Math Students' Project Report
No ratings yet
Math Students' Project Report
7 pages
Methods Textbook
100% (4)
Methods Textbook
851 pages
Capstone Sample Project Report
No ratings yet
Capstone Sample Project Report
6 pages
Excel Matrix Solution for Thermal Resistance
No ratings yet
Excel Matrix Solution for Thermal Resistance
7 pages
CLASS 9 IT & ITes Enabled Services - Question Bank
75% (4)
CLASS 9 IT & ITes Enabled Services - Question Bank
6 pages
Lagna Lord in Various Houses
100% (1)
Lagna Lord in Various Houses
31 pages
BMS 423 Buyer Supplier Relationships
No ratings yet
BMS 423 Buyer Supplier Relationships
90 pages
PHD Thesis Sociology PDF
100% (2)
PHD Thesis Sociology PDF
8 pages
2 2 Discovering Your Higher Purpose
No ratings yet
2 2 Discovering Your Higher Purpose
3 pages
Efflorescence FS Feb 11
No ratings yet
Efflorescence FS Feb 11
2 pages
LANGUAGE w2 Day 1
No ratings yet
LANGUAGE w2 Day 1
33 pages
1 DVS Prinsiples & Practice of Marine Diesel Engines 85 (Turbo)
100% (1)
1 DVS Prinsiples & Practice of Marine Diesel Engines 85 (Turbo)
85 pages
Bohne 1984
100% (1)
Bohne 1984
4 pages
Key Islamic Political Thinkers John L Esposito PDF Download
No ratings yet
Key Islamic Political Thinkers John L Esposito PDF Download
108 pages

4-Hour - Full Stack Data Science - Project Cheat Sheet

Uploaded by

4-Hour - Full Stack Data Science - Project Cheat Sheet

Uploaded by

4-Hour "Full Stack Data Science" Project Cheat Sheet

Step 1: Get Data (The SQL-in-Python Part)

# --- Modeling Imports ---

Connect & Load Data

# 2. (Optional) See what tables are in the database:

# 4. Load SQL query *directly* into a pandas DataFrame

# 5. Close the connection

# (If they also give a CSV, just use this):

Step 2: Clean & Explore Data (EDA)

First Look (Find Problems)

# --- Fill Missing Text (e.g., 'category') ---

# --- Fix 'object' columns that should be numbers ---

# --- Fix 'object' columns that should be dates ---

Step 3: Feature Engineering (How You Win)

From Dates (after pd.to_datetime)

From Numbers (Binning)

Turn Categories into Dummies (for Modeling)

Step 4: Modeling (Train & Evaluate)

1. Define X (Features) and y (Target)

3. Scale Features (Very Important!)

5. Evaluate Model (on your test set)

# 6. Create Submission File

# 7. Save to CSV (index=False is VITAL)

print("Submission file created successfully!")

You might also like

# 4. Load SQL query directly into a pandas DataFrame