Comprehensive Data Science Cheat Sheet
(Python)
1. Core Imports
None
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
2. NumPy (np) - Array & Math Operations
Operation Syntax
Create Arrays arr = np.array([1, 2, 3]), zeros = np.zeros((3, 4)), ones = np.ones(5), r_nums
= np.random.rand(3, 3) , r_int = np.random.randint(0, 10, 5), range_arr =
np.arange(0, 10, 2)
Inspect Arrays arr.shape (e.g., (3, 4)), arr.dtype (e.g., dtype('int64')), arr.ndim (e.g., 2 for 2D)
Reshape Arrays arr_flat = arr.flatten() , arr_reshaped = arr.reshape(3, 4)
Math Operations np.mean(arr) , np.median(arr), np.std(arr) (Standard Deviation), np.sum(arr),
np.log(arr) (Natural Log), np.exp(arr) (Exponential), np.sqrt(arr)
Axis-Specific arr.mean(axis=0) (Mean of each column), arr.sum(axis=1) (Sum of each
row)
Indexing arr[0, 3] (Row 0, Col 3) , arr[1:3, :] (Rows 1-2, all columns)
Boolean arr[arr > 5] (Get all elements > 5)
3. Pandas (pd) - DataFrames & Manipulation
I/O & Inspection
● Load Data: df = pd.read_csv('file.csv')
● Create DF: df = pd.DataFrame({'col1': [1,2], 'col2': [3,4]})
● See Data: df.head() (First 5 rows)
● Get Info: df.info() (Column types, non-null counts)
● Get Stats: df.describe() (Mean, min, max, quartiles)
● See Shape: df.shape (Rows, Cols)
● List Columns: df.columns
● Unique Values: df['col'].value_counts()
● Check Nulls: df.isnull().sum()
Selection & Filtering
● Select 1 Col: df['col_name'] (Returns a Series)
● Select 2+ Cols: df[['col1', 'col2']] (Returns a DataFrame)
● Select by Label: df.loc[index_label, 'col_name']
● Select by Position: df.iloc[row_index, col_index]
● Boolean Filter: df[df['age'] > 30]
● Multi-Filter: df[(df['age'] > 30) & (df['dept'] == 'Sales')]
● .isin() Filter: df[df['dept'].isin(['Sales', 'IT'])]
Data Cleaning
● Drop Nulls: df.dropna() (Drops rows with any nulls)
● Fill Nulls (All): df.fillna(value=0)
● Fill Nulls (Mean): mean_val = df['age'].mean()
df['age'] = df['age'].fillna(mean_val)
● Fill Nulls (Mode): mode_val = df['category'].mode()[0]
df['category'] = df['category'].fillna(mode_val)
● Change Type: df['col'] = df['col'].astype(int)
● Rename Cols: df = df.rename(columns={'old_name': 'new_name'})
● Drop Col: df = df.drop('col_name', axis=1)
● Find Duplicates: df.duplicated().sum()
● Drop Duplicates: df = df.drop_duplicates()
Grouping & Aggregating
● Group & Agg: df.groupby('dept')['salary'].mean()
● Multi-Agg: stats = df.groupby('dept').agg({ 'salary': 'mean', 'age': ['min', 'max'],
'employee_id': 'count' })
● Reset Index: stats = stats.reset_index()
Creating Columns & Combining
● New Column: df['new_col'] = df['col1'] * 2
● Apply Function: df['col'].apply(lambda x: x * 10)
● Apply Row-wise: df.apply(my_func, axis=1)
● Map Values: df['col'] = df['col'].map({'A': 1, 'B': 2})
● Join (Merge): merged = pd.merge(df1, df2, on='key', how='left')
● Stack (Concat): stacked = pd.concat([df1, df2], axis=0)
Using with Pandas (Most Common)
None
# --- Convert a column to datetime ---
df['date_col'] = pd.to_datetime(df['date_col'])
df['date_col'] = pd.to_datetime(df['date_col'],
format='%m/%d/%Y')
df['year'] = df['date_col'].dt.year
df['month'] = df['date_col'].dt.month
df['day'] = df['date_col'].dt.day
4. Scikit-learn (sklearn) - Machine Learning
1. Preprocessing & Splitting
● Import:
None
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler,
OneHotEncoder
● Split Data:
None
X = df[['feature1', 'feature2']]
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
● Scale Features (Fit on Train ONLY):
None
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
2. Models (Common)
● Linear Regression:
None
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train_scaled, y_train)
● Logistic Regression (Classification):
None
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
● Random Forest:
None
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
3. Prediction & Evaluation
● Import:
None
from sklearn.metrics import (
accuracy_score, confusion_matrix, classification_report,
mean_squared_error, r2_score
)
● Predict:
None
predictions = model.predict(X_test_scaled)
# Get probabilities (for classification)
probs = model.predict_proba(X_test_scaled)
● Evaluation (Classification):
None
print(f"Accuracy: {accuracy_score(y_test, predictions)}")
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
● Evaluation (Regression):
None
print(f"MSE: {mean_squared_error(y_test, predictions)}")
print(f"R-squared: {r2_score(y_test, predictions)}")
5. SciPy (stats) - Statistical Tests
● Import: import scipy.stats as stats
● T-Test (Independent): t_stat, p_val = stats.ttest_ind(sample1,
sample2)
● Correlation: corr, p_val = stats.pearsonr(x, y)
● Chi-Square: chi2, p_val, dof, expected =
stats.chi2_contingency(contingency_table)
● Z-Score: z_scores = stats.zscore(my_array)
6. Matplotlib (plt) & Seaborn (sns) - Plotting
● Setup: plt.figure(figsize=(10, 6))
● Scatter Plot: sns.scatterplot(x='col_x', y='col_y', data=df)
● Histogram: sns.histplot(data=df, x='col', bins=30, kde=True)
● Box Plot: sns.boxplot(x='category_col', y='value_col', data=df)
● Heatmap: sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
● Final Touches
None
plt.title('My Plot Title')
plt.xlabel('X-Axis Label')
plt.ylabel('Y-Axis Label')
plt.legend()
plt.show() # Display the plot
SQL Syntax Cheat Sheet (for Data
Science)
1. Basic Queries
● Select All Columns:
None
SELECT * FROM my_table;
● Select Specific Columns:
None
SELECT column1, column2 FROM my_table;
● Select with an Alias (Nickname):
None
SELECT column1 AS "New Name", column2
FROM my_table;
● Limit Results:
None
SELECT * FROM my_table
LIMIT 10;
● Select Unique Values:
None
SELECT DISTINCT column1 FROM my_table;
● Order Results:
None
SELECT * FROM my_table
ORDER BY column1 ASC; -- ASC (default) or DESC
2. Filtering (WHERE)
● Basic Conditions:
None
SELECT * FROM my_table
WHERE column1 = 'value';
● Numeric Conditions:
None
SELECT * FROM my_table
WHERE column1 > 100;
● Multiple Conditions:
None
SELECT * FROM my_table
WHERE column1 = 'value' AND column2 > 50;
● OR Condition:
None
SELECT * FROM my_table
WHERE column1 = 'value' OR column2 IS NOT NULL;
● Common Operators:
○ =, != (or <>), >, <, >=, <=
○ AND, OR, NOT
○ BETWEEN 10 AND 20
○ IN ('val1', 'val2')
○ LIKE 'a%' (% = wildcard, _ = single char)
○ IS NULL
○ IS NOT NULL
3. Aggregation (GROUP BY)
● Common Functions:
○ COUNT(*): Total count of rows
○ COUNT(column): Count of non-null values
○ COUNT(DISTINCT column): Count of unique values
○ SUM(column)
○ AVG(column)
○ MIN(column)
○ MAX(column)
● Basic Group By:
None
SELECT department, AVG(salary)
FROM employees
GROUP BY department;
● Group By with Filtering (HAVING):
○ WHERE filters before grouping.
○ HAVING filters after grouping.
None
SELECT
department,
COUNT(*) AS num_employees
FROM
employees
WHERE
salary > 30000 -- Filters individual employees first
GROUP BY
department
HAVING
COUNT(*) > 5; -- Filters departments with > 5 members
4. Joining Tables
● INNER JOIN (Default): Returns only rows that match in both tables.
None
SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.id;
● LEFT JOIN: Returns all rows from the left table (employees) and matching rows from
the right. If no match, right table columns are NULL.
None
SELECT e.name, d.department_name
FROM employees e
LEFT JOIN departments d ON e.department_id = d.id;
● JOIN with Multiple Conditions:
None
SELECT *
FROM table1 t1
JOIN table2 t2 ON t1.id = t2.id AND t1.date = t2.date;
5. Common Functions
● CASE Statement (IF/THEN Logic):
None
SELECT
name,
salary,
CASE
WHEN salary > 80000 THEN 'High'
WHEN salary > 50000 THEN 'Medium'
ELSE 'Low'
END AS salary_tier
FROM employees;
● Subqueries (Query within a query):
None
SELECT name
FROM employees
WHERE department_id IN (
SELECT id FROM departments WHERE location = 'New York'
);