0% found this document useful (0 votes)
6 views13 pages

Essential Data Science Formulae

This document is a comprehensive cheat sheet for data science using Python, covering essential libraries such as NumPy, Pandas, Scikit-learn, SciPy, Matplotlib, and Seaborn. It includes core imports, array and math operations, data manipulation techniques, machine learning models, statistical tests, and plotting methods. Additionally, it provides SQL syntax for basic queries, filtering, aggregation, joining tables, and common functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views13 pages

Essential Data Science Formulae

This document is a comprehensive cheat sheet for data science using Python, covering essential libraries such as NumPy, Pandas, Scikit-learn, SciPy, Matplotlib, and Seaborn. It includes core imports, array and math operations, data manipulation techniques, machine learning models, statistical tests, and plotting methods. Additionally, it provides SQL syntax for basic queries, filtering, aggregation, joining tables, and common functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Comprehensive Data Science Cheat Sheet

(Python)
1. Core Imports

None
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

2. NumPy (np) - Array & Math Operations


Operation Syntax

Create Arrays arr = np.array([1, 2, 3]), zeros = np.zeros((3, 4)), ones = np.ones(5), r_nums
= np.random.rand(3, 3) , r_int = np.random.randint(0, 10, 5), range_arr =
np.arange(0, 10, 2)

Inspect Arrays arr.shape (e.g., (3, 4)), arr.dtype (e.g., dtype('int64')), arr.ndim (e.g., 2 for 2D)

Reshape Arrays arr_flat = arr.flatten() , arr_reshaped = arr.reshape(3, 4)


Math Operations np.mean(arr) , np.median(arr), np.std(arr) (Standard Deviation), np.sum(arr),
np.log(arr) (Natural Log), np.exp(arr) (Exponential), np.sqrt(arr)

Axis-Specific arr.mean(axis=0) (Mean of each column), arr.sum(axis=1) (Sum of each


row)

Indexing arr[0, 3] (Row 0, Col 3) , arr[1:3, :] (Rows 1-2, all columns)

Boolean arr[arr > 5] (Get all elements > 5)

3. Pandas (pd) - DataFrames & Manipulation


I/O & Inspection
●​ Load Data: df = pd.read_csv('file.csv')
●​ Create DF: df = pd.DataFrame({'col1': [1,2], 'col2': [3,4]})
●​ See Data: df.head() (First 5 rows)
●​ Get Info: df.info() (Column types, non-null counts)
●​ Get Stats: df.describe() (Mean, min, max, quartiles)
●​ See Shape: df.shape (Rows, Cols)
●​ List Columns: df.columns
●​ Unique Values: df['col'].value_counts()
●​ Check Nulls: df.isnull().sum()

Selection & Filtering


●​ Select 1 Col: df['col_name'] (Returns a Series)
●​ Select 2+ Cols: df[['col1', 'col2']] (Returns a DataFrame)
●​ Select by Label: df.loc[index_label, 'col_name']
●​ Select by Position: df.iloc[row_index, col_index]
●​ Boolean Filter: df[df['age'] > 30]
●​ Multi-Filter: df[(df['age'] > 30) & (df['dept'] == 'Sales')]
●​ .isin() Filter: df[df['dept'].isin(['Sales', 'IT'])]
Data Cleaning
●​ Drop Nulls: df.dropna() (Drops rows with any nulls)
●​ Fill Nulls (All): df.fillna(value=0)
●​ Fill Nulls (Mean): mean_val = df['age'].mean()​
df['age'] = df['age'].fillna(mean_val)
●​ Fill Nulls (Mode): mode_val = df['category'].mode()[0]​
df['category'] = df['category'].fillna(mode_val)
●​ Change Type: df['col'] = df['col'].astype(int)
●​ Rename Cols: df = df.rename(columns={'old_name': 'new_name'})
●​ Drop Col: df = df.drop('col_name', axis=1)
●​ Find Duplicates: df.duplicated().sum()
●​ Drop Duplicates: df = df.drop_duplicates()

Grouping & Aggregating


●​ Group & Agg: df.groupby('dept')['salary'].mean()
●​ Multi-Agg: stats = df.groupby('dept').agg({ 'salary': 'mean', 'age': ['min', 'max'],
'employee_id': 'count' })
●​ Reset Index: stats = stats.reset_index()

Creating Columns & Combining


●​ New Column: df['new_col'] = df['col1'] * 2
●​ Apply Function: df['col'].apply(lambda x: x * 10)
●​ Apply Row-wise: df.apply(my_func, axis=1)
●​ Map Values: df['col'] = df['col'].map({'A': 1, 'B': 2})
●​ Join (Merge): merged = pd.merge(df1, df2, on='key', how='left')
●​ Stack (Concat): stacked = pd.concat([df1, df2], axis=0)

Using with Pandas (Most Common)

None
# --- Convert a column to datetime ---

df['date_col'] = pd.to_datetime(df['date_col'])

df['date_col'] = pd.to_datetime(df['date_col'],
format='%m/%d/%Y')

df['year'] = df['date_col'].dt.year

df['month'] = df['date_col'].dt.month

df['day'] = df['date_col'].dt.day
4. Scikit-learn (sklearn) - Machine Learning
1. Preprocessing & Splitting
●​ Import:

None

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler,
OneHotEncoder

●​ Split Data:

None

X = df[['feature1', 'feature2']]
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

●​ Scale Features (Fit on Train ONLY):

None

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

2. Models (Common)

●​ Linear Regression:

None

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train_scaled, y_train)
●​ Logistic Regression (Classification):

None

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

●​ Random Forest:

None

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

3. Prediction & Evaluation

●​ Import:

None

from sklearn.metrics import (
accuracy_score, confusion_matrix, classification_report,
mean_squared_error, r2_score
)

●​ Predict:

None

predictions = model.predict(X_test_scaled)
# Get probabilities (for classification)
probs = model.predict_proba(X_test_scaled)
●​ Evaluation (Classification):

None

print(f"Accuracy: {accuracy_score(y_test, predictions)}")
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

●​ Evaluation (Regression):

None

print(f"MSE: {mean_squared_error(y_test, predictions)}")
print(f"R-squared: {r2_score(y_test, predictions)}")

5. SciPy (stats) - Statistical Tests

●​ Import: import scipy.stats as stats


●​ T-Test (Independent): t_stat, p_val = stats.ttest_ind(sample1,
sample2)
●​ Correlation: corr, p_val = stats.pearsonr(x, y)
●​ Chi-Square: chi2, p_val, dof, expected =
stats.chi2_contingency(contingency_table)
●​ Z-Score: z_scores = stats.zscore(my_array)

6. Matplotlib (plt) & Seaborn (sns) - Plotting

●​ Setup: plt.figure(figsize=(10, 6))


●​ Scatter Plot: sns.scatterplot(x='col_x', y='col_y', data=df)
●​ Histogram: sns.histplot(data=df, x='col', bins=30, kde=True)
●​ Box Plot: sns.boxplot(x='category_col', y='value_col', data=df)
●​ Heatmap: sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
●​ Final Touches
None

plt.title('My Plot Title')
plt.xlabel('X-Axis Label')
plt.ylabel('Y-Axis Label')
plt.legend()

plt.show() # Display the plot


SQL Syntax Cheat Sheet (for Data
Science)

1. Basic Queries

●​ Select All Columns:

None

SELECT * FROM my_table;

●​ Select Specific Columns:

None

SELECT column1, column2 FROM my_table;

●​ Select with an Alias (Nickname):

None

SELECT column1 AS "New Name", column2
FROM my_table;

●​ Limit Results:

None

SELECT * FROM my_table
LIMIT 10;

●​ Select Unique Values:


None

SELECT DISTINCT column1 FROM my_table;

●​ Order Results:

None

SELECT * FROM my_table
ORDER BY column1 ASC; -- ASC (default) or DESC

2. Filtering (WHERE)

●​ Basic Conditions:

None

SELECT * FROM my_table
WHERE column1 = 'value';

●​ Numeric Conditions:

None

SELECT * FROM my_table
WHERE column1 > 100;

●​ Multiple Conditions:

None

SELECT * FROM my_table
WHERE column1 = 'value' AND column2 > 50;
●​ OR Condition:

None

SELECT * FROM my_table
WHERE column1 = 'value' OR column2 IS NOT NULL;

●​ Common Operators:
○​ =, != (or <>), >, <, >=, <=
○​ AND, OR, NOT
○​ BETWEEN 10 AND 20
○​ IN ('val1', 'val2')
○​ LIKE 'a%' (% = wildcard, _ = single char)
○​ IS NULL
○​ IS NOT NULL

3. Aggregation (GROUP BY)

●​ Common Functions:
○​ COUNT(*): Total count of rows
○​ COUNT(column): Count of non-null values
○​ COUNT(DISTINCT column): Count of unique values
○​ SUM(column)
○​ AVG(column)
○​ MIN(column)
○​ MAX(column)

●​ Basic Group By:

None

SELECT department, AVG(salary)
FROM employees
GROUP BY department;

●​ Group By with Filtering (HAVING):


○​ WHERE filters before grouping.
○​ HAVING filters after grouping.

None

SELECT
department,
COUNT(*) AS num_employees
FROM
employees
WHERE
salary > 30000 -- Filters individual employees first
GROUP BY
department
HAVING
COUNT(*) > 5; -- Filters departments with > 5 members

4. Joining Tables

●​ INNER JOIN (Default): Returns only rows that match in both tables.

None

SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.id;

●​ LEFT JOIN: Returns all rows from the left table (employees) and matching rows from
the right. If no match, right table columns are NULL.
None

SELECT e.name, d.department_name
FROM employees e
LEFT JOIN departments d ON e.department_id = d.id;

●​ JOIN with Multiple Conditions:

None

SELECT *
FROM table1 t1
JOIN table2 t2 ON t1.id = t2.id AND t1.date = t2.date;

5. Common Functions

●​ CASE Statement (IF/THEN Logic):

None

SELECT
name,
salary,
CASE
WHEN salary > 80000 THEN 'High'
WHEN salary > 50000 THEN 'Medium'
ELSE 'Low'
END AS salary_tier
FROM employees;

●​ Subqueries (Query within a query):


None

SELECT name
FROM employees
WHERE department_id IN (
SELECT id FROM departments WHERE location = 'New York'
);

You might also like