0% found this document useful (0 votes)

6 views13 pages

Essential Data Science Formulae

This document is a comprehensive cheat sheet for data science using Python, covering essential libraries such as NumPy, Pandas, Scikit-learn, SciPy, Matplotlib, and Seaborn. It includes core imports, array and math operations, data manipulation techniques, machine learning models, statistical tests, and plotting methods. Additionally, it provides SQL syntax for basic queries, filtering, aggregation, joining tables, and common functions.

Uploaded by

Kavialagan Arjunan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views13 pages

Essential Data Science Formulae

Uploaded by

Kavialagan Arjunan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Comprehensive Data Science Cheat Sheet

(Python)
1. Core Imports

None
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

2. NumPy (np) - Array & Math Operations

Operation Syntax

Create Arrays arr = np.array([1, 2, 3]), zeros = np.zeros((3, 4)), ones = np.ones(5), r_nums
= np.random.rand(3, 3) , r_int = np.random.randint(0, 10, 5), range_arr =
np.arange(0, 10, 2)

Inspect Arrays arr.shape (e.g., (3, 4)), arr.dtype (e.g., dtype('int64')), arr.ndim (e.g., 2 for 2D)

Reshape Arrays arr_flat = arr.flatten() , arr_reshaped = arr.reshape(3, 4)

Math Operations np.mean(arr) , np.median(arr), np.std(arr) (Standard Deviation), np.sum(arr),
np.log(arr) (Natural Log), np.exp(arr) (Exponential), np.sqrt(arr)

Axis-Specific arr.mean(axis=0) (Mean of each column), arr.sum(axis=1) (Sum of each

row)

Indexing arr[0, 3] (Row 0, Col 3) , arr[1:3, :] (Rows 1-2, all columns)

Boolean arr[arr > 5] (Get all elements > 5)

3. Pandas (pd) - DataFrames & Manipulation

I/O & Inspection
● Load Data: df = pd.read_csv('file.csv')
● Create DF: df = pd.DataFrame({'col1': [1,2], 'col2': [3,4]})
● See Data: df.head() (First 5 rows)
● Get Info: df.info() (Column types, non-null counts)
● Get Stats: df.describe() (Mean, min, max, quartiles)
● See Shape: df.shape (Rows, Cols)
● List Columns: df.columns
● Unique Values: df['col'].value_counts()
● Check Nulls: df.isnull().sum()

Selection & Filtering

● Select 1 Col: df['col_name'] (Returns a Series)
● Select 2+ Cols: df[['col1', 'col2']] (Returns a DataFrame)
● Select by Label: df.loc[index_label, 'col_name']
● Select by Position: df.iloc[row_index, col_index]
● Boolean Filter: df[df['age'] > 30]
● Multi-Filter: df[(df['age'] > 30) & (df['dept'] == 'Sales')]
● .isin() Filter: df[df['dept'].isin(['Sales', 'IT'])]
Data Cleaning
● Drop Nulls: df.dropna() (Drops rows with any nulls)
● Fill Nulls (All): df.fillna(value=0)
● Fill Nulls (Mean): mean_val = df['age'].mean()
df['age'] = df['age'].fillna(mean_val)
● Fill Nulls (Mode): mode_val = df['category'].mode()[0]
df['category'] = df['category'].fillna(mode_val)
● Change Type: df['col'] = df['col'].astype(int)
● Rename Cols: df = df.rename(columns={'old_name': 'new_name'})
● Drop Col: df = df.drop('col_name', axis=1)
● Find Duplicates: df.duplicated().sum()
● Drop Duplicates: df = df.drop_duplicates()

Grouping & Aggregating

● Group & Agg: df.groupby('dept')['salary'].mean()
● Multi-Agg: stats = df.groupby('dept').agg({ 'salary': 'mean', 'age': ['min', 'max'],
'employee_id': 'count' })
● Reset Index: stats = stats.reset_index()

Creating Columns & Combining

● New Column: df['new_col'] = df['col1'] * 2
● Apply Function: df['col'].apply(lambda x: x * 10)
● Apply Row-wise: df.apply(my_func, axis=1)
● Map Values: df['col'] = df['col'].map({'A': 1, 'B': 2})
● Join (Merge): merged = pd.merge(df1, df2, on='key', how='left')
● Stack (Concat): stacked = pd.concat([df1, df2], axis=0)

Using with Pandas (Most Common)

None
# --- Convert a column to datetime ---

df['date_col'] = pd.to_datetime(df['date_col'])

df['date_col'] = pd.to_datetime(df['date_col'],
format='%m/%d/%Y')

df['year'] = df['date_col'].dt.year

df['month'] = df['date_col'].dt.month

df['day'] = df['date_col'].dt.day
4. Scikit-learn (sklearn) - Machine Learning
1. Preprocessing & Splitting
● Import:

None

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler,
OneHotEncoder

● Split Data:

None

X = df[['feature1', 'feature2']]
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

● Scale Features (Fit on Train ONLY):

None

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

2. Models (Common)

● Linear Regression:

None

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train_scaled, y_train)
● Logistic Regression (Classification):

None

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

● Random Forest:

None

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

3. Prediction & Evaluation

● Import:

None

from sklearn.metrics import (
accuracy_score, confusion_matrix, classification_report,
mean_squared_error, r2_score
)

● Predict:

None

predictions = model.predict(X_test_scaled)
# Get probabilities (for classification)
probs = model.predict_proba(X_test_scaled)
● Evaluation (Classification):

None

print(f"Accuracy: {accuracy_score(y_test, predictions)}")
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

● Evaluation (Regression):

None

print(f"MSE: {mean_squared_error(y_test, predictions)}")
print(f"R-squared: {r2_score(y_test, predictions)}")

5. SciPy (stats) - Statistical Tests

● Import: import scipy.stats as stats

● T-Test (Independent): t_stat, p_val = stats.ttest_ind(sample1,
sample2)
● Correlation: corr, p_val = stats.pearsonr(x, y)
● Chi-Square: chi2, p_val, dof, expected =
stats.chi2_contingency(contingency_table)
● Z-Score: z_scores = stats.zscore(my_array)

6. Matplotlib (plt) & Seaborn (sns) - Plotting

● Setup: plt.figure(figsize=(10, 6))

● Scatter Plot: sns.scatterplot(x='col_x', y='col_y', data=df)
● Histogram: sns.histplot(data=df, x='col', bins=30, kde=True)
● Box Plot: sns.boxplot(x='category_col', y='value_col', data=df)
● Heatmap: sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
● Final Touches
None

plt.title('My Plot Title')
plt.xlabel('X-Axis Label')
plt.ylabel('Y-Axis Label')
plt.legend()

plt.show() # Display the plot

SQL Syntax Cheat Sheet (for Data
Science)

1. Basic Queries

● Select All Columns:

None

SELECT * FROM my_table;

● Select Specific Columns:

None

SELECT column1, column2 FROM my_table;

● Select with an Alias (Nickname):

None

SELECT column1 AS "New Name", column2
FROM my_table;

● Limit Results:

None

SELECT * FROM my_table
LIMIT 10;

● Select Unique Values:

None

SELECT DISTINCT column1 FROM my_table;

● Order Results:

None

SELECT * FROM my_table
ORDER BY column1 ASC; -- ASC (default) or DESC

2. Filtering (WHERE)

● Basic Conditions:

None

SELECT * FROM my_table
WHERE column1 = 'value';

● Numeric Conditions:

None

SELECT * FROM my_table
WHERE column1 > 100;

● Multiple Conditions:

None

SELECT * FROM my_table
WHERE column1 = 'value' AND column2 > 50;
● OR Condition:

None

SELECT * FROM my_table
WHERE column1 = 'value' OR column2 IS NOT NULL;

● Common Operators:
○ =, != (or <>), >, <, >=, <=
○ AND, OR, NOT
○ BETWEEN 10 AND 20
○ IN ('val1', 'val2')
○ LIKE 'a%' (% = wildcard, _ = single char)
○ IS NULL
○ IS NOT NULL

3. Aggregation (GROUP BY)

● Common Functions:
○ COUNT(*): Total count of rows
○ COUNT(column): Count of non-null values
○ COUNT(DISTINCT column): Count of unique values
○ SUM(column)
○ AVG(column)
○ MIN(column)
○ MAX(column)

● Basic Group By:

None

SELECT department, AVG(salary)
FROM employees
GROUP BY department;

● Group By with Filtering (HAVING):

○ WHERE filters before grouping.
○ HAVING filters after grouping.

None

SELECT
department,
COUNT(*) AS num_employees
FROM
employees
WHERE
salary > 30000 -- Filters individual employees first
GROUP BY
department
HAVING
COUNT(*) > 5; -- Filters departments with > 5 members

4. Joining Tables

● INNER JOIN (Default): Returns only rows that match in both tables.

None

SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.id;

● LEFT JOIN: Returns all rows from the left table (employees) and matching rows from
the right. If no match, right table columns are NULL.
None

SELECT e.name, d.department_name
FROM employees e
LEFT JOIN departments d ON e.department_id = d.id;

● JOIN with Multiple Conditions:

None

SELECT *
FROM table1 t1
JOIN table2 t2 ON t1.id = t2.id AND t1.date = t2.date;

5. Common Functions

● CASE Statement (IF/THEN Logic):

None

SELECT
name,
salary,
CASE
WHEN salary > 80000 THEN 'High'
WHEN salary > 50000 THEN 'Medium'
ELSE 'Low'
END AS salary_tier
FROM employees;

● Subqueries (Query within a query):

None

SELECT name
FROM employees
WHERE department_id IN (
SELECT id FROM departments WHERE location = 'New York'
);

Data Analyst Cheat Sheet
No ratings yet
Data Analyst Cheat Sheet
28 pages
Python CheatSheet
No ratings yet
Python CheatSheet
2 pages
Pandas
No ratings yet
Pandas
13 pages
Python and SQL Data Analysis Guide
No ratings yet
Python and SQL Data Analysis Guide
8 pages
Pandas Complete Cheatsheet
No ratings yet
Pandas Complete Cheatsheet
3 pages
EDA With Pandas
No ratings yet
EDA With Pandas
8 pages
Practical
No ratings yet
Practical
12 pages
HTML Code
No ratings yet
HTML Code
4 pages
SQL & Python Interview Q&A
No ratings yet
SQL & Python Interview Q&A
7 pages
HTML Code
No ratings yet
HTML Code
3 pages
Python Data Science Cheat Sheet
0% (1)
Python Data Science Cheat Sheet
3 pages
Pandas Data Manipulation Extended CheatSheet 1731972219
No ratings yet
Pandas Data Manipulation Extended CheatSheet 1731972219
9 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
Python Interviews
No ratings yet
Python Interviews
154 pages
Pandas Cheat Sheet PDF
67% (3)
Pandas Cheat Sheet PDF
1 page
Python Cheat Sheet Code Academy
100% (1)
Python Cheat Sheet Code Academy
1 page
.2 Dse
No ratings yet
.2 Dse
14 pages
A Complete Data Science Interview With 100 Questions
100% (1)
A Complete Data Science Interview With 100 Questions
57 pages
Ip File Class 12
No ratings yet
Ip File Class 12
26 pages
Real Data Analyst Interview Questions Answers
No ratings yet
Real Data Analyst Interview Questions Answers
15 pages
Ans Key Set A
No ratings yet
Ans Key Set A
6 pages
Pandas Syntax Revision For ML
No ratings yet
Pandas Syntax Revision For ML
10 pages
Wipro Data Analyst Interview Questions
No ratings yet
Wipro Data Analyst Interview Questions
29 pages
Interactive Data Analysis With Jupyter Cheatsheet 1731972443
No ratings yet
Interactive Data Analysis With Jupyter Cheatsheet 1731972443
10 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Python Data Cleaning Cheat Sheet
100% (4)
Python Data Cleaning Cheat Sheet
8 pages
EDA Cheat Sheet
No ratings yet
EDA Cheat Sheet
7 pages
XII IP Model 1 Ans
No ratings yet
XII IP Model 1 Ans
8 pages
HCLTech
No ratings yet
HCLTech
5 pages
Unit 1 Python Pandas
No ratings yet
Unit 1 Python Pandas
20 pages
Pandas and SQL Basics for Data Analysis
No ratings yet
Pandas and SQL Basics for Data Analysis
5 pages
Information Practices
No ratings yet
Information Practices
141 pages
Ade 1737191501
No ratings yet
Ade 1737191501
29 pages
Razorpay Data Analyst Interview Questions 1739977522
No ratings yet
Razorpay Data Analyst Interview Questions 1739977522
12 pages
Day 1-3 Basics
No ratings yet
Day 1-3 Basics
30 pages
Data Science Tools Guide: SQL, R, Python
No ratings yet
Data Science Tools Guide: SQL, R, Python
23 pages
Data Science Tools Overview
No ratings yet
Data Science Tools Overview
23 pages
Walmart Data Analyst Interview Experience
No ratings yet
Walmart Data Analyst Interview Experience
10 pages
Pandas Cheat Sheet for Data Science
No ratings yet
Pandas Cheat Sheet for Data Science
5 pages
Pandas Dataframe Cheat Sheet
No ratings yet
Pandas Dataframe Cheat Sheet
3 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
Pandas Trampas
No ratings yet
Pandas Trampas
9 pages
Data Science Course for Professionals
No ratings yet
Data Science Course for Professionals
21 pages
Cheat Sheet - Pandas
No ratings yet
Cheat Sheet - Pandas
6 pages
Lab Record IP
No ratings yet
Lab Record IP
13 pages
Python Data Science Cheat Sheet
100% (2)
Python Data Science Cheat Sheet
6 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
11 pages
SQL and Data Analysis Interview Questions
No ratings yet
SQL and Data Analysis Interview Questions
9 pages
Essential Pandas DataFrame Operations
No ratings yet
Essential Pandas DataFrame Operations
20 pages
Reflex Energy Audit 2019 v3
No ratings yet
Reflex Energy Audit 2019 v3
113 pages
Tote Bag
No ratings yet
Tote Bag
1 page
4-Hour - Full Stack Data Science - Project Cheat Sheet
No ratings yet
4-Hour - Full Stack Data Science - Project Cheat Sheet
6 pages
POM Unit 1
No ratings yet
POM Unit 1
127 pages
Cutting Edge Cricket
67% (3)
Cutting Edge Cricket
192 pages
Multi Sim
No ratings yet
Multi Sim
713 pages
L1-Linear Data Structure
No ratings yet
L1-Linear Data Structure
90 pages
Solution of Wave Equation
No ratings yet
Solution of Wave Equation
7 pages
Student Solutions Manual To Accompany Advanced Engineering Mathematics VOL 2
60% (5)
Student Solutions Manual To Accompany Advanced Engineering Mathematics VOL 2
222 pages
Questions and Answers-1
No ratings yet
Questions and Answers-1
6 pages
Process Question Bank
No ratings yet
Process Question Bank
23 pages
Lab Sheet 2 (Timer Application in PLC)
No ratings yet
Lab Sheet 2 (Timer Application in PLC)
7 pages
Notes Deep Learning
No ratings yet
Notes Deep Learning
57 pages
Geomage Modules for Borehole Imaging
No ratings yet
Geomage Modules for Borehole Imaging
2 pages
CUEA Object Oriented Programming Exam
No ratings yet
CUEA Object Oriented Programming Exam
4 pages
Advances in Computing and Data Sciences Second International Conference ICACDS 2018 Dehradun India April 20 21 2018 Revised Selected Papers Part II Mayank Singh Instant Download
No ratings yet
Advances in Computing and Data Sciences Second International Conference ICACDS 2018 Dehradun India April 20 21 2018 Revised Selected Papers Part II Mayank Singh Instant Download
87 pages
Stock Management Project Documentation
No ratings yet
Stock Management Project Documentation
98 pages
Testing of Network Using Sophos Firewall With Layer Three Switch Through Dos Attacks
No ratings yet
Testing of Network Using Sophos Firewall With Layer Three Switch Through Dos Attacks
5 pages
Class 10th Result
No ratings yet
Class 10th Result
1 page
Laravel Hotel Rate System Task
No ratings yet
Laravel Hotel Rate System Task
2 pages
Java Programming Lab Experiments
No ratings yet
Java Programming Lab Experiments
17 pages
Relational Model in DBMS
No ratings yet
Relational Model in DBMS
10 pages
Yash Resume
No ratings yet
Yash Resume
1 page
Blue and Pink Modern Glassmorphism Style Proposal
No ratings yet
Blue and Pink Modern Glassmorphism Style Proposal
27 pages
Types of Platform Operating Systems
No ratings yet
Types of Platform Operating Systems
10 pages
Kyc Verification by Seriousaa
No ratings yet
Kyc Verification by Seriousaa
4 pages
Devang Resume
No ratings yet
Devang Resume
1 page
RCC72 Stairs & Landings - Multiple
0% (1)
RCC72 Stairs & Landings - Multiple
5 pages
ATA 42 Avionics
100% (2)
ATA 42 Avionics
66 pages
Mag2600 Pulse Secure
No ratings yet
Mag2600 Pulse Secure
21 pages
Agent Initialization Delay Fix
No ratings yet
Agent Initialization Delay Fix
7 pages
S4220 EN Col08 ILT FV CO A4
No ratings yet
S4220 EN Col08 ILT FV CO A4
19 pages
Node MCU: Overview and Architecture
No ratings yet
Node MCU: Overview and Architecture
5 pages
JDE 92 Common Foundation User Interface
No ratings yet
JDE 92 Common Foundation User Interface
13 pages
Main SOP-Customer Complaint Management Module
No ratings yet
Main SOP-Customer Complaint Management Module
13 pages
Foxboro Evo™ Process Automation System: Product Specifications
No ratings yet
Foxboro Evo™ Process Automation System: Product Specifications
8 pages
Information Technology Csec Jan 2015 p1 With Answers
No ratings yet
Information Technology Csec Jan 2015 p1 With Answers
20 pages
Rhel 10 Cheat Sheet
No ratings yet
Rhel 10 Cheat Sheet
4 pages
Memory Management Techniques Explained
No ratings yet
Memory Management Techniques Explained
38 pages

Essential Data Science Formulae

Uploaded by

Essential Data Science Formulae

Uploaded by

Comprehensive Data Science Cheat Sheet

2. NumPy (np) - Array & Math Operations

Reshape Arrays arr_flat = arr.flatten() , arr_reshaped = arr.reshape(3, 4)

Axis-Specific arr.mean(axis=0) (Mean of each column), arr.sum(axis=1) (Sum of each

Indexing arr[0, 3] (Row 0, Col 3) , arr[1:3, :] (Rows 1-2, all columns)

Boolean arr[arr > 5] (Get all elements > 5)

3. Pandas (pd) - DataFrames & Manipulation

Selection & Filtering

Grouping & Aggregating

Creating Columns & Combining

Using with Pandas (Most Common)

●​ Scale Features (Fit on Train ONLY):

3. Prediction & Evaluation

5. SciPy (stats) - Statistical Tests

●​ Import: import scipy.stats as stats

6. Matplotlib (plt) & Seaborn (sns) - Plotting

●​ Setup: plt.figure(figsize=(10, 6))

plt.show() # Display the plot

●​ Select All Columns:

●​ Select Specific Columns:

●​ Select with an Alias (Nickname):

●​ Select Unique Values:

3. Aggregation (GROUP BY)

●​ Basic Group By:

●​ Group By with Filtering (HAVING):

●​ JOIN with Multiple Conditions:

●​ CASE Statement (IF/THEN Logic):

●​ Subqueries (Query within a query):

You might also like

● Scale Features (Fit on Train ONLY):

● Import: import scipy.stats as stats

● Setup: plt.figure(figsize=(10, 6))

● Select All Columns:

● Select Specific Columns:

● Select with an Alias (Nickname):

● Select Unique Values:

● Basic Group By:

● Group By with Filtering (HAVING):

● JOIN with Multiple Conditions:

● CASE Statement (IF/THEN Logic):

● Subqueries (Query within a query):