0% found this document useful (0 votes)

2K views3 pages

Pandas vs PySpark Cheat Sheet

This document provides a cheatsheet comparing common data analysis tasks in Pandas and PySpark. It outlines how to import libraries, define datasets, read/write data, inspect data, handle missing/duplicate values, rename/select columns, join datasets, group and sort data using each framework. The cheatsheet acts as a quick reference guide to help users choose the appropriate tool for different data processing and manipulation operations.

Uploaded by

api-261489892

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2K views3 pages

Pandas vs PySpark Cheat Sheet

Uploaded by

api-261489892

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CHEATSHEET: PANDAS VS PYSPARK

Vanessa Afolabi

Import Libraries and Set System Options:

PANDAS PYSPARK

import pandas as pd from [Link] import *

[Link] colwidth = 1000 from [Link] import *
from [Link] import SQLContext*

Define and create a dataset:

PANDAS PYSPARK

data = {’col1’ : [ , , ], ’col2’ : [ , , ]} StructField(’Col1’, IntegerType())

df = [Link](data, columns = [’col1’, ’col2’]) StructField(’Col2’, StringType())
schema = StructType([list of StructFields])
df = SQLContext(sc).createDataFrame([Link](), schema)

Read and Write to CSV:

PANDAS PYSPARK

[Link] csv() SQLContext(sc).read csv()

[Link] csv() [Link] csv()

Indexing and Splitting:

PANDAS PYSPARK

[Link][ ] [Link](weights=[ ], seed=n)

[Link][ ]

Inspect Data:
PANDAS PYSPARK

[Link]() [Link]()
[Link](n)
[Link] [Link]()
[Link]
[Link] [Link]()
Handling Duplicate Data:
PANDAS PYSPARK

[Link]() [Link]().count()
[Link]
[Link] duplicates() [Link]()

Rename Columns:
PANDAS PYSPARK

[Link](columns={”old col”:”new col”}) [Link](”old col”,”new col”)

Handling Missing Data:

PANDAS PYSPARK

[Link]() [Link]()
[Link]() [Link]()
[Link] [Link]()
df[’col’].isna() [Link]()
df[’col’].isnull()
df[’col’].notna() [Link]()
df[’col’].notnull()

Common Column Functions:

PANDAS PYSPARK

df[”col”] = df[”col”].[Link]() df = [Link](’col’,lower([Link]))

df[”col”] = df[”col”].[Link]() df = [Link](’*’,regexp replace().alias())
df = [Link](’*’,regexp extract().alias())
df[”col”] = df[”col”].[Link]() df = [Link](’col’,split(’col’))
df[”col”] = df[”col”].[Link]() df = [Link](’col’, UDF JOIN([Link], lit(’ ’)))
df[”col”] = df[”col”].[Link]() df = [Link](’col’, trim([Link]))

Apply User Defined Functions:

PANDAS PYSPARK

df[’col’] = df[’col’].map(UDF) df = [Link](’col’, UDF([Link]))

[Link](f) df = [Link](’col’, when(cond, UDF([Link])).otherwise())
[Link](f)

Join two dataset columns:

PANDAS PYSPARK

df[’new col’] = df[’col1’] + df[’col2’] df = [Link](’new col’,concat ws(’ ’,df.col1,df.col2))

[Link](’*’,concat(df.col1,df.col2).alias(’new col’))
Convert dataset column to a list:
PANDAS PYSPARK

list(df[’col’) [Link](”col”).[Link](lambda x:x).collect()

Filter Dataset:
PANDAS PYSPARK

df = df[df[’col’] != ” ”] df = df[df[’col’] == val]

df = [Link](df[’col’] == val)

Select Columns:
PANDAS PYSPARK

df = df[[’col1’,’col2’,’col3’]] df = [Link](’col1’,’col2’,’col3’)

Drop Columns:
PANDAS PYSPARK

[Link]([’B’,’C’], axis=1) [Link](’col1’,’col2’)

[Link](columns = [’B’,’C’])

Grouping Data:
PANDAS PYSPARK

[Link](by=[’col1’,’col2’]).count() [Link](’col’).count().show()

Combining Data:
PANDAS PYSPARK

[Link]([df1,df2]) [Link](df2)
[Link](df2)
[Link](df2) [Link](df2)

Cartesian Product:
PANDAS PYSPARK

df1[’key’] = 1 [Link](df2)
df2[’key’] = 1
[Link](df2, how=’outer’, on=’key’)

Sorting Data:
PANDAS PYSPARK

[Link] values() [Link]()

[Link] index() [Link]()

PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
PySpark SQL Cheat Sheet Guide
No ratings yet
PySpark SQL Cheat Sheet Guide
1 page
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
PySpark SQL Basics Cheat Sheet
No ratings yet
PySpark SQL Basics Cheat Sheet
1 page
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Second Highest Salary in PySpark
No ratings yet
Second Highest Salary in PySpark
22 pages
PySpark Interview Questions
0% (1)
PySpark Interview Questions
3 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
PySpark 30 Days Practice Guide?
100% (1)
PySpark 30 Days Practice Guide?
35 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
Pyspark Window Functions Overview
100% (1)
Pyspark Window Functions Overview
8 pages
PySpark Installation and Basics Guide
100% (1)
PySpark Installation and Basics Guide
131 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Overview of Spark SQL Features and APIs
No ratings yet
Overview of Spark SQL Features and APIs
24 pages
Cleaning Data With PySpark Chapter4
No ratings yet
Cleaning Data With PySpark Chapter4
23 pages
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
Pandas Interview Prep Guide
No ratings yet
Pandas Interview Prep Guide
5 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
Airflow DAG Best Practices Guide
100% (1)
Airflow DAG Best Practices Guide
6 pages
PySpark Zero To Hero Ebook
No ratings yet
PySpark Zero To Hero Ebook
6 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
Data Engineer Interview Prep
100% (2)
Data Engineer Interview Prep
16 pages
Loan Risk Analysis With Databricks and XGBoost - A Databricks Guide, Including Code Samples and Notebooks (2019)
No ratings yet
Loan Risk Analysis With Databricks and XGBoost - A Databricks Guide, Including Code Samples and Notebooks (2019)
11 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Databricks Certified Data Analyst Associate Sep 2025
No ratings yet
Databricks Certified Data Analyst Associate Sep 2025
10 pages
PySpark DataFrame Operations Guide
100% (1)
PySpark DataFrame Operations Guide
25 pages
Python Pandas DataFrame Guide
No ratings yet
Python Pandas DataFrame Guide
53 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
Databricks Python & Linux Commands Guide
No ratings yet
Databricks Python & Linux Commands Guide
109 pages
Snowflake External Tables Guide
No ratings yet
Snowflake External Tables Guide
105 pages
Azure DE Interview Que
100% (2)
Azure DE Interview Que
25 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
Data Pipeline
No ratings yet
Data Pipeline
13 pages
ADF Course Deck V2
No ratings yet
ADF Course Deck V2
216 pages
Creating Secrets in Databricks
No ratings yet
Creating Secrets in Databricks
13 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Aksha Interview Questions
100% (1)
Aksha Interview Questions
52 pages
Databricks Vs SQL Cheat Sheet
100% (2)
Databricks Vs SQL Cheat Sheet
11 pages
Data Engineering With Databricks (Verma, Sumit) (Z-Library)
No ratings yet
Data Engineering With Databricks (Verma, Sumit) (Z-Library)
219 pages
Data Bricks
No ratings yet
Data Bricks
43 pages
Databuildtoolpdf 220704 142715
No ratings yet
Databuildtoolpdf 220704 142715
39 pages
Snowflake
No ratings yet
Snowflake
11 pages
Azure Databricks Notes
No ratings yet
Azure Databricks Notes
20 pages
Snowflake Schema Explained
No ratings yet
Snowflake Schema Explained
8 pages
Data Cleaning with PySpark Guide
No ratings yet
Data Cleaning with PySpark Guide
21 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
SnowProCore Exam Study Guide 011425 COF C02
No ratings yet
SnowProCore Exam Study Guide 011425 COF C02
14 pages
PySpark SQL Pandas CheatSheet
No ratings yet
PySpark SQL Pandas CheatSheet
2 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
PySpark Cheatsheet - Elaborate
No ratings yet
PySpark Cheatsheet - Elaborate
14 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
PySpark SQL Cheat Sheet Python
100% (2)
PySpark SQL Cheat Sheet Python
1 page
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
3 pages
Canvas Basics Overview Guide
No ratings yet
Canvas Basics Overview Guide
199 pages
18 Ways to Open Command Prompt in Windows
No ratings yet
18 Ways to Open Command Prompt in Windows
20 pages
Class 11 Java Fundamentals Notes
No ratings yet
Class 11 Java Fundamentals Notes
12 pages
cFosSpeed Setup Log
No ratings yet
cFosSpeed Setup Log
48 pages
MetaBot Introduction & Designer Guide
No ratings yet
MetaBot Introduction & Designer Guide
1 page
Starting, Stopping, and Restarting Tomcat - Tomcat - The Definitive Guide, 2nd Edition (Book)
No ratings yet
Starting, Stopping, and Restarting Tomcat - Tomcat - The Definitive Guide, 2nd Edition (Book)
23 pages
Omr
100% (1)
Omr
37 pages
Mod Menu Log - Com - Os.airforce
No ratings yet
Mod Menu Log - Com - Os.airforce
16 pages
Memory Verification Using UVM
No ratings yet
Memory Verification Using UVM
13 pages
CS50x Lecture 4: Pointers & Memory
No ratings yet
CS50x Lecture 4: Pointers & Memory
32 pages
Solution How To Change MySQL Root Password in Ubuntu 20.04
No ratings yet
Solution How To Change MySQL Root Password in Ubuntu 20.04
10 pages
Servify
No ratings yet
Servify
12 pages
Data Virtuality LDW - Datasheet
No ratings yet
Data Virtuality LDW - Datasheet
3 pages
Checklist For Data Decimal Error
No ratings yet
Checklist For Data Decimal Error
3 pages
Assembly Language For Intel-Based Computers, 5 Edition
No ratings yet
Assembly Language For Intel-Based Computers, 5 Edition
81 pages
Preethika Chowdary Nekkanti Resume
No ratings yet
Preethika Chowdary Nekkanti Resume
1 page
Java Programming - Chapter 2 Quiz Answers
No ratings yet
Java Programming - Chapter 2 Quiz Answers
6 pages
C++ Queue Implementation Guide
No ratings yet
C++ Queue Implementation Guide
20 pages
Newspaper Documentation: Release 0.0.2
No ratings yet
Newspaper Documentation: Release 0.0.2
31 pages
GoF Design Patterns Overview
No ratings yet
GoF Design Patterns Overview
36 pages
Selenium Project
No ratings yet
Selenium Project
5 pages
IEC61968 DMS Information Exchange Design
No ratings yet
IEC61968 DMS Information Exchange Design
9 pages
Saidarao QA1
No ratings yet
Saidarao QA1
5 pages
NSX Cloud Consistent Networking and Security Across Enterprise, AWS & Azure Lightning Lab-Hol-2227-91-Ism - PDF - en
No ratings yet
NSX Cloud Consistent Networking and Security Across Enterprise, AWS & Azure Lightning Lab-Hol-2227-91-Ism - PDF - en
10 pages
Oracle Cloud Infrastructure 2024 Data Foundations Associate (1Z0-1195-24)
No ratings yet
Oracle Cloud Infrastructure 2024 Data Foundations Associate (1Z0-1195-24)
8 pages
r2 Cheatsheet
No ratings yet
r2 Cheatsheet
2 pages
CS3381 - Oops Lab Manual Final Bme
No ratings yet
CS3381 - Oops Lab Manual Final Bme
36 pages
Software Engineering Midterm Exam Questions
No ratings yet
Software Engineering Midterm Exam Questions
8 pages
Lab Manual - Mongodb
No ratings yet
Lab Manual - Mongodb
3 pages
Decimal to Hex Conversion in Assembly
No ratings yet
Decimal to Hex Conversion in Assembly
8 pages

Pandas vs PySpark Cheat Sheet

Uploaded by

Pandas vs PySpark Cheat Sheet

Uploaded by

CHEATSHEET: PANDAS VS PYSPARK

Import Libraries and Set System Options:

import pandas as pd from [Link] import *

Define and create a dataset:

data = {’col1’ : [ , , ], ’col2’ : [ , , ]} StructField(’Col1’, IntegerType())

Read and Write to CSV:

[Link] csv() SQLContext(sc).read csv()

Indexing and Splitting:

[Link][ ] [Link](weights=[ ], seed=n)

[Link](columns={”old col”:”new col”}) [Link](”old col”,”new col”)

Handling Missing Data:

Common Column Functions:

df[”col”] = df[”col”].[Link]() df = [Link](’col’,lower([Link]))

Apply User Defined Functions:

df[’col’] = df[’col’].map(UDF) df = [Link](’col’, UDF([Link]))

Join two dataset columns:

df[’new col’] = df[’col1’] + df[’col2’] df = [Link](’new col’,concat ws(’ ’,df.col1,df.col2))

list(df[’col’) [Link](”col”).[Link](lambda x:x).collect()

df = df[df[’col’] != ” ”] df = df[df[’col’] == val]

[Link]([’B’,’C’], axis=1) [Link](’col1’,’col2’)

[Link] values() [Link]()

You might also like