0% found this document useful (0 votes)
2K views3 pages

Pandas vs PySpark Cheat Sheet

This document provides a cheatsheet comparing common data analysis tasks in Pandas and PySpark. It outlines how to import libraries, define datasets, read/write data, inspect data, handle missing/duplicate values, rename/select columns, join datasets, group and sort data using each framework. The cheatsheet acts as a quick reference guide to help users choose the appropriate tool for different data processing and manipulation operations.

Uploaded by

api-261489892
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views3 pages

Pandas vs PySpark Cheat Sheet

This document provides a cheatsheet comparing common data analysis tasks in Pandas and PySpark. It outlines how to import libraries, define datasets, read/write data, inspect data, handle missing/duplicate values, rename/select columns, join datasets, group and sort data using each framework. The cheatsheet acts as a quick reference guide to help users choose the appropriate tool for different data processing and manipulation operations.

Uploaded by

api-261489892
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CHEATSHEET: PANDAS VS PYSPARK

Vanessa Afolabi

Import Libraries and Set System Options:


PANDAS PYSPARK

import pandas as pd from [Link] import *


[Link] colwidth = 1000 from [Link] import *
from [Link] import SQLContext*

Define and create a dataset:


PANDAS PYSPARK

data = {’col1’ : [ , , ], ’col2’ : [ , , ]} StructField(’Col1’, IntegerType())


df = [Link](data, columns = [’col1’, ’col2’]) StructField(’Col2’, StringType())
schema = StructType([list of StructFields])
df = SQLContext(sc).createDataFrame([Link](), schema)

Read and Write to CSV:


PANDAS PYSPARK

[Link] csv() SQLContext(sc).read csv()


[Link] csv() [Link] csv()

Indexing and Splitting:


PANDAS PYSPARK

[Link][ ] [Link](weights=[ ], seed=n)


[Link][ ]

Inspect Data:
PANDAS PYSPARK

[Link]() [Link]()
[Link](n)
[Link] [Link]()
[Link]
[Link] [Link]()
Handling Duplicate Data:
PANDAS PYSPARK

[Link]() [Link]().count()
[Link]
[Link] duplicates() [Link]()

Rename Columns:
PANDAS PYSPARK

[Link](columns={”old col”:”new col”}) [Link](”old col”,”new col”)

Handling Missing Data:


PANDAS PYSPARK

[Link]() [Link]()
[Link]() [Link]()
[Link] [Link]()
df[’col’].isna() [Link]()
df[’col’].isnull()
df[’col’].notna() [Link]()
df[’col’].notnull()

Common Column Functions:


PANDAS PYSPARK

df[”col”] = df[”col”].[Link]() df = [Link](’col’,lower([Link]))


df[”col”] = df[”col”].[Link]() df = [Link](’*’,regexp replace().alias())
df = [Link](’*’,regexp extract().alias())
df[”col”] = df[”col”].[Link]() df = [Link](’col’,split(’col’))
df[”col”] = df[”col”].[Link]() df = [Link](’col’, UDF JOIN([Link], lit(’ ’)))
df[”col”] = df[”col”].[Link]() df = [Link](’col’, trim([Link]))

Apply User Defined Functions:


PANDAS PYSPARK

df[’col’] = df[’col’].map(UDF) df = [Link](’col’, UDF([Link]))


[Link](f) df = [Link](’col’, when(cond, UDF([Link])).otherwise())
[Link](f)

Join two dataset columns:


PANDAS PYSPARK

df[’new col’] = df[’col1’] + df[’col2’] df = [Link](’new col’,concat ws(’ ’,df.col1,df.col2))


[Link](’*’,concat(df.col1,df.col2).alias(’new col’))
Convert dataset column to a list:
PANDAS PYSPARK

list(df[’col’) [Link](”col”).[Link](lambda x:x).collect()

Filter Dataset:
PANDAS PYSPARK

df = df[df[’col’] != ” ”] df = df[df[’col’] == val]


df = [Link](df[’col’] == val)

Select Columns:
PANDAS PYSPARK

df = df[[’col1’,’col2’,’col3’]] df = [Link](’col1’,’col2’,’col3’)

Drop Columns:
PANDAS PYSPARK

[Link]([’B’,’C’], axis=1) [Link](’col1’,’col2’)


[Link](columns = [’B’,’C’])

Grouping Data:
PANDAS PYSPARK

[Link](by=[’col1’,’col2’]).count() [Link](’col’).count().show()

Combining Data:
PANDAS PYSPARK

[Link]([df1,df2]) [Link](df2)
[Link](df2)
[Link](df2) [Link](df2)

Cartesian Product:
PANDAS PYSPARK

df1[’key’] = 1 [Link](df2)
df2[’key’] = 1
[Link](df2, how=’outer’, on=’key’)

Sorting Data:
PANDAS PYSPARK

[Link] values() [Link]()


[Link] index() [Link]()

You might also like