CHEATSHEET: PANDAS VS PYSPARK
Vanessa Afolabi
Import Libraries and Set System Options:
PANDAS PYSPARK
import pandas as pd from [Link] import *
[Link] colwidth = 1000 from [Link] import *
from [Link] import SQLContext*
Define and create a dataset:
PANDAS PYSPARK
data = {’col1’ : [ , , ], ’col2’ : [ , , ]} StructField(’Col1’, IntegerType())
df = [Link](data, columns = [’col1’, ’col2’]) StructField(’Col2’, StringType())
schema = StructType([list of StructFields])
df = SQLContext(sc).createDataFrame([Link](), schema)
Read and Write to CSV:
PANDAS PYSPARK
[Link] csv() SQLContext(sc).read csv()
[Link] csv() [Link] csv()
Indexing and Splitting:
PANDAS PYSPARK
[Link][ ] [Link](weights=[ ], seed=n)
[Link][ ]
Inspect Data:
PANDAS PYSPARK
[Link]() [Link]()
[Link](n)
[Link] [Link]()
[Link]
[Link] [Link]()
Handling Duplicate Data:
PANDAS PYSPARK
[Link]() [Link]().count()
[Link]
[Link] duplicates() [Link]()
Rename Columns:
PANDAS PYSPARK
[Link](columns={”old col”:”new col”}) [Link](”old col”,”new col”)
Handling Missing Data:
PANDAS PYSPARK
[Link]() [Link]()
[Link]() [Link]()
[Link] [Link]()
df[’col’].isna() [Link]()
df[’col’].isnull()
df[’col’].notna() [Link]()
df[’col’].notnull()
Common Column Functions:
PANDAS PYSPARK
df[”col”] = df[”col”].[Link]() df = [Link](’col’,lower([Link]))
df[”col”] = df[”col”].[Link]() df = [Link](’*’,regexp replace().alias())
df = [Link](’*’,regexp extract().alias())
df[”col”] = df[”col”].[Link]() df = [Link](’col’,split(’col’))
df[”col”] = df[”col”].[Link]() df = [Link](’col’, UDF JOIN([Link], lit(’ ’)))
df[”col”] = df[”col”].[Link]() df = [Link](’col’, trim([Link]))
Apply User Defined Functions:
PANDAS PYSPARK
df[’col’] = df[’col’].map(UDF) df = [Link](’col’, UDF([Link]))
[Link](f) df = [Link](’col’, when(cond, UDF([Link])).otherwise())
[Link](f)
Join two dataset columns:
PANDAS PYSPARK
df[’new col’] = df[’col1’] + df[’col2’] df = [Link](’new col’,concat ws(’ ’,df.col1,df.col2))
[Link](’*’,concat(df.col1,df.col2).alias(’new col’))
Convert dataset column to a list:
PANDAS PYSPARK
list(df[’col’) [Link](”col”).[Link](lambda x:x).collect()
Filter Dataset:
PANDAS PYSPARK
df = df[df[’col’] != ” ”] df = df[df[’col’] == val]
df = [Link](df[’col’] == val)
Select Columns:
PANDAS PYSPARK
df = df[[’col1’,’col2’,’col3’]] df = [Link](’col1’,’col2’,’col3’)
Drop Columns:
PANDAS PYSPARK
[Link]([’B’,’C’], axis=1) [Link](’col1’,’col2’)
[Link](columns = [’B’,’C’])
Grouping Data:
PANDAS PYSPARK
[Link](by=[’col1’,’col2’]).count() [Link](’col’).count().show()
Combining Data:
PANDAS PYSPARK
[Link]([df1,df2]) [Link](df2)
[Link](df2)
[Link](df2) [Link](df2)
Cartesian Product:
PANDAS PYSPARK
df1[’key’] = 1 [Link](df2)
df2[’key’] = 1
[Link](df2, how=’outer’, on=’key’)
Sorting Data:
PANDAS PYSPARK
[Link] values() [Link]()
[Link] index() [Link]()