RDD, pandas dataframe, numpy
Instructor: Li Yang
RDD (resilient distributed dataset) or Dataframe ?
• Dataframe
• New trend in spark
• MLlib is on a shift to dataframe based API
• Spark streaming is moving towards structured streaming that is heavily based on dataframe API
• RDD
• Out of date? No!
• RDD is the basic building block of spark
• Dataframe is build on top of RDD
When to use RDD or Dataframe ?
• RDD
• Low-level transformation, action and control on datasets
• Unstructured data, i.e., media streams or streams of text
• need to manipulate data with complicated functional programming
• Dataframe
• Simple processing, i.e., sum, average, SQL queries
• High-level optimization, i.e., MLlib
RDD (resilient distributed datasets)
• RDD examples: {‘a’,’b’,’c’,’d’}, {0,-1.5,2.5,3}, {(‘a’,1), (‘c’,-2),(‘d’,10)}
• RDD can contain any types of objects, including user-defined one
• RDD is a capsulation of a large dataset. In Spark all the work is expressed
as manipulations on RDD (create a new RDD; transform a RDD; call
operations on RDD)
• A special RDD is pair RDD. It has a key-value pair: (key, value)
• For example: {(‘Lily’, (18,1.65,100)), (‘John’, (16,1.85,150)), (‘Ann’,
(22,1.55,90))}. Here name is key, (age, height, weight) is value.
Remark: spark distributes the data contained in RDD across clusters
Operations on RDD
• Transformations
• Apply some functions to data in RDD to create a new RDD
• Actions
• Compute a result based on RDD
Create RDD
• Create RDD directly from external dataset file
• (python code): sc = SparkContext(“local”, “texfile”)
rdd_new = sc.textFile(“/mnt/S3data/sample.text”)
• Create RDD from existing Dataframe
• (python code) rdd_new = df.rdd
• Conversion from rdd to dataframe: rdd_new.toDF()
Transforms on RDD
• Transformations
• Apply some functions to data in RDD to create a new RDD
• Two common transformations on single RDD (for example, our RDD = {(‘Lily’, (18,1.65,100)), (‘John’, (16,1.85,150)), (‘Ann’, (22,1.55,90))}):
• Filter : select some data in RDD to create a new RDD (similar to select in SQL)
• rdd.filter(lambda x: x[0]==`John’) : choose John’s data
• rdd.filter(lambda x: x[1][0]<=20) : choose students’ data whose age <=20
• Map: apply a function to each data in RDD (very useful)
• rdd.map(lambda x: (x[0], x[1][1]*0.393)) : a new RDD = {(‘Lily’, 64.9), (‘John’, 72.8), (‘Ann’, 61)}, name+height in inches
• Transformations on two RDD:
• union, intersection
• Transformations on two pair RDD:
• join: merge two RDD by key values to generate a new RDD
Actions on RDD
• Action
• return a final value based on RDD
• Several common actions on single RDD:
• collect: return a list of all the data in RDD (shouldn’t be used for large dataset)
• take(n) : return the n rows of data in RDD
• count : return the number of rows
• reduce : apply a function to any two elements of RDD to create a new element and
continue until there is only one element in RDD
Python Function
• Definition 1:
• (python): def fun_name(input) :
return output
• Example: def Addition(x) :
return x+2
• Defintion 2:
• (python): lambda input: output
• Example: lambda x: x+2
Pandas Dataframe
• Pandas dataframe is a 2-dimensional labeled data structure with columns
of potentially different types
• Difference between dataframe and pandas dataframe?
• Pandas dataframe is build on top of dataframe. It is a higher level.
• Pandas dataframe is easier to be manipulated
• Dataframe → pandas dataframe: df.toPandas()
• Pandas dataframe → dataframe: spark.createDataFrame(df)
Pandas Dataframe
• Call pandas module (python): import pandas as pd
• Construct pandas dataframe (python): pd_new = df.toPandas()
• Display pandas dataframe (python): pd.head(2), df.tail(2)
• Basic manipulations of pandas dataframe (python): pd.columns, pd.describe()
• Slicing dataframe:
• Columns(python): pd[[‘column1’, ‘column2’]]
• Rows(python): pd[0:3]
• Accessing dataframe:
• (python): pd.loc[0,’column’]
• (python): pd.iloc[0,1]
Pandas Dataframe
• Apply function on columns (python): pd[‘colum1’,’colum2’].apply(fun)
• Merging two pandas dataframes (python): pd.merge(pd1, pd2, on=‘column’)
• Concatenate two pandas dataframe (python): pd.concat([pd1,pd2])
• Other basic manipulations of pandas dataframe (python):
• pd.dropna(axis=0, inplace=True)
• pd.drop_duplicates(subset=‘column’, inplace=True)
Numpy
• A numpy array is a grid of values, all of the same type
• A simpler data format of dataframe. Operations on numpy are similar to Matlab matrix operation
• Call a numpy (python): import numpy as np
• Construct a numpy (python):
• np.array([[1,2,3],[4,5,6]])
• np1 = pd.to_numpy(), (conversion from pandas dataframe)
• np.zeros((2,2)), np.ones((2,2))
• np.arange(0,3,0.1)
• Access data (python): np1[0,1], np1[0,:], np1[:,0]
• Matrix operations (elementwise) (python): np1+np2, np1-np2, np1*np2, np1/np2
• Real Matrix operations (python): np1.dot(np2), np1.T
• Conversion between pandas dataframe and numpy (python):
• Pandas → numpy: pd.to_numpy()
• numpy → pandas: pd.Dataframe(N1) (here N1 is a numpy)
Comparison data formats
Dataframe RDD Pandas Dataframe Numpy
Advantages Easy I/O; easy to transfer Very powerful to Easy to access and Easy to work on matrix
to other data formats ; easy manipulate the data manipulate data manipulations
for sql query
Disadvantage Hard to access and Hard to master it since you Something wrong with I/O Not fit for string type data
manipulate the data need to be very familiar on databricks
with it
• My suggestion:
1. Construct dataframe from dataset; Transform it to pandas dataframe for further operations
2. If you need simple matrix operations, transfer it to numpy; if you need complicated data operations, transfer it to rdd
Other modules
• re: module (package) to work on strings
• Call re (python): import re
• Substitute a string in a text(python): re.sub(‘<br>’, text)
• Search a string in a text(python): re.search(‘<br>’, text)
• matplotlib: plotting module similar to Matlab plot
• Call it (python): import matplotlib.pyplot as plt
• Plot a function with x as horizontal axis and y as vertical axis: plt.plot(x,y)
• imageio: basic operations on images
• Call it (python): from imageio import imread, imsave
• Read an image (python): img = imread(‘input image file.jpeg’)
• Show an image (python): plt.imshow(img)
• Save an array as image (python): imsave(‘output image file.jpeg’, img)