0% found this document useful (0 votes)

46 views15 pages

RDD vs DataFrame in Spark Explained

RDD, pandas dataframe, and numpy are common data structures in Spark and Python. RDD is Spark's basic distributed data structure, while dataframe provides a higher level API. Pandas dataframe is similar but built for pandas. Numpy arrays store numerical data in grids and support matrix operations. The document provides examples of constructing and manipulating these data structures for different use cases like SQL queries, machine learning, and numerical analysis.

Uploaded by

darren boesono

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views15 pages

RDD vs DataFrame in Spark Explained

Uploaded by

darren boesono

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

RDD, pandas dataframe, numpy

Instructor: Li Yang
RDD (resilient distributed dataset) or Dataframe ?
• Dataframe

• New trend in spark

• MLlib is on a shift to dataframe based API

• Spark streaming is moving towards structured streaming that is heavily based on dataframe API

• RDD

• Out of date? No!

• RDD is the basic building block of spark

• Dataframe is build on top of RDD

When to use RDD or Dataframe ?

• RDD

• Low-level transformation, action and control on datasets

• Unstructured data, i.e., media streams or streams of text

• need to manipulate data with complicated functional programming

• Dataframe

• Simple processing, i.e., sum, average, SQL queries

• High-level optimization, i.e., MLlib

RDD (resilient distributed datasets)

• RDD examples: {‘a’,’b’,’c’,’d’}, {0,-1.5,2.5,3}, {(‘a’,1), (‘c’,-2),(‘d’,10)}

• RDD can contain any types of objects, including user-defined one

• RDD is a capsulation of a large dataset. In Spark all the work is expressed

as manipulations on RDD (create a new RDD; transform a RDD; call
operations on RDD)

• A special RDD is pair RDD. It has a key-value pair: (key, value)

• For example: {(‘Lily’, (18,1.65,100)), (‘John’, (16,1.85,150)), (‘Ann’,

(22,1.55,90))}. Here name is key, (age, height, weight) is value.

Remark: spark distributes the data contained in RDD across clusters

Operations on RDD

• Transformations

• Apply some functions to data in RDD to create a new RDD

• Actions

• Compute a result based on RDD

Create RDD

• Create RDD directly from external dataset file

• (python code): sc = SparkContext(“local”, “texfile”)

rdd_new = sc.textFile(“/mnt/S3data/sample.text”)

• Create RDD from existing Dataframe

• (python code) rdd_new = df.rdd

• Conversion from rdd to dataframe: rdd_new.toDF()

Transforms on RDD
• Transformations

• Apply some functions to data in RDD to create a new RDD

• Two common transformations on single RDD (for example, our RDD = {(‘Lily’, (18,1.65,100)), (‘John’, (16,1.85,150)), (‘Ann’, (22,1.55,90))}):

• Filter : select some data in RDD to create a new RDD (similar to select in SQL)

• rdd.filter(lambda x: x[0]==`John’) : choose John’s data

• rdd.filter(lambda x: x[1][0]<=20) : choose students’ data whose age <=20

• Map: apply a function to each data in RDD (very useful)

• rdd.map(lambda x: (x[0], x[1][1]*0.393)) : a new RDD = {(‘Lily’, 64.9), (‘John’, 72.8), (‘Ann’, 61)}, name+height in inches

• Transformations on two RDD:

• union, intersection

• Transformations on two pair RDD:

• join: merge two RDD by key values to generate a new RDD

Actions on RDD

• Action

• return a final value based on RDD

• Several common actions on single RDD:

• collect: return a list of all the data in RDD (shouldn’t be used for large dataset)

• take(n) : return the n rows of data in RDD

• count : return the number of rows

• reduce : apply a function to any two elements of RDD to create a new element and
continue until there is only one element in RDD
Python Function

• Definition 1:

• (python): def fun_name(input) :

return output

• Example: def Addition(x) :

return x+2

• Defintion 2:

• (python): lambda input: output

• Example: lambda x: x+2

Pandas Dataframe

• Pandas dataframe is a 2-dimensional labeled data structure with columns

of potentially different types

• Difference between dataframe and pandas dataframe?

• Pandas dataframe is build on top of dataframe. It is a higher level.

• Pandas dataframe is easier to be manipulated

• Dataframe → pandas dataframe: df.toPandas()

• Pandas dataframe → dataframe: spark.createDataFrame(df)

Pandas Dataframe
• Call pandas module (python): import pandas as pd

• Construct pandas dataframe (python): pd_new = df.toPandas()

• Display pandas dataframe (python): pd.head(2), df.tail(2)

• Basic manipulations of pandas dataframe (python): pd.columns, pd.describe()

• Slicing dataframe:

• Columns(python): pd[[‘column1’, ‘column2’]]

• Rows(python): pd[0:3]

• Accessing dataframe:

• (python): pd.loc[0,’column’]

• (python): pd.iloc[0,1]
Pandas Dataframe

• Apply function on columns (python): pd[‘colum1’,’colum2’].apply(fun)

• Merging two pandas dataframes (python): pd.merge(pd1, pd2, on=‘column’)

• Concatenate two pandas dataframe (python): pd.concat([pd1,pd2])

• Other basic manipulations of pandas dataframe (python):

• pd.dropna(axis=0, inplace=True)

• pd.drop_duplicates(subset=‘column’, inplace=True)
Numpy
• A numpy array is a grid of values, all of the same type

• A simpler data format of dataframe. Operations on numpy are similar to Matlab matrix operation

• Call a numpy (python): import numpy as np

• Construct a numpy (python):

• np.array([[1,2,3],[4,5,6]])

• np1 = pd.to_numpy(), (conversion from pandas dataframe)

• np.zeros((2,2)), np.ones((2,2))

• np.arange(0,3,0.1)

• Access data (python): np1[0,1], np1[0,:], np1[:,0]

• Matrix operations (elementwise) (python): np1+np2, np1-np2, np1*np2, np1/np2

• Real Matrix operations (python): np1.dot(np2), np1.T

• Conversion between pandas dataframe and numpy (python):

• Pandas → numpy: pd.to_numpy()

• numpy → pandas: pd.Dataframe(N1) (here N1 is a numpy)

Comparison data formats
Dataframe RDD Pandas Dataframe Numpy

Advantages Easy I/O; easy to transfer Very powerful to Easy to access and Easy to work on matrix
to other data formats ; easy manipulate the data manipulate data manipulations
for sql query

Disadvantage Hard to access and Hard to master it since you Something wrong with I/O Not fit for string type data
manipulate the data need to be very familiar on databricks
with it

• My suggestion:

1. Construct dataframe from dataset; Transform it to pandas dataframe for further operations

2. If you need simple matrix operations, transfer it to numpy; if you need complicated data operations, transfer it to rdd
Other modules
• re: module (package) to work on strings

• Call re (python): import re

• Substitute a string in a text(python): re.sub(‘<br>’, text)

• Search a string in a text(python): re.search(‘<br>’, text)

• matplotlib: plotting module similar to Matlab plot

• Call it (python): import matplotlib.pyplot as plt

• Plot a function with x as horizontal axis and y as vertical axis: plt.plot(x,y)

• imageio: basic operations on images

• Call it (python): from imageio import imread, imsave

• Read an image (python): img = imread(‘input image file.jpeg’)

• Show an image (python): plt.imshow(img)

• Save an array as image (python): imsave(‘output image file.jpeg’, img)

Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
PySpark Installation and Basics Guide
100% (1)
PySpark Installation and Basics Guide
131 pages
Pyspark
No ratings yet
Pyspark
31 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Python Pyspark Q's
No ratings yet
Python Pyspark Q's
16 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Pyspark
No ratings yet
Pyspark
4 pages
Spark
No ratings yet
Spark
51 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
PySpark Notes
No ratings yet
PySpark Notes
4 pages
Unit 4 (Data Frame and Apache Kafka)
No ratings yet
Unit 4 (Data Frame and Apache Kafka)
28 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
PySpark Notes
No ratings yet
PySpark Notes
31 pages
NumPy and Pandas Tutorial
No ratings yet
NumPy and Pandas Tutorial
8 pages
Python Spark Basics and Examples
No ratings yet
Python Spark Basics and Examples
28 pages
Pyspark
No ratings yet
Pyspark
10 pages
Page 01
No ratings yet
Page 01
2 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
PYSPARK Interview Questions
100% (4)
PYSPARK Interview Questions
126 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
7 Apache Spark
No ratings yet
7 Apache Spark
48 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
All Document Reader 1715619870900
No ratings yet
All Document Reader 1715619870900
6 pages
RDD vs DataFrame vs Dataset in Spark
No ratings yet
RDD vs DataFrame vs Dataset in Spark
6 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
Pyspark Basics
No ratings yet
Pyspark Basics
74 pages
Py Spark
No ratings yet
Py Spark
19 pages
Important PySpark Operations 1698872557
No ratings yet
Important PySpark Operations 1698872557
4 pages
Hands - On Exercise: Using The Spark Shell..................................
100% (2)
Hands - On Exercise: Using The Spark Shell..................................
13 pages
REcall Topics
No ratings yet
REcall Topics
2 pages
Introduction to Data Analysis with Spark
No ratings yet
Introduction to Data Analysis with Spark
51 pages
PySpark Basics Overview 2
No ratings yet
PySpark Basics Overview 2
15 pages
Support of Big Data Machine Learning With Apache Spark
No ratings yet
Support of Big Data Machine Learning With Apache Spark
7 pages
Python Libraries: NumPy, Pandas, Matplotlib
No ratings yet
Python Libraries: NumPy, Pandas, Matplotlib
68 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Big Data with Apache Spark Basics
No ratings yet
Big Data with Apache Spark Basics
43 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
4 BNI Python Training
100% (1)
4 BNI Python Training
126 pages
Learning Spark Programming Basics: Introduction To Rdds
No ratings yet
Learning Spark Programming Basics: Introduction To Rdds
70 pages
Analysis of Heart Disease Dataset
No ratings yet
Analysis of Heart Disease Dataset
16 pages
Spark DataFrame Basics
No ratings yet
Spark DataFrame Basics
10 pages
Spark DataFrame Basics
No ratings yet
Spark DataFrame Basics
10 pages
Data Analysis With Python & Pandas
100% (3)
Data Analysis With Python & Pandas
378 pages
Data Science Papers
No ratings yet
Data Science Papers
109 pages
Learning NumPy and Pandas
No ratings yet
Learning NumPy and Pandas
3 pages
Python Library Functions Overview
No ratings yet
Python Library Functions Overview
12 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
Py Spark
No ratings yet
Py Spark
9 pages
Introduction to Pandas for Data Analysis
No ratings yet
Introduction to Pandas for Data Analysis
12 pages
Logistic Regression on Iris Dataset
No ratings yet
Logistic Regression on Iris Dataset
60 pages
PPS - Unit 5 (Imp Topics)
No ratings yet
PPS - Unit 5 (Imp Topics)
7 pages
Pyspark TOC - 24 Hours
No ratings yet
Pyspark TOC - 24 Hours
2 pages
Py Spark
No ratings yet
Py Spark
177 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
(PA) (Exam) Past Exam (2022 Fall)
No ratings yet
(PA) (Exam) Past Exam (2022 Fall)
5 pages
Mock Midterm
No ratings yet
Mock Midterm
1 page
Final Mock
No ratings yet
Final Mock
2 pages
美签个人简历英文版（原件）模板
No ratings yet
美签个人简历英文版（原件）模板
1 page
Lecture 7 Mutual Fund and Fund Performance
No ratings yet
Lecture 7 Mutual Fund and Fund Performance
30 pages
Big Data Insights for Tech Enthusiasts
No ratings yet
Big Data Insights for Tech Enthusiasts
19 pages
Math IA
No ratings yet
Math IA
16 pages
Python Data Manipulation in Databricks
No ratings yet
Python Data Manipulation in Databricks
12 pages
IB Economics: Tobacco Tax Analysis
No ratings yet
IB Economics: Tobacco Tax Analysis
7 pages
Chemistry IA FINAL PDF
No ratings yet
Chemistry IA FINAL PDF
13 pages
0580 m18 Ms 42
No ratings yet
0580 m18 Ms 42
7 pages
Chemistry IA FINAL PDF
No ratings yet
Chemistry IA FINAL PDF
13 pages
Darren Boesono 12IBC-2
No ratings yet
Darren Boesono 12IBC-2
1 page
Sound Velocity and Temperature Study
No ratings yet
Sound Velocity and Temperature Study
3 pages
Investigating The Effect of PH On Amylase Activity Ss 34
No ratings yet
Investigating The Effect of PH On Amylase Activity Ss 34
4 pages
Length Matching for High-Speed PCB Signals
No ratings yet
Length Matching for High-Speed PCB Signals
2 pages
Python Notes - 1
No ratings yet
Python Notes - 1
364 pages
Enter Factory Mode for CPU 76931
No ratings yet
Enter Factory Mode for CPU 76931
1 page
Introduction to Scientific Programming in C
No ratings yet
Introduction to Scientific Programming in C
17 pages
rd-03 v1.1.0 Specification 20230518
No ratings yet
rd-03 v1.1.0 Specification 20230518
19 pages
ATM System Project Report in C++
No ratings yet
ATM System Project Report in C++
36 pages
Aditya Engineering College (A) Aditya Engineering College (A)
No ratings yet
Aditya Engineering College (A) Aditya Engineering College (A)
125 pages
EasyIO FG Series User Guide v1.3
No ratings yet
EasyIO FG Series User Guide v1.3
24 pages
Project Simple Calculator
No ratings yet
Project Simple Calculator
14 pages
Keithley 617 Manual
No ratings yet
Keithley 617 Manual
183 pages
infoPLC Net IRC5 Connection To S500 IOs Via Profinet PDF
No ratings yet
infoPLC Net IRC5 Connection To S500 IOs Via Profinet PDF
28 pages
UFED 4PC Especificaciones Tecnicas
No ratings yet
UFED 4PC Especificaciones Tecnicas
2 pages
Avaya - Pre .71200X.VCEplus.78q
No ratings yet
Avaya - Pre .71200X.VCEplus.78q
22 pages
NB-IoT Smart Car Parking System Design
No ratings yet
NB-IoT Smart Car Parking System Design
5 pages
Lab File937
No ratings yet
Lab File937
52 pages
Ageing and Failure Modes of IGBT Modules in
No ratings yet
Ageing and Failure Modes of IGBT Modules in
11 pages
Python Programming Lab Guide
No ratings yet
Python Programming Lab Guide
180 pages
Hyperion LCM Utlity
No ratings yet
Hyperion LCM Utlity
30 pages
System Infrastructure Design Essentials
No ratings yet
System Infrastructure Design Essentials
9 pages
Project Report
No ratings yet
Project Report
49 pages
An-Eval3Br0665Jf: 100W 18V Smps Evaluation Board With Coolset F3R Ice3Br0665Jf
No ratings yet
An-Eval3Br0665Jf: 100W 18V Smps Evaluation Board With Coolset F3R Ice3Br0665Jf
27 pages
11 Op Amps and Startup Circuits For Cmos Ba
No ratings yet
11 Op Amps and Startup Circuits For Cmos Ba
5 pages
Computer Hardware
No ratings yet
Computer Hardware
23 pages
Exam 3
No ratings yet
Exam 3
38 pages
Project Form
No ratings yet
Project Form
8 pages
Systematic Design of A Transimpedance Amplifier With Specified Electromagnetic Out-of-Band Interfere-GJg
No ratings yet
Systematic Design of A Transimpedance Amplifier With Specified Electromagnetic Out-of-Band Interfere-GJg
9 pages
Ashok Kumar Resume
No ratings yet
Ashok Kumar Resume
2 pages
Factorytalk Assetcentre Getting Results Guide
No ratings yet
Factorytalk Assetcentre Getting Results Guide
112 pages
Workflow Interview Questions and Answers
No ratings yet
Workflow Interview Questions and Answers
84 pages

RDD vs DataFrame in Spark Explained

Uploaded by

RDD vs DataFrame in Spark Explained

Uploaded by

RDD, pandas dataframe, numpy

• New trend in spark

• MLlib is on a shift to dataframe based API

• Out of date? No!

• RDD is the basic building block of spark

• Dataframe is build on top of RDD

• Low-level transformation, action and control on datasets

• Unstructured data, i.e., media streams or streams of text

• need to manipulate data with complicated functional programming

• Simple processing, i.e., sum, average, SQL queries

• High-level optimization, i.e., MLlib

• RDD examples: {‘a’,’b’,’c’,’d’}, {0,-1.5,2.5,3}, {(‘a’,1), (‘c’,-2),(‘d’,10)}

• RDD can contain any types of objects, including user-defined one

• RDD is a capsulation of a large dataset. In Spark all the work is expressed

• A special RDD is pair RDD. It has a key-value pair: (key, value)

• For example: {(‘Lily’, (18,1.65,100)), (‘John’, (16,1.85,150)), (‘Ann’,

Remark: spark distributes the data contained in RDD across clusters

• Apply some functions to data in RDD to create a new RDD

• Compute a result based on RDD

• Create RDD directly from external dataset file

• (python code): sc = SparkContext(“local”, “texfile”)

• Create RDD from existing Dataframe

• (python code) rdd_new = df.rdd

• Conversion from rdd to dataframe: rdd_new.toDF()

• Apply some functions to data in RDD to create a new RDD

• rdd.filter(lambda x: x[0]==`John’) : choose John’s data

• rdd.filter(lambda x: x[1][0]<=20) : choose students’ data whose age <=20

• Map: apply a function to each data in RDD (very useful)

• Transformations on two RDD:

• Transformations on two pair RDD:

• join: merge two RDD by key values to generate a new RDD

• return a final value based on RDD

• Several common actions on single RDD:

• take(n) : return the n rows of data in RDD

• count : return the number of rows

• (python): def fun_name(input) :

• Example: def Addition(x) :

• (python): lambda input: output

• Example: lambda x: x+2

• Pandas dataframe is a 2-dimensional labeled data structure with columns

• Difference between dataframe and pandas dataframe?

• Pandas dataframe is build on top of dataframe. It is a higher level.

• Pandas dataframe is easier to be manipulated

• Dataframe → pandas dataframe: df.toPandas()

• Pandas dataframe → dataframe: spark.createDataFrame(df)

• Construct pandas dataframe (python): pd_new = df.toPandas()

• Display pandas dataframe (python): pd.head(2), df.tail(2)

• Basic manipulations of pandas dataframe (python): pd.columns, pd.describe()

• Columns(python): pd[[‘column1’, ‘column2’]]

• Apply function on columns (python): pd[‘colum1’,’colum2’].apply(fun)

• Merging two pandas dataframes (python): pd.merge(pd1, pd2, on=‘column’)

• Concatenate two pandas dataframe (python): pd.concat([pd1,pd2])

• Other basic manipulations of pandas dataframe (python):

• Call a numpy (python): import numpy as np

• Construct a numpy (python):

• np1 = pd.to_numpy(), (conversion from pandas dataframe)

• Access data (python): np1[0,1], np1[0,:], np1[:,0]

• Matrix operations (elementwise) (python): np1+np2, np1-np2, np1*np2, np1/np2

• Real Matrix operations (python): np1.dot(np2), np1.T

• Conversion between pandas dataframe and numpy (python):

• Pandas → numpy: pd.to_numpy()

• numpy → pandas: pd.Dataframe(N1) (here N1 is a numpy)

• Call re (python): import re

• Substitute a string in a text(python): re.sub(‘<br>’, text)

• Search a string in a text(python): re.search(‘<br>’, text)

• matplotlib: plotting module similar to Matlab plot

• Call it (python): import matplotlib.pyplot as plt

• Plot a function with x as horizontal axis and y as vertical axis: plt.plot(x,y)

• imageio: basic operations on images

• Call it (python): from imageio import imread, imsave

• Read an image (python): img = imread(‘input image file.jpeg’)

• Show an image (python): plt.imshow(img)

• Save an array as image (python): imsave(‘output image file.jpeg’, img)

You might also like