100% found this document useful (1 vote)

353 views45 pages

07 Spark Dataframes

This document provides an introduction to DataFrames in Spark SQL. It discusses how DataFrames are the preferred abstraction in Spark for working with structured data. DataFrames can be constructed from existing collections, transformed from other DataFrames, or loaded from files and external data sources. The document outlines key features of DataFrames such as their ability to scale to large datasets, support for various data formats, and optimizations. It compares DataFrames to RDDs and discusses how the DataFrame API aims to make Spark programming easier. Finally, it provides examples of creating DataFrames from CSV and JSON data.

Uploaded by

mhafod

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

353 views45 pages

07 Spark Dataframes

Uploaded by

mhafod

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

h"p://training.databricks.com/sparkcamp.

zip

Intro to DataFrames and

Spark SQL

Professor Anthony D. Joseph, UC Berkeley

Strata NYC September 2015
DataFrames
•  The preferred abstraction in Spark (introduced in 1.3)
•  Strongly typed collection of distributed elements
–  Built on Resilient Distributed Datasets

•  Immutable once constructed

•  Track lineage information to eﬀiciently recompute lost data
•  Enable operations on collection of elements in parallel

•  You construct DataFrames

•  by parallelizing existing collections (e.g., Pandas DataFrames)
•  by transforming an existing DataFrames
•  from files in HDFS or any other storage system (e.g., Parquet)
DataFrames
DataFrame split into 5 partitions

item-1 item-6 item-11 item-16 item-21

item-2 item-7 item-12 item-17 item-22
item-3 item-8 item-13 item-18 item-23
item-4 item-9 item-14 item-19 item-24 more partitions = more parallelism
item-5 item-10 item-15 item-20 item-25

Worker Worker Worker

Spark Spark Spark
executor executor executor
DataFrames API
•  is intended to enable wider audiences beyond “Big
Data” engineers to leverage the power of distributed
processing
•  is inspired by data frames in R and Python (Pandas)
•  designed from the ground-up to support modern big
data and data science applications
•  an extension to the existing RDD API

See
https://spark.apache.org/docs/latest/sql-programming-guide.html
databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-
scale-data-science.html
DataFrames Features
•  Ability to scale from kilobytes of data on a single
laptop to petabytes on a large cluster
•  Support for a wide array of data formats and
storage systems
•  State-of-the-art optimization and code generation
through the Spark SQL Catalyst optimizer
•  Seamless integration with all big data tooling and
infrastructure via Spark
•  APIs for Python, Java, Scala, and R
DataFrames versus RDDs
•  For new users familiar with data frames in other
programming languages, this API should make
them feel at home

•  For existing Spark users, the API will make Spark

easier to program than using RDDs

•  For both sets of users, DataFrames will improve

performance through intelligent optimizations and
code-generation
DataFrames and Spark SQL
DataFrames are fundamentally tied to Spark SQL
•  The DataFrames API provides a programmatic
interface—really, a domain-specific language (DSL)
—for interacting with your data.
•  Spark SQL provides a SQL-like interface.
•  Anything you can do in Spark SQL, you can do in
DataFrames
•  … and vice versa
DataFrames and Spark SQL
Like Spark SQL, the DataFrames API assumes that
the data has a table-like structure

Formally, a DataFrame is a size-mutable, potentially

heterogeneous tabular data structure with labeled
axes (i.e., rows and columns)

That’s a mouthful – just think of it as a table in a

distributed database: a distributed collection of
data organized into named, typed columns
DataFrames and RDDs
•  DataFrames are built on top of the Spark RDD API
•  You can use normal RDD operations on DataFrames

•  However, use the DataFrame API, wherever possible

•  Using RDD operations will usually yield an RDD, not a DataFrame
•  The DataFrame API is likely to be more eﬀicient, because it can
optimize the underlying operations with Catalyst
•  DataFrames can be significantly faster than RDDs
•  Performance is language-independent

Spark Python DF

Spark Scala DF

RDD Python

RDD Scala

0 2 4 6 8 10

Runtime performance of aggregating 10 million integer pairs

(seconds)
Plan Optimization & Execution

Logical Physical Code

Analysis
Optimization Planning Generation
SQL AST

Cost Model
Unresolved Optimized Physical Plans Selected
Logical Plan Physical
PhysicalPlans
Plans RDDs
Logical Plan Logical Plan Physical Plan
DataFrame
Catalog

DataFrames and SQL share the same optimization/execution pipeline

Transformations, Actions, Laziness
•  DataFrames are lazy
•  Transformations contribute to the query plan, but
they do not execute anything
•  Actions cause the execution of the query
Transformation examples Action examples

•  filter •  count
•  select •  collect
•  drop •  show
•  intersect •  head
•  join •  take
Creating a DataFrame
•  You create a DataFrame with a SQLContext object
•  In the Spark Scala shell (spark-shell) or pyspark, you
have a SQLContext available automatically, as
sqlContext
•  The DataFrame data source API is consistent, across
data formats
•  “Opening” a data source works pretty much the same way, no
matter what
Creating a DataFrame (Python)

# The import isn't necessary in the SparkShell or

Databricks
from pyspark import SparkContext, SparkConf

# The following three lines are not necessary
# in the pyspark shell
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

df = sqlContext.read.parquet("/path/to/data.parquet")
df2 = sqlContext.read.json("/path/to/data.json")
SQLContext and Hive
Our previous example created a default Spark SQLContext
object

If you are using a version of Spark that has Hive support, you
can also create a HiveContext, which provides additional
features, including:
•  the ability to write queries using more complete HiveQL parser
•  access to Hive user-defined functions
•  the ability to read data from Hive tables
HiveContext
•  To use a HiveContext, you do not need to have an
existing Hive installation, and all of the data sources
available to a SQLContext are still available
•  You do, however, need to have a version of Spark that
was built with Hive support – that is not the default
•  Hive is packaged separately to avoid including all of Hive’s
dependencies in the default Spark build
•  If these dependencies are not a problem for your application
then using HiveContext is currently recommended
•  It is not diﬀicult to build Spark with Hive support
DataFrames Have Schemas
In the previous example, we created DataFrames from
Parquet and JSON data
•  A Parquet table has a schema (column names and
types) that Spark can use
•  Parquet also allows Spark to be eﬀicient about how it
pares down data
•  Spark can infer a Schema from a JSON file
Data Sources supported by
DataFrames
built-in external

JDBC

{ JSON }

and more …
A Brief Look at spark-csv
Suppose our data file has a header:
first_name,last_name,gender,age
Erin,Shannon,F,42
Norman,Lockwood,M,81
Miguel,Ruiz,M,64
Rosalita,Ramirez,F,14
Ally,Garcia,F,39
Claire,McBride,F,23
Abigail,Cottrell,F,75
José,Rivera,M,59
Ravi,Dasgupta,M,25
…

From Spark Packages: http://spark-packages.org/package/databricks/spark-csv

A Brief Look at spark-csv

Using spark-csv, we can simply create a DataFrame directly

from our CSV file

# Python
df = sqlContext.read.format("com.databricks.spark.csv").\
load("people.csv", header="true")
A Brief Look at spark-csv

You can also declare the schema programmatically, which

allows you to specify the column types

from pyspark.sql.types import *

schema = StructType([StructField("firstName", StringType(), False),
StructField("gender", StringType(), False),
StructField("age", IntegerType(), False)])

df = sqlContext.read.format("com.databricks.spark.csv").\
schema(schema).\
load("people.csv")
What Can I Do with a DataFrame?

Once you have a DataFrame, there are a number of

operations you can perform

Let us look at a few of them

But, first, let us talk about columns

Columns
When we say “column” here, what do we mean?

A DataFrame column is an abstraction

It provides a common column-oriented view of the

underlying data, regardless of how the data is really
organized
Columns
Input Source Data Frame Data Let us see how
Format Variable Name
DataFrame
JSON dataFrame1 [ {"first": "Amy",
"last": "Bello",
"age": 29 }, columns map
onto some
{"first": "Ravi",
"last": "Agarwal",
"age": 33 },
…
]
common data
sources
CSV dataFrame2 first,last,age
Fred,Hoover,91
Joaquin,Hernandez,24
…

SQL Table dataFrame3 ﬁrst last age

Joe Smith 42
Jill Jones 33
Columns dataFrame1
column: "first"
Input Source Data Frame Data
Format Variable Name

JSON dataFrame1 [ {"first": "Amy",

"last": "Bello",
"age": 29 },
{"first": "Ravi",
"last": "Agarwal", dataFrame2
"age": 33 },
… column: "first"
]

CSV dataFrame2 first,last,age

Fred,Hoover,91
Joaquin,Hernandez,24
…

dataFrame3
SQL Table dataFrame3 column: "first"
ﬁrst last age
Joe Smith 42
Jill Jones 33
Columns
When we say “column” here, what do we mean?

Several things:

•  A place (a cell) for a data value to reside, within a row of

data. This cell can have several states:
•  empty (null)
•  missing (not there at all)
•  contains a (typed) value (non-null)
•  A collection of those cells, from multiple rows
•  A syntactic construct we can use to specify or target a cell
(or collections of cells) in a DataFrame query

How do you refer to a column in the DataFrame API?

Columns
Suppose we have a DataFrame, df, that reads a data
source that has "first", "last", and "age" columns

Python Java Scala R

df['first'] df.col("first") df("first") df$first
df.first† $("first")‡

†In Python, it’s possible to access a DataFrame’s columns either by attribute

(df.age) or by indexing (df['age']). While the former is convenient for
interactive data exploration, you should use the index form. It's future proof and
won’t break with column names that are also attributes on the DataFrame class.

‡The $ syntax can be ambiguous, if there are multiple DataFrames in the lineage
cache()

•  Spark can cache a DataFrame, using an in-memory

columnar format, by calling df.cache()
•  Internally, it calls df.persist(MEMORY_ONLY)
•  Spark will scan only those columns used by the
DataFrame and will automatically tune
compression to minimize memory usage and GC
pressure
•  You can call the unpersist() method to remove
the cached data from memory
show()

You can look at the first n elements in a DataFrame with

the show() method (n defaults to 20)

This method is an action that:

•  reads (or re-reads) the input source
•  executes the RDD DAG across the cluster
•  pulls the n elements back to the driver JVM
•  displays those elements in a tabular form
show()

In[1]: df.show()
+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐+
|firstName|lastName|gender|age|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐+
| Erin| Shannon| F| 42|
| Claire| McBride| F| 23|
| Norman|Lockwood| M| 81|
| Miguel| Ruiz| M| 64|
| Rosalita| Ramirez| F| 14|
| Ally| Garcia| F| 39|
| Abigail|Cottrell| F| 75|
| José| Rivera| M| 59|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐+

printSchema()
You can have Spark tell you what it thinks the data
schema is, by calling the printSchema() method
(This is mostly useful in the shell)

In[1]: df.printSchema()
root
|-‐-‐ firstName: string (nullable = true)
|-‐-‐ lastName: string (nullable = true)
|-‐-‐ gender: string (nullable = true)
|-‐-‐ age: integer (nullable = false)
select()
select() is like a SQL SELECT, allowing you to
limit the results to specific columns

In[1]: df.select("firstName", "age").show(5)

+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+
|firstName|age|
| Erin| 42|
| Claire| 23|
| Norman| 81|
| Miguel| 64|
| Rosalita| 14|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+
select()
The DSL also allows you create on-the-fly derived
columns
In[1]: df.select(df['firstName'],
df['age'],
df['age'] > 49,
df['age'] + 10).show(5)
+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|firstName|age|(age > 49)|(age + 10)|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
| Erin| 42| false| 52|
| Claire| 23| false| 33|
| Norman| 81| true| 91|
| Miguel| 64| true| 74|
| Rosalita| 14| false| 24|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
filter()
The filter() method is similar to the RDD filter()
method, except that it supports the DataFrame DSL

In[1]: df.filter(df['age'] > 49).\

select(df['first_name'], df['age']).\
show()
+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+
|firstName|age|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+
| Norman| 81|
| Miguel| 64|
| Abigail| 75|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+
orderBy()
The orderBy() method allows you to sort the results

In [1]: df.filter(df['age'] > 49).\

select(df['first_name'], df['age']).\
orderBy(df['age’], df['first_name']).show()
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+
|first_name|age|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+
| Miguel | 64|
| Abigail| 75|
| Norman | 81|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+
orderBy()
It’s easy to reverse the sort order

In [1]: df.filter(df['age'] > 49).\

select(df['first_name'], df['age']).\
orderBy(df['age'].desc(), df['first_name']).show()
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+
|first_name|age|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+
| Norman| 81|
| Abigail| 75|
| Miguel| 64|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+
groupBy()
Often used with count(), groupBy() groups data
items by a specific column value

In [1]: df.groupBy("age").count().show()
+-‐-‐-‐+-‐-‐-‐-‐-‐+
|age|count|
+-‐-‐-‐+-‐-‐-‐-‐-‐+
| 39| 1|
| 42| 2|
| 64| 1|
| 75| 1|
| 81| 1|
| 14| 1|
| 23| 2|
+-‐-‐-‐+-‐-‐-‐-‐-‐+
as() or alias()
as() or alias() allows you to rename a column
It is especially useful with generated columns
In [1]: df.select(df['first_name'],\
df['age'],\
(df['age'] < 30).alias('young')).show(5)
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐+
|first_name|age|young|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐+
| Erin| 42|false|
| Claire| 23| true|
| Norman| 81|false|
| Miguel| 64|false|
| Rosalita| 14| true|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐+

Note: In Python, you must use alias, because as is a keyword

Other Useful Transformations
Method Descrip2on
limit(n) Limit the results to n rows. limit() is
not an action, like show() or the RDD
take() method. It returns another
DataFrame
distinct() Returns a new DataFrame containing
only the unique rows from the current
DataFrame
drop(column) Returns a new DataFrame with a column
dropped. column is a name or a Column
object
intersect(dataframe) Intersect one DataFrame with another
join(dataframe) Join one DataFrame with another, like a
SQL join. We’ll discuss this one more in a
minute.
There are many more transformations
Joins
Let’s assume we have a second file, a JSON file that
contains records like this:

[
{
"firstName": "Erin",
"lastName": "Shannon",
"medium": "oil on canvas"
},
{
"firstName": "Norman",
"lastName": "Lockwood",
"medium": "metal (sculpture)"
},
…
]

41
Joins
We can load that into a second DataFrame and join
it with our first one.
In [1]: df2 = sqlContext.read.json("artists.json")
# Schema inferred as DataFrame[firstName: string, lastName: string, medium:
string]
In [2]: df.join(
df2,
df.first_name == df2.firstName and df.last_name == df2.lastName
).show()
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
|first_name|last_name|gender|age|firstName|lastName| medium|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
| Norman| Lockwood| M| 81| Norman|Lockwood|metal (sculpture)|
| Erin| Shannon| F| 42| Erin| Shannon| oil on canvas|
| Rosalita| Ramirez| F| 14| Rosalita| Ramirez| charcoal|
| Miguel| Ruiz| M| 64| Miguel| Ruiz| oil on canvas|
+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+

42
Joins
Let’s make that a little more readable by only
selecting some of the columns.

In [3]: df3 = df.join(

43
What is Spark SQL?
Spark SQL allows you to manipulate distributed
data with SQL queries – currently, two SQL dialects
are supported
•  If you are using a Spark SQLContext, the only
supported dialect is "sql", a rich subset of SQL 92
•  If you're using a HiveContext, the default dialect
is "hiveql", corresponding to Hive's SQL dialect
•  "sql" is also available, but "hiveql" is a richer dialect

57
What is Spark SQL?
Spark SQL is intimately tied to the DataFrames API

•  You issue SQL queries through a SQLContext or

HiveContext, using the sql() method
•  The sql() method returns a DataFrame.
•  You can mix DataFrame methods and SQL queries
in the same code
•  We will see examples in the Labs

58
End of DataFrames and
Spark SQL Module

Overview of Spark SQL Features and APIs
No ratings yet
Overview of Spark SQL Features and APIs
24 pages
Spark SQL & DataFrames Guide 2.2.0
No ratings yet
Spark SQL & DataFrames Guide 2.2.0
35 pages
Pyspark
No ratings yet
Pyspark
31 pages
17 SparkSQL
No ratings yet
17 SparkSQL
44 pages
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Big Data Analytics with Apache Spark
No ratings yet
Big Data Analytics with Apache Spark
94 pages
Modern Data Pipelines With Apache Airflow
No ratings yet
Modern Data Pipelines With Apache Airflow
36 pages
Spark DataFrame and RDD Operations Guide
No ratings yet
Spark DataFrame and RDD Operations Guide
5 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
How To Work With Apache Airflow
No ratings yet
How To Work With Apache Airflow
111 pages
Spark Application Deployment Guide
No ratings yet
Spark Application Deployment Guide
18 pages
Advanced Spark RDD Concepts and Debugging
50% (2)
Advanced Spark RDD Concepts and Debugging
49 pages
Compare Hadoop and Spark.: Table
No ratings yet
Compare Hadoop and Spark.: Table
10 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Apache Spark Interview Questions Guide
0% (1)
Apache Spark Interview Questions Guide
40 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Azure Databricks Mastery
No ratings yet
Azure Databricks Mastery
95 pages
Day1 Main
No ratings yet
Day1 Main
188 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Azure Data Factory Data Movement Lab
No ratings yet
Azure Data Factory Data Movement Lab
26 pages
Simulado Databricks
No ratings yet
Simulado Databricks
25 pages
Snowflake Snowpipe
No ratings yet
Snowflake Snowpipe
10 pages
Azure Databricks Team Data Science Lab
No ratings yet
Azure Databricks Team Data Science Lab
18 pages
DBT Interview Questions
No ratings yet
DBT Interview Questions
21 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Using Apache Spark in Local Mode
No ratings yet
Using Apache Spark in Local Mode
56 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Key Features of Apache Airflow 2.0
100% (2)
Key Features of Apache Airflow 2.0
39 pages
Apache Spark Interview Questions Guide
100% (1)
Apache Spark Interview Questions Guide
7 pages
Spark Interview Prep Guide
No ratings yet
Spark Interview Prep Guide
4 pages
10 SparkBasics
No ratings yet
10 SparkBasics
45 pages
Databricks Optimization Technique
No ratings yet
Databricks Optimization Technique
18 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Spark SQL and Hadoop Integration Guide
100% (1)
Spark SQL and Hadoop Integration Guide
25 pages
Pyspark Hands On
0% (1)
Pyspark Hands On
189 pages
Building Data Pipelines with Delta Live Tables
No ratings yet
Building Data Pipelines with Delta Live Tables
52 pages
Snowproans
No ratings yet
Snowproans
85 pages
What Are DBT Sources
No ratings yet
What Are DBT Sources
109 pages
Spark Interview Prep Guide
No ratings yet
Spark Interview Prep Guide
14 pages
Caching DataFrames in PySpark
No ratings yet
Caching DataFrames in PySpark
51 pages
Azure Synapse Analytics Course Guide
No ratings yet
Azure Synapse Analytics Course Guide
196 pages
Databricks Exam
No ratings yet
Databricks Exam
14 pages
Databricks Python & Linux Commands Guide
No ratings yet
Databricks Python & Linux Commands Guide
109 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Data Engineering Nanodegree with AWS
No ratings yet
Data Engineering Nanodegree with AWS
16 pages
dbt Basics: Key Concepts & Commands
No ratings yet
dbt Basics: Key Concepts & Commands
44 pages
Unit 4 (Data Frame and Apache Kafka)
No ratings yet
Unit 4 (Data Frame and Apache Kafka)
28 pages
Pyspark Basics
No ratings yet
Pyspark Basics
74 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Pyspark
No ratings yet
Pyspark
10 pages
Page 01
No ratings yet
Page 01
2 pages
Spark SQL
No ratings yet
Spark SQL
41 pages
RDDs Vs DataFrames and Datasets
No ratings yet
RDDs Vs DataFrames and Datasets
7 pages
User Guide Mouliforms PDF
No ratings yet
User Guide Mouliforms PDF
12 pages
02 The R Language Part 1
No ratings yet
02 The R Language Part 1
120 pages
Python Workbook Sm2012
100% (1)
Python Workbook Sm2012
68 pages
Naval Operations in Barbary Wars 1785-1801
No ratings yet
Naval Operations in Barbary Wars 1785-1801
161 pages
D
No ratings yet
D
20 pages
AIS Chapter 04 Testbank
No ratings yet
AIS Chapter 04 Testbank
12 pages
Assignment 1 Database
No ratings yet
Assignment 1 Database
3 pages
Java Backend Developer Roadmap 2023
No ratings yet
Java Backend Developer Roadmap 2023
3 pages
DBMS Interfaces
No ratings yet
DBMS Interfaces
3 pages
Query Performance Tuning Start To Finish Rally PDF
No ratings yet
Query Performance Tuning Start To Finish Rally PDF
94 pages
TR Bigdata 05 2015 CKL
No ratings yet
TR Bigdata 05 2015 CKL
8 pages
Apache Jack Rabbit OAK On MongoDB PDF
No ratings yet
Apache Jack Rabbit OAK On MongoDB PDF
51 pages
Topic 5 Normalization of Database Tables (Stu)
No ratings yet
Topic 5 Normalization of Database Tables (Stu)
58 pages
SQL To MongoDB Mapping Chart
No ratings yet
SQL To MongoDB Mapping Chart
5 pages
Components of The Relational Model
No ratings yet
Components of The Relational Model
3 pages
SQL Practical
No ratings yet
SQL Practical
16 pages
Hana Certification Preparation
100% (1)
Hana Certification Preparation
7 pages
SQL Server MVP Deep Dives Volume 2 Kalen Delaney ebook fresh digital copy
100% (4)
SQL Server MVP Deep Dives Volume 2 Kalen Delaney ebook fresh digital copy
70 pages
ZeosDocumentationCollection 2017 03 20 PDF
No ratings yet
ZeosDocumentationCollection 2017 03 20 PDF
85 pages
Understanding Entity Relationship Diagrams
No ratings yet
Understanding Entity Relationship Diagrams
11 pages
数据仓库技术架构及方案
No ratings yet
数据仓库技术架构及方案
60 pages
Pono, Anika Bsa2 - DBMS It Application Tools in Business
No ratings yet
Pono, Anika Bsa2 - DBMS It Application Tools in Business
3 pages
Difference Between File System and DBMS
No ratings yet
Difference Between File System and DBMS
4 pages
Oracle to PostgreSQL Migration Guide
100% (1)
Oracle to PostgreSQL Migration Guide
19 pages
Isolation Levels With Examples in SQL Server
No ratings yet
Isolation Levels With Examples in SQL Server
6 pages
What Is Primary Key
No ratings yet
What Is Primary Key
3 pages
OracleStudy Material
100% (1)
OracleStudy Material
376 pages
12-Ip and CS Second Periodic Test
100% (1)
12-Ip and CS Second Periodic Test
5 pages
BI Consultant Expertise and Experience
No ratings yet
BI Consultant Expertise and Experience
9 pages
Assignment 2 - SQL & OOPS - Student Information System
No ratings yet
Assignment 2 - SQL & OOPS - Student Information System
11 pages
What Is SQL?
No ratings yet
What Is SQL?
160 pages
SQL Joins
No ratings yet
SQL Joins
11 pages
Cbet Knec Ict TT Term Ii 2025 - 121351
No ratings yet
Cbet Knec Ict TT Term Ii 2025 - 121351
7 pages
SQL OLD QUESTIONs Class 12B
No ratings yet
SQL OLD QUESTIONs Class 12B
9 pages
Game Modding Configuration
No ratings yet
Game Modding Configuration
47 pages

07 Spark Dataframes

Uploaded by

07 Spark Dataframes

Uploaded by

h"p://training.databricks.com/sparkcamp.

Intro to DataFrames and

Professor Anthony D. Joseph, UC Berkeley

• Immutable once constructed

• You construct DataFrames

item-1 item-6 item-11 item-16 item-21

Worker Worker Worker

• For existing Spark users, the API will make Spark

• For both sets of users, DataFrames will improve

Formally, a DataFrame is a size-mutable, potentially

That’s a mouthful – just think of it as a table in a

• However, use the DataFrame API, wherever possible

Runtime performance of aggregating 10 million integer pairs

Logical Physical Code

DataFrames and SQL share the same optimization/execution pipeline

# The import isn't necessary in the SparkShell or

From Spark Packages: http://spark-packages.org/package/databricks/spark-csv

Using spark-csv, we can simply create a DataFrame directly

You can also declare the schema programmatically, which

from pyspark.sql.types import *

Once you have a DataFrame, there are a number of

Let us look at a few of them

But, first, let us talk about columns

A DataFrame column is an abstraction

It provides a common column-oriented view of the

SQL Table dataFrame3 ﬁrst last age

JSON dataFrame1 [ {"first": "Amy",

CSV dataFrame2 first,last,age

• A place (a cell) for a data value to reside, within a row of

How do you refer to a column in the DataFrame API?

Python Java Scala R

†In Python, it’s possible to access a DataFrame’s columns either by attribute

• Spark can cache a DataFrame, using an in-memory

You can look at the first n elements in a DataFrame with

This method is an action that:

In[1]: df.select("firstName", "age").show(5)

In[1]: df.filter(df['age'] > 49).\

In [1]: df.filter(df['age'] > 49).\

In [1]: df.filter(df['age'] > 49).\

Note: In Python, you must use alias, because as is a keyword

In [3]: df3 = df.join(

• You issue SQL queries through a SQLContext or

You might also like

•  Immutable once constructed

•  You construct DataFrames

•  For existing Spark users, the API will make Spark

•  For both sets of users, DataFrames will improve

•  However, use the DataFrame API, wherever possible

•  A place (a cell) for a data value to reside, within a row of

•  Spark can cache a DataFrame, using an in-memory

•  You issue SQL queries through a SQLContext or