Data Frames

slides sobre dataframes

Uploaded by

ricardo.faceb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views12 pages

Data Frames

slides sobre dataframes

Uploaded by

ricardo.faceb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

SPARK SQL

DataFrames and DataSets

Working with structured data

■ Extends RDD to a "DataFrame" object

■ DataFrames:
– Contain Row objects
– Can run SQL queries
– Has a schema (leading to more efficient storage)
– Read and write to JSON, Hive, parquet
– Communicates with JDBC/ODBC, Tableau
Using SparkSQL in Python

■ from pyspark.sql import SQLContext, Row

■ hiveContext = HiveContext(sc)
■ inputData = spark.read.json(dataFile)
■ inputData.createOrReplaceTempView("myStructuredStuff")
■ myResultDataFrame = hiveContext.sql("""SELECT foo FROM bar ORDER BY
foobar""")
Other stuff you can do with
dataframes
■ myResultDataFrame.show()
■ myResultDataFrame.select("someFieldName")
■ myResultDataFrame.filter(myResultDataFrame("someFieldName" > 200)
■ myResultDataFrame.groupBy(myResultDataFrame("someFieldName")).mean()
■ myResultDataFrame.rdd().map(mapperFunction)
Datasets

■ In Spark 2.0, a DataFrame is really a DataSet of Row objects

■ DataSets can wrap known, typed data too. But this is mostly transparent to
you in Python, since Python is dynamically typed.
■ So – don’t sweat this too much with Python. But the Spark 2.0 way is to use
DataSets instead of DataFrames when you can.
Shell access

■ Spark SQL exposes a JDBC/ODBC server (if you built Spark with Hive support)
■ Start it with sbin/start-thriftserver.sh
■ Listens on port 10000 by default
■ Connect using bin/beeline -u jdbc:hive2://localhost:10000
■ Viola, you have a SQL shell to Spark SQL
■ You can create new tables, or query existing ones that were cached using
hiveCtx.cacheTable("tableName")
User-defined functions (UDF's)

from pyspark.sql.types import IntegerType

hiveCtx.registerFunction("square", lambda x: x*x, IntegerType())
df = hiveCtx.sql("SELECT square('someNumericFiled') FROM tableName)
YOUR CHALLENGE
Filter out movies with very few ratings
The problem

■ Our examples of finding the lowest-rated movies were polluted with movies
only rated by one or two people.
■ Modify one or both of these scripts to only consider movies with at least ten
ratings.
Hints

■ RDD’s have a filter() function you can use

– It takes a function as a parameter, which accepts the entire key/value
pair
■ So if you’re calling filter() on an RDD that contains (movieID, (sumOfRatings,
totalRatings)) – a lambda function that takes in “x” would refer to
totalRatings as x[1][1]. x[1] gives us the “value” (sumOfRatings, totalRatings)
and x[1][1] pulls out totalRatings.
– This function should be an expression that returns True if the row should
be kept, or False if it should be discarded
■ DataFrames also have a filter() function
– It’s easier – you just pass in a string expression for what you want to
filter on.
– For example: df.filter(“count > 10”) would only pass through rows where
the “count” column is greater than 10.
GOOD LUCK

SparkDataFrames 250719 202947
No ratings yet
SparkDataFrames 250719 202947
11 pages
Overview of Spark SQL Features and APIs
No ratings yet
Overview of Spark SQL Features and APIs
24 pages
Spark SQL for Data Engineers
No ratings yet
Spark SQL for Data Engineers
25 pages
Page 01
No ratings yet
Page 01
2 pages
Spark SQL & DataFrames Guide 2.2.0
No ratings yet
Spark SQL & DataFrames Guide 2.2.0
35 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Spark SQL
No ratings yet
Spark SQL
28 pages
Big Data Analytics with Spark DataFrames
No ratings yet
Big Data Analytics with Spark DataFrames
79 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
SQL & PySpark ?
No ratings yet
SQL & PySpark ?
9 pages
PySpark DataFrames Guide
No ratings yet
PySpark DataFrames Guide
33 pages
Unit 5 SQL 2024 25
No ratings yet
Unit 5 SQL 2024 25
19 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Spark SQL
No ratings yet
Spark SQL
41 pages
Bda U5
No ratings yet
Bda U5
42 pages
SparkSql AND DF
No ratings yet
SparkSql AND DF
89 pages
PySpark Cheatsheet - Elaborate
No ratings yet
PySpark Cheatsheet - Elaborate
14 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
PYSPARK Interview Questions
100% (4)
PYSPARK Interview Questions
126 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
Learning Spark - Chapter 4
No ratings yet
Learning Spark - Chapter 4
30 pages
Spark SQL QA Summary
No ratings yet
Spark SQL QA Summary
3 pages
Spark SQL - Updated
No ratings yet
Spark SQL - Updated
19 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
SQL vs PySpark Operations Guide
No ratings yet
SQL vs PySpark Operations Guide
8 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
Lab 4 - Apache Spark SQL
No ratings yet
Lab 4 - Apache Spark SQL
46 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
SQL PySpark Cheat Sheet 1731729790
No ratings yet
SQL PySpark Cheat Sheet 1731729790
9 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
SQL & pySPARK
No ratings yet
SQL & pySPARK
9 pages
SQL Vs Pyspark-1
No ratings yet
SQL Vs Pyspark-1
9 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
How To Work With Apache Spark and Delta Lake?
No ratings yet
How To Work With Apache Spark and Delta Lake?
40 pages
07 Structured Data Processing
No ratings yet
07 Structured Data Processing
91 pages
PySpark 1713691456
No ratings yet
PySpark 1713691456
24 pages
Native SQL Support in Spark with Catalyst
No ratings yet
Native SQL Support in Spark with Catalyst
27 pages
Spark SQL
No ratings yet
Spark SQL
18 pages
Fall209 Spark SQL MC
No ratings yet
Fall209 Spark SQL MC
96 pages
HTML Code
No ratings yet
HTML Code
3 pages
Learning Spark - Chapter 5
No ratings yet
Learning Spark - Chapter 5
44 pages
Int 421
No ratings yet
Int 421
2 pages
Methods & Function in Databricks
No ratings yet
Methods & Function in Databricks
34 pages
Python You Should Learn
No ratings yet
Python You Should Learn
12 pages
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
No ratings yet
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
17 pages
Report SQL PDF
No ratings yet
Report SQL PDF
21 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Data Engineering Bootcamp
No ratings yet
Data Engineering Bootcamp
5 pages
Lavagna SE Tech
No ratings yet
Lavagna SE Tech
137 pages
Tuc 2 Tu 2 e
No ratings yet
Tuc 2 Tu 2 e
17 pages
Sample Exit EXAM Perpared by Abel M
75% (4)
Sample Exit EXAM Perpared by Abel M
6 pages
Employee Expense Claim Form
No ratings yet
Employee Expense Claim Form
1 page
The Enjoyment of Math 1st Edition Hans Rademacher PDF Available
0% (1)
The Enjoyment of Math 1st Edition Hans Rademacher PDF Available
85 pages
Laqshya 2026 Prelims Plan Publish Lyst1743431405355
No ratings yet
Laqshya 2026 Prelims Plan Publish Lyst1743431405355
23 pages
BR111
No ratings yet
BR111
44 pages
Kinds of Research Across Fields
No ratings yet
Kinds of Research Across Fields
2 pages
Presentation On Building
No ratings yet
Presentation On Building
18 pages
1 DVS Prinsiples & Practice of Marine Diesel Engines 85 (Turbo)
100% (1)
1 DVS Prinsiples & Practice of Marine Diesel Engines 85 (Turbo)
85 pages
Children's Arithmetic Research
100% (1)
Children's Arithmetic Research
5 pages
Proposed VCCT Wind Turbine Foundation Bill of Quantities
No ratings yet
Proposed VCCT Wind Turbine Foundation Bill of Quantities
2 pages
The Body Der Körper Worksheets
No ratings yet
The Body Der Körper Worksheets
8 pages
Being Transgender What You Should Know
No ratings yet
Being Transgender What You Should Know
256 pages
Funding Request for PPTAF Operations
No ratings yet
Funding Request for PPTAF Operations
3 pages
09 - CH75 - Engine Cooling & Sealing Air System
No ratings yet
09 - CH75 - Engine Cooling & Sealing Air System
26 pages
Demarcus Mckinstry
No ratings yet
Demarcus Mckinstry
2 pages
87 NURS FPX 6112 Assessment 3
No ratings yet
87 NURS FPX 6112 Assessment 3
3 pages
RRB NTPC Syllabus 2024 For First and Second
No ratings yet
RRB NTPC Syllabus 2024 For First and Second
5 pages
Bretta
No ratings yet
Bretta
2 pages
Drivesystems: Intelligent Drivesystems, Worldwide Services
No ratings yet
Drivesystems: Intelligent Drivesystems, Worldwide Services
36 pages
VFD Undervoltage Fault
No ratings yet
VFD Undervoltage Fault
6 pages
Land SURVEY 2023 CET 05114 Semester One Paper Marking Scheme Myunga
No ratings yet
Land SURVEY 2023 CET 05114 Semester One Paper Marking Scheme Myunga
7 pages
BMS 423 Buyer Supplier Relationships
No ratings yet
BMS 423 Buyer Supplier Relationships
90 pages
Grade 9 Syllabus Completion Plan
No ratings yet
Grade 9 Syllabus Completion Plan
4 pages
2025 Suoer Inverter Catalog
No ratings yet
2025 Suoer Inverter Catalog
20 pages
Spring 24 Tentative Time Table DBA
No ratings yet
Spring 24 Tentative Time Table DBA
7 pages
Learning Package CIVE1129 4th Ed
No ratings yet
Learning Package CIVE1129 4th Ed
52 pages
Peninsula SD Resolution on SB 5044
No ratings yet
Peninsula SD Resolution on SB 5044
2 pages
Prototype Side-Coupled Tube for e-Linac
No ratings yet
Prototype Side-Coupled Tube for e-Linac
16 pages

Data Frames

Uploaded by

Data Frames

Uploaded by

SPARK SQL

DataFrames and DataSets

■ Extends RDD to a "DataFrame" object

■ from pyspark.sql import SQLContext, Row

■ In Spark 2.0, a DataFrame is really a DataSet of Row objects

from pyspark.sql.types import IntegerType

■ RDD’s have a filter() function you can use

You might also like