100% found this document useful (1 vote)
2K views1,776 pages

Mastering Spark SQL PDF

Uploaded by

a
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
2K views1,776 pages

Mastering Spark SQL PDF

Uploaded by

a
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1776

Table

of Contents
Introduction 1.1
Spark SQL — Structured Data Processing with Relational Queries on Massive Scale
Datasets vs DataFrames vs RDDs 1.3 1.2
Dataset API vs SQL 1.4

(WIP) Vectorized Parquet Decoding


VectorizedParquetRecordReader 2.1
VectorizedColumnReader 2.2
SpecificParquetRecordReaderBase 2.3
ColumnVector Contract — In-Memory Columnar Data 2.4
WritableColumnVector Contract 2.4.1
OnHeapColumnVector 2.4.2
OffHeapColumnVector 2.4.3

Notable Features
Vectorized Parquet Decoding (Reader) 3.1
Dynamic Partition Inserts 3.2
Bucketing 3.3
Whole-Stage Java Code Generation (Whole-Stage CodeGen) 3.4
CodegenContext 3.4.1
CodeGenerator 3.4.2
GenerateColumnAccessor 3.4.2.1
GenerateOrdering 3.4.2.2
GeneratePredicate 3.4.2.3
GenerateSafeProjection 3.4.2.4
BytesToBytesMap Append-Only Hash Map 3.4.3
Vectorized Query Execution (Batch Decoding) 3.5
ColumnarBatch — ColumnVectors as Row-Wise Table 3.5.1

1
Data Source API V2 3.6
Subqueries 3.7
Hint Framework 3.8
Adaptive Query Execution 3.9
ExchangeCoordinator 3.9.1
Subexpression Elimination For Code-Generated Expression Evaluation (Common
Expression Reuse) 3.10
EquivalentExpressions 3.10.1
Cost-Based Optimization (CBO) 3.11
CatalogStatistics — Table Statistics in Metastore (External Catalog) 3.11.1
ColumnStat — Column Statistics 3.11.2
EstimationUtils 3.11.3
CommandUtils — Utilities for Table Statistics 3.11.4
Catalyst DSL — Implicit Conversions for Catalyst Data Structures 3.12

Developing Spark SQL Applications


Fundamentals of Spark SQL Application Development 4.1
SparkSession — The Entry Point to Spark SQL 4.2
Builder — Building SparkSession using Fluent API 4.2.1
implicits Object — Implicits Conversions 4.2.2
SparkSessionExtensions 4.2.3
Dataset — Structured Query with Data Encoder 4.3
DataFrame — Dataset of Rows with RowEncoder 4.3.1
Row 4.3.2
DataSource API — Managing Datasets in External Data Sources 4.4
DataFrameReader — Loading Data From External Data Sources 4.4.1
DataFrameWriter — Saving Data To External Data Sources 4.4.2
Dataset API — Dataset Operators 4.5
Typed Transformations 4.5.1
Untyped Transformations 4.5.2
Basic Actions 4.5.3
Actions 4.5.4
DataFrameNaFunctions — Working With Missing Data 4.5.5

2
DataFrameStatFunctions — Working With Statistic Functions 4.5.6
Column 4.6
Column API — Column Operators 4.6.1
TypedColumn 4.6.2
Basic Aggregation — Typed and Untyped Grouping Operators 4.7
RelationalGroupedDataset — Untyped Row-based Grouping 4.7.1
KeyValueGroupedDataset — Typed Grouping 4.7.2
Dataset Join Operators 4.8
Broadcast Joins (aka Map-Side Joins) 4.8.1
Window Aggregation 4.9
WindowSpec — Window Specification 4.9.1
Window Utility Object — Defining Window Specification 4.9.2
Standard Functions — functions Object 4.10
Aggregate Functions 4.10.1
Collection Functions 4.10.2
Date and Time Functions 4.10.3
Regular Functions (Non-Aggregate Functions) 4.10.4
Window Aggregation Functions 4.10.5
User-Defined Functions (UDFs) 4.11
UDFs are Blackbox — Don’t Use Them Unless You’ve Got No Choice 4.11.1
UserDefinedFunction 4.11.2
Schema — Structure of Data 4.12
StructType 4.12.1
StructField — Single Field in StructType 4.12.2
Data Types 4.12.3
Multi-Dimensional Aggregation 4.13
Dataset Caching and Persistence 4.14
User-Friendly Names Of Cached Queries in web UI’s Storage Tab 4.14.1
Dataset Checkpointing 4.15
UserDefinedAggregateFunction — Contract for User-Defined Untyped Aggregate
Functions (UDAFs) 4.16
Aggregator — Contract for User-Defined Typed Aggregate Functions (UDAFs) 4.17
Configuration Properties 4.18

3
SparkSession Registries
Catalog — Metastore Management Interface 5.1
CatalogImpl 5.1.1
ExecutionListenerManager — Management Interface of QueryExecutionListeners 5.2
ExperimentalMethods 5.3
ExternalCatalog Contract — External Catalog (Metastore) of Permanent Relational
Entities 5.4
InMemoryCatalog 5.4.1
HiveExternalCatalog — Hive-Aware Metastore of Permanent Relational Entities
FunctionRegistry — Contract for Function Registries (Catalogs) 5.5 5.4.2
GlobalTempViewManager — Management Interface of Global Temporary Views 5.6
SessionCatalog — Session-Scoped Catalog of Relational Entities 5.7
CatalogTable — Table Specification (Native Table Metadata) 5.7.1
CatalogStorageFormat — Storage Specification of Table or Partition 5.7.1.1
CatalogTablePartition — Partition Specification of Table 5.7.1.2
BucketSpec — Bucketing Specification of Table 5.7.1.3
HiveSessionCatalog — Hive-Specific Catalog of Relational Entities 5.7.2
HiveMetastoreCatalog — Legacy SessionCatalog for Converting Hive Metastore
Relations to Data Source Relations 5.7.3
SessionState 5.8
BaseSessionStateBuilder — Generic Builder of SessionState 5.8.1
SessionStateBuilder 5.8.2
HiveSessionStateBuilder — Builder of Hive-Specific SessionState 5.8.3
SharedState — State Shared Across SparkSessions 5.9
CacheManager — In-Memory Cache for Tables and Views 5.10
CachedRDDBuilder 5.10.1
RuntimeConfig — Management Interface of Runtime Configuration 5.11
SQLConf — Internal Configuration Store 5.12
StaticSQLConf — Cross-Session, Immutable and Static SQL Configuration 5.12.1
CatalystConf 5.12.2
UDFRegistration — Session-Scoped FunctionRegistry 5.13

File-Based Data Sources


4
FileFormat 6.1
OrcFileFormat 6.1.1
ParquetFileFormat 6.1.2
TextBasedFileFormat 6.2
CSVFileFormat 6.2.1
JsonFileFormat 6.2.2
TextFileFormat 6.2.3
JsonDataSource 6.3
FileCommitProtocol 6.4
SQLHadoopMapReduceCommitProtocol 6.4.1
PartitionedFile — File Block in FileFormat Data Source 6.5
FileScanRDD — Input RDD of FileSourceScanExec Physical Operator 6.6
ParquetReadSupport — Non-Vectorized ReadSupport in Parquet Data Source 6.7
RecordReaderIterator — Scala Iterator over Hadoop RecordReader’s Values 6.8

Kafka Data Source


Kafka Data Source 7.1
Kafka Data Source Options 7.2
KafkaSourceProvider 7.3
KafkaRelation 7.4
KafkaSourceRDD 7.5
KafkaSourceRDDOffsetRange 7.5.1
KafkaSourceRDDPartition 7.5.2
ConsumerStrategy Contract — Kafka Consumer Providers 7.6
KafkaOffsetReader 7.7
KafkaOffsetRangeLimit 7.8
KafkaDataConsumer Contract 7.9
InternalKafkaConsumer 7.9.1
KafkaWriter Helper Object — Writing Structured Queries to Kafka 7.10
KafkaWriteTask 7.10.1
JsonUtils Helper Object 7.11

5
Avro Data Source
Avro Data Source 8.1
AvroFileFormat — FileFormat For Avro-Encoded Files 8.2
AvroOptions — Avro Data Source Options 8.3
CatalystDataToAvro Unary Expression 8.4
AvroDataToCatalyst Unary Expression 8.5

JDBC Data Source


JDBC Data Source 9.1
JDBCOptions — JDBC Data Source Options 9.2
JdbcRelationProvider 9.3
JDBCRelation 9.4
JDBCRDD 9.5
JdbcDialect 9.6
JdbcUtils Helper Object 9.7

Hive Data Source


Hive Integration 10.1
Hive Metastore 10.1.1
Spark SQL CLI — spark-sql 10.1.2
DataSinks Strategy 10.1.3
HiveFileFormat 10.2
HiveClient 10.3
HiveClientImpl — The One and Only HiveClient 10.4
HiveUtils 10.5

Extending Spark SQL / Data Source API V2


DataSourceV2 — Data Sources in Data Source API V2 11.1
ReadSupport Contract — "Readable" Data Sources 11.2
WriteSupport Contract — "Writable" Data Sources 11.3

6
DataSourceReader 11.4
SupportsPushDownFilters 11.4.1
SupportsPushDownRequiredColumns 11.4.2
SupportsReportPartitioning 11.4.3
SupportsReportStatistics 11.4.4
SupportsScanColumnarBatch 11.4.5
DataSourceWriter 11.5
SessionConfigSupport 11.6
InputPartition 11.7
InputPartitionReader 11.8
DataWriter 11.9
DataWriterFactory 11.10
InternalRowDataWriterFactory 11.10.1
DataSourceV2StringFormat 11.11
DataSourceRDD — Input RDD Of DataSourceV2ScanExec Physical Operator 11.12
DataSourceRDDPartition 11.12.1
DataWritingSparkTask Partition Processing Function 11.13
DataSourceV2Utils Helper Object 11.14

Extending Spark SQL / Data Source API V1


DataSource — Pluggable Data Provider Framework 12.1
Custom Data Source Formats 12.2

Data Source Providers


CreatableRelationProvider Contract — Data Sources That Write Rows Per Save Mode
DataSourceRegister Contract — Registering Data Source Format 13.2 13.1
RelationProvider Contract — Relation Providers With Schema Inference 13.3
SchemaRelationProvider Contract — Relation Providers With Mandatory User-Defined
Schema 13.4

Data Source Relations / Extension Contracts

7
BaseRelation — Collection of Tuples with Schema 14.1
HadoopFsRelation — Relation for File-Based Data Source 14.1.1
CatalystScan Contract 14.2
InsertableRelation Contract — Non-File-Based Relations with Inserting or Overwriting
Data Support 14.3
PrunedFilteredScan Contract — Relations with Column Pruning and Filter Pushdown
PrunedScan Contract 14.5 14.4
TableScan Contract — Relations with Column Pruning 14.6

Others
FileFormatWriter Helper Object 15.1
Data Source Filter Predicate (For Filter Pushdown) 15.2
FileRelation Contract 15.3

Structured Query Execution


QueryExecution — Structured Query Execution Pipeline 16.1
UnsupportedOperationChecker 16.1.1
Analyzer — Logical Query Plan Analyzer 16.2
CheckAnalysis — Analysis Validation 16.2.1
SparkOptimizer — Logical Query Plan Optimizer 16.3
Catalyst Optimizer — Generic Logical Query Plan Optimizer 16.3.1
SparkPlanner — Spark Query Planner 16.4
SparkStrategy — Base for Execution Planning Strategies 16.4.1
SparkStrategies — Container of Execution Planning Strategies 16.4.2
LogicalPlanStats — Statistics Estimates and Query Hints of Logical Operator 16.5
Statistics — Estimates of Plan Statistics and Query Hints 16.5.1
HintInfo 16.5.2
LogicalPlanVisitor — Base Visitor for Computing Statistics of Logical Plan 16.5.3
SizeInBytesOnlyStatsPlanVisitor — LogicalPlanVisitor for Total Size (in Bytes)
Statistic Only 16.5.4
BasicStatsPlanVisitor — Computing Statistics for Cost-Based Optimization 16.5.5
AggregateEstimation 16.5.5.1

8
FilterEstimation 16.5.5.2
JoinEstimation 16.5.5.3
ProjectEstimation 16.5.5.4
Partitioning — Specification of Physical Operator’s Output Partitions 16.6
Distribution Contract — Data Distribution Across Partitions 16.7
AllTuples 16.7.1
BroadcastDistribution 16.7.2
ClusteredDistribution 16.7.3
HashClusteredDistribution 16.7.4
OrderedDistribution 16.7.5
UnspecifiedDistribution 16.7.6

Catalyst Expressions
Catalyst Expression — Executable Node in Catalyst Tree 17.1
AggregateExpression 17.2
AggregateFunction Contract — Aggregate Function Expressions 17.3
AggregateWindowFunction Contract — Declarative Window Aggregate Function
Expressions 17.4
AttributeReference 17.5
Alias 17.6
Attribute 17.7
BoundReference 17.8
CallMethodViaReflection 17.9
Coalesce 17.10
CodegenFallback 17.11
CollectionGenerator 17.12
ComplexTypedAggregateExpression 17.13
CreateArray 17.14
CreateNamedStruct 17.15
CreateNamedStructLike Contract 17.16
CreateNamedStructUnsafe 17.17
CumeDist 17.18
DeclarativeAggregate Contract — Unevaluable Aggregate Function Expressions 17.19

9
ExecSubqueryExpression 17.20
Exists 17.21
ExpectsInputTypes Contract 17.22
ExplodeBase Contract 17.23
First 17.24
Generator 17.25
GetArrayStructFields 17.26
GetArrayItem 17.27
GetMapValue 17.28
GetStructField 17.29
ImperativeAggregate 17.30
In 17.31
Inline 17.32
InSet 17.33
InSubquery 17.34
JsonToStructs 17.35
JsonTuple 17.36
ListQuery 17.37
Literal 17.38
MonotonicallyIncreasingID 17.39
Murmur3Hash 17.40
NamedExpression Contract 17.41
Nondeterministic Contract 17.42
OffsetWindowFunction Contract — Unevaluable Window Function Expressions 17.43
ParseToDate 17.44
ParseToTimestamp 17.45
PlanExpression 17.46
PrettyAttribute 17.47
RankLike Contract 17.48
ResolvedStar 17.49
RowNumberLike Contract 17.50
RuntimeReplaceable Contract 17.51
ScalarSubquery SubqueryExpression 17.52
ScalarSubquery ExecSubqueryExpression 17.53

10
ScalaUDF 17.54
ScalaUDAF 17.55
SimpleTypedAggregateExpression 17.56
SizeBasedWindowFunction Contract — Declarative Window Aggregate Functions with
Window Size 17.57
SortOrder 17.58
Stack 17.59
Star 17.60
StaticInvoke 17.61
SubqueryExpression 17.62
TimeWindow 17.63
TypedAggregateExpression 17.64
TypedImperativeAggregate 17.65
UnaryExpression Contract 17.66
UnixTimestamp 17.67
UnresolvedAttribute 17.68
UnresolvedFunction 17.69
UnresolvedGenerator 17.70
UnresolvedOrdinal 17.71
UnresolvedRegex 17.72
UnresolvedStar 17.73
UnresolvedWindowExpression 17.74
WindowExpression 17.75
WindowFunction Contract — Window Function Expressions With WindowFrame 17.76
WindowSpecDefinition 17.77

Logical Operators

Base Logical Operators (Contracts)


LogicalPlan Contract — Logical Operator with Children and Expressions / Logical Query
Plan 19.1
Command Contract — Eagerly-Executed Logical Operator 19.2

11
RunnableCommand Contract — Generic Logical Command with Side Effects 19.3
DataWritingCommand Contract — Logical Commands That Write Query Data 19.4
SaveAsHiveFile Contract — DataWritingCommands That Write Query Result As Hive
Files 19.5

Concrete Logical Operators


Aggregate 20.1
AlterViewAsCommand 20.2
AnalysisBarrier 20.3
AnalyzeColumnCommand 20.4
AnalyzePartitionCommand 20.5
AnalyzeTableCommand 20.6
AppendData 20.7
ClearCacheCommand 20.8
CreateDataSourceTableAsSelectCommand 20.9
CreateDataSourceTableCommand 20.10
CreateHiveTableAsSelectCommand 20.11
CreateTable 20.12
CreateTableCommand 20.13
CreateTempViewUsing 20.14
CreateViewCommand 20.15
DataSourceV2Relation 20.16
DescribeColumnCommand 20.17
DescribeTableCommand 20.18
DeserializeToObject 20.19
DropTableCommand 20.20
Except 20.21
Expand 20.22
ExplainCommand 20.23
ExternalRDD 20.24
Filter 20.25
Generate 20.26
GroupingSets 20.27

12
Hint 20.28
HiveTableRelation 20.29
InMemoryRelation 20.30
InsertIntoDataSourceCommand 20.31
InsertIntoDataSourceDirCommand 20.32
InsertIntoDir 20.33
InsertIntoHadoopFsRelationCommand 20.34
InsertIntoHiveDirCommand 20.35
InsertIntoHiveTable 20.36
InsertIntoTable 20.37
Intersect 20.38
Join 20.39
LeafNode 20.40
LocalRelation 20.41
LogicalRDD 20.42
LogicalRelation 20.43
OneRowRelation 20.44
Pivot 20.45
Project 20.46
Range 20.47
Repartition and RepartitionByExpression 20.48
ResolvedHint 20.49
SaveIntoDataSourceCommand 20.50
ShowCreateTableCommand 20.51
ShowTablesCommand 20.52
Sort 20.53
SubqueryAlias 20.54
TypedFilter 20.55
Union 20.56
UnresolvedCatalogRelation 20.57
UnresolvedHint 20.58
UnresolvedInlineTable 20.59
UnresolvedRelation 20.60
UnresolvedTableValuedFunction 20.61

13
Window 20.62
WithWindowDefinition 20.63
WriteToDataSourceV2 20.64
View 20.65

Physical Operators
SparkPlan Contract — Physical Operators in Physical Query Plan of Structured Query
CodegenSupport Contract — Physical Operators with Java Code Generation 21.1

DataSourceScanExec Contract — Leaf Physical Operators to Scan Over 21.2


BaseRelation 21.3
ColumnarBatchScan Contract — Physical Operators With Vectorized Reader 21.4
ObjectConsumerExec Contract — Unary Physical Operators with Child Physical
Operator with One-Attribute Output Schema 21.5
BaseLimitExec Contract 21.6
Exchange Contract 21.7
Projection Contract — Functions to Produce InternalRow for InternalRow 21.8
UnsafeProjection — Generic Function to Project InternalRows to UnsafeRows
GenerateUnsafeProjection 21.8.2 21.8.1
GenerateMutableProjection 21.8.3
InterpretedProjection 21.8.4
CodeGeneratorWithInterpretedFallback 21.8.5
SQLMetric — SQL Execution Metric of Physical Operator 21.9

Concrete Physical Operators


BroadcastExchangeExec 22.1
BroadcastHashJoinExec 22.2
BroadcastNestedLoopJoinExec 22.3
CartesianProductExec 22.4
CoalesceExec 22.5
DataSourceV2ScanExec 22.6
DataWritingCommandExec 22.7
DebugExec 22.8

14
DeserializeToObjectExec 22.9
ExecutedCommandExec 22.10
ExpandExec 22.11
ExternalRDDScanExec 22.12
FileSourceScanExec 22.13
FilterExec 22.14
GenerateExec 22.15
HashAggregateExec 22.16
HiveTableScanExec 22.17
InMemoryTableScanExec 22.18
LocalTableScanExec 22.19
MapElementsExec 22.20
ObjectHashAggregateExec 22.21
ObjectProducerExec 22.22
ProjectExec 22.23
RangeExec 22.24
RDDScanExec 22.25
ReusedExchangeExec 22.26
RowDataSourceScanExec 22.27
SampleExec 22.28
ShuffleExchangeExec 22.29
ShuffledHashJoinExec 22.30
SerializeFromObjectExec 22.31
SortAggregateExec 22.32
SortMergeJoinExec 22.33
SortExec 22.34
SubqueryExec 22.35
InputAdapter 22.36
WindowExec 22.37
AggregateProcessor 22.37.1
WindowFunctionFrame 22.37.2
WholeStageCodegenExec 22.38
WriteToDataSourceV2Exec 22.39

15
Logical Analysis Rules (Check, Evaluation,
Conversion and Resolution)
AliasViewChild 23.1
CleanupAliases 23.2
DataSourceAnalysis 23.3
DetermineTableStats 23.4
ExtractWindowExpressions 23.5
FindDataSourceTable 23.6
HandleNullInputsForUDF 23.7
HiveAnalysis 23.8
InConversion 23.9
LookupFunctions 23.10
PreprocessTableCreation 23.11
PreWriteCheck 23.12
RelationConversions 23.13
ResolveAliases 23.14
ResolveBroadcastHints 23.15
ResolveCoalesceHints 23.16
ResolveCreateNamedStruct 23.17
ResolveFunctions 23.18
ResolveHiveSerdeTable 23.19
ResolveInlineTables 23.20
ResolveMissingReferences 23.21
ResolveOrdinalInOrderByAndGroupBy 23.22
ResolveOutputRelation 23.23
ResolveReferences 23.24
ResolveRelations 23.25
ResolveSQLOnFile 23.26
ResolveSubquery 23.27
ResolveWindowFrame 23.28
ResolveWindowOrder 23.29
TimeWindowing 23.30

16
UpdateOuterReferences 23.31
WindowFrameCoercion 23.32
WindowsSubstitution 23.33

Base Logical Optimizations (Optimizer)


CollapseWindow 24.1
ColumnPruning 24.2
CombineTypedFilters 24.3
CombineUnions 24.4
ComputeCurrentTime 24.5
ConstantFolding 24.6
CostBasedJoinReorder 24.7
DecimalAggregates 24.8
EliminateSerialization 24.9
EliminateSubqueryAliases 24.10
EliminateView 24.11
GetCurrentDatabase 24.12
LimitPushDown 24.13
NullPropagation 24.14
OptimizeIn 24.15
OptimizeSubqueries 24.16
PropagateEmptyRelation 24.17
PullupCorrelatedPredicates 24.18
PushDownPredicate 24.19
PushPredicateThroughJoin 24.20
ReorderJoin 24.21
ReplaceExpressions 24.22
RewriteCorrelatedScalarSubquery 24.23
RewritePredicateSubquery 24.24
SimplifyCasts 24.25

17
Extended Logical Optimizations
(SparkOptimizer)
ExtractPythonUDFFromAggregate 25.1
OptimizeMetadataOnlyQuery 25.2
PruneFileSourcePartitions 25.3
PushDownOperatorsToDataSource 25.4

Execution Planning Strategies


Aggregation 26.1
BasicOperators 26.2
DataSourceStrategy 26.3
DataSourceV2Strategy 26.4
FileSourceStrategy 26.5
HiveTableScans 26.6
InMemoryScans 26.7
JoinSelection 26.8
SpecialLimits 26.9

Physical Query Optimizations


CollapseCodegenStages 27.1
EnsureRequirements 27.2
ExtractPythonUDFs 27.3
PlanSubqueries 27.4
ReuseExchange 27.5
ReuseSubquery 27.6

Encoders
Encoder — Internal Row Converter 28.1
Encoders Factory Object 28.1.1
ExpressionEncoder — Expression-Based Encoder 28.1.2

18
RowEncoder — Encoder for DataFrames 28.1.3
LocalDateTimeEncoder — Custom ExpressionEncoder for
java.time.LocalDateTime 28.1.4

RDDs
ShuffledRowRDD 29.1

Monitoring
SQL Tab — Monitoring Structured Queries in web UI 30.1
SQLListener Spark Listener 30.1.1
QueryExecutionListener 30.2
SQLAppStatusListener Spark Listener 30.3
SQLAppStatusPlugin 30.4
SQLAppStatusStore 30.5
WriteTaskStats 30.6
BasicWriteTaskStats 30.6.1
WriteTaskStatsTracker 30.7
BasicWriteTaskStatsTracker 30.7.1
WriteJobStatsTracker 30.8
BasicWriteJobStatsTracker 30.8.1
Logging 30.9

Performance Tuning and Debugging


Spark SQL’s Performance Tuning Tips and Tricks (aka Case Studies) 31.1
Number of Partitions for groupBy Aggregation 31.1.1
Debugging Query Execution 31.2

Catalyst — Tree Manipulation Framework


Catalyst — Tree Manipulation Framework 32.1
TreeNode — Node in Catalyst Tree 32.2

19
QueryPlan — Structured Query Plan 32.2.1
RuleExecutor Contract — Tree Transformation Rule Executor 32.3
Catalyst Rule — Named Transformation of TreeNodes 32.3.1
QueryPlanner — Converting Logical Plan to Physical Trees 32.4
GenericStrategy 32.5

Tungsten Execution Backend


Tungsten Execution Backend (Project Tungsten) 33.1
InternalRow — Abstract Binary Row Format 33.2
UnsafeRow — Mutable Raw-Memory Unsafe Binary Row Format 33.2.1
AggregationIterator — Generic Iterator of UnsafeRows for Aggregate Physical
Operators 33.3
ObjectAggregationIterator 33.3.1
SortBasedAggregationIterator 33.3.2
TungstenAggregationIterator — Iterator of UnsafeRows for HashAggregateExec
Physical Operator 33.3.3
CatalystSerde 33.4
ExternalAppendOnlyUnsafeRowArray — Append-Only Array for UnsafeRows (with Disk
Spill Threshold) 33.5
UnsafeFixedWidthAggregationMap 33.6

SQL Support
SQL Parsing Framework 34.1
AbstractSqlParser — Base SQL Parsing Infrastructure 34.2
AstBuilder — ANTLR-based SQL Parser 34.3
CatalystSqlParser — DataTypes and StructTypes Parser 34.4
ParserInterface — SQL Parser Contract 34.5
SparkSqlAstBuilder 34.6
SparkSqlParser — Default SQL Parser 34.7

Spark Thrift Server

20
Thrift JDBC/ODBC Server — Spark Thrift Server (STS) 35.1
SparkSQLEnv 35.2

Varia / Uncategorized
SQLExecution Helper Object 36.1
RDDConversions Helper Object 36.2
CatalystTypeConverters Helper Object 36.3
StatFunctions Helper Object 36.4
SubExprUtils Helper Object 36.5
PredicateHelper Scala Trait 36.6
SchemaUtils Helper Object 36.7
AggUtils Helper Object 36.8
ScalaReflection 36.9
CreateStruct Function Builder 36.10
MultiInstanceRelation 36.11
TypeCoercion Object 36.12
TypeCoercionRule — Contract For Type Coercion Rules 36.13
ExtractEquiJoinKeys — Scala Extractor for Destructuring Join Logical Operators 36.14
PhysicalAggregation — Scala Extractor for Destructuring Aggregate Logical Operators
PhysicalOperation — Scala Extractor for Destructuring Logical Query Plans 36.15

HashJoin — Contract for Hash-based Join Physical Operators 36.17 36.16


HashedRelation 36.18
LongHashedRelation 36.18.1
UnsafeHashedRelation 36.18.2
KnownSizeEstimation 36.19
SizeEstimator 36.20
BroadcastMode 36.21
HashedRelationBroadcastMode 36.21.1
IdentityBroadcastMode 36.21.2
PartitioningUtils 36.22
HadoopFileLinesReader 36.23
CatalogUtils Helper Object 36.24
ExternalCatalogUtils 36.25

21
PartitioningAwareFileIndex 36.26
BufferedRowIterator 36.27
CompressionCodecs 36.28
(obsolete) SQLContext 36.29

22
Introduction

The Internals of Spark SQL (Apache Spark


2.4.4)
Welcome to The Internals of Spark SQL gitbook! I’m very excited to have you here and
hope you will enjoy exploring the internals of Spark SQL as much as I have.

I write to discover what I know.

— Flannery O'Connor
I’m Jacek Laskowski, a freelance IT consultant, software engineer and technical instructor
specializing in Apache Spark, Apache Kafka and Kafka Streams (with Scala and sbt).

I offer software development and consultancy services with hands-on in-depth workshops
and mentoring. Reach out to me at [email protected] or @jaceklaskowski to discuss
opportunities.

Consider joining me at Warsaw Scala Enthusiasts and Warsaw Spark meetups in Warsaw,
Poland.

I’m also writing other books in the "The Internals of" series about Apache Spark,
Tip
Spark Structured Streaming, Apache Kafka, and Kafka Streams.

Expect text and code snippets from a variety of public sources. Attribution follows.

Now, let me introduce you to Spark SQL and Structured Queries.

23
Spark SQL — Structured Data Processing with Relational Queries on Massive Scale

Spark SQL — Structured Data Processing with


Relational Queries on Massive Scale
Like Apache Spark in general, Spark SQL in particular is all about distributed in-memory
computations on massive scale.

Quoting the Spark SQL: Relational Data Processing in Spark paper on Spark SQL:

Spark SQL is a new module in Apache Spark that integrates relational processing with
Spark’s functional programming API.

Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g.,
declarative queries and optimized storage), and lets SQL users call complex analytics
libraries in Spark (e.g., machine learning).

The primary difference between the computation models of Spark SQL and Spark Core is
the relational framework for ingesting, querying and persisting (semi)structured data using
relational queries (aka structured queries) that can be expressed in good ol' SQL (with
many features of HiveQL) and the high-level SQL-like functional declarative Dataset API
(aka Structured Query DSL).

Semi- and structured data are collections of records that can be described
Note using schema with column names, their types and whether a column can be
null or not (aka nullability).

Whichever query interface you use to describe a structured query, i.e. SQL or Query DSL,
the query becomes a Dataset (with a mandatory Encoder).

From Shark, Spark SQL, Hive on Spark, and the future of SQL on Apache Spark:

For SQL users, Spark SQL provides state-of-the-art SQL performance and maintains
compatibility with Shark/Hive. In particular, like Shark, Spark SQL supports all existing
Hive data formats, user-defined functions (UDF), and the Hive metastore.

For Spark users, Spark SQL becomes the narrow-waist for manipulating (semi-)
structured data as well as ingesting data from sources that provide schema, such as
JSON, Parquet, Hive, or EDWs. It truly unifies SQL and sophisticated analysis, allowing
users to mix and match SQL and more imperative programming APIs for advanced
analytics.

For open source hackers, Spark SQL proposes a novel, elegant way of building query
planners. It is incredibly easy to add new optimizations under this framework.

24
Spark SQL — Structured Data Processing with Relational Queries on Massive Scale

A Dataset is a programming interface to the structured query execution pipeline with


transformations and actions (as in the good old days of RDD API in Spark Core).

Internally, a structured query is a Catalyst tree of (logical and physical) relational operators
and expressions.

When an action is executed on a Dataset (directly, e.g. show or count, or indirectly, e.g.
save or saveAsTable) the structured query (behind Dataset ) goes through the execution
stages:

1. Logical Analysis

2. Caching Replacement

3. Logical Query Optimization (using rule-based and cost-based optimizations)

4. Physical Planning

5. Physical Optimization (e.g. Whole-Stage Java Code Generation or Adaptive Query


Execution)

6. Constructing the RDD of Internal Binary Rows (that represents the structured query in
terms of Spark Core’s RDD API)

As of Spark 2.0, Spark SQL is now de facto the primary and feature-rich interface to Spark’s
underlying in-memory distributed platform (hiding Spark Core’s RDDs behind higher-level
abstractions that allow for logical and physical query optimization strategies even without
your consent).

You can find out more on the core of Apache Spark (aka Spark Core) in
Note
Mastering Apache Spark 2 gitbook.

In other words, Spark SQL’s Dataset API describes a distributed computation that will
eventually be converted to a DAG of RDDs for execution.

Under the covers, structured queries are automatically compiled into


Note
corresponding RDD operations.

Spark SQL supports structured queries in batch and streaming modes (with the latter as a
separate module of Spark SQL called Spark Structured Streaming).

You can find out more on Spark Structured Streaming in Spark Structured
Note
Streaming (Apache Spark 2.2+) gitbook.

25
Spark SQL — Structured Data Processing with Relational Queries on Massive Scale

// Define the schema using a case class


case class Person(name: String, age: Int)

// you could read people from a CSV file


// It's been a while since you saw RDDs, hasn't it?
// Excuse me for bringing you the old past.
import org.apache.spark.rdd.RDD
val peopleRDD: RDD[Person] = sc.parallelize(Seq(Person("Jacek", 10)))

// Convert RDD[Person] to Dataset[Person] and run a query

// Automatic schema inferrence from existing RDDs


scala> val people = peopleRDD.toDS
people: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]

// Query for teenagers using Scala Query DSL


scala> val teenagers = people.where('age >= 10).where('age <= 19).select('name).as[Str
ing]
teenagers: org.apache.spark.sql.Dataset[String] = [name: string]

scala> teenagers.show
+-----+
| name|
+-----+
|Jacek|
+-----+

// You could however want to use good ol' SQL, couldn't you?

// 1. Register people Dataset as a temporary view in Catalog


people.createOrReplaceTempView("people")

// 2. Run SQL query


val teenagers = sql("SELECT * FROM people WHERE age >= 10 AND age <= 19")
scala> teenagers.show
+-----+---+
| name|age|
+-----+---+
|Jacek| 10|
+-----+---+

Spark SQL supports loading datasets from various data sources including tables in Apache
Hive. With Hive support enabled, you can load datasets from existing Apache Hive
deployments and save them back to Hive tables if needed.

26
Spark SQL — Structured Data Processing with Relational Queries on Massive Scale

sql("CREATE OR REPLACE TEMPORARY VIEW v1 (key INT, value STRING) USING csv OPTIONS ('p
ath'='people.csv', 'header'='true')")

// Queries are expressed in HiveQL


sql("FROM v1").show

scala> sql("desc EXTENDED v1").show(false)


+----------+---------+-------+
|col_name |data_type|comment|
+----------+---------+-------+
|# col_name|data_type|comment|
|key |int |null |
|value |string |null |
+----------+---------+-------+

Like SQL and NoSQL databases, Spark SQL offers performance query optimizations using
rule-based query optimizer (aka Catalyst Optimizer), whole-stage Java code generation
(aka Whole-Stage Codegen that could often be better than your own custom hand-written
code!) and Tungsten execution engine with its own internal binary row format.

As of Spark SQL 2.2, structured queries can be further optimized using Hint Framework.

Spark SQL introduces a tabular data abstraction called Dataset (that was previously
DataFrame). Dataset data abstraction is designed to make processing large amount of
structured tabular data on Spark infrastructure simpler and faster.

Quoting Apache Drill which applies to Spark SQL perfectly:


A SQL query engine for relational and NoSQL databases with direct
Note queries on self-describing and semi-structured data in files, e.g. JSON or
Parquet, and HBase tables without needing to specify metadata definitions
in a centralized store.

The following snippet shows a batch ETL pipeline to process JSON files and saving their
subset as CSVs.

spark.read
.format("json")
.load("input-json")
.select("name", "score")
.where($"score" > 15)
.write
.format("csv")
.save("output-csv")

With Structured Streaming feature however, the above static batch query becomes dynamic
and continuous paving the way for continuous applications.

27
Spark SQL — Structured Data Processing with Relational Queries on Massive Scale

import org.apache.spark.sql.types._
val schema = StructType(
StructField("id", LongType, nullable = false) ::
StructField("name", StringType, nullable = false) ::
StructField("score", DoubleType, nullable = false) :: Nil)

spark.readStream
.format("json")
.schema(schema)
.load("input-json")
.select("name", "score")
.where('score > 15)
.writeStream
.format("console")
.start

// -------------------------------------------
// Batch: 1
// -------------------------------------------
// +-----+-----+
// | name|score|
// +-----+-----+
// |Jacek| 20.5|
// +-----+-----+

As of Spark 2.0, the main data abstraction of Spark SQL is Dataset. It represents a
structured data which are records with a known schema. This structured data
representation Dataset enables compact binary representation using compressed
columnar format that is stored in managed objects outside JVM’s heap. It is supposed to
speed computations up by reducing memory usage and GCs.

Spark SQL supports predicate pushdown to optimize performance of Dataset queries and
can also generate optimized code at runtime.

Spark SQL comes with the different APIs to work with:

1. Dataset API (formerly DataFrame API) with a strongly-typed LINQ-like Query DSL that
Scala programmers will likely find very appealing to use.

2. Structured Streaming API (aka Streaming Datasets) for continuous incremental


execution of structured queries.

3. Non-programmers will likely use SQL as their query language through direct integration
with Hive

4. JDBC/ODBC fans can use JDBC interface (through Thrift JDBC/ODBC Server) and
connect their tools to Spark’s distributed query engine.

28
Spark SQL — Structured Data Processing with Relational Queries on Massive Scale

Spark SQL comes with a uniform interface for data access in distributed storage systems
like Cassandra or HDFS (Hive, Parquet, JSON) using specialized DataFrameReader and
DataFrameWriter objects.

Spark SQL allows you to execute SQL-like queries on large volume of data that can live in
Hadoop HDFS or Hadoop-compatible file systems like S3. It can access data from different
data sources - files or tables.

Spark SQL defines the following types of functions:

standard functions or User-Defined Functions (UDFs) that take values from a single row
as input to generate a single return value for every input row.

basic aggregate functions that operate on a group of rows and calculate a single return
value per group.

window aggregate functions that operate on a group of rows and calculate a single
return value for each row in a group.

There are two supported catalog implementations —  in-memory (default) and hive  — that
you can set using spark.sql.catalogImplementation property.

From user@spark:

If you already loaded csv data into a dataframe, why not register it as a table, and use
Spark SQL to find max/min or any other aggregates? SELECT MAX(column_name)
FROM dftable_name …​ seems natural.

you’re more comfortable with SQL, it might worth registering this DataFrame as a table
and generating SQL query to it (generate a string with a series of min-max calls)

You can parse data from external data sources and let the schema inferencer to deduct the
schema.

29
Spark SQL — Structured Data Processing with Relational Queries on Massive Scale

// Example 1
val df = Seq(1 -> 2).toDF("i", "j")
val query = df.groupBy('i)
.agg(max('j).as("aggOrdering"))
.orderBy(sum('j))
.as[(Int, Int)]
query.collect contains (1, 2) // true

// Example 2
val df = Seq((1, 1), (-1, 1)).toDF("key", "value")
df.createOrReplaceTempView("src")
scala> sql("SELECT IF(a > 0, a, 0) FROM (SELECT key a FROM src) temp").show
+-------------------+
|(IF((a > 0), a, 0))|
+-------------------+
| 1|
| 0|
+-------------------+

Further Reading and Watching


1. Spark SQL home page

2. (video) Spark’s Role in the Big Data Ecosystem - Matei Zaharia

3. Introducing Apache Spark 2.0

30
Datasets vs DataFrames vs RDDs

Datasets vs DataFrames vs RDDs


Many may have been asking yourself why they should be using Datasets rather than the
foundation of all Spark - RDDs using case classes.

This document collects advantages of Dataset vs RDD[CaseClass] to answer the question


Dan has asked on twitter:

"In #Spark, what is the advantage of a DataSet over an RDD[CaseClass]?"

Saving to or Writing from Data Sources


With Dataset API, loading data from a data source or saving it to one is as simple as using
SparkSession.read or Dataset.write methods, appropriately.

Accessing Fields / Columns


You select columns in a datasets without worrying about the positions of the columns.

In RDD, you have to do an additional hop over a case class and access fields by name.

31
Dataset API vs SQL

Dataset API vs SQL


Spark SQL supports two "modes" to write structured queries: Dataset API and SQL.

It turns out that some structured queries can be expressed easier using Dataset API, but
there are some that are only possible in SQL. In other words, you may find mixing Dataset
API and SQL modes challenging yet rewarding.

You could at some point consider writing structured queries using Catalyst data structures
directly hoping to avoid the differences and focus on what is supported in Spark SQL, but
that could quickly become unwieldy for maintenance (i.e. finding Spark SQL developers who
could be comfortable with it as well as being fairly low-level and therefore possibly too
dependent on a specific Spark SQL version).

This section describes the differences between Spark SQL features to develop Spark
applications using Dataset API and SQL mode.

1. RuntimeReplaceable Expressions are only available using SQL mode by means of SQL
functions like nvl , nvl2 , ifnull , nullif , etc.

2. Column.isin and SQL IN predicate with a subquery (and In Predicate Expression)

32
VectorizedParquetRecordReader

VectorizedParquetRecordReader
VectorizedParquetRecordReader is a concrete SpecificParquetRecordReaderBase for

parquet file format for Vectorized Parquet Decoding.

VectorizedParquetRecordReader is created exclusively when ParquetFileFormat is requested

for a data reader (with spark.sql.parquet.enableVectorizedReader property enabled and the


read schema with AtomicType data types only).

spark.sql.parquet.enableVectorizedReader configuration property is enabled


Note
( true ) by default.

VectorizedParquetRecordReader takes the following to be created:

TimeZone ( null when no timezone conversion is expected)

useOffHeap flag (per spark.sql.columnVector.offheap.enabled property)

Capacity (per spark.sql.parquet.columnarReaderBatchSize property)

VectorizedParquetRecordReader uses the capacity attribute for the following:

Creating WritableColumnVectors when initializing a columnar batch

Controlling number of rows when nextBatch

VectorizedParquetRecordReader uses OFF_HEAP memory mode when

spark.sql.columnVector.offheap.enabled internal configuration property is enabled ( true ).

spark.sql.columnVector.offheap.enabled configuration property is disabled


Note
( false ) by default.

Table 1. VectorizedParquetRecordReader’s Internal Properties (e.g. Registries, Counters


and Flags)
Name Description

Current batch index that is the index of an InternalRow in


the ColumnarBatch. Used when
VectorizedParquetRecordReader is requested to
getCurrentValue with the returnColumnarBatch flag disabled
batchIdx
Starts at 0

Increments every nextKeyValue

Reset to 0 when reading next rows into a columnar batch

columnarBatch ColumnarBatch

33
VectorizedParquetRecordReader

VectorizedColumnReaders (one reader per column) to read


rows as batches
columnReaders
Intialized when checkEndOfRowGroup (when requested to
read next rows into a columnar batch)

columnVectors Allocated WritableColumnVectors

Memory mode of the ColumnarBatch


OFF_HEAP (when useOffHeap is on as per
spark.sql.columnVector.offheap.enabled configuration
MEMORY_MODE property)
ON_HEAP
Used exclusively when VectorizedParquetRecordReader is
requested to initBatch.

missingColumns
Bitmap of columns (per index) that are missing (or simply the
ones that the reader should not read)

numBatched

Optimization flag to control whether


VectorizedParquetRecordReader offers rows as the
ColumnarBatch or one row at a time only
Default: false
returnColumnarBatch
Enabled ( true ) when VectorizedParquetRecordReader is
requested to enable returning batches

Used in nextKeyValue (to read next rows into a columnar


batch) and getCurrentValue (to return the internal
ColumnarBatch not a single InternalRow )

rowsReturned Number of rows read already

totalCountLoadedSoFar

totalRowCount Total number of rows to be read

nextKeyValue Method

boolean nextKeyValue() throws IOException

nextKeyValue is part of Hadoop’s RecordReader to read (key, value) pairs from


Note
a Hadoop InputSplit to present a record-oriented view.

34
VectorizedParquetRecordReader

nextKeyValue …​FIXME

nextKeyValue is used when:

NewHadoopRDD is requested to compute a partition ( compute )


Note
RecordReaderIterator is requested to check whether or not there are more
internal rows

resultBatch Method

ColumnarBatch resultBatch()

resultBatch gives columnarBatch if available or does initBatch.

resultBatch is used exclusively when VectorizedParquetRecordReader is


Note
requested to nextKeyValue.

Initializing —  initialize Method

void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext)

Note initialize is part of SpecificParquetRecordReaderBase Contract to…​FIXME.

initialize …​FIXME

enableReturningBatches Method

void enableReturningBatches()

enableReturningBatches simply turns returnColumnarBatch internal flag on.

enableReturningBatches is used exclusively when ParquetFileFormat is


Note requested for a data reader (for vectorized parquet decoding in whole-stage
codegen).

Initializing Columnar Batch —  initBatch Method

35
VectorizedParquetRecordReader

void initBatch(StructType partitionColumns, InternalRow partitionValues) (1)


// private
private void initBatch() (2)
private void initBatch(
MemoryMode memMode,
StructType partitionColumns,
InternalRow partitionValues)

1. Uses MEMORY_MODE

2. Uses MEMORY_MODE and no partitionColumns and no partitionValues

initBatch creates the batch schema that is sparkSchema and the input partitionColumns

schema.

initBatch requests OffHeapColumnVector or OnHeapColumnVector to allocate column

vectors per the input memMode , i.e. OFF_HEAP or ON_HEAP memory modes, respectively.
initBatch records the allocated column vectors as the internal WritableColumnVectors.

spark.sql.columnVector.offheap.enabled configuration property controls


OFF_HEAP or ON_HEAP memory modes, i.e. true or false , respectively.
Note
spark.sql.columnVector.offheap.enabled is disabled by default which means
that OnHeapColumnVector is used.

initBatch creates a ColumnarBatch (with the allocated WritableColumnVectors) and

records it as the internal ColumnarBatch.

initBatch creates new slots in the allocated WritableColumnVectors for the input

partitionColumns and sets the input partitionValues as constants.

initBatch initializes missing columns with nulls .

initBatch is used when:

VectorizedParquetRecordReader is requested for resultBatch


Note
ParquetFileFormat is requested to build a data reader with partition column
values appended

Reading Next Rows Into Columnar Batch —  nextBatch


Method

boolean nextBatch() throws IOException

36
VectorizedParquetRecordReader

nextBatch reads at least capacity rows and returns true when there are rows available.

Otherwise, nextBatch returns false (to "announce" there are no rows available).

Internally, nextBatch firstly requests every WritableColumnVector (in the columnVectors


internal registry) to reset itself.

nextBatch requests the ColumnarBatch to specify the number of rows (in batch) as 0

(effectively resetting the batch and making it available for reuse).

When the rowsReturned is greater than the totalRowCount, nextBatch finishes with
(returns) false (to "announce" there are no rows available).

nextBatch checkEndOfRowGroup.

nextBatch calculates the number of rows left to be returned as a minimum of the capacity

and the totalCountLoadedSoFar reduced by the rowsReturned.

nextBatch requests every VectorizedColumnReader to readBatch (with the number of rows

left to be returned and associated WritableColumnVector).

VectorizedColumnReaders use their own WritableColumnVectors for storing


Note values read. The numbers of VectorizedColumnReaders and
WritableColumnVector are equal.

The number of rows in the internal ColumnarBatch matches the number of rows
Note that VectorizedColumnReaders decoded and stored in corresponding
WritableColumnVectors.

In the end, nextBatch registers the progress as follows:

The number of rows read is added to the rowsReturned counter

Requests the internal ColumnarBatch to set the number of rows (in batch) to be the
number of rows read

The numBatched registry is exactly the number of rows read

The batchIdx registry becomes 0

nextBatch finishes with (returns) true (to "announce" there are rows available).

nextBatch is used exclusively when VectorizedParquetRecordReader is


Note
requested to nextKeyValue.

checkEndOfRowGroup Internal Method

void checkEndOfRowGroup() throws IOException

37
VectorizedParquetRecordReader

checkEndOfRowGroup …​FIXME

checkEndOfRowGroup is used exclusively when VectorizedParquetRecordReader


Note
is requested to read next rows into a columnar batch.

Getting Current Value (as Columnar Batch or Single


InternalRow) —  getCurrentValue Method

Object getCurrentValue()

getCurrentValue is part of the Hadoop RecordReader Contract to break the


Note
data into key/value pairs for input to a Hadoop Mapper .

getCurrentValue returns the entire ColumnarBatch with the returnColumnarBatch flag

enabled ( true ) or requests it for a single row instead.

getCurrentValue is used when:

NewHadoopRDD is requested to compute a partition ( compute )


Note
RecordReaderIterator is requested for the next internal row

38
VectorizedColumnReader

VectorizedColumnReader
VectorizedColumnReader is a vectorized column reader that

VectorizedParquetRecordReader uses for Vectorized Parquet Decoding.

VectorizedColumnReader is created exclusively when VectorizedParquetRecordReader is

requested to checkEndOfRowGroup (when requested to read next rows into a columnar


batch).

Once created, VectorizedColumnReader is requested to read rows as a batch (when


VectorizedParquetRecordReader is requested to read next rows into a columnar batch).

VectorizedColumnReader is given a WritableColumnVector to store rows read as a batch.

VectorizedColumnReader takes the following to be created:

Parquet ColumnDescriptor

Parquet OriginalType

Parquet PageReader

TimeZone (for timezone conversion to apply to int96 timestamps. null for no


conversion)

Reading Rows As Batch —  readBatch Method

void readBatch(
int total,
WritableColumnVector column) throws IOException

readBatch …​FIXME

readBatch is used exclusively when VectorizedParquetRecordReader is


Note
requested to read next rows into a columnar batch.

39
SpecificParquetRecordReaderBase

SpecificParquetRecordReaderBase — Hadoop
RecordReader
SpecificParquetRecordReaderBase is the base Hadoop RecordReader for parquet format

readers that directly materialize to T .

Note RecordReader reads <key, value> pairs from an Hadoop InputSplit .

VectorizedParquetRecordReader is the one and only


Note
SpecificParquetRecordReaderBase that directly materialize to Java Objects .

Table 1. SpecificParquetRecordReaderBase’s Internal Properties (e.g. Registries, Counters


and Flags)
Name Description
Spark schema
Initialized when SpecificParquetRecordReaderBase is
requested to initialize (from the value of
sparkSchema
org.apache.spark.sql.parquet.row.requested_schema
configuration as set when ParquetFileFormat is
requested to build a data reader with partition column
values appended)

initialize Method

void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext)

Note initialize is part of RecordReader Contract to initialize a RecordReader .

initialize …​FIXME

40
ColumnVector Contract — In-Memory Columnar Data

ColumnVector Contract — In-Memory Columnar


Data
ColumnVector is the contract of in-memory columnar data (of a DataType).

Table 1. ColumnVector Contract (Abstract Methods Only)


Method Description

ColumnarArray getArray(int rowId)


getArray

Used when…​FIXME

byte[] getBinary(int rowId)


getBinary

Used when…​FIXME

boolean getBoolean(int rowId)


getBoolean

Used when…​FIXME

byte getByte(int rowId)


getByte

Used when…​FIXME

ColumnVector getChild(int ordinal)


getChild

Used when…​FIXME

Decimal getDecimal(
int rowId,
int precision,
getDecimal int scale)

Used when…​FIXME

double getDouble(int rowId)


getDouble

Used when…​FIXME

41
ColumnVector Contract — In-Memory Columnar Data

float getFloat(int rowId)


getFloat

Used when…​FIXME

int getInt(int rowId)


getInt

Used when…​FIXME

long getLong(int rowId)


getLong

Used when…​FIXME

ColumnarMap getMap(int ordinal)


getMap

Used when…​FIXME

short getShort(int rowId)


getShort

Used when…​FIXME

UTF8String getUTF8String(int rowId)


getUTF8String

Used when…​FIXME

boolean hasNull()

hasNull

Used when OffHeapColumnVector and OnHeapColumnVector


are requested to putNotNulls

boolean isNullAt(int rowId)


isNullAt

Used in many places

int numNulls()
numNulls

Used for testing purposes only

42
ColumnVector Contract — In-Memory Columnar Data

Table 2. ColumnVectors (Direct Implementations and Extensions)


ColumnVector Description
ArrowColumnVector

OrcColumnVector

Writable column vectors with off-heap and on-heap memory


WritableColumnVector
variants

ColumnVector takes a DataType of the column to be created.

ColumnVector is a Java abstract class and cannot be created directly. It is


Note
created indirectly for the concrete ColumnVectors.

ColumnVector is an Evolving contract that is evolving towards becoming a


stable API, but is not a stable API yet and can change from one feature release
Note to another release.
In other words, using the contract is as treading on thin ice.

getInterval Final Method

CalendarInterval getInterval(int rowId)

getInterval …​FIXME

Note getInterval is used when…​FIXME

getStruct Final Method

ColumnarRow getStruct(int rowId)

getStruct …​FIXME

Note getStruct is used when…​FIXME

43
WritableColumnVector Contract

WritableColumnVector Contract
WritableColumnVector is the extension of the ColumnVector contract for writable column

vectors that FIXME.

Table 1. WritableColumnVector Contract (Abstract Methods Only)


Method Description

int getArrayLength(int rowId)


getArrayLength

Used when…​FIXME

int getArrayOffset(int rowId)


getArrayOffset

Used when…​FIXME

UTF8String getBytesAsUTF8String(
int rowId,
getBytesAsUTF8String int count)

Used when…​FIXME

int getDictId(int rowId)


getDictId

Used when…​FIXME

void putArray(
int rowId,
int offset,
putArray int length)

Used when…​FIXME

void putBoolean(
int rowId,
putBoolean boolean value)

Used when…​FIXME

44
WritableColumnVector Contract

putBooleans void putBooleans(


int rowId,
int count,
boolean value)

Used when…​FIXME

void putByte(
int rowId,
putByte byte value)

Used when…​FIXME

int putByteArray(
int rowId,
byte[] value,
putByteArray int offset,
int count)

Used when…​FIXME

void putBytes(
int rowId,
int count,
byte value)
void putBytes(
putBytes int rowId,
int count,
byte[] src,
int srcIndex)

Used when…​FIXME

void putDouble(
int rowId,
putDouble double value)

Used when…​FIXME

void putDoubles(
int rowId,
int count,
byte[] src,
int srcIndex)
void putDoubles(
int rowId,
putDoubles int count,
double value)
void putDoubles(
int rowId,
int count,
double[] src,
int srcIndex)

45
WritableColumnVector Contract

Used when…​FIXME

void putFloat(
int rowId,
putFloat float value)

Used when…​FIXME

void putFloats(
int rowId,
int count,
byte[] src,
int srcIndex)
void putFloats(
int rowId,
int count,
putFloats float value)
void putFloats(
int rowId,
int count,
float[] src,
int srcIndex)

Used when…​FIXME

void putInt(
int rowId,
putInt int value)

Used when…​FIXME

void putInts(
int rowId,
int count,
byte[] src,
int srcIndex)
void putInts(
int rowId,
int count,
putInts int value)
void putInts(
int rowId,
int count,
int[] src,
int srcIndex)

Used when…​FIXME

void putIntsLittleEndian(
int rowId,
int count,
byte[] src,
putIntsLittleEndian int srcIndex)

46
WritableColumnVector Contract

Used when…​FIXME

void putLong(
int rowId,
putLong long value)

Used when…​FIXME

void putLongs(
int rowId,
int count,
byte[] src,
int srcIndex)
void putLongs(
int rowId,
int count,
putLongs long value)
void putLongs(
int rowId,
int count,
long[] src,
int srcIndex)

Used when…​FIXME

void putLongsLittleEndian(
int rowId,
int count,
putLongsLittleEndian byte[] src,
int srcIndex)

Used when…​FIXME

void putNotNull(int rowId)

putNotNull

Used when WritableColumnVector is requested to reset and


appendNotNulls

void putNotNulls(
int rowId,
putNotNulls int count)

Used when…​FIXME

void putNull(int rowId)


putNull

Used when…​FIXME

47
WritableColumnVector Contract

void putNulls(
putNulls int rowId,
int count)

Used when…​FIXME

void putShort(
int rowId,
putShort short value)

Used when…​FIXME

void putShorts(
int rowId,
int count,
byte[] src,
int srcIndex)
void putShorts(
int rowId,
int count,
putShorts short value)
void putShorts(
int rowId,
int count,
short[] src,
int srcIndex)

Used when…​FIXME

void reserveInternal(int capacity)

Used when:
reserveInternal
OffHeapColumnVector and OnHeapColumnVector are
created
WritableColumnVector is requested to reserve memory of
a given required capacity

WritableColumnVector reserveNewColumn(
int capacity,
reserveNewColumn DataType type)

Used when…​FIXME

48
WritableColumnVector Contract

Table 2. WritableColumnVectors
WritableColumnVector Description

OffHeapColumnVector

OnHeapColumnVector

WritableColumnVector takes the following to be created:

Number of rows to hold in a vector (aka capacity )

Data type of the rows stored

WritableColumnVector is a Java abstract class and cannot be created directly. It


Note
is created indirectly for the concrete WritableColumnVectors.

reset Method

void reset()

reset …​FIXME

reset is used when:

OrcColumnarBatchReader is requested to nextBatch

VectorizedParquetRecordReader is requested to read next rows into a


Note columnar batch

OffHeapColumnVector and OnHeapColumnVector are created


WritableColumnVector is requested to reserveDictionaryIds

Reserving Memory Of Required Capacity —  reserve


Method

void reserve(int requiredCapacity)

reserve …​FIXME

49
WritableColumnVector Contract

reserve is used when:

OrcColumnarBatchReader is requested to putRepeatingValues ,


Note putNonNullValues , putValues , and putDecimalWritables

WritableColumnVector is requested to append values

reserveDictionaryIds Method

WritableColumnVector reserveDictionaryIds(int capacity)

reserveDictionaryIds …​FIXME

Note reserveDictionaryIds is used when…​FIXME

appendNotNulls Final Method

int appendNotNulls(int count)

appendNotNulls …​FIXME

Note appendNotNulls is used for testing purposes only.

50
OnHeapColumnVector

OnHeapColumnVector
OnHeapColumnVector is a concrete WritableColumnVector that…​FIXME

OnHeapColumnVector is created when:

OnHeapColumnVector is requested to allocate column vectors and reserveNewColumn

OrcColumnarBatchReader is requested to initBatch

Allocating Column Vectors —  allocateColumns Static


Method

OnHeapColumnVector[] allocateColumns(int capacity, StructType schema) (1)


OnHeapColumnVector[] allocateColumns(int capacity, StructField[] fields)

1. Simply converts StructType to StructField[] and calls the other allocateColumns

allocateColumns creates an array of OnHeapColumnVector for every field (to hold capacity

number of elements of the data type per field).

allocateColumns is used when:

AggregateHashMap is created

InMemoryTableScanExec is requested to createAndDecompressColumn

VectorizedParquetRecordReader is requested to initBatch (with ON_HEAP


Note memory mode)
OrcColumnarBatchReader is requested to initBatch (with ON_HEAP
memory mode)
ColumnVectorUtils is requested to convert an iterator of rows into a single
ColumnBatch (aka toBatch )

Creating OnHeapColumnVector Instance


OnHeapColumnVector takes the following when created:

Number of elements to hold in a vector (aka capacity )

Data type of the elements stored

When created, OnHeapColumnVector reserveInternal (for the given capacity) and reset.

51
OnHeapColumnVector

reserveInternal Method

void reserveInternal(int newCapacity)

Note reserveInternal is part of WritableColumnVector Contract to…​FIXME.

reserveInternal …​FIXME

reserveNewColumn Method

OnHeapColumnVector reserveNewColumn(int capacity, DataType type)

Note reserveNewColumn is part of WritableColumnVector Contract to…​FIXME.

reserveNewColumn …​FIXME

52
OffHeapColumnVector

OffHeapColumnVector
OffHeapColumnVector is a concrete WritableColumnVector that…​FIXME

Allocating Column Vectors —  allocateColumns Static


Method

OffHeapColumnVector[] allocateColumns(int capacity, StructType schema) (1)


OffHeapColumnVector[] allocateColumns(int capacity, StructField[] fields)

1. Simply converts StructType to StructField[] and calls the other allocateColumns

allocateColumns creates an array of OffHeapColumnVector for every field (to hold capacity

number of elements of the data type per field).

Note allocateColumns is used when…​FIXME

53
Vectorized Parquet Decoding (Reader)

Vectorized Parquet Decoding (Reader)


Vectorized Parquet Decoding (aka Vectorized Parquet Reader) allows for reading
datasets in parquet format in batches, i.e. rows are decoded in batches. That aims at
improving memory locality and cache utilization.

Quoting SPARK-12854 Vectorize Parquet reader:

The parquet encodings are largely designed to decode faster in batches, column by
column. This can speed up the decoding considerably.

Vectorized Parquet Decoding is used exclusively when ParquetFileFormat is requested for


a data reader when spark.sql.parquet.enableVectorizedReader property is enabled ( true )
and the read schema uses AtomicTypes data types only.

Vectorized Parquet Decoding uses VectorizedParquetRecordReader for vectorized


decoding.

spark.sql.parquet.enableVectorizedReader Configuration
Property
spark.sql.parquet.enableVectorizedReader configuration property is on by default.

val isParquetVectorizedReaderEnabled = spark.conf.get("spark.sql.parquet.enableVectori


zedReader").toBoolean
assert(isParquetVectorizedReaderEnabled, "spark.sql.parquet.enableVectorizedReader sho
uld be enabled by default")

54
Dynamic Partition Inserts

Dynamic Partition Inserts


Partitioning uses partitioning columns to divide a dataset into smaller chunks (based on
the values of certain columns) that will be written into separate directories.

With a partitioned dataset, Spark SQL can load only the parts (partitions) that are really
needed (and avoid doing filtering out unnecessary data on JVM). That leads to faster load
time and more efficient memory consumption which gives a better performance overall.

With a partitioned dataset, Spark SQL can also be executed over different subsets
(directories) in parallel at the same time.

Partitioned table (with single partition p1)

spark.range(10)
.withColumn("p1", 'id % 2)
.write
.mode("overwrite")
.partitionBy("p1")
.saveAsTable("partitioned_table")

Dynamic Partition Inserts is a feature of Spark SQL that allows for executing INSERT
OVERWRITE TABLE SQL statements over partitioned HadoopFsRelations that limits what

partitions are deleted to overwrite the partitioned table (and its partitions) with new data.

Dynamic partitions are the partition columns that have no values defined explicitly in the
PARTITION clause of INSERT OVERWRITE TABLE SQL statements (in the partitionSpec
part).

Static partitions are the partition columns that have values defined explicitly in the
PARTITION clause of INSERT OVERWRITE TABLE SQL statements (in the partitionSpec
part).

// Borrowed from https://medium.com/@anuvrat/writing-into-dynamic-partitions-using-spa


rk-2e2b818a007a
// Note day dynamic partition
INSERT OVERWRITE TABLE stats
PARTITION(country = 'US', year = 2017, month = 3, day)
SELECT ad, SUM(impressions), SUM(clicks), log_day
FROM impression_logs
GROUP BY ad;

INSERT OVERWRITE TABLE SQL statement is translated into InsertIntoTable logical


Note
operator.

55
Dynamic Partition Inserts

Dynamic Partition Inserts is only supported in SQL mode (for INSERT OVERWRITE TABLE
SQL statements).

Dynamic Partition Inserts is not supported for non-file-based data sources, i.e.
InsertableRelations.

With Dynamic Partition Inserts, the behaviour of OVERWRITE keyword is controlled by


spark.sql.sources.partitionOverwriteMode configuration property (default: static). The
property controls whether Spark should delete all the partitions that match the partition
specification regardless of whether there is data to be written to or not (static) or delete only
those partitions that will have data written into (dynamic).

When the dynamic overwrite mode is enabled Spark will only delete the partitions for which it
has data to be written to. All the other partitions remain intact.

From the Writing Into Dynamic Partitions Using Spark:

Spark now writes data partitioned just as Hive would — which means only the partitions
that are touched by the INSERT query get overwritten and the others are not touched.

56
Bucketing

Bucketing
Bucketing is an optimization technique that uses buckets (and bucketing columns) to
determine data partitioning and avoid data shuffle.

The motivation is to optimize performance of a join query by avoiding shuffles (aka


exchanges) of tables participating in the join. Bucketing results in fewer exchanges (and so
stages).

Bucketing can show the biggest benefit when pre-shuffled bucketed tables
Note are used more than once as bucketing itself takes time (that you will offset
executing multiple join queries later).

Bucketing is enabled by default. Spark SQL uses spark.sql.sources.bucketing.enabled


configuration property to control whether bucketing should be enabled and used for query
optimization or not.

Bucketing is used exclusively in FileSourceScanExec physical operator (when it is requested


for the input RDD and to determine the partitioning and ordering of the output).

Example: SortMergeJoin of two FileScans

57
Bucketing

import org.apache.spark.sql.SaveMode
spark.range(10e4.toLong).write.mode(SaveMode.Overwrite).saveAsTable("t10e4")
spark.range(10e6.toLong).write.mode(SaveMode.Overwrite).saveAsTable("t10e6")

// Bucketing is enabled by default


// Let's check it out anyway
assert(spark.sessionState.conf.bucketingEnabled, "Bucketing disabled?!")

// Make sure that you don't end up with a BroadcastHashJoin and a BroadcastExchange
// For that, let's disable auto broadcasting
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

val tables = spark.catalog.listTables.where($"name" startsWith "t10e")


scala> tables.show
+-----+--------+-----------+---------+-----------+
| name|database|description|tableType|isTemporary|
+-----+--------+-----------+---------+-----------+
|t10e4| default| null| MANAGED| false|
|t10e6| default| null| MANAGED| false|
+-----+--------+-----------+---------+-----------+

val t4 = spark.table("t10e4")
val t6 = spark.table("t10e6")

assert(t4.count == 10e4)
assert(t6.count == 10e6)

// trigger execution of the join query


t4.join(t6, "id").foreach(_ => ())

The above join query is a fine example of a SortMergeJoinExec (aka SortMergeJoin) of two
FileSourceScanExecs (aka Scan). The join query uses ShuffleExchangeExec physical
operators (aka Exchange) to shuffle the table datasets for the SortMergeJoin.

58
Bucketing

59
Bucketing

Figure 1. SortMergeJoin of FileScans (Details for Query)


One way to avoid the exchanges (and so optimize the join query) is to use table bucketing
that is applicable for all file-based data sources, e.g. Parquet, ORC, JSON, CSV, that are
saved as a table using DataFrameWrite.saveAsTable or simply available in a catalog by
SparkSession.table.

Bucketing is not supported for DataFrameWriter.save,


Note
DataFrameWriter.insertInto and DataFrameWriter.jdbc methods.

You use DataFrameWriter.bucketBy method to specify the number of buckets and the
bucketing columns.

You can optionally sort the output rows in buckets using DataFrameWriter.sortBy method.

people.write
.bucketBy(42, "name")
.sortBy("age")
.saveAsTable("people_bucketed")

DataFrameWriter.bucketBy and DataFrameWriter.sortBy simply set respective


Note
internal properties that eventually become a bucketing specification.

Unlike bucketing in Apache Hive, Spark SQL creates the bucket files per the number of
buckets and partitions. In other words, the number of bucketing files is the number of
buckets multiplied by the number of task writers (one per partition).

val large = spark.range(10e6.toLong)


import org.apache.spark.sql.SaveMode
large.write
.bucketBy(4, "id")
.sortBy("id")
.mode(SaveMode.Overwrite)
.saveAsTable("bucketed_4_id")

scala> println(large.queryExecution.toRdd.getNumPartitions)
8

// That gives 8 (partitions/task writers) x 4 (buckets) = 32 files


// With _SUCCESS extra file and the ls -l header "total 794624" that gives 34 files
$ ls -tlr spark-warehouse/bucketed_4_id | wc -l
34

With bucketing, the Exchanges are no longer needed (as the tables are already pre-
shuffled).

60
Bucketing

// Create bucketed tables


import org.apache.spark.sql.SaveMode
spark.range(10e4.toLong)
.write
.bucketBy(4, "id")
.sortBy("id")
.mode(SaveMode.Overwrite)
.saveAsTable("bucketed_4_10e4")
spark.range(10e6.toLong)
.write
.bucketBy(4, "id")
.sortBy("id")
.mode(SaveMode.Overwrite)
.saveAsTable("bucketed_4_10e6")

val bucketed_4_10e4 = spark.table("bucketed_4_10e4")


val bucketed_4_10e6 = spark.table("bucketed_4_10e6")

// trigger execution of the join query


bucketed_4_10e4.join(bucketed_4_10e6, "id").foreach(_ => ())

The above join query of the bucketed tables shows no ShuffleExchangeExec physical
operators (aka Exchange) as the shuffling has already been executed (before the query was
run).

61
Bucketing

Figure 2. SortMergeJoin of Bucketed Tables (Details for Query)

62
Bucketing

The number of partitions of a bucketed table is exactly the number of buckets.

val bucketed_4_10e4 = spark.table("bucketed_4_10e4")


val numPartitions = bucketed_4_10e4.queryExecution.toRdd.getNumPartitions
assert(numPartitions == 4)

Use SessionCatalog or DESCRIBE EXTENDED SQL command to find the bucketing information.

val bucketed_tables = spark.catalog.listTables.where($"name" startsWith "bucketed_")


scala> bucketed_tables.show
+---------------+--------+-----------+---------+-----------+
| name|database|description|tableType|isTemporary|
+---------------+--------+-----------+---------+-----------+
|bucketed_4_10e4| default| null| MANAGED| false|
|bucketed_4_10e6| default| null| MANAGED| false|
+---------------+--------+-----------+---------+-----------+

val demoTable = "bucketed_4_10e4"

// DESC EXTENDED or DESC FORMATTED would also work


val describeSQL = sql(s"DESCRIBE EXTENDED $demoTable")
scala> describeSQL.show(numRows = 21, truncate = false)
+----------------------------+--------------------------------------------------------
-------+-------+
|col_name |data_type
|comment|
+----------------------------+--------------------------------------------------------
-------+-------+
|id |bigint
|null |
| |
| |
|# Detailed Table Information|
| |
|Database |default
| |
|Table |bucketed_4_10e4
| |
|Owner |jacek
| |
|Created Time |Tue Oct 02 10:50:50 CEST 2018
| |
|Last Access |Thu Jan 01 01:00:00 CET 1970
| |
|Created By |Spark 2.3.2
| |
|Type |MANAGED
| |
|Provider |parquet
| |
|Num Buckets |4

63
Bucketing

| |
|Bucket Columns |[`id`]
| |
|Sort Columns |[`id`]
| |
|Table Properties |[transient_lastDdlTime=1538470250]
| |
|Statistics |413954 bytes
| |
|Location |file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed
_4_10e4| |
|Serde Library |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
| |
|InputFormat |org.apache.hadoop.mapred.SequenceFileInputFormat
| |
|OutputFormat |org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
| |
|Storage Properties |[serialization.format=1]
| |
+----------------------------+--------------------------------------------------------
-------+-------+

import org.apache.spark.sql.catalyst.TableIdentifier
val metadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(demoTable))
scala> metadata.bucketSpec.foreach(println)
4 buckets, bucket columns: [id], sort columns: [id]

The number of buckets has to be between 0 and 100000 exclusive or Spark SQL throws
an AnalysisException :

Number of buckets should be greater than 0 but less than 100000. Got `[numBuckets]`

There are however requirements that have to be met before Spark Optimizer gives a no-
Exchange query plan:

1. The number of partitions on both sides of a join has to be exactly the same.

2. Both join operators have to use HashPartitioning partitioning scheme.

It is acceptable to use bucketing for one side of a join.

64
Bucketing

// Make sure that you don't end up with a BroadcastHashJoin and a BroadcastExchange
// For this, let's disable auto broadcasting
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

val bucketedTableName = "bucketed_4_id"


val large = spark.range(10e5.toLong)
import org.apache.spark.sql.SaveMode
large.write
.bucketBy(4, "id")
.sortBy("id")
.mode(SaveMode.Overwrite)
.saveAsTable(bucketedTableName)
val bucketedTable = spark.table(bucketedTableName)

val t1 = spark
.range(4)
.repartition(4, $"id") // Make sure that the number of partitions matches the other
side

val q = t1.join(bucketedTable, "id")


scala> q.explain
== Physical Plan ==
*(4) Project [id#169L]
+- *(4) SortMergeJoin [id#169L], [id#167L], Inner
:- *(2) Sort [id#169L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#169L, 4)
: +- *(1) Range (0, 4, step=1, splits=8)
+- *(3) Sort [id#167L ASC NULLS FIRST], false, 0
+- *(3) Project [id#167L]
+- *(3) Filter isnotnull(id#167L)
+- *(3) FileScan parquet default.bucketed_4_id[id#167L] Batched: true, For
mat: Parquet, Location: InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-wareho
use/bucketed_4_id], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema:
struct<id:bigint>

q.foreach(_ => ())

65
Bucketing

66
Bucketing

Figure 3. SortMergeJoin of One Bucketed Table (Details for Query)

Bucket Pruning — Optimizing Filtering on Bucketed Column


(Reducing Bucket Files to Scan)
As of Spark 2.4, Spark SQL supports bucket pruning to optimize filtering on bucketed
column (by reducing the number of bucket files to scan).

Bucket pruning supports the following predicate expressions:

EqualTo ( = )

EqualNullSafe ( <=> )

In

InSet

And and Or of the above

FileSourceStrategy execution planning strategy is responsible for selecting only


LogicalRelations over HadoopFsRelation with the bucketing specification with the following:

1. There is exactly one bucketing column

2. The number of buckets is greater than 1

Example: Bucket Pruning

67
Bucketing

// Enable INFO logging level of FileSourceStrategy logger to see the details of the st
rategy
import org.apache.spark.sql.execution.datasources.FileSourceStrategy
val logger = FileSourceStrategy.getClass.getName.replace("$", "")
import org.apache.log4j.{Level, Logger}
Logger.getLogger(logger).setLevel(Level.INFO)

val q57 = q.where($"id" isin (50, 70))


scala> val sparkPlan57 = q57.queryExecution.executedPlan
18/11/17 23:18:04 INFO FileSourceStrategy: Pruning directories with:
18/11/17 23:18:04 INFO FileSourceStrategy: Pruned 2 out of 4 buckets.
18/11/17 23:18:04 INFO FileSourceStrategy: Post-Scan Filters: id#0L IN (50,70)
18/11/17 23:18:04 INFO FileSourceStrategy: Output Data Schema: struct<id: bigint>
18/11/17 23:18:04 INFO FileSourceScanExec: Pushed Filters: In(id, [50,70])
...

scala> println(sparkPlan57.numberedTreeString)
00 *(1) Filter id#0L IN (50,70)
01 +- *(1) FileScan parquet default.bucketed_4_id[id#0L,part#1L] Batched: true, Format
: Parquet, Location: CatalogFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/
bucketed_4_id], PartitionCount: 2, PartitionFilters: [], PushedFilters: [In(id, [50,70
])], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 2 out of 4

import org.apache.spark.sql.execution.FileSourceScanExec
val scan57 = sparkPlan57.collectFirst { case exec: FileSourceScanExec => exec }.get

import org.apache.spark.sql.execution.datasources.FileScanRDD
val rdd57 = scan57.inputRDDs.head.asInstanceOf[FileScanRDD]

import org.apache.spark.sql.execution.datasources.FilePartition
val bucketFiles57 = for {
FilePartition(bucketId, files) <- rdd57.filePartitions
f <- files
} yield s"Bucket $bucketId => $f"

scala> println(bucketFiles57.size)
24

Sorting

68
Bucketing

// Make sure that you don't end up with a BroadcastHashJoin and a BroadcastExchange
// Disable auto broadcasting
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

val bucketedTableName = "bucketed_4_id"


val large = spark.range(10e5.toLong)
import org.apache.spark.sql.SaveMode
large.write
.bucketBy(4, "id")
.sortBy("id")
.mode(SaveMode.Overwrite)
.saveAsTable(bucketedTableName)

// Describe the table and include bucketing spec only


val descSQL = sql(s"DESC FORMATTED $bucketedTableName")
.filter($"col_name".contains("Bucket") || $"col_name" === "Sort Columns")
scala> descSQL.show
+--------------+---------+-------+
| col_name|data_type|comment|
+--------------+---------+-------+
| Num Buckets| 4| |
|Bucket Columns| [`id`]| |
| Sort Columns| [`id`]| |
+--------------+---------+-------+

val bucketedTable = spark.table(bucketedTableName)

val t1 = spark.range(4)
.repartition(2, $"id") // Use just 2 partitions
.sortWithinPartitions("id") // sort partitions

val q = t1.join(bucketedTable, "id")


// Note two exchanges and sorts
scala> q.explain
== Physical Plan ==
*(5) Project [id#205L]
+- *(5) SortMergeJoin [id#205L], [id#203L], Inner
:- *(3) Sort [id#205L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#205L, 4)
: +- *(2) Sort [id#205L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#205L, 2)
: +- *(1) Range (0, 4, step=1, splits=8)
+- *(4) Sort [id#203L ASC NULLS FIRST], false, 0
+- *(4) Project [id#203L]
+- *(4) Filter isnotnull(id#203L)
+- *(4) FileScan parquet default.bucketed_4_id[id#203L] Batched: true, For
mat: Parquet, Location: InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-wareho
use/bucketed_4_id], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema:
struct<id:bigint>

q.foreach(_ => ())

69
Bucketing

There are two exchanges and sorts which makes the above use case
Warning almost unusable. I filed an issue at SPARK-24025 Join of bucketed and non-
bucketed tables can give two exchanges and sorts for non-bucketed side.

70
Bucketing

71
Bucketing

Figure 4. SortMergeJoin of Sorted Dataset and Bucketed Table (Details for Query)

spark.sql.sources.bucketing.enabled Spark SQL


Configuration Property
Bucketing is enabled when spark.sql.sources.bucketing.enabled configuration property is
turned on ( true ) and it is by default.

Use SQLConf.bucketingEnabled to access the current value of


Tip
spark.sql.sources.bucketing.enabled property.

// Bucketing is on by default
assert(spark.sessionState.conf.bucketingEnabled, "Bucketing disabled?!")

72
Whole-Stage Java Code Generation (Whole-Stage CodeGen)

Whole-Stage Java Code Generation (Whole-


Stage CodeGen)
Whole-Stage Java Code Generation (aka Whole-Stage CodeGen) is a physical query
optimization in Spark SQL that fuses multiple physical operators (as a subtree of plans that
support code generation) together into a single Java function.

Whole-Stage Java Code Generation improves the execution performance of a query by


collapsing a query tree into a single optimized function that eliminates virtual function calls
and leverages CPU registers for intermediate data.

Whole-Stage Code Generation is controlled by spark.sql.codegen.wholeStage


Spark internal property.
Whole-Stage Code Generation is enabled by default.

import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_CODEGEN_ENABLED
scala> spark.conf.get(WHOLESTAGE_CODEGEN_ENABLED)
res0: String = true
Note

Use SQLConf.wholeStageEnabled method to access the current value.

scala> spark.sessionState.conf.wholeStageEnabled
res1: Boolean = true

Whole-Stage Code Generation is used by some modern massively parallel


processing (MPP) databases to achieve a better query execution performance.
Note
See Efficiently Compiling Efficient Query Plans for Modern Hardware (PDF).

Note Janino is used to compile a Java source code into a Java class at runtime.

Before a query is executed, CollapseCodegenStages physical preparation rule finds the


physical query plans that support codegen and collapses them together as
WholeStageCodegen (possibly with InputAdapter in-between for physical operators with no

support for Java code generation).

CollapseCodegenStages is part of the sequence of physical preparation rules


Note QueryExecution.preparations that will be applied in order to the physical plan
before execution.

There are the following code generation paths (as coined in this commit):

73
Whole-Stage Java Code Generation (Whole-Stage CodeGen)

1. Non-whole-stage-codegen path

1. Whole-stage-codegen "produce" path

1. Whole-stage-codegen "consume" path

Review SPARK-12795 Whole stage codegen to learn about the work to support
Tip
it.

BenchmarkWholeStageCodegen — Performance
Benchmark
BenchmarkWholeStageCodegen class provides a benchmark to measure whole stage codegen

performance.

You can execute it using the command:

build/sbt 'sql/testOnly *BenchmarkWholeStageCodegen'

You need to un-ignore tests in BenchmarkWholeStageCodegen by replacing


Note
ignore with test .

$ build/sbt 'sql/testOnly *BenchmarkWholeStageCodegen'


...
Running benchmark: range/limit/sum
Running case: range/limit/sum codegen=false
22:55:23.028 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoo
p library for your platform... using builtin-java classes where applicable
Running case: range/limit/sum codegen=true

Java HotSpot(TM) 64-Bit Server VM 1.8.0_77-b03 on Mac OS X 10.10.5


Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz

range/limit/sum: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Rel


ative
--------------------------------------------------------------------------------------
-----
range/limit/sum codegen=false 376 / 433 1394.5 0.7
1.0X
range/limit/sum codegen=true 332 / 388 1581.3 0.6
1.1X

[info] - range/limit/sum (10 seconds, 74 milliseconds)

74
Whole-Stage Java Code Generation (Whole-Stage CodeGen)

75
CodegenContext

CodegenContext
CodegenContext is…​FIXME

CodegenContext takes no input parameters.

import org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext
val ctx = new CodegenContext

CodegenContext is created when:

WholeStageCodegenExec physical operator is requested to generate a Java source code

for the child operator (when WholeStageCodegenExec is executed)

CodeGenerator is requested for a new CodegenContext

GenerateUnsafeRowJoiner is requested for a UnsafeRowJoiner

CodegenContext stores expressions that don’t support codegen.

Example of CodegenContext.subexpressionElimination (through


CodegenContext.generateExpressions)

import org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext
val ctx = new CodegenContext

// Use Catalyst DSL


import org.apache.spark.sql.catalyst.dsl.expressions._
val expressions = "hello".expr.as("world") :: "hello".expr.as("world") :: Nil

// FIXME Use a real-life query to extract the expressions

// CodegenContext.subexpressionElimination (where the elimination all happens) is a pr


ivate method
// It is used exclusively in CodegenContext.generateExpressions which is public
// and does the elimination when it is enabled

// Note the doSubexpressionElimination flag is on


// Triggers the subexpressionElimination private method
ctx.generateExpressions(expressions, doSubexpressionElimination = true)

// subexpressionElimination private method uses ctx.equivalentExpressions


val commonExprs = ctx.equivalentExpressions.getAllEquivalentExprs

assert(commonExprs.length > 0, "No common expressions found")

Table 1. CodegenContext’s Internal Properties (e.g. Registries, Counters and Flags)

76
CodegenContext

Name Description
Mutable Scala Map with function names, their Java
source code and a class name

classFunctions New entries are added when CodegenContext is


requested to addClass and addNewFunctionToClass

Used when CodegenContext is requested to


declareAddedFunctions

EquivalentExpressions
Expressions are added and then fetched as equivalent
equivalentExpressions
sets when CodegenContext is requested to
subexpressionElimination (for generateExpressions with
subexpression elimination enabled)

currentVars The list of generated columns as input of current operator

INPUT_ROW The variable name of the input row of the current operator

placeHolderToComments Placeholders and their comments


Used when…​FIXME

References that are used to generate classes in the


following code generators:
GenerateMutableProjection
GenerateOrdering
GeneratePredicate

GenerateSafeProjection
references
GenerateUnsafeProjection

WholeStageCodegenExec
Elements are added when:
CodegenContext is requested to
addReferenceObj

CodegenFallback is requested to doGenCode

SubExprEliminationStates by Expression
subExprEliminationExprs
Used when…​FIXME

subexprFunctions Names of the functions that…​FIXME

77
CodegenContext

Generating Java Source Code For Code-Generated


Evaluation of Multiple Expressions (With Optional
Subexpression Elimination) —  generateExpressions
Method

generateExpressions(
expressions: Seq[Expression],
doSubexpressionElimination: Boolean = false): Seq[ExprCode]

(only with subexpression elimination enabled) generateExpressions does


subexpressionElimination of the input expressions .

In the end, generateExpressions requests every expressions to generate the Java source
code for code-generated (non-interpreted) expression evaluation.

generateExpressions is used when:

GenerateMutableProjection is requested to create a MutableProjection

GenerateUnsafeProjection is requested to create an ExprCode for Catalyst


Note
expressions
HashAggregateExec is requested to generate the Java source code for
whole-stage consume path with grouping keys

addReferenceObj Method

addReferenceObj(objName: String, obj: Any, className: String = null): String

addReferenceObj …​FIXME

Note addReferenceObj is used when…​FIXME

subexpressionEliminationForWholeStageCodegen
Method

subexpressionEliminationForWholeStageCodegen(expressions: Seq[Expression]): SubExprCod


es

subexpressionEliminationForWholeStageCodegen …​FIXME

78
CodegenContext

subexpressionEliminationForWholeStageCodegen is used exclusively when


Note HashAggregateExec is requested to generate a Java source code for whole-
stage consume path (with grouping keys or not).

Adding Function to Generated Class —  addNewFunction


Method

addNewFunction(
funcName: String,
funcCode: String,
inlineToOuterClass: Boolean = false): String

addNewFunction …​FIXME

Note addNewFunction is used when…​FIXME

subexpressionElimination Internal Method

subexpressionElimination(expressions: Seq[Expression]): Unit

subexpressionElimination requests EquivalentExpressions to addExprTree for every

expression (in the input expressions ).

subexpressionElimination requests EquivalentExpressions for the equivalent sets of

expressions with at least two equivalent expressions (aka common expressions).

For every equivalent expression set, subexpressionElimination does the following:

1. Takes the first expression and requests it to generate a Java source code for the
expression tree

2. addNewFunction and adds it to subexprFunctions

3. Creates a SubExprEliminationState and adds it with every common expression in the


equivalent expression set to subExprEliminationExprs

subexpressionElimination is used exclusively when CodegenContext is


Note
requested to generateExpressions (with subexpression elimination enabled).

Adding Mutable State —  addMutableState Method

79
CodegenContext

addMutableState(
javaType: String,
variableName: String,
initFunc: String => String = _ => "",
forceInline: Boolean = false,
useFreshName: Boolean = true): String

addMutableState …​FIXME

val input = ctx.addMutableState("scala.collection.Iterator", "input", v => s"$v = inpu


ts[0];")

Note addMutableState is used when…​FIXME

Adding Immutable State (Unless Exists Already) 


—  addImmutableStateIfNotExists Method

addImmutableStateIfNotExists(
javaType: String,
variableName: String,
initFunc: String => String = _ => ""): Unit

addImmutableStateIfNotExists …​FIXME

val ctx: CodegenContext = ???


val partitionMaskTerm = "partitionMask"
ctx.addImmutableStateIfNotExists(ctx.JAVA_LONG, partitionMaskTerm)

Note addImmutableStateIfNotExists is used when…​FIXME

freshName Method

freshName(name: String): String

freshName …​FIXME

Note freshName is used when…​FIXME

addNewFunctionToClass Internal Method

80
CodegenContext

addNewFunctionToClass(
funcName: String,
funcCode: String,
className: String): mutable.Map[String, mutable.Map[String, String]]

addNewFunctionToClass …​FIXME

Note addNewFunctionToClass is used when…​FIXME

addClass Internal Method

addClass(className: String, classInstance: String): Unit

addClass …​FIXME

Note addClass is used when…​FIXME

declareAddedFunctions Method

declareAddedFunctions(): String

declareAddedFunctions …​FIXME

Note declareAddedFunctions is used when…​FIXME

declareMutableStates Method

declareMutableStates(): String

declareMutableStates …​FIXME

Note declareMutableStates is used when…​FIXME

initMutableStates Method

initMutableStates(): String

initMutableStates …​FIXME

81
CodegenContext

Note initMutableStates is used when…​FIXME

initPartition Method

initPartition(): String

initPartition …​FIXME

Note initPartition is used when…​FIXME

emitExtraCode Method

emitExtraCode(): String

emitExtraCode …​FIXME

Note emitExtraCode is used when…​FIXME

addPartitionInitializationStatement Method

addPartitionInitializationStatement(statement: String): Unit

addPartitionInitializationStatement …​FIXME

Note addPartitionInitializationStatement is used when…​FIXME

82
CodeGenerator

CodeGenerator
CodeGenerator is a base class for generators of JVM bytecode for expression evaluation.

Table 1. CodeGenerator’s Internal Properties


Name Description

cache
Guava’s LoadingCache with at most 100 pairs of
CodeAndComment and GeneratedClass .

genericMutableRowType

Enable INFO or DEBUG logging level for


org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator logger to see what
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator=DEBUG

Refer to Logging.

CodeGenerator Contract

package org.apache.spark.sql.catalyst.expressions.codegen

abstract class CodeGenerator[InType, OutType] {


def create(in: InType): OutType
def canonicalize(in: InType): InType
def bind(in: InType, inputSchema: Seq[Attribute]): InType
def generate(expressions: InType, inputSchema: Seq[Attribute]): OutType
def generate(expressions: InType): OutType
}

Table 2. CodeGenerator Contract


Method Description
Generates an evaluator for expression(s) that may
(optionally) have expression(s) bound to a schema (i.e. a
collection of Attribute).
generate
Used in:
ExpressionEncoder for UnsafeProjection (for
serialization)

83
CodeGenerator

Compiling Java Source Code using Janino —  doCompile


Internal Method

Caution FIXME

Finding or Compiling Java Source Code —  compile


Method
Caution FIXME

create Method

create(references: Seq[Expression]): UnsafeProjection

Caution FIXME

create is used when:

CodeGenerator generates an expression evaluator


Note
GenerateOrdering creates a code gen ordering for SortOrder expressions

Creating CodegenContext —  newCodeGenContext


Method

newCodeGenContext(): CodegenContext

newCodeGenContext simply creates a new CodegenContext.

newCodeGenContext is used when:

GenerateMutableProjection is requested to create a MutableProjection

GenerateOrdering is requested to create a BaseOrdering

GeneratePredicate is requested to create a Predicate


Note
GenerateSafeProjection is requested to create a Projection

GenerateUnsafeProjection is requested to create a UnsafeProjection

GenerateColumnAccessor is requested to create a ColumnarIterator

84
CodeGenerator

85
GenerateColumnAccessor

GenerateColumnAccessor
GenerateColumnAccessor is a CodeGenerator for…​FIXME

Creating ColumnarIterator —  create Method

create(columnTypes: Seq[DataType]): ColumnarIterator

Note create is part of CodeGenerator Contract to…​FIXME.

create …​FIXME

86
GenerateOrdering

GenerateOrdering
GenerateOrdering is…​FIXME

Creating BaseOrdering —  create Method

create(ordering: Seq[SortOrder]): BaseOrdering


create(schema: StructType): BaseOrdering

Note create is part of CodeGenerator Contract to…​FIXME.

create …​FIXME

genComparisons Method

genComparisons(ctx: CodegenContext, schema: StructType): String

genComparisons …​FIXME

Note genComparisons is used when…​FIXME

87
GeneratePredicate

GeneratePredicate
GeneratePredicate is…​FIXME

Creating Predicate —  create Method

create(predicate: Expression): Predicate

Note create is part of CodeGenerator Contract to…​FIXME.

create …​FIXME

88
GenerateSafeProjection

GenerateSafeProjection
GenerateSafeProjection is…​FIXME

Creating Projection —  create Method

create(expressions: Seq[Expression]): Projection

Note create is part of CodeGenerator Contract to…​FIXME.

create …​FIXME

89
BytesToBytesMap Append-Only Hash Map

BytesToBytesMap Append-Only Hash Map


BytesToBytesMap is…​FIXME

Low space overhead,

Good memory locality, esp. for scans.

lookup Method

Location lookup(Object keyBase, long keyOffset, int keyLength)


Location lookup(Object keyBase, long keyOffset, int keyLength, int hash)

Caution FIXME

safeLookup Method

void safeLookup(Object keyBase, long keyOffset, int keyLength, Location loc, int hash)

safeLookup …​FIXME

safeLookup is used when BytesToBytesMap does lookup and


Note
UnsafeHashedRelation for looking up a single value or values by key.

90
Vectorized Query Execution (Batch Decoding)

Vectorized Query Execution (Batch Decoding)


Vectorized Query Execution (aka Vectorized Decoding or Batch Decoding) is…​FIXME

91
ColumnarBatch — ColumnVectors as Row-Wise Table

ColumnarBatch — ColumnVectors as Row-
Wise Table
ColumnarBatch allows to work with multiple ColumnVectors as a row-wise table.

import org.apache.spark.sql.types._
val schema = new StructType()
.add("intCol", IntegerType)
.add("doubleCol", DoubleType)
.add("intCol2", IntegerType)
.add("string", BinaryType)

val capacity = 4 * 1024 // 4k


import org.apache.spark.memory.MemoryMode
import org.apache.spark.sql.execution.vectorized.OnHeapColumnVector
val columns = schema.fields.map { field =>
new OnHeapColumnVector(capacity, field.dataType)
}

import org.apache.spark.sql.vectorized.ColumnarBatch
val batch = new ColumnarBatch(columns.toArray)

// Add a row [1, 1.1, NULL]


columns(0).putInt(0, 1)
columns(1).putDouble(0, 1.1)
columns(2).putNull(0)
columns(3).putByteArray(0, "Hello".getBytes(java.nio.charset.StandardCharsets.UTF_8))
batch.setNumRows(1)

assert(batch.getRow(0).numFields == 4)

ColumnarBatch is created when:

InMemoryTableScanExec physical operator is requested to createAndDecompressColumn

VectorizedParquetRecordReader is requested to initBatch

OrcColumnarBatchReader is requested to initBatch

ColumnVectorUtils is requested to toBatch

ArrowPythonRunner is requested for a Iterator[ColumnarBatch] (i.e.

newReaderIterator )

ArrowConverters is requested for a ArrowRowIterator (i.e. fromPayloadIterator )

92
ColumnarBatch — ColumnVectors as Row-Wise Table

ColumnarBatch takes an array of ColumnVectors to be created. ColumnarBatch immediately

initializes the internal MutableColumnarRow.

The number of columns in a ColumnarBatch is the number of ColumnVectors (this batch was
created with).

ColumnarBatch is an Evolving contract that is evolving towards becoming a


stable API, but is not a stable API yet and can change from one feature release
Note to another release.
In other words, using the contract is as treading on thin ice.

Table 1. ColumnarBatch’s Internal Properties (e.g. Registries, Counters and Flags)


Name Description
numRows Number of rows

row MutableColumnarRow over the ColumnVectors

Iterator Over InternalRows (in Batch) —  rowIterator


Method

Iterator<InternalRow> rowIterator()

rowIterator …​FIXME

rowIterator is used when:

ArrowConverters is requested to fromBatchIterator

AggregateInPandasExec , WindowInPandasExec , and


Note
FlatMapGroupsInPandasExec physical operators are requested to execute
( doExecute )

ArrowEvalPythonExec physical operator is requested to evaluate

Specifying Number of Rows (in Batch) —  setNumRows


Method

void setNumRows(int numRows)

In essence, setNumRows resets the batch and makes it available for reuse.

Internally, setNumRows simply sets the numRows to the given numRows .

93
ColumnarBatch — ColumnVectors as Row-Wise Table

setNumRows is used when:

OrcColumnarBatchReader is requested to nextBatch

VectorizedParquetRecordReader is requested to nextBatch (when


VectorizedParquetRecordReader is requested to nextKeyValue)

ColumnVectorUtils is requested to toBatch (for testing only)


Note
ArrowConverters is requested to fromBatchIterator

InMemoryTableScanExec physical operator is requested to


createAndDecompressColumn
ArrowPythonRunner is requested for a ReaderIterator
( newReaderIterator )

94
Data Source API V2

Data Source API V2


Data Source API V2 (DataSource API V2 or DataSource V2) is a new API for data sources
in Spark SQL with the following abstractions (contracts):

DataSourceV2 marker interface

ReadSupport

DataSourceReader

WriteSupport

DataSourceWriter

SessionConfigSupport

DataSourceV2StringFormat

InputPartition

The work on Data Source API V2 was tracked under SPARK-15689 Data
Note
source API v2 that was fixed in Apache Spark 2.3.0.

Note Data Source API V2 is already heavily used in Spark Structured Streaming.

Query Planning and Execution


Data Source API V2 relies on the DataSourceV2Strategy execution planning strategy for
query planning.

Data Reading
Data Source API V2 uses DataSourceV2Relation logical operator to represent data reading
(aka data scan).

DataSourceV2Relation is planned (translated) to a ProjectExec with a

DataSourceV2ScanExec physical operator (possibly under the FilterExec operator) when


DataSourceV2Strategy execution planning strategy is requested to plan a logical plan.

At execution, DataSourceV2ScanExec physical operator creates a DataSourceRDD (or a


ContinuousReader for Spark Structured Streaming).

DataSourceRDD uses InputPartitions for partitions, preferred locations, and computing

partitions.

95
Data Source API V2

Data Writing
Data Source API V2 uses WriteToDataSourceV2 and AppendData logical operators to
represent data writing (over a DataSourceV2Relation logical operator). As of Spark SQL
2.4.0, WriteToDataSourceV2 operator was deprecated for the more specific AppendData
operator (compare "data writing" to "data append" which is certainly more specific).

One of the differences between WriteToDataSourceV2 and AppendData logical


operators is that the former ( WriteToDataSourceV2 ) uses DataSourceWriter
Note
directly while the latter ( AppendData ) uses DataSourceV2Relation to get the
DataSourceWriter from.

WriteToDataSourceV2 and AppendData (with DataSourceV2Relation) logical operators are


planned as (translated to) a WriteToDataSourceV2Exec physical operator.

At execution, WriteToDataSourceV2Exec physical operator…​FIXME

Filter Pushdown Performance Optimization


Data Source API V2 supports filter pushdown performance optimization for
DataSourceReaders with SupportsPushDownFilters (that is applied when
DataSourceV2Strategy execution planning strategy is requested to plan a
DataSourceV2Relation logical operator).

(From Parquet Filter Pushdown in Apache Drill’s documentation) Filter pushdown is a


performance optimization that prunes extraneous data while reading from a data source to
reduce the amount of data to scan and read for queries with supported filter expressions.
Pruning data reduces the I/O, CPU, and network overhead to optimize query performance.

Enable INFO logging level for the DataSourceV2Strategy logger to be told what
Tip
the pushed filters are.

Further Reading and Watching


1. (video) Apache Spark Data Source V2 by Wenchen Fan and Gengliang Wang

96
Subqueries

Subqueries (Subquery Expressions)


As of Spark 2.0, Spark SQL supports subqueries.

A subquery (aka subquery expression) is a query that is nested inside of another query.

There are the following kinds of subqueries:

1. A subquery as a source (inside a SQL FROM clause)

2. A scalar subquery or a predicate subquery (as a column)

Every subquery can also be correlated or uncorrelated.

A scalar subquery is a structured query that returns a single row and a single column only.
Spark SQL uses ScalarSubquery (SubqueryExpression) expression to represent scalar
subqueries (while parsing a SQL statement).

// FIXME: ScalarSubquery in a logical plan

A ScalarSubquery expression appears as scalar-subquery#[exprId] [conditionString] in a


logical plan.

// FIXME: Name of a ScalarSubquery in a logical plan

It is said that scalar subqueries should be used very rarely if at all and you should join
instead.

Spark Analyzer uses ResolveSubquery resolution rule to resolve subqueries and at the end
makes sure that they are valid.

Catalyst Optimizer uses the following optimizations for subqueries:

PullupCorrelatedPredicates optimization to rewrite subqueries and pull up correlated


predicates

RewriteCorrelatedScalarSubquery optimization to constructLeftJoins

Spark Physical Optimizer uses PlanSubqueries physical optimization to plan queries with
scalar subqueries.

FIXME Describe how a physical ScalarSubquery is executed (cf.


Caution
updateResult , eval and doGenCode ).

97
Subqueries

98
Hint Framework

Hint Framework
Structured queries can be optimized using Hint Framework that allows for specifying query
hints.

Query hints allow for annotating a query and give a hint to the query optimizer how to
optimize logical plans. This can be very useful when the query optimizer cannot make
optimal decision, e.g. with respect to join methods due to conservativeness or the lack of
proper statistics.

Spark SQL supports COALESCE and REPARTITION and BROADCAST hints. All remaining
unresolved hints are silently removed from a query plan at analysis.

Note Hint Framework was added in Spark SQL 2.2.

Specifying Query Hints


You can specify query hints using Dataset.hint operator or SELECT SQL statements with
hints.

// Dataset API
val q = spark.range(1).hint(name = "myHint", 100, true)
val plan = q.queryExecution.logical
scala> println(plan.numberedTreeString)
00 'UnresolvedHint myHint, [100, true]
01 +- Range (0, 1, step=1, splits=Some(8))

// SQL
val q = sql("SELECT /*+ myHint (100, true) */ 1")
val plan = q.queryExecution.logical
scala> println(plan.numberedTreeString)
00 'UnresolvedHint myHint, [100, true]
01 +- 'Project [unresolvedalias(1, None)]
02 +- OneRowRelation

SELECT SQL Statements With Hints


SELECT SQL statement supports query hints as comments in SQL query that Spark SQL

translates into a UnresolvedHint unary logical operator in a logical plan.

COALESCE and REPARTITION Hints

99
Hint Framework

Spark SQL 2.4 added support for COALESCE and REPARTITION hints (using SQL
comments):

SELECT /*+ COALESCE(5) */ …​

SELECT /*+ REPARTITION(3) */ …​

Broadcast Hints
Spark SQL 2.2 supports BROADCAST hints using broadcast standard function or SQL
comments:

SELECT /*+ MAPJOIN(b) */ …​

SELECT /*+ BROADCASTJOIN(b) */ …​

SELECT /*+ BROADCAST(b) */ …​

broadcast Standard Function


While hint operator allows for attaching any hint to a logical plan broadcast standard
function attaches the broadcast hint only (that actually makes it a special case of hint
operator).

broadcast standard function is used for broadcast joins (aka map-side joins), i.e. to hint the

Spark planner to broadcast a dataset regardless of the size.

100
Hint Framework

val small = spark.range(1)


val large = spark.range(100)

// Let's use broadcast standard function first


val q = large.join(broadcast(small), "id")
val plan = q.queryExecution.logical
scala> println(plan.numberedTreeString)
00 'Join UsingJoin(Inner,List(id))
01 :- Range (0, 100, step=1, splits=Some(8))
02 +- ResolvedHint (broadcast)
03 +- Range (0, 1, step=1, splits=Some(8))

// Please note that broadcast standard function uses ResolvedHint not UnresolvedHint

// Let's "replicate" standard function using hint operator


// Any of the names would work (case-insensitive)
// "BROADCAST", "BROADCASTJOIN", "MAPJOIN"
val smallHinted = small.hint("broadcast")
val plan = smallHinted.queryExecution.logical
scala> println(plan.numberedTreeString)
00 'UnresolvedHint broadcast
01 +- Range (0, 1, step=1, splits=Some(8))

// join is "clever"
// i.e. resolves UnresolvedHint into ResolvedHint immediately
val q = large.join(smallHinted, "id")
val plan = q.queryExecution.logical
scala> println(plan.numberedTreeString)
00 'Join UsingJoin(Inner,List(id))
01 :- Range (0, 100, step=1, splits=Some(8))
02 +- ResolvedHint (broadcast)
03 +- Range (0, 1, step=1, splits=Some(8))

Spark Analyzer
There are the following logical rules that Spark Analyzer uses to analyze logical plans with
the UnresolvedHint logical operator:

1. ResolveBroadcastHints resolves UnresolvedHint operators with BROADCAST ,


BROADCASTJOIN , MAPJOIN hints to a ResolvedHint

2. ResolveCoalesceHints resolves UnresolvedHint logical operators with COALESCE or


REPARTITION hints

3. RemoveAllHints simply removes all UnresolvedHint operators

The order of executing the above rules matters.

101
Hint Framework

// Let's hint the query twice


// The order of hints matters as every hint operator executes Spark analyzer
// That will resolve all but the last hint
val q = spark.range(100).
hint("broadcast").
hint("myHint", 100, true)
val plan = q.queryExecution.logical
scala> println(plan.numberedTreeString)
00 'UnresolvedHint myHint, [100, true]
01 +- ResolvedHint (broadcast)
02 +- Range (0, 100, step=1, splits=Some(8))

// Let's resolve unresolved hints


import org.apache.spark.sql.catalyst.rules.RuleExecutor
import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
import org.apache.spark.sql.catalyst.analysis.ResolveHints
import org.apache.spark.sql.internal.SQLConf
object HintResolver extends RuleExecutor[LogicalPlan] {
lazy val batches =
Batch("Hints", FixedPoint(maxIterations = 100),
new ResolveHints.ResolveBroadcastHints(SQLConf.get),
ResolveHints.RemoveAllHints) :: Nil
}
val resolvedPlan = HintResolver.execute(plan)
scala> println(resolvedPlan.numberedTreeString)
00 ResolvedHint (broadcast)
01 +- Range (0, 100, step=1, splits=Some(8))

Hint Operator in Catalyst DSL


You can use hint operator from Catalyst DSL to create a UnresolvedHint logical operator,
e.g. for testing or Spark SQL internals exploration.

// Create a logical plan to add hint to


import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
val r1 = LocalRelation('a.int, 'b.timestamp, 'c.boolean)
scala> println(r1.numberedTreeString)
00 LocalRelation <empty>, [a#0, b#1, c#2]

// Attach hint to the plan


import org.apache.spark.sql.catalyst.dsl.plans._
val plan = r1.hint(name = "myHint", 100, true)
scala> println(plan.numberedTreeString)
00 'UnresolvedHint myHint, [100, true]
01 +- LocalRelation <empty>, [a#0, b#1, c#2]

102
Hint Framework

103
Adaptive Query Execution

Adaptive Query Execution


Adaptive Query Execution (aka Adaptive Query Optimisation or Adaptive
Optimisation) is an optimisation of a query execution plan that Spark Planner uses for
allowing alternative execution plans at runtime that would be optimized better based on
runtime statistics.

Quoting the description of a talk by the authors of Adaptive Query Execution:

At runtime, the adaptive execution mode can change shuffle join to broadcast join if it
finds the size of one table is less than the broadcast threshold. It can also handle
skewed input data for join and change the partition number of the next stage to better fit
the data scale. In general, adaptive execution decreases the effort involved in tuning
SQL query parameters and improves the execution performance by choosing a better
execution plan and parallelism at runtime.

Adaptive Query Execution is disabled by default. Set spark.sql.adaptive.enabled


configuration property to true to enable it.

Adaptive query execution is not supported for streaming Datasets and is


Note
disabled at their execution.

spark.sql.adaptive.enabled Configuration Property


spark.sql.adaptive.enabled configuration property turns adaptive query execution on.

Tip Use adaptiveExecutionEnabled method to access the current value.

EnsureRequirements
EnsureRequirements is…​FIXME

Further Reading and Watching


1. (video) An Adaptive Execution Engine For Apache Spark SQL — Carson Wang

2. An adaptive execution mode for Spark SQL by Carson Wang (Intel), Yucai Yu (Intel) at
Strata Data Conference in Singapore, December 7, 2017

104
ExchangeCoordinator

ExchangeCoordinator
ExchangeCoordinator is created when EnsureRequirements physical query optimization is

requested to add an ExchangeCoordinator for Adaptive Query Execution.

ExchangeCoordinator takes the following to be created:

Number of ShuffleExchangeExec unary physical operators

Recommended size of the input data of a post-shuffle partition (configured by


spark.sql.adaptive.shuffle.targetPostShuffleInputSize property)

Optional advisory minimum number of post-shuffle partitions (default: None )


(configured by spark.sql.adaptive.minNumPostShufflePartitions property)

ExchangeCoordinator keeps track of ShuffleExchangeExec unary physical operators that

were registered (when ShuffleExchangeExec unary physical operator was requested to


prepare itself for execution).

ExchangeCoordinator uses the following text representation (i.e. toString ):

coordinator[target post-shuffle partition size: [advisoryTargetPostShuffleInputSize]]

postShuffleRDD Method

postShuffleRDD(exchange: ShuffleExchangeExec): ShuffledRowRDD

postShuffleRDD …​FIXME

postShuffleRDD is used exclusively when ShuffleExchangeExec unary physical


Note
operator is requested to execute.

doEstimationIfNecessary Internal Method

doEstimationIfNecessary(): Unit

doEstimationIfNecessary …​FIXME

doEstimationIfNecessary is used exclusively when ExchangeCoordinator is


Note
requested for a post-shuffle RDD (ShuffledRowRDD).

105
ExchangeCoordinator

estimatePartitionStartIndices Method

estimatePartitionStartIndices(
mapOutputStatistics: Array[MapOutputStatistics]): Array[Int]

estimatePartitionStartIndices …​FIXME

estimatePartitionStartIndices is used exclusively when ExchangeCoordinator


Note
is requested for a doEstimationIfNecessary.

registerExchange Method

registerExchange(exchange: ShuffleExchangeExec): Unit

registerExchange simply adds the ShuffleExchangeExec unary physical operator to the

exchanges internal registry.

registerExchange is used exclusively when ShuffleExchangeExec unary


Note
physical operator is requested to prepare itself for execution.

106
Subexpression Elimination For Code-Generated Expression Evaluation (Common
Expression Reuse)

Subexpression Elimination In Code-Generated


Expression Evaluation (Common Expression
Reuse)
Subexpression Elimination (aka Common Expression Reuse) is an optimisation of a
logical query plan that eliminates expressions in code-generated (non-interpreted)
expression evaluation.

Subexpression Elimination is enabled by default. Use the internal


spark.sql.subexpressionElimination.enabled configuration property control whether the
feature is enabled ( true ) or not ( false ).

Subexpression Elimination is used (by means of subexpressionEliminationEnabled flag of


SparkPlan ) when the following physical operators are requested to execute (i.e. moving

away from queries to an RDD of internal rows to describe a distributed computation):

ProjectExec

HashAggregateExec (and for finishAggregate)

ObjectHashAggregateExec

SortAggregateExec

WindowExec (and creates a lookup table for WindowExpressions and factory functions
for WindowFunctionFrame)

Internally, subexpression elimination happens when CodegenContext is requested for


subexpressionElimination (when CodegenContext is requested to generateExpressions with
subexpression elimination enabled).

spark.sql.subexpressionElimination.enabled Configuration
Property
spark.sql.subexpressionElimination.enabled internal configuration property controls whether
the subexpression elimination optimization is enabled or not.

Tip Use subexpressionEliminationEnabled method to access the current value.

107
Subexpression Elimination For Code-Generated Expression Evaluation (Common
Expression Reuse)

scala> import spark.sessionState.conf


import spark.sessionState.conf

scala> conf.subexpressionEliminationEnabled
res1: Boolean = true

108
EquivalentExpressions

EquivalentExpressions
EquivalentExpressions is…​FIXME

Table 1. EquivalentExpressions’s Internal Properties (e.g. Registries, Counters and Flags)


Name Description

Equivalent sets of expressions, i.e. semantically equal


equivalenceMap expressions by their Expr "representative"
Used when…​FIXME

addExprTree Method

addExprTree(expr: Expression): Unit

addExprTree …​FIXME

addExprTree is used when CodegenContext is requested to


Note subexpressionElimination or
subexpressionEliminationForWholeStageCodegen.

addExpr Method

addExpr(expr: Expression): Boolean

addExpr …​FIXME

addExpr is used when:

EquivalentExpressions is requested to addExprTree


Note
PhysicalAggregation is requested to destructure an Aggregate logical
operator

Getting Equivalent Sets Of Expressions 


—  getAllEquivalentExprs Method

getAllEquivalentExprs: Seq[Seq[Expression]]

109
EquivalentExpressions

getAllEquivalentExprs takes the values of all the equivalent sets of expressions.

getAllEquivalentExprs is used when CodegenContext is requested to


Note subexpressionElimination or
subexpressionEliminationForWholeStageCodegen.

110
Cost-Based Optimization (CBO)

Cost-Based Optimization (CBO) of Logical


Query Plan
Cost-Based Optimization (aka Cost-Based Query Optimization or CBO Optimizer) is an
optimization technique in Spark SQL that uses table statistics to determine the most efficient
query execution plan of a structured query (given the logical query plan).

Cost-based optimization is disabled by default. Spark SQL uses spark.sql.cbo.enabled


configuration property to control whether the CBO should be enabled and used for query
optimization or not.

Cost-Based Optimization uses logical optimization rules (e.g. CostBasedJoinReorder) to


optimize the logical plan of a structured query based on statistics.

You first use ANALYZE TABLE COMPUTE STATISTICS SQL command to compute table
statistics. Use DESCRIBE EXTENDED SQL command to inspect the statistics.

Logical operators have statistics support that is used for query planning.

There is also support for equi-height column histograms.

Table Statistics
The table statistics can be computed for tables, partitions and columns and are as follows:

1. Total size (in bytes) of a table or table partitions

2. Row count of a table or table partitions

3. Column statistics, i.e. min, max, num_nulls, distinct_count, avg_col_len,


max_col_len, histogram

spark.sql.cbo.enabled Spark SQL Configuration Property


Cost-based optimization is enabled when spark.sql.cbo.enabled configuration property is
turned on, i.e. true .

spark.sql.cbo.enabled configuration property is turned off, i.e. false , by


Note
default.

Use SQLConf.cboEnabled to access the current value of spark.sql.cbo.enabled


Tip
property.

111
Cost-Based Optimization (CBO)

// CBO is disabled by default


val sqlConf = spark.sessionState.conf
scala> println(sqlConf.cboEnabled)
false

// Create a new SparkSession with CBO enabled


// You could spark-submit -c spark.sql.cbo.enabled=true
val sparkCboEnabled = spark.newSession
import org.apache.spark.sql.internal.SQLConf.CBO_ENABLED
sparkCboEnabled.conf.set(CBO_ENABLED.key, true)
val isCboEnabled = sparkCboEnabled.conf.get(CBO_ENABLED.key)
println(s"Is CBO enabled? $isCboEnabled")

Note CBO is disabled explicitly in Spark Structured Streaming.

ANALYZE TABLE COMPUTE STATISTICS SQL Command


Cost-Based Optimization uses the statistics stored in a metastore (aka external catalog)
using ANALYZE TABLE SQL command.

ANALYZE TABLE tableIdentifier partitionSpec?


COMPUTE STATISTICS (NOSCAN | FOR COLUMNS identifierSeq)?

Depending on the variant, ANALYZE TABLE computes different statistics, i.e. of a table,
partitions or columns.

1. ANALYZE TABLE with neither PARTITION specification nor FOR COLUMNS clause

2. ANALYZE TABLE with PARTITION specification (but no FOR COLUMNS clause)

3. ANALYZE TABLE with FOR COLUMNS clause (but no PARTITION specification)

Use spark.sql.statistics.histogram.enabled configuration property to enable


column (equi-height) histograms that can provide better estimation accuracy but
Tip cause an extra table scan).
spark.sql.statistics.histogram.enabled is off by default.

112
Cost-Based Optimization (CBO)

ANALYZE TABLE with PARTITION specification and FOR COLUMNS clause is incorrect.

// !!! INCORRECT !!!


ANALYZE TABLE t1 PARTITION (p1, p2) COMPUTE STATISTICS FOR COLUMNS id, p1

Note In such a case, SparkSqlAstBuilder reports a WARN message to the logs and simply ignores
the partition specification.

WARN Partition specification is ignored when collecting column statistics: [partitionSpec]

When executed, the above ANALYZE TABLE variants are translated to the following logical
commands (in a logical query plan), respectively:

1. AnalyzeTableCommand

2. AnalyzePartitionCommand

3. AnalyzeColumnCommand

DESCRIBE EXTENDED SQL Command


You can view the statistics of a table, partitions or a column (stored in a metastore) using
DESCRIBE EXTENDED SQL command.

(DESC | DESCRIBE) TABLE? (EXTENDED | FORMATTED)?


tableIdentifier partitionSpec? describeColName?

Table-level statistics are in Statistics row while partition-level statistics are in Partition
Statistics row.

Use DESC EXTENDED tableName for table-level statistics and DESC EXTENDED
Tip
tableName PARTITION (p1, p2, …​) for partition-level statistics only.

// table-level statistics are in Statistics row


scala> sql("DESC EXTENDED t1").show(numRows = 30, truncate = false)
+----------------------------+--------------------------------------------------------
------+-------+
|col_name |data_type
|comment|
+----------------------------+--------------------------------------------------------
------+-------+
|id |int
|null |
|p1 |int
|null |

113
Cost-Based Optimization (CBO)

|p2 |string
|null |
|# Partition Information |
| |
|# col_name |data_type
|comment|
|p1 |int
|null |
|p2 |string
|null |
| |
| |
|# Detailed Table Information|
| |
|Database |default
| |
|Table |t1
| |
|Owner |jacek
| |
|Created Time |Wed Dec 27 14:10:44 CET 2017
| |
|Last Access |Thu Jan 01 01:00:00 CET 1970
| |
|Created By |Spark 2.3.0
| |
|Type |MANAGED
| |
|Provider |parquet
| |
|Table Properties |[transient_lastDdlTime=1514453141]
| |
|Statistics |714 bytes, 2 rows
| |
|Location |file:/Users/jacek/dev/oss/spark/spark-warehouse/t1
| |
|Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSe
rDe | |
|InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputF
ormat | |
|OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutput
Format| |
|Storage Properties |[serialization.format=1]
| |
|Partition Provider |Catalog
| |
+----------------------------+--------------------------------------------------------
------+-------+

scala> spark.table("t1").show
+---+---+----+
| id| p1| p2|
+---+---+----+

114
Cost-Based Optimization (CBO)

| 0| 0|zero|
| 1| 1| one|
+---+---+----+

// partition-level statistics are in Partition Statistics row


scala> sql("DESC EXTENDED t1 PARTITION (p1=0, p2='zero')").show(numRows = 30, truncate
= false)
+--------------------------------+----------------------------------------------------
-----------------------------+-------+
|col_name |data_type
|comment|
+--------------------------------+----------------------------------------------------
-----------------------------+-------+
|id |int
|null |
|p1 |int
|null |
|p2 |string
|null |
|# Partition Information |
| |
|# col_name |data_type
|comment|
|p1 |int
|null |
|p2 |string
|null |
| |
| |
|# Detailed Partition Information|
| |
|Database |default
| |
|Table |t1
| |
|Partition Values |[p1=0, p2=zero]
| |
|Location |file:/Users/jacek/dev/oss/spark/spark-warehouse/t1/p
1=0/p2=zero | |
|Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHi
veSerDe | |
|InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetIn
putFormat | |
|OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOu
tputFormat | |
|Storage Properties |[path=file:/Users/jacek/dev/oss/spark/spark-warehous
e/t1, serialization.format=1]| |
|Partition Parameters |{numFiles=1, transient_lastDdlTime=1514469540, total
Size=357} | |
|Partition Statistics |357 bytes, 1 rows
| |
| |
| |

115
Cost-Based Optimization (CBO)

|# Storage Information |
| |
|Location |file:/Users/jacek/dev/oss/spark/spark-warehouse/t1
| |
|Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHi
veSerDe | |
|InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetIn
putFormat | |
|OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOu
tputFormat | |
|Storage Properties |[serialization.format=1]
| |
+--------------------------------+----------------------------------------------------
-----------------------------+-------+

You can view the statistics of a single column using DESC EXTENDED tableName columnName
that are in a Dataset with two columns, i.e. info_name and info_value .

116
Cost-Based Optimization (CBO)

scala> sql("DESC EXTENDED t1 id").show


+--------------+----------+
|info_name |info_value|
+--------------+----------+
|col_name |id |
|data_type |int |
|comment |NULL |
|min |0 |
|max |1 |
|num_nulls |0 |
|distinct_count|2 |
|avg_col_len |4 |
|max_col_len |4 |
|histogram |NULL |
+--------------+----------+

scala> sql("DESC EXTENDED t1 p1").show


+--------------+----------+
|info_name |info_value|
+--------------+----------+
|col_name |p1 |
|data_type |int |
|comment |NULL |
|min |0 |
|max |1 |
|num_nulls |0 |
|distinct_count|2 |
|avg_col_len |4 |
|max_col_len |4 |
|histogram |NULL |
+--------------+----------+

scala> sql("DESC EXTENDED t1 p2").show


+--------------+----------+
|info_name |info_value|
+--------------+----------+
|col_name |p2 |
|data_type |string |
|comment |NULL |
|min |NULL |
|max |NULL |
|num_nulls |0 |
|distinct_count|2 |
|avg_col_len |4 |
|max_col_len |4 |
|histogram |NULL |
+--------------+----------+

Cost-Based Optimizations

117
Cost-Based Optimization (CBO)

The Spark Optimizer uses heuristics (rules) that are applied to a logical query plan for cost-
based optimization.

Among the optimization rules are the following:

1. CostBasedJoinReorder logical optimization rule for join reordering with 2 or more


consecutive inner or cross joins (possibly separated by Project operators) when
spark.sql.cbo.enabled and spark.sql.cbo.joinReorder.enabled configuration properties
are both enabled.

Logical Commands for Altering Table Statistics


The following are the logical commands that alter table statistics in a metastore (aka external
catalog):

1. AnalyzeTableCommand

2. AnalyzeColumnCommand

3. AlterTableAddPartitionCommand

4. AlterTableDropPartitionCommand

5. AlterTableSetLocationCommand

6. TruncateTableCommand

7. InsertIntoHiveTable

8. InsertIntoHadoopFsRelationCommand

9. LoadDataCommand

EXPLAIN COST SQL Command

Caution FIXME See LogicalPlanStats

LogicalPlanStats — Statistics Estimates of Logical Operator


LogicalPlanStats adds statistics support to logical operators and is used for query planning
(with or without cost-based optimization, e.g. CostBasedJoinReorder or JoinSelection,
respectively).

Equi-Height Histograms for Columns


From SPARK-17074 generate equi-height histogram for column:

118
Cost-Based Optimization (CBO)

Equi-height histogram is effective in handling skewed data distribution.

For equi-height histogram, the heights of all bins(intervals) are the same. The default
number of bins we use is 254.

Now we use a two-step method to generate an equi-height histogram: 1. use


percentile_approx to get percentiles (end points of the equi-height bin intervals); 2. use
a new aggregate function to get distinct counts in each of these bins.

Note that this method takes two table scans. In the future we may provide other
algorithms which need only one table scan.

From [SPARK-17074] [SQL] Generate equi-height histogram in column statistics #19479:

Equi-height histogram is effective in cardinality estimation, and more accurate than


basic column stats (min, max, ndv, etc) especially in skew distribution.

For equi-height histogram, all buckets (intervals) have the same height (frequency).

we use a two-step method to generate an equi-height histogram:

1. use ApproximatePercentile to get percentiles p(0), p(1/n), p(2/n) …​ p((n-1)/n), p(1);

2. construct range values of buckets, e.g. [p(0), p(1/n)], [p(1/n), p(2/n)] …​ [p((n-1)/n),
p(1)], and use ApproxCountDistinctForIntervals to count ndv in each bucket. Each
bucket is of the form: (lowerBound, higherBound, ndv).

Spark SQL uses column statistics that may optionally hold the histogram of values (which is
empty by default). With spark.sql.statistics.histogram.enabled configuration property turned
on ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS SQL command generates
column (equi-height) histograms.

Note spark.sql.statistics.histogram.enabled is off by default.

119
Cost-Based Optimization (CBO)

// Computing column statistics with histogram


// ./bin/spark-shell --conf spark.sql.statistics.histogram.enabled=true
scala> spark.sessionState.conf.histogramEnabled
res1: Boolean = true

val tableName = "t1"

// Make the example reproducible


import org.apache.spark.sql.catalyst.TableIdentifier
val tid = TableIdentifier(tableName)
val sessionCatalog = spark.sessionState.catalog
sessionCatalog.dropTable(tid, ignoreIfNotExists = true, purge = true)

// CREATE TABLE t1
Seq((0, 0, "zero"), (1, 1, "one")).
toDF("id", "p1", "p2").
write.
saveAsTable(tableName)

// As we drop and create immediately we may face problems with unavailable partition f
iles
// Invalidate cache
spark.sql(s"REFRESH TABLE $tableName")

// Use ANALYZE TABLE...FOR COLUMNS to compute column statistics


// that saves them in a metastore (aka an external catalog)
val df = spark.table(tableName)
val allCols = df.columns.mkString(",")
val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $allCols"
spark.sql(analyzeTableSQL)

// Column statistics with histogram should be in the external catalog (metastore)

You can inspect the column statistics using DESCRIBE EXTENDED SQL command.

120
Cost-Based Optimization (CBO)

// Inspecting column statistics with column histogram


// See the above example for how to compute the stats
val colName = "id"
val descExtSQL = s"DESC EXTENDED $tableName $colName"

// 254 bins by default --> num_of_bins in histogram row below


scala> sql(descExtSQL).show(truncate = false)
+--------------+-----------------------------------------------------+
|info_name |info_value |
+--------------+-----------------------------------------------------+
|col_name |id |
|data_type |int |
|comment |NULL |
|min |0 |
|max |1 |
|num_nulls |0 |
|distinct_count|2 |
|avg_col_len |4 |
|max_col_len |4 |
|histogram |height: 0.007874015748031496, num_of_bins: 254 |
|bin_0 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_1 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_2 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_3 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_4 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_5 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_6 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_7 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_8 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
|bin_9 |lower_bound: 0.0, upper_bound: 0.0, distinct_count: 1|
+--------------+-----------------------------------------------------+
only showing top 20 rows

121
CatalogStatistics — Table Statistics in Metastore (External Catalog)

CatalogStatistics — Table Statistics From


External Catalog (Metastore)
CatalogStatistics are table statistics that are stored in an external catalog (aka

metastore):

Physical total size (in bytes)

Estimated number of rows (aka row count)

Column statistics (i.e. column names and their statistics)

CatalogStatistics is a "subset" of the statistics in Statistics (as there are no


concepts of attributes and broadcast hint in metastore).
Note
CatalogStatistics are often stored in a Hive metastore and are referred as
Hive statistics while Statistics are the Spark statistics.

CatalogStatistics can be converted to Spark statistics using toPlanStats method.

CatalogStatistics is created when:

AnalyzeColumnCommand, AlterTableAddPartitionCommand and TruncateTableCommand


commands are executed (and store statistics in ExternalCatalog)

CommandUtils is requested for updating existing table statistics, the current statistics (if

changed)

HiveExternalCatalog is requested for restoring Spark statistics from properties (from a

Hive Metastore)

DetermineTableStats and PruneFileSourcePartitions logical optimizations are executed


(i.e. applied to a logical plan)

HiveClientImpl is requested for a table or partition statistics from Hive’s parameters

122
CatalogStatistics — Table Statistics in Metastore (External Catalog)

scala> :type spark.sessionState.catalog


org.apache.spark.sql.catalyst.catalog.SessionCatalog

// Using higher-level interface to access CatalogStatistics


// Make sure that you ran ANALYZE TABLE (as described above)
val db = spark.catalog.currentDatabase
val tableName = "t1"
val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)
val stats = metadata.stats

scala> :type stats


Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]

// Using low-level internal SessionCatalog interface to access CatalogTables


val tid = spark.sessionState.sqlParser.parseTableIdentifier(tableName)
val metadata = spark.sessionState.catalog.getTempViewOrPermanentTableMetadata(tid)
val stats = metadata.stats

scala> :type stats


Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]

CatalogStatistics has a text representation.

scala> :type stats


Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]

scala> stats.map(_.simpleString).foreach(println)
714 bytes, 2 rows

Converting Metastore Statistics to Spark Statistics 


—  toPlanStats Method

toPlanStats(planOutput: Seq[Attribute], cboEnabled: Boolean): Statistics

toPlanStats converts the table statistics (from an external metastore) to Spark statistics.

With cost-based optimization enabled and row count statistics available, toPlanStats
creates a Statistics with the estimated total (output) size, row count and column statistics.

Cost-based optimization is enabled when spark.sql.cbo.enabled configuration


Note
property is turned on, i.e. true , and is disabled by default.

Otherwise, when cost-based optimization is disabled, toPlanStats creates a Statistics with


just the mandatory sizeInBytes.

Caution FIXME Why does toPlanStats compute sizeInBytes differently per CBO?

123
CatalogStatistics — Table Statistics in Metastore (External Catalog)

toPlanStats does the reverse of HiveExternalCatalog.statsToProperties.

Note FIXME Example

toPlanStats is used when HiveTableRelation and LogicalRelation are


Note
requested for statistics.

124
ColumnStat — Column Statistics

ColumnStat — Column Statistics
ColumnStat holds the statistics of a table column (as part of the table statistics in a

metastore).

Table 1. Column Statistics


Name Description
distinctCount Number of distinct values

min Minimum value

max Maximum value

nullCount Number of null values

avgLen Average length of the values

maxLen Maximum length of the values

histogram
Histogram of values (as Histogram which is empty by
default)

ColumnStat is computed (and created from the result row) using ANALYZE TABLE

COMPUTE STATISTICS FOR COLUMNS SQL command (that SparkSqlAstBuilder


translates to AnalyzeColumnCommand logical command).

val cols = "id, p1, p2"


val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $cols"
spark.sql(analyzeTableSQL)

ColumnStat may optionally hold the histogram of values which is empty by default. With

spark.sql.statistics.histogram.enabled configuration property turned on ANALYZE TABLE


COMPUTE STATISTICS FOR COLUMNS SQL command generates column (equi-height) histograms.

Note spark.sql.statistics.histogram.enabled is off by default.

You can inspect the column statistics using DESCRIBE EXTENDED SQL command.

125
ColumnStat — Column Statistics

scala> sql("DESC EXTENDED t1 id").show


+--------------+----------+
|info_name |info_value|
+--------------+----------+
|col_name |id |
|data_type |int |
|comment |NULL |
|min |0 |
|max |1 |
|num_nulls |0 |
|distinct_count|2 |
|avg_col_len |4 |
|max_col_len |4 |
|histogram |NULL | <-- no histogram (spark.sql.statistics.histogram.enabled o
ff)
+--------------+----------+

ColumnStat is part of the statistics of a table.

// Make sure that you ran ANALYZE TABLE (as described above)
val db = spark.catalog.currentDatabase
val tableName = "t1"
val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)
val stats = metadata.stats.get

scala> :type stats


org.apache.spark.sql.catalyst.catalog.CatalogStatistics

val colStats = stats.colStats


scala> :type colStats
Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

ColumnStat is converted to properties (serialized) while persisting the table (statistics) to a

metastore.

126
ColumnStat — Column Statistics

scala> :type colStats


Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

val colName = "p1"

val p1stats = colStats(colName)


scala> :type p1stats
org.apache.spark.sql.catalyst.plans.logical.ColumnStat

import org.apache.spark.sql.types.DoubleType
val props = p1stats.toMap(colName, dataType = DoubleType)
scala> println(props)
Map(distinctCount -> 2, min -> 0.0, version -> 1, max -> 1.4, maxLen -> 8, avgLen -> 8
, nullCount -> 0)

ColumnStat is re-created from properties (deserialized) when HiveExternalCatalog is

requested for restoring table statistics from properties (from a Hive Metastore).

scala> :type props


Map[String,String]

scala> println(props)
Map(distinctCount -> 2, min -> 0.0, version -> 1, max -> 1.4, maxLen -> 8, avgLen -> 8
, nullCount -> 0)

import org.apache.spark.sql.types.StructField
val p1 = $"p1".double

import org.apache.spark.sql.catalyst.plans.logical.ColumnStat
val colStatsOpt = ColumnStat.fromMap(table = "t1", field = p1, map = props)

scala> :type colStatsOpt


Option[org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

ColumnStat is also created when JoinEstimation is requested to estimateInnerOuterJoin

for Inner , Cross , LeftOuter , RightOuter and FullOuter joins.

127
ColumnStat — Column Statistics

val tableName = "t1"

// Make the example reproducible


import org.apache.spark.sql.catalyst.TableIdentifier
val tid = TableIdentifier(tableName)
val sessionCatalog = spark.sessionState.catalog
sessionCatalog.dropTable(tid, ignoreIfNotExists = true, purge = true)

// CREATE TABLE t1
Seq((0, 0, "zero"), (1, 1, "one")).
toDF("id", "p1", "p2").
write.
saveAsTable(tableName)

// As we drop and create immediately we may face problems with unavailable partition f
iles
// Invalidate cache
spark.sql(s"REFRESH TABLE $tableName")

// Use ANALYZE TABLE...FOR COLUMNS to compute column statistics


// that saves them in a metastore (aka an external catalog)
val df = spark.table(tableName)
val allCols = df.columns.mkString(",")
val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $allCols"
spark.sql(analyzeTableSQL)

// Fetch the table metadata (with column statistics) from a metastore


val metastore = spark.sharedState.externalCatalog
val db = spark.catalog.currentDatabase
val tableMeta = metastore.getTable(db, table = tableName)

// The column statistics are part of the table statistics


val colStats = tableMeta.stats.get.colStats

scala> :type colStats


Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]

scala> colStats.map { case (name, cs) => s"$name: $cs" }.foreach(println)


// the output may vary
id: ColumnStat(2,Some(0),Some(1),0,4,4,None)
p1: ColumnStat(2,Some(0),Some(1),0,4,4,None)
p2: ColumnStat(2,None,None,0,4,4,None)

ColumnStat does not support minimum and maximum metrics for binary (i.e.
Note
Array[Byte] ) and string types.

Converting Value to External/Java Representation (per


Catalyst Data Type) —  toExternalString Internal
Method

128
ColumnStat — Column Statistics

toExternalString(v: Any, colName: String, dataType: DataType): String

toExternalString …​FIXME

toExternalString is used exclusively when ColumnStat is requested for


Note
statistic properties.

supportsHistogram Method

supportsHistogram(dataType: DataType): Boolean

supportsHistogram …​FIXME

Note supportsHistogram is used when…​FIXME

Converting ColumnStat to Properties (ColumnStat


Serialization) —  toMap Method

toMap(colName: String, dataType: DataType): Map[String, String]

toMap converts ColumnStat to the properties.

Table 2. ColumnStat.toMap’s Properties


Key Value
version 1

distinctCount distinctCount

nullCount nullCount

avgLen avgLen

maxLen maxLen

min External/Java representation of min

max External/Java representation of max

histogram
Serialized version of Histogram (using
HistogramSerializer.serialize )

129
ColumnStat — Column Statistics

Note toMap adds min , max , histogram entries only if they are available.

Interestingly, colName and dataType input parameters bring no value to


Note toMap itself, but merely allow for a more user-friendly error reporting when
converting min and max column statistics.

toMap is used exclusively when HiveExternalCatalog is requested for


Note converting table statistics to properties (before persisting them as part of table
metadata in a Hive metastore).

Re-Creating Column Statistics from Properties (ColumnStat


Deserialization) —  fromMap Method

fromMap(table: String, field: StructField, map: Map[String, String]): Option[ColumnStat


]

fromMap creates a ColumnStat by fetching properties of every column statistic from the

input map .

fromMap returns None when recovering column statistics fails for whatever reason.

WARN Failed to parse column statistics for column [fieldName] in table [table]

Interestingly, table input parameter brings no value to fromMap itself, but


Note merely allows for a more user-friendly error reporting when parsing column
statistics fails.

fromMap is used exclusively when HiveExternalCatalog is requested for


Note
restoring table statistics from properties (from a Hive Metastore).

Creating Column Statistics from InternalRow (Result of


Computing Column Statistics) —  rowToColumnStat
Method

rowToColumnStat(
row: InternalRow,
attr: Attribute,
rowCount: Long,
percentiles: Option[ArrayData]): ColumnStat

rowToColumnStat creates a ColumnStat from the input row and the following positions:

130
ColumnStat — Column Statistics

0. distinctCount

1. min

2. max

3. nullCount

4. avgLen

5. maxLen

If the 6 th field is not empty, rowToColumnStat uses it to create histogram.

rowToColumnStat is used exclusively when AnalyzeColumnCommand is executed


Note
(to compute the statistics for specified columns).

statExprs Method

statExprs(
col: Attribute,
conf: SQLConf,
colPercentiles: AttributeMap[ArrayData]): CreateNamedStruct

statExprs …​FIXME

Note statExprs is used when…​FIXME

131
EstimationUtils

EstimationUtils
EstimationUtils is…​FIXME

getOutputSize Method

getOutputSize(
attributes: Seq[Attribute],
outputRowCount: BigInt,
attrStats: AttributeMap[ColumnStat] = AttributeMap(Nil)): BigInt

getOutputSize …​FIXME

Note getOutputSize is used when…​FIXME

nullColumnStat Method

nullColumnStat(dataType: DataType, rowCount: BigInt): ColumnStat

nullColumnStat …​FIXME

nullColumnStat is used exclusively when JoinEstimation is requested to


Note
estimateInnerOuterJoin for LeftOuter and RightOuter joins.

Checking Availability of Row Count Statistic 


—  rowCountsExist Method

rowCountsExist(plans: LogicalPlan*): Boolean

rowCountsExist is positive (i.e. true ) when every logical plan (in the input plans ) has

estimated number of rows (aka row count) statistic computed.

Otherwise, rowCountsExist is negative (i.e. false ).

rowCountsExist uses LogicalPlanStats to access the estimated statistics and


Note
query hints of a logical plan.

132
EstimationUtils

rowCountsExist is used when:

AggregateEstimation is requested to estimate statistics and query hints of


a Aggregate logical operator
Note JoinEstimation is requested to estimate statistics and query hints of a Join
logical operator (regardless of the join type)
ProjectEstimation is requested to estimate statistics and query hints of a
Project logical operator

133
CommandUtils — Utilities for Table Statistics

CommandUtils — Utilities for Table Statistics


CommandUtils is a helper class that logical commands, e.g. InsertInto* ,

AlterTable*Command , LoadDataCommand , and CBO’s Analyze* , use to manage table

statistics.

CommandUtils defines the following utilities:

Calculating Total Size of Table or Its Partitions

Calculating Total File Size Under Path

Creating CatalogStatistics with Current Statistics

Updating Existing Table Statistics

Enable INFO logging level for


org.apache.spark.sql.execution.command.CommandUtils logger to see what
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.execution.command.CommandUtils=INFO

Refer to Logging.

Updating Existing Table Statistics —  updateTableStats


Method

updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit

updateTableStats updates the table statistics of the input CatalogTable (only if the statistics

are available in the metastore already).

updateTableStats requests SessionCatalog to alterTableStats with the current total size

(when spark.sql.statistics.size.autoUpdate.enabled property is turned on) or empty statistics


(that effectively removes the recorded statistics completely).

updateTableStats uses spark.sql.statistics.size.autoUpdate.enabled


property to auto-update table statistics and can be expensive (and slow
Important
down data change commands) if the total number of files of a table is very
large.

134
CommandUtils — Utilities for Table Statistics

updateTableStats uses SparkSession to access the current SessionState that


Note
it then uses to access the session-scoped SessionCatalog.

updateTableStats is used when InsertIntoHiveTable,


Note InsertIntoHadoopFsRelationCommand, AlterTableDropPartitionCommand ,
AlterTableSetLocationCommand and LoadDataCommand commands are executed.

Calculating Total Size of Table (with Partitions) 


—  calculateTotalSize Method

calculateTotalSize(sessionState: SessionState, catalogTable: CatalogTable): BigInt

calculateTotalSize calculates total file size for the entire input CatalogTable (when it has

no partitions defined) or all its partitions (through the session-scoped SessionCatalog).

calculateTotalSize uses the input SessionState to access the


Note
SessionCatalog.

calculateTotalSize is used when:

AnalyzeColumnCommand and AnalyzeTableCommand commands are


executed
Note CommandUtils is requested to update existing table statistics (when
InsertIntoHiveTable, InsertIntoHadoopFsRelationCommand,
AlterTableDropPartitionCommand , AlterTableSetLocationCommand and
LoadDataCommand commands are executed)

Calculating Total File Size Under Path 


—  calculateLocationSize Method

calculateLocationSize(
sessionState: SessionState,
identifier: TableIdentifier,
locationUri: Option[URI]): Long

calculateLocationSize reads hive.exec.stagingdir configuration property for the staging

directory (with .hive-staging being the default).

You should see the following INFO message in the logs:

INFO CommandUtils: Starting to calculate the total file size under path [locationUri].

135
CommandUtils — Utilities for Table Statistics

calculateLocationSize calculates the sum of the length of all the files under the input

locationUri .

calculateLocationSize uses Hadoop’s FileSystem.getFileStatus and


Note FileStatus.getLen to access a file and the length of the file (in bytes),
respectively.

In the end, you should see the following INFO message in the logs:

INFO CommandUtils: It took [durationInMs] ms to calculate the total file size under pa
th [locationUri].

calculateLocationSize is used when:

AnalyzePartitionCommand and AlterTableAddPartitionCommand


Note commands are executed
CommandUtils is requested for total size of a table or its partitions

Creating CatalogStatistics with Current Statistics 


—  compareAndGetNewStats Method

compareAndGetNewStats(
oldStats: Option[CatalogStatistics],
newTotalSize: BigInt,
newRowCount: Option[BigInt]): Option[CatalogStatistics]

compareAndGetNewStats creates a new CatalogStatistics with the input newTotalSize and

newRowCount only when they are different from the oldStats .

compareAndGetNewStats is used when AnalyzePartitionCommand and


Note
AnalyzeTableCommand are executed.

136
Catalyst DSL — Implicit Conversions for Catalyst Data Structures

Catalyst DSL — Implicit Conversions for


Catalyst Data Structures
Catalyst DSL is a collection of Scala implicit conversions for constructing Catalyst data
structures, i.e. expressions and logical plans, more easily.

The goal of Catalyst DSL is to make working with Spark SQL’s building blocks easier (e.g.
for testing or Spark SQL internals exploration).

Table 1. Catalyst DSL’s Implicit Conversions


Name Description
Creates expressions
Literals
ExpressionConversions
UnresolvedAttribute and UnresolvedReference
…​

ImplicitOperators Adds operators to expressions for complex expressions

Creates logical plans

hint
plans join
table
DslLogicalPlan

Catalyst DSL is part of org.apache.spark.sql.catalyst.dsl package object.

import org.apache.spark.sql.catalyst.dsl.expressions._
scala> :type $"hello"
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute

137
Catalyst DSL — Implicit Conversions for Catalyst Data Structures

Some implicit conversions from the Catalyst DSL interfere with the implicits conversions fr
automatically in spark-shell (through spark.implicits._ ).

scala> 'hello.decimal
<console>:30: error: type mismatch;
found : Symbol
required: ?{def decimal: ?}
Note that implicit conversions are not applicable because they are ambiguous:
both method symbolToColumn in class SQLImplicits of type (s: Symbol)org.apache.spark.
and method DslSymbol in trait ExpressionConversions of type (sym: Symbol)org.apache.s
are possible conversion functions from Symbol to ?{def decimal: ?}
'hello.decimal
^
<console>:30: error: value decimal is not a member of Symbol
'hello.decimal
^

Important
Use sbt console with Spark libraries defined (in build.sbt ) instead.

You can also disable an implicit conversion using a trick described in How can an implicit b

// HACK: Disable symbolToColumn implicit conversion


// It is imported automatically in spark-shell (and makes demos impossible)
// implicit def symbolToColumn(s: Symbol): org.apache.spark.sql.ColumnName
trait ThatWasABadIdea
implicit def symbolToColumn(ack: ThatWasABadIdea) = ack

// HACK: Disable $ string interpolator


// It is imported automatically in spark-shell (and makes demos impossible)
implicit class StringToColumn(val sc: StringContext) {}

import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.catalyst.dsl.plans._

// ExpressionConversions

import org.apache.spark.sql.catalyst.expressions.Literal
scala> val trueLit: Literal = true
trueLit: org.apache.spark.sql.catalyst.expressions.Literal = true

import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
scala> val name: UnresolvedAttribute = 'name
name: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = 'name

// NOTE: This conversion may not work, e.g. in spark-shell

138
Catalyst DSL — Implicit Conversions for Catalyst Data Structures

// There is another implicit conversion StringToColumn in SQLImplicits


// It is automatically imported in spark-shell
// See :imports
val id: UnresolvedAttribute = $"id"

import org.apache.spark.sql.catalyst.expressions.Expression
scala> val expr: Expression = sum('id)
expr: org.apache.spark.sql.catalyst.expressions.Expression = sum('id)

// implicit class DslSymbol


scala> 'hello.s
res2: String = hello

scala> 'hello.attr
res4: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = 'hello

// implicit class DslString


scala> "helo".expr
res0: org.apache.spark.sql.catalyst.expressions.Expression = helo

scala> "helo".attr
res1: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = 'helo

// logical plans

scala> val t1 = table("t1")


t1: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
'UnresolvedRelation `t1`

scala> val p = t1.select('*).serialize[String].where('id % 2 == 0)


p: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
'Filter false
+- 'SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String,
StringType, fromString, input[0, java.lang.String, true], true) AS value#1]
+- 'Project ['*]
+- 'UnresolvedRelation `t1`

// FIXME Does not work because SimpleAnalyzer's catalog is empty


// the p plan references a t1 table
import org.apache.spark.sql.catalyst.analysis.SimpleAnalyzer
scala> p.analyze

ImplicitOperators Implicit Conversions


Operators for expressions, i.e. in .

ExpressionConversions Implicit Conversions


ExpressionConversions implicit conversions add ImplicitOperators operators to Catalyst

expressions.

139
Catalyst DSL — Implicit Conversions for Catalyst Data Structures

Type Conversions to Literal Expressions


ExpressionConversions adds conversions of Scala native types (e.g. Boolean , Long ,

String , Date , Timestamp ) and Spark SQL types (i.e. Decimal ) to Literal expressions.

// DEMO FIXME

Converting Symbols to UnresolvedAttribute and


AttributeReference Expressions
ExpressionConversions adds conversions of Scala’s Symbol to UnresolvedAttribute and

AttributeReference expressions.

// DEMO FIXME

Converting $-Prefixed String Literals to


UnresolvedAttribute Expressions
ExpressionConversions adds conversions of $"col name" to an UnresolvedAttribute

expression.

// DEMO FIXME

Adding Aggregate And Non-Aggregate Functions to


Expressions

star(names: String*): Expression

ExpressionConversions adds the aggregate and non-aggregate functions to Catalyst

expressions (e.g. sum , count , upper , star , callFunction , windowSpec , windowExpr )

import org.apache.spark.sql.catalyst.dsl.expressions._
val s = star()

import org.apache.spark.sql.catalyst.analysis.UnresolvedStar
assert(s.isInstanceOf[UnresolvedStar])

val s = star("a", "b")


scala> println(s)
WrappedArray(a, b).*

140
Catalyst DSL — Implicit Conversions for Catalyst Data Structures

Creating UnresolvedFunction Expressions —  function and


distinctFunction Methods

ExpressionConversions allows creating UnresolvedFunction expressions with function and

distinctFunction operators.

function(exprs: Expression*): UnresolvedFunction


distinctFunction(exprs: Expression*): UnresolvedFunction

import org.apache.spark.sql.catalyst.dsl.expressions._

// Works with Scala Symbols only


val f = 'f.function()
scala> :type f
org.apache.spark.sql.catalyst.analysis.UnresolvedFunction

scala> f.isDistinct
res0: Boolean = false

val g = 'g.distinctFunction()
scala> g.isDistinct
res1: Boolean = true

Creating AttributeReference Expressions With nullability


On or Off —  notNull and canBeNull Methods
ExpressionConversions adds canBeNull and notNull operators to create a

AttributeReference with nullability turned on or off, respectively.

notNull: AttributeReference
canBeNull: AttributeReference

// DEMO FIXME

Creating BoundReference —  at Method

at(ordinal: Int): BoundReference

ExpressionConversions adds at method to AttributeReferences to create

BoundReference expressions.

141
Catalyst DSL — Implicit Conversions for Catalyst Data Structures

import org.apache.spark.sql.catalyst.dsl.expressions._
val boundRef = 'hello.string.at(4)
scala> println(boundRef)
input[4, string, true]

plans Implicit Conversions for Logical Plans

Creating UnresolvedHint Logical Operator —  hint Method


plans adds hint method to create a UnresolvedHint logical operator.

hint(name: String, parameters: Any*): LogicalPlan

Creating Join Logical Operator —  join Method


join creates a Join logical operator.

join(
otherPlan: LogicalPlan,
joinType: JoinType = Inner,
condition: Option[Expression] = None): LogicalPlan

Creating UnresolvedRelation Logical Operator —  table


Method
table creates a UnresolvedRelation logical operator.

table(ref: String): LogicalPlan


table(db: String, ref: String): LogicalPlan

import org.apache.spark.sql.catalyst.dsl.plans._

val t1 = table("t1")
scala> println(t1.treeString)
'UnresolvedRelation `t1`

DslLogicalPlan Implicit Class

implicit class DslLogicalPlan(val logicalPlan: LogicalPlan)

142
Catalyst DSL — Implicit Conversions for Catalyst Data Structures

DslLogicalPlan implicit class is part of plans implicit conversions with extension methods

(of logical operators) to build entire logical plans.

select(exprs: Expression*): LogicalPlan


where(condition: Expression): LogicalPlan
filter[T: Encoder](func: T => Boolean): LogicalPlan
filter[T: Encoder](func: FilterFunction[T]): LogicalPlan
serialize[T: Encoder]: LogicalPlan
deserialize[T: Encoder]: LogicalPlan
limit(limitExpr: Expression): LogicalPlan
join(
otherPlan: LogicalPlan,
joinType: JoinType = Inner,
condition: Option[Expression] = None): LogicalPlan
cogroup[Key: Encoder, Left: Encoder, Right: Encoder, Result: Encoder](
otherPlan: LogicalPlan,
func: (Key, Iterator[Left], Iterator[Right]) => TraversableOnce[Result],
leftGroup: Seq[Attribute],
rightGroup: Seq[Attribute],
leftAttr: Seq[Attribute],
rightAttr: Seq[Attribute]): LogicalPlan
orderBy(sortExprs: SortOrder*): LogicalPlan
sortBy(sortExprs: SortOrder*): LogicalPlan
groupBy(groupingExprs: Expression*)(aggregateExprs: Expression*): LogicalPlan
window(
windowExpressions: Seq[NamedExpression],
partitionSpec: Seq[Expression],
orderSpec: Seq[SortOrder]): LogicalPlan
subquery(alias: Symbol): LogicalPlan
except(otherPlan: LogicalPlan): LogicalPlan
intersect(otherPlan: LogicalPlan): LogicalPlan
union(otherPlan: LogicalPlan): LogicalPlan
generate(
generator: Generator,
unrequiredChildIndex: Seq[Int] = Nil,
outer: Boolean = false,
alias: Option[String] = None,
outputNames: Seq[String] = Nil): LogicalPlan
insertInto(tableName: String, overwrite: Boolean = false): LogicalPlan
as(alias: String): LogicalPlan
coalesce(num: Integer): LogicalPlan
repartition(num: Integer): LogicalPlan
distribute(exprs: Expression*)(n: Int): LogicalPlan
hint(name: String, parameters: Any*): LogicalPlan

143
Catalyst DSL — Implicit Conversions for Catalyst Data Structures

// Import plans object


// That loads implicit class DslLogicalPlan
// And so every LogicalPlan is the "target" of the DslLogicalPlan methods
import org.apache.spark.sql.catalyst.dsl.plans._

val t1 = table(ref = "t1")

// HACK: Disable symbolToColumn implicit conversion


// It is imported automatically in spark-shell (and makes demos impossible)
// implicit def symbolToColumn(s: Symbol): org.apache.spark.sql.ColumnName
trait ThatWasABadIdea
implicit def symbolToColumn(ack: ThatWasABadIdea) = ack

import org.apache.spark.sql.catalyst.dsl.expressions._
val id = 'id.long
val logicalPlan = t1.select(id)
scala> println(logicalPlan.numberedTreeString)
00 'Project [id#1L]
01 +- 'UnresolvedRelation `t1`

val t2 = table("t2")
import org.apache.spark.sql.catalyst.plans.LeftSemi
val logicalPlan = t1.join(t2, joinType = LeftSemi, condition = Some(id))
scala> println(logicalPlan.numberedTreeString)
00 'Join LeftSemi, id#1: bigint
01 :- 'UnresolvedRelation `t1`
02 +- 'UnresolvedRelation `t2`

Analyzing Logical Plan —  analyze Method

analyze: LogicalPlan

analyze resolves attribute references.

analyze method is part of DslLogicalPlan implicit class.

Internally, analyze uses EliminateSubqueryAliases logical optimization and


SimpleAnalyzer logical analyzer.

// DEMO FIXME

144
Fundamentals of Spark SQL Application Development

Fundamentals of Spark SQL Application


Development
Development of a Spark SQL application requires the following steps:

1. Setting up Development Environment (IntelliJ IDEA, Scala and sbt)

2. Specifying Library Dependencies

3. Creating SparkSession

4. Loading Data from Data Sources

5. Processing Data Using Dataset API

6. Saving Data to Persistent Storage

7. Deploying Spark Application to Cluster (using spark-submit )

145
SparkSession — The Entry Point to Spark SQL

SparkSession — The Entry Point to Spark SQL


SparkSession is the entry point to Spark SQL. It is one of the very first objects you create

while developing a Spark SQL application.

As a Spark developer, you create a SparkSession using the SparkSession.builder method


(that gives you access to Builder API that you use to configure the session).

import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder
.appName("My Spark Application") // optional and will be autogenerated if not speci
fied
.master("local[*]") // only for demo and testing purposes, use spark-s
ubmit instead
.enableHiveSupport() // self-explanatory, isn't it?
.config("spark.sql.warehouse.dir", "target/spark-warehouse")
.withExtensions { extensions =>
extensions.injectResolutionRule { session =>
...
}
extensions.injectOptimizerRule { session =>
...
}
}
.getOrCreate

Once created, SparkSession allows for creating a DataFrame (based on an RDD or a Scala
Seq ), creating a Dataset, accessing the Spark SQL services (e.g. ExperimentalMethods,

ExecutionListenerManager, UDFRegistration), executing a SQL query, loading a table and


the last but not least accessing DataFrameReader interface to load a dataset of the format
of your choice (to some extent).

You can enable Apache Hive support with support for an external Hive metastore.

spark object in spark-shell (the instance of SparkSession that is auto-


created) has Hive support enabled.
In order to disable the pre-configured Hive support in the spark object, use
spark.sql.catalogImplementation internal configuration property with in-memory
Note
value (that uses InMemoryCatalog external catalog instead).

$ spark-shell --conf spark.sql.catalogImplementation=in-memory

146
SparkSession — The Entry Point to Spark SQL

You can have as many SparkSessions as you want in a single Spark application. The
common use case is to keep relational entities separate logically in catalogs per
SparkSession .

In the end, you stop a SparkSession using SparkSession.stop method.

spark.stop

Table 1. SparkSession API (Object and Instance Methods)


Method Description

active: SparkSession
active

(New in 2.4.0)

builder(): Builder

builder
Object method to create a Builder to get the current SparkSession instance or
create a new one.

catalog: Catalog

catalog
Access to the current metadata catalog of relational entities, e.g. database(s),
tables, functions, table columns, and temporary views.

clearActiveSession(): Unit
clearActiveSession
Object method

clearDefaultSession(): Unit
clearDefaultSession
Object method

close close(): Unit

conf: RuntimeConfig
conf
Access to the current runtime configuration

147
SparkSession — The Entry Point to Spark SQL

createDataFrame(rdd: RDD[_], beanClass: Class[_]): DataFrame


createDataFrame createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame
createDataFrame[A <: Product : TypeTag](rdd: RDD[A]): DataFrame
createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame

createDataset[T : Encoder](data: RDD[T]): Dataset[T]


createDataset createDataset[T : Encoder](data: Seq[T]): Dataset[T]

emptyDataFrame emptyDataFrame: DataFrame

emptyDataset emptyDataset[T: Encoder]: Dataset[T]

experimental: ExperimentalMethods
experimental
Access to the current ExperimentalMethods

getActiveSession: Option[SparkSession]
getActiveSession
Object method

getDefaultSession: Option[SparkSession]
getDefaultSession
Object method

import spark.implicits._
implicits
Implicits conversions

listenerManager: ExecutionListenerManager
listenerManager
Access to the current ExecutionListenerManager

newSession(): SparkSession
newSession
Creates a new SparkSession

148
SparkSession — The Entry Point to Spark SQL

range(end: Long): Dataset[java.lang.Long]


range(start: Long, end: Long): Dataset[java.lang.Long]
range range(start: Long, end: Long, step: Long): Dataset[java.lang.Long
range(start: Long, end: Long, step: Long, numPartitions: Int): Dataset
va.lang.Long]

Creates a Dataset[java.lang.Long]

read: DataFrameReader

read
Access to the current DataFrameReader to load data from external data
sources

sessionState: SessionState

Access to the current SessionState


Internally, sessionState clones the optional parent SessionState (if given
sessionState when creating the SparkSession) or creates a new SessionState using
BaseSessionStateBuilder as defined by spark.sql.catalogImplementation
configuration property:
in-memory (default) for
org.apache.spark.sql.internal.SessionStateBuilder
hive for org.apache.spark.sql.hive.HiveSessionStateBuilder

setActiveSession(session: SparkSession): Unit


setActiveSession
Object method

setDefaultSession(session: SparkSession): Unit


setDefaultSession
Object method

sharedState: SharedState
sharedState
Access to the current SharedState

sparkContext: SparkContext
sparkContext
Access to the underlying SparkContext

149
SparkSession — The Entry Point to Spark SQL

sql sql(sqlText: String): DataFrame

"Executes" a SQL query

sqlContext: SQLContext
sqlContext

Access to the underlying SQLContext

stop(): Unit
stop
Stops the associated SparkContext

table(tableName: String): DataFrame


table
Loads data from a table

time[T](f: => T): T

time
Executes a code block and prints out (to standard output) the time taken to
execute it

udf: UDFRegistration
udf
Access to the current UDFRegistration

version: String
version
Returns the version of Apache Spark

baseRelationToDataFrame acts as a mechanism to plug BaseRelation object


Note hierarchy in into LogicalPlan object hierarchy that SparkSession uses to bridge
them.

Creating SparkSession Using Builder Pattern —  builder


Object Method

builder(): Builder

150
SparkSession — The Entry Point to Spark SQL

builder creates a new Builder that you use to build a fully-configured SparkSession using

a fluent API.

import org.apache.spark.sql.SparkSession
val builder = SparkSession.builder

Tip Read about Fluent interface design pattern in Wikipedia, the free encyclopedia.

Accessing Version of Spark —  version Method

version: String

version returns the version of Apache Spark in use.

Internally, version uses spark.SPARK_VERSION value that is the version property in spark-
version-info.properties properties file on CLASSPATH.

Creating Empty Dataset (Given Encoder) 


—  emptyDataset Operator

emptyDataset[T: Encoder]: Dataset[T]

emptyDataset creates an empty Dataset (assuming that future records being of type T ).

scala> val strings = spark.emptyDataset[String]


strings: org.apache.spark.sql.Dataset[String] = [value: string]

scala> strings.printSchema
root
|-- value: string (nullable = true)

emptyDataset creates a LocalRelation logical query plan.

Creating Dataset from Local Collections or RDDs 


—  createDataset methods

createDataset[T : Encoder](data: RDD[T]): Dataset[T]


createDataset[T : Encoder](data: Seq[T]): Dataset[T]

151
SparkSession — The Entry Point to Spark SQL

createDataset is an experimental API to create a Dataset from a local Scala collection, i.e.

Seq[T] , Java’s List[T] , or a distributed RDD[T] .

scala> val one = spark.createDataset(Seq(1))


one: org.apache.spark.sql.Dataset[Int] = [value: int]

scala> one.show
+-----+
|value|
+-----+
| 1|
+-----+

createDataset creates a LocalRelation (for the input data collection) or LogicalRDD (for

the input RDD[T] ) logical operators.

You may want to consider implicits object and toDS method instead.

val spark: SparkSession = ...


import spark.implicits._
Tip
scala> val one = Seq(1).toDS
one: org.apache.spark.sql.Dataset[Int] = [value: int]

Internally, createDataset first looks up the implicit expression encoder in scope to access
the AttributeReference s (of the schema).

Note Only unresolved expression encoders are currently supported.

The expression encoder is then used to map elements (of the input Seq[T] ) into a
collection of InternalRows. With the references and rows, createDataset returns a Dataset
with a LocalRelation logical query plan.

Creating Dataset With Single Long Column —  range


Operator

range(end: Long): Dataset[java.lang.Long]


range(start: Long, end: Long): Dataset[java.lang.Long]
range(start: Long, end: Long, step: Long): Dataset[java.lang.Long]
range(start: Long, end: Long, step: Long, numPartitions: Int): Dataset[java.lang.Long]

range family of methods create a Dataset of Long numbers.

152
SparkSession — The Entry Point to Spark SQL

scala> spark.range(start = 0, end = 4, step = 2, numPartitions = 5).show


+---+
| id|
+---+
| 0|
| 2|
+---+

The three first variants (that do not specify numPartitions explicitly) use
Note
SparkContext.defaultParallelism for the number of partitions numPartitions .

Internally, range creates a new Dataset[Long] with Range logical plan and Encoders.LONG
encoder.

Creating Empty DataFrame —  emptyDataFrame method

emptyDataFrame: DataFrame

emptyDataFrame creates an empty DataFrame (with no rows and columns).

It calls createDataFrame with an empty RDD[Row] and an empty schema StructType(Nil).

Creating DataFrames from Local Collections or RDDs 


—  createDataFrame Method

createDataFrame(rdd: RDD[_], beanClass: Class[_]): DataFrame


createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame
createDataFrame[A <: Product : TypeTag](rdd: RDD[A]): DataFrame
createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame
// private[sql]
createDataFrame(rowRDD: RDD[Row], schema: StructType, needsConversion: Boolean): DataF
rame

createDataFrame creates a DataFrame using RDD[Row] and the input schema . It is

assumed that the rows in rowRDD all match the schema .

Caution FIXME

Executing SQL Queries (aka SQL Mode) —  sql Method

sql(sqlText: String): DataFrame

153
SparkSession — The Entry Point to Spark SQL

sql executes the sqlText SQL statement and creates a DataFrame.

sql is imported in spark-shell so you can execute SQL statements as if sql


were a part of the environment.

Note scala> :imports


1) import spark.implicits._ (72 terms, 43 are implicit)
2) import spark.sql (1 terms)

scala> sql("SHOW TABLES")


res0: org.apache.spark.sql.DataFrame = [tableName: string, isTemporary: boolean]

scala> sql("DROP TABLE IF EXISTS testData")


res1: org.apache.spark.sql.DataFrame = []

// Let's create a table to SHOW it


spark.range(10).write.option("path", "/tmp/test").saveAsTable("testData")

scala> sql("SHOW TABLES").show


+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
| testdata| false|
+---------+-----------+

Internally, sql requests the current ParserInterface to execute a SQL query that gives a
LogicalPlan.

Note sql uses SessionState to access the current ParserInterface .

sql then creates a DataFrame using the current SparkSession (itself) and the LogicalPlan.

spark-sql is the main SQL environment in Spark to work with pure SQL
statements (where you do not have to use Scala to execute them).

Tip spark-sql> show databases;


default
Time taken: 0.028 seconds, Fetched 1 row(s)

Accessing UDFRegistration —  udf Attribute

udf: UDFRegistration

154
SparkSession — The Entry Point to Spark SQL

udf attribute gives access to UDFRegistration that allows registering user-defined

functions for SQL-based queries.

val spark: SparkSession = ...


spark.udf.register("myUpper", (s: String) => s.toUpperCase)

val strs = ('a' to 'c').map(_.toString).toDS


strs.registerTempTable("strs")

scala> sql("SELECT *, myUpper(value) UPPER FROM strs").show


+-----+-----+
|value|UPPER|
+-----+-----+
| a| A|
| b| B|
| c| C|
+-----+-----+

Internally, it is simply an alias for SessionState.udfRegistration.

Loading Data From Table —  table Method

table(tableName: String): DataFrame (1)


// private[sql]
table(tableIdent: TableIdentifier): DataFrame

1. Parses tableName to a TableIdentifier and calls the other table

table creates a DataFrame (wrapper) from the input tableName table (but only if available

in the session catalog).

scala> spark.catalog.tableExists("t1")
res1: Boolean = true

// t1 exists in the catalog


// let's load it
val t1 = spark.table("t1")

Accessing Metastore —  catalog Attribute

catalog: Catalog

catalog attribute is a (lazy) interface to the current metastore, i.e. data catalog (of relational

entities like databases, tables, functions, table columns, and views).

155
SparkSession — The Entry Point to Spark SQL

Tip All methods in Catalog return Datasets .

scala> spark.catalog.listTables.show
+------------------+--------+-----------+---------+-----------+
| name|database|description|tableType|isTemporary|
+------------------+--------+-----------+---------+-----------+
|my_permanent_table| default| null| MANAGED| false|
| strs| null| null|TEMPORARY| true|
+------------------+--------+-----------+---------+-----------+

Internally, catalog creates a CatalogImpl (that uses the current SparkSession ).

Accessing DataFrameReader —  read method

read: DataFrameReader

read method returns a DataFrameReader that is used to read data from external storage

systems and load it into a DataFrame .

val spark: SparkSession = // create instance


val dfReader: DataFrameReader = spark.read

Getting Runtime Configuration —  conf Attribute

conf: RuntimeConfig

conf returns the current RuntimeConfig.

Internally, conf creates a RuntimeConfig (when requested the very first time and cached
afterwards) with the SQLConf of the SessionState.

readStream method

readStream: DataStreamReader

readStream returns a new DataStreamReader.

streams Attribute

156
SparkSession — The Entry Point to Spark SQL

streams: StreamingQueryManager

streams attribute gives access to StreamingQueryManager (through SessionState).

val spark: SparkSession = ...


spark.streams.active.foreach(println)

experimentalMethods Attribute

experimental: ExperimentalMethods

experimentalMethods is an extension point with ExperimentalMethods that is a per-session

collection of extra strategies and Rule[LogicalPlan] s.

experimental is used in SparkPlanner and SparkOptimizer. Hive and


Note
Structured Streaming use it for their own extra strategies and optimization rules.

Creating SparkSession Instance —  newSession method

newSession(): SparkSession

newSession creates (starts) a new SparkSession (with the current SparkContext and

SharedState).

scala> val newSession = spark.newSession


newSession: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@122f
58a

Stopping SparkSession —  stop Method

stop(): Unit

stop stops the SparkSession , i.e. stops the underlying SparkContext .

Create DataFrame from BaseRelation 


—  baseRelationToDataFrame Method

157
SparkSession — The Entry Point to Spark SQL

baseRelationToDataFrame(baseRelation: BaseRelation): DataFrame

Internally, baseRelationToDataFrame creates a DataFrame from the input BaseRelation


wrapped inside LogicalRelation.

LogicalRelation is an logical plan adapter for BaseRelation (so BaseRelation


Note
can be part of a logical plan).

baseRelationToDataFrame is used when:

DataFrameReader loads data from a data source that supports multiple


paths
Note DataFrameReader loads data from an external table using JDBC

TextInputCSVDataSource creates a base Dataset (of Strings)

TextInputJsonDataSource creates a base Dataset (of Strings)

Creating SessionState Instance 


—  instantiateSessionState Internal Method

instantiateSessionState(className: String, sparkSession: SparkSession): SessionState

instantiateSessionState finds the className that is then used to create and build a

BaseSessionStateBuilder .

instantiateSessionState may report an IllegalArgumentException while instantiating the

class of a SessionState :

Error while instantiating '[className]'

instantiateSessionState is used exclusively when SparkSession is requested


Note for SessionState per spark.sql.catalogImplementation configuration property
(and one is not available yet).

sessionStateClassName Internal Method

sessionStateClassName(conf: SparkConf): String

sessionStateClassName gives the name of the class of the SessionState per

spark.sql.catalogImplementation, i.e.

158
SparkSession — The Entry Point to Spark SQL

org.apache.spark.sql.hive.HiveSessionStateBuilder for hive

org.apache.spark.sql.internal.SessionStateBuilder for in-memory

sessionStateClassName is used exclusively when SparkSession is requested for


Note
the SessionState (and one is not available yet).

Creating DataFrame From RDD Of Internal Binary Rows and


Schema —  internalCreateDataFrame Internal Method

internalCreateDataFrame(
catalystRows: RDD[InternalRow],
schema: StructType,
isStreaming: Boolean = false): DataFrame

internalCreateDataFrame creates a DataFrame with a LogicalRDD.

internalCreateDataFrame is used when:

DataFrameReader is requested to create a DataFrame from Dataset of


JSONs or CSVs
Note
SparkSession is requested to create a DataFrame from RDD of rows

InsertIntoDataSourceCommand logical command is executed

Creating SparkSession Instance


SparkSession takes the following when created:

Spark Core’s SparkContext

Optional SharedState

Optional SessionState

SparkSessionExtensions

clearActiveSession Object Method

clearActiveSession(): Unit

clearActiveSession …​FIXME

159
SparkSession — The Entry Point to Spark SQL

clearDefaultSession Object Method

clearDefaultSession(): Unit

clearDefaultSession …​FIXME

Accessing ExperimentalMethods —  experimental


Method

experimental: ExperimentalMethods

experimental …​FIXME

getActiveSession Object Method

getActiveSession: Option[SparkSession]

getActiveSession …​FIXME

getDefaultSession Object Method

getDefaultSession: Option[SparkSession]

getDefaultSession …​FIXME

Accessing ExecutionListenerManager 
—  listenerManager Method

listenerManager: ExecutionListenerManager

listenerManager …​FIXME

Accessing SessionState —  sessionState Lazy Attribute

sessionState: SessionState

sessionState …​FIXME

160
SparkSession — The Entry Point to Spark SQL

setActiveSession Object Method

setActiveSession(session: SparkSession): Unit

setActiveSession …​FIXME

setDefaultSession Object Method

setDefaultSession(session: SparkSession): Unit

setDefaultSession …​FIXME

Accessing SharedState —  sharedState Method

sharedState: SharedState

sharedState …​FIXME

Measuring Duration of Executing Code Block —  time


Method

time[T](f: => T): T

time …​FIXME

161
Builder — Building SparkSession using Fluent API

Builder — Building SparkSession using Fluent


API
Builder is the fluent API to create a SparkSession.

Table 1. Builder API


Method Description

appName appName(name: String): Builder

config(conf: SparkConf): Builder


config(key: String, value: Boolean): Builder
config config(key: String, value: Double): Builder
config(key: String, value: Long): Builder
config(key: String, value: String): Builder

Enables Hive support


enableHiveSupport
enableHiveSupport(): Builder

Gets the current SparkSession or creates a new one.


getOrCreate
getOrCreate(): SparkSession

master master(master: String): Builder

Access to the SparkSessionExtensions


withExtensions
withExtensions(f: SparkSessionExtensions => Unit): Builder

Builder is available using the builder object method of a SparkSession.

162
Builder — Building SparkSession using Fluent API

import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder
.appName("My Spark Application") // optional and will be autogenerated if not speci
fied
.master("local[*]") // only for demo and testing purposes, use spark-s
ubmit instead
.enableHiveSupport() // self-explanatory, isn't it?
.config("spark.sql.warehouse.dir", "target/spark-warehouse")
.withExtensions { extensions =>
extensions.injectResolutionRule { session =>
...
}
extensions.injectOptimizerRule { session =>
...
}
}
.getOrCreate

You can have multiple SparkSession s in a single Spark application for different
Note
data catalogs (through relational entities).

Table 2. Builder’s Internal Properties (e.g. Registries, Counters and Flags)


Name Description
SparkSessionExtensions
extensions
Used when…​FIXME

options
Used when…​FIXME

Getting Or Creating SparkSession Instance 


—  getOrCreate Method

getOrCreate(): SparkSession

getOrCreate …​FIXME

Enabling Hive Support —  enableHiveSupport Method

enableHiveSupport(): Builder

enableHiveSupport enables Hive support, i.e. running structured queries on Hive tables (and

a persistent Hive metastore, support for Hive serdes and Hive user-defined functions).

163
Builder — Building SparkSession using Fluent API

You do not need any existing Hive installation to use Spark’s Hive support.
SparkSession context will automatically create metastore_db in the current
directory of a Spark application and a directory configured by
Note spark.sql.warehouse.dir.

Refer to SharedState.

Internally, enableHiveSupport makes sure that the Hive classes are on CLASSPATH, i.e.
Spark SQL’s org.apache.hadoop.hive.conf.HiveConf , and sets
spark.sql.catalogImplementation internal configuration property to hive .

withExtensions Method

withExtensions(f: SparkSessionExtensions => Unit): Builder

withExtensions simply executes the input f function with the SparkSessionExtensions.

appName Method

appName(name: String): Builder

appName …​FIXME

config Method

config(conf: SparkConf): Builder


config(key: String, value: Boolean): Builder
config(key: String, value: Double): Builder
config(key: String, value: Long): Builder
config(key: String, value: String): Builder

config …​FIXME

master Method

master(master: String): Builder

master …​FIXME

164
Builder — Building SparkSession using Fluent API

165
implicits Object — Implicits Conversions

implicits Object — Implicits Conversions


implicits object gives implicit conversions for converting Scala objects (incl. RDDs) into a

Dataset , DataFrame , Columns or supporting such conversions (through Encoders).

Table 1. implicits API


Name Description
Creates a DatasetHolder with the input Seq[T] converted to a
SparkSession.createDataset).
localSeqToDatasetHolder
implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]):

Encoders Encoders for primitive and object types in Scala and Java (aka

Converts $"name" into a Column


StringToColumn
implicit class StringToColumn(val sc: StringContext)

rddToDatasetHolder implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]):

symbolToColumn implicit def symbolToColumn(s: Symbol): ColumnName

implicits object is defined inside SparkSession and hence requires that you build a

SparkSession instance first before importing implicits conversions.

166
implicits Object — Implicits Conversions

import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...
import spark.implicits._

scala> val ds = Seq("I am a shiny Dataset!").toDS


ds: org.apache.spark.sql.Dataset[String] = [value: string]

scala> val df = Seq("I am an old grumpy DataFrame!").toDF


df: org.apache.spark.sql.DataFrame = [value: string]

scala> val df = Seq("I am an old grumpy DataFrame with text column!").toDF("text")


df: org.apache.spark.sql.DataFrame = [text: string]

val rdd = sc.parallelize(Seq("hello, I'm a very low-level RDD"))


scala> val ds = rdd.toDS
ds: org.apache.spark.sql.Dataset[String] = [value: string]

In Scala REPL-based environments, e.g. spark-shell , use :imports to know


Tip what imports are in scope.

scala> :help imports

show import history, identifying sources of names

scala> :imports
1) import org.apache.spark.SparkContext._ (69 terms, 1 are implicit)
2) import spark.implicits._ (1 types, 67 terms, 37 are implicit)
3) import spark.sql (1 terms)
4) import org.apache.spark.sql.functions._ (354 terms)

implicits object extends SQLImplicits abstract class.

DatasetHolder Scala Case Class


DatasetHolder is a Scala case class that, when created, takes a Dataset[T] .

DatasetHolder is created (implicitly) when rddToDatasetHolder and

localSeqToDatasetHolder implicit conversions are used.

DatasetHolder has toDS and toDF methods that simply return the Dataset[T] (it was

created with) or a DataFrame (using Dataset.toDF operator), respectively.

toDS(): Dataset[T]
toDF(): DataFrame
toDF(colNames: String*): DataFrame

167
implicits Object — Implicits Conversions

168
SparkSessionExtensions

SparkSessionExtensions
SparkSessionExtensions is an interface that a Spark developer can use to extend a

SparkSession with custom query execution rules and a relational entity parser.

As a Spark developer, you use Builder.withExtensions method (while building a new


SparkSession) to access the session-bound SparkSessionExtensions .

Table 1. SparkSessionExtensions API


Method Description

injectCheckRule injectCheckRule(builder: SparkSession => LogicalPlan =>

Registering a custom operator optimization rule


injectOptimizerRule
injectOptimizerRule(builder: SparkSession => Rule[LogicalPlan

injectParser injectParser(builder: (SparkSession, ParserInterface) =>

injectPlannerStrategy injectPlannerStrategy(builder: SparkSession => Strategy

injectPostHocResolutionRule injectPostHocResolutionRule(builder: SparkSession => Rule

injectResolutionRule injectResolutionRule(builder: SparkSession => Rule[LogicalPlan

SparkSessionExtensions is an integral part of SparkSession (and is indirectly required to

create one).

169
SparkSessionExtensions

Table 2. SparkSessionExtensions’s Internal Properties (e.g. Registries, Counters and Flags)


Name Description

Collection of RuleBuilder functions (i.e. SparkSession ⇒


Rule[LogicalPlan] )

Used when SparkSessionExtensions is requested to:


optimizerRules
Associate custom operator optimization rules with
SparkSession
Register a custom operator optimization rule

Associating Custom Operator Optimization Rules with


SparkSession —  buildOptimizerRules Method

buildOptimizerRules(session: SparkSession): Seq[Rule[LogicalPlan]]

buildOptimizerRules gives the optimizerRules logical rules that are associated with the

input SparkSession.

buildOptimizerRules is used exclusively when BaseSessionStateBuilder is


Note requested for the custom operator optimization rules to add to the base
Operator Optimization batch.

Registering Custom Check Analysis Rule (Builder) 


—  injectCheckRule Method

injectCheckRule(builder: SparkSession => LogicalPlan => Unit): Unit

injectCheckRule …​FIXME

Registering Custom Operator Optimization Rule (Builder) 


—  injectOptimizerRule Method

injectOptimizerRule(builder: SparkSession => Rule[LogicalPlan]): Unit

injectOptimizerRule simply registers a custom operator optimization rule (as a

RuleBuilder function) to the optimizerRules internal registry.

Registering Custom Parser (Builder) —  injectParser


170
SparkSessionExtensions

Registering Custom Parser (Builder) —  injectParser


Method

injectParser(builder: (SparkSession, ParserInterface) => ParserInterface): Unit

injectParser …​FIXME

Registering Custom Planner Strategy (Builder) 


—  injectPlannerStrategy Method

injectPlannerStrategy(builder: SparkSession => Strategy): Unit

injectPlannerStrategy …​FIXME

Registering Custom Post-Hoc Resolution Rule (Builder) 


—  injectPostHocResolutionRule Method

injectPostHocResolutionRule(builder: SparkSession => Rule[LogicalPlan]): Unit

injectPostHocResolutionRule …​FIXME

Registering Custom Resolution Rule (Builder) 


—  injectResolutionRule Method

injectResolutionRule(builder: SparkSession => Rule[LogicalPlan]): Unit

injectResolutionRule …​FIXME

171
Dataset — Structured Query with Data Encoder

Dataset — Structured Query with Data Encoder


Dataset is a strongly-typed data structure in Spark SQL that represents a structured query.

Note A structured query can be written using SQL or Dataset API.

The following figure shows the relationship between different entities of Spark SQL that all
together give the Dataset data structure.

Figure 1. Dataset’s Internals


It is therefore fair to say that Dataset consists of the following three elements:

1. QueryExecution (with the parsed unanalyzed LogicalPlan of a structured query)

2. Encoder (of the type of the records for fast serialization and deserialization to and from
InternalRow)

3. SparkSession

When created, Dataset takes such a 3-element tuple with a SparkSession , a


QueryExecution and an Encoder .

Dataset is created when:

Dataset.apply (for a LogicalPlan and a SparkSession with the Encoder in a Scala


implicit scope)

172
Dataset — Structured Query with Data Encoder

Dataset.ofRows (for a LogicalPlan and a SparkSession)

Dataset.toDF untyped transformation is used

Dataset.select, Dataset.randomSplit and Dataset.mapPartitions typed transformations


are used

KeyValueGroupedDataset.agg operator is used (that requests KeyValueGroupedDataset


to aggUntyped)

SparkSession.emptyDataset and SparkSession.range operators are used

CatalogImpl is requested to makeDataset (when requested to list databases, tables,

functions and columns)

Spark Structured Streaming’s MicroBatchExecution is requested to runBatch

Datasets are lazy and structured query operators and expressions are only triggered when
an action is invoked.

import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...

scala> val dataset = spark.range(5)


dataset: org.apache.spark.sql.Dataset[Long] = [id: bigint]

// Variant 1: filter operator accepts a Scala function


dataset.filter(n => n % 2 == 0).count

// Variant 2: filter operator accepts a Column-based SQL expression


dataset.filter('value % 2 === 0).count

// Variant 3: filter operator accepts a SQL query


dataset.filter("value % 2 = 0").count

The Dataset API offers declarative and type-safe operators that makes for an improved
experience for data processing (comparing to DataFrames that were a set of index- or
column name-based Rows).

Dataset was first introduced in Apache Spark 1.6.0 as an experimental


feature, and has since turned itself into a fully supported API.
As of Spark 2.0.0, DataFrame - the flagship data abstraction of previous
versions of Spark SQL - is currently a mere type alias for Dataset[Row] :
Note
type DataFrame = Dataset[Row]

See package object sql.

173
Dataset — Structured Query with Data Encoder

Dataset offers convenience of RDDs with the performance optimizations of DataFrames

and the strong static type-safety of Scala. The last feature of bringing the strong type-safety
to DataFrame makes Dataset so appealing. All the features together give you a more
functional programming interface to work with structured data.

scala> spark.range(1).filter('id === 0).explain(true)


== Parsed Logical Plan ==
'Filter ('id = 0)
+- Range (0, 1, splits=8)

== Analyzed Logical Plan ==


id: bigint
Filter (id#51L = cast(0 as bigint))
+- Range (0, 1, splits=8)

== Optimized Logical Plan ==


Filter (id#51L = 0)
+- Range (0, 1, splits=8)

== Physical Plan ==
*Filter (id#51L = 0)
+- *Range (0, 1, splits=8)

scala> spark.range(1).filter(_ == 0).explain(true)


== Parsed Logical Plan ==
'TypedFilter <function1>, class java.lang.Long, [StructField(value,LongType,true)], un
resolveddeserializer(newInstance(class java.lang.Long))
+- Range (0, 1, splits=8)

== Analyzed Logical Plan ==


id: bigint
TypedFilter <function1>, class java.lang.Long, [StructField(value,LongType,true)], new
Instance(class java.lang.Long)
+- Range (0, 1, splits=8)

== Optimized Logical Plan ==


TypedFilter <function1>, class java.lang.Long, [StructField(value,LongType,true)], new
Instance(class java.lang.Long)
+- Range (0, 1, splits=8)

== Physical Plan ==
*Filter <function1>.apply
+- *Range (0, 1, splits=8)

It is only with Datasets to have syntax and analysis checks at compile time (that was not
possible using DataFrame, regular SQL queries or even RDDs).

174
Dataset — Structured Query with Data Encoder

Using Dataset objects turns DataFrames of Row instances into a DataFrames of case
classes with proper names and types (following their equivalents in the case classes).
Instead of using indices to access respective fields in a DataFrame and cast it to a type, all
this is automatically handled by Datasets and checked by the Scala compiler.

If however a LogicalPlan is used to create a Dataset , the logical plan is first executed
(using the current SessionState in the SparkSession ) that yields the QueryExecution plan.

A Dataset is Queryable and Serializable , i.e. can be saved to a persistent storage.

SparkSession and QueryExecution are transient attributes of a Dataset and


Note therefore do not participate in Dataset serialization. The only firmly-tied feature
of a Dataset is the Encoder.

You can request the "untyped" view of a Dataset or access the RDD that is generated after
executing the query. It is supposed to give you a more pleasant experience while
transitioning from the legacy RDD-based or DataFrame-based APIs you may have used in
the earlier versions of Spark SQL or encourage migrating from Spark Core’s RDD API to
Spark SQL’s Dataset API.

The default storage level for Datasets is MEMORY_AND_DISK because recomputing the
in-memory columnar representation of the underlying table is expensive. You can however
persist a Dataset .

Spark 2.0 has introduced a new query model called Structured Streaming for
continuous incremental execution of structured queries. That made possible to
Note
consider Datasets a static and bounded as well as streaming and unbounded
data sets with a single unified API for different execution models.

A Dataset is local if it was created from local collections using SparkSession.emptyDataset


or SparkSession.createDataset methods and their derivatives like toDF. If so, the queries on
the Dataset can be optimized and run locally, i.e. without using Spark executors.

Dataset makes sure that the underlying QueryExecution is analyzed and


Note
checked.

Table 1. Dataset’s Properties


Name Description
ExpressionEncoder
boundEnc
Used when…​FIXME

Deserializer expression to convert internal rows to objects of type T

Created lazily by requesting the ExpressionEncoder to resolveAndBind


Used when:

175
Dataset — Structured Query with Data Encoder

deserializer Dataset is created (for a logical plan in a given SparkSession )

Dataset.toLocalIterator operator is used (to create a Java Iterator


type T )

Dataset is requested to collect all rows from a spark plan

Implicit ExpressionEncoder
exprEnc
Used when…​FIXME

Analyzed logical plan with all logical commands executed and turned into a
LocalRelation.

logicalPlan: LogicalPlan
logicalPlan

When initialized, logicalPlan requests the QueryExecution for analyzed logical p


the plan is a logical command or a union thereof, logicalPlan executes the
QueryExecution (using executeCollect).

planWithBarrier planWithBarrier: AnalysisBarrier

(lazily-created) RDD of JVM objects of type T (as converted from rows in


the internal binary row format).

rdd: RDD[T]

rdd gives RDD with the extra execution step to convert rows from thei
internal binary row format to JVM objects that will impact the JVM memo
Note
as the objects are inside JVM (while were outside before). You should n
use rdd directly.

Internally, rdd first creates a new logical plan that deserializes the Dataset’s
plan.

val dataset = spark.range(5).withColumn("group", 'id % 2)


scala> dataset.rdd.toDebugString
res1: String =
(8) MapPartitionsRDD[8] at rdd at <console>:26 [] // <-- extra deserialization
| MapPartitionsRDD[7] at rdd at <console>:26 []
| MapPartitionsRDD[6] at rdd at <console>:26 []
| MapPartitionsRDD[5] at rdd at <console>:26 []
rdd | ParallelCollectionRDD[4] at rdd at <console>:26 []

scala> dataset.queryExecution.toRdd.toDebugString
res2: String =
(8) MapPartitionsRDD[11] at toRdd at <console>:26 []
| MapPartitionsRDD[10] at toRdd at <console>:26 []
| ParallelCollectionRDD[9] at toRdd at <console>:26 []

176
Dataset — Structured Query with Data Encoder

rdd then requests SessionState to execute the logical plan to get the correspond
RDD of binary rows.

Note rdd uses SparkSession to access SessionState .

rdd then requests the Dataset’s ExpressionEncoder for the data type
deserializer expression) and maps over them (per partition) to create records of th
expected type T .

rdd is at the "boundary" between the internal binary row format and th
Note JVM type of the dataset. Avoid the extra deserialization step to lower JV
memory requirements of your Spark application.

Lazily-created SQLContext
sqlContext
Used when…​FIXME

Getting Input Files of Relations (in Structured Query) 


—  inputFiles Method

inputFiles: Array[String]

inputFiles requests QueryExecution for optimized logical plan and collects the following

logical operators:

LogicalRelation with FileRelation (as the BaseRelation)

FileRelation

HiveTableRelation

inputFiles then requests the logical operators for their underlying files:

inputFiles of the FileRelations

locationUri of the HiveTableRelation

resolve Internal Method

resolve(colName: String): NamedExpression

Caution FIXME

Creating Dataset Instance

177
Dataset — Structured Query with Data Encoder

Dataset takes the following when created:

SparkSession

QueryExecution

Encoder for the type T of the records

You can also create a Dataset using LogicalPlan that is immediately executed
Note
using SessionState .

Internally, Dataset requests QueryExecution to analyze itself.

Dataset initializes the internal registries and counters.

Is Dataset Local? —  isLocal Method

isLocal: Boolean

isLocal flag is enabled (i.e. true ) when operators like collect or take could be run

locally, i.e. without using executors.

Internally, isLocal checks whether the logical query plan of a Dataset is LocalRelation.

Is Dataset Streaming? —  isStreaming method

isStreaming: Boolean

isStreaming is enabled (i.e. true ) when the logical plan is streaming.

Internally, isStreaming takes the Dataset’s logical plan and gives whether the plan is
streaming or not.

Queryable

Caution FIXME

withNewRDDExecutionId Internal Method

withNewRDDExecutionId[U](body: => U): U

withNewRDDExecutionId executes the input body action under new execution id.

178
Dataset — Structured Query with Data Encoder

FIXME What’s the difference between withNewRDDExecutionId and


Caution
withNewExecutionId?

withNewRDDExecutionId is used when Dataset.foreach and


Note
Dataset.foreachPartition actions are used.

Creating DataFrame (For Logical Query Plan and


SparkSession) —  ofRows Internal Factory Method

ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame

ofRows is part of Dataset Scala object that is marked as a private[sql] and


Note
so can only be accessed from code in org.apache.spark.sql package.

ofRows returns DataFrame (which is the type alias for Dataset[Row] ). ofRows uses

RowEncoder to convert the schema (based on the input logicalPlan logical plan).

Internally, ofRows prepares the input logicalPlan for execution and creates a
Dataset[Row] with the current SparkSession, the QueryExecution and RowEncoder.

ofRows is used when:

DataFrameReader is requested to load data from a data source

Dataset is requested to execute checkpoint, mapPartitionsInR , untyped


transformations and set-based typed transformations
RelationalGroupedDataset is requested to create a DataFrame from
aggregate expressions, flatMapGroupsInR and flatMapGroupsInPandas

SparkSession is requested to create a DataFrame from a BaseRelation,


createDataFrame, internalCreateDataFrame, sql and table
CacheTableCommand , CreateTempViewUsing,
InsertIntoDataSourceCommand and SaveIntoDataSourceCommand logical
Note commands are executed (run)

DataSource is requested to writeAndRead (for a


CreatableRelationProvider)
FrequentItems is requested to singlePassFreqItems

StatFunctions is requested to crossTabulate and summary

Spark Structured Streaming’s DataStreamReader is requested to load


Spark Structured Streaming’s DataStreamWriter is requested to start

Spark Structured Streaming’s FileStreamSource is requested to getBatch


Spark Structured Streaming’s MemoryStream is requested to toDF

179
Dataset — Structured Query with Data Encoder

Tracking Multi-Job Structured Query Execution (PySpark) 


—  withNewExecutionId Internal Method

withNewExecutionId[U](body: => U): U

withNewExecutionId executes the input body action under new execution id.

withNewExecutionId sets a unique execution id so that all Spark jobs belong to


Note
the Dataset action execution.

withNewExecutionId is used exclusively when Dataset is executing Python-


based actions (i.e. collectToPython , collectAsArrowToPython and
toPythonIterator ) that are not of much interest in this gitbook.
Note
Feel free to contact me at [email protected] if you think I should re-consider my
decision.

Executing Action Under New Execution ID 


—  withAction Internal Method

withAction[U](name: String, qe: QueryExecution)(action: SparkPlan => U)

withAction requests QueryExecution for the optimized physical query plan and resets the

metrics of every physical operator (in the physical plan).

withAction requests SQLExecution to execute the input action with the executable

physical plan (tracked under a new execution id).

In the end, withAction notifies ExecutionListenerManager that the name action has finished
successfully or with an exception.

Note withAction uses SparkSession to access ExecutionListenerManager.

withAction is used when Dataset is requested for the following:

Computing the logical plan (and executing a logical command or their


Note Union )

Dataset operators: collect, count, head and toLocalIterator

Creating Dataset Instance (For LogicalPlan and


SparkSession) —  apply Internal Factory Method

180
Dataset — Structured Query with Data Encoder

apply[T: Encoder](sparkSession: SparkSession, logicalPlan: LogicalPlan): Dataset[T]

apply is part of Dataset Scala object that is marked as a private[sql] and


Note
so can only be accessed from code in org.apache.spark.sql package.

apply …​FIXME

apply is used when:

Dataset is requested to execute typed transformations and set-based


Note typed transformations
Spark Structured Streaming’s MemoryStream is requested to toDS

Collecting All Rows From Spark Plan 


—  collectFromPlan Internal Method

collectFromPlan(plan: SparkPlan): Array[T]

collectFromPlan …​FIXME

collectFromPlan is used for Dataset.head, Dataset.collect and


Note
Dataset.collectAsList operators.

selectUntyped Internal Method

selectUntyped(columns: TypedColumn[_, _]*): Dataset[_]

selectUntyped …​FIXME

selectUntyped is used exclusively when Dataset.select typed transformation is


Note
used.

Helper Method for Typed Transformations 


—  withTypedPlan Internal Method

withTypedPlan[U: Encoder](logicalPlan: LogicalPlan): Dataset[U]

withTypedPlan …​FIXME

181
Dataset — Structured Query with Data Encoder

withTypedPlan is annotated with Scala’s @inline annotation that requests the


Note
Scala compiler to try especially hard to inline it.

withTypedPlan is used in the Dataset typed transformations, i.e.


withWatermark, joinWith, hint, as, filter, limit, sample, dropDuplicates, filter,
Note
map, repartition, repartitionByRange, coalesce and sort with sortWithinPartitions
(through the sortInternal internal method).

Helper Method for Set-Based Typed Transformations 


—  withSetOperator Internal Method

withSetOperator[U: Encoder](logicalPlan: LogicalPlan): Dataset[U]

withSetOperator …​FIXME

withSetOperator is annotated with Scala’s @inline annotation that requests the


Note
Scala compiler to try especially hard to inline it.

withSetOperator is used in the Dataset typed transformations, i.e. union,


Note
unionByName, intersect and except.

sortInternal Internal Method

sortInternal(global: Boolean, sortExprs: Seq[Column]): Dataset[T]

sortInternal creates a Dataset with Sort unary logical operator (and the logicalPlan as the

child logical plan).

val nums = Seq((0, "zero"), (1, "one")).toDF("id", "name")


// Creates a Sort logical operator:
// - descending sort direction for id column (specified explicitly)
// - name column is wrapped with ascending sort direction
val numsSorted = nums.sort('id.desc, 'name)
val logicalPlan = numsSorted.queryExecution.logical
scala> println(logicalPlan.numberedTreeString)
00 'Sort ['id DESC NULLS LAST, 'name ASC NULLS FIRST], true
01 +- Project [_1#11 AS id#14, _2#12 AS name#15]
02 +- LocalRelation [_1#11, _2#12]

Internally, sortInternal firstly builds ordering expressions for the given sortExprs
columns, i.e. takes the sortExprs columns and makes sure that they are SortOrder
expressions already (and leaves them untouched) or wraps them into SortOrder expressions

182
Dataset — Structured Query with Data Encoder

with Ascending sort direction.

In the end, sortInternal creates a Dataset with Sort unary logical operator (with the
ordering expressions, the given global flag, and the logicalPlan as the child logical plan).

sortInternal is used for the sort and sortWithinPartitions typed


Note transformations in the Dataset API (with the only change of the global flag
being enabled and disabled, respectively).

Helper Method for Untyped Transformations and Basic


Actions —  withPlan Internal Method

withPlan(logicalPlan: LogicalPlan): DataFrame

withPlan simply uses ofRows internal factory method to create a DataFrame for the input

LogicalPlan and the current SparkSession.

withPlan is annotated with Scala’s @inline annotation that requests the Scala
Note
compiler to try especially hard to inline it.

withPlan is used in the Dataset untyped transformations (i.e. join, crossJoin


and select) and basic actions (i.e. createTempView,
Note
createOrReplaceTempView, createGlobalTempView and
createOrReplaceGlobalTempView).

Further Reading and Watching


(video) Structuring Spark: DataFrames, Datasets, and Streaming

183
DataFrame — Dataset of Rows with RowEncoder

DataFrame — Dataset of Rows with


RowEncoder
Spark SQL introduces a tabular functional data abstraction called DataFrame. It is designed
to ease developing Spark applications for processing large amount of structured tabular data
on Spark infrastructure.

DataFrame is a data abstraction or a domain-specific language (DSL) for working with


structured and semi-structured data, i.e. datasets that you can specify a schema for.

DataFrame is a collection of rows with a schema that is the result of executing a structured
query (once it will have been executed).

DataFrame uses the immutable, in-memory, resilient, distributed and parallel capabilities of
RDD, and applies a structure called schema to the data.

In Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row] .

type DataFrame = Dataset[Row]


Note

See org.apache.spark.package.scala.

DataFrame is a distributed collection of tabular data organized into rows and named

columns. It is conceptually equivalent to a table in a relational database with operations to


project ( select ), filter , intersect , join , group , sort , join , aggregate , or
convert to a RDD (consult DataFrame API)

data.groupBy('Product_ID).sum('Score)

Spark SQL borrowed the concept of DataFrame from pandas' DataFrame and made it
immutable, parallel (one machine, perhaps with many processors and cores) and
distributed (many machines, perhaps with many processors and cores).

Hey, big data consultants, time to help teams migrate the code from pandas'
Note DataFrame into Spark’s DataFrames (at least to PySpark’s DataFrame) and
offer services to set up large clusters!

DataFrames in Spark SQL strongly rely on the features of RDD - it’s basically a RDD
exposed as structured DataFrame by appropriate operations to handle very big data from
the day one. So, petabytes of data should not scare you (unless you’re an administrator to
create such clustered Spark environment - contact me when you feel alone with the task).

184
DataFrame — Dataset of Rows with RowEncoder

val df = Seq(("one", 1), ("one", 1), ("two", 1))


.toDF("word", "count")

scala> df.show
+----+-----+
|word|count|
+----+-----+
| one| 1|
| one| 1|
| two| 1|
+----+-----+

val counted = df.groupBy('word).count

scala> counted.show
+----+-----+
|word|count|
+----+-----+
| two| 1|
| one| 2|
+----+-----+

You can create DataFrames by loading data from structured files (JSON, Parquet, CSV),
RDDs, tables in Hive, or external databases (JDBC). You can also create DataFrames from
scratch and build upon them (as in the above example). See DataFrame API. You can read
any format given you have appropriate Spark SQL extension of DataFrameReader to format
the dataset appropriately.

Caution FIXME Diagram of reading data from sources to create DataFrame

You can execute queries over DataFrames using two approaches:

the good ol' SQL - helps migrating from "SQL databases" world into the world of
DataFrame in Spark SQL

Query DSL - an API that helps ensuring proper syntax at compile time.

DataFrame also allows you to do the following tasks:

Filtering

DataFrames use the Catalyst query optimizer to produce efficient queries (and so they are
supposed to be faster than corresponding RDD-based queries).

Your DataFrames can also be type-safe and moreover further improve their
Note performance through specialized encoders that can significantly cut serialization
and deserialization times.

185
DataFrame — Dataset of Rows with RowEncoder

You can enforce types on generic rows and hence bring type safety (at compile time) by
encoding rows into type-safe Dataset object. As of Spark 2.0 it is a preferred way of
developing Spark applications.

Features of DataFrame
A DataFrame is a collection of "generic" Row instances (as RDD[Row] ) and a schema.

Regardless of how you create a DataFrame , it will always be a pair of RDD[Row]


Note
and StructType.

SQLContext, spark, and Spark shell


You use org.apache.spark.sql.SQLContext to build DataFrames and execute SQL queries.

The quickest and easiest way to work with Spark SQL is to use Spark shell and spark
object.

scala> spark
res1: org.apache.spark.sql.SQLContext = org.apache.spark.sql.hive.HiveContext@60ae950f

As you may have noticed, spark in Spark shell is actually a


org.apache.spark.sql.hive.HiveContext that integrates the Spark SQL execution engine
with data stored in Apache Hive.

The Apache Hive™ data warehouse software facilitates querying and managing large
datasets residing in distributed storage.

Creating DataFrames from Scratch


Use Spark shell as described in Spark shell.

Using toDF
After you import spark.implicits._ (which is done for you by Spark shell) you may apply
toDF method to convert objects to DataFrames.

scala> val df = Seq("I am a DataFrame!").toDF("text")


df: org.apache.spark.sql.DataFrame = [text: string]

Creating DataFrame using Case Classes in Scala

186
DataFrame — Dataset of Rows with RowEncoder

This method assumes the data comes from a Scala case class that will describe the
schema.

scala> case class Person(name: String, age: Int)


defined class Person

scala> val people = Seq(Person("Jacek", 42), Person("Patryk", 19), Person("Maksym", 5)


)
people: Seq[Person] = List(Person(Jacek,42), Person(Patryk,19), Person(Maksym,5))

scala> val df = spark.createDataFrame(people)


df: org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> df.show
+------+---+
| name|age|
+------+---+
| Jacek| 42|
|Patryk| 19|
|Maksym| 5|
+------+---+

Custom DataFrame Creation using createDataFrame


SQLContext offers a family of createDataFrame operations.

scala> val lines = sc.textFile("Cartier+for+WinnersCurse.csv")


lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>
:24

scala> val headers = lines.first


headers: String = auctionid,bid,bidtime,bidder,bidderrate,openbid,price

scala> import org.apache.spark.sql.types.{StructField, StringType}


import org.apache.spark.sql.types.{StructField, StringType}

scala> val fs = headers.split(",").map(f => StructField(f, StringType))


fs: Array[org.apache.spark.sql.types.StructField] = Array(StructField(auctionid,String
Type,true), StructField(bid,StringType,true), StructField(bidtime,StringType,true), St
ructField(bidder,StringType,true), StructField(bidderrate,StringType,true), StructFiel
d(openbid,StringType,true), StructField(price,StringType,true))

scala> import org.apache.spark.sql.types.StructType


import org.apache.spark.sql.types.StructType

scala> val schema = StructType(fs)


schema: org.apache.spark.sql.types.StructType = StructType(StructField(auctionid,Strin
gType,true), StructField(bid,StringType,true), StructField(bidtime,StringType,true), S
tructField(bidder,StringType,true), StructField(bidderrate,StringType,true), StructFie
ld(openbid,StringType,true), StructField(price,StringType,true))

187
DataFrame — Dataset of Rows with RowEncoder

scala> val noheaders = lines.filter(_ != header)


noheaders: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[10] at filter at <conso
le>:33

scala> import org.apache.spark.sql.Row


import org.apache.spark.sql.Row

scala> val rows = noheaders.map(_.split(",")).map(a => Row.fromSeq(a))


rows: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[12] at map
at <console>:35

scala> val auctions = spark.createDataFrame(rows, schema)


auctions: org.apache.spark.sql.DataFrame = [auctionid: string, bid: string, bidtime: s
tring, bidder: string, bidderrate: string, openbid: string, price: string]

scala> auctions.printSchema
root
|-- auctionid: string (nullable = true)
|-- bid: string (nullable = true)
|-- bidtime: string (nullable = true)
|-- bidder: string (nullable = true)
|-- bidderrate: string (nullable = true)
|-- openbid: string (nullable = true)
|-- price: string (nullable = true)

scala> auctions.dtypes
res28: Array[(String, String)] = Array((auctionid,StringType), (bid,StringType), (bidt
ime,StringType), (bidder,StringType), (bidderrate,StringType), (openbid,StringType), (
price,StringType))

scala> auctions.show(5)
+----------+----+-----------+-----------+----------+-------+-----+
| auctionid| bid| bidtime| bidder|bidderrate|openbid|price|
+----------+----+-----------+-----------+----------+-------+-----+
|1638843936| 500|0.478368056| kona-java| 181| 500| 1625|
|1638843936| 800|0.826388889| doc213| 60| 500| 1625|
|1638843936| 600|3.761122685| zmxu| 7| 500| 1625|
|1638843936|1500|5.226377315|carloss8055| 5| 500| 1625|
|1638843936|1600| 6.570625| jdrinaz| 6| 500| 1625|
+----------+----+-----------+-----------+----------+-------+-----+
only showing top 5 rows

Loading data from structured files

Creating DataFrame from CSV file


Let’s start with an example in which schema inference relies on a custom case class in
Scala.

188
DataFrame — Dataset of Rows with RowEncoder

scala> val lines = sc.textFile("Cartier+for+WinnersCurse.csv")


lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>
:24

scala> val header = lines.first


header: String = auctionid,bid,bidtime,bidder,bidderrate,openbid,price

scala> lines.count
res3: Long = 1349

scala> case class Auction(auctionid: String, bid: Float, bidtime: Float, bidder: Strin
g, bidderrate: Int, openbid: Float, price: Float)
defined class Auction

scala> val noheader = lines.filter(_ != header)


noheader: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[53] at filter at <consol
e>:31

scala> val auctions = noheader.map(_.split(",")).map(r => Auction(r(0), r(1).toFloat,


r(2).toFloat, r(3), r(4).toInt, r(5).toFloat, r(6).toFloat))
auctions: org.apache.spark.rdd.RDD[Auction] = MapPartitionsRDD[59] at map at <console>
:35

scala> val df = auctions.toDF


df: org.apache.spark.sql.DataFrame = [auctionid: string, bid: float, bidtime: float, b
idder: string, bidderrate: int, openbid: float, price: float]

scala> df.printSchema
root
|-- auctionid: string (nullable = true)
|-- bid: float (nullable = false)
|-- bidtime: float (nullable = false)
|-- bidder: string (nullable = true)
|-- bidderrate: integer (nullable = false)
|-- openbid: float (nullable = false)
|-- price: float (nullable = false)

scala> df.show
+----------+------+----------+-----------------+----------+-------+------+
| auctionid| bid| bidtime| bidder|bidderrate|openbid| price|
+----------+------+----------+-----------------+----------+-------+------+
|1638843936| 500.0|0.47836804| kona-java| 181| 500.0|1625.0|
|1638843936| 800.0| 0.8263889| doc213| 60| 500.0|1625.0|
|1638843936| 600.0| 3.7611227| zmxu| 7| 500.0|1625.0|
|1638843936|1500.0| 5.2263775| carloss8055| 5| 500.0|1625.0|
|1638843936|1600.0| 6.570625| jdrinaz| 6| 500.0|1625.0|
|1638843936|1550.0| 6.8929167| carloss8055| 5| 500.0|1625.0|
|1638843936|1625.0| 6.8931136| carloss8055| 5| 500.0|1625.0|
|1638844284| 225.0| 1.237419|[email protected]| 0| 200.0| 500.0|
|1638844284| 500.0| 1.2524074| njbirdmom| 33| 200.0| 500.0|
|1638844464| 300.0| 1.8111342| aprefer| 58| 300.0| 740.0|
|1638844464| 305.0| 3.2126737| 19750926o| 3| 300.0| 740.0|

189
DataFrame — Dataset of Rows with RowEncoder

|1638844464| 450.0| 4.1657987| coharley| 30| 300.0| 740.0|


|1638844464| 450.0| 6.7363195| adammurry| 5| 300.0| 740.0|
|1638844464| 500.0| 6.7364697| adammurry| 5| 300.0| 740.0|
|1638844464|505.78| 6.9881945| 19750926o| 3| 300.0| 740.0|
|1638844464| 551.0| 6.9896526| 19750926o| 3| 300.0| 740.0|
|1638844464| 570.0| 6.9931483| 19750926o| 3| 300.0| 740.0|
|1638844464| 601.0| 6.9939003| 19750926o| 3| 300.0| 740.0|
|1638844464| 610.0| 6.994965| 19750926o| 3| 300.0| 740.0|
|1638844464| 560.0| 6.9953704| ps138| 5| 300.0| 740.0|
+----------+------+----------+-----------------+----------+-------+------+
only showing top 20 rows

Creating DataFrame from CSV files using spark-csv module


You’re going to use spark-csv module to load data from a CSV data source that handles
proper parsing and loading.

Support for CSV data sources is available by default in Spark 2.0.0. No need for
Note
an external module.

Start the Spark shell using --packages option as follows:

190
DataFrame — Dataset of Rows with RowEncoder

➜ spark git:(master) ✗ ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.2


.0
Ivy Default Cache set to: /Users/jacek/.ivy2/cache
The jars for the packages stored in: /Users/jacek/.ivy2/jars
:: loading settings :: url = jar:file:/Users/jacek/dev/oss/spark/assembly/target/scala
-2.11/spark-assembly-1.5.0-SNAPSHOT-hadoop2.7.1.jar!/org/apache/ivy/core/settings/ivys
ettings.xml
com.databricks#spark-csv_2.11 added as a dependency

scala> val df = spark.read.format("com.databricks.spark.csv").option("header", "true")


.load("Cartier+for+WinnersCurse.csv")
df: org.apache.spark.sql.DataFrame = [auctionid: string, bid: string, bidtime: string,
bidder: string, bidderrate: string, openbid: string, price: string]

scala> df.printSchema
root
|-- auctionid: string (nullable = true)
|-- bid: string (nullable = true)
|-- bidtime: string (nullable = true)
|-- bidder: string (nullable = true)
|-- bidderrate: string (nullable = true)
|-- openbid: string (nullable = true)
|-- price: string (nullable = true)

scala> df.show
+----------+------+-----------+-----------------+----------+-------+-----+
| auctionid| bid| bidtime| bidder|bidderrate|openbid|price|
+----------+------+-----------+-----------------+----------+-------+-----+
|1638843936| 500|0.478368056| kona-java| 181| 500| 1625|
|1638843936| 800|0.826388889| doc213| 60| 500| 1625|
|1638843936| 600|3.761122685| zmxu| 7| 500| 1625|
|1638843936| 1500|5.226377315| carloss8055| 5| 500| 1625|
|1638843936| 1600| 6.570625| jdrinaz| 6| 500| 1625|
|1638843936| 1550|6.892916667| carloss8055| 5| 500| 1625|
|1638843936| 1625|6.893113426| carloss8055| 5| 500| 1625|
|1638844284| 225|1.237418982|[email protected]| 0| 200| 500|
|1638844284| 500|1.252407407| njbirdmom| 33| 200| 500|
|1638844464| 300|1.811134259| aprefer| 58| 300| 740|
|1638844464| 305|3.212673611| 19750926o| 3| 300| 740|
|1638844464| 450|4.165798611| coharley| 30| 300| 740|
|1638844464| 450|6.736319444| adammurry| 5| 300| 740|
|1638844464| 500|6.736469907| adammurry| 5| 300| 740|
|1638844464|505.78|6.988194444| 19750926o| 3| 300| 740|
|1638844464| 551|6.989652778| 19750926o| 3| 300| 740|
|1638844464| 570|6.993148148| 19750926o| 3| 300| 740|
|1638844464| 601|6.993900463| 19750926o| 3| 300| 740|
|1638844464| 610|6.994965278| 19750926o| 3| 300| 740|
|1638844464| 560| 6.99537037| ps138| 5| 300| 740|
+----------+------+-----------+-----------------+----------+-------+-----+
only showing top 20 rows

191
DataFrame — Dataset of Rows with RowEncoder

Reading Data from External Data Sources (read method)


You can create DataFrames by loading data from structured files (JSON, Parquet, CSV),
RDDs, tables in Hive, or external databases (JDBC) using SQLContext.read method.

read: DataFrameReader

read returns a DataFrameReader instance.

Among the supported structured data (file) formats are (consult Specifying Data Format
(format method) for DataFrameReader ):

JSON

parquet

JDBC

ORC

Tables in Hive and any JDBC-compliant database

libsvm

val reader = spark.read


r: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@59e67a1
8

reader.parquet("file.parquet")
reader.json("file.json")
reader.format("libsvm").load("sample_libsvm_data.txt")

Querying DataFrame

Note Spark SQL offers a Pandas-like Query DSL.

Using Query DSL


You can select specific columns using select method.

This variant (in which you use stringified column names) can only select existing
Note
columns, i.e. you cannot create new ones using select expressions.

192
DataFrame — Dataset of Rows with RowEncoder

scala> predictions.printSchema
root
|-- id: long (nullable = false)
|-- topic: string (nullable = true)
|-- text: string (nullable = true)
|-- label: double (nullable = true)
|-- words: array (nullable = true)
| |-- element: string (containsNull = true)
|-- features: vector (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = true)

scala> predictions.select("label", "words").show


+-----+-------------------+
|label| words|
+-----+-------------------+
| 1.0| [hello, math!]|
| 0.0| [hello, religion!]|
| 1.0|[hello, phy, ic, !]|
+-----+-------------------+

scala> auctions.groupBy("bidder").count().show(5)
+--------------------+-----+
| bidder|count|
+--------------------+-----+
| dennisthemenace1| 1|
| amskymom| 5|
| [email protected]| 4|
| millyjohn| 1|
|ykelectro@hotmail...| 2|
+--------------------+-----+
only showing top 5 rows

In the following example you query for the top 5 of the most active bidders.

Note the tiny $ and desc together with the column name to sort the rows by.

193
DataFrame — Dataset of Rows with RowEncoder

scala> auctions.groupBy("bidder").count().sort($"count".desc).show(5)
+------------+-----+
| bidder|count|
+------------+-----+
| lass1004| 22|
| pascal1666| 19|
| freembd| 17|
|restdynamics| 17|
| happyrova| 17|
+------------+-----+
only showing top 5 rows

scala> import org.apache.spark.sql.functions._


import org.apache.spark.sql.functions._

scala> auctions.groupBy("bidder").count().sort(desc("count")).show(5)
+------------+-----+
| bidder|count|
+------------+-----+
| lass1004| 22|
| pascal1666| 19|
| freembd| 17|
|restdynamics| 17|
| happyrova| 17|
+------------+-----+
only showing top 5 rows

194
DataFrame — Dataset of Rows with RowEncoder

scala> df.select("auctionid").distinct.count
res88: Long = 97

scala> df.groupBy("bidder").count.show
+--------------------+-----+
| bidder|count|
+--------------------+-----+
| dennisthemenace1| 1|
| amskymom| 5|
| [email protected]| 4|
| millyjohn| 1|
|ykelectro@hotmail...| 2|
| [email protected]| 1|
| rrolex| 1|
| bupper99| 2|
| cheddaboy| 2|
| adcc007| 1|
| varvara_b| 1|
| yokarine| 4|
| steven1328| 1|
| anjara| 2|
| roysco| 1|
|lennonjasonmia@ne...| 2|
|northwestportland...| 4|
| bosspad| 10|
| 31strawberry| 6|
| nana-tyler| 11|
+--------------------+-----+
only showing top 20 rows

Using SQL
Register a DataFrame as a named temporary table to run SQL.

scala> df.registerTempTable("auctions") (1)

scala> val sql = spark.sql("SELECT count(*) AS count FROM auctions")


sql: org.apache.spark.sql.DataFrame = [count: bigint]

1. Register a temporary table so SQL queries make sense

You can execute a SQL query on a DataFrame using sql operation, but before the query is
executed it is optimized by Catalyst query optimizer. You can print the physical plan for a
DataFrame using the explain operation.

195
DataFrame — Dataset of Rows with RowEncoder

scala> sql.explain
== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[
count#148L])
TungstenExchange SinglePartition
TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], outp
ut=[currentCount#156L])
TungstenProject
Scan PhysicalRDD[auctionid#49,bid#50,bidtime#51,bidder#52,bidderrate#53,openbid#54
,price#55]

scala> sql.show
+-----+
|count|
+-----+
| 1348|
+-----+

scala> val count = sql.collect()(0).getLong(0)


count: Long = 1348

Filtering

scala> df.show
+----+---------+-----+
|name|productId|score|
+----+---------+-----+
| aaa| 100| 0.12|
| aaa| 200| 0.29|
| bbb| 200| 0.53|
| bbb| 300| 0.42|
+----+---------+-----+

scala> df.filter($"name".like("a%")).show
+----+---------+-----+
|name|productId|score|
+----+---------+-----+
| aaa| 100| 0.12|
| aaa| 200| 0.29|
+----+---------+-----+

Handling data in Avro format


Use custom serializer using spark-avro.

Run Spark shell with --packages com.databricks:spark-avro_2.11:2.0.0 (see 2.0.0 artifact is


not in any public maven repo why --repositories is required).

196
DataFrame — Dataset of Rows with RowEncoder

./bin/spark-shell --packages com.databricks:spark-avro_2.11:2.0.0 --repositories "http


://dl.bintray.com/databricks/maven"

And then…​

val fileRdd = sc.textFile("README.md")


val df = fileRdd.toDF

import org.apache.spark.sql.SaveMode

val outputF = "test.avro"


df.write.mode(SaveMode.Append).format("com.databricks.spark.avro").save(outputF)

See org.apache.spark.sql.SaveMode (and perhaps org.apache.spark.sql.SaveMode from


Scala’s perspective).

val df = spark.read.format("com.databricks.spark.avro").load("test.avro")

Example Datasets
eBay online auctions

SFPD Crime Incident Reporting system

197
Row

Row
Row is a generic row object with an ordered collection of fields that can be accessed by an

ordinal / an index (aka generic access by ordinal), a name (aka native primitive access) or
using Scala’s pattern matching.

Note Row is also called Catalyst Row.

Row may have an optional schema.

The traits of Row :

length or size - Row knows the number of elements (columns).

schema - Row knows the schema

Row belongs to org.apache.spark.sql.Row package.

import org.apache.spark.sql.Row

Creating Row —  apply Factory Method

Caution FIXME

Field Access by Index —  apply and get methods


Fields of a Row instance can be accessed by index (starting from 0 ) using apply or
get .

scala> val row = Row(1, "hello")


row: org.apache.spark.sql.Row = [1,hello]

scala> row(1)
res0: Any = hello

scala> row.get(1)
res1: Any = hello

Note Generic access by ordinal (using apply or get ) returns a value of type Any .

Get Field As Type —  getAs method

198
Row

You can query for fields with their proper types using getAs with an index

val row = Row(1, "hello")

scala> row.getAs[Int](0)
res1: Int = 1

scala> row.getAs[String](1)
res2: String = hello

FIXME

Note row.getAs[String](null)

Schema
A Row instance can have a schema defined.

Unless you are instantiating Row yourself (using Row Object), a Row has
Note
always a schema.

It is RowEncoder to take care of assigning a schema to a Row when toDF on


Note
a Dataset or when instantiating DataFrame through DataFrameReader.

Row Object
Row companion object offers factory methods to create Row instances from a collection of

elements ( apply ), a sequence of elements ( fromSeq ) and tuples ( fromTuple ).

scala> Row(1, "hello")


res0: org.apache.spark.sql.Row = [1,hello]

scala> Row.fromSeq(Seq(1, "hello"))


res1: org.apache.spark.sql.Row = [1,hello]

scala> Row.fromTuple((0, "hello"))


res2: org.apache.spark.sql.Row = [0,hello]

Row object can merge Row instances.

scala> Row.merge(Row(1), Row("hello"))


res3: org.apache.spark.sql.Row = [1,hello]

199
Row

It can also return an empty Row instance.

scala> Row.empty == Row()


res4: Boolean = true

Pattern Matching on Row


Row can be used in pattern matching (since Row Object comes with unapplySeq ).

scala> Row.unapplySeq(Row(1, "hello"))


res5: Some[Seq[Any]] = Some(WrappedArray(1, hello))

Row(1, "hello") match { case Row(key: Int, value: String) =>


key -> value
}

200
DataSource API — Managing Datasets in External Data Sources

DataSource API — Managing Datasets in


External Data Sources
Reading Datasets
Spark SQL can read data from external storage systems like files, Hive tables and JDBC
databases through DataFrameReader interface.

You use SparkSession to access DataFrameReader using read operation.

import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.getOrCreate

val reader = spark.read

DataFrameReader is an interface to create DataFrames (aka Dataset[Row] ) from files, Hive

tables or tables using JDBC.

val people = reader.csv("people.csv")


val cities = reader.format("json").load("cities.json")

As of Spark 2.0, DataFrameReader can read text files using textFile methods that return
Dataset[String] (not DataFrames ).

spark.read.textFile("README.md")

You can also define your own custom file formats.

val countries = reader.format("customFormat").load("countries.cf")

There are two operation modes in Spark SQL, i.e. batch and streaming (part of Spark
Structured Streaming).

You can access DataStreamReader for reading streaming datasets through


SparkSession.readStream method.

import org.apache.spark.sql.streaming.DataStreamReader
val stream: DataStreamReader = spark.readStream

The available methods in DataStreamReader are similar to DataFrameReader .

201
DataSource API — Managing Datasets in External Data Sources

Saving Datasets
Spark SQL can save data to external storage systems like files, Hive tables and JDBC
databases through DataFrameWriter interface.

You use write method on a Dataset to access DataFrameWriter .

import org.apache.spark.sql.{DataFrameWriter, Dataset}


val ints: Dataset[Int] = (0 to 5).toDS

val writer: DataFrameWriter[Int] = ints.write

DataFrameWriter is an interface to persist a Datasets to an external storage system in a

batch fashion.

You can access DataStreamWriter for writing streaming datasets through


Dataset.writeStream method.

val papers = spark.readStream.text("papers").as[String]

import org.apache.spark.sql.streaming.DataStreamWriter
val writer: DataStreamWriter[String] = papers.writeStream

The available methods in DataStreamWriter are similar to DataFrameWriter .

202
DataFrameReader — Loading Data From External Data Sources

DataFrameReader — Loading Data From


External Data Sources
DataFrameReader is the public interface to describe how to load data from an external data

source (e.g. files, tables, JDBC or Dataset[String]).

Table 1. DataFrameReader API


Method Description

csv(csvDataset: Dataset[String]): DataFrame


csv csv(path: String): DataFrame
csv(paths: String*): DataFrame

format format(source: String): DataFrameReader

jdbc(
url: String,
table: String,
predicates: Array[String],
connectionProperties: Properties): DataFrame
jdbc(
url: String,
table: String,
jdbc properties: Properties): DataFrame
jdbc(
url: String,
table: String,
columnName: String,
lowerBound: Long,
upperBound: Long,
numPartitions: Int,
connectionProperties: Properties): DataFrame

json(jsonDataset: Dataset[String]): DataFrame


json json(path: String): DataFrame
json(paths: String*): DataFrame

load(): DataFrame
load load(path: String): DataFrame
load(paths: String*): DataFrame

option(key: String, value: Boolean): DataFrameReader


option(key: String, value: Double): DataFrameReader
option option(key: String, value: Long): DataFrameReader
option(key: String, value: String): DataFrameReader

203
DataFrameReader — Loading Data From External Data Sources

options(options: scala.collection.Map[String, String]): DataFrameReader


options options(options: java.util.Map[String, String]): DataFrameReader

orc(path: String): DataFrame


orc orc(paths: String*): DataFrame

parquet(path: String): DataFrame


parquet parquet(paths: String*): DataFrame

schema(schemaString: String): DataFrameReader


schema schema(schema: StructType): DataFrameReader

table table(tableName: String): DataFrame

text(path: String): DataFrame


text text(paths: String*): DataFrame

textFile(path: String): Dataset[String]


textFile textFile(paths: String*): Dataset[String]

DataFrameReader is available using SparkSession.read.

import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...

import org.apache.spark.sql.DataFrameReader
val reader: DataFrameReader = spark.read

DataFrameReader supports many file formats natively and offers the interface to define

custom formats.

DataFrameReader assumes parquet data source file format by default that you
Note
can change using spark.sql.sources.default Spark property.

After you have described the loading pipeline (i.e. the "Extract" part of ETL in Spark SQL),
you eventually "trigger" the loading using format-agnostic load or format-specific (e.g. json,
csv, jdbc) operators.

204
DataFrameReader — Loading Data From External Data Sources

import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...

import org.apache.spark.sql.DataFrame

// Using format-agnostic load operator


val csvs: DataFrame = spark
.read
.format("csv")
.option("header", true)
.option("inferSchema", true)
.load("*.csv")

// Using format-specific load operator


val jsons: DataFrame = spark
.read
.json("metrics/*.json")

All methods of DataFrameReader merely describe a process of loading a data


Note
and do not trigger a Spark job (until an action is called).

DataFrameReader can read text files using textFile methods that return typed Datasets .

import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...

import org.apache.spark.sql.Dataset
val lines: Dataset[String] = spark
.read
.textFile("README.md")

Loading datasets using textFile methods allows for additional preprocessing


Note
before final processing of the string values as json or csv lines.

(New in Spark 2.2) DataFrameReader can load datasets from Dataset[String] (with lines
being complete "files") using format-specific csv and json operators.

205
DataFrameReader — Loading Data From External Data Sources

val csvLine = "0,Warsaw,Poland"

import org.apache.spark.sql.Dataset
val cities: Dataset[String] = Seq(csvLine).toDS
scala> cities.show
+---------------+
| value|
+---------------+
|0,Warsaw,Poland|
+---------------+

// Define schema explicitly (as below)


// or
// option("header", true) + option("inferSchema", true)
import org.apache.spark.sql.types.StructType
val schema = new StructType()
.add($"id".long.copy(nullable = false))
.add($"city".string)
.add($"country".string)
scala> schema.printTreeString
root
|-- id: long (nullable = false)
|-- city: string (nullable = true)
|-- country: string (nullable = true)

import org.apache.spark.sql.DataFrame
val citiesDF: DataFrame = spark
.read
.schema(schema)
.csv(cities)
scala> citiesDF.show
+---+------+-------+
| id| city|country|
+---+------+-------+
| 0|Warsaw| Poland|
+---+------+-------+

206
DataFrameReader — Loading Data From External Data Sources

Table 2. DataFrameReader’s Internal Properties (e.g. Registries, Counters and Flags)


Name Description

extraOptions
Used when…​FIXME

Name of the input data source (aka format or provider)


with the default format per spark.sql.sources.default
configuration property (default: parquet).
source
source can be changed using format method.

Used exclusively when DataFrameReader is requested to


load.

Optional used-specified schema (default: None , i.e.


undefined)
Set when DataFrameReader is requested to set a schema,
load a data from an external data source, loadV1Source
userSpecifiedSchema (when creating a DataSource), and load a data using json
and csv file formats
Used when DataFrameReader is requested to
assertNoSpecifiedSchema (while loading data using jdbc,
table and textFile)

Specifying Format Of Input Data Source —  format


method

format(source: String): DataFrameReader

You use format to configure DataFrameReader to use appropriate source format.

Supported data formats:

json

csv (since 2.0.0)

parquet (see Parquet)

orc

text

jdbc

libsvm  — only when used in format("libsvm")

207
DataFrameReader — Loading Data From External Data Sources

Note Spark SQL allows for developing custom data source formats.

Specifying Schema —  schema method

schema(schema: StructType): DataFrameReader

schema allows for specifying the schema of a data source (that the DataFrameReader is

about to read a dataset from).

import org.apache.spark.sql.types.StructType
val schema = new StructType()
.add($"id".long.copy(nullable = false))
.add($"city".string)
.add($"country".string)
scala> schema.printTreeString
root
|-- id: long (nullable = false)
|-- city: string (nullable = true)
|-- country: string (nullable = true)

import org.apache.spark.sql.DataFrameReader
val r: DataFrameReader = spark.read.schema(schema)

Some formats can infer schema from datasets (e.g. csv or json) using
Note
inferSchema option.

Tip Read up on Schema.

Specifying Load Options —  option and options


Methods

option(key: String, value: String): DataFrameReader


option(key: String, value: Boolean): DataFrameReader
option(key: String, value: Long): DataFrameReader
option(key: String, value: Double): DataFrameReader

You can also use options method to describe different options in a single Map .

options(options: scala.collection.Map[String, String]): DataFrameReader

208
DataFrameReader — Loading Data From External Data Sources

Loading Datasets from Files (into DataFrames) Using


Format-Specific Load Operators
DataFrameReader supports the following file formats:

JSON

CSV

parquet

ORC

text

json method

json(path: String): DataFrame


json(paths: String*): DataFrame
json(jsonDataset: Dataset[String]): DataFrame
json(jsonRDD: RDD[String]): DataFrame

New in 2.0.0: prefersDecimal

csv method

csv(path: String): DataFrame


csv(paths: String*): DataFrame
csv(csvDataset: Dataset[String]): DataFrame

parquet method

parquet(path: String): DataFrame


parquet(paths: String*): DataFrame

The supported options:

compression (default: snappy )

New in 2.0.0: snappy is the default Parquet codec. See [SPARK-14482][SQL] Change
default Parquet codec from gzip to snappy.

The compressions supported:

209
DataFrameReader — Loading Data From External Data Sources

none or uncompressed

snappy - the default codec in Spark 2.0.0.

gzip - the default codec in Spark before 2.0.0

lzo

val tokens = Seq("hello", "henry", "and", "harry")


.zipWithIndex
.map(_.swap)
.toDF("id", "token")

val parquetWriter = tokens.write


parquetWriter.option("compression", "none").save("hello-none")

// The exception is mostly for my learning purposes


// so I know where and how to find the trace to the compressions
// Sorry...
scala> parquetWriter.option("compression", "unsupported").save("hello-unsupported")
java.lang.IllegalArgumentException: Codec [unsupported] is not available. Available co
decs are uncompressed, gzip, lzo, snappy, none.
at org.apache.spark.sql.execution.datasources.parquet.ParquetOptions.<init>(ParquetO
ptions.scala:43)
at org.apache.spark.sql.execution.datasources.parquet.DefaultSource.prepareWrite(Par
quetRelation.scala:77)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$ru
n$1$$anonfun$4.apply(InsertIntoHadoopFsRelation.scala:122)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$ru
n$1$$anonfun$4.apply(InsertIntoHadoopFsRelation.scala:122)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.driverSideSetup(Wr
iterContainer.scala:103)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$ru
n$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:141)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$ru
n$1.apply(InsertIntoHadoopFsRelation.scala:116)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$ru
n$1.apply(InsertIntoHadoopFsRelation.scala:116)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scal
a:53)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertI
ntoHadoopFsRelation.scala:116)
at org.apache.spark.sql.execution.command.ExecutedCommand.sideEffectResult$lzycomput
e(commands.scala:61)
at org.apache.spark.sql.execution.command.ExecutedCommand.sideEffectResult(commands.
scala:59)
at org.apache.spark.sql.execution.command.ExecutedCommand.doExecute(commands.scala:73
)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:
118)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:
118)

210
DataFrameReader — Loading Data From External Data Sources

at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.
scala:137)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:134)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:117)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.sca
la:65)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:65)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:390)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:247)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:230)
... 48 elided

orc method

orc(path: String): DataFrame


orc(paths: String*): DataFrame

Optimized Row Columnar (ORC) file format is a highly efficient columnar format to store
Hive data with more than 1,000 columns and improve performance. ORC format was
introduced in Hive version 0.11 to use and retain the type information from the table
definition.

Tip Read ORC Files document to learn about the ORC file format.

text method

text method loads a text file.

text(path: String): DataFrame


text(paths: String*): DataFrame

Example

211
DataFrameReader — Loading Data From External Data Sources

val lines: Dataset[String] = spark.read.text("README.md").as[String]

scala> lines.show
+--------------------+
| value|
+--------------------+
| # Apache Spark|
| |
|Spark is a fast a...|
|high-level APIs i...|
|supports general ...|
|rich set of highe...|
|MLlib for machine...|
|and Spark Streami...|
| |
|<http://spark.apa...|
| |
| |
|## Online Documen...|
| |
|You can find the ...|
|guide, on the [pr...|
|and [project wiki...|
|This README file ...|
| |
| ## Building Spark|
+--------------------+
only showing top 20 rows

Loading Table to DataFrame —  table Method

table(tableName: String): DataFrame

table loads the content of the tableName table into an untyped DataFrame.

scala> spark.catalog.tableExists("t1")
res1: Boolean = true

// t1 exists in the catalog


// let's load it
val t1 = spark.read.table("t1")

table simply passes the call to SparkSession.table after making sure that a
Note
user-defined schema has not been specified.

212
DataFrameReader — Loading Data From External Data Sources

Loading Data From External Table using JDBC Data Source 


—  jdbc Method

jdbc(url: String, table: String, properties: Properties): DataFrame


jdbc(
url: String,
table: String,
predicates: Array[String],
connectionProperties: Properties): DataFrame
jdbc(
url: String,
table: String,
columnName: String,
lowerBound: Long,
upperBound: Long,
numPartitions: Int,
connectionProperties: Properties): DataFrame

jdbc loads data from an external table using the JDBC data source.

Internally, jdbc creates a JDBCOptions from the input url , table and extraOptions
with connectionProperties .

jdbc then creates one JDBCPartition per predicates .

In the end, jdbc requests the SparkSession to create a DataFrame for a JDBCRelation
(with JDBCPartitions and JDBCOptions created earlier).

jdbc does not support a custom schema and throws an AnalysisException if


defined:
Note
User specified schema not supported with `[jdbc]`

jdbc method uses java.util.Properties (and appears overly Java-centric).


Note
Use format("jdbc") instead.

Review the exercise Creating DataFrames from Tables using JDBC and
Tip
PostgreSQL.

Loading Datasets From Text Files —  textFile Method

textFile(path: String): Dataset[String]


textFile(paths: String*): Dataset[String]

213
DataFrameReader — Loading Data From External Data Sources

textFile loads one or many text files into a typed Dataset[String].

import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...

import org.apache.spark.sql.Dataset
val lines: Dataset[String] = spark
.read
.textFile("README.md")

textFile are similar to text family of methods in that they both read text files
Note but text methods return untyped DataFrame while textFile return typed
Dataset[String] .

Internally, textFile passes calls on to text method and selects the only value column
before it applies Encoders.STRING encoder.

Creating DataFrameReader Instance


DataFrameReader takes the following when created:

SparkSession

Loading Dataset (Data Source API V1) —  loadV1Source


Internal Method

loadV1Source(paths: String*): DataFrame

loadV1Source creates a DataSource and requests it to resolve the underlying relation (as a

BaseRelation).

In the end, loadV1Source requests SparkSession to create a DataFrame from the


BaseRelation.

loadV1Source is used when DataFrameReader is requested to load (and the


Note data source is neither of DataSourceV2 type nor a DataSourceReader could not
be created).

"Loading" Data As DataFrame —  load Method

load(): DataFrame
load(path: String): DataFrame
load(paths: String*): DataFrame

214
DataFrameReader — Loading Data From External Data Sources

load loads a dataset from a data source (with optional support for multiple paths ) as an

untyped DataFrame.

Internally, load lookupDataSource for the source. load then branches off per its type (i.e.
whether it is of DataSourceV2 marker type or not).

For a "Data Source V2" data source, load …​FIXME

Otherwise, if the source is not a "Data Source V2" data source, load simply loadV1Source.

load throws a AnalysisException when the source format is hive .

Hive data source can only be used with tables, you can not read files of Hive data sou
rce directly.

assertNoSpecifiedSchema Internal Method

assertNoSpecifiedSchema(operation: String): Unit

assertNoSpecifiedSchema throws a AnalysisException if the userSpecifiedSchema is

defined.

User specified schema not supported with `[operation]`

assertNoSpecifiedSchema is used when DataFrameReader is requested to load


Note
data using jdbc, table and textFile.

verifyColumnNameOfCorruptRecord Internal Method

verifyColumnNameOfCorruptRecord(
schema: StructType,
columnNameOfCorruptRecord: String): Unit

verifyColumnNameOfCorruptRecord …​FIXME

verifyColumnNameOfCorruptRecord is used when DataFrameReader is requested


Note
to load data using json and csv.

215
DataFrameWriter — Saving Data To External Data Sources

DataFrameWriter — Saving Data To External


Data Sources
DataFrameWriter is the interface to describe how data (as the result of executing a

structured query) should be saved to an external data source.

Table 1. DataFrameWriter API / Writing Operators


Method Description

bucketBy bucketBy(numBuckets: Int, colName: String, colNames: String*): DataFrameWriter

csv csv(path: String): Unit

format format(source: String): DataFrameWriter[T]

insertInto(tableName: String): Unit


insertInto
Inserts (the results of) a DataFrame into a table

jdbc jdbc(url: String, table: String, connectionProperties: Properties): Unit

json json(path: String): Unit

mode(saveMode: SaveMode): DataFrameWriter[T]


mode mode(saveMode: String): DataFrameWriter[T]

option(key: String, value: String): DataFrameWriter[T]


option(key: String, value: Boolean): DataFrameWriter[T]
option option(key: String, value: Long): DataFrameWriter[T]
option(key: String, value: Double): DataFrameWriter[T]

options options(options: scala.collection.Map[String, String]): DataFrameWriter

216
DataFrameWriter — Saving Data To External Data Sources

orc orc(path: String): Unit

parquet parquet(path: String): Unit

partitionBy partitionBy(colNames: String*): DataFrameWriter[T]

save(): Unit
save(path: String): Unit
save
Saves a DataFrame (i.e. writes the result of executing a structured query) to the
source

saveAsTable saveAsTable(tableName: String): Unit

sortBy sortBy(colName: String, colNames: String*): DataFrameWriter[T]

text text(path: String): Unit

DataFrameWriter is available using Dataset.write operator.

scala> :type df
org.apache.spark.sql.DataFrame

val writer = df.write

scala> :type writer


org.apache.spark.sql.DataFrameWriter[org.apache.spark.sql.Row]

DataFrameWriter supports many file formats and JDBC databases. It also allows for

plugging in new formats.

DataFrameWriter defaults to parquet data source format. You can change the default format

using spark.sql.sources.default configuration property or format or the format-specific


methods.

217
DataFrameWriter — Saving Data To External Data Sources

// see above for writer definition

// Save dataset in Parquet format


writer.save(path = "nums")

// Save dataset in JSON format


writer.format("json").save(path = "nums-json")

// Alternatively, use format-specific method


write.json(path = "nums-json")

In the end, you trigger the actual saving of the content of a Dataset (i.e. the result of
executing a structured query) using save method.

writer.save

DataFrameWriter uses internal mutable attributes to build a properly-defined "write

specification" for insertInto, save and saveAsTable methods.

Table 2. Internal Attributes and Corresponding Setters


Attribute Setters
source format

mode mode

extraOptions option, options, save

partitioningColumns partitionBy

bucketColumnNames bucketBy

numBuckets bucketBy

sortColumnNames sortBy

DataFrameWriter is a type constructor in Scala that keeps an internal reference


Note to the source DataFrame for the whole lifecycle (starting right from the moment
it was created).

Spark Structured Streaming’s DataStreamWriter is responsible for writing the


Note
content of streaming Datasets in a streaming fashion.

Executing Logical Command(s) —  runCommand Internal


218
DataFrameWriter — Saving Data To External Data Sources

Executing Logical Command(s) —  runCommand Internal


Method

runCommand(session: SparkSession, name: String)(command: LogicalPlan): Unit

runCommand uses the input SparkSession to access the SessionState that is in turn

requested to execute the logical command (that simply creates a QueryExecution).

runCommand records the current time (start time) and uses the SQLExecution helper object

to execute the action (under a new execution id) that simply requests the QueryExecution
for the RDD[InternalRow] (and triggers execution of logical commands).

Use web UI’s SQL tab to see the execution or a SparkListener to be notified
Tip when the execution is started and finished. The SparkListener should intercept
SparkListenerSQLExecutionStart and SparkListenerSQLExecutionEnd events.

runCommand records the current time (end time).

In the end, runCommand uses the input SparkSession to access the


ExecutionListenerManager and requests it to onSuccess (with the input name , the
QueryExecution and the duration).

In case of any exceptions, runCommand requests the ExecutionListenerManager to onFailure


(with the exception) and (re)throws it.

runCommand is used when DataFrameWriter is requested to save the rows of a


structured query (a DataFrame) to a data source (and indirectly executing a
Note logical command for writing to a data source V1), insert the rows of a structured
streaming (a DataFrame) into a table and create a table (that is used
exclusively for saveAsTable).

Saving Rows of Structured Streaming (DataFrame) to Table 


—  saveAsTable Method

saveAsTable(tableName: String): Unit


// PRIVATE API
saveAsTable(tableIdent: TableIdentifier): Unit

saveAsTable saves the content of a DataFrame to the tableName table.

219
DataFrameWriter — Saving Data To External Data Sources

val ids = spark.range(5)


ids.write.
option("path", "/tmp/five_ids").
saveAsTable("five_ids")

// Check out if saveAsTable as five_ids was successful


val q = spark.catalog.listTables.filter($"name" === "five_ids")
scala> q.show
+--------+--------+-----------+---------+-----------+
| name|database|description|tableType|isTemporary|
+--------+--------+-----------+---------+-----------+
|five_ids| default| null| EXTERNAL| false|
+--------+--------+-----------+---------+-----------+

Internally, saveAsTable requests the current ParserInterface to parse the input table
name.

saveAsTable uses the internal DataFrame to access the SparkSession that is


Note
used to access the SessionState and in the end the ParserInterface.

saveAsTable then requests the SessionCatalog to check whether the table exists or not.

saveAsTable uses the internal DataFrame to access the SparkSession that is


Note
used to access the SessionState and in the end the SessionCatalog.

In the end, saveAsTable branches off per whether the table exists or not and the save
mode.

Table 3. saveAsTable’s Behaviour per Save Mode


Does table exist? Save Mode Behaviour
yes Ignore Does nothing

Reports an AnalysisException with Table


yes ErrorIfExists [tableIdent] already exists. error
message

yes Overwrite FIXME

anything anything createTable

Saving Rows of Structured Query (DataFrame) to Data


Source —  save Method

save(): Unit

220
DataFrameWriter — Saving Data To External Data Sources

save saves the rows of a structured query (a Dataset) to a data source.

Internally, save uses DataSource to look up the class of the requested data source (for the
source option and the SQLConf).

save uses SparkSession to access the SessionState that is in turn used to


access the SQLConf.

Note val df: DataFrame = ???


df.sparkSession.sessionState.conf

If the class is a DataSourceV2…​FIXME

Otherwise, if not a DataSourceV2, save simply saveToV1Source.

save does not support saving to Hive (i.e. the source is hive ) and throws an

AnalysisException when requested so.

Hive data source can only be used with tables, you can not write files of Hive data so
urce directly.

save does not support bucketing (i.e. when the numBuckets or sortColumnNames options

are defined) and throws an AnalysisException when requested so.

'[operation]' does not support bucketing right now

Saving Data to Table Using JDBC Data Source —  jdbc


Method

jdbc(url: String, table: String, connectionProperties: Properties): Unit

jdbc method saves the content of the DataFrame to an external database table via JDBC.

You can use mode to control save mode, i.e. what happens when an external table exists
when save is executed.

It is assumed that the jdbc save pipeline is not partitioned and bucketed.

All options are overriden by the input connectionProperties .

The required options are:

driver which is the class name of the JDBC driver (that is passed to Spark’s own

DriverRegistry.register and later used to connect(url, properties) ).

221
DataFrameWriter — Saving Data To External Data Sources

When table exists and the override save mode is in use, DROP TABLE table is executed.

It creates the input table (using CREATE TABLE table (schema) where schema is the
schema of the DataFrame ).

bucketBy Method

bucketBy(numBuckets: Int, colName: String, colNames: String*): DataFrameWriter[T]

bucketBy simply sets the internal numBuckets and bucketColumnNames to the input

numBuckets and colName with colNames , respectively.

val df = spark.range(5)
import org.apache.spark.sql.DataFrameWriter
val writer: DataFrameWriter[java.lang.Long] = df.write

val bucketedTable = writer.bucketBy(numBuckets = 8, "col1", "col2")

scala> :type bucketedTable


org.apache.spark.sql.DataFrameWriter[Long]

partitionBy Method

partitionBy(colNames: String*): DataFrameWriter[T]

Caution FIXME

Specifying Save Mode —  mode Method

mode(saveMode: String): DataFrameWriter[T]


mode(saveMode: SaveMode): DataFrameWriter[T]

mode defines the behaviour of save when an external file or table (Spark writes to) already

exists, i.e. SaveMode .

222
DataFrameWriter — Saving Data To External Data Sources

Table 4. Types of SaveMode


Name Description
Append Records are appended to existing data.

ErrorIfExists Exception is thrown.

Ignore
Do not save the records and not change the existing data
in any way.

Overwrite Existing data is overwritten by new records.

Specifying Sorting Columns —  sortBy Method

sortBy(colName: String, colNames: String*): DataFrameWriter[T]

sortBy simply sets sorting columns to the input colName and colNames column names.

sortBy must be used together with bucketBy or DataFrameWriter reports an


Note
IllegalArgumentException .

Note assertNotBucketed asserts that bucketing is not used by some methods.

Specifying Writer Configuration —  option Method

option(key: String, value: Boolean): DataFrameWriter[T]


option(key: String, value: Double): DataFrameWriter[T]
option(key: String, value: Long): DataFrameWriter[T]
option(key: String, value: String): DataFrameWriter[T]

option …​FIXME

Specifying Writer Configuration —  options Method

options(options: scala.collection.Map[String, String]): DataFrameWriter[T]

options …​FIXME

Writing DataFrames to Files

Caution FIXME

223
DataFrameWriter — Saving Data To External Data Sources

Specifying Data Source (by Alias or Fully-Qualified Class


Name) —  format Method

format(source: String): DataFrameWriter[T]

format simply sets the source internal property.

Parquet

Caution FIXME

Note Parquet is the default data source format.

Inserting Rows of Structured Streaming (DataFrame) into


Table —  insertInto Method

insertInto(tableName: String): Unit (1)


insertInto(tableIdent: TableIdentifier): Unit

1. Parses tableName and calls the other insertInto with a TableIdentifier

insertInto inserts the content of the DataFrame to the specified tableName table.

insertInto ignores column names and just uses a position-based resolution,


Note i.e. the order (not the names!) of the columns in (the output of) the Dataset
matters.

Internally, insertInto creates an InsertIntoTable logical operator (with UnresolvedRelation


operator as the only child) and executes it right away (that submits a Spark job).

224
DataFrameWriter — Saving Data To External Data Sources

Figure 1. DataFrameWrite.insertInto Executes SQL Command (as a Spark job)


insertInto reports a AnalysisException for bucketed DataFrames, i.e. buckets or

sortColumnNames are defined.

'insertInto' does not support bucketing right now

val writeSpec = spark.range(4).


write.
bucketBy(numBuckets = 3, colName = "id")
scala> writeSpec.insertInto("t1")
org.apache.spark.sql.AnalysisException: 'insertInto' does not support bucketing right
now;
at org.apache.spark.sql.DataFrameWriter.assertNotBucketed(DataFrameWriter.scala:334)
at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:302)
at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:298)
... 49 elided

insertInto reports a AnalysisException for partitioned DataFrames, i.e.

partitioningColumns is defined.

insertInto() can't be used together with partitionBy().


Partition columns have already been defined for the table. It is
not necessary to use partitionBy().

225
DataFrameWriter — Saving Data To External Data Sources

val writeSpec = spark.range(4).


write.
partitionBy("id")
scala> writeSpec.insertInto("t1")
org.apache.spark.sql.AnalysisException: insertInto() can't be used together with parti
tionBy(). Partition columns have already be defined for the table. It is not necessary
to use partitionBy().;
at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:305)
at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:298)
... 49 elided

getBucketSpec Internal Method

getBucketSpec: Option[BucketSpec]

getBucketSpec returns a new BucketSpec if numBuckets was defined (with

bucketColumnNames and sortColumnNames).

getBucketSpec throws an IllegalArgumentException when numBuckets are not defined

when sortColumnNames are.

sortBy must be used together with bucketBy

getBucketSpec is used exclusively when DataFrameWriter is requested to


Note
create a table.

Creating Table —  createTable Internal Method

createTable(tableIdent: TableIdentifier): Unit

createTable builds a CatalogStorageFormat per extraOptions.

createTable assumes CatalogTableType.EXTERNAL when location URI of

CatalogStorageFormat is defined and CatalogTableType.MANAGED otherwise.

createTable creates a CatalogTable (with the bucketSpec per getBucketSpec).

In the end, createTable creates a CreateTable logical command (with the CatalogTable ,
mode and the logical query plan of the dataset) and runs it.

Note createTable is used when DataFrameWriter is requested to saveAsTable.

226
DataFrameWriter — Saving Data To External Data Sources

assertNotBucketed Internal Method

assertNotBucketed(operation: String): Unit

assertNotBucketed simply throws an AnalysisException if either numBuckets or

sortColumnNames internal property is defined:

'[operation]' does not support bucketing right now

assertNotBucketed is used when DataFrameWriter is requested to save,


Note
insertInto and jdbc.

Executing Logical Command for Writing to Data Source V1 


—  saveToV1Source Internal Method

saveToV1Source(): Unit

saveToV1Source creates a DataSource (for the source class name, the partitioningColumns

and the extraOptions) and requests it for the logical command for writing (with the mode and
the analyzed logical plan of the structured query).

While requesting the analyzed logical plan of the structured query,


Note
saveToV1Source triggers execution of logical commands.

In the end, saveToV1Source runs the logical command for writing.

The logical command for writing can be one of the following:

Note A SaveIntoDataSourceCommand for CreatableRelationProviders


An InsertIntoHadoopFsRelationCommand for FileFormats

saveToV1Source is used exclusively when DataFrameWriter is requested to


Note save the rows of a structured query (a DataFrame) to a data source (for all but
DataSourceV2 writers with WriteSupport ).

assertNotPartitioned Internal Method

assertNotPartitioned(operation: String): Unit

assertNotPartitioned …​FIXME

227
DataFrameWriter — Saving Data To External Data Sources

Note assertNotPartitioned is used when…​FIXME

csv Method

csv(path: String): Unit

csv …​FIXME

json Method

json(path: String): Unit

json …​FIXME

orc Method

orc(path: String): Unit

orc …​FIXME

parquet Method

parquet(path: String): Unit

parquet …​FIXME

text Method

text(path: String): Unit

text …​FIXME

partitionBy Method

partitionBy(colNames: String*): DataFrameWriter[T]

228
DataFrameWriter — Saving Data To External Data Sources

partitionBy simply sets the partitioningColumns internal property.

229
Dataset API — Dataset Operators

Dataset API — Dataset Operators


Dataset API is a set of operators with typed and untyped transformations, and actions to
work with a structured query (as a Dataset) as a whole.

Table 1. Dataset Operators (Transformations and Actions)


Operator Description

agg(aggExpr: (String, String), aggExprs: (String,


agg(expr: Column, exprs: Column*): DataFrame
agg(exprs: Map[String, String]): DataFrame
agg

An untyped transformation

alias(alias: String): Dataset[T]


alias(alias: Symbol): Dataset[T]
alias

A typed transformation that is a mere synonym of as.

apply(colName: String): Column

apply
An untyped transformation to select a column based on the colum
Dataset onto a Column )

as(alias: String): Dataset[T]


as(alias: Symbol): Dataset[T]
as

A typed transformation

as[U : Encoder]: Dataset[U]

as A typed transformation to enforce a type, i.e. marking the records


given data type (data type conversion). as simply changes the v
passed into typed operations (e.g. map) and does not eagerly pro
that are not present in the specified class.

cache(): this.type
cache
A basic action that is a mere synonym of persist.

230
Dataset API — Dataset Operators

checkpoint(): Dataset[T]
checkpoint checkpoint(eager: Boolean): Dataset[T]

A basic action to checkpoint the Dataset in a reliable way (using


compliant file system, e.g. Hadoop HDFS or Amazon S3)

coalesce(numPartitions: Int): Dataset[T]


coalesce
A typed transformation to repartition a Dataset

col(colName: String): Column


col
An untyped transformation to create a column (reference) based o

collect(): Array[T]
collect
An action

colRegex(colName: String): Column

colRegex
An untyped transformation to create a column (reference) based o
specified as a regex

columns: Array[String]
columns
A basic action

count(): Long
count
An action to count the number of rows

createGlobalTempView(viewName: String): Unit


createGlobalTempView
A basic action

createOrReplaceGlobalTempView(viewName: String): Unit


createOrReplaceGlobalTempView
A basic action

231
Dataset API — Dataset Operators

createOrReplaceTempView(viewName: String): Unit


createOrReplaceTempView

A basic action

createTempView(viewName: String): Unit


createTempView
A basic action

crossJoin(right: Dataset[_]): DataFrame


crossJoin
An untyped transformation

cube(cols: Column*): RelationalGroupedDataset


cube(col1: String, cols: String*): RelationalGroupedDataset
cube

An untyped transformation

describe(cols: String*): DataFrame


describe
An action

distinct(): Dataset[T]

distinct
A typed transformation that is a mere synonym of dropDuplicates
the Dataset )

drop(colName: String): DataFrame


drop(colNames: String*): DataFrame
drop(col: Column): DataFrame
drop

An untyped transformation

dropDuplicates(): Dataset[T]
dropDuplicates(colNames: Array[String]): Dataset[T
dropDuplicates(colNames: Seq[String]): Dataset[T]
dropDuplicates dropDuplicates(col1: String, cols: String*): Dataset

A typed transformation

dtypes: Array[(String, String)]

dtypes

232
Dataset API — Dataset Operators

A basic action

except(other: Dataset[T]): Dataset[T]


except
A typed transformation

exceptAll(other: Dataset[T]): Dataset[T]


exceptAll
(New in 2.4.4) A typed transformation

explain(): Unit
explain(extended: Boolean): Unit

explain
A basic action to display the logical and physical plans of the
logical and physical plans (with optional cost and codegen summa
output

filter(condition: Column): Dataset[T]


filter(conditionExpr: String): Dataset[T]
filter(func: T => Boolean): Dataset[T]
filter

A typed transformation

first(): T
first
An action that is a mere synonym of head

flatMap[U : Encoder](func: T => TraversableOnce[U]):


flatMap
A typed transformation

foreach(f: T => Unit): Unit


foreach
An action

foreachPartition(f: Iterator[T] => Unit): Unit


foreachPartition
An action

233
Dataset API — Dataset Operators

groupBy groupBy(cols: Column*): RelationalGroupedDataset


groupBy(col1: String, cols: String*): RelationalGroupedDataset

An untyped transformation

groupByKey[K: Encoder](func: T => K): KeyValueGroupedDataset


groupByKey
A typed transformation

head(): T (1)
head(n: Int): Array[T]

head
1. Uses 1 for n
An action

hint(name: String, parameters: Any*): Dataset[T]


hint
A basic action to specify a hint (and optional parameters)

inputFiles: Array[String]
inputFiles
A basic action

intersect(other: Dataset[T]): Dataset[T]


intersect
A typed transformation

intersectAll(other: Dataset[T]): Dataset[T]


intersectAll
(New in 2.4.4) A typed transformation

isEmpty: Boolean
isEmpty
(New in 2.4.4) A basic action

isLocal: Boolean
isLocal
A basic action

234
Dataset API — Dataset Operators

isStreaming isStreaming: Boolean

join(right: Dataset[_]): DataFrame


join(right: Dataset[_], usingColumn: String): DataFrame
join(right: Dataset[_], usingColumns: Seq[String]):
join(right: Dataset[_], usingColumns: Seq[String], joinType:
join join(right: Dataset[_], joinExprs: Column): DataFrame
join(right: Dataset[_], joinExprs: Column, joinType:

An untyped transformation

joinWith[U](other: Dataset[U], condition: Column):


joinWith[U](other: Dataset[U], condition: Column, joinType:
joinWith

A typed transformation

limit(n: Int): Dataset[T]


limit
A typed transformation

localCheckpoint(): Dataset[T]
localCheckpoint(eager: Boolean): Dataset[T]
localCheckpoint

A basic action to checkpoint the Dataset locally on executors (an

map[U: Encoder](func: T => U): Dataset[U]


map
A typed transformation

mapPartitions[U : Encoder](func: Iterator[T] => Iterator


mapPartitions
A typed transformation

na: DataFrameNaFunctions
na
An untyped transformation

orderBy(sortExprs: Column*): Dataset[T]


orderBy(sortCol: String, sortCols: String*): Dataset
orderBy

A typed transformation

235
Dataset API — Dataset Operators

persist(): this.type
persist(newLevel: StorageLevel): this.type

A basic action to persist the Dataset


persist
Although its category persist is not an action in the c
means executing anything in a Spark cluster (i.e. execu
Note
on executors). It acts only as a marker to perform Data
an action is really executed.

printSchema(): Unit
printSchema
A basic action

randomSplit(weights: Array[Double]): Array[Dataset


randomSplit(weights: Array[Double], seed: Long): Array
randomSplit

A typed transformation to split a Dataset randomly into two

rdd: RDD[T]
rdd
A basic action

reduce(func: (T, T) => T): T


reduce
An action to reduce the records of the Dataset using the specifie

repartition(partitionExprs: Column*): Dataset[T]


repartition(numPartitions: Int): Dataset[T]
repartition(numPartitions: Int, partitionExprs: Column
repartition

A typed transformation to repartition a Dataset

repartitionByRange(partitionExprs: Column*): Dataset


repartitionByRange(numPartitions: Int, partitionExprs:
repartitionByRange

A typed transformation

rollup(cols: Column*): RelationalGroupedDataset


rollup(col1: String, cols: String*): RelationalGroupedDataset
rollup

An untyped transformation

236
Dataset API — Dataset Operators

sample(withReplacement: Boolean, fraction: Double):


sample(withReplacement: Boolean, fraction: Double, seed:
sample(fraction: Double): Dataset[T]
sample sample(fraction: Double, seed: Long): Dataset[T]

A typed transformation

schema: StructType
schema
A basic action

select(cols: Column*): DataFrame


select(col: String, cols: String*): DataFrame

select[U1](c1: TypedColumn[T, U1]): Dataset[U1]


select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn
select[U1, U2, U3](
c1: TypedColumn[T, U1],
c2: TypedColumn[T, U2],
c3: TypedColumn[T, U3]): Dataset[(U1, U2, U3)]
select[U1, U2, U3, U4](
c1: TypedColumn[T, U1],
select c2: TypedColumn[T, U2],
c3: TypedColumn[T, U3],
c4: TypedColumn[T, U4]): Dataset[(U1, U2, U3, U4
select[U1, U2, U3, U4, U5](
c1: TypedColumn[T, U1],
c2: TypedColumn[T, U2],
c3: TypedColumn[T, U3],
c4: TypedColumn[T, U4],
c5: TypedColumn[T, U5]): Dataset[(U1, U2, U3, U4

An (untyped and typed) transformation

selectExpr(exprs: String*): DataFrame


selectExpr
An untyped transformation

show(): Unit
show(truncate: Boolean): Unit
show(numRows: Int): Unit
show(numRows: Int, truncate: Boolean): Unit
show show(numRows: Int, truncate: Int): Unit
show(numRows: Int, truncate: Int, vertical: Boolean

An action

sort(sortExprs: Column*): Dataset[T]


sort(sortCol: String, sortCols: String*): Dataset[

sort

237
Dataset API — Dataset Operators

A typed transformation to sort elements globally (across partitions


sortWithinPartitions transformation for partition-local sort

sortWithinPartitions(sortExprs: Column*): Dataset[


sortWithinPartitions(sortCol: String, sortCols: String
sortWithinPartitions
A typed transformation to sort elements within partitions (aka
transformation for global sort (across partitions)

stat: DataFrameStatFunctions
stat
An untyped transformation

storageLevel: StorageLevel
storageLevel
A basic action

summary(statistics: String*): DataFrame

summary
An action to calculate statistics (e.g. count , mean ,
50% , 75% percentiles)

take(n: Int): Array[T]


take
An action to take the first records of a Dataset

toDF(): DataFrame
toDF(colNames: String*): DataFrame
toDF

A basic action to convert a Dataset to a DataFrame

toJSON: Dataset[String]
toJSON
A typed transformation

toLocalIterator(): java.util.Iterator[T]

toLocalIterator
An action that returns an iterator with all rows in the
as much memory as the largest partition in the Dataset

238
Dataset API — Dataset Operators

transform[U](t: Dataset[T] => Dataset[U]): Dataset


transform

A typed transformation for chaining custom transformations

union(other: Dataset[T]): Dataset[T]


union
A typed transformation

unionByName(other: Dataset[T]): Dataset[T]


unionByName
A typed transformation

unpersist(): this.type (1)


unpersist(blocking: Boolean): this.type

unpersist
1. Uses unpersist with blocking disabled ( false
A basic action to unpersist the Dataset

where(condition: Column): Dataset[T]


where(conditionExpr: String): Dataset[T]
where

A typed transformation

withColumn(colName: String, col: Column): DataFrame


withColumn
An untyped transformation

withColumnRenamed(existingName: String, newName: String


withColumnRenamed
An untyped transformation

write: DataFrameWriter[T]

write
A basic action that returns a DataFrameWriter for saving the conte
streaming) Dataset out to an external storage

239
Typed Transformations

Dataset API — Typed Transformations


Typed transformations are part of the Dataset API for transforming a Dataset with an
Encoder (except the RowEncoder).

Typed transformations are the methods in the Dataset Scala class that are
Note
grouped in typedrel group name, i.e. @group typedrel .

Table 1. Dataset API’s Typed Transformations


Transformation Description

alias(alias: String): Dataset[T]


alias alias(alias: Symbol): Dataset[T]

as(alias: String): Dataset[T]


as as(alias: Symbol): Dataset[T]

as as[U : Encoder]: Dataset[U]

Repartitions a Dataset
coalesce
coalesce(numPartitions: Int): Dataset[T]

distinct distinct(): Dataset[T]

dropDuplicates(): Dataset[T]
dropDuplicates(colNames: Array[String]): Dataset[T]
dropDuplicates dropDuplicates(colNames: Seq[String]): Dataset[T]
dropDuplicates(col1: String, cols: String*): Dataset[T]

except except(other: Dataset[T]): Dataset[T]

filter(condition: Column): Dataset[T]


filter filter(conditionExpr: String): Dataset[T]
filter(func: T => Boolean): Dataset[T]

240
Typed Transformations

flatMap flatMap[U : Encoder](func: T => TraversableOnce[U]): Dataset[U]

groupByKey groupByKey[K: Encoder](func: T => K): KeyValueGroupedDataset[K,

intersect intersect(other: Dataset[T]): Dataset[T]

joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U


joinWith joinWith[U](other: Dataset[U], condition: Column, joinType: String

limit limit(n: Int): Dataset[T]

map map[U: Encoder](func: T => U): Dataset[U]

mapPartitions mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset

orderBy(sortExprs: Column*): Dataset[T]


orderBy orderBy(sortCol: String, sortCols: String*): Dataset[T]

randomSplit(weights: Array[Double]): Array[Dataset[T]]


randomSplit randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T

repartition(partitionExprs: Column*): Dataset[T]


repartition repartition(numPartitions: Int): Dataset[T]
repartition(numPartitions: Int, partitionExprs: Column*): Dataset

repartitionByRange(partitionExprs: Column*): Dataset[T]


repartitionByRange repartitionByRange(numPartitions: Int, partitionExprs: Column*):

sample(withReplacement: Boolean, fraction: Double): Dataset[T]


sample(withReplacement: Boolean, fraction: Double, seed: Long):
sample sample(fraction: Double): Dataset[T]
sample(fraction: Double, seed: Long): Dataset[T]

241
Typed Transformations

select[U1](c1: TypedColumn[T, U1]): Dataset[U1]


select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2]):
select[U1, U2, U3](
c1: TypedColumn[T, U1],
c2: TypedColumn[T, U2],
c3: TypedColumn[T, U3]): Dataset[(U1, U2, U3)]
select[U1, U2, U3, U4](
c1: TypedColumn[T, U1],
select c2: TypedColumn[T, U2],
c3: TypedColumn[T, U3],
c4: TypedColumn[T, U4]): Dataset[(U1, U2, U3, U4)]
select[U1, U2, U3, U4, U5](
c1: TypedColumn[T, U1],
c2: TypedColumn[T, U2],
c3: TypedColumn[T, U3],
c4: TypedColumn[T, U4],
c5: TypedColumn[T, U5]): Dataset[(U1, U2, U3, U4, U5)]

sort(sortExprs: Column*): Dataset[T]


sort sort(sortCol: String, sortCols: String*): Dataset[T]

sortWithinPartitions(sortExprs: Column*): Dataset[T]


sortWithinPartitions sortWithinPartitions(sortCol: String, sortCols: String*): Dataset

toJSON toJSON: Dataset[String]

transform transform[U](t: Dataset[T] => Dataset[U]): Dataset[U]

union union(other: Dataset[T]): Dataset[T]

unionByName unionByName(other: Dataset[T]): Dataset[T]

where(condition: Column): Dataset[T]


where where(conditionExpr: String): Dataset[T]

as Typed Transformation

as(alias: String): Dataset[T]


as(alias: Symbol): Dataset[T]

242
Typed Transformations

as …​FIXME

Enforcing Type —  as Typed Transformation

as[U: Encoder]: Dataset[U]

as[T] allows for converting from a weakly-typed Dataset of Rows to Dataset[T] with T

being a domain class (that can enforce a stronger schema).

// Create DataFrame of pairs


val df = Seq("hello", "world!").zipWithIndex.map(_.swap).toDF("id", "token")

scala> df.printSchema
root
|-- id: integer (nullable = false)
|-- token: string (nullable = true)

scala> val ds = df.as[(Int, String)]


ds: org.apache.spark.sql.Dataset[(Int, String)] = [id: int, token: string]

// It's more helpful to have a case class for the conversion


final case class MyRecord(id: Int, token: String)

scala> val myRecords = df.as[MyRecord]


myRecords: org.apache.spark.sql.Dataset[MyRecord] = [id: int, token: string]

Repartitioning Dataset with Shuffle Disabled —  coalesce


Typed Transformation

coalesce(numPartitions: Int): Dataset[T]

coalesce operator repartitions the Dataset to exactly numPartitions partitions.

Internally, coalesce creates a Repartition logical operator with shuffle disabled (which
is marked as false in the below explain 's output).

243
Typed Transformations

scala> spark.range(5).coalesce(1).explain(extended = true)


== Parsed Logical Plan ==
Repartition 1, false
+- Range (0, 5, step=1, splits=Some(8))

== Analyzed Logical Plan ==


id: bigint
Repartition 1, false
+- Range (0, 5, step=1, splits=Some(8))

== Optimized Logical Plan ==


Repartition 1, false
+- Range (0, 5, step=1, splits=Some(8))

== Physical Plan ==
Coalesce 1
+- *Range (0, 5, step=1, splits=Some(8))

dropDuplicates Typed Transformation

dropDuplicates(): Dataset[T]
dropDuplicates(colNames: Array[String]): Dataset[T]
dropDuplicates(colNames: Seq[String]): Dataset[T]
dropDuplicates(col1: String, cols: String*): Dataset[T]

dropDuplicates …​FIXME

except Typed Transformation

except(other: Dataset[T]): Dataset[T]

except …​FIXME

exceptAll Typed Transformation

exceptAll(other: Dataset[T]): Dataset[T]

exceptAll …​FIXME

filter Typed Transformation

244
Typed Transformations

filter(condition: Column): Dataset[T]


filter(conditionExpr: String): Dataset[T]
filter(func: T => Boolean): Dataset[T]

filter …​FIXME

Creating Zero or More Records —  flatMap Typed


Transformation

flatMap[U: Encoder](func: T => TraversableOnce[U]): Dataset[U]

flatMap returns a new Dataset (of type U ) with all records (of type T ) mapped over

using the function func and then flattening the results.

Note flatMap can create new records. It deprecated explode .

final case class Sentence(id: Long, text: String)


val sentences = Seq(Sentence(0, "hello world"), Sentence(1, "witaj swiecie")).toDS

scala> sentences.flatMap(s => s.text.split("\\s+")).show


+-------+
| value|
+-------+
| hello|
| world|
| witaj|
|swiecie|
+-------+

Internally, flatMap calls mapPartitions with the partitions flatMap(ped) .

intersect Typed Transformation

intersect(other: Dataset[T]): Dataset[T]

intersect …​FIXME

intersectAll Typed Transformation

intersectAll(other: Dataset[T]): Dataset[T]

245
Typed Transformations

intersectAll …​FIXME

joinWith Typed Transformation

joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)]


joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]

joinWith …​FIXME

limit Typed Transformation

limit(n: Int): Dataset[T]

limit …​FIXME

map Typed Transformation

map[U : Encoder](func: T => U): Dataset[U]

map …​FIXME

mapPartitions Typed Transformation

mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U]

mapPartitions …​FIXME

Randomly Split Dataset Into Two or More Datasets Per


Weight —  randomSplit Typed Transformation

randomSplit(weights: Array[Double]): Array[Dataset[T]]


randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]

randomSplit randomly splits the Dataset per weights .

weights doubles should sum up to 1 and will be normalized if they do not.

You can define seed and if you don’t, a random seed will be used.

246
Typed Transformations

randomSplit is commonly used in Spark MLlib to split an input Dataset into two
Note
datasets for training and validation.

val ds = spark.range(10)
scala> ds.randomSplit(Array[Double](2, 3)).foreach(_.show)
+---+
| id|
+---+
| 0|
| 1|
| 2|
+---+

+---+
| id|
+---+
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+

Repartitioning Dataset (Shuffle Enabled) —  repartition


Typed Transformation

repartition(partitionExprs: Column*): Dataset[T]


repartition(numPartitions: Int): Dataset[T]
repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]

repartition operators repartition the Dataset to exactly numPartitions partitions or using

partitionExprs expressions.

Internally, repartition creates a Repartition or RepartitionByExpression logical operators


with shuffle enabled (which is true in the below explain 's output beside Repartition ).

247
Typed Transformations

scala> spark.range(5).repartition(1).explain(extended = true)


== Parsed Logical Plan ==
Repartition 1, true
+- Range (0, 5, step=1, splits=Some(8))

== Analyzed Logical Plan ==


id: bigint
Repartition 1, true
+- Range (0, 5, step=1, splits=Some(8))

== Optimized Logical Plan ==


Repartition 1, true
+- Range (0, 5, step=1, splits=Some(8))

== Physical Plan ==
Exchange RoundRobinPartitioning(1)
+- *Range (0, 5, step=1, splits=Some(8))

repartition methods correspond to SQL’s DISTRIBUTE BY or CLUSTER BY


Note
clauses.

repartitionByRange Typed Transformation

repartitionByRange(partitionExprs: Column*): Dataset[T] (1)


repartitionByRange(numPartitions: Int, partitionExprs: Column*): Dataset[T]

1. Uses spark.sql.shuffle.partitions configuration property for the number of partitions to


use

repartitionByRange simply creates a Dataset with a RepartitionByExpression logical

operator.

248
Typed Transformations

scala> spark.version
res1: String = 2.3.1

val q = spark.range(10).repartitionByRange(numPartitions = 5, $"id")


scala> println(q.queryExecution.logical.numberedTreeString)
00 'RepartitionByExpression ['id ASC NULLS FIRST], 5
01 +- AnalysisBarrier
02 +- Range (0, 10, step=1, splits=Some(8))

scala> println(q.queryExecution.toRdd.getNumPartitions)
5

scala> println(q.queryExecution.toRdd.toDebugString)
(5) ShuffledRowRDD[18] at toRdd at <console>:26 []
+-(8) MapPartitionsRDD[17] at toRdd at <console>:26 []
| MapPartitionsRDD[13] at toRdd at <console>:26 []
| MapPartitionsRDD[12] at toRdd at <console>:26 []
| ParallelCollectionRDD[11] at toRdd at <console>:26 []

repartitionByRange uses a SortOrder with the Ascending sort order, i.e. ascending nulls

first, when no explicit sort order is specified.

repartitionByRange throws a IllegalArgumentException when no partitionExprs partition-

by expression is specified.

requirement failed: At least one partition-by expression must be specified.

sample Typed Transformation

sample(withReplacement: Boolean, fraction: Double): Dataset[T]


sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T]
sample(fraction: Double): Dataset[T]
sample(fraction: Double, seed: Long): Dataset[T]

sample …​FIXME

select Typed Transformation

249
Typed Transformations

select[U1](c1: TypedColumn[T, U1]): Dataset[U1]


select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2]): Dataset[(U1, U2)]
select[U1, U2, U3](
c1: TypedColumn[T, U1],
c2: TypedColumn[T, U2],
c3: TypedColumn[T, U3]): Dataset[(U1, U2, U3)]
select[U1, U2, U3, U4](
c1: TypedColumn[T, U1],
c2: TypedColumn[T, U2],
c3: TypedColumn[T, U3],
c4: TypedColumn[T, U4]): Dataset[(U1, U2, U3, U4)]
select[U1, U2, U3, U4, U5](
c1: TypedColumn[T, U1],
c2: TypedColumn[T, U2],
c3: TypedColumn[T, U3],
c4: TypedColumn[T, U4],
c5: TypedColumn[T, U5]): Dataset[(U1, U2, U3, U4, U5)]

select …​FIXME

sort Typed Transformation

sort(sortExprs: Column*): Dataset[T]


sort(sortCol: String, sortCols: String*): Dataset[T]

sort …​FIXME

sortWithinPartitions Typed Transformation

sortWithinPartitions(sortExprs: Column*): Dataset[T]


sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[T]

sortWithinPartitions simply calls the internal sortInternal method with the global flag

disabled ( false ).

toJSON Typed Transformation

toJSON: Dataset[String]

toJSON maps the content of Dataset to a Dataset of strings in JSON format.

250
Typed Transformations

scala> val ds = Seq("hello", "world", "foo bar").toDS


ds: org.apache.spark.sql.Dataset[String] = [value: string]

scala> ds.toJSON.show
+-------------------+
| value|
+-------------------+
| {"value":"hello"}|
| {"value":"world"}|
|{"value":"foo bar"}|
+-------------------+

Internally, toJSON grabs the RDD[InternalRow] (of the QueryExecution of the Dataset ) and
maps the records (per RDD partition) into JSON.

Note toJSON uses Jackson’s JSON parser — jackson-module-scala.

Transforming Datasets —  transform Typed


Transformation

transform[U](t: Dataset[T] => Dataset[U]): Dataset[U]

transform applies t function to the source Dataset[T] to produce a result Dataset[U] . It

is for chaining custom transformations.

val dataset = spark.range(5)

// Transformation t
import org.apache.spark.sql.Dataset
def withDoubled(longs: Dataset[java.lang.Long]) = longs.withColumn("doubled", 'id * 2)

scala> dataset.transform(withDoubled).show
+---+-------+
| id|doubled|
+---+-------+
| 0| 0|
| 1| 2|
| 2| 4|
| 3| 6|
| 4| 8|
+---+-------+

Internally, transform executes t function on the current Dataset[T] .

union Typed Transformation

251
Typed Transformations

union(other: Dataset[T]): Dataset[T]

union …​FIXME

unionByName Typed Transformation

unionByName(other: Dataset[T]): Dataset[T]

unionByName creates a new Dataset that is an union of the rows in this and the other

Datasets column-wise, i.e. the order of columns in Datasets does not matter as long as their
names and number match.

val left = spark.range(1).withColumn("rand", rand()).select("id", "rand")


val right = Seq(("0.1", 11)).toDF("rand", "id")
val q = left.unionByName(right)
scala> q.show
+---+-------------------+
| id| rand|
+---+-------------------+
| 0|0.14747380134150134|
| 11| 0.1|
+---+-------------------+

Internally, unionByName creates a Union logical operator for this Dataset and Project logical
operator with the other Dataset.

In the end, unionByName applies the CombineUnions logical optimization to the Union
logical operator and requests the result LogicalPlan to wrap the child operators with
AnalysisBarriers.

scala> println(q.queryExecution.logical.numberedTreeString)
00 'Union
01 :- AnalysisBarrier
02 : +- Project [id#90L, rand#92]
03 : +- Project [id#90L, rand(-9144575865446031058) AS rand#92]
04 : +- Range (0, 1, step=1, splits=Some(8))
05 +- AnalysisBarrier
06 +- Project [id#103, rand#102]
07 +- Project [_1#99 AS rand#102, _2#100 AS id#103]
08 +- LocalRelation [_1#99, _2#100]

unionByName throws an AnalysisException if there are duplicate columns in either Dataset.

252
Typed Transformations

Found duplicate column(s)

unionByName throws an AnalysisException if there are columns in this Dataset has a

column that is not available in the other Dataset.

Cannot resolve column name "[name]" among ([rightNames])

where Typed Transformation

where(condition: Column): Dataset[T]


where(conditionExpr: String): Dataset[T]

where is simply a synonym of the filter operator, i.e. passes the input parameters along to

filter .

Creating Streaming Dataset with EventTimeWatermark


Logical Operator —  withWatermark Streaming Typed
Transformation

withWatermark(eventTime: String, delayThreshold: String): Dataset[T]

Internally, withWatermark creates a Dataset with EventTimeWatermark logical plan for


streaming Datasets.

withWatermark uses EliminateEventTimeWatermark logical rule to eliminate


Note
EventTimeWatermark logical plan for non-streaming batch Datasets .

253
Typed Transformations

// Create a batch dataset


val events = spark.range(0, 50, 10).
withColumn("timestamp", from_unixtime(unix_timestamp - 'id)).
select('timestamp, 'id as "count")
scala> events.show
+-------------------+-----+
| timestamp|count|
+-------------------+-----+
|2017-06-25 21:21:14| 0|
|2017-06-25 21:21:04| 10|
|2017-06-25 21:20:54| 20|
|2017-06-25 21:20:44| 30|
|2017-06-25 21:20:34| 40|
+-------------------+-----+

// the dataset is a non-streaming batch one...


scala> events.isStreaming
res1: Boolean = false

// ...so EventTimeWatermark is not included in the logical plan


val watermarked = events.
withWatermark(eventTime = "timestamp", delayThreshold = "20 seconds")
scala> println(watermarked.queryExecution.logical.numberedTreeString)
00 Project [timestamp#284, id#281L AS count#288L]
01 +- Project [id#281L, from_unixtime((unix_timestamp(current_timestamp(), yyyy-MM-dd
HH:mm:ss, Some(America/Chicago)) - id#281L), yyyy-MM-dd HH:mm:ss, Some(America/Chicago
)) AS timestamp#284]
02 +- Range (0, 50, step=10, splits=Some(8))

// Let's create a streaming Dataset


import org.apache.spark.sql.types.StructType
val schema = new StructType().
add($"timestamp".timestamp).
add($"count".long)
scala> schema.printTreeString
root
|-- timestamp: timestamp (nullable = true)
|-- count: long (nullable = true)

val events = spark.


readStream.
schema(schema).
csv("events").
withWatermark(eventTime = "timestamp", delayThreshold = "20 seconds")
scala> println(events.queryExecution.logical.numberedTreeString)
00 'EventTimeWatermark 'timestamp, interval 20 seconds
01 +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@75abcdd4,csv,List
(),Some(StructType(StructField(timestamp,TimestampType,true), StructField(count,LongTy
pe,true))),List(),None,Map(path -> events),None), FileSource[events], [timestamp#329,
count#330L]

254
Typed Transformations

delayThreshold is parsed using CalendarInterval.fromString with interval formatted as


described in TimeWindow unary expression.
Note
0 years 0 months 1 week 0 days 0 hours 1 minute 20 seconds 0 milliseconds 0 microseconds

delayThreshold must not be negative (and milliseconds and months should


Note
both be equal or greater than 0 ).

Note withWatermark is used when…​FIXME

255
Untyped Transformations

Dataset API — Untyped Transformations


Untyped transformations are part of the Dataset API for transforming a Dataset to a
DataFrame, a Column, a RelationalGroupedDataset, a DataFrameNaFunctions or a
DataFrameStatFunctions (and hence untyped).

Untyped transformations are the methods in the Dataset Scala class that are
Note
grouped in untypedrel group name, i.e. @group untypedrel .

Table 1. Dataset API’s Untyped Transformations


Transformation Description

agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame


agg agg(expr: Column, exprs: Column*): DataFrame
agg(exprs: Map[String, String]): DataFrame

Selects a column based on the column name (i.e. maps a Dataset


Column )
apply
apply(colName: String): Column

Selects a column based on the column name (i.e. maps a Dataset


Column )
col
col(colName: String): Column

colRegex(colName: String): Column

colRegex
Selects a column based on the column name specified as a regex (i.e.
maps a Dataset onto a Column )

crossJoin crossJoin(right: Dataset[_]): DataFrame

cube(cols: Column*): RelationalGroupedDataset


cube cube(col1: String, cols: String*): RelationalGroupedDataset

drop(colName: String): DataFrame


drop drop(colNames: String*): DataFrame
drop(col: Column): DataFrame

256
Untyped Transformations

groupBy(cols: Column*): RelationalGroupedDataset


groupBy groupBy(col1: String, cols: String*): RelationalGroupedDataset

join(right: Dataset[_]): DataFrame


join(right: Dataset[_], usingColumn: String): DataFrame
join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
join(right: Dataset[_], usingColumns: Seq[String], joinType: String
join DataFrame
join(right: Dataset[_], joinExprs: Column): DataFrame
join(right: Dataset[_], joinExprs: Column, joinType: String):

na na: DataFrameNaFunctions

rollup(cols: Column*): RelationalGroupedDataset


rollup rollup(col1: String, cols: String*): RelationalGroupedDataset

select(cols: Column*): DataFrame


select select(col: String, cols: String*): DataFrame

selectExpr selectExpr(exprs: String*): DataFrame

stat stat: DataFrameStatFunctions

withColumn withColumn(colName: String, col: Column): DataFrame

withColumnRenamed withColumnRenamed(existingName: String, newName: String): DataFrame

agg Untyped Transformation

agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame


agg(expr: Column, exprs: Column*): DataFrame
agg(exprs: Map[String, String]): DataFrame

257
Untyped Transformations

agg …​FIXME

apply Untyped Transformation

apply(colName: String): Column

apply selects a column based on the column name (i.e. maps a Dataset onto a Column ).

col Untyped Transformation

col(colName: String): Column

col selects a column based on the column name (i.e. maps a Dataset onto a Column ).

Internally, col branches off per the input column name.

If the column name is * (a star), col simply creates a Column with ResolvedStar
expression (with the schema output attributes of the analyzed logical plan of the
QueryExecution).

Otherwise, col uses colRegex untyped transformation when


spark.sql.parser.quotedRegexColumnNames configuration property is enabled.

In the case when the column name is not * and


spark.sql.parser.quotedRegexColumnNames configuration property is disabled, col
creates a Column with the column name resolved (as a NamedExpression).

colRegex Untyped Transformation

colRegex(colName: String): Column

colRegex selects a column based on the column name specified as a regex (i.e. maps a

Dataset onto a Column ).

colRegex is used in col when spark.sql.parser.quotedRegexColumnNames


Note
configuration property is enabled (and the column name is not * ).

Internally, colRegex matches the input column name to different regular expressions (in the
order):

1. For column names with quotes without a qualifier, colRegex simply creates a Column
with a UnresolvedRegex (with no table)

258
Untyped Transformations

2. For column names with quotes with a qualifier, colRegex simply creates a Column with
a UnresolvedRegex (with a table specified)

3. For other column names, colRegex (behaves like col and) creates a Column with the
column name resolved (as a NamedExpression)

crossJoin Untyped Transformation

crossJoin(right: Dataset[_]): DataFrame

crossJoin …​FIXME

cube Untyped Transformation

cube(cols: Column*): RelationalGroupedDataset


cube(col1: String, cols: String*): RelationalGroupedDataset

cube …​FIXME

Dropping One or More Columns —  drop Untyped


Transformation

drop(colName: String): DataFrame


drop(colNames: String*): DataFrame
drop(col: Column): DataFrame

drop …​FIXME

groupBy Untyped Transformation

groupBy(cols: Column*): RelationalGroupedDataset


groupBy(col1: String, cols: String*): RelationalGroupedDataset

groupBy …​FIXME

join Untyped Transformation

259
Untyped Transformations

join(right: Dataset[_]): DataFrame


join(right: Dataset[_], usingColumn: String): DataFrame
join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
join(right: Dataset[_], joinExprs: Column): DataFrame
join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame

join …​FIXME

na Untyped Transformation

na: DataFrameNaFunctions

na simply creates a DataFrameNaFunctions to work with missing data.

rollup Untyped Transformation

rollup(cols: Column*): RelationalGroupedDataset


rollup(col1: String, cols: String*): RelationalGroupedDataset

rollup …​FIXME

select Untyped Transformation

select(cols: Column*): DataFrame


select(col: String, cols: String*): DataFrame

select …​FIXME

Projecting Columns using SQL Statements 


—  selectExpr Untyped Transformation

selectExpr(exprs: String*): DataFrame

selectExpr is like select , but accepts SQL statements.

260
Untyped Transformations

val ds = spark.range(5)

scala> ds.selectExpr("rand() as random").show


16/04/14 23:16:06 INFO HiveSqlParser: Parsing command: rand() as random
+-------------------+
| random|
+-------------------+
| 0.887675894185651|
|0.36766085091074086|
| 0.2700020856675186|
| 0.1489033635529543|
| 0.5862990791950973|
+-------------------+

Internally, it executes select with every expression in exprs mapped to Column (using
SparkSqlParser.parseExpression).

scala> ds.select(expr("rand() as random")).show


+------------------+
| random|
+------------------+
|0.5514319279894851|
|0.2876221510433741|
|0.4599999092045741|
|0.5708558868374893|
|0.6223314406247136|
+------------------+

stat Untyped Transformation

stat: DataFrameStatFunctions

stat simply creates a DataFrameStatFunctions to work with statistic functions.

withColumn Untyped Transformation

withColumn(colName: String, col: Column): DataFrame

withColumn …​FIXME

withColumnRenamed Untyped Transformation

261
Untyped Transformations

withColumnRenamed(existingName: String, newName: String): DataFrame

withColumnRenamed …​FIXME

262
Basic Actions

Dataset API — Basic Actions


Basic actions are a group of operators (methods) of the Dataset API for transforming a
Dataset into a session-scoped or global temporary view and other basic actions (FIXME).

Basic actions are the methods in the Dataset Scala class that are grouped in
Note
basic group name, i.e. @group basic .

Table 1. Dataset API’s Basic Actions


Action Description

cache(): this.type

cache

Marks the Dataset to be persisted (cached) and is


actually a synonym of persist basic action

checkpoint(): Dataset[T]
checkpoint(eager: Boolean): Dataset[T]

checkpoint
Checkpoints the Dataset in a reliable way (using a
reliable HDFS-compliant file system, e.g. Hadoop HDFS
or Amazon S3)

columns columns: Array[String]

createGlobalTempView createGlobalTempView(viewName: String): Unit

createOrReplaceGlobalTempView createOrReplaceGlobalTempView(viewName: String): Unit

createOrReplaceTempView createOrReplaceTempView(viewName: String): Unit

createTempView createTempView(viewName: String): Unit

dtypes dtypes: Array[(String, String)]

263
Basic Actions

explain(): Unit
explain(extended: Boolean): Unit

explain
Displays the logical and physical plans of the Dataset
i.e. displays the logical and physical plans (with optional
cost and codegen summaries) to the standard output

hint hint(name: String, parameters: Any*): Dataset[T]

inputFiles inputFiles: Array[String]

isEmpty: Boolean
isEmpty
(New in 2.4.4)

isLocal isLocal: Boolean

localCheckpoint(): Dataset[T]
localCheckpoint(eager: Boolean): Dataset[T]
localCheckpoint
Checkpoints the Dataset locally on executors (and
therefore unreliably)

persist(): this.type (1)


persist(newLevel: StorageLevel): this.type

1. Assumes the default storage level MEMORY_AND_DISK


Marks the Dataset to be persisted the next time an
persist action is executed
Internally, persist simply request the CacheManager
cache the structured query.

persist uses the CacheManager from the


Note SharedState associated with the
SparkSession (of the Dataset).

printSchema printSchema(): Unit

264
Basic Actions

rdd rdd: RDD[T]

schema schema: StructType

storageLevel storageLevel: StorageLevel

toDF(): DataFrame
toDF toDF(colNames: String*): DataFrame

unpersist(): this.type
unpersist(blocking: Boolean): this.type
unpersist

Unpersists the Dataset

write: DataFrameWriter[T]

write
Returns a DataFrameWriter for saving the content of the
(non-streaming) Dataset out to an external storage

Reliably Checkpointing Dataset —  checkpoint Basic


Action

checkpoint(): Dataset[T] (1)


checkpoint(eager: Boolean): Dataset[T] (2)

1. eager and reliableCheckpoint flags enabled

2. reliableCheckpoint flag enabled

checkpoint is an experimental operator and the API is evolving towards


Note
becoming stable.

checkpoint simply requests the Dataset to checkpoint with the given eager flag and the

reliableCheckpoint flag enabled.

createTempView Basic Action

265
Basic Actions

createTempView(viewName: String): Unit

createTempView …​FIXME

Note createTempView is used when…​FIXME

createOrReplaceTempView Basic Action

createOrReplaceTempView(viewName: String): Unit

createOrReplaceTempView …​FIXME

Note createOrReplaceTempView is used when…​FIXME

createGlobalTempView Basic Action

createGlobalTempView(viewName: String): Unit

createGlobalTempView …​FIXME

Note createGlobalTempView is used when…​FIXME

createOrReplaceGlobalTempView Basic Action

createOrReplaceGlobalTempView(viewName: String): Unit

createOrReplaceGlobalTempView …​FIXME

Note createOrReplaceGlobalTempView is used when…​FIXME

createTempViewCommand Internal Method

createTempViewCommand(
viewName: String,
replace: Boolean,
global: Boolean): CreateViewCommand

createTempViewCommand …​FIXME

266
Basic Actions

createTempViewCommand is used when the following Dataset operators are


Note used: Dataset.createTempView, Dataset.createOrReplaceTempView,
Dataset.createGlobalTempView and Dataset.createOrReplaceGlobalTempView.

Displaying Logical and Physical Plans, Their Cost and


Codegen —  explain Basic Action

explain(): Unit (1)


explain(extended: Boolean): Unit

1. Turns the extended flag on

explain prints the logical and (with extended flag enabled) physical plans, their cost and

codegen to the console.

Tip Use explain to review the structured queries and optimizations applied.

Internally, explain creates a ExplainCommand logical command and requests


SessionState to execute it (to get a QueryExecution back).

explain uses ExplainCommand logical command that, when executed, gives


different text representations of QueryExecution (for the Dataset’s LogicalPlan)
Note
depending on the flags (e.g. extended, codegen, and cost which are disabled
by default).

explain then requests QueryExecution for the optimized physical query plan and collects

the records (as InternalRow objects).

explain uses Dataset’s SparkSession to access the current SessionState .


Note

In the end, explain goes over the InternalRow records and converts them to lines to
display to console.

explain "converts" an InternalRow record to a line using getString at position


Note
0 .

If you are serious about query debugging you could also use the Debugging
Tip
Query Execution facility.

267
Basic Actions

scala> spark.range(10).explain(extended = true)


== Parsed Logical Plan ==
Range (0, 10, step=1, splits=Some(8))

== Analyzed Logical Plan ==


id: bigint
Range (0, 10, step=1, splits=Some(8))

== Optimized Logical Plan ==


Range (0, 10, step=1, splits=Some(8))

== Physical Plan ==
*Range (0, 10, step=1, splits=Some(8))

Specifying Hint —  hint Basic Action

hint(name: String, parameters: Any*): Dataset[T]

hint operator is part of Hint Framework to specify a hint (by name and parameters ) for a

Dataset .

Internally, hint simply attaches UnresolvedHint unary logical operator to an "analyzed"


Dataset (i.e. the analyzed logical plan of a Dataset ).

val ds = spark.range(3)
val plan = ds.queryExecution.logical
scala> println(plan.numberedTreeString)
00 Range (0, 3, step=1, splits=Some(8))

// Attach a hint
val dsHinted = ds.hint("myHint", 100, true)
val plan = dsHinted.queryExecution.logical
scala> println(plan.numberedTreeString)
00 'UnresolvedHint myHint, [100, true]
01 +- Range (0, 3, step=1, splits=Some(8))

hint adds an UnresolvedHint unary logical operator to an analyzed logical


plan that indirectly triggers analysis phase that executes logical commands and
Note
their unions as well as resolves all hints that have already been added to a
logical plan.

// FIXME Demo with UnresolvedHint

Locally Checkpointing Dataset —  localCheckpoint


268
Basic Actions

Locally Checkpointing Dataset —  localCheckpoint


Basic Action

localCheckpoint(): Dataset[T] (1)


localCheckpoint(eager: Boolean): Dataset[T]

1. eager flag enabled

localCheckpoint simply uses Dataset.checkpoint operator with the input eager flag and

reliableCheckpoint flag disabled ( false ).

checkpoint Internal Method

checkpoint(eager: Boolean, reliableCheckpoint: Boolean): Dataset[T]

checkpoint requests QueryExecution (of the Dataset ) to generate an RDD of internal

binary rows (aka internalRdd ) and then requests the RDD to make a copy of all the rows
(by adding a MapPartitionsRDD ).

Depending on reliableCheckpoint flag, checkpoint marks the RDD for (reliable)


checkpointing ( true ) or local checkpointing ( false ).

With eager flag on, checkpoint counts the number of records in the RDD (by executing
RDD.count ) that gives the effect of immediate eager checkpointing.

checkpoint requests QueryExecution (of the Dataset ) for optimized physical query plan

(the plan is used to get the outputPartitioning and outputOrdering for the result Dataset ).

In the end, checkpoint creates a DataFrame with a new logical plan node for scanning data
from an RDD of InternalRows ( LogicalRDD ).

checkpoint is used in the Dataset untyped transformations, i.e. checkpoint


Note
and localCheckpoint.

Generating RDD of Internal Binary Rows —  rdd Basic


Action

rdd: RDD[T]

269
Basic Actions

Whenever you are in need to convert a Dataset into a RDD , executing rdd method gives
you the RDD of the proper input object type (not Row as in DataFrames) that sits behind the
Dataset .

scala> val rdd = tokens.rdd


rdd: org.apache.spark.rdd.RDD[Token] = MapPartitionsRDD[11] at rdd at <console>:30

Internally, it looks ExpressionEncoder (for the Dataset ) up and accesses the deserializer
expression. That gives the DataType of the result of evaluating the expression.

A deserializer expression is used to decode an InternalRow to an object of type


Note
T . See ExpressionEncoder.

It then executes a DeserializeToObject logical operator that will produce a


RDD[InternalRow] that is converted into the proper RDD[T] using the DataType and T .

Note It is a lazy operation that "produces" a RDD[T] .

Accessing Schema —  schema Basic Action


A Dataset has a schema.

schema: StructType

You may also use the following methods to learn about the schema:
printSchema(): Unit
Tip
explain

Converting Typed Dataset to Untyped DataFrame —  toDF


Basic Action

toDF(): DataFrame
toDF(colNames: String*): DataFrame

toDF converts a Dataset into a DataFrame.

Internally, the empty-argument toDF creates a Dataset[Row] using the Dataset 's
SparkSession and QueryExecution with the encoder being RowEncoder.

Caution FIXME Describe toDF(colNames: String*)

270
Basic Actions

Unpersisting Cached Dataset —  unpersist Basic Action

unpersist(): this.type
unpersist(blocking: Boolean): this.type

unpersist uncache the Dataset possibly by blocking the call.

Internally, unpersist requests CacheManager to uncache the query.

Caution FIXME

Accessing DataFrameWriter (to Describe Writing Dataset) 


—  write Basic Action

write: DataFrameWriter[T]

write gives DataFrameWriter for records of type T .

import org.apache.spark.sql.{DataFrameWriter, Dataset}


val ints: Dataset[Int] = (0 to 5).toDS
val writer: DataFrameWriter[Int] = ints.write

isEmpty Typed Transformation

isEmpty: Boolean

isEmpty …​FIXME

isLocal Typed Transformation

isLocal: Boolean

isLocal …​FIXME

271
Actions

Dataset API — Actions
Actions are part of the Dataset API for…​FIXME

Actions are the methods in the Dataset Scala class that are grouped in
Note
action group name, i.e. @group action .

Table 1. Dataset API’s Actions


Action Description

collect collect(): Array[T]

count count(): Long

describe describe(cols: String*): DataFrame

first first(): T

foreach foreach(f: T => Unit): Unit

foreachPartition foreachPartition(f: Iterator[T] => Unit): Unit

head(): T
head head(n: Int): Array[T]

reduce reduce(func: (T, T) => T): T

show(): Unit
show(truncate: Boolean): Unit
show(numRows: Int): Unit
show show(numRows: Int, truncate: Boolean): Unit
show(numRows: Int, truncate: Int): Unit
show(numRows: Int, truncate: Int, vertical: Boolean)
: Unit

272
Actions

Computes specified statistics for numeric and string


columns. The default statistics are: count , mean ,
stddev , min , max and 25% , 50% , 75% percentiles.

summary(statistics: String*): DataFrame


summary

summary is an extended version of the


describe action that simply calculates
Note
count , mean , stddev , min and max
statistics.

take take(n: Int): Array[T]

toLocalIterator toLocalIterator(): java.util.Iterator[T]

collect Action

collect(): Array[T]

collect …​FIXME

count Action

count(): Long

count …​FIXME

Calculating Basic Statistics —  describe Action

describe(cols: String*): DataFrame

describe …​FIXME

first Action

first(): T

273
Actions

first …​FIXME

foreach Action

foreach(f: T => Unit): Unit

foreach …​FIXME

foreachPartition Action

foreachPartition(f: Iterator[T] => Unit): Unit

foreachPartition …​FIXME

head Action

head(): T (1)
head(n: Int): Array[T]

1. Calls the other head with n as 1 and takes the first element

head …​FIXME

reduce Action

reduce(func: (T, T) => T): T

reduce …​FIXME

show Action

show(): Unit
show(truncate: Boolean): Unit
show(numRows: Int): Unit
show(numRows: Int, truncate: Boolean): Unit
show(numRows: Int, truncate: Int): Unit
show(numRows: Int, truncate: Int, vertical: Boolean): Unit

show …​FIXME

274
Actions

Calculating Statistics —  summary Action

summary(statistics: String*): DataFrame

summary calculates specified statistics for numeric and string columns.

The default statistics are: count , mean , stddev , min , max and 25% , 50% , 75%
percentiles.

summary accepts arbitrary approximate percentiles specified as a percentage


Note
(e.g. 10% ).

Internally, summary uses the StatFunctions to calculate the requested summaries for the
Dataset.

Taking First Records —  take Action

take(n: Int): Array[T]

take is an action on a Dataset that returns a collection of n records.

take loads all the data into the memory of the Spark application’s driver
Warning
process and for a large n could result in OutOfMemoryError .

Internally, take creates a new Dataset with Limit logical plan for Literal expression
and the current LogicalPlan . It then runs the SparkPlan that produces a
Array[InternalRow] that is in turn decoded to Array[T] using a bounded encoder.

toLocalIterator Action

toLocalIterator(): java.util.Iterator[T]

toLocalIterator …​FIXME

275
DataFrameNaFunctions — Working With Missing Data

DataFrameNaFunctions — Working With
Missing Data
DataFrameNaFunctions is used to work with missing data in a structured query (a

DataFrame).

Table 1. DataFrameNaFunctions API


Method Description

drop(): DataFrame
drop(cols: Array[String]): DataFrame
drop(minNonNulls: Int): DataFrame
drop(minNonNulls: Int, cols: Array[String]): DataFrame
drop drop(minNonNulls: Int, cols: Seq[String]): DataFrame
drop(cols: Seq[String]): DataFrame
drop(how: String): DataFrame
drop(how: String, cols: Array[String]): DataFrame
drop(how: String, cols: Seq[String]): DataFrame

fill(value: Boolean): DataFrame


fill(value: Boolean, cols: Array[String]): DataFrame
fill(value: Boolean, cols: Seq[String]): DataFrame
fill(value: Double): DataFrame
fill(value: Double, cols: Array[String]): DataFrame
fill(value: Double, cols: Seq[String]): DataFrame
fill fill(value: Long): DataFrame
fill(value: Long, cols: Array[String]): DataFrame
fill(value: Long, cols: Seq[String]): DataFrame
fill(valueMap: Map[String, Any]): DataFrame
fill(value: String): DataFrame
fill(value: String, cols: Array[String]): DataFrame
fill(value: String, cols: Seq[String]): DataFrame

replace[T](cols: Seq[String], replacement: Map[T, T]): DataFrame


replace replace[T](col: String, replacement: Map[T, T]): DataFrame

DataFrameNaFunctions is available using na untyped transformation.

val q: DataFrame = ...


q.na

convertToDouble Internal Method

convertToDouble(v: Any): Double

276
DataFrameNaFunctions — Working With Missing Data

convertToDouble …​FIXME

Note convertToDouble is used when…​FIXME

drop Method

drop(): DataFrame
drop(cols: Array[String]): DataFrame
drop(minNonNulls: Int): DataFrame
drop(minNonNulls: Int, cols: Array[String]): DataFrame
drop(minNonNulls: Int, cols: Seq[String]): DataFrame
drop(cols: Seq[String]): DataFrame
drop(how: String): DataFrame
drop(how: String, cols: Array[String]): DataFrame
drop(how: String, cols: Seq[String]): DataFrame

drop …​FIXME

fill Method

fill(value: Boolean): DataFrame


fill(value: Boolean, cols: Array[String]): DataFrame
fill(value: Boolean, cols: Seq[String]): DataFrame
fill(value: Double): DataFrame
fill(value: Double, cols: Array[String]): DataFrame
fill(value: Double, cols: Seq[String]): DataFrame
fill(value: Long): DataFrame
fill(value: Long, cols: Array[String]): DataFrame
fill(value: Long, cols: Seq[String]): DataFrame
fill(valueMap: Map[String, Any]): DataFrame
fill(value: String): DataFrame
fill(value: String, cols: Array[String]): DataFrame
fill(value: String, cols: Seq[String]): DataFrame

fill …​FIXME

fillCol Internal Method

fillCol[T](col: StructField, replacement: T): Column

fillCol …​FIXME

Note fillCol is used when…​FIXME

277
DataFrameNaFunctions — Working With Missing Data

fillMap Internal Method

fillMap(values: Seq[(String, Any)]): DataFrame

fillMap …​FIXME

Note fillMap is used when…​FIXME

fillValue Internal Method

fillValue[T](value: T, cols: Seq[String]): DataFrame

fillValue …​FIXME

Note fillValue is used when…​FIXME

replace0 Internal Method

replace0[T](cols: Seq[String], replacement: Map[T, T]): DataFrame

replace0 …​FIXME

Note replace0 is used when…​FIXME

replace Method

replace[T](cols: Seq[String], replacement: Map[T, T]): DataFrame


replace[T](col: String, replacement: Map[T, T]): DataFrame

replace …​FIXME

replaceCol Internal Method

replaceCol(col: StructField, replacementMap: Map[_, _]): Column

replaceCol …​FIXME

Note replaceCol is used when…​FIXME

278
DataFrameNaFunctions — Working With Missing Data

279
DataFrameStatFunctions — Working With Statistic Functions

DataFrameStatFunctions — Working With
Statistic Functions
DataFrameStatFunctions is used to work with statistic functions in a structured query (a

DataFrame).

280
DataFrameStatFunctions — Working With Statistic Functions

Table 1. DataFrameStatFunctions API


Method Description

approxQuantile(
cols: Array[String],
probabilities: Array[Double],
relativeError: Double): Array[Array[Double]]
approxQuantile approxQuantile(
col: String,
probabilities: Array[Double],
relativeError: Double): Array[Double]

bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter


bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): BloomFilter
bloomFilter bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): BloomFilter
bloomFilter(colName: String, expectedNumItems: Long, numBits: Long):

corr(col1: String, col2: String): Double


corr corr(col1: String, col2: String, method: String): Double

countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int


countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch
countMinSketch countMinSketch(colName: String, eps: Double, confidence: Double, seed:
etch
countMinSketch(colName: String, depth: Int, width: Int, seed: Int):

cov cov(col1: String, col2: String): Double

crosstab crosstab(col1: String, col2: String): DataFrame

freqItems(cols: Array[String]): DataFrame


freqItems(cols: Array[String], support: Double): DataFrame
freqItems freqItems(cols: Seq[String]): DataFrame
freqItems(cols: Seq[String], support: Double): DataFrame

sampleBy sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame

DataFrameStatFunctions is available using stat untyped transformation.

val q: DataFrame = ...


q.stat

281
DataFrameStatFunctions — Working With Statistic Functions

approxQuantile Method

approxQuantile(
cols: Array[String],
probabilities: Array[Double],
relativeError: Double): Array[Array[Double]]
approxQuantile(
col: String,
probabilities: Array[Double],
relativeError: Double): Array[Double]

approxQuantile …​FIXME

bloomFilter Method

bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter


bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): BloomFilter
bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): BloomFilter
bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): BloomFilter

bloomFilter …​FIXME

buildBloomFilter Internal Method

buildBloomFilter(col: Column, zero: BloomFilter): BloomFilter

buildBloomFilter …​FIXME

Note convertToDouble is used when…​FIXME

corr Method

corr(col1: String, col2: String): Double


corr(col1: String, col2: String, method: String): Double

corr …​FIXME

countMinSketch Method

282
DataFrameStatFunctions — Working With Statistic Functions

countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch

countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch


countMinSketch(colName: String, eps: Double, confidence: Double, seed: Int): CountMinS
ketch
countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch
// PRIVATE API
countMinSketch(col: Column, zero: CountMinSketch): CountMinSketch

countMinSketch …​FIXME

cov Method

cov(col1: String, col2: String): Double

cov …​FIXME

crosstab Method

crosstab(col1: String, col2: String): DataFrame

crosstab …​FIXME

freqItems Method

freqItems(cols: Array[String]): DataFrame


freqItems(cols: Array[String], support: Double): DataFrame
freqItems(cols: Seq[String]): DataFrame
freqItems(cols: Seq[String], support: Double): DataFrame

freqItems …​FIXME

sampleBy Method

sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame

sampleBy …​FIXME

283
DataFrameStatFunctions — Working With Statistic Functions

284
Column

Column
Column represents a column in a Dataset that holds a Catalyst Expression that produces a

value per row.

Note A Column is a value generator for every row in a Dataset .

A special column * references all columns in a Dataset .

With the implicits converstions imported, you can create "free" column references using
Scala’s symbols.

val spark: SparkSession = ...


import spark.implicits._

import org.apache.spark.sql.Column
scala> val nameCol: Column = 'name
nameCol: org.apache.spark.sql.Column = name

Note "Free" column references are Column s with no association to a Dataset .

You can also create free column references from $ -prefixed strings.

// Note that $ alone creates a ColumnName


scala> val idCol = $"id"
idCol: org.apache.spark.sql.ColumnName = id

import org.apache.spark.sql.Column

// The target type triggers the implicit conversion to Column


scala> val idCol: Column = $"id"
idCol: org.apache.spark.sql.Column = id

Beside using the implicits conversions, you can create columns using col and column
functions.

import org.apache.spark.sql.functions._

scala> val nameCol = col("name")


nameCol: org.apache.spark.sql.Column = name

scala> val cityCol = column("city")


cityCol: org.apache.spark.sql.Column = city

285
Column

Finally, you can create a bound Column using the Dataset the column is supposed to be
part of using Dataset.apply factory method or Dataset.col operator.

You can use bound Column references only with the Dataset s they have been
Note
created from.

scala> val textCol = dataset.col("text")


textCol: org.apache.spark.sql.Column = text

scala> val idCol = dataset.apply("id")


idCol: org.apache.spark.sql.Column = id

scala> val idCol = dataset("id")


idCol: org.apache.spark.sql.Column = id

You can reference nested columns using . (dot).

Table 1. Column Operators


Operator Description
Specifying type hint about the expected return value of the
as
column

name

Column has a reference to Catalyst’s Expression it was created for using expr method.

Note scala> window('time, "5 seconds").expr


res0: org.apache.spark.sql.catalyst.expressions.Expression = timewindow('time,

Tip Read about typed column references in TypedColumn Expressions.

Specifying Type Hint —  as Operator

as[U : Encoder]: TypedColumn[Any, U]

as creates a TypedColumn (that gives a type hint about the expected return value of the

column).

scala> $"id".as[Int]
res1: org.apache.spark.sql.TypedColumn[Any,Int] = id

286
Column

name Operator

name(alias: String): Column

name …​FIXME

Note name is used when…​FIXME

Adding Column to Dataset —  withColumn Method

withColumn(colName: String, col: Column): DataFrame

withColumn method returns a new DataFrame with the new column col with colName

name added.

Note withColumn can replace an existing colName column.

scala> val df = Seq((1, "jeden"), (2, "dwa")).toDF("number", "polish")


df: org.apache.spark.sql.DataFrame = [number: int, polish: string]

scala> df.show
+------+------+
|number|polish|
+------+------+
| 1| jeden|
| 2| dwa|
+------+------+

scala> df.withColumn("polish", lit(1)).show


+------+------+
|number|polish|
+------+------+
| 1| 1|
| 2| 1|
+------+------+

You can add new columns do a Dataset using withColumn method.

287
Column

val spark: SparkSession = ...


val dataset = spark.range(5)

// Add a new column called "group"


scala> dataset.withColumn("group", 'id % 2).show
+---+-----+
| id|group|
+---+-----+
| 0| 0|
| 1| 1|
| 2| 0|
| 3| 1|
| 4| 0|
+---+-----+

Creating Column Instance For Catalyst Expression 


—  apply Factory Method

val spark: SparkSession = ...


case class Word(id: Long, text: String)
val dataset = Seq(Word(0, "hello"), Word(1, "spark")).toDS

scala> val idCol = dataset.apply("id")


idCol: org.apache.spark.sql.Column = id

// or using Scala's magic a little bit


// the following is equivalent to the above explicit apply call
scala> val idCol = dataset("id")
idCol: org.apache.spark.sql.Column = id

like Operator

Caution FIXME

scala> df("id") like "0"


res0: org.apache.spark.sql.Column = id LIKE 0

scala> df.filter('id like "0").show


+---+-----+
| id| text|
+---+-----+
| 0|hello|
+---+-----+

Symbols As Column Names

288
Column

scala> val df = Seq((0, "hello"), (1, "world")).toDF("id", "text")


df: org.apache.spark.sql.DataFrame = [id: int, text: string]

scala> df.select('id)
res0: org.apache.spark.sql.DataFrame = [id: int]

scala> df.select('id).show
+---+
| id|
+---+
| 0|
| 1|
+---+

Defining Windowing Column (Analytic Clause) —  over


Operator

over(): Column
over(window: WindowSpec): Column

over creates a windowing column (aka analytic clause) that allows to execute a

aggregate function over a window (i.e. a group of records that are in some relation to the
current record).

Read up on windowed aggregation in Spark SQL in Window Aggregate


Tip
Functions.

scala> val overUnspecifiedFrame = $"someColumn".over()


overUnspecifiedFrame: org.apache.spark.sql.Column = someColumn OVER (UnspecifiedFrame)

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.WindowSpec
val spec: WindowSpec = Window.rangeBetween(Window.unboundedPreceding, Window.currentRo
w)
scala> val overRange = $"someColumn" over spec
overRange: org.apache.spark.sql.Column = someColumn OVER (RANGE BETWEEN UNBOUNDED PREC
EDING AND CURRENT ROW)

cast Operator
cast method casts a column to a data type. It makes for type-safe maps with Row objects

of the proper type (not Any ).

289
Column

cast(to: String): Column


cast(to: DataType): Column

cast uses CatalystSqlParser to parse the data type from its canonical string

representation.

cast Example

scala> val df = Seq((0f, "hello")).toDF("label", "text")


df: org.apache.spark.sql.DataFrame = [label: float, text: string]

scala> df.printSchema
root
|-- label: float (nullable = false)
|-- text: string (nullable = true)

// without cast
import org.apache.spark.sql.Row
scala> df.select("label").map { case Row(label) => label.getClass.getName }.show(false
)
+---------------+
|value |
+---------------+
|java.lang.Float|
+---------------+

// with cast
import org.apache.spark.sql.types.DoubleType
scala> df.select(col("label").cast(DoubleType)).map { case Row(label) => label.getClas
s.getName }.show(false)
+----------------+
|value |
+----------------+
|java.lang.Double|
+----------------+

generateAlias Method

generateAlias(e: Expression): String

generateAlias …​FIXME

290
Column

generateAlias is used when:

Note Column is requested to named

RelationalGroupedDataset is requested to alias

named Method

named: NamedExpression

named …​FIXME

named is used when the following operators are used:

Note Dataset.select
KeyValueGroupedDataset.agg

291
Column API — Column Operators

Column API — Column Operators


Column API is a set of operators to work with values in a column (of a Dataset).

Table 1. Column Operators


Operator Description

asc asc: Column

asc_nulls_first asc_nulls_first: Column

asc_nulls_last asc_nulls_last: Column

desc desc: Column

desc_nulls_first desc_nulls_first: Column

desc_nulls_last desc_nulls_last: Column

isin isin(list: Any*): Column

isInCollection(values: scala.collection.Iterable[_]): Column

isInCollection
(New in 2.4.4) An expression operator that is true if the value
of the column is in the given values collection

isInCollection is simply a synonym of isin operator.

isin Operator

isin(list: Any*): Column

Internally, isin creates a Column with In predicate expression.

292
Column API — Column Operators

val ids = Seq((1, 2, 2), (2, 3, 1)).toDF("x", "y", "id")


scala> ids.show
+---+---+---+
| x| y| id|
+---+---+---+
| 1| 2| 2|
| 2| 3| 1|
+---+---+---+

val c = $"id" isin ($"x", $"y")


val q = ids.filter(c)
scala> q.show
+---+---+---+
| x| y| id|
+---+---+---+
| 1| 2| 2|
+---+---+---+

// Note that isin accepts non-Column values


val c = $"id" isin ("x", "y")
val q = ids.filter(c)
scala> q.show
+---+---+---+
| x| y| id|
+---+---+---+
+---+---+---+

293
TypedColumn

TypedColumn
TypedColumn is a Column with the ExpressionEncoder for the types of the input and the

output.

TypedColumn is created using as operator on a Column .

scala> val id = $"id".as[Int]


id: org.apache.spark.sql.TypedColumn[Any,Int] = id

scala> id.expr
res1: org.apache.spark.sql.catalyst.expressions.Expression = 'id

name Operator

name(alias: String): TypedColumn[T, U]

Note name is part of Column Contract to…​FIXME.

name …​FIXME

Note name is used when…​FIXME

Creating TypedColumn —  withInputType Internal


Method

withInputType(
inputEncoder: ExpressionEncoder[_],
inputAttributes: Seq[Attribute]): TypedColumn[T, U]

withInputType …​FIXME

withInputType is used when the following typed operators are used:

Dataset.select
Note
KeyValueGroupedDataset.agg
RelationalGroupedDataset.agg

Creating TypedColumn Instance

294
TypedColumn

TypedColumn takes the following when created:

Catalyst expression

ExpressionEncoder of the column results

TypedColumn initializes the internal registries and counters.

295
Basic Aggregation — Typed and Untyped Grouping Operators

Basic Aggregation — Typed and Untyped


Grouping Operators
You can calculate aggregates over a group of rows in a Dataset using aggregate operators
(possibly with aggregate functions).

Table 1. Aggregate Operators


Operator Return Type Description
Aggregates with or without grouping
agg RelationalGroupedDataset
(i.e. over an entire Dataset)

Used for untyped aggregates


using DataFrames. Grouping is
groupBy RelationalGroupedDataset
described using column expressions
or column names.

Used for typed aggregates using


groupByKey KeyValueGroupedDataset Datasets with records grouped by a
key-defining discriminator function.

Aggregate functions without aggregate operators return a single value. If you


Note want to find the aggregate values for each unique value (in a column), you
should groupBy first (over this column) to build the groups.

You can also use SparkSession to execute good ol' SQL with GROUP BY should
you prefer.

val spark: SparkSession = ???


Note spark.sql("SELECT COUNT(*) FROM sales GROUP BY city")

SQL or Dataset API’s operators go through the same query planning and
optimizations, and have the same performance characteristic in the end.

Aggregates Over Subset Of or Whole Dataset —  agg


Operator

agg(expr: Column, exprs: Column*): DataFrame


agg(exprs: Map[String, String]): DataFrame
agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame

296
Basic Aggregation — Typed and Untyped Grouping Operators

agg applies an aggregate function on a subset or the entire Dataset (i.e. considering the
entire data set as one group).

Note agg on a Dataset is simply a shortcut for groupBy().agg(…​).

scala> spark.range(10).agg(sum('id) as "sum").show


+---+
|sum|
+---+
| 45|
+---+

agg can compute aggregate expressions on all the records in a Dataset .

Untyped Grouping —  groupBy Operator

groupBy(cols: Column*): RelationalGroupedDataset


groupBy(col1: String, cols: String*): RelationalGroupedDataset

groupBy operator groups the rows in a Dataset by columns (as Column expressions or

names).

groupBy gives a RelationalGroupedDataset to execute aggregate functions or operators.

// 10^3-record large data set


val ints = 1 to math.pow(10, 3).toInt
val nms = ints.toDF("n").withColumn("m", 'n % 2)
scala> nms.count
res0: Long = 1000

val q = nms.
groupBy('m).
agg(sum('n) as "sum").
orderBy('m)
scala> q.show
+---+------+
| m| sum|
+---+------+
| 0|250500|
| 1|250000|
+---+------+

Internally, groupBy resolves column names (possibly quoted) and creates a


RelationalGroupedDataset (with groupType being GroupByType ).

297
Basic Aggregation — Typed and Untyped Grouping Operators

Note The following uses the data setup as described in Test Setup section below.

scala> tokens.show
+----+---------+-----+
|name|productId|score|
+----+---------+-----+
| aaa| 100| 0.12|
| aaa| 200| 0.29|
| bbb| 200| 0.53|
| bbb| 300| 0.42|
+----+---------+-----+

scala> tokens.groupBy('name).avg().show
+----+--------------+----------+
|name|avg(productId)|avg(score)|
+----+--------------+----------+
| aaa| 150.0| 0.205|
| bbb| 250.0| 0.475|
+----+--------------+----------+

scala> tokens.groupBy('name, 'productId).agg(Map("score" -> "avg")).show


+----+---------+----------+
|name|productId|avg(score)|
+----+---------+----------+
| aaa| 200| 0.29|
| bbb| 200| 0.53|
| bbb| 300| 0.42|
| aaa| 100| 0.12|
+----+---------+----------+

scala> tokens.groupBy('name).count.show
+----+-----+
|name|count|
+----+-----+
| aaa| 2|
| bbb| 2|
+----+-----+

scala> tokens.groupBy('name).max("score").show
+----+----------+
|name|max(score)|
+----+----------+
| aaa| 0.29|
| bbb| 0.53|
+----+----------+

scala> tokens.groupBy('name).sum("score").show
+----+----------+
|name|sum(score)|
+----+----------+
| aaa| 0.41|
| bbb| 0.95|

298
Basic Aggregation — Typed and Untyped Grouping Operators

+----+----------+

scala> tokens.groupBy('productId).sum("score").show
+---------+------------------+
|productId| sum(score)|
+---------+------------------+
| 300| 0.42|
| 100| 0.12|
| 200|0.8200000000000001|
+---------+------------------+

Typed Grouping —  groupByKey Operator

groupByKey[K: Encoder](func: T => K): KeyValueGroupedDataset[K, T]

groupByKey groups records (of type T ) by the input func and in the end returns a

KeyValueGroupedDataset to apply aggregation to.

Note groupByKey is Dataset 's experimental API.

scala> tokens.groupByKey(_.productId).count.orderBy($"value").show
+-----+--------+
|value|count(1)|
+-----+--------+
| 100| 1|
| 200| 2|
| 300| 1|
+-----+--------+

import org.apache.spark.sql.expressions.scalalang._
val q = tokens.
groupByKey(_.productId).
agg(typed.sum[Token](_.score)).
toDF("productId", "sum").
orderBy('productId)
scala> q.show
+---------+------------------+
|productId| sum|
+---------+------------------+
| 100| 0.12|
| 200|0.8200000000000001|
| 300| 0.42|
+---------+------------------+

Test Setup
This is a setup for learning GroupedData . Paste it into Spark Shell using :paste .

299
Basic Aggregation — Typed and Untyped Grouping Operators

import spark.implicits._

case class Token(name: String, productId: Int, score: Double)


val data = Seq(
Token("aaa", 100, 0.12),
Token("aaa", 200, 0.29),
Token("bbb", 200, 0.53),
Token("bbb", 300, 0.42))
val tokens = data.toDS.cache (1)

1. Cache the dataset so the following queries won’t load/recompute data over and over
again.

300
RelationalGroupedDataset — Untyped Row-based Grouping

RelationalGroupedDataset — Untyped Row-
based Grouping
RelationalGroupedDataset is an interface to calculate aggregates over groups of rows in a

DataFrame.

KeyValueGroupedDataset is used for typed aggregates over groups of custom


Note
Scala objects (not Rows).

RelationalGroupedDataset is a result of executing the following grouping operators:

groupBy

rollup

cube

pivot

Table 1. RelationalGroupedDataset’s Aggregate Operators


Operator Description
agg

avg

count

max

mean

min

pivot(pivotColumn: String): RelationalGroupedDataset


pivot(pivotColumn: String, values: Seq[Any]): RelationalGro
upedDataset
pivot(pivotColumn: Column): RelationalGroupedDataset (1)
pivot(pivotColumn: Column, values: Seq[Any]): RelationalGro
pivot upedDataset (1)

1. New in 2.4.0
Pivots on a column (with new columns per distinct value)

sum

301
RelationalGroupedDataset — Untyped Row-based Grouping

spark.sql.retainGroupColumns configuration property controls whether to retain columns


used for aggregation or not (in RelationalGroupedDataset operators).

spark.sql.retainGroupColumns is enabled by default.

scala> spark.conf.get("spark.sql.retainGroupColumns")
Note res1: String = true

// Use dataFrameRetainGroupColumns method for type-safe access to the current value


import spark.sessionState.conf
scala> conf.dataFrameRetainGroupColumns
res2: Boolean = true

Computing Aggregates Using Aggregate Column


Expressions or Function Names —  agg Operator

agg(expr: Column, exprs: Column*): DataFrame


agg(exprs: Map[String, String]): DataFrame
agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame

agg creates a DataFrame with the rows being the result of executing grouping expressions

(specified using columns or names) over row groups.

Note You can use untyped or typed column expressions.

val countsAndSums = spark.


range(10). // <-- 10-element Dataset
withColumn("group", 'id % 2). // <-- define grouping column
groupBy("group"). // <-- group by groups
agg(count("id") as "count", sum("id") as "sum")
scala> countsAndSums.show
+-----+-----+---+
|group|count|sum|
+-----+-----+---+
| 0| 5| 20|
| 1| 5| 25|
+-----+-----+---+

Internally, agg creates a DataFrame with Aggregate or Pivot logical operators.

302
RelationalGroupedDataset — Untyped Row-based Grouping

// groupBy above
scala> println(countsAndSums.queryExecution.logical.numberedTreeString)
00 'Aggregate [group#179L], [group#179L, count('id) AS count#188, sum('id) AS sum#190]
01 +- Project [id#176L, (id#176L % cast(2 as bigint)) AS group#179L]
02 +- Range (0, 10, step=1, splits=Some(8))

// rollup operator
val rollupQ = spark.range(2).rollup('id).agg(count('id))
scala> println(rollupQ.queryExecution.logical.numberedTreeString)
00 'Aggregate [rollup('id)], [unresolvedalias('id, None), count('id) AS count(id)#267]
01 +- Range (0, 2, step=1, splits=Some(8))

// cube operator
val cubeQ = spark.range(2).cube('id).agg(count('id))
scala> println(cubeQ.queryExecution.logical.numberedTreeString)
00 'Aggregate [cube('id)], [unresolvedalias('id, None), count('id) AS count(id)#280]
01 +- Range (0, 2, step=1, splits=Some(8))

// pivot operator
val pivotQ = spark.
range(10).
withColumn("group", 'id % 2).
groupBy("group").
pivot("group").
agg(count("id"))
scala> println(pivotQ.queryExecution.logical.numberedTreeString)
00 'Pivot [group#296L], group#296: bigint, [0, 1], [count('id)]
01 +- Project [id#293L, (id#293L % cast(2 as bigint)) AS group#296L]
02 +- Range (0, 10, step=1, splits=Some(8))

Creating DataFrame from Aggregate Expressions —  toDF


Internal Method

toDF(aggExprs: Seq[Expression]): DataFrame

Caution FIXME

Internally, toDF branches off per group type.

Caution FIXME

For PivotType , toDF creates a DataFrame with Pivot unary logical operator.

303
RelationalGroupedDataset — Untyped Row-based Grouping

toDF is used when the following RelationalGroupedDataset operators are


used:

Note agg and count


mean, max, avg, min and sum (indirectly through
aggregateNumericColumns)

aggregateNumericColumns Internal Method

aggregateNumericColumns(colNames: String*)(f: Expression => AggregateFunction): DataFr


ame

aggregateNumericColumns …​FIXME

aggregateNumericColumns is used when the following RelationalGroupedDataset


Note
operators are used: mean, max, avg, min and sum.

Creating RelationalGroupedDataset Instance


RelationalGroupedDataset takes the following when created:

DataFrame

Grouping expressions

Group type (to indicate the "source" operator)

GroupByType for groupBy

CubeType

RollupType

PivotType

pivot Operator

pivot(pivotColumn: String): RelationalGroupedDataset (1)


pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset (2)
pivot(pivotColumn: Column): RelationalGroupedDataset (3)
pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset (3)

304
RelationalGroupedDataset — Untyped Row-based Grouping

1. Selects distinct and sorted values on pivotColumn and calls the other pivot (that
results in 3 extra "scanning" jobs)

2. Preferred as more efficient because the unique values are aleady provided

3. New in 2.4.0

pivot pivots on a pivotColumn column, i.e. adds new columns per distinct values in

pivotColumn .

Note pivot is only supported after groupBy operation.

Note Only one pivot operation is supported on a RelationalGroupedDataset .

305
RelationalGroupedDataset — Untyped Row-based Grouping

val visits = Seq(


(0, "Warsaw", 2015),
(1, "Warsaw", 2016),
(2, "Boston", 2017)
).toDF("id", "city", "year")

val q = visits
.groupBy("city") // <-- rows in pivot table
.pivot("year") // <-- columns (unique values queried)
.count() // <-- values in cells
scala> q.show
+------+----+----+----+
| city|2015|2016|2017|
+------+----+----+----+
|Warsaw| 1| 1|null|
|Boston|null|null| 1|
+------+----+----+----+

scala> q.explain
== Physical Plan ==
HashAggregate(keys=[city#8], functions=[pivotfirst(year#9, count(1) AS `count`#222L, 2
015, 2016, 2017, 0, 0)])
+- Exchange hashpartitioning(city#8, 200)
+- HashAggregate(keys=[city#8], functions=[partial_pivotfirst(year#9, count(1) AS `
count`#222L, 2015, 2016, 2017, 0, 0)])
+- *HashAggregate(keys=[city#8, year#9], functions=[count(1)])
+- Exchange hashpartitioning(city#8, year#9, 200)
+- *HashAggregate(keys=[city#8, year#9], functions=[partial_count(1)])
+- LocalTableScan [city#8, year#9]

scala> visits
.groupBy('city)
.pivot("year", Seq("2015")) // <-- one column in pivot table
.count
.show
+------+----+
| city|2015|
+------+----+
|Warsaw| 1|
|Boston|null|
+------+----+

Use pivot with a list of distinct values to pivot on so Spark does not have
Important
to compute the list itself (and run three extra "scanning" jobs).

306
RelationalGroupedDataset — Untyped Row-based Grouping

Figure 1. pivot in web UI (Distinct Values Defined Explicitly)

Figure 2. pivot in web UI — Three Extra Scanning Jobs Due to Unspecified Distinct Values
spark.sql.pivotMaxValues (default: 10000 ) controls the maximum number of
Note (distinct) values that will be collected without error (when doing pivot without
specifying the values for the pivot column).

Internally, pivot creates a RelationalGroupedDataset with PivotType group type and


pivotColumn resolved using the DataFrame’s columns with values as Literal

expressions.

toDF internal method maps PivotType group type to a DataFrame with Pivot
unary logical operator.

scala> q.queryExecution.logical
Note res0: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Pivot [city#8], year#9: int, [2015, 2016, 2017], [count(1) AS count#24L]
+- Project [_1#3 AS id#7, _2#4 AS city#8, _3#5 AS year#9]
+- LocalRelation [_1#3, _2#4, _3#5]

strToExpr Internal Method

strToExpr(expr: String): (Expression => Expression)

strToExpr …​FIXME

strToExpr is used exclusively when RelationalGroupedDataset is requested to


Note
agg with aggregation functions specified by name

alias Method

alias(expr: Expression): NamedExpression

alias …​FIXME

alias is used exclusively when RelationalGroupedDataset is requested to


Note
create a DataFrame from aggregate expressions.

307
RelationalGroupedDataset — Untyped Row-based Grouping

308
KeyValueGroupedDataset — Typed Grouping

KeyValueGroupedDataset — Typed Grouping
KeyValueGroupedDataset is an experimental interface to calculate aggregates over groups of

objects in a typed Dataset.

Note RelationalGroupedDataset is used for untyped Row -based aggregates.

KeyValueGroupedDataset is created using Dataset.groupByKey operator.

val dataset: Dataset[Token] = ...


scala> val tokensByName = dataset.groupByKey(_.name)
tokensByName: org.apache.spark.sql.KeyValueGroupedDataset[String,Token] = org.apache.s
park.sql.KeyValueGroupedDataset@1e3aad46

Table 1. KeyValueGroupedDataset’s Aggregate Operators (KeyValueGroupedDataset API)


Operator Description
agg

cogroup

count

flatMapGroups

flatMapGroupsWithState

keys

keyAs

mapGroups

mapGroupsWithState

mapValues

reduceGroups

KeyValueGroupedDataset holds keys that were used for the object.

309
KeyValueGroupedDataset — Typed Grouping

scala> tokensByName.keys.show
+-----+
|value|
+-----+
| aaa|
| bbb|
+-----+

aggUntyped Internal Method

aggUntyped(columns: TypedColumn[_, _]*): Dataset[_]

aggUntyped …​FIXME

aggUntyped is used exclusively when KeyValueGroupedDataset.agg typed


Note
operator is used.

logicalPlan Internal Method

logicalPlan: AnalysisBarrier

logicalPlan …​FIXME

Note logicalPlan is used when…​FIXME

310
Dataset Join Operators

Dataset Join Operators


From PostgreSQL’s 2.6. Joins Between Tables:

Queries can access multiple tables at once, or access the same table in such a way
that multiple rows of the table are being processed at the same time. A query that
accesses multiple rows of the same or different tables at one time is called a join
query.

You can join two datasets using the join operators with an optional join condition.

Table 1. Join Operators


Operator Return Type Description
crossJoin DataFrame Untyped Row -based cross join

join DataFrame Untyped Row -based join

Used for a type-preserving join with two


joinWith Dataset output columns for records for which a
join condition holds

You can also use SQL mode to join datasets using good ol' SQL.

val spark: SparkSession = ...


spark.sql("select * from t1, t2 where t1.id = t2.id")

You can specify a join condition (aka join expression) as part of join operators or using
where or filter operators.

df1.join(df2, $"df1Key" === $"df2Key")


df1.join(df2).where($"df1Key" === $"df2Key")
df1.join(df2).filter($"df1Key" === $"df2Key")

You can specify the join type as part of join operators (using joinType optional parameter).

df1.join(df2, $"df1Key" === $"df2Key", "inner")

311
Dataset Join Operators

Table 2. Join Types


SQL Name (joinType) JoinType
CROSS cross Cross

INNER inner Inner

FULL OUTER outer , full , fullouter FullOuter

LEFT ANTI leftanti LeftAnti

LEFT OUTER leftouter , left LeftOuter

LEFT SEMI leftsemi LeftSemi

RIGHT OUTER rightouter , right RightOuter

Special case for Inner ,


NATURAL LeftOuter , RightOuter , NaturalJoin
FullOuter

Special case for Inner ,


USING LeftOuter , LeftSemi , UsingJoin
RightOuter , FullOuter ,
LeftAnti

ExistenceJoin is an artifical join type used to express an existential sub-query, that is often

referred to as existential join.

Note LeftAnti and ExistenceJoin are special cases of LeftOuter.

You can also find that Spark SQL uses the following two families of joins:

InnerLike with Inner and Cross

LeftExistence with LeftSemi, LeftAnti and ExistenceJoin

Name are case-insensitive and can use the underscore ( _ ) at any position, i.e.
Tip
left_anti and LEFT_ANTI are equivalent.

Spark SQL offers different join strategies with Broadcast Joins (aka Map-Side
Note Joins) among them that are supposed to optimize your join queries over large
distributed datasets.

join Operators

312
Dataset Join Operators

join(right: Dataset[_]): DataFrame (1)


join(right: Dataset[_], usingColumn: String): DataFrame (2)
join(right: Dataset[_], usingColumns: Seq[String]): DataFrame (3)
join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame (4)
join(right: Dataset[_], joinExprs: Column): DataFrame (5)
join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame (6)

1. Condition-less inner join

2. Inner join with a single column that exists on both sides

3. Inner join with columns that exist on both sides

4. Equi-join with explicit join type

5. Inner join

6. Join with explicit join type. Self-joins are acceptable.

join joins two Dataset s.

val left = Seq((0, "zero"), (1, "one")).toDF("id", "left")


val right = Seq((0, "zero"), (2, "two"), (3, "three")).toDF("id", "right")

// Inner join
scala> left.join(right, "id").show
+---+----+-----+
| id|left|right|
+---+----+-----+
| 0|zero| zero|
+---+----+-----+

scala> left.join(right, "id").explain


== Physical Plan ==
*Project [id#50, left#51, right#61]
+- *BroadcastHashJoin [id#50], [id#60], Inner, BuildRight
:- LocalTableScan [id#50, left#51]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as
bigint)))
+- LocalTableScan [id#60, right#61]

// Full outer
scala> left.join(right, Seq("id"), "fullouter").show
+---+----+-----+
| id|left|right|
+---+----+-----+
| 1| one| null|
| 3|null|three|
| 2|null| two|
| 0|zero| zero|
+---+----+-----+

313
Dataset Join Operators

scala> left.join(right, Seq("id"), "fullouter").explain


== Physical Plan ==
*Project [coalesce(id#50, id#60) AS id#85, left#51, right#61]
+- SortMergeJoin [id#50], [id#60], FullOuter
:- *Sort [id#50 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#50, 200)
: +- LocalTableScan [id#50, left#51]
+- *Sort [id#60 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#60, 200)
+- LocalTableScan [id#60, right#61]

// Left anti
scala> left.join(right, Seq("id"), "leftanti").show
+---+----+
| id|left|
+---+----+
| 1| one|
+---+----+

scala> left.join(right, Seq("id"), "leftanti").explain


== Physical Plan ==
*BroadcastHashJoin [id#50], [id#60], LeftAnti, BuildRight
:- LocalTableScan [id#50, left#51]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as big
int)))
+- LocalTableScan [id#60]

Internally, join(right: Dataset[_]) creates a DataFrame with a condition-less Join logical


operator (in the current SparkSession).

join(right: Dataset[_]) creates a logical plan with a condition-less Join


Note
operator with two child logical plans of the both sides of the join.

join(right: Dataset[_], usingColumns: Seq[String], joinType: String) creates


Note
a logical plan with a condition-less Join operator with UsingJoin join type.

join(right: Dataset[_], joinExprs: Column, joinType: String) accepts self-


joins where joinExprs is of the form:

df("key") === df("key")

Note That is usually considered a trivially true condition and refused as acceptable.

With spark.sql.selfJoinAutoResolveAmbiguity option enabled (which it is by


default), join will automatically resolve ambiguous join conditions into ones
that might make sense.
See [SPARK-6231] Join on two tables (generated from same one) is broken.

314
Dataset Join Operators

crossJoin Method

crossJoin(right: Dataset[_]): DataFrame

crossJoin joins two Datasets using Cross join type with no condition.

crossJoin creates an explicit cartesian join that can be very expensive without
Note
an extra filter (that can be pushed down).

Type-Preserving Joins —  joinWith Operators

joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)] (1)


joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]

1. inner equi-join

joinWith creates a Dataset with two columns _1 and _2 that each contain records for

which condition holds.

315
Dataset Join Operators

case class Person(id: Long, name: String, cityId: Long)


case class City(id: Long, name: String)
val family = Seq(
Person(0, "Agata", 0),
Person(1, "Iweta", 0),
Person(2, "Patryk", 2),
Person(3, "Maksym", 0)).toDS
val cities = Seq(
City(0, "Warsaw"),
City(1, "Washington"),
City(2, "Sopot")).toDS

val joined = family.joinWith(cities, family("cityId") === cities("id"))


scala> joined.printSchema
root
|-- _1: struct (nullable = false)
| |-- id: long (nullable = false)
| |-- name: string (nullable = true)
| |-- cityId: long (nullable = false)
|-- _2: struct (nullable = false)
| |-- id: long (nullable = false)
| |-- name: string (nullable = true)
scala> joined.show
+------------+----------+
| _1| _2|
+------------+----------+
| [0,Agata,0]|[0,Warsaw]|
| [1,Iweta,0]|[0,Warsaw]|
|[2,Patryk,2]| [2,Sopot]|
|[3,Maksym,0]|[0,Warsaw]|
+------------+----------+

Note joinWith preserves type-safety with the original object types.

Note joinWith creates a Dataset with Join logical plan.

316
Broadcast Joins (aka Map-Side Joins)

Broadcast Joins (aka Map-Side Joins)


Spark SQL uses broadcast join (aka broadcast hash join) instead of hash join to optimize
join queries when the size of one side data is below spark.sql.autoBroadcastJoinThreshold.

Broadcast join can be very efficient for joins between a large table (fact) with relatively small
tables (dimensions) that could then be used to perform a star-schema join. It can avoid
sending all data of the large table over the network.

You can use broadcast function or SQL’s broadcast hints to mark a dataset to be broadcast
when used in a join query.

According to the article Map-Side Join in Spark, broadcast join is also called a
Note replicated join (in the distributed system community) or a map-side join (in the
Hadoop community).

CanBroadcast object matches a LogicalPlan with output small enough for broadcast join.

Currently statistics are only supported for Hive Metastore tables where the
Note
command ANALYZE TABLE [tableName] COMPUTE STATISTICS noscan has been run.

JoinSelection execution planning strategy uses spark.sql.autoBroadcastJoinThreshold


property (default: 10M ) to control the size of a dataset before broadcasting it to all worker
nodes when performing a join.

val threshold = spark.conf.get("spark.sql.autoBroadcastJoinThreshold").toInt


scala> threshold / 1024 / 1024
res0: Int = 10

val q = spark.range(100).as("a").join(spark.range(100).as("b")).where($"a.id" === $"b.


id")
scala> println(q.queryExecution.logical.numberedTreeString)
00 'Filter ('a.id = 'b.id)
01 +- Join Inner
02 :- SubqueryAlias a
03 : +- Range (0, 100, step=1, splits=Some(8))
04 +- SubqueryAlias b
05 +- Range (0, 100, step=1, splits=Some(8))

scala> println(q.queryExecution.sparkPlan.numberedTreeString)
00 BroadcastHashJoin [id#0L], [id#4L], Inner, BuildRight
01 :- Range (0, 100, step=1, splits=8)
02 +- Range (0, 100, step=1, splits=8)

scala> q.explain
== Physical Plan ==
*BroadcastHashJoin [id#0L], [id#4L], Inner, BuildRight

317
Broadcast Joins (aka Map-Side Joins)

:- *Range (0, 100, step=1, splits=8)


+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- *Range (0, 100, step=1, splits=8)

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
scala> spark.conf.get("spark.sql.autoBroadcastJoinThreshold")
res1: String = -1

scala> q.explain
== Physical Plan ==
*SortMergeJoin [id#0L], [id#4L], Inner
:- *Sort [id#0L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#0L, 200)
: +- *Range (0, 100, step=1, splits=8)
+- *Sort [id#4L ASC NULLS FIRST], false, 0
+- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200)

// Force BroadcastHashJoin with broadcast hint (as function)


val qBroadcast = spark.range(100).as("a").join(broadcast(spark.range(100)).as("b")).wh
ere($"a.id" === $"b.id")
scala> qBroadcast.explain
== Physical Plan ==
*BroadcastHashJoin [id#14L], [id#18L], Inner, BuildRight
:- *Range (0, 100, step=1, splits=8)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- *Range (0, 100, step=1, splits=8)

// Force BroadcastHashJoin using SQL's BROADCAST hint


// Supported hints: BROADCAST, BROADCASTJOIN or MAPJOIN
val qBroadcastLeft = """
SELECT /*+ BROADCAST (lf) */ *
FROM range(100) lf, range(1000) rt
WHERE lf.id = rt.id
"""
scala> sql(qBroadcastLeft).explain
== Physical Plan ==
*BroadcastHashJoin [id#34L], [id#35L], Inner, BuildRight
:- *Range (0, 100, step=1, splits=8)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- *Range (0, 1000, step=1, splits=8)

val qBroadcastRight = """


SELECT /*+ MAPJOIN (rt) */ *
FROM range(100) lf, range(1000) rt
WHERE lf.id = rt.id
"""
scala> sql(qBroadcastRight).explain
== Physical Plan ==
*BroadcastHashJoin [id#42L], [id#43L], Inner, BuildRight
:- *Range (0, 100, step=1, splits=8)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- *Range (0, 1000, step=1, splits=8)

318
Broadcast Joins (aka Map-Side Joins)

319
Window Aggregation

Window Aggregation
Window Aggregation is…​FIXME

From Structured Query to Physical Plan


Spark Analyzer uses ExtractWindowExpressions logical resolution rule to replace (extract)
WindowExpression expressions with Window logical operators in a logical query plan.

Window —> (BasicOperators) —> WindowExec —>


Note WindowExec.adoc#doExecute (and windowExecBufferInMemoryThreshold +
windowExecBufferSpillThreshold)

320
WindowSpec — Window Specification

WindowSpec — Window Specification
WindowSpec is a window specification that defines which rows are included in a window

(frame), i.e. the set of rows that are associated with the current row by some relation.

WindowSpec takes the following when created:

Partition specification ( Seq[Expression] ) which defines which records are in the


same partition. With no partition defined, all records belong to a single partition

Ordering Specification ( Seq[SortOrder] ) which defines how records in a partition are


ordered that in turn defines the position of a record in a partition. The ordering could be
ascending ( ASC in SQL or asc in Scala) or descending ( DESC or desc ).

Frame Specification ( WindowFrame ) which defines the rows to be included in the


frame for the current row, based on their relative position to the current row. For
example, "the three rows preceding the current row to the current row" describes a
frame including the current input row and three rows appearing before the current row.

You use Window object to create a WindowSpec .

import org.apache.spark.sql.expressions.Window
scala> val byHTokens = Window.partitionBy('token startsWith "h")
byHTokens: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressi
ons.WindowSpec@574985d8

Once the initial version of a WindowSpec is created, you use the methods to further configure
the window specification.

321
WindowSpec — Window Specification

Table 1. WindowSpec API


Method Description

orderBy(cols: Column*): WindowSpec


orderBy orderBy(colName: String, colNames: String*): WindowSpec

partitionBy(cols: Column*): WindowSpec


partitionBy partitionBy(colName: String, colNames: String*): WindowSpec

rangeBetween(start: Column, end: Column): WindowSpec


rangeBetween rangeBetween(start: Long, end: Long): WindowSpec

rowsBetween rowsBetween(start: Long, end: Long): WindowSpec

With a window specification fully defined, you use Column.over operator that associates the
WindowSpec with an aggregate or window function.

scala> :type windowSpec


org.apache.spark.sql.expressions.WindowSpec

import org.apache.spark.sql.functions.rank
val c = rank over windowSpec

withAggregate Internal Method

withAggregate(aggregate: Column): Column

withAggregate …​FIXME

Note withAggregate is used exclusively when Column.over operator is used.

322
Window Utility Object — Defining Window Specification

Window Utility Object — Defining Window


Specification
Window utility object is a set of static methods to define a window specification.

Table 1. Window API


Method Description

currentRow: Long

currentRow
Value representing the current row that is used to define frame
boundaries.

orderBy(cols: Column*): WindowSpec


orderBy(colName: String, colNames: String*): WindowSpec
orderBy

Creates a WindowSpec with the ordering defined.

partitionBy(cols: Column*): WindowSpec


partitionBy(colName: String, colNames: String*): WindowSpec
partitionBy

Creates a WindowSpec with the partitioning defined.

rangeBetween(start: Column, end: Column): WindowSpec


rangeBetween(start: Long, end: Long): WindowSpec

rangeBetween
Creates a WindowSpec with the frame boundaries defined, from
start (inclusive) to end (inclusive). Both start and end
are relative to the current row based on the actual value of the
ORDER BY expression(s).

rowsBetween(start: Long, end: Long): WindowSpec

rowsBetween Creates a WindowSpec with the frame boundaries defined, from


start (inclusive) to end (inclusive). Both start and end
are positions relative to the current row based on the position of
the row within the partition.

unboundedFollowing: Long

unboundedFollowing

323
Window Utility Object — Defining Window Specification

Value representing the last row in a partition (equivalent to


"UNBOUNDED FOLLOWING" in SQL) that is used to define
frame boundaries.

unboundedPreceding: Long

unboundedPreceding
Value representing the first row in a partition (equivalent to
"UNBOUNDED PRECEDING" in SQL) that is used to define
frame boundaries.

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{currentRow, lit}
val windowSpec = Window
.partitionBy($"orderId")
.orderBy($"time")
.rangeBetween(currentRow, lit(1))
scala> :type windowSpec
org.apache.spark.sql.expressions.WindowSpec

Creating "Empty" WindowSpec —  spec Internal Method

spec: WindowSpec

spec creates an "empty" WindowSpec, i.e. with empty partition and ordering specifications,

and a UnspecifiedFrame .

spec is used when:

Column.over operator is used (with no WindowSpec )


Note
Window utility object is requested to partitionBy, orderBy, rowsBetween
and rangeBetween

324
Standard Functions — functions Object

Standard Functions — functions Object


org.apache.spark.sql.functions object defines built-in standard functions to work with

(values produced by) columns.

You can access the standard functions using the following import statement in your Scala
application:

import org.apache.spark.sql.functions._

Table 1. (Subset of) Standard Functions in Spark SQL


Name Description

approx_count_distinct(e: Column): Column


approx_count_distinct(columnName: String):
approx_count_distinct approx_count_distinct(e: Column, rsd: Double
approx_count_distinct(columnName: String, rsd:

avg(e: Column): Column


avg avg(columnName: String): Column

collect_list(e: Column): Column


collect_list collect_list(columnName: String): Column

collect_set(e: Column): Column


collect_set collect_set(columnName: String): Column

corr(column1: Column, column2: Column): Column


corr corr(columnName1: String, columnName2: String

count(e: Column): Column


count count(columnName: String): TypedColumn[Any

countDistinct(expr: Column, exprs: Column


countDistinct countDistinct(columnName: String, columnNames:

325
Standard Functions — functions Object

covar_pop covar_pop(column1: Column, column2: Column


covar_pop(columnName1: String, columnName2:

covar_samp(column1: Column, column2: Column


covar_samp covar_samp(columnName1: String, columnName2:

first(e: Column): Column


first(e: Column, ignoreNulls: Boolean): Column
first(columnName: String): Column
first(columnName: String, ignoreNulls: Boolean
first

Returns the first value in a group. Returns the first non-n


ignoreNulls flag on. If all values are null, then returns n

grouping(e: Column): Column


grouping(columnName: String): Column
grouping

Indicates whether a given column is aggregated or not

grouping_id(cols: Column*): Column


grouping_id(colName: String, colNames: String
grouping_id

Computes the level of grouping


Aggregate
functions
kurtosis(e: Column): Column
kurtosis kurtosis(columnName: String): Column

last(e: Column, ignoreNulls: Boolean): Column


last(columnName: String, ignoreNulls: Boolean
last last(e: Column): Column
last(columnName: String): Column

max(e: Column): Column


max max(columnName: String): Column

mean(e: Column): Column


mean mean(columnName: String): Column

min(e: Column): Column


min min(columnName: String): Column

326
Standard Functions — functions Object

skewness(e: Column): Column


skewness skewness(columnName: String): Column

stddev(e: Column): Column


stddev stddev(columnName: String): Column

stddev_pop(e: Column): Column


stddev_pop stddev_pop(columnName: String): Column

stddev_samp(e: Column): Column


stddev_samp stddev_samp(columnName: String): Column

sum(e: Column): Column


sum sum(columnName: String): Column

sumDistinct(e: Column): Column


sumDistinct sumDistinct(columnName: String): Column

variance(e: Column): Column


variance variance(columnName: String): Column

var_pop(e: Column): Column


var_pop var_pop(columnName: String): Column

var_samp(e: Column): Column


var_samp var_samp(columnName: String): Column

array_contains array_contains(column: Column, value: Any

array_distinct(e: Column): Column


array_distinct
(New in 2.4.0)

327
Standard Functions — functions Object

array_except(e: Column): Column


array_except

(New in 2.4.0)

array_intersect(col1: Column, col2: Column


array_intersect
(New in 2.4.0)

array_join(column: Column, delimiter: String


array_join(column: Column, delimiter: String
ring): Column
array_join

(New in 2.4.0)

array_max(e: Column): Column


array_max
(New in 2.4.0)

array_min(e: Column): Column


array_min
(New in 2.4.0)

array_position(column: Column, value: Any


array_position
(New in 2.4.0)

array_remove(column: Column, element: Any


array_remove
(New in 2.4.0)

array_repeat(e: Column, count: Int): Column


array_repeat(left: Column, right: Column):
array_repeat

(New in 2.4.0)

array_sort(e: Column): Column


array_sort
(New in 2.4.0)

328
Standard Functions — functions Object

array_union(col1: Column, col2: Column):


array_union

(New in 2.4.0)

arrays_zip(e: Column*): Column


arrays_zip
(New in 2.4.0)

arrays_overlap(a1: Column, a2: Column): Column


arrays_overlap
(New in 2.4.0)

element_at(column: Column, value: Any): Column


element_at
(New in 2.4.0)

explode explode(e: Column): Column

Collection
functions explode_outer(e: Column): Column

explode_outer
Creates a new row for each element in the given array o
If the array/map is null or empty then null

flatten(e: Column): Column


flatten
(New in 2.4.0)

from_json(e: Column, schema: Column): Column


from_json(e: Column, schema: DataType): Column
from_json(e: Column, schema: DataType, options:
]): Column
from_json(e: Column, schema: String, options:
): Column
from_json(e: Column, schema: StructType):
from_json(e: Column, schema: StructType, options:
from_json ing]): Column

1. New in 2.4.0
Parses a column with a JSON string into a StructType
StructType elements with the specified schema.

329
Standard Functions — functions Object

map_concat(cols: Column*): Column


map_concat

(New in 2.4.0)

map_from_entries(e: Column): Column


map_from_entries
(New in 2.4.0)

map_keys map_keys(e: Column): Column

map_values map_values(e: Column): Column

posexplode posexplode(e: Column): Column

posexplode_outer posexplode_outer(e: Column): Column

reverse(e: Column): Column

reverse Returns a reversed string or an array with reverse order

Note Support for reversing arrays is

schema_of_json(json: Column): Column


schema_of_json(json: String): Column
schema_of_json

(New in 2.4.0)

sequence(start: Column, stop: Column): Column


sequence(start: Column, stop: Column, step:
sequence

(New in 2.4.0)

shuffle(e: Column): Column


shuffle
(New in 2.4.0)

330
Standard Functions — functions Object

size(e: Column): Column


size

Returns the size of the given array or map. Returns -1 if

slice(x: Column, start: Int, length: Int):


slice
(New in 2.4.0)

current_date current_date(): Column

current_timestamp current_timestamp(): Column

from_utc_timestamp(ts: Column, tz: String


from_utc_timestamp(ts: Column, tz: Column
from_utc_timestamp

1. New in 2.4.0

months_between(end: Column, start: Column


months_between(end: Column, start: Column
lumn (1)
months_between

1. New in 2.4.0

to_date(e: Column): Column


to_date to_date(e: Column, fmt: String): Column

to_timestamp(s: Column): Column


Date and to_timestamp to_timestamp(s: Column, fmt: String): Column
time
functions

to_utc_timestamp(ts: Column, tz: String):


to_utc_timestamp(ts: Column, tz: Column):
to_utc_timestamp

1. New in 2.4.0

Converts current or specified time to Unix timestamp (in

unix_timestamp unix_timestamp(): Column


unix_timestamp(s: Column): Column
unix_timestamp(s: Column, p: String): Column

331
Standard Functions — functions Object

Generates tumbling time windows

window(
timeColumn: Column,
windowDuration: String): Column
window(
timeColumn: Column,
window windowDuration: String,
slideDuration: String): Column
window(
timeColumn: Column,
windowDuration: String,
slideDuration: String,
startTime: String): Column

Math
bin Converts the value of a long column to binary format
functions

array

broadcast

coalesce Gives the first non- null value among the given column

col and column Creating Columns

expr
Regular
functions
lit
(Non-
aggregate
functions) map

Returns monotonically increasing 64-bit integers that are


monotonically_increasing_id
to be monotonically increasing and unique, but not cons

struct

typedLit

when

split
String
functions
upper

udf Creating UDFs


UDF
functions
callUDF Executing an UDF by name with variable-length list of co

332
Standard Functions — functions Object

cume_dist(): Column

cume_dist
Computes the cumulative distribution of records across
partitions

currentRow currentRow(): Column

dense_rank(): Column
dense_rank
Computes the rank of records per window partition

lag(e: Column, offset: Int): Column


lag lag(columnName: String, offset: Int): Column
lag(columnName: String, offset: Int, defaultValue:

lead(columnName: String, offset: Int): Column


lead(e: Column, offset: Int): Column
lead lead(columnName: String, offset: Int, defaultValue:
lead(e: Column, offset: Int, defaultValue:

Window
functions ntile(n: Int): Column
ntile
Computes the ntile group

percent_rank(): Column
percent_rank
Computes the rank of records per window partition

rank(): Column
rank
Computes the rank of records per window partition

row_number(): Column
row_number
Computes the sequential numbering per window partitio

unboundedFollowing unboundedFollowing(): Column

333
Standard Functions — functions Object

unboundedPreceding unboundedPreceding(): Column

The page gives only a brief ovierview of the many functions available in
Tip functions object and so you should read the official documentation of the
functions object.

Executing UDF by Name and Variable-Length Column List 


—  callUDF Function

callUDF(udfName: String, cols: Column*): Column

callUDF executes an UDF by udfName and variable-length list of columns.

Defining UDFs —  udf Function

udf(f: FunctionN[...]): UserDefinedFunction

The udf family of functions allows you to create user-defined functions (UDFs) based on a
user-defined function in Scala. It accepts f function of 0 to 10 arguments and the input and
output types are automatically inferred (given the types of the respective input and output
types of the function f ).

import org.apache.spark.sql.functions._
val _length: String => Int = _.length
val _lengthUDF = udf(_length)

// define a dataframe
val df = sc.parallelize(0 to 3).toDF("num")

// apply the user-defined function to "num" column


scala> df.withColumn("len", _lengthUDF($"num")).show
+---+---+
|num|len|
+---+---+
| 0| 1|
| 1| 1|
| 2| 1|
| 3| 1|
+---+---+

Since Spark 2.0.0, there is another variant of udf function:

334
Standard Functions — functions Object

udf(f: AnyRef, dataType: DataType): UserDefinedFunction

udf(f: AnyRef, dataType: DataType) allows you to use a Scala closure for the function

argument (as f ) and explicitly declaring the output data type (as dataType ).

// given the dataframe above

import org.apache.spark.sql.types.IntegerType
val byTwo = udf((n: Int) => n * 2, IntegerType)

scala> df.withColumn("len", byTwo($"num")).show


+---+---+
|num|len|
+---+---+
| 0| 0|
| 1| 2|
| 2| 4|
| 3| 6|
+---+---+

split Function

split(str: Column, pattern: String): Column

split function splits str column using pattern . It returns a new Column .

Note split UDF uses java.lang.String.split(String regex, int limit) method.

val df = Seq((0, "hello|world"), (1, "witaj|swiecie")).toDF("num", "input")


val withSplit = df.withColumn("split", split($"input", "[|]"))

scala> withSplit.show
+---+-------------+----------------+
|num| input| split|
+---+-------------+----------------+
| 0| hello|world| [hello, world]|
| 1|witaj|swiecie|[witaj, swiecie]|
+---+-------------+----------------+

Note .$|()[{^?*+\ are RegEx’s meta characters and are considered special.

upper Function

335
Standard Functions — functions Object

upper(e: Column): Column

upper function converts a string column into one with all letter upper. It returns a new

Column .

The following example uses two functions that accept a Column and return
Note
another to showcase how to chain them.

val df = Seq((0,1,"hello"), (2,3,"world"), (2,4, "ala")).toDF("id", "val", "name")


val withUpperReversed = df.withColumn("upper", reverse(upper($"name")))

scala> withUpperReversed.show
+---+---+-----+-----+
| id|val| name|upper|
+---+---+-----+-----+
| 0| 1|hello|OLLEH|
| 2| 3|world|DLROW|
| 2| 4| ala| ALA|
+---+---+-----+-----+

Converting Long to Binary Format (in String


Representation) —  bin Function

bin(e: Column): Column


bin(columnName: String): Column (1)

1. Calls the first bin with columnName as a Column

bin converts the long value in a column to its binary format (i.e. as an unsigned integer in

base 2) with no extra leading 0s.

336
Standard Functions — functions Object

scala> spark.range(5).withColumn("binary", bin('id)).show


+---+------+
| id|binary|
+---+------+
| 0| 0|
| 1| 1|
| 2| 10|
| 3| 11|
| 4| 100|
+---+------+

val withBin = spark.range(5).withColumn("binary", bin('id))


scala> withBin.printSchema
root
|-- id: long (nullable = false)
|-- binary: string (nullable = false)

Internally, bin creates a Column with Bin unary expression.

scala> withBin.queryExecution.logical
res2: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
'Project [*, bin('id) AS binary#14]
+- Range (0, 5, step=1, splits=Some(8))

Note Bin unary expression uses java.lang.Long.toBinaryString for the conversion.

Bin expression supports code generation (aka CodeGen).

val withBin = spark.range(5).withColumn("binary", bin('id))


scala> withBin.queryExecution.debug.codegen
Found 1 WholeStageCodegen subtrees.
== Subtree 1 / 1 ==
Note *Project [id#19L, bin(id#19L) AS binary#22]
+- *Range (0, 5, step=1, splits=Some(8))
...
/* 103 */ UTF8String project_value1 = null;
/* 104 */ project_value1 = UTF8String.fromString(java.lang.Long.toBinaryString(ra

337
Aggregate Functions

Standard Aggregate Functions


Table 1. Standard Aggregate Functions
Name Description

approx_count_distinct(e: Column): Column


approx_count_distinct(columnName: String): Column
approx_count_distinct(e: Column, rsd: Double): Column
approx_count_distinct
approx_count_distinct(columnName: String, rsd: Double
): Column

avg(e: Column): Column


avg avg(columnName: String): Column

collect_list(e: Column): Column


collect_list collect_list(columnName: String): Column

collect_set(e: Column): Column


collect_set collect_set(columnName: String): Column

corr(column1: Column, column2: Column): Column


corr corr(columnName1: String, columnName2: String): Colu
mn

count(e: Column): Column


count count(columnName: String): TypedColumn[Any, Long]

countDistinct(expr: Column, exprs: Column*): Column


countDistinct(columnName: String, columnNames: String
countDistinct *): Column

covar_pop(column1: Column, column2: Column): Column


covar_pop(columnName1: String, columnName2: String):
covar_pop Column

338
Aggregate Functions

covar_samp(column1: Column, column2: Column): Column


covar_samp
covar_samp(columnName1: String, columnName2: String)
: Column

first(e: Column): Column


first(e: Column, ignoreNulls: Boolean): Column
first(columnName: String): Column
first(columnName: String, ignoreNulls: Boolean): Col
umn
first

Returns the first value in a group. Returns the first non-


null value when ignoreNulls flag on. If all values are
null, then returns null.

grouping(e: Column): Column


grouping(columnName: String): Column
grouping

Indicates whether a given column is aggregated or not

grouping_id(cols: Column*): Column


grouping_id(colName: String, colNames: String*): Col
grouping_id umn

Computes the level of grouping

kurtosis(e: Column): Column


kurtosis kurtosis(columnName: String): Column

last(e: Column, ignoreNulls: Boolean): Column


last(columnName: String, ignoreNulls: Boolean): Colu
last mn
last(e: Column): Column
last(columnName: String): Column

max(e: Column): Column


max max(columnName: String): Column

mean(e: Column): Column


mean mean(columnName: String): Column

min(e: Column): Column


min min(columnName: String): Column

339
Aggregate Functions

skewness(e: Column): Column


skewness skewness(columnName: String): Column

stddev(e: Column): Column


stddev stddev(columnName: String): Column

stddev_pop(e: Column): Column


stddev_pop stddev_pop(columnName: String): Column

stddev_samp(e: Column): Column


stddev_samp stddev_samp(columnName: String): Column

sum(e: Column): Column


sum sum(columnName: String): Column

sumDistinct(e: Column): Column


sumDistinct sumDistinct(columnName: String): Column

variance(e: Column): Column


variance variance(columnName: String): Column

var_pop(e: Column): Column


var_pop var_pop(columnName: String): Column

var_samp(e: Column): Column


var_samp var_samp(columnName: String): Column

grouping Aggregate Function

grouping(e: Column): Column


grouping(columnName: String): Column (1)

1. Calls the first grouping with columnName as a Column

340
Aggregate Functions

grouping is an aggregate function that indicates whether a specified column is aggregated

or not and:

returns 1 if the column is in a subtotal and is NULL

returns 0 if the underlying value is NULL or any other value

grouping can only be used with cube, rollup or GROUPING SETS multi-
Note dimensional aggregate operators (and is verified when Analyzer does check
analysis).

From Hive’s documentation about Grouping__ID function (that can somehow help to
understand grouping ):

When aggregates are displayed for a column its value is null . This may conflict in
case the column itself has some null values. There needs to be some way to identify
NULL in column, which means aggregate and NULL in column, which means value.

GROUPING__ID function is the solution to that.

341
Aggregate Functions

val tmpWorkshops = Seq(


("Warsaw", 2016, 2),
("Toronto", 2016, 4),
("Toronto", 2017, 1)).toDF("city", "year", "count")

// there seems to be a bug with nulls


// and so the need for the following union
val cityNull = Seq(
(null.asInstanceOf[String], 2016, 2)).toDF("city", "year", "count")

val workshops = tmpWorkshops union cityNull

scala> workshops.show
+-------+----+-----+
| city|year|count|
+-------+----+-----+
| Warsaw|2016| 2|
|Toronto|2016| 4|
|Toronto|2017| 1|
| null|2016| 2|
+-------+----+-----+

val q = workshops
.cube("city", "year")
.agg(grouping("city"), grouping("year")) // <-- grouping here
.sort($"city".desc_nulls_last, $"year".desc_nulls_last)

scala> q.show
+-------+----+--------------+--------------+
| city|year|grouping(city)|grouping(year)|
+-------+----+--------------+--------------+
| Warsaw|2016| 0| 0|
| Warsaw|null| 0| 1|
|Toronto|2017| 0| 0|
|Toronto|2016| 0| 0|
|Toronto|null| 0| 1|
| null|2017| 1| 0|
| null|2016| 1| 0|
| null|2016| 0| 0| <-- null is city
| null|null| 0| 1| <-- null is city
| null|null| 1| 1|
+-------+----+--------------+--------------+

Internally, grouping creates a Column with Grouping expression.

342
Aggregate Functions

val q = workshops.cube("city", "year").agg(grouping("city"))


scala> println(q.queryExecution.logical)
'Aggregate [cube(city#182, year#183)], [city#182, year#183, grouping('city) AS groupin
g(city)#705]
+- Union
:- Project [_1#178 AS city#182, _2#179 AS year#183, _3#180 AS count#184]
: +- LocalRelation [_1#178, _2#179, _3#180]
+- Project [_1#192 AS city#196, _2#193 AS year#197, _3#194 AS count#198]
+- LocalRelation [_1#192, _2#193, _3#194]

scala> println(q.queryExecution.analyzed)
Aggregate [city#724, year#725, spark_grouping_id#721], [city#724, year#725, cast((shif
tright(spark_grouping_id#721, 1) & 1) as tinyint) AS grouping(city)#720]
+- Expand [List(city#182, year#183, count#184, city#722, year#723, 0), List(city#182,
year#183, count#184, city#722, null, 1), List(city#182, year#183, count#184, null, yea
r#723, 2), List(city#182, year#183, count#184, null, null, 3)], [city#182, year#183, c
ount#184, city#724, year#725, spark_grouping_id#721]
+- Project [city#182, year#183, count#184, city#182 AS city#722, year#183 AS year#7
23]
+- Union
:- Project [_1#178 AS city#182, _2#179 AS year#183, _3#180 AS count#184]
: +- LocalRelation [_1#178, _2#179, _3#180]
+- Project [_1#192 AS city#196, _2#193 AS year#197, _3#194 AS count#198]
+- LocalRelation [_1#192, _2#193, _3#194]

grouping was added to Spark SQL in [SPARK-12706] support


Note
grouping/grouping_id function together group set.

grouping_id Aggregate Function

grouping_id(cols: Column*): Column


grouping_id(colName: String, colNames: String*): Column (1)

1. Calls the first grouping_id with colName and colNames as objects of type Column

grouping_id is an aggregate function that computes the level of grouping:

0 for combinations of each column

1 for subtotals of column 1

2 for subtotals of column 2

And so on…

val tmpWorkshops = Seq(


("Warsaw", 2016, 2),
("Toronto", 2016, 4),

343
Aggregate Functions

("Toronto", 2017, 1)).toDF("city", "year", "count")

// there seems to be a bug with nulls


// and so the need for the following union
val cityNull = Seq(
(null.asInstanceOf[String], 2016, 2)).toDF("city", "year", "count")

val workshops = tmpWorkshops union cityNull

scala> workshops.show
+-------+----+-----+
| city|year|count|
+-------+----+-----+
| Warsaw|2016| 2|
|Toronto|2016| 4|
|Toronto|2017| 1|
| null|2016| 2|
+-------+----+-----+

val query = workshops


.cube("city", "year")
.agg(grouping_id()) // <-- all grouping columns used
.sort($"city".desc_nulls_last, $"year".desc_nulls_last)
scala> query.show
+-------+----+-------------+
| city|year|grouping_id()|
+-------+----+-------------+
| Warsaw|2016| 0|
| Warsaw|null| 1|
|Toronto|2017| 0|
|Toronto|2016| 0|
|Toronto|null| 1|
| null|2017| 2|
| null|2016| 2|
| null|2016| 0|
| null|null| 1|
| null|null| 3|
+-------+----+-------------+

scala> spark.catalog.listFunctions.filter(_.name.contains("grouping_id")).show(false)
+-----------+--------+-----------+----------------------------------------------------
+-----------+
|name |database|description|className
|isTemporary|
+-----------+--------+-----------+----------------------------------------------------
+-----------+
|grouping_id|null |null |org.apache.spark.sql.catalyst.expressions.GroupingID|
true |
+-----------+--------+-----------+----------------------------------------------------
+-----------+

// bin function gives the string representation of the binary value of the given long
column

344
Aggregate Functions

scala> query.withColumn("bitmask", bin($"grouping_id()")).show


+-------+----+-------------+-------+
| city|year|grouping_id()|bitmask|
+-------+----+-------------+-------+
| Warsaw|2016| 0| 0|
| Warsaw|null| 1| 1|
|Toronto|2017| 0| 0|
|Toronto|2016| 0| 0|
|Toronto|null| 1| 1|
| null|2017| 2| 10|
| null|2016| 2| 10|
| null|2016| 0| 0| <-- null is city
| null|null| 3| 11|
| null|null| 1| 1|
+-------+----+-------------+-------+

The list of columns of grouping_id should match grouping columns (in cube or rollup )
exactly, or empty which means all the grouping columns (which is exactly what the function
expects).

grouping_id can only be used with cube, rollup or GROUPING SETS multi-
Note dimensional aggregate operators (and is verified when Analyzer does check
analysis).

Note Spark SQL’s grouping_id function is known as grouping__id in Hive.

From Hive’s documentation about Grouping__ID function:

When aggregates are displayed for a column its value is null . This may conflict in
case the column itself has some null values. There needs to be some way to identify
NULL in column, which means aggregate and NULL in column, which means value.

GROUPING__ID function is the solution to that.

Internally, grouping_id() creates a Column with GroupingID unevaluable expression.

Unevaluable expressions are expressions replaced by some other expressions


Note
during analysis or optimization.

// workshops dataset was defined earlier


val q = workshops
.cube("city", "year")
.agg(grouping_id())

// grouping_id function is spark_grouping_id virtual column internally


// that is resolved during analysis - see Analyzed Logical Plan
scala> q.explain(true)
== Parsed Logical Plan ==
'Aggregate [cube(city#182, year#183)], [city#182, year#183, grouping_id() AS grouping_

345
Aggregate Functions

id()#742]
+- Union
:- Project [_1#178 AS city#182, _2#179 AS year#183, _3#180 AS count#184]
: +- LocalRelation [_1#178, _2#179, _3#180]
+- Project [_1#192 AS city#196, _2#193 AS year#197, _3#194 AS count#198]
+- LocalRelation [_1#192, _2#193, _3#194]

== Analyzed Logical Plan ==


city: string, year: int, grouping_id(): int
Aggregate [city#757, year#758, spark_grouping_id#754], [city#757, year#758, spark_grou
ping_id#754 AS grouping_id()#742]
+- Expand [List(city#182, year#183, count#184, city#755, year#756, 0), List(city#182,
year#183, count#184, city#755, null, 1), List(city#182, year#183, count#184, null, yea
r#756, 2), List(city#182, year#183, count#184, null, null, 3)], [city#182, year#183, c
ount#184, city#757, year#758, spark_grouping_id#754]
+- Project [city#182, year#183, count#184, city#182 AS city#755, year#183 AS year#7
56]
+- Union
:- Project [_1#178 AS city#182, _2#179 AS year#183, _3#180 AS count#184]
: +- LocalRelation [_1#178, _2#179, _3#180]
+- Project [_1#192 AS city#196, _2#193 AS year#197, _3#194 AS count#198]
+- LocalRelation [_1#192, _2#193, _3#194]

== Optimized Logical Plan ==


Aggregate [city#757, year#758, spark_grouping_id#754], [city#757, year#758, spark_grou
ping_id#754 AS grouping_id()#742]
+- Expand [List(city#755, year#756, 0), List(city#755, null, 1), List(null, year#756,
2), List(null, null, 3)], [city#757, year#758, spark_grouping_id#754]
+- Union
:- LocalRelation [city#755, year#756]
+- LocalRelation [city#755, year#756]

== Physical Plan ==
*HashAggregate(keys=[city#757, year#758, spark_grouping_id#754], functions=[], output=
[city#757, year#758, grouping_id()#742])
+- Exchange hashpartitioning(city#757, year#758, spark_grouping_id#754, 200)
+- *HashAggregate(keys=[city#757, year#758, spark_grouping_id#754], functions=[], o
utput=[city#757, year#758, spark_grouping_id#754])
+- *Expand [List(city#755, year#756, 0), List(city#755, null, 1), List(null, yea
r#756, 2), List(null, null, 3)], [city#757, year#758, spark_grouping_id#754]
+- Union
:- LocalTableScan [city#755, year#756]
+- LocalTableScan [city#755, year#756]

grouping_id was added to Spark SQL in [SPARK-12706] support


Note
grouping/grouping_id function together group set.

346
Collection Functions

Standard Functions for Collections (Collection


Functions)
Table 1. (Subset of) Standard Functions for Handling Collections
Name Description

array_contains array_contains(column: Column, value: Any): Column

explode explode(e: Column): Column

explode_outer(e: Column): Column

explode_outer
Creates a new row for each element in the given array or map column.
If the array/map is null or empty then null is produced.

from_json(e: Column, schema: DataType): Column


from_json(e: Column, schema: DataType, options: Map[String, String
]): Column
from_json(e: Column, schema: String, options: Map[String, String]
): Column
from_json(e: Column, schema: StructType): Column
from_json from_json(e: Column, schema: StructType, options: Map[String, Str
ing]): Column

Extract data from arbitrary JSON-encoded values into a StructType or


ArrayType of StructType elements with the specified schema

map_keys map_keys(e: Column): Column

map_values map_values(e: Column): Column

posexplode posexplode(e: Column): Column

posexplode_outer posexplode_outer(e: Column): Column

347
Collection Functions

reverse(e: Column): Column

reverse
Returns a reversed string or an array with reverse order of elements

Note Support for reversing arrays is new in 2.4.0.

size(e: Column): Column


size
Returns the size of the given array or map. Returns -1 if null .

reverse Collection Function

reverse(e: Column): Column

reverse …​FIXME

size Collection Function

size(e: Column): Column

size returns the size of the given array or map. Returns -1 if null .

Internally, size creates a Column with Size unary expression.

import org.apache.spark.sql.functions.size
val c = size('id)
scala> println(c.expr.asCode)
Size(UnresolvedAttribute(ArrayBuffer(id)))

posexplode Collection Function

posexplode(e: Column): Column

posexplode …​FIXME

posexplode_outer Collection Function

posexplode_outer(e: Column): Column

348
Collection Functions

posexplode_outer …​FIXME

explode Collection Function

Caution FIXME

scala> Seq(Array(0,1,2)).toDF("array").withColumn("num", explode('array)).show


+---------+---+
| array|num|
+---------+---+
|[0, 1, 2]| 0|
|[0, 1, 2]| 1|
|[0, 1, 2]| 2|
+---------+---+

Note explode function is an equivalent of flatMap operator for Dataset .

explode_outer Collection Function

explode_outer(e: Column): Column

explode_outer generates a new row for each element in e array or map column.

Unlike explode, explode_outer generates null when the array or map is


Note
null or empty.

val arrays = Seq((1,Seq.empty[String])).toDF("id", "array")


scala> arrays.printSchema
root
|-- id: integer (nullable = false)
|-- array: array (nullable = true)
| |-- element: string (containsNull = true)
scala> arrays.select(explode_outer($"array")).show
+----+
| col|
+----+
|null|
+----+

Internally, explode_outer creates a Column with GeneratorOuter and Explode Catalyst


expressions.

349
Collection Functions

val explodeOuter = explode_outer($"array").expr


scala> println(explodeOuter.numberedTreeString)
00 generatorouter(explode('array))
01 +- explode('array)
02 +- 'array

Extracting Data from Arbitrary JSON-Encoded Values 


—  from_json Collection Function

from_json(e: Column, schema: StructType, options: Map[String, String]): Column (1)


from_json(e: Column, schema: DataType, options: Map[String, String]): Column (2)
from_json(e: Column, schema: StructType): Column (3)
from_json(e: Column, schema: DataType): Column (4)
from_json(e: Column, schema: String, options: Map[String, String]): Column (5)

1. Calls <2> with StructType converted to DataType

2. (fixme)

3. Calls <1> with empty options

4. Relays to the other from_json with empty options

5. Uses schema as DataType in the JSON format or falls back to StructType in the DDL
format

from_json parses a column with a JSON-encoded value into a StructType or ArrayType of

StructType elements with the specified schema.

val jsons = Seq("""{ "id": 0 }""").toDF("json")

import org.apache.spark.sql.types._
val schema = new StructType()
.add($"id".int.copy(nullable = false))

import org.apache.spark.sql.functions.from_json
scala> jsons.select(from_json($"json", schema) as "ids").show
+---+
|ids|
+---+
|[0]|
+---+

350
Collection Functions

A schema can be one of the following:

Note 1. DataType as a Scala object or in the JSON format


2. StructType in the DDL format

// Define the schema for JSON-encoded messages


// Note that the schema is nested (on the addresses field)
import org.apache.spark.sql.types._
val addressesSchema = new StructType()
.add($"city".string)
.add($"state".string)
.add($"zip".string)
val schema = new StructType()
.add($"firstName".string)
.add($"lastName".string)
.add($"email".string)
.add($"addresses".array(addressesSchema))
scala> schema.printTreeString
root
|-- firstName: string (nullable = true)
|-- lastName: string (nullable = true)
|-- email: string (nullable = true)
|-- addresses: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| | |-- zip: string (nullable = true)

// Generate the JSON-encoded schema


// That's the variant of the schema that from_json accepts
val schemaAsJson = schema.json

// Use prettyJson to print out the JSON-encoded schema


// Only for demo purposes
scala> println(schema.prettyJson)
{
"type" : "struct",
"fields" : [ {
"name" : "firstName",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "lastName",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "email",
"type" : "string",

351
Collection Functions

"nullable" : true,
"metadata" : { }
}, {
"name" : "addresses",
"type" : {
"type" : "array",
"elementType" : {
"type" : "struct",
"fields" : [ {
"name" : "city",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "state",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "zip",
"type" : "string",
"nullable" : true,
"metadata" : { }
} ]
},
"containsNull" : true
},
"nullable" : true,
"metadata" : { }
} ]
}

// Let's "validate" the JSON-encoded schema


import org.apache.spark.sql.types.DataType
val dt = DataType.fromJson(schemaAsJson)
scala> println(dt.sql)
STRUCT<`firstName`: STRING, `lastName`: STRING, `email`: STRING, `addresses`: ARRAY<ST
RUCT<`city`: STRING, `state`: STRING, `zip`: STRING>>>

// No exception means that the JSON-encoded schema should be fine


// Use it with from_json
val rawJsons = Seq("""
{
"firstName" : "Jacek",
"lastName" : "Laskowski",
"email" : "[email protected]",
"addresses" : [
{
"city" : "Warsaw",
"state" : "N/A",
"zip" : "02-791"
}
]

352
Collection Functions

}
""").toDF("rawjson")
val people = rawJsons
.select(from_json($"rawjson", schemaAsJson, Map.empty[String, String]) as "json")
.select("json.*") // <-- flatten the struct field
.withColumn("address", explode($"addresses")) // <-- explode the array field
.drop("addresses") // <-- no longer needed
.select("firstName", "lastName", "email", "address.*") // <-- flatten the struct fie
ld
scala> people.show
+---------+---------+---------------+------+-----+------+
|firstName| lastName| email| city|state| zip|
+---------+---------+---------------+------+-----+------+
| Jacek|Laskowski|[email protected]|Warsaw| N/A|02-791|
+---------+---------+---------------+------+-----+------+

options controls how a JSON is parsed and contains the same options as the
Note
json format.

Internally, from_json creates a Column with JsonToStructs unary expression.

from_json (creates a JsonToStructs that) uses a JSON parser in FAILFAST


Note parsing mode that simply fails early when a corrupted/malformed record is
found (and hence does not support columnNameOfCorruptRecord JSON option).

val jsons = Seq("""{ id: 0 }""").toDF("json")

import org.apache.spark.sql.types._
val schema = new StructType()
.add($"id".int.copy(nullable = false))
.add($"corrupted_records".string)
val opts = Map("columnNameOfCorruptRecord" -> "corrupted_records")
scala> jsons.select(from_json($"json", schema, opts) as "ids").show
+----+
| ids|
+----+
|null|
+----+

Note from_json corresponds to SQL’s from_json .

array_contains Collection Function

array_contains(column: Column, value: Any): Column

353
Collection Functions

array_contains creates a Column for a column argument as an array and the value of

same type as the type of the elements of the array.

Internally, array_contains creates a Column with a ArrayContains expression.

// Arguments must be an array followed by a value of same type as the array elements
import org.apache.spark.sql.functions.array_contains
val c = array_contains(column = $"ids", value = 1)

val ids = Seq(Seq(1,2,3), Seq(1), Seq(2,3)).toDF("ids")


val q = ids.filter(c)
scala> q.show
+---------+
| ids|
+---------+
|[1, 2, 3]|
| [1]|
+---------+

array_contains corresponds to SQL’s array_contains .

import org.apache.spark.sql.functions.array_contains
val c = array_contains(column = $"ids", value = Array(1, 2))
val e = c.expr
scala> println(e.sql)
array_contains(`ids`, [1,2])

Use SQL’s array_contains to use values from columns for the column and
Tip
value arguments.

354
Collection Functions

val codes = Seq(


(Seq(1, 2, 3), 2),
(Seq(1), 1),
(Seq.empty[Int], 1),
(Seq(2, 4, 6), 0)).toDF("codes", "cd")
scala> codes.show
+---------+---+
| codes| cd|
+---------+---+
|[1, 2, 3]| 2|
| [1]| 1|
| []| 1|
|[2, 4, 6]| 0|
+---------+---+

val q = codes.where("array_contains(codes, cd)")


scala> q.show
+---------+---+
| codes| cd|
+---------+---+
|[1, 2, 3]| 2|
| [1]| 1|
+---------+---+

// array_contains standard function with Columns does NOT work. Why?!


// Asked this question on StackOverflow --> https://stackoverflow.com/q/50412939/13053
44
val q = codes.where(array_contains($"codes", $"cd"))
scala> q.show
java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Column
Name cd
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)
at org.apache.spark.sql.functions$.array_contains(functions.scala:3046)
... 50 elided

// Thanks Russel for this excellent "workaround"


// https://stackoverflow.com/a/50413766/1305344
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.expressions.ArrayContains
val q = codes.where(new Column(ArrayContains($"codes".expr, $"cd".expr)))
scala> q.show
+---------+---+
| codes| cd|
+---------+---+
|[1, 2, 3]| 2|
| [1]| 1|
+---------+---+

map_keys Collection Function

355
Collection Functions

map_keys(e: Column): Column

map_keys …​FIXME

map_values Collection Function

map_values(e: Column): Column

map_values …​FIXME

356
Date and Time Functions

Date and Time Functions


Table 1. (Subset of) Standard Functions for Date and Time
Name Description

current_date Gives current date as a date column

current_timestamp

date_format

Converts column to date type (with an optional date


to_date
format)

Converts column to timestamp type (with an optional


to_timestamp
timestamp format)

Converts current or specified time to Unix timestamp (in


unix_timestamp
seconds)

Generates time windows (i.e. tumbling, sliding and


window
delayed windows)

Current Date As Date Column —  current_date Function

current_date(): Column

current_date function gives the current date as a date column.

val df = spark.range(1).select(current_date)
scala> df.show
+--------------+
|current_date()|
+--------------+
| 2017-09-16|
+--------------+

scala> df.printSchema
root
|-- current_date(): date (nullable = false)

Internally, current_date creates a Column with CurrentDate Catalyst leaf expression.

357
Date and Time Functions

val c = current_date()
import org.apache.spark.sql.catalyst.expressions.CurrentDate
val cd = c.expr.asInstanceOf[CurrentDate]
scala> println(cd.prettyName)
current_date

scala> println(cd.numberedTreeString)
00 current_date(None)

date_format Function

date_format(dateExpr: Column, format: String): Column

Internally, date_format creates a Column with DateFormatClass binary expression.


DateFormatClass takes the expression from dateExpr column and format .

val c = date_format($"date", "dd/MM/yyyy")

import org.apache.spark.sql.catalyst.expressions.DateFormatClass
val dfc = c.expr.asInstanceOf[DateFormatClass]
scala> println(dfc.prettyName)
date_format

scala> println(dfc.numberedTreeString)
00 date_format('date, dd/MM/yyyy, None)
01 :- 'date
02 +- dd/MM/yyyy

current_timestamp Function

current_timestamp(): Column

Caution FIXME

Note current_timestamp is also now function in SQL.

Converting Current or Specified Time to Unix Timestamp 


—  unix_timestamp Function

358
Date and Time Functions

unix_timestamp(): Column (1)


unix_timestamp(time: Column): Column (2)
unix_timestamp(time: Column, format: String): Column

1. Gives current timestamp (in seconds)

2. Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds)

unix_timestamp converts the current or specified time in the specified format to a Unix

timestamp (in seconds).

unix_timestamp supports a column of type Date , Timestamp or String .

// no time and format => current time


scala> spark.range(1).select(unix_timestamp as "current_timestamp").show
+-----------------+
|current_timestamp|
+-----------------+
| 1493362850|
+-----------------+

// no format so yyyy-MM-dd HH:mm:ss assumed


scala> Seq("2017-01-01 00:00:00").toDF("time").withColumn("unix_timestamp", unix_times
tamp($"time")).show
+-------------------+--------------+
| time|unix_timestamp|
+-------------------+--------------+
|2017-01-01 00:00:00| 1483225200|
+-------------------+--------------+

scala> Seq("2017/01/01 00:00:00").toDF("time").withColumn("unix_timestamp", unix_times


tamp($"time", "yyyy/MM/dd")).show
+-------------------+--------------+
| time|unix_timestamp|
+-------------------+--------------+
|2017/01/01 00:00:00| 1483225200|
+-------------------+--------------+

unix_timestamp returns null if conversion fails.

// note slashes as date separators


scala> Seq("2017/01/01 00:00:00").toDF("time").withColumn("unix_timestamp", unix_times
tamp($"time")).show
+-------------------+--------------+
| time|unix_timestamp|
+-------------------+--------------+
|2017/01/01 00:00:00| null|
+-------------------+--------------+

359
Date and Time Functions

unix_timestamp is also supported in SQL mode.

scala> spark.sql("SELECT unix_timestamp() as unix_timestamp").show


+--------------+
Note |unix_timestamp|
+--------------+
| 1493369225|
+--------------+

Internally, unix_timestamp creates a Column with UnixTimestamp binary expression


(possibly with CurrentTimestamp ).

Generating Time Windows —  window Function

window(
timeColumn: Column,
windowDuration: String): Column (1)
window(
timeColumn: Column,
windowDuration: String,
slideDuration: String): Column (2)
window(
timeColumn: Column,
windowDuration: String,
slideDuration: String,
startTime: String): Column (3)

1. Creates a tumbling time window with slideDuration as windowDuration and 0 second


for startTime

2. Creates a sliding time window with 0 second for startTime

3. Creates a delayed time window

window generates tumbling, sliding or delayed time windows of windowDuration duration

given a timeColumn timestamp specifying column.

From Tumbling Window (Azure Stream Analytics):

Note Tumbling windows are a series of fixed-sized, non-overlapping and


contiguous time intervals.

360
Date and Time Functions

From Introducing Stream Windows in Apache Flink:

Tumbling windows group elements of a stream into finite sets where each
Note set corresponds to an interval.
Tumbling windows discretize a stream into non-overlapping windows.

scala> val timeColumn = window('time, "5 seconds")


timeColumn: org.apache.spark.sql.Column = timewindow(time, 5000000, 5000000, 0) AS `wi
ndow`

timeColumn should be of TimestampType, i.e. with java.sql.Timestamp values.

Use java.sql.Timestamp.from or java.sql.Timestamp.valueOf factory methods to


Tip
create Timestamp instances.

// https://docs.oracle.com/javase/8/docs/api/java/time/LocalDateTime.html
import java.time.LocalDateTime
// https://docs.oracle.com/javase/8/docs/api/java/sql/Timestamp.html
import java.sql.Timestamp
val levels = Seq(
// (year, month, dayOfMonth, hour, minute, second)
((2012, 12, 12, 12, 12, 12), 5),
((2012, 12, 12, 12, 12, 14), 9),
((2012, 12, 12, 13, 13, 14), 4),
((2016, 8, 13, 0, 0, 0), 10),
((2017, 5, 27, 0, 0, 0), 15)).
map { case ((yy, mm, dd, h, m, s), a) => (LocalDateTime.of(yy, mm, dd, h, m, s), a)
}.
map { case (ts, a) => (Timestamp.valueOf(ts), a) }.
toDF("time", "level")
scala> levels.show
+-------------------+-----+
| time|level|
+-------------------+-----+
|2012-12-12 12:12:12| 5|
|2012-12-12 12:12:14| 9|
|2012-12-12 13:13:14| 4|
|2016-08-13 00:00:00| 10|
|2017-05-27 00:00:00| 15|
+-------------------+-----+

val q = levels.select(window($"time", "5 seconds"), $"level")


scala> q.show(truncate = false)
+---------------------------------------------+-----+
|window |level|
+---------------------------------------------+-----+
|[2012-12-12 12:12:10.0,2012-12-12 12:12:15.0]|5 |
|[2012-12-12 12:12:10.0,2012-12-12 12:12:15.0]|9 |
|[2012-12-12 13:13:10.0,2012-12-12 13:13:15.0]|4 |

361
Date and Time Functions

|[2016-08-13 00:00:00.0,2016-08-13 00:00:05.0]|10 |


|[2017-05-27 00:00:00.0,2017-05-27 00:00:05.0]|15 |
+---------------------------------------------+-----+

scala> q.printSchema
root
|-- window: struct (nullable = true)
| |-- start: timestamp (nullable = true)
| |-- end: timestamp (nullable = true)
|-- level: integer (nullable = false)

// calculating the sum of levels every 5 seconds


val sums = levels.
groupBy(window($"time", "5 seconds")).
agg(sum("level") as "level_sum").
select("window.start", "window.end", "level_sum")
scala> sums.show
+-------------------+-------------------+---------+
| start| end|level_sum|
+-------------------+-------------------+---------+
|2012-12-12 13:13:10|2012-12-12 13:13:15| 4|
|2012-12-12 12:12:10|2012-12-12 12:12:15| 14|
|2016-08-13 00:00:00|2016-08-13 00:00:05| 10|
|2017-05-27 00:00:00|2017-05-27 00:00:05| 15|
+-------------------+-------------------+---------+

windowDuration and slideDuration are strings specifying the width of the window for

duration and sliding identifiers, respectively.

Tip Use CalendarInterval for valid window identifiers.

Note window is available as of Spark 2.0.0.

Internally, window creates a Column (with TimeWindow expression) available as window


alias.

362
Date and Time Functions

// q is the query defined earlier


scala> q.show(truncate = false)
+---------------------------------------------+-----+
|window |level|
+---------------------------------------------+-----+
|[2012-12-12 12:12:10.0,2012-12-12 12:12:15.0]|5 |
|[2012-12-12 12:12:10.0,2012-12-12 12:12:15.0]|9 |
|[2012-12-12 13:13:10.0,2012-12-12 13:13:15.0]|4 |
|[2016-08-13 00:00:00.0,2016-08-13 00:00:05.0]|10 |
|[2017-05-27 00:00:00.0,2017-05-27 00:00:05.0]|15 |
+---------------------------------------------+-----+

scala> println(timeColumn.expr.numberedTreeString)
00 timewindow('time, 5000000, 5000000, 0) AS window#22
01 +- timewindow('time, 5000000, 5000000, 0)
02 +- 'time

Example — Traffic Sensor

Note The example is borrowed from Introducing Stream Windows in Apache Flink.

The example shows how to use window function to model a traffic sensor that counts every
15 seconds the number of vehicles passing a certain location.

Converting Column To DateType —  to_date Function

to_date(e: Column): Column


to_date(e: Column, fmt: String): Column

to_date converts the column into DateType (by casting to DateType ).

Note fmt follows the formatting styles.

Internally, to_date creates a Column with ParseToDate expression (and Literal


expression for fmt ).

Tip Use ParseToDate expression to use a column for the values of fmt .

Converting Column To TimestampType —  to_timestamp


Function

to_timestamp(s: Column): Column


to_timestamp(s: Column, fmt: String): Column

363
Date and Time Functions

to_timestamp converts the column into TimestampType (by casting to TimestampType ).

Note fmt follows the formatting styles.

Internally, to_timestamp creates a Column with ParseToTimestamp expression (and


Literal expression for fmt ).

Tip Use ParseToTimestamp expression to use a column for the values of fmt .

364
Regular Functions (Non-Aggregate Functions)

Regular Functions (Non-Aggregate Functions)


Table 1. (Subset of) Regular Functions
Name Description

array

broadcast

Gives the first non- null value among the given


coalesce
columns or null .

col and column Creating Columns

expr

lit

map

monotonically_increasing_id

struct

typedLit

when

broadcast Function

broadcast[T](df: Dataset[T]): Dataset[T]

broadcast function marks the input Dataset as small enough to be used in broadcast join.

Tip Read up on Broadcast Joins (aka Map-Side Joins).

365
Regular Functions (Non-Aggregate Functions)

val left = Seq((0, "aa"), (0, "bb")).toDF("id", "token").as[(Int, String)]


val right = Seq(("aa", 0.99), ("bb", 0.57)).toDF("token", "prob").as[(String, Double)]

scala> left.join(broadcast(right), "token").explain(extended = true)


== Parsed Logical Plan ==
'Join UsingJoin(Inner,List(token))
:- Project [_1#123 AS id#126, _2#124 AS token#127]
: +- LocalRelation [_1#123, _2#124]
+- BroadcastHint
+- Project [_1#136 AS token#139, _2#137 AS prob#140]
+- LocalRelation [_1#136, _2#137]

== Analyzed Logical Plan ==


token: string, id: int, prob: double
Project [token#127, id#126, prob#140]
+- Join Inner, (token#127 = token#139)
:- Project [_1#123 AS id#126, _2#124 AS token#127]
: +- LocalRelation [_1#123, _2#124]
+- BroadcastHint
+- Project [_1#136 AS token#139, _2#137 AS prob#140]
+- LocalRelation [_1#136, _2#137]

== Optimized Logical Plan ==


Project [token#127, id#126, prob#140]
+- Join Inner, (token#127 = token#139)
:- Project [_1#123 AS id#126, _2#124 AS token#127]
: +- Filter isnotnull(_2#124)
: +- LocalRelation [_1#123, _2#124]
+- BroadcastHint
+- Project [_1#136 AS token#139, _2#137 AS prob#140]
+- Filter isnotnull(_1#136)
+- LocalRelation [_1#136, _2#137]

== Physical Plan ==
*Project [token#127, id#126, prob#140]
+- *BroadcastHashJoin [token#127], [token#139], Inner, BuildRight
:- *Project [_1#123 AS id#126, _2#124 AS token#127]
: +- *Filter isnotnull(_2#124)
: +- LocalTableScan [_1#123, _2#124]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
+- *Project [_1#136 AS token#139, _2#137 AS prob#140]
+- *Filter isnotnull(_1#136)
+- LocalTableScan [_1#136, _2#137]

broadcast standard function is a special case of Dataset.hint operator that


Note
allows for attaching any hint to a logical plan.

coalesce Function

366
Regular Functions (Non-Aggregate Functions)

coalesce(e: Column*): Column

coalesce gives the first non- null value among the given columns or null .

coalesce requires at least one column and all columns have to be of the same or

compatible types.

Internally, coalesce creates a Column with a Coalesce expression (with the children being
the expressions of the input Column ).

Example: coalesce Function

val q = spark.range(2)
.select(
coalesce(
lit(null),
lit(null),
lit(2) + 2,
$"id") as "first non-null value")
scala> q.show
+--------------------+
|first non-null value|
+--------------------+
| 4|
| 4|
+--------------------+

Creating Columns —  col and column Functions

col(colName: String): Column


column(colName: String): Column

col and column methods create a Column that you can later use to reference a column in

a dataset.

import org.apache.spark.sql.functions._

scala> val nameCol = col("name")


nameCol: org.apache.spark.sql.Column = name

scala> val cityCol = column("city")


cityCol: org.apache.spark.sql.Column = city

367
Regular Functions (Non-Aggregate Functions)

expr Function

expr(expr: String): Column

expr function parses the input expr SQL statement to a Column it represents.

val ds = Seq((0, "hello"), (1, "world"))


.toDF("id", "token")
.as[(Long, String)]

scala> ds.show
+---+-----+
| id|token|
+---+-----+
| 0|hello|
| 1|world|
+---+-----+

val filterExpr = expr("token = 'hello'")

scala> ds.filter(filterExpr).show
+---+-----+
| id|token|
+---+-----+
| 0|hello|
+---+-----+

Internally, expr uses the active session’s sqlParser or creates a new SparkSqlParser to call
parseExpression method.

lit Function

lit(literal: Any): Column

lit function…​FIXME

struct Functions

struct(cols: Column*): Column


struct(colName: String, colNames: String*): Column

struct family of functions allows you to create a new struct column based on a collection of

Column or their names.

368
Regular Functions (Non-Aggregate Functions)

The difference between struct and another similar array function is that the
Note
types of the columns can be different (in struct ).

scala> df.withColumn("struct", struct($"name", $"val")).show


+---+---+-----+---------+
| id|val| name| struct|
+---+---+-----+---------+
| 0| 1|hello|[hello,1]|
| 2| 3|world|[world,3]|
| 2| 4| ala| [ala,4]|
+---+---+-----+---------+

typedLit Function

typedLit[T : TypeTag](literal: T): Column

typedLit …​FIXME

array Function

array(cols: Column*): Column


array(colName: String, colNames: String*): Column

array …​FIXME

map Function

map(cols: Column*): Column

map …​FIXME

when Function

when(condition: Column, value: Any): Column

when …​FIXME

monotonically_increasing_id Function

369
Regular Functions (Non-Aggregate Functions)

monotonically_increasing_id(): Column

monotonically_increasing_id returns monotonically increasing 64-bit integers. The

generated IDs are guaranteed to be monotonically increasing and unique, but not
consecutive (unless all rows are in the same single partition which you rarely want due to the
amount of the data).

val q = spark.range(1).select(monotonically_increasing_id)
scala> q.show
+-----------------------------+
|monotonically_increasing_id()|
+-----------------------------+
| 60129542144|
+-----------------------------+

The current implementation uses the partition ID in the upper 31 bits, and the lower 33 bits
represent the record number within each partition. That assumes that the data set has less
than 1 billion partitions, and each partition has less than 8 billion records.

370
Regular Functions (Non-Aggregate Functions)

// Demo to show the internals of monotonically_increasing_id function


// i.e. how MonotonicallyIncreasingID expression works

// Create a dataset with the same number of rows per partition


val q = spark.range(start = 0, end = 8, step = 1, numPartitions = 4)

// Make sure that every partition has the same number of rows
q.mapPartitions(rows => Iterator(rows.size)).foreachPartition(rows => assert(rows.next
== 2))
q.select(monotonically_increasing_id).show

// Assign consecutive IDs for rows per partition


import org.apache.spark.sql.expressions.Window
// count is the name of the internal registry of MonotonicallyIncreasingID to count ro
ws
// Could also be "id" since it is unique and consecutive in a partition
import org.apache.spark.sql.functions.{row_number, shiftLeft, spark_partition_id}
val rowNumber = row_number over Window.partitionBy(spark_partition_id).orderBy("id")
// row_number is a sequential number starting at 1 within a window partition
val count = rowNumber - 1 as "count"
val partitionMask = shiftLeft(spark_partition_id cast "long", 33) as "partitionMask"
// FIXME Why does the following sum give "weird" results?!
val sum = (partitionMask + count) as "partitionMask + count"
val demo = q.select(
$"id",
partitionMask,
count,
// FIXME sum,
monotonically_increasing_id)
scala> demo.orderBy("id").show
+---+-------------+-----+-----------------------------+
| id|partitionMask|count|monotonically_increasing_id()|
+---+-------------+-----+-----------------------------+
| 0| 0| 0| 0|
| 1| 0| 1| 1|
| 2| 8589934592| 0| 8589934592|
| 3| 8589934592| 1| 8589934593|
| 4| 17179869184| 0| 17179869184|
| 5| 17179869184| 1| 17179869185|
| 6| 25769803776| 0| 25769803776|
| 7| 25769803776| 1| 25769803777|
+---+-------------+-----+-----------------------------+

Internally, monotonically_increasing_id creates a Column with a MonotonicallyIncreasingID


non-deterministic leaf expression.

371
Window Aggregation Functions

Standard Functions for Window Aggregation


(Window Functions)
Window aggregate functions (aka window functions or windowed aggregates) are
functions that perform a calculation over a group of records called window that are in some
relation to the current record (i.e. can be in the same partition or frame as the current row).

In other words, when executed, a window function computes a value for each and every row
in a window (per window specification).

Window functions are also called over functions due to how they are applied
Note
using over operator.

Spark SQL supports three kinds of window functions:

ranking functions

analytic functions

aggregate functions

Table 1. Window Aggregate Functions in Spark SQL


Function Purpose
rank

dense_rank

Ranking
percent_rank
functions

ntile

row_number

cume_dist

Analytic
lag
functions

lead

For aggregate functions, you can use the existing aggregate functions as window functions,
e.g. sum , avg , min , max and count .

372
Window Aggregation Functions

// Borrowed from 3.5. Window Functions in PostgreSQL documentation


// Example of window functions using Scala API
//
case class Salary(depName: String, empNo: Long, salary: Long)
val empsalary = Seq(
Salary("sales", 1, 5000),
Salary("personnel", 2, 3900),
Salary("sales", 3, 4800),
Salary("sales", 4, 4800),
Salary("personnel", 5, 3500),
Salary("develop", 7, 4200),
Salary("develop", 8, 6000),
Salary("develop", 9, 4500),
Salary("develop", 10, 5200),
Salary("develop", 11, 5200)).toDS

import org.apache.spark.sql.expressions.Window
// Windows are partitions of deptName
scala> val byDepName = Window.partitionBy('depName)
byDepName: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressi
ons.WindowSpec@1a711314

scala> empsalary.withColumn("avg", avg('salary) over byDepName).show


+---------+-----+------+-----------------+
| depName|empNo|salary| avg|
+---------+-----+------+-----------------+
| develop| 7| 4200| 5020.0|
| develop| 8| 6000| 5020.0|
| develop| 9| 4500| 5020.0|
| develop| 10| 5200| 5020.0|
| develop| 11| 5200| 5020.0|
| sales| 1| 5000|4866.666666666667|
| sales| 3| 4800|4866.666666666667|
| sales| 4| 4800|4866.666666666667|
|personnel| 2| 3900| 3700.0|
|personnel| 5| 3500| 3700.0|
+---------+-----+------+-----------------+

You describe a window using the convenient factory methods in Window object that create a
window specification that you can further refine with partitioning, ordering, and frame
boundaries.

After you describe a window you can apply window aggregate functions like ranking
functions (e.g. RANK ), analytic functions (e.g. LAG ), and the regular aggregate functions,
e.g. sum , avg , max .

Window functions are supported in structured queries using SQL and Column-
Note
based expressions.

373
Window Aggregation Functions

Although similar to aggregate functions, a window function does not group rows into a single
output row and retains their separate identities. A window function can access rows that are
linked to the current row.

The main difference between window aggregate functions and aggregate


functions with grouping operators is that the former calculate values for every
Note
row in a window while the latter gives you at most the number of input rows, one
value per group.

Tip See Examples section in this document.

You can mark a function window by OVER clause after a function in SQL, e.g. avg(revenue)
OVER (…​) or over method on a function in the Dataset API, e.g. rank().over(…​) .

Note Window functions belong to Window functions group in Spark’s Scala API.

Window-based framework is available as an experimental feature since Spark


Note
1.4.0.

Window object
Window object provides functions to define windows (as WindowSpec instances).

Window object lives in org.apache.spark.sql.expressions package. Import it to use Window

functions.

import org.apache.spark.sql.expressions.Window

There are two families of the functions available in Window object that create WindowSpec
instance for one or many Column instances:

partitionBy

orderBy

Partitioning Records —  partitionBy Methods

partitionBy(colName: String, colNames: String*): WindowSpec


partitionBy(cols: Column*): WindowSpec

partitionBy creates an instance of WindowSpec with partition expression(s) defined for one

or more columns.

374
Window Aggregation Functions

// partition records into two groups


// * tokens starting with "h"
// * others
val byHTokens = Window.partitionBy('token startsWith "h")

// count the sum of ids in each group


val result = tokens.select('*, sum('id) over byHTokens as "sum over h tokens").orderBy(
'id)

scala> .show
+---+-----+-----------------+
| id|token|sum over h tokens|
+---+-----+-----------------+
| 0|hello| 4|
| 1|henry| 4|
| 2| and| 2|
| 3|harry| 4|
+---+-----+-----------------+

Ordering in Windows —  orderBy Methods

orderBy(colName: String, colNames: String*): WindowSpec


orderBy(cols: Column*): WindowSpec

orderBy allows you to control the order of records in a window.

375
Window Aggregation Functions

import org.apache.spark.sql.expressions.Window
val byDepnameSalaryDesc = Window.partitionBy('depname).orderBy('salary desc)

// a numerical rank within the current row's partition for each distinct ORDER BY value

scala> val rankByDepname = rank().over(byDepnameSalaryDesc)


rankByDepname: org.apache.spark.sql.Column = RANK() OVER (PARTITION BY depname ORDER BY
salary DESC UnspecifiedFrame)

scala> empsalary.select('*, rankByDepname as 'rank).show


+---------+-----+------+----+
| depName|empNo|salary|rank|
+---------+-----+------+----+
| develop| 8| 6000| 1|
| develop| 10| 5200| 2|
| develop| 11| 5200| 2|
| develop| 9| 4500| 4|
| develop| 7| 4200| 5|
| sales| 1| 5000| 1|
| sales| 3| 4800| 2|
| sales| 4| 4800| 2|
|personnel| 2| 3900| 1|
|personnel| 5| 3500| 2|
+---------+-----+------+----+

rangeBetween Method

rangeBetween(start: Long, end: Long): WindowSpec

rangeBetween creates a WindowSpec with the frame boundaries from start (inclusive) to

end (inclusive).

It is recommended to use Window.unboundedPreceding ,


Window.unboundedFollowing and Window.currentRow to describe the frame
Note
boundaries when a frame is unbounded preceding, unbounded following and at
current row, respectively.

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.WindowSpec
val spec: WindowSpec = Window.rangeBetween(Window.unboundedPreceding, Window.currentRo
w)

Internally, rangeBetween creates a WindowSpec with SpecifiedWindowFrame and RangeFrame


type.

376
Window Aggregation Functions

Frame
At its core, a window function calculates a return value for every input row of a table based
on a group of rows, called the frame. Every input row can have a unique frame associated
with it.

When you define a frame you have to specify three components of a frame specification -
the start and end boundaries, and the type.

Types of boundaries (two positions and three offsets):

UNBOUNDED PRECEDING - the first row of the partition

UNBOUNDED FOLLOWING - the last row of the partition

CURRENT ROW

<value> PRECEDING

<value> FOLLOWING

Offsets specify the offset from the current input row.

Types of frames:

ROW - based on physical offsets from the position of the current input row

RANGE - based on logical offsets from the position of the current input row

In the current implementation of WindowSpec you can use two methods to define a frame:

rowsBetween

rangeBetween

See WindowSpec for their coverage.

Window Operators in SQL Queries


The grammar of windows operators in SQL accepts the following:

1. CLUSTER BY or PARTITION BY or DISTRIBUTE BY for partitions,

2. ORDER BY or SORT BY for sorting order,

3. RANGE , ROWS , RANGE BETWEEN , and ROWS BETWEEN for window frame types,

4. UNBOUNDED PRECEDING , UNBOUNDED FOLLOWING , CURRENT ROW for frame bounds.

Tip Consult withWindows helper in AstBuilder .

377
Window Aggregation Functions

Examples

Top N per Group


Top N per Group is useful when you need to compute the first and second best-sellers in
category.

This example is borrowed from an excellent article Introducing Window


Note
Functions in Spark SQL.

Table 2. Table PRODUCT_REVENUE


product category revenue
Thin cell phone 6000

Normal tablet 1500

Mini tablet 5500

Ultra thin cell phone 5000

Very thin cell phone 6000

Big tablet 2500

Bendable cell phone 3000

Foldable cell phone 3000

Pro tablet 4500

Pro2 tablet 6500

Question: What are the best-selling and the second best-selling products in every category?

378
Window Aggregation Functions

val dataset = Seq(


("Thin", "cell phone", 6000),
("Normal", "tablet", 1500),
("Mini", "tablet", 5500),
("Ultra thin", "cell phone", 5000),
("Very thin", "cell phone", 6000),
("Big", "tablet", 2500),
("Bendable", "cell phone", 3000),
("Foldable", "cell phone", 3000),
("Pro", "tablet", 4500),
("Pro2", "tablet", 6500))
.toDF("product", "category", "revenue")

scala> dataset.show
+----------+----------+-------+
| product| category|revenue|
+----------+----------+-------+
| Thin|cell phone| 6000|
| Normal| tablet| 1500|
| Mini| tablet| 5500|
|Ultra thin|cell phone| 5000|
| Very thin|cell phone| 6000|
| Big| tablet| 2500|
| Bendable|cell phone| 3000|
| Foldable|cell phone| 3000|
| Pro| tablet| 4500|
| Pro2| tablet| 6500|
+----------+----------+-------+

scala> data.where('category === "tablet").show


+-------+--------+-------+
|product|category|revenue|
+-------+--------+-------+
| Normal| tablet| 1500|
| Mini| tablet| 5500|
| Big| tablet| 2500|
| Pro| tablet| 4500|
| Pro2| tablet| 6500|
+-------+--------+-------+

The question boils down to ranking products in a category based on their revenue, and to
pick the best selling and the second best-selling products based the ranking.

379
Window Aggregation Functions

import org.apache.spark.sql.expressions.Window
val overCategory = Window.partitionBy('category).orderBy('revenue.desc)

val ranked = data.withColumn("rank", dense_rank.over(overCategory))

scala> ranked.show
+----------+----------+-------+----+
| product| category|revenue|rank|
+----------+----------+-------+----+
| Pro2| tablet| 6500| 1|
| Mini| tablet| 5500| 2|
| Pro| tablet| 4500| 3|
| Big| tablet| 2500| 4|
| Normal| tablet| 1500| 5|
| Thin|cell phone| 6000| 1|
| Very thin|cell phone| 6000| 1|
|Ultra thin|cell phone| 5000| 2|
| Bendable|cell phone| 3000| 3|
| Foldable|cell phone| 3000| 3|
+----------+----------+-------+----+

scala> ranked.where('rank <= 2).show


+----------+----------+-------+----+
| product| category|revenue|rank|
+----------+----------+-------+----+
| Pro2| tablet| 6500| 1|
| Mini| tablet| 5500| 2|
| Thin|cell phone| 6000| 1|
| Very thin|cell phone| 6000| 1|
|Ultra thin|cell phone| 5000| 2|
+----------+----------+-------+----+

Revenue Difference per Category

This example is the 2nd example from an excellent article Introducing Window
Note
Functions in Spark SQL.

380
Window Aggregation Functions

import org.apache.spark.sql.expressions.Window
val reveDesc = Window.partitionBy('category).orderBy('revenue.desc)
val reveDiff = max('revenue).over(reveDesc) - 'revenue

scala> data.select('*, reveDiff as 'revenue_diff).show


+----------+----------+-------+------------+
| product| category|revenue|revenue_diff|
+----------+----------+-------+------------+
| Pro2| tablet| 6500| 0|
| Mini| tablet| 5500| 1000|
| Pro| tablet| 4500| 2000|
| Big| tablet| 2500| 4000|
| Normal| tablet| 1500| 5000|
| Thin|cell phone| 6000| 0|
| Very thin|cell phone| 6000| 0|
|Ultra thin|cell phone| 5000| 1000|
| Bendable|cell phone| 3000| 3000|
| Foldable|cell phone| 3000| 3000|
+----------+----------+-------+------------+

Difference on Column
Compute a difference between values in rows in a column.

381
Window Aggregation Functions

val pairs = for {


x <- 1 to 5
y <- 1 to 2
} yield (x, 10 * x * y)
val ds = pairs.toDF("ns", "tens")

scala> ds.show
+---+----+
| ns|tens|
+---+----+
| 1| 10|
| 1| 20|
| 2| 20|
| 2| 40|
| 3| 30|
| 3| 60|
| 4| 40|
| 4| 80|
| 5| 50|
| 5| 100|
+---+----+

import org.apache.spark.sql.expressions.Window
val overNs = Window.partitionBy('ns).orderBy('tens)
val diff = lead('tens, 1).over(overNs)

scala> ds.withColumn("diff", diff - 'tens).show


+---+----+----+
| ns|tens|diff|
+---+----+----+
| 1| 10| 10|
| 1| 20|null|
| 3| 30| 30|
| 3| 60|null|
| 5| 50| 50|
| 5| 100|null|
| 4| 40| 40|
| 4| 80|null|
| 2| 20| 20|
| 2| 40|null|
+---+----+----+

Please note that Why do Window functions fail with "Window function X does not take a
frame specification"?

The key here is to remember that DataFrames are RDDs under the covers and hence
aggregation like grouping by a key in DataFrames is RDD’s groupBy (or worse,
reduceByKey or aggregateByKey transformations).

382
Window Aggregation Functions

Running Total
The running total is the sum of all previous lines including the current one.

val sales = Seq(


(0, 0, 0, 5),
(1, 0, 1, 3),
(2, 0, 2, 1),
(3, 1, 0, 2),
(4, 2, 0, 8),
(5, 2, 2, 8))
.toDF("id", "orderID", "prodID", "orderQty")

scala> sales.show
+---+-------+------+--------+
| id|orderID|prodID|orderQty|
+---+-------+------+--------+
| 0| 0| 0| 5|
| 1| 0| 1| 3|
| 2| 0| 2| 1|
| 3| 1| 0| 2|
| 4| 2| 0| 8|
| 5| 2| 2| 8|
+---+-------+------+--------+

val orderedByID = Window.orderBy('id)

val totalQty = sum('orderQty).over(orderedByID).as('running_total)


val salesTotalQty = sales.select('*, totalQty).orderBy('id)

scala> salesTotalQty.show
16/04/10 23:01:52 WARN Window: No Partition Defined for Window operation! Moving all d
ata to a single partition, this can cause serious performance degradation.
+---+-------+------+--------+-------------+
| id|orderID|prodID|orderQty|running_total|
+---+-------+------+--------+-------------+
| 0| 0| 0| 5| 5|
| 1| 0| 1| 3| 8|
| 2| 0| 2| 1| 9|
| 3| 1| 0| 2| 11|
| 4| 2| 0| 8| 19|
| 5| 2| 2| 8| 27|
+---+-------+------+--------+-------------+

val byOrderId = orderedByID.partitionBy('orderID)


val totalQtyPerOrder = sum('orderQty).over(byOrderId).as('running_total_per_order)
val salesTotalQtyPerOrder = sales.select('*, totalQtyPerOrder).orderBy('id)

scala> salesTotalQtyPerOrder.show
+---+-------+------+--------+-----------------------+
| id|orderID|prodID|orderQty|running_total_per_order|
+---+-------+------+--------+-----------------------+

383
Window Aggregation Functions

| 0| 0| 0| 5| 5|
| 1| 0| 1| 3| 8|
| 2| 0| 2| 1| 9|
| 3| 1| 0| 2| 2|
| 4| 2| 0| 8| 8|
| 5| 2| 2| 8| 16|
+---+-------+------+--------+-----------------------+

Calculate rank of row


See "Explaining" Query Plans of Windows for an elaborate example.

Interval data type for Date and Timestamp types


See [SPARK-8943] CalendarIntervalType for time intervals.

With the Interval data type, you could use intervals as values specified in <value> PRECEDING
and <value> FOLLOWING for RANGE frame. It is specifically suited for time-series analysis with
window functions.

Accessing values of earlier rows


FIXME What’s the value of rows before current one?

Moving Average

Cumulative Aggregates
Eg. cumulative sum

User-defined aggregate functions


See [SPARK-3947] Support Scala/Java UDAF.

With the window function support, you could use user-defined aggregate functions as
window functions.

"Explaining" Query Plans of Windows

384
Window Aggregation Functions

import org.apache.spark.sql.expressions.Window
val byDepnameSalaryDesc = Window.partitionBy('depname).orderBy('salary desc)

scala> val rankByDepname = rank().over(byDepnameSalaryDesc)


rankByDepname: org.apache.spark.sql.Column = RANK() OVER (PARTITION BY depname ORDER B
Y salary DESC UnspecifiedFrame)

// empsalary defined at the top of the page


scala> empsalary.select('*, rankByDepname as 'rank).explain(extended = true)
== Parsed Logical Plan ==
'Project [*, rank() windowspecdefinition('depname, 'salary DESC, UnspecifiedFrame) AS
rank#9]
+- LocalRelation [depName#5, empNo#6L, salary#7L]

== Analyzed Logical Plan ==


depName: string, empNo: bigint, salary: bigint, rank: int
Project [depName#5, empNo#6L, salary#7L, rank#9]
+- Project [depName#5, empNo#6L, salary#7L, rank#9, rank#9]
+- Window [rank(salary#7L) windowspecdefinition(depname#5, salary#7L DESC, ROWS BET
WEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS rank#9], [depname#5], [salary#7L DESC]
+- Project [depName#5, empNo#6L, salary#7L]
+- LocalRelation [depName#5, empNo#6L, salary#7L]

== Optimized Logical Plan ==


Window [rank(salary#7L) windowspecdefinition(depname#5, salary#7L DESC, ROWS BETWEEN U
NBOUNDED PRECEDING AND CURRENT ROW) AS rank#9], [depname#5], [salary#7L DESC]
+- LocalRelation [depName#5, empNo#6L, salary#7L]

== Physical Plan ==
Window [rank(salary#7L) windowspecdefinition(depname#5, salary#7L DESC, ROWS BETWEEN U
NBOUNDED PRECEDING AND CURRENT ROW) AS rank#9], [depname#5], [salary#7L DESC]
+- *Sort [depname#5 ASC, salary#7L DESC], false, 0
+- Exchange hashpartitioning(depname#5, 200)
+- LocalTableScan [depName#5, empNo#6L, salary#7L]

lag Window Function

lag(e: Column, offset: Int): Column


lag(columnName: String, offset: Int): Column
lag(columnName: String, offset: Int, defaultValue: Any): Column
lag(e: Column, offset: Int, defaultValue: Any): Column

lag returns the value in e / columnName column that is offset records before the

current record. lag returns null value if the number of records in a window partition is
less than offset or defaultValue .

385
Window Aggregation Functions

val buckets = spark.range(9).withColumn("bucket", 'id % 3)


// Make duplicates
val dataset = buckets.union(buckets)

import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('bucket).orderBy('id)
scala> dataset.withColumn("lag", lag('id, 1) over windowSpec).show
+---+------+----+
| id|bucket| lag|
+---+------+----+
| 0| 0|null|
| 3| 0| 0|
| 6| 0| 3|
| 1| 1|null|
| 4| 1| 1|
| 7| 1| 4|
| 2| 2|null|
| 5| 2| 2|
| 8| 2| 5|
+---+------+----+

scala> dataset.withColumn("lag", lag('id, 2, "<default_value>") over windowSpec).show


+---+------+----+
| id|bucket| lag|
+---+------+----+
| 0| 0|null|
| 3| 0|null|
| 6| 0| 0|
| 1| 1|null|
| 4| 1|null|
| 7| 1| 1|
| 2| 2|null|
| 5| 2|null|
| 8| 2| 2|
+---+------+----+

FIXME It looks like lag with a default value has a bug — the default value’s
Caution
not used at all.

lead Window Function

lead(columnName: String, offset: Int): Column


lead(e: Column, offset: Int): Column
lead(columnName: String, offset: Int, defaultValue: Any): Column
lead(e: Column, offset: Int, defaultValue: Any): Column

386
Window Aggregation Functions

lead returns the value that is offset records after the current records, and defaultValue

if there is less than offset records after the current record. lag returns null value if the
number of records in a window partition is less than offset or defaultValue .

val buckets = spark.range(9).withColumn("bucket", 'id % 3)


// Make duplicates
val dataset = buckets.union(buckets)

import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('bucket).orderBy('id)
scala> dataset.withColumn("lead", lead('id, 1) over windowSpec).show
+---+------+----+
| id|bucket|lead|
+---+------+----+
| 0| 0| 0|
| 0| 0| 3|
| 3| 0| 3|
| 3| 0| 6|
| 6| 0| 6|
| 6| 0|null|
| 1| 1| 1|
| 1| 1| 4|
| 4| 1| 4|
| 4| 1| 7|
| 7| 1| 7|
| 7| 1|null|
| 2| 2| 2|
| 2| 2| 5|
| 5| 2| 5|
| 5| 2| 8|
| 8| 2| 8|
| 8| 2|null|
+---+------+----+

scala> dataset.withColumn("lead", lead('id, 2, "<default_value>") over windowSpec).sho


w
+---+------+----+
| id|bucket|lead|
+---+------+----+
| 0| 0| 3|
| 0| 0| 3|
| 3| 0| 6|
| 3| 0| 6|
| 6| 0|null|
| 6| 0|null|
| 1| 1| 4|
| 1| 1| 4|
| 4| 1| 7|
| 4| 1| 7|
| 7| 1|null|
| 7| 1|null|
| 2| 2| 5|

387
Window Aggregation Functions

| 2| 2| 5|
| 5| 2| 8|
| 5| 2| 8|
| 8| 2|null|
| 8| 2|null|
+---+------+----+

FIXME It looks like lead with a default value has a bug — the default
Caution
value’s not used at all.

Cumulative Distribution of Records Across Window


Partitions —  cume_dist Window Function

cume_dist(): Column

cume_dist computes the cumulative distribution of the records in window partitions. This is

equivalent to SQL’s CUME_DIST function.

val buckets = spark.range(9).withColumn("bucket", 'id % 3)


// Make duplicates
val dataset = buckets.union(buckets)

import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('bucket).orderBy('id)
scala> dataset.withColumn("cume_dist", cume_dist over windowSpec).show
+---+------+------------------+
| id|bucket| cume_dist|
+---+------+------------------+
| 0| 0|0.3333333333333333|
| 3| 0|0.6666666666666666|
| 6| 0| 1.0|
| 1| 1|0.3333333333333333|
| 4| 1|0.6666666666666666|
| 7| 1| 1.0|
| 2| 2|0.3333333333333333|
| 5| 2|0.6666666666666666|
| 8| 2| 1.0|
+---+------+------------------+

Sequential numbering per window partition 


—  row_number Window Function

row_number(): Column

388
Window Aggregation Functions

row_number returns a sequential number starting at 1 within a window partition.

val buckets = spark.range(9).withColumn("bucket", 'id % 3)


// Make duplicates
val dataset = buckets.union(buckets)

import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('bucket).orderBy('id)
scala> dataset.withColumn("row_number", row_number() over windowSpec).show
+---+------+----------+
| id|bucket|row_number|
+---+------+----------+
| 0| 0| 1|
| 0| 0| 2|
| 3| 0| 3|
| 3| 0| 4|
| 6| 0| 5|
| 6| 0| 6|
| 1| 1| 1|
| 1| 1| 2|
| 4| 1| 3|
| 4| 1| 4|
| 7| 1| 5|
| 7| 1| 6|
| 2| 2| 1|
| 2| 2| 2|
| 5| 2| 3|
| 5| 2| 4|
| 8| 2| 5|
| 8| 2| 6|
+---+------+----------+

ntile Window Function

ntile(n: Int): Column

ntile computes the ntile group id (from 1 to n inclusive) in an ordered window partition.

389
Window Aggregation Functions

val dataset = spark.range(7).select('*, 'id % 3 as "bucket")

import org.apache.spark.sql.expressions.Window
val byBuckets = Window.partitionBy('bucket).orderBy('id)
scala> dataset.select('*, ntile(3) over byBuckets as "ntile").show
+---+------+-----+
| id|bucket|ntile|
+---+------+-----+
| 0| 0| 1|
| 3| 0| 2|
| 6| 0| 3|
| 1| 1| 1|
| 4| 1| 2|
| 2| 2| 1|
| 5| 2| 2|
+---+------+-----+

Caution FIXME How is ntile different from rank ? What about performance?

Ranking Records per Window Partition —  rank Window


Function

rank(): Column
dense_rank(): Column
percent_rank(): Column

rank functions assign the sequential rank of each distinct value per window partition. They

are equivalent to RANK , DENSE_RANK and PERCENT_RANK functions in the good ol' SQL.

390
Window Aggregation Functions

val dataset = spark.range(9).withColumn("bucket", 'id % 3)

import org.apache.spark.sql.expressions.Window
val byBucket = Window.partitionBy('bucket).orderBy('id)

scala> dataset.withColumn("rank", rank over byBucket).show


+---+------+----+
| id|bucket|rank|
+---+------+----+
| 0| 0| 1|
| 3| 0| 2|
| 6| 0| 3|
| 1| 1| 1|
| 4| 1| 2|
| 7| 1| 3|
| 2| 2| 1|
| 5| 2| 2|
| 8| 2| 3|
+---+------+----+

scala> dataset.withColumn("percent_rank", percent_rank over byBucket).show


+---+------+------------+
| id|bucket|percent_rank|
+---+------+------------+
| 0| 0| 0.0|
| 3| 0| 0.5|
| 6| 0| 1.0|
| 1| 1| 0.0|
| 4| 1| 0.5|
| 7| 1| 1.0|
| 2| 2| 0.0|
| 5| 2| 0.5|
| 8| 2| 1.0|
+---+------+------------+

rank function assigns the same rank for duplicate rows with a gap in the sequence

(similarly to Olympic medal places). dense_rank is like rank for duplicate rows but
compacts the ranks and removes the gaps.

// rank function with duplicates


// Note the missing/sparse ranks, i.e. 2 and 4
scala> dataset.union(dataset).withColumn("rank", rank over byBucket).show
+---+------+----+
| id|bucket|rank|
+---+------+----+
| 0| 0| 1|
| 0| 0| 1|
| 3| 0| 3|
| 3| 0| 3|
| 6| 0| 5|

391
Window Aggregation Functions

| 6| 0| 5|
| 1| 1| 1|
| 1| 1| 1|
| 4| 1| 3|
| 4| 1| 3|
| 7| 1| 5|
| 7| 1| 5|
| 2| 2| 1|
| 2| 2| 1|
| 5| 2| 3|
| 5| 2| 3|
| 8| 2| 5|
| 8| 2| 5|
+---+------+----+

// dense_rank function with duplicates


// Note that the missing ranks are now filled in
scala> dataset.union(dataset).withColumn("dense_rank", dense_rank over byBucket).show
+---+------+----------+
| id|bucket|dense_rank|
+---+------+----------+
| 0| 0| 1|
| 0| 0| 1|
| 3| 0| 2|
| 3| 0| 2|
| 6| 0| 3|
| 6| 0| 3|
| 1| 1| 1|
| 1| 1| 1|
| 4| 1| 2|
| 4| 1| 2|
| 7| 1| 3|
| 7| 1| 3|
| 2| 2| 1|
| 2| 2| 1|
| 5| 2| 2|
| 5| 2| 2|
| 8| 2| 3|
| 8| 2| 3|
+---+------+----------+

// percent_rank function with duplicates


scala> dataset.union(dataset).withColumn("percent_rank", percent_rank over byBucket).s
how
+---+------+------------+
| id|bucket|percent_rank|
+---+------+------------+
| 0| 0| 0.0|
| 0| 0| 0.0|
| 3| 0| 0.4|
| 3| 0| 0.4|
| 6| 0| 0.8|
| 6| 0| 0.8|

392
Window Aggregation Functions

| 1| 1| 0.0|
| 1| 1| 0.0|
| 4| 1| 0.4|
| 4| 1| 0.4|
| 7| 1| 0.8|
| 7| 1| 0.8|
| 2| 2| 0.0|
| 2| 2| 0.0|
| 5| 2| 0.4|
| 5| 2| 0.4|
| 8| 2| 0.8|
| 8| 2| 0.8|
+---+------+------------+

currentRow Window Function

currentRow(): Column

currentRow …​FIXME

unboundedFollowing Window Function

unboundedFollowing(): Column

unboundedFollowing …​FIXME

unboundedPreceding Window Function

unboundedPreceding(): Column

unboundedPreceding …​FIXME

Further Reading and Watching


Introducing Window Functions in Spark SQL

3.5. Window Functions in the official documentation of PostgreSQL

Window Functions in SQL

Working with Window Functions in SQL Server

OVER Clause (Transact-SQL)

393
Window Aggregation Functions

An introduction to windowed functions

Probably the Coolest SQL Feature: Window Functions

Window Functions

394
User-Defined Functions (UDFs)

UDFs — User-Defined Functions
User-Defined Functions (aka UDF) is a feature of Spark SQL to define new Column-based
functions that extend the vocabulary of Spark SQL’s DSL for transforming Datasets.

Use the higher-level standard Column-based functions (with Dataset


operators) whenever possible before reverting to developing user-defined
functions since UDFs are a blackbox for Spark SQL and it cannot (and
does not even try to) optimize them.
As Reynold Xin from the Apache Spark project has once said on Spark’s
Important dev mailing list:
There are simple cases in which we can analyze the UDFs byte code
and infer what it is doing, but it is pretty difficult to do in general.
Check out UDFs are Blackbox — Don’t Use Them Unless You’ve Got No
Choice if you want to know the internals.

You define a new UDF by defining a Scala function as an input parameter of udf function.
It accepts Scala functions of up to 10 input parameters.

val dataset = Seq((0, "hello"), (1, "world")).toDF("id", "text")

// Define a regular Scala function


val upper: String => String = _.toUpperCase

// Define a UDF that wraps the upper Scala function defined above
// You could also define the function in place, i.e. inside udf
// but separating Scala functions from Spark SQL's UDFs allows for easier testing
import org.apache.spark.sql.functions.udf
val upperUDF = udf(upper)

// Apply the UDF to change the source dataset


scala> dataset.withColumn("upper", upperUDF('text)).show
+---+-----+-----+
| id| text|upper|
+---+-----+-----+
| 0|hello|HELLO|
| 1|world|WORLD|
+---+-----+-----+

You can register UDFs to use in SQL-based query expressions via UDFRegistration (that is
available through SparkSession.udf attribute).

395
User-Defined Functions (UDFs)

val spark: SparkSession = ...


scala> spark.udf.register("myUpper", (input: String) => input.toUpperCase)

You can query for available standard and user-defined functions using the Catalog interface
(that is available through SparkSession.catalog attribute).

val spark: SparkSession = ...


scala> spark.catalog.listFunctions.filter('name like "%upper%").show(false)
+-------+--------+-----------+-----------------------------------------------+--------
---+
|name |database|description|className |isTempor
ary|
+-------+--------+-----------+-----------------------------------------------+--------
---+
|myupper|null |null |null |true
|
|upper |null |null |org.apache.spark.sql.catalyst.expressions.Upper|true
|
+-------+--------+-----------+-----------------------------------------------+--------
---+

UDFs play a vital role in Spark MLlib to define new Transformers that are
Note function objects that transform DataFrames into DataFrames by introducing new
columns.

udf Functions (in functions object)

udf[RT: TypeTag](f: Function0[RT]): UserDefinedFunction


...
udf[RT: TypeTag, A1: TypeTag, A2: TypeTag, A3: TypeTag, A4: TypeTag, A5: TypeTag, A6:
TypeTag, A7: TypeTag, A8: TypeTag, A9: TypeTag, A10: TypeTag](f: Function10[A1, A2, A3
, A4, A5, A6, A7, A8, A9, A10, RT]): UserDefinedFunction

org.apache.spark.sql.functions object comes with udf function to let you define a UDF for

a Scala function f .

396
User-Defined Functions (UDFs)

val df = Seq(
(0, "hello"),
(1, "world")).toDF("id", "text")

// Define a "regular" Scala function


// It's a clone of upper UDF
val toUpper: String => String = _.toUpperCase

import org.apache.spark.sql.functions.udf
val upper = udf(toUpper)

scala> df.withColumn("upper", upper('text)).show


+---+-----+-----+
| id| text|upper|
+---+-----+-----+
| 0|hello|HELLO|
| 1|world|WORLD|
+---+-----+-----+

// You could have also defined the UDF this way


val upperUDF = udf { s: String => s.toUpperCase }

// or even this way


val upperUDF = udf[String, String](_.toUpperCase)

scala> df.withColumn("upper", upperUDF('text)).show


+---+-----+-----+
| id| text|upper|
+---+-----+-----+
| 0|hello|HELLO|
| 1|world|WORLD|
+---+-----+-----+

Define custom UDFs based on "standalone" Scala functions (e.g. toUpperUDF )


Tip so you can test the Scala functions using Scala way (without Spark SQL’s
"noise") and once they are defined reuse the UDFs in UnaryTransformers.

397
UDFs are Blackbox — Don’t Use Them Unless You’ve Got No Choice

UDFs are Blackbox — Don’t Use Them Unless


You’ve Got No Choice
Let’s review an example with an UDF. This example is converting strings of size 7 characters
only and uses the Dataset standard operators first and then custom UDF to do the same
transformation.

scala> spark.conf.get("spark.sql.parquet.filterPushdown")
res0: String = true

You are going to use the following cities dataset that is based on Parquet file (as used in
Predicate Pushdown / Filter Pushdown for Parquet Data Source section). The reason for
parquet is that it is an external data source that does support optimization Spark uses to
optimize itself like predicate pushdown.

398
UDFs are Blackbox — Don’t Use Them Unless You’ve Got No Choice

// no optimization as it is a more involved Scala function in filter


// 08/30 Asked on dev@spark mailing list for explanation
val cities6chars = cities.filter(_.name.length == 6).map(_.name.toUpperCase)

cities6chars.explain(true)

// or simpler when only concerned with PushedFilters attribute in Parquet


scala> cities6chars.queryExecution.optimizedPlan
res33: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, Stri
ngType, fromString, input[0, java.lang.String, true], true) AS value#248]
+- MapElements <function1>, class City, [StructField(id,LongType,false), StructField(n
ame,StringType,true)], obj#247: java.lang.String
+- Filter <function1>.apply
+- DeserializeToObject newInstance(class City), obj#246: City
+- Relation[id#236L,name#237] parquet

// no optimization for Dataset[City]?!


// 08/30 Asked on dev@spark mailing list for explanation
val cities6chars = cities.filter(_.name == "Warsaw").map(_.name.toUpperCase)

cities6chars.explain(true)

// The filter predicate is pushed down fine for Dataset's Column-based query in where
operator
scala> cities.where('name === "Warsaw").queryExecution.executedPlan
res29: org.apache.spark.sql.execution.SparkPlan =
*Project [id#128L, name#129]
+- *Filter (isnotnull(name#129) && (name#129 = Warsaw))
+- *FileScan parquet [id#128L,name#129] Batched: true, Format: ParquetFormat, Input
Paths: file:/Users/jacek/dev/oss/spark/cities.parquet, PartitionFilters: [], PushedFil
ters: [IsNotNull(name), EqualTo(name,Warsaw)], ReadSchema: struct<id:bigint,name:strin
g>

// Let's define a UDF to do the filtering


val isWarsaw = udf { (s: String) => s == "Warsaw" }

// Use the UDF in where (replacing the Column-based query)


scala> cities.where(isWarsaw('name)).queryExecution.executedPlan
res33: org.apache.spark.sql.execution.SparkPlan =
*Filter UDF(name#129)
+- *FileScan parquet [id#128L,name#129] Batched: true, Format: ParquetFormat, InputPat
hs: file:/Users/jacek/dev/oss/spark/cities.parquet, PartitionFilters: [], PushedFilters
: [], ReadSchema: struct<id:bigint,name:string>

399
UserDefinedFunction

UserDefinedFunction
UserDefinedFunction represents a user-defined function.

UserDefinedFunction is created when:

udf function is executed

UDFRegistration is requested to register a Scala function as a user-defined function (in

FunctionRegistry )

import org.apache.spark.sql.functions.udf
val lengthUDF = udf { s: String => s.length }

scala> :type lengthUDF


org.apache.spark.sql.expressions.UserDefinedFunction

val r = lengthUDF($"name")

scala> :type r
org.apache.spark.sql.Column

UserDefinedFunction can have an optional name.

val namedLengthUDF = lengthUDF.withName("lengthUDF")


scala> namedLengthUDF($"name")
res2: org.apache.spark.sql.Column = UDF:lengthUDF(name)

UserDefinedFunction is nullable by default, but can be changed as non-nullable.

val nonNullableLengthUDF = lengthUDF.asNonNullable


assert(nonNullableLengthUDF.nullable == false)

UserDefinedFunction is deterministic by default, i.e. produces the same result for the same

input. UserDefinedFunction can be changed to be non-deterministic.

assert(lengthUDF.deterministic)
val ndUDF = lengthUDF.asNondeterministic
assert(ndUDF.deterministic == false)

400
UserDefinedFunction

Table 1. UserDefinedFunction’s Internal Properties (e.g. Registries, Counters and Flags)


Name Description

Flag that controls whether the function is deterministic ( true ) or


not ( false ).

_deterministic Default: true


Use asNondeterministic to change it to false
Used when UserDefinedFunction is requested to execute

Executing UserDefinedFunction (Creating Column with


ScalaUDF Expression) —  apply Method

apply(exprs: Column*): Column

apply creates a Column with ScalaUDF expression.

import org.apache.spark.sql.functions.udf
scala> val lengthUDF = udf { s: String => s.length }
lengthUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(
<function1>,IntegerType,Some(List(StringType)))

scala> lengthUDF($"name")
res1: org.apache.spark.sql.Column = UDF(name)

Note apply is used when…​FIXME

Marking UserDefinedFunction as NonNullable 


—  asNonNullable Method

asNonNullable(): UserDefinedFunction

asNonNullable …​FIXME

Note asNonNullable is used when…​FIXME

Naming UserDefinedFunction —  withName Method

withName(name: String): UserDefinedFunction

401
UserDefinedFunction

withName …​FIXME

Note withName is used when…​FIXME

Creating UserDefinedFunction Instance


UserDefinedFunction takes the following when created:

A Scala function (as Scala’s AnyRef )

Output data type

Input data types (if available)

UserDefinedFunction initializes the internal registries and counters.

402
Schema — Structure of Data

Schema — Structure of Data
A schema is the description of the structure of your data (which together create a Dataset in
Spark SQL). It can be implicit (and inferred at runtime) or explicit (and known at compile
time).

A schema is described using StructType which is a collection of StructField objects (that in


turn are tuples of names, types, and nullability classifier).

StructType and StructField belong to the org.apache.spark.sql.types package.

import org.apache.spark.sql.types.StructType
val schemaUntyped = new StructType()
.add("a", "int")
.add("b", "string")

// alternatively using Schema DSL


val schemaUntyped_2 = new StructType()
.add($"a".int)
.add($"b".string)

You can use the canonical string representation of SQL types to describe the types in a
schema (that is inherently untyped at compile type) or use type-safe types from the
org.apache.spark.sql.types package.

// it is equivalent to the above expressions


import org.apache.spark.sql.types.{IntegerType, StringType}
val schemaTyped = new StructType()
.add("a", IntegerType)
.add("b", StringType)

Tip Read up on CatalystSqlParser that is responsible for parsing data types.

It is however recommended to use the singleton DataTypes class with static methods to
create schema types.

import org.apache.spark.sql.types.DataTypes._
val schemaWithMap = StructType(
StructField("map", createMapType(LongType, StringType), false) :: Nil)

StructType offers printTreeString that makes presenting the schema more user-friendly.

403
Schema — Structure of Data

scala> schemaTyped.printTreeString
root
|-- a: integer (nullable = true)
|-- b: string (nullable = true)

scala> schemaWithMap.printTreeString
root
|-- map: map (nullable = false)
| |-- key: long
| |-- value: string (valueContainsNull = true)

// You can use prettyJson method on any DataType


scala> println(schema1.prettyJson)
{
"type" : "struct",
"fields" : [ {
"name" : "a",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "b",
"type" : "string",
"nullable" : true,
"metadata" : { }
} ]
}

As of Spark 2.0, you can describe the schema of your strongly-typed datasets using
encoders.

import org.apache.spark.sql.Encoders

scala> Encoders.INT.schema.printTreeString
root
|-- value: integer (nullable = true)

scala> Encoders.product[(String, java.sql.Timestamp)].schema.printTreeString


root
|-- _1: string (nullable = true)
|-- _2: timestamp (nullable = true)

case class Person(id: Long, name: String)


scala> Encoders.product[Person].schema.printTreeString
root
|-- id: long (nullable = false)
|-- name: string (nullable = true)

Implicit Schema

404
Schema — Structure of Data

val df = Seq((0, s"""hello\tworld"""), (1, "two spaces inside")).toDF("label", "sente


nce")

scala> df.printSchema
root
|-- label: integer (nullable = false)
|-- sentence: string (nullable = true)

scala> df.schema
res0: org.apache.spark.sql.types.StructType = StructType(StructField(label,IntegerType,
false), StructField(sentence,StringType,true))

scala> df.schema("label").dataType
res1: org.apache.spark.sql.types.DataType = IntegerType

405
StructType

StructType — Data Type for Schema Definition


StructType is a built-in data type that is a collection of StructFields.

StructType is used to define a schema or its part.

You can compare two StructType instances to see whether they are equal.

import org.apache.spark.sql.types.StructType

val schemaUntyped = new StructType()


.add("a", "int")
.add("b", "string")

import org.apache.spark.sql.types.{IntegerType, StringType}


val schemaTyped = new StructType()
.add("a", IntegerType)
.add("b", StringType)

scala> schemaUntyped == schemaTyped


res0: Boolean = true

StructType presents itself as <struct> or STRUCT in query plans or SQL.

StructType is a Seq[StructField] and therefore all things Seq apply equally


here.

scala> schemaTyped.foreach(println)
Note StructField(a,IntegerType,true)
StructField(b,StringType,true)

Read the official documentation of Scala’s scala.collection.Seq.

As of Spark 2.4.0, StructType can be converted to DDL format using toDDL method.

Example: Using StructType.toDDL

// Generating a schema from a case class


// Because we're all properly lazy
case class Person(id: Long, name: String)
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Person].schema
scala> println(schema.toDDL)
`id` BIGINT,`name` STRING

406
StructType

fromAttributes Method

fromAttributes(attributes: Seq[Attribute]): StructType

fromAttributes …​FIXME

Note fromAttributes is used when…​FIXME

toAttributes Method

toAttributes: Seq[AttributeReference]

toAttributes …​FIXME

Note toAttributes is used when…​FIXME

Adding Fields to Schema —  add Method


You can add a new StructField to your StructType . There are different variants of add
method that all make for a new StructType with the field added.

407
StructType

add(field: StructField): StructType


add(name: String, dataType: DataType): StructType
add(name: String, dataType: DataType, nullable: Boolean): StructType
add(
name: String,
dataType: DataType,
nullable: Boolean,
metadata: Metadata): StructType
add(
name: String,
dataType: DataType,
nullable: Boolean,
comment: String): StructType
add(name: String, dataType: String): StructType
add(name: String, dataType: String, nullable: Boolean): StructType
add(
name: String,
dataType: String,
nullable: Boolean,
metadata: Metadata): StructType
add(
name: String,
dataType: String,
nullable: Boolean,
comment: String): StructType

DataType Name Conversions

simpleString: String
catalogString: String
sql: String

StructType as a custom DataType is used in query plans or SQL. It can present itself using

simpleString , catalogString or sql (see DataType Contract).

scala> schemaTyped.simpleString
res0: String = struct<a:int,b:string>

scala> schemaTyped.catalogString
res1: String = struct<a:int,b:string>

scala> schemaTyped.sql
res2: String = STRUCT<`a`: INT, `b`: STRING>

Accessing StructField —  apply Method

408
StructType

apply(name: String): StructField

StructType defines its own apply method that gives you an easy access to a

StructField by name.

scala> schemaTyped.printTreeString
root
|-- a: integer (nullable = true)
|-- b: string (nullable = true)

scala> schemaTyped("a")
res4: org.apache.spark.sql.types.StructField = StructField(a,IntegerType,true)

Creating StructType from Existing StructType —  apply


Method

apply(names: Set[String]): StructType

This variant of apply lets you create a StructType out of an existing StructType with the
names only.

scala> schemaTyped(names = Set("a"))


res0: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true
))

It will throw an IllegalArgumentException exception when a field could not be found.

scala> schemaTyped(names = Set("a", "c"))


java.lang.IllegalArgumentException: Field c does not exist.
at org.apache.spark.sql.types.StructType.apply(StructType.scala:275)
... 48 elided

Displaying Schema As Tree —  printTreeString Method

printTreeString(): Unit

printTreeString prints out the schema to standard output.

409
StructType

scala> schemaTyped.printTreeString
root
|-- a: integer (nullable = true)
|-- b: string (nullable = true)

Internally, it uses treeString method to build the tree and then println it.

Creating StructType For DDL-Formatted Text —  fromDDL


Object Method

fromDDL(ddl: String): StructType

fromDDL …​FIXME

Note fromDDL is used when…​FIXME

Converting to DDL Format —  toDDL Method

toDDL: String

toDDL converts all the fields to DDL format and concatenates them using the comma ( , ).

410
StructField — Single Field in StructType

StructField — Single Field in StructType


StructField describes a single field in a StructType with the following:

Name

DataType

nullable flag (enabled by default)

Metadata (empty by default)

A comment is part of metadata under comment key and is used to build a Hive column or
when describing a table.

scala> schemaTyped("a").getComment
res0: Option[String] = None

scala> schemaTyped("a").withComment("this is a comment").getComment


res1: Option[String] = Some(this is a comment)

As of Spark 2.4.0, StructField can be converted to DDL format using toDDL method.

Example: Using StructField.toDDL

import org.apache.spark.sql.types.MetadataBuilder
val metadata = new MetadataBuilder()
.putString("comment", "this is a comment")
.build
import org.apache.spark.sql.types.{LongType, StructField}
val f = new StructField(name = "id", dataType = LongType, nullable = false, metadata)
scala> println(f.toDDL)
`id` BIGINT COMMENT 'this is a comment'

Converting to DDL Format —  toDDL Method

toDDL: String

toDDL gives a text in the format:

[quoted name] [dataType][optional comment]

411
StructField — Single Field in StructType

toDDL is used when:

StructType is requested to convert itself to DDL format

Note ShowCreateTableCommand logical command is executed (and


showHiveTableHeader, showHiveTableNonDataColumns,
showDataSourceTableDataColumns)

412
Data Types

Data Types
DataType abstract class is the base type of all built-in data types in Spark SQL, e.g. strings,

longs.

DataType has two main type families:

Atomic Types as an internal type to represent types that are not null , UDTs, arrays,
structs, and maps

Numeric Types with fractional and integral types

413
Data Types

Table 1. Standard Data Types


Type Family Data Type Scala Types
BinaryType

BooleanType
Atomic Types
DateType
(except fractional and
integral types)
StringType

TimestampType java.sql.Timestamp

DecimalType

Fractional Types
DoubleType
(concrete NumericType)
FloatType

ByteType

IntegerType
Integral Types
(concrete NumericType) LongType

ShortType

ArrayType

CalendarIntervalType

MapType

NullType

ObjectType

StructType

UserDefinedType

AnyDataType
Matches any concrete
data type

Caution FIXME What about AbstractDataType?

You can extend the type system and create your own user-defined types (UDTs).

The DataType Contract defines methods to build SQL, JSON and string representations.

414
Data Types

DataType (and the concrete Spark SQL types) live in


Note
org.apache.spark.sql.types package.

import org.apache.spark.sql.types.StringType

scala> StringType.json
res0: String = "string"

scala> StringType.sql
res1: String = STRING

scala> StringType.catalogString
res2: String = string

You should use DataTypes object in your code to create complex Spark SQL types, i.e.
arrays or maps.

import org.apache.spark.sql.types.DataTypes

scala> val arrayType = DataTypes.createArrayType(BooleanType)


arrayType: org.apache.spark.sql.types.ArrayType = ArrayType(BooleanType,true)

scala> val mapType = DataTypes.createMapType(StringType, LongType)


mapType: org.apache.spark.sql.types.MapType = MapType(StringType,LongType,true)

DataType has support for Scala’s pattern matching using unapply method.

???

DataType Contract
Any type in Spark SQL follows the DataType contract which means that the types define the
following methods:

json and prettyJson to build JSON representations of a data type

defaultSize to know the default size of values of a type

simpleString and catalogString to build user-friendly string representations (with the

latter for external catalogs)

sql to build SQL representation

415
Data Types

import org.apache.spark.sql.types.DataTypes._

val maps = StructType(


StructField("longs2strings", createMapType(LongType, StringType), false) :: Nil)

scala> maps.prettyJson
res0: String =
{
"type" : "struct",
"fields" : [ {
"name" : "longs2strings",
"type" : {
"type" : "map",
"keyType" : "long",
"valueType" : "string",
"valueContainsNull" : true
},
"nullable" : false,
"metadata" : { }
} ]
}

scala> maps.defaultSize
res1: Int = 2800

scala> maps.simpleString
res2: String = struct<longs2strings:map<bigint,string>>

scala> maps.catalogString
res3: String = struct<longs2strings:map<bigint,string>>

scala> maps.sql
res4: String = STRUCT<`longs2strings`: MAP<BIGINT, STRING>>

DataTypes — Factory Methods for Data Types


DataTypes is a Java class with methods to access simple or create complex DataType

types in Spark SQL, i.e. arrays and maps.

It is recommended to use DataTypes class to define DataType types in a


Tip
schema.

DataTypes lives in org.apache.spark.sql.types package.

416
Data Types

import org.apache.spark.sql.types.DataTypes

scala> val arrayType = DataTypes.createArrayType(BooleanType)


arrayType: org.apache.spark.sql.types.ArrayType = ArrayType(BooleanType,true)

scala> val mapType = DataTypes.createMapType(StringType, LongType)


mapType: org.apache.spark.sql.types.MapType = MapType(StringType,LongType,true)

Simple DataType types themselves, i.e. StringType or CalendarIntervalType ,


come with their own Scala’s case object s alongside their definitions.
You may also import the types package and have access to the types.
Note
import org.apache.spark.sql.types._

UDTs — User-Defined Types

Caution FIXME

417
Multi-Dimensional Aggregation

Multi-Dimensional Aggregation
Multi-dimensional aggregate operators are enhanced variants of groupBy operator that
allow you to create queries for subtotals, grand totals and superset of subtotals in one go.

val sales = Seq(


("Warsaw", 2016, 100),
("Warsaw", 2017, 200),
("Boston", 2015, 50),
("Boston", 2016, 150),
("Toronto", 2017, 50)
).toDF("city", "year", "amount")

// very labor-intense
// groupBy's unioned
val groupByCityAndYear = sales
.groupBy("city", "year") // <-- subtotals (city, year)
.agg(sum("amount") as "amount")
val groupByCityOnly = sales
.groupBy("city") // <-- subtotals (city)
.agg(sum("amount") as "amount")
.select($"city", lit(null) as "year", $"amount") // <-- year is null
val withUnion = groupByCityAndYear
.union(groupByCityOnly)
.sort($"city".desc_nulls_last, $"year".asc_nulls_last)
scala> withUnion.show
+-------+----+------+
| city|year|amount|
+-------+----+------+
| Warsaw|2016| 100|
| Warsaw|2017| 200|
| Warsaw|null| 300|
|Toronto|2017| 50|
|Toronto|null| 50|
| Boston|2015| 50|
| Boston|2016| 150|
| Boston|null| 200|
+-------+----+------+

Multi-dimensional aggregate operators are semantically equivalent to union operator (or


SQL’s UNION ALL ) to combine single grouping queries.

418
Multi-Dimensional Aggregation

// Roll up your sleeves!


val withRollup = sales
.rollup("city", "year")
.agg(sum("amount") as "amount", grouping_id() as "gid")
.sort($"city".desc_nulls_last, $"year".asc_nulls_last)
.filter(grouping_id() =!= 3)
.select("city", "year", "amount")
scala> withRollup.show
+-------+----+------+
| city|year|amount|
+-------+----+------+
| Warsaw|2016| 100|
| Warsaw|2017| 200|
| Warsaw|null| 300|
|Toronto|2017| 50|
|Toronto|null| 50|
| Boston|2015| 50|
| Boston|2016| 150|
| Boston|null| 200|
+-------+----+------+

// Be even more smarter?


// SQL only, alas.
sales.createOrReplaceTempView("sales")
val withGroupingSets = sql("""
SELECT city, year, SUM(amount) as amount
FROM sales
GROUP BY city, year
GROUPING SETS ((city, year), (city))
ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
""")
scala> withGroupingSets.show
+-------+----+------+
| city|year|amount|
+-------+----+------+
| Warsaw|2016| 100|
| Warsaw|2017| 200|
| Warsaw|null| 300|
|Toronto|2017| 50|
|Toronto|null| 50|
| Boston|2015| 50|
| Boston|2016| 150|
| Boston|null| 200|
+-------+----+------+

It is assumed that using one of the operators is usually more efficient (than
Note union and groupBy ) as it gives more freedom for query optimization.

419
Multi-Dimensional Aggregation

Table 1. Multi-dimensional Aggregate Operators


Operator Return Type Description

Calculates subtotals and a grand


cube RelationalGroupedDataset total for every permutation of the
columns specified.

Calculates subtotals and a grand


rollup RelationalGroupedDataset total over (ordered) combination of
groups.

Beside cube and rollup multi-dimensional aggregate operators, Spark SQL supports
GROUPING SETS clause in SQL mode only.

SQL’s GROUPING SETS is the most general aggregate "operator" and can
Note generate the same dataset as using a simple groupBy, cube and rollup
operators.

420
Multi-Dimensional Aggregation

import java.time.LocalDate
import java.sql.Date
val expenses = Seq(
((2012, Month.DECEMBER, 12), 5),
((2016, Month.AUGUST, 13), 10),
((2017, Month.MAY, 27), 15))
.map { case ((yy, mm, dd), a) => (LocalDate.of(yy, mm, dd), a) }
.map { case (d, a) => (d.toString, a) }
.map { case (d, a) => (Date.valueOf(d), a) }
.toDF("date", "amount")
scala> expenses.show
+----------+------+
| date|amount|
+----------+------+
|2012-12-12| 5|
|2016-08-13| 10|
|2017-05-27| 15|
+----------+------+

// rollup time!
val q = expenses
.rollup(year($"date") as "year", month($"date") as "month")
.agg(sum("amount") as "amount")
.sort($"year".asc_nulls_last, $"month".asc_nulls_last)
scala> q.show
+----+-----+------+
|year|month|amount|
+----+-----+------+
|2012| 12| 5|
|2012| null| 5|
|2016| 8| 10|
|2016| null| 10|
|2017| 5| 15|
|2017| null| 15|
|null| null| 30|
+----+-----+------+

Tip Review the examples per operator in the following sections.

Support for multi-dimensional aggregate operators was added in [SPARK-6356]


Note
Support the ROLLUP/CUBE/GROUPING SETS/grouping() in SQLContext.

rollup Operator

rollup(cols: Column*): RelationalGroupedDataset


rollup(col1: String, cols: String*): RelationalGroupedDataset

421
Multi-Dimensional Aggregation

rollup multi-dimensional aggregate operator is an extension of groupBy operator that

calculates subtotals and a grand total across specified group of n + 1 dimensions (with n
being the number of columns as cols and col1 and 1 for where values become null ,
i.e. undefined).

rollup operator is commonly used for analysis over hierarchical data; e.g.
total salary by department, division, and company-wide total.
Note
See PostgreSQL’s 7.2.4. GROUPING SETS, CUBE, and ROLLUP

rollup operator is equivalent to GROUP BY ... WITH ROLLUP in SQL (which in


Note turn is equivalent to GROUP BY ... GROUPING SETS ((a,b,c),(a,b),(a),()) when
used with 3 columns: a , b , and c ).

val sales = Seq(


("Warsaw", 2016, 100),
("Warsaw", 2017, 200),
("Boston", 2015, 50),
("Boston", 2016, 150),
("Toronto", 2017, 50)
).toDF("city", "year", "amount")

val q = sales
.rollup("city", "year")
.agg(sum("amount") as "amount")
.sort($"city".desc_nulls_last, $"year".asc_nulls_last)
scala> q.show
+-------+----+------+
| city|year|amount|
+-------+----+------+
| Warsaw|2016| 100| <-- subtotal for Warsaw in 2016
| Warsaw|2017| 200|
| Warsaw|null| 300| <-- subtotal for Warsaw (across years)
|Toronto|2017| 50|
|Toronto|null| 50|
| Boston|2015| 50|
| Boston|2016| 150|
| Boston|null| 200|
| null|null| 550| <-- grand total
+-------+----+------+

// The above query is semantically equivalent to the following


val q1 = sales
.groupBy("city", "year") // <-- subtotals (city, year)
.agg(sum("amount") as "amount")
val q2 = sales
.groupBy("city") // <-- subtotals (city)
.agg(sum("amount") as "amount")
.select($"city", lit(null) as "year", $"amount") // <-- year is null
val q3 = sales

422
Multi-Dimensional Aggregation

.groupBy() // <-- grand total


.agg(sum("amount") as "amount")
.select(lit(null) as "city", lit(null) as "year", $"amount") // <-- city and year a
re null
val qq = q1
.union(q2)
.union(q3)
.sort($"city".desc_nulls_last, $"year".asc_nulls_last)
scala> qq.show
+-------+----+------+
| city|year|amount|
+-------+----+------+
| Warsaw|2016| 100|
| Warsaw|2017| 200|
| Warsaw|null| 300|
|Toronto|2017| 50|
|Toronto|null| 50|
| Boston|2015| 50|
| Boston|2016| 150|
| Boston|null| 200|
| null|null| 550|
+-------+----+------+

From Using GROUP BY with ROLLUP, CUBE, and GROUPING SETS in Microsoft’s
TechNet:

The ROLLUP, CUBE, and GROUPING SETS operators are extensions of the GROUP
BY clause. The ROLLUP, CUBE, or GROUPING SETS operators can generate the
same result set as when you use UNION ALL to combine single grouping queries;
however, using one of the GROUP BY operators is usually more efficient.

From PostgreSQL’s 7.2.4. GROUPING SETS, CUBE, and ROLLUP:

References to the grouping columns or expressions are replaced by null values in result
rows for grouping sets in which those columns do not appear.

From Summarizing Data Using ROLLUP in Microsoft’s TechNet:

The ROLLUP operator is useful in generating reports that contain subtotals and totals.
(…​) ROLLUP generates a result set that shows aggregates for a hierarchy of values in
the selected columns.

423
Multi-Dimensional Aggregation

// Borrowed from Microsoft's "Summarizing Data Using ROLLUP" article


val inventory = Seq(
("table", "blue", 124),
("table", "red", 223),
("chair", "blue", 101),
("chair", "red", 210)).toDF("item", "color", "quantity")

scala> inventory.show
+-----+-----+--------+
| item|color|quantity|
+-----+-----+--------+
|chair| blue| 101|
|chair| red| 210|
|table| blue| 124|
|table| red| 223|
+-----+-----+--------+

// ordering and empty rows done manually for demo purposes


scala> inventory.rollup("item", "color").sum().show
+-----+-----+-------------+
| item|color|sum(quantity)|
+-----+-----+-------------+
|chair| blue| 101|
|chair| red| 210|
|chair| null| 311|
| | | |
|table| blue| 124|
|table| red| 223|
|table| null| 347|
| | | |
| null| null| 658|
+-----+-----+-------------+

From Hive’s Cubes and Rollups:

WITH ROLLUP is used with the GROUP BY only. ROLLUP clause is used with GROUP
BY to compute the aggregate at the hierarchy levels of a dimension.

GROUP BY a, b, c with ROLLUP assumes that the hierarchy is "a" drilling down to "b"
drilling down to "c".

GROUP BY a, b, c, WITH ROLLUP is equivalent to GROUP BY a, b, c GROUPING


SETS ( (a, b, c), (a, b), (a), ( )).

Read up on ROLLUP in Hive’s LanguageManual in Grouping Sets, Cubes,


Note
Rollups, and the GROUPING__ID Function.

424
Multi-Dimensional Aggregation

// Borrowed from http://stackoverflow.com/a/27222655/1305344


val quarterlyScores = Seq(
("winter2014", "Agata", 99),
("winter2014", "Jacek", 97),
("summer2015", "Agata", 100),
("summer2015", "Jacek", 63),
("winter2015", "Agata", 97),
("winter2015", "Jacek", 55),
("summer2016", "Agata", 98),
("summer2016", "Jacek", 97)).toDF("period", "student", "score")

scala> quarterlyScores.show
+----------+-------+-----+
| period|student|score|
+----------+-------+-----+
|winter2014| Agata| 99|
|winter2014| Jacek| 97|
|summer2015| Agata| 100|
|summer2015| Jacek| 63|
|winter2015| Agata| 97|
|winter2015| Jacek| 55|
|summer2016| Agata| 98|
|summer2016| Jacek| 97|
+----------+-------+-----+

// ordering and empty rows done manually for demo purposes


scala> quarterlyScores.rollup("period", "student").sum("score").show
+----------+-------+----------+
| period|student|sum(score)|
+----------+-------+----------+
|winter2014| Agata| 99|
|winter2014| Jacek| 97|
|winter2014| null| 196|
| | | |
|summer2015| Agata| 100|
|summer2015| Jacek| 63|
|summer2015| null| 163|
| | | |
|winter2015| Agata| 97|
|winter2015| Jacek| 55|
|winter2015| null| 152|
| | | |
|summer2016| Agata| 98|
|summer2016| Jacek| 97|
|summer2016| null| 195|
| | | |
| null| null| 706|
+----------+-------+----------+

From PostgreSQL’s 7.2.4. GROUPING SETS, CUBE, and ROLLUP:

425
Multi-Dimensional Aggregation

The individual elements of a CUBE or ROLLUP clause may be either individual


expressions, or sublists of elements in parentheses. In the latter case, the sublists are
treated as single units for the purposes of generating the individual grouping sets.

// given the above inventory dataset

// using struct function


scala> inventory.rollup(struct("item", "color") as "(item,color)").sum().show
+------------+-------------+
|(item,color)|sum(quantity)|
+------------+-------------+
| [table,red]| 223|
|[chair,blue]| 101|
| null| 658|
| [chair,red]| 210|
|[table,blue]| 124|
+------------+-------------+

// using expr function


scala> inventory.rollup(expr("(item, color)") as "(item, color)").sum().show
+-------------+-------------+
|(item, color)|sum(quantity)|
+-------------+-------------+
| [table,red]| 223|
| [chair,blue]| 101|
| null| 658|
| [chair,red]| 210|
| [table,blue]| 124|
+-------------+-------------+

Internally, rollup converts the Dataset into a DataFrame (i.e. uses RowEncoder as the
encoder) and then creates a RelationalGroupedDataset (with RollupType group type).

Rollup expression represents GROUP BY ... WITH ROLLUP in SQL in Spark’s


Note Catalyst Expression tree (after AstBuilder parses a structured query with
aggregation).

Read up on rollup in Deeper into Postgres 9.5 - New Group By Options for
Tip
Aggregation.

cube Operator

cube(cols: Column*): RelationalGroupedDataset


cube(col1: String, cols: String*): RelationalGroupedDataset

426
Multi-Dimensional Aggregation

cube multi-dimensional aggregate operator is an extension of groupBy operator that allows

calculating subtotals and a grand total across all combinations of specified group of n + 1
dimensions (with n being the number of columns as cols and col1 and 1 for where
values become null , i.e. undefined).

cube returns RelationalGroupedDataset that you can use to execute aggregate function or

operator.

cube is more than rollup operator, i.e. cube does rollup with aggregation
Note
over all the missing combinations given the columns.

val sales = Seq(


("Warsaw", 2016, 100),
("Warsaw", 2017, 200),
("Boston", 2015, 50),
("Boston", 2016, 150),
("Toronto", 2017, 50)
).toDF("city", "year", "amount")

val q = sales.cube("city", "year")


.agg(sum("amount") as "amount")
.sort($"city".desc_nulls_last, $"year".asc_nulls_last)
scala> q.show
+-------+----+------+
| city|year|amount|
+-------+----+------+
| Warsaw|2016| 100| <-- total in Warsaw in 2016
| Warsaw|2017| 200| <-- total in Warsaw in 2017
| Warsaw|null| 300| <-- total in Warsaw (across all years)
|Toronto|2017| 50|
|Toronto|null| 50|
| Boston|2015| 50|
| Boston|2016| 150|
| Boston|null| 200|
| null|2015| 50| <-- total in 2015 (across all cities)
| null|2016| 250|
| null|2017| 250|
| null|null| 550| <-- grand total (across cities and years)
+-------+----+------+

GROUPING SETS SQL Clause

GROUP BY ... GROUPING SETS (...)

GROUPING SETS clause generates a dataset that is equivalent to union operator of multiple

groupBy operators.

427
Multi-Dimensional Aggregation

val sales = Seq(


("Warsaw", 2016, 100),
("Warsaw", 2017, 200),
("Boston", 2015, 50),
("Boston", 2016, 150),
("Toronto", 2017, 50)
).toDF("city", "year", "amount")
sales.createOrReplaceTempView("sales")

// equivalent to rollup("city", "year")


val q = sql("""
SELECT city, year, sum(amount) as amount
FROM sales
GROUP BY city, year
GROUPING SETS ((city, year), (city), ())
ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
""")
scala> q.show
+-------+----+------+
| city|year|amount|
+-------+----+------+
| Warsaw|2016| 100|
| Warsaw|2017| 200|
| Warsaw|null| 300|
|Toronto|2017| 50|
|Toronto|null| 50|
| Boston|2015| 50|
| Boston|2016| 150|
| Boston|null| 200|
| null|null| 550| <-- grand total across all cities and years
+-------+----+------+

// equivalent to cube("city", "year")


// note the additional (year) grouping set
val q = sql("""
SELECT city, year, sum(amount) as amount
FROM sales
GROUP BY city, year
GROUPING SETS ((city, year), (city), (year), ())
ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
""")
scala> q.show
+-------+----+------+
| city|year|amount|
+-------+----+------+
| Warsaw|2016| 100|
| Warsaw|2017| 200|
| Warsaw|null| 300|
|Toronto|2017| 50|
|Toronto|null| 50|
| Boston|2015| 50|
| Boston|2016| 150|

428
Multi-Dimensional Aggregation

| Boston|null| 200|
| null|2015| 50| <-- total across all cities in 2015
| null|2016| 250| <-- total across all cities in 2016
| null|2017| 250| <-- total across all cities in 2017
| null|null| 550|
+-------+----+------+

Internally, GROUPING SETS clause is parsed in withAggregation parsing handler (in


AstBuilder ) and becomes a GroupingSets logical operator internally.

Rollup GroupingSet with CodegenFallback Expression


(for rollup Operator)

Rollup(groupByExprs: Seq[Expression])
extends GroupingSet

Rollup expression represents rollup operator in Spark’s Catalyst Expression tree (after

AstBuilder parses a structured query with aggregation).

Note GroupingSet is an Expression with CodegenFallback support.

429
Dataset Caching and Persistence

Dataset Caching and Persistence


One of the optimizations in Spark SQL is Dataset caching (aka Dataset persistence)
which is available using the Dataset API using the following basic actions:

cache

persist

unpersist

cache is simply persist with MEMORY_AND_DISK storage level.

// Cache Dataset -- it is lazy and so nothing really happens


val data = spark.range(1).cache

// Trigger caching by executing an action


// The common idiom is to execute count since it's fairly cheap
data.count

At this point you could use web UI’s Storage tab to review the Datasets persisted. Visit
http://localhost:4040/storage.

Figure 1. web UI’s Storage tab


persist uses CacheManager for an in-memory cache of structured queries (and

InMemoryRelation logical operators), and is used to cache structured queries (which simply
registers the structured queries as InMemoryRelation leaf logical operators).

At withCachedData phase (of execution of a structured query), QueryExecution requests the


CacheManager to replace segments of a logical query plan with their cached data (including

subqueries).

scala> println(data.queryExecution.withCachedData.numberedTreeString)
00 InMemoryRelation [id#9L], StorageLevel(disk, memory, deserialized, 1 replicas)
01 +- *(1) Range (0, 1, step=1, splits=8)

430
Dataset Caching and Persistence

// Use the cached Dataset in another query


// Notice InMemoryRelation in use for cached queries
scala> df.withColumn("newId", 'id).explain(extended = true)
== Parsed Logical Plan ==
'Project [*, 'id AS newId#16]
+- Range (0, 1, step=1, splits=Some(8))

== Analyzed Logical Plan ==


id: bigint, newId: bigint
Project [id#0L, id#0L AS newId#16L]
+- Range (0, 1, step=1, splits=Some(8))

== Optimized Logical Plan ==


Project [id#0L, id#0L AS newId#16L]
+- InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deserialized, 1 r
eplicas)
+- *Range (0, 1, step=1, splits=Some(8))

== Physical Plan ==
*Project [id#0L, id#0L AS newId#16L]
+- InMemoryTableScan [id#0L]
+- InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deserialize
d, 1 replicas)
+- *Range (0, 1, step=1, splits=Some(8))

// Clear in-memory cache using SQL


// Equivalent to spark.catalog.clearCache
scala> sql("CLEAR CACHE").collect
res1: Array[org.apache.spark.sql.Row] = Array()

// Visit http://localhost:4040/storage to confirm the cleaning

You can also use SQL’s CACHE TABLE [tableName] to cache tableName table in
memory. Unlike cache and persist operators, CACHE TABLE is an eager
operation which is executed as soon as the statement is executed.

sql("CACHE TABLE [tableName]")

You could however use LAZY keyword to make caching lazy.


Note
sql("CACHE LAZY TABLE [tableName]")

Use SQL’s REFRESH TABLE [tableName] to refresh a cached table.


Use SQL’s UNCACHE TABLE (IF EXISTS)? [tableName] to remove a table from
cache.
Use SQL’s CLEAR CACHE to remove all tables from cache.

431
Dataset Caching and Persistence

Be careful what you cache, i.e. what Dataset is cached, as it gives different queries
cached.

// cache after range(5)


val q1 = spark.range(5).cache.filter($"id" % 2 === 0).select("id")
scala> q1.explain
== Physical Plan ==
*Filter ((id#0L % 2) = 0)
+- InMemoryTableScan [id#0L], [((id#0L % 2) = 0)]
+- InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deseria
lized, 1 replicas)
Note +- *Range (0, 5, step=1, splits=8)

// cache at the end


val q2 = spark.range(1).filter($"id" % 2 === 0).select("id").cache
scala> q2.explain
== Physical Plan ==
InMemoryTableScan [id#17L]
+- InMemoryRelation [id#17L], true, 10000, StorageLevel(disk, memory, deseriali
zed, 1 replicas)
+- *Filter ((id#17L % 2) = 0)
+- *Range (0, 1, step=1, splits=8)

You can check whether a Dataset was cached or not using the following code:

scala> :type q2
org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
Tip
val cache = spark.sharedState.cacheManager
scala> cache.lookupCachedData(q2.queryExecution.logical).isDefined
res0: Boolean = false

SQL’s CACHE TABLE


SQL’s CACHE TABLE corresponds to requesting the session-specific Catalog to caching the
table.

Internally, CACHE TABLE becomes CacheTableCommand runnable command that…​FIXME

432
User-Friendly Names Of Cached Queries in web UI’s Storage Tab

User-Friendly Names Of Cached Queries in


web UI’s Storage Tab
As you may have noticed, web UI’s Storage tab displays some cached queries with user-
friendly RDD names (e.g. "In-memory table [name]") while others not (e.g. "Scan
JDBCRelation…​").

Figure 1. Cached Queries in web UI (Storage Tab)


"In-memory table [name]" RDD names are the result of SQL’s CACHE TABLE or when
Catalog is requested to cache a table.

433
User-Friendly Names Of Cached Queries in web UI’s Storage Tab

// register Dataset as temporary view (table)


spark.range(1).createOrReplaceTempView("one")
// caching is lazy and won't happen until an action is executed
val one = spark.table("one").cache
// The following gives "*Range (0, 1, step=1, splits=8)"
// WHY?!
one.show

scala> spark.catalog.isCached("one")
res0: Boolean = true

one.unpersist

import org.apache.spark.storage.StorageLevel
// caching is lazy
spark.catalog.cacheTable("one", StorageLevel.MEMORY_ONLY)
// The following gives "In-memory table one"
one.show

spark.range(100).createOrReplaceTempView("hundred")
// SQL's CACHE TABLE is eager
// The following gives "In-memory table `hundred`"
// WHY single quotes?
spark.sql("CACHE TABLE hundred")

// register Dataset under name


val ds = spark.range(20)
spark.sharedState.cacheManager.cacheQuery(ds, Some("twenty"))
// trigger an action
ds.head

The other RDD names are due to caching a Dataset.

val ten = spark.range(10).cache


ten.head

434
Dataset Checkpointing

Dataset Checkpointing
Dataset Checkpointing is a feature of Spark SQL to truncate a logical query plan that could
specifically be useful for highly iterative data algorithms (e.g. Spark MLlib that uses Spark
SQL’s Dataset API for data manipulation).

Checkpointing is actually a feature of Spark Core (that Spark SQL uses for
distributed computations) that allows a driver to be restarted on failure with
previously computed state of a distributed computation described as an RDD .
That has been successfully used in Spark Streaming - the now-obsolete Spark
module for stream processing based on RDD API.
Note Checkpointing truncates the lineage of a RDD to be checkpointed. That has
been successfully used in Spark MLlib in iterative machine learning algorithms
like ALS.
Dataset checkpointing in Spark SQL uses checkpointing to truncate the lineage
of the underlying RDD of a Dataset being checkpointed.

Checkpointing can be eager or lazy per eager flag of checkpoint operator. Eager
checkpointing is the default checkpointing and happens immediately when requested. Lazy
checkpointing does not and will only happen when an action is executed.

Using Dataset checkpointing requires that you specify the checkpoint directory. The
directory stores the checkpoint files for RDDs to be checkpointed. Use
SparkContext.setCheckpointDir to set the path to a checkpoint directory.

Checkpointing can be local or reliable which defines how reliable the checkpoint directory is