50% found this document useful (2 votes)
1K views1,352 pages

Apache Spark Detailed Guide

This document contains a table of contents that outlines and structures the content of the document. It includes sections on Spark Core, the Spark Web UI, Spark metrics, the Spark status REST API, and Spark MLlib. The sections describe key concepts, components, and functionality within each of those aspects of Spark.

Uploaded by

hdemonts
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
50% found this document useful (2 votes)
1K views1,352 pages

Apache Spark Detailed Guide

This document contains a table of contents that outlines and structures the content of the document. It includes sections on Spark Core, the Spark Web UI, Spark metrics, the Spark status REST API, and Spark MLlib. The sections describe key concepts, components, and functionality within each of those aspects of Spark.

Uploaded by

hdemonts
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1352

Table

of Contents
Introduction 1.1
Overview of Apache Spark 1.2

Spark Core / Transferring Data Blocks In


Spark Cluster
ShuffleClient — Contract to Fetch Shuffle Blocks 2.1
BlockTransferService — Pluggable Block Transfers (To Fetch and Upload Blocks)
ExternalShuffleClient 2.1.2 2.1.1
NettyBlockTransferService — Netty-Based BlockTransferService 2.2
NettyBlockRpcServer — NettyBlockTransferService’s RpcHandler 2.2.1
BlockFetchingListener 2.3
RetryingBlockFetcher 2.4
BlockFetchStarter 2.4.1

Spark Core / Web UI


Web UI — Spark Application’s Web Console 3.1
Jobs 3.1.1
Stages 3.1.2
Storage 3.1.3
Environment 3.1.4
Executors 3.1.5
JobsTab 3.2
AllJobsPage 3.2.1
JobPage 3.2.2
StagesTab — Stages for All Jobs 3.3
AllStagesPage — Stages for All Jobs 3.3.1
StagePage — Stage Details 3.3.2
PoolPage — Pool Details 3.3.3

1
StorageTab 3.4
StoragePage 3.4.1
RDDPage 3.4.2
EnvironmentTab 3.5
EnvironmentPage 3.5.1
ExecutorsTab 3.6
ExecutorsPage 3.6.1
ExecutorThreadDumpPage 3.6.2
SparkUI — Web UI of Spark Application 3.7
SparkUITab 3.7.1
BlockStatusListener Spark Listener 3.8
EnvironmentListener Spark Listener 3.9
ExecutorsListener Spark Listener 3.10
JobProgressListener Spark Listener 3.11
StorageStatusListener Spark Listener 3.12
StorageListener — Spark Listener for Tracking Persistence Status of RDD Blocks 3.13
RDDOperationGraphListener Spark Listener 3.14
WebUI — Framework For Web UIs 3.15
WebUIPage — Contract of Pages in Web UI 3.15.1
WebUITab — Contract of Tabs in Web UI 3.15.2
RDDStorageInfo 3.16
RDDInfo 3.17
LiveEntity 3.18
LiveRDD 3.18.1
UIUtils 3.19
JettyUtils 3.20
web UI Configuration Properties 3.21

Spark Core / Metrics


Spark Metrics 4.1
MetricsSystem 4.2
MetricsConfig — Metrics System Configuration 4.3
Source — Contract of Metrics Sources 4.4

2
Sink — Contract of Metrics Sinks 4.5
MetricsServlet JSON Metrics Sink 4.5.1
Metrics Configuration Properties 4.6

Spark Core / Status REST API


Status REST API — Monitoring Spark Applications Using REST API 5.1
ApiRootResource — /api/v1 URI Handler 5.2
ApplicationListResource — applications URI Handler 5.2.1
OneApplicationResource — applications/appId URI Handler 5.2.2
StagesResource 5.2.2.1
OneApplicationAttemptResource 5.2.3
AbstractApplicationResource 5.3
BaseAppResource 5.4
ApiRequestContext 5.5
UIRoot — Contract for Root Contrainers of Application UI Information 5.6
UIRootFromServletContext 5.6.1

Spark MLlib
Spark MLlib — Machine Learning in Spark 6.1
ML Pipelines (spark.ml) 6.2
Pipeline 6.2.1
PipelineStage 6.2.2
Transformers 6.2.3
Transformer 6.2.3.1
Tokenizer 6.2.3.2
Estimators 6.2.4
Estimator 6.2.4.1
StringIndexer 6.2.4.1.1
KMeans 6.2.4.1.2
TrainValidationSplit 6.2.4.1.3
Predictor 6.2.4.2
RandomForestRegressor 6.2.4.2.1

3
Regressor 6.2.4.3
LinearRegression 6.2.4.3.1
Classifier 6.2.4.4
RandomForestClassifier 6.2.4.4.1
DecisionTreeClassifier 6.2.4.4.2
Models 6.2.5
Model 6.2.5.1
Evaluator — ML Pipeline Component for Model Scoring 6.2.6
BinaryClassificationEvaluator — Evaluator of Binary Classification Models
ClusteringEvaluator — Evaluator of Clustering Models 6.2.6.2 6.2.6.1
MulticlassClassificationEvaluator — Evaluator of Multiclass Classification
Models 6.2.6.3
RegressionEvaluator — Evaluator of Regression Models 6.2.6.4
CrossValidator — Model Tuning / Finding The Best Model 6.2.7
CrossValidatorModel 6.2.7.1
ParamGridBuilder 6.2.7.2
CrossValidator with Pipeline Example 6.2.7.3
Params and ParamMaps 6.2.8
ValidatorParams 6.2.8.1
HasParallelism 6.2.8.2
ML Persistence — Saving and Loading Models and Pipelines 6.3
MLWritable 6.3.1
MLReader 6.3.2
Example — Text Classification 6.4
Example — Linear Regression 6.5
Logistic Regression 6.6
LogisticRegression 6.6.1
Latent Dirichlet Allocation (LDA) 6.7
Vector 6.8
LabeledPoint 6.9
Streaming MLlib 6.10
GeneralizedLinearRegression 6.11
Alternating Least Squares (ALS) Matrix Factorization 6.12
ALS — Estimator for ALSModel 6.12.1

4
ALSModel — Model for Predictions 6.12.2
ALSModelReader 6.12.3
Instrumentation 6.13
MLUtils 6.14

Spark Core / Tools


Spark Shell — spark-shell shell script 7.1
Spark Submit — spark-submit shell script 7.2
SparkSubmitArguments 7.2.1
SparkSubmitOptionParser — spark-submit’s Command-Line Parser 7.2.2
SparkSubmitCommandBuilder Command Builder 7.2.3
spark-class shell script 7.3
AbstractCommandBuilder 7.3.1
SparkLauncher — Launching Spark Applications Programmatically 7.4

Spark Core / Architecture


Spark Architecture 8.1
Driver 8.2
Executor 8.3
TaskRunner 8.3.1
ExecutorSource 8.3.2
Master 8.4
Workers 8.5

Spark Core / RDD


Anatomy of Spark Application 9.1
SparkConf — Programmable Configuration for Spark Applications 9.2
Spark Properties and spark-defaults.conf Properties File 9.2.1
Deploy Mode 9.2.2
SparkContext 9.3
HeartbeatReceiver RPC Endpoint 9.3.1

5
Inside Creating SparkContext 9.3.2
ConsoleProgressBar 9.3.3
SparkStatusTracker 9.3.4
Local Properties — Creating Logical Job Groups 9.3.5
RDD — Resilient Distributed Dataset 9.4
RDD 9.4.1
RDD Lineage — Logical Execution Plan 9.4.2
TaskLocation 9.4.3
ParallelCollectionRDD 9.4.4
MapPartitionsRDD 9.4.5
OrderedRDDFunctions 9.4.6
CoGroupedRDD 9.4.7
SubtractedRDD 9.4.8
HadoopRDD 9.4.9
NewHadoopRDD 9.4.10
ShuffledRDD 9.4.11
Operators 9.5
Transformations 9.5.1
PairRDDFunctions 9.5.1.1
Actions 9.5.2
Caching and Persistence 9.6
StorageLevel 9.6.1
Partitions and Partitioning 9.7
Partition 9.7.1
Partitioner 9.7.2
HashPartitioner 9.7.2.1
Shuffling 9.8
Checkpointing 9.9
CheckpointRDD 9.9.1
RDD Dependencies 9.10
NarrowDependency — Narrow Dependencies 9.10.1
ShuffleDependency — Shuffle Dependencies 9.10.2
Map/Reduce-side Aggregator 9.11
AppStatusStore 9.12

6
AppStatusPlugin 9.13
AppStatusListener 9.14
KVStore 9.15
KVStoreView 9.15.1
ElementTrackingStore 9.15.2
InMemoryStore 9.15.3
LevelDB 9.15.4
InterruptibleIterator — Iterator With Support For Task Cancellation 9.16

Spark Core / Optimizations


Broadcast variables 10.1
Accumulators 10.2
AccumulatorContext 10.2.1

Spark Core / Services


SerializerManager 11.1
MemoryManager — Memory Management 11.2
UnifiedMemoryManager — Spark’s Memory Manager 11.2.1
StaticMemoryManager — Legacy Memory Manager 11.2.2
MemoryManager Configuration Properties 11.2.3
SparkEnv — Spark Runtime Environment 11.3
DAGScheduler — Stage-Oriented Scheduler 11.4
Jobs 11.4.1
Stage — Physical Unit Of Execution 11.4.2
ShuffleMapStage — Intermediate Stage in Execution DAG 11.4.2.1
ResultStage — Final Stage in Job 11.4.2.2
StageInfo 11.4.2.3
DAGSchedulerSource — Metrics Source for DAGScheduler 11.4.3
DAGScheduler Event Bus 11.4.4
JobListener 11.4.5
JobWaiter 11.4.5.1

7
TaskScheduler — Spark Scheduler 11.5
Tasks 11.5.1
ShuffleMapTask — Task for ShuffleMapStage 11.5.1.1
ResultTask 11.5.1.2
FetchFailedException 11.5.2
MapStatus — Shuffle Map Output Status 11.5.3
TaskSet — Set of Tasks for Stage 11.5.4
TaskSetManager 11.5.5
Schedulable 11.5.5.1
Schedulable Pool 11.5.5.2
Schedulable Builders 11.5.5.3
FIFOSchedulableBuilder 11.5.5.3.1
FairSchedulableBuilder 11.5.5.3.2
Scheduling Mode — spark.scheduler.mode Spark Property 11.5.5.4
TaskInfo 11.5.5.5
TaskDescription — Metadata of Single Task 11.5.6
TaskSchedulerImpl — Default TaskScheduler 11.5.7
Speculative Execution of Tasks 11.5.7.1
TaskResultGetter 11.5.7.2
TaskContext 11.5.8
TaskContextImpl 11.5.8.1
TaskResults — DirectTaskResult and IndirectTaskResult 11.5.9
TaskMemoryManager — Memory Manager of Single Task 11.5.10
MemoryConsumer 11.5.10.1
TaskMetrics 11.5.11
ShuffleWriteMetrics 11.5.11.1
TaskSetBlacklist — Blacklisting Executors and Nodes For TaskSet 11.5.12
SchedulerBackend — Pluggable Scheduler Backends 11.6
CoarseGrainedSchedulerBackend 11.6.1
DriverEndpoint — CoarseGrainedSchedulerBackend RPC Endpoint 11.6.1.1
ExecutorBackend — Pluggable Executor Backends 11.7
CoarseGrainedExecutorBackend 11.7.1
MesosExecutorBackend 11.7.2
BlockManager — Key-Value Store of Blocks of Data 11.8

8
MemoryStore 11.8.1
BlockEvictionHandler 11.8.2
StorageMemoryPool 11.8.3
MemoryPool 11.8.4
DiskStore 11.8.5
BlockDataManager 11.8.6
RpcHandler 11.8.7
RpcResponseCallback 11.8.8
TransportRequestHandler 11.8.9
TransportContext 11.8.10
TransportServer 11.8.11
TransportClientFactory 11.8.12
MessageHandler 11.8.13
BlockManagerMaster — BlockManager for Driver 11.8.14
BlockManagerMasterEndpoint — BlockManagerMaster RPC Endpoint
DiskBlockManager 11.8.15 11.8.14.1
BlockInfoManager 11.8.16
BlockInfo 11.8.16.1
BlockManagerSlaveEndpoint 11.8.17
DiskBlockObjectWriter 11.8.18
BlockManagerSource — Metrics Source for BlockManager 11.8.19
ShuffleMetricsSource — Metrics Source of BlockManager for Shuffle-Related
Metrics 11.8.20
StorageStatus 11.8.21
ManagedBuffer 11.8.22
MapOutputTracker — Shuffle Map Output Registry 11.9
MapOutputTrackerMaster — MapOutputTracker For Driver 11.9.1
MapOutputTrackerMasterEndpoint 11.9.1.1
MapOutputTrackerWorker — MapOutputTracker for Executors 11.9.2
ShuffleManager — Pluggable Shuffle Systems 11.10
SortShuffleManager — The Default Shuffle System 11.10.1
ExternalShuffleService 11.10.2
OneForOneStreamManager 11.10.3
ShuffleBlockResolver 11.10.4

9
IndexShuffleBlockResolver 11.10.4.1
ShuffleWriter 11.10.5
BypassMergeSortShuffleWriter 11.10.5.1
SortShuffleWriter 11.10.5.2
UnsafeShuffleWriter — ShuffleWriter for SerializedShuffleHandle 11.10.5.3
BaseShuffleHandle — Fallback Shuffle Handle 11.10.6
BypassMergeSortShuffleHandle — Marker Interface for Bypass Merge Sort Shuffle
Handles 11.10.7
SerializedShuffleHandle — Marker Interface for Serialized Shuffle Handles 11.10.8
ShuffleReader 11.10.9
BlockStoreShuffleReader 11.10.9.1
ShuffleBlockFetcherIterator 11.10.10
ShuffleExternalSorter — Cache-Efficient Sorter 11.10.11
ExternalSorter 11.10.12
Serialization 11.11
Serializer — Task SerDe 11.11.1
SerializerInstance 11.11.2
SerializationStream 11.11.3
DeserializationStream 11.11.4
ExternalClusterManager — Pluggable Cluster Managers 11.12
BroadcastManager 11.13
BroadcastFactory — Pluggable Broadcast Variable Factories 11.13.1
TorrentBroadcastFactory 11.13.1.1
TorrentBroadcast 11.13.1.2
CompressionCodec 11.13.2
ContextCleaner — Spark Application Garbage Collector 11.14
CleanerListener 11.14.1
Dynamic Allocation (of Executors) 11.15
ExecutorAllocationManager — Allocation Manager for Spark Core 11.15.1
ExecutorAllocationClient 11.15.2
ExecutorAllocationListener 11.15.3
ExecutorAllocationManagerSource 11.15.4
HTTP File Server 11.16
Data Locality 11.17

10
Cache Manager 11.18
OutputCommitCoordinator 11.19
RpcEnv — RPC Environment 11.20
RpcEndpoint 11.20.1
RpcEndpointRef 11.20.2
RpcEnvFactory 11.20.3
Netty-based RpcEnv 11.20.4
TransportConf — Transport Configuration 11.21
Utils Helper Object 11.22

Spark Core / Security


Securing Web UI 12.1

Spark Deployment Environments


Deployment Environments — Run Modes 13.1
Spark local (pseudo-cluster) 13.2
LocalSchedulerBackend 13.2.1
LocalEndpoint 13.2.2
Spark on cluster 13.3

Spark on YARN
Spark on YARN 14.1
YarnShuffleService — ExternalShuffleService on YARN 14.2
ExecutorRunnable 14.3
Client 14.4
YarnRMClient 14.5
ApplicationMaster 14.6
AMEndpoint — ApplicationMaster RPC Endpoint 14.6.1
YarnClusterManager — ExternalClusterManager for YARN 14.7
TaskSchedulers for YARN 14.8

11
YarnScheduler 14.8.1
YarnClusterScheduler 14.8.2
SchedulerBackends for YARN 14.9
YarnSchedulerBackend 14.9.1
YarnClientSchedulerBackend 14.9.2
YarnClusterSchedulerBackend 14.9.3
YarnSchedulerEndpoint RPC Endpoint 14.9.4
YarnAllocator 14.10
Introduction to Hadoop YARN 14.11
Setting up YARN Cluster 14.12
Kerberos 14.13
ConfigurableCredentialManager 14.13.1
ClientDistributedCacheManager 14.14
YarnSparkHadoopUtil 14.15
Settings 14.16

Spark Standalone
Spark Standalone 15.1
Standalone Master — Cluster Manager of Spark Standalone 15.2
Standalone Worker 15.3
web UI 15.4
ApplicationPage 15.4.1
LocalSparkCluster — Single-JVM Spark Standalone Cluster 15.5
Submission Gateways 15.6
Management Scripts for Standalone Master 15.7
Management Scripts for Standalone Workers 15.8
Checking Status 15.9
Example 2-workers-on-1-node Standalone Cluster (one executor per worker) 15.10
StandaloneSchedulerBackend 15.11

Spark on Mesos
Spark on Mesos 16.1

12
MesosCoarseGrainedSchedulerBackend 16.2
About Mesos 16.3

Execution Model
Execution Model 17.1

Monitoring, Tuning and Debugging


Unified Memory Management 18.1
Spark History Server 18.2
HistoryServer — WebUI For Active And Completed Spark Applications 18.2.1
SQLHistoryListener 18.2.2
FsHistoryProvider — File-System-Based History Provider 18.2.3
ApplicationHistoryProvider 18.2.4
HistoryServerArguments 18.2.5
ApplicationCacheOperations 18.2.6
ApplicationCache 18.2.7
Logging 18.3
Performance Tuning 18.4
SparkListener — Intercepting Events from Spark Scheduler 18.5
LiveListenerBus 18.5.1
ReplayListenerBus 18.5.2
SparkListenerBus — Internal Contract for Spark Event Buses 18.5.3
EventLoggingListener — Spark Listener for Persisting Events 18.5.4
StatsReportListener — Logging Summary Statistics 18.5.5
JsonProtocol 18.6
Debugging Spark 18.7

Varia
Building Apache Spark from Sources 19.1
Spark and Hadoop 19.2
SparkHadoopUtil 19.2.1

13
Spark and software in-memory file systems 19.3
Spark and The Others 19.4
Distributed Deep Learning on Spark 19.5
Spark Packages 19.6

Interactive Notebooks
Interactive Notebooks 20.1
Apache Zeppelin 20.1.1
Spark Notebook 20.1.2

Spark Tips and Tricks


Spark Tips and Tricks 21.1
Access private members in Scala in Spark shell 21.2
SparkException: Task not serializable 21.3
Running Spark Applications on Windows 21.4

Exercises
One-liners using PairRDDFunctions 22.1
Learning Jobs and Partitions Using take Action 22.2
Spark Standalone - Using ZooKeeper for High-Availability of Master 22.3
Spark’s Hello World using Spark shell and Scala 22.4
WordCount using Spark shell 22.5
Your first complete Spark application (using Scala and sbt) 22.6
Spark (notable) use cases 22.7
Using Spark SQL to update data in Hive using ORC files 22.8
Developing Custom SparkListener to monitor DAGScheduler in Scala 22.9
Developing RPC Environment 22.10
Developing Custom RDD 22.11
Working with Datasets from JDBC Data Sources (and PostgreSQL) 22.12
Causing Stage to Fail 22.13

14
Further Learning
Courses 23.1
Books 23.2

(separate book) Spark SQL


Spark SQL — Batch and Streaming Queries Over Structured Data on Massive Scale
24.1

(separate book) Spark Structured Streaming


Spark Structured Streaming — Streaming Datasets 25.1

(obsolete) Spark Streaming


Spark Streaming — Streaming RDDs 26.1
BlockRDD 26.1.1

(obsolete) Spark GraphX


Spark GraphX — Distributed Graph Computations 27.1
Graph Algorithms 27.2

15
Introduction

Mastering Apache Spark (2.3.1)


Welcome to Mastering Apache Spark gitbook! I’m very excited to have you here and hope
you will enjoy exploring the internals of Apache Spark (Core) as much as I have.

I write to discover what I know.

— Flannery O'Connor
I’m Jacek Laskowski, an independent consultant, software developer and technical instructor
specializing in Apache Spark, Apache Kafka and Kafka Streams (with Scala, sbt,
Kubernetes, DC/OS, Apache Mesos, and Hadoop YARN).

I offer software development and consultancy services with very hands-on in-depth
workshops and mentoring. Reach out to me at [email protected] or @jaceklaskowski to
discuss opportunities.

Consider joining me at Warsaw Scala Enthusiasts and Warsaw Spark meetups in Warsaw,
Poland.

I’m also writing Mastering Spark SQL, Mastering Kafka Streams, Apache Kafka
Tip
Notebook and Spark Structured Streaming Notebook gitbooks.

Expect text and code snippets from a variety of public sources. Attribution follows.

Now, let me introduce you to Apache Spark.

16
Overview of Apache Spark

Apache Spark
Apache Spark is an open-source distributed general-purpose cluster computing
framework with (mostly) in-memory data processing engine that can do ETL, analytics,
machine learning and graph processing on large volumes of data at rest (batch processing)
or in motion (streaming processing) with rich concise high-level APIs for the programming
languages: Scala, Python, Java, R, and SQL.

Figure 1. The Spark Platform


You could also describe Spark as a distributed, data processing engine for batch and
streaming modes featuring SQL queries, graph processing, and machine learning.

In contrast to Hadoop’s two-stage disk-based MapReduce computation engine, Spark’s


multi-stage (mostly) in-memory computing engine allows for running most computations in
memory, and hence most of the time provides better performance for certain applications,
e.g. iterative algorithms or interactive data mining (read Spark officially sets a new record in
large-scale sorting).

Spark aims at speed, ease of use, extensibility and interactive analytics.

Spark is often called cluster computing engine or simply execution engine.

Spark is a distributed platform for executing complex multi-stage applications, like


machine learning algorithms, and interactive ad hoc queries. Spark provides an efficient
abstraction for in-memory cluster computing called Resilient Distributed Dataset.

17
Overview of Apache Spark

Using Spark Application Frameworks, Spark simplifies access to machine learning and
predictive analytics at scale.

Spark is mainly written in Scala, but provides developer API for languages like Java, Python,
and R.

Microsoft’s Mobius project provides C# API for Spark "enabling the


Note implementation of Spark driver program and data processing operations in the
languages supported in the .NET framework like C# or F#."

If you have large amounts of data that requires low latency processing that a typical
MapReduce program cannot provide, Spark is a viable alternative.

Access any data type across any data source.

Huge demand for storage and data processing.

The Apache Spark project is an umbrella for SQL (with Datasets), streaming, machine
learning (pipelines) and graph processing engines built atop Spark Core. You can run them
all in a single application using a consistent API.

Spark runs locally as well as in clusters, on-premises or in cloud. It runs on top of Hadoop
YARN, Apache Mesos, standalone or in the cloud (Amazon EC2 or IBM Bluemix).

Spark can access data from many data sources.

Apache Spark’s Streaming and SQL programming models with MLlib and GraphX make it
easier for developers and data scientists to build applications that exploit machine learning
and graph analytics.

At a high level, any Spark application creates RDDs out of some input, run (lazy)
transformations of these RDDs to some other form (shape), and finally perform actions to
collect or store data. Not much, huh?

You can look at Spark from programmer’s, data engineer’s and administrator’s point of view.
And to be honest, all three types of people will spend quite a lot of their time with Spark to
finally reach the point where they exploit all the available features. Programmers use
language-specific APIs (and work at the level of RDDs using transformations and actions),
data engineers use higher-level abstractions like DataFrames or Pipelines APIs or external
tools (that connect to Spark), and finally it all can only be possible to run because
administrators set up Spark clusters to deploy Spark applications to.

It is Spark’s goal to be a general-purpose computing platform with various specialized


applications frameworks on top of a single unified engine.

18
Overview of Apache Spark

When you hear "Apache Spark" it can be two things — the Spark engine aka
Spark Core or the Apache Spark open source project which is an "umbrella"
term for Spark Core and the accompanying Spark Application Frameworks, i.e.
Note
Spark SQL, Spark Streaming, Spark MLlib and Spark GraphX that sit on top of
Spark Core and the main data abstraction in Spark called RDD - Resilient
Distributed Dataset.

Why Spark
Let’s list a few of the many reasons for Spark. We are doing it first, and then comes the
overview that lends a more technical helping hand.

Easy to Get Started


Spark offers spark-shell that makes for a very easy head start to writing and running Spark
applications on the command line on your laptop.

You could then use Spark Standalone built-in cluster manager to deploy your Spark
applications to a production-grade cluster to run on a full dataset.

Unified Engine for Diverse Workloads


As said by Matei Zaharia - the author of Apache Spark - in Introduction to AmpLab Spark
Internals video (quoting with few changes):

One of the Spark project goals was to deliver a platform that supports a very wide array
of diverse workflows - not only MapReduce batch jobs (there were available in
Hadoop already at that time), but also iterative computations like graph algorithms or
Machine Learning.

And also different scales of workloads from sub-second interactive jobs to jobs that run
for many hours.

Spark combines batch, interactive, and streaming workloads under one rich concise API.

Spark supports near real-time streaming workloads via Spark Streaming application
framework.

ETL workloads and Analytics workloads are different, however Spark attempts to offer a
unified platform for a wide variety of workloads.

Graph and Machine Learning algorithms are iterative by nature and less saves to disk or
transfers over network means better performance.

There is also support for interactive workloads using Spark shell.

19
Overview of Apache Spark

You should watch the video What is Apache Spark? by Mike Olson, Chief Strategy Officer
and Co-Founder at Cloudera, who provides a very exceptional overview of Apache Spark, its
rise in popularity in the open source community, and how Spark is primed to replace
MapReduce as the general processing engine in Hadoop.

Leverages the Best in distributed batch data processing


When you think about distributed batch data processing, Hadoop naturally comes to mind
as a viable solution.

Spark draws many ideas out of Hadoop MapReduce. They work together well - Spark on
YARN and HDFS - while improving on the performance and simplicity of the distributed
computing engine.

For many, Spark is Hadoop++, i.e. MapReduce done in a better way.

And it should not come as a surprise, without Hadoop MapReduce (its advances and
deficiencies), Spark would not have been born at all.

RDD - Distributed Parallel Scala Collections


As a Scala developer, you may find Spark’s RDD API very similar (if not identical) to Scala’s
Collections API.

It is also exposed in Java, Python and R (as well as SQL, i.e. SparkSQL, in a sense).

So, when you have a need for distributed Collections API in Scala, Spark with RDD API
should be a serious contender.

Rich Standard Library


Not only can you use map and reduce (as in Hadoop MapReduce jobs) in Spark, but also
a vast array of other higher-level operators to ease your Spark queries and application
development.

It expanded on the available computation styles beyond the only map-and-reduce available
in Hadoop MapReduce.

Unified development and deployment environment for all


Regardless of the Spark tools you use - the Spark API for the many programming languages
supported - Scala, Java, Python, R, or the Spark shell, or the many Spark Application
Frameworks leveraging the concept of RDD, i.e. Spark SQL, Spark Streaming, Spark MLlib

20
Overview of Apache Spark

and Spark GraphX, you still use the same development and deployment environment to for
large data sets to yield a result, be it a prediction (Spark MLlib), a structured data queries
(Spark SQL) or just a large distributed batch (Spark Core) or streaming (Spark Streaming)
computation.

It’s also very productive of Spark that teams can exploit the different skills the team
members have acquired so far. Data analysts, data scientists, Python programmers, or Java,
or Scala, or R, can all use the same Spark platform using tailor-made API. It makes for
bringing skilled people with their expertise in different programming languages together to a
Spark project.

Interactive Exploration / Exploratory Analytics


It is also called ad hoc queries.

Using the Spark shell you can execute computations to process large amount of data (The
Big Data). It’s all interactive and very useful to explore the data before final production
release.

Also, using the Spark shell you can access any Spark cluster as if it was your local machine.
Just point the Spark shell to a 20-node of 10TB RAM memory in total (using --master ) and
use all the components (and their abstractions) like Spark SQL, Spark MLlib, Spark
Streaming, and Spark GraphX.

Depending on your needs and skills, you may see a better fit for SQL vs programming APIs
or apply machine learning algorithms (Spark MLlib) from data in graph data structures
(Spark GraphX).

Single Environment
Regardless of which programming language you are good at, be it Scala, Java, Python, R or
SQL, you can use the same single clustered runtime environment for prototyping, ad hoc
queries, and deploying your applications leveraging the many ingestion data points offered
by the Spark platform.

You can be as low-level as using RDD API directly or leverage higher-level APIs of Spark
SQL (Datasets), Spark MLlib (ML Pipelines), Spark GraphX (Graphs) or Spark Streaming
(DStreams).

Or use them all in a single application.

The single programming model and execution engine for different kinds of workloads
simplify development and deployment architectures.

21
Overview of Apache Spark

Data Integration Toolkit with Rich Set of Supported Data


Sources
Spark can read from many types of data sources — relational, NoSQL, file systems, etc. — 
using many types of data formats - Parquet, Avro, CSV, JSON.

Both, input and output data sources, allow programmers and data engineers use Spark as
the platform with the large amount of data that is read from or saved to for processing,
interactively (using Spark shell) or in applications.

Tools unavailable then, at your fingertips now


As much and often as it’s recommended to pick the right tool for the job, it’s not always
feasible. Time, personal preference, operating system you work on are all factors to decide
what is right at a time (and using a hammer can be a reasonable choice).

Spark embraces many concepts in a single unified development and runtime environment.

Machine learning that is so tool- and feature-rich in Python, e.g. SciKit library, can now
be used by Scala developers (as Pipeline API in Spark MLlib or calling pipe() ).

DataFrames from R are available in Scala, Java, Python, R APIs.

Single node computations in machine learning algorithms are migrated to their


distributed versions in Spark MLlib.

This single platform gives plenty of opportunities for Python, Scala, Java, and R
programmers as well as data engineers (SparkR) and scientists (using proprietary enterprise
data warehouses with Thrift JDBC/ODBC Server in Spark SQL).

Mind the proverb if all you have is a hammer, everything looks like a nail, too.

Low-level Optimizations
Apache Spark uses a directed acyclic graph (DAG) of computation stages (aka execution
DAG). It postpones any processing until really required for actions. Spark’s lazy evaluation
gives plenty of opportunities to induce low-level optimizations (so users have to know less to
do more).

Mind the proverb less is more.

Excels at low-latency iterative workloads

22
Overview of Apache Spark

Spark supports diverse workloads, but successfully targets low-latency iterative ones. They
are often used in Machine Learning and graph algorithms.

Many Machine Learning algorithms require plenty of iterations before the result models get
optimal, like logistic regression. The same applies to graph algorithms to traverse all the
nodes and edges when needed. Such computations can increase their performance when
the interim partial results are stored in memory or at very fast solid state drives.

Spark can cache intermediate data in memory for faster model building and training. Once
the data is loaded to memory (as an initial step), reusing it multiple times incurs no
performance slowdowns.

Also, graph algorithms can traverse graphs one connection per iteration with the partial
result in memory.

Less disk access and network can make a huge difference when you need to process lots of
data, esp. when it is a BIG Data.

ETL done easier


Spark gives Extract, Transform and Load (ETL) a new look with the many programming
languages supported - Scala, Java, Python (less likely R). You can use them all or pick the
best for a problem.

Scala in Spark, especially, makes for a much less boiler-plate code (comparing to other
languages and approaches like MapReduce in Java).

Unified Concise High-Level API


Spark offers a unified, concise, high-level APIs for batch analytics (RDD API), SQL
queries (Dataset API), real-time analysis (DStream API), machine learning (ML Pipeline API)
and graph processing (Graph API).

Developers no longer have to learn many different processing engines and platforms, and let
the time be spent on mastering framework APIs per use case (atop a single computation
engine Spark).

Different kinds of data processing using unified API


Spark offers three kinds of data processing using batch, interactive, and stream
processing with the unified API and data structures.

Little to no disk use for better performance

23
Overview of Apache Spark

In the no-so-long-ago times, when the most prevalent distributed computing framework was
Hadoop MapReduce, you could reuse a data between computation (even partial ones!) only
after you’ve written it to an external storage like Hadoop Distributed Filesystem (HDFS). It
can cost you a lot of time to compute even very basic multi-stage computations. It simply
suffers from IO (and perhaps network) overhead.

One of the many motivations to build Spark was to have a framework that is good at data
reuse.

Spark cuts it out in a way to keep as much data as possible in memory and keep it there
until a job is finished. It doesn’t matter how many stages belong to a job. What does matter
is the available memory and how effective you are in using Spark API (so no shuffle occur).

The less network and disk IO, the better performance, and Spark tries hard to find ways to
minimize both.

Fault Tolerance included


Faults are not considered a special case in Spark, but obvious consequence of being a
parallel and distributed system. Spark handles and recovers from faults by default without
particularly complex logic to deal with them.

Small Codebase Invites Contributors


Spark’s design is fairly simple and the code that comes out of it is not huge comparing to the
features it offers.

The reasonably small codebase of Spark invites project contributors - programmers who
extend the platform and fix bugs in a more steady pace.

Further reading or watching


(video) Keynote: Spark 2.0 - Matei Zaharia, Apache Spark Creator and CTO of
Databricks

24
ShuffleClient — Contract to Fetch Shuffle Blocks

ShuffleClient — Contract to Fetch Shuffle


Blocks
ShuffleClient is the contract of clients that can fetch shuffle block files.

ShuffleClient can optionally be initialized with an appId (that actually does nothing by

default)

ShuffleClient has shuffle-related Spark metrics that are used when BlockManager is

requested for a shuffle-related Spark metrics source (only when Executor is created for a
non-local / cluster mode).

package org.apache.spark.network.shuffle;

abstract class ShuffleClient implements Closeable {


// only required methods that have no implementation
// the others follow
abstract void fetchBlocks(
String host,
int port,
String execId,
String[] blockIds,
BlockFetchingListener listener,
TempFileManager tempFileManager);
}

Table 1. (Subset of) ShuffleClient Contract


Method Description
Fetches a sequence of blocks from a remote block
manager node asynchronously
fetchBlocks
Used exclusively when ShuffleBlockFetcherIterator is
requested to sendRequest

Table 2. ShuffleClients
ShuffleClient Description

BlockTransferService

ExternalShuffleClient

init Method

25
ShuffleClient — Contract to Fetch Shuffle Blocks

void init(String appId)

init does nothing by default.

init is used when:

BlockManager is requested to initialize


Note
Spark on Mesos' MesosCoarseGrainedSchedulerBackend is requested to
registered

Requesting Shuffle-Related Spark Metrics 


—  shuffleMetrics Method

MetricSet shuffleMetrics()

shuffleMetrics returns an empty Dropwizard Metrics' MetricSet by default.

shuffleMetrics is used exclusively when BlockManager is requested for a


Note shuffle-related Spark metrics source (only when Executor is created for a non-
local / cluster mode).

26
BlockTransferService — Pluggable Block Transfers (To Fetch and Upload Blocks)

BlockTransferService — Pluggable Block
Transfers (To Fetch and Upload Blocks)
BlockTransferService is the base for ShuffleClients that can fetch and upload blocks of data

synchronously or asynchronously.

package org.apache.spark.network

abstract class BlockTransferService extends ShuffleClient {


// only required methods that have no implementation
// the others follow
def init(blockDataManager: BlockDataManager): Unit
def close(): Unit
def port: Int
def hostName: String
def fetchBlocks(
host: String,
port: Int,
execId: String,
blockIds: Array[String],
listener: BlockFetchingListener,
tempFileManager: TempFileManager): Unit
def uploadBlock(
hostname: String,
port: Int,
execId: String,
blockId: BlockId,
blockData: ManagedBuffer,
level: StorageLevel,
classTag: ClassTag[_]): Future[Unit]
}

Note BlockTransferService is a private[spark] contract.

27
BlockTransferService — Pluggable Block Transfers (To Fetch and Upload Blocks)

Table 1. (Subset of) BlockTransferService Contract


Method Description
init Used when…​FIXME

close Used when…​FIXME

port Used when…​FIXME

hostName Used when…​FIXME

Fetches a sequence of blocks from a remote node


asynchronously
Used exclusively when BlockTransferService is
fetchBlocks
requested to fetch only one block (in a blocking fashion)

fetchBlocks is part of ShuffleClient Contract


Note
to…​FIXME.

Used exclusively when BlockTransferService is


uploadBlock requested to upload a single block to a remote node (in a
blocking fashion).

NettyBlockTransferService is the one and only known implementation of


Note
BlockTransferService Contract.

BlockTransferService was introduced in SPARK-3019 Pluggable block transfer


Note
interface (BlockTransferService) and is available since Spark 1.2.0.

fetchBlockSync Method

fetchBlockSync(
host: String,
port: Int,
execId: String,
blockId: String,
tempFileManager: TempFileManager): ManagedBuffer

fetchBlockSync …​FIXME

Synchronous (and hence blocking) fetchBlockSync to fetch one block blockId (that
corresponds to the ShuffleClient parent’s asynchronous fetchBlocks).

fetchBlockSync is a mere wrapper around fetchBlocks to fetch one blockId block that

waits until the fetch finishes.

28
BlockTransferService — Pluggable Block Transfers (To Fetch and Upload Blocks)

Note fetchBlockSync is used when…​FIXME

Uploading Single Block to Remote Node (Blocking


Fashion) —  uploadBlockSync Method

uploadBlockSync(
hostname: String,
port: Int,
execId: String,
blockId: BlockId,
blockData: ManagedBuffer,
level: StorageLevel,
classTag: ClassTag[_]): Unit

uploadBlockSync …​FIXME

uploadBlockSync is a mere blocking wrapper around uploadBlock that waits until the upload

finishes.

uploadBlockSync is used exclusively when BlockManager is requested to


Note
replicate (when a replication level is greater than 1).

29
ExternalShuffleClient

ExternalShuffleClient
ExternalShuffleClient is a ShuffleClient that…​FIXME

Register Block Manager with Shuffle Server 


—  registerWithShuffleServer Method

void registerWithShuffleServer(
String host,
int port,
String execId,
ExecutorShuffleInfo executorInfo) throws IOException, InterruptedException

registerWithShuffleServer …​FIXME

Note registerWithShuffleServer is used when…​FIXME

fetchBlocks Method

void fetchBlocks(
String host,
int port,
String execId,
String[] blockIds,
BlockFetchingListener listener,
TempFileManager tempFileManager)

Note fetchBlocks is part of ShuffleClient Contract to…​FIXME.

fetchBlocks …​FIXME

30
NettyBlockTransferService — Netty-Based BlockTransferService

NettyBlockTransferService — Netty-Based
BlockTransferService
NettyBlockTransferService is a BlockTransferService that uses Netty for uploading or

fetching blocks of data.

NettyBlockTransferService is created when SparkEnv is created for the driver and

executors (to create the BlockManager).

Figure 1. Creating NettyBlockTransferService for BlockManager


BlockManager uses NettyBlockTransferService for the following:

FIXME (should it be here or in BlockManager?)

ShuffleClient (when spark.shuffle.service.enabled configuration property is off)


for…​FIXME

NettyBlockTransferService simply requests the TransportServer for the port.

Enable INFO or TRACE logging level for


org.apache.spark.network.netty.NettyBlockTransferService logger to see what
happens inside.

Add the following line to conf/log4j.properties :


Tip
log4j.logger.org.apache.spark.network.netty.NettyBlockTransferService=TRACE

Refer to Logging.

fetchBlocks Method

31
NettyBlockTransferService — Netty-Based BlockTransferService

fetchBlocks(
host: String,
port: Int,
execId: String,
blockIds: Array[String],
listener: BlockFetchingListener): Unit

Note fetchBlocks is part of BlockTransferService Contract to…​FIXME.

When executed, fetchBlocks prints out the following TRACE message in the logs:

TRACE Fetch blocks from [host]:[port] (executor id [execId])

fetchBlocks then creates a RetryingBlockFetcher.BlockFetchStarter where

createAndStart method…​FIXME

Depending on the maximum number of acceptable IO exceptions (such as connection


timeouts) per request, if the number is greater than 0 , fetchBlocks creates a
RetryingBlockFetcher and starts it immediately.

RetryingBlockFetcher is created with the


Note RetryingBlockFetcher.BlockFetchStarter created earlier, the input blockIds
and listener .

If however the number of retries is not greater than 0 (it could be 0 or less), the
RetryingBlockFetcher.BlockFetchStarter created earlier is started (with the input blockIds

and listener ).

In case of any Exception , you should see the following ERROR message in the logs and
the input BlockFetchingListener gets notified (using onBlockFetchFailure for every block
id).

ERROR Exception while beginning fetchBlocks

Application Id —  appId Property

Caution FIXME

Closing NettyBlockTransferService —  close Method

close(): Unit

32
NettyBlockTransferService — Netty-Based BlockTransferService

Note close is part of the BlockTransferService Contract.

close …​FIXME

Initializing NettyBlockTransferService —  init Method

init(blockDataManager: BlockDataManager): Unit

Note init is part of the BlockTransferService Contract.

init starts a server for…​FIXME

Internally, init creates a NettyBlockRpcServer (using the application id, a JavaSerializer


and the input blockDataManager ).

Caution FIXME Describe security when authEnabled is enabled.

init creates a TransportContext with the NettyBlockRpcServer created earlier.

Caution FIXME Describe transportConf and TransportContext .

init creates the internal clientFactory and a server.

Caution FIXME What’s the "a server"?

In the end, you should see the INFO message in the logs:

INFO NettyBlockTransferService: Server created on [hostName]:[port]

hostname is given when NettyBlockTransferService is created and is


controlled by spark.driver.host Spark property for the driver and differs per
Note
deployment environment for executors (as controlled by --hostname for
CoarseGrainedExecutorBackend ).

Uploading Block —  uploadBlock Method

uploadBlock(
hostname: String,
port: Int,
execId: String,
blockId: BlockId,
blockData: ManagedBuffer,
level: StorageLevel,
classTag: ClassTag[_]): Future[Unit]

33
NettyBlockTransferService — Netty-Based BlockTransferService

Note uploadBlock is part of the BlockTransferService Contract.

Internally, uploadBlock creates a TransportClient client to send a UploadBlock message


(to the input hostname and port ).

Note UploadBlock message is processed by NettyBlockRpcServer.

The UploadBlock message holds the application id, the input execId and blockId . It also
holds the serialized bytes for block metadata with level and classTag serialized (using
the internal JavaSerializer ) as well as the serialized bytes for the input blockData itself
(this time however the serialization uses ManagedBuffer.nioByteBuffer method).

The entire UploadBlock message is further serialized before sending (using


TransportClient.sendRpc ).

Caution FIXME Describe TransportClient and clientFactory.createClient .

When blockId block was successfully uploaded, you should see the following TRACE
message in the logs:

TRACE NettyBlockTransferService: Successfully uploaded block [blockId]

When an upload failed, you should see the following ERROR message in the logs:

ERROR Error while uploading block [blockId]

UploadBlock Message
UploadBlock is a BlockTransferMessage that describes a block being uploaded, i.e. send

over the wire from a NettyBlockTransferService to a NettyBlockRpcServer.

Table 1. UploadBlock Attributes


Attribute Description
appId The application id (the block belongs to)

execId The executor id

blockId The block id

metadata

blockData The block data as an array of bytes

34
NettyBlockTransferService — Netty-Based BlockTransferService

As an Encodable , UploadBlock can calculate the encoded size and do encoding and
decoding itself to or from a ByteBuf , respectively.

createServer Internal Method

createServer(bootstraps: List[TransportServerBootstrap]): TransportServer

createServer …​FIXME

createServer is used exclusively when NettyBlockTransferService is


Note
requested to init.

Creating NettyBlockTransferService Instance


NettyBlockTransferService takes the following when created:

SparkConf

SecurityManager

Bind address to bind to

Host name to bind to

Port number

Number of CPU cores

NettyBlockTransferService initializes the internal registries and counters.

35
NettyBlockRpcServer — NettyBlockTransferService’s RpcHandler

NettyBlockRpcServer — 
NettyBlockTransferService’s RpcHandler
NettyBlockRpcServer is a RpcHandler that handles messages for

NettyBlockTransferService.

NettyBlockRpcServer is created when…​FIXME

NettyBlockRpcServer uses a OneForOneStreamManager for…​FIXME

Table 1. NettyBlockRpcServer Messages


Message Behaviour
Obtaining local blocks and registering them with the
OpenBlocks
internal OneForOneStreamManager

UploadBlock Deserializes a block and stores it in BlockDataManager

Tip Enable TRACE logging level to see received messages in the logs.

Enable TRACE logging level for


org.apache.spark.network.netty.NettyBlockRpcServer logger to see what happens
inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.network.netty.NettyBlockRpcServer=TRACE

Refer to Logging.

Obtaining Local Blocks and Registering with Internal


OneForOneStreamManager  —  OpenBlocks Message
Handler
When OpenBlocks arrives, NettyBlockRpcServer requests block data (from
BlockDataManager ) for every block id in the message. The block data is a collection of

ManagedBuffer for every block id in the incoming message.

Note BlockDataManager is given when NettyBlockRpcServer is created.

NettyBlockRpcServer then registers a stream of ManagedBuffer s (for the blocks) with the

internal StreamManager under streamId .

36
NettyBlockRpcServer — NettyBlockTransferService’s RpcHandler

The internal StreamManager is OneForOneStreamManager and is created when


Note
NettyBlockRpcServer is created.

You should see the following TRACE message in the logs:

TRACE NettyBlockRpcServer: Registered streamId [streamId] with [size] buffers

In the end, NettyBlockRpcServer responds with a StreamHandle (with the streamId and the
number of blocks). The response is serialized as a ByteBuffer .

Deserializing Block and Storing in BlockDataManager  


—  UploadBlock Message Handler
When UploadBlock arrives, NettyBlockRpcServer deserializes the metadata of the input
message to get the StorageLevel and ClassTag of the block being uploaded.

metadata is serialized before NettyBlockTransferService sends a


Note UploadBlock message (using the internal JavaSerializer ) that is given as
serializer when NettyBlockRpcServer is created.

NettyBlockRpcServer creates a BlockId for the block id and requests the

BlockDataManager to store the block.

Note The BlockDataManager is passed in when NettyBlockRpcServer is created.

In the end, NettyBlockRpcServer responds with a 0 -capacity ByteBuffer .

Note UploadBlock is sent when NettyBlockTransferService uploads a block.

Creating NettyBlockRpcServer Instance


NettyBlockRpcServer takes the following when created:

Application ID

Serializer

BlockDataManager

NettyBlockRpcServer initializes the internal registries and counters.

Receiving RPC Messages —  receive Method

37
NettyBlockRpcServer — NettyBlockTransferService’s RpcHandler

receive(
client: TransportClient,
rpcMessage: ByteBuffer,
responseContext: RpcResponseCallback): Unit

Note receive is part of RpcHandler Contract to…​FIXME.

receive …​FIXME

38
BlockFetchingListener

BlockFetchingListener
BlockFetchingListener is the contract of EventListeners that want to be notified about

onBlockFetchSuccess and onBlockFetchFailure.

BlockFetchingListener is used when:

ShuffleClient, BlockTransferService, NettyBlockTransferService, and


ExternalShuffleClient are requested to fetch a sequence of blocks

BlockFetchStarter is requested to createAndStart

RetryingBlockFetcher and OneForOneBlockFetcher are created

package org.apache.spark.network.shuffle;

interface BlockFetchingListener extends EventListener {


void onBlockFetchSuccess(String blockId, ManagedBuffer data);
void onBlockFetchFailure(String blockId, Throwable exception);
}

Table 1. BlockFetchingListener Contract


Method Description
onBlockFetchSuccess Used when…​FIXME

onBlockFetchFailure Used when…​FIXME

Table 2. BlockFetchingListeners
BlockFetchingListener Description

RetryingBlockFetchListener

"Unnamed" in
ShuffleBlockFetcherIterator

"Unnamed" in
BlockTransferService

39
RetryingBlockFetcher

RetryingBlockFetcher
RetryingBlockFetcher is…​FIXME

RetryingBlockFetcher is created and immediately started when:

NettyBlockTransferService is requested to fetchBlocks (when maxIORetries is greater

than 0 which it is by default)

ExternalShuffleClient is requested to fetchBlocks (when maxIORetries is greater than

0 which it is by default)

RetryingBlockFetcher uses a BlockFetchStarter to createAndStart when requested to start

and later initiateRetry.

RetryingBlockFetcher uses outstandingBlocksIds internal registry of outstanding block IDs

to fetch that is initially the block IDs to fetch when created.

At initiateRetry, RetryingBlockFetcher prints out the following INFO message to the logs
(with the number of outstandingBlocksIds):

Retrying fetch ([retryCount]/[maxRetries]) for [size] outstanding blocks after [retryW


aitTime] ms

On onBlockFetchSuccess and onBlockFetchFailure, RetryingBlockFetchListener removes


the block ID from outstandingBlocksIds.

RetryingBlockFetcher uses a RetryingBlockFetchListener to remove block IDs from the

outstandingBlocksIds internal registry.

Creating RetryingBlockFetcher Instance


RetryingBlockFetcher takes the following when created:

TransportConf

BlockFetchStarter

Block IDs to fetch

BlockFetchingListener

Starting RetryingBlockFetcher —  start Method

40
RetryingBlockFetcher

void start()

start simply fetchAllOutstanding.

start is used when:

NettyBlockTransferService is requested to fetchBlocks (when

Note maxIORetries is greater than 0 which it is by default)


ExternalShuffleClient is requested to fetchBlocks (when maxIORetries is
greater than 0 which it is by default)

initiateRetry Internal Method

synchronized void initiateRetry()

initiateRetry …​FIXME

initiateRetry is used when:

RetryingBlockFetcher is requested to fetchAllOutstanding


Note
RetryingBlockFetchListener is requested to onBlockFetchFailure

fetchAllOutstanding Internal Method

void fetchAllOutstanding()

fetchAllOutstanding requests BlockFetchStarter to createAndStart for the

outstandingBlocksIds.

fetchAllOutstanding is used when RetryingBlockFetcher is requested to start


Note
and initiateRetry.

RetryingBlockFetchListener
RetryingBlockFetchListener is a BlockFetchingListener that RetryingBlockFetcher uses to

remove block IDs from the outstandingBlocksIds internal registry.

onBlockFetchSuccess Method

41
RetryingBlockFetcher

void onBlockFetchSuccess(String blockId, ManagedBuffer data)

Note onBlockFetchSuccess is part of BlockFetchingListener Contract to…​FIXME.

onBlockFetchSuccess …​FIXME

onBlockFetchFailure Method

void onBlockFetchFailure(String blockId, Throwable exception)

Note onBlockFetchFailure is part of BlockFetchingListener Contract to…​FIXME.

onBlockFetchFailure …​FIXME

42
BlockFetchStarter

BlockFetchStarter
BlockFetchStarter is the contract of…​FIXME…​to createAndStart.

void createAndStart(String[] blockIds, BlockFetchingListener listener)


throws IOException, InterruptedException;

createAndStart is used when:

ExternalShuffleClient is requested to fetchBlocks (when maxIORetries is 0 )

NettyBlockTransferService is requested to fetchBlocks (when maxIORetries is 0 )

RetryingBlockFetcher is requested to fetchAllOutstanding

43
Web UI — Spark Application’s Web Console

Web UI — Spark Application’s Web Console


Web UI (aka Application UI or webUI or Spark UI) is the web interface of a Spark
application to monitor and inspect Spark job executions in a web browser.

Figure 1. Welcome page - Jobs page


Every time you create a SparkContext in a Spark application you also launch an instance of
web UI. The web UI is available at http://[driverHostname]:4040 by default.

The default port can be changed using spark.ui.port configuration property.


Note SparkContext will increase the port if it is already taken until an open port is
found.

web UI comes with the following tabs (which may not all be visible immediately, but only
after the respective modules are in use, e.g. the SQL or Streaming tabs):

1. Jobs

2. Stages

3. Storage

4. Environment

5. Executors

You can use the web UI after the application has finished by persisting events
Tip
(using EventLoggingListener) and using Spark History Server.

44
Web UI — Spark Application’s Web Console

All the information that is displayed in web UI is available thanks to


Note JobProgressListener and other SparkListeners. One could say that web UI is a
web layer to Spark listeners.

45
Jobs

Jobs Tab
Jobs tab in web UI shows status of all Spark jobs in a Spark application (i.e. a
SparkContext).

Figure 1. Jobs Tab in Web UI


The Jobs tab is available under /jobs URL, i.e. http://localhost:4040/jobs.

Figure 2. Event Timeline in Jobs Tab


The Jobs tab consists of two pages, i.e. All Jobs and Details for Job pages.

Internally, the Jobs tab is represented by JobsTab.

Details for Job —  JobPage Page


When you click a job in AllJobsPage, you see the Details for Job page.

46
Jobs

Figure 3. Details for Job Page


JobPage is a WebUIPage that shows statistics and stage list for a given job.

Details for Job page is registered under /job URL, i.e. http://localhost:4040/jobs/job/?
id=0 and accepts one mandatory id request parameter as a job identifier.

When a job id is not found, you should see "No information to display for job ID" message.

Figure 4. "No information to display for job" in Details for Job Page
JobPage displays the job’s status, group (if available), and the stages per state: active,

pending, completed, skipped, and failed.

Note A job can be in a running, succeeded, failed or unknown state.

47
Jobs

Figure 5. Details for Job Page with Active and Pending Stages

48
Jobs

Figure 6. Details for Job Page with Four Stages

49
Stages

Stages Tab
Stages tab in web UI shows…​FIXME

Figure 1. Stages Tab in Web UI


The Stages tab is available under /stages URL, i.e. http://localhost:4040/stages.

Internally, the Stages tab is represented by StagesTab.

50
Storage

Storage Tab
Storage tab in web UI shows…​FIXME

Figure 1. Storage Tab in Web UI


The Storage tab is available under /storage URL, i.e. http://localhost:4040/storage.

Internally, the Storage tab is represented by StorageTab.

51
Environment

Environment Tab
Environment tab in web UI shows…​FIXME

Figure 1. Environment Tab in Web UI


The Environment tab is available under /environment URL, i.e.
http://localhost:4040/environment.

Internally, the Environment tab is represented by EnvironmentTab.

52
Executors

Executors Tab
Executors tab in web UI shows…​FIXME

Figure 1. Executors Tab in web UI (local mode)


The Executors tab is available under /executors URL, i.e. http://localhost:4040/executors.

Internally, the Executors tab is represented by ExecutorsTab.

What’s interesting in how Storage Memory is displayed in the Executors tab is that the default
in a way that is different from what the page displays (using the custom JavaScript

// local mode with spark.driver.memory 2g


// ./bin/spark-shell --conf spark.driver.memory=2g
// UnifiedMemoryManager reports 912MB
// You can see it after enabling INFO messages for BlockManagerMasterEndpoint

INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.8:54503 with

Note // custom JavaScript `formatBytes` function (from utils.js) reports...956.6MB


// See https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark
def formatBytes(bytes: Double) = {
val k = 1000
val i = math.floor(math.log(bytes) / math.log(k))
val maxMemoryWebUI = bytes / math.pow(k, i)
f"$maxMemoryWebUI%1.1f"
}
scala> println(formatBytes(maxMemory))
956.6

getExecInfo Method

getExecInfo(
listener: ExecutorsListener,
statusId: Int,
isActive: Boolean): ExecutorSummary

53
Executors

getExecInfo creates a ExecutorSummary .

Caution FIXME

Note getExecInfo is used when…​FIXME

Settings

spark.ui.threadDumpsEnabled
spark.ui.threadDumpsEnabled (default: true ) is to enable ( true ) or disable ( false )

ExecutorThreadDumpPage.

54
JobsTab

JobsTab
JobsTab is a SparkUITab with jobs prefix.

JobsTab is created exclusively when SparkUI is initialized.

JobsTab takes the following when created:

Parent SparkUI

AppStatusStore

When created, JobsTab creates the following pages and attaches them immediately:

AllJobsPage

JobPage

The Jobs tab uses JobProgressListener to access statistics of job executions in


Note
a Spark application to display.

handleKillRequest Method

handleKillRequest(request: HttpServletRequest): Unit

handleKillRequest …​FIXME

Note handleKillRequest is used when…​FIXME

55
AllJobsPage

AllJobsPage — Showing All Jobs in Web UI


AllJobsPage is a WebUIPage with an empty prefix.

AllJobsPage is created exclusively when JobsTab is created.

AllJobsPage renders a summary, an event timeline, and active, completed, and failed jobs

of a Spark application.

Tip Jobs (in any state) are displayed when their number is greater than 0 .

AllJobsPage displays the Summary section with the current Spark user, total uptime,

scheduling mode, and the number of jobs per status.

Note AllJobsPage uses JobProgressListener for Scheduling Mode .

Figure 1. Summary Section in Jobs Tab


Under the summary section is the Event Timeline section.

Figure 2. Event Timeline in Jobs Tab


Note AllJobsPage uses ExecutorsListener to build the event timeline.

56
AllJobsPage

Active Jobs, Completed Jobs, and Failed Jobs sections follow.

Figure 3. Job Status Section in Jobs Tab


Jobs are clickable, i.e. you can click on a job to see information about the stages of tasks
inside it.

When you hover over a job in Event Timeline not only you see the job legend but also the
job is highlighted in the Summary section.

Figure 4. Hovering Over Job in Event Timeline Highlights The Job in Status Section
The Event Timeline section shows not only jobs but also executors.

57
AllJobsPage

Figure 5. Executors in Event Timeline


Use Programmable Dynamic Allocation (using SparkContext ) to manage
Tip
executors for demo purposes.

Creating AllJobsPage Instance


AllJobsPage takes the following when created:

Parent JobsTab

AppStatusStore

58
JobPage

JobPage
JobPage is a WebUIPage with job prefix.

JobPage is created exclusively when JobsTab is created.

Creating JobPage Instance


JobPage takes the following when created:

Parent JobsTab

AppStatusStore

59
StagesTab — Stages for All Jobs

StagesTab — Stages for All Jobs


StagesTab is a SparkUITab with stages prefix.

StagesTab is created exclusively when SparkUI is initialized.

When created, StagesTab creates the following pages and attaches them immediately:

AllStagesPage

StagePage

PoolPage

Stages tab in web UI shows the current state of all stages of all jobs in a Spark application
(i.e. a SparkContext) with two optional pages for the tasks and statistics for a stage (when a
stage is selected) and pool details (when the application works in FAIR scheduling mode).

The title of the tab is Stages for All Jobs.

You can access the Stages tab under /stages URL, i.e. http://localhost:4040/stages.

With no jobs submitted yet (and hence no stages to display), the page shows nothing but the
title.

Figure 1. Stages Page Empty


The Stages page shows the stages in a Spark application per state in their respective
sections — Active Stages, Pending Stages, Completed Stages, and Failed Stages.

Figure 2. Stages Page With One Stage Completed

60
StagesTab — Stages for All Jobs

The state sections are only displayed when there are stages in a given state.
Note
Refer to Stages for All Jobs.

In FAIR scheduling mode you have access to the table showing the scheduler pools.

Figure 3. Fair Scheduler Pools Table


Internally, the page is represented by org.apache.spark.ui.jobs.StagesTab class.

The page uses the parent’s SparkUI to access required services, i.e. SparkContext,
SparkConf, JobProgressListener, RDDOperationGraphListener, and to know whether kill is
enabled or not.

StagesTab is created when…​FIXME

killEnabled flag

Caution FIXME

Creating StagesTab Instance


StagesTab takes the following when created:

SparkUI

AppStatusStore

Handling Request to Kill Stage (from web UI) 


—  handleKillRequest Method

handleKillRequest(request: HttpServletRequest): Unit

handleKillRequest …​FIXME

Note handleKillRequest is used when…​FIXME

61
StagesTab — Stages for All Jobs

62
AllStagesPage — Stages for All Jobs

Stages for All Jobs Page


AllStagesPage is a web page (section) that is registered with the Stages tab that displays all

stages in a Spark application - active, pending, completed, and failed stages with their count.

Figure 1. Stages Tab in web UI for FAIR scheduling mode (with pools only)
In FAIR scheduling mode you have access to the table showing the scheduler pools as well
as the pool names per stage.

Note Pool names are calculated using SparkContext.getAllPools.

Internally, AllStagesPage is a WebUIPage with access to the parent Stages tab and more
importantly the JobProgressListener to have access to current state of the entire Spark
application.

Rendering AllStagesPage (render method)

render(request: HttpServletRequest): Seq[Node]

render generates a HTML page to display in a web browser.

It uses the parent’s JobProgressListener to know about:

active stages (as activeStages )

pending stages (as pendingStages )

completed stages (as completedStages )

failed stages (as failedStages )

the number of completed stages (as numCompletedStages )

the number of failed stages (as numFailedStages )

Note Stage information is available as StageInfo object.

63
AllStagesPage — Stages for All Jobs

There are 4 different tables for the different states of stages - active, pending, completed,
and failed. They are displayed only when there are stages in a given state.

Figure 2. Stages Tab in web UI for FAIR scheduling mode (with pools and stages)
You could also notice "retry" for stage when it was retried.

Caution FIXME A screenshot

64
StagePage — Stage Details

StagePage — Stage Details
StagePage is a WebUIPage with stage prefix.

StagePage is created exclusively when StagesTab is created.

StagePage shows the task details for a stage given its id and attempt id.

Figure 1. Details for Stage


StagePage renders a page available under /stage URL that requires two request

parameters —  id and attempt , e.g. http://localhost:4040/stages/stage/?id=2&attempt=0.

StagePage uses the parent’s JobProgressListener and RDDOperationGraphListener to

calculate the metrics. More specifically, StagePage uses JobProgressListener 's


stageIdToData registry to access the stage for given stage id and attempt .

StagePage uses ExecutorsListener to display stdout and stderr logs of the executors in

Tasks section.

Tasks Section

65
StagePage — Stage Details

Figure 2. Tasks Section


Tasks paged table displays StageUIData that JobProgressListener collected for a stage and
stage attempt.

The section uses ExecutorsListener to access stdout and stderr logs for
Note
Executor ID / Host column.

Summary Metrics for Completed Tasks in Stage


The summary metrics table shows the metrics for the tasks in a given stage that have
already finished with SUCCESS status and metrics available.

The table consists of the following columns: Metric, Min, 25th percentile, Median, 75th
percentile, Max.

Figure 3. Summary Metrics for Completed Tasks in Stage


All the quantiles are doubles using TaskUIData.metrics (sorted in ascending
Note
order).

66
StagePage — Stage Details

The 1st row is Duration which includes the quantiles based on executorRunTime .

The 2nd row is the optional Scheduler Delay which includes the time to ship the task from
the scheduler to executors, and the time to send the task result from the executors to the
scheduler. It is not enabled by default and you should select Scheduler Delay checkbox
under Show Additional Metrics to include it in the summary table.

If Scheduler Delay is large, consider decreasing the size of tasks or decreasing


Tip
the size of task results.

The 3rd row is the optional Task Deserialization Time which includes the quantiles based
on executorDeserializeTime task metric. It is not enabled by default and you should select
Task Deserialization Time checkbox under Show Additional Metrics to include it in the
summary table.

The 4th row is GC Time which is the time that an executor spent paused for Java garbage
collection while the task was running (using jvmGCTime task metric).

The 5th row is the optional Result Serialization Time which is the time spent serializing the
task result on a executor before sending it back to the driver (using
resultSerializationTime task metric). It is not enabled by default and you should select

Result Serialization Time checkbox under Show Additional Metrics to include it in the
summary table.

The 6th row is the optional Getting Result Time which is the time that the driver spends
fetching task results from workers. It is not enabled by default and you should select Getting
Result Time checkbox under Show Additional Metrics to include it in the summary table.

If Getting Result Time is large, consider decreasing the amount of data returned
Tip
from each task.

If Tungsten is enabled (it is by default), the 7th row is the optional Peak Execution Memory
which is the sum of the peak sizes of the internal data structures created during shuffles,
aggregations and joins (using peakExecutionMemory task metric). For SQL jobs, this only
tracks all unsafe operators, broadcast joins, and external sort. It is not enabled by default
and you should select Peak Execution Memory checkbox under Show Additional Metrics
to include it in the summary table.

If the stage has an input, the 8th row is Input Size / Records which is the bytes and records
read from Hadoop or from a Spark storage (using inputMetrics.bytesRead and
inputMetrics.recordsRead task metrics).

If the stage has an output, the 9th row is Output Size / Records which is the bytes and
records written to Hadoop or to a Spark storage (using outputMetrics.bytesWritten and
outputMetrics.recordsWritten task metrics).

67
StagePage — Stage Details

If the stage has shuffle read there will be three more rows in the table. The first row is
Shuffle Read Blocked Time which is the time that tasks spent blocked waiting for shuffle
data to be read from remote machines (using shuffleReadMetrics.fetchWaitTime task
metric). The other row is Shuffle Read Size / Records which is the total shuffle bytes and
records read (including both data read locally and data read from remote executors using
shuffleReadMetrics.totalBytesRead and shuffleReadMetrics.recordsRead task metrics). And

the last row is Shuffle Remote Reads which is the total shuffle bytes read from remote
executors (which is a subset of the shuffle read bytes; the remaining shuffle data is read
locally). It uses shuffleReadMetrics.remoteBytesRead task metric.

If the stage has shuffle write, the following row is Shuffle Write Size / Records (using
shuffleWriteMetrics.bytesWritten and shuffleWriteMetrics.recordsWritten task metrics).

If the stage has bytes spilled, the following two rows are Shuffle spill (memory) (using
memoryBytesSpilled task metric) and Shuffle spill (disk) (using diskBytesSpilled task

metric).

Request Parameters
id is…​

attempt is…​

id and attempt uniquely identify the stage in


Note
JobProgressListener.stageIdToData to retrieve StageUIData .

task.page (default: 1 ) is…​

task.sort (default: Index )

task.desc (default: false )

task.pageSize (default: 100 )

task.prevPageSize (default: task.pageSize )

Metrics
Scheduler Delay is…​FIXME

Task Deserialization Time is…​FIXME

Result Serialization Time is…​FIXME

Getting Result Time is…​FIXME

Peak Execution Memory is…​FIXME

68
StagePage — Stage Details

Shuffle Read Time is…​FIXME

Executor Computing Time is…​FIXME

Shuffle Write Time is…​FIXME

Figure 4. DAG Visualization

69
StagePage — Stage Details

Figure 5. Event Timeline

Figure 6. Stage Task and Shuffle Stats

Aggregated Metrics by Executor


ExecutorTable table shows the following columns:

Executor ID

Address

Task Time

Total Tasks

Failed Tasks

Killed Tasks

Succeeded Tasks

(optional) Input Size / Records (only when the stage has an input)

(optional) Output Size / Records (only when the stage has an output)

(optional) Shuffle Read Size / Records (only when the stage read bytes for a shuffle)

(optional) Shuffle Write Size / Records (only when the stage wrote bytes for a shuffle)

70
StagePage — Stage Details

(optional) Shuffle Spill (Memory) (only when the stage spilled memory bytes)

(optional) Shuffle Spill (Disk) (only when the stage spilled bytes to disk)

Figure 7. Aggregated Metrics by Executor


It gets executorSummary from StageUIData (for the stage and stage attempt id) and creates
rows per executor.

It also requests BlockManagers (from JobProgressListener) to map executor ids to a pair of


host and port to display in Address column.

Accumulators
Stage page displays the table with named accumulators (only if they exist). It contains the
name and value of the accumulators.

Figure 8. Accumulators Section


The information with name and value is stored in AccumulableInfo (that is
Note
available in StageUIData).

Creating StagePage Instance


StagePage takes the following when created:

Parent StagesTab

AppStatusStore

71
PoolPage — Pool Details

PoolPage — Fair Scheduler Pool Details Page


PoolPage is a WebUIPage with pool prefix.

The Fair Scheduler Pool Details page shows information about a Schedulable pool and is
only available when a Spark application uses the FAIR scheduling mode (which is controlled
by spark.scheduler.mode setting).

Figure 1. Details Page for production Pool


PoolPage renders a page under /pool URL and requires one request parameter poolname

that is the name of the pool to display, e.g. http://localhost:4040/stages/pool/?


poolname=production. It is made up of two tables: Summary (with the details of the pool)
and Active Stages (with the active stages in the pool).

PoolPage is created exclusively when StagesTab is created.

PoolPage takes a StagesTab when created.

PoolPage uses the parent’s SparkContext to access information about the pool and

JobProgressListener for active stages in the pool (sorted by submissionTime in descending

order by default).

Summary Table
The Summary table shows the details of a Schedulable pool.

Figure 2. Summary for production Pool


It uses the following columns:

Pool Name

72
PoolPage — Pool Details

Minimum Share

Pool Weight

Active Stages - the number of the active stages in a Schedulable pool.

Running Tasks

SchedulingMode

All the columns are the attributes of a Schedulable but the number of active stages which is
calculated using the list of active stages of a pool (from the parent’s JobProgressListener).

Active Stages Table


The Active Stages table shows the active stages in a pool.

Figure 3. Active Stages for production Pool


It uses the following columns:

Stage Id

(optional) Pool Name - only available when in FAIR scheduling mode.

Description

Submitted

Duration

Tasks: Succeeded/Total

Input — Bytes and records read from Hadoop or from Spark storage.

Output — Bytes and records written to Hadoop.

Shuffle Read — Total shuffle bytes and records read (includes both data read locally
and data read from remote executors).

Shuffle Write — Bytes and records written to disk in order to be read by a shuffle in a


future stage.

The table uses JobProgressListener for information per stage in the pool.

Request Parameters

73
PoolPage — Pool Details

poolname
poolname is the name of the scheduler pool to display on the page. It is a mandatory

request parameter.

74
StorageTab

StorageTab
StorageTab is a SparkUITab with storage prefix.

StorageTab is created exclusively when SparkUI is initialized.

StorageTab takes the following when created:

Parent SparkUI

AppStatusStore

When created, StorageTab creates the following pages and attaches them immediately:

StoragePage

RDDPage

75
StoragePage

StoragePage
StoragePage is a WebUIPage with an empty prefix.

StoragePage is created exclusively when StorageTab is created.

StoragePage takes the following when created:

Parent SparkUITab

AppStatusStore

Rendering HTML Table Row for RDD Details —  rddRow


Internal Method

rddRow(rdd: v1.RDDStorageInfo): Seq[Node]

rddRow …​FIXME

Note rddRow is used when…​FIXME

Rendering HTML Table with RDD Details —  rddTable


Method

rddTable(rdds: Seq[v1.RDDStorageInfo]): Seq[Node]

rddTable …​FIXME

Note rddTable is used when…​FIXME

receiverBlockTables Method

receiverBlockTables(blocks: Seq[StreamBlockData]): Seq[Node]

receiverBlockTables …​FIXME

Note receiverBlockTables is used when…​FIXME

Rendering Page —  render Method

76
StoragePage

render(request: HttpServletRequest): Seq[Node]

Note render is part of WebUIPage Contract to…​FIXME.

render requests the AppStatusStore for rddList and renders an HTML table with their

details (if available).

render requests the AppStatusStore for streamBlocksList and renders an HTML table with

receiver blocks (if available).

In the end, render requests UIUtils to headerSparkPage (with Storage title).

77
RDDPage

RDDPage
RDDPage is a WebUIPage with rdd prefix.

RDDPage is created exclusively when StorageTab is created.

RDDPage takes the following when created:

Parent SparkUITab

AppStatusStore

render Method

render(request: HttpServletRequest): Seq[Node]

Note render is part of WebUIPage Contract to…​FIXME.

render …​FIXME

78
EnvironmentTab

EnvironmentTab
EnvironmentTab is a SparkUITab with environment prefix.

EnvironmentTab is created exclusively when SparkUI is initialized.

EnvironmentTab takes the following when created:

Parent SparkUI

AppStatusStore

When created, EnvironmentTab creates the EnvironmentPage page and attaches it


immediately.

79
EnvironmentPage

EnvironmentPage
EnvironmentPage is a WebUIPage with an empty prefix.

EnvironmentPage is created exclusively when EnvironmentTab is created.

Creating EnvironmentPage Instance


EnvironmentPage takes the following when created:

Parent EnvironmentTab

SparkConf

AppStatusStore

80
ExecutorsTab

ExecutorsTab
ExecutorsTab is a SparkUITab with executors prefix.

ExecutorsTab is created exclusively when SparkUI is initialized.

ExecutorsTab takes the parent SparkUI when created.

When created, ExecutorsTab creates the following pages and attaches them immediately:

ExecutorsPage

ExecutorThreadDumpPage

ExecutorsTab uses ExecutorsListener to collect information about executors in a Spark

application.

81
ExecutorsPage

ExecutorsPage
ExecutorsPage is a WebUIPage with an empty prefix.

ExecutorsPage is created exclusively when ExecutorsTab is created.

Creating ExecutorsPage Instance


ExecutorsPage takes the following when created:

Parent SparkUITab

threadDumpEnabled flag

82
ExecutorThreadDumpPage

ExecutorThreadDumpPage
ExecutorThreadDumpPage is a WebUIPage with threadDump prefix.

ExecutorThreadDumpPage is created exclusively when ExecutorsTab is created (with

spark.ui.threadDumpsEnabled configuration property enabled).

spark.ui.threadDumpsEnabled configuration property is enabled (i.e. true ) by


Note
default.

Creating ExecutorThreadDumpPage Instance


ExecutorThreadDumpPage takes the following when created:

SparkUITab

Optional SparkContext

83
SparkUI — Web UI of Spark Application

SparkUI — Web UI of Spark Application


SparkUI is the web UI of a Spark application (aka Application UI).

SparkUI is created along with the following:

SparkContext (for a live Spark application with spark.ui.enabled configuration property


enabled)

FsHistoryProvider is requested for the application UI (for a live or completed Spark

application)

Figure 1. Creating SparkUI for Live Spark Application


When created (while SparkContext is created for a live Spark application), SparkUI gets
the following:

Live AppStatusStore (with a ElementTrackingStore using an InMemoryStore and a live


AppStatusListener)

Name of the Spark application that is exactly the value of spark.app.name configuration
property

Empty base path

When started, SparkUI binds to appUIAddress address that you can control using
SPARK_PUBLIC_DNS environment variable or spark.driver.host Spark property.

With spark.ui.killEnabled configuration property turned on, SparkUI allows to


Note kill jobs and stages (subject to SecurityManager.checkModifyPermissions
permissions).

SparkUI gets an AppStatusStore that is then used for the following:

Initializing tabs, i.e. JobsTab, StagesTab, StorageTab, EnvironmentTab

AbstractApplicationResource is requested for jobsList, oneJob, executorList,

allExecutorList, rddList, rddData, environmentInfo

84
SparkUI — Web UI of Spark Application

StagesResource is requested for stageList, stageData, oneAttemptData, taskSummary,

taskList

SparkUI is requested for the current Spark user

Creating Spark SQL’s SQLTab (when SQLHistoryServerPlugin is requested to


setupUI )

Spark Streaming’s BatchPage is created

Table 1. SparkUI’s Internal Properties (e.g. Registries, Counters and Flags)


Name Description
appId

Enable INFO logging level for org.apache.spark.ui.SparkUI logger to see what


happens inside.

Add the following line to conf/log4j.properties :


Tip
log4j.logger.org.apache.spark.ui.SparkUI=INFO

Refer to Logging.

Assigning Unique Identifier of Spark Application 


—  setAppId Method

setAppId(id: String): Unit

setAppId sets the internal appId.

Note setAppId is used exclusively when SparkContext is initialized.

Stopping SparkUI  —  stop Method

stop(): Unit

stop stops the HTTP server and prints the following INFO message to the logs:

INFO SparkUI: Stopped Spark web UI at [appUIAddress]

85
SparkUI — Web UI of Spark Application

appUIAddress in the above INFO message is the result of appUIAddress


Note
method.

appUIAddress Method

appUIAddress: String

appUIAddress returns the entire URL of a Spark application’s web UI, including http://

scheme.

Internally, appUIAddress uses appUIHostPort.

Accessing Spark User —  getSparkUser Method

getSparkUser: String

getSparkUser returns the name of the user a Spark application runs as.

Internally, getSparkUser requests user.name System property from EnvironmentListener


Spark listener.

Note getSparkUser is used…​FIXME

createLiveUI Method

createLiveUI(
sc: SparkContext,
conf: SparkConf,
listenerBus: SparkListenerBus,
jobProgressListener: JobProgressListener,
securityManager: SecurityManager,
appName: String,
startTime: Long): SparkUI

createLiveUI creates a SparkUI for a live running Spark application.

Internally, createLiveUI simply forwards the call to create.

createLiveUI is called when SparkContext is created (and spark.ui.enabled is


Note
enabled).

86
SparkUI — Web UI of Spark Application

createHistoryUI Method

Caution FIXME

appUIHostPort Method

appUIHostPort: String

appUIHostPort returns the Spark application’s web UI which is the public hostname and

port, excluding the scheme.

Note appUIAddress uses appUIHostPort and adds http:// scheme.

getAppName Method

getAppName: String

getAppName returns the name of the Spark application (of a SparkUI instance).

Note getAppName is used when…​FIXME

Creating SparkUI Instance —  create Factory Method

create(
sc: Option[SparkContext],
store: AppStatusStore,
conf: SparkConf,
securityManager: SecurityManager,
appName: String,
basePath: String = "",
startTime: Long,
appSparkVersion: String = org.apache.spark.SPARK_VERSION): SparkUI

create creates a SparkUI backed by a AppStatusStore.

Internally, create simply creates a new SparkUI (with the predefined Spark version).

87
SparkUI — Web UI of Spark Application

create is used when:

SparkContext is created (for a running Spark application)


Note
FsHistoryProvider is requested to getAppUI (for a Spark application that
already finished)

Creating SparkUI Instance


SparkUI takes the following when created:

AppStatusStore

SparkContext

SparkConf

SecurityManager

Application name

basePath

Start time

appSparkVersion

SparkUI initializes the internal registries and counters and the tabs and handlers.

Attaching Tabs and Context Handlers —  initialize


Method

initialize(): Unit

Note initialize is part of WebUI Contract to initialize web components.

initialize creates and attaches the following tabs (with the reference to the SparkUI and

its AppStatusStore):

1. JobsTab

2. StagesTab

3. StorageTab

4. EnvironmentTab

88
SparkUI — Web UI of Spark Application

5. ExecutorsTab

In the end, initialize creates and attaches the following ServletContextHandlers :

1. Creates a static handler for serving files from a static directory, i.e. /static to serve
static files from org/apache/spark/ui/static directory (on CLASSPATH)

2. Creates a redirect handler to redirect / to /jobs/ (and so the Jobs tab is the
welcome tab when you open the web UI)

3. Creates the /api/* context handler for the Status REST API

4. Creates a redirect handler to redirect /jobs/job/kill to /jobs/ and request the


JobsTab to execute handleKillRequest before redirection

5. Creates a redirect handler to redirect /stages/stage/kill to /stages/ and request the


StagesTab to execute handleKillRequest before redirection

89
SparkUITab

SparkUITab
SparkUITab is the contract of WebUITab extensions with two additional properties:

appName

appSparkVersion

package org.apache.spark.ui

abstract class SparkUITab(parent: SparkUI, prefix: String)


extends WebUITab(parent, prefix) {
def appName: String
def appSparkVersion: String
}

Note SparkUITab is a private[spark] contract.

Table 1. SparkUITab Contract


Method Description
appName Used when…​FIXME

appSparkVersion Used when…​FIXME

Table 2. SparkUITabs
SparkUITab Description
EnvironmentTab

ExecutorsTab

JobsTab

StagesTab

StorageTab

SQLTab Used in Spark SQL module

StreamingTab Used in Spark Streaming module

ThriftServerTab Used in Spark Thrift Server

90
SparkUITab

91
BlockStatusListener Spark Listener

BlockStatusListener Spark Listener


BlockStatusListener is a SparkListener that tracks BlockManagers and the blocks for

Storage tab in web UI.

Table 1. BlockStatusListener Registries


Registry Description

blockManagers
The lookup table for a collection of BlockId and
BlockUIData per BlockManagerId.

Caution FIXME When are the events posted?

Table 2. BlockStatusListener Event Handlers


Event Handler Description

onBlockManagerAdded
Registers a BlockManager in blockManagers internal
registry (with no blocks).

onBlockManagerRemoved
Removes a BlockManager from blockManagers internal
registry.

Puts an updated BlockUIData for BlockId for


BlockManagerId in blockManagers internal registry.

onBlockUpdated
Ignores updates for unregistered BlockManager s or
non- StreamBlockId s.

For invalid StorageLevels (i.e. they do not use a memory


or a disk or no replication) the block is removed.

92
EnvironmentListener Spark Listener

EnvironmentListener Spark Listener

Caution FIXME

93
ExecutorsListener Spark Listener

ExecutorsListener Spark Listener


ExecutorsListener is a SparkListener that tracks executors and their tasks in a Spark

application for Stage Details page, Jobs tab and /allexecutors REST endpoint.

Table 1. ExecutorsListener’s SparkListener Callbacks (in alphabetical order)


Event Handler Description
May create an entry for the driver in
onApplicationStart
executorToTaskSummary registry

May create an entry in executorToTaskSummary registry.


It also makes sure that the number of entries for dead
executors does not exceed
onExecutorAdded spark.ui.retainedDeadExecutors and removes excess.
Adds an entry to executorEvents registry and optionally
removes the oldest if the number of entries exceeds
spark.ui.timeline.executors.maximum.

onExecutorBlacklisted FIXME

Marks an executor dead in executorToTaskSummary


registry.
onExecutorRemoved Adds an entry to executorEvents registry and optionally
removes the oldest if the number of entries exceeds
spark.ui.timeline.executors.maximum.

onExecutorUnblacklisted FIXME

onNodeBlacklisted FIXME

onNodeUnblacklisted FIXME

May create an entry for an executor in


onTaskStart
executorToTaskSummary registry.

May create an entry for an executor in


onTaskEnd
executorToTaskSummary registry.

ExecutorsListener requires a StorageStatusListener and SparkConf.

94
ExecutorsListener Spark Listener

Table 2. ExecutorsListener’s Internal Registries and Counters


Registry Description

The lookup table for ExecutorTaskSummary per executor


id.
executorToTaskSummary Used to build a ExecutorSummary for /allexecutors
REST endpoint, to display stdout and stderr logs in Tasks
and Aggregated Metrics by Executor sections in Stage
Details page.

A collection of SparkListenerEvents.
executorEvents
Used to build the event timeline in AllJobsPage and
Details for Job pages.

updateExecutorBlacklist Method

Caution FIXME

Intercepting Executor Was Blacklisted Events 


—  onExecutorBlacklisted Callback

Caution FIXME

Intercepting Executor Is No Longer Blacklisted Events 


—  onExecutorUnblacklisted Callback

Caution FIXME

Intercepting Node Was Blacklisted Events 


—  onNodeBlacklisted Callback

Caution FIXME

Intercepting Node Is No Longer Blacklisted Events 


—  onNodeUnblacklisted Callback

Caution FIXME

95
ExecutorsListener Spark Listener

Intercepting Application Started Events 


—  onApplicationStart Callback

onApplicationStart(applicationStart: SparkListenerApplicationStart): Unit

onApplicationStart is part of SparkListener contract to announce that a Spark


Note
application has been started.

onApplicationStart takes driverLogs property from the input applicationStart (if

defined) and finds the driver’s active StorageStatus (using the current
StorageStatusListener). onApplicationStart then uses the driver’s StorageStatus (if
defined) to set executorLogs .

Table 3. ExecutorTaskSummary and ExecutorInfo Attributes


SparkListenerApplicationStart
ExecutorTaskSummary Attribute
Attribute
executorLogs driverLogs (if defined)

Intercepting Executor Added Events 


—  onExecutorAdded Callback

onExecutorAdded(executorAdded: SparkListenerExecutorAdded): Unit

onExecutorAdded is part of SparkListener contract to announce that a new


Note
executor has been registered with the Spark application.

onExecutorAdded finds the executor (using the input executorAdded ) in the internal

executorToTaskSummary registry and sets the attributes. If not found, onExecutorAdded

creates a new entry.

Table 4. ExecutorTaskSummary and ExecutorInfo Attributes


ExecutorTaskSummary Attribute ExecutorInfo Attribute
executorLogs logUrlMap

totalCores totalCores

tasksMax totalCores / spark.task.cpus

96
ExecutorsListener Spark Listener

onExecutorAdded adds the input executorAdded to executorEvents collection. If the number

of elements in executorEvents collection is greater than


spark.ui.timeline.executors.maximum configuration property, the first/oldest event is
removed.

onExecutorAdded removes the oldest dead executor from executorToTaskSummary lookup

table if their number is greater than spark.ui.retainedDeadExecutors.

Intercepting Executor Removed Events 


—  onExecutorRemoved Callback

onExecutorRemoved(executorRemoved: SparkListenerExecutorRemoved): Unit

onExecutorRemoved is part of SparkListener contract to announce that an


Note
executor has been unregistered with the Spark application.

onExecutorRemoved adds the input executorRemoved to executorEvents collection. It then

removes the oldest event if the number of elements in executorEvents collection is greater
than spark.ui.timeline.executors.maximum configuration property.

The executor is marked as removed/inactive in executorToTaskSummary lookup table.

Intercepting Task Started Events —  onTaskStart


Callback

onTaskStart(taskStart: SparkListenerTaskStart): Unit

onTaskStart is part of SparkListener contract to announce that a task has been


Note
started.

onTaskStart increments tasksActive for the executor (using the input

SparkListenerTaskStart ).

Table 5. ExecutorTaskSummary and SparkListenerTaskStart Attributes


ExecutorTaskSummary Attribute Description
tasksActive Uses taskStart.taskInfo.executorId .

Intercepting Task End Events —  onTaskEnd Callback

97
ExecutorsListener Spark Listener

onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit

Note onTaskEnd is part of SparkListener contract to announce that a task has ended.

onTaskEnd takes TaskInfo from the input taskEnd (if available).

Depending on the reason for SparkListenerTaskEnd onTaskEnd does the following:

Table 6. onTaskEnd Behaviour per SparkListenerTaskEnd Reason


SparkListenerTaskEnd
onTaskEnd Behaviour
Reason
Resubmitted Does nothing

ExceptionFailure Increment tasksFailed

anything Increment tasksComplete

tasksActive is decremented but only when the number of active tasks for the executor is

greater than 0 .

Table 7. ExecutorTaskSummary and onTaskEnd Behaviour


ExecutorTaskSummary Attribute Description
tasksActive Decremented if greater than 0.

duration Uses taskEnd.taskInfo.duration

If the TaskMetrics (in the input taskEnd ) is available, the metrics are added to the
taskSummary for the task’s executor.

98
ExecutorsListener Spark Listener

Table 8. Task Metrics and Task Summary


Task Summary Task Metric
inputBytes inputMetrics.bytesRead

inputRecords inputMetrics.recordsRead

outputBytes outputMetrics.bytesWritten

outputRecords outputMetrics.recordsWritten

shuffleRead shuffleReadMetrics.remoteBytesRead

shuffleWrite shuffleWriteMetrics.bytesWritten

jvmGCTime metrics.jvmGCTime

Finding Active BlockManagers 


—  activeStorageStatusList Method

activeStorageStatusList: Seq[StorageStatus]

activeStorageStatusList requests StorageStatusListener for active BlockManagers (on

executors).

activeStorageStatusList is used when:

FIXME
AllExecutorListResource does executorList
Note
ExecutorListResource does executorList

ExecutorsListener gets informed that the Spark application has started,


onNodeBlacklisted, and onNodeUnblacklisted

99
JobProgressListener Spark Listener

JobProgressListener Spark Listener

JobProgressListener is a SparkListener for web UI.

JobProgressListener intercepts the following Spark events.

Table 1. JobProgressListener Events


Handler Purpose

Creates a JobUIData. It updates jobGroupToJobIds,


onJobStart pendingStages, jobIdToData, activeJobs,
stageIdToActiveJobIds, stageIdToInfo and stageIdToData.

Removes an entry in activeJobs. It also removes entries


in pendingStages and stageIdToActiveJobIds. It updates
onJobEnd
completedJobs, numCompletedJobs, failedJobs,
numFailedJobs and skippedStages.

onStageCompleted Updates the StageUIData and JobUIData .

Updates the task’s StageUIData and JobUIData , and


onTaskStart
registers a new TaskUIData .

Updates the task’s StageUIData (and TaskUIData ),


onTaskEnd
ExecutorSummary , and JobUIData .

onExecutorMetricsUpdate

Sets schedulingMode property using the current


spark.scheduler.mode (from Spark Properties
environment details).
onEnvironmentUpdate
Used in AllJobsPage (for the Scheduling Mode), and to
display pools in JobsTab and StagesTab.

FIXME: Add the links/screenshots for pools.

onBlockManagerAdded
Records an executor and its block manager in the internal
executorIdToBlockManagerId registry.

onBlockManagerRemoved
Removes the executor from the internal
executorIdToBlockManagerId registry.

Records a Spark application’s start time (in the internal


startTime ).
onApplicationStart
Used in Jobs tab (for a total uptime and the event
timeline) and Job page (for the event timeline).

100
JobProgressListener Spark Listener

Records a Spark application’s end time (in the internal


onApplicationEnd endTime ).

Used in Jobs tab (for a total uptime).

Does nothing.
onTaskGettingResult
FIXME: Why is this event intercepted at all?!

updateAggregateMetrics Method

Caution FIXME

Registries and Counters


JobProgressListener uses registries to collect information about job executions.

101
JobProgressListener Spark Listener

Table 2. JobProgressListener Registries and Counters


Name Description
numCompletedStages

numFailedStages

stageIdToData
Holds StageUIData per stage, i.e. the stage and stage
attempt ids.

stageIdToInfo

stageIdToActiveJobIds

poolToActiveStages

activeJobs

completedJobs

failedJobs

jobIdToData

jobGroupToJobIds

pendingStages

activeStages

completedStages

skippedStages

failedStages

The lookup table for BlockManagerId per executor id.

Used to track block managers so the Stage page can


executorIdToBlockManagerId display Address in Aggregated Metrics by Executor.
FIXME: How does Executors page collect the very
same information?

onJobStart Callback

onJobStart(jobStart: SparkListenerJobStart): Unit

102
JobProgressListener Spark Listener

onJobStart creates a JobUIData. It updates jobGroupToJobIds, pendingStages,

jobIdToData, activeJobs, stageIdToActiveJobIds, stageIdToInfo and stageIdToData.

onJobStart reads the optional Spark Job group id as spark.jobGroup.id (from properties

in the input jobStart ).

onJobStart then creates a JobUIData using the input jobStart with status attribute set

to JobExecutionStatus.RUNNING and records it in jobIdToData and activeJobs registries.

onJobStart looks the job ids for the group id (in jobGroupToJobIds registry) and adds the

job id.

The internal pendingStages is updated with StageInfo for the stage id (for every StageInfo
in SparkListenerJobStart.stageInfos collection).

onJobStart records the stages of the job in stageIdToActiveJobIds.

onJobStart records StageInfos in stageIdToInfo and stageIdToData.

onJobEnd Method

onJobEnd(jobEnd: SparkListenerJobEnd): Unit

onJobEnd removes an entry in activeJobs. It also removes entries in pendingStages and

stageIdToActiveJobIds. It updates completedJobs, numCompletedJobs, failedJobs,


numFailedJobs and skippedStages.

onJobEnd removes the job from activeJobs registry. It removes stages from pendingStages

registry.

When completed successfully, the job is added to completedJobs registry with status
attribute set to JobExecutionStatus.SUCCEEDED . numCompletedJobs gets incremented.

When failed, the job is added to failedJobs registry with status attribute set to
JobExecutionStatus.FAILED . numFailedJobs gets incremented.

For every stage in the job, the stage is removed from the active jobs (in
stageIdToActiveJobIds) that can remove the entire entry if no active jobs exist.

Every pending stage in stageIdToInfo gets added to skippedStages.

onExecutorMetricsUpdate Method

103
JobProgressListener Spark Listener

onExecutorMetricsUpdate(executorMetricsUpdate: SparkListenerExecutorMetricsUpdate): Un
it

onTaskStart Method

onTaskStart(taskStart: SparkListenerTaskStart): Unit

onTaskStart updates StageUIData and JobUIData , and registers a new TaskUIData .

onTaskStart takes TaskInfo from the input taskStart .

onTaskStart looks the StageUIData for the stage and stage attempt ids up (in

stageIdToData registry).

onTaskStart increments numActiveTasks and puts a TaskUIData for the task in

stageData.taskData .

Ultimately, onTaskStart looks the stage in the internal stageIdToActiveJobIds and for each
active job reads its JobUIData (from jobIdToData). It then increments numActiveTasks .

onTaskEnd Method

onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit

onTaskEnd updates the StageUIData (and TaskUIData ), ExecutorSummary , and JobUIData .

onTaskEnd takes TaskInfo from the input taskEnd .

onTaskEnd does its processing when the TaskInfo is available and


Note
stageAttemptId is not -1 .

onTaskEnd looks the StageUIData for the stage and stage attempt ids up (in stageIdToData

registry).

onTaskEnd saves accumulables in the StageUIData .

onTaskEnd reads the ExecutorSummary for the executor (the task has finished on).

Depending on the task end’s reason onTaskEnd increments succeededTasks , killedTasks


or failedTasks counters.

onTaskEnd adds the task’s duration to taskTime .

onTaskEnd decrements the number of active tasks (in the StageUIData ).

104
JobProgressListener Spark Listener

Again, depending on the task end’s reason onTaskEnd computes errorMessage and
updates StageUIData .

FIXME Why is the same information in two different registries —  stageData


Caution
and execSummary ?!

If taskMetrics is available, updateAggregateMetrics is executed.

The task’s TaskUIData is looked up in stageData.taskData and updateTaskInfo and


updateTaskMetrics are executed. errorMessage is updated.

onTaskEnd makes sure that the number of tasks in StageUIData ( stageData.taskData ) is

not above spark.ui.retainedTasks and drops the excess.

Ultimately, onTaskEnd looks the stage in the internal stageIdToActiveJobIds and for each
active job reads its JobUIData (from jobIdToData). It then decrements numActiveTasks and
increments numCompletedTasks , numKilledTasks or numFailedTasks depending on the task’s
end reason.

onStageSubmitted Method

onStageSubmitted(stageSubmitted: SparkListenerStageSubmitted): Unit

onStageCompleted Method

onStageCompleted(stageCompleted: SparkListenerStageCompleted): Unit

onStageCompleted updates the StageUIData and JobUIData .

onStageCompleted reads stageInfo from the input stageCompleted and records it in

stageIdToInfo registry.

onStageCompleted looks the StageUIData for the stage and the stage attempt ids up in

stageIdToData registry.

onStageCompleted records accumulables in StageUIData .

onStageCompleted removes the stage from poolToActiveStages and activeStages registries.

If the stage completed successfully (i.e. has no failureReason ), onStageCompleted adds the
stage to completedStages registry and increments numCompletedStages counter. It trims
completedStages.

105
JobProgressListener Spark Listener

Otherwise, when the stage failed, onStageCompleted adds the stage to failedStages registry
and increments numFailedStages counter. It trims failedStages.

Ultimately, onStageCompleted looks the stage in the internal stageIdToActiveJobIds and for
each active job reads its JobUIData (from jobIdToData). It then decrements
numActiveStages . When completed successfully, it adds the stage to

completedStageIndices . With failure, numFailedStages gets incremented.

JobUIData

Caution FIXME

blockManagerIds method

blockManagerIds: Seq[BlockManagerId]

Caution FIXME

StageUIData
Caution FIXME

Settings
Table 3. Spark Properties
Setting Default Value Description

spark.ui.retainedJobs 1000
The number of jobs to hold
information about

spark.ui.retainedStages 1000
The number of stages to
hold information about

spark.ui.retainedTasks 100000
The number of tasks to
hold information about

106
StorageStatusListener Spark Listener

StorageStatusListener — Spark Listener for


Tracking BlockManagers
StorageStatusListener is a SparkListener that uses SparkListener callbacks to track status

of every BlockManager in a Spark application.

StorageStatusListener is created and registered when SparkUI is created. It is later used

to create ExecutorsListener and StorageListener Spark listeners.

Table 1. StorageStatusListener’s SparkListener Callbacks (in alphabetical order)


Callback Description
Adds an executor id with StorageStatus (with
BlockManager and maximum memory on the executor) to
executorIdToStorageStatus internal registry.
onBlockManagerAdded
Removes any other BlockManager that may have been
registered for the executor earlier in
deadExecutorStorageStatus internal registry.

Removes an executor from executorIdToStorageStatus


internal registry and adds the removed StorageStatus to
deadExecutorStorageStatus internal registry.
onBlockManagerRemoved
Removes the oldest StorageStatus when the number of
entries in deadExecutorStorageStatus is bigger than
spark.ui.retainedDeadExecutors.

Updates StorageStatus for an executor in


onBlockUpdated executorIdToStorageStatus internal registry, i.e. removes
a block for NONE storage level and updates otherwise.

Removes the RDD blocks for an unpersisted RDD (on


onUnpersistRDD every BlockManager registered as StorageStatus in
executorIdToStorageStatus internal registry).

107
StorageStatusListener Spark Listener

Table 2. StorageStatusListener’s Internal Registries and Counters


Name Description

Collection of StorageStatus of removed/inactive


BlockManagers .

Accessible using deadStorageStatusList method.


Adds an element when StorageStatusListener handles
deadExecutorStorageStatus a BlockManager being removed (possibly removing one
element from the head when the number of elements are
above spark.ui.retainedDeadExecutors property).
Removes an element when StorageStatusListener
handles a new BlockManager (per executor) so the
executor is not longer dead.

Lookup table of StorageStatus per executor (including


the driver).
Adds an entry when StorageStatusListener handles a
new BlockManager.
executorIdToStorageStatus
Removes an entry when StorageStatusListener handles
a BlockManager being removed.
Updates StorageStatus of an executor when
StorageStatusListener handles StorageStatus updates.

Updating Storage Status For Executor 


—  updateStorageStatus Method

Caution FIXME

Active BlockManagers (on Executors) 


—  storageStatusList Method

storageStatusList: Seq[StorageStatus]

storageStatusList gives a collection of StorageStatus (from executorIdToStorageStatus

internal registry).

108
StorageStatusListener Spark Listener

storageStatusList is used when:

StorageStatusListener removes the RDD blocks for an unpersisted RDD


Note
ExecutorsListener does activeStorageStatusList

StorageListener does activeStorageStatusList

deadStorageStatusList Method

deadStorageStatusList: Seq[StorageStatus]

deadStorageStatusList gives deadExecutorStorageStatus internal registry.

deadStorageStatusList is used when ExecutorsListener is requested for


Note
inactive/dead BlockManagers.

Removing RDD Blocks for Unpersisted RDD 


—  updateStorageStatus Internal Method

updateStorageStatus(unpersistedRDDId: Int)

updateStorageStatus takes active BlockManagers.

updateStorageStatus then finds RDD blocks for unpersistedRDDId RDD (for every

BlockManager ) and removes the blocks.

storageStatusList is used exclusively when StorageStatusListener is notified


Note
that an RDD was unpersisted.

109
StorageListener — Spark Listener for Tracking Persistence Status of RDD Blocks

StorageListener — Spark Listener for Tracking


Persistence Status of RDD Blocks
StorageListener is a BlockStatusListener that uses SparkListener callbacks to track

changes in the persistence status of RDD blocks in a Spark application.

Table 1. StorageListener’s SparkListener Callbacks (in alphabetical order)


Callback Description
onBlockUpdated Updates _rddInfoMap with the update to a single block.

Removes RDDInfo instances from _rddInfoMap that


onStageCompleted participated in the completed stage as well as the ones
that are no longer cached.

Updates _rddInfoMap registry with the names of every


onStageSubmitted RDDInfo in the submitted stage, possibly adding new
RDDInfo instances if they were not registered yet.

Removes an RDDInfo from _rddInfoMap registry for the


onUnpersistRDD
unpersisted RDD.

Table 2. StorageListener’s Internal Registries and Counters


Name Description
RDDInfo instances per IDs
_rddInfoMap
Used when…​FIXME

Creating StorageListener Instance


StorageListener takes the following when created:

StorageStatusListener

StorageListener initializes the internal registries and counters.

Note StorageListener is created when SparkUI is created.

Finding Active BlockManagers 


—  activeStorageStatusList Method

110
StorageListener — Spark Listener for Tracking Persistence Status of RDD Blocks

activeStorageStatusList: Seq[StorageStatus]

activeStorageStatusList requests StorageStatusListener for active BlockManagers (on

executors).

activeStorageStatusList is used when:

AllRDDResource does rddList and getRDDStorageInfo


Note
StorageListener updates registered RDDInfos (with block updates from
BlockManagers)

Intercepting Block Status Update Events 


—  onBlockUpdated Callback

onBlockUpdated(blockUpdated: SparkListenerBlockUpdated): Unit

onBlockUpdated creates a BlockStatus (from the input SparkListenerBlockUpdated ) and

updates registered RDDInfos (with block updates from BlockManagers) (passing in BlockId
and BlockStatus as a single-element collection of updated blocks).

onBlockUpdated is part of SparkListener contract to announce that there was a


Note
change in a block status (on a BlockManager on an executor).

Intercepting Stage Completed Events 


—  onStageCompleted Callback

onStageCompleted(stageCompleted: SparkListenerStageCompleted): Unit

onStageCompleted finds the identifiers of the RDDs that have participated in the completed

stage and removes them from _rddInfoMap registry as well as the RDDs that are no longer
cached.

onStageCompleted is part of SparkListener contract to announce that a stage


Note
has finished.

Intercepting Stage Submitted Events 


—  onStageSubmitted Callback

onStageSubmitted(stageSubmitted: SparkListenerStageSubmitted): Unit

111
StorageListener — Spark Listener for Tracking Persistence Status of RDD Blocks

onStageSubmitted updates _rddInfoMap registry with the names of every RDDInfo in

stageSubmitted , possibly adding new RDDInfo instances if they were not registered yet.

onStageSubmitted is part of SparkListener contract to announce that the


Note
missing tasks of a stage were submitted for execution.

Intercepting Unpersist RDD Events —  onUnpersistRDD


Callback

onUnpersistRDD(unpersistRDD: SparkListenerUnpersistRDD): Unit

onUnpersistRDD removes the RDDInfo from _rddInfoMap registry for the unpersisted RDD

(from unpersistRDD ).

onUnpersistRDD is part of SparkListener contract to announce that an RDD has


Note
been unpersisted.

Updating Registered RDDInfos (with Block Updates from


BlockManagers) —  updateRDDInfo Internal Method

updateRDDInfo(updatedBlocks: Seq[(BlockId, BlockStatus)]): Unit

updateRDDInfo finds the RDDs for the input updatedBlocks (for BlockIds).

Note updateRDDInfo finds BlockIds that are RDDBlockIds.

updateRDDInfo takes RDDInfo entries (in _rddInfoMap registry) for which there are blocks in

the input updatedBlocks and updates RDDInfos (using StorageStatus) (from


activeStorageStatusList).

updateRDDInfo is used exclusively when StorageListener gets notified about a


Note
change in a block status (on a BlockManager on an executor).

Updating RDDInfos (using StorageStatus) 


—  StorageUtils.updateRddInfo Method

updateRddInfo(rddInfos: Seq[RDDInfo], statuses: Seq[StorageStatus]): Unit

Caution FIXME

112
StorageListener — Spark Listener for Tracking Persistence Status of RDD Blocks

updateRddInfo is used when:

SparkContext is requested for storage status of cached RDDs


Note
StorageListener updates registered RDDInfos (with block updates from
BlockManagers)

113
RDDOperationGraphListener Spark Listener

RDDOperationGraphListener Spark Listener

Caution FIXME

114
WebUI — Framework For Web UIs

WebUI — Base Web UI
WebUI is the base of the web UIs in Apache Spark:

Active Spark applications

Spark History Server

Spark Standalone cluster manager

Spark on Mesos cluster manager

Note Spark on YARN uses a different web framework for the web UI.

WebUI is used as the parent of WebUITabs.

package org.apache.spark.ui

abstract class WebUI {


// only required methods that have no implementation
// the others follow
def initialize(): Unit
}

Note WebUI is a private[spark] contract.

Table 1. (Subset of) WebUI Contract


Method Description
Used in implementations only to let them initialize their
web components

initialize does not add anything special to


initialize the Scala type hierarchy but a common
name to use across WebUIs (that could also
Note
be possible without it). In other words,
initialize does not participate in any
design pattern or a type hierarchy.

WebUI is a Scala abstract class and cannot be created directly, but only as one of the web

UIs.

115
WebUI — Framework For Web UIs

Table 2. WebUIs
WebUI Description

HistoryServer Used in Spark History Server

MasterWebUI Used in Spark Standalone cluster manager

MesosClusterUI Used in Spark on Mesos cluster manager

SparkUI WebUI of a Spark application

WorkerWebUI Used in Spark Standalone cluster manager

Once bound to a Jetty HTTP server, WebUI is available at an HTTP port (and is used in the
web URL as boundPort ).

WebUI is available at a web URL, i.e. http://[publicHostName]:[boundPort] . The

publicHostName is…​FIXME and the boundPort is the port that the port the Jetty HTTP
Server bound to.

116
WebUI — Framework For Web UIs

Table 3. WebUI’s Internal Properties (e.g. Registries, Counters and Flags)


Name Description

WebUITabs
tabs
Used when…​FIXME

ServletContextHandlers
handlers
Used when…​FIXME

ServletContextHandlers per WebUIPage


pageToHandlers
Used when…​FIXME

Optional ServerInfo (default: None )


serverInfo
Used when…​FIXME

Host name of the UI


publicHostName is either SPARK_PUBLIC_DNS environment
variable or spark.driver.host configuration property.
Defaults to the following if defined (in order):
publicHostName 1. SPARK_LOCAL_HOSTNAME environment variable

2. Host name of SPARK_LOCAL_IP environment variable


3. Utils.findLocalInetAddress

Used exclusively when WebUI is requested for the web


URL

className
Used when…​FIXME

Enable INFO or ERROR logging level for the corresponding loggers of the
WebUIs, e.g. org.apache.spark.ui.SparkUI , to see what happens inside.

Add the following line to conf/log4j.properties :


Tip
log4j.logger.org.apache.spark.ui=INFO

Refer to Logging.

Creating WebUI Instance


WebUI takes the following when created:

117
WebUI — Framework For Web UIs

SecurityManager

SSLOptions

Port number

SparkConf

basePath (default: empty)

Name (default: empty)

WebUI initializes the internal registries and counters.

WebUI is a Scala abstract class and cannot be created directly, but only as one
Note
of the implementations.

Detaching Page And Associated Handlers from UI 


—  detachPage Method

detachPage(page: WebUIPage): Unit

detachPage …​FIXME

Note detachPage is used when…​FIXME

Detaching Tab And Associated Pages from UI 


—  detachTab Method

detachTab(tab: WebUITab): Unit

detachTab …​FIXME

Note detachTab is used when…​FIXME

Detaching Handler —  detachHandler Method

detachHandler(handler: ServletContextHandler): Unit

detachHandler …​FIXME

Note detachHandler is used when…​FIXME

118
WebUI — Framework For Web UIs

Detaching Handler At Path —  detachHandler Method

detachHandler(path: String): Unit

detachHandler …​FIXME

Note detachHandler is used when…​FIXME

Attaching Page to UI —  attachPage Method

attachPage(page: WebUIPage): Unit

Internally, attachPage creates the path of the WebUIPage that is / (forward slash)
followed by the prefix of the page.

attachPage creates a HTTP request handler…​FIXME

attachPage is used when:

WebUI is requested to attach a WebUITab (the WebUIPages actually)


Note
HistoryServer, Spark Standalone’s MasterWebUI and WorkerWebUI , Spark
on Mesos' MesosClusterUI are requested to initialize

Attaching Tab And Associated Pages to UI —  attachTab


Method

attachTab(tab: WebUITab): Unit

attachTab attaches every WebUIPage of the input WebUITab.

In the end, attachTab adds the input WebUITab to WebUITab tabs.

Note attachTab is used when…​FIXME

Attaching Static Handler —  addStaticHandler Method

addStaticHandler(resourceBase: String, path: String): Unit

addStaticHandler …​FIXME

119
WebUI — Framework For Web UIs

Note addStaticHandler is used when…​FIXME

Attaching Handler to UI —  attachHandler Method

attachHandler(handler: ServletContextHandler): Unit

attachHandler simply adds the input Jetty ServletContextHandler to handlers registry and

requests the ServerInfo to addHandler (only if the ServerInfo is defined).

attachHandler is used when:

web UIs (i.e. HistoryServer, Spark Standalone’s MasterWebUI and


WorkerWebUI , Spark on Mesos' MesosClusterUI , SparkUI) are requested to
initialize
WebUI is requested to attach a page to web UI and addStaticHandler
Note
SparkContext is created (and attaches the driver metrics servlet handler to
the web UI)
HistoryServer is requested to attachSparkUI

Spark Standalone’s Master and Worker are requested to onStart (and


attach their metrics servlet handlers to the web ui)

getBasePath Method

getBasePath: String

getBasePath simply returns the base path.

getBasePath is used exclusively when WebUITab is requested for the base


Note
path.

Requesting Header Tabs —  getTabs Method

getTabs: Seq[WebUITab]

getTabs simply returns the registered tabs.

Note getTabs is used exclusively when WebUITab is requested for the header tabs.

120
WebUI — Framework For Web UIs

Requesting Handlers —  getHandlers Method

getHandlers: Seq[ServletContextHandler]

getHandlers simply returns the registered handlers.

Note getHandlers is used when…​FIXME

Binding UI to Jetty HTTP Server on Host —  bind Method

bind(): Unit

bind …​FIXME

Note bind is used when…​FIXME

Stopping UI —  stop Method

stop(): Unit

stop …​FIXME

Note stop is used when…​FIXME

121
WebUIPage — Contract of Pages in Web UI

WebUIPage — Contract of Pages in Web UI


WebUIPage is the contract of web pages of a WebUI that can be rendered in HTML and

JSON.

WebUIPage can be:

attached or detached from a WebUI

attached to a WebUITab

WebUIPage has a prefix that…​FIXME

package org.apache.spark.ui

abstract class WebUIPage(var prefix: String) {


def render(request: HttpServletRequest): Seq[Node]
def renderJson(request: HttpServletRequest): JValue = JNothing
}

Note WebUIPage is a private[spark] contract.

Table 1. WebUIPage Contract


Method Description

render
Used exclusively when WebUI is requested to attach a
page (and…​FIXME)

renderJson Used when…​FIXME

Table 2. WebUIPages
WebUIPage Description
AllExecutionsPage Used in Spark SQL module

AllJobsPage

AllStagesPage

ApplicationPage Used in Spark Standalone cluster manager

BatchPage Used in Spark Streaming module

DriverPage Used in Spark on Mesos module

122
WebUIPage — Contract of Pages in Web UI

EnvironmentPage

ExecutionPage Used in Spark SQL module

ExecutorsPage

ExecutorThreadDumpPage

HistoryPage Used in Spark History Server module

JobPage

LogPage Used in Spark Standalone cluster manager

MasterPage Used in Spark Standalone cluster manager

MesosClusterPage Used in Spark on Mesos module

PoolPage

RDDPage

StagePage

StoragePage

StreamingPage Used in Spark Streaming module

ThriftServerPage Used in Spark Thrift Server module

ThriftServerSessionPage Used in Spark Thrift Server module

WorkerPage Used in Spark Standalone cluster manager

123
WebUITab — Contract of Tabs in Web UI

WebUITab — Contract of Tabs in Web UI


WebUITab represents a tab in web UI with a name and pages.

WebUITab can be:

attached or detached from a WebUI

attached to a WebUITab

WebUITab is simply a collection of WebUIPages that can be attached to the tab.

WebUITab has a name (and defaults to prefix capitalized).

Note SparkUITab is the one and only implementation of WebUITab contract.

Note WebUITab is a private[spark] contract.

Attaching Page to Tab —  attachPage Method

attachPage(page: WebUIPage): Unit

attachPage prepends the page prefix (of the input WebUIPage ) with the tab prefix (with no

ending slash, i.e. / , if the page prefix is undefined).

In the end, attachPage adds the WebUIPage to pages registry.

Note attachPage is used when web UI tabs register their pages.

Requesting Base URI Path —  basePath Method

basePath: String

basePath requests the parent WebUI for the base path.

Note basePath is used when…​FIXME

Requesting Header Tabs —  headerTabs Method

headerTabs: Seq[WebUITab]

124
WebUITab — Contract of Tabs in Web UI

headerTabs requests the parent WebUI for the header tabs.

headerTabs is used exclusively when UIUtils is requested to


Note
headerSparkPage.

Creating WebUITab Instance


WebUITab takes the following when created:

Parent WebUI

Prefix

WebUITab initializes the internal registries and counters.

WebUITab is a Scala abstract class and cannot be created directly, but only as
Note
one of the implementations.

125
RDDStorageInfo

RDDStorageInfo
RDDStorageInfo contains information about RDD persistence:

RDD id

RDD name

Number of RDD partitions

Number of cached RDD partitions

Storage level ID

Memory used

Disk used

Data distribution (as Seq[RDDDataDistribution] )

Partitions (as Seq[RDDPartitionInfo]] )

RDDStorageInfo is created exclusively when LiveRDD is requested to doUpdate (when

requested to write).

RDDStorageInfo is used when:

1. web UI’s StoragePage is requested to render an HTML table row and an entire table for
RDD details

2. REST API’s AbstractApplicationResource is requested for rddList (at storage/rdd


path)

3. AppStatusStore is requested for rddList

126
RDDInfo

RDDInfo
RDDInfo is…​FIXME

127
LiveEntity

LiveEntity
LiveEntity is the contract of a live entity in Spark that…​FIXME

package org.apache.spark.status

abstract class LiveEntity {


// only required methods that have no implementation
// the others follow
protected def doUpdate(): Any
}

Note LiveEntity is a private[spark] contract.

Table 1. LiveEntity Contract


Method Description
doUpdate Used exclusivey when LiveEntity is requested to write.

LiveEntity tracks the last write time (in lastWriteTime internal registry).

write Method

write(store: ElementTrackingStore, now: Long, checkTriggers: Boolean = false): Unit

write requests the input ElementTrackingStore to write the updated value.

In the end, write records the time in the lastWriteTime.

write is used when:

1. AppStatusListener is requested to update


Note
2. SQLAppStatusListener is created (and registers a flush trigger) and
requested to update

128
LiveRDD

LiveRDD
LiveRDD is a LiveEntity that…​FIXME

LiveRDD is created exclusively when AppStatusListener is requested to handle

onStageSubmitted event

LiveRDD takes a RDDInfo when created.

doUpdate Method

doUpdate(): Any

Note doUpdate is part of LiveEntity Contract to…​FIXME.

doUpdate …​FIXME

129
UIUtils

UIUtils
UIUtils is a utility object for…​FIXME

headerSparkPage Method

headerSparkPage(
request: HttpServletRequest,
title: String,
content: => Seq[Node],
activeTab: SparkUITab,
refreshInterval: Option[Int] = None,
helpText: Option[String] = None,
showVisualization: Boolean = false,
useDataTables: Boolean = false): Seq[Node]

headerSparkPage …​FIXME

Note headerSparkPage is used when…​FIXME

130
JettyUtils

JettyUtils
JettyUtils is a set of utility methods for creating Jetty HTTP Server-specific components.

Table 1. JettyUtils’s Utility Methods


Name Description

createServlet Creates an HttpServlet

createStaticHandler Creates a Handler for a static content

createServletHandler Creates a ServletContextHandler for a path

createRedirectHandler

Creating ServletContextHandler for Path 


—  createServletHandler Method

createServletHandler(
path: String,
servlet: HttpServlet,
basePath: String): ServletContextHandler (1)
createServletHandler[T <: AnyRef](
path: String,
servletParams: ServletParams[T],
securityMgr: SecurityManager,
conf: SparkConf,
basePath: String = ""): ServletContextHandler (2)

1. Calls the first three-argument createServletHandler with

createServletHandler …​FIXME

createServletHandler is used when:

WebUI is requested to attachPage


Note
MetricsServlet is requested to getHandlers

Spark Standalone’s WorkerWebUI is requested to initialize

Creating HttpServlet —  createServlet Method

131
JettyUtils

createServlet[T <: AnyRef](


servletParams: ServletParams[T],
securityMgr: SecurityManager,
conf: SparkConf): HttpServlet

createServlet creates the X-Frame-Options header that can be either ALLOW-FROM with the

value of spark.ui.allowFramingFrom configuration property if defined or SAMEORIGIN .

createServlet creates a Java Servlets HttpServlet with support for GET requests.

When handling GET requests, the HttpServlet first checks view permissions of the remote
user (by requesting the SecurityManager to checkUIViewPermissions of the remote user).

Enable DEBUG logging level for org.apache.spark.SecurityManager logger to see what happens w
SecurityManager does the security check.

Add the following line to conf/log4j.properties :

log4j.logger.org.apache.spark.SecurityManager=DEBUG
Tip

You should see the following DEBUG message in the logs:

DEBUG SecurityManager: user=[user] aclsEnabled=[aclsEnabled] viewAcls=[viewAcls] viewAclsGrou

With view permissions check passed, the HttpServlet sends a response with the following:

FIXME

In case the view permissions didn’t allow to view the page, the HttpServlet sends an error
response with the following:

Status 403

Cache-Control header with "no-cache, no-store, must-revalidate"

Error message: "User is not authorized to access this page."

createServlet is used exclusively when JettyUtils is requested to


Note
createServletHandler.

Creating Handler For Static Content 


—  createStaticHandler Method

createStaticHandler(resourceBase: String, path: String): ServletContextHandler

132
JettyUtils

createStaticHandler creates a handler for serving files from a static directory

Internally, createStaticHandler creates a Jetty ServletContextHandler and sets


org.eclipse.jetty.servlet.Default.gzip init parameter to false .

createRedirectHandler creates a Jetty DefaultServlet.

Quoting the official documentation of Jetty’s DefaultServlet:


DefaultServlet The default servlet. This servlet, normally mapped to / ,
provides the handling for static content, OPTION and TRACE methods for
the context. The following initParameters are supported, these can be set
either on the servlet itself or as ServletContext initParameters with a prefix
of org.eclipse.jetty.servlet.Default.
Note
With that, org.eclipse.jetty.servlet.Default.gzip is to configure gzip init
parameter for Jetty’s DefaultServlet .
gzip If set to true, then static content will be served as gzip content
encoded if a matching resource is found ending with ".gz" (default false )
(deprecated: use precompressed)

createRedirectHandler resolves the resourceBase in the Spark classloader and, if

successful, sets resourceBase init parameter of the Jetty DefaultServlet to the URL.

Note resourceBase init parameter is used to replace the context resource base.

createRedirectHandler requests the ServletContextHandler to use the path as the context

path and register the DefaultServlet to serve it.

createRedirectHandler throws an Exception if the input resourceBase could not be

resolved.

Could not find resource path for Web UI: [resourceBase]

createStaticHandler is used when SparkUI, HistoryServer, Spark Standalone’s


Note MasterWebUI and WorkerWebUI , Spark on Mesos' MesosClusterUI are
requested to initialize.

createRedirectHandler Method

createRedirectHandler(
srcPath: String,
destPath: String,
beforeRedirect: HttpServletRequest => Unit = x => (),
basePath: String = "",
httpMethods: Set[String] = Set("GET")): ServletContextHandler

133
JettyUtils

createRedirectHandler …​FIXME

createRedirectHandler is used when SparkUI and Spark Standalone’s


Note
MasterWebUI are requested to initialize.

134
web UI Configuration Properties

web UI Configuration Properties


Table 1. web UI Configuration Properties
Default
Name Description
Value

Defines the URL to use in


ALLOW-FROM in X-Frame-Options
header (as described in
spark.ui.allowFramingFrom http://tools.ietf.org/html/rfc7034).
Used exclusively when
JettyUtils is requested to
create an HttpServlet.

spark.ui.consoleProgress.update.interval
200 Update interval, i.e. how often to
(ms) show the progress.

The flag to control whether the


spark.ui.enabled true web UI is started ( true ) or not
( false ).

The port web UI binds to.


If multiple SparkContext s
attempt to run on the same host
spark.ui.port 4040 (it is not possible to have two or
more Spark contexts on a single
JVM, though), they will bind to
successive ports beginning with
spark.ui.port .

Enables jobs and stages to be


killed from the web UI ( true )
or not ( false ).
Used exclusively when
spark.ui.killEnabled true
SparkUI is requested to
initialize (and registers the
redirect handlers for
/jobs/job/kill and
/stages/stage/kill URIs)

The maximum number of


entries in
executorToTaskSummary (in
ExecutorsListener ) and
spark.ui.retainedDeadExecutors 100

135
web UI Configuration Properties

deadExecutorStorageStatus (in
StorageStatusListener ) internal
registries.

Controls whether to create


spark.ui.showConsoleProgress true ConsoleProgressBar ( true ) or
not ( false ).

The maximum number of


spark.ui.timeline.executors.maximum 1000 entries in executorEvents
registry.

spark.ui.timeline.tasks.maximum 1000

136
Spark Metrics

Spark Metrics
Spark Metrics gives you execution metrics of Spark subsystems (aka metrics instances),
e.g. the driver of a Spark application or the master of a Spark Standalone cluster.

Spark Metrics uses Dropwizard Metrics 3.1.0 Java library for the metrics infrastructure.

Metrics is a Java library which gives you unparalleled insight into what your code does
in production.

Metrics provides a powerful toolkit of ways to measure the behavior of critical


components in your production environment.

MetricsSystem — Registry of Metrics Sources and Sinks of


Spark Subsystem
The main part of Spark Metrics is MetricsSystem which is a registry of metrics sources and
sinks of a Spark subsystem.

MetricsSystem uses Dropwizard Metrics' MetricRegistry that acts as the integration point

between Spark and the metrics library.

A Spark subsystem can access the MetricsSystem through the SparkEnv.metricsSystem


property.

val metricsSystem = SparkEnv.get.metricsSystem

MetricsConfig — Metrics System Configuration


MetricsConfig is the configuration of the MetricsSystem (i.e. metrics sources and sinks).

metrics.properties is the default metrics configuration file. It is configured using


spark.metrics.conf configuration property. The file is first loaded from the path directly before
using Spark’s CLASSPATH.

MetricsConfig also accepts a metrics configuration using spark.metrics.conf. -prefixed

configuration properties.

Spark comes with conf/metrics.properties.template file that is a template of metrics


configuration.

MetricsServlet Metrics Sink

137
Spark Metrics

Among the metrics sinks is MetricsServlet that is used when sink.servlet metrics sink is
configured in metrics configuration.

Caution FIXME Describe configuration files and properties

JmxSink Metrics Sink


Enable org.apache.spark.metrics.sink.JmxSink in metrics configuration.

You can then use jconsole to access Spark metrics through JMX.

*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink

Figure 1. jconsole and JmxSink in spark-shell

JSON URI Path


Metrics System is available at http://localhost:4040/metrics/json (for the default setup of a
Spark application).

$ http --follow http://localhost:4040/metrics/json


HTTP/1.1 200 OK
Cache-Control: no-cache, no-store, must-revalidate
Content-Length: 2200

138
Spark Metrics

Content-Type: text/json;charset=utf-8
Date: Sat, 25 Feb 2017 14:14:16 GMT
Server: Jetty(9.2.z-SNAPSHOT)
X-Frame-Options: SAMEORIGIN

{
"counters": {
"app-20170225151406-0000.driver.HiveExternalCatalog.fileCacheHits": {
"count": 0
},
"app-20170225151406-0000.driver.HiveExternalCatalog.filesDiscovered": {
"count": 0
},
"app-20170225151406-0000.driver.HiveExternalCatalog.hiveClientCalls": {
"count": 2
},
"app-20170225151406-0000.driver.HiveExternalCatalog.parallelListingJobCount":
{
"count": 0
},
"app-20170225151406-0000.driver.HiveExternalCatalog.partitionsFetched": {
"count": 0
}
},
"gauges": {
...
"timers": {
"app-20170225151406-0000.driver.DAGScheduler.messageProcessingTime": {
"count": 0,
"duration_units": "milliseconds",
"m15_rate": 0.0,
"m1_rate": 0.0,
"m5_rate": 0.0,
"max": 0.0,
"mean": 0.0,
"mean_rate": 0.0,
"min": 0.0,
"p50": 0.0,
"p75": 0.0,
"p95": 0.0,
"p98": 0.0,
"p99": 0.0,
"p999": 0.0,
"rate_units": "calls/second",
"stddev": 0.0
}
},
"version": "3.0.0"
}

139
Spark Metrics

You can access a Spark subsystem’s MetricsSystem using its corresponding


Note "leading" port, e.g. 4040 for the driver , 8080 for Spark Standalone’s
master and applications .

Note You have to use the trailing slash ( / ) to have the output.

Spark Standalone Master

$ http http://192.168.1.4:8080/metrics/master/json/path
HTTP/1.1 200 OK
Cache-Control: no-cache, no-store, must-revalidate
Content-Length: 207
Content-Type: text/json;charset=UTF-8
Server: Jetty(8.y.z-SNAPSHOT)
X-Frame-Options: SAMEORIGIN

{
"counters": {},
"gauges": {
"master.aliveWorkers": {
"value": 0
},
"master.apps": {
"value": 0
},
"master.waitingApps": {
"value": 0
},
"master.workers": {
"value": 0
}
},
"histograms": {},
"meters": {},
"timers": {},
"version": "3.0.0"
}

140
MetricsSystem

MetricsSystem — Registry of Metrics Sources


and Sinks of Spark Subsystem
MetricsSystem is a registry of metrics sources and sinks of a Spark subsystem, e.g. the

driver of a Spark application.

Figure 1. Creating MetricsSystem for Driver


MetricsSystem may have at most one MetricsServlet JSON metrics sink (which is registered

by default).

When created, MetricsSystem requests MetricsConfig to initialize.

Figure 2. Creating MetricsSystem

141
MetricsSystem

Table 1. Metrics Instances (Subsystems) and MetricsSystems


Name When Created
applications Spark Standalone’s Master is created.

driver SparkEnv is created for the driver.

executor SparkEnv is created for an executor.

master Spark Standalone’s Master is created.

mesos_cluster Spark on Mesos' MesosClusterScheduler is created.

shuffleService ExternalShuffleService is created.

worker Spark Standalone’s Worker is created.

MetricsSystem uses MetricRegistry as the integration point to Dropwizard Metrics library.

142
MetricsSystem

Table 2. MetricsSystem’s Internal Registries and Counters


Name Description

MetricsConfig
metricsConfig Initialized when MetricsSystem is created.
Used when MetricsSystem registers sinks and sources.

MetricsServlet JSON metrics sink that is only available for


the metrics instances with a web UI, i.e. the driver of a
Spark application and Spark Standalone’s Master .
metricsServlet Initialized when MetricsSystem registers sinks (and finds
a configuration entry with servlet sink name).
Used exclusively when MetricsSystem is requested for a
JSON servlet handler.

Dropwizard Metrics' MetricRegistry


Used when MetricsSystem is requested to:
registry register a metrics source
remove a metrics source
start (that in turn registers metrics sinks)

Flag that indicates whether MetricsSystem has been


running started ( true ) or not ( false )
Default: false

Metrics sinks in a Spark application.


sinks
Used when MetricsSystem registers a new metrics sink
and starts them eventually.

Metrics sources in a Spark application.


sources
Used when MetricsSystem registers a new metrics
source.

143
MetricsSystem

Enable WARN or ERROR logging levels for


org.apache.spark.metrics.MetricsSystem logger to see what happens in
MetricsSystem .

Add the following line to conf/log4j.properties :


Tip
log4j.logger.org.apache.spark.metrics.MetricsSystem=WARN

Refer to Logging.

"Static" Metrics Sources for Spark SQL — StaticSources


Caution FIXME

Registering Metrics Source —  registerSource Method

registerSource(source: Source): Unit

registerSource adds source to sources internal registry.

registerSource creates an identifier for the metrics source and registers it with

MetricRegistry.

registerSource uses Metrics' MetricRegistry.register to register a metrics


Note
source under a given name.

When registerSource tries to register a name more than once, you should see the following
INFO message in the logs:

INFO Metrics already registered

144
MetricsSystem

registerSource is used when:

SparkContext registers metrics sources for:

DAGScheduler
BlockManager

ExecutorAllocationManager (when dynamic allocation is enabled)


MetricsSystem is started (and registers the "static" metrics sources 
—  CodegenMetrics and HiveCatalogMetrics ) and does registerSources.
Executor is created (and registers a ExecutorSource)

ExternalShuffleService is started (and registers


Note ExternalShuffleServiceSource )

Spark Structured Streaming’s StreamExecution runs batches as data


arrives (when metrics are enabled).
Spark Streaming’s StreamingContext is started (and registers
StreamingSource )

Spark Standalone’s Master and Worker start (and register their


MasterSource and WorkerSource , respectively)

Spark Standalone’s Master registers a Spark application (and registers a


ApplicationSource )

Spark on Mesos' MesosClusterScheduler is started (and registers a


MesosClusterSchedulerSource )

Building Metrics Source Identifier 


—  buildRegistryName Method

buildRegistryName(source: Source): String

buildRegistryName is used to build the metrics source identifiers for a Spark


Note application’s driver and executors, but also for other Spark framework’s
components (e.g. Spark Standalone’s master and workers).

buildRegistryName uses spark.metrics.namespace and spark.executor.id Spark


Note properties to differentiate between a Spark application’s driver and executors,
and the other Spark framework’s components.

(only when instance is driver or executor ) buildRegistryName builds metrics source


name that is made up of spark.metrics.namespace, spark.executor.id and the name of the
source .

145
MetricsSystem

buildRegistryName uses Dropwizard Metrics' MetricRegistry to build metrics


Note
source identifiers.

Caution FIXME Finish for the other components.

buildRegistryName is used when MetricsSystem registers or removes a metrics


Note
source.

Registering Metrics Sources for Spark Instance 


—  registerSources Internal Method

registerSources(): Unit

registerSources finds metricsConfig configuration for the metrics instance.

Note instance is defined when MetricsSystem is created.

registerSources finds the configuration of all the metrics sources for the subsystem (as

described with source. prefix).

For every metrics source, registerSources finds class property, creates an instance, and
in the end registers it.

When registerSources fails, you should see the following ERROR message in the logs
followed by the exception.

ERROR Source class [classPath] cannot be instantiated

Note registerSources is used exclusively when MetricsSystem is started.

Requesting JSON Servlet Handler 


—  getServletHandlers Method

getServletHandlers: Array[ServletContextHandler]

If the MetricsSystem is running and the MetricsServlet is defined for the metrics system,
getServletHandlers simply requests the MetricsServlet for the JSON servlet handler.

When MetricsSystem is not running getServletHandlers throws an


IllegalArgumentException .

146
MetricsSystem

Can only call getServletHandlers on a running MetricsSystem

getServletHandlers is used when:

SparkContext is created
Note
Spark Standalone’s Master and Worker are requested to start (as
onStart )

Registering Metrics Sinks —  registerSinks Internal


Method

registerSinks(): Unit

registerSinks requests the MetricsConfig for the configuration of the instance.

registerSinks requests the MetricsConfig for the configuration of all metrics sinks (i.e.

configuration entries that match ^sink\\.(.)\\.(.) regular expression).

For every metrics sink configuration, registerSinks takes class property and (if defined)
creates an instance of the metric sink using an constructor that takes the configuration,
MetricRegistry and SecurityManager.

For a single servlet metrics sink, registerSinks converts the sink to a MetricsServlet and
sets the metricsServlet internal registry.

For all other metrics sinks, registerSinks adds the sink to the sinks internal registry.

In case of an Exception , registerSinks prints out the following ERROR message to the
logs:

Sink class [classPath] cannot be instantiated

Note registerSinks is used exclusively when MetricsSystem is requested to start.

stop Method

stop(): Unit

stop …​FIXME

147
MetricsSystem

Note stop is used when…​FIXME

getSourcesByName Method

getSourcesByName(sourceName: String): Seq[Source]

getSourcesByName …​FIXME

Note getSourcesByName is used when…​FIXME

removeSource Method

removeSource(source: Source): Unit

removeSource …​FIXME

Note removeSource is used when…​FIXME

Creating MetricsSystem Instance


MetricsSystem takes the following when created:

Instance name

SparkConf

SecurityManager

MetricsSystem initializes the internal registries and counters.

When created, MetricsSystem requests MetricsConfig to initialize.

Note createMetricsSystem is used to create a new MetricsSystems instance instead.

Creating MetricsSystem Instance For Subsystem 


—  createMetricsSystem Factory Method

createMetricsSystem(
instance: String
conf: SparkConf
securityMgr: SecurityManager): MetricsSystem

148
MetricsSystem

createMetricsSystem returns a new MetricsSystem.

Note createMetricsSystem is used when a metrics instance is created.

Requesting Sinks to Report Metrics —  report Method

report(): Unit

report simply requests the registered metrics sinks to report metrics.

report is used when SparkContext, Executor, Spark Standalone’s Master


Note
and Worker , Spark on Mesos' MesosClusterScheduler are requested to stop

Starting MetricsSystem —  start Method

start(): Unit

start turns running flag on.

start can only be called once and throws an IllegalArgumentException when


Note
called multiple times.

start registers the "static" metrics sources for Spark SQL, i.e. CodegenMetrics and

HiveCatalogMetrics .

start then registers the configured metrics sources and sinks for the Spark instance.

In the end, start requests the registered metrics sinks to start.

start throws an IllegalArgumentException when running flag is on.

requirement failed: Attempting to start a MetricsSystem that is already running

start is used when:

SparkContext is created

SparkEnv is created (on executors)


Note
ExternalShuffleService is requested to start

Spark Standalone’s Master and Worker , and Spark on Mesos'


MesosClusterScheduler are requested to start

149
MetricsSystem

150
MetricsConfig — Metrics System Configuration

MetricsConfig — Metrics System Configuration


MetricsConfig is the configuration of the MetricsSystem (i.e. metrics sources and sinks).

MetricsConfig is created when MetricsSystem is.

MetricsConfig uses metrics.properties as the default metrics configuration file. It is

configured using spark.metrics.conf configuration property. The file is first loaded from the
path directly before using Spark’s CLASSPATH.

MetricsConfig accepts a metrics configuration using spark.metrics.conf. -prefixed

configuration properties.

Spark comes with conf/metrics.properties.template file that is a template of metrics


configuration.

MetricsConfig makes sure that the default metrics properties are always defined.

Table 1. MetricsConfig’s Default Metrics Properties


Name Description
*.sink.servlet.class org.apache.spark.metrics.sink.MetricsServlet

*.sink.servlet.path /metrics/json

master.sink.servlet.path /metrics/master/json

applications.sink.servlet.path /metrics/applications/json

The order of precedence of metrics configuration settings is as follows:

1. Default metrics properties


Note 2. spark.metrics.conf configuration property or metrics.properties
configuration file

3. spark.metrics.conf. -prefixed Spark properties

MetricsConfig takes a SparkConf when created.

151
MetricsConfig — Metrics System Configuration

Table 2. MetricsConfig’s Internal Registries and Counters


Name Description

java.util.Properties with metrics properties


properties
Used to initialize per-subsystem’s
perInstanceSubProperties.

perInstanceSubProperties Lookup table of metrics properties per subsystem

Initializing MetricsConfig —  initialize Method

initialize(): Unit

initialize sets the default properties and loads configuration properties from a

configuration file (that is defined using spark.metrics.conf configuration property).

initialize takes all Spark properties that start with spark.metrics.conf. prefix from

SparkConf and adds them to properties (without the prefix).

In the end, initialize splits configuration per Spark subsystem with the default
configuration (denoted as * ) assigned to all subsystems afterwards.

initialize accepts * (star) for the default configuration or any combination


Note
of lower- and upper-case letters for Spark subsystem names.

Note initialize is used exclusively when MetricsSystem is created.

setDefaultProperties Internal Method

setDefaultProperties(prop: Properties): Unit

setDefaultProperties sets the default properties (in the input prop ).

Note setDefaultProperties is used exclusively when MetricsConfig is initialized.

Loading Custom Metrics Configuration File or


metrics.properties —  loadPropertiesFromFile Method

loadPropertiesFromFile(path: Option[String]): Unit

152
MetricsConfig — Metrics System Configuration

loadPropertiesFromFile tries to open the input path file (if defined) or the default metrics

configuration file metrics.properties (on CLASSPATH).

If either file is available, loadPropertiesFromFile loads the properties (to properties registry).

In case of exceptions, you should see the following ERROR message in the logs followed by
the exception.

ERROR Error loading configuration file [file]

Note loadPropertiesFromFile is used exclusively when MetricsConfig is initialized.

Grouping Properties Per Subsystem —  subProperties


Method

subProperties(prop: Properties, regex: Regex): mutable.HashMap[String, Properties]

subProperties takes prop properties and destructures keys given regex . subProperties

takes the matching prefix (of a key per regex ) and uses it as a new key with the value(s)
being the matching suffix(es).

driver.hello.world => (driver, (hello.world))

subProperties is used when MetricsConfig is initialized (to apply the default


Note metrics configuration) and when MetricsSystem registers metrics sources and
sinks.

getInstance Method

getInstance(inst: String): Properties

getInstance …​FIXME

Note getInstance is used when…​FIXME

153
Source — Contract of Metrics Sources

Source — Contract of Metrics Sources


Source is a contract of metrics sources.

package org.apache.spark.metrics.source

trait Source {
def sourceName: String
def metricRegistry: MetricRegistry
}

Note Source is a private[spark] contract.

Table 1. Source Contract


Method Description
sourceName Used when…​FIXME

Dropwizard Metrics' MetricRegistry


metricRegistry
Used when…​FIXME

154
Source — Contract of Metrics Sources

Table 2. Sources
Source Description
ApplicationSource

BlockManagerSource

CacheMetrics

CodegenMetrics

DAGSchedulerSource

ExecutorAllocationManagerSource

ExecutorSource

ExternalShuffleServiceSource

HiveCatalogMetrics

JvmSource

LiveListenerBusMetrics

MasterSource

MesosClusterSchedulerSource

ShuffleMetricsSource

StreamingSource

WorkerSource

155
Sink — Contract of Metrics Sinks

Sink — Contract of Metrics Sinks


Sink is a contract of metrics sinks.

package org.apache.spark.metrics.sink

trait Sink {
def start(): Unit
def stop(): Unit
def report(): Unit
}

Note Sink is a private[spark] contract.

Table 1. Sink Contract


Method Description
start Used when…​FIXME

stop Used when…​FIXME

report Used when…​FIXME

Table 2. Sinks
Sink Description
ConsoleSink

CsvSink

GraphiteSink

JmxSink

MetricsServlet

Slf4jSink

StatsdSink

All known Sinks in Spark 2.3 are in org.apache.spark.metrics.sink Scala


Note
package.

156
Sink — Contract of Metrics Sinks

157
MetricsServlet JSON Metrics Sink

MetricsServlet JSON Metrics Sink


MetricsServlet is a metrics sink that gives metrics snapshots in JSON format.

MetricsServlet is a "special" sink as it is only available to the metrics instances with a web

UI:

Driver of a Spark application

Spark Standalone’s Master and Worker

You can access the metrics from MetricsServlet at /metrics/json URI by default. The entire
URL depends on a metrics instance, e.g. http://localhost:4040/metrics/json/ for a running
Spark application.

158
MetricsServlet JSON Metrics Sink

$ http http://localhost:4040/metrics/json/
HTTP/1.1 200 OK
Cache-Control: no-cache, no-store, must-revalidate
Content-Length: 5005
Content-Type: text/json;charset=utf-8
Date: Mon, 11 Jun 2018 06:29:03 GMT
Server: Jetty(9.3.z-SNAPSHOT)
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block

{
"counters": {
"local-1528698499919.driver.HiveExternalCatalog.fileCacheHits": {
"count": 0
},
"local-1528698499919.driver.HiveExternalCatalog.filesDiscovered": {
"count": 0
},
"local-1528698499919.driver.HiveExternalCatalog.hiveClientCalls": {
"count": 0
},
"local-1528698499919.driver.HiveExternalCatalog.parallelListingJobCount": {
"count": 0
},
"local-1528698499919.driver.HiveExternalCatalog.partitionsFetched": {
"count": 0
},
"local-1528698499919.driver.LiveListenerBus.numEventsPosted": {
"count": 7
},
"local-1528698499919.driver.LiveListenerBus.queue.appStatus.numDroppedEvents":
{
"count": 0
},
"local-1528698499919.driver.LiveListenerBus.queue.executorManagement.numDroppe
dEvents": {
"count": 0
}
},
...

MetricsServlet is created exclusively when MetricsSystem is started (and requested to

register metrics sinks).

MetricsServlet can be configured using configuration properties with sink.servlet prefix (in

metrics configuration). That is not required since MetricsConfig makes sure that
MetricsServlet is always configured.

159
MetricsServlet JSON Metrics Sink

MetricsServlet uses jackson-databind, the general data-binding package for Jackson (as

ObjectMapper) with Dropwizard Metrics library (i.e. registering a Coda Hale MetricsModule ).

Table 1. MetricsServlet’s Configuration Properties


Name Default Description
path /metrics/json/ Path URI prefix to bind to

sample false
Whether to show entire set of samples for
histograms

Table 2. MetricsServlet’s Internal Properties (e.g. Registries, Counters and Flags)


Name Description
Jaxson’s com.fasterxml.jackson.databind.ObjectMapper
that "provides functionality for reading and writing JSON,
either to and from basic POJOs (Plain Old Java Objects),
or to and from a general-purpose JSON Tree Model
(JsonNode), as well as related functionality for performing
mapper conversions."
When created, mapper is requested to register a Coda
Hale com.codahale.metrics.json.MetricsModule.

Used exclusively when MetricsServlet is requested to


getMetricsSnapshot.

servletPath Value of path configuration property

Flag to control whether to show samples ( true ) or not


( false ).
servletShowSample is the value of sample configuration
servletShowSample
property (if defined) or false .

Used when ObjectMapper is requested to register a Coda


Hale com.codahale.metrics.json.MetricsModule.

Creating MetricsServlet Instance


MetricsServlet takes the following when created:

Configuration Properties (as Java Properties )

Dropwizard Metrics' MetricRegistry

SecurityManager

MetricsServlet initializes the internal registries and counters.

160
MetricsServlet JSON Metrics Sink

Requesting Metrics Snapshot —  getMetricsSnapshot


Method

getMetricsSnapshot(request: HttpServletRequest): String

getMetricsSnapshot simply requests the Jackson ObjectMapper to serialize the

MetricRegistry to a JSON string (using ObjectMapper.writeValueAsString).

getMetricsSnapshot is used exclusively when MetricsServlet is requested to


Note
getHandlers.

Requesting JSON Servlet Handler —  getHandlers


Method

getHandlers(conf: SparkConf): Array[ServletContextHandler]

getHandlers returns just a single ServletContextHandler (in a collection) that gives metrics

snapshot in JSON format at every request at servletPath URI path.

getHandlers is used exclusively when MetricsSystem is requested for metrics


Note
ServletContextHandlers.

161
Metrics Configuration Properties

Metrics Configuration Properties


Table 1. Metrics Configuration Properties
Name Default Value Description
spark.metrics.conf metrics.properties The metrics configuration file.

Root namespace for metrics


reporting.
Since a Spark application’s ID
Spark application’s changes with every execution of a
ID (i.e. Spark application, a custom
spark.metrics.namespace spark.app.id spark.metrics.namespace can be
configuration specified for an easier metrics
property) reporting.
Used when MetricsSystem is
requested for a metrics source
identifier (aka metrics namespace)

162
Status REST API — Monitoring Spark Applications Using REST API

Status REST API — Monitoring Spark


Applications Using REST API
Status REST API is a collection of REST endpoints under /api/v1 URI path in the root
containers for application UI information:

SparkUI - Application UI for an active Spark application (i.e. a Spark application that is
still running)

HistoryServer - Application UI for active and completed Spark applications (i.e. Spark
applications that are still running or have already finished)

Status REST API uses ApiRootResource main resource class that registers /api/v1 URI
path and the subpaths.

Table 1. URI Paths


Path Description
applications Delegates to the ApplicationListResource resource class

applications/{appId} Delegates to the OneApplicationResource resource class

version Creates a VersionInfo with the current version of Spark

Status REST API uses the following components:

Jersey RESTful Web Services framework with support for the Java API for RESTful
Web Services (JAX-RS API)

Eclipse Jetty as the lightweight HTTP server and the Java Servlet container

163
ApiRootResource — /api/v1 URI Handler

ApiRootResource — /api/v1 URI Handler


ApiRootResource is the ApiRequestContext for the /v1 URI path.

ApiRootResource uses @Path("/v1") annotation at the class level. It is a partial URI path

template relative to the base URI of the server on which the resource is deployed, the
context root of the application, and the URL pattern to which the JAX-RS runtime responds.

Learn more about @Path annotation in The @Path Annotation and URI Path
Tip
Templates.

ApiRootResource registers the /api/* context handler (with the REST resources and

providers in org.apache.spark.status.api.v1 package).

With the @Path("/v1") annotation and after registering the /api/* context handler,
ApiRootResource serves HTTP requests for paths under the /api/v1 URI paths for SparkUI

and HistoryServer.

ApiRootResource gives the metrics of a Spark application in JSON format (using JAX-RS

API).

164
ApiRootResource — /api/v1 URI Handler

// start spark-shell
$ http http://localhost:4040/api/v1/applications
HTTP/1.1 200 OK
Content-Encoding: gzip
Content-Length: 257
Content-Type: application/json
Date: Tue, 05 Jun 2018 18:36:16 GMT
Server: Jetty(9.3.z-SNAPSHOT)
Vary: Accept-Encoding, User-Agent

[
{
"attempts": [
{
"appSparkVersion": "2.3.1-SNAPSHOT",
"completed": false,
"duration": 0,
"endTime": "1969-12-31T23:59:59.999GMT",
"endTimeEpoch": -1,
"lastUpdated": "2018-06-05T15:04:48.328GMT",
"lastUpdatedEpoch": 1528211088328,
"sparkUser": "jacek",
"startTime": "2018-06-05T15:04:48.328GMT",
"startTimeEpoch": 1528211088328
}
],
"id": "local-1528211089216",
"name": "Spark shell"
}
]

// Fixed in Spark 2.3.1


// https://issues.apache.org/jira/browse/SPARK-24188
$ http http://localhost:4040/api/v1/version
HTTP/1.1 200 OK
Content-Encoding: gzip
Content-Length: 43
Content-Type: application/json
Date: Thu, 14 Jun 2018 08:19:06 GMT
Server: Jetty(9.3.z-SNAPSHOT)
Vary: Accept-Encoding, User-Agent

{
"spark": "2.3.1"
}

165
ApiRootResource — /api/v1 URI Handler

Table 1. ApiRootResource’s Paths


Path HTTP Method Description

applications
Delegates to the
ApplicationListResource resource class

applications/{appId}
Delegates to the
OneApplicationResource resource class

Creates a VersionInfo with the current


version GET
version of Spark

Creating /api/* Context Handler —  getServletHandler


Method

getServletHandler(uiRoot: UIRoot): ServletContextHandler

getServletHandler creates a Jetty ServletContextHandler for /api context path.

The Jetty ServletContextHandler created does not support HTTP sessions as


Note
REST API is stateless.

getServletHandler creates a Jetty ServletHolder with the resources and providers in

org.apache.spark.status.api.v1 package. It then registers the ServletHolder to serve /*

context path (under the ServletContextHandler for /api ).

getServletHandler requests UIRootFromServletContext to setUiRoot with the

ServletContextHandler and the input UIRoot.

getServletHandler is used when SparkUI and HistoryServer are requested to


Note
initialize.

166
ApplicationListResource — applications URI Handler

ApplicationListResource — applications URI
Handler
ApplicationListResource is a ApiRequestContext that ApiRootResource uses to handle

applications URI path.

Table 1. ApplicationListResource’s Paths


Path HTTP Method Description
/ GET appList

// start spark-shell
// there should be a single Spark application -- the spark-shell itself
$ http http://localhost:4040/api/v1/applications
HTTP/1.1 200 OK
Content-Encoding: gzip
Content-Length: 255
Content-Type: application/json
Date: Wed, 06 Jun 2018 12:40:33 GMT
Server: Jetty(9.3.z-SNAPSHOT)
Vary: Accept-Encoding, User-Agent

[
{
"attempts": [
{
"appSparkVersion": "2.3.1-SNAPSHOT",
"completed": false,
"duration": 0,
"endTime": "1969-12-31T23:59:59.999GMT",
"endTimeEpoch": -1,
"lastUpdated": "2018-06-06T12:30:19.220GMT",
"lastUpdatedEpoch": 1528288219220,
"sparkUser": "jacek",
"startTime": "2018-06-06T12:30:19.220GMT",
"startTimeEpoch": 1528288219220
}
],
"id": "local-1528288219790",
"name": "Spark shell"
}
]

isAttemptInRange Internal Method

167
ApplicationListResource — applications URI Handler

isAttemptInRange(
attempt: ApplicationAttemptInfo,
minStartDate: SimpleDateParam,
maxStartDate: SimpleDateParam,
minEndDate: SimpleDateParam,
maxEndDate: SimpleDateParam,
anyRunning: Boolean): Boolean

isAttemptInRange …​FIXME

isAttemptInRange is used exclusively when ApplicationListResource is


Note
requested to handle a GET / HTTP request.

appList Method

appList(
@QueryParam("status") status: JList[ApplicationStatus],
@DefaultValue("2010-01-01") @QueryParam("minDate") minDate: SimpleDateParam,
@DefaultValue("3000-01-01") @QueryParam("maxDate") maxDate: SimpleDateParam,
@DefaultValue("2010-01-01") @QueryParam("minEndDate") minEndDate: SimpleDateParam,
@DefaultValue("3000-01-01") @QueryParam("maxEndDate") maxEndDate: SimpleDateParam,
@QueryParam("limit") limit: Integer)
: Iterator[ApplicationInfo]

appList …​FIXME

Note appList is used when…​FIXME

168
OneApplicationResource — applications/appId URI Handler

OneApplicationResource — applications/appId
URI Handler
OneApplicationResource is a AbstractApplicationResource (and so a ApiRequestContext

indirectly) that ApiRootResource uses to handle applications/appId URI path.

Table 1. OneApplicationResource’s Paths


Path HTTP Method Description
/ GET getApp

// start spark-shell
// there should be a single Spark application -- the spark-shell itself
$ http http://localhost:4040/api/v1/applications
HTTP/1.1 200 OK
Content-Encoding: gzip
Content-Length: 255
Content-Type: application/json
Date: Wed, 06 Jun 2018 12:40:33 GMT
Server: Jetty(9.3.z-SNAPSHOT)
Vary: Accept-Encoding, User-Agent

[
{
"attempts": [
{
"appSparkVersion": "2.3.1-SNAPSHOT",
"completed": false,
"duration": 0,
"endTime": "1969-12-31T23:59:59.999GMT",
"endTimeEpoch": -1,
"lastUpdated": "2018-06-06T12:30:19.220GMT",
"lastUpdatedEpoch": 1528288219220,
"sparkUser": "jacek",
"startTime": "2018-06-06T12:30:19.220GMT",
"startTimeEpoch": 1528288219220
}
],
"id": "local-1528288219790",
"name": "Spark shell"
}
]

$ http http://localhost:4040/api/v1/applications/local-1528288219790
HTTP/1.1 200 OK
Content-Encoding: gzip
Content-Length: 255

169
OneApplicationResource — applications/appId URI Handler

Content-Type: application/json
Date: Wed, 06 Jun 2018 12:41:43 GMT
Server: Jetty(9.3.z-SNAPSHOT)
Vary: Accept-Encoding, User-Agent

{
"attempts": [
{
"appSparkVersion": "2.3.1-SNAPSHOT",
"completed": false,
"duration": 0,
"endTime": "1969-12-31T23:59:59.999GMT",
"endTimeEpoch": -1,
"lastUpdated": "2018-06-06T12:30:19.220GMT",
"lastUpdatedEpoch": 1528288219220,
"sparkUser": "jacek",
"startTime": "2018-06-06T12:30:19.220GMT",
"startTimeEpoch": 1528288219220
}
],
"id": "local-1528288219790",
"name": "Spark shell"
}

getApp Method

getApp(): ApplicationInfo

getApp requests the UIRoot for the application info (given the appId).

In the end, getApp returns the ApplicationInfo if available or reports a


NotFoundException :

unknown app: [appId]

170
StagesResource

StagesResource
StagesResource is…​FIXME

Table 1. StagesResource’s Paths


Path HTTP Method Description

GET stageList

{stageId: \d+} GET stageData

{stageId:
\d+}/{stageAttemptId: GET oneAttemptData
\d+}

{stageId:
\d+}/{stageAttemptId: GET taskSummary
\d+}/taskSummary

{stageId:
\d+}/{stageAttemptId: GET taskList
\d+}/taskList

stageList Method

stageList(@QueryParam("status") statuses: JList[StageStatus]): Seq[StageData]

stageList …​FIXME

Note stageList is used when…​FIXME

stageData Method

stageData(
@PathParam("stageId") stageId: Int,
@QueryParam("details") @DefaultValue("true") details: Boolean): Seq[StageData]

stageData …​FIXME

Note stageData is used when…​FIXME

oneAttemptData Method

171
StagesResource

oneAttemptData(
@PathParam("stageId") stageId: Int,
@PathParam("stageAttemptId") stageAttemptId: Int,
@QueryParam("details") @DefaultValue("true") details: Boolean): StageData

oneAttemptData …​FIXME

Note oneAttemptData is used when…​FIXME

taskSummary Method

taskSummary(
@PathParam("stageId") stageId: Int,
@PathParam("stageAttemptId") stageAttemptId: Int,
@DefaultValue("0.05,0.25,0.5,0.75,0.95") @QueryParam("quantiles") quantileString: St
ring)
: TaskMetricDistributions

taskSummary …​FIXME

Note taskSummary is used when…​FIXME

taskList Method

taskList(
@PathParam("stageId") stageId: Int,
@PathParam("stageAttemptId") stageAttemptId: Int,
@DefaultValue("0") @QueryParam("offset") offset: Int,
@DefaultValue("20") @QueryParam("length") length: Int,
@DefaultValue("ID") @QueryParam("sortBy") sortBy: TaskSorting): Seq[TaskData]

taskList …​FIXME

Note taskList is used when…​FIXME

172
OneApplicationAttemptResource

OneApplicationAttemptResource
OneApplicationAttemptResource is a AbstractApplicationResource (and so a

ApiRequestContext indirectly).

OneApplicationAttemptResource is used when AbstractApplicationResource is requested to

applicationAttempt.

Table 1. OneApplicationAttemptResource’s Paths


Path HTTP Method Description
/ GET getAttempt

// start spark-shell
// there should be a single Spark application -- the spark-shell itself
// CAUTION: FIXME Demo of OneApplicationAttemptResource in Action

getAttempt Method

getAttempt(): ApplicationAttemptInfo

getAttempt requests the UIRoot for the application info (given the appId) and finds the

attemptId among the available attempts.

Note appId and attemptId are path parameters.

In the end, getAttempt returns the ApplicationAttemptInfo if available or reports a


NotFoundException :

unknown app [appId], attempt [attemptId]

173
AbstractApplicationResource

AbstractApplicationResource
AbstractApplicationResource is a BaseAppResource with a set of URI paths that are

common across implementations.

174
AbstractApplicationResource

// start spark-shell
$ http http://localhost:4040/api/v1/applications
HTTP/1.1 200 OK
Content-Encoding: gzip
Content-Length: 257
Content-Type: application/json
Date: Tue, 05 Jun 2018 18:46:32 GMT
Server: Jetty(9.3.z-SNAPSHOT)
Vary: Accept-Encoding, User-Agent

[
{
"attempts": [
{
"appSparkVersion": "2.3.1-SNAPSHOT",
"completed": false,
"duration": 0,
"endTime": "1969-12-31T23:59:59.999GMT",
"endTimeEpoch": -1,
"lastUpdated": "2018-06-05T15:04:48.328GMT",
"lastUpdatedEpoch": 1528211088328,
"sparkUser": "jacek",
"startTime": "2018-06-05T15:04:48.328GMT",
"startTimeEpoch": 1528211088328
}
],
"id": "local-1528211089216",
"name": "Spark shell"
}
]

$ http http://localhost:4040/api/v1/applications/local-1528211089216/storage/rdd
HTTP/1.1 200 OK
Content-Length: 3
Content-Type: application/json
Date: Tue, 05 Jun 2018 18:48:00 GMT
Server: Jetty(9.3.z-SNAPSHOT)
Vary: Accept-Encoding, User-Agent

[]

// Execute the following query in spark-shell


spark.range(5).cache.count

$ http http://localhost:4040/api/v1/applications/local-1528211089216/storage/rdd
// output omitted for brevity

175
AbstractApplicationResource

Table 1. AbstractApplicationResources
AbstractApplicationResource Description

OneApplicationResource Handles applications/appId requests

OneApplicationAttemptResource

Table 2. AbstractApplicationResource’s Paths


Path HTTP Method Description
allexecutors GET allExecutorList

environment GET environmentInfo

executors GET executorList

jobs GET jobsList

jobs/{jobId: \\d+} GET oneJob

logs GET getEventLogs

stages stages

storage/rdd/{rddId:
\\d+} GET rddData

storage/rdd GET rddList

rddList Method

rddList(): Seq[RDDStorageInfo]

rddList …​FIXME

Note rddList is used when…​FIXME

environmentInfo Method

environmentInfo(): ApplicationEnvironmentInfo

environmentInfo …​FIXME

176
AbstractApplicationResource

Note environmentInfo is used when…​FIXME

rddData Method

rddData(@PathParam("rddId") rddId: Int): RDDStorageInfo

rddData …​FIXME

Note rddData is used when…​FIXME

allExecutorList Method

allExecutorList(): Seq[ExecutorSummary]

allExecutorList …​FIXME

Note allExecutorList is used when…​FIXME

executorList Method

executorList(): Seq[ExecutorSummary]

executorList …​FIXME

Note executorList is used when…​FIXME

oneJob Method

oneJob(@PathParam("jobId") jobId: Int): JobData

oneJob …​FIXME

Note oneJob is used when…​FIXME

jobsList Method

jobsList(@QueryParam("status") statuses: JList[JobExecutionStatus]): Seq[JobData]

177
AbstractApplicationResource

jobsList …​FIXME

Note jobsList is used when…​FIXME

178
BaseAppResource

BaseAppResource
BaseAppResource is the contract of ApiRequestContexts that can withUI and use appId and

attemptId path parameters in URI paths.

Table 1. BaseAppResource’s Path Parameters


Name Description
@PathParam("appId")
appId
Used when…​FIXME

@PathParam("attemptId")
attemptId
Used when…​FIXME

Table 2. BaseAppResources
BaseAppResource Description
AbstractApplicationResource

BaseStreamingAppResource

StagesResource

Note BaseAppResource is a private[v1] contract.

withUI Method

withUI[T](fn: SparkUI => T): T

withUI …​FIXME

Note withUI is used when…​FIXME

179
ApiRequestContext

ApiRequestContext
ApiRequestContext is the contract of…​FIXME

package org.apache.spark.status.api.v1

trait ApiRequestContext {
// only required methods that have no implementation
// the others follow
@Context
var servletContext: ServletContext = _

@Context
var httpRequest: HttpServletRequest = _
}

Note ApiRequestContext is a private[v1] contract.

Table 1. ApiRequestContext Contract


Method Description
Java Servlets' HttpServletRequest
httpRequest
Used when…​FIXME

Java Servlets' ServletContext


servletContext
Used when…​FIXME

Table 2. ApiRequestContexts
ApiRequestContext Description
ApiRootResource

ApiStreamingApp

ApplicationListResource

BaseAppResource

SecurityFilter

Getting Current UIRoot —  uiRoot Method

180
ApiRequestContext

uiRoot: UIRoot

uiRoot simply requests UIRootFromServletContext to get the current UIRoot (for the given

servletContext).

Note uiRoot is used when…​FIXME

181
UIRoot — Contract for Root Contrainers of Application UI Information

UIRoot — Contract for Root Contrainers of


Application UI Information
UIRoot is the contract of the root containers for application UI information.

package org.apache.spark.status.api.v1

trait UIRoot {
// only required methods that have no implementation
// the others follow
def withSparkUI[T](appId: String, attemptId: Option[String])(fn: SparkUI => T): T
def getApplicationInfoList: Iterator[ApplicationInfo]
def getApplicationInfo(appId: String): Option[ApplicationInfo]
def securityManager: SecurityManager
}

Note UIRoot is a private[spark] contract.

Table 1. UIRoot Contract


Method Description
getApplicationInfo Used when…​FIXME

getApplicationInfoList Used when…​FIXME

securityManager Used when…​FIXME

withSparkUI
Used exclusively when BaseAppResource is requested
withUI

Table 2. UIRoots
UIRoot Description
Application UI for active and completed Spark
HistoryServer applications (i.e. Spark applications that are still running
or have already finished)

Application UI for an active Spark application (i.e. a Spark


SparkUI
application that is still running)

writeEventLogs Method

182
UIRoot — Contract for Root Contrainers of Application UI Information

writeEventLogs(appId: String, attemptId: Option[String], zipStream: ZipOutputStream):


Unit

writeEventLogs …​FIXME

Note writeEventLogs is used when…​FIXME

183
UIRootFromServletContext

UIRootFromServletContext
UIRootFromServletContext manages the current UIRoot object in a Jetty ContextHandler .

UIRootFromServletContext uses its canonical name for the context attribute that is used to

set or get the current UIRoot object (in Jetty’s ContextHandler ).

ContextHandler is the environment for multiple Jetty Handlers , e.g. URI


Note
context path, class loader, static resource base.

In essence, UIRootFromServletContext is simply a "bridge" between two worlds, Spark’s


UIRoot and Jetty’s ContextHandler .

setUiRoot Method

setUiRoot(contextHandler: ContextHandler, uiRoot: UIRoot): Unit

setUiRoot …​FIXME

setUiRoot is used exclusively when ApiRootResource is requested to register


Note
/api/* context handler.

getUiRoot Method

getUiRoot(context: ServletContext): UIRoot

getUiRoot …​FIXME

getUiRoot is used exclusively when ApiRequestContext is requested for the


Note
current UIRoot.

184
Spark MLlib — Machine Learning in Spark

Spark MLlib
I’m new to Machine Learning as a discipline and Spark MLlib in particular so
Caution
mistakes in this document are considered a norm (not an exception).

Spark MLlib is a module (a library / an extension) of Apache Spark to provide distributed


machine learning algorithms on top of Spark’s RDD abstraction. Its goal is to simplify the
development and usage of large scale machine learning.

You can find the following types of machine learning algorithms in MLlib:

Classification

Regression

Frequent itemsets (via FP-growth Algorithm)

Recommendation

Feature extraction and selection

Clustering

Statistics

Linear Algebra

You can also do the following using MLlib:

Model import and export

Pipelines

There are two libraries for Machine Learning in Spark MLlib:


org.apache.spark.mllib for RDD-based Machine Learning and a higher-level
Note
API under org.apache.spark.ml for DataFrame-based Machine Learning with
Pipelines.

Machine Learning uses large datasets to identify (infer) patterns and make decisions (aka
predictions). Automated decision making is what makes Machine Learning so appealing.
You can teach a system from a dataset and let the system act by itself to predict future.

The amount of data (measured in TB or PB) is what makes Spark MLlib especially important
since a human could not possibly extract much value from the dataset in a short time.

Spark handles data distribution and makes the huge data available by means of RDDs,
DataFrames, and recently Datasets.

185
Spark MLlib — Machine Learning in Spark

Use cases for Machine Learning (and hence Spark MLlib that comes with appropriate
algorithms):

Security monitoring and fraud detection

Operational optimizations

Product recommendations or (more broadly) Marketing optimization

Ad serving and optimization

Concepts
This section introduces the concepts of Machine Learning and how they are modeled in
Spark MLlib.

Observation
An observation is used to learn about or evaluate (i.e. draw conclusions about) the
observed item’s target value.

Spark models observations as rows in a DataFrame .

Feature
A feature (aka dimension or variable) is an attribute of an observation. It is an independent
variable.

Spark models features as columns in a DataFrame (one per feature or a set of features).

Note Ultimately, it is up to an algorithm to expect one or many features per column.

There are two classes of features:

Categorical with discrete values, i.e. the set of possible values is limited, and can range
from one to many thousands. There is no ordering implied, and so the values are
incomparable.

Numerical with quantitative values, i.e. any numerical values that you can compare to
each other. You can further classify them into discrete and continuous features.

Label
A label is a variable that a machine learning system learns to predict that are assigned to
observations.

186
Spark MLlib — Machine Learning in Spark

There are categorical and numerical labels.

A label is a dependent variable that depends on other dependent or independent variables


like features.

FP-growth Algorithm
Spark 1.5 have significantly improved on frequent pattern mining capabilities with new
algorithms for association rule generation and sequential pattern mining.

Frequent Itemset Mining using the Parallel FP-growth algorithm (since Spark 1.3)

Frequent Pattern Mining in MLlib User Guide

frequent pattern mining

reveals the most frequently visited site in a particular period

finds popular routing paths that generate most traffic in a particular region

models its input as a set of transactions, e.g. a path of nodes.

A transaction is a set of items, e.g. network nodes.

the algorithm looks for common subsets of items that appear across transactions,
e.g. sub-paths of the network that are frequently traversed.

A naive solution: generate all possible itemsets and count their occurrence

A subset is considered a pattern when it appears in some minimum proportion of


all transactions - the support.

the items in a transaction are unordered

analyzing traffic patterns from network logs

the algorithm finds all frequent itemsets without generating and testing all
candidates

suffix trees (FP-trees) constructed and grown from filtered transactions

Also available in Mahout, but slower.

Distributed generation of association rules (since Spark 1.5).

in a retailer’s transaction database, a rule {toothbrush, floss} ⇒ {toothpaste} with


a confidence value 0.8 would indicate that 80% of customers who buy a
toothbrush and floss also purchase a toothpaste in the same transaction. The

187
Spark MLlib — Machine Learning in Spark

retailer could then use this information, put both toothbrush and floss on sale, but
raise the price of toothpaste to increase overall profit.

FPGrowth model

parallel sequential pattern mining (since Spark 1.5)

PrefixSpan algorithm with modifications to parallelize the algorithm for Spark.

extract frequent sequential patterns like routing updates, activation failures, and
broadcasting timeouts that could potentially lead to customer complaints and
proactively reach out to customers when it happens.

Power Iteration Clustering


since Spark 1.3

unsupervised learning including clustering

identifying similar behaviors among users or network clusters

Power Iteration Clustering (PIC) in MLlib, a simple and scalable graph clustering
method

PIC in MLlib User Guide

org.apache.spark.mllib.clustering.PowerIterationClustering

a graph algorithm

Among the first MLlib algorithms built upon GraphX.

takes an undirected graph with similarities defined on edges and outputs clustering
assignment on nodes

uses truncated power iteration to find a very low-dimensional embedding of the


nodes, and this embedding leads to effective graph clustering.

stores the normalized similarity matrix as a graph with normalized similarities


defined as edge properties

The edge properties are cached and remain static during the power iterations.

The embedding of nodes is defined as node properties on the same graph


topology.

update the embedding through power iterations, where aggregateMessages is


used to compute matrix-vector multiplications, the essential operation in a power
iteration method

188
Spark MLlib — Machine Learning in Spark

k-means is used to cluster nodes using the embedding.

able to distinguish clearly the degree of similarity – as represented by the Euclidean


distance among the points – even though their relationship is non-linear

Further reading or watching


Improved Frequent Pattern Mining in Spark 1.5: Association Rules and Sequential
Patterns

New MLlib Algorithms in Spark 1.3: FP-Growth and Power Iteration Clustering

(video) GOTO 2015 • A Taste of Random Decision Forests on Apache Spark • Sean
Owen

189
ML Pipelines (spark.ml)

ML Pipelines (spark.ml)
ML Pipeline API (aka Spark ML or spark.ml due to the package the API lives in) lets Spark
users quickly and easily assemble and configure practical distributed Machine Learning
pipelines (aka workflows) by standardizing the APIs for different Machine Learning concepts.

Both scikit-learn and GraphLab have the concept of pipelines built into their
Note
system.

The ML Pipeline API is a new DataFrame-based API developed under org.apache.spark.ml


package and is the primary API for MLlib as of Spark 2.0.

The previous RDD-based API under org.apache.spark.mllib package is in


Important maintenance-only mode which means that it is still maintained with bug
fixes but no new features are expected.

The key concepts of Pipeline API (aka spark.ml Components):

Pipeline

PipelineStage

Transformers

Models

Estimators

Evaluator

Params (and ParamMaps)

Figure 1. Pipeline with Transformers and Estimator (and corresponding Model)


The beauty of using Spark ML is that the ML dataset is simply a DataFrame (and all
calculations are simply UDF applications on columns).

190
ML Pipelines (spark.ml)

Use of a machine learning algorithm is only one component of a predictive analytic


workflow. There can also be additional pre-processing steps for the machine learning
algorithm to work.

While a RDD computation in Spark Core, a Dataset manipulation in Spark SQL,


Note a continuous DStream computation in Spark Streaming are the main data
abstractions a ML Pipeline is in Spark MLlib.

A typical standard machine learning workflow is as follows:

1. Loading data (aka data ingestion)

2. Extracting features (aka feature extraction)

3. Training model (aka model training)

4. Evaluate (or predictionize)

You may also think of two additional steps before the final model becomes production ready
and hence of any use:

1. Testing model (aka model testing)

2. Selecting the best model (aka model selection or model tuning)

3. Deploying model (aka model deployment and integration)

Note The Pipeline API lives under org.apache.spark.ml package.

Given the Pipeline Components, a typical machine learning pipeline is as follows:

You use a collection of Transformer instances to prepare input DataFrame - the dataset
with proper input data (in columns) for a chosen ML algorithm.

You then fit (aka build) a Model .

With a Model you can calculate predictions (in prediction column) on features input
column through DataFrame transformation.

Example: In text classification, preprocessing steps like n-gram extraction, and TF-IDF
feature weighting are often necessary before training of a classification model like an SVM.

Upon deploying a model, your system must not only know the SVM weights to apply to input
features, but also transform raw data into the format the model is trained on.

Pipeline for text categorization

Pipeline for image classification

Pipelines are like a query plan in a database system.

191
ML Pipelines (spark.ml)

Components of ML Pipeline:

Pipeline Construction Framework – A DSL for the construction of pipelines that


includes concepts of Nodes and Pipelines.

Nodes are data transformation steps (Transformers)

Pipelines are a DAG of Nodes.

Pipelines become objects that can be saved out and applied in real-time to new
data.

It can help creating domain-specific feature transformers, general purpose transformers,


statistical utilities and nodes.

You could persist (i.e. save to a persistent storage) or unpersist (i.e. load from a
persistent storage) ML components as described in Persisting Machine Learning
Components.

A ML component is any object that belongs to Pipeline API, e.g. Pipeline,


Note
LinearRegressionModel, etc.

Features of Pipeline API


The features of the Pipeline API in Spark MLlib:

DataFrame as a dataset format

ML Pipelines API is similar to scikit-learn

Easy debugging (via inspecting columns added during execution)

Parameter tuning

Compositions (to build more complex pipelines out of existing ones)

Pipelines
A ML pipeline (or a ML workflow) is a sequence of Transformers and Estimators to fit a
PipelineModel to an input dataset.

pipeline: DataFrame =[fit]=> DataFrame (using transformers and estimators)

A pipeline is represented by Pipeline class.

import org.apache.spark.ml.Pipeline

192
ML Pipelines (spark.ml)

Pipeline is also an Estimator (so it is acceptable to set up a Pipeline with other

Pipeline instances).

The Pipeline object can read or load pipelines (refer to Persisting Machine Learning
Components page).

read: MLReader[Pipeline]
load(path: String): Pipeline

You can create a Pipeline with an optional uid identifier. It is of the format
pipeline_[randomUid] when unspecified.

val pipeline = new Pipeline()

scala> println(pipeline.uid)
pipeline_94be47c3b709

val pipeline = new Pipeline("my_pipeline")

scala> println(pipeline.uid)
my_pipeline

The identifier uid is used to create an instance of PipelineModel to return from


fit(dataset: DataFrame): PipelineModel method.

scala> val pipeline = new Pipeline("my_pipeline")


pipeline: org.apache.spark.ml.Pipeline = my_pipeline

scala> val df = (0 to 9).toDF("num")


df: org.apache.spark.sql.DataFrame = [num: int]

scala> val model = pipeline.setStages(Array()).fit(df)


model: org.apache.spark.ml.PipelineModel = my_pipeline

The stages mandatory parameter can be set using setStages(value:


Array[PipelineStage]): this.type method.

Pipeline Fitting (fit method)

fit(dataset: DataFrame): PipelineModel

The fit method returns a PipelineModel that holds a collection of Transformer objects
that are results of Estimator.fit method for every Estimator in the Pipeline (with possibly-
modified dataset ) or simply input Transformer objects. The input dataset DataFrame is

193
ML Pipelines (spark.ml)

passed to transform for every Transformer instance in the Pipeline.

It first transforms the schema of the input dataset DataFrame.

It then searches for the index of the last Estimator to calculate Transformers for Estimator
and simply return Transformer back up to the index in the pipeline. For each Estimator the
fit method is called with the input dataset . The result DataFrame is passed to the next

Transformer in the chain.

An IllegalArgumentException exception is thrown when a stage is neither


Note
Estimator or Transformer .

transform method is called for every Transformer calculated but the last one (that is the

result of executing fit on the last Estimator ).

The calculated Transformers are collected.

After the last Estimator there can only be Transformer stages.

The method returns a PipelineModel with uid and transformers. The parent Estimator is
the Pipeline itself.

Further reading or watching


ML Pipelines

ML Pipelines: A New High-Level API for MLlib

(video) Building, Debugging, and Tuning Spark Machine Learning Pipelines - Joseph
Bradley (Databricks)

(video) Spark MLlib: Making Practical Machine Learning Easy and Scalable

(video) Apache Spark MLlib 2 0 Preview: Data Science and Production by Joseph K.
Bradley (Databricks)

194
Pipeline

Pipeline — ML Pipeline Component


Pipeline is a ML component in Spark MLlib 2 that…​FIXME

195
PipelineStage

PipelineStage — ML Pipeline Component


The PipelineStage abstract class represents a single stage in a Pipeline.

PipelineStage has the following direct implementations (of which few are abstract classes,

too):

Estimators

Models

Pipeline

Predictor

Transformer

Each PipelineStage transforms schema using transformSchema family of methods:

transformSchema(schema: StructType): StructType


transformSchema(schema: StructType, logging: Boolean): StructType

Note StructType describes a schema of a DataFrame.

Enable DEBUG logging level for the respective PipelineStage implementations to


Tip see what happens beneath.

196
Transformers

Transformers
A transformer is a ML Pipeline component that transforms a DataFrame into another
DataFrame (both called datasets).

transformer: DataFrame =[transform]=> DataFrame

Transformers prepare a dataset for an machine learning algorithm to work with. They are
also very helpful to transform DataFrames in general (even outside the machine learning
space).

Transformers are instances of org.apache.spark.ml.Transformer abstract class that offers


transform family of methods:

transform(dataset: DataFrame): DataFrame


transform(dataset: DataFrame, paramMap: ParamMap): DataFrame
transform(dataset: DataFrame, firstParamPair: ParamPair[_], otherParamPairs: ParamPair
[_]*): DataFrame

A Transformer is a PipelineStage and thus can be a part of a Pipeline.

A few available implementations of Transformer :

StopWordsRemover

Binarizer

SQLTransformer

VectorAssembler — a feature transformer that assembles (merges) multiple columns


into a (feature) vector column.

UnaryTransformer

Tokenizer

RegexTokenizer

NGram

HashingTF

OneHotEncoder

Model

197
Transformers

See Custom UnaryTransformer section for a custom Transformer implementation.

StopWordsRemover
StopWordsRemover is a machine learning feature transformer that takes a string array column

and outputs a string array column with all defined stop words removed. The transformer
comes with a standard set of English stop words as default (that are the same as scikit-learn
uses, i.e. from the Glasgow Information Retrieval Group).

It works as if it were a UnaryTransformer but it has not been migrated to extend


Note
the class yet.

StopWordsRemover class belongs to org.apache.spark.ml.feature package.

import org.apache.spark.ml.feature.StopWordsRemover
val stopWords = new StopWordsRemover

It accepts the following parameters:

scala> println(stopWords.explainParams)
caseSensitive: whether to do case-sensitive comparison during filtering (default: false
)
inputCol: input column name (undefined)
outputCol: output column name (default: stopWords_9c2c0fdd8a68__output)
stopWords: stop words (default: [Ljava.lang.String;@5dabe7c8)

null values from the input array are preserved unless adding null to
Note
stopWords explicitly.

198
Transformers

import org.apache.spark.ml.feature.RegexTokenizer
val regexTok = new RegexTokenizer("regexTok")
.setInputCol("text")
.setPattern("\\W+")

import org.apache.spark.ml.feature.StopWordsRemover
val stopWords = new StopWordsRemover("stopWords")
.setInputCol(regexTok.getOutputCol)

val df = Seq("please find it done (and empty)", "About to be rich!", "empty")


.zipWithIndex
.toDF("text", "id")

scala> stopWords.transform(regexTok.transform(df)).show(false)
+-------------------------------+---+------------------------------------+------------
-----+
|text |id |regexTok__output |stopWords__o
utput|
+-------------------------------+---+------------------------------------+------------
-----+
|please find it done (and empty)|0 |[please, find, it, done, and, empty]|[]
|
|About to be rich! |1 |[about, to, be, rich] |[rich]
|
|empty |2 |[empty] |[]
|
+-------------------------------+---+------------------------------------+------------
-----+

Binarizer
Binarizer is a Transformer that splits the values in the input column into two groups -

"ones" for values larger than the threshold and "zeros" for the others.

It works with DataFrames with the input column of DoubleType or VectorUDT. The type of
the result output column matches the type of the input column, i.e. DoubleType or
VectorUDT .

199
Transformers

import org.apache.spark.ml.feature.Binarizer
val bin = new Binarizer()
.setInputCol("rating")
.setOutputCol("label")
.setThreshold(3.5)

scala> println(bin.explainParams)
inputCol: input column name (current: rating)
outputCol: output column name (default: binarizer_dd9710e2a831__output, current: label
)
threshold: threshold used to binarize continuous features (default: 0.0, current: 3.5)

val doubles = Seq((0, 1d), (1, 1d), (2, 5d)).toDF("id", "rating")

scala> bin.transform(doubles).show
+---+------+-----+
| id|rating|label|
+---+------+-----+
| 0| 1.0| 0.0|
| 1| 1.0| 0.0|
| 2| 5.0| 1.0|
+---+------+-----+

import org.apache.spark.mllib.linalg.Vectors
val denseVec = Vectors.dense(Array(4.0, 0.4, 3.7, 1.5))
val vectors = Seq((0, denseVec)).toDF("id", "rating")

scala> bin.transform(vectors).show
+---+-----------------+-----------------+
| id| rating| label|
+---+-----------------+-----------------+
| 0|[4.0,0.4,3.7,1.5]|[1.0,0.0,1.0,0.0]|
+---+-----------------+-----------------+

SQLTransformer
SQLTransformer is a Transformer that does transformations by executing SELECT …​ FROM

THIS with THIS being the underlying temporary table registered for the input dataset.

Internally, THIS is replaced with a random name for a temporary table (using
registerTempTable).

Note It has been available since Spark 1.6.0.

It requires that the SELECT query uses THIS that corresponds to a temporary table and
simply executes the mandatory statement using sql method.

You have to specify the mandatory statement parameter using setStatement method.

200
Transformers

import org.apache.spark.ml.feature.SQLTransformer
val sql = new SQLTransformer()

// dataset to work with


val df = Seq((0, s"""hello\tworld"""), (1, "two spaces inside")).toDF("label", "sente
nce")

scala> sql.setStatement("SELECT sentence FROM __THIS__ WHERE label = 0").transform(df)


.show
+-----------+
| sentence|
+-----------+
|hello world|
+-----------+

scala> println(sql.explainParams)
statement: SQL statement (current: SELECT sentence FROM __THIS__ WHERE label = 0)

VectorAssembler
VectorAssembler is a feature transformer that assembles (merges) multiple columns into a

(feature) vector column.

It supports columns of the types NumericType , BooleanType , and VectorUDT . Doubles are
passed on untouched. Other numberic types and booleans are cast to doubles.

201
Transformers

import org.apache.spark.ml.feature.VectorAssembler
val vecAssembler = new VectorAssembler()

scala> print(vecAssembler.explainParams)
inputCols: input column names (undefined)
outputCol: output column name (default: vecAssembler_5ac31099dbee__output)

final case class Record(id: Int, n1: Int, n2: Double, flag: Boolean)
val ds = Seq(Record(0, 4, 2.0, true)).toDS

scala> ds.printSchema
root
|-- id: integer (nullable = false)
|-- n1: integer (nullable = false)
|-- n2: double (nullable = false)
|-- flag: boolean (nullable = false)

val features = vecAssembler


.setInputCols(Array("n1", "n2", "flag"))
.setOutputCol("features")
.transform(ds)

scala> features.printSchema
root
|-- id: integer (nullable = false)
|-- n1: integer (nullable = false)
|-- n2: double (nullable = false)
|-- flag: boolean (nullable = false)
|-- features: vector (nullable = true)

scala> features.show
+---+---+---+----+-------------+
| id| n1| n2|flag| features|
+---+---+---+----+-------------+
| 0| 4|2.0|true|[4.0,2.0,1.0]|
+---+---+---+----+-------------+

UnaryTransformers
The UnaryTransformer abstract class is a specialized Transformer that applies
transformation to one input column and writes results to another (by appending a new
column).

Each UnaryTransformer defines the input and output columns using the following "chain"
methods (they return the transformer on which they were executed and so are chainable):

setInputCol(value: String)

202
Transformers

setOutputCol(value: String)

Each UnaryTransformer calls validateInputType while executing transformSchema(schema:


StructType) (that is part of PipelineStage contract).

Note A UnaryTransformer is a PipelineStage.

When transform is called, it first calls transformSchema (with DEBUG logging enabled) and
then adds the column as a result of calling a protected abstract createTransformFunc .

createTransformFunc function is abstract and defined by concrete


Note
UnaryTransformer objects.

Internally, transform method uses Spark SQL’s udf to define a function (based on
createTransformFunc function described above) that will create the new output column (with

appropriate outputDataType ). The UDF is later applied to the input column of the input
DataFrame and the result becomes the output column (using DataFrame.withColumn

method).

Using udf and withColumn methods from Spark SQL demonstrates an


Note
excellent integration between the Spark modules: MLlib and SQL.

The following are UnaryTransformer implementations in spark.ml:

Tokenizer that converts a string column to lowercase and then splits it by white spaces.

RegexTokenizer that extracts tokens.

NGram that converts the input array of strings into an array of n-grams.

HashingTF that maps a sequence of terms to their term frequencies (cf. SPARK-13998
HashingTF should extend UnaryTransformer)

OneHotEncoder that maps a numeric input column of label indices onto a column of
binary vectors.

RegexTokenizer
RegexTokenizer is a UnaryTransformer that tokenizes a String into a collection of String .

203
Transformers

import org.apache.spark.ml.feature.RegexTokenizer
val regexTok = new RegexTokenizer()

// dataset to transform with tabs and spaces


val df = Seq((0, s"""hello\tworld"""), (1, "two spaces inside")).toDF("label", "sente
nce")

val tokenized = regexTok.setInputCol("sentence").transform(df)

scala> tokenized.show(false)
+-----+------------------+-----------------------------+
|label|sentence |regexTok_810b87af9510__output|
+-----+------------------+-----------------------------+
|0 |hello world |[hello, world] |
|1 |two spaces inside|[two, spaces, inside] |
+-----+------------------+-----------------------------+

Note Read the official scaladoc for org.apache.spark.ml.feature.RegexTokenizer.

It supports minTokenLength parameter that is the minimum token length that you can change
using setMinTokenLength method. It simply filters out smaller tokens and defaults to 1 .

// see above to set up the vals

scala> rt.setInputCol("line").setMinTokenLength(6).transform(df).show
+-----+--------------------+-----------------------------+
|label| line|regexTok_8c74c5e8b83a__output|
+-----+--------------------+-----------------------------+
| 1| hello world| []|
| 2|yet another sentence| [another, sentence]|
+-----+--------------------+-----------------------------+

It has gaps parameter that indicates whether regex splits on gaps ( true ) or matches
tokens ( false ). You can set it using setGaps . It defaults to true .

When set to true (i.e. splits on gaps) it uses Regex.split while Regex.findAllIn for false .

204
Transformers

scala> rt.setInputCol("line").setGaps(false).transform(df).show
+-----+--------------------+-----------------------------+
|label| line|regexTok_8c74c5e8b83a__output|
+-----+--------------------+-----------------------------+
| 1| hello world| []|
| 2|yet another sentence| [another, sentence]|
+-----+--------------------+-----------------------------+

scala> rt.setInputCol("line").setGaps(false).setPattern("\\W").transform(df).show(false
)
+-----+--------------------+-----------------------------+
|label|line |regexTok_8c74c5e8b83a__output|
+-----+--------------------+-----------------------------+
|1 |hello world |[] |
|2 |yet another sentence|[another, sentence] |
+-----+--------------------+-----------------------------+

It has pattern parameter that is the regex for tokenizing. It uses Scala’s .r method to
convert the string to regex. Use setPattern to set it. It defaults to \\s+ .

It has toLowercase parameter that indicates whether to convert all characters to lowercase
before tokenizing. Use setToLowercase to change it. It defaults to true .

NGram
In this example you use org.apache.spark.ml.feature.NGram that converts the input
collection of strings into a collection of n-grams (of n words).

import org.apache.spark.ml.feature.NGram

val bigram = new NGram("bigrams")


val df = Seq((0, Seq("hello", "world"))).toDF("id", "tokens")
bigram.setInputCol("tokens").transform(df).show

+---+--------------+---------------+
| id| tokens|bigrams__output|
+---+--------------+---------------+
| 0|[hello, world]| [hello world]|
+---+--------------+---------------+

HashingTF
Another example of a transformer is org.apache.spark.ml.feature.HashingTF that works on a
Column of ArrayType .

It transforms the rows for the input column into a sparse term frequency vector.

205
Transformers

import org.apache.spark.ml.feature.HashingTF
val hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("features")
.setNumFeatures(5000)

// see above for regexTok transformer


val regexedDF = regexTok.transform(df)

// Use HashingTF
val hashedDF = hashingTF.transform(regexedDF)

scala> hashedDF.show(false)
+---+------------------+---------------------+-----------------------------------+
|id |text |words |features |
+---+------------------+---------------------+-----------------------------------+
|0 |hello world |[hello, world] |(5000,[2322,3802],[1.0,1.0])
|
|1 |two spaces inside|[two, spaces, inside]|(5000,[276,940,2533],[1.0,1.0,1.0])|
+---+------------------+---------------------+-----------------------------------+

The name of the output column is optional, and if not specified, it becomes the identifier of a
HashingTF object with the __output suffix.

scala> hashingTF.uid
res7: String = hashingTF_fe3554836819

scala> hashingTF.transform(regexDF).show(false)
+---+------------------+---------------------+----------------------------------------
---+
|id |text |words |hashingTF_fe3554836819__output
|
+---+------------------+---------------------+----------------------------------------
---+
|0 |hello world |[hello, world] |(262144,[71890,72594],[1.0,1.0])
|
|1 |two spaces inside|[two, spaces, inside]|(262144,[53244,77869,115276],[1.0,1.0,1.0
])|
+---+------------------+---------------------+----------------------------------------
---+

OneHotEncoder
OneHotEncoder is a Tokenizer that maps a numeric input column of label indices onto a

column of binary vectors.

206
Transformers

// dataset to transform
val df = Seq(
(0, "a"), (1, "b"),
(2, "c"), (3, "a"),
(4, "a"), (5, "c"))
.toDF("label", "category")
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer().setInputCol("category").setOutputCol("cat_index").fi
t(df)
val indexed = indexer.transform(df)

import org.apache.spark.sql.types.NumericType

scala> indexed.schema("cat_index").dataType.isInstanceOf[NumericType]
res0: Boolean = true

import org.apache.spark.ml.feature.OneHotEncoder
val oneHot = new OneHotEncoder()
.setInputCol("cat_index")
.setOutputCol("cat_vec")

val oneHotted = oneHot.transform(indexed)

scala> oneHotted.show(false)
+-----+--------+---------+-------------+
|label|category|cat_index|cat_vec |
+-----+--------+---------+-------------+
|0 |a |0.0 |(2,[0],[1.0])|
|1 |b |2.0 |(2,[],[]) |
|2 |c |1.0 |(2,[1],[1.0])|
|3 |a |0.0 |(2,[0],[1.0])|
|4 |a |0.0 |(2,[0],[1.0])|
|5 |c |1.0 |(2,[1],[1.0])|
+-----+--------+---------+-------------+

scala> oneHotted.printSchema
root
|-- label: integer (nullable = false)
|-- category: string (nullable = true)
|-- cat_index: double (nullable = true)
|-- cat_vec: vector (nullable = true)

scala> oneHotted.schema("cat_vec").dataType.isInstanceOf[VectorUDT]
res1: Boolean = true

Custom UnaryTransformer
The following class is a custom UnaryTransformer that transforms words using upper letters.

207
Transformers

package pl.japila.spark

import org.apache.spark.ml._
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types._

class UpperTransformer(override val uid: String)


extends UnaryTransformer[String, String, UpperTransformer] {

def this() = this(Identifiable.randomUID("upper"))

override protected def validateInputType(inputType: DataType): Unit = {


require(inputType == StringType)
}

protected def createTransformFunc: String => String = {


_.toUpperCase
}

protected def outputDataType: DataType = StringType


}

Given a DataFrame you could use it as follows:

val upper = new UpperTransformer

scala> upper.setInputCol("text").transform(df).show
+---+-----+--------------------------+
| id| text|upper_0b559125fd61__output|
+---+-----+--------------------------+
| 0|hello| HELLO|
| 1|world| WORLD|
+---+-----+--------------------------+

208
Transformer

Transformer
Transformer is the contract in Spark MLlib for transformers that transform one dataset into

another.

Transformer is a PipelineStage and so…​FIXME

Transforming Dataset with Extra Parameters 


—  transform Method

Caution FIXME

Transformer Contract

package org.apache.spark.ml

abstract class Evaluator {


// only required methods that have no implementation
def transform(dataset: Dataset[_]): DataFrame
def copy(extra: ParamMap): Transformer
}

Table 1. Transformer Contract


Method Description
copy Used when…​

transform Used when…​

209
Tokenizer

Tokenizer
Tokenizer is a unary transformer that converts the column of String values to lowercase

and then splits it by white spaces.

import org.apache.spark.ml.feature.Tokenizer
val tok = new Tokenizer()

// dataset to transform
val df = Seq(
(1, "Hello world!"),
(2, "Here is yet another sentence.")).toDF("id", "sentence")

val tokenized = tok.setInputCol("sentence").setOutputCol("tokens").transform(df)

scala> tokenized.show(truncate = false)


+---+-----------------------------+-----------------------------------+
|id |sentence |tokens |
+---+-----------------------------+-----------------------------------+
|1 |Hello world! |[hello, world!] |
|2 |Here is yet another sentence.|[here, is, yet, another, sentence.]|
+---+-----------------------------+-----------------------------------+

210
Estimators

Estimators — ML Pipeline Component


An estimator is an abstraction of a learning algorithm that fits a model on a dataset.

That was so machine learning to explain an estimator this way, wasn’t it? It is
Note that the more I spend time with Pipeline API the often I use the terms and
phrases from this space. Sorry.

Technically, an Estimator produces a Model (i.e. a Transformer) for a given DataFrame and
parameters (as ParamMap ). It fits a model to the input DataFrame and ParamMap to produce
a Transformer (a Model ) that can calculate predictions for any DataFrame -based input
datasets.

It is basically a function that maps a DataFrame onto a Model through fit method, i.e. it
takes a DataFrame and produces a Transformer as a Model .

estimator: DataFrame =[fit]=> Model

Estimators are instances of org.apache.spark.ml.Estimator abstract class that comes with


fit method (with the return type M being a Model ):

fit(dataset: DataFrame): M

Estimator is a PipelineStage and so it can be a part of a Pipeline.

Pipeline considers Estimator special and executes fit method before


Note transform (as for other Transformer objects in a pipeline). Consult Pipeline
document.

211
Estimator

Estimator
Estimator is the contract in Spark MLlib for estimators that fit models to a dataset.

Estimator accepts parameters that you can set through dedicated setter methods upon

creating an Estimator . You could also fit a model with extra parameters.

import org.apache.spark.ml.classification.LogisticRegression

// Define parameters upon creating an Estimator


val lr = new LogisticRegression().
setMaxIter(5).
setRegParam(0.01)
val training: DataFrame = ...
val model1 = lr.fit(training)

// Define parameters through fit


import org.apache.spark.ml.param.ParamMap
val customParams = ParamMap(
lr.maxIter -> 10,
lr.featuresCol -> "custom_features"
)
val model2 = lr.fit(training, customParams)

Estimator is a PipelineStage and so can be a part of a Pipeline.

Estimator Contract

package org.apache.spark.ml

abstract class Estimator[M <: Model[M]] {


// only required methods that have no implementation
def fit(dataset: Dataset[_]): M
def copy(extra: ParamMap): Estimator[M]
}

Table 1. Estimator Contract


Method Description
copy Used when…​

fit Used when…​

212
Estimator

Fitting Model with Extra Parameters —  fit Method

fit(dataset: Dataset[_], paramMap: ParamMap): M

fit copies the extra paramMap and fits a model (of type M ).

fit is used mainly for model tuning to find the best model (using
Note
CrossValidator and TrainValidationSplit).

213
Estimator

StringIndexer
org.apache.spark.ml.feature.StringIndexer is an Estimator that produces a

StringIndexerModel .

val df = ('a' to 'a' + 9).map(_.toString)


.zip(0 to 9)
.map(_.swap)
.toDF("id", "label")

import org.apache.spark.ml.feature.StringIndexer
val strIdx = new StringIndexer()
.setInputCol("label")
.setOutputCol("index")

scala> println(strIdx.explainParams)
handleInvalid: how to handle invalid entries. Options are skip (which will filter out
rows with bad values), or error (which will throw an error). More options may be added
later (default: error)
inputCol: input column name (current: label)
outputCol: output column name (default: strIdx_ded89298e014__output, current: index)

val model = strIdx.fit(df)


val indexed = model.transform(df)

scala> indexed.show
+---+-----+-----+
| id|label|index|
+---+-----+-----+
| 0| a| 3.0|
| 1| b| 5.0|
| 2| c| 7.0|
| 3| d| 9.0|
| 4| e| 0.0|
| 5| f| 2.0|
| 6| g| 6.0|
| 7| h| 8.0|
| 8| i| 4.0|
| 9| j| 1.0|
+---+-----+-----+

214
Estimator

KMeans
KMeans class is an implementation of the K-means clustering algorithm in machine learning

with support for k-means|| (aka k-means parallel) in Spark MLlib.

Roughly, k-means is an unsupervised iterative algorithm that groups input data in a


predefined number of k clusters. Each cluster has a centroid which is a cluster center. It is
a highly iterative machine learning algorithm that measures the distance (between a vector
and centroids) as the nearest mean. The algorithm steps are repeated till the convergence
of a specified number of steps.

Note K-Means algorithm uses Lloyd’s algorithm in computer science.

It is an Estimator that produces a KMeansModel.

Do import org.apache.spark.ml.clustering.KMeans to work with KMeans


Tip
algorithm.

KMeans defaults to use the following values:

Number of clusters or centroids ( k ): 2

Maximum number of iterations ( maxIter ): 20

Initialization algorithm ( initMode ): k-means||

Number of steps for the k-means|| ( initSteps ): 5

Convergence tolerance ( tol ): 1e-4

import org.apache.spark.ml.clustering._
val kmeans = new KMeans()

scala> println(kmeans.explainParams)
featuresCol: features column name (default: features)
initMode: initialization algorithm (default: k-means||)
initSteps: number of steps for k-means|| (default: 5)
k: number of clusters to create (default: 2)
maxIter: maximum number of iterations (>= 0) (default: 20)
predictionCol: prediction column name (default: prediction)
seed: random seed (default: -1689246527)
tol: the convergence tolerance for iterative algorithms (default: 1.0E-4)

KMeans assumes that featuresCol is of type VectorUDT and appends predictionCol of

type IntegerType .

215
Estimator

Internally, fit method "unwraps" the feature vector in featuresCol column in the input
DataFrame and creates an RDD[Vector] . It then hands the call over to the MLlib variant of

KMeans in org.apache.spark.mllib.clustering.KMeans . The result is copied to KMeansModel


with a calculated KMeansSummary .

Each item (row) in a data set is described by a numeric vector of attributes called features .
A single feature (a dimension of the vector) represents a word (token) with a value that is a
metric that defines the importance of that word or term in the document.

Enable INFO logging level for org.apache.spark.mllib.clustering.KMeans logger


to see what happens inside a KMeans .
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.mllib.clustering.KMeans=INFO

Refer to Logging.

KMeans Example
You can represent a text corpus (document collection) using the vector space model. In this
representation, the vectors have dimension that is the number of different words in the
corpus. It is quite natural to have vectors with a lot of zero values as not all words will be in a
document. We will use an optimized memory representation to avoid zero values using
sparse vectors.

This example shows how to use k-means to classify emails as a spam or not.

// NOTE Don't copy and paste the final case class with the other lines
// It won't work with paste mode in spark-shell
final case class Email(id: Int, text: String)

val emails = Seq(


"This is an email from your lovely wife. Your mom says...",
"SPAM SPAM spam",
"Hello, We'd like to offer you").zipWithIndex.map(_.swap).toDF("id", "text").as[Email
]

// Prepare data for k-means


// Pass emails through a "pipeline" of transformers
import org.apache.spark.ml.feature._
val tok = new RegexTokenizer()
.setInputCol("text")
.setOutputCol("tokens")
.setPattern("\\W+")

val hashTF = new HashingTF()

216
Estimator

.setInputCol("tokens")
.setOutputCol("features")
.setNumFeatures(20)

val preprocess = (tok.transform _).andThen(hashTF.transform)

val features = preprocess(emails.toDF)

scala> features.select('text, 'features).show(false)


+--------------------------------------------------------+----------------------------
--------------------------------+
|text |features
|
+--------------------------------------------------------+----------------------------
--------------------------------+
|This is an email from your lovely wife. Your mom says...|(20,[0,3,6,8,10,11,17,19],[1
.0,2.0,1.0,1.0,2.0,1.0,2.0,1.0])|
|SPAM SPAM spam |(20,[13],[3.0])
|
|Hello, We'd like to offer you |(20,[0,2,7,10,11,19],[2.0,1.0
,1.0,1.0,1.0,1.0]) |
+--------------------------------------------------------+----------------------------
--------------------------------+

import org.apache.spark.ml.clustering.KMeans
val kmeans = new KMeans

scala> val kmModel = kmeans.fit(features.toDF)


16/04/08 15:57:37 WARN KMeans: The input data is not directly cached, which may hurt p
erformance if its parent RDDs are also uncached.
16/04/08 15:57:37 INFO KMeans: Initialization with k-means|| took 0.219 seconds.
16/04/08 15:57:37 INFO KMeans: Run 0 finished in 1 iterations
16/04/08 15:57:37 INFO KMeans: Iterations took 0.030 seconds.
16/04/08 15:57:37 INFO KMeans: KMeans converged in 1 iterations.
16/04/08 15:57:37 INFO KMeans: The cost for the best run is 5.000000000000002.
16/04/08 15:57:37 WARN KMeans: The input data was not directly cached, which may hurt
performance if its parent RDDs are also uncached.
kmModel: org.apache.spark.ml.clustering.KMeansModel = kmeans_7a13a617ce0b

scala> kmModel.clusterCenters.map(_.toSparse)
res36: Array[org.apache.spark.mllib.linalg.SparseVector] = Array((20,[13],[3.0]), (20,[
0,2,3,6,7,8,10,11,17,19],[1.5,0.5,1.0,0.5,0.5,0.5,1.5,1.0,1.0,1.0]))

val email = Seq("hello mom").toDF("text")


val result = kmModel.transform(preprocess(email))

scala> .show(false)
+---------+------------+---------------------+----------+
|text |tokens |features |prediction|
+---------+------------+---------------------+----------+
|hello mom|[hello, mom]|(20,[2,19],[1.0,1.0])|1 |
+---------+------------+---------------------+----------+

217
Estimator

218
Estimator

TrainValidationSplit
TrainValidationSplit is…​FIXME

Validating and Transforming Schema 


—  transformSchema Method

transformSchema(schema: StructType): StructType

Note transformSchema is part of PipelineStage Contract.

transformSchema simply passes the call to transformSchemaImpl (that is shared between

CrossValidator and TrainValidationSplit ).

219
Predictor

Predictor
Predictor is an Estimator for a PredictionModel with its own abstract train method.

train(dataset: DataFrame): M

The train method is supposed to ease dealing with schema validation and copying
parameters to a trained PredictionModel model. It also sets the parent of the model to itself.

A Predictor is basically a function that maps a DataFrame onto a PredictionModel .

predictor: DataFrame =[train]=> PredictionModel

It implements the abstract fit(dataset: DataFrame) of the Estimator abstract class that
validates and transforms the schema of a dataset (using a custom transformSchema of
PipelineStage), and then calls the abstract train method.

Validation and transformation of a schema (using transformSchema ) makes sure that:

1. features column exists and is of correct type (defaults to Vector).

2. label column exists and is of Double type.

As the last step, it adds the prediction column of Double type.

220
Predictor

RandomForestRegressor
RandomForestRegressor is a Predictor for Random Forest machine learning algorithm that

trains a RandomForestRegressionModel .

221
Predictor

import org.apache.spark.mllib.linalg.Vectors
val features = Vectors.sparse(10, Seq((2, 0.2), (4, 0.4)))

val data = (0.0 to 4.0 by 1).map(d => (d, features)).toDF("label", "features")


// data.as[LabeledPoint]

scala> data.show(false)
+-----+--------------------------+
|label|features |
+-----+--------------------------+
|0.0 |(10,[2,4,6],[0.2,0.4,0.6])|
|1.0 |(10,[2,4,6],[0.2,0.4,0.6])|
|2.0 |(10,[2,4,6],[0.2,0.4,0.6])|
|3.0 |(10,[2,4,6],[0.2,0.4,0.6])|
|4.0 |(10,[2,4,6],[0.2,0.4,0.6])|
+-----+--------------------------+

import org.apache.spark.ml.regression.{ RandomForestRegressor, RandomForestRegressionM


odel }
val rfr = new RandomForestRegressor
val model: RandomForestRegressionModel = rfr.fit(data)

scala> model.trees.foreach(println)
DecisionTreeRegressionModel (uid=dtr_247e77e2f8e0) of depth 1 with 3 nodes
DecisionTreeRegressionModel (uid=dtr_61f8eacb2b61) of depth 2 with 7 nodes
DecisionTreeRegressionModel (uid=dtr_63fc5bde051c) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_64d4e42de85f) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_693626422894) of depth 3 with 9 nodes
DecisionTreeRegressionModel (uid=dtr_927f8a0bc35e) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_82da39f6e4e1) of depth 3 with 7 nodes
DecisionTreeRegressionModel (uid=dtr_cb94c2e75bd1) of depth 0 with 1 nodes
DecisionTreeRegressionModel (uid=dtr_29e3362adfb2) of depth 1 with 3 nodes
DecisionTreeRegressionModel (uid=dtr_d6d896abcc75) of depth 3 with 7 nodes
DecisionTreeRegressionModel (uid=dtr_aacb22a9143d) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_18d07dadb5b9) of depth 2 with 7 nodes
DecisionTreeRegressionModel (uid=dtr_f0615c28637c) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_4619362d02fc) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_d39502f828f4) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_896f3a4272ad) of depth 3 with 9 nodes
DecisionTreeRegressionModel (uid=dtr_891323c29838) of depth 3 with 7 nodes
DecisionTreeRegressionModel (uid=dtr_d658fe871e99) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_d91227b13d41) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_4a7976921f4b) of depth 2 with 5 nodes

scala> model.treeWeights
res12: Array[Double] = Array(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0
, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0)

scala> model.featureImportances
res13: org.apache.spark.mllib.linalg.Vector = (1,[0],[1.0])

222
Predictor

223
Regressor

Regressor
Regressor is…​FIXME

224
Regressor

LinearRegression
LinearRegression is a Regressor that represents the linear regression algorithm in Machine

Learning.

LinearRegression belongs to org.apache.spark.ml.regression package.

Tip Read the scaladoc of LinearRegression.

It expects org.apache.spark.mllib.linalg.Vector as the input type of the column in a


dataset and produces LinearRegressionModel.

import org.apache.spark.ml.regression.LinearRegression
val lr = new LinearRegression

The acceptable parameters:

scala> println(lr.explainParams)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the
penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.0)
featuresCol: features column name (default: features)
fitIntercept: whether to fit an intercept term (default: true)
labelCol: label column name (default: label)
maxIter: maximum number of iterations (>= 0) (default: 100)
predictionCol: prediction column name (default: prediction)
regParam: regularization parameter (>= 0) (default: 0.0)
solver: the solver algorithm for optimization. If this is not set or empty, default va
lue is 'auto' (default: auto)
standardization: whether to standardize the training features before fitting the model
(default: true)
tol: the convergence tolerance for iterative algorithms (default: 1.0E-6)
weightCol: weight column name. If this is not set or empty, we treat all instance weig
hts as 1.0 (default: )

LinearRegression Example

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val data = (0.0 to 9.0 by 1) // create a collection of Doubles
.map(n => (n, n)) // make it pairs
.map { case (label, features) =>
LabeledPoint(label, Vectors.dense(features)) } // create labeled points of dense v
ectors
.toDF // make it a DataFrame

225
Regressor

scala> data.show
+-----+--------+
|label|features|
+-----+--------+
| 0.0| [0.0]|
| 1.0| [1.0]|
| 2.0| [2.0]|
| 3.0| [3.0]|
| 4.0| [4.0]|
| 5.0| [5.0]|
| 6.0| [6.0]|
| 7.0| [7.0]|
| 8.0| [8.0]|
| 9.0| [9.0]|
+-----+--------+

import org.apache.spark.ml.regression.LinearRegression
val lr = new LinearRegression

val model = lr.fit(data)

scala> model.intercept
res1: Double = 0.0

scala> model.coefficients
res2: org.apache.spark.mllib.linalg.Vector = [1.0]

// make predictions
scala> val predictions = model.transform(data)
predictions: org.apache.spark.sql.DataFrame = [label: double, features: vector ... 1 m
ore field]

scala> predictions.show
+-----+--------+----------+
|label|features|prediction|
+-----+--------+----------+
| 0.0| [0.0]| 0.0|
| 1.0| [1.0]| 1.0|
| 2.0| [2.0]| 2.0|
| 3.0| [3.0]| 3.0|
| 4.0| [4.0]| 4.0|
| 5.0| [5.0]| 5.0|
| 6.0| [6.0]| 6.0|
| 7.0| [7.0]| 7.0|
| 8.0| [8.0]| 8.0|
| 9.0| [9.0]| 9.0|
+-----+--------+----------+

import org.apache.spark.ml.evaluation.RegressionEvaluator

// rmse is the default metric


// We're explicit here for learning purposes
val regEval = new RegressionEvaluator().setMetricName("rmse")

226
Regressor

val rmse = regEval.evaluate(predictions)

scala> println(s"Root Mean Squared Error: $rmse")


Root Mean Squared Error: 0.0

import org.apache.spark.mllib.linalg.DenseVector
// NOTE Follow along to learn spark.ml-way (not RDD-way)
predictions.rdd.map { r =>
(r(0).asInstanceOf[Double], r(1).asInstanceOf[DenseVector](0).toDouble, r(2).asInsta
nceOf[Double]))
.toDF("label", "feature0", "prediction").show
+-----+--------+----------+
|label|feature0|prediction|
+-----+--------+----------+
| 0.0| 0.0| 0.0|
| 1.0| 1.0| 1.0|
| 2.0| 2.0| 2.0|
| 3.0| 3.0| 3.0|
| 4.0| 4.0| 4.0|
| 5.0| 5.0| 5.0|
| 6.0| 6.0| 6.0|
| 7.0| 7.0| 7.0|
| 8.0| 8.0| 8.0|
| 9.0| 9.0| 9.0|
+-----+--------+----------+

// Let's make it nicer to the eyes using a Scala case class


scala> :pa
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.Row
import org.apache.spark.mllib.linalg.DenseVector
case class Prediction(label: Double, feature0: Double, prediction: Double)
object Prediction {
def apply(r: Row) = new Prediction(
label = r(0).asInstanceOf[Double],
feature0 = r(1).asInstanceOf[DenseVector](0).toDouble,
prediction = r(2).asInstanceOf[Double])
}

// Exiting paste mode, now interpreting.

import org.apache.spark.sql.Row
import org.apache.spark.mllib.linalg.DenseVector
defined class Prediction
defined object Prediction

scala> predictions.rdd.map(Prediction.apply).toDF.show
+-----+--------+----------+
|label|feature0|prediction|
+-----+--------+----------+
| 0.0| 0.0| 0.0|
| 1.0| 1.0| 1.0|

227
Regressor

| 2.0| 2.0| 2.0|


| 3.0| 3.0| 3.0|
| 4.0| 4.0| 4.0|
| 5.0| 5.0| 5.0|
| 6.0| 6.0| 6.0|
| 7.0| 7.0| 7.0|
| 8.0| 8.0| 8.0|
| 9.0| 9.0| 9.0|
+-----+--------+----------+

train Method

train(dataset: DataFrame): LinearRegressionModel

train (protected) method of LinearRegression expects a dataset DataFrame with two

columns:

1. label of type DoubleType .

2. features of type Vector.

It returns LinearRegressionModel .

It first counts the number of elements in features column (usually features ). The column
has to be of mllib.linalg.Vector type (and can easily be prepared using HashingTF
transformer).

val spam = Seq(


(0, "Hi Jacek. Wanna more SPAM? Best!"),
(1, "This is SPAM. This is SPAM")).toDF("id", "email")

import org.apache.spark.ml.feature.RegexTokenizer
val regexTok = new RegexTokenizer()
val spamTokens = regexTok.setInputCol("email").transform(spam)

scala> spamTokens.show(false)
+---+--------------------------------+---------------------------------------+
|id |email |regexTok_646b6bcc4548__output |
+---+--------------------------------+---------------------------------------+
|0 |Hi Jacek. Wanna more SPAM? Best!|[hi, jacek., wanna, more, spam?, best!]|
|1 |This is SPAM. This is SPAM |[this, is, spam., this, is, spam] |
+---+--------------------------------+---------------------------------------+

import org.apache.spark.ml.feature.HashingTF
val hashTF = new HashingTF()
.setInputCol(regexTok.getOutputCol)
.setOutputCol("features")
.setNumFeatures(5000)

228
Regressor

val spamHashed = hashTF.transform(spamTokens)

scala> spamHashed.select("email", "features").show(false)


+--------------------------------+----------------------------------------------------
------------+
|email |features
|
+--------------------------------+----------------------------------------------------
------------+
|Hi Jacek. Wanna more SPAM? Best!|(5000,[2525,2943,3093,3166,3329,3980],[1.0,1.0,1.0,1
.0,1.0,1.0])|
|This is SPAM. This is SPAM |(5000,[1713,3149,3370,4070],[1.0,1.0,2.0,2.0])
|
+--------------------------------+----------------------------------------------------
------------+

// Create labeled datasets for spam (1)


val spamLabeled = spamHashed.withColumn("label", lit(1d))

scala> spamLabeled.show
+---+--------------------+-----------------------------+--------------------+-----+
| id| email|regexTok_646b6bcc4548__output| features|label|
+---+--------------------+-----------------------------+--------------------+-----+
| 0|Hi Jacek. Wanna m...| [hi, jacek., wann...|(5000,[2525,2943,...| 1.0|
| 1|This is SPAM. Thi...| [this, is, spam.,...|(5000,[1713,3149,...| 1.0|
+---+--------------------+-----------------------------+--------------------+-----+

val regular = Seq(


(2, "Hi Jacek. I hope this email finds you well. Spark up!"),
(3, "Welcome to Apache Spark project")).toDF("id", "email")
val regularTokens = regexTok.setInputCol("email").transform(regular)
val regularHashed = hashTF.transform(regularTokens)
// Create labeled datasets for non-spam regular emails (0)
val regularLabeled = regularHashed.withColumn("label", lit(0d))

val training = regularLabeled.union(spamLabeled).cache

scala> training.show
+---+--------------------+-----------------------------+--------------------+-----+
| id| email|regexTok_646b6bcc4548__output| features|label|
+---+--------------------+-----------------------------+--------------------+-----+
| 2|Hi Jacek. I hope ...| [hi, jacek., i, h...|(5000,[72,105,942...| 0.0|
| 3|Welcome to Apache...| [welcome, to, apa...|(5000,[2894,3365,...| 0.0|
| 0|Hi Jacek. Wanna m...| [hi, jacek., wann...|(5000,[2525,2943,...| 1.0|
| 1|This is SPAM. Thi...| [this, is, spam.,...|(5000,[1713,3149,...| 1.0|
+---+--------------------+-----------------------------+--------------------+-----+

import org.apache.spark.ml.regression.LinearRegression
val lr = new LinearRegression

// the following calls train by the Predictor contract (see above)


val lrModel = lr.fit(training)

229
Regressor

// Let's predict whether an email is a spam or not


val email = Seq("Hi Jacek. you doing well? Bye!").toDF("email")
val emailTokens = regexTok.setInputCol("email").transform(email)
val emailHashed = hashTF.transform(emailTokens)

scala> lrModel.transform(emailHashed).select("prediction").show
+-----------------+
| prediction|
+-----------------+
|0.563603440350882|
+-----------------+

230
Classifier

Classifier
Classifier is a Predictor that…​FIXME

Classifier accepts parameters.

extractLabeledPoints Method

extractLabeledPoints(dataset: Dataset[_], numClasses: Int): RDD[LabeledPoint]

extractLabeledPoints …​FIXME

Note extractLabeledPoints is used when…​FIXME

getNumClasses Method

getNumClasses(dataset: Dataset[_], maxNumClasses: Int = 100): Int

getNumClasses …​FIXME

Note getNumClasses is used when…​FIXME

231
Classifier

RandomForestClassifier
RandomForestClassifier is a probabilistic Classifier for…​FIXME

232
Classifier

DecisionTreeClassifier
DecisionTreeClassifier is a probabilistic Classifier for…​FIXME

233
Models

ML Pipeline Models
Model abstract class is a Transformer with the optional Estimator that has produced it (as a

transient parent field).

model: DataFrame =[predict]=> DataFrame (with predictions)

An Estimator is optional and is available only after fit (of an Estimator) has
Note
been executed whose result a model is.

As a Transformer it takes a DataFrame and transforms it to a result DataFrame with


prediction column added.

There are two direct implementations of the Model class that are not directly related to a
concrete ML algorithm:

PipelineModel

PredictionModel

PipelineModel

Caution PipelineModel is a private[ml] class.

PipelineModel is a Model of Pipeline estimator.

Once fit, you can use the result model as any other models to transform datasets (as
DataFrame ).

A very interesting use case of PipelineModel is when a Pipeline is made up of


Transformer instances.

234
Models

// Transformer #1
import org.apache.spark.ml.feature.Tokenizer
val tok = new Tokenizer().setInputCol("text")

// Transformer #2
import org.apache.spark.ml.feature.HashingTF
val hashingTF = new HashingTF().setInputCol(tok.getOutputCol).setOutputCol("features")

// Fuse the Transformers in a Pipeline


import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline().setStages(Array(tok, hashingTF))

val dataset = Seq((0, "hello world")).toDF("id", "text")

// Since there's no fitting, any dataset works fine


val featurize = pipeline.fit(dataset)

// Use the pipelineModel as a series of Transformers


scala> featurize.transform(dataset).show(false)
+---+-----------+------------------------+--------------------------------+
|id |text |tok_8aec9bfad04a__output|features |
+---+-----------+------------------------+--------------------------------+
|0 |hello world|[hello, world] |(262144,[71890,72594],[1.0,1.0])|
+---+-----------+------------------------+--------------------------------+

PredictionModel
PredictionModel is an abstract class to represent a model for prediction algorithms like

regression and classification (that have their own specialized models - details coming up
below).

PredictionModel is basically a Transformer with predict method to calculate predictions

(that end up in prediction column).

PredictionModel belongs to org.apache.spark.ml package.

import org.apache.spark.ml.PredictionModel

The contract of PredictionModel class requires that every custom implementation defines
predict method (with FeaturesType type being the type of features ).

predict(features: FeaturesType): Double

The direct less-algorithm-specific extensions of the PredictionModel class are:

RegressionModel

235
Models

ClassificationModel

RandomForestRegressionModel

As a custom Transformer it comes with its own custom transform method.

Internally, transform first ensures that the type of the features column matches the type
of the model and adds the prediction column of type Double to the schema of the result
DataFrame .

It then creates the result DataFrame and adds the prediction column with a predictUDF
function applied to the values of the features column.

FIXME A diagram to show the transformation from a dataframe (on the left)
Caution and another (on the right) with an arrow to represent the transformation
method.

Enable DEBUG logging level for a PredictionModel implementation, e.g.


LinearRegressionModel, to see what happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.ml.regression.LinearRegressionModel=DEBUG

Refer to Logging.

ClassificationModel
ClassificationModel is a PredictionModel that transforms a DataFrame with mandatory

features , label , and rawPrediction (of type Vector) columns to a DataFrame with

prediction column added.

A Model with ClassifierParams parameters, e.g. ClassificationModel ,


Note requires that a DataFrame have the mandatory features , label (of type
Double ), and rawPrediction (of type Vector) columns.

ClassificationModel comes with its own transform (as Transformer) and predict (as

PredictionModel).

The following is a list of the known ClassificationModel custom implementations (as of


March, 24th):

ProbabilisticClassificationModel (the abstract parent of the following classification

models)

DecisionTreeClassificationModel ( final )

236
Models

LogisticRegressionModel

NaiveBayesModel

RandomForestClassificationModel ( final )

RegressionModel
RegressionModel is a PredictionModel that transforms a DataFrame with mandatory label ,

features , and prediction columns.

It comes with no own methods or values and so is more a marker abstract class (to combine
different features of regression models under one type).

LinearRegressionModel
LinearRegressionModel represents a model produced by a LinearRegression estimator. It

transforms the required features column of type org.apache.spark.mllib.linalg.Vector.

It is a private[ml] class so what you, a developer, may eventually work with is


Note the more general RegressionModel , and since RegressionModel is just a
marker no-method abstract class, it is more a PredictionModel.

As a linear regression model that extends LinearRegressionParams it expects the following


schema of an input DataFrame :

label (required)

features (required)

prediction

regParam

elasticNetParam

maxIter (Int)

tol (Double)

fitIntercept (Boolean)

standardization (Boolean)

weightCol (String)

solver (String)

237
Models

(New in 1.6.0) LinearRegressionModel is also a MLWritable (so you can save it to a


persistent storage for later reuse).

With DEBUG logging enabled (see above) you can see the following messages in the logs
when transform is called and transforms the schema.

16/03/21 06:55:32 DEBUG LinearRegressionModel: Input schema: {"type":"struct","fields"


:[{"name":"label","type":"double","nullable":false,"metadata":{}},{"name":"features","
type":{"type":"udt","class":"org.apache.spark.mllib.linalg.VectorUDT","pyClass":"pyspa
rk.mllib.linalg.VectorUDT","sqlType":{"type":"struct","fields":[{"name":"type","type":
"byte","nullable":false,"metadata":{}},{"name":"size","type":"integer","nullable":true
,"metadata":{}},{"name":"indices","type":{"type":"array","elementType":"integer","cont
ainsNull":false},"nullable":true,"metadata":{}},{"name":"values","type":{"type":"array
","elementType":"double","containsNull":false},"nullable":true,"metadata":{}}]}},"null
able":true,"metadata":{}}]}
16/03/21 06:55:32 DEBUG LinearRegressionModel: Expected output schema: {"type":"struct
","fields":[{"name":"label","type":"double","nullable":false,"metadata":{}},{"name":"f
eatures","type":{"type":"udt","class":"org.apache.spark.mllib.linalg.VectorUDT","pyCla
ss":"pyspark.mllib.linalg.VectorUDT","sqlType":{"type":"struct","fields":[{"name":"typ
e","type":"byte","nullable":false,"metadata":{}},{"name":"size","type":"integer","null
able":true,"metadata":{}},{"name":"indices","type":{"type":"array","elementType":"inte
ger","containsNull":false},"nullable":true,"metadata":{}},{"name":"values","type":{"ty
pe":"array","elementType":"double","containsNull":false},"nullable":true,"metadata":{}
}]}},"nullable":true,"metadata":{}},{"name":"prediction","type":"double","nullable":fa
lse,"metadata":{}}]}

The implementation of predict for LinearRegressionModel calculates dot(v1, v2) of two


Vectors - features and coefficients - (of DenseVector or SparseVector types) of the
same size and adds intercept .

The coefficients Vector and intercept Double are the integral part of
Note
LinearRegressionModel as the required input parameters of the constructor.

LinearRegressionModel Example

238
Models

// Create a (sparse) Vector


import org.apache.spark.mllib.linalg.Vectors
val indices = 0 to 4
val elements = indices.zip(Stream.continually(1.0))
val sv = Vectors.sparse(elements.size, elements)

// Create a proper DataFrame


val ds = sc.parallelize(Seq((0.5, sv))).toDF("label", "features")

import org.apache.spark.ml.regression.LinearRegression
val lr = new LinearRegression

// Importing LinearRegressionModel and being explicit about the type of model value
// is for learning purposes only
import org.apache.spark.ml.regression.LinearRegressionModel
val model: LinearRegressionModel = lr.fit(ds)

// Use the same ds - just for learning purposes


scala> model.transform(ds).show
+-----+--------------------+----------+
|label| features|prediction|
+-----+--------------------+----------+
| 0.5|(5,[0,1,2,3,4],[1...| 0.5|
+-----+--------------------+----------+

RandomForestRegressionModel
RandomForestRegressionModel is a PredictionModel with features column of type Vector.

Interestingly, DataFrame transformation (as part of Transformer contract) uses


SparkContext.broadcast to send itself to the nodes in a Spark cluster and calls calculates
predictions (as prediction column) on features .

KMeansModel
KMeansModel is a Model of KMeans algorithm.

It belongs to org.apache.spark.ml.clustering package.

239
Models

// See spark-mllib-estimators.adoc#KMeans
val kmeans: KMeans = ???
val trainingDF: DataFrame = ???
val kmModel = kmeans.fit(trainingDF)

// Know the cluster centers


scala> kmModel.clusterCenters
res0: Array[org.apache.spark.mllib.linalg.Vector] = Array([0.1,0.3], [0.1,0.1])

val inputDF = Seq((0.0, Vectors.dense(0.2, 0.4))).toDF("label", "features")

scala> kmModel.transform(inputDF).show(false)
+-----+---------+----------+
|label|features |prediction|
+-----+---------+----------+
|0.0 |[0.2,0.4]|0 |
+-----+---------+----------+

240
Model

Model
Model is the contract for a fitted model, i.e. a Transformer that was produced by an

Estimator.

Model Contract

package org.apache.spark.ml

abstract class Model[M] extends Transformer {


def copy(extra: ParamMap): M
}

Table 1. Model Contract


Method Description
copy Used when…​

parent Estimator that produced this model.

241
Evaluator — ML Pipeline Component for Model Scoring

Evaluator — ML Pipeline Component for Model


Scoring
Evaluator is the contract in Spark MLlib for ML Pipeline components that can evaluate

models for given parameters.

ML Pipeline evaluators are transformers that take DataFrames and compute metrics
indicating how good a model is.

evaluator: DataFrame =[evaluate]=> Double

Evaluator is used to evaluate models and is usually (if not always) used for best model

selection by CrossValidator and TrainValidationSplit.

Evaluator uses isLargerBetter method to indicate whether the Double metric should be

maximized ( true ) or minimized ( false ). It considers a larger value better ( true ) by


default.

Table 1. Evaluators
Evaluator Description
BinaryClassificationEvaluator Evaluator of binary classification models

ClusteringEvaluator Evaluator of clustering models

MulticlassClassificationEvaluator Evaluator of multiclass classification models

RegressionEvaluator Evaluator of regression models

Evaluating Model Output with Extra Parameters 


—  evaluate Method

evaluate(dataset: Dataset[_], paramMap: ParamMap): Double

evaluate copies the extra paramMap and evaluates a model output.

Note evaluate is used…​FIXME

Evaluator Contract

242
Evaluator — ML Pipeline Component for Model Scoring

package org.apache.spark.ml.evaluation

abstract class Evaluator {


def evaluate(dataset: Dataset[_]): Double
def copy(extra: ParamMap): Evaluator
def isLargerBetter: Boolean = true
}

Table 2. Evaluator Contract


Method Description
copy Used when…​

evaluate Used when…​

Indicates whether the metric returned by evaluate should


isLargerBetter be maximized ( true ) or minimized ( false ).
Gives true by default.

243
BinaryClassificationEvaluator — Evaluator of Binary Classification Models

BinaryClassificationEvaluator — Evaluator of
Binary Classification Models
BinaryClassificationEvaluator is an Evaluator of cross-validate models from binary

classifications (e.g. LogisticRegression, RandomForestClassifier, NaiveBayes ,


DecisionTreeClassifier, MultilayerPerceptronClassifier , GBTClassifier , LinearSVC ).

BinaryClassificationEvaluator finds the best model by maximizing the model evaluation

metric that is the area under the specified curve (and so isLargerBetter is turned on for either
metric).

import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
val binEval = new BinaryClassificationEvaluator().
setMetricName("areaUnderROC").
setRawPredictionCol("rawPrediction").
setLabelCol("label")

scala> binEval.isLargerBetter
res0: Boolean = true

scala> println(binEval.explainParams)
labelCol: label column name (default: label)
metricName: metric name in evaluation (areaUnderROC|areaUnderPR) (default: areaUnderRO
C)
rawPredictionCol: raw prediction (a.k.a. confidence) column name (default: rawPredicti
on)

Table 1. BinaryClassificationEvaluator' Parameters


Parameter Default Value Description
Name of the classification metric for
evaluation
metricName areaUnderROC
Can be either areaUnderROC (default) or
areaUnderPR

rawPredictionCol rawPrediction
Column name with raw predictions (a.k.a.
confidence)

labelCol label
Name of the column with indexed labels
(i.e. 0 s or 1 s)

Evaluating Model Output —  evaluate Method

244
BinaryClassificationEvaluator — Evaluator of Binary Classification Models

evaluate(dataset: Dataset[_]): Double

Note evaluate is part of Evaluator Contract.

evaluate …​FIXME

245
ClusteringEvaluator — Evaluator of Clustering Models

ClusteringEvaluator — Evaluator of Clustering
Models
ClusteringEvaluator is an Evaluator of clustering models (e.g. FPGrowth ,

GaussianMixture , ALS, KMeans , LinearSVC , RandomForestRegressor ,

GeneralizedLinearRegression , LinearRegression, GBTRegressor , DecisionTreeRegressor ,

NaiveBayes )

Note ClusteringEvaluator is available since Spark 2.3.0.

ClusteringEvaluator finds the best model by maximizing the model evaluation metric (i.e.

isLargerBetter is always turned on).

import org.apache.spark.ml.evaluation.ClusteringEvaluator
val cluEval = new ClusteringEvaluator().
setPredictionCol("prediction").
setFeaturesCol("features").
setMetricName("silhouette")

scala> cluEval.isLargerBetter
res0: Boolean = true

scala> println(cluEval.explainParams)
featuresCol: features column name (default: features, current: features)
metricName: metric name in evaluation (silhouette) (default: silhouette, current: silh
ouette)
predictionCol: prediction column name (default: prediction, current: prediction)

Table 1. ClusteringEvaluator' Parameters


Parameter Default Value Description

featuresCol features
Name of the column with features (of type
VectorUDT )

Name of the classification metric for


evaluation
metricName silhouette
metricName can only be
Note
silhouette .

predictionCol prediction
Name of the column with prediction (of
type NumericType )

246
ClusteringEvaluator — Evaluator of Clustering Models

Evaluating Model Output —  evaluate Method

evaluate(dataset: Dataset[_]): Double

Note evaluate is part of Evaluator Contract.

evaluate …​FIXME

247
MulticlassClassificationEvaluator — Evaluator of Multiclass Classification Models

MulticlassClassificationEvaluator — Evaluator
of Multiclass Classification Models
MulticlassClassificationEvaluator is an Evaluator that takes datasets with the following

two columns:

prediction (of DoubleType values)

label (of float or double values)

248
RegressionEvaluator — Evaluator of Regression Models

RegressionEvaluator — Evaluator of
Regression Models
RegressionEvaluator is an Evaluator of regression models (e.g. ALS,

DecisionTreeRegressor , DecisionTreeClassifier, GBTRegressor , GBTClassifier ,

RandomForestRegressor, RandomForestClassifier, LinearRegression, RFormula ,


NaiveBayes , LogisticRegression, MultilayerPerceptronClassifier , LinearSVC ,

GeneralizedLinearRegression).

Table 1. RegressionEvaluator’s Metrics and isLargerBetter Flag


Metric Description isLargerBetter
rmse Root mean squared error false

mse Mean squared error false

(default) Unadjusted coefficient of


r2 determination true

Regression through the origin

mae Mean absolute error false

import org.apache.spark.ml.evaluation.RegressionEvaluator
val regEval = new RegressionEvaluator().
setMetricName("r2").
setPredictionCol("prediction").
setLabelCol("label")

scala> regEval.isLargerBetter
res0: Boolean = true

scala> println(regEval.explainParams)
labelCol: label column name (default: label, current: label)
metricName: metric name in evaluation (mse|rmse|r2|mae) (default: rmse, current: r2)
predictionCol: prediction column name (default: prediction, current: prediction)

249
RegressionEvaluator — Evaluator of Regression Models

Table 2. RegressionEvaluator' Parameters


Parameter Default Value Description

Name of the classification metric for


evaluation
metricName areaUnderROC
Can be one of the following: mae , mse ,
rmse (default), r2

predictionCol prediction Name of the column with predictions

labelCol label Name of the column with indexed labels

250
RegressionEvaluator — Evaluator of Regression Models

// prepare a fake input dataset using transformers


import org.apache.spark.ml.feature.Tokenizer
val tok = new Tokenizer().setInputCol("text")

import org.apache.spark.ml.feature.HashingTF
val hashTF = new HashingTF()
.setInputCol(tok.getOutputCol) // it reads the output of tok
.setOutputCol("features")

// Scala trick to chain transform methods


// It's of little to no use since we've got Pipelines
// Just to have it as an alternative
val transform = (tok.transform _).andThen(hashTF.transform _)

val dataset = Seq((0, "hello world", 0.0)).toDF("id", "text", "label")

// we're using Linear Regression algorithm


import org.apache.spark.ml.regression.LinearRegression
val lr = new LinearRegression

import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline().setStages(Array(tok, hashTF, lr))

val model = pipeline.fit(dataset)

// Let's do prediction
// Note that we're using the same dataset as for fitting the model
// Something you'd definitely not be doing in prod
val predictions = model.transform(dataset)

// Now we're ready to evaluate the model


// Evaluator works on datasets with predictions

import org.apache.spark.ml.evaluation.RegressionEvaluator
val regEval = new RegressionEvaluator

scala> regEval.evaluate(predictions)
res0: Double = 0.0

Evaluating Model Output —  evaluate Method

evaluate(dataset: Dataset[_]): Double

Note evaluate is part of Evaluator Contract.

evaluate …​FIXME

251
RegressionEvaluator — Evaluator of Regression Models

252
CrossValidator — Model Tuning / Finding The Best Model

CrossValidator — Model Tuning / Finding The


Best Model
CrossValidator is an Estimator for model tuning, i.e. finding the best model for given

parameters and a dataset.

CrossValidator splits the dataset into a set of non-overlapping randomly-partitioned

numFolds pairs of training and validation datasets.

CrossValidator generates a CrossValidatorModel to hold the best model and average

cross-validation metrics.

CrossValidator takes any Estimator for model selection, including the Pipeline
Note
that is used to transform raw datasets and generate a Model.

Use ParamGridBuilder for the parameter grid, i.e. collection of ParamMaps for
Note
model tuning.

import org.apache.spark.ml.Pipeline
val pipeline: Pipeline = ...

import org.apache.spark.ml.param.ParamMap
val paramGrid: Array[ParamMap] = new ParamGridBuilder().
addGrid(...).
addGrid(...).
build

import org.apache.spark.ml.tuning.CrossValidator
val cv = new CrossValidator().
setEstimator(pipeline).
setEvaluator(...).
setEstimatorParamMaps(paramGrid).
setNumFolds(...).
setParallelism(...)

import org.apache.spark.ml.tuning.CrossValidatorModel
val bestModel: CrossValidatorModel = cv.fit(training)

CrossValidator is a MLWritable.

253
CrossValidator — Model Tuning / Finding The Best Model

Table 1. CrossValidator' Parameters


Parameter Default Value Description
estimator (undefined) Estimator for best model selection.

estimatorParamMaps (undefined) Param maps for the estimator

Evaluator to select hyper-parameters that


evaluator (undefined)
maximize the validated metric

The number of folds for cross validation


numFolds 3
Must be at least 2 .

The number of threads to use while fitting


parallelism 1 a model
Must be at least 1 .

seed Random seed

Enable INFO or DEBUG logging levels for


org.apache.spark.ml.tuning.CrossValidator logger to see what happens inside.

Add the following line to conf/log4j.properties :


Tip
log4j.logger.org.apache.spark.ml.tuning.CrossValidator=DEBUG

Refer to Logging.

Finding The Best Model —  fit Method

fit(dataset: Dataset[_]): CrossValidatorModel

Note fit is part of Estimator Contract to fit a model (i.e. produce a model).

fit validates the schema (with logging turned on).

You should see the following DEBUG message in the logs:

DEBUG CrossValidator: Input schema: [json]

fit makes sure that estimator, evaluator, estimatorParamMaps and parallelism

parameters are defined or reports a NoSuchElementException .

254
CrossValidator — Model Tuning / Finding The Best Model

java.util.NoSuchElementException: Failed to find a default value for [name]

fit creates a ExecutionContext (per parallelism parameter).

fit creates a Instrumentation and requests it to print out the parameters numFolds,

seed, parallelism to the logs.

INFO ...FIXME

fit requests Instrumentation to print out the tuning parameters to the logs.

INFO ...FIXME

fit kFolds the RDD of the dataset per numFolds and seed parameters.

Note fit passes the underlying RDD of the dataset to kFolds.

fit computes metrics for every pair of training and validation RDDs.

fit calculates the average metrics over all kFolds.

You should see the following INFO message in the logs:

INFO Average cross-validation metrics: [metrics]

fit requests the Evaluator for the best cross-validation metric.

You should see the following INFO message in the logs:

INFO Best set of parameters:


[estimatorParamMap]
INFO Best cross-validation metric: [bestMetric].

fit requests the Estimator to fit the best model (for the dataset and the best set of

estimatorParamMap).

You should see the following INFO message in the logs:

INFO training finished

In the end, fit creates a CrossValidatorModel (for the ID, the best model and the average
metrics for every kFold) and copies parameters to it.

255
CrossValidator — Model Tuning / Finding The Best Model

fit and Computing Metric for Training and Validation RDDs


fit computes metrics for every pair of training and validation RDDs (from kFold).

fit creates and persists training and validation datasets.

Tip You can monitor the storage for persisting the datasets in web UI’s Storage tab.

fit Prints out the following DEBUG message to the logs

DEBUG Train split [index] with multiple sets of parameters.

For every map in estimatorParamMaps parameter fit fits a model using the Estimator.

fit does the fitting in parallel per parallelism parameter.

Note parallelism parameter defaults to 1 , i.e. no parallelism for fitting models.

fit unpersists the training data (per pair of training and validation RDDs)
Note
when all models have been trained.

fit requests the models to transform their respective validation datasets (with the

corresponding parameters from estimatorParamMaps) and then requests the Evaluator to


evaluate the transformed datasets.

fit prints out the following DEBUG message to the logs:

DEBUG Got metric [metric] for model trained with $paramMap.

fit waits until all metrics are available and unpersists the validation dataset.

Creating CrossValidator Instance


CrossValidator takes the following when created:

Unique ID

Validating and Transforming Schema 


—  transformSchema Method

transformSchema(schema: StructType): StructType

Note transformSchema is part of PipelineStage Contract.

256
CrossValidator — Model Tuning / Finding The Best Model

transformSchema simply passes the call to transformSchemaImpl (that is shared between

CrossValidator and TrainValidationSplit).

257
CrossValidatorModel

CrossValidatorModel
CrossValidatorModel is a Model that is created when CrossValidator is requested to find

the best model (per parameters and dataset).

CrossValidatorModel is MLWritable, i.e. FIXME

Creating CrossValidatorModel Instance


CrossValidatorModel takes the following when created:

Unique ID

Best Model

Average cross-validation metrics

CrossValidatorModel initializes the internal registries and counters.

258
ParamGridBuilder

ParamGridBuilder
ParamGridBuilder is…​FIXME

259
CrossValidator with Pipeline Example

CrossValidator with Pipeline Example


Caution FIXME The example below does NOT work. Being investigated.

FIXME Can k-means be crossvalidated? Does it make any sense? Does it


Caution
only applies to supervised learning?

// Let's create a pipeline with transformers and estimator


import org.apache.spark.ml.feature._

val tok = new Tokenizer().setInputCol("text")

val hashTF = new HashingTF()


.setInputCol(tok.getOutputCol)
.setOutputCol("features")
.setNumFeatures(10)

import org.apache.spark.ml.classification.RandomForestClassifier
val rfc = new RandomForestClassifier

import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline()
.setStages(Array(tok, hashTF, rfc))

// CAUTION: label must be double


// 0 = scientific text
// 1 = non-scientific text
val trainDS = Seq(
(0L, "[science] hello world", 0d),
(1L, "long text", 1d),
(2L, "[science] hello all people", 0d),
(3L, "[science] hello hello", 0d)).toDF("id", "text", "label").cache

// Check out the train dataset


// Values in label and prediction columns should be alike
val sampleModel = pipeline.fit(trainDS)
sampleModel
.transform(trainDS)
.select('text, 'label, 'features, 'prediction)
.show(truncate = false)

+--------------------------+-----+--------------------------+----------+
|text |label|features |prediction|
+--------------------------+-----+--------------------------+----------+
|[science] hello world |0.0 |(10,[0,8],[2.0,1.0]) |0.0 |
|long text |1.0 |(10,[4,9],[1.0,1.0]) |1.0 |
|[science] hello all people|0.0 |(10,[0,6,8],[1.0,1.0,2.0])|0.0 |
|[science] hello hello |0.0 |(10,[0,8],[1.0,2.0]) |0.0 |
+--------------------------+-----+--------------------------+----------+

260
CrossValidator with Pipeline Example

val input = Seq("Hello ScienCE").toDF("text")


sampleModel
.transform(input)
.select('text, 'rawPrediction, 'prediction)
.show(truncate = false)

+-------------+--------------------------------------+----------+
|text |rawPrediction |prediction|
+-------------+--------------------------------------+----------+
|Hello ScienCE|[12.666666666666668,7.333333333333333]|0.0 |
+-------------+--------------------------------------+----------+

import org.apache.spark.ml.tuning.ParamGridBuilder
val paramGrid = new ParamGridBuilder().build

import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
val binEval = new BinaryClassificationEvaluator

import org.apache.spark.ml.tuning.CrossValidator
val cv = new CrossValidator()
.setEstimator(pipeline) // <-- pipeline is the estimator
.setEvaluator(binEval) // has to match the estimator
.setEstimatorParamMaps(paramGrid)

// WARNING: It does not work!!!


val cvModel = cv.fit(trainDS)

261
Params and ParamMaps

Params (and ParamMaps)


Params is the contract in Spark MLlib for ML components that take parameters.

Params has params collection of Param objects.

import org.apache.spark.ml.recommendation.ALS
val als = new ALS().
setMaxIter(5).
setRegParam(0.01).
setUserCol("userId").
setItemCol("movieId").
setRatingCol("rating")
scala> :type als.params
Array[org.apache.spark.ml.param.Param[_]]

scala> println(als.explainParams)
alpha: alpha for implicit preference (default: 1.0)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10
means that the cache will get checkpointed every 10 iterations (default: 10)
coldStartStrategy: strategy for dealing with unknown or new users/items at prediction
time. This may be useful in cross-validation or production scenarios, for handling use
r/item ids the model has not seen in the training data. Supported values: nan,drop. (d
efault: nan)
finalStorageLevel: StorageLevel for ALS model factors. (default: MEMORY_AND_DISK)
implicitPrefs: whether to use implicit preference (default: false)
intermediateStorageLevel: StorageLevel for intermediate datasets. Cannot be 'NONE'. (d
efault: MEMORY_AND_DISK)
itemCol: column name for item ids. Ids must be within the integer value range. (defaul
t: item, current: movieId)
maxIter: maximum number of iterations (>= 0) (default: 10, current: 5)
nonnegative: whether to use nonnegative constraint for least squares (default: false)
numItemBlocks: number of item blocks (default: 10)
numUserBlocks: number of user blocks (default: 10)
predictionCol: prediction column name (default: prediction)
rank: rank of the factorization (default: 10)
ratingCol: column name for ratings (default: rating, current: rating)
regParam: regularization parameter (>= 0) (default: 0.1, current: 0.01)
seed: random seed (default: 1994790107)
userCol: column name for user ids. Ids must be within the integer value range. (defaul
t: user, current: userId)

262
Params and ParamMaps

import org.apache.spark.ml.tuning.CrossValidator
val cv = new CrossValidator
scala> println(cv.explainParams)
estimator: estimator for selection (undefined)
estimatorParamMaps: param maps for the estimator (undefined)
evaluator: evaluator used to select hyper-parameters that maximize the validated metri
c (undefined)
numFolds: number of folds for cross validation (>= 2) (default: 3)
seed: random seed (default: -1191137437)

Params comes with $ (dollar) method for Spark MLlib developers to access the user-

defined or the default value of a parameter.

Params Contract

package org.apache.spark.ml.param

trait Params {
def copy(extra: ParamMap): Params
}

Table 1. (Subset of) Params Contract


Method Description
copy

Explaining Parameters —  explainParams Method

explainParams(): String

explainParams takes params collection of parameters and converts every parameter to a

corresponding help text with the param name, the description and optionally the default and
the user-defined values if available.

263
Params and ParamMaps

import org.apache.spark.ml.recommendation.ALS
val als = new ALS().
setMaxIter(5).
setRegParam(0.01).
setUserCol("userId").
setItemCol("movieId").
setRatingCol("rating")
scala> println(als.explainParams)
alpha: alpha for implicit preference (default: 1.0)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10
means that the cache will get checkpointed every 10 iterations (default: 10)
coldStartStrategy: strategy for dealing with unknown or new users/items at prediction
time. This may be useful in cross-validation or production scenarios, for handling use
r/item ids the model has not seen in the training data. Supported values: nan,drop. (d
efault: nan)
finalStorageLevel: StorageLevel for ALS model factors. (default: MEMORY_AND_DISK)
implicitPrefs: whether to use implicit preference (default: false)
intermediateStorageLevel: StorageLevel for intermediate datasets. Cannot be 'NONE'. (d
efault: MEMORY_AND_DISK)
itemCol: column name for item ids. Ids must be within the integer value range. (default
: item, current: movieId)
maxIter: maximum number of iterations (>= 0) (default: 10, current: 5)
nonnegative: whether to use nonnegative constraint for least squares (default: false)
numItemBlocks: number of item blocks (default: 10)
numUserBlocks: number of user blocks (default: 10)
predictionCol: prediction column name (default: prediction)
rank: rank of the factorization (default: 10)
ratingCol: column name for ratings (default: rating, current: rating)
regParam: regularization parameter (>= 0) (default: 0.1, current: 0.01)
seed: random seed (default: 1994790107)
userCol: column name for user ids. Ids must be within the integer value range. (default
: user, current: userId)

Copying Parameters with Optional Extra Values 


—  copyValues Method

copyValues[T](to: T, extra: ParamMap = ParamMap.empty): T

copyValues adds extra parameters to paramMap, possibly overridding existing keys.

copyValues iterates over params collection and sets the default value followed by what may

have been defined using the user-defined and extra parameters.

Note copyValues is used mainly for copy method.

264
Params and ParamMaps

265
ValidatorParams

ValidatorParams
Table 1. ValidatorParams' Parameters
Parameter Default Value Description
estimator (undefined) Estimator for best model selection

estimatorParamMaps (undefined) Param maps for the estimator

Evaluator to select hyper-parameters that


evaluator (undefined)
maximize the validated metric

logTuningParams Method

logTuningParams(instrumentation: Instrumentation[_]): Unit

logTuningParams …​FIXME

Note logTuningParams is used when…​FIXME

loadImpl Method

loadImpl[M](
path: String,
sc: SparkContext,
expectedClassName: String): (Metadata, Estimator[M], Evaluator, Array[ParamMap])

loadImpl …​FIXME

Note loadImpl is used when…​FIXME

transformSchemaImpl Method

transformSchemaImpl(schema: StructType): StructType

transformSchemaImpl …​FIXME

transformSchemaImpl is used when CrossValidator and TrainValidationSplit


Note
validate and transform schema.

266
ValidatorParams

267
HasParallelism

HasParallelism
HasParallelism is a Scala trait for Spark MLlib components that allow for specifying the

level of parallelism for multi-threaded execution and provide a thread-pool-based execution


context.

HasParallelism defines parallelism parameter that controls the number of threads in a

cached thread pool.

Table 1. HasParallelism' Parameters


Parameter Default Value Description
The number of threads to use when
parallelism 1 running parallel algorithms
Must be at least 1 .

getExecutionContext Method

getExecutionContext: ExecutionContext

getExecutionContext …​FIXME

Note getExecutionContext is used when…​FIXME

268
ML Persistence — Saving and Loading Models and Pipelines

ML Persistence — Saving and Loading Models


and Pipelines
MLWriter and MLReader belong to org.apache.spark.ml.util package.

They allow you to save and load models despite the languages — Scala, Java, Python or R 
— they have been saved in and loaded later on.

MLWriter
MLWriter abstract class comes with save(path: String) method to save a ML component

to a given path .

save(path: String): Unit

It comes with another (chainable) method overwrite to overwrite the output path if it
already exists.

overwrite(): this.type

The component is saved into a JSON file (see MLWriter Example section below).

Enable INFO logging level for the MLWriter implementation logger to see what
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.ml.Pipeline$.PipelineWriter=INFO

Refer to Logging.

FIXME The logging doesn’t work and overwriting does not print out INFO
Caution
message to the logs :(

MLWriter Example

import org.apache.spark.ml._
val pipeline = new Pipeline().setStages(Array.empty[PipelineStage])
pipeline.write.overwrite.save("sample-pipeline")

269
ML Persistence — Saving and Loading Models and Pipelines

The result of save for "unfitted" pipeline is a JSON file for metadata (as shown below).

$ cat sample-pipeline/metadata/part-00000 | jq
{
"class": "org.apache.spark.ml.Pipeline",
"timestamp": 1472747720477,
"sparkVersion": "2.1.0-SNAPSHOT",
"uid": "pipeline_181c90b15d65",
"paramMap": {
"stageUids": []
}
}

The result of save for pipeline model is a JSON file for metadata while Parquet for model
data, e.g. coefficients.

val model = pipeline.fit(training)


model.write.save("sample-model")

270
ML Persistence — Saving and Loading Models and Pipelines

$ cat sample-model/metadata/part-00000 | jq
{
"class": "org.apache.spark.ml.PipelineModel",
"timestamp": 1472748168005,
"sparkVersion": "2.1.0-SNAPSHOT",
"uid": "pipeline_3ed598da1c4b",
"paramMap": {
"stageUids": [
"regexTok_bf73e7c36e22",
"hashingTF_ebece38da130",
"logreg_819864aa7120"
]
}
}

$ tree sample-model/stages/
sample-model/stages/
|-- 0_regexTok_bf73e7c36e22
| `-- metadata
| |-- _SUCCESS
| `-- part-00000
|-- 1_hashingTF_ebece38da130
| `-- metadata
| |-- _SUCCESS
| `-- part-00000
`-- 2_logreg_819864aa7120
|-- data
| |-- _SUCCESS
| `-- part-r-00000-56423674-0208-4768-9d83-2e356ac6a8d2.snappy.parquet
`-- metadata
|-- _SUCCESS
`-- part-00000

7 directories, 8 files

MLReader
MLReader abstract class comes with load(path: String) method to load a ML component

from a given path .

271
ML Persistence — Saving and Loading Models and Pipelines

import org.apache.spark.ml._
val pipeline = Pipeline.read.load("sample-pipeline")

scala> val stageCount = pipeline.getStages.size


stageCount: Int = 0

val pipelineModel = PipelineModel.read.load("sample-model")

scala> pipelineModel.stages
res1: Array[org.apache.spark.ml.Transformer] = Array(regexTok_bf73e7c36e22, hashingTF_
ebece38da130, logreg_819864aa7120)

272
MLWritable

MLWritable
MLWritable is…​FIXME

273
MLReader

MLReader
MLReader is the contract for…​FIXME

MLReader Contract

package org.apache.spark.ml.util

abstract class MLReader[T] {


def load(path: String): T
}

Table 1. MLReader Contract


Method Description
load Used when…​

274
Example — Text Classification

Example — Text Classification
The example was inspired by the video Building, Debugging, and Tuning Spark
Note
Machine Learning Pipelines - Joseph Bradley (Databricks).

Problem: Given a text document, classify it as a scientific or non-scientific one.

The example uses a case class LabeledText to have the schema described
Note
nicely.

import spark.implicits._

sealed trait Category


case object Scientific extends Category
case object NonScientific extends Category

// FIXME: Define schema for Category

case class LabeledText(id: Long, category: Category, text: String)

val data = Seq(LabeledText(0, Scientific, "hello world"), LabeledText(1, NonScientific


, "witaj swiecie")).toDF

scala> data.show
+-----+-------------+
|label| text|
+-----+-------------+
| 0| hello world|
| 1|witaj swiecie|
+-----+-------------+

It is then tokenized and transformed into another DataFrame with an additional column
called features that is a Vector of numerical values.

Note Paste the code below into Spark Shell using :paste mode.

import spark.implicits._

case class Article(id: Long, topic: String, text: String)


val articles = Seq(
Article(0, "sci.math", "Hello, Math!"),
Article(1, "alt.religion", "Hello, Religion!"),
Article(2, "sci.physics", "Hello, Physics!"),
Article(3, "sci.math", "Hello, Math Revised!"),
Article(4, "sci.math", "Better Math"),
Article(5, "alt.religion", "TGIF")).toDS

275
Example — Text Classification

Now, the tokenization part comes that maps the input text of each text document into tokens
(a Seq[String] ) and then into a Vector of numerical values that can only then be
understood by a machine learning algorithm (that operates on Vector instances).

scala> articles.show
+---+------------+--------------------+
| id| topic| text|
+---+------------+--------------------+
| 0| sci.math| Hello, Math!|
| 1|alt.religion| Hello, Religion!|
| 2| sci.physics| Hello, Physics!|
| 3| sci.math|Hello, Math Revised!|
| 4| sci.math| Better Math|
| 5|alt.religion| TGIF|
+---+------------+--------------------+

val topic2Label: Boolean => Double = isSci => if (isSci) 1 else 0


val toLabel = udf(topic2Label)

val labelled = articles.withColumn("label", toLabel($"topic".like("sci%"))).cache

val Array(trainDF, testDF) = labelled.randomSplit(Array(0.75, 0.25))

scala> trainDF.show
+---+------------+--------------------+-----+
| id| topic| text|label|
+---+------------+--------------------+-----+
| 1|alt.religion| Hello, Religion!| 0.0|
| 3| sci.math|Hello, Math Revised!| 1.0|
+---+------------+--------------------+-----+

scala> testDF.show
+---+------------+---------------+-----+
| id| topic| text|label|
+---+------------+---------------+-----+
| 0| sci.math| Hello, Math!| 1.0|
| 2| sci.physics|Hello, Physics!| 1.0|
| 4| sci.math| Better Math| 1.0|
| 5|alt.religion| TGIF| 0.0|
+---+------------+---------------+-----+

The train a model phase uses the logistic regression machine learning algorithm to build a
model and predict label for future input text documents (and hence classify them as
scientific or non-scientific).

276
Example — Text Classification

import org.apache.spark.ml.feature.RegexTokenizer
val tokenizer = new RegexTokenizer()
.setInputCol("text")
.setOutputCol("words")

import org.apache.spark.ml.feature.HashingTF
val hashingTF = new HashingTF()
.setInputCol(tokenizer.getOutputCol) // it does not wire transformers -- it's just
a column name
.setOutputCol("features")
.setNumFeatures(5000)

import org.apache.spark.ml.classification.LogisticRegression
val lr = new LogisticRegression().setMaxIter(20).setRegParam(0.01)

import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))

It uses two columns, namely label and features vector to build a logistic regression
model to make predictions.

277
Example — Text Classification

val model = pipeline.fit(trainDF)

val trainPredictions = model.transform(trainDF)


val testPredictions = model.transform(testDF)

scala> trainPredictions.select('id, 'topic, 'text, 'label, 'prediction).show


+---+------------+--------------------+-----+----------+
| id| topic| text|label|prediction|
+---+------------+--------------------+-----+----------+
| 1|alt.religion| Hello, Religion!| 0.0| 0.0|
| 3| sci.math|Hello, Math Revised!| 1.0| 1.0|
+---+------------+--------------------+-----+----------+

// Notice that the computations add new columns


scala> trainPredictions.printSchema
root
|-- id: long (nullable = false)
|-- topic: string (nullable = true)
|-- text: string (nullable = true)
|-- label: double (nullable = true)
|-- words: array (nullable = true)
| |-- element: string (containsNull = true)
|-- features: vector (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = true)

import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
val evaluator = new BinaryClassificationEvaluator().setMetricName("areaUnderROC")

import org.apache.spark.ml.param.ParamMap
val evaluatorParams = ParamMap(evaluator.metricName -> "areaUnderROC")

scala> val areaTrain = evaluator.evaluate(trainPredictions, evaluatorParams)


areaTrain: Double = 1.0

scala> val areaTest = evaluator.evaluate(testPredictions, evaluatorParams)


areaTest: Double = 0.6666666666666666

Let’s tune the model’s hyperparameters (using "tools" from org.apache.spark.ml.tuning


package).

FIXME Review the available classes in the org.apache.spark.ml.tuning


Caution
package.

278
Example — Text Classification

import org.apache.spark.ml.tuning.ParamGridBuilder
val paramGrid = new ParamGridBuilder()
.addGrid(hashingTF.numFeatures, Array(100, 1000))
.addGrid(lr.regParam, Array(0.05, 0.2))
.addGrid(lr.maxIter, Array(5, 10, 15))
.build

// That gives all the combinations of the parameters

paramGrid: Array[org.apache.spark.ml.param.ParamMap] =
Array({
logreg_cdb8970c1f11-maxIter: 5,
hashingTF_8d7033d05904-numFeatures: 100,
logreg_cdb8970c1f11-regParam: 0.05
}, {
logreg_cdb8970c1f11-maxIter: 5,
hashingTF_8d7033d05904-numFeatures: 1000,
logreg_cdb8970c1f11-regParam: 0.05
}, {
logreg_cdb8970c1f11-maxIter: 10,
hashingTF_8d7033d05904-numFeatures: 100,
logreg_cdb8970c1f11-regParam: 0.05
}, {
logreg_cdb8970c1f11-maxIter: 10,
hashingTF_8d7033d05904-numFeatures: 1000,
logreg_cdb8970c1f11-regParam: 0.05
}, {
logreg_cdb8970c1f11-maxIter: 15,
hashingTF_8d7033d05904-numFeatures: 100,
logreg_cdb8970c1f11-regParam: 0.05
}, {
logreg_cdb8970c1f11-maxIter: 15,
hashingTF_8d7033d05904-numFeatures: 1000,
logreg_cdb8970c1f11-...

import org.apache.spark.ml.tuning.CrossValidator
import org.apache.spark.ml.param._
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEstimatorParamMaps(paramGrid)
.setEvaluator(evaluator)
.setNumFolds(10)

val cvModel = cv.fit(trainDF)

Let’s use the cross-validated model to calculate predictions and evaluate their precision.

279
Example — Text Classification

val cvPredictions = cvModel.transform(testDF)

scala> cvPredictions.select('topic, 'text, 'prediction).show


+------------+---------------+----------+
| topic| text|prediction|
+------------+---------------+----------+
| sci.math| Hello, Math!| 0.0|
| sci.physics|Hello, Physics!| 0.0|
| sci.math| Better Math| 1.0|
|alt.religion| TGIF| 0.0|
+------------+---------------+----------+

scala> evaluator.evaluate(cvPredictions, evaluatorParams)


res26: Double = 0.6666666666666666

scala> val bestModel = cvModel.bestModel


bestModel: org.apache.spark.ml.Model[_] = pipeline_8873b744aac7

FIXME Review
Caution
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/tuning

You can eventually save the model for later use.

cvModel.write.overwrite.save("model")

Congratulations! You’re done.

280
Example — Linear Regression

Example — Linear Regression
The DataFrame used for Linear Regression has to have features column of
org.apache.spark.mllib.linalg.VectorUDT type.

Note You can change the name of the column using featuresCol parameter.

The list of the parameters of LinearRegression :

scala> println(lr.explainParams)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the
penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.0)
featuresCol: features column name (default: features)
fitIntercept: whether to fit an intercept term (default: true)
labelCol: label column name (default: label)
maxIter: maximum number of iterations (>= 0) (default: 100)
predictionCol: prediction column name (default: prediction)
regParam: regularization parameter (>= 0) (default: 0.0)
solver: the solver algorithm for optimization. If this is not set or empty, default va
lue is 'auto' (default: auto)
standardization: whether to standardize the training features before fitting the model
(default: true)
tol: the convergence tolerance for iterative algorithms (default: 1.0E-6)
weightCol: weight column name. If this is not set or empty, we treat all instance weig
hts as 1.0 (default: )

Caution FIXME The following example is work in progress.

281
Example — Linear Regression

import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline("my_pipeline")

import org.apache.spark.ml.regression._
val lr = new LinearRegression

val df = sc.parallelize(0 to 9).toDF("num")


val stages = Array(lr)
val model = pipeline.setStages(stages).fit(df)

// the above lines gives:


java.lang.IllegalArgumentException: requirement failed: Column features must be of type
org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually IntegerType.
at scala.Predef$.require(Predef.scala:219)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.sc
ala:51)
at org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:72)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:117)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:182)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:182)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:182)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:66)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:133)
... 51 elided

282
Logistic Regression

Logistic Regression
In statistics, logistic regression, or logit regression, or logit model is a regression
model where the dependent variable (DV) is categorical.

— Wikipedia, the free encyclopedia


Logistic regression

283
LogisticRegression

LogisticRegression
LogisticRegression is…​FIXME

284
Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA)


Information here are based almost exclusively from the blog post Topic
Note
modeling with LDA: MLlib meets GraphX.

Topic modeling is a type of model that can be very useful in identifying hidden thematic
structure in documents. Broadly speaking, it aims to find structure within an unstructured
collection of documents. Once the structure is "discovered", you may answer questions like:

What is document X about?

How similar are documents X and Y?

If I am interested in topic Z, which documents should I read first?

Spark MLlib offers out-of-the-box support for Latent Dirichlet Allocation (LDA) which is the
first MLlib algorithm built upon GraphX.

Topic models automatically infer the topics discussed in a collection of documents.

Example

FIXME Use Tokenizer, StopWordsRemover, CountVectorizer, and finally LDA


Caution
in a pipeline.

285
Vector

Vector
Vector sealed trait represents a numeric vector of values (of Double type) and their

indices (of Int type).

It belongs to org.apache.spark.mllib.linalg package.

To Scala and Java developers:


Vector class in Spark MLlib belongs to org.apache.spark.mllib.linalg

Note package.
It is not the Vector type in Scala or Java. Train your eyes to see two types of
the same name. You’ve been warned.

A Vector object knows its size .

A Vector object can be converted to:

Array[Double] using toArray .

a dense vector as DenseVector using toDense .

a sparse vector as SparseVector using toSparse .

(1.6.0) a JSON string using toJson .

(internal) a breeze vector as BV[Double] using toBreeze .

There are exactly two available implementations of Vector sealed trait (that also belong to
org.apache.spark.mllib.linalg package):

DenseVector

SparseVector

Use Vectors factory object to create vectors, be it DenseVector or


Tip
SparseVector .

286
Vector

import org.apache.spark.mllib.linalg.Vectors

// You can create dense vectors explicitly by giving values per index
val denseVec = Vectors.dense(Array(0.0, 0.4, 0.3, 1.5))
val almostAllZeros = Vectors.dense(Array(0.0, 0.4, 0.3, 1.5, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0))

// You can however create a sparse vector by the size and non-zero elements
val sparse = Vectors.sparse(10, Seq((1, 0.4), (2, 0.3), (3, 1.5)))

// Convert a dense vector to a sparse one


val fromSparse = sparse.toDense

scala> almostAllZeros == fromSparse


res0: Boolean = true

Note The factory object is called Vectors (plural).

import org.apache.spark.mllib.linalg._

// prepare elements for a sparse vector


// NOTE: It is more Scala rather than Spark
val indices = 0 to 4
val elements = indices.zip(Stream.continually(1.0))
val sv = Vectors.sparse(elements.size, elements)

// Notice how Vector is printed out


scala> sv
res4: org.apache.spark.mllib.linalg.Vector = (5,[0,1,2,3,4],[1.0,1.0,1.0,1.0,1.0])

scala> sv.size
res0: Int = 5

scala> sv.toArray
res1: Array[Double] = Array(1.0, 1.0, 1.0, 1.0, 1.0)

scala> sv == sv.copy
res2: Boolean = true

scala> sv.toJson
res3: String = {"type":0,"size":5,"indices":[0,1,2,3,4],"values":[1.0,1.0,1.0,1.0,1.0]}

287
LabeledPoint

LabeledPoint
Caution FIXME

LabeledPoint is a convenient class for declaring a schema for DataFrames that are used as

input data for Linear Regression in Spark MLlib.

288
Streaming MLlib

Streaming MLlib
The following Machine Learning algorithms have their streaming variants in MLlib:

k-means

Linear Regression

Logistic Regression

They can train models and predict on streaming data.

Note The streaming algorithms belong to spark.mllib (the older RDD-based API).

Streaming k-means
org.apache.spark.mllib.clustering.StreamingKMeans

Streaming Linear Regression


org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD

Streaming Logistic Regression


org.apache.spark.mllib.classification.StreamingLogisticRegressionWithSGD

Sources
Streaming Machine Learning in Spark- Jeremy Freeman (HHMI Janelia Research
Center)

289
GeneralizedLinearRegression

GeneralizedLinearRegression (GLM)
GeneralizedLinearRegression is a regression algorithm. It supports the following error

distribution families:

1. gaussian

2. binomial

3. poisson

4. gamma

GeneralizedLinearRegression supports the following relationship between the linear

predictor and the mean of the distribution function links:

1. identity

2. logit

3. log

4. inverse

5. probit

6. cloglog

7. sqrt

GeneralizedLinearRegression supports 4096 features.

The label column has to be of DoubleType type.

GeneralizedLinearRegression belongs to org.apache.spark.ml.regression


Note
package.

import org.apache.spark.ml.regression._
val glm = new GeneralizedLinearRegression()

import org.apache.spark.ml.linalg._
val features = Vectors.sparse(5, Seq((3,1.0)))
val trainDF = Seq((0, features, 1)).toDF("id", "features", "label")
val glmModel = glm.fit(trainDF)

GeneralizedLinearRegression is a Regressor with features of Vector type that can train a

GeneralizedLinearRegressionModel.

290
GeneralizedLinearRegression

GeneralizedLinearRegressionModel

Regressor
Regressor is a custom Predictor.

291
Alternating Least Squares (ALS) Matrix Factorization

Alternating Least Squares (ALS) Matrix


Factorization for Recommender Systems
Alternating Least Squares (ALS) Matrix Factorization is a recommendation
algorithm…​FIXME

Read the original paper Scalable Collaborative Filtering with Jointly Derived
Tip
Neighborhood Interpolation Weights by Robert M. Bell and Yehuda Koren.

Recommender systems based on collaborative filtering predict user preferences for


products or services by learning past user-item relationships. A predominant approach
to collaborative filtering is neighborhood based ("k-nearest neighbors"), where a user-
item preference rating is interpolated from ratings of similar items and/or users.

Our method is very fast in practice, generating a prediction in about 0.2 milliseconds.
Importantly, it does not require training many parameters or a lengthy preprocessing,
making it very practical for large scale applications. Finally, we show how to apply these
methods to the perceivably much slower user-oriented approach. To this end, we
suggest a novel scheme for low dimensional embedding of the users. We evaluate
these methods on the Netflix dataset, where they deliver significantly better results than
the commercial Netflix Cinematch recommender system.

— Robert M. Bell and Yehuda Koren


Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights
Read the follow-up paper Collaborative Filtering for Implicit Feedback Datasets
Tip
by Yifan Hu, Yehuda Koren and Chris Volinsky.

ALS Example

// Based on JavaALSExample from the official Spark examples


// https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark
/examples/ml/JavaALSExample.java

// 1. Save the code to als.scala


// 2. Run `spark-shell -i als.scala`

import spark.implicits._

import org.apache.spark.ml.recommendation.ALS
val als = new ALS().
setMaxIter(5).
setRegParam(0.01).
setUserCol("userId").
setItemCol("movieId").
setRatingCol("rating")

292
Alternating Least Squares (ALS) Matrix Factorization

import org.apache.spark.ml.recommendation.ALS.Rating
// FIXME Use a much richer dataset, i.e. Spark's data/mllib/als/sample_movielens_ratin
gs.txt
// FIXME Load it using spark.read
val ratings = Seq(
Rating(0, 2, 3),
Rating(0, 3, 1),
Rating(0, 5, 2),
Rating(1, 2, 2)).toDF("userId", "movieId", "rating")
val Array(training, testing) = ratings.randomSplit(Array(0.8, 0.2))

// Make sure that the RDDs have at least one record


assert(training.count > 0)
assert(testing.count > 0)

import org.apache.spark.ml.recommendation.ALSModel
val model = als.fit(training)

// drop NaNs
model.setColdStartStrategy("drop")
val predictions = model.transform(testing)

import org.apache.spark.ml.evaluation.RegressionEvaluator
val evaluator = new RegressionEvaluator().
setMetricName("rmse"). // root mean squared error
setLabelCol("rating").
setPredictionCol("prediction")
val rmse = evaluator.evaluate(predictions)
println(s"Root-mean-square error = $rmse")

// Model is ready for recommendations

// Generate top 10 movie recommendations for each user


val userRecs = model.recommendForAllUsers(10)
userRecs.show(truncate = false)

// Generate top 10 user recommendations for each movie


val movieRecs = model.recommendForAllItems(10)
movieRecs.show(truncate = false)

// Generate top 10 movie recommendations for a specified set of users


// Use a trick to make sure we work with the known users from the input
val users = ratings.select(als.getUserCol).distinct.limit(3)
val userSubsetRecs = model.recommendForUserSubset(users, 10)
userSubsetRecs.show(truncate = false)

// Generate top 10 user recommendations for a specified set of movies


val movies = ratings.select(als.getItemCol).distinct.limit(3)
val movieSubSetRecs = model.recommendForItemSubset(movies, 10)
movieSubSetRecs.show(truncate = false)

System.exit(0)

293
Alternating Least Squares (ALS) Matrix Factorization

294
ALS — Estimator for ALSModel

ALS — Estimator for ALSModel


ALS is an Estimator that generates a ALSModel.

ALS uses als-[random-numbers] for the default identifier.

ALS can be fine-tuned using parameters.

Table 1. ALS’s Parameters (aka ALSParams)


Parameter Default Value Description
Alpha constant in the implicit
preference formulation
Must be non-negative, i.e. at
least 0 .
alpha 1.0
Used when ALS trains a model
(and computes factors for users
and items datasets) with
implicit preference enabled
(which is disabled by default)

Checkpoint interval, i.e. how


many iterations between
checkpointInterval 10 checkpoints.
Must be at least 1 or exactly
-1 to disable checkpointing

Strategy for dealing with


unknown or new users/items at
prediction time, i.e. what
happens for user or item ids the
model has not seen in the
training data.

Supported values:
coldStartStrategy nan
nan - predicted value for
unknown ids will be NaN
drop - rows in the input
DataFrame containing
unknown ids are dropped
from the output DataFrame
(with predictions).

StorageLevel for ALS model


finalStorageLevel MEMORY_AND_DISK
factors

295
ALS — Estimator for ALSModel

implicitPrefs false Flag to turn implicit preference


on ( true ) or off ( false )

StorageLevel for intermediate


intermediateStorageLevel MEMORY_AND_DISK
datasets. Must not be NONE .

Column name for item ids


itemCol item Must be all integers or
numerics within the integer
value range

Maximum number of iterations


maxIter 10
Must be non-negative, i.e. at
least 0 .

Flag to decide whether to apply


nonnegative Disabled ( false ) nonnegativity constraints for
least squares.

Number of user blocks


numUserBlocks 10
Has to be at least 1 .

Number of item blocks


numItemBlocks 10
Has to be at least 1 .

Column name for predictions

predictionCol prediction
The main purpose of the
estimator

Of type FloatType

Rank of the matrix factorization


rank 10
Has to be at least 1 .

Column name for ratings

Must be all integers or


numerics within the integer
ratingCol rating value range
Cast to FloatType
Set to 1.0 when
undefined

Regularization parameter

296
ALS — Estimator for ALSModel

regParam 10 Must be non-negative, i.e. at


least 0 .

seed Randomly-generated Random seed

Column name for user ids


userCol user Must be all integers or
numerics within the integer
value range

computeFactors Internal Method

computeFactors[ID](
srcFactorBlocks: RDD[(Int, FactorBlock)],
srcOutBlocks: RDD[(Int, OutBlock)],
dstInBlocks: RDD[(Int, InBlock[ID])],
rank: Int,
regParam: Double,
srcEncoder: LocalIndexEncoder,
implicitPrefs: Boolean = false,
alpha: Double = 1.0,
solver: LeastSquaresNESolver): RDD[(Int, FactorBlock)]

computeFactors …​FIXME

Note computeFactors is used when…​FIXME

Fitting ALSModel —  fit Method

fit(dataset: Dataset[_]): ALSModel

Internally, fit validates the schema of the dataset (to make sure that the types of the
columns are correct and the prediction column is not available yet).

fit casts the rating column (as defined using ratingCol parameter) to FloatType .

fit selects user, item and rating columns (from the dataset ) and converts it to RDD of

Rating instances.

Note fit converts the dataset to RDD using rdd operator.

fit prints out the training parameters as INFO message to the logs:

297
ALS — Estimator for ALSModel

INFO ...FIXME

fit trains a model, i.e. generates a pair of RDDs of user and item factors.

fit converts the RDDs with user and item factors to corresponding DataFrames with id

and features columns.

fit creates a ALSModel .

fit prints out the following INFO message to the logs:

INFO training finished

Caution FIXME Check out the log

In the end, fit copies parameter values to the ALSModel model.

Caution FIXME Why is the copying necessary?

partitionRatings Internal Method

partitionRatings[ID](
ratings: RDD[Rating[ID]],
srcPart: Partitioner,
dstPart: Partitioner): RDD[((Int, Int), RatingBlock[ID])]

partitionRatings …​FIXME

Note partitionRatings is used when…​FIXME

makeBlocks Internal Method

makeBlocks[ID](
prefix: String,
ratingBlocks: RDD[((Int, Int), RatingBlock[ID])],
srcPart: Partitioner,
dstPart: Partitioner,
storageLevel: StorageLevel)(
implicit srcOrd: Ordering[ID]): (RDD[(Int, InBlock[ID])], RDD[(Int, OutBlock)])

makeBlocks …​FIXME

Note makeBlocks is used when…​FIXME

298
ALS — Estimator for ALSModel

train Method

train[ID](
ratings: RDD[Rating[ID]],
rank: Int = 10,
numUserBlocks: Int = 10,
numItemBlocks: Int = 10,
maxIter: Int = 10,
regParam: Double = 0.1,
implicitPrefs: Boolean = false,
alpha: Double = 1.0,
nonnegative: Boolean = false,
intermediateRDDStorageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK,
finalRDDStorageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK,
checkpointInterval: Int = 10,
seed: Long = 0L)(
implicit ord: Ordering[ID]): (RDD[(ID, Array[Float])], RDD[(ID, Array[Float])])

train first creates

train partition the ratings RDD (using two HashPartitioners with numUserBlocks and

numItemBlocks partitions) and immediately persists the RDD per


intermediateRDDStorageLevel storage level.

train creates a pair of user in and out block RDDs for blockRatings .

train triggers caching.

train uses a Spark idiom to trigger caching by counting the elements of an


Note
RDD.

train swaps users and items to create a swappedBlockRatings RDD.

train creates a pair of user in and out block RDDs for the swappedBlockRatings RDD.

train triggers caching.

train creates LocalIndexEncoders for user and item HashPartitioner partitioners.

Caution FIXME train gets too "heavy", i.e. advanced. Gave up for now. Sorry.

train throws a IllegalArgumentException when ratings is empty.

requirement failed: No ratings available from [ratings]

train throws a IllegalArgumentException when intermediateRDDStorageLevel is NONE .

299
ALS — Estimator for ALSModel

requirement failed: ALS is not designed to run without persisting intermediate RDDs.

Note train is used when…​FIXME

validateAndTransformSchema Internal Method

validateAndTransformSchema(schema: StructType): StructType

validateAndTransformSchema …​FIXME

validateAndTransformSchema is used exclusively when ALS is requested to


Note
transform a dataset schema.

Transforming Dataset Schema —  transformSchema


Method

transformSchema(schema: StructType): StructType

Internally, transformSchema …​FIXME

300
ALSModel — Model for Predictions

ALSModel — Model for Predictions


ALSModel is a model fitted by ALS algorithm.

A Model in Spark MLlib is a Transformer that comes with a custom transform


Note
method.

When making prediction (i.e. executed), ALSModel …​FIXME

ALSModel is created when:

ALS fits a ALSModel

ALSModel copies a ALSModel

ALSModelReader loads a ALSModel from a persistent storage

ALSModel is a MLWritable.

301
ALSModel — Model for Predictions

// The following spark-shell session is used to show


// how ALSModel works under the covers
// Mostly to learn how to work with the private ALSModel class

// Use paste raw mode to copy the code


// :paste -raw (or its shorter version :pa -raw)
// BEGIN :pa -raw
package org.apache.spark.ml

import org.apache.spark.sql._
class MyALS(spark: SparkSession) {
import spark.implicits._
val userFactors = Seq((0, Seq(0.3, 0.2))).toDF("id", "features")
val itemFactors = Seq((0, Seq(0.3, 0.2))).toDF("id", "features")
import org.apache.spark.ml.recommendation._
val alsModel = new ALSModel(uid = "uid", rank = 10, userFactors, itemFactors)
}
// END :pa -raw

// Copy the following to spark-shell directly


import org.apache.spark.ml._
val model = new MyALS(spark).
alsModel.
setUserCol("user").
setItemCol("item")

import org.apache.spark.sql.types._
val mySchema = new StructType().
add($"user".float).
add($"item".float)

val transformedSchema = model.transformSchema(mySchema)


scala> transformedSchema.printTreeString
root
|-- user: float (nullable = true)
|-- item: float (nullable = true)
|-- prediction: float (nullable = false)

Making Predictions —  transform Method

transform(dataset: Dataset[_]): DataFrame

Note transform is part of Transformer Contract.

Internally, transform validates the schema of the dataset .

transform left-joins the dataset with userFactors dataset (using userCol column of

dataset and id column of userFactors).

302
ALSModel — Model for Predictions

Left join takes two datasets and gives all the rows from the left side (of the join)
combined with the corresponding row from the right side if available or null .

val rows0 = spark.range(0)


val rows5 = spark.range(5)
scala> rows0.join(rows5, Seq("id"), "left").show
+---+
| id|
+---+
+---+

scala> rows5.join(rows0, Seq("id"), "left").count


res3: Long = 5

scala> spark.range(0, 55).join(spark.range(56, 200), Seq("id"), "left").count


res4: Long = 55

val rows02 = spark.range(0, 2)


Note val rows39 = spark.range(3, 9)
scala> rows02.join(rows39, Seq("id"), "left").show
+---+
| id|
+---+
| 0|
| 1|
+---+

val names = Seq((3, "three"), (4, "four")).toDF("id", "name")


scala> rows02.join(names, Seq("id"), "left").show
+---+----+
| id|name|
+---+----+
| 0|null|
| 1|null|
+---+----+

transform left-joins the dataset with itemFactors dataset (using itemCol column of

dataset and id column of itemFactors).

transform makes predictions using the features columns of userFactors and itemFactors

datasets (per every row in the left-joined dataset).

transform takes (selects) all the columns from the dataset and predictionCol with

predictions.

Ultimately, transform drops rows containing null or NaN values for predictions if
coldStartStrategy is drop .

The default value of coldStartStrategy is nan that does not drop missing
Note
values from predictions column.

transformSchema Method

transformSchema(schema: StructType): StructType

303
ALSModel — Model for Predictions

Note transformSchema is part of Transformer Contract.

Internally, transform validates the schema of the dataset .

Creating ALSModel Instance


ALSModel takes the following when created:

Unique ID

Rank

DataFrame of user factors

DataFrame of item factors

ALSModel initializes the internal registries and counters.

Requesting sdot from BLAS —  predict Internal Property

predict: UserDefinedFunction

predict is a user-defined function (UDF) that takes two collections of float numbers and

requests BLAS for sdot .

Caution FIXME Read about com.github.fommil.netlib.BLAS.getInstance.sdot .

Note predict is a mere wrapper of com.github.fommil.netlib.BLAS.

Note predict is used exclusively when ALSModel is requested to transform.

Creating ALSModel with Extra Parameters —  copy


Method

copy(extra: ParamMap): ALSModel

Note copy is part of Model Contract.

copy creates a new ALSModel .

copy then copies extra parameters to the new ALSModel and sets the parent.

304
ALSModel — Model for Predictions

305
ALSModelReader

ALSModelReader
ALSModelReader is…​FIXME

load Method

load(path: String): ALSModel

Note load is part of MLReader Contract.

load …​FIXME

306
Instrumentation

Instrumentation
Instrumentation is…​FIXME

Printing Out Parameters to Logs —  logParams Method

logParams(params: Param[_]*): Unit

logParams …​FIXME

Note logParams is used when…​FIXME

Creating Instrumentation —  create Method

create[E](estimator: E, dataset: Dataset[_]): Instrumentation[E]

create …​FIXME

Note create is used when…​FIXME

307
MLUtils

MLUtils
MLUtils is…​FIXME

kFold Method

kFold[T](rdd: RDD[T], numFolds: Int, seed: Long): Array[(RDD[T], RDD[T])]

kFold …​FIXME

Note kFold is used when…​FIXME

308
Spark Shell — spark-shell shell script

Spark Shell — spark-shell shell script


Spark shell is an interactive environment where you can learn how to make the most out of
Apache Spark quickly and conveniently.

Tip Spark shell is particularly helpful for fast interactive prototyping.

Under the covers, Spark shell is a standalone Spark application written in Scala that offers
environment with auto-completion (using TAB key) where you can run ad-hoc queries and
get familiar with the features of Spark (that help you in developing your own standalone
Spark applications). It is a very convenient tool to explore the many things available in Spark
with immediate feedback. It is one of the many reasons why Spark is so helpful for tasks to
process datasets of any size.

There are variants of Spark shell for different languages: spark-shell for Scala, pyspark
for Python and sparkR for R.

Note This document (and the book in general) uses spark-shell for Scala only.

You can start Spark shell using spark-shell script.

$ ./bin/spark-shell
scala>

spark-shell is an extension of Scala REPL with automatic instantiation of SparkSession as

spark (and SparkContext as sc ).

scala> :type spark


org.apache.spark.sql.SparkSession

// Learn the current version of Spark in use


scala> spark.version
res0: String = 2.1.0-SNAPSHOT

spark-shell also imports Scala SQL’s implicits and sql method.

scala> :imports
1) import spark.implicits._ (59 terms, 38 are implicit)
2) import spark.sql (1 terms)

309
Spark Shell — spark-shell shell script

When you execute spark-shell you actually execute Spark submit as follows:

org.apache.spark.deploy.SparkSubmit --class
org.apache.spark.repl.Main --name Spark shell spark-
Note shell

Set SPARK_PRINT_LAUNCH_COMMAND to see the entire command to be executed.


Refer to Print Launch Command of Spark Scripts.

Using Spark shell


You start Spark shell using spark-shell script (available in bin directory).

$ ./bin/spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newL
evel).
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://10.47.71.138:4040
Spark context available as 'sc' (master = local[*], app id = local-1477858597347).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0-SNAPSHOT
/_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Spark shell creates an instance of SparkSession under the name spark for you (so you
don’t have to know the details how to do it yourself on day 1).

scala> :type spark


org.apache.spark.sql.SparkSession

Besides, there is also sc value created which is an instance of SparkContext.

310
Spark Shell — spark-shell shell script

scala> :type sc
org.apache.spark.SparkContext

To close Spark shell, you press Ctrl+D or type in :q (or any subset of :quit ).

scala> :q

Settings
Table 1. Spark Properties
Spark Property Default Value Description
Used in spark-shell to create REPL
ClassLoader to load new classes defined
in the Scala REPL as a user types code.

spark.repl.class.uri null Enable INFO logging level for


org.apache.spark.executor.Executor
logger to have the value printed out to
the logs:
INFO Using REPL class URI: [classUri]

311
Spark Submit — spark-submit shell script

Spark Submit —  spark-submit shell script


spark-submit shell script allows you to manage your Spark applications.

You can submit your Spark application to a Spark deployment environment for execution, kill
or request status of Spark applications.

You can find spark-submit script in bin directory of the Spark distribution.

$ ./bin/spark-submit
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]
...

When executed, spark-submit script first checks whether SPARK_HOME environment variable
is set and sets it to the directory that contains bin/spark-submit shell script if not. It then
executes spark-class shell script to run SparkSubmit standalone application.

FIXME Add Cluster Manager and Deploy Mode to the table below (see
Caution
options value)

Table 1. Command-Line Options, Spark Properties and Environment Variables (from SparkSubmitArgum
handle)
Command-
Spark Property Environment Variable
Line Option
action Defaults to

--archives

--conf

--deploy-
mode
spark.submit.deployMode DEPLOY_MODE Deploy mode

--driver-
class-path
spark.driver.extraClassPath The driver’s class path

--driver-
java-options
spark.driver.extraJavaOptions The driver’s JVM option

--driver-
library-path
spark.driver.extraLibraryPath The driver’s native libra

--driver-
memory
spark.driver.memory SPARK_DRIVER_MEMORY The driver’s memory

312
Spark Submit — spark-submit shell script

--driver-
spark.driver.cores
cores

--exclude-
spark.jars.excludes
packages

--executor-
cores
spark.executor.cores SPARK_EXECUTOR_CORES The number of executo

--executor-
memory
spark.executor.memory SPARK_EXECUTOR_MEMORY An executor’s memory

--files spark.files

ivyRepoPath spark.jars.ivy

--jars spark.jars

--keytab spark.yarn.keytab

submissionToKill
--kill
to KILL

--master spark.master MASTER Master URL. Defaults to

--class

SPARK_YARN_APP_NAME
Uses mainClass
--name spark.app.name
(YARN only) off primaryResource
ways set it

--num-
executors spark.executor.instances

--packages spark.jars.packages

--principal spark.yarn.principal

--
properties- spark.yarn.principal
file

--proxy-
user

--py-files

--queue

--
repositories

313
Spark Submit — spark-submit shell script

submissionToRequestSta
--status
action set to

--supervise

--total-
executor- spark.cores.max
cores

--verbose

--version SparkSubmit.printVersi

--help printUsageAndExit(0)

--usage-
printUsageAndExit(1)
error

Set SPARK_PRINT_LAUNCH_COMMAND environment variable to have the complete


Spark command printed out to the console, e.g.

$ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell
Spark Command: /Library/Ja...
Tip

Refer to Print Launch Command of Spark Scripts (or


org.apache.spark.launcher.Main Standalone Application where this environment
variable is actually used).

Avoid using scala.App trait for a Spark application’s main class in Scala as
reported in SPARK-4170 Closure problems when running Scala app that "extends
Tip App".
Refer to Executing Main —  runMain internal method in this document.

Preparing Submit Environment 


—  prepareSubmitEnvironment Internal Method

prepareSubmitEnvironment(args: SparkSubmitArguments)
: (Seq[String], Seq[String], Map[String, String], String)

prepareSubmitEnvironment creates a 4-element tuple, i.e. (childArgs, childClasspath,

sysProps, childMainClass) .

314
Spark Submit — spark-submit shell script

Table 2. prepareSubmitEnvironment 's Four-Element Return Tuple


Element Description
childArgs Arguments

childClasspath Classpath elements

sysProps Spark properties

childMainClass Main class

prepareSubmitEnvironment uses options to…​

Caution FIXME

Note prepareSubmitEnvironment is used in SparkSubmit object.

Tip See the elements of the return tuple using --verbose command-line option.

Custom Spark Properties File —  --properties-file


command-line option

--properties-file [FILE]

--properties-file command-line option sets the path to a file FILE from which Spark

loads extra Spark properties.

Tip Spark uses conf/spark-defaults.conf by default.

Driver Cores in Cluster Deploy Mode —  --driver-cores


command-line option

--driver-cores NUM

--driver-cores command-line option sets the number of cores to NUM for the driver in the

cluster deploy mode.

--driver-cores switch is only available for cluster mode (for Standalone,


Note
Mesos, and YARN).

Note It corresponds to spark.driver.cores setting.

315
Spark Submit — spark-submit shell script

Note It is printed out to the standard error output in verbose mode.

Additional JAR Files to Distribute —  --jars command-


line option

--jars JARS

--jars is a comma-separated list of local jars to include on the driver’s and executors'

classpaths.

Caution FIXME

Additional Files to Distribute --files command-line


option

--files FILES

Caution FIXME

Additional Archives to Distribute —  --archives


command-line option

--archives ARCHIVES

Caution FIXME

Specifying YARN Resource Queue —  --queue command-


line option

--queue QUEUE_NAME

With --queue you can choose the YARN resource queue to submit a Spark application to.
The default queue name is default .

Caution FIXME What is a queue ?

Note It corresponds to spark.yarn.queue Spark’s setting.

316
Spark Submit — spark-submit shell script

Tip It is printed out to the standard error output in verbose mode.

Actions

Submitting Applications for Execution —  submit method


The default action of spark-submit script is to submit a Spark application to a deployment
environment for execution.

Use --verbose command-line switch to know the main class to be executed,


Tip arguments, system properties, and classpath (to ensure that the command-line
arguments and switches were processed properly).

When executed, spark-submit executes submit method.

submit(args: SparkSubmitArguments): Unit

If proxyUser is set it will…​FIXME

Caution FIXME Review why and when to use proxyUser .

It passes the execution on to runMain.

Executing Main —  runMain internal method

runMain(
childArgs: Seq[String],
childClasspath: Seq[String],
sysProps: Map[String, String],
childMainClass: String,
verbose: Boolean): Unit

runMain is an internal method to build execution environment and invoke the main method

of the Spark application that has been submitted for execution.

Note It is exclusively used when submitting applications for execution.

When verbose input flag is enabled (i.e. true ) runMain prints out all the input
parameters, i.e. childMainClass , childArgs , sysProps , and childClasspath (in that
order).

317
Spark Submit — spark-submit shell script

Main class:
[childMainClass]
Arguments:
[childArgs one per line]
System properties:
[sysProps one per line]
Classpath elements:
[childClasspath one per line]

Note Use spark-submit 's --verbose command-line option to enable verbose flag.

runMain builds the context classloader (as loader ) depending on

spark.driver.userClassPathFirst flag.

Caution FIXME Describe spark.driver.userClassPathFirst

It adds the jars specified in childClasspath input parameter to the context classloader (that
is later responsible for loading the childMainClass main class).

childClasspath input parameter corresponds to --jars command-line option


Note
with the primary resource if specified in client deploy mode.

It sets all the system properties specified in sysProps input parameter (using Java’s
System.setProperty method).

It creates an instance of childMainClass main class (as mainClass ).

Note childMainClass is the main class spark-submit has been invoked with.

Avoid using scala.App trait for a Spark application’s main class in Scala as
Tip reported in SPARK-4170 Closure problems when running Scala app that "extends
App".

If you use scala.App for the main class, you should see the following warning message in
the logs:

Warning: Subclasses of scala.App may not work correctly. Use a main() method instead.

Finally, runMain executes the main method of the Spark application passing in the
childArgs arguments.

Any SparkUserAppException exceptions lead to System.exit while the others are simply re-
thrown.

Adding Local Jars to ClassLoader —  addJarToClasspath internal method

318
Spark Submit — spark-submit shell script

addJarToClasspath(localJar: String, loader: MutableURLClassLoader)

addJarToClasspath is an internal method to add file or local jars (as localJar ) to the

loader classloader.

Internally, addJarToClasspath resolves the URI of localJar . If the URI is file or local
and the file denoted by localJar exists, localJar is added to loader . Otherwise, the
following warning is printed out to the logs:

Warning: Local jar /path/to/fake.jar does not exist, skipping.

For all other URIs, the following warning is printed out to the logs:

Warning: Skip remote jar hdfs://fake.jar.

addJarToClasspath assumes file URI when localJar has no URI specified,


Note
e.g. /path/to/local.jar .

FIXME What is a URI fragment? How does this change re YARN distributed
Caution
cache? See Utils#resolveURI .

Killing Applications —  --kill command-line option


--kill

Requesting Application Status —  --status command-line


option
--status

Command-line Options
Execute spark-submit --help to know about the command-line options supported.

➜ spark git:(master) ✗ ./bin/spark-submit --help


Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client")

319
Spark Submit — spark-submit shell script

or
on one of the worker machines inside the cluster ("clust
er")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the dri
ver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to inc
lude
on the driver and executor classpaths. Will search the l
ocal
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude w
hile
resolving the dependencies provided in --packages to avo
id
dependency conflicts.
--repositories Comma-separated list of additional remote repositories t
o
search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to plac
e
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the workin
g
directory of each executor.

--conf PROP=VALUE Arbitrary Spark configuration property.


--properties-file FILE Path to a file from which to load extra properties. If n
ot
specified, this will look for conf/spark-defaults.conf.

--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note tha
t
jars added with --jars are automatically included in the
classpath.

--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).

--proxy-user NAME User to impersonate when submitting the application.


This argument does not work with --principal / --keytab.

--help, -h Show this help message and exit.


--verbose, -v Print additional debug output.
--version, Print the version of current Spark.

320
Spark Submit — spark-submit shell script

Spark standalone with cluster deploy mode only:


--driver-cores NUM Cores for driver (Default: 1).

Spark standalone or Mesos with cluster deploy mode only:


--supervise If given, restarts the driver on failure.
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.

Spark standalone and Mesos only:


--total-executor-cores NUM Total cores for all executors.

Spark standalone and YARN only:


--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,
or all available cores on the worker in standalone mode)

YARN-only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
--archives ARCHIVES Comma separated list of archives to be extracted into th
e
working directory of each executor.
--principal PRINCIPAL Principal to be used to login to KDC, while running on
secure HDFS.
--keytab KEYTAB The full path to the file that contains the keytab for t
he
principal specified above. This keytab will be copied to
the node running the Application Master via the Secure
Distributed Cache, for renewing the login tickets and th
e
delegation tokens periodically.

--class

--conf or -c

--deploy-mode (see Deploy Mode)

--driver-class-path (see --driver-class-path command-line option)

--driver-cores (see Driver Cores in Cluster Deploy Mode)

--driver-java-options

--driver-library-path

--driver-memory

--executor-memory

--files

321
Spark Submit — spark-submit shell script

--jars

--kill for Standalone cluster mode only

--master

--name

--packages

--exclude-packages

--properties-file (see Custom Spark Properties File)

--proxy-user

--py-files

--repositories

--status for Standalone cluster mode only

--total-executor-cores

List of switches, i.e. command-line options that do not take parameters:

--help or -h

--supervise for Standalone cluster mode only

--usage-error

--verbose or -v (see Verbose Mode)

--version (see Version)

YARN-only options:

--archives

--executor-cores

--keytab

--num-executors

--principal

--queue (see Specifying YARN Resource Queue (--queue switch))

--driver-class-path command-line option

322
Spark Submit — spark-submit shell script

--driver-class-path command-line option sets the extra class path entries (e.g. jars and

directories) that should be added to a driver’s JVM.

You should use --driver-class-path in client deploy mode (not SparkConf) to


Tip ensure that the CLASSPATH is set up with the entries. client deploy mode
uses the same JVM for the driver as spark-submit 's.

--driver-class-path sets the internal driverExtraClassPath property (when

SparkSubmitArguments.handle called).

It works for all cluster managers and deploy modes.

If driverExtraClassPath not set on command-line, the spark.driver.extraClassPath setting is


used.

Command-line options (e.g. --driver-class-path ) have higher precedence


than their corresponding Spark settings in a Spark properties file (e.g.
Note
spark.driver.extraClassPath ). You can therefore control the final settings by
overriding Spark settings on command line using the command-line options.

Table 3. Spark Settings in Spark Properties File and on Command Line


Setting / System Property Command-Line Option Description
Extra class path entries
spark.driver.extraClassPath --driver-class-path (e.g. jars and directories)
to pass to a driver’s JVM.

Version —  --version command-line option

$ ./bin/spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0-SNAPSHOT
/_/

Branch master
Compiled by user jacek on 2016-09-30T07:08:39Z
Revision 1fad5596885aab8b32d2307c0edecbae50d5bd7a
Url https://github.com/apache/spark.git
Type --help for more information.

Verbose Mode —  --verbose command-line option

323
Spark Submit — spark-submit shell script

When spark-submit is executed with --verbose command-line option, it enters verbose


mode.

In verbose mode, the parsed arguments are printed out to the System error output.

FIXME

It also prints out propertiesFile and the properties from the file.

FIXME

Deploy Mode —  --deploy-mode command-line option


You use spark-submit’s --deploy-mode command-line option to specify the deploy mode for
a Spark application.

Environment Variables
The following is the list of environment variables that are considered when command-line
options are not specified:

MASTER for --master

SPARK_DRIVER_MEMORY for --driver-memory

SPARK_EXECUTOR_MEMORY (see Environment Variables in the SparkContext document)

SPARK_EXECUTOR_CORES

DEPLOY_MODE

SPARK_YARN_APP_NAME

_SPARK_CMD_USAGE

External packages and custom repositories


The spark-submit utility supports specifying external packages using Maven coordinates
using --packages and custom repositories using --repositories .

./bin/spark-submit \
--packages my:awesome:package \
--repositories s3n://$aws_ak:$aws_sak@bucket/path/to/repo

324
Spark Submit — spark-submit shell script

FIXME Why should I care?

Launching SparkSubmit Standalone Application —  main


method

The source code of the script lives in


Tip
https://github.com/apache/spark/blob/master/bin/spark-submit.

When executed, spark-submit script simply passes the call to spark-class with
org.apache.spark.deploy.SparkSubmit class followed by command-line arguments.

spark-class uses the class name —  org.apache.spark.deploy.SparkSubmit  — to


parse command-line arguments appropriately.
Tip
Refer to org.apache.spark.launcher.Main Standalone Application

It creates an instance of SparkSubmitArguments.

If in verbose mode, it prints out the application arguments.

It then relays the execution to action-specific internal methods (with the application
arguments):

When no action was explicitly given, it is assumed submit action.

kill (when --kill switch is used)

requestStatus (when --status switch is used)

The action can only have one of the three available values: SUBMIT , KILL , or
Note
REQUEST_STATUS .

spark-env.sh - load additional environment settings


spark-env.sh consists of environment settings to configure Spark for your site.

export JAVA_HOME=/your/directory/java
export HADOOP_HOME=/usr/lib/hadoop
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=1G

spark-env.sh is loaded at the startup of Spark’s command line scripts.

SPARK_ENV_LOADED env var is to ensure the spark-env.sh script is loaded once.

325
Spark Submit — spark-submit shell script

SPARK_CONF_DIR points at the directory with spark-env.sh or $SPARK_HOME/conf is used.

spark-env.sh is executed if it exists.

$SPARK_HOME/conf directory has spark-env.sh.template file that serves as a template

for your own custom configuration.

Consult Environment Variables in the official documentation.

326
SparkSubmitArguments

SparkSubmitArguments  — spark-submit’s
Command-Line Argument Parser
SparkSubmitArguments is a custom SparkSubmitArgumentsParser to handle the command-line

arguments of spark-submit script that the actions (i.e. submit, kill and status) use for their
execution (possibly with the explicit env environment).

SparkSubmitArguments is created when launching spark-submit script with only


Note
args passed in and later used for printing the arguments in verbose mode.

Calculating Spark Properties 


—  loadEnvironmentArguments internal method

loadEnvironmentArguments(): Unit

loadEnvironmentArguments calculates the Spark properties for the current execution of spark-

submit.

loadEnvironmentArguments reads command-line options first followed by Spark properties

and System’s environment variables.

Spark config properties start with spark. prefix and can be set using --conf
Note
[key=value] command-line option.

handle Method

protected def handle(opt: String, value: String): Boolean

handle parses the input opt argument and returns true or throws an

IllegalArgumentException when it finds an unknown opt .

handle sets the internal properties in the table Command-Line Options, Spark Properties

and Environment Variables.

mergeDefaultSparkProperties Internal Method

mergeDefaultSparkProperties(): Unit

327
SparkSubmitArguments

mergeDefaultSparkProperties merges Spark properties from the default Spark properties file,

i.e. spark-defaults.conf with those specified through --conf command-line option.

328
SparkSubmitOptionParser — spark-submit’s Command-Line Parser

SparkSubmitOptionParser  — spark-submit’s
Command-Line Parser
SparkSubmitOptionParser is the parser of spark-submit's command-line options.

Table 1. spark-submit Command-Line Options


Command-Line Option Description
--archives

--class The main class to run (as mainClass internal attribute).

All = -separated values end up in conf potentially


--conf [prop=value] or -
c [prop=value]
overriding existing settings. Order on command-line
matters.

--deploy-mode deployMode internal property

spark.driver.extraClassPath in conf  — the driver class


--driver-class-path
path

--driver-cores

spark.driver.extraJavaOptions in conf  — the driver VM


--driver-java-options
options

spark.driver.extraLibraryPath in conf  — the driver


--driver-library-path
native library path

--driver-memory spark.driver.memory in conf

--exclude-packages

--executor-cores

--executor-memory

--files

--help or -h The option is added to sparkArgs

--jars

--keytab

--kill The option and a value are added to sparkArgs

329
SparkSubmitOptionParser — spark-submit’s Command-Line Parser

--kill The option and a value are added to sparkArgs

--master master internal property

--name

--num-executors

--packages

--principal

propertiesFile internal property. Refer to Custom Spark


--properties-file [FILE] Properties File —  --properties-file command-line
option.

--proxy-user

--py-files

--queue

--repositories

--status The option and a value are added to sparkArgs

--supervise

--total-executor-cores

--usage-error The option is added to sparkArgs

--verbose or -v

--version The option is added to sparkArgs

SparkSubmitOptionParser Callbacks
SparkSubmitOptionParser is supposed to be overriden for the following capabilities (as

callbacks).

330
SparkSubmitOptionParser — spark-submit’s Command-Line Parser

Table 2. Callbacks
Callback Description
handle Executed when an option with an argument is parsed.

handleUnknown Executed when an unrecognized option is parsed.

handleExtraArgs
Executed for the command-line arguments that handle and
handleUnknown callbacks have not processed.

SparkSubmitOptionParser belongs to org.apache.spark.launcher Scala package and spark-

launcher Maven/sbt module.

org.apache.spark.launcher.SparkSubmitArgumentsParser is a custom
Note
SparkSubmitOptionParser .

Parsing Command-Line Arguments —  parse Method

final void parse(List<String> args)

parse parses a list of command-line arguments.

parse calls handle callback whenever it finds a known command-line option or a switch (a

command-line option with no parameter). It calls handleUnknown callback for unrecognized


command-line options.

parse keeps processing command-line arguments until handle or handleUnknown callback

return false or all command-line arguments have been consumed.

Ultimately, parse calls handleExtraArgs callback.

331
SparkSubmitCommandBuilder Command Builder

SparkSubmitCommandBuilder Command Builder

SparkSubmitCommandBuilder is used to build a command that spark-submit and

SparkLauncher use to launch a Spark application.

SparkSubmitCommandBuilder uses the first argument to distinguish between shells:

1. pyspark-shell-main

2. sparkr-shell-main

3. run-example

Caution FIXME Describe run-example

SparkSubmitCommandBuilder parses command-line arguments using OptionParser (which is

a SparkSubmitOptionParser). OptionParser comes with the following methods:

1. handle to handle the known options (see the table below). It sets up master ,

deployMode , propertiesFile , conf , mainClass , sparkArgs internal properties.

2. handleUnknown to handle unrecognized options that usually lead to Unrecognized

option error message.

3. handleExtraArgs to handle extra arguments that are considered a Spark application’s

arguments.

For spark-shell it assumes that the application arguments are after spark-
Note
submit 's arguments.

SparkSubmitCommandBuilder.buildCommand /
buildSparkSubmitCommand

public List<String> buildCommand(Map<String, String> env)

Note buildCommand is part of the AbstractCommandBuilder public API.

SparkSubmitCommandBuilder.buildCommand simply passes calls on to

buildSparkSubmitCommand private method (unless it was executed for pyspark or sparkr


scripts which we are not interested in in this document).

buildSparkSubmitCommand Internal Method

332
SparkSubmitCommandBuilder Command Builder

private List<String> buildSparkSubmitCommand(Map<String, String> env)

buildSparkSubmitCommand starts by building so-called effective config. When in client mode,

buildSparkSubmitCommand adds spark.driver.extraClassPath to the result Spark command.

Note Use spark-submit to have spark.driver.extraClassPath in effect.

buildSparkSubmitCommand builds the first part of the Java command passing in the extra

classpath (only for client deploy mode).

Caution FIXME Add isThriftServer case.

buildSparkSubmitCommand appends SPARK_SUBMIT_OPTS and SPARK_JAVA_OPTS environment

variables.

(only for client deploy mode) …​

Caution FIXME Elaborate on the client deply mode case.

addPermGenSizeOpt case…​elaborate

Caution FIXME Elaborate on addPermGenSizeOpt

buildSparkSubmitCommand appends org.apache.spark.deploy.SparkSubmit and the command-

line arguments (using buildSparkSubmitArgs).

buildSparkSubmitArgs method

List<String> buildSparkSubmitArgs()

buildSparkSubmitArgs builds a list of command-line arguments for spark-submit.

buildSparkSubmitArgs uses a SparkSubmitOptionParser to add the command-line

arguments that spark-submit recognizes (when it is executed later on and uses the very
same SparkSubmitOptionParser parser to parse command-line arguments).

333
SparkSubmitCommandBuilder Command Builder

Table 1. SparkSubmitCommandBuilder Properties and Corresponding


SparkSubmitOptionParser Attributes

SparkSubmitCommandBuilder Property SparkSubmitOptionParser Attribute

verbose VERBOSE

master MASTER [master]

deployMode DEPLOY_MODE [deployMode]

appName NAME [appName]

conf CONF [key=value]*

propertiesFile PROPERTIES_FILE [propertiesFile]

jars JARS [comma-separated jars]

files FILES [comma-separated files]

pyFiles PY_FILES [comma-separated pyFiles]

mainClass CLASS [mainClass]

sparkArgs sparkArgs (passed straight through)

appResource appResource (passed straight through)

appArgs appArgs (passed straight through)

getEffectiveConfig Internal Method

Map<String, String> getEffectiveConfig()

getEffectiveConfig internal method builds effectiveConfig that is conf with the Spark

properties file loaded (using loadPropertiesFile internal method) skipping keys that have
already been loaded (it happened when the command-line options were parsed in handle
method).

Command-line options (e.g. --driver-class-path ) have higher precedence


than their corresponding Spark settings in a Spark properties file (e.g.
Note spark.driver.extraClassPath ). You can therefore control the final settings by
overriding Spark settings on command line using the command-line options.
charset and trims white spaces around values.

334
SparkSubmitCommandBuilder Command Builder

isClientMode Internal Method

private boolean isClientMode(Map<String, String> userProps)

isClientMode checks master first (from the command-line options) and then spark.master

Spark property. Same with deployMode and spark.submit.deployMode .

Caution FIXME Review master and deployMode . How are they set?

isClientMode responds positive when no explicit master and client deploy mode set

explicitly.

OptionParser
OptionParser is a custom SparkSubmitOptionParser that SparkSubmitCommandBuilder uses

to parse command-line arguments. It defines all the SparkSubmitOptionParser callbacks, i.e.


handle, handleUnknown, and handleExtraArgs, for command-line argument handling.

OptionParser’s handle Callback

boolean handle(String opt, String value)

OptionParser comes with a custom handle callback (from the SparkSubmitOptionParser

callbacks).

335
SparkSubmitCommandBuilder Command Builder

Table 2. handle Method


Command-Line Option Property / Behaviour
--master master

--deploy-mode deployMode

--properties-file propertiesFile

--driver-memory Sets spark.driver.memory (in conf )

--driver-java-options
Sets spark.driver.extraJavaOptions (in
conf )

--driver-library-path
Sets spark.driver.extraLibraryPath (in
conf )

--driver-class-path
Sets spark.driver.extraClassPath (in
conf )

--conf Expects a key=value pair that it puts in


conf

Sets mainClass (in conf ).


It may also set allowsMixedArguments and
--class
appResource if the execution is for one of
the special classes, i.e. spark-shell,
SparkSQLCLIDriver , or HiveThriftServer2.

Disables isAppResourceReq and adds


--kill | --status
itself with the value to sparkArgs .

Disables isAppResourceReq and adds


--help | --usage-error
itself to sparkArgs .

--version
Disables isAppResourceReq and adds
itself to sparkArgs .

anything else Adds an element to sparkArgs

OptionParser’s handleUnknown Method

boolean handleUnknown(String opt)

336
SparkSubmitCommandBuilder Command Builder

If allowsMixedArguments is enabled, handleUnknown simply adds the input opt to appArgs


and allows for further parsing of the argument list.

Caution FIXME Where’s allowsMixedArguments enabled?

If isExample is enabled, handleUnknown sets mainClass to be org.apache.spark.examples.


[opt] (unless the input opt has already the package prefix) and stops further parsing of

the argument list.

Caution FIXME Where’s isExample enabled?

Otherwise, handleUnknown sets appResource and stops further parsing of the argument list.

OptionParser’s handleExtraArgs Method

void handleExtraArgs(List<String> extra)

handleExtraArgs adds all the extra arguments to appArgs .

337
spark-class shell script

spark-class shell script


spark-class shell script is the Spark application command-line launcher that is responsible

for setting up JVM environment and executing a Spark application.

Note Ultimately, any shell script in Spark, e.g. spark-submit, calls spark-class script.

You can find spark-class script in bin directory of the Spark distribution.

When started, spark-class first loads $SPARK_HOME/bin/load-spark-env.sh , collects the


Spark assembly jars, and executes org.apache.spark.launcher.Main.

Depending on the Spark distribution (or rather lack thereof), i.e. whether RELEASE file exists
or not, it sets SPARK_JARS_DIR environment variable to [SPARK_HOME]/jars or
[SPARK_HOME]/assembly/target/scala-[SPARK_SCALA_VERSION]/jars , respectively (with the latter

being a local build).

If SPARK_JARS_DIR does not exist, spark-class prints the following error message and exits
with the code 1 .

Failed to find Spark jars directory ([SPARK_JARS_DIR]).


You need to build Spark with the target "package" before running this program.

spark-class sets LAUNCH_CLASSPATH environment variable to include all the jars under

SPARK_JARS_DIR .

If SPARK_PREPEND_CLASSES is enabled, [SPARK_HOME]/launcher/target/scala-


[SPARK_SCALA_VERSION]/classes directory is added to LAUNCH_CLASSPATH as the first entry.

Use SPARK_PREPEND_CLASSES to have the Spark launcher classes (from


[SPARK_HOME]/launcher/target/scala-[SPARK_SCALA_VERSION]/classes ) to appear
Note
before the other Spark assembly jars. It is useful for development so your
changes don’t require rebuilding Spark again.

SPARK_TESTING and SPARK_SQL_TESTING environment variables enable test special mode.

Caution FIXME What’s so special about the env vars?

spark-class uses org.apache.spark.launcher.Main command-line application to compute

the Spark command to launch. The Main class programmatically computes the command
that spark-class executes afterwards.

Tip Use JAVA_HOME to point at the JVM to use.

338
spark-class shell script

Launching org.apache.spark.launcher.Main Standalone


Application
org.apache.spark.launcher.Main is a Scala standalone application used in spark-class to

prepare the Spark command to execute.

Main expects that the first parameter is the class name that is the "operation mode":

1. org.apache.spark.deploy.SparkSubmit  —  Main uses SparkSubmitCommandBuilder to

parse command-line arguments. This is the mode spark-submit uses.

2. anything —  Main uses SparkClassCommandBuilder to parse command-line arguments.

$ ./bin/spark-class org.apache.spark.launcher.Main
Exception in thread "main" java.lang.IllegalArgumentException: Not enough arguments: m
issing class name.
at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderU
tils.java:241)
at org.apache.spark.launcher.Main.main(Main.java:51)

Main uses buildCommand method on the builder to build a Spark command.

If SPARK_PRINT_LAUNCH_COMMAND environment variable is enabled, Main prints the final Spark


command to standard error.

Spark Command: [cmd]


========================================

If on Windows it calls prepareWindowsCommand while on non-Windows OSes


prepareBashCommand with tokens separated by \0 .

Caution FIXME What’s prepareWindowsCommand ? prepareBashCommand ?

Main uses the following environment variables:

SPARK_DAEMON_JAVA_OPTS and SPARK_MASTER_OPTS to be added to the command line of

the command.

SPARK_DAEMON_MEMORY (default: 1g ) for -Xms and -Xmx .

339
AbstractCommandBuilder

AbstractCommandBuilder
AbstractCommandBuilder is the base command builder for SparkSubmitCommandBuilder

and SparkClassCommandBuilder specialized command builders.

AbstractCommandBuilder expects that command builders define buildCommand .

Table 1. AbstractCommandBuilder Methods


Method Description
buildCommand The only abstract method that subclasses have to define.

buildJavaCommand

getConfDir

Loads the configuration file for a Spark application, be it


loadPropertiesFile the user-specified properties file or spark-defaults.conf
file under the Spark configuration directory.

buildJavaCommand Internal Method

List<String> buildJavaCommand(String extraClassPath)

buildJavaCommand builds the Java command for a Spark application (which is a collection of

elements with the path to java executable, JVM options from java-opts file, and a class
path).

If javaHome is set, buildJavaCommand adds [javaHome]/bin/java to the result Java


command. Otherwise, it uses JAVA_HOME or, when no earlier checks succeeded, falls
through to java.home Java’s system property.

Caution FIXME Who sets javaHome internal property and when?

buildJavaCommand loads extra Java options from the java-opts file in configuration

directory if the file exists and adds them to the result Java command.

Eventually, buildJavaCommand builds the class path (with the extra class path if non-empty)
and adds it as -cp to the result Java command.

buildClassPath method

340
AbstractCommandBuilder

List<String> buildClassPath(String appClassPath)

buildClassPath builds the classpath for a Spark application.

Directories always end up with the OS-specific file separator at the end of their
Note
paths.

buildClassPath adds the following in that order:

1. SPARK_CLASSPATH environment variable

2. The input appClassPath

3. The configuration directory

4. (only with SPARK_PREPEND_CLASSES set or SPARK_TESTING being 1 ) Locally compiled


Spark classes in classes , test-classes and Core’s jars.

Caution FIXME Elaborate on "locally compiled Spark classes".

5. (only with SPARK_SQL_TESTING being 1 ) …​

Caution FIXME Elaborate on the SQL testing case

6. HADOOP_CONF_DIR environment variable

7. YARN_CONF_DIR environment variable

8. SPARK_DIST_CLASSPATH environment variable

childEnv is queried first before System properties. It is always empty for


Note
AbstractCommandBuilder (and SparkSubmitCommandBuilder , too).

Loading Properties File —  loadPropertiesFile Internal


Method

Properties loadPropertiesFile()

loadPropertiesFile is part of AbstractCommandBuilder private API that loads Spark settings

from a properties file (when specified on the command line) or spark-defaults.conf in the
configuration directory.

It loads the settings from the following files starting from the first and checking every location
until the first properties file is found:

341
AbstractCommandBuilder

1. propertiesFile (if specified using --properties-file command-line option or set by

AbstractCommandBuilder.setPropertiesFile ).

2. [SPARK_CONF_DIR]/spark-defaults.conf

3. [SPARK_HOME]/conf/spark-defaults.conf

Note loadPropertiesFile reads a properties file using UTF-8 .

Spark’s Configuration Directory —  getConfDir Internal


Method
AbstractCommandBuilder uses getConfDir to compute the current configuration directory of

a Spark application.

It uses SPARK_CONF_DIR (from childEnv which is always empty anyway or as a environment


variable) and falls through to [SPARK_HOME]/conf (with SPARK_HOME from getSparkHome
internal method).

Spark’s Home Directory —  getSparkHome Internal


Method
AbstractCommandBuilder uses getSparkHome to compute Spark’s home directory for a Spark

application.

It uses SPARK_HOME (from childEnv which is always empty anyway or as a environment


variable).

If SPARK_HOME is not set, Spark throws a IllegalStateException :

Spark home not found; set it explicitly or use the SPARK_HOME environment variable.

342
SparkLauncher — Launching Spark Applications Programmatically

SparkLauncher  — Launching Spark


Applications Programmatically
SparkLauncher is an interface to launch Spark applications programmatically, i.e. from a

code (not spark-submit directly). It uses a builder pattern to configure a Spark application
and launch it as a child process using spark-submit.

SparkLauncher belongs to org.apache.spark.launcher Scala package in spark-launcher

build module.

SparkLauncher uses SparkSubmitCommandBuilder to build the Spark command of a Spark

application to launch.

Table 1. SparkLauncher 's Builder Methods to Set Up Invocation of Spark Application


Setter Description

addAppArgs(String…​ args)
Adds command line arguments for a
Spark application.

addFile(String file)
Adds a file to be submitted with a Spark
application.

addJar(String jar)
Adds a jar file to be submitted with the
application.

addPyFile(String file)
Adds a python file / zip / egg to be
submitted with a Spark application.

addSparkArg(String arg)
Adds a no-value argument to the Spark
invocation.

Adds an argument with a value to the


Spark invocation. It recognizes known
addSparkArg(String name, String value) command-line arguments, i.e. --master ,
--properties-file , --conf , --class , -
-jars , --files , and --py-files .

directory(File dir)
Sets the working directory of spark-
submit.

redirectError() Redirects stderr to stdout.

redirectError(File errFile)
Redirects error output to the specified
errFile file.

Redirects error output to the specified to

343
SparkLauncher — Launching Spark Applications Programmatically

redirectError(ProcessBuilder.Redirect Redirects error output to the specified to


to)
Redirect.

redirectOutput(File outFile)
Redirects output to the specified outFile
file.

redirectOutput(ProcessBuilder.Redirect Redirects standard output to the specified


to) to Redirect.

Sets all output to be logged and


redirectToLog(String loggerName) redirected to a logger with the specified
name.

setAppName(String appName) Sets the name of an Spark application

Sets the main application resource, i.e.


setAppResource(String resource) the location of a jar file for Scala/Java
applications.

setConf(String key, String value)


Sets a Spark property. Expects key
starting with spark. prefix.

setDeployMode(String mode) Sets the deploy mode.

setJavaHome(String javaHome) Sets a custom JAVA_HOME .

setMainClass(String mainClass) Sets the main class.

setMaster(String master) Sets the master URL.

Sets the internal propertiesFile .


setPropertiesFile(String path)
See loadPropertiesFile Internal Method.

setSparkHome(String sparkHome) Sets a custom SPARK_HOME .

setVerbose(boolean verbose)
Enables verbose reporting for
SparkSubmit.

After the invocation of a Spark application is set up, use launch() method to launch a sub-
process that will start the configured Spark application. It is however recommended to use
startApplication method instead.

344
SparkLauncher — Launching Spark Applications Programmatically

import org.apache.spark.launcher.SparkLauncher

val command = new SparkLauncher()


.setAppResource("SparkPi")
.setVerbose(true)

val appHandle = command.startApplication()

345
Spark Architecture

Spark Architecture
Spark uses a master/worker architecture. There is a driver that talks to a single
coordinator called master that manages workers in which executors run.

Figure 1. Spark architecture


The driver and the executors run in their own Java processes. You can run them all on the
same (horizontal cluster) or separate machines (vertical cluster) or in a mixed machine
configuration.

346
Spark Architecture

Figure 2. Spark architecture in detail


Physical machines are called hosts or nodes.

347
Driver

Driver
A Spark driver (aka an application’s driver process) is a JVM process that hosts
SparkContext for a Spark application. It is the master node in a Spark application.

It is the cockpit of jobs and tasks execution (using DAGScheduler and Task Scheduler). It
hosts Web UI for the environment.

Figure 1. Driver with the services


It splits a Spark application into tasks and schedules them to run on executors.

A driver is where the task scheduler lives and spawns tasks across workers.

A driver coordinates workers and overall execution of tasks.

Spark shell is a Spark application and the driver. It creates a SparkContext that
Note
is available as sc .

348
Driver

Driver requires the additional services (beside the common ones like ShuffleManager,
MemoryManager, BlockTransferService, BroadcastManager, CacheManager):

Listener Bus

RPC Environment

MapOutputTrackerMaster with the name MapOutputTracker

BlockManagerMaster with the name BlockManagerMaster

HttpFileServer

MetricsSystem with the name driver

OutputCommitCoordinator with the endpoint’s name OutputCommitCoordinator

FIXME Diagram of RpcEnv for a driver (and later executors). Perhaps it


Caution
should be in the notes about RpcEnv?

High-level control flow of work

Your Spark application runs as long as the Spark driver.

Once the driver terminates, so does your Spark application.

Creates SparkContext , `RDD’s, and executes transformations and actions

Launches tasks

Driver’s Memory
It can be set first using spark-submit’s --driver-memory command-line option or
spark.driver.memory and falls back to SPARK_DRIVER_MEMORY if not set earlier.

Note It is printed out to the standard error output in spark-submit’s verbose mode.

Driver’s Cores
It can be set first using spark-submit’s --driver-cores command-line option for cluster
deploy mode.

In client deploy mode the driver’s memory corresponds to the memory of the
Note
JVM process the Spark application runs on.

Note It is printed out to the standard error output in spark-submit’s verbose mode.

349
Driver

Settings
Table 1. Spark Properties
Spark Property Default Value Description
Port to use for the
BlockManager on the driver.

More precisely,
spark.driver.blockManager.port spark.blockManager.port spark.driver.blockManager.port
is used when
NettyBlockTransferService
created (while SparkEnv
created for the driver).

The address of the node where


the driver runs on.
spark.driver.host localHostName
Set when SparkContext
created

The port the driver listens to. It


is first set to 0 in the driver
when SparkContext is
initialized.

spark.driver.port 0 Set to the port of RpcEnv


driver (in SparkEnv.create
when client-mode
ApplicationMaster connected
to the driver (in Spark on
YARN).

The driver’s memory size (in


spark.driver.memory 1g MiBs).
Refer to Driver’s Memory

The number of CPU cores


assigned to the driver in
deploy mode.
NOTE: When Client is created
spark.driver.cores 1 (for Spark on YARN in cluster
mode only), it sets the number
of cores for ApplicationManager
using spark.driver.cores

Refer to Driver’s Cores.

spark.driver.extraLibraryPath

350
Driver

spark.driver.extraJavaOptions Additional JVM options for the


driver.

spark.driver.appUIAddress

spark.driver.appUIAddress is
used exclusively in Spark on
YARN. It is set when spark.driver.libraryPath
YarnClientSchedulerBackend
starts to run ExecutorLauncher
(and register ApplicationMaster
for the Spark application).

spark.driver.extraClassPath
spark.driver.extraClassPath system property sets the additional classpath entries (e.g. jars

and directories) that should be added to the driver’s classpath in cluster deploy mode.

For client deploy mode you can use a properties file or command line to set
spark.driver.extraClassPath .

Do not use SparkConf since it is too late for client deploy mode given the
Note JVM has already been set up to start a Spark application.
Refer to buildSparkSubmitCommand Internal Method for the very low-level details
of how it is handled internally.

spark.driver.extraClassPath uses a OS-specific path separator.

Use spark-submit 's --driver-class-path command-line option on command


Note
line to override spark.driver.extraClassPath from a Spark properties file.

351
Executor

Executor
Executor is a distributed agent that is responsible for executing tasks.

Executor is created when:

CoarseGrainedExecutorBackend receives RegisteredExecutor message (for Spark

Standalone and YARN)

Spark on Mesos’s MesosExecutorBackend does registered

LocalEndpoint is created (for local mode)

Executor typically runs for the entire lifetime of a Spark application which is called static

allocation of executors (but you could also opt in for dynamic allocation).

Note Executors are managed exclusively by executor backends.

Executors reports heartbeat and partial metrics for active tasks to HeartbeatReceiver RPC
Endpoint on the driver.

Figure 1. HeartbeatReceiver’s Heartbeat Message Handler


Executors provide in-memory storage for RDDs that are cached in Spark applications (via
Block Manager).

When an executor starts it first registers with the driver and communicates directly to
execute tasks.

352
Executor

Figure 2. Launching tasks on executor using TaskRunners


Executor offers are described by executor id and the host on which an executor runs (see
Resource Offers in this document).

Executors can run multiple tasks over its lifetime, both in parallel and sequentially. They
track running tasks (by their task ids in runningTasks internal registry). Consult Launching
Tasks section.

Executors use a Executor task launch worker thread pool for launching tasks.

Executors send metrics (and heartbeats) using the internal heartbeater - Heartbeat Sender
Thread.

It is recommended to have as many executors as data nodes and as many cores as you can
get from the cluster.

Executors are described by their id, hostname, environment (as SparkEnv ), and
classpath (and, less importantly, and more for internal optimization, whether they run in
local or cluster mode).

Caution FIXME How many cores are assigned per executor?

Table 1. Executor’s Internal Properties


Name Initial Value Description

executorSource ExecutorSource FIXME

353
Executor

Table 2. Executor’s Internal Registries and Counters


Name Description
heartbeatFailures

RPC endpoint reference to HeartbeatReceiver on the


driver (available on spark.driver.host at spark.driver.port
port).
heartbeatReceiverRef Set when Executor is created.
Used exclusively when Executor reports heartbeats and
partial metrics for active tasks to the driver (that happens
every spark.executor.heartbeatInterval interval).

maxDirectResultSize

maxResultSize

runningTasks Lookup table of TaskRunners per…​FIXME

Enable INFO or DEBUG logging level for org.apache.spark.executor.Executor


logger to see what happens inside.

Add the following line to conf/log4j.properties :


Tip
log4j.logger.org.apache.spark.executor.Executor=DEBUG

Refer to Logging.

updateDependencies Internal Method

updateDependencies(newFiles: Map[String, Long], newJars: Map[String, Long]): Unit

updateDependencies …​FIXME

updateDependencies is used exclusively when TaskRunner is started to run a


Note
task.

createClassLoader Method

Caution FIXME

addReplClassLoaderIfNeeded Method

354
Executor

Caution FIXME

Creating Executor Instance


Executor takes the following when created:

Executor ID

Executor’s host name

SparkEnv

Collection of user-defined JARs (to add to tasks' class path). Empty by default

Flag that says whether the executor runs in local or cluster mode (default: false , i.e.
cluster mode is preferred)

Java’s UncaughtExceptionHandler (default: SparkUncaughtExceptionHandler )

User-defined JARs are defined using --user-class-path command-line option of


Note CoarseGrainedExecutorBackend that can be set using
spark.executor.extraClassPath property.

Note isLocal is enabled exclusively for LocalEndpoint (for Spark in local mode).

When created, you should see the following INFO messages in the logs:

INFO Executor: Starting executor ID [executorId] on host [executorHostname]

(only for non-local modes) Executor sets SparkUncaughtExceptionHandler as the default


handler invoked when a thread abruptly terminates due to an uncaught exception.

(only for non-local modes) Executor requests the BlockManager to initialize (with the Spark
application id of the SparkConf).

Note Spark application id corresponds to the value of spark.app.id Spark property.

(only for non-local modes) Executor requests the MetricsSystem to register the
ExecutorSource and shuffleMetricsSource of the BlockManager.

Executor uses SparkEnv to access the local MetricsSystem and


Note
BlockManager.

Executor creates a task class loader (optionally with REPL support) that the current

Serializer is requested to use (when deserializing task later).

355
Executor

Note Executor uses SparkEnv to access the local Serializer .

Executor starts sending heartbeats and active tasks metrics.

Executor initializes the internal registries and counters in the meantime (not necessarily at

the very end).

Launching Task —  launchTask Method

launchTask(
context: ExecutorBackend,
taskId: Long,
attemptNumber: Int,
taskName: String,
serializedTask: ByteBuffer): Unit

launchTask executes the input serializedTask task concurrently.

Internally, launchTask creates a TaskRunner, registers it in runningTasks internal registry


(by taskId ), and finally executes it on "Executor task launch worker" thread pool.

Figure 3. Launching tasks on executor using TaskRunners


launchTask is called by CoarseGrainedExecutorBackend (when it handles
Note
LaunchTask message), MesosExecutorBackend, and LocalEndpoint.

Sending Heartbeats and Active Tasks Metrics 


—  startDriverHeartbeater Method
Executors keep sending metrics for active tasks to the driver every
spark.executor.heartbeatInterval (defaults to 10s with some random initial delay so the
heartbeats from different executors do not pile up on the driver).

356
Executor

Figure 4. Executors use HeartbeatReceiver endpoint to report task metrics


An executor sends heartbeats using the internal heartbeater — Heartbeat Sender Thread.

Figure 5. HeartbeatReceiver’s Heartbeat Message Handler


For each task in TaskRunner (in runningTasks internal registry), the task’s metrics are
computed (i.e. mergeShuffleReadMetrics and setJvmGCTime ) that become part of the
heartbeat (with accumulators).

FIXME How do mergeShuffleReadMetrics and setJvmGCTime influence


Caution
accumulators ?

Executors track the TaskRunner that run tasks. A task might not be assigned to
Note
a TaskRunner yet when the executor sends a heartbeat.

A blocking Heartbeat message that holds the executor id, all accumulator updates (per task
id), and BlockManagerId is sent to HeartbeatReceiver RPC endpoint (with
spark.executor.heartbeatInterval timeout).

Caution FIXME When is heartbeatReceiverRef created?

If the response requests to reregister BlockManager, you should see the following INFO
message in the logs:

INFO Executor: Told to re-register on heartbeat

357
Executor

The BlockManager is reregistered.

The internal heartbeatFailures counter is reset (i.e. becomes 0 ).

If there are any issues with communicating with the driver, you should see the following
WARN message in the logs:

WARN Executor: Issue communicating with driver in heartbeater

The internal heartbeatFailures is incremented and checked to be less than the acceptable
number of failures (i.e. spark.executor.heartbeat.maxFailures Spark property). If the number
is greater, the following ERROR is printed out to the logs:

ERROR Executor: Exit as unable to send heartbeats to driver more than [HEARTBEAT_MAX_F
AILURES] times

The executor exits (using System.exit and exit code 56).

Tip Read about TaskMetrics in TaskMetrics.

Reporting Heartbeat and Partial Metrics for Active Tasks to


Driver —  reportHeartBeat Internal Method

reportHeartBeat(): Unit

reportHeartBeat collects TaskRunners for currently running tasks (aka active tasks) with

their tasks deserialized (i.e. either ready for execution or already started).

Note TaskRunner has task deserialized when it runs the task.

For every running task, reportHeartBeat takes its TaskMetrics and:

Requests ShuffleRead metrics to be merged

Sets jvmGCTime metrics

reportHeartBeat then records the latest values of internal and external accumulators for

every task.

Internal accumulators are a task’s metrics while external accumulators are a


Note
Spark application’s accumulators that a user has created.

358
Executor

reportHeartBeat sends a blocking Heartbeat message to HeartbeatReceiver endpoint

(running on the driver). reportHeartBeat uses spark.executor.heartbeatInterval for the RPC


timeout.

A Heartbeat message contains the executor identifier, the accumulator


Note
updates, and the identifier of the BlockManager.

Note reportHeartBeat uses SparkEnv to access the current BlockManager .

If the response (from HeartbeatReceiver endpoint) is to re-register the BlockManager , you


should see the following INFO message in the logs and reportHeartBeat requests
BlockManager to re-register (which will register the blocks the BlockManager manages with

the driver).

INFO Told to re-register on heartbeat

HeartbeatResponse requests BlockManager to re-register when either


Note
TaskScheduler or HeartbeatReceiver know nothing about the executor.

When posting the Heartbeat was successful, reportHeartBeat resets heartbeatFailures


internal counter.

In case of a non-fatal exception, you should see the following WARN message in the logs
(followed by the stack trace).

WARN Issue communicating with driver in heartbeater

Every failure reportHeartBeat increments heartbeat failures up to


spark.executor.heartbeat.maxFailures Spark property. When the heartbeat failures reaches
the maximum, you should see the following ERROR message in the logs and the executor
terminates with the error code: 56 .

ERROR Exit as unable to send heartbeats to driver more than [HEARTBEAT_MAX_FAILURES] t


imes

reportHeartBeat is used when Executor schedules reporting heartbeat and


Note partial metrics for active tasks to the driver (that happens every
spark.executor.heartbeatInterval Spark property).

heartbeater — Heartbeat Sender Thread


heartbeater is a daemon ScheduledThreadPoolExecutor with a single thread.

359
Executor

The name of the thread pool is driver-heartbeater.

Coarse-Grained Executors
Coarse-grained executors are executors that use CoarseGrainedExecutorBackend for task
scheduling.

Resource Offers
Read resourceOffers in TaskSchedulerImpl and resourceOffer in TaskSetManager.

"Executor task launch worker" Thread Pool 


—  threadPool Property
Executor uses threadPool daemon cached thread pool with the name Executor task

launch worker-[ID] (with ID being the task id) for launching tasks.

threadPool is created when Executor is created and shut down when it stops.

Executor Memory —  spark.executor.memory or


SPARK_EXECUTOR_MEMORY settings
You can control the amount of memory per executor using spark.executor.memory setting. It
sets the available memory equally for all executors per application.

The amount of memory per executor is looked up when SparkContext is


Note
created.

You can change the assigned memory per executor per node in standalone cluster using
SPARK_EXECUTOR_MEMORY environment variable.

You can find the value displayed as Memory per Node in web UI for standalone Master (as
depicted in the figure below).

360
Executor

Figure 6. Memory per Node in Spark Standalone’s web UI


The above figure shows the result of running Spark shell with the amount of memory per
executor defined explicitly (on command line), i.e.

./bin/spark-shell --master spark://localhost:7077 -c spark.executor.memory=2g

Metrics
Every executor registers its own ExecutorSource to report metrics.

Stopping Executor —  stop Method

stop(): Unit

stop requests MetricsSystem for a report.

Note stop uses SparkEnv to access the current MetricsSystem .

stop shuts driver-heartbeater thread down (and waits at most 10 seconds).

stop shuts Executor task launch worker thread pool down.

(only when not local) stop requests SparkEnv to stop.

361
Executor

stop is used when CoarseGrainedExecutorBackend and LocalEndpoint are


Note
requested to stop their managed executors.

Settings
Table 3. Spark Properties
Default
Spark Property Description
Value

spark.executor.cores
Number of cores for an
executor.

List of URLs representing user-


defined class path entries that
are added to an executor’s class
path.
spark.executor.extraClassPath (empty) Each entry is separated by
system-dependent path
separator, i.e. : on
Unix/MacOS systems and
on Microsoft Windows.

Extra Java options for


executors.
spark.executor.extraJavaOptions Used to prepare the command
to launch
CoarseGrainedExecutorBackend
in a YARN container.

Extra library paths separated by


system-dependent path
separator, i.e. : on
Unix/MacOS systems and
spark.executor.extraLibraryPath on Microsoft Windows.
Used to prepare the command
to launch
CoarseGrainedExecutorBackend
in a YARN container.

Number of times an executor will


try to send heartbeats to the
driver before it gives up and
exits (with exit code 56 ).
spark.executor.heartbeat.maxFailures 60
NOTE: It was introduced in
SPARK-13522 Executor should
kill itself when it’s unable to
heartbeat to the driver more
than N times.

362
Executor

Interval after which an executor


reports heartbeat and metrics for
active tasks to the driver.
spark.executor.heartbeatInterval 10s
Refer to Sending heartbeats and
partial metrics for active tasks
this document.

spark.executor.id

spark.executor.instances 0 Number of executors to use.

spark.executor.logs.rolling.maxSize

spark.executor.logs.rolling.maxRetainedFiles

spark.executor.logs.rolling.strategy

spark.executor.logs.rolling.time.interval

Amount of memory to use per


executor process.

Equivalent to
SPARK_EXECUTOR_MEMORY
spark.executor.memory 1g environment variable.

Refer to Executor Memory — 


spark.executor.memory or
SPARK_EXECUTOR_MEMORY
settings in this document.

spark.executor.port

spark.executor.port

Flag to control whether to load


spark.executor.userClassPathFirst false classes in user jars before those
in Spark jars.

spark.executor.uri Equivalent to
SPARK_EXECUTOR_URI

spark.task.maxDirectResultSize 1048576B

363
TaskRunner

TaskRunner
TaskRunner is a thread of execution of a single task.

TaskRunner is created exclusively when Executor is requested to launch a task.

Figure 1. Executor creates TaskRunner and runs (almost) immediately


TaskRunner can be run or killed that simply means running or killing the task this

TaskRunner object manages, respectively.

364
TaskRunner

Table 1. TaskRunner’s Internal Registries and Counters


Name Description

FIXME
taskId
Used when…​FIXME

FIXME
threadName
Used when…​FIXME

FIXME
taskName
Used when…​FIXME

FIXME
finished
Used when…​FIXME

FIXME
killed
Used when…​FIXME

FIXME
threadId
Used when…​FIXME

FIXME
startGCTime
Used when…​FIXME

FIXME
task
Used when…​FIXME

FIXME
replClassLoader
Used when…​FIXME

Enable INFO or DEBUG logging level for org.apache.spark.executor.Executor


logger to see what happens inside TaskRunner (since TaskRunner is an internal
class of Executor ).

Add the following line to conf/log4j.properties :


Tip
log4j.logger.org.apache.spark.executor.Executor=DEBUG

Refer to Logging.

365
TaskRunner

Creating TaskRunner Instance


TaskRunner takes the following when created:

ExecutorBackend

TaskDescription

TaskRunner initializes the internal registries and counters.

computeTotalGcTime Method

Caution FIXME

updateDependencies Method

Caution FIXME

setTaskFinishedAndClearInterruptStatus Method

Caution FIXME

Lifecycle

Caution FIXME Image with state changes

A TaskRunner object is created when an executor is requested to launch a task.

It is created with an ExecutorBackend (to send the task’s status updates to), task and
attempt ids, task name, and serialized version of the task (as ByteBuffer ).

Running Task —  run Method

run(): Unit

Note run is part of Java’s java.lang.Runnable contract.

When executed, run initializes threadId as the current thread identifier (using Java’s
Thread)

run then sets the name of the current thread as threadName (using Java’s Thread).

366
TaskRunner

run creates a TaskMemoryManager (using the current MemoryManager and taskId).

Note run uses SparkEnv to access the current MemoryManager .

run starts tracking the time to deserialize a task.

run sets the current thread’s context classloader (with replClassLoader).

run creates a closure Serializer .

Note run uses SparkEnv to access the current closure Serializer .

You should see the following INFO message in the logs:

INFO Executor: Running [taskName] (TID [taskId])

run notifies ExecutorBackend that taskId is in TaskState.RUNNING state.

Note run uses ExecutorBackend that was specified when TaskRunner was created.

run computes startGCTime .

run updates dependencies.

Note run uses TaskDescription that is specified when TaskRunner is created.

run deserializes the task (using the context class loader) and sets its localProperties and

TaskMemoryManager . run sets the task internal reference to hold the deserialized task.

Note run uses TaskDescription to access serialized task.

If killed flag is enabled, run throws a TaskKilledException .

You should see the following DEBUG message in the logs:

DEBUG Executor: Task [taskId]'s epoch is [task.epoch]

run notifies MapOutputTracker about the epoch of the task.

Note run uses SparkEnv to access the current MapOutputTracker .

run records the current time as the task’s start time (as taskStart ).

run runs the task (with taskAttemptId as taskId, attemptNumber from TaskDescription ,

and metricsSystem as the current MetricsSystem).

367
TaskRunner

Note run uses SparkEnv to access the current MetricsSystem .

The task runs inside a "monitored" block (i.e. try-finally block) to detect any
Note memory and lock leaks after the task’s run finishes regardless of the final
outcome - the computed value or an exception thrown.

After the task’s run has finished (inside the "finally" block of the "monitored" block), run
requests BlockManager to release all locks of the task (for the task’s taskId). The locks are
later used for lock leak detection.

run then requests TaskMemoryManager to clean up allocated memory (that helps finding

memory leaks).

If run detects memory leak of the managed memory (i.e. the memory freed is greater than
0 ) and spark.unsafe.exceptionOnMemoryLeak Spark property is enabled (it is not by

default) and no exception was reported while the task ran, run reports a SparkException :

Managed memory leak detected; size = [freedMemory] bytes, TID = [taskId]

Otherwise, if spark.unsafe.exceptionOnMemoryLeak is disabled, you should see the


following ERROR message in the logs instead:

ERROR Executor: Managed memory leak detected; size = [freedMemory] bytes, TID = [taskI
d]

If run detects a memory leak, it leads to a SparkException or ERROR


Note
message in the logs.

If run detects lock leaking (i.e. the number of locks released) and
spark.storage.exceptionOnPinLeak Spark property is enabled (it is not by default) and no
exception was reported while the task ran, run reports a SparkException :

[releasedLocks] block locks were not released by TID = [taskId]:


[releasedLocks separated by comma]

Otherwise, if spark.storage.exceptionOnPinLeak is disabled or the task reported an


exception, you should see the following INFO message in the logs instead:

INFO Executor: [releasedLocks] block locks were not released by TID = [taskId]:
[releasedLocks separated by comma]

368
TaskRunner

If run detects any lock leak, it leads to a SparkException or INFO message in


Note
the logs.

Rigth after the "monitored" block, run records the current time as the task’s finish time (as
taskFinish ).

If the task was killed (while it was running), run reports a TaskKilledException (and the
TaskRunner exits).

run creates a Serializer and serializes the task’s result. run measures the time to

serialize the result.

run uses SparkEnv to access the current Serializer . SparkEnv was


Note
specified when the owning Executor was created.

This is when TaskExecutor serializes the computed value of a task to be


Important
sent back to the driver.

run records the task metrics:

executorDeserializeTime

executorDeserializeCpuTime

executorRunTime

executorCpuTime

jvmGCTime

resultSerializationTime

run collects the latest values of internal and external accumulators used in the task.

run creates a DirectTaskResult (with the serialized result and the latest values of

accumulators).

run serializes the DirectTaskResult and gets the byte buffer’s limit.

Note A serialized DirectTaskResult is Java’s java.nio.ByteBuffer.

run selects the proper serialized version of the result before sending it to ExecutorBackend .

run branches off based on the serialized DirectTaskResult byte buffer’s limit.

When maxResultSize is greater than 0 and the serialized DirectTaskResult buffer limit
exceeds it, the following WARN message is displayed in the logs:

369
TaskRunner

WARN Executor: Finished [taskName] (TID [taskId]). Result is larger than maxResultSize
([resultSize] > [maxResultSize]), dropping it.

Tip Read about spark.driver.maxResultSize.

$ ./bin/spark-shell -c spark.driver.maxResultSize=1m

scala> sc.version
res0: String = 2.0.0-SNAPSHOT

scala> sc.getConf.get("spark.driver.maxResultSize")
res1: String = 1m

scala> sc.range(0, 1024 * 1024 + 10, 1).collect


WARN Executor: Finished task 4.0 in stage 0.0 (TID 4). Result is larger than maxResult
Size (1031.4 KB > 1024.0 KB), dropping it.
...
ERROR TaskSetManager: Total size of serialized results of 1 tasks (1031.4 KB) is bigge
r than spark.driver.maxResultSize (1024.0 KB)
...
org.apache.spark.SparkException: Job aborted due to stage failure: Total size of seria
lized results of 1 tasks (1031.4 KB) is bigger than spark.driver.maxResultSize (1024.0
KB)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$
failJobAndIndependentStages(DAGScheduler.scala:1448)
...

In this case, run creates a IndirectTaskResult (with a TaskResultBlockId for the task’s
taskId and resultSize ) and serializes it.

When maxResultSize is not positive or resultSize is smaller than maxResultSize but


greater than maxDirectResultSize, run creates a TaskResultBlockId for the task’s taskId
and stores the serialized DirectTaskResult in BlockManager (as the TaskResultBlockId
with MEMORY_AND_DISK_SER storage level).

You should see the following INFO message in the logs:

INFO Executor: Finished [taskName] (TID [taskId]). [resultSize] bytes result sent via
BlockManager)

In this case, run creates a IndirectTaskResult (with a TaskResultBlockId for the task’s
taskId and resultSize ) and serializes it.

The difference between the two above cases is that the result is dropped or
Note
stored in BlockManager with MEMORY_AND_DISK_SER storage level.

370
TaskRunner

When the two cases above do not hold, you should see the following INFO message in the
logs:

INFO Executor: Finished [taskName] (TID [taskId]). [resultSize] bytes result sent to d
river

run uses the serialized DirectTaskResult byte buffer as the final serializedResult .

The final serializedResult is either a IndirectTaskResult (possibly with the


Note
block stored in BlockManager ) or a DirectTaskResult.

run notifies ExecutorBackend that taskId is in TaskState.FINISHED state with the serialized

result and removes taskId from the owning executor’s runningTasks registry.

Note run uses ExecutorBackend that is specified when TaskRunner is created.

TaskRunner is Java’s Runnable and the contract requires that once a


Note
TaskRunner has completed execution it must not be restarted.

When run catches a exception while executing the task, run acts according to its type (as
presented in the following "run’s Exception Cases" table and the following sections linked
from the table).

Table 2. run’s Exception Cases, TaskState and Serialized ByteBuffer


Exception Type TaskState Serialized ByteBuffer
FetchFailedException FAILED TaskFailedReason

TaskKilledException KILLED TaskKilled

InterruptedException KILLED TaskKilled

CommitDeniedException FAILED TaskFailedReason

Throwable FAILED ExceptionFailure

FetchFailedException
When FetchFailedException is reported while running a task, run
setTaskFinishedAndClearInterruptStatus.

run requests FetchFailedException for the TaskFailedReason , serializes it and notifies

ExecutorBackend that the task has failed (with taskId, TaskState.FAILED , and a serialized

reason).

371
TaskRunner

Note ExecutorBackend was specified when TaskRunner was created.

run uses a closure Serializer to serialize the failure reason. The Serializer
Note
was created before run ran the task.

TaskKilledException
When TaskKilledException is reported while running a task, you should see the following
INFO message in the logs:

INFO Executor killed [taskName] (TID [taskId])

run then setTaskFinishedAndClearInterruptStatus and notifies ExecutorBackend that the

task has been killed (with taskId, TaskState.KILLED , and a serialized TaskKilled object).

InterruptedException (with Task Killed)


When InterruptedException is reported while running a task, and the task has been killed,
you should see the following INFO message in the logs:

INFO Executor interrupted and killed [taskName] (TID [taskId])

run then setTaskFinishedAndClearInterruptStatus and notifies ExecutorBackend that the

task has been killed (with taskId, TaskState.KILLED , and a serialized TaskKilled object).

The difference between this InterruptedException and TaskKilledException is


Note
the INFO message in the logs.

CommitDeniedException
When CommitDeniedException is reported while running a task, run
setTaskFinishedAndClearInterruptStatus and notifies ExecutorBackend that the task has
failed (with taskId, TaskState.FAILED , and a serialized TaskKilled object).

The difference between this CommitDeniedException and FetchFailedException


Note
is just the reason being sent to ExecutorBackend .

Throwable
When run catches a Throwable , you should see the following ERROR message in the
logs (followed by the exception).

372
TaskRunner

ERROR Exception in [taskName] (TID [taskId])

run then records the following task metrics (only when Task is available):

executorRunTime

jvmGCTime

run then collects the latest values of internal and external accumulators (with taskFailed

flag enabled to inform that the collection is for a failed task).

Otherwise, when Task is not available, the accumulator collection is empty.

run converts the task accumulators to collection of AccumulableInfo , creates a

ExceptionFailure (with the accumulators), and serializes them.

Note run uses a closure Serializer to serialize the ExceptionFailure .

FIXME Why does run create new ExceptionFailure(t,


Caution
accUpdates).withAccums(accums) , i.e. accumulators occur twice in the object.

run setTaskFinishedAndClearInterruptStatus and notifies ExecutorBackend that the task

has failed (with taskId, TaskState.FAILED , and the serialized ExceptionFailure ).

run may also trigger SparkUncaughtExceptionHandler.uncaughtException(t) if this is a fatal

error.

The difference between this most Throwable case and other FAILED cases
Note (i.e. FetchFailedException and CommitDeniedException) is just the serialized
ExceptionFailure vs a reason being sent to ExecutorBackend , respectively.

Killing Task —  kill Method

kill(interruptThread: Boolean): Unit

kill marks the TaskRunner as killed and kills the task (if available and not finished

already).

Note kill passes the input interruptThread on to the task itself while killing it.

When executed, you should see the following INFO message in the logs:

INFO TaskRunner: Executor is trying to kill [taskName] (TID [taskId])

373
TaskRunner

killed flag is checked periodically in run to stop executing the task. Once killed,
Note
the task will eventually stop.

Settings
Table 3. Spark Properties
Spark Property Default Value Description
spark.unsafe.exceptionOnMemoryLeak false FIXME

spark.storage.exceptionOnPinLeak false FIXME

374
ExecutorSource

ExecutorSource
ExecutorSource is a metrics source of an Executor. It uses an executor’s threadPool for

calculating the gauges.

Every executor has its own separate ExecutorSource that is registered when
Note
CoarseGrainedExecutorBackend receives a RegisteredExecutor .

The name of a ExecutorSource is executor.

Figure 1. ExecutorSource in JConsole (using Spark Standalone)

375
ExecutorSource

Table 1. ExecutorSource Gauges


Gauge Description

Approximate number of threads that are actively


threadpool.activeTasks executing tasks.
Uses ThreadPoolExecutor.getActiveCount().

Approximate total number of tasks that have


threadpool.completeTasks completed execution.
Uses ThreadPoolExecutor.getCompletedTaskCount().

Current number of threads in the pool.


threadpool.currentPool_size
Uses ThreadPoolExecutor.getPoolSize().

Maximum allowed number of threads that have ever


threadpool.maxPool_size simultaneously been in the pool
Uses ThreadPoolExecutor.getMaximumPoolSize().

Uses Hadoop’s FileSystem.getAllStatistics() and


filesystem.hdfs.read_bytes
getBytesRead() .

Uses Hadoop’s FileSystem.getAllStatistics() and


filesystem.hdfs.write_bytes
getBytesWritten() .

filesystem.hdfs.read_ops Uses Hadoop’s FileSystem.getAllStatistics() and


getReadOps()

Uses Hadoop’s FileSystem.getAllStatistics() and


filesystem.hdfs.largeRead_ops
getLargeReadOps() .

Uses Hadoop’s FileSystem.getAllStatistics() and


filesystem.hdfs.write_ops
getWriteOps() .

filesystem.file.read_bytes The same as hdfs but for file scheme.

filesystem.file.write_bytes The same as hdfs but for file scheme.

filesystem.file.read_ops The same as hdfs but for file scheme.

filesystem.file.largeRead_ops The same as hdfs but for file scheme.

filesystem.file.write_ops The same as hdfs but for file scheme.

376
ExecutorSource

377
Master

Master
A master is a running Spark instance that connects to a cluster manager for resources.

The master acquires cluster nodes to run executors.

Caution FIXME Add it to the Spark architecture figure above.

378
Workers

Workers
Workers (aka slaves) are running Spark instances where executors live to execute tasks.
They are the compute nodes in Spark.

Caution FIXME Are workers perhaps part of Spark Standalone only?

Caution FIXME How many executors are spawned per worker?

A worker receives serialized tasks that it runs in a thread pool.

It hosts a local Block Manager that serves blocks to other workers in a Spark cluster.
Workers communicate among themselves using their Block Manager instances.

Caution FIXME Diagram of a driver with workers as boxes.

Explain task execution in Spark and understand Spark’s underlying execution model.

New vocabulary often faced in Spark UI

When you create SparkContext, each worker starts an executor. This is a separate process
(JVM), and it loads your jar, too. The executors connect back to your driver program. Now
the driver can send them commands, like flatMap , map and reduceByKey . When the
driver quits, the executors shut down.

A new process is not started for each step. A new process is started on each worker when
the SparkContext is constructed.

The executor deserializes the command (this is possible because it has loaded your jar),
and executes it on a partition.

Shortly speaking, an application in Spark is executed in three steps:

1. Create RDD graph, i.e. DAG (directed acyclic graph) of RDDs to represent entire
computation.

2. Create stage graph, i.e. a DAG of stages that is a logical execution plan based on the
RDD graph. Stages are created by breaking the RDD graph at shuffle boundaries.

3. Based on the plan, schedule and execute tasks on workers.

In the WordCount example, the RDD graph is as follows:

file → lines → words → per-word count → global word count → output

379
Workers

Based on this graph, two stages are created. The stage creation rule is based on the idea of
pipelining as many narrow transformations as possible. RDD operations with "narrow"
dependencies, like map() and filter() , are pipelined together into one set of tasks in
each stage.

In the end, every stage will only have shuffle dependencies on other stages, and may
compute multiple operations inside it.

In the WordCount example, the narrow transformation finishes at per-word count. Therefore,
you get two stages:

file → lines → words → per-word count

global word count → output

Once stages are defined, Spark will generate tasks from stages. The first stage will create
ShuffleMapTasks with the last stage creating ResultTasks because in the last stage, one
action operation is included to produce results.

The number of tasks to be generated depends on how your files are distributed. Suppose
that you have 3 three different files in three different nodes, the first stage will generate 3
tasks: one task per partition.

Therefore, you should not map your steps to tasks directly. A task belongs to a stage, and is
related to a partition.

The number of tasks being generated in each stage will be equal to the number of partitions.

Cleanup

Caution FIXME

Settings
spark.worker.cleanup.enabled (default: false ) Cleanup enabled.

380
Anatomy of Spark Application

Anatomy of Spark Application


Every Spark application starts from creating SparkContext.

Note Without SparkContext no computation (as a Spark job) can be started.

A Spark application is an instance of SparkContext. Or, put it differently, a Spark


Note
context constitutes a Spark application.

A Spark application is uniquely identified by a pair of the application and application attempt
ids.

For it to work, you have to create a Spark configuration using SparkConf or use a custom
SparkContext constructor.

package pl.japila.spark

import org.apache.spark.{SparkContext, SparkConf}

object SparkMeApp {
def main(args: Array[String]) {

val masterURL = "local[*]" (1)

val conf = new SparkConf() (2)


.setAppName("SparkMe Application")
.setMaster(masterURL)

val sc = new SparkContext(conf) (3)

val fileName = util.Try(args(0)).getOrElse("build.sbt")

val lines = sc.textFile(fileName).cache() (4)

val c = lines.count() (5)


println(s"There are $c lines in $fileName")
}
}

1. Master URL to connect the application to

2. Create Spark configuration

3. Create Spark context

4. Create lines RDD

381
Anatomy of Spark Application

5. Execute count action

Tip Spark shell creates a Spark context and SQL context for you at startup.

When a Spark application starts (using spark-submit script or as a standalone application), it


connects to Spark master as described by master URL. It is part of Spark context’s
initialization.

Figure 1. Submitting Spark application to master using master URL


Your Spark application can run locally or on the cluster which is based on the
Note cluster manager and the deploy mode ( --deploy-mode ). Refer to Deployment
Modes.

You can then create RDDs, transform them to other RDDs and ultimately execute actions.
You can also cache interim RDDs to speed up data processing.

After all the data processing is completed, the Spark application finishes by stopping the
Spark context.

382
SparkConf — Programmable Configuration for Spark Applications

SparkConf — Spark Application’s Configuration


Refer to Spark Configuration in the official documentation for an extensive
Tip
coverage of how to configure Spark and user programs.

TODO
Describe SparkConf object for the application configuration.
Caution
the default configs

system properties

There are three ways to configure Spark and user programs:

Spark Properties - use Web UI to learn the current properties.

…​

setIfMissing Method

Caution FIXME

isExecutorStartupConf Method

Caution FIXME

set Method

Caution FIXME

Mandatory Settings - spark.master and spark.app.name


There are two mandatory settings of any Spark application that have to be defined before
this Spark application could be run — spark.master and spark.app.name.

Spark Properties
Every user program starts with creating an instance of SparkConf that holds the master
URL to connect to ( spark.master ), the name for your Spark application (that is later
displayed in web UI and becomes spark.app.name ) and other Spark properties required for

383
SparkConf — Programmable Configuration for Spark Applications

proper runs. The instance of SparkConf can be used to create SparkContext.

Start Spark shell with --conf spark.logConf=true to log the effective Spark
configuration as INFO when SparkContext is started.

$ ./bin/spark-shell --conf spark.logConf=true


...
15/10/19 17:13:49 INFO SparkContext: Running Spark version 1.6.0-SNAPSHOT
15/10/19 17:13:49 INFO SparkContext: Spark configuration:
spark.app.name=Spark shell
spark.home=/Users/jacek/dev/oss/spark
Tip spark.jars=
spark.logConf=true
spark.master=local[*]
spark.repl.class.uri=http://10.5.10.20:64055
spark.submit.deployMode=client
...

Use sc.getConf.toDebugString to have a richer output once SparkContext has


finished initializing.

You can query for the values of Spark properties in Spark shell as follows:

scala> sc.getConf.getOption("spark.local.dir")
res0: Option[String] = None

scala> sc.getConf.getOption("spark.app.name")
res1: Option[String] = Some(Spark shell)

scala> sc.getConf.get("spark.master")
res2: String = local[*]

Setting up Spark Properties


There are the following places where a Spark application looks for Spark properties (in the
order of importance from the least important to the most important):

conf/spark-defaults.conf - the configuration file with the default Spark properties.

Read spark-defaults.conf.

--conf or -c - the command-line option used by spark-submit (and other shell scripts

that use spark-submit or spark-class under the covers, e.g. spark-shell )

SparkConf

Default Configuration
The default Spark configuration is created when you execute the following code:

384
SparkConf — Programmable Configuration for Spark Applications

import org.apache.spark.SparkConf
val conf = new SparkConf

It simply loads spark.* system properties.

You can use conf.toDebugString or conf.getAll to have the spark.* system properties
loaded printed out.

scala> conf.getAll
res0: Array[(String, String)] = Array((spark.app.name,Spark shell), (spark.jars,""), (
spark.master,local[*]), (spark.submit.deployMode,client))

scala> conf.toDebugString
res1: String =
spark.app.name=Spark shell
spark.jars=
spark.master=local[*]
spark.submit.deployMode=client

scala> println(conf.toDebugString)
spark.app.name=Spark shell
spark.jars=
spark.master=local[*]
spark.submit.deployMode=client

Unique Identifier of Spark Application —  getAppId


Method

getAppId: String

getAppId gives spark.app.id Spark property or reports NoSuchElementException if not set.

getAppId is used when:

NettyBlockTransferService is initialized (and creates a

Note NettyBlockRpcServer as well as saves the identifier for later use).


Executor is created (in non-local mode and requests BlockManager to
initialize).

Settings

385
SparkConf — Programmable Configuration for Spark Applications

Table 1. Spark Properties


Spark Property Default Value Description
spark.master Master URL

Unique identifier of a Spark


application that Spark uses to
uniquely identify metric sources.
spark.app.id TaskScheduler.applicationId() Set when SparkContext is
created (right after
TaskScheduler is started that
actually gives the identifier).

spark.app.name Application Name

386
Spark Properties and spark-defaults.conf Properties File

Spark Properties and spark-defaults.conf


Properties File
Spark properties are the means of tuning the execution environment for your Spark
applications.

The default Spark properties file is $SPARK_HOME/conf/spark-defaults.conf that could be


overriden using spark-submit with --properties-file command-line option.

Table 1. Environment Variables


Environment Variable Default Value Description
Spark’s configuration
SPARK_CONF_DIR ${SPARK_HOME}/conf directory (with spark-
defaults.conf )

Tip Read the official documentation of Apache Spark on Spark Configuration.

Table 2. Spark Application’s Properties


Property Name Default Description
Comma-separated list of directories that
are used as a temporary storage for
"scratch" space, including map output files
and RDDs that get stored on disk.
spark.local.dir /tmp
This should be on a fast, local disk in your
system. It can also be a comma-
separated list of multiple directories on
different disks.

spark-defaults.conf  — Default Spark Properties File


spark-defaults.conf (under SPARK_CONF_DIR or $SPARK_HOME/conf ) is the default properties

file with the Spark properties of your Spark applications.

spark-defaults.conf is loaded by AbstractCommandBuilder’s


Note
loadPropertiesFile internal method.

Calculating Path of Default Spark Properties 


—  Utils.getDefaultPropertiesFile method

387
Spark Properties and spark-defaults.conf Properties File

getDefaultPropertiesFile(env: Map[String, String] = sys.env): String

getDefaultPropertiesFile calculates the absolute path to spark-defaults.conf properties

file that can be either in directory specified by SPARK_CONF_DIR environment variable or


$SPARK_HOME/conf directory.

getDefaultPropertiesFile is part of private[spark]


Note
org.apache.spark.util.Utils object.

388
Deploy Mode

Deploy Mode
Deploy mode specifies the location of where driver executes in the deployment
environment.

Deploy mode can be one of the following options:

client (default) - the driver runs on the machine that the Spark application was

launched.

cluster - the driver runs on a random node in a cluster.

Note cluster deploy mode is only available for non-local cluster deployments.

You can control the deploy mode of a Spark application using spark-submit’s --deploy-mode
command-line option or spark.submit.deployMode Spark property.

Note spark.submit.deployMode setting can be client or cluster .

Client Deploy Mode

Caution FIXME

Cluster Deploy Mode


Caution FIXME

spark.submit.deployMode
spark.submit.deployMode (default: client ) can be client or cluster .

389
SparkContext

SparkContext — Entry Point to Spark Core


SparkContext (aka Spark context) is the heart of a Spark application.

Note You could also assume that a SparkContext instance is a Spark application.

Spark context sets up internal services and establishes a connection to a Spark execution
environment.

Once a SparkContext is created you can use it to create RDDs, accumulators and
broadcast variables, access Spark services and run jobs (until SparkContext is stopped).

A Spark context is essentially a client of Spark’s execution environment and acts as the
master of your Spark application (don’t get confused with the other meaning of Master in
Spark, though).

Figure 1. Spark context acts as the master of your Spark application


SparkContext offers the following functions:

Getting current status of a Spark application

SparkEnv

SparkConf

deployment environment (as master URL)

390
SparkContext

application name

unique identifier of execution attempt

deploy mode

default level of parallelism that specifies the number of partitions in RDDs when
they are created without specifying the number explicitly by a user.

Spark user

the time (in milliseconds) when SparkContext was created

URL of web UI

Spark version

Storage status

Setting Configuration

master URL

Local Properties — Creating Logical Job Groups

Setting Local Properties to Group Spark Jobs

Default Logging Level

Creating Distributed Entities

RDDs

Accumulators

Broadcast variables

Accessing services, e.g. AppStatusStore, TaskScheduler, LiveListenerBus,


BlockManager, SchedulerBackends, ShuffleManager and the optional ContextCleaner.

Running jobs synchronously

Submitting jobs asynchronously

Cancelling a job

Cancelling a stage

Assigning custom Scheduler Backend, TaskScheduler and DAGScheduler

Closure cleaning

391
SparkContext

Accessing persistent RDDs

Unpersisting RDDs, i.e. marking RDDs as non-persistent

Registering SparkListener

Programmable Dynamic Allocation

Table 1. SparkContext’s Internal Registries and Counters


Name Description

Lookup table of persistent/cached RDDs per their ids.


Used when SparkContext is requested to:

persistRDD
persistentRdds
getRDDStorageInfo
getPersistentRDDs
unpersistRDD

Table 2. SparkContext’s Internal Properties


Name Initial Value Description
_taskScheduler (uninitialized) TaskScheduler

Tip Read the scaladoc of org.apache.spark.SparkContext.

Enable INFO logging level for org.apache.spark.SparkContext logger to see what


happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.SparkContext=INFO

Refer to Logging.

addFile Method

addFile(path: String): Unit (1)


addFile(path: String, recursive: Boolean): Unit

1. recursive flag is off

addFile adds the path file to be downloaded…​FIXME

392
SparkContext

addFile is used when:

SparkContext is initialized (and files were defined)


Note
Spark SQL’s AddFileCommand is executed
Spark SQL’s SessionResourceLoader is requested to load a file resource

Removing RDD Blocks from BlockManagerMaster 


—  unpersistRDD Internal Method

unpersistRDD(rddId: Int, blocking: Boolean = true): Unit

unpersistRDD requests BlockManagerMaster to remove the blocks for the RDD (given

rddId ).

unpersistRDD uses SparkEnv to access the current BlockManager that is in


Note
turn used to access the current BlockManagerMaster .

unpersistRDD removes rddId from persistentRdds registry.

In the end, unpersistRDD posts a SparkListenerUnpersistRDD (with rddId ) to


LiveListenerBus Event Bus.

unpersistRDD is used when:

Note ContextCleaner does doCleanupRDD

SparkContext unpersists an RDD (i.e. marks an RDD as non-persistent)

Unique Identifier of Spark Application —  applicationId


Method

Caution FIXME

postApplicationStart Internal Method

Caution FIXME

postApplicationEnd Method

Caution FIXME

393
SparkContext

clearActiveContext Method

Caution FIXME

Accessing persistent RDDs —  getPersistentRDDs


Method

getPersistentRDDs: Map[Int, RDD[_]]

getPersistentRDDs returns the collection of RDDs that have marked themselves as

persistent via cache.

Internally, getPersistentRDDs returns persistentRdds internal registry.

Cancelling Job —  cancelJob Method

cancelJob(jobId: Int)

cancelJob requests DAGScheduler to cancel a Spark job.

Cancelling Stage —  cancelStage Methods

cancelStage(stageId: Int): Unit


cancelStage(stageId: Int, reason: String): Unit

cancelStage simply requests DAGScheduler to cancel a Spark stage (with an optional

reason ).

cancelStage is used when StagesTab handles a kill request (from a user in


Note
web UI).

Programmable Dynamic Allocation


SparkContext offers the following methods as the developer API for dynamic allocation of

executors:

requestExecutors

killExecutors

requestTotalExecutors

394
SparkContext

(private!) getExecutorIds

Requesting New Executors —  requestExecutors Method

requestExecutors(numAdditionalExecutors: Int): Boolean

requestExecutors requests numAdditionalExecutors executors from

CoarseGrainedSchedulerBackend.

Requesting to Kill Executors —  killExecutors Method

killExecutors(executorIds: Seq[String]): Boolean

Caution FIXME

Requesting Total Executors —  requestTotalExecutors


Method

requestTotalExecutors(
numExecutors: Int,
localityAwareTasks: Int,
hostToLocalTaskCount: Map[String, Int]): Boolean

requestTotalExecutors is a private[spark] method that requests the exact number of

executors from a coarse-grained scheduler backend.

Note It works for coarse-grained scheduler backends only.

When called for other scheduler backends you should see the following WARN message in
the logs:

WARN Requesting executors is only supported in coarse-grained mode

Getting Executor Ids —  getExecutorIds Method


getExecutorIds is a private[spark] method that is part of ExecutorAllocationClient

contract. It simply passes the call on to the current coarse-grained scheduler backend, i.e.
calls getExecutorIds .

395
SparkContext

Note It works for coarse-grained scheduler backends only.

When called for other scheduler backends you should see the following WARN message in
the logs:

WARN Requesting executors is only supported in coarse-grained mode

FIXME Why does SparkContext implement the method for coarse-grained


Caution scheduler backends? Why doesn’t SparkContext throw an exception when
the method is called? Nobody seems to be using it (!)

Creating SparkContext Instance


You can create a SparkContext instance with or without creating a SparkConf object first.

You may want to read Inside Creating SparkContext to learn what happens
Note
behind the scenes when SparkContext is created.

Getting Existing or Creating New SparkContext 


—  getOrCreate Methods

getOrCreate(): SparkContext
getOrCreate(conf: SparkConf): SparkContext

getOrCreate methods allow you to get the existing SparkContext or create a new one.

import org.apache.spark.SparkContext
val sc = SparkContext.getOrCreate()

// Using an explicit SparkConf object


import org.apache.spark.SparkConf
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("SparkMe App")
val sc = SparkContext.getOrCreate(conf)

The no-param getOrCreate method requires that the two mandatory Spark settings - master
and application name - are specified using spark-submit.

Constructors

396
SparkContext

SparkContext()
SparkContext(conf: SparkConf)
SparkContext(master: String, appName: String, conf: SparkConf)
SparkContext(
master: String,
appName: String,
sparkHome: String = null,
jars: Seq[String] = Nil,
environment: Map[String, String] = Map())

You can create a SparkContext instance using the four constructors.

import org.apache.spark.SparkConf
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("SparkMe App")

import org.apache.spark.SparkContext
val sc = new SparkContext(conf)

When a Spark context starts up you should see the following INFO in the logs (amongst the
other messages that come from the Spark services):

INFO SparkContext: Running Spark version 2.0.0-SNAPSHOT

Only one SparkContext may be running in a single JVM (check out SPARK-
2243 Support multiple SparkContexts in the same JVM). Sharing access to a
Note
SparkContext in the JVM is the solution to share data within Spark (without
relying on other means of data sharing using external data stores).

Accessing Current SparkEnv —  env Method

Caution FIXME

Getting Current SparkConf —  getConf Method

getConf: SparkConf

getConf returns the current SparkConf.

Changing the SparkConf object does not change the current configuration (as
Note
the method returns a copy).

397
SparkContext

Deployment Environment —  master Method

master: String

master method returns the current value of spark.master which is the deployment

environment in use.

Application Name —  appName Method

appName: String

appName gives the value of the mandatory spark.app.name setting.

appName is used when SparkDeploySchedulerBackend starts, SparkUI creates a


Note web UI, when postApplicationStart is executed, and for Mesos and
checkpointing in Spark Streaming.

Unique Identifier of Execution Attempt 


—  applicationAttemptId Method

applicationAttemptId: Option[String]

applicationAttemptId gives the unique identifier of the execution attempt of a Spark

application.

applicationAttemptId is used when:

Note ShuffleMapTask and ResultTask are created

SparkContext announces that a Spark application has started

Storage Status (of All BlockManagers) 


—  getExecutorStorageStatus Method

getExecutorStorageStatus: Array[StorageStatus]

getExecutorStorageStatus requests BlockManagerMaster for storage status (of all

BlockManagers).

Note getExecutorStorageStatus is a developer API.

398
SparkContext

getExecutorStorageStatus is used when:

Note SparkContext is requested for storage status of cached RDDs

SparkStatusTracker is requested for information about all known executors

Deploy Mode —  deployMode Method

deployMode: String

deployMode returns the current value of spark.submit.deployMode setting or client if not

set.

Scheduling Mode —  getSchedulingMode Method

getSchedulingMode: SchedulingMode.SchedulingMode

getSchedulingMode returns the current Scheduling Mode.

Schedulable (Pool) by Name —  getPoolForName Method

getPoolForName(pool: String): Option[Schedulable]

getPoolForName returns a Schedulable by the pool name, if one exists.

Note getPoolForName is part of the Developer’s API and may change in the future.

Internally, it requests the TaskScheduler for the root pool and looks up the Schedulable by
the pool name.

It is exclusively used to show pool details in web UI (for a stage).

All Pools —  getAllPools Method

getAllPools: Seq[Schedulable]

getAllPools collects the Pools in TaskScheduler.rootPool.

Note TaskScheduler.rootPool is part of the TaskScheduler Contract.

399
SparkContext

Note getAllPools is part of the Developer’s API.

Caution FIXME Where is the method used?

getAllPools is used to calculate pool names for Stages tab in web UI with
Note
FAIR scheduling mode used.

Default Level of Parallelism

defaultParallelism: Int

defaultParallelism requests TaskScheduler for the default level of parallelism.

Default level of parallelism specifies the number of partitions in RDDs when


Note
created without specifying them explicitly by a user.

defaultParallelism is used in SparkContext.parallelize, SparkContext.range


and SparkContext.makeRDD (as well as Spark Streaming’s
DStream.countByValue and DStream.countByValueAndWindow et al.).
Note
defaultParallelism is also used to instantiate HashPartitioner and for the
minimum number of partitions in HadoopRDDs.

Current Spark Scheduler (aka TaskScheduler) 


—  taskScheduler Property

taskScheduler: TaskScheduler
taskScheduler_=(ts: TaskScheduler): Unit

taskScheduler manages (i.e. reads or writes) _taskScheduler internal property.

Getting Spark Version —  version Property

version: String

version returns the Spark version this SparkContext uses.

makeRDD Method

Caution FIXME

400
SparkContext

Submitting Jobs Asynchronously —  submitJob Method

submitJob[T, U, R](
rdd: RDD[T],
processPartition: Iterator[T] => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit,
resultFunc: => R): SimpleFutureAction[R]

submitJob submits a job in an asynchronous, non-blocking way to DAGScheduler.

It cleans the processPartition input function argument and returns an instance of


SimpleFutureAction that holds the JobWaiter instance.

Caution FIXME What are resultFunc ?

It is used in:

AsyncRDDActions methods

Spark Streaming for ReceiverTrackerEndpoint.startReceiver

Spark Configuration

Caution FIXME

SparkContext and RDDs


You use a Spark context to create RDDs (see Creating RDD).

When an RDD is created, it belongs to and is completely owned by the Spark context it
originated from. RDDs can’t by design be shared between SparkContexts.

401
SparkContext

Figure 2. A Spark context creates a living space for RDDs.

Creating RDD —  parallelize Method


SparkContext allows you to create many different RDDs from input sources like:

Scala’s collections, i.e. sc.parallelize(0 to 100)

local or remote filesystems, i.e. sc.textFile("README.md")

Any Hadoop InputSource using sc.newAPIHadoopFile

Read Creating RDDs in RDD - Resilient Distributed Dataset.

Unpersisting RDD (Marking RDD as Non-Persistent) 


—  unpersist Method

Caution FIXME

unpersist removes an RDD from the master’s Block Manager (calls removeRdd(rddId: Int,

blocking: Boolean) ) and the internal persistentRdds mapping.

It finally posts SparkListenerUnpersistRDD message to listenerBus .

Setting Checkpoint Directory —  setCheckpointDir


Method

402
SparkContext

setCheckpointDir(directory: String)

setCheckpointDir method is used to set up the checkpoint directory…​FIXME

Caution FIXME

Registering Accumulator —  register Methods

register(acc: AccumulatorV2[_, _]): Unit


register(acc: AccumulatorV2[_, _], name: String): Unit

register registers the acc accumulator. You can optionally give an accumulator a name .

You can create built-in accumulators for longs, doubles, and collection types
Tip
using specialized methods.

Internally, register registers acc accumulator (with the current SparkContext ).

Creating Built-In Accumulators

longAccumulator: LongAccumulator
longAccumulator(name: String): LongAccumulator
doubleAccumulator: DoubleAccumulator
doubleAccumulator(name: String): DoubleAccumulator
collectionAccumulator[T]: CollectionAccumulator[T]
collectionAccumulator[T](name: String): CollectionAccumulator[T]

You can use longAccumulator , doubleAccumulator or collectionAccumulator to create and


register accumulators for simple and collection values.

longAccumulator returns LongAccumulator with the zero value 0 .

doubleAccumulator returns DoubleAccumulator with the zero value 0.0 .

collectionAccumulator returns CollectionAccumulator with the zero value

java.util.List[T] .

403
SparkContext

scala> val acc = sc.longAccumulator


acc: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: None, value:
0)

scala> val counter = sc.longAccumulator("counter")


counter: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 1, name: Some(cou
nter), value: 0)

scala> counter.value
res0: Long = 0

scala> sc.parallelize(0 to 9).foreach(n => counter.add(n))

scala> counter.value
res3: Long = 45

The name input parameter allows you to give a name to an accumulator and have it
displayed in Spark UI (under Stages tab for a given stage).

Figure 3. Accumulators in the Spark UI


Tip You can register custom accumulators using register methods.

Creating Broadcast Variable —  broadcast Method

broadcast[T](value: T): Broadcast[T]

broadcast method creates a broadcast variable. It is a shared memory with value (as

broadcast blocks) on the driver and later on all Spark executors.

404
SparkContext

val sc: SparkContext = ???


scala> val hello = sc.broadcast("hello")
hello: org.apache.spark.broadcast.Broadcast[String] = Broadcast(0)

Spark transfers the value to Spark executors once, and tasks can share it without incurring
repetitive network transmissions when the broadcast variable is used multiple times.

Figure 4. Broadcasting a value to executors


Internally, broadcast requests the current BroadcastManager to create a new broadcast
variable.

The current BroadcastManager is available using SparkEnv.broadcastManager


Note attribute and is always BroadcastManager (with few internal configuration
changes to reflect where it runs, i.e. inside the driver or executors).

You should see the following INFO message in the logs:

INFO SparkContext: Created broadcast [id] from [callSite]

If ContextCleaner is defined, the new broadcast variable is registered for cleanup.

405
SparkContext

Spark does not support broadcasting RDDs.

scala> sc.broadcast(sc.range(0, 10))


java.lang.IllegalArgumentException: requirement failed: Can not directly broadcast RDDs; in
Note at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1392)
... 48 elided

Once created, the broadcast variable (and other blocks) are displayed per executor and the
driver in web UI (under Executors tab).

Figure 5. Broadcast Variables In web UI’s Executors Tab

Distribute JARs to workers


The jar you specify with SparkContext.addJar will be copied to all the worker nodes.

The configuration setting spark.jars is a comma-separated list of jar paths to be included


in all tasks executed from this SparkContext. A path can either be a local file, a file in HDFS
(or other Hadoop-supported filesystems), an HTTP, HTTPS or FTP URI, or local:/path for
a file on every worker node.

scala> sc.addJar("build.sbt")
15/11/11 21:54:54 INFO SparkContext: Added JAR build.sbt at http://192.168.1.4:49427/j
ars/build.sbt with timestamp 1447275294457

Caution FIXME Why is HttpFileServer used for addJar?

SparkContext as Application-Wide Counter


SparkContext keeps track of:

406
SparkContext

shuffle ids using nextShuffleId internal counter for registering shuffle dependencies to
Shuffle Service.

Running Job Synchronously —  runJob Methods


RDD actions run jobs using one of runJob methods.

runJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit): Unit
runJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int]): Array[U]
runJob[T, U](
rdd: RDD[T],
func: Iterator[T] => U,
partitions: Seq[Int]): Array[U]
runJob[T, U](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U): Array[U]
runJob[T, U](rdd: RDD[T], func: Iterator[T] => U): Array[U]
runJob[T, U](
rdd: RDD[T],
processPartition: (TaskContext, Iterator[T]) => U,
resultHandler: (Int, U) => Unit)
runJob[T, U: ClassTag](
rdd: RDD[T],
processPartition: Iterator[T] => U,
resultHandler: (Int, U) => Unit)

runJob executes a function on one or many partitions of a RDD (in a SparkContext space)

to produce a collection of values per partition.

Note runJob can only work when a SparkContext is not stopped.

Internally, runJob first makes sure that the SparkContext is not stopped. If it is, you should
see the following IllegalStateException exception in the logs:

java.lang.IllegalStateException: SparkContext has been shutdown


at org.apache.spark.SparkContext.runJob(SparkContext.scala:1893)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1914)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1934)
... 48 elided

runJob then calculates the call site and cleans a func closure.

407
SparkContext

You should see the following INFO message in the logs:

INFO SparkContext: Starting job: [callSite]

With spark.logLineage enabled (which is not by default), you should see the following INFO
message with toDebugString (executed on rdd ):

INFO SparkContext: RDD's recursive dependencies:


[toDebugString]

runJob requests DAGScheduler to run a job.

Tip runJob just prepares input parameters for DAGScheduler to run a job.

After DAGScheduler is done and the job has finished, runJob stops ConsoleProgressBar
and performs RDD checkpointing of rdd .

For some actions, e.g. first() and lookup() , there is no need to compute all
Tip
the partitions of the RDD in a job. And Spark knows it.

// RDD to work with


val lines = sc.parallelize(Seq("hello world", "nice to see you"))

import org.apache.spark.TaskContext
scala> sc.runJob(lines, (t: TaskContext, i: Iterator[String]) => 1) (1)
res0: Array[Int] = Array(1, 1) (2)

1. Run a job using runJob on lines RDD with a function that returns 1 for every partition
(of lines RDD).

2. What can you say about the number of partitions of the lines RDD? Is your result
res0 different than mine? Why?

Tip Read TaskContext.

Running a job is essentially executing a func function on all or a subset of partitions in an


rdd RDD and returning the result as an array (with elements being the results per

partition).

408
SparkContext

Figure 6. Executing action

Stopping SparkContext  —  stop Method

stop(): Unit

stop stops the SparkContext .

Internally, stop enables stopped internal flag. If already stopped, you should see the
following INFO message in the logs:

INFO SparkContext: SparkContext already stopped.

stop then does the following:

1. Removes _shutdownHookRef from ShutdownHookManager

2. Posts a SparkListenerApplicationEnd (to LiveListenerBus Event Bus)

3. Stops web UI

4. Requests MetricSystem to report metrics (from all registered sinks)

5. Stops ContextCleaner

6. Requests ExecutorAllocationManager to stop

409
SparkContext

7. If LiveListenerBus was started, requests LiveListenerBus to stop

8. Requests EventLoggingListener to stop

9. Requests DAGScheduler to stop

10. Requests RpcEnv to stop HeartbeatReceiver endpoint

11. Requests ConsoleProgressBar to stop

12. Clears the reference to TaskScheduler , i.e. _taskScheduler is null

13. Requests SparkEnv to stop and clears SparkEnv

14. Clears SPARK_YARN_MODE flag

15. Clears an active SparkContext

Ultimately, you should see the following INFO message in the logs:

INFO SparkContext: Successfully stopped SparkContext

Registering SparkListener —  addSparkListener Method

addSparkListener(listener: SparkListenerInterface): Unit

You can register a custom SparkListenerInterface using addSparkListener method

Note You can also register custom listeners using spark.extraListeners setting.

Custom SchedulerBackend, TaskScheduler and


DAGScheduler
By default, SparkContext uses ( private[spark] class)
org.apache.spark.scheduler.DAGScheduler , but you can develop your own custom

DAGScheduler implementation, and use ( private[spark] ) SparkContext.dagScheduler_=(ds:


DAGScheduler) method to assign yours.

It is also applicable to SchedulerBackend and TaskScheduler using schedulerBackend_=(sb:


SchedulerBackend) and taskScheduler_=(ts: TaskScheduler) methods, respectively.

Caution FIXME Make it an advanced exercise.

Events

410
SparkContext

When a Spark context starts, it triggers SparkListenerEnvironmentUpdate and


SparkListenerApplicationStart messages.

Refer to the section SparkContext’s initialization.

Setting Default Logging Level —  setLogLevel Method

setLogLevel(logLevel: String)

setLogLevel allows you to set the root logging level in a Spark application, e.g. Spark shell.

Internally, setLogLevel calls org.apache.log4j.Level.toLevel(logLevel) that it then uses to set


using org.apache.log4j.LogManager.getRootLogger().setLevel(level).

You can directly set the logging level using


org.apache.log4j.LogManager.getLogger().
Tip
LogManager.getLogger("org").setLevel(Level.OFF)

Closure Cleaning —  clean Method

clean(f: F, checkSerializable: Boolean = true): F

Every time an action is called, Spark cleans up the closure, i.e. the body of the action, before
it is serialized and sent over the wire to executors.

SparkContext comes with clean(f: F, checkSerializable: Boolean = true) method that


does this. It in turn calls ClosureCleaner.clean method.

Not only does ClosureCleaner.clean method clean the closure, but also does it transitively,
i.e. referenced closures are cleaned transitively.

A closure is considered serializable as long as it does not explicitly reference unserializable


objects. It does so by traversing the hierarchy of enclosing closures and null out any
references that are not actually used by the starting closure.

411
SparkContext

Enable DEBUG logging level for org.apache.spark.util.ClosureCleaner logger to


see what happens inside the class.

Add the following line to conf/log4j.properties :


Tip
log4j.logger.org.apache.spark.util.ClosureCleaner=DEBUG

Refer to Logging.

With DEBUG logging level you should see the following messages in the logs:

+++ Cleaning closure [func] ([func.getClass.getName]) +++


+ declared fields: [declaredFields.size]
[field]
...
+++ closure [func] ([func.getClass.getName]) is now cleaned +++

Serialization is verified using a new instance of Serializer (as closure Serializer). Refer to
Serialization.

Caution FIXME an example, please.

Hadoop Configuration
While a SparkContext is being created, so is a Hadoop configuration (as an instance of
org.apache.hadoop.conf.Configuration that is available as _hadoopConfiguration ).

Note SparkHadoopUtil.get.newConfiguration is used.

If a SparkConf is provided it is used to build the configuration as described. Otherwise, the


default Configuration object is returned.

If AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are both available, the following settings


are set for the Hadoop configuration:

fs.s3.awsAccessKeyId , fs.s3n.awsAccessKeyId , fs.s3a.access.key are set to the value

of AWS_ACCESS_KEY_ID

fs.s3.awsSecretAccessKey , fs.s3n.awsSecretAccessKey , and fs.s3a.secret.key are set

to the value of AWS_SECRET_ACCESS_KEY

Every spark.hadoop. setting becomes a setting of the configuration with the prefix
spark.hadoop. removed for the key.

412
SparkContext

The value of spark.buffer.size (default: 65536 ) is used as the value of


io.file.buffer.size .

listenerBus  —  LiveListenerBus Event Bus


listenerBus is a LiveListenerBus object that acts as a mechanism to announce events to

other services on the driver.

It is created and started when SparkContext starts and, since it is a single-JVM


Note
event bus, is exclusively used on the driver.

Note listenerBus is a private[spark] value in SparkContext .

Time when SparkContext was Created —  startTime


Property

startTime: Long

startTime is the time in milliseconds when SparkContext was created.

scala> sc.startTime
res0: Long = 1464425605653

Spark User —  sparkUser Property

sparkUser: String

sparkUser is the user who started the SparkContext instance.

Note It is computed when SparkContext is created using Utils.getCurrentUserName.

Submitting ShuffleDependency for Execution 


—  submitMapStage Internal Method

submitMapStage[K, V, C](
dependency: ShuffleDependency[K, V, C]): SimpleFutureAction[MapOutputStatistics]

submitMapStage submits the input ShuffleDependency to DAGScheduler for execution and

returns a SimpleFutureAction .

413
SparkContext

Internally, submitMapStage calculates the call site first and submits it with localProperties .

Interestingly, submitMapStage is used exclusively when Spark SQL’s


Note
ShuffleExchange physical operator is executed.

submitMapStage seems related to Adaptive Query Planning / Adaptive


Note
Scheduling.

Calculating Call Site —  getCallSite Method

Caution FIXME

Cancelling Job Group —  cancelJobGroup Method

cancelJobGroup(groupId: String)

cancelJobGroup requests DAGScheduler to cancel a group of active Spark jobs.

cancelJobGroup is used exclusively when SparkExecuteStatementOperation


Note
does cancel .

Cancelling All Running and Scheduled Jobs 


—  cancelAllJobs Method

Caution FIXME

cancelAllJobs is used when spark-shell is terminated (e.g. using Ctrl+C, so it


Note
can in turn terminate all active Spark jobs) or SparkSQLCLIDriver is terminated.

Setting Local Properties to Group Spark Jobs 


—  setJobGroup Method

setJobGroup(
groupId: String,
description: String,
interruptOnCancel: Boolean = false): Unit

setJobGroup sets local properties:

spark.jobGroup.id as groupId

spark.job.description as description

414
SparkContext

spark.job.interruptOnCancel as interruptOnCancel

setJobGroup is used when:

Note Spark Thrift Server’s SparkExecuteStatementOperation runs a query


Structured Streaming’s StreamExecution runs batches

cleaner Method

cleaner: Option[ContextCleaner]

cleaner is a private[spark] method to get the optional application-wide ContextCleaner.

ContextCleaner is created when SparkContext is created with


Note spark.cleaner.referenceTracking Spark property enabled (which it is by
default).

Finding Preferred Locations (Placement Preferences) for


RDD Partition —  getPreferredLocs Method

getPreferredLocs(rdd: RDD[_], partition: Int): Seq[TaskLocation]

getPreferredLocs simply requests DAGScheduler for the preferred locations for partition .

Preferred locations of a partition of a RDD are also called placement


Note
preferences or locality preferences.

getPreferredLocs is used in CoalescedRDDPartition ,


Note
DefaultPartitionCoalescer and PartitionerAwareUnionRDD .

Registering RDD in persistentRdds Internal Registry 


—  persistRDD Internal Method

persistRDD(rdd: RDD[_]): Unit

persistRDD registers rdd in persistentRdds internal registry.

Note persistRDD is used exclusively when RDD is persisted or locally checkpointed.

415
SparkContext

Getting Storage Status of Cached RDDs (as RDDInfos) 


—  getRDDStorageInfo Methods

getRDDStorageInfo: Array[RDDInfo] (1)


getRDDStorageInfo(filter: RDD[_] => Boolean): Array[RDDInfo] (2)

1. Part of Spark’s Developer API that uses <2> filtering no RDDs

getRDDStorageInfo takes all the RDDs (from persistentRdds registry) that match filter

and creates a collection of RDDInfo instances.

getRDDStorageInfo then updates the RDDInfos with the current status of all BlockManagers

(in a Spark application).

In the end, getRDDStorageInfo gives only the RDD that are cached (i.e. the sum of memory
and disk sizes as well as the number of partitions cached are greater than 0 ).

Note getRDDStorageInfo is used when RDD is requested for RDD lineage graph.

Settings

spark.driver.allowMultipleContexts
Quoting the scaladoc of org.apache.spark.SparkContext:

Only one SparkContext may be active per JVM. You must stop() the active
SparkContext before creating a new one.

You can however control the behaviour using spark.driver.allowMultipleContexts flag.

It is disabled, i.e. false , by default.

If enabled (i.e. true ), Spark prints the following WARN message to the logs:

WARN Multiple running SparkContexts detected in the same JVM!

If disabled (default), it will throw an SparkException exception:

Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this erro
r, set spark.driver.allowMultipleContexts = true. The currently running SparkContext w
as created at:
[ctx.creationSite.longForm]

416
SparkContext

When creating an instance of SparkContext , Spark marks the current thread as having it
being created (very early in the instantiation process).

It’s not guaranteed that Spark will work properly with two or more
Caution
SparkContexts. Consider the feature a work in progress.

Accessing AppStatusStore —  statusStore Method

statusStore: AppStatusStore

statusStore gives the current AppStatusStore.

statusStore is used when:

ConsoleProgressBar is requested to refresh


Note
Spark SQL’s SharedState is requested for a SQLAppStatusStore (as
statusStore )

Requesting URL of web UI —  uiWebUrl Method

uiWebUrl: Option[String]

uiWebUrl requests the SparkUI for webUrl.

Environment Variables
Table 3. Environment Variables
Environment Variable Default Value Description

Amount of memory to allocate for a


SPARK_EXECUTOR_MEMORY 1024 Spark executor in MB.
See Executor Memory.

The user who is running


SPARK_USER SparkContext . Available later as
sparkUser.

417
HeartbeatReceiver RPC Endpoint

HeartbeatReceiver RPC Endpoint


HeartbeatReceiver is a ThreadSafeRpcEndpoint registered on the driver under the name

HeartbeatReceiver.

HeartbeatReceiver receives Heartbeat messages from executors that Spark uses as the

mechanism to receive accumulator updates (with task metrics and a Spark application’s
accumulators) and pass them along to TaskScheduler .

Figure 1. HeartbeatReceiver RPC Endpoint and Heartbeats from Executors


HeartbeatReceiver is registered immediately after a Spark application is
Note
started, i.e. when SparkContext is created.

HeartbeatReceiver is a SparkListener to get notified when a new executor is added to or no

longer available in a Spark application. HeartbeatReceiver tracks executors (in


executorLastSeen registry) to handle Heartbeat and ExpireDeadHosts messages from
executors that are assigned to the Spark application.

418
HeartbeatReceiver RPC Endpoint

Table 1. HeartbeatReceiver RPC Endpoint’s Messages (in alphabetical order)


Message Description

Posted when HeartbeatReceiver is notified that an


ExecutorRemoved
executor is no longer available (to a Spark application).

Posted when HeartbeatReceiver is notified that a new


ExecutorRegistered
executor has been registered (with a Spark application).

ExpireDeadHosts FIXME

Posted when Executor informs that it is alive and reports


Heartbeat
task metrics.

Posted when SparkContext informs that TaskScheduler


TaskSchedulerIsSet
is available.

Table 2. HeartbeatReceiver’s Internal Registries and Counters


Name Description

executorLastSeen
Executor ids and the timestamps of when the last
heartbeat was received.

scheduler TaskScheduler

Enable DEBUG or TRACE logging levels for org.apache.spark.HeartbeatReceiver


to see what happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.HeartbeatReceiver=TRACE

Refer to Logging.

Creating HeartbeatReceiver Instance


HeartbeatReceiver takes the following when created:

SparkContext

Clock

HeartbeatReceiver registers itself as a SparkListener .

HeartbeatReceiver initializes the internal registries and counters.

419
HeartbeatReceiver RPC Endpoint

Starting HeartbeatReceiver RPC Endpoint —  onStart


Method

Note onStart is part of the RpcEndpoint Contract

When called, HeartbeatReceiver sends a blocking ExpireDeadHosts every


spark.network.timeoutInterval on eventLoopThread - Heartbeat Receiver Event Loop
Thread.

ExecutorRegistered

ExecutorRegistered(executorId: String)

When received, HeartbeatReceiver registers the executorId executor and the current time
(in executorLastSeen internal registry).

Note HeartbeatReceiver uses the internal Clock to know the current time.

ExecutorRemoved

ExecutorRemoved(executorId: String)

When ExecutorRemoved arrives, HeartbeatReceiver removes executorId from


executorLastSeen internal registry.

ExpireDeadHosts

ExpireDeadHosts

When ExpireDeadHosts arrives the following TRACE is printed out to the logs:

TRACE HeartbeatReceiver: Checking for hosts with no recent heartbeats in HeartbeatRece


iver.

Each executor (in executorLastSeen registry) is checked whether the time it was last seen is
not longer than spark.network.timeout.

For any such executor, the following WARN message is printed out to the logs:

420
HeartbeatReceiver RPC Endpoint

WARN HeartbeatReceiver: Removing executor [executorId] with no recent heartbeats: [tim


e] ms exceeds timeout [timeout] ms

TaskScheduler.executorLost is called (with SlaveLost("Executor heartbeat timed out after


[timeout] ms" ).

SparkContext.killAndReplaceExecutor is asynchronously called for the executor (i.e. on

killExecutorThread).

The executor is removed from executorLastSeen.

Heartbeat

Heartbeat(executorId: String,
accumUpdates: Array[(Long, Seq[AccumulatorV2[_, _]])],
blockManagerId: BlockManagerId)

When received, HeartbeatReceiver finds the executorId executor (in executorLastSeen


registry).

When the executor is found, HeartbeatReceiver updates the time the heartbeat was
received (in executorLastSeen).

Note HeartbeatReceiver uses the internal Clock to know the current time.

HeartbeatReceiver then submits an asynchronous task to notify TaskScheduler that the

heartbeat was received from the executor (using TaskScheduler internal reference).
HeartbeatReceiver posts a HeartbeatResponse back to the executor (with the response from

TaskScheduler whether the executor has been registered already or not so it may eventually

need to re-register).

If however the executor was not found (in executorLastSeen registry), i.e. the executor was
not registered before, you should see the following DEBUG message in the logs and the
response is to notify the executor to re-register.

DEBUG Received heartbeat from unknown executor [executorId]

In a very rare case, when TaskScheduler is not yet assigned to HeartbeatReceiver , you
should see the following WARN message in the logs and the response is to notify the
executor to re-register.

WARN Dropping [heartbeat] because TaskScheduler is not ready yet

421
HeartbeatReceiver RPC Endpoint

TaskScheduler can be unassigned when no TaskSchedulerIsSet has not been


Note
received yet.

Heartbeats messages are the mechanism of executors to inform the Spark


Note
application that they are alive and update about the state of active tasks.

TaskSchedulerIsSet

TaskSchedulerIsSet

When received, HeartbeatReceiver sets the internal reference to TaskScheduler.

HeartbeatReceiver uses SparkContext that is given when HeartbeatReceiver is


Note
created.

onExecutorAdded Method

onExecutorAdded(executorAdded: SparkListenerExecutorAdded): Unit

onExecutorAdded simply sends a ExecutorRegistered message to itself (that in turn

registers an executor).

onExecutorAdded is part of SparkListener contract to announce that a new


Note
executor was registered with a Spark application.

Sending ExecutorRegistered Message to Itself 


—  addExecutor Internal Method

addExecutor(executorId: String): Option[Future[Boolean]]

addExecutor sends a ExecutorRegistered message (to register executorId executor).

addExecutor is used when HeartbeatReceiver is notified that a new executor


Note
was added.

onExecutorRemoved Method

onExecutorRemoved(executorRemoved: SparkListenerExecutorRemoved): Unit

422
HeartbeatReceiver RPC Endpoint

onExecutorRemoved simply passes the call to removeExecutor (that in turn unregisters an

executor).

onExecutorRemoved is part of SparkListener contract to announce that an


Note
executor is no longer available for a Spark application.

Sending ExecutorRemoved Message to Itself 


—  removeExecutor Method

removeExecutor(executorId: String): Option[Future[Boolean]]

removeExecutor sends a ExecutorRemoved message to itself (passing in executorId ).

removeExecutor is used when HeartbeatReceiver is notified that an executor is


Note
no longer available.

Stopping HeartbeatReceiver RPC Endpoint —  onStop


Method

Note onStop is part of the RpcEndpoint Contract

When called, HeartbeatReceiver cancels the checking task (that sends a blocking
ExpireDeadHosts every spark.network.timeoutInterval on eventLoopThread - Heartbeat
Receiver Event Loop Thread - see Starting (onStart method)) and shuts down
eventLoopThread and killExecutorThread executors.

killExecutorThread  — Kill Executor Thread


killExecutorThread is a daemon ScheduledThreadPoolExecutor with a single thread.

The name of the thread pool is kill-executor-thread.

Note It is used to request SparkContext to kill the executor.

eventLoopThread  — Heartbeat Receiver Event Loop


Thread
eventLoopThread is a daemon ScheduledThreadPoolExecutor with a single thread.

The name of the thread pool is heartbeat-receiver-event-loop-thread.

423
HeartbeatReceiver RPC Endpoint

expireDeadHosts Internal Method

expireDeadHosts(): Unit

Caution FIXME

expireDeadHosts is used when HeartbeatReceiver receives a ExpireDeadHosts


Note
message.

Settings
Table 3. Spark Properties
Spark Property Default Value
spark.storage.blockManagerTimeoutIntervalMs 60s

spark.storage.blockManagerSlaveTimeoutMs 120s

spark.network.timeout spark.storage.blockManagerSlaveTimeoutMs

spark.network.timeoutInterval spark.storage.blockManagerTimeoutIntervalMs

424
Inside Creating SparkContext

Inside Creating SparkContext


This document describes what happens when you create a new SparkContext.

import org.apache.spark.{SparkConf, SparkContext}

// 1. Create Spark configuration


val conf = new SparkConf()
.setAppName("SparkMe Application")
.setMaster("local[*]") // local mode

// 2. Create Spark context


val sc = new SparkContext(conf)

The example uses Spark in local mode, but the initialization with the other
Note
cluster modes would follow similar steps.

Creating SparkContext instance starts by setting the internal allowMultipleContexts field


with the value of spark.driver.allowMultipleContexts and marking this SparkContext instance
as partially constructed. It makes sure that no other thread is creating a SparkContext
instance in this JVM. It does so by synchronizing on SPARK_CONTEXT_CONSTRUCTOR_LOCK and
using the internal atomic reference activeContext (that eventually has a fully-created
SparkContext instance).

The entire code of SparkContext that creates a fully-working SparkContext


instance is between two statements:

SparkContext.markPartiallyConstructed(this, allowMultipleContexts)
Note
// the SparkContext code goes here

SparkContext.setActiveContext(this, allowMultipleContexts)

startTime is set to the current time in milliseconds.

stopped internal flag is set to false .

The very first information printed out is the version of Spark as an INFO message:

INFO SparkContext: Running Spark version 2.0.0-SNAPSHOT

You can use version method to learn about the current Spark version or
Tip
org.apache.spark.SPARK_VERSION value.

425
Inside Creating SparkContext

A LiveListenerBus instance is created (as listenerBus ).

The current user name is computed.

Caution FIXME Where is sparkUser used?

It saves the input SparkConf (as _conf ).

Caution FIXME Review _conf.validateSettings()

It ensures that the first mandatory setting - spark.master is defined. SparkException is


thrown if not.

A master URL must be set in your configuration

It ensures that the other mandatory setting - spark.app.name is defined. SparkException is


thrown if not.

An application name must be set in your configuration

For Spark on YARN in cluster deploy mode, it checks existence of spark.yarn.app.id .


SparkException is thrown if it does not exist.

Detected yarn cluster mode, but isn't running on a cluster. Deployment to YARN is not
supported directly by SparkContext. Please use spark-submit.

Caution FIXME How to "trigger" the exception? What are the steps?

When spark.logConf is enabled SparkConf.toDebugString is called.

SparkConf.toDebugString is called very early in the initialization process and


Note other settings configured afterwards are not included. Use
sc.getConf.toDebugString once SparkContext is initialized.

The driver’s host and port are set if missing. spark.driver.host becomes the value of
Utils.localHostName (or an exception is thrown) while spark.driver.port is set to 0 .

spark.driver.host and spark.driver.port are expected to be set on the driver. It is


Note
later asserted by SparkEnv.

spark.executor.id setting is set to driver .

Use sc.getConf.get("spark.executor.id") to know where the code is executed 


Tip
— driver or executors.

426
Inside Creating SparkContext

It sets the jars and files based on spark.jars and spark.files , respectively. These are
files that are required for proper task execution on executors.

If event logging is enabled, i.e. spark.eventLog.enabled flag is true , the internal field
_eventLogDir is set to the value of spark.eventLog.dir setting or the default value

/tmp/spark-events .

Also, if spark.eventLog.compress is enabled (it is not by default), the short name of the
CompressionCodec is assigned to _eventLogCodec . The config key is
spark.io.compression.codec (default: lz4 ).

Tip Read about compression codecs in Compression.

Creating LiveListenerBus
SparkContext creates a LiveListenerBus.

Creating Live AppStatusStore


SparkContext requests AppStatusStore to create a live store (i.e. the AppStatusStore for a

live Spark application) and requests LiveListenerBus to add the AppStatusListener to the
status queue.

The current AppStatusStore is available as statusStore property of the


Note
SparkContext .

Creating SparkEnv
SparkContext creates a SparkEnv and requests SparkEnv to use the instance as the

default SparkEnv.

Caution FIXME Describe the following steps.

MetadataCleaner is created.

Caution FIXME What’s MetadataCleaner?

Creating SparkStatusTracker
SparkContext creates a SparkStatusTracker (with itself and the AppStatusStore).

Creating ConsoleProgressBar

427
Inside Creating SparkContext

SparkContext creates the optional ConsoleProgressBar when

spark.ui.showConsoleProgress property is enabled and the INFO logging level for


SparkContext is disabled.

Creating SparkUI
SparkContext creates a SparkUI when spark.ui.enabled configuration property is enabled

(i.e. true ) with the following:

AppStatusStore

Name of the Spark application that is exactly the value of spark.app.name configuration
property

Empty base path

Note spark.ui.enabled Spark property is assumed enabled when undefined.

Caution FIXME Where’s _ui used?

A Hadoop configuration is created. See Hadoop Configuration.

If there are jars given through the SparkContext constructor, they are added using addJar .

If there were files specified, they are added using addFile.

At this point in time, the amount of memory to allocate to each executor (as
_executorMemory ) is calculated. It is the value of spark.executor.memory setting, or

SPARK_EXECUTOR_MEMORY environment variable (or currently-deprecated SPARK_MEM ),


or defaults to 1024 .

_executorMemory is later available as sc.executorMemory and used for

LOCAL_CLUSTER_REGEX, Spark Standalone’s SparkDeploySchedulerBackend, to set


executorEnvs("SPARK_EXECUTOR_MEMORY") , MesosSchedulerBackend,

CoarseMesosSchedulerBackend.

The value of SPARK_PREPEND_CLASSES environment variable is included in executorEnvs .

FIXME

What’s _executorMemory ?
What’s the unit of the value of _executorMemory exactly?
Caution
What are "SPARK_TESTING", "spark.testing"? How do they contribute
to executorEnvs ?

What’s executorEnvs ?

428
Inside Creating SparkContext

The Mesos scheduler backend’s configuration is included in executorEnvs , i.e.


SPARK_EXECUTOR_MEMORY, _conf.getExecutorEnv , and SPARK_USER .

SparkContext registers HeartbeatReceiver RPC endpoint.

SparkContext.createTaskScheduler is executed (using the master URL) and the result


becomes the internal _schedulerBackend and _taskScheduler .

The internal _schedulerBackend and _taskScheduler are used by


Note
schedulerBackend and taskScheduler methods, respectively.

DAGScheduler is created (as _dagScheduler ).

SparkContext sends a blocking TaskSchedulerIsSet message to HeartbeatReceiver RPC

endpoint (to inform that the TaskScheduler is now available).

Starting TaskScheduler
SparkContext starts TaskScheduler .

Setting Spark Application’s and Execution Attempt’s IDs 


—  _applicationId and _applicationAttemptId
SparkContext sets the internal fields —  _applicationId and _applicationAttemptId  — 

(using applicationId and applicationAttemptId methods from the TaskScheduler


Contract).

SparkContext requests TaskScheduler for the unique identifier of a Spark


Note application (that is currently only implemented by TaskSchedulerImpl that uses
SchedulerBackend to request the identifier).

The unique identifier of a Spark application is used to initialize SparkUI and


Note
BlockManager.

_applicationAttemptId is used when SparkContext is requested for the unique


Note identifier of execution attempt of a Spark application and when
EventLoggingListener is created.

Setting spark.app.id Spark Property in SparkConf


SparkContext sets spark.app.id property to be the unique identifier of a Spark application

and, if enabled, passes it on to SparkUI .

Initializing BlockManager

429
Inside Creating SparkContext

The BlockManager (for the driver) is initialized (with _applicationId ).

Starting MetricsSystem
SparkContext requests the MetricsSystem to start.

SparkContext starts MetricsSystem after setting spark.app.id Spark property


Note
as MetricsSystem uses it to build unique identifiers fo metrics sources.

Requesting JSON Servlet Handler


SparkContext requests the MetricsSystem for a JSON servlet handler and requests the

SparkUI to attach it.

_eventLogger is created and started if isEventLogEnabled . It uses EventLoggingListener

that gets registered to LiveListenerBus.

FIXME Why is _eventLogger required to be the internal field of


Caution
SparkContext? Where is this used?

If dynamic allocation is enabled, ExecutorAllocationManager is created (as


_executorAllocationManager ) and immediately started.

_executorAllocationManager is exposed (as a method) to YARN scheduler


Note
backends to reset their state to the initial state.

If spark.cleaner.referenceTracking Spark property is enabled (i.e. true ), SparkContext


creates ContextCleaner (as _cleaner ) and started immediately. Otherwise, _cleaner is
empty.

Note spark.cleaner.referenceTracking Spark property is enabled by default.

FIXME It’d be quite useful to have all the properties with their default values
Caution in sc.getConf.toDebugString , so when a configuration is not included but
does change Spark runtime configuration, it should be added to _conf .

It registers user-defined listeners and starts SparkListenerEvent event delivery to the


listeners.

postEnvironmentUpdate is called that posts SparkListenerEnvironmentUpdate message on

LiveListenerBus with information about Task Scheduler’s scheduling mode, added jar and
file paths, and other environmental details. They are displayed in web UI’s Environment tab.

SparkListenerApplicationStart message is posted to LiveListenerBus (using the internal


postApplicationStart method).

430
Inside Creating SparkContext

TaskScheduler is notified that SparkContext is almost fully initialized.

TaskScheduler.postStartHook does nothing by default, but custom


implementations offer more advanced features, i.e. TaskSchedulerImpl blocks
Note
the current thread until SchedulerBackend is ready. There is also
YarnClusterScheduler for Spark on YARN in cluster deploy mode.

Registering Metrics Sources


SparkContext requests MetricsSystem to register metrics sources for the following services:

1. DAGScheduler

2. BlockManager

3. ExecutorAllocationManager (if dynamic allocation is enabled)

Adding Shutdown Hook


SparkContext adds a shutdown hook (using ShutdownHookManager.addShutdownHook() ).

You should see the following DEBUG message in the logs:

DEBUG Adding shutdown hook

Caution FIXME ShutdownHookManager.addShutdownHook()

Any non-fatal Exception leads to termination of the Spark context instance.

Caution FIXME What does NonFatal represent in Scala?

Caution FIXME Finish me

Initializing nextShuffleId and nextRddId Internal Counters


nextShuffleId and nextRddId start with 0 .

Caution FIXME Where are nextShuffleId and nextRddId used?

A new instance of Spark context is created and ready for operation.

Creating SchedulerBackend and TaskScheduler 


—  createTaskScheduler Internal Method

431
Inside Creating SparkContext

createTaskScheduler(
sc: SparkContext,
master: String,
deployMode: String): (SchedulerBackend, TaskScheduler)

createTaskScheduler is executed as part of creating an instance of SparkContext to create

TaskScheduler and SchedulerBackend objects.

createTaskScheduler uses the master URL to select the requested implementation.

Figure 1. SparkContext creates Task Scheduler and Scheduler Backend


createTaskScheduler understands the following master URLs:

local - local mode with 1 thread only

local[n] or local[*] - local mode with n threads.

local[n, m] or local[*, m]  — local mode with n threads and m number of failures.

spark://hostname:port for Spark Standalone.

local-cluster[n, m, z]  — local cluster with n workers, m cores per worker, and z

memory per worker.

mesos://hostname:port for Spark on Apache Mesos.

any other URL is passed to getClusterManager to load an external cluster manager.

Caution FIXME

Loading External Cluster Manager for URL


(getClusterManager method)

getClusterManager(url: String): Option[ExternalClusterManager]

432
Inside Creating SparkContext

getClusterManager loads ExternalClusterManager that can handle the input url .

If there are two or more external cluster managers that could handle url , a
SparkException is thrown:

Multiple Cluster Managers ([serviceLoaders]) registered for the url [url].

Note getClusterManager uses Java’s ServiceLoader.load method.

getClusterManager is used to find a cluster manager for a master URL when


Note
creating a SchedulerBackend and a TaskScheduler for the driver.

setupAndStartListenerBus

setupAndStartListenerBus(): Unit

setupAndStartListenerBus is an internal method that reads spark.extraListeners setting from

the current SparkConf to create and register SparkListenerInterface listeners.

It expects that the class name represents a SparkListenerInterface listener with one of the
following constructors (in this order):

a single-argument constructor that accepts SparkConf

a zero-argument constructor

setupAndStartListenerBus registers every listener class.

You should see the following INFO message in the logs:

INFO Registered listener [className]

It starts LiveListenerBus and records it in the internal _listenerBusStarted .

When no single- SparkConf or zero-argument constructor could be found for a class name in
spark.extraListeners setting, a SparkException is thrown with the message:

[className] did not have a zero-argument constructor or a single-argument constructor


that accepts SparkConf. Note: if the class is defined inside of another Scala class, t
hen its constructors may accept an implicit parameter that references the enclosing cl
ass; in this case, you must define the listener as a top-level class in order to preve
nt this extra parameter from breaking Spark's ability to find a valid constructor.

433
Inside Creating SparkContext

Any exception while registering a SparkListenerInterface listener stops the SparkContext


and a SparkException is thrown and the source exception’s message.

Exception when registering SparkListener

Set INFO on org.apache.spark.SparkContext logger to see the extra listeners


being registered.
Tip
INFO SparkContext: Registered listener pl.japila.spark.CustomSparkListener

Creating SparkEnv for Driver —  createSparkEnv


Method

createSparkEnv(
conf: SparkConf,
isLocal: Boolean,
listenerBus: LiveListenerBus): SparkEnv

createSparkEnv simply delegates the call to SparkEnv to create a SparkEnv for the driver.

It calculates the number of cores to 1 for local master URL, the number of processors
available for JVM for * or the exact number in the master URL, or 0 for the cluster
master URLs.

Utils.getCurrentUserName Method

getCurrentUserName(): String

getCurrentUserName computes the user name who has started the SparkContext instance.

Note It is later available as SparkContext.sparkUser.

Internally, it reads SPARK_USER environment variable and, if not set, reverts to Hadoop
Security API’s UserGroupInformation.getCurrentUser().getShortUserName() .

Note It is another place where Spark relies on Hadoop API for its operation.

Utils.localHostName Method
localHostName computes the local host name.

434
Inside Creating SparkContext

It starts by checking SPARK_LOCAL_HOSTNAME environment variable for the value. If it is not


defined, it uses SPARK_LOCAL_IP to find the name (using InetAddress.getByName ). If it is not
defined either, it calls InetAddress.getLocalHost for the name.

Utils.localHostName is executed while SparkContext is created and also to


Note
compute the default value of spark.driver.host Spark property.

Caution FIXME Review the rest.

stopped Flag

Caution FIXME Where is this used?

435
ConsoleProgressBar

ConsoleProgressBar
ConsoleProgressBar shows the progress of active stages to standard error, i.e. stderr . It

uses SparkStatusTracker to poll the status of stages periodically and print out active stages
with more than one task. It keeps overwriting itself to hold in one line for at most 3 first
concurrent stages at a time.

[Stage 0:====> (316 + 4) / 1000][Stage 1:> (0 + 0) / 1000][Sta


ge 2:> (0 + 0) / 1000]]]

The progress includes the stage id, the number of completed, active, and total tasks.

ConsoleProgressBar may be useful when you ssh to workers and want to see
Tip
the progress of active stages.

ConsoleProgressBar is created when SparkContext starts with

spark.ui.showConsoleProgress enabled and the logging level of


org.apache.spark.SparkContext logger as WARN or higher (i.e. less messages are printed
out and so there is a "space" for ConsoleProgressBar ).

import org.apache.log4j._
Logger.getLogger("org.apache.spark.SparkContext").setLevel(Level.WARN)

To print the progress nicely ConsoleProgressBar uses COLUMNS environment variable to


know the width of the terminal. It assumes 80 columns.

The progress bar prints out the status after a stage has ran at least 500 milliseconds every
spark.ui.consoleProgress.update.interval milliseconds.

The initial delay of 500 milliseconds before ConsoleProgressBar show the


Note
progress is not configurable.

See the progress bar in Spark shell with the following:

436
ConsoleProgressBar

$ ./bin/spark-shell --conf spark.ui.showConsoleProgress=true (1)

scala> sc.setLogLevel("OFF") (2)

import org.apache.log4j._
scala> Logger.getLogger("org.apache.spark.SparkContext").setLevel(Level.WARN) (3)

scala> sc.parallelize(1 to 4, 4).map { n => Thread.sleep(500 + 200 * n); n }.count (4


)
[Stage 2:> (0 + 4) / 4]
[Stage 2:==============> (1 + 3) / 4]
[Stage 2:=============================> (2 + 2) / 4]
[Stage 2:============================================> (3 + 1) / 4]

1. Make sure spark.ui.showConsoleProgress is true . It is by default.

2. Disable ( OFF ) the root logger (that includes Spark’s logger)

3. Make sure org.apache.spark.SparkContext logger is at least WARN .

4. Run a job with 4 tasks with 500ms initial sleep and 200ms sleep chunks to see the
progress bar.

Tip Watch the short video that show ConsoleProgressBar in action.

You may want to use the following example to see the progress bar in full glory - all 3
concurrent stages in console (borrowed from a comment to [SPARK-4017] show progress
bar in console #3029):

> ./bin/spark-shell
scala> val a = sc.makeRDD(1 to 1000, 10000).map(x => (x, x)).reduceByKey(_ + _)
scala> val b = sc.makeRDD(1 to 1000, 10000).map(x => (x, x)).reduceByKey(_ + _)
scala> a.union(b).count()

Creating ConsoleProgressBar Instance


ConsoleProgressBar requires a SparkContext.

When being created, ConsoleProgressBar reads spark.ui.consoleProgress.update.interval


configuration property to set up the update interval and COLUMNS environment variable for
the terminal width (or assumes 80 columns).

ConsoleProgressBar starts the internal timer refresh progress that does refresh and shows

progress.

437
ConsoleProgressBar

ConsoleProgressBar is created when SparkContext starts,


spark.ui.showConsoleProgress configuration property is enabled, and the
Note logging level of org.apache.spark.SparkContext logger is WARN or higher (i.e.
less messages are printed out and so there is a "space" for
ConsoleProgressBar ).

Note Once created, ConsoleProgressBar is available internally as _progressBar .

finishAll Method

Caution FIXME

stop Method

stop(): Unit

stop cancels (stops) the internal timer.

Note stop is executed when SparkContext stops.

refresh Internal Method

refresh(): Unit

refresh …​FIXME

Note refresh is used when…​FIXME

438
SparkStatusTracker

SparkStatusTracker
SparkStatusTracker is…​FIXME

SparkStatusTracker is created when SparkContext is created.

Creating SparkStatusTracker Instance


SparkStatusTracker takes the following when created:

SparkContext

AppStatusStore

439
Local Properties — Creating Logical Job Groups

Local Properties — Creating Logical Job


Groups
The purpose of local properties concept is to create logical groups of jobs by means of
properties that (regardless of the threads used to submit the jobs) makes the separate jobs
launched from different threads belong to a single logical group.

You can set a local property that will affect Spark jobs submitted from a thread, such as the
Spark fair scheduler pool. You can use your own custom properties. The properties are
propagated through to worker tasks and can be accessed there via
TaskContext.getLocalProperty.

Propagating local properties to workers starts when SparkContext is requested


Note
to run or submit a Spark job that in turn passes them along to DAGScheduler .

Local properties is used to group jobs into pools in FAIR job scheduler by
Note spark.scheduler.pool per-thread property and in
SQLExecution.withNewExecutionId Helper Methods

A common use case for the local property concept is to set a local property in a thread, say
spark.scheduler.pool, after which all jobs submitted within the thread will be grouped, say
into a pool by FAIR job scheduler.

val rdd = sc.parallelize(0 to 9)

sc.setLocalProperty("spark.scheduler.pool", "myPool")

// these two jobs (one per action) will run in the myPool pool
rdd.count
rdd.collect

sc.setLocalProperty("spark.scheduler.pool", null)

// this job will run in the default pool


rdd.count

Local Properties —  localProperties Property

localProperties: InheritableThreadLocal[Properties]

localProperties is a protected[spark] property of a SparkContext that are the properties

through which you can create logical job groups.

440
Local Properties — Creating Logical Job Groups

Tip Read up on Java’s java.lang.InheritableThreadLocal.

Setting Local Property —  setLocalProperty Method

setLocalProperty(key: String, value: String): Unit

setLocalProperty sets key local property to value .

Tip When value is null the key property is removed from localProperties.

Getting Local Property —  getLocalProperty Method

getLocalProperty(key: String): String

getLocalProperty gets a local property by key in this thread. It returns null if key is

missing.

Getting Local Properties —  getLocalProperties


Method

getLocalProperties: Properties

getLocalProperties is a private[spark] method that gives access to localProperties.

setLocalProperties Method

setLocalProperties(props: Properties): Unit

setLocalProperties is a private[spark] method that sets props as localProperties.

441
RDD — Resilient Distributed Dataset

RDD — Resilient Distributed Dataset


Resilient Distributed Dataset (aka RDD) is the primary data abstraction in Apache Spark
and the core of Spark (that I often refer to as "Spark Core").

The origins of RDD


The original paper that gave birth to the concept of RDD is Resilient Distributed Datasets: A
Fault-Tolerant Abstraction for In-Memory Cluster Computing by Matei Zaharia, et al.

A RDD is a resilient and distributed collection of records spread over one or many partitions.

One could compare RDDs to collections in Scala, i.e. a RDD is computed on


Note
many JVMs while a Scala collection lives on a single JVM.

Using RDD Spark hides data partitioning and so distribution that in turn allowed them to
design parallel computational framework with a higher-level programming interface (API) for
four mainstream programming languages.

The features of RDDs (decomposing the name):

Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to
recompute missing or damaged partitions due to node failures.

Distributed with data residing on multiple nodes in a cluster.

Dataset is a collection of partitioned data with primitive values or values of values, e.g.
tuples or other objects (that represent records of the data you work with).

Figure 1. RDDs

442
RDD — Resilient Distributed Dataset

From the scaladoc of org.apache.spark.rdd.RDD:

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an


immutable, partitioned collection of elements that can be operated on in parallel.

From the original paper about RDD - Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing:

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets
programmers perform in-memory computations on large clusters in a fault-tolerant
manner.

Beside the above traits (that are directly embedded in the name of the data abstraction -
RDD) it has the following additional traits:

In-Memory, i.e. data inside RDD is stored in memory as much (size) and long (time) as
possible.

Immutable or Read-Only, i.e. it does not change once created and can only be
transformed using transformations to new RDDs.

Lazy evaluated, i.e. the data inside RDD is not available or transformed until an action
is executed that triggers the execution.

Cacheable, i.e. you can hold all the data in a persistent "storage" like memory (default
and the most preferred) or disk (the least preferred due to access speed).

Parallel, i.e. process data in parallel.

Typed — RDD records have types, e.g. Long in RDD[Long] or (Int, String) in


RDD[(Int, String)] .

Partitioned — records are partitioned (split into logical partitions) and distributed across
nodes in a cluster.

Location-Stickiness —  RDD can define placement preferences to compute partitions


(as close to the records as possible).

Preferred location (aka locality preferences or placement preferences or


locality info) is information about the locations of RDD records (that Spark’s
Note
DAGScheduler uses to place computing partitions on to have the tasks as close
to the data as possible).

Computing partitions in a RDD is a distributed process by design and to achieve even data
distribution as well as leverage data locality (in distributed systems like HDFS or
Cassandra in which data is partitioned by default), they are partitioned to a fixed number of

443
RDD — Resilient Distributed Dataset

partitions - logical chunks (parts) of data. The logical division is for processing only and
internally it is not divided whatsoever. Each partition comprises of records.

Figure 2. RDDs
Partitions are the units of parallelism. You can control the number of partitions of a RDD
using repartition or coalesce transformations. Spark tries to be as close to data as possible
without wasting time to send data across network by means of RDD shuffling, and creates
as many partitions as required to follow the storage layout and thus optimize data access. It
leads to a one-to-one mapping between (physical) data in distributed data storage, e.g.
HDFS or Cassandra, and partitions.

RDDs support two kinds of operations:

transformations - lazy operations that return another RDD.

actions - operations that trigger computation and return values.

The motivation to create RDD were (after the authors) two types of applications that current
computing frameworks handle inefficiently:

iterative algorithms in machine learning and graph computations.

interactive data mining tools as ad-hoc queries on the same dataset.

The goal is to reuse intermediate in-memory results across multiple data-intensive


workloads with no need for copying large amounts of data over the network.

Technically, RDDs follow the contract defined by the five main intrinsic properties:

List of parent RDDs that are the dependencies of the RDD.

An array of partitions that a dataset is divided to.

A compute function to do a computation on partitions.

444
RDD — Resilient Distributed Dataset

An optional Partitioner that defines how keys are hashed, and the pairs partitioned (for
key-value RDDs)

Optional preferred locations (aka locality info), i.e. hosts for a partition where the
records live or are the closest to read from.

This RDD abstraction supports an expressive set of operations without having to modify
scheduler for each one.

An RDD is a named (by name ) and uniquely identified (by id ) entity in a SparkContext
(available as context property).

RDDs live in one and only one SparkContext that creates a logical boundary.

RDDs cannot be shared between SparkContexts (see SparkContext and


Note
RDDs).

An RDD can optionally have a friendly name accessible using name that can be changed
using = :

scala> val ns = sc.parallelize(0 to 10)


ns: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <consol
e>:24

scala> ns.id
res0: Int = 2

scala> ns.name
res1: String = null

scala> ns.name = "Friendly name"


ns.name: String = Friendly name

scala> ns.name
res2: String = Friendly name

scala> ns.toDebugString
res3: String = (8) Friendly name ParallelCollectionRDD[2] at parallelize at <console>:
24 []

RDDs are a container of instructions on how to materialize big (arrays of) distributed data,
and how to split it into partitions so Spark (using executors) can hold some of them.

In general data distribution can help executing processing in parallel so a task processes a
chunk of data that it could eventually keep in memory.

Spark does jobs in parallel, and RDDs are split into partitions to be processed and written in
parallel. Inside a partition, data is processed sequentially.

445
RDD — Resilient Distributed Dataset

Saving partitions results in part-files instead of one single file (unless there is a single
partition).

checkpointRDD Internal Method

Caution FIXME

isCheckpointedAndMaterialized Method

Caution FIXME

getNarrowAncestors Method

Caution FIXME

toLocalIterator Method

Caution FIXME

cache Method

Caution FIXME

persist Methods

persist(): this.type
persist(newLevel: StorageLevel): this.type

Refer to Persisting RDD —  persist Methods.

persist Internal Method

persist(newLevel: StorageLevel, allowOverride: Boolean): this.type

Caution FIXME

persist is used when RDD is requested to persist itself and marks itself for
Note
local checkpointing.

446
RDD — Resilient Distributed Dataset

unpersist Method

Caution FIXME

localCheckpoint Method

localCheckpoint(): this.type

Refer to Marking RDD for Local Checkpointing —  localCheckpoint Method.

RDD Contract

abstract class RDD[T] {


def compute(split: Partition, context: TaskContext): Iterator[T]
def getPartitions: Array[Partition]
def getDependencies: Seq[Dependency[_]]
def getPreferredLocations(split: Partition): Seq[String] = Nil
val partitioner: Option[Partitioner] = None
}

Note RDD is an abstract class in Scala.

Table 1. RDD Contract


Method Description
Used exclusively when RDD computes a partition
compute
(possibly by reading from a checkpoint).

getPartitions
Used exclusively when RDD is requested for its partitions
(called only once as the value is cached).

getDependencies
Used when RDD is requested for its dependencies
(called only once as the value is cached).

Defines placement preferences of a partition.


getPreferredLocations
Used exclusively when RDD is requested for the
preferred locations of a partition.

partitioner Defines the Partitioner of a RDD .

Types of RDDs

447
RDD — Resilient Distributed Dataset

There are some of the most interesting types of RDDs:

ParallelCollectionRDD

CoGroupedRDD

HadoopRDD is an RDD that provides core functionality for reading data stored in HDFS
using the older MapReduce API. The most notable use case is the return RDD of
SparkContext.textFile .

MapPartitionsRDD - a result of calling operations like map , flatMap , filter ,


mapPartitions, etc.

CoalescedRDD - a result of repartition or coalesce transformations.

ShuffledRDD - a result of shuffling, e.g. after repartition or coalesce transformations.

PipedRDD - an RDD created by piping elements to a forked external process.

PairRDD (implicit