0% found this document useful (0 votes)
21 views65 pages

Spark

The document provides an introduction to Apache Spark, detailing its purpose within the Hadoop infrastructure, its architecture, and key components such as Resilient Distributed Datasets (RDDs) and various libraries. It highlights the advantages of Apache Spark, including speed, generality, and ease of use, along with its applications in data science and engineering. Additionally, it covers programming with Spark using Scala and Python, and explains RDD operations and persistence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views65 pages

Spark

The document provides an introduction to Apache Spark, detailing its purpose within the Hadoop infrastructure, its architecture, and key components such as Resilient Distributed Datasets (RDDs) and various libraries. It highlights the advantages of Apache Spark, including speed, generality, and ease of use, along with its applications in data science and engineering. Additionally, it covers programming with Spark using Scala and Python, and explains RDD operations and persistence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Introduction to Apache Spark

© Copyright IBM Corporation 2021


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit objectives
• Explain the nature and purpose of Apache Spark in the Hadoop
infrastructure.
• Describe the architecture and list the components of the Apache Spark
unified stack.
• Describe the role of a Resilient Distributed Dataset (RDD).
• Explain the principles of Apache Spark programming.
• List and describe the Apache Spark libraries.
• Start and use Apache Spark Scala and Python shells.
• Describe Apache Spark Streaming, Apache Spark SQL, MLib, and
GraphX.

Introduction to Apache Spark © Copyright IBM Corporation 2021


Apache Spark overview

Introduction to Apache Spark © Copyright IBM Corporation 2021


Topics
• Apache Spark overview
• Scala overview
• Resilient Distributed Dataset
• Programming with Apache Spark
• Apache Spark libraries
• Apache Spark cluster and monitoring

Introduction to Apache Spark © Copyright IBM Corporation 2021


Big data and Apache Spark
• Faster results from analytics are increasingly important.
• Apache Spark is a computing platform that is fast, general-purpose, and
easy to use.

Speed • In-memory computations.


• Faster than MapReduce for complex applications on disk.
Generality • Covers a wide range of workloads on one system.
• Batch applications (for example, MapReduce).
• Iterative algorithms.
• Interactive queries and streaming.
Ease of use •APIs for Scala, Python, Java, and R.
•Libraries for SQL, machine learning, streaming, and graph processing.
•Runs on Hadoop clusters or as a stand-alone product.
•Includes the popular MapReduce model.

Introduction to Apache Spark © Copyright IBM Corporation 2021


Ease of use
• To implement the classic WordCount in Java MapReduce, you need
three classes: the main class that sets up the job, a mapper, and a
reducer, each about 10 lines long.
• Here is the same WordCount program that is written in Scala for
Apache Spark:

val conf = new SparkConf().setAppName("Spark wordcount")


val sc = new SparkContext(conf)
val file = sc.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1)).countByKey()
counts.saveAsTextFile("hdfs://...")

• With Java, Apache Spark can take advantage of the versatility,


flexibility, and functional programming concepts of Scala.

Introduction to Apache Spark © Copyright IBM Corporation 2021


Who uses Apache Spark and why
• Apache Spark is used for parallel distributed processing, fault tolerance
on commodity hardware, scalability, in-memory computing, high-level
APIs, and other tasks.
• Data scientist:
▪ Analyze and model the data to obtain insight by using ad hoc analysis.
▪ Transform the data into a usable format.
▪ Used for statistics, machine learning, and SQL.
• Data engineers:
▪ Develop a data processing system or application.
▪ Inspect and tune their applications.
▪ Program with the Apache Spark API.
• Everyone else:
▪ Ease of use.
▪ Wide variety of functions.
▪ Mature and reliable.

Introduction to Apache Spark © Copyright IBM Corporation 2021


Apache Spark unified stack

Apache
Apache Spark MLlib GraphX
Spark SQL Streaming Machine learning Graph processing
Real-time
processing

Apache Spark Core

Apache
Stand-alone scheduler YARN
Mesos

Introduction to Apache Spark © Copyright IBM Corporation 2021


Apache Spark jobs and shell
• Apache Spark jobs can be written in Scala, Python, or Java. APIs are
available for all three at the following website:
http://spark.apache.org/docs/latest
• Apache Spark shells are provided for Scala (spark-shell) and Python
(pyspark).
• The Apache Spark native language is Scala, so it is natural to write Apache
Spark applications by using Scala.

Introduction to Apache Spark © Copyright IBM Corporation 2021


Apache Spark Scala and Python shells
• Apache Spark shells provide simple ways to learn the APIs and provide
a set of powerful tools to analyze data interactively.
• Scala shell:
▪ Runs on the Java virtual machine (JVM), which provides a good way to use
existing Java libraries
▪ To launch the Scala shell, run ./bin/spark-shell.
▪ The prompt is scala>.
▪ To read in a text file, run val textFile =
sc.textFile("README.md")
• Python shell:
▪ To launch the Python shell, run ./bin/pyspark.
▪ The prompt is >>>.
▪ To read in a text file, run textFile = sc.textFile("README.md").
▪ Two more variations are IPython and the IPython Notebook.
• To quit either shell, press Ctrl-D (the EOF character).

Introduction to Apache Spark © Copyright IBM Corporation 2021


Scala overview

Introduction to Apache Spark © Copyright IBM Corporation 2021


Topics
• Apache Spark overview
• Scala overview
• Resilient Distributed Dataset
• Programming with Apache Spark
• Apache Spark libraries
• Apache Spark cluster and monitoring

Introduction to Apache Spark © Copyright IBM Corporation 2021


Brief overview of Scala
• Everything is an object:
▪ Primitive types such as numbers or Boolean.
▪ Functions.
• Numbers are objects:
▪ 1 + 2 * 3 / 4 🡺 (1).+(((2).*(3))./(x))).
▪ The +, *, and / characters are valid identifiers in Scala.
• Functions are objects:
▪ Pass functions as arguments.
▪ Store them in variables.
▪ Return them from other functions.
• Function declaration:
def functionName ([list of parameters]) : [return type]

Introduction to Apache Spark © Copyright IBM Corporation 2021


Scala: Anonymous functions (Lambda functions)
• Lambda ( => syntax): Functions without a name are created for one-time use to
pass to another function.
• In the graphic on the left, => is where the arguments are placed (no arguments in
the example).
• The graphic on the right shows the body of the function (here the println
statement).

Introduction to Apache Spark © Copyright IBM Corporation 2021


Computing WordCount by using Lambda functions
• The classic WordCount program can be written with anonymous
(Lambda) functions.
• Three functions are needed:
▪ Use a token to turn each line into words (with a space as a delimiter).
▪ Map to produce the <word, 1> key/value pair from each word that is read.
▪ Reduce to aggregate the counts for each word individually (reduceByKey)
• The results are written to HDFS.
text_file = spark.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")

• Lambda functions can be used with Scala, Python, and Java V8. This
example is written in Scala.
Introduction to Apache Spark © Copyright IBM Corporation 2021
Resilient Distributed Dataset

Introduction to Apache Spark © Copyright IBM Corporation 2021


Topics
• Apache Spark overview
• Scala overview
• Resilient Distributed Dataset
• Programming with Apache Spark
• Apache Spark libraries
• Apache Spark cluster and monitoring

Introduction to Apache Spark © Copyright IBM Corporation 2021


Resilient Distributed Dataset
Example
• An RDD is a fault-tolerant collection of elements that can RDD flow
be operated on in parallel.
• RDDs are immutable. Hadoop RDD
• Three methods for creating an RDD:
▪ Parallelizing an existing collection
▪ Referencing a data set
▪ Transforming from an existing RDD Filtered RDD
• Two types of RDD operations:
▪ Transformations
▪ Actions
Mapped RDD
• Uses a data set from any storage that is supported
by Hadoop, such as HDFS and Amazon S3.
Reduced
RDD

Introduction to Apache Spark © Copyright IBM Corporation 2021


Creating an RDD
To create an RDD, complete the following steps:
1. Start the Apache Spark shell (requires a PATH environment variable):
spark-shell
2. Create some data:
val data = 1 to 10000
3. Parallelize that data (creating the RDD):
val distData = sc.parallelize(data)
4. Perform more transformations or invoke an action on the
transformation:
distData.filter(…)
You can also create an RDD from an external data set:
val readmeFile = sc.textFile("Readme.md")

Introduction to Apache Spark © Copyright IBM Corporation 2021


RDD basic operations
• Loading a file:
val lines = sc.textFile("hdfs://data.txt")
• Applying a transformation:
val lineLengths = lines.map(s => s.length)
• Starting an action:
val totalLengths = lineLengths.reduce((a,b) => a + b)
• Viewing the DAG:
lineLengths.toDebugString

Introduction to Apache Spark © Copyright IBM Corporation 2021


What happens when an action is run (1 of 8)
// Creating the RDD
val logFile = sc.textFile("hdfs://…")
// Transformations
Driver
val errors = logFile.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
//Caching
messages.cache()
// Actions
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
Worker Worker Worker

Introduction to Apache Spark © Copyright IBM Corporation 2021


What happens when an action is run (2 of 8)
// Creating the RDD
val logFile = sc.textFile("hdfs://…")
// Transformations
Driver
val errors = logFile.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
//Caching
messages.cache()
// Actions
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
Worker Worker Worker
Block 1 Block 2 Block 3

The data is partitioned


into different blocks.

Introduction to Apache Spark © Copyright IBM Corporation 2021


What happens when an action is run (3 of 8)
// Creating the RDD
val logFile = sc.textFile("hdfs://…")
// Transformations
Driver
val errors = logFile.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
//Cache
messages.cache()
// Actions
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
Worker Worker Worker
Block 1 Block 2 Block 3

The driver sends the code to be run


on each block.

Introduction to Apache Spark © Copyright IBM Corporation 2021


What happens when an action is run (4 of 8)
// Creating the RDD
val logFile = sc.textFile("hdfs://…")
// Transformations
Driver
val errors = logFile.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
//Caching
messages.cache()
// Actions
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
Worker Worker Worker
Block 1 Block 2 Block 3

Read the HDFS block.

Introduction to Apache Spark © Copyright IBM Corporation 2021


What happens when an action is run (5 of 8)
// Creating the RDD
val logFile = sc.textFile("hdfs://…")
// Transformations
Driver
val errors = logFile.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
//Caching
messages.cache()
// Actions
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
Worker Worker Worker
Block 1 Block 2 Block 3

Cache Cache Cache

Process plus cache data

Introduction to Apache Spark © Copyright IBM Corporation 2021


What happens when an action is run (6 of 8)
// Creating the RDD
val logFile = sc.textFile("hdfs://…")
// Transformations
Driver
val errors = logFile.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
//Caching
messages.cache()
// Actions
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
Worker Worker Worker
Block 1 Block 2 Block 3

Cache Cache Cache

Send the data back to


the driver.

Introduction to Apache Spark © Copyright IBM Corporation 2021


What happens when an action is run (7 of 8)
// Creating the RDD
val logFile = sc.textFile("hdfs://…")
// Transformations
Driver
val errors = logFile.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
//Caching
messages.cache()
// Actions
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
Worker Worker Worker
Block 1 Block 2 Block 3

Cache Cache Cache

Process from cache.

Introduction to Apache Spark © Copyright IBM Corporation 2021


What happens when an action is run (8 of 8)
// Creating the RDD
val logFile = sc.textFile("hdfs://…")
// Transformations
Driver
val errors = logFile.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
//Caching
messages.cache()
// Actions
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
Worker Worker Worker
Block 1 Block 2 Block 3

Cache Cache Cache

Send the data back to


the driver.

Introduction to Apache Spark © Copyright IBM Corporation 2021


RDD operations: Transformations
• Here are some of the transformations that are available. The full set can be found
on the Apache Spark website.
• Transformations are lazy evaluations.
• Returns a pointer to the transformed RDD.

Transformation Meaning

map(func) Returns a new data set that is formed by passing each element of the source
through a function func.

filter(func) Returns a new data set that is formed by selecting those elements of the source
on which func returns true.
flatMap(func) Like map, but each input item can be mapped to 0 or more output items. So, func
should return a Seq rather than a single item.
join(otherDataset, When called on data sets of type (K, V) and (K, W), returns a data set of (K, (V,
[numTasks]) W)) pairs with all pairs of elements for each key.
reduceByKey(func) When called on a data set of (K, V) pairs, returns a data set of (K,V) pairs where
the values for each key are aggregated by using the reduce function func.
sortByKey([ascending],[numTa When called on a data set of (K, V) pairs where K implements Ordered, returns a
sks]) data set of (K,V) pairs that are sorted by keys in ascending or descending order.

Introduction to Apache Spark © Copyright IBM Corporation 2021


RDD operations: Actions
Actions return values.

Action Meaning

collect() Returns all the elements of the data set as an array of the driver program. This action is
usually useful after a filter or another operation that returns a sufficiently small subset of
data.
count() Returns the number of elements in a data set.

first() Returns the first element of the data set.

take(n) Returns an array with the first n elements of the data set.

foreach(func) Runs a function func on each element of the data set.

Introduction to Apache Spark © Copyright IBM Corporation 2021


RDD persistence
• Each node stores partitions of the cache that it computes in memory.
• The node reuses them in other actions on that data set (or derived data sets).
Future actions are much faster (often by more than 10x).
• There are two methods for RDD persistence:
▪ persist()
▪ cache()
Storage level Meaning
MEMORY_ONLY The RDD is stored as deserialized Java objects in the JVM. If the RDD does not fit in memory,
part of it is cached. The rest of it is recomputed as needed. This level is the default. The
cache() method uses this level.
MEMORY_AND_DISK Same as MEMORY_ONLY, except the RDD also is stored on disk if it does not fit in memory.
The RDD is read from memory and disk when needed.
MEMORY_ONLY_SER The RDD is stored as serialized Java objects (one byte array per partition). It is
space-efficient, but more CPU-intensive to read.
MEMORY_AND_DISK_SER Like MEMORY_AND_DISK but the RDD is stored as serialized objects.
DISK_ONLY The RDD is stored only on disk.
MEMORY_ONLY_2, Same as DISK_ONLY, but each partition is replicated on two cluster nodes.
MEMORY_AND_DISK_2, and
so on
OFF_HEAP (experimental) Stores RDD in serialized format in Tachyon.

Introduction to Apache Spark © Copyright IBM Corporation 2021


Best practices for which storage level to choose
• Use the default storage level (MEMORY_ONLY) when possible.
Otherwise, use MEMORY_ONLY_SER and a fast serialization library.
• Do not spill to disk unless the functions that computed your data sets
are expensive or they filter a large amount of the data (recomputing a
partition might be as fast as reading it from disk).
• Use the replicated storage levels if you want fast fault recovery (such as
using Apache Spark to serve requests from a web application).
• All the storage levels provide full fault tolerance by recomputing lost
data, but the replicated ones let you continue running tasks on the RDD
without waiting to recompute a lost partition.
• The experimental OFF_HEAP mode has several advantages:
▪ Allows multiple executors to share the pool of memory in Tachyon.
▪ Reduces garbage collection costs.
▪ Cached data is not lost if individual executors fail.

Introduction to Apache Spark © Copyright IBM Corporation 2021


Shared variables and key-value pairs

• When a function is passed from the driver to a worker, normally a


separate copy of the variables is used ("pass by value").
• Two types of variables:
▪ Broadcast variables
− Read-only copy on each machine.
− Distribute broadcast variables by using efficient broadcast algorithms.
▪ Accumulators:
− Variables that are added through an associative operation.
− Implement counters and sums.
− Only the driver can read the accumulator’s value.
− Numeric types accumulators. Extend for new types.

Scala: key-value pairs Python: key-value pairs Java: key-value pairs

val pair = ('a', 'b') pair = ('a', 'b') Tuple2 pair = new Tuple2('a', 'b');
pair._1 // will return 'a' pair[0] # will return 'a' pair._1 // will return 'a'
pair._2 // will return 'b' pair[1] # will return 'b' pair._2 // will return 'b'

Introduction to Apache Spark © Copyright IBM Corporation 2021


Programming with key-value pairs
• There are special operations that are available on RDDs of key-value
pairs. You group or aggregate elements by using a key.
• Tuple2 objects are created by writing (a, b), and you must import
org.apache.spark.SparkContext._.
• PairRDDFunctions contains key-value pair operations:
reduceByKey((a, b) => a + b)
• Custom objects like the key in a key-value pair require a custom
equals() method with a matching hashCode() method.
• Here is an example:

val textFile = sc.textFile("…")


val readmeCount = textFile.flatMap(line =>
line.split(" ")).map(word => (word,
1)).reduceByKey(_ + _)

Introduction to Apache Spark © Copyright IBM Corporation 2021


Programming with Apache
Spark

Introduction to Apache Spark © Copyright IBM Corporation 2021


Topics
• Apache Spark overview
• Scala overview
• Resilient Distributed Dataset
• Programming with Apache Spark
• Apache Spark libraries
• Apache Spark cluster and monitoring

Introduction to Apache Spark © Copyright IBM Corporation 2021


Programming with Apache Spark
• You reviewed accessing Apache Spark with interactive shells:
▪ spark-shell (for Scala)
▪ pyspark (for Python)
• Next, you review programming with Apache Spark with the following
languages:
▪ Scala
▪ Python
▪ Java
• Compatible versions of software are needed:
▪ Apache Spark 1.6.3 uses Scala 2.10. To write applications in Scala, you must
use a compatible Scala version (for example, 2.10.X).
▪ Apache Spark 1.x works with Python 2.6 or higher (but not with Python 3).
▪ Apache Spark 1.x works with Java 6 and higher, and Java 8 supports
Lambda expressions.

Introduction to Apache Spark © Copyright IBM Corporation 2021


SparkContext
• The SparkContext is the main entry point for Apache Spark functions: It
represents the connection to an Apache Spark cluster.
• Use the SparkContext to create RDDs, accumulators, and broadcast
variables on that cluster.
• With the Apache Spark shell, the SparkContext (sc) is automatically
initialized for you to use.
• But in an Apache Spark program, you must add code to import some
classes and implicit conversions into your program:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

Introduction to Apache Spark © Copyright IBM Corporation 2021


Linking with Apache Spark: Scala
• Apache Spark applications require certain dependencies.
• Apache Spark needs a compatible Scala version to write applications.
For example, Apache Spark 1.6.3 uses Scala 2.10.
• To write an Apache Spark application, you must add a Maven
dependency to Apache Spark, which is available through
Maven Central:
groupId = org.apache.spark
artifactId = spark-core_2.10
version = 1.6.3
• To access an HDFS cluster, you must add a dependency on
hadoop-client for your version of HDFS:
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>

Introduction to Apache Spark © Copyright IBM Corporation 2021


Initializing Apache Spark: Scala
• Build a SparkConf object that contains information about your
application:
val conf = new SparkConf().setAppName(appName).setMaster(master)
• Use the appName parameter to set the name for your application that
appears on the cluster UI.
• The master parameter is an Apache Spark, Apache Mesos, or YARN
cluster URL (or a special "local" string to run in local mode):
▪ In testing, you can pass "local" to run Apache Spark.
▪ local[16] allocates 16 cores.
▪ In production mode, do not hardcode master in the program. Start with
spark-submit and use it there.
• Create the SparkContext object:
new SparkContext(conf)

Introduction to Apache Spark © Copyright IBM Corporation 2021


Linking with Apache Spark: Python
• Apache Spark 1.x works with Python 2.6 or higher.
• It uses the standard CPython interpreter, so C libraries like NumPy can
be used.
• To run Apache Spark applications in Python, use the
bin/spark-submit script in the Apache Spark directory:
▪ Load Apache Spark Java or Scala libraries.
▪ Then, you can submit applications to a cluster.
• If you want to access HDFS, you must use a build of PySpark linking to
your version of HDFS.
• Import Apache Spark classes:
from pyspark import SparkContext, SparkConf

Introduction to Apache Spark © Copyright IBM Corporation 2021


Initializing Apache Spark: Python
• Build a SparkConf object that contains information about your
application:
conf = SparkConf().setAppName(appName).setMaster(master)
• Use the appName parameter to set the name for your application that
appears on the cluster UI.
• The master parameter is an Apache Spark, Apache Mesos, or YARN
cluster URL (or a special "local" string to run in local mode):
▪ In production mode, do not hardcode master in the program. Start
spark-submit and use it there.
▪ In testing, you can pass "local" to run Apache Spark.
• Create the SparkContext object:
sc = SparkContext(conf=conf)

Introduction to Apache Spark © Copyright IBM Corporation 2021


Linking with Apache Spark: Java
• Apache Spark supports Lambda expressions of Java.
• Add a dependency to Apache Spark, which is available through Maven
Central:
groupId = org.apache.spark
artifactId = spark-core_2.10
version = 1.6.3
• If you want to access an HDFS cluster, you must add the dependency
too:
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>
• Import some Apache Spark classes:
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.api.java.JavaRDD
import org.apache.spark.SparkConf

Introduction to Apache Spark © Copyright IBM Corporation 2021


Initializing Apache Spark: Java
• Build a SparkConf object that contains information about your
application:
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master)

• Use the appName parameter to set the name for your application that
appears on the cluster UI.
• The master parameter is an Apache Spark, Apache Mesos, or YARN
cluster URL (or a special "local" string to run in local mode):
▪ In production mode, do not hardcode master in the program. Start
spark-submit and use it there.
▪ In testing, you can pass "local" to run Apache Spark.
• Create the JavaSparkContext object:
JavaSparkContext sc = new JavaSparkContext(conf);

Introduction to Apache Spark © Copyright IBM Corporation 2021


Passing functions to Apache Spark
• Apache Spark API heavily relies on passing functions in the driver program
to run on the cluster.
• Three methods:
▪ Anonymous function syntax:
(x: Int) => x + 1
▪ Static methods in a global singleton object:
object MyFunctions {
>def func1 (s: String): String = {…}
}
myRdd.map(MyFunctions.func1)
▪ To pass by reference to avoid sending the entire object, consider copying the
function to a local variable:
val field = "Hello"
• Avoid:
def doStuff(rdd: RDD[String]):RDD[String] = {rdd.map(x => field +
x)}
• Consider:
def doStuff(rdd: RDD[String]):RDD[String] = {
val field_ = this.field
rdd.map(x => field_ + x) }
Introduction to Apache Spark © Copyright IBM Corporation 2021
Programming the business logic
• The Apache Spark API is
available in Scala, Java, R,
and Python.
• Create the RDD from an
external data set or from an
existing RDD.
• There are transformations
and actions to process the
data.
• Use RDD persistence to
improve performance.
• Use broadcast variables or
accumulators for specific use
cases.

Introduction to Apache Spark © Copyright IBM Corporation 2021


Running Apache Spark examples
• Apache Spark samples are available in the examples directory on
Apache Spark website, on GitHub, or within the Apache Spark
distribution itself.
• Run the examples by running the following command:
./bin/run-example SparkPi

SparkPi is the name of the sample application.


• In Python, you can run any of the Python examples by running the
following command:
./bin/spark-submit examples/src/main/python/pi.py
pi.py is the Python example name.

Introduction to Apache Spark © Copyright IBM Corporation 2021


Creating Apache Spark stand-alone applications: Scala

Import
statements

SparkConf and
SparkContext

Transformations
+ Actions

Introduction to Apache Spark © Copyright IBM Corporation 2021


Running stand-alone applications
• Define the dependencies by using any system build mechanism
(Ant, SBT, Maven, or Gradle)
• Example:
• Scala: simple.sbt
• Java: pom.xml
• Python: --py-files argument (not needed for SimpleApp.py)

• Create a typical directory structure with the files:


Scala using SBT : Java using Maven:
./simple.sbt ./pom.xml
./src ./src
./src/main ./src/main
./src/main/scala ./src/main/java
./src/main/scala/SimpleApp.scala ./src/main/java/SimpleApp.java

• Create a JAR package that contains the application's code.


• Use spark-submit to run the program.

Introduction to Apache Spark © Copyright IBM Corporation 2021


Apache Spark libraries

Introduction to Apache Spark © Copyright IBM Corporation 2021


Topics
• Apache Spark overview
• Scala overview
• Resilient Distributed Dataset
• Programming with Apache Spark
• Apache Spark libraries
• Apache Spark cluster and monitoring

Introduction to Apache Spark © Copyright IBM Corporation 2021


Apache Spark libraries
• Extension of the core Apache Spark API.
• Improvements that are made to the core are passed to these libraries.
• There is little overhead to using them with the Apache Spark Core.

spark.apache.org

Introduction to Apache Spark © Copyright IBM Corporation 2021


Apache Spark SQL
• Allows relational queries to be expressed in the following languages:
▪ SQL
▪ HiveQL
▪ Scala
• SchemaRDD:
▪ Row objects
▪ Schema
▪ Created from:
− Existing RDD
− Parquet file
− JSON data set
− HiveQL against Apache Hive
• Supports Scala, Java, R, and Python.

Introduction to Apache Spark © Copyright IBM Corporation 2021


Apache Spark Streaming
• Scalable, high-throughput, and Receives data from:
fault-tolerant stream processing – Apache Kafka
of live data streams. – Flume
– HDFS and S3
• Receives live input data and
divides into small batches that – Kinesis
are processed and returned as – Twitter
batches. Pushes data out to:
– HDFS
• DStreams: Sequence of RDD.
– Databases
• Supports Scala, Java, and – Dashboard
Python.
Apache
Kafka
Flume HDFS
Apache Spark
HDFS/S3 Databases
Streaming
Kinesis Dashboards
Twitter

Introduction to Apache Spark © Copyright IBM Corporation 2021


Apache Spark Streaming: Internals
• The input stream (DStream) goes into Apache Spark Streaming.
• The data is broken up into batches that are fed into the Apache
Spark engine for processing.
• The results are generated as a stream of batches.
Batches
Input
of Batches of
data
Apache input Apache processed data
stream
Spark data Spark
Streaming Engine
spark.apache.or
g

Tim Tim Tim Tim Tim


• Sliding window operations: Origin
al
e1 e2 e3 e4 e5

▪ Windowed computations: DStre


am Window-bas
− Window length ed
Window operation
− Sliding interval ed Windo Windo Windo
− reduceByKeyAndWindow DStream w w w
at time at time at time
1 spark.apache.or 3 5
g

Introduction to Apache Spark © Copyright IBM Corporation 2021


GraphX
• GraphX for graph processing:
▪ Graphs and graph parallel computation
▪ Social networks and language modeling
• The goal of GraphX is to optimize the process by making it easier to
view data both as a graph and as collections, such as RDD, without
data movement or duplication.

https://spark.apache.org/docs/latest/graphx-programming-guide.html

Introduction to Apache Spark © Copyright IBM Corporation 2021


Apache Spark cluster and
monitoring

Introduction to Apache Spark © Copyright IBM Corporation 2021


Topics
• Apache Spark overview
• Scala overview
• Resilient Distributed Dataset
• Programming with Apache Spark
• Apache Spark libraries
• Apache Spark cluster and monitoring

Introduction to Apache Spark © Copyright IBM Corporation 2021


Apache Spark cluster overview
• Components:
▪ Driver
▪ Cluster manager
▪ Executors
Worker Node
Executo
Cache
r

Driver Program Task Task

SparkContext Cluster Manager

Worker Node
Executo
Cache
r
Task Task

• Three supported cluster managers:


▪ Stand-alone
▪ Apache Mesos
▪ Hadoop YARN
Introduction to Apache Spark © Copyright IBM Corporation 2021
Apache Spark monitoring
There are three ways to monitor Apache Spark applications:
• Web UI:
▪ Port 4040 (lab exercise on port 8088)
▪ Available while the application exists
• Metrics:
▪ Based on the Coda Hale Metrics Library
▪ Report to various sinks (HTTP, JMX, and CSV)
▪ /conf/metrics.properties
• External instruments:
▪ Cluster-wide monitoring tool (Ganglia)
▪ OS profiling tools (dstat, iostat, and iotop)
▪ JVM utilities (jstack, jmap, jstat, and jconsole)

Introduction to Apache Spark © Copyright IBM Corporation 2021


Unit summary
• Explained the nature and purpose of Apache Spark in the Hadoop
infrastructure.
• Described the architecture and listed the components of the Apache
Spark unified stack.
• Described the role of a Resilient Distributed Dataset (RDD).
• Explained the principles of Apache Spark programming.
• Listed and described the Apache Spark libraries.
• Started and used Apache Spark Scala and Python shells.
• Described Apache Spark Streaming, Apache Spark SQL, MLib, and
GraphX.

Introduction to Apache Spark © Copyright IBM Corporation 2021


Review questions
1. True or False: Ease of use is one of the benefits of Apache
Spark.
2. Which language is supported by Apache Spark?
A. C++
B. C#
C. Java
D. Node.js
3. True or False: Scala is the primary abstraction of Apache
Spark.
4. In RDD actions, which function returns all the elements of
the data set as an array of the driver program?
A. Collect
B. Take
C. Count
D. Reduce
5. True or False: Referencing a data set is one of the methods
to create RDD.
Introduction to Apache Spark © Copyright IBM Corporation 2021
Review answers
1. True or False: Ease of use is one of the benefits of using
Apache Spark.
2. Which language is supported by Apache Spark?
A. C++
B. C#
C. Java
D. Node.js
3. True or False: Scala is the primary abstraction of Apache
Spark.
4. In RDD actions, which function returns all the elements of
the data set as an array of the driver program?
A. Collect
B. Take
C. Count
D. Reduce
5. True or False: Referencing a data set is one of the methods
to create RDD.
Introduction to Apache Spark © Copyright IBM Corporation 2021
Exercise: Running Apache
Spark applications in Python

Introduction to Apache Spark © Copyright IBM Corporation 2021


Exercise objectives
• In this exercise, you explore some of Spark 2 client program
examples and learn how to run them. You gain experience
with the fundamental aspects of running Spark in the HDP
environment.
• After completing this exercise, you should be able to do the
following tasks:
▪ Browse files and folders in HDFS.
▪ Work with Apache Spark RDD with Python.

Introduction to Apache Spark © Copyright IBM Corporation 2021

You might also like