0% found this document useful (0 votes)

21 views65 pages

Spark

The document provides an introduction to Apache Spark, detailing its purpose within the Hadoop infrastructure, its architecture, and key components such as Resilient Distributed Datasets (RDDs) and various libraries. It highlights the advantages of Apache Spark, including speed, generality, and ease of use, along with its applications in data science and engineering. Additionally, it covers programming with Spark using Scala and Python, and explains RDD operations and persistence.

Uploaded by

salma.naser.elshazly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views65 pages

Spark

Uploaded by

salma.naser.elshazly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

Introduction to Apache Spark

© Copyright IBM Corporation 2021

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit objectives
• Explain the nature and purpose of Apache Spark in the Hadoop
infrastructure.
• Describe the architecture and list the components of the Apache Spark
unified stack.
• Describe the role of a Resilient Distributed Dataset (RDD).
• Explain the principles of Apache Spark programming.
• List and describe the Apache Spark libraries.
• Start and use Apache Spark Scala and Python shells.
• Describe Apache Spark Streaming, Apache Spark SQL, MLib, and
GraphX.

Introduction to Apache Spark © Copyright IBM Corporation 2021

Apache Spark overview

Introduction to Apache Spark © Copyright IBM Corporation 2021

Topics
• Apache Spark overview
• Scala overview
• Resilient Distributed Dataset
• Programming with Apache Spark
• Apache Spark libraries
• Apache Spark cluster and monitoring

Introduction to Apache Spark © Copyright IBM Corporation 2021

Big data and Apache Spark
• Faster results from analytics are increasingly important.
• Apache Spark is a computing platform that is fast, general-purpose, and
easy to use.

Speed • In-memory computations.

• Faster than MapReduce for complex applications on disk.
Generality • Covers a wide range of workloads on one system.
• Batch applications (for example, MapReduce).
• Iterative algorithms.
• Interactive queries and streaming.
Ease of use •APIs for Scala, Python, Java, and R.
•Libraries for SQL, machine learning, streaming, and graph processing.
•Runs on Hadoop clusters or as a stand-alone product.
•Includes the popular MapReduce model.

Introduction to Apache Spark © Copyright IBM Corporation 2021

Ease of use
• To implement the classic WordCount in Java MapReduce, you need
three classes: the main class that sets up the job, a mapper, and a
reducer, each about 10 lines long.
• Here is the same WordCount program that is written in Scala for
Apache Spark:

val conf = new SparkConf().setAppName("Spark wordcount")

val sc = new SparkContext(conf)
val file = sc.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1)).countByKey()
counts.saveAsTextFile("hdfs://...")

• With Java, Apache Spark can take advantage of the versatility,

flexibility, and functional programming concepts of Scala.

Introduction to Apache Spark © Copyright IBM Corporation 2021

Who uses Apache Spark and why
• Apache Spark is used for parallel distributed processing, fault tolerance
on commodity hardware, scalability, in-memory computing, high-level
APIs, and other tasks.
• Data scientist:
▪ Analyze and model the data to obtain insight by using ad hoc analysis.
▪ Transform the data into a usable format.
▪ Used for statistics, machine learning, and SQL.
• Data engineers:
▪ Develop a data processing system or application.
▪ Inspect and tune their applications.
▪ Program with the Apache Spark API.
• Everyone else:
▪ Ease of use.
▪ Wide variety of functions.
▪ Mature and reliable.

Introduction to Apache Spark © Copyright IBM Corporation 2021

Apache Spark unified stack

Apache
Apache Spark MLlib GraphX
Spark SQL Streaming Machine learning Graph processing
Real-time
processing

Apache Spark Core

Apache
Stand-alone scheduler YARN
Mesos

Introduction to Apache Spark © Copyright IBM Corporation 2021

Apache Spark jobs and shell
• Apache Spark jobs can be written in Scala, Python, or Java. APIs are
available for all three at the following website:
http://spark.apache.org/docs/latest
• Apache Spark shells are provided for Scala (spark-shell) and Python
(pyspark).
• The Apache Spark native language is Scala, so it is natural to write Apache
Spark applications by using Scala.

Introduction to Apache Spark © Copyright IBM Corporation 2021

Apache Spark Scala and Python shells
• Apache Spark shells provide simple ways to learn the APIs and provide
a set of powerful tools to analyze data interactively.
• Scala shell:
▪ Runs on the Java virtual machine (JVM), which provides a good way to use
existing Java libraries
▪ To launch the Scala shell, run ./bin/spark-shell.
▪ The prompt is scala>.
▪ To read in a text file, run val textFile =
sc.textFile("README.md")
• Python shell:
▪ To launch the Python shell, run ./bin/pyspark.
▪ The prompt is >>>.
▪ To read in a text file, run textFile = sc.textFile("README.md").
▪ Two more variations are IPython and the IPython Notebook.
• To quit either shell, press Ctrl-D (the EOF character).

Introduction to Apache Spark © Copyright IBM Corporation 2021

Scala overview

Introduction to Apache Spark © Copyright IBM Corporation 2021

Topics
• Apache Spark overview
• Scala overview
• Resilient Distributed Dataset
• Programming with Apache Spark
• Apache Spark libraries
• Apache Spark cluster and monitoring

Introduction to Apache Spark © Copyright IBM Corporation 2021

Brief overview of Scala
• Everything is an object:
▪ Primitive types such as numbers or Boolean.
▪ Functions.
• Numbers are objects:
▪ 1 + 2 * 3 / 4 🡺 (1).+(((2).*(3))./(x))).
▪ The +, *, and / characters are valid identifiers in Scala.
• Functions are objects:
▪ Pass functions as arguments.
▪ Store them in variables.
▪ Return them from other functions.
• Function declaration:
def functionName ([list of parameters]) : [return type]

Introduction to Apache Spark © Copyright IBM Corporation 2021

Scala: Anonymous functions (Lambda functions)
• Lambda ( => syntax): Functions without a name are created for one-time use to
pass to another function.
• In the graphic on the left, => is where the arguments are placed (no arguments in
the example).
• The graphic on the right shows the body of the function (here the println
statement).

Introduction to Apache Spark © Copyright IBM Corporation 2021

Computing WordCount by using Lambda functions
• The classic WordCount program can be written with anonymous
(Lambda) functions.
• Three functions are needed:
▪ Use a token to turn each line into words (with a space as a delimiter).
▪ Map to produce the <word, 1> key/value pair from each word that is read.
▪ Reduce to aggregate the counts for each word individually (reduceByKey)
• The results are written to HDFS.
text_file = spark.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")

• Lambda functions can be used with Scala, Python, and Java V8. This
example is written in Scala.
Introduction to Apache Spark © Copyright IBM Corporation 2021
Resilient Distributed Dataset

Introduction to Apache Spark © Copyright IBM Corporation 2021

Topics
• Apache Spark overview
• Scala overview
• Resilient Distributed Dataset
• Programming with Apache Spark
• Apache Spark libraries
• Apache Spark cluster and monitoring

Introduction to Apache Spark © Copyright IBM Corporation 2021

Resilient Distributed Dataset
Example
• An RDD is a fault-tolerant collection of elements that can RDD flow
be operated on in parallel.
• RDDs are immutable. Hadoop RDD
• Three methods for creating an RDD:
▪ Parallelizing an existing collection
▪ Referencing a data set
▪ Transforming from an existing RDD Filtered RDD
• Two types of RDD operations:
▪ Transformations
▪ Actions
Mapped RDD
• Uses a data set from any storage that is supported
by Hadoop, such as HDFS and Amazon S3.
Reduced
RDD

Introduction to Apache Spark © Copyright IBM Corporation 2021

Creating an RDD
To create an RDD, complete the following steps:
1. Start the Apache Spark shell (requires a PATH environment variable):
spark-shell
2. Create some data:
val data = 1 to 10000
3. Parallelize that data (creating the RDD):
val distData = sc.parallelize(data)
4. Perform more transformations or invoke an action on the
transformation:
distData.filter(…)
You can also create an RDD from an external data set:
val readmeFile = sc.textFile("Readme.md")

Introduction to Apache Spark © Copyright IBM Corporation 2021

RDD basic operations
• Loading a file:
val lines = sc.textFile("hdfs://data.txt")
• Applying a transformation:
val lineLengths = lines.map(s => s.length)
• Starting an action:
val totalLengths = lineLengths.reduce((a,b) => a + b)
• Viewing the DAG:
lineLengths.toDebugString

Introduction to Apache Spark © Copyright IBM Corporation 2021

What happens when an action is run (1 of 8)
// Creating the RDD
val logFile = sc.textFile("hdfs://…")
// Transformations
Driver
val errors = logFile.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
//Caching
messages.cache()
// Actions
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
Worker Worker Worker

Introduction to Apache Spark © Copyright IBM Corporation 2021

What happens when an action is run (2 of 8)
// Creating the RDD
val logFile = sc.textFile("hdfs://…")
// Transformations
Driver
val errors = logFile.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
//Caching
messages.cache()
// Actions
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
Worker Worker Worker
Block 1 Block 2 Block 3

The data is partitioned

into different blocks.

Introduction to Apache Spark © Copyright IBM Corporation 2021

What happens when an action is run (3 of 8)
// Creating the RDD
val logFile = sc.textFile("hdfs://…")
// Transformations
Driver
val errors = logFile.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
//Cache
messages.cache()
// Actions
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
Worker Worker Worker
Block 1 Block 2 Block 3

The driver sends the code to be run

on each block.

Introduction to Apache Spark © Copyright IBM Corporation 2021

What happens when an action is run (4 of 8)
// Creating the RDD
val logFile = sc.textFile("hdfs://…")
// Transformations
Driver
val errors = logFile.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
//Caching
messages.cache()
// Actions
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
Worker Worker Worker
Block 1 Block 2 Block 3

Read the HDFS block.

Introduction to Apache Spark © Copyright IBM Corporation 2021

What happens when an action is run (5 of 8)
// Creating the RDD
val logFile = sc.textFile("hdfs://…")
// Transformations
Driver
val errors = logFile.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
//Caching
messages.cache()
// Actions
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
Worker Worker Worker
Block 1 Block 2 Block 3

Cache Cache Cache

Process plus cache data

Introduction to Apache Spark © Copyright IBM Corporation 2021

What happens when an action is run (6 of 8)
// Creating the RDD
val logFile = sc.textFile("hdfs://…")
// Transformations
Driver
val errors = logFile.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
//Caching
messages.cache()
// Actions
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
Worker Worker Worker
Block 1 Block 2 Block 3

Cache Cache Cache

Send the data back to

the driver.

Introduction to Apache Spark © Copyright IBM Corporation 2021

What happens when an action is run (7 of 8)
// Creating the RDD
val logFile = sc.textFile("hdfs://…")
// Transformations
Driver
val errors = logFile.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
//Caching
messages.cache()
// Actions
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
Worker Worker Worker
Block 1 Block 2 Block 3

Cache Cache Cache

Process from cache.

Introduction to Apache Spark © Copyright IBM Corporation 2021

What happens when an action is run (8 of 8)
// Creating the RDD
val logFile = sc.textFile("hdfs://…")
// Transformations
Driver
val errors = logFile.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
//Caching
messages.cache()
// Actions
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
Worker Worker Worker
Block 1 Block 2 Block 3

Cache Cache Cache

Send the data back to

the driver.

Introduction to Apache Spark © Copyright IBM Corporation 2021

RDD operations: Transformations
• Here are some of the transformations that are available. The full set can be found
on the Apache Spark website.
• Transformations are lazy evaluations.
• Returns a pointer to the transformed RDD.

Transformation Meaning

map(func) Returns a new data set that is formed by passing each element of the source
through a function func.

filter(func) Returns a new data set that is formed by selecting those elements of the source
on which func returns true.
flatMap(func) Like map, but each input item can be mapped to 0 or more output items. So, func
should return a Seq rather than a single item.
join(otherDataset, When called on data sets of type (K, V) and (K, W), returns a data set of (K, (V,
[numTasks]) W)) pairs with all pairs of elements for each key.
reduceByKey(func) When called on a data set of (K, V) pairs, returns a data set of (K,V) pairs where
the values for each key are aggregated by using the reduce function func.
sortByKey([ascending],[numTa When called on a data set of (K, V) pairs where K implements Ordered, returns a
sks]) data set of (K,V) pairs that are sorted by keys in ascending or descending order.

Introduction to Apache Spark © Copyright IBM Corporation 2021

RDD operations: Actions
Actions return values.

Action Meaning

collect() Returns all the elements of the data set as an array of the driver program. This action is
usually useful after a filter or another operation that returns a sufficiently small subset of
data.
count() Returns the number of elements in a data set.

first() Returns the first element of the data set.

take(n) Returns an array with the first n elements of the data set.

foreach(func) Runs a function func on each element of the data set.

Introduction to Apache Spark © Copyright IBM Corporation 2021

RDD persistence
• Each node stores partitions of the cache that it computes in memory.
• The node reuses them in other actions on that data set (or derived data sets).
Future actions are much faster (often by more than 10x).
• There are two methods for RDD persistence:
▪ persist()
▪ cache()
Storage level Meaning
MEMORY_ONLY The RDD is stored as deserialized Java objects in the JVM. If the RDD does not fit in memory,
part of it is cached. The rest of it is recomputed as needed. This level is the default. The
cache() method uses this level.
MEMORY_AND_DISK Same as MEMORY_ONLY, except the RDD also is stored on disk if it does not fit in memory.
The RDD is read from memory and disk when needed.
MEMORY_ONLY_SER The RDD is stored as serialized Java objects (one byte array per partition). It is
space-efficient, but more CPU-intensive to read.
MEMORY_AND_DISK_SER Like MEMORY_AND_DISK but the RDD is stored as serialized objects.
DISK_ONLY The RDD is stored only on disk.
MEMORY_ONLY_2, Same as DISK_ONLY, but each partition is replicated on two cluster nodes.
MEMORY_AND_DISK_2, and
so on
OFF_HEAP (experimental) Stores RDD in serialized format in Tachyon.

Introduction to Apache Spark © Copyright IBM Corporation 2021

Best practices for which storage level to choose
• Use the default storage level (MEMORY_ONLY) when possible.
Otherwise, use MEMORY_ONLY_SER and a fast serialization library.
• Do not spill to disk unless the functions that computed your data sets
are expensive or they filter a large amount of the data (recomputing a
partition might be as fast as reading it from disk).
• Use the replicated storage levels if you want fast fault recovery (such as
using Apache Spark to serve requests from a web application).
• All the storage levels provide full fault tolerance by recomputing lost
data, but the replicated ones let you continue running tasks on the RDD
without waiting to recompute a lost partition.
• The experimental OFF_HEAP mode has several advantages:
▪ Allows multiple executors to share the pool of memory in Tachyon.
▪ Reduces garbage collection costs.
▪ Cached data is not lost if individual executors fail.

Introduction to Apache Spark © Copyright IBM Corporation 2021

Shared variables and key-value pairs

• When a function is passed from the driver to a worker, normally a

separate copy of the variables is used ("pass by value").
• Two types of variables:
▪ Broadcast variables
− Read-only copy on each machine.
− Distribute broadcast variables by using efficient broadcast algorithms.
▪ Accumulators:
− Variables that are added through an associative operation.
− Implement counters and sums.
− Only the driver can read the accumulator’s value.
− Numeric types accumulators. Extend for new types.

Scala: key-value pairs Python: key-value pairs Java: key-value pairs

val pair = ('a', 'b') pair = ('a', 'b') Tuple2 pair = new Tuple2('a', 'b');
pair._1 // will return 'a' pair[0] # will return 'a' pair._1 // will return 'a'
pair._2 // will return 'b' pair[1] # will return 'b' pair._2 // will return 'b'

Introduction to Apache Spark © Copyright IBM Corporation 2021

Programming with key-value pairs
• There are special operations that are available on RDDs of key-value
pairs. You group or aggregate elements by using a key.
• Tuple2 objects are created by writing (a, b), and you must import
org.apache.spark.SparkContext._.
• PairRDDFunctions contains key-value pair operations:
reduceByKey((a, b) => a + b)
• Custom objects like the key in a key-value pair require a custom
equals() method with a matching hashCode() method.
• Here is an example:

val textFile = sc.textFile("…")

val readmeCount = textFile.flatMap(line =>
line.split(" ")).map(word => (word,
1)).reduceByKey(_ + _)

Introduction to Apache Spark © Copyright IBM Corporation 2021

Programming with Apache
Spark

Introduction to Apache Spark © Copyright IBM Corporation 2021

Topics
• Apache Spark overview
• Scala overview
• Resilient Distributed Dataset
• Programming with Apache Spark
• Apache Spark libraries
• Apache Spark cluster and monitoring

Introduction to Apache Spark © Copyright IBM Corporation 2021

Programming with Apache Spark
• You reviewed accessing Apache Spark with interactive shells:
▪ spark-shell (for Scala)
▪ pyspark (for Python)
• Next, you review programming with Apache Spark with the following
languages:
▪ Scala
▪ Python
▪ Java
• Compatible versions of software are needed:
▪ Apache Spark 1.6.3 uses Scala 2.10. To write applications in Scala, you must
use a compatible Scala version (for example, 2.10.X).
▪ Apache Spark 1.x works with Python 2.6 or higher (but not with Python 3).
▪ Apache Spark 1.x works with Java 6 and higher, and Java 8 supports
Lambda expressions.

Introduction to Apache Spark © Copyright IBM Corporation 2021

SparkContext
• The SparkContext is the main entry point for Apache Spark functions: It
represents the connection to an Apache Spark cluster.
• Use the SparkContext to create RDDs, accumulators, and broadcast
variables on that cluster.
• With the Apache Spark shell, the SparkContext (sc) is automatically
initialized for you to use.
• But in an Apache Spark program, you must add code to import some
classes and implicit conversions into your program:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

Introduction to Apache Spark © Copyright IBM Corporation 2021

Linking with Apache Spark: Scala
• Apache Spark applications require certain dependencies.
• Apache Spark needs a compatible Scala version to write applications.
For example, Apache Spark 1.6.3 uses Scala 2.10.
• To write an Apache Spark application, you must add a Maven
dependency to Apache Spark, which is available through
Maven Central:
groupId = org.apache.spark
artifactId = spark-core_2.10
version = 1.6.3
• To access an HDFS cluster, you must add a dependency on
hadoop-client for your version of HDFS:
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>

Introduction to Apache Spark © Copyright IBM Corporation 2021

Initializing Apache Spark: Scala
• Build a SparkConf object that contains information about your
application:
val conf = new SparkConf().setAppName(appName).setMaster(master)
• Use the appName parameter to set the name for your application that
appears on the cluster UI.
• The master parameter is an Apache Spark, Apache Mesos, or YARN
cluster URL (or a special "local" string to run in local mode):
▪ In testing, you can pass "local" to run Apache Spark.
▪ local[16] allocates 16 cores.
▪ In production mode, do not hardcode master in the program. Start with
spark-submit and use it there.
• Create the SparkContext object:
new SparkContext(conf)

Introduction to Apache Spark © Copyright IBM Corporation 2021

Linking with Apache Spark: Python
• Apache Spark 1.x works with Python 2.6 or higher.
• It uses the standard CPython interpreter, so C libraries like NumPy can
be used.
• To run Apache Spark applications in Python, use the
bin/spark-submit script in the Apache Spark directory:
▪ Load Apache Spark Java or Scala libraries.
▪ Then, you can submit applications to a cluster.
• If you want to access HDFS, you must use a build of PySpark linking to
your version of HDFS.
• Import Apache Spark classes:
from pyspark import SparkContext, SparkConf

Introduction to Apache Spark © Copyright IBM Corporation 2021

Initializing Apache Spark: Python
• Build a SparkConf object that contains information about your
application:
conf = SparkConf().setAppName(appName).setMaster(master)
• Use the appName parameter to set the name for your application that
appears on the cluster UI.
• The master parameter is an Apache Spark, Apache Mesos, or YARN
cluster URL (or a special "local" string to run in local mode):
▪ In production mode, do not hardcode master in the program. Start
spark-submit and use it there.
▪ In testing, you can pass "local" to run Apache Spark.
• Create the SparkContext object:
sc = SparkContext(conf=conf)

Linking with Apache Spark: Java
• Apache Spark supports Lambda expressions of Java.
• Add a dependency to Apache Spark, which is available through Maven
Central:
groupId = org.apache.spark
artifactId = spark-core_2.10
version = 1.6.3
• If you want to access an HDFS cluster, you must add the dependency
too:
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>
• Import some Apache Spark classes:
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.api.java.JavaRDD
import org.apache.spark.SparkConf

Initializing Apache Spark: Java
• Build a SparkConf object that contains information about your
application:
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master)

• Use the appName parameter to set the name for your application that
appears on the cluster UI.
• The master parameter is an Apache Spark, Apache Mesos, or YARN
cluster URL (or a special "local" string to run in local mode):
▪ In production mode, do not hardcode master in the program. Start
spark-submit and use it there.
▪ In testing, you can pass "local" to run Apache Spark.
• Create the JavaSparkContext object:
JavaSparkContext sc = new JavaSparkContext(conf);

Passing functions to Apache Spark
• Apache Spark API heavily relies on passing functions in the driver program
to run on the cluster.
• Three methods:
▪ Anonymous function syntax:
(x: Int) => x + 1
▪ Static methods in a global singleton object:
object MyFunctions {
>def func1 (s: String): String = {…}
}
myRdd.map(MyFunctions.func1)
▪ To pass by reference to avoid sending the entire object, consider copying the
function to a local variable:
val field = "Hello"
• Avoid:
def doStuff(rdd: RDD[String]):RDD[String] = {rdd.map(x => field +
x)}
• Consider:
def doStuff(rdd: RDD[String]):RDD[String] = {
val field_ = this.field
rdd.map(x => field_ + x) }
Introduction to Apache Spark © Copyright IBM Corporation 2021
Programming the business logic
• The Apache Spark API is
available in Scala, Java, R,
and Python.
• Create the RDD from an
external data set or from an
existing RDD.
• There are transformations
and actions to process the
data.
• Use RDD persistence to
improve performance.
• Use broadcast variables or
accumulators for specific use
cases.

Running Apache Spark examples
• Apache Spark samples are available in the examples directory on
Apache Spark website, on GitHub, or within the Apache Spark
distribution itself.
• Run the examples by running the following command:
./bin/run-example SparkPi

SparkPi is the name of the sample application.

• In Python, you can run any of the Python examples by running the
following command:
./bin/spark-submit examples/src/main/python/pi.py
pi.py is the Python example name.

Creating Apache Spark stand-alone applications: Scala

Import
statements

SparkConf and
SparkContext

Transformations
+ Actions

Running stand-alone applications
• Define the dependencies by using any system build mechanism
(Ant, SBT, Maven, or Gradle)
• Example:
• Scala: simple.sbt
• Java: pom.xml
• Python: --py-files argument (not needed for SimpleApp.py)

• Create a typical directory structure with the files:

Scala using SBT : Java using Maven:
./simple.sbt ./pom.xml
./src ./src
./src/main ./src/main
./src/main/scala ./src/main/java
./src/main/scala/SimpleApp.scala ./src/main/java/SimpleApp.java

• Create a JAR package that contains the application's code.

• Use spark-submit to run the program.

Apache Spark libraries

Topics
• Apache Spark overview
• Scala overview
• Resilient Distributed Dataset
• Programming with Apache Spark
• Apache Spark libraries
• Apache Spark cluster and monitoring

Apache Spark libraries
• Extension of the core Apache Spark API.
• Improvements that are made to the core are passed to these libraries.
• There is little overhead to using them with the Apache Spark Core.

spark.apache.org

Apache Spark SQL
• Allows relational queries to be expressed in the following languages:
▪ SQL
▪ HiveQL
▪ Scala
• SchemaRDD:
▪ Row objects
▪ Schema
▪ Created from:
− Existing RDD
− Parquet file
− JSON data set
− HiveQL against Apache Hive
• Supports Scala, Java, R, and Python.

Apache Spark Streaming
• Scalable, high-throughput, and Receives data from:
fault-tolerant stream processing – Apache Kafka
of live data streams. – Flume
– HDFS and S3
• Receives live input data and
divides into small batches that – Kinesis
are processed and returned as – Twitter
batches. Pushes data out to:
– HDFS
• DStreams: Sequence of RDD.
– Databases
• Supports Scala, Java, and – Dashboard
Python.
Apache
Kafka
Flume HDFS
Apache Spark
HDFS/S3 Databases
Streaming
Kinesis Dashboards
Twitter

Apache Spark Streaming: Internals
• The input stream (DStream) goes into Apache Spark Streaming.
• The data is broken up into batches that are fed into the Apache
Spark engine for processing.
• The results are generated as a stream of batches.
Batches
Input
of Batches of
data
Apache input Apache processed data
stream
Spark data Spark
Streaming Engine
spark.apache.or
g

Tim Tim Tim Tim Tim

• Sliding window operations: Origin
al
e1 e2 e3 e4 e5

▪ Windowed computations: DStre

am Window-bas
− Window length ed
Window operation
− Sliding interval ed Windo Windo Windo
− reduceByKeyAndWindow DStream w w w
at time at time at time
1 spark.apache.or 3 5
g

GraphX
• GraphX for graph processing:
▪ Graphs and graph parallel computation
▪ Social networks and language modeling
• The goal of GraphX is to optimize the process by making it easier to
view data both as a graph and as collections, such as RDD, without
data movement or duplication.

https://spark.apache.org/docs/latest/graphx-programming-guide.html

Apache Spark cluster and
monitoring

Topics
• Apache Spark overview
• Scala overview
• Resilient Distributed Dataset
• Programming with Apache Spark
• Apache Spark libraries
• Apache Spark cluster and monitoring

Apache Spark cluster overview
• Components:
▪ Driver
▪ Cluster manager
▪ Executors
Worker Node
Executo
Cache
r

Driver Program Task Task

SparkContext Cluster Manager

Worker Node
Executo
Cache
r
Task Task

• Three supported cluster managers:

▪ Stand-alone
▪ Apache Mesos
▪ Hadoop YARN
Introduction to Apache Spark © Copyright IBM Corporation 2021
Apache Spark monitoring
There are three ways to monitor Apache Spark applications:
• Web UI:
▪ Port 4040 (lab exercise on port 8088)
▪ Available while the application exists
• Metrics:
▪ Based on the Coda Hale Metrics Library
▪ Report to various sinks (HTTP, JMX, and CSV)
▪ /conf/metrics.properties
• External instruments:
▪ Cluster-wide monitoring tool (Ganglia)
▪ OS profiling tools (dstat, iostat, and iotop)
▪ JVM utilities (jstack, jmap, jstat, and jconsole)

Unit summary
• Explained the nature and purpose of Apache Spark in the Hadoop
infrastructure.
• Described the architecture and listed the components of the Apache
Spark unified stack.
• Described the role of a Resilient Distributed Dataset (RDD).
• Explained the principles of Apache Spark programming.
• Listed and described the Apache Spark libraries.
• Started and used Apache Spark Scala and Python shells.
• Described Apache Spark Streaming, Apache Spark SQL, MLib, and
GraphX.

Review questions
1. True or False: Ease of use is one of the benefits of Apache
Spark.
2. Which language is supported by Apache Spark?
A. C++
B. C#
C. Java
D. Node.js
3. True or False: Scala is the primary abstraction of Apache
Spark.
4. In RDD actions, which function returns all the elements of
the data set as an array of the driver program?
A. Collect
B. Take
C. Count
D. Reduce
5. True or False: Referencing a data set is one of the methods
to create RDD.
Introduction to Apache Spark © Copyright IBM Corporation 2021
Review answers
1. True or False: Ease of use is one of the benefits of using
Apache Spark.
2. Which language is supported by Apache Spark?
A. C++
B. C#
C. Java
D. Node.js
3. True or False: Scala is the primary abstraction of Apache
Spark.
4. In RDD actions, which function returns all the elements of
the data set as an array of the driver program?
A. Collect
B. Take
C. Count
D. Reduce
5. True or False: Referencing a data set is one of the methods
to create RDD.
Introduction to Apache Spark © Copyright IBM Corporation 2021
Exercise: Running Apache
Spark applications in Python

Exercise objectives
• In this exercise, you explore some of Spark 2 client program
examples and learn how to run them. You gain experience
with the fundamental aspects of running Spark in the HDP
environment.
• After completing this exercise, you should be able to do the
following tasks:
▪ Browse files and folders in HDFS.
▪ Work with Apache Spark RDD with Python.

SABDE3G06 Big Data Sparks
No ratings yet
SABDE3G06 Big Data Sparks
57 pages
Lec - Spark
No ratings yet
Lec - Spark
65 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Overview of Apache Spark and RDDs
100% (1)
Overview of Apache Spark and RDDs
109 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
21 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Unit - 4
No ratings yet
Unit - 4
18 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Apache Spark Tutorial
100% (1)
Apache Spark Tutorial
6 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
CC PPT
No ratings yet
CC PPT
12 pages
Apache Spark Big Data Framework Overview
No ratings yet
Apache Spark Big Data Framework Overview
58 pages
Apache Spark Guide for Developers
No ratings yet
Apache Spark Guide for Developers
232 pages
Unit 5.1
No ratings yet
Unit 5.1
9 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Apache Spark: Fast Data Processing Engine
No ratings yet
Apache Spark: Fast Data Processing Engine
80 pages
Introduction to Data Analysis with Spark
No ratings yet
Introduction to Data Analysis with Spark
51 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Unit IV Spark
No ratings yet
Unit IV Spark
23 pages
Spark
No ratings yet
Spark
160 pages
4.1. Spark Basics
No ratings yet
4.1. Spark Basics
28 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
Apache Spark Overview & Features
No ratings yet
Apache Spark Overview & Features
65 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Apache Spark: Fast Cluster Computing
No ratings yet
Apache Spark: Fast Cluster Computing
6 pages
Apache Spark
No ratings yet
Apache Spark
27 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Apache Spark Defined
No ratings yet
Apache Spark Defined
14 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
22 PDFsam Apache Spark Tutorial
No ratings yet
22 PDFsam Apache Spark Tutorial
7 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
8 TH
No ratings yet
8 TH
19 pages
35-Unit5 DataAnalytics IoT Adoop Spark Part4
No ratings yet
35-Unit5 DataAnalytics IoT Adoop Spark Part4
12 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Unit - 4
No ratings yet
Unit - 4
49 pages
Presentation On Apache Spark
No ratings yet
Presentation On Apache Spark
7 pages
SPARK
No ratings yet
SPARK
27 pages
Module 5 Data Science
No ratings yet
Module 5 Data Science
8 pages
Sparklyr Online Training Overview
No ratings yet
Sparklyr Online Training Overview
80 pages
Unit V
No ratings yet
Unit V
23 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Module 5 Data Science
No ratings yet
Module 5 Data Science
7 pages
Apache Spark for Developers
No ratings yet
Apache Spark for Developers
8 pages
Introduction to Apache Spark 2 Architecture
No ratings yet
Introduction to Apache Spark 2 Architecture
43 pages
SPARK
No ratings yet
SPARK
47 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Spark SQL and Hadoop Integration Guide
100% (1)
Spark SQL and Hadoop Integration Guide
25 pages
Apache Spark: In-Memory Data Processing
No ratings yet
Apache Spark: In-Memory Data Processing
187 pages
06-Apache Spark
No ratings yet
06-Apache Spark
75 pages
Spark Programming Fundamentals Guide
No ratings yet
Spark Programming Fundamentals Guide
54 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Shark
No ratings yet
Shark
24 pages
Hitachi Thermal Power Equipment Guide
100% (1)
Hitachi Thermal Power Equipment Guide
16 pages
HOTS-SOLO Model Re-Entry Plan
No ratings yet
HOTS-SOLO Model Re-Entry Plan
2 pages
Gen. Nov., A New Leucosiid Genus (Crustacea, Brachyura) .: Ihleus
No ratings yet
Gen. Nov., A New Leucosiid Genus (Crustacea, Brachyura) .: Ihleus
6 pages
Anxiety Levels of Bse Science Students in Pangasinan - State University During The Covid 19 Pandemic
No ratings yet
Anxiety Levels of Bse Science Students in Pangasinan - State University During The Covid 19 Pandemic
91 pages
Pressure Switches - Square D™ NEMA - 9012GPG2
No ratings yet
Pressure Switches - Square D™ NEMA - 9012GPG2
4 pages
Oo Programming in Python PDF
No ratings yet
Oo Programming in Python PDF
684 pages
The Enjoyment of Math 1st Edition Hans Rademacher PDF Available
0% (1)
The Enjoyment of Math 1st Edition Hans Rademacher PDF Available
85 pages
Children's Arithmetic Research
100% (1)
Children's Arithmetic Research
5 pages
Educational Assignment
No ratings yet
Educational Assignment
16 pages
REESfinalpaper - Jordonez CBurbano Csanchez CGamboa
No ratings yet
REESfinalpaper - Jordonez CBurbano Csanchez CGamboa
13 pages
Moment Area Method
No ratings yet
Moment Area Method
63 pages
Science10 Q3 SLM16
No ratings yet
Science10 Q3 SLM16
15 pages
Name Age Sex Born/Place Nationality Religion A S'A D Health
No ratings yet
Name Age Sex Born/Place Nationality Religion A S'A D Health
2 pages
ENGLISH7-Q1-Biographical Context and Sociocultural Context
No ratings yet
ENGLISH7-Q1-Biographical Context and Sociocultural Context
4 pages
NVS Recruitment 2019: Principal & Teachers
No ratings yet
NVS Recruitment 2019: Principal & Teachers
14 pages
Level: Certificate/Diploma/Bachelors Ref. Number: APEL: (Office Use Only)
No ratings yet
Level: Certificate/Diploma/Bachelors Ref. Number: APEL: (Office Use Only)
13 pages
Illumination Engineering Design Module
No ratings yet
Illumination Engineering Design Module
3 pages
Types of Chemical Reaction
No ratings yet
Types of Chemical Reaction
2 pages
Deemed OEM Authorization Letter Template
No ratings yet
Deemed OEM Authorization Letter Template
2 pages
Beam Design Principles and Analysis
No ratings yet
Beam Design Principles and Analysis
49 pages
2-CLIMATE1 - Greenhouse Gas Removal Solutions
No ratings yet
2-CLIMATE1 - Greenhouse Gas Removal Solutions
6 pages
Terms & Conditions
No ratings yet
Terms & Conditions
1 page
Performance Task 2 CPAR24 25
No ratings yet
Performance Task 2 CPAR24 25
2 pages
LifeCare PCA Technical Service Manual-2013
No ratings yet
LifeCare PCA Technical Service Manual-2013
177 pages
Rautakoski
No ratings yet
Rautakoski
3 pages
DSP Project
No ratings yet
DSP Project
15 pages
MODEL7211A Series: Desktop Installation 1 100M Ethernet Copper Port To 1 E1 Network Bridge
No ratings yet
MODEL7211A Series: Desktop Installation 1 100M Ethernet Copper Port To 1 E1 Network Bridge
5 pages
Inclinometer
No ratings yet
Inclinometer
45 pages
Functions of CSO
No ratings yet
Functions of CSO
25 pages
Technical Specifications Baby Warmer
No ratings yet
Technical Specifications Baby Warmer
1 page