0% found this document useful (0 votes)

230 views40 pages

Spark & Scala for Developers

This document provides an overview of Spark, a fast and general engine for large-scale data processing. Spark introduces the concept of resilient distributed datasets (RDDs) that allow data to be distributed across a cluster and operated on in parallel. RDDs track their lineage to enable fault tolerance by recomputing lost data. Spark supports transformations like map, filter, and actions like count through its Scala and Java APIs.

Uploaded by

Amit Dubey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

230 views40 pages

Spark & Scala for Developers

Uploaded by

Amit Dubey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Parallel

Programming
With Spark
Matei Zaharia

UC Berkeley

www.spark-project.org
UC BERKELEY

What is Spark?
Fast and expressive cluster computing system
compatible with Apache Hadoop
Works with any Hadoop-supported storage system
and data format (HDFS, S3, SequenceFile, )

Improves eciency through:

In-memory computing primitives

General computation graphs

As much as
30 faster

Improves usability through rich Scala and Java

APIs and interactive shell

Often 2-10 less code

How to Run It
Local multicore: just a library in your program
EC2: scripts for launching a Spark cluster
Private cluster: Mesos, YARN*, standalone*

*Coming soon in Spark 0.6

Scala vs Java APIs

Spark originally written in Scala, which allows
concise function syntax and interactive use
Recently added Java API for standalone apps
(dev branch on GitHub)
Interactive shell still in Scala
This course: mostly Scala, with translations to Java

Outline
Introduction to Scala & functional programming
Spark concepts
Tour of Spark operations
Job execution

About Scala
High-level language for the Java VM

Object-oriented + functional programming

Statically typed

Comparable in speed to Java

But often no need to write types due to type inference

Interoperates with Java

Can use any Java class, inherit from it, etc; can also
call Scala code from Java

Best Way to Learn Scala

Interactive shell: just type scala
Supports importing libraries, tab completion,
and all constructs in the language

Quick Tour
Declaring variables:

Java equivalent:

var x: Int = 7
var x = 7 // type inferred

int x = 7;

val y = hi

final String y = hi;

// read-only

Functions:

Java equivalent:

def square(x: Int): Int = x*x

int square(int x) {
return x*x;
}

def square(x: Int): Int = {

x*x
Last expression in block returned
}
def announce(text: String) {
println(text)
}

void announce(String text) {

System.out.println(text);
}

Quick Tour
Generic types:

Java equivalent:

var arr = new Array[Int](8)

int[] arr = new int[8];

var lst = List(1, 2, 3)

// type of lst is List[Int]

List<Integer> lst =
new ArrayList<Integer>();
lst.add(...)

Factory method

Cant hold primitive types

Indexing:

Java equivalent:

arr(5) = 7

arr[5] = 7;

println(lst(5))

System.out.println(lst.get(5));

Quick Tour
Processing collections with functional programming:
val list = List(1, 2, 3)

Function expression (closure)

list.foreach(x => println(x))

list.foreach(println)
list.map(x => x + 2)
list.map(_ + 2)

// prints 1, 2, 3
// same

// => List(3, 4, 5)
// same, with placeholder notation

list.filter(x => x % 2 == 1)
list.filter(_ % 2 == 1)

// => List(1, 3)
// => List(1, 3)

list.reduce((x, y) => x + y)
list.reduce(_ + _)

// => 6
// => 6

All of these leave the list unchanged (List is immutable)

Scala Closure Syntax

(x: Int) => x + 2

// full version

x => x + 2

// type inferred

_ + 2

// when each argument is used exactly once

x => {
// when body is a block of code
val numberToAdd = 2
x + numberToAdd
}
// If closure is too long, can always pass a function
def addTwo(x: Int): Int = x + 2
list.map(addTwo)

Scala allows dening a local

function inside another function

Other Collection Methods

Scala collections provide many other functional
methods; for example, Google for Scala Seq
Method on Seq[T]

Explanation

map(f: T => U): Seq[U]

Pass each element through f

flatMap(f: T => Seq[U]): Seq[U]

One-to-many map

filter(f: T => Boolean): Seq[T]

Keep elements passing f

exists(f: T => Boolean): Boolean

True if one element passes

forall(f: T => Boolean): Boolean

True if all elements pass

reduce(f: (T, T) => T): T

Merge elements using f

groupBy(f: T => K): Map[K,List[T]]

Group elements by f(element)

sortBy(f: T => K): Seq[T]

Sort elements by f(element)

. . .

Outline
Introduction to Scala & functional programming
Spark concepts
Tour of Spark operations
Job execution

Spark Overview
Goal: work with distributed collections as you
would with local ones
Concept: resilient distributed datasets (RDDs)

Immutable collections of objects spread across a cluster

Built through parallel transformations (map, lter, etc)
Automatically rebuilt on failure
Controllable persistence (e.g. caching in RAM) for reuse

Main Primitives
Resilient distributed datasets (RDDs)

Immutable, partitioned collections of objects

Transformations (e.g. map, lter, groupBy, join)

Lazy operations to build RDDs from other RDDs

Actions (e.g. count, collect, save)

Return a result or write it to storage

Example: Log Mining

Load error messages from a log into memory, then
interactively search for various patterns
val lines = spark.textFile(hdfs://...)

Base
RDD
Transformed
RDD
results

val errors = lines.filter(_.startsWith(ERROR))

val messages = errors.map(_.split(\t)(2))
messages.cache()
messages.filter(_.contains(foo)).count

Driver

Cac he 1

Worker

tasks Block 1

Action
Cache 2

messages.filter(_.contains(bar)).count

Worker

. . .
Cache 3

of Win
ikipedia
Result: sfull-text
caled to s1earch
TB data
5-7 sec
in <1 (vs
sec
(vs sec
20 fsor
ec ofn-disk
or on-disk
data)
170
data)

Worker
Block 3

Block 2

RDD Fault Tolerance

RDDs track the series of transformations used to
build them (their lineage) to recompute lost data
E.g: messages

= textFile(...).filter(_.contains(error))
.map(_.split(\t)(2))

HadoopRDD
path = hdfs://

FilteredRDD

func = _.contains(...)

MappedRDD
func = _.split()

Iteratrion time (s)

Fault Recovery Test

140
120
100
80
60
40
20
0

119

Failure happens
81

5
6
Iteration

Behavior with Less RAM

Iteration time (s)

100

20
0
Cache
disabled

25%

50%

75%

% of working set in cache

Fully
cached

How it Looks in Java

lines.filter(_.contains(error)).count()

JavaRDD<String> lines = ...;

lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(error);
}
}).count();

More examples in the next talk

Outline
Introduction to Scala & functional programming
Spark concepts
Tour of Spark operations
Job execution

Learning Spark
Easiest way: Spark interpreter (spark-shell)

Modied version of Scala interpreter for cluster use

Runs in local mode on 1 thread by default, but

can control through MASTER environment var:
MASTER=local
./spark-shell
MASTER=local[2] ./spark-shell
MASTER=host:port ./spark-shell

# local, 1 thread
# local, 2 threads
# run on Mesos

First Stop: SparkContext

Main entry point to Spark functionality
Created for you in spark-shell as variable sc
In standalone programs, youd make your own
(see later for details)

Creating RDDs
// Turn a Scala collection into an RDD
sc.parallelize(List(1, 2, 3))
// Load text file from local FS, HDFS, or S3
sc.textFile(file.txt)
sc.textFile(directory/*.txt)
sc.textFile(hdfs://namenode:9000/path/file)
// Use any existing Hadoop InputFormat
sc.hadoopFile(keyClass, valClass, inputFmt, conf)

Basic Transformations
val nums = sc.parallelize(List(1, 2, 3))
// Pass each element through a function
val squares = nums.map(x => x*x)
// {1, 4, 9}
// Keep elements passing a predicate
val even = squares.filter(_ % 2 == 0)

// {4}

// Map each element to zero or more others

nums.flatMap(x => 1 to x) // => {1, 1, 2, 1, 2, 3}
Range object (sequence
of numbers 1, 2, , x)

Basic Actions
val nums = sc.parallelize(List(1, 2, 3))
// Retrieve RDD contents as a local collection
nums.collect() // => Array(1, 2, 3)
// Return first K elements
nums.take(2)
// => Array(1, 2)
// Count number of elements
nums.count()
// => 3
// Merge elements with an associative function
nums.reduce(_ + _) // => 6
// Write elements to a text file
nums.saveAsTextFile(hdfs://file.txt)

Working with Key-Value Pairs

Sparks distributed reduce transformations
operate on RDDs of key-value pairs
Scala pair syntax:
val pair = (a, b)

// sugar for new Tuple2(a, b)

Accessing pair elements:

pair._1
pair._2

// => a
// => b

Some Key-Value Operations

val pets = sc.parallelize(
List((cat, 1), (dog, 1), (cat, 2)))
pets.reduceByKey(_ + _) // => {(cat, 3), (dog, 1)}
pets.groupByKey() // => {(cat, Seq(1, 2)), (dog, Seq(1)}
pets.sortByKey()

// => {(cat, 1), (cat, 2), (dog, 1)}

reduceByKey also automatically implements

combiners on the map side

Example: Word Count

val lines = sc.textFile(hamlet.txt)
val counts = lines.flatMap(line => line.split( ))
.map(word => (word, 1))
.reduceByKey(_ + _)

to be or

to
be
or

(to, 1)
(be, 1)
(or, 1)

(be, 2)
(not, 1)

not to be

not
to
be

(not, 1)
(to, 1)
(be, 1)

(or, 1)
(to, 2)

Other Key-Value Operations

val visits = sc.parallelize(List(
(index.html, 1.2.3.4),
(about.html, 3.4.5.6),
(index.html, 1.3.3.1)))
val pageNames = sc.parallelize(List(
(index.html, Home), (about.html, About)))
visits.join(pageNames)
// (index.html, (1.2.3.4, Home))
// (index.html, (1.3.3.1, Home))
// (about.html, (3.4.5.6, About))
visits.cogroup(pageNames)
// (index.html, (Seq(1.2.3.4, 1.3.3.1), Seq(Home)))
// (about.html, (Seq(3.4.5.6), Seq(About)))

Controlling The Number of

Reduce Tasks
All the pair RDD operations take an optional
second parameter for number of tasks
words.reduceByKey(_ + _, 5)
words.groupByKey(5)
visits.join(pageViews, 5)

Can also set spark.default.parallelism property

Using Local Variables

Any external variables you use in a closure will
automatically be shipped to the cluster:
val query = Console.readLine()
pages.filter(_.contains(query)).count()

Some caveats:

Each task gets a new copy (updates arent sent back)

Variable must be Serializable
Dont use elds of an outer object (ships all of it!)

Closure Mishap Example

class MyCoolRddApp {
val param = 3.14
val log = new Log(...)
...

How to get around it:

class MyCoolRddApp {
...
def work(rdd: RDD[Int]) {
val param_ = param
rdd.map(x => x + param_)
.reduce(...)
}

def work(rdd: RDD[Int]) {

rdd.map(x => x + param)
.reduce(...)
}
}

NotSerializableException:
MyCoolRddApp (or Log)

References only local variable

instead of this.param

Other RDD Operations

sample(): deterministically sample a subset
union(): merge two RDDs
cartesian(): cross product
pipe(): pass through external program

See Programming Guide for more:

www.spark-project.org/documentation.html

Outline
Introduction to Scala & functional programming
Spark concepts
Tour of Spark operations
Job execution

Software Components
Spark runs as a library in your
program (1 instance per app)
Runs tasks locally or on Mesos
dev branch also supports YARN,
standalone deployment

Accesses storage systems via

Hadoop InputFormat API
Can use HBase, HDFS, S3,

Your application
SparkContext
Mesos
master
Slave

Slave

Spark
worker

Local
threads

HDFS or other storage

Task Scheduler
Runs general task
graphs
Pipelines functions
where possible

Stage 1
C:

groupBy
D:

Cache-aware data
reuse & locality
Partitioning-aware
to avoid shues

join
Stage 2 map

= RDD

lter

Stage 3

= cached partition

Data Storage
Cached RDDs normally stored as Java objects
Fastest access on JVM, but can be larger than ideal

Can also store in serialized format

Spark 0.5: spark.cache.class=spark.SerializingCache

Default serialization library is Java serialization

Very slow for large data!
Can customize through spark.serializer (see later)

How to Get Started

git clone git://github.com/mesos/spark
cd spark
sbt/sbt compile

./spark-shell

More Information
Scala resources:

www.artima.com/scalazine/articles/steps.html
(First Steps to Scala)
www.artima.com/pins1ed (free book)

Spark documentation:

www.spark-project.org/documentation.html

Spark Training in Bangalore
No ratings yet
Spark Training in Bangalore
36 pages
Spark Programming and RDDs Overview
No ratings yet
Spark Programming and RDDs Overview
59 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
AWS Data Lake Lab: Athena & QuickSight
No ratings yet
AWS Data Lake Lab: Athena & QuickSight
22 pages
Hadoop MapReduce Interview Q&A Guide
No ratings yet
Hadoop MapReduce Interview Q&A Guide
7 pages
Overview of Apache Spark Features and Benefits
No ratings yet
Overview of Apache Spark Features and Benefits
16 pages
Hive Tutorial PDF
0% (1)
Hive Tutorial PDF
14 pages
Scala Interview Prep Guide
No ratings yet
Scala Interview Prep Guide
21 pages
Spark JSON Movie Database Analysis
No ratings yet
Spark JSON Movie Database Analysis
2 pages
SparkInternals All
No ratings yet
SparkInternals All
90 pages
SS1123 - D2T - Apache Cassandra Overview PDF
100% (1)
SS1123 - D2T - Apache Cassandra Overview PDF
45 pages
Apache Flink for Big Data Experts
No ratings yet
Apache Flink for Big Data Experts
68 pages
Oozie Workflow Guide
No ratings yet
Oozie Workflow Guide
84 pages
DVS Apache Spark Course Overview
No ratings yet
DVS Apache Spark Course Overview
2 pages
Big Data Engineer Course Syllabus
No ratings yet
Big Data Engineer Course Syllabus
21 pages
TensorFlowOnSpark: Scalable ML on Spark
No ratings yet
TensorFlowOnSpark: Scalable ML on Spark
35 pages
Hadoop Distributed File System (HDFS) : Suresh Pathipati
No ratings yet
Hadoop Distributed File System (HDFS) : Suresh Pathipati
43 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
17 SparkSQL
No ratings yet
17 SparkSQL
44 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Apache Hive: Prashant Gupta
100% (1)
Apache Hive: Prashant Gupta
61 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
Spark NLP: A Guide for Data Scientists
No ratings yet
Spark NLP: A Guide for Data Scientists
39 pages
Apache Spark 2.3: Key Updates
No ratings yet
Apache Spark 2.3: Key Updates
57 pages
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
No ratings yet
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
74 pages
Apache Spark RDD API Examples
No ratings yet
Apache Spark RDD API Examples
38 pages
Apache Spark: Fast Cluster Computing
No ratings yet
Apache Spark: Fast Cluster Computing
6 pages
Load Unstructured Data into Hive with PySpark
No ratings yet
Load Unstructured Data into Hive with PySpark
9 pages
Cloudera Spark Developer Training
No ratings yet
Cloudera Spark Developer Training
491 pages
Spark Application Deployment Guide
No ratings yet
Spark Application Deployment Guide
18 pages
Near Real-Time Big Data Processing
No ratings yet
Near Real-Time Big Data Processing
59 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
Scala REPL and Apache Spark Overview
No ratings yet
Scala REPL and Apache Spark Overview
11 pages
10 SparkBasics
No ratings yet
10 SparkBasics
45 pages
Machine Learning Spark ML
No ratings yet
Machine Learning Spark ML
11 pages
Twitter Streaming Popular Hashtags
No ratings yet
Twitter Streaming Popular Hashtags
4 pages
13 SparkBuildingAndDeploying
No ratings yet
13 SparkBuildingAndDeploying
53 pages
Administrator Exercise Instructions 201306
No ratings yet
Administrator Exercise Instructions 201306
117 pages
Sqoop Cammand
No ratings yet
Sqoop Cammand
8 pages
AWS Boto - 1
No ratings yet
AWS Boto - 1
55 pages
Public - Crash Course - Apache Spark - Berlin - 2018 PDF
No ratings yet
Public - Crash Course - Apache Spark - Berlin - 2018 PDF
76 pages
Machine Learning in Spark
100% (1)
Machine Learning in Spark
26 pages
BK Hdfs Administration
No ratings yet
BK Hdfs Administration
73 pages
Akka PDF
No ratings yet
Akka PDF
454 pages
(Hortonworks University) HDP Developer Apache Spark
100% (1)
(Hortonworks University) HDP Developer Apache Spark
66 pages
Learning Concurrent Programming in Scala: Chapter No. 1 "Introduction"
No ratings yet
Learning Concurrent Programming in Scala: Chapter No. 1 "Introduction"
21 pages
Python Advanced - Pipes in Python
No ratings yet
Python Advanced - Pipes in Python
7 pages
Apache Cassandra Database - Instaclustr
No ratings yet
Apache Cassandra Database - Instaclustr
8 pages
Apache Hive 3: Managing Tables and Transactions
No ratings yet
Apache Hive 3: Managing Tables and Transactions
40 pages
Cloudera Developer Training For Apache Spark: Hands-On Exercises
No ratings yet
Cloudera Developer Training For Apache Spark: Hands-On Exercises
61 pages
Apache Spark Features and Architecture
No ratings yet
Apache Spark Features and Architecture
4 pages
Facebook Hive POC
No ratings yet
Facebook Hive POC
18 pages
MapR Certified Spark Developer Study Guide (MCSD)
No ratings yet
MapR Certified Spark Developer Study Guide (MCSD)
29 pages
Cloudera Administration Study Guide
No ratings yet
Cloudera Administration Study Guide
3 pages
Lec 9
No ratings yet
Lec 9
38 pages
Using Apache Spark in Local Mode
No ratings yet
Using Apache Spark in Local Mode
56 pages
Lec 9
No ratings yet
Lec 9
33 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Overview of SPARK Technology and RDDs
No ratings yet
Overview of SPARK Technology and RDDs
39 pages
Learning Spark
100% (1)
Learning Spark
4 pages
Spark RDD Transformations Guide
100% (1)
Spark RDD Transformations Guide
37 pages
Resume
No ratings yet
Resume
2 pages
Aimlsyll
No ratings yet
Aimlsyll
113 pages
GCP Cloud
No ratings yet
GCP Cloud
5 pages
280+ Hours Data Engineering Program
No ratings yet
280+ Hours Data Engineering Program
9 pages
Data Engineer's Professional Profile
No ratings yet
Data Engineer's Professional Profile
3 pages
PDE Exam Dump 3
No ratings yet
PDE Exam Dump 3
98 pages
Spark Runtime Architecture Overview
No ratings yet
Spark Runtime Architecture Overview
5 pages
K-Means with MapReduce in PySpark
No ratings yet
K-Means with MapReduce in PySpark
6 pages
Exam Topics
No ratings yet
Exam Topics
107 pages
Data Science New Report
No ratings yet
Data Science New Report
39 pages
Data Analytics Learning Resources
No ratings yet
Data Analytics Learning Resources
5 pages
Adhikari2018 Paper 6
No ratings yet
Adhikari2018 Paper 6
6 pages
007.2 - Big Data Systems Components
No ratings yet
007.2 - Big Data Systems Components
2 pages
Apache Spark Programming With Databricks
No ratings yet
Apache Spark Programming With Databricks
112 pages
Harshit Sharma
No ratings yet
Harshit Sharma
1 page
Azure Synapse Data Explorer Guide
No ratings yet
Azure Synapse Data Explorer Guide
6 pages
Shelly Bansal - SR Data Engineer
No ratings yet
Shelly Bansal - SR Data Engineer
6 pages
Ruchi Pandey: Data Engineer Resume
No ratings yet
Ruchi Pandey: Data Engineer Resume
1 page
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Distributed Databases, NOSQL Systems and BIGDATA
No ratings yet
Distributed Databases, NOSQL Systems and BIGDATA
62 pages
Big Data Applications Across Industries
No ratings yet
Big Data Applications Across Industries
15 pages
Dhanush Bigdata Resume Updated
No ratings yet
Dhanush Bigdata Resume Updated
9 pages
Batch Processing with Spark Guide
No ratings yet
Batch Processing with Spark Guide
41 pages
Data Warehousing and Analytics AWS Vs Snowflake Vs Databricks
No ratings yet
Data Warehousing and Analytics AWS Vs Snowflake Vs Databricks
8 pages
58.cse 1.1.3.
No ratings yet
58.cse 1.1.3.
45 pages
Data Engineering Expert Profile
No ratings yet
Data Engineering Expert Profile
6 pages
Cloud Native Connector (GCP, Snowflake, Databricks)
No ratings yet
Cloud Native Connector (GCP, Snowflake, Databricks)
7 pages
Darshan - BA Assignment
No ratings yet
Darshan - BA Assignment
10 pages