0% found this document useful (0 votes)

351 views56 pages

Using Apache Spark in Local Mode

Spark is a fast, general-purpose cluster computing system that allows processing of large datasets across a cluster in a fault-tolerant manner. It improves efficiency through in-memory computing and general computation graphs. Spark provides APIs in Java, Scala and Python and can run on local machines, EC2 or private clusters using Mesos, YARN or standalone mode. Key concepts include resilient distributed datasets (RDDs) which are immutable distributed collections that can be rebuilt if a partition is lost.

Uploaded by

jainam dude

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

351 views56 pages

Using Apache Spark in Local Mode

Uploaded by

jainam dude

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

What is Spark?

 Fast, expressive cluster computing system compatible with Apache Hadoop

- Works with any Hadoop-supported storage system (HDFS, S3, Avro, …)
 Improves efficiency through:
- In-memory computing primitives
Up to 100× faster
- General computation graphs
 Improves usability through:
- Rich APIs in Java, Scala, Python
Often 2-10× less code
- Interactive shell
How to Run It
 Local multicore: just a library in your program
 EC2: scripts for launching a Spark cluster
 Private cluster: Mesos, YARN, Standalone Mode
Languages
 APIs in Java, Scala and Python
 Interactive shells in Scala and Python
Outline
 Introduction to Spark
 Tour of Spark operations
 Job execution
 Standalone programs
 Deployment options
Key Idea
 Work with distributed collections as you would with local ones

 Concept: resilient distributed datasets (RDDs)

- Immutable collections of objects spread across a cluster
- Built through parallel transformations (map, filter, etc)
- Automatically rebuilt on failure
- Controllable persistence (e.g. caching in RAM)
Operations
 Transformations (e.g. map, filter, groupBy, join)
- Lazy operations to build RDDs from other RDDs
 Actions (e.g. count, collect, save)
- Return a result or write it to storage
Example: Mining Console Logs
 Load error messages from a log into memory, then interactively search for patterns
Cache 1
Base RDD
Transformed RDD Worker
lines = [Link](“hdfs://...”) tasks
errors = [Link](lambda s: [Link](“ERROR”)) Block 1
Driver results
messages = [Link](lambda s: [Link](‘\t’)[2])
[Link]()
Action
[Link](lambda s: “foo” in s).count() Cache 2
Worker
[Link](lambda s: “bar” in s).count()
Cache 3
. . . Block 2
Worker

Result:
Result:
full-text
scaled
search
to 1 TB
of Wikipedia
data in 5-7
in sec
<1 sec Block 3
(vs
(vs170
20 sec
secfor
foron-disk
on-diskdata)
data)
RDD Fault Tolerance

RDDs track the transformations used to build them (their lineage) to recompute lost data

E.g:

messages = textFile(...).filter(lambda s: [Link](“ERROR”))

.map(lambda s: [Link](‘\t’)[2])

HadoopRDD FilteredRDD MappedRDD

path = hdfs://… func = contains(...) func = split(…)
Fault Recovery Test

140
119
120
Iteratrion time (s)

100 Failure happens

81
80
57 56 58 58 57 59 57 59
60
40
20
0
1 2 3 4 5 6 7 8 9 10
Iteration
Behavior with Less RAM

Ite r a tio n tim e (s )

80
70 69

60 58

50
41
40
30 30

20
12

10
0
Cache disabled 25% 50% 75% Fully cached
% of working set in cache
Spark in Java and Scala

Java API: Scala API:

JavaRDD<String> lines = [Link](…); val lines = [Link](…)

errors = [Link]( errors = [Link](s => [Link](“ERROR”))

new Function<String, Boolean>() { // can also write filter(_.contains(“ERROR”))
public Boolean call(String s) {
return [Link](“ERROR”); [Link]
}
});

[Link]()
Which Language Should I Use?
 Standalone programs can be written in any, but console is only Python & Scala
 Python developers: can stay with Python for both
 Java developers: consider using Scala for console (to learn the API)

 Performance: Java / Scala will be faster (statically typed), but Python can do well for
numerical work with NumPy
Scala Cheat Sheet
Variables: Collections and closures:
var x: Int = 7 val nums = Array(1, 2, 3)
var x = 7 // type inferred
[Link]((x: Int) => x + 2) // => Array(3, 4, 5)
val y = “hi” // read-only
[Link](x => x + 2) // => same
[Link](_ + 2) // => same

MASTER=local ./spark-shell # local, 1 thread

MASTER=local[2] ./spark-shell # local, 2 threads
MASTER=spark://host:port ./spark-shell # Spark standalone cluster
First Stop: SparkContext
 Main entry point to Spark functionality
 Created for you in Spark shells as variable sc
 In standalone programs, you’d make your own (see later for details)
Creating RDDs
# Turn a local collection into an RDD
[Link]([1, 2, 3])

# Load text file from local FS, HDFS, or S3

[Link](“[Link]”)
[Link](“directory/*.txt”)
[Link](“hdfs://namenode:9000/path/file”)

# Use any existing Hadoop InputFormat

[Link](keyClass, valClass, inputFmt, conf)
Basic Transformations
nums = [Link]([1, 2, 3])

# Pass each element through a function

squares = [Link](lambda x: x*x) # => {1, 4, 9}

# Keep elements passing a predicate

even = [Link](lambda x: x % 2 == 0) # => {4}

# Map each element to zero or more others

[Link](lambda x: range(0, x)) # => {0, 0, 1, 0, 1, 2}

Range object (sequence of

numbers 0, 1, …, x-1)
Basic Actions
nums = [Link]([1, 2, 3])
# Retrieve RDD contents as a local collection
[Link]() # => [1, 2, 3]
# Return first K elements
[Link](2) # => [1, 2]
# Count number of elements
[Link]() # => 3
# Merge elements with an associative function
[Link](lambda x, y: x + y) # => 6
# Write elements to a text file
[Link](“hdfs://[Link]”)
Working with Key-Value Pairs
 Spark’s “distributed reduce” transformations act on RDDs of key-value pairs
 Python: pair = (a, b)
pair[0] # => a
pair[1] # => b

 Scala: val pair = (a, b)

pair._1 // => a
pair._2 // => b

 Java: Tuple2 pair = new Tuple2(a, b); // class scala.Tuple2

pair._1 // => a
pair._2 // => b
Some Key-Value Operations
pets = [Link]([(“cat”, 1), (“dog”, 1), (“cat”, 2)])
[Link](lambda x, y: x + y)
# => {(cat, 3), (dog, 1)}
[Link]()
# => {(cat, Seq(1, 2)), (dog, Seq(1)}
[Link]()
# => {(cat, 1), (cat, 2), (dog, 1)}

reduceByKey also automatically implements combiners on the map side

Example: Word Count

lines = [Link](“[Link]”)
counts = [Link](lambda line: [Link](“ ”)) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda x, y: x + y)

“to” (to, 1)
(be, 2)
“to be or” “be” (be, 1)
(not, 1)
“or” (or, 1)

“not” (not, 1)
(or, 1)
“not to be” “to” (to, 1)
(to, 2)
“be” (be, 1)
Multiple Datasets
visits = [Link]([(“[Link]”, “[Link]”),
(“[Link]”, “[Link]”),
(“[Link]”, “[Link]”)])
pageNames = [Link]([(“[Link]”, “Home”), (“[Link]”, “About”)])
[Link](pageNames)
# (“[Link]”, (“[Link]”, “Home”))
# (“[Link]”, (“[Link]”, “Home”))
# (“[Link]”, (“[Link]”, “About”))
[Link](pageNames)
# (“[Link]”, (Seq(“[Link]”, “[Link]”), Seq(“Home”)))
# (“[Link]”, (Seq(“[Link]”), Seq(“About”)))
Controlling the Level of Parallelism
 All the pair RDD operations take an optional second parameter for number of tasks
[Link](lambda x, y: x + y, 5)
[Link](5)
[Link](pageViews, 5)
Using Local Variables
 External variables you use in a closure will automatically be shipped to the cluster:
query = raw_input(“Enter a query:”)
[Link](lambda x: [Link](query)).count()

 Some caveats:
- Each task gets a new copy (updates aren’t sent back)
- Variable must be Serializable (Java/Scala) or Pickle-able (Python)
- Don’t use fields of an outer object (ships all of it!)
Closure Mishap Example

class MyCoolRddApp { How to get around it:

val param = 3.14
val log = new Log(...) class MyCoolRddApp {
... ...

def work(rdd: RDD[Int]) { def work(rdd: RDD[Int]) {

[Link](x => x + param) val param_ = param
.reduce(...) [Link](x => x + param_)
} .reduce(...)
} }
NotSerializableException:
MyCoolRddApp (or Log) } References only local variable
instead of [Link]
More Details
 Spark supports lots of other operations!
 Full programming guide: [Link]/documentation
Outline
 Introduction to Spark
 Tour of Spark operations
 Job execution
 Standalone programs
 Deployment options
Software Components
 Spark runs as a library in your program Your application
(one instance per app)
SparkContext
 Runs tasks locally or on a cluster
- Standalone deploy cluster, Mesos or YARN Cluster Local
 Accesses storage via Hadoop InputFormat API manager threads

- Can use HBase, HDFS, S3, …

Worker Worker
Spark Spark
executor executor

HDFS or other storage

Task Scheduler
 Supports general task graphs A: B:
 Pipelines functions where possible
 Cache-aware data reuse & locality F:
Stage 1 groupBy
 Partitioning-aware to avoid shuffles
C: D: E:

join

Stage 2 map filter Stage 3

= RDD = cached partition

Hadoop Compatibility
 Spark can read/write to any storage system / format that has a plugin for Hadoop!
- Examples: HDFS, S3, HBase, Cassandra, Avro, SequenceFile
- Reuses Hadoop’s InputFormat and OutputFormat APIs
 APIs like [Link] support filesystems, while [Link]
allows passing any Hadoop JobConf to configure an input source
Outline
 Introduction to Spark
 Tour of Spark operations
 Job execution
 Standalone programs
 Deployment options
Build Spark
 Requires Java 6+, Scala 2.9.2
git clone git://[Link]/mesos/spark
cd spark
sbt/sbt package
# Optional: publish to local Maven cache
sbt/sbt publish-local
Add Spark to Your Project
 Scala and Java: add a Maven dependency on
groupId: [Link]-project
artifactId:spark-core_2.9.1
version: 0.7.0-SNAPSHOT

 Python: run program with our pyspark script

Create a SparkContext
import [Link]
Scala

import [Link]._

val sc = new SparkContext(“masterUrl”, “name”, “sparkHome”, Seq(“[Link]”))

Cluster URL, or
import [Link]; App Spark install List of JARs with
Java

local / local[N] name path on cluster app code (to ship)

JavaSparkContext sc = new JavaSparkContext(
“masterUrl”, “name”, “sparkHome”, new String[] {“[Link]”}));
Python

from pyspark import SparkContext

sc = SparkContext(“masterUrl”, “name”, “sparkHome”, [“[Link]”]))

Complete App: Scala
import [Link]
import [Link]._

object WordCount {
def main(args: Array[String]) {
val sc = new SparkContext(“local”, “WordCount”, args(0), Seq(args(1)))
val lines = [Link](args(2))
[Link](_.split(“ ”))
.map(word => (word, 1))
.reduceByKey(_ + _)
.saveAsTextFile(args(3))
}
}
Complete App: Python
import sys
from pyspark import SparkContext

if __name__ == "__main__":
sc = SparkContext( “local”, “WordCount”, [Link][0], None)
lines = [Link]([Link][1])

[Link](lambda s: [Link](“ ”)) \

.map(lambda word: (word, 1)) \
.reduceByKey(lambda x, y: x + y) \
.saveAsTextFile([Link][2])
Example: PageRank
Why PageRank?
 Good example of a more complex algorithm
- Multiple stages of map & reduce
 Benefits from Spark’s in-memory caching
- Multiple iterations over the same data
Basic Idea
 Give pages ranks (scores) based on links to them
- Links from many pages  high rank
- Link from a high-rank page  high rank

Image: [Link]/wiki/File:[Link]
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.0

1.0 1.0

1.0
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.0
1 0.5
1
1.0 0.5 1.0

0.5
0.5

1.0
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.85

0.58 1.0

0.58
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.85
0.58 0.5
1.85
0.58 0.29 1.0

0.5
0.29

0.58
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.31

0.39 1.72
...

0.58
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs

Final state: 1.44

0.46 1.37

0.73
Scala Implementation
val links = // RDD of (url, neighbors) pairs
var ranks = // RDD of (url, rank) pairs

for (i <- 1 to ITERATIONS) {

val contribs = [Link](ranks).flatMap {
case (url, (links, rank)) =>
[Link](dest => (dest, rank/[Link]))
}
ranks = [Link](_ + _)
.mapValues(0.15 + 0.85 * _)
}

[Link](...)
PageRank Performance

Ite ra tio n tim e (s )

171
200

150 Hadoop
Spark
100

80
50

14
0
30 60

Number of machines
Other Iterative Algorithms

155 Hadoop
K-Means Clustering
4.1 Spark
0 30 60 90 120 150 180

110
Logistic Regression
0.96
0 25 50 75 100 125

Time per Iteration (s)

Outline
 Introduction to Spark
 Tour of Spark operations
 Job execution
 Standalone programs
 Deployment options
Local Mode
 Just pass local or local[k] as master URL
 Still serializes tasks to catch marshaling errors
 Debug using local debuggers
- For Java and Scala, just run your main program in a debugger
- For Python, use an attachable debugger (e.g. PyDev, winpdb)
 Great for unit testing
Private Cluster
 Can run with one of:
- Standalone deploy mode (similar to Hadoop cluster scripts)
- Apache Mesos: [Link]/docs/latest/[Link]
- Hadoop YARN: [Link]/docs/0.6.0/[Link]
 Basically requires configuring a list of workers, running launch scripts, and passing a
special cluster URL to SparkContext
Amazon EC2
 Easiest way to launch a Spark cluster
git clone git://[Link]/mesos/[Link]
cd spark/ec2
./spark-ec2 -k keypair –i id_rsa.pem –s slaves \
[launch|stop|start|destroy] clusterName

 Details: [Link]/docs/latest/[Link]

 New: run Spark on Elastic MapReduce – [Link]/spark-emr

Viewing Logs
 Click through the web UI at master:8080
 Or, look at stdout and stdout files in the Spark or Mesos “work” directory for your app:
work/<ApplicationID>/<ExecutorID>/stdout
 Application ID (Framework ID in Mesos) is printed when Spark connects
Community
 Join the Spark Users mailing list:
[Link]/group/spark-users

 Come to the Bay Area meetup:

[Link]/spark-users
Conclusion
 Spark offers a rich API to make data analytics fast: both fast to write and fast to run
 Achieves 100x speedups in real applications
 Growing community with 14 companies contributing
 Details, tutorials, videos: [Link]

Spark
No ratings yet
Spark
96 pages
Pyspark
No ratings yet
Pyspark
31 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Near Real-Time Big Data Processing
No ratings yet
Near Real-Time Big Data Processing
59 pages
Spark QA
No ratings yet
Spark QA
34 pages
Apache Spark RDD API Examples
No ratings yet
Apache Spark RDD API Examples
38 pages
Spark Repartition vs Coalesce Explained
No ratings yet
Spark Repartition vs Coalesce Explained
7 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
No ratings yet
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
9 pages
Data Engineering 101 - Databricks Optimization
No ratings yet
Data Engineering 101 - Databricks Optimization
16 pages
Spark Application Deployment Guide
No ratings yet
Spark Application Deployment Guide
18 pages
Caching DataFrames in PySpark
No ratings yet
Caching DataFrames in PySpark
51 pages
Understanding Spark Architecture Basics
No ratings yet
Understanding Spark Architecture Basics
25 pages
Spark DataFrame and RDD Operations Guide
No ratings yet
Spark DataFrame and RDD Operations Guide
5 pages
Understanding Apache Spark Architecture
0% (1)
Understanding Apache Spark Architecture
30 pages
Spark Streaming App Development Guide
No ratings yet
Spark Streaming App Development Guide
8 pages
PySpark Interview Questions
0% (1)
PySpark Interview Questions
3 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
PySpark Tutorial: From Basics to Advanced
No ratings yet
PySpark Tutorial: From Basics to Advanced
102 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
PySpark Cheat 23
No ratings yet
PySpark Cheat 23
9 pages
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
25 Pyspark Transformation
No ratings yet
25 Pyspark Transformation
10 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
PySpark Meetup Talk
No ratings yet
PySpark Meetup Talk
35 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
PySpark Zero To Hero Ebook
No ratings yet
PySpark Zero To Hero Ebook
6 pages
Azure DE Interview Que
100% (2)
Azure DE Interview Que
25 pages
Py Spark
No ratings yet
Py Spark
10 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
Apache Spark 101 For Data Engineering
No ratings yet
Apache Spark 101 For Data Engineering
15 pages
Spark Interview Prep Guide
No ratings yet
Spark Interview Prep Guide
4 pages
Data Migration and CDC Tasks
No ratings yet
Data Migration and CDC Tasks
11 pages
Day 89
No ratings yet
Day 89
9 pages
Spark Production Insights and Lessons
No ratings yet
Spark Production Insights and Lessons
34 pages
Extended Spark Interview QA
No ratings yet
Extended Spark Interview QA
3 pages
Compare Hadoop and Spark.: Table
No ratings yet
Compare Hadoop and Spark.: Table
10 pages
ETL Process Overview in Agriculture
100% (1)
ETL Process Overview in Agriculture
42 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
Naresh DE
No ratings yet
Naresh DE
5 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Caching vs Persisting in PySpark
No ratings yet
Caching vs Persisting in PySpark
3 pages
Creating Secrets in Databricks
No ratings yet
Creating Secrets in Databricks
13 pages
Apache Spark for Developers
No ratings yet
Apache Spark for Developers
3 pages
ADF Interview Questions and Scenarios
No ratings yet
ADF Interview Questions and Scenarios
2 pages
Pyspark Window Functions Overview
100% (1)
Pyspark Window Functions Overview
8 pages
PySpark Tutorial for Beginners
No ratings yet
PySpark Tutorial for Beginners
206 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
Databricks Optimization Technique
No ratings yet
Databricks Optimization Technique
18 pages
Azure Databricks Mastery
No ratings yet
Azure Databricks Mastery
95 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
Databricks Interview Question & Answers
No ratings yet
Databricks Interview Question & Answers
10 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
Spark
No ratings yet
Spark
160 pages
Hardware Support Supervisor Resume
No ratings yet
Hardware Support Supervisor Resume
4 pages
Manual Testing 106 Ques.
0% (1)
Manual Testing 106 Ques.
11 pages
Word 2010 MCQ
No ratings yet
Word 2010 MCQ
5 pages
Backplane Bus Systems Guide
100% (2)
Backplane Bus Systems Guide
88 pages
Internship SEO Test
No ratings yet
Internship SEO Test
14 pages
Souvik Chakraborty - Multi
No ratings yet
Souvik Chakraborty - Multi
8 pages
Teknologi Informasi, Internet Dan Pengguna
No ratings yet
Teknologi Informasi, Internet Dan Pengguna
30 pages
Message
No ratings yet
Message
6 pages
CH 1 Question Bank
No ratings yet
CH 1 Question Bank
2 pages
A Novel Approach For Data Driven Security in Intelligent Healthcare System
No ratings yet
A Novel Approach For Data Driven Security in Intelligent Healthcare System
11 pages
MGS1600GY Magnetic Guide Sensor Overview
No ratings yet
MGS1600GY Magnetic Guide Sensor Overview
21 pages
Java OOP Basics: Objects & Classes
No ratings yet
Java OOP Basics: Objects & Classes
5 pages
Apfs Stats
No ratings yet
Apfs Stats
11 pages
Visa Secure Badge Asset Guidelines
No ratings yet
Visa Secure Badge Asset Guidelines
1 page
Leapfrog Geothermal User Manual
No ratings yet
Leapfrog Geothermal User Manual
579 pages
Thai License Plate Digit Recognition
No ratings yet
Thai License Plate Digit Recognition
46 pages
Vamsi Resume
No ratings yet
Vamsi Resume
7 pages
Pravin Sonawane
No ratings yet
Pravin Sonawane
3 pages
STI College Balagtas Midterm Exam Schedule
No ratings yet
STI College Balagtas Midterm Exam Schedule
7 pages
MyInfinity Thermostat Dropping Connection With Carrier Server - Home Improvement Stack Exchange PDF
No ratings yet
MyInfinity Thermostat Dropping Connection With Carrier Server - Home Improvement Stack Exchange PDF
1 page
BMT Excel Traning
No ratings yet
BMT Excel Traning
72 pages
H1 Vuln List
No ratings yet
H1 Vuln List
83 pages
MySQL Alter Table Exercises Guide
No ratings yet
MySQL Alter Table Exercises Guide
25 pages
Platform for Aspiring Artists
No ratings yet
Platform for Aspiring Artists
8 pages
100 Computer Teacher Interview Questions With Answers
No ratings yet
100 Computer Teacher Interview Questions With Answers
60 pages
Alice & Greenfoot Programming Quiz
100% (1)
Alice & Greenfoot Programming Quiz
13 pages
Can 5G Fixed Broadband Bridge The Rural Digital Divide
No ratings yet
Can 5G Fixed Broadband Bridge The Rural Digital Divide
6 pages
F o S I PDF Password
No ratings yet
F o S I PDF Password
2 pages
PhonePe Software Engineer Internship
No ratings yet
PhonePe Software Engineer Internship
3 pages
SAP Event Mesh Limitations Explained
No ratings yet
SAP Event Mesh Limitations Explained
6 pages

Using Apache Spark in Local Mode

Uploaded by

Using Apache Spark in Local Mode

Uploaded by

What is Spark?

 Fast, expressive cluster computing system compatible with Apache Hadoop

 Concept: resilient distributed datasets (RDDs)

messages = textFile(...).filter(lambda s: [Link](“ERROR”))

HadoopRDD FilteredRDD MappedRDD

100 Failure happens

Ite r a tio n tim e (s )

Java API: Scala API:

JavaRDD<String> lines = [Link](…); val lines = [Link](…)

errors = [Link]( errors = [Link](s => [Link](“ERROR”))

Functions: [Link]((x, y) => x + y) // => 6

MASTER=local ./spark-shell # local, 1 thread

# Load text file from local FS, HDFS, or S3

# Use any existing Hadoop InputFormat

# Pass each element through a function

# Keep elements passing a predicate

# Map each element to zero or more others

Range object (sequence of

 Scala: val pair = (a, b)

 Java: Tuple2 pair = new Tuple2(a, b); // class scala.Tuple2

reduceByKey also automatically implements combiners on the map side

class MyCoolRddApp { How to get around it:

def work(rdd: RDD[Int]) { def work(rdd: RDD[Int]) {

- Can use HBase, HDFS, S3, …

HDFS or other storage

Stage 2 map filter Stage 3

= RDD = cached partition

 Python: run program with our pyspark script

val sc = new SparkContext(“masterUrl”, “name”, “sparkHome”, Seq(“[Link]”))

local / local[N] name path on cluster app code (to ship)

from pyspark import SparkContext

sc = SparkContext(“masterUrl”, “name”, “sparkHome”, [“[Link]”]))

[Link](lambda s: [Link](“ ”)) \

Final state: 1.44

for (i <- 1 to ITERATIONS) {

Ite ra tio n tim e (s )

Time per Iteration (s)

 New: run Spark on Elastic MapReduce – [Link]/spark-emr

 Come to the Bay Area meetup:

You might also like