0% found this document useful (0 votes)

19 views7 pages

22 PDFsam Apache Spark Tutorial

Uploaded by

mitmak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views7 pages

22 PDFsam Apache Spark Tutorial

Uploaded by

mitmak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Apache Spark

as they care as they share.

Follow the procedure given below to execute the given example.

Open Spark-Shell
The following command is used to open spark shell. Generally, spark is built using Scala.
Therefore, a Spark program runs on Scala environment.

$ spark-shell

If Spark shell opens successfully then you will find the following output. Look at the last
line of the output “Spark context available as sc” means the Spark container is
automatically created spark context object with the name sc. Before starting the first
step of a program, the SparkContext object should be created.

Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(hadoop); users
with modify permissions: Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server'
on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.2.0
/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>

18
Apache Spark

Create an RDD
First, we have to read the input file using Spark-Scala API and create an RDD.

The following command is used for reading a file from given location. Here, new RDD is
created with the name of inputfile. The String which is given as an argument in the
textFile(“”) method is absolute path for the input file name. However, if only the file
name is given, then it means that the input file is in the current location.

scala> val inputfile = sc.textFile("input.txt")

Execute Word count Transformation

Our aim is to count the words in a file. Create a flat map for splitting each line into words
(flatMap(line => line.split(“ ”)).

Next, read each word as a key with a value ‘1’ (<key, value> = <word,1>)using map
function (map(word => (word, 1)).

Finally, reduce those keys by adding values of similar keys (reduceByKey(_+_)).

The following command is used for executing word count logic. After executing this, you
will not find any output because this is not an action, this is a transformation; pointing a
new RDD or tell spark to what to do with the given data)

scala> val counts = inputfile.flatMap(line => line.split(" ")).map(word =>

(word, 1)).reduceByKey(_+_);

Current RDD
While working with the RDD, if you want to know about current RDD, then use the
following command. It will show you the description about current RDD and its
dependencies for debugging.

scala> counts.toDebugString

Caching the Transformations

You can mark an RDD to be persisted using the persist() or cache() methods on it. The
first time it is computed in an action, it will be kept in memory on the nodes. Use the
following command to store the intermediate transformations in memory.

scala> counts.cache()

Applying the Action

Applying an action, like store all the transformations, results into a text file. The String
argument for saveAsTextFile(“ ”) method is the absolute path of output folder. Try the
following command to save the output in a text file. In the following example, ‘output’
folder is in current location.

19
Apache Spark

scala> counts.saveAsTextFile("output")

Checking the Output

Open another terminal to go to home directory (where spark is executed in the other
terminal). Use the following commands for checking output directory.

[hadoop@localhost ~]$ cd output/

[hadoop@localhost output]$ ls -1

part-00000
part-00001
_SUCCESS

The following command is used to see output from Part-00000 files.

[hadoop@localhost output]$ cat part-00000

Output
(people,1)
(are,2)
(not,1)
(as,8)
(beautiful,2)
(they, 7)
(look,1)

The following command is used to see output from Part-00001 files.

[hadoop@localhost output]$ cat part-00001

Output

(walk, 1)
(or, 1)
(talk, 1)
(only, 1)
(love, 1)
(care, 1)

20
Apache Spark

(share, 1)

UN Persist the Storage

Before UN-persisting, if you want to see the storage space that is used for this
application, then use the following URL in your browser.

http://localhost:4040

You will see the following screen, which shows the storage space used for the
application, which are running on the Spark shell.

If you want to UN-persist the storage space of particular RDD, then use the following
command.

Scala> counts.unpersist()

You will see the output as follows:

15/06/27 00:57:33 INFO ShuffledRDD: Removing RDD 9 from persistence list

15/06/27 00:57:33 INFO BlockManager: Removing RDD 9
15/06/27 00:57:33 INFO BlockManager: Removing block rdd_9_1

21
Apache Spark

15/06/27 00:57:33 INFO MemoryStore: Block rdd_9_1 of size 480 dropped from
memory (free 280061810)
15/06/27 00:57:33 INFO BlockManager: Removing block rdd_9_0
15/06/27 00:57:33 INFO MemoryStore: Block rdd_9_0 of size 296 dropped from
memory (free 280062106)
res7: cou.type = ShuffledRDD[9] at reduceByKey at <console>:14

For verifying the storage space in the browser, use the following URL.

http://localhost:4040

You will see the following screen. It shows the storage space used for the application,
which are running on the Spark shell.

22
5. SPARK – DEPLOYMENT Apache Spark

Spark application, using spark-submit, is a shell command used to deploy the Spark
application on a cluster. It uses all respective cluster managers through a uniform
interface. Therefore, you do not have to configure your application for each one.

Example
Let us take the same example of word count, we used before, using shell commands.
Here, we consider the same example as a spark application.

Sample Input
The following text is the input data and the file named is in.txt.

people are not as beautiful as they look,

as they walk or as they talk.
they are only as beautiful as they love,
as they care as they share.

Look at the following program:

SparkWordCount.scala
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark._

object SparkWordCount {
def main(args: Array[String]) {

val sc = new SparkContext( "local", "Word Count", "/usr/local/spark", Nil,

Map(), Map())

/* local = master URL; Word Count = application name; */

/* /usr/local/spark = Spark Home; Nil = jars; Map = environment */

/* Map = variables to work nodes */

/creating an inputRDD to read text file (in.txt) through Spark context/

val input = sc.textFile("in.txt")

23
Apache Spark

/* Transform the inputRDD into countRDD */

val count=input.flatMap(line=>line.split(" "))
.map(word=>(word, 1))
.reduceByKey(_ + _)

/* saveAsTextFile method is an action that effects on the RDD */

count.saveAsTextFile("outfile")
System.out.println("OK");
}
}

Save the above program into a file named SparkWordCount.scala and place it in a
user-defined directory named spark-application.

Note: While transforming the inputRDD into countRDD, we are using flatMap() for
tokenizing the lines (from text file) into words, map() method for counting the word
frequency and reduceByKey() method for counting each word repetition.

Use the following steps to submit this application. Execute all steps in the spark-
application directory through the terminal.

Step 1: Download Spark Jar

Spark core jar is required for compilation, therefore, download spark-core_2.10-1.3.0.jar
from the following link Spark core jar and move the jar file from download directory to
spark-application directory.

Step 2: Compile program

Compile the above program using the command given below. This command should be
executed from the spark-application directory. Here, /usr/local/spark/lib/spark-
assembly-1.4.0-hadoop2.6.0.jar is a Hadoop support jar taken from Spark library.

$ scalac -classpath "spark-core_2.10-1.3.0.jar:/usr/local/spark/lib/spark-

assembly-1.4.0-hadoop2.6.0.jar" SparkPi.scala

Step 3: Create a JAR

Create a jar file of the spark application using the following command. Here, wordcount
is the file name for jar file.

jar -cvf wordcount.jar SparkWordCount*.class spark-core_2.10-1.3.0.jar

/usr/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar

Step 4: Submit spark application

Lec - Spark
No ratings yet
Lec - Spark
65 pages
Spark
No ratings yet
Spark
65 pages
SABDE3G06 Big Data Sparks
No ratings yet
SABDE3G06 Big Data Sparks
57 pages
Sumit Kothari Apache Spark and Scala Practical 17
No ratings yet
Sumit Kothari Apache Spark and Scala Practical 17
18 pages
Spark Cluster Setup Guide
No ratings yet
Spark Cluster Setup Guide
12 pages
Bda 05
No ratings yet
Bda 05
12 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Apache Spark Setup for Developers
No ratings yet
Apache Spark Setup for Developers
2 pages
Spark Shell Commands and RDD Examples
No ratings yet
Spark Shell Commands and RDD Examples
61 pages
Practical 11cdscds
No ratings yet
Practical 11cdscds
4 pages
Spark
No ratings yet
Spark
160 pages
SPARK
No ratings yet
SPARK
27 pages
Using Apache Spark in Local Mode
No ratings yet
Using Apache Spark in Local Mode
56 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
Spark Running Notes
No ratings yet
Spark Running Notes
19 pages
29 PDFsam Apache Spark Tutorial
No ratings yet
29 PDFsam Apache Spark Tutorial
7 pages
Spark Multinode Cluster Setup Guide
No ratings yet
Spark Multinode Cluster Setup Guide
16 pages
Analyzing Large Datasets with Spark
No ratings yet
Analyzing Large Datasets with Spark
11 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
8 PDFsam Apache Spark Tutorial
No ratings yet
8 PDFsam Apache Spark Tutorial
7 pages
L7A - Spark RDD With Scala
No ratings yet
L7A - Spark RDD With Scala
21 pages
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
Spark & Scala for Developers
No ratings yet
Spark & Scala for Developers
40 pages
Spark
No ratings yet
Spark
96 pages
Overview of SPARK Technology and RDDs
No ratings yet
Overview of SPARK Technology and RDDs
39 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
Unit - 4
No ratings yet
Unit - 4
18 pages
Apache Spark: Fast Cluster Computing
No ratings yet
Apache Spark: Fast Cluster Computing
6 pages
Apache Spark: In-Memory Data Processing
No ratings yet
Apache Spark: In-Memory Data Processing
187 pages
Day1 Main
No ratings yet
Day1 Main
188 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Module 5 Data Science
No ratings yet
Module 5 Data Science
7 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Cloudlab Exercise 11 Lesson 11
No ratings yet
Cloudlab Exercise 11 Lesson 11
2 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
Spark: Fast Interactive Data Processing
No ratings yet
Spark: Fast Interactive Data Processing
25 pages
Spark: Fast, Interactive Cluster Computing
No ratings yet
Spark: Fast, Interactive Cluster Computing
25 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Big Data Akshat
No ratings yet
Big Data Akshat
57 pages
Module 5 Data Science
No ratings yet
Module 5 Data Science
8 pages
Hadoop and Spark Overview
No ratings yet
Hadoop and Spark Overview
34 pages
Spark 1
No ratings yet
Spark 1
97 pages
SPARK
No ratings yet
SPARK
35 pages
Apache Spark Course Overview 2018-2019
No ratings yet
Apache Spark Course Overview 2018-2019
25 pages
HDP Developer Apache Pig and Hive
No ratings yet
HDP Developer Apache Pig and Hive
42 pages
PySpark Interview Questions WITH AWS
No ratings yet
PySpark Interview Questions WITH AWS
88 pages
1 PDFsam Apache Spark Tutorial
No ratings yet
1 PDFsam Apache Spark Tutorial
7 pages
Spark Word Count Implementation Guide
No ratings yet
Spark Word Count Implementation Guide
9 pages
Fastdataanalyticswithsparkandpython 150207060921 Conversion Gate02
No ratings yet
Fastdataanalyticswithsparkandpython 150207060921 Conversion Gate02
75 pages
Function Spark
No ratings yet
Function Spark
10 pages
13 PDFsam IOQM 2029 Properties of GCD LCM
No ratings yet
13 PDFsam IOQM 2029 Properties of GCD LCM
2 pages
21 PDFsam IOQM 2029 Properties of GCD LCM
No ratings yet
21 PDFsam IOQM 2029 Properties of GCD LCM
2 pages
23 PDFsam IOQM 2029 Properties of GCD LCM
No ratings yet
23 PDFsam IOQM 2029 Properties of GCD LCM
2 pages
25 PDFsam IOQM 2029 Properties of GCD LCM
No ratings yet
25 PDFsam IOQM 2029 Properties of GCD LCM
2 pages
9 PDFsam IOQM 2029 Properties of GCD LCM
No ratings yet
9 PDFsam IOQM 2029 Properties of GCD LCM
2 pages
1 PDFsam IOQM 2029 Properties of GCD LCM
No ratings yet
1 PDFsam IOQM 2029 Properties of GCD LCM
2 pages
9 PDFsam IOQM 2024 Non Routine Equation YT
No ratings yet
9 PDFsam IOQM 2024 Non Routine Equation YT
2 pages
23 PDFsam IOQM 2024 Non Routine Equation YT
No ratings yet
23 PDFsam IOQM 2024 Non Routine Equation YT
2 pages
25 PDFsam Trigonometry RESULTS For IOQM
No ratings yet
25 PDFsam Trigonometry RESULTS For IOQM
2 pages
11 PDFsam IOQM 2024 Non Routine Equation YT
No ratings yet
11 PDFsam IOQM 2024 Non Routine Equation YT
2 pages
7 PDFsam IOQM 2024 Non Routine Equation YT
No ratings yet
7 PDFsam IOQM 2024 Non Routine Equation YT
2 pages
13 PDFsam IOQM 2024 Non Routine Equation YT
No ratings yet
13 PDFsam IOQM 2024 Non Routine Equation YT
2 pages
15 PDFsam IOQM 2024 Non Routine Equation YT
No ratings yet
15 PDFsam IOQM 2024 Non Routine Equation YT
2 pages
11 PDFsam IOQm Theory Vedantu
No ratings yet
11 PDFsam IOQm Theory Vedantu
10 pages
1 Pdfsam Ioqm Important CDF
100% (1)
1 Pdfsam Ioqm Important CDF
2 pages
5 PDFsam Trigonometry RESULTS For IOQM
No ratings yet
5 PDFsam Trigonometry RESULTS For IOQM
2 pages
13 PDFsam Trigonometry RESULTS For IOQM
No ratings yet
13 PDFsam Trigonometry RESULTS For IOQM
2 pages
1 PDFsam IOQm Theory Vedantu
No ratings yet
1 PDFsam IOQm Theory Vedantu
10 pages
21 Pdfsam Ioqm Important CDF
No ratings yet
21 Pdfsam Ioqm Important CDF
2 pages
41 PDFsam IOQM-BY-FIITJEE
No ratings yet
41 PDFsam IOQM-BY-FIITJEE
10 pages
51 PDFsam IOQM-BY-FIITJEE
No ratings yet
51 PDFsam IOQM-BY-FIITJEE
10 pages
301 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
No ratings yet
301 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
6 pages
71 PDFsam IOQM-BY-FIITJEE
No ratings yet
71 PDFsam IOQM-BY-FIITJEE
10 pages
21 PDFsam IOQM-BY-FIITJEE
No ratings yet
21 PDFsam IOQM-BY-FIITJEE
10 pages
281 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
No ratings yet
281 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
20 pages
1 PDFsam IOQM-BY-FIITJEE
No ratings yet
1 PDFsam IOQM-BY-FIITJEE
10 pages
89 - PDFsam - Start Sketching and Drawing Now Simple Techniques For Drawing Landscapes, People and Objects
No ratings yet
89 - PDFsam - Start Sketching and Drawing Now Simple Techniques For Drawing Landscapes, People and Objects
8 pages
31 PDFsam Mathematical Formulae
No ratings yet
31 PDFsam Mathematical Formulae
10 pages
121 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
No ratings yet
121 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
20 pages
201 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
No ratings yet
201 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
20 pages
Vocabulary For Security
No ratings yet
Vocabulary For Security
9 pages
QAFT 6 Paper 1
No ratings yet
QAFT 6 Paper 1
9 pages
Marketing Management Text and Cases 1st Edition David L. Loudon Robert Stevens Bruce Wrenn Full Chapters Instanly
No ratings yet
Marketing Management Text and Cases 1st Edition David L. Loudon Robert Stevens Bruce Wrenn Full Chapters Instanly
157 pages
OB-Blanchard-Fields, F. (2007) - Everyday Problem Solving and Emotion. An Adult Developmental Perspective
100% (1)
OB-Blanchard-Fields, F. (2007) - Everyday Problem Solving and Emotion. An Adult Developmental Perspective
6 pages
Diabetes Detection Using Deep Learning Algorithms: ICT Express November 2018
No ratings yet
Diabetes Detection Using Deep Learning Algorithms: ICT Express November 2018
5 pages
Shuna's Journey (Hayao Miyazaki) (2022)
No ratings yet
Shuna's Journey (Hayao Miyazaki) (2022)
147 pages
Lead-Free Piezoceramics Study
No ratings yet
Lead-Free Piezoceramics Study
8 pages
Blood Glucose Level Report: 136 mg/dL
No ratings yet
Blood Glucose Level Report: 136 mg/dL
2 pages
Bitcoin Private Keys From GitHub
25% (4)
Bitcoin Private Keys From GitHub
99 pages
Bloom's Verb PDF
100% (1)
Bloom's Verb PDF
10 pages
4.50 - LAKHS Income Criteria OBC Non-Creamy Layer
No ratings yet
4.50 - LAKHS Income Criteria OBC Non-Creamy Layer
2 pages
B.A. Economics Sem-6
No ratings yet
B.A. Economics Sem-6
12 pages
Understanding Animation Principles
No ratings yet
Understanding Animation Principles
6 pages
RedBus: India's Leading Bus Ticket Platform
No ratings yet
RedBus: India's Leading Bus Ticket Platform
3 pages
Azahar Log
No ratings yet
Azahar Log
13 pages
METHOD OF MAKING HIGH PURITY Lithium Hydroxide and Hydrochloric Ascid
No ratings yet
METHOD OF MAKING HIGH PURITY Lithium Hydroxide and Hydrochloric Ascid
12 pages
Command in War Creveld PDF
No ratings yet
Command in War Creveld PDF
2 pages
ELE 412 - EEE 412 Old
No ratings yet
ELE 412 - EEE 412 Old
113 pages
Shipyard Layout Improvement Proposal
No ratings yet
Shipyard Layout Improvement Proposal
16 pages
English Grammar
No ratings yet
English Grammar
128 pages
Methylene Blue and Indigo Blue Removal From (Waste) Water Using Hexagonal Boron Nitride Nanosheets As Adsorbent
No ratings yet
Methylene Blue and Indigo Blue Removal From (Waste) Water Using Hexagonal Boron Nitride Nanosheets As Adsorbent
10 pages
Persuasive Writing Lesson Plan
No ratings yet
Persuasive Writing Lesson Plan
3 pages
Quiz - 6625 - Cambridge Starters Test 5 - Reading Writing
No ratings yet
Quiz - 6625 - Cambridge Starters Test 5 - Reading Writing
6 pages
Activity Book: LSP 8646-T5
100% (1)
Activity Book: LSP 8646-T5
13 pages
Wavepad: Audio Editing Software
No ratings yet
Wavepad: Audio Editing Software
17 pages
Fire and Gas Detection System
No ratings yet
Fire and Gas Detection System
55 pages
Course 1. Psychometrics 2024, 2
No ratings yet
Course 1. Psychometrics 2024, 2
215 pages
CQB Physics Jee Main 2019 Fluid Mechanics
No ratings yet
CQB Physics Jee Main 2019 Fluid Mechanics
4 pages
Learningplan 1
No ratings yet
Learningplan 1
8 pages
Going Pro 3 Exam 2
No ratings yet
Going Pro 3 Exam 2
2 pages

22 PDFsam Apache Spark Tutorial

Uploaded by

22 PDFsam Apache Spark Tutorial

Uploaded by

Apache Spark

as they care as they share.

Follow the procedure given below to execute the given example.

scala> val inputfile = sc.textFile("input.txt")

Execute Word count Transformation

Finally, reduce those keys by adding values of similar keys (reduceByKey(_+_)).

scala> val counts = inputfile.flatMap(line => line.split(" ")).map(word =>

Caching the Transformations

Applying the Action

Checking the Output

[hadoop@localhost ~]$ cd output/

The following command is used to see output from Part-00000 files.

[hadoop@localhost output]$ cat part-00000

The following command is used to see output from Part-00001 files.

[hadoop@localhost output]$ cat part-00001

UN Persist the Storage

You will see the output as follows:

15/06/27 00:57:33 INFO ShuffledRDD: Removing RDD 9 from persistence list

people are not as beautiful as they look,

Look at the following program:

val sc = new SparkContext( "local", "Word Count", "/usr/local/spark", Nil,

/* local = master URL; Word Count = application name; */

/* /usr/local/spark = Spark Home; Nil = jars; Map = environment */

/* Map = variables to work nodes */

/*creating an inputRDD to read text file (in.txt) through Spark context*/

/* Transform the inputRDD into countRDD */

/* saveAsTextFile method is an action that effects on the RDD */

Step 1: Download Spark Jar

Step 2: Compile program

$ scalac -classpath "spark-core_2.10-1.3.0.jar:/usr/local/spark/lib/spark-

Step 3: Create a JAR

jar -cvf wordcount.jar SparkWordCount*.class spark-core_2.10-1.3.0.jar

Step 4: Submit spark application

You might also like

/creating an inputRDD to read text file (in.txt) through Spark context/