0% found this document useful (0 votes)
19 views7 pages

22 PDFsam Apache Spark Tutorial

Uploaded by

mitmak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views7 pages

22 PDFsam Apache Spark Tutorial

Uploaded by

mitmak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Apache Spark

as they care as they share.

Follow the procedure given below to execute the given example.

Open Spark-Shell
The following command is used to open spark shell. Generally, spark is built using Scala.
Therefore, a Spark program runs on Scala environment.

$ spark-shell

If Spark shell opens successfully then you will find the following output. Look at the last
line of the output “Spark context available as sc” means the Spark container is
automatically created spark context object with the name sc. Before starting the first
step of a program, the SparkContext object should be created.

Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(hadoop); users
with modify permissions: Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server'
on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.2.0
/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>

18
Apache Spark

Create an RDD
First, we have to read the input file using Spark-Scala API and create an RDD.

The following command is used for reading a file from given location. Here, new RDD is
created with the name of inputfile. The String which is given as an argument in the
textFile(“”) method is absolute path for the input file name. However, if only the file
name is given, then it means that the input file is in the current location.

scala> val inputfile = sc.textFile("input.txt")

Execute Word count Transformation


Our aim is to count the words in a file. Create a flat map for splitting each line into words
(flatMap(line => line.split(“ ”)).

Next, read each word as a key with a value ‘1’ (<key, value> = <word,1>)using map
function (map(word => (word, 1)).

Finally, reduce those keys by adding values of similar keys (reduceByKey(_+_)).

The following command is used for executing word count logic. After executing this, you
will not find any output because this is not an action, this is a transformation; pointing a
new RDD or tell spark to what to do with the given data)

scala> val counts = inputfile.flatMap(line => line.split(" ")).map(word =>


(word, 1)).reduceByKey(_+_);

Current RDD
While working with the RDD, if you want to know about current RDD, then use the
following command. It will show you the description about current RDD and its
dependencies for debugging.

scala> counts.toDebugString

Caching the Transformations


You can mark an RDD to be persisted using the persist() or cache() methods on it. The
first time it is computed in an action, it will be kept in memory on the nodes. Use the
following command to store the intermediate transformations in memory.

scala> counts.cache()

Applying the Action


Applying an action, like store all the transformations, results into a text file. The String
argument for saveAsTextFile(“ ”) method is the absolute path of output folder. Try the
following command to save the output in a text file. In the following example, ‘output’
folder is in current location.

19
Apache Spark

scala> counts.saveAsTextFile("output")

Checking the Output


Open another terminal to go to home directory (where spark is executed in the other
terminal). Use the following commands for checking output directory.

[hadoop@localhost ~]$ cd output/


[hadoop@localhost output]$ ls -1

part-00000
part-00001
_SUCCESS

The following command is used to see output from Part-00000 files.

[hadoop@localhost output]$ cat part-00000

Output
(people,1)
(are,2)
(not,1)
(as,8)
(beautiful,2)
(they, 7)
(look,1)

The following command is used to see output from Part-00001 files.

[hadoop@localhost output]$ cat part-00001

Output

(walk, 1)
(or, 1)
(talk, 1)
(only, 1)
(love, 1)
(care, 1)

20
Apache Spark

(share, 1)

UN Persist the Storage


Before UN-persisting, if you want to see the storage space that is used for this
application, then use the following URL in your browser.

http://localhost:4040

You will see the following screen, which shows the storage space used for the
application, which are running on the Spark shell.

If you want to UN-persist the storage space of particular RDD, then use the following
command.

Scala> counts.unpersist()

You will see the output as follows:

15/06/27 00:57:33 INFO ShuffledRDD: Removing RDD 9 from persistence list


15/06/27 00:57:33 INFO BlockManager: Removing RDD 9
15/06/27 00:57:33 INFO BlockManager: Removing block rdd_9_1

21
Apache Spark

15/06/27 00:57:33 INFO MemoryStore: Block rdd_9_1 of size 480 dropped from
memory (free 280061810)
15/06/27 00:57:33 INFO BlockManager: Removing block rdd_9_0
15/06/27 00:57:33 INFO MemoryStore: Block rdd_9_0 of size 296 dropped from
memory (free 280062106)
res7: cou.type = ShuffledRDD[9] at reduceByKey at <console>:14

For verifying the storage space in the browser, use the following URL.

http://localhost:4040

You will see the following screen. It shows the storage space used for the application,
which are running on the Spark shell.

22
5. SPARK – DEPLOYMENT Apache Spark

Spark application, using spark-submit, is a shell command used to deploy the Spark
application on a cluster. It uses all respective cluster managers through a uniform
interface. Therefore, you do not have to configure your application for each one.

Example
Let us take the same example of word count, we used before, using shell commands.
Here, we consider the same example as a spark application.

Sample Input
The following text is the input data and the file named is in.txt.

people are not as beautiful as they look,


as they walk or as they talk.
they are only as beautiful as they love,
as they care as they share.

Look at the following program:

SparkWordCount.scala
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark._

object SparkWordCount {
def main(args: Array[String]) {

val sc = new SparkContext( "local", "Word Count", "/usr/local/spark", Nil,


Map(), Map())

/* local = master URL; Word Count = application name; */

/* /usr/local/spark = Spark Home; Nil = jars; Map = environment */

/* Map = variables to work nodes */

/*creating an inputRDD to read text file (in.txt) through Spark context*/


val input = sc.textFile("in.txt")

23
Apache Spark

/* Transform the inputRDD into countRDD */


val count=input.flatMap(line=>line.split(" "))
.map(word=>(word, 1))
.reduceByKey(_ + _)

/* saveAsTextFile method is an action that effects on the RDD */


count.saveAsTextFile("outfile")
System.out.println("OK");
}
}

Save the above program into a file named SparkWordCount.scala and place it in a
user-defined directory named spark-application.

Note: While transforming the inputRDD into countRDD, we are using flatMap() for
tokenizing the lines (from text file) into words, map() method for counting the word
frequency and reduceByKey() method for counting each word repetition.

Use the following steps to submit this application. Execute all steps in the spark-
application directory through the terminal.

Step 1: Download Spark Jar


Spark core jar is required for compilation, therefore, download spark-core_2.10-1.3.0.jar
from the following link Spark core jar and move the jar file from download directory to
spark-application directory.

Step 2: Compile program


Compile the above program using the command given below. This command should be
executed from the spark-application directory. Here, /usr/local/spark/lib/spark-
assembly-1.4.0-hadoop2.6.0.jar is a Hadoop support jar taken from Spark library.

$ scalac -classpath "spark-core_2.10-1.3.0.jar:/usr/local/spark/lib/spark-


assembly-1.4.0-hadoop2.6.0.jar" SparkPi.scala

Step 3: Create a JAR


Create a jar file of the spark application using the following command. Here, wordcount
is the file name for jar file.

jar -cvf wordcount.jar SparkWordCount*.class spark-core_2.10-1.3.0.jar


/usr/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar

Step 4: Submit spark application


24

You might also like