DEV3000 LabGuide
DEV3000 LabGuide
Applications
Lab Guide
Spring 2017 – Version 5.1.0
© 2017, MapR Technologies, Inc. All rights reserved. All other trademarks cited here are the property of
their respective owners.
Course Sandbox
For instructor-led training, clusters are provided to students through the MapR Academy lab environment.
Students taking the on-demand version of the course must download one of the MapR Sandboxes listed
below to complete the lab exercises. See the Connection Guide provided with your student materials for
details on how to use the sandboxes.
• VMware Course Sandbox: http://package.mapr.com/releases/v5.1.0/sandbox/MapR-Sandbox-
For-Hadoop-5.1.0-vmware.ova
• VirtualBox Course Sandbox: http://package.mapr.com/releases/v5.1.0/sandbox/MapR-Sandbox-
For-Hadoop-5.1.0.ova
CAUTION: Exercises for this course have been tested and validated ONLY with the
Sandboxes listed above. Do not use the most current Sandbox from the MapR website for
these labs.
Note: Additional information that will clarify something, provides details, or helps you
avoid mistakes.
Try This! Exercises you can complete after class (or during class if you finish a lab
early) to strengthen learning.
Command Syntax
When command syntax is presented, any arguments that are enclosed in chevrons, <like this>,
should be substituted with an appropriate value. For example this:
# cp /etc/passwd /etc/passwd.bak
Note: Sample commands provide guidance, but do not always reflect exactly what you will
see on the screen. For example, if there is output associated with a command, it may not be
shown.
Caution: Code samples in this lab guide may not work correctly when cut and pasted. For
best results, type commands in rather than cutting and pasting.
Note: Some commands shown throughout this lab guide are too long to fit on a single line.
The backslash character (\) indicates that the command continues on the next line. Do not
include the backslash character, or a carriage return, when typing the commands.
Note: In this and subsequent commands that include the <cluster> designator,
replace <cluster> with the actual name of your cluster. For example,
/mapr/maprdemo. The /mapr/<cluster> prefix indicates where the cluster file
system is mounted with Direct Access NFS™, which makes it possible to use
standard Linux command to access the cluster file system.
4. Run the MRv1 version of the wordcount application against the input file.
$ hadoop2 jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-\
examples.jar wordcount /user/user01/Lab1.3/in.txt \
/user/user01/Lab1.3/OUT
Note: With hadoop commands such as this, you do not need to include the prefix
/mapr/<cluster>, since you are dealing directly with the cluster file system and not
using Direct Access NFS™.
$ cat /mapr/<cluster>/user/user01/Lab1.3/OUT/part-r-00000
3. Run the MRv2 version of the wordcount application against the directory:
$ hadoop2 jar /opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce\
/hadoop-mapreduce-examples-2.7.0-mapr-1602.jar wordcount \
/user/user01/Lab1.3/IN2 /user/user01/Lab1.3/OUT2
7. Cross-reference the frequency of the “word” ATUH in the binary and in the wordcount output:
$ strings /mapr/<cluster>/user/user01/Lab1.3/IN3/mybinary | \
grep -c ATUH
$ egrep -ac ATUH /mapr/<cluster>/user/user01/Lab1.3/OUT3/\
part-r-00000
3. In the list of displayed jobs, scroll to the bottom to find your jobs (as a combination of the Job
Name and User fields).
4. Click the link word count for one of the jobs you launched.
a. How many tasks comprised that job?
b. How long did they each last?
c. On which node did they run? Note that in a single-node cluster, there's only one machine
the job can run on.
2. When the job completes, scroll back through the output to determine your container ID for the
shell, as shown in the sample output below:
15/01/21 18:34:07 INFO distributedshell.Client: Got application
report from ASM for, appId=1, clientToAMToken=null, appDiagnostics=,
appMasterHost=yarn-training/192.168.56.102, appQueue=root.user01,
appMasterRpcPort=-1, appStartTime=1421894036331,
yarnAppState=FINISHED, distributedFinalState=SUCCEEDED,
appTrackingUrl=http://yarn-
training:8088/proxy/application_1421893926516_0001/A, appUser=user01
6. Display the contents of the stdout file. You should see a listing of the /user/user01 directory.
$ cat stdout
In this exercise, you will use the Web UI provided by the History Server to examine information for the job
you previously launched.
1. Connect to the History Server in your web browser:
http://<IP address>:8088
The data set we’re using is the history of the United States federal budget from the year 1901 to 2012.
The data was downloaded from the white house website and has been massaged for this exercise. The
existing code calculates minimum and maximum values in the data set. You will modify the code to
calculate the mean surplus or deficit.
The fields of interest in this exercise are the first and fourth fields (year and surplus or deficit). The
second field is the total income derived from federal income taxes, and the third field is the expenditures
for that year. The fourth field is the difference between the second and third fields. A negative value in
the fourth field indicates a budget deficit and a positive value indicates a budget surplus.
2. Create a directory for the lab work, and position yourself in that directory:
$ mkdir /mapr/<cluster>/user/user01/Lab3
$ cd /mapr/<cluster>/user/user01/Lab3
This will create two directories: RECEIPTS_LAB, which contains the source files for the lab, and
RECEIPTS_SOLUTION which contains files with the solution correctly implemented. You can
review solutions files as needed for help completing the lab.
Lesson 3: Modify a MapReduce Program
3. Examine the output from your MapReduce job. Note you may need to wait a minute before the
job output is completely written to the output files.
$ cat /mapr/<cluster>/user/user01/Lab3/RECEIPTS_LAB/OUT/part*
If you did not obtain the results above, you’ll need to revisit your Mapper class. Ask your
instructor for help if you need. Once you obtain the correct intermediate results from the map-
only code, proceed to the next section.
Recall that the mapper code you ran above will produce intermediate results. One such record looks like
this:
summary 1968_-25161
When you execute the code for this lab, there will only be one reducer (since there is only one key –
“summary”). That reducer will iterate over all the intermediate results and pull out the year and surplus or
deficit. Your reducer will keep track of the minimum and maximum values (as temp variables) as well as
the year those values occurred. You will also need to keep track of the sum of the surplus or deficit and
count of the records in order to calculate the mean value.
1. Open the ReceiptsReducer.java source file with your favorite text editor.
$ vi ReceiptsReducer.java
2. Find the // TODO statements in the file, and make the changes indicated. Refer to the solutions
file as needed for help.
3. Save the ReceiptsReducer.java file.
4. Open the ReceiptsDriver.java source file with your favorite text editor. Find the line //
TODO comment out the Reducer class definition. Recall that in the previous section,
you commented out the Reducer definition – in this section, you will need to uncomment it so it
will be included again.
5. Save the ReceiptsDriver.java file.
max(2000): 236241.0
mean: -93862.0
Summary of Data
This lab examines data sampled from university students across North America. The data set can be
downloaded from http://archive.ics.uci.edu/ml/datasets/University.
Not every record contains the same number of fields, but every record starts with the string (def-
instance and ends with the string )). Each record contains information for a single university in the
survey. Here is a sample record:
(def-instance Adelphi
(state newyork)
(control private)
(no-of-students thous:5-10)
(male:female ratio:30:70)
(student:faculty ratio:15:1)
(sat math 475)
(expenses thous$:7-10)
(percent-financial-aid 60)
(no-applicants thous:4-7)
(percent-admittance 70)
(percent-enrolled 40)
(academics scale:1-5 2)
(social scale:1-5 2)
(sat verbal 500)
(quality-of-life scale:1-5 2)
(academic-emphasis business-administration)
(academic-emphasis biology))
Prepare
1. Log into a node as the user user01.
This will create two subdirectories: a UNIVERSITY_LAB directory containing the source files, and
a UNIVERSITY_SOLUTION directory that contains the modified files with the correct solutions.
You can refer to the files in the UNIVERSITY_SOLUTION directory if you get stuck on a step.
Each record contains an unknown number of fields after the start of the record and before either
the sat math or sat verbal field. The sat math field may come before or after the sat
verbal field, and one or both of the fields may not be part of the record at all. For example:
(def-instance <University Name>
. . .
(sat verbal 500)
. . .
(sat math 475)
. . .))
Examine the first few records in the file, then skip to line 1000 or so. Note that the data set is not
uniform from beginning to end.
3. Close the data file, and open the UniversityMapper.java source file with your favorite text
editor.
$ vi UniversityMapper.java
4. The UniversityMapper.java file contains a number of TODO directives. Make the changes
necessary to address each TODO entry, and then save the file. Compare your results to what is
shown in the file in the UNIVERSITY_SOLUTIONS directory.
The other reducer will be given a list of key-value pairs that looks like this:
satm 400 500 510 . . .
2. Implement each TODO in the UniversityReducer.java file as follows, just as you did for the
UniversityMapper.java file. Save your changes. Compare your changes to the file in the
UNIVERSITY_SOLUTIONS directory.
2. Implement each TODO in the UniversityDriver.java file, and save your changes. Compare
your changes to the file in the UNIVERSITY_SOLUTIONS directory.
If you get any errors that you can’t resolve, it might help to check the output from your map phase
by setting mapred.num.reduce.tasks to 0 in your configuration.
2. Launch the rerun.sh script to execute the code.
$ ./rerun.sh
In this exercise, you will run the teragen and terasort MapReduce applications from the examples
provided in the Hadoop distribution. You will then examine the records produced from running each one.
2. Download and unzip the lab files into that directory, and position yourself in the directory created:
$ wget http://course-files.mapr.com/DEV3000/DEV301-v5.1-Lab5.zip
$ unzip DEV301-v5.1-Lab5.zip
$ cd DEV301-v5.1-Lab5
You should see three directories created when the lab file is unzipped: SLOW_LAB, VOTER_LAB,
and VOTER_SOLUTION.
3. Uncompress the data file for the VOTER_LAB:
$ gunzip VOTER_LAB/DATA/myvoter.csv.gz
4. Inject some faulty records into your data set. For example:
$ echo "0,anna,14,independent,100,100" >> VOTER_LAB/DATA/myvoter.csv
$ echo "0,anna,25" >> VOTER_LAB/DATA/myvoter.csv
Run teragen
1. Run the teragen MapReduce application to generate 1000 records:
$ hadoop2 jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-/
examples.jar teragen 1000 /user/user01/Lab5/TERA_IN
Lesson 5: Manage, Monitor, and Test MapReduce Jobs
Q: Why are there no input or output records for the reducer in the job output?
3. Examine the files produced by teragen and answer the questions below.
b. Why is the number of records we generated with teragen different than the total number of
lines in the files?
$ wc -l /mapr/<cluster>/user/user01/Lab5/TERA_IN/part-m-0000*
Run terasort
1. Run the terasort application to sort those records you just created and look at the job output.
$ hadoop2 jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-/
examples.jar terasort /user/user01/Lab5/TERA_IN /
/user/user01/Lab5/TERA_OUT
3. In the list of displayed jobs, scroll to the bottom to find your job (as a combination of the Job
Name and User fields and applicationid). Click the job.
4. Look at the terasort standard output to determine the following:
a. Look at the number of mappers launched. Is this equal to the number of input files?
b. Look at the number of map and reduce input and output records. When would the
number of map input records be different than the number of map output records?
c. Look at the number of combine input and output records. What does this imply about the
terasort application?
In this exercise, you will write the logic to identify a “bad” record in a data set, then define a custom
counter to count “bad” records from that data set. This is what a “good” record looks like:
1,david davidson,10,socialist,369.78,5108
There are 6 fields total – a primary key, name, age, party affiliation, and two more fields you don’t care
about. You will implement a record checker that validates that there are exactly 6 fields in the record, and
that the third field is a “reasonable” age for a voter.
1. Change to the VOTER_LAB directory:
$ cd /mapr/<cluster>/user/user01/Lab5/VOTER_LAB
2. Open the VoterDriver.java file with the view command. What character separates the
keys from the values in the records? Close the file.
3. Open the VoterMapper.java file with your favorite editor. Which character is the value of the
record tokenizing on? Keep the file open for the next step.
4. Locate the // TODO statements in the file, and implement the changes necessary to validate the
record. Then save the file.
5. Compile and execute the code, using rebuild.sh and rerun.sh. Based on the minimum,
maximum, and mean values for voter ages, what do you conclude about the nature of the data
set?
6. Examine the output in your terminal from the job to determine the number of bad records.
a. How many records have the wrong number of fields?
b. How many records have a bad age field?
c. Does the total number of bad records, plus the total number of reduce input records,
equal the total number of map input records?
In this exercise, you will generate standard error and log messages and then consume them in the MCS.
• Instead of incrementing the bad record counter for incorrect number of tokens, write a
message to standard error. Include the bad record in the message.
• Instead of incrementing the bad record counter or writing to standard error for incorrect
number of tokens, write a message to syslog. Include the bad record in the message.
• Instead of incrementing the bad record counter or writing to standard error for invalid age,
write a message to syslog. Include the bad record in the message.
2. Launch the MapReduce application and specify the sleep time (in ms).
$ ./rerun.sh "-DXmx1024m -D my.map.sleep=4000"
3. Display the job counter for MAPRFS_BYTES_READ. Replace the jobid variable using the output
from the previous command. NOTE: Wait till you see that the Job has started, before running this
command.
$ mapred job -counter <jobID> \
org.apache.hadoop.mapreduce.FileSystemCounter MAPRFS_BYTES_READ
2. Scroll through the list and find the job using the Id, Name, and User. Click on the Id to view the
job history.
If the job hangs, change the memory allocation for filesystem in the warden.conf file:
service.command.mfs.heapsize.max=1024
In this exercise, the code to test the mapper using MRUnit is already provided. You will follow that
example to implement the reducer test.
Recall the VoterMapper map method emits the key-value pair: (party, age). For example, with input
"1,david davidson,20,socialist,369.78,5108" you should expect output (socialist, 20).
3. Test the map method against the test file – you should get a "success" message.
$ ./retest.sh map mymaptest.dat
4. Now edit the test file so that the input and expected output do not match.
5. Test the map method against the test file – this time you should get an exception
$ ./retest.sh map mymaptest.dat
2. Implement the TODO in the VoterTest.java file to write the unit test for the reducer.
4a. Is the number of mappers equal to the Yes, there are two of each.
number of input files?
4b. When would the number of map input If the map method is doing any sort of filtering
records be different than the number of map (for example, dropping “bad” records).
output records?
4c. What does this imply about the terasort The terasort application does not use a
application? combiner.
2 What character separates the keys from the The field separator character is a comma.
values in the records?
5 Based on the minimum, maximum, and The minimum, maximum, and mean values
mean values for voter ages, what do you for all parties (democratic, republican, green,
conclude about the nature of the data set? etc.) are exactly the same. This is unlikely,
and you should investigate to make sure your
data is accurate.
6a. How many records have the wrong number The sample data already has one record with
of fields? the wrong number of fields; the instructions
have you add another.
6b. How many records have a bad age field? The sample data already has one record with
a bad age field; the instructions have you add
another.
4. Determine the aggregate map phase run time of the job. Connect to the JobHistoryServer using
the IP address of the node, at port 10999:
http://<IP address>:19888
5. Run it a few times more to establish a good baseline. Remove the output directory
/user/user01/Lab6/TERA_OUT_1 before each run. Here are some values for a few runs. Fill
in the aggregate map phase run times for your runs, below the ones given in the table.
Run 1 Run 2 Run 3 Run 4
4. Run it a few times more to establish a good test. Change the name of the output directory for
each rerun. Here are some values for a few runs: below those, fill in the results for your runs.
Note: There is a significant difference in the sample times shown in this table
between the first run and the rest of the runs. This is one reason we take several
samples when benchmarking. Without a reason for this sample, statistically we
would probably discard this outlier.
5. It appears that the change has impacted the amount of time spent in the map phase (which
makes sense given we are changing the io.sort.mb parameter). Calculate the change in
performance due to the modification. Here is the calculation with the sample numbers provided:
perform the same calculation with your test numbers.
• Average aggregate time, modified (not using outlier from Run 1):
= (305 + 349 + 321 ) / 3 = 325 seconds
• Performance differential:
= (baseline - modified) / baseline )* 100
= ((129 – 325) / 129) * 100
= (-196 / 129) * 100 = -151%
In other words, the modified job performs 151% slower than the baseline (takes 151% longer). If
the result is a positive number, then the modified job is faster than the baseline job.
In this exercise, you will create and populate a table in HBase to store the voter data from previous
exercises. You will then run a MapReduce program to calculate the usual maximum, minimum, and mean
values using data read from that table.
8. Use the importtsv utility to import the data into the HBase table.
$ hadoop jar /opt/mapr/hbase/hbase-1.1.1/lib/hbase-server-\
1.1.1-mapr-1602.jar importtsv -Dimporttsv.columns=\
HBASE_ROW_KEY,cf1:name,cf2:age,cf2:party,cf3:contribution_amount,\
cf3:voter_number /user/user01/Lab7/myvoter_table \
/user/user01/Lab7/VOTERHBASE_SOLUTION/myvoter.tsv
9. Use the hbase command to validate the contents of the new table.
$ echo "scan '/user/user01/Lab7/myvoter_table'" | hbase shell
ROW COLUMN+CELL
1 column=cf1:name, timestamp=1406142938710, value=david
davidson
1 column=cf2:age, timestamp=1406142938710, value=49
1 column=cf2:party, timestamp=1406142938710, value=socialist
1 column=cf3:contribution_amount, timestamp=1406142938710,
value=369.78
1 column=cf3:voter_number, timestamp=1406142938710,
value=5108
10 column=cf1:name, timestamp=1406142938710, value=Oscar
xylophone
. . . <output omitted>
1000000 row(s) in 1113.9850 seconds
libertarian 47.0
republican 18.0
republican 77.0
republican 47.0
socialist 18.0
socialist 77.0
socialist 47.0
In this exercise, you will modify a MapReduce driver that launches two jobs. The first job calculates
minimum, maximum, and mean values for the SAT verbal and math scores. The second job calculates
the numerator and denominator for the Spearman correlation coefficient between the verbal and math
scores. The driver then calculates the correlation coefficient by dividing the numerator by the square root
of the denominator. The code for both MapReduce jobs has been provided.
Programming Objective
Let X represent the SAT verbal scores and Y represent the SAT math scores. The first MapReduce job
calculates the mean values for X and Y, and the second MapReduce job calculates the numerator and
the squared value of the denominator. The driver you write must configure and launch both jobs and then
calculate the Spearman correlation coefficient.
2. Download the lab file to your cluster, and extract the zip file.
$ wget http://course-files.mapr.com/DEV3000/DEV302-v5.1-Lab8.zip
$ unzip DEV302-v5.1-Lab8.zip
Lesson 8: Launch Jobs
product_sumofsquares is 243128.0
var1_sumofsquares is 259871.0
var2_sumofsquares is 289679.0
spearman's coefficient is 0.886130250066755
In this exercise, you will implement a MapReduce streaming application using the language of your
choice (Python or Perl). Guidance will be provided for building the application in the UNIX bash shell.
We return to the RECEIPTS data set to calculate the minimum, maximum, mean and the years associated
with the those values.
minyear=""
maxyear=""
while read line
do
value=`echo $line | awk '{print $2}'`
if [ -n "$value" ]
then
year=`echo $value | awk -F_ '{print $1}'`
delta=`echo $value | awk -F_ '{print $2}'`
fi
if [ $delta -lt $min ]
then
min=$delta
minyear=$year
elif [ $delta -gt $max ]
then
max=$delta
maxyear=$year
fi
count=$(( count + 1 ))
sum=$(( sum + delta ))
done
mean=$(( sum / count ))
printf "min year is %s\n" "$minyear"
printf "min value is %s\n" "$min"
printf "max year is %s\n" "$maxyear"
printf "max value is %s\n" "$max"
printf "sum is %s\n" "$sum"
printf "count is %s\n" "$count"
printf "mean is %d\n" "$mean"
#!/usr/bin/env bash
USER=`whoami`
# 1) test map script
echo -e "1901 588 525 63 588 525 63" | ./receipts_mapper.sh | od –c
# 2) test reduce script
echo -e "summary\t1901_63" | ./receipts_reducer.sh | od –c
# 3) map/reduce on Hadoop
export JOBHOME=/user/$USER/9/STREAMING_RECEIPTS
export CONTRIB=/opt/mapr/hadoop/hadoop-0.20.2/contrib/streaming
export STREAMINGJAR=hadoop-*-streaming.jar
export THEJARFILE=$CONTRIB/$STREAMINGJAR
rm -rf $JOBHOME/OUT