Hadoop HDFS and MapReduce
LAB GUIDE
Hadoop Getting Started | Big Data Technologies | Oct 16 2017
Login and Environment Setup
1. Start PuTTY on your system and enter the given IP address to connect to the Linux
server with Hadoop installed.
2. Login with user id hadoopx and password hux. (e.g. hadoop1, hu1)
You can set Hadoop environment variables by appending the following commands to
~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME
To do this, perform the following steps:
1. Type nano ~/.bashrc to open the file to open the nano editor.
2. You will see that there are a number of lines of text already present in the file. Take
care that you dont accidentally modify the content already present.
3. Press and hold the down arrow key to go the end of the file. Press Enter.
4. Copy (ctrl-c) the lines above (beginning with export) and paste (ctrl-v) in the
nano window. These lines should appear at the end of the file in the editor.
5. Press ctrl-x to exit the editor and press y at the prompts that appear.
6. To apply the changes to the shell environment, type the following command at the
bash prompt:
$source ~/.bashrc
7. To verify that the changes have taken effect type the following command at the
bash prompt:
a. hadoop version
PAGE 1
This should show the version of Hadoop running (2.8.1) on the Linux server.
Familiarizing yourself with HDFS
1. First format the HDFS file system:
$ hadoop namenode -format
2. Start the distributed file system. The following command will start the namenode as
well as the data nodes as cluster.
$ start-dfs.sh
3. Listing Files in HDFS
$ hadoop fs -ls
4. Make the HDFS directories required to execute MapReduce jobs:
$ hadoop fs -mkdir ~/user
$ hadoop fs -mkdir user/<username>
5. Create a data file, data.txt, containing input data for a program in the home
directory
$ cat /usr/local/hadoop/etc/hadoop/*.xml >> ~/data.txt
6. Inserting Data into HDFS
Copy the file data.txt in the home directory of the local filesystem to the directory
/input in hdfs filesystem.
a) Create an input directory in hdfs:
$ hadoop fs -mkdir user/<username>
b) Copy file from the local filesystem
$ hadoop fs -put ~/data.txt user/<username>
c) Verify that the file has been copied.
$ hadoop fs -ls user/<username>
6. Run a MapReduce program from the set of example programs provided:
PAGE 2
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-
mapreduce-examples-2.8.1.jar grep user/hadoop1 user/
hadoop1/output 'dfs[a-z.]+'
7. Retrieving Data from HDFS
Assume we have a file in HDFS called outfile. Given below is a simple demonstration for
retrieving the required file from the Hadoop file system.
Step 1
Initially, view the data from HDFS using cat command.
$ hadoop fs -cat user/hadoop1/output/*
Step 2
Get the file from HDFS to the local file system using get command.
$ mkdir ~/output
$ hadoop fs -get user/hadoop1/output/* ~/output
Shutting Down the HDFS
You can shut down the HDFS by using the following command.
$ stop-dfs.sh
Additional Reading:
1. You can find the complete list of HDFS commands here.
2. A detailed explanation of MapReduce and a complete description of the steps in
developing a MapReduce program can be found here.
PAGE 3