0% found this document useful (0 votes)
23 views8 pages

Week 1 Lab

This document provides a comprehensive guide on installing and configuring Hadoop on Ubuntu in pseudo-distributed mode. It covers the installation of Java, setting up a Hadoop user with passwordless SSH, and configuring essential Hadoop files such as core-site.xml and hdfs-site.xml. The document concludes with instructions on starting the Hadoop cluster and accessing the Hadoop UI through a web browser.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views8 pages

Week 1 Lab

This document provides a comprehensive guide on installing and configuring Hadoop on Ubuntu in pseudo-distributed mode. It covers the installation of Java, setting up a Hadoop user with passwordless SSH, and configuring essential Hadoop files such as core-site.xml and hdfs-site.xml. The document concludes with instructions on starting the Hadoop cluster and accessing the Hadoop UI through a web browser.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Week 1

Downloading and installing Hadoop; Understanding different Hadoop modes. Start-up


scripts, Configuration files.

Install JDK on Ubuntu

The Hadoop framework is written in Java, and its services require a compatible Java Runtime
Environment (JRE) and Java Development Kit (JDK). Use the following command to update
your system before initiating a new installation:

sudo apt update

At the moment, Apache Hadoop 3.x fully supports Java 8 and 11. The OpenJDK 8 package in
Ubuntu contains both the runtime environment and development kit.

Type the following command in your terminal to install OpenJDK 8:

sudo apt install openjdk-8-jdk –y

Once the installation process is complete, verify the current Java version:

java -version; javac –version

The output informs you which Java version is in use.

Set Up Hadoop User and Configure SSH

It is advisable to create a non-root user, specifically for the Hadoop environment. A distinct user
improves security and helps you manage your cluster more efficiently. To ensure the smooth
functioning of Hadoop services, the user should have the ability to establish a passwordless SSH
connection with the localhost.

Install OpenSSH on Ubuntu

Install the OpenSSH server and client using the following command:

sudo apt install openssh-server openssh-client –y

Create Hadoop User

Utilize the adduser command to create a new Hadoop user:

sudo adduser hdoop

The username, in this example, is hdoop. You are free to use any username and password you
see fit.
Switch to the newly created user and enter the corresponding password:

su – hdoop

The user now needs to be able to SSH to the localhost without being prompted for a password.

Enable Passwordless SSH for Hadoop User

Generate an SSH key pair and define the location it is to be stored in:

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

The system proceeds to generate and save the SSH key pair.

Use the cat command to store the public key as authorized_keys in the ssh directory:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Set the file permissions for your user with the chmod command:

chmod 0600 ~/.ssh/authorized_keys

The new user can now SSH without entering a password every time. Verify everything is set up
correctly by using the hdoop user to SSH to localhost:

ssh localhost

After an initial prompt, the Hadoop user can seamlessly establish an SSH connection to the
localhost.

Download and Install Hadoop on Ubuntu

After configuring the Hadoop user, you are ready to install Hadoop on your system. Follow the
steps below:

Use the provided mirror link and download the Hadoop package using the wget command:

wget [Link]

Once the download completes, use the tar command to extract the .[Link] file and initiate the
Hadoop installation:

tar xzf [Link]

The Hadoop binary files are now located within the hadoop-3.4.0 directory.

Single Node Hadoop Deployment (Pseudo-Distributed Mode)


Hadoop excels when deployed in a fully distributed mode on a large cluster of networked
servers. However, if you are new to Hadoop and want to explore basic commands or test
applications, you can configure Hadoop on a single node.

This setup, also called pseudo-distributed mode, allows each Hadoop daemon to run as a single
Java process. Configure a Hadoop environment by editing a set of configuration files:

.bashrc

[Link]

[Link]

[Link]

mapred-site-xml

[Link]

Configure Hadoop Environment Variables (bashrc)

The .bashrc config file is a shell script that initializes user-specific settings, such as environment
variables, aliases, and functions, every time a new Bash shell session is started. Follow the steps
below to configure Hadoop environment variables:

[Link] the .bashrc shell configuration file using a text editor of your choice (we will use nano):

nano .bashrc

2. Define the Hadoop environment variables by adding the following content to the end of the
file:

# Hadoop Environment Variables

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

export HADOOP_HOME=/home/hdoop/hadoop-3.4.0

export HADOOP_INSTALL=$HADOOP_HOME

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

export HADOOP_OPTS="-[Link]=$HADOOP_HOME/lib/native"

3. Once you add the variables, save and exit the .bashrc file.

4. Run the command below to apply the changes to the current running environment:

source ~/.bashrc

Edit [Link] File

The [Link] file serves as a master file to configure YARN, HDFS, MapReduce, and
Hadoop-related project settings. When setting up a single-node Hadoop cluster, you need to
define which Java implementation will be utilized.

Follow the steps below:

1. Use the previously created $HADOOP_HOME variable to access the [Link] file:
nano $HADOOP_HOME/etc/hadoop/[Link]
2. Uncomment the $JAVA_HOME variable (i.e., remove the # sign) and add the full path to
the OpenJDK installation on your system. If you have installed the same version as
presented in the first part of this tutorial, add the following line:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
The path needs to match the location of the Java installation on our system.

If you need help to locate the correct Java path, run the following command in your terminal
window.

which javac

The resulting output provides the path to the Java binary directory.

3. Use the provided path to find the OpenJDK directory with the following command:

readlink -f /usr/bin/javac

The section of the path just before the /bin/javac directory needs to be assigned to the
$JAVA_HOME variable.

Edit [Link] File


The [Link] file defines HDFS and Hadoop core properties. To set up Hadoop in a pseudo-
distributed mode, you need to specify the URL for your NameNode, and the temporary directory
Hadoop uses for the map and reduce process.

The steps below show how to configure the file.

[Link] the [Link] file in a text editor:


nano $HADOOP_HOME/etc/hadoop/[Link]

[Link] the following configuration to override the default values for the temporary directory and
add your HDFS URL to replace the default local file system setting:

<configuration>

<property>

<name>[Link]</name>

<value>hdfs://localhost:9000</value>

</property>

<property>

<name>[Link]</name>

<value>/home/hdoop/tmpdata</value>

</property>

</configuration>

Edit [Link] File

The [Link] file governs specifies critical parameters, such as data storage paths,
replication settings, and block sizes, which govern the behavior and performance of the HDFS
cluster. Configure the file by defining the NameNode and DataNode storage directories.
Additionally, the default [Link] value of 3 needs to be changed to 1 to match the single-
node setup.

Follow the steps below:

[Link] the following command to open the [Link] file for editing:

sudo nano $HADOOP_HOME/etc/hadoop/[Link]

2..Add the following configuration to the file and, if needed, adjust the NameNode and
DataNode directories to your custom locations:
<configuration>
<property>
<name>[Link]</name>
<value>1</value>
</property>
<property>
<name>[Link]</name>
<value>/usr/local/hadoop_space/hdfs/namenode</value>
</property>
<property>
<name>[Link]</name>
<value>/usr/local/hadoop_space/hdfs/datanode</value>
</property>
</configuration>
Edit [Link] File
The [Link] file is a configuration file that defines settings for the MapReduce
framework, including parameters such as the job tracker address, the number of map and
reduce tasks, and resource management, to control how MapReduce jobs are executed
across the cluster.
Follow the steps below to configure the [Link] file:
1. Use the following command to access the [Link] file and define
MapReduce values:
sudo nano $HADOOP_HOME/etc/hadoop/[Link]
2. Add the following configuration to change the default MapReduce framework name
value to yarn:
<configuration>
<property>
<name>[Link]</name>
<value>yarn</value>
</property>
</configuration>
Edit [Link] File
The [Link] file defines YARN settings. It contains configurations for the Node
Manager, Resource Manager, Containers, and Application Master.
1. Open the [Link] file in a text editor:
nano $HADOOP_HOME/etc/hadoop/[Link]
2. Append the following configuration to the file:
<configuration>
<property>
<name>[Link]-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>[Link]</name>
<value>[Link]</value>
</property>
<property>
<name>[Link]</name>
<value>localhost</value>
</property>
<property>
<name>[Link]</name>
<value>false</value>
</property>
<property>
<name>[Link]-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HAD
OOP_CONF_DIR</value>
</property>
</configuration>
create a directory in the location you specified for your temporary data.
sudo mkdir -p /usr/local/hadoop_space/hdfs/namenode
sudo mkdir -p /usr/local/hadoop_space/hdfs/datanode
sudo chown -R hdoop:hdoop /usr/local/hadoop_space
Format HDFS NameNode
It is important to format the NameNode before starting Hadoop services for the first time:
hdfs namenode –format
Start Hadoop Cluster
Starting a Hadoop cluster involves initializing the key services - HDFS for distributed
storage and YARN for resource management. This enables the system to process and
store large-scale data across multiple nodes.
Follow the steps below:
1. Navigate to the hadoop-3.4.0/sbin directory and execute the following command to
start the NameNode and DataNode:
[Link]
2. Once the namenode, datanodes, and secondary namenode are up and running, start
the YARN resource and nodemanagers by typing:

[Link] the following command to check if all the daemons are active and
running as Java processes:

Jps
Access Hadoop from Browser

Use your preferred browser and navigate to your localhost URL or IP. The default port
number 9870 gives you access to the Hadoop NameNode UI:

[Link]

The default port 9864 is used to access individual DataNodes directly from your browser:

[Link]

The YARN Resource Manager is accessible on port 8088:

[Link]

Conclusion

You have successfully installed Hadoop on Ubuntu and deployed it in a pseudo-


distributed mode. A single-node Hadoop deployment is an excellent starting point for
exploring basic HDFS commands and acquiring the experience you need to design a fully
distributed Hadoop cluster.

You might also like