Big Data Analytics Lab Manual
(DS605PC)
Experiment 1: Create a Hadoop Cluster
Objective:
To set up and configure a Hadoop cluster for distributed data processing.
Requirements:
- One or more computers (or virtual machines)
- Linux operating system (e.g., Ubuntu, CentOS)
- Java Development Kit (JDK) installed on each machine
Procedure:
1. Install Hadoop on each machine:
- Download the Hadoop binary from the official Apache Hadoop website.
- Extract the files to a designated directory.
2. Edit configuration files:
- Set up core-site.xml: Hadoop core configuration, including HDFS settings.
- Modify hdfs-site.xml: Set HDFS configurations like replication factor.
- Configure yarn-site.xml and mapred-site.xml for resource and job management.
3. Set up SSH key-based authentication to allow password-less login between machines.
4. Format the HDFS filesystem by running the 'hdfs namenode -format' command on the
master node.
5. Start the Hadoop daemons by executing the following on the master node:
- start-dfs.sh (Starts NameNode and DataNode daemons)
- start-yarn.sh (Starts ResourceManager and NodeManager daemons)
6. Verify the setup using the web UI for NameNode and ResourceManager.
Expected Result:
A running Hadoop cluster capable of distributed data processing.
Key Notes:
- Master node manages the file system namespace, job scheduling, and resource allocation.
- Worker nodes store actual data and perform computations.
Advantages:
- Enables scalable and reliable data storage and processing.
- Fault tolerance through data replication across nodes.
- Efficient resource management using YARN.
Figure 1: Hadoop Cluster Architecture
Experiment 2: Implement MapReduce Job for Inverted Index
Objective
To implement a simple MapReduce job that builds an inverted index on the set of input
documents using Hadoop.
Requirements
- Hadoop installed and configured
- Java installed
- Input text files
- Basic knowledge of Java MapReduce programming
Procedure
1. Create input files containing sample text.
2. Write the Mapper class that processes input lines and emits word-document pairs.
3. Write the Reducer class that groups all documents containing each word.
4. Compile the Java classes and create a JAR file.
5. Run the job using the Hadoop command line and pass the input/output paths.
6. Check the output directory for the inverted index results.
Expected Result
A text-based inverted index where each word is mapped to a list of documents containing it.
Key Notes
- Demonstrates core MapReduce workflow
- Enhances understanding of text processing and Hadoop jobs
Advantages
- Facilitates faster search queries in distributed document sets
Figure 2: Word-Document mapping from MapReduce output
Experiment 3: Process Big Data in HBase
Objective
To store and retrieve large volumes of data using HBase, a distributed, scalable NoSQL
database.
Requirements
- Hadoop and HBase installed
- Sample dataset (e.g., student records)
- HBase shell or Java API
Procedure
1. Start Hadoop and HBase services.
2. Create a new table using HBase shell: 'create 'students', 'info''.
3. Insert data using 'put' command: 'put 'students', '1', 'info:name', 'John''.
4. Retrieve data using 'get' and 'scan' commands.
5. Perform updates and deletions using HBase shell.
6. Explore integration with MapReduce for batch processing.
Expected Result
Data inserted, updated, and retrieved successfully using HBase shell or API.
Key Notes
- HBase provides random real-time read/write access to big data
- Column-family based flexible schema
Advantages
- Scalability and consistency for OLTP systems
Figure 3: HBase Table with Column Families