Laboratory Manual
20IC404P: Big Data
Analytics And Computing Lab
COMPUTER ENGINEERING DEPARTMENT
SCHOOL OF TECHNOLOGY
Name of Student:
Roll No:
Branch: Sem./Year:
Academic Year:
PANDIT DEENDAYAL ENERGY UNIVERSITY
Raisan, Gandhinagar – 380 007, Gujarat, India
Computer Engineering Department
Certificate
This is to certify that
Mr./Ms. _____________________________________ Roll no.____________Exam
No. ______________________ of 7th Semester Degree course in Information and
Communication Technology has satisfactorily completed his/her term work
in Big Data Analytics Lab (20IC404P) subject during the semester
from_____________ to _____________ at School of Technology, PDEU.
Date of Submission:
Signature:
Faculty In-charge Head of Department
Index
Name:
Roll No:
Exam No:
Pages Marks
Sr. Date of
Experiment Title (out Sign.
No. Completion
From To of 10)
To Explore Big Data
1 Analytical Tools and Their
Applications
To install Hadoop and
2 perform basic HDFS
operations
To perform HDFS
3 Configuration related
Commands
To perform HBase
4
Commands
To explore Hive
Fundamentals and
perform operations on
5
Databases, Tables, and
Data Loading
Install and Run Hive then
use Hive to create, alter,
and drop databases,
tables, views,
functions, and indexes
6
Create the HDFS tables
and load them in Hive and
learn joining of tables in
Hive
Explore the programming
7
architecture of Map
Reduce API, run a basic
Word Count Map
Reduce program to
understand Map Reduce
Paradigm.
Develop Map Reduce Work
8
Application
Install, Deploy & configure
Apache Spark Cluster.
9 Run apache spark
applications using
Scala.
Explore Spark MLib and
10
GraphX libraries
Implement basic CRUD
11
operations in MongoDB
Data analytics using
Apache Spark on Amazon
food dataset e.g. find all
12
the pairs of items
Frequently reviewed
together.
COURSE OBJECTIVES
• Know the big data characteristics and challenges.
• Learn the tools and techniques of collection, integration, processing
and storage of big data.
• Understand real applications which uses big data.
COURSE OUTCOMES
On completion of the course, students will be able to
CO1: List big data characteristics and challenges.
CO2: Understand essential features of the programming tools.
CO3: Practice big data programming tools.
CO4: Analyze computing problems for given application.
CO5: Evaluate computing based solution.
CO6: Design applications using big data analytics.
ASSESSMENT
Continuous Assessment: 50 Marks
End Sem Lab Exam: VIVA (25 Marks) + Project (25 Marks)
Experiment 1: To Explore Big Data Analytical Tools and Their
Applications
AIM:
To explore and understand various Big Data analytical tools such as
Hadoop ecosystem components (HDFS, MapReduce, HBase, PIG, Hive),
Apache Spark, and MongoDB, along with their key features and
applications.
OBJECTIVES:
1. Understand the purpose and architecture of different Big Data tools.
2. Compare the functionalities of Hadoop ecosystem components,
Spark, and NoSQL databases
3. Identify real-world use cases of these tools in industry.
MAPPED COURSE OUTCOME (CO):
CO1: List big data characteristics and challenges. (Primary Mapping)
CO2: Understand essential features of the programming tools. (Secondary
Mapping)
THEORY:
Big Data tools are designed to handle large-scale data processing, storage,
and analytics. Below is a brief overview of key tools:
1. Hadoop & HDFS
▪ Hadoop: Open-source framework for distributed storage and batch
processing using MapReduce.
▪ HDFS (Hadoop Distributed File System): Stores data across
multiple machines with fault tolerance.
▪ Applications: Log processing, recommendation systems (e.g.,
Amazon, LinkedIn).
2. MapReduce
▪ Programming model for processing large datasets in parallel across
clusters.
▪ Phases: Map (filtering/sorting) → Reduce (aggregation).
▪ Applications: Log analysis, web indexing (e.g., Google Search).
3. HBase
▪ Distributed, column-oriented NoSQL database built on HDFS.
▪ Provides real-time read/write access to large datasets.
▪ Applications: Messaging systems (e.g., Facebook Messenger), time-
series data.
4. PIG
▪ High-level scripting language for Hadoop data processing.
▪ Simplifies writing MapReduce jobs using Pig Latin.
▪ Applications: ETL (Extract, Transform, Load) operations, data
cleaning.
5. Hive
▪ Data warehousing tool with SQL-like queries (HiveQL).
▪ Runs on top of Hadoop for structured data analysis.
▪ Applications: Business intelligence, ad-hoc querying (e.g.,
Facebook analytics).
6. Apache Spark
▪ In-memory processing engine for fast analytics.
▪ Supports batch & real-time processing (Spark Streaming).
▪ Applications: Fraud detection (PayPal), recommendation engines
(Netflix).
7. MongoDB
▪ NoSQL database for unstructured/semi-structured data.
▪ Uses JSON-like documents for flexible storage.
▪ Applications: User profiles, IoT data (e.g., Uber, eBay).
Task to be performed:
1. Research & Compare Tools
Fill a comparison table with:
▪ Tool Name
▪ Primary Use Case
▪ Key Features
▪ Industry Example
2. Case Study Analysis
Select one tool (e.g., Spark, MongoDB) and research how a company (e.g.,
Netflix, Uber) uses it.
3. Group Discussion
Discuss which tool is best suited for:
▪ Real-time analytics
▪ Large-scale batch processing
▪ Unstructured data storage
LEARNING OUTCOME:
After completing this experiment, students will be able to:
1. Explain the role of Hadoop, Spark, Hive, PIG, and MongoDB in Big
Data processing.
2. Compare different tools based on their use cases and performance.
3. Identify real-world applications of each tool in industry.
Experiment 2 – To install Hadoop and perform basic HDFS
operations
Practical
To deploy Hadoop in standalone mode and perform fundamental HDFS
file operations
Aim: To install Hadoop and configure it in standalone mode (single-node
cluster) for basic operations and testing.
Objectives:
1. Install Hadoop on a Linux system
2. Configure Hadoop in standalone mode
3. Verify the installation and run basic HDFS commands
Theory: HDFS (Hadoop Distributed File System) is Hadoop's scalable,
fault-tolerant storage system designed for big data. It follows a master-
slave architecture with a NameNode (manages metadata) and DataNodes
(store data blocks). Files are split into large blocks (default 128MB) and
replicated (default 3x) across nodes for fault tolerance.
HDFS uses a write-once-read-many model, optimizing batch processing
over random writes. Commands resemble Linux file operations but use
hadoop fs or hdfs dfs prefixes. Key features include:
• High throughput for large files
• Data integrity via checksums
• Scalability across commodity hardware
Basic operations include directory management (mkdir, ls), file transfers
(put, get), and permissions (chmod, chown). Unlike local file systems,
HDFS prioritizes sequential access, making it ideal for MapReduce/Spark
workflows.
Standalone mode uses local filesystem instead of HDFS, ideal for learning
basic operations. All commands use hadoop fs or hdfs dfs prefixes.
Tasks to be Completed
Open Terminal and perform the given tasks:
1. Check HDFS Version
Displays the installed Hadoop and HDFS version details.
Syntax: hdfs version
Output:
2. help - Command Assistance
Purpose: Get help for any HDFS command.
Syntax:
hdfs dfs -help <COMMAND_NAME>
Example:
hdfs dfs -help put
Output:
Usage: hdfs dfs [generic options] -put [-f] [-p] <localsrc> ... <dst>
Copy files from local to HDFS.
...
3. ls - List Files/Directories
Purpose: List contents of an HDFS directory.
Syntax:
hdfs dfs -ls <HDFS_PATH>
Example:
hdfs dfs -ls /
Output:
4. mkdir - Create Directory
Purpose: Create a new directory in HDFS.
Syntax:
hdfs dfs -mkdir <HDFS_DIR_PATH>
Example:
hdfs dfs -mkdir /bigdatatesting
Verification:
hdfs dfs -ls /
Output:
5. touchz - Create Empty File
Purpose: Create a 0-byte file in HDFS.
Syntax:
hdfs dfs -touchz <HDFS_FILE_PATH>
Example:
hdfs dfs -touchz /bigdatatesting/test.dat
Verification:
hdfs dfs -ls /bigdatatesting
Output:
-rw-r--r-- 1 user supergroup 0 2023-01-01 12:36
/bigdatatesting/test.dat
6. copyFromLocal (or put) - Upload Files
Purpose: Copy files from local filesystem to HDFS.
Syntax:
hdfs dfs -copyFromLocal <LOCAL_FILE> <HDFS_PATH>
# OR
hdfs dfs -put <LOCAL_FILE> <HDFS_PATH>
Example:
echo "Hello HDFS" > localfile.txt
hdfs dfs -copyFromLocal localfile.txt /bigdatatesting/
Verification:
hdfs dfs -ls /bigdatatesting
Output:
-rw-r--r-- 1 user supergroup 10 2023-01-01 12:37
/bigdatatesting/localfile.txt
7. cat - View File Content
Purpose: Display contents of an HDFS file.
Syntax:
hdfs dfs -cat <HDFS_FILE_PATH>
Example:
hdfs dfs -cat /bigdatatesting/localfile.txt
Output:
Hello HDFS
8. copyToLocal (or get) - Download Files
Purpose: Copy files from HDFS to local filesystem.
Syntax:
hdfs dfs -copyToLocal <HDFS_FILE> <LOCAL_PATH>
# OR
hdfs dfs -get <HDFS_FILE> <LOCAL_PATH>
Example:
hdfs dfs -copyToLocal /bigdatatesting/localfile.txt ~/Downloads/
Verification:
cat ~/Downloads/localfile.txt
Output:
Hello HDFS
9. du - Check File Size
Purpose: Show file/directory size in bytes.
Syntax:
hdfs dfs -du [-s] <HDFS_PATH>
Example:
hdfs dfs -du -s /bigdatatesting/localfile.txt
Output:
10 10 /bigdatatesting/localfile.txt # (size | replication size | path)
10. mv - Move/Rename Files
Purpose: Move or rename files in HDFS.
Syntax:
hdfs dfs -mv <SOURCE_PATH> <DEST_PATH>
Example:
hdfs dfs -mv /bigdatatesting/localfile.txt
/bigdatatesting/renamedfile.txt
Verification:
hdfs dfs -ls /bigdatatesting
Output:
-rw-r--r-- 1 user supergroup 10 2023-01-01 12:37
/bigdatatesting/renamedfile.txt
11. cp - Copy Files in HDFS
Purpose: Copy files within HDFS.
Syntax:
hdfs dfs -cp <SOURCE_PATH> <DEST_PATH>
Example:
hdfs dfs -cp /bigdatatesting/renamedfile.txt /backup/
Verification:
hdfs dfs -ls /backup
Output:
-rw-r--r-- 1 user supergroup 10 2023-01-01 12:38
/backup/renamedfile.txt
12. rm - Delete File
Purpose: Remove a file from HDFS.
Syntax:
hdfs dfs -rm <HDFS_FILE_PATH>
Example:
hdfs dfs -rm /bigdatatesting/renamedfile.txt
Output:
Deleted /bigdatatesting/renamedfile.txt
Output Result
//Attach screenshots of the output after every command.
Learning Outcome
1. Perform basic file operations in HDFS
2. Understand HDFS directory structure
3. Navigate HDFS web interface
4. Differentiate between local FS and HDFS operation
Experiment 3: To Perform HDFS Configuration
Related Commands
Practical
To execute essential HDFS configuration and administration commands
for cluster setup, monitoring, and optimization.
Aim
To perform critical HDFS configuration and file management operations
for effective cluster administration.
Objectives
a) Verify and modify HDFS configurations
b) Monitor storage utilization and replication
c) Manage permissions and system health
d) Handle compressed data operations
Theory
HDFS configuration commands allow administrators to:
• Check active settings (getconf, classpath)
• Control storage (dfsadmin, du, setrep)
• Maintain health (fsck, safemode)
• Secure data (chmod)
• Optimize transfers (compression commands)
HDFS maintains data reliability through replication, storing 3 copies of
each block by default for fault tolerance. The system uses a 128MB default
block size, which directly impacts both storage efficiency and MapReduce
task performance. For security, HDFS implements POSIX-style
permissions to control file and directory access. Additionally, built-in
compression support reduces storage requirements and network transfer
overhead while maintaining data accessibility.
Tasks to be Done
1. Verify Configuration
Command: hdfs getconf -confKey dfs.blocksize
Purpose: Check active block size setting
Output: 134217728 (128MB in bytes)
2. Storage Report
Command: hdfs dfsadmin -report
Purpose: View cluster capacity and node status
Key Output: Live nodes, used/remaining space
3. Change Replication
Command: hdfs dfs -setrep -w 2 /data/important
Purpose: Reduce replication factor to 2
4. Check Disk Usage
Command: hdfs dfs -du -h /user
Purpose: Analyze storage consumption
Sample Output: 2.1G 6.3G /user/hadoop # (actual | replicated size)
5. Filesystem Health Check
Command: hdfs fsck / -blocks -locations
Purpose: Verify block health and placement
Critical Output: Under/Missing/Corrupt blocks
6. Modify Permissions
Command: hdfs dfs -chmod 750 /secure/data
Purpose: Restrict directory access
Validation: hdfs dfs -ls /secure
7. View File Head or Tail
Command: hdfs dfs -cat /logs/app.log | head -10
Purpose: Inspect file contents
Output: First 10 lines of log
8. Safemode Check
a) Command: hdfs dfsadmin -safemode get
Purpose: Verify safemode state
Output: Safe mode is OFF
b) Enter Safemode (Read-only Mode)
Purpose: Force HDFS into maintenance state (blocks modifications)
Command: hdfs dfsadmin -safemode enter
Output: Safe mode is ON
c) Exit Safemode (Normal Operations)
Purpose: Resume write operations
Command: hdfs dfsadmin -safemode leave
Output: Safe mode is OFF
9. Find Files
Command: hdfs dfs -find /data -name "*.csv"
Purpose: Locate files by pattern
Output: Full HDFS paths of matching files
10. Compressed File Transfer
Create and Upload:
gzip data.txt
hdfs dfs -put data.txt.gz /compressed/
Download and Extract:
hdfs dfs -get /compressed/data.txt.gz ~/
gunzip ~/data.txt.gz
11. Check Classpath
Command: hadoop classpath
Purpose: Verify library paths
Output: All JARs/config paths separated by :
Explore configuration files at location using >>cd path of config file
a) Explore Configuration files
core-site.xml: stores the common properties across all other
Hadoop eco system
hdfs-site.xml : properties specific to HDFS services across nodes
Command:
cd /etc/Hadoop/conf
cat core-site.xml
12. View Compressed Data
Command: hadoop fs -text /compressed/data.txt.gz | head -5
Purpose: Inspect without full download
Output: First 5 uncompressed lines
Output Result
//Attach screenshots of the output after every command.
Learning Outcomes
a) Configure HDFS parameters and environment
b) Monitor and optimize cluster storage
c) Administer permissions and health checks
d) Efficiently handle compressed data
e) Troubleshoot using configuration tools
Experiment 4: To Perform Operations on HBase
Practical To execute essential HBase commands for table management,
data operations, and cluster administration.
Aim
To perform CRUD operations, table configurations, and cluster monitoring
in HBase using shell commands.
Objectives
a) Start/stop HBase and access the HBase shell
b) Create, describe, and manage HBase tables
c) Perform data operations (insert, read, update, delete)
d) Monitor cluster status and table health
e) Use advanced operations (truncate, versioning)
Theory
HBase is a distributed, schema-less NoSQL database built on HDFS. Key
features include:
i. Column-oriented storage (data grouped by column families)
ii. Automatic sharding (tables split into regions distributed across
RegionServers)
iii. Strong consistency for row-level operations
iv. Real-time read/write access with low latency
v. Data versioning (multiple cell versions with timestamps)
Core components of HBase are:
i. HMaster: Manages metadata, load balancing, and failover.
ii. RegionServer: Handles read/write requests for regions.
iii. ZooKeeper: Coordinates cluster state and configuration.
Tasks to be Done
1. Start HBase Services & Access Shell
sudo hbase master start #Start HMaster
sudo hbase regionserver start #Start RegionServer
hbase shell # Enter HBase shell
Purpose: Initialize HBase cluster and access the command interface.
2. Cluster Status & Version Check
status # Check cluster health (live RegionServers)
version # Display HBase version
whoami # Show current HBase user
Purpose: Verify cluster status and environment details.
3. Table Management
a) Create Table:
create 'Student_Rollno', 'personal', 'academic'
Purpose: Create table Student_Rollno with column families personal and
academic.
b) List Tables:
list #Verify table creation
c) Describe Table:
describe 'Student_Rollno'
Purpose: Show column family configurations (VERSIONS, BLOCKCACHE,
etc.).
d) Disable/Enable Table:
disable 'Student_Rollno' # Prepare for schema changes
enable 'Student_Rollno' # Reactivate table
Validation:
is_disabled 'Student_Rollno' #false
is_enabled 'Student_Rollno' # true
f) Drop Table:
Note: Create “Emp_Temp” table with suitable column families and drop
it.
disable 'Emp_Temp'
drop 'Emp_Temp'
Purpose: Permanently delete table.
4. Data Operations
a) Insert Student Data:
put 'Student_Rollno', 'S001', 'personal:name', 'Rahul Sharma'
put 'Student_Rollno', 'S001', 'personal:city', 'Mumbai'
put 'Student_Rollno', 'S001', 'academic:cgpa', '9.2'
put 'Student_Rollno', 'S001', 'academic:course', 'B.Tech'
Purpose: Add student record with roll number S001. Now, add one more
student detail with roll Number S002.
b) Read Data:
get 'Student_Rollno', 'S001' #Read full record
get 'Student_Rollno', 'S001', {COLUMN => 'academic:cgpa'} # Read CGPA
only
c) Update Data:
put 'Student_Rollno', 'S001', 'academic:cgpa', '9.5' # Update CGPA
Purpose: Modify existing student record.
d) Scan Table:
scan 'Student_Rollno' # View all student records
scan 'Student_Rollno', {LIMIT => 1} # View first record
Note: You can try for different values of Limit by adding more rows in the
table.
e) Delete Data:
delete 'Student_Rollno', 'S001', 'personal:city' # Delete city info
deleteall 'Student_Rollno', 'S001' # Delete entire record
5. Advanced Operations
a) Count Records:
Note: Add few more student details S003, S004.
count 'Student_Rollno'
Output: Number of rows in table
b) Truncate Table:
truncate 'Student_Rollno' # Wipe all data (disable->drop->recreate)
c) Data Versioning:
# Requires VERSIONS>1 in column family
#To alter column family to store 3 versions use below mentioned command
#alter 'Student_Rollno', NAME => 'academic', VERSIONS => 3
#then update cgpa atleast 3 times using put command (choose your own
values)
get 'Student_Rollno', 'S001', {COLUMN => 'academic:cgpa', VERSIONS => 3}
#Check the VERSIONS setting using describe
Purpose: Retrieve previous CGPA values.
Output: Displays current and historical values.
6. Cluster Administration
a) Table Existence Check:
exists 'Student_Rollno' # true
b) Bulk Operations:
disable_all 'Student_.*' # Disable multiple student tables
drop_all 'Student_.*' # Drop multiple student tables
c) Exit Shell & Stop Cluster:
exit # Leave HBase shell
./bin/stop-hbase.sh # Stop HBase (from HBase home directory)
Output Result
// Attach screenshots of all command executions showing:
1. Table creation and description
2. Data insertion and retrieval
3. Update and delete operations
4. Cluster status checks
5. Truncate and versioning operations
Learning Outcomes
a) Start/stop HBase services and navigate HBase shell
b) Design and manage tables with custom column families
c) Perform CRUD operations on student records
d) Monitor cluster health and table states
e) Implement data versioning and bulk operations
f) Administer table lifecycle (create/enable/truncate/drop)
Experiment 5: Hive Fundamentals - Databases,
Tables, and Data Loading
Aim: To perform core Hive operations including
creating/altering/dropping databases and tables, and loading data from
local files using both LOAD DATA and CTAS methods.
Objectives:
1. Manage Hive databases (Create, Use, Alter, Drop).
2. Manage Hive tables (Create, Describe, Alter, Drop).
3. Load data using LOAD DATA LOCAL INPATH.
4. Create tables and load data using CTAS (Create Table As Select).
CO Mapping:
CO3: Practice big data programming tools.
CO4: Analyze computing problems for a given application.
1. Database Operations
Syntax & Examples:
-- 1. List all databases
SHOW DATABASES;
-- 2. Create a new database
CREATE DATABASE sales_db;
-- 3. Use the database
USE sales_db;
-- 4. Verify current database
SELECT current_database();
-- 5. (Optional) Alter database to add a property
ALTER DATABASE sales_db SET DBPROPERTIES
('created_by'='student');
-- 6. (Optional) Drop database (use with caution!)
-- DROP DATABASE sales_db CASCADE;
2. Managed Table Operations
Syntax & Examples:
-- 1. Create a managed table 'customers'
CREATE TABLE customers (
cust_id INT,
first_name STRING,
last_name STRING,
city STRING,
signup_date DATE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;
-- 2. Show tables in the current database
SHOW TABLES;
-- 3. View detailed table structure
DESCRIBE FORMATTED customers;
-- 4. Alter table to add a new column
ALTER TABLE customers ADD COLUMNS (email STRING);
-- 5. Rename the table
ALTER TABLE customers RENAME TO customer_info;
-- 6. (Optional) Drop the table
-- DROP TABLE customer_info;
3. Loading Data with LOAD DATA LOCAL INPATH
Steps:
Create a local data file (/home/cloudera/customer_data.csv):
1001|Aman|Sharma|Delhi|2023-02-10
1002|Bobby|Singh|Agra|2022-11-15
1003|Ketan|Patel|Gandhinagar|2023-01-20
1004|Riya|Shah|Delhi|2022-12-05
Load data into Hive table:
LOAD DATA LOCAL INPATH '/home/cloudera/customer_data.csv'
INTO TABLE customer_info;
-- Verify data loading
SELECT * FROM customer_info LIMIT 5;
4. Creating Tables with CTAS (Create Table As Select)
Syntax:
CREATE TABLE <new_table_name>
[STORED AS <file_format>]
AS
SELECT <column_list>
FROM <source_table>
[WHERE <condition>];
Example:
-- Create a new table 'D=delhi_customers' containing only delhi customers
CREATE TABLE delhi_customers
STORED AS TEXTFILE
AS
SELECT cust_id, first_name, last_name, signup_date
FROM customer_info WHERE city = 'Delhi';
-- Verify the new table and its data
SELECT * FROM delhi_customers;
Tasks to be Performed by Students
[Note: Write the commands and attach screenshot of outputs]
1. Database Management:
o Create a database named inventory_db.
o Switch to this database.
2. Table Management:
o Create a managed table named products with this schema:
▪ product_id (INT)
▪ product_name (STRING)
▪ category (STRING)
▪ price (DECIMAL(10,2))
▪ stock (INT)
o Use comma , as the field terminator.
o Rename the table to product_details.
3. Data Loading with LOAD DATA:
o Create a local file /home/cloudera/product_data.csv:
2001,Laptop,Electronics,55000.00,25
2002,Smartphone,Electronics,25000.00,50
2003,Desk,Furniture,12000.00,15
2004,Chair,Furniture,5000.00,30
o Load this data into product_details.
o Show all records from the table.
4. Data Loading with CTAS:
o Create a new table electronics_stock using CTAS.
o It should contain only product_name and stock for items in
the 'Electronics' category.
o Query the new table to verify.
Learning Outcome
After completing this experiment, students will be able to:
a) Perform full lifecycle management of Hive databases and tables.
b) Load data from local files into Hive using LOAD DATA.
c) Use CTAS to create new tables from existing data.
d) Understand the physical storage location of managed tables.