0% found this document useful (0 votes)

21 views45 pages

Exp 8 Big Data Analytics and Computing Lab Manual

This laboratory manual outlines the curriculum for the Big Data Analytics and Computing Lab course at Pandit Deendayal Energy University. It includes objectives, outcomes, and experiments focused on tools like Hadoop, Spark, and MongoDB, along with practical tasks for installation and configuration. The manual emphasizes understanding big data characteristics, tools, and their applications in real-world scenarios.

Uploaded by

Mit Shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views45 pages

Exp 8 Big Data Analytics and Computing Lab Manual

Uploaded by

Mit Shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Laboratory Manual

20IC404P: Big Data

Analytics And Computing Lab

COMPUTER ENGINEERING DEPARTMENT

SCHOOL OF TECHNOLOGY

Name of Student:

Roll No:

Branch: Sem./Year:

Academic Year:
PANDIT DEENDAYAL ENERGY UNIVERSITY
Raisan, Gandhinagar – 380 007, Gujarat, India

Computer Engineering Department

Certificate
This is to certify that

Mr./Ms. _________________________ Roll no.Exam

No. ______________________ of 7th Semester Degree course in Information and

Communication Technology has satisfactorily completed his/her term work

in Big Data Analytics Lab (20IC404P) subject during the semester

from_ to _ at School of Technology, PDEU.

Date of Submission:

Signature:
Faculty In-charge Head of Department
Index
Name:

Roll No:

Exam No:

Pages Marks
Sr. Date of
Experiment Title (out Sign.
No. Completion
From To of 10)

To Explore Big Data

1 Analytical Tools and Their
Applications
To install Hadoop and
2 perform basic HDFS
operations
To perform HDFS
3 Configuration related
Commands
To perform HBase
4
Commands
To explore Hive
Fundamentals and
5 perform operations on
Databases, Tables, and
Data Loading
To do data analysis using
6
Hive Query
To explore HIVE
7 Operations - Partitioning,
Bucketing and Views
To Develop Map Reduce
Work Application
8

Install, Deploy & configure

Apache Spark Cluster.
9 Run apache spark
applications using
Scala.
Explore Spark MLib and
10
GraphX libraries
Implement basic CRUD
11
operations in MongoDB
Data analytics using
Apache Spark on Amazon
food dataset e.g. find all
12
the pairs of items
Frequently reviewed
together.
COURSE OBJECTIVES

• Know the big data characteristics and challenges.

• Learn the tools and techniques of collection, integration, processing
and storage of big data.
• Understand real applications which uses big data.

COURSE OUTCOMES

On completion of the course, students will be able to

CO1: List big data characteristics and challenges.

CO2: Understand essential features of the programming tools.
CO3: Practice big data programming tools.
CO4: Analyze computing problems for given application.
CO5: Evaluate computing based solution.
CO6: Design applications using big data analytics.

ASSESSMENT

Continuous Assessment: 50 Marks

End Sem Lab Exam: VIVA (25 Marks) + Project (25 Marks)
Experiment 1: To Explore Big Data Analytical Tools and Their
Applications

AIM:
To explore and understand various Big Data analytical tools such as
Hadoop ecosystem components (HDFS, MapReduce, HBase, PIG, Hive),
Apache Spark, and MongoDB, along with their key features and
applications.

OBJECTIVES:
1. Understand the purpose and architecture of different Big Data tools.
2. Compare the functionalities of Hadoop ecosystem components,
Spark, and NoSQL databases
3. Identify real-world use cases of these tools in industry.

MAPPED COURSE OUTCOME (CO):

CO1: List big data characteristics and challenges. (Primary Mapping)

CO2: Understand essential features of the programming tools. (Secondary
Mapping)

THEORY:
Big Data tools are designed to handle large-scale data processing, storage,
and analytics. Below is a brief overview of key tools:
1. Hadoop & HDFS
▪ Hadoop: Open-source framework for distributed storage and batch
processing using MapReduce.
▪ HDFS (Hadoop Distributed File System): Stores data across
multiple machines with fault tolerance.
▪ Applications: Log processing, recommendation systems (e.g.,
Amazon, LinkedIn).
2. MapReduce
▪ Programming model for processing large datasets in parallel across
clusters.
▪ Phases: Map (filtering/sorting) → Reduce (aggregation).
▪ Applications: Log analysis, web indexing (e.g., Google Search).
3. HBase
▪ Distributed, column-oriented NoSQL database built on HDFS.
▪ Provides real-time read/write access to large datasets.
▪ Applications: Messaging systems (e.g., Facebook Messenger), time-
series data.
4. PIG
▪ High-level scripting language for Hadoop data processing.
▪ Simplifies writing MapReduce jobs using Pig Latin.
▪ Applications: ETL (Extract, Transform, Load) operations, data
cleaning.
5. Hive
▪ Data warehousing tool with SQL-like queries (HiveQL).
▪ Runs on top of Hadoop for structured data analysis.
▪ Applications: Business intelligence, ad-hoc querying (e.g.,
Facebook analytics).
6. Apache Spark
▪ In-memory processing engine for fast analytics.
▪ Supports batch & real-time processing (Spark Streaming).
▪ Applications: Fraud detection (PayPal), recommendation engines
(Netflix).
7. MongoDB
▪ NoSQL database for unstructured/semi-structured data.
▪ Uses JSON-like documents for flexible storage.
▪ Applications: User profiles, IoT data (e.g., Uber, eBay).
Task to be performed:
1. Research & Compare Tools
Fill a comparison table with:
▪ Tool Name
▪ Primary Use Case
▪ Key Features
▪ Industry Example
2. Case Study Analysis
Select one tool (e.g., Spark, MongoDB) and research how a company (e.g.,
Netflix, Uber) uses it.
3. Group Discussion
Discuss which tool is best suited for:
▪ Real-time analytics
▪ Large-scale batch processing
▪ Unstructured data storage

LEARNING OUTCOME:

After completing this experiment, students will be able to:

1. Explain the role of Hadoop, Spark, Hive, PIG, and MongoDB in Big
Data processing.
2. Compare different tools based on their use cases and performance.
3. Identify real-world applications of each tool in industry.
Experiment 2 – To install Hadoop and perform basic HDFS
operations

Practical
To deploy Hadoop in standalone mode and perform fundamental HDFS
file operations

Aim: To install Hadoop and configure it in standalone mode (single-node

cluster) for basic operations and testing.

Objectives:

1. Install Hadoop on a Linux system

2. Configure Hadoop in standalone mode
3. Verify the installation and run basic HDFS commands

Theory: HDFS (Hadoop Distributed File System) is Hadoop's scalable,

fault-tolerant storage system designed for big data. It follows a master-
slave architecture with a NameNode (manages metadata) and DataNodes
(store data blocks). Files are split into large blocks (default 128MB) and
replicated (default 3x) across nodes for fault tolerance.

HDFS uses a write-once-read-many model, optimizing batch processing

over random writes. Commands resemble Linux file operations but use
hadoop fs or hdfs dfs prefixes. Key features include:

• High throughput for large files

• Data integrity via checksums
• Scalability across commodity hardware

Basic operations include directory management (mkdir, ls), file transfers

(put, get), and permissions (chmod, chown). Unlike local file systems,
HDFS prioritizes sequential access, making it ideal for MapReduce/Spark
workflows.

Standalone mode uses local filesystem instead of HDFS, ideal for learning
basic operations. All commands use hadoop fs or hdfs dfs prefixes.
Tasks to be Completed
Open Terminal and perform the given tasks:

1. Check HDFS Version

Displays the installed Hadoop and HDFS version details.
Syntax: hdfs version
Output:

2. help - Command Assistance

Purpose: Get help for any HDFS command.
Syntax:
hdfs dfs -help <COMMAND_NAME>
Example:
hdfs dfs -help put
Output:
Usage: hdfs dfs [generic options] -put [-f] [-p] <localsrc> ... <dst>
Copy files from local to HDFS.
...

3. ls - List Files/Directories
Purpose: List contents of an HDFS directory.
Syntax:
hdfs dfs -ls <HDFS_PATH>
Example:
hdfs dfs -ls /
Output:

4. mkdir - Create Directory

Purpose: Create a new directory in HDFS.
Syntax:
hdfs dfs -mkdir <HDFS_DIR_PATH>
Example:
hdfs dfs -mkdir /bigdatatesting
Verification:
hdfs dfs -ls /
Output:
5. touchz - Create Empty File
Purpose: Create a 0-byte file in HDFS.
Syntax:
hdfs dfs -touchz <HDFS_FILE_PATH>
Example:
hdfs dfs -touchz /bigdatatesting/[Link]
Verification:
hdfs dfs -ls /bigdatatesting
Output:
-rw-r--r-- 1 user supergroup 0 2023-01-01 12:36
/bigdatatesting/[Link]

6. copyFromLocal (or put) - Upload Files

Purpose: Copy files from local filesystem to HDFS.
Syntax:
hdfs dfs -copyFromLocal <LOCAL_FILE> <HDFS_PATH>
# OR
hdfs dfs -put <LOCAL_FILE> <HDFS_PATH>
Example:
echo "Hello HDFS" > [Link]
hdfs dfs -copyFromLocal [Link] /bigdatatesting/
Verification:
hdfs dfs -ls /bigdatatesting
Output:
-rw-r--r-- 1 user supergroup 10 2023-01-01 12:37
/bigdatatesting/[Link]

7. cat - View File Content

Purpose: Display contents of an HDFS file.
Syntax:
hdfs dfs -cat <HDFS_FILE_PATH>
Example:
hdfs dfs -cat /bigdatatesting/[Link]
Output:
Hello HDFS

8. copyToLocal (or get) - Download Files

Purpose: Copy files from HDFS to local filesystem.
Syntax:
hdfs dfs -copyToLocal <HDFS_FILE> <LOCAL_PATH>
# OR
hdfs dfs -get <HDFS_FILE> <LOCAL_PATH>
Example:
hdfs dfs -copyToLocal /bigdatatesting/[Link] ~/Downloads/
Verification:
cat ~/Downloads/[Link]
Output:
Hello HDFS

9. du - Check File Size

Purpose: Show file/directory size in bytes.
Syntax:
hdfs dfs -du [-s] <HDFS_PATH>
Example:
hdfs dfs -du -s /bigdatatesting/[Link]
Output:
10 10 /bigdatatesting/[Link] # (size | replication size | path)

10. mv - Move/Rename Files

Purpose: Move or rename files in HDFS.
Syntax:
hdfs dfs -mv <SOURCE_PATH> <DEST_PATH>
Example:
hdfs dfs -mv /bigdatatesting/[Link]
/bigdatatesting/[Link]
Verification:
hdfs dfs -ls /bigdatatesting
Output:
-rw-r--r-- 1 user supergroup 10 2023-01-01 12:37
/bigdatatesting/[Link]

11. cp - Copy Files in HDFS

Purpose: Copy files within HDFS.
Syntax:
hdfs dfs -cp <SOURCE_PATH> <DEST_PATH>
Example:
hdfs dfs -cp /bigdatatesting/[Link] /backup/
Verification:
hdfs dfs -ls /backup
Output:
-rw-r--r-- 1 user supergroup 10 2023-01-01 12:38
/backup/[Link]
12. rm - Delete File
Purpose: Remove a file from HDFS.
Syntax:
hdfs dfs -rm <HDFS_FILE_PATH>
Example:
hdfs dfs -rm /bigdatatesting/[Link]
Output:
Deleted /bigdatatesting/[Link]

Output Result

//Attach screenshots of the output after every command.

Learning Outcome
1. Perform basic file operations in HDFS
2. Understand HDFS directory structure
3. Navigate HDFS web interface
4. Differentiate between local FS and HDFS operation
Experiment 3: To Perform HDFS Configuration
Related Commands

Practical

To execute essential HDFS configuration and administration commands

for cluster setup, monitoring, and optimization.

Aim

To perform critical HDFS configuration and file management operations

for effective cluster administration.

Objectives

a) Verify and modify HDFS configurations

b) Monitor storage utilization and replication
c) Manage permissions and system health
d) Handle compressed data operations

Theory

HDFS configuration commands allow administrators to:

• Check active settings (getconf, classpath)

• Control storage (dfsadmin, du, setrep)
• Maintain health (fsck, safemode)
• Secure data (chmod)
• Optimize transfers (compression commands)

HDFS maintains data reliability through replication, storing 3 copies of

each block by default for fault tolerance. The system uses a 128MB default
block size, which directly impacts both storage efficiency and MapReduce
task performance. For security, HDFS implements POSIX-style
permissions to control file and directory access. Additionally, built-in
compression support reduces storage requirements and network transfer
overhead while maintaining data accessibility.
Tasks to be Done

1. Verify Configuration

Command: hdfs getconf -confKey [Link]

Purpose: Check active block size setting

Output: 134217728 (128MB in bytes)

2. Storage Report

Command: hdfs dfsadmin -report

Purpose: View cluster capacity and node status

Key Output: Live nodes, used/remaining space

3. Change Replication

Command: hdfs dfs -setrep -w 2 /data/important

Purpose: Reduce replication factor to 2

4. Check Disk Usage

Command: hdfs dfs -du -h /user

Purpose: Analyze storage consumption

Sample Output: 2.1G 6.3G /user/hadoop # (actual | replicated size)

5. Filesystem Health Check

Command: hdfs fsck / -blocks -locations

Purpose: Verify block health and placement

Critical Output: Under/Missing/Corrupt blocks

6. Modify Permissions

Command: hdfs dfs -chmod 750 /secure/data

Purpose: Restrict directory access

Validation: hdfs dfs -ls /secure

7. View File Head or Tail

Command: hdfs dfs -cat /logs/[Link] | head -10

Purpose: Inspect file contents

Output: First 10 lines of log

8. Safemode Check
a) Command: hdfs dfsadmin -safemode get
Purpose: Verify safemode state
Output: Safe mode is OFF

b) Enter Safemode (Read-only Mode)

Purpose: Force HDFS into maintenance state (blocks modifications)
Command: hdfs dfsadmin -safemode enter
Output: Safe mode is ON

c) Exit Safemode (Normal Operations)

Purpose: Resume write operations
Command: hdfs dfsadmin -safemode leave
Output: Safe mode is OFF

9. Find Files

Command: hdfs dfs -find /data -name "*.csv"

Purpose: Locate files by pattern

Output: Full HDFS paths of matching files

10. Compressed File Transfer

Create and Upload:

gzip [Link]

hdfs dfs -put [Link] /compressed/

Download and Extract:

hdfs dfs -get /compressed/[Link] ~/

gunzip ~/[Link]
11. Check Classpath

Command: hadoop classpath

Purpose: Verify library paths

Output: All JARs/config paths separated by :

Explore configuration files at location using >>cd path of config file

a) Explore Configuration files

[Link]: stores the common properties across all other
Hadoop eco system
[Link] : properties specific to HDFS services across nodes
Command:
cd /etc/Hadoop/conf
cat [Link]

12. View Compressed Data

Command: hadoop fs -text /compressed/[Link] | head -5

Purpose: Inspect without full download

Output: First 5 uncompressed lines

Output Result

//Attach screenshots of the output after every command.

Learning Outcomes

a) Configure HDFS parameters and environment

b) Monitor and optimize cluster storage
c) Administer permissions and health checks
d) Efficiently handle compressed data
e) Troubleshoot using configuration tools
Experiment 4: To Perform Operations on HBase
Practical To execute essential HBase commands for table management,
data operations, and cluster administration.

Aim

To perform CRUD operations, table configurations, and cluster monitoring

in HBase using shell commands.

Objectives

a) Start/stop HBase and access the HBase shell

b) Create, describe, and manage HBase tables
c) Perform data operations (insert, read, update, delete)
d) Monitor cluster status and table health
e) Use advanced operations (truncate, versioning)

Theory

HBase is a distributed, schema-less NoSQL database built on HDFS. Key

features include:

i. Column-oriented storage (data grouped by column families)

ii. Automatic sharding (tables split into regions distributed across
RegionServers)
iii. Strong consistency for row-level operations
iv. Real-time read/write access with low latency
v. Data versioning (multiple cell versions with timestamps)

Core components of HBase are:

i. HMaster: Manages metadata, load balancing, and failover.

ii. RegionServer: Handles read/write requests for regions.

iii. ZooKeeper: Coordinates cluster state and configuration.

Tasks to be Done

1. Start HBase Services & Access Shell

sudo hbase master start #Start HMaster

sudo hbase regionserver start #Start RegionServer

hbase shell # Enter HBase shell

Purpose: Initialize HBase cluster and access the command interface.

2. Cluster Status & Version Check

status # Check cluster health (live RegionServers)

version # Display HBase version

whoami # Show current HBase user

Purpose: Verify cluster status and environment details.

3. Table Management

a) Create Table:

create 'Student_Rollno', 'personal', 'academic'

Purpose: Create table Student_Rollno with column families personal and

academic.

b) List Tables:

list #Verify table creation

c) Describe Table:

describe 'Student_Rollno'

Purpose: Show column family configurations (VERSIONS, BLOCKCACHE,

etc.).

d) Disable/Enable Table:

disable 'Student_Rollno' # Prepare for schema changes

enable 'Student_Rollno' # Reactivate table

Validation:

is_disabled 'Student_Rollno' #false

is_enabled 'Student_Rollno' # true

f) Drop Table:

Note: Create “Emp_Temp” table with suitable column families and drop
it.

disable 'Emp_Temp'

drop 'Emp_Temp'

Purpose: Permanently delete table.

4. Data Operations

a) Insert Student Data:

put 'Student_Rollno', 'S001', 'personal:name', 'Rahul Sharma'

put 'Student_Rollno', 'S001', 'personal:city', 'Mumbai'

put 'Student_Rollno', 'S001', 'academic:cgpa', '9.2'

put 'Student_Rollno', 'S001', 'academic:course', '[Link]'

Purpose: Add student record with roll number S001. Now, add one more
student detail with roll Number S002.

b) Read Data:

get 'Student_Rollno', 'S001' #Read full record

get 'Student_Rollno', 'S001', {COLUMN => 'academic:cgpa'} # Read CGPA

only

c) Update Data:

put 'Student_Rollno', 'S001', 'academic:cgpa', '9.5' # Update CGPA

Purpose: Modify existing student record.

d) Scan Table:

scan 'Student_Rollno' # View all student records

scan 'Student_Rollno', {LIMIT => 1} # View first record

Note: You can try for different values of Limit by adding more rows in the
table.

e) Delete Data:

delete 'Student_Rollno', 'S001', 'personal:city' # Delete city info

deleteall 'Student_Rollno', 'S001' # Delete entire record

5. Advanced Operations

a) Count Records:

Note: Add few more student details S003, S004.

count 'Student_Rollno'

Output: Number of rows in table

b) Truncate Table:

truncate 'Student_Rollno' # Wipe all data (disable->drop->recreate)

c) Data Versioning:

# Requires VERSIONS>1 in column family

#To alter column family to store 3 versions use below mentioned command

#alter 'Student_Rollno', NAME => 'academic', VERSIONS => 3

#then update cgpa atleast 3 times using put command (choose your own
values)

get 'Student_Rollno', 'S001', {COLUMN => 'academic:cgpa', VERSIONS => 3}

#Check the VERSIONS setting using describe

Purpose: Retrieve previous CGPA values.

Output: Displays current and historical values.
6. Cluster Administration

a) Table Existence Check:

exists 'Student_Rollno' # true

b) Bulk Operations:

disable_all 'Student_.*' # Disable multiple student tables

drop_all 'Student_.*' # Drop multiple student tables

c) Exit Shell & Stop Cluster:

exit # Leave HBase shell

./bin/[Link] # Stop HBase (from HBase home directory)

Output Result

// Attach screenshots of all command executions showing:

1. Table creation and description

2. Data insertion and retrieval

3. Update and delete operations

4. Cluster status checks

5. Truncate and versioning operations

Learning Outcomes

a) Start/stop HBase services and navigate HBase shell

b) Design and manage tables with custom column families
c) Perform CRUD operations on student records
d) Monitor cluster health and table states
e) Implement data versioning and bulk operations
f) Administer table lifecycle (create/enable/truncate/drop)
Experiment 5: Hive Fundamentals - Databases,
Tables, and Data Loading
Aim: To perform core Hive operations including
creating/altering/dropping databases and tables, and loading data from
local files using both LOAD DATA and CTAS methods.

Objectives:

1. Manage Hive databases (Create, Use, Alter, Drop).

2. Manage Hive tables (Create, Describe, Alter, Drop).
3. Load data using LOAD DATA LOCAL INPATH.
4. Create tables and load data using CTAS (Create Table As Select).

CO Mapping:

CO3: Practice big data programming tools.

CO4: Analyze computing problems for a given application.

1. Database Operations

Syntax & Examples:

-- 1. List all databases

SHOW DATABASES;

-- 2. Create a new database

CREATE DATABASE sales_db;

-- 3. Use the database

USE sales_db;

-- 4. Verify current database

SELECT current_database();

-- 5. (Optional) Alter database to add a property

ALTER DATABASE sales_db SET DBPROPERTIES

('created_by'='student');
-- 6. (Optional) Drop database (use with caution!)

-- DROP DATABASE sales_db CASCADE;

2. Managed Table Operations

Syntax & Examples:

-- 1. Create a managed table 'customers'

CREATE TABLE customers (

cust_id INT,
first_name STRING,
last_name STRING,
city STRING,
signup_date DATE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;

-- 2. Show tables in the current database

SHOW TABLES;

-- 3. View detailed table structure

DESCRIBE FORMATTED customers;

-- 4. Alter table to add a new column

ALTER TABLE customers ADD COLUMNS (email STRING);

-- 5. Rename the table

ALTER TABLE customers RENAME TO customer_info;

-- 6. (Optional) Drop the table

-- DROP TABLE customer_info;

3. Loading Data with LOAD DATA LOCAL INPATH

Steps:

Create a local data file (/home/cloudera/customer_data.csv):

1001|Aman|Sharma|Delhi|2023-02-10

1002|Bobby|Singh|Agra|2022-11-15

1003|Ketan|Patel|Gandhinagar|2023-01-20

1004|Riya|Shah|Delhi|2022-12-05

Load data into Hive table:

LOAD DATA LOCAL INPATH '/home/cloudera/customer_data.csv'

INTO TABLE customer_info;

-- Verify data loading

SELECT * FROM customer_info LIMIT 5;

4. Creating Tables with CTAS (Create Table As Select)

Syntax:

CREATE TABLE <new_table_name>

[STORED AS <file_format>]
AS
SELECT <column_list>
FROM <source_table>
[WHERE <condition>];

Example:

-- Create a new table 'D=delhi_customers' containing only delhi customers

CREATE TABLE delhi_customers

STORED AS TEXTFILE
AS
SELECT cust_id, first_name, last_name, signup_date
FROM customer_info WHERE city = 'Delhi';
-- Verify the new table and its data

SELECT * FROM delhi_customers;

Tasks to be Performed by Students

[Note: Write the commands and attach screenshot of outputs]

1. Database Management:
o Create a database named inventory_db.
o Switch to this database.

2. Table Management:
o Create a managed table named products with this schema:
▪ product_id (INT)
▪ product_name (STRING)
▪ category (STRING)
▪ price (DECIMAL(10,2))
▪ stock (INT)
o Use comma , as the field terminator.
o Rename the table to product_details.

3. Data Loading with LOAD DATA:

o Create a local file /home/cloudera/product_data.csv:

2001,Laptop,Electronics,55000.00,25
2002,Smartphone,Electronics,25000.00,50
2003,Desk,Furniture,12000.00,15
2004,Chair,Furniture,5000.00,30

o Load this data into product_details.

o Show all records from the table.

4. Data Loading with CTAS:

o Create a new table electronics_stock using CTAS.
o It should contain only product_name and stock for items in
the 'Electronics' category.
o Query the new table to verify.
Learning Outcome

After completing this experiment, students will be able to:

a) Perform full lifecycle management of Hive databases and tables.

b) Load data from local files into Hive using LOAD DATA.
c) Use CTAS to create new tables from existing data.
d) Understand the physical storage location of managed tables.
Experiment 6: Data Analysis using HIVE
Aim

To perform comprehensive data analysis using HiveQL queries.

Objectives

1. To create and manage Hive tables for given data

2. To analyze data using different HiveQL features
3. To generate comprehensive reports using aggregation and sorting

Theory

Hive is a data warehouse infrastructure built on Hadoop that provides data

summarization, query, and analysis. It provides an SQL-like interface
called HiveQL to query data stored in HDFS. Hive is suitable for structured
data and supports external tables that can directly access data from HDFS
without loading it into Hive's managed storage.

Dataset: student_data.csv

Create the file without column names.

student_id,name,gender,department,semester,subject,marks,attendance,city

1,Rajesh Kumar,M,CS,3,Mathematics,85,92,Mumbai
2,Priya Sharma,F,Electronics,2,Physics,78,88,Delhi
3,Amit Patel,M,CS,3,Programming,92,95,Bangalore
4,Anjali Singh,F,ICT,4,Mathematics,88,90,Chennai
5,Sanjay Verma,M,Electronics,2,Chemistry,76,85,Hyderabad
6,Divya Mehta,F,CS,3,Physics,81,89,Pune
7,Neha Reddy,F,ICT,4,Programming,79,87,Mumbai
8,Mohan Das,M,Electronics,2,Mathematics,91,94,Delhi
9,Kavita Joshi,F,CS,3,Chemistry,84,91,Bangalore
10,Rahul Sharma,M,ICT,4,Physics,87,93,Chennai
11,Sneha Patel,F,CS,5,Mathematics,93,96,Hyderabad
12,Vikram Singh,M,Electronics,4,Programming,82,86,Pune
13,Ananya Gupta,F,ICT,5,Chemistry,89,92,Mumbai
14,Rohan Malhotra,M,CS,5,Physics,85,90,Delhi
15,Pooja Desai,F,Electronics,4,Mathematics,90,95,Bangalore
A. Start Hive Services
# Start Hive services
sudo service hive-server2 start
sudo service hive-metastore start

# Access Hive shell

beeline -u jdbc:hive2://localhost:10000
OR
hive

B. Prepare Data in HDFS

Note: Type these commands on another terminal as these are not hive
commands.
hadoop fs -mkdir -p /user/hive/input/student_data
hadoop fs -put student_data.csv /user/hive/input/student_data/

C. Create Database and Table

CREATE DATABASE IF NOT EXISTS student_db;
USE student_db;

CREATE EXTERNAL TABLE students (

student_id INT,
name STRING,
gender STRING,
department STRING,
semester INT,
subject STRING,
marks INT,
attendance INT,
city STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/hive/input/student_data/';

SHOW TABLES;
DESCRIBE students;
D. Verify Data Loading
SELECT COUNT(*) AS total_students FROM students;
SELECT * FROM students LIMIT 5;

E. Data Analysis Queries

1. Get all student records

SELECT * FROM students;

2. Select only name and department

SELECT name, department FROM students;

3. Students from CS department

SELECT name, marks FROM students WHERE department = 'CS';

4. Students sorted by marks (ascending)

SELECT name, marks FROM students ORDER BY marks;

5. Top students by marks

SELECT name, department, marks FROM students ORDER BY marks
DESC;

6. Top 3 students
SELECT name, marks FROM students ORDER BY marks DESC LIMIT 3;

7. Total number of students

SELECT COUNT(*) AS total_students FROM students;

8. Students count by gender

SELECT gender, COUNT(*) AS count FROM students GROUP BY gender;

9. Average marks and attendance

SELECT AVG(marks) AS avg_marks, AVG(attendance) AS avg_attendance
FROM students;

10. Department-wise Student Count

SELECT department, COUNT(*) AS student_count
FROM students
GROUP BY department
ORDER BY student_count DESC;
11. Average Marks by Department
SELECT department, AVG(marks) AS average_marks
FROM students
GROUP BY department
ORDER BY average_marks DESC;

12. Subject-wise Performance

SELECT subject, AVG(marks) AS average_marks
FROM students
GROUP BY subject
ORDER BY average_marks DESC;

13. Gender-wise Department Analysis

SELECT department, gender, COUNT(*) AS count, AVG(marks) AS
avg_marks
FROM students
GROUP BY department, gender
ORDER BY department, gender;

14. Semester and Department Analysis

SELECT semester, department, COUNT(*) AS student_count, AVG(marks)
AS avg_marks
FROM students
GROUP BY semester, department
ORDER BY semester, department;

15. Comprehensive Department Report

SELECT department,
COUNT(*) AS total_students,
AVG(marks) AS avg_marks,
MAX(marks) AS highest_marks,
MIN(marks) AS lowest_marks,
AVG(attendance) AS avg_attendance
FROM students
GROUP BY department
ORDER BY avg_marks DESC;
16. City-wise Analysis
SELECT city,
COUNT(*) AS student_count,
AVG(marks) AS avg_marks,
AVG(attendance) AS avg_attendance
FROM students
GROUP BY city
ORDER BY avg_marks DESC;

17. Departments with High Average Marks

SELECT department, AVG(marks) AS avg_marks
FROM students
GROUP BY department
HAVING AVG(marks) > 85
ORDER BY avg_marks DESC;

18. Cities with Good Attendance

SELECT city, AVG(attendance) AS avg_attendance
FROM students
GROUP BY city
HAVING AVG(attendance) > 90
ORDER BY avg_attendance DESC;

19. Performance by Gender and Department

SELECT department, gender,
COUNT(*) AS student_count,
AVG(marks) AS avg_marks,
AVG(attendance) AS avg_attendance,
CASE
WHEN AVG(marks) >= 85 THEN 'Excellent'
WHEN AVG(marks) >= 75 THEN 'Good'
ELSE 'Average'
END AS performance_rating
FROM students
GROUP BY department, gender
ORDER BY department, gender;

Tasks to be done:
1. Execute above queries and attach screenshots.
Learning Outcomes
After completing this experiment, students will be able to:
1. Create and manage Hive databases and tables
2. Load data into Hive from HDFS
3. Execute basic and advanced HiveQL queries
4. Perform multi-dimensional data analysis
5. Generate comprehensive analytical reports
Experiment 7: HIVE Operations - Partitioning,
Bucketing and Views
Aim

To understand and practice Hive partitioning (static and dynamic),

bucketing (with and without partitioning), and creating views for data
management.

Objectives

1. Implement static and dynamic partitioning in Hive

2. Create and use bucketed tables
3. Combine partitioning and bucketing
4. Create and manage views for data abstraction

Theory

Partitioning: Divides table data into partitions based on column values

for faster query performance
a) Static Partitioning: Partition values explicitly specified during data
insertion
b) Dynamic Partitioning: Partition values automatically determined
from select query
Bucketing: Divides data into fixed number of buckets based on hash
function for efficient sampling and joins
Views: Virtual tables that provide logical abstraction over actual data
without storing data physically

Dataset: student_data.csv (Same as Exp 6 )

student_id,name,gender,department,semester,subject,marks,attendance,city

Tasks to be Done

A. Start Hive Services

# Start Hive services

sudo service hive-server2 start
sudo service hive-metastore start

# Access Hive shell

beeline -u jdbc:hive2://localhost:10000

#In case of any error you may start with hive shell.

B. Prepare Data and Create Base Table

# Prepare HDFS directory

hadoop fs -mkdir -p /user/hive/input/student_data
hadoop fs -put student_data.csv /user/hive/input/student_data/

-- Create database
CREATE DATABASE IF NOT EXISTS student_analysis;
USE student_analysis;

-- Create base external table

CREATE EXTERNAL TABLE students_base (
student_id INT,
name STRING,
gender STRING,
department STRING,
semester INT,
subject STRING,
marks INT,
attendance INT,
city STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/hive/input/student_data/';

-- Verify data
SELECT COUNT(*) FROM students_base;
SELECT * FROM students_base LIMIT 5;

C. Static Partitioning

-- Create table with static partitioning by department

CREATE TABLE students_static_partition (
student_id INT,
name STRING,
gender STRING,
semester INT,
subject STRING,
marks INT,
attendance INT,
city STRING
)
PARTITIONED BY (department STRING)
STORED AS TEXTFILE;

-- Insert data with static partitioning (explicit partition values)

INSERT INTO TABLE students_static_partition PARTITION
(department='CS') SELECT student_id, name, gender, semester, subject,
marks, attendance, city FROM students_base WHERE department='CS';

INSERT INTO TABLE students_static_partition PARTITION

(department='Electronics') SELECT student_id, name, gender, semester,
subject, marks, attendance, city FROM students_base WHERE
department='Electronics';
INSERT INTO TABLE students_static_partition PARTITION
(department='ICT') SELECT student_id, name, gender, semester, subject,
marks, attendance, city FROM students_base WHERE department='ICT';

-- Show partitions
SHOW PARTITIONS students_static_partition;

-- Query specific partition

SELECT * FROM students_static_partition WHERE department='CS';

D. Dynamic Partitioning

-- Enable dynamic partitioning

SET [Link] = true;
SET [Link] = nonstrict;

-- Create table for dynamic partitioning by semester

CREATE TABLE students_dynamic_partition (
student_id INT,
name STRING,
gender STRING,
department STRING,
subject STRING,
marks INT,
attendance INT,
city STRING
)
PARTITIONED BY (semester INT)
STORED AS TEXTFILE;

-- Insert data with dynamic partitioning

INSERT INTO TABLE students_dynamic_partition PARTITION (semester)
SELECT student_id, name, gender, department, subject, marks,
attendance, city, semester FROM students_base;

-- Show all partitions created dynamically

SHOW PARTITIONS students_dynamic_partition;

-- Query data from specific semester partition

SELECT * FROM students_dynamic_partition WHERE semester=3;
E. Bucketing Without Partitioning

-- Enable bucketing
SET [Link] = true;

-- Create bucketed table by department

CREATE TABLE students_bucketed (
student_id INT,
name STRING,
gender STRING,
department STRING,
semester INT,
subject STRING,
marks INT,
attendance INT,
city STRING
)
CLUSTERED BY (department) INTO 4 BUCKETS
STORED AS TEXTFILE;

-- Insert data into bucketed table

INSERT INTO TABLE students_bucketed
SELECT * FROM students_base;

-- Verify bucketing worked

SELECT department, COUNT(*) FROM students_bucketed GROUP BY
department;

F. Bucketing with Partitioning

-- Create table with both partitioning and bucketing
CREATE TABLE students_partitioned_bucketed (
student_id INT,
name STRING,
gender STRING,
department STRING,
subject STRING,
marks INT,
attendance INT,
city STRING
)
PARTITIONED BY (semester INT)
CLUSTERED BY (department) INTO 3 BUCKETS
STORED AS TEXTFILE;

-- Insert data
INSERT INTO TABLE students_partitioned_bucketed PARTITION
(semester) SELECT student_id, name, gender, subject, marks,
attendance, city, department, semester FROM students_base;

-- Show partitions
SHOW PARTITIONS students_partitioned_bucketed;

-- Show information of partitions and buckets at HDFS location

-- [Do it yourself]
hdfs dfs -ls ……..[location of files]

Hint: Check table properties for identifying locations

G. Creating and Using Views

-- Create view for top performing students

CREATE VIEW top_students AS
SELECT name, department, marks, attendance
FROM students_base WHERE marks > 85
ORDER BY marks DESC;

-- Query the view

SELECT * FROM top_students;

-- Create view for department-wise summary

CREATE VIEW department_summary AS
SELECT department,
COUNT(*) as total_students,
AVG(marks) as avg_marks,
AVG(attendance) as avg_attendance
FROM students_base
GROUP BY department;
-- Query the summary view
SELECT * FROM department_summary ORDER BY avg_marks DESC;

-- Create view for student performance analysis

CREATE VIEW student_performance AS
SELECT name, department, semester, marks,
CASE
WHEN marks >= 90 THEN 'Excellent'
WHEN marks >= 80 THEN 'Good'
WHEN marks >= 70 THEN 'Average'
ELSE 'Needs Improvement'
END as performance_grade
FROM students_base;

-- Query performance view

SELECT * FROM student_performance ORDER BY marks DESC;

-- View definition
DESCRIBE EXTENDED top_students;

Tasks to be Done
1. Execute all partitioning operations (static and dynamic)
2. Create and verify bucketed tables
3. Implement combined partitioning and bucketing
4. Create and query different types of views
5. Document observations and take screenshots of each step.

Learning Outcomes
After completing this experiment, students will be able to:
1. Implement static and dynamic partitioning in Hive
2. Create and use bucketed tables for efficient data access
3. Combine partitioning and bucketing for optimal performance
4. Create and manage views for data abstraction
Experiment 8: To Develop Map Reduce Work
Application
Aim

To run a MapReduce job using Python on Hadoop to count words in a file.

Objectives

1. Understand the MapReduce programming model and its components

2. Implement mapper and reducer functions using Python
3. Execute and analyze a complete MapReduce job on Hadoop cluster

Theory

MapReduce is a programming model for processing large datasets in

parallel across a distributed cluster. It consists of two main functions:

Mapper: Processes input data and generates key-value pairs

• Input: (key, value) pairs
• Output: Intermediate (key, value) pairs

Reducer: Aggregates intermediate results by key

• Input: (key, list of values)
• Output: Final (key, result) pairs
In word count example:
• Mapper emits (word, 1) for each word
• Reducer sums counts for each word
• Hadoop handles sorting, shuffling, and distribution automatically

Steps

1. Create Test File ( use >gedit [Link])

apple banana apple banana cherry apple cherry cherry apple

2. Put File in HDFS

hdfs dfs -mkdir -p /user/student/input

hdfs dfs -put [Link] /user/student/input/

3. Create Python Mapper ([Link])

#!/usr/bin/env python

import sys

for line in [Link]:

words = [Link]().split()
for word in words:
print "%s\t1" % word

4. Create Python Reducer ([Link])

#!/usr/bin/env python

import sys

current_word = None
current_count = 0
for line in [Link]:
word, count = [Link]().split('\t')
count = int(count)

if current_word == word:
current_count += count
else:
if current_word:
print "%s\t%s" % (current_word, current_count)
current_word = word
current_count = count

if current_word:
print "%s\t%s" % (current_word, current_count)

5. Make Files Executable

chmod +x [Link]

chmod +x [Link]
6. Test Locally First

cat [Link] | python [Link] | sort | python [Link]

[Note: Check output of mapper separately in case of any error.]

7. Run on Hadoop

hadoop jar /usr/lib/hadoop-mapreduce/[Link] \

-files [Link],[Link] \
-mapper 'python [Link]' \
-reducer 'python [Link]' \
-input /user/student/input/[Link] \
-output /user/student/output

8. Check Results

# Check the output directory

hdfs dfs -ls /user/student/output

# View the results

hdfs dfs -cat /user/student/output/part-00000

# Copy results to local file system for verification

hdfs dfs -get /user/student/output/part-00000 local_output.txt

cat local_output.txt

Expected Output:

apple 4
banana 2
cherry 3

Tasks to be done:

Execute above steps by adding more fruits and take screenshots.

Learning Outcomes

1. Understand the fundamental concepts of MapReduce programming

model.
2. Develop mapper and reducer functions using Python for data
processing.
3. Execute MapReduce jobs on Hadoop cluster using Hadoop
Streaming.
4. Analyze and verify the output of distributed data processing tasks.

Exp 10 Big Data Analytics and Computing Lab Manual
No ratings yet
Exp 10 Big Data Analytics and Computing Lab Manual
60 pages
Exp 9 Big Data Analytics and Computing Lab Manual
No ratings yet
Exp 9 Big Data Analytics and Computing Lab Manual
51 pages
Exp 5 Big Data Analytics and Computing Lab Manual
No ratings yet
Exp 5 Big Data Analytics and Computing Lab Manual
28 pages
Big Data
No ratings yet
Big Data
19 pages
Bad601 Lab Maual
No ratings yet
Bad601 Lab Maual
34 pages
BDA LabManual
No ratings yet
BDA LabManual
32 pages
Bda Record 18071a0597-1
No ratings yet
Bda Record 18071a0597-1
28 pages
CSET 371 Course File
No ratings yet
CSET 371 Course File
81 pages
Big Data Analytics Course Syllabus
No ratings yet
Big Data Analytics Course Syllabus
4 pages
DAN Lab ManuaL
No ratings yet
DAN Lab ManuaL
53 pages
Big Data and Analytics Syllabus 2021
No ratings yet
Big Data and Analytics Syllabus 2021
3 pages
BIG DATA ANALYTIS LAB File Shivam
No ratings yet
BIG DATA ANALYTIS LAB File Shivam
42 pages
Big Data Lab Guide for CS Students
No ratings yet
Big Data Lab Guide for CS Students
53 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
94 pages
DSA Practical Index
No ratings yet
DSA Practical Index
3 pages
Big Data Course: Hadoop & Analytics
No ratings yet
Big Data Course: Hadoop & Analytics
2 pages
19CS4701D
No ratings yet
19CS4701D
2 pages
BigData and Hadoop - Syllabus
No ratings yet
BigData and Hadoop - Syllabus
2 pages
Big Data Analytics 0th Lecture
No ratings yet
Big Data Analytics 0th Lecture
19 pages
Bda Lab Record
No ratings yet
Bda Lab Record
32 pages
Big Data Curriculum for CS & CSE Students
No ratings yet
Big Data Curriculum for CS & CSE Students
2 pages
Ccs334 - Big Data Analytics
60% (5)
Ccs334 - Big Data Analytics
2 pages
LP BigData
No ratings yet
LP BigData
5 pages
Big Data Technologies PG-DBDA September 2023: ACTS, Pune
No ratings yet
Big Data Technologies PG-DBDA September 2023: ACTS, Pune
6 pages
Bda Da1
No ratings yet
Bda Da1
14 pages
Int 421
No ratings yet
Int 421
2 pages
BDA Practical File
No ratings yet
BDA Practical File
61 pages
CCS334 BDA Syllabus
No ratings yet
CCS334 BDA Syllabus
5 pages
2022-23-BDA-LAB Manual
No ratings yet
2022-23-BDA-LAB Manual
59 pages
Data Bots Training Courses
100% (1)
Data Bots Training Courses
36 pages
Big Data Course Overview and Tools
No ratings yet
Big Data Course Overview and Tools
4 pages
Microsoft Word - B.tech. 3rd Yr CSE (DS) 2022 23 Big Data
No ratings yet
Microsoft Word - B.tech. 3rd Yr CSE (DS) 2022 23 Big Data
2 pages
Manual 5
No ratings yet
Manual 5
51 pages
Bigdata
No ratings yet
Bigdata
2 pages
Big Data Analysis - Lab Manual - Bharathidasan University - B.SC Data Science, Second Year, 4th Semester
No ratings yet
Big Data Analysis - Lab Manual - Bharathidasan University - B.SC Data Science, Second Year, 4th Semester
41 pages
Learn Well Technocraft: Hadoop/Big Data Syllabus
100% (1)
Learn Well Technocraft: Hadoop/Big Data Syllabus
12 pages
17cs17 - Vcs314 - Big Data Systems
No ratings yet
17cs17 - Vcs314 - Big Data Systems
5 pages
CCS334 Bda
No ratings yet
CCS334 Bda
5 pages
Big Data Analytics for B.Tech Students
No ratings yet
Big Data Analytics for B.Tech Students
119 pages
Big Data With Hadoop and Spark - 2023-25
No ratings yet
Big Data With Hadoop and Spark - 2023-25
4 pages
6th Sem - Big Data - IsE
No ratings yet
6th Sem - Big Data - IsE
5 pages
Big Data Analytics Course
No ratings yet
Big Data Analytics Course
3 pages
Big Data Analytics Syllabus
No ratings yet
Big Data Analytics Syllabus
3 pages
20IT503 - Big Data Analytics - Unit4
No ratings yet
20IT503 - Big Data Analytics - Unit4
73 pages
Big Data Hadoop & Spark Course
No ratings yet
Big Data Hadoop & Spark Course
30 pages
Big Data Analytics
No ratings yet
Big Data Analytics
131 pages
Big Data & Hadoop Lab Course
No ratings yet
Big Data & Hadoop Lab Course
2 pages
Big Data Analytics
No ratings yet
Big Data Analytics
2 pages
Big Data Analytics for B.Tech Students
No ratings yet
Big Data Analytics for B.Tech Students
134 pages
IIT Kharagpur Data Science PDF
No ratings yet
IIT Kharagpur Data Science PDF
22 pages
Course Pack BDA
No ratings yet
Course Pack BDA
6 pages
Bda3 7
No ratings yet
Bda3 7
30 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Bigdata
No ratings yet
Bigdata
3 pages
Big Data - Road Map
No ratings yet
Big Data - Road Map
22 pages
Docker Commands and Cheat Sheet
No ratings yet
Docker Commands and Cheat Sheet
8 pages
IT Analyst with Big Data & Automation Skills
No ratings yet
IT Analyst with Big Data & Automation Skills
2 pages
Configuring CISCO ACS Server As Radius For Intel AMT 802.1x Clients With Windows DB Rev 3
No ratings yet
Configuring CISCO ACS Server As Radius For Intel AMT 802.1x Clients With Windows DB Rev 3
30 pages
Full Stack Developer: Spring Boot & SOAP
No ratings yet
Full Stack Developer: Spring Boot & SOAP
6 pages
Creating A VB Scheduler in
No ratings yet
Creating A VB Scheduler in
11 pages
Bad Performance When Using Transaction SE16 or SE16N T
No ratings yet
Bad Performance When Using Transaction SE16 or SE16N T
2 pages
Nagraj Subbiah Konar: Resume
No ratings yet
Nagraj Subbiah Konar: Resume
2 pages
Database Management System Syllabus
No ratings yet
Database Management System Syllabus
1 page
How to Delete Custom Fields in Salesforce
No ratings yet
How to Delete Custom Fields in Salesforce
3 pages
ZT Docker Cheat Sheet PDF
No ratings yet
ZT Docker Cheat Sheet PDF
1 page
2 Assignment
No ratings yet
2 Assignment
10 pages
Web-Based Statistical Analysis App
No ratings yet
Web-Based Statistical Analysis App
4 pages
AWS & DevOps Engineer Profile
No ratings yet
AWS & DevOps Engineer Profile
4 pages
VCF Docoooo
No ratings yet
VCF Docoooo
92 pages
Zia Zpa
No ratings yet
Zia Zpa
35 pages
Oracle SQL Developer Migration Overview
No ratings yet
Oracle SQL Developer Migration Overview
2 pages
AD Health Check for Admins
No ratings yet
AD Health Check for Admins
9 pages
Map Reduce and Format Features
No ratings yet
Map Reduce and Format Features
61 pages
Software Design Document Template
No ratings yet
Software Design Document Template
7 pages
Power BI Interview Questions and Answer For Data Analyst
No ratings yet
Power BI Interview Questions and Answer For Data Analyst
34 pages
Customer Relationship Management: Submitted To: Dr. Krishna Das Gupta Batch of 2017-19 CRM-5, Group 09
100% (1)
Customer Relationship Management: Submitted To: Dr. Krishna Das Gupta Batch of 2017-19 CRM-5, Group 09
13 pages
BDA Report 21 35 40 Merged
No ratings yet
BDA Report 21 35 40 Merged
10 pages
Learning Apache Cassandra - Sample Chapter
No ratings yet
Learning Apache Cassandra - Sample Chapter
20 pages
EKS, AKS, GKE: Key Comparisons
No ratings yet
EKS, AKS, GKE: Key Comparisons
5 pages
MT4 Server API
No ratings yet
MT4 Server API
9 pages
Penetration Testing with Cobalt Strike
No ratings yet
Penetration Testing with Cobalt Strike
14 pages
Business Process Procedure - Self Registration of Supplier
No ratings yet
Business Process Procedure - Self Registration of Supplier
54 pages
BMC Proactivenet Performance Management: Disaster Recovery Best Practices
No ratings yet
BMC Proactivenet Performance Management: Disaster Recovery Best Practices
17 pages
Ovation Software Training Overview
0% (1)
Ovation Software Training Overview
3 pages
Project Docs 222
No ratings yet
Project Docs 222
6 pages

Exp 8 Big Data Analytics and Computing Lab Manual

Uploaded by

Exp 8 Big Data Analytics and Computing Lab Manual

Uploaded by

Laboratory Manual

20IC404P: Big Data

COMPUTER ENGINEERING DEPARTMENT

Computer Engineering Department

Mr./Ms. _____________________________________ Roll no.____________Exam

No. ______________________ of 7th Semester Degree course in Information and

Communication Technology has satisfactorily completed his/her term work

in Big Data Analytics Lab (20IC404P) subject during the semester

from_____________ to _____________ at School of Technology, PDEU.

To Explore Big Data

Install, Deploy & configure

• Know the big data characteristics and challenges.

On completion of the course, students will be able to

CO1: List big data characteristics and challenges.

Continuous Assessment: 50 Marks

MAPPED COURSE OUTCOME (CO):

CO1: List big data characteristics and challenges. (Primary Mapping)

After completing this experiment, students will be able to:

Aim: To install Hadoop and configure it in standalone mode (single-node

1. Install Hadoop on a Linux system

Theory: HDFS (Hadoop Distributed File System) is Hadoop's scalable,

HDFS uses a write-once-read-many model, optimizing batch processing

• High throughput for large files

Basic operations include directory management (mkdir, ls), file transfers

1. Check HDFS Version

2. help - Command Assistance

4. mkdir - Create Directory

6. copyFromLocal (or put) - Upload Files

7. cat - View File Content

8. copyToLocal (or get) - Download Files

9. du - Check File Size

10. mv - Move/Rename Files

11. cp - Copy Files in HDFS

//Attach screenshots of the output after every command.

To execute essential HDFS configuration and administration commands

To perform critical HDFS configuration and file management operations

a) Verify and modify HDFS configurations

HDFS configuration commands allow administrators to:

• Check active settings (getconf, classpath)

HDFS maintains data reliability through replication, storing 3 copies of

Command: hdfs getconf -confKey [Link]

Purpose: Check active block size setting

Output: 134217728 (128MB in bytes)

Command: hdfs dfsadmin -report

Purpose: View cluster capacity and node status

Key Output: Live nodes, used/remaining space

Command: hdfs dfs -setrep -w 2 /data/important

Purpose: Reduce replication factor to 2

4. Check Disk Usage

Command: hdfs dfs -du -h /user

Purpose: Analyze storage consumption

Sample Output: 2.1G 6.3G /user/hadoop # (actual | replicated size)

5. Filesystem Health Check

Command: hdfs fsck / -blocks -locations

Purpose: Verify block health and placement

Critical Output: Under/Missing/Corrupt blocks

Command: hdfs dfs -chmod 750 /secure/data

Purpose: Restrict directory access

Validation: hdfs dfs -ls /secure

Command: hdfs dfs -cat /logs/[Link] | head -10

Purpose: Inspect file contents

Output: First 10 lines of log

b) Enter Safemode (Read-only Mode)

c) Exit Safemode (Normal Operations)

Command: hdfs dfs -find /data -name "*.csv"

Purpose: Locate files by pattern

Output: Full HDFS paths of matching files

10. Compressed File Transfer

Create and Upload:

hdfs dfs -put [Link] /compressed/

Download and Extract:

hdfs dfs -get /compressed/[Link] ~/

Command: hadoop classpath

Purpose: Verify library paths

Output: All JARs/config paths separated by :

Explore configuration files at location using >>cd path of config file

Mr./Ms. _________________________ Roll no.Exam

from_ to _ at School of Technology, PDEU.