Q1 Write a Java program to create an HBase table named
"hospital_records" with column families "patient_info" (name, age,
blood_group) and "treatment_info" (disease, medication,
doctor_assigned).
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.HTableDescriptor;
import org.apache.hadoop.hbase.client.HColumnDescriptor;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.util.Bytes;
public class CreateHBaseTable {
public static void main(String[] args) throws Exception {
// Set HBase Configuration
Configuration config = HBaseConfiguration.create();
// Instantiate HBase Admin
Admin admin = ConnectionFactory.createConnection(config).getAdmin();
// Define the table descriptor for "hospital_records" table
HTableDescriptor tableDescriptor = new HTableDescriptor(Bytes.toBytes("hospital_records"));
// Add column families to the table
tableDescriptor.addFamily(new HColumnDescriptor(Bytes.toBytes("patient_info")));
tableDescriptor.addFamily(new HColumnDescriptor(Bytes.toBytes("treatment_info")));
// Create the table if it does not exist
if (!admin.tableExists(tableDescriptor.getTableName())) {
admin.createTable(tableDescriptor);
System.out.println("Table 'hospital_records' created successfully with column families 'patient_info' and
'treatment_info'.");
} else {
System.out.println("Table 'hospital_records' already exists.");
// Close admin connection
admin.close();
}
Q2 A research organization manages multiple datasets in a distributed file
system and needs to perform security
a. Check the number of files stored in "/research".
b.Change dataset ownership to a new user.
c. Restrict write access for group members.
d.Find and list all files in "/research" that were modified in the last 7 days.
e. Display the current file permissions.
Solution:
a. Check the number of files stored in /research
hdfs dfs -count /research
This shows:
DIR_COUNT (number of directories),
FILE_COUNT (number of files),
CONTENT_SIZE (total size in bytes).
b. Change dataset ownership to a new user
hdfs dfs -chown newuser /research
To change both user and group:
hdfs dfs -chown newuser:newgroup /research
To do it recursively (for all files/folders inside):
hdfs dfs -chown -R newuser:newgroup /research
c. Restrict write access for group members
You can modify permissions using:
hdfs dfs -chmod g-w /research
This removes write (w) permission for the group.
d. Find and list all files in /research that were modified in the last
7 days
You need to combine HDFS list command with timestamps. Since hdfs dfs -
ls doesn't show modification in a directly parsable format, we typically
transfer output to a script:
hdfs dfs -ls -R /research | awk '$6 != "" {print $6, $7, $8, $NF}' >
files_list.txt
Then process this with a script (outside HDFS), or use Linux find if the
folder is on a local FS:
find /research -type f -mtime -7
In HDFS, use fsck + date filtering via scripts — it's complex, so most
admins export the metadata or rely on external tools like Apache Ranger
or Cloudera Manager.
e. Display the current file permissions
hdfs dfs -ls /research
Q3. A bank has a dataset containing customer transaction details and
wants to process it using a scripting-based data flow language. Perform
the following tasks:
a. Load a dataset into Apache Pig containing the following columns:
transaction_id (INT) - Unique identifier for each transaction
⚫ account_number (STRING) - Customer's bank account number
transaction_type (STRING) - Type of transaction (e.g., Credit, Debit)
⚫amount (DOUBLE) - Transaction amount
branch_name (STRING) - Bank branch where the transaction occurred
b. Write a Pig script to load the data and display all records.
c. Write a Pig command to calculate the total amount of all transactions
for each account_number.
d. Write a Pig command to group transactions by transaction_type.
a. Load a dataset into Apache Pig
Assume the dataset is stored as a CSV file at:
/user/bank/transactions.csv
Each line in the file looks like:
1001,AC12345,Debit,2500.75,MainBranch
1002,AC54321,Credit,1200.00,WestBranch
...
Pig command to load the dataset:
transactions = LOAD '/user/bank/transactions.csv'
USING PigStorage(',')
AS (transaction_id:int,
account_number:chararray,
transaction_type:chararray,
amount:double,
branch_name:chararray);
b. Display all records
DUMP transactions;
c. Calculate the total amount of all transactions for each
account_number
grouped_by_account = GROUP transactions BY account_number;
total_amount_per_account = FOREACH grouped_by_account GENERATE
group AS account_number,
SUM(transactions.amount) AS total_amount;
DUMP total_amount_per_account;
d. Group transactions by transaction_type
grouped_by_type = GROUP transactions BY transaction_type;
DUMP grouped_by_type;
Optionally, to count how many transactions of each type:
type_count = FOREACH grouped_by_type GENERATE
group AS transaction_type,
COUNT(transactions) AS transaction_count;
DUMP type_count;
Q4 An e-commerce company wants to optimize customer order
management for efficient search and retrieval.
Perform the following tasks:
a. Create a Hive table named "ecommerce_orders" with the following
columns:
order_id (INT) - Unique identifier for each order
customer_name (STRING) - Name of the customer
category (STRING) - Category of the product (e.g., Electronics, Clothing,
Home Appliances)
product (STRING) - Name of the product
price (DOUBLE) - Price of the product
c. Implement partitioning on the "ecommerce_orders" table using the
"category" column to improve query performance.
b. Insert at least five records into the table.
d. Write a Hive query to retrieve all orders for a specific product category
a. Create the Hive table ecommerce_orders
We will create a partitioned table by category to support efficient query
filtering.
CREATE TABLE ecommerce_orders (
order_id INT,
customer_name STRING,
product STRING,
price DOUBLE
)
PARTITIONED BY (category STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Note: Since category is used as a partition column, it's not included in the
main table schema.
b. Insert at least five records into the table
Partitioned tables require partition information at the time of insertion.
-- Insert 2 Electronics orders
INSERT INTO TABLE ecommerce_orders PARTITION (category='Electronics')
VALUES
(1001, 'Alice', 'Smartphone', 699.99),
(1002, 'Bob', 'Laptop', 999.50);
-- Insert 2 Clothing orders
INSERT INTO TABLE ecommerce_orders PARTITION (category='Clothing')
VALUES
(1003, 'Carol', 'T-Shirt', 25.00),
(1004, 'Dave', 'Jeans', 49.99);
-- Insert 1 Home Appliance order
INSERT INTO TABLE ecommerce_orders PARTITION (category='Home
Appliances') VALUES
(1005, 'Eve', 'Microwave Oven', 150.00);
c. Partitioning by the category column
Already handled while creating the table (step a). Partitioning optimizes
queries by scanning only relevant folders under
/user/hive/warehouse/ecommerce_orders/.
d. Hive query to retrieve all orders for a specific category (e.g.,
Electronics)
SELECT * FROM ecommerce_orders
WHERE category = 'Electronics';
Q5 A bank wants to efficiently analyze customer transactions by
categorizing them into different groups. Perform the following tasks:
a. Create a Hive table named "bank_transactions" with the following
columns: transaction_id (INT) - Unique identifier for each transaction
account_number (STRING) - Customer's bank account number
transaction_type (STRING) - Type of transaction (e.g., Credit, Debit)
amount (DOUBLE) - Transaction amount
b. Insert at least six records into the table.
c. Implement bucketing on the "bank_transactions" table using the
"account_number" column, creating 4 buckets.
d. Write a Hive query to calculate the total amount spent on debit
transactions.
a. Create the Hive table bank_transactions with bucketing
First, enable bucketing and create the table:
SET hive.enforce.bucketing = true;
CREATE TABLE bank_transactions (
transaction_id INT,
account_number STRING,
transaction_type STRING,
amount DOUBLE
CLUSTERED BY (account_number) INTO 4 BUCKETS
STORED AS TEXTFILE;
CLUSTERED BY with INTO 4 BUCKETS enables bucketing by
account_number.
b. Insert at least six records into the table
Important: Hive bucketing only works with INSERT ... SELECT from
another table, not direct VALUES.
So, first create a staging table without bucketing:
CREATE TABLE bank_transactions_stage (
transaction_id INT,
account_number STRING,
transaction_type STRING,
amount DOUBLE
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
-- Insert sample data into the staging table
INSERT INTO bank_transactions_stage VALUES
(101, 'ACC1001', 'Credit', 1500.00),
(102, 'ACC1002', 'Debit', 500.00),
(103, 'ACC1003', 'Debit', 1200.50),
(104, 'ACC1001', 'Debit', 750.00),
(105, 'ACC1004', 'Credit', 2000.00),
(106, 'ACC1002', 'Debit', 300.00);
INSERT INTO TABLE bank_transactions
SELECT * FROM bank_transactions_stage;
c. Bucketing by account_number into 4 buckets
Already done in part (a). To see bucket files, you can look into the Hive
warehouse directory after loading (requires CLI or HDFS access).
d. Query: Total amount spent on debit transactions
SELECT SUM(amount) AS total_debit_amount
FROM bank_transactions
WHERE transaction_type = 'Debit';
Q6 A university administration requires a system to store and manage
student information along with their course
details. Perform the following tasks using hbase: a. Retrieve a list of all
tables in the database.
b. Display the structure of the "university_records" table, which consists of
student and course details.
c.Insert two student records into the table.
a. Retrieve a list of all tables in HBase
Use the HBase shell and run:
List
b. Display the structure of the "university_records" table
Use the describe command:
describe 'university_records'
If the table doesn’t exist yet, you can create it like this:
create 'university_records', 'student', 'course'
'student' column family can store fields like name, age, etc.
'course' column family can store fields like course_name, duration,
etc.
c. Insert two student records into the table
Each row is identified by a row key (e.g., student ID). Use put command:
# Insert Student 1
put 'university_records', '1001', 'student:name', 'Alice'
put 'university_records', '1001', 'student:age', '21'
put 'university_records', '1001', 'course:course_name', 'Computer Science'
put 'university_records', '1001', 'course:duration', '4 Years'
# Insert Student 2
put 'university_records', '1002', 'student:name', 'Bob'
put 'university_records', '1002', 'student:age', '22'
put 'university_records', '1002', 'course:course_name', 'Electronics'
put 'university_records', '1002', 'course:duration', '4 Years'
To check inserted data:
scan 'university_records'
Q7 A large-scale data processing system needs optimization. Perform the
following system maintenance tasks:
a. Check the memory usage of storage services.
b. Display the number of file in directory "/Research".
c. Reduce the size of the file abc.csv.
d. Set replication factor of abc.csv to 4.
a. Check the memory usage of storage services
Use the following command to check memory usage:
jps
Find the NameNode and DataNode process IDs, then use:
top -p <PID>
Or, check overall memory status:
free -m
For HDFS-specific disk usage:
hdfs dfsadmin -report
b. Display the number of files in directory "/Research"
hdfs dfs -count /Research
Output will give you:
DIR_COUNT FILE_COUNT CONTENT_SIZE /Research
c. Reduce the size of the file abc.csv
You can compress the file using gzip:
hdfs dfs -get /path/to/abc.csv
gzip abc.csv
hdfs dfs -put abc.csv.gz /path/to/
Or, filter unwanted data before saving again in HDFS:
hdfs dfs -cat /path/to/abc.csv | grep "important_data" | hdfs dfs -put -
/path/to/abc_filtered.csv
d. Set replication factor of abc.csv to 4
hdfs dfs -setrep -w 4 /path/to/abc.csv
-w waits until replication is complete.
Q8 An educational institute needs a system to store student exam scores
and find the highest score in each subject. Perform the following tasks:
a. Create a Hive table named "student_scores" with the following columns:
roll_no (INT) - Unique identifier for each student
student_name (STRING) - Name of the student
subject (STRING) - Name of the subject
marks (INT) - Marks obtained by the student
b. Insert at least six records into the table.
c. Write a Hive query to fetch the highest marks in each subject.
a. Create the Hive table "student_scores"
To create the table with the required columns (roll_no, student_name,
subject, and marks), use the following query:
CREATE TABLE student_scores (
roll_no INT,
student_name STRING,
subject STRING,
marks INT
);
b. Insert at least six records into the table
You can insert records into the table like this:
INSERT INTO student_scores VALUES
(1, 'John Doe', 'Mathematics', 85),
(2, 'Jane Smith', 'Mathematics', 90),
(3, 'Sam Brown', 'Physics', 78),
(4, 'Lisa White', 'Physics', 82),
(5, 'Tom Green', 'Chemistry', 88),
(6, 'Emma Black', 'Chemistry', 92);
c. Write a Hive query to fetch the highest marks in each subject
To get the highest marks in each subject, you can use the GROUP BY and
MAX() function like this:
SELECT subject, MAX(marks) AS highest_marks
FROM student_scores
GROUP BY subject;
Q9 Write a Java program to insert at least four records into the
"hospital_records" table with column families "patient_info" (name, age,
blood_group) and "treatment_info" (disease, medication,
doctor_assigned). (table should be created using hbase commands and
data should be entered using java program)
To accomplish this task, you need to perform two main steps:
1. Create the HBase table using HBase shell commands.
2. Write a Java program to insert records into the hospital_records
table.
Step 1: Create the HBase Table Using HBase Shell Commands
In HBase shell, you can create the table with two column families:
patient_info and treatment_info.
create 'hospital_records', 'patient_info', 'treatment_info'
This command will create a table named hospital_records with the two
column families patient_info (for patient details) and treatment_info (for
treatment details).
Step 2: Java Program to Insert Records into the Table
The Java program to insert records involves using the HBase API for
inserting data into the table. Below is the Java code that inserts records
into the hospital_records table.
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;
public class HospitalRecordInsertion {
public static void main(String[] args) throws Exception {
// Set up the configuration for HBase connection
Configuration config = HBaseConfiguration.create();
// Create the HBase connection and table
Connection connection = ConnectionFactory.createConnection(config);
Table table = connection.getTable(TableName.valueOf("hospital_records"));
// Create the Put object for the row to insert data
// Insert record 1
Put put1 = new Put(Bytes.toBytes("row1"));
put1.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("name"), Bytes.toBytes("John Doe"));
put1.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("age"), Bytes.toBytes("45"));
put1.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("blood_group"), Bytes.toBytes("O+"));
put1.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("disease"), Bytes.toBytes("Flu"));
put1.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("medication"),
Bytes.toBytes("Tamiflu"));
put1.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("doctor_assigned"), Bytes.toBytes("Dr.
Smith"));
// Insert record 2
Put put2 = new Put(Bytes.toBytes("row2"));
put2.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("name"), Bytes.toBytes("Jane Doe"));
put2.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("age"), Bytes.toBytes("34"));
put2.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("blood_group"), Bytes.toBytes("A+"));
put2.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("disease"), Bytes.toBytes("Pneumonia"));
put2.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("medication"),
Bytes.toBytes("Antibiotics"));
put2.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("doctor_assigned"), Bytes.toBytes("Dr.
Adams"));
// Insert record 3
Put put3 = new Put(Bytes.toBytes("row3"));
put3.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("name"), Bytes.toBytes("Sam Brown"));
put3.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("age"), Bytes.toBytes("29"));
put3.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("blood_group"), Bytes.toBytes("B-"));
put3.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("disease"), Bytes.toBytes("COVID-19"));
put3.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("medication"),
Bytes.toBytes("Remdesivir"));
put3.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("doctor_assigned"), Bytes.toBytes("Dr.
Miller"));
// Insert record 4
Put put4 = new Put(Bytes.toBytes("row4"));
put4.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("name"), Bytes.toBytes("Emily White"));
put4.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("age"), Bytes.toBytes("60"));
put4.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("blood_group"), Bytes.toBytes("AB+"));
put4.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("disease"), Bytes.toBytes("Asthma"));
put4.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("medication"),
Bytes.toBytes("Ventolin"));
put4.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("doctor_assigned"), Bytes.toBytes("Dr.
Taylor"));
// Adding the records to the table
table.put(put1);
table.put(put2);
table.put(put3);
table.put(put4);
// Closing resources
table.close();
connection.close();
System.out.println("Records inserted successfully!");
}
Q10 A company collects loT sensor data and needs to store it in a structured distributed
environment. Perform the following tasks:
a. Create a directory "/sensor_data" in the distributed storage system.
b. Upload a file named "readings.csv" from the local system.
c. Check the integrity of 'reading.csv".
d. Append new sensor readings from a local file named "additional_readings.csv" to
"readings.csv" in HDFS.
e.Delete "readings.csv" permanently from storage.
a. Create a directory "/sensor_data" in the distributed storage system
In HDFS, you can use the following command to create a directory:
hdfs dfs -mkdir /sensor_data
This will create a directory named /sensor_data in the HDFS filesystem.
b. Upload a file named "readings.csv" from the local system to HDFS
To upload the readings.csv file from your local system to the HDFS directory /sensor_data,
use the following command:
hdfs dfs -put /local/path/to/readings.csv /sensor_data/
Replace /local/path/to/readings.csv with the actual path where readings.csv is located on your
local system.
c. Check the integrity of "readings.csv"
To check the integrity of readings.csv, you can use the -checksum option in HDFS. This will
return the checksum of the file to verify its integrity:
hdfs dfs -checksum /sensor_data/readings.csv
This command will display the checksum for the file, which you can compare with the
expected checksum to ensure data integrity.
d. Append new sensor readings from a local file named "additional_readings.csv" to
"readings.csv" in HDFS
To append data from additional_readings.csv to readings.csv in HDFS, you can use the -
appendToFile command:
hdfs dfs -appendToFile /local/path/to/additional_readings.csv /sensor_data/readings.csv
This command will append the contents of additional_readings.csv to the existing
readings.csv file in HDFS.
e. Delete "readings.csv" permanently from storage
To delete readings.csv from HDFS permanently, you can use the -rm command:
hdfs dfs -rm /sensor_data/readings.csv
This will delete the file readings.csv from HDFS. If you need to remove it recursively (in
case it's a directory), you can use the -rm -r command.