0% found this document useful (0 votes)
9 views16 pages

Bigdata Question

The document contains a series of programming tasks and solutions related to data management using HBase, Hive, and Apache Pig. It includes creating tables, inserting records, managing datasets, and performing queries for various use cases such as hospital records, bank transactions, and student scores. Each section provides code snippets and commands necessary to accomplish the tasks outlined.

Uploaded by

vaishnavi kumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views16 pages

Bigdata Question

The document contains a series of programming tasks and solutions related to data management using HBase, Hive, and Apache Pig. It includes creating tables, inserting records, managing datasets, and performing queries for various use cases such as hospital records, bank transactions, and student scores. Each section provides code snippets and commands necessary to accomplish the tasks outlined.

Uploaded by

vaishnavi kumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Q1 Write a Java program to create an HBase table named

"hospital_records" with column families "patient_info" (name, age,


blood_group) and "treatment_info" (disease, medication,
doctor_assigned).
import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.client.Admin;

import org.apache.hadoop.hbase.client.HTableDescriptor;

import org.apache.hadoop.hbase.client.HColumnDescriptor;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hbase.util.Bytes;

public class CreateHBaseTable {

public static void main(String[] args) throws Exception {

// Set HBase Configuration

Configuration config = HBaseConfiguration.create();

// Instantiate HBase Admin

Admin admin = ConnectionFactory.createConnection(config).getAdmin();

// Define the table descriptor for "hospital_records" table

HTableDescriptor tableDescriptor = new HTableDescriptor(Bytes.toBytes("hospital_records"));

// Add column families to the table

tableDescriptor.addFamily(new HColumnDescriptor(Bytes.toBytes("patient_info")));

tableDescriptor.addFamily(new HColumnDescriptor(Bytes.toBytes("treatment_info")));

// Create the table if it does not exist

if (!admin.tableExists(tableDescriptor.getTableName())) {

admin.createTable(tableDescriptor);

System.out.println("Table 'hospital_records' created successfully with column families 'patient_info' and


'treatment_info'.");

} else {

System.out.println("Table 'hospital_records' already exists.");

// Close admin connection

admin.close();

}
Q2 A research organization manages multiple datasets in a distributed file
system and needs to perform security

a. Check the number of files stored in "/research".

b.Change dataset ownership to a new user.

c. Restrict write access for group members.

d.Find and list all files in "/research" that were modified in the last 7 days.

e. Display the current file permissions.

Solution:

a. Check the number of files stored in /research

hdfs dfs -count /research

This shows:

 DIR_COUNT (number of directories),

 FILE_COUNT (number of files),

 CONTENT_SIZE (total size in bytes).

b. Change dataset ownership to a new user

hdfs dfs -chown newuser /research

To change both user and group:

hdfs dfs -chown newuser:newgroup /research

To do it recursively (for all files/folders inside):

hdfs dfs -chown -R newuser:newgroup /research

c. Restrict write access for group members

You can modify permissions using:

hdfs dfs -chmod g-w /research

This removes write (w) permission for the group.

d. Find and list all files in /research that were modified in the last
7 days
You need to combine HDFS list command with timestamps. Since hdfs dfs -
ls doesn't show modification in a directly parsable format, we typically
transfer output to a script:

hdfs dfs -ls -R /research | awk '$6 != "" {print $6, $7, $8, $NF}' >
files_list.txt

Then process this with a script (outside HDFS), or use Linux find if the
folder is on a local FS:

find /research -type f -mtime -7

In HDFS, use fsck + date filtering via scripts — it's complex, so most
admins export the metadata or rely on external tools like Apache Ranger
or Cloudera Manager.

e. Display the current file permissions

hdfs dfs -ls /research

Q3. A bank has a dataset containing customer transaction details and


wants to process it using a scripting-based data flow language. Perform
the following tasks:

a. Load a dataset into Apache Pig containing the following columns:

transaction_id (INT) - Unique identifier for each transaction

⚫ account_number (STRING) - Customer's bank account number

transaction_type (STRING) - Type of transaction (e.g., Credit, Debit)

⚫amount (DOUBLE) - Transaction amount

branch_name (STRING) - Bank branch where the transaction occurred

b. Write a Pig script to load the data and display all records.

c. Write a Pig command to calculate the total amount of all transactions


for each account_number.

d. Write a Pig command to group transactions by transaction_type.

a. Load a dataset into Apache Pig

Assume the dataset is stored as a CSV file at:


/user/bank/transactions.csv
Each line in the file looks like:

1001,AC12345,Debit,2500.75,MainBranch

1002,AC54321,Credit,1200.00,WestBranch

...

Pig command to load the dataset:

transactions = LOAD '/user/bank/transactions.csv'

USING PigStorage(',')

AS (transaction_id:int,

account_number:chararray,

transaction_type:chararray,

amount:double,

branch_name:chararray);

b. Display all records

DUMP transactions;

c. Calculate the total amount of all transactions for each


account_number

grouped_by_account = GROUP transactions BY account_number;

total_amount_per_account = FOREACH grouped_by_account GENERATE

group AS account_number,

SUM(transactions.amount) AS total_amount;

DUMP total_amount_per_account;

d. Group transactions by transaction_type

grouped_by_type = GROUP transactions BY transaction_type;

DUMP grouped_by_type;
Optionally, to count how many transactions of each type:

type_count = FOREACH grouped_by_type GENERATE

group AS transaction_type,

COUNT(transactions) AS transaction_count;

DUMP type_count;

Q4 An e-commerce company wants to optimize customer order


management for efficient search and retrieval.

Perform the following tasks:

a. Create a Hive table named "ecommerce_orders" with the following


columns:

order_id (INT) - Unique identifier for each order

customer_name (STRING) - Name of the customer

category (STRING) - Category of the product (e.g., Electronics, Clothing,


Home Appliances)

product (STRING) - Name of the product

price (DOUBLE) - Price of the product

c. Implement partitioning on the "ecommerce_orders" table using the


"category" column to improve query performance.

b. Insert at least five records into the table.

d. Write a Hive query to retrieve all orders for a specific product category

a. Create the Hive table ecommerce_orders

We will create a partitioned table by category to support efficient query


filtering.

CREATE TABLE ecommerce_orders (

order_id INT,

customer_name STRING,

product STRING,

price DOUBLE
)

PARTITIONED BY (category STRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

STORED AS TEXTFILE;

Note: Since category is used as a partition column, it's not included in the
main table schema.

b. Insert at least five records into the table

Partitioned tables require partition information at the time of insertion.

-- Insert 2 Electronics orders

INSERT INTO TABLE ecommerce_orders PARTITION (category='Electronics')


VALUES

(1001, 'Alice', 'Smartphone', 699.99),

(1002, 'Bob', 'Laptop', 999.50);

-- Insert 2 Clothing orders

INSERT INTO TABLE ecommerce_orders PARTITION (category='Clothing')


VALUES

(1003, 'Carol', 'T-Shirt', 25.00),

(1004, 'Dave', 'Jeans', 49.99);

-- Insert 1 Home Appliance order

INSERT INTO TABLE ecommerce_orders PARTITION (category='Home


Appliances') VALUES

(1005, 'Eve', 'Microwave Oven', 150.00);

c. Partitioning by the category column

Already handled while creating the table (step a). Partitioning optimizes
queries by scanning only relevant folders under
/user/hive/warehouse/ecommerce_orders/.

d. Hive query to retrieve all orders for a specific category (e.g.,


Electronics)

SELECT * FROM ecommerce_orders

WHERE category = 'Electronics';


Q5 A bank wants to efficiently analyze customer transactions by
categorizing them into different groups. Perform the following tasks:

a. Create a Hive table named "bank_transactions" with the following


columns: transaction_id (INT) - Unique identifier for each transaction
account_number (STRING) - Customer's bank account number
transaction_type (STRING) - Type of transaction (e.g., Credit, Debit)
amount (DOUBLE) - Transaction amount

b. Insert at least six records into the table.

c. Implement bucketing on the "bank_transactions" table using the


"account_number" column, creating 4 buckets.

d. Write a Hive query to calculate the total amount spent on debit


transactions.

a. Create the Hive table bank_transactions with bucketing

First, enable bucketing and create the table:

SET hive.enforce.bucketing = true;

CREATE TABLE bank_transactions (

transaction_id INT,

account_number STRING,

transaction_type STRING,

amount DOUBLE

CLUSTERED BY (account_number) INTO 4 BUCKETS

STORED AS TEXTFILE;

CLUSTERED BY with INTO 4 BUCKETS enables bucketing by


account_number.

b. Insert at least six records into the table

Important: Hive bucketing only works with INSERT ... SELECT from
another table, not direct VALUES.
So, first create a staging table without bucketing:

CREATE TABLE bank_transactions_stage (

transaction_id INT,

account_number STRING,

transaction_type STRING,

amount DOUBLE

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

STORED AS TEXTFILE;

-- Insert sample data into the staging table

INSERT INTO bank_transactions_stage VALUES

(101, 'ACC1001', 'Credit', 1500.00),

(102, 'ACC1002', 'Debit', 500.00),

(103, 'ACC1003', 'Debit', 1200.50),

(104, 'ACC1001', 'Debit', 750.00),

(105, 'ACC1004', 'Credit', 2000.00),

(106, 'ACC1002', 'Debit', 300.00);

INSERT INTO TABLE bank_transactions

SELECT * FROM bank_transactions_stage;

c. Bucketing by account_number into 4 buckets

Already done in part (a). To see bucket files, you can look into the Hive
warehouse directory after loading (requires CLI or HDFS access).

d. Query: Total amount spent on debit transactions

SELECT SUM(amount) AS total_debit_amount


FROM bank_transactions

WHERE transaction_type = 'Debit';

Q6 A university administration requires a system to store and manage


student information along with their course

details. Perform the following tasks using hbase: a. Retrieve a list of all
tables in the database.

b. Display the structure of the "university_records" table, which consists of


student and course details.

c.Insert two student records into the table.

a. Retrieve a list of all tables in HBase

Use the HBase shell and run:

List

b. Display the structure of the "university_records" table

Use the describe command:

describe 'university_records'

If the table doesn’t exist yet, you can create it like this:

create 'university_records', 'student', 'course'

 'student' column family can store fields like name, age, etc.

 'course' column family can store fields like course_name, duration,


etc.

c. Insert two student records into the table

Each row is identified by a row key (e.g., student ID). Use put command:

# Insert Student 1

put 'university_records', '1001', 'student:name', 'Alice'

put 'university_records', '1001', 'student:age', '21'

put 'university_records', '1001', 'course:course_name', 'Computer Science'

put 'university_records', '1001', 'course:duration', '4 Years'

# Insert Student 2
put 'university_records', '1002', 'student:name', 'Bob'

put 'university_records', '1002', 'student:age', '22'

put 'university_records', '1002', 'course:course_name', 'Electronics'

put 'university_records', '1002', 'course:duration', '4 Years'

To check inserted data:

scan 'university_records'

Q7 A large-scale data processing system needs optimization. Perform the


following system maintenance tasks:

a. Check the memory usage of storage services.

b. Display the number of file in directory "/Research".

c. Reduce the size of the file abc.csv.

d. Set replication factor of abc.csv to 4.

a. Check the memory usage of storage services

Use the following command to check memory usage:

jps

Find the NameNode and DataNode process IDs, then use:

top -p <PID>

Or, check overall memory status:

free -m

For HDFS-specific disk usage:

hdfs dfsadmin -report

b. Display the number of files in directory "/Research"

hdfs dfs -count /Research

Output will give you:


DIR_COUNT FILE_COUNT CONTENT_SIZE /Research

c. Reduce the size of the file abc.csv

You can compress the file using gzip:

hdfs dfs -get /path/to/abc.csv

gzip abc.csv
hdfs dfs -put abc.csv.gz /path/to/

Or, filter unwanted data before saving again in HDFS:

hdfs dfs -cat /path/to/abc.csv | grep "important_data" | hdfs dfs -put -


/path/to/abc_filtered.csv

d. Set replication factor of abc.csv to 4

hdfs dfs -setrep -w 4 /path/to/abc.csv

-w waits until replication is complete.

Q8 An educational institute needs a system to store student exam scores


and find the highest score in each subject. Perform the following tasks:

a. Create a Hive table named "student_scores" with the following columns:

roll_no (INT) - Unique identifier for each student

student_name (STRING) - Name of the student

subject (STRING) - Name of the subject

marks (INT) - Marks obtained by the student

b. Insert at least six records into the table.

c. Write a Hive query to fetch the highest marks in each subject.

a. Create the Hive table "student_scores"

To create the table with the required columns (roll_no, student_name,


subject, and marks), use the following query:

CREATE TABLE student_scores (

roll_no INT,

student_name STRING,

subject STRING,

marks INT

);

b. Insert at least six records into the table

You can insert records into the table like this:

INSERT INTO student_scores VALUES

(1, 'John Doe', 'Mathematics', 85),

(2, 'Jane Smith', 'Mathematics', 90),


(3, 'Sam Brown', 'Physics', 78),

(4, 'Lisa White', 'Physics', 82),

(5, 'Tom Green', 'Chemistry', 88),

(6, 'Emma Black', 'Chemistry', 92);

c. Write a Hive query to fetch the highest marks in each subject

To get the highest marks in each subject, you can use the GROUP BY and
MAX() function like this:

SELECT subject, MAX(marks) AS highest_marks

FROM student_scores

GROUP BY subject;

Q9 Write a Java program to insert at least four records into the


"hospital_records" table with column families "patient_info" (name, age,
blood_group) and "treatment_info" (disease, medication,
doctor_assigned). (table should be created using hbase commands and
data should be entered using java program)

To accomplish this task, you need to perform two main steps:

1. Create the HBase table using HBase shell commands.

2. Write a Java program to insert records into the hospital_records


table.

Step 1: Create the HBase Table Using HBase Shell Commands

In HBase shell, you can create the table with two column families:
patient_info and treatment_info.

create 'hospital_records', 'patient_info', 'treatment_info'

This command will create a table named hospital_records with the two
column families patient_info (for patient details) and treatment_info (for
treatment details).

Step 2: Java Program to Insert Records into the Table

The Java program to insert records involves using the HBase API for
inserting data into the table. Below is the Java code that inserts records
into the hospital_records table.
import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.client.*;

import org.apache.hadoop.hbase.util.Bytes;

public class HospitalRecordInsertion {

public static void main(String[] args) throws Exception {

// Set up the configuration for HBase connection

Configuration config = HBaseConfiguration.create();

// Create the HBase connection and table

Connection connection = ConnectionFactory.createConnection(config);

Table table = connection.getTable(TableName.valueOf("hospital_records"));

// Create the Put object for the row to insert data

// Insert record 1

Put put1 = new Put(Bytes.toBytes("row1"));

put1.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("name"), Bytes.toBytes("John Doe"));

put1.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("age"), Bytes.toBytes("45"));

put1.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("blood_group"), Bytes.toBytes("O+"));

put1.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("disease"), Bytes.toBytes("Flu"));

put1.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("medication"),
Bytes.toBytes("Tamiflu"));

put1.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("doctor_assigned"), Bytes.toBytes("Dr.


Smith"));

// Insert record 2

Put put2 = new Put(Bytes.toBytes("row2"));

put2.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("name"), Bytes.toBytes("Jane Doe"));

put2.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("age"), Bytes.toBytes("34"));

put2.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("blood_group"), Bytes.toBytes("A+"));

put2.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("disease"), Bytes.toBytes("Pneumonia"));

put2.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("medication"),
Bytes.toBytes("Antibiotics"));

put2.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("doctor_assigned"), Bytes.toBytes("Dr.


Adams"));
// Insert record 3

Put put3 = new Put(Bytes.toBytes("row3"));

put3.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("name"), Bytes.toBytes("Sam Brown"));

put3.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("age"), Bytes.toBytes("29"));

put3.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("blood_group"), Bytes.toBytes("B-"));

put3.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("disease"), Bytes.toBytes("COVID-19"));

put3.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("medication"),
Bytes.toBytes("Remdesivir"));

put3.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("doctor_assigned"), Bytes.toBytes("Dr.


Miller"));

// Insert record 4

Put put4 = new Put(Bytes.toBytes("row4"));

put4.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("name"), Bytes.toBytes("Emily White"));

put4.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("age"), Bytes.toBytes("60"));

put4.addColumn(Bytes.toBytes("patient_info"), Bytes.toBytes("blood_group"), Bytes.toBytes("AB+"));

put4.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("disease"), Bytes.toBytes("Asthma"));

put4.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("medication"),
Bytes.toBytes("Ventolin"));

put4.addColumn(Bytes.toBytes("treatment_info"), Bytes.toBytes("doctor_assigned"), Bytes.toBytes("Dr.


Taylor"));

// Adding the records to the table

table.put(put1);

table.put(put2);

table.put(put3);

table.put(put4);

// Closing resources

table.close();

connection.close();

System.out.println("Records inserted successfully!");

}
Q10 A company collects loT sensor data and needs to store it in a structured distributed
environment. Perform the following tasks:
a. Create a directory "/sensor_data" in the distributed storage system.
b. Upload a file named "readings.csv" from the local system.
c. Check the integrity of 'reading.csv".
d. Append new sensor readings from a local file named "additional_readings.csv" to
"readings.csv" in HDFS.
e.Delete "readings.csv" permanently from storage.
a. Create a directory "/sensor_data" in the distributed storage system
In HDFS, you can use the following command to create a directory:
hdfs dfs -mkdir /sensor_data
This will create a directory named /sensor_data in the HDFS filesystem.
b. Upload a file named "readings.csv" from the local system to HDFS
To upload the readings.csv file from your local system to the HDFS directory /sensor_data,
use the following command:
hdfs dfs -put /local/path/to/readings.csv /sensor_data/
Replace /local/path/to/readings.csv with the actual path where readings.csv is located on your
local system.
c. Check the integrity of "readings.csv"
To check the integrity of readings.csv, you can use the -checksum option in HDFS. This will
return the checksum of the file to verify its integrity:
hdfs dfs -checksum /sensor_data/readings.csv
This command will display the checksum for the file, which you can compare with the
expected checksum to ensure data integrity.
d. Append new sensor readings from a local file named "additional_readings.csv" to
"readings.csv" in HDFS
To append data from additional_readings.csv to readings.csv in HDFS, you can use the -
appendToFile command:
hdfs dfs -appendToFile /local/path/to/additional_readings.csv /sensor_data/readings.csv
This command will append the contents of additional_readings.csv to the existing
readings.csv file in HDFS.
e. Delete "readings.csv" permanently from storage
To delete readings.csv from HDFS permanently, you can use the -rm command:
hdfs dfs -rm /sensor_data/readings.csv
This will delete the file readings.csv from HDFS. If you need to remove it recursively (in
case it's a directory), you can use the -rm -r command.

You might also like