DEPARTMENT OF DATA SCIENCE AND COMPUTER
APPLICATIONS
VI SEMESTER B. TECH
MID-TERM SCHEME OF EVALUATION
SUBJECT: BIG DATA ANALYTICS AND TOOLS (DSE 3222)
SET-1 (Mid-term)
Q. CO
Questions M LO BL
No. (CLO)
1 Describe how MapReduce optimizes vote
counting for national election applications
compared to traditional methods, ensuring faster 2 2 2 2
result declaration. Justify your answer with an
example
Ans Answer:
In the traditional approach, as shown in Figure 1,
the counting involves a centralized method
where each booth labeled A, B, C, D, and E
collects votes locally but sends them all to a single
result center for final tallying. Because every vote,
whether physical or digital, converges at one hub,
the center quickly becomes a bottleneck. This
central node must handle the entire workload of
verifying, tallying, and finalizing results, which
can burden its resources. Moreover, moving large
numbers of machines physically or via network
transfers is both time-consuming and costly. The
result center’s single point of control also creates
a single point of failure, any delay or disruption at
the center postpones the entire count. As the
number of booths or voters grows, this approach
struggles to scale effectively, ultimately
prolonging the announcement of final results.
a MapReduce-inspired approach to vote
counting, offering a distributed and more efficient
alternative to the traditional centralized method.
In this approach, each booth—labeled A, B, C, D,
and E—tallies its votes locally. Rather than
sending all raw polls or detailed data to the central
office, each booth only transfers summarized
results (for instance, total votes per candidate) to
the result center. This parallels the Map phase in
MapReduce, where local nodes process and
summarize their data independently before
sending those summaries forward.
By reducing the flow of raw votes into the center,
the system avoids the massive data transfer
bottleneck that plagues traditional counting. The
result center receives partial tallies from each
booth and aggregates them to determine final
counts, akin to the Reduce phase in MapReduce.
Because only summarized data is transmitted
rather than every individual vote—network load
is minimized, and the final calculation becomes
both straightforward and fast.
Scheme: Explaining the mapreduce over
traditional approach for given use-case – 1M
With diagrammatical representation
and example- 1M
2 Explain the results after execution of the following
command 2 2 2 3
CREATE TABLE IF NOT EXISTS students (
1
StudentID INT,
Name STRING,
Age INT,
Course STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
Ans
CREATE TABLE IF NOT EXISTS students (
This statement tells Hive to create a table named
students only if it doesn't already exist. It prevents
errors by not attempting to recreate an existing
table with specified schema.
ROW FORMAT DELIMITED
This clause specifies that the table's rows are
stored in a delimited format, meaning specific
characters separate the fields within each row.
FIELDS TERMINATED BY ','
This line defines the delimiter for fields (columns)
in each row: a comma. When reading or writing
data, Hive will split the line at each comma to
separate the columns.
LINES TERMINATED BY '\n'
This clause specifies that each row in the text file
ends with a newline character, ensuring that each
line represents one record.
STORED AS TEXTFILE
Finally, this line instructs Hive to store the data in
a plain text file format. This means that the
underlying data files will be simple text files,
following the defined delimiters.
Scheme: Explaining
CREATE TABLE IF NOT EXISTS- 0.5M
ROW FORMAT DELIMITED- 0.5M
2
LINES TERMINATED BY '\n' & FIELDS TERMINATED
BY ','- 0.5M
STORED AS TEXTFILE-0.5M
3 Describe different data models of the hive and
how Hive partitions and buckets improve query
2 3 3 2
execution performance compared to RDBMS
indexing with a suitable example.
Ans
To optimize query performance on massive
datasets, Hive uses partitioning and bucketing
rather than depending on traditional RDBMS-style
indexing. Partitioning divides a table into segments
based on the value of a specific column (such as
date or region), so that when you query data for a
particular partition (say, a specific date), Hive only
scans that relevant partition. This reduces the
amount of data read during query execution.
Bucketing further subdivides these partitions into a
fixed number of buckets based on a hash function
3
applied to a chosen column (like customer_id),
which helps in distributing data evenly.
Example: Consider a national sales dataset with
millions of transactions as an example. In an
RDBMS, you might create an index on the sale_date
column to speed up queries, but maintaining such
an index can be cumbersome on a large, dynamic
dataset. In Hive, you would partition the table by
sale_date, so that a query filtering for a specific day
only reads the corresponding partition, avoiding a
full scan. Hive’s partitioning and bucketing
optimize query performance by reducing the data
scanned and enabling efficient parallel processing
in a distributed environment, making it highly
suitable for handling big data workloads.
Scheme: List and describe data models- 1M
Explaining the benefits of partition and bucketing
over RDBMs during data processing- 1M
4 An online video streaming platform uses Hadoop
to analyze user watch history and recommend
videos. However, MapReduce jobs take too long
to process large datasets. Identify and explain 3 1 1 3
three techniques to optimize Hadoop’s
performance, considering data locality, file
formats, and resource management.
Ans • Data locality refers to processing data as close as possible to where it is
stored in HDFS to minimize data transfer.
Techniques:
Enable rack awareness in Hadoop to schedule tasks closer to the data.
Use Hadoop’s speculative execution to assign slow tasks to different nodes
for faster processing.
Ensure data is evenly distributed across nodes by avoiding small files and
balancing HDFS blocks.
(Any 2 techniques – 1Mark)
• The choice of file format significantly impacts Hadoop’s read/write speed
and overall job efficiency.
Techniques:
Use columnar storage formats like Parquet or ORC instead of plain text
(CSV/JSON) to improve read performance.
Enable compression techniques (e.g., Snappy, Gzip) to reduce disk I/O and
speed up processing.
Store larger block sizes (e.g., 256MB instead of the default 128MB) to
reduce the number of splits and increase efficiency.
(Any 2 techniques – 1Mark)
4
• Hadoop’s resource allocation and scheduling are crucial in job performance.
Techniques:
Use YARN (Yet Another Resource Negotiator) to efficiently allocate
resources across multiple jobs.
Adjust map and reduce task settings ([Link],
[Link]) for better resource utilization.
Configure dynamic resource allocation to prioritize critical jobs and prevent
resource bottlenecks.
(Any 2 techniques – 1Mark)
Total – ( 1 + 1 + 1 = 3 Marks)
5 A large e-commerce company stores terabytes of
clickstream data from its website users. The data
needs to be processed for customer behavior
analysis. The company wants to process this data
to identify the most viewed products in the last 24 3 2 2 3
hours. Describe how a MapReduce job can solve
this problem, including the role of the Mapper and
Reducer functions.
Ans The clickstream logs contain user interactions with products. The goal is to count the
number of views per product in the last 24 hours.
Role of the Mapper Function: Extracts product views from the clickstream logs and
emits key-value pairs.
• Reads each line of the input data.
• Filters out records older than 24 hours based on the timestamp.
• Emits a key-value pair ((product_id, 1))
Role of the Reducer Function: Aggregates product views to compute total views
per product.
• Receives key-value pairs from the Mapper.
• Sums up the counts for each product_id.
• Emits the final result ((product_id, total_views))
Sorting and Ranking the Most Viewed Products:
• Once the Reducer output is generated, it can be sorted in descending order to
get the most viewed products.
• An optional combiner function can be used to partially aggregate counts
before sending them to the reducer, improving efficiency.
• The final result is stored in HDFS or sent to a recommendation engine for
further processing.
( Mapper – 1M + Reducer – 1M + Shuffle -1M) = 3 Marks
6 A food delivery service has two datasets: orders
(fields: order_id, user_id, restaurant_id, amount,
order_date) and users (fields: user_id, name,
3 2 2 3
email, city). The company wants to analyze total
spending per city to determine demand trends.
Explain the different types of joins in Pig (e.g.,
5
inner, outer, replicated joins) and recommend the
most efficient type for this use case. Write a Pig
Latin script to calculate the total revenue per city.
Ans Inner Join: (0.5M)
• Returns only matching records from both datasets.
• Suitable when we need data that exists in both tables.
• Example: If a user has never placed an order, they will be excluded.
JOIN orders BY user_id, users BY user_id;
Outer Join:(0.5M)
Includes records even if there is no match in one dataset.
• Left Outer Join: Keeps all records from the left (orders) dataset.
• Right Outer Join: Keeps all records from the right (users) dataset.
• Full Outer Join: Keeps all records from both datasets.
JOIN orders BY user_id LEFT OUTER, users BY user_id;
Replicated Join:(0.5M)
• Optimized for small datasets (users) and large datasets (orders).
• The users dataset is replicated across all nodes to improve performance.
• Most efficient choice in this case since users is smaller than orders.
JOIN orders BY user_id, users BY user_id USING 'replicated';
Since the users dataset is smaller, a replicated join is most efficient.(0.5M)
• It avoids expensive shuffle operations and improves query performance.
• Why not an inner join? – We might exclude valid orders if a user is missing.
• Why not an outer join? – Unnecessary overhead as we only need matched
records.
Pig Latin Script: (1M)
-- Load orders data
orders = LOAD '[Link]' USING PigStorage(',')
AS (order_id:chararray, user_id:chararray, restaurant_id:chararray,
amount:double, order_date:chararray);
-- Load users data
users = LOAD '[Link]' USING PigStorage(',')
AS (user_id:chararray, name:chararray, email:chararray, city:chararray);
-- Perform Replicated Join (Efficient for small users dataset)
joined_data = JOIN orders BY user_id, users BY user_id USING 'replicated';
-- Group data by city
grouped_data = GROUP joined_data BY city;
-- Calculate total spending per city
total_spending = FOREACH grouped_data GENERATE group AS city,
SUM(joined_data.amount) AS total_revenue;
Total = (0.5M + 0.5M + 0.5M +0.5M + 1M = 3 Marks)
7 A large e-commerce company stores transaction
data (transaction_id, user_id, amount,
3 3 3 4
payment_method, transaction_date) in Hadoop
using Hive. The data is organized by transaction
6
date, and the company frequently queries
transactions from specific days. Explain how
partitioning in Hive can improve query
performance. The company wants to partition the
transactions table by transaction_date. Write a
HiveQL query to create a partitioned table for
transactions and load data into the appropriate
partitions for the months of January and February
2025.
Ans How Partitioning Improves Query Performance? (1 Mark)
• Faster Queries: Queries only scan relevant partitions instead of the entire
table.
• Efficient Storage Management: Data is physically separated, making retrieval
more structured.
• Reduced Disk I/O: Less data is read from storage when filtering by
transaction_date
HiveQL Query for Creating a Partitioned Table (1 Mark)
CREATE TABLE transactions (
transaction_id STRING,
user_id STRING,
amount DOUBLE,
payment_method STRING
) PARTITIONED BY (transaction_date STRING)
STORED AS PARQUET;
Loading Data into Partitions (1 Mark)
-- Load transactions for January 2025
LOAD DATA INPATH 'hdfs:/data/transactions/january_2025.csv'
INTO TABLE transactions PARTITION (transaction_date='2025-01');
-- Load transactions for February 2025
LOAD DATA INPATH 'hdfs:/data/transactions/february_2025.csv'
INTO TABLE transactions PARTITION (transaction_date='2025-02');
Total = 1M + 1M + 1M = 3 Marks
8 A logistics company needs to store real-time
package tracking data and is evaluating whether
to use HBase or a relational database. Compare
HBase with traditional RDBMS in terms of data
consistency, scalability, and query flexibility for 3 3 3 4
real-time package tracking. The company wants
to quickly retrieve the latest status update for a
package. How should they design the row key and
column family to ensure efficient lookups?
Ans comparison –(1M)
7
Row Key Design (1M)
• Format: "package_id#timestamp" (reverse timestamp for latest-first
ordering).
• Example: "PKG123#20250224103000" (Package ID + Timestamp in
YYYYMMDDHHMMSS format).
• Why?
o Ensures latest updates appear first when scanned.
o Efficient range scans for historical tracking.
Column Family Design (1M)
• Column Family: "tracking_info" (Stores tracking details).
• Columns: "location", "status", "updated_by".
• Why?
o Organizes related tracking data together for faster retrieval.
o Flexible schema allows new attributes to be added easily
Example:
create 'package_tracking', {NAME => 'tracking_info', VERSIONS => 1}
put 'package_tracking', 'PKG123#20250224103000', 'tracking_info:location', 'New
York'
put 'package_tracking', 'PKG123#20250224103000', 'tracking_info:status', 'In
Transit'
put 'package_tracking', 'PKG123#20250224103000', 'tracking_info:updated_by',
'Logistics System'
scan 'package_tracking', {STARTROW => 'PKG123#', LIMIT => 1}
( 1M + 1M + 1M ) = 3 Marks
9 A university wants to store and manage student
activity records using HBase. The data should be
structured efficiently for easy retrieval and
4 3 3 4
analysis. Perform the following tasks using
HBase queries based on the given student
activity dataset:
8
i. Create a database named University Records.
ii. Create a table named student_activity under
this database with the following column families:
personal_details Stores student_name, email,
age.
academic_activity Stores course_name,
last_login, IP_address.
iii. Insert data for the above student records
(S001, S002, and S1003).
iv. Retrieve all records from the student_activity
table.
Create the Table with Column Families
create a table named student_activity under the
UniversityRecords namespace. This table is
defined with two column families:
personal_details – to store fields such as
student_name, email, and age.
academic_activity – to store fields like
course_name, last_login, and IP_address.
The command below creates the table
and defines these families:
create 'UniversityRecords:student_activity',
'personal_details', 'academic_activity'
Insert Data for Student Records
Insert data for three student records (S001, S002,
and S1003) using the put command. Each put
statement specifies the table, row key, column
(using the syntax column_family:column), and the
corresponding value.
For example:
For student S001, we insert data into both column
families.
Similarly, do for S002 and S1003.
Insert data for S001
9
put 'UniversityRecords:student_activity', 'S001',
'personal_details:student_name', 'SSB'
put 'UniversityRecords:student_activity', 'S001',
'personal_details:email', 'ssb@[Link]'
put 'UniversityRecords:student_activity', 'S001',
'personal_details:age', '20'
put 'UniversityRecords:student_activity', 'S001',
'academic_activity:course_name', 'hadoop'
put 'UniversityRecords:student_activity', 'S001',
'academic_activity:last_login', '2021-11-01'
put 'UniversityRecords:student_activity', 'S001',
'academic_activity:IP_address', '[Link]'
Insert data for S002
put 'UniversityRecords:student_activity', 'S002',
'personal_details:student_name', 'XYZ'
put 'UniversityRecords:student_activity', 'S002',
'personal_details:email', 'xyz@[Link]'
put 'UniversityRecords:student_activity', 'S002',
'personal_details:age', '22'
put 'UniversityRecords:student_activity', 'S002',
'academic_activity:course_name', 'python'
put 'UniversityRecords:student_activity', 'S002',
'academic_activity:last_login', '2021-11-02'
put 'UniversityRecords:student_activity', 'S002',
'academic_activity:IP_address', '[Link]'
# Insert data for S1003
put 'UniversityRecords:student_activity', 'S1003',
'personal_details:student_name', 'abc'
put 'UniversityRecords:student_activity', 'S1003',
'personal_details:email', 'abc@[Link]'
put 'UniversityRecords:student_activity', 'S1003',
'personal_details:age', '21'
put 'UniversityRecords:student_activity', 'S1003',
'academic_activity:course_name', 'Java'
put 'UniversityRecords:student_activity', 'S1003',
'academic_activity:last_login', '2021-11-03'
10
put 'UniversityRecords:student_activity', 'S1003',
'academic_activity:IP_address', '[Link]'
Retrieve All Records to verify that the data has
been correctly inserted and to view all records, you
can use the scan command, which scans the
entire table.
scan 'UniversityRecords:student_activity'
Scheme: a) 0.5M,
b) 1M
c) 2M
d) 0.5M
10 Consider two relations, Department and Course,
with the following attributes: Department:
(Dept_ID, Dept_name, Course_code), Course:
(course_code, course_name). Using this schema,
write a Pig Latin script and a MapReduce program
separately to perform a join operation between 5 2 2 4
Department and Course on the course_code
column. Compare and analyze why Apache Pig
offers a more efficient and simplified approach for
data processing and analysis compared to raw
MapReduce for the given problem.
Pig latin script
dept = LOAD 'Department' USING PigStorage(',')
AS (Dept_ID:chararray, Dept_name:chararray,
Course_code:chararray);
course = LOAD 'Course' USING PigStorage(',') AS
(course_code:chararray,
course_name:chararray);
joined = JOIN dept BY Course_code, course BY
course_code;
DUMP joined;
MapReduce code:
#!/usr/bin/env python
11
import sys
for line in [Link]:
line = [Link]()
fields = [Link](',')
dept_id, dept_name, course_code = fields
print(f"{course_code}\tD|{dept_id},{dept_name}")
#!/usr/bin/env python
import sys
for line in [Link]:
line = [Link]()
fields = [Link](',')
course_code, course_name = fields
print(f"{course_code}\tC|{course_name}")
Reducer code:
#!/usr/bin/env python
import sys
records = []
for line in [Link]():
line = [Link]()
dept_id, dept_name, course_code1 = [Link](‘, ’)
course_name, course_code2 = [Link](‘,’)
if course_code1 == course_code2:
[Link]([dept_id, dept_name,
course_name, course_code1])
for rec in records:
print(rec[0], rec[1], rec[2], rec[3])
Apache Pig offers a higher-level language that
abstracts the details of data processing. With Pig,
you write just a few lines to load, join, and store
data. The platform automatically translates these
12
commands into optimized MapReduce jobs. This
not only reduces the amount of code you need to
write and maintain but also minimizes the potential
for errors. Pig handles data partitioning, parallel
execution, and schema management under the
hood, allowing you to focus on the logic of your
analysis rather than the mechanics of distributed
processing. In contrast, raw MapReduce requires
detailed programming for each phase (mapping,
shuffling, and reducing) and manual handling of
complex tasks like data tagging and join logic. For
tasks such as joining Department and Course
relations, Pig provides a cleaner, more concise
solution that is easier to write, understand, and
maintain.
Scheme: PigLatin script – 1.5M
Mapper code -1M
Reducer code -1.5M
Analysis on the optimality of data processing -
1M
13