Nile University - CSCI 461: Introduction to Big Data - Spring 2025
Assignment #2
—-------------------------------------------------------------------------------------------------
INSTRUCTIONS:
- EACH MEMBER MUST UNDERSTAND EVERYTHING IN THE ASSIGNMENT.
- ANY MEMBER MAY BE ASKED TO DO AN UPDATE OR CHANGE IN CODE.
- Assignment deadline will be May 11 2024 @ 11:45 PM.
- Assignment discussions start in the week that starts May 11, 2024. [Discussion slots will
be announced for each TA]
- Assignment's total grade is 10 marks. [Distribution is specified in assignment
requirements below] + 1 Bonus mark.
- Any submission after the deadline will be considered as -2 from the assignment’s total
grade. [Unless you have a clearly accepted reason sent by mail immediately]
- CHEATING in the assignment is considered as ZERO from each member’s total grade.
- In the discussion, all members MUST present in the discussion [Unless you have a
clearly accepted reason sent by mail to the TA before the discussion], otherwise, there
will be a grade deduction of 1 mark.
ASSIGNMENT REQUIREMENTS:
PART ONE:
1. Work on the data in the JSON file attached named mds.json.
2. Use the MongoDB container we worked on during the lab to have access to MongoDB,
create a database with the name moviesDB, and import the JSON file data to MongoDB
in a collection with the name moviesColl.
3. Write at least 2 queries:
a. The first query should include the feature of indexing in MongoDB. [1.5 Marks]
b. The second query should include the use of logical operations. [1.5 Marks]
PART TWO:
1. Convert the mds.json from a JSON format to a CSV format.
2. Use the container we used during the lab which contains Hadoop, and copy the data (after
conversion from JSON to CSV) to HDFS.
3. Write a MapReduce job (Using IntelliJ, instructions on how to create and run a
MapReduce job in an IntelliJ project explained in the lab and the slides) to calculate the
average rating for each genre in both movies and TV shows. (Write the code in one
1
Nile University - CSCI 461: Introduction to Big Data - Spring 2025
class named ARDriver.java, which contains also mapper and reducer code, use only the
first genre if more than one exists, and exclude items with nulls in the genre) [1 Mark for
the driver, 3 Marks for the mapper, 3 Marks for the reducer]
BONUS
Additional query in MongoDB. [0.5 Mark]
Using the concept of Combiner in MapReduce in PART TWO. [0.5 Mark]
DELIVERABLES
ALL TEAM MEMBERS MUST SUBMIT THE FOLLOWING
AS ONE ZIP FILE ON moodle:
- A text file named mDBQ.txt contains your MongoDB queries.
- A text file named mDBR.txt contains your MongoDB query results.
- The Java class ARDriver.java is used for the MapReduce job. (ARDriver.java should
contain the Driver code, Mapper Code, and Reducer Code)
- The output file of the MapReduce job which is named part-r-00000.
NOTE #1: JSON format to CSV format should be like
{
"name": "Alice Johnson",
"age": 25,
"city": "Wonderland",
"isStudent": true,
"grades": [90, 88, 95],
"address": {
"street": "456 Oak Avenue",
"zipCode": "54321",
"country": "Fantasyland"
}
}
name,age,city,isStudent,grades,address.street,address.zipCode,address.country
Alice Johnson,25,Wonderland,true,"90, 88, 95",456 Oak Avenue,54321,Fantasyland
2
Nile University - CSCI 461: Introduction to Big Data - Spring 2025
NOTE #2: The output of the MapReduce job will be like
Movie, Drama, 7.8
Movie, Horror, 8.9
TV, Drama, 7.6
TV, Comedy, 9.8