0% found this document useful (0 votes)
17 views15 pages

Tutorial For Course Work

Uploaded by

claudisroshan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views15 pages

Tutorial For Course Work

Uploaded by

claudisroshan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Task A explanation

Hai Huang
Task A

 Task A (25 marks): Hive Data Warehouse Design


Please design a data warehouse in Hive with your own data (a collection with at least 50 records in
at least 3 tables). Please implement 10 different queries on that data. Make sure the data and queries
show adequate variety and complexity. Please provide appropriate explanation/discussion and
adequate screenshots to prove your implementation of data, queries, and results of queries.
Task A
• You need to access Hadoop Virtual Machine
• Please follow the lab document of week 2 to access your own Hadoop Virtual
Machine
• If you work at home on your laptop, please access student virtual desktop first by URL:
https://rdweb.wvd.microsoft.com/arm/webclient/index.html , then you can access Hadoop
Virtual Machine in student virtual desktop.
Task A
• Step1: build CSV files for Hive tables (You need to build your own data)
• Please note that you can’t transfer a file from University desktops (or student virtual desktop) to your
Hadoop Virtual Machine.
• Please use copy/paste to copy data to a file in Hadoop VM (such as copying the content of a csv file on university desktop to a file on Hadoop VM )
• Or you can edit files directly on Hadoop VM

• Step2: start Hadoop (if it is not started yet)


• You can use command: start-all.sh
• It might be possible that Hadoop fails to work properly. In this case, you can restart Hadoop by two commands:
• stop-all.sh (to quit Hadoop service)
• start-all.sh (to start Hadoop service)

• Step3: upload your csv files on to Hadoop


• by using command: hdfs dfs -put . For example: hdfs dfs -put file123.csv (Be sure that the working directory
contains file123.csv)
Task A
• Step 4: Access Hive by input: hive

• Step5: Create Tables. For example:


create table testTable(c1 STRING, c2 STRING, c3 STRING, v4 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS
TEXTFILE;

• Step6: Load the data into the tables you have built in Hive. For example:
LOAD DATA INPATH ‘file123.csv' OVERWRITE INTO TABLE testTable; (Be sure that file123.csv has been put on Hadoop at step 3)

• Step7: Design your queries and execute them on Hive


• In your report, you need to include screenshots of each query and their results in Hive.
• You also need to include the screenshots of the records in each table.
Task B explanation
Hai Huang
Task B

Please design a MapReduce algorithm (using Pseudo-codes or Java Codes) to the task assigned. The
algorithm is expected to be as efficient as possible.

• Task b.1: Output the number of papers by each author for each year.
• Task b.2: Output the average number of papers per conference for each year.
• Task b.3: Output the number of authors by each conference for each year.
• Task b.4: Output the average number of authors per paper for each year.
• Task b.5: Output the number of papers by each conference for each year.
Select your Task B
For example, a student ID is : 001374720
The last digit is “0”, the student need to select task B.1
Select your Task B

Please ignore any number after “-” in your student id.

For example, if your student id shows “011340894 – 1 ”,


please ignore 1. The digit 4 is the last digit and you need to
select task b.3.
Task B
• Please review the lecture slides of Lecture 6 (Advanced MapReduce
programming, 19 Feb)
• Pseudo codes are highly recommended
• Please check the lecture 6 slides about Pseudo code styles
• Design at least two classes: Mapper, Reducer
• Design Map function for Mapper; Reduce function for Reducer
• Clearly show the input & output key value pairs for Map function and Reduce
function
• To make your algorithms more efficient, you can consider design:
• Combiner
• In-Mapper combiner
Task B
• You should also explain how the input is mapped into (key, value) pairs by the
map stage, i.e., specify what is the key and what is the associated value in each
pair, and, how the key(s) and value(s) are computed.

• Then you should explain how the output (key, value) pairs of the map stage are
processed by the reduce stage to get the final answer(s).

• You need to discuss the efficiency of your algorithm (How does your design make
your algorithm efficient?).
Combiner
In-Mapper combiner
Task C explanation
Hai Huang
Task C.1
Task C.2
Task C.3

Cloud hosting strategy

You might also like