0% found this document useful (0 votes)

17 views8 pages

(BigData) Lab04 - Streaming

The document outlines a lab exercise on Big Data Analysis focusing on Spark Streaming for analyzing cryptocurrency price data from Binance. It details the Extract-Transform-Load (ETL) process, including requirements for data extraction using Kafka, transformation through moving averages and Z-scores, and loading into MongoDB. Additionally, it specifies submission guidelines, grading criteria, and a bonus task for further analysis of price windows.

Uploaded by

Nguyễn Nghĩa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views8 pages

(BigData) Lab04 - Streaming

Uploaded by

Nguyễn Nghĩa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

VIET NAM NATIONAL UNIVERSITY HO CHI MINH CITY

UNIVERSITY OF SCIENCE

INTRODUCTION
TO
BIG DATA ANALYSIS

Lab instructors

Vũ Công Thành | [email protected]

Huỳnh Lâm Hải Đăng | [email protected]

-1-
Lab 04: Spark Streaming
1. Statements
In this Lab, you and your team is going to implement a pipeline of Extract-Transform-Load
that perform streamed analysis on time-series data of cryptocurrencies’ prices, in particular
the price of symbol BTCUSDT from Binance trading platform. Similar to the Lab 1, this lab is
also explorative of the frameworks for processing big data, thus you may encounter many
unseen issues during the setup as well as the processing steps. You are recommended to
provide detailed description of these issues and how to solve them in your report.
Disclaimer:
This exercise is for educational and technical demonstration purposes only. It is hereby
declared that no encouragement, endorsement, or recommendation for engaging in
cryptocurrency trading, investment, or speculative activities is provided.

1.1. Extract
In the extract stage, you will utilize Binance’s APIs to crawl the time-series data about the
symbol using any programming language that is capable of doing the task, meaning that you
are NOT restricted to Python or Scala for this stage. This data is then published through Kafka
to the transform stage.

Requirements: Implement a Kafka producer that

• Fetches the price of the symbol from Binance API which contains a floating-point
value representing the symbol’s price.
• Upon receiving a response from the API, checks if the received JSON conforms the
following output format, based on the documentation of Binance:
{
"symbol": <a string>,
"price": <a floating-point value>
}
• Inserts event-time information:
o Add another field to the above JSON that denotes the timestamp associated
with this response, in other words, this is the timestamp that your crawler
received this JSON response.

-2-
o Refer to ISO8601 standards for detailed format of the timestamp, you are
recommended to use your language’s time-related libraries for processing
these timestamps.
• Runs with a frequency of at least once per 100 miliseconds.
• Push those records to a topic in Kafka named btc-price.
• You and your team should take screenshots of significant steps that you and the team
did, then put them into your report with detailed explanation.

Binance API:

• Reference: https://developers.binance.com/docs/binance-spot-api-docs/rest-
api/market-data-endpoints
• API: api.binance.com/api/v3/ticker/price?symbol=BTCUSDT

1.2. Transform
The transform stage has two steps that involves Kafka publications and subscriptions.
Specifically, the first step requires the calculation of moving average and moving standard
deviation within specified sliding windows while the second one computes the Z-scores of
latest price against those windows’ moving average and standard deviation.

Allowed programming language(s): Java, Python, and Scala. NOTE: your

implementations should handle late data with a tolerance of up to 10 seconds late time.
Requirements:
• Implement a program using Spark Structure Streaming, also including Spark SQL, to:
o Subscribe to the btc-price topic from Kafka of the extract stage.
o Use event-time processing to group the listened messages into sliding
windows of the following lengths: 30s (30 seconds), 1m (1 minute or 60
seconds), 5m (5 minutes), 15m (15 minutes), 30m (30 minutes), and 1h (1 hour
or 60 minutes).
o Compute the moving averages and moving standard deviations by calculating
average and standard deviation of prices per window.
o You should also handle edge cases with your own rules and definitions, then
output the results in the following format:
{
"timestamp": <ISO8601 UTC timestamp>,
"symbol": <a string>,
[

-3-
{
"window": <a string among 30s, 1m, 5m, 15m, 30m, 1h>,
"avg_price": <a floating-point value>,
"std_price": <a floating-point value>
},
... # Repeat until avg and std of all windows are provided
]
}
o Publish the results to another Kafka topic called btc-price-moving with
append mode.
• Implement a program using Spark Structure Streaming, also including Spark SQL, to:
o Listen to both of the following Kafka topics: btc-price and btc-price-moving.
o With one record read from btc-price and another one from btc-price-moving
that share the same timestamp information, computes the Z-score of the
price with respect to each sliding window given in the moving statistics
record’s information.
o After handling the edge cases (if any, with your own rules and definitions),
output the results in the following format:
{
"timestamp": <ISO8601 UTC timestamp>,
"symbol": <a string>,
[
{
"window": <a string among 30s, 1m, 5m, 15m, 30m, 1h>,
"zscore_price": <a floating-point value>
},
... # Repeat until Z-scores of all windows are provided
]
}
o Publish this result to a new Kafka topic called btc-price-zscore by append
mode similar to prior tasks.
• You and your team should take screenshots of significant steps that you and the team
did, then put them into your report with detailed explanation.

References:

• Z-score: also known as standard score, refer to Standard score - Wikipedia for more
details.

-4-
1.3. Load
In the load stage, you and your team will use Spark Structured Stream and MongoDB
Spark Connector to store the calculated data as a collection under the streaming mode.

Allowed programming language(s): Java, Python, and Scala. NOTE: your

implementations should handle late data with a tolerance of up to 10 seconds late time.

Requirements:

• Setup a MongoDB for persistently storing the computed data, you are free to choose
where to install the database management system as well as the method of
installation.
• Subscribe to the btc-price-zscore Kafka topic from the transform stage and create a
Spark Structured Stream from it.
• Write this stream to MongoDB collections named btc-price-zscore-<window>
where <window> encodes the interval associated with the sliding window (30s, 1m,
5m, 15m, 30m, 1h). You are free to define the schema of these collections but be sure
to denotes it in the final report.
• You and your team should take screenshots of significant steps that you and the team
did, then put them into your report with detailed explanation.

References:
• MongoDB Documentation: https://www.mongodb.com/docs/manual/introduction/

1.4. Bonus
This is a bonus section to the transform stage where shortest windows of negative
outcome is found for each price record published in the Kafka topic btc-price. There would
be 2 such windows: one for the lower prices and another for the higher ones.

Allowed programming language(s): Java, Python, and Scala. NOTE: your

implementations should handle late data with a tolerance of up to 10 seconds late time.

Requirements:

• Implement a program using Spark Structure Streaming, also including Spark SQL, to:
o Listen to the Kafka topic btc-price published in the extract stage.
o For each price record p received from that topic with an event-time timestamp
t, you will need to find at most two records in the 20-second interval after it, .i.e

-5-
(𝑡, 𝑡 + 20], where one is the first encountered message with price higher than
that of p and the other is the first with price lower than that.
o Calculate the time difference between each of the found records with t, in
floating-point seconds, then publish the results to Kafka topics of btc-price-
higher and btc-price-lower, respectively. For the case where no window of
higher (or lower) price is found within this 20-second interval, the program
must publish a placeholder record filling the field of length with 20.0.
o The published JSON should have the following format:

{
"timestamp": <ISO8601 UTC timestamp>,
"<higher/lower>_window": <a floating-point value>
}
o The Kafka topics’ publications are in append mode.
• You and your team should take screenshots of significant steps that you and the team
did, then put them into your report with detailed explanation.

Hint: You may need to employ some form of stateful operations for this part of the lab.

2. Submission Guideline
This lab requires a group’s submission where the work of your group’s members is
compressed into a single file and only one representative may submit this file on Moodle.
The submission file contains a single folder named <GroupID> where student ID of the first
member (that your group has registered in earlier form) is used. Its internal structure will be
as follow:
<GroupID>
├─ docs
│ ├── Report.pdf
│
├─ src
│ ├── Extract
│ │ ├── <GroupID>.{py, ipynb, jar, sc, scala} # Executable files
│ │ └── code # Original code & results for the transform stage, if any
│ │
│ ├── Transform
│ │ ├── <GroupID>_moving.{py, ipynb, jar, sc, scala} # Executable files for moving statistics
│ │ ├── <GroupID>_zscore.{py, ipynb, jar, sc, scala} # Executable files for Z-scores
│ │ └── code # Original code & results for the transform stage, if any
│ │
│ ├── Load

-6-
│ │ ├── <GroupID>.{py, ipynb, jar, sc, scala} # Executable files
│ │ └── code # Original code & results for the transform stage, if any
│ │
│ ├── Bonus
│ │ ├── <GroupID>.{py, ipynb, jar, sc, scala} # Executable files for bonus part
│ │ └── code # Original code & results for the bonus part, if any
│
└── README.md # (Optional) Instructions to run your code

You must strictly follow the above file structure and compress the whole folder into a
ZIP file named <GroupID>.zip, which is your final file to be submitted to Moodle.

Grading Criteria
The grading criteria are summarized in the below table.

Requirements Points
Extract 2
- The crawler get the pricing data without any error. 1
- The timestamp is inserted appropriately. 0.5
- The Kafka producer publishes this data at correct topic. 0.5
Transform 4.25
- The sliding windows are created according to the required intervals. 1
- The means and standard deviations are correctly computed. 1
- These statistics data are parsed into the required formats. 0.25
- The moving statistics are piped to the correct Kafka topic. 0.25
- The price record is matched against statistics of the same event-time. 1
- Z-scores of the matched price are successfully calculated and formatted. 0.5
- The results are directed towards the output topic. 0.25
Load 2
- The MongoDB is setup and running. 1
- The stream is read and written to the appropriate collections in MongoDB. 1
Bonus 1
- Correctly compute the required windows’ lengths. 0.75
- Format and publish the results appropriately. 0.25
Report 1.75
- Overview: source code’s structure, components, and implemented methods. 0.75
- Detailed explanation: verbal and/or visual illustration of the methods. 0.75
- Contribution table: assigned tasks for each member in the group. 0.25
TOTAL 11

Also note that:

• Ensure your code is well-documented with clear comments.

-7-
• Include all necessary files, logs, and screenshots to verify successful execution.
• Each task can be accomplished under complex environments and different
programming languages, remember to provide instructions for running each task if
this is the case.

Happy Coding and Best of Luck!

The Instructor./.

-8-

Near Real-Time Big Data Processing
No ratings yet
Near Real-Time Big Data Processing
59 pages
Assignment No. 3 For Business Data Analytics
No ratings yet
Assignment No. 3 For Business Data Analytics
16 pages
Unit 5 (Big Data Analytics)
No ratings yet
Unit 5 (Big Data Analytics)
11 pages
Assessment - Machine Learning
No ratings yet
Assessment - Machine Learning
6 pages
Data Lake 1
No ratings yet
Data Lake 1
19 pages
The Big Book of Data Science Use Cases
No ratings yet
The Big Book of Data Science Use Cases
80 pages
10-Big Data Nhom7
No ratings yet
10-Big Data Nhom7
81 pages
Big Data Concepts - Spark & Streaming
No ratings yet
Big Data Concepts - Spark & Streaming
35 pages
Advanced Data Science with Spark
No ratings yet
Advanced Data Science with Spark
47 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
CSIT 6000Q Assignment 3
No ratings yet
CSIT 6000Q Assignment 3
5 pages
MCS-226 Data Science and Big Data
No ratings yet
MCS-226 Data Science and Big Data
1 page
NIFI Project
No ratings yet
NIFI Project
2 pages
5a - Streaming Data Analytics PDF
No ratings yet
5a - Streaming Data Analytics PDF
37 pages
Spark: Big Data Processing & Libraries
No ratings yet
Spark: Big Data Processing & Libraries
47 pages
Stock Documentation
No ratings yet
Stock Documentation
8 pages
Big Data Analytics Exam Guide
No ratings yet
Big Data Analytics Exam Guide
2 pages
Practical - 12 13 14 PDF
No ratings yet
Practical - 12 13 14 PDF
11 pages
Bda Assign2
No ratings yet
Bda Assign2
4 pages
Hackathon Problem Statement NIT Silchar
No ratings yet
Hackathon Problem Statement NIT Silchar
6 pages
MapReduce for Data Processing
No ratings yet
MapReduce for Data Processing
7 pages
Hadoop Datalake for Financial Time Series
No ratings yet
Hadoop Datalake for Financial Time Series
7 pages
Kafka
No ratings yet
Kafka
78 pages
Kafka and Spark Assignment Guide
No ratings yet
Kafka and Spark Assignment Guide
3 pages
Mod4 DWDM BTECH
No ratings yet
Mod4 DWDM BTECH
9 pages
Dataflow Pipeline Overview and Concepts
No ratings yet
Dataflow Pipeline Overview and Concepts
3 pages
Unit 2 Bda
No ratings yet
Unit 2 Bda
13 pages
Lê Thị Hậu - ITDSIU21085 - Quiz3
No ratings yet
Lê Thị Hậu - ITDSIU21085 - Quiz3
5 pages
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
No ratings yet
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
28 pages
Unit 3
No ratings yet
Unit 3
4 pages
Project - Traffic Data Analysis
No ratings yet
Project - Traffic Data Analysis
20 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
Big Book of Data Science Use Cases v3
No ratings yet
Big Book of Data Science Use Cases v3
86 pages
Untitled Document
No ratings yet
Untitled Document
8 pages
EoDA Open QA Batch 1
No ratings yet
EoDA Open QA Batch 1
1 page
Professional Machine Learning Engineer-Part1
No ratings yet
Professional Machine Learning Engineer-Part1
250 pages
Assignment 03 BigData Computing Noc23-Cs112
No ratings yet
Assignment 03 BigData Computing Noc23-Cs112
6 pages
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
No ratings yet
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
12 pages
Macse502 Programming-For-data-science Eth 1.0 83 Macse502
No ratings yet
Macse502 Programming-For-data-science Eth 1.0 83 Macse502
4 pages
Bajaj Finance 10 Years
No ratings yet
Bajaj Finance 10 Years
38 pages
DAV Chapter3
No ratings yet
DAV Chapter3
44 pages
Unit Iii
No ratings yet
Unit Iii
19 pages
Streaming Graph Processing Unit5
No ratings yet
Streaming Graph Processing Unit5
7 pages
Analytics On Big Fast Data Using A Realtime Stream Data Processing Architecture
No ratings yet
Analytics On Big Fast Data Using A Realtime Stream Data Processing Architecture
34 pages
BDA Unit V
No ratings yet
BDA Unit V
21 pages
Bda Summer 2024 Solution
No ratings yet
Bda Summer 2024 Solution
26 pages
MLOps and AML: Key Concepts and Practices
No ratings yet
MLOps and AML: Key Concepts and Practices
21 pages
Online Machine Learning Algorithms For Currency Exchange Prediction
No ratings yet
Online Machine Learning Algorithms For Currency Exchange Prediction
84 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
Pawar (2022) Seasonal and Non Seasonal GARCH TimeSeries Analysis
No ratings yet
Pawar (2022) Seasonal and Non Seasonal GARCH TimeSeries Analysis
33 pages
Mining Data Streams
No ratings yet
Mining Data Streams
37 pages
Customizing Kafka Stream Procssing
No ratings yet
Customizing Kafka Stream Procssing
4 pages
DA PYQs
No ratings yet
DA PYQs
16 pages
Spark Programming Fundamentals Guide
No ratings yet
Spark Programming Fundamentals Guide
54 pages
TA3 Big Data Analytics
No ratings yet
TA3 Big Data Analytics
13 pages
Practice Assignment 2
No ratings yet
Practice Assignment 2
2 pages
Spark Kafka
No ratings yet
Spark Kafka
14 pages
10 SparkIntroduction BigData 2x
No ratings yet
10 SparkIntroduction BigData 2x
33 pages
Overview of SQL Single-Row Functions
No ratings yet
Overview of SQL Single-Row Functions
57 pages
Labview Programming Reference Manual 7-30-2024!1!3000-1501-3000
No ratings yet
Labview Programming Reference Manual 7-30-2024!1!3000-1501-3000
1,500 pages
Texplorers Technical Interview Prep Guide
No ratings yet
Texplorers Technical Interview Prep Guide
2 pages
Student Database Management System Project
No ratings yet
Student Database Management System Project
8 pages
GS Open License Price List Nov 2021
No ratings yet
GS Open License Price List Nov 2021
7 pages
Oracle 10g for Database Professionals
No ratings yet
Oracle 10g for Database Professionals
26 pages
SBI Clerk Exam: SQL & Database Questions
No ratings yet
SBI Clerk Exam: SQL & Database Questions
5 pages
Practical Data Modeling Exercises
No ratings yet
Practical Data Modeling Exercises
9 pages
Basic IT Tools All Units Summary
No ratings yet
Basic IT Tools All Units Summary
3 pages
CKAN Installation & Configuration
No ratings yet
CKAN Installation & Configuration
23 pages
MBR
No ratings yet
MBR
24 pages
Database Structure & SQL Overview
No ratings yet
Database Structure & SQL Overview
4 pages
SQL Practical File for BBA Students
No ratings yet
SQL Practical File for BBA Students
51 pages
SP3D Admin Course Syllabus
No ratings yet
SP3D Admin Course Syllabus
7 pages
SAP Dashboards for Financial Planners
No ratings yet
SAP Dashboards for Financial Planners
24 pages
Database Management Systems Guide
No ratings yet
Database Management Systems Guide
71 pages
20 Distributed Reliability Protocols PDF
0% (2)
20 Distributed Reliability Protocols PDF
31 pages
Patterns For Blockchain Data Migration: HMN Dilum Bandara Xiwei Xu Ingo Weber
No ratings yet
Patterns For Blockchain Data Migration: HMN Dilum Bandara Xiwei Xu Ingo Weber
19 pages
Advanced Data Modeling Guide
No ratings yet
Advanced Data Modeling Guide
38 pages
1 Stop Project1
No ratings yet
1 Stop Project1
27 pages
Data Migration Cockpit in SAP S4 HANA
No ratings yet
Data Migration Cockpit in SAP S4 HANA
12 pages
Data and Applicaiona Integreation in HRIS NF D2
No ratings yet
Data and Applicaiona Integreation in HRIS NF D2
22 pages
Sap Bi On DB2
No ratings yet
Sap Bi On DB2
402 pages
Data Analytics Using Python
No ratings yet
Data Analytics Using Python
10 pages
EnterpriseComputing AT2 Y12 2025
No ratings yet
EnterpriseComputing AT2 Y12 2025
20 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
82 pages
Rhcsa Mock 3
No ratings yet
Rhcsa Mock 3
3 pages
Lotus Notes 8 Plugins Overview
100% (1)
Lotus Notes 8 Plugins Overview
2 pages
SQL Queries Using Set Operators, Joins and Subqueries Samples
100% (3)
SQL Queries Using Set Operators, Joins and Subqueries Samples
25 pages
SQL Basics for Beginners
No ratings yet
SQL Basics for Beginners
10 pages

(BigData) Lab04 - Streaming

Uploaded by

(BigData) Lab04 - Streaming

Uploaded by

VIET NAM NATIONAL UNIVERSITY HO CHI MINH CITY

Vũ Công Thành | [email protected]

Requirements: Implement a Kafka producer that

Allowed programming language(s): Java, Python, and Scala. NOTE: your

Allowed programming language(s): Java, Python, and Scala. NOTE: your

Allowed programming language(s): Java, Python, and Scala. NOTE: your

Also note that:

Happy Coding and Best of Luck!

You might also like