0% found this document useful (0 votes)
17 views8 pages

(BigData) Lab04 - Streaming

The document outlines a lab exercise on Big Data Analysis focusing on Spark Streaming for analyzing cryptocurrency price data from Binance. It details the Extract-Transform-Load (ETL) process, including requirements for data extraction using Kafka, transformation through moving averages and Z-scores, and loading into MongoDB. Additionally, it specifies submission guidelines, grading criteria, and a bonus task for further analysis of price windows.

Uploaded by

Nguyễn Nghĩa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views8 pages

(BigData) Lab04 - Streaming

The document outlines a lab exercise on Big Data Analysis focusing on Spark Streaming for analyzing cryptocurrency price data from Binance. It details the Extract-Transform-Load (ETL) process, including requirements for data extraction using Kafka, transformation through moving averages and Z-scores, and loading into MongoDB. Additionally, it specifies submission guidelines, grading criteria, and a bonus task for further analysis of price windows.

Uploaded by

Nguyễn Nghĩa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

VIET NAM NATIONAL UNIVERSITY HO CHI MINH CITY

UNIVERSITY OF SCIENCE

INTRODUCTION
TO
BIG DATA ANALYSIS

Lab instructors

Vũ Công Thành | [email protected]


Huỳnh Lâm Hải Đăng | [email protected]

-1-
Lab 04: Spark Streaming
1. Statements
In this Lab, you and your team is going to implement a pipeline of Extract-Transform-Load
that perform streamed analysis on time-series data of cryptocurrencies’ prices, in particular
the price of symbol BTCUSDT from Binance trading platform. Similar to the Lab 1, this lab is
also explorative of the frameworks for processing big data, thus you may encounter many
unseen issues during the setup as well as the processing steps. You are recommended to
provide detailed description of these issues and how to solve them in your report.
Disclaimer:
This exercise is for educational and technical demonstration purposes only. It is hereby
declared that no encouragement, endorsement, or recommendation for engaging in
cryptocurrency trading, investment, or speculative activities is provided.

1.1. Extract
In the extract stage, you will utilize Binance’s APIs to crawl the time-series data about the
symbol using any programming language that is capable of doing the task, meaning that you
are NOT restricted to Python or Scala for this stage. This data is then published through Kafka
to the transform stage.

Requirements: Implement a Kafka producer that

• Fetches the price of the symbol from Binance API which contains a floating-point
value representing the symbol’s price.
• Upon receiving a response from the API, checks if the received JSON conforms the
following output format, based on the documentation of Binance:
{
"symbol": <a string>,
"price": <a floating-point value>
}
• Inserts event-time information:
o Add another field to the above JSON that denotes the timestamp associated
with this response, in other words, this is the timestamp that your crawler
received this JSON response.

-2-
o Refer to ISO8601 standards for detailed format of the timestamp, you are
recommended to use your language’s time-related libraries for processing
these timestamps.
• Runs with a frequency of at least once per 100 miliseconds.
• Push those records to a topic in Kafka named btc-price.
• You and your team should take screenshots of significant steps that you and the team
did, then put them into your report with detailed explanation.

Binance API:

• Reference: https://developers.binance.com/docs/binance-spot-api-docs/rest-
api/market-data-endpoints
• API: api.binance.com/api/v3/ticker/price?symbol=BTCUSDT

1.2. Transform
The transform stage has two steps that involves Kafka publications and subscriptions.
Specifically, the first step requires the calculation of moving average and moving standard
deviation within specified sliding windows while the second one computes the Z-scores of
latest price against those windows’ moving average and standard deviation.

Allowed programming language(s): Java, Python, and Scala. NOTE: your


implementations should handle late data with a tolerance of up to 10 seconds late time.
Requirements:
• Implement a program using Spark Structure Streaming, also including Spark SQL, to:
o Subscribe to the btc-price topic from Kafka of the extract stage.
o Use event-time processing to group the listened messages into sliding
windows of the following lengths: 30s (30 seconds), 1m (1 minute or 60
seconds), 5m (5 minutes), 15m (15 minutes), 30m (30 minutes), and 1h (1 hour
or 60 minutes).
o Compute the moving averages and moving standard deviations by calculating
average and standard deviation of prices per window.
o You should also handle edge cases with your own rules and definitions, then
output the results in the following format:
{
"timestamp": <ISO8601 UTC timestamp>,
"symbol": <a string>,
[

-3-
{
"window": <a string among 30s, 1m, 5m, 15m, 30m, 1h>,
"avg_price": <a floating-point value>,
"std_price": <a floating-point value>
},
... # Repeat until avg and std of all windows are provided
]
}
o Publish the results to another Kafka topic called btc-price-moving with
append mode.
• Implement a program using Spark Structure Streaming, also including Spark SQL, to:
o Listen to both of the following Kafka topics: btc-price and btc-price-moving.
o With one record read from btc-price and another one from btc-price-moving
that share the same timestamp information, computes the Z-score of the
price with respect to each sliding window given in the moving statistics
record’s information.
o After handling the edge cases (if any, with your own rules and definitions),
output the results in the following format:
{
"timestamp": <ISO8601 UTC timestamp>,
"symbol": <a string>,
[
{
"window": <a string among 30s, 1m, 5m, 15m, 30m, 1h>,
"zscore_price": <a floating-point value>
},
... # Repeat until Z-scores of all windows are provided
]
}
o Publish this result to a new Kafka topic called btc-price-zscore by append
mode similar to prior tasks.
• You and your team should take screenshots of significant steps that you and the team
did, then put them into your report with detailed explanation.

References:

• Z-score: also known as standard score, refer to Standard score - Wikipedia for more
details.

-4-
1.3. Load
In the load stage, you and your team will use Spark Structured Stream and MongoDB
Spark Connector to store the calculated data as a collection under the streaming mode.

Allowed programming language(s): Java, Python, and Scala. NOTE: your


implementations should handle late data with a tolerance of up to 10 seconds late time.

Requirements:

• Setup a MongoDB for persistently storing the computed data, you are free to choose
where to install the database management system as well as the method of
installation.
• Subscribe to the btc-price-zscore Kafka topic from the transform stage and create a
Spark Structured Stream from it.
• Write this stream to MongoDB collections named btc-price-zscore-<window>
where <window> encodes the interval associated with the sliding window (30s, 1m,
5m, 15m, 30m, 1h). You are free to define the schema of these collections but be sure
to denotes it in the final report.
• You and your team should take screenshots of significant steps that you and the team
did, then put them into your report with detailed explanation.

References:
• MongoDB Documentation: https://www.mongodb.com/docs/manual/introduction/

1.4. Bonus
This is a bonus section to the transform stage where shortest windows of negative
outcome is found for each price record published in the Kafka topic btc-price. There would
be 2 such windows: one for the lower prices and another for the higher ones.

Allowed programming language(s): Java, Python, and Scala. NOTE: your


implementations should handle late data with a tolerance of up to 10 seconds late time.

Requirements:

• Implement a program using Spark Structure Streaming, also including Spark SQL, to:
o Listen to the Kafka topic btc-price published in the extract stage.
o For each price record p received from that topic with an event-time timestamp
t, you will need to find at most two records in the 20-second interval after it, .i.e

-5-
(𝑡, 𝑡 + 20], where one is the first encountered message with price higher than
that of p and the other is the first with price lower than that.
o Calculate the time difference between each of the found records with t, in
floating-point seconds, then publish the results to Kafka topics of btc-price-
higher and btc-price-lower, respectively. For the case where no window of
higher (or lower) price is found within this 20-second interval, the program
must publish a placeholder record filling the field of length with 20.0.
o The published JSON should have the following format:

{
"timestamp": <ISO8601 UTC timestamp>,
"<higher/lower>_window": <a floating-point value>
}
o The Kafka topics’ publications are in append mode.
• You and your team should take screenshots of significant steps that you and the team
did, then put them into your report with detailed explanation.

Hint: You may need to employ some form of stateful operations for this part of the lab.

2. Submission Guideline
This lab requires a group’s submission where the work of your group’s members is
compressed into a single file and only one representative may submit this file on Moodle.
The submission file contains a single folder named <GroupID> where student ID of the first
member (that your group has registered in earlier form) is used. Its internal structure will be
as follow:
<GroupID>
├─ docs
│ ├── Report.pdf

├─ src
│ ├── Extract
│ │ ├── <GroupID>.{py, ipynb, jar, sc, scala} # Executable files
│ │ └── code # Original code & results for the transform stage, if any
│ │
│ ├── Transform
│ │ ├── <GroupID>_moving.{py, ipynb, jar, sc, scala} # Executable files for moving statistics
│ │ ├── <GroupID>_zscore.{py, ipynb, jar, sc, scala} # Executable files for Z-scores
│ │ └── code # Original code & results for the transform stage, if any
│ │
│ ├── Load

-6-
│ │ ├── <GroupID>.{py, ipynb, jar, sc, scala} # Executable files
│ │ └── code # Original code & results for the transform stage, if any
│ │
│ ├── Bonus
│ │ ├── <GroupID>.{py, ipynb, jar, sc, scala} # Executable files for bonus part
│ │ └── code # Original code & results for the bonus part, if any

└── README.md # (Optional) Instructions to run your code

You must strictly follow the above file structure and compress the whole folder into a
ZIP file named <GroupID>.zip, which is your final file to be submitted to Moodle.

Grading Criteria
The grading criteria are summarized in the below table.

Requirements Points
Extract 2
- The crawler get the pricing data without any error. 1
- The timestamp is inserted appropriately. 0.5
- The Kafka producer publishes this data at correct topic. 0.5
Transform 4.25
- The sliding windows are created according to the required intervals. 1
- The means and standard deviations are correctly computed. 1
- These statistics data are parsed into the required formats. 0.25
- The moving statistics are piped to the correct Kafka topic. 0.25
- The price record is matched against statistics of the same event-time. 1
- Z-scores of the matched price are successfully calculated and formatted. 0.5
- The results are directed towards the output topic. 0.25
Load 2
- The MongoDB is setup and running. 1
- The stream is read and written to the appropriate collections in MongoDB. 1
Bonus 1
- Correctly compute the required windows’ lengths. 0.75
- Format and publish the results appropriately. 0.25
Report 1.75
- Overview: source code’s structure, components, and implemented methods. 0.75
- Detailed explanation: verbal and/or visual illustration of the methods. 0.75
- Contribution table: assigned tasks for each member in the group. 0.25
TOTAL 11

Also note that:


• Ensure your code is well-documented with clear comments.

-7-
• Include all necessary files, logs, and screenshots to verify successful execution.
• Each task can be accomplished under complex environments and different
programming languages, remember to provide instructions for running each task if
this is the case.

Happy Coding and Best of Luck!

The Instructor./.

-8-

You might also like