Here you can find the materials for the "Data Engineering 3: Orchestration and Real-time Data Processing" course, part of the MSc in Business Analytics at CEU. For the previous editions, see 2017/2018, 2018/2019, 2019/2020, 2020/2021, 2021/2022, 2022/2023, 2023/2024, and 2024/2025.
2 x 3 x 100 mins on Feb 16 and 23:
- 13:30 - 15:10 session 1
- 15:10 - 15:40 break
- 15:40 - 17:20 session 2
- 17:20 - 17:40 break
- 17:40 - 19:20 session 3
In-person at the Vienna campus (QS B-421).
Please find in the syllabus folder of this repository.
- You need a laptop with any operating system and stable Internet connection.
- Please make sure that Internet/network firewall rules are not limiting your access to unusual ports (e.g. 22, 8787, 8080, 8000), as we will heavily use these in the class (can be a problem on a company network). CEU WiFi should have the related firewall rules applied for the class.
- Join the Teams channel dedicated to the class at
CEU BA DE3 Batch Jobs and APIs ('25/26)with thec1vc62rteam code. - When joining remotely, it's highly suggested to get a second monitor where you can follow the online stream, and keep your main monitor for your own work. The second monitor could be an external screen attached to your laptop, e.g. a TV, monitor, projector, but if you don't have access to one, you may also use a tablet or phone to dial-in to the Zoom call.
To be updated weekly.
Goal: learn how to run and schedule Python or R jobs in the cloud.
Excerpts from https://daroczig.github.io/talks
- "A Decade of Using R in Production" (Real Data Science USA - R meetup)
- "Getting Things Logged" (RStudio::conf 2020)
- "Analytics databases in a startup environment: beyond MySQL and Spark" (Budapest Data Forum 2018)
-
Use the following sign-in URL to access the class AWS account: https://657609838022.signin.aws.amazon.com/console
-
Secure your access key(s), other credentials and any login information ...
... because a truly wise person learns from the mistakes of others!
"When I woke up the next morning, I had four emails and a missed phone call from Amazon AWS - something about 140 servers running on my AWS account, mining Bitcoin" -- Hoffman said
"Nevertheless, now I know that Bitcoin can be mined with SQL, which is priceless ;-)" -- Uri Shaked
So set up 2FA (go to IAM / Users / username / Security credentials / Assigned MFA device): https://console.aws.amazon.com/iam
PS probably you do not really need to store any access keys, but you may rely on roles (and the Key Management Service, and the Secrets Manager and so on)
-
Let's use the
eu-west-1Ireland region
Note: we follow the instructions on Windows in the Computer Lab, but please find below how to access the boxes from Mac or Linux as well when working with the instances remotely.
-
Create (or import) an SSH key in AWS (EC2 / Key Pairs): https://eu-west-1.console.aws.amazon.com/ec2/v2/home?region=eu-west-1#KeyPairs:sort=keyName including using the Owner tag!
-
Get an SSH client:
-
Windows -- Download and install PuTTY: https://www.putty.org
-
Mac -- Install PuTTY for Mac using homebrew or macports
sudo brew install putty sudo port install putty
-
Linux -- probably the OpenSSH client is already installed, but to use the same tools on all operating systems, please install and use PuTTY on Linux too, eg on Ubuntu:
sudo apt install putty
-
-
Convert the generated pem key to PuTTY formatNo need to do this anymore, AWS can provide the key as PPK now.-
CLI:
puttygen key.pem -O private -o key.ppk
-
Make sure the key is readable only by your Windows/Linux/Mac user, eg
chmod 0400 key.ppk
-
Create an EC2 instance
- Optional: create an Elastic IP for your box
- Go the the Instances overview at https://eu-west-1.console.aws.amazon.com/ec2/v2/home?region=eu-west-1#Instances:sort=instanceId
- Click "Launch Instance"
- Provide a name for your server (e.g.
daroczig-de3-week1) and some additional tags for resource tracking, including tagging downstream services, such as Instance and Volumes:- Class:
DE3 - Owner:
daroczig
- Class:
- Pick the
Ubuntu Server 24.04 LTS (HVM), SSD Volume TypeAMI - Pick
t3a.small(2 GiB of RAM should be enough for most tasks) instance type (see more instance types) - Select your AWS key created above and launch
- Update the volume size to 20 GiB to make sure we have enough space.
- Pick a unique name for the security group after clicking "Edit" on the "Network settings"
- Click "Launch instance"
- Note and click on the instance id
-
Connect to the box
- Specify the hostname or IP address
- Specify the "Private key file for authentication" in the Connection category's SSH/Auth/Credentials pane
- Set the username to
ubuntuon the Connection/Data tab - Save the Session profile
- Click the "Open" button
- Accept & cache server's host key
Alternatively, you can connect via a standard SSH client on a Mac or Linux, something like:
chmod 0400 /path/to/your/pem
ssh -i /path/to/your/pem ubuntu@ip-address-of-your-machineAs a last resort, use "EC2 Instance Connect" from the EC2 dashboard by clicking "Connect" in the context menu of the instance (triggered by right click in the table).
Yes, although most of you are using Python, we will install RStudio Server (with support for both R and Python), as it comes with a lot of useful features for the coming hours at this class (e.g. it's most complete and production-ready open-source IDE supporting multiple users and languages).
-
Look at the docs: https://www.rstudio.com/products/rstudio/download-server
-
First, we will upgrade the system to the most recent version of the already installed packages. Note, check on the concept of a package manager!
Download Ubuntu
aptpackage list:sudo apt update
Optionally upgrade the system:
sudo apt upgrade
And optionally also reboot so that kernel upgrades can take effect.
-
Install R
sudo apt install r-base
To avoid manually answering "Yes" to the question to confirm installation, you can specify the
-yflag:sudo apt install -y r-base
-
Try R
R
For example:
1 + 4 # any ideas what this command does? hist(runif(100)) # duh, where is the plot?!
Exit:
q()
Look at the files:
ls ls -latr
Note, if you have X11 server installed, you can forward X11 through SSH to render locally, but this can be complicated to set up on a random operating system, and also not very convenient, so we will not bother with it for now.
-
Try this in Python as well!
$ python Command 'python' not found, did you mean: command 'python3' from deb python3 command 'python' from deb python-is-python3 $ python3 --version Python 3.12.3
Let's symlink
pythontopython3to make it easier to use:sudo apt install python-is-python3
And install
matplotlib:sudo apt install python3-matplotlib
Then replicate that histogram:
import matplotlib.pyplot as plt import random numbers = [random.random() for _ in range(100)] plt.hist(numbers) plt.show()
But uh oh, there's no plot!
plt.savefig("python.png")
-
Install RStudio Server
wget https://download2.rstudio.org/server/jammy/amd64/rstudio-server-2026.01.0-392-amd64.deb sudo apt install -y gdebi-core sudo gdebi rstudio-server-2026.01.0-392-amd64.deb
-
Check process and open ports
rstudio-server status sudo rstudio-server status sudo systemctl status rstudio-server sudo ps aux | grep rstudio sudo apt -y install net-tools sudo netstat -tapen | grep LIST sudo netstat -tapen
-
Confirm that the service is up and running and the port is open
ubuntu@ip-172-31-12-150:~$ sudo ss -tapen | grep LIST tcp 0 0 0.0.0.0:8787 0.0.0.0:* LISTEN 0 49065 23587/rserver tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 0 15671 1305/sshd tcp6 0 0 :::22 :::* LISTEN 0 15673 1305/sshd
-
Try to connect to the host from a browser on port 8787, eg http://foobar.eu-west-1.compute.amazonaws.com:8787
-
Realize it's not working
-
Open up port 8787 in the security group, by selecting your security group and click "Edit inbound rules":
Now you should be able to access the service. If not, e.g. blocked by company firewall, don't worry, we can workaround that by using a proxy server -- to be set up in the next section.
Optionally you can associate a fixed IP address to your box, so that the IP address is not changing when you stop and start the box.
- Allocate a new Elastic IP address at https://eu-west-1.console.aws.amazon.com/ec2/v2/home?region=eu-west-1#Addresses:
- Name this resource by assigning the "Name" and "Owner" tags
- Associate this Elastic IP with your stopped box, then start it
Optionally you can associate a subdomain with your node, using the above created Elastic IP address:
-
Go to Route 53: https://console.aws.amazon.com/route53/home
-
Go to Hosted Zones and click on
de3.click -
Create a new
Arecord:- fill in the desired
Record name(subdomain), egfoobar(well, use your own username as the subdomain) - paste the public IP address or hostname of your server in the
Valuefield - click
Create records
- fill in the desired
-
Now you will be able to access your box using this custom (sub)domain, no need to remember IP addresses.
To avoid using ports like 8787 and 8080 (and get blocked by the firewall installed on the CEU WiFi), let's configure our services to listen on the standard 80 (HTTP) and potentially on the 443 (HTTPS) port as well, and serve RStudio on the /rstudio, and later Jenkins on the /jenkins path.
For this end, we will use Caddy as a reverse-proxy, so let's install it first:
sudo apt install -y caddyThen let's edit the main configuration file /etc/caddy/Caddyfile, which also do some transformations, eg rewriting the URL (removing the /rstudio path) before hitting RStudio Server. To edit the file, we can use the nano editor:
sudo nano /etc/caddy/CaddyfileIf you are not familiar with the nano editor, check the keyboard shortcuts at
the bottom of the screen, e.g. ^X (Ctrl+X keys pressed simultaneously) to
exit, or M-A (Alt+A keys pressed simultaneously) to start marking text for
later copying or deleting.
Delete everything (either by patiently pressing the Delete or Backspace
keys, or by pressing M-A, then navigating to the bottom of the file and
pressing ^K), and copy/paste (Shift+Insert or Ctrl+Shift+C) the following
content:
:80 {
redir /rstudio /rstudio/ permanent
handle_path /rstudio/* {
reverse_proxy localhost:8787 {
transport http {
read_timeout 20d
}
# need to rewrite the Location header to remove the port number
# https://caddy.community/t/reverse-proxy-header-down-on-location-header-or-something-equivalent/13157/3
header_down Location ([^:]+://[^:]+(:[0-9]+)?/) ./
}
}
}
And restart the Caddy service:
sudo systemctl restart caddyFind more information at https://support.rstudio.com/hc/en-us/articles/200552326-Running-RStudio-Server-with-a-Proxy.
Let's see if the port is open on the machine:
sudo ss -tapen|grep LISTLet's see if we can access RStudio Server on the new path:
curl localhost/rstudioNow let's see from the outside world ... and realize that we need to open up port 80!
Now we need to tweak the config to support other services as well in the future e.g. Jenkins:
:80 {
redir /rstudio /rstudio/ permanent
handle_path /rstudio/* {
reverse_proxy localhost:8787 {
transport http {
read_timeout 20d
}
# need to rewrite the Location header to remove the port number
# https://caddy.community/t/reverse-proxy-header-down-on-location-header-or-something-equivalent/13157/3
header_down Location ([^:]+://[^:]+(:[0-9]+)?/) ./
}
}
handle /jenkins/* {
reverse_proxy 127.0.0.1:8080
}
}
It might be useful to also proxy port 8000 for future use via updating the Caddy config to:
:80 {
redir /rstudio /rstudio/ permanent
handle_path /rstudio/* {
reverse_proxy localhost:8787 {
transport http {
read_timeout 20d
}
# need to rewrite the Location header to remove the port number
# https://caddy.community/t/reverse-proxy-header-down-on-location-header-or-something-equivalent/13157/3
header_down Location ([^:]+://[^:]+(:[0-9]+)?/) ./
}
}
handle /jenkins/* {
reverse_proxy 127.0.0.1:8080
}
handle_path /8000/* {
reverse_proxy 127.0.0.1:8000
}
}
This way you can access the above services via the below URLs:
RStudio Server:
- http://your.ip.address:8787
- http://your.ip.address/rstudio
Jenkins:
- http://your.ip.address:8080/jenkins
- http://your.ip.address/jenkins
Port 8000:
- http://your.ip.address:8000
- http://your.ip.address/8000
If you cannot access RStudio Server on port 80, you might need to restart caddy as per above.
It's useful to note that above paths in the index page as a reminder, that you can achive by adding the following to the Caddy configuration:
handle / {
respond "Welcome to DE3! Are you looking for /rstudio or /jenkins?" 200
}
Next, you might really want to set up SSL either with Caddy or placing an AWS Load Balancer in front of the EC2 node. For a simple setup, we can realy on Caddy's built-in SSL support using LetsEncrypt:
-
Register a domain name (or use the already registered
de3.clickdomain subdomain), and point it to your EC2 node's public IP address: https://us-east-1.console.aws.amazon.com/route53/v2/hostedzones -
You might need to wait a bit for the DNS to propagate. Check via
digor similar, e.g.:dig foobar.de3.click -
Update the Caddy configuration to use the new domain name instead of
:80in/etc/caddy/Caddyfile:foobar.de3.click { redir /rstudio /rstudio/ permanent handle_path /rstudio/* { reverse_proxy localhost:8787 { transport http { read_timeout 20d } # need to rewrite the Location header to remove the port number # https://caddy.community/t/reverse-proxy-header-down-on-location-header-or-something-equivalent/13157/3 header_down Location ([^:]+://[^:]+(:[0-9]+)?/) ./ } } handle /jenkins/* { reverse_proxy 127.0.0.1:8080 } handle_path /8000/* { reverse_proxy 127.0.0.1:8000 } handle / { respond "Welcome to DE3! Are you looking for /rstudio or /jenkins?" 200 } } -
Caddy will then automatically obtain and renew the SSL certificate using LetsEncrypt, and you will be able to access the services via the new domain name through HTTPS. If you are interested in the related logs, you can view them in the Caddy logs:
sudo journalctl -u caddy # follow logs sudo journalctl -u caddy
-
Create a new user:
sudo adduser foobar
-
Login & quick demo:
1+2 plot(runif(100)) install.packages('fortunes') library(fortunes) fortune() fortune(200) system('whoami')
-
Reload webpage (F5), realize we continue where we left the browser :)
-
Create a Python script and try to run it (in class: don't run it yet, just pay attention to the shared screen):
import matplotlib.pyplot as plt import random numbers = [random.random() for _ in range(100)] plt.hist(numbers) plt.show()
Note that RStudio Server will ask you to confirm the installation of a few packages ... which takes ages (compiling C++ etc), so we better install the binary packages instead:
sudo apt install --no-install-recommends \ r-cran-jsonlite r-cran-reticulate r-cran-png
Now return to the Python script. But we still cannot run it, as the
matplotlibpackage is not installed. Strange, we just installed it in the shell! To understand what's happening, get back to R and check the Python interpreter:reticulate::py_config() reticulate::py_require("matplotlib")
Note that this is a temporary virtual environment, so you need to install the packages again if you restart the R session.
Now the script runs .. until you restart the R session.
So let's create a persistent virtual environment for Python and install the packages there:
library(reticulate) virtualenv_create("de3") virtualenv_install("de3", packages = c("matplotlib")) use_virtualenv("de3", required = TRUE)
Note that creating the virtual environment failed due to some missing OS dependencies (e.g.
pip), so let's install them first in the shell:sudo apt install python3-venv python3-pip python3-dev
Then run the following commands in R, and then try to rerun the Python script as well. You might need to restart R and go to the
Toolsmenu /Global Options/Python/Use Virtual Environmentand select thede3environment.Now return to the Python script again, and rerun your script.
-
Annoyed already with switching between R and Python? And then switching to SSH? Let's try to simplify that by using the built-in terminal in RStudio:
$ whoami ceu $ sudo whoami ceu is not in the sudoers file. This incident will be reported.
-
Grant sudo access to the new user by going back to SSH with
rootaccess:sudo apt install -y mc sudo mc sudo mcedit /etc/sudoers sudo adduser ceu admin man adduser man deluser
Note 1: might need to relogin / restart RStudio / reload R / reload page .. to force a new shell login so that the updated group setting is applied
Note 2: you might want to add
NOPASSWDto thesudoersfile:ceu ALL=(ALL) NOPASSWD:ALL
Although also note (3) the related security risks.
-
echo "www-port=80" | sudo tee -a /etc/rstudio/rserver.conf sudo rstudio-server restart
Great, we have a working environment for R and Python. Now let's try to do something useful with it!
Create a Python or R script to get the most recent Bitcoin <> USD prices (e.g. from the Binance API), report the last price and the price change in the last 1 hour, and plot a line chart of the price history, something like:
BTC current price is $42,000, with a standard deviation of 100.
from binance.client import Client
client = Client()
# https://python-binance.readthedocs.io/en/latest/binance.html#binance.client.Client.get_klines
klines = client.get_klines(symbol='BTCUSDT', interval='1m', limit=60)
# report on closing prices
close = [float(d[4]) for d in klines]
from statistics import stdev
print(f"BTC current price is ${close[-1]}, with a standard deviation of {round(stdev(close), 2)}.")
# create a line chart of the price history
from datetime import datetime
dates = [datetime.fromtimestamp(k[0] / 1000) for k in klines]
import matplotlib.pyplot as plt
plt.clf()
plt.plot(dates, close, marker='o')
plt.title('BTC Price History')
plt.show()
# save the plot to a file
plt.savefig('btc_price_history.png')To demo how it would be implemented in R, let's install some related packages:
sudo apt install --no-install-recommends \
r-cran-ggplot2 r-cran-glue r-cran-remotes \
r-cran-data.table r-cran-httr r-cran-digest r-cran-logger r-cran-jsonlite r-cran-snakecaseThen in an R package from GitHub:
library(remotes)
install_github("daroczig/binancer")And the actual R code:
library(binancer)
klines <- binance_klines('BTCUSDT', interval = '1m', limit = 60)
library(glue)
print(glue("BTC current price is ${klines$close[60]}, with a standard deviation of {round(sd(klines$close), 2)}."))
library(ggplot2)
ggplot(klines, aes(close_time, close)) + geom_line()Great! Now let's create a candlestick chart of the price history, something like:
from binance.client import Client
client = Client()
klines = client.get_klines(symbol='BTCUSDT', interval='1m', limit=60)
# reticulate::py_install("pandas")
import pandas as pd
df = pd.DataFrame(klines, columns=[
'timestamp', 'open', 'high', 'low', 'close', 'volume',
'close_time', 'quote_volume', 'trades', 'taker_buy_base',
'taker_buy_quote', 'ignore'
])
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
df[['open', 'high', 'low', 'close', 'volume']] = df[['open', 'high', 'low', 'close', 'volume']].astype(float)
# reticulate::py_install("mplfinance")
import mplfinance as mpf
df_plot = df.set_index('timestamp')
df_plot = df_plot[['open', 'high', 'low', 'close', 'volume']]
mpf.plot(df_plot, type='candle', style='charles',
title='BTC Price History',
ylabel='Price (USD)')Or via matplotlib:
from matplotlib.patches import Rectangle
fig, ax = plt.subplots(figsize=(12, 6))
for i, row in df.iterrows():
color = 'green' if row['close'] >= row['open'] else 'red'
# candle lines for high/low
ax.plot([i, i], [row['low'], row['high']], color=color, linewidth=1)
# candle body for open/close
height = abs(row['close'] - row['open'])
bottom = min(row['open'], row['close'])
rect = Rectangle((i - 0.3, bottom), 0.6, height, facecolor=color, edgecolor=color, alpha=0.8)
ax.add_patch(rect)
ax.set_title('BTC Price History')
plt.show()Same in R:
ggplot(klines, aes(open_time)) +
geom_linerange(aes(ymin = open, ymax = close, color = close < open), size = 2) +
geom_errorbar(aes(ymin = low, ymax = high), size = 0.25) +
theme_bw() + theme('legend.position' = 'none') + xlab('') +Or a bit more polished version:
library(scales)
ggplot(klines, aes(open_time)) +
geom_linerange(aes(ymin = open, ymax = close, color = close < open), size = 2) +
geom_errorbar(aes(ymin = low, ymax = high), size = 0.25) +
theme_bw() + theme('legend.position' = 'none') + xlab('') +
ggtitle(paste('Last Updated:', Sys.time())) +
scale_y_continuous(labels = dollar) +
scale_color_manual(values = c('#1a9850', '#d73027')) # RdYlGnFor the record, doing the same for 4 symbols would be also as simple as:
library(data.table)
klines <- rbindlist(lapply(
c('BTCUSDT', 'ETHUSDT', 'BNBUSDT', 'XRPUSDT'),
binance_klines,
interval = '15m', limit = 4*24))
ggplot(klines, aes(open_time)) +
geom_linerange(aes(ymin = open, ymax = close, color = close < open), size = 2) +
geom_errorbar(aes(ymin = low, ymax = high), size = 0.25) +
theme_bw() + theme('legend.position' = 'none') + xlab('') +
ggtitle(paste('Last Updated:', Sys.time())) +
scale_color_manual(values = c('#1a9850', '#d73027')) +
facet_wrap(~symbol, scales = 'free', nrow = 2)-
Install Jenkins from the RStudio/Terminal: https://www.jenkins.io/doc/book/installing/linux/#debianubuntu
sudo apt install -y fontconfig openjdk-21-jre sudo wget -O /usr/share/keyrings/jenkins-keyring.asc \ https://pkg.jenkins.io/debian-stable/jenkins.io-2026.key echo deb [signed-by=/usr/share/keyrings/jenkins-keyring.asc] \ https://pkg.jenkins.io/debian-stable binary/ | sudo tee \ /etc/apt/sources.list.d/jenkins.list > /dev/null sudo apt-get update sudo apt-get install -y jenkins # check which port is open by java (jenkins) sudo ss -tapen | grep java
-
Open up port
8080in the related security group if you want direct access -
To make use of the Caddy proxy, we need to update the Jenkins configuration to use the
/jenkinspath: uncommentEnvironment="JENKINS_PREFIX=/jenkins"in/lib/systemd/system/jenkins.service, then reload the Systemd configs and restart Jenkins:sudo systemctl daemon-reload sudo systemctl restart jenkins
You can find more details at the Jenkins reverse proxy guide and troubleshooting guide.
-
Access Jenkins from your browser and finish installation
-
Read the initial admin password from RStudio/Terminal via
sudo cat /var/lib/jenkins/secrets/initialAdminPassword
-
Proceed with installing the suggested plugins
-
Create your first user (eg
ceu)
-
Note that if loading Jenkins after getting a new IP takes a lot of time, it might be due to
not be able to load the theme.css as trying to search for that on the previous IP (as per
Jenkins URL setting). To overcome this, wait 2 mins for the theme.css timeout, login, disable
the dark theme plugin at the /jenkins/manage/pluginManager/installed path, and then restart Jenkins at the bottom of the page via Restart button. Find more details at jenkinsci/dark-theme-plugin#458.
Let's schedule a Jenkins job to check on the Bitcoin prices every hour!
-
Create a "New Item" (job) in Jenkins:
-
Debug & figure out what's the problem: it's a permission error, so let's add the
jenkinsuser to the<USERNAME>group:sudo adduser jenkins <USERNAME>
Then restart Jenkins from the RStudio Server terminal:
sudo systemctl restart jenkins
A better solution will be later to commit our Python or R script into a git repo, and make it part of the job to update from the repo .. or even better, use Docker to run the job in a container.
-
Yay, another error:
This is due to not finding the virtual environment, so let's add that to our build step:
. /home/<USERNAME>/.virtualenvs/de3/bin/activate
Note the leading dot
.in the command, which is a special character in the shell -- a shorthand forsourcecommand to set environment variables. As Jenkins by default runs the commands insh(and not e.g. Bourne shellbash), we need to use the.shorthand. -
It runs at last:
-
Now let's update our code to generate a line plot and store in the workspace:
from binance.client import Client client = Client() klines = client.get_klines(symbol='BTCUSDT', interval='1m', limit=60) close = [float(d[4]) for d in klines] from statistics import stdev print(f"BTC current price is ${close[-1]}, with a standard deviation of {round(stdev(close), 2)}.") # create a line chart of the price history from datetime import datetime dates = [datetime.fromtimestamp(k[0] / 1000) for k in klines] import matplotlib.pyplot as plt plt.clf() plt.plot(dates, close, marker='o') plt.title('BTC Price History') plt.savefig("btcprice.png")
-
Then find the Workspace of the Project, such as https://daroczig.de3.click/jenkins/job/t/ws/btcprice.png. Note that this image will be updated every run.
If you are not happy with RStudio Server, you can also install "VS Code in the browser" from https://github.com/coder/code-server:
-
Check the source of their install script: https://code-server.dev/install.sh
-
Test the install script:
curl -fsSL https://code-server.dev/install.sh | sh -s -- --dry-run -
Install the code-server package:
curl -fsSL https://code-server.dev/install.sh | sh -
Start the code-server service for a given user (an emphasis here to run this as a user):
sudo systemctl restart code-server@<USERNAME>
Note that the service will be started for the given user, so there's no
multi-user access like in RStudio Server, and it runs on port 8080 by default.
Let's update the port number to something special (to avoid later port number conflicts) and configure Caddy to proxy it:
-
Edit (e.g. using
nano) the~/.config/code-server/config.yamlfile to set the port number to 8888 -
Take a note of the password from the config file, as you will need it to login to the service.
-
Restart the code-server service:
sudo systemctl restart code-server@<USERNAME> -
Update the Caddy configuration to proxy the new port:
handle /coder/* { reverse_proxy 127.0.0.1:8888 uri strip_prefix /coder } -
Restart the Caddy service:
sudo systemctl restart caddy -
You can now access the code-server service via the browser at
https://<USERNAME>.de3.click/coder. -
Install the Python extension or anything else you need :)
In short, it was meant to be a hand-on session to demonstrate and get a feel for setting up cloud infrastructure manually. You might like it it and decide to learn/do more, or you might prefer not dealing with infrastruture ever again -- which is fine.
Hosnestly, the actual programmin environment (e.g. Python or R) and the environment (e.g. Jenkins, Airflow, etc.) do not really matter that much -- the tools were chosen to have a defined number of moving pieces and make sure we understand how those play together.
Today we used:
- EC Instance Connect instead of SSH to connect to our virtual machine to overcome the firewall limitations,
aptpackage manager to install software,- RStudio Server as the main control interface for R, Python, and the terminal as well,
- Jenkins to schedule R or Python commands to run on a regular basis,
- Caddy as a reverse proxy to access the services via a human-friendly domain name and HTTPS.
Quiz: https://forms.office.com/e/wRAxGqirdV (5 mins to convince me -- using your own words -- that you understand the concepts)
- 2FA/MFA in AWS
- Creating EC2 nodes
- Connecting to EC2 nodes via SSH/Putty or EC2 Instance Connect
- Updating security groups
- Installing RStudio Server
- Setting up a reserved proxy along with a domain name and SSL certificate
- The difference between R console and Shell
- The use of
sudoand how to grantroot(system administrator) privileges - Adding new Linux users, setting password, adding to group
- Installing Python packages within RStudio Server and in a virtual environment
- Installing Jenkins
- Scheduling basic commands on Jenkins
- Installing VS Code "Server"
Note that you do NOT need to do the instructions below marked with the 💪 emoji -- those have been already done for you, and the related steps are only included below for documenting what has been done and demonstrated in the class.
💪 Instead of starting from scratch, let's create an Amazon Machine Image (AMI) from the EC2 node we used last week, so that we can use that as the basis of all the next steps:
- Find the EC2 node in the EC2 console
- Right click, then "Image and templates" / "Create image"
- Name the AMI and click "Create image"
- It might take a few minutes to finish
Then you can use the newly created de3-week2 AMI to spin up a new instance for you:
-
Go the the Instances overview at https://eu-west-1.console.aws.amazon.com/ec2/v2/home?region=eu-west-1#Instances:sort=instanceId
-
Click "Launch Instance"
-
Provide a name for your server (e.g.
daroczig-de3-week2) and some additional tags for resource tracking, including tagging downstream services, such as Instance and Volumes:- Class:
DE3 - Owner:
daroczig - subdomain:
daroczig-- NOTE that this is important for the next step! The startup script will register this subdomain under thecount-down-timer.eu.orgdomain name so that you can access RStudio Server, Jenkins, etc from your browser without fighting with firewall rules.
- Class:
-
Pick the
de3-week2AMI -
Pick
t3a.medium(4 GiB of RAM should be enough for most tasks) instance type (see more instance types) -
Select your AWS key created above and launch
-
Select the
de3security group (granting access to ports 22, 443, 8000, 8080, and 8787) -
Click "Advanced details" and select
ceudataserverIAM instance profile, which grants permissions to read EC2 tags and update Route53 records and a few other services that are required in some later steps. -
Note and click on the instance id
We need a script that:
- Reads the subdomain from the EC2 tags
- Looks up the hosted zone ID for the domain name
- Updates the Route53 record to point to the EC2 instance's public IP address
- Configures Caddy to proxy the requests to the EC2 instance's ports
Note that this script requires the AWS CLI to be installed and configured with the appropriate permissions. The AWS CLI was installed via:
sudo snap install aws-cli --classicAnd the required permissions were granted via the ceudataserver IAM instance profile,
including read-only access to EC2 tags and write permissions on Route53 records.
#!/usr/bin/env bash
set -euo pipefail
DOMAIN_NAME="count-down-timer.eu.org"
# look up info on the EC2 instance using the EC2 metadata endpoint
META=http://169.254.169.254/latest
TOKEN=$(curl -s -X PUT "$META/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
get_metadata () {
curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
"$META/meta-data/$1"
}
INSTANCE_ID=$(get_metadata instance-id)
REGION=$(get_metadata placement/region)
# hit a bump querying the tag from the metadata server, so let's use the AWS CLI instead
SUBDOMAIN=$(aws ec2 describe-tags \
--region "$REGION" \
--filters "Name=resource-id,Values=$INSTANCE_ID" "Name=key,Values=subdomain" \
--query "Tags[0].Value" \
--output text)
if [ "$SUBDOMAIN" == "None" ] || [ -z "$SUBDOMAIN" ]; then
echo "ERROR: 'subdomain' tag not found on instance $INSTANCE_ID"
exit 1
fi
DOMAIN="${SUBDOMAIN}.${DOMAIN_NAME}"
# update Route53 record
HOSTED_ZONE_ID=$(aws route53 list-hosted-zones-by-name \
--dns-name "${DOMAIN_NAME}" \
--query "HostedZones[0].Id" \
--output text | cut -d'/' -f3)
echo "Hosted Zone ID: $HOSTED_ZONE_ID"
PUBLIC_IP=$(get_metadata public-ipv4)
echo "Public IP: $PUBLIC_IP"
cat > /tmp/route53-change.json <<EOF
{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "$DOMAIN",
"Type": "A",
"TTL": 300,
"ResourceRecords": [{"Value": "$PUBLIC_IP"}]
}
}]
}
EOF
echo "Updating Route53 record..."
aws route53 change-resource-record-sets \
--hosted-zone-id "$HOSTED_ZONE_ID" \
--change-batch file:///tmp/route53-change.json
rm /tmp/route53-change.json
# configure caddy
mkdir -p /etc/caddy
cat <<EOF >/etc/caddy/Caddyfile
$DOMAIN {
redir /rstudio /rstudio/ permanent
handle_path /rstudio/* {
reverse_proxy localhost:8787 {
transport http {
read_timeout 20d
}
header_down Location ([^:]+://[^:]+(:[0-9]+)?/) ./
}
}
handle /jenkins/* {
reverse_proxy 127.0.0.1:8080
}
handle_path /8000/* {
reverse_proxy 127.0.0.1:8000
}
handle / {
respond "Welcome to DE3! Are you looking for /rstudio or /jenkins?" 200
}
encode gzip
log {
output file /var/log/caddy/access.log
format json
}
}
EOFWe need to make that script executable:
sudo chmod +x /usr/local/bin/update-caddy-domain.shCreate a systemd service at /etc/systemd/system/caddy-setup.service
to run the script at startup before Caddy starts:
[Unit]
Description=Update Route53 and Caddy config before Caddy starts
Before=caddy.service
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
ExecStart=/usr/local/bin/update-caddy-domain.sh
RemainAfterExit=yes
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.targetEnable and test the service:
sudo systemctl daemon-reload
sudo systemctl enable caddy-setup.service
sudo systemctl start caddy-setup.serviceProfit 💸
In case profit does not happen .. a few hints for debugging after booting the instance:
# check the service status and
systemctl status caddy-setup.service
journalctl -u caddy-setup.service -n 100 -f
# try to run the script manually
/usr/local/bin/update-caddy-domain.shWe'll export the list of IAM users from AWS and create a system user for everyone.
-
Attach a newly created IAM EC2 Role (let's call it
ceudataserver) to the EC2 box and assign 'Read-only IAM access' (IAMReadOnlyAccess): -
Install AWS CLI tool (note that using the snap package manager as it was removed from the apt repos):
sudo snap install aws-cli --classic -
List all the IAM users: https://docs.aws.amazon.com/cli/latest/reference/iam/list-users.html
aws iam list-users -
Install R packages from JSON parsing and logging (in the next steps) from the apt repo instead of CRAN sources as per https://github.com/eddelbuettel/r2u
wget -q -O- https://eddelbuettel.github.io/r2u/assets/dirk_eddelbuettel_key.asc | sudo tee -a /etc/apt/trusted.gpg.d/cranapt_key.asc sudo add-apt-repository "deb [arch=amd64] https://r2u.stat.illinois.edu/ubuntu noble main" sudo apt update sudo apt install --no-install-recommends r-cran-jsonlite r-cran-logger r-cran-glue
Note that all dependencies (let it be an R package or system/Ubuntu package) have been automatically resolved and installed.
Don't forget to click on the brush icon to clean up your terminal output if needed.
Optionally enable
bspmto enable binary package installations via the traditionalinstall.packagesR function. -
Export the list of users from R:
library(jsonlite) users <- fromJSON(system('aws iam list-users', intern = TRUE)) str(users) users[[1]]$UserName
Or Python:
import boto3 iam = boto3.client('iam') response = iam.list_users() users = response['Users'] users[0] users[0]["UserName"]
-
Create a new system user on the box (for RStudio Server access) for every IAM user, set password and add to group:
library(logger) library(glue) for (user in users[[1]]$UserName) { ## remove invalid character user <- sub('@.*', '', user) user <- sub('.', '_', user, fixed = TRUE) log_info('Creating {user}') system(glue("sudo adduser --disabled-password --quiet --gecos '' {user}")) log_info('Setting password for {user}') system(glue("echo '{user}:secretpass' | sudo chpasswd")) # note the single quotes + placement of sudo log_info('Adding {user} to sudo group') system(glue('sudo adduser {user} sudo')) log_info('Adding {user} to jenkins group') system(glue('sudo adduser {user} jenkins')) }
Note, you may have to temporarily enable passwordless sudo for this user (if have not done already) :/
ceu ALL=(ALL) NOPASSWD:ALL
Check users:
readLines('/etc/passwd')
-
Install the "PAM Authentication Plugin" in Jenkins.
-
Enable Jenkins to use PAM authentication by adding to the
shadowgroup, then restart Jenkins:sudo adduser jenkins shadow sudo systemctl restart jenkins
-
Update the security backend to use real Unix users for shared access (if users already created):
Then make sure to test new user access in an incognito window to avoid closing yourself out :)
Replicate the below plot either in R or Python! Feel free to use your notes from
last week, or scroll up in this README.md file for the actual code, and how to
install the required R or Python packages.
Please do NOT try to hammer the server with AI recommendations on how to fix things when things go wrong -- you are operating on a cloud server, not your local machine, and you could do serious harm to this high-value production server (and the billing account associated with the AWS account)! 😊
Example solution in Python:
-
Install the required Python packages (in the R console):
reticulate::py_install(c('pandas', 'matplotlib', "python-binance"))
-
Create a Python script to replicate the plot:
from binance.client import Client client = Client() # https://python-binance.readthedocs.io/en/latest/binance.html#binance.client.Client.get_klines klines = client.get_klines(symbol='BTCUSDT', interval='1m', limit=60) # report on closing prices close = [float(d[4]) for d in klines] from statistics import stdev print(f"BTC current price is ${close[-1]}, with a standard deviation of {round(stdev(close), 2)}.") # create a line chart of the price history from datetime import datetime dates = [datetime.fromtimestamp(k[0] / 1000) for k in klines] import matplotlib.pyplot as plt plt.clf() plt.plot(dates, close, marker='o') plt.title('BTC Price History') #plt.show() plt.savefig('btc_price_history_linechart.png') import pandas as pd df = pd.DataFrame(klines, columns=[ 'timestamp', 'open', 'high', 'low', 'close', 'volume', 'close_time', 'quote_volume', 'trades', 'taker_buy_base', 'taker_buy_quote', 'ignore' ]) df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms') df[['open', 'high', 'low', 'close', 'volume']] = df[['open', 'high', 'low', 'close', 'volume']].astype(float) from matplotlib.patches import Rectangle fig, ax = plt.subplots(figsize=(12, 6)) for i, row in df.iterrows(): color = 'green' if row['close'] >= row['open'] else 'red' # candle lines for high/low ax.plot([i, i], [row['low'], row['high']], color=color, linewidth=1) # candle body for open/close height = abs(row['close'] - row['open']) bottom = min(row['open'], row['close']) rect = Rectangle((i - 0.3, bottom), 0.6, height, facecolor=color, edgecolor=color, alpha=0.8) ax.add_patch(rect) ax.set_title('BTC Price History') # plt.show() plt.savefig('btc_price_history_candlestick-chart.png')
Now create a Jenkins job to run the Python script every minute!
-
Create a new job:
- Name:
get current Bitcoin price - Type:
Freestyle project - Click
OK
- Name:
-
Define a schedule:
* * * * * -
Add a new
Execute shellbuild step:. /home/<USERNAME>/.virtualenvs/de3/bin/activate python /home/<USERNAME>/<SCRIPT_NAME>.py
- Create a new gist on GitHub (to demo a super simple git repository).
- Add the script to the repository along with a
requirements.txtfile for the Python dependencies. - Configure the Jenkins job to use the git repository as the source code
management. Find the git repository URL in the "Clone via HTTPS" button,
which returns the gist's URL with a
.gitsuffix. Also note that the defaultmasterbranch name will not work, as Github defaults to the more modernmainbranch name, so udpate that in the Jenkins job configuration. - Update the
Execute shellbuild step to refer to the script in the git repository instead of the hardcoded local path.
Example solution: https://gist.github.com/daroczig/9e4004bbb6532edb6da384260da201c2
Example command to run the script:
. /home/<USERNAME>/.virtualenvs/de3/bin/activate
python btcprice.py-
💪 Install Docker:
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg echo \ "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \ https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null sudo apt-get update sudo apt-get install -y docker-ce
-
Make sure you have the Python script and the
requirements.txtfile locally (on the RStudio Server instance). -
Create a Dockerfile to build the image:
FROM python:3.11-slim RUN pip install matplotlib pandas python-binance WORKDIR /app ADD my_local_file.py /app/btcreport.py CMD ["python3", "btcreport.py"]
-
Build the image:
sudo docker build -t btcprice . -
Run the container:
sudo docker run --rm -ti btcprice
Where are the images stored?
-
Update the Python script to write to a special folder, e.g.
/outputs, and attach it from outside of the container.sudo docker run --rm -ti -v /home/<USERNAME>/outputs:/outputs btcprice
-
Update the Jenkins job to run the container and attach the output folder.
-
Optionally start using the Docker Jenkins plugin instead of issuing the
docker runcommand(s) in theExecute shellbuild step.
Let's set up e-mail notifications via eg https://app.mailjet.com/signin
-
💪 Sign up, confirm your e-mail address and domain
-
💪 Take a note on the SMTP settings, eg
- SMTP server: in-v3.mailjet.com
- Port: 465
- SSL: Yes
- Username: ***
- Password: ***
-
💪 Configure Jenkins at http://de3.ceudata.net/jenkins/configure
-
Set up the default FROM e-mail address at "System Admin e-mail address": [email protected]
-
Search for "Extended E-mail Notification" and configure
- SMTP Server
- Click "Advanced"
- Check "Use SMTP Authentication"
- Enter User Name from the above steps
- Enter Password from the above steps
- Check "Use SSL"
- SMTP port: 465
-
-
Set up "Post-build Actions" in Jenkins: Editable Email Notification - read the manual and info popups, configure to get an e-mail on job failures and fixes
-
Configure the job to send the whole e-mail body as the deault body template for all outgoing emails
${BUILD_LOG, maxLines=1000}Optionally, look at other Jenkins plugins, eg the Slack Notifier: https://plugins.jenkins.io/slack
But who uses emails anymore? Let's set up MS Teams notifications instead!
-
Join the #bots-bots-bots channel in the DE3 course's MS Teams
-
Click on "Manage channel" in the triple-dot context menu of the channel, then click "Edit" of the "Connectors" tab, and add an incoming webhook with your username and optional logo, store the URL for later use
-
Install the
apprisePython package in your virtual environment so that you can test it interactively:reticulate::py_install("apprise")
-
Don't forget to add the package name to your
requirements.txtfile as well if you plan to use it in a Jenkins job as a Docker container. -
Example script saying hello to the channel:
import apprise poster = apprise.Apprise() poster.add('https://ceuedu.webhook.office.com/webhookb2/...') poster.notify( title='Hello from Python!', body='Such a warm hello.', )
Find more details in the
apprisedocs:- https://appriseit.com/library/quick-start/
- https://appriseit.com/library/attachments/
- https://appriseit.com/services/msteams/
-
Update your Python script to send a message to the channel when the Bitcoin price is above $50,000 💸
from apprise import Apprise, NotifyType
from binance.client import Client
poster = Apprise()
poster.add('https://ceuedu.webhook.office.com/webhookb2/...')
client = Client()
klines = client.get_klines(symbol='BTCUSDT', interval='1m', limit=1)
price = klines[0][4]
if price > 50_000:
poster.notify(
title='Bitcoin price change alert',
body=f'The current price of a BTC is ${price}',
notify_type=NotifyType.WARNING,
)What's the problem with the current approach?
- Hardcoded webhook URL (security risk)
- Spamming the channel
Let's solve the latter first!
We need a central place that acts as a persistent storage for our Jenkins jobs, e.g. to mark if we have sent a recent alert in MS Teams ... let's give a try to a key-value database:
-
💪 Install
RedisValkey serversudo apt install valkey ss -tapen | grep LISTTest using the
valkey-clitool:get foo set foo 42 get foo del foo set foo 42 ex 5 get foo get foo exit -
Install a Python client by running the following in the R console:
reticulate::py_install("valkey")
-
Get familiar with using Valkey from Python by testing it in the Python console:
from valkey import Valkey from time import sleep # no need to specify the host/port and authentication as running locally r = Valkey() r.set('foo', 'bar') r.get('foo') r.delete('foo') r.set('foo', 2, ex=2) r.get('foo') sleep(2) r.get('foo')
-
Update the Python script alerting in MS Teams to silence alerts for 5 minutes after the last alert was sent.
if price > 50_000 and r.get('last_alert_time') is None: poster.notify( title='Bitcoin price change alert', body=f'The current price of a BTC is ${price}', notify_type=NotifyType.WARNING, ) r.set('last_alert_time', time.time(), ex=300)
-
Exercises: Update the Python script to
-
Try to read the alert threshold from the Valkey database instead of sticking with a hardcoded value. Update the hardcoded value to be used as a default value if the key is not found.
alert_threshold = r.get('alert_threshold') or 50_000
Note that you should check for type mismatches, e.g.:
try: alert_threshold = float(r.get('alert_threshold')) except ValueError: alert_threshold = 50_000
-
Count the number of alerts sent in the last hour:
-
Naive approach: set one key per alert with TTL and count the keys.
r.set(f'alert:{time.time_ns()}', '1', ex=3600) # count the still existing keys sum(1 for _ in r.scan_iter("alert:*"))
-
Sorted set approach: set one key with the timestamp as the score and count the keys in the range of the last hour.
import time r.zadd('alerts', {str(time.time_ns()): time.time()}) # count the keys in the range of the last hour r.zcount('alerts', time.time() - 3600, time.time()) # trim old entries r.zremrangebyscore('alerts', "-inf", time.time() - 3600)
-
-
-
Do NOT store the webhook URL in plain-text (e.g. in your R or Python script)!
-
Let's use Amazon's Key Management Service: https://github.com/daroczig/CEU-R-prod/raw/2017-2018/AWR.Kinesis/AWR.Kinesis-talk.pdf (slides 73-75)
-
Install the
boto3Python module from R to experiment with it interactively:reticulate::py_install("boto3")
-
💪 Create a key in the Key Management Service (KMS):
alias/de3 -
💪 Grant access to that KMS key by creating an EC2 IAM role at https://console.aws.amazon.com/iam/home?region=eu-west-1#/roles with the
AWSKeyManagementServicePowerUserpolicy and explicit grant access to the key in the KMS console -
💪 Attach the newly created IAM role if not yet done
-
Test how KMS encryption works:
from boto3 import client kms = client('kms', region_name="eu-west-1") encrypted = kms.encrypt(KeyId="alias/de3", Plaintext="Foo") import base64 base64.b64encode(encrypted["CiphertextBlob"]).decode('utf-8')
Now you can post that base64-encoded ciphertext anywhere, as there's no way to decrypt it without having access to the KMS key.
-
Store the ciphertext and use
kms.decryptto decrypt later, see egsecret = kms.decrypt(CiphertextBlob=base64.b64decode('AQICAHgzIk6iRoD8yYhFk//xayHj0G7uYfdCxrW6ncfAZob2MwF9MDMxdkLzSi1zOCr0BijiAAAAbzBtBgkqhkiG9w0BBwagYDBeAgEAMFkGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQM2J6fxSA6NeNtA7lEAgEQgCzWhyZY2bYqnVWLmbbAgYd4nKmUHQ4dM1MwecLgusbDXryXYNp5bEFQ+NlQzQ==')) secret secret["Plaintext"].decode('utf-8')
-
💪 Alternatively, use the AWS Parameter Store or Secrets Manager, see eg https://eu-west-1.console.aws.amazon.com/systems-manager/parameters/?region=eu-west-1&tab=Table and grant the
AmazonSSMReadOnlyAccesspolicy to your IAM role or user. -
Then query the parameter store from Python:
ssm = client('ssm', region_name="eu-west-1") parameter = ssm.get_parameter(Name='/teams/daroczig', WithDecryption=True) webhook_url = parameter["Parameter"]["Value"]
-
Store your own webhook in the Parameter Store and use it in your Python script.
Note, if you are running the script inside a Docker container, and you face errors when trying to access AWS services, you might need to use the AWS CLI to adjust access to the metadata server, see e.g. https://stackoverflow.com/questions/71884350/using-imds-v2-with-token-inside-docker-on-ec2-or-ecs/71884476#71884476.
-
💪 Install plumber: rplumber.io
sudo apt install -no-install-recommends -y r-cran-plumber
-
Create an API endpoint to show the min, max and mean price of a BTC in the past hour!
Create
~/plumber.Rwith the below content:library(binancer) #* BTC stats #* @get /btc function() { klines <- binance_klines('BTCUSDT', interval = '1m', limit = 60L) klines[, .(min = min(close), mean = mean(close), max = max(close))] }
Start the plumber application wither via clicking on the "Run API" button or the below commands:
library(plumber) pr("plumber.R") %>% pr_run(host='0.0.0.0', port=8000)
-
Add a new API endpoint to generate the candlestick chart with dynamic symbol (default to BTC), interval and limit! Note that you might need a new
@serializer, function arguments, and type conversions as well.Example solution for the above ...
library(binancer) library(ggplot2) library(scales) #* Generate plot #* @param symbol coin pair #* @param interval:str enum #* @param limit integer #* @get /klines #* @serializer png function(symbol = 'BTCUSDT', interval = '1m', limit = 60L) { klines <- binance_klines(symbol, interval = interval, limit = as.integer(limit)) # NOTE int conversion library(scales) p <- ggplot(klines, aes(open_time)) + geom_linerange(aes(ymin = open, ymax = close, color = close < open), size = 2) + geom_errorbar(aes(ymin = low, ymax = high), size = 0.25) + theme_bw() + theme('legend.position' = 'none') + xlab('') + ggtitle(paste('Last Updated:', Sys.time())) + scale_y_continuous(labels = dollar) + scale_color_manual(values = c('#1a9850', '#d73027')) # RdYlGn print(p) }
-
Add a new API endpoint to generate a HTML report including both the above!
Example solution for the above ...
💪 Update the
markdownpackage:sudo apt install --no-install-recommends -y r-cran-markdown
Create an R markdown for the reporting:
--- title: "report" output: html_document date: "`r Sys.Date()`" --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE, warning=FALSE) library(binancer) library(ggplot2) library(scales) library(knitr) klines <- function() { binance_klines('BTCUSDT', interval = '1m', limit = 60L) } ``` Bitcoin stats: ```{r stats} kable(klines()[, .(min = min(close), mean = mean(close), max = max(close))]) ``` On a nice plot: ```{r plot} ggplot(klines(), aes(open_time, )) + geom_linerange(aes(ymin = open, ymax = close, color = close < open), size = 2) + geom_errorbar(aes(ymin = low, ymax = high), size = 0.25) + theme_bw() + theme('legend.position' = 'none') + xlab('') + ggtitle(paste('Last Updated:', Sys.time())) + scale_y_continuous(labels = dollar) + scale_color_manual(values = c('#1a9850', '#d73027')) ```
And the plumber file:
library(binancer) library(ggplot2) library(scales) library(rmarkdown) library(plumber) #' Gets BTC data from the past hour #' @return data.table klines <- function() { binance_klines('BTCUSDT', interval = '1m', limit = 60L) } #* BTC stats #* @get /stats function() { klines()[, .(min = min(close), mean = mean(close), max = max(close))] } #* Generate plot #* @get /plot #* @serializer png function() { p <- ggplot(klines(), aes(open_time, )) + geom_linerange(aes(ymin = open, ymax = close, color = close < open), size = 2) + geom_errorbar(aes(ymin = low, ymax = high), size = 0.25) + theme_bw() + theme('legend.position' = 'none') + xlab('') + ggtitle(paste('Last Updated:', Sys.time())) + scale_y_continuous(labels = dollar) + scale_color_manual(values = c('#1a9850', '#d73027')) # RdYlGn print(p) } #* Generate HTML #* @get /report #* @serializer html function(res) { filename <- tempfile(fileext = '.html') on.exit(unlink(filename)) render('report.Rmd', output_file = filename) include_file(filename, res) }
Run via:
library(plumber) pr('plumber.R') %>% pr_run(port = 8000)
Try to DRY (don't repeat yourself!) this up as much as possible.
-
Install the
FastAPIpackage:reticulate::py_install("fastapi")
-
Create a new Python script, e.g.
~/api.pywith the below content:from fastapi import FastAPI app = FastAPI() @app.get("/hello") def hello(): return {"hello": "world"}
-
Install
uvicornto run the FastAPI application:reticulate::py_install("uvicorn")
-
Start the FastAPI application in the Terminal:
source .virtualenvs/de3/bin/activate uvicorn api:app --reload
-
Test the API endpoint from your browser by hitting your domain name's
/8000/helloendpoint5. Test the API endpoint from your browser by hitting your domain name's/8000/helloendpoint -
Write an Python script that replicates the 3 API endpoints implemented above in R:
/statsreports on the min/mean/max BTC price from the past 3 hours/plotgenerates a candlestick chart on the price of BTC from past 3 hours/reportgenerates a HTML report including both the above
Example solution for the above in Python ...
Install dependencies:
pip install fastapi uvicorn python-binance pandas matplotlib
Create
api.py(FastAPI app with/stats,/plot,/report):from io import BytesIO import base64 from binance.client import Client import pandas as pd from fastapi import FastAPI from fastapi.responses import HTMLResponse, Response from pydantic import BaseModel, Field import matplotlib matplotlib.use("Agg") import matplotlib.pyplot as plt from matplotlib.patches import Rectangle app = FastAPI( title="BTC Price API", description="Min/mean/max BTC price and candlestick chart for the past 3 hours.", ) # Past 3 hours = 180 x 1-minute candles LIMIT = 60 * 3 class StatsResponse(BaseModel): """Summary stats for BTC close price over the last 3 hours.""" min: float = Field(..., description="Minimum close price (USD)") mean: float = Field(..., description="Mean close price (USD)") max: float = Field(..., description="Maximum close price (USD)") def klines() -> pd.DataFrame: """Fetch BTCUSDT 1m klines from Binance for the past 3 hours.""" client = Client() raw = client.get_klines(symbol="BTCUSDT", interval="1m", limit=LIMIT) df = pd.DataFrame( raw, columns=[ "timestamp", "open", "high", "low", "close", "volume", "close_time", "quote_volume", "trades", "taker_buy_base", "taker_buy_quote", "ignore", ], ) df["timestamp"] = pd.to_datetime(df["timestamp"], unit="ms") for col in ["open", "high", "low", "close", "volume"]: df[col] = df[col].astype(float) return df @app.get("/stats", response_model=StatsResponse) def stats() -> StatsResponse: """Return min, mean, and max BTC close price from the past 3 hours.""" df = klines() return StatsResponse( min=float(df["close"].min()), mean=float(df["close"].mean()), max=float(df["close"].max()), ) def plot_png() -> bytes: """Render candlestick chart as PNG bytes.""" df = klines() fig, ax = plt.subplots(figsize=(12, 6)) for i, row in df.iterrows(): color = "green" if row["close"] >= row["open"] else "red" ax.plot([i, i], [row["low"], row["high"]], color=color, linewidth=1) h = abs(row["close"] - row["open"]) bot = min(row["open"], row["close"]) ax.add_patch( Rectangle((i - 0.3, bot), 0.6, h, facecolor=color, edgecolor=color, alpha=0.8) ) ax.set_title("BTC Price (past 3h)") ax.set_ylabel("Price (USD)") ax.set_xlabel("Time") ax.set_xticks(range(0, len(df), 30)) ax.set_xticklabels(df["timestamp"].iloc[::30].dt.strftime("%H:%M"), rotation=45) buf = BytesIO() fig.savefig(buf, format="png", dpi=100, bbox_inches="tight") plt.close(fig) buf.seek(0) return buf.getvalue() @app.get("/plot") def plot() -> Response: """Return a candlestick chart (PNG) of BTC price for the past 3 hours.""" return Response(content=plot_png(), media_type="image/png") @app.get("/report", response_class=HTMLResponse) def report() -> HTMLResponse: """Return an HTML report with stats and embedded candlestick chart.""" s = stats() b64 = base64.b64encode(plot_png()).decode() html = f""" <!DOCTYPE html> <html> <head><meta charset="utf-8"><title>BTC Report</title></head> <body> <h1>BTC price report (past 3 hours)</h1> <h2>Stats</h2> <table> <tr><th>min</th><th>mean</th><th>max</th></tr> <tr><td>{s.min:.2f}</td><td>{s.mean:.2f}</td><td>{s.max:.2f}</td></tr> </table> <h2>Plot</h2> <img src="data:image/png;base64,{b64}" alt="BTC candlestick" /> </body> </html> """ return HTMLResponse(html)
Run the API:
uvicorn api:app --host 0.0.0.0 --port 8000
When behind Caddy at
/8000/, use--root-pathso/docscan load the OpenAPI spec:uvicorn api:app --host 0.0.0.0 --port 8000 --root-path /8000
Test:
/stats,/plot,/report(and/docsfor Swagger).Example solution for the above in R ...
💪 Update the
markdownpackage:sudo apt install -y r-cran-markdown
Create an R markdown for the reporting:
--- title: "report" output: html_document date: "`r Sys.Date()`" --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE, warning=FALSE) library(binancer) library(ggplot2) library(scales) library(knitr) klines <- function() { binance_klines('BTCUSDT', interval = '1m', limit = 60L) } ``` Bitcoin stats: ```{r stats} kable(klines()[, .(min = min(close), mean = mean(close), max = max(close))]) ``` On a nice plot: ```{r plot} ggplot(klines(), aes(open_time, )) + geom_linerange(aes(ymin = open, ymax = close, color = close < open), size = 2) + geom_errorbar(aes(ymin = low, ymax = high), size = 0.25) + theme_bw() + theme('legend.position' = 'none') + xlab('') + ggtitle(paste('Last Updated:', Sys.time())) + scale_y_continuous(labels = dollar) + scale_color_manual(values = c('#1a9850', '#d73027')) ```
And the plumber file:
library(binancer) library(ggplot2) library(scales) library(rmarkdown) library(plumber) #' Gets BTC data from the past hour #' @return data.table klines <- function() { binance_klines('BTCUSDT', interval = '1m', limit = 60L) } #* BTC stats #* @get /stats function() { klines()[, .(min = min(close), mean = mean(close), max = max(close))] } #* Generate plot #* @get /plot #* @serializer png function() { p <- ggplot(klines(), aes(open_time, )) + geom_linerange(aes(ymin = open, ymax = close, color = close < open), size = 2) + geom_errorbar(aes(ymin = low, ymax = high), size = 0.25) + theme_bw() + theme('legend.position' = 'none') + xlab('') + ggtitle(paste('Last Updated:', Sys.time())) + scale_y_continuous(labels = dollar) + scale_color_manual(values = c('#1a9850', '#d73027')) # RdYlGn print(p) } #* Generate HTML #* @get /report #* @serializer html function(res) { filename <- tempfile(fileext = '.html') on.exit(unlink(filename)) render('report.Rmd', output_file = filename) include_file(filename, res) }
Run via:
library(plumber) pr('plumber.R') %>% pr_run(port = 8000)
Why API? Why R-based API? Why Python-based API? See previously mentioned examples in the slide decks, e.g.
- adtech
- healthtech
Why containers? How to run in production in other ways?!
Let's bundle all the scripts into a single Docker image:
- 💪 Install Docker:
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce-
Create a new file named
Dockerfile(File/New file/Text file to avoid auto-adding theRorpyfile extension) with the below content to add the required files and set the default working directory to the same folder:-
Python image:
FROM python:3.11-slim RUN pip install fastapi uvicorn pandas matplotlib python-binance ADD api.py /app/api.py EXPOSE 8000 WORKDIR /app CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"] -
R image:
FROM rstudio/plumber RUN apt-get update && apt-get install -y pandoc && apt-get clean && rm -rf /var/lib/apt/lists/ RUN install2.r ggplot2 rmarkdown RUN installGithub.r daroczig/binancer ADD report.Rmd /app/report.Rmd ADD plumber.R /app/plumber.R EXPOSE 8000 WORKDIR /app CMD ["plumber.R"]
-
-
Build the Docker image:
sudo docker build -t btc-report-api .- Run a container based on the above image:
sudo docker run -p 8000:8000 --rm -ti btc-report-api- Test by visiting the
8000port or the Caddy proxy at <https://.count-down-timer.eu.org/8000>, e.g. Swagger docs at <https://.count-down-timer.eu.org/8000/docs> (R) or <https://.count-down-timer.eu.org/8000/docs> (Python) or an actualendpoint directly at eg <https://.count-down-timer.eu.org/8000/report>.
Now let's make the above created and tested Docker image available outside of the RStudio Server by uploading the Docker image to Elastic Container Registry (ECR):
-
Create a new private repository at https://eu-west-1.console.aws.amazon.com/ecr/home?region=eu-west-1, call it
de3-example-api -
💪 Assign the
EC2InstanceProfileForImageBuilderECRContainerBuildspolicy to theceudataserverIAM role so that we get RW access to the ECR repositories. Tighten this role up in prod! -
Let's login to ECR on the RStudio Server so that we can upload the Docker image:
aws ecr get-login-password --region eu-west-1 | sudo docker login --username AWS --password-stdin 657609838022.dkr.ecr.eu-west-1.amazonaws.com -
Tag the already build Docker image for upload:
sudo docker tag btc-report-api:latest 657609838022.dkr.ecr.eu-west-1.amazonaws.com/de3-example-api:latest
-
Push the Docker image:
sudo docker push 657609838022.dkr.ecr.eu-west-1.amazonaws.com/de3-example-api:latest
-
Check the Docker repository in the AWS console, e.g. at https://eu-west-1.console.aws.amazon.com/ecr/repositories/private/657609838022/de3-example-api?region=eu-west-1 if using the above repository name.
-
Go to the Elastic Container Service (ECS) dashboard at https://eu-west-1.console.aws.amazon.com/ecs/home?region=eu-west-1#/
-
Create a task definition for the Docker run:
- Task name:
btc-api - Container name:
api - Image URI:
657609838022.dkr.ecr.eu-west-1.amazonaws.com/de3-example-api - Container port: 8000
- Review Task size, but default values should fine for this simple task
- Task name:
-
Create a new cluster, call it
BTC_API, using Fargate. Don't forget to add theClasstag! -
Create a Service in the newly created Cluster at https://eu-west-1.console.aws.amazon.com/ecs/v2/clusters/btc-api/services?region=eu-west-1
- Compute option can be "Launch type" for now
- Specify the Task Family as
btc-api - Provide the same as service name
- Use the
de3security group - Create a load balancer listening on port 80 (would need to create an SSL cert for HTTPS), and specify
/statsas the healthcheck path, with a 10 seconds grace period - Test the deployed service behind the load balancer, e.g. https://btc-api-1417435399.eu-west-1.elb.amazonaws.com/report
Read the rOpenSci Docker tutorial -- quiz next week! Think about why we might want to use Docker.
The goal of this assignment is to confirm that you have a general understanding on how to build data pipelines using Amazon Web Services and R or Python, and can actually implement a stream processing application (either running in almost real-time or batched/scheduled way) or R- or Python-based API in practice.
To minimize the system administration and some of the already-covered engineering tasks for the students, the below pre-configured tools are provided as free options, but students can decide to build their own environment (on the top of or independently from these) and feel free to use any other tools:
de3Amazon Machine Image that you can use to spin up an EC2 node with RStudio Server, Shiny Server, Jenkins, Redis and Docker installed & pre-configured (use your AWS username and the password shared on Slack previously).de3EC2 IAM role with full access to Kinesis, Dynamodb, Cloudwatch and theslacktoken in the Parameter Storede3security group with open ports for RStudio Server and Jenkins- lecture and seminar notes at https://github.com/daroczig/CEU-R-prod
Make sure to clean-up your EC2 nodes, security groups, keys etc created in the past weeks, as left-over AWS resources will contribute negative points to your final grade! E.g. the EC2 node you created on the second week should be terminated.
-
Minimal project (for grade up to "B"): schedule a Jenkins job that runs every hour getting the past hour's 1-minute interval klines data on ETH prices (in USD). The job should be configured to pull the R or Python script at the start of the job either from a private or public git repo or gist. Then
- Find the min and max price of ETH in the past hour, and post these stats
in the
#bots-bots-botsMS Teams channel. Make sure to set your username for the message, and use a custom emoji as the icon. - Set up email notification for the job when it fails.
- Find the min and max price of ETH in the past hour, and post these stats
in the
-
Recommended project (for grade up to "A"): Deploy an R- or Python-based API in ECS (like we did on the last week) for analyzing recent Binance (or any other real-time) data. The API should include at least 4 endpoints using different serializers, and these endpoints should be other than the ones we covered in the class. At least one endpoint should have at least a few parameters. Build a Docker image, push it to ECR, and deploy as service in ECS. Document the steps required to set up ECR/ECS with screenshots, then delete all services after confirming that everything works correctly.
Regarding feedback: by default, I add a super short feedback on Moodle as a comment to your submission (e.g. "good job" or "excellent" for grade A, or short details on why it was not A). If you want to receive more detailed feedback, please send me an email to schedule a quick call. If you want early feedback (before grading), send me an email at least a week before the submission deadline!
-
Create a PDF document that describes your solution and all the main steps involved with low level details: attach screenshots (including the URL nav bar and the date/time widget of your OS, so like full-screen and not area-picked screenshots) of your browser showing what you are doing in RStudio Server, Jenkins, in the AWS dashboards, or example messages posted in MS Teams, and make sure that the code you wrote is either visible on the screenshots, or included in the PDF.
-
STOP the EC2 Instance you worked on, but don’t terminate it, so I can start it and check how it works. Note that your instance will be terminated by me after the end of the class.
-
Include the
instance_idon the first page of the PDF, along with your name or student id. -
Upload the PDF to Moodle.
Midnight (CET) on March 13, 2026.
File a GitHub ticket.










