BINANCE BTC DATA PIPELINE

Technical ETL Suite for Real-time Cryptocurrency Analytics

An automated pipeline designed to extract high-frequency Bitcoin (BTC) market data, perform advanced feature engineering using Apache Spark, and persist the results into a centralized PostgreSQL data warehouse.

PROJECT ARCHITECTURE

This project implements a Medallion Architecture to ensure data quality and lineage.

Orchestration Layer: Apache Airflow (utilizing TaskFlow API)
Processing Engine: Apache Spark (PySpark) 3.5+
Data Storage:
- Bronze: Local Parquet Files (Raw API Responses)
- Silver: PostgreSQL (Processed Features)
Infrastructure: Dockerized environment with custom Java/Spark integrations

TECHNICAL WORKFLOW

Extraction: Automated calls to Binance V3 API capturing 1-minute candlesticks.
Bronze Layer: Data is serialized to Parquet. Timestamps are coerced to microsecond precision (us) to ensure seamless Spark interoperability.
Feature Engineering: A distributed Spark session calculates rolling windows and future targets.
Silver Layer: The enriched dataset is written to the silver_db via the PostgreSQL JDBC driver.

FEATURE ENGINEERING OVERVIEW

Feature	Logic	Purpose
Close T+10	`lead(close, 10)`	Future price target for Predictive Modeling
Returns	`(close_t - close_t-1) / close_t-1`	Relative price volatility measurement
MA (5/10)	`avg(close)` over rolling rows	Short-term trend smoothing
Taker Ratio	`taker_buy_volume / total_volume`	Sentiment & Market Intensity indicator

REPOSITORY STRUCTURE

Data_Airflow/
├── dags/
│   └── auto_collect_proces.py   # Pipeline orchestration logic
├── utils/
│   ├── get_data.py             # Data acquisition & Bronze storage
│   └── EDA_data_proces.py      # Spark Feature Engineering & Silver storage
├── Dockerfile                  # Airflow + OpenJDK 17 + Spark image
├── docker-compose.yml          # Full stack orchestration
├── requirements.txt            # Python ecosystem dependencies
└── .env                        # Private environment configuration

IMPLEMENTATION GUIDE

1. Environment Setup

Create a .env file in the root of the Data_Airflow directory with the following schema:

POSTGRES_HOST=postgres
POSTGRES_PORT=5432
POSTGRES_DB=silver_db
POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
SILVER_TABLE=btc_silver_data

2. System Launch

Deploy the architecture using Docker Compose:

docker compose up -d

3. Pipeline Operation

The pipeline is pre-configured to run hourly. Monitor execution via the Airflow Web UI at http://localhost:8080.

To trigger manually for verification, you may execute:

python utils/get_data.py && python utils/EDA_data_proces.py

Developed for Advanced Agentic Coding - Binance AI Project

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
dags		dags
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
init_airflow.sh		init_airflow.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BINANCE BTC DATA PIPELINE

Technical ETL Suite for Real-time Cryptocurrency Analytics

PROJECT ARCHITECTURE

TECHNICAL WORKFLOW

FEATURE ENGINEERING OVERVIEW

REPOSITORY STRUCTURE

IMPLEMENTATION GUIDE

1. Environment Setup

2. System Launch

3. Pipeline Operation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Ysen0603/Data_Airflow_BTC

Folders and files

Latest commit

History

Repository files navigation

BINANCE BTC DATA PIPELINE

Technical ETL Suite for Real-time Cryptocurrency Analytics

PROJECT ARCHITECTURE

TECHNICAL WORKFLOW

FEATURE ENGINEERING OVERVIEW

REPOSITORY STRUCTURE

IMPLEMENTATION GUIDE

1. Environment Setup

2. System Launch

3. Pipeline Operation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages