Skip to content

This project automates the extraction, processing, and storage of Bitcoin (BTC) market data using Apache Airflow, Apache Spark, and PostgreSQL. It follows a medallion architecture pattern with Bronze and Silver layers.

Notifications You must be signed in to change notification settings

Ysen0603/Data_Airflow_BTC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


BINANCE BTC DATA PIPELINE

Technical ETL Suite for Real-time Cryptocurrency Analytics


An automated pipeline designed to extract high-frequency Bitcoin (BTC) market data, perform advanced feature engineering using Apache Spark, and persist the results into a centralized PostgreSQL data warehouse.


PROJECT ARCHITECTURE

This project implements a Medallion Architecture to ensure data quality and lineage.

  • Orchestration Layer: Apache Airflow (utilizing TaskFlow API)
  • Processing Engine: Apache Spark (PySpark) 3.5+
  • Data Storage:
    • Bronze: Local Parquet Files (Raw API Responses)
    • Silver: PostgreSQL (Processed Features)
  • Infrastructure: Dockerized environment with custom Java/Spark integrations

TECHNICAL WORKFLOW

  1. Extraction: Automated calls to Binance V3 API capturing 1-minute candlesticks.
  2. Bronze Layer: Data is serialized to Parquet. Timestamps are coerced to microsecond precision (us) to ensure seamless Spark interoperability.
  3. Feature Engineering: A distributed Spark session calculates rolling windows and future targets.
  4. Silver Layer: The enriched dataset is written to the silver_db via the PostgreSQL JDBC driver.

FEATURE ENGINEERING OVERVIEW

Feature Logic Purpose
Close T+10 lead(close, 10) Future price target for Predictive Modeling
Returns (close_t - close_t-1) / close_t-1 Relative price volatility measurement
MA (5/10) avg(close) over rolling rows Short-term trend smoothing
Taker Ratio taker_buy_volume / total_volume Sentiment & Market Intensity indicator

REPOSITORY STRUCTURE

Data_Airflow/
├── dags/
│   └── auto_collect_proces.py   # Pipeline orchestration logic
├── utils/
│   ├── get_data.py             # Data acquisition & Bronze storage
│   └── EDA_data_proces.py      # Spark Feature Engineering & Silver storage
├── Dockerfile                  # Airflow + OpenJDK 17 + Spark image
├── docker-compose.yml          # Full stack orchestration
├── requirements.txt            # Python ecosystem dependencies
└── .env                        # Private environment configuration

IMPLEMENTATION GUIDE

1. Environment Setup

Create a .env file in the root of the Data_Airflow directory with the following schema:

POSTGRES_HOST=postgres
POSTGRES_PORT=5432
POSTGRES_DB=silver_db
POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
SILVER_TABLE=btc_silver_data

2. System Launch

Deploy the architecture using Docker Compose:

docker compose up -d

3. Pipeline Operation

The pipeline is pre-configured to run hourly. Monitor execution via the Airflow Web UI at http://localhost:8080.

To trigger manually for verification, you may execute:

python utils/get_data.py && python utils/EDA_data_proces.py

Developed for Advanced Agentic Coding - Binance AI Project

About

This project automates the extraction, processing, and storage of Bitcoin (BTC) market data using Apache Airflow, Apache Spark, and PostgreSQL. It follows a medallion architecture pattern with Bronze and Silver layers.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors