An automated pipeline designed to extract high-frequency Bitcoin (BTC) market data, perform advanced feature engineering using Apache Spark, and persist the results into a centralized PostgreSQL data warehouse.
This project implements a Medallion Architecture to ensure data quality and lineage.
- Orchestration Layer: Apache Airflow (utilizing TaskFlow API)
- Processing Engine: Apache Spark (PySpark) 3.5+
- Data Storage:
- Bronze: Local Parquet Files (Raw API Responses)
- Silver: PostgreSQL (Processed Features)
- Infrastructure: Dockerized environment with custom Java/Spark integrations
- Extraction: Automated calls to Binance V3 API capturing 1-minute candlesticks.
- Bronze Layer: Data is serialized to Parquet. Timestamps are coerced to microsecond precision (
us) to ensure seamless Spark interoperability. - Feature Engineering: A distributed Spark session calculates rolling windows and future targets.
- Silver Layer: The enriched dataset is written to the
silver_dbvia the PostgreSQL JDBC driver.
| Feature | Logic | Purpose |
|---|---|---|
| Close T+10 | lead(close, 10) |
Future price target for Predictive Modeling |
| Returns | (close_t - close_t-1) / close_t-1 |
Relative price volatility measurement |
| MA (5/10) | avg(close) over rolling rows |
Short-term trend smoothing |
| Taker Ratio | taker_buy_volume / total_volume |
Sentiment & Market Intensity indicator |
Data_Airflow/
├── dags/
│ └── auto_collect_proces.py # Pipeline orchestration logic
├── utils/
│ ├── get_data.py # Data acquisition & Bronze storage
│ └── EDA_data_proces.py # Spark Feature Engineering & Silver storage
├── Dockerfile # Airflow + OpenJDK 17 + Spark image
├── docker-compose.yml # Full stack orchestration
├── requirements.txt # Python ecosystem dependencies
└── .env # Private environment configuration
Create a .env file in the root of the Data_Airflow directory with the following schema:
POSTGRES_HOST=postgres
POSTGRES_PORT=5432
POSTGRES_DB=silver_db
POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
SILVER_TABLE=btc_silver_dataDeploy the architecture using Docker Compose:
docker compose up -dThe pipeline is pre-configured to run hourly.
Monitor execution via the Airflow Web UI at http://localhost:8080.
To trigger manually for verification, you may execute:
python utils/get_data.py && python utils/EDA_data_proces.pyDeveloped for Advanced Agentic Coding - Binance AI Project