Skip to content

Pratush12/mariadb-airflow-hackathon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

49 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Ctrl_Alt_db β€” Airflow MariaDB Connector

Team Name: Ctrl_Alt_db
Project Title: Airflow MariaDB Connector
Theme: Integration


🧩 Problem Statement

Apache Airflow currently lacks native integration with MariaDB, forcing developers to rely on the MySQL connector.
However, this connector is incompatible with key MariaDB-specific features such as:

  • ⚑ ColumnStore
  • πŸ“₯ cpimport
  • 🧠 Native JSON functions

This limitation results in reduced functionality and performance bottlenecks in ETL workflows.
Data pipelines built on Airflow cannot fully leverage MariaDB’s high-performance architecture.

πŸ” Benchmark Insight:
The MariaDB Python connector outperforms the MySQL connector by up to 3Γ— in operations like:

  • executemany
  • SELECT
  • JSON_INSERT

The absence of a native Airflow–MariaDB connector thus limits Airflow’s ability to orchestrate modern, high-performance, and scalable MariaDB data workflows.


πŸ’‘ Solution Overview

The Airflow MariaDB Connector introduces seamless, native integration between Apache Airflow and MariaDB (including ColumnStore).

βœ… Key Features

  • 🧩 Native Airflow connection type for direct MariaDB integration (no MySQL fallback)
  • πŸš€ High-speed data ingestion using cpimport, optimized for bulk ETL operations
  • πŸ”„ ETL workflows: download β†’ transform β†’ load between MariaDB and S3
  • πŸ“Š Columnar architecture support for faster analytical queries
  • βš™οΈ 3Γ— performance improvement over MySQL connector for critical database operations

⚑ Performance Comparison: MariaDB vs MySQL

We ran a quick benchmark using mariadb_and_sql.py to compare the execution speed of common operations.
The results clearly demonstrate the performance advantage of MariaDB’s Python connector over MySQL, especially for bulk inserts, SELECT queries, and JSON operations.

MariaDB vs MySQL Performance Comparison

Note:


🧠 Concept

🎯 Goal

Build a seamless ETL integration between Apache Airflow and MariaDB ColumnStore.

πŸ’­ Idea

Automate OpenFlights data ingestion using:

  • Airflow DAGs for orchestration
  • Secure SSH transfers
  • cpimport for high-performance bulk loading into ColumnStore

πŸ—οΈ Principles & Design

Principle Description
πŸ” Automation Entire data pipeline runs automatically via Airflow scheduling
πŸ” Security Uses SSH-based file transfer β€” no direct DB exposure
βš–οΈ Scalability ColumnStore ensures distributed & parallel data loading
🧩 Modularity Each dataset (airports, airlines, routes, etc.) is processed independently
πŸ”„ Reusability DAG supports adding new datasets via simple JSON config updates

βš™οΈ Installation & Setup Instructions

Choose the installation method that best fits your needs:


πŸš€ Option 1: Quick Install (Recommended for Production)

Install the MariaDB provider directly from GitHub:

# Install the latest version
pip install -U git+https://github.com/Pratush12/mariadb-airflow-hackathon.git@main#subdirectory=airflow-mariadb-provider

Requirements:

  • Python 3.8+
  • Apache Airflow 2.5.0+
  • MariaDB server (for database operations)
  • SSH access (for cpimport operations)

πŸ”§ Option 2: Clone and Install (Development)

For development or customization:

# Clone the repository
git clone https://github.com/Pratush12/mariadb-airflow-hackathon.git
cd mariadb-airflow-hackathon/airflow-mariadb-provider

# Install in development mode
pip install -e ".[all]"

# Or install with specific features
pip install -e ".[amazon]"  # S3 support
pip install -e ".[ssh]"     # SSH support

🐳 Option 3: Docker Setup (Full Development Environment)

For a complete development environment with MariaDB ColumnStore:

🧱 Step 1: Add Custom Provider to Airflow

Since Airflow doesn't natively support MariaDB, we created a custom provider.

This provider:

  • Adds MariaDB connection type
  • Provides S3 hooks and cpimport operators
  • Enables direct integration with MariaDB from Airflow DAGs
# Clone or copy your provider into the airflow directory
COPY airflow-mariadb-provider /opt/airflow/airflow-mariadb-provider

# Install the provider
RUN pip install --no-deps /opt/airflow/airflow-mariadb-provider

πŸƒβ€β™‚οΈ Quick Start

After installation, create a MariaDB connection in Airflow:

  1. Go to Admin β†’ Connections in Airflow UI (localhost:8080)

  2. Add new connection:

    • Connection ID: maria_db_default
    • Connection Type: MariaDB
    • Host: Your MariaDB server
    • Port: 3306
    • Login: Your username
    • Password: Your password
    • Schema: Your database name
  3. Use the operators in your DAGs:

from airflow.providers.mariadb.operators.mariadb import MariaDBOperator

# Basic SQL execution
sql_task = MariaDBOperator(
    task_id="execute_sql",
    mariadb_conn_id="maria_db_default",
    sql="SELECT * FROM my_table"
)

🐬 Step 2: Install MariaDB with ColumnStore Engine

We use MariaDB ColumnStore inside Docker for high-performance analytical queries.

docker run -d -p 3307:3306 -p 2222:22 --shm-size=512m -e PM1=mcs1 --hostname=mcs1 mariadb/columnstore
docker exec -it mcs1 provision mcs1

🧠 Why ColumnStore? It enables parallelized columnar data storage β€” perfect for analytical workloads.

βš“ Step 3: Connect Airflow to MariaDB via Docker Network

docker network connect airflow_net mcs1
docker-compose down -v
docker-compose up -d

Your docker-compose.yml connects both containers (Airflow + MariaDB) via the same network for smooth communication.

🐳 Step 4: Dockerfile for Airflow with MariaDB Connector

# Use the official Airflow image
FROM apache/airflow:2.9.0

# Switch to root user to install system dependencies
USER root

# Install system dependencies for MariaDB
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
        gcc \
        libmariadb-dev \
        mariadb-client && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

USER airflow

# Set the PATH for the airflow user
ENV PATH="/home/airflow/.local/bin:${PATH}"

# Install Python dependencies (MariaDB driver)
RUN pip install --no-cache-dir mariadb

# Install your custom provider
# Make sure you install it without upgrading core Airflow packages
COPY airflow-mariadb-provider /opt/airflow/airflow-mariadb-provider
RUN pip install --no-deps /opt/airflow/airflow-mariadb-provider

πŸ” Step 5: Enable SSH Connection in MariaDB Container

SSH is used for secure file transfers (e.g., CSV β†’ cpimport).

docker exec -it mcs1 bash
ssh-keygen -A
/usr/sbin/sshd -D &
exit

Then restart the Airflow webserver:

docker restart airflow-docker-airflow-webserver-1

πŸ”— Airflow Connections

Once Airflow is running, configure these connections in the Airflow UI (localhost:8080 β†’ Admin β†’ Connections):

Connection ID Type Description
maria_db_default MariaDB Host: hostname, Port: 3306, User: user, Password: password
mariadb_ssh_connection SSH Host: hostname, Port: 22, Username: user, Password: password
aws_default (optional) S3 For S3 data transfer workflows

πŸ“‚ Project Structure

Below is the overall structure of the project:

mariadb-airflow-hackathon/
β”‚
β”œβ”€ readme.MD # Project documentation
β”œβ”€ docker-compose.yml # Docker Compose setup for Airflow + MariaDB
β”œβ”€ Dockerfile # Custom Airflow image with MariaDB connector
β”œβ”€ requirements.txt # Python dependencies
β”œβ”€ mysql_vs_mariadb.py # Performance benchmarking script
β”œβ”€ images/ # Screenshots, performance charts
β”‚ └─ comparison_between_mariadb_mysql.png
β”œβ”€ dags/ # Airflow DAGs
β”‚ β”œβ”€ config/
β”‚ β”‚ └─ datasets.json # Dataset configuration
β”‚ β”œβ”€ openflights_dag.py # Main OpenFlights data ingestion DAG
β”‚ β”œβ”€ mariadb_s3_operators_dag.py # S3 integration DAG
└─ airflow-mariadb-provider/ # Apache Airflow MariaDB Provider
   β”œβ”€ pyproject.toml # Modern Python packaging configuration
   β”œβ”€ README.rst # Provider documentation
   β”œβ”€ docs/ # Sphinx documentation
   β”‚ β”œβ”€ index.rst
   β”‚ β”œβ”€ changelog.rst
   β”‚ β”œβ”€ commits.rst
   β”‚ β”œβ”€ operators.rst
   β”‚ β”œβ”€ security.rst
   β”‚ β”œβ”€ connections/
   β”‚ β”‚ └─ mariadb.rst
   β”‚ β”œβ”€ example_dags/
   β”‚ β”‚ β”œβ”€ example_mariadb_basic.py
   β”‚ β”‚ β”œβ”€ example_mariadb_cpimport.py
   β”‚ β”‚ └─ example_mariadb_s3.py
   β”‚ └─ installing-providers-from-sources.rst
   β”‚
   β”œβ”€ src/airflow/providers/mariadb/ # Main provider code
   β”‚ β”œβ”€ __init__.py # Provider metadata
   β”‚ β”œβ”€ get_provider_info.py # Provider information
   β”‚ β”œβ”€ hooks/
   β”‚ β”‚ β”œβ”€ __init__.py
   β”‚ β”‚ └─ mariadb.py # MariaDB hook implementation
   β”‚ β”œβ”€ operators/
   β”‚ β”‚ β”œβ”€ __init__.py
   β”‚ β”‚ β”œβ”€ mariadb.py # Basic MariaDB operator
   β”‚ β”‚ β”œβ”€ cpimport.py # cpimport operator
   β”‚ β”‚ └─ s3.py # S3 integration operators
   β”‚ β”œβ”€ sensors/
   β”‚ β”‚ └─ __init__.py
   β”‚ └─ transfers/
   β”‚     └─ __init__.py
   β”‚
   └─ tests/ # Comprehensive test suite
      β”œβ”€ conftest.py # Test configuration
      β”œβ”€ unit/mariadb/ # Unit tests
      β”‚ β”œβ”€ hooks/
      β”‚ β”‚ └─ test_mariadb.py
      β”‚ └─ __init__.py
      └─ system/mariadb/ # System tests
         β”œβ”€ example_mariadb.py
         └─ __init__.py

πŸ§ͺ Running the DAG

Start Airflow UI β†’ http://localhost:8080

Trigger the DAG: OpenFlights Data Ingestion

Watch:

  • πŸ—‚οΈ SSH file upload logs
  • βš™οΈ cpimport execution
  • πŸ“Š Data validation queries inside MariaDB

🧭 Outcome & Learnings

Through this project, we successfully:

  • βœ… Built the first Airflow–MariaDB native connector
  • βœ… Integrated MariaDB ColumnStore for parallel ETL
  • βœ… Learned Airflow provider development and Docker networking
  • βœ… Explored secure SSH integration for data transfer
  • βœ… Benchmarked MariaDB vs MySQL connector performance

πŸ’¬ β€œThis hackathon gave us deep insights into how Airflow orchestrates ETL pipelines and how MariaDB’s performance capabilities can be unlocked with the right integration.”


🏁 Conclusion

The Airflow MariaDB Connector bridges a major integration gap in modern data engineering.

It enables:

  • ⚑ Direct and optimized Airflow–MariaDB communication
  • πŸš€ High-speed ETL via cpimport
  • πŸ“ˆ Scalable analytics with ColumnStore

With this, Ctrl_Alt_db has taken the first step toward empowering the Airflow community with a truly MariaDB-native data orchestration solution.

About

Repo for MariaDB airflow operator for hackathon

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published