Team Name: Ctrl_Alt_db
Project Title: Airflow MariaDB Connector
Theme: Integration
Apache Airflow currently lacks native integration with MariaDB, forcing developers to rely on the MySQL connector.
However, this connector is incompatible with key MariaDB-specific features such as:
- β‘ ColumnStore
- π₯ cpimport
- π§ Native JSON functions
This limitation results in reduced functionality and performance bottlenecks in ETL workflows.
Data pipelines built on Airflow cannot fully leverage MariaDBβs high-performance architecture.
π Benchmark Insight:
The MariaDB Python connector outperforms the MySQL connector by up to 3Γ in operations like:
executemanySELECTJSON_INSERT
The absence of a native AirflowβMariaDB connector thus limits Airflowβs ability to orchestrate modern, high-performance, and scalable MariaDB data workflows.
The Airflow MariaDB Connector introduces seamless, native integration between Apache Airflow and MariaDB (including ColumnStore).
- π§© Native Airflow connection type for direct MariaDB integration (no MySQL fallback)
- π High-speed data ingestion using
cpimport, optimized for bulk ETL operations - π ETL workflows: download β transform β load between MariaDB and S3
- π Columnar architecture support for faster analytical queries
- βοΈ 3Γ performance improvement over MySQL connector for critical database operations
We ran a quick benchmark using mariadb_and_sql.py to compare the execution speed of common operations.
The results clearly demonstrate the performance advantage of MariaDBβs Python connector over MySQL, especially for bulk inserts, SELECT queries, and JSON operations.
Note:
- Above analysis was done on a local machine with 1Million records.
- Code ref: https://github.com/Pratush12/mariadb-airflow-hackathon/blob/main/mysql_vs_mariadb.py
Build a seamless ETL integration between Apache Airflow and MariaDB ColumnStore.
Automate OpenFlights data ingestion using:
- Airflow DAGs for orchestration
- Secure SSH transfers
- cpimport for high-performance bulk loading into ColumnStore
| Principle | Description |
|---|---|
| π Automation | Entire data pipeline runs automatically via Airflow scheduling |
| π Security | Uses SSH-based file transfer β no direct DB exposure |
| βοΈ Scalability | ColumnStore ensures distributed & parallel data loading |
| π§© Modularity | Each dataset (airports, airlines, routes, etc.) is processed independently |
| π Reusability | DAG supports adding new datasets via simple JSON config updates |
Choose the installation method that best fits your needs:
Install the MariaDB provider directly from GitHub:
# Install the latest version
pip install -U git+https://github.com/Pratush12/mariadb-airflow-hackathon.git@main#subdirectory=airflow-mariadb-providerRequirements:
- Python 3.8+
- Apache Airflow 2.5.0+
- MariaDB server (for database operations)
- SSH access (for cpimport operations)
For development or customization:
# Clone the repository
git clone https://github.com/Pratush12/mariadb-airflow-hackathon.git
cd mariadb-airflow-hackathon/airflow-mariadb-provider
# Install in development mode
pip install -e ".[all]"
# Or install with specific features
pip install -e ".[amazon]" # S3 support
pip install -e ".[ssh]" # SSH supportFor a complete development environment with MariaDB ColumnStore:
Since Airflow doesn't natively support MariaDB, we created a custom provider.
This provider:
- Adds MariaDB connection type
- Provides S3 hooks and cpimport operators
- Enables direct integration with MariaDB from Airflow DAGs
# Clone or copy your provider into the airflow directory
COPY airflow-mariadb-provider /opt/airflow/airflow-mariadb-provider
# Install the provider
RUN pip install --no-deps /opt/airflow/airflow-mariadb-providerAfter installation, create a MariaDB connection in Airflow:
-
Go to Admin β Connections in Airflow UI (
localhost:8080) -
Add new connection:
- Connection ID:
maria_db_default - Connection Type:
MariaDB - Host: Your MariaDB server
- Port: 3306
- Login: Your username
- Password: Your password
- Schema: Your database name
- Connection ID:
-
Use the operators in your DAGs:
from airflow.providers.mariadb.operators.mariadb import MariaDBOperator
# Basic SQL execution
sql_task = MariaDBOperator(
task_id="execute_sql",
mariadb_conn_id="maria_db_default",
sql="SELECT * FROM my_table"
)We use MariaDB ColumnStore inside Docker for high-performance analytical queries.
docker run -d -p 3307:3306 -p 2222:22 --shm-size=512m -e PM1=mcs1 --hostname=mcs1 mariadb/columnstore
docker exec -it mcs1 provision mcs1π§ Why ColumnStore? It enables parallelized columnar data storage β perfect for analytical workloads.
docker network connect airflow_net mcs1
docker-compose down -v
docker-compose up -dYour docker-compose.yml connects both containers (Airflow + MariaDB) via the same network for smooth communication.
# Use the official Airflow image
FROM apache/airflow:2.9.0
# Switch to root user to install system dependencies
USER root
# Install system dependencies for MariaDB
RUN apt-get update && \
apt-get install -y --no-install-recommends \
gcc \
libmariadb-dev \
mariadb-client && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
USER airflow
# Set the PATH for the airflow user
ENV PATH="/home/airflow/.local/bin:${PATH}"
# Install Python dependencies (MariaDB driver)
RUN pip install --no-cache-dir mariadb
# Install your custom provider
# Make sure you install it without upgrading core Airflow packages
COPY airflow-mariadb-provider /opt/airflow/airflow-mariadb-provider
RUN pip install --no-deps /opt/airflow/airflow-mariadb-providerSSH is used for secure file transfers (e.g., CSV β cpimport).
docker exec -it mcs1 bash
ssh-keygen -A
/usr/sbin/sshd -D &
exitThen restart the Airflow webserver:
docker restart airflow-docker-airflow-webserver-1Once Airflow is running, configure these connections in the Airflow UI (localhost:8080 β Admin β Connections):
| Connection ID | Type | Description |
|---|---|---|
maria_db_default |
MariaDB | Host: hostname, Port: 3306, User: user, Password: password |
mariadb_ssh_connection |
SSH | Host: hostname, Port: 22, Username: user, Password: password |
aws_default (optional) |
S3 | For S3 data transfer workflows |
Below is the overall structure of the project:
mariadb-airflow-hackathon/
β
ββ readme.MD # Project documentation
ββ docker-compose.yml # Docker Compose setup for Airflow + MariaDB
ββ Dockerfile # Custom Airflow image with MariaDB connector
ββ requirements.txt # Python dependencies
ββ mysql_vs_mariadb.py # Performance benchmarking script
ββ images/ # Screenshots, performance charts
β ββ comparison_between_mariadb_mysql.png
ββ dags/ # Airflow DAGs
β ββ config/
β β ββ datasets.json # Dataset configuration
β ββ openflights_dag.py # Main OpenFlights data ingestion DAG
β ββ mariadb_s3_operators_dag.py # S3 integration DAG
ββ airflow-mariadb-provider/ # Apache Airflow MariaDB Provider
ββ pyproject.toml # Modern Python packaging configuration
ββ README.rst # Provider documentation
ββ docs/ # Sphinx documentation
β ββ index.rst
β ββ changelog.rst
β ββ commits.rst
β ββ operators.rst
β ββ security.rst
β ββ connections/
β β ββ mariadb.rst
β ββ example_dags/
β β ββ example_mariadb_basic.py
β β ββ example_mariadb_cpimport.py
β β ββ example_mariadb_s3.py
β ββ installing-providers-from-sources.rst
β
ββ src/airflow/providers/mariadb/ # Main provider code
β ββ __init__.py # Provider metadata
β ββ get_provider_info.py # Provider information
β ββ hooks/
β β ββ __init__.py
β β ββ mariadb.py # MariaDB hook implementation
β ββ operators/
β β ββ __init__.py
β β ββ mariadb.py # Basic MariaDB operator
β β ββ cpimport.py # cpimport operator
β β ββ s3.py # S3 integration operators
β ββ sensors/
β β ββ __init__.py
β ββ transfers/
β ββ __init__.py
β
ββ tests/ # Comprehensive test suite
ββ conftest.py # Test configuration
ββ unit/mariadb/ # Unit tests
β ββ hooks/
β β ββ test_mariadb.py
β ββ __init__.py
ββ system/mariadb/ # System tests
ββ example_mariadb.py
ββ __init__.py
Start Airflow UI β http://localhost:8080
Trigger the DAG: OpenFlights Data Ingestion
Watch:
- ποΈ SSH file upload logs
- βοΈ cpimport execution
- π Data validation queries inside MariaDB
Through this project, we successfully:
- β Built the first AirflowβMariaDB native connector
- β Integrated MariaDB ColumnStore for parallel ETL
- β Learned Airflow provider development and Docker networking
- β Explored secure SSH integration for data transfer
- β Benchmarked MariaDB vs MySQL connector performance
π¬ βThis hackathon gave us deep insights into how Airflow orchestrates ETL pipelines and how MariaDBβs performance capabilities can be unlocked with the right integration.β
The Airflow MariaDB Connector bridges a major integration gap in modern data engineering.
It enables:
- β‘ Direct and optimized AirflowβMariaDB communication
- π High-speed ETL via cpimport
- π Scalable analytics with ColumnStore
With this, Ctrl_Alt_db has taken the first step toward empowering the Airflow community with a truly MariaDB-native data orchestration solution.
