Stress-test dlt's Arrow route on the smallest Azure Container Apps footprint:
- source:
SNOWFLAKE_SAMPLE_DATA.TPCH_SF{1,10,100,1000}.LINEITEM - destination:
dummy.TPCH_SF{1,10,100,1000}.LINEITEM - runtime: Azure Container Apps Consumption at
0.25 vCPU / 0.5 GiB - packaging:
uv - secrets: Azure Key Vault
The app extracts Snowflake result sets with Arrow batches, yields pyarrow.Table objects into dlt, and logs memory, disk, batch, and throughput metrics during the run.
flowchart LR
S["Source Snowflake\nSNOWFLAKE_SAMPLE_DATA.TPCH_SF1.LINEITEM"] --> Q["Ordered page query\nLIMIT 50000"]
Q --> X["Snowflake connector\nfetch_arrow_all()"]
K["Azure Key Vault\nPATs + run key"] --> A["Azure Container App\nbenchmark runner"]
X --> A
A --> P["Arrow page\npyarrow.Table"]
P --> D["dlt resource"]
D --> L["dlt load package\nParquet payload + pipeline state"]
L --> T["Target Snowflake\ndummy.TPCH_SF1.LINEITEM"]
A --> M["Telemetry sampler\nmemory, disk, batch, timing"]
M --> Z["Azure Monitor\nlogs + metrics"]
Install:
brew install azure-cli
brew install azure/azd/azd
brew install snowflake-cli
brew install uvAuthenticate once:
az login
azd auth login
snow connection add mpz
snow connection test -c mpzThe Snowflake CLI connection must be named mpz. The bootstrap script uses that name by default.
This creates the dedicated benchmark identity AZAPP, grants AZAPP_ROLE, prepares dummy, and creates a PAT.
./scripts/bootstrap_snowflake.shThe last command prints a token_secret. Keep it outside the repo. You will use it for AZD environment values.
azd env new swc
azd env set AZURE_SUBSCRIPTION_ID "<your-subscription-id>"
azd env set AZURE_LOCATION swedencentral
azd env set AZURE_RESOURCE_GROUP rg-dlthubarrow-swc
azd env set SOURCE_SNOWFLAKE_ACCOUNT "<your-snowflake-account>"
azd env set SOURCE_SNOWFLAKE_USER AZAPP
azd env set SOURCE_SNOWFLAKE_TOKEN "<azapp-pat>"
azd env set SOURCE_SNOWFLAKE_WAREHOUSE COMPUTE_WH
azd env set SOURCE_SNOWFLAKE_ROLE AZAPP_ROLE
azd env set SOURCE_SNOWFLAKE_DATABASE SNOWFLAKE_SAMPLE_DATA
azd env set DESTINATION_SNOWFLAKE_ACCOUNT "<your-snowflake-account>"
azd env set DESTINATION_SNOWFLAKE_USER AZAPP
azd env set DESTINATION_SNOWFLAKE_TOKEN "<azapp-pat>"
azd env set DESTINATION_SNOWFLAKE_WAREHOUSE COMPUTE_WH
azd env set DESTINATION_SNOWFLAKE_ROLE AZAPP_ROLE
azd env set DESTINATION_SNOWFLAKE_DATABASE dummyazd up --no-promptThis provisions:
- Azure Container Registry
- Azure Key Vault
- Log Analytics
- Application Insights
- Azure Container Apps environment
- Azure Container App with a user-assigned managed identity
./scripts/run_smoke_test.shGet the endpoint:
az containerapp show \
--resource-group rg-dlthubarrow-swc \
--name "$(azd env get-value CONTAINER_APP_NAME)" \
--query properties.configuration.ingress.fqdn \
--output tsvCheck the latest run state:
curl "https://$(az containerapp show \
--resource-group rg-dlthubarrow-swc \
--name "$(azd env get-value CONTAINER_APP_NAME)" \
--query properties.configuration.ingress.fqdn \
--output tsv)/latest"Tail logs:
az containerapp logs show \
--resource-group rg-dlthubarrow-swc \
--name "$(azd env get-value CONTAINER_APP_NAME)" \
--followThe benchmark does not yield Python dict rows into dlt.
- Snowflake extraction uses keyset-paged SQL queries
- each page is fetched from Snowflake as Arrow with
fetch_arrow_all() - pages are normalized into
pyarrow.Table - the dlt resource yields those Arrow tables directly
- dlt loads with
loader_file_format="parquet"
That matches the Arrow route described in the dlt article, with Arrow entering from the Snowflake connector rather than from dlt's SQL backend abstraction.
Successful Azure run on the smallest Container Apps size:
| Metric | Value |
|---|---|
| Compute | 0.25 vCPU / 0.5 GiB |
| Extracted rows | 6,001,215 |
| Loaded rows | 6,001,215 |
| Arrow pages / batches | 121 |
| dlt load packages | 1 |
| dlt jobs in package | 2 |
| Stage duration | 211.24s |
| Throughput | 28,410 rows/s |
| Source bytes | 165,228,544 |
| Effective throughput | 0.746 MB/s |
| Peak in-app cgroup memory | 536,752,128 bytes |
| Memory at stage completion | 282,013,696 bytes |
| RSS at stage completion | 337,235,968 bytes |
| Temp/work disk at stage completion | 241,966,510 bytes |
| Replica restarts during TPCH_SF1 | 0 |
Azure Monitor view for the same window:
- max
MemoryPercentage:99% - max
WorkingSetBytes:318,771,200 - max
CpuPercentage:62%
Interpretation: TPCH_SF1 completes on the smallest Container Apps footprint, but memory is still the limiting resource and the extraction phase runs very close to the container memory cap.
Install and run locally:
uv lock --python 3.12
uv sync --python 3.12 --dev
uv run python -m benchmark_runnerThe app expects these environment variables when running locally:
RUN_API_KEYSOURCE_SNOWFLAKE_ACCOUNTSOURCE_SNOWFLAKE_USERSOURCE_SNOWFLAKE_PASSWORDDESTINATION_SNOWFLAKE_ACCOUNTDESTINATION_SNOWFLAKE_USERDESTINATION_SNOWFLAKE_PASSWORD
Optional values:
SOURCE_SNOWFLAKE_WAREHOUSESOURCE_SNOWFLAKE_ROLESOURCE_SNOWFLAKE_DATABASEDESTINATION_SNOWFLAKE_WAREHOUSEDESTINATION_SNOWFLAKE_ROLEDESTINATION_SNOWFLAKE_DATABASEBENCHMARK_DATASETSBENCHMARK_SOURCE_TABLEBENCHMARK_SOURCE_CHUNK_ROWSBENCHMARK_WORK_ROOT
GET /healthzGET /latestPOST /run
POST /run requires the X-Run-Key header.
- Runtime secrets are stored in Azure Key Vault.
azd envwrites local values into.azure/<env>/.env, which is ignored by.azure/.gitignore.- No PATs or passwords are committed in tracked files.