Skip to content

Sling Gets stuck when using DuckDB to write Parquet Files #668

@noat28

Description

@noat28

Sling version:

sling==1.4.23

What is the Operating System?

Linux

Do you have a CLI Pro/Platform subscription?

No

Description of the issue

Running sling in batches ( parquet ) gets stuck, without any reason.
I got timeout.

Image

I tried to create my own script, using batch that failed, loading the parquet files, and it worked smoothly

Replication Configuration

NUM_BATCHES = 10

Those tables are too large to fit in a single batch

for i in range(NUM_BATCHES):
defaults = {
'mode': 'incremental',
'object': 'slow_features.noa_orders_info_sync', # Where to write to
'update_key': 'shopper_id', # not used - we do manual batching logic
'sql': f"SELECT * FROM {{stream_full_name}} WHERE MOD(ABS(FNV_HASH(shopper_id)), {NUM_BATCHES}) = {i} --{{incremental_value}}",
'source_options': {
# Parquet format fixes bugs we had with export of large CSVs, but it uses Duckdb to read the parquets
# which requires a stronger machine
'format': 'parquet',
}
}

streams = {
    'analytics_v2.shopper_orders_info': sling.ReplicationStream(
        **defaults
    ),
}

Log Output

2025-10-19 21:29:49 DBG select * from read_parquet(['sling_temp/stream/1760909308449.parquet/u01-0003_part_00.parquet', 'sling_temp/stream/1760909308449.parquet/u01-0000_part_00.parquet', 'sling_temp/stream/1760909308449.parquet/u01-0001_part_00.parquet', 'sling_temp/stream/1760909308449.parquet/u01-0002_part_00.parquet'])
2025-10-19 21:29:49 DBG applying column casing (normalize) for target type ([REDACTED])
2025-10-19 21:29:49 DBG casting column 'shopper_id' as 'text'
2025-10-19 21:29:49 DBG casting column 'store_id' as 'text'
2025-10-19 21:29:49 DBG casting column 'store_name' as 'text'
2025-10-19 21:29:49 DBG casting column 'orders_info' as 'text'
2025-10-19 21:29:49 INF writing to target database [mode: incremental]
2025-10-19 21:29:49 INF streaming data (direct insert)
2025-10-19 21:30:43 INF inserted 3527602 rows into "slow_features"."noa_orders_info_sync" in 134 secs [26,138 r/s] [1.4 GB]
2025-10-19 21:30:43 DBG closed "redshift" connection (conn-redshift-8bX)
2025-10-19 21:30:43 DBG closed "[REDACTED]" connection (conn-[REDACTED]-wis)
2025-10-19 21:30:43 INF execution succeeded

FYI there is a new sling version released (1.4.24). Please run pip install -U sling.
running batch number: 7 out of 10
2025-10-19 21:30:43 INF Sling CLI | https://slingdata.io/
2025-10-19 21:30:43 INF Sling Replication | SLING_RS -> SLING_PG | analytics_v2.shopper_orders_info
2025-10-19 21:30:43 DBG Sling version: 1.4.23 (linux amd64)
2025-10-19 21:30:43 DBG type is db-db
2025-10-19 21:30:43 DBG using: {"columns":{},"mode":"incremental","select":[],"transforms":null}
2025-10-19 21:30:43 DBG using source options: {"empty_as_null":false,"format":"parquet","datetime_format":"AUTO","max_decimals":-1}
2025-10-19 21:30:43 DBG using target options: {"datetime_format":"auto","file_max_rows":0,"max_decimals":-1,"use_bulk":true,"add_new_columns":true,"adjust_column_type":false,"column_casing":"normalize"}
2025-10-19 21:30:43 INF connecting to source database (redshift)
2025-10-19 21:30:43 DBG opened "redshift" connection (conn-redshift-7ES)
2025-10-19 21:30:43 INF connecting to target database ([REDACTED])
2025-10-19 21:30:43 DBG opened "[REDACTED]" connection (conn-[REDACTED]-GV0)
2025-10-19 21:30:43 INF getting checkpoint value (shopper_id)
2025-10-19 21:30:43 DBG select max("shopper_id") as max_val from "slow_features"."noa_orders_info_sync"
2025-10-19 21:30:43 INF reading from source database
2025-10-19 21:30:43 INF unloading from redshift to s3
2025-10-19 21:30:43 DBG opened "s3" connection (conn-s3-0pA)
2025-10-19 21:30:43 DBG unload ('SELECT * FROM "analytics_v2"."shopper_orders_info" WHERE MOD(ABS(FNV_HASH(shopper_id)), 10) = 6 --null')
to 's3://[REDACTED]/sling_temp/stream/1760909443793.parquet/u01-'
credentials 'aws_access_key_id=;aws_secret_access_key='
allowoverwrite PARQUET PARALLEL
2025-10-19 21:31:31 DBG Unloaded to s3://[REDACTED]/sling_temp/stream/1760909443793.parquet
2025-10-19 21:31:31 DBG opened "s3" connection (conn-s3-s8L)
2025-10-19 21:31:31 DBG copying 4 files from remote to /tmp/duck.temp.1760909491369.i4E for local processing
2025-10-19 21:31:32 DBG opened "file" connection (conn-file-oSu)
2025-10-19 21:31:32 DBG reading datastream from s3://[REDACTED]/sling_temp/stream/1760909443793.parquet [format=parquet, nodes=4]
2025-10-19 21:31:32 DBG select * from read_parquet(['sling_temp/stream/1760909443793.parquet/u01-0003_part_00.parquet', 'sling_temp/stream/1760909443793.parquet/u01-0000_part_00.parquet', 'sling_temp/stream/1760909443793.parquet/u01-0001_part_00.parquet', 'sling_temp/stream/1760909443793.parquet/u01-0002_part_00.parquet'])
2025-10-19 21:31:33 DBG applying column casing (normalize) for target type ([REDACTED])
2025-10-19 21:31:33 DBG casting column 'shopper_id' as 'text'
2025-10-19 21:31:33 DBG casting column 'store_id' as 'text'
2025-10-19 21:31:33 DBG casting column 'store_name' as 'text'
2025-10-19 21:31:33 DBG casting column 'orders_info' as 'text'
2025-10-19 21:31:33 INF writing to target database [mode: incremental]
2025-10-19 21:31:33 INF streaming data (direct insert)

the last two runs

up until the last batch everything work smoothly. the parquet files are 4, each size of 120mg .

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions