-
-
Notifications
You must be signed in to change notification settings - Fork 64
Sling Gets stuck when using DuckDB to write Parquet Files #668
Description
Sling version:
sling==1.4.23
What is the Operating System?
Linux
Do you have a CLI Pro/Platform subscription?
No
Description of the issue
Running sling in batches ( parquet ) gets stuck, without any reason.
I got timeout.
I tried to create my own script, using batch that failed, loading the parquet files, and it worked smoothly
Replication Configuration
NUM_BATCHES = 10
Those tables are too large to fit in a single batch
for i in range(NUM_BATCHES):
defaults = {
'mode': 'incremental',
'object': 'slow_features.noa_orders_info_sync', # Where to write to
'update_key': 'shopper_id', # not used - we do manual batching logic
'sql': f"SELECT * FROM {{stream_full_name}} WHERE MOD(ABS(FNV_HASH(shopper_id)), {NUM_BATCHES}) = {i} --{{incremental_value}}",
'source_options': {
# Parquet format fixes bugs we had with export of large CSVs, but it uses Duckdb to read the parquets
# which requires a stronger machine
'format': 'parquet',
}
}
streams = {
'analytics_v2.shopper_orders_info': sling.ReplicationStream(
**defaults
),
}
Log Output
2025-10-19 21:29:49 DBG select * from read_parquet(['sling_temp/stream/1760909308449.parquet/u01-0003_part_00.parquet', 'sling_temp/stream/1760909308449.parquet/u01-0000_part_00.parquet', 'sling_temp/stream/1760909308449.parquet/u01-0001_part_00.parquet', 'sling_temp/stream/1760909308449.parquet/u01-0002_part_00.parquet'])
2025-10-19 21:29:49 DBG applying column casing (normalize) for target type ([REDACTED])
2025-10-19 21:29:49 DBG casting column 'shopper_id' as 'text'
2025-10-19 21:29:49 DBG casting column 'store_id' as 'text'
2025-10-19 21:29:49 DBG casting column 'store_name' as 'text'
2025-10-19 21:29:49 DBG casting column 'orders_info' as 'text'
2025-10-19 21:29:49 INF writing to target database [mode: incremental]
2025-10-19 21:29:49 INF streaming data (direct insert)
2025-10-19 21:30:43 INF inserted 3527602 rows into "slow_features"."noa_orders_info_sync" in 134 secs [26,138 r/s] [1.4 GB]
2025-10-19 21:30:43 DBG closed "redshift" connection (conn-redshift-8bX)
2025-10-19 21:30:43 DBG closed "[REDACTED]" connection (conn-[REDACTED]-wis)
2025-10-19 21:30:43 INF execution succeeded
FYI there is a new sling version released (1.4.24). Please run pip install -U sling.
running batch number: 7 out of 10
2025-10-19 21:30:43 INF Sling CLI | https://slingdata.io/
2025-10-19 21:30:43 INF Sling Replication | SLING_RS -> SLING_PG | analytics_v2.shopper_orders_info
2025-10-19 21:30:43 DBG Sling version: 1.4.23 (linux amd64)
2025-10-19 21:30:43 DBG type is db-db
2025-10-19 21:30:43 DBG using: {"columns":{},"mode":"incremental","select":[],"transforms":null}
2025-10-19 21:30:43 DBG using source options: {"empty_as_null":false,"format":"parquet","datetime_format":"AUTO","max_decimals":-1}
2025-10-19 21:30:43 DBG using target options: {"datetime_format":"auto","file_max_rows":0,"max_decimals":-1,"use_bulk":true,"add_new_columns":true,"adjust_column_type":false,"column_casing":"normalize"}
2025-10-19 21:30:43 INF connecting to source database (redshift)
2025-10-19 21:30:43 DBG opened "redshift" connection (conn-redshift-7ES)
2025-10-19 21:30:43 INF connecting to target database ([REDACTED])
2025-10-19 21:30:43 DBG opened "[REDACTED]" connection (conn-[REDACTED]-GV0)
2025-10-19 21:30:43 INF getting checkpoint value (shopper_id)
2025-10-19 21:30:43 DBG select max("shopper_id") as max_val from "slow_features"."noa_orders_info_sync"
2025-10-19 21:30:43 INF reading from source database
2025-10-19 21:30:43 INF unloading from redshift to s3
2025-10-19 21:30:43 DBG opened "s3" connection (conn-s3-0pA)
2025-10-19 21:30:43 DBG unload ('SELECT * FROM "analytics_v2"."shopper_orders_info" WHERE MOD(ABS(FNV_HASH(shopper_id)), 10) = 6 --null')
to 's3://[REDACTED]/sling_temp/stream/1760909443793.parquet/u01-'
credentials 'aws_access_key_id=;aws_secret_access_key='
allowoverwrite PARQUET PARALLEL
2025-10-19 21:31:31 DBG Unloaded to s3://[REDACTED]/sling_temp/stream/1760909443793.parquet
2025-10-19 21:31:31 DBG opened "s3" connection (conn-s3-s8L)
2025-10-19 21:31:31 DBG copying 4 files from remote to /tmp/duck.temp.1760909491369.i4E for local processing
2025-10-19 21:31:32 DBG opened "file" connection (conn-file-oSu)
2025-10-19 21:31:32 DBG reading datastream from s3://[REDACTED]/sling_temp/stream/1760909443793.parquet [format=parquet, nodes=4]
2025-10-19 21:31:32 DBG select * from read_parquet(['sling_temp/stream/1760909443793.parquet/u01-0003_part_00.parquet', 'sling_temp/stream/1760909443793.parquet/u01-0000_part_00.parquet', 'sling_temp/stream/1760909443793.parquet/u01-0001_part_00.parquet', 'sling_temp/stream/1760909443793.parquet/u01-0002_part_00.parquet'])
2025-10-19 21:31:33 DBG applying column casing (normalize) for target type ([REDACTED])
2025-10-19 21:31:33 DBG casting column 'shopper_id' as 'text'
2025-10-19 21:31:33 DBG casting column 'store_id' as 'text'
2025-10-19 21:31:33 DBG casting column 'store_name' as 'text'
2025-10-19 21:31:33 DBG casting column 'orders_info' as 'text'
2025-10-19 21:31:33 INF writing to target database [mode: incremental]
2025-10-19 21:31:33 INF streaming data (direct insert)
the last two runs
up until the last batch everything work smoothly. the parquet files are 4, each size of 120mg .