Skip to content

Stack overflow with pathological cases of WriterProperties::max_row_group_count and WriterProperties::max_row_group_bytes #9386

@alamb

Description

@alamb

Since this recurses, this could potentially blow out the stack with pathalogical inputs (e.g. a RecordBatch with 1M rows with a max_row_group_count of 1). I don't think it is necessary to fix now, I just wanted to point it out

Originally posted by @alamb in #9357 (comment)

Here is a reproducer (add to arrow/arrow_writer/mod.rs) which fails (process aborts due to stack overflow)

    #[test]
    fn test_row_group_limit_rows_only_pathological_stack_overflow_demo() {
        let schema = Arc::new(Schema::new(vec![Field::new(
            "int",
            ArrowDataType::Int32,
            false,
        )]));
        let array = Int32Array::from((0..1_000_000_i32).collect::<Vec<_>>());
        let batch = RecordBatch::try_new(schema.clone(), vec![Arc::new(array)]).unwrap();

        let props = WriterProperties::builder()
            .set_max_row_group_row_count(Some(1))
            .set_max_row_group_bytes(None)
            .build();

        let file = tempfile::tempfile().unwrap();
        let mut writer = ArrowWriter::try_new(file, schema, Some(props)).unwrap();

        // This currently recurses once per row-group split and can overflow the stack.
        writer.write(&batch).unwrap();
    }

The expected behavior is either an error (or ideally) successfully write the file

Metadata

Metadata

Assignees

No one assigned

    Labels

    parquetChanges to the parquet crate

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions