feat: add max_row_group_bytes option to WriterProperties by yonipeleg33 · Pull Request #9357 · apache/arrow-rs

yonipeleg33 · 2026-02-05T09:33:18Z

Which issue does this PR close?

This PR implements another suggestion introduced in #1213:

Add functionality to flush based on a bytes threshold instead of, or in addition to, the current row threshold

So not "Closes" anything new.

Rationale for this change

A best effort to match Spark's (or more specifically, Hadoop's) parquet.block.size configuration behaviour, as documented in parquet-hadoop's README:

Property: parquet.block.size
Description: The block size in bytes...

Since arrow's parquet writer writes batches, it's inherently different than Hadoop's per-record writer behaviour - so the behaviour of max_row_group_bytes will be different than Hadoop's parquet.block.size, but this is the closest I could reasonably get (see details below).

What changes are included in this PR?

Configuration changes

New optional max_row_group_bytes configuration option in WriterProperties
Rename existing max_row_group_size private property to max_row_group_row_count
Backwards compatible: No public APIs changed:
- set_max_row_group_size() and max_row_group_size() still remain with their existing signatures.
- added set_max_row_group_row_count() and max_row_group_row_count() which expose the Option<usize> type.
- If set_max_row_group_row_count(None) is called, max_row_group_size() will return usize::MAX.

Writer changes

ArrowWriter::write now supports any combination of these two properties (row count and row bytes):

Both are unset -> Write everything in one row group.
One is set -> Respect only this one (either bytes or rows amount).
Both are set -> Respect the lower of them: Open a new row group when either row count or byte size limits reached

Byte limit is calculated once per batch (as opposed to Hadoop's per-record calculation):
Before writing each batch, compute the average row size in bytes based on previous writes, and flush or split the batch according to that average before hitting the limit.
This means that the first batch will always be written as a whole (unless row count limit is also set).

Are these changes tested?

Yes - added unit tests to check all different combinations of these two properties being set.

Are there any user-facing changes?

Yes:

Introducing new APIs to configure byte limits on row groups, and slight change to existing one (returning usize::MAX from max_row_group_size() if it was unset by the user).
Deprecating the old set_max_row_group_size and max_row_group_size APIs.

rluvaton · 2026-02-05T14:12:26Z

parquet/src/file/properties.rs

+    /// Sets maximum size of a row group in bytes, or `None` for unlimited.
+    ///
+    /// Row groups are flushed when their estimated encoded size exceeds this threshold.
+    /// This is similar to parquet-mr's `parquet.block.size` behavior.


parquet-mr is just the official java implementation for parquet, you can rewrite the comment to clarify that this match the official parquet Java implementation

Also, parquet-mr is I think now officially called "parquet-java" https://github.com/apache/parquet-java

…w_group_bytes

parquet/src/file/properties.rs

rluvaton · 2026-02-05T14:24:48Z

parquet/src/file/properties.rs

@@ -575,7 +595,34 @@ impl WriterPropertiesBuilder {
    /// If the value is set to 0.
    pub fn set_max_row_group_size(mut self, value: usize) -> Self {


Should we deprecate this function function?

Probably. Done

Wait wait, no so fast, this is a breaking change, as clippy will fail for users, I was asking, it might be in a different pr, but open for discussion. if you keep it please update the pr description under changes to users

I do agree that this API should be deprecated. Thanks for pointing it out!

this is a breaking change

This is inherently not a breaking change; the purpose of marking APIs as deprecated is to warn users before making a breaking change, without actually making this change.
This PR already calls for a minor bump due to the new APIs introduced; deprecating the old one does not change the version semantics for this PR.

as clippy will fail for users

It's a rustc warning:

The deprecated attribute marks an item as deprecated. rustc will issue warnings on usage of #[deprecated] items

So unless users add -D warnings, compilation won't break.

it might be in a different pr

I don't mind either way: Leaving it here or opening a new PR for deprecating the old API. LMK what's your preference and I'll do it.

if you keep it please update the pr description under changes to users

Done!

Wait wait, no so fast, this is a breaking change, as clippy will fail for users, I was asking, it might be in a different pr, but open for discussion. if you keep it please update the pr description under changes to users

Yeah, in general I agree with @yonipeleg33 -- I don't think a clippy failure is a breaking change per-se -- the rust compiler will be happy to compile it. If downstream projects want to take a more strict "clippy must pass" stance I donthink that is technically an API breakage

rluvaton · 2026-02-05T14:25:43Z

parquet/src/arrow/arrow_writer/mod.rs

    }
+
+    /// Helper to create a test batch with the given number of rows.
+    /// Each row is approximately 4 bytes (one i32).


Suggested change

/// Each row is approximately 4 bytes (one i32).

/// Each row is 4 bytes (one `i32`).

rluvaton · 2026-02-05T14:26:55Z

parquet/src/arrow/arrow_writer/mod.rs

+            ArrowDataType::Int32,
+            false,
+        )]));
+        let array = Int32Array::from((0..num_rows as i32).collect::<Vec<_>>());


Suggested change

let array = Int32Array::from((0..num_rows as i32).collect::<Vec<_>>());

let array = Int32Array::from_iter(0..num_rows as i32);

rluvaton · 2026-02-05T14:27:40Z

parquet/src/arrow/arrow_writer/mod.rs

+
+    #[test]
+    fn test_row_group_limit_rows_only() {
+        // When only max_row_group_size is set, respect the row limit


the comment is not on the correct line

rluvaton · 2026-02-05T14:27:51Z

parquet/src/arrow/arrow_writer/mod.rs

+
+    #[test]
+    fn test_row_group_limit_none_writes_single_row_group() {
+        // When both limits are None, all data should go into a single row group


the comment is not on the correct line

Done (moved to above the test function)

rluvaton · 2026-02-05T14:28:46Z

parquet/src/arrow/arrow_writer/mod.rs

+            false,
+        )]));
+
+        // Set byte limit to approximately fit ~30 rows worth of data (~100 bytes each)


the comment is not on the correct line (it should be on the Some(3500) one)

rluvaton · 2026-02-05T14:28:55Z

parquet/src/arrow/arrow_writer/mod.rs

+
+    #[test]
+    fn test_row_group_limit_both_row_wins() {
+        // When both limits are set, the row limit triggers first


the comment is not on the correct line

rluvaton · 2026-02-05T14:31:23Z

parquet/src/file/properties.rs

 pub const DEFAULT_WRITE_PAGE_HEADER_STATISTICS: bool = false;
 /// Default value for [`WriterProperties::max_row_group_size`]
 pub const DEFAULT_MAX_ROW_GROUP_SIZE: usize = 1024 * 1024;
+/// Default value for [`WriterProperties::max_row_group_bytes`] (128 MB, same as parquet-mr's parquet.block.size)


same as my other parquet-mr comment

rluvaton · 2026-02-05T14:32:48Z

parquet/src/file/properties.rs

+        if let Some(v) = value {
+            assert!(v > 0, "Cannot have a 0 max row group bytes");
+        }


Suggested change

if let Some(v) = value {

assert!(v > 0, "Cannot have a 0 max row group bytes");

}

assert_ne!(value, Some(0), "Cannot have a 0 max row group bytes");

rluvaton · 2026-02-05T15:05:05Z

parquet/src/arrow/arrow_writer/mod.rs

+    }
+
+    #[test]
+    fn test_row_group_limit_both_row_wins() {


Can you add a similar test when bytes win with the same structure as this, i.e. writing single large batch, but only changing the config (same test with only conf change)

No can do; The first batch is always written as a whole, because we need some statistics in order to calculate average row size. This is also noted in the PR description:

This means that the first batch will always be written as a whole (unless row count limit is also set).

you don't need statistics, you can calculate it from the data types you need to encocde

AFAICT, that beats the purpose of this configuration: Its purpose is to control the IO profile of the writer (i.e. how much and when it writes to disk), and for that, the data needs to at least be already encoded before calculating the row group size.

This is also backed by the Java source code:
it calculates memSize using columnStore.getBufferedSize(), which is documented as follows:

@return approximate size of the buffered encoded binary data

rluvaton · 2026-02-05T15:06:10Z

parquet/src/arrow/arrow_writer/mod.rs

+    }
+
+    #[test]
+    fn test_row_group_limit_both_bytes_wins() {


Can you have a similar test with rows wins that have the same structure but only config change?

Done - see test_row_group_limit_both_row_wins_multiple_batches vs. test_row_group_limit_both_row_wins_single_batch

rluvaton · 2026-02-05T15:07:39Z

parquet/src/arrow/arrow_writer/mod.rs

+    #[test]
+    fn test_row_group_limit_both_bytes_wins() {
+        // When both limits are set, the byte limit triggers first
+        // Write in multiple small batches so byte-based splitting can work


According to the comment on the method, the way you write batches should not affect, only the config, that is if the byte based got hit first, it should write that, if the row hit first it should write that.

and also, it should work regardless of how you feed the data

Unfortunately, the way data is fed does affect the row group splits, because of the first batch issue (noted in the PR description):

This means that the first batch will always be written as a whole (unless row count limit is also set).

And even beyond the first batch, the behaviour is not predictable: Byte-based limit is enforced by calculating the average row size, based on previous batches. This is more dynamic than row-based limit.

I'm not sure what's actionable from this comment. If you think there's still a missing test case, please LMK.

rluvaton · 2026-02-05T15:09:39Z

parquet/src/file/properties.rs

+    /// Sets maximum size of a row group in bytes, or `None` for unlimited.
+    ///
+    /// Row groups are flushed when their estimated encoded size exceeds this threshold.
+    /// This is similar to the official `parquet.block.size` behavior.


Suggested change

/// This is similar to the official `parquet.block.size` behavior.

/// This is similar to the official Java implementation `parquet.block.size` behavior.

this is not part of the spec so there is no official about it

…ow_count

yonipeleg33

Thanks @rluvaton, PTAL

yonipeleg33 · 2026-02-05T16:04:22Z

parquet/src/arrow/arrow_writer/mod.rs

    }
+
+    /// Helper to create a test batch with the given number of rows.
+    /// Each row is approximately 4 bytes (one i32).


yonipeleg33 · 2026-02-05T16:04:56Z

parquet/src/arrow/arrow_writer/mod.rs

+            ArrowDataType::Int32,
+            false,
+        )]));
+        let array = Int32Array::from((0..num_rows as i32).collect::<Vec<_>>());


yonipeleg33 · 2026-02-05T16:05:52Z

parquet/src/arrow/arrow_writer/mod.rs

+
+    #[test]
+    fn test_row_group_limit_none_writes_single_row_group() {
+        // When both limits are None, all data should go into a single row group


Done (moved to above the test function)

yonipeleg33 · 2026-02-05T16:06:04Z

parquet/src/arrow/arrow_writer/mod.rs

+
+    #[test]
+    fn test_row_group_limit_rows_only() {
+        // When only max_row_group_size is set, respect the row limit


yonipeleg33 · 2026-02-05T16:06:49Z

parquet/src/arrow/arrow_writer/mod.rs

+            false,
+        )]));
+
+        // Set byte limit to approximately fit ~30 rows worth of data (~100 bytes each)


yonipeleg33 · 2026-02-05T16:13:19Z

parquet/src/file/properties.rs

+        if let Some(v) = value {
+            assert!(v > 0, "Cannot have a 0 max row group bytes");
+        }


yonipeleg33 · 2026-02-05T16:18:53Z

parquet/src/file/properties.rs

@@ -575,7 +595,34 @@ impl WriterPropertiesBuilder {
    /// If the value is set to 0.
    pub fn set_max_row_group_size(mut self, value: usize) -> Self {


Probably. Done

yonipeleg33 · 2026-02-05T16:52:25Z

parquet/src/arrow/arrow_writer/mod.rs

+    }
+
+    #[test]
+    fn test_row_group_limit_both_row_wins() {


No can do; The first batch is always written as a whole, because we need some statistics in order to calculate average row size. This is also noted in the PR description:

This means that the first batch will always be written as a whole (unless row count limit is also set).

yonipeleg33 · 2026-02-05T16:58:33Z

parquet/src/arrow/arrow_writer/mod.rs

+    }
+
+    #[test]
+    fn test_row_group_limit_both_bytes_wins() {


Done - see test_row_group_limit_both_row_wins_multiple_batches vs. test_row_group_limit_both_row_wins_single_batch

yonipeleg33 · 2026-02-05T17:02:31Z

parquet/src/arrow/arrow_writer/mod.rs

+    #[test]
+    fn test_row_group_limit_both_bytes_wins() {
+        // When both limits are set, the byte limit triggers first
+        // Write in multiple small batches so byte-based splitting can work


Unfortunately, the way data is fed does affect the row group splits, because of the first batch issue (noted in the PR description):

This means that the first batch will always be written as a whole (unless row count limit is also set).

And even beyond the first batch, the behaviour is not predictable: Byte-based limit is enforced by calculating the average row size, based on previous batches. This is more dynamic than row-based limit.

I'm not sure what's actionable from this comment. If you think there's still a missing test case, please LMK.

…w_group_bytes

etseidl

Thanks @yonipeleg33. Flushing partial review for now, but I think this is looking pretty sound so far.

etseidl · 2026-02-06T22:36:42Z

parquet/src/file/properties.rs

+    /// # Panics
+    /// If the value is `Some(0)`.
+    pub fn set_max_row_group_row_count(mut self, value: Option<usize>) -> Self {
+        assert_ne!(value, Some(0), "Cannot have a 0 max row group bytes");


Suggested change

assert_ne!(value, Some(0), "Cannot have a 0 max row group bytes");

assert_ne!(value, Some(0), "Cannot have a 0 max row group row count");

etseidl · 2026-02-06T22:39:04Z

parquet/src/file/properties.rs

    /// # Panics
    /// If the value is set to 0.
+    #[deprecated(
+        since = "57.3.0",


Suggested change

since = "57.3.0",

since = "58.0.0",

57.3.0 has already been released

Thanks, done

etseidl · 2026-02-06T22:45:13Z

parquet/src/file/properties.rs

 pub const DEFAULT_MAX_ROW_GROUP_SIZE: usize = 1024 * 1024;
+/// Default value for [`WriterProperties::max_row_group_bytes`] (128 MB, same as the official Java
+/// implementation for `parquet.block.size`)
+pub const DEFAULT_MAX_ROW_GROUP_BYTES: usize = 128 * 1024 * 1024;


This constant appears to be unused. I'd vote for less clutter and get rid of it

Or should we set it?

Removed. Good catch! (Leftover from previous implementations)

Or should we set it?

@alamb I think not, as it changes behaviour without users opting-in for that new behaviour. None preserves the existing behaviour by default, which is no byte count limit at all.

etseidl · 2026-02-06T22:54:25Z

parquet/src/file/properties.rs

+    /// This is similar to the official Java implementation for `parquet.block.size`'s behavior.
+    ///
+    /// If both `max_row_group_row_count` and `max_row_group_bytes` are set,
+    /// the row group with the smallest limit will be applied.


Suggested change

/// the row group with the smallest limit will be applied.

/// the row group with the smaller limit will be produced.

etseidl · 2026-02-06T22:54:54Z

parquet/src/file/properties.rs

+    /// Sets maximum number of rows in a row group, or `None` for unlimited.
+    ///
+    /// If both `max_row_group_row_count` and `max_row_group_bytes` are set,
+    /// the row group with the smallest limit will be applied.


Suggested change

/// the row group with the smallest limit will be applied.

/// the row group with the smaller limit will be produced.

etseidl · 2026-02-06T23:15:39Z

parquet/src/arrow/arrow_writer/mod.rs

@@ -314,8 +320,12 @@ impl<W: Write + Send> ArrowWriter<W> {
    /// Encodes the provided [`RecordBatch`]
    ///
    /// If this would cause the current row group to exceed [`WriterProperties::max_row_group_size`]


Suggested change

/// If this would cause the current row group to exceed [`WriterProperties::max_row_group_size`]

/// If this would cause the current row group to exceed [`WriterProperties::max_row_group_row_count`]

?

Indeed! Done

…w_group_bytes

- Change deprecation notice to 58.0.0 - Improve wording in comments - Cleanup references to the newly deprecated API

yonipeleg33

Thanks @etseidl, PTAL

yonipeleg33 · 2026-02-08T08:57:53Z

parquet/src/file/properties.rs

 pub const DEFAULT_MAX_ROW_GROUP_SIZE: usize = 1024 * 1024;
+/// Default value for [`WriterProperties::max_row_group_bytes`] (128 MB, same as the official Java
+/// implementation for `parquet.block.size`)
+pub const DEFAULT_MAX_ROW_GROUP_BYTES: usize = 128 * 1024 * 1024;


Removed. Good catch! (Leftover from previous implementations)

yonipeleg33 · 2026-02-08T08:58:19Z

parquet/src/file/properties.rs

+    /// # Panics
+    /// If the value is `Some(0)`.
+    pub fn set_max_row_group_row_count(mut self, value: Option<usize>) -> Self {
+        assert_ne!(value, Some(0), "Cannot have a 0 max row group bytes");


yonipeleg33 · 2026-02-08T08:58:28Z

parquet/src/file/properties.rs

    /// # Panics
    /// If the value is set to 0.
+    #[deprecated(
+        since = "57.3.0",


Thanks, done

yonipeleg33 · 2026-02-08T09:01:34Z

parquet/src/file/properties.rs

+    /// This is similar to the official Java implementation for `parquet.block.size`'s behavior.
+    ///
+    /// If both `max_row_group_row_count` and `max_row_group_bytes` are set,
+    /// the row group with the smallest limit will be applied.


yonipeleg33 · 2026-02-08T09:01:40Z

parquet/src/file/properties.rs

+    /// Sets maximum number of rows in a row group, or `None` for unlimited.
+    ///
+    /// If both `max_row_group_row_count` and `max_row_group_bytes` are set,
+    /// the row group with the smallest limit will be applied.


yonipeleg33 · 2026-02-08T09:03:05Z

parquet/src/arrow/arrow_writer/mod.rs

@@ -314,8 +320,12 @@ impl<W: Write + Send> ArrowWriter<W> {
    /// Encodes the provided [`RecordBatch`]
    ///
    /// If this would cause the current row group to exceed [`WriterProperties::max_row_group_size`]


Indeed! Done

alamb · 2026-02-07T12:56:23Z

parquet/src/file/properties.rs

 pub const DEFAULT_MAX_ROW_GROUP_SIZE: usize = 1024 * 1024;
+/// Default value for [`WriterProperties::max_row_group_bytes`] (128 MB, same as the official Java
+/// implementation for `parquet.block.size`)
+pub const DEFAULT_MAX_ROW_GROUP_BYTES: usize = 128 * 1024 * 1024;


Or should we set it?

alamb · 2026-02-07T12:57:15Z

parquet/src/file/properties.rs

+    /// Returns maximum number of rows in a row group, or `usize::MAX` if unlimited.
    ///
    /// For more details see [`WriterPropertiesBuilder::set_max_row_group_size`]
    pub fn max_row_group_size(&self) -> usize {


Given the introduction of max_row_group_count, what would you think about deprecating max_row_group_size and directing people to that new setting?

That makes sense, as we also deprecate the corresponding setter. Done!

alamb · 2026-02-07T12:57:42Z

parquet/src/file/properties.rs

            data_page_row_count_limit: DEFAULT_DATA_PAGE_ROW_COUNT_LIMIT,
            write_batch_size: DEFAULT_WRITE_BATCH_SIZE,
-            max_row_group_size: DEFAULT_MAX_ROW_GROUP_SIZE,
+            max_row_group_row_count: Some(DEFAULT_MAX_ROW_GROUP_SIZE),


Could we also please align the constant name to the parameter name (eg. DEFAULT_MAX_ROW_GROUP_COUNT)

alamb · 2026-02-08T12:03:31Z

Thank you @yonipeleg33 -- sorry I forgot to submit my review from the other day when I reviewed this PR

yonipeleg33

Thank you @yonipeleg33 -- sorry I forgot to submit my review from the other day when I reviewed this PR

Happens to the best of us 😄
Done, thanks for the review! PTAL

yonipeleg33 · 2026-02-08T12:30:30Z

parquet/src/file/properties.rs

 pub const DEFAULT_MAX_ROW_GROUP_SIZE: usize = 1024 * 1024;
+/// Default value for [`WriterProperties::max_row_group_bytes`] (128 MB, same as the official Java
+/// implementation for `parquet.block.size`)
+pub const DEFAULT_MAX_ROW_GROUP_BYTES: usize = 128 * 1024 * 1024;


Or should we set it?

@alamb I think not, as it changes behaviour without users opting-in for that new behaviour. None preserves the existing behaviour by default, which is no byte count limit at all.

yonipeleg33 · 2026-02-08T12:38:16Z

parquet/src/file/properties.rs

+    /// Returns maximum number of rows in a row group, or `usize::MAX` if unlimited.
    ///
    /// For more details see [`WriterPropertiesBuilder::set_max_row_group_size`]
    pub fn max_row_group_size(&self) -> usize {


That makes sense, as we also deprecate the corresponding setter. Done!

yonipeleg33 · 2026-02-08T12:38:27Z

parquet/src/file/properties.rs

            data_page_row_count_limit: DEFAULT_DATA_PAGE_ROW_COUNT_LIMIT,
            write_batch_size: DEFAULT_WRITE_BATCH_SIZE,
-            max_row_group_size: DEFAULT_MAX_ROW_GROUP_SIZE,
+            max_row_group_row_count: Some(DEFAULT_MAX_ROW_GROUP_SIZE),


etseidl

Thanks @yonipeleg33, this looks good to me.

parquet/src/arrow/arrow_writer/mod.rs

yonipeleg33 · 2026-02-09T20:26:05Z

Thanks so much for the review, guys!
Let's wait for @alamb to take one more look as well.

alamb

Looks good to me too -- thanks @yonipeleg33 @etseidl and @rluvaton

alamb · 2026-02-10T21:17:32Z

parquet/src/file/properties.rs

@@ -575,7 +595,34 @@ impl WriterPropertiesBuilder {
    /// If the value is set to 0.
    pub fn set_max_row_group_size(mut self, value: usize) -> Self {


Wait wait, no so fast, this is a breaking change, as clippy will fail for users, I was asking, it might be in a different pr, but open for discussion. if you keep it please update the pr description under changes to users

Yeah, in general I agree with @yonipeleg33 -- I don't think a clippy failure is a breaking change per-se -- the rust compiler will be happy to compile it. If downstream projects want to take a more strict "clippy must pass" stance I donthink that is technically an API breakage

alamb · 2026-02-10T21:20:37Z

parquet/src/arrow/arrow_writer/mod.rs

+                let a = batch.slice(0, to_write);
+                let b = batch.slice(to_write, batch.num_rows() - to_write);
+                self.write(&a)?;
+                return self.write(&b);


Since this recurses, this could potentially blow out the stack with pathalogical inputs (e.g. a RecordBatch with 1M rows with a max_row_group_count of 1). I don't think it is necessary to fix now, I just wanted to point it out

here is a reproducer (I will file a follow on ticket)

#[test] fn test_row_group_limit_rows_only_pathological_stack_overflow_demo() { let schema = Arc::new(Schema::new(vec![Field::new( "int", ArrowDataType::Int32, false, )])); let array = Int32Array::from((0..1_000_000_i32).collect::<Vec<_>>()); let batch = RecordBatch::try_new(schema.clone(), vec![Arc::new(array)]).unwrap(); let props = WriterProperties::builder() .set_max_row_group_row_count(Some(1)) .set_max_row_group_bytes(None) .build(); let file = tempfile::tempfile().unwrap(); let mut writer = ArrowWriter::try_new(file, schema, Some(props)).unwrap(); // This currently recurses once per row-group split and can overflow the stack. writer.write(&batch).unwrap(); }

I filed a ticket to track

Stack overflow with pathological cases of WriterProperties::max_row_group_count and WriterProperties::max_row_group_bytes #9386

Given the prior code path for max_row_group_size also uses recursion I don't think this is a new bug introduced by this PR (though the max bytes path is now also susceptible to the same issue

Thanks for looking into it!

alamb · 2026-02-10T22:22:58Z

I made a small follow on PR to add some additional tests

Minor: Add additional test coverage for WriterProperties::{max_row_group_row_count,max_row_group_size} #9387

…oup_row_count,max_row_group_size} (#9387) # Which issue does this PR close? - Follow on to #9357 # Rationale for this change While reviewing this PR, I found (with codex) some additional code paths that I think would be valuable to test: 1. That you can't set `Some(0)` for the max sizes 2. Certain codepaths # What changes are included in this PR? Add tets # Are these changes tested?  # Are there any user-facing changes?

feat: add row group limit by byte

e8978fe

github-actions bot added the parquet Changes to the parquet crate label Feb 5, 2026

rluvaton reviewed Feb 5, 2026

View reviewed changes

[CR] Update comment to avoid referring parquet-mr specifically

0e07315

yonipeleg33 force-pushed the feature/max_row_group_bytes branch from 618f003 to 0e07315 Compare February 5, 2026 14:22

github-actions bot added the arrow Changes to the arrow crate label Feb 5, 2026

Merge branch 'main' of github.com:apache/arrow-rs into feature/max_ro…

28278a2

…w_group_bytes

github-actions bot removed the arrow Changes to the arrow crate label Feb 5, 2026

rluvaton reviewed Feb 5, 2026

View reviewed changes

parquet/src/file/properties.rs Show resolved Hide resolved

rluvaton reviewed Feb 5, 2026

View reviewed changes

yonipeleg33 added 4 commits February 5, 2026 18:18

[CR] Comment changes, minor touch-ups

c308933

[CR] Deprecate set_max_row_group_size in favor of set_max_row_group_r…

148fc15

…ow_count

Simplify tests: Extract batches writing into a helper function

8069ee7

[CR] Add a test where both limits are set with multiple batches

a2227f7

yonipeleg33 commented Feb 5, 2026

View reviewed changes

yonipeleg33 added 2 commits February 5, 2026 19:05

Merge branch 'main' of github.com:apache/arrow-rs into feature/max_ro…

3ff6148

…w_group_bytes

Fix deprecated use leftovers

8d71637

yonipeleg33 requested a review from rluvaton February 5, 2026 17:08

etseidl reviewed Feb 6, 2026

View reviewed changes

yonipeleg33 added 3 commits February 8, 2026 10:54

fmt + clippy

d8cfb7f

Merge branch 'main' of github.com:apache/arrow-rs into feature/max_ro…

3d2afd9

…w_group_bytes

[CR] Fixes:

a3e4a1f

- Change deprecation notice to 58.0.0 - Improve wording in comments - Cleanup references to the newly deprecated API

yonipeleg33 commented Feb 8, 2026

View reviewed changes

yonipeleg33 requested a review from etseidl February 8, 2026 09:05

alamb reviewed Feb 8, 2026

View reviewed changes

yonipeleg33 added 2 commits February 8, 2026 14:31

[CR] Rename const to match property name

f3be44c

[CR] Deprecate max_row_group_size getter

029afdc

yonipeleg33 commented Feb 8, 2026

View reviewed changes

etseidl approved these changes Feb 9, 2026

View reviewed changes

parquet/src/arrow/arrow_writer/mod.rs Show resolved Hide resolved

yonipeleg33 requested a review from alamb February 9, 2026 20:21

[CR] Clarify safe subtraction in ArrowWriter::write

d6728ad

alamb approved these changes Feb 10, 2026

View reviewed changes

alamb merged commit 54d8191 into apache:main Feb 10, 2026
16 checks passed

This was referenced Feb 10, 2026

Stack overflow with pathological cases of WriterProperties::max_row_group_count and WriterProperties::max_row_group_bytes #9386

Open

Minor: Add additional test coverage for WriterProperties::{max_row_group_row_count,max_row_group_size} #9387

Merged

alamb mentioned this pull request Feb 11, 2026

GH-48467: [C++][Parquet] Add configure to limit the row group size in bytes apache/arrow#48468

Open

alamb mentioned this pull request Feb 23, 2026

Upgrade DataFusion to arrow-rs/parquet 58.0.0 / object_store 0.13.0 apache/datafusion#19728

Open

1 task

		@@ -575,7 +595,34 @@ impl WriterPropertiesBuilder {
		/// If the value is set to 0.
		pub fn set_max_row_group_size(mut self, value: usize) -> Self {

	/// Each row is approximately 4 bytes (one i32).
	/// Each row is 4 bytes (one `i32`).

	let array = Int32Array::from((0..num_rows as i32).collect::<Vec<_>>());
	let array = Int32Array::from_iter(0..num_rows as i32);

	/// This is similar to the official `parquet.block.size` behavior.
	/// This is similar to the official Java implementation `parquet.block.size` behavior.

	assert_ne!(value, Some(0), "Cannot have a 0 max row group bytes");
	assert_ne!(value, Some(0), "Cannot have a 0 max row group row count");

	/// the row group with the smallest limit will be applied.
	/// the row group with the smaller limit will be produced.

	/// If this would cause the current row group to exceed [`WriterProperties::max_row_group_size`]
	/// If this would cause the current row group to exceed [`WriterProperties::max_row_group_row_count`]

Comments

Conversation

yonipeleg33 commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

rluvaton Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yonipeleg33 Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rluvaton Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rluvaton Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

yonipeleg33 commented Feb 5, 2026 •

edited

Loading

rluvaton Feb 5, 2026 •

edited

Loading

yonipeleg33 Feb 5, 2026 •

edited

Loading

rluvaton Feb 5, 2026 •

edited

Loading

rluvaton Feb 5, 2026 •

edited

Loading