Skip to content

GH-38503: [Go][Parquet] Style improvement for using ArrowColumnWriter#38581

Merged
zeroshade merged 1 commit intoapache:mainfrom
mapleFU:go-parquet/improve-style
Nov 15, 2023
Merged

GH-38503: [Go][Parquet] Style improvement for using ArrowColumnWriter#38581
zeroshade merged 1 commit intoapache:mainfrom
mapleFU:go-parquet/improve-style

Conversation

@mapleFU
Copy link
Copy Markdown
Member

@mapleFU mapleFU commented Nov 4, 2023

Rationale for this change

Currently, ArrowColumnWriter seems not having bug. But the usage is confusing. For nested type, ArrowColumnWriter should considering the logic below:

  /// 0 foo.bar
  ///       foo.bar.baz           0
  ///       foo.bar.baz2          1
  ///   foo.qux                   2
  /// 1 foo2                      3
  /// 2 foo3                      4

The left column is the column in root of arrow::Schema, the parquet itself only stores Leaf node,
so, the column id for parquet is list at right.

In the ArrowColumnWriter, the final argument is the LeafIdx in parquet, so, writer should considering
using leafIdx. Also, it need a LeafCount API for getting the leaf-count here.

What changes are included in this PR?

Style enhancement for LeafCount, leafIdx and usage for ArrowColumnWriter

Are these changes tested?

no

Are there any user-facing changes?

no

@mapleFU mapleFU requested a review from zeroshade as a code owner November 4, 2023 09:32
@github-actions
Copy link
Copy Markdown

github-actions bot commented Nov 4, 2023

⚠️ GitHub issue #38503 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Copy Markdown
Member Author

mapleFU commented Nov 4, 2023

@zeroshade @tschaub

I've updated the sample here. Also, we may need to check the type for writer. Any advice is welcomed.

@tschaub
Copy link
Copy Markdown
Contributor

tschaub commented Nov 4, 2023

Thank you for looking into this and finding the issue, @mapleFU.

My assumption was that the last argument to the pqarrow.NewArrowColumnWriter function was the Arrow column index instead of the Parquet column index (or logical instead of physical, not sure which terms to use). And the current tests support this assumption.

Maybe the question for @zeroshade is whether the tests are correct (last arg is logical/Arrow col index) or the source is correct (last arg is physical/Parquet col index).

This has been a source of uncertainty for me when col is used in the documentation. I look forward to better understanding which interpretation is meant (or is field a better term to differentiate?).

@mapleFU
Copy link
Copy Markdown
Member Author

mapleFU commented Nov 4, 2023

Sorry for late replying because I'm playing games :-)

Schema is an important part when using parquet, because parquet only store leaf-nodes. You can also refer to the code here: https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader.h#L142-L153

  /// \brief Read the given columns into a Table
  ///
  /// The indicated column indices are relative to the internal representation
  /// of the parquet table. For instance :
  /// 0 foo.bar
  ///       foo.bar.baz           0
  ///       foo.bar.baz2          1
  ///   foo.qux                   2
  /// 1 foo2                      3
  /// 2 foo3                      4
  ///
  /// i=0 will read foo.bar.baz, i=1 will read only foo.bar.baz2 and so on.
  /// Only leaf fields have indices; foo itself doesn't have an index.
  /// To get the index for a particular leaf field, one can use
  /// manifest().schema_fields to get the top level fields, and then walk the
  /// tree to identify the relevant leaf fields and access its column_index.
  /// To get the total number of leaf fields, use FileMetadata.num_columns().

The code above is for reader, but the reason is same. The left is the "arrow column number", the right is "parquet leaf column number", here we need a mapping for them.

So, in case above, the parquet "real" schema would be:

foo.bar.baz
foo.bar.baz2
foo.qux
foo2
foo3

You can also read some comments in https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader.h . It would helps a lot.

ArrowColumnWriter needs mapping the arrow writer (like writer for foo ) to underlying leaf writer (writer for foo.bar.baz and foo.qux ). In this case, the foo writer will write 3 leaf node, and next time, the ArrowColumnWriter should accept 3 here.

@mapleFU
Copy link
Copy Markdown
Member Author

mapleFU commented Nov 8, 2023

Ping @zeroshade
Would you mind take a look? Or otherway is better here?

@zeroshade
Copy link
Copy Markdown
Member

Sorry for the delay here, I've been away at a conference all last week. I've been catching up on my notifications.

Reading through the code a bunch, it looks like the intent is that the passed in column index (originally named col, renamed to leafColIdx in this PR) should be the index of the first physical/parquet column index that will be used by the writer. For example, if you look inside the ArrowColumnWriter.Write method, it uses colIdx + leafIdx to tell the BufferedRowGroupWriter which physical/parquet column to write. This is also verified by looking at FileWriter.WriteColumnChunked which contains fw.colIdx += acw.leafCount, showing that we bump the column index we are constructing the column writer with by the number of leaf columns we found.

So it looks like you've got the right interpretation here I believe.

Copy link
Copy Markdown
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Nov 13, 2023
@tschaub
Copy link
Copy Markdown
Contributor

tschaub commented Nov 13, 2023

@zeroshade - I agree that the implementation of the pqarrow.NewArrowColumnWriter function expects the last argument to be the physical/Parquet column index.

What was confusing to me and caused uncertainty was that the pqarrow package is mostly about working with Arrow, and the name of the function NewArrowColumnWriter suggests this is about Arrow columns. Then I looked at the tests for confirmation and saw that they use the Arrow table column index in calling this function:

for i := int64(0); i < tbl.NumCols(); i++ {
acw, err := pqarrow.NewArrowColumnWriter(tbl.Column(int(i)).Data(), 0, tbl.NumRows(), manifest, srgw, int(i))
require.NoError(t, err)
require.NoError(t, acw.Write(ctx))
}

So while I agree that the changes here improve the name of that last argument and make it easier to get the leaf index with the newly exported arrowColumnWriter.LeafCount() method, I'm not sure this makes for the most intuitive API. Since the function is called with an Arrow column chunk and the number of Arrow rows, wouldn't it make sense to call it with the Arrow column index as well (as done in the tests above)?

@tschaub
Copy link
Copy Markdown
Contributor

tschaub commented Nov 14, 2023

I was envisioning a change something like this: main...tschaub:arrow:arrow-col

@zeroshade
Copy link
Copy Markdown
Member

Hmm, originally the intent when i wrote that was essentially sort of a "destination column index" type thing, which is why it referred to the physical parquet index to start at. It also simplified the code in general. Typically most people would likely not be creating an ArrowColumnWriter directly but rather would be just using the pqarrow.Writer itself and writing the records as is (with the ArrowColumnWriter being used internally in most situations).

Personally, I prefer the usage of the last arg being the physical parquet column to start writing at, which seems more intuitive to me in general if you look at it in terms of being a "destination to start writing at" which makes the ArrowColumnWriter more versatile and able to be used regardless of what the arrow schema was (at least in my mind).

What do you think @mapleFU of @tschaub's comments? I'm fine with being outvoted on this if others find it more intuitive for it to be the opposite. You get to be a tie breaker 😄

@mapleFU
Copy link
Copy Markdown
Member Author

mapleFU commented Nov 14, 2023

So while I agree that the changes here improve the name of that last argument and make it easier to get the leaf index with the newly exported arrowColumnWriter.LeafCount() method, I'm not sure this makes for the most intuitive API. Since the function is called with an Arrow column chunk and the number of Arrow rows, wouldn't it make sense to call it with the Arrow column index as well (as done in the tests above)?

This is not the most intuitive API. Generally parquet/arrow module need to convert arrow record batch to a parquet leaves, this need to be done in parquet/arrow

func (fw *FileWriter) WriteColumnChunked(data *arrow.Chunked, offset, size int64) error {
	acw, err := NewArrowColumnWriter(data, offset, size, fw.manifest, fw.rgw, fw.colIdx)
	if err != nil {
		return err
	}
	fw.colIdx += acw.leafCount
	return acw.Write(fw.ctx)
}

This API is much more easily, it's used in WriteTable. Also, C++ has a WriteRecordBatch, we use it as in-house parquet writer. I think it's easy to porting and use it.

ArrowColumnWriter is neccessary for writing parquet, but I think it's a bit hack to using it as a export API. User need to always maintaining the offsets, array types etc.

@zeroshade
Copy link
Copy Markdown
Member

zeroshade commented Nov 14, 2023

@mapleFU Do you think it would make more sense to simply change it and no longer expose the ArrowColumnWriter directly, and direct users to using the FileWriter apis instead?

The Go has an equivalent to WriteRecordBatch which is the Write(rec arrow.Record) error method, which conforms to the array.RecordWriter interface

@mapleFU
Copy link
Copy Markdown
Member Author

mapleFU commented Nov 14, 2023

Yeah, I think ArrowColumnWriter is a bit "internal" because user of it need to understanding the mapping... It's ok for unnested type, but when user has nested type, it would be so hard to understand.

Also, I think we'd better adding some type checking for parquet writer. I think some Array level checking will not harm the performance

@zeroshade
Copy link
Copy Markdown
Member

@mapleFU would you wanna take a stab at updating this PR accordingly?

@tschaub thoughts on the conversation so far?

@mapleFU
Copy link
Copy Markdown
Member Author

mapleFU commented Nov 14, 2023

Updated, fell free to edit it.

@zeroshade
Copy link
Copy Markdown
Member

@mapleFU In order to make the ArrowColumnWriter no longer exported, you need to change the first character to be lowercase

@mapleFU
Copy link
Copy Markdown
Member Author

mapleFU commented Nov 15, 2023

Hmmm I know you meaning but I think maybe separate or close this patch is better? Since already some other modules using NewArrowColumnWriter. I don't know if this is ok to just set it private directly.

@tschaub
Copy link
Copy Markdown
Contributor

tschaub commented Nov 15, 2023

@mapleFU Do you think it would make more sense to simply change it and no longer expose the ArrowColumnWriter directly, and direct users to using the FileWriter apis instead?

I also think this makes sense. I was only using the ArrowColumnWriter after digging around the tests. But I have since switched to using a fileWriter.WriteColumnChunked() with a pqarrow.FileWriter. I think this is in line with what is being suggested here.

@mapleFU
Copy link
Copy Markdown
Member Author

mapleFU commented Nov 15, 2023

@zeroshade I found there're some tests using ArrowColumnWriter in other module, like encoding, should I also remove them or how to fix them?

@tschaub
Copy link
Copy Markdown
Contributor

tschaub commented Nov 15, 2023

Feel free to cherry pick if this is what you had in mind: main...tschaub:arrow:inernal-arrow-column-writer

I don't have PARQUET_TEST_DATA set up, but other tests pass.

@mapleFU
Copy link
Copy Markdown
Member Author

mapleFU commented Nov 15, 2023

Nice, let matt decide that!

@zeroshade
Copy link
Copy Markdown
Member

Nice! Thanks both of you! This looks good to me as is, so I'm gonna merge this and then go review #38727

@zeroshade zeroshade merged commit dfdebdd into apache:main Nov 15, 2023
@zeroshade zeroshade removed the awaiting merge Awaiting merge label Nov 15, 2023
@mapleFU mapleFU deleted the go-parquet/improve-style branch November 15, 2023 15:41
zeroshade pushed a commit that referenced this pull request Nov 15, 2023
This makes it so the Arrow column writer is not exported from the `pqarrow` package.  This follows up on comments from #38581.
* Closes: #38503

Authored-by: Tim Schaub <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit dfdebdd.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…Writer (apache#38581)

### Rationale for this change

Currently, `ArrowColumnWriter` seems not having bug. But the usage is confusing. For nested type,  `ArrowColumnWriter` should considering the logic below:

```
  /// 0 foo.bar
  ///       foo.bar.baz           0
  ///       foo.bar.baz2          1
  ///   foo.qux                   2
  /// 1 foo2                      3
  /// 2 foo3                      4
```

The left column is the column in root of `arrow::Schema`, the parquet itself only stores Leaf node,
so, the column id for parquet is list at right.

In the `ArrowColumnWriter`, the final argument is the LeafIdx in parquet, so, writer should considering
using `leafIdx`. Also, it need a `LeafCount` API for getting the leaf-count here.

### What changes are included in this PR?

Style enhancement for `LeafCount`, `leafIdx` and usage for `ArrowColumnWriter`

### Are these changes tested?

no

### Are there any user-facing changes?

no

* Closes: apache#38503

Authored-by: mwish <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…pache#38727)

This makes it so the Arrow column writer is not exported from the `pqarrow` package.  This follows up on comments from apache#38581.
* Closes: apache#38503

Authored-by: Tim Schaub <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Go][Parquet] Trouble using the C++ reader to read a Parquet file written with the Go writer

3 participants