ARROW-16590: [C++] Consolidate files dealing with row-major storage by westonpace · Pull Request #13218 · apache/arrow

westonpace · 2022-05-23T21:03:11Z

The primary goal of this refactor of old code was to improve the readability and clarity of the code base. I did not make any functional changes to the code and if any functional changes are suggested which modify existing code I will happily discuss them here but defer the changes themselves to follow-up PRs. I would very much appreciate any feedback on naming, making sure we have sufficient test coverage, and overall layout of the code.

KeyRowArray -> RowTableImpl KeyEncoder -> RowTableEncoder: The old name made sense because this data is currently represented physically as an array of rows. However, the data is conceptually tabular. We are storing rows & columns. In particular, I found it confusing that KeyColumnArray was a 1D data structure while KeyRowArray was a 2D table structure.
KeyEncoder::Context -> LightContext: There's nothing particular to the key encoder here and I worry keeping it there may lead to fracturing into many different "context" objects.
Overall structure: I created a new folder arrow/compute/row and put all row-based utilities in here. Most of the files are now marked as _internal and the content in these files is not used outside of arrow/compute/row. The grouper had previously been alongside the kernel code and it didn't really belong there as it relies very heavily on the internal structure of the row encoding.
Row structure: I documented the file arrow/compute/row/row_internal.h

github-actions · 2022-05-23T21:03:32Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

westonpace · 2022-05-23T22:35:57Z

CI failures appear unrelated.

pitrou

I tried to take a quick look at this.

pitrou · 2022-05-31T14:33:17Z

cpp/src/arrow/compute/api_aggregate.h

  const FunctionOptions* options;
 };

+Result<std::vector<const HashAggregateKernel*>> GetKernels(


Do we need to expose these APIs here, or can there be a separate header file for internal hash-aggregation APIs?

IIRC these are only used in grouped aggregation and in tests, so api_aggregate_internal.h would be appropriate to house anything which is in namespace internal here

Yes but api_..._internal feels a bit awkward. I created arrow/compute/exec/aggregate.h. This follows the same convention as things like arrow/compute/exec/hash_join.h which contains logic specific to the operators but unaware of the fact its being used in an exec plan. I think it makes sense for the aggregate tests to use this type. It's still using the internal namespace but that's because we need it in the hash kernels tests and at least this keeps the kernels folder cleaner.

Maybe a longer term fix would be to modify the hash aggregate tests to use the exec plan and an aggregate node?

pitrou · 2022-05-31T14:36:46Z

cpp/src/arrow/compute/exec/key_hash.h

-  static void HashMultiColumn(const std::vector<KeyColumnArray>& cols,
-                              KeyEncoder::KeyEncoderContext* ctx, uint32_t* out_hash);
+  static void HashMultiColumn(const std::vector<KeyColumnArray>& cols, LightContext* ctx,
+                              uint32_t* out_hash);


For the record, is this a class with only static methods/attributes? This seems like an anti-pattern.

Yes, that is what it is. It is essentially a namespace to distinguish between 32bit and 64bit implementations. Hashing32::HashBatch will hash rows into uint32_t while Hashing64::HashBatch will hash rows into uint64_t. Would a namespace be an option? (e.g. arrow::compute::hash32::HashBatch)

Alternatively, I suppose we could rename all the functions (e.g. arrow::compute::HashBatch32 and arrow::compute::HashBatch64).

Or we could template all the functions (e.g. arrow::compute::HashBatch<uint32_t> and arrow::compute::HashBatch<uint64_t>)

Do we have a strong style preference here?

Do we have a strong style preference here?

Hmm, I don't think so. If it's used for templating then I suppose the class is necessary.

pitrou · 2022-05-31T14:41:25Z

cpp/src/arrow/compute/light_array.h

+/// allows us to take advantage of these resources without coupling the logic with
+/// the execution engine.
+struct LightContext {
+  bool has_avx2() const { return (hardware_flags & arrow::internal::CpuInfo::AVX2) > 0; }


Why is this no using CpuInfo::IsSupported(CpuInfo::AVX2)?

IIRC, the concept here was to be able to attach hardware flags to a specific context rather than needing to disable or enable for the whole library using CpuInfo::EnableFeature(). It and many other things are certainly candidates for follow up refactoring

Leaving this alone for now.

pitrou · 2022-05-31T14:49:05Z

cpp/src/arrow/compute/row/encode_internal.h

+  std::vector<uint32_t> batch_varbinary_cols_base_offsets_;
+};
+
+class EncoderInteger {


Do these all have to be exposed in a .h?

Some don't. Any of the encoders that have an AVX2 implemented method do I think. So if I was going to need an internal header anyways it seemed more consistent to just throw them all in. However, I can prune this down to just the encoders needed if that would be better.

pitrou · 2022-05-31T14:51:31Z

cpp/src/arrow/compute/row/row_internal.h

+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;


Why are some sizes or quantities unsigned and other signed?

I'm not sure if there is a particular reason.

cpp/src/arrow/compute/row/row_internal.h

pitrou · 2022-05-31T14:56:23Z

cpp/src/arrow/compute/row/row_internal.h

+  // Buffers can only expand during lifetime and never shrink.
+  std::unique_ptr<ResizableBuffer> null_masks_;
+  // Only used if the table has variable-length columns
+  // Stores the offsets into the binary data


Where is the binary data stored?

I added a comment but it's stored after the fixed-size fields. So, for example, if you had 2 int32 fields and a string field and 3 rows you might have something like...

i1 i2 s1

1 3 abc

2 4 xy

// buffers_[1] 0x00000001 0x00000002 0x61 0x62 0x63 0x00000003 0x00000004 0x78 0x79 // offsets_ 2, 5, 7, 9

I'm probably off on a few details in that example but that is the rough idea.

pitrou · 2022-05-31T14:56:39Z

cpp/src/arrow/compute/row/row_internal.h

+  // Called after resize to fix pointers
+  void update_buffer_pointers();
+
+  static constexpr int64_t padding_for_vectors = 64;


Suggested change

static constexpr int64_t padding_for_vectors = 64;

static constexpr int64_t kPaddingForVectors = 64;

Also add a comment explaining what this is?

I agree that this change should be made but I'd recommend doing so in follow up; I'd prefer to keep this refactor move-only since it's large as it is

I went ahead and did the rename. It's a private constant so the scope should be pretty minimal.

pitrou · 2022-05-31T14:57:48Z

cpp/src/arrow/compute/row/row_internal.h

+  // The number of bytes that can be stored in the table without resizing
+  int64_t bytes_capacity_;
+
+  // Mutable to allow lazy evaluation


Should these be atomic or is the row table not thread safe?

The row table is not thread safe. I updated the class comment to mention this fact.

bkietz · 2022-05-31T17:05:31Z

cpp/src/arrow/compute/api_aggregate.h

  const FunctionOptions* options;
 };

+Result<std::vector<const HashAggregateKernel*>> GetKernels(


IIRC these are only used in grouped aggregation and in tests, so api_aggregate_internal.h would be appropriate to house anything which is in namespace internal here

bkietz · 2022-05-31T17:08:52Z

cpp/src/arrow/compute/row/row_internal.h

+  // Called after resize to fix pointers
+  void update_buffer_pointers();
+
+  static constexpr int64_t padding_for_vectors = 64;


I agree that this change should be made but I'd recommend doing so in follow up; I'd prefer to keep this refactor move-only since it's large as it is

bkietz · 2022-05-31T17:20:19Z

cpp/src/arrow/compute/light_array.h

+/// allows us to take advantage of these resources without coupling the logic with
+/// the execution engine.
+struct LightContext {
+  bool has_avx2() const { return (hardware_flags & arrow::internal::CpuInfo::AVX2) > 0; }


IIRC, the concept here was to be able to attach hardware flags to a specific context rather than needing to disable or enable for the whole library using CpuInfo::EnableFeature(). It and many other things are certainly candidates for follow up refactoring

cpp/src/arrow/compute/row/row_internal.h

pitrou · 2022-06-07T13:36:18Z

cpp/src/arrow/compute/row/row_internal.h

Are these a different thing than {null_masks_, offsets_, rows_)?

cpp/src/arrow/compute/row/row_internal.h

pitrou · 2022-06-07T13:38:36Z

cpp/src/arrow/compute/row/row_internal.h

Once I've called AppendEmpty, what am I supposed to do?

wesm

+1. I rebased after #13364 and addressed some of the remaining code review comments. This conflicts with some of the ongoing refactoring to transition from ExecBatch to ExecSpan so I will merge this as soon as we have a green CI build

…to arrow/compute/row ARROW-16590: Moved GroupBy out of the kernels layer (api_aggregate) and into the exec layer (exec/aggregate). Added some comments and renamed a few fields to adhere to style conventions. ARROW-16590: Fix includes in benchmark to address GroupBy move

Add missing #pragma once

github-actions · 2022-06-14T03:06:48Z

https://issues.apache.org/jira/browse/ARROW-16590

westonpace requested a review from bkietz May 23, 2022 21:03

github-actions bot added the Component: C++ label May 23, 2022

This was referenced May 23, 2022

ARROW-16590: [C++] Consolidate files dealing with row-major storage, add some helper methods #13172

Closed

ARROW-16637: [C++] Add row-based utilities for encoding a batch and merging row tables #13220

Closed

pitrou self-requested a review May 30, 2022 15:48

pitrou reviewed May 31, 2022

View reviewed changes

bkietz approved these changes May 31, 2022

View reviewed changes

westonpace force-pushed the feature/ARROW-16590--consolidate-row-major-utilities-2 branch from 9bef021 to a8b4ae1 Compare June 3, 2022 02:53

westonpace requested a review from pitrou June 3, 2022 03:05

pitrou reviewed Jun 7, 2022

View reviewed changes

wesm force-pushed the feature/ARROW-16590--consolidate-row-major-utilities-2 branch from 65ebdaa to 2de80c9 Compare June 13, 2022 20:03

wesm approved these changes Jun 13, 2022

View reviewed changes

wesm force-pushed the feature/ARROW-16590--consolidate-row-major-utilities-2 branch from 2de80c9 to cbbae69 Compare June 13, 2022 20:32

westonpace and others added 3 commits June 13, 2022 21:02

Address some code review comments

0b8ae51

Add missing #pragma once

Fix Cython compilation

d4860c8

wesm force-pushed the feature/ARROW-16590--consolidate-row-major-utilities-2 branch from cbbae69 to d4860c8 Compare June 14, 2022 02:02

github-actions bot added the Component: Python label Jun 14, 2022

wesm changed the title ~~ARROW-16590: [C++] Consolidate files dealing with row-major storage~~ ARROW-16590: [C++] Consolidate files dealing with row-major storage Jun 14, 2022

wesm merged commit 5b859fd into apache:master Jun 14, 2022

	static constexpr int64_t padding_for_vectors = 64;
	static constexpr int64_t kPaddingForVectors = 64;

Conversation

westonpace commented May 23, 2022

Uh oh!

github-actions bot commented May 23, 2022

Uh oh!

westonpace commented May 23, 2022

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!