ARROW-16590: [C++] Consolidate files dealing with row-major storage#13218
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format? or See also: |
|
CI failures appear unrelated. |
pitrou
left a comment
There was a problem hiding this comment.
I tried to take a quick look at this.
| const FunctionOptions* options; | ||
| }; | ||
|
|
||
| Result<std::vector<const HashAggregateKernel*>> GetKernels( |
There was a problem hiding this comment.
Do we need to expose these APIs here, or can there be a separate header file for internal hash-aggregation APIs?
There was a problem hiding this comment.
IIRC these are only used in grouped aggregation and in tests, so api_aggregate_internal.h would be appropriate to house anything which is in namespace internal here
There was a problem hiding this comment.
Yes but api_..._internal feels a bit awkward. I created arrow/compute/exec/aggregate.h. This follows the same convention as things like arrow/compute/exec/hash_join.h which contains logic specific to the operators but unaware of the fact its being used in an exec plan. I think it makes sense for the aggregate tests to use this type. It's still using the internal namespace but that's because we need it in the hash kernels tests and at least this keeps the kernels folder cleaner.
Maybe a longer term fix would be to modify the hash aggregate tests to use the exec plan and an aggregate node?
| static void HashMultiColumn(const std::vector<KeyColumnArray>& cols, | ||
| KeyEncoder::KeyEncoderContext* ctx, uint32_t* out_hash); | ||
| static void HashMultiColumn(const std::vector<KeyColumnArray>& cols, LightContext* ctx, | ||
| uint32_t* out_hash); |
There was a problem hiding this comment.
For the record, is this a class with only static methods/attributes? This seems like an anti-pattern.
There was a problem hiding this comment.
Yes, that is what it is. It is essentially a namespace to distinguish between 32bit and 64bit implementations. Hashing32::HashBatch will hash rows into uint32_t while Hashing64::HashBatch will hash rows into uint64_t. Would a namespace be an option? (e.g. arrow::compute::hash32::HashBatch)
Alternatively, I suppose we could rename all the functions (e.g. arrow::compute::HashBatch32 and arrow::compute::HashBatch64).
Or we could template all the functions (e.g. arrow::compute::HashBatch<uint32_t> and arrow::compute::HashBatch<uint64_t>)
Do we have a strong style preference here?
There was a problem hiding this comment.
Do we have a strong style preference here?
Hmm, I don't think so. If it's used for templating then I suppose the class is necessary.
| /// allows us to take advantage of these resources without coupling the logic with | ||
| /// the execution engine. | ||
| struct LightContext { | ||
| bool has_avx2() const { return (hardware_flags & arrow::internal::CpuInfo::AVX2) > 0; } |
There was a problem hiding this comment.
Why is this no using CpuInfo::IsSupported(CpuInfo::AVX2)?
There was a problem hiding this comment.
IIRC, the concept here was to be able to attach hardware flags to a specific context rather than needing to disable or enable for the whole library using CpuInfo::EnableFeature(). It and many other things are certainly candidates for follow up refactoring
There was a problem hiding this comment.
Leaving this alone for now.
| std::vector<uint32_t> batch_varbinary_cols_base_offsets_; | ||
| }; | ||
|
|
||
| class EncoderInteger { |
There was a problem hiding this comment.
Do these all have to be exposed in a .h?
There was a problem hiding this comment.
Some don't. Any of the encoders that have an AVX2 implemented method do I think. So if I was going to need an internal header anyways it seemed more consistent to just throw them all in. However, I can prune this down to just the encoders needed if that would be better.
| /// For a varying-length binary, size of all encoded fixed-length key columns, | ||
| /// including lengths of varying-length columns, rounded up to the multiple of string | ||
| /// alignment. | ||
| uint32_t fixed_length; |
There was a problem hiding this comment.
Why are some sizes or quantities unsigned and other signed?
There was a problem hiding this comment.
I'm not sure if there is a particular reason.
| // Buffers can only expand during lifetime and never shrink. | ||
| std::unique_ptr<ResizableBuffer> null_masks_; | ||
| // Only used if the table has variable-length columns | ||
| // Stores the offsets into the binary data |
There was a problem hiding this comment.
Where is the binary data stored?
There was a problem hiding this comment.
I added a comment but it's stored after the fixed-size fields. So, for example, if you had 2 int32 fields and a string field and 3 rows you might have something like...
| i1 | i2 | s1 |
|---|---|---|
| 1 | 3 | abc |
| 2 | 4 | xy |
// buffers_[1]
0x00000001 0x00000002 0x61 0x62 0x63 0x00000003 0x00000004 0x78 0x79
// offsets_
2, 5, 7, 9
I'm probably off on a few details in that example but that is the rough idea.
| // Called after resize to fix pointers | ||
| void update_buffer_pointers(); | ||
|
|
||
| static constexpr int64_t padding_for_vectors = 64; |
There was a problem hiding this comment.
| static constexpr int64_t padding_for_vectors = 64; | |
| static constexpr int64_t kPaddingForVectors = 64; |
There was a problem hiding this comment.
Also add a comment explaining what this is?
There was a problem hiding this comment.
I agree that this change should be made but I'd recommend doing so in follow up; I'd prefer to keep this refactor move-only since it's large as it is
There was a problem hiding this comment.
I went ahead and did the rename. It's a private constant so the scope should be pretty minimal.
| // The number of bytes that can be stored in the table without resizing | ||
| int64_t bytes_capacity_; | ||
|
|
||
| // Mutable to allow lazy evaluation |
There was a problem hiding this comment.
Should these be atomic or is the row table not thread safe?
There was a problem hiding this comment.
The row table is not thread safe. I updated the class comment to mention this fact.
| const FunctionOptions* options; | ||
| }; | ||
|
|
||
| Result<std::vector<const HashAggregateKernel*>> GetKernels( |
There was a problem hiding this comment.
IIRC these are only used in grouped aggregation and in tests, so api_aggregate_internal.h would be appropriate to house anything which is in namespace internal here
| // Called after resize to fix pointers | ||
| void update_buffer_pointers(); | ||
|
|
||
| static constexpr int64_t padding_for_vectors = 64; |
There was a problem hiding this comment.
I agree that this change should be made but I'd recommend doing so in follow up; I'd prefer to keep this refactor move-only since it's large as it is
| /// allows us to take advantage of these resources without coupling the logic with | ||
| /// the execution engine. | ||
| struct LightContext { | ||
| bool has_avx2() const { return (hardware_flags & arrow::internal::CpuInfo::AVX2) > 0; } |
There was a problem hiding this comment.
IIRC, the concept here was to be able to attach hardware flags to a specific context rather than needing to disable or enable for the whole library using CpuInfo::EnableFeature(). It and many other things are certainly candidates for follow up refactoring
9bef021 to
a8b4ae1
Compare
There was a problem hiding this comment.
Are these a different thing than {null_masks_, offsets_, rows_)?
There was a problem hiding this comment.
Once I've called AppendEmpty, what am I supposed to do?
65ebdaa to
2de80c9
Compare
wesm
left a comment
There was a problem hiding this comment.
+1. I rebased after #13364 and addressed some of the remaining code review comments. This conflicts with some of the ongoing refactoring to transition from ExecBatch to ExecSpan so I will merge this as soon as we have a green CI build
2de80c9 to
cbbae69
Compare
…to arrow/compute/row ARROW-16590: Moved GroupBy out of the kernels layer (api_aggregate) and into the exec layer (exec/aggregate). Added some comments and renamed a few fields to adhere to style conventions. ARROW-16590: Fix includes in benchmark to address GroupBy move
Add missing #pragma once
cbbae69 to
d4860c8
Compare
The primary goal of this refactor of old code was to improve the readability and clarity of the code base. I did not make any functional changes to the code and if any functional changes are suggested which modify existing code I will happily discuss them here but defer the changes themselves to follow-up PRs. I would very much appreciate any feedback on naming, making sure we have sufficient test coverage, and overall layout of the code.
KeyColumnArraywas a 1D data structure whileKeyRowArraywas a 2D table structure.