ARROW-11981: [C++] Implement Union ExecNode#10927
ARROW-11981: [C++] Implement Union ExecNode#10927aocsa wants to merge 5 commits intoapache:masterfrom
Conversation
lidavidm
left a comment
There was a problem hiding this comment.
Thanks for doing this. I think we can clean up the implementation here using AtomicCounter instead of manually synchronizing with mutexes.
nirandaperera
left a comment
There was a problem hiding this comment.
left some comments. :-) Good work @aocsa
54c2e6b to
09df8df
Compare
|
Thanks for the review @lidavidm, @nirandaperera.
|
lidavidm
left a comment
There was a problem hiding this comment.
Thanks for the changes! This is much cleaner now.
There was a problem hiding this comment.
I wonder if it actually makes sense to forward seq like this. As I understand nodes can't really rely on this anyways so I don't think it harms things, but it seems odd to even have the parameter in that case. CC @bkietz for opinions.
(We could modify Increment to also give the count, if we wanted to give a unique seq_num to each batch we emit here.)
There was a problem hiding this comment.
I think seq should be ultimately removed. It's a confusing parameter which we don't provide guarantees on. I'll add a JIRA for this anon
There was a problem hiding this comment.
|
Thanks @lidavidm, I updated this PR addressing feedback and ensuring the following cases are covered.
|
0e3cc0d to
ecc69c2
Compare
lidavidm
left a comment
There was a problem hiding this comment.
Thanks, just a few final comments.
Also there are a few lint failures: https://github.com/apache/arrow/pull/10927/checks?check_run_id=3350092948#step:7:4091
You can use docker-compose to lint code: https://arrow.apache.org/docs/developers/cpp/development.html#code-style-linting-and-ci
Tests: - union node with variable number of inputs and with different schemas - union node with 0 inputs - union node with inputs with empty batches minor fix
Address: - lint fixes - use cppguide - mingw ci failures
minor fix
58e35e0 to
27c53af
Compare
|
Merged. Thanks! |
|
I filed ARROW-13653 as a follow-up. |
|
Thanks @lidavidm 👍 |
UnionDataset allows Fragments of multiple schemas and differing file formats to be scanned together as a single Dataset. This is useful functionality but makes the Dataset interface somewhat difficult to reason about since it must be general enough to accommodate UnionDataset.
After ARROW-11928 it will probably be more natural to support unioning of datasets through a subclass of ExecNode. Reconciliation of differing schemas can then be trivially handled by a full ProjectNode.
Note this would obviate both ARROW-11001 and ARROW-11749. In addition, Dataset could be simplified to a concrete class containing a set of compatibly typed/formatted Fragments.