Skip to content

ARROW-11981: [C++] Implement Union ExecNode#10927

Closed
aocsa wants to merge 5 commits intoapache:masterfrom
aocsa:aocsa/ARROW-11981
Closed

ARROW-11981: [C++] Implement Union ExecNode#10927
aocsa wants to merge 5 commits intoapache:masterfrom
aocsa:aocsa/ARROW-11981

Conversation

@aocsa
Copy link
Contributor

@aocsa aocsa commented Aug 12, 2021

UnionDataset allows Fragments of multiple schemas and differing file formats to be scanned together as a single Dataset. This is useful functionality but makes the Dataset interface somewhat difficult to reason about since it must be general enough to accommodate UnionDataset.

After ARROW-11928 it will probably be more natural to support unioning of datasets through a subclass of ExecNode. Reconciliation of differing schemas can then be trivially handled by a full ProjectNode.

Note this would obviate both ARROW-11001 and ARROW-11749. In addition, Dataset could be simplified to a concrete class containing a set of compatibly typed/formatted Fragments.

@github-actions
Copy link

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. I think we can clean up the implementation here using AtomicCounter instead of manually synchronizing with mutexes.

Copy link
Contributor

@nirandaperera nirandaperera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments. :-) Good work @aocsa

@aocsa aocsa force-pushed the aocsa/ARROW-11981 branch from 54c2e6b to 09df8df Compare August 16, 2021 16:47
@aocsa
Copy link
Contributor Author

aocsa commented Aug 16, 2021

Thanks for the review @lidavidm, @nirandaperera.
I update this PR with the following main changes:

  1. The use AtomicCounter
  2. Handle variable number of inputs
  3. Validate input schemas

@lidavidm lidavidm self-requested a review August 16, 2021 17:05
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! This is much cleaner now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it actually makes sense to forward seq like this. As I understand nodes can't really rely on this anyways so I don't think it harms things, but it seems odd to even have the parameter in that case. CC @bkietz for opinions.

(We could modify Increment to also give the count, if we wanted to give a unique seq_num to each batch we emit here.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think seq should be ultimately removed. It's a confusing parameter which we don't provide guarantees on. I'll add a JIRA for this anon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. @aocsa we can leave this as-is then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aocsa
Copy link
Contributor Author

aocsa commented Aug 17, 2021

Thanks @lidavidm, I updated this PR addressing feedback and ensuring the following cases are covered.

  • union node with variable number of inputs with different schemas
  • union node with 0 inputs
  • union node with inputs with empty batches

@aocsa aocsa force-pushed the aocsa/ARROW-11981 branch from 0e3cc0d to ecc69c2 Compare August 17, 2021 04:37
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aocsa added 5 commits August 17, 2021 20:38
Tests: 
- union node with variable number of inputs and with different schemas 
- union node with 0 inputs
- union node with inputs with empty batches

minor fix
Address: 
- lint fixes
- use cppguide 
- mingw ci failures
@aocsa aocsa force-pushed the aocsa/ARROW-11981 branch from 58e35e0 to 27c53af Compare August 18, 2021 01:38
@lidavidm lidavidm closed this in 48f42e9 Aug 18, 2021
@lidavidm
Copy link
Member

Merged. Thanks!

@lidavidm
Copy link
Member

I filed ARROW-13653 as a follow-up.

@aocsa
Copy link
Contributor Author

aocsa commented Aug 18, 2021

Thanks @lidavidm 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants