ARROW-8314: [Python] Add a Table.select method to select a subset of columns#7272
ARROW-8314: [Python] Add a Table.select method to select a subset of columns#7272jorisvandenbossche wants to merge 6 commits intoapache:masterfrom
Conversation
|
Here's a C++ version, albeit from the R bindings: https://github.com/apache/arrow/blob/master/r/src/table.cpp#L128-L143 Since you're doing this in Python as well, maybe this should be moved to C++? |
|
@jorisvandenbossche Do you need help on this? |
940fe20 to
f3175e2
Compare
|
I added a C++ version (didn't yet update R to use it) |
pitrou
left a comment
There was a problem hiding this comment.
LGTM, just a small type problem on the C++ side
cpp/src/arrow/table.cc
Outdated
There was a problem hiding this comment.
Need a static_cast, or use int64_t or size_t instead.
|
It would also be nice to add a test on the C++ side, if that's not too time-consuming. |
Are you intending to make that change in this PR too? |
f3175e2 to
ca8cf21
Compare
|
I updated this, and added a small C++ test.
Realistically speaking, not at the moment (I certainly want to learn how to set up a R dev environment, but just before the release with other priorities might not be the best time ;-)) |
|
I added https://issues.apache.org/jira/browse/ARROW-9387 for using this in R. It might be trivial but in case it isn't I don't want to block this. |
|
@pitrou could you check the C++ test? (fixing the linting issue) |
ca8cf21 to
858a8d4
Compare
|
@jorisvandenbossche I'll take a look when I'm done with 1.0.0-critical tasks. Hopefully before the end of the week :-) |
cpp/src/arrow/table.cc
Outdated
There was a problem hiding this comment.
If this is returning Result anyway we might as well boundscheck the indices, thoughts?
There was a problem hiding this comment.
Added a boundscheck
cpp/src/arrow/table.cc
Outdated
cpp/src/arrow/table_test.cc
Outdated
There was a problem hiding this comment.
ASSERT_OK(subset->ValidateFull())?
858a8d4 to
042689a
Compare
|
+1 |
R follow up from apache#7272
R follow up from #7272 The current `$select()` uses a more familiar (though more expensive) tidyselect interface: ``` r library(arrow, warn.conflicts = FALSE) tab <- Table$create(x1 = 1:2, x2 = 3:4, y = 5:6) # lower level 0-based indices tab$SelectColumns(0:1) #> Table #> 2 rows x 2 columns #> $x1 <int32> #> $x2 <int32> # higher level tidyselect based tab$select(starts_with("x")) #> Table #> 2 rows x 2 columns #> $x1 <int32> #> $x2 <int32> ``` <sup>Created on 2020-09-07 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0.9001)</sup> Do we want both ? `$select()` is used e.g. by the `read_csv(col_select=)` argument: ```r tab <- reader$Read()$select(!!enquo(col_select)) ``` Closes #8125 from romainfrancois/ARROW-9387/Table_SelectColumns Lead-authored-by: Romain Francois <[email protected]> Co-authored-by: Neal Richardson <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
This is a pure python implementation. It might be we want that on the C++ side (unless it already exists?), but having it available in Python is already useful IMO.