Skip to content

ARROW-8314: [Python] Add a Table.select method to select a subset of columns#7272

Closed
jorisvandenbossche wants to merge 6 commits intoapache:masterfrom
jorisvandenbossche:ARROW-8314-table-select
Closed

ARROW-8314: [Python] Add a Table.select method to select a subset of columns#7272
jorisvandenbossche wants to merge 6 commits intoapache:masterfrom
jorisvandenbossche:ARROW-8314-table-select

Conversation

@jorisvandenbossche
Copy link
Copy Markdown
Member

This is a pure python implementation. It might be we want that on the C++ side (unless it already exists?), but having it available in Python is already useful IMO.

@github-actions
Copy link
Copy Markdown

@nealrichardson
Copy link
Copy Markdown
Member

nealrichardson commented May 26, 2020

Here's a C++ version, albeit from the R bindings: https://github.com/apache/arrow/blob/master/r/src/table.cpp#L128-L143

Since you're doing this in Python as well, maybe this should be moved to C++?

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Jun 10, 2020

@jorisvandenbossche Do you need help on this?

@jorisvandenbossche jorisvandenbossche force-pushed the ARROW-8314-table-select branch 2 times, most recently from 940fe20 to f3175e2 Compare June 22, 2020 14:03
@jorisvandenbossche
Copy link
Copy Markdown
Member Author

I added a C++ version (didn't yet update R to use it)

Copy link
Copy Markdown
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a small type problem on the C++ side

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a static_cast, or use int64_t or size_t instead.

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Jun 22, 2020

It would also be nice to add a test on the C++ side, if that's not too time-consuming.

@nealrichardson
Copy link
Copy Markdown
Member

I added a C++ version (didn't yet update R to use it)

Are you intending to make that change in this PR too?

@jorisvandenbossche jorisvandenbossche force-pushed the ARROW-8314-table-select branch from f3175e2 to ca8cf21 Compare July 9, 2020 12:41
@jorisvandenbossche
Copy link
Copy Markdown
Member Author

I updated this, and added a small C++ test.

Are you intending to make that change in this PR too?

Realistically speaking, not at the moment (I certainly want to learn how to set up a R dev environment, but just before the release with other priorities might not be the best time ;-))
So would either merge without, unless you can quickly push a change (I suppose it might be an easy change).

@nealrichardson
Copy link
Copy Markdown
Member

I added https://issues.apache.org/jira/browse/ARROW-9387 for using this in R. It might be trivial but in case it isn't I don't want to block this.

@jorisvandenbossche
Copy link
Copy Markdown
Member Author

jorisvandenbossche commented Jul 13, 2020

@pitrou could you check the C++ test?

(fixing the linting issue)

@jorisvandenbossche jorisvandenbossche force-pushed the ARROW-8314-table-select branch from ca8cf21 to 858a8d4 Compare July 13, 2020 14:33
@pitrou
Copy link
Copy Markdown
Member

pitrou commented Jul 13, 2020

@jorisvandenbossche I'll take a look when I'm done with 1.0.0-critical tasks. Hopefully before the end of the week :-)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is returning Result anyway we might as well boundscheck the indices, thoughts?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a boundscheck

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASSERT_OK(subset->ValidateFull())?

@jorisvandenbossche jorisvandenbossche force-pushed the ARROW-8314-table-select branch from 858a8d4 to 042689a Compare July 14, 2020 19:43
@wesm
Copy link
Copy Markdown
Member

wesm commented Jul 14, 2020

+1

@wesm wesm closed this in 1413963 Jul 14, 2020
@jorisvandenbossche jorisvandenbossche deleted the ARROW-8314-table-select branch July 14, 2020 21:04
romainfrancois added a commit to romainfrancois/arrow that referenced this pull request Sep 7, 2020
nealrichardson added a commit that referenced this pull request Sep 11, 2020
R follow up from #7272

The current `$select()` uses a more familiar (though more expensive) tidyselect interface:

``` r
library(arrow, warn.conflicts = FALSE)

tab <- Table$create(x1 = 1:2, x2 = 3:4, y = 5:6)
# lower level 0-based indices
tab$SelectColumns(0:1)
#> Table
#> 2 rows x 2 columns
#> $x1 <int32>
#> $x2 <int32>

# higher level tidyselect based
tab$select(starts_with("x"))
#> Table
#> 2 rows x 2 columns
#> $x1 <int32>
#> $x2 <int32>
```

<sup>Created on 2020-09-07 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0.9001)</sup>

Do we want both ? `$select()` is used e.g. by the `read_csv(col_select=)` argument:

```r
tab <- reader$Read()$select(!!enquo(col_select))
```

Closes #8125 from romainfrancois/ARROW-9387/Table_SelectColumns

Lead-authored-by: Romain Francois <[email protected]>
Co-authored-by: Neal Richardson <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants