Skip to content

Update Arrow to apache-arrow-4.0.1#6

Merged
Avogar merged 1389 commits intoClickHouse:masterfrom
Avogar:update
Jun 1, 2021
Merged

Update Arrow to apache-arrow-4.0.1#6
Avogar merged 1389 commits intoClickHouse:masterfrom
Avogar:update

Conversation

@Avogar
Copy link
Copy Markdown
Member

@Avogar Avogar commented Jun 1, 2021

No description provided.

alamb and others added 30 commits March 31, 2021 13:36
# Rationale
Accessing the list of tables via `select * from information_schema.tables` (introduced in apache#9818) is a lot to type

See the doc for background: https://docs.google.com/document/d/12cpZUSNPqVH9Z0BBx6O8REu7TFqL-NPPAYCUPpDls1k/edit#

# Proposal

Add support for `SHOW TABLES` command.

# Commentary

This is different than both postgres (which uses `\d` in `psql` for this purpose), and MySQL (which uses `DESCRIBE`).

I could be convinced that we should not add `SHOW TABLES` at all (and just stay with `select * from information_schema.tables` but I wanted to add the proposal)

# Example Use

Setup:
```
echo "1,Foo,44.9" > /tmp/table.csv
echo "2,Bar,22.1" >> /tmp/table.csv
cargo run --bin datafusion-cli
```

Then run :

```
CREATE EXTERNAL TABLE t(a int, b varchar, c float)
STORED AS CSV
LOCATION '/tmp/table.csv';

> show tables;
+---------------+--------------------+------------+------------+
| table_catalog | table_schema       | table_name | table_type |
+---------------+--------------------+------------+------------+
| datafusion    | public             | t          | BASE TABLE |
| datafusion    | information_schema | tables     | VIEW       |
+---------------+--------------------+------------+------------+
2 row in set. Query took 0 seconds.

```

Closes apache#9847 from alamb/alamb/really_show_tables

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
…support to GROUP BY/hash aggregates

This PR adds support for TimestampMillisecondArray to `Scalar`, `GroupByScalar`, and hash_aggregate / GROUP BYs to Rust DataFusion, thus fixing 12028.   I believe it might also fix 11940, though this needs input from you all.

There are some formatting changes from running `cargo fmt`.

Closes apache#9773 from velvia/evan/pr-arrow-12028-groupby-tsmillis

Authored-by: Evan Chan <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
The `append` functions in the `Builder` structs are often used in "hot" code. This PR tags them with `#[inline]`, making it possible to inline the function calls across crate boundaries.

Closes apache#9860 from ritchie46/inline_builder_appends

Authored-by: Ritchie Vink <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
…ld and dump the output.

This adds a one hour timeout to the R builds.  It changes the R CI test script to use `reporter="location"` by default.  It adds a dump test logs step to the end of the build that will always dump the test output regardless of success/failure.

These three changes combined will make it much easier to debug test failures in R tests.

Closes apache#9846 from westonpace/feature/arrow-12143

Lead-authored-by: Weston Pace <[email protected]>
Co-authored-by: Jonathan Keane <[email protected]>
Signed-off-by: David Li <[email protected]>
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec:
apache/parquet-format#45
but then quickly removed in favour of the Null logical type:
apache/parquet-format#51

Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type.

Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null.

Closes apache#9863 from pitrou/PARQUET-1990-invalid-converted-type

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
* Add ZSTD codec implementation.
* Removes dependency on netty by using ArrowBuf methods.
* Updates docs and archery test (will wait on CI to run archery if that fails will do more debugging.)

Closes apache#9822 from emkornfield/zstd

Lead-authored-by: emkornfield <[email protected]>
Co-authored-by: Micah Kornfield <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
Closes apache#9856 from kou/glib-gandiva-filter

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Clean up some things that `clippy` was complaining to me locally about

Closes apache#9867 from alamb/alamb/rust-clippy

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
…hema.columns`

Builds on the code in apache#9818

# Rationale

Provide schema metadata access (so a user can see what columns exist and their type).

See the doc for background: https://docs.google.com/document/d/12cpZUSNPqVH9Z0BBx6O8REu7TFqL-NPPAYCUPpDls1k/edit#

I plan to add support for `SHOW COLUMNS` possibly as a follow on PR (though I have found out that `SHOW COLUMNS` and `SHOW TABLES` are not supported by either MySQL or by Postgres 🤔 )

# Changes
I chose to add the first 15 columns from `information_schema.columns` You can see the full list in Postgres [here](https://www.postgresql.org/docs/9.5/infoschema-columns.html) and SQL Server [here](https://docs.microsoft.com/en-us/sql/relational-databases/system-information-schema-views/columns-transact-sql?view=sql-server-ver15).

There are a bunch more columns that say "Applies to features not available in PostgreSQL" and that don't apply to DataFusion either-- since my usecase is to get the basic schema information out I chose not to add a bunch of columns that are always null at this time.

I feel the use of column builders here is somewhat awkward (as it requires many calls to `unwrap`). I am thinking of a follow on PR to refactor this code to use `Vec<String>` and `Vec<u64>` and then create `StringArray` and `UInt64Array` directly from them but for now I just want the functionality

# Example use

Setup:
```
echo "1,Foo,44.9" > /tmp/table.csv
echo "2,Bar,22.1" >> /tmp/table.csv
cargo run --bin datafusion-cli
```

Then run :

```
> CREATE EXTERNAL TABLE t(a int, b varchar, c float)
STORED AS CSV
LOCATION '/tmp/table.csv';
0 rows in set. Query took 0 seconds.

>   select table_name, column_name, ordinal_position, is_nullable, data_type from information_schema.columns;
+------------+-------------+------------------+-------------+-----------+
| table_name | column_name | ordinal_position | is_nullable | data_type |
+------------+-------------+------------------+-------------+-----------+
| t          | a           | 0                | NO          | Int32     |
| t          | b           | 1                | NO          | Utf8      |
| t          | c           | 2                | NO          | Float32   |
+------------+-------------+------------------+-------------+-----------+
3 row in set. Query took 0 seconds.
```

Closes apache#9840 from alamb/alamn/information_schema_columns

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
Add an `into_inner()` method to `ipc::writer::StreamWriter`, allowing users to recover the underlying writer, consuming the StreamWriter. Essentially exposes `into_inner()` from the BufWriter contained in the StreamWriter. The StreamWriter will 'finish' itself if not already finished when returning the writer.

Also added `ArrowError::From<std::io::IntoInnerError>` conversion to allow for ergonomic return of the potential `IntoInnerError` returned by `BufWriter.into_inner()`.

Closes apache#9858 from ericwburden/into-inner-fn-for-stream-writer

Authored-by: Eric Burden <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
…ex groups from strings

Adds a regexp_extract compute kernel to select a substring based on a regular expression.

Some things I did that I may be doing wrong:

* I exposed `GenericStringBuilder`
* I build the resulting Array using a builder - this looks quite different from e.g. the substring kernel. Should I change it accordingly, e.g. because of performance considerations?
* In order to apply the new function in datafusion, I did not see a better solution than to handle the pattern string as `StringArray` and take the first record to compile the regex pattern from it and apply it to all values. Is there a way to define that an argument has to be a literal/scalar and cannot be filled by e.g. another column? I consider my current implementation quite error prone and would like to make this a bit more robust.

Closes apache#9428 from sweb/ARROW-10354/regexp_extract

Authored-by: Florian Müller <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
…eads

Closes apache#9808 from westonpace/feature/arrow-12097

Authored-by: Weston Pace <[email protected]>
Signed-off-by: David Li <[email protected]>
According to PEP 632, distutils will be deprecated in Python 3.10 and removed in 3.12.

* switch to `setuptools` for general packaging
* use the `Version` class from the `packaging` project instead of `distutils.LooseVersion`

Closes apache#9849 from pitrou/ARROW-12068-remove-distutils

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Compressed files such as `.gz` can contain multiple concatenated "streams".
If the last stream in the file decompressed to empty data, we would erroneously raise an error.

Closes apache#9864 from pitrou/ARROW-12169-compressed-empty-stream

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: David Li <[email protected]>
It's for GNU Autotools.

Closes apache#9869 from kou/glib-remove-config

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
This PR adds child data to Arrow's C FFI implementation and implements it for `List` and `LargeList` datatypes.

Closes apache#9778 from ritchie46/ffi_types

Authored-by: Ritchie Vink <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
…run synchronously from datasets

Calling the async streaming CSV reader from the synchronous Scanner::Scan was causing a form of nested parallelism and causing nested deadlocks.  This commit brings over some of the work in ARROW-7001 and allows the CSV scan task to be called in an async fashion.  In addition, an async path is put in the scanner and dataset write so that all internal uses of ScanTask()->Execute happen in an async-friendly way.  External uses of ScanTask()->Execute should already be outside the CPU thread pool and should not cause deadlock.

Some of this PR will be obsoleted by ARROW-7001 but the work in file_csv and the test cases should remain fairly intact.

Closes apache#9868 from westonpace/bugfix/arrow-12161

Lead-authored-by: Weston Pace <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: David Li <[email protected]>
… integration tests

# Rationale
Rust debug symbols are quite verbose, taking up memory during the final link time as well as significant disk space. Turning off the creation of symbols should save us compile / test time for CI as well as space on the integration test

# Change
Do not produce debug symbols on Rust CI (keep enough to have line numbers in `panic!` traceback, but not enough to interpret a core file, which no one does to my knowledge anyways)

Note that the integration test passed: https://github.com/apache/arrow/pull/9879/checks?check_run_id=2256148363

Closes apache#9879 from alamb/less_symbols_in_integration

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
This updates zstd version used by parquet crate to zstd = "0.7.0+zstd.1.4.9".

Closes apache#9881 from aldanor/feature/zstd-0.7

Authored-by: Ivan Smirnov <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
This just moves the tests to allow the feature-flag to be used and pass this kind of test (where previously it would fail)

```bash
cargo test --no-default-features --features cli
```

Closes apache#9874 from seddonm1/regexp_match_test

Authored-by: Mike Seddon <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
Closes apache#9763 from emkornfield/trivial_prs

Lead-authored-by: Micah Kornfield <[email protected]>
Co-authored-by: emkornfield <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
This depends on ARROW-12192: apache/arrow-site#99

Closes apache#9885 from kou/release-post-website-download

Lead-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
This throws an error if the user attempts to create a Table with columns of different lengths. We already had this for RecordBatches but not for Tables.

Closes apache#9851 from ianmcook/ARROW-12155

Authored-by: Ian Cook <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
…r functions and aggregates

Broken out from apache#9600 by @wqc200. Note this does not contain the part of apache#9600 that controls the output display of functions.

# Rationale

Aggregate functions are checked using case insensitive comparison (e.g. `select MAX(x)` and `select max(x)` both work - the code is [here](https://github.com/apache/arrow/blob/356c300c5ee1e2b23a83652514af11e3a731d596/rust/datafusion/src/physical_plan/aggregates.rs#L75)

However, scalar functions, user defined aggregates, and user defined functions, are checked using case sensitive comparisons (e.g. `select sqrt(x)` works while `select SQRT` does not. Postgres always uses case insensitive comparison:

```
alamb=# select sqrt(x) from foo;
 sqrt
------
(0 rows)

alamb=# select SQRT(x) from foo;
 sqrt
------
(0 rows)
```

# Changes

Always use case insensitive comparisons for unquoted identifier comparison, both for consistency within DataFusion as well as consistency with Postgres (and the SQL standard)

Adds tests that demonstrate the behavior

# Notes

This PR changes how user defined functions are resolved in SQL queries. If a user registers two functions with names
`"my_sqrt"` and `"MY_SQRT"` previously they could both be called
individually. After this PR `my_sqrt` will be called unless the user
specifically put `"SQRT"` (in quotes) in their query.

Closes apache#9827 from alamb/case_insensitive_functions

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
There are two typos:

1. In comment: `*memory consumption.` -> `* memory consumption`.
2. reader_writer.h and reader-writer.cc are not compatible, when you use vim to edit them, it's confuse why `:A` doesn't work.

Closes apache#9870 from Clcanny/fix-reader-writer-example-typo

Authored-by: Clcanny <[email protected]>
Signed-off-by: Yibo Cai <[email protected]>
Also `stringr::str_replace()` and `stringr::str_replace_all()`

Closes apache#9878 from ianmcook/ARROW-11513

Lead-authored-by: Ian Cook <[email protected]>
Co-authored-by: Neal Richardson <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
This adds `quantile()` and `median()` methods for `ArrowDatum`

Closes apache#9875 from ianmcook/ARROW-11338

Authored-by: Ian Cook <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
…etend version in the macOS wheel builds

Submitted crossbow job manually with target version 4.0.0: https://github.com/ursacomputing/crossbow/tree/build-124-github-wheel-osx-high-sierra-cp36m

Closes apache#9872 from kszucs/ARROW-12172

Authored-by: Krisztián Szűcs <[email protected]>
Signed-off-by: Krisztián Szűcs <[email protected]>
I was debugging another issue (not a bug in DataFusion I don't think) but noticed there wasn't any coverage for LIMIT in exec.rs, so I figured I would add some.

(well really I was writing a test to trigger what I thought was a bug in DataFusion -- lol)

Closes apache#9897 from alamb/limit_fix

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
kszucs and others added 25 commits April 21, 2021 18:11
Closes apache#10143 from jonkeane/ARROW-12520

Authored-by: Jonathan Keane <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
Fixes false positives in a check that pkg-config is installed

Closes apache#10198 from ianmcook/ARROW-12601

Authored-by: Ian Cook <[email protected]>
Signed-off-by: Ian Cook <[email protected]>
Minimal changes to link to ucrt builds specifically once supported (and will be safely redundant until then).

Closes apache#10217 from jeroen/winucrt

Lead-authored-by: Jeroen Ooms <[email protected]>
Co-authored-by: Neal Richardson <[email protected]>
Signed-off-by: Jonathan Keane <[email protected]>
An uninitialized StopToken caused segfaults if you ever called read_csv with cancellation disabled or when not on the main thread (e.g. if used in a Flight server). If we have a 4.0.1 I think this qualifies as a regression.

Closes apache#10227 from lidavidm/arrow-12622

Authored-by: David Li <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…t.write_table

Closes apache#10223 from jorisvandenbossche/ARROW-12617-orc-write_table-signature

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
… no nulls

Closes apache#10184 from cyb70289/12568-cast-crash

Lead-authored-by: Yibo Cai <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Closes apache#10237 from jonkeane/ARROW-12571-valgrindCI-patch

Authored-by: Jonathan Keane <[email protected]>
Signed-off-by: Jonathan Keane <[email protected]>
…rs should not be case-sensitive

This makes the environment variables `LIBARROW_MINIMAL`, `LIBARROW_DOWNLOAD`, and `NOT_CRAN` case-insensitive

Closes apache#10252 from ianmcook/ARROW-12642

Authored-by: Ian Cook <[email protected]>
Signed-off-by: Jonathan Keane <[email protected]>
With the nvcc 11.2 compiler we have a segfault when we have a copy and move assignment operator :
```
using Impl::operator=;
```
before a move-assignment operator:

 ```
Variant& operator=(Variant&& other) noexcept {
  this->destroy();
  other.move_to(this);
  return *this;
}
```

A minimal repro :

With a segfault : https://godbolt.org/z/h9eYv6zas

Without a segfault : https://godbolt.org/z/oWhK5qPd8

In this PR, as a workaround, we have essentially re-ordered the move-assignment of `Variant` before `using Impl::operator=;`.

Closes apache#10257 from galipremsagar/patch-2

Authored-by: GALI PREM SAGAR <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Closes apache#10287 from pitrou/ARROW-12670-extract-regex-nulls

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
…data

Closes apache#10297 from zeroshade/flight-client-metadata

Authored-by: Matthew Topol <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
…d arrays => crash

fixing ARROW-12774

Closes apache#10320 from nirandaperera/ARROW-12774

Authored-by: niranda perera <[email protected]>
Signed-off-by: Yibo Cai <[email protected]>
…t bundlers such as Rollup

Bundlers such as Rollup do not recognize the `_Buffer` import, which breaks their builds. This change resolves this issue by removing Buffer in favor of `TextEncoder`. Note that change incurs a performance penalty on Node as `Buffer` is often faster.

Co-authored-by: Adam Lippai <[email protected]>
Co-authored-by: Paul Taylor <[email protected]>

Closes apache#10332 from domoritz/remove-buffer-js

Authored-by: Dominik Moritz <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…ite_js_test_json

The integration build has started to fail on master: https://github.com/apache/arrow/runs/2575265526#step:9:4265

I don't entirely understand the reason why we see this error, in order to call that function we would need to pass `--write_generated_json` to the archery command, but we don't.
The only occurrence of that option in the codebase is in the javascript [test runner](https://github.com/apache/arrow/blob/master/js/gulp/test-task.js#L97), but that seems to use the old `integration_test.py` script which have been deleted since we ported it to archery (cc @trxcllnt @domoritz).

Additionally, I'm unable to reproduce it locally since `archery integration` doesn't call `write_js_test_json` by default.

The implementation is clearly wrong though.

Closes apache#10314 from kszucs/integration-decimal

Authored-by: Krisztián Szűcs <[email protected]>
Signed-off-by: Krisztián Szűcs <[email protected]>
… > stop)

When the normalized slice has a start > stop, we were creating invalid arrays with a negative length (which then errors on subsequent operations)

Closes apache#10341 from jorisvandenbossche/ARROW-12769

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: David Li <[email protected]>
…set mark

Closes apache#10346 from jorisvandenbossche/ARROW-12806

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: David Li <[email protected]>
There is a fallback_version configuration option for setuptools_scm which we don't use: https://github.com/pypa/setuptools_scm#configuration-parameters

Although this setting seems to have issues according to pypa/setuptools-scm#549
We already have a workaround in setup.py for the functionality of the fallback_version option, but it is disabled for the case of sdist: https://github.com/apache/arrow/blob/master/python/setup.py#L529

Closes apache#10342 from kszucs/ARROW-12619

Authored-by: Krisztián Szűcs <[email protected]>
Signed-off-by: Krisztián Szűcs <[email protected]>
…pes (apache#10344)

This backports the relevant part of ARROW-12500 into the 4.0.1 branch.

While ARROW-12500 cherry-picks cleanly, it doesn't build since it depends on prior changes - this just includes the actual fix and not the larger refactoring that was the focus of the patch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.