ARROW-5396: [JS] Support files and streams with no record batches by trxcllnt · Pull Request #4373 · apache/arrow

trxcllnt · 2019-05-22T17:55:54Z

Re: #3871, ARROW-2119, and closes ARROW-5396.

This PR updates the JS Readers and Writers to support files and streams with no RecordBatches. The approach here is two-fold:

If the Readers' source message stream terminates after reading the Schema message, the Reader will yield a dummy zero-length RecordBatch with the schema.
The Writer always writes the schema for any RecordBatch, but skips writing the RecordBatch field metadata if it's empty.

This is necessary because the reader and writer don't know about each other when they're communicating via the Node and DOM stream i/o primitives; they only know about the values pushed through the streams. Since the RecordBatchReader and Writer don't yield the Schema message as a standalone value, we pump the stream with a zero-length RecordBatch that contains the schema instead.

wesm · 2019-05-22T18:32:04Z

@trxcllnt so the dummy RecordBatch is needed only to obtain the schema? Curious if this causes a zero-length RecordBatch to be subsequently sent

With this change does this pass the integration test I added?

trxcllnt · 2019-05-22T18:41:08Z

@wesm yes it passed the integration test when I ran it locally.

The change in this PR causes the RecordBatchStreamReader to yield a dummy zero-length RecordBatch if the stream is going to terminate without yielding at least one RecordBatch (here), but it also updates the Writer to ignore zero-length RecordBatches (here).

This will cause the Writer not to write any zero-length RecordBatches, but I can't think of a case where that'd be a problem. If it is, we could instead use an internal flag to indicate whether a zero-length RecordBatch should or shouldn't be written :-).

wesm · 2019-05-22T19:27:34Z

Interesting. We also have an integration test for length-0 RecordBatch objects so if this passes both tests then I suppose it's OK. I guess I should merge my no-record-batches patch with C++-only using it and then you can rebase this and enable JS so we can get a complete build?

trxcllnt · 2019-05-22T20:03:16Z

Ah, I forgot about that test. I thought you were asking about the no-record-batches integration test from PR #3871, which is what I tested locally. I don't think it'll pass the generate_primitive_case([0, 0, 0], name='primitive_zerolength') test unless we alter the the Writer to be more intelligent about which zero-length RecordBatches should and shouldn't be written.

I'll make that change here shortly and update this PR.

trxcllnt · 2019-05-22T20:42:21Z

@wesm ok, I updated the Reader and Writer to use a new _InternalEmptyPlaceholderRecordBatch subclass to indicate whether the input stream truly has no batches, or just a collection of zero-length batches. Looks like the C++ and JS integration tests now line up:

…a schema but no recordbatches

…ource stream has no RecordBatches

wesm · 2019-05-23T17:19:35Z

Rebased and enabled the no-batches integration test for JS

codecov-io · 2019-05-23T18:23:52Z

Codecov Report

Merging #4373 into master will increase coverage by 1.99%.
The diff coverage is 64.7%.

@@            Coverage Diff             @@
##           master    #4373      +/-   ##
==========================================
+ Coverage   88.34%   90.33%   +1.99%     
==========================================
  Files         781       74     -707     
  Lines       98721     5464   -93257     
  Branches     1251     1253       +2     
==========================================
- Hits        87212     4936   -82276     
+ Misses      11273      520   -10753     
+ Partials      236        8     -228

Impacted Files	Coverage Δ
js/src/ipc/writer.ts	`89.04% <100%> (+0.05%)`	⬆️
js/src/ipc/message.ts	`92.25% <100%> (-0.1%)`	⬇️
js/src/recordbatch.ts	`90.47% <33.33%> (-4.4%)`	⬇️
js/src/ipc/reader.ts	`87.22% <42.85%> (-0.92%)`	⬇️
python/pyarrow/ipc.pxi
cpp/src/arrow/csv/chunker-test.cc
cpp/src/parquet/column_page.h
cpp/src/parquet/bloom_filter-test.cc
cpp/src/arrow/array/builder_decimal.cc
cpp/src/plasma/client.cc
... and 703 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9fcc12f...c860696. Read the comment docs.

emkornfield · 2019-05-29T03:30:05Z

@TheNeuralBit do you mind reviewing the JS?

TheNeuralBit · 2019-05-29T14:39:59Z

Would you mind adding a docstring on _InternalEmptyPlaceholderRecordBatch to make it clear why it exists? Otherwise LGTM
Some non-integration test-cases would be nice to, but we could just make a jira for that.

wesm · 2019-05-30T17:30:44Z

@trxcllnt if you can add that comment I can go ahead and merge this

wesm · 2019-05-31T21:01:50Z

I'm going ahead and merging this. Please feel free to submit a follow up PR to document this

This PR adds Vector Builder implementations for each DataType, as well as high-level stream primitives for Iterables/AsyncIterables, node streams, and DOM streams. edit: I've created a demo that transforms a CSV file/stream of JSON rows to an Arrow table in this repository: https://github.com/trxcllnt/csv-to-arrow-js #### Builder API The new `Builder` class exposes an API for sequentially appending (or setting into slots that have already been allocated) arbitrary JavaScript values that will be flushed to the same underlying Data chunk. The `Builder` class also supports specifying a list of null-value sentinels, or values that will be interpreted to indicate "null" should be written to the null bitmap instead of being written as a valid element. Similar to the existing `Vector` API, `Builder` has a static `Builder.new()` method that will return the correct `Builder` subclass instance for the supplied DataType. Since the `Builder` constructor takes an options Object, this method also takes an Object: ```typescript import { Builder, Utf8 } from 'apache-arrow'; const utf8Builder = Builder.new({ type: new Utf8(), nullValues: [null, 'n/a'] }); utf8Builder .append('hello') .append('n/a') // will be interpreted to mean `null` .append('world') .append(null); const utf8Vector = utf8Builder.finish().toVector(); console.log(utf8Vector.toJSON()); // > ["hello", null, "world", null] ``` The `Builder` class has two methods for flushing the pending values to their underlying ArrayBuffer representations: `flush(): Data<T>` and `toVector(): Vector<T>` (`toVector()` calls `flush()` and creates a `Vector` instance from the returned Data instance). Calling `Builder.prototype.finish()` will finalize a `Builder` instance. After this, no more values should be written to the Builder instance. This is a no-op on for most types, except the `DictionaryBuilder`, which flushes its internal dictionary and writes the values to the `Dictionary` type's `dictionaryVector` field. #### Iterable and stream APIs Creating and using Builders directly is a bit cumbersome, so we provide some high-level streaming APIs for automatically creating builders, appending values, and flushing chunks of a certain size: ```typescript Builder.throughIterable(options: IterableBuilderOptions<T, TNull>) Builder.throughAsyncIterable(options: IterableBuilderOptions<T, TNull>) Builder.throughDOM(options: BuilderTransformOptions<T, TNull>) Builder.throughNode(options: BuilderDuplexOptions<T, TNull>) ``` #### Iterables and AsyncIterables The static `throughIterable` and `throughAsyncIterable` methods take an `options` argument that indicates the Builder's type and null-value sentinels, and returns a function which accepts an Iterable or AsyncIterable, respectively, of values to transform: ```typescript import { Chunked, Builder, Utf8 } from 'apache-arrow'; const options = { type: new Utf8(), nullValues: [null, 'n/a'] }; const buildUtf8 = Builder.throughIterable(options); const utf8Vector = Chunked.concat(...buildUtf8(['hello', 'n/a', 'world', null])); ``` The `options` argument can also specify a `queueingStrategy` and `highWaterMark` that control the chunking semantics: * If the `queueingStrategy` is `"count"` (or is omitted), then the returned generator function will flush the `Builder` and yield a chunk once the number of values that have been written to the Builder reaches the value supplied for `highWaterMark`, regardless of how much memory the `Builder` has allocated. * If the `queueingStrategy` is `"bytes"`, then the returned generator function will flush the `Builder` and yield a new chunk once the Builder's `byteLength` field reaches or exceeds the value supplied for `highWaterMark`, regardless of how many elements the `Builder` contains. #### Node and DOM Streams In addition to the Iterable transform APIs, we can also create node and DOM transform streams with similar options: ```typescript import { Readable } from 'stream'; import { toArray } from 'ix/asynciterable/toarray'; import { Chunked, Builder, Utf8 } from 'apache-arrow'; const options = { type: new Utf8(), nullValues: [undefined, 'n/a'], queueingStrategy: 'bytes', highWaterMark: 64, // flush each chunk once 64 bytes have been written }; const utf8Vector = Chunked.concat(await toArray( Readable .from(['hello', 'n/a', 'world', undefined]) .pipe(Builder.throughNode(options)) )); ``` #### Miscellaneous * Updates most dependencies, updates TypeScript to v3.5.1 (and resolves #4452) * Updates the BigInt compatibility type to use `Object.setPrototypeOf()`, yielding a 4x speedup * Updates Int64 and Uint64 set routines to accept native `bigint` types if available * Adds a docstring to the `_InternalEmptyPlaceholderRecordBatch` class added in #4373 Author: ptaylor <[email protected]> Closes #4476 from trxcllnt/js/data-builders and squashes the following commits: 7998d2a <ptaylor> add createIsValidFunction example docstring 07fa443 <ptaylor> remove default dictionary hash function 8b0752f <ptaylor> fix possible AsyncRandomAccessFile race condition retrieving filehandle size 4d2a4f0 <ptaylor> regenerate flatbuffer source from current format schemas 05acad8 <ptaylor> fix minor row serialization issues to be compatible with console.table() c0f7f7b <ptaylor> fix a few minor formatting issues in arrow2csv 9c3a865 <ptaylor> ensure byteLength is calculated for offsets buffer ba755ad <ptaylor> use 53 bit hash fn to further avoid collisions 8659323 <ptaylor> add test for builder iterable byte queueing strategy 13b51db <ptaylor> print more details about each message f250845 <ptaylor> use a better default dictionary builder hash function 3adf555 <ptaylor> adds or updates most of the high-level Vector.from() methods to use the Vector Builders 5aea4f9 <ptaylor> Add more specific Int64 and Uint64 Builder tests e974444 <ptaylor> fix lint 5d4b0be <ptaylor> ensure bitmapbufferbuilder increments and decrements _popCount appropriately 4b5375e <ptaylor> remove unnecessary jsdoc b2556b7 <ptaylor> add docstring for _InternalEmptyPlaceholderRecordBatch 5b990c3 <ptaylor> Clean up typedoc output, update typedoc to master to use [email protected] c2673ac <ptaylor> add initial builder jsdoc 893c74f <ptaylor> ensure ListBuilder supports random insertion dddba1a <ptaylor> ensure variablewidthbuilder supports random insertion 84beb68 <ptaylor> remove some property getters, clean up 0209e9d <ptaylor> update to typescript 3.5 c587c33 <ptaylor> add BufferBuilders, clean up public Builder API 86a6cd0 <ptaylor> update closure compiler dependency d032e2d <ptaylor> update dependencies fd46163 <ptaylor> fix nan checks 26387c7 <ptaylor> update typescript, finish streaming builders, add comprehensive builder tests 79bfd46 <ptaylor> fix node and dom builder streams 2b7a63c <ptaylor> update test types 0990344 <ptaylor> add Builder throughDOM and throughNode transform streams 15ac17c <ptaylor> use Object.setPrototypeOf to improve bn performance f708190 <ptaylor> use safe BigInt64Array constructor f651657 <ptaylor> update typescript, ts-jest, jest a307273 <ptaylor> fix readable-stream detection in firefox 4c2ef42 <ptaylor> add the rest of the builder types 9ba9965 <ptaylor> update row type inference sanity check f8760cb <ptaylor> enumerate each type key individually now that they're a real thing 70377a1 <ptaylor> add vectorname getter to chunked for completion d193141 <ptaylor> add helper method to return the stride for a datatype 385fee3 <ptaylor> fix typo 2f36741 <ptaylor> move stream methods to io folder fb2c9a2 <ptaylor> cleanup 1398ebd <ptaylor> don't clone Dictionary DataType instances in Schema assign to preserve pointer to original instance f7fe8c1 <ptaylor> update builder buffer padding 2b7b992 <ptaylor> ensure union typeids buffer is an Int8Array 6e64c40 <ptaylor> ensure builder allocates aligned buffers, null bitmaps initialized to null, add Int64 builder tests cf5f778 <ptaylor> return signed or unsigned 53bit integers 8a5c713 <ptaylor> show a better error message if gulp fails 6b7a20e <ptaylor> ensure 64-bit int builders use BN to check for nulls c609266 <ptaylor> fix bool builder, add primitive builder tests 2f423ec <ptaylor> fix date builder tests 6ebaa36 <ptaylor> WIP streaming data builders 9e5e3ea <ptaylor> fix bool builder, add primitive builder tests ff32cbb <ptaylor> fix date builder tests 738fabb <ptaylor> WIP streaming data builders

trxcllnt requested review from TheNeuralBit and wesm May 22, 2019 17:55

trxcllnt and others added 3 commits May 23, 2019 12:18

ensure reader and writer support the case where a stream or file has …

193b08d

…a schema but no recordbatches

define an _InternalEmptyRecordBatch class to signal that the reader s…

86d192d

…ource stream has no RecordBatches

Run no-batches integration test for JS also

c860696

wesm force-pushed the js/fix-no-record-batches branch from 622c875 to c860696 Compare May 23, 2019 17:19

wesm closed this in 17e0198 May 31, 2019

trxcllnt mentioned this pull request Jun 5, 2019

ARROW-5115: [JS] Add Vector Builders and high-level stream primitives #4476

Closed

asfimport mentioned this pull request May 31, 2019

[JS] Ensure reader and writer support files and streams with no RecordBatches #21853

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-5396: [JS] Support files and streams with no record batches#4373

ARROW-5396: [JS] Support files and streams with no record batches#4373
trxcllnt wants to merge 3 commits intoapache:masterfrom
trxcllnt:js/fix-no-record-batches

trxcllnt commented May 22, 2019 •

edited

Loading

Uh oh!

wesm commented May 22, 2019

Uh oh!

trxcllnt commented May 22, 2019 •

edited

Loading

Uh oh!

wesm commented May 22, 2019

Uh oh!

trxcllnt commented May 22, 2019

Uh oh!

trxcllnt commented May 22, 2019 •

edited

Loading

Uh oh!

wesm commented May 23, 2019

Uh oh!

codecov-io commented May 23, 2019

Uh oh!

emkornfield commented May 29, 2019

Uh oh!

TheNeuralBit commented May 29, 2019

Uh oh!

wesm commented May 30, 2019

Uh oh!

wesm commented May 31, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

trxcllnt commented May 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented May 22, 2019

Uh oh!

trxcllnt commented May 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented May 22, 2019

Uh oh!

trxcllnt commented May 22, 2019

Uh oh!

trxcllnt commented May 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented May 23, 2019

Uh oh!

codecov-io commented May 23, 2019

Codecov Report

Uh oh!

emkornfield commented May 29, 2019

Uh oh!

TheNeuralBit commented May 29, 2019

Uh oh!

wesm commented May 30, 2019

Uh oh!

wesm commented May 31, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

trxcllnt commented May 22, 2019 •

edited

Loading

trxcllnt commented May 22, 2019 •

edited

Loading

trxcllnt commented May 22, 2019 •

edited

Loading