ARROW-5396: [JS] Support files and streams with no record batches#4373
ARROW-5396: [JS] Support files and streams with no record batches#4373trxcllnt wants to merge 3 commits intoapache:masterfrom
Conversation
|
@trxcllnt so the dummy RecordBatch is needed only to obtain the schema? Curious if this causes a zero-length RecordBatch to be subsequently sent With this change does this pass the integration test I added? |
|
@wesm yes it passed the integration test when I ran it locally. The change in this PR causes the RecordBatchStreamReader to yield a dummy zero-length RecordBatch if the stream is going to terminate without yielding at least one RecordBatch (here), but it also updates the Writer to ignore zero-length RecordBatches (here). This will cause the Writer not to write any zero-length RecordBatches, but I can't think of a case where that'd be a problem. If it is, we could instead use an internal flag to indicate whether a zero-length RecordBatch should or shouldn't be written :-). |
|
Interesting. We also have an integration test for length-0 RecordBatch objects so if this passes both tests then I suppose it's OK. I guess I should merge my no-record-batches patch with C++-only using it and then you can rebase this and enable JS so we can get a complete build? |
|
Ah, I forgot about that test. I thought you were asking about the no-record-batches integration test from PR #3871, which is what I tested locally. I don't think it'll pass the I'll make that change here shortly and update this PR. |
|
@wesm ok, I updated the Reader and Writer to use a new |
…a schema but no recordbatches
…ource stream has no RecordBatches
622c875 to
c860696
Compare
|
Rebased and enabled the no-batches integration test for JS |
Codecov Report
@@ Coverage Diff @@
## master #4373 +/- ##
==========================================
+ Coverage 88.34% 90.33% +1.99%
==========================================
Files 781 74 -707
Lines 98721 5464 -93257
Branches 1251 1253 +2
==========================================
- Hits 87212 4936 -82276
+ Misses 11273 520 -10753
+ Partials 236 8 -228
Continue to review full report at Codecov.
|
|
@TheNeuralBit do you mind reviewing the JS? |
|
Would you mind adding a docstring on |
|
@trxcllnt if you can add that comment I can go ahead and merge this |
|
I'm going ahead and merging this. Please feel free to submit a follow up PR to document this |
This PR adds Vector Builder implementations for each DataType, as well as high-level stream primitives for Iterables/AsyncIterables, node streams, and DOM streams. edit: I've created a demo that transforms a CSV file/stream of JSON rows to an Arrow table in this repository: https://github.com/trxcllnt/csv-to-arrow-js #### Builder API The new `Builder` class exposes an API for sequentially appending (or setting into slots that have already been allocated) arbitrary JavaScript values that will be flushed to the same underlying Data chunk. The `Builder` class also supports specifying a list of null-value sentinels, or values that will be interpreted to indicate "null" should be written to the null bitmap instead of being written as a valid element. Similar to the existing `Vector` API, `Builder` has a static `Builder.new()` method that will return the correct `Builder` subclass instance for the supplied DataType. Since the `Builder` constructor takes an options Object, this method also takes an Object: ```typescript import { Builder, Utf8 } from 'apache-arrow'; const utf8Builder = Builder.new({ type: new Utf8(), nullValues: [null, 'n/a'] }); utf8Builder .append('hello') .append('n/a') // will be interpreted to mean `null` .append('world') .append(null); const utf8Vector = utf8Builder.finish().toVector(); console.log(utf8Vector.toJSON()); // > ["hello", null, "world", null] ``` The `Builder` class has two methods for flushing the pending values to their underlying ArrayBuffer representations: `flush(): Data<T>` and `toVector(): Vector<T>` (`toVector()` calls `flush()` and creates a `Vector` instance from the returned Data instance). Calling `Builder.prototype.finish()` will finalize a `Builder` instance. After this, no more values should be written to the Builder instance. This is a no-op on for most types, except the `DictionaryBuilder`, which flushes its internal dictionary and writes the values to the `Dictionary` type's `dictionaryVector` field. #### Iterable and stream APIs Creating and using Builders directly is a bit cumbersome, so we provide some high-level streaming APIs for automatically creating builders, appending values, and flushing chunks of a certain size: ```typescript Builder.throughIterable(options: IterableBuilderOptions<T, TNull>) Builder.throughAsyncIterable(options: IterableBuilderOptions<T, TNull>) Builder.throughDOM(options: BuilderTransformOptions<T, TNull>) Builder.throughNode(options: BuilderDuplexOptions<T, TNull>) ``` #### Iterables and AsyncIterables The static `throughIterable` and `throughAsyncIterable` methods take an `options` argument that indicates the Builder's type and null-value sentinels, and returns a function which accepts an Iterable or AsyncIterable, respectively, of values to transform: ```typescript import { Chunked, Builder, Utf8 } from 'apache-arrow'; const options = { type: new Utf8(), nullValues: [null, 'n/a'] }; const buildUtf8 = Builder.throughIterable(options); const utf8Vector = Chunked.concat(...buildUtf8(['hello', 'n/a', 'world', null])); ``` The `options` argument can also specify a `queueingStrategy` and `highWaterMark` that control the chunking semantics: * If the `queueingStrategy` is `"count"` (or is omitted), then the returned generator function will flush the `Builder` and yield a chunk once the number of values that have been written to the Builder reaches the value supplied for `highWaterMark`, regardless of how much memory the `Builder` has allocated. * If the `queueingStrategy` is `"bytes"`, then the returned generator function will flush the `Builder` and yield a new chunk once the Builder's `byteLength` field reaches or exceeds the value supplied for `highWaterMark`, regardless of how many elements the `Builder` contains. #### Node and DOM Streams In addition to the Iterable transform APIs, we can also create node and DOM transform streams with similar options: ```typescript import { Readable } from 'stream'; import { toArray } from 'ix/asynciterable/toarray'; import { Chunked, Builder, Utf8 } from 'apache-arrow'; const options = { type: new Utf8(), nullValues: [undefined, 'n/a'], queueingStrategy: 'bytes', highWaterMark: 64, // flush each chunk once 64 bytes have been written }; const utf8Vector = Chunked.concat(await toArray( Readable .from(['hello', 'n/a', 'world', undefined]) .pipe(Builder.throughNode(options)) )); ``` #### Miscellaneous * Updates most dependencies, updates TypeScript to v3.5.1 (and resolves #4452) * Updates the BigInt compatibility type to use `Object.setPrototypeOf()`, yielding a 4x speedup * Updates Int64 and Uint64 set routines to accept native `bigint` types if available * Adds a docstring to the `_InternalEmptyPlaceholderRecordBatch` class added in #4373 Author: ptaylor <[email protected]> Closes #4476 from trxcllnt/js/data-builders and squashes the following commits: 7998d2a <ptaylor> add createIsValidFunction example docstring 07fa443 <ptaylor> remove default dictionary hash function 8b0752f <ptaylor> fix possible AsyncRandomAccessFile race condition retrieving filehandle size 4d2a4f0 <ptaylor> regenerate flatbuffer source from current format schemas 05acad8 <ptaylor> fix minor row serialization issues to be compatible with console.table() c0f7f7b <ptaylor> fix a few minor formatting issues in arrow2csv 9c3a865 <ptaylor> ensure byteLength is calculated for offsets buffer ba755ad <ptaylor> use 53 bit hash fn to further avoid collisions 8659323 <ptaylor> add test for builder iterable byte queueing strategy 13b51db <ptaylor> print more details about each message f250845 <ptaylor> use a better default dictionary builder hash function 3adf555 <ptaylor> adds or updates most of the high-level Vector.from() methods to use the Vector Builders 5aea4f9 <ptaylor> Add more specific Int64 and Uint64 Builder tests e974444 <ptaylor> fix lint 5d4b0be <ptaylor> ensure bitmapbufferbuilder increments and decrements _popCount appropriately 4b5375e <ptaylor> remove unnecessary jsdoc b2556b7 <ptaylor> add docstring for _InternalEmptyPlaceholderRecordBatch 5b990c3 <ptaylor> Clean up typedoc output, update typedoc to master to use [email protected] c2673ac <ptaylor> add initial builder jsdoc 893c74f <ptaylor> ensure ListBuilder supports random insertion dddba1a <ptaylor> ensure variablewidthbuilder supports random insertion 84beb68 <ptaylor> remove some property getters, clean up 0209e9d <ptaylor> update to typescript 3.5 c587c33 <ptaylor> add BufferBuilders, clean up public Builder API 86a6cd0 <ptaylor> update closure compiler dependency d032e2d <ptaylor> update dependencies fd46163 <ptaylor> fix nan checks 26387c7 <ptaylor> update typescript, finish streaming builders, add comprehensive builder tests 79bfd46 <ptaylor> fix node and dom builder streams 2b7a63c <ptaylor> update test types 0990344 <ptaylor> add Builder throughDOM and throughNode transform streams 15ac17c <ptaylor> use Object.setPrototypeOf to improve bn performance f708190 <ptaylor> use safe BigInt64Array constructor f651657 <ptaylor> update typescript, ts-jest, jest a307273 <ptaylor> fix readable-stream detection in firefox 4c2ef42 <ptaylor> add the rest of the builder types 9ba9965 <ptaylor> update row type inference sanity check f8760cb <ptaylor> enumerate each type key individually now that they're a real thing 70377a1 <ptaylor> add vectorname getter to chunked for completion d193141 <ptaylor> add helper method to return the stride for a datatype 385fee3 <ptaylor> fix typo 2f36741 <ptaylor> move stream methods to io folder fb2c9a2 <ptaylor> cleanup 1398ebd <ptaylor> don't clone Dictionary DataType instances in Schema assign to preserve pointer to original instance f7fe8c1 <ptaylor> update builder buffer padding 2b7b992 <ptaylor> ensure union typeids buffer is an Int8Array 6e64c40 <ptaylor> ensure builder allocates aligned buffers, null bitmaps initialized to null, add Int64 builder tests cf5f778 <ptaylor> return signed or unsigned 53bit integers 8a5c713 <ptaylor> show a better error message if gulp fails 6b7a20e <ptaylor> ensure 64-bit int builders use BN to check for nulls c609266 <ptaylor> fix bool builder, add primitive builder tests 2f423ec <ptaylor> fix date builder tests 6ebaa36 <ptaylor> WIP streaming data builders 9e5e3ea <ptaylor> fix bool builder, add primitive builder tests ff32cbb <ptaylor> fix date builder tests 738fabb <ptaylor> WIP streaming data builders
Re: #3871, ARROW-2119, and closes ARROW-5396.
This PR updates the JS Readers and Writers to support files and streams with no RecordBatches. The approach here is two-fold:
This is necessary because the reader and writer don't know about each other when they're communicating via the Node and DOM stream i/o primitives; they only know about the values pushed through the streams. Since the RecordBatchReader and Writer don't yield the Schema message as a standalone value, we pump the stream with a zero-length RecordBatch that contains the schema instead.