Operate directly on UTF-8 encoded bytes

Apache Spark started using the univocity parser for CSV parsing in 2.0. Internally in Spark we don't use Java string but rather directly UTF-8 encoded bytes (we use a combination of on-heap byte arrays as well as offheap memory). Spark currently reads CSV files in as bytes, convert those into Java String, run through univocity, and then convert those back into UTF-8 bytes. This is pretty expensive.

Is there a way to extend univocity to support operating directly on UTF-8 encoded bytes?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Operate directly on UTF-8 encoded bytes #99

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Operate directly on UTF-8 encoded bytes #99

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions