Skip to content

Operate directly on UTF-8 encoded bytes #99

@rxin

Description

@rxin

Apache Spark started using the univocity parser for CSV parsing in 2.0. Internally in Spark we don't use Java string but rather directly UTF-8 encoded bytes (we use a combination of on-heap byte arrays as well as offheap memory). Spark currently reads CSV files in as bytes, convert those into Java String, run through univocity, and then convert those back into UTF-8 bytes. This is pretty expensive.

Is there a way to extend univocity to support operating directly on UTF-8 encoded bytes?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions