-
-
Notifications
You must be signed in to change notification settings - Fork 254
Closed
Labels
Description
Apache Spark started using the univocity parser for CSV parsing in 2.0. Internally in Spark we don't use Java string but rather directly UTF-8 encoded bytes (we use a combination of on-heap byte arrays as well as offheap memory). Spark currently reads CSV files in as bytes, convert those into Java String, run through univocity, and then convert those back into UTF-8 bytes. This is pretty expensive.
Is there a way to extend univocity to support operating directly on UTF-8 encoded bytes?