Improving the CSV schema inference

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

There's an open issue in the datafusion repository with the CSV schema inference. The current implementation in arrow will return  `Int64` as the datatype for any numeric columns that have no decimal and don't match a date format. This circumstance is causing problems when the CSV is read later, should the value overflow the `Int64` data type.

Here's the datafusion issue apache/arrow-datafusion#3174

**Describe the solution you'd like**
Maybe arrow could try to support the `UInt64` and `Decimal128` datatypes as well, should it notice the values inside the CSV are too large. Or even default to `String` should it notice that even these are too small to ensure the CSV can be read without problems.

**Describe alternatives you've considered**
Alternatively, I imagine the column's type could be "upgraded" when reading the CSV, should there be any parsing errors due to overflows. I imagine this would need all previously parsed values to be casted, which could hopefully be avoided given better inference results.

**Additional context**
I'd be open to implementing this change. My naive approach would be something like this: https://github.com/apache/arrow-rs/commit/4b3104ea431835018c4fb90003013e7d2c7fe47b# in case anyone here has any suggestions on how to improve it, I would be very happy.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving the CSV schema inference #2580

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improving the CSV schema inference #2580

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions