arrow_json: support decimal 128 and 256 types in json writer#5197
arrow_json: support decimal 128 and 256 types in json writer#5197matthewgapp wants to merge 4 commits intoapache:masterfrom
Conversation
| # arrow-data = { workspace = true } | ||
| # arrow-schema = { workspace = true } | ||
|
|
||
| arrow-buffer = { version = "49" } |
There was a problem hiding this comment.
temporary change while testing in another system
| explicit_nulls: bool, | ||
| ) -> Result<(), ArrowError> { | ||
| let options = FormatOptions::default(); | ||
| let formatter = ArrayFormatter::try_new(array.as_ref(), &options)?; |
There was a problem hiding this comment.
this formatter approach was lifted from the date-type writer logic above
There was a problem hiding this comment.
This will result in precision loss, decimals need to use the formatting logic in arrow-cast. I'm not entirely sure how to support this within the constraint of a serde_json value
There was a problem hiding this comment.
hey @tustvold, i'm confused - the ArrayFormatter is from arrow_cast and, from what I can tell, uses the format_decimal logic in this impl (from arrow_cast)
arrow-rs/arrow-cast/src/display.rs
Line 441 in bc96ca3
Let me know what I'm missing! Thank you
There was a problem hiding this comment.
@tustvold did you mean to reference this method in the arrow_array crate?
arrow-rs/arrow-cast/src/display.rs
Line 441 in bc96ca3
There was a problem hiding this comment.
The problem is you are then parsing to a f64 which will result in precision loss. As serde_json value lacks a decimal representation you may be forced to encode decimals as JSON strings to avoid this
There was a problem hiding this comment.
Writing non-native numeric quantities as string in JSON is a fairly standard approach to workaround the limited precision many readers have by default, serde_json included. Protobuf even serializes u64 to strings as part of its JSON specification. This is not just because the JSON specification only mandated double precision floating point, but because there are real performance disadvantages to doing otherwise.
I agree using a string is not a good thing, ideally we wouldn't be relying on serde_json to serialize, but aside from a fairly major rewrite of this code to not usw serde_json value, it seems to me like a pragmatic way to get something, which is better than not supporting anything?
Edit: FWIW I filed the ticket for actually de-serdifying the writer - #5314
There was a problem hiding this comment.
Thanks for the context. 5314 which builds on top of the impressive #3479 looks like a great step forward. In the meantime, I appreciate that writing to string is the correct thing to do, but I recommend that we balance the ideal technical solution with how it will be used in the real world and that user experience.
For practical purposes, I strongly suggest that we provide the user an option to write decimals as json numbers. We could do this in a non-breaking way by adding a pub JsonWriter that provides a method to set the decimal decoding as number (the default being string). This would ensure that by default, the correct thing is performed but the user can opt in for the practical parsing to number, saving them a ton of headache downstream.
For example, our downstream use case would require this feature; otherwise we'd have to maintain a fork of the crate that lets us serialize decimals as number; otherwise we'd have to layer in additional parsing on each field to see if string fields could actually be parsed to numbers (which would be commonly the case since we're dealing with small decimal-type numbers at the source).
There was a problem hiding this comment.
Cool, will that implementation enable us to write decimal types to json as a number?
|
Hi @matthewgapp , I was wondering if you'd have any plans to continue working on this PR, please? |
Hey @saeedzareian apologies about ghosting this. I'll pick this up again this week. It's blocking our downstream project. |
|
#5318 has now been merged and should give us flexibility to format decimals directly |
|
Hey, I would also be interested in this feature, are the next steps here known? Can I help? FWIW, I think I'd prefer decimals to be printed as strings, since otherwise it's easier for the reader of the json to screw things up when reading (if you print a decimal like "10.3" as |
|
Closing as this PR has been inactive for a while |
|
Implemented in #6606 |
Which issue does this PR close?
Closes #5198
Rationale for this change
This PR will enable support for Decimal type columns to json as floats
What changes are included in this PR?
This PR uses similar logic that's used to write arrow date types to write the decimal types to json.
Are there any user-facing changes?