-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Add Option-returning methods for parsing literals #6538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| //integral types | ||
| @inline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this inline? Seems like enough code that it should not be inlined in most cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be.
| def parseByte(from: String): Option[Byte] = { | ||
| val len = from.length() | ||
| @tailrec | ||
| def step(i: Int, agg: Int, isPositive: Boolean): Option[Byte] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is basically identical between Byte and Short. You should make it a private[this] method (maybe even using the Int variant if it isn't significantly slower) and check the range afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean have a private method returning Option[Int] (or Int) and bounds check afterwards? That's possible, but it would parse more than needed on overflow. Maybe a variation with a private method that passes in the bound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ichoran can you maybe help me with the benchmarks?
From the results I'm getting, "100".toByte is slower than
try {
Some("100".toByte)
}
catch {
case e: NumberFormatException => None
}
(on my machine 13.244 +/- 0.321 ns/op for the plain case vs 12.697 +/- 0.180 ns/op for the try catch with 10 warmup iterations and 30 measurement iterations of 1 second each)
|
I added simple benchmarks to show how hilariously slow the current checking of parseDouble is - and for convenience in discussing the PR. Should those be in the final PR? If so, where, and do they need to be harnassed up for neat result rendering like the others? |
|
bikeshed: I don't like Personally, I like |
| case "true" => Some(true) | ||
| case "false" => Some(false) | ||
| case _ => None | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
equalsIgnoreCase will have better performance.
| val endchar = unsigned.charAt(unsigned.length - 1) | ||
| if (endchar == 'f' || endchar == 'F' || endchar == 'd' || endchar == 'D') unsigned.substring(0, unsigned.length - 1) | ||
| else unsigned | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do floats and doubles get the suffix analysis, but not (that I can see) longs?
Is "5L" a valid long?
Is "0x10" a valid base-16 int?
Is "0x10L" a valid long?
Is "0777" valid? Is it in base-8 or base-10? What about "-0777" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paulp the behaviour of the new methods is in all cases the same as the old toInt/toLong/toDouble etc. Those are the same as the equivalent java methods.
In other words, this implementation is intended to have x.parseFoo == Try(x.toFoo).toOption == Try(java.lang.Foo.parseFoo(x)).toOption for all x != null, with (much) better runtime performance for the unhappy case.
The floating point conversions accept a different set of formats. Why the original distinction was made I don't know exactly (does IEEE 754 or related include a notation guide? If so, probably that, but that's a guess).
If this were a greenfields implementation, I might be tempted to argue over what trade-offs and formats I like, rather than that symmetry (I would, for example, not support any digits outside 7 bit ASCII, which are supported for the integral types, but not for the floating point types) - and probably also want to have consistency with scala literals.
But this is not, and I care about having these default parsers to Option a great deal more than about which consistencies are the most important ones to support, and which trade-offs are best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, while those are the reasons why I think the trade-offs I chose are the most palatable ones, I'm not married to them, and I'm happy to change to any other consensus that you or anyone else is able to find.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martijnhoekstra Thanks. I didn't realize you were reproducing the java behavior in all cases. So am I understanding correctly that you've reproduced all the java library logic solely to avoid the exception? If the reproduced inconsistencies are faithful to the java inconsistencies then that choice seems fine, but it's quite damning for the jvm architecture if the native parsing routines offer a maximum of one element from { performance, correctness }.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - the only exception I made is for null, which this implementation throws a NPE for (which would escape the Option), and the other implementations handle with a NumberFormatException for the integral types (but not for Double and Float, which also throw a NPE for the JDK version). ¯\_(ツ)_/¯
This is just failing much faster, and with less typing than Try("wait, this is not an Int at all!".toInt).toOption. Such a victory, I know.
ffcb6d6 to
96a2682
Compare
|
@martijnhoekstra - If you are getting strange results with benchmarks on things that have times of less than about 100 ns/op, I would first switch to a batch benchmark, where you do some dozens or hundreds of operations pulled out of an array. JMH tries its best to give good values, but you can often get weird differences in how well the JIT compiler can optimize the inner loop of the JMH measurement routine. If you can make it more expensive and still comparable, it's more likely to make sense. It is also possible that try/catching can be faster than direct reading of something that throws an uncaught exception due to different logic in handling the exception, but I haven't run into that very often. |
|
Benchmark results for the integral types on my machine: Raw output: A google doc with structured results: https://docs.google.com/spreadsheets/d/1hTYMSLCbUHztiLN6jue5KuqWhPwpxbgRvg1WyR8T5SM/edit?usp=sharing Summary for my machine: For all string sizes, with the exception of single digits, performance for a parse is a flat 3 to 4 nanoseconds slower than the equivalent try/catch in the successful case. This amounts to just 62% of the baseline speed for the string "-1", and more than 99% of the baseline speed on the string "-000000000200000000003123456" For single digit strings, the re-implementation is actually marginally faster than the baseline. For the unhappy path, the re-implementation is about a factor 300 faster in the test case, though this will in practice differ with stack depth, and this speedup is likely larger rather than smaller in real world scenarios. The adapted version of toInt on scala/native in these benchmarks is about 1 nanosecond faster for the all success cases but the single digit one, where it is 3 ns slower in this benchmark. But does a bit less work - it doesn't go through The adapted version of toByte from scala/native is slower in all cases, but particularly when detecting overflow. |
|
Seems pretty good to me! Also fairly expected--3 to 4 ns is a typical overhead for boxing into an |
c3840b0 to
1353443
Compare
|
Benchmarks for double/float validation are harder to interpret, but the troughput is about 0.65-0.7 of the bare case. Since the string needs to be traversed twice, this doesn't seem entirely unreasonable. Speedups for the unhappy path are still about a factor 100. |
|
@martijnhoekstra - Yeah, that's about what I remember getting. I wrote a huge pile of approximate FP parsing code for Grok and a different huge pile for Jsonal to do better (only deferring to the Java routines in tough cases), but they both are a huge amount of code and it's difficult to verify correctness (especially for Grok which uses big tables of precomputed numbers). I think the way you did it is the way to go for the standard library. |
3482e7e to
53636be
Compare
|
No longer WIP (though review may have something to say about that) |
|
rebased and force-pushed |
| @Test | ||
| def nullByte: Unit = try { | ||
| nullstring.parseByte | ||
| Assert.fail("null.parseByte throws NPE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Very nice work, @martijnhoekstra! The remaining open question seems to be the method names, @NthPortal suggests |
|
Except for parsing, the variant of exception-throwing method |
|
So no strong objections against |
includes cleanup of cruft/boyscout rule
|
@lrytz I rebased and squased some commits, but left the separate commit where I remove the benchmarks in, since that is at least of interest for this PR (results of those benchmarks are discussed in this ticket, so it makes sense to know what was discussed IMO). Let me know if that's what you want, or that I should slice and dice differently. |
lrytz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 thank you, @martijnhoekstra!
|
scala-collection-compat backport: scala/scala-collection-compat#566 |
Fixes SI-16
Follow-up to scala/collection-strawman#431
Needs checking and benchmarking against the alternative in scala/native, and the baroque check on the float format needs patching up - at least to not substring so much, but keep track of start and end indices.
An alternative could be to do a faster, easier less thorough check, and fall back to
Try#toOption, which would then only be really slow on false positives of the check.Trade-offs made in this PR that could be bike-shed over (but not by me, tell me the outcome, and I'll adapt):
(str: String) => str.parseInt == Try(str.toInt).toOptionholdsOption[Double], rather than useNaNfor signaling failure.parseInt- I'm also fine withtoIntOptionor any other scheme.Todo: