Skip to content

Conversation

@martijnhoekstra
Copy link
Contributor

@martijnhoekstra martijnhoekstra commented Apr 19, 2018

Fixes SI-16

Follow-up to scala/collection-strawman#431

Needs checking and benchmarking against the alternative in scala/native, and the baroque check on the float format needs patching up - at least to not substring so much, but keep track of start and end indices.

An alternative could be to do a faster, easier less thorough check, and fall back to Try#toOption, which would then only be really slow on false positives of the check.

Trade-offs made in this PR that could be bike-shed over (but not by me, tell me the outcome, and I'll adapt):

  • Parses all digits that Java parses, including (combinations of) more exotic scripts, not just 0-9, so that the invariant (str: String) => str.parseInt == Try(str.toInt).toOption holds
  • parseDouble parses to Option[Double], rather than use NaN for signaling failure.
  • null strings throw NPE
  • I like the name parseInt - I'm also fine with toIntOption or any other scheme.

Todo:

  • Homogenize benchmarks for good cross validation
  • Add jUnit tests for known edge-cases (possibly take from scala/native and scala-js)
  • Clean up unfortunate switches created by matches
  • Run benchmarks, post results
  • Clean up cruft (what to do with benchmarks? guidance requested)

}

//integral types
@inline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this inline? Seems like enough code that it should not be inlined in most cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be.

def parseByte(from: String): Option[Byte] = {
val len = from.length()
@tailrec
def step(i: Int, agg: Int, isPositive: Boolean): Option[Byte] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is basically identical between Byte and Short. You should make it a private[this] method (maybe even using the Int variant if it isn't significantly slower) and check the range afterwards.

Copy link
Contributor Author

@martijnhoekstra martijnhoekstra Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean have a private method returning Option[Int] (or Int) and bounds check afterwards? That's possible, but it would parse more than needed on overflow. Maybe a variation with a private method that passes in the bound?

Copy link
Contributor Author

@martijnhoekstra martijnhoekstra Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ichoran can you maybe help me with the benchmarks?

From the results I'm getting, "100".toByte is slower than

try {
  Some("100".toByte)
}
catch {
  case e: NumberFormatException => None
}

(on my machine 13.244 +/- 0.321 ns/op for the plain case vs 12.697 +/- 0.180 ns/op for the try catch with 10 warmup iterations and 30 measurement iterations of 1 second each)

@martijnhoekstra
Copy link
Contributor Author

I added simple benchmarks to show how hilariously slow the current checking of parseDouble is - and for convenience in discussing the PR. Should those be in the final PR? If so, where, and do they need to be harnassed up for neat result rendering like the others?

@NthPortal
Copy link
Contributor

bikeshed: I don't like parse<type>. parse is sufficiently different from to that it almost seems like to<type> isn't parsing it or something.

Personally, I like tryTo<type>, though I could also live with to<type>Option.

@SethTisue SethTisue added the WIP label Apr 19, 2018
case "true" => Some(true)
case "false" => Some(false)
case _ => None
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equalsIgnoreCase will have better performance.

val endchar = unsigned.charAt(unsigned.length - 1)
if (endchar == 'f' || endchar == 'F' || endchar == 'd' || endchar == 'D') unsigned.substring(0, unsigned.length - 1)
else unsigned
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do floats and doubles get the suffix analysis, but not (that I can see) longs?
Is "5L" a valid long?
Is "0x10" a valid base-16 int?
Is "0x10L" a valid long?
Is "0777" valid? Is it in base-8 or base-10? What about "-0777" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paulp the behaviour of the new methods is in all cases the same as the old toInt/toLong/toDouble etc. Those are the same as the equivalent java methods.

In other words, this implementation is intended to have x.parseFoo == Try(x.toFoo).toOption == Try(java.lang.Foo.parseFoo(x)).toOption for all x != null, with (much) better runtime performance for the unhappy case.

The floating point conversions accept a different set of formats. Why the original distinction was made I don't know exactly (does IEEE 754 or related include a notation guide? If so, probably that, but that's a guess).

If this were a greenfields implementation, I might be tempted to argue over what trade-offs and formats I like, rather than that symmetry (I would, for example, not support any digits outside 7 bit ASCII, which are supported for the integral types, but not for the floating point types) - and probably also want to have consistency with scala literals.

But this is not, and I care about having these default parsers to Option a great deal more than about which consistencies are the most important ones to support, and which trade-offs are best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, while those are the reasons why I think the trade-offs I chose are the most palatable ones, I'm not married to them, and I'm happy to change to any other consensus that you or anyone else is able to find.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martijnhoekstra Thanks. I didn't realize you were reproducing the java behavior in all cases. So am I understanding correctly that you've reproduced all the java library logic solely to avoid the exception? If the reproduced inconsistencies are faithful to the java inconsistencies then that choice seems fine, but it's quite damning for the jvm architecture if the native parsing routines offer a maximum of one element from { performance, correctness }.

Copy link
Contributor Author

@martijnhoekstra martijnhoekstra Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - the only exception I made is for null, which this implementation throws a NPE for (which would escape the Option), and the other implementations handle with a NumberFormatException for the integral types (but not for Double and Float, which also throw a NPE for the JDK version). ¯\_(ツ)_/¯

This is just failing much faster, and with less typing than Try("wait, this is not an Int at all!".toInt).toOption. Such a victory, I know.

@martijnhoekstra martijnhoekstra force-pushed the SI-16 branch 2 times, most recently from ffcb6d6 to 96a2682 Compare April 20, 2018 14:22
@Ichoran
Copy link
Contributor

Ichoran commented Apr 22, 2018

@martijnhoekstra - If you are getting strange results with benchmarks on things that have times of less than about 100 ns/op, I would first switch to a batch benchmark, where you do some dozens or hundreds of operations pulled out of an array. JMH tries its best to give good values, but you can often get weird differences in how well the JIT compiler can optimize the inner loop of the JMH measurement routine. If you can make it more expensive and still comparable, it's more likely to make sense.

It is also possible that try/catching can be faster than direct reading of something that throws an uncaught exception due to different logic in handling the exception, but I haven't run into that very often.

@martijnhoekstra
Copy link
Contributor Author

Benchmark results for the integral types on my machine:

Raw output:


[info] Benchmark                                                    (parsee)  Mode  Cnt     Score    Error  Units
[info] ToAllIntegralsBenchmark.checkedToByte                                  avgt   40  1055.499 ▒  9.919  ns/op
[info] ToAllIntegralsBenchmark.checkedToByte                               1  avgt   40     7.139 ▒  0.031  ns/op
[info] ToAllIntegralsBenchmark.checkedToByte                              -1  avgt   40     7.721 ▒  0.057  ns/op
[info] ToAllIntegralsBenchmark.checkedToByte                            +010  avgt   40    14.939 ▒  0.641  ns/op
[info] ToAllIntegralsBenchmark.checkedToByte                        -0001000  avgt   40  1171.832 ▒  8.989  ns/op
[info] ToAllIntegralsBenchmark.checkedToByte                +000012345678912  avgt   40  1099.203 ▒ 10.522  ns/op
[info] ToAllIntegralsBenchmark.checkedToByte    -000000000200000000003123456  avgt   40  1125.148 ▒ 10.296  ns/op
[info] ToAllIntegralsBenchmark.checkedToInt                                   avgt   40  1038.656 ▒ 31.087  ns/op
[info] ToAllIntegralsBenchmark.checkedToInt                                1  avgt   40     6.871 ▒  0.034  ns/op
[info] ToAllIntegralsBenchmark.checkedToInt                               -1  avgt   40     7.141 ▒  0.048  ns/op
[info] ToAllIntegralsBenchmark.checkedToInt                             +010  avgt   40    13.272 ▒  0.059  ns/op
[info] ToAllIntegralsBenchmark.checkedToInt                         -0001000  avgt   40    20.795 ▒  0.242  ns/op
[info] ToAllIntegralsBenchmark.checkedToInt                 +000012345678912  avgt   40  1075.602 ▒ 13.646  ns/op
[info] ToAllIntegralsBenchmark.checkedToInt     -000000000200000000003123456  avgt   40  1097.770 ▒ 13.117  ns/op
[info] ToAllIntegralsBenchmark.checkedToLong                                  avgt   40  1022.257 ▒  8.984  ns/op
[info] ToAllIntegralsBenchmark.checkedToLong                               1  avgt   40     7.128 ▒  0.060  ns/op
[info] ToAllIntegralsBenchmark.checkedToLong                              -1  avgt   40     7.726 ▒  0.111  ns/op
[info] ToAllIntegralsBenchmark.checkedToLong                            +010  avgt   40    13.636 ▒  0.077  ns/op
[info] ToAllIntegralsBenchmark.checkedToLong                        -0001000  avgt   40    21.610 ▒  0.217  ns/op
[info] ToAllIntegralsBenchmark.checkedToLong                +000012345678912  avgt   40    40.798 ▒  0.137  ns/op
[info] ToAllIntegralsBenchmark.checkedToLong    -000000000200000000003123456  avgt   40    67.833 ▒  0.901  ns/op
[info] ToAllIntegralsBenchmark.checkedToShort                                 avgt   40  1044.435 ▒ 10.855  ns/op
[info] ToAllIntegralsBenchmark.checkedToShort                              1  avgt   40     7.138 ▒  0.063  ns/op
[info] ToAllIntegralsBenchmark.checkedToShort                             -1  avgt   40     7.717 ▒  0.042  ns/op
[info] ToAllIntegralsBenchmark.checkedToShort                           +010  avgt   40    13.651 ▒  0.053  ns/op
[info] ToAllIntegralsBenchmark.checkedToShort                       -0001000  avgt   40    20.996 ▒  0.066  ns/op
[info] ToAllIntegralsBenchmark.checkedToShort               +000012345678912  avgt   40  1125.459 ▒ 45.085  ns/op
[info] ToAllIntegralsBenchmark.checkedToShort   -000000000200000000003123456  avgt   40  1105.831 ▒  9.946  ns/op
[info] ToAllIntegralsBenchmark.parseByte                                      avgt   40     3.571 ▒  0.036  ns/op
[info] ToAllIntegralsBenchmark.parseByte                                   1  avgt   40     6.398 ▒  0.034  ns/op
[info] ToAllIntegralsBenchmark.parseByte                                  -1  avgt   40    11.195 ▒  0.177  ns/op
[info] ToAllIntegralsBenchmark.parseByte                                +010  avgt   40    16.957 ▒  0.030  ns/op
[info] ToAllIntegralsBenchmark.parseByte                            -0001000  avgt   40    20.567 ▒  0.194  ns/op
[info] ToAllIntegralsBenchmark.parseByte                    +000012345678912  avgt   40    21.859 ▒  0.240  ns/op
[info] ToAllIntegralsBenchmark.parseByte        -000000000200000000003123456  avgt   40    29.000 ▒  0.324  ns/op
[info] ToAllIntegralsBenchmark.parseByteNative                                avgt   40     3.352 ▒  0.051  ns/op
[info] ToAllIntegralsBenchmark.parseByteNative                             1  avgt   40    11.815 ▒  0.020  ns/op
[info] ToAllIntegralsBenchmark.parseByteNative                            -1  avgt   40    13.029 ▒  0.190  ns/op
[info] ToAllIntegralsBenchmark.parseByteNative                          +010  avgt   40    17.174 ▒  0.130  ns/op
[info] ToAllIntegralsBenchmark.parseByteNative                      -0001000  avgt   40    24.166 ▒  0.350  ns/op
[info] ToAllIntegralsBenchmark.parseByteNative              +000012345678912  avgt   40    34.451 ▒  0.055  ns/op
[info] ToAllIntegralsBenchmark.parseByteNative  -000000000200000000003123456  avgt   40    43.238 ▒  0.086  ns/op
[info] ToAllIntegralsBenchmark.parseInt                                       avgt   40     3.553 ▒  0.015  ns/op
[info] ToAllIntegralsBenchmark.parseInt                                    1  avgt   40     6.589 ▒  0.297  ns/op
[info] ToAllIntegralsBenchmark.parseInt                                   -1  avgt   40    11.478 ▒  0.044  ns/op
[info] ToAllIntegralsBenchmark.parseInt                                 +010  avgt   40    16.203 ▒  0.246  ns/op
[info] ToAllIntegralsBenchmark.parseInt                             -0001000  avgt   40    24.168 ▒  0.273  ns/op
[info] ToAllIntegralsBenchmark.parseInt                     +000012345678912  avgt   40    34.990 ▒  0.308  ns/op
[info] ToAllIntegralsBenchmark.parseInt         -000000000200000000003123456  avgt   40    42.876 ▒  0.169  ns/op
[info] ToAllIntegralsBenchmark.parseIntNative                                 avgt   40     3.085 ▒  0.014  ns/op
[info] ToAllIntegralsBenchmark.parseIntNative                              1  avgt   40     9.556 ▒  0.093  ns/op
[info] ToAllIntegralsBenchmark.parseIntNative                             -1  avgt   40    10.137 ▒  0.045  ns/op
[info] ToAllIntegralsBenchmark.parseIntNative                           +010  avgt   40    15.664 ▒  0.056  ns/op
[info] ToAllIntegralsBenchmark.parseIntNative                       -0001000  avgt   40    22.726 ▒  0.074  ns/op
[info] ToAllIntegralsBenchmark.parseIntNative               +000012345678912  avgt   40    35.158 ▒  0.332  ns/op
[info] ToAllIntegralsBenchmark.parseIntNative   -000000000200000000003123456  avgt   40    45.039 ▒  0.271  ns/op
[info] ToAllIntegralsBenchmark.parseLong                                      avgt   40     3.568 ▒  0.034  ns/op
[info] ToAllIntegralsBenchmark.parseLong                                   1  avgt   40     6.543 ▒  0.106  ns/op
[info] ToAllIntegralsBenchmark.parseLong                                  -1  avgt   40    12.139 ▒  0.045  ns/op
[info] ToAllIntegralsBenchmark.parseLong                                +010  avgt   40    16.881 ▒  0.029  ns/op
[info] ToAllIntegralsBenchmark.parseLong                            -0001000  avgt   40    25.604 ▒  0.134  ns/op
[info] ToAllIntegralsBenchmark.parseLong                    +000012345678912  avgt   40    42.787 ▒  0.199  ns/op
[info] ToAllIntegralsBenchmark.parseLong        -000000000200000000003123456  avgt   40    67.274 ▒  0.224  ns/op
[info] ToAllIntegralsBenchmark.parseShort                                     avgt   40     3.567 ▒  0.026  ns/op
[info] ToAllIntegralsBenchmark.parseShort                                  1  avgt   40     6.447 ▒  0.057  ns/op
[info] ToAllIntegralsBenchmark.parseShort                                 -1  avgt   40    11.488 ▒  0.043  ns/op
[info] ToAllIntegralsBenchmark.parseShort                               +010  avgt   40    15.395 ▒  0.160  ns/op
[info] ToAllIntegralsBenchmark.parseShort                           -0001000  avgt   40    25.727 ▒  0.186  ns/op
[info] ToAllIntegralsBenchmark.parseShort                   +000012345678912  avgt   40    25.420 ▒  0.089  ns/op
[info] ToAllIntegralsBenchmark.parseShort       -000000000200000000003123456  avgt   40    34.065 ▒  0.322  ns/op
[info] ToAllIntegralsBenchmark.toByte                                      1  avgt   40     7.105 ▒  0.030  ns/op
[info] ToAllIntegralsBenchmark.toByte                                     -1  avgt   40     7.696 ▒  0.035  ns/op
[info] ToAllIntegralsBenchmark.toByte                                   +010  avgt   40    13.550 ▒  0.040  ns/op
[info] ToAllIntegralsBenchmark.toInt                                       1  avgt   40     6.811 ▒  0.024  ns/op
[info] ToAllIntegralsBenchmark.toInt                                      -1  avgt   40     7.114 ▒  0.042  ns/op
[info] ToAllIntegralsBenchmark.toInt                                    +010  avgt   40    13.607 ▒  0.068  ns/op
[info] ToAllIntegralsBenchmark.toInt                                -0001000  avgt   40    20.639 ▒  0.095  ns/op
[info] ToAllIntegralsBenchmark.toLong                                      1  avgt   40     7.118 ▒  0.049  ns/op
[info] ToAllIntegralsBenchmark.toLong                                     -1  avgt   40     7.718 ▒  0.055  ns/op
[info] ToAllIntegralsBenchmark.toLong                                   +010  avgt   40    13.252 ▒  0.115  ns/op
[info] ToAllIntegralsBenchmark.toLong                               -0001000  avgt   40    21.081 ▒  0.098  ns/op
[info] ToAllIntegralsBenchmark.toLong                       +000012345678912  avgt   40    40.374 ▒  0.329  ns/op
[info] ToAllIntegralsBenchmark.toLong           -000000000200000000003123456  avgt   40    64.527 ▒  0.224  ns/op
[info] ToAllIntegralsBenchmark.toShort                                     1  avgt   40     7.097 ▒  0.013  ns/op
[info] ToAllIntegralsBenchmark.toShort                                    -1  avgt   40     7.680 ▒  0.014  ns/op
[info] ToAllIntegralsBenchmark.toShort                                  +010  avgt   40    13.403 ▒  0.030  ns/op
[info] ToAllIntegralsBenchmark.toShort                              -0001000  avgt   40    21.349 ▒  0.143  ns/op

A google doc with structured results: https://docs.google.com/spreadsheets/d/1hTYMSLCbUHztiLN6jue5KuqWhPwpxbgRvg1WyR8T5SM/edit?usp=sharing

Summary for my machine:

For all string sizes, with the exception of single digits, performance for a parse is a flat 3 to 4 nanoseconds slower than the equivalent try/catch in the successful case. This amounts to just 62% of the baseline speed for the string "-1", and more than 99% of the baseline speed on the string "-000000000200000000003123456"

For single digit strings, the re-implementation is actually marginally faster than the baseline.

For the unhappy path, the re-implementation is about a factor 300 faster in the test case, though this will in practice differ with stack depth, and this speedup is likely larger rather than smaller in real world scenarios.

The adapted version of toInt on scala/native in these benchmarks is about 1 nanosecond faster for the all success cases but the single digit one, where it is 3 ns slower in this benchmark. But does a bit less work - it doesn't go through StringOps. In a separate benchmark to eliminate this difference, the scala/native performance advantage disappears.

The adapted version of toByte from scala/native is slower in all cases, but particularly when detecting overflow.

@Ichoran
Copy link
Contributor

Ichoran commented Apr 24, 2018

Seems pretty good to me! Also fairly expected--3 to 4 ns is a typical overhead for boxing into an Option.

@martijnhoekstra martijnhoekstra force-pushed the SI-16 branch 2 times, most recently from c3840b0 to 1353443 Compare April 25, 2018 12:51
@martijnhoekstra
Copy link
Contributor Author

Benchmarks for double/float validation are harder to interpret, but the troughput is about 0.65-0.7 of the bare case. Since the string needs to be traversed twice, this doesn't seem entirely unreasonable.

Speedups for the unhappy path are still about a factor 100.

@Ichoran
Copy link
Contributor

Ichoran commented Apr 25, 2018

@martijnhoekstra - Yeah, that's about what I remember getting. I wrote a huge pile of approximate FP parsing code for Grok and a different huge pile for Jsonal to do better (only deferring to the Java routines in tough cases), but they both are a huge amount of code and it's difficult to verify correctness (especially for Grok which uses big tables of precomputed numbers). I think the way you did it is the way to go for the standard library.

@martijnhoekstra martijnhoekstra force-pushed the SI-16 branch 2 times, most recently from 3482e7e to 53636be Compare April 25, 2018 16:24
@martijnhoekstra martijnhoekstra changed the title WIP: add optional parsing to Byte/Short/Int/Long/Float/Double Add optional parsing to Byte/Short/Int/Long/Float/Double Apr 26, 2018
@martijnhoekstra
Copy link
Contributor Author

No longer WIP (though review may have something to say about that)

@martijnhoekstra
Copy link
Contributor Author

rebased and force-pushed

@Test
def nullByte: Unit = try {
nullstring.parseByte
Assert.fail("null.parseByte throws NPE")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lrytz
Copy link
Member

lrytz commented May 29, 2018

Very nice work, @martijnhoekstra! The remaining open question seems to be the method names, @NthPortal suggests tryToInt or toIntOption instead of parseInt (#6538 (comment)). I also prefer these names over parse... Any more opinions?

@Ichoran
Copy link
Contributor

Ichoran commented May 29, 2018

Except for parsing, the variant of exception-throwing method x which instead returns an Option is called xOption. (headOption, etc.). So I favor toIntOption and friends.

@martijnhoekstra
Copy link
Contributor Author

So no strong objections against toIntOption. If that doesn't change until tomorrow, I'll change that (and squash everything?)

@martijnhoekstra
Copy link
Contributor Author

@lrytz toXOption seems to be a variant everyone can live with, and also the most popular one, so i renamed to that.

I rebased and squased some commits, but left the separate commit where I remove the benchmarks in, since that is at least of interest for this PR (results of those benchmarks are discussed in this ticket, so it makes sense to know what was discussed IMO).

Let me know if that's what you want, or that I should slice and dice differently.

Copy link
Member

@lrytz lrytz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 thank you, @martijnhoekstra!

@lrytz lrytz merged commit b7fef0e into scala:2.13.x May 30, 2018
@lrytz lrytz added the release-notes worth highlighting in next release notes label May 30, 2018
@SethTisue SethTisue changed the title Add optional parsing to Byte/Short/Int/Long/Float/Double Add Option-returning method for parsing literals Aug 22, 2018
@SethTisue SethTisue changed the title Add Option-returning method for parsing literals Add Option-returning methods for parsing literals Aug 22, 2018
@SethTisue
Copy link
Member

scala-collection-compat backport: scala/scala-collection-compat#566

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes worth highlighting in next release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

String.toInt, .toDouble should allow a default parameter

8 participants