Provide a simple string matcher as the dual of the simple string interpolator #7387

lihaoyi · 2018-11-02T16:35:49Z

Examples:

val s"Hello, $name" = "Hello, James"
println(name)  // "James"

val s"$greeting, $name" = "Hello, James"
println(greeting)  // "Hello"
println(name)  // "James"

val TimeSplitter = "([0-9]+)[.:]([0-9]+)".r
val s"The time is ${TimeSplitter(hours, mins)}" = "The time is 10.50"
println(hours) // 10
println(mins) // 50

This is slightly random, but I think it fits very well as a counterpart
to the existing s"..." interpolator, has a very simple definition, has
lived in the ammonite-ops library as the r interpolator for a long
time, and I have generally found it very useful.

It supports simple globbing matches, without any fancy regex
group-matching or whatever. This is a nice fit for the simple semantics
of the existing s interpolator.

One subtlety is that in the case of ambiguity, the implementation gives
the characters to later matches rather than earlier matches:

scala> val s"$a-$b-$c-$d" = "1-2-3-4-5-6-7-8-9"
a: String = 1
b: String = 2
c: String = 3
d: String = 4-5-6-7-8-9

This is somewhat arbitrary. This is for simple matches; for anything
complex or ambiguous, you can use a full regex.

Fancier regex matching, e.g. using regex-syntax in the literal parts of the string or regex capturing groups to the bound variables, is out of scope for this PR. Maybe it can be done separately, but even if it does appear I think there's value in a simple matcher that people can use without worrying about needing to escape their .s and other regex special-characters.

Notably, this cannot be implemented in user-land right now as an extension method on StringContext, because the presence of the s method stops the extension method from being picked up.

What do people think? If it seems plausible that we could get this into the standard library, I'll be happy to add a few unit tests and do whatever polish is necessary.

src/library/scala/StringContext.scala

som-snytt · 2018-11-02T22:03:31Z

A pragmatic proposal in the best sense.

One can ask if a core feature should be most "performant" because it's used all the time, or can be casual because folks who care will roll their own.

lihaoyi · 2018-11-02T23:18:28Z

No reason why this shouldn't be performant; I just thought it would be nice to get some feedback before spending the time implementing a linear time string-globbing algorithm from someone's blog (e.g. https://research.swtch.com/glob). If people are happy with the idea I'll happily make it fast

mkeskells · 2018-11-05T08:32:41Z

@lihaoyi could this be done as a macro, or at least a macro façade?
It should be more performant, but it is hard to make it a macro later as it will break binary compatibility.
Even if we just make it a macro that calls another function for now - then we can improve the macro, to cherry-pick the cases that are improvable. I think that we will need the run-time function anyway where the format is not a literal

The binary compatibility issue is why the s interpolator changes had to wait until 2.13 - https://github.com/scala/scala-dev/issues/475 and https://github.com/scala/scala/pull/6093/files worked around limitations for 2.12

SethTisue · 2018-11-06T16:18:13Z

this is cool, I'm in favor. anyone else on the Lightbend team want to encourage or discourage Haoyi from pursuing this further?

lihaoyi · 2018-11-07T00:05:55Z

yeah it shouldn't be hard to make it a macro, if people think that's the right thing to do

smarter · 2018-11-07T00:18:03Z

The 2.13 standard library is supposed to cross-compile between 2.13 and Dotty. Any macro addition complicates this (right now there's one macro in the standard library, the f string interpolator, which is already causing us headaches).

lihaoyi · 2018-11-07T08:14:30Z

I have swapped in a linear (or at least non-exponential) time algorithm for globbing, adapted from @rsc's blog. I suspect that this should be reasonably fast and good enough for most use cases; if someone wants a macro-powered NFA-generated-at-compile-time implementation to eek out the last bit of performance and be the fastest string matching algorithm on the JVM, they can implement that themselves (as I have done in FastParse for similar cases)

Notably, the current implementation doesn't really require any heavy compilation of regexes, so it's unclear how much a macro doing stuff at compile time can really help us.

I don't really know why it works, but it does. Added a moderately thorough test suite to walk through the various edge conditions in the glob matching implementation and it seems to do the right thing.

I haven't actually built/compiled/tested the scala/scala project, but have tested the code standalone, so hopefully I didn't mess up the copy-paste!

lihaoyi · 2018-11-07T08:21:31Z

src/library/scala/StringContext.scala

+      .toIndexedSeq
+      .reverse
+      .drop(1)
+      .dropRight(1)


Any suggestions on how to write this bit better performance while still being reasonably elegant would be welcome

lihaoyi · 2018-11-07T08:21:55Z

src/library/scala/StringContext.scala

+    val matchEnds = Array.fill(patternChunks.length - 1)(-1)
+
+    // -1 is the magic marker number, since it cannot be a normal scala.Char
+    val pattern = patternChunks.flatMap(-1 +: _.toCharArray.map(_.toShort)).drop(1)


This bit too. We don't seem to have any mkString equivalent for non-string sequences

I often find myself wanting something a bit more generic like intercalate.
https://github.com/typelevel/cats/blob/master/core/src/main/scala/cats/Foldable.scala#L545

lihaoyi · 2018-11-07T08:23:24Z

test/junit/scala/lang/stringinterpol/StringContextTest.scala

+      (StringContext.glob(Seq("", ""), " ab ")    -> Some(Seq(" ab "))),
+
+      // Ambiguous lonely wildcard
+      (StringContext.glob(Seq("", "", ""), "ab")      -> Some(Seq("", "ab"))),


Note that unlike the previous regex-based implementation, in case of ambiguity the later wildcards are prioritized over the earlier wildcards when characters can go to either

Maybe document that stars are reluctant: val s"${a}x${b}" = "axxxb" gives a, xxb instead of axx. Maybe add that unit test sample, since I think in the covering case, I can't tell which b is shown. Or no, L338 is a,b not ab,empty.

lihaoyi · 2018-11-07T08:25:42Z

src/library/scala/StringContext.scala

+          matchEnds(n) = matchEnds(n) match{
+            case -1 => inputIndex
+            case s => math.max(s, inputIndex)
+          }


For every patternIndex which matches one of the wildcards in matchIndices, we keep track of the earliest and latest parts of the input where inputIndex has been, and use that as the range of characters the wildcard has consumed.

lihaoyi · 2018-11-07T08:26:26Z

src/library/scala/StringContext.scala

+    val nameLength = input.length
+
+    while(patternIndex < patternLength || inputIndex < nameLength) {
+      matchIndices.indexOf(patternIndex) match{


Not sure if it's worth getting rid of this indexOf in favor of direct lookup, but I guess I can do that if necessary

lihaoyi · 2018-11-07T08:27:32Z

I've left comments on the bits of the algorithm I could use help in writing better or more elegantly. Any suggestions are welcome; feel free to just push commits to this PR if you have some improvement in mind

lihaoyi · 2018-11-07T09:27:08Z

Not sure what that build failure is; any ideas?

lrytz · 2018-11-07T11:51:28Z

I like that feature too.

Pushed a change that will fix the test failure.

lihaoyi · 2018-11-07T12:32:10Z

Pushed some tweaks to optimize the pre-calculation of lookup tables that we use in the main algorithm. Hopefully there shouldn't be any gross inefficiencies anymore...

lihaoyi · 2018-11-07T13:16:55Z

huh not sure why the test is failing in scala/scala master when it passes in scala 2.12.7; is there some easy command to run just that test?

lihaoyi · 2018-11-07T13:22:18Z

sbt test/test is giving me [error] SERVER ERROR: GATEWAY_TIMEOUT url=https://repo.lightbend.com/typesafe/scala-sha-bootstrap/org/scala-lang/bootstrap/02fe2ed93766323a13f22c7a7e2ecdcd84259b6c/test/files/lib/annotations.jar

lrytz · 2018-11-07T13:26:33Z

Our sbt build is a bit special... https://github.com/scala/scala#using-the-sbt-build. testOnly *StringContextTest should probably work.

lrytz · 2018-11-07T13:28:57Z

(or just run it in intellij)

lihaoyi · 2018-11-07T14:42:08Z

k fixed it, validate-main passes

lihaoyi · 2018-11-09T13:56:02Z

this is basically ready for review

lihaoyi · 2018-11-26T15:19:08Z

is anyone home?

som-snytt

Clever usage! Maybe decide on code formatting before squashing? I'll take another look at the algo. There are a couple of minor artifacts like nameLength or nx where I went huh? but nbd.

som-snytt · 2018-11-26T17:12:14Z

src/library/scala/StringContext.scala

+    * @return None if there is no match, Some containing the sequence of matched
+    *         wildcard strings if there is a match 
+    */
+  def glob(patternChunks: Seq[String], input: String): Option[Seq[String]] = {


Should be private, although of general utility, just to avoid committing to it as API?

test/junit/scala/lang/stringinterpol/StringContextTest.scala

som-snytt · 2018-11-26T18:05:39Z

test/junit/scala/lang/stringinterpol/StringContextTest.scala

+      (StringContext.glob(Seq("", ""), " ab ")    -> Some(Seq(" ab "))),
+
+      // Ambiguous lonely wildcard
+      (StringContext.glob(Seq("", "", ""), "ab")      -> Some(Seq("", "ab"))),


Maybe document that stars are reluctant: val s"${a}x${b}" = "axxxb" gives a, xxb instead of axx. Maybe add that unit test sample, since I think in the covering case, I can't tell which b is shown. Or no, L338 is a,b not ab,empty.

som-snytt · 2018-11-26T18:10:47Z

I'm not actually home.

I just saw the previous invitation to push improvements to comments; maybe I'll follow up with adding some spaces, too.

lihaoyi · 2018-11-27T04:05:25Z

@som-snytt yeah feel free to push whatever changes you want; easier than going back and forth

SethTisue

I'm in favor of merging this once @som-snytt finishes waving his magic wand over it

but I'd like another member of the Scala team to take a last look and do the final merge

lihaoyi · 2018-12-17T10:55:56Z

Just to follow up here, after another two weeks. Is there anything I should do here? Or does approval mean it's out of my hands and just waiting for someone to press the button

som-snytt · 2018-12-17T16:48:08Z

I didn't get around to fiddling with comments, which doesn't matter because RC is for re-commenting, but they will require squashing. Seth asked someone with a bigger wand to do an incantation.

I said: "Maybe decide on code formatting before squashing?" Conditional approval is no doubt a bad practice.

lrytz

Rebased, pushed some tiny syntax cleanups. LGTM, thanks a lot @lihaoyi!

Examples: ```scala val s"Hello, $name" = "Hello, James" println(name) // "James" ``` ```scala val s"$greeting, $name" = "Hello, James" println(greeting) // "Hello" println(name) // "James" ``` ```scala val TimeSplitter = "([0-9]+)[.:]([0-9]+)".r val s"The time is ${TimeSplitter(hours, mins)}" = "The time is 10.50" println(hours) // 10 println(mins) // 50 ``` This is slightly random, but I think it fits very well as a counterpart to the existing `s"..."` interpolator, has a very simple definition, has lived in the ammonite-ops library as the `r` interpolator for a long time, and I have generally found it very useful. It supports simple globbing matches, without any fancy regex group-matching or whatever. This is a nice fit for the simple semantics of the existing `s` interpolator. One subtlety is that in the case of ambiguity, the implementation gives the characters to later matches rather than earlier matches: ```scala scala> val s"$a-$b-$c-$d" = "1-2-3-4-5-6-7-8-9" a: String = 1 b: String = 2 c: String = 3 d: String = 4-5-6-7-8-9 ``` This is somewhat arbitrary. This is for simple matches; for anything complex or ambiguous, you can use a full regex.

lihaoyi · 2018-12-20T16:27:03Z

Thanks lukas, sorry for being a pest!

lrytz · 2018-12-21T07:27:43Z

Not at all! We know that the PR queue size and delays are a little long these days, but we're actually glad when contributors push for their changes to get over the finish line :-)

scala-jenkins added this to the 2.13.0-RC1 milestone Nov 2, 2018

som-snytt reviewed Nov 2, 2018

View reviewed changes

src/library/scala/StringContext.scala Outdated Show resolved Hide resolved

som-snytt reviewed Nov 2, 2018

View reviewed changes

src/library/scala/StringContext.scala Outdated Show resolved Hide resolved

som-snytt reviewed Nov 2, 2018

View reviewed changes

src/library/scala/StringContext.scala Outdated Show resolved Hide resolved

lihaoyi commented Nov 7, 2018

View reviewed changes

som-snytt approved these changes Nov 26, 2018

View reviewed changes

SethTisue reviewed Dec 4, 2018

View reviewed changes

smarter mentioned this pull request Dec 5, 2018

Add sym shorthand for Symbol #7495

Merged

lrytz approved these changes Dec 20, 2018

View reviewed changes

lrytz merged commit cccfb26 into scala:2.13.x Dec 20, 2018

lrytz added the release-notes worth highlighting in next release notes label Dec 21, 2018

rminehan mentioned this pull request Oct 17, 2023

Dual string interpolator not processing strings with escaped characters scala/bug#12893

Closed

Provide a simple string matcher as the dual of the simple string interpolator #7387

Provide a simple string matcher as the dual of the simple string interpolator #7387

Conversation

lihaoyi commented Nov 2, 2018 • edited by lrytz Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

som-snytt commented Nov 2, 2018

Uh oh!

lihaoyi commented Nov 2, 2018

Uh oh!

mkeskells commented Nov 5, 2018

Uh oh!

SethTisue commented Nov 6, 2018

Uh oh!

lihaoyi commented Nov 7, 2018

Uh oh!

smarter commented Nov 7, 2018

Uh oh!

lihaoyi commented Nov 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lihaoyi commented Nov 7, 2018

Uh oh!

lihaoyi commented Nov 7, 2018

Uh oh!

lrytz commented Nov 7, 2018

Uh oh!

lihaoyi commented Nov 7, 2018

Uh oh!

lihaoyi commented Nov 7, 2018

Uh oh!

lihaoyi commented Nov 7, 2018

Uh oh!

lrytz commented Nov 7, 2018

Uh oh!

lrytz commented Nov 7, 2018

Uh oh!

lihaoyi commented Nov 7, 2018

Uh oh!

lihaoyi commented Nov 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lihaoyi commented Nov 26, 2018

Uh oh!

som-snytt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

som-snytt commented Nov 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lihaoyi commented Nov 27, 2018

Uh oh!

SethTisue left a comment

Choose a reason for hiding this comment

Uh oh!

lihaoyi commented Dec 17, 2018

Uh oh!

som-snytt commented Dec 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

lihaoyi commented Nov 2, 2018 •

edited by lrytz

Loading

lihaoyi commented Nov 7, 2018 •

edited

Loading

lihaoyi commented Nov 9, 2018 •

edited

Loading

som-snytt commented Nov 26, 2018 •

edited

Loading

som-snytt commented Dec 17, 2018 •

edited

Loading