Skip to content

Conversation

@lihaoyi
Copy link
Contributor

@lihaoyi lihaoyi commented Nov 2, 2018

Examples:

val s"Hello, $name" = "Hello, James"
println(name)  // "James"
val s"$greeting, $name" = "Hello, James"
println(greeting)  // "Hello"
println(name)  // "James"
val TimeSplitter = "([0-9]+)[.:]([0-9]+)".r
val s"The time is ${TimeSplitter(hours, mins)}" = "The time is 10.50"
println(hours) // 10
println(mins) // 50

This is slightly random, but I think it fits very well as a counterpart
to the existing s"..." interpolator, has a very simple definition, has
lived in the ammonite-ops library as the r interpolator for a long
time, and I have generally found it very useful.

It supports simple globbing matches, without any fancy regex
group-matching or whatever. This is a nice fit for the simple semantics
of the existing s interpolator.

One subtlety is that in the case of ambiguity, the implementation gives
the characters to later matches rather than earlier matches:

scala> val s"$a-$b-$c-$d" = "1-2-3-4-5-6-7-8-9"
a: String = 1
b: String = 2
c: String = 3
d: String = 4-5-6-7-8-9

This is somewhat arbitrary. This is for simple matches; for anything
complex or ambiguous, you can use a full regex.

Fancier regex matching, e.g. using regex-syntax in the literal parts of the string or regex capturing groups to the bound variables, is out of scope for this PR. Maybe it can be done separately, but even if it does appear I think there's value in a simple matcher that people can use without worrying about needing to escape their .s and other regex special-characters.

Notably, this cannot be implemented in user-land right now as an extension method on StringContext, because the presence of the s method stops the extension method from being picked up.

What do people think? If it seems plausible that we could get this into the standard library, I'll be happy to add a few unit tests and do whatever polish is necessary.

@scala-jenkins scala-jenkins added this to the 2.13.0-RC1 milestone Nov 2, 2018
@som-snytt
Copy link
Contributor

A pragmatic proposal in the best sense.

One can ask if a core feature should be most "performant" because it's used all the time, or can be casual because folks who care will roll their own.

@lihaoyi
Copy link
Contributor Author

lihaoyi commented Nov 2, 2018

No reason why this shouldn't be performant; I just thought it would be nice to get some feedback before spending the time implementing a linear time string-globbing algorithm from someone's blog (e.g. https://research.swtch.com/glob). If people are happy with the idea I'll happily make it fast

@mkeskells
Copy link
Contributor

@lihaoyi could this be done as a macro, or at least a macro façade?
It should be more performant, but it is hard to make it a macro later as it will break binary compatibility.
Even if we just make it a macro that calls another function for now - then we can improve the macro, to cherry-pick the cases that are improvable. I think that we will need the run-time function anyway where the format is not a literal

The binary compatibility issue is why the s interpolator changes had to wait until 2.13 - https://github.com/scala/scala-dev/issues/475 and https://github.com/scala/scala/pull/6093/files worked around limitations for 2.12

@SethTisue
Copy link
Member

this is cool, I'm in favor. anyone else on the Lightbend team want to encourage or discourage Haoyi from pursuing this further?

@lihaoyi
Copy link
Contributor Author

lihaoyi commented Nov 7, 2018

yeah it shouldn't be hard to make it a macro, if people think that's the right thing to do

@smarter
Copy link
Member

smarter commented Nov 7, 2018

The 2.13 standard library is supposed to cross-compile between 2.13 and Dotty. Any macro addition complicates this (right now there's one macro in the standard library, the f string interpolator, which is already causing us headaches).

@lihaoyi
Copy link
Contributor Author

lihaoyi commented Nov 7, 2018

I have swapped in a linear (or at least non-exponential) time algorithm for globbing, adapted from @rsc's blog. I suspect that this should be reasonably fast and good enough for most use cases; if someone wants a macro-powered NFA-generated-at-compile-time implementation to eek out the last bit of performance and be the fastest string matching algorithm on the JVM, they can implement that themselves (as I have done in FastParse for similar cases)

Notably, the current implementation doesn't really require any heavy compilation of regexes, so it's unclear how much a macro doing stuff at compile time can really help us.

I don't really know why it works, but it does. Added a moderately thorough test suite to walk through the various edge conditions in the glob matching implementation and it seems to do the right thing.

I haven't actually built/compiled/tested the scala/scala project, but have tested the code standalone, so hopefully I didn't mess up the copy-paste!

.toIndexedSeq
.reverse
.drop(1)
.dropRight(1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestions on how to write this bit better performance while still being reasonably elegant would be welcome

val matchEnds = Array.fill(patternChunks.length - 1)(-1)

// -1 is the magic marker number, since it cannot be a normal scala.Char
val pattern = patternChunks.flatMap(-1 +: _.toCharArray.map(_.toShort)).drop(1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit too. We don't seem to have any mkString equivalent for non-string sequences

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(StringContext.glob(Seq("", ""), " ab ") -> Some(Seq(" ab "))),

// Ambiguous lonely wildcard
(StringContext.glob(Seq("", "", ""), "ab") -> Some(Seq("", "ab"))),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that unlike the previous regex-based implementation, in case of ambiguity the later wildcards are prioritized over the earlier wildcards when characters can go to either

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe document that stars are reluctant: val s"${a}x${b}" = "axxxb" gives a, xxb instead of axx. Maybe add that unit test sample, since I think in the covering case, I can't tell which b is shown. Or no, L338 is a,b not ab,empty.

matchEnds(n) = matchEnds(n) match{
case -1 => inputIndex
case s => math.max(s, inputIndex)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For every patternIndex which matches one of the wildcards in matchIndices, we keep track of the earliest and latest parts of the input where inputIndex has been, and use that as the range of characters the wildcard has consumed.

val nameLength = input.length

while(patternIndex < patternLength || inputIndex < nameLength) {
matchIndices.indexOf(patternIndex) match{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's worth getting rid of this indexOf in favor of direct lookup, but I guess I can do that if necessary

@lihaoyi
Copy link
Contributor Author

lihaoyi commented Nov 7, 2018

I've left comments on the bits of the algorithm I could use help in writing better or more elegantly. Any suggestions are welcome; feel free to just push commits to this PR if you have some improvement in mind

@lihaoyi
Copy link
Contributor Author

lihaoyi commented Nov 7, 2018

Not sure what that build failure is; any ideas?

@lrytz
Copy link
Member

lrytz commented Nov 7, 2018

I like that feature too.

Pushed a change that will fix the test failure.

@lihaoyi
Copy link
Contributor Author

lihaoyi commented Nov 7, 2018

Pushed some tweaks to optimize the pre-calculation of lookup tables that we use in the main algorithm. Hopefully there shouldn't be any gross inefficiencies anymore...

@lihaoyi
Copy link
Contributor Author

lihaoyi commented Nov 7, 2018

huh not sure why the test is failing in scala/scala master when it passes in scala 2.12.7; is there some easy command to run just that test?

@lihaoyi
Copy link
Contributor Author

lihaoyi commented Nov 7, 2018

sbt test/test is giving me [error] SERVER ERROR: GATEWAY_TIMEOUT url=https://repo.lightbend.com/typesafe/scala-sha-bootstrap/org/scala-lang/bootstrap/02fe2ed93766323a13f22c7a7e2ecdcd84259b6c/test/files/lib/annotations.jar

@lrytz
Copy link
Member

lrytz commented Nov 7, 2018

Our sbt build is a bit special... https://github.com/scala/scala#using-the-sbt-build. testOnly *StringContextTest should probably work.

@lrytz
Copy link
Member

lrytz commented Nov 7, 2018

(or just run it in intellij)

@lihaoyi
Copy link
Contributor Author

lihaoyi commented Nov 7, 2018

k fixed it, validate-main passes

@lihaoyi
Copy link
Contributor Author

lihaoyi commented Nov 9, 2018

this is basically ready for review

@lihaoyi
Copy link
Contributor Author

lihaoyi commented Nov 26, 2018

is anyone home?

Copy link
Contributor

@som-snytt som-snytt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever usage! Maybe decide on code formatting before squashing? I'll take another look at the algo. There are a couple of minor artifacts like nameLength or nx where I went huh? but nbd.

* @return None if there is no match, Some containing the sequence of matched
* wildcard strings if there is a match
*/
def glob(patternChunks: Seq[String], input: String): Option[Seq[String]] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be private, although of general utility, just to avoid committing to it as API?

(StringContext.glob(Seq("", ""), " ab ") -> Some(Seq(" ab "))),

// Ambiguous lonely wildcard
(StringContext.glob(Seq("", "", ""), "ab") -> Some(Seq("", "ab"))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe document that stars are reluctant: val s"${a}x${b}" = "axxxb" gives a, xxb instead of axx. Maybe add that unit test sample, since I think in the covering case, I can't tell which b is shown. Or no, L338 is a,b not ab,empty.

@som-snytt
Copy link
Contributor

som-snytt commented Nov 26, 2018

I'm not actually home.

I just saw the previous invitation to push improvements to comments; maybe I'll follow up with adding some spaces, too.

@lihaoyi
Copy link
Contributor Author

lihaoyi commented Nov 27, 2018

@som-snytt yeah feel free to push whatever changes you want; easier than going back and forth

Copy link
Member

@SethTisue SethTisue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of merging this once @som-snytt finishes waving his magic wand over it

but I'd like another member of the Scala team to take a last look and do the final merge

@lihaoyi
Copy link
Contributor Author

lihaoyi commented Dec 17, 2018

Just to follow up here, after another two weeks. Is there anything I should do here? Or does approval mean it's out of my hands and just waiting for someone to press the button

@som-snytt
Copy link
Contributor

som-snytt commented Dec 17, 2018

I didn't get around to fiddling with comments, which doesn't matter because RC is for re-commenting, but they will require squashing. Seth asked someone with a bigger wand to do an incantation.

I said: "Maybe decide on code formatting before squashing?" Conditional approval is no doubt a bad practice.

Copy link
Member

@lrytz lrytz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased, pushed some tiny syntax cleanups. LGTM, thanks a lot @lihaoyi!

Examples:

```scala
val s"Hello, $name" = "Hello, James"
println(name)  // "James"
```

```scala
val s"$greeting, $name" = "Hello, James"
println(greeting)  // "Hello"
println(name)  // "James"
```

```scala
val TimeSplitter = "([0-9]+)[.:]([0-9]+)".r
val s"The time is ${TimeSplitter(hours, mins)}" = "The time is 10.50"
println(hours) // 10
println(mins) // 50
```

This is slightly random, but I think it fits very well as a counterpart
to the existing `s"..."` interpolator, has a very simple definition, has
lived in the ammonite-ops library as the `r` interpolator for a long
time, and I have generally found it very useful.

It supports simple globbing matches, without any fancy regex
group-matching or whatever. This is a nice fit for the simple semantics
of the existing `s` interpolator.

One subtlety is that in the case of ambiguity, the implementation gives
the characters to later matches rather than earlier matches:

```scala
scala> val s"$a-$b-$c-$d" = "1-2-3-4-5-6-7-8-9"
a: String = 1
b: String = 2
c: String = 3
d: String = 4-5-6-7-8-9
```

This is somewhat arbitrary. This is for simple matches; for anything
complex or ambiguous, you can use a full regex.
@lrytz lrytz merged commit cccfb26 into scala:2.13.x Dec 20, 2018
@lihaoyi
Copy link
Contributor Author

lihaoyi commented Dec 20, 2018

Thanks lukas, sorry for being a pest!

@lrytz
Copy link
Member

lrytz commented Dec 21, 2018

Not at all! We know that the PR queue size and delays are a little long these days, but we're actually glad when contributors push for their changes to get over the finish line :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes worth highlighting in next release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants