-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Provide a simple string matcher as the dual of the simple string interpolator #7387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
A pragmatic proposal in the best sense. One can ask if a core feature should be most "performant" because it's used all the time, or can be casual because folks who care will roll their own. |
|
No reason why this shouldn't be performant; I just thought it would be nice to get some feedback before spending the time implementing a linear time string-globbing algorithm from someone's blog (e.g. https://research.swtch.com/glob). If people are happy with the idea I'll happily make it fast |
|
@lihaoyi could this be done as a macro, or at least a macro façade? The binary compatibility issue is why the |
|
this is cool, I'm in favor. anyone else on the Lightbend team want to encourage or discourage Haoyi from pursuing this further? |
|
yeah it shouldn't be hard to make it a macro, if people think that's the right thing to do |
|
The 2.13 standard library is supposed to cross-compile between 2.13 and Dotty. Any macro addition complicates this (right now there's one macro in the standard library, the f string interpolator, which is already causing us headaches). |
|
I have swapped in a linear (or at least non-exponential) time algorithm for globbing, adapted from @rsc's blog. I suspect that this should be reasonably fast and good enough for most use cases; if someone wants a macro-powered NFA-generated-at-compile-time implementation to eek out the last bit of performance and be the fastest string matching algorithm on the JVM, they can implement that themselves (as I have done in FastParse for similar cases) Notably, the current implementation doesn't really require any heavy compilation of regexes, so it's unclear how much a macro doing stuff at compile time can really help us. I don't really know why it works, but it does. Added a moderately thorough test suite to walk through the various edge conditions in the glob matching implementation and it seems to do the right thing. I haven't actually built/compiled/tested the scala/scala project, but have tested the code standalone, so hopefully I didn't mess up the copy-paste! |
| .toIndexedSeq | ||
| .reverse | ||
| .drop(1) | ||
| .dropRight(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestions on how to write this bit better performance while still being reasonably elegant would be welcome
| val matchEnds = Array.fill(patternChunks.length - 1)(-1) | ||
|
|
||
| // -1 is the magic marker number, since it cannot be a normal scala.Char | ||
| val pattern = patternChunks.flatMap(-1 +: _.toCharArray.map(_.toShort)).drop(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit too. We don't seem to have any mkString equivalent for non-string sequences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I often find myself wanting something a bit more generic like intercalate.
https://github.com/typelevel/cats/blob/master/core/src/main/scala/cats/Foldable.scala#L545
| (StringContext.glob(Seq("", ""), " ab ") -> Some(Seq(" ab "))), | ||
|
|
||
| // Ambiguous lonely wildcard | ||
| (StringContext.glob(Seq("", "", ""), "ab") -> Some(Seq("", "ab"))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that unlike the previous regex-based implementation, in case of ambiguity the later wildcards are prioritized over the earlier wildcards when characters can go to either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe document that stars are reluctant: val s"${a}x${b}" = "axxxb" gives a, xxb instead of axx. Maybe add that unit test sample, since I think in the covering case, I can't tell which b is shown. Or no, L338 is a,b not ab,empty.
| matchEnds(n) = matchEnds(n) match{ | ||
| case -1 => inputIndex | ||
| case s => math.max(s, inputIndex) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For every patternIndex which matches one of the wildcards in matchIndices, we keep track of the earliest and latest parts of the input where inputIndex has been, and use that as the range of characters the wildcard has consumed.
| val nameLength = input.length | ||
|
|
||
| while(patternIndex < patternLength || inputIndex < nameLength) { | ||
| matchIndices.indexOf(patternIndex) match{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's worth getting rid of this indexOf in favor of direct lookup, but I guess I can do that if necessary
|
I've left comments on the bits of the algorithm I could use help in writing better or more elegantly. Any suggestions are welcome; feel free to just push commits to this PR if you have some improvement in mind |
|
Not sure what that build failure is; any ideas? |
|
I like that feature too. Pushed a change that will fix the test failure. |
|
Pushed some tweaks to optimize the pre-calculation of lookup tables that we use in the main algorithm. Hopefully there shouldn't be any gross inefficiencies anymore... |
|
huh not sure why the test is failing in scala/scala master when it passes in scala 2.12.7; is there some easy command to run just that test? |
|
|
|
Our sbt build is a bit special... https://github.com/scala/scala#using-the-sbt-build. |
|
(or just run it in intellij) |
|
k fixed it, validate-main passes |
|
this is basically ready for review |
|
is anyone home? |
som-snytt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever usage! Maybe decide on code formatting before squashing? I'll take another look at the algo. There are a couple of minor artifacts like nameLength or nx where I went huh? but nbd.
| * @return None if there is no match, Some containing the sequence of matched | ||
| * wildcard strings if there is a match | ||
| */ | ||
| def glob(patternChunks: Seq[String], input: String): Option[Seq[String]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be private, although of general utility, just to avoid committing to it as API?
| (StringContext.glob(Seq("", ""), " ab ") -> Some(Seq(" ab "))), | ||
|
|
||
| // Ambiguous lonely wildcard | ||
| (StringContext.glob(Seq("", "", ""), "ab") -> Some(Seq("", "ab"))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe document that stars are reluctant: val s"${a}x${b}" = "axxxb" gives a, xxb instead of axx. Maybe add that unit test sample, since I think in the covering case, I can't tell which b is shown. Or no, L338 is a,b not ab,empty.
|
I'm not actually home. I just saw the previous invitation to push improvements to comments; maybe I'll follow up with adding some spaces, too. |
|
@som-snytt yeah feel free to push whatever changes you want; easier than going back and forth |
SethTisue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of merging this once @som-snytt finishes waving his magic wand over it
but I'd like another member of the Scala team to take a last look and do the final merge
|
Just to follow up here, after another two weeks. Is there anything I should do here? Or does approval mean it's out of my hands and just waiting for someone to press the button |
|
I didn't get around to fiddling with comments, which doesn't matter because I said: "Maybe decide on code formatting before squashing?" Conditional approval is no doubt a bad practice. |
lrytz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebased, pushed some tiny syntax cleanups. LGTM, thanks a lot @lihaoyi!
Examples:
```scala
val s"Hello, $name" = "Hello, James"
println(name) // "James"
```
```scala
val s"$greeting, $name" = "Hello, James"
println(greeting) // "Hello"
println(name) // "James"
```
```scala
val TimeSplitter = "([0-9]+)[.:]([0-9]+)".r
val s"The time is ${TimeSplitter(hours, mins)}" = "The time is 10.50"
println(hours) // 10
println(mins) // 50
```
This is slightly random, but I think it fits very well as a counterpart
to the existing `s"..."` interpolator, has a very simple definition, has
lived in the ammonite-ops library as the `r` interpolator for a long
time, and I have generally found it very useful.
It supports simple globbing matches, without any fancy regex
group-matching or whatever. This is a nice fit for the simple semantics
of the existing `s` interpolator.
One subtlety is that in the case of ambiguity, the implementation gives
the characters to later matches rather than earlier matches:
```scala
scala> val s"$a-$b-$c-$d" = "1-2-3-4-5-6-7-8-9"
a: String = 1
b: String = 2
c: String = 3
d: String = 4-5-6-7-8-9
```
This is somewhat arbitrary. This is for simple matches; for anything
complex or ambiguous, you can use a full regex.
|
Thanks lukas, sorry for being a pest! |
|
Not at all! We know that the PR queue size and delays are a little long these days, but we're actually glad when contributors push for their changes to get over the finish line :-) |
Examples:
This is slightly random, but I think it fits very well as a counterpart
to the existing
s"..."interpolator, has a very simple definition, haslived in the ammonite-ops library as the
rinterpolator for a longtime, and I have generally found it very useful.
It supports simple globbing matches, without any fancy regex
group-matching or whatever. This is a nice fit for the simple semantics
of the existing
sinterpolator.One subtlety is that in the case of ambiguity, the implementation gives
the characters to later matches rather than earlier matches:
This is somewhat arbitrary. This is for simple matches; for anything
complex or ambiguous, you can use a full regex.
Fancier regex matching, e.g. using regex-syntax in the literal parts of the string or regex capturing groups to the bound variables, is out of scope for this PR. Maybe it can be done separately, but even if it does appear I think there's value in a simple matcher that people can use without worrying about needing to escape their
.s and other regex special-characters.Notably, this cannot be implemented in user-land right now as an extension method on
StringContext, because the presence of thesmethod stops the extension method from being picked up.What do people think? If it seems plausible that we could get this into the standard library, I'll be happy to add a few unit tests and do whatever polish is necessary.