Fix OOM in regex for large regex quantifier#26543
Conversation
| case RegexNode.Onelazy: | ||
| if (curNode._m > 0) | ||
|
|
||
| int cutoff = 1_000_000; // In release, cutoff at a length to which we can reasonably construct a string |
There was a problem hiding this comment.
const int Cutoff =
#if DEBUG
50;
#else
1_000_000;
#endifThere was a problem hiding this comment.
This would mean people using debug builds and RegexOptions.Debug will need to edit this code and build a private in order to get accurate regex debug tracing.
There was a problem hiding this comment.
Can you clarify @joshfree ? You mean if they want to get the same cutoff as release, but compiled for debug? I guess so. I think that is rarely going to be needed.
There was a problem hiding this comment.
Yup.
when investigating a bug in regex you can pass RegexOptions.Debug to a debug build and get trace output. See https://github.com/dotnet/corefx/blob/master/src/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunner.cs#L132-L155
There was a problem hiding this comment.
Since we don't ship debug builds, someone doing such an investigation would be building it themselves, at which point they could make this whatever they want if it's a problem. That said, I don't have a preference whether we use a different value for debug or not.
There was a problem hiding this comment.
Why do we differ between DEBUG and non-DEBUG here? More precisely, why do we disable this optimization for quantifiers > 50 on DEBUG?
There was a problem hiding this comment.
Why do we differ between DEBUG and non-DEBUG here?
This kind of thing is done sometimes to increase the chances that both code paths are appropriately tested. I'm assuming that's what Dan had in mind here.
There was a problem hiding this comment.
Right. I was once bitten by a case where we called a Win32 API with a buffer that was always large enough in testing, and the "try again with larger buffer" path was bad.
|
|
||
| int cutoff = 1_000_000; // In release, cutoff at a length to which we can reasonably construct a string | ||
| #if DEBUG | ||
| cutoff = 50; // In debug, use a smaller cutoff to exercise the cutoff path |
There was a problem hiding this comment.
Note: this is fairly small. This would mean a totally reasonably regex of "a{100,110}" would fail. I don't know what contexts debug builds are used in. but something to be potentially worried about. Perhaps another way would be to have a static and allow tests to change it.
There was a problem hiding this comment.
Hopefully not fail - this just disables the prefix optimization for >50.
There was a problem hiding this comment.
Re static - we try to avoid more use of internalsvisibleto
| yield return new object[] { @"a{100}b", new string('a', 100) + "bc", RegexOptions.None, 0, 102, true, new string('a', 100) + "b" }; | ||
|
|
||
| yield return new object[] { @"a{11}b", new string('a', 10) + "bc", RegexOptions.None, 0, 12, false, string.Empty }; | ||
| yield return new object[] { @"a{101}b", new string('a', 100) + "bc", RegexOptions.None, 0, 102, false, string.Empty }; |
There was a problem hiding this comment.
can we have tests with quantifiers like {a,b} and {a,}
There was a problem hiding this comment.
Will do. I discovered that we do not support {,n} for some reason. Unfortunately it is treated as a literal (not error) so we probably can't switch it on. Workaround is {0,n}.
There was a problem hiding this comment.
Yeah, i thought that was odd myself. But it's not in the Regex reference, so it's at least expected as per our docs.
ViktorHofer
left a comment
There was a problem hiding this comment.
LGTM. As Stephen already commented, please change cutoff to be a const int.
Thanks for the fix!
|
@dotnet-bot test Windows x86 Release Build please (tests ran fine, but helix only started them just before the timeout) |
|
@dotnet/dnceng the tests in https://ci3.dot.net/job/dotnet_corefx/job/master/job/windows-TGroup_netcoreapp+CGroup_Release+AGroup_x86+TestOuter_false_prtest/7958/ seem to have completed OK, but the system got jammed and eventually timed out? Also note that the "Xunit result XML" links just go to the homepage. @Anipik can you take a look at those logs to check that it's not Dumpling doing it? There are various errors like |
|
@dotnet-bot test NETFX x86 Release Build please (issue) |
|
@dotnet/dnceng I could use a pointer here. The log shows 3 configurations timed out. If I look at the details though, it seems the tests did not start until too late. Eg Windows.10.Amd64.Open:Release-x86 has but in the runner log I see For some reason the tests didn't get kicked off until the timeout happened! |
|
@danmosemsft the test did finish, but it looks like XUnit or Runtests.cmd hung and did not actually exit thereafter. DUmpling even tried to make a dump of this hang but it's unfortunately lost. |
|
@danmosemsft |
|
@MattGal I added logging of the test being run and all of them are completing quickly. What I can't understand is why the tests don't START running until AFTER the infrastructure has timed out and FINISHED. (based on the run_client.py) |
|
@dotnet/dnceng |
|
@dotnet/dnceng I will need help on this one as above |
|
@MattGal or @ChadNedzlek could one of you take a look at the issue Dan is running into? |
|
@danmosemsft The windows jobs all ran multiple times. That's why there are multiple "run_client.py" logs. For 1 of them, we couldn't find the zip file (something I think @MattGal might have a fix for). The others had one run that timed out, and we reran it, only for the second one to also timeout. Both run_client.py logs indicate a timeout, and in both cases, the second run_client.py logs' time envelope lines up with the console logs. |
|
@ChadNedzlek reading logs, I see only missing .dmp files (meaning Dumpling is not functioning properly); the workitem-payload-zip bug was rolled out last Friday and seems to have fixed that bug, at least in this incarnation of it. |
|
@dotnet/dnceng @MattGal @ChadNedzlek now jobs are failing like this: |
|
We're taking a look at this |
|
@dotnet-bot test Windows x86 Release Build |
|
We restarted Jenkins and the leg I manually triggered got past the error. Should be able to try again now. |
|
@dotnet-bot test Windows x86 Release Build |
43037e9 to
5cf57fc
Compare
|
Dan, I removed the merge conflict (after my PR came in) and pushed again. Would we great to get that in for 2.1. |
|
Thanks for the reminder about this one, I'll take a look at that failure .. |
|
Seems to be an infrastructure issue. Tests all passed in the log: |
|
@dotnet-bot test windows x86 release build |
I'll take a look, but we've had lots of problems in this space with tests leaking processes and failing for that reason recently (including an invariant assert which due to working too much was removed); I'll take a look to be sure. Please tag @dnceng every time you have a fresh thing you consider an infra problem as we're very keen to investigate these. (or ping me directly if you're not getting traction from one of us) |
|
@danmosemsft , @ViktorHofer this is definitely not an infrastructure issue, rather it's the same thing Dan has been dealing with recently, specifically you leak a dotnet.exe that doesn't exit within 20 minutes, and given the logging whose tests are probably ignored in your results. If you run your test directly using RunTests.cmd, you'll note it leaks a dotnet.exe. If you attach to this dotnet.exe and break (I let it run ~2 minutes) you'll find it hung in places like: ... to my untrained eye, it would seem that this long-running loop might run for a long time: (also, SourceLink is AMAZING) for (; ;)
{
if (scan == beforefirst || pattern[match] != pattern[scan])
{
// at the end of the match, note the difference in _positive
// this is not the length of the match, but the distance from the internal match
// to the tail suffix.
if (Positive[match] == 0)
Positive[match] = match - scan;
// System.Diagnostics.Debug.WriteLine("Set positive[" + match + "] to " + (match - scan));
break;
}
scan -= bump;
match -= bump;
}We've definitely discussed having some kind of poison environment variable so we can list the reasons that a work item ended up timing out, we simply haven't had the opportunity to implement such a thing yet. |
|
Yes this is for me to fix. The regex doesn't OOM now but on x86 it takes too long presumably because of paging. I will dial back the size or add a timeout. |
|
Put the Dispose in -- which now I recall was the reason I put this aside and started the change to cause missing Disposes to crash instead of leak... Thanks @MattGal for pointing this out. |
|
@dotnet-bot test OSX x64 Debug Build |
|
Tizen is known. OSX is Http tests again. |
* Fix OOM in regex for large quantifier
* tidy
* tidy
* cr
* {,n} is literal
* Disable for NETFX
* Temporarily track test name
* Hanging test
* Reduce slow case test
* Revert logging
* Omitted dispose
* Merge error
Commit migrated from dotnet/corefx@18ea494
Fix #26484
As an optimization, the regex engine tries to establish a definite prefix that matches must have, if it can. For example "a{5,10}bcd" indicates all matches must begin with "aaaaa". If there is such a prefix, the engine can use Boyer-Moore to skip more quickly to the next plausible match. It is optional though (as far as I can tell).
As implemented the code constructs a string with the prefix. So if the quantifier is very large, it will run out of memory. This could be avoided with a larger change, but I'm making a minimal change to not make prefix strings larger than 10^6 characters, an arbitrarily large number that can still be formed into a string.