Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Fix OOM in regex for large regex quantifier#26543

Merged
danmoseley merged 12 commits intodotnet:masterfrom
danmoseley:regex.oom
Mar 1, 2018
Merged

Fix OOM in regex for large regex quantifier#26543
danmoseley merged 12 commits intodotnet:masterfrom
danmoseley:regex.oom

Conversation

@danmoseley
Copy link
Member

Fix #26484

As an optimization, the regex engine tries to establish a definite prefix that matches must have, if it can. For example "a{5,10}bcd" indicates all matches must begin with "aaaaa". If there is such a prefix, the engine can use Boyer-Moore to skip more quickly to the next plausible match. It is optional though (as far as I can tell).

As implemented the code constructs a string with the prefix. So if the quantifier is very large, it will run out of memory. This could be avoided with a larger change, but I'm making a minimal change to not make prefix strings larger than 10^6 characters, an arbitrarily large number that can still be formed into a string.

case RegexNode.Onelazy:
if (curNode._m > 0)

int cutoff = 1_000_000; // In release, cutoff at a length to which we can reasonably construct a string
Copy link
Member

@stephentoub stephentoub Jan 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const int Cutoff =
#if DEBUG
    50;
#else 
    1_000_000;
#endif

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would mean people using debug builds and RegexOptions.Debug will need to edit this code and build a private in order to get accurate regex debug tracing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify @joshfree ? You mean if they want to get the same cutoff as release, but compiled for debug? I guess so. I think that is rarely going to be needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't ship debug builds, someone doing such an investigation would be building it themselves, at which point they could make this whatever they want if it's a problem. That said, I don't have a preference whether we use a different value for debug or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we differ between DEBUG and non-DEBUG here? More precisely, why do we disable this optimization for quantifiers > 50 on DEBUG?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we differ between DEBUG and non-DEBUG here?

This kind of thing is done sometimes to increase the chances that both code paths are appropriately tested. I'm assuming that's what Dan had in mind here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I was once bitten by a case where we called a Win32 API with a buffer that was always large enough in testing, and the "try again with larger buffer" path was bad.


int cutoff = 1_000_000; // In release, cutoff at a length to which we can reasonably construct a string
#if DEBUG
cutoff = 50; // In debug, use a smaller cutoff to exercise the cutoff path

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this is fairly small. This would mean a totally reasonably regex of "a{100,110}" would fail. I don't know what contexts debug builds are used in. but something to be potentially worried about. Perhaps another way would be to have a static and allow tests to change it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully not fail - this just disables the prefix optimization for >50.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re static - we try to avoid more use of internalsvisibleto

yield return new object[] { @"a{100}b", new string('a', 100) + "bc", RegexOptions.None, 0, 102, true, new string('a', 100) + "b" };

yield return new object[] { @"a{11}b", new string('a', 10) + "bc", RegexOptions.None, 0, 12, false, string.Empty };
yield return new object[] { @"a{101}b", new string('a', 100) + "bc", RegexOptions.None, 0, 102, false, string.Empty };

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have tests with quantifiers like {a,b} and {a,}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. I discovered that we do not support {,n} for some reason. Unfortunately it is treated as a literal (not error) so we probably can't switch it on. Workaround is {0,n}.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i thought that was odd myself. But it's not in the Regex reference, so it's at least expected as per our docs.

Copy link
Member

@ViktorHofer ViktorHofer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. As Stephen already commented, please change cutoff to be a const int.

Thanks for the fix!

@danmoseley
Copy link
Member Author

@dotnet-bot test Windows x86 Release Build please (tests ran fine, but helix only started them just before the timeout)

@danmoseley
Copy link
Member Author

@dotnet/dnceng the tests in https://ci3.dot.net/job/dotnet_corefx/job/master/job/windows-TGroup_netcoreapp+CGroup_Release+AGroup_x86+TestOuter_false_prtest/7958/ seem to have completed OK, but the system got jammed and eventually timed out? Also note that the "Xunit result XML" links just go to the homepage.

Example
https://mc.dot.net/#/user/danmosemsft/pr~2Fjenkins~2Fdotnet~2Fcorefx~2Fmaster~2F/test~2Ffunctional~2Fcli~2F/5c04bb1c4a3f0e60fd0b0f9fede590594877335b/workItem/System.Text.RegularExpressions.Tests/wilogs

@Anipik can you take a look at those logs to check that it's not Dumpling doing it? There are various errors like Could not find C on disk. and Access Denied. I know the site is down.

@danmoseley
Copy link
Member Author

@dotnet-bot test NETFX x86 Release Build please (issue)
@dotnet-bot test Windows x86 Release Build

@danmoseley
Copy link
Member Author

@dotnet/dnceng I could use a pointer here. The log shows 3 configurations timed out. If I look at the details though, it seems the tests did not start until too late. Eg Windows.10.Amd64.Open:Release-x86 has
2018-02-05 17:53:04,858: INFO: scriptrunner(33): _main: BuildTools Helix Script Runner v0.1 starting

but in the runner log I see
018-02-05 17:33:01,989: INFO: executor(386): _execute_command: running %HELIX_PYTHONPATH% %HELIX_CORRELATION_PAYLOAD%\RunnerScripts\scriptrunner\scriptrunner.py --script RunTests.cmd %HELIX_CORRELATION_PAYLOAD% in C:\dotnetbuild\work\9b1c0b15-cfee-4360-8566-ced8b66915bb\Work\56a31538-1178-46a1-a15b-78c08ff1e74e\Exec max 1200 seconds
...
2018-02-05 17:53:03,026: ERROR: executor(397): _execute_command: Executor timed out after 1200 seconds and was killed.

For some reason the tests didn't get kicked off until the timeout happened!

https://mc.dot.net/#/user/danmosemsft/pr~2Fjenkins~2Fdotnet~2Fcorefx~2Fmaster~2F/test~2Ffunctional~2Fcli~2F/5c04bb1c4a3f0e60fd0b0f9fede590594877335b/workItem/System.Text.RegularExpressions.Tests/wilogs

@MattGal
Copy link
Member

MattGal commented Feb 5, 2018

@danmosemsft the test did finish, but it looks like XUnit or Runtests.cmd hung and did not actually exit thereafter. DUmpling even tried to make a dump of this hang but it's unfortunately lost.

@Anipik
Copy link

Anipik commented Feb 5, 2018

@danmosemsft
This is due the fact that dumpling.py script (which we download from the GitHub) is present on helix machine but then it tries to install the debugger (cdb) by making call to dumpling website which is offline hence not able to download the debugger, hance not able to create dumps which uses that downloaded debugger path.

@danmoseley
Copy link
Member Author

danmoseley commented Feb 6, 2018

@MattGal I added logging of the test being run and all of them are completing quickly. What I can't understand is why the tests don't START running until AFTER the infrastructure has timed out and FINISHED. (based on the run_client.py)

https://mc.dot.net/#/user/danmosemsft/pr~2Fjenkins~2Fdotnet~2Fcorefx~2Fmaster~2F/test~2Ffunctional~2Fcli~2F/43037e91b6963e0c64e38f904d49fd4f2fbc1468/workItem/System.Text.RegularExpressions.Tests/wilogs

2018-02-06 00:53:57,275: ERROR: job(44): kill: Job running for too long. Killing...
2018-02-06 00:53:57,275: ERROR: executor(397): _execute_command: Executor timed out after 1200 seconds and was killed.
2018-02-06 00:53:57,275: INFO: event(43): send: Sending event type WorkItemTimeout
2018-02-06 00:53:57,430: INFO: saferequests(90): request_with_retry: Response complete with status code '201'
2018-02-06 00:53:57,430: INFO: executor(416): _execute_command: Finished _execute_command, exit code: -3
Executed on wuw81o7sj000E66
2018-02-06 00:54:06,723: INFO: scriptrunner(33): _main: BuildTools Helix Script Runner v0.1 starting
2018-02-06 00:54:06,723: INFO: helix_test_execution(24): __init__: Copying execution payload files from C:\dotnetbuild\work\a44152d5-1c71-48c1-9628-37f4b91e1664\Work\97c4ebf8-cfdc-4f71-bfe9-a9511a55e1a0\Unzip to C:\dotnetbuild\work\a44152d5-1c71-48c1-9628-37f4b91e1664\Work\97c4ebf8-cfdc-4f71-bfe9-a9511a55e1a0\Exec\execution
...
2018-02-06 00:54:13,836: INFO: proc(54): run_and_log_output: Output:    System.Text.RegularExpressions.Tests  Total: 831, Errors: 0, Failed: 0, Skipped: 0, Time: 5.923s
2018-02-06 00:54:13,979: INFO: proc(54): run_and_log_output: Output: Finished running tests.  End time= 0:54:13.97, Exit code = 0

@danmoseley
Copy link
Member Author

@dotnet/dnceng

@danmoseley
Copy link
Member Author

@dotnet/dnceng I will need help on this one as above

@Chrisboh
Copy link
Member

Chrisboh commented Feb 7, 2018

@MattGal or @ChadNedzlek could one of you take a look at the issue Dan is running into?

@ChadNedzlek
Copy link

@danmosemsft The windows jobs all ran multiple times. That's why there are multiple "run_client.py" logs. For 1 of them, we couldn't find the zip file (something I think @MattGal might have a fix for).

The others had one run that timed out, and we reran it, only for the second one to also timeout. Both run_client.py logs indicate a timeout, and in both cases, the second run_client.py logs' time envelope lines up with the console logs.

@MattGal
Copy link
Member

MattGal commented Feb 7, 2018

@ChadNedzlek reading logs, I see only missing .dmp files (meaning Dumpling is not functioning properly); the workitem-payload-zip bug was rolled out last Friday and seems to have fixed that bug, at least in this incarnation of it.

@danmoseley danmoseley closed this Feb 8, 2018
@danmoseley danmoseley reopened this Feb 8, 2018
@danmoseley
Copy link
Member Author

danmoseley commented Feb 8, 2018

@dotnet/dnceng @MattGal @ChadNedzlek now jobs are failing like this:

Using context: Linux x64 Release Build
java.lang.NullPointerException
	at java.util.Collections$UnmodifiableCollection.<init>(Collections.java:1026)
	at java.util.Collections$UnmodifiableList.<init>(Collections.java:1302)
	at java.util.Collections.unmodifiableList(Collections.java:1287)
	at hudson.model.ParametersAction.getAllParameters(ParametersAction.java:351)
	at sun.reflect.GeneratedMethodAccessor439.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.jenkinsci.plugins.workflow.cps.ParamsVariable.getValue(ParamsVariable.java:61)
	at org.jenkinsci.plugins.workflow.cps.CpsScript.getProperty(CpsScript.java:121)

@riarenas
Copy link

riarenas commented Feb 8, 2018

We're taking a look at this

@riarenas
Copy link

riarenas commented Feb 9, 2018

@dotnet-bot test Windows x86 Release Build

@riarenas
Copy link

riarenas commented Feb 9, 2018

We restarted Jenkins and the leg I manually triggered got past the error. Should be able to try again now.

@danmoseley danmoseley closed this Feb 9, 2018
@danmoseley danmoseley reopened this Feb 9, 2018
@ahsonkhan
Copy link

@dotnet-bot test Windows x86 Release Build

@ViktorHofer
Copy link
Member

Dan, I removed the merge conflict (after my PR came in) and pushed again. Would we great to get that in for 2.1.

@ViktorHofer ViktorHofer added this to the 2.1.0 milestone Feb 28, 2018
@danmoseley
Copy link
Member Author

Thanks for the reminder about this one, I'll take a look at that failure ..

@ViktorHofer
Copy link
Member

Seems to be an infrastructure issue. Tests all passed in the log:

2018-02-28 14:44:10,069: INFO: proc(54): run_and_log_output: Output: === TEST EXECUTION SUMMARY ===

2018-02-28 14:44:10,099: INFO: proc(54): run_and_log_output: Output:    System.Text.RegularExpressions.Tests  Total: 831, Errors: 0, Failed: 0, Skipped: 0, Time: 6.442s

2018-02-28 14:44:10,177: INFO: proc(54): run_and_log_output: Output: ----- end 14:44:10.17 ----- exit code 0 ----------------------------------------------------------

@danmoseley
Copy link
Member Author

@dotnet-bot test windows x86 release build

@MattGal
Copy link
Member

MattGal commented Feb 28, 2018

@ViktorHofer

Seems to be an infrastructure issue. Tests all passed in the log:

I'll take a look, but we've had lots of problems in this space with tests leaking processes and failing for that reason recently (including an invariant assert which due to working too much was removed); I'll take a look to be sure.

Please tag @dnceng every time you have a fresh thing you consider an infra problem as we're very keen to investigate these. (or ping me directly if you're not getting traction from one of us)

@MattGal
Copy link
Member

MattGal commented Feb 28, 2018

@danmosemsft , @ViktorHofer this is definitely not an infrastructure issue, rather it's the same thing Dan has been dealing with recently, specifically you leak a dotnet.exe that doesn't exit within 20 minutes, and given the logging whose tests are probably ignored in your results.

If you run your test directly using RunTests.cmd, you'll note it leaks a dotnet.exe. If you attach to this dotnet.exe and break (I let it run ~2 minutes) you'll find it hung in places like:

>	System.Text.RegularExpressions.dll!System.Text.RegularExpressions.RegexBoyerMoore.RegexBoyerMoore(string pattern, bool caseInsensitive, bool rightToLeft, System.Globalization.CultureInfo culture) Line 127	C#
 	System.Text.RegularExpressions.dll!System.Text.RegularExpressions.RegexWriter.RegexCodeFromRegexTree(System.Text.RegularExpressions.RegexTree tree) Line 131	C#
 	System.Text.RegularExpressions.dll!System.Text.RegularExpressions.RegexWriter.Write(System.Text.RegularExpressions.RegexTree t) Line 64	C#
 	System.Text.RegularExpressions.dll!System.Text.RegularExpressions.Regex.Regex(string pattern, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout, bool addToCache) Line 128	C#
 	System.Text.RegularExpressions.dll!System.Text.RegularExpressions.Regex.IsMatch(string input, string pattern) Line 14	C#
 	System.Text.RegularExpressions.Tests.dll!System.Text.RegularExpressions.Tests.RegexMatchTests.Match_ExcessPrefix.AnonymousMethod__12_0() Line 773	C#
 	[Native to Managed Transition]	
 	[Managed to Native Transition]	
 	RemoteExecutorConsoleApp.exe!RemoteExecutorConsoleApp.Program.Main(string[] args) Line 52	C#

... to my untrained eye, it would seem that this long-running loop might run for a long time: (also, SourceLink is AMAZING)

                for (; ;)
                {
                    if (scan == beforefirst || pattern[match] != pattern[scan])
                    {
                        // at the end of the match, note the difference in _positive
                        // this is not the length of the match, but the distance from the internal match
                        // to the tail suffix.
                        if (Positive[match] == 0)
                            Positive[match] = match - scan;

                        // System.Diagnostics.Debug.WriteLine("Set positive[" + match + "] to " + (match - scan));

                        break;
                    }

                    scan -= bump;
                    match -= bump;
                }

https://github.com/dotnet/corefx/blob/master/src/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexBoyerMoore.cs#L112-L129

We've definitely discussed having some kind of poison environment variable so we can list the reasons that a work item ended up timing out, we simply haven't had the opportunity to implement such a thing yet.

@danmoseley
Copy link
Member Author

Yes this is for me to fix. The regex doesn't OOM now but on x86 it takes too long presumably because of paging. I will dial back the size or add a timeout.

@danmoseley
Copy link
Member Author

Put the Dispose in -- which now I recall was the reason I put this aside and started the change to cause missing Disposes to crash instead of leak... Thanks @MattGal for pointing this out.

@ViktorHofer
Copy link
Member

@dotnet-bot test OSX x64 Debug Build
@dotnet-bot test Tizen armel Debug Build

@danmoseley
Copy link
Member Author

Tizen is known. OSX is Http tests again.

@danmoseley danmoseley merged commit 18ea494 into dotnet:master Mar 1, 2018
@danmoseley danmoseley deleted the regex.oom branch March 1, 2018 17:22
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* Fix OOM in regex for large quantifier

* tidy

* tidy

* cr

* {,n} is literal

* Disable for NETFX

* Temporarily track test name

* Hanging test

* Reduce slow case test

* Revert logging

* Omitted dispose

* Merge error


Commit migrated from dotnet/corefx@18ea494
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regex throws out of memory while constructing (not matching) simple regex string.