Fix OOM in regex for large regex quantifier by danmoseley · Pull Request #26543 · dotnet/corefx

danmoseley · 2018-01-23T07:46:48Z

Fix #26484

As an optimization, the regex engine tries to establish a definite prefix that matches must have, if it can. For example "a{5,10}bcd" indicates all matches must begin with "aaaaa". If there is such a prefix, the engine can use Boyer-Moore to skip more quickly to the next plausible match. It is optional though (as far as I can tell).

As implemented the code constructs a string with the prefix. So if the quantifier is very large, it will run out of memory. This could be avoided with a larger change, but I'm making a minimal change to not make prefix strings larger than 10^6 characters, an arbitrarily large number that can still be formed into a string.

stephentoub · 2018-01-23T15:55:31Z

src/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFCD.cs

                    case RegexNode.Onelazy:
-                        if (curNode._m > 0)
+
+                        int cutoff = 1_000_000; // In release, cutoff at a length to which we can reasonably construct a string


const int Cutoff = #if DEBUG 50; #else 1_000_000; #endif

This would mean people using debug builds and RegexOptions.Debug will need to edit this code and build a private in order to get accurate regex debug tracing.

Can you clarify @joshfree ? You mean if they want to get the same cutoff as release, but compiled for debug? I guess so. I think that is rarely going to be needed.

Yup.

when investigating a bug in regex you can pass RegexOptions.Debug to a debug build and get trace output. See https://github.com/dotnet/corefx/blob/master/src/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunner.cs#L132-L155

Since we don't ship debug builds, someone doing such an investigation would be building it themselves, at which point they could make this whatever they want if it's a problem. That said, I don't have a preference whether we use a different value for debug or not.

Why do we differ between DEBUG and non-DEBUG here? More precisely, why do we disable this optimization for quantifiers > 50 on DEBUG?

Why do we differ between DEBUG and non-DEBUG here?

This kind of thing is done sometimes to increase the chances that both code paths are appropriately tested. I'm assuming that's what Dan had in mind here.

Right. I was once bitten by a case where we called a Win32 API with a buffer that was always large enough in testing, and the "try again with larger buffer" path was bad.

CyrusNajmabadi · 2018-01-23T20:43:49Z

src/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFCD.cs

+
+                        int cutoff = 1_000_000; // In release, cutoff at a length to which we can reasonably construct a string
+#if DEBUG
+                        cutoff = 50; // In debug, use a smaller cutoff to exercise the cutoff path


Note: this is fairly small. This would mean a totally reasonably regex of "a{100,110}" would fail. I don't know what contexts debug builds are used in. but something to be potentially worried about. Perhaps another way would be to have a static and allow tests to change it.

Hopefully not fail - this just disables the prefix optimization for >50.

Re static - we try to avoid more use of internalsvisibleto

CyrusNajmabadi · 2018-01-23T20:44:17Z

src/System.Text.RegularExpressions/tests/Regex.Match.Tests.cs

+            yield return new object[] { @"a{100}b", new string('a', 100) + "bc", RegexOptions.None, 0, 102, true, new string('a', 100) + "b" };
+
+            yield return new object[] { @"a{11}b", new string('a', 10) + "bc", RegexOptions.None, 0, 12, false, string.Empty };
+            yield return new object[] { @"a{101}b", new string('a', 100) + "bc", RegexOptions.None, 0, 102, false, string.Empty };


can we have tests with quantifiers like {a,b} and {a,}

Will do. I discovered that we do not support {,n} for some reason. Unfortunately it is treated as a literal (not error) so we probably can't switch it on. Workaround is {0,n}.

Yeah, i thought that was odd myself. But it's not in the Regex reference, so it's at least expected as per our docs.

ViktorHofer

LGTM. As Stephen already commented, please change cutoff to be a const int.

Thanks for the fix!

danmoseley · 2018-02-04T04:00:44Z

@dotnet-bot test Windows x86 Release Build please (tests ran fine, but helix only started them just before the timeout)

danmoseley · 2018-02-05T05:33:21Z

@dotnet/dnceng the tests in https://ci3.dot.net/job/dotnet_corefx/job/master/job/windows-TGroup_netcoreapp+CGroup_Release+AGroup_x86+TestOuter_false_prtest/7958/ seem to have completed OK, but the system got jammed and eventually timed out? Also note that the "Xunit result XML" links just go to the homepage.

Example
https://mc.dot.net/#/user/danmosemsft/pr~2Fjenkins~2Fdotnet~2Fcorefx~2Fmaster~2F/test~2Ffunctional~2Fcli~2F/5c04bb1c4a3f0e60fd0b0f9fede590594877335b/workItem/System.Text.RegularExpressions.Tests/wilogs

@Anipik can you take a look at those logs to check that it's not Dumpling doing it? There are various errors like Could not find C on disk. and Access Denied. I know the site is down.

danmoseley · 2018-02-05T17:00:21Z

@dotnet-bot test NETFX x86 Release Build please (issue)
@dotnet-bot test Windows x86 Release Build

danmoseley · 2018-02-05T18:57:30Z

@dotnet/dnceng I could use a pointer here. The log shows 3 configurations timed out. If I look at the details though, it seems the tests did not start until too late. Eg Windows.10.Amd64.Open:Release-x86 has
2018-02-05 17:53:04,858: INFO: scriptrunner(33): _main: BuildTools Helix Script Runner v0.1 starting

but in the runner log I see
018-02-05 17:33:01,989: INFO: executor(386): _execute_command: running %HELIX_PYTHONPATH% %HELIX_CORRELATION_PAYLOAD%\RunnerScripts\scriptrunner\scriptrunner.py --script RunTests.cmd %HELIX_CORRELATION_PAYLOAD% in C:\dotnetbuild\work\9b1c0b15-cfee-4360-8566-ced8b66915bb\Work\56a31538-1178-46a1-a15b-78c08ff1e74e\Exec max 1200 seconds
...
2018-02-05 17:53:03,026: ERROR: executor(397): _execute_command: Executor timed out after 1200 seconds and was killed.

For some reason the tests didn't get kicked off until the timeout happened!

https://mc.dot.net/#/user/danmosemsft/pr~2Fjenkins~2Fdotnet~2Fcorefx~2Fmaster~2F/test~2Ffunctional~2Fcli~2F/5c04bb1c4a3f0e60fd0b0f9fede590594877335b/workItem/System.Text.RegularExpressions.Tests/wilogs

MattGal · 2018-02-05T21:06:40Z

@danmosemsft the test did finish, but it looks like XUnit or Runtests.cmd hung and did not actually exit thereafter. DUmpling even tried to make a dump of this hang but it's unfortunately lost.

Anipik · 2018-02-05T21:26:11Z

@danmosemsft
This is due the fact that dumpling.py script (which we download from the GitHub) is present on helix machine but then it tries to install the debugger (cdb) by making call to dumpling website which is offline hence not able to download the debugger, hance not able to create dumps which uses that downloaded debugger path.

danmoseley · 2018-02-06T02:06:24Z

@MattGal I added logging of the test being run and all of them are completing quickly. What I can't understand is why the tests don't START running until AFTER the infrastructure has timed out and FINISHED. (based on the run_client.py)

https://mc.dot.net/#/user/danmosemsft/pr~2Fjenkins~2Fdotnet~2Fcorefx~2Fmaster~2F/test~2Ffunctional~2Fcli~2F/43037e91b6963e0c64e38f904d49fd4f2fbc1468/workItem/System.Text.RegularExpressions.Tests/wilogs

2018-02-06 00:53:57,275: ERROR: job(44): kill: Job running for too long. Killing...
2018-02-06 00:53:57,275: ERROR: executor(397): _execute_command: Executor timed out after 1200 seconds and was killed.
2018-02-06 00:53:57,275: INFO: event(43): send: Sending event type WorkItemTimeout
2018-02-06 00:53:57,430: INFO: saferequests(90): request_with_retry: Response complete with status code '201'
2018-02-06 00:53:57,430: INFO: executor(416): _execute_command: Finished _execute_command, exit code: -3

Executed on wuw81o7sj000E66
2018-02-06 00:54:06,723: INFO: scriptrunner(33): _main: BuildTools Helix Script Runner v0.1 starting
2018-02-06 00:54:06,723: INFO: helix_test_execution(24): __init__: Copying execution payload files from C:\dotnetbuild\work\a44152d5-1c71-48c1-9628-37f4b91e1664\Work\97c4ebf8-cfdc-4f71-bfe9-a9511a55e1a0\Unzip to C:\dotnetbuild\work\a44152d5-1c71-48c1-9628-37f4b91e1664\Work\97c4ebf8-cfdc-4f71-bfe9-a9511a55e1a0\Exec\execution
...
2018-02-06 00:54:13,836: INFO: proc(54): run_and_log_output: Output:    System.Text.RegularExpressions.Tests  Total: 831, Errors: 0, Failed: 0, Skipped: 0, Time: 5.923s
2018-02-06 00:54:13,979: INFO: proc(54): run_and_log_output: Output: Finished running tests.  End time= 0:54:13.97, Exit code = 0

danmoseley · 2018-02-06T02:45:21Z

@dotnet/dnceng

danmoseley · 2018-02-07T06:41:42Z

@dotnet/dnceng I will need help on this one as above

Chrisboh · 2018-02-07T17:50:58Z

@MattGal or @ChadNedzlek could one of you take a look at the issue Dan is running into?

ChadNedzlek · 2018-02-07T18:25:39Z

@danmosemsft The windows jobs all ran multiple times. That's why there are multiple "run_client.py" logs. For 1 of them, we couldn't find the zip file (something I think @MattGal might have a fix for).

The others had one run that timed out, and we reran it, only for the second one to also timeout. Both run_client.py logs indicate a timeout, and in both cases, the second run_client.py logs' time envelope lines up with the console logs.

MattGal · 2018-02-07T19:52:55Z

@ChadNedzlek reading logs, I see only missing .dmp files (meaning Dumpling is not functioning properly); the workitem-payload-zip bug was rolled out last Friday and seems to have fixed that bug, at least in this incarnation of it.

danmoseley · 2018-02-08T23:12:41Z

@dotnet/dnceng @MattGal @ChadNedzlek now jobs are failing like this:

Using context: Linux x64 Release Build
java.lang.NullPointerException
	at java.util.Collections$UnmodifiableCollection.<init>(Collections.java:1026)
	at java.util.Collections$UnmodifiableList.<init>(Collections.java:1302)
	at java.util.Collections.unmodifiableList(Collections.java:1287)
	at hudson.model.ParametersAction.getAllParameters(ParametersAction.java:351)
	at sun.reflect.GeneratedMethodAccessor439.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.jenkinsci.plugins.workflow.cps.ParamsVariable.getValue(ParamsVariable.java:61)
	at org.jenkinsci.plugins.workflow.cps.CpsScript.getProperty(CpsScript.java:121)

riarenas · 2018-02-08T23:52:59Z

We're taking a look at this

riarenas · 2018-02-09T01:19:14Z

@dotnet-bot test Windows x86 Release Build

riarenas · 2018-02-09T01:46:52Z

We restarted Jenkins and the leg I manually triggered got past the error. Should be able to try again now.

ahsonkhan · 2018-02-26T19:14:01Z

@dotnet-bot test Windows x86 Release Build

ViktorHofer · 2018-02-28T14:09:23Z

Dan, I removed the merge conflict (after my PR came in) and pushed again. Would we great to get that in for 2.1.

danmoseley · 2018-02-28T16:10:41Z

Thanks for the reminder about this one, I'll take a look at that failure ..

ViktorHofer · 2018-02-28T16:26:52Z

Seems to be an infrastructure issue. Tests all passed in the log:

2018-02-28 14:44:10,069: INFO: proc(54): run_and_log_output: Output: === TEST EXECUTION SUMMARY ===

2018-02-28 14:44:10,099: INFO: proc(54): run_and_log_output: Output:    System.Text.RegularExpressions.Tests  Total: 831, Errors: 0, Failed: 0, Skipped: 0, Time: 6.442s

2018-02-28 14:44:10,177: INFO: proc(54): run_and_log_output: Output: ----- end 14:44:10.17 ----- exit code 0 ----------------------------------------------------------

danmoseley · 2018-02-28T16:46:04Z

@dotnet-bot test windows x86 release build

MattGal · 2018-02-28T17:44:23Z

@ViktorHofer

Seems to be an infrastructure issue. Tests all passed in the log:

I'll take a look, but we've had lots of problems in this space with tests leaking processes and failing for that reason recently (including an invariant assert which due to working too much was removed); I'll take a look to be sure.

Please tag @dnceng every time you have a fresh thing you consider an infra problem as we're very keen to investigate these. (or ping me directly if you're not getting traction from one of us)

MattGal · 2018-02-28T18:02:08Z

@danmosemsft , @ViktorHofer this is definitely not an infrastructure issue, rather it's the same thing Dan has been dealing with recently, specifically you leak a dotnet.exe that doesn't exit within 20 minutes, and given the logging whose tests are probably ignored in your results.

If you run your test directly using RunTests.cmd, you'll note it leaks a dotnet.exe. If you attach to this dotnet.exe and break (I let it run ~2 minutes) you'll find it hung in places like:

>	System.Text.RegularExpressions.dll!System.Text.RegularExpressions.RegexBoyerMoore.RegexBoyerMoore(string pattern, bool caseInsensitive, bool rightToLeft, System.Globalization.CultureInfo culture) Line 127	C#
 	System.Text.RegularExpressions.dll!System.Text.RegularExpressions.RegexWriter.RegexCodeFromRegexTree(System.Text.RegularExpressions.RegexTree tree) Line 131	C#
 	System.Text.RegularExpressions.dll!System.Text.RegularExpressions.RegexWriter.Write(System.Text.RegularExpressions.RegexTree t) Line 64	C#
 	System.Text.RegularExpressions.dll!System.Text.RegularExpressions.Regex.Regex(string pattern, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout, bool addToCache) Line 128	C#
 	System.Text.RegularExpressions.dll!System.Text.RegularExpressions.Regex.IsMatch(string input, string pattern) Line 14	C#
 	System.Text.RegularExpressions.Tests.dll!System.Text.RegularExpressions.Tests.RegexMatchTests.Match_ExcessPrefix.AnonymousMethod__12_0() Line 773	C#
 	[Native to Managed Transition]	
 	[Managed to Native Transition]	
 	RemoteExecutorConsoleApp.exe!RemoteExecutorConsoleApp.Program.Main(string[] args) Line 52	C#

... to my untrained eye, it would seem that this long-running loop might run for a long time: (also, SourceLink is AMAZING)

                for (; ;)
                {
                    if (scan == beforefirst || pattern[match] != pattern[scan])
                    {
                        // at the end of the match, note the difference in _positive
                        // this is not the length of the match, but the distance from the internal match
                        // to the tail suffix.
                        if (Positive[match] == 0)
                            Positive[match] = match - scan;

                        // System.Diagnostics.Debug.WriteLine("Set positive[" + match + "] to " + (match - scan));

                        break;
                    }

                    scan -= bump;
                    match -= bump;
                }

https://github.com/dotnet/corefx/blob/master/src/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexBoyerMoore.cs#L112-L129

We've definitely discussed having some kind of poison environment variable so we can list the reasons that a work item ended up timing out, we simply haven't had the opportunity to implement such a thing yet.

danmoseley · 2018-02-28T18:27:38Z

Yes this is for me to fix. The regex doesn't OOM now but on x86 it takes too long presumably because of paging. I will dial back the size or add a timeout.

danmoseley · 2018-03-01T07:22:18Z

Put the Dispose in -- which now I recall was the reason I put this aside and started the change to cause missing Disposes to crash instead of leak... Thanks @MattGal for pointing this out.

ViktorHofer · 2018-03-01T14:03:59Z

@dotnet-bot test OSX x64 Debug Build
@dotnet-bot test Tizen armel Debug Build

danmoseley · 2018-03-01T17:22:38Z

Tizen is known. OSX is Http tests again.

* Fix OOM in regex for large quantifier * tidy * tidy * cr * {,n} is literal * Disable for NETFX * Temporarily track test name * Hanging test * Reduce slow case test * Revert logging * Omitted dispose * Merge error Commit migrated from dotnet/corefx@18ea494

danmoseley requested review from CyrusNajmabadi and ViktorHofer January 23, 2018 07:46

stephentoub reviewed Jan 23, 2018

View reviewed changes

stephentoub approved these changes Jan 23, 2018

View reviewed changes

CyrusNajmabadi reviewed Jan 23, 2018

View reviewed changes

karelz added the area-System.Text.RegularExpressions label Jan 25, 2018

karelz assigned danmoseley Jan 25, 2018

ViktorHofer approved these changes Jan 30, 2018

View reviewed changes

danmoseley mentioned this pull request Feb 4, 2018

Fix regex scanner for \c[ #26828

Merged

danmoseley closed this Feb 8, 2018

danmoseley reopened this Feb 8, 2018

danmoseley closed this Feb 9, 2018

danmoseley reopened this Feb 9, 2018

ViktorHofer force-pushed the regex.oom branch from 43037e9 to 5cf57fc Compare February 28, 2018 14:08

ViktorHofer added this to the 2.1.0 milestone Feb 28, 2018

danmoseley added 10 commits February 28, 2018 23:16

Fix OOM in regex for large quantifier

409c097

tidy

9f9f87a

tidy

47a4e48

cr

6e03aec

{,n} is literal

33c102e

Disable for NETFX

81ff17c

Temporarily track test name

fbe3438

Hanging test

875ac7e

Reduce slow case test

70926ed

Revert logging

6f20506

danmoseley force-pushed the regex.oom branch from 5cf57fc to 6f20506 Compare March 1, 2018 07:19

Omitted dispose

74f953d

Merge error

0520f11

danmoseley merged commit 18ea494 into dotnet:master Mar 1, 2018

danmoseley deleted the regex.oom branch March 1, 2018 17:22

danmoseley mentioned this pull request Jan 31, 2020

Tests are not waiting on child processes when run dotnet/runtime#25025

Closed

Conversation

danmoseley commented Jan 23, 2018

Uh oh!

stephentoub Jan 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ViktorHofer left a comment

Choose a reason for hiding this comment

Uh oh!

danmoseley commented Feb 4, 2018

Uh oh!

danmoseley commented Feb 5, 2018

Uh oh!

danmoseley commented Feb 5, 2018

Uh oh!

danmoseley commented Feb 5, 2018

Uh oh!

MattGal commented Feb 5, 2018

Uh oh!

Anipik commented Feb 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danmoseley commented Feb 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danmoseley commented Feb 6, 2018

Uh oh!

danmoseley commented Feb 7, 2018

Uh oh!

Chrisboh commented Feb 7, 2018

Uh oh!

ChadNedzlek commented Feb 7, 2018

Uh oh!

MattGal commented Feb 7, 2018

Uh oh!

danmoseley commented Feb 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

riarenas commented Feb 8, 2018

Uh oh!

riarenas commented Feb 9, 2018

Uh oh!

riarenas commented Feb 9, 2018

Uh oh!

ahsonkhan commented Feb 26, 2018

Uh oh!

ViktorHofer commented Feb 28, 2018

Uh oh!

danmoseley commented Feb 28, 2018

Uh oh!

ViktorHofer commented Feb 28, 2018

Uh oh!

danmoseley commented Feb 28, 2018

stephentoub Jan 23, 2018 •

edited

Loading

Anipik commented Feb 5, 2018 •

edited

Loading

danmoseley commented Feb 6, 2018 •

edited

Loading

danmoseley commented Feb 8, 2018 •

edited

Loading

MattGal commented Feb 28, 2018 •

edited

Loading