Improve log parsing performance by tgamblin · Pull Request #7093 · spack/spack

tgamblin · 2018-01-28T07:52:17Z

When a build errors out, Spack goes through the build log and scrapes it for errors and warnings (see #5016, #5561). We found that this was taking forever (like an hour) on large logs. It turns out that Python regular expressions are slow, and this affects our log scraping in bad ways.

This PR does several things:

Add a profile mode to the parser that tells you which regexes take the most time. A number of regular expressions from ctest_log_parser have really poor performance, mostly due to "untethered" * and + (i.e., they don't start with ^, so Python's backtracking regex implementation checks for repetition at every position in the string). I can't verify that CTest's regexes work as intended with an initial ^ (which would be an obvious solution), so I don't really want to touch them. This fear was borne out when I tried just adding ^ and some tests broke.
Instead of using only "efficient" regular expressions, added a prefilter() class that allows the parser to quickly check a precondition before evaluating any of the expensive ones. Preconditions do things like test whether the string contains "error" or "warning" (linear time things) before evaluating regexes. Some simple preconditions speed things up by about 200x for the log file I tested.
Parallelize the log parsing. Files are now divided up and parsed by all CPUs. This gets you another speedup approximately equal to the number of physical cores you've got.
Add a spack log-parse command that you can run on any file. This let me test things, and it's handy because you can run it on build logs that aren't even from Spack.

Here's spack log-parse:

$ spack log-parse -h
usage: spack log-parse [-h] [--show SHOW] [-c CONTEXT] [-p] [-w WIDTH]
                       [-j JOBS]
                       file

filter errors and warnings from build logs

positional arguments:
  file                  a log file containing build output, or - for stdin

optional arguments:
  -h, --help            show this help message and exit
  --show SHOW           comma-separated list of what to show; options: errors,
                        warnings
  -c CONTEXT, --context CONTEXT
                        lines of context to show around lines of interest
  -p, --profile         print out a profile of time spent in regexes during
                        parse
  -w WIDTH, --width WIDTH
                        wrap width: auto-size to terminal by default; 0 for no
                        wrap
  -j JOBS, --jobs JOBS  number of jobs to parse log file; default is 1 for
                        short logs, ncpus for long logs

Try it -- it'll find errors (and optionally warnings) from your builds.

Finally, this PR cleans up the log output a bit so that long lines wrap better with the line numbers. i.e., we now indent wrapped lines so you can still see the line numbers. I find this makes things more readable for really nasty build logs.

tgamblin · 2018-01-28T07:52:52Z

@becker33: see if this helps

- Command should allow dashes in their names like the reest of spack, e.g. `spack log-parse` - It might be too late for `spack build-cache` (since it is already called `spack buildcache`), but we should try a bit to avoid inconsistencies in naming conventions - The code was inconsistent about where commands should be called by their python module name (e.g. `log_parse`) and where the actual command name should be used (e.g. `log-parse`). - This made it hard to make a command with a dash in the name, and it made `SpackCommand` fail to recognize commands with dashes. - The code now uses the user-facing name with dashes for function parameters, then converts that the module name when needed.

- A number of regular expressions from ctest_log_parser have really poor performance, most due to untethered expressions with * or + (i.e., they don't start with ^, so the repetition has to be checked for every position in the string with Python's backtracking regex implementation) - I can't verify that CTest's regexes work with an added ^, so I don't really want to touch them. I tried adding this and found that it caused some tests to break. - Instead of using only "efficient" regular expressions, Added a prefilter() class that allows the parser to quickly check a precondition before evaluating any of the expensive regexes. - Preconditions do things like check whether the string contains "error" or "warning" (linear time things) before evaluating regexes that would require them. It's sad that Python doesn't use Thompson string matching (see https://swtch.com/~rsc/regexp/regexp1.html) - Even with Python's slow implementation, this makes the parser ~200x faster on the input we tried it on.

- Add better coloring and line wrapping to the log parse output. This makes nasty build output look better with the line numbers. - `spack log-parse` allows the log parsing logic used at the end of builds to be executed on arbitrary files, which is handy even outside of spack. - Also provides a profile option -- we can profile arbitrary files and show which regular expressions in the magic CTest parser take the most time.

- Log parsing now uses multiple threads for long logs - Lines from logs are divided into chnks and farmed out to <ncpus> - Add -j option to `spack log-parse`

alalazo

Still need to check and try the command, but for the rest this LGTM. I have just a couple of minor questions on pool.map_async.

alalazo · 2018-01-30T15:09:48Z

lib/spack/external/ctest_log_parser.py



+def chunks(l, n):
+    """Divide l into n approximately-even chunks."""


Super-minor note: are we disabling this style rule? Or not using pycodestyle yet?

Out of curiosity, do you have any opinions on pycodestyle vs. flake8? They seem to serve the same purpose, but I just learned about pycodestyle recently.

alalazo · 2018-01-30T15:27:30Z

lib/spack/external/ctest_log_parser.py

+
+            # create a pool and farm out the matching job
+            pool = multiprocessing.Pool(jobs)
+            try:


Here it's kind of unfortunate that to be compatible with python 2 we can't do:

with multiprocessing.Pool(jobs): results = pool.map_async(_parse_unpack, args, 1).get(9999999) errors, warnings, timings = zip(*results)

Also why the timeout is so long? Does it have to do with the comment?

It does. The pool doesn't handle interrupts well unless you use map_async -- I should add a link.

alalazo · 2018-01-30T15:46:31Z

lib/spack/external/ctest_log_parser.py

-
-            # add log context, as well
+        if jobs is None:
+            jobs = multiprocessing.cpu_count()


Might consider:

jobs = jobs or multiprocessing.cpu_count()

alalazo · 2018-01-30T16:03:26Z

lib/spack/external/ctest_log_parser.py

+            try:
+                # this is a workaround for a Python bug in Pool with ctrl-C
+                results = pool.map_async(_parse_unpack, args, 1).get(9999999)
+                errors, warnings, timings = zip(*results)


I see you like mind bending tricks with zip 😄 Next time somebody tells me he needs to turn an AoS into an SoA I'll reply "Just use zip"

This isn't even the cool one -- this one just makes lists of lists of results from workers. The better zip is on line 464, where we sum the profile timings 😄

becker33

I have reproduced @tgamblin's fantastic performance results on ~12MB log files.

Testing is below 6 ppt below 73% threshold, but light testing in anger shows good results, and it's unlikely to affect unrelated features.

* Allow dashes in command names and fix command name handling - Command should allow dashes in their names like the reest of spack, e.g. `spack log-parse` - It might be too late for `spack build-cache` (since it is already called `spack buildcache`), but we should try a bit to avoid inconsistencies in naming conventions - The code was inconsistent about where commands should be called by their python module name (e.g. `log_parse`) and where the actual command name should be used (e.g. `log-parse`). - This made it hard to make a command with a dash in the name, and it made `SpackCommand` fail to recognize commands with dashes. - The code now uses the user-facing name with dashes for function parameters, then converts that the module name when needed. * Improve performance of log parsing - A number of regular expressions from ctest_log_parser have really poor performance, most due to untethered expressions with * or + (i.e., they don't start with ^, so the repetition has to be checked for every position in the string with Python's backtracking regex implementation) - I can't verify that CTest's regexes work with an added ^, so I don't really want to touch them. I tried adding this and found that it caused some tests to break. - Instead of using only "efficient" regular expressions, Added a prefilter() class that allows the parser to quickly check a precondition before evaluating any of the expensive regexes. - Preconditions do things like check whether the string contains "error" or "warning" (linear time things) before evaluating regexes that would require them. It's sad that Python doesn't use Thompson string matching (see https://swtch.com/~rsc/regexp/regexp1.html) - Even with Python's slow implementation, this makes the parser ~200x faster on the input we tried it on. * Add `spack log-parse` command and improve the display of parsed logs - Add better coloring and line wrapping to the log parse output. This makes nasty build output look better with the line numbers. - `spack log-parse` allows the log parsing logic used at the end of builds to be executed on arbitrary files, which is handy even outside of spack. - Also provides a profile option -- we can profile arbitrary files and show which regular expressions in the magic CTest parser take the most time. * Parallelize log parsing - Log parsing now uses multiple threads for long logs - Lines from logs are divided into chnks and farmed out to <ncpus> - Add -j option to `spack log-parse`

tgamblin requested a review from becker33 January 28, 2018 07:52

tgamblin added 2 commits January 28, 2018 10:33

tgamblin force-pushed the bugfix/log-parse-performance branch from 0ad41e8 to 9beb408 Compare January 28, 2018 18:36

tgamblin mentioned this pull request Jan 28, 2018

spack logs: added new command #6166

Closed

tgamblin added 2 commits January 28, 2018 13:38

Parallelize log parsing

5b0b651

- Log parsing now uses multiple threads for long logs - Lines from logs are divided into chnks and farmed out to <ncpus> - Add -j option to `spack log-parse`

tgamblin force-pushed the bugfix/log-parse-performance branch from 9beb408 to 5b0b651 Compare January 28, 2018 21:38

tgamblin requested review from adamjstewart and alalazo January 28, 2018 23:08

alalazo approved these changes Jan 30, 2018

View reviewed changes

becker33 approved these changes Feb 1, 2018

View reviewed changes

becker33 merged commit df758e1 into develop Feb 1, 2018

tgamblin deleted the bugfix/log-parse-performance branch February 1, 2018 18:45

haampie mentioned this pull request Oct 7, 2021

Avoid quadratic complexity in log parser #26568

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve log parsing performance#7093

Improve log parsing performance#7093
becker33 merged 4 commits intodevelopfrom
bugfix/log-parse-performance

tgamblin commented Jan 28, 2018 •

edited

Loading

Uh oh!

tgamblin commented Jan 28, 2018

Uh oh!

alalazo left a comment

Uh oh!

alalazo Jan 30, 2018

Uh oh!

adamjstewart Feb 1, 2018

Uh oh!

alalazo Jan 30, 2018

Uh oh!

tgamblin Jan 30, 2018

Uh oh!

alalazo Jan 30, 2018

Uh oh!

alalazo Jan 30, 2018

Uh oh!

tgamblin Jan 30, 2018

Uh oh!

becker33 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants



		def chunks(l, n):
		"""Divide l into n approximately-even chunks."""

Conversation

tgamblin commented Jan 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tgamblin commented Jan 28, 2018

Uh oh!

alalazo left a comment

Choose a reason for hiding this comment

Uh oh!

alalazo Jan 30, 2018

Choose a reason for hiding this comment

Uh oh!

adamjstewart Feb 1, 2018

Choose a reason for hiding this comment

Uh oh!

alalazo Jan 30, 2018

Choose a reason for hiding this comment

Uh oh!

tgamblin Jan 30, 2018

Choose a reason for hiding this comment

Uh oh!

alalazo Jan 30, 2018

Choose a reason for hiding this comment

Uh oh!

alalazo Jan 30, 2018

Choose a reason for hiding this comment

Uh oh!

tgamblin Jan 30, 2018

Choose a reason for hiding this comment

Uh oh!

becker33 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tgamblin commented Jan 28, 2018 •

edited

Loading