Skip to content

Improve log parsing performance#7093

Merged
becker33 merged 4 commits intodevelopfrom
bugfix/log-parse-performance
Feb 1, 2018
Merged

Improve log parsing performance#7093
becker33 merged 4 commits intodevelopfrom
bugfix/log-parse-performance

Conversation

@tgamblin
Copy link
Copy Markdown
Member

@tgamblin tgamblin commented Jan 28, 2018

When a build errors out, Spack goes through the build log and scrapes it for errors and warnings (see #5016, #5561). We found that this was taking forever (like an hour) on large logs. It turns out that Python regular expressions are slow, and this affects our log scraping in bad ways.

This PR does several things:

  • Add a profile mode to the parser that tells you which regexes take the most time. A number of regular expressions from ctest_log_parser have really poor performance, mostly due to "untethered" * and + (i.e., they don't start with ^, so Python's backtracking regex implementation checks for repetition at every position in the string). I can't verify that CTest's regexes work as intended with an initial ^ (which would be an obvious solution), so I don't really want to touch them. This fear was borne out when I tried just adding ^ and some tests broke.

  • Instead of using only "efficient" regular expressions, added a prefilter() class that allows the parser to quickly check a precondition before evaluating any of the expensive ones. Preconditions do things like test whether the string contains "error" or "warning" (linear time things) before evaluating regexes. Some simple preconditions speed things up by about 200x for the log file I tested.

  • Parallelize the log parsing. Files are now divided up and parsed by all CPUs. This gets you another speedup approximately equal to the number of physical cores you've got.

  • Add a spack log-parse command that you can run on any file. This let me test things, and it's handy because you can run it on build logs that aren't even from Spack.

Here's spack log-parse:

$ spack log-parse -h
usage: spack log-parse [-h] [--show SHOW] [-c CONTEXT] [-p] [-w WIDTH]
                       [-j JOBS]
                       file

filter errors and warnings from build logs

positional arguments:
  file                  a log file containing build output, or - for stdin

optional arguments:
  -h, --help            show this help message and exit
  --show SHOW           comma-separated list of what to show; options: errors,
                        warnings
  -c CONTEXT, --context CONTEXT
                        lines of context to show around lines of interest
  -p, --profile         print out a profile of time spent in regexes during
                        parse
  -w WIDTH, --width WIDTH
                        wrap width: auto-size to terminal by default; 0 for no
                        wrap
  -j JOBS, --jobs JOBS  number of jobs to parse log file; default is 1 for
                        short logs, ncpus for long logs

Try it -- it'll find errors (and optionally warnings) from your builds.

Finally, this PR cleans up the log output a bit so that long lines wrap better with the line numbers. i.e., we now indent wrapped lines so you can still see the line numbers. I find this makes things more readable for really nasty build logs.

@tgamblin tgamblin requested a review from becker33 January 28, 2018 07:52
@tgamblin
Copy link
Copy Markdown
Member Author

@becker33: see if this helps

- Command should allow dashes in their names like the reest of spack,
  e.g. `spack log-parse`
  - It might be too late for `spack build-cache` (since it is already
    called `spack buildcache`), but we should try a bit to avoid
    inconsistencies in naming conventions

- The code was inconsistent about where commands should be called by
  their python module name (e.g. `log_parse`) and where the actual
  command name should be used (e.g. `log-parse`).

- This made it hard to make a command with a dash in the name, and it
  made `SpackCommand` fail to recognize commands with dashes.

- The code now uses the user-facing name with dashes for function
  parameters, then converts that the module name when needed.
- A number of regular expressions from ctest_log_parser have really poor
  performance, most due to untethered expressions with * or + (i.e., they
  don't start with ^, so the repetition has to be checked for every
  position in the string with Python's backtracking regex implementation)

- I can't verify that CTest's regexes work with an added ^, so I don't
  really want to touch them.  I tried adding this and found that it
  caused some tests to break.

- Instead of using only "efficient" regular expressions, Added a
  prefilter() class that allows the parser to quickly check a
  precondition before evaluating any of the expensive regexes.

- Preconditions do things like check whether the string contains "error"
  or "warning" (linear time things) before evaluating regexes that would
  require them.  It's sad that Python doesn't use Thompson string
  matching (see https://swtch.com/~rsc/regexp/regexp1.html)

- Even with Python's slow implementation, this makes the parser ~200x
  faster on the input we tried it on.
@tgamblin tgamblin force-pushed the bugfix/log-parse-performance branch from 0ad41e8 to 9beb408 Compare January 28, 2018 18:36
- Add better coloring and line wrapping to the log parse output.  This
  makes nasty build output look better with the line numbers.

- `spack log-parse` allows the log parsing logic used at the end of
  builds to be executed on arbitrary files, which is handy even outside
  of spack.

- Also provides a profile option -- we can profile arbitrary files and
  show which regular expressions in the magic CTest parser take the most
  time.
- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
@tgamblin tgamblin force-pushed the bugfix/log-parse-performance branch from 9beb408 to 5b0b651 Compare January 28, 2018 21:38
Copy link
Copy Markdown
Member

@alalazo alalazo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to check and try the command, but for the rest this LGTM. I have just a couple of minor questions on pool.map_async.



def chunks(l, n):
"""Divide l into n approximately-even chunks."""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super-minor note: are we disabling this style rule? Or not using pycodestyle yet?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, do you have any opinions on pycodestyle vs. flake8? They seem to serve the same purpose, but I just learned about pycodestyle recently.


# create a pool and farm out the matching job
pool = multiprocessing.Pool(jobs)
try:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it's kind of unfortunate that to be compatible with python 2 we can't do:

with multiprocessing.Pool(jobs):
    results = pool.map_async(_parse_unpack, args, 1).get(9999999)
    errors, warnings, timings = zip(*results)

Also why the timeout is so long? Does it have to do with the comment?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does. The pool doesn't handle interrupts well unless you use map_async -- I should add a link.


# add log context, as well
if jobs is None:
jobs = multiprocessing.cpu_count()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might consider:

jobs = jobs or  multiprocessing.cpu_count()

try:
# this is a workaround for a Python bug in Pool with ctrl-C
results = pool.map_async(_parse_unpack, args, 1).get(9999999)
errors, warnings, timings = zip(*results)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you like mind bending tricks with zip 😄 Next time somebody tells me he needs to turn an AoS into an SoA I'll reply "Just use zip"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't even the cool one -- this one just makes lists of lists of results from workers. The better zip is on line 464, where we sum the profile timings 😄

Copy link
Copy Markdown
Member

@becker33 becker33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reproduced @tgamblin's fantastic performance results on ~12MB log files.

Testing is below 6 ppt below 73% threshold, but light testing in anger shows good results, and it's unlikely to affect unrelated features.

@becker33 becker33 merged commit df758e1 into develop Feb 1, 2018
@tgamblin tgamblin deleted the bugfix/log-parse-performance branch February 1, 2018 18:45
smuething pushed a commit to smuething/spack that referenced this pull request Feb 28, 2018
* Allow dashes in command names and fix command name handling

- Command should allow dashes in their names like the reest of spack,
  e.g. `spack log-parse`
  - It might be too late for `spack build-cache` (since it is already
    called `spack buildcache`), but we should try a bit to avoid
    inconsistencies in naming conventions

- The code was inconsistent about where commands should be called by
  their python module name (e.g. `log_parse`) and where the actual
  command name should be used (e.g. `log-parse`).

- This made it hard to make a command with a dash in the name, and it
  made `SpackCommand` fail to recognize commands with dashes.

- The code now uses the user-facing name with dashes for function
  parameters, then converts that the module name when needed.

* Improve performance of log parsing

- A number of regular expressions from ctest_log_parser have really poor
  performance, most due to untethered expressions with * or + (i.e., they
  don't start with ^, so the repetition has to be checked for every
  position in the string with Python's backtracking regex implementation)

- I can't verify that CTest's regexes work with an added ^, so I don't
  really want to touch them.  I tried adding this and found that it
  caused some tests to break.

- Instead of using only "efficient" regular expressions, Added a
  prefilter() class that allows the parser to quickly check a
  precondition before evaluating any of the expensive regexes.

- Preconditions do things like check whether the string contains "error"
  or "warning" (linear time things) before evaluating regexes that would
  require them.  It's sad that Python doesn't use Thompson string
  matching (see https://swtch.com/~rsc/regexp/regexp1.html)

- Even with Python's slow implementation, this makes the parser ~200x
  faster on the input we tried it on.

* Add `spack log-parse` command and improve the display of parsed logs

- Add better coloring and line wrapping to the log parse output.  This
  makes nasty build output look better with the line numbers.

- `spack log-parse` allows the log parsing logic used at the end of
  builds to be executed on arbitrary files, which is handy even outside
  of spack.

- Also provides a profile option -- we can profile arbitrary files and
  show which regular expressions in the magic CTest parser take the most
  time.

* Parallelize log parsing

- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
ch741 pushed a commit to ch741/spack that referenced this pull request Apr 20, 2018
* Allow dashes in command names and fix command name handling

- Command should allow dashes in their names like the reest of spack,
  e.g. `spack log-parse`
  - It might be too late for `spack build-cache` (since it is already
    called `spack buildcache`), but we should try a bit to avoid
    inconsistencies in naming conventions

- The code was inconsistent about where commands should be called by
  their python module name (e.g. `log_parse`) and where the actual
  command name should be used (e.g. `log-parse`).

- This made it hard to make a command with a dash in the name, and it
  made `SpackCommand` fail to recognize commands with dashes.

- The code now uses the user-facing name with dashes for function
  parameters, then converts that the module name when needed.

* Improve performance of log parsing

- A number of regular expressions from ctest_log_parser have really poor
  performance, most due to untethered expressions with * or + (i.e., they
  don't start with ^, so the repetition has to be checked for every
  position in the string with Python's backtracking regex implementation)

- I can't verify that CTest's regexes work with an added ^, so I don't
  really want to touch them.  I tried adding this and found that it
  caused some tests to break.

- Instead of using only "efficient" regular expressions, Added a
  prefilter() class that allows the parser to quickly check a
  precondition before evaluating any of the expensive regexes.

- Preconditions do things like check whether the string contains "error"
  or "warning" (linear time things) before evaluating regexes that would
  require them.  It's sad that Python doesn't use Thompson string
  matching (see https://swtch.com/~rsc/regexp/regexp1.html)

- Even with Python's slow implementation, this makes the parser ~200x
  faster on the input we tried it on.

* Add `spack log-parse` command and improve the display of parsed logs

- Add better coloring and line wrapping to the log parse output.  This
  makes nasty build output look better with the line numbers.

- `spack log-parse` allows the log parsing logic used at the end of
  builds to be executed on arbitrary files, which is handy even outside
  of spack.

- Also provides a profile option -- we can profile arbitrary files and
  show which regular expressions in the magic CTest parser take the most
  time.

* Parallelize log parsing

- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants