Conversation
|
@becker33: see if this helps |
- Command should allow dashes in their names like the reest of spack,
e.g. `spack log-parse`
- It might be too late for `spack build-cache` (since it is already
called `spack buildcache`), but we should try a bit to avoid
inconsistencies in naming conventions
- The code was inconsistent about where commands should be called by
their python module name (e.g. `log_parse`) and where the actual
command name should be used (e.g. `log-parse`).
- This made it hard to make a command with a dash in the name, and it
made `SpackCommand` fail to recognize commands with dashes.
- The code now uses the user-facing name with dashes for function
parameters, then converts that the module name when needed.
- A number of regular expressions from ctest_log_parser have really poor performance, most due to untethered expressions with * or + (i.e., they don't start with ^, so the repetition has to be checked for every position in the string with Python's backtracking regex implementation) - I can't verify that CTest's regexes work with an added ^, so I don't really want to touch them. I tried adding this and found that it caused some tests to break. - Instead of using only "efficient" regular expressions, Added a prefilter() class that allows the parser to quickly check a precondition before evaluating any of the expensive regexes. - Preconditions do things like check whether the string contains "error" or "warning" (linear time things) before evaluating regexes that would require them. It's sad that Python doesn't use Thompson string matching (see https://swtch.com/~rsc/regexp/regexp1.html) - Even with Python's slow implementation, this makes the parser ~200x faster on the input we tried it on.
0ad41e8 to
9beb408
Compare
- Add better coloring and line wrapping to the log parse output. This makes nasty build output look better with the line numbers. - `spack log-parse` allows the log parsing logic used at the end of builds to be executed on arbitrary files, which is handy even outside of spack. - Also provides a profile option -- we can profile arbitrary files and show which regular expressions in the magic CTest parser take the most time.
- Log parsing now uses multiple threads for long logs - Lines from logs are divided into chnks and farmed out to <ncpus> - Add -j option to `spack log-parse`
9beb408 to
5b0b651
Compare
alalazo
left a comment
There was a problem hiding this comment.
Still need to check and try the command, but for the rest this LGTM. I have just a couple of minor questions on pool.map_async.
|
|
||
|
|
||
| def chunks(l, n): | ||
| """Divide l into n approximately-even chunks.""" |
There was a problem hiding this comment.
Super-minor note: are we disabling this style rule? Or not using pycodestyle yet?
There was a problem hiding this comment.
Out of curiosity, do you have any opinions on pycodestyle vs. flake8? They seem to serve the same purpose, but I just learned about pycodestyle recently.
|
|
||
| # create a pool and farm out the matching job | ||
| pool = multiprocessing.Pool(jobs) | ||
| try: |
There was a problem hiding this comment.
Here it's kind of unfortunate that to be compatible with python 2 we can't do:
with multiprocessing.Pool(jobs):
results = pool.map_async(_parse_unpack, args, 1).get(9999999)
errors, warnings, timings = zip(*results)Also why the timeout is so long? Does it have to do with the comment?
There was a problem hiding this comment.
It does. The pool doesn't handle interrupts well unless you use map_async -- I should add a link.
|
|
||
| # add log context, as well | ||
| if jobs is None: | ||
| jobs = multiprocessing.cpu_count() |
There was a problem hiding this comment.
Might consider:
jobs = jobs or multiprocessing.cpu_count()| try: | ||
| # this is a workaround for a Python bug in Pool with ctrl-C | ||
| results = pool.map_async(_parse_unpack, args, 1).get(9999999) | ||
| errors, warnings, timings = zip(*results) |
There was a problem hiding this comment.
I see you like mind bending tricks with zip 😄 Next time somebody tells me he needs to turn an AoS into an SoA I'll reply "Just use zip"
There was a problem hiding this comment.
This isn't even the cool one -- this one just makes lists of lists of results from workers. The better zip is on line 464, where we sum the profile timings 😄
* Allow dashes in command names and fix command name handling
- Command should allow dashes in their names like the reest of spack,
e.g. `spack log-parse`
- It might be too late for `spack build-cache` (since it is already
called `spack buildcache`), but we should try a bit to avoid
inconsistencies in naming conventions
- The code was inconsistent about where commands should be called by
their python module name (e.g. `log_parse`) and where the actual
command name should be used (e.g. `log-parse`).
- This made it hard to make a command with a dash in the name, and it
made `SpackCommand` fail to recognize commands with dashes.
- The code now uses the user-facing name with dashes for function
parameters, then converts that the module name when needed.
* Improve performance of log parsing
- A number of regular expressions from ctest_log_parser have really poor
performance, most due to untethered expressions with * or + (i.e., they
don't start with ^, so the repetition has to be checked for every
position in the string with Python's backtracking regex implementation)
- I can't verify that CTest's regexes work with an added ^, so I don't
really want to touch them. I tried adding this and found that it
caused some tests to break.
- Instead of using only "efficient" regular expressions, Added a
prefilter() class that allows the parser to quickly check a
precondition before evaluating any of the expensive regexes.
- Preconditions do things like check whether the string contains "error"
or "warning" (linear time things) before evaluating regexes that would
require them. It's sad that Python doesn't use Thompson string
matching (see https://swtch.com/~rsc/regexp/regexp1.html)
- Even with Python's slow implementation, this makes the parser ~200x
faster on the input we tried it on.
* Add `spack log-parse` command and improve the display of parsed logs
- Add better coloring and line wrapping to the log parse output. This
makes nasty build output look better with the line numbers.
- `spack log-parse` allows the log parsing logic used at the end of
builds to be executed on arbitrary files, which is handy even outside
of spack.
- Also provides a profile option -- we can profile arbitrary files and
show which regular expressions in the magic CTest parser take the most
time.
* Parallelize log parsing
- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
* Allow dashes in command names and fix command name handling
- Command should allow dashes in their names like the reest of spack,
e.g. `spack log-parse`
- It might be too late for `spack build-cache` (since it is already
called `spack buildcache`), but we should try a bit to avoid
inconsistencies in naming conventions
- The code was inconsistent about where commands should be called by
their python module name (e.g. `log_parse`) and where the actual
command name should be used (e.g. `log-parse`).
- This made it hard to make a command with a dash in the name, and it
made `SpackCommand` fail to recognize commands with dashes.
- The code now uses the user-facing name with dashes for function
parameters, then converts that the module name when needed.
* Improve performance of log parsing
- A number of regular expressions from ctest_log_parser have really poor
performance, most due to untethered expressions with * or + (i.e., they
don't start with ^, so the repetition has to be checked for every
position in the string with Python's backtracking regex implementation)
- I can't verify that CTest's regexes work with an added ^, so I don't
really want to touch them. I tried adding this and found that it
caused some tests to break.
- Instead of using only "efficient" regular expressions, Added a
prefilter() class that allows the parser to quickly check a
precondition before evaluating any of the expensive regexes.
- Preconditions do things like check whether the string contains "error"
or "warning" (linear time things) before evaluating regexes that would
require them. It's sad that Python doesn't use Thompson string
matching (see https://swtch.com/~rsc/regexp/regexp1.html)
- Even with Python's slow implementation, this makes the parser ~200x
faster on the input we tried it on.
* Add `spack log-parse` command and improve the display of parsed logs
- Add better coloring and line wrapping to the log parse output. This
makes nasty build output look better with the line numbers.
- `spack log-parse` allows the log parsing logic used at the end of
builds to be executed on arbitrary files, which is handy even outside
of spack.
- Also provides a profile option -- we can profile arbitrary files and
show which regular expressions in the magic CTest parser take the most
time.
* Parallelize log parsing
- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
When a build errors out, Spack goes through the build log and scrapes it for errors and warnings (see #5016, #5561). We found that this was taking forever (like an hour) on large logs. It turns out that Python regular expressions are slow, and this affects our log scraping in bad ways.
This PR does several things:
Add a profile mode to the parser that tells you which regexes take the most time. A number of regular expressions from
ctest_log_parserhave really poor performance, mostly due to "untethered"*and+(i.e., they don't start with^, so Python's backtracking regex implementation checks for repetition at every position in the string). I can't verify that CTest's regexes work as intended with an initial^(which would be an obvious solution), so I don't really want to touch them. This fear was borne out when I tried just adding^and some tests broke.Instead of using only "efficient" regular expressions, added a
prefilter()class that allows the parser to quickly check a precondition before evaluating any of the expensive ones. Preconditions do things like test whether the string contains "error" or "warning" (linear time things) before evaluating regexes. Some simple preconditions speed things up by about 200x for the log file I tested.Parallelize the log parsing. Files are now divided up and parsed by all CPUs. This gets you another speedup approximately equal to the number of physical cores you've got.
Add a
spack log-parsecommand that you can run on any file. This let me test things, and it's handy because you can run it on build logs that aren't even from Spack.Here's
spack log-parse:Try it -- it'll find errors (and optionally warnings) from your builds.
Finally, this PR cleans up the log output a bit so that long lines wrap better with the line numbers. i.e., we now indent wrapped lines so you can still see the line numbers. I find this makes things more readable for really nasty build logs.