Skip to content

Test report fails to get a cache hit the second time it runs #14426

@segonpin

Description

@segonpin

Description of the problem / feature request:

When running a test twice using remote-cache or disk-cache, bazel fails to cache the action that generates the xml test report (external/bazel_tools/tools/test/generate-xml.sh) on the second invocation. This happens because the action use as arguments the walltime duration of the previous action (the action running the test, external/bazel_tools/tools/test/test-setup.sh), and that time is zero when the action gets a cache hit on the second invocation.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Using the following (repository)[https://github.com/segonpin/bazel_test_cache_miss] (or any simple test that takes longer than 1 second to run), just run the test three times.

$ bazelisk clean; bazelisk test //... --disk_cache=/tmp/bazel_disk_cache --execution_log_json_file=exec1.json
$ bazelisk clean; bazelisk test //... --disk_cache=/tmp/bazel_disk_cache --execution_log_json_file=exec2.json
$ bazelisk clean; bazelisk test //... --disk_cache=/tmp/bazel_disk_cache --execution_log_json_file=exec3.json

The first time there are 0 cache-hits, the third time everything gets a cache-hit, but the second time one action does not get a cache-hit, and that seems incorrect to me.

What operating system are you running Bazel on?

Ubuntu 20.04

What's the output of bazel info release?

release 6.0.0-pre.20211117.1

Any other information, logs, or outputs that you want to share?

When we run the test for the first time. It will run 7 actions, of which 4 are linux-sandbox. Notice that the test takes 5 seconds to run.

$ bazelisk clean; bazelisk test //... --disk_cache=/tmp/bazel_disk_cache --execution_log_json_file=exec1.json
...
INFO: 7 processes: 3 internal, 4 linux-sandbox.
INFO: Build completed successfully, 7 total actions
//:sleep_test                                                            PASSED in 5.0s
...

Then we run the test for the second time. It will run 7 actions, and now only 1 is linux-sandbox. None should be. The reason for the cache-miss is that the test duration is used as input for the missed action, and because the test was cached now, the duration has changed from 5 seconds to 0 seconds, so the commandArgs was different for the action.

$ bazelisk clean; bazelisk test //... --disk_cache=/tmp/bazel_disk_cache --execution_log_json_file=exec2.json
...
INFO: 7 processes: 3 disk cache hit, 3 internal, 1 linux-sandbox.
INFO: Build completed successfully, 7 total actions
//:sleep_test                                                   (cached) PASSED in 0.0s
...

Now we run the test again, and now the test run in 0 seconds, as the previous time, so that the action printing the xml report get a cache hit from before.

$ bazelisk clean; bazelisk test //... --disk_cache=/tmp/bazel_disk_cache --execution_log_json_file=exec3.json
...
INFO: 7 processes: 4 disk cache hit, 3 internal.
INFO: Build completed successfully, 7 total actions
//:sleep_test                                                   (cached) PASSED in 0.0s
...

The action getting the cache miss runs

# in the first invocation
"commandArgs": ["external/bazel_tools/tools/test/generate-xml.sh", "bazel-out/k8-fastbuild/testlogs/sleep_test/test.log", "bazel-out/k8-fastbuild/testlogs/sleep_test/test.xml", "5", "0"],
# second and third invocation
"commandArgs": ["external/bazel_tools/tools/test/generate-xml.sh", "bazel-out/k8-fastbuild/testlogs/sleep_test/test.log", "bazel-out/k8-fastbuild/testlogs/sleep_test/test.xml", "0", "0"],

And the generated test report looks like follows (using 5 the first invocation, and 0 the second and third invocations)

$ cat bazel-out/k8-fastbuild/testlogs/sleep_test/test.xml
<?xml version="1.0" encoding="UTF-8"?>
<testsuites>
  <testsuite name="sleep_test" tests="1" failures="0" errors="0">
    <testcase name="sleep_test" status="run" duration="5" time="5"></testcase>
      <system-out>
Generated test.log (if the file is not UTF-8, then this may be unreadable):
<![CDATA[exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //:sleep_test
-----------------------------------------------------------------------------]]>
      </system-out>
    </testsuite>
</testsuites>

In my opinion the test report should be always the first one (with the duration="5"), or otherwise the last action should not be remoteCacheable.

Metadata

Metadata

Assignees

Labels

P2We'll consider working on this in future. (Assignee optional)team-Remote-ExecIssues and PRs for the Execution (Remote) teamtype: bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions