Skip to content

CI: Use GCP-backed kernel cache in Windows CI#9738

Merged
leofang merged 4 commits into
cupy:mainfrom
kmaehashi:gcp-backed-cache
Mar 3, 2026
Merged

CI: Use GCP-backed kernel cache in Windows CI#9738
leofang merged 4 commits into
cupy:mainfrom
kmaehashi:gcp-backed-cache

Conversation

@kmaehashi
Copy link
Copy Markdown
Member

@kmaehashi kmaehashi commented Feb 21, 2026

Part of #9665. Based on #9737, this PR introduces GCP-backed kernel cache to Windows CI, which will be activated if CUPY_CI_ENABLE_GCP_KERNEL_CACHE=1 env var is set when invoking pytest.

Disclosure: The initial implementation was done by Copilot in kmaehashi#84.


Observations:

@kmaehashi kmaehashi requested a review from a team as a code owner February 21, 2026 10:19
@kmaehashi kmaehashi added cat:test Test code / CI to-be-backported Pull-requests to be backported to stable branch prio:medium labels Feb 21, 2026
@kmaehashi
Copy link
Copy Markdown
Member Author

kmaehashi commented Feb 21, 2026

/test windows,cuda130

This will take 6 hours and likely results in timeout. The second invocation should be faster (I hope).

@leofang
Copy link
Copy Markdown
Member

leofang commented Feb 22, 2026

/test windows,cuda130

@kmaehashi
Copy link
Copy Markdown
Member Author

(I forgot to do pip install google-cloud-storage...)

/test windows,cuda130

@kmaehashi
Copy link
Copy Markdown
Member Author

kmaehashi commented Feb 22, 2026

(Confirmed that kernel cache started to appear in the GCS bucket.)

@kmaehashi
Copy link
Copy Markdown
Member Author

/test windows,cuda130

2 similar comments
@leofang
Copy link
Copy Markdown
Member

leofang commented Feb 22, 2026

/test windows,cuda130

@kmaehashi
Copy link
Copy Markdown
Member Author

/test windows,cuda130

@kmaehashi
Copy link
Copy Markdown
Member Author

This looks effective. In the fourth run (i.e. cache is fully populated in previous runs), the build + unit test run completed in 2h13m.

@kmaehashi kmaehashi marked this pull request as ready for review February 24, 2026 09:01
@kmaehashi
Copy link
Copy Markdown
Member Author

/test windows,cuda130

@kmaehashi
Copy link
Copy Markdown
Member Author

The CI failure should be fixed by the latest cuda-pathfinder (v1.4.0)

/test windows,cuda130

@leofang leofang self-assigned this Feb 27, 2026
@leofang
Copy link
Copy Markdown
Member

leofang commented Feb 27, 2026

There are 4 test failures. I'll check tomorrow if we should just skip the tests when the GCP backend is in use (because we'd never be able to change the cache dir locally with use_temporary_cache_dir()).

================================== FAILURES ===================================
___ TestRaw_param_0_{backend='nvrtc', in_memory=False}.test_compile_kernel ____

self = <<cupy_tests.core_tests.test_raw.TestRaw_param_0_{backend='nvrtc', in_memory=False} testMethod=test_compile_kernel>  parameter: {'backend': 'nvrtc', 'in_memory': False}>

    @unittest.skipUnless(not cupy.cuda.runtime.is_hip,
                         'only CUDA raises warning')
    @pytest.mark.thread_unsafe(reason="mutates global cache directory")
    def test_compile_kernel(self):
        kern = cupy.RawKernel(
            _test_compile_src, 'test_op',
            options=('-DOP=+',),
            backend=self.backend,
            jitify=self.jitify)
        log = io.StringIO()
        with use_temporary_cache_dir():
            kern.compile(log_stream=log)
>       assert 'warning' in log.getvalue()
E       AssertionError: assert 'warning' in ''
E        +  where '' = <built-in method getvalue of _io.StringIO object at 0x0000016836E8F6D0>()
E        +    where <built-in method getvalue of _io.StringIO object at 0x0000016836E8F6D0> = <_io.StringIO object at 0x0000016836E8F6D0>.getvalue

cupy_tests\core_tests\test_raw.py:1089: AssertionError
___ TestRaw_param_0_{backend='nvrtc', in_memory=False}.test_compile_module ____

self = <<cupy_tests.core_tests.test_raw.TestRaw_param_0_{backend='nvrtc', in_memory=False} testMethod=test_compile_module>  parameter: {'backend': 'nvrtc', 'in_memory': False}>

    @unittest.skipUnless(not cupy.cuda.runtime.is_hip,
                         'only CUDA raises warning')
    @pytest.mark.thread_unsafe(reason="mutates global cache directory")
    def test_compile_module(self):
        module = cupy.RawModule(
            code=_test_compile_src,
            backend=self.backend,
            options=('-DOP=+',),
            jitify=self.jitify)
        log = io.StringIO()
        with use_temporary_cache_dir():
            module.compile(log_stream=log)
>       assert 'warning' in log.getvalue()
E       AssertionError: assert 'warning' in ''
E        +  where '' = <built-in method getvalue of _io.StringIO object at 0x0000016836E8F7F0>()
E        +    where <built-in method getvalue of _io.StringIO object at 0x0000016836E8F7F0> = <_io.StringIO object at 0x0000016836E8F7F0>.getvalue

cupy_tests\core_tests\test_raw.py:1105: AssertionError
____ TestRaw_param_3_{backend='nvcc', in_memory=False}.test_compile_kernel ____

self = <<cupy_tests.core_tests.test_raw.TestRaw_param_3_{backend='nvcc', in_memory=False} testMethod=test_compile_kernel>  parameter: {'backend': 'nvcc', 'in_memory': False}>

    @unittest.skipUnless(not cupy.cuda.runtime.is_hip,
                         'only CUDA raises warning')
    @pytest.mark.thread_unsafe(reason="mutates global cache directory")
    def test_compile_kernel(self):
        kern = cupy.RawKernel(
            _test_compile_src, 'test_op',
            options=('-DOP=+',),
            backend=self.backend,
            jitify=self.jitify)
        log = io.StringIO()
        with use_temporary_cache_dir():
            kern.compile(log_stream=log)
>       assert 'warning' in log.getvalue()
E       AssertionError: assert 'warning' in ''
E        +  where '' = <built-in method getvalue of _io.StringIO object at 0x0000016836E8F880>()
E        +    where <built-in method getvalue of _io.StringIO object at 0x0000016836E8F880> = <_io.StringIO object at 0x0000016836E8F880>.getvalue

cupy_tests\core_tests\test_raw.py:1089: AssertionError
____ TestRaw_param_3_{backend='nvcc', in_memory=False}.test_compile_module ____

self = <<cupy_tests.core_tests.test_raw.TestRaw_param_3_{backend='nvcc', in_memory=False} testMethod=test_compile_module>  parameter: {'backend': 'nvcc', 'in_memory': False}>

    @unittest.skipUnless(not cupy.cuda.runtime.is_hip,
                         'only CUDA raises warning')
    @pytest.mark.thread_unsafe(reason="mutates global cache directory")
    def test_compile_module(self):
        module = cupy.RawModule(
            code=_test_compile_src,
            backend=self.backend,
            options=('-DOP=+',),
            jitify=self.jitify)
        log = io.StringIO()
        with use_temporary_cache_dir():
            module.compile(log_stream=log)
>       assert 'warning' in log.getvalue()
E       AssertionError: assert 'warning' in ''
E        +  where '' = <built-in method getvalue of _io.StringIO object at 0x0000016791682DD0>()
E        +    where <built-in method getvalue of _io.StringIO object at 0x0000016791682DD0> = <_io.StringIO object at 0x0000016791682DD0>.getvalue

cupy_tests\core_tests\test_raw.py:1105: AssertionError

@kmaehashi
Copy link
Copy Markdown
Member Author

Maybe we can implement NullCacheBackend and activate it in the context manager instead of mocking local cache directory.

@seberg
Copy link
Copy Markdown
Member

seberg commented Feb 28, 2026

I think use_temporary_cache_dir could mock cupy.cuda.compiler._kernel_cache_backend to DiskKernelCacheBackend(tmp_dir).

A NullCacheBackend is neater, but maybe it doesn't hurt to just use DiskKernelCacheBackend in a few extra tests anyway? (There is one check to see that the in_memory flag is honored, but that could easily be done with a NullCacheBackend also).

@kmaehashi
Copy link
Copy Markdown
Member Author

/test windows,cuda130

Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @kmaehashi! Looks like it's working now! I left a question but it is not blocking.

Comment thread .pfnci/windows/test.ps1
# DownloadCache "${cache_pr_gcs_dir}" "${cache_archive}"
#}

$Env:CUPY_CI_ENABLE_GCP_KERNEL_CACHE = "1"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the old cache is now working (after #9728), I wonder if we should set this to 1 randomly (with 50-50 chance) so that we get both the old cache (which is exclusively used on the user land) and the new cache (only used in the CI) tested with equal chance.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg @asi1024 thoughts?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set this to 1 randomly

This will double test runs needed to fully populate kernel cache. Even with GCP cache, we have to wait for three time for all tests to success. #9738 (comment)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is to look into adding some tests if we are worried, rather than keeping two cache mechanism in CI.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Let's track this in an issue to follow up. No need to block on the merge.

@leofang leofang merged commit 8862fea into cupy:main Mar 3, 2026
70 checks passed
chainer-ci pushed a commit to chainer-ci/cupy that referenced this pull request Mar 3, 2026
CI: Use GCP-backed kernel cache in Windows CI
@leofang leofang linked an issue Mar 5, 2026 that may be closed by this pull request
@leofang leofang added this to the v15.0.0a1 milestone Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat:test Test code / CI prio:medium to-be-backported Pull-requests to be backported to stable branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI: Improve how kernel caches are updated/pruned

4 participants