Summary
Each execution within local_provider_runs is currently creating an isolated .cache directory. Because these directories contain massive Hugging Face models and uv package caches, disk space is being exhausted rapidly when multiple runs are initiated.
Root Cause
The issue stems from the environment remapping $HOME. When $HOME is redirected for a run, tools like uv and Hugging Face defaults to creating a new cache structure within that new home instead of utilizing the host user's existing cache. While uv often uses symlinks, the current isolation prevents it from linking to a shared global store, leading to redundant data downloads and storage.
Impact
Rapid Disk Exhaustion: Each run duplicates several gigabytes of data.
Performance Hit: Increased latency for runs as they must re-download or re-index assets that should already be cached locally.
Proposed Solution
Modify the provider configuration to map the user's global .cache and uv directories back into the run environment.
-
Action: Explicitly mount/link ~/.cache and the uv cache directory to the remapped environment.
-
Reference: This follows the pattern previously implemented for the SkyPilot Kubernetes cluster to ensure cache persistence across isolated tasks.
Steps to Reproduce
- Trigger multiple runs using the local_provider.
- Inspect the filesystem within the local_provider_runs directory.
- Observe that each run ID has a unique, high-volume .cache folder.
Summary
Each execution within local_provider_runs is currently creating an isolated .cache directory. Because these directories contain massive Hugging Face models and uv package caches, disk space is being exhausted rapidly when multiple runs are initiated.
Root Cause
The issue stems from the environment remapping $HOME. When $HOME is redirected for a run, tools like uv and Hugging Face defaults to creating a new cache structure within that new home instead of utilizing the host user's existing cache. While uv often uses symlinks, the current isolation prevents it from linking to a shared global store, leading to redundant data downloads and storage.
Impact
Rapid Disk Exhaustion: Each run duplicates several gigabytes of data.
Performance Hit: Increased latency for runs as they must re-download or re-index assets that should already be cached locally.
Proposed Solution
Modify the provider configuration to map the user's global .cache and uv directories back into the run environment.
Action: Explicitly mount/link ~/.cache and the uv cache directory to the remapped environment.
Reference: This follows the pattern previously implemented for the SkyPilot Kubernetes cluster to ensure cache persistence across isolated tasks.
Steps to Reproduce