-
Notifications
You must be signed in to change notification settings - Fork 179
Bug: Incorrect GCS path construction for phenotype data functions #815
Description
There is a bug in the path construction logic within the AnophelesPhenotypeData class that prevents phenotype data from being loaded in cloud environments like Bespin.
This issue was identified and diagnosed by @jonbrenas (thank you!).
The Problem
The methods _load_phenotype_data and phenotype_sample_sets were using a hardcoded and improperly formatted path string to locate phenotype data files. This resulted in FileNotFoundError or ValueError exceptions when running in a live cloud environment, although it sometimes appeared to work locally due to file caching.
The original faulty logic was:
base_phenotype_path = f"{self._url}v3.2/phenotypes/all"This leads to two problems:
-
Incorrect Path
It produces a malformed path with a missing /, for example:
gs://...us_central1v3.2/... -
Hardcoded Version
Thev3.2is hardcoded, which will break when the API is updated to a new data release.
These functions likely only worked in local testing environments due to file caching, which masked the path issue.
The Solution
The correct and robust solution is to programmatically construct the path for each sample_set by first looking up its specific data release. This ensures the correct version path is used and mirrors the established pattern in other data access classes within the library (e.g., AnophelesCnvData).
The fix involves changing the logic inside the for loop in both _load_phenotype_data and phenotype_sample_sets to the following:
The line should be changed to:
# Look up the release for each sample set
release = self.lookup_release(sample_set=sample_set)
release_path = self._release_to_path(release)
# Construct the full, correct path
phenotype_path = f"{self._base_path}/{release_path}/phenotypes/all/{sample_set}/phenotypes.csv"This ensures the path is always correctly formatted and version-aware for both live and cached data access.