Skip to content

Bug: Incorrect GCS path construction for phenotype data functions #815

@mohamed-laarej

Description

@mohamed-laarej

There is a bug in the path construction logic within the AnophelesPhenotypeData class that prevents phenotype data from being loaded in cloud environments like Bespin.

This issue was identified and diagnosed by @jonbrenas (thank you!).


The Problem

The methods _load_phenotype_data and phenotype_sample_sets were using a hardcoded and improperly formatted path string to locate phenotype data files. This resulted in FileNotFoundError or ValueError exceptions when running in a live cloud environment, although it sometimes appeared to work locally due to file caching.

The original faulty logic was:

base_phenotype_path = f"{self._url}v3.2/phenotypes/all"

This leads to two problems:

  • Incorrect Path
    It produces a malformed path with a missing /, for example:
    gs://...us_central1v3.2/...

  • Hardcoded Version
    The v3.2 is hardcoded, which will break when the API is updated to a new data release.

These functions likely only worked in local testing environments due to file caching, which masked the path issue.

The Solution

The correct and robust solution is to programmatically construct the path for each sample_set by first looking up its specific data release. This ensures the correct version path is used and mirrors the established pattern in other data access classes within the library (e.g., AnophelesCnvData).
The fix involves changing the logic inside the for loop in both _load_phenotype_data and phenotype_sample_sets to the following:

The line should be changed to:

# Look up the release for each sample set
release = self.lookup_release(sample_set=sample_set)
release_path = self._release_to_path(release)
                
# Construct the full, correct path
phenotype_path = f"{self._base_path}/{release_path}/phenotypes/all/{sample_set}/phenotypes.csv"

This ensures the path is always correctly formatted and version-aware for both live and cached data access.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions