Feat: Add functions to access insecticide resistance phenotype data#792
Feat: Add functions to access insecticide resistance phenotype data#792
Conversation
jonbrenas
left a comment
There was a problem hiding this comment.
Thanks @mohamed-laarej. I put a few comments here and there that I hope will nudge you in the right direction. This is a great start, though.
malariagen_data/anoph/phenotypes.py
Outdated
| import fsspec | ||
|
|
||
|
|
||
| class PhenotypeDataMixin: |
There was a problem hiding this comment.
I think the "phenotype" class should be named something like "AnophelesPhenotypeData" to be consistent with other similar classes. It isn't really a mixing because it is inherited by the classes using its functions.
malariagen_data/ag3.py
Outdated
|
|
||
|
|
||
| class Ag3(AnophelesDataResource): | ||
| class Ag3(AnophelesDataResource, PhenotypeDataMixin): |
There was a problem hiding this comment.
AnophelesDataResource is the class that needs to inherit the "phenotype" class as funestus and other species will probably have some phenotypic data added at some point.
malariagen_data/ag3.py
Outdated
|
|
||
| _xpehh_gwss_cache_name = XPEHH_GWSS_CACHE_NAME | ||
| _ihs_gwss_cache_name = IHS_GWSS_CACHE_NAME | ||
| _phenotype_gcs_path_template = "gs://vo_agam_release_master_us_central1/v3.2/phenotypes/all/{sample_set}/phenotypes.csv" |
There was a problem hiding this comment.
The GCS path should be built from the "GCS_DEFAULT_URL" to be consistent with similar data types.
malariagen_data/anoph/phenotypes.py
Outdated
| duplicate_cols = [col for col in df_merged.columns if col.endswith("_meta")] | ||
| for col in duplicate_cols: | ||
| base_col = col.replace("_meta", "") | ||
| if base_col in df_merged.columns: | ||
| df_merged[base_col] = df_merged[base_col].fillna(df_merged[col]) | ||
| df_merged = df_merged.drop(columns=[col]) |
There was a problem hiding this comment.
This seems a little convoluted. I think you can check that the shared information is the same in both DataFrames and then only merge the new info.
malariagen_data/anoph/phenotypes.py
Outdated
| sample_sets: Optional[Union[str, List[str]]] = None, | ||
| sample_query: Optional[str] = None, | ||
| sample_query_options: Optional[Dict[str, Any]] = None, | ||
| insecticide: Optional[Union[str, List[str]]] = None, | ||
| dose: Optional[Union[float, List[float]]] = None, | ||
| phenotype: Optional[Union[str, List[str]]] = None, | ||
| cohort_size: Optional[int] = None, | ||
| min_cohort_size: Optional[int] = None, | ||
| max_cohort_size: Optional[int] = None, |
There was a problem hiding this comment.
I would advise to reuse the types defined in base_params.py and to create a phenotype_params.py for the ones that are new (say dose or insecticide)
malariagen_data/anoph/phenotypes.py
Outdated
| """ | ||
| Load phenotypic data from insecticide resistance bioassays. | ||
| """ | ||
| # 1. Normalize sample_sets |
There was a problem hiding this comment.
A function like _prep_sample_sets_param (which is in base.py) would probably be helpful here. There are probably a few other utility functions there that would save you time.
|
|
||
| return pd.concat(filtered_groups, ignore_index=True) | ||
|
|
||
| def _create_phenotype_binary_series(self, df: pd.DataFrame) -> pd.Series: |
There was a problem hiding this comment.
This may need some more thinking. The range of values for "phenotype" in the data may need to be constrained.
|
Hi @jonbrenas , Thank you very much for taking the time to review this PR and for your valuable feedback! |
jonbrenas
left a comment
There was a problem hiding this comment.
Thanks @mohamed-laarej.
In addition to the tests that were mentioned during our call, I added a few comments here.
| from typing import TypeAlias, Optional, Union, List | ||
|
|
||
| # Type alias for insecticide parameter | ||
| insecticide: TypeAlias = Optional[Union[str, List[str]]] |
There was a problem hiding this comment.
These should be 'Annotated` to help with the docs.
| Provides methods for accessing insecticide resistance phenotypic data. | ||
| Inherited by AnophelesDataResource subclasses (e.g., Ag3). | ||
| """ | ||
|
|
There was a problem hiding this comment.
I think you are missing the constructor here.
|
Thanks @mohamed-laarej. I ran a quick debug of your code and the reason why you don't get the |
- Fixed genetic data merge to dynamically detect and use correct coordinate names - Added comprehensive detection for both 'samples' and 'sample_id' coordinates - Implemented dimension alignment before merging xarray datasets - Added tests for the new phenotypes functions Resolves #789
|
Thanks @mohamed-laarej. This is great! Here are a few more comments:
|
…esponse to mentor feedback)
|
Thanks @jonbrenas for the detailed feedback! Really helpful stuff. Let me go through each point: Functions as Global VariablesAh yeah, I can see why that looked weird! I ended up doing it that way because I was getting a bunch of MyPy errors when I first tried the mixin approach: malariagen_data\anoph\phenotypes.py:31: error: "AnophelesPhenotypeData" has no attribute "_url" [attr-defined]
malariagen_data\anoph\phenotypes.py:377: error: "AnophelesPhenotypeData" has no attribute "sample_metadata" [attr-defined]To fix this, I added type annotations and made them constructor params when you suggested adding a constructor. But after reviewing how the other mixins work, I refactored it to use cooperative inheritance like this: class AnophelesPhenotypeData:
# Type annotations for MyPy
_url: str
_fs: fsspec.AbstractFileSystem
sample_metadata: Callable[..., pd.DataFrame]
# etc...
def __init__(self, **kwargs):
super().__init__(**kwargs)I think it's much cleaner! Now it just calls Sample Query FilteringCompletely agree on this one! Consolidating everything into Clearer Data Type SelectionThis is a great point, the current logic is definitely confusing. I'd love your thoughts on which approach would work best: Option A - Separate functions:def phenotypes_with_snps(self, region, sample_query=None, ...):
def phenotypes_with_haplotypes(self, region, sample_query=None, analysis="arab", ...):
def phenotypes_with_cnvs(self, region, sample_query=None, ...):
def phenotypes_only(self, sample_query=None, ...):Option B - Single function with parameter:def phenotypes(self, region=None, genetic_data="snps", sample_query=None, ...):
# where genetic_data can be "snps", "haplotypes", "cnvs", or NoneI'm personally leaning toward Option A since it makes the intent really clear and avoids any ambiguity, but I'd definitely defer to your judgment on what would be most intuitive for users. Next StepsI'll get started on: Thanks again for taking the time to provide such detailed feedback, it's really helpful for understanding the codebase patterns better! |
|
Thanks @mohamed-laarej. I think you got that right. Option A would probably be my choice: if we want to add more options later on, it will be clearer to add a new function than an accepted value for a parameter. |
…ltering - Replaced direct phenotype parameters with 'sample_query' for more flexible filtering. - Split 'phenotypes' method into 'phenotypes_with_snps', 'phenotypes_with_haplotypes', and 'phenotype_data'. - Updated integration tests to reflect these changes.
… SNPs, Haplotypes).
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
Hi @jonbrenas , I've addressed the points regarding:
Regarding the CNVs, I did attempt to integrate them using |
jonbrenas
left a comment
There was a problem hiding this comment.
Great work, @mohamed-laarej. In phenotype_data, sample_query_options isn't used properly: it is passed to sample_metadata without a query (i.e., it does nothing) and not used when the query is actually being evaluated.
malariagen_data/anopheles.py
Outdated
| plotly_discrete_legend, | ||
| ) | ||
|
|
||
| from .anoph.phenotypes import AnophelesPhenotypeData |
There was a problem hiding this comment.
Shift to before the functions import to be consistent.
|
I've always assumed that the blank line between imports at the top of This might be a style choice and does not need to be changed in this PR. Although the style does not seem to be applied consistently everywhere (e.g. If we decide to police "don't have lines between imports", then that would require a lot of changes. Normally this wouldn't even be noticeable or noteworthy, but there is no other change in |
|
Thanks @leehart for the observation! To be honest, I hadn't noticed the specific pattern you're referring to in the import sections before. I tend to add blank lines wherever I feel they improve readability or visually separate blocks of code, not following any strict convention. You're right about the pattern in |
- Re-implemented the phenotype_binary method, which was inadvertently removed during previous refactoring. - Updated phenotype_binary to leverage the sample_query mechanism for filtering, aligning with mentor feedback. - Added new integration tests for phenotype_binary functionality. - Incorporated phenotype_binary examples into the demonstration notebook.
|
Hi @jonbrenas, I just wanted to give you a quick heads-up |
|
That's good thanks @mohamed-laarej ! |
Feat: Add functions to access insecticide resistance phenotype data
Resolves: #789
Description:
This PR introduces new functionality to access insecticide resistance phenotype data, as requested in issue #789 for GSoC 2025. The primary goal is to provide users with easy access to bioassay results (e.g., alive/dead status after insecticide exposure) alongside sample metadata and optionally linked genetic data.
Changes Implemented:
malariagen_data.anoph.phenotypes.PhenotypeDataMixinlocated in the new filemalariagen_data/anoph/phenotypes.py. This mixin encapsulates all logic related to fetching, processing, and merging phenotype data.Ag3class (inmalariagen_data/ag3.py) now inherits fromPhenotypeDataMixin. This seamlessly integrates the new phenotype methods, making them directly available onAg3instances (e.g.,ag3.phenotype_data(...)).New Functionality Provided by
PhenotypeDataMixin:The following public methods are now available via
Ag3objects:phenotype_data(...): Loads insecticide bioassay data for specified sample sets from GCS. It merges this with sample metadata and returns a pandas DataFrame. Users can filter bysample_sets,sample_query,insecticide,dose,phenotype, and apply cohort size constraints (cohort_size,min_cohort_size,max_cohort_size).phenotype_binary(...): Similar tophenotype_data, but processes the results to return a pandas Series containing binary phenotype outcomes (1 = alive/resistant, 0 = dead/susceptible), indexed bysample_id. This is useful for downstream analyses requiring binary resistance status.phenotypes(...): Provides a comprehensive view by loading phenotype data and optionally merging it with genetic variant data (snp_callsorhaplotypes) for a specifiedregion. It returns an xarray Dataset containing both phenotype and genetic information for common samples. The function handles cases where genetic data is unavailable or does not overlap with the phenotype samples for the given query.phenotype_sample_sets(): A utility method that returns a list of sample sets for which phenotype data files were found in GCS.Several internal helper methods (
_load_phenotype_data,_merge_phenotype_with_metadata,_create_phenotype_binary_series,_create_phenotype_dataset, etc.) support these public functions.Testing:
Local testing using a script confirmed that
phenotype_dataandphenotype_binaryload and process data correctly. Thephenotypesfunction was also tested; the logic for merging phenotype data with genetic data appears sound and correctly handles cases where genetic data is unavailable or does not overlap with the available phenotype samples. However, fully verifying the merge output with actual overlapping data was challenging due to data availability limitations in the local test environment using the currently accessible sample sets.Formal unit tests covering the new functions will be added as per project guidelines and mentor recommendations following this initial review.
Next Steps: