Skip to content

Feat: Add functions to access insecticide resistance phenotype data#792

Merged
jonbrenas merged 17 commits intomasterfrom
phenotypes-loader
Jul 4, 2025
Merged

Feat: Add functions to access insecticide resistance phenotype data#792
jonbrenas merged 17 commits intomasterfrom
phenotypes-loader

Conversation

@mohamed-laarej
Copy link
Copy Markdown
Collaborator

Feat: Add functions to access insecticide resistance phenotype data

Resolves: #789

Description:

This PR introduces new functionality to access insecticide resistance phenotype data, as requested in issue #789 for GSoC 2025. The primary goal is to provide users with easy access to bioassay results (e.g., alive/dead status after insecticide exposure) alongside sample metadata and optionally linked genetic data.

Changes Implemented:

  1. New Mixin Class: Created malariagen_data.anoph.phenotypes.PhenotypeDataMixin located in the new file malariagen_data/anoph/phenotypes.py. This mixin encapsulates all logic related to fetching, processing, and merging phenotype data.
  2. Integration with Ag3: The main Ag3 class (in malariagen_data/ag3.py) now inherits from PhenotypeDataMixin. This seamlessly integrates the new phenotype methods, making them directly available on Ag3 instances (e.g., ag3.phenotype_data(...)).

New Functionality Provided by PhenotypeDataMixin:

The following public methods are now available via Ag3 objects:

  • phenotype_data(...): Loads insecticide bioassay data for specified sample sets from GCS. It merges this with sample metadata and returns a pandas DataFrame. Users can filter by sample_sets, sample_query, insecticide, dose, phenotype, and apply cohort size constraints (cohort_size, min_cohort_size, max_cohort_size).
  • phenotype_binary(...): Similar to phenotype_data, but processes the results to return a pandas Series containing binary phenotype outcomes (1 = alive/resistant, 0 = dead/susceptible), indexed by sample_id. This is useful for downstream analyses requiring binary resistance status.
  • phenotypes(...): Provides a comprehensive view by loading phenotype data and optionally merging it with genetic variant data (snp_calls or haplotypes) for a specified region. It returns an xarray Dataset containing both phenotype and genetic information for common samples. The function handles cases where genetic data is unavailable or does not overlap with the phenotype samples for the given query.
  • phenotype_sample_sets(): A utility method that returns a list of sample sets for which phenotype data files were found in GCS.

Several internal helper methods (_load_phenotype_data, _merge_phenotype_with_metadata, _create_phenotype_binary_series, _create_phenotype_dataset, etc.) support these public functions.

Testing:

Local testing using a script confirmed that phenotype_data and phenotype_binary load and process data correctly. The phenotypes function was also tested; the logic for merging phenotype data with genetic data appears sound and correctly handles cases where genetic data is unavailable or does not overlap with the available phenotype samples. However, fully verifying the merge output with actual overlapping data was challenging due to data availability limitations in the local test environment using the currently accessible sample sets.

Formal unit tests covering the new functions will be added as per project guidelines and mentor recommendations following this initial review.

Next Steps:

  • Requesting review of the implementation approach and code.
  • Plan to add formal unit tests covering the new functions as per project guidelines after initial feedback.

@mohamed-laarej mohamed-laarej requested a review from jonbrenas June 5, 2025 12:42
Copy link
Copy Markdown
Collaborator

@jonbrenas jonbrenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mohamed-laarej. I put a few comments here and there that I hope will nudge you in the right direction. This is a great start, though.

import fsspec


class PhenotypeDataMixin:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "phenotype" class should be named something like "AnophelesPhenotypeData" to be consistent with other similar classes. It isn't really a mixing because it is inherited by the classes using its functions.



class Ag3(AnophelesDataResource):
class Ag3(AnophelesDataResource, PhenotypeDataMixin):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AnophelesDataResource is the class that needs to inherit the "phenotype" class as funestus and other species will probably have some phenotypic data added at some point.


_xpehh_gwss_cache_name = XPEHH_GWSS_CACHE_NAME
_ihs_gwss_cache_name = IHS_GWSS_CACHE_NAME
_phenotype_gcs_path_template = "gs://vo_agam_release_master_us_central1/v3.2/phenotypes/all/{sample_set}/phenotypes.csv"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GCS path should be built from the "GCS_DEFAULT_URL" to be consistent with similar data types.

Comment on lines +88 to +93
duplicate_cols = [col for col in df_merged.columns if col.endswith("_meta")]
for col in duplicate_cols:
base_col = col.replace("_meta", "")
if base_col in df_merged.columns:
df_merged[base_col] = df_merged[base_col].fillna(df_merged[col])
df_merged = df_merged.drop(columns=[col])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little convoluted. I think you can check that the shared information is the same in both DataFrames and then only merge the new info.

Comment on lines +236 to +244
sample_sets: Optional[Union[str, List[str]]] = None,
sample_query: Optional[str] = None,
sample_query_options: Optional[Dict[str, Any]] = None,
insecticide: Optional[Union[str, List[str]]] = None,
dose: Optional[Union[float, List[float]]] = None,
phenotype: Optional[Union[str, List[str]]] = None,
cohort_size: Optional[int] = None,
min_cohort_size: Optional[int] = None,
max_cohort_size: Optional[int] = None,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advise to reuse the types defined in base_params.py and to create a phenotype_params.py for the ones that are new (say dose or insecticide)

"""
Load phenotypic data from insecticide resistance bioassays.
"""
# 1. Normalize sample_sets
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A function like _prep_sample_sets_param (which is in base.py) would probably be helpful here. There are probably a few other utility functions there that would save you time.


return pd.concat(filtered_groups, ignore_index=True)

def _create_phenotype_binary_series(self, df: pd.DataFrame) -> pd.Series:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may need some more thinking. The range of values for "phenotype" in the data may need to be constrained.

@mohamed-laarej
Copy link
Copy Markdown
Collaborator Author

Hi @jonbrenas ,

Thank you very much for taking the time to review this PR and for your valuable feedback!
I've pushed new commits addressing the points you raised.
Please let me know if these changes look good or if there's anything else. Thanks again!

@mohamed-laarej mohamed-laarej requested a review from jonbrenas June 6, 2025 04:45
Copy link
Copy Markdown
Collaborator

@jonbrenas jonbrenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mohamed-laarej.

In addition to the tests that were mentioned during our call, I added a few comments here.

from typing import TypeAlias, Optional, Union, List

# Type alias for insecticide parameter
insecticide: TypeAlias = Optional[Union[str, List[str]]]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be 'Annotated` to help with the docs.

Provides methods for accessing insecticide resistance phenotypic data.
Inherited by AnophelesDataResource subclasses (e.g., Ag3).
"""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are missing the constructor here.

@jonbrenas
Copy link
Copy Markdown
Collaborator

Thanks @mohamed-laarej.

I ran a quick debug of your code and the reason why you don't get the snp_calls in phenotypes is that, in _create_phenotype_dataset, you use variant_data.coords["samples"].values as the list of sample ids from variant_data instead of variant_data.coords["sample_id"].values. I haven't looked at why you can't find the haplotypes yet but I am sure you will have it figured out before I get to it.

- Fixed genetic data merge to dynamically detect and use correct coordinate names

- Added comprehensive detection for both 'samples' and 'sample_id' coordinates

- Implemented dimension alignment before merging xarray datasets

- Added tests for the new phenotypes functions

Resolves #789
@jonbrenas
Copy link
Copy Markdown
Collaborator

Thanks @mohamed-laarej. This is great!

Here are a few more comments:

  • The way you have a few important functions (e.g., sample_metadata) as global variables is a little strange to me. For simplicity and clarity's sake, can you change the code so that they are inherited from the correct classes instead?
  • I think the selection of the location, the insecticide, ... should be part of the sample query and not additional parameters.
  • I think it would be simpler and clearer to have different functions that return the phenotypes + either the SNP calls, the haplotypes, or the CNVs (which I don't think are currently an option); or one function with a parameter that let's you select what data is included; or both, than your current approach of having that being dependent on the choice of a region. Unless you have a model in mind that will use both SNP calls and haplotypes I don't think a function that returns both is a clear improvement on one that returns either.
  • A notebook showing the results of a few functions would still be very helpful.

@mohamed-laarej
Copy link
Copy Markdown
Collaborator Author

Thanks @jonbrenas for the detailed feedback! Really helpful stuff. Let me go through each point:

Functions as Global Variables

Ah yeah, I can see why that looked weird! I ended up doing it that way because I was getting a bunch of MyPy errors when I first tried the mixin approach:

malariagen_data\anoph\phenotypes.py:31: error: "AnophelesPhenotypeData" has no attribute "_url" [attr-defined]
malariagen_data\anoph\phenotypes.py:377: error: "AnophelesPhenotypeData" has no attribute "sample_metadata" [attr-defined]

To fix this, I added type annotations and made them constructor params when you suggested adding a constructor. But after reviewing how the other mixins work, I refactored it to use cooperative inheritance like this:

class AnophelesPhenotypeData:
    # Type annotations for MyPy
    _url: str
    _fs: fsspec.AbstractFileSystem
    sample_metadata: Callable[..., pd.DataFrame]
    # etc...
    
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

I think it's much cleaner! Now it just calls self.sample_metadata() directly like it should.
Could you please confirm if this is the right approach? Or is there anything else I should consider here to make it more idiomatic or robust?

Sample Query Filtering

Completely agree on this one! Consolidating everything into sample_query will definitely make the API cleaner and more consistent with how the rest of the codebase works.

Clearer Data Type Selection

This is a great point, the current logic is definitely confusing. I'd love your thoughts on which approach would work best:

Option A - Separate functions:

def phenotypes_with_snps(self, region, sample_query=None, ...):
def phenotypes_with_haplotypes(self, region, sample_query=None, analysis="arab", ...):
def phenotypes_with_cnvs(self, region, sample_query=None, ...):
def phenotypes_only(self, sample_query=None, ...):

Option B - Single function with parameter:

def phenotypes(self, region=None, genetic_data="snps", sample_query=None, ...):
    # where genetic_data can be "snps", "haplotypes", "cnvs", or None

I'm personally leaning toward Option A since it makes the intent really clear and avoids any ambiguity, but I'd definitely defer to your judgment on what would be most intuitive for users.

Next Steps

I'll get started on:
Cleaning up the inheritance structure based on your guidance
Moving all filtering logic to sample_query
Implementing whichever genetic data selection approach you prefer
Putting together a demo notebook to show how everything works

Thanks again for taking the time to provide such detailed feedback, it's really helpful for understanding the codebase patterns better!

@jonbrenas
Copy link
Copy Markdown
Collaborator

Thanks @mohamed-laarej. I think you got that right. Option A would probably be my choice: if we want to add more options later on, it will be clearer to add a new function than an accepted value for a parameter.

…ltering

- Replaced direct phenotype parameters with 'sample_query' for more flexible filtering.
- Split 'phenotypes' method into 'phenotypes_with_snps', 'phenotypes_with_haplotypes', and 'phenotype_data'.
- Updated integration tests to reflect these changes.
@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@mohamed-laarej
Copy link
Copy Markdown
Collaborator Author

Hi @jonbrenas ,
Thanks again for your valuable feedback!

I've addressed the points regarding:

  1. Making location, insecticide, etc., part of the sample_query parameter.
  2. Splitting the phenotypes function into phenotypes_with_snps, phenotypes_with_haplotypes, and a dedicated phenotype_data function for phenotype-only retrieval.
  3. A notebook (phenotype_data_demo.ipynb) to show how to use the new functionalities.

Regarding the CNVs, I did attempt to integrate them using cnv_coverage_calls. However, I ran into some persistent issues, mainly related to the specific cnv_analysis availability for certain sample sets and the cnv_coverage_calls API expecting a single sample_set rather than a list. I haven't quite figured out how to make it work for now, but I will continue to try and see if I can get it integrated.

@mohamed-laarej mohamed-laarej requested a review from jonbrenas June 24, 2025 11:56
Copy link
Copy Markdown
Collaborator

@jonbrenas jonbrenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @mohamed-laarej. In phenotype_data, sample_query_options isn't used properly: it is passed to sample_metadata without a query (i.e., it does nothing) and not used when the query is actually being evaluated.

plotly_discrete_legend,
)

from .anoph.phenotypes import AnophelesPhenotypeData
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shift to before the functions import to be consistent.

Copy link
Copy Markdown
Collaborator

@jonbrenas jonbrenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@leehart
Copy link
Copy Markdown
Collaborator

leehart commented Jun 30, 2025

I've always assumed that the blank line between imports at the top of ag3.py is used to visually separate external from internal imports. For comparison, there are also blank lines between the imports in af1.py, which I took to be visual separations for standard library modules (e.g. sys) and external modules (e.g. plotly.express) and internal modules (e.g. .anopheles).

This might be a style choice and does not need to be changed in this PR. Although the style does not seem to be applied consistently everywhere (e.g. anoph/aim_data.py), perhaps because it was never written as policy.

If we decide to police "don't have lines between imports", then that would require a lot of changes. Normally this wouldn't even be noticeable or noteworthy, but there is no other change in ag3.py in this PR.

@mohamed-laarej
Copy link
Copy Markdown
Collaborator Author

Thanks @leehart for the observation! To be honest, I hadn't noticed the specific pattern you're referring to in the import sections before. I tend to add blank lines wherever I feel they improve readability or visually separate blocks of code, not following any strict convention. You're right about the pattern in af1.py; I can see how those blank lines help distinguish between standard library, external, and internal imports.
That said, I appreciate you bringing this up a consistent approach across the codebase does make sense if the team ever decides to formalize it. For now, I've left things as-is since this PR is focused on new functionality, and standardizing import styling seems like something that might be better addressed team-wide.

- Re-implemented the phenotype_binary method, which was inadvertently removed during previous refactoring.
- Updated phenotype_binary to leverage the sample_query mechanism for filtering, aligning with mentor feedback.
- Added new integration tests for phenotype_binary functionality.
- Incorporated phenotype_binary examples into the demonstration notebook.
@mohamed-laarej
Copy link
Copy Markdown
Collaborator Author

Hi @jonbrenas, I just wanted to give you a quick heads-up
While reviewing the PR after approval (thank you again!), I realized that the phenotype_binary method had been inadvertently removed during the recent refactoring around sample_query and phenotype_data, and I unfortunately overlooked re-adding it.
Since this method was part of the original functionality and provides a convenient way to access binary phenotype outcomes, I’ve restored it in line with the new design principles, it now uses sample_query for filtering. I’ve also added integration tests and updated the demo notebook with relevant examples.
Apologies for the late addition to an already-approved PR. If you’d prefer this be handled in a separate PR instead, I’m happy to make that change just let me know.
Thanks again for your time and understanding!

@jonbrenas
Copy link
Copy Markdown
Collaborator

That's good thanks @mohamed-laarej !

@jonbrenas jonbrenas merged commit 2dba673 into master Jul 4, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants