Feat: Add functions to access insecticide resistance phenotype data by mohamed-laarej · Pull Request #792 · malariagen/malariagen-data-python

mohamed-laarej · 2025-06-04T20:14:35Z

Feat: Add functions to access insecticide resistance phenotype data

Resolves: #789

Description:

This PR introduces new functionality to access insecticide resistance phenotype data, as requested in issue #789 for GSoC 2025. The primary goal is to provide users with easy access to bioassay results (e.g., alive/dead status after insecticide exposure) alongside sample metadata and optionally linked genetic data.

Changes Implemented:

New Mixin Class: Created malariagen_data.anoph.phenotypes.PhenotypeDataMixin located in the new file malariagen_data/anoph/phenotypes.py. This mixin encapsulates all logic related to fetching, processing, and merging phenotype data.
Integration with Ag3: The main Ag3 class (in malariagen_data/ag3.py) now inherits from PhenotypeDataMixin. This seamlessly integrates the new phenotype methods, making them directly available on Ag3 instances (e.g., ag3.phenotype_data(...)).

New Functionality Provided by PhenotypeDataMixin:

The following public methods are now available via Ag3 objects:

phenotype_data(...): Loads insecticide bioassay data for specified sample sets from GCS. It merges this with sample metadata and returns a pandas DataFrame. Users can filter by sample_sets, sample_query, insecticide, dose, phenotype, and apply cohort size constraints (cohort_size, min_cohort_size, max_cohort_size).
phenotype_binary(...): Similar to phenotype_data, but processes the results to return a pandas Series containing binary phenotype outcomes (1 = alive/resistant, 0 = dead/susceptible), indexed by sample_id. This is useful for downstream analyses requiring binary resistance status.
phenotypes(...): Provides a comprehensive view by loading phenotype data and optionally merging it with genetic variant data (snp_calls or haplotypes) for a specified region. It returns an xarray Dataset containing both phenotype and genetic information for common samples. The function handles cases where genetic data is unavailable or does not overlap with the phenotype samples for the given query.
phenotype_sample_sets(): A utility method that returns a list of sample sets for which phenotype data files were found in GCS.

Several internal helper methods (_load_phenotype_data, _merge_phenotype_with_metadata, _create_phenotype_binary_series, _create_phenotype_dataset, etc.) support these public functions.

Testing:

Local testing using a script confirmed that phenotype_data and phenotype_binary load and process data correctly. The phenotypes function was also tested; the logic for merging phenotype data with genetic data appears sound and correctly handles cases where genetic data is unavailable or does not overlap with the available phenotype samples. However, fully verifying the merge output with actual overlapping data was challenging due to data availability limitations in the local test environment using the currently accessible sample sets.

Formal unit tests covering the new functions will be added as per project guidelines and mentor recommendations following this initial review.

Next Steps:

Requesting review of the implementation approach and code.
Plan to add formal unit tests covering the new functions as per project guidelines after initial feedback.

jonbrenas

Thanks @mohamed-laarej. I put a few comments here and there that I hope will nudge you in the right direction. This is a great start, though.

jonbrenas · 2025-06-05T14:53:04Z

malariagen_data/anoph/phenotypes.py

+import fsspec
+
+
+class PhenotypeDataMixin:


I think the "phenotype" class should be named something like "AnophelesPhenotypeData" to be consistent with other similar classes. It isn't really a mixing because it is inherited by the classes using its functions.

jonbrenas · 2025-06-05T14:53:23Z

malariagen_data/ag3.py



-class Ag3(AnophelesDataResource):
+class Ag3(AnophelesDataResource, PhenotypeDataMixin):


AnophelesDataResource is the class that needs to inherit the "phenotype" class as funestus and other species will probably have some phenotypic data added at some point.

jonbrenas · 2025-06-05T14:53:41Z

malariagen_data/ag3.py


    _xpehh_gwss_cache_name = XPEHH_GWSS_CACHE_NAME
    _ihs_gwss_cache_name = IHS_GWSS_CACHE_NAME
+    _phenotype_gcs_path_template = "gs://vo_agam_release_master_us_central1/v3.2/phenotypes/all/{sample_set}/phenotypes.csv"


The GCS path should be built from the "GCS_DEFAULT_URL" to be consistent with similar data types.

jonbrenas · 2025-06-05T15:00:53Z

malariagen_data/anoph/phenotypes.py

+        duplicate_cols = [col for col in df_merged.columns if col.endswith("_meta")]
+        for col in duplicate_cols:
+            base_col = col.replace("_meta", "")
+            if base_col in df_merged.columns:
+                df_merged[base_col] = df_merged[base_col].fillna(df_merged[col])
+            df_merged = df_merged.drop(columns=[col])


This seems a little convoluted. I think you can check that the shared information is the same in both DataFrames and then only merge the new info.

jonbrenas · 2025-06-05T15:03:58Z

malariagen_data/anoph/phenotypes.py

+        sample_sets: Optional[Union[str, List[str]]] = None,
+        sample_query: Optional[str] = None,
+        sample_query_options: Optional[Dict[str, Any]] = None,
+        insecticide: Optional[Union[str, List[str]]] = None,
+        dose: Optional[Union[float, List[float]]] = None,
+        phenotype: Optional[Union[str, List[str]]] = None,
+        cohort_size: Optional[int] = None,
+        min_cohort_size: Optional[int] = None,
+        max_cohort_size: Optional[int] = None,


I would advise to reuse the types defined in base_params.py and to create a phenotype_params.py for the ones that are new (say dose or insecticide)

jonbrenas · 2025-06-05T15:05:27Z

malariagen_data/anoph/phenotypes.py

+        """
+        Load phenotypic data from insecticide resistance bioassays.
+        """
+        # 1. Normalize sample_sets


A function like _prep_sample_sets_param (which is in base.py) would probably be helpful here. There are probably a few other utility functions there that would save you time.

jonbrenas · 2025-06-05T15:08:59Z

malariagen_data/anoph/phenotypes.py

+
+        return pd.concat(filtered_groups, ignore_index=True)
+
+    def _create_phenotype_binary_series(self, df: pd.DataFrame) -> pd.Series:


This may need some more thinking. The range of values for "phenotype" in the data may need to be constrained.

mohamed-laarej · 2025-06-06T04:45:33Z

Hi @jonbrenas ,

Thank you very much for taking the time to review this PR and for your valuable feedback!
I've pushed new commits addressing the points you raised.
Please let me know if these changes look good or if there's anything else. Thanks again!

jonbrenas

Thanks @mohamed-laarej.

In addition to the tests that were mentioned during our call, I added a few comments here.

jonbrenas · 2025-06-06T20:42:29Z

malariagen_data/anoph/phenotype_params.py

+from typing import TypeAlias, Optional, Union, List
+
+# Type alias for insecticide parameter
+insecticide: TypeAlias = Optional[Union[str, List[str]]]


These should be 'Annotated` to help with the docs.

jonbrenas · 2025-06-06T20:43:50Z

malariagen_data/anoph/phenotypes.py

+    Provides methods for accessing insecticide resistance phenotypic data.
+    Inherited by AnophelesDataResource subclasses (e.g., Ag3).
+    """
+


I think you are missing the constructor here.

jonbrenas · 2025-06-09T19:37:25Z

Thanks @mohamed-laarej.

I ran a quick debug of your code and the reason why you don't get the snp_calls in phenotypes is that, in _create_phenotype_dataset, you use variant_data.coords["samples"].values as the list of sample ids from variant_data instead of variant_data.coords["sample_id"].values. I haven't looked at why you can't find the haplotypes yet but I am sure you will have it figured out before I get to it.

- Fixed genetic data merge to dynamically detect and use correct coordinate names - Added comprehensive detection for both 'samples' and 'sample_id' coordinates - Implemented dimension alignment before merging xarray datasets - Added tests for the new phenotypes functions Resolves #789

jonbrenas · 2025-06-11T16:35:31Z

Thanks @mohamed-laarej. This is great!

Here are a few more comments:

The way you have a few important functions (e.g., sample_metadata) as global variables is a little strange to me. For simplicity and clarity's sake, can you change the code so that they are inherited from the correct classes instead?
I think the selection of the location, the insecticide, ... should be part of the sample query and not additional parameters.
I think it would be simpler and clearer to have different functions that return the phenotypes + either the SNP calls, the haplotypes, or the CNVs (which I don't think are currently an option); or one function with a parameter that let's you select what data is included; or both, than your current approach of having that being dependent on the choice of a region. Unless you have a model in mind that will use both SNP calls and haplotypes I don't think a function that returns both is a clear improvement on one that returns either.
A notebook showing the results of a few functions would still be very helpful.

…esponse to mentor feedback)

mohamed-laarej · 2025-06-11T19:52:28Z

Thanks @jonbrenas for the detailed feedback! Really helpful stuff. Let me go through each point:

Functions as Global Variables

Ah yeah, I can see why that looked weird! I ended up doing it that way because I was getting a bunch of MyPy errors when I first tried the mixin approach:

malariagen_data\anoph\phenotypes.py:31: error: "AnophelesPhenotypeData" has no attribute "_url" [attr-defined]
malariagen_data\anoph\phenotypes.py:377: error: "AnophelesPhenotypeData" has no attribute "sample_metadata" [attr-defined]

To fix this, I added type annotations and made them constructor params when you suggested adding a constructor. But after reviewing how the other mixins work, I refactored it to use cooperative inheritance like this:

class AnophelesPhenotypeData:
    # Type annotations for MyPy
    _url: str
    _fs: fsspec.AbstractFileSystem
    sample_metadata: Callable[..., pd.DataFrame]
    # etc...
    
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

I think it's much cleaner! Now it just calls self.sample_metadata() directly like it should.
Could you please confirm if this is the right approach? Or is there anything else I should consider here to make it more idiomatic or robust?

Sample Query Filtering

Completely agree on this one! Consolidating everything into sample_query will definitely make the API cleaner and more consistent with how the rest of the codebase works.

Clearer Data Type Selection

This is a great point, the current logic is definitely confusing. I'd love your thoughts on which approach would work best:

Option A - Separate functions:

def phenotypes_with_snps(self, region, sample_query=None, ...):
def phenotypes_with_haplotypes(self, region, sample_query=None, analysis="arab", ...):
def phenotypes_with_cnvs(self, region, sample_query=None, ...):
def phenotypes_only(self, sample_query=None, ...):

Option B - Single function with parameter:

def phenotypes(self, region=None, genetic_data="snps", sample_query=None, ...):
    # where genetic_data can be "snps", "haplotypes", "cnvs", or None

I'm personally leaning toward Option A since it makes the intent really clear and avoids any ambiguity, but I'd definitely defer to your judgment on what would be most intuitive for users.

Next Steps

I'll get started on:
Cleaning up the inheritance structure based on your guidance
Moving all filtering logic to sample_query
Implementing whichever genetic data selection approach you prefer
Putting together a demo notebook to show how everything works

Thanks again for taking the time to provide such detailed feedback, it's really helpful for understanding the codebase patterns better!

jonbrenas · 2025-06-12T09:02:53Z

Thanks @mohamed-laarej. I think you got that right. Option A would probably be my choice: if we want to add more options later on, it will be clearer to add a new function than an accepted value for a parameter.

…ltering - Replaced direct phenotype parameters with 'sample_query' for more flexible filtering. - Split 'phenotypes' method into 'phenotypes_with_snps', 'phenotypes_with_haplotypes', and 'phenotype_data'. - Updated integration tests to reflect these changes.

… SNPs, Haplotypes).

review-notebook-app · 2025-06-24T11:45:24Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

mohamed-laarej · 2025-06-24T11:54:19Z

Hi @jonbrenas ,
Thanks again for your valuable feedback!

I've addressed the points regarding:

Making location, insecticide, etc., part of the sample_query parameter.
Splitting the phenotypes function into phenotypes_with_snps, phenotypes_with_haplotypes, and a dedicated phenotype_data function for phenotype-only retrieval.
A notebook (phenotype_data_demo.ipynb) to show how to use the new functionalities.

Regarding the CNVs, I did attempt to integrate them using cnv_coverage_calls. However, I ran into some persistent issues, mainly related to the specific cnv_analysis availability for certain sample sets and the cnv_coverage_calls API expecting a single sample_set rather than a list. I haven't quite figured out how to make it work for now, but I will continue to try and see if I can get it integrated.

jonbrenas

Great work, @mohamed-laarej. In phenotype_data, sample_query_options isn't used properly: it is passed to sample_metadata without a query (i.e., it does nothing) and not used when the query is actually being evaluated.

jonbrenas · 2025-06-24T14:43:23Z

malariagen_data/anopheles.py

    plotly_discrete_legend,
 )
-
+from .anoph.phenotypes import AnophelesPhenotypeData


Shift to before the functions import to be consistent.

jonbrenas

LGTM

leehart · 2025-06-30T09:53:49Z

I've always assumed that the blank line between imports at the top of ag3.py is used to visually separate external from internal imports. For comparison, there are also blank lines between the imports in af1.py, which I took to be visual separations for standard library modules (e.g. sys) and external modules (e.g. plotly.express) and internal modules (e.g. .anopheles).

This might be a style choice and does not need to be changed in this PR. Although the style does not seem to be applied consistently everywhere (e.g. anoph/aim_data.py), perhaps because it was never written as policy.

If we decide to police "don't have lines between imports", then that would require a lot of changes. Normally this wouldn't even be noticeable or noteworthy, but there is no other change in ag3.py in this PR.

mohamed-laarej · 2025-06-30T10:35:01Z

Thanks @leehart for the observation! To be honest, I hadn't noticed the specific pattern you're referring to in the import sections before. I tend to add blank lines wherever I feel they improve readability or visually separate blocks of code, not following any strict convention. You're right about the pattern in af1.py; I can see how those blank lines help distinguish between standard library, external, and internal imports.
That said, I appreciate you bringing this up a consistent approach across the codebase does make sense if the team ever decides to formalize it. For now, I've left things as-is since this PR is focused on new functionality, and standardizing import styling seems like something that might be better addressed team-wide.

- Re-implemented the phenotype_binary method, which was inadvertently removed during previous refactoring. - Updated phenotype_binary to leverage the sample_query mechanism for filtering, aligning with mentor feedback. - Added new integration tests for phenotype_binary functionality. - Incorporated phenotype_binary examples into the demonstration notebook.

mohamed-laarej · 2025-07-01T01:46:12Z

Hi @jonbrenas, I just wanted to give you a quick heads-up
While reviewing the PR after approval (thank you again!), I realized that the phenotype_binary method had been inadvertently removed during the recent refactoring around sample_query and phenotype_data, and I unfortunately overlooked re-adding it.
Since this method was part of the original functionality and provides a convenient way to access binary phenotype outcomes, I’ve restored it in line with the new design principles, it now uses sample_query for filtering. I’ve also added integration tests and updated the demo notebook with relevant examples.
Apologies for the late addition to an already-approved PR. If you’d prefer this be handled in a separate PR instead, I’m happy to make that change just let me know.
Thanks again for your time and understanding!

jonbrenas · 2025-07-04T16:21:07Z

That's good thanks @mohamed-laarej !

mohamed-laarej added 4 commits June 4, 2025 20:59

Add PhenotypeDataMixin: support phenotype data access in Ag3

672b549

Fixing Mypy errors

b8d2ad4

Fixing the new Mypy errors

b90c627

Mypy errors :)

adf2242

mohamed-laarej requested a review from jonbrenas June 5, 2025 12:42

jonbrenas requested changes Jun 5, 2025

View reviewed changes

Addresses mentor feedback

03ef646

mohamed-laarej requested a review from jonbrenas June 6, 2025 04:45

mohamed-laarej added 2 commits June 6, 2025 05:52

Fixing Mypy errors

2440b3a

Fixing another Mypy error

c662806

jonbrenas requested changes Jun 6, 2025

View reviewed changes

mohamed-laarej added 2 commits June 9, 2025 14:55

Add constructor and annotate parameters

48d1097

Fix dose filtering logic for None values (mypy-related regression)

56f0d7a

mohamed-laarej added 3 commits June 11, 2025 03:21

Fix mypy operator error with type ignore comment

2b95ce6

Fix the linting error

d632f32

Delegated function initialization to superclass via kwargs (partial r…

617d522

…esponse to mentor feedback)

mohamed-laarej added 2 commits June 24, 2025 11:58

Add Jupyter notebook demonstrating new phenotype data API (filtering,…

3dc320a

… SNPs, Haplotypes).

mohamed-laarej requested a review from jonbrenas June 24, 2025 11:56

jonbrenas requested changes Jun 24, 2025

View reviewed changes

fix: correct sample_query_options use and import order

dd9146b

jonbrenas approved these changes Jun 27, 2025

View reviewed changes

jonbrenas merged commit 2dba673 into master Jul 4, 2025
10 checks passed



		class Ag3(AnophelesDataResource):
		class Ag3(AnophelesDataResource, PhenotypeDataMixin):


		return pd.concat(filtered_groups, ignore_index=True)

		def _create_phenotype_binary_series(self, df: pd.DataFrame) -> pd.Series:

Conversation

mohamed-laarej commented Jun 4, 2025

Uh oh!

jonbrenas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mohamed-laarej commented Jun 6, 2025

Uh oh!

jonbrenas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonbrenas commented Jun 9, 2025

Uh oh!

jonbrenas commented Jun 11, 2025

Uh oh!

mohamed-laarej commented Jun 11, 2025

Functions as Global Variables

Sample Query Filtering

Clearer Data Type Selection

Option A - Separate functions:

Option B - Single function with parameter:

Next Steps

Uh oh!

jonbrenas commented Jun 12, 2025

Uh oh!

review-notebook-app bot commented Jun 24, 2025

Uh oh!

mohamed-laarej commented Jun 24, 2025

Uh oh!

jonbrenas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonbrenas left a comment

Choose a reason for hiding this comment

Uh oh!

leehart commented Jun 30, 2025

Uh oh!

mohamed-laarej commented Jun 30, 2025

Uh oh!

mohamed-laarej commented Jul 1, 2025

Uh oh!

jonbrenas commented Jul 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants