Skip to content

Anopheles refactor part 5 - AnophelesSnpData#393

Merged
alimanfoo merged 59 commits intomalariagen:masterfrom
alimanfoo:big-refactor-part-cinq-snp-data-2023-05-03
May 18, 2023
Merged

Anopheles refactor part 5 - AnophelesSnpData#393
alimanfoo merged 59 commits intomalariagen:masterfrom
alimanfoo:big-refactor-part-cinq-snp-data-2023-05-03

Conversation

@alimanfoo
Copy link
Copy Markdown
Member

@alimanfoo alimanfoo commented May 3, 2023

Here we continue working towards #366 by pulling out functions for accessing SNP data into a new AnophelesSnpData composable class.

  • open_snp_sites()
  • open_snp_genotypes()
  • open_site_filters()
  • site_filters()
  • snp_sites()
  • snp_genotypes()
  • snp_variants()
  • snp_calls()
  • open_site_annotations()
  • site_annotations()
  • is_accessible()
  • snp_allele_counts()
  • plot_snps_track()
  • plot_snps()
  • fast unit tests on simulated data, coverage up

While we are here, we also introduce the use of the typeguard package. A check_types decorator is added to public methods to provide runtime checking of parameter types, which then removes the need for us to write any type checking code. This resolves #358 (preferring typeguard over pydantic because the type checking behaviour is simpler and there is no coersion of types).

Along the way, this PR finds and fixes a number of typing errors.

Also resolves #396 by upgrading scikit-allel.

@codecov
Copy link
Copy Markdown

codecov bot commented May 3, 2023

Codecov Report

Attention: Patch coverage is 96.17391% with 22 lines in your changes missing coverage. Please review.

Project coverage is 96.04%. Comparing base (61bb489) to head (b482314).
Report is 600 commits behind head on master.

Files with missing lines Patch % Lines
malariagen_data/anoph/snp_data.py 97.00% 14 Missing ⚠️
malariagen_data/anoph/base.py 91.37% 5 Missing ⚠️
malariagen_data/anoph/sample_metadata.py 92.10% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #393      +/-   ##
==========================================
+ Coverage   95.79%   96.04%   +0.25%     
==========================================
  Files           4        5       +1     
  Lines         689     1238     +549     
==========================================
+ Hits          660     1189     +529     
- Misses         29       49      +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alimanfoo alimanfoo marked this pull request as ready for review May 17, 2023 13:33
@alimanfoo alimanfoo requested a review from leehart May 17, 2023 13:35
Copy link
Copy Markdown
Member Author

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few notes on the changes in this PR...

# any remote data in order to fail fast where possible.
- name: Run fast unit tests
run: poetry run pytest -v tests/anoph
run: poetry run pytest -v tests/anoph --typeguard-packages=malariagen_data,malariagen_data.anoph
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --typeguard-packages option triggers the pytest typeguard extension to inject runtime type checking everywhere that type annotations have been used. This helps to catch typing errors which are not picked up statically by mypy.

Note, however, that this is something separate from the @check_types decorator which performs parameter type checks for public methods.

Comment on lines +25 to +26
parse_multi_region,
parse_single_region,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functions for parsing region parameter have been reworked to make sure it's clear when a function will accept multiple regions in a list, versus when only a single region is accepted.

self,
sample,
region,
region: base_params.single_region,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example of where we use an explicit parameter type to make clear that parameter value must be a single region.

Comment on lines +96 to +104
sample_indices: TypeAlias = Annotated[
List[int],
"""
Advanced usage parameter. A list of indices of samples to select,
corresponding to the order in which the samples are found within the
sample metadata. Either provide this parameter or sample_query, not
both.
""",
]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter has been introduced to make typing easier when handling the case where samples are being selected via a list of indices rather than a pandas query.

Comment on lines -492 to -496
else:
raise TypeError(
f"Invalid type for sample_sets parameter; expected str, list or tuple; found: {sample_sets!r}"
)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling type errors isn't required any more, because the @check_types decorator ensures that function parameters will conform to the type annotations.

return details


def check_types(f):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N.B., we roll our own simple decorator here for performing runtime type checks on parameter values. See the docstring for explanation of why we don't use typeguard's @typechecked decorator.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clear outputs.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All test functions have been migrated to use simulated data where possible, to speed up the test suite.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All test functions have been migrated to use simulated data where possible, to speed up the test suite.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All test functions have been migrated to use simulated data where possible, to speed up the test suite.

alimanfoo

This comment was marked as duplicate.

@alimanfoo alimanfoo mentioned this pull request May 17, 2023
24 tasks
@alimanfoo
Copy link
Copy Markdown
Member Author

Going to merge if CI passes, but happy to follow up any comments. Need to cut a release as the xarray concatenation problem is causing big slowdowns for trying to run large PCAs, we need a release with the fix in #395.

@alimanfoo alimanfoo merged commit 77cd67b into malariagen:master May 18, 2023
@alimanfoo alimanfoo deleted the big-refactor-part-cinq-snp-data-2023-05-03 branch May 18, 2023 12:36
@alimanfoo
Copy link
Copy Markdown
Member Author

Going to merge if CI passes, but happy to follow up any comments. Need to cut a release as the xarray concatenation problem is causing big slowdowns for trying to run large PCAs, we need a release with the fix in #395.

Doh, actually forgot we already had a release with #395 in it. Ah well, probably good to get these changes in sooner.

@alimanfoo alimanfoo added the BMGF-001927 Work supported by BMGF grant INV-001927 (MalariaGEN 2019-2024). label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BMGF-001927 Work supported by BMGF grant INV-001927 (MalariaGEN 2019-2024).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Speed up install on Python 3.10 Consider validating function arguments with pydantic

1 participant