Skip to content

Anopheles refactor part 2 - AnophelesGenomeSequenceData#381

Merged
alimanfoo merged 14 commits intomalariagen:masterfrom
alimanfoo:anoph-refactor-part-deux-genome-sequence-2023-03-31
Apr 3, 2023
Merged

Anopheles refactor part 2 - AnophelesGenomeSequenceData#381
alimanfoo merged 14 commits intomalariagen:masterfrom
alimanfoo:anoph-refactor-part-deux-genome-sequence-2023-03-31

Conversation

@alimanfoo
Copy link
Copy Markdown
Member

@alimanfoo alimanfoo commented Apr 3, 2023

Here we continue working towards #366 by factoring out a class AnophelesGenomeSequenceData which assumes responsibility for providing access to a reference genome sequence.

Also here we reorganise the tests to create a new file conftest.py which provides test fixtures common to multiple test modules.

Also here we try out adding coverage reports via codecov.

Comment on lines +13 to +14
MAJOR_VERSION_NUMBER = 1
MAJOR_VERSION_PATH = "v1.0"
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed for consistency with other variable names.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module provides the new AnophelesGenomeSequenceData class.

Comment on lines -753 to -782
@property
@abstractmethod
def contigs(self):
raise NotImplementedError("Must override contigs")

@property
@abstractmethod
def _genome_fasta_path(self):
raise NotImplementedError("Must override _genome_fasta_path")

@property
@abstractmethod
def _genome_fai_path(self):
raise NotImplementedError("Must override _genome_fai_path")

@property
@abstractmethod
def _genome_zarr_path(self):
raise NotImplementedError("Must override _genome_zarr_path")

@property
@abstractmethod
def _genome_ref_id(self):
raise NotImplementedError("Must override _genome_ref_id")

@property
@abstractmethod
def _genome_ref_name(self):
raise NotImplementedError("Must override _genome_ref_name")

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to parent class.

Comment on lines -2016 to -2068
@doc(
summary="Open the reference genome zarr.",
returns="Zarr hierarchy containing the reference genome sequence.",
)
def open_genome(self) -> zarr.hierarchy.Group:
if self._cache_genome is None:
path = f"{self._base_path}/{self._genome_zarr_path}"
store = init_zarr_store(fs=self._fs, path=path)
self._cache_genome = zarr.open_consolidated(store=store)
return self._cache_genome

def _genome_sequence_for_contig(self, *, contig, inline_array, chunks):
"""Obtain the genome sequence for a given contig as an array."""
assert contig in self.contigs
genome = self.open_genome()
z = genome[contig]
d = da_from_zarr(z, inline_array=inline_array, chunks=chunks)
return d

@doc(
summary="Access the reference genome sequence.",
returns="""
An array of nucleotides giving the reference genome sequence for the
given contig.
""",
)
def genome_sequence(
self,
region: base_params.region,
inline_array: base_params.inline_array = base_params.inline_array_default,
chunks: base_params.chunks = base_params.chunks_default,
) -> da.Array:
resolved_region: Region = self.resolve_region(region)
del region

# obtain complete sequence for the requested contig
d = self._genome_sequence_for_contig(
contig=resolved_region.contig, inline_array=inline_array, chunks=chunks
)

# deal with region start and stop
if resolved_region.start:
slice_start = resolved_region.start - 1
else:
slice_start = None
if resolved_region.end:
slice_stop = resolved_region.end
else:
slice_stop = None
loc_region = slice(slice_start, slice_stop)

return d[loc_region]

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to parent class.

Comment on lines +392 to +398
if hasattr(resource, "genome_features"):
gene_annotation = resource.genome_features(attributes=["ID"])
results = gene_annotation.query(f"ID == '{region}'")
if not results.empty:
# the region is a feature ID
feature = results.squeeze()
return Region(feature.contig, int(feature.start), int(feature.end))
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small modification here to support testing of the AnophelesGenomeSequenceData class.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new file defines test fixtures which are common to the base and genome_sequence test modules. Basically, we want a single place to create some test data.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of refactoring here to use pytest fixtures, and to make use of common fixtures defined in conftest.py.

@alimanfoo alimanfoo mentioned this pull request Apr 3, 2023
24 tasks
@alimanfoo alimanfoo changed the title Anophelese refactor part 2 - AnophelesGenomeSequenceData Anopheles refactor part 2 - AnophelesGenomeSequenceData Apr 3, 2023
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Poetry environment updated to include pytest-cases.

@alimanfoo alimanfoo requested review from leehart and sanjaynagi April 3, 2023 15:11
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 3, 2023

Codecov Report

❗ No coverage uploaded for pull request base (master@1285723). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 8a4705b differs from pull request most recent head 4e59bc1. Consider uploading reports for the commit 4e59bc1 to get more accurate results

@@            Coverage Diff            @@
##             master     #381   +/-   ##
=========================================
  Coverage          ?   92.02%           
=========================================
  Files             ?        2           
  Lines             ?      188           
  Branches          ?        0           
=========================================
  Hits              ?      173           
  Misses            ?       15           
  Partials          ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@alimanfoo
Copy link
Copy Markdown
Member Author

Updating the poetry environment has bumped pandas from 1.5 to 2.0, which is then generating some errors around datetime dtype conversions when trying to build the "quarter" column in the sample metadata. I've worked around here by calculating quarter directly.

@leehart
Copy link
Copy Markdown
Collaborator

leehart commented Apr 3, 2023

Updating the poetry environment has bumped pandas from 1.5 to 2.0, which is then generating some errors around datetime dtype conversions when trying to build the "quarter" column in the sample metadata. I've worked around here by calculating quarter directly.

Looks like the tests will need to be updated too, since they use the same method:

        # Check that quarter is derived from month, in cases where it is not -1
        df_samples_with_quarter = df_samples.query("quarter != -1")
        df_derived_quarters = (
            df_samples_with_quarter["month"].astype("datetime64[M]").dt.quarter
        )
>       assert np.all(df_samples_with_quarter["quarter"] == df_derived_quarters)

@alimanfoo alimanfoo merged commit 388ac7e into malariagen:master Apr 3, 2023
@alimanfoo alimanfoo deleted the anoph-refactor-part-deux-genome-sequence-2023-03-31 branch April 3, 2023 22:33
@alimanfoo
Copy link
Copy Markdown
Member Author

Thanks Lee 🙏

@alimanfoo alimanfoo added the BMGF-001927 Work supported by BMGF grant INV-001927 (MalariaGEN 2019-2024). label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BMGF-001927 Work supported by BMGF grant INV-001927 (MalariaGEN 2019-2024).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants