Skip to content

Anopheles refactor part 4 - AnophelesSampleMetadata#386

Merged
alimanfoo merged 35 commits intomalariagen:masterfrom
alimanfoo:sample-data-refactor-2023-04-15
Apr 27, 2023
Merged

Anopheles refactor part 4 - AnophelesSampleMetadata#386
alimanfoo merged 35 commits intomalariagen:masterfrom
alimanfoo:sample-data-refactor-2023-04-15

Conversation

@alimanfoo
Copy link
Copy Markdown
Member

@alimanfoo alimanfoo commented Apr 19, 2023

Here we continue towards #366 by pulling out functions for accessing sample metadata into a new class AnophelesSampleMetadata.

Also we resolve #341 along the way, by making use of the ability of fsspec.cat() to read multiple files concurrently. Because reading metadata is now faster, we change the caching strategy a little, and we do not report a progress bar while loading metadata.

Also we take this opportunity to remove support for older legacy AIM and species analyses, and to rename some of the supporting functions for clarity. This should not affect user code, as the main user-facing functions retain the same names and parameters.

Comment on lines +246 to +272
def read_files(
self,
paths: Iterable[str],
on_error: Literal["raise", "omit", "return"] = "return",
) -> Mapping[str, bytes]:
# TODO Add caching?

# Prepend the base path.
prefix = self._base_path + "/"
full_paths = [prefix + path for path in paths]

# Retrieve all files. N.B., depending on what type of storage is
# being used, the cat() function may be able to read multiple files
# concurrently. E.g., this is true if the file system is a
# GCSFileSystem, which achieves concurrency by using async I/O under the
# hood. This is a useful performance optimisation, because reading a
# file from GCS incurs some latency, and so reading many small files
# from GCS can be slow if files are not read concurrently. Hence we
# want to make use of cat() here and provide paths for all files to
# read concurrently. For more information see:
# https://filesystem-spec.readthedocs.io/en/latest/async.html
full_files = self._fs.cat(full_paths, on_error=on_error)

# Strip off the prefix.
files = {k.split(prefix, 1)[1]: v for k, v in full_files.items()}

return files
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new function allows us to read files concurrently.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 19, 2023

Codecov Report

Attention: Patch coverage is 96.57534% with 10 lines in your changes missing coverage. Please review.

Project coverage is 95.79%. Comparing base (5c8b914) to head (9a90d8d).
Report is 602 commits behind head on master.

Files with missing lines Patch % Lines
malariagen_data/anoph/sample_metadata.py 96.40% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #386      +/-   ##
==========================================
+ Coverage   95.23%   95.79%   +0.55%     
==========================================
  Files           3        4       +1     
  Lines         399      689     +290     
==========================================
+ Hits          380      660     +280     
- Misses         19       29      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alimanfoo alimanfoo changed the title Anopheles refactor part 4 - AnophelesSampleData Anopheles refactor part 4 - AnophelesSampleMetadata Apr 19, 2023
@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@alimanfoo alimanfoo marked this pull request as ready for review April 24, 2023 23:45
@alimanfoo alimanfoo mentioned this pull request Apr 24, 2023
24 tasks
@alimanfoo alimanfoo requested a review from leehart April 25, 2023 00:19
@alimanfoo
Copy link
Copy Markdown
Member Author

There is probably more test refactoring that could be done here, moving across some testing logic which currently runs against the real data, and running it on the locally simulated data instead. But I might leave that for a future PR.

@alimanfoo
Copy link
Copy Markdown
Member Author

A note about the testing strategy. Here we continue with the strategy of simulating a smaller data resource with local files, to accelerate the unit tests. Rather than simulate sample metadata from scratch, I have pulled in a small selection of real sample metadata CSV files, and then we sub-sample from those to generate the simulated data resource.

@alimanfoo
Copy link
Copy Markdown
Member Author

Hi @leehart, lots going on here so please let me know if there's anything would be good to unpack or talk through.

Planning to merge later tomorrow if no objections.

@alimanfoo alimanfoo merged commit 29b7023 into malariagen:master Apr 27, 2023
@alimanfoo alimanfoo deleted the sample-data-refactor-2023-04-15 branch April 27, 2023 14:47
@alimanfoo alimanfoo added the BMGF-001927 Work supported by BMGF grant INV-001927 (MalariaGEN 2019-2024). label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BMGF-001927 Work supported by BMGF grant INV-001927 (MalariaGEN 2019-2024).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Asynchronously load sample metadata

1 participant