Anopheles refactor part 4 - AnophelesSampleMetadata by alimanfoo · Pull Request #386 · malariagen/malariagen-data-python

alimanfoo · 2023-04-19T00:21:03Z

Here we continue towards #366 by pulling out functions for accessing sample metadata into a new class AnophelesSampleMetadata.

Also we resolve #341 along the way, by making use of the ability of fsspec.cat() to read multiple files concurrently. Because reading metadata is now faster, we change the caching strategy a little, and we do not report a progress bar while loading metadata.

Also we take this opportunity to remove support for older legacy AIM and species analyses, and to rename some of the supporting functions for clarity. This should not affect user code, as the main user-facing functions retain the same names and parameters.

alimanfoo · 2023-04-19T00:26:34Z

malariagen_data/anoph/base.py

+    def read_files(
+        self,
+        paths: Iterable[str],
+        on_error: Literal["raise", "omit", "return"] = "return",
+    ) -> Mapping[str, bytes]:
+        # TODO Add caching?
+
+        # Prepend the base path.
+        prefix = self._base_path + "/"
+        full_paths = [prefix + path for path in paths]
+
+        # Retrieve all files. N.B., depending on what type of storage is
+        # being used, the cat() function may be able to read multiple files
+        # concurrently. E.g., this is true if the file system is a
+        # GCSFileSystem, which achieves concurrency by using async I/O under the
+        # hood. This is a useful performance optimisation, because reading a
+        # file from GCS incurs some latency, and so reading many small files
+        # from GCS can be slow if files are not read concurrently. Hence we
+        # want to make use of cat() here and provide paths for all files to
+        # read concurrently. For more information see:
+        # https://filesystem-spec.readthedocs.io/en/latest/async.html
+        full_files = self._fs.cat(full_paths, on_error=on_error)
+
+        # Strip off the prefix.
+        files = {k.split(prefix, 1)[1]: v for k, v in full_files.items()}
+
+        return files


This new function allows us to read files concurrently.

codecov · 2023-04-19T20:52:18Z

Codecov Report

Attention: Patch coverage is 96.57534% with 10 lines in your changes missing coverage. Please review.

Project coverage is 95.79%. Comparing base (5c8b914) to head (9a90d8d).
Report is 602 commits behind head on master.

Files with missing lines	Patch %	Lines
malariagen_data/anoph/sample_metadata.py	96.40%	10 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #386      +/-   ##
==========================================
+ Coverage   95.23%   95.79%   +0.55%     
==========================================
  Files           3        4       +1     
  Lines         399      689     +290     
==========================================
+ Hits          380      660     +280     
- Misses         19       29      +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

review-notebook-app · 2023-04-24T23:11:24Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

alimanfoo · 2023-04-25T00:56:23Z

There is probably more test refactoring that could be done here, moving across some testing logic which currently runs against the real data, and running it on the locally simulated data instead. But I might leave that for a future PR.

alimanfoo · 2023-04-25T15:56:51Z

A note about the testing strategy. Here we continue with the strategy of simulating a smaller data resource with local files, to accelerate the unit tests. Rather than simulate sample metadata from scratch, I have pulled in a small selection of real sample metadata CSV files, and then we sub-sample from those to generate the simulated data resource.

alimanfoo · 2023-04-26T13:04:07Z

Hi @leehart, lots going on here so please let me know if there's anything would be good to unpack or talk through.

Planning to merge later tomorrow if no objections.

alimanfoo added 13 commits April 15, 2023 22:14

start work on sample metadata refactor

3f7e56b

wip

e789162

wip

bfc0997

fix

623c7f1

wip

d952ebe

wip

42b1aa8

general metadata

cb7be50

consistify approac

8fa8988

wip

f1a5d1b

comments

5216ffc

wip aim metadata

9561678

big sample metadata hack

8a46df7

resolveconflicts

8528b32

alimanfoo commented Apr 19, 2023

View reviewed changes

fix bug and tests

2f188b4

rename to metadata; fix typing

a669ec9

alimanfoo changed the title ~~Anopheles refactor part 4 - AnophelesSampleData~~ Anopheles refactor part 4 - AnophelesSampleMetadata Apr 19, 2023

alimanfoo added 12 commits April 20, 2023 23:35

wip fix test errors

2d161bb

wip more metadata tests

4b414ce

wip test sample_metadata

a1152b6

wip test sample_metadata

54621c2

refactor extra metadata

35c0426

clean out after refactor

d9f55d5

clean out after refactor

e927843

cache sample metadata

c23287b

clean out after refactor

9bccb4c

fix test

b749d8f

refresh notebook

5988572

refresh notebook cache

6e8656f

alimanfoo added 2 commits April 25, 2023 00:24

add file caching

196b709

clean up

57b28e5

alimanfoo marked this pull request as ready for review April 24, 2023 23:45

alimanfoo mentioned this pull request Apr 24, 2023

Anopheles refactor #366

Open

24 tasks

move more metadata functions

e613bde

alimanfoo requested a review from leehart April 25, 2023 00:19

alimanfoo added 2 commits April 25, 2023 01:36

simplify

e0cfc65

fix notebook

2554a4c

alimanfoo added 3 commits April 25, 2023 15:09

increase coverage

47d16a6

simplify

9153f11

tighten types

9a90d8d

alimanfoo merged commit 29b7023 into malariagen:master Apr 27, 2023

alimanfoo deleted the sample-data-refactor-2023-04-15 branch April 27, 2023 14:47

alimanfoo added the BMGF-001927 Work supported by BMGF grant INV-001927 (MalariaGEN 2019-2024). label Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anopheles refactor part 4 - AnophelesSampleMetadata#386

Anopheles refactor part 4 - AnophelesSampleMetadata#386
alimanfoo merged 35 commits intomalariagen:masterfrom
alimanfoo:sample-data-refactor-2023-04-15

alimanfoo commented Apr 19, 2023 •

edited

Loading

Uh oh!

alimanfoo Apr 19, 2023

Uh oh!

codecov bot commented Apr 19, 2023 •

edited

Loading

Uh oh!

review-notebook-app bot commented Apr 24, 2023

Uh oh!

alimanfoo commented Apr 25, 2023

Uh oh!

alimanfoo commented Apr 25, 2023

Uh oh!

alimanfoo commented Apr 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alimanfoo commented Apr 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alimanfoo Apr 19, 2023

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

review-notebook-app bot commented Apr 24, 2023

Uh oh!

alimanfoo commented Apr 25, 2023

Uh oh!

alimanfoo commented Apr 25, 2023

Uh oh!

alimanfoo commented Apr 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alimanfoo commented Apr 19, 2023 •

edited

Loading

codecov bot commented Apr 19, 2023 •

edited

Loading