Anopheles refactor part 4 - AnophelesSampleMetadata#386
Anopheles refactor part 4 - AnophelesSampleMetadata#386alimanfoo merged 35 commits intomalariagen:masterfrom
Conversation
| def read_files( | ||
| self, | ||
| paths: Iterable[str], | ||
| on_error: Literal["raise", "omit", "return"] = "return", | ||
| ) -> Mapping[str, bytes]: | ||
| # TODO Add caching? | ||
|
|
||
| # Prepend the base path. | ||
| prefix = self._base_path + "/" | ||
| full_paths = [prefix + path for path in paths] | ||
|
|
||
| # Retrieve all files. N.B., depending on what type of storage is | ||
| # being used, the cat() function may be able to read multiple files | ||
| # concurrently. E.g., this is true if the file system is a | ||
| # GCSFileSystem, which achieves concurrency by using async I/O under the | ||
| # hood. This is a useful performance optimisation, because reading a | ||
| # file from GCS incurs some latency, and so reading many small files | ||
| # from GCS can be slow if files are not read concurrently. Hence we | ||
| # want to make use of cat() here and provide paths for all files to | ||
| # read concurrently. For more information see: | ||
| # https://filesystem-spec.readthedocs.io/en/latest/async.html | ||
| full_files = self._fs.cat(full_paths, on_error=on_error) | ||
|
|
||
| # Strip off the prefix. | ||
| files = {k.split(prefix, 1)[1]: v for k, v in full_files.items()} | ||
|
|
||
| return files |
There was a problem hiding this comment.
This new function allows us to read files concurrently.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #386 +/- ##
==========================================
+ Coverage 95.23% 95.79% +0.55%
==========================================
Files 3 4 +1
Lines 399 689 +290
==========================================
+ Hits 380 660 +280
- Misses 19 29 +10 ☔ View full report in Codecov by Sentry. |
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
There is probably more test refactoring that could be done here, moving across some testing logic which currently runs against the real data, and running it on the locally simulated data instead. But I might leave that for a future PR. |
|
A note about the testing strategy. Here we continue with the strategy of simulating a smaller data resource with local files, to accelerate the unit tests. Rather than simulate sample metadata from scratch, I have pulled in a small selection of real sample metadata CSV files, and then we sub-sample from those to generate the simulated data resource. |
|
Hi @leehart, lots going on here so please let me know if there's anything would be good to unpack or talk through. Planning to merge later tomorrow if no objections. |
Here we continue towards #366 by pulling out functions for accessing sample metadata into a new class
AnophelesSampleMetadata.Also we resolve #341 along the way, by making use of the ability of
fsspec.cat()to read multiple files concurrently. Because reading metadata is now faster, we change the caching strategy a little, and we do not report a progress bar while loading metadata.Also we take this opportunity to remove support for older legacy AIM and species analyses, and to rename some of the supporting functions for clarity. This should not affect user code, as the main user-facing functions retain the same names and parameters.