Support multiple releases via the sample_sets parameter by alimanfoo · Pull Request #88 · malariagen/malariagen-data-python

alimanfoo · 2021-11-23T23:49:51Z

This PR makes some changes to how the sample_sets parameter is handled. In particular:

Make sure that multiple releases can be requested via all functions that support a sample_sets parameter.
Change the default behaviour of these functions so that data from all samples sets from all available releases are returned by default.

Note that this second point makes a subtle change to how the API behaves. When interacting with public releases, this will currently be equivalent to requesting sample_sets="v3". When accessing pre-releases, this will access everything from v3 to v3.5. The rationale here is that, generally, this change should mean that users can forget about the sample_sets parameter for most use cases, which will hopefully make things a little simpler.

Resolves #85. Resolves #86.

alimanfoo · 2021-11-23T23:51:05Z

N.B., currently some tests are skipped due to problems in the discordant read calls data. The skip marks should be removed when the data are fixed, prior to merging this PR.

cclarkson · 2021-11-24T13:47:09Z

malariagen_data/ag3.py

                    xarray.concat(
                        [
-                            self._snp_calls_dataset(
+                            self.snp_calls(


Hi @alimanfoo, I'm not completely clear why we no longer need to call _snp_calls_dataset here (see also _cnv_discordant_read_calls_dataset)?

Calling recursively like this allows multiple releases to be provided as the sample_sets argument. E.g., sample_sets = ['v3', 'v3.1', 'v3.2'].

An alternative approach could be to properly flatten out the sample_sets parameter within the _prep_sample_sets_arg function, i.e., turn something like ['v3', 'v3.1', 'v3.2'] into a list of sample sets. I might look into that.

cclarkson

Other than my one comment, this all makes sense and looks great to me.

alimanfoo · 2021-11-25T08:03:58Z

Hi @cclarkson, thanks looking at this. I took another look and decided it would be better to handle all the polymorphism of the sample_sets parameter within the _prep_sample_sets_arg() method.

So now, it doesn't matter whether the user provides a single sample set, or a list of sample sets, or a single release, or a list of releases, or None (implying all data please), _prep_sample_sets_arg() always returns a list of sample sets.

This then simplifies logic elsewhere. In particular, methods which are performing concatenation over sample sets are simpler, and no longer need to use recursion.

Hope that makes sense.

cclarkson · 2021-11-25T13:53:49Z

malariagen_data/ag3.py

+            df = pandas.concat(
+                [self.sample_sets(release=r) for r in release],
+                axis=0,
+                ignore_index=True,


can we cache here?

I don't think it's worth it, as the dataframes for each of the releases in the list will be cached.

cclarkson · 2021-11-25T14:00:16Z

malariagen_data/ag3.py

            return root

+    def _snp_genotypes(self, *, contig, sample_set, field, inline_array, chunks):
+        # single single contig, single sample set


nitpick "single single"

it's really single :) will fix

cclarkson · 2021-11-25T14:03:37Z

malariagen_data/ag3.py

            self._cache_snp_genotypes[sample_set] = root
            return root

+    def _snp_genotypes(self, *, contig, sample_set, field, inline_array, chunks):


What does the * do here? Is this the same as *args?

It disallows any positional arguments, i.e., forces the method to be called with keyword only arguments. This is defensive, a common source of bugs is where you rely on positional arguments but then later add more positional args or change the argument order, and forget to update calling code somewhere. Forcing all calls to use keyword only arguments means if you add more arguments then all previous code will still work. See also https://www.python.org/dev/peps/pep-3102/.

cclarkson

LGTM

alimanfoo · 2021-11-25T17:08:16Z

Thanks Chris :)

alimanfoo added 4 commits November 23, 2021 18:14

sample_sets multi releases

0b29177

wip multi releases

a310c26

wip sample_sets rework

445e11b

wip test multiple releases

1378a94

alimanfoo changed the title ~~Support multiple releases~~ Support multiple releases via the sample_sets parameter Nov 23, 2021

tidy

7b7e445

cclarkson reviewed Nov 24, 2021

View reviewed changes

cclarkson approved these changes Nov 24, 2021

View reviewed changes

alimanfoo added 4 commits November 25, 2021 06:07

remove test skips

3f00a20

simplify handling of multiple sample sets

c9d4278

simplify sample_sets normalisation

c165b33

add test for chunks pass through

31709a8

alimanfoo mentioned this pull request Nov 25, 2021

Fix chunks argument to Ag3.snp_genotypes() getting passed through #87

Closed

fix recursion

8a7a50c

alimanfoo marked this pull request as ready for review November 25, 2021 08:04

alimanfoo requested review from cclarkson and leehart November 25, 2021 08:15

cclarkson reviewed Nov 25, 2021

View reviewed changes

fix comment

a793536

cclarkson approved these changes Nov 25, 2021

View reviewed changes

alimanfoo merged commit 9d47f6f into master Nov 25, 2021

alimanfoo deleted the fix-multi-releases-param-alimanfoo-2021-11-23 branch November 25, 2021 17:08

alimanfoo added this to the v1.0.0 milestone Dec 15, 2021

alimanfoo added the BMGF-001927 Work supported by BMGF grant INV-001927 (MalariaGEN 2019-2024). label Dec 4, 2024

Conversation

alimanfoo commented Nov 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alimanfoo commented Nov 23, 2021

Uh oh!

cclarkson Nov 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alimanfoo Nov 25, 2021

Choose a reason for hiding this comment

Uh oh!

alimanfoo Nov 25, 2021

Choose a reason for hiding this comment

Uh oh!

cclarkson left a comment

Choose a reason for hiding this comment

Uh oh!

alimanfoo commented Nov 25, 2021

Uh oh!

cclarkson Nov 25, 2021

Choose a reason for hiding this comment

Uh oh!

alimanfoo Nov 25, 2021

Choose a reason for hiding this comment

Uh oh!

cclarkson Nov 25, 2021

Choose a reason for hiding this comment

Uh oh!

alimanfoo Nov 25, 2021

Choose a reason for hiding this comment

Uh oh!

cclarkson Nov 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alimanfoo Nov 25, 2021

Choose a reason for hiding this comment

Uh oh!

cclarkson left a comment

Choose a reason for hiding this comment

Uh oh!

alimanfoo commented Nov 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alimanfoo commented Nov 23, 2021 •

edited

Loading

cclarkson Nov 24, 2021 •

edited

Loading

cclarkson Nov 25, 2021 •

edited

Loading