Skip to content

Support multiple releases via the sample_sets parameter#88

Merged
alimanfoo merged 11 commits intomasterfrom
fix-multi-releases-param-alimanfoo-2021-11-23
Nov 25, 2021
Merged

Support multiple releases via the sample_sets parameter#88
alimanfoo merged 11 commits intomasterfrom
fix-multi-releases-param-alimanfoo-2021-11-23

Conversation

@alimanfoo
Copy link
Copy Markdown
Member

@alimanfoo alimanfoo commented Nov 23, 2021

This PR makes some changes to how the sample_sets parameter is handled. In particular:

  • Make sure that multiple releases can be requested via all functions that support a sample_sets parameter.
  • Change the default behaviour of these functions so that data from all samples sets from all available releases are returned by default.

Note that this second point makes a subtle change to how the API behaves. When interacting with public releases, this will currently be equivalent to requesting sample_sets="v3". When accessing pre-releases, this will access everything from v3 to v3.5. The rationale here is that, generally, this change should mean that users can forget about the sample_sets parameter for most use cases, which will hopefully make things a little simpler.

Resolves #85. Resolves #86.

@alimanfoo alimanfoo changed the title Support multiple releases Support multiple releases via the sample_sets parameter Nov 23, 2021
@alimanfoo
Copy link
Copy Markdown
Member Author

N.B., currently some tests are skipped due to problems in the discordant read calls data. The skip marks should be removed when the data are fixed, prior to merging this PR.

xarray.concat(
[
self._snp_calls_dataset(
self.snp_calls(
Copy link
Copy Markdown
Collaborator

@cclarkson cclarkson Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alimanfoo, I'm not completely clear why we no longer need to call _snp_calls_dataset here (see also _cnv_discordant_read_calls_dataset)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling recursively like this allows multiple releases to be provided as the sample_sets argument. E.g., sample_sets = ['v3', 'v3.1', 'v3.2'].

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative approach could be to properly flatten out the sample_sets parameter within the _prep_sample_sets_arg function, i.e., turn something like ['v3', 'v3.1', 'v3.2'] into a list of sample sets. I might look into that.

Copy link
Copy Markdown
Collaborator

@cclarkson cclarkson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than my one comment, this all makes sense and looks great to me.

@alimanfoo
Copy link
Copy Markdown
Member Author

Hi @cclarkson, thanks looking at this. I took another look and decided it would be better to handle all the polymorphism of the sample_sets parameter within the _prep_sample_sets_arg() method.

So now, it doesn't matter whether the user provides a single sample set, or a list of sample sets, or a single release, or a list of releases, or None (implying all data please), _prep_sample_sets_arg() always returns a list of sample sets.

This then simplifies logic elsewhere. In particular, methods which are performing concatenation over sample sets are simpler, and no longer need to use recursion.

Hope that makes sense.

@alimanfoo alimanfoo marked this pull request as ready for review November 25, 2021 08:04
df = pandas.concat(
[self.sample_sets(release=r) for r in release],
axis=0,
ignore_index=True,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we cache here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's worth it, as the dataframes for each of the releases in the list will be cached.

return root

def _snp_genotypes(self, *, contig, sample_set, field, inline_array, chunks):
# single single contig, single sample set
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick "single single"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's really single :) will fix

self._cache_snp_genotypes[sample_set] = root
return root

def _snp_genotypes(self, *, contig, sample_set, field, inline_array, chunks):
Copy link
Copy Markdown
Collaborator

@cclarkson cclarkson Nov 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the * do here? Is this the same as *args?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It disallows any positional arguments, i.e., forces the method to be called with keyword only arguments. This is defensive, a common source of bugs is where you rely on positional arguments but then later add more positional args or change the argument order, and forget to update calling code somewhere. Forcing all calls to use keyword only arguments means if you add more arguments then all previous code will still work. See also https://www.python.org/dev/peps/pep-3102/.

Copy link
Copy Markdown
Collaborator

@cclarkson cclarkson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alimanfoo alimanfoo merged commit 9d47f6f into master Nov 25, 2021
@alimanfoo
Copy link
Copy Markdown
Member Author

Thanks Chris :)

@alimanfoo alimanfoo deleted the fix-multi-releases-param-alimanfoo-2021-11-23 branch November 25, 2021 17:08
@alimanfoo alimanfoo added this to the v1.0.0 milestone Dec 15, 2021
@alimanfoo alimanfoo added the BMGF-001927 Work supported by BMGF grant INV-001927 (MalariaGEN 2019-2024). label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BMGF-001927 Work supported by BMGF grant INV-001927 (MalariaGEN 2019-2024).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Chunks parameter appears to be ignored Ag3.snp_calls() cannot take multiple releases for sample_set parameter

2 participants