Support multiple releases via the sample_sets parameter#88
Support multiple releases via the sample_sets parameter#88
Conversation
|
N.B., currently some tests are skipped due to problems in the discordant read calls data. The skip marks should be removed when the data are fixed, prior to merging this PR. |
malariagen_data/ag3.py
Outdated
| xarray.concat( | ||
| [ | ||
| self._snp_calls_dataset( | ||
| self.snp_calls( |
There was a problem hiding this comment.
Hi @alimanfoo, I'm not completely clear why we no longer need to call _snp_calls_dataset here (see also _cnv_discordant_read_calls_dataset)?
There was a problem hiding this comment.
Calling recursively like this allows multiple releases to be provided as the sample_sets argument. E.g., sample_sets = ['v3', 'v3.1', 'v3.2'].
There was a problem hiding this comment.
An alternative approach could be to properly flatten out the sample_sets parameter within the _prep_sample_sets_arg function, i.e., turn something like ['v3', 'v3.1', 'v3.2'] into a list of sample sets. I might look into that.
cclarkson
left a comment
There was a problem hiding this comment.
Other than my one comment, this all makes sense and looks great to me.
|
Hi @cclarkson, thanks looking at this. I took another look and decided it would be better to handle all the polymorphism of the So now, it doesn't matter whether the user provides a single sample set, or a list of sample sets, or a single release, or a list of releases, or None (implying all data please), This then simplifies logic elsewhere. In particular, methods which are performing concatenation over sample sets are simpler, and no longer need to use recursion. Hope that makes sense. |
| df = pandas.concat( | ||
| [self.sample_sets(release=r) for r in release], | ||
| axis=0, | ||
| ignore_index=True, |
There was a problem hiding this comment.
I don't think it's worth it, as the dataframes for each of the releases in the list will be cached.
malariagen_data/ag3.py
Outdated
| return root | ||
|
|
||
| def _snp_genotypes(self, *, contig, sample_set, field, inline_array, chunks): | ||
| # single single contig, single sample set |
There was a problem hiding this comment.
nitpick "single single"
There was a problem hiding this comment.
it's really single :) will fix
| self._cache_snp_genotypes[sample_set] = root | ||
| return root | ||
|
|
||
| def _snp_genotypes(self, *, contig, sample_set, field, inline_array, chunks): |
There was a problem hiding this comment.
What does the * do here? Is this the same as *args?
There was a problem hiding this comment.
It disallows any positional arguments, i.e., forces the method to be called with keyword only arguments. This is defensive, a common source of bugs is where you rely on positional arguments but then later add more positional args or change the argument order, and forget to update calling code somewhere. Forcing all calls to use keyword only arguments means if you add more arguments then all previous code will still work. See also https://www.python.org/dev/peps/pep-3102/.
|
Thanks Chris :) |
This PR makes some changes to how the
sample_setsparameter is handled. In particular:sample_setsparameter.Note that this second point makes a subtle change to how the API behaves. When interacting with public releases, this will currently be equivalent to requesting
sample_sets="v3". When accessing pre-releases, this will access everything from v3 to v3.5. The rationale here is that, generally, this change should mean that users can forget about thesample_setsparameter for most use cases, which will hopefully make things a little simpler.Resolves #85. Resolves #86.