Add support for writing Anopheles SNP data to the plink binary file format#515
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
Hey Tristan. Nice work! Ill save comments for now but FYI - when you add notebooks to malariagen_data, make sure you have cleared all outputs, otherwise they can become quite hefty in size and then the repo balloons in size over time (all of it is stored in git history). |
…s, avoids using dask
…variant_allele mapping
|
I've found the source of the AssertionError (also see issue #516) - something to do with how I haven't managed to get to the bottom of it yet but in this PR there's a temporary fix that just applies |
Removed numba decorator for the apply_allele_mapping function, for now
|
I've (I hope!) made a fix to the above error (issue #516), and described it in more detail there |
|
I think it should work. Feel free to mark it as "Ready for review" when you think that it is appropriate. |
jonbrenas
left a comment
There was a problem hiding this comment.
Hi @tristanpwdennis. Before we merge this PR, could I ask you to do some clean-up?
test-1.ipynb needs to be removed and you made some changes to .gitignore and test_snp_data.py. There are also quite a few print commands that need to be removed or switched to debug mode.
Could you also add a test that checks (at least) that the file is created?
malariagen_data/anoph/to_plink.py
Outdated
| # Filter SNPs for segregating sites only | ||
| with self._spinner("Subsetting to segregating sites"): | ||
| gt = ds_snps["call_genotype"].data.compute() | ||
| print("count alleles") |
There was a problem hiding this comment.
Can you remove this print statement?
malariagen_data/anoph/to_plink.py
Outdated
| print("count alleles") | ||
| with ProgressBar(): | ||
| ac = allel.GenotypeArray(gt).count_alleles(max_allele=3) | ||
| print("ascertain segregating sites") |
malariagen_data/anoph/to_plink.py
Outdated
| & (ac[:, 0] <= max_ref_ac) | ||
| & (an_missing <= max_missing_an) | ||
| ) | ||
| print(f"ascertained {np.count_nonzero(loc_sites):,} sites") |
malariagen_data/anoph/to_plink.py
Outdated
| print(f"ascertained {np.count_nonzero(loc_sites):,} sites") | ||
|
|
||
| # Set up dataset with required vars for plink conversion | ||
| print("Set up dataset") |
tests/anoph/test_snp_data.py
Outdated
| assert "variants" in contig_grp | ||
| variants = contig_grp["variants"] | ||
| assert "POS" in variants | ||
| assert False |
There was a problem hiding this comment.
Not sure what this is here for.
|
Thank you very much @tristanpwdennis. This is great. I added a few comments where there are still some |
|
Hi Jon, |
|
Hi @jonbrenas, I had a tidy and removed some redundant code from to_plink.py. I also added a test (test_plink_converter.py) to make sure the files are created. Let me know how everything looks & if this is sufficient, or if I can add any more tests. Hope this works ok! |
alimanfoo
left a comment
There was a problem hiding this comment.
Hi @tristanpwdennis, all checks are passing in the shadow PR so we're looking great here and almost ready to merge!
Just a couple more things here to add documentation for the new function...
|
Btw have taken the liberty to edit the PR title and description just so this PR will look good in the release notes :) |
Co-authored-by: Alistair Miles <[email protected]>
|
Addressed comments and feeling the urge to merge |
alimanfoo
left a comment
There was a problem hiding this comment.
Thanks @tristanpwdennis, noticed a couple more small things. Promise then I'll stop nit picking! If we could make these changes, then get all the checks to pass over in the shadow PR, happy to merge.
Co-authored-by: Alistair Miles <[email protected]>
Co-authored-by: Alistair Miles <[email protected]>
Co-authored-by: Alistair Miles <[email protected]>
Co-authored-by: Alistair Miles <[email protected]>
Co-authored-by: Alistair Miles <[email protected]>
|
Accepted corrections :) |
|
Thanks @tristanpwdennis. You probably saw that the tests failed. It looks like the |
|
EDIT: I see this now on github, let me try to fix |
alimanfoo
left a comment
There was a problem hiding this comment.
All looks good to me, happy to be merged once checks have passed.
|
Wooohoo! |
|
Hooray, thanks guys! I wonder if I get a GH badge for being the longest open PR on a given repo... |
This PR adds a new function
biallelic_snps_to_plink()to theAg3andAf1APIs which allows for selection of a set of biallelic SNPs for export to the plink binary file format. This then allows for data to be imported into tools like ADMIXTURE which can read the plink format.Resolves #248.