Haplotype frequency analysis methods by jonbrenas · Pull Request #630 · malariagen/malariagen-data-python

jonbrenas · 2024-09-27T14:47:04Z

This PR adds two new functions for computing haplotype frequencies within a genome region of interest, haplotype_frequencies() and haplotype_frequencies_advanced(). The outputs from these functions can then be passed into generic functions for visualising frequency data as a heatmap, time series or map.

Resolves #367.

…to do the heatmap.

…-python into 367-haps-freq

jonbrenas · 2024-10-16T16:12:21Z

I think I am getting there but I have not been able to run a notebook with this version of the code so I am not 100% sure that it does what it is supposed to yet.

alimanfoo · 2024-10-16T17:25:22Z

Hi @jonbrenas, just to say this looks like it's moving in a good direction :)

In the malariagen_data.anoph.h12 module there is a function haplotype_frequencies() which you could use to calculate haplotype frequencies here, if you moved haplotype_frequencies() to a common location like malariagen_data.util module.

jonbrenas · 2024-10-16T17:40:39Z

Thanks @alimanfoo. I had used haplotype_frequencies() in my first prototype but, for the time series plot, counts and nobs are also needed. I considered moving the function and modifying slightly the functions using haplotype_frequencies() in malariagen_data.anoph.h12 (which is probably only garud_h12) but I wanted to limit the amount of side modifications in this PR to a minimum. I guess haplotype_frequencies() is much faster thanks to numba so I'll transition it now.

…ncies code.

…-python into 367-haps-freq

…til I have had dinner.

alimanfoo · 2024-10-18T11:26:36Z

malariagen_data/anoph/hap_freq.py

+        hap_track: dict[np.int64, float] = {}
+        for coh, loc_coh in cohorts_iterator:
+            hap_track = {k: 0 for k in hap_track.keys()}
+            n_samples = np.count_nonzero(loc_coh)
+            assert n_samples >= min_cohort_size
+            gt_coh = allel.GenotypeDaskArray(da.compress(loc_coh, gt, axis=1))
+            gt_hap = gt_coh.to_haplotypes().compute()
+            f, _, _ = haplotype_frequencies(gt_hap)
+            hap_track.update(f)
+            freq_cols["frq_" + coh] = list(hap_track.values())


I am a little concerned about ensuring that the frequency values are the same for each cohort, i.e., the values correspond to the same haplotypes. Also I think this could end up bleeding values between different cohorts.

Am I right that you are trying to use hap_track to keep a record of what haplotypes you have seen so far?

Consider what would happen here if you have two cohorts, where some haplotypes are present in both cohorts, and each cohort has haplotypes which are not observed in the other cohort.

I wonder if it would be safer to do this in two passes. In the first pass, you compute the results of calling haplotype_frequencies() for each cohort. Then you find the set of all haplotype keys across all cohorts, and use this to generate a consistent array of frequencies for each cohort, guaranteeing the same order of values for each cohort, and filling with 0 in case the haplotype is not observed.

If you did that you could also add a "label" column to the output dataframe, which could contain the hash for each haplotype. Not super user friendly, but would at least allow us to refer to the different haplotypes in some way?

Not sure I get what you mean. I have little time so what I say might be gibberish.

For the first part, haplotype_frequencies() returns a dict where the keys are the hash for each haplotype and the value is the frequency. These keys are then used by hap_track so it should already do what I think you asked with one pass.
For the second part, "label" is already used for the index (because plot_frequencies_heatmap() requires it, I think). We could keep the old index (containing the hash) in a column "hash" if you want.

For the first part, haplotype_frequencies() returns a dict where the keys are the hash for each haplotype and the value is the frequency. These keys are then used by hap_track so it should already do what I think you asked with one pass.

Right, after you have iterated through all cohorts then hap_track will have seen all haplotypes present in any cohort, and so the keys will contain hashes of all haplotypes present.

But if you are outputting the frequencies as you go, within that first pass, then the number of values you find will vary between cohorts. This is because you find more haplotypes as you visit more cohorts.

Also, hap_track currently accumulates data from all cohorts. When you visit a new cohort, any haplotypes previously found will have their frequencies updated. But any haplotypes found in previous cohorts but not present in the current cohort will still be present within hap_track. If you then take all values from hap_track you could get a mangle of data from the current cohort and data from cohorts visited in previous iterations.

Sorry @alimanfoo, I thought I had commented further yesterday but my comments seems to have got lost (I probably forgot to click on the big green button).

Right, after you have iterated through all cohorts then hap_track will have seen all haplotypes present in any cohort, and so the keys will contain hashes of all haplotypes present.
But if you are outputting the frequencies as you go, within that first pass, then the number of values you find will vary between cohorts. This is because you find more haplotypes as you visit more cohorts.

That's right. This is the reason why there is a step before creating the DataFrame that adds 0 to the end of the lists that are shorter than the total number of haplotypes (which is the length of the last list generated). This could be a problem if update doesn't treat the dict entries as a queue and reorders them (according to whatever order it likes). I tried this:

d = {"B": 0, "C":1} d.update({"A": 2, "B":3} d.values()

and it correctly returns [3,1,2] so the order seems to be indeed kept. So the new haplotypes are always added at the end of hap_track. I should probably document my code better (that is, at least a little) to make my reasoning clearer.

Also, hap_track currently accumulates data from all cohorts. When you visit a new cohort, any haplotypes previously found will have their frequencies updated. But any haplotypes found in previous cohorts but not present in the current cohort will still be present within hap_track. If you then take all values from hap_track you could get a mangle of data from the current cohort and data from cohorts visited in previous iterations.

The first step of every cohort run is to reset all values of hap_track to 0 so I don't think there is a risk of mangling. I was scared at some point that the content of hap_track would be copied by reference instead of by value in freq_coh but given that they end up different, I don't think that's the case.

There is one situation that I am not quite sure how to handle though. Resetting all values to 0 for each cohort run makes sense for freq and count but, if a haplotype is not observed, having 0 as the nobs value might be wrong. What do you think? Should the lengthening step set all values in nobs to be equal or is 0 for haplotypes that are not observed OK. As far as I can see, it only matters for the CIs ... and I just convinced myself that it is wrong to have nobs == 0 so I will update that when I come back from my leave.

leehart · 2024-11-05T14:59:47Z

@jonbrenas to add examples here and notebook examples.

…plotypes.

review-notebook-app · 2024-11-06T10:12:14Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

alimanfoo

Thanks so much @jonbrenas, this is looking good. Some suggestions for code simplification and to remove duplication.

Also please add the new public methods haplotype_frequencies() and haplotype_frequencies_advanced() to the Ag3 and Af1 API docs.

malariagen_data/anoph/hap_freq.py

alimanfoo · 2024-11-21T12:50:00Z

malariagen_data/anopheles.py

-    AnophelesHapData,
+    AnophelesHapFrequencyAnalysis,


Suggest to leave the AnophelesHapData class explicitly here within the parents, and add AnophelesHapData further back, e.g., together with other analysis classes.

Thanks @alimanfoo. It would make sense but when AnophelesHapData is added to the list, this error pops up:

E TypeError: Cannot create a consistent method resolution E order (MRO) for bases AnophelesSnpData, AnophelesHapData, AnophelesSampleMetadata, AnophelesCnvData, AnophelesGenomeFeaturesData, AnophelesGenomeSequenceData, AnophelesBase

I have tried to figure out why but I have not been able to yet.

Turns out AnophelesHapData really wants to be before AnophelesSnpData and AnophelesCnvData.

Cool, you managed to resolve the MRO issue?

Sort of. The only reason (as far as I can see) the order of these 3 would be important would be if the superclasses are used differently in AnophelesHapData and I don't see where they would be. I am not sure it is worth investigating too much given that it works this way, though.

alimanfoo · 2024-11-21T12:54:24Z

Took the liberty to edit PR title and description so this PR looks good in the release notes.

Co-authored-by: Alistair Miles <[email protected]>

alimanfoo

Hi @jonbrenas, this is looking good.

Small suggestion, the functions that have been refactored into the util module, suggest to remove the leading underscore from the function names. A leading underscore by convention indicates a function is intended to be private to a single module.

…-python into 367-haps-freq

jonbrenas · 2024-11-28T16:18:25Z

Thanks @alimanfoo, makes a lot of sense. I think I made the changes we requested.

alimanfoo

Looks good, thanks @jonbrenas 🌼

alimanfoo · 2024-12-03T09:58:56Z

Just to note that this PR and #665 have some overlap and so expect there will be conflicts to resolve after one is merged before the second can be merged.

It might be slightly easier to merge this one first, and then deal with any conflicts in #665, but happy to go in whichever order you think best.

jonbrenas added 11 commits September 27, 2024 15:44

Not yet started copying my code

caf0699

Added the function to compute the dataframe. Next, the heatmap.

48904f3

Trying to correctly type count_rows. Also, I don't think I will need …

65f715d

…to do the heatmap.

Not quite there yet.

644e78b

Attempt at hap_freq

c8b752c

Merge branch 'master' into 367-haps-freq

5f730c5

Adding some tests

08bfeb0

Merge branch '367-haps-freq' of github.com:malariagen/malariagen-data…

0649cfe

…-python into 367-haps-freq

Better tests

758f42a

Added some spinners and progresses

fbf4039

Removed a duplicated spinner

ed875a8

jonbrenas marked this pull request as ready for review October 16, 2024 16:12

Merge branch 'master' into 367-haps-freq

10b40ce

jonbrenas added 8 commits October 16, 2024 19:07

Clearing the way to use haplotype_frequencies in the haplotype_freque…

c88f248

…ncies code.

Merge branch '367-haps-freq' of github.com:malariagen/malariagen-data…

ad6c811

…-python into 367-haps-freq

Passed the tests. I am not going to check that it does what I want un…

ba7a7ed

…til I have had dinner.

Passed the tests. I am not going to check that it does what I want un…

56f473a

…til I have had dinner.

Fighting the lint

0993d82

Fighting the lint still

8de60c3

Updated the advanced version too

1d257fc

Better typing.

2d85d2e

alimanfoo reviewed Oct 18, 2024

View reviewed changes

jonbrenas added 3 commits November 6, 2024 08:28

Merge branch 'master' into 367-haps-freq

12ac956

Added a first run of all haplotypes to set up the list of observed ha…

5e636f2

…plotypes.

Adding the notebook.

5870a35

Forgot to add a file.

c7f3141

jonbrenas requested a review from alimanfoo November 6, 2024 12:14

alimanfoo reviewed Nov 21, 2024

View reviewed changes

alimanfoo changed the title ~~Haplotype frequency time series plot~~ Haplotype frequency analysis methods Nov 21, 2024

jonbrenas and others added 12 commits November 21, 2024 13:18

Update malariagen_data/anoph/hap_freq.py

b340dfe

Co-authored-by: Alistair Miles <[email protected]>

Update malariagen_data/anoph/hap_freq.py

28067b3

Co-authored-by: Alistair Miles <[email protected]>

Starting work on alimanfoo's comments

bb1394d

Doing the renaming correctly

fe95e08

Trying to find the MRO

304eff0

Making progress through alimanfoo's comments

a168e03

Last set of easy changes

5f8e855

Merge branch 'master' into 367-haps-freq

5cc504a

Solved the MRO issue

5abc240

Solved the MRO issue

deb5f9a

Factoring some shared code

73b4439

Looks like I missed a file

c7346a4

jonbrenas mentioned this pull request Nov 25, 2024

plot_frequencies_heatmap is not in the correct file #664

Open

alimanfoo reviewed Nov 27, 2024

View reviewed changes

jonbrenas added 3 commits November 28, 2024 10:28

Merge branch 'master' into 367-haps-freq

80f8cc5

De-underscored some functions

0ec6cb3

Merge branch '367-haps-freq' of github.com:malariagen/malariagen-data…

c9d8ec4

…-python into 367-haps-freq

alimanfoo mentioned this pull request Dec 2, 2024

Refactor CNV frequencies functions to their own module #665

Merged

alimanfoo approved these changes Dec 3, 2024

View reviewed changes

Merge branch 'master' into 367-haps-freq

1321ede

jonbrenas merged commit ef6991f into master Dec 3, 2024

alimanfoo added the BMGF-068808 Work supported by BMGF grant INV-068808 (MalariaGEN 2024-2027). label Dec 4, 2024

leehart deleted the 367-haps-freq branch December 9, 2024 10:52

Conversation

jonbrenas commented Sep 27, 2024 • edited by alimanfoo Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonbrenas commented Oct 16, 2024

Uh oh!

alimanfoo commented Oct 16, 2024

Uh oh!

jonbrenas commented Oct 16, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leehart commented Nov 5, 2024

Uh oh!

review-notebook-app bot commented Nov 6, 2024

Uh oh!

alimanfoo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alimanfoo commented Nov 21, 2024

Uh oh!

alimanfoo left a comment

Choose a reason for hiding this comment

Uh oh!

jonbrenas commented Nov 28, 2024

Uh oh!

alimanfoo left a comment

Choose a reason for hiding this comment

Uh oh!

alimanfoo commented Dec 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jonbrenas commented Sep 27, 2024 •

edited by alimanfoo

Loading