Conversation
…to do the heatmap.
…-python into 367-haps-freq
|
I think I am getting there but I have not been able to run a notebook with this version of the code so I am not 100% sure that it does what it is supposed to yet. |
|
Hi @jonbrenas, just to say this looks like it's moving in a good direction :) In the |
|
Thanks @alimanfoo. I had used |
…-python into 367-haps-freq
…til I have had dinner.
…til I have had dinner.
malariagen_data/anoph/hap_freq.py
Outdated
| hap_track: dict[np.int64, float] = {} | ||
| for coh, loc_coh in cohorts_iterator: | ||
| hap_track = {k: 0 for k in hap_track.keys()} | ||
| n_samples = np.count_nonzero(loc_coh) | ||
| assert n_samples >= min_cohort_size | ||
| gt_coh = allel.GenotypeDaskArray(da.compress(loc_coh, gt, axis=1)) | ||
| gt_hap = gt_coh.to_haplotypes().compute() | ||
| f, _, _ = haplotype_frequencies(gt_hap) | ||
| hap_track.update(f) | ||
| freq_cols["frq_" + coh] = list(hap_track.values()) |
There was a problem hiding this comment.
I am a little concerned about ensuring that the frequency values are the same for each cohort, i.e., the values correspond to the same haplotypes. Also I think this could end up bleeding values between different cohorts.
Am I right that you are trying to use hap_track to keep a record of what haplotypes you have seen so far?
Consider what would happen here if you have two cohorts, where some haplotypes are present in both cohorts, and each cohort has haplotypes which are not observed in the other cohort.
I wonder if it would be safer to do this in two passes. In the first pass, you compute the results of calling haplotype_frequencies() for each cohort. Then you find the set of all haplotype keys across all cohorts, and use this to generate a consistent array of frequencies for each cohort, guaranteeing the same order of values for each cohort, and filling with 0 in case the haplotype is not observed.
There was a problem hiding this comment.
If you did that you could also add a "label" column to the output dataframe, which could contain the hash for each haplotype. Not super user friendly, but would at least allow us to refer to the different haplotypes in some way?
There was a problem hiding this comment.
Not sure I get what you mean. I have little time so what I say might be gibberish.
For the first part, haplotype_frequencies() returns a dict where the keys are the hash for each haplotype and the value is the frequency. These keys are then used by hap_track so it should already do what I think you asked with one pass.
For the second part, "label" is already used for the index (because plot_frequencies_heatmap() requires it, I think). We could keep the old index (containing the hash) in a column "hash" if you want.
There was a problem hiding this comment.
For the first part,
haplotype_frequencies()returns a dict where the keys are the hash for each haplotype and the value is the frequency. These keys are then used byhap_trackso it should already do what I think you asked with one pass.
Right, after you have iterated through all cohorts then hap_track will have seen all haplotypes present in any cohort, and so the keys will contain hashes of all haplotypes present.
But if you are outputting the frequencies as you go, within that first pass, then the number of values you find will vary between cohorts. This is because you find more haplotypes as you visit more cohorts.
Also, hap_track currently accumulates data from all cohorts. When you visit a new cohort, any haplotypes previously found will have their frequencies updated. But any haplotypes found in previous cohorts but not present in the current cohort will still be present within hap_track. If you then take all values from hap_track you could get a mangle of data from the current cohort and data from cohorts visited in previous iterations.
There was a problem hiding this comment.
Sorry @alimanfoo, I thought I had commented further yesterday but my comments seems to have got lost (I probably forgot to click on the big green button).
Right, after you have iterated through all cohorts then hap_track will have seen all haplotypes present in any cohort, and so the keys will contain hashes of all haplotypes present.
But if you are outputting the frequencies as you go, within that first pass, then the number of values you find will vary between cohorts. This is because you find more haplotypes as you visit more cohorts.
That's right. This is the reason why there is a step before creating the DataFrame that adds 0 to the end of the lists that are shorter than the total number of haplotypes (which is the length of the last list generated). This could be a problem if update doesn't treat the dict entries as a queue and reorders them (according to whatever order it likes). I tried this:
d = {"B": 0, "C":1}
d.update({"A": 2, "B":3}
d.values()
and it correctly returns [3,1,2] so the order seems to be indeed kept. So the new haplotypes are always added at the end of hap_track. I should probably document my code better (that is, at least a little) to make my reasoning clearer.
Also, hap_track currently accumulates data from all cohorts. When you visit a new cohort, any haplotypes previously found will have their frequencies updated. But any haplotypes found in previous cohorts but not present in the current cohort will still be present within hap_track. If you then take all values from hap_track you could get a mangle of data from the current cohort and data from cohorts visited in previous iterations.
The first step of every cohort run is to reset all values of hap_track to 0 so I don't think there is a risk of mangling. I was scared at some point that the content of hap_track would be copied by reference instead of by value in freq_coh but given that they end up different, I don't think that's the case.
There is one situation that I am not quite sure how to handle though. Resetting all values to 0 for each cohort run makes sense for freq and count but, if a haplotype is not observed, having 0 as the nobs value might be wrong. What do you think? Should the lengthening step set all values in nobs to be equal or is 0 for haplotypes that are not observed OK. As far as I can see, it only matters for the CIs ... and I just convinced myself that it is wrong to have nobs == 0 so I will update that when I come back from my leave.
|
@jonbrenas to add examples here and notebook examples. |
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
alimanfoo
left a comment
There was a problem hiding this comment.
Thanks so much @jonbrenas, this is looking good. Some suggestions for code simplification and to remove duplication.
Also please add the new public methods haplotype_frequencies() and haplotype_frequencies_advanced() to the Ag3 and Af1 API docs.
malariagen_data/anopheles.py
Outdated
| AnophelesHapData, | ||
| AnophelesHapFrequencyAnalysis, |
There was a problem hiding this comment.
Suggest to leave the AnophelesHapData class explicitly here within the parents, and add AnophelesHapData further back, e.g., together with other analysis classes.
There was a problem hiding this comment.
Thanks @alimanfoo. It would make sense but when AnophelesHapData is added to the list, this error pops up:
E TypeError: Cannot create a consistent method resolution
E order (MRO) for bases AnophelesSnpData, AnophelesHapData, AnophelesSampleMetadata, AnophelesCnvData, AnophelesGenomeFeaturesData, AnophelesGenomeSequenceData, AnophelesBase
I have tried to figure out why but I have not been able to yet.
There was a problem hiding this comment.
Turns out AnophelesHapData really wants to be before AnophelesSnpData and AnophelesCnvData.
There was a problem hiding this comment.
Cool, you managed to resolve the MRO issue?
There was a problem hiding this comment.
Sort of. The only reason (as far as I can see) the order of these 3 would be important would be if the superclasses are used differently in AnophelesHapData and I don't see where they would be. I am not sure it is worth investigating too much given that it works this way, though.
|
Took the liberty to edit PR title and description so this PR looks good in the release notes. |
Co-authored-by: Alistair Miles <[email protected]>
Co-authored-by: Alistair Miles <[email protected]>
alimanfoo
left a comment
There was a problem hiding this comment.
Hi @jonbrenas, this is looking good.
Small suggestion, the functions that have been refactored into the util module, suggest to remove the leading underscore from the function names. A leading underscore by convention indicates a function is intended to be private to a single module.
|
Thanks @alimanfoo, makes a lot of sense. I think I made the changes we requested. |
alimanfoo
left a comment
There was a problem hiding this comment.
Looks good, thanks @jonbrenas 🌼
|
Just to note that this PR and #665 have some overlap and so expect there will be conflicts to resolve after one is merged before the second can be merged. It might be slightly easier to merge this one first, and then deal with any conflicts in #665, but happy to go in whichever order you think best. |
This PR adds two new functions for computing haplotype frequencies within a genome region of interest,
haplotype_frequencies()andhaplotype_frequencies_advanced(). The outputs from these functions can then be passed into generic functions for visualising frequency data as a heatmap, time series or map.Resolves #367.