Skip to content

Haplotype frequency analysis methods#630

Merged
jonbrenas merged 40 commits intomasterfrom
367-haps-freq
Dec 3, 2024
Merged

Haplotype frequency analysis methods#630
jonbrenas merged 40 commits intomasterfrom
367-haps-freq

Conversation

@jonbrenas
Copy link
Copy Markdown
Collaborator

@jonbrenas jonbrenas commented Sep 27, 2024

This PR adds two new functions for computing haplotype frequencies within a genome region of interest, haplotype_frequencies() and haplotype_frequencies_advanced(). The outputs from these functions can then be passed into generic functions for visualising frequency data as a heatmap, time series or map.

Resolves #367.

@jonbrenas
Copy link
Copy Markdown
Collaborator Author

I think I am getting there but I have not been able to run a notebook with this version of the code so I am not 100% sure that it does what it is supposed to yet.

@jonbrenas jonbrenas marked this pull request as ready for review October 16, 2024 16:12
@alimanfoo
Copy link
Copy Markdown
Member

Hi @jonbrenas, just to say this looks like it's moving in a good direction :)

In the malariagen_data.anoph.h12 module there is a function haplotype_frequencies() which you could use to calculate haplotype frequencies here, if you moved haplotype_frequencies() to a common location like malariagen_data.util module.

@jonbrenas
Copy link
Copy Markdown
Collaborator Author

Thanks @alimanfoo. I had used haplotype_frequencies() in my first prototype but, for the time series plot, counts and nobs are also needed. I considered moving the function and modifying slightly the functions using haplotype_frequencies() in malariagen_data.anoph.h12 (which is probably only garud_h12) but I wanted to limit the amount of side modifications in this PR to a minimum. I guess haplotype_frequencies() is much faster thanks to numba so I'll transition it now.

Comment on lines +98 to +107
hap_track: dict[np.int64, float] = {}
for coh, loc_coh in cohorts_iterator:
hap_track = {k: 0 for k in hap_track.keys()}
n_samples = np.count_nonzero(loc_coh)
assert n_samples >= min_cohort_size
gt_coh = allel.GenotypeDaskArray(da.compress(loc_coh, gt, axis=1))
gt_hap = gt_coh.to_haplotypes().compute()
f, _, _ = haplotype_frequencies(gt_hap)
hap_track.update(f)
freq_cols["frq_" + coh] = list(hap_track.values())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little concerned about ensuring that the frequency values are the same for each cohort, i.e., the values correspond to the same haplotypes. Also I think this could end up bleeding values between different cohorts.

Am I right that you are trying to use hap_track to keep a record of what haplotypes you have seen so far?

Consider what would happen here if you have two cohorts, where some haplotypes are present in both cohorts, and each cohort has haplotypes which are not observed in the other cohort.

I wonder if it would be safer to do this in two passes. In the first pass, you compute the results of calling haplotype_frequencies() for each cohort. Then you find the set of all haplotype keys across all cohorts, and use this to generate a consistent array of frequencies for each cohort, guaranteeing the same order of values for each cohort, and filling with 0 in case the haplotype is not observed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you did that you could also add a "label" column to the output dataframe, which could contain the hash for each haplotype. Not super user friendly, but would at least allow us to refer to the different haplotypes in some way?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I get what you mean. I have little time so what I say might be gibberish.

For the first part, haplotype_frequencies() returns a dict where the keys are the hash for each haplotype and the value is the frequency. These keys are then used by hap_track so it should already do what I think you asked with one pass.
For the second part, "label" is already used for the index (because plot_frequencies_heatmap() requires it, I think). We could keep the old index (containing the hash) in a column "hash" if you want.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first part, haplotype_frequencies() returns a dict where the keys are the hash for each haplotype and the value is the frequency. These keys are then used by hap_track so it should already do what I think you asked with one pass.

Right, after you have iterated through all cohorts then hap_track will have seen all haplotypes present in any cohort, and so the keys will contain hashes of all haplotypes present.

But if you are outputting the frequencies as you go, within that first pass, then the number of values you find will vary between cohorts. This is because you find more haplotypes as you visit more cohorts.

Also, hap_track currently accumulates data from all cohorts. When you visit a new cohort, any haplotypes previously found will have their frequencies updated. But any haplotypes found in previous cohorts but not present in the current cohort will still be present within hap_track. If you then take all values from hap_track you could get a mangle of data from the current cohort and data from cohorts visited in previous iterations.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @alimanfoo, I thought I had commented further yesterday but my comments seems to have got lost (I probably forgot to click on the big green button).

Right, after you have iterated through all cohorts then hap_track will have seen all haplotypes present in any cohort, and so the keys will contain hashes of all haplotypes present.
But if you are outputting the frequencies as you go, within that first pass, then the number of values you find will vary between cohorts. This is because you find more haplotypes as you visit more cohorts.

That's right. This is the reason why there is a step before creating the DataFrame that adds 0 to the end of the lists that are shorter than the total number of haplotypes (which is the length of the last list generated). This could be a problem if update doesn't treat the dict entries as a queue and reorders them (according to whatever order it likes). I tried this:

d = {"B": 0, "C":1}
d.update({"A": 2, "B":3}
d.values()

and it correctly returns [3,1,2] so the order seems to be indeed kept. So the new haplotypes are always added at the end of hap_track. I should probably document my code better (that is, at least a little) to make my reasoning clearer.

Also, hap_track currently accumulates data from all cohorts. When you visit a new cohort, any haplotypes previously found will have their frequencies updated. But any haplotypes found in previous cohorts but not present in the current cohort will still be present within hap_track. If you then take all values from hap_track you could get a mangle of data from the current cohort and data from cohorts visited in previous iterations.

The first step of every cohort run is to reset all values of hap_track to 0 so I don't think there is a risk of mangling. I was scared at some point that the content of hap_track would be copied by reference instead of by value in freq_coh but given that they end up different, I don't think that's the case.

There is one situation that I am not quite sure how to handle though. Resetting all values to 0 for each cohort run makes sense for freq and count but, if a haplotype is not observed, having 0 as the nobs value might be wrong. What do you think? Should the lengthening step set all values in nobs to be equal or is 0 for haplotypes that are not observed OK. As far as I can see, it only matters for the CIs ... and I just convinced myself that it is wrong to have nobs == 0 so I will update that when I come back from my leave.

@leehart
Copy link
Copy Markdown
Collaborator

leehart commented Nov 5, 2024

@jonbrenas to add examples here and notebook examples.

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@jonbrenas jonbrenas requested a review from alimanfoo November 6, 2024 12:14
Copy link
Copy Markdown
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much @jonbrenas, this is looking good. Some suggestions for code simplification and to remove duplication.

Also please add the new public methods haplotype_frequencies() and haplotype_frequencies_advanced() to the Ag3 and Af1 API docs.

Comment on lines +105 to +106
AnophelesHapData,
AnophelesHapFrequencyAnalysis,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to leave the AnophelesHapData class explicitly here within the parents, and add AnophelesHapData further back, e.g., together with other analysis classes.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alimanfoo. It would make sense but when AnophelesHapData is added to the list, this error pops up:

E   TypeError: Cannot create a consistent method resolution
E   order (MRO) for bases AnophelesSnpData, AnophelesHapData, AnophelesSampleMetadata, AnophelesCnvData, AnophelesGenomeFeaturesData, AnophelesGenomeSequenceData, AnophelesBase

I have tried to figure out why but I have not been able to yet.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out AnophelesHapData really wants to be before AnophelesSnpData and AnophelesCnvData.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, you managed to resolve the MRO issue?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of. The only reason (as far as I can see) the order of these 3 would be important would be if the superclasses are used differently in AnophelesHapData and I don't see where they would be. I am not sure it is worth investigating too much given that it works this way, though.

@alimanfoo alimanfoo changed the title Haplotype frequency time series plot Haplotype frequency analysis methods Nov 21, 2024
@alimanfoo
Copy link
Copy Markdown
Member

Took the liberty to edit PR title and description so this PR looks good in the release notes.

Copy link
Copy Markdown
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jonbrenas, this is looking good.

Small suggestion, the functions that have been refactored into the util module, suggest to remove the leading underscore from the function names. A leading underscore by convention indicates a function is intended to be private to a single module.

@jonbrenas
Copy link
Copy Markdown
Collaborator Author

Thanks @alimanfoo, makes a lot of sense. I think I made the changes we requested.

Copy link
Copy Markdown
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @jonbrenas 🌼

@alimanfoo
Copy link
Copy Markdown
Member

Just to note that this PR and #665 have some overlap and so expect there will be conflicts to resolve after one is merged before the second can be merged.

It might be slightly easier to merge this one first, and then deal with any conflicts in #665, but happy to go in whichever order you think best.

@jonbrenas jonbrenas merged commit ef6991f into master Dec 3, 2024
@alimanfoo alimanfoo added the BMGF-068808 Work supported by BMGF grant INV-068808 (MalariaGEN 2024-2027). label Dec 4, 2024
@leehart leehart deleted the 367-haps-freq branch December 9, 2024 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BMGF-068808 Work supported by BMGF grant INV-068808 (MalariaGEN 2024-2027).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Haplotype frequency time series plot

3 participants