Skip to content

Reduce memory usage when accessing SNP calls with site_class parameter#501

Merged
alimanfoo merged 7 commits intomalariagen:masterfrom
alimanfoo:gh498-investigation-2024-01-11
Jan 11, 2024
Merged

Reduce memory usage when accessing SNP calls with site_class parameter#501
alimanfoo merged 7 commits intomalariagen:masterfrom
alimanfoo:gh498-investigation-2024-01-11

Conversation

@alimanfoo
Copy link
Copy Markdown
Member

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@alimanfoo
Copy link
Copy Markdown
Member Author

alimanfoo commented Jan 11, 2024

There are two separate optimisations here.

The first change is within the _snp_calls() method and reduces memory usage when requesting a relatively small genome region:

image

The second change is within the _locate_site_class() method and reduces memory usage even when accessing a large genome region:

image

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.71%. Comparing base (8f7f86f) to head (9e7a859).
Report is 421 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #501   +/-   ##
=======================================
  Coverage   98.71%   98.71%           
=======================================
  Files          33       33           
  Lines        3274     3277    +3     
=======================================
+ Hits         3232     3235    +3     
  Misses         42       42           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +850 to +859
# therefore subset to SNP positions.
pos = self.snp_sites(
region=region,
field="POS",
site_mask=site_mask,
inline_array=inline_array,
chunks=chunks,
)
idx = (pos - 1).compute()
loc_ann = da.take(loc_ann, idx, axis=0)
Copy link
Copy Markdown
Member Author

@alimanfoo alimanfoo Jan 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a key move, we delay the indexing of the site annotations data to SNP positions until later on. This avoids the indexing array (idx) being used to index multiple arrays and therefore getting copied multiple times.

@alimanfoo alimanfoo merged commit c560a96 into malariagen:master Jan 11, 2024
@alimanfoo alimanfoo deleted the gh498-investigation-2024-01-11 branch January 11, 2024 13:50
@alimanfoo alimanfoo added the BMGF-001927 Work supported by BMGF grant INV-001927 (MalariaGEN 2019-2024). label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BMGF-001927 Work supported by BMGF grant INV-001927 (MalariaGEN 2019-2024).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RAM use for Af diversity_stats

1 participant