Support genomic regions by nkran · Pull Request #106 · malariagen/malariagen-data-python

nkran · 2021-12-08T21:37:37Z

Wanted to give #14 a go as it would be a neat feature to have in every day data wrangling.
Despite WIP I wanted to open a PR to check in with you regarding implementation and suggestions.

Still to-do:

Add support for CNV and haplotype functions
Tests for region_slice()

alimanfoo · 2021-12-09T10:32:05Z

Hi @nkran, this is great stuff and very much appreciated!

At a first glance this looks like a great direction. I'll give it a more thorough review asap but expect comments will be relatively minor.

alimanfoo

Looking very good, couple of small suggestions...

malariagen_data/ag3.py

alimanfoo · 2021-12-09T17:49:49Z

malariagen_data/ag3.py

+        # search the geneset and match a genomic region regex pattern
+        gene_annotation = self.geneset(["ID"]).query(
+            f"type == 'gene' and ID == '{region}'"
+        )
+        region_pattern_match = re.search(r"([a-zA-Z0-9]+)\:(.+)\-(.+)", region)
+
+        # region is a chromosome arm
+        if region in self.contigs:
+            contig, start, end = region, None, None
+
+        # region is a gene name
+        elif not gene_annotation.empty:
+            gene_annotation = gene_annotation.squeeze()
+            contig = gene_annotation.contig
+            start = gene_annotation.start
+            end = gene_annotation.end
+
+        # parse region string that contains genomic coordinates
+        elif region_pattern_match:
+            region_split = region_pattern_match.groups()
+
+            contig = region_split[0]
+            start = int(region_split[1].replace(",", ""))
+            end = int(region_split[2].replace(",", ""))
+
+            if contig not in self.contigs:
+                raise ValueError(f"Contig {contig} does not exist in the dataset.")
+            elif (
+                start < 0
+                or end <= start
+                or end > self.genome_sequence(region=contig).shape[0]
+            ):
+                raise ValueError("Provided genomic coordinates are not valid.")
+
+        else:
+            raise ValueError(f"Region {region} is not valid.")


Suggested change

# search the geneset and match a genomic region regex pattern

gene_annotation = self.geneset(["ID"]).query(

f"type == 'gene' and ID == '{region}'"

)

region_pattern_match = re.search(r"([a-zA-Z0-9]+)\:(.+)\-(.+)", region)

# region is a chromosome arm

if region in self.contigs:

contig, start, end = region, None, None

# region is a gene name

elif not gene_annotation.empty:

gene_annotation = gene_annotation.squeeze()

contig = gene_annotation.contig

start = gene_annotation.start

end = gene_annotation.end

# parse region string that contains genomic coordinates

elif region_pattern_match:

region_split = region_pattern_match.groups()

contig = region_split[0]

start = int(region_split[1].replace(",", ""))

end = int(region_split[2].replace(",", ""))

if contig not in self.contigs:

raise ValueError(f"Contig {contig} does not exist in the dataset.")

elif (

start < 0

or end <= start

or end > self.genome_sequence(region=contig).shape[0]

):

raise ValueError("Provided genomic coordinates are not valid.")

else:

raise ValueError(f"Region {region} is not valid.")

# check type, fail early if bad

if not isinstance(region, str):

raise TypeError("The region parameter must be a string or Region object.")

# check if region is a chromosome arm

if region in self.contigs:

return Ag3.Region(region, None, None)

# check if region is a region string

region_pattern_match = re.search(r"([a-zA-Z0-9]+)\:(.+)\-(.+)", region)

if region_pattern_match:

# parse region string that contains genomic coordinates

region_split = region_pattern_match.groups()

contig = region_split[0]

start = int(region_split[1].replace(",", ""))

end = int(region_split[2].replace(",", ""))

if contig not in self.contigs:

raise ValueError(f"Contig {contig} does not exist in the dataset.")

elif (

start < 0

or end <= start

or end > self.genome_sequence(region=contig).shape[0]

):

raise ValueError("Provided genomic coordinates are not valid.")

return Ag3.Region(contig, start, end)

# check if region is a gene annotation feature ID

gene_annotation = self.geneset(["ID"]).query(f"ID == '{region}'")

if not gene_annotation.empty:

# region is a feature ID

gene_annotation = gene_annotation.squeeze()

return Ag3.Region(gene_annotation.contig, gene_annotation.start, gene_annotation.end)

raise ValueError(f"Region {region!r} is not a valid contig, region string or feature ID.")

This suggestion reorders the logic to deal with the simplest and cheapest cases first. Searching the geneset for an ID is not particularly expensive, but seems like it's worth avoiding if in many cases the region parameter will be a contig or region string and those cases can be dealt with cheaply first.

Also the query on the geneset here is relaxed a little to allow matching any feature ID.

Great, thanks. Will commit the changes.

malariagen_data/ag3.py

nkran · 2021-12-10T22:03:33Z

Hi @alimanfoo, thanks for the first review :) I added your corrections and support for regions in the other functions.

I saw that there were TODO tags for multiple contigs/regions support in cnv_discordant_read_calls(), cnv_hmm() and cnv_coverage_calls() so I followed suit from other methods to do the same there. Hope I didn't mess up.

A few combinations of tests for CNV methods with some regions are failing with KeyError from allel.locate_region(). I assume that is due to the empty dataset that gets returned if there are no CNVs in a given region. Any suggestions or preference about how to resolve that?

Have a nice weekend,
N

alimanfoo

Couple of small suggestions.

alimanfoo · 2021-12-13T23:09:12Z

malariagen_data/ag3.py

+        # find genes in the region
+        if region.start and region.end:
+            df_genes = self.geneset().query(
+                f"type == 'gene' and contig == '{region.contig}' and start >= {region.start} and end <= {region.end}"


Suggested change

f"type == 'gene' and contig == '{region.contig}' and start >= {region.start} and end <= {region.end}"

f"type == 'gene' and contig == '{region.contig}' and start <= {region.end} and end >= {region.start}"

Possible tweak to retrieve all genes overlapping the requested region.

alimanfoo · 2021-12-13T23:13:24Z

tests/test_ag3.py

+    region = ag3._resolve_region(region)
+    if region.start and region.end:
+        df_genes = df_geneset.query(
+            f"type == 'gene' and contig == '{region.contig}' and start >= {region.start} and end <= {region.end}"


Will need to edit this to match above if changing to overlap query.

alimanfoo · 2021-12-13T23:17:27Z

A few combinations of tests for CNV methods with some regions are failing with KeyError from allel.locate_region(). I assume that is due to the empty dataset that gets returned if there are no CNVs in a given region. Any suggestions or preference about how to resolve that?

Yes that would be my guess, especially the discordant read calls, there are only a couple of regions with data there. Could just adapt the regions tested so they return some data and avoid KeyError, e.g., for the discordant read calls could use genes from the cyp6p/aa cluster on 2R?

alimanfoo · 2021-12-14T15:15:38Z

Hi @nkran, on second thoughts, it might be better to back out the changes to the CNV methods and revisit those in a later PR. The reason is that CNVs span intervals, and so we'd need to do an overlap query for regions. I.e., the current approach of using the POS field (which holds the start coordinates) and SortedIndex.locate_range() will only find CNVs where the start occurs within the requested region.

nkran · 2021-12-14T17:58:42Z

Hi @alimanfoo! Ah yes, that makes sense! I'll omit those two commits and update the branch.

(cherry picked from commit 5eb2fbc)

(cherry picked from commit 1723bfa)

alimanfoo · 2021-12-15T00:29:21Z

Hi @nkran, thanks so much, this all looks good to me now, except for one .pyc file has crept in (tests/__pycache__/__init__.cpython-37.pyc). If you could remove that then it looks good to merge.

nkran · 2021-12-15T11:52:18Z

Hi @alimanfoo, just to double check - my commit (a2362bd) removes that .pyc file that is currently in master. Do you want me to revert the commit that deletes it and keep the .pyc file or to remove the .pyc file? Thanks for your patience :) Just want to get it right.

alimanfoo · 2021-12-15T14:40:57Z

Hi @alimanfoo, just to double check - my commit (a2362bd) removes that .pyc file that is currently in master. Do you want me to revert the commit that deletes it and keep the .pyc file or to remove the .pyc file? Thanks for your patience :) Just want to get it right.

My bad, I wasn't looking carefully enough to realise the file was already in master and you were cleaning up :)

OK to merge?

nkran · 2021-12-15T15:15:05Z

OK! :) 🚀

alimanfoo · 2021-12-15T15:17:07Z

Done, thanks again 😄 🙏

nkran added 3 commits December 6, 2021 14:47

Add hidden vscode folder to gitignore

f3fd9af

Remove cached pyc

a2362bd

Support genomic regions - SNPs

06d609f

alimanfoo added this to the v1.0.0 milestone Dec 9, 2021

alimanfoo reviewed Dec 9, 2021

View reviewed changes

Refactor _resolve_region() and renaming

1f0e7fa

alimanfoo reviewed Dec 13, 2021

View reviewed changes

nkran added 2 commits December 14, 2021 18:06

Move region resolving to private methods for consistency

2b80fad

(cherry picked from commit 5eb2fbc)

Add tests for locate_region()

88e7c0a

(cherry picked from commit 1723bfa)

nkran force-pushed the genome-regions branch from 1723bfa to 88e7c0a Compare December 14, 2021 21:15

nkran changed the title ~~WIP: Support genomic regions~~ Support genomic regions Dec 14, 2021

alimanfoo approved these changes Dec 15, 2021

View reviewed changes

alimanfoo merged commit fac93bc into malariagen:master Dec 15, 2021

alimanfoo mentioned this pull request Dec 23, 2021

Support genome regions when accessing data for Ag3 #14

Closed

alimanfoo added the BMGF-001927 Work supported by BMGF grant INV-001927 (MalariaGEN 2019-2024). label Dec 4, 2024

	f"type == 'gene' and contig == '{region.contig}' and start >= {region.start} and end <= {region.end}"
	f"type == 'gene' and contig == '{region.contig}' and start <= {region.end} and end >= {region.start}"

Conversation

nkran commented Dec 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alimanfoo commented Dec 9, 2021

Uh oh!

alimanfoo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alimanfoo Dec 9, 2021

Choose a reason for hiding this comment

Uh oh!

nkran Dec 9, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nkran commented Dec 10, 2021

Uh oh!

alimanfoo left a comment

Choose a reason for hiding this comment

Uh oh!

alimanfoo Dec 13, 2021

Choose a reason for hiding this comment

Uh oh!

alimanfoo Dec 13, 2021

Choose a reason for hiding this comment

Uh oh!

alimanfoo commented Dec 13, 2021

Uh oh!

alimanfoo commented Dec 14, 2021

Uh oh!

nkran commented Dec 14, 2021

Uh oh!

alimanfoo commented Dec 15, 2021

Uh oh!

nkran commented Dec 15, 2021

Uh oh!

alimanfoo commented Dec 15, 2021

Uh oh!

nkran commented Dec 15, 2021

Uh oh!

alimanfoo commented Dec 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nkran commented Dec 8, 2021 •

edited

Loading