Skip to content

Support genomic regions#106

Merged
alimanfoo merged 6 commits intomalariagen:masterfrom
nkran:genome-regions
Dec 15, 2021
Merged

Support genomic regions#106
alimanfoo merged 6 commits intomalariagen:masterfrom
nkran:genome-regions

Conversation

@nkran
Copy link
Copy Markdown
Contributor

@nkran nkran commented Dec 8, 2021

Wanted to give #14 a go as it would be a neat feature to have in every day data wrangling.
Despite WIP I wanted to open a PR to check in with you regarding implementation and suggestions.

Still to-do:

  • Add support for CNV and haplotype functions
  • Tests for region_slice()

@alimanfoo
Copy link
Copy Markdown
Member

Hi @nkran, this is great stuff and very much appreciated!

At a first glance this looks like a great direction. I'll give it a more thorough review asap but expect comments will be relatively minor.

@alimanfoo alimanfoo added this to the v1.0.0 milestone Dec 9, 2021
Copy link
Copy Markdown
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking very good, couple of small suggestions...

Comment on lines +328 to +363
# search the geneset and match a genomic region regex pattern
gene_annotation = self.geneset(["ID"]).query(
f"type == 'gene' and ID == '{region}'"
)
region_pattern_match = re.search(r"([a-zA-Z0-9]+)\:(.+)\-(.+)", region)

# region is a chromosome arm
if region in self.contigs:
contig, start, end = region, None, None

# region is a gene name
elif not gene_annotation.empty:
gene_annotation = gene_annotation.squeeze()
contig = gene_annotation.contig
start = gene_annotation.start
end = gene_annotation.end

# parse region string that contains genomic coordinates
elif region_pattern_match:
region_split = region_pattern_match.groups()

contig = region_split[0]
start = int(region_split[1].replace(",", ""))
end = int(region_split[2].replace(",", ""))

if contig not in self.contigs:
raise ValueError(f"Contig {contig} does not exist in the dataset.")
elif (
start < 0
or end <= start
or end > self.genome_sequence(region=contig).shape[0]
):
raise ValueError("Provided genomic coordinates are not valid.")

else:
raise ValueError(f"Region {region} is not valid.")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# search the geneset and match a genomic region regex pattern
gene_annotation = self.geneset(["ID"]).query(
f"type == 'gene' and ID == '{region}'"
)
region_pattern_match = re.search(r"([a-zA-Z0-9]+)\:(.+)\-(.+)", region)
# region is a chromosome arm
if region in self.contigs:
contig, start, end = region, None, None
# region is a gene name
elif not gene_annotation.empty:
gene_annotation = gene_annotation.squeeze()
contig = gene_annotation.contig
start = gene_annotation.start
end = gene_annotation.end
# parse region string that contains genomic coordinates
elif region_pattern_match:
region_split = region_pattern_match.groups()
contig = region_split[0]
start = int(region_split[1].replace(",", ""))
end = int(region_split[2].replace(",", ""))
if contig not in self.contigs:
raise ValueError(f"Contig {contig} does not exist in the dataset.")
elif (
start < 0
or end <= start
or end > self.genome_sequence(region=contig).shape[0]
):
raise ValueError("Provided genomic coordinates are not valid.")
else:
raise ValueError(f"Region {region} is not valid.")
# check type, fail early if bad
if not isinstance(region, str):
raise TypeError("The region parameter must be a string or Region object.")
# check if region is a chromosome arm
if region in self.contigs:
return Ag3.Region(region, None, None)
# check if region is a region string
region_pattern_match = re.search(r"([a-zA-Z0-9]+)\:(.+)\-(.+)", region)
if region_pattern_match:
# parse region string that contains genomic coordinates
region_split = region_pattern_match.groups()
contig = region_split[0]
start = int(region_split[1].replace(",", ""))
end = int(region_split[2].replace(",", ""))
if contig not in self.contigs:
raise ValueError(f"Contig {contig} does not exist in the dataset.")
elif (
start < 0
or end <= start
or end > self.genome_sequence(region=contig).shape[0]
):
raise ValueError("Provided genomic coordinates are not valid.")
return Ag3.Region(contig, start, end)
# check if region is a gene annotation feature ID
gene_annotation = self.geneset(["ID"]).query(f"ID == '{region}'")
if not gene_annotation.empty:
# region is a feature ID
gene_annotation = gene_annotation.squeeze()
return Ag3.Region(gene_annotation.contig, gene_annotation.start, gene_annotation.end)
raise ValueError(f"Region {region!r} is not a valid contig, region string or feature ID.")

This suggestion reorders the logic to deal with the simplest and cheapest cases first. Searching the geneset for an ID is not particularly expensive, but seems like it's worth avoiding if in many cases the region parameter will be a contig or region string and those cases can be dealt with cheaply first.

Also the query on the geneset here is relaxed a little to allow matching any feature ID.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks. Will commit the changes.

@nkran
Copy link
Copy Markdown
Contributor Author

nkran commented Dec 10, 2021

Hi @alimanfoo, thanks for the first review :) I added your corrections and support for regions in the other functions.

I saw that there were TODO tags for multiple contigs/regions support in cnv_discordant_read_calls(), cnv_hmm() and cnv_coverage_calls() so I followed suit from other methods to do the same there. Hope I didn't mess up.

A few combinations of tests for CNV methods with some regions are failing with KeyError from allel.locate_region(). I assume that is due to the empty dataset that gets returned if there are no CNVs in a given region. Any suggestions or preference about how to resolve that?

Have a nice weekend,
N

Copy link
Copy Markdown
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of small suggestions.

# find genes in the region
if region.start and region.end:
df_genes = self.geneset().query(
f"type == 'gene' and contig == '{region.contig}' and start >= {region.start} and end <= {region.end}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"type == 'gene' and contig == '{region.contig}' and start >= {region.start} and end <= {region.end}"
f"type == 'gene' and contig == '{region.contig}' and start <= {region.end} and end >= {region.start}"

Possible tweak to retrieve all genes overlapping the requested region.

region = ag3._resolve_region(region)
if region.start and region.end:
df_genes = df_geneset.query(
f"type == 'gene' and contig == '{region.contig}' and start >= {region.start} and end <= {region.end}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to edit this to match above if changing to overlap query.

@alimanfoo
Copy link
Copy Markdown
Member

A few combinations of tests for CNV methods with some regions are failing with KeyError from allel.locate_region(). I assume that is due to the empty dataset that gets returned if there are no CNVs in a given region. Any suggestions or preference about how to resolve that?

Yes that would be my guess, especially the discordant read calls, there are only a couple of regions with data there. Could just adapt the regions tested so they return some data and avoid KeyError, e.g., for the discordant read calls could use genes from the cyp6p/aa cluster on 2R?

@alimanfoo
Copy link
Copy Markdown
Member

Hi @nkran, on second thoughts, it might be better to back out the changes to the CNV methods and revisit those in a later PR. The reason is that CNVs span intervals, and so we'd need to do an overlap query for regions. I.e., the current approach of using the POS field (which holds the start coordinates) and SortedIndex.locate_range() will only find CNVs where the start occurs within the requested region.

@nkran
Copy link
Copy Markdown
Contributor Author

nkran commented Dec 14, 2021

Hi @alimanfoo! Ah yes, that makes sense! I'll omit those two commits and update the branch.

@nkran nkran changed the title WIP: Support genomic regions Support genomic regions Dec 14, 2021
@alimanfoo
Copy link
Copy Markdown
Member

Hi @nkran, thanks so much, this all looks good to me now, except for one .pyc file has crept in (tests/__pycache__/__init__.cpython-37.pyc). If you could remove that then it looks good to merge.

@nkran
Copy link
Copy Markdown
Contributor Author

nkran commented Dec 15, 2021

Hi @alimanfoo, just to double check - my commit (a2362bd) removes that .pyc file that is currently in master. Do you want me to revert the commit that deletes it and keep the .pyc file or to remove the .pyc file? Thanks for your patience :) Just want to get it right.

@alimanfoo
Copy link
Copy Markdown
Member

Hi @alimanfoo, just to double check - my commit (a2362bd) removes that .pyc file that is currently in master. Do you want me to revert the commit that deletes it and keep the .pyc file or to remove the .pyc file? Thanks for your patience :) Just want to get it right.

My bad, I wasn't looking carefully enough to realise the file was already in master and you were cleaning up :)

OK to merge?

@nkran
Copy link
Copy Markdown
Contributor Author

nkran commented Dec 15, 2021

OK! :) 🚀

@alimanfoo alimanfoo merged commit fac93bc into malariagen:master Dec 15, 2021
@alimanfoo
Copy link
Copy Markdown
Member

Done, thanks again 😄 🙏

@alimanfoo alimanfoo added the BMGF-001927 Work supported by BMGF grant INV-001927 (MalariaGEN 2019-2024). label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BMGF-001927 Work supported by BMGF grant INV-001927 (MalariaGEN 2019-2024).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants