Conversation
|
Hi @nkran, this is great stuff and very much appreciated! At a first glance this looks like a great direction. I'll give it a more thorough review asap but expect comments will be relatively minor. |
alimanfoo
left a comment
There was a problem hiding this comment.
Looking very good, couple of small suggestions...
malariagen_data/ag3.py
Outdated
| # search the geneset and match a genomic region regex pattern | ||
| gene_annotation = self.geneset(["ID"]).query( | ||
| f"type == 'gene' and ID == '{region}'" | ||
| ) | ||
| region_pattern_match = re.search(r"([a-zA-Z0-9]+)\:(.+)\-(.+)", region) | ||
|
|
||
| # region is a chromosome arm | ||
| if region in self.contigs: | ||
| contig, start, end = region, None, None | ||
|
|
||
| # region is a gene name | ||
| elif not gene_annotation.empty: | ||
| gene_annotation = gene_annotation.squeeze() | ||
| contig = gene_annotation.contig | ||
| start = gene_annotation.start | ||
| end = gene_annotation.end | ||
|
|
||
| # parse region string that contains genomic coordinates | ||
| elif region_pattern_match: | ||
| region_split = region_pattern_match.groups() | ||
|
|
||
| contig = region_split[0] | ||
| start = int(region_split[1].replace(",", "")) | ||
| end = int(region_split[2].replace(",", "")) | ||
|
|
||
| if contig not in self.contigs: | ||
| raise ValueError(f"Contig {contig} does not exist in the dataset.") | ||
| elif ( | ||
| start < 0 | ||
| or end <= start | ||
| or end > self.genome_sequence(region=contig).shape[0] | ||
| ): | ||
| raise ValueError("Provided genomic coordinates are not valid.") | ||
|
|
||
| else: | ||
| raise ValueError(f"Region {region} is not valid.") |
There was a problem hiding this comment.
| # search the geneset and match a genomic region regex pattern | |
| gene_annotation = self.geneset(["ID"]).query( | |
| f"type == 'gene' and ID == '{region}'" | |
| ) | |
| region_pattern_match = re.search(r"([a-zA-Z0-9]+)\:(.+)\-(.+)", region) | |
| # region is a chromosome arm | |
| if region in self.contigs: | |
| contig, start, end = region, None, None | |
| # region is a gene name | |
| elif not gene_annotation.empty: | |
| gene_annotation = gene_annotation.squeeze() | |
| contig = gene_annotation.contig | |
| start = gene_annotation.start | |
| end = gene_annotation.end | |
| # parse region string that contains genomic coordinates | |
| elif region_pattern_match: | |
| region_split = region_pattern_match.groups() | |
| contig = region_split[0] | |
| start = int(region_split[1].replace(",", "")) | |
| end = int(region_split[2].replace(",", "")) | |
| if contig not in self.contigs: | |
| raise ValueError(f"Contig {contig} does not exist in the dataset.") | |
| elif ( | |
| start < 0 | |
| or end <= start | |
| or end > self.genome_sequence(region=contig).shape[0] | |
| ): | |
| raise ValueError("Provided genomic coordinates are not valid.") | |
| else: | |
| raise ValueError(f"Region {region} is not valid.") | |
| # check type, fail early if bad | |
| if not isinstance(region, str): | |
| raise TypeError("The region parameter must be a string or Region object.") | |
| # check if region is a chromosome arm | |
| if region in self.contigs: | |
| return Ag3.Region(region, None, None) | |
| # check if region is a region string | |
| region_pattern_match = re.search(r"([a-zA-Z0-9]+)\:(.+)\-(.+)", region) | |
| if region_pattern_match: | |
| # parse region string that contains genomic coordinates | |
| region_split = region_pattern_match.groups() | |
| contig = region_split[0] | |
| start = int(region_split[1].replace(",", "")) | |
| end = int(region_split[2].replace(",", "")) | |
| if contig not in self.contigs: | |
| raise ValueError(f"Contig {contig} does not exist in the dataset.") | |
| elif ( | |
| start < 0 | |
| or end <= start | |
| or end > self.genome_sequence(region=contig).shape[0] | |
| ): | |
| raise ValueError("Provided genomic coordinates are not valid.") | |
| return Ag3.Region(contig, start, end) | |
| # check if region is a gene annotation feature ID | |
| gene_annotation = self.geneset(["ID"]).query(f"ID == '{region}'") | |
| if not gene_annotation.empty: | |
| # region is a feature ID | |
| gene_annotation = gene_annotation.squeeze() | |
| return Ag3.Region(gene_annotation.contig, gene_annotation.start, gene_annotation.end) | |
| raise ValueError(f"Region {region!r} is not a valid contig, region string or feature ID.") |
This suggestion reorders the logic to deal with the simplest and cheapest cases first. Searching the geneset for an ID is not particularly expensive, but seems like it's worth avoiding if in many cases the region parameter will be a contig or region string and those cases can be dealt with cheaply first.
Also the query on the geneset here is relaxed a little to allow matching any feature ID.
There was a problem hiding this comment.
Great, thanks. Will commit the changes.
|
Hi @alimanfoo, thanks for the first review :) I added your corrections and support for regions in the other functions. I saw that there were TODO tags for multiple contigs/regions support in A few combinations of tests for CNV methods with some regions are failing with Have a nice weekend, |
alimanfoo
left a comment
There was a problem hiding this comment.
Couple of small suggestions.
malariagen_data/ag3.py
Outdated
| # find genes in the region | ||
| if region.start and region.end: | ||
| df_genes = self.geneset().query( | ||
| f"type == 'gene' and contig == '{region.contig}' and start >= {region.start} and end <= {region.end}" |
There was a problem hiding this comment.
| f"type == 'gene' and contig == '{region.contig}' and start >= {region.start} and end <= {region.end}" | |
| f"type == 'gene' and contig == '{region.contig}' and start <= {region.end} and end >= {region.start}" |
Possible tweak to retrieve all genes overlapping the requested region.
tests/test_ag3.py
Outdated
| region = ag3._resolve_region(region) | ||
| if region.start and region.end: | ||
| df_genes = df_geneset.query( | ||
| f"type == 'gene' and contig == '{region.contig}' and start >= {region.start} and end <= {region.end}" |
There was a problem hiding this comment.
Will need to edit this to match above if changing to overlap query.
Yes that would be my guess, especially the discordant read calls, there are only a couple of regions with data there. Could just adapt the regions tested so they return some data and avoid KeyError, e.g., for the discordant read calls could use genes from the cyp6p/aa cluster on 2R? |
|
Hi @nkran, on second thoughts, it might be better to back out the changes to the CNV methods and revisit those in a later PR. The reason is that CNVs span intervals, and so we'd need to do an overlap query for regions. I.e., the current approach of using the |
|
Hi @alimanfoo! Ah yes, that makes sense! I'll omit those two commits and update the branch. |
1723bfa to
88e7c0a
Compare
|
Hi @nkran, thanks so much, this all looks good to me now, except for one .pyc file has crept in ( |
|
Hi @alimanfoo, just to double check - my commit (a2362bd) removes that .pyc file that is currently in master. Do you want me to revert the commit that deletes it and keep the .pyc file or to remove the .pyc file? Thanks for your patience :) Just want to get it right. |
My bad, I wasn't looking carefully enough to realise the file was already in master and you were cleaning up :) OK to merge? |
|
OK! :) 🚀 |
|
Done, thanks again 😄 🙏 |
Wanted to give #14 a go as it would be a neat feature to have in every day data wrangling.
Despite WIP I wanted to open a PR to check in with you regarding implementation and suggestions.
Still to-do:
region_slice()