Skip to content

Refactor CNV data functions for Anopheles (inc. Af1)#411

Merged
leehart merged 51 commits intomalariagen:masterfrom
leehart:GH404_Af1_CNV_data
Aug 3, 2023
Merged

Refactor CNV data functions for Anopheles (inc. Af1)#411
leehart merged 51 commits intomalariagen:masterfrom
leehart:GH404_Af1_CNV_data

Conversation

@leehart
Copy link
Copy Markdown
Collaborator

@leehart leehart commented Jun 8, 2023

Re: #404

  • Part 1: setup
  • Part 2: code refactoring
  • Part 3: type hints and documentation
  • Part 4: test refactoring

@codecov
Copy link
Copy Markdown

codecov bot commented Jun 8, 2023

Codecov Report

Attention: Patch coverage is 97.78870% with 9 lines in your changes missing coverage. Please review.

Project coverage is 97.20%. Comparing base (ebb4384) to head (440a5ef).
Report is 559 commits behind head on master.

Files with missing lines Patch % Lines
malariagen_data/anoph/cnv_data.py 97.63% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #411      +/-   ##
==========================================
+ Coverage   97.07%   97.20%   +0.12%     
==========================================
  Files          24       26       +2     
  Lines        1673     2072     +399     
==========================================
+ Hits         1624     2014     +390     
- Misses         49       58       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@leehart
Copy link
Copy Markdown
Collaborator Author

leehart commented Jun 8, 2023

Need to resolve _lookup_sample() in AnophelesCnvData.
I wonder if the plan is to move it from anopheles to anoph.base, to work in a similar fashion to lookup_release(). 🤔

@leehart
Copy link
Copy Markdown
Collaborator Author

leehart commented Jun 8, 2023

Perhaps lookup_sample() belongs in AnophelesSampleMetadata instead.

@leehart
Copy link
Copy Markdown
Collaborator Author

leehart commented Jun 8, 2023

Need to resolve the contig param of cnv_discordant_read_calls() in AnophelesCnvData.
Can't use base_params.contig because it needs to support a list as well as a str.
There is also a note that says:

# N.B., we cannot support region instead of contig here, because some
# CNV alleles have unknown start or end coordinates.

🤔

@leehart
Copy link
Copy Markdown
Collaborator Author

leehart commented Jun 8, 2023

In order to cater for cnv_discordant_read_calls(), I wonder if we need to switch to using something similar to the region_param_type for contigs, e.g.

single_contig_param_type: TypeAlias = str

contig_param_type: TypeAlias = Union[
    single_contig_param_type,
    List[single_contig_param_type],
    Tuple[single_contig_param_type, ...],
]

@leehart
Copy link
Copy Markdown
Collaborator Author

leehart commented Jun 9, 2023

Another approach or convention, which would be a breaking change, could be to use plural param names when either a string or a list or a tuple are accepted, e.g. contigs, and use singular when only a string, representing the singular, are accepted, e.g. contig.

@alimanfoo
Copy link
Copy Markdown
Member

Another approach or convention, which would be a breaking change, could be to use plural param names when either a string or a list or a tuple are accepted, e.g. contigs, and use singular when only a string, representing the singular, are accepted, e.g. contig.

This is tempting, but I think I'd prefer to avoid API change unless really necessary.

@alimanfoo
Copy link
Copy Markdown
Member

In order to cater for cnv_discordant_read_calls(), I wonder if we need to switch to using something similar to the region_param_type for contigs, e.g.

single_contig_param_type: TypeAlias = str

contig_param_type: TypeAlias = Union[
    single_contig_param_type,
    List[single_contig_param_type],
    Tuple[single_contig_param_type, ...],
]

This seems reasonable to me.

@leehart
Copy link
Copy Markdown
Collaborator Author

leehart commented Jun 9, 2023

Need to resolve plot_cnv_hmm_coverage_track() use of the string "auto" for param y_max, which is usually a float.

@alimanfoo
Copy link
Copy Markdown
Member

Perhaps lookup_sample() belongs in AnophelesSampleMetadata instead.

SGTM.

@leehart
Copy link
Copy Markdown
Collaborator Author

leehart commented Jun 9, 2023

It was easier to create a new base_param named contigs than it was to change all cases that use contig to use single_contig (there is only one case that supports a list of contigs), but this leads to a naming inconsistency with region. I am tempted to rename and refactor all cases of region to use regions instead. This would make the difference between singular and plural types more obvious, and consistent.

contig: TypeAlias = Annotated[
    single_contig_param_type,
    """
    Reference genome contig name. See the `contigs` property for valid contig
    names.
    """,
]

contigs: TypeAlias = Annotated[
    contig_param_type,
    """
    Reference genome contig name. See the `contigs` property for valid contig
    names. Can also be a sequence (e.g., list) of contigs.
    """,
]

single_region: TypeAlias = Annotated[
    single_region_param_type,
    """
    Region of the reference genome. Can be a contig name, region string
    (formatted like "{contig}:{start}-{end}"), or identifier of a genome
    feature such as a gene or transcript.
    """,
]

region: TypeAlias = Annotated[
    region_param_type,
    """
    Region of the reference genome. Can be a contig name, region string
    (formatted like "{contig}:{start}-{end}"), or identifier of a genome
    feature such as a gene or transcript. Can also be a sequence (e.g., list)
    of regions.
    """,
]

Copy link
Copy Markdown
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great so far :)

A few minor comments...

@alimanfoo
Copy link
Copy Markdown
Member

I am tempted to rename and refactor all cases of region to use regions instead. This would make the difference between singular and plural types more obvious, and consistent.

No objection. Would also then have consistency in type name with the sample_set and sample_sets types too.

@leehart
Copy link
Copy Markdown
Collaborator Author

leehart commented Jun 9, 2023

I'm now wondering if it makes sense for cnv_data to be using het_params, or whether it would be better to promote these to base_params (or perhaps gplt_params), like single_sample was.

    def plot_cnv_hmm_coverage_track(
        self,
        sample: het_params.sample,
        region: base_params.region,
        sample_set: Optional[base_params.sample_set] = None,
        y_max: cnv_params.y_max = cnv_params.y_max_default,
        sizing_mode: gplt_params.sizing_mode = gplt_params.sizing_mode_default,
        width: gplt_params.width = gplt_params.width_default,
        height: gplt_params.height = 200,
        circle_kwargs: Optional[het_params.circle_kwargs] = None,
        line_kwargs: Optional[het_params.line_kwargs] = None,
        show: gplt_params.show = True,
        x_range: Optional[gplt_params.x_range] = None,
        output_backend: gplt_params.output_backend = gplt_params.output_backend_default,
    ) -> gplt_params.figure:

I am also slightly concerned about cnv_params now having a mixed-type y_max and a different y_max_default to het_params, although this avoids changing the API.

@leehart
Copy link
Copy Markdown
Collaborator Author

leehart commented Jun 9, 2023

As suspected, having a mixed-type for y_max causes type-safety / inconstencey flags, e.g.

fig.yaxis.ticker = list(range(y_max + 1))

which would error if a string other than "auto" was used as param

        if y_max == "auto":
            y_max = data["call_CN"].max() + 2

🤔

It also looks like there is a problem re-using the het_params circle_kwargs for cnv_data, e.g.

circle_kwargs.setdefault("size", 3)

Raising no attribute "setdefault".

🤔

There is also an "incompatible types" issue in cnv_data's treatment of base_param sample_sets, i.e.

        if sample_sets is not None:
            if isinstance(sample_sets, (list, tuple)):
                sample_sets_text = ", ".join(sample_sets)
            else:
                sample_sets_text = sample_sets
            title += f" - {sample_sets_text}"

Somehow clashing with Union[Sequence[str], str],. I gather Sequence isn't actually an instance of a list or tuple, so gets treated as if it were a string in this logic.

@leehart
Copy link
Copy Markdown
Collaborator Author

leehart commented Jul 4, 2023

I gather I need to improve test coverage on cnv_data.py from 65% to 97% to satisfy these CI checks.

@leehart
Copy link
Copy Markdown
Collaborator Author

leehart commented Jul 6, 2023

Nearly there! 96.32% of diff hit (target 96.95%)

@leehart leehart marked this pull request as ready for review July 6, 2023 16:08
Copy link
Copy Markdown
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review, one small thing I noticed...

Copy link
Copy Markdown
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a sheep ton of work in here, thanks so much @leehart 🙏🏻

It looks great, I just picked up one thing regarding how the CNV stard and end coordinates are simulated. There should be no dependency between the CNV data and the SNP data, so it would be good to simulate the CNV start and end coordinates independently, according to the different types of CNV calling. Happy to talk through if would help.

@leehart
Copy link
Copy Markdown
Collaborator Author

leehart commented Jul 7, 2023

Ah, I suspected some of this and should have asked! Many thanks @alimanfoo . I'll see what I can do and might need to clarify some things next week.

Copy link
Copy Markdown
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, feel free to merge after making the suggested changes to simulation of CNV coordinates.

@leehart leehart merged commit 43d3357 into malariagen:master Aug 3, 2023
@leehart leehart deleted the GH404_Af1_CNV_data branch August 3, 2023 21:27
@alimanfoo alimanfoo added the BMGF-001927 Work supported by BMGF grant INV-001927 (MalariaGEN 2019-2024). label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BMGF-001927 Work supported by BMGF grant INV-001927 (MalariaGEN 2019-2024).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants