Skip to content

Add "taxon_by" param to *_frequencies_advanced() functions and allow the "period_by" param to specify a column name#694

Merged
leehart merged 20 commits intomasterfrom
GH391_add_params_to_allele_frequencies_advanced
Jul 4, 2025
Merged

Add "taxon_by" param to *_frequencies_advanced() functions and allow the "period_by" param to specify a column name#694
leehart merged 20 commits intomasterfrom
GH391_add_params_to_allele_frequencies_advanced

Conversation

@leehart
Copy link
Copy Markdown
Collaborator

@leehart leehart commented Dec 10, 2024

Re: parent issue #391

Re: issues #722 and #723

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@leehart leehart changed the title Add taxon_by param to snp_allele_frequencies_advanced(). Add example … Add params to snp_allele_frequencies_advanced Dec 10, 2024
@leehart
Copy link
Copy Markdown
Collaborator Author

leehart commented Dec 13, 2024

Hi @alimanfoo . Do we know how we want to allow any value in the period_by column? Looking at the current code, it's not clear to me what we want to do here.

util.py currently has:

def prep_samples_for_cohort_grouping(
    *, df_samples, area_by, period_by, taxon_by="taxon"
):
    # Take a copy, as we will modify the dataframe.
    df_samples = df_samples.copy()

    # Fix "intermediate" or "unassigned" taxon values - we only want to build
    # cohorts with clean taxon calls, so we set other values to None.
    loc_intermediate_taxon = (
        df_samples[taxon_by].str.startswith("intermediate").fillna(False)
    )
    df_samples.loc[loc_intermediate_taxon, taxon_by] = None
    loc_unassigned_taxon = (
        df_samples[taxon_by].str.startswith("unassigned").fillna(False)
    )
    df_samples.loc[loc_unassigned_taxon, taxon_by] = None

    # Add period column.
    if period_by == "year":
        make_period = _make_sample_period_year
    elif period_by == "quarter":
        make_period = _make_sample_period_quarter
    elif period_by == "month":
        make_period = _make_sample_period_month
    else:  # pragma: no cover
        raise ValueError(
            f"Value for period_by parameter must be one of 'year', 'quarter', 'month'; found {period_by!r}."
        )
    sample_period = df_samples.apply(make_period, axis="columns")
    df_samples["period"] = sample_period

    # Add area column for consistent output.
    df_samples["area"] = df_samples[area_by]

    return df_samples

with

def _make_sample_period_month(row):
    year = row.year
    month = row.month
    if year > 0 and month > 0:
        return pd.Period(freq="M", year=year, month=month)
    else:
        return pd.NaT


def _make_sample_period_quarter(row):
    year = row.year
    month = row.month
    if year > 0 and month > 0:
        return pd.Period(freq="Q", year=year, month=month)
    else:
        return pd.NaT


def _make_sample_period_year(row):
    year = row.year
    if year > 0:
        return pd.Period(freq="Y", year=year)
    else:
        return pd.NaT

@leehart
Copy link
Copy Markdown
Collaborator Author

leehart commented Jan 17, 2025

@cclarkson @ahernank @jonbrenas Any thoughts or knowledge regarding the above question for this issue?

@jonbrenas
Copy link
Copy Markdown
Collaborator

Your guess is probably as good as mine. Maybe allow to use a column that already exists as "period" or a dictionary of queries? It might be worth asking for details at the next tools meeting.

@leehart leehart changed the title Add params to snp_allele_frequencies_advanced Add "taxon_by" param to snp_allele_frequencies_advanced() Feb 4, 2025
@leehart
Copy link
Copy Markdown
Collaborator Author

leehart commented Feb 4, 2025

Restricting this PR to just adding the "taxon_by" param.

I've split the original issue into two sub-issues, in order to try to address the period_by feature request separately later, when we have a clearer idea of a path forward for that.

@leehart
Copy link
Copy Markdown
Collaborator Author

leehart commented Feb 4, 2025

To do: add to other _frequencies_advanced functions.

@leehart leehart marked this pull request as draft February 4, 2025 14:26
Copy link
Copy Markdown
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @leehart. Would be good to add this parameter to the other ..._frequencies_advanced() functions.

@leehart leehart changed the title Add "taxon_by" param to snp_allele_frequencies_advanced() Add "taxon_by" param to *_allele_frequencies_advanced() Feb 4, 2025
@leehart
Copy link
Copy Markdown
Collaborator Author

leehart commented Feb 4, 2025

To do: investigate and resolve potential merge conflict issue around _karyotype_tags_n_alt().

@leehart leehart changed the title Add "taxon_by" param to *_allele_frequencies_advanced() Add "taxon_by" param to *_frequencies_advanced() functions Feb 20, 2025
@leehart
Copy link
Copy Markdown
Collaborator Author

leehart commented Feb 21, 2025

Now that I have some clarity on the sibling issue #723 I'm tempted to include it under this PR, and then update the PR title accordingly.

…uping(). Add nb test. Remove old _make_sample_period_...() funcs from util.py.
@leehart leehart changed the title Add "taxon_by" param to *_frequencies_advanced() functions Add "taxon_by" param to *_frequencies_advanced() functions and allow the "period_by" param to specify a column name Feb 21, 2025
@codecov
Copy link
Copy Markdown

codecov bot commented Feb 21, 2025

Codecov Report

Attention: Patch coverage is 75.00000% with 9 lines in your changes missing coverage. Please review.

Project coverage is 95.97%. Comparing base (0197957) to head (3c2ce1d).

Files with missing lines Patch % Lines
malariagen_data/anoph/frq_base.py 61.53% 5 Missing ⚠️
malariagen_data/anoph/snp_frq.py 80.00% 2 Missing ⚠️
malariagen_data/anoph/cnv_frq.py 80.00% 1 Missing ⚠️
malariagen_data/anoph/hap_frq.py 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #694      +/-   ##
==========================================
- Coverage   96.13%   95.97%   -0.16%     
==========================================
  Files          47       47              
  Lines        4683     4699      +16     
==========================================
+ Hits         4502     4510       +8     
- Misses        181      189       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@leehart
Copy link
Copy Markdown
Collaborator Author

leehart commented Feb 21, 2025

To do: increase test coverage, in light of codecov test failures.

@leehart leehart marked this pull request as ready for review March 6, 2025 16:40
@leehart leehart requested a review from alimanfoo March 6, 2025 16:41
Copy link
Copy Markdown
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just a couple of minor suggestions re parameter checking.

@leehart leehart requested a review from alimanfoo March 10, 2025 16:01
Copy link
Copy Markdown
Collaborator

@jonbrenas jonbrenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of my comments are required changes and I think @leehart addressed all the concerns expressed by @alimanfoo (but I will let @alimanfoo confirm that it is so).

…with underscore when non-default taxon_by col
Copy link
Copy Markdown
Collaborator

@jonbrenas jonbrenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@leehart leehart merged commit d53e326 into master Jul 4, 2025
10 checks passed
@leehart leehart deleted the GH391_add_params_to_allele_frequencies_advanced branch July 4, 2025 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants