Skip to content

Fix TypeError for single-exon transcripts in veff.Annotator.get_children#873

Merged
jonbrenas merged 8 commits intomalariagen:masterfrom
Vedag812:fix/issue-840-single-exon-typeerror
Mar 3, 2026
Merged

Fix TypeError for single-exon transcripts in veff.Annotator.get_children#873
jonbrenas merged 8 commits intomalariagen:masterfrom
Vedag812:fix/issue-840-single-exon-typeerror

Conversation

@Vedag812
Copy link
Copy Markdown
Contributor

Fixes #840

What's the issue?

When snp_allele_frequencies (or snp_effects) is called on a transcript that has only a single child — for instance a single-exon non-coding transcript — Annotator.get_children() in veff.py returns a pandas Series rather than a DataFrame. The downstream .sort_values(start) call then blows up with a TypeError, because Series doesn't have the same sort_values semantics when columns are referenced by name in that context.

As @jonbrenas noted in the issue, the deeper question is what to do with transcripts that lack coding regions entirely. The effects we compute (synonymous, start/stop lost, etc.) don't make sense for non-coding transcripts, so a clear error is more helpful than a cryptic crash.

What I changed

malariagen_data/veff.py

  • get_children() — when DataFrame.loc[] matches a single row, pandas returns a Series. I added a check: if the result is a Series, convert it back to a single-row DataFrame with .to_frame().T and preserve the index name. This way the rest of the pipeline (sorting, column filtering, .itertuples()) works as expected regardless of how many children the transcript has.

  • get_effects() — after extracting CDS, UTR, and exon children, I added a guard that checks whether the transcript has any coding features (CDS or UTR). If not, it raises a ValueError with a message explaining that variant effect annotation only applies to protein-coding transcripts and that the annotation may be incomplete. This replaces the previous behavior of silently producing incorrect results or crashing further downstream.

How I tested

Wrote a standalone test script that exercises three scenarios:

  1. Single-child transcript — confirms get_children() returns a DataFrame and sort_values(start) works
  2. Multi-child transcript — confirms the fix doesn't break the normal case
  3. Non-coding transcript — confirms the new ValueError is raised with an informative message

All three pass.

Vedag812 and others added 5 commits February 20, 2026 16:02
…alariagen#840)

When a transcript has only one child (e.g. a single-exon non-coding
transcript), pandas DataFrame.loc returns a Series instead of a
DataFrame. This caused a TypeError when downstream code called
.sort_values('start') on the result.

Changes:
- Annotator.get_children() now always returns a DataFrame by converting
  a Series result to a single-row DataFrame via .to_frame().T
- Annotator.get_effects() now raises an informative ValueError when a
  transcript has no CDS or UTR children (non-coding), instead of
  failing with a confusing error downstream

Closes malariagen#840
Copy link
Copy Markdown
Collaborator

@jonbrenas jonbrenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Vedag812
Copy link
Copy Markdown
Contributor Author

@jonbrenas Any updates sir?

Copilot AI review requested due to automatic review settings March 3, 2026 08:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

@jonbrenas jonbrenas merged commit 5999f49 into malariagen:master Mar 3, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TypeError when running ag3.snp_allele_frequencies for some transcripts

3 participants