Skip to content

Modify naming of species metadata columns in Ag3 #93

@alimanfoo

Description

@alimanfoo

This issue proposes some modifications to how the species calls are handled and presented through sample metadata.

The main problem coming up is that there are three possible sources of species/taxon assignment, available through different means, and this can be confusing:

  • AIM species calls, included by default when calling Ag3.sample_metadata(), providing the species column (and several other columns)
  • PCA species calls, which can be included when calling Ag3.sample_metadata() if the species_calls parameter is given as ("20200422", "pca"), providing the species column (and several other columns).
  • Cohorts metadata, accessible via the Ag3.sample_cohorts() method, which provides the taxon column.

This is all potentially confusing for the user. Below are some proposed changes to improve this.

AIM species columns

Propose that we change the naming of the AIM species columns, from:

    aim_cols = (
        "aim_fraction_colu",
        "aim_fraction_arab",
        "species_gambcolu_arabiensis",
        "species_gambiae_coluzzii",
        "species",
    )

...to:

    aim_cols = (
        "aim_fraction_colu",
        "aim_fraction_arab",
        "aim_species_gambcolu_arabiensis",
        "aim_species_gambiae_coluzzii",
        "aim_species",
    )

This would make it easier to always talk about the "AIM species" assignment, and have that always visible in dataframes and pivot tables. I.e., it's always clear where the species assignment as come from.

PCA species columns

Propose that we change the naming of the AIM species columns, from:

    pca_cols = (
        "PC1",
        "PC2",
        "species_gambcolu_arabiensis",
        "species_gambiae_coluzzii",
        "species",
    )

...to:

    pca_cols = (
        "pca_species_pc1",
        "pca_species_pc2",
        "pca_species_gambcolu_arabiensis",
        "pca_species_gambiae_coluzzii",
        "pca_species",
    )

In general we don't recommend use of the PCA species assignment, but in case anyone ever does use them, these column names make it clear what is being used.

Notes

Still open for discussion is whether the cohort metadata should get included as well when calling Ag3.sample_metadata(), which would then provide the taxon column, which is the best column to use as it provides the most refined view of taxa within the dataset. However, I'll raise a separate issue to discuss that.

After this change is implemented, there will be some downstream consequences, as the vector data user guide will likely need to be updated (at least rerun), and possibly also partner user guides.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions