-
Notifications
You must be signed in to change notification settings - Fork 178
Modify naming of species metadata columns in Ag3 #93
Description
This issue proposes some modifications to how the species calls are handled and presented through sample metadata.
The main problem coming up is that there are three possible sources of species/taxon assignment, available through different means, and this can be confusing:
- AIM species calls, included by default when calling
Ag3.sample_metadata(), providing thespeciescolumn (and several other columns) - PCA species calls, which can be included when calling
Ag3.sample_metadata()if thespecies_callsparameter is given as("20200422", "pca"), providing thespeciescolumn (and several other columns). - Cohorts metadata, accessible via the
Ag3.sample_cohorts()method, which provides thetaxoncolumn.
This is all potentially confusing for the user. Below are some proposed changes to improve this.
AIM species columns
Propose that we change the naming of the AIM species columns, from:
aim_cols = (
"aim_fraction_colu",
"aim_fraction_arab",
"species_gambcolu_arabiensis",
"species_gambiae_coluzzii",
"species",
)
...to:
aim_cols = (
"aim_fraction_colu",
"aim_fraction_arab",
"aim_species_gambcolu_arabiensis",
"aim_species_gambiae_coluzzii",
"aim_species",
)
This would make it easier to always talk about the "AIM species" assignment, and have that always visible in dataframes and pivot tables. I.e., it's always clear where the species assignment as come from.
PCA species columns
Propose that we change the naming of the AIM species columns, from:
pca_cols = (
"PC1",
"PC2",
"species_gambcolu_arabiensis",
"species_gambiae_coluzzii",
"species",
)
...to:
pca_cols = (
"pca_species_pc1",
"pca_species_pc2",
"pca_species_gambcolu_arabiensis",
"pca_species_gambiae_coluzzii",
"pca_species",
)
In general we don't recommend use of the PCA species assignment, but in case anyone ever does use them, these column names make it clear what is being used.
Notes
Still open for discussion is whether the cohort metadata should get included as well when calling Ag3.sample_metadata(), which would then provide the taxon column, which is the best column to use as it provides the most refined view of taxa within the dataset. However, I'll raise a separate issue to discuss that.
After this change is implemented, there will be some downstream consequences, as the vector data user guide will likely need to be updated (at least rerun), and possibly also partner user guides.