Add `unrestricted_use_only` and `surveillance_use_only` constructor params by leehart · Pull Request #724 · malariagen/malariagen-data-python

leehart · 2025-02-07T10:51:19Z

Re: issue #716

…_releases property.

leehart · 2025-02-20T15:43:40Z

Opening this for a WIP review, to potentially avoid going too far down the wrong path.

@alimanfoo @cclarkson @ahernank @jonbrenas I'm trying to identify other functions that we need to filter results for, according to:

unrestricted_use_only, which I've so far applied to sample_sets(), and
surveillance_use_only. which I've so far applied to sample_metadata().

Those seem like the main ones, which some other functions use, but I want to make sure we plug all potential leaks.

leehart · 2025-02-25T14:33:13Z

To do: add _prep_sample_query_param()

…use_only

leehart · 2025-03-21T10:43:33Z

To do: check that all public functions honour the unrestricted_use_only and surveillance_use_only constructor params.

leehart · 2025-04-08T11:52:40Z

It looks like Ag3 currently has about 137 public methods... 🤔

leehart · 2025-04-08T12:11:43Z

It also looks like about 119 of Ag3's public methods cannot be called without specifying params (cannot rely on defaults), which makes the testing of these constructor params somewhat difficult to automate. This also smells a lot like a "god object" anti-pattern https://en.wikipedia.org/wiki/God_object

We should probably consider re-organising all of those methods, despite the inconvenience, but that will probably have to wait. In the meantime, I hope to be able to figure out which functions are vulnerable to leaking unfiltered data, relating to either the surveillance-only or unrestricted-use-only flags.

leehart · 2025-04-08T14:01:47Z

@ahernank @jonbrenas Checkmarks indicate whether the function gets its data from an upstream public function, or file, or param, or otherwise looks covered. Unchecked functions indicate some doubt and require further investigation, discussion or coding.

aa_allele_frequencies - gets its data from snp_allele_frequencies
aa_allele_frequencies_advanced - gets its data from snp_allele_frequencies_advanced
add_extra_metadata - checks data from sample_metadata
aim_calls - uses _prep_sample_query_param and sample_metadata
aim_metadata - gets its data from samples.species_aim.csv
aim_variants - gets its data from aim_defs_{analysis}/{aims}.zarr
average_fst - uses snp_allele_counts
biallelic_diplotype_pairwise_distances - uses _prep_sample_selection_cache_params which uses _prep_sample_query_param
biallelic_diplotypes - uses _prep_sample_selection_cache_params which uses _prep_sample_query_param
biallelic_snp_calls - uses snp_allele_counts and snp_calls
biallelic_snps_to_plink - uses biallelic_snp_calls
clear_extra_metadata - is simply self._extra_metadata = []
cnv_coverage_calls - TO INVESTIGATE: uses _cnv_coverage_calls_dataset
cnv_discordant_read_calls - TO INVESTIGATE: uses _prep_sample_query_param and sample_metadata but also _cnv_discordant_read_calls_dataset
cnv_hmm - TO INVESTIGATE: uses _prep_sample_query_param and sample_metadata but also _cnv_hmm_dataset
cohort_diversity_stats - uses sample_metadata and snp_allele_counts
cohorts - gets its data from cohorts_{cohort_set}.csv
cohorts_metadata - gets its data from samples.cohorts.csv
count_samples - uses sample_metadata
cross_metadata - gets its data from crosses.fam
diplotype_pairwise_distances - uses _prep_sample_query_param
diversity_stats - TO INVESTIGATE: uses _setup_cohort_queries and cohort_diversity_stats not _prep_sample_query_param
fst_gwss - uses _prep_sample_query_param
g123_calibration - uses _prep_sample_query_param
g123_gwss - uses _prep_sample_query_param
gene_cnv - TO INVESTIGATE: uses _gene_cnv not _prep_sample_query_param
gene_cnv_frequencies - TO INVESTIGATE: uses _gene_cnv_frequencies not _prep_sample_query_param
gene_cnv_frequencies_advanced - TO INVESTIGATE: uses _gene_cnv_frequencies_advanced not _prep_sample_query_param
general_metadata - TO CONSIDER: gets its data from samples.meta.csv (NOT FILTERED)
geneset - uses genome_features
genome_feature_children - TO INVESTIGATE: uses _genome_features
genome_features - TO INVESTIGATE: uses _genome_features_for_contig and _genome_features
genome_sequence - TO INVESTIGATE: uses _genome_sequence_for_contig
h12_calibration - uses _prep_sample_query_param
h12_gwss - uses _prep_sample_query_param
h1x_gwss - uses _prep_sample_query_param
haplotype_pairwise_distances - uses _prep_sample_query_param
haplotype_sites - uses _haplotype_sites_for_region
haplotypes - TO INVESTIGATE: uses _prep_sample_query_param and sample_metadata but also _haplotypes_for_contig
haplotypes_frequencies - uses sample_metadata and locate_cohorts and haplotypes and haplotype_frequencies
haplotypes_frequencies_advanced - uses sample_metadata and haplotypes and haplotype_frequencies
igv - uses _igv_config and igv_notebook (LOOKS COVERED)
ihs_gwss - TO INVESTIGATE: uses _prep_sample_query_param but also _ihs_gwss
is_accessible - uses genome_sequence and snp_sites and _site_filters_for_region
karyotype - TO INVESTIGATE: uses load_inversion_tags and snp_calls but not _prep_sample_query_param
load_inversion_tags - gets its data from self._inversion_tag_path
lookup_release - uses sample_sets
lookup_sample - uses sample_metadata
lookup_study - uses sample_sets
lookup_study_info - uses sample_sets
lookup_terms_of_use_info - uses sample_sets
njt - uses _prep_sample_selection_cache_params which uses _prep_sample_query_param
open_cnv_coverage_calls - gets its data from coverage_calls/{analysis}/zarr
open_cnv_discordant_read_calls - gets its data from cnv/{sample_set}/{calls_version}/zarr
open_cnv_hmm - gets its data from cnv/{sample_set}/hmm/zarr
open_file - simply self._fs.open(f"{self._base_path}/{path}")
open_genome - gets its data from {self._base_path}/{self._genome_zarr_path}
open_haplotype_sites - gets its data from snp_haplotypes/sites/{analysis}/zarr
open_haplotypes - gets its data from snp_haplotypes/{sample_set}/{analysis}/zarr
open_site_annotations - gets its data from {self._base_path}/{self._site_annotations_zarr_path}
open_site_filters - gets its data from site_filters/{self._site_filters_analysis}/{mask_prepped}/
open_snp_genotypes - gets its data from snp_genotypes/all/{sample_set}/
open_snp_sites - gets its data from snp_genotypes/all/sites/
pairwise_average_fst - TO INVESTIGATE: uses _setup_cohort_queries not _prep_sample_query_param
pca - uses _prep_sample_selection_cache_params which uses _prep_sample_query_param
plot_aim_heatmap - uses aim_calls and go_make_subplots
plot_cnv_hmm_coverage - uses plot_cnv_hmm_coverage_track and plot_genes
plot_cnv_hmm_coverage_track - uses cnv_hmm
plot_cnv_hmm_heatmap - uses plot_cnv_hmm_heatmap_track and plot_genes
plot_cnv_hmm_heatmap_track - uses cnv_hmm
plot_diplotype_clustering - uses sample_metadata and diplotype_pairwise_distances
plot_diplotype_clustering_advanced - TO INVESTIGATE: uses plot_diplotype_clustering but also _dipclust_het_bar_trace and _dipclust_cnv_bar_trace and _insert_dipclust_snp_trace and _dipclust_concat_subplots
plot_diversity_stats - gets its data from df_stats param
plot_frequencies_heatmap - gets its data from df param
plot_frequencies_interactive_map - gets its data from ds param
plot_frequencies_map_markers - gets its data from ds param
plot_frequencies_time_series - gets its data from ds param
plot_fst_gwss - uses plot_fst_gwss_track and plot_genes
plot_fst_gwss_track - uses fst_gwss
plot_g123_calibration - uses g123_calibration
plot_g123_gwss - uses plot_g123_gwss_track and plot_genes
plot_g123_gwss_track - uses g123_gwss
plot_genes - TO INVESTIGATE: uses genome_sequence and _plot_genes_setup_data
plot_h12_calibration - uses h12_calibration
plot_h12_gwss - uses plot_h12_gwss_track and plot_genes
plot_h12_gwss_multi_overlay - uses plot_h12_gwss_multi_overlay_track and plot_genes
plot_h12_gwss_multi_overlay_track - TO INVESTIGATE: uses h12_gwss and _setup_cohort_queries not _prep_sample_query_param
plot_h12_gwss_multi_panel - TO INVESTIGATE: uses plot_h12_gwss_track and plot_genes and _setup_cohort_queries not _prep_sample_query_param
plot_h12_gwss_track - uses h12_gwss
plot_h1x_gwss - uses plot_h1x_gwss_track and plot_genes
plot_h1x_gwss_track - uses h1x_gwss
plot_haplotype_clustering - TO INVESTIGATE: uses sample_metadata and haplotype_pairwise_distances but also _setup_sample_symbol and _setup_sample_colors_plotly and _setup_sample_hover_data_plotly and plot_dendrogram
plot_haplotype_network - uses haplotypes and sample_metadata and median_joining_network and mjn_graph and plotly_discrete_legend (I expect _setup_sample_colors_plotly is OK)
plot_heterozygosity - uses plot_heterozygosity_track and plot_genes
plot_heterozygosity_track - TO INVESTIGATE: uses _sample_count_het and _plot_heterozygosity_track
plot_ihs_gwss - uses plot_ihs_gwss_track and plot_genes
plot_ihs_gwss_track - uses ihs_gwss
plot_njt - uses njt and sample_metadata (I expect _setup_sample_symbol and _setup_sample_colors_plotly and _setup_sample_hover_data_plotly are OK)
plot_pairwise_average_fst - gets its data from fst_df param
plot_pca_coords - gets its data from data param (I expect _setup_sample_symbol and _setup_sample_colors_plotly and _setup_sample_hover_data_plotly are OK)
plot_pca_coords_3d - gets its data from data param (I expect _setup_sample_symbol and _setup_sample_colors_plotly and _setup_sample_hover_data_plotly are OK)
plot_pca_variance - gets its data from evr param (simple bar plot)
plot_roh - TO INVESTIGATE: uses _sample_count_het and _plot_heterozygosity_track and _roh_hmm_predict and plot_roh_track and plot_genes
plot_roh_track - gets its data from df_roh param
plot_sample_location_geo - uses sample_metadata (NOTE: could potentially use _prep_sample_query_param too)
plot_sample_location_mapbox - uses sample_metadata (NOTE: could potentially use _prep_sample_query_param too)
plot_samples_bar - uses sample_metadata (NOTE: could potentially use _prep_sample_query_param too)
plot_samples_interactive_map - uses sample_metadata (NOTE: could potentially use _prep_sample_query_param too)
plot_snps - uses plot_snps_track and plot_genes
plot_snps_track - uses snp_allele_counts and snp_variants and genome_sequence
plot_transcript - uses genome_features and genome_feature_children
plot_xpehh_gwss - uses plot_xpehh_gwss_track and plot_genes
plot_xpehh_gwss_track - uses xpehh_gwss
read_files - gets its data from its paths param
results_cache_get - gets its data from results.zarr.zip or results.npz
results_cache_set - gets its data from its params and results params and writes params.json and results.zarr.zip
roh_hmm - TO INVESTIGATE: uses _sample_count_het and _roh_hmm_predict
sample_metadata - uses _prep_sample_query_param and general_metadata and sequence_qc_metadata and surveillance_flags and aim_metadata and cohorts_metadata
sample_sets - uses _read_sample_sets and _unrestricted_use_only
sequence_qc_metadata - gets its data from sequence_qc_stats.csv
site_annotations - uses _site_annotations_raw and snp_sites
site_filters - uses _site_filters_for_region
snp_allele_counts - uses _prep_sample_selection_cache_params which uses _prep_sample_query_param
snp_allele_frequencies - uses sample_metadata and snp_calls
snp_allele_frequencies_advanced - uses sample_metadata and snp_calls
snp_calls - uses _prep_sample_query_param and _snp_calls
snp_dataset - uses snp_calls
snp_effects - uses snp_variants and _snp_df_melt and _snp_effect_annotator
snp_genotype_allele_counts - uses snp_calls
snp_genotypes - uses _prep_sample_query_param and _snp_genotypes_for_contig
snp_sites - uses _snp_sites_for_region and site_filters
snp_variants - uses _snp_variants_for_contig
surveillance_flags - gets its data from surveillance.flags.csv
view_alignments - uses _igv_view_alignments_tracks and igv
wgs_data_catalog - TO CONSIDER: gets its data from wgs_snp_data.csv (NOT FILTERED)
wgs_run_accessions - TO CONSIDER: gets its data from wgs_accession_data.csv (NOT FILTERED)
xpehh_gwss - uses _prep_sample_query_param and _xpehh_gwss

leehart · 2025-04-24T09:14:10Z

Now investigating unexpected test failures after merging with master branch.

Local pytest:

446 passed, 38 warnings, 224 errors

CI pytest:

474 passed, 36 warnings, 224 errors

jonbrenas · 2025-04-24T09:33:19Z

Thanks @leehart. For IGV to work, it needed to have access to a 'public' version of the reference genomes, e.g., at gs://vo_anoph_temp_us_central1/vo_afun_release/reference/genome/idAnoFuneDA-416_04. This, in turn, needed to be passed to the AnophelesBase constructor. The tests use simulated data so the link needs to be specified (but None should work).

leehart · 2025-06-05T08:11:55Z

Submitting this for review while we figure out sample_indices.

leehart · 2025-06-05T11:27:04Z

Testing this on Colab is throwing out a bunch of these ugly warnings, which would be good to resolve:

/usr/local/lib/python3.11/dist-packages/malariagen_data/anoph/sample_metadata.py:780: RuntimeWarning: Engine has switched to 'python' because numexpr does not support extension array dtypes. Please set your engine to python manually.
df_samples = df_samples.query(prepared_sample_query, **sample_query_options)

anoph/sample_metadata.py:780 currently (in this PR) has:

        # Apply the sample_query or sample_indices, if specified.
        if prepared_sample_query is not None:
            # Assume a pandas query string.
            sample_query_options = sample_query_options or {}
            df_samples = df_samples.query(prepared_sample_query, **sample_query_options)
            df_samples = df_samples.reset_index(drop=True)

anoph/base.py:951 currently (in this PR) has:

        # Determine which samples match the sample query.
        if sample_query != "":
            loc_samples = df_samples.eval(sample_query, **sample_query_options)

anoph/sample_metadata.py:1071 currently (in this PR) has:

        if prepared_sample_query is not None:
            # Resolve query to a list of integers for more cache hits - we
            # do this because there are different ways to write the same pandas
            # query, and so it's better to evaluate the query and use a list of
            # integer indices instead.
            df_samples = self.sample_metadata(sample_sets=prepared_sample_sets)
            sample_query_options = sample_query_options or {}
            loc_samples = df_samples.eval(
                prepared_sample_query, **sample_query_options
            ).values
            sample_indices = np.nonzero(loc_samples)[0].tolist()

Basically, it seems to be around using Pandas query or eval in a way that's incompatible with the more efficient numexpr engine. I gather it might be because the DataFrame in question is using exotic data types, for some reason, which would want to avoid if we want to use the numexpr engine, rather than specifying the python engine just to quash this warning. We'll need to track down wherever these "extension array dtypes" are being used and see how they might be avoided, for the sake of performance. If they're unavoidable, then I guess we might be forced to use the less efficient python engine.

leehart · 2025-06-05T15:50:00Z

Since the numexpr engine doesn't support dtypes like Int64, Float64 and even boolean, (judging from test_extension_array_eval in https://github.com/pandas-dev/pandas/blob/main/pandas/tests/frame/test_query_eval.py ) which we currently use in the sample_metadata DataFrame in order to support missing values, e.g.:

mean_cov                           Float64
median_cov                           Int64
modal_cov                            Int64
mean_cov_2L                        Float64
median_cov_2L                        Int64
mode_cov_2L                          Int64
mean_cov_2R                        Float64
median_cov_2R                        Int64
mode_cov_2R                          Int64
mean_cov_3L                        Float64
median_cov_3L                        Int64
mode_cov_3L                          Int64
mean_cov_3R                        Float64
median_cov_3R                        Int64
mode_cov_3R                          Int64
mean_cov_X                         Float64
median_cov_X                         Int64
mode_cov_X                           Int64
frac_gen_cov                       Float64
divergence                         Float64
contam_pct                         Float64
contam_LLR                         Float64

I'm tempted to force the use of the less efficient python engine, despite this not being recommended by the Pandas docs https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html

You can change the semantics of the expression by passing the keyword argument parser='python'. This enforces the same semantics as evaluation in Python space. Likewise, you can pass engine='python' to evaluate an expression using Python itself as a backend. This is not recommended as it is inefficient compared to using numexpr as the engine.

leehart · 2025-06-10T10:07:06Z

Hi @jonbrenas @ahernank - I've tried to elucidate the problem with sample_indices (for this PR) below. I'm looking into resolving it now, so I'll put this PR back into draft mode.

What is the `sample_indices` parameter used for?

The sample_indices parameter is available for some functions as a way to select the samples returned by the function by providing the corresponding indices of the required samples with reference to the sample_metadata function. It is intended as an alternative to the sample_query parameter, which selects samples based on Pandas query criteria. Some functions prevent both the sample_query and the sample_indices parameters being used at the same time. Some functions internally convert the results of the sample_query parameter into an equivalent sample_indices parameter.

Public functions that support the sample_indices param are:

biallelic_diplotype_pairwise_distances
njt
plot_njt
pca
sample_metadata
snp_genotypes
snp_calls
snp_allele_counts
biallelic_snp_calls
biallelic_diplotypes
biallelic_snps_to_plink

Functions that use _prep_sample_selection_cache_params to convert sample_query to sample_indices:

biallelic_diplotype_pairwise_distances
njt
pca
snp_allele_counts
biallelic_diplotypes

Functions that use base_params.validate_sample_selection_params to prevent both sample_query and sample_indices:

sample_metadata
snp_genotypes
snp_calls

What is the problem?

One of the mechanisms being used to honour the surveillance_use_only setting is to internally inject is_surveillance == True into the sample_query parameter via the _prep_sample_query_param function. This means that functions that were originally called with no sample_query parameter gain a sample_query parameter, regardless of whether they also had a sample_indices parameter passed to them. For the sample_metadata function, which uses validate_sample_selection_params to ensure both selection methods are not used, using sample_indices with a surveillance-only object will currently (in this PR) result in a silent failure, e.g.

sample_sets = "AG1000G-FR"
sample_indices = [0, 1, 2, 3]
default_sample_metadata_df = ag3_default.sample_metadata(sample_sets=sample_sets, sample_indices=sample_indices)
surveillance_sample_metadata_df = ag3_surveillance.sample_metadata(sample_sets=sample_sets, sample_indices=sample_indices)
print('Number of samples from default object:', len(default_sample_metadata_df))
print('Number of samples from surveillance object:', len(surveillance_sample_metadata_df))

Number of samples from default object: 4
Number of samples from surveillance object: 23

In this case, sample_indices are ignored due to logic in the sample_metadata function:

        if prepared_sample_query is not None:
		# ...
        elif sample_indices is not None:
            # ...

For the pca function, which doesn't use validate_sample_selection_params but uses _prep_sample_selection_cache_params, which internally converts any sample_query into sample_indices, using sample_indices with a surveillance-only object will currently (in this PR) also result in a silent failure, e.g.

sample_indices = list(range(0, 500))
n_snps = 10_000
pca_region = '3L:15,000,000-16,000,000'
ag3_pca_df, evr = ag3_default.pca(region=pca_region, sample_sets=mixed_sample_set, n_snps=n_snps, sample_indices=sample_indices)
ag3_surveillance_pca_df, evr = ag3_surveillance.pca(region=pca_region, sample_sets=mixed_sample_set, n_snps=n_snps, sample_indices=sample_indices)
print('Samples returned by the default object:', len(ag3_pca_df['sample_id'].unique()))
print('Samples returned by the surveillance object:', len(ag3_surveillance_pca_df['sample_id'].unique()))

Samples returned by the default object: 500
Samples returned by the surveillance object: 86

In this case, sample_indices are ignored due to logic in the _prep_sample_selection_cache_params function:

        if prepared_sample_query is not None:
            # ...
            sample_indices = np.nonzero(loc_samples)[0].tolist()

        return prepared_sample_sets, sample_indices

For the functions that don't use either validate_sample_selection_params nor _prep_sample_selection_cache_params:

plot_njt passes its sample_indices to both njt and sample_metadata, so is reliant on their behaviour.
biallelic_snp_calls uses snp_allele_counts and snp_calls, so is reliant on their behaviour.
biallelic_snps_to_plink uses biallelic_snp_calls, so is reliant on its behaviour.

What is the proposed solution in principle?

To restore proper support for the sample_indices param for affected functions in this PR, when surveillance_use_only is set:

Ensure that the validate_sample_selection_params check is used at the beginning of any function that accepts both sample_query and sample_indices param. (This doesn't directly solve the problem but tightens up the user-interface and protects the assumptions of downstream code.)
Ensure that wherever sample_query is not provided by the user (before being "prepared" and hijacked internally) and sample_indices is provided, that sample_indices are still applied appropriately. For instance, if a sample_query starts as None but becomes is_surveillance == True due to _prep_sample_query_param, then any sample_indices should correspond to the results of the is_surveillance == True sample query, rather than None. In other words, sample_indices should be relative to whatever samples the function returns (or uses, in the case of plots) when passed no sample_query nor sample_indices by the user.

…le_indices. Add sample_query_options to biallelic_snps_to_plink.

leehart · 2025-06-10T11:28:53Z

jonbrenas · 2025-06-10T11:31:58Z

Thanks @leehart. It is kind of wild (and also wrong) that we had three different potential behaviours for the interaction between sample_query and sample_indices. I agree that we need to remediate that but I am not sure I agree with the chosen solution.

As far as I am concerned, surveillance_use_only and unrestricted_use_only are queries (and, as you say, it is how they are handled internally) so IMHO going with the middle road isn't working. The easiest solution would be to simply ban the use of sample_indices if any of the "query" options is used but it might be too restrictive. A better solution would be to end up with a process like what is down for min_cohort_size where any ID explicitly listed in sample_indices that doesn't make the cut of the sample_query/surveillance_use_only/unrestricted_use_only is listed in a warning (if possible with an explanation of why it failed but that may be harder to produce).

leehart · 2025-06-10T13:52:06Z

Thanks @jonbrenas. I think I understand some of your concerns, but I reckon I still need some clarification.

I guess it might boil down to how the users are (or should be) determining the sample_indices to pass into these functions in the first place. One assumption I had was that the full list of sample indices would be aligned with the results of the sample_metadata without any kind of query. So for an object with no surveillance_use_only setting, the full list for a particular sample set would be given by something like:

samples_df = ag3_default.sample_metadata(sample_sets=sample_set)

Then it would seem natural to the user to expect the same behaviour for an object with surveillance_use_only set to True, e.g.

samples_df = ag3_surveillance.sample_metadata(sample_sets=sample_set)

In this context, the user is not explicitly passing any sample_query to the function, regardless of whether we inject is_surveillance == True into sample_query behind the scenes. In other words, as far as the user is concerned, there is no sample_query, so they have no obvious reason to expect that sample_indices cannot or should not be used. (I don't personally like the hijacking of sample_query for this purpose, but that's already water under the bridge.)

However, if the user is getting their sample_indices from some other source of truth, besides sample_metadata, then I can see how things can go wrong. For instance, sample_indices of [0, 1, 2, 3] in the example above is going to return different samples for ag3_default.sample_metadata(sample_sets=sample_set) compared to ag3_surveillance.sample_metadata(sample_sets=sample_set). However, I see this as similar to the misalignment that we might see from sample_indices for ag3_default.sample_metadata(sample_sets=sample_set) not aligning with the same samples from ag3_default.sample_metadata(), for example. That's an explainable mistake or misuse.

As far as I can tell, these sample_indices are meant to be used in a similar context to how df_samples = df_samples.iloc[sample_indices] is used within sample_metadata, and the documentation clearly says:

A list of indices of samples to select, corresponding to the order in which the samples are found within the sample metadata. Either provide this parameter or sample_query, not both.

Banning the use of sample_indices wherever we have surveillance_use_only switched on would mean that we'd effectively have different function signatures depending on different constructor params, which seems to confuse matters and deprives the user of previously available features. But it also seems unnecessary, since it looks like we can restore this functionality (which is only temporarily lost in this PR) with a few tweaks to the internal logic, as demonstrated in the updated sample_metadata function in this PR, which I believe can be also achieved for the other affected functions, as detailed above.

I'm not sure I understand the solution regarding min_cohort_size. Sample ids are not explicitly determined by sample_indices, which is part of the complication here. sample_indices only identify samples with reference to some kind of original full list of samples, represented by df_samples in the df_samples.iloc[sample_indices] example. As with the existing sample_metadata function and the documentation for the sample_indices param, it seems that the user-level expectation is that the full list of samples, upon which sample_indices must be based, ought to come from the "sample metadata", which would naturally seem to be functionally equivalent to (or semantically synonymous with) the sample_metadata function, depending on its parameters (such as sample_sets) or the higher-level constructor params (such as surveillance_use_only).

In any case, it would be good to have a clear understanding and position on this. Ideally, I think we want to preserve existing functionality and expectations, and minimise potential confusion and complexity, while keeping the user experience as unsurprising and intuitive as possible.

I feel like I'm missing something or overlooking something important here!

jonbrenas · 2025-06-10T15:42:48Z

Thanks @leehart.

Sorry, I had misinterpreted the way these parameters were going to be used and, as a result, my comment didn't make much sense. I thought that unrestricted_use_only and surveillance_use_only could be used when the AnophelesBase was defined (which is the case) and/or when a function is called (which is not the case). My objection was only applicable in the second case (e.g., I have run a PCA analysis with 100 samples chosen at random and I want to do it again with the same samples but only if they are surveillance in which case I would have asked for surveillance_use_only and sample_indices when calling pca). In that case, one would have to use a convoluted query with Ag3 defined using surveillance_use_only. My bad!

leehart · 2025-06-10T16:00:54Z

Ah, I see, thanks @jonbrenas . Yeah, these params are only going to be used (AFAIK) during object construction, e.g. ag3_unrestricted_surveillance = malariagen_data.Ag3(unrestricted_use_only=True, surveillance_use_only=True). I hadn't even considered these params being used for individual functions... Interesting / mind-boggling! No worries, I'll continue on this path then, unless there are other objections. Thanks again for your thoughts on this, which made me think more deeply about the problem and consider other angles. 👍

leehart · 2025-06-17T09:46:21Z

OK, I think I've sorted out the sample_indices support now.

snp_genotypes was a little tricky, and the solution here might be a little fragile since it relies (in this PR) on the unfiltered general_metadata, which we should probably make private using an underscore prefix, along with a bunch of other functions, in a follow-up PR. (Issue #796.)

It seems good to at least have an internal method of getting an unfiltered list of samples though, e.g. via _general_metadata, for cases such as this, e.g. aligning with unfiltered unlabelled SNP data.

leehart · 2025-06-17T10:03:25Z

Investigating CI test failures, which didn't fail locally:

=========================== short test summary info ============================
FAILED tests/anoph/test_cnv_frq.py::test_gene_cnv_frequencies_advanced_with_period_by[af1_sim-year] - AssertionError: 
Arrays are not equal

Mismatched elements: 2 / 20 (10%)
 ACTUAL: array(['amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp',
       'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del',
       'amp', 'del'], dtype=object)
 DESIRED: array(['amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp',
       'amp', 'del', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del',
       'amp', 'del'], dtype=object)
FAILED tests/anoph/test_cnv_frq.py::test_gene_cnv_frequencies_advanced_with_sample_query[af1_sim] - AssertionError: 
Arrays are not equal

Mismatched elements: 2 / 18 (11.1%)
 ACTUAL: array(['amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'amp', 'del',
       'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp'],
      dtype=object)
 DESIRED: array(['amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'amp', 'amp',
       'del', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp'],
      dtype=object)
FAILED tests/anoph/test_cnv_frq.py::test_gene_cnv_frequencies_advanced_with_min_cohort_size[af1_sim-0] - AssertionError: 
Arrays are not equal

Mismatched elements: 2 / 20 (10%)
 ACTUAL: array(['amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp',
       'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del',
       'amp', 'del'], dtype=object)
 DESIRED: array(['amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp',
       'amp', 'del', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del',
       'amp', 'del'], dtype=object)
FAILED tests/anoph/test_cnv_frq.py::test_gene_cnv_frequencies_advanced_with_min_cohort_size[af1_sim-10] - AssertionError: 
Arrays are not equal

Mismatched elements: 2 / 20 (10%)
 ACTUAL: array(['amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp',
       'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del',
       'amp', 'del'], dtype=object)
 DESIRED: array(['amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp',
       'amp', 'del', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del',
       'amp', 'del'], dtype=object)
FAILED tests/anoph/test_cnv_frq.py::test_gene_cnv_frequencies_advanced_with_nobs_mode[af1_sim-fixed] - AssertionError: 
Arrays are not equal

Mismatched elements: 2 / 20 (10%)
 ACTUAL: array(['amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp',
       'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del',
       'amp', 'del'], dtype=object)
 DESIRED: array(['amp', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del', 'amp',
       'amp', 'del', 'del', 'amp', 'del', 'amp', 'del', 'amp', 'del',
       'amp', 'del'], dtype=object)
=========== 5 failed, 693 passed, 125 warnings in 857.50s (0:14:17) ============
Error: Process completed with exit code 1.

[ It looks like that was one of those transient random failures, due to simulated test data. Resolved by re-running. ]

leehart · 2025-06-20T08:43:57Z

@jonbrenas Just to note that this code hasn't been reviewed or merged in yet, so work on the advanced course might need to be provisional or require changes if something here needs to change. But I expect the general behaviour will be the same.

jonbrenas · 2025-06-20T08:51:53Z

Thanks @leehart. Yes, the work on the advanced course is going to be provisional, it is also going to be mostly textual (i.e., explaining why some sample sets are restricted, why some samples might be biased in a way that means they are not to be used for surveillance, ...).

I don't remember if we had agreed that we would meet for you to show what has changed how and why or if you thought that I could figure it out my own. If the latter, I'll get to reviewing this PR asap.

leehart · 2025-06-20T09:29:07Z

Hi @jonbrenas . Maybe take a look and take some notes, and then we can meet to discuss, if you like. There's quite a bit, so we might need to do a few rounds. Or you might think it all looks OK! 🚀

leehart · 2025-08-08T08:50:04Z

@jonbrenas Let's try to meet to go through this PR in September.

jonbrenas

It looked good to me last time I checked and the additional tests I ran all worked fine.

leehart · 2025-11-25T09:33:15Z

Also approved by @ahernank . Merging in now.

WIP: dev support for unrestricted_use_only, surveillance_use_only

fc6c2ba

leehart changed the title ~~Support unrestricted_use_only and surveillance_use_only~~ Support unrestricted_use_only and surveillance_use_only constructor params Feb 14, 2025

leehart added 4 commits February 18, 2025 14:40

Add test sample sets for Af1 with unrestricted_use_only. Add relevant…

dfdd4e2

…_releases property.

Update comment re skipping test due to lack of relevant fixtures

1b6cbb9

Add surveillance flags to sample_metadata(). Add tests.

02921c9

Merge branch 'master' into GH716_add_constructor_params

d4af40a

leehart marked this pull request as ready for review February 20, 2025 15:36

leehart requested review from ahernank, alimanfoo, cclarkson and jonbrenas February 20, 2025 15:43

leehart marked this pull request as draft March 18, 2025 09:23

leehart added 3 commits March 18, 2025 17:32

Merge branch 'master' into GH716_add_constructor_params

d4e7e70

WIP: add _prep_sample_query_param() stub where _prep_sample_set_param()

19902b0

Add logic to _prep_sample_query_param() to honour self._surveillance_…

d7b8383

…use_only

leehart added 4 commits March 21, 2025 16:25

Merge branch 'master' into GH716_add_constructor_params

de0daf8

Allow _prep_sample_query_param() to return None

435e8a7

Return consistent data type from _prep_sample_query_param()

bde3d4e

Merge branch 'master' into GH716_add_constructor_params

bfed3f4

Merge branch 'master' into GH716_add_constructor_params

78d26d1

leehart changed the title ~~Support unrestricted_use_only and surveillance_use_only constructor params~~ Add unrestricted_use_only and surveillance_use_only constructor params Apr 24, 2025

Merge branch 'master' into GH716_add_constructor_params

6396126

Use python engine for sample_query to support extension dtypes

1177e63

leehart marked this pull request as draft June 10, 2025 10:07

leehart added 2 commits June 10, 2025 12:05

Add validate_sample_selection_params to funcs with sample_query, samp…

d9ba5eb

…le_indices. Add sample_query_options to biallelic_snps_to_plink.

Amend sample_metadata to allow sample_indices when surveillance_use_only

9b7d6cc

leehart added 2 commits June 13, 2025 17:46

WIP: handle sample_indices when surveillance_use_only

6c4e74f

Amend snp_genotypes to handle sample_indices when surveillance_use_only

4b9904f

leehart marked this pull request as ready for review June 17, 2025 09:46

Merge branch 'master' into GH716_add_constructor_params

0f20d3a

leehart removed the request for review from alimanfoo August 8, 2025 08:48

Merge branch 'master' into GH716_add_constructor_params

dad61a6

jonbrenas approved these changes Nov 11, 2025

View reviewed changes

leehart mentioned this pull request Nov 20, 2025

Adir1.x #795

Merged

leehart merged commit 03a6d5d into master Nov 25, 2025
10 checks passed

leehart deleted the GH716_add_constructor_params branch November 25, 2025 09:33

Conversation

leehart commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leehart commented Feb 20, 2025

Uh oh!

leehart commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leehart commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leehart commented Apr 8, 2025

Uh oh!

leehart commented Apr 8, 2025

Uh oh!

leehart commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leehart commented Apr 24, 2025

Uh oh!

jonbrenas commented Apr 24, 2025

Uh oh!

leehart commented Jun 5, 2025

Uh oh!

leehart commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leehart commented Jun 5, 2025

Uh oh!

leehart commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the sample_indices parameter used for?

What is the problem?

What is the proposed solution in principle?

Uh oh!

leehart commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonbrenas commented Jun 10, 2025

Uh oh!

leehart commented Jun 10, 2025

Uh oh!

jonbrenas commented Jun 10, 2025

Uh oh!

leehart commented Jun 10, 2025

Uh oh!

leehart commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leehart commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leehart commented Jun 20, 2025

Uh oh!

jonbrenas commented Jun 20, 2025

Uh oh!

leehart commented Jun 20, 2025

Uh oh!

leehart commented Aug 8, 2025

Uh oh!

jonbrenas left a comment

Choose a reason for hiding this comment

Uh oh!

leehart commented Nov 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leehart commented Feb 7, 2025 •

edited

Loading

leehart commented Feb 25, 2025 •

edited

Loading

leehart commented Mar 21, 2025 •

edited

Loading

leehart commented Apr 8, 2025 •

edited

Loading

leehart commented Jun 5, 2025 •

edited

Loading

leehart commented Jun 10, 2025 •

edited

Loading

What is the `sample_indices` parameter used for?

leehart commented Jun 10, 2025 •

edited

Loading

leehart commented Jun 17, 2025 •

edited

Loading

leehart commented Jun 17, 2025 •

edited

Loading