Skip to content

gene_cnv_frequencies - fixes pandas fragment warning#68

Merged
cclarkson merged 3 commits intomasterfrom
0710-cc-67-panda-problems
Oct 14, 2021
Merged

gene_cnv_frequencies - fixes pandas fragment warning#68
cclarkson merged 3 commits intomasterfrom
0710-cc-67-panda-problems

Conversation

@cclarkson
Copy link
Copy Markdown
Collaborator

resolves #67

Making a df.copy() of the dataframe each iteration consolidates the df and stops the warning.

@cclarkson cclarkson requested a review from alimanfoo October 7, 2021 16:58
@alimanfoo
Copy link
Copy Markdown
Member

Hi @cclarkson, suggestion to resolve this...

Instead of this block of code:

        # compute cohort frequencies
        for coh, loc_samples in coh_dict.items():
            df = df.copy()
            n_samples = np.count_nonzero(loc_samples)
            if n_samples == 0:
                raise ValueError(f"no samples for cohort {coh!r}")
            if n_samples < min_cohort_size:
                df[f"{coh}_amp"] = np.nan
                df[f"{coh}_del"] = np.nan
            else:
                is_amp_coh = np.compress(loc_samples, is_amp, axis=1)
                is_del_coh = np.compress(loc_samples, is_del, axis=1)
                amp_count_coh = np.sum(is_amp_coh, axis=1)
                del_count_coh = np.sum(is_del_coh, axis=1)
                amp_freq_coh = amp_count_coh / n_samples
                del_freq_coh = del_count_coh / n_samples
                df[f"{coh}_amp"] = amp_freq_coh
                df[f"{coh}_del"] = del_freq_coh

...store new columns in a dict, then use pd.concat, e.g.:

        # compute cohort frequencies
        freq_cols = dict()
        for coh, loc_samples in coh_dict.items():
            n_samples = np.count_nonzero(loc_samples)
            if n_samples == 0:
                raise ValueError(f"no samples for cohort {coh!r}")
            if n_samples < min_cohort_size:
                freq_cols[f"{coh}_amp"] = np.nan
                freq_cols[f"{coh}_del"] = np.nan
            else:
                is_amp_coh = np.compress(loc_samples, is_amp, axis=1)
                is_del_coh = np.compress(loc_samples, is_del, axis=1)
                amp_count_coh = np.sum(is_amp_coh, axis=1)
                del_count_coh = np.sum(is_del_coh, axis=1)
                amp_freq_coh = amp_count_coh / n_samples
                del_freq_coh = del_count_coh / n_samples
                freq_cols[f"{coh}_amp"] = amp_freq_coh
                freq_cols[f"{coh}_del"] = del_freq_coh

        # build a dataframe with the frequency columns
        df_freqs = pd.DataFrame(freq_cols)

        # build the final dataframe
        df = pd.concat([df, df_freqs], axis=1)

@cclarkson
Copy link
Copy Markdown
Collaborator Author

@alimanfoo - works with a dict now, should be good to go.

Copy link
Copy Markdown
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cclarkson, this looks good.

I added a couple of comments about possible ways to reduce the number of times dataframes are copied. Because the dataframe is likely small here then copies are not necessarily a problem, but it's probably not a bad practice to be thinking about when copies are made and if they are necessary or not - could stand us in good stead if/when we have to handle larger dataframes. Up to you how to address, merge when you feel ready.

freq_cols[f"{coh}_del"] = del_freq_coh

# build a dataframe with the frequency columns
df_freqs = pandas.DataFrame.from_dict(freq_cols)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, I don't think from_dict is necessary, can just use DataFrame constructor.


# set gene ID as index for convenience
df.set_index("ID", inplace=True)
df = df.set_index("ID")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a copy of the dataframe. Here the dataframe is likely to be small and so extra copies probably don't hurt, but it seems unnecessary. Using inplace=True avoids the dataframe copy.

Comment on lines +1763 to +1764
df = df.reset_index(drop=True)
df = pandas.concat([df, df_freqs], axis=1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
df = df.reset_index(drop=True)
df = pandas.concat([df, df_freqs], axis=1)
df = pandas.concat([df, df_freqs], axis=1, ignore_index=True)

Calling reset_index causes a dataframe copy. As with comment below, copying a relatively small dataframe probably doesn't hurt, but in this case could be avoided by using concat with ignore_index=True which will cause the columns to just get stacked up without trying to do any kind of join.

@cclarkson cclarkson merged commit 9d9eb4c into master Oct 14, 2021
@cclarkson cclarkson deleted the 0710-cc-67-panda-problems branch October 14, 2021 10:15
@alimanfoo alimanfoo added the BMGF-001927 Work supported by BMGF grant INV-001927 (MalariaGEN 2019-2024). label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BMGF-001927 Work supported by BMGF grant INV-001927 (MalariaGEN 2019-2024).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gene_cnv_frequencies PerformanceWarning: DataFrame is highly fragmented

2 participants