`gene_cnv_frequencies` - fixes pandas fragment warning by cclarkson · Pull Request #68 · malariagen/malariagen-data-python

cclarkson · 2021-10-07T16:58:56Z

resolves #67

Making a df.copy() of the dataframe each iteration consolidates the df and stops the warning.

alimanfoo · 2021-10-11T22:24:42Z

Hi @cclarkson, suggestion to resolve this...

Instead of this block of code:

        # compute cohort frequencies
        for coh, loc_samples in coh_dict.items():
            df = df.copy()
            n_samples = np.count_nonzero(loc_samples)
            if n_samples == 0:
                raise ValueError(f"no samples for cohort {coh!r}")
            if n_samples < min_cohort_size:
                df[f"{coh}_amp"] = np.nan
                df[f"{coh}_del"] = np.nan
            else:
                is_amp_coh = np.compress(loc_samples, is_amp, axis=1)
                is_del_coh = np.compress(loc_samples, is_del, axis=1)
                amp_count_coh = np.sum(is_amp_coh, axis=1)
                del_count_coh = np.sum(is_del_coh, axis=1)
                amp_freq_coh = amp_count_coh / n_samples
                del_freq_coh = del_count_coh / n_samples
                df[f"{coh}_amp"] = amp_freq_coh
                df[f"{coh}_del"] = del_freq_coh

...store new columns in a dict, then use pd.concat, e.g.:

        # compute cohort frequencies
        freq_cols = dict()
        for coh, loc_samples in coh_dict.items():
            n_samples = np.count_nonzero(loc_samples)
            if n_samples == 0:
                raise ValueError(f"no samples for cohort {coh!r}")
            if n_samples < min_cohort_size:
                freq_cols[f"{coh}_amp"] = np.nan
                freq_cols[f"{coh}_del"] = np.nan
            else:
                is_amp_coh = np.compress(loc_samples, is_amp, axis=1)
                is_del_coh = np.compress(loc_samples, is_del, axis=1)
                amp_count_coh = np.sum(is_amp_coh, axis=1)
                del_count_coh = np.sum(is_del_coh, axis=1)
                amp_freq_coh = amp_count_coh / n_samples
                del_freq_coh = del_count_coh / n_samples
                freq_cols[f"{coh}_amp"] = amp_freq_coh
                freq_cols[f"{coh}_del"] = del_freq_coh

        # build a dataframe with the frequency columns
        df_freqs = pd.DataFrame(freq_cols)

        # build the final dataframe
        df = pd.concat([df, df_freqs], axis=1)

cclarkson · 2021-10-13T19:42:44Z

@alimanfoo - works with a dict now, should be good to go.

alimanfoo

Hi @cclarkson, this looks good.

I added a couple of comments about possible ways to reduce the number of times dataframes are copied. Because the dataframe is likely small here then copies are not necessarily a problem, but it's probably not a bad practice to be thinking about when copies are made and if they are necessary or not - could stand us in good stead if/when we have to handle larger dataframes. Up to you how to address, merge when you feel ready.

alimanfoo · 2021-10-13T21:04:09Z

malariagen_data/ag3.py

+                freq_cols[f"{coh}_del"] = del_freq_coh
+
+        # build a dataframe with the frequency columns
+        df_freqs = pandas.DataFrame.from_dict(freq_cols)


Nit, I don't think from_dict is necessary, can just use DataFrame constructor.

alimanfoo · 2021-10-13T21:05:38Z

malariagen_data/ag3.py


        # set gene ID as index for convenience
-        df.set_index("ID", inplace=True)
+        df = df.set_index("ID")


This creates a copy of the dataframe. Here the dataframe is likely to be small and so extra copies probably don't hurt, but it seems unnecessary. Using inplace=True avoids the dataframe copy.

alimanfoo · 2021-10-13T21:07:47Z

malariagen_data/ag3.py

+        df = df.reset_index(drop=True)
+        df = pandas.concat([df, df_freqs], axis=1)


Suggested change

df = df.reset_index(drop=True)

df = pandas.concat([df, df_freqs], axis=1)

df = pandas.concat([df, df_freqs], axis=1, ignore_index=True)

Calling reset_index causes a dataframe copy. As with comment below, copying a relatively small dataframe probably doesn't hurt, but in this case could be avoided by using concat with ignore_index=True which will cause the columns to just get stacked up without trying to do any kind of join.

fixes pandas fragment warning

bbfbe4f

cclarkson requested a review from alimanfoo October 7, 2021 16:58

fixes panda warnings using dict

15cc169

alimanfoo approved these changes Oct 13, 2021

View reviewed changes

Alistairs inplace suggestions

ca1e601

cclarkson merged commit 9d9eb4c into master Oct 14, 2021

cclarkson deleted the 0710-cc-67-panda-problems branch October 14, 2021 10:15

alimanfoo added the BMGF-001927 Work supported by BMGF grant INV-001927 (MalariaGEN 2019-2024). label Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`gene_cnv_frequencies` - fixes pandas fragment warning#68

`gene_cnv_frequencies` - fixes pandas fragment warning#68
cclarkson merged 3 commits intomasterfrom
0710-cc-67-panda-problems

cclarkson commented Oct 7, 2021

Uh oh!

alimanfoo commented Oct 11, 2021

Uh oh!

cclarkson commented Oct 13, 2021

Uh oh!

alimanfoo left a comment

Uh oh!

alimanfoo Oct 13, 2021

Uh oh!

alimanfoo Oct 13, 2021

Uh oh!

alimanfoo Oct 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		df = df.reset_index(drop=True)
		df = pandas.concat([df, df_freqs], axis=1)

Conversation

cclarkson commented Oct 7, 2021

Uh oh!

alimanfoo commented Oct 11, 2021

Uh oh!

cclarkson commented Oct 13, 2021

Uh oh!

alimanfoo left a comment

Choose a reason for hiding this comment

Uh oh!

alimanfoo Oct 13, 2021

Choose a reason for hiding this comment

Uh oh!

alimanfoo Oct 13, 2021

Choose a reason for hiding this comment

Uh oh!

alimanfoo Oct 13, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants