Skip to content

Conversation

@taldcroft
Copy link
Member

This builds on #8789 to simplify and speed up replacing a column. The speed up is about a factor of 8 for a simple table with 20 columns (e.g. from 420 us to 54 us).

Along the way it adds a new internal method that factors out the whole processing of turning arbitrary sequence data into something suitable for inclusion in the columns dict. This is independently a good thing and will see more use beyond this PR.

Once #8789 is merged I will rebase and the diffs here will clean up.

It will also supercede #8702 as a fix for #8642.

Closes #8702
Fixes #8642

@taldcroft taldcroft changed the title Table replace col Simplify and speed up Table replace_column() Jun 21, 2019
@pllim pllim requested a review from astrofrog June 21, 2019 20:25
@pllim pllim added the table label Jun 21, 2019
@pllim pllim requested a review from mhvk June 21, 2019 20:26
@pllim pllim added this to the v4.0 milestone Jun 21, 2019
@mhvk
Copy link
Contributor

mhvk commented Jun 21, 2019

@taldcroft - let me know when this is rebased and I'll have a look.

@bsipocz
Copy link
Member

bsipocz commented Jun 22, 2019

@taldcroft - #8789 being closed, could you please rebase?

@taldcroft
Copy link
Member Author

BTW, there is at least one little issue with this that needs fixing before I push up a rebased version.

@taldcroft taldcroft force-pushed the table-replace-col branch from f21c553 to 69b3b26 Compare June 24, 2019 00:59
@taldcroft
Copy link
Member Author

This is rebased now. Some of the infrastructure change is anticipating work on add_column to make it possible to build a table one column at a time in O(N) instead of O(N**2). Right now each new column addition re-creates the entire table up to that point.

Still needs a test on the new option to replace a column without copying.

Copy link
Member

@astrofrog astrofrog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, and very happy about this performance improvement! Just a few small comments below, and as you said a test should be added for replacing without copying.

from . import conf


_implementation_notes = """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check, what is the motivation for this as opposed to comments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me of your (@astrofrog's) old suggestion of using the module docstring for information for developers. I liked the idea then, and I think this note shows that it certainly makes sense to have some general mechanism for doing that.

@taldcroft - I recall that at the time you felt that information for developers should also be in the documentation - I think there is certainly an argument for that too, but maybe your choice here suggests that having something inside the module is the easier route after all.

... ... ...

Actually, I don't recall quite right - it is issue #4821 that suggested adding a "README" type thingie.

For now, my sense is to leave things as they are here, but open a new issue - I would love for other modules to have similar implementation notes!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #8930 for a place to discuss

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation was to remind myself upfront of global gotchas that could be missed if buried in the code. Of course this could be a comment block. Hopefully before release this will be in some better form.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taldcroft - this is probably the most nitpicky review I've written (well, in a while anyway). Really, this all looks good so I'm approving as my comments are all suggestions only, and @astrofrog already covered the things that would be good to clarify.

@mhvk
Copy link
Contributor

mhvk commented Jun 28, 2019

p.s. The failure looks real.

@taldcroft
Copy link
Member Author

BTW, the real travis failure before was due to a table feature I didn't know about: if a table has just one column then you can replace that with a new column of different length. My new implementation did not originally allow that, but it does now.

@taldcroft
Copy link
Member Author

I think I have addressed all comments, thanks for the reviews! Very helpful and the code is better now.

@taldcroft
Copy link
Member Author

And BTW, the failed test where I learned about the length-changing feature was not intended to change the table length, I think that was just a mistake. But good thing for this!

@taldcroft
Copy link
Member Author

Tests finally passing...

raise ValueError('column name {0} is not in the table'.format(name))

if self[name].info.indices:
raise ValueError('cannot replace a table index column')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taldcroft - Do you think that this check is still needed ? Removing it would allow to fix #8077, and it seems to work if I remove it but I have not checked in detail that the output file is correct.

And btw thanks for these PRs which looks a great improvement for masked tables !

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one issue here is if a column is part of an index with another column, then you can corrupt that index. Indices in Table was done by a GSoC student and I'll confess to not understanding the details all that well, so my first reaction is always to leave things alone.

Maybe when replacing an index column then just drop the indices that use it? Users should not be surprised that indices go away if they replace the column. Thoughts @saimn @astrofrog ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah indeed the multi-column case needs a special treatment. I think it's fine to drop the index in this case, this feature is probably not much used anyway (it is not possible to use a mlti-column index with .loc).

@taldcroft taldcroft deleted the table-replace-col branch October 1, 2019 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug when trying to set the unit on a TimeSeries column

6 participants