-
-
Notifications
You must be signed in to change notification settings - Fork 2k
Simplify and speed up Table replace_column() #8902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@taldcroft - let me know when this is rebased and I'll have a look. |
|
@taldcroft - #8789 being closed, could you please rebase? |
|
BTW, there is at least one little issue with this that needs fixing before I push up a rebased version. |
f21c553 to
69b3b26
Compare
|
This is rebased now. Some of the infrastructure change is anticipating work on Still needs a test on the new option to replace a column without copying. |
astrofrog
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, and very happy about this performance improvement! Just a few small comments below, and as you said a test should be added for replacing without copying.
| from . import conf | ||
|
|
||
|
|
||
| _implementation_notes = """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to check, what is the motivation for this as opposed to comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reminds me of your (@astrofrog's) old suggestion of using the module docstring for information for developers. I liked the idea then, and I think this note shows that it certainly makes sense to have some general mechanism for doing that.
@taldcroft - I recall that at the time you felt that information for developers should also be in the documentation - I think there is certainly an argument for that too, but maybe your choice here suggests that having something inside the module is the easier route after all.
... ... ...
Actually, I don't recall quite right - it is issue #4821 that suggested adding a "README" type thingie.
For now, my sense is to leave things as they are here, but open a new issue - I would love for other modules to have similar implementation notes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #8930 for a place to discuss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation was to remind myself upfront of global gotchas that could be missed if buried in the code. Of course this could be a comment block. Hopefully before release this will be in some better form.
mhvk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taldcroft - this is probably the most nitpicky review I've written (well, in a while anyway). Really, this all looks good so I'm approving as my comments are all suggestions only, and @astrofrog already covered the things that would be good to clarify.
|
p.s. The failure looks real. |
|
BTW, the real travis failure before was due to a table feature I didn't know about: if a table has just one column then you can replace that with a new column of different length. My new implementation did not originally allow that, but it does now. |
|
I think I have addressed all comments, thanks for the reviews! Very helpful and the code is better now. |
|
And BTW, the failed test where I learned about the length-changing feature was not intended to change the table length, I think that was just a mistake. But good thing for this! |
|
Tests finally passing... |
| raise ValueError('column name {0} is not in the table'.format(name)) | ||
|
|
||
| if self[name].info.indices: | ||
| raise ValueError('cannot replace a table index column') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taldcroft - Do you think that this check is still needed ? Removing it would allow to fix #8077, and it seems to work if I remove it but I have not checked in detail that the output file is correct.
And btw thanks for these PRs which looks a great improvement for masked tables !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one issue here is if a column is part of an index with another column, then you can corrupt that index. Indices in Table was done by a GSoC student and I'll confess to not understanding the details all that well, so my first reaction is always to leave things alone.
Maybe when replacing an index column then just drop the indices that use it? Users should not be surprised that indices go away if they replace the column. Thoughts @saimn @astrofrog ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah indeed the multi-column case needs a special treatment. I think it's fine to drop the index in this case, this feature is probably not much used anyway (it is not possible to use a mlti-column index with .loc).
This builds on #8789 to simplify and speed up replacing a column. The speed up is about a factor of 8 for a simple table with 20 columns (e.g. from 420 us to 54 us).
Along the way it adds a new internal method that factors out the whole processing of turning arbitrary sequence data into something suitable for inclusion in the columns dict. This is independently a good thing and will see more use beyond this PR.
Once #8789 is merged I will rebase and the diffs here will clean up.
It will also supercede #8702 as a fix for #8642.
Closes #8702
Fixes #8642