-
-
Notifications
You must be signed in to change notification settings - Fork 2k
Indexing system for Table #3915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is very exciting! I will try and take a look. |
e62df42 to
53a171f
Compare
astropy/table/column.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there is a little problem here... In #3930 (and #3929 which it superceded) we decided to remove __getitem__ from the BaseColumn class because it slows single item access (and hence iterating over column values) by more than a factor of 10. This is a fundamentally due to having to call into the Python layer instead of directly to the C version from ndarray.
I don't see any obvious way to get around this. What do you think @mdboom ? If not, we may need to just lose indices whenever you do indexing (slice, fancy indexing, whatever) on a Column. We can still do this at the Table level, which is a more likely case where you really want the indices to follow along.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think let's just preserve them at the table level and see whether anyone notices that the aren't preserved when slicing columns...
|
This looks like a problem: |
astropy/table/index.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the name, I would have thought that this prevents the updating of indices as the values change (and then recalculates the index when the with block is exited). Is that something that would be hard to implement? The use case being "I need to add 1000 rows to a table, and don't want the index recalculated with each addition".
Even if we don't implement what I just described, maybe we should change the name of this to avoid confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this context just prevents indices from being copied upon creation of a new Table (e.g. Table(t), t[1:3], etc.) You're right that the name probably isn't appropriate, and actually this is somewhat of a hack to improve the performance of group_by (which takes many slices). I'm not sure if there are other use cases, but if we keep it, maybe another solution would be to include a parameter signaling whether to avoid updating indices completely or just to avoid copying indices? Something like:
with index_mode('modify'):
# add rows to a table without index changes
with index_mode('copy'):
# take slices/boolean masks of a table without index copying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like what you suggest -- it will allow us to have other indexing contexts in the future...
8749aba to
ab6bf80
Compare
|
Does anyone have any thoughts about the best way to support indices for mixin columns? The issue is that indices are stored and managed in the |
|
The first thought that comes to mind for mixins is to put the indices into the |
0ab9e2e to
63a43d1
Compare
|
Makes sense to me, although the real issue is that indices have to be updated every time a column is modified. For example, |
|
@mhvk - this table indexing PR is touching on the whole mixin concept and you might have some ideas. See the comments above about indexes and mixins. One idea is to use some kind of mixin class (here "mixin" in the traditional sense) like [EDIT: @mdmueller - for the canonical case of |
|
Was just going to comment we could have the mixin on the |
81935c2 to
37c8f38
Compare
|
What about naming the |
|
Sounds good -- I'll update the names. |
|
Can a context manager temporarily override where |
|
Ah, that does seem like a nice idea for convenience. It should actually be pretty simple to override, e.g. |
bec2020 to
b7b2d9a
Compare
7236155 to
6a1086e
Compare
6a1086e to
66feeed
Compare
|
@mdmueller - it looks like there is a doctest that needs skipping on Windows (see the AppVeyor 2.6 failure) |
|
@mdmueller - instead of skipping the doctest you can add the type (and check that all your examples specify int64 where needed). |
|
Closed by #4202. |
Column.__getitem__ speedup branch of astropy#4075, and made adjustments to take advantage of it in the indices code. This included getting rid of the `def get_item(...)` hacks, which I think are no longer necessary--the base Column.__getitem__ is now almost exactly as fast as the stock implementation in the basic case of single item scalar access. In the process I also refactored the Cython code a bit so that it could be less repetitive. For now, tests that failed due to the change in behavior (indicies now *are* copied by default when slicing) are commented out; but all other tests pass. It should also that the 'copy_on_getitem' index mode is now a no-op since that's the default behavior. We could still change this so that it's not the default behavior, if we prefer, or add a no_copy_on_getitem mode instead.
Column.__getitem__ speedup branch of astropy#4075, and made adjustments to take advantage of it in the indices code. This included getting rid of the `def get_item(...)` hacks, which I think are no longer necessary--the base Column.__getitem__ is now almost exactly as fast as the stock implementation in the basic case of single item scalar access. In the process I also refactored the Cython code a bit so that it could be less repetitive. For now, tests that failed due to the change in behavior (indicies now *are* copied by default when slicing) are commented out; but all other tests pass. It should also that the 'copy_on_getitem' index mode is now a no-op since that's the default behavior. We could still change this so that it's not the default behavior, if we prefer, or add a no_copy_on_getitem mode instead.
Column.__getitem__ speedup branch of astropy#4075, and made adjustments to take advantage of it in the indices code. This included getting rid of the `def get_item(...)` hacks, which I think are no longer necessary--the base Column.__getitem__ is now almost exactly as fast as the stock implementation in the basic case of single item scalar access. In the process I also refactored the Cython code a bit so that it could be less repetitive. For now, tests that failed due to the change in behavior (indicies now *are* copied by default when slicing) are commented out; but all other tests pass. It should also that the 'copy_on_getitem' index mode is now a no-op since that's the default behavior. We could still change this so that it's not the default behavior, if we prefer, or add a no_copy_on_getitem mode instead.
This is a first pass at a new system allowing the
Tableclass to hold indices, both for improved grouping performance and for searching by column values. Indices are stored in columns rather than the full table, and indices in turn store a reference to the column(s) they act on, so that composite indices work correctly. Some parts of this PR are final and some are still unfinished, so any comments/suggestions would be great.Complete
Indexclass, which provides a wrapper around the sorting engineColumn, which ensure that indices are modified as appropriate when aColumnis modifiedTableUnfinished
TheTable.where()method, particularly finishing up the searching syntaxThe actual sorting engine. I have bintrees (which provides fast binary search tree and red-black tree implementations) as an optional dependency, which is replaced by a pure-Python binary search tree class if bintrees isn't present. I also began work on a pure-Python red-black tree and a sorted array class, both unfinished.Experimental parts of this PR have been removed.
@mdboom @taldcroft