Skip to content

Conversation

@mdmueller
Copy link
Contributor

This is a first pass at a new system allowing the Table class to hold indices, both for improved grouping performance and for searching by column values. Indices are stored in columns rather than the full table, and indices in turn store a reference to the column(s) they act on, so that composite indices work correctly. Some parts of this PR are final and some are still unfinished, so any comments/suggestions would be great.

Complete

  • The Index class, which provides a wrapper around the sorting engine
  • Modifications to Column, which ensure that indices are modified as appropriate when a Column is modified
  • Modifications to Table

Unfinished

  • The Table.where() method, particularly finishing up the searching syntax
  • The actual sorting engine. I have bintrees (which provides fast binary search tree and red-black tree implementations) as an optional dependency, which is replaced by a pure-Python binary search tree class if bintrees isn't present. I also began work on a pure-Python red-black tree and a sorted array class, both unfinished.
    Experimental parts of this PR have been removed.

@mdboom @taldcroft

@Cadair
Copy link
Member

Cadair commented Jul 3, 2015

This is very exciting! I will try and take a look.

@mdmueller mdmueller force-pushed the table-indexing branch 2 times, most recently from e62df42 to 53a171f Compare July 13, 2015 20:57
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there is a little problem here... In #3930 (and #3929 which it superceded) we decided to remove __getitem__ from the BaseColumn class because it slows single item access (and hence iterating over column values) by more than a factor of 10. This is a fundamentally due to having to call into the Python layer instead of directly to the C version from ndarray.

I don't see any obvious way to get around this. What do you think @mdboom ? If not, we may need to just lose indices whenever you do indexing (slice, fancy indexing, whatever) on a Column. We can still do this at the Table level, which is a more likely case where you really want the indices to follow along.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think let's just preserve them at the table level and see whether anyone notices that the aren't preserved when slicing columns...

@taldcroft
Copy link
Member

This looks like a problem:

In [40]: t
Out[40]: 
<Table length=3>
  a      b     c  
int32 float32 str1
----- ------- ----
    3     3.0    e
    2     2.0    d
    1     1.0    c

In [44]: t.insert_row(2, (10, 10.0, 'z'))

In [45]: t
Out[45]: 
<Table length=4>
  a      b     c  
int32 float32 str1
----- ------- ----
    3     3.0    e
    2     2.0    d
   10    10.0    z
    1     1.0    c

In [46]: t.indices
Out[46]: [[[ 1  2  3 10], [2 1 0 2]]]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the name, I would have thought that this prevents the updating of indices as the values change (and then recalculates the index when the with block is exited). Is that something that would be hard to implement? The use case being "I need to add 1000 rows to a table, and don't want the index recalculated with each addition".

Even if we don't implement what I just described, maybe we should change the name of this to avoid confusion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this context just prevents indices from being copied upon creation of a new Table (e.g. Table(t), t[1:3], etc.) You're right that the name probably isn't appropriate, and actually this is somewhat of a hack to improve the performance of group_by (which takes many slices). I'm not sure if there are other use cases, but if we keep it, maybe another solution would be to include a parameter signaling whether to avoid updating indices completely or just to avoid copying indices? Something like:

with index_mode('modify'):
   # add rows to a table without index changes
with index_mode('copy'):
   # take slices/boolean masks of a table without index copying

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like what you suggest -- it will allow us to have other indexing contexts in the future...

@mdmueller mdmueller force-pushed the table-indexing branch 3 times, most recently from 8749aba to ab6bf80 Compare July 19, 2015 02:15
@mdmueller
Copy link
Contributor Author

Does anyone have any thoughts about the best way to support indices for mixin columns? The issue is that indices are stored and managed in the Column class, so mixin indices will require a workaround--maybe we should store mixin indices directly in Table? I can't think of a clean way to update indices based on mixin column modifications, though, unless we change mixin classes directly.

@taldcroft
Copy link
Member

The first thought that comes to mind for mixins is to put the indices into the info container. That is the new generic easy to handle attributes in a mixin safe way. What do you think?

@mdmueller
Copy link
Contributor Author

Makes sense to me, although the real issue is that indices have to be updated every time a column is modified. For example, t['col'][row] = val should update all indices in t['col']. This functionality is currently built into Column but the fact that mixins can be essentially arbitrary makes it more difficult.

@taldcroft
Copy link
Member

@mhvk - this table indexing PR is touching on the whole mixin concept and you might have some ideas. See the comments above about indexes and mixins. One idea is to use some kind of mixin class (here "mixin" in the traditional sense) like SupportsTableIndexing to indicate that a mixin like Quantity has the machinery in __setitem__ to handle in-place updates. Another idea is a class attribute like supports_table_indexing = True. The latter is perhaps simpler. I haven't tested to see which is faster.

[EDIT: SupportsTableIndexing would be on the info object, not the parent object!]

@mdmueller - for the canonical case of Time objects (for Time series) we are helped by the fact that Time objects are immutable and you cannot set them. For at least this case there is no problem. But of course we will want to index Quantitys.

@mhvk
Copy link
Contributor

mhvk commented Jul 21, 2015

Was just going to comment we could have the mixin on the info object by an attribute, or, perhaps easier, by a subclass that provides the required index -- but I see you edited your comment to say the same thing!

@mdmueller mdmueller force-pushed the table-indexing branch 3 times, most recently from 81935c2 to 37c8f38 Compare July 22, 2015 16:40
@taldcroft
Copy link
Member

What about naming the index_modes as freeze (instead of modify) and discard_on_copy instead of copy. The current names seem to be exactly opposite of what I would expect.

@mdmueller
Copy link
Contributor Author

Sounds good -- I'll update the names.

@taldcroft
Copy link
Member

Can a context manager temporarily override __getitem__ for an instance? I'm thinking:

with col.index_mode('copy_on_getitem'):
    col_new = col[i0:i1]

where __getitem__ is your current getitem?

@mdmueller
Copy link
Contributor Author

Ah, that does seem like a nice idea for convenience. It should actually be pretty simple to override, e.g.

In [1]: class A:
   ...:     def __getitem__(self, item):
   ...:         return item
   ...:         

In [2]: a = A()

In [5]: class context:
    def __init__(self, a):
        self.a = a
    def __enter__(self):
        self.func_copy = self.a.__getitem__
        self.a.__getitem__ = lambda x: x*2
    def __exit__(self, exc_type, exc_value, traceback):
        self.a.__getitem__ = self.func_copy
   ...:         

In [7]: a[5]
5

In [8]: with context(a):
   ...:     print(a[5])
   ...:     
10

In [9]: a[5]
5

@mdmueller mdmueller force-pushed the table-indexing branch 2 times, most recently from bec2020 to b7b2d9a Compare July 28, 2015 21:03
@astrofrog
Copy link
Member

@mdmueller - it looks like there is a doctest that needs skipping on Windows (see the AppVeyor 2.6 failure)

@taldcroft
Copy link
Member

@mdmueller - instead of skipping the doctest you can add the type (and check that all your examples specify int64 where needed).

t = Table([(1, 2, 3, 4), (10, 1, 9, 9)], names=('a', 'b'), dtype=[np.int64, np.int64])

@taldcroft
Copy link
Member

To try finishing this out and to provide a passing base for #4090, I've cloned this PR off to #4202 and added a commit to fix the Windows docstring problem. If that passes then I'll merge #4202.

@taldcroft
Copy link
Member

Closed by #4202.

@taldcroft taldcroft closed this Sep 30, 2015
embray added a commit to embray/astropy that referenced this pull request Sep 30, 2015
Column.__getitem__ speedup branch of astropy#4075, and made adjustments to take
advantage of it in the indices code.  This included getting rid of the
`def get_item(...)` hacks, which I think are no longer necessary--the
base Column.__getitem__ is now almost exactly as fast as the stock
implementation in the basic case of single item scalar access.

In the process I also refactored the Cython code a bit so that it could
be less repetitive.

For now, tests that failed due to the change in behavior (indicies now
*are* copied by default when slicing) are commented out; but all other
tests pass.

It should also that the 'copy_on_getitem' index mode is now a no-op
since that's the default behavior.  We could still change this so that
it's not the default behavior, if we prefer, or add a no_copy_on_getitem
mode instead.
dhomeier pushed a commit to dhomeier/astropy that referenced this pull request Dec 17, 2015
Column.__getitem__ speedup branch of astropy#4075, and made adjustments to take
advantage of it in the indices code.  This included getting rid of the
`def get_item(...)` hacks, which I think are no longer necessary--the
base Column.__getitem__ is now almost exactly as fast as the stock
implementation in the basic case of single item scalar access.

In the process I also refactored the Cython code a bit so that it could
be less repetitive.

For now, tests that failed due to the change in behavior (indicies now
*are* copied by default when slicing) are commented out; but all other
tests pass.

It should also that the 'copy_on_getitem' index mode is now a no-op
since that's the default behavior.  We could still change this so that
it's not the default behavior, if we prefer, or add a no_copy_on_getitem
mode instead.
dhomeier pushed a commit to dhomeier/astropy that referenced this pull request Jun 12, 2016
Column.__getitem__ speedup branch of astropy#4075, and made adjustments to take
advantage of it in the indices code.  This included getting rid of the
`def get_item(...)` hacks, which I think are no longer necessary--the
base Column.__getitem__ is now almost exactly as fast as the stock
implementation in the basic case of single item scalar access.

In the process I also refactored the Cython code a bit so that it could
be less repetitive.

For now, tests that failed due to the change in behavior (indicies now
*are* copied by default when slicing) are commented out; but all other
tests pass.

It should also that the 'copy_on_getitem' index mode is now a no-op
since that's the default behavior.  We could still change this so that
it's not the default behavior, if we prefer, or add a no_copy_on_getitem
mode instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants