Indexing system for Table #3915

mdmueller · 2015-07-03T16:52:28Z

This is a first pass at a new system allowing the Table class to hold indices, both for improved grouping performance and for searching by column values. Indices are stored in columns rather than the full table, and indices in turn store a reference to the column(s) they act on, so that composite indices work correctly. Some parts of this PR are final and some are still unfinished, so any comments/suggestions would be great.

Complete

The Index class, which provides a wrapper around the sorting engine
Modifications to Column, which ensure that indices are modified as appropriate when a Column is modified
Modifications to Table

Unfinished

~~The Table.where() method, particularly finishing up the searching syntax~~
The actual sorting engine. I have bintrees (which provides fast binary search tree and red-black tree implementations) as an optional dependency, which is replaced by a pure-Python binary search tree class if bintrees isn't present. I also began work on a pure-Python red-black tree and a sorted array class, both unfinished.
Experimental parts of this PR have been removed.

@mdboom @taldcroft

Cadair · 2015-07-03T17:23:11Z

This is very exciting! I will try and take a look.

taldcroft · 2015-07-15T15:35:55Z

astropy/table/column.py

So there is a little problem here... In #3930 (and #3929 which it superceded) we decided to remove __getitem__ from the BaseColumn class because it slows single item access (and hence iterating over column values) by more than a factor of 10. This is a fundamentally due to having to call into the Python layer instead of directly to the C version from ndarray.

I don't see any obvious way to get around this. What do you think @mdboom ? If not, we may need to just lose indices whenever you do indexing (slice, fancy indexing, whatever) on a Column. We can still do this at the Table level, which is a more likely case where you really want the indices to follow along.

Yeah, I think let's just preserve them at the table level and see whether anyone notices that the aren't preserved when slicing columns...

taldcroft · 2015-07-15T17:11:14Z

This looks like a problem:

In [40]: t
Out[40]: 
<Table length=3>
  a      b     c  
int32 float32 str1
----- ------- ----
    3     3.0    e
    2     2.0    d
    1     1.0    c

In [44]: t.insert_row(2, (10, 10.0, 'z'))

In [45]: t
Out[45]: 
<Table length=4>
  a      b     c  
int32 float32 str1
----- ------- ----
    3     3.0    e
    2     2.0    d
   10    10.0    z
    1     1.0    c

In [46]: t.indices
Out[46]: [[[ 1  2  3 10], [2 1 0 2]]]

mdboom · 2015-07-15T18:26:08Z

astropy/table/index.py

Given the name, I would have thought that this prevents the updating of indices as the values change (and then recalculates the index when the with block is exited). Is that something that would be hard to implement? The use case being "I need to add 1000 rows to a table, and don't want the index recalculated with each addition".

Even if we don't implement what I just described, maybe we should change the name of this to avoid confusion?

Actually, this context just prevents indices from being copied upon creation of a new Table (e.g. Table(t), t[1:3], etc.) You're right that the name probably isn't appropriate, and actually this is somewhat of a hack to improve the performance of group_by (which takes many slices). I'm not sure if there are other use cases, but if we keep it, maybe another solution would be to include a parameter signaling whether to avoid updating indices completely or just to avoid copying indices? Something like:

with index_mode('modify'): # add rows to a table without index changes with index_mode('copy'): # take slices/boolean masks of a table without index copying

I like what you suggest -- it will allow us to have other indexing contexts in the future...

mdmueller · 2015-07-19T20:02:23Z

Does anyone have any thoughts about the best way to support indices for mixin columns? The issue is that indices are stored and managed in the Column class, so mixin indices will require a workaround--maybe we should store mixin indices directly in Table? I can't think of a clean way to update indices based on mixin column modifications, though, unless we change mixin classes directly.

taldcroft · 2015-07-21T11:54:47Z

The first thought that comes to mind for mixins is to put the indices into the info container. That is the new generic easy to handle attributes in a mixin safe way. What do you think?

mdmueller · 2015-07-21T20:37:42Z

Makes sense to me, although the real issue is that indices have to be updated every time a column is modified. For example, t['col'][row] = val should update all indices in t['col']. This functionality is currently built into Column but the fact that mixins can be essentially arbitrary makes it more difficult.

taldcroft · 2015-07-21T21:04:49Z

@mhvk - this table indexing PR is touching on the whole mixin concept and you might have some ideas. See the comments above about indexes and mixins. One idea is to use some kind of mixin class (here "mixin" in the traditional sense) like SupportsTableIndexing to indicate that a mixin like Quantity has the machinery in __setitem__ to handle in-place updates. Another idea is a class attribute like supports_table_indexing = True. The latter is perhaps simpler. I haven't tested to see which is faster.

[EDIT: SupportsTableIndexing would be on the info object, not the parent object!]

@mdmueller - for the canonical case of Time objects (for Time series) we are helped by the fact that Time objects are immutable and you cannot set them. For at least this case there is no problem. But of course we will want to index Quantitys.

mhvk · 2015-07-21T23:22:48Z

Was just going to comment we could have the mixin on the info object by an attribute, or, perhaps easier, by a subclass that provides the required index -- but I see you edited your comment to say the same thing!

taldcroft · 2015-07-22T17:04:39Z

What about naming the index_modes as freeze (instead of modify) and discard_on_copy instead of copy. The current names seem to be exactly opposite of what I would expect.

mdmueller · 2015-07-22T17:11:02Z

Sounds good -- I'll update the names.

taldcroft · 2015-07-22T17:19:36Z

Can a context manager temporarily override __getitem__ for an instance? I'm thinking:

with col.index_mode('copy_on_getitem'):
    col_new = col[i0:i1]

where __getitem__ is your current getitem?

mdmueller · 2015-07-22T17:26:07Z

Ah, that does seem like a nice idea for convenience. It should actually be pretty simple to override, e.g.

In [1]: class A:
   ...:     def __getitem__(self, item):
   ...:         return item
   ...:         

In [2]: a = A()

In [5]: class context:
    def __init__(self, a):
        self.a = a
    def __enter__(self):
        self.func_copy = self.a.__getitem__
        self.a.__getitem__ = lambda x: x*2
    def __exit__(self, exc_type, exc_value, traceback):
        self.a.__getitem__ = self.func_copy
   ...:         

In [7]: a[5]
5

In [8]: with context(a):
   ...:     print(a[5])
   ...:     
10

In [9]: a[5]
5

…le method

… changes

astrofrog · 2015-09-29T14:33:32Z

@mdmueller - it looks like there is a doctest that needs skipping on Windows (see the AppVeyor 2.6 failure)

taldcroft · 2015-09-29T16:27:49Z

@mdmueller - instead of skipping the doctest you can add the type (and check that all your examples specify int64 where needed).

t = Table([(1, 2, 3, 4), (10, 1, 9, 9)], names=('a', 'b'), dtype=[np.int64, np.int64])

taldcroft · 2015-09-29T19:41:21Z

To try finishing this out and to provide a passing base for #4090, I've cloned this PR off to #4202 and added a commit to fix the Windows docstring problem. If that passes then I'll merge #4202.

taldcroft · 2015-09-30T13:36:16Z

Closed by #4202.

Column.__getitem__ speedup branch of astropy#4075, and made adjustments to take advantage of it in the indices code. This included getting rid of the `def get_item(...)` hacks, which I think are no longer necessary--the base Column.__getitem__ is now almost exactly as fast as the stock implementation in the basic case of single item scalar access. In the process I also refactored the Cython code a bit so that it could be less repetitive. For now, tests that failed due to the change in behavior (indicies now *are* copied by default when slicing) are commented out; but all other tests pass. It should also that the 'copy_on_getitem' index mode is now a no-op since that's the default behavior. We could still change this so that it's not the default behavior, if we prefer, or add a no_copy_on_getitem mode instead.

mdmueller force-pushed the table-indexing branch 2 times, most recently from e62df42 to 53a171f Compare July 13, 2015 20:57

taldcroft reviewed Jul 15, 2015
View reviewed changes

taldcroft mentioned this pull request Jul 15, 2015

Consider using np.lexsort instead of argsort for table group_by #3964

Open

mdboom reviewed Jul 15, 2015
View reviewed changes

mdmueller force-pushed the table-indexing branch 3 times, most recently from 8749aba to ab6bf80 Compare July 19, 2015 02:15

mdmueller force-pushed the table-indexing branch from 0ab9e2e to 63a43d1 Compare July 21, 2015 19:13

mdmueller force-pushed the table-indexing branch 3 times, most recently from 81935c2 to 37c8f38 Compare July 22, 2015 16:40

mdmueller force-pushed the table-indexing branch 2 times, most recently from bec2020 to b7b2d9a Compare July 28, 2015 21:03

mdmueller added 16 commits September 28, 2015 23:54

Finished SortedArray rewrite in terms of Table methods

2b399ad

Made table.indices allow for special __getitem__ retrieval

6f9cb26

Fixed bug when renaming indexed columns

5ed38a5

Implemented loc and iloc in Table

c501897

Made lexsort work with Time, Table.sort() can take no arguments

1c6acb8

Made loc and iloc take individual labels and list/ndarray

dc0f72f

Made table .loc work with Time objects

15dc4c4

Improvements for BST and index replace_rows, also index_mode as a Tab…

d15f99e

…le method

Made Table sort faster with indices

c5e4aad

Wrote section on indexing in documentation

28f0d6a

Added new engine section in docs

428a23e

Made performance improvements for .loc retrieval

7d6a205

Fixed bug with updating column references in index

ecc5ea2

Added 'unique' mode for indexes

de6bb8c

Made .loc return a single row instead of a Table, minor documentation…

66d65a5

… changes

Added experimental warning

11230c9

mdmueller force-pushed the table-indexing branch from 7236155 to 6a1086e Compare September 29, 2015 03:54

Added changelog entry

66feeed

mdmueller force-pushed the table-indexing branch from 6a1086e to 66feeed Compare September 29, 2015 04:16

taldcroft mentioned this pull request Sep 29, 2015

Table indexing with docstring fix for Windows #4202

Merged

taldcroft closed this Sep 30, 2015

This was referenced Oct 5, 2015

Table indexing should not affect Mixin classes except through info #4222

Merged

Shape changing mixin #4123

Merged

Uh oh!

Indexing system for Table #3915

Indexing system for Table #3915

Uh oh!

Conversation

mdmueller commented Jul 3, 2015

Complete

Unfinished

Uh oh!

Cadair commented Jul 3, 2015

Uh oh!

taldcroft Jul 15, 2015

Choose a reason for hiding this comment

Uh oh!

mdboom Jul 15, 2015

Choose a reason for hiding this comment

Uh oh!

taldcroft commented Jul 15, 2015

Uh oh!

mdboom Jul 15, 2015

Choose a reason for hiding this comment

Uh oh!

mdmueller Jul 18, 2015

Choose a reason for hiding this comment

Uh oh!

mdboom Jul 19, 2015

Choose a reason for hiding this comment

Uh oh!

mdmueller commented Jul 19, 2015

Uh oh!

taldcroft commented Jul 21, 2015

Uh oh!

mdmueller commented Jul 21, 2015

Uh oh!

taldcroft commented Jul 21, 2015

Uh oh!

mhvk commented Jul 21, 2015

Uh oh!

taldcroft commented Jul 22, 2015

Uh oh!

mdmueller commented Jul 22, 2015

Uh oh!

taldcroft commented Jul 22, 2015

Uh oh!

mdmueller commented Jul 22, 2015

Uh oh!

astrofrog commented Sep 29, 2015

Uh oh!

taldcroft commented Sep 29, 2015

Uh oh!

taldcroft commented Sep 29, 2015

Uh oh!

taldcroft commented Sep 30, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants