Skip to content

Conversation

@taldcroft
Copy link
Member

This issue is to explore the idea of entirely removing the np.ndarray or np.ma.MaskedArray structured array object that currently serves as the primary data storage. This object is referenced as Table._data in the code.

The way Table._data currently works is that it is essentially the owner of all the table data and the Column objects then hold references to the columns in Table._data (hereafter just _data). A fair bit of the trickier table code is maintaining that duality when mutable table operations are performed. When you start trying to integrate other data types as Columns it gets very tricky / untenable.

The proposal is to make a Table be a collection of Columns where the Column objects are the primary and independent data owners. I am hoping that implementation of this will actually result in a net reduction of code and overall simplification of logic. Some advantages:

Issues:

  • The Row class will need work to replicate the functionality currently handled by np.void and np.mvoid classes. There is something weird and magical about those classes I have never entirely understood, so fully replicating them represents a real uncertainty right now.
  • It's likely that iterating through a table by rows will be even slower. Even now there is a lot of Python code running for each row so it's starting out slow.
  • Users that are defying convention and using _data directly will be out of luck, though we could put in a compatibility property that basically creates the equivalent ndarray on request (and issues a warning saying don't use this!).
  • The accepted way to create a pure np.array from a Table will always generate a copy. Currently it can be a reference.
  • The high level table operations are implemented at a low level using pure ndarrays. This will need some work.

Comments / discussion please!

@astrofrog @eteq @embray @mhvk @mdboom

@taldcroft taldcroft added this to the v1.0.0 milestone Jul 24, 2014
@mhvk
Copy link
Contributor

mhvk commented Jul 24, 2014

@taldcroft - Interesting thought! Though I'm not completely sure it will make it that much easier to put different items in columns; my sense is that the hard part is to adapt Column, and that is not affected much here. Is this not true for your trials? Among the negatives, a small one to add would be that slicing a table by sets of rows will likely become slower.

Overall, my sense would be to get the columns holding Time, Coordinates, etc., working first (for me, that means mostly letting go of the idea to do the masked versions as well...).

@perrygreenfield
Copy link
Member

My big concern is losing the advantages of the row/column duality. Yes, you make column operations much easier, but how does the performance of rows go? Not only slicing, but use of index arrays. If one has many columns, that could become quite expensive. Or row insertion operations. Don't you also add to the responsibility of ensuring columns are now consistent in length?

@taldcroft
Copy link
Member Author

Though I'm not completely sure it will make it that much easier to put different items in columns; my sense is that the hard part is to adapt Column, and that is not affected much here.

One of the first issues I encountered with Time columns is that internally Time maintains two columns jd1 and jd2, but this is conceptually one column and from the user perspective what should be visible is the single-column representation (e.g. a list of ISO strings). So you get this problem: what to display and where to hold jd1 and jd2.

My original idea was putting jd1 and jd2 into the structured array as hidden columns with auto-generated column names. This gets messy quickly since there are different numbers of columns in the structured array as in the table. Eventually the internal Time values were not stored in _data at all and nor were the representation (string) values. This is basically what I proposed but it ends up creating a difference between "Mixin" columns and normal ones, so in the case of a table with Mixin columns one would need to go around np.void anyway.

I think that if none of the columns actually live in _data then these decisions and complications will just go away.

@taldcroft
Copy link
Member Author

My big concern is losing the advantages of the row/column duality. Yes, you make column operations much easier, but how does the performance of rows go?

Good questions. Here is at least a partial answer: http://nbviewer.ipython.org/gist/taldcroft/cccf5fac271f2ac49e0d

The upshot is that if you iterate through a wide table (100 cols) and then do something with just one of the columns, you lose a factor of three in speed. This is a silly thing to do but that doesn't stop people. But interestingly if you iterate through a wide table and then actually manipulate all of the columns you gain a factor of almost 3. (I don't know why and hopefully didn't do something silly in my test).

About slicing, yes if you do t2 = t[3:30] then in the proposed scenario you replace a single ndarray slice with N_COLS ndarray slices. But in that table slice operation there is a lot of other Python code getting run which handles all the metadata, creating a new TableColumns object and so on. Here are some numbers (using the table in the notebook). The actually slicing is completely neglible:

>>> timeit t2 = t[10:30]
10 loops, best of 3: 17.2 ms per loop
>>> timeit t2 = t._data[10:30]
1000000 loops, best of 3: 469 ns per loop

Granted, it seems the current Table slicing performance is pretty bad relative to numpy ndarray, but this would imply that doing 100 individual slices for columns isn't going to make things much worse.

@embray
Copy link
Member

embray commented Jul 24, 2014

Yes please. This will also make supporting FITS columns much easier. Still need to look more closely at the details though.

@mhvk
Copy link
Contributor

mhvk commented Jul 24, 2014

@taldcroft - regarding holding Time, I think our approach in #1835 for holding the values in an array worked fine. So, my sense would still be that the problems are turning things into Columns. Though on further thought I can see that it would make things much easier to drop the requirement that Column is an ndarray subclass, and rather insist it has a couple of attributes that allow accessing, slicing, etc. I guess table would become more similar to TableColumns, except that of course lengths are guaranteed to be the same.

A propos slicing: yes, I guess it makes sense that overheads are large enough that it doesn't matter.

@perrygreenfield
Copy link
Member

@taldcroft are these the right timing tests? What I'm wondering about is using numpy operations on tables. Not really having used tables much personally (if you are willing to take over some of my exciting administrative duties, I think I might be able to do that :-) I'm sort of presuming that I can apply numpy operations to table columns or numpy indexing to the table as a whole, but I actually haven't done that. Can I supply a mask array as an index to a table directly as I would be able for a naked numpy structured array (e.g., table[mask]) and have it return a new table with the selected rows, using the built in numpy machinery for doing it? If what is going on under the covers is actual loops over rows, then I would agree that using a numpy structure array isn't helping much. But each of these examples does explicit looping over rows, and explicit numerical operations on individual table cells. What I'm wondering is using the example I gave (table[mask]) will be much slower than it would be if one were doing for a numpy structure array.

@taldcroft
Copy link
Member Author

thought I can see that it would make things much easier to drop the requirement that Column is an ndarray subclass, and rather insist it has a couple of attributes that allow accessing, slicing, etc

Exactly. From #1833:
"These are table columns that are not subclassed from Column or ndarray but which implement a protocol that allows them to be part of a Table. Essentially they have a sufficiently close API and can do the operations required by the Table API."

@embray
Copy link
Member

embray commented Jul 24, 2014

Row-based iteration could probably also be sped up by making some linear access assumptions and aligning columns with each other a chunk at a time.

@taldcroft
Copy link
Member Author

@perrygreenfield - I was going straight to what I am worried about for the worst case, which is iterating through rows.

The astropy Table object does indeed support all methods of indexing allowed by numpy arrays, largely by delegating to ndarray.__getitem__ via the _data ndarray. The difference now would be that instead of a single call to _data[index] there would be a loop of calls to col[index] for each column. However this looping is actually already being done as part of Table creation, so there will be negligible overall slowdown due to the change in how indexing the data is done.

We actually need to work on performance of table, as there appear to be some huge overheads associated with creating new tables and columns that are basically unrelated to the underlying time for slicing or fancy indexing. Creating a Table takes almost 20 msec while the actual numpy data indexing is less than 0.1 msec. But at least most of the time I'm using indexing like this it isn't in a tight inner loop and it's associated with other processing that is more expensive. But still this is pretty bad.
image

@eteq
Copy link
Member

eteq commented Jul 25, 2014

@taldcorft I'm 👍 for this in concept. My main worry is that if some people have optimized code based on the current row/column performance characteristics, if this causes a major change they might be annoyed. But if it can stay within factors of 2 or so, I definitely see the advantages here.

There's two other issues that might come up:

  • I know plenty of people who use the other idiom for making tables: feeding in numpy structured arrays. Those people also want to get structured arrays out sometimes via np.array(table). Might that be a major performance degredation to have to pull these appart and put them back together? It still might be worth it, but it's a thing to consider
  • Will __repr__ become onerous? Right now it looks like it uses ndarray, so it might be quite a pain to replicate it? But maybe that doesn't really matter, as __repr__ isn't necessarily guaranteed to be backwards compatible.

@taldcroft
Copy link
Member Author

@eteq - I'm more worried about memory than speed in the case you mention of using numpy structured arrays. For the most part allocating memory and doing memory copies is pretty quick, at least in relation to the other things that one is usually doing with that data.

I just realized that we could be a little clever and make some a common use case not consume extra memory. If someone initializes a Table with copy=False and a structured array, then we could create the Column objects as references to the structured array columns. Of course one needs to be careful here that subsequent mutable operations (inserting rows) always does the right thing. This may or may not be feasible...

Yes __repr__ will take some work, sigh.

AND of course there is always plan B, which was the original plan A, of having a hybrid approach where some columns are in the _data ndarray and some are not. This would basically preserve the existing interfaces / performance for tables that can be represented that way, while allowing the possibility of "Mixin" columns that are just contained in TableColumns. Basically the code just gets a bit ugly because you still need to handle the case (e.g. a FITS table) with only Mixin columns. __repr__ would still need work along with Row. There would be lots of conditionals asking if the table is pure ndarray-based, and likewise whether a column is part of an ndarray table or not. It's all possible, but probably not in an elegant way.

@taldcroft
Copy link
Member Author

I'm also looking at bcolz (aka carray) for inspiration and as something that Table could support as mixin columns (not as a default but as an option for people wanting the features of bcolz):

http://bcolz.blosc.org/tutorial.html#tutorials

@taldcroft
Copy link
Member Author

Which of course leads to a much more ambitious idea of Table as a uniform front-end for any of a number of underlying data storage backends like ndarray, pandas data series/frame, bcolz carray/ctable. Hmm..

@taldcroft
Copy link
Member Author

I've finished a POC for this which passes all table tests on Python 2.7. Please see:
https://github.com/taldcroft/astropy/compare/table-no-array
For now I'm not making this a PR since there will be a lot more work and I don't want to burn travis time or [skip ci] every commit.

Some things that occurred to me:

  • Though this might be optimistic, I am hoping this could make Quantity integration nearly trivial. Basically a column could literally be a Quantity (with just a few additions to Quantity). We might be able to avoid issues with "column-ness".
  • Another idea I haven't thought through to the end, but I think that we might be able to drop the concept of a "masked table" because a table would be a collections of columns, some of which could be masked.
  • The above point is important because there are no masked Quantity, Time, or Coordinate classes. This would be less of a limitation if a Table is not tied to a global concept of masked or not.
  • We could have unicode column names.
  • You can now create a table from a list of columns by reference. In other words if you have some numpy arrays you can assemble them into a table without making a copy.

Code details:

  • Generally I was just hacking to get this POC to pass tests. A real implementation would be cleaner.
  • Right now in the code there is a new read-only Table._data property that basically creates the old structured ndarray on demand. This is just a transition crutch to jump start getting it running and passing tests, but will be removed.
  • I purposely left some of the old code structure and some commented-out stuff for reference.
  • The biggest single change was in Table.__setitem__ where I had to reimplement the relevant API of ndarray.__setitem__.
  • add_row got less efficient because I couldn't figure a way to do an in-place resize of a Column because it always has the OWNDATA flag set False. This was always the case with masked tables (no in-place resize), but previous it was sometimes possible to efficiently expand the entire structured array.
  • sort lost the ability to do an in-place sort and instead uses an argsort.

@mhvk
Copy link
Contributor

mhvk commented Sep 26, 2014

@taldcroft - the more I think about this, the more I like it: Column essentially becomes a container class that adds some metadata. Ideally, one can then make any suitable object into a column by making a new class that subclasses from both (i.e., a mixin-class, maybe a bit like I'm doing in #1839); columns holding ndarray and MaskedArray just are the two currently defined examples. I particularly like that masked columns with quantities then properly becomes the problem of getting masked quantities to work!

As part of this, it may be useful for me to do something I had wondered about already: move the dependence of BaseColumn on ndarray into Column, so that BaseColumn can become the container class (possibly becoming just Column in the process). But what are the minimum properties a column should have (ie., what should anything that is contained in a column supprt)? Obviously, it should have a length, and should support __getitem__ (and, not essential but probably expected, __setitem__). Is anything else really required?

@taldcroft
Copy link
Member Author

@mhvk - I'm planning to try a couple of things today:

  1. Get Quantity to work as a column
  2. Make a minimal new class that works as a column. E.g. something that is effectively an array from 1 to N.

These will be empirical ways to answer your question of what is required. I think that the list is not really that long. #1833 is a good starting point for answering this as well. This shows a minimal list of:

  • __getitem__, __setitem__
  • name attribute
  • copy()
  • __len__
  • shape property
  • data property
  • format property
  • unit (optional)
  • format (optional)
  • description (optional)
  • meta (optional)

Most of these are things that could be naturally included in classes that want to be part of Table, thus eliminating the need to even have a mixin subclass. At least within the astropy core classes we have that freedom and so I'd argue to make those core classes directly Table-compatible. That eliminates all the headache of the distinction between the column version and pure version of a particular class like Quantity. Anyway that's the direction I'd like to pursue, but only some code will really answer if it works.

@mhvk
Copy link
Contributor

mhvk commented Sep 26, 2014

It seems to me the above list mixes things that are required to possibly be a column with things that one gets from being one. I think the requirements for being able to be column content are:

  • must have: __getitem__, __len__
  • most likely: copy
  • probably: __setitem__, shape (not sure, since in principle I see no reason a list couldn't be a column, or indeed a tuple; it would just be rather inefficient...).

While the stuff that is brought by column would be:

  • data: just exposes column content (e.g., Quantity) [I'd probably call this content, since, e.g., MaskedArray already has data and I do not think you'd want to override that.]
  • format, description, meta: as record holder (if not provided by column content)
  • link to parent table (for grouping)

I do not think the column should provide a unit; if one wants that, use a column with quantity as content!

@taldcroft
Copy link
Member Author

OK, check out the latest:

This adds initial support for having Quantity columns. Most importantly, check out the almost negligible footprint of this change on Quantity itself. Basically just the addition of name and data properties. In fact I think it would be possible to get rid of even those, but the price would be an API change in Table: currently table columns always have a name attribute, and this is quite useful in practice.

Personally I think that having a few properties in Quantity that are "do-nothing" except in the context of Table is well worth the huge win of having pure Quantities within Table.

@mhvk
Copy link
Contributor

mhvk commented Sep 26, 2014

@taldcroft - I like how easy it is, but think it would be just as easy without any changes to Quantity, but with a minimal Column class. I may try to cook something up along those lines; I am fairly confident that anything that works with ndarray, will automatically work with Quantity as well.

My main resistance, though, is that now in your demo you have:

In [3]: t['b']
Out[3]:
<Quantity [ 1., 2.] m>

Shouldn't in this case it still be a column, if only so one has a link to the parent table?

@taldcroft
Copy link
Member Author

t['b'].parent_table is already there. 😄 So in fact it isn't a pure-pure Quantity, but for all practical matters it is just a Quantity with a few non-impacting attributes. I just find this cleaner than a new QuantityColumn class (and TimeColumn, and SkyCoordColumn, and ...).

@mhvk
Copy link
Contributor

mhvk commented Sep 26, 2014

Or, in another way, should print(t['a']) still display a vertically stacked column, or just print a Quantity? My feeling is that it should give the same format as print(t['a','b']).

@taldcroft
Copy link
Member Author

You have a point there about printing. But for instance Quanity could know to print itself as a table column if parent_table is defined. So yes, Quantity gets to be a slight hybrid, but I think that from the use perspective this is more intuitive than the wrapper Column mixin.

But you may generally have a point that the hybridization gets to be too much and having a mixin keeps things cleaner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"objets" -> "objects"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@mhvk
Copy link
Contributor

mhvk commented Nov 5, 2014

@embray - with Table becoming effectively an OrderedDict of listables, it may be best to think of Row as an OrderedDict of items. It may be if one foregoes the construction of the np.void (which is hardly useful in your example), things would be faster. But I guess the best one could do would be something like this:

def __iter__(self):
    return iter(col[self._index] for col in self._table.columns.values())

which, hopefully, would be similar to

def __iter__(self):
    return iter(self[key] for key in self._table.columns.keys())

(sorry, no time to benchmark myself, but it would be nice to ensure this isn't slowed down as much as you seem to have found -- though obviously it is exactly this type of use case that is likely to suffer most!).

@taldcroft
Copy link
Member Author

@embray - about the figures in the new docs, I did those with google docs drawings.

@astrofrog
Copy link
Member

@eteq - do you want to do a final review of this? From my point of view it is good to go.

@taldcroft
Copy link
Member Author

@mhvk @astrofrog - you might want to have a look at ec07b42 which fixes what I would consider a bug in that a Column gets returned instead of a plain numpy array in the following case:

In [4]: c = Column([[1,2],[3,4]])
In [5]: c[0]
Out[5]: 
<Column name=None unit=None format=None description=None> 
array([1, 2])

For this particular case the timing performance for c[0] is improved by a factor of 40, which is why I originally noticed the issue. But the penalty is a factor of ~10 increase in the time for a more common case of:

>>> c = Column([1,2,3,4])
>>> c[1]

Here, because the pure Python __getitem__ gets called, the item access time goes up. I'm thinking that any actual pure Python code which is doing column access like this will be much slower than the actual value fetch operation, so in real cases it won't matter. But maybe Cython code would get slowed? Not sure.

@astrofrog
Copy link
Member

@taldcroft - in principle this sounds fine to me, but just to note that this will break backward compatibility for some very specific situations, specifically:

In [6]: c = Column([[1,2],[3,4]], unit='m')

In [7]: c / u.s
Out[7]: 
<Quantity [[ 1., 2.],
           [ 3., 4.]] m / s>

In [13]: c[1,1:2] / u.s
Out[13]: <Quantity [ 4.] m / s>

In [8]: c[1,1] / u.s
Out[8]: <Quantity 4.0 1 / s>

In [9]: c[1] / u.s  # this will change and return a quantity in 1/s
Out[9]: <Quantity [ 3., 4.] m / s>

so if we go ahead with this, there should be a note in the API section of the changelog.

@mhvk
Copy link
Contributor

mhvk commented Nov 10, 2014

@taldcroft - since you are somewhat changing the API no matter what you do here with getting individual items, how about going the other way and always return a Column? The logic might be that the meta data is just as relevant to a single item as it is to a slice.

@mhvk
Copy link
Contributor

mhvk commented Nov 10, 2014

I tried quickly and the following does not break everything (though there are quite a few tests that explicitly test items are np.float etc):

    def __getitem__(self, item):
        out = np.array(self.data[item], copy=False).view(self.__class__)
        out.__array_finalize__(self)
        return out

    @property
    def data(self):
        data = self.view(np.ndarray)
        if self.shape:
            return data
        else:
            return data.item()

    def __unicode__(self):
        if not np.iterable(self):
            return six.text_type(self.data)
        _pformat_col = self._formatter._pformat_col
        lines, n_header = _pformat_col(self)
        return '\n'.join(lines)

    def __bytes__(self):
        if not np.iterable(self):
            return six.text_type(self.data).encode('utf-8')
        return six.text_type(self).encode('utf-8')

@mhvk
Copy link
Contributor

mhvk commented Nov 10, 2014

@taldcroft - thinking more about this, how about undoing the last commit, and just raise a separate issue for what clearly is a bug, but one that at least according to me requires more thought. The advantage is that we stick to the idea that this PR only changes the implementation, while minimizing changes to the API. This even more so as I think it was ready to go in!

@astrofrog
Copy link
Member

+1 to @mhvk's suggestion of removing the last commit. This is an issue with the current API that we should deal with separately.

@taldcroft
Copy link
Member Author

Removing the last commit is fine by me, will do.
On Nov 10, 2014 9:39 AM, "Thomas Robitaille" [email protected]
wrote:

+1 to @mhvk https://github.com/mhvk's suggestion of removing the last
commit. This is an issue with the current API that we should deal with
separately.


Reply to this email directly or view it on GitHub
#2790 (comment).

@taldcroft
Copy link
Member Author

OK ec07b42 is gone and now lives in #3095.

@mhvk
Copy link
Contributor

mhvk commented Nov 11, 2014

I'm happy with this to go in.

taldcroft added a commit that referenced this pull request Nov 11, 2014
Remove numpy structured array as data container in Table
@taldcroft taldcroft merged commit a7dd2ef into astropy:master Nov 11, 2014
@astrofrog
Copy link
Member

🎉

@eteq
Copy link
Member

eteq commented Nov 12, 2014

👏 - I was going to say I can try to review it but don't know that I have too much to add... but this is good too!

@taldcroft taldcroft deleted the table-no-array branch October 1, 2019 20:43
eerovaher added a commit to eerovaher/astropy that referenced this pull request Aug 3, 2021
A `Table` object does not store data in a NumPy ndarray since PR astropy#2790,
but the documentation still claimed that it does in a couple of places.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants