-
-
Notifications
You must be signed in to change notification settings - Fork 2k
Remove numpy structured array as data container in Table #2790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@taldcroft - Interesting thought! Though I'm not completely sure it will make it that much easier to put different items in columns; my sense is that the hard part is to adapt Overall, my sense would be to get the columns holding Time, Coordinates, etc., working first (for me, that means mostly letting go of the idea to do the masked versions as well...). |
|
My big concern is losing the advantages of the row/column duality. Yes, you make column operations much easier, but how does the performance of rows go? Not only slicing, but use of index arrays. If one has many columns, that could become quite expensive. Or row insertion operations. Don't you also add to the responsibility of ensuring columns are now consistent in length? |
One of the first issues I encountered with My original idea was putting I think that if none of the columns actually live in |
Good questions. Here is at least a partial answer: http://nbviewer.ipython.org/gist/taldcroft/cccf5fac271f2ac49e0d The upshot is that if you iterate through a wide table (100 cols) and then do something with just one of the columns, you lose a factor of three in speed. This is a silly thing to do but that doesn't stop people. But interestingly if you iterate through a wide table and then actually manipulate all of the columns you gain a factor of almost 3. (I don't know why and hopefully didn't do something silly in my test). About slicing, yes if you do Granted, it seems the current Table slicing performance is pretty bad relative to numpy ndarray, but this would imply that doing 100 individual slices for columns isn't going to make things much worse. |
|
Yes please. This will also make supporting FITS columns much easier. Still need to look more closely at the details though. |
|
@taldcroft - regarding holding A propos slicing: yes, I guess it makes sense that overheads are large enough that it doesn't matter. |
|
@taldcroft are these the right timing tests? What I'm wondering about is using numpy operations on tables. Not really having used tables much personally (if you are willing to take over some of my exciting administrative duties, I think I might be able to do that :-) I'm sort of presuming that I can apply numpy operations to table columns or numpy indexing to the table as a whole, but I actually haven't done that. Can I supply a mask array as an index to a table directly as I would be able for a naked numpy structured array (e.g., table[mask]) and have it return a new table with the selected rows, using the built in numpy machinery for doing it? If what is going on under the covers is actual loops over rows, then I would agree that using a numpy structure array isn't helping much. But each of these examples does explicit looping over rows, and explicit numerical operations on individual table cells. What I'm wondering is using the example I gave (table[mask]) will be much slower than it would be if one were doing for a numpy structure array. |
Exactly. From #1833: |
|
Row-based iteration could probably also be sped up by making some linear access assumptions and aligning columns with each other a chunk at a time. |
|
@perrygreenfield - I was going straight to what I am worried about for the worst case, which is iterating through rows. The astropy We actually need to work on performance of table, as there appear to be some huge overheads associated with creating new tables and columns that are basically unrelated to the underlying time for slicing or fancy indexing. Creating a Table takes almost 20 msec while the actual numpy data indexing is less than 0.1 msec. But at least most of the time I'm using indexing like this it isn't in a tight inner loop and it's associated with other processing that is more expensive. But still this is pretty bad. |
|
@taldcorft I'm 👍 for this in concept. My main worry is that if some people have optimized code based on the current row/column performance characteristics, if this causes a major change they might be annoyed. But if it can stay within factors of 2 or so, I definitely see the advantages here. There's two other issues that might come up:
|
|
@eteq - I'm more worried about memory than speed in the case you mention of using numpy structured arrays. For the most part allocating memory and doing memory copies is pretty quick, at least in relation to the other things that one is usually doing with that data. I just realized that we could be a little clever and make some a common use case not consume extra memory. If someone initializes a Table with Yes AND of course there is always plan B, which was the original plan A, of having a hybrid approach where some columns are in the |
|
I'm also looking at bcolz (aka carray) for inspiration and as something that |
|
Which of course leads to a much more ambitious idea of |
|
I've finished a POC for this which passes all table tests on Python 2.7. Please see: Some things that occurred to me:
Code details:
|
|
@taldcroft - the more I think about this, the more I like it: As part of this, it may be useful for me to do something I had wondered about already: move the dependence of |
|
@mhvk - I'm planning to try a couple of things today:
These will be empirical ways to answer your question of what is required. I think that the list is not really that long. #1833 is a good starting point for answering this as well. This shows a minimal list of:
Most of these are things that could be naturally included in classes that want to be part of Table, thus eliminating the need to even have a mixin subclass. At least within the astropy core classes we have that freedom and so I'd argue to make those core classes directly Table-compatible. That eliminates all the headache of the distinction between the column version and pure version of a particular class like Quantity. Anyway that's the direction I'd like to pursue, but only some code will really answer if it works. |
|
It seems to me the above list mixes things that are required to possibly be a column with things that one gets from being one. I think the requirements for being able to be column content are:
While the stuff that is brought by column would be:
I do not think the column should provide a |
|
OK, check out the latest:
This adds initial support for having Quantity columns. Most importantly, check out the almost negligible footprint of this change on Personally I think that having a few properties in Quantity that are "do-nothing" except in the context of Table is well worth the huge win of having pure Quantities within Table. |
|
@taldcroft - I like how easy it is, but think it would be just as easy without any changes to My main resistance, though, is that now in your demo you have: Shouldn't in this case it still be a column, if only so one has a link to the parent table? |
|
|
|
Or, in another way, should |
|
You have a point there about printing. But for instance Quanity could know to print itself as a table column if But you may generally have a point that the hybridization gets to be too much and having a mixin keeps things cleaner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"objets" -> "objects"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
@embray - with which, hopefully, would be similar to (sorry, no time to benchmark myself, but it would be nice to ensure this isn't slowed down as much as you seem to have found -- though obviously it is exactly this type of use case that is likely to suffer most!). |
|
@embray - about the figures in the new docs, I did those with google docs drawings. |
|
@eteq - do you want to do a final review of this? From my point of view it is good to go. |
|
@mhvk @astrofrog - you might want to have a look at ec07b42 which fixes what I would consider a bug in that a For this particular case the timing performance for Here, because the pure Python |
|
@taldcroft - in principle this sounds fine to me, but just to note that this will break backward compatibility for some very specific situations, specifically: so if we go ahead with this, there should be a note in the API section of the changelog. |
|
@taldcroft - since you are somewhat changing the API no matter what you do here with getting individual items, how about going the other way and always return a |
|
I tried quickly and the following does not break everything (though there are quite a few tests that explicitly test items are |
|
@taldcroft - thinking more about this, how about undoing the last commit, and just raise a separate issue for what clearly is a bug, but one that at least according to me requires more thought. The advantage is that we stick to the idea that this PR only changes the implementation, while minimizing changes to the API. This even more so as I think it was ready to go in! |
|
+1 to @mhvk's suggestion of removing the last commit. This is an issue with the current API that we should deal with separately. |
|
Removing the last commit is fine by me, will do.
|
ec07b42 to
698d0b9
Compare
|
I'm happy with this to go in. |
Remove numpy structured array as data container in Table
|
🎉 |
|
👏 - I was going to say I can try to review it but don't know that I have too much to add... but this is good too! |
A `Table` object does not store data in a NumPy ndarray since PR astropy#2790, but the documentation still claimed that it does in a couple of places.

This issue is to explore the idea of entirely removing the
np.ndarrayornp.ma.MaskedArraystructured array object that currently serves as the primary data storage. This object is referenced asTable._datain the code.The way
Table._datacurrently works is that it is essentially the owner of all the table data and theColumnobjects then hold references to the columns inTable._data(hereafter just_data). A fair bit of the trickier table code is maintaining that duality when mutable table operations are performed. When you start trying to integrate other data types asColumnsit gets very tricky / untenable.The proposal is to make a
Tablebe a collection ofColumnswhere theColumnobjects are the primary and independent data owners. I am hoping that implementation of this will actually result in a net reduction of code and overall simplification of logic. Some advantages:_dataobject needs to be re-built from scratch for each column.Issues:
Rowclass will need work to replicate the functionality currently handled bynp.voidandnp.mvoidclasses. There is something weird and magical about those classes I have never entirely understood, so fully replicating them represents a real uncertainty right now._datadirectly will be out of luck, though we could put in a compatibility property that basically creates the equivalentndarrayon request (and issues a warning saying don't use this!).np.arrayfrom aTablewill always generate a copy. Currently it can be a reference.ndarrays. This will need some work.Comments / discussion please!
@astrofrog @eteq @embray @mhvk @mdboom