-
-
Notifications
You must be signed in to change notification settings - Fork 2k
Interface Table to pandas DataFrame #3504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
astropy/table/table.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be a bit cleaner with for column in self.columns.values(): then check column.ndim > 1. If you like functional, if any(col.ndim > 1 for col in self.columns.values()): is even better.
|
This may be a can of worms you don't want to open, but what about masked data? This is supported in both Table and DataFrame, but in different ways. At the least you should do some checking and raise exceptions if not supported. |
|
I think you can support mixin columns since they just go to object arrays in Of course if you have a coordinate object with a million elements this will blow up... |
|
|
|
@Cadair - this PR does include |
|
@taldcroft - indeed we need to be able to support the masking. I'll play around with it. |
|
Note to self: regarding the masking, I think that when masked columns are present they need to be converted to float to be able to represent |
|
@astrofrog I should probably read things better :p |
|
@astrofrog - yes now I remember not liking the way pandas handles missing values. Works great for float, no so much for any other type. |
|
This should now support masked values. At the moment the caveat is that if a Pandas series does not contain any masked values, it gets converted to a Anyway, this is ready for an initial review/testing to see if there is anything that I have overlooked. |
|
If this becomes more complicated I could also move the code to |
|
OK, I just ran this on my table which was the output of a 791 x 17 astroquery job. It worked just fine (not that there was any doubt), with exactly the behavior I would expect. Thank you @astrofrog ! |
|
For now I'm leaving the conversion disabled for mix-in columns since I think we may want to decide how to handle these better. For example for a I've added a documentation page but I'm getting a weird issue - |
astropy/table/table.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than check beforehand, maybe let DataFrame raise an error if something bad is passed in? It keeps the code cleaner, and if pandas starts to support multi-dimensional columns, things will just work. (or If you think it is important to give a clearer error message, maybe have the actual data frame assignments inside try/except clauses?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that the pandas error messages are not always very explicit. For the mixin columns it might actually work but just be very inefficient. So I prefer to control it at this stage. We could put it in a try...except but I'm not sure if it's easy to know what the issue is so as to know what better error to raise. I can look into it, but would be interested in @taldcroft's opinion here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhvk - the error that Pandas gives is pretty cryptic:
In [13]: t = Table([[[1,2]]])
In [14]: print(t)
col0 [2]
--------
1 .. 2
In [15]: pandas.DataFrame(np.array(t))
...
ValueError: If using all scalar values, you must must pass an index
I think pre-catching the exception is better. Also, I don't think support for multi-d columns is coming any time soon. I think this would cut rather deeply into the fundamental data structures (though I'm not an expert in Pandas so I could be wrong). I asked Wes about this several years ago and the answer was hmm... would have to think about that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely no big deal. Though since every usage of the dtype property recreates the dtypes of all the columns, maybe write:
if any(getattr(col, 'ndim', 1) > 1 for col in self.columns.values()):
raise ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
astropy/table/table.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing an else: clause here if not isinstance(column, MaskedColumn). (Though without mixins that can't actually happen, but good to future-proof things). Inspired by @mdboom, I've been using an idiom of collecting columns in a dict prior to making the table. So something like this might generically work and doesn't require an intermediate numpy array in the non-masked case. This also could support special-casing for certain mixins like you mentioned, e.g. Time => datetime:
out = OrderedDict()
for name, column in self.columns.items():
if isinstance(column, MaskedColumn):
....
elif isinstance(column, Time):
out[name] = column.datetime # this actually works in pandas, don't know if there is a more efficient way though
else: # Not a masked column
out[name] = column
return DataFrame(out)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can take this further and support the existing built-in mixins.
elif isinstance(column, Quantity):
out[name] = column.value
elif isinstance(column, SkyCoord):
for rcn in column.representation_component_names:
outname = name + '__' + rcn # I like double underscore here for better namespace separation
if outname in self.colnames:
# do something...
out[outname] = getattr(column, rcn)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 to going via an OrderDict rather than as_array. I like that in that case DataFrame gets set in only one place. Furthermore, it would also still allow the check on column dimensionality to be done in an try/except clause -- I mean to keep the if statement, but only execute it if there was an actual error:
try:
d = DataFrame(out)
except:
if any(...):
raise ValueError("Multi...")
else:
raise
The advantage (I think) is that the normal path is both as readable and as fast as it can possibly be, while the exception is still informative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've switched to OrderedDict. Mixin column support can come later.
|
On the mixin: @taldcroft - I like the idea of supporting |
|
For reference, there is a long-running but still-active pandas issue pandas-dev/pandas#2485 about allowing metadata to be attached to pandas objects (with milestone "someday"). In order to have any hope of round-tripping astropy mixin columns we would need a metadata container on the pandas side. |
|
@mhvk - I agree with your proposal for the optimal implementation using the framework of OTOH @astrofrog might prefer to just limit the scope of this PR to non-mixins for the sake of getting something done and merged, and that would be fine. |
Interface Table to pandas DataFrame
|
@barentsen - just FYI, this has been merged! Let me know if you try it out and have any issues! |
|
Oh, I mistakenly assumed this was using the unified I/O framework. To which my question becomes, why not? |
|
@embray - we could support that but I'd still recommend we keep the explicit methods in addition. In particular, writing is a bit trickier API-wise. |
|
I actually thing that |
|
I'm not sure I necessary agree with that. For example an "HDUList" object (despite its odd name) is sort of a high-level "file-like" object representing a FITS file, though not in the sense you mean (in that it's like a stream with a read and possibly seek methods). But it's still an object used for reading out of a FITS file, and is the recommended object to use when passing around a FITS file open for reading between different parts of code. I think for Pandas you have more of a point though. A Pandas DF is an abstract data structure that is not tied to any particular file format. |
|
I'm attempting to use .to_pandas() on an astropy Table object (from astroquery) and it's not working. My astropy is up-to-date from conda. Could someone more knowledgeable help me with this? I would attach the ipy notebook (using python 2) i've been testing it with but that's not allowed, so here's the code: |
|
@abigailStev This feature will be in Astropy v1.1.0 which isn't out yet. If you got Astropy through conda then you'd only get the most recent release which is v1.0.4 (unless there's some mechanism to get updates from master). Where'd you find out about this feature? |
|
In this issue right here. Hadn't realized it wasn't out yet! I'll be patient :) |
|
@abigailStev Usually you can check by looking at the milestone. That will tend to correspond with the lowest version number that something is available in (though there isn't a great way to tell if that version has been released yet, though if the milestone isn't closed it probably hasn't been yet, or just check the tags or changelog or something :/) |
|
@abigailStev The other option is to look at the version number in the upper left corder of the docs. For the pandas page on "latest", it's v1.1dev... meaning 1.1 is not out yet. |
|
That's why I asked--if @abigailStev had found it via the docs I was going to point that out. Good to know anyways though. |

This is simple but often requested by users. Still needs docs, but let me know if you have any feedback in the mean time.
Remaining to-dos:
Closes #2804