Interface Table to pandas DataFrame #3504

astrofrog · 2015-02-14T20:40:26Z

This is simple but often requested by users. Still needs docs, but let me know if you have any feedback in the mean time.

Remaining to-dos:

Add docs

Closes #2804

taldcroft · 2015-02-15T01:48:30Z

astropy/table/table.py

I think this can be a bit cleaner with for column in self.columns.values(): then check column.ndim > 1. If you like functional, if any(col.ndim > 1 for col in self.columns.values()): is even better.

taldcroft · 2015-02-15T01:52:55Z

This may be a can of worms you don't want to open, but what about masked data? This is supported in both Table and DataFrame, but in different ways. At the least you should do some checking and raise exceptions if not supported.

taldcroft · 2015-02-15T03:20:44Z

I think you can support mixin columns since they just go to object arrays in as_array():

In [66]: t = Table([Time(['2001:001', '2002:002']), ['a', 'b'], [1, 2]], names=['a', 'b', 'c'])

In [67]: nt = t.as_array()

In [68]: ntd = OrderedDict((name, nt[name]) for name in nt.dtype.names)

In [69]: df = DataFrame(ntd)

In [70]: df['a'][1].jyear
Out[70]: 2002.002737850787

Of course if you have a coordinate object with a million elements this will blow up...

Cadair · 2015-02-15T11:47:32Z

from_pandas would be nice if such a thing doesn't already exist.

astrofrog · 2015-02-15T12:12:20Z

@Cadair - this PR does include from_pandas. It's pretty simple for now, but once I start messing with the masking it might start to get more complex.

astrofrog · 2015-02-15T12:12:38Z

@taldcroft - indeed we need to be able to support the masking. I'll play around with it.

astrofrog · 2015-02-15T12:29:29Z

Note to self: regarding the masking, I think that when masked columns are present they need to be converted to float to be able to represent NaN (and in that case a warning should be raised about the data type loss). For string arrays, missing values are represented by passing an object array that contains None.

Cadair · 2015-02-15T13:21:13Z

@astrofrog I should probably read things better :p

taldcroft · 2015-02-15T13:47:44Z

@astrofrog - yes now I remember not liking the way pandas handles missing values. Works great for float, no so much for any other type.

astrofrog · 2015-02-15T23:03:06Z

This should now support masked values. At the moment the caveat is that if a Pandas series does not contain any masked values, it gets converted to a Column, and if it contains any masked values, to a MaskedColumn. One can imagine having a masked column that has no masked values in a specific table but here this would just get converted to Column. I think this is ok, but I'd be interested in hearing if not. Also, different numerical types are lost for masked columns since pandas needs to convert to float64 for these (to be able to represent nan).

Anyway, this is ready for an initial review/testing to see if there is anything that I have overlooked.

astrofrog · 2015-02-15T23:04:22Z

If this becomes more complicated I could also move the code to pandas.py and then call helper functions from to_pandas and from_pandas.

gully · 2015-02-17T23:46:12Z

OK, I just ran this on my table which was the output of a 791 x 17 astroquery job. It worked just fine (not that there was any doubt), with exactly the behavior I would expect. Thank you @astrofrog !
👍

astrofrog · 2015-02-18T11:52:22Z

For now I'm leaving the conversion disabled for mix-in columns since I think we may want to decide how to handle these better. For example for a Time column we should be able to actually make a date/time series in pandas, etc.

I've added a documentation page but I'm getting a weird issue - to_pandas and from_pandas are not showing up in the API docs. I must be missing something obvious so will look into it later (if anyone spots the issue in the mean time, please let me know!)

mhvk · 2015-02-18T15:06:24Z

astropy/table/table.py

Rather than check beforehand, maybe let DataFrame raise an error if something bad is passed in? It keeps the code cleaner, and if pandas starts to support multi-dimensional columns, things will just work. (or If you think it is important to give a clearer error message, maybe have the actual data frame assignments inside try/except clauses?).

The issue is that the pandas error messages are not always very explicit. For the mixin columns it might actually work but just be very inefficient. So I prefer to control it at this stage. We could put it in a try...except but I'm not sure if it's easy to know what the issue is so as to know what better error to raise. I can look into it, but would be interested in @taldcroft's opinion here too.

@mhvk - the error that Pandas gives is pretty cryptic:

In [13]: t = Table([[[1,2]]]) In [14]: print(t) col0 [2] -------- 1 .. 2 In [15]: pandas.DataFrame(np.array(t)) ... ValueError: If using all scalar values, you must must pass an index

I think pre-catching the exception is better. Also, I don't think support for multi-d columns is coming any time soon. I think this would cut rather deeply into the fundamental data structures (though I'm not an expert in Pandas so I could be wrong). I asked Wes about this several years ago and the answer was hmm... would have to think about that...

Definitely no big deal. Though since every usage of the dtype property recreates the dtypes of all the columns, maybe write:

if any(getattr(col, 'ndim', 1) > 1 for col in self.columns.values()): raise ...

taldcroft · 2015-02-18T16:18:37Z

astropy/table/table.py

Missing an else: clause here if not isinstance(column, MaskedColumn). (Though without mixins that can't actually happen, but good to future-proof things). Inspired by @mdboom, I've been using an idiom of collecting columns in a dict prior to making the table. So something like this might generically work and doesn't require an intermediate numpy array in the non-masked case. This also could support special-casing for certain mixins like you mentioned, e.g. Time => datetime:

out = OrderedDict() for name, column in self.columns.items(): if isinstance(column, MaskedColumn): .... elif isinstance(column, Time): out[name] = column.datetime # this actually works in pandas, don't know if there is a more efficient way though else: # Not a masked column out[name] = column return DataFrame(out)

You can take this further and support the existing built-in mixins.

elif isinstance(column, Quantity): out[name] = column.value elif isinstance(column, SkyCoord): for rcn in column.representation_component_names: outname = name + '__' + rcn # I like double underscore here for better namespace separation if outname in self.colnames: # do something... out[outname] = getattr(column, rcn)

👍 to going via an OrderDict rather than as_array. I like that in that case DataFrame gets set in only one place. Furthermore, it would also still allow the check on column dimensionality to be done in an try/except clause -- I mean to keep the if statement, but only execute it if there was an actual error:

try: d = DataFrame(out) except: if any(...): raise ValueError("Multi...") else: raise

The advantage (I think) is that the normal path is both as readable and as fast as it can possibly be, while the exception is still informative.

I've switched to OrderedDict. Mixin column support can come later.

mhvk · 2015-02-18T16:56:21Z

On the mixin: @taldcroft - I like the idea of supporting Time and SkyCoord, but think we should not do that in this PR. This mostly because I think the way the objects should be stored in a pandas dataframe should be left to the classes themselves. Indeed, they seem really good examples of what could go into the ColumnAttribute class that we discussed before; just like we thought it could override __str__, it could have a to_pandas and from_pandas method. If so, in the routine here, one would simply check whether the column had that attribute; indeed, with that in place, it would be good to just define such an attribute in MaskedColumn rather than special-case it here. See new issue #3518.

taldcroft · 2015-02-18T20:01:18Z

For reference, there is a long-running but still-active pandas issue pandas-dev/pandas#2485 about allowing metadata to be attached to pandas objects (with milestone "someday"). In order to have any hope of round-tripping astropy mixin columns we would need a metadata container on the pandas side.

taldcroft · 2015-02-18T20:13:32Z

@mhvk - I agree with your proposal for the optimal implementation using the framework of ColumnAttribute (when it exists), but I don't see why not put in functionality now in a hardwired way if it won't change the user API. That would allow testing the basic interface concept in astropy master sooner rather than later. (Of course for SkyCoord there is a real API decision about mapping one coordinate column to multiple pandas columns).

OTOH @astrofrog might prefer to just limit the scope of this PR to non-mixins for the sake of getting something done and merged, and that would be fine.

Interface Table to pandas DataFrame

astrofrog · 2015-04-25T16:21:13Z

@barentsen - just FYI, this has been merged! Let me know if you try it out and have any issues!

embray · 2015-05-04T17:33:06Z

Oh, I mistakenly assumed this was using the unified I/O framework. To which my question becomes, why not? Table.read(some_dataframe) might be nice?

astrofrog · 2015-05-04T17:56:04Z

@embray - we could support that but I'd still recommend we keep the explicit methods in addition. In particular, writing is a bit trickier API-wise.

taldcroft · 2015-05-04T21:20:32Z

I actually thing that read() should be restricted to reading from a file-like object. A more natural way would be t = Table(some_dataframe).

embray · 2015-05-05T20:42:58Z

I'm not sure I necessary agree with that. For example an "HDUList" object (despite its odd name) is sort of a high-level "file-like" object representing a FITS file, though not in the sense you mean (in that it's like a stream with a read and possibly seek methods). But it's still an object used for reading out of a FITS file, and is the recommended object to use when passing around a FITS file open for reading between different parts of code.

I think for Pandas you have more of a point though. A Pandas DF is an abstract data structure that is not tied to any particular file format.

abigailStev · 2015-10-02T19:28:28Z

I'm attempting to use .to_pandas() on an astropy Table object (from astroquery) and it's not working. My astropy is up-to-date from conda. Could someone more knowledgeable help me with this?

I would attach the ipy notebook (using python 2) i've been testing it with but that's not allowed, so here's the code:

from astroquery.sdss import SDSS
import astropy ## Using astropy v 1.0.4 (from Conda)
import pandas as pd

#### Get some data from the SDSS SQL server
cmd = """
SELECT TOP 10000
    p.u, p.g, p.r, p.i, p.z, s.class, s.z, s.zerr
FROM
    PhotoObj AS p
JOIN
    SpecObj AS s ON s.bestobjid = p.objid
WHERE
    p.u BETWEEN 0 AND 19.6 AND
    p.g BETWEEN 0 AND 20 AND
    (s.class = 'STAR' OR s.class = 'GALAXY' OR s.class = 'QSO')
"""
data = SDSS.query_sql(cmd)  # Loads data as an astropy Table
type(data)
#### yields: astropy.table.table.Table
df = data.to_pandas()
#### Gives an error!
"""
 ---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-11-d0f0089f5e2a> in <module>()
----> 1 df = data.to_pandas()

>AttributeError: 'Table' object has no attribute 'to_pandas'
"""

embray · 2015-10-02T19:58:16Z

@abigailStev This feature will be in Astropy v1.1.0 which isn't out yet. If you got Astropy through conda then you'd only get the most recent release which is v1.0.4 (unless there's some mechanism to get updates from master).

Where'd you find out about this feature?

abigailStev · 2015-10-02T21:14:27Z

In this issue right here. Hadn't realized it wasn't out yet! I'll be patient :)

embray · 2015-10-02T21:17:15Z

@abigailStev Usually you can check by looking at the milestone. That will tend to correspond with the lowest version number that something is available in (though there isn't a great way to tell if that version has been released yet, though if the milestone isn't closed it probably hasn't been yet, or just check the tags or changelog or something :/)

bsipocz · 2015-10-03T10:14:46Z

@abigailStev The other option is to look at the version number in the upper left corder of the docs. For the pandas page on "latest", it's v1.1dev... meaning 1.1 is not out yet.

embray · 2015-10-05T17:24:48Z

That's why I asked--if @abigailStev had found it via the docs I was going to point that out. Good to know anyways though.

astrofrog added the table label Feb 14, 2015

astrofrog force-pushed the pandas branch from a2e3113 to 51f5264 Compare February 14, 2015 20:41

taldcroft reviewed Feb 15, 2015
View reviewed changes

astrofrog force-pushed the pandas branch from 3c8dede to 6dc174a Compare February 15, 2015 23:00

astrofrog force-pushed the pandas branch from 57ad605 to adb0172 Compare February 16, 2015 15:06

astrofrog added this to the v1.1.0 milestone Feb 16, 2015

astrofrog changed the title ~~WIP: interface Table to pandas DataFrame~~ Interface Table to pandas DataFrame Feb 18, 2015

mhvk reviewed Feb 18, 2015
View reviewed changes

Cadair mentioned this pull request Feb 18, 2015

Error reading a FITS table into a pandas DataFrame via astropy #1156

Closed

taldcroft reviewed Feb 18, 2015
View reviewed changes

mhvk mentioned this pull request Feb 18, 2015

Create ColumnAttribute class (or similar) for mixin columns #3518

Closed

astrofrog and others added 6 commits April 25, 2015 14:23

Added documentation page

8ed2bd6

Fixes to links

427570f

Implemented comments following review by @taldcroft and @mhvk

862ef45

Fix failing test

77d61b1

Fix changelog entry

56e0d75

Convert string-like column to numpy string type in from_pandas()

d03683c

astrofrog force-pushed the pandas branch from 622ce61 to d03683c Compare April 25, 2015 12:23

astrofrog added the Ready-for-final-review label Apr 25, 2015

taldcroft added a commit that referenced this pull request Apr 25, 2015

Merge pull request #3504 from astrofrog/pandas

0cf8f67

Interface Table to pandas DataFrame

taldcroft merged commit 0cf8f67 into astropy:master Apr 25, 2015

astrofrog mentioned this pull request Apr 25, 2015

What's new in 1.1? #3726

Merged

embray added Affects-release Enhancement unified-io and removed Ready-for-final-review unified-io labels May 4, 2015

astrofrog deleted the pandas branch July 5, 2016 18:46

This was referenced Aug 17, 2018

data changed in astropy.table.Table.to_pandas() #7741

Closed

BUG: Avoid int to float conversion in to_pandas #7747

Merged

Uh oh!

Interface Table to pandas DataFrame #3504

Interface Table to pandas DataFrame #3504

Uh oh!

Conversation

astrofrog commented Feb 14, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taldcroft commented Feb 15, 2015

Uh oh!

taldcroft commented Feb 15, 2015

Uh oh!

Cadair commented Feb 15, 2015

Uh oh!

astrofrog commented Feb 15, 2015

Uh oh!

astrofrog commented Feb 15, 2015

Uh oh!

astrofrog commented Feb 15, 2015

Uh oh!

Cadair commented Feb 15, 2015

Uh oh!

taldcroft commented Feb 15, 2015

Uh oh!

astrofrog commented Feb 15, 2015

Uh oh!

astrofrog commented Feb 15, 2015

Uh oh!

gully commented Feb 17, 2015

Uh oh!

astrofrog commented Feb 18, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhvk commented Feb 18, 2015

Uh oh!

taldcroft commented Feb 18, 2015

Uh oh!

taldcroft commented Feb 18, 2015

Uh oh!

astrofrog commented Apr 25, 2015

Uh oh!

embray commented May 4, 2015

Uh oh!

astrofrog commented May 4, 2015

Uh oh!

taldcroft commented May 4, 2015

Uh oh!

embray commented May 5, 2015

Uh oh!

abigailStev commented Oct 2, 2015

Uh oh!

embray commented Oct 2, 2015

Uh oh!

abigailStev commented Oct 2, 2015

Uh oh!

embray commented Oct 2, 2015

Uh oh!

bsipocz commented Oct 3, 2015

Uh oh!

embray commented Oct 5, 2015

Uh oh!

Reviewers