Skip to content

Conversation

@astrofrog
Copy link
Member

This is simple but often requested by users. Still needs docs, but let me know if you have any feedback in the mean time.

Remaining to-dos:

  • Add docs

Closes #2804

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be a bit cleaner with for column in self.columns.values(): then check column.ndim > 1. If you like functional, if any(col.ndim > 1 for col in self.columns.values()): is even better.

@taldcroft
Copy link
Member

This may be a can of worms you don't want to open, but what about masked data? This is supported in both Table and DataFrame, but in different ways. At the least you should do some checking and raise exceptions if not supported.

@taldcroft
Copy link
Member

I think you can support mixin columns since they just go to object arrays in as_array():

In [66]: t = Table([Time(['2001:001', '2002:002']), ['a', 'b'], [1, 2]], names=['a', 'b', 'c'])

In [67]: nt = t.as_array()

In [68]: ntd = OrderedDict((name, nt[name]) for name in nt.dtype.names)

In [69]: df = DataFrame(ntd)

In [70]: df['a'][1].jyear
Out[70]: 2002.002737850787

Of course if you have a coordinate object with a million elements this will blow up...

@Cadair
Copy link
Member

Cadair commented Feb 15, 2015

from_pandas would be nice if such a thing doesn't already exist.

@astrofrog
Copy link
Member Author

@Cadair - this PR does include from_pandas. It's pretty simple for now, but once I start messing with the masking it might start to get more complex.

@astrofrog
Copy link
Member Author

@taldcroft - indeed we need to be able to support the masking. I'll play around with it.

@astrofrog
Copy link
Member Author

Note to self: regarding the masking, I think that when masked columns are present they need to be converted to float to be able to represent NaN (and in that case a warning should be raised about the data type loss). For string arrays, missing values are represented by passing an object array that contains None.

@Cadair
Copy link
Member

Cadair commented Feb 15, 2015

@astrofrog I should probably read things better :p

@taldcroft
Copy link
Member

@astrofrog - yes now I remember not liking the way pandas handles missing values. Works great for float, no so much for any other type.

@astrofrog
Copy link
Member Author

This should now support masked values. At the moment the caveat is that if a Pandas series does not contain any masked values, it gets converted to a Column, and if it contains any masked values, to a MaskedColumn. One can imagine having a masked column that has no masked values in a specific table but here this would just get converted to Column. I think this is ok, but I'd be interested in hearing if not. Also, different numerical types are lost for masked columns since pandas needs to convert to float64 for these (to be able to represent nan).

Anyway, this is ready for an initial review/testing to see if there is anything that I have overlooked.

@astrofrog
Copy link
Member Author

If this becomes more complicated I could also move the code to pandas.py and then call helper functions from to_pandas and from_pandas.

@gully
Copy link

gully commented Feb 17, 2015

OK, I just ran this on my table which was the output of a 791 x 17 astroquery job. It worked just fine (not that there was any doubt), with exactly the behavior I would expect. Thank you @astrofrog !
👍

@astrofrog astrofrog changed the title WIP: interface Table to pandas DataFrame Interface Table to pandas DataFrame Feb 18, 2015
@astrofrog
Copy link
Member Author

For now I'm leaving the conversion disabled for mix-in columns since I think we may want to decide how to handle these better. For example for a Time column we should be able to actually make a date/time series in pandas, etc.

I've added a documentation page but I'm getting a weird issue - to_pandas and from_pandas are not showing up in the API docs. I must be missing something obvious so will look into it later (if anyone spots the issue in the mean time, please let me know!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than check beforehand, maybe let DataFrame raise an error if something bad is passed in? It keeps the code cleaner, and if pandas starts to support multi-dimensional columns, things will just work. (or If you think it is important to give a clearer error message, maybe have the actual data frame assignments inside try/except clauses?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that the pandas error messages are not always very explicit. For the mixin columns it might actually work but just be very inefficient. So I prefer to control it at this stage. We could put it in a try...except but I'm not sure if it's easy to know what the issue is so as to know what better error to raise. I can look into it, but would be interested in @taldcroft's opinion here too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhvk - the error that Pandas gives is pretty cryptic:

In [13]: t = Table([[[1,2]]])
In [14]: print(t)
col0 [2]
--------
  1 .. 2
In [15]: pandas.DataFrame(np.array(t))
...
ValueError: If using all scalar values, you must must pass an index

I think pre-catching the exception is better. Also, I don't think support for multi-d columns is coming any time soon. I think this would cut rather deeply into the fundamental data structures (though I'm not an expert in Pandas so I could be wrong). I asked Wes about this several years ago and the answer was hmm... would have to think about that...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely no big deal. Though since every usage of the dtype property recreates the dtypes of all the columns, maybe write:

if any(getattr(col, 'ndim', 1) > 1 for col in self.columns.values()):
    raise ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing an else: clause here if not isinstance(column, MaskedColumn). (Though without mixins that can't actually happen, but good to future-proof things). Inspired by @mdboom, I've been using an idiom of collecting columns in a dict prior to making the table. So something like this might generically work and doesn't require an intermediate numpy array in the non-masked case. This also could support special-casing for certain mixins like you mentioned, e.g. Time => datetime:

out = OrderedDict()
for name, column in self.columns.items():
    if isinstance(column, MaskedColumn):
        ....
    elif isinstance(column, Time):
        out[name] = column.datetime  # this actually works in pandas, don't know if there is a more efficient way though
    else:  # Not a masked column
        out[name] = column

return DataFrame(out)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can take this further and support the existing built-in mixins.

elif isinstance(column, Quantity):
    out[name] = column.value
elif isinstance(column, SkyCoord):
    for rcn in column.representation_component_names:
        outname = name + '__' + rcn  # I like double underscore here for better namespace separation
        if outname in self.colnames: 
            # do something...
        out[outname] = getattr(column, rcn)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to going via an OrderDict rather than as_array. I like that in that case DataFrame gets set in only one place. Furthermore, it would also still allow the check on column dimensionality to be done in an try/except clause -- I mean to keep the if statement, but only execute it if there was an actual error:

try:
    d = DataFrame(out)
except:
    if any(...):
       raise ValueError("Multi...")
    else:
       raise

The advantage (I think) is that the normal path is both as readable and as fast as it can possibly be, while the exception is still informative.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've switched to OrderedDict. Mixin column support can come later.

@mhvk
Copy link
Contributor

mhvk commented Feb 18, 2015

On the mixin: @taldcroft - I like the idea of supporting Time and SkyCoord, but think we should not do that in this PR. This mostly because I think the way the objects should be stored in a pandas dataframe should be left to the classes themselves. Indeed, they seem really good examples of what could go into the ColumnAttribute class that we discussed before; just like we thought it could override __str__, it could have a to_pandas and from_pandas method. If so, in the routine here, one would simply check whether the column had that attribute; indeed, with that in place, it would be good to just define such an attribute in MaskedColumn rather than special-case it here. See new issue #3518.

@taldcroft
Copy link
Member

For reference, there is a long-running but still-active pandas issue pandas-dev/pandas#2485 about allowing metadata to be attached to pandas objects (with milestone "someday"). In order to have any hope of round-tripping astropy mixin columns we would need a metadata container on the pandas side.

@taldcroft
Copy link
Member

@mhvk - I agree with your proposal for the optimal implementation using the framework of ColumnAttribute (when it exists), but I don't see why not put in functionality now in a hardwired way if it won't change the user API. That would allow testing the basic interface concept in astropy master sooner rather than later. (Of course for SkyCoord there is a real API decision about mapping one coordinate column to multiple pandas columns).

OTOH @astrofrog might prefer to just limit the scope of this PR to non-mixins for the sake of getting something done and merged, and that would be fine.

taldcroft added a commit that referenced this pull request Apr 25, 2015
Interface Table to pandas DataFrame
@taldcroft taldcroft merged commit 0cf8f67 into astropy:master Apr 25, 2015
@astrofrog
Copy link
Member Author

@barentsen - just FYI, this has been merged! Let me know if you try it out and have any issues!

@embray
Copy link
Member

embray commented May 4, 2015

Oh, I mistakenly assumed this was using the unified I/O framework. To which my question becomes, why not? Table.read(some_dataframe) might be nice?

@astrofrog
Copy link
Member Author

@embray - we could support that but I'd still recommend we keep the explicit methods in addition. In particular, writing is a bit trickier API-wise.

@taldcroft
Copy link
Member

I actually thing that read() should be restricted to reading from a file-like object. A more natural way would be t = Table(some_dataframe).

@embray
Copy link
Member

embray commented May 5, 2015

I'm not sure I necessary agree with that. For example an "HDUList" object (despite its odd name) is sort of a high-level "file-like" object representing a FITS file, though not in the sense you mean (in that it's like a stream with a read and possibly seek methods). But it's still an object used for reading out of a FITS file, and is the recommended object to use when passing around a FITS file open for reading between different parts of code.

I think for Pandas you have more of a point though. A Pandas DF is an abstract data structure that is not tied to any particular file format.

@abigailStev
Copy link
Contributor

I'm attempting to use .to_pandas() on an astropy Table object (from astroquery) and it's not working. My astropy is up-to-date from conda. Could someone more knowledgeable help me with this?

I would attach the ipy notebook (using python 2) i've been testing it with but that's not allowed, so here's the code:

from astroquery.sdss import SDSS
import astropy ## Using astropy v 1.0.4 (from Conda)
import pandas as pd

#### Get some data from the SDSS SQL server
cmd = """
SELECT TOP 10000
    p.u, p.g, p.r, p.i, p.z, s.class, s.z, s.zerr
FROM
    PhotoObj AS p
JOIN
    SpecObj AS s ON s.bestobjid = p.objid
WHERE
    p.u BETWEEN 0 AND 19.6 AND
    p.g BETWEEN 0 AND 20 AND
    (s.class = 'STAR' OR s.class = 'GALAXY' OR s.class = 'QSO')
"""
data = SDSS.query_sql(cmd)  # Loads data as an astropy Table
type(data)
#### yields: astropy.table.table.Table
df = data.to_pandas()
#### Gives an error!
"""
 ---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-11-d0f0089f5e2a> in <module>()
----> 1 df = data.to_pandas()

>AttributeError: 'Table' object has no attribute 'to_pandas'
"""

@embray
Copy link
Member

embray commented Oct 2, 2015

@abigailStev This feature will be in Astropy v1.1.0 which isn't out yet. If you got Astropy through conda then you'd only get the most recent release which is v1.0.4 (unless there's some mechanism to get updates from master).

Where'd you find out about this feature?

@abigailStev
Copy link
Contributor

In this issue right here. Hadn't realized it wasn't out yet! I'll be patient :)

@embray
Copy link
Member

embray commented Oct 2, 2015

@abigailStev Usually you can check by looking at the milestone. That will tend to correspond with the lowest version number that something is available in (though there isn't a great way to tell if that version has been released yet, though if the milestone isn't closed it probably hasn't been yet, or just check the tags or changelog or something :/)

@bsipocz
Copy link
Member

bsipocz commented Oct 3, 2015

@abigailStev The other option is to look at the version number in the upper left corder of the docs. For the pandas page on "latest", it's v1.1dev... meaning 1.1 is not out yet.
screen shot 2015-10-03 at 11 10 39

@embray
Copy link
Member

embray commented Oct 5, 2015

That's why I asked--if @abigailStev had found it via the docs I was going to point that out. Good to know anyways though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

astropy Table to pandas DataFrame

8 participants