Skip to content

Conversation

@taldcroft
Copy link
Member

Hi,

I've started using the Table class and really like it, but one annoyance is that one has to know whether a column exists before assigning to it, as, when it does not, an exception is thrown. Would it be an idea to try to create the column automatically (i.e., have behaviour more closely similar to dict),

i.e.,

mytable['new'] = mytable['old']
or
mytable['new'] = 'test'

would create a new column. A very simple (and likely imperfect) implementation could replace the current setitem with:

     def __setitem__(self, item, value):
        try:
            self._data[item] = value
        except (ValueError, KeyError, TypeError):
            if hasattr(value,'__len__') and len(value) == len(self):
                self.add_column(Column(item, value))
            else:
                self.add_column(Column(item, [value]*len(self)))
        except:
            raise

@taldcroft
Copy link
Member

This is an interesting idea and I agree there is something nice in making it so easy to make new columns. I think the implementation will be straightforward.

The main concern I have is about the possibility for allowing errors by effectively removing the current validation that the column exists when assigning to that column. For example if you make a typo, or have a program error that generates the wrong column name, then Table will happily create that new column instead of raising an exception and indicating that something unexpected is happening.

@astrofrog @eteq or others - do you have any thoughts? I'm not totally sure where to draw the line between syntactic convenience and preventing foot shooting.

@eteq
Copy link
Member

eteq commented Feb 4, 2013

Hmm... I see both sides @taldcroft. What about a solution with the following behavior:

>>> mytable['new'] = 'newcolname'
TypeError: 'newcolname' is a string, not a Column
>>> mytable['new'] = Column('', data=whatever) #succeeds - name in Column init doesn't matter because it's overwritten
>>> mytable['replacement'] = mytable['old']  # works fine b/c 'old' is already a column - this just renames it.

That is, check for Column or subclasses, and only accept them. I can't really think of a case where you might accidentally do this without meaning to create a new column, whereas I could imagine it if you allow any iterable as the RHS of __setitem__.

@mhvk - do you think this would meet the use case you had in mind?

@mhvk
Copy link
Contributor Author

mhvk commented Feb 4, 2013

Hi Erik, Tom,

Thanks for your quick replies. My use case is one where I have a table
with measurements and have a function that runs some analysis on it,
adding the results in new columns. I would seem nicest not to have to
worry about whether these columns exist, i.e., whether I am running the
routine for a second time with perhaps some changes to the parameters.
(This is partly informed by decades worth of conditioning by ESO-MIDAS's
compute/table file :col = ,
which was one of the main reasons to sticking with MIDAS for as long as
I did -- but porting my MIDAS scripts and fortran routines to python has
been surprisingly painless -- thanks at least in part to astropy -- Time
is marvellous too).

Anyway, I'd prefer contents of a column and creating one to be as
similar as possible. Within python, it also seems closest to the
behaviour of OrderedDict, which Table (somewhat) looks like. But I
realise I am ignoring worries about shooting oneself in the foot (but
then, I did that so frequently that further damage may not be possible).

As for the solutions with explicitly adding Column(...), that would be
an improvement but seems cluttery. Ideally, one would duck type, if
only to avoid issues like, e.g., ds9's .set_pyfits not recognizing
objects read in with astropy.io.fits).

One smaller comment on what Erik wrote:

mytable['new'] = 'newcolname'
TypeError: 'newcolname' is a string, not a Column
mytable['new'] = Column('', data=whatever) #succeeds - name in Column init doesn't matter because it's overwritten
mytable['replacement'] = mytable['old'] # works fine b/c 'old' is already a column - this just renames it.

Would one actually want the last one to be a rename? Looking at such an
equation, I would intuitively assume that what I get is a column called
"replacement" that has the same contents as "old" (perhaps by pointing
at the same data rather than by copying), i.e., the same as if
"replacement" already existed. But perhaps I'm simply misunderstanding;
the documentation lists the more obvious renaming methods

t.rename_column('a', 'a_new')
mytable['old'].name = 'replacement'

All the best,

Marten

Prof. M. H. van Kerkwijk
Dept. of Astronomy & Astroph., 50 St George St., Toronto, ON, M5S 3H4, Canada
McLennan Labs, room 1203B, tel: +1(416)9467288, fax: +1(416)9467287
[email protected], http://www.astro.utoronto.ca/~mhvk

@eteq
Copy link
Member

eteq commented Feb 4, 2013

I guess when I think about it more, it's not clear to me what anyone could mean by tab['newtable'] = somearray other than tab.add_column(Column('newtable',data=somearray)). So the need for an explicit call to Column is not necessary unless you want to add units or other extra information or whatever.

And after re-reading your post, @taldcroft, I see what I said before is not fully addressing what you're worried about, as you were thinking of typos when you mean to use an already-existing column. I do see the concern there too, but when I think about it, adding this behavior seems consistent with other things in python. E.g., a dict doesn't raise an exception if you insert a new item into it without using a set method or something. So I think convenience here weakly trumps caution.

@astrofrog
Copy link
Member

I need to think about this, and am not very keen at first glance, even though it is appealing from a convenience point of view.

What would:

tab['c'][0] = 1

do if 'c' does not already exist? and does:

tab['b'] = tab['a']

copy the data, or is it a reference? One could limit the cases to:

tab['newcolumn'] = x

where x is already a column instance, i.e. no numpy arrays, scalars, etc. That might prevent some foot shooting?

@eteq
Copy link
Member

eteq commented Feb 4, 2013

@astrofrog - here's what I was thinking:

tab['c'][0] = 1

raises KeyError if 'c' does not exist, because tab['c'] uses __getitem__, which still fails if nothing is present.

tab['b'] = tab['a']

I would think would create a new column named 'b'. I just noticed I didn't respond to @mhvk's comment on this - I mis-spoke before and didn't mean "rename" - I mean what I just said here - it would a new column named 'b', that has the same contants as 'a'.

There is an issue here that occurs to me: if you do tab['a'] = Column('b',...), it overwrites the 'b' name, which may not be intended. In such a case it should log a warning or something (except perhaps in the tab['a'] = Column('', ...) case).

tab['newcolumn'] = x

That was what I was discussing above: at first I was thinking only allow Column objects, but when I thought for a while, I couldn't think of any scenario where I would do this and not mean to create a new column with name 'newcolumn' and data of x (except if 'newcolumn' is a typo, that is). My opinion here is not terribly strong, though - requiring a Column is also fine with me.

@taldcroft
Copy link
Member

To address some points in the previous couple of comments:

tab['b'] = tab['a']

This would definitely copy the data. This is required in any case because in the underlying ndarray structured array cannot have one column that is a view of another array. That always happens at the point of adding a column, which creates a brand new ndarray.

tab['a'] = Column('b', ...)

Not a problem wrt to the name because it would always ignore the name of the RHS column. In this syntax the name of the new column is uniquely taken from the key item ('a' in this case).

Note that tab['a'] does return a Column('a', ..) object, so tab['b'] = tab['a'] is effectively the same as tab['b'] = Column('a', ..).

If you think about something in __setitem__ like:

new_col = Column(key, val, length=len(table))  # Column does validation, coercion broadcasting etc
table.add_column(newcol)

(This is sort of pseudo-code). I think it's actually pretty clean and let's you put anything on the right hand side val that would work to initialize a column. There are probably details that will need to be addressed, but on the whole I think it will be fairly clean and robust, and use the existing Column machinery.

@taldcroft
Copy link
Member

Anyway, I'm going to take a quick stab at this and see if any problems come up. Hopefully there will be a PR and code soon which should make it easier to decide on the behavior.

@astrofrog
Copy link
Member

@taldcroft - awesome, thanks!

@eteq
Copy link
Member

eteq commented Feb 4, 2013

Sounds good, @taldcroft

One quick comment on the tab['a'] = Column('b', ...) case. I was just suggesting that this should generate a warning (or maybe just go to the INFO log) that tells the user that the pre-existing name of the column is being ignored. That gives weak protection against the case where you do tab['adatasetr'] = Column('adataset',...), where the first part was just a typo.

I was also thinking you can suppress the warning for tab['a'] = Column('', ...) or possibly tab['a'] = Column(None, ...) so that people can just leave it to the LHS name to set the Column and not have annoying warnings pop up when they know what they're doing.

But maybe you were going to do this anyway - I'll look forward to the PR!

@taldcroft
Copy link
Member

OK, see attached code. So far there is a net addition of like 4 lines of code. See also the ipynb:

http://nbviewer.ipython.org/4710147/

I think I really like this!

@astrofrog
Copy link
Member

Nice! Ok, now starts the hard work of trying to find examples that would confuse users, and if we can't find any, then I think it should go in! (kind of like trying to disprove a scientific hypothesis ;-) )

@taldcroft
Copy link
Member

@eteq - having warnings like that will produce unwanted warnings in the case of something like t['a'] = t['b']. The reason is that t['a'] returns a named column object:

In [2]: ca = Column('a', [1, 2])
In [3]: t = Table([ca])
In [4]: t['a']
Out[4]: 
<Column name='a' units=None format=None description=None>
array([1, 2])

In the Don't Repeat Yourself spirit, for this syntax the RHS column name is irrelevant and should not be cause for even a warning. For the most part users won't need to bother with Column. The exception is probably when there is metadata to handle, though of course that can all be set post-creation.

This does highlight a good use case for anonymous (unnamed) columns. The unfortunate thing is that this would be best handled by a serious API change in Column to where it was initialized using Column(val=None, name=None, ...) instead of the current Column(name, val=None, ...). The current ordering is actually bothersome to me because I often do Column([1, 2]) before I remember the first arg must be name. We can support something like Column(None, [1, 2]) for an unnamed column.

@astrofrog
Copy link
Member

@taldcroft - I wonder whether we made a mistake with the original API - maybe it should be:

t.add_column('a', Column([1,2,3], ...))

and then all this wouldn't be an issue here - that is, a column really only has a name as part of a table (though add_columns would be more complicated...)

@eteq
Copy link
Member

eteq commented Feb 4, 2013

@astrofrog - I've definitely found myself doing things like:

magcols = [tab[band] for band in 'ugriz']
... some code long enough that I've forgotten exactly what all the mags were in magcols ...
for magcol in magcols:
    scatter(something, magcol, label=magcol.name)
legend()

That is, it's useful that columns know their own names, independent of living in a Table. So I think what the original API did is still valuable... it's just that we want to make sure when they're in a table they are subordinate to the Table for their names.

@taldcroft - I see your point with the t['a'] = t['b'] case. But I am worried about the substantial room for confusion in the tab['a'] = Column('b', ...) case - some user who hasn't read the documentation closely might reasonably think that they can later do tab['a'].name =='b' and have it come out True. Or they might think tab['b'] will also get them the same thing.

Perhaps a way around this is to also silence the warning if the Column you're adding has a name that's already in the Table? Then tab['b'] = tab['a'] works fine - it will set the new column's name to 'b', but not give a warning because 'a' is already present. This still leaves a little room for confusion (tab['b'] = Column('a', data=somenewarray)), but that seems like a corner case not worth worrying too much about.

@eteq
Copy link
Member

eteq commented Feb 4, 2013

Oh, and @taldcroft - if we want to change the order to Table(value, name), now is the time to do it (v0.2) so you don't have to break the API when we go to v0.3. I'm personally fine with this - I'd suggest a PR that just switches them around for v0.2, and then add the new features in v0.3.

@eteq
Copy link
Member

eteq commented Feb 4, 2013

I totally missed you had code @taldcroft - I agree this is very nice! (modulo my comments above, which aren't about the major functionality).

Also, nice use of http://nbviewer.ipython.org - I didn't realize you can just toss gists up and have it work. That's super-useful!

@taldcroft
Copy link
Member

@astrofrog - I think the add_column(s) API is clean the way it is. If you want to add columns they need to be named. But I do think that having name be the required first arg in __init__ may have been a mistake.

@eteq - Despite a desire to fix this now, I'm a little hesitant about such a big change so late in the release. This would hit around 15 or 20 files in astropy and maybe 200 lines, and will completely break any calls to Column that are currently in the wild. While the number of users is relatively small now, this kind of sudden API change might not be nice. I think the issue with the current API is relatively minor and using Column(None, val, ...) is an acceptable way to have unnamed columns.

@astrofrog
Copy link
Member

With the API you are suggesting in this PR, I think most users will use it as

t['a'] = np.array([1,2,3])

i.e. there's no point using the Column in most cases, except if they want to add meta-data. But even then, they can do

t['a'] = np.array([1,2,3])
t['a'].unit = u.m

or something like that. So I think having to use Column(None, [1,2,3]) may be a rare case anyway (so it's fine to just have it that way when it's really needed).

@eteq
Copy link
Member

eteq commented Feb 4, 2013

@taldcroft re: name/value swap. Ok that sounds fine. I agree it's not a big deal.

What did you think about my suggestion about warning only if the table name is not already present? I can put a PR against your branch here to implement what I'm thinking, if you like.

@astrofrog - that makes sense. I think at least I personally would still end up using Column a fair amount because I think it's wise to have units in there, but I imagine you're right that a fair number of people will use it the way you described instead of the Column initializer. (Hopefully in forthcoming releases more of that information will make it into ASCII outputs, which would make it more used generally, I think).

@mhvk
Copy link
Contributor Author

mhvk commented Feb 5, 2013

Hi All,

Great to see it implemented. (And nice to see how to do this properly;
makes much more sense to initialize the column and then assign to it,
letting the column mechanism work out whether this is possible (lenth=1
or length(Table).

@taldcroft re: name/value swap. Ok that sounds fine. I agree it's not a big deal.

For what it's worth, as a new user I found it somewhat confusing that
the initialisation of a Table has the reverse order of that of a Column,
i.e., that

t = Table(, names=colnames)

while

c = Column(name, )

It does seem a bit more logical for both to have the same order, and for
that order to match recarray.

All the best,

Marten

@taldcroft
Copy link
Member

@mhvk - sigh, I agree 100% on the disconnect with the order of args for Column vs. everything else. The question is what to do at this stage. The astropy 0.2 release is already at rc1. Making this change is not that difficult technically, but will break any existing user code (which is probably not that much at this point).

See also #731 for a path to fix this by 0.3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this make masked arrays non-masked? Is there a reason for doing this here rather than just passing value to data in NewColumn since the re-casting to a Numpy masked or non-masked array should already happen there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A masked array will fail the isinstance() test and won't get to this line.

I originally wrote basically what you suggested, but it turns out that Column ignores the length kwarg if you pass in data, so it was creating a wrong length column if you passed a scalar. Instead of modifying the Column.__new__ and MaskedColumn.__new__, I decided to just ask numpy for the dtype of non-ndarray inputs in advance and then make an initially-empty Column.

BTW, version 2 was something like (with no if not isinstance(..))

value = np.array(value, subok=True, copy=False)

That would have also preserved masked arrays and not copied values, but I thought it was better to not touch value at all if already an ndarray.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would np.asarray make sense here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdboom - I agree np.asarray would be the preferred method here. Just for my own understanding, is there a situation where there would be any difference between asarray() and array() in the case where the input value is not an ndarray subclass?

@astrofrog
Copy link
Member

I've tagged this with the 0.3 milestone for now, but if @eteq and @taldcroft both think it can be included in 0.2, we can do that too (but it might require substantial improvements to the docs, so maybe 0.3 is easier time-wise).

@taldcroft
Copy link
Member

I think 0.3 is fine.

@eteq
Copy link
Member

eteq commented Feb 5, 2013

yeah, I'd say 0.3 makes the most sense

@embray
Copy link
Member

embray commented Feb 6, 2013

On argument order for Column, if it makes sense to we could definitely change that for 0.3. It could still use type inference to maintain backwards compatibility with the old arguments (with a deprecated warning of course).

@taldcroft
Copy link
Member

So here's a challenge that I couldn't figure out (was going to try stackoverflow, but I suppose I could start here).

Is there a way to know within a function how the arguments were provided? Can I distinguish between:

def func(a=None, b=None):
    pass
func(1, b=2)  # case 1
func(a=1, b=2)  # case 2

I know with inspect I can get the list of args and the values, but I couldn't see how to distinguish between cases 1 and 2. The reason this would be useful is to insert a deprecation warning into 0.2 now that tells people they should not be calling Column('name', data, ...) because this will break in 0.3, and should instead be writing Column(name='name', data=data, ...), which will work in 0.3.

@taldcroft
Copy link
Member

@mhvk, @eteq - But of course! I was trying to be too clever and missing the easy way. This is implemented in #731.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to add
shape=value.shape[1:]

That way even assignment of a multidimensional array will do the right thing -- yes, I'm using that; my Gemini/GMOS tables have columns holding the three chips and two stars on the slit, i.e., shape=(512,2,3)

@mhvk
Copy link
Contributor Author

mhvk commented Feb 9, 2013

Hi All,

I've been using the new auto-assignment in my Gemini/GMOS reduction scripts now, and realised I'd like it even better if it did multidimensional assignments; all that is needed is to add "shape=value.shape[1:]" to the column creation in the new version (I added a line note at 1178 above, but am not sure that information gets propagated).

Thanks!

Marten

@taldcroft
Copy link
Member

@mhvk - excellent idea. This is done in 316b79d and my local tests show that this does the trick. I still need to write a full set of tests.

BTW when you write inline comments they do show up as notifications to those who are participating in the thread.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further use shows that the change from a "try" to an "if" statement breaks assignment to rows by row number, i.e.,
mytable[0] = (somedata)
throws ValueError.
maybe go back to try/except? I guess one could also expand the if statement, but this seems less clean:
if item in self.colnames or isinstance(item, int)
-- Marten

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, I am disappointed that there is no test case of assignment like mytable[0] = (somedata) since this problem should have been caught in test. I'll do that and try to work something out based on your suggestions.

@taldcroft
Copy link
Member

@mhvk - I have the fixed the issue of setting rows and added some tests. It would be great if you can look over the code and tests, and try to break this latest version.

In the end I just made sure that the item is a string. Using try turns out to be problematic because there are many different reasons an exception can come up in the self._data[item] = value line, but only a couple of different exception classes that get raised. Trying to figure this all out is a mess.

This issue of assigning a row to a row is separate from this PR, and is now tracked in #770.

@taldcroft
Copy link
Member

@mhvk @eteq @astrofrog - I think this is basically ready. I noticed it wasn't propagating column metadata in the case of setting from an existing Column, so that is fixed now. I just need to update the docs.

BTW, pandas allows and encourages this syntax, and on reflection it's really the right way to go. I think we were being overly conservative initially about possible problems.

@taldcroft
Copy link
Member

I think this is ready for final review (assuming it passes Travis tests). I rebased against master to get the docs all straight.

@taldcroft
Copy link
Member

If anyone would like more time to review this then let me know, otherwise I'll plan to merge tomorrow.

taldcroft added a commit that referenced this pull request Mar 9, 2013
Use setitem syntax to create a new Table column
@taldcroft taldcroft merged commit 428289e into astropy:master Mar 9, 2013
@taldcroft taldcroft deleted the table-new-col branch March 9, 2013 17:51
@eteq
Copy link
Member

eteq commented Mar 14, 2013

@taldcroft - Sorry I didn't get back to you on this - been a bit swamped... but it looks great to me, anyway, so fortunately there was no need :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants