ARROW-1291: [Python] Cast non-string DataFrame columns to strings in RecordBatch/Table.from_pandas#911
ARROW-1291: [Python] Cast non-string DataFrame columns to strings in RecordBatch/Table.from_pandas#911wesm wants to merge 1 commit intoapache:masterfrom
Conversation
…m_pandas Change-Id: I7fdd4c32b2f54d3003c6b87b9ae13186c35bcec0
|
|
||
|
|
||
| def construct_metadata(df, index_levels, preserve_index, types): | ||
| def construct_metadata(df, column_names, index_levels, preserve_index, types): |
There was a problem hiding this comment.
Why pass the column_names instead of:
column_names = [str(col) for col in df.columns]
?
There was a problem hiding this comment.
these got sanitized earlier as part of creating the schema
| return 0 | ||
|
|
||
|
|
||
| cdef tuple _dataframe_to_arrays( |
There was a problem hiding this comment.
Out of curiosity , why was this written in cython originally?
There was a problem hiding this comment.
Started small, got bigger =)
|
I didn't compare "dataframe_to_arrays" with the original cython implementation too carefully. I assume they are the same except for the column name casting? Otherwise LGTM |
|
|
||
| for name in df.columns: | ||
| col = df[name] | ||
| if not isinstance(name, six.string_types): |
There was a problem hiding this comment.
This allows anything that isn't a string including floats, timestamps, and other any wacky thing someone puts in a column index. Should this be more strict about what type(df.columns) is?
There was a problem hiding this comment.
In a lot of cases it will just be "Index". I'd rather have someone complaining about this rather than pre-emptively guessing what will be the right thing to do
| df['a'] = df['a'].astype('category') | ||
| self._check_pandas_roundtrip(df) | ||
|
|
||
| def test_non_string_columns(self): |
There was a problem hiding this comment.
There should be a test for additional column types that either fails or explicitly succeeds based on what we decide about allowing other types in.
There was a problem hiding this comment.
I suppose we can leave this as the only test right now and say that anything other integers or strings is undefined behavior.
| ), | ||
| 'pandas_version': pd.__version__, | ||
| } | ||
| ).encode('utf8') |
There was a problem hiding this comment.
I'll start making more local variables :)
No description provided.