-
Notifications
You must be signed in to change notification settings - Fork 4k
Add support for handling raw arrow data #10191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I gave it a first pass and it already looks great! The API of arrow / pandas seems to be very involved, so thanks a lot for all the refactoring and making it easier to use. I think the main feedback is to have a look at the comments and parameters of the internal APIs again and add some more information about them, especially the column-touching ones. It feels like there is a lot of intrinsic knowledge of yours about the data structure ( |
|
I added more comments and clarifications. I think the easiest way to understand the overall parsing/pre-processing logic is by starting with |
| // Load all index data cells: | ||
| // Load all cell data for index columns. | ||
| // Will be empty if the table was not processed through Pandas. | ||
| const indexData = parseIndexData(table, pandasSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the comment makes it sound like this is Pandas only. If so, the variable and function names should be renamed similar to the other renamings you did, e.g. pandasIndexData and parsePandasIndexData
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed it 👍
raethlein
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🐼
## Describe your changes This PR makes the usage of Pandas metadata in serialized Arrow tables entirely optional. This means that our arrow-backed frontend components (e.g. table, dataframe, vega lite) can work with raw arrow data that wasn't processed by Pandas. This PR also simplifies the Quiver API to a small number of public methods: - `columnNames`: Matrix of column names of the index- & data-columns. - `columnTypes`: List of column types for every index- & data-column - `dimensions`: Dimensions of the DataFrame - `getCell`: Return a single cell from an index- or data-column. - `hash`: A hash that identifies the underlying data. - `styler`: Pandas Styler data. - `addRows`: Add the contents of another table to this table. ## GitHub Issue Link (if applicable) - Closes streamlit#5606 ## Testing Plan - Added usage of a raw pyarrow table and array to data mocks. - Update some snapshots with expected changes. - Update a huge number of unit tests in frontend to conform with the new quiver interface and type information. --- **Contribution License Agreement** By submitting this pull request you agree that all contributions to this project are made under the Apache 2.0 license.
Describe your changes
This PR makes the usage of Pandas metadata in serialized Arrow tables entirely optional. This means that our arrow-backed frontend components (e.g. table, dataframe, vega lite) can work with raw arrow data that wasn't processed by Pandas.
This PR also simplifies the Quiver API to a small number of public methods:
columnNames: Matrix of column names of the index- & data-columns.columnTypes: List of column types for every index- & data-columndimensions: Dimensions of the DataFramegetCell: Return a single cell from an index- or data-column.hash: A hash that identifies the underlying data.styler: Pandas Styler data.addRows: Add the contents of another table to this table.GitHub Issue Link (if applicable)
Testing Plan
Contribution License Agreement
By submitting this pull request you agree that all contributions to this project are made under the Apache 2.0 license.