Skip to content

Conversation

@lukasmasuch
Copy link
Collaborator

@lukasmasuch lukasmasuch commented Jan 15, 2025

Describe your changes

This PR makes the usage of Pandas metadata in serialized Arrow tables entirely optional. This means that our arrow-backed frontend components (e.g. table, dataframe, vega lite) can work with raw arrow data that wasn't processed by Pandas.

This PR also simplifies the Quiver API to a small number of public methods:

  • columnNames: Matrix of column names of the index- & data-columns.
  • columnTypes: List of column types for every index- & data-column
  • dimensions: Dimensions of the DataFrame
  • getCell: Return a single cell from an index- or data-column.
  • hash: A hash that identifies the underlying data.
  • styler: Pandas Styler data.
  • addRows: Add the contents of another table to this table.

GitHub Issue Link (if applicable)

Testing Plan

  • Added usage of a raw pyarrow table and array to data mocks.
  • Update some snapshots with expected changes.
  • Update a huge number of unit tests in frontend to conform with the new quiver interface and type information.

Contribution License Agreement

By submitting this pull request you agree that all contributions to this project are made under the Apache 2.0 license.

@raethlein
Copy link
Collaborator

raethlein commented Jan 16, 2025

I gave it a first pass and it already looks great! The API of arrow / pandas seems to be very involved, so thanks a lot for all the refactoring and making it easier to use. I think the main feedback is to have a look at the comments and parameters of the internal APIs again and add some more information about them, especially the column-touching ones. It feels like there is a lot of intrinsic knowledge of yours about the data structure (columns = [], columns = {}, schema, table) and I think it would be extremely helpful to verbalize that knowledge for easier digesting of what's going on for everyone who is not as deep into arrow as you are 🙂

@lukasmasuch
Copy link
Collaborator Author

I added more comments and clarifications. I think the easiest way to understand the overall parsing/pre-processing logic is by starting with parseArrowIpcBytes -> this goes through all the steps with some explanations.

// Load all index data cells:
// Load all cell data for index columns.
// Will be empty if the table was not processed through Pandas.
const indexData = parseIndexData(table, pandasSchema)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the comment makes it sound like this is Pandas only. If so, the variable and function names should be renamed similar to the other renamings you did, e.g. pandasIndexData and parsePandasIndexData

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed it 👍

Copy link
Collaborator

@raethlein raethlein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🐼

@lukasmasuch lukasmasuch changed the title Add support for handling raw arrow data in frontend component Add support for handling raw arrow data Jan 16, 2025
@lukasmasuch lukasmasuch enabled auto-merge (squash) January 16, 2025 15:06
@lukasmasuch lukasmasuch merged commit 028e078 into develop Jan 16, 2025
33 checks passed
@lukasmasuch lukasmasuch deleted the refactor/support-raw-arrow-v2 branch January 16, 2025 15:37
edegp pushed a commit to edegp/streamlit that referenced this pull request Jan 19, 2025
## Describe your changes

This PR makes the usage of Pandas metadata in serialized Arrow tables
entirely optional. This means that our arrow-backed frontend components
(e.g. table, dataframe, vega lite) can work with raw arrow data that
wasn't processed by Pandas.

This PR also simplifies the Quiver API to a small number of public
methods:
- `columnNames`: Matrix of column names of the index- & data-columns.
- `columnTypes`: List of column types for every index- & data-column
- `dimensions`: Dimensions of the DataFrame
- `getCell`: Return a single cell from an index- or data-column.
- `hash`: A hash that identifies the underlying data.
- `styler`: Pandas Styler data.
- `addRows`: Add the contents of another table to this table.

## GitHub Issue Link (if applicable)

- Closes streamlit#5606

## Testing Plan

- Added usage of a raw pyarrow table and array to data mocks. 
- Update some snapshots with expected changes.
- Update a huge number of unit tests in frontend to conform with the new
quiver interface and type information.

---

**Contribution License Agreement**

By submitting this pull request you agree that all contributions to this
project are made under the Apache 2.0 license.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

change:refactor PR contains code refactoring without behavior change impact:users PR changes affect end users security-assessment-completed Security assessment has been completed for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

st.dataframe fails with simple Arrow Table

3 participants