Add support for handling raw arrow data #10191

lukasmasuch · 2025-01-15T02:29:03Z

Describe your changes

This PR makes the usage of Pandas metadata in serialized Arrow tables entirely optional. This means that our arrow-backed frontend components (e.g. table, dataframe, vega lite) can work with raw arrow data that wasn't processed by Pandas.

This PR also simplifies the Quiver API to a small number of public methods:

columnNames: Matrix of column names of the index- & data-columns.
columnTypes: List of column types for every index- & data-column
dimensions: Dimensions of the DataFrame
getCell: Return a single cell from an index- or data-column.
hash: A hash that identifies the underlying data.
styler: Pandas Styler data.
addRows: Add the contents of another table to this table.

GitHub Issue Link (if applicable)

Closes st.dataframe fails with simple Arrow Table #5606

Testing Plan

Added usage of a raw pyarrow table and array to data mocks.
Update some snapshots with expected changes.
Update a huge number of unit tests in frontend to conform with the new quiver interface and type information.

Contribution License Agreement

By submitting this pull request you agree that all contributions to this project are made under the Apache 2.0 license.

…styler-handling

raethlein · 2025-01-16T11:02:40Z

I gave it a first pass and it already looks great! The API of arrow / pandas seems to be very involved, so thanks a lot for all the refactoring and making it easier to use. I think the main feedback is to have a look at the comments and parameters of the internal APIs again and add some more information about them, especially the column-touching ones. It feels like there is a lot of intrinsic knowledge of yours about the data structure (columns = [], columns = {}, schema, table) and I think it would be extremely helpful to verbalize that knowledge for easier digesting of what's going on for everyone who is not as deep into arrow as you are 🙂

…-raw-arrow-v2

lukasmasuch · 2025-01-16T12:01:58Z

I added more comments and clarifications. I think the easiest way to understand the overall parsing/pre-processing logic is by starting with parseArrowIpcBytes -> this goes through all the steps with some explanations.

raethlein · 2025-01-16T13:17:03Z

frontend/lib/src/dataframes/arrowParseUtils.ts

-  // Load all index data cells:
+  // Load all cell data for index columns.
+  // Will be empty if the table was not processed through Pandas.
  const indexData = parseIndexData(table, pandasSchema)


nit: the comment makes it sound like this is Pandas only. If so, the variable and function names should be renamed similar to the other renamings you did, e.g. pandasIndexData and parsePandasIndexData

Renamed it 👍

raethlein

LGTM 🐼

## Describe your changes This PR makes the usage of Pandas metadata in serialized Arrow tables entirely optional. This means that our arrow-backed frontend components (e.g. table, dataframe, vega lite) can work with raw arrow data that wasn't processed by Pandas. This PR also simplifies the Quiver API to a small number of public methods: - `columnNames`: Matrix of column names of the index- & data-columns. - `columnTypes`: List of column types for every index- & data-column - `dimensions`: Dimensions of the DataFrame - `getCell`: Return a single cell from an index- or data-column. - `hash`: A hash that identifies the underlying data. - `styler`: Pandas Styler data. - `addRows`: Add the contents of another table to this table. ## GitHub Issue Link (if applicable) - Closes streamlit#5606 ## Testing Plan - Added usage of a raw pyarrow table and array to data mocks. - Update some snapshots with expected changes. - Update a huge number of unit tests in frontend to conform with the new quiver interface and type information. --- **Contribution License Agreement** By submitting this pull request you agree that all contributions to this project are made under the Apache 2.0 license.

lukasmasuch added 30 commits January 14, 2025 17:08

Refactor Pandas styler handling in frontend

a5ff092

More styler refactoring

6e91c6a

Improvements

fd48b5f

Apply updates

2fffd30

Fix unit tests

c0d9ca4

Fix period test

150972b

Add top level comment

299d927

Minor fix

b70180f

Add comments

ebebdd7

Add links to pandas styler docs

33c64cb

Fix usage of caption

067991a

Simplify

997edf0

Add better comments

e5285e0

Update licenses

441d177

Add snapshots

947b669

Remove unused variable

dc365ee

Update imports

a2a198d

Fix tests

1053a46

Merge remote-tracking branch 'upstream/develop' into refactor/pandas-…

548976d

…styler-handling

Cleaned up linting

0dc1fe0

Add check back

3fdd26a

Use enum constants

b5192cc

Refactoring to support raw arrow

8096c4c

Update

d7f2ee9

Add support for arrow

0714978

Add map

c81c3d6

Extend example

1a42b4a

Add comments to clarify index

9f198f0

Update comment

2fb750d

Update comments

1b59938

lukasmasuch added 6 commits January 16, 2025 12:25

Apply feedback

02acb60

Add more comments

1e557e5

Update comments

485dbe8

Improve comments

f401bba

Remove tests

3a844ad

Merge remote-tracking branch 'upstream/develop' into refactor/support…

48404de

…-raw-arrow-v2

lukasmasuch added 5 commits January 16, 2025 13:03

Update comment

ce39d6a

Add comment

71965d2

Update comment

6f1c563

Add more comments

e8d6937

Update comment

4b5bec3

raethlein reviewed Jan 16, 2025

View reviewed changes

lukasmasuch added 8 commits January 16, 2025 14:26

Renamings

3b6279d

Rename methods

be5b392

Add additional init props method

0b974b5

Change default to false

72ca143

Change more comments

eb4badf

Fix issue

d9d585c

Add additional tests

b74519b

Improvements to type utils

a9a5d21

raethlein approved these changes Jan 16, 2025

View reviewed changes

lukasmasuch changed the title ~~Add support for handling raw arrow data in frontend component~~ Add support for handling raw arrow data Jan 16, 2025

Stabilize pydeck e2e test

2558e5b

lukasmasuch enabled auto-merge (squash) January 16, 2025 15:06

lukasmasuch merged commit 028e078 into develop Jan 16, 2025
33 checks passed

lukasmasuch deleted the refactor/support-raw-arrow-v2 branch January 16, 2025 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for handling raw arrow data #10191

Add support for handling raw arrow data #10191

Uh oh!

lukasmasuch commented Jan 15, 2025 •

edited

Loading

Uh oh!

raethlein commented Jan 16, 2025 •

edited

Loading

Uh oh!

lukasmasuch commented Jan 16, 2025

Uh oh!

raethlein Jan 16, 2025

Uh oh!

lukasmasuch Jan 16, 2025

Uh oh!

raethlein left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add support for handling raw arrow data #10191

Add support for handling raw arrow data #10191

Uh oh!

Conversation

lukasmasuch commented Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

GitHub Issue Link (if applicable)

Testing Plan

Uh oh!

raethlein commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lukasmasuch commented Jan 16, 2025

Uh oh!

raethlein Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

lukasmasuch Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

raethlein left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lukasmasuch commented Jan 15, 2025 •

edited

Loading

raethlein commented Jan 16, 2025 •

edited

Loading