Skip to content
This repository was archived by the owner on Mar 13, 2026. It is now read-only.

feat: to_gbq uses Parquet by default, use api_method="load_csv" for old behavior#413

Merged
tswast merged 21 commits intomainfrom
issue366-null-strings
Nov 2, 2021
Merged

feat: to_gbq uses Parquet by default, use api_method="load_csv" for old behavior#413
tswast merged 21 commits intomainfrom
issue366-null-strings

Conversation

@tswast
Copy link
Copy Markdown
Contributor

@tswast tswast commented Oct 26, 2021

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #366 🦕

@google-cla google-cla Bot added the cla: yes This human has signed the Contributor License Agreement. label Oct 26, 2021
@product-auto-label product-auto-label Bot added the api: bigquery Issues related to the googleapis/python-bigquery-pandas API. label Oct 26, 2021
@tswast tswast marked this pull request as ready for review October 27, 2021 21:41
@tswast tswast requested a review from a team October 27, 2021 21:41
@tswast
Copy link
Copy Markdown
Contributor Author

tswast commented Oct 28, 2021

    pyarrow 3.0.0 depends on numpy>=1.16.6
    The user requested (constraint) numpy==1.14.5

Looks like we need to bump minimum numpy.

Per https://numpy.org/neps/nep-0029-deprecation_policy.html, we should be on Numpy 1.18 already, so requiring >=1.16.6 is justifiable.

@tswast
Copy link
Copy Markdown
Contributor Author

tswast commented Oct 28, 2021

Looks like most of the CircleCI failures are caused by the same issue.

FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_data - Fil...
FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_data_if_table_exists_append
FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_subset_columns_if_table_exists_append
FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_data_if_table_exists_replace
FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_data_flexible_column_order
FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_data_with_timestamp
FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_data_tokyo
FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_data_tokyo_non_existing_dataset
E   pyarrow.lib.ArrowInvalid: Casting from timestamp[ns, tz=US/Arizona] to timestamp[ms] would lose data: 1635431957963003000

The same tests pass with the latest versions of packages in the 3.9 tests.

We don't test with the system tests with 3.7 on the Kokoro session, so I can't tell if the same issue happens with the pip version. Will update the noxfile to run with 3.7.

@tswast
Copy link
Copy Markdown
Contributor Author

tswast commented Oct 28, 2021

Looks like we're not the only ones encountering this: https://stackoverflow.com/questions/59682833/pyarrow-lib-arrowinvalid-casting-from-timestampns-to-timestampms-would-los

I wonder if we bump the minimum pyarrow to 4.0.0 if it would fix it?

@tswast
Copy link
Copy Markdown
Contributor Author

tswast commented Oct 28, 2021

I've tried pyarrow 4, 5, and 6. None of which fixed it. Possibly a problem with pandas?

@tswast
Copy link
Copy Markdown
Contributor Author

tswast commented Oct 28, 2021

Tests pass with

attrs==21.2.0
cachetools==4.2.4
certifi==2021.10.8
charset-normalizer==2.0.7
click==8.0.3
google-api-core==1.16.0
google-auth==1.4.1
google-auth-oauthlib==0.0.1
google-cloud-bigquery==1.11.1
google-cloud-bigquery-storage==1.1.0
google-cloud-core==0.29.1
google-cloud-testutils==1.2.0
google-crc32c==1.3.0
google-resumable-media==2.1.0
googleapis-common-protos==1.53.0
grpcio==1.41.1
idna==3.3
importlib-metadata==4.8.1
iniconfig==1.1.1
mock==4.0.3
numpy==1.21.3
oauthlib==3.1.1
packaging==21.0
pandas==1.3.4
-e git+ssh://[email protected]/tswast/python-bigquery-pandas.git@845ff322a5d7900826c97a4da652aead5518ca73#egg=pandas_gbq
pluggy==1.0.0
protobuf==3.19.0
py==1.10.0
pyarrow==6.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pydata-google-auth==0.1.2
pyparsing==3.0.3
pytest==6.2.5
python-dateutil==2.8.2
pytz==2021.3
requests==2.26.0
requests-oauthlib==1.3.0
rsa==4.7.2
six==1.16.0
toml==0.10.2
tqdm==4.23.0
typing-extensions==3.10.0.2
urllib3==1.26.7
zipp==3.6.0

I'll try different versions of pandas.

@tswast
Copy link
Copy Markdown
Contributor Author

tswast commented Oct 28, 2021

This issue is fixed by upgrading to pandas 1.1.0+.

Looking at the pandas 1.1.0 changelog, there have been several bug fixes relating to timestamp data. I'm not sure which one in particular would have helped here, but potentially the fix for mixing and matching different timezones. https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.1.0.html#parsing-timezone-aware-format-with-different-timezones-in-to-datetime

I don't think we want to require pandas 1.1.0 just yet. Perhaps the tests could be updated not to mix and match timezones, since that's not actually supported by pandas until 1.1.0?

@tswast tswast requested a review from plamut October 29, 2021 14:25
Copy link
Copy Markdown

@plamut plamut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm the fix gets rid of the linked issue.

Overall it looks good, the comments are just some nits and one possible refactoring opportunity.

Comment thread pandas_gbq/load.py
Comment thread pandas_gbq/load.py Outdated
Comment thread pandas_gbq/exceptions.py Outdated
Comment thread pandas_gbq/exceptions.py
@tswast tswast requested a review from plamut November 1, 2021 19:37
Copy link
Copy Markdown

@plamut plamut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

The Python 3.10 check fails, because the BigQuery client does not yet support Python 3.10 and cannot be installed as a dependency.

@tswast tswast merged commit 9a65383 into main Nov 2, 2021
@tswast tswast deleted the issue366-null-strings branch November 2, 2021 14:52
Comment thread pandas_gbq/load.py
job_config.schema = pandas_gbq.schema.to_google_cloud_bigquery(schema)
# If not, let BigQuery determine schema unless we are encoding the CSV files ourselves.
elif not FEATURES.bigquery_has_from_dataframe_with_csv:
schema = pandas_gbq.schema.generate_bq_schema(dataframe)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may introduce a failure if the schema is None and the generate_bq_schema is left unused.

The parquet conversion may be successful, but the actual BQ table schema type may not match the resultant conversion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that our tests wouldn't have caught that. Do you have an example of a dataframe that demonstrates this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, the reason we don't have this here is that the google-cloud-bigquery library does similar dataframe to BQ schema conversion logic if the schema is not populated on the job config: https://github.com/googleapis/python-bigquery/blob/66b3dd9f9aec3fda9610a3ceec8d8a477f2ab3b9/google/cloud/bigquery/client.py#L2625

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

api: bigquery Issues related to the googleapis/python-bigquery-pandas API. cla: yes This human has signed the Contributor License Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Empty strings inconsistently converted to NULL's when using df.to_gbq()

3 participants