Skip to content

test: update test_datasets.py#3857

Merged
mattijn merged 17 commits intomainfrom
improve-tests-datasets
Aug 22, 2025
Merged

test: update test_datasets.py#3857
mattijn merged 17 commits intomainfrom
improve-tests-datasets

Conversation

@mattijn
Copy link
Copy Markdown
Contributor

@mattijn mattijn commented Jul 15, 2025

This PR improves the datasets test suite based on feedback in #3854. Now, the automated tests reflect the implementation, solid work. Namely:

  • Fixed test failures: corrected the regex pattern in test_all_datasets to properly match error messages (the test was looking for icon_7zip.png but the actual error shows 7zip.png).
  • Added data API tests: integrated comprehensive tests for the new data accessor API into the main test file.
  • Streamlined test coverage: removed redundant test is_url function that validates with vega-datasets 3.2.0 solely precomputed URLs.
  • Maintained core functionality: preserved all essential tests for backend functionality, error handling, caching, and data parsing.

The test suite now provides better coverage and the tests are reflecting by their names. No changes to the code in altair were required.

Could you have a look if this addresses your concerns @dangotbanned?

Close #3854

@dangotbanned
Copy link
Copy Markdown
Member

dangotbanned commented Jul 16, 2025

Thanks @mattijn, I've only skimmed through but this is looking better! 🎉

Note

I'll aim to do a proper review soon

One thing that stands out to me on a first look:

No changes to the code in altair were required.

Apologies for not being clearer, I'm absolutely fine with further changes being made in both altair and tools.

test_pandas_date_parse

I think that at least (#3854 (comment)) will require some changes outside of the test suite - since that issue seems very related to (vega/vega-datasets#702).

My guess is that because that PR introduced a disconnect between the Resource.name and Resource.path, like below, the way I was previously handling these fields probably needs to be updated in some way:

Before After (vega/vega-datasets#702)
{"name": "7zip.png", "path": "7zip.png"}
{"name": "annual-precip.json", "path": "annual-precip.json"}
{"name": "flights-200k.arrow", "path": "flights-200k.arrow"}
{"name": "flights-20k.json", "path": "flights-20k.json"}
{"name": "icon_7zip", "path": "7zip.png"}
{"name": "annual_precip", "path": "annual-precip.json"}
{"name": "flights_200k_arrow", "path": "flights-200k.arrow"}
{"name": "flights_20k", "path": "flights-20k.json"}

Important

Would you be able to restore (test_pandas_date_parse), with the only change being the names used in the @pytest.mark.parametrize please?

For reference:

  • (909e7d0) was the commit that introduced using schema data type information for pandas, in response to @joelostblom's early experience (feat(RFC): Adds altair.datasets #3631 (comment))
  • (a776e2f) was the last major change to SchemaCache
  • Also here's the docstring for SchemaCache
    class SchemaCache(CompressedCache["_Dataset", "_FlSchema"]):
    """
    `json`_, `gzip`_ -based, lazy schema lookup.
    - Primarily benefits ``pandas``, which needs some help identifying **temporal** columns.
    - Utilizes `data package`_ schema types.
    - All methods return falsy containers instead of exceptions
    .. _json:
    https://docs.python.org/3/library/json.html
    .. _gzip:
    https://docs.python.org/3/library/gzip.html
    .. _data package:
    https://github.com/vega/vega-datasets/pull/631
    """

Does that help demonstrate why this is a regression?

  • I added functionality to solve an issue that came up in the PR
  • I added a test to make sure the data types were at least all temporal data types
  • But now they aren't, and the issue is only showing up in pandas - which is where the problem was in the first place 😔

@dangotbanned dangotbanned self-requested a review July 16, 2025 10:52
Comment thread tools/datasets/datapackage.py
@dangotbanned
Copy link
Copy Markdown
Member

(42236c2)

@mattijn did you generate this?

@mattijn
Copy link
Copy Markdown
Contributor Author

mattijn commented Jul 17, 2025

Yes, but now ruff and or mypy aren't yet happy. Or you don't think we should make pylance happy too?

@dangotbanned
Copy link
Copy Markdown
Member

Yes, but now ruff and or mypy aren't yet happy. Or you don't think we should make pylance happy too?

No I mean did you write this yourself?

@mattijn
Copy link
Copy Markdown
Contributor Author

mattijn commented Jul 17, 2025

Not a type expert, let me leave this to you, but the stubs are now derived from the DataPackage and not the other type files.

@mattijn
Copy link
Copy Markdown
Contributor Author

mattijn commented Aug 22, 2025

Thanks for reviewing @dangotbanned! I've included one raise using the AltairDatasetsError in case for the PyArrow backend when the date type column is in non-ISO date format, related pyarrow issue: apache/arrow#41488.

This is specifically for the 'stocks' dataset example (date contains values like this:"Jan 1 2000"):

from altair.datasets import data

data.stocks(engine="pyarrow")

AltairDatasetsError: PyArrow cannot parse date format in dataset 'stocks'. This is a known limitation of PyArrow's CSV reader for non-ISO date formats.

@mattijn mattijn merged commit 5e9f60e into main Aug 22, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

improve robustness of altair.datasets

2 participants