Skip to content

Remove Java URI validations for Blob Storage providers#1604

Merged
eric-maynard merged 3 commits intoapache:mainfrom
adnanhemani:ahemani_fix_1545_try_2
May 27, 2025
Merged

Remove Java URI validations for Blob Storage providers#1604
eric-maynard merged 3 commits intoapache:mainfrom
adnanhemani:ahemani_fix_1545_try_2

Conversation

@adnanhemani
Copy link
Contributor

@adnanhemani adnanhemani commented May 17, 2025

This is a retry at #1586, which I'll close if this PR is on the right direction instead.

Java URI does not actually apply any normalization to URIs if we do not call URI.normalize() (which we currently do not). Additionally, blob storage providers like S3 and GCS can provide ".." and "." as valid fragments in URLs - which Java URI would attempt to normalize incorrectly. As a result, attempting to validate and/or normalize URIs for blob storage providers using the Java URI class is the incorrect behavior. While we may want to add location validation via regex later, removing it first should at least unblock the bug we see in #1545.

@snazy
Copy link
Member

snazy commented May 19, 2025

Can you elaborate on when .. would be a valid in a path? For me, .. is a path-traversal in the presence of it raises the question "why is this not the source of a path-traversal-attack".

@snazy
Copy link
Member

snazy commented May 19, 2025

I'm also confused on having multiple PRs for the same issue.

@adnanhemani
Copy link
Contributor Author

I'm also confused on having multiple PRs for the same issue.

Please see #1586 (comment) and #1586 (comment)

@adnanhemani
Copy link
Contributor Author

Can you elaborate on when .. would be a valid in a path? For me, .. is a path-traversal in the presence of it raises the question "why is this not the source of a path-traversal-attack".

Blob Storage providers like S3 and GCS do not recognize ".." as a path traversal. If you store an object in S3 with the path "s3://abcd/a/b/../c", S3 stores the object exactly as such - it will not normalize the path into "s3://abcd/a/c". While it may be an anti-pattern to make such a location, we should not deviate from what the blob storage provider allows/does. The same can be said about multiple forward slashes together ("s3://abcd/a/b//c" is not the same as "s3://abcd/a/b/c").

As a result, we should only allow ".." and "." normalization in local filesystem calls where these can be resolved - and those are the code paths where Java URI is being used in this PR. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if location is FILE:////////tmp/smth?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out - fixed!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line of code is gone now, but I think the question about upper case URIs remains. Since the purpose of this PR is directly related to that, I'm ok with addressing upper case URIs later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have tests for multiple leading / chars?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we do not, but am introducing in revision. Thanks!

@snazy
Copy link
Member

snazy commented May 20, 2025

In ADLS / is a literally a path separator, so I guess that has to be normalized.

Sure, in S3 + GCS the path is a plain string. But are patterns like /../ or /// in S3/GCS locations used in practice? I've seen usages of special characters like ", but not /../ or ///.

@dimas-b
Copy link
Contributor

dimas-b commented May 20, 2025

I've heard stories about /../ being used in S3 URI. I think it is preferable for Polaris not to interpret storage URI paths beyond determining "prefix" matches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of this assertion? Is it not preferable to assert exact expected type rather that asserting what the type is not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question! AzureLocation is an extension of StorageLocation (due to Azure's very specific way of defining locations). All other locations remain instances of StorageLocation. So if I test that any of these locations (including an instance of AzureLocation) are an instance of StorageLocation, they'll always return true.

This test is basically just testing that we are not creating AzureLocation subclassed objects for these paths.

Copy link
Contributor

@dimas-b dimas-b May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cf. assertThat(...).isExactlyInstanceOf(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL this exists in AssertJ :) Thanks @dimas-b - pushed changes with this. PTAL!

Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks like it provides incremental values as it avoids interpreting storage URIs as RFC-compliant URI. I'd be fine merging it.

I have one comment about test code, though.

@adnanhemani
Copy link
Contributor Author

But are patterns like /../ or /// in S3/GCS locations used in practice?

Agreed with @dimas-b here. Although users should probably not be doing this, I'd prefer we not make Polaris opinionated on this (since we have no good reason to do so).

@adnanhemani adnanhemani requested a review from ajantha-bhat as a code owner May 21, 2025 19:07
dimas-b
dimas-b previously approved these changes May 22, 2025
@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board May 22, 2025
@dimas-b
Copy link
Contributor

dimas-b commented May 22, 2025

Please rebase to get the latest CI changes.

Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are too many unrelated changes in this PR after the last push, it seems

@adnanhemani adnanhemani force-pushed the ahemani_fix_1545_try_2 branch from 5ab25fe to 04b506b Compare May 22, 2025 21:00
@adnanhemani
Copy link
Contributor Author

adnanhemani commented May 22, 2025

There are too many unrelated changes in this PR after the last push, it seems

Sorry, made a mistake in how I rebased it and got every commit replayed on top of this PR. I've forced-push to correct - changes should be back to normal :)

@eric-maynard eric-maynard merged commit ef7cefe into apache:main May 27, 2025
6 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board May 27, 2025
williamhyun pushed a commit to williamhyun/polaris that referenced this pull request May 27, 2025
This is a retry at apache#1586, which I'll close if this PR is on the right direction instead.

Java URI does not actually apply any normalization to URIs if we do not call URI.normalize() (which we currently do not). Additionally, blob storage providers like S3 and GCS can provide ".." and "." as valid fragments in URLs - which Java URI would attempt to normalize incorrectly. As a result, attempting to validate and/or normalize URIs for blob storage providers using the Java URI class is the incorrect behavior. While we may want to add location validation via regex later, removing it first should at least unblock the bug we see in apache#1545.
snazy added a commit to snazy/polaris that referenced this pull request Jun 13, 2025
* fix: Remove duplicated code in IcebergCatalog (apache#1681)

* Fix SparkClient listGenericTable to use ListGenericTablesRESTResponse

ListTableResponse is the class used by iceberg endpoint, which is the same as ListGenericTablesRESTResponse. However, we are suppose to use ListGenericTablesRESTResponse to be correct

* fix: Remove info log about deprecated internal method from PolarisConfiguration (apache#1672)


Remove the INFO log about calls to this method in Polaris, because users cannot do anything about these messages.

Phasing out old property names requires coordination with users (e.g. release notes), so it is not a matter of merely avoiding calls to that method in Polaris code.

Fixes apache#1666

* Create a single binary distribution bundle (apache#1589)

* fix(quickstart): Correct Quickstart Instructions (apache#1673)

* Remove Java URI validations for Blob Storage providers (apache#1604)

This is a retry at apache#1586, which I'll close if this PR is on the right direction instead.

Java URI does not actually apply any normalization to URIs if we do not call URI.normalize() (which we currently do not). Additionally, blob storage providers like S3 and GCS can provide ".." and "." as valid fragments in URLs - which Java URI would attempt to normalize incorrectly. As a result, attempting to validate and/or normalize URIs for blob storage providers using the Java URI class is the incorrect behavior. While we may want to add location validation via regex later, removing it first should at least unblock the bug we see in apache#1545.

* Improve test coverage for invalid inputs in Policy APIs (apache#1665)

* Fix getting-started docker start by PR apache#1532 (apache#1687)

* Fix the manual test broken by PR apache#1532 (apache#1688)

* Fix credentials printing twice (apache#1682)

* main: Update dependency com.diffplug.spotless:spotless-plugin-gradle to v7.0.4 (apache#1690)

* INFO: last merged commit 493de03

---------

Co-authored-by: Alexandre Dutra <[email protected]>
Co-authored-by: gh-yzou <[email protected]>
Co-authored-by: Dmitri Bourlatchkov <[email protected]>
Co-authored-by: Yufei Gu <[email protected]>
Co-authored-by: Adnan Hemani <[email protected]>
Co-authored-by: William Hyun <[email protected]>
Co-authored-by: Christopher Lambert <[email protected]>
Co-authored-by: Mend Renovate <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants