Remove Java URI validations for Blob Storage providers#1604
Remove Java URI validations for Blob Storage providers#1604eric-maynard merged 3 commits intoapache:mainfrom
Conversation
|
Can you elaborate on when |
|
I'm also confused on having multiple PRs for the same issue. |
Please see #1586 (comment) and #1586 (comment) |
Blob Storage providers like S3 and GCS do not recognize ".." as a path traversal. If you store an object in S3 with the path "s3://abcd/a/b/../c", S3 stores the object exactly as such - it will not normalize the path into "s3://abcd/a/c". While it may be an anti-pattern to make such a location, we should not deviate from what the blob storage provider allows/does. The same can be said about multiple forward slashes together ("s3://abcd/a/b//c" is not the same as "s3://abcd/a/b/c"). As a result, we should only allow ".." and "." normalization in local filesystem calls where these can be resolved - and those are the code paths where Java URI is being used in this PR. WDYT? |
There was a problem hiding this comment.
What if location is FILE:////////tmp/smth?
There was a problem hiding this comment.
Good call out - fixed!
There was a problem hiding this comment.
This line of code is gone now, but I think the question about upper case URIs remains. Since the purpose of this PR is directly related to that, I'm ok with addressing upper case URIs later.
There was a problem hiding this comment.
Do we have tests for multiple leading / chars?
There was a problem hiding this comment.
No, we do not, but am introducing in revision. Thanks!
|
In ADLS Sure, in S3 + GCS the path is a plain string. But are patterns like |
|
I've heard stories about |
There was a problem hiding this comment.
What is the meaning of this assertion? Is it not preferable to assert exact expected type rather that asserting what the type is not?
There was a problem hiding this comment.
This is a good question! AzureLocation is an extension of StorageLocation (due to Azure's very specific way of defining locations). All other locations remain instances of StorageLocation. So if I test that any of these locations (including an instance of AzureLocation) are an instance of StorageLocation, they'll always return true.
This test is basically just testing that we are not creating AzureLocation subclassed objects for these paths.
There was a problem hiding this comment.
Cf. assertThat(...).isExactlyInstanceOf(...)
There was a problem hiding this comment.
TIL this exists in AssertJ :) Thanks @dimas-b - pushed changes with this. PTAL!
dimas-b
left a comment
There was a problem hiding this comment.
This PR looks like it provides incremental values as it avoids interpreting storage URIs as RFC-compliant URI. I'd be fine merging it.
I have one comment about test code, though.
Agreed with @dimas-b here. Although users should probably not be doing this, I'd prefer we not make Polaris opinionated on this (since we have no good reason to do so). |
|
Please rebase to get the latest CI changes. |
dimas-b
left a comment
There was a problem hiding this comment.
There are too many unrelated changes in this PR after the last push, it seems
5ab25fe to
04b506b
Compare
Sorry, made a mistake in how I rebased it and got every commit replayed on top of this PR. I've forced-push to correct - changes should be back to normal :) |
This is a retry at apache#1586, which I'll close if this PR is on the right direction instead. Java URI does not actually apply any normalization to URIs if we do not call URI.normalize() (which we currently do not). Additionally, blob storage providers like S3 and GCS can provide ".." and "." as valid fragments in URLs - which Java URI would attempt to normalize incorrectly. As a result, attempting to validate and/or normalize URIs for blob storage providers using the Java URI class is the incorrect behavior. While we may want to add location validation via regex later, removing it first should at least unblock the bug we see in apache#1545.
* fix: Remove duplicated code in IcebergCatalog (apache#1681) * Fix SparkClient listGenericTable to use ListGenericTablesRESTResponse ListTableResponse is the class used by iceberg endpoint, which is the same as ListGenericTablesRESTResponse. However, we are suppose to use ListGenericTablesRESTResponse to be correct * fix: Remove info log about deprecated internal method from PolarisConfiguration (apache#1672) Remove the INFO log about calls to this method in Polaris, because users cannot do anything about these messages. Phasing out old property names requires coordination with users (e.g. release notes), so it is not a matter of merely avoiding calls to that method in Polaris code. Fixes apache#1666 * Create a single binary distribution bundle (apache#1589) * fix(quickstart): Correct Quickstart Instructions (apache#1673) * Remove Java URI validations for Blob Storage providers (apache#1604) This is a retry at apache#1586, which I'll close if this PR is on the right direction instead. Java URI does not actually apply any normalization to URIs if we do not call URI.normalize() (which we currently do not). Additionally, blob storage providers like S3 and GCS can provide ".." and "." as valid fragments in URLs - which Java URI would attempt to normalize incorrectly. As a result, attempting to validate and/or normalize URIs for blob storage providers using the Java URI class is the incorrect behavior. While we may want to add location validation via regex later, removing it first should at least unblock the bug we see in apache#1545. * Improve test coverage for invalid inputs in Policy APIs (apache#1665) * Fix getting-started docker start by PR apache#1532 (apache#1687) * Fix the manual test broken by PR apache#1532 (apache#1688) * Fix credentials printing twice (apache#1682) * main: Update dependency com.diffplug.spotless:spotless-plugin-gradle to v7.0.4 (apache#1690) * INFO: last merged commit 493de03 --------- Co-authored-by: Alexandre Dutra <[email protected]> Co-authored-by: gh-yzou <[email protected]> Co-authored-by: Dmitri Bourlatchkov <[email protected]> Co-authored-by: Yufei Gu <[email protected]> Co-authored-by: Adnan Hemani <[email protected]> Co-authored-by: William Hyun <[email protected]> Co-authored-by: Christopher Lambert <[email protected]> Co-authored-by: Mend Renovate <[email protected]>
This is a retry at #1586, which I'll close if this PR is on the right direction instead.
Java URI does not actually apply any normalization to URIs if we do not call
URI.normalize()(which we currently do not). Additionally, blob storage providers like S3 and GCS can provide ".." and "." as valid fragments in URLs - which Java URI would attempt to normalize incorrectly. As a result, attempting to validate and/or normalize URIs for blob storage providers using the Java URI class is the incorrect behavior. While we may want to add location validation via regex later, removing it first should at least unblock the bug we see in #1545.