Skip to content

Core: enforce writing POSIX compatible paths for data location#6772

Closed
jackye1995 wants to merge 6 commits intoapache:mainfrom
jackye1995:posix
Closed

Core: enforce writing POSIX compatible paths for data location#6772
jackye1995 wants to merge 6 commits intoapache:mainfrom
jackye1995:posix

Conversation

@jackye1995
Copy link
Contributor

@jackye1995 jackye1995 commented Feb 8, 2023

part of fixes for #6758

@github-actions github-actions bot added the core label Feb 8, 2023
@jackye1995
Copy link
Contributor Author

}

public static String posixNormalize(String path) {
return Paths.get(path).normalize().toString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is not suitable for Iceberg paths, since Java Path is for paths, but Iceberg paths are actually URIs. It mangles URIs and doesn't understand the scheme vs path distinction:

jshell> Paths.get("s3://foo/bar").toString()
$2 ==> "s3:/foo/bar"

jshell> Paths.get("s3://foo/bar/../..").normalize().toString()
$5 ==> "s3:"

jshell> Paths.get("s3://foo/bar/../../..").normalize().toString()
$6 ==> ""

URI.normalize() seems better, though it has interesting behavior where it preserves leading .. path segments at the start of the path that would go past the root:

jshell> URI.create("s3://foo/abc/../../..").normalize()
$11 ==> s3://foo/../..

Copy link
Contributor Author

@jackye1995 jackye1995 Feb 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like using URI is still not the right way, because URI would fail with characters like space, which are supposed to be escaped with URL encoding like %20. However, it is a valid posix path.

I think what should be done is that, for path with scheme, everything after that should use Paths.get(path).normalize().toString().

@jackye1995 jackye1995 changed the title Core: enforce writing POSIX compatible paths Core: enforce writing POSIX compatible paths for data location Feb 11, 2023
@jackye1995 jackye1995 requested review from RussellSpitzer and electrum and removed request for electrum February 11, 2023 20:01
@jackye1995
Copy link
Contributor Author

@jackye1995
Copy link
Contributor Author

We have different places for creating file path names,

  • LocationProvider for data files
  • TableOperations.metadataFileLocation for manifests and manifest lists
  • catalog specific impl for metadata file location (e.g. BaseMetastoreTableOperations.newTableMetadataFilePath)

So I will restrict this PR to just data location, and do other ones later to avoid having too many changes in different places.

public void testPosixNormalizePathsWithSchemeAndWithoutAuthority() {
Assert.assertEquals(
"Must work with the root directory representation",
"file:///",
Copy link
Contributor Author

@jackye1995 jackye1995 Feb 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a place where Hadoop Path behavior differs, it resolves to file:/, but that seems like an invalid resolution.

Also I am treating / as a special case, both here and in a path without a scheme. For paths with a scheme and authority, we can use scheme://authority as the root path without ambiguity. But file:// (just scheme) and (empty path) do not mean root location, only file:/// and / mean that.

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 25, 2024
@github-actions
Copy link

github-actions bot commented Sep 1, 2024

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Sep 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants