Skip to content

S3FileIO Can Create Non-Posix Paths #6758

@RussellSpitzer

Description

@RussellSpitzer

Apache Iceberg version

1.1.0 (latest release)

Query engine

None

Please describe the bug 🐞

An interesting thing we ran into:

Our FileIo API contains this method

/** Get a {@link OutputFile} instance to write bytes to the file at the given path. */
OutputFile newOutputFile(String path);

Which uses a "String" as the Path for creating a new file reference. Now in general this is not an issue but there are some edge cases here. For example, S3FileIO doesn't enforce posix rules when creating paths or directories (since neither of those really exist in S3.) This means we the two following locations are actually different:

  1. foo//bar/file
  2. foo/bar/file

But within posix systems these two should refer to the exact same thing. See https://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html#tag_03_266

A pathname may optionally contain one or more trailing slashes. Multiple successive slashes are considered to be the same as one slash.

Which we can see holds true in Java's Path class.

scala> import java.nio.file.Paths
import java.nio.file.Paths

scala> Paths.get("/foo//bar/file").equals(Paths.get("/foo/bar/file"))
res29: Boolean = true

Or more importantly in Hadoop's Path Class

scala> val p = new Path("foo/bar//bazz")
p: org.apache.hadoop.fs.Path = foo/bar/bazz

scala> p.toUri
res38: java.net.URI = foo/bar/bazz

This leads to an issue when a table is written to by S3FileIO but then read with HadoopFileIO. HadoopFileIO cannot read from the special foo//bar/file path because this isn't a valid posix path. This means if for some reason we end up generating double slashes in our path's metadata_location (or other paths) when using S3FileIO those files will be inaccessible if S3FileIO is swapped with HadoopFileIO.

I think in this case we should probably add to the spec that all files (and paths) must comply with posix standards.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions