-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Apache Iceberg version
1.1.0 (latest release)
Query engine
None
Please describe the bug 🐞
An interesting thing we ran into:
Our FileIo API contains this method
iceberg/api/src/main/java/org/apache/iceberg/io/FileIO.java
Lines 44 to 46 in c07f2aa
| /** Get a {@link OutputFile} instance to write bytes to the file at the given path. */ | |
| OutputFile newOutputFile(String path); |
Which uses a "String" as the Path for creating a new file reference. Now in general this is not an issue but there are some edge cases here. For example, S3FileIO doesn't enforce posix rules when creating paths or directories (since neither of those really exist in S3.) This means we the two following locations are actually different:
- foo//bar/file
- foo/bar/file
But within posix systems these two should refer to the exact same thing. See https://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html#tag_03_266
A pathname may optionally contain one or more trailing slashes. Multiple successive slashes are considered to be the same as one slash.
Which we can see holds true in Java's Path class.
scala> import java.nio.file.Paths
import java.nio.file.Paths
scala> Paths.get("/foo//bar/file").equals(Paths.get("/foo/bar/file"))
res29: Boolean = trueOr more importantly in Hadoop's Path Class
scala> val p = new Path("foo/bar//bazz")
p: org.apache.hadoop.fs.Path = foo/bar/bazz
scala> p.toUri
res38: java.net.URI = foo/bar/bazzThis leads to an issue when a table is written to by S3FileIO but then read with HadoopFileIO. HadoopFileIO cannot read from the special foo//bar/file path because this isn't a valid posix path. This means if for some reason we end up generating double slashes in our path's metadata_location (or other paths) when using S3FileIO those files will be inaccessible if S3FileIO is swapped with HadoopFileIO.
I think in this case we should probably add to the spec that all files (and paths) must comply with posix standards.