-
Notifications
You must be signed in to change notification settings - Fork 135
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We've encountered an issue when using arrow-rs via DeltaLake. Many times, delta files will hold dozens of parquet files in a log directory. To select recent files, the function LocalFileSystem::list_with_offset is called. This does not have an efficient implementation, instead, the entire directory is scanned in our example resulting in >100,000 statx system calls, several times for each file in the _delta_log subdirectory. This is terribly slow for our use case.
Describe the solution you'd like
Upgrade LocalFileSystem::list_with_offset to filter the files and cut the number of statx calls. We have a simple PR (to follow) which does this.
For our use case, it cuts the time to open the delta table from 35 seconds to 4 seconds.
Describe alternatives you've considered
There are likely fancier ways to filter these files and cut the number of statx calls, but a simple pre-filter like what we have done in the associated PR is quite effective. In general, it would be nice to have a more optimized LocalFileSystem implementation.
Additional context