Skip to content

[C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK #40035

@Tom-Newton

Description

@Tom-Newton

Describe the enhancement requested

Optimisation to #37511
Child of #18014

When reading from Azure blob storage the bandwidth we get per connection is very dependant on the latency to the filesystem. To achieve good bandwidth with high latency far greater concurrency is needed. For example this is relevant when reading from blob storage in a different region to your compute.

As an example lets consider reading a parquet file. There are 2 levels of parallelism that I'm aware of when using Arrow and the native AzureFileSystem:

  1. Arrow will make concurrent calls to ReadAt for each column and row group combination. At most we can have one concurrent connection per column and row group combination, so for small parquet files this may be less than we would like.
  2. Within ReadAt the AzureFileSystem calls BlobClient::DownloadTo which implements some extra concurrency internally https://github.com/Azure/azure-sdk-for-cpp/blob/ddd0f4bd075d6715ac3004136a690445c4cde5c2/sdk/storage/azure-storage-blobs/src/blob_client.cpp#L516. Purpose of this issue is to make the config options for this parallelism configurable by the user.

Component(s)

C++

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions