Implement a lazy multipart writer that does direct PUT for small files#5205
Implement a lazy multipart writer that does direct PUT for small files#5205Samrose-Ahmed wants to merge 1 commit intoapache:masterfrom
Conversation
|
I'm a little confused by this, I was perhaps expecting a struct that could be created with a Path and an object store and would implement |
object_store/src/aws/mod.rs
Outdated
| self.client.put_part(path, id, part_idx, data).await | ||
| } | ||
|
|
||
| async fn put_object(&self, path: &Path, data: Bytes) -> Result<()> { |
There was a problem hiding this comment.
If the data is already in memory, i.e. Bytes what is the motivation for using multipart?
That doesn't seem correct, because that wouldnt be atomic, i.e you could have half data uploaded. The motivation is imagine u upload very small files half the time and large ones the rest, but you don't know ahead of time. This saves you the needless three calls when ur uploading small files, you just directly PUT the buffered data. |
|
It would need to buffer in memory until the threshold is reached, at which point it falls back to usinng multipart, otherwise on close it just does a put request with what it has buffered? |
|
Yes that's what I was thinking |
object_store/src/multipart.rs
Outdated
| /// up to the configured maximum concurrency | ||
| /// The multipart upload will only be created when more than the 10 MiB is written, | ||
| /// otherwise the data will be uploaded directly as PUT. | ||
| pub struct LazyWriteMultiPart { |
There was a problem hiding this comment.
Why can't this just call ObjectStore::put_multipart to obtain a boxed AsyncWrite if/when it decides to take this path? That would simpler, faster and work for all stores?
There was a problem hiding this comment.
Ah yeah that might be a better way, then you wouldn't need the multipartstore.
|
Marking as draft to signify this is not waiting on review, feel free to unmark when you would like me to take another look |
39adfc0 to
6ddbc82
Compare
|
Thanks let me know ur comments, I will fix the test |
tustvold
left a comment
There was a problem hiding this comment.
I like where this is heading, left some comments
6ddbc82 to
8280a8f
Compare
|
I made an update according to ur suggestions, let me know comments. |
8280a8f to
cba3bec
Compare
Signed-off-by: 🐼 Samrose Ahmed 🐼 <[email protected]>
cba3bec to
372bcbd
Compare
|
I've run out of time to review this today, trying to avoid spending all my holiday reviewing code, will try to find time to take a look in the coming days. |
tustvold
left a comment
There was a problem hiding this comment.
Had a brief look, I think this would benefit from some low-level testing of the interface and state transitions, as it stands I'm not very comfortable the logic is actually correct
| } | ||
| } | ||
| LazyWriteState::AsyncWrite(writer) => { | ||
| return Pin::new(writer).poll_write(cx, buf).map_ok(|n| n + wrote) |
There was a problem hiding this comment.
I'm not sure this is correct, the returned count is the number from buf that is consumed, as it stands I think this will include any data that was previously buffered as well
|
Marking as a draft, please feel free to unmark when you would like me to take another look |
|
I picked this up in #5431 PTAL |
Which issue does this PR close?
I can create an issue for this.
Rationale for this change
For small files, multipart upload is unnecessary, but it can be convenient to use the same interface, ie. same AsyncWrite that can internally either do a direct PUT for small files or use multipart upload.
What changes are included in this PR?
Adds new code to create an async writer that lazily creates the multipart upload after there is enough data.
Are there any user-facing changes?
yes
notes