-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[pytorch] Make FileStore not segfault with concurrent accesses. #28812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
FileStore wasn't thread-safe. Maybe it wasn't necessarily intended as such, but it's not documented as such, and we already have unittests that use this store from multiple threads, and occasionally crash as a result. We can address this by simply throwing a small-scope std::mutex around accesses to the std::unordered_map<> cache_. Differential Revision: [D18187821](https://our.internmc.facebook.com/intern/diff/D18187821/) [ghstack-poisoned]
…sses." FileStore wasn't thread-safe. Maybe it wasn't necessarily intended as such, but it's not documented as such, and we already have unittests that use this store from multiple threads, and occasionally crash as a result. We can address this by simply throwing a small-scope std::mutex around accesses to the std::unordered_map<> cache_. Differential Revision: [D18187821](https://our.internmc.facebook.com/intern/diff/D18187821/) [ghstack-poisoned]
…sses." FileStore wasn't thread-safe. Maybe it wasn't necessarily intended as such, but it's not documented as such, and we already have unittests that use this store from multiple threads, and occasionally crash as a result. We can address this by simply throwing a small-scope std::mutex around accesses to the std::unordered_map<> cache_. Differential Revision: [D18187821](https://our.internmc.facebook.com/intern/diff/D18187821/) [ghstack-poisoned]
Pull Request resolved: #28812 FileStore wasn't thread-safe. Maybe it wasn't necessarily intended as such, but it's not documented as such, and we already have unittests that use this store from multiple threads, and occasionally crash as a result. We can address this by simply throwing a small-scope std::mutex around accesses to the std::unordered_map<> cache_. ghstack-source-id: 92787892 Differential Revision: [D18187821](https://our.internmc.facebook.com/intern/diff/D18187821/)
pietern
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this used from multiple threads?
If we must make it thread safe I think there are more subtle things going on. The store keeps a file handle and an offset within that file that is already processed. If multiple threads try and refresh the store with the latest values (everything from the current offset to the end of the file), then that also must happen while holding a lock, or the threads will clobber the offset (pos_).
torch/lib/c10d/FileStore.cpp
Outdated
| file, | ||
| pos_, | ||
| [this](const std::string& key, std::vector<uint8_t>&& value) { | ||
| insertCache(key, std::move(value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Superfluous std::move?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, this wasn't superfluous - the compiler gets cranky without the extra std::move sprinkled in.
(Though this isn't needed in the new approach)
|
Where are those unittests that use it from multiple threads? |
|
Sorry, I should have been more clear; I meant to tidy a few things with this PE EOD yesterday. I discovered yesterday that many of the [fb-only] ThriftRpcAgent tests are using FileStore in an unsafe manner (i.e. creating one of these, and handing to be used across two different threads, see the CommunicationManagerTest). I forgot about pos_ here, the segfault I saw when running the tests in a loop with a bad unordered_map access. LMK if you think it makes sense to write up a separate class if this interferes too much - though my concern is that even if we documen |
|
(my paste got cut off) |
…sses." FileStore wasn't thread-safe. Maybe it wasn't necessarily intended as such, but it's not documented as such, and we already have unittests that use this store from multiple threads, and occasionally crash as a result. We can address this by simply throwing a small-scope std::mutex around accesses to the std::unordered_map<> cache_. Differential Revision: [D18187821](https://our.internmc.facebook.com/intern/diff/D18187821/) [ghstack-poisoned]
|
Peter, I changed the locking approach to effective a semaphore around file operations. The cv/mutex cost should be quite low in the context of file ops/file locking. If you really don't think it's warranted in this class, I'm sort of ok with us just documenting the class as one dangerous to use with threads, though this might lead to misuse, or spraying coarser-grained locks (that are held across file syscalls) at the problem. |
|
@jjlilley I took a look at the test you're referring to and it's unintended use. If there is some multi threaded test, you should be able to use an in process "store", which is nothing more than a map and a mutex (see for example https://github.com/facebookincubator/gloo/blob/master/gloo/rendezvous/hash_store.cc). No need to hit a file on disk in that case. I don't see a valid use case where you have both multiple processes and multiple threads that need access to the same store. Even if it were, I would instead give every thread its own store instance instead of reusing the same one. Making this one thread safe doesn't hurt, though, but it shouldn't be a proxy for folks to assume everything is thread-safe. Also, using process groups must always be done from a single thread, because the calls must be issued in a deterministic order across all processes. We can go forward with this change, but I recommend removing the RAII class and instead using a |
…sses." FileStore isn't thread-safe. We've observed a few FB unittests already using this class in an unsafe manner. This change enforces at most a single concurrent use of the various file options, from this specific Store instance. This protects the cache_, pos_, and the relative integrity of the operations. An alternative would be simply to explicitly document this class as non-thread-safe, though perhaps not everybody will read the warning. Differential Revision: [D18187821](https://our.internmc.facebook.com/intern/diff/D18187821/) [ghstack-poisoned]
…sses." FileStore isn't thread-safe. We've observed a few FB unittests already using this class in an unsafe manner. This change enforces at most a single concurrent use of the various file options, from this specific Store instance. This protects the cache_, pos_, and the relative integrity of the operations. An alternative would be simply to explicitly document this class as non-thread-safe, though perhaps not everybody will read the warning. Differential Revision: [D18187821](https://our.internmc.facebook.com/intern/diff/D18187821/) [ghstack-poisoned]
|
Thanks for your response, Pieter.
|
…sses." FileStore isn't thread-safe. We've observed a few FB unittests already using this class in an unsafe manner. This change enforces at most a single concurrent use of the various file options, from this specific Store instance. This protects the cache_, pos_, and the relative integrity of the operations. An alternative would be simply to explicitly document this class as non-thread-safe, though perhaps not everybody will read the warning. Differential Revision: [D18187821](https://our.internmc.facebook.com/intern/diff/D18187821/) [ghstack-poisoned]
…sses." FileStore isn't thread-safe. We've observed a few FB unittests already using this class in an unsafe manner. This change enforces at most a single concurrent use of the various file options, from this specific Store instance. This protects the cache_, pos_, and the relative integrity of the operations. An alternative would be simply to explicitly document this class as non-thread-safe, though perhaps not everybody will read the warning. Differential Revision: [D18187821](https://our.internmc.facebook.com/intern/diff/D18187821/) [ghstack-poisoned]
Pull Request resolved: #28812 FileStore isn't thread-safe. We've observed a few FB unittests already using this class in an unsafe manner. This change enforces at most a single concurrent use of the various file options, from this specific Store instance. This protects the cache_, pos_, and the relative integrity of the operations. An alternative would be simply to explicitly document this class as non-thread-safe, though perhaps not everybody will read the warning. ghstack-source-id: 92874098 Differential Revision: [D18187821](https://our.internmc.facebook.com/intern/diff/D18187821/)
pietern
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent, thanks for the update @jjlilley!
In this diff I see the mutex is locked at the same time as the file is opened. We could make this even simpler by locking the mutex from the file constructor, e.g. by passing the mutex as ref.
|
Thanks for the review. Maybe I'll submit as-is, and will send a quick [optional] followup? It's not clear 100% to me from squinting at the code, which seems more elegant, because of the needed interaction with Lock. |
|
This pull request has been merged in 331e09e. |
|
You're right, it would make it more convoluted. |
Stack from ghstack:
FileStore isn't thread-safe. We've observed a few FB unittests
already using this class in an unsafe manner.
This change enforces at most a single concurrent use of
the various file options, from this specific Store instance.
This protects the cache_, pos_, and the relative integrity
of the operations.
An alternative would be simply to explicitly document this
class as non-thread-safe, though perhaps not everybody will
read the warning.
Differential Revision: D18187821