Add pooling for parallel usage of SyntaxSet#78
Add pooling for parallel usage of SyntaxSet#78williamyaoh wants to merge 6 commits intotrishume:masterfrom
Conversation
|
I'd like some eyes on the |
trishume
left a comment
There was a problem hiding this comment.
Thanks for taking on this feature, I appreciate it.
Unfortunately, I have a number of issues with the way you did it. I'm sorry you put the work into implementing it this way, if you had told me your plans in #20 I could have told you ahead of time what changes I'd request.
But, I've included suggestions for what changes you could make to get this PR to a state where I'd approve it.
|
|
||
| #[bench] | ||
| fn bench_parallel_parsing(b: &mut Bencher) { | ||
| let files = |
There was a problem hiding this comment.
I'd prefer that all these files weren't added to the repo, but if you have a good reason I'm willing to be convinced.
My proposed alternative is just using the existing large test files (parser.rs and jquery.js) multiple times each.
| use std::io::{stdout, BufReader, BufRead, Write}; | ||
| use std::fs::File; | ||
|
|
||
| trait HasSync: Sync {} |
| /// Same as `new`, but with a defined maximum pool size. | ||
| pub fn with_pool_size(init_fn: F, pool_size: usize) -> Self { | ||
| SyntaxSetPool { | ||
| syntaxes: Mutex::new(repeating(LazyInit::new).take(pool_size).collect()), |
There was a problem hiding this comment.
This could be replaced with (0..pool_size).iter().map(|_| LazyInit::new).collect() or just a for loop using Vec::push and then the whole bunch repeating iterator code wouldn't need to be included just for this single use.
| pub struct SyntaxSetPool<F: Fn() -> SyntaxSet> { | ||
| /// We intentionally use a *stack* of `SyntaxSet`s so that | ||
| /// already-initialized syntaxes get reused as much as possible. | ||
| syntaxes: Mutex<Vec<LazyInit<SyntaxSet>>>, |
There was a problem hiding this comment.
Instead of using this whole unsafe LazyInit business (which I'm very uncomfortable with, especially given that without it syntect itself contains no unsafe code), you could do something like this:
- Define `type PooledSet = Arc<Mutex>;
- In SyntaxSetPool use
syntaxes: Mutex<Vec<PooledSet>>. - Add a field
max_pool_sizethat you set on initialization to the core count. - Add a field
num_in_poolthat starts at0. - At initialization the
syntaxesvector is empty. - When getting a syntax, if there are none left in
syntaxes:- If
num_in_pool < max_pool_size, create a new one and bumpnum_in_pool. - Else block on
has_syntax
- If
- If there is a syntax, pop it, lock it, call
go, then unlock it, lock the pool and return it.
So at the cost of one extra never-contested lock (which AFAIK should be very cheap) per syntax get, the amount of code required decreases substantially and it no longer requires unsafe.
There was a problem hiding this comment.
Unfortunately, I don't think that solution would work, since that wouldn't make SyntaxSetPool be Sync, since PooledSet wouldn't be Sync. This is the big sticking point for this pull request too, but I don't see a way to transport SyntaxSets across thread boundaries the way it's needed without using unsafe.
I agree with you that adding unsafe code is unsavoury, so maybe in the end the best way to resolve #20 would be to recommend people use thread_local! + rayon.
There was a problem hiding this comment.
@williamyaoh you're right, my mistake. Looking into the docs for Send and Sync to try and understand why the Mutex doesn't end up Sync, I think the same issue might apply to your unsafe code.
I haven't thought about it very long yet but I think you might be able to pass an initializer function that clones one of the Rcs inside SyntaxSet, stashes it outside the closure inside a RefCell or something and use that to create a data race on the refcount.
Unless one of us thinks of a way to do this without unsafe, or perhaps more possibly, using a well-tested established crate (which might use unsafe) to do it, I'm leaning towards just recommending thread_local! and rayon in the docs like you say.
There was a problem hiding this comment.
I don't believe, with SyntaxSet being written the way it is, that there's a way to implement pooling in syntect itself using only safe code.
For the purposes of pooling, the only marker we care about is making SyntaxSet be Send (If it were Sync, we could just share it between threads directly, with no need for pooling). My understanding of Send is that it denotes a type that's safe to move ownership of across thread boundaries, so making SyntaxSet be Send would require everything it owns to also be Send, which clearly can't happen because of the Rcs. For the same reason, no external crate could provide a safe interface to transport non-Send types across thread boundaries in general.
Without it being either Send or Sync, there's simply no way to safely share a SyntaxSet between threads. Either you move it, or you share it through a reference, whether that's through a normal reference or some kind of smart pointer.
I'd just go with recommending rayon. On reflection, the amount of code within the unsafety boundary caused by unsafely implementing Send on SyntaxSet is a bit too large to stomach.
There was a problem hiding this comment.
@williamyaoh that's a good point, you're right. There's basically no way to prove to the compiler that you haven't cloned one of the Rcs and are going to access it from another thread. And I'm not even sure there is a way to stop you from doing that, even without proving it to the compiler.
I can update the docs and maybe add an example using Rayon and thread_local!, but it's inconvenient to contribute to syntect right now because I'm working at Google and it's used in a Google project, so I can't claim that it's unrelated and I'd have to go through a process.
So either I can do that in a couple months after my internship or I can accept a PR if you do it.
| serde = { version = "1.0", features = ["rc"] } | ||
| serde_derive = "1.0" | ||
| serde_json = "1.0" | ||
| num_cpus = "^1.5" |
There was a problem hiding this comment.
I think this should be behind the parsing feature just like a bunch of the other crates. Same for the corresponding extern crate line.
| /// Use syntax sets by passing a closure to `with_syntax_set()`. A | ||
| /// `SyntaxSet` will be removed from the pool, and a reference to it | ||
| /// passed to the given closure. Afterwards, the set will automatically | ||
| /// be returned to the pool. |
There was a problem hiding this comment.
I would add a note here to avoid using this when you're already using a crate like rayon that uses a small thread pool. In that case it is simpler to just use:
thread_local!(static SYNTAX_SET: SyntaxSet = SyntaxSet::load_defaults_newlines());
// and later:
SYNTAX_SET.with(|syntax_set| ... );
See #20.
This PR adds a struct
SyntaxSetPoolwhich poolsSyntaxSets for parallel usage with the minimum amount of initialization of saidSyntaxSets. By default, it will only initialize as manySyntaxSets as the running machine has CPUs, though this can be specified as well.Examples of the usage of
SyntaxSetPoolcan be found inexamples/parsyncat.rs, which highlights multiple source files in parallel and prints the results tostdout, andbenches/parallel_parsing.rs.Issues
SyntaxDefinition's preexistingCloneimplementation, which was merely incorrect before (due toWeakpointers in the resulting clone), can now cause memory unsafety due to their containingSyntaxSetbeing ferried across thread boundaries. Since this implementation was already incorrect, ideally thisCloneimplementation would be fixed or removed before merging this PR. Alternatively, there might be a way to hide it from library users.