prototype of configurable parallelism#3934
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces configurable parallelism for the generate command, breaking it down into parallel clean, generate, and format steps. The implementation correctly uses errgroup.WithContext and SetLimit to control concurrency. My review focuses on a critical correctness issue with loop variable capture in the new parallel execution helper and a potential reintroduction of a race condition during the clean step that was previously guarded against. Addressing these points will improve the robustness and correctness of the new parallel processing logic.
| for _, lib := range libraries { | ||
| g.Go(func() error { | ||
| return function(gctx, lib) | ||
| }) | ||
| } |
There was a problem hiding this comment.
There's a common Go pitfall here with loop variable capture in a closure. The lib variable is reused in each iteration of the loop. When the goroutines created by g.Go execute, they will likely all use the value of lib from the final iteration, leading to incorrect behavior where the same library is processed multiple times and others are skipped. To fix this, you should create a new variable inside the loop scope to capture the current value for each goroutine.
| for _, lib := range libraries { | |
| g.Go(func() error { | |
| return function(gctx, lib) | |
| }) | |
| } | |
| for _, lib := range libraries { | |
| lib := lib // Capture loop variable. | |
| g.Go(func() error { | |
| return function(gctx, lib) | |
| }) | |
| } |
| if err := executeInParallel(ctx, libraries, parallelization.Clean, func(gctx context.Context, lib *config.Library) error { | ||
| return cleanLibrary(cfg.Language, lib) | ||
| }); err != nil { | ||
| return err | ||
| } |
There was a problem hiding this comment.
The previous implementation performed cleaning sequentially with a comment explaining it was to avoid race conditions when output directories are nested. This change introduces parallel cleaning, which could re-introduce those race conditions if a user configures parallelization.clean to be greater than 1. For example, if one library's output directory is a subdirectory of another's, cleaning them in parallel could lead to unpredictable behavior. Consider either forcing the clean operation to be sequential or adding a strong warning in the configuration documentation about this potential issue.
internal/librarian/generate.go
Outdated
| if err := python.Generate(ctx, library, googleapisDir); err != nil { | ||
| return err | ||
| } | ||
| return nil |
There was a problem hiding this comment.
drive-by: simplify to:
| if err := python.Generate(ctx, library, googleapisDir); err != nil { | |
| return err | |
| } | |
| return nil | |
| return python.Generate(ctx, library, googleapisDir) |
There was a problem hiding this comment.
This change shouldn't be in here at all, in fact - it crept in when I was replacing the existing code with just "log and sleep". But the same change could still be made, in fact for all branches...
There was a problem hiding this comment.
I've refactored the cleanLibrary function though, given that that was changing anyway...
| // parallelization; a positive number represents a specific limit of | ||
| // parallelization; 0 is treated as equivalent to 1 (for the sake of | ||
| // convenience, as a default value). | ||
| type Parallelization struct { |
There was a problem hiding this comment.
I don't think we should put this in the librarian.yaml. We should just hardcode it in internal/librarian/{language} since it is an implementation detail of the function.
|
I’m not convinced this needs to be configurable right now. If languages need different behavior, I’d rather handle that at the language implementation level than introduce orchestration knobs in librarian.yaml. For debugging, I think we can cross that bridge when we get there, and maybe adding a simple My preference would be to solve this problem in code right now, and we can always add configuration later if there are real use cases that can't be solved without it. |
|
@julieqiu Fair enough. My guess is that the simplest way to do that is to change the sort of interface we've got with the languages to pass in the whole slice of libraries to clean (generate, format) and let the language decide whether to do it in series or parallel. That means a bit of duplication (looping, basically) but is probably simpler than a sort of "parallelism negotiation" where the agnostic code asks the language code whether or not to parallelize - but you may have had something else in mind. If you're happy with the "pass the slice in" approach I'll create a new PR tomorrow. |
|
Closing in favor of #4005 (which still may not be what we want). |
No description provided.