Skip to content

Comments

Add locking around steps.#4108

Closed
aledegano wants to merge 1 commit intoconcourse:masterfrom
Pix4D:lock_worker_choice_in_tasks
Closed

Add locking around steps.#4108
aledegano wants to merge 1 commit intoconcourse:masterfrom
Pix4D:lock_worker_choice_in_tasks

Conversation

@aledegano
Copy link
Contributor

This is my tentative at fixing #3301. Taking a pragmatic way and implementing a lock wherever needed.

All steps chose the worker serially by acquiring the same lock.

The lock is acquired in the task, put and get steps,
each implemented in the relative Run method.

The lock is released by the corresponding method that creates
the Container in the database just after that's done:
either worker.FindOrCreateContainer or fetcher.Fetch.

Resource_scanner, resource_type_scanner and image_resource_fetcher
do not implement the locking mechanism thus their processing is independent.

The unit tests are adjusted for the addition of the lock.

New test cases are probably required but I'd like to have feedback on the general direction I've taken
before working more on this.

@ddadlani @deniseyu @kcmannem

Signed-off-by: Alessandro Degano [email protected]

All steps chose the worker serially by acquiring the same lock.

The lock is acquired in the task, put and get steps,
each implemented in the relative Run method.

The lock is released by the corresponding method that creates
the Container in the database just after that's done:
either `worker.FindOrCreateContainer` or `fetcher.Fetch`.

Resource_scanner, resource_type_scanner and image_resource_fetcher
do not implement the locking mechanism thus their processing is independent.

The unit tests are adjusted for the addition of the lock.

Signed-off-by: Alessandro Degano <[email protected]>
@aledegano aledegano force-pushed the lock_worker_choice_in_tasks branch from fb29a79 to 6c73810 Compare July 8, 2019 12:59
Copy link
Contributor

@ddadlani ddadlani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aledeganopix4d

Thanks for working on fixing #3301!

My biggest concern with this implementation is that all steps are effectively serialized. Some get or task steps could take quite a while and holding the lock for this time would mean that nothing else would get scheduled while a step is running, which seems too restrictive.

For the use case of #4076, I think it would be sufficient to have a TaskStepLock that is only required with that feature. What are your thoughts?

We also have some incoming refactors (such as #4109) that will break this locking mechanism because we are trying to centralize container creation in one place (to enable things like scheduling and multiple runtimes).

Would it be sufficient for your use case to limit the lock to #4076?

versionedSource, err := f.fetchWithLock(ctx, logger, source, imageFetchingDelegate.Stdout())
if containerLock != nil {
containerLock.Release()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the lock would need to be released after the second call to fetchWithLock() on line 86 since container creation could occur there, but this is susceptible to deadlock because we have 2 locks being held/released here.

s.session.Metadata,
s.containerSpec,
s.resourceTypes,
nil,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be passing the lock in here since we could be creating a db.CreatingContainer inside FindOrCreateContainer.

@aledegano
Copy link
Contributor Author

Hi @aledeganopix4d

Thanks for working on fixing #3301!

Hi!
Thank you for taking the time to review this PR!

My biggest concern with this implementation is that all steps are effectively serialized. Some get or task steps could take quite a while and holding the lock for this time would mean that nothing else would get scheduled while a step is running, which seems too restrictive.

For the use case of #4076, I think it would be sufficient to have a TaskStepLock that is only required with that feature. What are your thoughts?
Yes, for #4076 it would be sufficient, however the bug reported with #3301 won't be solved for all use cases: in_parallel put and gets could still be landing on the same worker.

We also have some incoming refactors (such as #4109) that will break this locking mechanism because we are trying to centralize container creation in one place (to enable things like scheduling and multiple runtimes).

I've noticed #4109 and I was indeed wondering how and when that refactor would impact my work with this PR and #4076. But my understanding is that I can keep working on the current HEAD as that rework might take a while?

Would it be sufficient for your use case to limit the lock to #4076?

Yes, in that case, probably, it's better if I create a single PR, that introduces max-build-tasks-per-worker and the minimum amount of locking required for that mechanism to work as intended.
Do you agree?

@ddadlani
Copy link
Contributor

ddadlani commented Jul 9, 2019

Hi @aledeganopix4d,

We are planning to merge #4109 into master in the next couple of days. The only effect it should have on your PR is where the "active_tasks" number is incremented. In fact, you could likely move the calls to IncrementActiveTasks and DecrementActiveTasks and the new locking logic, into RunTaskStep in worker.Client and it could remain encapsulated in that part of the code.

What do you think?

@ddadlani
Copy link
Contributor

ddadlani commented Jul 9, 2019

Yes, in that case, probably, it's better if I create a single PR, that introduces max-build-tasks-per-worker and the minimum amount of locking required for that mechanism to work as intended.
Do you agree?

Yes, I agree

@aledegano
Copy link
Contributor Author

Hi @aledeganopix4d,

We are planning to merge #4109 into master in the next couple of days. The only effect it should have on your PR is where the "active_tasks" number is incremented. In fact, you could likely move the calls to IncrementActiveTasks and DecrementActiveTasks and the new locking logic, into RunTaskStep in worker.Client and it could remain encapsulated in that part of the code.

What do you think?

That's perfect, thanks!

@aledegano aledegano closed this Jul 16, 2019
@aledegano
Copy link
Contributor Author

Superseded by #4118

@aledegano aledegano deleted the lock_worker_choice_in_tasks branch July 16, 2019 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants