Moving Zk updates for reload, force_commit to their own Znodes which … #10451

swaminathanmanish · 2023-03-21T19:00:10Z

Problem: Currently Zk updates for jobTypes RELOAD, RELOAD_ALL, FORCE_COMMIT goes to the same Znode. This increases write load on that Znode., especially given updates for all tables for the listed jobTypes goes to that Znode. We discovered the problem while implementing rebalanceStatus and decided to create a separate Znode for TABLE_REBALANCE. This PR is to finish the clean up to move the other types to their separate Znodes.

Solution : Move updates for for jobTypes RELOAD, RELOAD_ALL, FORCE_COMMIT, to their own Znodes. Combined RELOAD, RELOAD_ALL updates into a single Znode as they are related.
Note: This is a backwards incompatible change. The status of previously completed reload jobs, will not be available after this change is deployed. Given the low impact and to keep the code changes simple, we made this backwards incompatible.

swaminathanmanish · 2023-03-21T23:09:47Z

@saurabhd336 , @Jackie-Jiang - Please take a look

codecov-commenter · 2023-03-22T00:19:40Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.38%. Comparing base (3687b2b) to head (8ae7e9f).
Report is 2397 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #10451      +/-   ##
============================================
+ Coverage     63.23%   70.38%   +7.14%     
- Complexity     5905     6468     +563     
============================================
  Files          2036     2103      +67     
  Lines        110973   112766    +1793     
  Branches      16892    16981      +89     
============================================
+ Hits          70177    79367    +9190     
+ Misses        35606    27848    -7758     
- Partials       5190     5551     +361

Flag	Coverage Δ
integration1	`24.46% <100.00%> (-0.03%)`	⬇️
integration2	`24.25% <100.00%> (-0.11%)`	⬇️
unittests1	`67.88% <0.00%> (-0.08%)`	⬇️
unittests2	`13.88% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

...ntroller/src/main/java/org/apache/pinot/controller/helix/core/PinotHelixResourceManager.java

mcvsubbu · 2023-03-29T16:57:11Z

Back when we were designing pause/resume, we specifically did not want proliferation of znodes (as was proposed in the first design). We changed to fold it all into idealstate znode.
cc: @kishoreg , @npawar , @navina , @sajjad-moradi

swaminathanmanish · 2023-03-29T21:30:07Z

Back when we were designing pause/resume, we specifically did not want proliferation of znodes (as was proposed in the first design). We changed to fold it all into idealstate znode. cc: @kishoreg , @npawar , @navina , @sajjad-moradi

Thanks Subbu. Could you point me to the design/reference doc, just for my understanding ? Actually in this case, we did not want to send all updates (progress status updates corresponding to reload, force_commit, rebalance) to a single Znode as that would become a bottleneck. Instead have a Znode per update type which is fixed/constant (3). This Znode is not per table. Its per type (3 types) and takes updates for all tables.

Jackie-Jiang · 2023-03-30T18:34:36Z

@mcvsubbu There are pros and cons of storing the status within the IS:
For pause/consume, it makes more sense to store the status within the IS because:

There is only one state associated with the table (a boolean flag pause)
It is accessed quite frequently (not ad-hoc)
We never need to access this flag for all tables

For the controller task, IMO it makes more sense to keep the status into a separate ZNode because:

It is per task type node, instead of per table node
All the access is ad-hoc (when manually submitting a task, or checking status)
We want to know all the tasks (cross multiple tables) for a given task type

mcvsubbu · 2023-03-30T18:57:16Z

Back when we were designing pause/resume, we specifically did not want proliferation of znodes (as was proposed in the first design). We changed to fold it all into idealstate znode. cc: @kishoreg , @npawar , @navina , @sajjad-moradi

Thanks Subbu. Could you point me to the design/reference doc, just for my understanding ? Actually in this case, we did not want to send all updates (progress status updates corresponding to reload, force_commit, rebalance) to a single Znode as that would become a bottleneck. Instead have a Znode per update type which is fixed/constant (3). This Znode is not per table. Its per type (3 types) and takes updates for all tables.

Here is the design doc in which we proposed znode hierarchy specifically for pause/resume.
https://docs.google.com/document/d/19uKzPRowJ8WLE0A4g6i8XLOBwqpswH2_OvFvbxxZ_X4/edit#heading=h.tvfsvrm5pwew

And here is the later design doc in which we changed it to be in IDEALSTATE, thereby losing some flxibility (e.g. per partition pause).

I am not saying that one is better than the other. There are pros and cons.

Asking if we are changing direction here:
cc: @kishoreg

mcvsubbu · 2023-03-30T18:59:34Z

@mcvsubbu There are pros and cons of storing the status within the IS: For pause/consume, it makes more sense to store the status within the IS because:

There is only one state associated with the table (a boolean flag pause)

It is accessed quite frequently (not ad-hoc)

We never need to access this flag for all tables

For the controller task, IMO it makes more sense to keep the status into a separate ZNode because:

It is per task type node, instead of per table node

All the access is ad-hoc (when manually submitting a task, or checking status)

We want to know all the tasks (cross multiple tables) for a given task type

Arguments make sense. I think the PR is including (or at least suggesting) that we try a new znode for forceCommit, and hence my questions. Note that forceCommit is built on pause/resume

mcvsubbu · 2023-03-30T19:01:15Z

@mcvsubbu There are pros and cons of storing the status within the IS: For pause/consume, it makes more sense to store the status within the IS because:

There is only one state associated with the table (a boolean flag pause)

It is accessed quite frequently (not ad-hoc)

We never need to access this flag for all tables

For the controller task, IMO it makes more sense to keep the status into a separate ZNode because:

It is per task type node, instead of per table node

All the access is ad-hoc (when manually submitting a task, or checking status)

We want to know all the tasks (cross multiple tables) for a given task type

Not sure what you mean by "frequent access".

Also, reload table is on a per-table basis and this PR suggests we move that to a separate znode as well

kishoreg · 2023-03-30T19:19:55Z

Subbu has valid questions. Before we get to the znode creation. why are we not using the existing task framework for this? there is nothing stopping us from creating a task for each of these commands that need to run in background.. the only difference is controllers running the tasks instead of minions.

Creating a task and managing fault tolerance etc is not trivial. Using the existing task framework allows us to leverage all the things we build for minion in terms of getting progress updates, the UI will have a tab to show the pending tasks etc.

swaminathanmanish · 2023-03-30T20:40:47Z

@mcvsubbu There are pros and cons of storing the status within the IS: For pause/consume, it makes more sense to store the status within the IS because:

There is only one state associated with the table (a boolean flag pause)

It is accessed quite frequently (not ad-hoc)

We never need to access this flag for all tables

For the controller task, IMO it makes more sense to keep the status into a separate ZNode because:

It is per task type node, instead of per table node

All the access is ad-hoc (when manually submitting a task, or checking status)

We want to know all the tasks (cross multiple tables) for a given task type

Not sure what you mean by "frequent access".

Also, reload table is on a per-table basis and this PR suggests we move that to a separate znode as well

@mcvsubbu - To clarify about reload + znode . The Znode is per type and not per table. Reload for any table will go to the same znode. Likewise status updates for rebalance/force_commits for any table will go to the same znode based on their job type.

Jackie-Jiang · 2023-03-30T21:29:02Z

@mcvsubbu There are pros and cons of storing the status within the IS: For pause/consume, it makes more sense to store the status within the IS because:

There is only one state associated with the table (a boolean flag pause)

It is accessed quite frequently (not ad-hoc)

We never need to access this flag for all tables

For the controller task, IMO it makes more sense to keep the status into a separate ZNode because:

It is per task type node, instead of per table node

All the access is ad-hoc (when manually submitting a task, or checking status)

We want to know all the tasks (cross multiple tables) for a given task type

Not sure what you mean by "frequent access".

Also, reload table is on a per-table basis and this PR suggests we move that to a separate znode as well

Say we want to take the same approach as pause consumption by storing the job status in the IS, there will be the following problems:

We can only use the simple field because map field is reserved for segment assignment. For job status, we want to keep different job status apart by using the map field
It will be hard to use the simple field to track multiple tasks of the same task type on the same table (e.g. I reloaded multiple segments within a table)
To gather the job status for a given job type, we need to loop over all the IS, which is very inefficient

Another main difference for this vs pause consumption is that we are adding very few ZNodes here, one per task type, so there won't be scale issue.

Jackie-Jiang · 2023-03-30T22:07:34Z

Subbu has valid questions. Before we get to the znode creation. why are we not using the existing task framework for this? there is nothing stopping us from creating a task for each of these commands that need to run in background.. the only difference is controllers running the tasks instead of minions.

Creating a task and managing fault tolerance etc is not trivial. Using the existing task framework allows us to leverage all the things we build for minion in terms of getting progress updates, the UI will have a tab to show the pending tasks etc.

@kishoreg I think these are two very different questions. We didn't leverage the task framework for controller task in the first place because directly executing the controller task is much lighter weight for both coding perspective and execution perspective (no need to implement separate task generator/executor, no need to serialize task config etc.). We can consider moving it to task framework so that we get the fault tolerance (retry/cancel logic) from the task framework, but that involves very big change. We'll need to introduce the controller task executor concept, and ingest the resource manager into the controller task executor. That is out of the scope of this PR since this PR only splits the ZNode for all task types into one per task type.

...ler/src/main/java/org/apache/pinot/controller/api/resources/PinotSegmentRestletResource.java

Jackie-Jiang

LGTM otherwise

...ler/src/main/java/org/apache/pinot/controller/api/resources/PinotSegmentRestletResource.java

...ntroller/src/main/java/org/apache/pinot/controller/helix/core/PinotHelixResourceManager.java

…ill spread out Zk write load across jobTypes

swaminathanmanish marked this pull request as ready for review March 21, 2023 19:04

xiangfu0 requested review from Jackie-Jiang, saurabhd336 and snleee March 22, 2023 00:04

swaminathanmanish force-pushed the jobTypeCleanupWork branch from bd6873d to 7277459 Compare March 24, 2023 21:56

saurabhd336 approved these changes Mar 29, 2023

View reviewed changes

saurabhd336 reviewed Mar 29, 2023

View reviewed changes

...ntroller/src/main/java/org/apache/pinot/controller/helix/core/PinotHelixResourceManager.java Outdated Show resolved Hide resolved

mcvsubbu self-assigned this Mar 29, 2023

mcvsubbu added the backward-incompat Referenced by PRs that introduce or fix backward compat issues label Mar 29, 2023

walterddr reviewed Apr 1, 2023

View reviewed changes

...ler/src/main/java/org/apache/pinot/controller/api/resources/PinotSegmentRestletResource.java Outdated Show resolved Hide resolved

Jackie-Jiang approved these changes Apr 6, 2023

View reviewed changes

...ler/src/main/java/org/apache/pinot/controller/api/resources/PinotSegmentRestletResource.java Outdated Show resolved Hide resolved

...ntroller/src/main/java/org/apache/pinot/controller/helix/core/PinotHelixResourceManager.java Outdated Show resolved Hide resolved

swaminathanmanish force-pushed the jobTypeCleanupWork branch from 7277459 to ebb5c3e Compare April 6, 2023 22:03

Moving Zk updates for reload,force_commit to their own Znodes which w…

8ae7e9f

…ill spread out Zk write load across jobTypes

swaminathanmanish force-pushed the jobTypeCleanupWork branch from ebb5c3e to 8ae7e9f Compare April 6, 2023 23:45

snleee merged commit 28c602b into apache:master Apr 7, 2023

Moving Zk updates for reload, force_commit to their own Znodes which … #10451

Moving Zk updates for reload, force_commit to their own Znodes which … #10451

Uh oh!

Conversation

swaminathanmanish commented Mar 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

swaminathanmanish commented Mar 21, 2023

Uh oh!

codecov-commenter commented Mar 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

mcvsubbu commented Mar 29, 2023

Uh oh!

swaminathanmanish commented Mar 29, 2023

Uh oh!

Jackie-Jiang commented Mar 30, 2023

Uh oh!

mcvsubbu commented Mar 30, 2023

Uh oh!

mcvsubbu commented Mar 30, 2023

Uh oh!

mcvsubbu commented Mar 30, 2023

Uh oh!

kishoreg commented Mar 30, 2023

Uh oh!

swaminathanmanish commented Mar 30, 2023

Uh oh!

Jackie-Jiang commented Mar 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jackie-Jiang commented Mar 30, 2023

Uh oh!

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

swaminathanmanish commented Mar 21, 2023 •

edited

Loading

codecov-commenter commented Mar 22, 2023 •

edited

Loading

Jackie-Jiang commented Mar 30, 2023 •

edited

Loading