Skip to content

Conversation

@swaminathanmanish
Copy link
Contributor

@swaminathanmanish swaminathanmanish commented Mar 21, 2023

Problem: Currently Zk updates for jobTypes RELOAD, RELOAD_ALL, FORCE_COMMIT goes to the same Znode. This increases write load on that Znode., especially given updates for all tables for the listed jobTypes goes to that Znode. We discovered the problem while implementing rebalanceStatus and decided to create a separate Znode for TABLE_REBALANCE. This PR is to finish the clean up to move the other types to their separate Znodes.

Solution : Move updates for for jobTypes RELOAD, RELOAD_ALL, FORCE_COMMIT, to their own Znodes. Combined RELOAD, RELOAD_ALL updates into a single Znode as they are related.
Note: This is a backwards incompatible change. The status of previously completed reload jobs, will not be available after this change is deployed. Given the low impact and to keep the code changes simple, we made this backwards incompatible.

@swaminathanmanish swaminathanmanish marked this pull request as ready for review March 21, 2023 19:04
@swaminathanmanish
Copy link
Contributor Author

@saurabhd336 , @Jackie-Jiang - Please take a look

@codecov-commenter
Copy link

codecov-commenter commented Mar 22, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.38%. Comparing base (3687b2b) to head (8ae7e9f).
Report is 2397 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #10451      +/-   ##
============================================
+ Coverage     63.23%   70.38%   +7.14%     
- Complexity     5905     6468     +563     
============================================
  Files          2036     2103      +67     
  Lines        110973   112766    +1793     
  Branches      16892    16981      +89     
============================================
+ Hits          70177    79367    +9190     
+ Misses        35606    27848    -7758     
- Partials       5190     5551     +361     
Flag Coverage Δ
integration1 24.46% <100.00%> (-0.03%) ⬇️
integration2 24.25% <100.00%> (-0.11%) ⬇️
unittests1 67.88% <0.00%> (-0.08%) ⬇️
unittests2 13.88% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mcvsubbu mcvsubbu self-assigned this Mar 29, 2023
@mcvsubbu mcvsubbu added the backward-incompat Referenced by PRs that introduce or fix backward compat issues label Mar 29, 2023
@mcvsubbu
Copy link
Contributor

Back when we were designing pause/resume, we specifically did not want proliferation of znodes (as was proposed in the first design). We changed to fold it all into idealstate znode.
cc: @kishoreg , @npawar , @navina , @sajjad-moradi

@swaminathanmanish
Copy link
Contributor Author

Back when we were designing pause/resume, we specifically did not want proliferation of znodes (as was proposed in the first design). We changed to fold it all into idealstate znode. cc: @kishoreg , @npawar , @navina , @sajjad-moradi

Thanks Subbu. Could you point me to the design/reference doc, just for my understanding ? Actually in this case, we did not want to send all updates (progress status updates corresponding to reload, force_commit, rebalance) to a single Znode as that would become a bottleneck. Instead have a Znode per update type which is fixed/constant (3). This Znode is not per table. Its per type (3 types) and takes updates for all tables.

@Jackie-Jiang
Copy link
Contributor

@mcvsubbu There are pros and cons of storing the status within the IS:
For pause/consume, it makes more sense to store the status within the IS because:

  • There is only one state associated with the table (a boolean flag pause)
  • It is accessed quite frequently (not ad-hoc)
  • We never need to access this flag for all tables

For the controller task, IMO it makes more sense to keep the status into a separate ZNode because:

  • It is per task type node, instead of per table node
  • All the access is ad-hoc (when manually submitting a task, or checking status)
  • We want to know all the tasks (cross multiple tables) for a given task type

@mcvsubbu
Copy link
Contributor

Back when we were designing pause/resume, we specifically did not want proliferation of znodes (as was proposed in the first design). We changed to fold it all into idealstate znode. cc: @kishoreg , @npawar , @navina , @sajjad-moradi

Thanks Subbu. Could you point me to the design/reference doc, just for my understanding ? Actually in this case, we did not want to send all updates (progress status updates corresponding to reload, force_commit, rebalance) to a single Znode as that would become a bottleneck. Instead have a Znode per update type which is fixed/constant (3). This Znode is not per table. Its per type (3 types) and takes updates for all tables.

Here is the design doc in which we proposed znode hierarchy specifically for pause/resume.
https://docs.google.com/document/d/19uKzPRowJ8WLE0A4g6i8XLOBwqpswH2_OvFvbxxZ_X4/edit#heading=h.tvfsvrm5pwew

And here is the later design doc in which we changed it to be in IDEALSTATE, thereby losing some flxibility (e.g. per partition pause).

I am not saying that one is better than the other. There are pros and cons.

Asking if we are changing direction here:
cc: @kishoreg

@mcvsubbu
Copy link
Contributor

@mcvsubbu There are pros and cons of storing the status within the IS: For pause/consume, it makes more sense to store the status within the IS because:

  • There is only one state associated with the table (a boolean flag pause)
  • It is accessed quite frequently (not ad-hoc)
  • We never need to access this flag for all tables

For the controller task, IMO it makes more sense to keep the status into a separate ZNode because:

  • It is per task type node, instead of per table node
  • All the access is ad-hoc (when manually submitting a task, or checking status)
  • We want to know all the tasks (cross multiple tables) for a given task type

Arguments make sense. I think the PR is including (or at least suggesting) that we try a new znode for forceCommit, and hence my questions. Note that forceCommit is built on pause/resume

@mcvsubbu
Copy link
Contributor

@mcvsubbu There are pros and cons of storing the status within the IS: For pause/consume, it makes more sense to store the status within the IS because:

  • There is only one state associated with the table (a boolean flag pause)
  • It is accessed quite frequently (not ad-hoc)
  • We never need to access this flag for all tables

For the controller task, IMO it makes more sense to keep the status into a separate ZNode because:

  • It is per task type node, instead of per table node
  • All the access is ad-hoc (when manually submitting a task, or checking status)
  • We want to know all the tasks (cross multiple tables) for a given task type

Not sure what you mean by "frequent access".

Also, reload table is on a per-table basis and this PR suggests we move that to a separate znode as well

@kishoreg
Copy link
Member

Subbu has valid questions. Before we get to the znode creation. why are we not using the existing task framework for this? there is nothing stopping us from creating a task for each of these commands that need to run in background.. the only difference is controllers running the tasks instead of minions.

Creating a task and managing fault tolerance etc is not trivial. Using the existing task framework allows us to leverage all the things we build for minion in terms of getting progress updates, the UI will have a tab to show the pending tasks etc.

@swaminathanmanish
Copy link
Contributor Author

@mcvsubbu There are pros and cons of storing the status within the IS: For pause/consume, it makes more sense to store the status within the IS because:

  • There is only one state associated with the table (a boolean flag pause)
  • It is accessed quite frequently (not ad-hoc)
  • We never need to access this flag for all tables

For the controller task, IMO it makes more sense to keep the status into a separate ZNode because:

  • It is per task type node, instead of per table node
  • All the access is ad-hoc (when manually submitting a task, or checking status)
  • We want to know all the tasks (cross multiple tables) for a given task type

Not sure what you mean by "frequent access".

Also, reload table is on a per-table basis and this PR suggests we move that to a separate znode as well

@mcvsubbu - To clarify about reload + znode . The Znode is per type and not per table. Reload for any table will go to the same znode. Likewise status updates for rebalance/force_commits for any table will go to the same znode based on their job type.

@Jackie-Jiang
Copy link
Contributor

Jackie-Jiang commented Mar 30, 2023

@mcvsubbu There are pros and cons of storing the status within the IS: For pause/consume, it makes more sense to store the status within the IS because:

  • There is only one state associated with the table (a boolean flag pause)
  • It is accessed quite frequently (not ad-hoc)
  • We never need to access this flag for all tables

For the controller task, IMO it makes more sense to keep the status into a separate ZNode because:

  • It is per task type node, instead of per table node
  • All the access is ad-hoc (when manually submitting a task, or checking status)
  • We want to know all the tasks (cross multiple tables) for a given task type

Not sure what you mean by "frequent access".

Also, reload table is on a per-table basis and this PR suggests we move that to a separate znode as well

Say we want to take the same approach as pause consumption by storing the job status in the IS, there will be the following problems:

  1. We can only use the simple field because map field is reserved for segment assignment. For job status, we want to keep different job status apart by using the map field
  2. It will be hard to use the simple field to track multiple tasks of the same task type on the same table (e.g. I reloaded multiple segments within a table)
  3. To gather the job status for a given job type, we need to loop over all the IS, which is very inefficient

Another main difference for this vs pause consumption is that we are adding very few ZNodes here, one per task type, so there won't be scale issue.

@Jackie-Jiang
Copy link
Contributor

Subbu has valid questions. Before we get to the znode creation. why are we not using the existing task framework for this? there is nothing stopping us from creating a task for each of these commands that need to run in background.. the only difference is controllers running the tasks instead of minions.

Creating a task and managing fault tolerance etc is not trivial. Using the existing task framework allows us to leverage all the things we build for minion in terms of getting progress updates, the UI will have a tab to show the pending tasks etc.

@kishoreg I think these are two very different questions. We didn't leverage the task framework for controller task in the first place because directly executing the controller task is much lighter weight for both coding perspective and execution perspective (no need to implement separate task generator/executor, no need to serialize task config etc.). We can consider moving it to task framework so that we get the fault tolerance (retry/cancel logic) from the task framework, but that involves very big change. We'll need to introduce the controller task executor concept, and ingest the resource manager into the controller task executor. That is out of the scope of this PR since this PR only splits the ZNode for all task types into one per task type.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

…ill spread out Zk write load across jobTypes
@snleee snleee merged commit 28c602b into apache:master Apr 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backward-incompat Referenced by PRs that introduce or fix backward compat issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants