-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Moving Zk updates for reload, force_commit to their own Znodes which … #10451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@saurabhd336 , @Jackie-Jiang - Please take a look |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #10451 +/- ##
============================================
+ Coverage 63.23% 70.38% +7.14%
- Complexity 5905 6468 +563
============================================
Files 2036 2103 +67
Lines 110973 112766 +1793
Branches 16892 16981 +89
============================================
+ Hits 70177 79367 +9190
+ Misses 35606 27848 -7758
- Partials 5190 5551 +361
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
bd6873d to
7277459
Compare
...ntroller/src/main/java/org/apache/pinot/controller/helix/core/PinotHelixResourceManager.java
Outdated
Show resolved
Hide resolved
|
Back when we were designing pause/resume, we specifically did not want proliferation of znodes (as was proposed in the first design). We changed to fold it all into idealstate znode. |
Thanks Subbu. Could you point me to the design/reference doc, just for my understanding ? Actually in this case, we did not want to send all updates (progress status updates corresponding to reload, force_commit, rebalance) to a single Znode as that would become a bottleneck. Instead have a Znode per update type which is fixed/constant (3). This Znode is not per table. Its per type (3 types) and takes updates for all tables. |
|
@mcvsubbu There are pros and cons of storing the status within the IS:
For the controller task, IMO it makes more sense to keep the status into a separate ZNode because:
|
Here is the design doc in which we proposed znode hierarchy specifically for pause/resume. And here is the later design doc in which we changed it to be in IDEALSTATE, thereby losing some flxibility (e.g. per partition pause). I am not saying that one is better than the other. There are pros and cons. Asking if we are changing direction here: |
Arguments make sense. I think the PR is including (or at least suggesting) that we try a new znode for forceCommit, and hence my questions. Note that forceCommit is built on pause/resume |
Not sure what you mean by "frequent access". Also, reload table is on a per-table basis and this PR suggests we move that to a separate znode as well |
|
Subbu has valid questions. Before we get to the znode creation. why are we not using the existing task framework for this? there is nothing stopping us from creating a task for each of these commands that need to run in background.. the only difference is controllers running the tasks instead of minions. Creating a task and managing fault tolerance etc is not trivial. Using the existing task framework allows us to leverage all the things we build for minion in terms of getting progress updates, the UI will have a tab to show the pending tasks etc. |
@mcvsubbu - To clarify about reload + znode . The Znode is per type and not per table. Reload for any table will go to the same znode. Likewise status updates for rebalance/force_commits for any table will go to the same znode based on their job type. |
Say we want to take the same approach as pause consumption by storing the job status in the IS, there will be the following problems:
Another main difference for this vs pause consumption is that we are adding very few ZNodes here, one per task type, so there won't be scale issue. |
@kishoreg I think these are two very different questions. We didn't leverage the task framework for controller task in the first place because directly executing the controller task is much lighter weight for both coding perspective and execution perspective (no need to implement separate task generator/executor, no need to serialize task config etc.). We can consider moving it to task framework so that we get the fault tolerance (retry/cancel logic) from the task framework, but that involves very big change. We'll need to introduce the controller task executor concept, and ingest the resource manager into the controller task executor. That is out of the scope of this PR since this PR only splits the ZNode for all task types into one per task type. |
...ler/src/main/java/org/apache/pinot/controller/api/resources/PinotSegmentRestletResource.java
Outdated
Show resolved
Hide resolved
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise
...ler/src/main/java/org/apache/pinot/controller/api/resources/PinotSegmentRestletResource.java
Outdated
Show resolved
Hide resolved
...ntroller/src/main/java/org/apache/pinot/controller/helix/core/PinotHelixResourceManager.java
Outdated
Show resolved
Hide resolved
7277459 to
ebb5c3e
Compare
…ill spread out Zk write load across jobTypes
ebb5c3e to
8ae7e9f
Compare
Problem: Currently Zk updates for jobTypes RELOAD, RELOAD_ALL, FORCE_COMMIT goes to the same Znode. This increases write load on that Znode., especially given updates for all tables for the listed jobTypes goes to that Znode. We discovered the problem while implementing rebalanceStatus and decided to create a separate Znode for TABLE_REBALANCE. This PR is to finish the clean up to move the other types to their separate Znodes.
Solution : Move updates for for jobTypes RELOAD, RELOAD_ALL, FORCE_COMMIT, to their own Znodes. Combined RELOAD, RELOAD_ALL updates into a single Znode as they are related.
Note: This is a backwards incompatible change. The status of previously completed reload jobs, will not be available after this change is deployed. Given the low impact and to keep the code changes simple, we made this backwards incompatible.