-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Add controller API for reload segment task status #8828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall approach looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can consider adding a size limit to the cache, and once the entry count reaches the limit, we remove the earliest entries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I replace this map with a LRU map instead then? I think that might achieve something close to this and we can specify max size when initialising the map
pinot-server/src/main/java/org/apache/pinot/server/starter/helix/HelixInstanceDataManager.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this handle null status properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It returns "null" string
...ntroller/src/main/java/org/apache/pinot/controller/helix/core/PinotHelixResourceManager.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null response need to be properly handled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. The msgId should just be the ZNode id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the getMsgId does fetch the record's zk id but for some reason I was seeing this difference on server and controller side. I'll look into this more and update the PR accordingly if we are able to use msgId itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any update on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So helix, when sending a PARTICIPANT message, generates one message per participant to store into ZK. When doing that, it overrides the original message's id with a new random one
https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/messaging/DefaultMessagingService.java#L199
It only retains the fields from the original message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider storing the task id and reload info (table name / segment name) into a ZK node, and add another API to read the reload tasks in the cluster in case user loses the response of the reload call. We can also put other useful info there, e.g. number of messages sent, which can be included in the status check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that API just returns the list of active reload task ids and description like # of messages sent, table names etc?
Agree that might be useful. But should that list be cleaned up by some sort of periodic task? Or no need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We may put some default retention to keep the reload info (e.g. 1 day). I don't think we need a periodic task to clean up the ZK node. Instead, for each reload call, when we add new entry, we can clean up the expired ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have added the API, anc the changes in realod API to store taskMetadata into ZK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I could not find "retention" specification when creating ZNodes. Is there any example already in place where we might be doing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've handled retention via explicitly cleaning up old tasks when new ones are created.
9e12366 to
e2664c3
Compare
|
I've updated the PR to
Yet to add
I was hoping to get the approach validated first, for all segments API. Once we are aligned on this, I'll update the PR to include changes for single segment reload. cc: @Jackie-Jiang @npawar |
Codecov Report
@@ Coverage Diff @@
## master #8828 +/- ##
=============================================
- Coverage 68.80% 24.86% -43.95%
+ Complexity 4747 51 -4696
=============================================
Files 1842 1834 -8
Lines 97579 97437 -142
Branches 14716 14698 -18
=============================================
- Hits 67141 24228 -42913
- Misses 25705 70827 +45122
+ Partials 4733 2382 -2351
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any update on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest not keeping per-task level ZK metadata. It might create too many ZNode, and can be easily left over if not handled properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| public static String constructPropertyStorePathForTaskResource(String resourceName) { | |
| public static String constructPropertyStorePathForTask(String resourceName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) revert the empty lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this create a LRU cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
4823c44 to
1047e19
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we name this something other than TASKS? Task is already a concept in helix task framework and as a result in minion, so I can see this getting very confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CONTROLLER_JOB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, there's already a string association between TaskType and minion tasks like MergeRollup, R2O etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract this block into a common util to share across this and the above method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/tableName/tableNameWithType ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case of a hybrid table, these 2 lines would run twice and always be redundant for one of the tables. Curious, why you chose to not just have tableNameWithType in the params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reload API accepts tableName and type is optional. Hence thought of keeping it same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But even the reload API, returns 1 jobId per tableType for hybrid table. I think expecting tableNameWithType for this API makes sense. Have changed
...ler/src/main/java/org/apache/pinot/controller/api/resources/PinotSegmentRestletResource.java
Outdated
Show resolved
Hide resolved
7959a4a to
3e6f5a0
Compare
mcvsubbu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still reviewing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add segment name & tablename in the log message.
Will not repeat for other log messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe most other APIs have table name and table type as independent parameters. We should keep all APIs that way, and move towards this model if that is not already the case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
|
Please change the description. This PR does much more than just get reload status. Earlier design for reload was to "fire and forget" (send a reload helix message and let servers handle it as they please). This one (if I understand right) changes the reload command to start a helix task. have you considered what happens when controller is upgraded to the new version but the servers are running old version, and the reload command is issued? Please evaluate and add that to the PR description, and also mark this PR for release notes |
|
@mcvsubbu So nothing about the core functionality wrt reloading segments has been changed here. It's still largely a fire and forget operation, using helix messages that are handled by servers. This PR just associates each reload op with a trackable id, writes some additional meta about the task into ZK, and provides APIs which can be used to 1) Get the status of a given realod op task id 2) Get a historical list of reload tasks. |
When you start using new znodes, then we need to worry about upgrades, cleaning up of znodes, etc. Please write up a design document on this before proceeding. |
Sounds good. I'll share the design doc for this. |
1b74f18 to
e65acbc
Compare
|
@mcvsubbu @npawar @Jackie-Jiang I've updated the description with the design doc link. Please do review 👍 |
d65b2bc to
cdbc371
Compare
npawar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changed approach is clever and very light weight. It's just a little odd to think of this as "status of the exact reload id" because there's no correlation anymore between the exact fired reload and the status we're returning.
Are there any other operations that would set that load time? Like a resetSegment? (I think not, but just confirming). Or any other gotchas that need to be called out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jobMetadata or just metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename all "Task" in this file to "Job" (and also in PinotHelixResourceManager)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
@npawar Yes and that's a idea with this change. Scenarios like server restarting in the middle of a reload op, a rebalance, etc can lead to this. Earlier with the tight coupling of task id -> status map at the server, there were issues with server restarts etc leading to missing status data at the server. With this approach, success of a reload op has simply been defined as the load time being > reload job submission time. |
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the reload message, we should also carry the reload submission time so that we don't need to reload the same segment multiple times when multiple reload is submitted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to check if the record exists. We can directly get() and check if the return is null. It can also prevent race condition.
When the record exist, we should track the record version (passing a Stat into the get()), and check the expected version when setting the record to prevent race condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Should we be retrying in case of a version mismatch? For now I've not added retries, in line with other version based _propertyStore updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return new SuccessResponse("Sent " + msgInfo + " reload messages"); | |
| return new SuccessResponse("Sent " + msgInfo.getLeft() + " reload messages"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted to return the job id as part of the response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then let's change the response message to include both job id and the messages sent, something like "Submitted reload job: %s, sent %d reload messages"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Added both the job id as well as success / failure status of ZK write
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest reflecting this back to the user, instead of just logging an error and return success. User won't be able to read the reload status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Modified the response for this API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor) _loadTimeMs for simplicity, same for the getter and setter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then let's change the response message to include both job id and the messages sent, something like "Submitted reload job: %s, sent %d reload messages"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^^
...ler/src/main/java/org/apache/pinot/controller/api/resources/PinotSegmentRestletResource.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make the response more clear to the user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Changed this to return both job id as well as zk meta write success / failure status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will create one ZNode per table, which is too much IMO. We are just storing the recent controller jobs in the ZNode, and there shouldn't be a lot of them running in parallel, so we can have one single ZNode for all the tables. Keeping one single node will help track all the running jobs, and also clean up the old jobs.
We can also consider having one node per job type to isolate the metadata for different jobs. All the entries for the same job type should have the same format, which is easier to manage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. I suppose since the tableName is part of the meta being stored itself and the jobId is guaranteed to be unique, we need not have to keep tableName as a key? I've flattened it out
3bbe9bc to
214c76f
Compare
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall logic looks good.
We can consider adding a test into the OfflineClusterIntegrationTest where we have several places reloading the table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) We usually put : for easier searching purpose, same for other places
| LOGGER.error("Failed to add reload segment job meta into zookeeper for table {}, segment {}", | |
| LOGGER.error("Failed to add reload segment job meta into zookeeper for table: {}, segment: {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can save one redundant ZK access and also avoid race condition, same for other places
| if (_propertyStore.exists(controllerJobResourcePath, AccessOption.PERSISTENT)) { | |
| ZNRecord taskResourceZnRecord = _propertyStore.get(controllerJobResourcePath, null, -1); | |
| ZNRecord controllerJobRecord = _propertyStore.get(controllerJobResourcePath, null, -1); | |
| if (controllerJobRecord != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Helix may throw a ZkNoNodeException exception if options = -1. So I've handled that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annotate with @Nullable.
Let's add some javadoc for the public method. Same for other public methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added javadoc. jobId must not be be null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this line into addNewReloadSegmentJob() and addNewReloadAllSegmentsJob() to be more readable. We want to keep the metadata creation part together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit)
| * Task ZK props | |
| * Controller job ZK props |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one applies to all jobs, instead of just the reload jobs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually use camel case as the key name in the ZNode. We may also simplify the key as it is always under the context of controller job. E.g. jobType, jobId, tableName, submissionTimeMs, messagesCount, reloadSegmentName
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we handle this properly on the controller side? Should we return 0/0 instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With controller now filtering servers for the segment being queried using idealstate, this should be far less frequent.
And yes the controller makes the API calls using CompletionServiceHelper which handles non 200 responses correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel returning 0/0 might be preferred so that we can differentiate server not responding vs server not having the segment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to return 0 / 0
...ler/src/main/java/org/apache/pinot/controller/api/resources/PinotSegmentRestletResource.java
Outdated
Show resolved
Hide resolved
...ler/src/main/java/org/apache/pinot/controller/api/resources/PinotSegmentRestletResource.java
Outdated
Show resolved
Hide resolved
e0ca1a4 to
5f0e293
Compare
892bdea to
966e46f
Compare
|
Let's add a |
| ServerReloadControllerJobStatusResponse response = | ||
| JsonUtils.stringToObject(responseString, ServerReloadControllerJobStatusResponse.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we using ServerReloadControllerJobStatusResponse.class in JsonUtil instead of SegmentReloadStatusValue.class ? does it guarantee to be always compatible? this seems like a risky assumption consider controller and server are upgraded separately during rolling upgrade
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that this is risky assumption. We can change this to SegmentReloadStatusValue.class
But, I think since the agreed upon upgrade sequence is controller -> broker -> servers, if new fields are added to SegmentReloadStatusValue, in the controller we'd always need to ensure the newer fields are null checked before accessing. Since during upgrade server can be on older version.
This PR adds a controller API that can be used to get the current status of a 'reload segments' task. When a segment reload task is submitted, an identifier will be returned, and the task details will be stored into ZK. Users can then call the newly added API with the task ID, to get the status details of the reload task.
Another API has been added to get all historical reload task ids.
Design document: https://docs.google.com/document/d/1Eqn2FDDIhCr8G2JFlifs5FjT0LsVPfpPTpdJIJvorwI/edit?usp=sharing