-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
We are seeing an issue in production where the RealtimeToOfflineSegments task is taking a very long time to complete, and timing out in the end.
There are two areas where it is slow, during segment mapping:
2022-01-12T12:40:28Z Initialized mapper with 43 record readers
And during row sorting
2022-01-13T10:10:09Z Start sorting on numRows: 42757049, numSortFields: 9
We are using Pinot 0.9.3 with 8 minions, with 8 GB heap, 16GB memory and 4 cores of CPU.
Task config:
"taskTypeConfigsMap": {
"RealtimeToOfflineSegmentsTask": {
"bucketTimePeriod": "3h",
"bufferTimePeriod": "2d",
"mergeType": "dedup",
"maxNumRecordsPerSegment": 10000000,
"roundBucketTimePeriod": "1h"
}
}
See log files and profiling screenshots
Full run log
log1.txt
Sometimes segment mapping happens with up to 90 record readers. This stage then takes 35 minutes up to 1 hour. Once that’s done, it starts the usual reduction and sorting phase. (Even though the task should have been cancelled by now due to hitting timeout).
It then starts destroying segments, and then immediately finishes with a timeout after destroying segments. I’m not sure exactly what happens here, but I’m wondering if this is causing an issue where the task is not properly cleaning up, causing loss of data.
The logs above show the final moments where the task times out after issuing delete segments.
I have been talking with @richardstartin on slack, and he pointed me towards this issue: #7929
To circumvent this problem, I disabled all raw indexes.
After doing that, I ran another task, here is the log output:
log2.txt
Looking at this log file, we can see that sorting rows took 1.42 hours for 75 million rows
2022-01-12T14:19:37Z Start sorting on numRows: 75094464, numSortFields: 9
2022-01-12T15:45:24Z Finish sorting in 5147498ms
We decided to profile the application to try figure out what is going on using:
jcmd <pid> JFR.start duration=300s filename=minion.jfr settings=profile
Profiling during segment mapping:
2022-01-13T09:53:44Z Initialized mapper with 45 record readers, output dir: /var/pinot/minion/data/RealtimeToOfflineSegmentsTask/tmp-a4085745-be54-4411-abfa-c0dfb8fa05a1/workingDir/mapper_output, timeHandler: class org.apache.pinot.core.segment.processing.timehandler.EpochTimeHandler, partitioners: class org.apache.pinot.core.segment.processing.partitioner.TableConfigPartitioner
Top 5 methods during segment mapping:





Top 5 TLAB during segment mapping:





Profiling during segment sorting
2022-01-13T10:10:09Z Start sorting on numRows: 42757049, numSortFields: 9
Top 5 methods during segment sorting:





Top 5 TLAB during segment sorting:





Looking at the EBS volume attached to the minion, which has 500 GB and 3000 IOPs, it's maxing out the iOPS on reads.

Here is the resource usage of the minion:



@richardstartin indicated that this is due a 9 dimensional sort taking place (on 42 million rows) due to dedup being enabled. I am currently trying to work-around the issue by disabling dedup to see how that goes. This still does not explain why the segment mapping is slow.
Here is the code Richard Startin linked which is causing the excessive disk usage
_sortedRowIds = new int[numRows];
for (int i = 0; i < numRows; i++) {
_sortedRowIds[i] = i;
}
Arrays
.quickSort(0, _endRowId, (i1, i2) -> _fileReader.compare(_sortedRowIds[i1], _sortedRowIds[i2]), (i1, i2) -> {
int temp = _sortedRowIds[i1];
_sortedRowIds[i1] = _sortedRowIds[i2];
_sortedRowIds[i2] = temp;
});
Update: I've changed the realtime segments task to concat. This seems to alleviate the problem of sorting rows for most tables, with the exception of two large outliers. Below are profiling screenshots from those two tables.
Large Table 1
Segment mapping is still running, it's been running for 1 hour.
Large Table 2
Segment mapping is still running, it's been running for 1 hour.



