RealtimeToOfflineSegmentsTask tasks time out with large amount of rows

We are seeing an issue in production where the RealtimeToOfflineSegments task is taking a very long time to complete, and timing out in the end.

There are two areas where it is slow, during segment mapping:
`2022-01-12T12:40:28Z Initialized mapper with 43 record readers`

And during row sorting
`2022-01-13T10:10:09Z Start sorting on numRows: 42757049, numSortFields: 9`

We are using Pinot 0.9.3 with 8 minions, with 8 GB heap, 16GB memory and 4 cores of CPU.

Task config:
```
    "taskTypeConfigsMap": {
      "RealtimeToOfflineSegmentsTask": {
        "bucketTimePeriod": "3h",
        "bufferTimePeriod": "2d",
        "mergeType": "dedup",
        "maxNumRecordsPerSegment": 10000000,
        "roundBucketTimePeriod": "1h"
      }
    }
```

See log files and profiling screenshots

**Full run log**
[log1.txt](https://github.com/apache/pinot/files/7862331/log1.txt)

Sometimes segment mapping happens with up to 90 record readers. This stage then takes 35 minutes up to 1 hour. Once that’s done, it starts the usual reduction and sorting phase. (Even though the task should have been cancelled by now due to hitting timeout).

It then starts destroying segments, and then immediately finishes with a timeout after destroying segments. I’m not sure exactly what happens here, but I’m wondering if this is causing an issue where the task is not properly cleaning up, causing loss of data.
The logs above show the final moments where the task times out after issuing delete segments.

I have been talking with @richardstartin on slack, and he pointed me towards this issue: https://github.com/apache/pinot/issues/7929

To circumvent this problem, I disabled all raw indexes.

After doing that, I ran another task, here is the log output:
[log2.txt](https://github.com/apache/pinot/files/7862358/log2.txt)

Looking at this log file, we can see that sorting rows took 1.42 hours for 75 million rows
```
2022-01-12T14:19:37Z Start sorting on numRows: 75094464, numSortFields: 9
2022-01-12T15:45:24Z Finish sorting in 5147498ms
```

We decided to profile the application to try figure out what is going on using:
`jcmd <pid> JFR.start duration=300s filename=minion.jfr settings=profile`

**Profiling during segment mapping:**
`2022-01-13T09:53:44Z Initialized mapper with 45 record readers, output dir: /var/pinot/minion/data/RealtimeToOfflineSegmentsTask/tmp-a4085745-be54-4411-abfa-c0dfb8fa05a1/workingDir/mapper_output, timeHandler: class org.apache.pinot.core.segment.processing.timehandler.EpochTimeHandler, partitioners: class org.apache.pinot.core.segment.processing.partitioner.TableConfigPartitioner`

**Top 5 methods during segment mapping:**
![image (4)](https://user-images.githubusercontent.com/46895578/149322601-4fa90ffb-7a4d-4983-897e-3515d1b665c6.png)
![image (3)](https://user-images.githubusercontent.com/46895578/149322607-b5a9e473-b9bc-43bf-bf81-af0e1afe4de4.png)
![image (2)](https://user-images.githubusercontent.com/46895578/149322611-d06fb104-b1ce-43f6-ba7b-71747132d099.png)
![image (1)](https://user-images.githubusercontent.com/46895578/149322613-8a1a2e1c-f9ad-43e3-9b72-2bac0fcd343b.png)
![image](https://user-images.githubusercontent.com/46895578/149322614-aa95f53e-312a-4b5e-adda-0ea0767deb3b.png)


**Top 5 TLAB during segment mapping:**
![image (9)](https://user-images.githubusercontent.com/46895578/149322687-dbc536ba-5d60-4c48-a3e9-d47890334f37.png)
![image (8)](https://user-images.githubusercontent.com/46895578/149322697-45474741-4b36-4025-a0ac-bfd06442ff44.png)
![image (7)](https://user-images.githubusercontent.com/46895578/149322698-7a672d3f-f447-4ec8-8028-21fafac5dfab.png)
![image (6)](https://user-images.githubusercontent.com/46895578/149322704-3c0de309-7eff-4078-8a10-6af1647c5ec4.png)
![image (5)](https://user-images.githubusercontent.com/46895578/149322706-98cfc4f1-ed52-455a-a724-c2abf9f821be.png)

**Profiling during segment sorting**
`2022-01-13T10:10:09Z Start sorting on numRows: 42757049, numSortFields: 9`

**Top 5 methods during segment sorting:**
![image (14)](https://user-images.githubusercontent.com/46895578/149322924-1926d7b3-c704-447d-b8a6-b9458480edee.png)
![image (13)](https://user-images.githubusercontent.com/46895578/149322934-a927b04a-948c-416c-92ac-1a03daa1dd8f.png)
![image (12)](https://user-images.githubusercontent.com/46895578/149322936-c360259d-ce60-43ea-b2db-dd164f7cba2d.png)
![image (11)](https://user-images.githubusercontent.com/46895578/149322937-8e67625b-fedb-4210-b400-7b59aa41a0e6.png)
![image (10)](https://user-images.githubusercontent.com/46895578/149322939-d69cfa3a-004a-49f3-8be0-c2a91e3f74d4.png)

**Top 5 TLAB during segment sorting:**
![image (19)](https://user-images.githubusercontent.com/46895578/149323000-ff97b629-e63a-4920-854d-f459e773aec3.png)
![image (18)](https://user-images.githubusercontent.com/46895578/149323012-ea0ab4c5-a08e-48ce-9dc9-1f893db3609b.png)
![image (17)](https://user-images.githubusercontent.com/46895578/149323018-483f18a9-8439-44f7-987f-b1937d0f630a.png)
![image (16)](https://user-images.githubusercontent.com/46895578/149323020-038b2ed7-59c8-408f-9fb3-289f6f702112.png)
![image (15)](https://user-images.githubusercontent.com/46895578/149323021-1dc2f119-3278-4392-b1b2-cf8b20b1545a.png)

Looking at the EBS volume attached to the minion, which has 500 GB and 3000 IOPs, it's maxing out the iOPS on reads.
![image (20)](https://user-images.githubusercontent.com/46895578/149323113-c2e45cdb-bb12-4b8b-a975-ff9a0c66a574.png)

Here is the resource usage of the minion:
![image (23)](https://user-images.githubusercontent.com/46895578/149323171-4b69613e-d34b-4796-9b49-267b8868b277.png)
![image (22)](https://user-images.githubusercontent.com/46895578/149323174-4c6e8129-6029-41d3-8004-0ec682d42819.png)
![image (21)](https://user-images.githubusercontent.com/46895578/149323178-2bfb20e7-81da-4920-8e7d-f3b570b40951.png)

@richardstartin indicated that this is due a 9 dimensional sort taking place (on 42 million rows) due to dedup being enabled. I am currently trying to work-around the issue by disabling dedup to see how that goes. This still does not explain why the segment mapping is slow.

Here is the code Richard Startin linked which is causing the excessive disk usage
```
   _sortedRowIds = new int[numRows];
      for (int i = 0; i < numRows; i++) {
        _sortedRowIds[i] = i;
      }
      Arrays
          .quickSort(0, _endRowId, (i1, i2) -> _fileReader.compare(_sortedRowIds[i1], _sortedRowIds[i2]), (i1, i2) -> {
            int temp = _sortedRowIds[i1];
            _sortedRowIds[i1] = _sortedRowIds[i2];
            _sortedRowIds[i2] = temp;
          });
```

---

**Update**: I've changed the realtime segments task to concat. This seems to alleviate the problem of sorting rows for most tables, with the exception of two large outliers. Below are profiling screenshots from those two tables.

**Large Table 1**
Segment mapping is still running, it's been running for 1 hour.

**Method Profiling Flame View**
![image](https://user-images.githubusercontent.com/46895578/149334810-5ca313b7-f6ee-4275-8270-303a304a1795.png)

**TLAB Allocations Flame View**
![image](https://user-images.githubusercontent.com/46895578/149334917-028df107-d6ef-4fcc-bab3-cf866522933d.png)

**Large Table 2**
Segment mapping is still running, it's been running for 1 hour.

**Method Profiling Flame View**
![image](https://user-images.githubusercontent.com/46895578/149335006-e867024c-5587-4a3d-a5a4-b50b2be227c4.png)

**TLAB Allocations Flame View**
![image](https://user-images.githubusercontent.com/46895578/149335098-624bafcf-cb6b-44b4-a603-932bc3ad64c1.png)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RealtimeToOfflineSegmentsTask tasks time out with large amount of rows #8014

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RealtimeToOfflineSegmentsTask tasks time out with large amount of rows #8014

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions