Skip to content

Upsert table intermittently misses out on records during concurrent ingestion updates #12667

@tibrewalpratik17

Description

@tibrewalpratik17

We have an partial-upsert table where the number of updates per key is pretty high in a short period of time (within an hour we get 1000s of updates for a key).

Between this, if we are querying for that particular primary key, we see no response from Pinot intermittently. I saw, this coincides with an update received for that primary key (query is received within 1 second of a new record for that key).

After few seconds, the record comes up again in the query response and everything works fine until there is another overlap of query-time and ingestion-time.

I suspect it might be happening because we update DocID by removing it first and then adding it again.

if (segment == currentSegment) {
replaceDocId(validDocIds, queryableDocIds, currentDocId, newDocId, recordInfo);
} else {
removeDocId(currentSegment, currentDocId);
addDocId(validDocIds, queryableDocIds, newDocId, recordInfo);
}

And it might be a race condition where query is received between these 2 actions.

Is this expected behaviour? Is there a way we can guarantee atleast the older record (if not newer) during this time?

One of the possible solution can be to have read lock on validDocIds before updating and use the same lock in FilterPlanNode -

MutableRoaringBitmap queryableDocIdSnapshot = null;
if (!_queryContext.isSkipUpsert()) {
ThreadSafeMutableRoaringBitmap queryableDocIds = _indexSegment.getQueryableDocIds();
if (queryableDocIds != null) {
queryableDocIdSnapshot = queryableDocIds.getMutableRoaringBitmap();
} else {
ThreadSafeMutableRoaringBitmap validDocIds = _indexSegment.getValidDocIds();
if (validDocIds != null) {
queryableDocIdSnapshot = validDocIds.getMutableRoaringBitmap();
}
}
}

But this can become a major issue during high-qps / high-ingestion scenario where the other thread gets starved.

One more approach can be to get a clone of validDocIds before updating and then update (add / remove) the docId in the clone and then copy it back to the original validDocID. This can work for updates in the present consuming segment but might not work for updates where previous record was in an old immutable segment. This will also have memory implications as we will have to maintain a clone of validDocID at every message. What if we do this in batches? We can reduce memory footprint but the implementation can become quite intrusive.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions