Upsert table intermittently misses out on records during concurrent ingestion updates

We have an partial-upsert table where the number of updates per key is pretty high in a short period of time (within an hour we get 1000s of updates for a key).

Between this, if we are querying for that particular primary key, we see no response from Pinot intermittently. I saw, this coincides with an update received for that primary key (query is received within 1 second of a new record for that key). 

After few seconds, the record comes up again in the query response and everything works fine until there is another overlap of query-time and ingestion-time. 

I suspect it might be happening because we update DocID by removing it first and then adding it again. https://github.com/apache/pinot/blob/168408aa8d0f94de3004abe6e49b6263cef24186/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/upsert/ConcurrentMapPartitionUpsertMetadataManager.java#L303-L308

And it might be a race condition where query is received between these 2 actions. 

Is this expected behaviour? Is there a way we can guarantee atleast the older record (if not newer) during this time? 

One of the possible solution can be to have read lock on validDocIds before updating and use the same lock in FilterPlanNode - https://github.com/apache/pinot/blob/168408aa8d0f94de3004abe6e49b6263cef24186/pinot-core/src/main/java/org/apache/pinot/core/plan/FilterPlanNode.java#L88-L99
But this can become a major issue during high-qps / high-ingestion scenario where the other thread gets starved. 

One more approach can be to get a clone of validDocIds before updating and then update (add / remove) the docId in the clone and then copy it back to the original validDocID. This can work for updates in the present consuming segment but might not work for updates where previous record was in an old immutable segment. This will also have memory implications as we will have to maintain a clone of validDocID at every message. What if we do this in batches? We can reduce memory footprint but the implementation can become quite intrusive.



	if (segment == currentSegment) {
	replaceDocId(validDocIds, queryableDocIds, currentDocId, newDocId, recordInfo);
	} else {
	removeDocId(currentSegment, currentDocId);
	addDocId(validDocIds, queryableDocIds, newDocId, recordInfo);
	}

	MutableRoaringBitmap queryableDocIdSnapshot = null;
	if (!_queryContext.isSkipUpsert()) {
	ThreadSafeMutableRoaringBitmap queryableDocIds = _indexSegment.getQueryableDocIds();
	if (queryableDocIds != null) {
	queryableDocIdSnapshot = queryableDocIds.getMutableRoaringBitmap();
	} else {
	ThreadSafeMutableRoaringBitmap validDocIds = _indexSegment.getValidDocIds();
	if (validDocIds != null) {
	queryableDocIdSnapshot = validDocIds.getMutableRoaringBitmap();
	}
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Upsert table intermittently misses out on records during concurrent ingestion updates #12667

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Upsert table intermittently misses out on records during concurrent ingestion updates #12667

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions