Skip to content

Real time tiered storage feature causing temporary offline segments #7779

@dongxiaoman

Description

@dongxiaoman

NOTE: This is not an urgent bug but it seems quite annoying if we can confirm it is the root cause.

Right now we can see clear correlation of query failure (because right now any offline segments could cause failure) seconds after a realtime segment is moved into another tiered storage.

We have a query error for a missing segment at timestamp "timestamp":"2021-11-16T22:13:18.256Z
and 5 seconds earlier we see logs indicating the segment was dropped from real time server. And a few minutes later we see the same segment showing up in its tiered servers.

The log for dropping in streaming server is:

2021/11/16 22:13:15.345 INFO [HelixStateTransitionHandler] [HelixTaskExecutor-message_handle_STATE_TRANSITION] Instance Server_st-fw-81.service.consul_8098, partition point_entry__34__576__20211116T0930Z received state transition from OFFLINE to DROPPED on session 30043ed1a3604f2, message id: d9310a75-1742-4758-981e-32c0b193f7eb

In my mental model, it could be this reason:

  1. Segment is set to be moved to another tier due to TTL
  2. The segment is dropped from Real time server, but the new tier has not completed the "ONLINE" task needed for that segment yet
  3. The segment appears offline from Pinot controller, Query kicks in, Brokers (? or servers?) complains about missing segments from Real time server

The step 3 is still a bit strange, did broker not receive the segment external view change event within 5 seconds? The segment is going to show up in another tiered storage

If we think of a tiered storage move of segment as "rebalance", we actually should have the option to do the "no-downtime" move of segments into another tier. Keep one replica in place, ensure the new replica shows up, and then move another?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions