Skip to content

Support pushFileNamePattern job conf option #8141

@kkrugler

Description

@kkrugler

We do daily builds of offline segments using Hadoop, and then store the results in HDFS in the directory that is configured as our Pinot cluster’s deep store. Our build generates 35 new (or more typically updated) per-month segments each day, which we then deploy to our Pinot cluster via a metadata push.

What this means is that we’ve got a deep store directory in HDFS with ≈ 1200 segments (representing 3 years of data) for a table. When we do the metadata push every segment is downloaded, metadata is extracted, and that metadata tarball is sent to the controller. This takes about 3 hours currently. But we only want to send the 35 new segments.

It seems like a simple solution would be to support a new, optional pushFileNamePattern parameter in the job conf, which could be used to filter down to only the segments we care about. The format could be the same as the existing includeFileNamePattern pattern.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions