-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
We do daily builds of offline segments using Hadoop, and then store the results in HDFS in the directory that is configured as our Pinot cluster’s deep store. Our build generates 35 new (or more typically updated) per-month segments each day, which we then deploy to our Pinot cluster via a metadata push.
What this means is that we’ve got a deep store directory in HDFS with ≈ 1200 segments (representing 3 years of data) for a table. When we do the metadata push every segment is downloaded, metadata is extracted, and that metadata tarball is sent to the controller. This takes about 3 hours currently. But we only want to send the 35 new segments.
It seems like a simple solution would be to support a new, optional pushFileNamePattern parameter in the job conf, which could be used to filter down to only the segments we care about. The format could be the same as the existing includeFileNamePattern pattern.