-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Wrong file names created in distributed Snapshot #10390
Copy link
Copy link
Closed
Labels
Description
Describe the bug
Running a Snapshot in distributed mode creates as many partial snapshots as distributed tasks. Each task will append its task id to the original file name, so that different tasks do not overwrite each other's data. The input file name is modified in-place in each task, but this leads to a situation where a task can receive a modified file name and modify it further, thus leading to wrong file names after the Snapshot. The following reproducer
import ROOT
RDataFrame = ROOT.RDF.Experimental.Distributed.Dask.RDataFrame
from dask.distributed import Client, LocalCluster
import os
def test(client):
df = RDataFrame(10, daskclient=client).Define("a","1.")
snap_df = df.Snapshot("dummy_distributed", "dummy_distributed.root")
print(snap_df._headnode.inputfiles)
tmp_files = [f"dummy_distributed_{i}.root" for i in range(2)]
if __name__ == "__main__":
client = Client(LocalCluster(n_workers=2, threads_per_worker=1, processes=True))
for i in range(2):
test(client)leads to
$: python repro.py
['dummy_distributed_0.root', 'dummy_distributed_1.root']
['dummy_distributed_1_0.root', 'dummy_distributed_0_1.root']
Interestingly, this happens only with Dask
Expected behavior
Each task can properly process the Snapshot in isolation.
Setup
ROOT >= 6.26
Reactions are currently unavailable