Skip to content

Wrong file names created in distributed Snapshot #10390

@vepadulano

Description

@vepadulano

Describe the bug

Running a Snapshot in distributed mode creates as many partial snapshots as distributed tasks. Each task will append its task id to the original file name, so that different tasks do not overwrite each other's data. The input file name is modified in-place in each task, but this leads to a situation where a task can receive a modified file name and modify it further, thus leading to wrong file names after the Snapshot. The following reproducer

import ROOT
RDataFrame = ROOT.RDF.Experimental.Distributed.Dask.RDataFrame

from dask.distributed import Client, LocalCluster

import os

def test(client):

    df = RDataFrame(10, daskclient=client).Define("a","1.")

    snap_df = df.Snapshot("dummy_distributed", "dummy_distributed.root")
    print(snap_df._headnode.inputfiles)

    tmp_files = [f"dummy_distributed_{i}.root" for i in range(2)]

if __name__ == "__main__":
    client = Client(LocalCluster(n_workers=2, threads_per_worker=1, processes=True))
    for i in range(2):
        test(client)

leads to

$: python repro.py
['dummy_distributed_0.root', 'dummy_distributed_1.root']
['dummy_distributed_1_0.root', 'dummy_distributed_0_1.root']

Interestingly, this happens only with Dask

Expected behavior

Each task can properly process the Snapshot in isolation.

Setup

ROOT >= 6.26

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions