Skip to content

dask.bag.to_textfiles does not store each element in a line #1072

@Jeffrey04

Description

@Jeffrey04

Not sure if this is intended, but in the documentation it is mentioned the behaviour of bag.to_textfiles() should be

Write bag to disk, one filename per partition, one line per element

However, as reproduced in the following example, it actually stores everything in a line

import dask.bag as db

db.from_sequence(range(10), 10) \
    .map(str) \
    .to_textfiles('foo.*.txt')

and the output file reads

0123456789

After some tinkering, replacing this loop in the write() function (in dask/bag/core.py) from

        for line in data:
            f.write(line.encode(encoding))

to

        for i, line in enumerate(data):
            f.write((line if i == 0 else '\n{}'.format(line)).encode(encoding))

seems to work as mentioned in the documentation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions