farm.data_handler.split_file encoding error

When I tried to use the split_file as in your train from scratch example, I get this error:
```
Splitting file ...:   5%|5         | 127877/2407713 [00:00<00:02, 869200.61it/s]
Traceback (most recent call last):
  File "finetune_lm.py", line 43, in <module>
    split_file(data_dir / "train.txt", output_dir=Path('/data/german_old_texts/processed/lm/split_files'), docs_per_file=20)
  File "/home/user/farm/data_handler/utils.py", line 785, in split_file
    write_file.writelines(lines_to_write)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 62: ordinal not in range(128)
```
From the split_file method definition in the source code, the default encoding should be utf-8. So I think this error shouldn't happened.

After reading the source code of this method, I noticed that the encoding is only used during open read the file. However, when opening the file for write (in line 784 and 793), you are using:

`write_file = stack.enter_context(open(filename, 'w+', buffering=10 * 1024 * 1024))` 

instead of

`write_file = stack.enter_context(open(filename, 'w+', encoding=encoding, buffering=10 * 1024 * 1024))`

Maybe this caused the error.

I don't know how I can change the source and test it to contribute.

Farm version = 0.4.6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

farm.data_handler.split_file encoding error #462

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

farm.data_handler.split_file encoding error #462

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions