When I tried to use the split_file as in your train from scratch example, I get this error:
Splitting file ...: 5%|5 | 127877/2407713 [00:00<00:02, 869200.61it/s]
Traceback (most recent call last):
File "finetune_lm.py", line 43, in <module>
split_file(data_dir / "train.txt", output_dir=Path('/data/german_old_texts/processed/lm/split_files'), docs_per_file=20)
File "/home/user/farm/data_handler/utils.py", line 785, in split_file
write_file.writelines(lines_to_write)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 62: ordinal not in range(128)
From the split_file method definition in the source code, the default encoding should be utf-8. So I think this error shouldn't happened.
After reading the source code of this method, I noticed that the encoding is only used during open read the file. However, when opening the file for write (in line 784 and 793), you are using:
write_file = stack.enter_context(open(filename, 'w+', buffering=10 * 1024 * 1024))
instead of
write_file = stack.enter_context(open(filename, 'w+', encoding=encoding, buffering=10 * 1024 * 1024))
Maybe this caused the error.
I don't know how I can change the source and test it to contribute.
Farm version = 0.4.6