Is your feature request related to a problem? Please describe.
After sending this PR #677, I think the data module should be refactor, which is hard to use for user-defined dataset.
Decouple data_type and data_path
In
|
data_cls = data_config["dataset_name"] |
,
dataset_name is actually data_type, then load data from HF.
Actually, user's data would not open-source. We have to manually change source code here.
Support multiple data path
In
|
train_original_dataset = load_dataset("json", data_files=train_ds_path)["train"] |
,
OpenAIFormatDataset.__init__ only supports single path, not list of data path. It does not match real scenario: training data comes from difference domains or teams.
llamafactory style (https://github.com/hiyouga/LLaMA-Factory/blob/main/data/dataset_info.json) may be better.
Dataset constructor definition
In
, the key-list in dataset diffs.
If you do not set system_key for OpenAIFormatDataset, it crashes. So I have to change the code:
data = hf_datasets.OpenAIFormatDataset(
data_config["train_data_path"],
data_config["val_data_path"],
**{k: data_config[k] for k in ("chat_key", "system_key", "system_prompt") if k in data_config}
)
Is your feature request related to a problem? Please describe.
After sending this PR #677, I think the
datamodule should be refactor, which is hard to use for user-defined dataset.Decouple
data_typeanddata_pathIn
RL/examples/run_sft.py
Line 94 in 51d8006
dataset_nameis actually data_type, then load data from HF.Actually, user's data would not open-source. We have to manually change source code here.
Support multiple data path
In
RL/nemo_rl/data/hf_datasets/oai_format_dataset.py
Line 51 in 51d8006
OpenAIFormatDataset.__init__only supports single path, not list of data path. It does not match real scenario: training data comes from difference domains or teams.llamafactory style (https://github.com/hiyouga/LLaMA-Factory/blob/main/data/dataset_info.json) may be better.
Dataset constructor definition
In
RL/examples/run_sft.py
Line 116 in 51d8006
If you do not set
system_keyforOpenAIFormatDataset, it crashes. So I have to change the code: