Su Zhu comments

Results 13 comments of


                                            Su Zhu

clarification on paper result

Find solid baselines in https://github.com/sz128/slot_filling_and_intent_detection_of_SLU.

alpaca_gpt4_data_zh.json那份数据，很多output部分是不完整的。

中文结果有问题，不完整。

alpaca_gpt4_data_zh.json那份数据，很多output部分是不完整的。

@lale314 可以用如下代码过滤出不完整的数据。 ```python import sys import json with open(sys.argv[1]) as fin: for line in fin: line = line.strip() sample = json.loads(line) output = sample['output'].strip(" \n\"”") if output[-1] in set("?!.。？！})]`》）") or...

Add masking of different samples in a long sequence for flash-attention mechanism

> Isn't this already supported with #31629 ? It seems that both implementations are similar. But, we need to consider the situation that position ids are not reset between different...

Add masking of different samples in a long sequence for flash-attention mechanism

> Hey! Don't you think that ragging the tensor would be more efficient? Yes. I didn't describe it well. I updated the description of this PR. The implementations of this...

Add masking of different samples in a long sequence for flash-attention mechanism

> > > Hey! Don't you think that ragging the tensor would be more efficient? > > > > > > Yes. I didn't describe it well. I updated the...

Add masking of different samples in a long sequence for flash-attention mechanism

> > we need to consider the scenario where position IDs are not reset between different short samples, especially for LLM pre-training > > does this imply us properly computing...

Add masking of different samples in a long sequence for flash-attention mechanism

> > > > we need to consider the scenario where position IDs are not reset between different short samples, especially for LLM pre-training > > > > > >...

Add masking of different samples in a long sequence for flash-attention mechanism

> Why wouldn't we use `position_ids` to encode all information (packed, not packed, padded, not padded) in a slightly more elegant way without touching `attention_mask`? > > For example let's...

Add masking of different samples in a long sequence for flash-attention mechanism

> Yep, agree with that definitely! My proposal was to leave this choice to users to set in data collator. If they wish to treat such concatenated sequences as a...