Skip to content

gpt动态图混合并行case执行2w+step后loss出nan #60142

@Liujie0926

Description

@Liujie0926

bug描述 Describe the Bug

复现环境:cuda11.7 python3.10 v100-32g 单机八卡
paddle commit:3bcdeef55611b66f49fca4b68bd99daf7e44b40b
git clone http://github.com/PaddlePaddle/PaddleNLP.git -b develop && cd PaddleNLP/model_zoo/gpt-3/
数据&环境准备
python -m pip install -r requirements.txt
mkdir data
wget -O data/gpt_en_dataset_300m_ids.npy https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy
wget -O data/gpt_en_dataset_300m_idx.npz https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz
执行命令
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7;

gpt_recompute_bs16_fp16_DP2-MP2-PP2配置在2.5w+ step开始出nan

python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3,4,5,6,7 tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml -o Global.seed=1234 -o Global.local_batch_size=8 -o Global.micro_batch_size=2 -o Engine.max_steps=50000 -o Engine.eval_freq=1000 -o Engine.mix_precision.enable=True -o Engine.save_load.save_steps=100000 -o Model.hidden_size=1024 -o Model.num_layers=4 -o Model.num_attention_heads=4 -o Model.type_vocab_size=1 -o Model.use_recompute=True -o Distributed.dp_degree=2 -o Distributed.mp_degree=2 -o Distributed.pp_degree=2 -o Distributed.sharding.sharding_degree=1 -o Distributed.sharding.sharding_stage=1 -o Distributed.sharding.sharding_offload=False -o Profiler_pretrain.memory_stats=True -o Optimizer.lr.max_lr=1e-4 -o Optimizer.lr.min_lr=1e-5

gpt_bs64_fp16_DP8-MP1-PP1配置在1.7w+ step开始出nan

python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3,4,5,6,7 tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml -o Global.seed=1234 -o Global.local_batch_size=8 -o Global.micro_batch_size=8 -o Engine.max_steps=50000 -o Engine.eval_freq=1000 -o Engine.mix_precision.enable=True -o Engine.save_load.save_steps=100000 -o Model.hidden_size=1024 -o Model.num_layers=4 -o Model.num_attention_heads=4 -o Model.type_vocab_size=1 -o Model.use_recompute=True -o Distributed.dp_degree=8 -o Distributed.mp_degree=1 -o Distributed.pp_degree=1 -o Distributed.sharding.sharding_degree=1 -o Distributed.sharding.sharding_stage=1 -o Distributed.sharding.sharding_offload=False -o Profiler_pretrain.memory_stats=True -o Optimizer.lr.max_lr=1e-4 -o Optimizer.lr.min_lr=1e-5

问题现象
训练过程中精度出nan,如图
image

其他补充信息 Additional Supplementary Information

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions