Out of memory at later stage of training

Hello,

I observed some strange behavior when launching a training on a server of 4 16GB P100 GPUs using:

`python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --coco_path /path/to/coco `

The training went well for 12 epochs and then in the middle of the 13th epoch, it had an OOM error. Usually memory usage shouldn't change between epochs, but for DETR I don't know if this is the case.

According to the paper, you trained your models using  "16 V100 GPUs, with 4 images per GPU (hence a total batch size of 64)". Could you tell me if your GPUs have 16GB or 32GB of memory?

Thanks a lot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Out of memory at later stage of training #150

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Out of memory at later stage of training #150

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions