Skip to content
This repository was archived by the owner on Mar 12, 2024. It is now read-only.
This repository was archived by the owner on Mar 12, 2024. It is now read-only.

Out of memory at later stage of training #150

@netw0rkf10w

Description

@netw0rkf10w

Hello,

I observed some strange behavior when launching a training on a server of 4 16GB P100 GPUs using:

python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --coco_path /path/to/coco

The training went well for 12 epochs and then in the middle of the 13th epoch, it had an OOM error. Usually memory usage shouldn't change between epochs, but for DETR I don't know if this is the case.

According to the paper, you trained your models using "16 V100 GPUs, with 4 images per GPU (hence a total batch size of 64)". Could you tell me if your GPUs have 16GB or 32GB of memory?

Thanks a lot!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions