Skip to content
This repository was archived by the owner on Apr 8, 2025. It is now read-only.

remove hardcoded seeds from trainer#424

Merged
tholor merged 1 commit intodeepset-ai:masterfrom
tstadel:hardcoded_seeds_fix
Jun 30, 2020
Merged

remove hardcoded seeds from trainer#424
tholor merged 1 commit intodeepset-ai:masterfrom
tstadel:hardcoded_seeds_fix

Conversation

@tstadel
Copy link
Copy Markdown
Member

@tstadel tstadel commented Jun 25, 2020

resolves #423 by removing calls to set_all_seeds in Trainer.train()

@tholor
Copy link
Copy Markdown
Member

tholor commented Jun 26, 2020

Thanks for flagging this and proposing a fix @tstadel !

The reason why we set the seeds there was to gain full reproducibility of training with checkpointing vs. without.
If you load a checkpoint, we will "fast-forward" the DataLoader to the batch where you stopped in your last run (e.g step 100). During this fast-forwarding, the random state will change as there are operations that modify it (e.g. finding a random next sentence). So even though we load the random states from the checkpoint, they will be altered after the fast-forwarding.

I agree that they shouldn't be hardcoded to specific values (=39), but do you see a particular down-side of having them there? We could, for example, add an arg "seed" to the Trainer() and use this value instead.

@tstadel
Copy link
Copy Markdown
Member Author

tstadel commented Jun 26, 2020

Thanks for the explanation!

Nevertheless I have to say, it just looks pretty odd to me to set the seed in every step... Even if we introduce an arg to the Trainer, calls to set_all_seeds() from upstream code would be effectively ignored. So in order to reproduce results from such code without having to change it, the seed setting within train() should be optional.

Is it simply too cumbersome to restore the seeds stored in the checkpoint after fast-forwarding or is there another reason we can't do that?

@tholor
Copy link
Copy Markdown
Member

tholor commented Jun 30, 2020

Is it simply too cumbersome to restore the seeds stored in the checkpoint after fast-forwarding or is there another reason we can't do that?

It's not really complex, but it would probably make the code a bit less readable as we don't have access to the checkpoint anymore after the initial loading via the Trainer's class method. One option would be to pass the checkpoint path as an attribute to the Trainer instance and then loading the random states like here:

FARM/farm/train.py

Lines 477 to 482 in 0098819

numpy_rng_state = trainer_checkpoint["numpy_rng_state"]
numpy.random.set_state(numpy_rng_state)
rng_state = trainer_checkpoint["rng_state"]
cuda_rng_state = trainer_checkpoint["cuda_rng_state"]
torch.set_rng_state(rng_state)
torch.cuda.set_rng_state(cuda_rng_state)

Thinking more about this, I agree that the current situation is not ideal as recent research also shows quite some variance of performance for different seeds in downstream tasks. This "seed hacking" is not possible right now.

Thus, my suggestion is to:

  • short-term: get rid of these seeds
  • mid/long-term: load the random states from the checkpoint after fast-forwarding

Let me know if you are willing to work on the midterm solution.

@tstadel
Copy link
Copy Markdown
Member Author

tstadel commented Jun 30, 2020

Yes, I'd like to work on the midterm solution. I'm trying to find some time for this till the end of the week.

@tholor
Copy link
Copy Markdown
Member

tholor commented Jun 30, 2020

Perfect, thanks! I am merging this one with the short term fix now, and we can tackle the midterm solution in a separate PR.

@tholor tholor merged commit daa6291 into deepset-ai:master Jun 30, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Trainer fixes all seeds to 39

2 participants