-
Notifications
You must be signed in to change notification settings - Fork 243
Description
Currently at Harvard, we're doing our EAD uploads via an ingest script that talks to the API, largely so we can have per file rather than per-process failure on errors. This has largely worked out well, but is fairly slow across 6000+ finding aids, so our script does multiple uploads at once in hopes of speeding a 4-day process into a overnight process.
When testing this, I noticed that when we do uploads in parallel, we get a class of errors that we don't see when uploading serially:
Couldn't create version of: #<AgentCorporateEntity:0x69c234c6>
Couldn't create version of: #<AgentCorporateEntity:0x13619867>
Couldn't create version of: #<AgentCorporateEntity:0x3a539471>
Couldn't create version of: #<AgentCorporateEntity:0x35e9e1e7>
Couldn't create version of: #<Subject:0x62ea4d3c>
Couldn't create version of: #<Subject:0x2354d32d>
Couldn't create version of: #<Subject:0x262c6aa6>
Couldn't create version of: #<Subject:0x2bfaa80c>
This appears to happen when records with identical subject terms or corpnames are uploaded in the same batch. My guess is that this is because the "create or fetch existing" logic for those models is not serialized or isolated, and they're getting identical IDs and one or the other (or both?) are getting kicked out.
I don't currently have time to fix this, and have just resigned myself to slow imports, but it's something that could come up infrequently in actual use, I suspect, and it would be VERY nice to be able to parallelize import scripts.