Skip to content

Improve file processing performance #59

@davidmezzetti

Description

@davidmezzetti

This issue will improve the file processing performance with the following changes.

  • Serialize in-process data to temporary storage vs passing interprocess
  • Set a batchsize for queuing data into batches. The current implementation is spending a lot of time on queue blocking record-by-record.
  • Remove output queue limit. With new method, the database loader will catch up at the end of processing

On a reasonably modern machine the full PubMed baseline (37 million articles as of JAN-2025) can be parsed in 1.5 hours.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions