Skip to content
This repository was archived by the owner on Apr 8, 2025. It is now read-only.

perf(DPR-Data-Read): Improve data reading process#733

Merged
Timoeller merged 2 commits intodeepset-ai:masterfrom
voidful:improve-dpr-data-read
Mar 5, 2021
Merged

perf(DPR-Data-Read): Improve data reading process#733
Timoeller merged 2 commits intodeepset-ai:masterfrom
voidful:improve-dpr-data-read

Conversation

@voidful
Copy link
Copy Markdown
Contributor

@voidful voidful commented Mar 4, 2021

Refer to issue #730

Improvement:

  • Support jsonl format for loading large file.
  • Downsampling data to reduce memory usage.
  • Add passage id when not exist.

Support jsonl format, downsampling data, add passage id when not exist.
Copy link
Copy Markdown
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good already. Thanks for the changes.

  • lets keep max samples
  • made a comment about a failing test

Although the downsampling now exists twice in our codebase I vote for keeping the downsampling in our TextSimilarityProcessor as wll, since we might use it for other datasets. Or do you have any strong preferences?

@Timoeller Timoeller self-requested a review March 5, 2021 08:23
Copy link
Copy Markdown
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thanks for the suggested changes.

@Timoeller Timoeller merged commit fc82058 into deepset-ai:master Mar 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants