Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move ArchiveResult from detail index.json history to database model #525

Merged
merged 21 commits into from
Nov 28, 2020

Conversation

cdvv7788
Copy link
Contributor

@cdvv7788 cdvv7788 commented Nov 3, 2020

Summary

When this PR is ready, archivebox will be able to:

  • Save ArchiveResults to the database
  • Load ArchiveResults from the filesystem if nothing is found in the database (to ease migration)
  • Use ArchiveResults to answer questions about available extractors output per snapshot

Related issues

#513

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

@cdvv7788 cdvv7788 marked this pull request as draft November 3, 2020 14:57
@cdvv7788
Copy link
Contributor Author

cdvv7788 commented Nov 4, 2020

Current state: functional.
It creates ArchiveResult objects based on the index.json detail for every archive (when migrating)
It saves new ArchiveResult objects when an extractor finish.
The icons displayed in index are calculated based on the ArchiveResults for the Snapshot

Missing:

  • Remove ArchiveResult schema from the extractor process. Use the ArchiveResult model instead Cannot be done yet. I will leave this for a later PR.
  • Add WARC to the list of icons. This requires specific logic.
  • Reduce N+1 bottlenecks
  • Handle the case where there are several ArchiveResult objects for a Snapshot with the same extractor. - Unnecessary: The presence of a single succeeded ArchiveResult is enough for the icon to show up.
  • There are a couple extractors that require special treatment to display the items: warc and archive_org

@cdvv7788 cdvv7788 marked this pull request as ready for review November 12, 2020 16:38
@cdvv7788
Copy link
Contributor Author

@pirate Ready for review.

@cdvv7788 cdvv7788 mentioned this pull request Nov 14, 2020
6 tasks
@pirate
Copy link
Member

pirate commented Nov 23, 2020

This is great so far, sorry I took so long to review it.

I'm going to merge it into v0.5 then release because I think it's important for the performance gains and I want v0.5 to be the "performance release".

@pirate pirate changed the title ArchiveResult move to database (sqlite) Move ArchiveResult from detail index.json history to database model Nov 23, 2020
@cdvv7788
Copy link
Contributor Author

Thanks for the review! I will fix all of those details today (they are all small details).

@cdvv7788
Copy link
Contributor Author

@pirate implemented all of the fixes. After you approve it, I will rebase-merge it.

@cdvv7788
Copy link
Contributor Author

@pirate Added two changes you requested. I think this is ready for review.

  • Replace charfield with jsonfield
  • Re-add icons, even if the related extractors were not used

I also bumped the django version to 3.1

@pirate pirate merged commit be6dcfa into ArchiveBox:master Nov 28, 2020
@cdvv7788 cdvv7788 deleted the archive-result branch November 28, 2020 04:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants