Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Scheduling Archival from the UI #578

Open
4 of 9 tasks
BlipRanger opened this issue Dec 10, 2020 · 3 comments
Open
4 of 9 tasks

Feature Request: Scheduling Archival from the UI #578

BlipRanger opened this issue Dec 10, 2020 · 3 comments
Labels
size: medium status: wip Work is in-progress / has already been partially completed touches: configuration touches: data/schema/architecture touches: dependencies/packaging Issues or changes that add/remove/affect dependencies touches: docs why: functionality Intended to improve ArchiveBox functionality or features
Milestone

Comments

@BlipRanger
Copy link
Contributor

Type

  • General question or discussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

Currently scheduling ingestion of new urls requires writing a cron job external to the web UI (external to the docker container in my case) which isn't entirely ideal in a docker/self-contained setup. I believe this would be a nice convenience feature for users that might want to manage the entire operation of AB from within the web UI.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

This feature would add a method for setting up scheduled pulls from various data sources via the web UI rather than only externally via cron. I specifically imagine at least a way to specify a RSS feed to be subscribed to that it can watch for new content from (something like Wallabag in my particular imagined use case). Technically I think this would involve a new menu/button in the UI and should dovetail with the internal scheduling processes already available.

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

  • I'm willing to contribute dev time / money to fix this issue
  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up
@BlipRanger BlipRanger added why: functionality Intended to improve ArchiveBox functionality or features status: idea-phase Work is tentatively approved and is being planned / laid out, but is not ready to be implemented yet labels Dec 10, 2020
@pirate
Copy link
Member

pirate commented Dec 10, 2020

Yeah this is definitely on our mind, it probably won't be added for a couple versions but this is definitely something I've been planning.

It's blocked by adding a background queue system like Huey or dramatiq: #91

In the meantime I recommend using docker-compose instead of docker alone, as it allows you to declaratively define your scheduled imports all in one place (you can see the docker-compose.yml commented out section for an example of how to do that).

@BlipRanger
Copy link
Contributor Author

Gotcha, I saw the future queuing system and that makes sense! And yes, currently using compose, so I'll look into doing that. Thanks!

@pirate pirate added this to the v0.6.3 milestone Apr 16, 2021
@pirate
Copy link
Member

pirate commented Apr 16, 2021

Here's my proposed implementation of a new model to track scheduled imports: https://github.com/ArchiveBox/ArchiveBox/pull/707/files

Remaining TODOs:

  • figure out which python scheduler to use
    • huey + django-huey-monitor (my current favorite)
    • celery (ugh...)
    • APScheduler (will require lots of manual models and concurrency control code)
    • yacron (not sure if it can be configured dynamically)
    • dramatiq (doesn't support sqlite)
  • decide whether to continue supporting system crontab at all, or tear it out (imo we should just tear it out and move to using an internal scheduler)
  • fork the scheduled task worker off the server process automatically on startup, so no need to run separate archivebox schedule --foreground process manually
  • figure out how to enforce "at least once" or "at most once" concurrency model for scheduled tasks

Follow that PR for more updates as work progresses. #707

See this thread here for my WIP design that moves us towards a message-passing / async job worker structure internally: #91 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size: medium status: wip Work is in-progress / has already been partially completed touches: configuration touches: data/schema/architecture touches: dependencies/packaging Issues or changes that add/remove/affect dependencies touches: docs why: functionality Intended to improve ArchiveBox functionality or features
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants