Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call for public comments: Considering deprecating the archivebox oneshot command as of the 0.7 release #1082

Closed
pirate opened this issue Jan 11, 2023 · 6 comments
Labels
size: hard status: wontfix We are not planning to make these changes at the moment (sorry) touches: data/schema/architecture why: functionality Intended to improve ArchiveBox functionality or features
Milestone

Comments

@pirate
Copy link
Member

pirate commented Jan 11, 2023

image

Long long ago before archivebox was a Django app, it used to be a one-shot bash script called archive-pocket-stream.sh. When we moved to the Django system archivebox oneshot was provided as an escape hatch for users that did not like being forced to create collections and manage data directories all of a sudden. It allows the new fancy django archivebox to run in "oneshot" mode without creating a main index file, data dir, etc. and only outputting the results of one snapshot into PWD.

As you might imagine, it required tremendous haxx to run the new Django archivebox without a db file in this way, including instantiating a fake sqlite3 db in memory, filesystem write filtering, etc. and it's imposing a large maintenance burden by making it hard to refactor other subsystems.

Now that we have solidly been on Django for several major versions, I think we can safely retire archivebox oneshot ?

Iif anyone is using it, speak up now and make a case for keeping it 😅 🤠👋

@pirate pirate added size: hard touches: data/schema/architecture status: backlog Work is planned someday but is not the highest priority at the moment is: enhancement labels Jan 11, 2023
@pirate pirate changed the title Attention: I intend to deprecate the archivebox oneshot command soon Attention: We intend to remove the archivebox oneshot command from ArchiveBox in the next 0.7 release Jan 11, 2023
@pirate pirate added this to the v0.7.0 milestone Jan 11, 2023
@ianrobertsFF
Copy link

This is my entire use case, I need to be able to do single page snapshots on a daily basis, and have no need for any of the other functionality of ArchiveBox, I'll be running it CLI only, and using another tool to trigger the daily snapshots of the pages, and piping them into specific directories.

Without the ability to create multiple snapshots of the same URL in your normal use cases, this is the only way I can achieve it.

@pirate
Copy link
Member Author

pirate commented Apr 7, 2023

Good to know! Would your needs be satisfied if we add better native support for multiple snapshots in archivebox instead of keeping this older feature? @ianrobertsFF

@jvican
Copy link

jvican commented Apr 9, 2023

I'm also using this. I think it makes a lot of sense to keep a command like oneshot around because it's fairly self-contained, and it aligns well with the UNIX philosophy. It does one thing and it does it well, without the need for archive init and the use of the rest of the software. Please don't take it away.

@ianrobertsFF
Copy link

My needs would be satisfied by multiple snapshots, although I still wouldn't be using any of the functionality that oneshot doesn't currently use, so it wouldn't be a better workflow, as oneshot does exactly what I need.

However assuming I can continue to take on-demand snapshots with the native support for multiple snapshots, this would be acceptable to me.

@jwmh
Copy link

jwmh commented Jul 31, 2023

Q:
Would it be possible to fork this off into its own separate project/repo?
Would it even be desireable?

I appreciate @jvican ’s comments on this, and agree.

@pirate
Copy link
Member Author

pirate commented Dec 18, 2023

Ok I've decided to keep oneshot because it ties in nicely with the ongoing refactor to move ArchiveBox towards an event-driven job queue model. The old oneshot will be renamed and joined by a new command to run a single extractor method:

archivebox snapshot

Can be run to snapshot an individual URL into the current directory (runs all extractors by default).

archivebox snapshot --methods=all 'https://example.com/somepage.html'
# creates a subfolder for each extractor method, and an index.html and index.json file in $PWD

This works the same way as oneshot does now, and I'll alias oneshot to the new command so we don't break backwards compatibility.

archivebox extract

This runs an individual extractor method and outputs into the current directory.

archivebox extract --method=PDF --method-args-here 'https://example.com/somepage.html'
# writes output.pdf (and an index.json containing cmd+output for each run) into $PWD using the headless browser

After the refactor, archivebox add will work by internally enqueuing a job that runs archivebox snapshot ... for each imported URL.
The snapshot job then in turn enqueues a job for each extractor needed on that URL.
Each extractor job then runs archivebox extract --method=... internally to write the output into the final archive directory.

Please subscribe to this issue for updates: #1289

@pirate pirate closed this as not planned Won't fix, can't repro, duplicate, stale Dec 18, 2023
@pirate pirate changed the title Attention: We intend to remove the archivebox oneshot command from ArchiveBox in the next 0.7 release Call for public comments: Considering deprecating the archivebox oneshot command as of the 0.7 release Dec 18, 2023
@pirate pirate added why: functionality Intended to improve ArchiveBox functionality or features status: wontfix We are not planning to make these changes at the moment (sorry) status: backlog Work is planned someday but is not the highest priority at the moment and removed status: backlog Work is planned someday but is not the highest priority at the moment labels Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size: hard status: wontfix We are not planning to make these changes at the moment (sorry) touches: data/schema/architecture why: functionality Intended to improve ArchiveBox functionality or features
Projects
None yet
Development

No branches or pull requests

4 participants