Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add _EXTRA_ARGS for various extractors #1360

Merged
merged 6 commits into from
Mar 14, 2024
Merged

Conversation

benmuth
Copy link
Contributor

@benmuth benmuth commented Feb 23, 2024

Summary

This PR adds a way to configure wget, curl, singlefile, youtube-dl, and chrome without overriding the default options.

The main default options, extra options, and more specific options (like WGET_USER_AGENT) are all deduplicated. It's assumed that options set with more specificity should take precedence, so something like the --user-agent argument for wget will come from WGET_USER_AGENT instead of _ARGS or _EXTRA_ARGS, and options set in _EXTRA_ARGS take precedence over _ARGS.

This PR might need some more testing with more complex configurations. Hopefully it's simple enough that won't break anything while still being useful, but I'm not a wizard with curl or wget so there might be some possibilities I don't know about.

Related issues

#1025

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

@pirate
Copy link
Member

pirate commented Feb 23, 2024

Awesome! thanks for doing this, it's a great addition

should we add CHROME_EXTRA_ARGS and YOUTUBEDL_EXTRA_ARGS too?

@benmuth benmuth marked this pull request as ready for review February 24, 2024 00:44
@benmuth benmuth changed the title Add EXTRA_*_ARGS for wget, curl, and singlefile Add _EXTRA_ARGS for various extractors Feb 24, 2024
@pirate
Copy link
Member

pirate commented Feb 29, 2024

Looks great!

One minor change then I'm ready to merge:

@enforce_types
def dedupe(*options: List[str]) -> List[str]:
    """
    Deduplicates the given options. Options that come earlier in the list clobber
    later conflicting options.
    """
    ...

Can you flip the dedupe() precedence order to last arg wins?
(and update all the extractor CMDs to put *EXTRA_ARGS after *ARGS)

In my experience, most CLI tools that accept duplicate args take the last arg as the winning one if they see the same arg twice, so I want to follow that convention.

Site note: you can invoice me for all your PRs as you submit them, no need to wait till they're merged. 👍

@pirate pirate added this to the v0.8.0 milestone Feb 29, 2024
@pirate pirate added touches: configuration why: functionality Intended to improve ArchiveBox functionality or features status: wip Work is in-progress / has already been partially completed type: enhancement expected: next release labels Feb 29, 2024
@pirate
Copy link
Member

pirate commented Mar 1, 2024

@benmuth
Copy link
Contributor Author

benmuth commented Mar 1, 2024

@benmuth
Copy link
Contributor Author

benmuth commented Mar 6, 2024

@pirate I went ahead and added both the ARGS and EXTRA_ARGS options for the Mercury extractor to be more consistent with the rest of the changes in this PR.

@pirate pirate merged commit ca2c484 into ArchiveBox:dev Mar 14, 2024
1 of 2 checks passed
@benmuth benmuth deleted the extra-args branch March 15, 2024 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expected: next release status: wip Work is in-progress / has already been partially completed touches: configuration why: functionality Intended to improve ArchiveBox functionality or features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants