Add Wayback Machine URL archiver and replacer script#2504
Add Wayback Machine URL archiver and replacer script#2504arkid15r merged 52 commits intovacanza:devfrom
Conversation
|
Important Review skippedMore than 25% of the files skipped due to max files limit. The review is being skipped to prevent a low-quality review. 59 files out of 166 files are above the max files limit of 100. Please upgrade to Pro plan to get higher limits. You can disable this status message by setting the ✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
This commit introduces a new CLI script that: - Recursively scans the `holidays` package for `.py` and `.po` files, ignoring `__pycache__` - Extracts all HTTP(S) links, filtering out known domains (e.g., GitHub, Wikipedia, Python docs) - Queries the Wayback Machine CDX API for existing captures - Submits URLs to the Wayback Machine Save API when needed - Supports three archive policies: • `if-missing` (default): archive only when no capture exists • `always`: always submit for archiving, even if captures exist • `never`: only lookup existing captures, do not archive - Replaces original URLs in source files with their Wayback snapshots - Uses a retrying `requests` session with exponential backoff for robustness - Prints progress summaries and warning messages for files that cannot be read or written This script makes it easy to freeze external references within the codebase, ensuring all links remain valid over time.
ebeae2b to
383be6c
Compare
|
There are still 41 files with ~177 URLs that could not be archived by Internet Archive. Some of these URLs might be no longer accessible. |
PPsyrius
left a comment
There was a problem hiding this comment.
I've checked up to letter N for now
Co-authored-by: Panpakorn Siripanich <[email protected]> Signed-off-by: Kriti Birda <[email protected]>
Co-authored-by: Panpakorn Siripanich <[email protected]> Signed-off-by: Kriti Birda <[email protected]>
Signed-off-by: Kriti Birda <[email protected]>
Remove tiny.cc source link aliases, Thailand sources archive work
PPsyrius
left a comment
There was a problem hiding this comment.
I did look into SonarQube's error suppression a bit, seems like they only got # NOSONAR as in-line global issue suppression tool with no option to only disable specific rule like mypy or other tools...
Sure, I'll review it this week 👍 |
Co-authored-by: Panpakorn Siripanich <[email protected]> Signed-off-by: Kriti Birda <[email protected]>
arkid15r
left a comment
There was a problem hiding this comment.
@kritibirda26 great work -- both idea and implementation 👏
I don't want to be a blocker here just because of readability and best practices suggestions. Moreover, refactoring w/o tests is a bit tricky. So I'll just provide a general feedback you might use later:
- if you need just a sequence prefer using tuples instead of lists
- use consistent naming (ignore vs ignored)
- use Path instead of os.path
- if dict (or other) params order doesn't matter -- order alphabetically
- use spellcheck locally (e.g. your IDE plugin)
Signed-off-by: Arkadii Yakovets <[email protected]>
|



Proposed change
Add Wayback Machine URL archiver and replacer script
holidayspackage for.pyand.pofiles, ignoring__pycache__•
if-missing(default): archive only when no capture exists•
always: always submit for archiving, even if captures exist•
never: only lookup existing captures, do not archiverequestssession with exponential backoff for robustnessThis script makes it easy to freeze external references within the codebase, ensuring all links remain valid over time.
Fix #2467.
Type of change
holidaysfunctionality in general)Checklist
make check, all checks and tests are green