Skip to content

feat: upload and extract zip/tar archives into workspace#1226

Closed
bergeouss wants to merge 2 commits intonesquena:masterfrom
bergeouss:feat/525-upload-zip-folder
Closed

feat: upload and extract zip/tar archives into workspace#1226
bergeouss wants to merge 2 commits intonesquena:masterfrom
bergeouss:feat/525-upload-zip-folder

Conversation

@bergeouss
Copy link
Copy Markdown
Contributor

Fixes #525

What

Allow uploading zip, tar.gz, tgz, tar.bz2, tar.xz archives — they are automatically extracted into a subfolder in the workspace.

How

  • extract_archive() in api/upload.py: detects archive type, extracts into a named subfolder (archive stem). Uses Python stdlib only (zipfile, tarfile).
  • Zip-slip / tar-slip protection: every member path is resolved and verified to start with the destination directory prefix. No symlinks or .. traversal allowed.
  • Collision avoidance: if the target folder name exists, appends a 3-digit suffix.
  • New route POST /api/upload/extract: same multipart format as /api/upload, but calls extract_archive().
  • Frontend: uploadPendingFiles() auto-detects archive files via _ARCHIVE_EXTS regex and routes them to the extract endpoint. Workspace file tree refreshes after extraction.
  • Archive extensions added to the file picker accept attribute.
  • Toast notification shows extraction summary (file count, archive count).

Changes

  • api/upload.py: extract_archive(), handle_upload_extract()
  • api/routes.py: new import + /api/upload/extract route
  • static/ui.js: _ARCHIVE_EXTS regex, archive-aware upload logic, toast
  • static/index.html: archive extensions in file input accept
  • static/i18n.js: archive_extracted key in all 7 locales

Testing

2828 passed, 0 failed (full suite).

- Add extract_archive() with zip-slip and tar-slip protection
- New /api/upload/extract endpoint for archive uploads
- Auto-detect archive files (.zip, .tar.gz, .tgz, .bz2, .xz)
- Archives extracted into named subfolder (avoids overwrites)
- Workspace file tree auto-refreshes after extraction
- Archive extensions added to file picker accept list
- i18n: archive_extracted key in all 7 locales

Security: path traversal blocked via resolve() prefix check,
matching existing safe_resolve_ws() sandbox pattern.
@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Thanks for the PR, @bergeouss! Archive extraction in the workspace is a useful workflow improvement — being able to upload a zip and have it auto-extracted saves a manual unzip step.

The security considerations are appreciated: zip-slip/tar-slip protection and no symlink traversal are the right safeguards for server-side archive extraction. A few things to verify before merge:

  1. Path resolution guard: Confirming the zip-slip protection resolves member paths against the real destination directory (after os.path.realpath()) before comparing the prefix. Symlinks inside the archive that point outside the workspace could bypass a pure string-prefix check — confirming realpath() or equivalent is used is important.

  2. Large archive handling: Is there a size limit on archives? A very large zip (or a zip bomb) could exhaust disk space or memory. A configurable or hardcoded max extraction size would be a good safety rail.

  3. Error handling / partial extraction: If extraction fails partway through (disk full, corrupted archive member), is the partial subfolder cleaned up, or does it remain? The collision-avoidance logic creates a new subfolder — confirming failed extractions don't leave orphaned directories would be good.

  4. File type detection: The _ARCHIVE_EXTS regex on the frontend determines routing to the extract endpoint. Is there also server-side validation of the archive content type (not just the extension)? A file named malicious.zip that isn't actually a zip should fail gracefully.

2828 passed with 0 failures is a clean signal. The 5-file scope is well-contained. The above are the main items to validate — especially the path resolution and size limits. Looking good!

- Add cumulative extraction size limit (_MAX_EXTRACTED_BYTES = 200 MB)
  that tracks uncompressed file sizes during extraction to guard against
  zip/tar bombs (small compressed archives that expand to huge sizes).
- On any extraction failure (disk full, corrupted member, size limit),
  clean up the partially-extracted destination directory to avoid
  leaving orphaned folders in the workspace.
@bergeouss
Copy link
Copy Markdown
Contributor Author

Review Feedback Addressed

  • Issue (2) — Large archive / zip-bomb protection: Added a cumulative extraction size limit (_MAX_EXTRACTED_BYTES = 200 MB — 10x the upload limit). During extraction, each member's uncompressed size is tracked; if the total exceeds the limit, a clear error is raised. This prevents a small compressed archive from consuming excessive disk space.
  • Issue (3) — Partial extraction cleanup: Wrapped the extraction loop in a try/except. On any failure (disk full, corrupted member, size limit exceeded), the partially-extracted destination directory is cleaned up via shutil.rmtree() to avoid leaving orphaned folders in the workspace.
  • Fix: Both protections are in extract_archive() in api/upload.py. The cleanup runs before re-raising so the caller still gets the original error.
  • Files: api/upload.py

Regarding the other points:

  • (1) Path resolution guard: Confirmed — (dest_dir / member.filename).resolve() uses Python's Path.resolve() which follows symlinks, then startswith(str(dest_dir.resolve())) validates the final resolved path stays within the workspace. A symlink inside the archive pointing to /etc/passwd would be resolved by .resolve() and caught by the prefix check.
  • (4) File type detection: Server-side validation happens implicitly — zipfile.ZipFile() and tarfile.open() will raise on invalid content. The error is caught by the generic handler and returned as "Archive extraction failed" (400). This is sufficient since the frontend already filters by extension.

🤖 AI-assisted via Hermes Agent

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Thanks for following up, @bergeouss! Both of the substantive items are now addressed:

  • (2) Large archive / zip-bomb protection: The 200 MB cumulative extraction limit with mid-loop tracking is the right approach — prevents compressed-small/extracted-large attacks without needing to inspect the archive upfront.
  • (3) Partial extraction cleanup: shutil.rmtree() on failure before re-raising is correct cleanup semantics — the caller still gets the error, and the workspace doesn't accumulate orphaned directories.

The confirmations on path resolution and server-side type validation are also correct:

  • Path.resolve() + startswith(str(dest_dir.resolve())) correctly handles symlinks inside archives pointing outside the workspace.
  • zipfile.ZipFile() / tarfile.open() raising on invalid content is sufficient — the error surfaces cleanly to the user.

All four points are addressed. This looks ready for maintainer merge review.

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Merged in v0.50.237 via #1243. Thank you @bergeouss! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(workspace): upload a zip/archive or folder and extract into workspace

2 participants