Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscroll before before archiving and take full-height screenshots #80

Open
pirate opened this issue Jun 19, 2018 · 7 comments
Open

Autoscroll before before archiving and take full-height screenshots #80

pirate opened this issue Jun 19, 2018 · 7 comments
Labels
size: medium status: wip Work is in-progress / has already been partially completed why: functionality Intended to improve ArchiveBox functionality or features

Comments

@pirate
Copy link
Member

pirate commented Jun 19, 2018

I've sumbitted a Chromium bug tracker feature request for adding a --full-page flag: https://bugs.chromium.org/p/chromium/issues/detail?id=854013

Hopefully it's merged, allowing us to screenshot the full height of pages, instead of limiting them to the config settings defined by DIMENSIONS.

@pirate pirate added status: needs followup Work is stalled awaiting a follow-up from the original issue poster or ArchiveBox maintainers easy size: medium labels Jun 19, 2018
@pirate
Copy link
Member Author

pirate commented Mar 15, 2019

This will be easy with user scripts the moment pyppeteer is merged in #177. Or if we switch to playwright it's also easy using playwright's --full-page flag. #51

@pirate pirate changed the title Support for full page screenshots instead of fixed dimensions Autoscroll before before archiving and take full-height screenshots Mar 15, 2019
@pirate pirate added why: functionality Intended to improve ArchiveBox functionality or features status: wip Work is in-progress / has already been partially completed and removed status: needs followup Work is stalled awaiting a follow-up from the original issue poster or ArchiveBox maintainers labels Mar 15, 2019
@mtvu
Copy link

mtvu commented Jun 10, 2021

The code provided in this playwright issue solves the full-page screenshot problem for me
microsoft/playwright#620

Here is the code I use to take a full page screenshot with playwright

const { chromium } = require('playwright');

(async () => {

  const browser = await chromium.launch({
    channel: 'chrome' // or 'msedge', 'chrome-beta', 'msedge-beta', 'msedge-dev', etc.
  });
  const context = await browser.newContext();
  const page = await context.newPage();
  
  await page.goto('https://apple.com/');
  await scrollFullPage(page);
  
  await page.screenshot({ 
    path: 'apple.png',
    fullPage : true
  });
  
  await browser.close();
})();

async function scrollFullPage(page) {
  await page.evaluate(async () => {
    await new Promise(resolve => {
      let totalHeight = 0;
      const distance = 100;
      const timer = setInterval(() => {
        const scrollHeight = document.body.scrollHeight;
        window.scrollBy(0, distance);
        totalHeight += distance;
        
        if (totalHeight >= scrollHeight){
          clearInterval(timer);
          resolve();
        }
      }, 100);
    });
  });
}`

@timdonovanuk
Copy link

Is this feature natively available now or only via hacking in user scripts?

@pirate
Copy link
Member Author

pirate commented Jun 11, 2021

Not available natively yet, it's blocked on #51

@timdonovanuk
Copy link

Ah fair enough, thanks! Seems like #51 encapsulates a whole ton of effort to make this happen, so thanks and good luck!

@DeoLeung
Copy link

DeoLeung commented Mar 7, 2025

will be great to have the ability to take full height screenshots! any update on this after 4 years?

@pirate
Copy link
Member Author

pirate commented Mar 10, 2025

My conclusion after a lot of work on this issue is that full-page screenshots up to ~8000px maximum height are ok, but many many pages are longer than that, and most common image formats actually don't support images that big. Even the formats that do (png) cause most image viewers to crash when you try to open them. You need to mess with Chrome's GPU memory settings to even get it to take more than 16,000px in one image, let alone the 90,000px+ that some long comment thread pages have.

Multiple screenshots are the better solution. My solution so far is one 4:3 screenshot at the top of the page, and then numbered 16:10 screenshots for like ~15 full-height scrolls down the page. Also works great for feeding it to vision and OCR models for analysis.

I built this ^ more advanced puppeteer based screenshot approach for a paying client last year, and it's still in active development. It's all in TS and ArchiveBox is all Python, so it takes time to bridge that gap, refactor, open source it, document it, package it, ship it, etc. for the public.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size: medium status: wip Work is in-progress / has already been partially completed why: functionality Intended to improve ArchiveBox functionality or features
Projects
None yet
Development

No branches or pull requests

4 participants