Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Adding two different links with the same timestamp (e.g. from JSON) errors out and stops the entire import #1188

Open
melyux opened this issue Jul 22, 2023 · 2 comments
Labels
expected: next release good first ticket help wanted size: easy status: backlog Work is planned someday but is not the highest priority at the moment

Comments

@melyux
Copy link

melyux commented Jul 22, 2023

Describe the bug

If you have two links with the same timestamp, ArchiveBox throws this error:

AssertionError: Cannot merge two links with different URLs ...

and stops the import. If the links don't match, it should just make separate records, not stop the entire import process. I see that the directories for these two snapshots are created in a way that resolves this (increments the timestamp for the "duplicate" snapshot's directory by 1, so we get 1611619200.0 and 1611619201.0. That's good

Steps to reproduce

  1. Import two links with the same timestamp using `--parser json``.
  2. Notice that the pull/import process stops after throwing an error AssertionError: Cannot merge two links with different URLs.
  3. Notice that any links after this are not pulled.

Screenshots or log output

Source:

[
  ...
  {
    "url": "https://google.com",
    "title": "Google",
    "created": "2021-01-26T00:00:00+0000"
  },
  {
    "url": "https://yahoo.com",
    "title": "Yahoo",
    "created": "2021-01-26T00:00:00+0000"
  },
  ...
]

Log:

[*] [2023-07-22 03:42:04] Archiving 269/1742 URLs from added set...

[▶] [2023-07-22 03:42:04] Starting archiving of 269 snapshots in index...
    ! Failed to archive link: AssertionError: Cannot merge two links with different URLs (google.com != yahoo.com)

Traceback (most recent call last):
  File "/usr/local/bin/archivebox", line 33, in <module>
    sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/archivebox/cli/__init__.py", line 140, in main
    run_subcommand(
  File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/archivebox/cli/archivebox_add.py", line 109, in main
    add(
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/archivebox/main.py", line 660, in add
    archive_links(new_links, overwrite=False, **archive_kwargs)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/archivebox/extractors/__init__.py", line 200, in archive_links
    archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir))
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/archivebox/extractors/__init__.py", line 96, in archive_link
    link = load_link_details(link, out_dir=out_dir)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/archivebox/index/__init__.py", line 350, in load_link_details
    return merge_links(existing_link, link)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/archivebox/index/__init__.py", line 63, in merge_links
    assert a.base_url == b.base_url, f'Cannot merge two links with different URLs ({a.base_url} != {b.base_url})'
           ^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Cannot merge two links with different URLs (google.com != yahoo.com)

ArchiveBox version

0.6.3
ArchiveBox v0.6.3 40ddd33 Cpython Linux Linux-6.1.0-10-amd64-x86_64-with-glibc2.31 x86_64
DEBUG=False IN_DOCKER=True IS_TTY=True TZ=UTC FS_ATOMIC=True FS_REMOTE=False FS_PERMS=644 1000:1000 SEARCH_BACKEND=ripgrep

[i] Dependency versions:
 √  PYTHON_BINARY         v3.11.4         valid     /usr/local/bin/python3.11                                                   
 √  SQLITE_BINARY         v2.6.0          valid     /usr/local/lib/python3.11/sqlite3/dbapi2.py                                 
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.11/site-packages/django/__init__.py                  
 √  ARCHIVEBOX_BINARY     v0.6.3          valid     /usr/local/bin/archivebox                                                   

 √  CURL_BINARY           v7.74.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21           valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v18.16.1        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v1.0.44         valid     /usr/lib/node_modules/single-file-cli/single-file                           
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 -  GIT_BINARY            -               disabled  /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2023.07.06     valid     /usr/local/bin/yt-dlp                                                       
 √  CHROME_BINARY         v114.0.5735.198  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v12.1.1         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              


[i] Data locations:
@neel-suthar
Copy link

@pirate Is this because we use the timestamp value to create the output directory?

@pirate
Copy link
Member

pirate commented Jan 20, 2024

Yes, timestamp is currently the unique key for snapshots, because it has millisecond-level resolution we can always bump it by a few ms if there are conflicts (and even add more decimals). Resolving conflicts here and deduping correctly has historically been a big source of complexity in the archivebox internals.

This will change in the future when we add official support for taking multiple snapshots of the same url over time #179 and when we switch to using UUIDs for unique keys #74

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expected: next release good first ticket help wanted size: easy status: backlog Work is planned someday but is not the highest priority at the moment
Projects
None yet
Development

No branches or pull requests

3 participants