Skip to content

mirror create --all can mirror everything#12940

Merged
tgamblin merged 67 commits intospack:developfrom
scheibelp:features/mirror-everything
Oct 26, 2019
Merged

mirror create --all can mirror everything#12940
tgamblin merged 67 commits intospack:developfrom
scheibelp:features/mirror-everything

Conversation

@scheibelp
Copy link
Copy Markdown
Member

@scheibelp scheibelp commented Sep 25, 2019

Add support for mirroring all packages with spack mirror create --all. In this mode there is no concretization: Spack pulls every version of every package into a mirror. It makes multiple attempts for each package/version combination (in case there is a temporary build failure for a given version) but also continues if all attempts fail (if there is a permanent build failure for a given version).

This includes an update to mirroring logic to prefer storing sources to a name that is derived entirely from the source and is unique (i.e. not including the package name in the cached source). For example

  • For archives with checksums, the name is the sha256 sum (where before it might be <package-name>-package-version>.tar.gz)
  • For SCM repositories, it is a concatenation of the hash of the full repository URL and the branch/tag/commit

This allows different packages to refer to the same resource or source without duplicating that download in the mirror/cache. This change is not essential to mirroring everything but is expected to save space when mirroring many versions of packages which all use the same resource.

The new structure of the mirror is:

<base directory>/
  _source-cache/   <-- the _source-cache directory is new
    archive/
      sldkfjasldkfjs9djsldkfjs0sjdkflsjf0.tar.gz        <-- not human-readable
    git/
      slkdjf98dej2j0s-develop.tar.gz     <-- the mirror can store the latest commit of a repository branch
    svn/      <-- each fetch strategy has its own subdirectory
      ...
  openmpi/   <-- the remaining package directories have the old format
    openmpi-1.10.1.tar.gz   <-- these human-readable names are symlinks into _source-cache

When creating a mirror with archive names as described above, the mirror creation logic now also creates symlinks with the old format in order to help users understand which package each mirrored archive is associated with; the symlinks are relative so the mirror directory can still itself be archived.

Other changes include:

  • spack mirror create will not re-download resources that have already been placed in it
  • When creating a mirror, the resources downloaded to the mirror will not be cached

TODOs:

  • (possibly in a later PR) The logic for caching patches is separate from package source/resource caching: applying the caching updates here to patches will take additional work (which is considered worthwhile but less of an impact since patches are generally small).
  • Make the mirror optionally output traditional source code paths by default (since it can be difficult to map a machine-generated filename to the associated package, and users may want to actually peruse these files to understand them).
  • (update 10/15: there is now an --all-versions option which you can use to collect all versions/dependencies of a set of root packages, which offers an alternative to caching everything) (possibly in a later PR) I presume that users likely don't want every version of every package by default, but this is difficult to determine up front without concretizing packages (which would be expensive when we are talking about all packages). If a decent approximate solution can be found I'd like to add it in as an option (but still also allow downloading all versions as an option).
  • (new: 10/14) Print stats while doing the mirroring (right now stats are only printed at the end which is a long time to wait when mirroring everything)
  • (new 10/14) globally-cached resources should not all be stored in a single directory (e.g. one directory for all git resources): in order to keep the directory size small, they should be stored in an intermediate directory that is randomly selected based on the file name

… to store, it just downloads each spec that was provided to it
…nk of code can move out of the disable_compiler_existence_check context_manager
… temporary failure, so this adds a small number of retries for each spec that is mirrored
… mirror vs. using a per-package directory which will allow multiple packages to refer to the same resource and have it be reused
…s') and also a case where I was using list.append instead of list.extend
@tgamblin tgamblin self-requested a review September 27, 2019 17:30
@tgamblin tgamblin self-assigned this Sep 27, 2019
@tgamblin tgamblin changed the title mirror everything add mirror --all to fetch all downloadable resources Oct 26, 2019
@tgamblin tgamblin changed the title add mirror --all to fetch all downloadable resources add mirror create --all to mirror all packages Oct 26, 2019
@tgamblin tgamblin changed the title add mirror create --all to mirror all packages commands: mirror create --all can mirror everything Oct 26, 2019
@tgamblin tgamblin changed the title commands: mirror create --all can mirror everything mirror create --all can mirror everything Oct 26, 2019
@tgamblin tgamblin merged commit 4af4487 into spack:develop Oct 26, 2019
jrmadsen pushed a commit to jrmadsen/spack that referenced this pull request Oct 30, 2019
Support mirroring all packages with `spack mirror create --all`.

In this mode there is no concretization:

* Spack pulls every version of every package into the created mirror.
* It also makes multiple attempts for each package/version combination
  (if there is a temporary connection failure).
* Continues if all attempts fail. i.e., this makes its best effort to
  fetch evrerything, even if all attempts to fetch one package fail.

This also changes mirroring logic to prefer storing sources by their hash
or by a unique name derived from the source.  For example:

* Archives with checksums are named by the sha256 sum, i.e.,
  `archive/f6/f6cf3bd233f9ea6147b21c7c02cac24e5363570ce4fd6be11dab9f499ed6a7d8.tar.gz`
  vs the previous `<package-name>-package-version>.tar.gz`
* VCS repositories are stored by a path derived from their URL,
  e.g. `git/google/leveldb.git/master.tar.gz`.

The new mirror layout allows different packages to refer to the same
resource or source without duplicating that download in the
mirror/cache. This change is not essential to mirroring everything but is
expected to save space when mirroring packages that all use the same
resource.

The new structure of the mirror is:

```
<base directory>/
  _source-cache/   <-- the _source-cache directory is new
    archive/       <-- archives/resources/patches stored by hash
      00/          <-- 2-letter sha256 prefix
        002748bdd0319d5ab82606cf92dc210fc1c05d0607a2e1d5538f60512b029056.tar.gz
      01/
        0154c25c45b5506b6d618ca8e18d0ef093dac47946ac0df464fb21e77b504118.tar.gz
        0173a74a515211997a3117a47e7b9ea43594a04b865b69da5a71c0886fa829ea.tar.gz
        ...
    git/
      OpenFAST/
        openfast.git/
          master.tar.gz     <-- repo by branch name
      PHASTA/
        phasta.git/
          11f431f2d1a53a529dab4b0f079ab8aab7ca1109.tar.gz  <-- repo by commit
      ...
    svn/      <-- each fetch strategy has its own subdirectory
      ...
  openmpi/   <-- the remaining package directories have the old format
    openmpi-1.10.1.tar.gz  <-- human-readable name is symlink to _source-cache
```

In addition to the archive names as described above, `mirror create` now
also creates symlinks with the old format to help users understand which
package each mirrored archive is associated with, and to allow mirrors to
work with old spack versions. The symlinks are relative so the mirror
directory can still itself be archived.

Other improvements:

* `spack mirror create` will not re-download resources that have already
  been placed in it.

* When creating a mirror, the resources downloaded to the mirror will not
  be cached (things are not stored twice).
@balay

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants