You can persist your project outputs to a remote storage using the
push command. This can help you export your
pipeline packages, share work with your team, or cache results to avoid
repeating work. The pull command will download
any outputs that are in the remote storage and aren't available locally.
You can list one or more remotes in the remotes section of your
project.yml by mapping a string name
to the URL of the storage. Under the hood, Weasel uses
cloudpathlib to communicate with the
remote storages, so you can use any protocol that CloudPath supports,
including S3,
Google Cloud Storage, and the local
filesystem, although you may need to install extra dependencies to use certain
protocols.
π‘ Example using remote storage
$ python -m weasel pull localremotes: default: 's3://my-weasel-bucket' local: '/mnt/scratch/cache'
βΉοΈ How it works
Inside the remote storage, Weasel uses a clever directory structure to avoid overwriting files. The top level of the directory structure is a URL-encoded version of the output's path. Within this directory are subdirectories named according to a hash of the command string and the command's dependencies. Finally, within those directories are files, named according to an MD5 hash of their contents.
βββ urlencoded_file_path # Path of original file βββ some_command_hash # Hash of command you ran β βββ some_content_hash # Hash of file content β βββ another_content_hash βββ another_command_hash βββ third_content_hash
For instance, let's say you had the following spaCy command in your
project.yml:
- name: train
help: 'Train a spaCy pipeline using the specified corpus and config'
script:
- 'spacy train ./config.cfg --output training/'
deps:
- 'corpus/train'
- 'corpus/dev'
- 'config.cfg'
outputs:
- 'training/model-best'After you finish training, you run push to make
sure the training/model-best output is saved to remote storage. Weasel will
then construct a hash from your command script and the listed dependencies,
corpus/train, corpus/dev and config.cfg, in order to identify the
execution context of your output. It would then compute an MD5 hash of the
training/model-best directory, and use those three pieces of information to
construct the storage URL.
python -m weasel run train
python -m weasel pushβββ s3://my-weasel-bucket/training%2Fmodel-best
βββ 1d8cb33a06cc345ad3761c6050934a1b
βββ d8e20c3537a084c5c10d95899fe0b1ff
If you change the command or one of its dependencies (for instance, by editing
the config.cfg file to tune the
hyperparameters), a different creation hash will be calculated, so when you use
push you won't be overwriting your previous file.
The system even supports multiple outputs for the same file and the same
context, which can happen if your training process is not deterministic, or if
you have dependencies that aren't represented in the command.
In summary, the weasel remote storages are designed to make a particular set
of trade-offs. Priority is placed on convenience, correctness and
avoiding data loss. You can use push freely, as
you'll never overwrite remote state, and you don't have to come up with names or
version numbers. However, it's up to you to manage the size of your remote
storage, and to remove files that are no longer relevant to you.