GCS downloads are randomly truncated about 1 time in 1000

I have a process that consumes data from GCS.  There are about 70k objects in the bucket, each several kb to several Mb in size.  A separate process is continuously recomputing and refreshing the objects.  My issue is that the consumer will occasionally download a file using `download_to_filename()` only to find that the local copy is corrupt.  Specifically, the file will be truncated to a small multiple of 15Kb plus a few bytes.  `download_to_filename()` does not raise an exception when this happens.  I do not bother check the return value from `download_to_filename()` as my understanding from reading the source is that it is always None.

I do not have a simple, clean repro, but am filing this issue at the request of @jonparrott.  

1. OS type and version

```
# uname -a
Linux XXXXXXX 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u5 (2017-09-19) x86_64 GNU/Linux
```
2. Python version and virtual environment information `python --version`

```
# python --version
Python 2.7.9
```

3. google-cloud-python version `pip show google-cloud`, `pip show google-<service>` or `pip freeze`

```
pip freeze | grep google
gapic-google-cloud-datastore-v1==0.15.3
gapic-google-cloud-error-reporting-v1beta1==0.15.3
gapic-google-cloud-logging-v2==0.91.3
gapic-google-cloud-pubsub-v1==0.15.4
gapic-google-cloud-spanner-admin-database-v1==0.15.3
gapic-google-cloud-spanner-admin-instance-v1==0.15.3
gapic-google-cloud-spanner-v1==0.15.3
google-api-python-client==1.6.2
google-auth==1.1.1
google-auth-httplib2==0.0.2
google-cloud==0.27.0
google-cloud-bigquery==0.26.0
google-cloud-bigtable==0.26.0
google-cloud-core==0.26.0
google-cloud-datastore==1.2.0
google-cloud-dns==0.26.0
google-cloud-error-reporting==0.26.0
google-cloud-happybase==0.22.0
google-cloud-language==0.27.0
google-cloud-logging==1.2.0
google-cloud-monitoring==0.26.0
google-cloud-pubsub==0.27.0
google-cloud-resource-manager==0.26.0
google-cloud-runtimeconfig==0.26.0
google-cloud-spanner==0.26.0
google-cloud-speech==0.28.0
google-cloud-storage==1.3.2
google-cloud-translate==1.1.0
google-cloud-videointelligence==0.25.0
google-cloud-vision==0.26.0
google-compute-engine==2.3.2
google-gax==0.15.15
google-resumable-media==0.2.3
googleapis-common-protos==1.5.3
grpc-google-cloud-datastore-v1==0.14.0
grpc-google-cloud-logging-v2==0.90.0
grpc-google-cloud-pubsub-v1==0.14.0
grpc-google-iam-v1==0.11.4
proto-google-cloud-datastore-v1==0.90.4
proto-google-cloud-error-reporting-v1beta1==0.15.3
proto-google-cloud-logging-v2==0.91.3
proto-google-cloud-pubsub-v1==0.15.4
proto-google-cloud-spanner-admin-database-v1==0.15.3
proto-google-cloud-spanner-admin-instance-v1==0.15.3
proto-google-cloud-spanner-v1==0.15.3
```

Note that this problem appeared only after I updated my google cloud libraries (which I was forced to do because of a bug on google's side involving pubsub).  Here's a diff of my requirements.txt (with context removed):

```
+cachetools==2.0.1
+certifi==2017.7.27.1
-chardet==2.3.0
+chardet==3.0.4
-dill==0.2.6
+dill==0.2.7.1
-futures==3.0.5
-gapic-google-cloud-datastore-v1==0.14.1
-gapic-google-cloud-logging-v2==0.90.1
-gapic-google-cloud-pubsub-v1==0.14.1
+futures==3.1.1
+gapic-google-cloud-datastore-v1==0.15.3
+gapic-google-cloud-error-reporting-v1beta1==0.15.3
+gapic-google-cloud-logging-v2==0.91.3
+gapic-google-cloud-pubsub-v1==0.15.4
+gapic-google-cloud-spanner-admin-database-v1==0.15.3
+gapic-google-cloud-spanner-admin-instance-v1==0.15.3
+gapic-google-cloud-spanner-v1==0.15.3
-google-auth==0.5.0
+google-auth==1.1.1
-google-cloud==0.22.0
```

4. Stacktrace if available

That it does not produce a stacktrace is the isssue! 😃 

5. Steps to reproduce

As I mentioned, I don't have a simple repro case.  I think it's something like:

  1. Create a bucket with 70k objects each several Mb in size.  The objects can be random data.  It might be important that another process continuously refreshes these objects.
  2. Write a program to download the objects.  If you want to mimic my setup as closely as possible, do this using a `multiprocessing.Pool` of 8 processes on an 8 core machine GCE machine and `Pool.map()` over a few dozen objects at a time. 
  3. Check that the downloaded file size matches the file size reported by GCS.  
  4. Note that about 1 in 1,000 downloads, the local size will be a small multiple of 15kb plus a few bytes, and not the size reported by GCS.

6. Code example

Use something like this to download.  Note that I've hacked this up from my actual code and haven't tested it.

```python
def copy_from_gcs(bucket_name, src_path, dst_path):
    """Copies the file at src_path in GCS to dst_path locally"""

    logging.info("Copying gs://%s/%s to %s", bucket_name, src_path, dst_path)
    bucket = storage.client.Client().bucket(bucket_name)
    if not bucket.exists():
        raise RuntimeError("GCS bucket %s doesn't exist!" % (bucket_name,))
    blob = bucket.get_blob(src_path)
    if blob is None:
        raise RuntimeError("No GCS object at %s/%s" % (bucket_name, src_path))
    remote_size = blob.size

    (tempfd, temp_fname) = tempfile.mkstemp()
    os.close(tempfd)

    blob.download_to_filename(temp_fname)
    local_size = os.stat(temp_fname).st_size

    if local_size != remote_size:
        logging.warning("When downloading gs://%s/%s to %s, the local file's size (%d) didn't match the remote size (%d).",
                        bucket_name, src_path, dst_path, local_size, remote_size)
        os.remove(temp_fname)
        raise RuntimeError("Downloaded file size (%d) doesn't match GCS file size (%d)" % (local_size, remote_size))

    os.rename(temp_fname, dst_path)
    return True
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCS downloads are randomly truncated about 1 time in 1000 #4129

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GCS downloads are randomly truncated about 1 time in 1000 #4129

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions