I have a process that consumes data from GCS. There are about 70k objects in the bucket, each several kb to several Mb in size. A separate process is continuously recomputing and refreshing the objects. My issue is that the consumer will occasionally download a file using download_to_filename() only to find that the local copy is corrupt. Specifically, the file will be truncated to a small multiple of 15Kb plus a few bytes. download_to_filename() does not raise an exception when this happens. I do not bother check the return value from download_to_filename() as my understanding from reading the source is that it is always None.
I do not have a simple, clean repro, but am filing this issue at the request of @jonparrott.
- OS type and version
# uname -a
Linux XXXXXXX 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u5 (2017-09-19) x86_64 GNU/Linux
- Python version and virtual environment information
python --version
# python --version
Python 2.7.9
- google-cloud-python version
pip show google-cloud, pip show google-<service> or pip freeze
pip freeze | grep google
gapic-google-cloud-datastore-v1==0.15.3
gapic-google-cloud-error-reporting-v1beta1==0.15.3
gapic-google-cloud-logging-v2==0.91.3
gapic-google-cloud-pubsub-v1==0.15.4
gapic-google-cloud-spanner-admin-database-v1==0.15.3
gapic-google-cloud-spanner-admin-instance-v1==0.15.3
gapic-google-cloud-spanner-v1==0.15.3
google-api-python-client==1.6.2
google-auth==1.1.1
google-auth-httplib2==0.0.2
google-cloud==0.27.0
google-cloud-bigquery==0.26.0
google-cloud-bigtable==0.26.0
google-cloud-core==0.26.0
google-cloud-datastore==1.2.0
google-cloud-dns==0.26.0
google-cloud-error-reporting==0.26.0
google-cloud-happybase==0.22.0
google-cloud-language==0.27.0
google-cloud-logging==1.2.0
google-cloud-monitoring==0.26.0
google-cloud-pubsub==0.27.0
google-cloud-resource-manager==0.26.0
google-cloud-runtimeconfig==0.26.0
google-cloud-spanner==0.26.0
google-cloud-speech==0.28.0
google-cloud-storage==1.3.2
google-cloud-translate==1.1.0
google-cloud-videointelligence==0.25.0
google-cloud-vision==0.26.0
google-compute-engine==2.3.2
google-gax==0.15.15
google-resumable-media==0.2.3
googleapis-common-protos==1.5.3
grpc-google-cloud-datastore-v1==0.14.0
grpc-google-cloud-logging-v2==0.90.0
grpc-google-cloud-pubsub-v1==0.14.0
grpc-google-iam-v1==0.11.4
proto-google-cloud-datastore-v1==0.90.4
proto-google-cloud-error-reporting-v1beta1==0.15.3
proto-google-cloud-logging-v2==0.91.3
proto-google-cloud-pubsub-v1==0.15.4
proto-google-cloud-spanner-admin-database-v1==0.15.3
proto-google-cloud-spanner-admin-instance-v1==0.15.3
proto-google-cloud-spanner-v1==0.15.3
Note that this problem appeared only after I updated my google cloud libraries (which I was forced to do because of a bug on google's side involving pubsub). Here's a diff of my requirements.txt (with context removed):
+cachetools==2.0.1
+certifi==2017.7.27.1
-chardet==2.3.0
+chardet==3.0.4
-dill==0.2.6
+dill==0.2.7.1
-futures==3.0.5
-gapic-google-cloud-datastore-v1==0.14.1
-gapic-google-cloud-logging-v2==0.90.1
-gapic-google-cloud-pubsub-v1==0.14.1
+futures==3.1.1
+gapic-google-cloud-datastore-v1==0.15.3
+gapic-google-cloud-error-reporting-v1beta1==0.15.3
+gapic-google-cloud-logging-v2==0.91.3
+gapic-google-cloud-pubsub-v1==0.15.4
+gapic-google-cloud-spanner-admin-database-v1==0.15.3
+gapic-google-cloud-spanner-admin-instance-v1==0.15.3
+gapic-google-cloud-spanner-v1==0.15.3
-google-auth==0.5.0
+google-auth==1.1.1
-google-cloud==0.22.0
- Stacktrace if available
That it does not produce a stacktrace is the isssue! 😃
- Steps to reproduce
As I mentioned, I don't have a simple repro case. I think it's something like:
-
Create a bucket with 70k objects each several Mb in size. The objects can be random data. It might be important that another process continuously refreshes these objects.
-
Write a program to download the objects. If you want to mimic my setup as closely as possible, do this using a multiprocessing.Pool of 8 processes on an 8 core machine GCE machine and Pool.map() over a few dozen objects at a time.
-
Check that the downloaded file size matches the file size reported by GCS.
-
Note that about 1 in 1,000 downloads, the local size will be a small multiple of 15kb plus a few bytes, and not the size reported by GCS.
-
Code example
Use something like this to download. Note that I've hacked this up from my actual code and haven't tested it.
def copy_from_gcs(bucket_name, src_path, dst_path):
"""Copies the file at src_path in GCS to dst_path locally"""
logging.info("Copying gs://%s/%s to %s", bucket_name, src_path, dst_path)
bucket = storage.client.Client().bucket(bucket_name)
if not bucket.exists():
raise RuntimeError("GCS bucket %s doesn't exist!" % (bucket_name,))
blob = bucket.get_blob(src_path)
if blob is None:
raise RuntimeError("No GCS object at %s/%s" % (bucket_name, src_path))
remote_size = blob.size
(tempfd, temp_fname) = tempfile.mkstemp()
os.close(tempfd)
blob.download_to_filename(temp_fname)
local_size = os.stat(temp_fname).st_size
if local_size != remote_size:
logging.warning("When downloading gs://%s/%s to %s, the local file's size (%d) didn't match the remote size (%d).",
bucket_name, src_path, dst_path, local_size, remote_size)
os.remove(temp_fname)
raise RuntimeError("Downloaded file size (%d) doesn't match GCS file size (%d)" % (local_size, remote_size))
os.rename(temp_fname, dst_path)
return True
I have a process that consumes data from GCS. There are about 70k objects in the bucket, each several kb to several Mb in size. A separate process is continuously recomputing and refreshing the objects. My issue is that the consumer will occasionally download a file using
download_to_filename()only to find that the local copy is corrupt. Specifically, the file will be truncated to a small multiple of 15Kb plus a few bytes.download_to_filename()does not raise an exception when this happens. I do not bother check the return value fromdownload_to_filename()as my understanding from reading the source is that it is always None.I do not have a simple, clean repro, but am filing this issue at the request of @jonparrott.
python --versionpip show google-cloud,pip show google-<service>orpip freezeNote that this problem appeared only after I updated my google cloud libraries (which I was forced to do because of a bug on google's side involving pubsub). Here's a diff of my requirements.txt (with context removed):
That it does not produce a stacktrace is the isssue! 😃
As I mentioned, I don't have a simple repro case. I think it's something like:
Create a bucket with 70k objects each several Mb in size. The objects can be random data. It might be important that another process continuously refreshes these objects.
Write a program to download the objects. If you want to mimic my setup as closely as possible, do this using a
multiprocessing.Poolof 8 processes on an 8 core machine GCE machine andPool.map()over a few dozen objects at a time.Check that the downloaded file size matches the file size reported by GCS.
Note that about 1 in 1,000 downloads, the local size will be a small multiple of 15kb plus a few bytes, and not the size reported by GCS.
Code example
Use something like this to download. Note that I've hacked this up from my actual code and haven't tested it.