Skip to content

Publisher thread terminates, forever breaking publication when GCE metadata service blips #1173

@pgcamus

Description

@pgcamus

Thanks for stopping by to let us know something could be better!

PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.

Please run down the following list and make sure you've tried the usual "quick fixes":

If you are still having issues, please be sure to include as much information as possible:

Environment details

  • OS type and version: Ubuntu 22.04
  • Python version: 3.10.9
  • pip version: pip --version
  • google-cloud-pubsub version: 2.21.1

Steps to reproduce

Run google-cloud-pubsub and suffer a metadata outage like https://status.cloud.google.com/incidents/u6rQ2nNVbhAFqGCcTm58.

Note that this can trigger even in an un-sustained GCE metadata outage as once this exception triggers even once, the commit thread is dead forever. In our case, there was a short outage on the metadata server, but the retries all happened so quickly that the exception was raised before the service recovered

2024-04-26T07:30:45.783 Compute Engine Metadata server unavailable on attempt 1 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/universe/universe_domain (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7c1ca813c1c0>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-04-26T07:30:45.788 Compute Engine Metadata server unavailable on attempt 2 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/universe/universe_domain (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7c1ca8290730>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-04-26T07:30:45.794 Compute Engine Metadata server unavailable on attempt 3 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/universe/universe_domain (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7c1ca82918a0>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-04-26T07:30:45.801 [...]
2024-04-26T07:30:45.806 [...]

Code example

# example

Stack trace

Traceback (most recent call last):
  File "/app/device/trimark/proxy/proxy.runfiles/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/app/device/trimark/proxy/proxy.runfiles/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/app/device/trimark/proxy/proxy.runfiles/common_deps_google_cloud_pubsub/site-packages/google/cloud/pubsub_v1/publisher/_batch/thread.py", line 274, in _commit
    response = self._client._gapic_publish(
  File "/app/device/trimark/proxy/proxy.runfiles/common_deps_google_cloud_pubsub/site-packages/google/cloud/pubsub_v1/publisher/client.py", line 267, in _gapic_publish
    return super().publish(*args, **kwargs)
  File "/app/device/trimark/proxy/proxy.runfiles/common_deps_google_cloud_pubsub/site-packages/google/pubsub_v1/services/publisher/client.py", line 1058, in publish
    self._validate_universe_domain()
  File "/app/device/trimark/proxy/proxy.runfiles/common_deps_google_cloud_pubsub/site-packages/google/pubsub_v1/services/publisher/client.py", line 554, in _validate_universe_domain
    or PublisherClient._compare_universes(
  File "/app/device/trimark/proxy/proxy.runfiles/common_deps_google_cloud_pubsub/site-packages/google/pubsub_v1/services/publisher/client.py", line 531, in _compare_universes
    credentials_universe = getattr(credentials, \"universe_domain\", default_universe)
  File "/app/device/trimark/proxy/proxy.runfiles/common_deps_google_auth/site-packages/google/auth/compute_engine/credentials.py", line 154, in universe_domain
    self._universe_domain = _metadata.get_universe_domain(
  File "/app/device/trimark/proxy/proxy.runfiles/common_deps_google_auth/site-packages/google/auth/compute_engine/_metadata.py", line 284, in get_universe_domain
    universe_domain = get(
  File "/app/device/trimark/proxy/proxy.runfiles/common_deps_google_auth/site-packages/google/auth/compute_engine/_metadata.py", line 217, in get
    raise exceptions.TransportError(
google.auth.exceptions.TransportError: Failed to retrieve http://metadata.google.internal/computeMetadata/v1/universe/universe_domain from the Google Compute Engine metadata service. Compute Engine Metadata server unavailable

Speculative analysis

It looks like the issue is that the google-auth library is raising a TransportError which is not caught by the batch commit thread in this library. Potential fixes include catching that in Batch._commit (e.g. here), or catching it further down in google-cloud-pubsub and wrapping it in a GoogleAPIError.

Metadata

Metadata

Labels

api: pubsubIssues related to the googleapis/python-pubsub API.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions