Skip to content

Pilot does not remove old endpoints from Envoy cluster #9480

@kop

Description

@kop

Describe the bug

During rolling upgrades in Kubernetes, when old pods are deleted and new ones are added, Envoy keeps forwarding traffic to old pod IPs, which does not exist anymore.
This happens EVERY time I update deployments in my K8S cluster.

For example, I have initial state with one pod with IP 10.28.0.37, after deployment update two new pods with IP's 10.28.0.42 and 10.28.6.113 are created and old one is deleted.

Pilot endpoint :15003/v1/registration returns:

{
 "service-key": "ops-k8s-monitoring-kiali.moba-system.svc.cluster.local|
http",
 "hosts": [
  {
   "ip_address": "10.28.0.42",
   "port": 8080
  },
  {
   "ip_address": "10.28.6.113",
   "port": 8080
  }
 ]
},

However Envoy endpoint :15000/clusters returns:

outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::default_priority::max_connections::1024
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::default_priority::max_pending_requests::1024
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::default_priority::max_requests::1024
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::default_priority::max_retries::3
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::high_priority::max_connections::1024
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::high_priority::max_pending_requests::1024
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::high_priority::max_requests::1024
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::high_priority::max_retries::3
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::added_via_api::true
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::cx_active::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::cx_connect_fail::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::cx_total::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::rq_active::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::rq_error::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::rq_success::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::rq_timeout::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::rq_total::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::health_flags::healthy
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::weight::1
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::region::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::zone::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::sub_zone::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::canary::false
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::success_rate::-1
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::cx_active::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::cx_connect_fail::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::cx_total::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::rq_active::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::rq_error::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::rq_success::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::rq_timeout::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::rq_total::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::health_flags::healthy
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::weight::1
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::region::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::zone::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::sub_zone::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::canary::false
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::success_rate::-1
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::cx_active::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::cx_connect_fail::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::cx_total::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::rq_active::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::rq_error::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::rq_success::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::rq_timeout::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::rq_total::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::health_flags::healthy
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::weight::1
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::region::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::zone::us-central1/us-central1-f
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::sub_zone::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::canary::false
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::success_rate::-1

After enabling debug logging in Envoy I was able to find this EDS update:

[2018-10-23 08:23:27.025][20][debug][config] bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:60] gRPC config for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment accepted with 1 resources: [cluster_name: "outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local"
endpoints {
  locality {
  }
  lb_endpoints {
    endpoint {
      address {
        socket_address {
          address: "10.28.0.42"
          port_value: 8080
        }
      }
    }
    metadata {
      filter_metadata {
        key: "istio"
        value {
          fields {
            key: "uid"
            value {
              string_value: "kubernetes://ops-k8s-monitoring-kiali-768d6cf478-b4v24.moba-system"
            }
          }
        }
      }
    }
  }
  lb_endpoints {
    endpoint {
      address {
        socket_address {
          address: "10.28.6.113"
          port_value: 8080
        }
      }
    }
    metadata {
      filter_metadata {
        key: "istio"
        value {
          fields {
            key: "uid"
            value {
              string_value: "kubernetes://ops-k8s-monitoring-kiali-768d6cf478-j824q.moba-system"
            }
          }
        }
      }
    }
  }
  lb_endpoints {
    endpoint {
      address {
        socket_address {
          address: "10.28.0.37"
          port_value: 8080
        }
      }
    }
    metadata {
      filter_metadata {
        key: "istio"
        value {
          fields {
            key: "uid"
            value {
              string_value: "kubernetes://ops-k8s-monitoring-kiali-649c4c9dd9-8cr8m.moba-system"
            }
          }
        }
      }
    }
  }
}
]

However no other updates (with old endpoints deleted) are received by Envoy.
In Pilot logs I was not able to find anything interesting except of:

2018-10-23T08:49:05.242425Z	info	Handling event delete for pod ops-k8s-monitoring-kiali-768d6cf478-b4v24 in namespace moba-system -> 10.28.0.42
2018-10-23T08:47:25.184087Z	warn	Endpoint without pod 10.28.0.42 &Endpoints{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:ops-k8s-monitoring-kiali,GenerateName:,Namespace:moba-system,SelfLink:/api/v1/namespaces/moba-system/endpoints/ops-k8s-monitoring-kiali,UID:44027206-d696-11e8-bc39-42010a8000b6,ResourceVersion:35125760,Generation:0,CreationTimestamp:2018-10-23 07:35:50 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: ops-k8s-monitoring-kiali,component: kiali,department: ops,stack: monitoring,team: k8s,},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,},Subsets:[{[{10.28.0.42  0xc420bc9760 ObjectReference{Kind:Pod,Namespace:moba-system,Name:ops-k8s-monitoring-kiali-768d6cf478-b4v24,UID:e8e571b2-d69c-11e8-bc39-42010a8000b6,APIVersion:,ResourceVersion:35125759,FieldPath:,}} {10.28.6.113  0xc420bc9770 &ObjectReference{Kind:Pod,Namespace:moba-system,Name:ops-k8s-monitoring-kiali-768d6cf478-j824q,UID:e9149882-d69c-11e8-bc39-42010a8000b6,APIVersion:,ResourceVersion:35125749,FieldPath:,}}] [] [{http 8080 TCP}]}],}

After some time (~30m) Envoy receives updated config and removes old endpoints.
If I restart Pilot, old endpoints are removed instantly.

Expected behavior
Old endpoints are deleted from Envoy cluster.

Steps to reproduce the bug
Described in first section.

Version

Istio:

Version: 1.0.2
GitRevision: d639408fded355fb906ef2a1f9e8ffddc24c3d64
User: root@66ce69d4a51e
Hub: gcr.io/istio-release
GolangVersion: go1.10.1
BuildStatus: Clean

Kubernetes:

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.7", GitCommit:"0c38c362511b20a098d7cd855f1314dad92c2780", GitTreeState:"clean", BuildDate:"2018-08-20T10:09:03Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.7-gke.6", GitCommit:"06898a4d0f2b96f82b43d9e59fa2825bd3d616a2", GitTreeState:"clean", BuildDate:"2018-10-02T17:32:01Z", GoVersion:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}

Installation

Istio is installed with a help of official Helm charts, custom values are:

pilot:
  replicaCount: 3
  autoscaleMin: 3
  autoscaleMax: 6
  traceSampling: 100

mixer:
  enabled: true
  replicaCount: 2
  autoscaleMin: 2
  autoscaleMax: 6

  istio-policy:
    autoscaleEnabled: true
    autoscaleMin: 2
    autoscaleMax: 6

  istio-telemetry:
    autoscaleEnabled: true
    autoscaleMin: 2
    autoscaleMax: 6

sidecarInjectorWebhook:
  enableNamespacesByDefault: true

prometheus:
  enabled: false

certmanager:
  enabled: true

Environment
Google Kubernetes Engine v1.10.7-gke.6

Cluster state
6 nodes with 120 pods.

istio-dump.tar.gz

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions