Pilot does not remove old endpoints from Envoy cluster

**Describe the bug**

During rolling upgrades in Kubernetes, when old pods are deleted and new ones are added, Envoy keeps forwarding traffic to old pod IPs, which does not exist anymore.
This happens EVERY time I update deployments in my K8S cluster.

For example, I have initial state with one pod with IP `10.28.0.37`, after deployment update two new pods with IP's `10.28.0.42` and `10.28.6.113` are created and old one is deleted. 

Pilot endpoint `:15003/v1/registration` returns:

```json
{
 "service-key": "ops-k8s-monitoring-kiali.moba-system.svc.cluster.local|
http",
 "hosts": [
  {
   "ip_address": "10.28.0.42",
   "port": 8080
  },
  {
   "ip_address": "10.28.6.113",
   "port": 8080
  }
 ]
},
```

However Envoy endpoint `:15000/clusters` returns:

```
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::default_priority::max_connections::1024
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::default_priority::max_pending_requests::1024
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::default_priority::max_requests::1024
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::default_priority::max_retries::3
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::high_priority::max_connections::1024
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::high_priority::max_pending_requests::1024
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::high_priority::max_requests::1024
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::high_priority::max_retries::3
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::added_via_api::true
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::cx_active::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::cx_connect_fail::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::cx_total::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::rq_active::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::rq_error::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::rq_success::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::rq_timeout::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::rq_total::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::health_flags::healthy
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::weight::1
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::region::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::zone::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::sub_zone::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::canary::false
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::success_rate::-1
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::cx_active::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::cx_connect_fail::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::cx_total::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::rq_active::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::rq_error::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::rq_success::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::rq_timeout::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::rq_total::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::health_flags::healthy
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::weight::1
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::region::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::zone::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::sub_zone::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::canary::false
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::success_rate::-1
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::cx_active::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::cx_connect_fail::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::cx_total::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::rq_active::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::rq_error::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::rq_success::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::rq_timeout::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::rq_total::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::health_flags::healthy
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::weight::1
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::region::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::zone::us-central1/us-central1-f
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::sub_zone::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::canary::false
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::success_rate::-1
```

After enabling debug logging in Envoy I was able to find this EDS update:

```
[2018-10-23 08:23:27.025][20][debug][config] bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:60] gRPC config for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment accepted with 1 resources: [cluster_name: "outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local"
endpoints {
  locality {
  }
  lb_endpoints {
    endpoint {
      address {
        socket_address {
          address: "10.28.0.42"
          port_value: 8080
        }
      }
    }
    metadata {
      filter_metadata {
        key: "istio"
        value {
          fields {
            key: "uid"
            value {
              string_value: "kubernetes://ops-k8s-monitoring-kiali-768d6cf478-b4v24.moba-system"
            }
          }
        }
      }
    }
  }
  lb_endpoints {
    endpoint {
      address {
        socket_address {
          address: "10.28.6.113"
          port_value: 8080
        }
      }
    }
    metadata {
      filter_metadata {
        key: "istio"
        value {
          fields {
            key: "uid"
            value {
              string_value: "kubernetes://ops-k8s-monitoring-kiali-768d6cf478-j824q.moba-system"
            }
          }
        }
      }
    }
  }
  lb_endpoints {
    endpoint {
      address {
        socket_address {
          address: "10.28.0.37"
          port_value: 8080
        }
      }
    }
    metadata {
      filter_metadata {
        key: "istio"
        value {
          fields {
            key: "uid"
            value {
              string_value: "kubernetes://ops-k8s-monitoring-kiali-649c4c9dd9-8cr8m.moba-system"
            }
          }
        }
      }
    }
  }
}
]
```

However no other updates (with old endpoints deleted) are received by Envoy.
In Pilot logs I was not able to find anything interesting except of:

```
2018-10-23T08:49:05.242425Z	info	Handling event delete for pod ops-k8s-monitoring-kiali-768d6cf478-b4v24 in namespace moba-system -> 10.28.0.42
2018-10-23T08:47:25.184087Z	warn	Endpoint without pod 10.28.0.42 &Endpoints{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:ops-k8s-monitoring-kiali,GenerateName:,Namespace:moba-system,SelfLink:/api/v1/namespaces/moba-system/endpoints/ops-k8s-monitoring-kiali,UID:44027206-d696-11e8-bc39-42010a8000b6,ResourceVersion:35125760,Generation:0,CreationTimestamp:2018-10-23 07:35:50 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: ops-k8s-monitoring-kiali,component: kiali,department: ops,stack: monitoring,team: k8s,},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,},Subsets:[{[{10.28.0.42  0xc420bc9760 ObjectReference{Kind:Pod,Namespace:moba-system,Name:ops-k8s-monitoring-kiali-768d6cf478-b4v24,UID:e8e571b2-d69c-11e8-bc39-42010a8000b6,APIVersion:,ResourceVersion:35125759,FieldPath:,}} {10.28.6.113  0xc420bc9770 &ObjectReference{Kind:Pod,Namespace:moba-system,Name:ops-k8s-monitoring-kiali-768d6cf478-j824q,UID:e9149882-d69c-11e8-bc39-42010a8000b6,APIVersion:,ResourceVersion:35125749,FieldPath:,}}] [] [{http 8080 TCP}]}],}
```

After some time (~30m) Envoy receives updated config and removes old endpoints.
If I restart Pilot, old endpoints are removed instantly.


**Expected behavior**
Old endpoints are deleted from Envoy cluster.

**Steps to reproduce the bug**
Described in first section.

**Version**

Istio:

```
Version: 1.0.2
GitRevision: d639408fded355fb906ef2a1f9e8ffddc24c3d64
User: root@66ce69d4a51e
Hub: gcr.io/istio-release
GolangVersion: go1.10.1
BuildStatus: Clean
```

Kubernetes:

```
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.7", GitCommit:"0c38c362511b20a098d7cd855f1314dad92c2780", GitTreeState:"clean", BuildDate:"2018-08-20T10:09:03Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.7-gke.6", GitCommit:"06898a4d0f2b96f82b43d9e59fa2825bd3d616a2", GitTreeState:"clean", BuildDate:"2018-10-02T17:32:01Z", GoVersion:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}
```

**Installation**

Istio is installed with a help of official Helm charts, custom values are:

```yaml
pilot:
  replicaCount: 3
  autoscaleMin: 3
  autoscaleMax: 6
  traceSampling: 100

mixer:
  enabled: true
  replicaCount: 2
  autoscaleMin: 2
  autoscaleMax: 6

  istio-policy:
    autoscaleEnabled: true
    autoscaleMin: 2
    autoscaleMax: 6

  istio-telemetry:
    autoscaleEnabled: true
    autoscaleMin: 2
    autoscaleMax: 6

sidecarInjectorWebhook:
  enableNamespacesByDefault: true

prometheus:
  enabled: false

certmanager:
  enabled: true
```

**Environment**
Google Kubernetes Engine v1.10.7-gke.6

**Cluster state**
6 nodes with 120 pods.

[istio-dump.tar.gz](https://github.com/istio/istio/files/2505539/istio-dump.tar.gz)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pilot does not remove old endpoints from Envoy cluster #9480

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pilot does not remove old endpoints from Envoy cluster #9480

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions