Describe the bug
During rolling upgrades in Kubernetes, when old pods are deleted and new ones are added, Envoy keeps forwarding traffic to old pod IPs, which does not exist anymore.
This happens EVERY time I update deployments in my K8S cluster.
For example, I have initial state with one pod with IP 10.28.0.37, after deployment update two new pods with IP's 10.28.0.42 and 10.28.6.113 are created and old one is deleted.
Pilot endpoint :15003/v1/registration returns:
{
"service-key": "ops-k8s-monitoring-kiali.moba-system.svc.cluster.local|
http",
"hosts": [
{
"ip_address": "10.28.0.42",
"port": 8080
},
{
"ip_address": "10.28.6.113",
"port": 8080
}
]
},
However Envoy endpoint :15000/clusters returns:
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::default_priority::max_connections::1024
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::default_priority::max_pending_requests::1024
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::default_priority::max_requests::1024
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::default_priority::max_retries::3
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::high_priority::max_connections::1024
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::high_priority::max_pending_requests::1024
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::high_priority::max_requests::1024
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::high_priority::max_retries::3
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::added_via_api::true
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::cx_active::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::cx_connect_fail::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::cx_total::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::rq_active::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::rq_error::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::rq_success::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::rq_timeout::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::rq_total::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::health_flags::healthy
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::weight::1
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::region::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::zone::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::sub_zone::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::canary::false
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.42:8080::success_rate::-1
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::cx_active::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::cx_connect_fail::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::cx_total::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::rq_active::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::rq_error::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::rq_success::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::rq_timeout::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::rq_total::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::health_flags::healthy
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::weight::1
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::region::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::zone::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::sub_zone::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::canary::false
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.6.113:8080::success_rate::-1
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::cx_active::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::cx_connect_fail::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::cx_total::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::rq_active::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::rq_error::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::rq_success::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::rq_timeout::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::rq_total::0
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::health_flags::healthy
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::weight::1
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::region::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::zone::us-central1/us-central1-f
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::sub_zone::
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::canary::false
outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local::10.28.0.37:8080::success_rate::-1
After enabling debug logging in Envoy I was able to find this EDS update:
[2018-10-23 08:23:27.025][20][debug][config] bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:60] gRPC config for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment accepted with 1 resources: [cluster_name: "outbound|80||ops-k8s-monitoring-kiali.moba-system.svc.cluster.local"
endpoints {
locality {
}
lb_endpoints {
endpoint {
address {
socket_address {
address: "10.28.0.42"
port_value: 8080
}
}
}
metadata {
filter_metadata {
key: "istio"
value {
fields {
key: "uid"
value {
string_value: "kubernetes://ops-k8s-monitoring-kiali-768d6cf478-b4v24.moba-system"
}
}
}
}
}
}
lb_endpoints {
endpoint {
address {
socket_address {
address: "10.28.6.113"
port_value: 8080
}
}
}
metadata {
filter_metadata {
key: "istio"
value {
fields {
key: "uid"
value {
string_value: "kubernetes://ops-k8s-monitoring-kiali-768d6cf478-j824q.moba-system"
}
}
}
}
}
}
lb_endpoints {
endpoint {
address {
socket_address {
address: "10.28.0.37"
port_value: 8080
}
}
}
metadata {
filter_metadata {
key: "istio"
value {
fields {
key: "uid"
value {
string_value: "kubernetes://ops-k8s-monitoring-kiali-649c4c9dd9-8cr8m.moba-system"
}
}
}
}
}
}
}
]
However no other updates (with old endpoints deleted) are received by Envoy.
In Pilot logs I was not able to find anything interesting except of:
2018-10-23T08:49:05.242425Z info Handling event delete for pod ops-k8s-monitoring-kiali-768d6cf478-b4v24 in namespace moba-system -> 10.28.0.42
2018-10-23T08:47:25.184087Z warn Endpoint without pod 10.28.0.42 &Endpoints{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:ops-k8s-monitoring-kiali,GenerateName:,Namespace:moba-system,SelfLink:/api/v1/namespaces/moba-system/endpoints/ops-k8s-monitoring-kiali,UID:44027206-d696-11e8-bc39-42010a8000b6,ResourceVersion:35125760,Generation:0,CreationTimestamp:2018-10-23 07:35:50 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: ops-k8s-monitoring-kiali,component: kiali,department: ops,stack: monitoring,team: k8s,},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,},Subsets:[{[{10.28.0.42 0xc420bc9760 ObjectReference{Kind:Pod,Namespace:moba-system,Name:ops-k8s-monitoring-kiali-768d6cf478-b4v24,UID:e8e571b2-d69c-11e8-bc39-42010a8000b6,APIVersion:,ResourceVersion:35125759,FieldPath:,}} {10.28.6.113 0xc420bc9770 &ObjectReference{Kind:Pod,Namespace:moba-system,Name:ops-k8s-monitoring-kiali-768d6cf478-j824q,UID:e9149882-d69c-11e8-bc39-42010a8000b6,APIVersion:,ResourceVersion:35125749,FieldPath:,}}] [] [{http 8080 TCP}]}],}
After some time (~30m) Envoy receives updated config and removes old endpoints.
If I restart Pilot, old endpoints are removed instantly.
Expected behavior
Old endpoints are deleted from Envoy cluster.
Steps to reproduce the bug
Described in first section.
Version
Istio:
Version: 1.0.2
GitRevision: d639408fded355fb906ef2a1f9e8ffddc24c3d64
User: root@66ce69d4a51e
Hub: gcr.io/istio-release
GolangVersion: go1.10.1
BuildStatus: Clean
Kubernetes:
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.7", GitCommit:"0c38c362511b20a098d7cd855f1314dad92c2780", GitTreeState:"clean", BuildDate:"2018-08-20T10:09:03Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.7-gke.6", GitCommit:"06898a4d0f2b96f82b43d9e59fa2825bd3d616a2", GitTreeState:"clean", BuildDate:"2018-10-02T17:32:01Z", GoVersion:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}
Installation
Istio is installed with a help of official Helm charts, custom values are:
pilot:
replicaCount: 3
autoscaleMin: 3
autoscaleMax: 6
traceSampling: 100
mixer:
enabled: true
replicaCount: 2
autoscaleMin: 2
autoscaleMax: 6
istio-policy:
autoscaleEnabled: true
autoscaleMin: 2
autoscaleMax: 6
istio-telemetry:
autoscaleEnabled: true
autoscaleMin: 2
autoscaleMax: 6
sidecarInjectorWebhook:
enableNamespacesByDefault: true
prometheus:
enabled: false
certmanager:
enabled: true
Environment
Google Kubernetes Engine v1.10.7-gke.6
Cluster state
6 nodes with 120 pods.
istio-dump.tar.gz
Describe the bug
During rolling upgrades in Kubernetes, when old pods are deleted and new ones are added, Envoy keeps forwarding traffic to old pod IPs, which does not exist anymore.
This happens EVERY time I update deployments in my K8S cluster.
For example, I have initial state with one pod with IP
10.28.0.37, after deployment update two new pods with IP's10.28.0.42and10.28.6.113are created and old one is deleted.Pilot endpoint
:15003/v1/registrationreturns:{ "service-key": "ops-k8s-monitoring-kiali.moba-system.svc.cluster.local| http", "hosts": [ { "ip_address": "10.28.0.42", "port": 8080 }, { "ip_address": "10.28.6.113", "port": 8080 } ] },However Envoy endpoint
:15000/clustersreturns:After enabling debug logging in Envoy I was able to find this EDS update:
However no other updates (with old endpoints deleted) are received by Envoy.
In Pilot logs I was not able to find anything interesting except of:
After some time (~30m) Envoy receives updated config and removes old endpoints.
If I restart Pilot, old endpoints are removed instantly.
Expected behavior
Old endpoints are deleted from Envoy cluster.
Steps to reproduce the bug
Described in first section.
Version
Istio:
Kubernetes:
Installation
Istio is installed with a help of official Helm charts, custom values are:
Environment
Google Kubernetes Engine v1.10.7-gke.6
Cluster state
6 nodes with 120 pods.
istio-dump.tar.gz