Skip to content

Keda-Operator OOM problem after upgrade to Keda v2.11.* #4789

@andreb89

Description

@andreb89

Report

Hi,
we have an OOM problem in Kubernetes (AKS 1.26.3) with the keda-operator introduced with version 2.11.*. We are using Postgres- and Prometheus trigger for scaled jobs. For now, we downgraded to 2.10.1 again, where we do not have this issue.

Grafana metrics for the keda-operator pod with 2.11.1:
image

After the downgrade to 2.10.1:
image

I added some keda-operator pod logs. but nothing useful is really found around the time the OOM happens.

We are using the default resource request/limits, e.g. keda-operator:

    Limits:
      cpu:     1
      memory:  1000Mi
    Requests:
      cpu:      100m
      memory:   100Mi

We have about 500 scaledjobs instances and 1 scaledobjects instance. Most of the jobs have a Prometheus trigger with the following template:

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  annotations:
    meta.helm.sh/release-name: worker
    meta.helm.sh/release-namespace: ks-ns
  creationTimestamp: "2022-01-26T17
  finalizers:
  - finalizer.keda.sh
  generation: 21
  labels:
    ...
  name: worker
  namespace: ks-ns
  resourceVersion: "647584627"
  uid: 057f6b4b-8cc0-43aa-a16c-fe9ab7611d79
spec:
  failedJobsHistoryLimit: 10
  jobTargetRef:
    activeDeadlineSeconds: 1800
    backoffLimit: 6
    template:
      metadata:
        creationTimestamp: null
        labels:
         ...
      spec:
        containers:
        ...
    ttlSecondsAfterFinished: 3600
  maxReplicaCount: 20
  pollingInterval: 5
  rolloutStrategy: default
  scalingStrategy: {}
  successfulJobsHistoryLimit: 1
  triggers:
  - metadata:
      metricName: serverless_pendingjobs
      query: max(serverless_pendingjobs{queue="queue", namespace="ks-ns"})
      serverAddress: http://[cluster]:9090
      threshold: "1"
    type: prometheus

Expected Behavior

Memory consumption should stay the same after the Keda version update.

Actual Behavior

Huge jump in memory consumption after the upgrade.

Steps to Reproduce the Problem

Have a bigger cluster with a lot of different scale jobs and try the Keda version upgrade from 2.10.* to 2.11.*.

Maybe this will happen for you, too. Honestly unclear.

Logs from KEDA operator

  | Jul 11, 2023 @ 01:10:54.809 | 2023-07-10T23:10:54Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "appg-lucmerge-appf", "scaledJob.Namespace": "staging-appg-application", "Number of pending Jobs ": 0}
  | Jul 11, 2023 @ 01:10:54.809 | 2023-07-10T23:10:54Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "appf-bundle-finalizer-m", "scaledJob.Namespace": "staging-appf-application", "Number of pending Jobs ": 0}
  | Jul 11, 2023 @ 01:10:54.827 | 2023-07-10T23:10:54Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "appf-import-data-xl", "scaledJob.Namespace": "staging-appf-application", "Effective number of max jobs": 0}
  | Jul 11, 2023 @ 01:10:54.827 | 2023-07-10T23:10:54Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "appf-import-data-xl", "scaledJob.Namespace": "staging-appf-application", "Number of jobs": 0}
  | Jul 11, 2023 @ 01:10:54.827 | 2023-07-10T23:10:54Z	INFO	scaleexecutor	Created jobs	{"scaledJob.Name": "appf-import-data-xl", "scaledJob.Namespace": "staging-appf-application", "Number of jobs": 0}
  | Jul 11, 2023 @ 01:10:55.749 | 2023-07-10T23:10:55Z	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": ":8080"}
  | Jul 11, 2023 @ 01:10:55.773 | 2023-07-10T23:10:55Z	INFO	setup	Starting manager
  | Jul 11, 2023 @ 01:10:55.773 | 2023-07-10T23:10:55Z	INFO	setup	Git Commit: b8dbd298cf9001b1597a2756fd0be4fa4df2059f
  | Jul 11, 2023 @ 01:10:55.773 | 2023-07-10T23:10:55Z	INFO	setup	KEDA Version: 2.11.1
  | Jul 11, 2023 @ 01:10:55.773 | 2023-07-10T23:10:55Z	INFO	setup	Running on Kubernetes 1.26	{"version": "v1.26.3"}
  | Jul 11, 2023 @ 01:10:55.773 | 2023-07-10T23:10:55Z	INFO	setup	Go Version: go1.20.5
  | Jul 11, 2023 @ 01:10:55.773 | 2023-07-10T23:10:55Z	INFO	setup	Go OS/Arch: linux/amd64
  | Jul 11, 2023 @ 01:10:55.874 | 2023-07-10T23:10:55Z	INFO	Starting server	{"kind": "health probe", "addr": "[::]:8081"}
  | Jul 11, 2023 @ 01:10:55.874 | 2023-07-10T23:10:55Z	INFO	starting server	{"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
  | Jul 11, 2023 @ 01:10:55.874 | I0710 23:10:55.874494       1 leaderelection.go:245] attempting to acquire leader lease staging-keda-serverless/operator.keda.sh...
  | Jul 11, 2023 @ 01:10:56.668 | I0710 23:10:56.668438       1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="158.111µs" userAgent="kube-probe/1.26" audit-ID="b69846c6-e714-4b2e-8109-460408fc4fa0" srcIP="10.4.8.122:49694" resp=200
  | Jul 11, 2023 @ 01:10:56.668 | I0710 23:10:56.668040       1 httplog.go:132] "HTTP" verb="GET" URI="/readyz" latency="213.915µs" userAgent="kube-probe/1.26" audit-ID="280df468-0e0a-4222-84ee-0aed41f7c566" srcIP="10.4.8.122:49710" resp=200
  | Jul 11, 2023 @ 01:11:01.274 | I0710 23:11:01.274250       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="11.654072ms" userAgent="Go-http-client/2.0" audit-ID="7c0694eb-ae9d-4f70-b035-452bbd726728" srcIP="10.4.1.72:50284" resp=200
  | Jul 11, 2023 @ 01:11:01.274 | I0710 23:11:01.274315       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="11.800979ms" userAgent="Go-http-client/2.0" audit-ID="5d45c41a-6417-4d00-9e08-41b10b1477c1" srcIP="10.4.1.72:50284" resp=200
  | Jul 11, 2023 @ 01:11:01.274 | I0710 23:11:01.274552       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="12.225696ms" userAgent="Go-http-client/2.0" audit-ID="f1594dda-145d-4acb-b768-b74e1608460a" srcIP="10.4.1.72:50284" resp=200
  | Jul 11, 2023 @ 01:11:01.284 | I0710 23:11:01.283973       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="21.652078ms" userAgent="Go-http-client/2.0" audit-ID="11e6d398-2328-4f5a-abdc-8301c366b3b4" srcIP="10.4.1.72:50284" resp=200
  | Jul 11, 2023 @ 01:11:01.284 | I0710 23:11:01.283992       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="21.631177ms" userAgent="Go-http-client/2.0" audit-ID="3c64c045-a66f-4f51-9c09-a43a411cf3fc" srcIP="10.4.1.72:50284" resp=200
  | Jul 11, 2023 @ 01:11:02.144 | I0710 23:11:02.144046       1 httplog.go:132] "HTTP" verb="GET" URI="/openapi/v2" latency="18.651856ms" userAgent="" audit-ID="3d4f8945-2587-42c2-9703-3a10a6862d03" srcIP="10.4.1.72:52592" resp=304
  | Jul 11, 2023 @ 01:11:02.144 | I0710 23:11:02.144171       1 httplog.go:132] "HTTP" verb="GET" URI="/openapi/v3" latency="17.086093ms" userAgent="" audit-ID="d2813a34-fd4a-4bf0-a79e-d434a98a8cba" srcIP="10.4.1.72:52592" resp=200
  | Jul 11, 2023 @ 01:11:05.252 | I0710 23:11:05.252401       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="13.751949ms" userAgent="kube-controller-manager/v1.26.3 (linux/amd64) kubernetes/f18584a/system:serviceaccount:kube-system:resourcequota-controller" audit-ID="e25b658c-cdc2-4eb9-9b42-8e3c5dde004f" srcIP="10.4.1.72:52592" resp=200
  | Jul 11, 2023 @ 01:11:06.670 | I0710 23:11:06.670464       1 httplog.go:132] "HTTP" verb="GET" URI="/readyz" latency="221.919µs" userAgent="kube-probe/1.26" audit-ID="710f9c70-3312-4855-a44d-d88e5d548618" srcIP="10.4.8.122:51914" resp=200
  | Jul 11, 2023 @ 01:11:06.673 | I0710 23:11:06.673761       1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="164.413µs" userAgent="kube-probe/1.26" audit-ID="77c18635-cf67-4ae2-a901-6e9ed6568e06" srcIP="10.4.8.122:51928" resp=200
  | Jul 11, 2023 @ 01:11:06.848 | I0710 23:11:06.848846       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="14.929647ms" userAgent="kube-controller-manager/v1.26.3 (linux/amd64) kubernetes/f18584a/system:serviceaccount:kube-system:generic-garbage-collector" audit-ID="af777025-5f23-413c-8672-1ed69c616df0" srcIP="10.4.1.72:52592" resp=200
  | Jul 11, 2023 @ 01:11:08.502 | I0710 23:11:08.502363       1 httplog.go:132] "HTTP" verb="GET" URI="/apis/external.metrics.k8s.io/v1beta1" latency="14.735956ms" userAgent="kube-controller-manager/v1.26.3 (linux/amd64) kubernetes/f18584a/controller-discovery" audit-ID="c8b38271-99a6-4c7b-9fb7-dab401da7004" srcIP="10.4.1.72:52592" resp=200
  | Jul 11, 2023 @ 01:11:12.084 | I0710 23:11:12.084112       1 leaderelection.go:255] successfully acquired lease staging-keda-serverless/operator.keda.sh
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting EventSource	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "source": "kind source: *v1alpha1.TriggerAuthentication"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting EventSource	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v1alpha1.ScaledObject"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting EventSource	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v2.HorizontalPodAutoscaler"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting EventSource	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "source": "kind source: *v1alpha1.ClusterTriggerAuthentication"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting Controller	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting Controller	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting EventSource	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "source": "kind source: *v1alpha1.ScaledJob"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting Controller	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"}
  | Jul 11, 2023 @ 01:11:12.084 | 2023-07-10T23:11:12Z	INFO	Starting Controller	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"}
  | Jul 11, 2023 @ 01:11:12.095 | 2023-07-10T23:11:12Z	INFO	grpc_server	Starting Metrics Service gRPC Server	{"address": ":9666"}
  | Jul 11, 2023 @ 01:11:12.488 | 2023-07-10T23:11:12Z	INFO	Starting workers	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "worker count": 1}
  | Jul 11, 2023 @ 01:11:12.488 | 2023-07-10T23:11:12Z	INFO	Reconciling ScaledJob	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"appd-leak-detection-m","namespace":"staging-appd-application"}, "namespace": "staging-appd-application", "name": "appd-leak-detection-m", "reconcileID": "5bb6525a-17bb-4149-a142-1970bf22d248"}
  | Jul 11, 2023 @ 01:11:12.488 | 2023-07-10T23:11:12Z	INFO	RolloutStrategy is deprecated, please us Rollout.Strategy in order to define the desired strategy for job rollouts	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"appd-leak-detection-m","namespace":"staging-appd-application"}, "namespace": "staging-appd-application", "name": "appd-leak-detection-m", "reconcileID": "5bb6525a-17bb-4149-a142-1970bf22d248"}
  | Jul 11, 2023 @ 01:11:12.488 | 2023-07-10T23:11:12Z	INFO	Starting workers	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "worker count": 5}
  | Jul 11, 2023 @ 01:11:12.488 | 2023-07-10T23:11:12Z	INFO	Starting workers	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "worker count": 1}
  | Jul 11, 2023 @ 01:11:12.489 | 2023-07-10T23:11:12Z	INFO	Starting workers	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "worker count": 1}
  | Jul 11, 2023 @ 01:11:12.490 | 2023-07-10T23:11:12Z	INFO	"metricName" is deprecated and will be removed in v2.12, please do not set it anymore	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"appac-analysis-calculation-worker","namespace":"staging-appac-application"}, "namespace": "staging-appac-application", "name": "appac-analysis-calculation-worker", "reconcileID": "23148f01-9201-4373-a770-6d7d4b5bbcf7", "trigger.type": "prometheus"}
  | Jul 11, 2023 @ 01:11:12.490 | 2023-07-10T23:11:12Z	INFO	Reconciling ScaledObject	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"appac-analysis-calculation-worker","namespace":"staging-appac-application"}, "namespace": "staging-appac-application", "name": "appac-analysis-calculation-worker", "reconcileID": "23148f01-9201-4373-a770-6d7d4b5bbcf7"}
  | Jul 11, 2023 @ 01:11:13.296 | 2023-07-10T23:11:13Z	INFO	Initializing Scaling logic according to ScaledJob Specification	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"appd-leak-detection-m","namespace":"staging-appd-application"}, "namespace": "staging-appd-application", "name": "appd-leak-detection-m", "reconcileID": "5bb6525a-17bb-4149-a142-1970bf22d248"}
  | Jul 11, 2023 @ 01:11:13.309 | 2023-07-10T23:11:13Z	INFO	Initializing Scaling logic according to ScaledObject Specification	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"appac-analysis-calculation-worker","namespace":"staging-appac-application"}, "namespace": "staging-appac-application", "name": "appac-analysis-calculation-worker", "reconcileID": "23148f01-9201-4373-a770-6d7d4b5bbcf7"}
  | Jul 11, 2023 @ 01:11:13.310 | 2023-07-10T23:11:13Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "appd-leak-detection-m", "scaledJob.Namespace": "staging-appd-application", "Number of running Jobs": 0}
  | Jul 11, 2023 @ 01:11:13.310 | 2023-07-10T23:11:13Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "appd-leak-detection-m", "scaledJob.Namespace": "staging-appd-application", "Number of pending Jobs ": 0}
  | Jul 11, 2023 @ 01:11:13.315 | 2023-07-10T23:11:13Z	INFO	Reconciling ScaledJob	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"appd-toll-qa-routing-l","namespace":"staging-appd-application"}, "namespace": "staging-appd-application", "name": "appd-toll-qa-routing-l", "reconcileID": "1fac22ef-d1df-4c3b-b7bf-ea6196839a2e"}
  | Jul 11, 2023 @ 01:11:13.315 | 2023-07-10T23:11:13Z	INFO	RolloutStrategy is deprecated, please us Rollout.Strategy in order to define the desired strategy for job rollouts	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"appd-toll-qa-routing-l","namespace":"staging-appd-application"}, "namespace": "staging-appd-application", "name": "appd-toll-qa-routing-l", "reconcileID": "1fac22ef-d1df-4c3b-b7bf-ea6196839a2e"}
  | Jul 11, 2023 @ 01:11:13.315 | 2023-07-10T23:11:13Z	INFO	Initializing Scaling logic according to ScaledJob Specification	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"appd-toll-qa-routing-l","namespace":"staging-appd-application"}, "namespace": "staging-appd-application", "name": "appd-toll-qa-routing-l", "reconcileID": "1fac22ef-d1df-4c3b-b7bf-ea6196839a2e"}
  | Jul 11, 2023 @ 01:11:13.320 | 2023-07-10T23:11:13Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "appd-toll-qa-routing-l", "scaledJob.Namespace": "staging-appd-application", "Number of running Jobs": 0}
  | Jul 11, 2023 @ 01:11:13.320 | 2023-07-10T23:11:13Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "appd-toll-qa-routing-l", "scaledJob.Namespace": "staging-appd-application", "Number of pending Jobs ": 0}

KEDA Version

2.11.1

Kubernetes Version

1.26

Platform

Microsoft Azure

Scaler Details

Prometheus & Postgres

Anything else?

No response

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

Status

Ready To Ship

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions