Skip to content

[Enhancement] [Bug] Incomplete StatefulSet configuration and concurrency risks in EventMesh Operator #5214

@qqeasonchen

Description

@qqeasonchen

Search before asking

  • I had searched in the issues and found no similar issues.

Enhancement Request

Description
The current implementation of eventmesh-operator has several critical issues regarding Kubernetes resource management and internal concurrency safety, which may lead to deployment failures
or unstable behavior in production environments.

Issues Identified

  1. Missing Headless Service for StatefulSets

    • Problem: The operator creates StatefulSet resources for both Runtime and Connectors but fails to create the corresponding Headless Service. It also does not set the serviceName
      field in the StatefulSet spec.
    • Impact: Pods managed by the StatefulSet will not have stable network identities (DNS entries like pod-0.service-name.namespace.svc.cluster.local), which is a core feature of
      StatefulSets and essential for cluster communication.
  2. Unsafe Global State Usage

    • Problem: A global variable IsEventMeshRuntimeInitialized in share/share.go is used to track runtime readiness.
    • Impact: This design is not thread-safe and breaks in multi-tenant or multi-cluster scenarios (e.g., managing multiple EventMesh clusters in different namespaces). It causes race
      conditions and incorrect dependency checks.
  3. Hardcoded Replica Logic

    • Problem: The RuntimeReconciler hardcodes Replicas to 1 in some paths, potentially ignoring the replicaPerGroup configuration defined in the CRD.
  4. Blocking Operations

    • Problem: The controller uses time.Sleep() for retries or waiting.
    • Impact: This blocks the reconciliation thread, reducing the operator's throughput and responsiveness. It should use reconcile.Result{RequeueAfter: ...} instead.

Describe the solution you'd like

Proposed Fixes

  1. Refactor Controllers:

    • Implement logic to automatically create a Headless Service (ClusterIP: None) for each StatefulSet.
    • Ensure the StatefulSet.Spec.ServiceName matches the created Service.
  2. Remove Global State:

    • Delete IsEventMeshRuntimeInitialized.
    • Update ConnectorsReconciler to dynamically query the Kubernetes API for Runtime resource status to determine readiness.
  3. Enhance Robustness:

    • Use correct replica values from the CRSpec.
    • Replace blocking sleeps with non-blocking requeue mechanisms.

Environment

  • EventMesh Version: (Current Master)
  • Kubernetes Version: Any

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions