Skip to content

[Bug] [seatunnel-engine] Memory Leak #9637

@liangcw1111

Description

@liangcw1111

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

当seatunnel集群长期大量提交任务时,手动触发GC后剩余的不可GC内存持续增长,经过排查发现CoordinatorService有3处存在内存泄露,其中前两处是必然出现,第三种在集群压力较大时有不低的概率出现.

  1. runningJobStateIMap, 这个imap扔了4种类型的key进去,但是只移除了3种,checkpoint相关数据只存入没有移除,每个任务会遗留一条数据,目前我司每日任务有8000+, 实际监控这个imap的size每天膨胀8000+
  2. CoordinatorService 的 pendingJobMasterMap,这个map在固定slot&REJECT配置时, 若资源不足则任务失败但是pendingJobMasterMap 没有清理,导致有任务提交成功资源分配失败时就会增加一条记录,实际监控这个map的size每天膨胀200+
  3. metricsImap,这个imap的移除需要先获取锁,若获取锁失败则直接return没有抛出异常,所以不会触发外层的重试,在集群负载比较高时会导致imap数据清理失败,实际监控这个imap中存的map的size每天膨胀40+
    相关代码见截图

SeaTunnel Version

2.3.10

SeaTunnel Config

seatunnel:
  engine:
    job-schedule-strategy: REJECT
    classloader-cache-mode: true
    history-job-expire-minutes: 1440
    telemetry:
      logs:
        scheduled-deletion-enable: false
    backup-count: 1
    print-execution-info-interval: 60
    print-job-metrics-info-interval: 60
    queue-type: blockingqueue
    slot-service:
      dynamic-slot: false
      slot-num: 50
      slot-allocation-strategy: SYSTEM_LOAD
    checkpoint:
      interval: 30000
      timeout: 2147483647
      max-concurrent: 5
      tolerable-failure: 2
      storage:
        type: oss
       
    http:
      enable-http: true
      port: 80
      context-path: /seatunnel

Running Command

sh /alidata1/za-seatunnel/seatunnel-2.3.10/bin/seatunnel-cluster.sh -d -r worker

Error Exception

Zeta or Flink or Spark Version

No response

Java or Scala Version

No response

Screenshots

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions