-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Closed
Labels
Description
Search before asking
- I had searched in the issues and found no similar issues.
What happened
当seatunnel集群长期大量提交任务时,手动触发GC后剩余的不可GC内存持续增长,经过排查发现CoordinatorService有3处存在内存泄露,其中前两处是必然出现,第三种在集群压力较大时有不低的概率出现.
- runningJobStateIMap, 这个imap扔了4种类型的key进去,但是只移除了3种,checkpoint相关数据只存入没有移除,每个任务会遗留一条数据,目前我司每日任务有8000+, 实际监控这个imap的size每天膨胀8000+
- CoordinatorService 的 pendingJobMasterMap,这个map在固定slot&REJECT配置时, 若资源不足则任务失败但是pendingJobMasterMap 没有清理,导致有任务提交成功资源分配失败时就会增加一条记录,实际监控这个map的size每天膨胀200+
- metricsImap,这个imap的移除需要先获取锁,若获取锁失败则直接return没有抛出异常,所以不会触发外层的重试,在集群负载比较高时会导致imap数据清理失败,实际监控这个imap中存的map的size每天膨胀40+
相关代码见截图
SeaTunnel Version
2.3.10
SeaTunnel Config
seatunnel:
engine:
job-schedule-strategy: REJECT
classloader-cache-mode: true
history-job-expire-minutes: 1440
telemetry:
logs:
scheduled-deletion-enable: false
backup-count: 1
print-execution-info-interval: 60
print-job-metrics-info-interval: 60
queue-type: blockingqueue
slot-service:
dynamic-slot: false
slot-num: 50
slot-allocation-strategy: SYSTEM_LOAD
checkpoint:
interval: 30000
timeout: 2147483647
max-concurrent: 5
tolerable-failure: 2
storage:
type: oss
http:
enable-http: true
port: 80
context-path: /seatunnel
Running Command
sh /alidata1/za-seatunnel/seatunnel-2.3.10/bin/seatunnel-cluster.sh -d -r workerError Exception
无
Zeta or Flink or Spark Version
No response
Java or Scala Version
No response
Screenshots
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct