Skip to content

LogScheduler: fix IO job starvation#906

Merged
MrAnno merged 4 commits intoaxoflow:mainfrom
alltilla:logscheduler-fix-starvation
Jan 15, 2026
Merged

LogScheduler: fix IO job starvation#906
MrAnno merged 4 commits intoaxoflow:mainfrom
alltilla:logscheduler-fix-starvation

Conversation

@alltilla
Copy link
Member

@alltilla alltilla commented Jan 14, 2026

We can have more partitions (IO jobs) (from one or multiple parallelize() calls) than number of cores/IO workers available.

The LogScheduler correctly spreads the messages between partitions, but we can only run a limited amount of IO jobs at the same time.

The work() method is implemented in a way that during its run, new batches can be added to its workload. Also, it does not return if there is even one batch available for processing. If there is a big load of messages, IO jobs run indefinitely, or at least until new batches are no longer added to its queue.

If we have some IO jobs scheduled, but not ran, new batches will still be added to the non-running jobs, but we only process batches from the running jobs.

This happens until log-iw-size() number of messages are accumulated in the non-running jobs, which will cause the running jobs to not receive any more logs, thus eventually return, causing the previously non-running jobs to run finally. So the only thing that saves us from complete starvation is log-iw-size() and backpressure, which is not the intended behavior.

We need to put a hard limit to the number of logs processed by one job run, so ivykis can work its magic and cycle between the scheduled IO jobs. We now count the number of logs processed in one work() run, and if it reaches a configurable limit, we finish the batch and return from work(). The new log-fetch-limit() option of parallelize() can change this limit from its default 1000 value.

I have also added some level 4 metrics that can help debug parallelize() errors.

syslogng_parallelized_assigned_events_total{id="/home/alltilla/repos/axosyslog/build/install/etc/syslog-ng.conf:6:3",partition_index="0"} 614700
syslogng_parallelized_assigned_events_total{id="/home/alltilla/repos/axosyslog/build/install/etc/syslog-ng.conf:6:3",partition_index="1"} 614661
syslogng_parallelized_assigned_events_total{id="/home/alltilla/repos/axosyslog/build/install/etc/syslog-ng.conf:6:3",partition_index="2"} 614600
syslogng_parallelized_assigned_events_total{id="/home/alltilla/repos/axosyslog/build/install/etc/syslog-ng.conf:6:3",partition_index="3"} 614600
syslogng_parallelized_processed_events_total{id="/home/alltilla/repos/axosyslog/build/install/etc/syslog-ng.conf:6:3",partition_index="0"} 614513
syslogng_parallelized_processed_events_total{id="/home/alltilla/repos/axosyslog/build/install/etc/syslog-ng.conf:6:3",partition_index="1"} 614500
syslogng_parallelized_processed_events_total{id="/home/alltilla/repos/axosyslog/build/install/etc/syslog-ng.conf:6:3",partition_index="2"} 614500
syslogng_parallelized_processed_events_total{id="/home/alltilla/repos/axosyslog/build/install/etc/syslog-ng.conf:6:3",partition_index="3"} 614500

alltilla added a commit to alltilla/axosyslog that referenced this pull request Jan 14, 2026
Signed-off-by: Attila Szakacs <[email protected]>
@alltilla alltilla force-pushed the logscheduler-fix-starvation branch from fe4959e to 51624d9 Compare January 14, 2026 16:11
@alltilla alltilla changed the title LogScheduler: fix starvation LogScheduler: fix IO job starvation Jan 14, 2026
alltilla added a commit to alltilla/axosyslog that referenced this pull request Jan 14, 2026
Signed-off-by: Attila Szakacs <[email protected]>
@alltilla alltilla force-pushed the logscheduler-fix-starvation branch from 51624d9 to 5051e33 Compare January 14, 2026 16:41
@alltilla alltilla requested review from MrAnno and bazsi and removed request for MrAnno January 14, 2026 16:44
syslogng_parallelized_assigned_events_total{id="/home/alltilla/repos/axosyslog/build/install/etc/syslog-ng.conf:6:3",partition_index="0"} 614700
syslogng_parallelized_assigned_events_total{id="/home/alltilla/repos/axosyslog/build/install/etc/syslog-ng.conf:6:3",partition_index="1"} 614661
syslogng_parallelized_assigned_events_total{id="/home/alltilla/repos/axosyslog/build/install/etc/syslog-ng.conf:6:3",partition_index="2"} 614600
syslogng_parallelized_assigned_events_total{id="/home/alltilla/repos/axosyslog/build/install/etc/syslog-ng.conf:6:3",partition_index="3"} 614600
syslogng_parallelized_processed_events_total{id="/home/alltilla/repos/axosyslog/build/install/etc/syslog-ng.conf:6:3",partition_index="0"} 614513
syslogng_parallelized_processed_events_total{id="/home/alltilla/repos/axosyslog/build/install/etc/syslog-ng.conf:6:3",partition_index="1"} 614500
syslogng_parallelized_processed_events_total{id="/home/alltilla/repos/axosyslog/build/install/etc/syslog-ng.conf:6:3",partition_index="2"} 614500
syslogng_parallelized_processed_events_total{id="/home/alltilla/repos/axosyslog/build/install/etc/syslog-ng.conf:6:3",partition_index="3"} 614500

Signed-off-by: Attila Szakacs <[email protected]>
alltilla added a commit to alltilla/axosyslog that referenced this pull request Jan 15, 2026
Signed-off-by: Attila Szakacs <[email protected]>
@alltilla alltilla force-pushed the logscheduler-fix-starvation branch from 5051e33 to b5a6e2e Compare January 15, 2026 08:37
MrAnno
MrAnno previously approved these changes Jan 15, 2026
Copy link
Contributor

@MrAnno MrAnno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.

We can have more partitions (IO jobs) (from one or
multiple parallelize() calls) than number of
cores/IO workers available.

The LogScheduler correctly spreads the messages
between partitions but we can only run a limited
amount of IO jobs at the same time.

The work() method is implemented in a way that
during its run, new batches can be added to its
workload. Also, it does not return if there is
even one batch available for processing.
If there is a big load of messages, IO jobs run
indefinitely, or at least until new batches are no
longer added to its queue.

If we have some IO jobs scheduled, but not ran,
new batches will still be added to the non-running
jobs, but we only process batches from the running
jobs.

This happens until log-iw-size() number of logs
are accumulated in the non-running jobs, which
will cause the running jobs to not receive any
more logs, thus eventually return, causing the
previously non-running jobs to run finally.
So the only thing that saves us from complete
starvation is log-iw-size() and backpressure,
which is not the intended behavior.

We need to put a hard limit to the number of logs
processed by one job run, so ivykis can work its
magic and cycle between the scheduled IO jobs.
We now count the number of logs processed in
one work() run, and if it reaches a configurable
limit, we finish the batch and return from work().
The new log-fetch-limit() option of parallelize()
can change this limit from its default 1000 value.

Signed-off-by: Attila Szakacs <[email protected]>
Signed-off-by: Attila Szakacs <[email protected]>
@alltilla alltilla marked this pull request as draft January 15, 2026 10:04
@alltilla alltilla force-pushed the logscheduler-fix-starvation branch from 84d8eed to 04850eb Compare January 15, 2026 10:28
@alltilla alltilla marked this pull request as ready for review January 15, 2026 10:39
@MrAnno MrAnno merged commit ad0be26 into axoflow:main Jan 15, 2026
21 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants