One of the first things that anyone learns about hardware performance monitoring on modern Intel processors is that there are three fixed-function counters and four general-purpose counters per logical core. As long as there are sufficient counters for all the events that need to be measured simultaneously, assigning one counter for each event is doable. However, if there are more events than counters, multiplexing occurs where different events are measured in different time intervals. In the perf stat tool, if you see a column of percentages to the right hand side of the output, it means that multiplexing has been used to measure the given events. But have you ever seen an output and said “WTF, why is there multiplexing?”

Here are a few perf stat commands you can try and see if the output is as what you expect. The event codes for the events used on microarchitectures older than Skylake are the same as those on Skylake and later; it’s just that the names of some of the events are slightly different. Try to run these commands on different microarchitectures with hyperthreading enabled or disabled. You can also try on systems with the same microarchitecture but different kernel versions. Does multiplexing occur as per your expectation? Do you see any <not counted> or <not supported> next to some of the events instead of actual event counts? Note that, in this article, the event counts as numerical values are not of interest; what matters is how events are scheduled. The exact meaning of each event also doesn’t matter.

Sandy Bridge up to and including Broadwell
------------------------------------------
perf stat -e l1d_pend_miss.pending,cycle_activity.stalls_l1d_pending dd if=/dev/zero of=/dev/null count=1000000
perf stat -e l1d_pend_miss.pending,cycle_activity.stalls_l1d_pending:D dd if=/dev/zero of=/dev/null count=1000000
perf stat -e '{l1d_pend_miss.pending,faults}',cycle_activity.stalls_l1d_pending:D,mem_uops_retired.all_loads dd if=/dev/zero of=/dev/null count=1000000
perf stat -e mem_load_uops_retired.l1_hit,mem_load_uops_retired.l1_miss,mem_load_uops_retired.l2_hit dd if=/dev/zero of=/dev/null count=1000000
perf stat -e mem_load_uops_retired.l1_hit,mem_load_uops_retired.l1_miss,mem_load_uops_retired.hit_lfb,mem_load_uops_retired.l2_hit,mem_load_uops_retired.l3_hit dd if=/dev/zero of=/dev/null count=1000000

Haswell and Broadwell
---------------------
perf stat -e dtlb_load_misses.walk_completed,dtlb_load_misses.walk_completed_4k,dtlb_store_misses.walk_completed,dtlb_store_misses.walk_completed_4k,itlb_misses.walk_completed,itlb_misses.walk_completed_4k dd if=/dev/zero of=/dev/null count=1000000
perf stat -e '{dtlb_load_misses.walk_completed,dtlb_load_misses.walk_completed_4k,dtlb_store_misses.walk_completed,dtlb_store_misses.walk_completed_4k,itlb_misses.walk_completed,itlb_misses.walk_completed_4k}' dd if=/dev/zero of=/dev/null count=1000000
perf stat -e instructions,dtlb_load_misses.walk_completed,dtlb_load_misses.walk_completed_4k,dtlb_store_misses.walk_completed,dtlb_store_misses.walk_completed_4k,itlb_misses.walk_completed,itlb_misses.walk_completed_4k sleep 1

Skylake up to and including Sunny Cove
--------------------------------------
perf stat -e l1d_pend_miss.pending,cycle_activity.stalls_l1d_miss dd if=/dev/zero of=/dev/null count=1000000
perf stat -e l1d_pend_miss.pending,cycle_activity.stalls_l1d_miss:D dd if=/dev/zero of=/dev/null count=1000000
perf stat -e '{l1d_pend_miss.pending,faults}',cycle_activity.stalls_l1d_miss:D,mem_inst_retired.all_loads dd if=/dev/zero of=/dev/null count=1000000
perf stat -e mem_load_retired.l1_hit,mem_load_retired.l1_miss,mem_load_retired.l2_hit dd if=/dev/zero of=/dev/null count=1000000
perf stat -e mem_load_retired.l1_hit,mem_load_retired.l1_miss,mem_load_retired.fb_hit,mem_load_retired.l2_hit,mem_load_retired.l3_hit dd if=/dev/zero of=/dev/null count=1000000
perf stat -e dtlb_load_misses.walk_completed,dtlb_load_misses.walk_completed_4k,dtlb_store_misses.walk_completed,dtlb_store_misses.walk_completed_4k,itlb_misses.walk_completed,itlb_misses.walk_completed_4k dd if=/dev/zero of=/dev/null count=1000000
perf stat -e '{dtlb_load_misses.walk_completed,dtlb_load_misses.walk_completed_4k,dtlb_store_misses.walk_completed,dtlb_store_misses.walk_completed_4k,itlb_misses.walk_completed,itlb_misses.walk_completed_4k}' dd if=/dev/zero of=/dev/null count=1000000
perf stat -e instructions,dtlb_load_misses.walk_completed,dtlb_load_misses.walk_completed_4k,dtlb_store_misses.walk_completed,dtlb_store_misses.walk_completed_4k,itlb_misses.walk_completed,itlb_misses.walk_completed_4k sleep 1

In some of these commands, I’ve used the ‘{…}’ syntax, which means that all the events in the brackets should be scheduled together as a group without any multiplexing among them, but the the group as a whole can still be multiplexed with other groups. The happy face modifier (colon followed by D) means that the event (or event group) should be pinned i.e., it’s not subject to multiplexing and should always be measured. Both of these features are discussed in the perf list manual. The dd tool copies the specified number of blocks from one file to another. It doesn’t really matter what it does; I just needed to use some program with minimal amount of effort.

Once you know the perf event scheduling algorithm, you’ll be able to understand all the possible outputs for each of the commands above and any other commands with potentially more complex set of events. So first I’ll discuss the algorithm and then explain the outputs at the end. I apologize beforehand for not making any figures, which would have made it a bit easier to follow the article. There are currently no examples on AMD processors. But if you like to discuss with me some AMD examples that you have produced, let me know in the comments. That said, this article does cover AMD processors as well; it’s just that there are no examples for these processors.

Introduction

The Linux perf subsystem consists of a kernel component, called perf_event, which is exposed to userspace through the perf_event_open system call, and a collection of userspace tools that can be used to count or capture samples of one or more events of interest. An event is any operation that is performed by the operating system or the hardware. An event that is counted by the OS is called a software event. The OS can allocate a counter in memory and increment the counter whenever the event occurs. For example, the number of page faults can be counted in the page fault handler. Since page faults are raised by the hardware on x86 processors, they can also be counted by the processor. However, most events can only be counted in the hardware or software, but not both. For example, only the OS can count minor and major page faults separately. On the other hand, events such as L1 data cache accesses and branch mispredictions can only be counted by the hardware. These events and many others are important because they can have a significant impact on performance and/or energy consumption.

The OS can create one counter for each software event, resulting in a one-to-one correspondence between events and counters. This is possible because the number of OS event counters is practically unlimited. Each event can simply be counted on the counter allocated for it. In order for events to be effectively used for performance monitoring and analysis, the process of counting events itself should have no or negligible impact on performance. The cost of incrementing a counter on a page fault is negligible compared to that of the page fault. This approach of allocating one counter for each event unfortunately doesn’t work for hardware events because the cost of most hardware events is typically no larger than few hundreds of cycles. Therefore, performing a load-modify-store operation whenever an event occurs to maintain the counter of the event is prohibitive. Instead, hardware events are counted using a small number of counters and associated logic in the hardware.

From the perspective of the perf subsystem, the hardware resources used to maintain the counts for a set of events are collectively called a performance monitoring unit (PMU). A processor may offer multiple PMUs where each PMU has its own independent resources and supports a different set of events. Intel processors since Nehalem offer at least one PMU for each logical core and one uncore PMU. The number of counters in a PMU is typically much smaller than the total number of events supported by that PMU. In general, each counter has an associated control register that can be programmed to specify which event to be counted on that counter register. Some counters, called fixed-function counters, can only be used to count a single event, but they may still offer a limited amount of programmability such as whether to count kernel-mode or user-mode occurrences of the same event. Similarly, some events may have constraints regarding which counters can be used to count them. Given a set of hardware events to count, a counter needs to be assigned to each event such that the constraints of the event are satisfied. It may happen that there is no valid assignment of events to counters due to conflicts (e.g., two events require the same counter). This situation can be resolved using event multiplexing whereby the PMU is programmed so that different events can be simultaneously counted at different points in time. The task of finding a valid assignment for a given set of events is called the event scheduling problem. The one-to-one correspondence between software events and counters makes this problem trivial for software events.

Some tools leave it up to the user to correctly assign hardware events to counters. Consider for example likwid-perfctr where the user has to specify the counter for each event that needs to be measured by appending :PMCn or :FIXCn, which represent a general-purpose (GP) counter and a fixed-function (FP) counter, respectively, to the event name. likwid-perfctr is aware of event constraints (well, most of them), though, and it will let you know if the schedule you came up with violates the constraints of one or more events. However, it will not tell you whether and how to schedule the events. likwid-perfctr uses, by default, the msr-safe or the msr kernel module to program the counters. Linux perf, on the other hand, implements an event scheduling algorithm and it offers no option (through the perf_event_open interface) to specify an event schedule. I’ll discuss in this article the Linux perf event scheduling algorithms used in all the Linux kernel versions up to 5.3-rc7, which is the most recent.

Event Groups

The unit of scheduling in perf is not an individual event, but rather an event group, which may contain one or more events (potentially on different PMUs). The notion of an event group is useful for ensuring that a set of mathematically related events are all simultaneously measured for the same period of time. For example, the number of L1 cache misses should not be larger than the number of L2 cache accesses. Otherwise, it may happen that the events get multiplexed and their measurements would no longer be comparable, making the analysis more difficult.

An event group consists of a leading event and zero or more sibling events. When perf_event_open is called to add an event to the event context specified by the cpu, pid, and type arguments, whether this event is added to a new group or an existing group is determined by the group_fd argument. If group_fd is -1, the event is considered to be a leading event and a new group is created for the event. The event is added to the group and the new group is added to the event context. Otherwise, if group_fd specifies a valid group file descriptor, the event is considered as a sibling event and is therefore added to the same group specified by group_fd. Every successful call to perf_event_open results in the addition of an event to the specified event context. There are two lists of event groups in an event context, one for pinned event groups and the other for flexible event groups. A pinned event group is created by setting the pinned field of the perf_event_attr argument to 1 when adding the leading event of the group, which tells perf that the event group should never be scheduled out due to multiplexing.

When the leading event of a group is about to be created, if it’s hardware event, a check is performed to ensure that the constraint of the event makes it possible to schedule the event; there is at least one counter on which the event can be counted. This is done at a CPU vendor-specific layer (Intel or AMD) of the perf subsystem because it is at this layer the constraints are defined. If this check fails, perf_event_open returns EINVAL and the event group is not created. When a sibling event is added to an existing group, a check is performed to ensure that the new event together with all the existing events in the group that are on the same PMU as the new event can be scheduled together on that PMU. The schedulability of the event group on other PMUs has already been performed when the last event of the group on each PMU was added to the group. The new event only affects the schedule on its PMU, so that’s the only validation check that needs to be performed. Only the leader event and enabled sibling events participate in this check; sibling events created with the disabled bit of perf_event_attr set to 1 are not considered. The event group validation algorithm will be discussed later with the scheduling algorithm. If a valid schedule with the new event was not found (e.g., the number of events has become larger than the total number of GP and FP counters available), perf_event_open returns EINVAL and the event is not added to the group. There are many other checks that are performed; I only mentioned that ones that are most relevant to event scheduling.

The NMI Watchdog

The NMI watchdog is a mechanism that detects hard lockups and may perform some remedy action accordingly. By default, it’s enabled on each online logical core in the system. However, it can be disabled via boot-time and run-time kernel parameters. A hardware event with the following attributes (only the ones that are relevant are shown) gets created and added to the event context of each logical core on which the NMI watchdog is enabled.

static struct perf_event_attr wd_hw_attr = {
	.type		= PERF_TYPE_HARDWARE,
	.config		= PERF_COUNT_HW_CPU_CYCLES,
	.pinned		= 1,
};

The event PERF_COUNT_HW_CPU_CYCLES is implemented on Intel and AMD processors as the unhalted core cycles event, which can be scheduled on one of the FP counters and any of the GP counters. If the NMI watchdog is enabled at boot-time, it’s likely that the FP counter will be allocated for it (this will be clarified later). Since the event is pinned, the counter will essentially become unavailable for other potential uses. If the NMI watchdog gets disabled and later enabled dynamically, I think it would still get the FP counter because it’s the first pinned group to be scheduled in the list of pinned groups. The only situation where I think a GP counter may get assigned to the NMI watchdog is when it’s disabled at boot-time and then enabled later. Either way, when discussing the scheduling algorithm, keep in mind that one of the counters might be reserved for the NMI watchdog on each logical core. There is a proposal to implement the NMI watchdog using a chipset-based timer instead of permanently consuming a precious performance counter, but it has not been accepted yet in the kernel.

The Scheduling Algorithm

Before discussing the scheduling algorithm itself, the “when”and “what” questions need to be addressed first. That is, when to perform event scheduling and what events to consider for scheduling (i.e., the input to the algorithm). An event scheduling operation is initiated in the following cases:

On a context switch from one task to a different task, there may be one or more enabled events associated with the next task. The collection of events that need to be considered for scheduling could be different and so a new schedule has to be constructed. Note that the events associated with the previous task need to be disabled. However, only these particular events are disabled; all other events continue to be enabled and use the same schedule, unless the next task adds more events.
When an event, a group leader or otherwise, is manually enabled. There are five ways by which an event gets manually enabled: (1) performing an ioctl system call on a perf_event file descriptor with the PERF_EVENT_IOC_ENABLE flag, (2) performing an prctl system call on an event that is attached to the caller’s event context with the PR_TASK_PERF_EVENTS_ENABLE flag, (3) performing an ioctl system call on a perf_event file descriptor with the PERF_EVENT_IOC_REFRESH flag, (4) a call to exec enables all events that were created with enable_on_exec set to 1 and associated with the current task, and (5) performing an ioctl system call on a perf_event file descriptor with the PERF_EVENT_IOC_MODIFY_ATTRIBUTES flag. Event scheduling is only initiated for the PMUs to which the newly enabled events belong.
A call to perf_event_open with the disabled field of the perf_event_attr argument set to 0, which indicates that the new event is enabled and must be scheduled immediately on its PMU. If the event is associated with a task (i.e., the pid argument is nonnegative), event scheduling is only triggered if the task is currently running on at least one logical core. This is achieved by sending an interrupt to each core on which the task is running. Otherwise, the event is scheduled on the next context switch for the task.
When a timer interrupt expires for event multiplexing, which is needed when the events enabled on a logical core per PMU cannot all be simultaneously scheduled.
When ptrace sets or modifies a hardware breakpoint, a breakpoint event is enabled on the hardware breakpoint PMU.
When the NMI watchdog is enabled on a logical core at boot-time or run-time, it adds a new hardware event on that core as discussed previously.
When one of the following changes are made to a cgroup: (1) a new task migrates to the cgroup, (2) changing the type of the cgroup, and (3) enabling or disabling a subsystem of the cgroup.
When /sys/devices/cpu/allow_tsx_force_abort is changed on an Intel Skylake (and close derivatives) processor. This changes the availability of one of the GP counters of the core PMU, which requires rescheduling all enabled events on that PMU on all logical cores. I’ll discuss the event constraints later in this article.

Disabling an event does also change the set of events that are eligible for scheduling on the PMU of that event. In particular, it may be possible to avoid multiplexing with that event disabled. However, disabling an event by itself will not immediately trigger event scheduling. Instead, the event (and in case of a group leader, all its sibling events) will simply stop being measured. Later, when any of the above cases occur, a better schedule may be constructed.

All event groups are maintained in data structures called event contexts. Every task has an event context and every PMU has an event context per CPU (i.e., logical core). Depending on the pid, cpu, and perf_event_attr.type arguments used to create a group leading event, the event is added to either the event context of a task or the event context of a PMU on the specified CPU. In particular, if pid is nonnegative, then the event is added to the event context of the specified task. The cpu argument in this case can be used to only measure the event when the task runs on the specified CPU. Otherwise, the cpu argument must specify a valid CPU (and a PMU in the type argument) so that the event can be added to its event context. All events in the same group must have the same pid and cpu. However, the event type could be different, yet the whole group ends up being in a single event context. The exact details of how the event context is determined when there are events of different types are beyond the scope of this article.

Another scheduling-related attribute of any event is whether it’s pinned or flexible. These two attributes must be the same for all events in the same event group (otherwise perf_event_open returns EINVAL). Overall, any event group belongs to one of the following four categories:

Per-cpu and pinned: the event group should be measured on the specified logical core whenever it’s enabled. If the event group could not be scheduled because no counter is available for it, it goes into an error state, which will be discussed a bit later. Pinned event groups are not subject to multiplexing. The notion of a pinned event group is only useful for groups that contain hardware events where, due to PMU limitations, it may happen that not all enabled events can be simultaneously active.
Per-task and pinned: the event group should be measured whenever it’s enabled and the task that owns its event context is running on any logical cores (or on a particular logical core if the cpu argument is valid).
Per-cpu and flexible: the event group should be measured on the specified logical core whenever it’s enabled, but it can be multiplexed if needed.
Per-task and flexible: the event group should be measure on the logical cores on which the task that owns its event context is running, but it can be multiplexed if needed.

Event group scheduling is performed independently on different logical cores. When event scheduling starts on a logical core, only those event groups that should be measured on that logical core are eligible (based on pid and cpu) for scheduling. Therefore, only the event groups from the event context of the PMUs on that logical core and the event context of the task currently running on the logical core are considered for scheduling. In addition, if the group leader is in the disabled state, all its sibling events are effectively in the disabled state as well, and so none of them are schedulable. The event groups are scheduled in the following order: per-cpu and pinned, per-task and pinned, per-cpu and flexible, and then per-task and flexible. Pinned groups are prioritized over flexible groups to give them a better chance of being scheduled. An event schedule is constructed gradually by scheduling each group one after the other. If a pinned group could not be scheduled, it transitions from the enabled state to the error state in which it’s not considered for scheduling. The only state transition that is possible for a group in the error state is to the enabled state, which triggers scheduling as discussed previously. If a flexible group could not be scheduled, multiplexing is enabled on the logical core. perf will attempt to schedule all eligible groups sequentially. If it sees a group that failed to schedule, all later groups that include at least one hardware event will not be considered for scheduling; groups with only software events are always scheduled successfully.

Now I’ll discuss how the events in a single group are scheduled on the PMU resources that have not been allocated for the earlier groups (the first group to be scheduled on a PMU has all that PMU’s resources available for it). At this point, each event of the group is handled by the PMU that the event belongs to and the exact details of the scheduling algorithm mostly depend on the PMU. I’ll only discuss the algorithm used in the x86 core PMU because this is the one most commonly used and the others are either similar or simpler. The algorithm for the enabled x86 core PMU events works as follows:

Calculate the constraints of each event in the group currently being scheduled. The constraints of the events from the previously scheduled groups have already been determined from previous executions of this algorithm, but the constraints of events with dynamic constraints have to be recalculated. I’ll discuss event constraints calculation later in this article.
Calculate the weight of the constraint of each event. An event constraint is represented as a bit vector that contains one bit for each counter in the PMU. For example, the vector for a PMU with 3 FP counters and 4 GP counters consists of 7 bits, where a set bit indicates that the event can be measured on the corresponding counter. The number of set bits is called the weight of the constraint.
Determine the constraints on the usage of the GP counters. This step is implemented in the kernel 4.1-rc1 and is only necessary in certain cases (depending on the kernel version) as will be discussed later.
For each event in the group being currently scheduled and all earlier groups starting from the event with smallest constraint weight (i.e., most constrained) to the largest constraint weight (i.e, least constrained), do the following:
1. If the event can be scheduled on an FP counter and there is a matching FP counter that is unused, assign the event to that counter. FP counters are preferred over GP counters stating with the kernel 3.3-rc1. In older kernel versions, it may happen that an event can be measured on an FP counter gets assigned a GP counter, which may result in multiplexing unnecessarily.
2. Otherwise, if the event can be scheduled on a GP counter and there is a matching GP counter that is unused, assign the event to the first such counter.
3. If the event has been assigned to a counter, the event has been successfully scheduled. If the constraint of the event is marked as overlapping (which I’ll define later) and the number of currently saved scheduler states is smaller than 2, save the current scheduler state. Overlapping event constraints currently only exist on AMD processors. Scheduling support for overlapping counters is implemented in the kernel 3.3-rc1 and later.
4. If the event has not been assigned any counter, check whether there is a saved scheduler state. If there is, rollback to the last saved state by marking all the counters that have been assigned between the saved state till this point as unused, attempt to assign a different counter to the event scheduled when that state was saved, and redo the scheduling for later events. If there isn’t, the scheduler has failed to assign a counter to the event, which means that the whole event group that contains the event could not be scheduled.
In case of success, all the assigned counters are programmed accordingly and enabled, making the events active. In case of failure, pinned and flexible groups are treated differently. For pinned groups, the pinned group that failed scheduling is transitioned to the error state, which makes it ineligible for scheduling in the future until it gets reenabled. For flexible groups, multiplexing is enabled on the event context of the group that failed scheduling. In addition, the scheduling of earlier groups (which has succeeded) is made active.

In summary, the scheduler iterates over all enabled pinned and flexible event groups from the event context of the PMUs of the logical core and the currently running task until it reaches group that fails to schedule, after which only pure software groups are scheduled. For each group, to maximize the use of available counters, the algorithm starts with the most constrained events to the least constrained events of the group. Before kernel version 2.6.34-rc1, the weights of constraints were not considered and so the order of scheduling the events was based on the order in which the events were created, which was undesirable because the user had to think about the order of events. Scheduling is done independently per PMU and per logical core (although a group, which is the scheduling unit, may contain events from different PMUs).

Group validation is done using the same algorithm, except that steps 3 and 5 are not performed. In group validation, only the group being validated is considered to check whether the group by itself is schedulable. Otherwise, the group is essentially invalid.

It’s worth noting here that the availability of PMU resources may depend not only on the events scheduled so far on the same core but also on other cores. That’s because some PMUs or some types of PMU resources are shared between threads of the same core or all cores of the package, in which case their availability may change depending on what events are being measured on the other cores. For example, all cores share the uncore PMU on Intel processors and the northbridge PMU on AMD processors. On Nehalem and Westmere, the MSR registers for the off-core response events are shared between the two threads of a core.

If one or more of the flexible groups could not be scheduled, a timer of type CLOCK_MONOTONIC based on the hrtimer infrastructure is enabled to perform multiplexing. This timer is created per PMU per logical core. Before kernel 3.11-rc1, multiplexing is performed on a scheduler interrupt, which is also used by the scheduler to switch between tasks. This makes it possible to make the interrupt period configurable indepedently from the scheduler clock. The multiplexing period can be configured via a sysfs entry called /sys/devices/xxx/perf_event_mux_interval_ms, where xxx is the name of the PMU, such as cpu. Note that this can only be configured per PMU, but not per logical core. The multiplexing period involves a trade-off between performance overhead and measurement accuracy.

When a multiplexing interrupt occurs, the interrupt handler first checks whether multiplexing is enabled in the current PMU context and/or the current task context. This check is necessary because the event schedule may change dynamically and so does the need for multiplexing. In each context in which it’s enabled, it will perform a rotation operation on the list of flexible groups of that context. That is, it will remove the last event group in the last and insert at the head of the list, thereby giving it a better chance of being scheduled. This triggers rescheduling of all the events from both contexts as discussed earlier, which can turn a different set of events from the inactive state (i.e., enabled but could not be scheduled) to the active state (i.e., enabled and being actively measured). Whenever, the effective state of an event changes (the effective state depends on the state of the event and the state of its group leader), the amount of time an event spends in particular state (active vs. inactive) is tracked accordingly. Scaling of the event measurements has to be performed by the userspace tool, such as perf stat.

Some kernel versions suffer from bugs related to event scheduling:

Before kernel 4.18-rc1, adding a software event to an existing group that already contains software and hardware events may fail due to a bug in event context management. This may result in an error when creating a group of events. The patch can be found here, which also shows an example.
Before 4.13-rc2, a pinned grouped may not get prioritized over flexible groups due to a bug in the code that determines whether a group is pinned. This may lead to incorrect results. The patch can be found here, which also shows an example.
Before 4.6-rc2, there is a bug in the timer tracking code for multiplexed events where an enabled event may falsely appear to be measured for zero percent of the time (i.e., it was never activated by the scheduling algorithm). This may lead to incorrect results. The patch can be found here, which also shows an example.

Event Constraints Calculation

One of the steps of the scheduling algorithm is determining the constraints of each event to be scheduled so that it can be measured correctly. If an event was assigned a counter that violates its constraints, what happens on Intel and AMD processors is that the counter will never be incremented when the event occurs, so the counter value will not change, which may give the false impression that no events occurred. It’s absolutely crucial to get the constraints right. In my experience, perf is the best open-source tool in this regard, but no tool is immune from bugs.

The constraints of an event are represented in perf as a bit vector, one bit for each FP and GP counter, and additional special constraints. Most events on Intel and AMD processors require only a counter and can be counted on any GP counter. Some important events, however, have more complex constraints. For example, an off-core response event requires one GP counter and one MSR register. The L1D_PEND_MISS.PENDING event can only be counted on the third GP counter on SnB, IvB, HSW, and BDW, but it can be counted on any GP counter on later microarhitectures. All of the aforementioned constraints are known statically because they don’t depend on any dynamic factors, so they must always be satisfied.

There are also dynamic constraints. Starting with Sandy Bridge, if hyperthreading is disabled in the BIOS or if the processor doesn’t support hyperthreading, each core has 8 GP counters instead of just 4. So whether an event can be scheduled on the other 4 counters depends on whether hyperthreading is enabled. It also depends on the event. For example, on Sandy Bridge up to 9th Gen, all events with the event code 0xD1 (and any unit mask) can only be counted on the first 4 GP counters, irrespective of hyperthreading.

Some kernel versions suffer from bugs related to event constraints:

Before 4.7-rc7, perf may use the higher 4 GP counter when HT is disabled for events that cannot be counted on these counters. These events include all those with the event codes 0xD0, 0xD1, 0xD2, 0xCD on Broadwell up to 9th Gen, and 0xC6 on 6th to 9th Gen, in non-PEBS mode. This may result in incorrect measurements. The patch can be found here.
Before 4.3-rc4, on Broadwell, perf unnecessarily constrains all the events 0xA3 (any unit mask) to only the third GP counter, but this is only necessary for the event with the umask 0x08. This may result in failed scheduling. The patch can be found here.
Before 4.1-rc1, there is a bug in perf that adds unnecessary dynamic constraints for events that have dynamic constraints. This may result in failed scheduling. The patch can be found here.
Before 4.0-rc7, on Haswell, perf unnecessarily constrains all the events 0xA3 (any unit mask) to only the third GP counter, but this is only necessary for the events with the umasks 0x08, 0x0C, and 0x04. This may result in failed scheduling. The patch can be found here.
Before 3.15-rc7, on Silvermont, perf incorrectly constrains the event 0x3C with umask 0x01 to the third fixed counter. This may result in incorrect measurements or failed scheduling. The patch can be found here.

Besides the bugs that exist in perf, there are also bugs in many processors, of which I’ll discuss two particularly nasty ones that exist in recent Intel processors and how they impact event scheduling: the TFA bug and the HT bug.

Intel has recently released a microcode update to fix a bug in the Restricted Transactional Memory (RTM) implementation in some of their processors. As a result of that update, there is new erratum titled “Performance Monitoring General Purpose Counter 3 May Contain Unexpected Values,” which can be found in the specification update documents of the following processors: E3 v5, E3 v6, Xeon D, Xeon Scalable Processor, and 6th, 7th, and 8th Core. This erratum says that the fourth GP counter (IA32_PMC3) and its control register (IA32_PREFEVTSEL3) may get corrupted during the execution of a transaction when bit 0 of the TSX_FORCE_ABORT MSR is set to 0. This is called the TFA bug and it only exists if that microcode update is installed. So if bit 0 of this MSR is set to 0, IA32_PMC3 and IA32_PREFEVTSEL3 should not be used because they may get corrupted. However, if bit 0 of the MSR is 1, all transactions are forced to abort and IA32_PMC3 will report correct event counts. Starting with the kernel 5.1-rc1, TSX_FORCE_ABORT[0] can be configured using /sys/devices/cpu/allow_tsx_force_abort, which, if set to 0, the perf scheduler will not use PMC3 for any event. You can find more information on it here. As discussed earlier, changing this setting triggers event scheduling, so the effect happens immediately. Fortunately, there is no event on the affected processors that can only be counted on IA32_PMC3. Otherwise, such an event cannot be measured if TSX_FORCE_ABORT[0] is 0.

Another, much more nasty, bug is titled “Performance Monitor Counters May Produce Incorrect Results” in the specification update documents of all Intel Sandy Bridge, Ivy Bridge, and Haswell processors. I have actually mentioned this erratum very briefly in a another blog post. This bug was apparently discovered by the authors of Gooda as discussed in the paper titled “Hierarchical cycle accounting: a new method for application performance tuning.” In this bug, an event that occurs in one logical core and for which there is an enabled counter may be dropped or counted on the corresponding counter (if enabled) on the sibling logical core. The events that are vulnerable to this bug are called corrupting events, which include 0xD0, 0xD1, 0xD2, and 0xD3 (any umasks). For example, if the second counter of a logical core is programmed and enabled to count a corrupting event and if the second counter in the sibling logical core is also programmed and enabled to count any event (not necessarily corrupting), the values in both counters could be wrong. To be specific, there are two cases:

If the event on the sibling core is a non-corrupting event, then its counter may be (much) larger than it actually is and the count of the corrupting event may be (much) smaller than it actually is.
If the event on the sibling core is also a corrupting event, there can be event leaks in both directions.

Different kernel versions handle this bug in different ways. The scheduler in kernels before 3.10-rc1 is not aware of this bug and so corruption may occur. In 3.10-rc1 up to 4.1-rc1, all the corrupting events are blacklisted on all Ivy Bridge processors and so the scheduler always fail to schedule these events. This was a significant restriction because the corrupting events are very useful. A solution was proposed in a published paper on the bug and was implemented in 4.1-rc1. Basically, the scheduler was changed so that it takes into consideration what events are being measured on the sibling logical core. If a corrupting event is being measured on a counter, the scheduler will consider the corresponding counter as unavailable and it will not schedule any event on it. If a non-corrupting event is being measured on a counter, the scheduler will only program non-corrupting events on the corresponding counter in the sibling logical core. This essentially makes the constraints of all of the corrupting events dynamic. To ensure that a logical core doesn’t starve for counters because the sibling core scheduled too many corrupting event, a limit is imposed on the maximum number of counters a logical core can use. In 4.1-rc1 up to 4.1-rc7, the limit is that the total number of enabled FP and GP counters on a logical core cannot exceed half of the GP counters and it was imposed whenever hyperthreading is enabled. Starting with 4.1-rc7, the limit is changed so that the total number of enabled GP counters cannot exceed half of the GP counters. Also, the limit is only imposed on a logical core when hyperthreading is enabled and there is at least one enabled corrupting event on that logical core. Determining this limit constitutes step 3 of the scheduling algorithm. Note that by “enabled” here I mean at the time of scheduling. Disabling an event dynamically would not, by itself, affect the limit that was imposed at the time of scheduling because this doesn’t trigger event scheduling. But this is OK because scheduling will be performed the next time a multiplexing interrupt occurs. If there is no multiplexing, then it doesn’t really matter.

There is only one situation where the scheduling algorithm may backtrack to an earlier event in an event group and assign to it a different counter when it hits a dead end at some event: it only backtracks to overlapping events. An overlapping event is an event whose constraints weight is equal to or smaller than that of another event but its constraints are not a subset of those of the other event. For example, assume that the events A, B, C, and D have the constraint vectors (in binary) 1001, 0011, 0111, and 0111, respectively. This means that event A can be scheduled on the first and fourth counters, event B can be scheduled on any of the first two counters, and events C and D can be scheduled on any of the first three counters. Their weights are 2, 2, 3, and 3, respectively. According the scheduling algorithm, the scheduler will start with the event with the smallest constraint weight and assign to it the first available counter, which is counter 0. Subsequently, event B gets counter 1 and event C gets counter 2. At this point, no counters are left for event D. Without backtracking, scheduling the event group would fail. However, the scheduler would save the partial schedules at events A and B because both are overlapping. It will first try to backtrack to the saved state of event B, but there is no counter other than counter 1 to assign to it. So it will backtrack further to the saved state of event A and assign to it counter 3 instead of 0. In this way, the scheduler will assign counters 0, 1, and 2 to events B, C, and D, respectively, which results in a complete schedule for the group. Overlapping events only exist on AMD processors.

Examples

Armed with the knowledge of how event scheduling works in perf, let’s determine how the events from each command shown at the beginning of this article may get scheduled on different systems. The first command was:

perf stat -e l1d_pend_miss.pending,cycle_activity.stalls_l1d_pending dd if=/dev/zero of=/dev/null count=1000000

On SnB, IvB, HSW, and BDW, the most likely output is the following, irrespective of hyperthreading:

13,636,115 l1d_pend_miss.pending (49.93%)
2,715,693 cycle_activity.stalls_l1d_pending (50.07%)

There are two event groups, each contains a single hardware event that belongs to the core PMU. Both of these events are constrained on the third GP counter, so they cannot be simultaneously scheduled. Here is how it plays out:

The first event group, which contains l1d_pend_miss.pending, will be scheduled first. The first unused counter that satisfies the constraints of the event is the third GP counter. The scheduler assigns that counter to the event and marks the counter as occupied. The counter is then enabled, which makes the event active.
The second event group, which contains only cycle_activity.stalls_l1d_pending, will then be scheduled. Since the third GP counter is in use, there are no available counters that can be assigned to the event. The scheduler fails at this point, which leads to enabling multiplexing.
There are no more event groups to be scheduled, so the scheduling process ends.

When a multiplexing interrupt occurs, the following steps are performed:

The interrupt handler first checks whether multiplexing is enabled on the event context of the current task and/or the event context of the PMU associated with expired multiplexing timer. In this case, multiplexing is enabled in the event context of the currently running task.
The last event group in the list of flexible groups of the event context is remove from the last and inserted at the begining of the list. That is, the group that contains cycle_activity.stalls_l1d_pending now comes before the group that contains l1d_pend_miss.pending.
A new scheduled needs to be constructed for the events. This goes similar to what I said above, except that cycle_activity.stalls_l1d_pending takes the third GP counter and l1d_pend_miss.pending fails to be scheduled this time. So multiplexing continues to be enabled.

The two events are measured alternately every timer interval until the program terminates. So each event is measured about 50% of the time.

What other possible outputs one might get? If the NMI watchdog somehow ended up using the third GP counter, since it’s a pinned event, it will always be prioritized over our events. The output would be:

<not counted> l1d_pend_miss.pending (0.00%)
<not counted> cycle_activity.stalls_l1d_pending (0.00%)

This means that the events were measured zero percent of the time.

Another possible output is when hyperthreading is enabled and the kernel version 4.1-rc1 or later is being used. It might happen that the core PMU of the sibling logical core has an active corrupting event on the sibling third GP counter. This makes the the third GP counter of our logical core unusable and the ouput would also show <not counted> for both events. If the state of the corrupting event dynamically changed, both events may get the chance to be measured, but the time percentage may not add up to 100% and percentage for one event may be very different from the other.

On Skylake and later processors (where the event names are slightly different), there is no situation where multiplexing would occur, assuming there are no other tools calling perf_event_event to add events on the same logical core. That’s because these events can be scheduled on any of the first four GP counters and these processors don’t suffer from the HT bug. Even with the TFA bug and with the NMI interrupt possibly occupying a GP counter, there will still be two GP counters for the two events.

In the next command, the set of events is the same. The only different is that the second event in the list is marked as pinned.

perf stat -e l1d_pend_miss.pending,cycle_activity.stalls_l1d_pending:D dd if=/dev/zero of=/dev/null count=1000000

This causes the scheduler to always prioritize the second event over the first event. The most likely output on pre-Skylake microarchitectures would be:

<not counted> l1d_pend_miss.pending (0.00%)
1,283,966 cycle_activity.stalls_l1d_pending:D

Although both event groups are in the same event context, the first is held in the flexible group list and the second one is held in the pinned group list. The flexible group list has only one event and so rotation in every multiplexing interrupt doesn’t really help.

This doesn’t make a difference on Skylake and later.

The next command is the following:

perf stat -e '{l1d_pend_miss.pending,faults}',cycle_activity.stalls_l1d_pending:D,mem_uops_retired.all_loads dd if=/dev/zero of=/dev/null count=1000000

There are two differences compared to the previous set of events. First, l1d_pend_miss.pending is now in a group with another event, faults, which is a software event that represents the number of page faults. Both of the events are in the same event group. Second, there is a new event group that contains the mem_uops_retired.all_loads event. One possible output on pre-Skylake could be:

<not counted> l1d_pend_miss.pending (0.00%)
<not counted> faults (0.00%)
1,412,367 cycle_activity.stalls_l1d_pending:D 
652,815,004 mem_uops_retired.all_loads (49.91%)

The software event can always be scheduled. However, since cycle_activity.stalls_l1d_pending is always prioritized over l1d_pend_miss.pending, the l1d_pend_miss.pending event will always fail to schedule and so its event group as a whole will always fail to schedule. This explains the <not counted> part for these two events. The cycle_activity.stalls_l1d_pending is measured 100% of the time, which should be obvious why. The interesting part here is that mem_uops_retired.all_loads is only measured 50% of the time. What is happening? Let’s carefully go through the algorithm:

All pinned event groups are first scheduled. There is only one such group, which is the one containing cycle_activity.stalls_l1d_pending. If the NMI watchdog is enabled, there would another group. Both of them can be successfully scheduled.
The flexible list contains two groups. The first group would fail to schedule because there are no unused counters that satisify the countraints of the hardware event in that group. Multiplexing is enabled as a result.
The second group is checked. Since it’s not a pure software event group and since the previous group has already failed to be scheduled, the second group is not considered for scheduling. At this point, there are no more groups to be scheduled.

When a multiplexing interrupt occurs, it will rotate the list of flexible groups and scheduling is performed again:

All pinned event groups are first scheduled as before.
The first flexible group will be scheduled. This is the one containing mem_uops_retired.all_loads, which can be scheduled on any of the first four GP counters. Since at most two GP counters are occupied, the event can be successfully scheduled.
The second flexible group, containing l1d_pend_miss.pending, will fail to schedule as discussed earlier. This causes multiplexing to be enabled again.

That’s why mem_uops_retired.all_loads gets measured 50% of the time.

In general, there are some additional complications involved in scheduling these set of events on pre-Skylake. In 3.10-rc1 up to (but not including) 4.1-rc1, on Ivy Bridge, the mem_uops_retired.all_loads event is blacklisted and so the event would fail the validation step at the time of creating it using perf_event_open. You’ll see <not supported> next to it in this case. In 4.1-rc1 up to (but not including) 4.1-rc7, the event is allowed but there is a limit on the total number of counters (GP + FP) a logical core can use when hyperthreading is enabled, which is half of the GP counters. If the NMI watchdog is enabled, it would not be possible to schedule l1d_pend_miss.pending because half of the GP counters is 2 and the event would require a third counter. In this case, you’ll see <not counted> next to it in the output. In 4.1-rc7 and later, the HT bug may also affect scheduling depending on what is happening in the sibling logical core.

On Skylake and later, there will probably be no multiplexing irrespective of the NMI watchdog and hyperthreading. There is one subtle issue that might occur, though. All of the three hardware events can only be measured on the bottom 4 GP counters. Before 4.7-rc7, the scheduler is not aware of this constraint for the mem_uops_retired.all_loads event, so it think it can be scheduled on any of the 8 GP counters if all of the 8 are available. If there are more event groups, mem_uops_retired.all_loads might end up being scheduled on one of the top 4 GP counters and its event count would show up as zero.

The next command is the following:

perf stat -e mem_load_uops_retired.l1_hit,mem_load_uops_retired.l1_miss,mem_load_uops_retired.l2_hit dd if=/dev/zero of=/dev/null count=1000000

On pre-Skylake, one possible output may be:

648,161,977 mem_load_uops_retired.l1_hit (66.61%)
4,223,144 mem_load_uops_retired.l1_miss (66.73%)
212,313 mem_load_uops_retired.l2_hit (66.66%)

This occurs in 4.1-rc7 and later when hyperthreading is enabled. All of these events are corrupting events and so the maximum number of GP counters that can be used is 2. In any multiplexing interval, there can be 2 of the 3 events being measured. Due to rotation, each event will end up being measured about 66% of the time. If hyperthreading is disabled, there will be no multiplexing because the HT bug would not apply. Certain complications may occur as discussed earlier. On Skylake and later, the most probable output is no multiplexing.

In one of the steps of the algorithm, the events in a group are ordered according to their constraints weights from the smallest to the largest weight. To show how this is useful, consider the following event group:

perf stat -e '{mem_load_uops_retired.l1_hit,mem_load_uops_retired.l1_miss,mem_load_uops_retired.l2_hit ,l1d_pend_miss.pending}' dd if=/dev/zero of=/dev/null count=1000000

Consider a Haswell processor with hyperthreading disabled. The first three events can only be measured on the bottom 4 GP counters and the fourth event can only be measured on the third GP counter. If the events were not ordered according to their constraints weight, the first three events would get assigned the first three GP counters, which makes impossible to schedule the fourth event and the whole group would fail to be scheduled. The weight of the fourth event is 1 and weight of all other events is 4. Ordering the events according to their weight results in successfully finding a schedule for the group.

The next command is the following:

perf stat -e mem_load_uops_retired.l1_hit,mem_load_uops_retired.l1_miss,mem_load_uops_retired.hit_lfb,mem_load_uops_retired.l2_hit,mem_load_uops_retired.l3_hit dd if=/dev/zero of=/dev/null count=1000000

One possible output on pre-Skylake is the following:

647,510,926 mem_load_uops_retired.l1_hit (40.01%)
4,100,685 mem_load_uops_retired.l1_miss (40.01%)
54,919 mem_load_uops_retired.hit_lfb (40.00%)
76,342 mem_load_uops_retired.l2_hit (39.99%)
3,343,613 mem_load_uops_retired.l3_hit (40.00%)

With hyperthreading enabled, only 2 out 5 events can be active at any point in time, which is 40% of the time for each event. On Skylake and later, the most probable output is the following:

53,57,27,415 mem_load_retired.l1_hit (79.81%)
17,383 mem_load_retired.l1_miss (79.90%)
19,150 mem_load_retired.fb_hit (80.21%)
18,108 mem_load_retired.l2_hit (80.21%)
659 mem_load_retired.l3_hit (79.87%)

All of the events are constrained to the bottom 4 counters. So irrespective of whether hyperthreading is enabled or not, only 4 out of the 5 events can be active simultaneously, which is equal to 80% of the time. But if you use a kernel version older than 4.7-rc7 with hyperthreading disabled, there will be no multiplexing because perf thinks that each of these events can be scheduled on any of the 8 GP counters. Therefore, the count for the last event would be zero (wrong) and no multiplexing would occur. Also, starting with the kernel 5.1-rc1, the TFA bug can render one of the bottom 4 counters unavailable, making each event active for only 60% of the time.

The sixth example is the following:

perf stat -e dtlb_load_misses.walk_completed,dtlb_load_misses.walk_completed_4k,dtlb_store_misses.walk_completed,dtlb_store_misses.walk_completed_4k,itlb_misses.walk_completed,itlb_misses.walk_completed_4k dd if=/dev/zero of=/dev/null count=1000000

This is an example where the events are more constrained on Sunny Cove than on earlier microarchitectures. On Sunny Cove, each of these events can only be measured on the bottom 4 GP counters. Therefore, since there are a total of 6 events, there will be multiplexing irrespective of whether hyperthreading is enabled. In contrast, on earlier microarchitectures, each of the events can be measured on any of the 8 GP counters and so multiplexing will not occur when hyperthreading is disabled.

The seventh example is the following:

perf stat -e '{dtlb_load_misses.walk_completed,dtlb_load_misses.walk_completed_4k,dtlb_store_misses.walk_completed,dtlb_store_misses.walk_completed_4k,itlb_misses.walk_completed,itlb_misses.walk_completed_4k}' dd if=/dev/zero of=/dev/null count=1000000

The events are same as in the previous example, except they are all in the same group instead of having each event in its own group. On all microarchitectures, if hyperthreading is disabled, the most likely output is the following:

<not counted> dtlb_load_misses.walk_completed 
<not counted> dtlb_load_misses.walk_completed_4k 
<not counted> dtlb_store_misses.walk_completed 
<not counted> dtlb_store_misses.walk_completed_4k 
<not supported> itlb_misses.walk_completed 
<not supported> itlb_misses.walk_completed_4k

The interesting part here is that the first 4 events are marked as <not counted> but the next two are marked as <not supported>. This means that group validation has failed while trying to add the last two events to the group because there are only 4 GP counters available. Notice how there is no 0.00% next to the events, which is an important difference from earlier cases.

The last example is the following:

perf stat -e instructions,dtlb_load_misses.walk_completed,dtlb_load_misses.walk_completed_4k,dtlb_store_misses.walk_completed,dtlb_store_misses.walk_completed_4k,itlb_misses.walk_completed,itlb_misses.walk_completed_4k sleep 1

The only two differences from the sixth example is that there is one additional event that can be scheduled on an FP counter and that the task to be profiled is sleep instead of dd. On systems where multiplexing occurs, the most likely output is:

863,911 instructions 
598 dtlb_load_misses.walk_completed 
481 dtlb_load_misses.walk_completed_4k 
92 dtlb_store_misses.walk_completed 
74 dtlb_store_misses.walk_completed_4k 
<not counted> itlb_misses.walk_completed (0.00%)
<not counted> itlb_misses.walk_completed_4k (0.00%)

Here is how it plays out:

The first five events are scheduled on an FP counter and the 4 available GP counters one after the other because each event is in its own flexible group.
The sixth event will fail be scheduled, so multiplexing is enabled and the seventh event is not considered for scheduling.
The sleep task begins execution and quickly puts the task into sleep for 1 second before the first multiplexing interrupt occurs.
When the first multiplexing interrupt, it will see that multiplexing is enabled. But there is no flexible group list to rotate. At this point, the sleep task is not running and so its event context would not be considered for scheduling. As a result, multiplexing is disabled.
Later the sleep task wakes up and scheduling is performed again. Since the event groups in the flexible list of the task have not been rotated, the same events will be scheduled and the same events will fail to be scheduled, which causes multiplexing to be enabled.
The sleep task terminates quickly before the second multiplexing round occurs. The last two events end up never being measured. That’s why there is <not counted> and 0.00% next to these events.

I can give more complex examples, but I think that’s enough for this article.

6 thoughts on “The Linux perf Event Scheduling Algorithm”

Thomas Gruber on February 3, 2020 at 2:57 PM said:

Nice article, thanks.
The sentence “likwid-perfctr uses its own kernel module to program the counters” is not true. LIKWID uses the ‘msr’ kernel module (a default Linux kernel module), msr-safe (if available, LLNL project https://github.com/LLNL/msr-safe) or perf_event.

Reply ↓
- Hadi Brais on February 3, 2020 at 3:28 PM said:
  
  Thanks for pointing that out. I’ve changed it to “likwid-perfctr uses, by default, the msr-safe or the msr kernel module to program the counters.”
  
  Reply ↓
Lu Qiang on August 13, 2020 at 1:00 PM said:

wonderful article, it helps me understand the detail of multiplexing mechanism in perf and suggest me use it cleverly to do performance tuning

Reply ↓
Tobias Pristupin on October 15, 2022 at 3:36 AM said:

Fantastic article, thank you so much! Here is a paper that also explains parts of this algorithm: https://ieeexplore.ieee.org/document/7877112

I have one question. From my understanding, the input to perf from userspace is a list of events {e_1, …, e_n} where some of these events might be grouped together. The scheduling algorithm will then create configurations of the form C_1, …, C_k where each configuration is a mapping of events to counters. Is this correct? My question is then, is there any way to obtain these C_1, …, C_k from userspace? I don’t actually want to schedule my events, I just want to be able to obtain how they would be configured. Is this possible?

Thanks!

Reply ↓
- Hadi Brais on October 15, 2022 at 6:59 AM said:
  
  That paper is already cited in the article.
  
  If my understanding is correct, you don’t want to actually schedule any events, but rather simulate the scheduling algorithm to determine how each given event would be scheduled. Nope, there is no built-in way to do that.
  
  Reply ↓
Pingback: Perf Result Conflict During Multiplexing - JTuto Mercure

Micromysteries

Hadi Brais's blog

The Linux perf Event Scheduling Algorithm

Introduction

Event Groups

The NMI Watchdog

The Scheduling Algorithm

Event Constraints Calculation

Examples

6 thoughts on “The Linux perf Event Scheduling Algorithm”

Leave a comment Cancel reply

Introduction

Event Groups

The NMI Watchdog

The Scheduling Algorithm

Event Constraints Calculation

Examples

Share this:

Related

6 thoughts on “The Linux perf Event Scheduling Algorithm”

Leave a comment Cancel reply