Profile Guided Compiler Optimization
Introduction
A profile-guided optimizer uses a program’s runtime profile in two ways:
1. Identify new optimization opportunities that are frequently observed during RT but are not
detected by static analysis.
2. Carry out sophisticated cost-benefit analysis to apply transformations that improve the
performance of one part of the program.
Existed classic optimization method may have serious drawbacks. First, this approach fails to
exploit many opportunities for optimization (which can be improved through the combination of
static analysis and run-time profile) and it may have an adverse impact on performance (which can
be addressed by using a complex performance model during the optimization).
Different typed of optimizations need different types of data, like control flow, value, and address
profiles. Given enough profile data, the following categories of optimization opportunities can
be exploited:
1. Optimization opportunities uncovered using static analysis whose exploitation involves
performance trade-offs. Mainly focused on transformation, whether to employ trade-off
depends on cost-benefit analysis. (Uncovered through SA but need trade-offs.)
2. Optimization opportunities that cannot be uncovered by SA, but are frequently observed
to exist during program execution. SA can be used to relate values of variables but not able
to identify specific value involved. E.g., i=i+1, SA tells that the statement increments the
value of i by one, but the exact value of i depends on maybe input, statements’ executed
times.
Consider the example below:
In the original code, x*y is computed twice, which is redundant. This can be optimized by
introducing variable t. If the first branch always goes to false, the optimization may not be much
useful. Therefore, this transformation is useful only if line4 is executed less frequently than line 7.
Also consider the case that through the profile, very frequently y=1 in line4, thus we have identified
an optimization opportunity. But suppose we optimized the code to (c). In that case, the extra
instruction introduced to check the value of y should be considered that the overall benefit of
eliminating the multiply is greater than the overall cost of executing the extra instructions (IA-64
can enable efficient implementation of the optimization in (c)). Thus, a simple cost-benefit analysis
should be carried out to determine whether or not the transformation should be applied.
The profile-guided compilation process is shown below:
Before the PGO, an instrumented version of the program must be run on representative inputs to
collect profile data. This profile data is then used by the compiler to recompile the program.
Types of Profile Information
Control Flow Profiles
This profile captures a trace of the execution path taken by the program, which is referred to control
flow trace (CFT). Through the CFT, we could compute the execution frequency of any given
program subpath. But CFTs are large in size, therefore compressed forms are considered. Several
approximations are used based on the level of approximation:
1. Node profiles provide execution frequencies of the basic blocks in the CFG. When making
code placement decisions like the case (redundancy removal) before such profile is OK.
2. Edge profiles provide the execution frequencies of each edge in the CFG. Edge profiles are
superior to node profiles cause node profiles can always be computed from edge profiles,
not vice versa.
3. Two-edge profiles provide the execution frequencies of each pair of consecutive edges in
the CFG. This is better than edge profiles cause edge profiles can be computed from two-
edge profiles, not vice versa. And this profile can capture the correlation between the
execution of consecutive conditional branched.
4. Path profiles provide the execution frequencies of acyclic (does not include a loop back
edge) and intraprocedural (terminates if an exit node of a procedural is reached) subpaths
in the CFG.
The precision of estimation of the execution frequency of an arbitrary subpath in the program varies
among these methods.
Value Profiles
This profile identifies the specific values encountered as an operand of instruction as well as the
frequencies with which the values are encountered. Example:
With this information, the compiler can recognize operands that are almost always constant and
utilize this information to carry out value specialization optimizations like constant folding, strength
reduction, and motion of nearly invariant code out of loops.
Address Profiles
This profile can be collected in form of a stream of memory addresses that are referenced by a
program. These profiles are usually used to apply data layout and placement transformations for
improving the performance of memory hierarchy. Addresses can be collected at different
granularities.
The whole program stream (WPS) representation is analyzed to identify hot address streams.
Based on WPS, temporal relationship graph (TRG) has been proposed. The nodes in the graph
are data items of interest. Weighted links are established between pairs of nodes.
The weights on the links at the end of the program run can be used for achieving good cache behavior.
Profile Guided Classical Optimizations
In some optimizations specialization leads to elimination of a copy (e.g., redundancy and dead code
elimination) while during other optimizations specialization leads to simpler and more efficient code
(e.g. strength reduction and constant folding).
Transformations
Two primary classes of transformation are used to carry out code replication and enable
specialization of conditionally optimizable code: code motion of different types and control flow
restructuring with varying scope.
1. Code motion of statements. The basic form is safe code motion, for every execution of a
statement during the execution of optimized code, there exists a corresponding execution of the
statement during the execution of unoptimized codes. Speculative code motion allows
compiler to introduce executions of a statement in the optimized code that are not present in the
unoptimized code. To be beneficial, the added cost of these extra executions must be offset by
savings resulting from speculative code motion (cost-benefit analysis based on profile is
needed). In predicated code motion, statements can be moved out of control structures and
executed at a different program point under the original conditions by predicating its execution
with an appropriately constructed predicate expression.
2. Control flow restructuring. If the conditions under which a statement can be optimized hold
along some execution paths, then control flow restructuring can be employed to enable the
optimization of the statement as follows.
a) Create a program such that when we reach the statement, based upon the incoming
edge along which we arrive at the statement, we can determine whether or not the
statement can be optimized.
b) Create unoptimized and optimized copies of the statement and place them along the
incoming edges. The scope of control flow restructuring determines the degree to
which the code can be optimized.
Function inlining is one way to achieve interprocedural control flow restructuring. However,
it is also accompanied with significant code growth. To limit code growth while performing
interprocedural optimizations a couple of alternative techniques have been proposed: partial
inlining of frequently executed paths through a procedure and creating procedures with
multiple entries and multiple exits.
Optimizations
1. Partial Redundancy Elimination (PRE). Many opportunities for redundancy removal cannot
be exploited using this restricted form of code motion.
In case (a) node7 the expression x+y is partial redundancy because if node2 is visited then the
value has already been evaluated. If speculative code motion is used to hoist x+y above node 6,
as shown in case (b), then if the frequency of node3 is less than node7, it will result in a net
reduction in partial redundancy. By applying control flow restructuring in case (c), this partial
redundancy is fully eliminated but results in code growth. Thus profile is needed for selection
of these methods. Partial redundancy elimination can also be applied in other contexts such as
load removal and array bounds checks elimination.
2. Partial dead code elimination (PDE). The PDE of an assignment is achieved by delaying the
execution of the statement to a point where its execution is definitely required.
In case (d) node 2 is partially dead if control reaches node6 and node7, they re-assign x. Simply
delaying the assign to node8 is incorrect cause it will block the correct value of x from reaching
its use in node10. Straightforward code motion will blocks the sinking of x to node 4 and
therefore further. By predicating the statement we can enable its sinking past node 4 and down
to 8 and thus achieve PDE in case (e). This approach is beneficial if the frequency of node2 is
larger than node8. Control flow restructuring can also be applied in case (f). Partially dead
stores can also be eliminated through these methods.
3. Conditional branch elimination. A conditional branch is considered to be partially redundant
if along some paths its outcome can be determined at compile-time. Such elimination requires
flow graph restructuring, the path(s) mentioned before are separated from the path(s) along
which the outcome is unknown through code duplication.
The first path: x>0 is T, hence x=1 is definitely F. The second path: x=1 is T, if F() preserved
the value x, then x=0 is F. The first could be exploited through intraprocedural control flow
restructuring and the second requires interprocedural control flow restructuring.
Partial Redundancy Elimination via Speculative Code Motion
1. Cost model
For a given expression exp and a conditional node n, we formulate a condition under which
it is potentially profitable to enable speculative hoisting of exp above n. Essentially we
identify the paths through n which can benefit from speculation of exp and paths that incur
an additional execution time cost due to speculation of exp. If the probability of following
paths that benefit is greater than the paths that incur a cost, then speculation of exp is
enabled at n. The computation of these probabilities is based upon an execution profile.
It is useful to categorize the program subpaths that either originate or terminate at n as
follows:
a. Available subpaths which are subpaths from the start node to n along which an
evaluation of exp is encountered and is not killed by a redefinition of any of its
operands prior to reaching n.
b. Unavailable subpaths which are subpaths from the start node to n along which exp
is not available at n. These subpaths include those along which either an evaluation
of exp is not encountered or if exp is encountered, it is killed by a redefinition of
one of its operands prior to reaching n.
c. Anticipable subpaths which are subpaths from n to the end node along which exp is
evaluated prior to any redefinition of the operands of exp.
d. Unanticipable subpaths which are subpaths from n to the end node along which exp
is not anticipable at n. These subpaths include those along which either exp is not
evaluated or if exp is evaluated, it is done after a redefinition of one of its operands.
The paths that can potentially benefit from hoisting of exp above a conditional node n are
the paths that are obtained by linking available subpaths with anticipable subpaths at
n. Due to the reasons below:
a. Along these paths there is redundancy. An evaluation of exp prior to reaching
n causes an evaluation of exp performed after visiting n to become redundant.
b. The redundancy cannot be removed unless exp is hoisted above node n.
Therefore we can now state that, given a path p that passes through the conditional node
n, the hoisting of an expression exp above n can benefit path p iff. exp is available and
anticipable at n's entry along p.
Denote the set of paths through n that benefit from hoisting of exp above n as
BenefitPathsexp(n). The expected overall benefit of speculating exp can be computed by
summing up the execution frequencies of paths in this set:
Similarly, the paths that incur a cost due to hoisting of exp above a conditional node n are
the paths that are obtained by linking unavailable subpaths with unanticipable subpaths
at n. The expected overall cost of speculating exp above n can be computed as below:
Speculation should be enabled if based upon the profile data we conclude that the expected
benefit exceeds the expected cost.
A boolean variable EnableSpecexp(n) is associated with an expression and conditional node
pair (exp, n). As shown below, speculation is enabled if the probability of following a
path through n that benefits is greater than the probability of following a path that incurs
an additional cost.
Note that the sum of the two probabilities used below need not be 1 because there maybe
other paths which are unaffected by speculation.
2. Cost-benefit analysis based upon probabilistic data flow analysis
Use probabilistic data flow analysis (PDFA) guided by two-edge profiles. This algorithm
neither requires the entire control flow trace nor computes data flow information on a per
path basis.
While traditional data flow analysis calculates whether a data flow fact holds or does not
hold at some program point, PDFAs calculate the probability with which a data flow fact
will hold at some program point. PDFA provides us with the following probabilities:
AvailProbexp(n) is the probability that expression exp is available at node n and
AntiProbexp(n) is the probability that exp is anticipable at node n. Therefore we
reformulate the speculation enabling condition for expression exp at a conditional node n
in terms of these probabilities as follows (ref. for detailed analysis system):
Availability and anticipability are calculated based on the traditional formulations of the
data-flow problems as shown below:
Availability analysis is guided by the local Boolean predicates LocAvail and LocBlkAvail.
LocAvailexp(n) equals T if there is an occurrence of exp in n which is not blocked by a
subsequent statement of n, that is, a statement which modifies variables used in exp.
LocBlkAvaiexp(n) equals T, if exp is blocked by some statement of n.
Similarly,anticipability analysis is guided by the local predicates LocAnti and LocBlkAnti.
LocAntiexp(n) equals T if there is an occurrence of exp in n which is not blocked by a
preceding statement of n, that is, a statement which modifies variables used in exp.
LocBlkAntiexp(n) equals T if exp is blocked by some statement of n.
Since probabilistic data-flow systems are based on any-problems in which the meet
operator is union, the analysis problems have to be transformed as described in ref[39]. If
a “must-be-X” problem is an intersection problem, then the “may-not-be-X” problem is a
union problem. The solution of the “must-be-X” problem is the complement of the
solution of the “may-not-be-X” problem.
Note that the initializations of the start and end node have to be changed appropriately as
well. Since now the confluence operator for both problems is union, the PDFA framework
can be applied in a straightforward way. The solutions N-NoAvail*, X-NoAvail*, N-
NoAnti*, and X-NoAnti* of the equation systems denote the probabilities of the
complementary problems. Thus, the probabilities for availability and anticipability are
given by N-Avail*= 1- N-NoAoail*, X-Avail*= 1- X-NoAvail*, N-Anti*= 1 - N-NoAnti*,
and X-Anti*= 1-X-NoAnti*.
ref[39] T. Reps, S. Horwitz, and M. Sagiv, “Precise interprocedural dataflow analysis via
graph reachability,” in Proceedings of the 22nd ACM SIGPLAN-SIGACT symposium on
Principles of programming languages - POPL ’95, San Francisco, California, United
States, 1995, pp. 49–61. doi: 10.1145/199448.199462.