Paper Observation -- Lecture 5

Xiao Yang - xiaoyang@ee.princeton.edu -

The following are discussions on paper Exploiting Hardware
Performance Counters with Flow and Context
Sensitivity Profiling.

The paper Exploiting Hardware Performance Counters with Flow
and Context Sensitivity Profiling proposed a new method for doing
efficient and accurate profiling. It associates metrics collected by
hardware counters with path information. The paths can be either
paths within a procedure or sequences of procedure calls. By doing
so, we gain much insight into the values of hardware metrics and their
relation to dynamic program execution. Therefore, we can perform much
more precise dynamic program analysis with the new profiling method
than with the old profiling methods, which don't associate metrics
with context. Traditional profiling method only apportion simple
metrics, such and execution frequency or elapsed time to static,
syntactic units such as statements or procedures. In this way, we can
identify the hot procedures which have much more impact on metrics
than the other procedures. However, hot procedures execute many
paths, so knowledge about hot procedures doesn't provide much useful
information about hot paths, such as which paths are the hot paths.
With relatively low cost, the method introduced in this paper provide
much more insight into program's dynamic behavior.

Besides all the benefits of the new profiling method as claimed, the
method still has shortcomings.

As in paper Efficient Path Profiling, the intra-procedural
profiling still need to break cyclic CFG's generated by loops or
similar structures to acyclic CFG's in order to deal with them. As a
result, the flow sensitive profiling can't deal with the case that
paths are strongly correlated to loop iterations. By breaking loop
structure to acyclic segments are treat them as different paths,
certain information about the loop is lost. For example, it could be
the case that a loop is iterated twice most commonly. However, this
cannot be seen from the profile collected using the proposed method,
because this method treat the loop as several pieces and the metrics
for these pieces are aggregated for each of them separately. For this
case specifically, it may be an effective optimization to unroll the
loop once to fit the characteristic of this loop, but there is no way
to reason this optimization based on the profile collected.

The same problem exists for context sensitive profiling. The context
calling tree(CCT) is a better idea than the dynamic calling tree(DCT)
in that it provide almost the same amount of context information, but
without redundancy. It's also superior to dynamic calling graph(DCG)
in that it have all the context information that DCG doesn't have and
no non-existing paths that falsely be deduced from DCG. The
shortcomings lie in the way it treats recursive calling. The result
is a compromise between space and accuracy. The problem it has here
is similar to the problem in introduced in breaking cyclic CFG's to
acyclic CFG's, but the one is more complicated. The self recursion
cases such as M -> A -> A -> A -> ... are like the loop case. Consider
to case that the procedure call
sequence A -> B -> A -> C is very frequently
executed in a program, while there exist other call sequences
M -> A -> B -> A -> B -> ..., which are much less frequently executed.
Such a scenario can be
captured easily in DCT because we can see many more instances of the
former sequence than the latter ones. However, in CCT, because the
existence of recursion here, all the sequences will be actually
aggregated. We'll see a loop formed by A and B and a separate
edge from A to C in CCT, and we would identify A -> B,
B -> A and A -> C as hot paths. This is not too
bad, except that we lose the chance to capture a much longer hot path
and perform optimal optimization.

In order to keep the profile size manageable, this method actually
makes a tradeoff by breaking full path profiling into context
sensitive profiling and flow sensitive profiling, for procedure call
sequences profiling and intra-procedural path profiling, respectively.
As a result, we lose accuracy. For example, for a hot path in a
procedure, we don't know whether the flow comes from a single path in
its parent procedure, or from many paths in one or many procedures.
In the case where some procedures are strongly correlated, we are not
able to do certain aggressive optimizations. For example, it may be
the case that a hot path in one procedure and a hot path in one of
this procedure's children procedure make a longer hot path. We can't
gain such information from separate context sensitive profiling and
flow sensitive profiling because we block intra-procedural path
information at procedure call boundaries. In the extreme case, it may
be beneficial to optimize the code by duplicate the code of one
procedure and optimize it together with code of another procedure
which is strongly related to it. But we can't gain support for this
kind of optimization from the proposed method. It may be worthwhile
to associate parent procedure's path information with current
procedure's path profile, then we could discover some interesting
characteristics about procedure pairs, although doing so could
increase the size of the profile significantly.

The perturbation of instrumentation is hard to estimate. When the CCT
is deep and bushy, tracing back to tree to find recursion or search
though a bunch of pointers to sibling procedures to identify previous
instance might be very expensive. The average results provide in the
paper seem to be geometric means of individual results, which hide the
large variation in individual data. The perturbation and overhead for
certain programs are quite large. For these case, it's questionable
whether the metrics collected is more heavily related to the spots of
instrumentation than to the original program.

Paper Observation -- Lecture 7

Xiao Yang <xiaoyang@ee.princeton.edu>

The follow is addition to the discussions on paper {Exploiting
Hardware Performance Counters with Flow and Context Sensitivity
Profiling}, which I forgot to put last time.

As mentioned in the last paper observation, we lost information across
procedure boundaries. We can't determine whether the flow in a hot
path in the current procedure comes from a single path in a single
caller procedure or from many paths in many caller procedures. While
doing full path profiling and associating HW metrics with it is too
costly, there is a way to obtain full path execution frequency
profiling with just context sensitive profiling and flow sensitive
profiling. What we need to do is to maintain a path map for callee
procedures. The entry in the table contains the identifier of the
caller procedure, the path number in caller procedure and the path
number in current procedure. It can be implemented using links and
create only when needed. During a procedure call, the caller's
identification and path number can be obtained by callee procedure.
When callee procedure exits, the path number in callee procedure is
also know. We can update the table correspondingly. Identical
entries can be merged and a counter maintained. The table in a
procedure may end up like

Caller Caller's Path Number Path Number Count

foo 7 3 1

bar 5 4 10

... ... ... ...

The upper bound of the size of all the tables is limited by the number
of executed paths in the program and it's manageable. With these
tables, we can construct the full path map for a program and their
execution frequencies. Although we can't get separate HW metrics for
them because we lose such information when merging identical entries,
we can still identify much longer hot paths then we can do using only
intra-procedural path profiling. This is especially helpful when we
have several strongly correlated procedures.

The following are discussions on paper \textit{An Efficient Algorithm
for Exploiting Multiple Arithmetic Units} by Tomasulo.

Tomasulo's algorithm is a very efficient algorithm for out-of-order
execution on multiple functional units(FU's). It can basically
eliminate all the stalls caused by false data dependencies(WAR and
WAW) and stall only for real data dependencies(RAW). By creating a
distributed set of instruction buffers between instruction dispatcher
and individual FU's and using common data bus(CDB) to forward data
within and across FU's, it can use multiple FU's efficiently and
achieve high level of ILP. Therefore, it is used in many legacy and
modern microprocessors.

The shortcoming of Tomasulo's algorithm is its very high
implementation cost.

Cost of reservation stations

Each reservation station hold many pieces of information, namely, busy
flag, operation of the instruction in it, destination register,
sources and their tags. It also need comparators to compare tags of
data coming from CDB. Effectively, each reservation station is a big
register with comparators and the cost is pretty high. Since the
algorithm will also generate stalls if the reservation stations for an
FU is full and they are all waiting for result from other FU's. In
this case, even if we have other instructions using the same FU and
having all the sources ready to proceed, it can't be executed because
no reservation stations are available for that FU. Therefore, to
ensure the FU's are used efficient, we need to have many reservation
stations attached to each FU to obtain a large instruction window.

Cost of common data bus

CDB is use to broadcast result from FU's to all FU's. Since all FU's
share the same CDB, a complicate bus controller is needed to arbitrate
the bus. FU's need to send requests to the bus controller before they
produce the result in order to reserve the bus, the bus controller
decide the order that data are put on the bus. It would be very
expensive when the number of FU's sharing the bus becomes large.
Since the CDB has large fanout, it is fat and costly to design.
Although reservation stations or instruction buffers are distributed,
the CDB and the bus controller is centralized, which are potentially
bottlenecks. We may want to use multiple data buses to alleviate the
problem if it becomes serious. Doing so will increase the cost. The
cost of reservation stations will also increase greatly because it's
the same as adding more ports to a register file.

Cost of comparators

The number of comparators is proportional to number of FU's, number of
reservation stations for each FU and number of CDB's. Large number of
FU's, number of reservation stations per FU and number of CDB's tend
to provide higher performance. The number of comparators increases
sharply with increase in these numbers and they are costly.

As a result of the high cost, most commercial implementations of
Tomasulo's algorithm are kept simple. For example, the PowerPC 750
microprocessor used in PowerMac G3 has 5 FU's with one reservation
station for each FU(2 for load/store unit) and one CDB.