Grant Wallace
CS 597d
Observation 4:
Exploiting Hardware Performance Counters with Flow and Context Sensitive
Profiling [1]
It’s interesting that dense procedures have less
than half the number of paths (on average) than sparse procedures (table
5). It doesn’t seem like this would be generally true, but if there were
a correlation it would be encouraging from an optimization standpoint.
In terms of the instrumentation perturbation problem,
maybe different profile runs could be optimized to measure different metrics.
For example if we wanted to look at cache misses, we could keep the count
array in non-cacheable memory. If we could avoid register peeling (by having
a special profiling register, maybe it could double as a debug register
in another context) then there would be no data-cache perturbation except
possibly at procedure calls were an extra stack save/restore would occur.
But since stack data is likely to be accessed here anyway, this might not
perturb the cache very much.
In section 4.1 it states that the breadth of a CCT
is bounded by the number of procedures in a program (i.e. O(n)). But consider
the case where every procedure calls every other procedure except itself
or any of its ancestors. This is equivalent to every procedure calling
every other procedure with the recursion removed as per the CCT algorithm.
In this case, the worst case, the breadth can be (n-1)! where n is the
number of procedures in the program.
As an example, consider nodes A,B,C,D. The CCT would look like…
A
/ | \
/
| \
/
| \
B C
D (n-1) children per parent node
/ | / \
| \
C D B D B C
(n-2) children per parent node
| | |
| | |
D C D B C B
(n-3) children per parent node
Breadth is (n-1)! which is 3! in this case.
[1] "Exploiting Hardware Performance Counters with Flow and Context
Sensitive Profiling", G. Ammons, T. Ball, and J. Larus, Proceedings of
the 1997 ACM SIGPLAN Conference on Programming Language Design and Implementation.