A Comparison of Full and Partial Predicated Execution Support for ILP Processors

This article brought about a couple of thoughts and ideas I've been
having related to profiling and predication.  One problem I have with
depending on predication is the difficulty in compiling for it.  These
articles show that it can be some effectively, but it is still
difficult.  I think this difficulty comes in a large part from the
dependence on profiling data is its feasibility.  I haven't yet seen a
system that would make me use profiling in the applications that I
write.  I figure that if I've seen or atleast been introduced to the
benefits and still wouldn't use it because of the inconvenience then
only a very small group of programmers will actually work to find these
performance increases.

So I wonder if it would be possible to effectively compile for
predication without actually profiling, or by creating a simple strategy
for gathering useful profile data.

Another thing I've been wondering is how do programs perform that are
profiled as fully as the SPEC benchmarks are in IMPACT.  Basically, I
look at these tables and see that yes there is a significant performance
increase, but at the same time we are able to completely profile the
running of these programs, optimize and then run the same test again
only more optimized.  What happens when we run a small profile and then
distribute the real program, what are the performance gains there?