Observations of Branch Prediction Papers -- Lecture 3

Xiao Yang - xiaoyang@ee.princeton.edu

The following observations are mainly for paper[1], with occasional references to paper[2].

Effectiveness of the Approach used in Paper[1]

The approach in paper [1] uses the value(s) of certain register(S) at several cycles before a branch to construct the predictor for that branch. While it was shown in the paper that this approach achieved satisfying result, the effectiveness of it in general is worth doubting. In certain cases, compiler is unable to schedule the branch condition to far ahead of the branch. and the condition is related to certain register(s) in some other manner than can be exploited using the approach in paper\cite[1], than it won't work. Consider the code below.

for (i=0;i<100;i++) {
  if(i%2==1)
    { ... }
  else
    { ... }
... }

In the above case, it may not be easy for the compiler to separate the branch condition from the branch for many cycles. The value of the register for i, which can be known well in advance, does not really provide any information about the branch. The branch condition is not related to any other registers than the one holding i. In this case, no matter how we profile the program and generate value-based predictor for the branch using the method from paper[1], the predictor would perform the same bad, while a synthesized dynamic branch predictor based on the POP scheme or PEP scheme[2], or a hardware two-level adaptive branch predictor can do much better. It can also be noticed that all the programs used for evaluation in paper[1] are integer programs. Currently, it seems to be the case that floating-point performance is more critical to the evaluation of processor performance because most multimedia applications use floating point. It is desirable that the method can also work well for floating point programs. However, since floating point registers usually cover much larger value range than integer registers, it will be very hard to find a generic quantizing scheme to fit the register value based approach in paper[1].

Question on Profile sensitivity

This is rather a question on profiling than a question on the approach. The input data set to a program can vary greatly at run time. It naturally becomes a question that how can profiling effectively cover all the branches in the program and get the correct taken(not taken) probability. For example, a branch may be taken if the value of a register(suppose it corresponds to input) is even and not taken otherwise. During run time, the values of that register are either all even or all odd. When profiling, no matter if we use even values, odd values or half even, half odd value, we didn't catch any valuable run time characteristics about the branch. Since the method in paper[1] is based solely on profiling, it would work poorly, while approaches not(or not entirely) based on profiling would work much better, such as POP, PEP[2], two-level hardware predictors.

Mix Synthesized Branch Predictor and Hardware Predictor

Our ultimate goal is to provide higher branch prediction rate. Hardware cost may become less of an issue as chip fabrication technology advances. It's not meaningful to substitute current hardware branch predictors with hardware that support compiler-synthesized branch predictors entirely if the resulting prediction rate becomes lower. If cost is not a big issue, perhaps it's better to keep both hardware and mix the use of them.

Compiler-synthesized predictor can utilized the useful information gained during compilation and profiling. Hardware predictors are inferior in terms of these. However, the advantages of hardware predictors are that they are fast and they perform very well in general. For synthesized predictors, because they insert prediction instructions into the code, they increase the number of instructions to be executed. In case that both synthesized predictor and hardware predictor predict correctly, using synthesized version is slower, as summarized below.
- S right, H right, S slow;
- S right, H wrong, S fast;
- S wrong, H right, S slow;
- S wrong, H wrong, S slow.

Therefore, it may be worth consideration that we mix synthesized predictor with hardware predictor based on profiling information(if profiling is useful in general). For the portion of code that synthesized predictor achieves higher prediction rate, use synthesized predictor, otherwise use hardware predictor.

[1] Scott. Mahlke and B. Natarajan. Branch Prediction Based on Universal Data Compression Algorithms. Proceedings of the 29th International Symposium on Microarchitecture, 1997

[2] D.~I. August, D.~A. Connors, J.~C. Gyllenhaal and W.~W. Hwu Architectural Support for Compiler-Synthesized Dynamic Branch Prediction Strategies: Rationale and Initial Results. The 3rd International Symposium on High-Performance Computer Architecture, 1997