Xiao Yang - xiaoyang@ee.princeton.edu
The following observations are mainly for the branch correlation paper[1], with occasional references to the other two.
The branch correlation scheme used in this paper used is indeed a GAp implementation of 2-level adaptive branch predictor, with GAg implementation[2] as a special case. It uses a single global branch history register(BHR) and a per-address pattern history table(PPHT). Its BHR is actually formed by concatenating the last a few bits of the branch address and the branch history, which is slightly different from the BHR definition used in paper[2]. They use the branch history to index into the corresponding column, which they termed correlation dimension, and use branch address to select the correct row(entry dimension) to obtain the predictor bits.
The result that the branch correlation scheme achieves significant performance gain over plain 2-bit counter scheme is easy to understand. The 2-bit counter scheme predicts branch results without reference to previous correlated branches, which are normally present in programs. Thus it can't exploit the very useful information provided by the outcomes of the branches before the current one, which are very likely correlated to the current one. The branch correlation scheme sacrifice address resolution to utilize the correlation among branches. Fortunately, most programs the used for evaluation are computation-intensive tasks. They have good spatial locality, which means at a give period of time, very likely they are only executing small segments of code. They are usually not many branches in small segments of code, therefore the sacrifice in address resolution in the correlation scheme doesn't really lose much. One the other hand, for the branch prediction tables(BPT) with the same size and predictor bits, these two schemes have the same number of entries. The correlation scheme doesn't require a larger training set than the 2-bit counter scheme, but it bases its outcome on more information. No wonder it out performs the 2-bit counter scheme noticeably.
However, it is a debatable result that GAg implementation(degenerate case in paper[1]) performs quite well too. The disadvantage of GAg is that it discards all branch address information, thus loses the sequencing of branches completely, the prediction result can be too useful. For example, if we have 6 branches, b1 - b6, and the shift register contains 4 bits, then it can't distinguish the case of b1 - b4 generating history 1011 and b3 - b6 generating history 1011. It's unlikely that these different branch sequences have the same characteristics. In such cases, branch outcomes from different sequences will interfere with each other, rendering the prediction result largely useless, if not entirely. I suspect the reason that they achieved significant performance gain in their degenerate case is that they used many(in their case, 15) bits for the shift register. It's unlikely for a program, especially a program from SPEC, to have 15 branches in a small segment of code. 15-bit history register and the corresponding table may cover all the possible combinations of branch outcomes in most cases, the indexing can be pretty accurate. Especially they used SPEC programs for evaluation, which I think have more loop structures than normal programs. They have better spatial locality in terms of code. GAg may be particularly good for these cases. In general, it's a dubious result that GAg can gain much performance. Gshare mentioned in paper[3] is a better scheme in general because it have the advantages of GAg, it also retain the branch address information in some form.
Another problem is still the effectiveness of the method proposed in paper[1]. Although their scheme performs quite well on SPEC programs, it's hard to say that the scheme would perform equally well for the other programs. For example, there may be pollution problems. If there are irrelevant branches among correlated branches, an extreme case is that there are many independent branches between 2(or 2 groups of) correlated branches, the number of independent branches is greater than the number of bits for history, their scheme would perform poorly. The outcomes of the independent branches will purge the correlation information from the history. In the worse case the prediction result could be worthless. It's worthwhile to conduct benchmarking on more programs than the SPEC ones.
It is usually the case that all the entries in BPT are initialized to 0, which means many branches will be predicted not taken at the beginning. For a for loop, the predictor would need to predict wrong twice before it can guess right. I don't know if the same problem is present for context switches. The idea behind leaving BPT unchanged during context switching is that BPT contents should be random enough at that time. However, I suspect many programs may leave the BPT in a not so random state due to loops, especially nested loops. Initializing the table to be all 1's(weakly NT) or 2's(weakly T) and resetting it to be all 1's or 2's can make it look more random to new programs. A less random table is more resilient to change in branch patterns than a random table. This method may help the predictor to eliminate a few mistakenly predicted branches, but they shouldn't be too many. The conclusion need to be verified by simulation.
[1] S. T. Pan, K. So, and J. T. Rahmeh. Improving the Accuracy of Dynamic Branch Prediction Using Branch Correlation. Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, 1992.
[2] T. Y. Yeh and Y. N. Patt. Alternative Implementations of Two-Level Adaptive Branch Prediction. Proceedings of the 19th International Symposium on Computer Architecture, 1992.
[3] P. Y. Chang, E. Hao, and Y. N. Patt. Alternative Implementations of Hybrid Branch Predictors. Proceedings of the 28th International Symposium on Computer Architecture, 1995.