NeuroBPredict

Branch Predictor


Modern CPUs have grown significantly more complex than they used to be, as manufacturers like AMD and Intel now grow unable to achieve speedups simply by smashing more transistors on a single chip. This limitation is largely due to physical constraints, i.e. due to quantum issues that become prevalent at size regimes these chips are dealing with. As a result, such manufacturers have looked to more creative ways of achieving speedups, largely in the form of increasing parallelism. This has emerged in the forms of multi-core processor designs and also branch prediction.

Branch prediction has become an essential part of the CPU pipeline, particularlyl to increase ILP (instruction-level parallelism). Specifically, branch predictors are responsible for pre-fetching instructions for decoding/execution in conditional branch locations, i.e. points in the code where it is impossible to know which instruction to fetch next. In other words, the CPU guesses the outcome of the branch and fetches instructions accordingly, hoping to avoid the seeming possible of needing to wait for the conditional branch to finish executing to fetch/process further instructions. Modern branch predictors typically integrate global history and local history, which respectively correspond to applying the result of previous branch outcomes in the overall program and at particular values for the program counter (i.e. by line number in the code). However, with the uprising of neural nets, it was unavoidable that they would too be tested in this regime.

The main issue with neural net implementations is the latency, i.e. how long it would take for the prediction to occur relative to the time spent. Despite maybe a few percentage points extra, perceptron use was largely relegated as not likely to be adopted, seeing as modern methods already achieve accuracies of upwards of 90-95%. However, recent research investigations have revealed otherwise, citing that perceptrons may in fact be comparable in accuracy and latency to those methods that are currently most widespread. We explored several branch predictor designs across different executables and similarly between ISAs, i.e. ARM and X86, all through the gem5 simulator environment.

StaticBP


The class of branch predictors can be subdivided into two major classes: "static" and "dynamic." Static branch predictors are the simplest variant of branch predictors. As suggested by the name, these predictors essentially guess the same result regardless of its performance on past conditional cases. That is, a static predictor will either always predict "true" whenever it encounters a branch (i.e. will inform the CPU to directly fetch instructions for the "true" case) or "false." There are some variants thereof, i.e. alternate between the two, but these two are clearly the most sensible and straightforwad to implement. Surprisingly, static branch predictors work quite well in practice, coupled with their 0 time latency, since no computation must be performed prior to fetching instructions. Thus, static BPs form a solid baseline from which to evaluate dynamic BPs.


Branch Prediction Results

Executable Conditional Accuracy Indirect Accuracy Latency
Bubblesort.txt 330833 1934 192.72
ConnCompMedium.txt 127705 25611 74.40
ConnCompSmall.txt 57430 12348 30.76
IntMM.txt 10607 1889 32.09
Oscar.txt 20943 2852 58.84
Puzzle.txt 580458 3029 946.32
Queens.txt 268830 1998 100.58
Quicksort.txt 299392 13046 116.42
RealMM.txt 10813 1997 31.88
Towers.txt 209757 32586 82.68

LocalBP


Dynamic BPs, on the other hand, typically integrate either global history and local history or a combination thereof. The LocalBP chooses to exclusively make use of the local history for its predictions, meaning that we have a table (i.e. an array) indexed by a hash of the program counter. This predictor makes use of a saturated counter, which is a commonly integrated structure in branch predictors that prevents a single change in conditional outcome from completely throwing off future predictions. Specifically, this "saturated counter" is essentially a Markov Model with four states, called "strongly accepting," "weakly accepting," "weakly rejecting," and "strongly rejecting," where the accepting states both correspond to when the BP predicts the conditional will go along the true branch and the rejecting the opposite.

Thus, each of the distinct local program counter hashes corresponds to a different saturated state machine. These predictors typically perform well when different conditionals across the program have distinct behaviors, although it suffers the downside of not capturing any overall trends in the program (i.e. if most conditionals are true).

Branch Prediction Results

Executable Conditional Accuracy Indirect Accuracy Latency
Bubblesort.txt 198284 1486 91.44
ConnCompMedium.txt 104713 8770 55.55
ConnCompSmall.txt 55185 5713 31.15
IntMM.txt 9072 1345 32.73
Oscar.txt 15798 2538 58.68
Puzzle.txt 42277 2149 32.41
Queens.txt 210521 4005 59.36
Quicksort.txt 217449 5352 63.20
RealMM.txt 9286 1522 31.99
Towers.txt 58558 1653 81.71
Treesort.txt 368046 14900 114.14

TournamentBP


Unlike the previous example, the tournament branch predictor makes use of several of the previous branches. That is, it makes correlated predictions based on the last m branches. In particular, each of the local branches has an indicator that keeps track of which of the past m branches is most correlated with outcomes, making it simply an extension of the LocalBP, since tournament can easily capture if the most recent branch is most heavily correlated with the outcome. Unlike the single local branch predictor, this predictor is capable of making predictions that depends on results further back than the most recent branch, meaning it effectively extends the scope of what is "significant" in making a prediction. Of course, with this extended input domain comes the difficulty of including factors that may not be relevant yet may seem to be in a training sample along with increased prediction/training latency.

Branch Prediction Results

Executable Conditional Accuracy Indirect Accuracy Latency
Bubblesort.txt 69726 3685 89.56
ConnCompMedium.txt 112205 26699 74.81
ConnCompSmall.txt 50478 13169 31.48
IntMM.txt 9214 2903 32.33
Oscar.txt 11094 5669 58.21
Puzzle.txt 2317 1367 0.65
Queens.txt 113906 16731 59.00
Quicksort.txt 274604 93990 71.36
RealMM.txt 9372 2978 32.55
Towers.txt 15795 3116 81.96
Treesort.txt 119107 29582 33.24

BimodalBP


Unlike both the tournament and local branch predictors, the Bimodal branch predictor, as suggested by the name, maintains both a global and local history of conditional branches down which we have travelled and also maintain a "choosing" parameter that determines which, between the global and local histories, is more directly correlated with the outputs. In that way, future predictions by this bimodal predictor can make use of both the global and local structures through the program. That is, we can detect if some global structure is emerging (i.e. alternating between true/false loops) and local constructs, further increasing the scope. These branch predictors tend to do quite well in practice and have become largely, in one variant or another, the de facto standard of branch predictors.

Branch Prediction Results

Executable Conditional Accuracy Indirect Accuracy Latency
Bubblesort.txt 158021 2135 89.81
ConnCompMedium.txt 132635 37877 74.11
ConnCompSmall.txt 64185 19558 31.55
IntMM.txt 11714 1623 32.41
Oscar.txt 17935 4053 58.64
Puzzle.txt 2447 260 0.34
Queens.txt 189028 53924 59.32
Quicksort.txt 267541 85479 63.54
RealMM.txt 12137 1739 32.15
Towers.txt 52133 2006 82.27
Treesort.txt 1619 80 0.15

LTAGE


Similar to how neural networks grew so prevalent in recent times, the LTAGE branch predictor was essentially born out of online branch predictor competitions. That is, competitions where the goal was to most accurately predict conditional outcomes produced by CPU dumps (with penalties associated to latency). LTAGE is a very extension of the local branch predictor concept. In particular, with local BP, the main issue arises in determining how large of history the predictor should consider. If it considers too much, there may be too large of latency and also finding trends that may simply be artifacts of the training set. If it is taken too small, the BP will not have much information off of which it can do predictions, meaning it will essentially reduce down to simple/random chance. Thus, the LTAGE concept is maintaining a variable length local history for different branches, seeing which one most correlates with branches actually taken, which allows the predictor to essentially sidestep the issue of presetting a size of history to use for predictions. You can more thoroughly explore this architecture through its paper at: https://www.jilp.org/vol9/v9paper6.pdf

Branch Prediction Results

Executable Conditional Accuracy Indirect Accuracy Latency
Bubblesort.txt 36086 2209 91.27
ConnCompMedium.txt 145920 48768 74.91
ConnCompSmall.txt 64545 21220 32.43
IntMM.txt 5908 1758 32.35
Oscar.txt 10587 3823 59.51
Puzzle.txt 104092 53946 453.51
Queens.txt 10384 3833 60.73
Quicksort.txt 190634 33496 63.22
RealMM.txt 6261 1959 31.65
Towers.txt 6500 1954 82.14

NeuralBP


Of course, as previously mentioned, with the uprising of neural nets, it was unavoidable that they would too be applied to branch prediction. The first attempt to apply neural nets to this regime was as follows: the main issue with applying complicated networks is that, since these predictors are extremely low level (on the CPU), they cannot afford the overhead necessary to perform the predictions. In other words, the "neural net" scope must be greatly limited as to allow the predictions to be made in a reasonable time frame. It turns out that this structure is simply a one layer network of neurons, i.e. a series of weights associated to different points in the program, set up in a table-like fashion.

In other words, different program counters (similar to the local branch predictor) hash to different locations in the weights matrix, where each row corresponds to a fixed hashed pc value. The input to this neuron is a vector of the global history of conditional outcomes at the point of prediction in the program (similar to the TournamentBP), meaning that a weighted linear combination thereof is used to predict the final outcome of the branch. Thus, this algorithm very closely resembles that of the TournamentBP with the exception of how it is updated, which is per the standard neural update rules. More recent modifications, however, have been made to this predictor, described below. This work was made possible by the assistance of the description in https://www.microarch.org/micro36/html/pdf/jimenez-FastPath.pdf.

Branch Prediction Results

Executable Conditional Accuracy Indirect Accuracy Latency
Bubblesort.txt 185036 3312 133.41
ConnCompMedium.txt 132996 29265 129.98
ConnCompSmall.txt 63890 15752 53.99
IntMM.txt 8604 2259 39.19
Oscar.txt 13140 5125 71.05
Puzzle.txt 675985 204267 1202.73
Queens.txt 205150 19174 92.66
Quicksort.txt 253347 82381 103.60
RealMM.txt 9082 2631 37.24
Towers.txt 97834 18770 119.72
Treesort.txt 432152 70188 198.38

NeuralPathBP


The fast Neural Path BP is one of the state-of-the-art predictors that brought neural perceptron design back into focus in the realm of BPs. The algorithm that was implemented here is more thoroughly described in: https://www.microarch.org/micro36/html/pdf/jimenez-FastPath.pdf. As a brief overview, this predictor essentially integrates the neural structure of the previous algorithm with the "paths taken" in the program. That is, we maintain a global history of the conditional addresses, which captures how we are jumping around the program, which serves the purpose of being the input vector combined as a weighted sum by the corresponding weights of the vectors. Thus, unlike the previous algorithm, this updates several rows, presumably to capture how the same location in the program may be correlated with many subsequent conditional outcomes, which this version can directly capture in its updates.

Branch Prediction Results

Executable Conditional Accuracy Indirect Accuracy Latency
Bubblesort.txt 333474 2804 325.93
ConnCompMedium.txt 157877 34015 180.54
ConnCompSmall.txt 62784 14209 75.27
IntMM.txt 9765 1816 40.77
Oscar.txt 16688 3920 70.07
Perm.txt 124110 1980 151.88
Puzzle.txt 620337 139201 1412.84
Queens.txt 234622 3367 188.27
Quicksort.txt 281145 24475 209.31
RealMM.txt 10046 2145 40.74
Towers.txt 128113 1919 159.47
Treesort.txt 319174 2453 280.58