Enhancements to the CPU Pipeline
Let’s start our journey talking about what is new the way instructions are processed in the Sandy Bridge microarchitecture.
There are four instruction decoders, meaning that the CPU can decode up to four instructions per clock cycle. These decoders are in charge of decoding IA32 (a.k.a. x86) instructions into RISC-like microinstructions (µops) that are used internally by the CPU execution units. Like previous Intel CPUs, Sandy Bridge microarchitecture supports both macro- and micro-fusion. Macro-fusion allows the CPU to join two related x86 instructions into a single one, while micro-fusion joins two relates microinstructions into a single one. Of course the goal is to improve performance.
What is completely new is the addition of a decoded microinstruction cache, capable of storing 1,536 microinstructions (which translated more or less to 6 kB). Intel is referring this cache as an “L0 cache.” The idea is obvious. When the program that is running enters a loop (i.e., needs to repeat the same instructions several times), the CPU won’t need to decode again the x86 instructions: they will be already decoded in the cache, saving time and thus improving performance. According to Intel this cache has an 80% hit rate, i.e. it is used 80% of the time.
Now you may be asking yourself if this is not the same idea used in the Netburst microarchitecture (i.e. Pentium 4 processors), which had a trace cache that also stored decoded microinstructions. A trace cache works differently from a microinstruction cache: it stores the instructions in the same order they were originally ran. This way, when a program reaches a loop that is ran, let’s say, 10 times, the trace cache will store the same instructions 10 times. Therefore, there are a lot of repeated instructions in the trace cache. The same doesn’t happen with the microinstruction cache, which stores only individual decoded instructions.
When the microinstruction cache is used, the CPU puts the L1 instruction cache and the decoders to “sleep,” making the CPU to save energy and to run cooler.
The branch prediction unit was redesigned and the Branch Target Buffer (BTB) size was doubled in comparison to Nehalem, plus it now uses a compression technique to allow even more data to be stored. Branch prediction is a circuit that tries to guess the next steps of a program in advance, loading to inside the CPU the instructions it thinks the CPU will try to load next. If it hits it right, the CPU won’t waste time loading these instructions from memory, as they will be already inside the CPU. Increasing the size of the BTB allows this circuit to load even more instructions in advance, improving the CPU performance.
The scheduler used in the Sandy Bridge microarchitecture is similar to the one used in the Nehalem microarchitecture, with six dispatch ports, three ports used by execution units and three ports used by memory operations.
Although this configuration is the same, the Sandy Bridge microarchitecture has more execution units: while the Nehalem microarchitecture has 12 of them, the Sandy Bridge has 15, see Figure 2. According to Intel, they were redesigned in order to improve floating-point (i.e., math operations) performance.
Each execution unit is connected to the instruction scheduler using a 128-bit datapath. In order to execute the new AVX instructions, which carry 256-bit data, instead of adding 256-bit datapaths and 256-bit units to the CPU, two execution units are “merged” (i.e., used at the same time), as you can see in Figure 3.
After an instruction is executed, it isn’t copied back to the re-order buffer as it happened in previous Intel architectures, but rather indicated in a list that it is done. This way the CPU saves bits and improves efficiency.
Another difference is on the memory ports. The Nehalem microarchitecture has one load, one store address and one store data units, each one attached to an individual dispatch port. This means that Nehalem-based processors can load from the L1 data cache 128 bits of data per cycle.
In the Sandy Bridge microarchitecture, the load and the store address units can be used either as a load unit or a store address unit. This change allows two times more data to be loaded from the L1 data cache at the same time (using two 128-bit units at the same time instead of only one), thus improving performance. This way, Sandy Bridge-based processors can load 256 bits of data from the L1 data cache per cycle.