Inside the AMD Bulldozer Architecture
The Fetch and Decode Units
Contents
The Fetch unit is in charge of getting the next instruction to be decoded from the RAM or memory cache. For further information we sug
gest you to read our How a CPU Works and How the Memory Cache Works tutorials.
Figure 3: The Fetch and Decode units
As shown in the previous page, the Fetch unit is shared by the two “cores” available in each Bulldozer module. The L1 instruction cache is also shared by the two “cores,” because it is an essential part of the fetch unit, but each CPU “core” has its own L1 data cache. Interesting enough AMD has already announced that the L1 instruction cache used in the Bulldozer architecture is a two-way set associative 64 KB cache, the same configuration used by CPUs based on the AMD64 architecture, with the obvious difference that while AMD64 CPUs have one L1 memory cache per core, Bulldozer-based CPUs will have one L1 memory cache per each pair of cores. However, the data cache used by each “core” will be of only 16 KB, which is considerably smaller than the 64 KB per core currently used by CPUs based on the AMD64 architecture.
At this moment AMD hasn’t made public the size of Bulldozer’s BTBs (Branch Target Buffers), which is a small memory that lists all identified branches in the program, used by the branch prediction mechanism of the CPU.
The sizes of the TLBs (Translation Look-aside Buffers), on the other hand, have been disclosed, as you can see in Figure 3. These buffers are a small memory to help the conversion between virtual addresses and physical addresses, used mainly by the virtual memory circuit (virtual memory, also known as swap file, is a technique where the CPU simulates that it has more RAM memory than you actually have installed by using a file in the hard drive).
PC programs are written using x86 instructions, but nowadays the CPU Execution unit only understands proprietary RISC-like instructions. So the Decode unit is in charge of converting the x86 instructions provided by the program running into these RISC-like microinstructions, which are the kind of instruction understood by the Execution unit of the CPU. The Bulldozer architecture has four decoders, but at this moment AMD didn’t give a lot of information on what kind of instructions each decoder can handle. Usually at least one of these decoders handle exclusively complex instructions, using the provided microcode ROM (in the slide “Ucode” should be read as “µcode,” or “microcode”). The decoding of complex instructions take several clock cycles to be completed, because they are converted into several microinstructions. Simple instructions, however, are usually converted in only one clock cycle because they are translated into a single microinstruction. Usually processor manufacturers optimize their CPUs to decode the most common instructions as fast as possible, in just one clock cycle.
