Inside AMD64 Architecture
AMD64 Pipeline
Contents
Pipeline is a list of all stages a given instruction must go through in order to be fully executed. AMD64 architecture uses a 12-stage pipeline for executing integer instructions and a 17-stage pipeline for executing floating-point ones. So it takes 12 or 17 steps for a given instruction to be executed on AMD64 CPUs. AMD previous architecture – K7, which was used by the original Athlon, Athlon XP and some Sempron models – had a 10-stage pipeline. Pentium 4 pipeline has 20 stages and Pentium 4 “Prescott” pipeline has 31 stages. Intel went back and the forthcoming Core 2 Duo processor will have a 14-stage pipeline.
Let’s study AMD64’s integer pipeline. It is based on K7 architecture pipeline, the main difference the decoder stages that were broken in several different stages, probably to allow AMD64 CPUs to achieve a higher clock rate.
Figure 11: AMD64 Integer Pipeline.
Here is a basic explanation of each stage, which explains how a given instruction is processed by processors based on AMD64 architecture. If you think this is too complex for you, don’t worry. This is just a summary of what we will be explaining in the next pages.
- Fetch: Fetches instructions from the L1 instruction cache in groups of 16 bytes (128 bits). This phase is broken into two stages. The second stage is also known as “Transit”, as its main operation is to move data inside the CPU (resembling the “Drive” stage available on Pentium 4).
- Pick: The fetch unit sends the 128 bits that were fetched to this stage, feeding the buffer available here. Since x86 instructions don’t have a fixed length, on this stage the CPU looks and separates the instructions present in the buffer. It also decides to which decoder the x86 instruction will be sent: to a simple (and quick) decoder, used for common x86 instructions that are converted in just one or two macro-ops, or to a complex (and slow) decoder, used by x86 instructions that are converted into several macro-ops. This stage is also known as “scan”.
- Decode: Here the x86 instructions are translated into macro-ops that the CPU core can understand. This phase takes two stages.
- Pack: Decoded macro-ops pairs are fused into a single macro-op here.
- Pack/Decode: Some more decoding is done here before the macro-ops are sent to AMD64’s Instruction Control Unit (which is the same thing as the “Reorder Buffer” present on CPUs from Intel).
- Dispatch: Macro-ops are sent to the appropriate scheduler in this stage.
- Schedule: The macro-ops are scheduled to be executed on one of the CPU schedulers.
- AGU/ALU: The integer instruction or memory-related instruction is executed.
- Data Cache: The data generated by the execution unit is sent to the L1 data cache, the original registers are restored and the instruction is tagged as “executed” on the reorder buffer. This phase is equivalent to the “retirement” phase on Intel CPUs.
