Inside Pentium 4 Architecture
Dispatch and Execution Units
Contents
As we’ve seen, Pentium 4 has four dispatch ports numbered 0 through 3. Each port is connected to one, two or three execution units, as you can see in Figure 6.
Figure 6: Dispatch and execution units.
The units marked as “clock x2” can execute two microinstructions per clock cycle. Ports 0 and 1 can send two microinstructions per clock cycle to these units. So the maximum number of microinstructions that can be dispatched per clock cycle is six:
- Two microinstructions on port 0;
- Two microinstructions on port 1;
- One microinstruction on port 2;
- One microinstruction on port 3.
Keep in mind that complex instructions may take several clock cycles to be processed. Let’s take an example of port 1, where the complete floating point unit is located. While this unit is processing a very complex instruction that takes several clock ticks to be executed, port 1 dispatch unit won’t stall: it will keep sending simple instructions to the ALU (Arithmetic and Logic Unit) while the FPU is busy.
So, even thought the maximum dispatch rate is six microinstructions, actually the CPU can have up to seven microinstructions being processed at the same time.
Actually that’s why ports 0 and 1 have more then one execution unit attached. If you pay attention, Intel put on the same port one fast unit together with at least one complex (and slow) unit. So, while the complex unit is busy processing data, the other unit can keep receiving microinstructions from its corresponding dispatch port. As we mentioned before, the idea is to keep all execution units busy all the time.
The two double-speed ALUs can process two microinstructions per clock cycle. The other units need at least one clock cycle to process the microinstructions they receive. So, Pentium 4 architecture is optimized for simple instructions.
As you can see in Figure 6, dispatch ports 2 and 3 are dedicated to memory operations: load (read data from memory) and store (write data to memory), respectively. As for memory operation, it is interesting to note that port 0 is also used during store operations (see Figure 5 and the list of operations in Figure 6). On such operations, port 3 is used to send the memory address, while port 0 is used to send the data to be stored at this address. This data can be generated by either the ALU or the FPU, depending on the kind of data to be stored (integer or floating point/SSE).
In Figure 6 you have a complete list of the kinds of instructions each execution unit deals with. FXCH and LEA (Load Effective Address) are two x86 instructions. Actually Intel’s implementation for FXCH instruction on Pentium 4 caused a great deal of surprise to all experts, because on processors from previous generation (Pentium III) and processors from AMD this instruction can be executed at zero clock cycle, while on Pentium 4 it takes some clock cycles to be executed.
That’s it. If you want to compare Pentium 4 architecture to Athlon 64’s, read our Inside AMD64 Architecture tutorial.
