How a CPU Works
Processing Instructions
Contents
The fetch unit is in charge of loading instructions from memory. First, it will look if the instruction required by the CPU is in the L1 instruction cache. If it is not, it goes to the L2 memory cache. If the instruction is also not there, then it has to directly load from the slow system RAM memory.
When you turn on your PC all the caches are empty, of course, but as the system starts loading the operating system, the CPU starts processing the first instructions loaded from the hard drive, and the cache controller starts loading the caches, and the show begins.
After the fetch unit grabbed the instruction required by the CPU to be processed, it sends it to the decode unit.
The decode unit will then figure out what that particular instruction does. It does that by consulting a ROM memory that exists inside the CPU, called microcode. Each instruction that a given CPU understands has its own microcode. The microcode will “teach” the CPU what to do. It is like a step-by-step guide to every instruction. If the instruction loaded is, for example, add a+b, its microcode will tell the decode unit that it needs two parameters, a and b. The decode unit will then request the fetch unit to grab the data present in the next two memory positions, which fit the values for a and b. After the decode unit “translated” the instruction and grabbed all required data to execute the instruction, it will pass all data and the “step-by-step cookbook” on how to execute that instruction to the execute unit.
The execute unit will finally execute the instruction. On modern CPUs you will find more than one execution unit working in parallel. This is done in order to increase the processor performance. For example, a CPU with six execution units can execute six instructions in parallel, so in theory it could achieve the same performance of six processors with just one execution unit. This kind of architecture is called superscalar architecture.
Usually modern CPUs don’t have several identical execution units; they have execution units specialized in one kind of instructions. The best example is the FPU, Float Point Unit, which is in charge of executing complex math instructions. Usually between the decode unit and the execution unit there is an unit (called dispatch or schedule unit) in charge of sending the instruction to the correct execution unit, i.e., if the instruction is a math instruction it will send it to the FPU and not to one “generic” execution unit. By the way, “generic” execution units are called ALU, Arithmetic and Logic Unit.
Finally, when the processing is over, the result is sent to the L1 data cache. Continuing our add a+b example, the result would be sent to the L1 data cache. This result can be then sent back to RAM memory or to another place, as the video card, for example. But this will depend on the next instruction that is going to be processed next (the next instruction could be “print the result on the screen”).
Another interesting feature that all microprocessors have for a long time is called “pipeline”, which is the capability of having several different instructions at different stages of the CPU at the same time.
After the fetch unit sent the instruction to the decode unit, it will be idle, right? So, how about instead of doing nothing, put the fetch unit to grab the next instruction? When the first instruction goes to the execution unit, the fetch unit can send the second instruction to the decode unit and grab the third instruction, and so on.
In a modern CPU with an 11-stage pipeline (stage is another name for each unit of the CPU), it will probably have 11 instructions inside it at the same time almost all the time. In fact, since all modern CPUs have a superscalar architecture, the number of instructions simultaneously inside the CPU will be even higher.
Also, for an 11-stage pipeline CPU, an instruction to be fully executed will have to pass through 11 units. The higher the number of stages, the higher the time an instruction will delay to be fully executed. On the other hand, keep in mind that because of this concept several instructions can be running inside the CPU at the same time. The very first instruction loaded by the CPU can delay 11 steps to get out of it, but once it goes out, the second instruction will get out right after it (and not another 11 steps later).
There are several other tricks used by modern CPUs to increase performance. We will explain two of them, out-of-order execution (OOO) and speculative execution.
