We are a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for us to earn fees by linking to Amazon.com and affiliated sites.

[nextpage title=”Introduction”]
Even though every microprocessor has its own internal design, all microprocessors share a same basic concept – which we will explain in this tutorial. We will take a look inside a generic CPU architecture, so you will be able to understand more about Intel and AMD products and the differences between them.
The CPU (Central Processing Unit) – which is also called microprocessor or processor – is in charge of processing data. How it will process data will depend on the program. The program can be a spreadsheet, a word processor or a game: for the CPU it makes no difference, since it doesn’t understand what the program is actually doing. It just follows the orders (called commands or instructions) contained inside the program. These orders could be to add two numbers or to send a piece of data to the video card, for example.

When you double click on an icon to run a program, here is what happens:
1. The program, which is stored inside the hard disk drive, is transferred to the RAM memory. A program is a series of instructions to the CPU.
2. The CPU, using a circuit called memory controller, loads the program data from the RAM memory.
3. The data, now inside the CPU, is processed.
4. What happens next will depend on the program. The CPU could continue to load and executing the program or could do something with the processed data, like displaying something on the screen.

How a CPU loads dataFigure 1: How stored data is transferred to the CPU.

In the past, the CPU controlled the data transfer between the hard disk drive and the RAM memory. Since the hard disk drive is slower than the RAM memory, this slowed down the system, since the CPU would be busy until all the data was transferred from the hard disk drive to the RAM memory. This method is called PIO, Processor I/O (or Programmed I/O). Nowadays data transfer between the hard disk drive and the RAM memory in made without using the CPU, thus making the system faster. This method is called bus mastering or DMA (Direct Memory Access). In order to simplify our drawing we didn’t put the north bridge chip between the hard disk drive and the RAM memory in Figure 1, but it is there. If you’d like to learn more about this subject, we’ve already written a tutorial on that.
Processors from AMD based on sockets 754, 939 and 940 (Athlon 64, Athlon 64 X2, Athlon 64 FX, Opteron and some Sempron models) have an embedded memory controller. This means that for these processors the CPU accesses the RAM memory directly, without using the north bridge chip shown in Figure 1.
To better understand the role of the chipset in a computer, we recommend you to read our tutorial Everything You Need to Know About Chipsets.
[nextpage title=”Clock”]
So, what is clock anyway? Clock is a signal used to sync things inside the computer. Take a look at Figure 2, where we show a typical clock signal: it is a square wave changing from “0” to “1” at a fixed rate. On this figure you can see three full clock cycles (“ticks”). The beginning of each cycle is when the clock signal goes from “0” to “1”; we marked this with an arrow. The clock signal is measured in a unit called Hertz (Hz), which is the number of clock cycles per second. A clock of 100 MHz means that in one second there is 100 million clock cycles.

Clock signalFigure 2: Clock signal.

In the computer, all timings are measured in terms of clock cycles. For example, a RAM memory with a “5” latency means that it will delay five full clock cycles to start delivering data. Inside the CPU, all instructions delay a certain number of clock cycles to be performed. For example, a given instruction can delay seven clock cycles to be fully executed.
Regarding the CPU, the interesting thing is that the CPU knows how many clock cycles each instruction will take, because it has a table which lists this information. So if it has two instructions to be executed and it knows that the first will delay seven clock cycles to be executed, it will automatically start the execution of the next instruction on the 8th clock tick. Of course this is a generic explanation for a CPU with just one execution unit – modern processors have several execution units working in parallel and it could execute the second instruction at the same time as the first, in parallel. This is called superscalar architecture and we will talk more about this later.
So, what clock has to do with performance? To think that clock and performance is the same thing is the most common misconception about processors.
If you compare two completely identical CPUs, the one running at a higher clock rate will be faster. In this case, with a higher clock rate, the time between each clock cycle will be shorter, so things are going to be performed in less time and the performance will be higher. But when you do compare two different processors, this is not necessarily true.
If you get two processors with different architectures – for example, two different manufacturers, like Intel and AMD – things inside the CPU are completely different.
As we mentioned, each instruction takes a certain number of clock cycles to be executed. Let’s say that processor “A” takes seven clock cycles to perform a given instruction, and that processor “B” takes five clock cycles to perform this same instruction. If they are running at the same clock rate, processor “B” will be faster, because it can process this instruction is less time.
For modern CPUs there is much more in the performance game, as CPUs have different number of execution units, different cache sizes, different ways of transferring data inside the CPU, different ways of processing the instructions inside the execution units, different clock rates with the outside world, etc. Don’t worry; we will cover all that in this tutorial.
As the processor clock signal became very high, one problem showed up. The motherboard where the processor is installed could not work using the same clock signal. If you look at a motherboard, you will see several tracks or paths. These tracks are wires that connect the several circuits of the computer. The problem is that with higher clock rates, these wires started to work as antennas, so the signal, instead of arriving at the other end of the wire, would simply vanish, being transmitted as radio waves.

Motherboard wires work as antennasFigure 3: The wires on the motherboard can work as antennas.

[nextpage title=”External Clock”]
So the CPU manufacturers started using a new concept, called clock multiplication, which started with 486DX2 processor. Under this scheme, which is used in all CPUs nowadays, the CPU has an external clock, which is used when transferring data to and from the RAM memory (using the north bridge chip), and a higher internal clock.
To give a real example, on a 3.4 GHz Pentium 4 this “3.4 GHz” refers to the CPU internal clock, which is obtained multiplying by 17 its 200 MHz external clock. We illustrated this example in Figure 4.

CPU clocksFigure 4: Internal and external clocks on a Pentium 4 3.4 GHz.

The huge difference between the internal clock and the external clock on modern CPUs is one major roadblock to overcome in order to increase the computer performance. Continuing the Pentium 4 3.4 GHz example, it has to reduce its speed by 17x when it has to read data from RAM memory! During this process, it works as if it were a 200 MHz CPU!
Several techniques are used to minimize the impact of this clock difference. One of them is the use of a memory cache inside the CPU. Another one is transferring more than one data chunk per clock cycle. Processors from both AMD and Intel use this feature, but while AMD CPUs transfer two data per clock cycle, Intel CPUs transfer four data per clock cycle.

Transferring more than one data chunk per clock cycleFigure 5: Transferring more than one data per clock cycle.

Because of that, AMD CPUs are listed as having the double of their real external clocks. For example, an AMD CPU with a 200 MHz external clock is listed as 400 MHz. The same happens with Intel CPUs: an Intel CPU with a 200 MHz external clock is listed as having an 800 MHz external clock.
The technique of transferring two data per clock cycle is called DDR (Dual Data Rate), while the technique of transferring four data per clock cycle is called QDR (Quad Data Rate).
[nextpage title=”Block Diagram of a CPU”]
In Figure 6, you can see a basic block diagram for a modern CPU. There are many differences between AMD and Intel architectures. Read or tutorial Inside Pentium 4 Architecture for a detailed view on Pentium 4 architecture. We still plan to write a specific tutorial about Athlon 64 architecture in the near future. We think that understanding the basic block diagram of a modern CPU is the first step to understand how CPUs from Intel and AMD work and the differences between them.

CPU block diagramFigure 6: Basic block diagram of a CPU.

The dotted line in Figure 6 represents the CPU body, as the RAM memory is located outside the CPU. The datapath between the RAM memory and the CPU is usually 64-bit wide (or 128-bit when dual channel memory configuration is used), running at the memory clock or the CPU external clock, which one is lower. The number of bits used and the clock rate can be combined in a unit called transfer rate, measured in MB/s. To calculate the transfer rate, the formula is number of bits x clock / 8. For a system using DDR400 memories in single channel configuration (64 bits) the memory transfer rate will be 3,200 MB/s, while the same system using dual channel memories (128 bits) will have a 6,400 MB/s memory transfer rate. For more information on this subject, read our tutorial Everything You Need to Know About DDR Dual Channel.
All the circuits inside the dotted box run at the CPU internal clock. Depending on the CPU some of its internal parts can even run at a higher clock rate. Also, the datapath between the CPU units can be wider, i.e., transfer more bits per clock cycle than 64 or 128. For example, the datapath between the L2 memory cache and the L1 instruction cache on modern processors is usually 256-bit wide. The higher the number the bits transferred per clock cycle, the fast the transfer will be done (in other words, the transfer rate will be higher). In Figure 5 we used a red arrow between the RAM memory and the L2 memory cache and green arrows between all other blocks to express the different clock rates and datapath width used.
[nextpage title=”Memory Cache”]
Memory cache is a high performance kind of memory, also called static memory. The kind of memory used on the computer main RAM memory is called dynamic memory. Static memory consumes more power, is more expensive and is physically bigger than dynamic memory, but it is a lot faster. It can work at the same clock as the CPU, which dynamic memory is not capable of.
Since going to the “external world” to fetch data makes the CPU to work at a lower clock rate, memory cache technique is used. When the CPU loads a data from a certain memory position, a circuit called memory cache controller (not drawn in Figure 6 in the name of simplicity) loads into the memory cache a whole block of data below the current position that the CPU has just loaded. Since usually programs flow in a sequential way, the next memory position the CPU will request will probably be the position immediately below the memory position that it has just loaded. Since the memory cache controller already loaded a lot of data below the first memory position read by the CPU, the next data will be inside the memory cache, so the CPU doesn’t need to go outside to grab the data: it is already loaded inside in the memory cache embedded in the CPU, which it can access at its internal clock rate.
The cache controller is always observing the memory positions being loaded and loading data from several memory positions after the memory position that has just been read. To give you a real example, if the CPU loaded data stored in the address 1,000, the cache controller will load data from “n” addresses after the address 1,000. This number “n” is called page; if a given processor is working with 4 KB pages (which is a typical value), it will load data from 4,096 addresses below the current memory position being load (address 1,000 in our example). By the way, 1 KB equals to 1,024 bytes, that’s why 4 KB is 4,096 not 4,000. In Figure 7 we illustrate this example.

How memory cache worksFigure 7: How the memory cache controller works.

The bigger the memory cache, the higher the chances of the data required by the CPU are already there, so the CPU will need to directly access RAM memory less often, thus increasing the system performance (just remember that every time the CPU needs to access the RAM memory directly it needs to lower its clock rate for this operation).
We call a “hit” when the CPU loads a required data from the cache, and we call a “miss” if the required data isn’t there and the CPU needs to access the system RAM memory.
L1 and L2 means “Level 1” and “Level 2”, respectively, and refers to the distance they are from the CPU core (execution unit). One common doubt is why having three separated cache memories (L1 data cache, L1 instruction cache and L2 cache). Pay attention to Figure 6 and you will see that L1 instruction cache works as an “input cache”, while L1 data cache works as an “output cache”. L1 instruction cache – which is usually smaller than L2 cache – is particularly efficient when the program starts to repeat a small part of it (loop), because the required instructions will be closer to the fetch unit.
On the specs page of a CPU the L1 cache can be found with different kinds of representation. Some manufacturers list the two L1 cache separately (some times calling the instruction cache as “I” and the data cache as “D”), some add the amount of the two and writes “separated” – so a “128 KB, separated” would mean 64 KB instruction cache and 64 KB data cache –, and some simply add the two and you have to guess that the amount is total and you should divide by two to get the capacity of each cache. The exception, however, goes to the Pentium 4 and newer Celeron CPUs based on sockets 478 and 775.
Pentium 4 processors (and Celeron processors using sockets 478 and 775) don’t have a L1 instruction cache, instead they have a trace execution cache, which is a cache located between the decode unit and the execution unit. So, the L1 instruction cache is there, but with a different name and a different location. We are mentioning this here because this is a very common mistake, to think that Pentium 4 processors don’t have L1 instruction cache. So when comparing Pentium 4 to other CPUs people would think that its L1 cache is much smaller, because they are only counting the 8 KB L1 data cache. The trace execution cache of Pentium 4 and Celeron CPUs is of 150 KB and should be taken in account, of course.
[nextpage title=”Branching”]
As we mentioned several times, one of the main problems for the CPU is having too many cache misses, because the fetch unit must access directly the slow RAM memory, thus slowing down the system.
Usually the use of the memory cache avoids this a lot, but there is one typical situation where the cache controller will miss: branches. If in the middle of the program there is an instruction called JMP (“jump” or “go to”) sending the program to a completely different memory position, this new position won’t be loaded in the L2 memory cache, making the fetch unit to go get that position directly in the RAM memory. In order to solve this issue, the cache controller of modern CPUs analyze the memory block it loaded and whenever it finds a JMP instruction in there it will load the memory block for that position in the L2 memory cache before the CPU reaches that JMP instruction.

Unconditional BranchingFigure 8: Unconditional branching situation.

This is pretty easy to implement, the problem is when the program has a conditional branching, i.e., the address the program should go to depends on a condition not yet known. For example, if a =< b go to address 1, or if a > b go to address 2. We illustrate this example in Figure 9. This would make a cache miss, because the values of a and b are unknown and the cache controller would be looking only for JMP-like instructions. The solution: the cache controller loads both conditions into the memory cache. Later, when the CPU processes the branching instruction, it will simply discard the one that wasn’t chosen. It is better to load the memory cache with unnecessary data than directly accessing the RAM memory.

Conditional BranchingFigure 9: Conditional branching situation.

[nextpage title=”Processing Instructions”]
The fetch unit is in charge of loading instructions from memory. First, it will look if the instruction required by the CPU is in the L1 instruction cache. If it is not, it goes to the L2 memory cache. If the instruction is also not there, then it has to directly load from the slow system RAM memory.
When you turn on your PC all the caches are empty, of course, but as the system starts loading the operating system, the CPU starts processing the first instructions loaded from the hard drive, and the cache controller starts loading the caches, and the show begins.
After the fetch unit grabbed the instruction required by the CPU to be processed, it sends it to the decode unit.
The decode unit will then figure out what that particular instruction does. It does that by consulting a ROM memory that exists inside the CPU, called microcode. Each instruction that a given CPU understands has its own microcode. The microcode will “teach” the CPU what to do. It is like a step-by-step guide to every instruction. If the instruction loaded is, for example, add a+b, its microcode will tell the decode unit that it needs two parameters, a and b. The decode unit will then request the fetch unit to grab the data present in the next two memory positions, which fit the values for a and b. After the decode unit “translated” the instruction and grabbed all required data to execute the instruction, it will pass all data and the “step-by-step cookbook” on how to execute that instruction to the execute unit.
The execute unit will finally execute the instruction. On modern CPUs you will find more than one execution unit working in parallel. This is done in order to increase the processor performance. For example, a CPU with six execution units can execute six instructions in parallel, so in theory it could achieve the same performance of six processors with just one execution unit. This kind of architecture is called superscalar architecture.
Usually modern CPUs don’t have several identical execution units; they have execution units specialized in one kind of instructions. The best example is the FPU, Float Point Unit, which is in charge of executing complex math instructions. Usually between the decode unit and the execution unit there is an unit (called dispatch or schedule unit) in charge of sending the instruction to the correct execution unit, i.e., if the instruction is a math instruction it will send it to the FPU and not to one “generic” execution unit. By the way, “generic” execution units are called ALU, Arithmetic and Logic Unit.
Finally, when the processing is over, the result is sent to the L1 data cache. Continuing our add a+b example, the result would be sent to the L1 data cache. This result can be then sent back to RAM memory or to another place, as the video card, for example. But this will depend on the next instruction that is going to be processed next (the next instruction could be “print the result on the screen”).
Another interesting feature that all microprocessors have for a long time is called “pipeline”, which is the capability of having several different instructions at different stages of the CPU at the same time.
After the fetch unit sent the instruction to the decode unit, it will be idle, right? So, how about instead of doing nothing, put the fetch unit to grab the next instruction? When the first instruction goes to the execution unit, the fetch unit can send the second instruction to the decode unit and grab the third instruction, and so on.
In a modern CPU with an 11-stage pipeline (stage is another name for each unit of the CPU), it will probably have 11 instructions inside it at the same time almost all the time. In fact, since all modern CPUs have a superscalar architecture, the number of instructions simultaneously inside the CPU will be even higher.
Also, for an 11-stage pipeline CPU, an instruction to be fully executed will have to pass through 11 units. The higher the number of stages, the higher the time an instruction will delay to be fully executed. On the other hand, keep in mind that because of this concept several instructions can be running inside the CPU at the same time. The very first instruction loaded by the CPU can delay 11 steps to get out of it, but once it goes out, the second instruction will get out right after it (and not another 11 steps later).
There are several other tricks used by modern CPUs to increase performance. We will explain two of them, out-of-order execution (OOO) and speculative execution.
[nextpage title=”Out-Of-Order Execution (OOO)”]
Remember that we said that modern CPUs have several execution units working in parallel? We also said that there are different kinds of execution units, like ALU, which is a generic execution unit, and FPU, which is a math execution unit. Just as a generic example in order to understand the problem, let’s say that a given CPU has six execution engines, four “generic” and two FPUs. Let’s also say that the program has the following instruction flow in a given moment:

1. generic instruction
2. generic instruction
3. generic instruction
4. generic instruction
5. generic instruction
6. generic instruction
7. math instruction
8. generic instruction
9. generic instruction
10. math instruction

What will happen? The schedule/dispatch unit will send the first four instructions to the four ALUs but then, at the fifth instruction, the CPU will need to wait for one of their ALUs to be free in order to continue processing, since all its four generic execution units are busy. That’s not good, because we still have two math units (FPUs) available, and they are idle. So, a CPU with out-of-order execution (all modern CPUs have this feature) will look at the next instruction to see if it can be sent to one of the idle units. In our example, it can’t, because the sixth instruction also needs one ALU to be processed. The out-of-order engine continues its search and finds out that the seventh instruction is a math instruction that can be executed in one of the available FPUs. Since the other FPU will still be available, it will go down the program looking for another math instruction. In our example, it will pass the eight and the ninth instructions and will load the tenth instruction.
So, in our example, the execution units will be processing, at the same time, the first, the second, the third, the fourth, the seventh and the tenth instructions.
The name out-of-order comes from the fact that the CPU doesn’t need to wait; it can pull an instruction from the bottom of the program and process it before the instructions above it are processed. Of course the out-of-order engine cannot go forever looking for an instruction if it cannot find one. The out-of-order engine of all CPUs has a depth limit on which it can crawl looking for instructions (a typical value would be 512).
[nextpage title=”Speculative Execution”]
Let’s suppose that one of this generic instructions is a conditional branching. What will the out-of-order engine do? If the CPU implements a feature called speculative execution (all modern CPUs do), it will execute both branches. Consider the example below:

1. generic instruction
2. generic instruction
3. if a=<b go to instruction 15
4. generic instruction
5. generic instruction
6. generic instruction
7. math instruction
8. generic instruction
9. generic instruction
10. math instruction
15. math instruction
16. generic instruction

When the out-of-order engine analyses this program, it will pull instruction 15 into one of the FPUs, since it will need one math to fill one of the FPUs that otherwise would be idle. So at a given moment we could have both branches being processed at the same time. If when the CPU finishes processing the third instruction a is greater than b, the CPU will simple discard the processing of instruction 15. You may think this is a waste of time, but in fact it is not. It doesn’t cost anything to the CPU to execute that particular instruction, because the FPU would be otherwise idle anyway. On the other hand, if a=<b the CPU will have a performance boost, since when instruction 3 asks for instruction 15 it will be already processed, going straight to instruction 16 or even further, if instruction 16 also has already been processed by the out-of-order engine.
Of course everything we explained on this tutorial is an over simplification in order to make this very technical subject easier to understand. Read our Inside Pentium 4 Architecture tutorial in order to study the architecture of Pentium 4 processor. We will be also posting an Athlon 64 architecture tutorial very soon.

Last update on 2021-07-14 at 13:28 / Affiliate links / Images from Amazon Product Advertising API