How The Cache Memory Works
Meet The Memory Cache
Contents
In Figure 2, you can see a basic block diagram from a generic single-core CPU. Of course the actual block diagram will vary depending on the CPU and you can read our tutorials for each CPU line to take a look at their real block diagram (Inside Pentium 4 Architecture, Inside Intel Core Microarchitecture and Inside AMD64 Architecture).
Figure 2: Basic block diagram of a CPU.
The dotted line in Figure 2 represents the CPU body, as the RAM memory is located outside the CPU. The datapath between the RAM memory and the CPU is usually 64-bit wide (or 128-bit when dual channel memory configuration is used), running at the memory clock or the CPU external clock (or memory bus clock, in the case of AMD processors), which one is lower. We have already taught you how to calculate the memory transfer rate on the first page of this tutorial.
All the circuits inside the dotted box run at the CPU internal clock. Depending on the CPU some of its internal parts can even run at a higher clock rate. Also, the datapath between the CPU units can be wider, i.e., transfer more bits per clock cycle than 64 or 128. For example, the datapath between the L2 memory cache and the L1 instruction cache on modern processors is usually 256-bit wide. The datapath between the L1 instruction cache and the CPU fetch unit also varies depending on the CPU model – 128 bits is a typical value, but at the end of this tutorial we will present a table containing the main memory cache specs for the CPUs available on the market today. The higher the number the bits transferred per clock cycle, the fast the transfer will be done (in other words, the transfer rate will be higher).
In summary, all modern CPUs have three memory caches: L2, which is bigger and found between the RAM memory and the L1 instruction cache, which holds both instructions and data; the L1 instruction cache, which is used to store instructions to be executed by the CPU; and the L1 data cache, which is used to store data to be written back to the memory. L1 and L2 means “Level 1” and “Level 2,” respectively, and refers to the distance they are from the CPU core (execution unit). One common doubt is why having three separated cache memories (L1 data cache, L1 instruction cache and L2 cache).
Making zero latency static memory is a huge challenge, especially with CPUs running at very high clock rates. Since manufacturing near zero latency static RAM is very hard, the manufacturer uses this kind of memory only on the L1 memory cache. The L2 memory cache uses a static RAM that isn’t as fast as the memory used on the L1 cache, as it provides some latency, thus it is a little bit slower than L1 memory cache.
Pay attention to Figure 2 and you will see that L1 instruction cache works as an “input cache,” while L1 data cache works as an “output cache.” L1 instruction cache – which is usually smaller than L2 cache – is particularly efficient when the program starts to repeat a small part of it (loop), because the required instructions will be closer to the fetch unit.
This is rarely mentioned, but the L1 instruction cache is also used to store other data besides the instructions to be decoded. Depending on the CPU it can be also used to store some pre-decode data and branching information (in summary, control data that will speed up the decoding process) and sometimes the L1 instruction cache is bigger than announced, because the manufacturer didn’t add the extra space available for these extra pieces of information.
On the specs page of a CPU the L1 cache can be found with different kinds of representation. Some manufacturers list the two L1 cache separately (some times calling the instruction cache as “I” and the data cache as “D”), some add the amount of the two and writes “separated” – so a “128 KB, separated” would mean 64 KB instruction cache and 64 KB data cache –, and some simply add the two and you have to guess that the amount is total and you should divide by two to get the capacity of each cache. The exception, however, goes to CPUs based on Netburst microarchitecture, i.e., Pentium 4, Pentium D, Pentium 4-based Xeon and Pentium 4-based Celeron CPUs.
Processors based on Netburst microarchitecture don’t have a L1 instruction cache, instead they have a trace execution cache, which is a cache located between the decode unit and the execution unit, storing instructions already decoded. So, the L1 instruction cache is there, but with a different name and a different location. We are mentioning this here because this is a very common mistake, to think that Pentium 4 processors don’t have L1 instruction cache. So when comparing Pentium 4 to other CPUs people would think that its L1 cache is much smaller, because they are only counting the 8 KB L1 data cache. The trace execution cache of Netburst-based CPUs is of 150 KB and should be taken in account, of course.
