Inside AMD K10 Architecture

[nextpage title=”Introduction”]

K10 is the name of the new architecture that new processors from AMD will be using, like the forthcoming Phenom and the Opteron based on the much expected “Barcelona” core. In fact, a lot of people are making a big confusion calling K10 architecture as “Barcelona,” while Barcelona is only one of the CPUs that will use this new architecture. In this tutorial we will explain what is new on the K10 architecture and will also present a complete AMD roadmap showing all products based on K10 architecture that are planned so far.

The new K10 architecture is based on the K8 (a.k.a. AMD64) architecture with some enhancements. Thus we recommend you to read our Inside AMD64 Architecture before continuing to read the present tutorial. By the way, AMD never released an architecture called K9, from K8 they jumped to K10.

The foil presented in Figure 1 shows the main enhancements K10 microarchitecture brings over K8.

Figure 1: K10 microarchitecture enhancements over K8.

The main points that were enhanced were:

The fetch unit fetches 32 bytes (256 bits) of data per clock cycle from the L1 instruction cache – this is the double CPUs based on K8 architecture could fetch per clock cycle. Intel CPUs based on Core microarchitecture, like Core 2 Duo, also fetches 32 bytes per clock cycle.
The use of a true 128-bit internal datapath. On previous CPUs based on K8 microarchitecture the internal datapath was of 64 bits only. This was a problem for SSE instructions, since SSE registers, called XMM, are 128-bit long. So, when executing an instruction that manipulated a 128-bit data, this operation had to be broke down into two 64-bit operations. The new 128-bit datapath makes K10 microarchitecture faster to process SSE instructions that manipulate 128-bit data compared to K8 microarchitecture. Intel processors based on Core microarchitecture (Core 2 Duo, for example) also have 128-bit internal datapaths, while Intel processors based on Netburst microarchitecture (Pentium 4 and Pentium D) have a 64-bit internal datapaths. AMD is calling this new feature “AMD Wide Floating Point Accelerator.”

In Figure 2, you can see a list of new features introduced by K10 architecture. We will be explaining each one of them in the next pages.

Figure 2: New features introduced by K10 architecture.

[nextpage title=”L3 Memory Cache”]

Just to remember, memory cache is a high-speed memory (static RAM or SRAM) embedded inside the CPU used to store data that the CPU may need. If the data required by the CPU isn’t located in the cache, it must go all the way to the main RAM memory, which reduces its speed, as the RAM memory is accessed using the CPU external clock rate. For example, on an AMD 3 GHz CPU, the memory cache is accessed at 3 GHz but the RAM memory is accessed at 800 MHz (if you are using DDR2-800 memories) or less.

On Pentium D and AMD dual-core CPUs based on K8 architecture each CPU core has its own L2 memory cache. On Intel dual-core CPUs based on Core and Pentium M microarchitectures, there is only L2 memory cache, which is shared between the two cores.

Intel says that this shared architecture is better, because on the separated cache approach at some moment one core may run out of cache while the other may have unused parts on its own L2 memory cache. When this happens, the first core must grab data from the main RAM memory, even though there was empty space on the L2 memory cache of the second core that could be used to store data and prevent that core from accessing the main RAM memory. So on a Core 2 Duo processor with 4 MB L2 memory cache, one core may be using 3.5 MB while the other 512 KB (0.5 MB), contrasted to the fixed 50%-50% division used on other dual-core CPUs.

On the other hand, current quad-core Intel CPUs like Core 2 Extreme QX and Core 2 Quad use two dual-core chips, meaning that this sharing only occurs between cores 1 & 2 and 3 & 4. In the future Intel plans to launch quad-core CPUs using a single chip. When this happens the L2 cache will be shared between the four cores.

In Figure 3, you can see a comparison between these three L2 memory cache solutions.

Figure 3: Comparison between current L2 memory cache solutions on current multi-core CPUs.

K10 architecture adds a shared L3 memory cache inside the CPU. This is shown in Figure 4. The size of this cache will depend on the CPU model, just like what happens with the size of L2 cache.

Figure 4: K10 cache architecture.

AMD calls this approach as “Balanced Smart Cache.”

By the way, L1 memory cache continues unaltered: 64 KB for instructions and 64 KB for data per core (on Figure 1 AMD shows “512 KB,” but this is the total figure for a quad-core CPU).

[nextpage title=”Independent Memory Controller”]

The higher the data the CPU fetches from the RAM memory per clock cycle the faster the system will be. As we explained in the previous page, the CPU is a lot faster than the RAM memory, so the less times it needs to fetch data from the memory the better. Loading lots of data at once prevents this from happening.

Memory modules are 64-bit devices. Instead of launching 128-bit memory modules, CPU and chipset manufacturers came with the idea of dual-channel memory, a way to access two memory modules simultaneously, as if these two 64-bit memory modules were a single 128-bit module. This doubles the memory access transfer rate, as now instead of one 64-bit data two 64-bit data can be loaded per clock cycle.

The problem with dual-channel technology is that the second 64-bit data that is loaded together with the data that was originally requested is necessarily stored on the following address. For example, if the CPU asked for the data A stored in address 1, the memory controller will automatically load data A and data B, which is stored in address 2.

If the CPU doesn’t have a use for this data B, this second load will be completely wasted, as the memory controller cannot use this parallel loading to read a data that is stored on an address that is not the following address.

The memory controller used on K10 architecture allows the CPU to load a data stored on an address different from the next address. This independency will increase the CPU performance by not wasting memory loads. Figure 5 illustrates this feature, where the CPU wanted to load data A and F. On K8 architecture, illustrated on the left side, two data fetches are needed (as two data are completely useless), while on K10 architecture only one data fetch is needed.

Figure 5: Independent memory controller.

Informally the independent architecture used on K10 is called "un-ganged", while the previous implementation that is used nowadays is called "ganged".

AMD calls this feature as “AMD Memory Optimizer Technology.”

By the way, it seems that AMD fixed the “broken divider” problem found on current socket AM2 CPUs. Let’s wait to see if that is really true.

[nextpage title=”Energy-Saving Features”]

The majority of new features introduced by the new K10 architecture are targeted to save energy – and thus make the CPU to produce less heat.

Here are these features:

Independent Dynamic Core Technology allows each CPU core to run at a different clock rate. The voltage of the cores, however, is shared and it will be the voltage required by the core that is running at the higher clock rate.
CoolCore Technology allows the CPU to automatically turn off parts of the CPU that are not being used. Processors based on Core microarchitecture also have a similar feature (“Advanced Power Gating”).

Figure 6: CoolCore Technology.

Dual Dynamic Power Management (DDPM), informally known as “split-plane,” this technology allows the CPU and the memory controller (which is embedded inside the CPU) to use different power sources – i.e., voltages. This will allow the memory controller to work at higher clock rates – typically 200 MHz above the standard clock. This technology also allows the CPU to reduce its voltage and keep the memory controller working at full speed, when the CPU enters one of its power-saving modes. When installed on older motherboards that don’t have separated power sources for the CPU and for the memory controller, the CPU will work like K8 processors, i.e., will use the single voltage provided to feed both the CPU and the memory controller.

Figure 7: Dual Dynamic Power Management (DDPM).

Figure 8: Dual Dynamic Power Management (DDPM).

Desktop CPUs will use HyperTransport 3.0 instead of HyperTransport 1.x (server CPUs will adopt HT3 only in the future). There are two goals here. The more obvious is a higher transfer rate for accessing peripherals, as by using HT3 K10-based CPUs will be able to access the external world up to 10,400 MB/s (K8-based CPUs are capable of transferring data up to 4,000 MB/s) – this is an important 2.6x increase in available bandwidth. But the not-so-obvious advantage is power saving, as HT3 allows the CPU to change the HyperTransport clock rate and width (i.e., number of bits that are transferred per clock cycle) on the fly. For example, if the CPU senses that 10,400 MB/s is too much for what it is doing at the moment, it can decrease the HyperTransport clock rate (and width) to a value more compatible to what it is doing. The lower the clock rate and the number of bits that are transferred per clock cycle, the less electrical power is used. Since HT3 keeps compatibility with HT1, K10-based CPUs can be installed on older motherboards, but their HyperTransport bus will work at a lower clock rate. For a complete discussion on HyperTransport 3.0 please read our The HyperTransport Bus Used by AMD Processors tutorial.

Now let’s talk about the CPUs that will use the new K10 architecture.

[nextpage title=”K10 Server CPUs Roadmap”]

You can see K10-based server CPUs roadmap on Figures 9 and 10.

Figure 9: K10 server CPUs roadmap.

Figure 10: K10 server CPUs roadmap.

As expected the first CPU to be launched using the new K10 architecture will be a quad-core Opteron processor based on “Barcelona” core. In Figure 11, you can see the Opteron "Barcelona" models AMD plans to launch and below it a table containing all models released so far.

Figure 11: Clock rates and TDP that will be available for the quad-core Opteron “Barcelona.”

AMD Opteron 2300 Series
Model	CoreFrequency	TDP
2350	2.0 GHz	95W
2347	1.9 GHz	95W
2347 HE	1.9 GHz	68W
2346 HE	1.8 GHz	68W
2344 HE	1.7 GHz	68W

AMD Opteron 8300 Series
Model	CoreFrequency	TDP
8350	2.0 GHz	95W
8347	1.9 GHz	95W
8347 HE	1.9 GHz	68W
8346 HE	1.8 GHz	68W

Here is a quick summary of the cores that will be launched for the server market based on K10 architecture:

Barcelona: quad- or dual-core Opteron on the 2000 and 8000 series, 512 KB L2 memory cache per core, 2 MB L3 memory cache, registered DDR2 memory, socket 1207 (socket F), HyperTransport 1.x, 65 nm manufacturing process.
Budapest: dual-core Opteron on the 1000 series, 512 KB L2 memory cache per core, 2 MB L3 memory cache, conventional DDR2 memory, socket 1207 (socket F), HyperTransport 1.x or 3.0, 65 nm manufacturing process.
Shanghai: quad- or dual-core Opteron on the 2000 and 8000 series, 512 KB L2 memory cache per core, 6 MB L3 memory cache, registered DDR2 memory, socket 1207 (socket F), HyperTransport 1.x, 45 nm manufacturing process.
Montreal: octal- or quad-core Opteron on the 2000 and 8000 series, 1 MB L2 memory cache per core, 6 MB or 12 MB L3 memory cache, registered DDR3 memory, socket G3, HyperTransport 1.x, 45 nm manufacturing process.
Suzuka: quad- or dual-core Opteron on the 1000 series, 512 KB L2 memory cache per core, 6 MB L3 memory cache, conventional DDR3 memory, socket AM3, HyperTransport 3.0, 45 nm manufacturing process.

Now let’s see the planned
desktop models.

[nextpage title=”K10 Desktop CPUs Roadmap”]

You can see K10-based desktop CPUs roadmap in Figure 12.

Figure 12: K10 desktop CPUs roadmap.

AMD didn’t say the model numbers that will be released.

Here is a quick summary of the cores that will be launched for the desktop market based on K10 architecture:

Spica: A single-core Sempron LE CPU, with 512 KB L2 memory cache, regular DDR2 memory, HyperTransport 3.0 and socket AM2+.
Rana: Dual-core Athlon X2 LS CPU, with 512 KB L2 memory cache per core, L3 memory cache (value not disclaimed) regular DDR2 memory, HyperTransport 3.0 and socket AM2+.
Kuma: Dual-core Phenom X2 CPU, with 512 KB L2 memory cache per core, 2 MB L3 memory cache, regular DDR2 memory, HyperTransport 3.0 and socket AM2+.
Agena: Quad-core Phenom X4 CPU, with 512 KB L2 memory cache per core, 2 MB L3 memory cache, regular DDR2 memory, HyperTransport 3.0 and socket AM2+.
Agena FX: Quad-core Phenom FX CPU, with 512 KB L2 memory cache per core, 2 MB L3 memory cache, regular DDR2 memory, HyperTransport 3.0 and socket AM2+ or socket 1207+.

Socket AM2+ and socket 1207+ are sockets AM2 and 1207 (socket F) supporting HyperTransport 3.0 and the Dual Dynamic Power Management (DDPM) technologies. Like we said before, you can install K10-based processors on old socket AM2 or socket F motherboards, however the CPU won’t have access to the new transfer rates and features provided by HyperTransport 3.0 nor the separated voltage for the memory controller – both CPU and memory controller will be fed with the same voltage.

For Performance

Everything you need to know

Reader Interactions

Leave a Reply Cancel reply

Footer

For Performance

Everything you need to know