Inside Intel Nehalem Microarchitecture

[nextpage title=”Introduction”]

Nehalem is the codename of the new Intel CPU with integrated memory controller that will reach the market next month and that will be called Core i7; this architecture will also be used on CPUs targeted to servers (Xeon) and, a few years from now, it will also be used on entry-level CPUs. CPUs based on this architecture will have an embedded memory controller supporting three DDR3 channels, three cache levels, the return of Hyper-Threading technology, a new external bus called QuickPath and more. In this tutorial we will explain what’s new on this architecture.

Below we summarized a list of Nehalem main features, and we will explain what they mean in the next pages:

Based on Intel Core microarchitecture.
Two to eight cores.
Integrated DDR3 triple-channel memory controller.
Individual 256 KB L2 memory caches for each core.
8 MB L3 memory cache.
New SSE 4.2 instruction set (seven new instructions).
Hyper-Threading technology.
Turbo mode (auto overclocking).
Enhancements to the microarchitecture (support for macro-fusion under 64-bit mode, improved Loop Stream Detector, six dispatch ports, etc).
Enhancements on the prediction unit, with the addition of a second Branch Target Buffer (BTB).
A second 512-entry Translation Look-aside Buffer (TLB).
Optimized for unaligned SSE instructions.
Improved virtualization performance (60% improvement on round-trip virtualization latency compared to 65-nm Core 2 CPUs and 20% improvement compared to 45-nm Core 2 CPUs, according to Intel).
New QuickPath Interconnect external bus.
New power control unit.
45 nm manufacturing technology at launch, with future models at 32 nm (CPUs codenamed “Westmere”).
New socket with 1366 pins.

It is important to remember that Core 2 CPUs manufactured under 45-nm technology have extra features compared to the Core 2 CPUs manufactured under 65-nm technology. All these features are present on Nehalem-based CPUs are the most significant ones are:

SSE4.1 instruction set (47 new SSE instructions).
Deep Power Down Technology (only on mobile CPUs, also known as C6 state).
Enhanced Intel Dynamic Acceleration Technology (only on mobile CPUs).
Fast Radix-16 Divider (FPU enhancement).
Super Shuffle engine (FPU enhancement).
Enhanced Virtualization Technology (between 25% and 75% performance improvement on virtual machine transition time).

Now let’s discuss in details the most significant differences introduced by this new architecture.

[nextpage title=”Integrated Memory Controller”]

Since the beginning of times Intel CPUs use an external bus called Front Side Bus or simply FSB that is shared between memory and I/O requests. Nehalem-based CPUs have an embedded memory controller and thus will provide two external busses: a memory bus for connecting the CPU to the memory and an I/O bus to connect the CPU to the external world.

This change improves a lot the system performance for two main reasons. First, now we have separated datapaths for I/O and memory accesses. Second, memory access is faster as the CPU doesn’t need to communicate first with an external controller anymore.

On Figures 1 and 2 we are comparing the traditional architecture used by Intel CPUs and the new architecture that will be used by Intel CPUs with an integrated memory controller.

Figure 1: Architecture used by current Intel CPUs.

Figure 2: Architecture used by Intel CPUs with embedded memory controller.

This new external bus is called QuickPath Interconnect (QPI) and it provides two separated datapaths (one for transmitting data and another for receiving data) for the CPU to communicate with the chipset or with other CPUs, in the case of servers with more than one CPU. As you can see, this bus is the equivalent of the HyperTransport bus used on AMD CPUs. The first generation of QuickPath Interconnect will run at 3.2 GHz transferring two 16-bit data per clock tick, which equals to a maximum theoretical transfer rate of 12.6 GB/s on each direction. For a more detailed explanation about this new bus and also a comparison between it, HyperTransport and Front Side Bus, please read our Everything You Need to Know About the QuickPath Interconnect (QPI) tutorial.

Desktop CPUs will have only one QuickPath Interconnect, while server CPUs will have two independent busses to allow them to be connected together on SMP (Symmetric Multiprocessing) environments.

The memory controller integrated on Nehalem-based processors provides three memory channels, i.e., it is capable of accessing three memory modules at the same time, in parallel, in order to improve performance – in theory triple-channel architecture provides a 50% increase of available bandwidth compared to a dual-channel architecture running at the same clock rate.

So in order to achieve the best possible performance with a Nehalem-based CPU such as Core i7, you need to install three or six (if your motherboard supports six memory sockets, of course) memory modules. You will have to pay close attention to this change, because most people today are used to have a PC with 2 GB or 4 GB (two or four memory modules in order to match the system’s two memory channels) while with Core i7 and other Nehalem-based CPUs you need to have a PC with 1.5 GB, 3 GB or 6 GB for best performance (three or six memory modules in order to match the system’s three memory channel).

Another thing you have to be very careful about is the fact that some motherboards targeted to Core i7 will have four memory sockets, like Intel "Smackover", which is based on Intel X58 chipset. If you install four memory modules you will have more memory available, but you will decrease system performance. For example, if you install 4 GB (four 1 GB memory modules), the system will access the first 3 GB at triple-channel performance, but the memory area between 3 GB and 4 GB will be accessed at single-channel performance. So unless you really need more RAM memory, stick with 1.5 GB, 3 GB or 6 GB. Other manufacturers have already announced that they will produce motherboards with six sockets, so on this boards you must install memories in triplets in order to achieve the maximum possible performance.

With three memory channels available, the CPU will access the memory at 192 bits per time (3x 64 bits), if you have three or six memory modules installed, of course. This gives a maximum theoretical transfer rate of 25.58 GB/s if DDR3-1066 memories are used.

The memory controller embedded on Nehalem-based CPUs accepts only DDR3 memories – no support for DDR2 is given.

Due to the integration of the memory controller, Intel has to change the CPU socket to a new socket using 1,366 pins. So you won’t be able to upgrade your current Intel-based system to a Core i7 by simply changing the CPU; you will have to also replace the motherboard and probably the memories, if you don’t have DDR3 memories. If you do have DDR3 memories, you will probably need to buy one extra module in order to enable the triple-channel mode, if you have only two modules.

[nextpage title=”Memory Cache”]

On the cache memory
side Intel will use the same cache arrangement AMD is using on their Phenom CPUs, i.e., individual L2 caches for each core and a shared L3 memory cache. Each L2 memory cache will be of 256 KB and the L3 cache will be of 8 MB, at the least for the first models to be launched (Intel may launch Nehalem-based Xeon CPUs with more cache). L1 cache remains the same as Core 2 Duo (64 KB, 32 KB for instructions and 32 KB for data).

Core 2 Duo processors have only one L2 memory cache, which is shared among all CPU cores, but quad-core CPUs from Intel like Core 2 Quad and Core 2 Extreme have two L2 caches, each one shared by each group of two cores. For a better understanding we summarize the available cache architectures on Figures 3 and 4.

Figure 3: A comparison between cache architectures.

Figure 4: A comparison between cache architectures.

[nextpage title=”Enhancements to the CPU Pipeline”]

As mentioned, Nehalem (Core i7) is based on the architecture used by Core 2 Duo, bringing some enhancements on the way instructions flow inside the CPU. On this page we will describe these enhancements.

Core 2 Duo is, by the way, based on Pentium M, which in turn is based on Pentium III. All these CPUs are 6^th generation Intel CPUs (if you run a CPUID instruction all of them will return “6” for the Family field). Pentium 4 was a 7^th generation Intel CPU, using a complete different microarchitecture – Core 2 and Core i7 CPUs have absolutely nothing to do with Pentium 4. You may find strange a manufacturer going back to an “old” architecture but this is what happened (the “old” microarchitecture proved to be more efficient than the “new” one).

Refer to Figure 5 to understand the genealogy of the new Nehalem microarchitecture. We also added the main improvements brought by each new CPU; each CPU has everything brought by the previous CPU plus the mentioned improvements. Of course each CPU has other minor improvements; we listed only the most important ones.

Figure 5: Nehalem microarchitecture genealogy tree.

In order to understand the improvements brought by this new microarchitecture you need to remember that programs are written using x86 instructions (also called “macro-op” or simply “instructions”), which aren’t understandable by the CPU execution units. They must be first decoded into microinstructions (also called “micro-op” or “µop”). This architecture is a CISC/RISC hybrid and was introduced by the Pentium Pro: CPU receives x86 (CISC) instructions, but execute proprietary microinstructions (RISC).

Core microarchitecture, used on Core 2 CPUs, introduced macro-fusion, which is the ability of translating two x86 instructions in just one microinstruction (also known as “micro-ops”) to be executed inside the CPU, improving performance and lowering the CPU power consumption, since it will execute only one microinstruction instead of two. This scheme, however, only works for comparing and conditional branching instructions (i.e., CMP or TEST plus a Jcc instruction).

Nehalem microarchitecture improves macro-fusion in two ways. First it adds the support for several branching instructions that couldn’t be fused on Core 2 CPUs. And second, on Nehalem-based CPUs macro-fusion is used on both 32- and 64-bit modes, while on Core 2 CPUs macro-fusion only works when the CPU is working under 32-bit mode.

Core microarchitecture also add a Loop Stream Detector, basically a small 18-instruction cache between the fetch and the decode units from the CPU. When the CPU is running a loop (a part of a program that is repeated several times) the CPU doesn’t need to fetch the required instructions again from the L1 instruction cache: they are already close to the decode unit. In addition the CPU actually turns off the fetch and branch prediction units while running a detected loop, making the CPU to save some power.

On Nehalem-based CPUs this small cache has been moved to after the decode unit. So instead of holding x86 instructions like on Core 2 CPUs, it holds micro-ops (up to 28). This improves performance, because when the CPU is running a loop, it now doesn’t need to decode the instructions present in the loop: they will be already decoded inside this small cache. Also, the CPU can now turn off the decode unit in addition to the fetch and branch prediction units when running a detected loop, saving even more power.

Figure 6: Location of the Loop Stream Detector on Core and Nehalem CPUs.

Nehalem architecture adds one extra dispatch port and has now 12 execution units, see below. With that CPUs based on this architecture can have more microinstructions being executed at the same time than previous CPUs.

Figure 7: Dispatch ports and execution units.

Nehalem microarchitecture also adds two extra buffers: a second 512-entry Translation Look-aside Buffer (TLB) and a second Branch Target Buffer (BTB). The addition of these buffers increases the CPU performance.

TLB is a table used for the conversion between physical addresses and virtual addresses by the virtual memory circuit. Virtual memory is a technique where the CPU simulates more RAM memory on a file on the hard drive (called swap file) to allow the computer to continue operating even when there is not enough RAM available (the CPU gets what is on the RAM memory, stores inside this swap file and then frees memory for using).

Branch prediction is a circuit that tries to guess the next steps of a program in advance, loading to inside the CPU the instructions it thinks the CPU will try to load next. If it hits it right, the CPU won’t waste time loading these instructions from memory, as they will be already inside the CPU. Increasing the size (or adding a second one, in the case of Nehalem-based CPUs) of the BTB allows this circuit to load even more instructions in advance, improving the CPU performance.

[nextpage title=”Power Management Enhancements”]

Transistors inside the CPU work as a switch, with two possible states: conductive (a.k.a. “saturation mode”), working as a closed switch, and non-conductive (a.k.a. “cut-off” mode), working as an open switch. The problem is when they are on their non-conductive state in theory they shouldn’t allow any current to flow, but a small amount of current still flows. This current is called leakage and if you add up all leakage currents you have a significant amount of current (and thus power) being wasted and unnecessary heat being generated. One of the challenges in designing CPUs in recent years has been trying to e
liminate leakage current.

Nehalem brings a power control unit inside the CPU in order to better manage power (see Figure 8). This unit reduces leakage current and also allows the new “Turbo Mode,” which we will discuss in the next page. Basically, the CPU can now have different voltages and frequencies for each core, for the units outside the cores, for the memory controller, for the cache and for the I/O units. On previous CPUs, all cores had to run at the same clock rate but on Nehalem-based CPUs each core can be programmed to run at different clock rates to save power.

Figure 8: Power control unit.

The embedded power control unit can now switch off any of the CPU cores, feature not available on mobile Core 2 CPUs. In fact now the CPU can put any core into the C6 (“deep power down”) power state independently of the state under the remaining cores are running. This allows energy savings when you are running your PC normally but one or more cores are idle and thus can be shut down.

[nextpage title=”Turbo Mode”]

The embedded power control unit added also power sensors for each core. So the CPU knows how much power each core is consuming and how much heat is being dissipated. This allowed the addition of a “Turbo Mode” in the CPU.

Turbo Mode allows the CPU to increase the clock rate of the active core(s). This idea isn’t new and Core i7 isn’t the first CPU to use it (some Xeon CPUs based on Netburst – i.e., Pentium 4 – architecture have this feature, known as “Foxton technology”). But on the previous incarnation of this technology it could only be used when the other processing cores were idle.

This new mode is a closed-loop system. The CPU is constantly monitoring its temperature and power consumption. The CPU will overclock the active cores until the CPU reaches its maximum allowed TDP, based on the cooling system you are using. This is configurable on the motherboard setup. For example, if you say your CPU cooler is able to dissipate 130 W, the CPU will increase (or reduce) its clock so the power currently dissipated by the CPU matches the amount of power the CPU cooler can dissipate. So if you, for example, replace the CPU stock cooler with a better cooler, you will have to enter the motherboard setup to configure the new cooler TDP (i.e., the maximum amount of thermal power it can dissipate) in order to make Turbo Mode to increase the CPU clock even more.

Notice that the CPU doesn’t have to necessary shut down unused cores to enable Turbo Mode. But since this dynamic overclocking technique is based on how much power you can still dissipate using your current CPU cooler, shutting down unused cores will reduce the CPU consumption and power dissipation and thus will allow a higher overclocking.

The new Turbo Mode is an extension to the SpeedStep technology, so it is viewed by the system as a SpeedStep feature.

It only works for the CPU cores, so the memory controller and the memory cache are not affected by this technology.

Apparently Turbo Mode will only be available on “Extreme Edition” models.

[nextpage title=”Other Features”]

Now that we covered all main features brought by the new Nehalem core, we are going to explain a little bit more about two important features, Hyper-Threading and the optimization done to deal with unaligned SSE instructions.

Hyper-Threading technology allows each CPU core to be recognized as two CPUs. Thus if you have a Core i7 with four cores, the operating system will recognize it as having eight cores. This technology is based on the fact that when the CPU core is running there are certain circuits inside that are idle and thus can be used. Originally released for the Pentium 4 CPU this is the first time this technology is available on a 6^th generation Intel CPU. This technology is also called SMT or Simultaneous Multi-Threading (SMT). This technology does not provide the same performance gain as if “real” CPU cores were used instead (i.e., a CPU with 8 cores is faster than a CPU with 4 cores and HT technology, provided that they both work under the same clock rate and are based on the same architecture), however you are gaining these extra “CPU cores” for free.

There are two kinds of SSE instructions that access memory, aligned and unaligned (also called misaligned). Aligned instructions required the requested data to be inside 16-byte (128-bits) address boundaries, while unaligned instructions don’t. See Figure 9 for an illustration.

Figure 9: Aligned vs. unaligned instructions.

O.k. we know that this sounds cryptic for you, so let’s translate into English.

Imagine a system with dual-channel memory. The memory controller will access the memory 128 bits at a time. So the memory will be divided into 128-bit (16 bytes) blocks. So in theory the address that you request must start at the beginning of each block, so you can make a 128-bit read (or write) and get what you want at just one request. This is the aligned request shown on top of Figure 9.

But suppose that you issue a command to read a data from the memory but instead of using the first address inside the block you ask for the address in the middle of the block. Since you are requesting a 128-bit data, what will happen is that half of the data will be on the first block and the other half of the data will be on the next block – this is shown on the bottom of Figure 9. Since the data you requested will be split into two different blocks the memory controller will have to read two memory blocks, not just one as it happened on the previous example. On the first read you will get back half of the data you want and on the second read you will get the remaining of the data.

Although aligned requests are more efficient they are more difficult for programmers because they need to know the memory organization. Because of that most programmers end up using only unaligned instructions.

Previous Intel CPUs were optimized for aligned instructions and unaligned ones were slower and were translated into multiple micro-ops – in other words, unaligned instructions were easier for the programmer but ran slower. Nehalem-based CPUs are optimized for unaligned instructions, achieving the same speed as aligned instructions. The slide in Figure 10 summarizes this.

Figure 10: Nehalem is optimized for unaligned SSE instructions.

Inside Intel Nehalem Microarchitecture

For Performance

Everything you need to know

Reader Interactions

Leave a Reply Cancel reply

Footer

For Performance

Everything you need to know