Editorial - Revisiting Hyper-Threading for Nehalem

Nehalem re-incorporates a form of Simultaneous Multithreading (SMT) that first appeared on Pentium 4 processors several years ago branded as "Hyper-Threading" (HTT). HTT has been conscpiously absent on the Core2 generation of processors which has been theorized as the result of Intel's decision to let the Haifa team develop the Core2 based on their great success with the Pentium M while the preceeding Netburst architecture was swept aside (and HTT along with it).

Since the concept of HTT may be new to many or even foggy for those who once owned a P4 (or still do), I thought it might be prudent to write a brief editorial on HTT and how it works with current operating systems.

By the way, while it's not clear what the brand name for Nehalem's SMT implementation will be, I will stick with the Hyper-Threading (HTT) nomenclature throughout this article.

What is Hyper-Threading?

Hyper-Threading is an Intel patented technique for simultaneous multithreading. It duplicates some key execution elements, specifically those elements that store the executional state, but not those elements that actually do the execution. It thereby allows two threads to be processed concurrently on a single core relying on the fact that a given execution unit is not always fully utilized. For example, an executing thread may occasionally be idle waiting for data from main memory in the event of a cache miss, allowing another thread to use those idle periods to execute.

However, it's important to note that Hyper-Threading not like having two full cores. Many elements including L1 and L2 cache, and the execution engine itself are shared between the two competing threads.

Here is a list of the core elements that are duplicated, partitioned, and shared between the two threads on a core:

Duplicated:
  • Register renaming logic
  • Instruction Pointer
  • ITLB
  • Return stack predictor
  • Various other architectural registers
Partitioned:
  • Re-order buffer (ROB)
  • Load/Store buffer
  • Various queues, like the scheduling queues, uop queue, etc.
Shared:
  • Caches: trace cache, L1, L2
  • Microarchitectural registers
  • Execution Unit
According to Intel, the duplicated resources account for only a 5% increase in transistor count (and TDP).

HTT Performance:

In general, some applications can benefit as much as 30% from HTT making it a very economical performance option.

Based on knowing what resources are duplicated and shared we can better understand where performance issues may arise or economies gained. In particular, since L1 and L2 cache are shared by both HTT threads, the possibility of cache contention and thrashing can develop as they compete with each other for these important resources. On the contrary, multi-threaded applications that have producer and consumer threads where one thread is producing data that another thread consumes benefit significantly from Hyper-Threading because they obviously are working together on the same data set and sharing the cache is tremendously advantageous.

A revealing set of benchmarks were conducted by FiringSquad on a set of P4 processors in 2005...

The processors tested above include a Pentium 4 840 EE (2 cores with HTT), 840 (2 cores), and 540 (1 core with HTT).

To summarize the results:

  • The HTT enabled single core 540 shows an 18% improvement with multi-threading.
  • The dual core 840's shows a 67% improvement with multi-threading.
  • An added core performs 44% better than an extra-thread on the same core
  • The game was optimized for 2 thread execution (as 2 cores with HTT offers no benefit)

Therefore, while it's clear that an additional core is the best way to improve performance in multi-threaded applications, noticeable gains can be had from HTT technology... even with a game.

The bottom-line on performance is that the impact of HTT depends heavily on the particular nature of the processor load and to some extent, how the various threads are scheduled by the operating system...

Operating System Support:

HTT technology is abstracted so the operating system sees a series of logical processors. Each HTT enabled physical core appears as a pair of logical processors to the operating system - each of which can have threads assigned by the OS allowing two concurrent threads of execution to occur on a single physical core. Logic in the core manages which of the two concurrent threads execute at any given time (based on the idle cycles) but it is the operating system that assigns the threads to their respective execution units.

An operating system that is HTT aware can optimize performance by assigning new threads to logical processors on inactive cores rather than to logical processors on a core with a concurrent active thread. Conversely, an operating system that is not HTT aware can introduce performance penalties by loading a single physical core with concurrent threads while other cores sit idle. For example, consider the Bloomfield CPU represented in the illustration below that has 8 logical processors on 4 physical cores with two active threads.

A shaded logical processor indicates an active thread; a non-shaded processor is inactive. Assuming no affinity has been set, the operating system is free to schedule the next available thread to any of the inactive logical processors. In a non-HTT aware operating system, the next thread might be randomly assigned to logical processor 2 or 5 both of which would incur a performance penalty since the new thread is now competing for shared processor resources with a concurrent thread on the same core while the last two cores sit idle.

An HTT aware operating system will optimize performance by dispatching new threads onto inactive physical cores whenever possible (in this case logical processor 3, 4, 7, or 8).

Here's a summary of recent Microsoft Operating Systems and their awareness of HTT:

  • Windows 2000: No HTT Awareness
  • Windows XP: HTT Aware
  • Server 2003: HTT Aware + API for application awareness
  • Windows Vista: HTT Aware + API for application awareness
  • Server 2008: HTT Aware + API for application awareness

(A full list of HTT aware operating systems is provided by Intel.)

The most recent Windows operating systems are not only HTT aware but expose an API that applications can use to optimize threading based on the logical processor configuration on the machine. This is advantageous in that it allows applications to determine the ideal thread distribution to maximize performance based on the logical/physical processor implementation at run-time.

It's also worth pointing out that users can also over-ride the operating systems thread scheduling by forcing affinity at the process level using Task Manager as shown below.



Summary

In summary, HTT can provide a noticable increase in performance with little additional cost in terms of die-area and power consumption. The relatively small duplication of some core execution elements has been shown to provide tangible benefits to a variety of tasks including games and desktop media applictions. Benefits of 10-30% can be expected from applications run on HTT aware operating systems such as XP, Vista, and recent Server flavors.

With Nehalem reintroducing a form of HTT and Vista now supporting an HTT aware API for applications to optimize threading combined with an increasing emphasis on parallel programming, this next generation of hardware and software promises to be a very interesting time for performance enthusiasts.

Further Reading:

Hyper-threading (Wikipedia)
Windows Support for Hyper-Threading Technology (Microsoft)
Introduction to Multithreading, Superthreading and Hyperthreading (Ars Technica)
Hyper-threading Quake 4 benchmarks (Firing Squad)

Discuss this in the forums