How to Pick a CPU When Buying Servers

As your IT team looks to upgrade your computational and enterprise storage systems, be sure you understand the choices and tradeoffs you face regarding your high-performance server CPU options. Some of our recommendations may seem, well, counterintuitive.

As 2013 rolls in and the economy stabilizes, many IT organizations are looking to upgrade their computational and storage systems. Like any IT purchasing decision, there are tradeoffs to consider and choices to make regarding hardware features and the technology available. When it comes to high performance computing, the first step is understanding your CPU options.

Intel vs. AMD

For at least this year, the two server CPU choices remain Intel and AMD. ARM might solve some of the computational parts of some of the problems, but in 2013, ARM won't have enough I/O bandwidth with 10 Gigabit Ethernet ports and storage to make it a viable alternative. This might change for 2014, but it's too soon to predict as development of PCIe buses with enough performance capability is complex.

AMD responds to Henry Newman's CPU analysis on Page 3.

The latest AMD CPUs have 16 cores, but only if you are running integer operations. When it comes to floating-point operations, you have only eight cores. This combined with the fact that the latest Intel server processors can read and write data from memory significantly faster on a per-core basis than AMD processors mean that AMD processors are best-suited for operations with low computational intensity and which do not require high memory bandwidth per core. .

Communications Between CPU Sockets

Another place that Intel has a major advantage is communications between CPU sockets. The current crop of Intel server CPUs support 25.6 gigabits per second (Gbps) of I/O bandwidth between CPU sockets over the Quick Path Interconnect (QPI).

[Related: Intel Aims for Faster Thunderbolt with PCI-Express 3.0]

This performance combined with the per-socket memory bandwidth performance exceeds the current performance of AMD CPUs. On multi-socket machines, this has a dramatic impact on the performance for all of the sockets because a process might be making a request for which memory has been allocated on another socket.

PCIe Bus Drives Intel Ahead

PCIe is where the rubber meets the road on why the latest Intel processors are far ahead of their AMD competitors. The Intel technology on the latest server CPUs runs PCIe 3 with 40 lanes on each CPU.

That means that the PCIe bus and the CPU are capable of 40Gbps of I/O bandwidth. This is far greater than the bandwidth of available on AMD processors. So if you need to do a lot of network I/O or disk I/O, PCIe 3 is the better choice because it has far higher bandwidth than PCI 2.0 and the performance of the bus will double, but also the Intel CPU supports more PCIe lanes.

It's Intel's Year But There Are Still Issues

There is one problem with the new Intel CPUs that becomes more noticeable with quad-socket configurations. As mentioned earlier, the PCIe bus is on the CPU socket so with four sockets you have four PCIe buses with 40 lanes each for a total of 160 lanes of 1Gbps PCIe bandwidth. That is a lot of I/O bandwidth, but looking a bit deeper there is a problem:

  1. The QPI connections between sockets is a dual-channel 12.8Gbps channel for a total performance of 25.6Gbps
  2. The PCIe express bandwidth of a socket is 40x 1Gbps per lane or 40 Gbps of PCIe bandwidth to the socket.

Problems quickly arise when PCIe bandwidth exceeds 25.6Gbps and the process requesting access to the PCIe bus is not on the socket with the bus where the access is being requested. Some of the workarounds attempted would lock processes on sockets with the PCIe bus that needs to be read or written. But it did not work for all applications. For example, those with data coming in and going out of multiple locations such as a striped file system are affected because you cannot break the request and move each request to each PCIe bus.

[Related: Intel Aims for Faster Thunderbolt with PCI-Express 3.0]

The real-world performance for general purpose applications running on a four-socket system is likely an estimated 90 percent of the QPI bandwidth between sockets (or 23Gbps) unless the data goes out on the socket with the PCIe bus. Every fourth I/O, if they are equality distributed, will run at 40Gbps, so the average performance would be (3x23Gbps +40Gbps)/4 or an average performance of about 27.25Gbps per socket for a quad-socket system.

This is, of course, the average based on equal distribution of the processes and I/O to the PCIe bus. A process that has PCIe processor affinity will significantly improve that average, but it is often difficult to architect and meet the requirements of putting every task on a PCIe bus and ensuring that the process runs on the CPU with that bus. The probability of this limitation is higher with a quad-socket system than with a dual-socket system.

The diagram below shows an example of a dual-socket system that, though having the same issues, reduces the potential of hitting that architectural limitation.


My estimate for performance for a dual-socket system is (23Gbps +40Gbps) or average socket performance of 31.5Gbps. On a dual-socket system it is much easier to architect the system so that you can put the right I/O on the right CPU and achieve near-peak performance.

CPU Conclusions Are Counter-Intuitive

New Intel systems have far more I/O bandwidth than previous systems and they have more than anything available from AMD. ARM is not currently competitive if you need to move lots of data in and out of the system.

The current Intel line quad-socket systems will average about 27.25Gbps unless significant work is done to architect the system to connect with processors and PCIe buses. The IOPS performance of the system will, of course, be higher as IOPS is not impacted by QPI bandwidth limitation.

The dual-socket systems are easier to get higher performance, and the average system performance is over 4.25Gbps. So my conclusion is you are better off using dual-socket systems for high I/O bandwidth requirements versus a quad socket. This, of course, is clearly counterintuitive, but is the best strategy given the current Intel architecture.

You will mostly likely see Ivy Bridge server processors in 2013 and the QPI bandwidth will go way up so with Ivy Bridge quad socket systems likely make sense. More on this after the Ivy Bridge serve processor are released.

Henry Newman, is CEO and CTO of Instrumental Inc., a consulting firm that specializes in high-performance computing and storage. Follow everything from on Twitter @CIOonline, on Facebook, and on Google +.

AMD Responds

Editor's note: There are many ways to evaluate CPUs and while stands by the facts reported and the conclusions drawn in this article, AMD didn't agree with our analysis. Below is the company's response:

The article titled "How to Pick a CPU When Buying Servers" represents one opinion and addresses several performance characteristics relevant only for the highest performing processors targeted at very high performing applications. Yet these represent a minority of the CPUs sold into the server market. For example, industry reports show the majority of Intel Xeon E5-2600 Series processors sold are at the mid to low-end of its SKU stack where there are decreased core counts, lower memory support speeds (lower potential memory bandwidth), slower QPI speeds, and importantly more affordable price points. In fact, price/performance is not even mentioned and surely this is an important consideration when buying servers. This means that the metrics in the article aren't the ones that drive most server CPU purchasing decisions. If cost were factored into the equation and a comparison performed on processors at the same price points, AMD has advantages in overall performance and feature sets.

In addition, there are several inaccuracies within the original article. The article states the following:

  • "The latest AMD CPUs have 16 cores, but only if you are running integer operations. When it comes to floating-point operations, you have only eight cores."

    AMD's response: The floating point unit in the latest AMD Opteron processors, called Flex FP, has the capability of operating as either eight 256bit FPUs or sixteen 128bit FPUs, giving technical customers more compute flexibility.
  • "&the latest Intel server processors can read and write data from memory significantly faster than AMD processors"

    AMD's response: In reality, AMD has the same number of memory channels and support the same memory speeds as the high-end Intel Xeon E5-2600 Series processors. Based on STREAM scores which measure memory bandwidth, Intel has a 2.5 percent performance advantage, which doesn't seem very significant. In addition, there are SKUs in the mid to low end of Intel's CPU stack that have reduced memory support speeds, which puts their memory bandwidth performance at a disadvantage. AMD Opteron 6300 Series processors support the same high speed memory for all models.
  • "AMD processors should be relegated to operations with low computational intensity that do not require high-memory bandwidth."

    AMD's response: Yet, recent purchases on the latest TOP500 list contradict this pretty clearly. There are 18 AMD Opteron-based systems in the top 100, including the #1 supercomputer, which uses AMD Opteron 6200 Series processors.
To comment on this article and other CIO content, visit us on Facebook, LinkedIn or Twitter.
Get your IT project the recognition it deserves.
Submit your CIO 100 Award application today!