orly going thirty: Intel and AMD Processor Micro-Benchmarking

There are a large number of synthetic CPU benchmarks available - for example, GeekBench, JetStream, SPEC. The utility of these benchmarks for whole-system performance is debatable. Then we have benchmarks that attempt to measure whole-system performance; for example the time-honored Linux kernel compilation, and elaborate benchmarks such as SAP Sales and Distribution (SAP SD), otherwise known as the famous "SAPS rating."

Here I am attempting to measure some degree of whole-system performance by using ffmpeg to transcode Big Buck Bunny. This is a CPU-bound (more correctly, FPU-bound) benchmark with some memory and I/O load due to the very large size of the movie. I've used a statically-linked binary that is not particularly optimized for particular processor features or GPU's (ffmpeg can greatly speed up transcoding on Nvidia GPU's).

Here are the necessary steps to replicate my results (these are for Linux; on MacOS, I used the ffmpeg distribution from brew but the steps are otherwise identical):

wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz

tar xf ffmpeg-release-amd64-static.tar.xz

wget http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_60fps_normal.mp4

for i in 1 2 3; do
rm -f output.mp4; time ffmpeg-*-amd64-static/ffmpeg -threads 2 -loglevel panic -i bbb_sunflower_1080p_60fps_normal.mp4 -vcodec h264 -acodec aac -strict -2 -crf 26 output.mp4 2>&1 >>out.txt
done

Note that we are limiting the number of threads that FFMPEG can use to 2, which allows it to only use 2 cores. On a 4-core (or more..) machine, the encoding results are much better, but since many of my data points are from 2-core machines, we have to limit the number of threads to 2 in order to have an apples-to-apples comparison.

Note that on a 2-core hyper-threaded system, "in theory" 4 threads is ideal; however, hyper-threading is really only relevant for I/O-bound workloads, and since FFMPEG is CPU-bound, a thread limit of 2 is more appropriate.

We can see on this simple test, that for the MacOS trials:

there is a 37% performance improvement from Sandy Bridge to Broadwell (3 generations)
17% improvement from Broadwell to Kaby Lake (2 generations)

Over 5 generations there is a cumulative improvement of 48%.

For the AWS M instance family:

13% from Sandy Bridge (m1) to Ivy Bridge (m3) (1 generation)
14% from Ivy Bridge (m3) to Broadwell (m4) (2 generations)
13% from Broadwell (m4) to Skylake (m5) (1 generation)

Over 4 generations there is a cumulative improvement of 35%.

For the AWS C instance family:

28% from Ivy Bridge EP to Haswell (1 generation)
12% from Haswell to Skylake (2 generations)

Over 3 generations there is a cumulative improvement of 37% - but this is also partially due to differing clock speeds.

We normally would consider a benchmark such as SAPS to be a rigorous, whole-system benchmark because SAPS measures order line items per hour (an application metric) across infrastructure (CPU, memory, I/O), operating system, Java virtual machine, database, and ERP application. But it very much seems that SAPS is essentially a CPU benchmark.

Consider the following:

SAP certification #2015005 from 2015-03-10 (AWS c4.4xlarge, 8 cores / 16 threads) - 19,030 SAPS or 2,379 SAPS/core
SAP certification #2015006 from 2015-03-10 (AWS c4.8xlarge, 18 cores / 36 threads) - 37,950 SAPS or 2,108 SAPS/core

Here we observe almost linear scaling - as the number of cores/threads is increased from 8 to 18 (2.25X) the SAPS increases from 19,030 to 37,950 (1.99X).

If we consider the SAPS results for the previous-generation AWS C3 instance family:

SAP certification #2014041 from 2014-10-27 (AWS c3.8xlarge, 16 cores / 32 threads) - 31,830 SAP or 1,989 SAPS/core

The C3 result is about 6% lower than the c4.8xlarge on a per-core basis. If we recall the naive Big Buck Bunny transcoding benchmark, the C4 is about 12% faster than C3. Thus it appears that SAPS is not purely a CPU benchmark (as it should be) but is strongly CPU-dominated (at least half of the SAPS is directly attributable to CPU performance).

Naively concluding, there appears to be (on average) around 10% performance improvement across Intel CPU generations (across tick and tock). This means CPU performance doubles in 6.9 years (87 months - a far cry from Moore's Law which optimistically predicted 18 months

Intel and AMD Processor Micro-Benchmarking

No comments: