Recently I’ve decided to step out of my 32bit comfort zone and move to the x64 world which was introduced 15 years ago (Yes, I’m slow :D). I was worried about having to step into the Matrix and stuff, but the D community is really helpful, they gave me instant help even for my n00best questions.
I planned not only to try the official DMD compiler which is fast and implements all the goodies of D, but also the LDC compiler which has known by having strong optimization. I will test a simple vector expression and see how the compilers can optimize that. I’ll do a human-optimized version too, to be able to compare to something.
The test expression
a = a*b+c;
Where a,b,c are arrays of 32bit floats. This is a bit “memory heavy” (2 float ops and 4 io ops), but will do for testing.
1. simple float ops in a loop. The classic way to do this.
foreach(i; 0..a.length) a[i] = a[i]*b[i]+c[i];
2. float vector ops. The most elegant form in D.
a = a*b+c;
3. float4 ops in a loop. When you think you’re helping the compiler.
import core.simd; auto a4 = cast(float4)a, b4 = cast(float4)b, c4 = cast(float4)c; foreach(i; 0..a4.length) a4[i] = a4[i]*b4[i]+c4[i];
4. float4 vector ops. Same, but let Dlang do the indexing.
a4 = a4*b4+c4;
5. float4 __simd() intrinsics. You can specify what exactly you want to do in SSE, while the compiler can still inline/unroll your code. At least I think so. Sadly, not implemented in LDC.
foreach(i; 0..a4.length) a4[i] = __simd(XMM.ADDPS, __simd(XMM.MULPS, a4[i], b4[i]), c4[i]);
6. inline asm (no unroll). A lot of writing. Optimized by carbon based organic intelligence :D.
//R8=a, R9=b, R10=c, R11=count mov RCX, 0 L0: movaps XMM0, [R8 +RCX*4]; mulps XMM0, [R9 +RCX*4]; addps XMM0, [R10+RCX*4]; movaps [R8 +RCX*4], XMM0; add RCX, 4; cmp RCX, R11; jb L0;
7. inline asm (4x unroll). 4x more to write. Interleaved to help the CPU looking ahead. Also 4x more opportunities for typing errors. This is the best I can do. (Loop handling omitted.)
movaps XMM0, [R8 +RCX*4+00]; mulps XMM0, [R9 +RCX*4+00]; movaps XMM1, [R8 +RCX*4+16]; addps XMM0, [R10+RCX*4+00]; mulps XMM1, [R9 +RCX*4+16]; movaps XMM2, [R8 +RCX*4+32]; movaps [R8 +RCX*4+00], XMM0; addps XMM1, [R10+RCX*4+16]; mulps XMM2, [R9 +RCX*4+32]; movaps XMM3, [R8 +RCX*4+48]; movaps [R8 +RCX*4+16], XMM1; addps XMM2, [R10+RCX*4+32]; mulps XMM3, [R9 +RCX*4+48]; movaps [R8 +RCX*4+32], XMM2; addps XMM3, [R10+RCX*4+48]; movaps [R8 +RCX*4+48], XMM3;
I’m using an AMD FX-8350 @4.1GHz, dual ch ram at 2.4GHz. Only 1 core loaded.
I’m using VisualD, compiling in RELEASE mode. I don’t know the exact command line, but I guess it will add all the necessary options to build the fastest executable.
Later I will change the array sizes to fit in different cache/ram levels. Starting with L1 cache range where most of the performance differences can show up.
The first test is DMD 64bit with 12KB memory footprint (fits well in L1).
Before analyzing this, here are the LDC 64bit 12KB results:
So this is why they mention “strong optimization” at LDC and they doesn’t mention it at DMD. I’m totally amazed by LDC. Maybe the test expression is simple, but this means I will never have to think about manually optimizing simple things for SSE in the rest of my life.
After examining the LDC generated code: It exactly does my version of “inline asm (4x unrolled)” with a bit different way of interleaving the instruction stream, but it still driver the one data io port of my core at 100%. We can see that when not unrolling the loop, thus not making space for the MULPS instruction’s 3 cycle latency means 192->253 = 30% slowdown. The great thing is that LDC optimizes consequently, it doesn’t matter what language features I’m using. So I can use the most compact way, the D vector operations wherever I want.
This is the awesome 4x SSE unrolled loop, that LDC automatically generates, regardles off the high level form of the algorithm:
L0: movups xmm0, [r9+rcx*4] movups xmm1, [r9+rcx*4+0x10] movups xmm2, [r10+rcx*4] mulps xmm2, xmm0 movups xmm0, [r10+rcx*4+0x10] mulps xmm0, xmm1 movups xmm1, [r10+rcx*4+0x20] movups xmm3, [r10+rcx*4+0x30] movups xmm4, [rdx+rcx*4] addps xmm4, xmm2 movups xmm2, [rdx+rcx*4+0x10] addps xmm2, xmm0 movups [r10+rcx*4], xmm4 movups [r10+rcx*4+0x10], xmm2 movups xmm0, [r9+rcx*4+0x20] mulps xmm0, xmm1 movups xmm1, [r9+rcx*4+0x30] mulps xmm1, xmm3 movups xmm2, [rdx+rcx*4+0x20] addps xmm2, xmm0 movups xmm0, [rdx+rcx*4+0x30] addps xmm0, xmm1 movups [r10+rcx*4+0x20], xmm2 movups [r10+rcx*4+0x30], xmm0 add rcx, 0x10 add rsi, 0x2 jnz L0
Note that it loads every operand as they were unaligned. This could be a problem, but not at this example, because the bottleneck is IO, and there is always present a free execution unit that can handle the unnecessary load instructions.
The performance of DMD is not that good. But I will not forget about it because it has lightning fast compilation speeds, making development with instant feedback easy. (I’m coming from Delphi. Where 2-3 seconds of build-time is enough for 150K LOC. 0.2sec needed if you modify only a top level module. I know what it feels like 😉
So the worst case in DMD is when doing things traditionally by making a loop and indexing everything. There is no auto vectorization, it translates to simple single operand SSE instructions:
L0: mov rax, [rbp-0x18] cmp rax, [rbp-0x10] jae 0x7ff7f0c715fa mov rax, [rbp-0x18] mov [rbp-0x8], rax mov rax, [rbp-0x18] shl rax, 0x2 mov rcx, [rbp+0x20] mov rcx, [rcx+0x8] lea rcx, [rax+rcx] movss xmm0, [rcx] mov rdx, [rbp+0x18] mov rdx, [rdx+0x8] movss xmm1, [rax+rdx] mulss xmm0, xmm1 mov rdx, [rbp+0x10] mov rdx, [rdx+0x8] movss xmm1, [rax+rdx] addss xmm0, xmm1 movss [rcx], xmm0 inc qword [rbp-0x18] jmp L0
Well this looks like the same size as the previous LDC code snippet, but it does only 1/4 of the job, and a lot else of what is not essential for the job.
It doesn’t use register optimization, holds everything, even the loop counter on the stack. In 64bit, it would have a lot (like 12) general registers at hand, but there are sufficient amount of them even on 32bit too for this purpose.
It reads the ptr-s of the slices every time from memory, and storing every slices as reference too, because they’re input parameters. (Now this is a difference between DMD and LDC: DMD passes slices as reference (according to Microsoft’s 64bit spec), and LDC is putting the whole 16byte slice struct on the stack.)
There is an optimization: It calculates i*sizeof(float) (What did I say?!!! float.sizeof, I mean :D) from the loop index, and it uses it 3 times. It does the SHL2 manually, although it would be specified with no cost in the intel SIB byte.
So basically this looks like DEBUG code, but I’m 100% sure, I choosed RELEASE DMD 😀
Let’s see DMD’s best optimization which fortunately generated for the most beautiful code format: the vector operations.
L0: mov rcx, r11 //r11 is the float index, rcx is the byte index //we only need one of them in the loop shl rcx, 0x2 //Can do with SIB mov r8, [rbx+0x8] movups xmm0, [rcx+r8] //unaligned reads, still an opportunity for manual opt mov r9, [rax+0x8] movups xmm1, [rcx+r9] mulps xmm0, xmm1 //it does the actual vectorization, but not the 4x unroll mov r10, [rdi+0x8] movups xmm2, [rcx+r10] addps xmm0, xmm2 mov r8, [rsi+0x8] movups [rcx+r8], xmm0 add r11, 0x4 dec rdx test rdx, rdx //dec just updated the zero flag already. Just sayin' :D jnz 0x7ff7ce6b1ec8
So it did well, with a bit of additional math.
Tests including bigger memory footprints
I’ve also tested L2 size (1.5MB footprint), and on RAM size (48MB). Here are the results on a chart:
Nothing unusual there, the unoptimal things in the DMD SIMD variants are smoothed out as the memory footprint rises, because waiting for uncached memory became the bottleneck there.
And LDC’s performance is just amazing (just as me lol,… kidding :D). In a future post I will investigate what is the compile time cost generating such greatly optimized code…