Compiling speed of DLang compilers.

I wanted to see what is the best way to compile small dlang projects, so I made a benchmark for it.

Here are the things checked:

  • Compiler used: DMD or LDC
  • Release(with optimization) or debug
  • Target: win32 or win64
  • Build type: Incremental(all modules are compiled in parallel, then linking) or one step build it with the compiler.
  • module count: 4 vs 13(more opportunity to do it in parallel)

The system: AMD FX-8350 4.2GHz, 2.4GHz ram, SSD

And the results:

compiler_speed_test1.png

For only 4 modules the fastest is DMD single pass compilation but it without optimization. Single pass compilation for LDC is a bit slower, and it is much slower when optimization is turned on, but with incremental compilation it is only 1.5x slower than DMD_DBG. So I think the winner here is LDC.

compiler_speed_test2.png

For a larger number of moduiles(13) incremental compilation helps a lot for LDC, but not that much for DMD.

The incremental builds in these tests was used for compiling all the modules. In the next test I measured what’s the case when incremental compilation is only recompiling 1 modified module and it uses the precompiled object files for the other modules. This is more like a real life situation.

compiler_speed_test3.png

It takes only 3-4 seconds to launch the exe file and start testing it. In the C++ world it is usual to spend 20 seconds on compilation plus 5-10 seconds to launch a debugger this time is so long, sometimes I totally forget why I launched the exe in the first place lol. And also it is important to test optimized release code in an instant because I can have the same experience with it, like the users get.

There is an optimization opportunity when using the MS Linker (from VS2017). I currently use the msvcenv.bat file to setup the environment variables for the MS Linker. It would be better to cache those environment variables to get 0.5sec better times for all LDC builds and for the 64bit DMD builds too.

Advertisements
Posted in Uncategorized | Tagged | Leave a comment

My first impressions of 64 bit DLang compilers.

Recently I’ve decided to step out of my 32bit comfort zone and move to the x64 world which was introduced 15 years ago (Yes, I’m slow :D). I was worried about having to step into the Matrix and stuff, but the D community is really helpful, they gave me instant help even for my n00best questions.

I planned not only to try the official DMD compiler which is fast and implements all the goodies of D, but also the LDC compiler which has known by having strong optimization. I will test a simple vector expression and see how the compilers can optimize that. I’ll do a human-optimized version too, to be able to compare to something.

The test expression

a = a*b+c; 

Where a,b,c are arrays of 32bit floats. This is a bit “memory heavy” (2 float ops and 4 io ops), but will do for testing.

Various implementations

1. simple float ops in a loop. The classic way to do this.

foreach(i; 0..a.length) a[i] = a[i]*b[i]+c[i];

2. float[] vector ops. The most elegant form in D.

a[] = a[]*b[]+c[];

3. float4 ops in a loop. When you think you’re helping the compiler.

import core.simd;
auto a4 = cast(float4[])a, 
     b4 = cast(float4[])b, 
     c4 = cast(float4[])c;
foreach(i; 0..a4.length) 
  a4[i] = a4[i]*b4[i]+c4[i];

4. float4[] vector ops. Same, but let Dlang do the indexing.

a4[] = a4[]*b4[]+c4[];

5. float4 __simd() intrinsics. You can specify what exactly you want to do in SSE, while the compiler can still inline/unroll your code. At least I think so. Sadly, not implemented in LDC.

foreach(i; 0..a4.length) 
  a4[i] = __simd(XMM.ADDPS, __simd(XMM.MULPS, a4[i], b4[i]), c4[i]);

6. inline asm (no unroll). A lot of writing. Optimized by carbon based organic intelligence :D.

//R8=a, R9=b, R10=c, R11=count
mov RCX, 0
 L0: movaps XMM0, [R8 +RCX*4]; 
     mulps  XMM0, [R9 +RCX*4]; 
     addps  XMM0, [R10+RCX*4]; 
     movaps [R8 +RCX*4], XMM0; 
add RCX, 4; 
cmp RCX, R11; 
jb L0;

7. inline asm (4x unroll). 4x more to write. Interleaved to help the CPU looking ahead. Also 4x more opportunities for typing errors. This is the best I can do. (Loop handling omitted.)

movaps XMM0, [R8 +RCX*4+00];   
mulps  XMM0, [R9 +RCX*4+00];  movaps XMM1, [R8 +RCX*4+16]; 
addps  XMM0, [R10+RCX*4+00];  mulps  XMM1, [R9 +RCX*4+16];  movaps XMM2, [R8 +RCX*4+32];
movaps [R8 +RCX*4+00], XMM0;  addps  XMM1, [R10+RCX*4+16];  mulps  XMM2, [R9 +RCX*4+32];  movaps XMM3, [R8 +RCX*4+48]; 
                              movaps [R8 +RCX*4+16], XMM1;  addps  XMM2, [R10+RCX*4+32];  mulps  XMM3, [R9 +RCX*4+48];      
                                                            movaps [R8 +RCX*4+32], XMM2;  addps  XMM3, [R10+RCX*4+48];      
                                                                                          movaps [R8 +RCX*4+48], XMM3;

Test conditions

I’m using an AMD FX-8350 @4.1GHz, dual ch ram at 2.4GHz. Only 1 core loaded.
I’m using VisualD, compiling in RELEASE mode. I don’t know the exact command line, but I guess it will add all the necessary options to build the fastest executable.
Later I will change the array sizes to fit in different cache/ram levels. Starting with L1 cache range where most of the performance differences can show up.

The first test is DMD 64bit with 12KB memory footprint (fits well in L1).

dmdtest.png

Before analyzing this, here are the LDC 64bit 12KB results:

ldctest.png

So this is why they mention “strong optimization” at LDC and they doesn’t mention it at DMD. I’m totally amazed by LDC. Maybe the test expression is simple, but this means I will never have to think about manually optimizing simple things for SSE in the rest of my life.

After examining the LDC generated code: It exactly does my version of “inline asm (4x unrolled)” with a bit different way of interleaving the instruction stream, but it still driver the one data io port of my core at 100%. We can see that when not unrolling the loop, thus not making space for the MULPS instruction’s 3 cycle latency means 192->253 = 30% slowdown. The great thing is that LDC optimizes consequently, it doesn’t matter what language features I’m using. So I can use the most compact way, the D vector[] operations wherever I want.

This is the awesome 4x SSE unrolled loop, that LDC automatically generates, regardles off the high level form of the algorithm:

L0: movups xmm0, [r9+rcx*4] 
movups xmm1, [r9+rcx*4+0x10] 
movups xmm2, [r10+rcx*4] 
mulps xmm2, xmm0 
movups xmm0, [r10+rcx*4+0x10] 
mulps xmm0, xmm1 
movups xmm1, [r10+rcx*4+0x20] 
movups xmm3, [r10+rcx*4+0x30] 
movups xmm4, [rdx+rcx*4] 
addps xmm4, xmm2 
movups xmm2, [rdx+rcx*4+0x10] 
addps xmm2, xmm0 
movups [r10+rcx*4], xmm4 
movups [r10+rcx*4+0x10], xmm2 
movups xmm0, [r9+rcx*4+0x20] 
mulps xmm0, xmm1 
movups xmm1, [r9+rcx*4+0x30] 
mulps xmm1, xmm3 
movups xmm2, [rdx+rcx*4+0x20] 
addps xmm2, xmm0 
movups xmm0, [rdx+rcx*4+0x30] 
addps xmm0, xmm1 
movups [r10+rcx*4+0x20], xmm2 
movups [r10+rcx*4+0x30], xmm0 
add rcx, 0x10 
add rsi, 0x2 
jnz L0

Note that it loads every operand as they were unaligned. This could be a problem, but not at this example, because the bottleneck is IO, and there is always present a free execution unit that can handle the unnecessary load instructions.

The performance of DMD is not that good. But I will not forget about it because it has lightning fast compilation speeds, making development with instant feedback easy. (I’m coming from Delphi. Where 2-3 seconds of build-time is enough for 150K LOC. 0.2sec needed if you modify only a top level module. I know what it feels like 😉
So the worst case in DMD is when doing things traditionally by making a loop and indexing everything. There is no auto vectorization, it translates to simple single operand SSE instructions:

L0: mov rax, [rbp-0x18] 
cmp rax, [rbp-0x10] 
jae 0x7ff7f0c715fa 
mov rax, [rbp-0x18] 
mov [rbp-0x8], rax 
mov rax, [rbp-0x18] 
shl rax, 0x2 
mov rcx, [rbp+0x20] 
mov rcx, [rcx+0x8] 
lea rcx, [rax+rcx] 
movss xmm0, [rcx] 
mov rdx, [rbp+0x18] 
mov rdx, [rdx+0x8] 
movss xmm1, [rax+rdx] 
mulss xmm0, xmm1 
mov rdx, [rbp+0x10] 
mov rdx, [rdx+0x8] 
movss xmm1, [rax+rdx] 
addss xmm0, xmm1 
movss [rcx], xmm0 
inc qword [rbp-0x18]
jmp L0

Well this looks like the same size as the previous LDC code snippet, but it does only 1/4 of the job, and a lot else of what is not essential for the job.

It doesn’t use register optimization, holds everything, even the loop counter on the stack. In 64bit, it would have a lot (like 12) general registers at hand, but there are sufficient amount of them even on 32bit too for this purpose.
It reads the ptr-s of the slices every time from memory, and storing every slices as reference too, because they’re input parameters. (Now this is a difference between DMD and LDC: DMD passes slices as reference (according to Microsoft’s 64bit spec), and LDC is putting the whole 16byte slice struct on the stack.)
There is an optimization: It calculates i*sizeof(float) (What did I say?!!! float.sizeof, I mean :D) from the loop index, and it uses it 3 times. It does the SHL2 manually, although it would be specified with no cost in the intel SIB byte.
So basically this looks like DEBUG code, but I’m 100% sure, I choosed RELEASE DMD 😀

Let’s see DMD’s best optimization which fortunately generated for the most beautiful code format: the vector[] operations.

L0: mov rcx, r11       //r11 is the float[] index, rcx is the byte index
                       //we only need one of them in the loop
shl rcx, 0x2           //Can do with SIB
mov r8, [rbx+0x8] 
movups xmm0, [rcx+r8]  //unaligned reads, still an opportunity for manual opt
mov r9, [rax+0x8] 
movups xmm1, [rcx+r9] 
mulps xmm0, xmm1       //it does the actual vectorization, but not the 4x unroll
mov r10, [rdi+0x8] 
movups xmm2, [rcx+r10] 
addps xmm0, xmm2 
mov r8, [rsi+0x8] 
movups [rcx+r8], xmm0 
add r11, 0x4 
dec rdx 
test rdx, rdx          //dec just updated the zero flag already. Just sayin' :D
jnz 0x7ff7ce6b1ec8

So it did well, with a bit of additional math.

Tests including bigger memory footprints

I’ve also tested L2 size (1.5MB footprint), and on RAM size (48MB). Here are the results on a chart:

dmdldcchart.png

Nothing unusual there, the unoptimal things in the DMD SIMD variants are smoothed out as the memory footprint rises, because waiting for uncached memory became the bottleneck there.

And LDC’s performance is just amazing (just as me lol,… kidding :D). In a future post I will investigate what is the compile time cost generating such greatly optimized code…

 

Posted in SSE | Tagged , , , , | Leave a comment

Fast SSE 3×3 median filter for RGB24 images

Optimization opportunities:

After a bit of googling, a good solution to the problem would be a partial sorting network which only uses the median value.
[1] -> https://software.intel.com/sites/default/files/m/d/4/1/d/8/MedianFilter.pdf

Another way of optimization is to reuse intermediate data by doing not only one but more median kernels at a time.
[2] -> https://www.cg.tuwien.ac.at/research/publications/1994/Kopp-1994-EMF/TR-186-2-94-18Paper.pdf

The paper said that they reached 3x speedup by found 2 kernels at once, and reusing data while moving the kernel forward by 2 pixels.

Also SSE can really speed up this because it has 16x parallel min/max functions, I think the biggest speedup will come from this, so all the algorithm must be twisted in a way to support SSE as possible as could. This algo has a lot of computation and a moderate amount of memory IO, this is good for SSE.

Finally there’s multithreading, but it’s so straightforward, I don’t even will implement it.

Let’s plan the algorithm

First we need some basic sort network to start from. We need a 9 element sorting network for this. I choose the one that groups the data at the beginning into 3×3 element groups. There 3 element groups will represent the column of out moving 3×3 kernel window.

Here’s an online generator -> http://pages.ripco.net/~jgamble/nw.html
The variant called “Best” seems like to suit our needs well. -> http://jgamble.ripco.net/cgi-bin/nw.cgi?inputs=9&algorithm=best&output=svg

sortnet1.png

When visualizing data paths of the median value, it turned out that 5 min/max operations aren’t used at all and removing those we can reduce the operation count from 25 down to 20. This trick is used in the Intel Optimization paper [2].

After reorganizing a bit we can see the 3 element sort ranges at the beginning, these operations will be reused by adjacent kernels. Also there is a noticeable 6 element group. Also after making an error checker, it turned out that one additional operation can be eliminated.

sortnet2.png

Now let’s expand it to a 2 kernels, where we have 4 input columns, 12 values in total, and we need to pull 2 medians.

sortnet3.png

The workflow:

  • At every iteration, we move the 2 kernels forward by 2 pixels: copying 2×6 sorted values (light green) and sorting 2*3 incoming ones: 6 minmax ops 6 moves
  • Then the purple rectangle comes, this is a preparation step. 2 minmax ops
  • The next orange stage is to save 5 values for later use: 5 moves
  • And finally the two median calculations: 2*8 minmax ops, 5 moves (recovering saved orange values)

Total min/max ops: 6+2+2*8=24, total moves=5+5+6=16
That’s 12 minmax per one median value, it’s 60% more efficient than the single kernel partial sorting network.
There were attempts to optimize this further in 2D [1], but the gain was so small compared to the 1D version that I don’t want to make the algorithm bigger.

SSE estimation: In the worst case, a minmax op will use 3 cycle and a move will use 1. So the total cycle count is 12*3+8= 44 cycles for each median calculation. By using SIMD with byte data we’ll have 16 lanes, thus the cycle estimation is around 2.8 cycles for each median calculation. (not including fetch and store phases, those will be the biggest problem in this median filter because of 24bit input data, it will be discussed later.)

The two building blocks we need are these:

MinMax: Will use a temporary register and an sse min and max instruction. The result will be always in a register, one of the inputs can come from a memory location. There will many combinations of inputs used (reg/mem), also in contrast to the non-partial sorting networks here we gonna have unused outputs of either the min or the max functions. I will not try to code these manually, I will use an automatic instruction flow generator tool for this, that will handle register automation too, I only have to watch out for the proper instruction order.

Move: It will be implicit as I’m using the Instruction flow generator thingie.

Fetch and Store

This will be quite32 difficult because of the nature of RGB24.
First have to decide in which direction we want to process the data in the 2D bitmap.

sortnet4.png

There are two variations(for the above scalar example):

  • Left to right: This is the natural order of the system memory, so it would be so good for the cache.
  • Top to bottom: This is not so good for cache, would require additional prefetching, but at certain bitmap with it could still be able make cache useless.

Let’s widen it in order to fulfill an XMM register:

sortnet5.png

Now the register utilization is much better.
From the two variants I immediately throw de [left to right] on because it needs so many little reads, and that means a lot of shuffling in order to fill the register.
With the top down variant we could be able to read the memory in large chunks, and no complex shuffling is needed, the already loaded data can be reused by shifting. Additionally on recent systems that have the SSSE3 extension it would be even more effective by using this ‘super weapon’ called PALIGNR and even more complex shuffling can be achieved with the PSHUFB byte-permute instruction. But first, I’ll do it only with simple memory reads, and simple SSE instructions.

By thinking a bit more about the problem, that 93% register utilization can be simply extended to 100% because the only important thing in the current problem is not to separate RGB triplets properly, but only to pair the color components with their left neighbors. They are exactly 3 pixels away when using RGB, 4 pixels away when using RGBA and 1 pixels away with GrayScale. In other words it’s irrelevant if we group the colors like RGB, RGB, RGB, RGB or RGBR, GBRG, BRGB.

One last thing remains to deal with is the unnatural pattern of memory access. I’m thinking about 2 ways to solve it:

  • It’s easy to do but can’t solve cache conflicts.
  • Moving our window down like 10 times, then go up and right and do another run downwards adjacent to the previous run. This may help a lot with very large bitmaps. And still can be combined with prefetching by no cost.

When I implemented the algo, it turned out that we need both of the above. Prefetch is needed because the non-normal access pattern, and partitioning is needed to be able to stay inside the L3 cache.

Implementation details

It wasn’t so hard to implement this, I used the plan, rotated it by 90 degrees and designed it in my sse programming tool. Only 4 SSE instructions were used: MOVUPS for reads,  MOVAPS for stores and to move reg/reg data, and PMINUB/PMAXUB. On my FX-8350 cpu (AMD Piledriver architecture) it is possible to execute 2 MIN/MAX at one cycle, and those have only 2 cycle latencies, so this particular algorithm can run quite fast on it.

So here’s what the first design looked like:

median3x3_sse_plan.png

After some reordering and simplification it was possible to implement the whole algo on the 8 xmm registers and without any temp registers. But later I’ve noticed, that it’s not needed that much because I think the bottleneck is the rather L3 cache because of the unusual memory access pattern. Anyways, here’s what the cleaned up version looks like (note that, now the 16 SSE byte lane’s are visualized too):

median3x3_sse_plan2.png

The results:

lenna_median3x3.png

Personally I was amazed how fast it became.

The test config: AMD FX-8350 4.2GHz, 8MB cache, 2133MHz DDR3    Memory read speed=11.4 GByte/s
Tested with a big 3888*2592 RGB24 image (30Mbytes), the runtime on 1 core was only 9.84 ms, that is 2928 MByte/s image processing speed. Smaller images can run even faster because of better cache usage: I measured 3.4GByte/s on a small 640×480 image.
Compared to a non-SSE implementation written in a high level language, the difference was huge: An initial, lightly optimized, more readable version did it in 3.36 s (8.5MByte/s). This version can be found here: http://prog.hu/tarsalgo/?fid=194693&no=10#e10 Later on with various optimization (partial bubble-sort, faster data access) they were reached a 2x faster version. And finally with the introduction of FastDib/MMX routines (mixed with high level code) they got another 7.5x speedup on top of that 2x. Theoretically it’s 8.5*2*7.5=127.5MB/s. It’s theoretical, because I don’t have the sources to test those, I only rely on Stella_209’s time measurements on the forums. But it’s still so far away from this SSE version’s 2900MB/s. This shows the power of low level optimization because one well written inner loop can be much better that small highly optimized routines stitched together with other fast routines in a high level language. With rewriting the loop from scratch we can avoid unnecessary data movement and conversion, in addition to function call overheads, and most importantly we can twist the whole process around the SSE instructions that will do the heavy lifting in order to feed those instructions with the highest amount of data to work on.

Imperfections:

Yes, there are some but with a lot of work they can be fixed. This experiment was only to measure that how fast an SSE median filter can be in general.

  • Every 16th column is filtered twice because the 16wide kernel works in place and reusing a previous kernel’s output. Solution -> work into a separate output buffer.
  • Boundary conditions. I’m ignoring them at all. There can be also a big untouched gap on the left and right sides because the kernel only outputs 16byte aligned data. Solution -> handle the border situations by a traditional algo.

Source code:

The source code is relatively small. The assembly parts are compacted because they were just pulled out automatically from my SSE design tool in an order I controlled.

Compiled and tested on Delphi XE3 win32.

procedure SSE_median3x3_RGB_x2(dst:pointer; linesize, count:integer);
asm
  push esi push edi push ebx push ebp

  mov ebp,linesize  //ebp: linesize   //must be 16 aligned
  mov ebx,count     //ebx: counterr
  mov ecx,dst       //ecx: dst

  //calc src addresses
  mov eax, ecx   sub eax, ebp   sub eax, 3   //src = dst-linesize-3
  lea edx, [eax+ebp]  //one line down

  //prepare 2*7*$10 bytes of temp buffers
  mov esi, esp   sub esi, 2*$70;   and esi,not $FFF    //align temps to a 4K page
  lea edi, [esi+$70]  //the other half of the temp buffers

  //fetch the first 2 rows to the temp buffer
  movups xmm0,[eax] movups xmm1,[eax+3] movups xmm2,[eax+6] movaps xmm3,xmm0 pminub xmm0,xmm1 pmaxub xmm3,xmm1 movups xmm1,[edx]
  movaps xmm4,xmm3 pminub xmm3,xmm2 movups xmm5,[edx+3] pmaxub xmm4,xmm2 movaps xmm2,xmm0 pminub xmm0,xmm3 pmaxub xmm2,xmm3
  movups xmm3,[edx+6] movaps xmm6,xmm1 pminub xmm1,xmm5 pmaxub xmm6,xmm5 movaps xmm5,xmm6 pminub xmm6,xmm3 pmaxub xmm5,xmm3
  movaps [esi],xmm0 movaps xmm3,xmm1 pminub xmm1,xmm6 movaps [esi+$10],xmm2 pmaxub xmm3,xmm6 movaps [esi+$20],xmm4
  movaps [esi+$30],xmm1 movaps [esi+$40],xmm3 movaps [esi+$50],xmm5

  xchg esi, edi         //swap the temp banks,
  lea eax, [eax+ebp*2]  //advance the source offsets
  lea edx, [edx+ebp*2]

  cmp ebx, 0;  jle @end
  @1:
    //inputs      : eax, edx  : image source row0, row1
    //temps       : esi, edi  : 2x 7*16bytes alternating buffers. Last elements are used as 2 temps.
    //output      : ecx       : 2x 16bytes of output (used as temp internally)
    //remarks     : edx = eax+linesize;  next iteration: swap(esi,edi); eax += linesize*2;
    //source plan : SSEDesign_1610_Median3x3
prefetch [eax+ebp*2]  prefetch [edx+ebp*2]  //2.2->2.9 GB/s
    movups xmm0,[eax] movups xmm1,[eax+3] movups xmm2,[eax+6] movaps xmm3,xmm0 pminub xmm0,xmm1 pmaxub xmm3,xmm1 movups xmm1,[edx]
    movaps xmm4,xmm3 pminub xmm3,xmm2 pmaxub xmm4,xmm2 movups xmm2,[edx+3] movaps xmm5,xmm0 pminub xmm0,xmm3 movups xmm6,[edx+6]
    pmaxub xmm5,xmm3 movaps xmm3,xmm1 pminub xmm1,xmm2 pmaxub xmm3,xmm2 movaps [esi],xmm0 movaps xmm2,xmm3 pminub xmm3,xmm6
    movaps [esi+$10],xmm5 pmaxub xmm2,xmm6 movaps [esi+$20],xmm4 movaps xmm6,xmm1 pminub xmm1,xmm3 pmaxub xmm6,xmm3
    movaps [esi+$30],xmm1 movaps xmm3,[edi+$20] movaps [esi+$40],xmm6 movaps xmm7,xmm2 pminub xmm2,xmm3 movaps [edi+$60],xmm2
    movaps [esi+$50],xmm7 movaps xmm7,[edi+$10] movaps xmm3,xmm6 pminub xmm6,xmm7 pmaxub xmm3,xmm7 movaps [esi+$60],xmm6
    pminub xmm4,xmm2 pmaxub xmm5,xmm6 movaps xmm6,[edi] pminub xmm5,xmm3 pmaxub xmm0,xmm1 pmaxub xmm0,xmm6 movaps xmm2,xmm4
lea eax, [eax+ebp*2]  //advance the source offsets
    pminub xmm4,xmm0 pmaxub xmm2,xmm0 pmaxub xmm4,xmm5 movaps xmm5,[edi+$50] movaps xmm0,[esi+$60] pminub xmm4,xmm2
lea edx, [edx+ebp*2]
    movaps xmm2,[edi+$40] movaps [ecx+ebp],xmm4 pmaxub xmm2,xmm0 movaps xmm0,[edi+$60] pminub xmm5,xmm0 pminub xmm2,xmm3
    movaps xmm3,[edi+$30] pmaxub xmm3,xmm1 pmaxub xmm3,xmm6 movaps xmm6,xmm5 pminub xmm5,xmm3 pmaxub xmm6,xmm3 pmaxub xmm5,xmm2
xchg esi, edi         //swap the temp banks,
    pminub xmm5,xmm6 movaps [ecx],xmm5
lea ecx, [ecx+ebp*2]  //advance the dst offset too

    dec ebx jg @1
  @end:
  pop ebp pop ebx pop edi pop esi
end;

procedure applyMedian3x3(b:TBitmap);


  procedure Error(s:string);begin raise exception.Create('applyMedian3x3: '+s); end;
  function AlignUp(i:integer;const align:integer):integer; asm sub edx,1;  add eax,edx;  not edx;  and eax,edx end;

var base,dst0, ycnt, xcnt, w, h, x, y, yblocksize, ysiz, st, en:integer;

const L3CacheLimit=4096 shl 10;
begin
  if b.PixelFormat<>pf24bit then error('24bit format expected');
  h:=b.Height;  w:=b.Width*3;  if (w and 15)<>0 then error('Scanline size must be a multiple of 16');
  base:=integer(b.ScanLine[h-1]);

  //calculate first aligned dstByte (on the 0th line first)
  dst0:=AlignUp(base+w+3, 16);  //center of 3x3 window
  xcnt:=((base+w*2-3)-dst0)div 16;
  ycnt:=h shr 1-1;

  yblocksize:=ycnt;
  while(yblocksize>4)and(yblocksize*w*2>L3CacheLimit) do yblocksize:=yblocksize shr 1;

  while ycnt>0 do begin
    ysiz:=min(ycnt, yblocksize);
    for x:=0 to xcnt-1 do begin
      SSE_median3x3_RGB_x2(pointer(dst0+x shl 4), w, ysiz);
    end;
    ycnt:=ycnt-ysiz;
    inc(dst0, ysiz*w*2);
  end;
end;

 

Posted in SSE | Tagged , , , , | Leave a comment

GCN Quick Reference Card

Happy new year!

Most importantly here’s the document: GCN Reference Card

(Use it only on your own risk, as it can contain errors. The red color means the glorious GCN3)

Last autumn I had the opportunity to program a GCN3 Fury card. It is a real 8TFlops/s beast, even games are flowing incredibly on that while I still have a potato of a CPU 😀

So I had a job on it, and in order to do it, first I had to upgrade my assebler to support the changes that GCN3 introduced. I’ve spent the first days on comparing the old (Southert/Sea Islands) manual against the new one (GCN3 Isa Manual), spotting the differences, making notes on every important changes. The new instruction set is totally incompatible with the old one, but maybe first I was a bit angry, later I really liked the new changes. Just 2 things out of the many: Sub DWord Addressing, Data Parallel Processing. These two (mostly the latter) are kinda redefining how I think in parallel programming from now. Basically there’s no need to think in mass parallel, we can interweave adjacent threads without any wasted cycles(* there are penalties, though) to collaborate on the same job, while using the same memory resources as it would do with only 1 individual thread.

Long ago I thought about optimizing Scrypt on 4 adjacent lanes. I just guessed it is useful as it uses 1/4 the memory costs. On GCN1 it had to be implemented with the ds_swizzle instruction, but now it is free on the GCN3 to connect 4 lanes this way. Too bad I don’t have time to play with this…

5 years ago there was a mass parallel revolution (I’d call it hype), when sequential programmers were so optimistic to wait for new solutions that can make their sequential programs run on massive amounts of treads automatically. Now there is that inter-thread data sharing thing. If memory is a bottleneck, then we can think about connecting 2,4,16 or even 32 threads to work together on the same job. By this, it became even harder to optimize classic sequential code automatically to new hardware. But those guys are still optimistic as hell. I heard about someone who willing to wait these parallel things to be implemented under Java. Geeeeez 😀 Why not start OCL today?!

Posted in Uncategorized | Tagged | 1 Comment

Bitmap clock

Just got bored a bit and I’ve made a clock out of 6 decade counters and 7segment encoders. I’m starting to realize, that how much simpler it is to program these simple things rather than design their hardware.

clock

Update: Here’s clock V2.0. It is built entirely out of synchronous master-slave T flip-flops with synch’d clear. This way the circuit is much simpler. (And at least I finally know, how it works. :D)

clock2

Posted in Uncategorized | Tagged , | 2 Comments

Bitmap Logic Simulator

This post is not about GPU, it’s about low level computing.

Long ago I wanted to make a simple simulator, which can simulate logic gates with the simplest rules.
Earlier I’ve found  WireWorld An amazing logic simulation with really simple rules. I never played with it, I found that extremely difficult, that everything is in motion in this simulation when it’s working. I wanted something, that simulates logic gates in a similar way that in TTL electronics so I came up with these simple rules:

bitmap logic simulation rules

There are only five 3×3 pixel combinations that the simulation recognizes as special features: The wire crossing and the NOT gate facing all four directions. The OR behavior can be achieved when wires are connected to each others. And 2 serially connected NOT gates act like a buffer gate.

The simulation is working well with signal feedbacks:

  • Two NOT gates connected circularly is a memory cell. At startup it will have a random 0 or 1 bit, but will never oscillate because the simulation is upgraded with some randomness when changing logic states (just like in the real world where you’ll never have two logic gates that switches in the exact same time).
  • Odd number of NOT gates in a circular chain: They make an oscillator. The more gates used, the slower is the frequency.

Here are some more circuit examples:

JK_flipflop

Master Slave JK flipflop and its compact form

ttl chips

Some 74xxx chips

Now it’s enough information about the rules of the simulation. If you’re interested, you can find the Simulator and its (Delphi XE) source in the [Download Area] up on the menu bar of this page. Current link is: BmpLogicSim150902.zip  Let me see what incredible machines you can build with it! 😀

Using these instructions: [How-to-Build-an-8-Bit-Computer] I’ve managed to build a working machine that can compute the first 12 Fibonacci numbers. Well, that’s a… something… Now I have a bit more understanding on how things work under the hood… Next time I want to make it a bigger computer in order to make the classic Tetris game on it. Also it needs the simulator to be optimized 10..50x faster, but there are a lot of room for improvements in it.

So here’s the small computer on 1024×1024 resolution. First you have to click on [RESET], ant then on [RUN]. It will stop whenever a 8bit overflow happens in while calculating the Fibonacci series.

8bit_cpu

(When you load it, be patient, it take a few seconds to load and preprocess this circuit into the simulator)

Here’s a small video on how it works. Sry for bad quality.

Posted in Uncategorized | Tagged , | 50 Comments

Instructions: Compiling and running the ASM Groestl kernel in sgminer 5.1

(You can find more information and benchmark results on this page: https://realhet.wordpress.com/gcn-asm-groestl-coin-kernel/ )

Compiling the kernel for a specific GPU

Requirements:

  • Windows 7 or up
  • Catalyst 14.6, 14.7, 14.9,  14.12  (those are confirmed to work)
  • AMD Radeon gfx with GCN chip (HD7xxx and up)
  • sgminer 5.1

Step 1: Download HetPas??????_Groestl.zip! The latest version can be found in the [Download Area] on the MenuHeader of this blog.

Step 2: Unpack it to a folder that has write access (UAC let it to write there).

Step 3: Turn off DEP for HetPas.exe. (If you don’t know what is this and HetPas.exe wont start -> google “Turn off DEP”)

Step 4: Open HetPas.exe and load [F3] the “groestl/groestl_isa.hpas” program.

Step 5: Run the program by pressing [F9]! If everything goes well, you should see something like this:
groestl compile screen

Step 6: Run “sgminer.exe -k groestlcoin (…your usual command-line comes here)” and then stop it in order to precompile the original groestlcoin.cl opencl kernel.

Step 7: Look for the precompiled binary file that just have been created in the same path as sgminer.exe. The name of the file is something like this: “groestlcoinXXXXgw256l4.bin“. To the place of XXXX your graphics chip’s name will go. In my case it is “groestlcoinCapeverdegw256l4.bin”.

Step 8: Replace the binary file next to sgminer.exe with the “kernel_dump\kernel.elf” one you’ve created in Step 5.
groestl replace kernel

Step 9: Run sgminer.exe but now with the new binary!

Posted in Uncategorized | Tagged , , , , | 11 Comments