Fast SSE 3×3 median filter for RGB24 images

Optimization opportunities:

After a bit of googling, a good solution to the problem would be a partial sorting network which only uses the median value.
[1] -> https://software.intel.com/sites/default/files/m/d/4/1/d/8/MedianFilter.pdf

Another way of optimization is to reuse intermediate data by doing not only one but more median kernels at a time.
[2] -> https://www.cg.tuwien.ac.at/research/publications/1994/Kopp-1994-EMF/TR-186-2-94-18Paper.pdf

The paper said that they reached 3x speedup by found 2 kernels at once, and reusing data while moving the kernel forward by 2 pixels.

Also SSE can really speed up this because it has 16x parallel min/max functions, I think the biggest speedup will come from this, so all the algorithm must be twisted in a way to support SSE as possible as could. This algo has a lot of computation and a moderate amount of memory IO, this is good for SSE.

Finally there’s multithreading, but it’s so straightforward, I don’t even will implement it.

Let’s plan the algorithm

First we need some basic sort network to start from. We need a 9 element sorting network for this. I choose the one that groups the data at the beginning into 3×3 element groups. There 3 element groups will represent the column of out moving 3×3 kernel window.

Here’s an online generator -> http://pages.ripco.net/~jgamble/nw.html
The variant called “Best” seems like to suit our needs well. -> http://jgamble.ripco.net/cgi-bin/nw.cgi?inputs=9&algorithm=best&output=svg

sortnet1.png

When visualizing data paths of the median value, it turned out that 5 min/max operations aren’t used at all and removing those we can reduce the operation count from 25 down to 20. This trick is used in the Intel Optimization paper [2].

After reorganizing a bit we can see the 3 element sort ranges at the beginning, these operations will be reused by adjacent kernels. Also there is a noticeable 6 element group. Also after making an error checker, it turned out that one additional operation can be eliminated.

sortnet2.png

Now let’s expand it to a 2 kernels, where we have 4 input columns, 12 values in total, and we need to pull 2 medians.

sortnet3.png

The workflow:

  • At every iteration, we move the 2 kernels forward by 2 pixels: copying 2×6 sorted values (light green) and sorting 2*3 incoming ones: 6 minmax ops 6 moves
  • Then the purple rectangle comes, this is a preparation step. 2 minmax ops
  • The next orange stage is to save 5 values for later use: 5 moves
  • And finally the two median calculations: 2*8 minmax ops, 5 moves (recovering saved orange values)

Total min/max ops: 6+2+2*8=24, total moves=5+5+6=16
That’s 12 minmax per one median value, it’s 60% more efficient than the single kernel partial sorting network.
There were attempts to optimize this further in 2D [1], but the gain was so small compared to the 1D version that I don’t want to make the algorithm bigger.

SSE estimation: In the worst case, a minmax op will use 3 cycle and a move will use 1. So the total cycle count is 12*3+8= 44 cycles for each median calculation. By using SIMD with byte data we’ll have 16 lanes, thus the cycle estimation is around 2.8 cycles for each median calculation. (not including fetch and store phases, those will be the biggest problem in this median filter because of 24bit input data, it will be discussed later.)

The two building blocks we need are these:

MinMax: Will use a temporary register and an sse min and max instruction. The result will be always in a register, one of the inputs can come from a memory location. There will many combinations of inputs used (reg/mem), also in contrast to the non-partial sorting networks here we gonna have unused outputs of either the min or the max functions. I will not try to code these manually, I will use an automatic instruction flow generator tool for this, that will handle register automation too, I only have to watch out for the proper instruction order.

Move: It will be implicit as I’m using the Instruction flow generator thingie.

Fetch and Store

This will be quite32 difficult because of the nature of RGB24.
First have to decide in which direction we want to process the data in the 2D bitmap.

sortnet4.png

There are two variations(for the above scalar example):

  • Left to right: This is the natural order of the system memory, so it would be so good for the cache.
  • Top to bottom: This is not so good for cache, would require additional prefetching, but at certain bitmap with it could still be able make cache useless.

Let’s widen it in order to fulfill an XMM register:

sortnet5.png

Now the register utilization is much better.
From the two variants I immediately throw de [left to right] on because it needs so many little reads, and that means a lot of shuffling in order to fill the register.
With the top down variant we could be able to read the memory in large chunks, and no complex shuffling is needed, the already loaded data can be reused by shifting. Additionally on recent systems that have the SSSE3 extension it would be even more effective by using this ‘super weapon’ called PALIGNR and even more complex shuffling can be achieved with the PSHUFB byte-permute instruction. But first, I’ll do it only with simple memory reads, and simple SSE instructions.

By thinking a bit more about the problem, that 93% register utilization can be simply extended to 100% because the only important thing in the current problem is not to separate RGB triplets properly, but only to pair the color components with their left neighbors. They are exactly 3 pixels away when using RGB, 4 pixels away when using RGBA and 1 pixels away with GrayScale. In other words it’s irrelevant if we group the colors like RGB, RGB, RGB, RGB or RGBR, GBRG, BRGB.

One last thing remains to deal with is the unnatural pattern of memory access. I’m thinking about 2 ways to solve it:

  • It’s easy to do but can’t solve cache conflicts.
  • Moving our window down like 10 times, then go up and right and do another run downwards adjacent to the previous run. This may help a lot with very large bitmaps. And still can be combined with prefetching by no cost.

When I implemented the algo, it turned out that we need both of the above. Prefetch is needed because the non-normal access pattern, and partitioning is needed to be able to stay inside the L3 cache.

Implementation details

It wasn’t so hard to implement this, I used the plan, rotated it by 90 degrees and designed it in my sse programming tool. Only 4 SSE instructions were used: MOVUPS for reads,  MOVAPS for stores and to move reg/reg data, and PMINUB/PMAXUB. On my FX-8350 cpu (AMD Piledriver architecture) it is possible to execute 2 MIN/MAX at one cycle, and those have only 2 cycle latencies, so this particular algorithm can run quite fast on it.

So here’s what the first design looked like:

median3x3_sse_plan.png

After some reordering and simplification it was possible to implement the whole algo on the 8 xmm registers and without any temp registers. But later I’ve noticed, that it’s not needed that much because I think the bottleneck is the rather L3 cache because of the unusual memory access pattern. Anyways, here’s what the cleaned up version looks like (note that, now the 16 SSE byte lane’s are visualized too):

median3x3_sse_plan2.png

The results:

lenna_median3x3.png

Personally I was amazed how fast it became.

The test config: AMD FX-8350 4.2GHz, 8MB cache, 2133MHz DDR3    Memory read speed=11.4 GByte/s
Tested with a big 3888*2592 RGB24 image (30Mbytes), the runtime on 1 core was only 9.84 ms, that is 2928 MByte/s image processing speed. Smaller images can run even faster because of better cache usage: I measured 3.4GByte/s on a small 640×480 image.
Compared to a non-SSE implementation written in a high level language, the difference was huge: An initial, lightly optimized, more readable version did it in 3.36 s (8.5MByte/s). This version can be found here: http://prog.hu/tarsalgo/?fid=194693&no=10#e10 Later on with various optimization (partial bubble-sort, faster data access) they were reached a 2x faster version. And finally with the introduction of FastDib/MMX routines (mixed with high level code) they got another 7.5x speedup on top of that 2x. Theoretically it’s 8.5*2*7.5=127.5MB/s. It’s theoretical, because I don’t have the sources to test those, I only rely on Stella_209’s time measurements on the forums. But it’s still so far away from this SSE version’s 2900MB/s. This shows the power of low level optimization because one well written inner loop can be much better that small highly optimized routines stitched together with other fast routines in a high level language. With rewriting the loop from scratch we can avoid unnecessary data movement and conversion, in addition to function call overheads, and most importantly we can twist the whole process around the SSE instructions that will do the heavy lifting in order to feed those instructions with the highest amount of data to work on.

Imperfections:

Yes, there are some but with a lot of work they can be fixed. This experiment was only to measure that how fast an SSE median filter can be in general.

  • Every 16th column is filtered twice because the 16wide kernel works in place and reusing a previous kernel’s output. Solution -> work into a separate output buffer.
  • Boundary conditions. I’m ignoring them at all. There can be also a big untouched gap on the left and right sides because the kernel only outputs 16byte aligned data. Solution -> handle the border situations by a traditional algo.

Source code:

The source code is relatively small. The assembly parts are compacted because they were just pulled out automatically from my SSE design tool in an order I controlled.

Compiled and tested on Delphi XE3 win32.

procedure SSE_median3x3_RGB_x2(dst:pointer; linesize, count:integer);
asm
  push esi push edi push ebx push ebp

  mov ebp,linesize  //ebp: linesize   //must be 16 aligned
  mov ebx,count     //ebx: counterr
  mov ecx,dst       //ecx: dst

  //calc src addresses
  mov eax, ecx   sub eax, ebp   sub eax, 3   //src = dst-linesize-3
  lea edx, [eax+ebp]  //one line down

  //prepare 2*7*$10 bytes of temp buffers
  mov esi, esp   sub esi, 2*$70;   and esi,not $FFF    //align temps to a 4K page
  lea edi, [esi+$70]  //the other half of the temp buffers

  //fetch the first 2 rows to the temp buffer
  movups xmm0,[eax] movups xmm1,[eax+3] movups xmm2,[eax+6] movaps xmm3,xmm0 pminub xmm0,xmm1 pmaxub xmm3,xmm1 movups xmm1,[edx]
  movaps xmm4,xmm3 pminub xmm3,xmm2 movups xmm5,[edx+3] pmaxub xmm4,xmm2 movaps xmm2,xmm0 pminub xmm0,xmm3 pmaxub xmm2,xmm3
  movups xmm3,[edx+6] movaps xmm6,xmm1 pminub xmm1,xmm5 pmaxub xmm6,xmm5 movaps xmm5,xmm6 pminub xmm6,xmm3 pmaxub xmm5,xmm3
  movaps [esi],xmm0 movaps xmm3,xmm1 pminub xmm1,xmm6 movaps [esi+$10],xmm2 pmaxub xmm3,xmm6 movaps [esi+$20],xmm4
  movaps [esi+$30],xmm1 movaps [esi+$40],xmm3 movaps [esi+$50],xmm5

  xchg esi, edi         //swap the temp banks,
  lea eax, [eax+ebp*2]  //advance the source offsets
  lea edx, [edx+ebp*2]

  cmp ebx, 0;  jle @end
  @1:
    //inputs      : eax, edx  : image source row0, row1
    //temps       : esi, edi  : 2x 7*16bytes alternating buffers. Last elements are used as 2 temps.
    //output      : ecx       : 2x 16bytes of output (used as temp internally)
    //remarks     : edx = eax+linesize;  next iteration: swap(esi,edi); eax += linesize*2;
    //source plan : SSEDesign_1610_Median3x3
prefetch [eax+ebp*2]  prefetch [edx+ebp*2]  //2.2->2.9 GB/s
    movups xmm0,[eax] movups xmm1,[eax+3] movups xmm2,[eax+6] movaps xmm3,xmm0 pminub xmm0,xmm1 pmaxub xmm3,xmm1 movups xmm1,[edx]
    movaps xmm4,xmm3 pminub xmm3,xmm2 pmaxub xmm4,xmm2 movups xmm2,[edx+3] movaps xmm5,xmm0 pminub xmm0,xmm3 movups xmm6,[edx+6]
    pmaxub xmm5,xmm3 movaps xmm3,xmm1 pminub xmm1,xmm2 pmaxub xmm3,xmm2 movaps [esi],xmm0 movaps xmm2,xmm3 pminub xmm3,xmm6
    movaps [esi+$10],xmm5 pmaxub xmm2,xmm6 movaps [esi+$20],xmm4 movaps xmm6,xmm1 pminub xmm1,xmm3 pmaxub xmm6,xmm3
    movaps [esi+$30],xmm1 movaps xmm3,[edi+$20] movaps [esi+$40],xmm6 movaps xmm7,xmm2 pminub xmm2,xmm3 movaps [edi+$60],xmm2
    movaps [esi+$50],xmm7 movaps xmm7,[edi+$10] movaps xmm3,xmm6 pminub xmm6,xmm7 pmaxub xmm3,xmm7 movaps [esi+$60],xmm6
    pminub xmm4,xmm2 pmaxub xmm5,xmm6 movaps xmm6,[edi] pminub xmm5,xmm3 pmaxub xmm0,xmm1 pmaxub xmm0,xmm6 movaps xmm2,xmm4
lea eax, [eax+ebp*2]  //advance the source offsets
    pminub xmm4,xmm0 pmaxub xmm2,xmm0 pmaxub xmm4,xmm5 movaps xmm5,[edi+$50] movaps xmm0,[esi+$60] pminub xmm4,xmm2
lea edx, [edx+ebp*2]
    movaps xmm2,[edi+$40] movaps [ecx+ebp],xmm4 pmaxub xmm2,xmm0 movaps xmm0,[edi+$60] pminub xmm5,xmm0 pminub xmm2,xmm3
    movaps xmm3,[edi+$30] pmaxub xmm3,xmm1 pmaxub xmm3,xmm6 movaps xmm6,xmm5 pminub xmm5,xmm3 pmaxub xmm6,xmm3 pmaxub xmm5,xmm2
xchg esi, edi         //swap the temp banks,
    pminub xmm5,xmm6 movaps [ecx],xmm5
lea ecx, [ecx+ebp*2]  //advance the dst offset too

    dec ebx jg @1
  @end:
  pop ebp pop ebx pop edi pop esi
end;

procedure applyMedian3x3(b:TBitmap);


  procedure Error(s:string);begin raise exception.Create('applyMedian3x3: '+s); end;
  function AlignUp(i:integer;const align:integer):integer; asm sub edx,1;  add eax,edx;  not edx;  and eax,edx end;

var base,dst0, ycnt, xcnt, w, h, x, y, yblocksize, ysiz, st, en:integer;

const L3CacheLimit=4096 shl 10;
begin
  if b.PixelFormat<>pf24bit then error('24bit format expected');
  h:=b.Height;  w:=b.Width*3;  if (w and 15)<>0 then error('Scanline size must be a multiple of 16');
  base:=integer(b.ScanLine[h-1]);

  //calculate first aligned dstByte (on the 0th line first)
  dst0:=AlignUp(base+w+3, 16);  //center of 3x3 window
  xcnt:=((base+w*2-3)-dst0)div 16;
  ycnt:=h shr 1-1;

  yblocksize:=ycnt;
  while(yblocksize>4)and(yblocksize*w*2>L3CacheLimit) do yblocksize:=yblocksize shr 1;

  while ycnt>0 do begin
    ysiz:=min(ycnt, yblocksize);
    for x:=0 to xcnt-1 do begin
      SSE_median3x3_RGB_x2(pointer(dst0+x shl 4), w, ysiz);
    end;
    ycnt:=ycnt-ysiz;
    inc(dst0, ysiz*w*2);
  end;
end;

 

Posted in SSE | Tagged , , , , | Leave a comment

GCN Quick Reference Card

Happy new year!

Most importantly here’s the document: GCN Reference Card

(Use it only on your own risk, as it can contain errors. The red color means the glorious GCN3)

Last autumn I had the opportunity to program a GCN3 Fury card. It is a real 8TFlops/s beast, even games are flowing incredibly on that while I still have a potato of a CPU 😀

So I had a job on it, and in order to do it, first I had to upgrade my assebler to support the changes that GCN3 introduced. I’ve spent the first days on comparing the old (Southert/Sea Islands) manual against the new one (GCN3 Isa Manual), spotting the differences, making notes on every important changes. The new instruction set is totally incompatible with the old one, but maybe first I was a bit angry, later I really liked the new changes. Just 2 things out of the many: Sub DWord Addressing, Data Parallel Processing. These two (mostly the latter) are kinda redefining how I think in parallel programming from now. Basically there’s no need to think in mass parallel, we can interweave adjacent threads without any wasted cycles(* there are penalties, though) to collaborate on the same job, while using the same memory resources as it would do with only 1 individual thread.

Long ago I thought about optimizing Scrypt on 4 adjacent lanes. I just guessed it is useful as it uses 1/4 the memory costs. On GCN1 it had to be implemented with the ds_swizzle instruction, but now it is free on the GCN3 to connect 4 lanes this way. Too bad I don’t have time to play with this…

5 years ago there was a mass parallel revolution (I’d call it hype), when sequential programmers were so optimistic to wait for new solutions that can make their sequential programs run on massive amounts of treads automatically. Now there is that inter-thread data sharing thing. If memory is a bottleneck, then we can think about connecting 2,4,16 or even 32 threads to work together on the same job. By this, it became even harder to optimize classic sequential code automatically to new hardware. But those guys are still optimistic as hell. I heard about someone who willing to wait these parallel things to be implemented under Java. Geeeeez 😀 Why not start OCL today?!

Posted in Uncategorized | Tagged | 1 Comment

Bitmap clock

Just got bored a bit and I’ve made a clock out of 6 decade counters and 7segment encoders. I’m starting to realize, that how much simpler it is to program these simple things rather than design their hardware.

clock

Update: Here’s clock V2.0. It is built entirely out of synchronous master-slave T flip-flops with synch’d clear. This way the circuit is much simpler. (And at least I finally know, how it works. :D)

clock2

Posted in Uncategorized | Tagged , | 2 Comments

Bitmap Logic Simulator

This post is not about GPU, it’s about low level computing.

Long ago I wanted to make a simple simulator, which can simulate logic gates with the simplest rules.
Earlier I’ve found  WireWorld An amazing logic simulation with really simple rules. I never played with it, I found that extremely difficult, that everything is in motion in this simulation when it’s working. I wanted something, that simulates logic gates in a similar way that in TTL electronics so I came up with these simple rules:

bitmap logic simulation rules

There are only five 3×3 pixel combinations that the simulation recognizes as special features: The wire crossing and the NOT gate facing all four directions. The OR behavior can be achieved when wires are connected to each others. And 2 serially connected NOT gates act like a buffer gate.

The simulation is working well with signal feedbacks:

  • Two NOT gates connected circularly is a memory cell. At startup it will have a random 0 or 1 bit, but will never oscillate because the simulation is upgraded with some randomness when changing logic states (just like in the real world where you’ll never have two logic gates that switches in the exact same time).
  • Odd number of NOT gates in a circular chain: They make an oscillator. The more gates used, the slower is the frequency.

Here are some more circuit examples:

JK_flipflop

Master Slave JK flipflop and its compact form

ttl chips

Some 74xxx chips

Now it’s enough information about the rules of the simulation. If you’re interested, you can find the Simulator and its (Delphi XE) source in the [Download Area] up on the menu bar of this page. Current link is: BmpLogicSim150902.zip  Let me see what incredible machines you can build with it! 😀

Using these instructions: [How-to-Build-an-8-Bit-Computer] I’ve managed to build a working machine that can compute the first 12 Fibonacci numbers. Well, that’s a… something… Now I have a bit more understanding on how things work under the hood… Next time I want to make it a bigger computer in order to make the classic Tetris game on it. Also it needs the simulator to be optimized 10..50x faster, but there are a lot of room for improvements in it.

So here’s the small computer on 1024×1024 resolution. First you have to click on [RESET], ant then on [RUN]. It will stop whenever a 8bit overflow happens in while calculating the Fibonacci series.

8bit_cpu

(When you load it, be patient, it take a few seconds to load and preprocess this circuit into the simulator)

Here’s a small video on how it works. Sry for bad quality.

Posted in Uncategorized | Tagged , | 43 Comments

Instructions: Compiling and running the ASM Groestl kernel in sgminer 5.1

(You can find more information and benchmark results on this page: https://realhet.wordpress.com/gcn-asm-groestl-coin-kernel/ )

Compiling the kernel for a specific GPU

Requirements:

  • Windows 7 or up
  • Catalyst 14.6, 14.7, 14.9,  14.12  (those are confirmed to work)
  • AMD Radeon gfx with GCN chip (HD7xxx and up)
  • sgminer 5.1

Step 1: Download HetPas??????_Groestl.zip! The latest version can be found in the [Download Area] on the MenuHeader of this blog.

Step 2: Unpack it to a folder that has write access (UAC let it to write there).

Step 3: Turn off DEP for HetPas.exe. (If you don’t know what is this and HetPas.exe wont start -> google “Turn off DEP”)

Step 4: Open HetPas.exe and load [F3] the “groestl/groestl_isa.hpas” program.

Step 5: Run the program by pressing [F9]! If everything goes well, you should see something like this:
groestl compile screen

Step 6: Run “sgminer.exe -k groestlcoin (…your usual command-line comes here)” and then stop it in order to precompile the original groestlcoin.cl opencl kernel.

Step 7: Look for the precompiled binary file that just have been created in the same path as sgminer.exe. The name of the file is something like this: “groestlcoinXXXXgw256l4.bin“. To the place of XXXX your graphics chip’s name will go. In my case it is “groestlcoinCapeverdegw256l4.bin”.

Step 8: Replace the binary file next to sgminer.exe with the “kernel_dump\kernel.elf” one you’ve created in Step 5.
groestl replace kernel

Step 9: Run sgminer.exe but now with the new binary!

Posted in Uncategorized | Tagged , , , , | 11 Comments

Testing the GCN ASM Groestl kernel using sgminer 5.1

Making the original and the new kernel identical (from the outside)

In the previous blog post I was at the point, that the desired kernel parameters (char *block, uint *output, ulong target) are worked well in a small test kernel. After it is applied to the main kernel only one thing left to realize in asm: Detecting if the result<=target and marking it in the output buffer.

  //compare and report result
  v_if_u64(g[0]<=target)
    enter \ v_temp addr, data, tmp \ s_temp oldE[2] align:2
    v_mov_b32     addr, $FF
    v_mov_b32     data, 1
    buffer_atomic_add  data, addr, resOutput, 0 idxen glc
    s_waitcnt     vmcnt(0)
    v_min_u32     addr, $FE, data //dont let it overflow

    //reverse byte order of gid
    v_bfe_u32 data, gid, 24, 8
    v_bfe_u32 tmp , gid, 16, 8 \ v_lshlrev_b32 tmp,  8, tmp \ v_or_b32 data, data, tmp
    v_bfe_u32 tmp , gid,  8, 8 \ v_lshlrev_b32 tmp, 16, tmp \ v_or_b32 data, data, tmp
                                 v_lshlrev_b32 tmp, 24, gid \ v_or_b32 data, data, tmp

    tbuffer_store_format_x  data, addr, resOutput, 0 idxen format:[BUF_DATA_FORMAT_32, BUF_NUM_FORMAT_FLOAT]
    dd $BF8C0F00 //s_waitcnt     vmcnt(0) & expcnt(0)
    leave
  _endif

Simple if/relation handling: v_if_u64() is not a GCN instruction. It is some kind of a macro that can identify the relation operation and does the appropriate compare instruction. It also jumps conditionally and saves/modifies the exec mask based on the compare result.

Atomics: Using atomic increase when calculating the next output index. In the original opencl I had to use atomic_inc(&output[0xFF]) as well because I’m using a special test which returns more than 100 values in the output buffer and got to make sure that no values are lost because of the concurrent incrementations of the output index.

Swapping byte order: Well that’s not too nice, I should rather find  some packing instructions, but this is not the inner loop, so I just don’t care… As thinking it a bit further: It would be pretty fast with this way: Swapping low and high words with v_bytealign. Selecting odd bytes with v_and, shifting them right 8bits. Selecting even bytes and scaling it up 256x and adding to previous result with v_mad_u32_u24. Only 4 simple instructions instead of 9. It’s fun how a functionality can be built from various ‘LEGO’ pieces of the instruction set.

New method of functional testing

Now the key is to compare the asm kernel to the original kernel. Here are the testing parameters:

  • Block: is the same randomly generated 80 bytes as before.
  • target: 0x0008FFFFFFFFFFFF  (must be high enough to generate more than 100 values in the output array)
  • global_work_offset: 567    (is quiet random too)
  • global_work_count: 256*10  *512   (approx. 1.2million)
  • compressing the output[]: Iterate through all the output values and multiply them with a large prime number (402785417) and summarize them together. Because the order of values are not important, only that counts that all the values must be in the array.
  • Checking whether 2 kernels are identical: Is the same as checking the compressed ‘hashes’ of the outputs.

Just for the record, the compressed output hash value calculated from the result of the above parameters is: 335485889931504896.

It was checked for both kernels and it was proven that the original and the new kernel calculates the *same results.
*Actually it is “pretty much the same” by checking the outcome of 1.2 million groestlCoin calculations using a relatively high target value.

Testing it live

Testing it by running sgminer 5.1 and replacing the precompiled kernel binary (groestlcoinCapeverdegw256l4.bin) with my new binary produced the expected 3.5x speedup. And I was kinda lucky because I got the first ‘accepthed’ after 10 minutes. The next one came 3 hours later. So I’m now more than sure that it works correctly. But of course we can only 100% sure when it earns 3.5x more coins than the OpenCL version. That thing I cannot test because I don’t have a mining rig.

my_first_grs

second_grs

Note that GPU1 did the mining, not GPU0. GPU1 is a HD7770 running 1000MHz (stock), and it has 640 streams, peak performance is 1.28 TFlops/s. It ran around 63 Celsius degrees. It’s kinda cool because bottleneck is LDS and L1.

In the next post I’ll write down the instructions on how to build a kernel on a specific GCN GPU and use it with SG 5.1.

Posted in Uncategorized | Tagged , , , , | Leave a comment

Making the GCN ASM Groestl kernel to run inside sgminer 5.1

There are basically two ways to do this:

  • Make the kernel binary 100% compatible with the original kernel
  • Apply modifications in sgminer to handle a my custom binary.

For the first option I have to mimic the kernel parameters of the original kernel. The following is a small dummy kernel that uses the same parameters:

__kernel void search(__global unsigned char* block, volatile __global uint* output, const ulong target) {
  if(target>0)
    output[get_local_id(0)] = block[get_global_id(0)];
}

The simple program in it ensures that none of paramerers are optimized out byte the OCL compiler. Let’s see how it looks like in asm (cat14.9):

  s_buffer_load_dword s0, s[4:7], 0x18        //domain gid base
  s_lshl_b32 s1, s12, 8                       //s1:groupid*256
  s_buffer_load_dword s4, s[8:11], 0x00       //s4: block ofs
  s_buffer_load_dword s5, s[8:11], 0x04       //s5: output ofs
  s_buffer_load_dwordx2 s[6:7], s[8:11], 0x08 //s[6:7] target
  s_waitcnt lgkmcnt(0)
  s_add_u32 s0, s1, s0                        //s0: domain base id
  v_lshl_b64 v[1:2], 0, 0                     //fill with zero
  v_cmp_ne_i64 vcc, s[6:7], v[1:2]            //target>0?
  s_cbranch_vccz label_0018
  s_add_u32 s0, s0, s4                        //s0: threadgroup base id + block ofs
  s_load_dwordx4 s[8:11], s[2:3], 0x60        //s[8..11]: block res
  v_add_i32 v1, vcc, s0, v0                   //v1:gid+block ofs
  s_waitcnt lgkmcnt(0)
  buffer_load_ubyte v1, v1, s[8:11], 0 offen  //block[gid]
  s_load_dwordx4 s[0:3], s[2:3], 0x68         //s[0..3]: output res
  v_lshlrev_b32 v0, 2, v0                     //v0:lid*4
  v_add_i32 v0, vcc, s5, v0                   //v0:lid*4+output base
  s_waitcnt vmcnt(0) & lgkmcnt(0)
  tbuffer_store_format_x v1, v0, s[0:3], 0 offen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT] //output[lid]
  s_waitcnt vmcnt(0) & expcnt(0)
label_0018:
  s_endpgm

Well this is pretty much code for doing a little more than nothing.
We can access the target int64 parameter indirectly form buffer s[8:11], it is loaded into s[6:7].
get_local_id(0) is in v0.
threadgroupId is in s1.
There are 3 resources provided to the kernel by the driver:
s[4:7] contains kernel domain parameters. The only thing used in this is the lowest index value of the 1D kernel domain rangle.
s[8:11] is a table full of base pointers (and sizes?) for the parameter buffers. It also contains value parameters.
s[2:3] this is actually a 64bit pointer that points to the table the of resources of out 2 buffer parameters.
So these things are provided at start. Following is the list of things that must be calculated:
get_global_id(0) is calculated as threadGroupId*256+get_local_id(0)+domain_base_id
BlockResource is loaded from s[2:3] along with BlockBaseOffset from s[8:11]
The same from OutputResource and OutputBase.

And now it is possible to access the parameters ‘block’ and ‘output’. IMO it is a bit messy, and the real problem is not even the mess: It’s the possibility to change all these input registers and functionality in future versions of the GPU drivers.

One additional buffer is not considered yet: In the groestlcoin.cl kernel there are some big constant arrays. The OpenCL environment pass these constant in a buffer silently. I just don’t know how is this works in the binary .ELF format. I can only put the data into the code stream and with the s_getpc_b64 instruction I can calculate the address of it and read it from there. But here comes an additional problem with the alignment of that data.

And finally as if it is not complicated enough: I’d have to modify my binary generator to be able to work with not only buffer parameters but from now with ordinal/float parameters as well.

Altogether I think this is the better way and not to make modifications in sgminer.

ver04: Packing the T tables into the code

As the orriginal kernel passes T0 and T1 in a silent buffer to the kernel, I have to do it in asm with a workaround.
First I generated a groestl_table.inc file with 16KB of constant declarations in asm style:

//contents of groestl_table.inc file
dd $A5F432C6
dd $C6A597F4
dd $84976FF8
...           //T0..T7: 16KB of data
dd $6D6DD6DA
dd $4E3A4E62
dd $2C2C3A58

Next step is to include this into the code:

  s_getpc_b64 s[40:41]  
  s_branch @tableEnd
  aligncode 16         //aligns the next code to the next 16 byte boundary 
@tableBegin:
  #include groestl_table.inc
@tableEnd:
  alias constBuf = s[40:43]  //make constbuf resource
  s_sub_u32 s40, s40, 4           \ s_subb_u32 s41, s41, 0 //beginning of code
  s_add_u32 s40, s40, @tableBegin \ s_addc_u32 s41, s41, 0 //physical offset of the table 
  s_or_b32  s41, s41, $00080000   //8byte record size
  s_mov_b32 s42, $FFFFFF00
  s_mov_b32 s43, $002A7204 

The above code first saves the value of the program counter, then includes the table data, and finally builds a 128bit resource constant in s[40:43] which points to the beginning of the table. Later in the program, this resource constant (called ‘constBuf’) will used for the lookups. Just a note to self, that it would be useful to be able to use @address labels in arithmetic expressions too (eg. @tableBegin-4).
The total code size is now 42KB, but the actual code which must fit into the ICache is 16KB less, only 26KB. The execution time of the new code remained the same as before.

Emulating the original kernel parameters

New things to be implemented in the compiler and on the host side:

1. There must be a way to specify the kernel header manually. The current system only works on buffers now (constant and uav buffers, although there is no difference between the two on GCN). So there is a second optional parameter for the TClDevice.NewKernel() function: oclSkeleton.

var isacode:=asm_isa(
  isa79xx
  numthreadpergroup 256
  numsgprs 48
  numvgprs 128
...
  s_engpgm
);

var skeleton:=
  "void main(__global unsigned char* block, volatile __global uint* output, const ulong target)
   { 
     if(target>0) output[get_local_id(0)] = block[get_global_id(0)]; //must use all parameters here
   }";

var var isakernel := dev.newKernel(isacode, skeleton); //<- optional skeleton opencl code

If the skeleton parameter is specified, then the base kernel will be generated by that code instead of an automatic skeleton code based on the oclBuffers instruction. In this the example the skeleton code specifies the same parameters as those used in the groestlcoin.cl kernel.

2. Also I have to enable to pass values (not just buffers) to the kernel in the host program. From now this can be realized using TClKernel.SetArg() calls:

isakernel.SetArg(0, block);
isakernel.SetArg(1, output);
isakernel.SetArg(2, int64(1234)); //must cast to int64 because otherwise it would be a simple 32bit integer.

3. Letting the kernel domain’s low range to be nonzero (global_work_offset). As I have seen sgminer is actively using this. So from now there is a new TClKernel.RunRange(offset, size, buf0, buf1, …) function.

isakernel.runRange(39,85).waitfor(); //that means run 85 workitems starting from gid=39. Also wait for it.

Testing dummy code with original kernel parameters

Here’s the assembly dummy code that does the same operation at it’s skeleton code:

#include stdgcn.inc

var i,j,k,dev:=findGCNDevice;   //or find the first GCN device.
writeln(dev.dump);

var isacode:=asm_isa(
  isa79xx
  numthreadpergroup 256
  numvgprs 128
  numsgprs 48
  ldssize 16384

  s_lshl_b32            s1, s12, 8                //groupid*256
  s_buffer_load_dword   s0      , s[4: 7], 0x18   //global_work_offset
  s_buffer_load_dword   s4      , s[8:11], 0x00   //block ofs
  s_buffer_load_dword   s5      , s[8:11], 0x04   //output ofs
  s_buffer_load_dwordx2 s[ 6: 7], s[8:11], 0x08   //target
  s_load_dwordx4        s[12:15], s[2: 3], 0x60   //block res
  s_load_dwordx4        s[16:19], s[2: 3], 0x68   //output res
  s_waitcnt     lgkmcnt(0)
  s_add_u32     s0, s1, s0                        //workgroup gid base
  s_add_u32     s12, s12, s4 \ s_addc_u32 s13 ,s13, 0  //adjust block res with offset
  s_add_u32     s16, s16, s5 \ s_addc_u32 s17 ,s17, 0  //adjust output res with offset
  v_add_i32     v1, vcc, s0, v0                   //gid
  alias lid = v0, gid = v1, resBlock = s[12:15], resOutput = s[16:19], target = s[6:7]

  v_temp_range 2..127
  s_temp_range 0..5, 8..11, 20..47

  v_temp zero[2], data, addr
  
  v_lshl_b64    zero[0], 0, 0                      //fill with zero
  v_cmp_gt_i64  vcc, target, zero[0]               //target>0?
  s_if(vccnz)
    buffer_load_ubyte  data, gid, resBlock, 0 offen  //block[g]
    v_lshlrev_b32  addr, 2, lid                      //v0:lid*4
    s_waitcnt vmcnt(0)
    tbuffer_store_format_x  data, addr, resOutput, 0 offen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT]
  _endif
  s_endpgm
);

var skeleton:="void main(__global unsigned char* block, volatile __global uint* output, const ulong target)
               { if(target>0) output[get_local_id(0)] = block[get_global_id(0)]; }";

var isakernel := dev.newKernel(isacode, skeleton);
isakernel.dump("dummyIsa\");

var block:=dev.newBuffer('rw', 512);
for i:=0 to 127 do block.Ints[i]:=i;

var output:=dev.newBuffer('rw', 256*4);
output.clear;

with isakernel do begin
  SetArg(0, block);
  SetArg(1, output);
  SetArg(2, int64(1));
  runRange(52,256).waitfor();

  for i:=0 to 255 do writeln(i, ' ', output.Ints[i]);
end;

As I’m using variable names in asm it’s not that messy as the raw disassembled version. Next time I’ll have to combine this with the groestl miner asm code and build the new binary. After that I gonna modify the functional test to make 100% sure, that not only the groestl calculations but also the thread-indexing is working correctly. With the new testing method, I’ll start at a nonzero global_offset and will use a relatively high target value to get around 100 hits. If I put it inside an iteration and give it randomized blocks and work_offsets, I’ll prove that the new kernel works exactly like the original one.

Posted in Uncategorized | Leave a comment