Testing the GCN ASM Groestl kernel using sgminer 5.1

Making the original and the new kernel identical (from the outside)

In the previous blog post I was at the point, that the desired kernel parameters (char *block, uint *output, ulong target) are worked well in a small test kernel. After it is applied to the main kernel only one thing left to realize in asm: Detecting if the result<=target and marking it in the output buffer.

  //compare and report result
  v_if_u64(g[0]<=target)
    enter \ v_temp addr, data, tmp \ s_temp oldE[2] align:2
    v_mov_b32     addr, $FF
    v_mov_b32     data, 1
    buffer_atomic_add  data, addr, resOutput, 0 idxen glc
    s_waitcnt     vmcnt(0)
    v_min_u32     addr, $FE, data //dont let it overflow

    //reverse byte order of gid
    v_bfe_u32 data, gid, 24, 8
    v_bfe_u32 tmp , gid, 16, 8 \ v_lshlrev_b32 tmp,  8, tmp \ v_or_b32 data, data, tmp
    v_bfe_u32 tmp , gid,  8, 8 \ v_lshlrev_b32 tmp, 16, tmp \ v_or_b32 data, data, tmp
                                 v_lshlrev_b32 tmp, 24, gid \ v_or_b32 data, data, tmp

    tbuffer_store_format_x  data, addr, resOutput, 0 idxen format:[BUF_DATA_FORMAT_32, BUF_NUM_FORMAT_FLOAT]
    dd $BF8C0F00 //s_waitcnt     vmcnt(0) & expcnt(0)
    leave
  _endif

Simple if/relation handling: v_if_u64() is not a GCN instruction. It is some kind of a macro that can identify the relation operation and does the appropriate compare instruction. It also jumps conditionally and saves/modifies the exec mask based on the compare result.

Atomics: Using atomic increase when calculating the next output index. In the original opencl I had to use atomic_inc(&output[0xFF]) as well because I’m using a special test which returns more than 100 values in the output buffer and got to make sure that no values are lost because of the concurrent incrementations of the output index.

Swapping byte order: Well that’s not too nice, I should rather find  some packing instructions, but this is not the inner loop, so I just don’t care… As thinking it a bit further: It would be pretty fast with this way: Swapping low and high words with v_bytealign. Selecting odd bytes with v_and, shifting them right 8bits. Selecting even bytes and scaling it up 256x and adding to previous result with v_mad_u32_u24. Only 4 simple instructions instead of 9. It’s fun how a functionality can be built from various ‘LEGO’ pieces of the instruction set.

New method of functional testing

Now the key is to compare the asm kernel to the original kernel. Here are the testing parameters:

  • Block: is the same randomly generated 80 bytes as before.
  • target: 0x0008FFFFFFFFFFFF  (must be high enough to generate more than 100 values in the output array)
  • global_work_offset: 567    (is quiet random too)
  • global_work_count: 256*10  *512   (approx. 1.2million)
  • compressing the output[]: Iterate through all the output values and multiply them with a large prime number (402785417) and summarize them together. Because the order of values are not important, only that counts that all the values must be in the array.
  • Checking whether 2 kernels are identical: Is the same as checking the compressed ‘hashes’ of the outputs.

Just for the record, the compressed output hash value calculated from the result of the above parameters is: 335485889931504896.

It was checked for both kernels and it was proven that the original and the new kernel calculates the *same results.
*Actually it is “pretty much the same” by checking the outcome of 1.2 million groestlCoin calculations using a relatively high target value.

Testing it live

Testing it by running sgminer 5.1 and replacing the precompiled kernel binary (groestlcoinCapeverdegw256l4.bin) with my new binary produced the expected 3.5x speedup. And I was kinda lucky because I got the first ‘accepthed’ after 10 minutes. The next one came 3 hours later. So I’m now more than sure that it works correctly. But of course we can only 100% sure when it earns 3.5x more coins than the OpenCL version. That thing I cannot test because I don’t have a mining rig.

my_first_grs

second_grs

Note that GPU1 did the mining, not GPU0. GPU1 is a HD7770 running 1000MHz (stock), and it has 640 streams, peak performance is 1.28 TFlops/s. It ran around 63 Celsius degrees. It’s kinda cool because bottleneck is LDS and L1.

In the next post I’ll write down the instructions on how to build a kernel on a specific GCN GPU and use it with SG 5.1.

Advertisements
Posted in Uncategorized | Tagged , , , , | Leave a comment

Making the GCN ASM Groestl kernel to run inside sgminer 5.1

There are basically two ways to do this:

  • Make the kernel binary 100% compatible with the original kernel
  • Apply modifications in sgminer to handle a my custom binary.

For the first option I have to mimic the kernel parameters of the original kernel. The following is a small dummy kernel that uses the same parameters:

__kernel void search(__global unsigned char* block, volatile __global uint* output, const ulong target) {
  if(target>0)
    output[get_local_id(0)] = block[get_global_id(0)];
}

The simple program in it ensures that none of paramerers are optimized out byte the OCL compiler. Let’s see how it looks like in asm (cat14.9):

  s_buffer_load_dword s0, s[4:7], 0x18        //domain gid base
  s_lshl_b32 s1, s12, 8                       //s1:groupid*256
  s_buffer_load_dword s4, s[8:11], 0x00       //s4: block ofs
  s_buffer_load_dword s5, s[8:11], 0x04       //s5: output ofs
  s_buffer_load_dwordx2 s[6:7], s[8:11], 0x08 //s[6:7] target
  s_waitcnt lgkmcnt(0)
  s_add_u32 s0, s1, s0                        //s0: domain base id
  v_lshl_b64 v[1:2], 0, 0                     //fill with zero
  v_cmp_ne_i64 vcc, s[6:7], v[1:2]            //target>0?
  s_cbranch_vccz label_0018
  s_add_u32 s0, s0, s4                        //s0: threadgroup base id + block ofs
  s_load_dwordx4 s[8:11], s[2:3], 0x60        //s[8..11]: block res
  v_add_i32 v1, vcc, s0, v0                   //v1:gid+block ofs
  s_waitcnt lgkmcnt(0)
  buffer_load_ubyte v1, v1, s[8:11], 0 offen  //block[gid]
  s_load_dwordx4 s[0:3], s[2:3], 0x68         //s[0..3]: output res
  v_lshlrev_b32 v0, 2, v0                     //v0:lid*4
  v_add_i32 v0, vcc, s5, v0                   //v0:lid*4+output base
  s_waitcnt vmcnt(0) & lgkmcnt(0)
  tbuffer_store_format_x v1, v0, s[0:3], 0 offen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT] //output[lid]
  s_waitcnt vmcnt(0) & expcnt(0)
label_0018:
  s_endpgm

Well this is pretty much code for doing a little more than nothing.
We can access the target int64 parameter indirectly form buffer s[8:11], it is loaded into s[6:7].
get_local_id(0) is in v0.
threadgroupId is in s1.
There are 3 resources provided to the kernel by the driver:
s[4:7] contains kernel domain parameters. The only thing used in this is the lowest index value of the 1D kernel domain rangle.
s[8:11] is a table full of base pointers (and sizes?) for the parameter buffers. It also contains value parameters.
s[2:3] this is actually a 64bit pointer that points to the table the of resources of out 2 buffer parameters.
So these things are provided at start. Following is the list of things that must be calculated:
get_global_id(0) is calculated as threadGroupId*256+get_local_id(0)+domain_base_id
BlockResource is loaded from s[2:3] along with BlockBaseOffset from s[8:11]
The same from OutputResource and OutputBase.

And now it is possible to access the parameters ‘block’ and ‘output’. IMO it is a bit messy, and the real problem is not even the mess: It’s the possibility to change all these input registers and functionality in future versions of the GPU drivers.

One additional buffer is not considered yet: In the groestlcoin.cl kernel there are some big constant arrays. The OpenCL environment pass these constant in a buffer silently. I just don’t know how is this works in the binary .ELF format. I can only put the data into the code stream and with the s_getpc_b64 instruction I can calculate the address of it and read it from there. But here comes an additional problem with the alignment of that data.

And finally as if it is not complicated enough: I’d have to modify my binary generator to be able to work with not only buffer parameters but from now with ordinal/float parameters as well.

Altogether I think this is the better way and not to make modifications in sgminer.

ver04: Packing the T tables into the code

As the orriginal kernel passes T0 and T1 in a silent buffer to the kernel, I have to do it in asm with a workaround.
First I generated a groestl_table.inc file with 16KB of constant declarations in asm style:

//contents of groestl_table.inc file
dd $A5F432C6
dd $C6A597F4
dd $84976FF8
...           //T0..T7: 16KB of data
dd $6D6DD6DA
dd $4E3A4E62
dd $2C2C3A58

Next step is to include this into the code:

  s_getpc_b64 s[40:41]  
  s_branch @tableEnd
  aligncode 16         //aligns the next code to the next 16 byte boundary 
@tableBegin:
  #include groestl_table.inc
@tableEnd:
  alias constBuf = s[40:43]  //make constbuf resource
  s_sub_u32 s40, s40, 4           \ s_subb_u32 s41, s41, 0 //beginning of code
  s_add_u32 s40, s40, @tableBegin \ s_addc_u32 s41, s41, 0 //physical offset of the table 
  s_or_b32  s41, s41, $00080000   //8byte record size
  s_mov_b32 s42, $FFFFFF00
  s_mov_b32 s43, $002A7204 

The above code first saves the value of the program counter, then includes the table data, and finally builds a 128bit resource constant in s[40:43] which points to the beginning of the table. Later in the program, this resource constant (called ‘constBuf’) will used for the lookups. Just a note to self, that it would be useful to be able to use @address labels in arithmetic expressions too (eg. @tableBegin-4).
The total code size is now 42KB, but the actual code which must fit into the ICache is 16KB less, only 26KB. The execution time of the new code remained the same as before.

Emulating the original kernel parameters

New things to be implemented in the compiler and on the host side:

1. There must be a way to specify the kernel header manually. The current system only works on buffers now (constant and uav buffers, although there is no difference between the two on GCN). So there is a second optional parameter for the TClDevice.NewKernel() function: oclSkeleton.

var isacode:=asm_isa(
  isa79xx
  numthreadpergroup 256
  numsgprs 48
  numvgprs 128
...
  s_engpgm
);

var skeleton:=
  "void main(__global unsigned char* block, volatile __global uint* output, const ulong target)
   { 
     if(target>0) output[get_local_id(0)] = block[get_global_id(0)]; //must use all parameters here
   }";

var var isakernel := dev.newKernel(isacode, skeleton); //<- optional skeleton opencl code

If the skeleton parameter is specified, then the base kernel will be generated by that code instead of an automatic skeleton code based on the oclBuffers instruction. In this the example the skeleton code specifies the same parameters as those used in the groestlcoin.cl kernel.

2. Also I have to enable to pass values (not just buffers) to the kernel in the host program. From now this can be realized using TClKernel.SetArg() calls:

isakernel.SetArg(0, block);
isakernel.SetArg(1, output);
isakernel.SetArg(2, int64(1234)); //must cast to int64 because otherwise it would be a simple 32bit integer.

3. Letting the kernel domain’s low range to be nonzero (global_work_offset). As I have seen sgminer is actively using this. So from now there is a new TClKernel.RunRange(offset, size, buf0, buf1, …) function.

isakernel.runRange(39,85).waitfor(); //that means run 85 workitems starting from gid=39. Also wait for it.

Testing dummy code with original kernel parameters

Here’s the assembly dummy code that does the same operation at it’s skeleton code:

#include stdgcn.inc

var i,j,k,dev:=findGCNDevice;   //or find the first GCN device.
writeln(dev.dump);

var isacode:=asm_isa(
  isa79xx
  numthreadpergroup 256
  numvgprs 128
  numsgprs 48
  ldssize 16384

  s_lshl_b32            s1, s12, 8                //groupid*256
  s_buffer_load_dword   s0      , s[4: 7], 0x18   //global_work_offset
  s_buffer_load_dword   s4      , s[8:11], 0x00   //block ofs
  s_buffer_load_dword   s5      , s[8:11], 0x04   //output ofs
  s_buffer_load_dwordx2 s[ 6: 7], s[8:11], 0x08   //target
  s_load_dwordx4        s[12:15], s[2: 3], 0x60   //block res
  s_load_dwordx4        s[16:19], s[2: 3], 0x68   //output res
  s_waitcnt     lgkmcnt(0)
  s_add_u32     s0, s1, s0                        //workgroup gid base
  s_add_u32     s12, s12, s4 \ s_addc_u32 s13 ,s13, 0  //adjust block res with offset
  s_add_u32     s16, s16, s5 \ s_addc_u32 s17 ,s17, 0  //adjust output res with offset
  v_add_i32     v1, vcc, s0, v0                   //gid
  alias lid = v0, gid = v1, resBlock = s[12:15], resOutput = s[16:19], target = s[6:7]

  v_temp_range 2..127
  s_temp_range 0..5, 8..11, 20..47

  v_temp zero[2], data, addr
  
  v_lshl_b64    zero[0], 0, 0                      //fill with zero
  v_cmp_gt_i64  vcc, target, zero[0]               //target>0?
  s_if(vccnz)
    buffer_load_ubyte  data, gid, resBlock, 0 offen  //block[g]
    v_lshlrev_b32  addr, 2, lid                      //v0:lid*4
    s_waitcnt vmcnt(0)
    tbuffer_store_format_x  data, addr, resOutput, 0 offen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT]
  _endif
  s_endpgm
);

var skeleton:="void main(__global unsigned char* block, volatile __global uint* output, const ulong target)
               { if(target>0) output[get_local_id(0)] = block[get_global_id(0)]; }";

var isakernel := dev.newKernel(isacode, skeleton);
isakernel.dump("dummyIsa\");

var block:=dev.newBuffer('rw', 512);
for i:=0 to 127 do block.Ints[i]:=i;

var output:=dev.newBuffer('rw', 256*4);
output.clear;

with isakernel do begin
  SetArg(0, block);
  SetArg(1, output);
  SetArg(2, int64(1));
  runRange(52,256).waitfor();

  for i:=0 to 255 do writeln(i, ' ', output.Ints[i]);
end;

As I’m using variable names in asm it’s not that messy as the raw disassembled version. Next time I’ll have to combine this with the groestl miner asm code and build the new binary. After that I gonna modify the functional test to make 100% sure, that not only the groestl calculations but also the thread-indexing is working correctly. With the new testing method, I’ll start at a nonzero global_offset and will use a relatively high target value to get around 100 hits. If I put it inside an iteration and give it randomized blocks and work_offsets, I’ll prove that the new kernel works exactly like the original one.

Posted in Uncategorized | Leave a comment

Optimizing Groestl hash function in GCN ASM – Part 2.

Yesterday I achieved 2.65x boost over the ocl version with 128 vregs and <32KB code size. Today it is time for optimize the inner loop.

Reordering the instructions in the inner loop – not needed

It turned out that it is not necessary, the first version was ok. GCN works well when LDS and ALU instructions are grouped together they does not needed to be interleaved as the OCL compiler does it so desperately.

So the final order of instructions is the following:

  • extract the 8 bytes (and multiply them with 8 for LDS byte offset)
  • add Tn offsets
  • initiate the load of 8×8 bytes from LDS
  • wait for all 8 LDS operations
  • xor the data into result

These well separated groups of instructions run fast, and look simple with macros.

ver03 Hybrid lookups (LDS and L1 cooperation) 2.65x->3.48x=131%

In the original OCL version I noticed that from every 8 lookups only 6 are realized from LDS and the rest is done using MEM. Pallas told me that he wanted T0 and T1 to be in ‘constant memory’. But on GCN architecture there is no such type of memory (as on Evergreen does) so it used the Gpu ram instead. So from there I got the idea to reduce the LDS usage with the help of the L1 cache. The small lookup thable can easily fit in the 8K(or 16K?) L1 cache and by the OpenCL programming manual it said that L1 peak is 4byte/cycle and LDS peak is 8byte/cycle.

The base situation is 8 LDS lookups that is 2.65x

  • First I tried to replace all 8 LDS lookups with 8 memory reads: -> 0.95x  Ok that’s pretty slow, like it wasn’t intended for random acces.
  • 1 Mem + 7 LDS: -> 3.01x It’s getting better.’ +
  • 2 Mem + 6 LDS: -> 3.48x That is the best. Also Pallas did this in the OpenCL kernel.
  • 3 Mem + 5 LDS: -> 2.59x And finally the L1 cache became the bottleneck.

So the code looks like this now:

#macro RBTT(dst, didx, src, i0, i1, i2, i3, i4, i5, i6, i7) //2MEM 6LDS 3.48
enter
  v_temp addr[8]
  v_temp data[16] align:2

  v_and_b32 addr[0], $FF, src[i0*2+0]
  v_bfe_u32 addr[1], src[i1*2+0], 8, 8
  __for__(i in[0..1], v_add_i32 addr[i], vcc, $100*i+$20, addr[i] )
  tbuffer_load_format_xy dst[didx*2], addr[0], UAV, 0 idxen format:[BUF_DATA_FORMAT_32_32, BUF_NUM_FORMAT_FLOAT]
  __for__(i in[1..1], tbuffer_load_format_xy data[i*2],  addr[i], UAV, 0 idxen format:[BUF_DATA_FORMAT_32_32, BUF_NUM_FORMAT_FLOAT])

  v_bfe_u32 addr[2], src[i2*2+0], 16-3, 8+3
  v_lshrrev_b32 addr[3], 24-3, src[i3*2+0]
  v_and_b32 addr[4], $FF, src[i4*2+1] v_lshlrev_b32 addr[4], 3, addr[4]
  v_bfe_u32 addr[5], src[i5*2+1], 8-3, 8+3
  v_bfe_u32 addr[6], src[i6*2+1], 16-3, 8+3
  v_lshrrev_b32 addr[7], 24-3, src[i7*2+1]
  __for__(i in[2..7], v_add_i32 addr[i], vcc, $800*i, addr[i] ) 

 __for__(i in[2..7], ds_read_b64 data[i*2], addr[i])
 s_waitcnt vmcnt(0)
  __for__(i in[1..1], v_xor_b32 dst[didx*2 ], dst[didx*2 ], data[i*2 ]
                      v_xor_b32 dst[didx*2+1], dst[didx*2+1], data[i*2+1])
  s_waitcnt lgkmcnt(0)
  __for__(i in[2..7], v_xor_b32 dst[didx*2 ], dst[didx*2 ], data[i*2 ]
                      v_xor_b32 dst[didx*2+1], dst[didx*2+1], data[i*2+1])
leave
#endm

Some more failed attempts

I tried what if I use only one T0 table and do the 64bit byte-rotates manually with the v_bytealign instruction -> nothing: No benefit from reading the same region of LDS because I forgot that broadcasting only occurs for workitems, not consecutive lds reads. Also no slowdown caused by the additional math because not the V-ALU is the bottleneck.

Tried to replace $800*i constants with pre-initialized S-regs to make the instruction stream smaller. Less than 1% speedup so it was not necessary as the average instruction bytes/cycle doesn’t exceed 12 bytes. Just had to try it to make sure. I had some experiments long ago regarding instruction stream density and number of WFs/V-ALU. It turned out that low number of WFs (we have only 2 now) and 64bit vector instructions interleaved with 64bit scalar instructions are a bad combination.

Making long kernels doing multiple GroestlCoin hashes in a loop. Pallas was right, it’s not necessary (below 1%). This algorithm take so much time by default and the initialization of the LDS is taking no time compared to it, so it not worth to make it more complicated with batch processing.

Maybe there’s 9% if I can make it run on 84VGRPS. (I broke the algorithm by using only 3 ulong[16] arrays and that way it needed only 90 regs. At that point it was at 3.42x, then I switched the kernel to allocate 84 regs end the speed became 3.81x. But it’s tricky because regs84..89 are became zeroes and maybe it produced some easier memory patterns and that caused the speedup. If I split the 8 qword reads into 2*4 qword reads that way I need 12 less VRegs but also the speed drops by 5%. And I also have to swap ulong[16] arrays so this opt not worth it anyways.

What else is left

Now that I stuck at 3.48x, finally I’m kinda out of ideas. One last thing would be the first and last round optimization: As the whole message (except 1 dword) is know on the CPU side, it would be wise to calculate those things on the CPU and let the thousands of GPU threads calculate only the data it really have to. And for the last round there are lots of unused values. These are the kinds of optimizations that I rather solve automated by a machine. In the past I did these expression graph optimizations that way, but for this particular problem there is so few benefits to do so. Lets say I’m very optimistic and can eliminate 2 whole rounds. Then the speedup yould be only 5%. In reality I think maybe 75% of the first round can be done on CPU and similar amount of calculations are unused on the last round. So maybe it’s a 2.5%. Just because there are so many calculations inside in contrast to the first/last round.

Final step…

…will be to make it work inside a miner program an ensure that it works correctly. Are there common test vectors for this, or something? I don’t know if miner programs are flexible with kernel parameters or I have to follow the original parameters. I don’t know how the driver passes single-value parameters for example.

(Check the Download Area for latest version)

Posted in Uncategorized | Tagged , , , , , | Leave a comment

Optimizing Groestl hash function in GCN ASM – Part 1.

Last post was about making the first assembly version that just runs correctly. Luckily it became 1.17x faster than the original high level version. Now I will apply a series of optimizations and let’s see how effective are they. I’ll start with the most obvious ones.

ver01: Reduce VGPRS count down to 128 -> 196%

I had the original VGPRS count unchanged in ver00, just to see how important is that let each of the GCN Vector V-ALUs deal not only one but two WaveFronts. When the V-ALU can choose from more than one WFs, then if one WF begins to wait for LDS or RAM, then it can immediately switch to another WF. It effectively hides LDS and MEM latency without the need of changing anything in the program code. (Latency hiding can be realized inside the code as well, but it needs more registers and more clever code. Allowing the GCN chip to parallellize the code in every cycle is much more elegant.)

So reducing VGPRS count from 164 down to 128 produces the speedup from 1.17x->2.30x. It’s 196% gain.

It is can be realised in asm like this:

numvgprs 128     v_temp_range 2..127

The “numvgprs” instruction tells the compiler that how many regs it should allocate on the hardware. “v_temp_range” specifies the compiler a range of vregs that make up a heap of regs it can temporarily give to code regions marked with enter/leave regions.

Just a theoretical nonsense at the moment: I have tried other vgprs counts:
85: It’s just a bit bigger than the 3 WafeFronts/V-ALU setup so it’s the same 2WFs/V_ALU. But somehow it’s 2.70x.
84: Every V-ALU can choose from not 2 but 3 WFs at any time. -> 4.20x
But it of course calculates wrong results. To make it work I would have to make it with 38 fewer VRegs. Maybe in the future it worth trying: Long time regions in the algorithm only needs 2x ulong[16] arrays and some temps. That can made under 85 and LDS can be used for swap locally-unused data. But at this early stage it not worth to bother with and later find out that maybe this was an opt unnecessary.

ver02: Reduce code size to fit into 32KB ICache -> 115%

This is the second most obvious optimization in my opinion.
In order to do this I have to find repetitive tasks in the code and either put them into a loop or call them as a subroutine. For simplicity I’ll try to do it with loops.
Below is the previous code:

#macro PASS
 v_mov_b32 x[11*2+0], 0 v_mov_b32 x[11*2+1], 0
 v_mov_b32 x[12*2+0], 0 v_mov_b32 x[12*2+1], 0
 v_mov_b32 x[13*2+0], 0 v_mov_b32 x[13*2+1], 0
 v_mov_b32 x[14*2+0], 0 v_mov_b32 x[14*2+1], 0
 v_mov_b32 x[15*2+0], 0 v_mov_b32 x[15*2+1], M15hi

 v_xor_b32 x[15*2+1], H15hi, x[15*2+1]
 CNST_P(g, x,  0) ROUND_P(a, g) CNST_P(a, a,  1) ROUND_P(g, a)
 CNST_P(g, g,  2) ROUND_P(a, g) CNST_P(a, a,  3) ROUND_P(g, a)
 CNST_P(g, g,  4) ROUND_P(a, g) CNST_P(a, a,  5) ROUND_P(g, a)
 CNST_P(g, g,  6) ROUND_P(a, g) CNST_P(a, a,  7) ROUND_P(g, a)
 CNST_P(g, g,  8) ROUND_P(a, g) CNST_P(a, a,  9) ROUND_P(g, a)
 CNST_P(g, g, 10) ROUND_P(a, g) CNST_P(a, a, 11) ROUND_P(g, a)
 CNST_P(g, g, 12) ROUND_P(a, g) CNST_P(a, a, 13) ROUND_P(g, a)
 v_xor_b32 x[15*2+1], H15hi, x[15*2+1]

 CNST_Q(x, x,  0) ROUND_Q(a, x) CNST_Q(a, a,  1) ROUND_Q(x, a)
 CNST_Q(x, x,  2) ROUND_Q(a, x) CNST_Q(a, a,  3) ROUND_Q(x, a)
 CNST_Q(x, x,  4) ROUND_Q(a, x) CNST_Q(a, a,  5) ROUND_Q(x, a)
 CNST_Q(x, x,  6) ROUND_Q(a, x) CNST_Q(a, a,  7) ROUND_Q(x, a)
 CNST_Q(x, x,  8) ROUND_Q(a, x) CNST_Q(a, a,  9) ROUND_Q(x, a)
 CNST_Q(x, x, 10) ROUND_Q(a, x) CNST_Q(a, a, 11) ROUND_Q(x, a)
 CNST_Q(x, x, 12) ROUND_Q(a, x) CNST_Q(a, a, 13) ROUND_Q(x, a)

 __for__(i in[0..$1F], v_xor_b32 g[i], g[i], x[i] ) //combine P and Q
 __for__(i in[0.. $F], v_mov_b32 x[i], g[i+$10] )
 v_xor_b32 g[15*2+1], H15hi, g[15*2+1]

 CNST_P(g, g,  0) ROUND_P(a, g) CNST_P(a, a,  1) ROUND_P(g, a)
 CNST_P(g, g,  2) ROUND_P(a, g) CNST_P(a, a,  3) ROUND_P(g, a)
 CNST_P(g, g,  4) ROUND_P(a, g) CNST_P(a, a,  5) ROUND_P(g, a)
 CNST_P(g, g,  6) ROUND_P(a, g) CNST_P(a, a,  7) ROUND_P(g, a)
 CNST_P(g, g,  8) ROUND_P(a, g) CNST_P(a, a,  9) ROUND_P(g, a)
 CNST_P(g, g, 10) ROUND_P(a, g) CNST_P(a, a, 11) ROUND_P(g, a)
 CNST_P(g, g, 12) ROUND_P(a, g) CNST_P(a, a, 13) ROUND_P(g, a)
#endm

#macro PASS_Transition
 __for__(i in[0.. $F], v_xor_b32 x[i], x[i], g[i+$10] ) //combine

                          v_xor_b32 x[ 7*2+1], H15hi, x[7*2+1]
 v_mov_b32 x[ 8*2+0], $80 v_mov_b32 x[ 8*2+1], 0
 v_mov_b32 x[ 9*2+0],   0 v_mov_b32 x[ 9*2+1], 0
 v_mov_b32 x[10*2+0],   0 v_mov_b32 x[10*2+1], 0
#endm

//issuing macros
PASS
PASS_Transition
PASS

Processing these macroes resulting in 340KB of code. First I tried to make a loop like this:

s_temp passIdx
s_movk_i32 passIdx, 0
s_while_i32(passIdx<2)
  PASS
  s_if_i32(passIdx=0)
    PASS_Transition
  _endif
  s_addk_i32 passIdx, 1
_endw

But it was a crashed.  Later I’ve found that the s_cbranch instruction can only jump +-128KB away, and the jump needed was 170KB. From now the assembler will raise an exception when it happens again.
So first I have to make the loops inside the PASS macro. Round Constant addition will change as from now SRegs will hold the current roundIndex. There will be two types of round additions: The first one that will initialize the SRegs holding round indexes, and the second will use and increment those SRegs. This way S instructions will be interleaved with V instructions, thus not requiring additional clock cycles. Below are the new RoundConst adder macroes (they can also copy from one array to another):

s_temp RoundCnst[16] //these will be incremented before each round

#macro CNST_P_first(dst, src)
  __for__(i in[0..$F], s_movk_i32 RoundCnst[i], i*$10+0
                       v_mov_b32 dst[i*2+1], src[i*2+1]
                       v_xor_b32 dst[i*2+0], RoundCnst[i], src[i*2] )
#endm

#macro CNST_P_next(dst, src)
  __for__(i in[0..$F], s_addk_i32 RoundCnst[i], 1
                       v_mov_b32 dst[i*2+1], src[i*2+1]
                       v_xor_b32 dst[i*2+0], RoundCnst[i], src[i*2] )
#endm

#macro CNST_Q_first(dst, src)
  __for__(i in[0..$F], s_mov_b32 RoundCnst[i], ![not((i*$10+0)<<24)]
                       v_not_b32 dst[i*2+0], src[i*2]
                       v_xor_b32 dst[i*2+1], RoundCnst[i], src[i*2+1] )
#endm

#macro CNST_Q_next(dst, src)
  __for__(i in[0..$F], s_sub_u32 RoundCnst[i], RoundCnst[i], 1<<24
                       v_not_b32 dst[i*2+0], src[i*2]
                       v_xor_b32 dst[i*2+1], RoundCnst[i], src[i*2+1] )
#endm

Also I made a simple iterator macro:

#macro mRepeat(repetionCount, what) //simple repeat macro.
enter s_temp i
  s_movk_i32 i, 0
  s_while_i32(i<repetionCount)
    what
    s_addk_i32 i, 1 //should rather check SCC right after this
  _endw
leave
#endm

Using these the new Groestl PASS can be squeezed:

#macro PASS
  v_mov_b32 x[11*2+0], 0 v_mov_b32 x[11*2+1], 0
  v_mov_b32 x[12*2+0], 0 v_mov_b32 x[12*2+1], 0
  v_mov_b32 x[13*2+0], 0 v_mov_b32 x[13*2+1], 0
  v_mov_b32 x[14*2+0], 0 v_mov_b32 x[14*2+1], 0
  v_mov_b32 x[15*2+0], 0 v_mov_b32 x[15*2+1], M15hi

  v_xor_b32 x[15*2+1], H15hi, x[15*2+1]

              CNST_P_first(a, x) ROUND_P(g, a)
  mRepeat(13, CNST_P_next( a, g) ROUND_P(g, a))

  v_xor_b32 x[15*2+1], H15hi, x[15*2+1]

              CNST_Q_first(a, x) ROUND_Q(x, a)
  mRepeat(13, CNST_Q_next( a, x) ROUND_Q(x, a) ) 

  __for__(i in[0..$1F], v_xor_b32 g[i], g[i], x[i] ) //combine P and Q
  __for__(i in[0.. $F], v_mov_b32 x[i], g[i+$10] )
  v_xor_b32 g[15*2+1], H15hi, g[15*2+1]

              CNST_P_first(a, g) ROUND_P(g, a)
  mRepeat(13, CNST_P_next( a, g) ROUND_P(g, a) )
#endm

This reduced code size to 50K the instruction cache is now utilized well. But not forget one more loop for the 2 passes:

mRepeat(2,
  PASS
  s_if_i32(i=0)
    PASS_Transition //this must be called between the 2 PASSes
  _endif
)

Final code size is 25KB now which is completely fits into the 32KB instruction cache. Because of this optimization the speed raised from 2.30x to 2.65x, that means a 115% improvement. I expected this to be higher, but in the current state of the algo the biggest bottleneck is the LDS. So next time I’ll try to make LDS access faster somehow.
Theoretically when I remove the LDS reads then the speed goes up to 7.04x, just by doing only the V-ALU calculations. With this high V-ALU utilization the Instr Cache becomes a bigger issue: When I revert from 25KB to the old 340KB code size, the speed drops from 7.04x down to 4.46x. So if LDS wouldn’t be a bottleneck, the ICache optimization would give 158% benefit instead of 115%.

Posted in Uncategorized | Tagged , , , , , | 4 Comments

Implementing Groestl hash function in GCN ASM

 There is a new version of HetPas in the Download Area . It contains the Groestl asm project for Cat14.9. See details at the bottom of this post.

Some months I was looking for an interesting GPU project to do for my free time, and ran into the Groestl hash algorithm. Found it a good algo to play with, so here’s a case-study of what benefits we have if we can go deeper than OpenCL.

The original code

I’ll start from this OpenCL source code: Pallas optimized groestlcoin / diamond etc. opencl kernel


Originally I downloaded it from here: http://devgurus.amd.com/message/1306845#1306845

Groestl documentation is here: http://www.groestl.info/


The Catalyst version I use is: 14.9
The speed of the algo on different hardware and/or Catalyst are shown below:

groestl_speeds

The baseline for any further performance comparison will be HD7770, 1000MHz, Cat 14.9  Pallas OpenCL version which is 4MH/s.

Examining the OpenCL-compiled binary

Below is the repeating ‘main loop’ from then  OpenCL code. It does 2*8 T0 table lookups and does the XORing. 2*2 values are looked up from the RAM, and the remaining 6*2 is accessed from LDS. T0[], T1[] is const array which is effectively in memory, T2[]..T7[] is copyed to in local memory at styarup.


a[0x0] ^= QC64(0x00, r); \
a[0x1] ^= QC64(0x10, r); \
...
a[0xE] ^= QC64(0xE0, r); \
a[0xF] ^= QC64(0xF0, r); \
t0[0x0] = B64_0(a[0x0]); \
t1[0x0] = B64_1(a[0x0]); \
...
t6[0xF] = B64_6(a[0xF]); \
t7[0xF] = B64_7(a[0xF]); \
RBTT(a[0x0], 0x1, 0x3, 0x5, 0xB, 0x0, 0x2, 0x4, 0x6); \
RBTT(a[0x1], 0x2, 0x4, 0x6, 0xC, 0x1, 0x3, 0x5, 0x7); \

Refer Pallas’s source code for details!
Here’s what the Catalyst 14.9 driver compiled from it:

 s_waitcnt vmcnt(2)
 v_xor_b32 v9, v84, v86
 v_xor_b32 v16, v85, v87
 s_waitcnt lgkmcnt(8)
 v_xor_b32 v9, v9, v66
 s_waitcnt lgkmcnt(7)
 v_xor_b32 v16, v16, v47
 s_waitcnt lgkmcnt(6)
 v_xor_b32 v9, v9, v69
 v_xor_b32 v16, v16, v70
 s_waitcnt lgkmcnt(5)
 v_xor_b32 v9, v9, v74
 s_waitcnt lgkmcnt(4)
 v_xor_b32 v16, v16, v73
 s_waitcnt lgkmcnt(3)
 v_xor_b32 v9, v9, v76
 s_waitcnt lgkmcnt(2)
 v_xor_b32 v16, v16, v75
 s_waitcnt lgkmcnt(1)
 v_xor_b32 v9, v9, v78
 s_waitcnt lgkmcnt(0)
 v_xor_b32 v16, v16, v77
 s_waitcnt vmcnt(0)
 v_xor_b32 v43, v91, v93
 v_xor_b32 v47, v92, v94
 v_xor_b32 v59, 0x07000000, v59
 v_not_b32 v133, v40
 v_xor_b32 v134, 0xbfffffff, v59
 v_lshr_b64 v[73:74], v[133:134], 16
 v_bfe_u32 v66, v73, 0, 8
 v_lshlrev_b32 v66, 3, v66
 v_add_i32 v73, vcc, 0x00001800, v66
 v_add_i32 v66, vcc, 0x00001804, v66
 ds_read_b32 v74, v81
 ds_read_b32 v75, v80
 ds_read_b32 v73, v73
 ds_read_b32 v66, v66
 s_waitcnt lgkmcnt(3)
 v_xor_b32 v9, v9, v74
 s_waitcnt lgkmcnt(2)
 v_xor_b32 v16, v16, v75
 s_waitcnt lgkmcnt(1)
 v_xor_b32 v43, v43, v73
 s_waitcnt lgkmcnt(0)
 v_xor_b32 v47, v47, v66
 v_xor_b32 v66, 0x07000000, v90
 v_not_b32 v67, v89
 v_xor_b32 v68, 0x5fffffff, v66
 v_lshr_b64 v[76:77], v[67:68], 24
 v_bfe_u32 v76, v76, 0, 8
 v_lshlrev_b32 v76, 3, v76
 v_bfe_u32 v77, v100, 0, 8
 v_lshlrev_b32 v77, 3, v77
 v_add_i32 v78, vcc, 0x00000800, v77
 v_add_i32 v77, vcc, 0x00000804, v77
 v_bfe_u32 v80, v6, 8, 8
 v_lshlrev_b32 v80, 3, v80
 v_add_i32 v81, vcc, 0x00001000, v80
 ds_read2_b32 v[83:84], v76 offset1:1
 ds_read_b32 v76, v78
 ds_read_b32 v77, v77
 ds_read_b32 v78, v81
 s_waitcnt lgkmcnt(3)
 v_xor_b32 v43, v43, v83
 v_xor_b32 v47, v47, v84
 s_waitcnt lgkmcnt(2)
 v_xor_b32 v43, v43, v76
 s_waitcnt lgkmcnt(1)
 v_xor_b32 v47, v47, v77
 v_add_i32 v76, vcc, 0x00001004, v80
 s_waitcnt lgkmcnt(0)
 v_xor_b32 v43, v43, v78
 v_bfe_u32 v77, v14, 16, 8
 v_lshlrev_b32 v77, 3, v77
 v_add_i32 v78, vcc, 0x00002000, v77
 v_add_i32 v77, vcc, 0x00002004, v77
 v_xor_b32 v15, 0x07000000, v15
 v_not_b32 v4, v4
 v_xor_b32 v15, 0xafffffff, v15
 v_lshrrev_b32 v80, 24, v15
 v_lshlrev_b32 v80, 3, v80
 v_add_i32 v81, vcc, 0x00002800, v80
 ds_read_b32 v76, v76
 ds_read_b32 v78, v78
 ds_read_b32 v77, v77
 ds_read_b32 v81, v81
 s_waitcnt lgkmcnt(3)
 v_xor_b32 v47, v47, v76
 s_waitcnt lgkmcnt(2)
 v_xor_b32 v43, v43, v78
 s_waitcnt lgkmcnt(1)
 v_xor_b32 v47, v47, v77
 v_add_i32 v76, vcc, 0x00002804, v80
 s_waitcnt lgkmcnt(0)
 v_xor_b32 v43, v43, v81
 v_bfe_u32 v77, v99, 0, 8
 v_lshlrev_b32 v77, 3, v77
 v_add_i32 v77, vcc, s0, v77
 v_lshr_b64 v[83:84], v[5:6], 8
 v_bfe_u32 v78, v83, 0, 8
 v_lshlrev_b32 v78, 3, v78
 v_add_i32 v78, vcc, s9, v78
 v_mov_b32 v83, v8
 v_mov_b32 v84, v14
 v_lshr_b64 v[85:86], v[83:84], 16
 v_bfe_u32 v85, v85, 0, 8
 v_lshlrev_b32 v85, 3, v85
 v_add_i32 v86, vcc, 0x00001800, v85
 v_add_i32 v85, vcc, 0x00001804, v85
 v_xor_b32 v11, 0x07000000, v11
 v_not_b32 v87, v7
 v_xor_b32 v88, 0x6fffffff, v11
 v_lshr_b64 v[89:90], v[87:88], 24
 v_bfe_u32 v89, v89, 0, 8
 v_lshlrev_b32 v89, 3, v89
 ds_read_b32 v76, v76
 ds_read_b32 v86, v86
 ds_read_b32 v85, v85
 ds_read2_b32 v[89:90], v89 offset1:1
 s_waitcnt lgkmcnt(3)
 v_xor_b32 v47, v47, v76
 v_bfe_u32 v76, v36, 0, 8
 v_lshlrev_b32 v76, 3, v76
 v_add_i32 v91, vcc, 0x00000800, v76
 v_add_i32 v76, vcc, 0x00000804, v76
 v_bfe_u32 v92, v33, 8, 8
 v_lshlrev_b32 v92, 3, v92
 v_add_i32 v93, vcc, 0x00001000, v92
 v_add_i32 v92, vcc, 0x00001004, v92
 ds_read_b32 v91, v91
 ds_read_b32 v76, v76
 ds_read_b32 v93, v93
 ds_read_b32 v92, v92
 v_bfe_u32 v94, v26, 16, 8
 v_lshlrev_b32 v94, 3, v94
 v_add_i32 v95, vcc, 0x00002000, v94
 v_add_i32 v94, vcc, 0x00002004, v94
 v_lshrrev_b32 v96, 24, v134
 v_lshlrev_b32 v96, 3, v96
 v_add_i32 v97, vcc, 0x00002800, v96
 v_add_i32 v96, vcc, 0x00002804, v96
 ds_read_b32 v95, v95
 ds_read_b32 v94, v94
 v_bfe_u32 v98, v109, 0, 8
 v_lshlrev_b32 v98, 3, v98
 v_add_i32 v98, vcc, s0, v98
 v_lshr_b64 v[101:102], v[99:100], 8
 v_bfe_u32 v17, v101, 0, 8
 v_lshlrev_b32 v17, 3, v17
 v_add_i32 v17, vcc, s9, v17
 tbuffer_load_format_xy v[101:102], v77, s[16:19], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT]
 tbuffer_load_format_xy v[77:78], v78, s[16:19], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT]
 tbuffer_load_format_xy v[103:104], v98, s[16:19], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT]
 tbuffer_load_format_xy v[105:106], v17, s[16:19], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT]

Some observations:

  • Maximum number of s_waits used. Sure the compiler finds all but I think it’s not that necessary if it reads and then processes in larger batches. The S alu could be used for better thing such as helping in address calculations.
  • Memory address calculations can be done with less instructions: For RAM the UAVbase and the Table offset can supplied in the tbuffer instruction’s scalar address parameter. No need to v_add, let the address calculator hardware do it. The int64 byte offset can be multiplied using the memory resource’s stride field and idxen flag. So all togethet this calculation can be hardware accelerated: base_addr = uavbase+tableoffset+tableindex*8
  • The compiler is clever because it uses v_bfe (Bit Field Extract) when it founds x>>16&0xFF type of C code.
  • But if we consider that 64bit LDS reads are always 64bit aligned regardless of the lowest 3 bits of the byte address, then we can spare an extra <<3 instruction: (x>>16&0xFF)<<3 becomes x>>13&0x7FF

Observing the OpenCL produced binary:

  • VGPRS=164. Bad because it allows only the minimum number of wavefronts/CU. No latency hiding, the ALU will sleep while waiting for RAM/LDS.
  • Binary size=110KB. Bad because it doesn’t fit into the 32KB instruction cache at all.

Making the first assembly version

First I tried to rewrite the OpenCL code to make a very simple asm impementation that just works. I haven’t implemented the first- and last-round optimizations, those can add an be an extra 8% speedup in the future, though.

Regarding kernel parameters: Keep the input kernel parameters as simple as it can be: void main(__global ulong* a1) That’s all. It’s necessary at the moment as the current version of HetPas doesn’t support non-buffer parameters, and with the less number of parameters the less buffer-resource calculations and S register usage comes. In the kernel header I manually adjust the offset in the buffer resource so it doesn’t neded to be added to the byte offset in every tbuffer instructions. Also I set stride=8 in the buffer resource, so tbuffer instructions will scale my qword indices to byte indices whenever I use the ‘idxen’ option. In the final version it is possible to make the kernel parameters compatible with the original OpenCL kernel but the way it is done could be changed with every upcoming Catalyst versions, so it would be better to modify host code a bit rather than hack it into the asm code. Also the kernel is so simple that it doesn’t request for the kernel domain ranges, thus global index is only 1D and zero based.
This is how the kernel header looks like (sorry, no syntax highlight this time):

//////////////////////////////// kernel header
isa79xx //this kernel must be called on a 0-based 1D kernel domain!
numthreadpergroup 256
ldssize 16384
oclbuffers 1, 0
numvgprs 256 v_temp_range 2..255
numsgprs 48 s_temp_range 1..3, 8..47

//////////////////////////////// Init ids, params
alias lid = v0, tid = s0, gid = v1, UAV = s[4:7]
s_buffer_load_dword s1, s[12:15], 0x00 //load uav base offset
s_mov_b32 tid, s16 //acquire tid
s_lshl_b32 s2, tid, 8 //calculate gid
v_add_i32 gid, vcc, s2, lid
s_waitcnt lgkmcnt(0)
s_add_u32 s4, s4, s1 s_addc_u32 s5, s5, 0 //adjust UAV res with uav base offset
s_andn2_b32 s5, s5, $3FFF0000 s_or_b32 s5, s5, $80000 //set 8byte record size for UAV
s_movk_i32 m0, -1 //disable LDS range checking

Next we initialize the LDS from RAM. It’s a good example on local variable allocation (enter/leave/v_temp/s_temp) and on macro __for__() iteration. Also note the tbuffer is used with hardware indexing (8byte recordsize is specified in the header).

  //initialize lds with Groestl T table ////////////////////////////////////////////////
 enter
 v_temp data[16] align:2
 v_temp vaddr
 s_temp saddr

 v_mov_b32 vaddr, lid

 __for__(i in [0..7],
   s_movk_i32 saddr, $100+i*$800 //ram table select
   tbuffer_load_format_xy data[i*2], vaddr, UAV, saddr idxen format:[BUF_DATA_FORMAT_32_32, BUF_NUM_FORMAT_FLOAT]
 )
 s_waitcnt vmcnt(0)

 __for__(i in [0..7],
   s_movk_i32 saddr, i*$800 //lds table select
   v_mad_u32_u24 vaddr, lid, 8, saddr
   ds_write_b64 vaddr, data[i*2]
 )
 s_waitcnt lgkmcnt(0)

 s_barrier
 leave

Next I defined some macroes for copying and adding roundconstants to ulong[16] arrays:

#define CNST_P(dst, src, r) __for__(i in[0..$F], v_xor_b32 dst[i*2], i*$10+r, src[i*2] v_mov_b32 dst[i*2+1], src[i*2+1])
#define CNST_Q(dst, src, r) __for__(i in[0..$F], v_xor_b32 dst[i*2+1], ![not((i*$10+r)<<24)], src[i*2+1] v_not_b32 dst[i*2], src[i*2])

Simple moves and xors with constants. Note that arrays in the assembler are always considered array of 32bit values. That’s why there are so many *2.
Now the most important part follows, which does 8 lookups and XORs the result into an element of an array:

#macro RBTT(dst, didx, src, i0, i1, i2, i3, i4, i5, i6, i7)
enter
  v_temp addr[8]
  v_temp data[16] align:2
  v_and_b32 addr[0], $FF, src[i0*2+0]
  v_lshlrev_b32 addr[0], 3, addr[0]
  v_bfe_u32 addr[1], src[i1*2+0], 8-3, 8+3   v_add_i32 addr[1], vcc, $800*1, addr[1]
  v_bfe_u32 addr[2], src[i2*2+0], 16-3, 8+3  v_add_i32 addr[2], vcc, $800*2, addr[2]
  v_lshrrev_b32 addr[3], 24-3, src[i3*2+0]   v_add_i32 addr[3], vcc, $800*3, addr[3]
  v_and_b32 addr[4], $FF, src[i4*2+1]
  v_lshlrev_b32 addr[4], 3, addr[4]          v_add_i32 addr[4], vcc, $800*4, addr[4]
  v_bfe_u32 addr[5], src[i5*2+1], 8-3, 8+3   v_add_i32 addr[5], vcc, $800*5, addr[5]
  v_bfe_u32 addr[6], src[i6*2+1], 16-3, 8+3  v_add_i32 addr[6], vcc, $800*6, addr[6]
  v_lshrrev_b32 addr[7], 24-3, src[i7*2+1]   v_add_i32 addr[7], vcc, $800*7, addr[7]
  ds_read_b64 dst[didx*2], addr[0]
  __for__(i in[1..7], ds_read_b64 data[i*2], addr[i])
  s_waitcnt lgkmcnt(0)
  __for__(i in[1..7], v_xor_b32 dst[didx*2 ], dst[didx*2 ], data[i*2 ]
                      v_xor_b32 dst[didx*2+1], dst[didx*2+1], data[i*2+1])
leave
#endm

“dst” is the name of the destination array
“didx” is the index in the “dst”
“src” is the source array
“i0”..”i7″ are the column indexes for the byte-scramble operation.

Issuing RBTT 8x make up a round:

#macro ROUND(dst, src, i0, i1, i2, i3, i4, i5, i6, i7)
__for__(i in[0..$F], RBTT(dst, i, src, (i0+i)%16, (i1+i)%16, (i2+i)%16, (i3+i)%16, (i4+i)%16, (i5+i)%16, (i6+i)%16, (i7+i)%16) )
#endm

Then 2 work arrays are defined, and the message is loaded onto one of them:


//load message block /////////////////////////////////
v_temp x[32],g[32], a[32] align:2 //x:message

enter
  v_temp vaddr
  __for__(i in [0..9],
    v_mov_b32 vaddr, $10+i
    tbuffer_load_format_xy x[i*2], vaddr, UAV, 0 idxen format:[BUF_DATA_FORMAT_32_32, BUF_NUM_FORMAT_FLOAT]
  )
  s_waitcnt vmcnt(0)
leave
                         v_mov_b32 x[ 9*2+1], gid
v_mov_b32 x[10*2+0], $80 v_mov_b32 x[10*2+1], 0

The 1D 0based GlobalID goes to the last DWORD of the message.

There are 2 Groestl passes in GroestlCoin, so let’s make a PASS() macro:

#macro PASS0Final
  __for__(i in[0.. $F], v_xor_b32 x[i], x[i], g[i+$10] ) //combine

  v_xor_b32 x[ 7*2+1], H15hi, x[7*2+1]
  v_mov_b32 x[ 8*2+0], $80 v_mov_b32 x[ 8*2+1], 0
  v_mov_b32 x[ 9*2+0], 0 v_mov_b32 x[ 9*2+1], 0
  v_mov_b32 x[10*2+0], 0 v_mov_b32 x[10*2+1], 0
#endm

#macro PASS(passIdx) //passIdx: 0..1
  v_mov_b32 x[11*2+0], 0 v_mov_b32 x[11*2+1], 0
  v_mov_b32 x[12*2+0], 0 v_mov_b32 x[12*2+1], 0
  v_mov_b32 x[13*2+0], 0 v_mov_b32 x[13*2+1], 0
  v_mov_b32 x[14*2+0], 0 v_mov_b32 x[14*2+1], 0
  v_mov_b32 x[15*2+0], 0 v_mov_b32 x[15*2+1], M15hi

  v_xor_b32 x[15*2+1], H15hi, x[15*2+1]
  CNST_P(g, x, 0) ROUND_P(a, g) CNST_P(a, a, 1) ROUND_P(g, a)
  CNST_P(g, g, 2) ROUND_P(a, g) CNST_P(a, a, 3) ROUND_P(g, a)
  CNST_P(g, g, 4) ROUND_P(a, g) CNST_P(a, a, 5) ROUND_P(g, a)
  CNST_P(g, g, 6) ROUND_P(a, g) CNST_P(a, a, 7) ROUND_P(g, a)
  CNST_P(g, g, 8) ROUND_P(a, g) CNST_P(a, a, 9) ROUND_P(g, a)
  CNST_P(g, g, 10) ROUND_P(a, g) CNST_P(a, a, 11) ROUND_P(g, a)
  CNST_P(g, g, 12) ROUND_P(a, g) CNST_P(a, a, 13) ROUND_P(g, a)
  v_xor_b32 x[15*2+1], H15hi, x[15*2+1]

  CNST_Q(x, x, 0) ROUND_Q(a, x) CNST_Q(a, a, 1) ROUND_Q(x, a)
  CNST_Q(x, x, 2) ROUND_Q(a, x) CNST_Q(a, a, 3) ROUND_Q(x, a)
  CNST_Q(x, x, 4) ROUND_Q(a, x) CNST_Q(a, a, 5) ROUND_Q(x, a)
  CNST_Q(x, x, 6) ROUND_Q(a, x) CNST_Q(a, a, 7) ROUND_Q(x, a)
  CNST_Q(x, x, 8) ROUND_Q(a, x) CNST_Q(a, a, 9) ROUND_Q(x, a)
  CNST_Q(x, x, 10) ROUND_Q(a, x) CNST_Q(a, a, 11) ROUND_Q(x, a)
  CNST_Q(x, x, 12) ROUND_Q(a, x) CNST_Q(a, a, 13) ROUND_Q(x, a)

  __for__(i in[0..$1F], v_xor_b32 g[i], g[i], x[i] ) //combine P and Q
  __for__(i in[0.. $F], v_mov_b32 x[i], g[i+$10] )
  v_xor_b32 g[15*2+1], H15hi, g[15*2+1]

  CNST_P(g, g, 0) ROUND_P(a, g) CNST_P(a, a, 1) ROUND_P(g, a)
  CNST_P(g, g, 2) ROUND_P(a, g) CNST_P(a, a, 3) ROUND_P(g, a)
  CNST_P(g, g, 4) ROUND_P(a, g) CNST_P(a, a, 5) ROUND_P(g, a)
  CNST_P(g, g, 6) ROUND_P(a, g) CNST_P(a, a, 7) ROUND_P(g, a)
  CNST_P(g, g, 8) ROUND_P(a, g) CNST_P(a, a, 9) ROUND_P(g, a)
  CNST_P(g, g, 10) ROUND_P(a, g) CNST_P(a, a, 11) ROUND_P(g, a)
  CNST_P(g, g, 12) ROUND_P(a, g) CNST_P(a, a, 13) ROUND_P(g, a)

  __IF__(passIdx=0, PASS0Final)
#endm

And finally use those macroes and extract egy ulong value which will be compared against “target”:

  PASS(0)
  PASS(1)

  v_xor_b32 g[0], g[11*2], x[3*2] v_xor_b32 g[1], g[11*2+1], x[3*2+1] //result value
  dump64($1F, g[0]) //dump64 writes an ulong into the buffer if the globalID is 1234

  s_endpgm

And that’s the whole kernel. It does not compares with “target”, it only returns the ulong value only if the current thread’s GID is 1234.

Testing the first assembly version

These parameters are used for the functional tests:

const testBlock := #$6f7037939d1aa4a9863574ddf41a0d371799dfea+
                   #$89b37ecb1ecded76426afa25108feec755347891+
                   #$b3fa9afd2a360cf64f56e4d20f0c8c03ca411b3a+
                   #$29dd28ea4fc0cddf9a1e8c707966b7a700000000; //80 bytes
const testResponse := $9FB391FF6984DFA9;           //^^^^^^^^ gid goes here
#define debugThread 1234

The above block will be completed with gid=1234 value at its end, and then the resulting ulong will be checked against the testResponse constant. TestBlock is random and the TestResponse constant was extracted from the original OpenCL kernel.

Checking the speed of the first asm version

Total workitems: 256*10*512   (256=threadgroupsize, and 10 because my card has 10 CUes)
Running the test 4x is enough at this point:

elapsed: 564.523 ms 4.644 MH/s gain: 1.16x
elapsed: 561.300 ms 4.670 MH/s gain: 1.17x
elapsed: 561.266 ms 4.671 MH/s gain: 1.17x
elapsed: 561.303 ms 4.670 MH/s gain: 1.17x

First speedup is 1.17x.

Without the first and last round optimizations but with the simplified LDS addressing it is a bit faster than the OpenCL version. Gain is compared to a baseline OCL version of 4MH/s on my system: HD7770 1000MHz cat14.9 win7/64. With cat14.6 it would be only around 1.00x, as the 14.9 somehow produces a less optimal output.

For this first version 2 settings are close to the OpenCL version:

  • Kernel size: much greater than the 32KB I-cache, it’s 340K (ocl:110K, so there must be a small loop in it compared to my version which is 100% unrolled at the moment)
  • VReg usage: is artificially greater than 128, so one CU can have only minimal amount of 4 wavefronts at any time. No latency hiding at all just like in the OCL ver.

Below is the ‘main loop’ extracted from the first asm version. It does the same amount of work as the extracted disasm from the ocl ver. 2*8 lookups and the surrounding XORs.

enter v_temp addr[8] v_temp data[16] align:2
v_and_b32 addr[0], $FF, g[ ( 0+0)%16*2+0]
v_lshlrev_b32 addr[0], 3, addr[0]
v_bfe_u32 addr[1], g[ ( 1+0)%16*2+0], 8-3, 8+3
v_add_i32 addr[1], vcc, $800*1, addr[1]
v_bfe_u32 addr[2], g[ ( 2+0)%16*2+0], 16-3, 8+3
v_add_i32 addr[2], vcc, $800*2, addr[2]
v_lshrrev_b32 addr[3], 24-3, g[ ( 3+0)%16*2+0]
v_add_i32 addr[3], vcc, $800*3, addr[3]
v_and_b32 addr[4], $FF, g[ ( 4+0)%16*2+1]
v_lshlrev_b32 addr[4], 3, addr[4]
v_add_i32 addr[4], vcc, $800*4, addr[4]
v_bfe_u32 addr[5], g[ ( 5+0)%16*2+1], 8-3, 8+3
v_add_i32 addr[5], vcc, $800*5, addr[5]
v_bfe_u32 addr[6], g[ ( 6+0)%16*2+1], 16-3, 8+3
v_add_i32 addr[6], vcc, $800*6, addr[6]
v_lshrrev_b32 addr[7], 24-3, g[ ( 11+0)%16*2+1]
v_add_i32 addr[7], vcc, $800*7, addr[7]
ds_read_b64 a[ 0*2], addr[0]
ds_read_b64 data[1*2], addr[1]
ds_read_b64 data[2*2], addr[2]
ds_read_b64 data[3*2], addr[3]
ds_read_b64 data[4*2], addr[4]
ds_read_b64 data[5*2], addr[5]
ds_read_b64 data[6*2], addr[6]
ds_read_b64 data[7*2], addr[7]
s_waitcnt lgkmcnt(0)
v_xor_b32 a[ 0*2 ], a[ 0*2 ], data[1*2 ]
v_xor_b32 a[ 0*2+1], a[ 0*2+1], data[1*2+1]
v_xor_b32 a[ 0*2 ], a[ 0*2 ], data[2*2 ]
v_xor_b32 a[ 0*2+1], a[ 0*2+1], data[2*2+1]
v_xor_b32 a[ 0*2 ], a[ 0*2 ], data[3*2 ]
v_xor_b32 a[ 0*2+1], a[ 0*2+1], data[3*2+1]
v_xor_b32 a[ 0*2 ], a[ 0*2 ], data[4*2 ]
v_xor_b32 a[ 0*2+1], a[ 0*2+1], data[4*2+1]
v_xor_b32 a[ 0*2 ], a[ 0*2 ], data[5*2 ]
v_xor_b32 a[ 0*2+1], a[ 0*2+1], data[5*2+1]
v_xor_b32 a[ 0*2 ], a[ 0*2 ], data[6*2 ]
v_xor_b32 a[ 0*2+1], a[ 0*2+1], data[6*2+1]
v_xor_b32 a[ 0*2 ], a[ 0*2 ], data[7*2 ]
v_xor_b32 a[ 0*2+1], a[ 0*2+1], data[7*2+1]
leave
enter v_temp addr[8] v_temp data[16] align:2
v_and_b32 addr[0], $FF, g[ ( 0+1)%16*2+0]
v_lshlrev_b32 addr[0], 3, addr[0]
v_bfe_u32 addr[1], g[ ( 1+1)%16*2+0], 8-3, 8+3
v_add_i32 addr[1], vcc, $800*1, addr[1]
v_bfe_u32 addr[2], g[ ( 2+1)%16*2+0], 16-3, 8+3
v_add_i32 addr[2], vcc, $800*2, addr[2]
v_lshrrev_b32 addr[3], 24-3, g[ ( 3+1)%16*2+0]
v_add_i32 addr[3], vcc, $800*3, addr[3]
v_and_b32 addr[4], $FF, g[ ( 4+1)%16*2+1]
v_lshlrev_b32 addr[4], 3, addr[4]
v_add_i32 addr[4], vcc, $800*4, addr[4]
v_bfe_u32 addr[5], g[ ( 5+1)%16*2+1], 8-3, 8+3
v_add_i32 addr[5], vcc, $800*5, addr[5]
v_bfe_u32 addr[6], g[ ( 6+1)%16*2+1], 16-3, 8+3
v_add_i32 addr[6], vcc, $800*6, addr[6]
v_lshrrev_b32 addr[7], 24-3, g[ ( 11+1)%16*2+1]
v_add_i32 addr[7], vcc, $800*7, addr[7]
ds_read_b64 a[ 1*2], addr[0]
ds_read_b64 data[1*2], addr[1]
ds_read_b64 data[2*2], addr[2]
ds_read_b64 data[3*2], addr[3]
ds_read_b64 data[4*2], addr[4]
ds_read_b64 data[5*2], addr[5]
ds_read_b64 data[6*2], addr[6]
ds_read_b64 data[7*2], addr[7]
s_waitcnt lgkmcnt(0)
v_xor_b32 a[ 1*2 ], a[ 1*2 ], data[1*2 ]
v_xor_b32 a[ 1*2+1], a[ 1*2+1], data[1*2+1]
v_xor_b32 a[ 1*2 ], a[ 1*2 ], data[2*2 ]
v_xor_b32 a[ 1*2+1], a[ 1*2+1], data[2*2+1]
v_xor_b32 a[ 1*2 ], a[ 1*2 ], data[3*2 ]
v_xor_b32 a[ 1*2+1], a[ 1*2+1], data[3*2+1]
v_xor_b32 a[ 1*2 ], a[ 1*2 ], data[4*2 ]
v_xor_b32 a[ 1*2+1], a[ 1*2+1], data[4*2+1]
v_xor_b32 a[ 1*2 ], a[ 1*2 ], data[5*2 ]
v_xor_b32 a[ 1*2+1], a[ 1*2+1], data[5*2+1]
v_xor_b32 a[ 1*2 ], a[ 1*2 ], data[6*2 ]
v_xor_b32 a[ 1*2+1], a[ 1*2+1], data[6*2+1]
v_xor_b32 a[ 1*2 ], a[ 1*2 ], data[7*2 ]
v_xor_b32 a[ 1*2+1], a[ 1*2+1], data[7*2+1]
leave

With a few more registers this could be pipelined (2 stages: next address calc and previous xors, lds reads) but it will be easier to go below 128 VRegs and allow the GPU share the resources (alu and ldsreads) across 2 conturrent wavefronts.
Just a random information is that the source file after macro processing is more than 2.3MB, the assembler is not a bad compressor as it compresses it down to 340KB.

Check it yourself!

For those who want to see this working on their system, I uploaded a special Groestl version of HetPas. Click on the [Download Link] on the menu bar at the top of this page!

Instructions to make it run:

  • Minimum of Win7 32 or 64 bit is required. It’s the requirement as well as for the GCN cards.
  • Use the 14.9 Catalyst! Older versions are guaranteed to crash as they pass the kernel parameters differently. Newer Cat versions untested.
  • You have to disable the Data Execution Prevention (DEP) for the exe because it will patch some Delphi runtime library functionality at the startup.
  • Either [Run As Administrator] or put it in a folder where it can write files. It will be needed to export the temporary files of the OpenCL compiler. It also writes an .ini file in the exe’s path.
  • Open the “groestl\groestl_ocl.hpas” file to test the original OpenCL kernel (by Pallas). Press F9 to Run.
  • Open the “groestl\groestl_isa.hpas” file to test the GCN assembly version. If it works ok, then it should display “RESULT IS OK”. It’s ok if it says “TEMPDATA IS WRONG”.
  • In the examples folder only OpenCL_OpenCL_HelloWorld.hpas is compatible with cat14.9. Others are crashing because of the changed method of passing kernel parameters in registers. To try the examples use cat13.4 or cat12.10. Cat12.10 has a working disassembler that disassembles binary-only ELF images, but that vesrion is so old that doesn’t handle new cards. To ‘decrease’ Catalyst version you may have to use the Catalyst Clean Uninstall Utility.

To be continued…

Now that the first asm version just works correctly, in the next post I’ll examine different methods optimizations to make it worth descending from the OpenCL language down to GCN assembly.

Posted in Uncategorized | Tagged , , , , , | Leave a comment

GCN specific macros

( [Download Link] section updated. Produced ELF is NOT compatible with cat14.6b. Works well with cat13.4, and if you want a working disassembler, you should use cat12.10)

With the #include directive it is possible to inline headers. So I made an stdgcn.inc to help doing ‘everyday’ programming tasks.

A kernel code may start like this:

var code:=asm_isa(
  #include stdgcn.inc
  KernelInitUC(64,64,8192)
  ...

KernelUnitUC(WorkGroupSize, VRegCount, LDSBytes) is defined in stdgcn.inc and does the following:

  • sets important kernel parameters: WorkGroupSize, VRegCount and LDSBytes
  • specifies buffers and reads pointers to them. In this case4 UC means 1 uav and 1 constant buffers.
  • prepares kernel indexes, and stores them in grpId=s0, lid=v0, gid=v1 aliases respectively.
  • allocates vector and scalar registers for temp variables (more info later) making sure that it not include resource constants and other important registers.
  • measures the start time of the kernel. (well, maybe this should be optional)

 Temp registers

This is a new feature which helps using variables is a structured form.
Before using this, a register pool must be allocated with s_temp_range and v_temp_range instructions. For example:

s_temp_range 1..7, 27..103
v_temp_range 2..64

From now on there will be a scope for variables allocated with the v_temp and s_temp instructions:

v_temp X, Y, Z  //note: the data type is always 32bit, for 64bit types you can use arrays
s_temp i,j,k
s_temp data[16] align:16  //allocates a 16 dword array of sregs aligned to 16 dword boundary

 Managing temp register scope

There are two special instructions for this: enter and leave. In a block between enter and leave; a new scope is created. One can allocate registers with s_temp and v_temp inside a block and the leave instruction will release all those variables that are allocated inside the block. It is very useful inside macros.

Program structure macros

_if(), _else, _end: Lets you create if/else statements without using jumps and labels. The _if statement has to know what register are you going to sheck with it so the proper form of _if instruction is this:

  • s_if(vccz) //scalar IF checking a scalar flag.
  • s_if_i32(s6>-32769) //scalar if checking 32bit signed integer relation
  • v_if_f64(v10<>s20) //vector if with 64bit float operand (and a 64bit float scalar)

Possible types for s_if are: i32, u32. And for v_if: i32, u32, i64, u64, f32, f64.

_while(), _endw: Makes a while block. You must use the same prefixes and suffixes for _while macro as you would use for the _if macro.

_repeat, _until(): Makes a repeat-until block. Prefix and suffix must be specified for _until().

_break, _continue: Can be used inside a _while-_endw or a _repeat-_until block.

Memory IO macros

dwAddr is a dword index. uavId is 0-based. AOption can one or more option of the tbuffer_ instruction, for example: glc.

uavWrite(uavId,dwaddr,value)
uavWrite(uavId,dwaddr,value,AOption)
uavRead(uavid, dwaddr,value)
uavRead(uavid, dwaddr,value,AOption)
cbRead(dwaddr,value)

note: They are so slow that should not be used in az inner loop. But they provide easy acces to memory.

 Accessing HW_INFO

They are easy access macros for the bitfields of the HW_INFO value. The result is placed in the provided scalar reg.

getWaveId(ghwRes) 
getSIMDId(ghwRes) 
getCUId(ghwRes) 
getSHId(ghwRes) 
getSEId(ghwRes) 
getThreadGroupId(ghwRes) 
getVirtualMemoryId(ghwRes)
getRingId(ghwRes) 
getStateId(ghwRes)

And a complicated one that calculates the Global SIMD Id. You can identify the SIMD on which your program is running.

getGlobalSIMDId(ggsRes)

 GDS macros

gwAddr: dword indeg in GDS memory

gdsWrite(gwAddr,gwData)
gdsRead(gwAddr,gwData)
gdsAdd(gwAddr,gwData)

 Global Wave Synch

Id is a unique id chosen by you. gwsThreads: the number of total workgroups (or wavefronts, I’m not sure… The wrong one will crash :D)

gwsInit(gwsId,gwsThreads)
gwsBarrier(gwsId)

 Measuring execution time

_getTickInit     //initializes T0 time. All other timing macros will work relative to this.
getTick(gtRes)   //returns current time elapsed from T0 //with lame 32bit calculations
breakOnTimeOut(botTimeoutMS)  //ensures that a loop cannot be infinite. Calls s_endpgm if timeOutMS is reached.

 Kernel initialization

Must be called right after including stdgcn.inc.

AGrpSize: no of workItems in a workGroup. ANumVGPRS: allocaten no of vector regs. ALdsSizeBytes: as its name.

KernelInitUUUC(AGrpSize,ANumVGPRS,ALdsSizeBytes)  //3 UAVs and 1 ConstBuffer

Other buffer variants implemented: UU, UC, U

 

Posted in Uncategorized | Tagged | Leave a comment

Macro Preprocessor Documentation

Latest version can be downloaded from here: http://x.pgy.hu/~worm/het/hp
Suggested Catalyst ver: 12.10, also works with 13.4 (new ELF format), but except the disassembler. Later Catalyst versions aren’t tested with.

Hi and Happy New Year!

As my script/asm compiler is getting more complicated, it is time to summarize everything that helps to code in GCN ASM easier. The macro preprocessor is working right before the pascal interpreter and it has a syntax that is close to the C Macro Preprocessor.

Defining a macro: #define, #assign, #undef

#define something               //just define it, equals ""

#define add(a,b) ((a)+(b))      //define a simple macro
#define add(a,b,c) ((a)+(b)+(c))//overload it with 3 parameters

#define multiline 1+  \
                  2             //multiline macro

{#define multiline2 1+
2 }
         //another form of multiline macro

#macro ensureRange(x, min, max) //multiline macro, best syntax
  if(x<min)then
    x:=min
  else if(x>max)then
    x:=max;
#endm

#assign two 1+1                 //'assign' evaluates the expression

#undef something                //undefine an existing macro

var x:=-0;                      //test some of the macroes
ensureRange(x,1,5);
writeln(x, ',', add(1,2,3), ',', two);    //the output should be 1,6,2

The # operator

# converts the contents of tho following identifier to string.
## concatenates its two operands.
Using whitespace is not allowed near the #, ## operators.

Predefined macros:

There are some predefined macros: __file__, __line__, __date__, __time__

Conditional code generating can be done with __ifdef__(), __ifndef__(), __if__(). First parameter is the condition, and the rest of the parameters are the code which will be compiled if the condition is met.  Must ensure proper use of brackets() otherwise use a macro.

#define a
__ifdef__(a, writeln('a is defined'));
__ifndef__(b, writeln('b is undefined'));
#define a 15
__if__(a>=10, writeln('a>=10'));

Iterations can be written with the __for__() macro:

__for__(i in [1, 2, 5..8], write(i); ) //125678
__for__(i:=1 to 8 where i not in [3, 4], write(i); ) //125678
__for__(i:=1 to 8 step 2, write(i); ) //1357

Special macros:

#include pastes a file into the source code. No “” or <> needed, just the filename. First it will search for the file in the directory of the current .hpas file. If it can’t find there, it will try in the include\ directory (which is resides in the hetpas.exe’s directory).

It is possible to generate program code with actual program code. In the #script directive you can place a pascal script. Everything passed to write() and writeln() will be pasted into the source code:

{#script writeln('writeln("this is scripted");');}

Because this type of macro is originated back when I did a preprocessor for Delphi, this macro must enclosed in {} comment brackets.

ASM macros:

There are two macros that help nesting assembly code into the pascal script: asm_il(), asm_isa(). They work exactly like -marks when defining string literals. They are to isolate pascal script code from assembly code, so the IDE can provide proper syntax highlighting and code-insight functionality.
When including a file these macros are stripped down automatically so they can be included inside an asm() block while the included file has proper syntax highlighting.
There is a way to insert evaluated scripts into the asm block: If the preprocessor find something like this ![expression] in the asm block, then it evaluates the expression and inserts its result into the code.

writeln(asm_isa(This code is displayed in !["GCN"+" ISA"] syntax s_endpgm));
//output is: This code is displayed in GCN ISA syntax s_endpgm

Posted in Uncategorized | Tagged , | Leave a comment