Making the GCN ASM Groestl kernel to run inside sgminer 5.1

There are basically two ways to do this:

  • Make the kernel binary 100% compatible with the original kernel
  • Apply modifications in sgminer to handle a my custom binary.

For the first option I have to mimic the kernel parameters of the original kernel. The following is a small dummy kernel that uses the same parameters:

__kernel void search(__global unsigned char* block, volatile __global uint* output, const ulong target) {
    output[get_local_id(0)] = block[get_global_id(0)];

The simple program in it ensures that none of paramerers are optimized out byte the OCL compiler. Let’s see how it looks like in asm (cat14.9):

  s_buffer_load_dword s0, s[4:7], 0x18        //domain gid base
  s_lshl_b32 s1, s12, 8                       //s1:groupid*256
  s_buffer_load_dword s4, s[8:11], 0x00       //s4: block ofs
  s_buffer_load_dword s5, s[8:11], 0x04       //s5: output ofs
  s_buffer_load_dwordx2 s[6:7], s[8:11], 0x08 //s[6:7] target
  s_waitcnt lgkmcnt(0)
  s_add_u32 s0, s1, s0                        //s0: domain base id
  v_lshl_b64 v[1:2], 0, 0                     //fill with zero
  v_cmp_ne_i64 vcc, s[6:7], v[1:2]            //target>0?
  s_cbranch_vccz label_0018
  s_add_u32 s0, s0, s4                        //s0: threadgroup base id + block ofs
  s_load_dwordx4 s[8:11], s[2:3], 0x60        //s[8..11]: block res
  v_add_i32 v1, vcc, s0, v0                   //v1:gid+block ofs
  s_waitcnt lgkmcnt(0)
  buffer_load_ubyte v1, v1, s[8:11], 0 offen  //block[gid]
  s_load_dwordx4 s[0:3], s[2:3], 0x68         //s[0..3]: output res
  v_lshlrev_b32 v0, 2, v0                     //v0:lid*4
  v_add_i32 v0, vcc, s5, v0                   //v0:lid*4+output base
  s_waitcnt vmcnt(0) & lgkmcnt(0)
  tbuffer_store_format_x v1, v0, s[0:3], 0 offen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT] //output[lid]
  s_waitcnt vmcnt(0) & expcnt(0)

Well this is pretty much code for doing a little more than nothing.
We can access the target int64 parameter indirectly form buffer s[8:11], it is loaded into s[6:7].
get_local_id(0) is in v0.
threadgroupId is in s1.
There are 3 resources provided to the kernel by the driver:
s[4:7] contains kernel domain parameters. The only thing used in this is the lowest index value of the 1D kernel domain rangle.
s[8:11] is a table full of base pointers (and sizes?) for the parameter buffers. It also contains value parameters.
s[2:3] this is actually a 64bit pointer that points to the table the of resources of out 2 buffer parameters.
So these things are provided at start. Following is the list of things that must be calculated:
get_global_id(0) is calculated as threadGroupId*256+get_local_id(0)+domain_base_id
BlockResource is loaded from s[2:3] along with BlockBaseOffset from s[8:11]
The same from OutputResource and OutputBase.

And now it is possible to access the parameters ‘block’ and ‘output’. IMO it is a bit messy, and the real problem is not even the mess: It’s the possibility to change all these input registers and functionality in future versions of the GPU drivers.

One additional buffer is not considered yet: In the kernel there are some big constant arrays. The OpenCL environment pass these constant in a buffer silently. I just don’t know how is this works in the binary .ELF format. I can only put the data into the code stream and with the s_getpc_b64 instruction I can calculate the address of it and read it from there. But here comes an additional problem with the alignment of that data.

And finally as if it is not complicated enough: I’d have to modify my binary generator to be able to work with not only buffer parameters but from now with ordinal/float parameters as well.

Altogether I think this is the better way and not to make modifications in sgminer.

ver04: Packing the T tables into the code

As the orriginal kernel passes T0 and T1 in a silent buffer to the kernel, I have to do it in asm with a workaround.
First I generated a file with 16KB of constant declarations in asm style:

//contents of file
dd $A5F432C6
dd $C6A597F4
dd $84976FF8
...           //T0..T7: 16KB of data
dd $6D6DD6DA
dd $4E3A4E62
dd $2C2C3A58

Next step is to include this into the code:

  s_getpc_b64 s[40:41]  
  s_branch @tableEnd
  aligncode 16         //aligns the next code to the next 16 byte boundary 
  alias constBuf = s[40:43]  //make constbuf resource
  s_sub_u32 s40, s40, 4           \ s_subb_u32 s41, s41, 0 //beginning of code
  s_add_u32 s40, s40, @tableBegin \ s_addc_u32 s41, s41, 0 //physical offset of the table 
  s_or_b32  s41, s41, $00080000   //8byte record size
  s_mov_b32 s42, $FFFFFF00
  s_mov_b32 s43, $002A7204 

The above code first saves the value of the program counter, then includes the table data, and finally builds a 128bit resource constant in s[40:43] which points to the beginning of the table. Later in the program, this resource constant (called ‘constBuf’) will used for the lookups. Just a note to self, that it would be useful to be able to use @address labels in arithmetic expressions too (eg. @tableBegin-4).
The total code size is now 42KB, but the actual code which must fit into the ICache is 16KB less, only 26KB. The execution time of the new code remained the same as before.

Emulating the original kernel parameters

New things to be implemented in the compiler and on the host side:

1. There must be a way to specify the kernel header manually. The current system only works on buffers now (constant and uav buffers, although there is no difference between the two on GCN). So there is a second optional parameter for the TClDevice.NewKernel() function: oclSkeleton.

var isacode:=asm_isa(
  numthreadpergroup 256
  numsgprs 48
  numvgprs 128

var skeleton:=
  "void main(__global unsigned char* block, volatile __global uint* output, const ulong target)
     if(target>0) output[get_local_id(0)] = block[get_global_id(0)]; //must use all parameters here

var var isakernel := dev.newKernel(isacode, skeleton); //<- optional skeleton opencl code

If the skeleton parameter is specified, then the base kernel will be generated by that code instead of an automatic skeleton code based on the oclBuffers instruction. In this the example the skeleton code specifies the same parameters as those used in the kernel.

2. Also I have to enable to pass values (not just buffers) to the kernel in the host program. From now this can be realized using TClKernel.SetArg() calls:

isakernel.SetArg(0, block);
isakernel.SetArg(1, output);
isakernel.SetArg(2, int64(1234)); //must cast to int64 because otherwise it would be a simple 32bit integer.

3. Letting the kernel domain’s low range to be nonzero (global_work_offset). As I have seen sgminer is actively using this. So from now there is a new TClKernel.RunRange(offset, size, buf0, buf1, …) function.

isakernel.runRange(39,85).waitfor(); //that means run 85 workitems starting from gid=39. Also wait for it.

Testing dummy code with original kernel parameters

Here’s the assembly dummy code that does the same operation at it’s skeleton code:


var i,j,k,dev:=findGCNDevice;   //or find the first GCN device.

var isacode:=asm_isa(
  numthreadpergroup 256
  numvgprs 128
  numsgprs 48
  ldssize 16384

  s_lshl_b32            s1, s12, 8                //groupid*256
  s_buffer_load_dword   s0      , s[4: 7], 0x18   //global_work_offset
  s_buffer_load_dword   s4      , s[8:11], 0x00   //block ofs
  s_buffer_load_dword   s5      , s[8:11], 0x04   //output ofs
  s_buffer_load_dwordx2 s[ 6: 7], s[8:11], 0x08   //target
  s_load_dwordx4        s[12:15], s[2: 3], 0x60   //block res
  s_load_dwordx4        s[16:19], s[2: 3], 0x68   //output res
  s_waitcnt     lgkmcnt(0)
  s_add_u32     s0, s1, s0                        //workgroup gid base
  s_add_u32     s12, s12, s4 \ s_addc_u32 s13 ,s13, 0  //adjust block res with offset
  s_add_u32     s16, s16, s5 \ s_addc_u32 s17 ,s17, 0  //adjust output res with offset
  v_add_i32     v1, vcc, s0, v0                   //gid
  alias lid = v0, gid = v1, resBlock = s[12:15], resOutput = s[16:19], target = s[6:7]

  v_temp_range 2..127
  s_temp_range 0..5, 8..11, 20..47

  v_temp zero[2], data, addr
  v_lshl_b64    zero[0], 0, 0                      //fill with zero
  v_cmp_gt_i64  vcc, target, zero[0]               //target>0?
    buffer_load_ubyte  data, gid, resBlock, 0 offen  //block[g]
    v_lshlrev_b32  addr, 2, lid                      //v0:lid*4
    s_waitcnt vmcnt(0)
    tbuffer_store_format_x  data, addr, resOutput, 0 offen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT]

var skeleton:="void main(__global unsigned char* block, volatile __global uint* output, const ulong target)
               { if(target>0) output[get_local_id(0)] = block[get_global_id(0)]; }";

var isakernel := dev.newKernel(isacode, skeleton);

var block:=dev.newBuffer('rw', 512);
for i:=0 to 127 do block.Ints[i]:=i;

var output:=dev.newBuffer('rw', 256*4);

with isakernel do begin
  SetArg(0, block);
  SetArg(1, output);
  SetArg(2, int64(1));

  for i:=0 to 255 do writeln(i, ' ', output.Ints[i]);

As I’m using variable names in asm it’s not that messy as the raw disassembled version. Next time I’ll have to combine this with the groestl miner asm code and build the new binary. After that I gonna modify the functional test to make 100% sure, that not only the groestl calculations but also the thread-indexing is working correctly. With the new testing method, I’ll start at a nonzero global_offset and will use a relatively high target value to get around 100 hits. If I put it inside an iteration and give it randomized blocks and work_offsets, I’ll prove that the new kernel works exactly like the original one.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s