Optimizing Groestl hash function in GCN ASM – Part 2.

Yesterday I achieved 2.65x boost over the ocl version with 128 vregs and <32KB code size. Today it is time for optimize the inner loop.

Reordering the instructions in the inner loop – not needed

It turned out that it is not necessary, the first version was ok. GCN works well when LDS and ALU instructions are grouped together they does not needed to be interleaved as the OCL compiler does it so desperately.

So the final order of instructions is the following:

  • extract the 8 bytes (and multiply them with 8 for LDS byte offset)
  • add Tn offsets
  • initiate the load of 8×8 bytes from LDS
  • wait for all 8 LDS operations
  • xor the data into result

These well separated groups of instructions run fast, and look simple with macros.

ver03 Hybrid lookups (LDS and L1 cooperation) 2.65x->3.48x=131%

In the original OCL version I noticed that from every 8 lookups only 6 are realized from LDS and the rest is done using MEM. Pallas told me that he wanted T0 and T1 to be in ‘constant memory’. But on GCN architecture there is no such type of memory (as on Evergreen does) so it used the Gpu ram instead. So from there I got the idea to reduce the LDS usage with the help of the L1 cache. The small lookup thable can easily fit in the 8K(or 16K?) L1 cache and by the OpenCL programming manual it said that L1 peak is 4byte/cycle and LDS peak is 8byte/cycle.

The base situation is 8 LDS lookups that is 2.65x

  • First I tried to replace all 8 LDS lookups with 8 memory reads: -> 0.95x  Ok that’s pretty slow, like it wasn’t intended for random acces.
  • 1 Mem + 7 LDS: -> 3.01x It’s getting better.’ +
  • 2 Mem + 6 LDS: -> 3.48x That is the best. Also Pallas did this in the OpenCL kernel.
  • 3 Mem + 5 LDS: -> 2.59x And finally the L1 cache became the bottleneck.

So the code looks like this now:

#macro RBTT(dst, didx, src, i0, i1, i2, i3, i4, i5, i6, i7) //2MEM 6LDS 3.48
enter
  v_temp addr[8]
  v_temp data[16] align:2

  v_and_b32 addr[0], $FF, src[i0*2+0]
  v_bfe_u32 addr[1], src[i1*2+0], 8, 8
  __for__(i in[0..1], v_add_i32 addr[i], vcc, $100*i+$20, addr[i] )
  tbuffer_load_format_xy dst[didx*2], addr[0], UAV, 0 idxen format:[BUF_DATA_FORMAT_32_32, BUF_NUM_FORMAT_FLOAT]
  __for__(i in[1..1], tbuffer_load_format_xy data[i*2],  addr[i], UAV, 0 idxen format:[BUF_DATA_FORMAT_32_32, BUF_NUM_FORMAT_FLOAT])

  v_bfe_u32 addr[2], src[i2*2+0], 16-3, 8+3
  v_lshrrev_b32 addr[3], 24-3, src[i3*2+0]
  v_and_b32 addr[4], $FF, src[i4*2+1] v_lshlrev_b32 addr[4], 3, addr[4]
  v_bfe_u32 addr[5], src[i5*2+1], 8-3, 8+3
  v_bfe_u32 addr[6], src[i6*2+1], 16-3, 8+3
  v_lshrrev_b32 addr[7], 24-3, src[i7*2+1]
  __for__(i in[2..7], v_add_i32 addr[i], vcc, $800*i, addr[i] ) 

 __for__(i in[2..7], ds_read_b64 data[i*2], addr[i])
 s_waitcnt vmcnt(0)
  __for__(i in[1..1], v_xor_b32 dst[didx*2 ], dst[didx*2 ], data[i*2 ]
                      v_xor_b32 dst[didx*2+1], dst[didx*2+1], data[i*2+1])
  s_waitcnt lgkmcnt(0)
  __for__(i in[2..7], v_xor_b32 dst[didx*2 ], dst[didx*2 ], data[i*2 ]
                      v_xor_b32 dst[didx*2+1], dst[didx*2+1], data[i*2+1])
leave
#endm

Some more failed attempts

I tried what if I use only one T0 table and do the 64bit byte-rotates manually with the v_bytealign instruction -> nothing: No benefit from reading the same region of LDS because I forgot that broadcasting only occurs for workitems, not consecutive lds reads. Also no slowdown caused by the additional math because not the V-ALU is the bottleneck.

Tried to replace $800*i constants with pre-initialized S-regs to make the instruction stream smaller. Less than 1% speedup so it was not necessary as the average instruction bytes/cycle doesn’t exceed 12 bytes. Just had to try it to make sure. I had some experiments long ago regarding instruction stream density and number of WFs/V-ALU. It turned out that low number of WFs (we have only 2 now) and 64bit vector instructions interleaved with 64bit scalar instructions are a bad combination.

Making long kernels doing multiple GroestlCoin hashes in a loop. Pallas was right, it’s not necessary (below 1%). This algorithm take so much time by default and the initialization of the LDS is taking no time compared to it, so it not worth to make it more complicated with batch processing.

Maybe there’s 9% if I can make it run on 84VGRPS. (I broke the algorithm by using only 3 ulong[16] arrays and that way it needed only 90 regs. At that point it was at 3.42x, then I switched the kernel to allocate 84 regs end the speed became 3.81x. But it’s tricky because regs84..89 are became zeroes and maybe it produced some easier memory patterns and that caused the speedup. If I split the 8 qword reads into 2*4 qword reads that way I need 12 less VRegs but also the speed drops by 5%. And I also have to swap ulong[16] arrays so this opt not worth it anyways.

What else is left

Now that I stuck at 3.48x, finally I’m kinda out of ideas. One last thing would be the first and last round optimization: As the whole message (except 1 dword) is know on the CPU side, it would be wise to calculate those things on the CPU and let the thousands of GPU threads calculate only the data it really have to. And for the last round there are lots of unused values. These are the kinds of optimizations that I rather solve automated by a machine. In the past I did these expression graph optimizations that way, but for this particular problem there is so few benefits to do so. Lets say I’m very optimistic and can eliminate 2 whole rounds. Then the speedup yould be only 5%. In reality I think maybe 75% of the first round can be done on CPU and similar amount of calculations are unused on the last round. So maybe it’s a 2.5%. Just because there are so many calculations inside in contrast to the first/last round.

Final step…

…will be to make it work inside a miner program an ensure that it works correctly. Are there common test vectors for this, or something? I don’t know if miner programs are flexible with kernel parameters or I have to follow the original parameters. I don’t know how the driver passes single-value parameters for example.

(Check the Download Area for latest version)

Advertisements
This entry was posted in Uncategorized and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s