Yesterday I achieved 2.65x boost over the ocl version with 128 vregs and <32KB code size. Today it is time for optimize the inner loop.
Reordering the instructions in the inner loop – not needed
It turned out that it is not necessary, the first version was ok. GCN works well when LDS and ALU instructions are grouped together they does not needed to be interleaved as the OCL compiler does it so desperately.
So the final order of instructions is the following:
- extract the 8 bytes (and multiply them with 8 for LDS byte offset)
- add Tn offsets
- initiate the load of 8×8 bytes from LDS
- wait for all 8 LDS operations
- xor the data into result
These well separated groups of instructions run fast, and look simple with macros.
ver03 Hybrid lookups (LDS and L1 cooperation) 2.65x->3.48x=131%
In the original OCL version I noticed that from every 8 lookups only 6 are realized from LDS and the rest is done using MEM. Pallas told me that he wanted T0 and T1 to be in ‘constant memory’. But on GCN architecture there is no such type of memory (as on Evergreen does) so it used the Gpu ram instead. So from there I got the idea to reduce the LDS usage with the help of the L1 cache. The small lookup thable can easily fit in the 8K(or 16K?) L1 cache and by the OpenCL programming manual it said that L1 peak is 4byte/cycle and LDS peak is 8byte/cycle.
The base situation is 8 LDS lookups that is 2.65x
- First I tried to replace all 8 LDS lookups with 8 memory reads: -> 0.95x Ok that’s pretty slow, like it wasn’t intended for random acces.
- 1 Mem + 7 LDS: -> 3.01x It’s getting better.’ +
- 2 Mem + 6 LDS: -> 3.48x That is the best. Also Pallas did this in the OpenCL kernel.
- 3 Mem + 5 LDS: -> 2.59x And finally the L1 cache became the bottleneck.
So the code looks like this now:
#macro RBTT(dst, didx, src, i0, i1, i2, i3, i4, i5, i6, i7) //2MEM 6LDS 3.48 enter v_temp addr v_temp data align:2 v_and_b32 addr, $FF, src[i0*2+0] v_bfe_u32 addr, src[i1*2+0], 8, 8 __for__(i in[0..1], v_add_i32 addr[i], vcc, $100*i+$20, addr[i] ) tbuffer_load_format_xy dst[didx*2], addr, UAV, 0 idxen format:[BUF_DATA_FORMAT_32_32, BUF_NUM_FORMAT_FLOAT] __for__(i in[1..1], tbuffer_load_format_xy data[i*2], addr[i], UAV, 0 idxen format:[BUF_DATA_FORMAT_32_32, BUF_NUM_FORMAT_FLOAT]) v_bfe_u32 addr, src[i2*2+0], 16-3, 8+3 v_lshrrev_b32 addr, 24-3, src[i3*2+0] v_and_b32 addr, $FF, src[i4*2+1] v_lshlrev_b32 addr, 3, addr v_bfe_u32 addr, src[i5*2+1], 8-3, 8+3 v_bfe_u32 addr, src[i6*2+1], 16-3, 8+3 v_lshrrev_b32 addr, 24-3, src[i7*2+1] __for__(i in[2..7], v_add_i32 addr[i], vcc, $800*i, addr[i] ) __for__(i in[2..7], ds_read_b64 data[i*2], addr[i]) s_waitcnt vmcnt(0) __for__(i in[1..1], v_xor_b32 dst[didx*2 ], dst[didx*2 ], data[i*2 ] v_xor_b32 dst[didx*2+1], dst[didx*2+1], data[i*2+1]) s_waitcnt lgkmcnt(0) __for__(i in[2..7], v_xor_b32 dst[didx*2 ], dst[didx*2 ], data[i*2 ] v_xor_b32 dst[didx*2+1], dst[didx*2+1], data[i*2+1]) leave #endm
Some more failed attempts
I tried what if I use only one T0 table and do the 64bit byte-rotates manually with the v_bytealign instruction -> nothing: No benefit from reading the same region of LDS because I forgot that broadcasting only occurs for workitems, not consecutive lds reads. Also no slowdown caused by the additional math because not the V-ALU is the bottleneck.
Tried to replace $800*i constants with pre-initialized S-regs to make the instruction stream smaller. Less than 1% speedup so it was not necessary as the average instruction bytes/cycle doesn’t exceed 12 bytes. Just had to try it to make sure. I had some experiments long ago regarding instruction stream density and number of WFs/V-ALU. It turned out that low number of WFs (we have only 2 now) and 64bit vector instructions interleaved with 64bit scalar instructions are a bad combination.
Making long kernels doing multiple GroestlCoin hashes in a loop. Pallas was right, it’s not necessary (below 1%). This algorithm take so much time by default and the initialization of the LDS is taking no time compared to it, so it not worth to make it more complicated with batch processing.
Maybe there’s 9% if I can make it run on 84VGRPS. (I broke the algorithm by using only 3 ulong arrays and that way it needed only 90 regs. At that point it was at 3.42x, then I switched the kernel to allocate 84 regs end the speed became 3.81x. But it’s tricky because regs84..89 are became zeroes and maybe it produced some easier memory patterns and that caused the speedup. If I split the 8 qword reads into 2*4 qword reads that way I need 12 less VRegs but also the speed drops by 5%. And I also have to swap ulong arrays so this opt not worth it anyways.
What else is left
Now that I stuck at 3.48x, finally I’m kinda out of ideas. One last thing would be the first and last round optimization: As the whole message (except 1 dword) is know on the CPU side, it would be wise to calculate those things on the CPU and let the thousands of GPU threads calculate only the data it really have to. And for the last round there are lots of unused values. These are the kinds of optimizations that I rather solve automated by a machine. In the past I did these expression graph optimizations that way, but for this particular problem there is so few benefits to do so. Lets say I’m very optimistic and can eliminate 2 whole rounds. Then the speedup yould be only 5%. In reality I think maybe 75% of the first round can be done on CPU and similar amount of calculations are unused on the last round. So maybe it’s a 2.5%. Just because there are so many calculations inside in contrast to the first/last round.
…will be to make it work inside a miner program an ensure that it works correctly. Are there common test vectors for this, or something? I don’t know if miner programs are flexible with kernel parameters or I have to follow the original parameters. I don’t know how the driver passes single-value parameters for example.
(Check the Download Area for latest version)