Optimizing Groestl hash function in GCN ASM – Part 2.

Yesterday I achieved 2.65x boost over the ocl version with 128 vregs and <32KB code size. Today it is time for optimize the inner loop.

Reordering the instructions in the inner loop – not needed

It turned out that it is not necessary, the first version was ok. GCN works well when LDS and ALU instructions are grouped together they does not needed to be interleaved as the OCL compiler does it so desperately.

So the final order of instructions is the following:

  • extract the 8 bytes (and multiply them with 8 for LDS byte offset)
  • add Tn offsets
  • initiate the load of 8×8 bytes from LDS
  • wait for all 8 LDS operations
  • xor the data into result

These well separated groups of instructions run fast, and look simple with macros.

ver03 Hybrid lookups (LDS and L1 cooperation) 2.65x->3.48x=131%

In the original OCL version I noticed that from every 8 lookups only 6 are realized from LDS and the rest is done using MEM. Pallas told me that he wanted T0 and T1 to be in ‘constant memory’. But on GCN architecture there is no such type of memory (as on Evergreen does) so it used the Gpu ram instead. So from there I got the idea to reduce the LDS usage with the help of the L1 cache. The small lookup thable can easily fit in the 8K(or 16K?) L1 cache and by the OpenCL programming manual it said that L1 peak is 4byte/cycle and LDS peak is 8byte/cycle.

The base situation is 8 LDS lookups that is 2.65x

  • First I tried to replace all 8 LDS lookups with 8 memory reads: -> 0.95x  Ok that’s pretty slow, like it wasn’t intended for random acces.
  • 1 Mem + 7 LDS: -> 3.01x It’s getting better.’ +
  • 2 Mem + 6 LDS: -> 3.48x That is the best. Also Pallas did this in the OpenCL kernel.
  • 3 Mem + 5 LDS: -> 2.59x And finally the L1 cache became the bottleneck.

So the code looks like this now:

#macro RBTT(dst, didx, src, i0, i1, i2, i3, i4, i5, i6, i7) //2MEM 6LDS 3.48
enter
  v_temp addr[8]
  v_temp data[16] align:2

  v_and_b32 addr[0], $FF, src[i0*2+0]
  v_bfe_u32 addr[1], src[i1*2+0], 8, 8
  __for__(i in[0..1], v_add_i32 addr[i], vcc, $100*i+$20, addr[i] )
  tbuffer_load_format_xy dst[didx*2], addr[0], UAV, 0 idxen format:[BUF_DATA_FORMAT_32_32, BUF_NUM_FORMAT_FLOAT]
  __for__(i in[1..1], tbuffer_load_format_xy data[i*2],  addr[i], UAV, 0 idxen format:[BUF_DATA_FORMAT_32_32, BUF_NUM_FORMAT_FLOAT])

  v_bfe_u32 addr[2], src[i2*2+0], 16-3, 8+3
  v_lshrrev_b32 addr[3], 24-3, src[i3*2+0]
  v_and_b32 addr[4], $FF, src[i4*2+1] v_lshlrev_b32 addr[4], 3, addr[4]
  v_bfe_u32 addr[5], src[i5*2+1], 8-3, 8+3
  v_bfe_u32 addr[6], src[i6*2+1], 16-3, 8+3
  v_lshrrev_b32 addr[7], 24-3, src[i7*2+1]
  __for__(i in[2..7], v_add_i32 addr[i], vcc, $800*i, addr[i] ) 

 __for__(i in[2..7], ds_read_b64 data[i*2], addr[i])
 s_waitcnt vmcnt(0)
  __for__(i in[1..1], v_xor_b32 dst[didx*2 ], dst[didx*2 ], data[i*2 ]
                      v_xor_b32 dst[didx*2+1], dst[didx*2+1], data[i*2+1])
  s_waitcnt lgkmcnt(0)
  __for__(i in[2..7], v_xor_b32 dst[didx*2 ], dst[didx*2 ], data[i*2 ]
                      v_xor_b32 dst[didx*2+1], dst[didx*2+1], data[i*2+1])
leave
#endm

Some more failed attempts

I tried what if I use only one T0 table and do the 64bit byte-rotates manually with the v_bytealign instruction -> nothing: No benefit from reading the same region of LDS because I forgot that broadcasting only occurs for workitems, not consecutive lds reads. Also no slowdown caused by the additional math because not the V-ALU is the bottleneck.

Tried to replace $800*i constants with pre-initialized S-regs to make the instruction stream smaller. Less than 1% speedup so it was not necessary as the average instruction bytes/cycle doesn’t exceed 12 bytes. Just had to try it to make sure. I had some experiments long ago regarding instruction stream density and number of WFs/V-ALU. It turned out that low number of WFs (we have only 2 now) and 64bit vector instructions interleaved with 64bit scalar instructions are a bad combination.

Making long kernels doing multiple GroestlCoin hashes in a loop. Pallas was right, it’s not necessary (below 1%). This algorithm take so much time by default and the initialization of the LDS is taking no time compared to it, so it not worth to make it more complicated with batch processing.

Maybe there’s 9% if I can make it run on 84VGRPS. (I broke the algorithm by using only 3 ulong[16] arrays and that way it needed only 90 regs. At that point it was at 3.42x, then I switched the kernel to allocate 84 regs end the speed became 3.81x. But it’s tricky because regs84..89 are became zeroes and maybe it produced some easier memory patterns and that caused the speedup. If I split the 8 qword reads into 2*4 qword reads that way I need 12 less VRegs but also the speed drops by 5%. And I also have to swap ulong[16] arrays so this opt not worth it anyways.

What else is left

Now that I stuck at 3.48x, finally I’m kinda out of ideas. One last thing would be the first and last round optimization: As the whole message (except 1 dword) is know on the CPU side, it would be wise to calculate those things on the CPU and let the thousands of GPU threads calculate only the data it really have to. And for the last round there are lots of unused values. These are the kinds of optimizations that I rather solve automated by a machine. In the past I did these expression graph optimizations that way, but for this particular problem there is so few benefits to do so. Lets say I’m very optimistic and can eliminate 2 whole rounds. Then the speedup yould be only 5%. In reality I think maybe 75% of the first round can be done on CPU and similar amount of calculations are unused on the last round. So maybe it’s a 2.5%. Just because there are so many calculations inside in contrast to the first/last round.

Final step…

…will be to make it work inside a miner program an ensure that it works correctly. Are there common test vectors for this, or something? I don’t know if miner programs are flexible with kernel parameters or I have to follow the original parameters. I don’t know how the driver passes single-value parameters for example.

(Check the Download Area for latest version)

Advertisements
Posted in Uncategorized | Tagged , , , , , | Leave a comment

Optimizing Groestl hash function in GCN ASM – Part 1.

Last post was about making the first assembly version that just runs correctly. Luckily it became 1.17x faster than the original high level version. Now I will apply a series of optimizations and let’s see how effective are they. I’ll start with the most obvious ones.

ver01: Reduce VGPRS count down to 128 -> 196%

I had the original VGPRS count unchanged in ver00, just to see how important is that let each of the GCN Vector V-ALUs deal not only one but two WaveFronts. When the V-ALU can choose from more than one WFs, then if one WF begins to wait for LDS or RAM, then it can immediately switch to another WF. It effectively hides LDS and MEM latency without the need of changing anything in the program code. (Latency hiding can be realized inside the code as well, but it needs more registers and more clever code. Allowing the GCN chip to parallellize the code in every cycle is much more elegant.)

So reducing VGPRS count from 164 down to 128 produces the speedup from 1.17x->2.30x. It’s 196% gain.

It is can be realised in asm like this:

numvgprs 128     v_temp_range 2..127

The “numvgprs” instruction tells the compiler that how many regs it should allocate on the hardware. “v_temp_range” specifies the compiler a range of vregs that make up a heap of regs it can temporarily give to code regions marked with enter/leave regions.

Just a theoretical nonsense at the moment: I have tried other vgprs counts:
85: It’s just a bit bigger than the 3 WafeFronts/V-ALU setup so it’s the same 2WFs/V_ALU. But somehow it’s 2.70x.
84: Every V-ALU can choose from not 2 but 3 WFs at any time. -> 4.20x
But it of course calculates wrong results. To make it work I would have to make it with 38 fewer VRegs. Maybe in the future it worth trying: Long time regions in the algorithm only needs 2x ulong[16] arrays and some temps. That can made under 85 and LDS can be used for swap locally-unused data. But at this early stage it not worth to bother with and later find out that maybe this was an opt unnecessary.

ver02: Reduce code size to fit into 32KB ICache -> 115%

This is the second most obvious optimization in my opinion.
In order to do this I have to find repetitive tasks in the code and either put them into a loop or call them as a subroutine. For simplicity I’ll try to do it with loops.
Below is the previous code:

#macro PASS
 v_mov_b32 x[11*2+0], 0 v_mov_b32 x[11*2+1], 0
 v_mov_b32 x[12*2+0], 0 v_mov_b32 x[12*2+1], 0
 v_mov_b32 x[13*2+0], 0 v_mov_b32 x[13*2+1], 0
 v_mov_b32 x[14*2+0], 0 v_mov_b32 x[14*2+1], 0
 v_mov_b32 x[15*2+0], 0 v_mov_b32 x[15*2+1], M15hi

 v_xor_b32 x[15*2+1], H15hi, x[15*2+1]
 CNST_P(g, x,  0) ROUND_P(a, g) CNST_P(a, a,  1) ROUND_P(g, a)
 CNST_P(g, g,  2) ROUND_P(a, g) CNST_P(a, a,  3) ROUND_P(g, a)
 CNST_P(g, g,  4) ROUND_P(a, g) CNST_P(a, a,  5) ROUND_P(g, a)
 CNST_P(g, g,  6) ROUND_P(a, g) CNST_P(a, a,  7) ROUND_P(g, a)
 CNST_P(g, g,  8) ROUND_P(a, g) CNST_P(a, a,  9) ROUND_P(g, a)
 CNST_P(g, g, 10) ROUND_P(a, g) CNST_P(a, a, 11) ROUND_P(g, a)
 CNST_P(g, g, 12) ROUND_P(a, g) CNST_P(a, a, 13) ROUND_P(g, a)
 v_xor_b32 x[15*2+1], H15hi, x[15*2+1]

 CNST_Q(x, x,  0) ROUND_Q(a, x) CNST_Q(a, a,  1) ROUND_Q(x, a)
 CNST_Q(x, x,  2) ROUND_Q(a, x) CNST_Q(a, a,  3) ROUND_Q(x, a)
 CNST_Q(x, x,  4) ROUND_Q(a, x) CNST_Q(a, a,  5) ROUND_Q(x, a)
 CNST_Q(x, x,  6) ROUND_Q(a, x) CNST_Q(a, a,  7) ROUND_Q(x, a)
 CNST_Q(x, x,  8) ROUND_Q(a, x) CNST_Q(a, a,  9) ROUND_Q(x, a)
 CNST_Q(x, x, 10) ROUND_Q(a, x) CNST_Q(a, a, 11) ROUND_Q(x, a)
 CNST_Q(x, x, 12) ROUND_Q(a, x) CNST_Q(a, a, 13) ROUND_Q(x, a)

 __for__(i in[0..$1F], v_xor_b32 g[i], g[i], x[i] ) //combine P and Q
 __for__(i in[0.. $F], v_mov_b32 x[i], g[i+$10] )
 v_xor_b32 g[15*2+1], H15hi, g[15*2+1]

 CNST_P(g, g,  0) ROUND_P(a, g) CNST_P(a, a,  1) ROUND_P(g, a)
 CNST_P(g, g,  2) ROUND_P(a, g) CNST_P(a, a,  3) ROUND_P(g, a)
 CNST_P(g, g,  4) ROUND_P(a, g) CNST_P(a, a,  5) ROUND_P(g, a)
 CNST_P(g, g,  6) ROUND_P(a, g) CNST_P(a, a,  7) ROUND_P(g, a)
 CNST_P(g, g,  8) ROUND_P(a, g) CNST_P(a, a,  9) ROUND_P(g, a)
 CNST_P(g, g, 10) ROUND_P(a, g) CNST_P(a, a, 11) ROUND_P(g, a)
 CNST_P(g, g, 12) ROUND_P(a, g) CNST_P(a, a, 13) ROUND_P(g, a)
#endm

#macro PASS_Transition
 __for__(i in[0.. $F], v_xor_b32 x[i], x[i], g[i+$10] ) //combine

                          v_xor_b32 x[ 7*2+1], H15hi, x[7*2+1]
 v_mov_b32 x[ 8*2+0], $80 v_mov_b32 x[ 8*2+1], 0
 v_mov_b32 x[ 9*2+0],   0 v_mov_b32 x[ 9*2+1], 0
 v_mov_b32 x[10*2+0],   0 v_mov_b32 x[10*2+1], 0
#endm

//issuing macros
PASS
PASS_Transition
PASS

Processing these macroes resulting in 340KB of code. First I tried to make a loop like this:

s_temp passIdx
s_movk_i32 passIdx, 0
s_while_i32(passIdx<2)
  PASS
  s_if_i32(passIdx=0)
    PASS_Transition
  _endif
  s_addk_i32 passIdx, 1
_endw

But it was a crashed.  Later I’ve found that the s_cbranch instruction can only jump +-128KB away, and the jump needed was 170KB. From now the assembler will raise an exception when it happens again.
So first I have to make the loops inside the PASS macro. Round Constant addition will change as from now SRegs will hold the current roundIndex. There will be two types of round additions: The first one that will initialize the SRegs holding round indexes, and the second will use and increment those SRegs. This way S instructions will be interleaved with V instructions, thus not requiring additional clock cycles. Below are the new RoundConst adder macroes (they can also copy from one array to another):

s_temp RoundCnst[16] //these will be incremented before each round

#macro CNST_P_first(dst, src)
  __for__(i in[0..$F], s_movk_i32 RoundCnst[i], i*$10+0
                       v_mov_b32 dst[i*2+1], src[i*2+1]
                       v_xor_b32 dst[i*2+0], RoundCnst[i], src[i*2] )
#endm

#macro CNST_P_next(dst, src)
  __for__(i in[0..$F], s_addk_i32 RoundCnst[i], 1
                       v_mov_b32 dst[i*2+1], src[i*2+1]
                       v_xor_b32 dst[i*2+0], RoundCnst[i], src[i*2] )
#endm

#macro CNST_Q_first(dst, src)
  __for__(i in[0..$F], s_mov_b32 RoundCnst[i], ![not((i*$10+0)<<24)]
                       v_not_b32 dst[i*2+0], src[i*2]
                       v_xor_b32 dst[i*2+1], RoundCnst[i], src[i*2+1] )
#endm

#macro CNST_Q_next(dst, src)
  __for__(i in[0..$F], s_sub_u32 RoundCnst[i], RoundCnst[i], 1<<24
                       v_not_b32 dst[i*2+0], src[i*2]
                       v_xor_b32 dst[i*2+1], RoundCnst[i], src[i*2+1] )
#endm

Also I made a simple iterator macro:

#macro mRepeat(repetionCount, what) //simple repeat macro.
enter s_temp i
  s_movk_i32 i, 0
  s_while_i32(i<repetionCount)
    what
    s_addk_i32 i, 1 //should rather check SCC right after this
  _endw
leave
#endm

Using these the new Groestl PASS can be squeezed:

#macro PASS
  v_mov_b32 x[11*2+0], 0 v_mov_b32 x[11*2+1], 0
  v_mov_b32 x[12*2+0], 0 v_mov_b32 x[12*2+1], 0
  v_mov_b32 x[13*2+0], 0 v_mov_b32 x[13*2+1], 0
  v_mov_b32 x[14*2+0], 0 v_mov_b32 x[14*2+1], 0
  v_mov_b32 x[15*2+0], 0 v_mov_b32 x[15*2+1], M15hi

  v_xor_b32 x[15*2+1], H15hi, x[15*2+1]

              CNST_P_first(a, x) ROUND_P(g, a)
  mRepeat(13, CNST_P_next( a, g) ROUND_P(g, a))

  v_xor_b32 x[15*2+1], H15hi, x[15*2+1]

              CNST_Q_first(a, x) ROUND_Q(x, a)
  mRepeat(13, CNST_Q_next( a, x) ROUND_Q(x, a) ) 

  __for__(i in[0..$1F], v_xor_b32 g[i], g[i], x[i] ) //combine P and Q
  __for__(i in[0.. $F], v_mov_b32 x[i], g[i+$10] )
  v_xor_b32 g[15*2+1], H15hi, g[15*2+1]

              CNST_P_first(a, g) ROUND_P(g, a)
  mRepeat(13, CNST_P_next( a, g) ROUND_P(g, a) )
#endm

This reduced code size to 50K the instruction cache is now utilized well. But not forget one more loop for the 2 passes:

mRepeat(2,
  PASS
  s_if_i32(i=0)
    PASS_Transition //this must be called between the 2 PASSes
  _endif
)

Final code size is 25KB now which is completely fits into the 32KB instruction cache. Because of this optimization the speed raised from 2.30x to 2.65x, that means a 115% improvement. I expected this to be higher, but in the current state of the algo the biggest bottleneck is the LDS. So next time I’ll try to make LDS access faster somehow.
Theoretically when I remove the LDS reads then the speed goes up to 7.04x, just by doing only the V-ALU calculations. With this high V-ALU utilization the Instr Cache becomes a bigger issue: When I revert from 25KB to the old 340KB code size, the speed drops from 7.04x down to 4.46x. So if LDS wouldn’t be a bottleneck, the ICache optimization would give 158% benefit instead of 115%.

Posted in Uncategorized | Tagged , , , , , | 4 Comments

Implementing Groestl hash function in GCN ASM

 There is a new version of HetPas in the Download Area . It contains the Groestl asm project for Cat14.9. See details at the bottom of this post.

Some months I was looking for an interesting GPU project to do for my free time, and ran into the Groestl hash algorithm. Found it a good algo to play with, so here’s a case-study of what benefits we have if we can go deeper than OpenCL.

The original code

I’ll start from this OpenCL source code: Pallas optimized groestlcoin / diamond etc. opencl kernel


Originally I downloaded it from here: http://devgurus.amd.com/message/1306845#1306845

Groestl documentation is here: http://www.groestl.info/


The Catalyst version I use is: 14.9
The speed of the algo on different hardware and/or Catalyst are shown below:

groestl_speeds

The baseline for any further performance comparison will be HD7770, 1000MHz, Cat 14.9  Pallas OpenCL version which is 4MH/s.

Examining the OpenCL-compiled binary

Below is the repeating ‘main loop’ from then  OpenCL code. It does 2*8 T0 table lookups and does the XORing. 2*2 values are looked up from the RAM, and the remaining 6*2 is accessed from LDS. T0[], T1[] is const array which is effectively in memory, T2[]..T7[] is copyed to in local memory at styarup.


a[0x0] ^= QC64(0x00, r); \
a[0x1] ^= QC64(0x10, r); \
...
a[0xE] ^= QC64(0xE0, r); \
a[0xF] ^= QC64(0xF0, r); \
t0[0x0] = B64_0(a[0x0]); \
t1[0x0] = B64_1(a[0x0]); \
...
t6[0xF] = B64_6(a[0xF]); \
t7[0xF] = B64_7(a[0xF]); \
RBTT(a[0x0], 0x1, 0x3, 0x5, 0xB, 0x0, 0x2, 0x4, 0x6); \
RBTT(a[0x1], 0x2, 0x4, 0x6, 0xC, 0x1, 0x3, 0x5, 0x7); \

Refer Pallas’s source code for details!
Here’s what the Catalyst 14.9 driver compiled from it:

 s_waitcnt vmcnt(2)
 v_xor_b32 v9, v84, v86
 v_xor_b32 v16, v85, v87
 s_waitcnt lgkmcnt(8)
 v_xor_b32 v9, v9, v66
 s_waitcnt lgkmcnt(7)
 v_xor_b32 v16, v16, v47
 s_waitcnt lgkmcnt(6)
 v_xor_b32 v9, v9, v69
 v_xor_b32 v16, v16, v70
 s_waitcnt lgkmcnt(5)
 v_xor_b32 v9, v9, v74
 s_waitcnt lgkmcnt(4)
 v_xor_b32 v16, v16, v73
 s_waitcnt lgkmcnt(3)
 v_xor_b32 v9, v9, v76
 s_waitcnt lgkmcnt(2)
 v_xor_b32 v16, v16, v75
 s_waitcnt lgkmcnt(1)
 v_xor_b32 v9, v9, v78
 s_waitcnt lgkmcnt(0)
 v_xor_b32 v16, v16, v77
 s_waitcnt vmcnt(0)
 v_xor_b32 v43, v91, v93
 v_xor_b32 v47, v92, v94
 v_xor_b32 v59, 0x07000000, v59
 v_not_b32 v133, v40
 v_xor_b32 v134, 0xbfffffff, v59
 v_lshr_b64 v[73:74], v[133:134], 16
 v_bfe_u32 v66, v73, 0, 8
 v_lshlrev_b32 v66, 3, v66
 v_add_i32 v73, vcc, 0x00001800, v66
 v_add_i32 v66, vcc, 0x00001804, v66
 ds_read_b32 v74, v81
 ds_read_b32 v75, v80
 ds_read_b32 v73, v73
 ds_read_b32 v66, v66
 s_waitcnt lgkmcnt(3)
 v_xor_b32 v9, v9, v74
 s_waitcnt lgkmcnt(2)
 v_xor_b32 v16, v16, v75
 s_waitcnt lgkmcnt(1)
 v_xor_b32 v43, v43, v73
 s_waitcnt lgkmcnt(0)
 v_xor_b32 v47, v47, v66
 v_xor_b32 v66, 0x07000000, v90
 v_not_b32 v67, v89
 v_xor_b32 v68, 0x5fffffff, v66
 v_lshr_b64 v[76:77], v[67:68], 24
 v_bfe_u32 v76, v76, 0, 8
 v_lshlrev_b32 v76, 3, v76
 v_bfe_u32 v77, v100, 0, 8
 v_lshlrev_b32 v77, 3, v77
 v_add_i32 v78, vcc, 0x00000800, v77
 v_add_i32 v77, vcc, 0x00000804, v77
 v_bfe_u32 v80, v6, 8, 8
 v_lshlrev_b32 v80, 3, v80
 v_add_i32 v81, vcc, 0x00001000, v80
 ds_read2_b32 v[83:84], v76 offset1:1
 ds_read_b32 v76, v78
 ds_read_b32 v77, v77
 ds_read_b32 v78, v81
 s_waitcnt lgkmcnt(3)
 v_xor_b32 v43, v43, v83
 v_xor_b32 v47, v47, v84
 s_waitcnt lgkmcnt(2)
 v_xor_b32 v43, v43, v76
 s_waitcnt lgkmcnt(1)
 v_xor_b32 v47, v47, v77
 v_add_i32 v76, vcc, 0x00001004, v80
 s_waitcnt lgkmcnt(0)
 v_xor_b32 v43, v43, v78
 v_bfe_u32 v77, v14, 16, 8
 v_lshlrev_b32 v77, 3, v77
 v_add_i32 v78, vcc, 0x00002000, v77
 v_add_i32 v77, vcc, 0x00002004, v77
 v_xor_b32 v15, 0x07000000, v15
 v_not_b32 v4, v4
 v_xor_b32 v15, 0xafffffff, v15
 v_lshrrev_b32 v80, 24, v15
 v_lshlrev_b32 v80, 3, v80
 v_add_i32 v81, vcc, 0x00002800, v80
 ds_read_b32 v76, v76
 ds_read_b32 v78, v78
 ds_read_b32 v77, v77
 ds_read_b32 v81, v81
 s_waitcnt lgkmcnt(3)
 v_xor_b32 v47, v47, v76
 s_waitcnt lgkmcnt(2)
 v_xor_b32 v43, v43, v78
 s_waitcnt lgkmcnt(1)
 v_xor_b32 v47, v47, v77
 v_add_i32 v76, vcc, 0x00002804, v80
 s_waitcnt lgkmcnt(0)
 v_xor_b32 v43, v43, v81
 v_bfe_u32 v77, v99, 0, 8
 v_lshlrev_b32 v77, 3, v77
 v_add_i32 v77, vcc, s0, v77
 v_lshr_b64 v[83:84], v[5:6], 8
 v_bfe_u32 v78, v83, 0, 8
 v_lshlrev_b32 v78, 3, v78
 v_add_i32 v78, vcc, s9, v78
 v_mov_b32 v83, v8
 v_mov_b32 v84, v14
 v_lshr_b64 v[85:86], v[83:84], 16
 v_bfe_u32 v85, v85, 0, 8
 v_lshlrev_b32 v85, 3, v85
 v_add_i32 v86, vcc, 0x00001800, v85
 v_add_i32 v85, vcc, 0x00001804, v85
 v_xor_b32 v11, 0x07000000, v11
 v_not_b32 v87, v7
 v_xor_b32 v88, 0x6fffffff, v11
 v_lshr_b64 v[89:90], v[87:88], 24
 v_bfe_u32 v89, v89, 0, 8
 v_lshlrev_b32 v89, 3, v89
 ds_read_b32 v76, v76
 ds_read_b32 v86, v86
 ds_read_b32 v85, v85
 ds_read2_b32 v[89:90], v89 offset1:1
 s_waitcnt lgkmcnt(3)
 v_xor_b32 v47, v47, v76
 v_bfe_u32 v76, v36, 0, 8
 v_lshlrev_b32 v76, 3, v76
 v_add_i32 v91, vcc, 0x00000800, v76
 v_add_i32 v76, vcc, 0x00000804, v76
 v_bfe_u32 v92, v33, 8, 8
 v_lshlrev_b32 v92, 3, v92
 v_add_i32 v93, vcc, 0x00001000, v92
 v_add_i32 v92, vcc, 0x00001004, v92
 ds_read_b32 v91, v91
 ds_read_b32 v76, v76
 ds_read_b32 v93, v93
 ds_read_b32 v92, v92
 v_bfe_u32 v94, v26, 16, 8
 v_lshlrev_b32 v94, 3, v94
 v_add_i32 v95, vcc, 0x00002000, v94
 v_add_i32 v94, vcc, 0x00002004, v94
 v_lshrrev_b32 v96, 24, v134
 v_lshlrev_b32 v96, 3, v96
 v_add_i32 v97, vcc, 0x00002800, v96
 v_add_i32 v96, vcc, 0x00002804, v96
 ds_read_b32 v95, v95
 ds_read_b32 v94, v94
 v_bfe_u32 v98, v109, 0, 8
 v_lshlrev_b32 v98, 3, v98
 v_add_i32 v98, vcc, s0, v98
 v_lshr_b64 v[101:102], v[99:100], 8
 v_bfe_u32 v17, v101, 0, 8
 v_lshlrev_b32 v17, 3, v17
 v_add_i32 v17, vcc, s9, v17
 tbuffer_load_format_xy v[101:102], v77, s[16:19], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT]
 tbuffer_load_format_xy v[77:78], v78, s[16:19], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT]
 tbuffer_load_format_xy v[103:104], v98, s[16:19], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT]
 tbuffer_load_format_xy v[105:106], v17, s[16:19], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT]

Some observations:

  • Maximum number of s_waits used. Sure the compiler finds all but I think it’s not that necessary if it reads and then processes in larger batches. The S alu could be used for better thing such as helping in address calculations.
  • Memory address calculations can be done with less instructions: For RAM the UAVbase and the Table offset can supplied in the tbuffer instruction’s scalar address parameter. No need to v_add, let the address calculator hardware do it. The int64 byte offset can be multiplied using the memory resource’s stride field and idxen flag. So all togethet this calculation can be hardware accelerated: base_addr = uavbase+tableoffset+tableindex*8
  • The compiler is clever because it uses v_bfe (Bit Field Extract) when it founds x>>16&0xFF type of C code.
  • But if we consider that 64bit LDS reads are always 64bit aligned regardless of the lowest 3 bits of the byte address, then we can spare an extra <<3 instruction: (x>>16&0xFF)<<3 becomes x>>13&0x7FF

Observing the OpenCL produced binary:

  • VGPRS=164. Bad because it allows only the minimum number of wavefronts/CU. No latency hiding, the ALU will sleep while waiting for RAM/LDS.
  • Binary size=110KB. Bad because it doesn’t fit into the 32KB instruction cache at all.

Making the first assembly version

First I tried to rewrite the OpenCL code to make a very simple asm impementation that just works. I haven’t implemented the first- and last-round optimizations, those can add an be an extra 8% speedup in the future, though.

Regarding kernel parameters: Keep the input kernel parameters as simple as it can be: void main(__global ulong* a1) That’s all. It’s necessary at the moment as the current version of HetPas doesn’t support non-buffer parameters, and with the less number of parameters the less buffer-resource calculations and S register usage comes. In the kernel header I manually adjust the offset in the buffer resource so it doesn’t neded to be added to the byte offset in every tbuffer instructions. Also I set stride=8 in the buffer resource, so tbuffer instructions will scale my qword indices to byte indices whenever I use the ‘idxen’ option. In the final version it is possible to make the kernel parameters compatible with the original OpenCL kernel but the way it is done could be changed with every upcoming Catalyst versions, so it would be better to modify host code a bit rather than hack it into the asm code. Also the kernel is so simple that it doesn’t request for the kernel domain ranges, thus global index is only 1D and zero based.
This is how the kernel header looks like (sorry, no syntax highlight this time):

//////////////////////////////// kernel header
isa79xx //this kernel must be called on a 0-based 1D kernel domain!
numthreadpergroup 256
ldssize 16384
oclbuffers 1, 0
numvgprs 256 v_temp_range 2..255
numsgprs 48 s_temp_range 1..3, 8..47

//////////////////////////////// Init ids, params
alias lid = v0, tid = s0, gid = v1, UAV = s[4:7]
s_buffer_load_dword s1, s[12:15], 0x00 //load uav base offset
s_mov_b32 tid, s16 //acquire tid
s_lshl_b32 s2, tid, 8 //calculate gid
v_add_i32 gid, vcc, s2, lid
s_waitcnt lgkmcnt(0)
s_add_u32 s4, s4, s1 s_addc_u32 s5, s5, 0 //adjust UAV res with uav base offset
s_andn2_b32 s5, s5, $3FFF0000 s_or_b32 s5, s5, $80000 //set 8byte record size for UAV
s_movk_i32 m0, -1 //disable LDS range checking

Next we initialize the LDS from RAM. It’s a good example on local variable allocation (enter/leave/v_temp/s_temp) and on macro __for__() iteration. Also note the tbuffer is used with hardware indexing (8byte recordsize is specified in the header).

  //initialize lds with Groestl T table ////////////////////////////////////////////////
 enter
 v_temp data[16] align:2
 v_temp vaddr
 s_temp saddr

 v_mov_b32 vaddr, lid

 __for__(i in [0..7],
   s_movk_i32 saddr, $100+i*$800 //ram table select
   tbuffer_load_format_xy data[i*2], vaddr, UAV, saddr idxen format:[BUF_DATA_FORMAT_32_32, BUF_NUM_FORMAT_FLOAT]
 )
 s_waitcnt vmcnt(0)

 __for__(i in [0..7],
   s_movk_i32 saddr, i*$800 //lds table select
   v_mad_u32_u24 vaddr, lid, 8, saddr
   ds_write_b64 vaddr, data[i*2]
 )
 s_waitcnt lgkmcnt(0)

 s_barrier
 leave

Next I defined some macroes for copying and adding roundconstants to ulong[16] arrays:

#define CNST_P(dst, src, r) __for__(i in[0..$F], v_xor_b32 dst[i*2], i*$10+r, src[i*2] v_mov_b32 dst[i*2+1], src[i*2+1])
#define CNST_Q(dst, src, r) __for__(i in[0..$F], v_xor_b32 dst[i*2+1], ![not((i*$10+r)<<24)], src[i*2+1] v_not_b32 dst[i*2], src[i*2])

Simple moves and xors with constants. Note that arrays in the assembler are always considered array of 32bit values. That’s why there are so many *2.
Now the most important part follows, which does 8 lookups and XORs the result into an element of an array:

#macro RBTT(dst, didx, src, i0, i1, i2, i3, i4, i5, i6, i7)
enter
  v_temp addr[8]
  v_temp data[16] align:2
  v_and_b32 addr[0], $FF, src[i0*2+0]
  v_lshlrev_b32 addr[0], 3, addr[0]
  v_bfe_u32 addr[1], src[i1*2+0], 8-3, 8+3   v_add_i32 addr[1], vcc, $800*1, addr[1]
  v_bfe_u32 addr[2], src[i2*2+0], 16-3, 8+3  v_add_i32 addr[2], vcc, $800*2, addr[2]
  v_lshrrev_b32 addr[3], 24-3, src[i3*2+0]   v_add_i32 addr[3], vcc, $800*3, addr[3]
  v_and_b32 addr[4], $FF, src[i4*2+1]
  v_lshlrev_b32 addr[4], 3, addr[4]          v_add_i32 addr[4], vcc, $800*4, addr[4]
  v_bfe_u32 addr[5], src[i5*2+1], 8-3, 8+3   v_add_i32 addr[5], vcc, $800*5, addr[5]
  v_bfe_u32 addr[6], src[i6*2+1], 16-3, 8+3  v_add_i32 addr[6], vcc, $800*6, addr[6]
  v_lshrrev_b32 addr[7], 24-3, src[i7*2+1]   v_add_i32 addr[7], vcc, $800*7, addr[7]
  ds_read_b64 dst[didx*2], addr[0]
  __for__(i in[1..7], ds_read_b64 data[i*2], addr[i])
  s_waitcnt lgkmcnt(0)
  __for__(i in[1..7], v_xor_b32 dst[didx*2 ], dst[didx*2 ], data[i*2 ]
                      v_xor_b32 dst[didx*2+1], dst[didx*2+1], data[i*2+1])
leave
#endm

“dst” is the name of the destination array
“didx” is the index in the “dst”
“src” is the source array
“i0”..”i7″ are the column indexes for the byte-scramble operation.

Issuing RBTT 8x make up a round:

#macro ROUND(dst, src, i0, i1, i2, i3, i4, i5, i6, i7)
__for__(i in[0..$F], RBTT(dst, i, src, (i0+i)%16, (i1+i)%16, (i2+i)%16, (i3+i)%16, (i4+i)%16, (i5+i)%16, (i6+i)%16, (i7+i)%16) )
#endm

Then 2 work arrays are defined, and the message is loaded onto one of them:


//load message block /////////////////////////////////
v_temp x[32],g[32], a[32] align:2 //x:message

enter
  v_temp vaddr
  __for__(i in [0..9],
    v_mov_b32 vaddr, $10+i
    tbuffer_load_format_xy x[i*2], vaddr, UAV, 0 idxen format:[BUF_DATA_FORMAT_32_32, BUF_NUM_FORMAT_FLOAT]
  )
  s_waitcnt vmcnt(0)
leave
                         v_mov_b32 x[ 9*2+1], gid
v_mov_b32 x[10*2+0], $80 v_mov_b32 x[10*2+1], 0

The 1D 0based GlobalID goes to the last DWORD of the message.

There are 2 Groestl passes in GroestlCoin, so let’s make a PASS() macro:

#macro PASS0Final
  __for__(i in[0.. $F], v_xor_b32 x[i], x[i], g[i+$10] ) //combine

  v_xor_b32 x[ 7*2+1], H15hi, x[7*2+1]
  v_mov_b32 x[ 8*2+0], $80 v_mov_b32 x[ 8*2+1], 0
  v_mov_b32 x[ 9*2+0], 0 v_mov_b32 x[ 9*2+1], 0
  v_mov_b32 x[10*2+0], 0 v_mov_b32 x[10*2+1], 0
#endm

#macro PASS(passIdx) //passIdx: 0..1
  v_mov_b32 x[11*2+0], 0 v_mov_b32 x[11*2+1], 0
  v_mov_b32 x[12*2+0], 0 v_mov_b32 x[12*2+1], 0
  v_mov_b32 x[13*2+0], 0 v_mov_b32 x[13*2+1], 0
  v_mov_b32 x[14*2+0], 0 v_mov_b32 x[14*2+1], 0
  v_mov_b32 x[15*2+0], 0 v_mov_b32 x[15*2+1], M15hi

  v_xor_b32 x[15*2+1], H15hi, x[15*2+1]
  CNST_P(g, x, 0) ROUND_P(a, g) CNST_P(a, a, 1) ROUND_P(g, a)
  CNST_P(g, g, 2) ROUND_P(a, g) CNST_P(a, a, 3) ROUND_P(g, a)
  CNST_P(g, g, 4) ROUND_P(a, g) CNST_P(a, a, 5) ROUND_P(g, a)
  CNST_P(g, g, 6) ROUND_P(a, g) CNST_P(a, a, 7) ROUND_P(g, a)
  CNST_P(g, g, 8) ROUND_P(a, g) CNST_P(a, a, 9) ROUND_P(g, a)
  CNST_P(g, g, 10) ROUND_P(a, g) CNST_P(a, a, 11) ROUND_P(g, a)
  CNST_P(g, g, 12) ROUND_P(a, g) CNST_P(a, a, 13) ROUND_P(g, a)
  v_xor_b32 x[15*2+1], H15hi, x[15*2+1]

  CNST_Q(x, x, 0) ROUND_Q(a, x) CNST_Q(a, a, 1) ROUND_Q(x, a)
  CNST_Q(x, x, 2) ROUND_Q(a, x) CNST_Q(a, a, 3) ROUND_Q(x, a)
  CNST_Q(x, x, 4) ROUND_Q(a, x) CNST_Q(a, a, 5) ROUND_Q(x, a)
  CNST_Q(x, x, 6) ROUND_Q(a, x) CNST_Q(a, a, 7) ROUND_Q(x, a)
  CNST_Q(x, x, 8) ROUND_Q(a, x) CNST_Q(a, a, 9) ROUND_Q(x, a)
  CNST_Q(x, x, 10) ROUND_Q(a, x) CNST_Q(a, a, 11) ROUND_Q(x, a)
  CNST_Q(x, x, 12) ROUND_Q(a, x) CNST_Q(a, a, 13) ROUND_Q(x, a)

  __for__(i in[0..$1F], v_xor_b32 g[i], g[i], x[i] ) //combine P and Q
  __for__(i in[0.. $F], v_mov_b32 x[i], g[i+$10] )
  v_xor_b32 g[15*2+1], H15hi, g[15*2+1]

  CNST_P(g, g, 0) ROUND_P(a, g) CNST_P(a, a, 1) ROUND_P(g, a)
  CNST_P(g, g, 2) ROUND_P(a, g) CNST_P(a, a, 3) ROUND_P(g, a)
  CNST_P(g, g, 4) ROUND_P(a, g) CNST_P(a, a, 5) ROUND_P(g, a)
  CNST_P(g, g, 6) ROUND_P(a, g) CNST_P(a, a, 7) ROUND_P(g, a)
  CNST_P(g, g, 8) ROUND_P(a, g) CNST_P(a, a, 9) ROUND_P(g, a)
  CNST_P(g, g, 10) ROUND_P(a, g) CNST_P(a, a, 11) ROUND_P(g, a)
  CNST_P(g, g, 12) ROUND_P(a, g) CNST_P(a, a, 13) ROUND_P(g, a)

  __IF__(passIdx=0, PASS0Final)
#endm

And finally use those macroes and extract egy ulong value which will be compared against “target”:

  PASS(0)
  PASS(1)

  v_xor_b32 g[0], g[11*2], x[3*2] v_xor_b32 g[1], g[11*2+1], x[3*2+1] //result value
  dump64($1F, g[0]) //dump64 writes an ulong into the buffer if the globalID is 1234

  s_endpgm

And that’s the whole kernel. It does not compares with “target”, it only returns the ulong value only if the current thread’s GID is 1234.

Testing the first assembly version

These parameters are used for the functional tests:

const testBlock := #$6f7037939d1aa4a9863574ddf41a0d371799dfea+
                   #$89b37ecb1ecded76426afa25108feec755347891+
                   #$b3fa9afd2a360cf64f56e4d20f0c8c03ca411b3a+
                   #$29dd28ea4fc0cddf9a1e8c707966b7a700000000; //80 bytes
const testResponse := $9FB391FF6984DFA9;           //^^^^^^^^ gid goes here
#define debugThread 1234

The above block will be completed with gid=1234 value at its end, and then the resulting ulong will be checked against the testResponse constant. TestBlock is random and the TestResponse constant was extracted from the original OpenCL kernel.

Checking the speed of the first asm version

Total workitems: 256*10*512   (256=threadgroupsize, and 10 because my card has 10 CUes)
Running the test 4x is enough at this point:

elapsed: 564.523 ms 4.644 MH/s gain: 1.16x
elapsed: 561.300 ms 4.670 MH/s gain: 1.17x
elapsed: 561.266 ms 4.671 MH/s gain: 1.17x
elapsed: 561.303 ms 4.670 MH/s gain: 1.17x

First speedup is 1.17x.

Without the first and last round optimizations but with the simplified LDS addressing it is a bit faster than the OpenCL version. Gain is compared to a baseline OCL version of 4MH/s on my system: HD7770 1000MHz cat14.9 win7/64. With cat14.6 it would be only around 1.00x, as the 14.9 somehow produces a less optimal output.

For this first version 2 settings are close to the OpenCL version:

  • Kernel size: much greater than the 32KB I-cache, it’s 340K (ocl:110K, so there must be a small loop in it compared to my version which is 100% unrolled at the moment)
  • VReg usage: is artificially greater than 128, so one CU can have only minimal amount of 4 wavefronts at any time. No latency hiding at all just like in the OCL ver.

Below is the ‘main loop’ extracted from the first asm version. It does the same amount of work as the extracted disasm from the ocl ver. 2*8 lookups and the surrounding XORs.

enter v_temp addr[8] v_temp data[16] align:2
v_and_b32 addr[0], $FF, g[ ( 0+0)%16*2+0]
v_lshlrev_b32 addr[0], 3, addr[0]
v_bfe_u32 addr[1], g[ ( 1+0)%16*2+0], 8-3, 8+3
v_add_i32 addr[1], vcc, $800*1, addr[1]
v_bfe_u32 addr[2], g[ ( 2+0)%16*2+0], 16-3, 8+3
v_add_i32 addr[2], vcc, $800*2, addr[2]
v_lshrrev_b32 addr[3], 24-3, g[ ( 3+0)%16*2+0]
v_add_i32 addr[3], vcc, $800*3, addr[3]
v_and_b32 addr[4], $FF, g[ ( 4+0)%16*2+1]
v_lshlrev_b32 addr[4], 3, addr[4]
v_add_i32 addr[4], vcc, $800*4, addr[4]
v_bfe_u32 addr[5], g[ ( 5+0)%16*2+1], 8-3, 8+3
v_add_i32 addr[5], vcc, $800*5, addr[5]
v_bfe_u32 addr[6], g[ ( 6+0)%16*2+1], 16-3, 8+3
v_add_i32 addr[6], vcc, $800*6, addr[6]
v_lshrrev_b32 addr[7], 24-3, g[ ( 11+0)%16*2+1]
v_add_i32 addr[7], vcc, $800*7, addr[7]
ds_read_b64 a[ 0*2], addr[0]
ds_read_b64 data[1*2], addr[1]
ds_read_b64 data[2*2], addr[2]
ds_read_b64 data[3*2], addr[3]
ds_read_b64 data[4*2], addr[4]
ds_read_b64 data[5*2], addr[5]
ds_read_b64 data[6*2], addr[6]
ds_read_b64 data[7*2], addr[7]
s_waitcnt lgkmcnt(0)
v_xor_b32 a[ 0*2 ], a[ 0*2 ], data[1*2 ]
v_xor_b32 a[ 0*2+1], a[ 0*2+1], data[1*2+1]
v_xor_b32 a[ 0*2 ], a[ 0*2 ], data[2*2 ]
v_xor_b32 a[ 0*2+1], a[ 0*2+1], data[2*2+1]
v_xor_b32 a[ 0*2 ], a[ 0*2 ], data[3*2 ]
v_xor_b32 a[ 0*2+1], a[ 0*2+1], data[3*2+1]
v_xor_b32 a[ 0*2 ], a[ 0*2 ], data[4*2 ]
v_xor_b32 a[ 0*2+1], a[ 0*2+1], data[4*2+1]
v_xor_b32 a[ 0*2 ], a[ 0*2 ], data[5*2 ]
v_xor_b32 a[ 0*2+1], a[ 0*2+1], data[5*2+1]
v_xor_b32 a[ 0*2 ], a[ 0*2 ], data[6*2 ]
v_xor_b32 a[ 0*2+1], a[ 0*2+1], data[6*2+1]
v_xor_b32 a[ 0*2 ], a[ 0*2 ], data[7*2 ]
v_xor_b32 a[ 0*2+1], a[ 0*2+1], data[7*2+1]
leave
enter v_temp addr[8] v_temp data[16] align:2
v_and_b32 addr[0], $FF, g[ ( 0+1)%16*2+0]
v_lshlrev_b32 addr[0], 3, addr[0]
v_bfe_u32 addr[1], g[ ( 1+1)%16*2+0], 8-3, 8+3
v_add_i32 addr[1], vcc, $800*1, addr[1]
v_bfe_u32 addr[2], g[ ( 2+1)%16*2+0], 16-3, 8+3
v_add_i32 addr[2], vcc, $800*2, addr[2]
v_lshrrev_b32 addr[3], 24-3, g[ ( 3+1)%16*2+0]
v_add_i32 addr[3], vcc, $800*3, addr[3]
v_and_b32 addr[4], $FF, g[ ( 4+1)%16*2+1]
v_lshlrev_b32 addr[4], 3, addr[4]
v_add_i32 addr[4], vcc, $800*4, addr[4]
v_bfe_u32 addr[5], g[ ( 5+1)%16*2+1], 8-3, 8+3
v_add_i32 addr[5], vcc, $800*5, addr[5]
v_bfe_u32 addr[6], g[ ( 6+1)%16*2+1], 16-3, 8+3
v_add_i32 addr[6], vcc, $800*6, addr[6]
v_lshrrev_b32 addr[7], 24-3, g[ ( 11+1)%16*2+1]
v_add_i32 addr[7], vcc, $800*7, addr[7]
ds_read_b64 a[ 1*2], addr[0]
ds_read_b64 data[1*2], addr[1]
ds_read_b64 data[2*2], addr[2]
ds_read_b64 data[3*2], addr[3]
ds_read_b64 data[4*2], addr[4]
ds_read_b64 data[5*2], addr[5]
ds_read_b64 data[6*2], addr[6]
ds_read_b64 data[7*2], addr[7]
s_waitcnt lgkmcnt(0)
v_xor_b32 a[ 1*2 ], a[ 1*2 ], data[1*2 ]
v_xor_b32 a[ 1*2+1], a[ 1*2+1], data[1*2+1]
v_xor_b32 a[ 1*2 ], a[ 1*2 ], data[2*2 ]
v_xor_b32 a[ 1*2+1], a[ 1*2+1], data[2*2+1]
v_xor_b32 a[ 1*2 ], a[ 1*2 ], data[3*2 ]
v_xor_b32 a[ 1*2+1], a[ 1*2+1], data[3*2+1]
v_xor_b32 a[ 1*2 ], a[ 1*2 ], data[4*2 ]
v_xor_b32 a[ 1*2+1], a[ 1*2+1], data[4*2+1]
v_xor_b32 a[ 1*2 ], a[ 1*2 ], data[5*2 ]
v_xor_b32 a[ 1*2+1], a[ 1*2+1], data[5*2+1]
v_xor_b32 a[ 1*2 ], a[ 1*2 ], data[6*2 ]
v_xor_b32 a[ 1*2+1], a[ 1*2+1], data[6*2+1]
v_xor_b32 a[ 1*2 ], a[ 1*2 ], data[7*2 ]
v_xor_b32 a[ 1*2+1], a[ 1*2+1], data[7*2+1]
leave

With a few more registers this could be pipelined (2 stages: next address calc and previous xors, lds reads) but it will be easier to go below 128 VRegs and allow the GPU share the resources (alu and ldsreads) across 2 conturrent wavefronts.
Just a random information is that the source file after macro processing is more than 2.3MB, the assembler is not a bad compressor as it compresses it down to 340KB.

Check it yourself!

For those who want to see this working on their system, I uploaded a special Groestl version of HetPas. Click on the [Download Link] on the menu bar at the top of this page!

Instructions to make it run:

  • Minimum of Win7 32 or 64 bit is required. It’s the requirement as well as for the GCN cards.
  • Use the 14.9 Catalyst! Older versions are guaranteed to crash as they pass the kernel parameters differently. Newer Cat versions untested.
  • You have to disable the Data Execution Prevention (DEP) for the exe because it will patch some Delphi runtime library functionality at the startup.
  • Either [Run As Administrator] or put it in a folder where it can write files. It will be needed to export the temporary files of the OpenCL compiler. It also writes an .ini file in the exe’s path.
  • Open the “groestl\groestl_ocl.hpas” file to test the original OpenCL kernel (by Pallas). Press F9 to Run.
  • Open the “groestl\groestl_isa.hpas” file to test the GCN assembly version. If it works ok, then it should display “RESULT IS OK”. It’s ok if it says “TEMPDATA IS WRONG”.
  • In the examples folder only OpenCL_OpenCL_HelloWorld.hpas is compatible with cat14.9. Others are crashing because of the changed method of passing kernel parameters in registers. To try the examples use cat13.4 or cat12.10. Cat12.10 has a working disassembler that disassembles binary-only ELF images, but that vesrion is so old that doesn’t handle new cards. To ‘decrease’ Catalyst version you may have to use the Catalyst Clean Uninstall Utility.

To be continued…

Now that the first asm version just works correctly, in the next post I’ll examine different methods optimizations to make it worth descending from the OpenCL language down to GCN assembly.

Posted in Uncategorized | Tagged , , , , , | Leave a comment

GCN specific macros

( [Download Link] section updated. Produced ELF is NOT compatible with cat14.6b. Works well with cat13.4, and if you want a working disassembler, you should use cat12.10)

With the #include directive it is possible to inline headers. So I made an stdgcn.inc to help doing ‘everyday’ programming tasks.

A kernel code may start like this:

var code:=asm_isa(
  #include stdgcn.inc
  KernelInitUC(64,64,8192)
  ...

KernelUnitUC(WorkGroupSize, VRegCount, LDSBytes) is defined in stdgcn.inc and does the following:

  • sets important kernel parameters: WorkGroupSize, VRegCount and LDSBytes
  • specifies buffers and reads pointers to them. In this case4 UC means 1 uav and 1 constant buffers.
  • prepares kernel indexes, and stores them in grpId=s0, lid=v0, gid=v1 aliases respectively.
  • allocates vector and scalar registers for temp variables (more info later) making sure that it not include resource constants and other important registers.
  • measures the start time of the kernel. (well, maybe this should be optional)

 Temp registers

This is a new feature which helps using variables is a structured form.
Before using this, a register pool must be allocated with s_temp_range and v_temp_range instructions. For example:

s_temp_range 1..7, 27..103
v_temp_range 2..64

From now on there will be a scope for variables allocated with the v_temp and s_temp instructions:

v_temp X, Y, Z  //note: the data type is always 32bit, for 64bit types you can use arrays
s_temp i,j,k
s_temp data[16] align:16  //allocates a 16 dword array of sregs aligned to 16 dword boundary

 Managing temp register scope

There are two special instructions for this: enter and leave. In a block between enter and leave; a new scope is created. One can allocate registers with s_temp and v_temp inside a block and the leave instruction will release all those variables that are allocated inside the block. It is very useful inside macros.

Program structure macros

_if(), _else, _end: Lets you create if/else statements without using jumps and labels. The _if statement has to know what register are you going to sheck with it so the proper form of _if instruction is this:

  • s_if(vccz) //scalar IF checking a scalar flag.
  • s_if_i32(s6>-32769) //scalar if checking 32bit signed integer relation
  • v_if_f64(v10<>s20) //vector if with 64bit float operand (and a 64bit float scalar)

Possible types for s_if are: i32, u32. And for v_if: i32, u32, i64, u64, f32, f64.

_while(), _endw: Makes a while block. You must use the same prefixes and suffixes for _while macro as you would use for the _if macro.

_repeat, _until(): Makes a repeat-until block. Prefix and suffix must be specified for _until().

_break, _continue: Can be used inside a _while-_endw or a _repeat-_until block.

Memory IO macros

dwAddr is a dword index. uavId is 0-based. AOption can one or more option of the tbuffer_ instruction, for example: glc.

uavWrite(uavId,dwaddr,value)
uavWrite(uavId,dwaddr,value,AOption)
uavRead(uavid, dwaddr,value)
uavRead(uavid, dwaddr,value,AOption)
cbRead(dwaddr,value)

note: They are so slow that should not be used in az inner loop. But they provide easy acces to memory.

 Accessing HW_INFO

They are easy access macros for the bitfields of the HW_INFO value. The result is placed in the provided scalar reg.

getWaveId(ghwRes) 
getSIMDId(ghwRes) 
getCUId(ghwRes) 
getSHId(ghwRes) 
getSEId(ghwRes) 
getThreadGroupId(ghwRes) 
getVirtualMemoryId(ghwRes)
getRingId(ghwRes) 
getStateId(ghwRes)

And a complicated one that calculates the Global SIMD Id. You can identify the SIMD on which your program is running.

getGlobalSIMDId(ggsRes)

 GDS macros

gwAddr: dword indeg in GDS memory

gdsWrite(gwAddr,gwData)
gdsRead(gwAddr,gwData)
gdsAdd(gwAddr,gwData)

 Global Wave Synch

Id is a unique id chosen by you. gwsThreads: the number of total workgroups (or wavefronts, I’m not sure… The wrong one will crash :D)

gwsInit(gwsId,gwsThreads)
gwsBarrier(gwsId)

 Measuring execution time

_getTickInit     //initializes T0 time. All other timing macros will work relative to this.
getTick(gtRes)   //returns current time elapsed from T0 //with lame 32bit calculations
breakOnTimeOut(botTimeoutMS)  //ensures that a loop cannot be infinite. Calls s_endpgm if timeOutMS is reached.

 Kernel initialization

Must be called right after including stdgcn.inc.

AGrpSize: no of workItems in a workGroup. ANumVGPRS: allocaten no of vector regs. ALdsSizeBytes: as its name.

KernelInitUUUC(AGrpSize,ANumVGPRS,ALdsSizeBytes)  //3 UAVs and 1 ConstBuffer

Other buffer variants implemented: UU, UC, U

 

Posted in Uncategorized | Tagged | Leave a comment

Macro Preprocessor Documentation

Latest version can be downloaded from here: http://x.pgy.hu/~worm/het/hp
Suggested Catalyst ver: 12.10, also works with 13.4 (new ELF format), but except the disassembler. Later Catalyst versions aren’t tested with.

Hi and Happy New Year!

As my script/asm compiler is getting more complicated, it is time to summarize everything that helps to code in GCN ASM easier. The macro preprocessor is working right before the pascal interpreter and it has a syntax that is close to the C Macro Preprocessor.

Defining a macro: #define, #assign, #undef

#define something               //just define it, equals ""

#define add(a,b) ((a)+(b))      //define a simple macro
#define add(a,b,c) ((a)+(b)+(c))//overload it with 3 parameters

#define multiline 1+  \
                  2             //multiline macro

{#define multiline2 1+
2 }
         //another form of multiline macro

#macro ensureRange(x, min, max) //multiline macro, best syntax
  if(x<min)then
    x:=min
  else if(x>max)then
    x:=max;
#endm

#assign two 1+1                 //'assign' evaluates the expression

#undef something                //undefine an existing macro

var x:=-0;                      //test some of the macroes
ensureRange(x,1,5);
writeln(x, ',', add(1,2,3), ',', two);    //the output should be 1,6,2

The # operator

# converts the contents of tho following identifier to string.
## concatenates its two operands.
Using whitespace is not allowed near the #, ## operators.

Predefined macros:

There are some predefined macros: __file__, __line__, __date__, __time__

Conditional code generating can be done with __ifdef__(), __ifndef__(), __if__(). First parameter is the condition, and the rest of the parameters are the code which will be compiled if the condition is met.  Must ensure proper use of brackets() otherwise use a macro.

#define a
__ifdef__(a, writeln('a is defined'));
__ifndef__(b, writeln('b is undefined'));
#define a 15
__if__(a>=10, writeln('a>=10'));

Iterations can be written with the __for__() macro:

__for__(i in [1, 2, 5..8], write(i); ) //125678
__for__(i:=1 to 8 where i not in [3, 4], write(i); ) //125678
__for__(i:=1 to 8 step 2, write(i); ) //1357

Special macros:

#include pastes a file into the source code. No “” or <> needed, just the filename. First it will search for the file in the directory of the current .hpas file. If it can’t find there, it will try in the include\ directory (which is resides in the hetpas.exe’s directory).

It is possible to generate program code with actual program code. In the #script directive you can place a pascal script. Everything passed to write() and writeln() will be pasted into the source code:

{#script writeln('writeln("this is scripted");');}

Because this type of macro is originated back when I did a preprocessor for Delphi, this macro must enclosed in {} comment brackets.

ASM macros:

There are two macros that help nesting assembly code into the pascal script: asm_il(), asm_isa(). They work exactly like -marks when defining string literals. They are to isolate pascal script code from assembly code, so the IDE can provide proper syntax highlighting and code-insight functionality.
When including a file these macros are stripped down automatically so they can be included inside an asm() block while the included file has proper syntax highlighting.
There is a way to insert evaluated scripts into the asm block: If the preprocessor find something like this ![expression] in the asm block, then it evaluates the expression and inserts its result into the code.

writeln(asm_isa(This code is displayed in !["GCN"+" ISA"] syntax s_endpgm));
//output is: This code is displayed in GCN ISA syntax s_endpgm

Posted in Uncategorized | Tagged , | Leave a comment

GCN Hello World example

This simple example will help you learn how to write simple GCN asm programs and how to test it in the HetPas environment. First look at this small OpenCL program:

__kenel test(__global int *uav, __constant int *cb)
{
  int gid = get_global_id(0);
  uav[gid] = gid + cb[0];
}

This simple thing will be implemented in the next sections.

1. Writing the GCN code

In HetPas it actually means that you have to make a pascal style string. In order to see syntax highlight and to have Ctrl+Space (code completion) you have to put your asm code inside an asm_isa() construct:

const code:=asm_isa(
  //The text you write here will be seen as GCN assembly cod
  //It will be syntax highlighted in the editor
);

Next we have to write a GCN header which will specify important things for the compiler. These instructions aren’t part of the GCN ISA.

isa79xx                  ;every gcn program starts with this 
  numVgprs 64            ;specify number of vector regs (max 256)
  numSgprs 105           ;and scalar regs (105 is the max)
  numThreadPerGroup 256  ;workitems in a workgroup
  ldsSize 0              ;LDS size in bytes (max 32KB)

Then we have to specify the kernel parameters. The current implementation supports any number of UAVs (__global int*) followed by one or zero ConstantBuffer (__const int* cb). The oclBuffers instruction is there to specify UAV_count a CB_count:

  oclBuffers 1,1 ;first value is number of UAVs,
     ;second value is number of const buffers(1 max)

Based on the UAV and CB configuration, the way you access those buffers and the scalar register which will contain the group_id at kernel startup will be different.
For the 1,1 combination here’s what we have to know:

 ; userElements[0] = IMM_UAV, 10, s[4:7]
 ; userElements[1] = IMM_CONST_BUFFER, 1, s[8:11]
 ; userElements[2] = IMM_CONST_BUFFER, 2, s[12:15]
 ; COMPUTE_PGM_RSRC2:USER_SGPR = 16

The above is a part of the disasm of our small OpenCL program.  It tells that:

  • the group_id will be placed in the s16 register.
  • UAV buffer resource (refer to SI ISA manual Table 8.5) is in s[4:7]
  • CB buffer is in s[8:11]
  • UAV offset table is in s[12:15]

So the OpenCL kernel’s first __global int *uav parameter is on the s[4:7] buffer resource offseted with the first dword at s[12:15]. (note s[12:15] is 128bit long and it’s a buffer resource).
The __constant int *cb parameter is at s[8:11] and the offset is 0 inside that buffer resource.
Note that we don’t have to know why is that this way, we’ll only use it. So let’s prepare these things in the GCN asm program:

  alias uav=s[4:7], cb=s[12:15], uavofs=s[8:11]

With the alias instruction we can have simple replace macros. (It works like #define in C but with a different syntax). Now we have simple labels for the 3 buffer resources we’re going to use.

  s_mulk_i32 s16, 256        //scale GroupID with GroupSize
  v_add_i32 v0, vcc, s16, v0 //v0: global threadId

As I mentioned before, we have the get_group_id() value in s16. In order to calculate get_global_id, we have to multiply it with the GroupSize (specified with numThreadPerGroup instruction) and add it to v0 register. At kernel start v0 contains get_local_id(). Note that all id’s are in flattened format.

And now let’s read the 2 constants we need: the UAV’s offset and the cb[0] value.

  s_buffer_load_dword s0, uavofs, 0x00 //s0: uav offset
  s_buffer_load_dword s1, cb, 0x00     //s1: cb[0]
  s_waitcnt lgkmcnt(0) //wait for s_buffer_loads

s_waitcnt is there to wait the s_buffer instructions. Those are executed in the background, and lgkmcnt(0) ensures that s0 and s1 contains valid values read from the buffers.

Now it’s time to implement our simple OpenCL program:

  uav[gid] = gid + cb[0];

We have the globalID in the v0 register and cb[0] in the s1 register and UavOffset is in s0. The following code will calculate the address in the uav buffer resource which is gid*sizeof(int)+uavOffset.

  v_mad_u32_u24 v2, v0, 4, s0   ; v2: address

That’s the byte address inside the UAV buffer resource. Let’s calculate gid+cb[0] too:

  v_add_i32     v3, vcc, s1, v0   ;v3: data

Only one last step remains: Write into memory

tbuffer_store_format_x  v3, v2, uav, 0 offen offset:0 
  format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT]

BTW as we know that we don’t need data conversion, we can use typeless buffer_store too:

buffer_store_dword v3, v2, uav, 0 offen offset:0

The buffer store instruction is not finishes immediately and it will need the DATA value (v3) in time when it actually writes into memory.  With the s_waitcnt expcnt(0) instruction we can force the program to wait until the v3 register is accessible again.

The final instruction in our example tells the GPU the end of program:

s_endpgm

2. Testing the kernel in HetPas

The steps are:

  • Select a CL device
  • Compile/Build the kernel
  • Allocate/Initialize buffers
  • Run the kernel
  • Read the buffers
  • Cleanup

To select a device:

var dev:=cl.devices[1];  //specify device index
writeln(dev.dump);       //display some info

Note that cl.devices.count reports how many CL devices present in the system.

Compiling code is also simple:

var kernel:=dev.NewKernel(code);

The kernel object is ready to launch kernels. It also can produce temporary things:

writeln(kernel.ILCode);  //show AMD_IL disasm
writeln(kernel.ISACode); //show ISA disasm
filewrite('c:\test.elf',kernel.ELFImage);

The last line saves the kernel image. Later you can reload it with:

var kernel2:=dev.NewKernel(readfile('c:\test.elf'));

Our next task is to allocate buffers:

const WorkCount:=1536;
var uav:=dev.NewBuffer('rw',WorkCount*4{in bytes});
var cb:=dev.NewBuffer('r',4096{in bytes});
cb.intVArray:=(1000);  //it gets an array of ints
cb.ints[0]:=1000;      //the same as above with indexed property
                       //also there is floats[] and floatVArray

Now that the buffers are initialized, it’s time to run the kernel:

var ev:=kernel.run(WorkCount, uav, cb);//it can have max 4 params
ev.waitfor;
writeln(format('elapsed: %.3f ms',ev.elapsedtime_sec*1000));
ev.free;

We can access the UAV now:

with uav do for i:=0 to WorkCount-1 do
  writeln(format('%6x: %.8x %10d %10f',i,Ints[i],Ints[i],Floats[i]));

This dumps every DWords in the UAV in int, hex and float32 format.
All we got left is to free allocated objects.

uav.free;
cb.free;
kernel.free;

Note: Don’t have to free device because it is handled dynamically inside.

So that’s all.

This example code can be found in the /examples/OpenCL_OpenCL_HelloWorld.hpas file. I tested it with Catalyst 12.10 and with the 2013 May 14th release of HetPas. Download link.
It should work on Catalyst 13.4 also, but in that version the kernel.ISACode feature is broken. (Catalyst produces no disasm when you upload a binary only kernel)

Posted in Uncategorized | Tagged , , | Leave a comment

HetPas ver0.00

* update 160102: Please always check for the latest version in the download area
* update 130514: A new version is available for Cat 13.4
Successfully tested on HD7770 and on HD6970 with the following examples:
AMDIL_OpenCL_HelloWorld.hpas
GCN_OpenCL_Fibonacci_recursive.hpas
GCN_OpenCL_HelloWorld.hpas
GCN_OpenCL_latency_test.hpas
GCN_OpenCL_mandel.hpas
OpenCL_OpenCL_HelloWorld.hpas

——————————————————————————-

Download link -> in the header of this blog.

Software requirements: Windows XP, AMD Catalyst driver
Win7+ users (The app will need a classic XP-like win32 environment):
– Use “Run as Administrator”, because it will generate some temp/result files into C:\.
– Disable Data Execution Prevention, as it will use runtime generated machine code.

What is this?

HetPas is a small script compiler/executor and a small IDE.  It supports 3 languages with syntax highligt, and code-inside to help faster development.

The supported languages are:

  • Pascal with some C inspired things, this is the main/host language.
  • AMD_IL. Middle level ams-like language for cards HD4xxx..HD7xxx
  • GCN ISA. Lowest level asm language for HD77xx+ gfx cards.

What kernel files it can produce?

  • CAL .elf image with AMD_IL code inside (uses AMD’s internal compiler), all cards where amd_il is working, except HD77xx with new drivers.
  • CAL .elf image with GCN ISA binary  (generated with own compiler) hd77xx+ only
  • OpenCL .elf image loaded with GCN ISA binary, hd77xx+ only

What about this release?

It’s a very first one, so it can contain tons of bugs, also the GCN ISA compiler is a reduced one: It lacks some instruction groups, for example double precision encodings. Also anything can change in the future, so don’t  use it for serious projects. Just take it as a toy, with it you can try out ideas on the GCN architecture.

Is there documentation?

Unfortunately not much: here’s a small reference of language elements -> HetPas Reference

Official documentation for AMD_IL and GCN_ISA -> amd-accelerated-parallel-processing-app-sdk/documentation
Check the documents “AMD Intermediate Language (IL) Specification (v2.0e)” and “AMD Southern Islands Instruction set Architecture”!

Indeed it’s not that much, how to start then?

(First if you’re a win7 user, you should disable UAC on this program, because it will write many temporary files in the C:\ path. Use Run as Administrator or XP compatibility mode or something.)

Note that at the moment this project is in early beta/preview stage, so use it on your own risk only.
I suggest, first check out some hpas programs in the examples folder and learn from them!

  • HetPasDemo.hpas – Contains many language elements of the host language.
  • mandel.hpas – a small mandelbrot renderer

Then you can choose a gpu target:

a) HD4xxx..HD7xxx with CAL+AMD_IL.

  • AMDIL_CAL_HelloWorld.hpas

b) HD77xx+ with OpenCL+GCN_ISA (Use latest drivers, I’ve tested with 12-10 on win7 64) *Note that: this is the most up to date target

  • GCN_OpenCL_HelloWorld.hpas
  • GCN_OpenCL_mandel.hpas – Single Precision mandelbrot renderer
  • GCN_OpenCL_latency_test.hpas – You can measure how many cycles an instruction sequence takes.
  • GCN_OpenCL_Fibonacci_recursive.hpas – Some advanced GCN tricks, like indirect S register addressing, goto to a specific address, also this example demonstrates  C style precompiler macroes.

c) HD77xx+ with CAL+GCN_ISA   (Use cat11-12 driver on win7 64bit, or 12-2 on linux 32bit) This is a bit deprecated but works flawlessly with the right drivers, with the wrong drivers it simply crashes when you access UAV.

  • GCN_CAL_mandel.hpas – similar to the OpenCL+GCN_ISA version.
  • GCN_CAL_latency_test.hpas – “
  • GCN_OpenCL_Fibonacci_recursive.hpas – “
  • GCN_CAL_FractalComputeUnit.hpas – This is a big one, I’m not sure if it still works (don’t want to reinstall old drivers right now) but I included it because it contains seriuos macro examples: for example the __for__() macro, and array_aliases.

Why I’m sharing this?

I really like to program efficient hardware in an efficient way. (Also have some experience using SSE) And I’m kinda amazed of this fresh, well designed architecture called GCN. Unfortunately there’s no official assembler for it. So feel free to try my reduced assembler to get a sneak peak of GCN asm, but don’t expect too much 😀

Some cool things that you can reach when you’re close to the metal:

  • True x86 like program flow. You can do jumps/calls/rets to any location in gpu memory.
  • 32bit integer ADD with carryOUT and optional carryIN, 24bit bit integer MAD (good for highprecision math)
  • You can use registers like an array (+1 cycle)
  • You can control register usage, so you can stay under 84 or 64 vregs for fast performance, or use the all 256 vregs if you have to.
  • It has a QueryPerformanceCounter() equivalent. Though it’s very complicated to relate it to final kernel duration because of latency hiding. It can be a good tool to understand how the chip works internally (You can identify big stalls with it, and possibly reorder your code lines to perform better with less threads)
Posted in Uncategorized | Tagged , , , , , , , | 12 Comments