Implementing Groestl hash function in GCN ASM

 There is a new version of HetPas in the Download Area . It contains the Groestl asm project for Cat14.9. See details at the bottom of this post.

Some months I was looking for an interesting GPU project to do for my free time, and ran into the Groestl hash algorithm. Found it a good algo to play with, so here’s a case-study of what benefits we have if we can go deeper than OpenCL.

The original code

I’ll start from this OpenCL source code: Pallas optimized groestlcoin / diamond etc. opencl kernel


Originally I downloaded it from here: http://devgurus.amd.com/message/1306845#1306845

Groestl documentation is here: http://www.groestl.info/


The Catalyst version I use is: 14.9
The speed of the algo on different hardware and/or Catalyst are shown below:

groestl_speeds

The baseline for any further performance comparison will be HD7770, 1000MHz, Cat 14.9  Pallas OpenCL version which is 4MH/s.

Examining the OpenCL-compiled binary

Below is the repeating ‘main loop’ from then  OpenCL code. It does 2*8 T0 table lookups and does the XORing. 2*2 values are looked up from the RAM, and the remaining 6*2 is accessed from LDS. T0[], T1[] is const array which is effectively in memory, T2[]..T7[] is copyed to in local memory at styarup.


a[0x0] ^= QC64(0x00, r); \
a[0x1] ^= QC64(0x10, r); \
...
a[0xE] ^= QC64(0xE0, r); \
a[0xF] ^= QC64(0xF0, r); \
t0[0x0] = B64_0(a[0x0]); \
t1[0x0] = B64_1(a[0x0]); \
...
t6[0xF] = B64_6(a[0xF]); \
t7[0xF] = B64_7(a[0xF]); \
RBTT(a[0x0], 0x1, 0x3, 0x5, 0xB, 0x0, 0x2, 0x4, 0x6); \
RBTT(a[0x1], 0x2, 0x4, 0x6, 0xC, 0x1, 0x3, 0x5, 0x7); \

Refer Pallas’s source code for details!
Here’s what the Catalyst 14.9 driver compiled from it:

 s_waitcnt vmcnt(2)
 v_xor_b32 v9, v84, v86
 v_xor_b32 v16, v85, v87
 s_waitcnt lgkmcnt(8)
 v_xor_b32 v9, v9, v66
 s_waitcnt lgkmcnt(7)
 v_xor_b32 v16, v16, v47
 s_waitcnt lgkmcnt(6)
 v_xor_b32 v9, v9, v69
 v_xor_b32 v16, v16, v70
 s_waitcnt lgkmcnt(5)
 v_xor_b32 v9, v9, v74
 s_waitcnt lgkmcnt(4)
 v_xor_b32 v16, v16, v73
 s_waitcnt lgkmcnt(3)
 v_xor_b32 v9, v9, v76
 s_waitcnt lgkmcnt(2)
 v_xor_b32 v16, v16, v75
 s_waitcnt lgkmcnt(1)
 v_xor_b32 v9, v9, v78
 s_waitcnt lgkmcnt(0)
 v_xor_b32 v16, v16, v77
 s_waitcnt vmcnt(0)
 v_xor_b32 v43, v91, v93
 v_xor_b32 v47, v92, v94
 v_xor_b32 v59, 0x07000000, v59
 v_not_b32 v133, v40
 v_xor_b32 v134, 0xbfffffff, v59
 v_lshr_b64 v[73:74], v[133:134], 16
 v_bfe_u32 v66, v73, 0, 8
 v_lshlrev_b32 v66, 3, v66
 v_add_i32 v73, vcc, 0x00001800, v66
 v_add_i32 v66, vcc, 0x00001804, v66
 ds_read_b32 v74, v81
 ds_read_b32 v75, v80
 ds_read_b32 v73, v73
 ds_read_b32 v66, v66
 s_waitcnt lgkmcnt(3)
 v_xor_b32 v9, v9, v74
 s_waitcnt lgkmcnt(2)
 v_xor_b32 v16, v16, v75
 s_waitcnt lgkmcnt(1)
 v_xor_b32 v43, v43, v73
 s_waitcnt lgkmcnt(0)
 v_xor_b32 v47, v47, v66
 v_xor_b32 v66, 0x07000000, v90
 v_not_b32 v67, v89
 v_xor_b32 v68, 0x5fffffff, v66
 v_lshr_b64 v[76:77], v[67:68], 24
 v_bfe_u32 v76, v76, 0, 8
 v_lshlrev_b32 v76, 3, v76
 v_bfe_u32 v77, v100, 0, 8
 v_lshlrev_b32 v77, 3, v77
 v_add_i32 v78, vcc, 0x00000800, v77
 v_add_i32 v77, vcc, 0x00000804, v77
 v_bfe_u32 v80, v6, 8, 8
 v_lshlrev_b32 v80, 3, v80
 v_add_i32 v81, vcc, 0x00001000, v80
 ds_read2_b32 v[83:84], v76 offset1:1
 ds_read_b32 v76, v78
 ds_read_b32 v77, v77
 ds_read_b32 v78, v81
 s_waitcnt lgkmcnt(3)
 v_xor_b32 v43, v43, v83
 v_xor_b32 v47, v47, v84
 s_waitcnt lgkmcnt(2)
 v_xor_b32 v43, v43, v76
 s_waitcnt lgkmcnt(1)
 v_xor_b32 v47, v47, v77
 v_add_i32 v76, vcc, 0x00001004, v80
 s_waitcnt lgkmcnt(0)
 v_xor_b32 v43, v43, v78
 v_bfe_u32 v77, v14, 16, 8
 v_lshlrev_b32 v77, 3, v77
 v_add_i32 v78, vcc, 0x00002000, v77
 v_add_i32 v77, vcc, 0x00002004, v77
 v_xor_b32 v15, 0x07000000, v15
 v_not_b32 v4, v4
 v_xor_b32 v15, 0xafffffff, v15
 v_lshrrev_b32 v80, 24, v15
 v_lshlrev_b32 v80, 3, v80
 v_add_i32 v81, vcc, 0x00002800, v80
 ds_read_b32 v76, v76
 ds_read_b32 v78, v78
 ds_read_b32 v77, v77
 ds_read_b32 v81, v81
 s_waitcnt lgkmcnt(3)
 v_xor_b32 v47, v47, v76
 s_waitcnt lgkmcnt(2)
 v_xor_b32 v43, v43, v78
 s_waitcnt lgkmcnt(1)
 v_xor_b32 v47, v47, v77
 v_add_i32 v76, vcc, 0x00002804, v80
 s_waitcnt lgkmcnt(0)
 v_xor_b32 v43, v43, v81
 v_bfe_u32 v77, v99, 0, 8
 v_lshlrev_b32 v77, 3, v77
 v_add_i32 v77, vcc, s0, v77
 v_lshr_b64 v[83:84], v[5:6], 8
 v_bfe_u32 v78, v83, 0, 8
 v_lshlrev_b32 v78, 3, v78
 v_add_i32 v78, vcc, s9, v78
 v_mov_b32 v83, v8
 v_mov_b32 v84, v14
 v_lshr_b64 v[85:86], v[83:84], 16
 v_bfe_u32 v85, v85, 0, 8
 v_lshlrev_b32 v85, 3, v85
 v_add_i32 v86, vcc, 0x00001800, v85
 v_add_i32 v85, vcc, 0x00001804, v85
 v_xor_b32 v11, 0x07000000, v11
 v_not_b32 v87, v7
 v_xor_b32 v88, 0x6fffffff, v11
 v_lshr_b64 v[89:90], v[87:88], 24
 v_bfe_u32 v89, v89, 0, 8
 v_lshlrev_b32 v89, 3, v89
 ds_read_b32 v76, v76
 ds_read_b32 v86, v86
 ds_read_b32 v85, v85
 ds_read2_b32 v[89:90], v89 offset1:1
 s_waitcnt lgkmcnt(3)
 v_xor_b32 v47, v47, v76
 v_bfe_u32 v76, v36, 0, 8
 v_lshlrev_b32 v76, 3, v76
 v_add_i32 v91, vcc, 0x00000800, v76
 v_add_i32 v76, vcc, 0x00000804, v76
 v_bfe_u32 v92, v33, 8, 8
 v_lshlrev_b32 v92, 3, v92
 v_add_i32 v93, vcc, 0x00001000, v92
 v_add_i32 v92, vcc, 0x00001004, v92
 ds_read_b32 v91, v91
 ds_read_b32 v76, v76
 ds_read_b32 v93, v93
 ds_read_b32 v92, v92
 v_bfe_u32 v94, v26, 16, 8
 v_lshlrev_b32 v94, 3, v94
 v_add_i32 v95, vcc, 0x00002000, v94
 v_add_i32 v94, vcc, 0x00002004, v94
 v_lshrrev_b32 v96, 24, v134
 v_lshlrev_b32 v96, 3, v96
 v_add_i32 v97, vcc, 0x00002800, v96
 v_add_i32 v96, vcc, 0x00002804, v96
 ds_read_b32 v95, v95
 ds_read_b32 v94, v94
 v_bfe_u32 v98, v109, 0, 8
 v_lshlrev_b32 v98, 3, v98
 v_add_i32 v98, vcc, s0, v98
 v_lshr_b64 v[101:102], v[99:100], 8
 v_bfe_u32 v17, v101, 0, 8
 v_lshlrev_b32 v17, 3, v17
 v_add_i32 v17, vcc, s9, v17
 tbuffer_load_format_xy v[101:102], v77, s[16:19], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT]
 tbuffer_load_format_xy v[77:78], v78, s[16:19], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT]
 tbuffer_load_format_xy v[103:104], v98, s[16:19], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT]
 tbuffer_load_format_xy v[105:106], v17, s[16:19], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT]

Some observations:

  • Maximum number of s_waits used. Sure the compiler finds all but I think it’s not that necessary if it reads and then processes in larger batches. The S alu could be used for better thing such as helping in address calculations.
  • Memory address calculations can be done with less instructions: For RAM the UAVbase and the Table offset can supplied in the tbuffer instruction’s scalar address parameter. No need to v_add, let the address calculator hardware do it. The int64 byte offset can be multiplied using the memory resource’s stride field and idxen flag. So all togethet this calculation can be hardware accelerated: base_addr = uavbase+tableoffset+tableindex*8
  • The compiler is clever because it uses v_bfe (Bit Field Extract) when it founds x>>16&0xFF type of C code.
  • But if we consider that 64bit LDS reads are always 64bit aligned regardless of the lowest 3 bits of the byte address, then we can spare an extra <<3 instruction: (x>>16&0xFF)<<3 becomes x>>13&0x7FF

Observing the OpenCL produced binary:

  • VGPRS=164. Bad because it allows only the minimum number of wavefronts/CU. No latency hiding, the ALU will sleep while waiting for RAM/LDS.
  • Binary size=110KB. Bad because it doesn’t fit into the 32KB instruction cache at all.

Making the first assembly version

First I tried to rewrite the OpenCL code to make a very simple asm impementation that just works. I haven’t implemented the first- and last-round optimizations, those can add an be an extra 8% speedup in the future, though.

Regarding kernel parameters: Keep the input kernel parameters as simple as it can be: void main(__global ulong* a1) That’s all. It’s necessary at the moment as the current version of HetPas doesn’t support non-buffer parameters, and with the less number of parameters the less buffer-resource calculations and S register usage comes. In the kernel header I manually adjust the offset in the buffer resource so it doesn’t neded to be added to the byte offset in every tbuffer instructions. Also I set stride=8 in the buffer resource, so tbuffer instructions will scale my qword indices to byte indices whenever I use the ‘idxen’ option. In the final version it is possible to make the kernel parameters compatible with the original OpenCL kernel but the way it is done could be changed with every upcoming Catalyst versions, so it would be better to modify host code a bit rather than hack it into the asm code. Also the kernel is so simple that it doesn’t request for the kernel domain ranges, thus global index is only 1D and zero based.
This is how the kernel header looks like (sorry, no syntax highlight this time):

//////////////////////////////// kernel header
isa79xx //this kernel must be called on a 0-based 1D kernel domain!
numthreadpergroup 256
ldssize 16384
oclbuffers 1, 0
numvgprs 256 v_temp_range 2..255
numsgprs 48 s_temp_range 1..3, 8..47

//////////////////////////////// Init ids, params
alias lid = v0, tid = s0, gid = v1, UAV = s[4:7]
s_buffer_load_dword s1, s[12:15], 0x00 //load uav base offset
s_mov_b32 tid, s16 //acquire tid
s_lshl_b32 s2, tid, 8 //calculate gid
v_add_i32 gid, vcc, s2, lid
s_waitcnt lgkmcnt(0)
s_add_u32 s4, s4, s1 s_addc_u32 s5, s5, 0 //adjust UAV res with uav base offset
s_andn2_b32 s5, s5, $3FFF0000 s_or_b32 s5, s5, $80000 //set 8byte record size for UAV
s_movk_i32 m0, -1 //disable LDS range checking

Next we initialize the LDS from RAM. It’s a good example on local variable allocation (enter/leave/v_temp/s_temp) and on macro __for__() iteration. Also note the tbuffer is used with hardware indexing (8byte recordsize is specified in the header).

  //initialize lds with Groestl T table ////////////////////////////////////////////////
 enter
 v_temp data[16] align:2
 v_temp vaddr
 s_temp saddr

 v_mov_b32 vaddr, lid

 __for__(i in [0..7],
   s_movk_i32 saddr, $100+i*$800 //ram table select
   tbuffer_load_format_xy data[i*2], vaddr, UAV, saddr idxen format:[BUF_DATA_FORMAT_32_32, BUF_NUM_FORMAT_FLOAT]
 )
 s_waitcnt vmcnt(0)

 __for__(i in [0..7],
   s_movk_i32 saddr, i*$800 //lds table select
   v_mad_u32_u24 vaddr, lid, 8, saddr
   ds_write_b64 vaddr, data[i*2]
 )
 s_waitcnt lgkmcnt(0)

 s_barrier
 leave

Next I defined some macroes for copying and adding roundconstants to ulong[16] arrays:

#define CNST_P(dst, src, r) __for__(i in[0..$F], v_xor_b32 dst[i*2], i*$10+r, src[i*2] v_mov_b32 dst[i*2+1], src[i*2+1])
#define CNST_Q(dst, src, r) __for__(i in[0..$F], v_xor_b32 dst[i*2+1], ![not((i*$10+r)<<24)], src[i*2+1] v_not_b32 dst[i*2], src[i*2])

Simple moves and xors with constants. Note that arrays in the assembler are always considered array of 32bit values. That’s why there are so many *2.
Now the most important part follows, which does 8 lookups and XORs the result into an element of an array:

#macro RBTT(dst, didx, src, i0, i1, i2, i3, i4, i5, i6, i7)
enter
  v_temp addr[8]
  v_temp data[16] align:2
  v_and_b32 addr[0], $FF, src[i0*2+0]
  v_lshlrev_b32 addr[0], 3, addr[0]
  v_bfe_u32 addr[1], src[i1*2+0], 8-3, 8+3   v_add_i32 addr[1], vcc, $800*1, addr[1]
  v_bfe_u32 addr[2], src[i2*2+0], 16-3, 8+3  v_add_i32 addr[2], vcc, $800*2, addr[2]
  v_lshrrev_b32 addr[3], 24-3, src[i3*2+0]   v_add_i32 addr[3], vcc, $800*3, addr[3]
  v_and_b32 addr[4], $FF, src[i4*2+1]
  v_lshlrev_b32 addr[4], 3, addr[4]          v_add_i32 addr[4], vcc, $800*4, addr[4]
  v_bfe_u32 addr[5], src[i5*2+1], 8-3, 8+3   v_add_i32 addr[5], vcc, $800*5, addr[5]
  v_bfe_u32 addr[6], src[i6*2+1], 16-3, 8+3  v_add_i32 addr[6], vcc, $800*6, addr[6]
  v_lshrrev_b32 addr[7], 24-3, src[i7*2+1]   v_add_i32 addr[7], vcc, $800*7, addr[7]
  ds_read_b64 dst[didx*2], addr[0]
  __for__(i in[1..7], ds_read_b64 data[i*2], addr[i])
  s_waitcnt lgkmcnt(0)
  __for__(i in[1..7], v_xor_b32 dst[didx*2 ], dst[didx*2 ], data[i*2 ]
                      v_xor_b32 dst[didx*2+1], dst[didx*2+1], data[i*2+1])
leave
#endm

“dst” is the name of the destination array
“didx” is the index in the “dst”
“src” is the source array
“i0”..”i7″ are the column indexes for the byte-scramble operation.

Issuing RBTT 8x make up a round:

#macro ROUND(dst, src, i0, i1, i2, i3, i4, i5, i6, i7)
__for__(i in[0..$F], RBTT(dst, i, src, (i0+i)%16, (i1+i)%16, (i2+i)%16, (i3+i)%16, (i4+i)%16, (i5+i)%16, (i6+i)%16, (i7+i)%16) )
#endm

Then 2 work arrays are defined, and the message is loaded onto one of them:


//load message block /////////////////////////////////
v_temp x[32],g[32], a[32] align:2 //x:message

enter
  v_temp vaddr
  __for__(i in [0..9],
    v_mov_b32 vaddr, $10+i
    tbuffer_load_format_xy x[i*2], vaddr, UAV, 0 idxen format:[BUF_DATA_FORMAT_32_32, BUF_NUM_FORMAT_FLOAT]
  )
  s_waitcnt vmcnt(0)
leave
                         v_mov_b32 x[ 9*2+1], gid
v_mov_b32 x[10*2+0], $80 v_mov_b32 x[10*2+1], 0

The 1D 0based GlobalID goes to the last DWORD of the message.

There are 2 Groestl passes in GroestlCoin, so let’s make a PASS() macro:

#macro PASS0Final
  __for__(i in[0.. $F], v_xor_b32 x[i], x[i], g[i+$10] ) //combine

  v_xor_b32 x[ 7*2+1], H15hi, x[7*2+1]
  v_mov_b32 x[ 8*2+0], $80 v_mov_b32 x[ 8*2+1], 0
  v_mov_b32 x[ 9*2+0], 0 v_mov_b32 x[ 9*2+1], 0
  v_mov_b32 x[10*2+0], 0 v_mov_b32 x[10*2+1], 0
#endm

#macro PASS(passIdx) //passIdx: 0..1
  v_mov_b32 x[11*2+0], 0 v_mov_b32 x[11*2+1], 0
  v_mov_b32 x[12*2+0], 0 v_mov_b32 x[12*2+1], 0
  v_mov_b32 x[13*2+0], 0 v_mov_b32 x[13*2+1], 0
  v_mov_b32 x[14*2+0], 0 v_mov_b32 x[14*2+1], 0
  v_mov_b32 x[15*2+0], 0 v_mov_b32 x[15*2+1], M15hi

  v_xor_b32 x[15*2+1], H15hi, x[15*2+1]
  CNST_P(g, x, 0) ROUND_P(a, g) CNST_P(a, a, 1) ROUND_P(g, a)
  CNST_P(g, g, 2) ROUND_P(a, g) CNST_P(a, a, 3) ROUND_P(g, a)
  CNST_P(g, g, 4) ROUND_P(a, g) CNST_P(a, a, 5) ROUND_P(g, a)
  CNST_P(g, g, 6) ROUND_P(a, g) CNST_P(a, a, 7) ROUND_P(g, a)
  CNST_P(g, g, 8) ROUND_P(a, g) CNST_P(a, a, 9) ROUND_P(g, a)
  CNST_P(g, g, 10) ROUND_P(a, g) CNST_P(a, a, 11) ROUND_P(g, a)
  CNST_P(g, g, 12) ROUND_P(a, g) CNST_P(a, a, 13) ROUND_P(g, a)
  v_xor_b32 x[15*2+1], H15hi, x[15*2+1]

  CNST_Q(x, x, 0) ROUND_Q(a, x) CNST_Q(a, a, 1) ROUND_Q(x, a)
  CNST_Q(x, x, 2) ROUND_Q(a, x) CNST_Q(a, a, 3) ROUND_Q(x, a)
  CNST_Q(x, x, 4) ROUND_Q(a, x) CNST_Q(a, a, 5) ROUND_Q(x, a)
  CNST_Q(x, x, 6) ROUND_Q(a, x) CNST_Q(a, a, 7) ROUND_Q(x, a)
  CNST_Q(x, x, 8) ROUND_Q(a, x) CNST_Q(a, a, 9) ROUND_Q(x, a)
  CNST_Q(x, x, 10) ROUND_Q(a, x) CNST_Q(a, a, 11) ROUND_Q(x, a)
  CNST_Q(x, x, 12) ROUND_Q(a, x) CNST_Q(a, a, 13) ROUND_Q(x, a)

  __for__(i in[0..$1F], v_xor_b32 g[i], g[i], x[i] ) //combine P and Q
  __for__(i in[0.. $F], v_mov_b32 x[i], g[i+$10] )
  v_xor_b32 g[15*2+1], H15hi, g[15*2+1]

  CNST_P(g, g, 0) ROUND_P(a, g) CNST_P(a, a, 1) ROUND_P(g, a)
  CNST_P(g, g, 2) ROUND_P(a, g) CNST_P(a, a, 3) ROUND_P(g, a)
  CNST_P(g, g, 4) ROUND_P(a, g) CNST_P(a, a, 5) ROUND_P(g, a)
  CNST_P(g, g, 6) ROUND_P(a, g) CNST_P(a, a, 7) ROUND_P(g, a)
  CNST_P(g, g, 8) ROUND_P(a, g) CNST_P(a, a, 9) ROUND_P(g, a)
  CNST_P(g, g, 10) ROUND_P(a, g) CNST_P(a, a, 11) ROUND_P(g, a)
  CNST_P(g, g, 12) ROUND_P(a, g) CNST_P(a, a, 13) ROUND_P(g, a)

  __IF__(passIdx=0, PASS0Final)
#endm

And finally use those macroes and extract egy ulong value which will be compared against “target”:

  PASS(0)
  PASS(1)

  v_xor_b32 g[0], g[11*2], x[3*2] v_xor_b32 g[1], g[11*2+1], x[3*2+1] //result value
  dump64($1F, g[0]) //dump64 writes an ulong into the buffer if the globalID is 1234

  s_endpgm

And that’s the whole kernel. It does not compares with “target”, it only returns the ulong value only if the current thread’s GID is 1234.

Testing the first assembly version

These parameters are used for the functional tests:

const testBlock := #$6f7037939d1aa4a9863574ddf41a0d371799dfea+
                   #$89b37ecb1ecded76426afa25108feec755347891+
                   #$b3fa9afd2a360cf64f56e4d20f0c8c03ca411b3a+
                   #$29dd28ea4fc0cddf9a1e8c707966b7a700000000; //80 bytes
const testResponse := $9FB391FF6984DFA9;           //^^^^^^^^ gid goes here
#define debugThread 1234

The above block will be completed with gid=1234 value at its end, and then the resulting ulong will be checked against the testResponse constant. TestBlock is random and the TestResponse constant was extracted from the original OpenCL kernel.

Checking the speed of the first asm version

Total workitems: 256*10*512   (256=threadgroupsize, and 10 because my card has 10 CUes)
Running the test 4x is enough at this point:

elapsed: 564.523 ms 4.644 MH/s gain: 1.16x
elapsed: 561.300 ms 4.670 MH/s gain: 1.17x
elapsed: 561.266 ms 4.671 MH/s gain: 1.17x
elapsed: 561.303 ms 4.670 MH/s gain: 1.17x

First speedup is 1.17x.

Without the first and last round optimizations but with the simplified LDS addressing it is a bit faster than the OpenCL version. Gain is compared to a baseline OCL version of 4MH/s on my system: HD7770 1000MHz cat14.9 win7/64. With cat14.6 it would be only around 1.00x, as the 14.9 somehow produces a less optimal output.

For this first version 2 settings are close to the OpenCL version:

  • Kernel size: much greater than the 32KB I-cache, it’s 340K (ocl:110K, so there must be a small loop in it compared to my version which is 100% unrolled at the moment)
  • VReg usage: is artificially greater than 128, so one CU can have only minimal amount of 4 wavefronts at any time. No latency hiding at all just like in the OCL ver.

Below is the ‘main loop’ extracted from the first asm version. It does the same amount of work as the extracted disasm from the ocl ver. 2*8 lookups and the surrounding XORs.

enter v_temp addr[8] v_temp data[16] align:2
v_and_b32 addr[0], $FF, g[ ( 0+0)%16*2+0]
v_lshlrev_b32 addr[0], 3, addr[0]
v_bfe_u32 addr[1], g[ ( 1+0)%16*2+0], 8-3, 8+3
v_add_i32 addr[1], vcc, $800*1, addr[1]
v_bfe_u32 addr[2], g[ ( 2+0)%16*2+0], 16-3, 8+3
v_add_i32 addr[2], vcc, $800*2, addr[2]
v_lshrrev_b32 addr[3], 24-3, g[ ( 3+0)%16*2+0]
v_add_i32 addr[3], vcc, $800*3, addr[3]
v_and_b32 addr[4], $FF, g[ ( 4+0)%16*2+1]
v_lshlrev_b32 addr[4], 3, addr[4]
v_add_i32 addr[4], vcc, $800*4, addr[4]
v_bfe_u32 addr[5], g[ ( 5+0)%16*2+1], 8-3, 8+3
v_add_i32 addr[5], vcc, $800*5, addr[5]
v_bfe_u32 addr[6], g[ ( 6+0)%16*2+1], 16-3, 8+3
v_add_i32 addr[6], vcc, $800*6, addr[6]
v_lshrrev_b32 addr[7], 24-3, g[ ( 11+0)%16*2+1]
v_add_i32 addr[7], vcc, $800*7, addr[7]
ds_read_b64 a[ 0*2], addr[0]
ds_read_b64 data[1*2], addr[1]
ds_read_b64 data[2*2], addr[2]
ds_read_b64 data[3*2], addr[3]
ds_read_b64 data[4*2], addr[4]
ds_read_b64 data[5*2], addr[5]
ds_read_b64 data[6*2], addr[6]
ds_read_b64 data[7*2], addr[7]
s_waitcnt lgkmcnt(0)
v_xor_b32 a[ 0*2 ], a[ 0*2 ], data[1*2 ]
v_xor_b32 a[ 0*2+1], a[ 0*2+1], data[1*2+1]
v_xor_b32 a[ 0*2 ], a[ 0*2 ], data[2*2 ]
v_xor_b32 a[ 0*2+1], a[ 0*2+1], data[2*2+1]
v_xor_b32 a[ 0*2 ], a[ 0*2 ], data[3*2 ]
v_xor_b32 a[ 0*2+1], a[ 0*2+1], data[3*2+1]
v_xor_b32 a[ 0*2 ], a[ 0*2 ], data[4*2 ]
v_xor_b32 a[ 0*2+1], a[ 0*2+1], data[4*2+1]
v_xor_b32 a[ 0*2 ], a[ 0*2 ], data[5*2 ]
v_xor_b32 a[ 0*2+1], a[ 0*2+1], data[5*2+1]
v_xor_b32 a[ 0*2 ], a[ 0*2 ], data[6*2 ]
v_xor_b32 a[ 0*2+1], a[ 0*2+1], data[6*2+1]
v_xor_b32 a[ 0*2 ], a[ 0*2 ], data[7*2 ]
v_xor_b32 a[ 0*2+1], a[ 0*2+1], data[7*2+1]
leave
enter v_temp addr[8] v_temp data[16] align:2
v_and_b32 addr[0], $FF, g[ ( 0+1)%16*2+0]
v_lshlrev_b32 addr[0], 3, addr[0]
v_bfe_u32 addr[1], g[ ( 1+1)%16*2+0], 8-3, 8+3
v_add_i32 addr[1], vcc, $800*1, addr[1]
v_bfe_u32 addr[2], g[ ( 2+1)%16*2+0], 16-3, 8+3
v_add_i32 addr[2], vcc, $800*2, addr[2]
v_lshrrev_b32 addr[3], 24-3, g[ ( 3+1)%16*2+0]
v_add_i32 addr[3], vcc, $800*3, addr[3]
v_and_b32 addr[4], $FF, g[ ( 4+1)%16*2+1]
v_lshlrev_b32 addr[4], 3, addr[4]
v_add_i32 addr[4], vcc, $800*4, addr[4]
v_bfe_u32 addr[5], g[ ( 5+1)%16*2+1], 8-3, 8+3
v_add_i32 addr[5], vcc, $800*5, addr[5]
v_bfe_u32 addr[6], g[ ( 6+1)%16*2+1], 16-3, 8+3
v_add_i32 addr[6], vcc, $800*6, addr[6]
v_lshrrev_b32 addr[7], 24-3, g[ ( 11+1)%16*2+1]
v_add_i32 addr[7], vcc, $800*7, addr[7]
ds_read_b64 a[ 1*2], addr[0]
ds_read_b64 data[1*2], addr[1]
ds_read_b64 data[2*2], addr[2]
ds_read_b64 data[3*2], addr[3]
ds_read_b64 data[4*2], addr[4]
ds_read_b64 data[5*2], addr[5]
ds_read_b64 data[6*2], addr[6]
ds_read_b64 data[7*2], addr[7]
s_waitcnt lgkmcnt(0)
v_xor_b32 a[ 1*2 ], a[ 1*2 ], data[1*2 ]
v_xor_b32 a[ 1*2+1], a[ 1*2+1], data[1*2+1]
v_xor_b32 a[ 1*2 ], a[ 1*2 ], data[2*2 ]
v_xor_b32 a[ 1*2+1], a[ 1*2+1], data[2*2+1]
v_xor_b32 a[ 1*2 ], a[ 1*2 ], data[3*2 ]
v_xor_b32 a[ 1*2+1], a[ 1*2+1], data[3*2+1]
v_xor_b32 a[ 1*2 ], a[ 1*2 ], data[4*2 ]
v_xor_b32 a[ 1*2+1], a[ 1*2+1], data[4*2+1]
v_xor_b32 a[ 1*2 ], a[ 1*2 ], data[5*2 ]
v_xor_b32 a[ 1*2+1], a[ 1*2+1], data[5*2+1]
v_xor_b32 a[ 1*2 ], a[ 1*2 ], data[6*2 ]
v_xor_b32 a[ 1*2+1], a[ 1*2+1], data[6*2+1]
v_xor_b32 a[ 1*2 ], a[ 1*2 ], data[7*2 ]
v_xor_b32 a[ 1*2+1], a[ 1*2+1], data[7*2+1]
leave

With a few more registers this could be pipelined (2 stages: next address calc and previous xors, lds reads) but it will be easier to go below 128 VRegs and allow the GPU share the resources (alu and ldsreads) across 2 conturrent wavefronts.
Just a random information is that the source file after macro processing is more than 2.3MB, the assembler is not a bad compressor as it compresses it down to 340KB.

Check it yourself!

For those who want to see this working on their system, I uploaded a special Groestl version of HetPas. Click on the [Download Link] on the menu bar at the top of this page!

Instructions to make it run:

  • Minimum of Win7 32 or 64 bit is required. It’s the requirement as well as for the GCN cards.
  • Use the 14.9 Catalyst! Older versions are guaranteed to crash as they pass the kernel parameters differently. Newer Cat versions untested.
  • You have to disable the Data Execution Prevention (DEP) for the exe because it will patch some Delphi runtime library functionality at the startup.
  • Either [Run As Administrator] or put it in a folder where it can write files. It will be needed to export the temporary files of the OpenCL compiler. It also writes an .ini file in the exe’s path.
  • Open the “groestl\groestl_ocl.hpas” file to test the original OpenCL kernel (by Pallas). Press F9 to Run.
  • Open the “groestl\groestl_isa.hpas” file to test the GCN assembly version. If it works ok, then it should display “RESULT IS OK”. It’s ok if it says “TEMPDATA IS WRONG”.
  • In the examples folder only OpenCL_OpenCL_HelloWorld.hpas is compatible with cat14.9. Others are crashing because of the changed method of passing kernel parameters in registers. To try the examples use cat13.4 or cat12.10. Cat12.10 has a working disassembler that disassembles binary-only ELF images, but that vesrion is so old that doesn’t handle new cards. To ‘decrease’ Catalyst version you may have to use the Catalyst Clean Uninstall Utility.

To be continued…

Now that the first asm version just works correctly, in the next post I’ll examine different methods optimizations to make it worth descending from the OpenCL language down to GCN assembly.

Advertisements
This entry was posted in Uncategorized and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s