Last post was about making the first assembly version that just runs correctly. Luckily it became 1.17x faster than the original high level version. Now I will apply a series of optimizations and let’s see how effective are they. I’ll start with the most obvious ones.

## ver01: Reduce VGPRS count down to 128 -> 196%

I had the original VGPRS count unchanged in ver00, just to see how important is that let each of the GCN Vector V-ALUs deal not only one but two WaveFronts. When the V-ALU can choose from more than one WFs, then if one WF begins to wait for LDS or RAM, then it can immediately switch to another WF. It effectively hides LDS and MEM latency without the need of changing anything in the program code. (Latency hiding can be realized inside the code as well, but it needs more registers and more clever code. Allowing the GCN chip to parallellize the code in every cycle is much more elegant.)

So reducing VGPRS count from 164 down to 128 produces the speedup from 1.17x->2.30x. It’s** 196% **gain.

It is can be realised in asm like this:

numvgprs 128 v_temp_range 2..127

The “numvgprs” instruction tells the compiler that how many regs it should allocate on the hardware. “v_temp_range” specifies the compiler a range of vregs that make up a heap of regs it can temporarily give to code regions marked with enter/leave regions.

Just a theoretical nonsense at the moment: I have tried other vgprs counts:

85: It’s just a bit bigger than the 3 WafeFronts/V-ALU setup so it’s the same 2WFs/V_ALU. But somehow it’s 2.70x.

84: Every V-ALU can choose from not 2 but 3 WFs at any time. -> 4.20x

But it of course calculates wrong results. To make it work I would have to make it with 38 fewer VRegs. Maybe in the future it worth trying: Long time regions in the algorithm only needs 2x ulong[16] arrays and some temps. That can made under 85 and LDS can be used for swap locally-unused data. But at this early stage it not worth to bother with and later find out that maybe this was an opt unnecessary.

## ver02: Reduce code size to fit into 32KB ICache -> 115%

This is the second most obvious optimization in my opinion.

In order to do this I have to find repetitive tasks in the code and either put them into a loop or call them as a subroutine. For simplicity I’ll try to do it with loops.

Below is the previous code:

#macro PASS v_mov_b32 x[11*2+0], 0 v_mov_b32 x[11*2+1], 0 v_mov_b32 x[12*2+0], 0 v_mov_b32 x[12*2+1], 0 v_mov_b32 x[13*2+0], 0 v_mov_b32 x[13*2+1], 0 v_mov_b32 x[14*2+0], 0 v_mov_b32 x[14*2+1], 0 v_mov_b32 x[15*2+0], 0 v_mov_b32 x[15*2+1], M15hi v_xor_b32 x[15*2+1], H15hi, x[15*2+1] CNST_P(g, x, 0) ROUND_P(a, g) CNST_P(a, a, 1) ROUND_P(g, a) CNST_P(g, g, 2) ROUND_P(a, g) CNST_P(a, a, 3) ROUND_P(g, a) CNST_P(g, g, 4) ROUND_P(a, g) CNST_P(a, a, 5) ROUND_P(g, a) CNST_P(g, g, 6) ROUND_P(a, g) CNST_P(a, a, 7) ROUND_P(g, a) CNST_P(g, g, 8) ROUND_P(a, g) CNST_P(a, a, 9) ROUND_P(g, a) CNST_P(g, g, 10) ROUND_P(a, g) CNST_P(a, a, 11) ROUND_P(g, a) CNST_P(g, g, 12) ROUND_P(a, g) CNST_P(a, a, 13) ROUND_P(g, a) v_xor_b32 x[15*2+1], H15hi, x[15*2+1] CNST_Q(x, x, 0) ROUND_Q(a, x) CNST_Q(a, a, 1) ROUND_Q(x, a) CNST_Q(x, x, 2) ROUND_Q(a, x) CNST_Q(a, a, 3) ROUND_Q(x, a) CNST_Q(x, x, 4) ROUND_Q(a, x) CNST_Q(a, a, 5) ROUND_Q(x, a) CNST_Q(x, x, 6) ROUND_Q(a, x) CNST_Q(a, a, 7) ROUND_Q(x, a) CNST_Q(x, x, 8) ROUND_Q(a, x) CNST_Q(a, a, 9) ROUND_Q(x, a) CNST_Q(x, x, 10) ROUND_Q(a, x) CNST_Q(a, a, 11) ROUND_Q(x, a) CNST_Q(x, x, 12) ROUND_Q(a, x) CNST_Q(a, a, 13) ROUND_Q(x, a) __for__(i in[0..$1F], v_xor_b32 g[i], g[i], x[i] ) //combine P and Q __for__(i in[0.. $F], v_mov_b32 x[i], g[i+$10] ) v_xor_b32 g[15*2+1], H15hi, g[15*2+1] CNST_P(g, g, 0) ROUND_P(a, g) CNST_P(a, a, 1) ROUND_P(g, a) CNST_P(g, g, 2) ROUND_P(a, g) CNST_P(a, a, 3) ROUND_P(g, a) CNST_P(g, g, 4) ROUND_P(a, g) CNST_P(a, a, 5) ROUND_P(g, a) CNST_P(g, g, 6) ROUND_P(a, g) CNST_P(a, a, 7) ROUND_P(g, a) CNST_P(g, g, 8) ROUND_P(a, g) CNST_P(a, a, 9) ROUND_P(g, a) CNST_P(g, g, 10) ROUND_P(a, g) CNST_P(a, a, 11) ROUND_P(g, a) CNST_P(g, g, 12) ROUND_P(a, g) CNST_P(a, a, 13) ROUND_P(g, a) #endm #macro PASS_Transition __for__(i in[0.. $F], v_xor_b32 x[i], x[i], g[i+$10] ) //combine v_xor_b32 x[ 7*2+1], H15hi, x[7*2+1] v_mov_b32 x[ 8*2+0], $80 v_mov_b32 x[ 8*2+1], 0 v_mov_b32 x[ 9*2+0], 0 v_mov_b32 x[ 9*2+1], 0 v_mov_b32 x[10*2+0], 0 v_mov_b32 x[10*2+1], 0 #endm //issuing macros PASS PASS_Transition PASS

Processing these macroes resulting in 340KB of code. First I tried to make a loop like this:

s_temp passIdx s_movk_i32 passIdx, 0 s_while_i32(passIdx<2) PASS s_if_i32(passIdx=0) PASS_Transition _endif s_addk_i32 passIdx, 1 _endw

But it was a crashed. Later I’ve found that the s_cbranch instruction can only jump +-128KB away, and the jump needed was 170KB. From now the assembler will raise an exception when it happens again.

So first I have to make the loops inside the PASS macro. Round Constant addition will change as from now SRegs will hold the current roundIndex. There will be two types of round additions: The first one that will initialize the SRegs holding round indexes, and the second will use and increment those SRegs. This way S instructions will be interleaved with V instructions, thus not requiring additional clock cycles. Below are the new RoundConst adder macroes (they can also copy from one array to another):

s_temp RoundCnst[16] //these will be incremented before each round #macro CNST_P_first(dst, src) __for__(i in[0..$F], s_movk_i32 RoundCnst[i], i*$10+0 v_mov_b32 dst[i*2+1], src[i*2+1] v_xor_b32 dst[i*2+0], RoundCnst[i], src[i*2] ) #endm #macro CNST_P_next(dst, src) __for__(i in[0..$F], s_addk_i32 RoundCnst[i], 1 v_mov_b32 dst[i*2+1], src[i*2+1] v_xor_b32 dst[i*2+0], RoundCnst[i], src[i*2] ) #endm #macro CNST_Q_first(dst, src) __for__(i in[0..$F], s_mov_b32 RoundCnst[i], ![not((i*$10+0)<<24)] v_not_b32 dst[i*2+0], src[i*2] v_xor_b32 dst[i*2+1], RoundCnst[i], src[i*2+1] ) #endm #macro CNST_Q_next(dst, src) __for__(i in[0..$F], s_sub_u32 RoundCnst[i], RoundCnst[i], 1<<24 v_not_b32 dst[i*2+0], src[i*2] v_xor_b32 dst[i*2+1], RoundCnst[i], src[i*2+1] ) #endm

Also I made a simple iterator macro:

#macro mRepeat(repetionCount, what) //simple repeat macro. enter s_temp i s_movk_i32 i, 0 s_while_i32(i<repetionCount) what s_addk_i32 i, 1 //should rather check SCC right after this _endw leave #endm

Using these the new Groestl PASS can be squeezed:

#macro PASS v_mov_b32 x[11*2+0], 0 v_mov_b32 x[11*2+1], 0 v_mov_b32 x[12*2+0], 0 v_mov_b32 x[12*2+1], 0 v_mov_b32 x[13*2+0], 0 v_mov_b32 x[13*2+1], 0 v_mov_b32 x[14*2+0], 0 v_mov_b32 x[14*2+1], 0 v_mov_b32 x[15*2+0], 0 v_mov_b32 x[15*2+1], M15hi v_xor_b32 x[15*2+1], H15hi, x[15*2+1] CNST_P_first(a, x) ROUND_P(g, a) mRepeat(13, CNST_P_next( a, g) ROUND_P(g, a)) v_xor_b32 x[15*2+1], H15hi, x[15*2+1] CNST_Q_first(a, x) ROUND_Q(x, a) mRepeat(13, CNST_Q_next( a, x) ROUND_Q(x, a) ) __for__(i in[0..$1F], v_xor_b32 g[i], g[i], x[i] ) //combine P and Q __for__(i in[0.. $F], v_mov_b32 x[i], g[i+$10] ) v_xor_b32 g[15*2+1], H15hi, g[15*2+1] CNST_P_first(a, g) ROUND_P(g, a) mRepeat(13, CNST_P_next( a, g) ROUND_P(g, a) ) #endm

This reduced code size to 50K the instruction cache is now utilized well. But not forget one more loop for the 2 passes:

mRepeat(2, PASS s_if_i32(i=0) PASS_Transition //this must be called between the 2 PASSes _endif )

Final code size is 25KB now which is completely fits into the 32KB instruction cache. Because of this optimization the speed raised from 2.30x to 2.65x, that means a **115%** improvement. I expected this to be higher, but in the current state of the algo the biggest bottleneck is the LDS. So next time I’ll try to make LDS access faster somehow.

Theoretically when I remove the LDS reads then the speed goes up to 7.04x, just by doing only the V-ALU calculations. With this high V-ALU utilization the Instr Cache becomes a bigger issue: When I revert from 25KB to the old 340KB code size, the speed drops from 7.04x down to 4.46x. So if LDS wouldn’t be a bottleneck, the ICache optimization would give 158% benefit instead of 115%.

100% -> 128 -> 196% That is crazy good!

I am building a gcn asm compiler also for windows. It’s not as good as yours but it’s okay. Hopefully I’ll post it with the source code in the next few weeks.

1.00x -> 1.17x -> 2.65 That is crazy good! (Edit:I was looking at the wrong numbers.)

I have been following your posts here and on AMD Forums for a while and they are really insightful. Thanks for sharing.

a short note: gcn devices can execute onl 1 command from each wavefront on every cycle. so if you have one wavefront/simd, you can saturate valus but not the entire CU. if you have two wavefronts/simd, you can execute alu+lds or valu+salu operations in the single cycle

Running the minimum number of streams (getting higher number of vregs) is risky, yeah.

I had many tests long long ago, and it turned out that, you need only 4 WFs on every CU to be able to use s_alu+v_alu in every cycle. Only had make the program in a way, that the instruction stream is not denser than 12bytes/cycle: 8byte valu+4byte salu is ok. But 8byte valu+8byte salu becomes slower. This instruction decoder penalty is lost when we go 8 WFs/CU in excange for losing registers.

That was 3-4 years ago 😀 I totally don’t know how it is in GCN3.