( [Download Link] section updated. Produced ELF is NOT compatible with cat14.6b. Works well with cat13.4, and if you want a working disassembler, you should use cat12.10)
With the #include directive it is possible to inline headers. So I made an stdgcn.inc to help doing ‘everyday’ programming tasks.
A kernel code may start like this:
var code:=asm_isa( #include stdgcn.inc KernelInitUC(64,64,8192) ...
KernelUnitUC(WorkGroupSize, VRegCount, LDSBytes) is defined in stdgcn.inc and does the following:
- sets important kernel parameters: WorkGroupSize, VRegCount and LDSBytes
- specifies buffers and reads pointers to them. In this case4 UC means 1 uav and 1 constant buffers.
- prepares kernel indexes, and stores them in grpId=s0, lid=v0, gid=v1 aliases respectively.
- allocates vector and scalar registers for temp variables (more info later) making sure that it not include resource constants and other important registers.
- measures the start time of the kernel. (well, maybe this should be optional)
This is a new feature which helps using variables is a structured form.
Before using this, a register pool must be allocated with s_temp_range and v_temp_range instructions. For example:
s_temp_range 1..7, 27..103 v_temp_range 2..64
From now on there will be a scope for variables allocated with the v_temp and s_temp instructions:
v_temp X, Y, Z //note: the data type is always 32bit, for 64bit types you can use arrays s_temp i,j,k s_temp data align:16 //allocates a 16 dword array of sregs aligned to 16 dword boundary
Managing temp register scope
There are two special instructions for this: enter and leave. In a block between enter and leave; a new scope is created. One can allocate registers with s_temp and v_temp inside a block and the leave instruction will release all those variables that are allocated inside the block. It is very useful inside macros.
Program structure macros
_if(), _else, _end: Lets you create if/else statements without using jumps and labels. The _if statement has to know what register are you going to sheck with it so the proper form of _if instruction is this:
- s_if(vccz) //scalar IF checking a scalar flag.
- s_if_i32(s6>-32769) //scalar if checking 32bit signed integer relation
- v_if_f64(v10<>s20) //vector if with 64bit float operand (and a 64bit float scalar)
Possible types for s_if are: i32, u32. And for v_if: i32, u32, i64, u64, f32, f64.
_while(), _endw: Makes a while block. You must use the same prefixes and suffixes for _while macro as you would use for the _if macro.
_repeat, _until(): Makes a repeat-until block. Prefix and suffix must be specified for _until().
_break, _continue: Can be used inside a _while-_endw or a _repeat-_until block.
Memory IO macros
dwAddr is a dword index. uavId is 0-based. AOption can one or more option of the tbuffer_ instruction, for example: glc.
uavWrite(uavId,dwaddr,value) uavWrite(uavId,dwaddr,value,AOption) uavRead(uavid, dwaddr,value) uavRead(uavid, dwaddr,value,AOption) cbRead(dwaddr,value)
note: They are so slow that should not be used in az inner loop. But they provide easy acces to memory.
They are easy access macros for the bitfields of the HW_INFO value. The result is placed in the provided scalar reg.
getWaveId(ghwRes) getSIMDId(ghwRes) getCUId(ghwRes) getSHId(ghwRes) getSEId(ghwRes) getThreadGroupId(ghwRes) getVirtualMemoryId(ghwRes) getRingId(ghwRes) getStateId(ghwRes)
And a complicated one that calculates the Global SIMD Id. You can identify the SIMD on which your program is running.
gwAddr: dword indeg in GDS memory
gdsWrite(gwAddr,gwData) gdsRead(gwAddr,gwData) gdsAdd(gwAddr,gwData)
Global Wave Synch
Id is a unique id chosen by you. gwsThreads: the number of total workgroups (or wavefronts, I’m not sure… The wrong one will crash :D)
Measuring execution time
_getTickInit //initializes T0 time. All other timing macros will work relative to this. getTick(gtRes) //returns current time elapsed from T0 //with lame 32bit calculations breakOnTimeOut(botTimeoutMS) //ensures that a loop cannot be infinite. Calls s_endpgm if timeOutMS is reached.
Must be called right after including stdgcn.inc.
AGrpSize: no of workItems in a workGroup. ANumVGPRS: allocaten no of vector regs. ALdsSizeBytes: as its name.
KernelInitUUUC(AGrpSize,ANumVGPRS,ALdsSizeBytes) //3 UAVs and 1 ConstBuffer
Other buffer variants implemented: UU, UC, U