GCN Hello World example

This simple example will help you learn how to write simple GCN asm programs and how to test it in the HetPas environment. First look at this small OpenCL program:

__kenel test(__global int *uav, __constant int *cb)
{
  int gid = get_global_id(0);
  uav[gid] = gid + cb[0];
}

This simple thing will be implemented in the next sections.

1. Writing the GCN code

In HetPas it actually means that you have to make a pascal style string. In order to see syntax highlight and to have Ctrl+Space (code completion) you have to put your asm code inside an asm_isa() construct:

const code:=asm_isa(
  //The text you write here will be seen as GCN assembly cod
  //It will be syntax highlighted in the editor
);

Next we have to write a GCN header which will specify important things for the compiler. These instructions aren’t part of the GCN ISA.

isa79xx                  ;every gcn program starts with this 
  numVgprs 64            ;specify number of vector regs (max 256)
  numSgprs 105           ;and scalar regs (105 is the max)
  numThreadPerGroup 256  ;workitems in a workgroup
  ldsSize 0              ;LDS size in bytes (max 32KB)

Then we have to specify the kernel parameters. The current implementation supports any number of UAVs (__global int*) followed by one or zero ConstantBuffer (__const int* cb). The oclBuffers instruction is there to specify UAV_count a CB_count:

  oclBuffers 1,1 ;first value is number of UAVs,
     ;second value is number of const buffers(1 max)

Based on the UAV and CB configuration, the way you access those buffers and the scalar register which will contain the group_id at kernel startup will be different.
For the 1,1 combination here’s what we have to know:

 ; userElements[0] = IMM_UAV, 10, s[4:7]
 ; userElements[1] = IMM_CONST_BUFFER, 1, s[8:11]
 ; userElements[2] = IMM_CONST_BUFFER, 2, s[12:15]
 ; COMPUTE_PGM_RSRC2:USER_SGPR = 16

The above is a part of the disasm of our small OpenCL program.  It tells that:

  • the group_id will be placed in the s16 register.
  • UAV buffer resource (refer to SI ISA manual Table 8.5) is in s[4:7]
  • CB buffer is in s[8:11]
  • UAV offset table is in s[12:15]

So the OpenCL kernel’s first __global int *uav parameter is on the s[4:7] buffer resource offseted with the first dword at s[12:15]. (note s[12:15] is 128bit long and it’s a buffer resource).
The __constant int *cb parameter is at s[8:11] and the offset is 0 inside that buffer resource.
Note that we don’t have to know why is that this way, we’ll only use it. So let’s prepare these things in the GCN asm program:

  alias uav=s[4:7], cb=s[12:15], uavofs=s[8:11]

With the alias instruction we can have simple replace macros. (It works like #define in C but with a different syntax). Now we have simple labels for the 3 buffer resources we’re going to use.

  s_mulk_i32 s16, 256        //scale GroupID with GroupSize
  v_add_i32 v0, vcc, s16, v0 //v0: global threadId

As I mentioned before, we have the get_group_id() value in s16. In order to calculate get_global_id, we have to multiply it with the GroupSize (specified with numThreadPerGroup instruction) and add it to v0 register. At kernel start v0 contains get_local_id(). Note that all id’s are in flattened format.

And now let’s read the 2 constants we need: the UAV’s offset and the cb[0] value.

  s_buffer_load_dword s0, uavofs, 0x00 //s0: uav offset
  s_buffer_load_dword s1, cb, 0x00     //s1: cb[0]
  s_waitcnt lgkmcnt(0) //wait for s_buffer_loads

s_waitcnt is there to wait the s_buffer instructions. Those are executed in the background, and lgkmcnt(0) ensures that s0 and s1 contains valid values read from the buffers.

Now it’s time to implement our simple OpenCL program:

  uav[gid] = gid + cb[0];

We have the globalID in the v0 register and cb[0] in the s1 register and UavOffset is in s0. The following code will calculate the address in the uav buffer resource which is gid*sizeof(int)+uavOffset.

  v_mad_u32_u24 v2, v0, 4, s0   ; v2: address

That’s the byte address inside the UAV buffer resource. Let’s calculate gid+cb[0] too:

  v_add_i32     v3, vcc, s1, v0   ;v3: data

Only one last step remains: Write into memory

tbuffer_store_format_x  v3, v2, uav, 0 offen offset:0 
  format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT]

BTW as we know that we don’t need data conversion, we can use typeless buffer_store too:

buffer_store_dword v3, v2, uav, 0 offen offset:0

The buffer store instruction is not finishes immediately and it will need the DATA value (v3) in time when it actually writes into memory.  With the s_waitcnt expcnt(0) instruction we can force the program to wait until the v3 register is accessible again.

The final instruction in our example tells the GPU the end of program:

s_endpgm

2. Testing the kernel in HetPas

The steps are:

  • Select a CL device
  • Compile/Build the kernel
  • Allocate/Initialize buffers
  • Run the kernel
  • Read the buffers
  • Cleanup

To select a device:

var dev:=cl.devices[1];  //specify device index
writeln(dev.dump);       //display some info

Note that cl.devices.count reports how many CL devices present in the system.

Compiling code is also simple:

var kernel:=dev.NewKernel(code);

The kernel object is ready to launch kernels. It also can produce temporary things:

writeln(kernel.ILCode);  //show AMD_IL disasm
writeln(kernel.ISACode); //show ISA disasm
filewrite('c:\test.elf',kernel.ELFImage);

The last line saves the kernel image. Later you can reload it with:

var kernel2:=dev.NewKernel(readfile('c:\test.elf'));

Our next task is to allocate buffers:

const WorkCount:=1536;
var uav:=dev.NewBuffer('rw',WorkCount*4{in bytes});
var cb:=dev.NewBuffer('r',4096{in bytes});
cb.intVArray:=(1000);  //it gets an array of ints
cb.ints[0]:=1000;      //the same as above with indexed property
                       //also there is floats[] and floatVArray

Now that the buffers are initialized, it’s time to run the kernel:

var ev:=kernel.run(WorkCount, uav, cb);//it can have max 4 params
ev.waitfor;
writeln(format('elapsed: %.3f ms',ev.elapsedtime_sec*1000));
ev.free;

We can access the UAV now:

with uav do for i:=0 to WorkCount-1 do
  writeln(format('%6x: %.8x %10d %10f',i,Ints[i],Ints[i],Floats[i]));

This dumps every DWords in the UAV in int, hex and float32 format.
All we got left is to free allocated objects.

uav.free;
cb.free;
kernel.free;

Note: Don’t have to free device because it is handled dynamically inside.

So that’s all.

This example code can be found in the /examples/OpenCL_OpenCL_HelloWorld.hpas file. I tested it with Catalyst 12.10 and with the 2013 May 14th release of HetPas. Download link.
It should work on Catalyst 13.4 also, but in that version the kernel.ISACode feature is broken. (Catalyst produces no disasm when you upload a binary only kernel)

Advertisements
This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s