Testing the GCN ASM Groestl kernel using sgminer 5.1

Making the original and the new kernel identical (from the outside)

In the previous blog post I was at the point, that the desired kernel parameters (char *block, uint *output, ulong target) are worked well in a small test kernel. After it is applied to the main kernel only one thing left to realize in asm: Detecting if the result<=target and marking it in the output buffer.

  //compare and report result
    enter \ v_temp addr, data, tmp \ s_temp oldE[2] align:2
    v_mov_b32     addr, $FF
    v_mov_b32     data, 1
    buffer_atomic_add  data, addr, resOutput, 0 idxen glc
    s_waitcnt     vmcnt(0)
    v_min_u32     addr, $FE, data //dont let it overflow

    //reverse byte order of gid
    v_bfe_u32 data, gid, 24, 8
    v_bfe_u32 tmp , gid, 16, 8 \ v_lshlrev_b32 tmp,  8, tmp \ v_or_b32 data, data, tmp
    v_bfe_u32 tmp , gid,  8, 8 \ v_lshlrev_b32 tmp, 16, tmp \ v_or_b32 data, data, tmp
                                 v_lshlrev_b32 tmp, 24, gid \ v_or_b32 data, data, tmp

    tbuffer_store_format_x  data, addr, resOutput, 0 idxen format:[BUF_DATA_FORMAT_32, BUF_NUM_FORMAT_FLOAT]
    dd $BF8C0F00 //s_waitcnt     vmcnt(0) & expcnt(0)

Simple if/relation handling: v_if_u64() is not a GCN instruction. It is some kind of a macro that can identify the relation operation and does the appropriate compare instruction. It also jumps conditionally and saves/modifies the exec mask based on the compare result.

Atomics: Using atomic increase when calculating the next output index. In the original opencl I had to use atomic_inc(&output[0xFF]) as well because I’m using a special test which returns more than 100 values in the output buffer and got to make sure that no values are lost because of the concurrent incrementations of the output index.

Swapping byte order: Well that’s not too nice, I should rather find  some packing instructions, but this is not the inner loop, so I just don’t care… As thinking it a bit further: It would be pretty fast with this way: Swapping low and high words with v_bytealign. Selecting odd bytes with v_and, shifting them right 8bits. Selecting even bytes and scaling it up 256x and adding to previous result with v_mad_u32_u24. Only 4 simple instructions instead of 9. It’s fun how a functionality can be built from various ‘LEGO’ pieces of the instruction set.

New method of functional testing

Now the key is to compare the asm kernel to the original kernel. Here are the testing parameters:

  • Block: is the same randomly generated 80 bytes as before.
  • target: 0x0008FFFFFFFFFFFF  (must be high enough to generate more than 100 values in the output array)
  • global_work_offset: 567    (is quiet random too)
  • global_work_count: 256*10  *512   (approx. 1.2million)
  • compressing the output[]: Iterate through all the output values and multiply them with a large prime number (402785417) and summarize them together. Because the order of values are not important, only that counts that all the values must be in the array.
  • Checking whether 2 kernels are identical: Is the same as checking the compressed ‘hashes’ of the outputs.

Just for the record, the compressed output hash value calculated from the result of the above parameters is: 335485889931504896.

It was checked for both kernels and it was proven that the original and the new kernel calculates the *same results.
*Actually it is “pretty much the same” by checking the outcome of 1.2 million groestlCoin calculations using a relatively high target value.

Testing it live

Testing it by running sgminer 5.1 and replacing the precompiled kernel binary (groestlcoinCapeverdegw256l4.bin) with my new binary produced the expected 3.5x speedup. And I was kinda lucky because I got the first ‘accepthed’ after 10 minutes. The next one came 3 hours later. So I’m now more than sure that it works correctly. But of course we can only 100% sure when it earns 3.5x more coins than the OpenCL version. That thing I cannot test because I don’t have a mining rig.



Note that GPU1 did the mining, not GPU0. GPU1 is a HD7770 running 1000MHz (stock), and it has 640 streams, peak performance is 1.28 TFlops/s. It ran around 63 Celsius degrees. It’s kinda cool because bottleneck is LDS and L1.

In the next post I’ll write down the instructions on how to build a kernel on a specific GCN GPU and use it with SG 5.1.

This entry was posted in Uncategorized and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s