* update 160102: Please always check for the latest version in the download area
* update 130514: A new version is available for Cat 13.4
Successfully tested on HD7770 and on HD6970 with the following examples:
Download link -> in the header of this blog.
Software requirements: Windows XP, AMD Catalyst driver
Win7+ users (The app will need a classic XP-like win32 environment):
– Use “Run as Administrator”, because it will generate some temp/result files into C:\.
– Disable Data Execution Prevention, as it will use runtime generated machine code.
What is this?
HetPas is a small script compiler/executor and a small IDE. It supports 3 languages with syntax highligt, and code-inside to help faster development.
The supported languages are:
- Pascal with some C inspired things, this is the main/host language.
- AMD_IL. Middle level ams-like language for cards HD4xxx..HD7xxx
- GCN ISA. Lowest level asm language for HD77xx+ gfx cards.
What kernel files it can produce?
- CAL .elf image with AMD_IL code inside (uses AMD’s internal compiler), all cards where amd_il is working, except HD77xx with new drivers.
- CAL .elf image with GCN ISA binary (generated with own compiler) hd77xx+ only
- OpenCL .elf image loaded with GCN ISA binary, hd77xx+ only
What about this release?
It’s a very first one, so it can contain tons of bugs, also the GCN ISA compiler is a reduced one: It lacks some instruction groups, for example double precision encodings. Also anything can change in the future, so don’t use it for serious projects. Just take it as a toy, with it you can try out ideas on the GCN architecture.
Is there documentation?
Unfortunately not much: here’s a small reference of language elements -> HetPas Reference
Official documentation for AMD_IL and GCN_ISA -> amd-accelerated-parallel-processing-app-sdk/documentation
Check the documents “AMD Intermediate Language (IL) Specification (v2.0e)” and “AMD Southern Islands Instruction set Architecture”!
Indeed it’s not that much, how to start then?
(First if you’re a win7 user, you should disable UAC on this program, because it will write many temporary files in the C:\ path. Use Run as Administrator or XP compatibility mode or something.)
Note that at the moment this project is in early beta/preview stage, so use it on your own risk only.
I suggest, first check out some hpas programs in the examples folder and learn from them!
- HetPasDemo.hpas – Contains many language elements of the host language.
- mandel.hpas – a small mandelbrot renderer
Then you can choose a gpu target:
a) HD4xxx..HD7xxx with CAL+AMD_IL.
b) HD77xx+ with OpenCL+GCN_ISA (Use latest drivers, I’ve tested with 12-10 on win7 64) *Note that: this is the most up to date target
- GCN_OpenCL_mandel.hpas – Single Precision mandelbrot renderer
- GCN_OpenCL_latency_test.hpas – You can measure how many cycles an instruction sequence takes.
- GCN_OpenCL_Fibonacci_recursive.hpas – Some advanced GCN tricks, like indirect S register addressing, goto to a specific address, also this example demonstrates C style precompiler macroes.
c) HD77xx+ with CAL+GCN_ISA (Use cat11-12 driver on win7 64bit, or 12-2 on linux 32bit) This is a bit deprecated but works flawlessly with the right drivers, with the wrong drivers it simply crashes when you access UAV.
- GCN_CAL_mandel.hpas – similar to the OpenCL+GCN_ISA version.
- GCN_CAL_latency_test.hpas – “
- GCN_OpenCL_Fibonacci_recursive.hpas – “
- GCN_CAL_FractalComputeUnit.hpas – This is a big one, I’m not sure if it still works (don’t want to reinstall old drivers right now) but I included it because it contains seriuos macro examples: for example the __for__() macro, and array_aliases.
Why I’m sharing this?
I really like to program efficient hardware in an efficient way. (Also have some experience using SSE) And I’m kinda amazed of this fresh, well designed architecture called GCN. Unfortunately there’s no official assembler for it. So feel free to try my reduced assembler to get a sneak peak of GCN asm, but don’t expect too much 😀
Some cool things that you can reach when you’re close to the metal:
- True x86 like program flow. You can do jumps/calls/rets to any location in gpu memory.
- 32bit integer ADD with carryOUT and optional carryIN, 24bit bit integer MAD (good for highprecision math)
- You can use registers like an array (+1 cycle)
- You can control register usage, so you can stay under 84 or 64 vregs for fast performance, or use the all 256 vregs if you have to.
- It has a QueryPerformanceCounter() equivalent. Though it’s very complicated to relate it to final kernel duration because of latency hiding. It can be a good tool to understand how the chip works internally (You can identify big stalls with it, and possibly reorder your code lines to perform better with less threads)