# Optimization opportunities:

After a bit of googling, a good solution to the problem would be a **partial sorting network **which only uses the median value.

[1] -> https://software.intel.com/sites/default/files/m/d/4/1/d/8/MedianFilter.pdf

Another way of optimization is to reuse intermediate data by doing **not only one but more median kernels at a time**.

[2] -> https://www.cg.tuwien.ac.at/research/publications/1994/Kopp-1994-EMF/TR-186-2-94-18Paper.pdf

The paper said that they reached 3x speedup by found 2 kernels at once, and reusing data while moving the kernel forward by 2 pixels.

Also **SSE** can really speed up this because it has 16x parallel min/max functions, I think the biggest speedup will come from this, so all the algorithm must be twisted in a way to support SSE as possible as could. This algo has a lot of computation and a moderate amount of memory IO, this is good for SSE.

Finally there’s **multithreading**, but it’s so straightforward, I don’t even will implement it.

# Let’s plan the algorithm

First we need some basic sort network to start from. We need a 9 element **sorting network **for this. I choose the one that groups the data at the beginning into 3×3 element groups. There 3 element groups will represent the column of out moving 3×3 kernel window.

Here’s an online generator -> http://pages.ripco.net/~jgamble/nw.html

The variant called “Best” seems like to suit our needs well. -> http://jgamble.ripco.net/cgi-bin/nw.cgi?inputs=9&algorithm=best&output=svg

When visualizing data paths of the median value, it turned out that 5 min/max operations aren’t used at all and removing those we can reduce the operation count from 25 down to 20. This trick is used in the Intel Optimization paper [2].

After reorganizing a bit we can see the 3 element sort ranges at the beginning, these operations will be reused by adjacent kernels. Also there is a noticeable 6 element group. Also after making an error checker, it turned out that one additional operation can be eliminated.

Now let’s expand it to a 2 kernels, where we have 4 input columns, 12 values in total, and we need to pull 2 medians.

The workflow:

- At every iteration, we move the 2 kernels forward by 2 pixels: copying 2×6 sorted values (light green) and sorting 2*3 incoming ones:
**6 minmax ops 6 moves** - Then the purple rectangle comes, this is a preparation step.
**2 minmax ops** - The next orange stage is to save 5 values for later use:
**5 moves** - And finally the two median calculations:
**2*8 minmax ops, 5 moves**(recovering saved orange values)

**Total min/max ops: 6+2+2*8=24, total moves=5+5+6=16
**That’s 12 minmax per one median value, it’s 60% more efficient than the single kernel partial sorting network.

There were attempts to optimize this further in 2D [1], but the gain was so small compared to the 1D version that I don’t want to make the algorithm bigger.

SSE estimation: In the worst case, a minmax op will use 3 cycle and a move will use 1. So the total cycle count is 12*3+8= 44 cycles for each median calculation. By using SIMD with byte data we’ll have 16 lanes, thus the cycle estimation is around 2.8 cycles for each median calculation. (not including fetch and store phases, those will be the biggest problem in this median filter because of 24bit input data, it will be discussed later.)

The two building blocks we need are these:

**MinMax:** Will use a temporary register and an sse min and max instruction. The result will be always in a register, one of the inputs can come from a memory location. There will many combinations of inputs used (reg/mem), also in contrast to the non-partial sorting networks here we gonna have unused outputs of either the min or the max functions. I will not try to code these manually, I will use an automatic instruction flow generator tool for this, that will handle register automation too, I only have to watch out for the proper instruction order.

**Move: **It will be implicit as I’m using the Instruction flow generator thingie.

# Fetch and Store

This will be quite32 difficult because of the nature of RGB24.

First have to decide in which direction we want to process the data in the 2D bitmap.

There are two variations(for the above scalar example):

- Left to right: This is the natural order of the system memory, so it would be so good for the cache.
- Top to bottom: This is not so good for cache, would require additional prefetching, but at certain bitmap with it could still be able make cache useless.

Let’s widen it in order to fulfill an XMM register:

Now the register utilization is much better.

From the two variants I immediately throw de [left to right] on because it needs so many little reads, and that means a lot of shuffling in order to fill the register.

With the top down variant we could be able to read the memory in large chunks, and no complex shuffling is needed, the already loaded data can be reused by shifting. Additionally on recent systems that have the SSSE3 extension it would be even more effective by using this ‘super weapon’ called PALIGNR and even more complex shuffling can be achieved with the PSHUFB byte-permute instruction. But first, I’ll do it only with simple memory reads, and simple SSE instructions.

By thinking a bit more about the problem, that 93% register utilization can be simply extended to 100% because the only important thing in the current problem is not to separate RGB triplets properly, but only to pair the color components with their left neighbors. They are exactly 3 pixels away when using RGB, 4 pixels away when using RGBA and 1 pixels away with GrayScale. In other words it’s irrelevant if we group the colors like RGB, RGB, RGB, RGB or RGBR, GBRG, BRGB.

One last thing remains to deal with is the unnatural pattern of memory access. I’m thinking about 2 ways to solve it:

- It’s easy to do but can’t solve cache conflicts.
- Moving our window down like 10 times, then go up and right and do another run downwards adjacent to the previous run. This may help a lot with very large bitmaps. And still can be combined with prefetching by no cost.

When I implemented the algo, it turned out that we need both of the above. Prefetch is needed because the non-normal access pattern, and partitioning is needed to be able to stay inside the L3 cache.

# Implementation details

It wasn’t so hard to implement this, I used the plan, rotated it by 90 degrees and designed it in my sse programming tool. Only 4 SSE instructions were used: MOVUPS for reads, MOVAPS for stores and to move reg/reg data, and PMINUB/PMAXUB. On my FX-8350 cpu (AMD Piledriver architecture) it is possible to execute 2 MIN/MAX at one cycle, and those have only 2 cycle latencies, so this particular algorithm can run quite fast on it.

So here’s what the first design looked like:

After some reordering and simplification it was possible to implement the whole algo on the 8 xmm registers and without any temp registers. But later I’ve noticed, that it’s not needed that much because I think the bottleneck is the rather L3 cache because of the unusual memory access pattern. Anyways, here’s what the cleaned up version looks like (note that, now the 16 SSE byte lane’s are visualized too):

# The results:

Personally I was amazed how fast it became.

The test config: **AMD FX-8350 4.2GHz, 8MB cache, 2133MHz DDR3 Memory read speed=11.4 GByte/s**

Tested with a big **3888*2592 RGB24 image** (30Mbytes), the runtime on 1 core was only **9.84 ms**, that is **2928 MByte/s **image processing speed. Smaller images can run even faster because of better cache usage: I measured 3.4GByte/s on a small 640×480 image.

Compared to a non-SSE implementation written in a high level language, the difference was huge: An initial, lightly optimized, more readable version did it in 3.36 s (8.5MByte/s). This version can be found here: http://prog.hu/tarsalgo/?fid=194693&no=10#e10 Later on with various optimization (partial bubble-sort, faster data access) they were reached a 2x faster version. And finally with the introduction of FastDib/MMX routines (mixed with high level code) they got another 7.5x speedup on top of that 2x. Theoretically it’s 8.5*2*7.5=127.5MB/s. It’s theoretical, because I don’t have the sources to test those, I only rely on Stella_209’s time measurements on the forums. But it’s still so far away from this SSE version’s 2900MB/s. This shows the power of low level optimization because one well written inner loop can be much better that small highly optimized routines stitched together with other fast routines in a high level language. With rewriting the loop from scratch we can avoid unnecessary data movement and conversion, in addition to function call overheads, and most importantly we can twist the whole process around the SSE instructions that will do the heavy lifting in order to feed those instructions with the highest amount of data to work on.

# Imperfections:

Yes, there are some but with a lot of work they can be fixed. This experiment was only to measure that how fast an SSE median filter can be in general.

- Every 16
^{th}column is filtered twice because the 16wide kernel works in place and reusing a previous kernel’s output. Solution -> work into a separate output buffer. - Boundary conditions. I’m ignoring them at all. There can be also a big untouched gap on the left and right sides because the kernel only outputs 16byte aligned data. Solution -> handle the border situations by a traditional algo.

# Source code:

The source code is relatively small. The assembly parts are compacted because they were just pulled out automatically from my SSE design tool in an order I controlled.

Compiled and tested on Delphi XE3 win32.

procedure SSE_median3x3_RGB_x2(dst:pointer; linesize, count:integer); asm push esi push edi push ebx push ebp mov ebp,linesize //ebp: linesize //must be 16 aligned mov ebx,count //ebx: counterr mov ecx,dst //ecx: dst //calc src addresses mov eax, ecx sub eax, ebp sub eax, 3 //src = dst-linesize-3 lea edx, [eax+ebp] //one line down //prepare 2*7*$10 bytes of temp buffers mov esi, esp sub esi, 2*$70; and esi,not $FFF //align temps to a 4K page lea edi, [esi+$70] //the other half of the temp buffers //fetch the first 2 rows to the temp buffer movups xmm0,[eax] movups xmm1,[eax+3] movups xmm2,[eax+6] movaps xmm3,xmm0 pminub xmm0,xmm1 pmaxub xmm3,xmm1 movups xmm1,[edx] movaps xmm4,xmm3 pminub xmm3,xmm2 movups xmm5,[edx+3] pmaxub xmm4,xmm2 movaps xmm2,xmm0 pminub xmm0,xmm3 pmaxub xmm2,xmm3 movups xmm3,[edx+6] movaps xmm6,xmm1 pminub xmm1,xmm5 pmaxub xmm6,xmm5 movaps xmm5,xmm6 pminub xmm6,xmm3 pmaxub xmm5,xmm3 movaps [esi],xmm0 movaps xmm3,xmm1 pminub xmm1,xmm6 movaps [esi+$10],xmm2 pmaxub xmm3,xmm6 movaps [esi+$20],xmm4 movaps [esi+$30],xmm1 movaps [esi+$40],xmm3 movaps [esi+$50],xmm5 xchg esi, edi //swap the temp banks, lea eax, [eax+ebp*2] //advance the source offsets lea edx, [edx+ebp*2] cmp ebx, 0; jle @end @1: //inputs : eax, edx : image source row0, row1 //temps : esi, edi : 2x 7*16bytes alternating buffers. Last elements are used as 2 temps. //output : ecx : 2x 16bytes of output (used as temp internally) //remarks : edx = eax+linesize; next iteration: swap(esi,edi); eax += linesize*2; //source plan : SSEDesign_1610_Median3x3 prefetch [eax+ebp*2] prefetch [edx+ebp*2] //2.2->2.9 GB/s movups xmm0,[eax] movups xmm1,[eax+3] movups xmm2,[eax+6] movaps xmm3,xmm0 pminub xmm0,xmm1 pmaxub xmm3,xmm1 movups xmm1,[edx] movaps xmm4,xmm3 pminub xmm3,xmm2 pmaxub xmm4,xmm2 movups xmm2,[edx+3] movaps xmm5,xmm0 pminub xmm0,xmm3 movups xmm6,[edx+6] pmaxub xmm5,xmm3 movaps xmm3,xmm1 pminub xmm1,xmm2 pmaxub xmm3,xmm2 movaps [esi],xmm0 movaps xmm2,xmm3 pminub xmm3,xmm6 movaps [esi+$10],xmm5 pmaxub xmm2,xmm6 movaps [esi+$20],xmm4 movaps xmm6,xmm1 pminub xmm1,xmm3 pmaxub xmm6,xmm3 movaps [esi+$30],xmm1 movaps xmm3,[edi+$20] movaps [esi+$40],xmm6 movaps xmm7,xmm2 pminub xmm2,xmm3 movaps [edi+$60],xmm2 movaps [esi+$50],xmm7 movaps xmm7,[edi+$10] movaps xmm3,xmm6 pminub xmm6,xmm7 pmaxub xmm3,xmm7 movaps [esi+$60],xmm6 pminub xmm4,xmm2 pmaxub xmm5,xmm6 movaps xmm6,[edi] pminub xmm5,xmm3 pmaxub xmm0,xmm1 pmaxub xmm0,xmm6 movaps xmm2,xmm4 lea eax, [eax+ebp*2] //advance the source offsets pminub xmm4,xmm0 pmaxub xmm2,xmm0 pmaxub xmm4,xmm5 movaps xmm5,[edi+$50] movaps xmm0,[esi+$60] pminub xmm4,xmm2 lea edx, [edx+ebp*2] movaps xmm2,[edi+$40] movaps [ecx+ebp],xmm4 pmaxub xmm2,xmm0 movaps xmm0,[edi+$60] pminub xmm5,xmm0 pminub xmm2,xmm3 movaps xmm3,[edi+$30] pmaxub xmm3,xmm1 pmaxub xmm3,xmm6 movaps xmm6,xmm5 pminub xmm5,xmm3 pmaxub xmm6,xmm3 pmaxub xmm5,xmm2 xchg esi, edi //swap the temp banks, pminub xmm5,xmm6 movaps [ecx],xmm5 lea ecx, [ecx+ebp*2] //advance the dst offset too dec ebx jg @1 @end: pop ebp pop ebx pop edi pop esi end; procedure applyMedian3x3(b:TBitmap); procedure Error(s:string);begin raise exception.Create('applyMedian3x3: '+s); end; function AlignUp(i:integer;const align:integer):integer; asm sub edx,1; add eax,edx; not edx; and eax,edx end; var base,dst0, ycnt, xcnt, w, h, x, y, yblocksize, ysiz, st, en:integer; const L3CacheLimit=4096 shl 10; begin if b.PixelFormat<>pf24bit then error('24bit format expected'); h:=b.Height; w:=b.Width*3; if (w and 15)<>0 then error('Scanline size must be a multiple of 16'); base:=integer(b.ScanLine[h-1]); //calculate first aligned dstByte (on the 0th line first) dst0:=AlignUp(base+w+3, 16); //center of 3x3 window xcnt:=((base+w*2-3)-dst0)div 16; ycnt:=h shr 1-1; yblocksize:=ycnt; while(yblocksize>4)and(yblocksize*w*2>L3CacheLimit) do yblocksize:=yblocksize shr 1; while ycnt>0 do begin ysiz:=min(ycnt, yblocksize); for x:=0 to xcnt-1 do begin SSE_median3x3_RGB_x2(pointer(dst0+x shl 4), w, ysiz); end; ycnt:=ycnt-ysiz; inc(dst0, ysiz*w*2); end; end;