The error is in the "b2" variable calculation, defined as "a2 - a4" by Kovac. The correct calculation of "b2" is obtained through the calculation of "a3 - a4".I did my best to update the code and check the text. Be careful anyway and mind the file versions !
Version ven. janv. 13 14:54:34 CET 2012 : restoration to ygdes.com
Version jeu. mars 3 06:28:16 CET 2016 : moved to ygdes.com's new server
VErsion jeu. mars 10 01:10:15 CET 2016 : bogus DCT warning
for the F-CPU Core 0 |
This paper presents some common techniques to optimise data-intensive computations on FC0. Our example is a 8-tap Discrete Cosine Transform which is widely used for image compression. This paper concentrates on some "basic" manual scheduling techniques and shows a step-by-step optimisation of an existing code. The source examples are written in C to allow debugging and verifications in the absence of a working F-CPU or assembler.
FC0 is the "F-CPU Core #0", a proof of concept and playground for us wannabe computer hackers. FC0 is characterized by its constraints, design choices and history :
As a result, you can issue several independent operations and use the result later. You can
start a long operation, perform several short ones and finally get the result of the long
operation because
(1) almost all instructions are pipelined and non-blocking,
(2) the many registers allow many long operations to start and still execute short
operations without spilling to memory.
FC0 is an interesting proof of concept because it can sustain high loads
predicably and remain simple. It is not too difficult to implement (most of the F-CPU
contributors are unexperimented) but does not blush when compared with other cores of the
same gate count. It exploits all the known techniques (only when needed) but keeps a lot of
room for future enhancements, paving the way for FC1, FC2, FC3 etc. It is distributed in the
form of copylefted source code that can be implemented in a variety of technologies, letting
the user fine-tune his own version to match his needs.
However this puts some pressure on the compiler ("RISC" can be read as "Relegate the Important
Stuff to the Compiler") and we will see some tricks here, which can be used elsewhere as well.
Note that this example is only focused on numerical code only : the demonstration C code will be a 1-to-1 translation of the assembly langage (or vice versa) so we can concentrate only on the scheduling and functional part, not on the assembly langage level. The code also is more readable, a very important factor because any confusion could insert very naughty bugs. It couldn't be more simple : we will use only addition, substraction and multiply.
The algorithm that we will implement and optimize here is a "Discrete Cosine Transform" (or "DCT" for short), a function that is often used for processing signals : analysis and compression of images and sounds principally. The DCT is a special kind of Fourier Transform with particular characteristics, for example there is no need of a "window" that fades data in and out. Moreover, it is more efficient than the Fourier Transform when it comes to "pack the energy" of a signal, so an entropy coder compresses the result with less bits. Mathematically, the algorithm used for compression ("DCT") is called "DCT type II" and the inverse ("iDCT") is called "DCT type III" (in order to avoid confusion when talking to mathematicians).
The Winograd algorithm is a near-perfect implementation of the Fourier and Discrete Cosine transforms because it requires less multiplies than the Cooley-Tuckey algorithm (which uses the famous "butterfly" structure). It is less symmetrical but very useful when the cost of a multiply is significant (and it often is).
The Direct Cosine Transform is heavily used in the compression and decompression of still and moving images, mainly in the JPEG and MPEG formats. During the compression, each block of 8*8 pixels is transformed into a "spacial spectrum" representation that eases the compression, and this operation is performed in the reverse order during decompression. Though the operations are reversed, the core of the 8*8 DCT can remain if the pre- or post-scaling step is adapted.
1-D time-frequency transforms are also used for compressed audio streams such as the MPEG family (the 3rd layer of MPEG-2 is best known as "MP3"). But 2-D transforms are almost as easy with the use of "composition" of 1-D transforms. Most image compression formats use the "separability" property of the 2-D transform and a 8x8 DCT can be performed with 8*1-D, 8-tap DCT passes on the rows followed by the same pass on the columns. Using the Winograd version of the DCT, this requires 29*2*8=464 adds or substractions and 5*2*8=80 multiplies, instead of 4096 multiplies with the most naive approach of a matrix multiply. An even more interesting fact is that all the multiplies of the Winograd (or the Cooley-Tuckey) DCT are constants. The pre- or post-scaling computations are not counted but require 64 multiplies (one per pixel) which can be performed along with other operations.
Optimizing the DCT is very benefic both for the encoding and the decoding of pictures and movies, for example, and this paper shows that F-CPU is particularly well suited to this task if the program is adapted a bit. Frankly, before writing this code, i didn't know it would be so easy, but there is a bit of luck too...
A "traditional" DCT dataflow graph, derived from the classical "butterfly" diagram, looks like this :
(2016) Warning: the picture above is bogus, I think I tried to reverse-engineer some code and I must have mistranslated something at the output of the odd half. Do not use this diagram.
One can recognize the underlying "butterfly" on which this method is based. Each box performs the add/sub operation and the others perform a trigonometric "rotation" : this version totals 13 multiplies and 22 adds or subtractions when operating with scalar data (not complex data).
After a lot of maths, Winograd found a formula that uses less multiplies (5) but a bit more add/subs (29). This more sophisticated algorithm is often used in hardware accelerators, despite the more difficult scheduling. One reason is that the multiplies are often expensive, but FC0 is only impacted by the operation's latency, which can be "hidden" with good code. The other reason is that it is more stable numerically, because the multiplies introduce some "numerical noise" (proportional to the precision) and less multiplies means more acuracy. The graph looks pretty different from the previous one :
This is a graphic representation of the algorithm that we use here and which has some additional characteristics.
I have found this code in a paper named "Pipelined Fast 2-D DCT Architecture for JPEG Image Compression" authored by Luciano Volcan Agostini <agostini@inf.ufrgs.br>, Ivan Saraiva Silva <ivan@dimap.ufrn.br> and Sergio Bampi <bampi@inf.ufrgs.br>. They present the architecture and the result of a FPGA-based device designed to perform 2-D DCT for real-time image processing. I don't remember where i found it but one could search for it on Google or request it from the authors or from me (it is provided in the archive of this article). The file is named sbcci_DCT2D.pdf and weights 106.271 bytes.
The paper starts by defining a set of constants :
m1 = cos(4pi/16) m2 = cos(6pi/16) m3 = cos(2pi/16) - cos(6pi/16) m4 = cos(2pi/16) + cos(6pi/16) |
/* Step 1 */ b0 = a0 + a7; b1 = a1 + a6; b2 = a3 - a4; b3 = a1 - a6; b4 = a2 + a5; b5 = a3 + a4; b6 = a2 - a5; b7 = a0 - a7; |
/* Step 2 */ c0 = b0 + b5; c1 = b1 - b4; c2 = b2 + b6; c3 = b1 + b4; c4 = b0 - b5; c5 = b3 + b7; c6 = b3 + b6; c7 = b7; |
/* Step 3 */ d0 = c0 + c3; d1 = c0 - c3; d2 = c2; d3 = c1 + c4; d4 = c2 - c5; d5 = c4; d6 = c5; d7 = c6; d8 = c7; |
/* Step 4 */ e0 = d0; e1 = d1; e2 = m3 * d2; e3 = m1 * d7; e4 = m4 * d6; e5 = d5; e6 = m1 * d3; e7 = m2 * d4; e8 = d8; |
/* Step 5 */ f0 = e0; f1 = e1; f2 = e5 + e6; f3 = e5 - e6; f4 = e3 + e8; f5 = e8 - e3; f6 = e2 + e7; f7 = e4 + e7; |
/* Step 6 */ S0 = f0; S1 = f4 + f7; S2 = f2; S3 = f5 - f6; S4 = f1; S5 = f5 + f6; S6 = f3; S7 = f4 - f7; |
There are however high risks of human errors when one messes with 50 lines of random code like these. And even before one starts to hack into it, there must be no doubt that the original code performs the task as expected. I have created a testbench that allows the code to be compiled with any C compiler, thus catching typing errors. You can also provide your own stimuli, examine the output and even single-step into the program with a debugger if needed.
This "wrapper" for our code uses floating-point numbers because working with integers requires a lot of alignments and bit-countings. For example, the input of the 2-D DCT is generally 8-bit numbers (when compressing images) and the output is 16-bit numbers, increasing by steps of 1 bit per pipeline stage. The data are usually represented in signed fractional format, with 1 sign bit and from 7 to 15 mantissa bits ("1Q7" to "1Q15"). Using floating point numbers during the algorithm verification phase helps separate the DCT from the scaling : it's easier to find where the errors are.
The testbench simply #includes the DCT code and creates a type called sample. This makes coding easier because we don't have to worry about the data type or inputs/outputs. The program defines a set of source and destination data and the #included code must declare the internal variables itself. The following version of dct1d_test.c allows one to (manually) verify that the output of two different codes are identical :
/* dct1d_test.c created mer jan 2 13:55:12 GMT 2002 by Yann Guidon <whygee@f-cpu.org> this is a "wrapper" for the other files which allows to test them. Here we use "float" because the code doesn't contain the shifts and sign corrections required by integer arithmetics. compile with gcc -lm dct1d_test.c -Wall */ #include <stdio.h> #include <math.h> /* typedef signed short int sample; */ typedef double sample; sample m1, m2, m3, m4, /* twiddle factors */ a0, a1, a2, a3, a4, a5, a6, a7, /* the inputs */ S0, S1, S2, S3, S4, S5, S6, S7; /* the outputs */ static void DCT1D_01() { #include "dct1d_01.c" } static void DCT1D_05() { #include "dct1d_05.c" } int main() { sample l[8]; int i; m1 = cos(4*M_PI/16); m2 = cos(6*M_PI/16); m3 = cos(2*M_PI/16) - cos(6*M_PI/16); m4 = cos(2*M_PI/16) + cos(6*M_PI/16); for (i=0; i<8; i++) { l[i] = cos((i * M_PI)/4); } a0=l[0]; a1=l[1]; a2=l[2]; a3=l[3]; a4=l[4]; a5=l[5]; a6=l[6]; a7=l[7]; DCT1D_01(); printf("\n DCT8 intput : DCT8 output :\n\ a0 = %+.15f S0 = %+.15f\n\ a1 = %+.15f S1 = %+.15f\n\ a2 = %+.15f S2 = %+.15f\n\ a3 = %+.15f S3 = %+.15f\n\ a4 = %+.15f S4 = %+.15f\n\ a5 = %+.15f S5 = %+.15f\n\ a6 = %+.15f S6 = %+.15f\n\ a7 = %+.15f S7 = %+.15f\n\n", a0, S0, a1, S1, a2, S2, a3, S3, a4, S4, a5, S5, a6, S6, a7, S7); S0=S1=S2=S3=S4=S5=S6=S7=0; DCT1D_05(); printf("\n DCT8 intput : DCT8 output :\n\ a0 = %+.15f S0 = %+.15f\n\ a1 = %+.15f S1 = %+.15f\n\ a2 = %+.15f S2 = %+.15f\n\ a3 = %+.15f S3 = %+.15f\n\ a4 = %+.15f S4 = %+.15f\n\ a5 = %+.15f S5 = %+.15f\n\ a6 = %+.15f S6 = %+.15f\n\ a7 = %+.15f S7 = %+.15f\n\n", a0, S0, a1, S1, a2, S2, a3, S3, a4, S4, a5, S5, a6, S6, a7, S7); return 0; } |
The first thing to do is to cleanup the code.
A number of "operations" in the original algorithm only copy the result from a pipeline stage to the next : this is useless because the processor executes it from registers, so the copy can be short-circuited with variable renaming. For example, if we find b=a; c=b; we can simply write c=a; and spare one instruction. In fact, the original code contains 50 operations and the Winograd DCT contains 29+5=34 operations, so we can remove 16 lines. This is not complex at all and takes a few minutes by hand. The following code can be found in dct1d_01.c :
/* DCT1D_01.c version mar jan 1 08:45:57 GMT 2002 whygee@f-cpu.org 1* 8-bin DCT for a "baseline" JPG compressor. originally cut and pasted from : sbcci_DCT2D.pdf "Pipelined Fast 2-D DCT Architecture for JPEG Image Compression" Luciano Volcan Agostini |
Ok, there are 16 red lines, so it looks good :-)
Now that all the copy instructions are identified, all the references are short-circuited and the code is shortened. We start to see the "skeleton" of the code : a succession of adds, subs and a block of multiplies in the middle. The source code is named dct1d_02.c
/* DCT1D_02.c version mar jan 1 08:45:57 GMT 2002 whygee@f-cpu.org 1* 8-bin DCT for a "baseline" JPG compressor. Optimisation : part 2 = first cleanup All the unnecessary temporary variables are removed. A census of the "nodes" is made. Since their number is below 64, the register allocation is straight-forward. Furthermore, the original code seems to be incredibly well optimised. First, the "minimal distance" between operations seems to be compatible with FC0. Second, it seems possible to use less than the original number of temporary variables : although the "clean" model defines 30 variables, it is "sliced" in such a way that maybe 16 working registers are necessary. Later we will even describe how they are reduced to 10. b2 bug corrected on Tue Dec 16 06:37:45 CET 2003 */ sample b0, b1, b2, b3, b4, b5, b6, b7, c0, c1, c2, c3, c4, c5, c6, d0, d1, d3, d4, e2, e3, e4, e6, e7, f2, f3, f4, f5, f6, f7; /* Step 1 */ b0 = a0 + a7; b1 = a1 + a6; b2 = a3 - a4; b3 = a1 - a6; b4 = a2 + a5; b5 = a3 + a4; b6 = a2 - a5; b7 = a0 - a7; /***/ /* Step 2 */ c0 = b0 + b5; c1 = b1 - b4; c2 = b2 + b6; /***/ c3 = b1 + b4; c4 = b0 - b5; c5 = b3 + b7; /***/ c6 = b3 + b6; /***/ /* Step 3 */ d0 = c0 + c3; /***/ d1 = c0 - c3; /***/ d3 = c1 + c4; d4 = c2 - c5; /* Step 4 */ e2 = m3 * c2; /*c2*/ e3 = m1 * c6; /*c6*/ e4 = m4 * c5; /*c5*/ e6 = m1 * d3; e7 = m2 * d4; /* Step 5 */ f2 = c4 + e6; /*c4*/ f3 = c4 - e6; /*c4*/ f4 = e3 + b7; /*b7*/ f5 = b7 - e3; /*b7*/ f6 = e2 + e7; f7 = e4 + e7; /* Step 6 */ S0 = d0; /*d0*/ S1 = f4 + f7; S2 = f2; S3 = f5 - f6; S4 = d1; /*d1*/ S5 = f5 + f6; S6 = f3; S7 = f4 - f7; |
Now that all the unnecessary instructions are removed, we can examine whether the pipeline will stall and where : it's a first static scheduling analysis that will tell us whether we have to modify the code or not.
The short architectural description of FC0, in the beginning of this document, now makes sense : we need a number of information to perform the analysis, and principally under which conditions the core stalls. We will only care about the read-after-write dependencies (when the data is not ready when it is needed) and the latency of each instructions. Other parameters are mostly ignored because we use only 2 types of arithmetic operations only. The write-back contentions (when more than 2 data are written at the same time) are not checked because they can't happen by construction (there are only 2 latencies so 3 data can't be written in one cycle).
Now the game is to count how many instructions separate an instruction that computes a result and an instruction that uses this result. If the distance is shorter than the latency of the operation's unit, then there is a stall. The following code is available under the name dct1d_03.c
/* DCT1D_03.c version mar jan 1 08:45:57 GMT 2002 whygee@f-cpu.org 1* 8-bin DCT for a "baseline" JPG compressor. Optimisation : part 3 = distance verification Before the registers are renamed (down to around 16 registers), we have to check that the original order preserves all the dependency distances. Here we use the following rules of thumb : * addition or substraction : the result is not available before 2 cycles * multiplication : "mulh" is used, let's consider that it takes 6 cycles. Here, "cycles" are the number of instructions that separate those with dependencies. For example : a = b + c <= takes 2 cycles d = e + f g = h + i k = l + a <= a is ready now. If there is a conflict, we have to swap several operations. If the swap's range is too large, there is an avalanche effect with the dependencies which forces to re-analyse everything through a global dataflow-graph reordering and flattening. b2 bug corrected on Tue Dec 16 06:37:45 CET 2003 */ sample b0, b1, b2, b3, b4, b5, b6, b7, c0, c1, c2, c3, c4, c5, c6, d0, d1, d3, d4, e2, e3, e4, e6, e7, f2, f3, f4, f5, f6, f7; /* here, we assume that there is no pending operation that deals with aX. */ /* Step 1 */ b0 = a0 + a7; b1 = a1 + a6; b2 = a3 - a4; b3 = a1 - a6; b4 = a2 + a5; b5 = a3 + a4; b6 = a2 - a5; b7 = a0 - a7; /***/ /* Step 2 */ c0 = b0 + b5; /* b5 = 2 cycles : limit... if there was a conflict we could have swapped with c1. */ c1 = b1 - b4; /* b4 = 4 cycles */ c2 = b2 + b6; /***/ /* b4 = 4 cycles */ c3 = b1 + b4; /* no more dangerous dependencies here. */ c4 = b0 - b5; /***/ c5 = b3 + b7; /***/ c6 = b3 + b6; /***/ /* Step 3 */ S0 = c0 + c3; /***/ /* c3=3 */ S4 = c0 - c3; /***/ /* c3=4 */ d3 = c1 + c4; /* c4=4 */ d4 = c2 - c5; /* c5=4 */ /* Step 4 */ e2 = m3 * c2; /*c2*/ e3 = m1 * c6; /*c6*/ e4 = m4 * c5; /*c5*/ e6 = m1 * d3; /* d3=4 */ e7 = m2 * d4; /* d4=4 */ /* Step 5 */ S2 = c4 + e6; /*c4*/ /* e6=1 : OUCH !!! */ S6 = c4 - e6; /*c4*/ /* e6=2 : ... */ f4 = e3 + b7; /*b7*/ /* e3=5 */ f5 = b7 - e3; /*b7*/ /* e3=6 */ f6 = e2 + e7; /* e7=4 */ f7 = e4 + e7; /* e7=5 */ /* Step 6 */ /*S0 = d0;*/ S1 = f4 + f7; /*S2 = f2;*/ S3 = f5 - f6; /*S4 = d1;*/ S5 = f5 + f6; /*S6 = f3;*/ S7 = f4 - f7; |
The blue lines represent the dependencies which are solved exactly when needed (a kind of critical datapath). The red lines represent those which are not solved : the operation is not finished before or when the result becomes available. The red lines must be removed either by swapping instructions or by inserting some other "work" during the stall. In the middle of the block, it is not possible to interleave other work and there are a lot of results which are ready long before they are needed.
In fact, everything is fine before the multiplies: there are around 8 data to process with a latency of 2 cycles, so there always remains 5 operations to choose from. However, the multiplies by constants are longer and there are only 5 of them, so the remaining operations must wait for their completion. 5 multiplies is enough to exhaust the pool of the additions' results.
Here, things become more serious. We have to swap instructions so no cycle is lost. If you ask a computer to do that, it's an exponentially complex problem but a human brain can do this with a linear complexity because it can sort the problems and avoid those that are not :-)
/* DCT1D_04.c version mar jan 1 08:45:57 GMT 2002 whygee@f-cpu.org 1* 8-bin DCT for a "baseline" JPG compressor. Optimisation : part 4 = rescheduling It appears that there is a problem at the end of the block, mainly because of the multiply's latency and a slight order problem at the beginning of step 5. We have to reorder the operations to reduce the overall latency. End-of-block and start-of-block contentions are common because the dependencies exhaust. On top of that, Winograd's algorithm is very optimised but the dependency diagram is not symmetrical as in the usual "butterfly" FFT. The problem that arises here is not surprising because the dependendy tree narrows. In a "real" case, there are other things to interleave with, such as load/stores, counter or pointer manipulations... However we will reorder the instructions manually. When we can't recover the lost cycles, the solution is to interleave some other "useful work" to fill the pipeline stalls. As noted before, the first way is to interleave data movement or transformations from the next code block. The other (desperate) way is to unroll the loop : 1) copy-paste the block 2) rename all the registers in the second block 3) interleave the two blocks in such a position that the stalls are filled. 4) copy-paste the resulting block to form the starting and ending of the loop 5) don't forget to divide the loop count by two Fortunately, in the current case, it is possible to manually reorder the instructions. It took a few minutes and it spans over a larger part of the block than expected but only one cycle is lost in the end, which is not enough to justify loop unrolling. This deep swapping increases the register usage and a few more temporary variables are necessary, maybe 20 or 22 instead or 16 or 18. b2 bug corrected on Tue Dec 16 06:37:45 CET 2003 */ typedef signed short int sample; sample b0, b1, b2, b3, b4, b5, b6, b7, c0, c1, c2, c3, c4, c5, c6, d0, d1, d3, d4, e2, e3, e4, e6, e7, f2, f3, f4, f5, f6, f7; /* the outputs */ /* here, we assume that there is no pending operation that deals with aX. */ /* Step 1 */ b0 = a0 + a7; b1 = a1 + a6; b2 = a3 - a4; b3 = a1 - a6; b4 = a2 + a5; b5 = a3 + a4; b6 = a2 - a5; b7 = a0 - a7; /***/ /* Step 2 */ c0 = b0 + b5; c1 = b1 - b4; c2 = b2 + b6; /***/ c6 = b3 + b6; /* moved */ c4 = b0 - b5; /***/ c5 = b3 + b7; /***/ c3 = b1 + b4; /* moved */ /* Step 3 */ e3 = m1 * c6; /*3*/ /* moved */ d3 = c1 + c4; /*3*/ d4 = c2 - c5; /*3*/ /* Step 4 */ e2 = m3 * c2; /*c2*/ /*7*/ e4 = m4 * c5; /*c5*/ /*5*/ e6 = m1 * d3; /*3*/ e7 = m2 * d4; /*3*/ S0 = c0 + c3; /*7*/ S4 = c0 - c3; /*8*/ /* Step 5 */ f4 = e3 + b7; /*b7*/ /* e3=8 */ f5 = b7 - e3; /*b7*/ /* e3=9 */ /* one free slot here : the multiply is not ready */ S2 = c4 + e6; /*c4*/ /* e6=5 : OUCH !!! */ f6 = e2 + e7; /* e7=5 (6 after the stall) */ S6 = c4 - e6; /*c4*/ /* e6=already ok */ f7 = e4 + e7; /* e7=already ok */ /* Step 6 (reordered) */ S3 = f5 - f6; /* f6=2 */ S5 = f5 + f6; /* f6=3 */ S1 = f4 + f7; /* f7=2 */ S7 = f4 - f7; /* f7=3 */ |
I don't remember exactly how i did this (3 months ago) but the principle is to find where to move instructions (and which) so the stalls disappear. This can force to move one or more independent instructions but there is only one lost cycle now. I did this by hand within some tens of minutes, i think. I'm too lazy to draw the dependency diagram but the average "waiting time" between the result delivery and its use is decreased. So the first hint (if you want to do this at home) is to search the longest "distances"/"waiting times" and balance them so that the stalls disappear.
One stall cycle for 34 instructions is a decent number (97% of CPU usage) so i stop the cycle counter here and i focus on another issue. I'm already happy that the code was already pretty well optimised and i only had to swap some instructions. Starting from another code could have been much more painful.
The new code uses even more complex (for the newcomers) techniques and its making is more prone to errors : i have saved a "snapshot" of several steps of my ongoing work to avoid any loss of time if i detected an error.
/* DCT1D_05.c version mar jan 1 08:45:57 GMT 2002 whygee@f-cpu.org 1* 8-bin DCT for a "baseline" JPG compressor. Optimisation : part 5 = register reallocation Once the data dependencies are free from pipeline stalls, we can "optimise" the register allocation. It is not a critical part in F-CPU because there are 64 registers : it's well enough to avoid complex algorithms. We could associate each register to a temporary variable but this would reduce the available ressources for the rest of the program, particularly for the previous and next code blocks. We have to keep data pointers, constants (index advance and twiddle factors) and some headroom, in case loop unrolling appears necessary in the future. For example, a dual-issue FC0 would probably require a special 2x unrolling : the duplication of every instruction (not forgetting to rename the registers), forming perfectly independent and scheduled pairs. This is the simplest solution if you don't want to recalculate all the data dependecy distances and the register numbers. Of course, this only works when all the execution units are duplicated and the multiply unit is too heavy to accept two instructions per cycle. Some local reordering would then be necessary by "shifting" the code stream by a certain amount of instructions, so that there is no conflict. Back to register re-allocation. This algorithm is very simple (once you have understood it). Furthermore there is no register pressure so we can follow it quietly. The basic principle is to identify a datum, from the time it is created until the last use. The end of a datum's life is identified when the associated register is written to, that is : when it is associated to another datum (or more exactly, before, and this detail becomes meaningful later). For example : a = b + c <-- c contains a first value --- anything --- <-- time during which c is unused c = d + e <-- c contains a new value The algorithm uses a "scoreboard" approach where each register has a flag (either allocated and to which old register, or free). In a forward scan, as in the P6 core family, the scoreboard's entry is cleared whenever the register is written to. If it is used before being overwritten, this means that the value it contained has died and we have thus detected a datum's death. In order to detect this case, it is more efficient to run the algorithm from the end of the block (backwards), otherwise the registers will contain unused values during a significant fraction of the program. This is an important difference with out-of-order cores such as the P6 where the instruction flow must be decoded in-order for obvious reasons. This register re-allocation method is optimal when there is no register pressure and if the coded algorithm is already optimised. If there are not enough registers for a given code block, things are getting much more difficult because one has to "swap" data to memory or split/slice the block into pipelineable chunks. If it is not possible, it is necessary to count which data are most used to keep them in registers. This allocation algorithm is not funny at all because its complexity depends on the instruction set's orthogonality... No such problem here : we have enough registers in the F-CPU to run an even more complex computation. When our scoreboard needs a new entry and all the entries are "busy", we can simply create a new entry. The reallocation algorithm goes as follows: 1- point at the end of the instruction stream 2- identify the source operands as "dying data" 3- for each "dying data" : 4 allocate a register for each data (set the corresponding scoreboard entry) 5 go to the previous instruction 6 if the datum is written 7 replace the old reference with the new register name 8 leave this loop (this is the end of datum) 9 replace all source references with the new register name (carefully checking that there is no numbering conflict) 10 loop back to 5 11- If we have reached the beginning of the code block, exit this algorithm 12- When all the curent instruction's references are replaced, the instruction can be "committed". 13- go to the previous instruction (upper in the stream) 14- identify the yet already re-allocated register references. 15- if all the sources are already allocated, go to 11 16- all the unallocated sources are "dying data", go to 3 There is a necessity to distinguish newly allocated references from the original register references. It is safer to use different symbolic names, such as a1-a7 for the original names and R1-R63 for the new names. While this algorithm is easy, it is also very easy to make fatal errors, particularly if you are sleep-deprived... You are warned : from experience i can say that it can hurt very very harshly. Here is a small example block : a0 = a1 + a2 (1) a1 = a2 + a3 (2) a2 = a3 + a4 (3) a5 = a6 + a1 (4) We start at the end, where a5 = a6 + a1 and the scoreboard is initially cleared. a5 can be replaced by anything a priori, let's say R1. Then we propagate the new names of a1=R2 and a6=R3 (we have created new scoreboard entries) until these values are "associated" with their value. R1=a4 R2=a1 R3=a6 a0 = a1 + a2 (1') R2 = a2 + a3 (2') a2 = a3 + a4 (3') R1 = R3 + R2 (4') Now that all the references are replaced in the instruction, we go up and see what references are not yet replaced. For each replacement, we look in the scoreboard for unused entries : R1 is free (since it is written later), we need to create two new entries. The scoreboard then contains : R1=a4 R2=a1 R3=a6 R4=a3 R5=a2 a0 = a1 + a2 (1'') R2 = a2 + R4 (2'') R5 = R4 + R1 (3'') The process continues until all instructions are processed. However we have to process inputs and outputs (terminals) differently : here a2 is used both internally and externally. Depending on the case, it is maybe necessary to allocate a new register. R1=a4 R2=a1 R3=a6 R4=a3 R5=a2(terminal) R6=a2(internal) a0 = a1 + R6 (1''') R2 = R6 + R4 (2''') and finally, the result is R1=a4 R2=a1 R3=a6 R4=a3 R5=a2(terminal) R6=a2(internal) R7=a1 R8=a0 R8 = R7 + R6 (1'''') R2 = R6 + R4 (2''') R5 = R4 + R1 (3'') R1 = R3 + R2 (4') This small example shows how it works but it becomes efficient for only larger code blocks, such as our DCT. By changing the allocation policy, we can require more or less registers than previously. In our DCT where a lot of temporary variables are used, this will "compress" the number of used registers. If we omit m1-m4, a1-a7 and S0-S7, we needed originally 30 registers and now we only need 10 after two hours of hacking. This means that despite the instruction reordering which increases the lifelength of some variables, we are still near the minimum of 9 as written in the original algorithm. * Note : the recoded algorithm is now completely obfuscated ! You will not be able to understand it unless you perform a complex and exhaustive analysis. * Note 2: The reallocation algorithm must be modified if the CPU has 2-address instructions (ie x86). It can't be applied on the instruction stream and it requires a higher level of optimisation, using dataflow graph analysis. This is much more complex because we have to determine which data must be copied before the register is overwritten by a result. It results in a higher register pressure and a greater programming difficulty. Do not forget that you have to read from the end of the file if you want to understand the comments ;-) Warning : This manual algorithm is not difficult but any error can be fatal. It is relatively easy to implement it in software but no source processor exists yet. After some tens of lines, it is impossible to avoid one or two errors, but here are some advices : - run regression tests during all the process to detect errors as soon as possible - save the result under a new name often, in order to always have a failsafe and working source code. - keep the old and new values in two consecutive lines, one being commented out, this helps to understand what happens and visually detect errors. - implement some "guards" that will help detect the errors. In our example, it is done with the variable's naming : it is illogical to find a name like "e4" in your scoreboard when you are manipulating the b block, because e4 appears later and should not have been created. This means that you have forgotten to discard it from the scoreboard or erased the line where it is defined, or something even more difficult to trace back. - another example of error detection is when an instruction writes to a register that is not present in the scoreboard but this is not a "terminal". b2 bug corrected on Tue Dec 16 06:37:45 CET 2003 */ sample R1, R2, R3, R4, R5, R6, R7, R8, R9, R10; /* reallocated temporary registers */ /*b0, b1, b2, b3, b4, b5, b6, b7, deprecated c0, c1, c2, c3, c4, c5, c6, d0, d1, d3, d4, e2, e3, e4, e6, e7, f2, f3, f4, f5, f6, f7; */ /* Step 1 */ /* b0 = a0 + a7; */ R7 = a0 + a7; /* b1 = a1 + a6; */ R6 = a1 + a6; /* b2 = a2 - a4; */ R4 = a3 - a4; /* b3 = a1 - a6; */ R5 = a1 - a6; /* and here it's a simple substitution :-) */ /* b4 = a2 + a5; */ R9 = a2 + a5; /* b5 = a3 + a4; */ R10 = a3 + a4; /* b6 = a2 - a5; */ R3 = a2 - a5; /* b7 = a0 - a7; */ R8 = a0 - a7; /* Step 2 */ /* c0 = b0 + b5; R1=(c0) R2=free R3=b6 R4=b2 R5=b3 R6=b1 R7=b0 R8=b7 R9=b4 R10=b5 */ R1 = R7 + R10; /* c0 is removed/"created, b0 and b5 survive */ /* c1 = b1 - b4; R1=c0 R2=(c1) R3=b6 R4=b2 R5=b3 R6=b1 R7=b0 R8=b7 R9=b4 R10=b5 */ R2 = R6 - R9; /* c1 is removed <= c1 is "created, b1 and b4 survive */ /* c2 = b2 + b6; R1=c0 R2=c1 R3=b6 R4=c2/b2 R5=b3 R6=b1 R7=b0 R8=b7 R9=b4 R10=b5 */ R4 = R4 + R3; /* replacement b2/c2/R4 <= c2 is "created", b2 dies, b6 survives */ /* c6 = b3 + b6; R1=c0 R2=c1 R3=c6/b6 R4=c2 R5=b3 R6=b1 R7=b0 R8=b7 R9=b4 R10=b5 */ R3 = R5 + R3; /* replacement c6/b6/R3 <= b3 survives, b6 dies, c6 is created */ /* c4 = b0 - b5; R1=c0 R2=c1 R3=c6 R4=c2 R5=b3 R6=b1 R7=c4/b0 R8=b7 R9=b4 R10=b5 */ R7 = R7 - R10; /* replacement c4/b0/R7 <= c4 is "created", b0 and b5 die */ /* c5 = b3 + b7; R1=c0 R2=c1 R3=c6 R4=c2 R5=c5/b3 R6=b1 R7=c4 R8=b7 R9=b4 */ R5 = R5 + R8; /* replacement c5/b3/R5 <= c5 is "created", b3 dies, b7 survives */ /* c3 = b1 + b4; R1=c0 R2=c1 R3=c6 R4=c2 R5=c5 R6=b1 R7=c4 R8=b7 R9=c3/b4 */ R9 = R6 + R9; /* c3 is "created", b4 and b1 die */ /* Step 3 */ /* e3 = m1 * c6; R1=c0 R2=c1 R3=c6/e3 R4=c2 R5=c5 R6=free R7=c4 R8=b7 R9=c3 */ R3 = m1 * R3; /* replacement c6/e3/R3 */ /* d3 = c1 + c4; R1=c0 R2=c1/d3 R3=e3 R4=c2 R5=c5 R6=free R7=c4 R8=b7 R9=c3 */ R2 = R2 + R7; /* replacement c1/d3/R2 */ /* d4 = c2 - c5; R1=c0 R2=d3 R3=e3 R4=c2 R5=c5 R6=free R7=c4 R8=b7 R9=c3 */ R6 = R4 - R5; /* d4/R6 dies */ /* Step 4 */ /* e2 = m3 * c2; R1=c0 R2=d3 R3=e3 R4=c2 R5=c5 R6=d4 R7=c4 R8=b7 R9=c3 */ R4 = m3 * R4; /* same as next line with e2/c2/R4 */ /* e4 = m4 * c5; R1=c0 R2=d3 R3=e3 R4=e2 R5=e4/c5 R6=d4 R7=c4 R8=b7 R9=c3 */ R5 = m4 * R5; /* same as next line with e4/c5/R5 */ /* e6 = m1 * d3; R1=c0 R2=e6/d3 R3=e3 R4=e2 R5=e4 R6=d4 R7=c4 R8=b7 R9=c3 */ R2 = m1 * R2; /* same as next line with e6/d3/R2 */ /* e7 = m2 * d4; R1=c0 R2=e6 R3=e3 R4=e2 R5=e4 R6=e7/d4 R7=c4 R8=b7 R9=c3 */ R6 = m2 * R6; /* d4 is "created", its entry is replaced by d4 which dies. */ /* S0 = c0 + c3; R1=c0 R2=e6 R3=e3 R4=e2 R5=e4 R6=e7 R7=c4 R8=b7 R9=c3 */ S0 = R1 + R9; /* nothing new. */ /* S4 = c0 - c3; R1=c0 R2=e6 R3=e3 R4=e2 R5=e4 R6=e7 R7=c4 R8=b7 R9=c3 */ S4 = R1 - R9; /* c0 needs a slot, it replaces f4. new slot created for c3 */ /* Step 5 */ /* f4 = e3 + b7; R1=f4 R2=e6 R3=e3 R4=e2 R5=e4 R6=e7 R7=c4 R8=b7 */ R1 = R3 + R8; /* R1/f4 created and nothing new is allocated : R1 disappears from the scoreboard */ /* f5 = b7 - e3; R1=f4 R2=e6 R3=f5/e3 R4=e2 R5=e4 R6=e7 R7=c4 R8=b7 */ R3 = R8 - R3; /* f5 appears, e3 dies, R3 is associated to both */ /* one free slot here : the multiply is not ready */ /* S2 = c4 + e6; R1=f4 R2=e6 R3=f5 R4=e2 R5=e4 R6=e7 R7=c4 */ S2 = R7 + R2; /* f6 = e2 + e7; R1=f4 R2=f7 R3=f5 R4=f6/e2 R5=e4 R6=e7 R7=c4 */ R4 = R4 + R6; /* e7 exists but not e2. f6 is being created. R4 is both associated to f6 (new) and e2 (old). We spare one new register allocation ! However, in CPUs with imprecise exception (ALPHA for example) this trick is not allowed because on FPU fault the exception handler can't know which value (old or new) the register contains. */ /* S6 = c4 - e6; R1=f4 R2=e6 R3=f5 R4=f6 R5=e4 R6=e7 R7=c4 */ S6 = R7 - R2; /* e6 replaces f7 */ /* f7 = e4 + e7; R1=f4 R2=f7 R3=f5 R4=f6 R5=e4 R6=e7 */ R2 = R5 + R6; /* f7 is "created" so we remove it from the scoreboard */ /* Step 6 */ /* S3 = f5 - f6; R1=f4 R2=f7 R3=f5 R4=f6 */ S3 = R3 - R4; /* S5 = f5 + f6; R1=f4 R2=f7 R3=f5 R4=f6 */ S5 = R3 + R4; /* S1 = f4 + f7; R1=f4 R2=f7 */ S1 = R1 + R2; /* S7 = f4 - f7; R1=f4 R2=f7 */ S7 = R1 - R2; |
Now we have a code that uses 10 temporary registers and performs a 1-D 8-bin DCT in 35 cycles. As one of the comments above states, if floating point is used, precise exceptions or a "FP barrier" is needed in some places because an imprecise exception handler could have some troubles. But F-CPU defines all instructions to be precise and there is no problem. Here is the final code with all the development comments removed :
/* DCT1D.c version jeudi 18 avril 2002 (C) 2002 whygee@f-cpu.org Winograd 8-bin DCT for a "baseline" JPG compressor. Optimisation : final part, cleanup Most comments from the precedent code version are removed, so it can be integrated in another piece of software. b2 bug corrected on Tue Dec 16 06:37:45 CET 2003 */ sample R1, R2, R3, R4, R5, R6, R7, R8, R9, R10; /* temporary registers */ R7 = a0 + a7; R6 = a1 + a6; R4 = a3 - a4; R5 = a1 - a6; R9 = a2 + a5; R10 = a3 + a4; R3 = a2 - a5; R8 = a0 - a7; R1 = R7 + R10; R2 = R6 - R9; R4 = R4 + R3; R3 = R5 + R3; R7 = R7 - R10; R5 = R5 + R8; R9 = R6 + R9; R3 = m1 * R3; R2 = R2 + R7; R6 = R4 - R5; R4 = m3 * R4; R5 = m4 * R5; R2 = m1 * R2; R6 = m2 * R6; S0 = R1 + R9; S4 = R1 - R9; R1 = R3 + R8; R3 = R8 - R3; /* one free slot here : the multiply is not ready */ S2 = R7 + R2; R4 = R4 + R6; S6 = R7 - R2; R2 = R5 + R6; S3 = R3 - R4; S5 = R3 + R4; S1 = R1 + R2; S7 = R1 - R2; |
Please note that i have not addressed the issue of the operand alignment or the number's range : if you use integer numbers, you have to take care yourself about the fractional number management.
If you want to use floating point numbers, you have to be careful because the scheduling that is presented here will not apply. The actual latencies of the FP operators are not yet know for the F-CPU but it's likely to be 2x or 3x slower. Use loop unrolling and interleaving of the instructions to "hide" the latencies.
The proposed code is however transparent to the data parallelism : you can use integer SIMD code to boost the performance. For example, if you compute pictures, you can compute a full RGB pixel with one 64-bit register. The initial representation is 8-bit but this gradually increases to 16 bits during the computation. It's your job to find the most suitable internal data representation that will use the least packing/unpacking instructions.
If you are programming sound processing effects, your quest ends here : this one dimensional code can do the trick. The FFT (and even Winograd FFT) is more often used than the DCT but the same methods can be used as here. I have some 8-bin Winograd FFT code in my archives, just in case.
However, if you program a 2-D DCT for image processing, the code that is developped in this paper is only the "core" of your loop. Well-written code will not waste all the cycles that were saved in the loop body and accessing data is the place where most problems arise. This will be addressed in another paper, whenever I find some time to write it.
A few other parameters must be taken into account in a real program that will run in a F-CPU computer : SIMD data and superscalar extensions. Using SIMD data (that is : multiple data packed in a register), you have to interleave (pack) separate streams and separate (unpack) them later, which can often take an unexpected amount of time. This must be designed along with the rest of the program, which must be analysed globally to reduce the packing overhead.
The last problem comes if a 2-issue version of F-CPU is designed : the latency of each operation virtually doubles because 2x more instructions must be executed before a result is ready. This applies for 3-issue and 4-issue superscalar as well, but i'll show a simple example here. F-CPU is a configurable, customizable core so you can fit it to your application, but some units like the multiply unit can't be duplicated 2x or 3x without an increase in cost and die space. In short, superscalar versions of F-CPU can't provide the user with all the units that a single-issue core can provide, there can be some restrictions in the type of the operations. For example, the logic unit can be replicated, but not the multiply unit, so only a number of specific operation is allowed to enter the pipeline every cycle.
This is not a big problem with the presented DCT because it requires only 5 multiplies for 34 instructions. The idea is to duplicate the code and interleave the instructions in such a way that only one multiply is decoded for every couple of instructions. This is equivalent to loop unrolling but with the difference that the second instanciation of the loop is "shifted" in a position where the multiplies don't "collide". I will now represent our DCT code with one letter and colour per instruction, to give a visual example.
a is "addition" s is "substraction" m is "multiply" |
aassaassasaasaamasmmmmasasaasasaas
1) Duplication
aassaassasaasaamasmmmmasasaasasaas aassaassasaasaamasmmmmasasaasasaas
aaaassssaaaassssaassaaaassaaaammaassmmmmmmmmaassaassaaaassaassaaaass
2) Shift
Just shift modulo (the number of instructions) the instructions from one pipeline so only one m is present each column. In our case, it's easy because there are several legal positions, so let's take the first one that appears... For example, with 8 positions of difference, we get :
aassaassasaasaamasmmmmasasaasasaas aassaassasaasaamasmmmmasasaasasaas |
3) Prologue and epilogue
If there is a loop where the instruction flows do not start at the same time, it necessarily overflows the loop body's boundaries. Let's take one instance of the original code as a reference, the first one for example, and put the boundary there so it coincides with it. Another instance of the code is added to fill the "hole" :
abcdefghijklmnopqrstuvwxyz|aassaassasaasaamasmmmmasasaasasaas|abcdefgh aassaassasaasaamasmmmmasas|aasasaasaassaassasaasaamasmmmmasas|aasasaas prolog loop body epilog |
abcdefghijklmnopqrstuvwxyz|aassaassasaasaamasmmmmasasaasasaas|abcdefghijklmnop abcdefghaassaassasaasaamas|mmmmasasaasasaasaassaassasaasaamas|mmmmasasaasasaas aassaassasaasaamasmmmmasas|aasasaasaassaassasaasaamasmmmmasas|aasasaasabcdefgh prolog loop body epilog |
In the 2D DCT case, the added epilog code can be used to keep as much data as possible inside the register set. For example, the epilog code can be interleaved with the 2nd pass prolog and several memory references can be avoided by using registers instead of load/store instructions...
4) Interleaving
Now that we have a legal and efficient instruction stream by construction, we have to generate the final source code. There are several ways to do that, from the most simple to the hairiest...
The simplest form simply "scans" the above code with the columns first :
aaabbaccsddseeaffaggshhsiaajasksa... |
More complex versions will take architecture-specific parameters into account, such as register access restrictions, execution unit availability and other annoying pairing rules. For example, if the multiply unit is only accessible by the first instruction of a pair, the pair will be swapped if the code performs the multiply on the second. This should be harmless to the performance as long as the code is executed on a platform with compatible coding rules. However, on a constrained architecture, this can cause "avalanche" effects that can force the whole code to be rescheduled.
Now, if you are a beginner at assembly programming or at general computer architecture, it should be obvious that a good microprocessor must be as orthogonal as possible and impose few coding rules, otherwise the coding complexity and the code size explode.
I hope you enjoyed reading this paper and that this many hours work is useful to you. F-CPU is a modern architecture that requires modern techniques and a fresh look at the programming paradigm. Although badly-scheduled code will execute, F-CPU becomes an interesting alternative when its main features are used : orthogonal instructions and architecture, a lot of registers and non-blocking pipelined execution units that can handle SIMD data.
All these features can be used at the same time but this requires some hacking and a lot of thinking. This paper has presented three techniques but there are other variants and ways to do the job. At the time of writing, no "optimizing" compiler exists for F-CPU (now you understand that it's not as easy as a porting GCC) and good programming habits can save some execution time and help your code benefit from future architectural enhancements without rewriting everything.
Stay tuned for future in-depth algorithm explorations...
These files are (C) 2002-2003 by Yann Guidon
The information they contain are provided in the hope that they are useful ;
they are believed to be acurate but there is no warranty of ANY KIND.
If you find a bug or want to correct something, don't hesitate to contact me.
Verbatim copying and distribution of this entire article is permitted
in any medium, provided this notice is preserved.