8-tap Winograd DCT optimised for FC0

(C) 2002 - 2003 Yann Guidon <whygee@f-cpu.org> for the F-CPU project
You can download all the files from this paper at http://ygdes.com/dct_fc0.tgz
HTML version : (init) apr. 16, 2002
version : apr. 18, 2002 : the pictures are finished, i'm adding the code.
version : may 20th, updating a bit.
version : Mon Dec 2 07:58:02 EST 2002 :It appeared, after i wrote the paper, that the numerical results from the original DCT code are not working as expected. I have found it myself and others have reported problems, so i've put a warning on the page.
version : Tue Dec 16 05:02:32 CET 2003 : Luciano Volcan Agostini just sent me an email, correcting the typo in his paper.

The error is in the "b2" variable calculation,
defined as "a2 - a4" by Kovac. The correct calculation of "b2" is
obtained through the calculation of "a3 - a4".

I did my best to update the code and check the text. Be careful anyway and mind the file versions !

Additinal notes concerning the computational cost :

I have found the binDCT which is simpler (no multiplies with the help of add/shifts). It is less precise as well because it gives only an approximation of the result. For FC0, the computation time (in the absence of pipeline stalls and memory load/stores) depends only on the number of instructions (that is : operations). Wino8's overall computation cost is lower for FC0 which has a fast pipelined multiplier. One instruction thus replaces many add/shifts and saves time.

The benefit of wino8 is less obvious when compared to the "classical" DCT : 35 vs 34 instructions. The Winograd version requires less multiplies, which take more cycles to complete, thus placing less pressure on the instruction scheduling (more instructions to choose from, less results to wait). And this remark is also perfectly valid for other CPUs.

Version ven. janv. 13 14:54:34 CET 2012 : restoration to ygdes.com

Version jeu. mars 3 06:28:16 CET 2016 : moved to ygdes.com's new server

VErsion jeu. mars 10 01:10:15 CET 2016 : bogus DCT warning

Optimised Winograd DCT
for the F-CPU Core 0

Summary

This paper presents some common techniques to optimise data-intensive computations on FC0. Our example is a 8-tap Discrete Cosine Transform which is widely used for image compression. This paper concentrates on some "basic" manual scheduling techniques and shows a step-by-step optimisation of an existing code. The source examples are written in C to allow debugging and verifications in the absence of a working F-CPU or assembler.

Introduction

FC0 is the "F-CPU Core #0", a proof of concept and playground for us wannabe computer hackers. FC0 is characterized by its constraints, design choices and history :

Single issue pipeline : in order to make the core realisable, issuing one instruction per cycle spares the team some efforts. More complex cores can be designed later, conforming to the F-CPU programming model but they can differ totally from FC0.
64 registers : The F-CPU programming model specifies a large unified register set for storing all the temporary data : pointers, integer data, floating point data... Such a large register file reduces the need for register renaming, a costly technique which adds latency and bloats control hardware. More registers also means less "pressure" (less memory accesses when the data can be kept in the register set) and the future ability to run several parallel pipelines without running out of independent operands. A numerical example :
4 pipelines with 3-address instructions require 12 registers per cycle (8 for reading and 4 for writing), all pipelines can have up to 64/12=5 cycles of latency without stalling the core. In fewer words, more registers allow the programs to exploit more parallelism without making the core more complex.
64-bit SIMD registers : Another way to increase the performance of the computer is to exploit data parallelism. The F-CPU programming model specifies that the registers are at least 64-bit wide and the Execution Units can support subword operations. Unlike "MMX" and other "multimedia instructions", this support is not an extension but a coherent feature that aims at achieving the best MOPS/MHz ratio whenever possible. As the register width will increase, so will the number of operations per cycle, independently from the architecture.
Superpipelined core : In order to get a processor that is as fast as possible, all the pipeline stages have a defined, very narrow granularity, which allows the core to potentially run faster than a conventional processor built with the same technology. This very fine granularity means that the core is "superpipelined" : what a conventional CPU does in one cycle requires two cycles with FC0 (as a rough estimate). But since the clock cycle is shorter, FC0 performs more work.
Out Of Order Completion core : Because the pipeline stages are so narrow, even an operation as simple as an addition takes a significant amount of time to complete. It would be unrealistic to block the pipeline until the result is available so this only happens if this result is required by another instruction. While the operation is performed, other instructions can enter the pipeline as long as the instruction decoder detects no dependency.
This core is "Out Of Order Completion" because the operations can take a different amount of time (depending on their type) so there is the possibility to complete them "out of order". The pipeline scheduler is designed in such a way that no incoherency can occur and that decoding is not unnecessarily stalled.

As a result, you can issue several independent operations and use the result later. You can start a long operation, perform several short ones and finally get the result of the long operation because
(1) almost all instructions are pipelined and non-blocking,
(2) the many registers allow many long operations to start and still execute short operations without spilling to memory.
FC0 is an interesting proof of concept because it can sustain high loads predicably and remain simple. It is not too difficult to implement (most of the F-CPU contributors are unexperimented) but does not blush when compared with other cores of the same gate count. It exploits all the known techniques (only when needed) but keeps a lot of room for future enhancements, paving the way for FC1, FC2, FC3 etc. It is distributed in the form of copylefted source code that can be implemented in a variety of technologies, letting the user fine-tune his own version to match his needs.
However this puts some pressure on the compiler ("RISC" can be read as "Relegate the Important Stuff to the Compiler") and we will see some tricks here, which can be used elsewhere as well.

Note that this example is only focused on numerical code only : the demonstration C code will be a 1-to-1 translation of the assembly langage (or vice versa) so we can concentrate only on the scheduling and functional part, not on the assembly langage level. The code also is more readable, a very important factor because any confusion could insert very naughty bugs. It couldn't be more simple : we will use only addition, substraction and multiply.

Presentation of the DCT

The algorithm that we will implement and optimize here is a "Discrete Cosine Transform" (or "DCT" for short), a function that is often used for processing signals : analysis and compression of images and sounds principally. The DCT is a special kind of Fourier Transform with particular characteristics, for example there is no need of a "window" that fades data in and out. Moreover, it is more efficient than the Fourier Transform when it comes to "pack the energy" of a signal, so an entropy coder compresses the result with less bits. Mathematically, the algorithm used for compression ("DCT") is called "DCT type II" and the inverse ("iDCT") is called "DCT type III" (in order to avoid confusion when talking to mathematicians).

The Winograd algorithm is a near-perfect implementation of the Fourier and Discrete Cosine transforms because it requires less multiplies than the Cooley-Tuckey algorithm (which uses the famous "butterfly" structure). It is less symmetrical but very useful when the cost of a multiply is significant (and it often is).

The Direct Cosine Transform is heavily used in the compression and decompression of still and moving images, mainly in the JPEG and MPEG formats. During the compression, each block of 8*8 pixels is transformed into a "spacial spectrum" representation that eases the compression, and this operation is performed in the reverse order during decompression. Though the operations are reversed, the core of the 8*8 DCT can remain if the pre- or post-scaling step is adapted.

1-D time-frequency transforms are also used for compressed audio streams such as the MPEG family (the 3rd layer of MPEG-2 is best known as "MP3"). But 2-D transforms are almost as easy with the use of "composition" of 1-D transforms. Most image compression formats use the "separability" property of the 2-D transform and a 8x8 DCT can be performed with 8*1-D, 8-tap DCT passes on the rows followed by the same pass on the columns. Using the Winograd version of the DCT, this requires 29*2*8=464 adds or substractions and 5*2*8=80 multiplies, instead of 4096 multiplies with the most naive approach of a matrix multiply. An even more interesting fact is that all the multiplies of the Winograd (or the Cooley-Tuckey) DCT are constants. The pre- or post-scaling computations are not counted but require 64 multiplies (one per pixel) which can be performed along with other operations.

Optimizing the DCT is very benefic both for the encoding and the decoding of pictures and movies, for example, and this paper shows that F-CPU is particularly well suited to this task if the program is adapted a bit. Frankly, before writing this code, i didn't know it would be so easy, but there is a bit of luck too...

The original code

A "traditional" DCT dataflow graph, derived from the classical "butterfly" diagram, looks like this :

(2016) Warning: the picture above is bogus, I think I tried to reverse-engineer some code and I must have mistranslated something at the output of the odd half. Do not use this diagram.

One can recognize the underlying "butterfly" on which this method is based. Each box performs the add/sub operation and the others perform a trigonometric "rotation" : this version totals 13 multiplies and 22 adds or subtractions when operating with scalar data (not complex data).

After a lot of maths, Winograd found a formula that uses less multiplies (5) but a bit more add/subs (29). This more sophisticated algorithm is often used in hardware accelerators, despite the more difficult scheduling. One reason is that the multiplies are often expensive, but FC0 is only impacted by the operation's latency, which can be "hidden" with good code. The other reason is that it is more stable numerically, because the multiplies introduce some "numerical noise" (proportional to the precision) and less multiplies means more acuracy. The graph looks pretty different from the previous one :

This is a graphic representation of the algorithm that we use here and which has some additional characteristics.

I have found this code in a paper named "Pipelined Fast 2-D DCT Architecture for JPEG Image Compression" authored by Luciano Volcan Agostini <agostini@inf.ufrgs.br>, Ivan Saraiva Silva <ivan@dimap.ufrn.br> and Sergio Bampi <bampi@inf.ufrgs.br>. They present the architecture and the result of a FPGA-based device designed to perform 2-D DCT for real-time image processing. I don't remember where i found it but one could search for it on Google or request it from the authors or from me (it is provided in the archive of this article). The file is named sbcci_DCT2D.pdf and weights 106.271 bytes.

The paper starts by defining a set of constants :

m1 = cos(4pi/16)
m2 = cos(6pi/16)
m3 = cos(2pi/16) - cos(6pi/16)
m4 = cos(2pi/16) + cos(6pi/16)

Then Table 1 presents the (corrected) 6-step algorithm :

/* Step 1 */
b0 = a0 + a7;
b1 = a1 + a6;
b2 = a3 - a4;
b3 = a1 - a6;
b4 = a2 + a5;
b5 = a3 + a4;
b6 = a2 - a5;
b7 = a0 - a7;

/* Step 2 */
c0 = b0 + b5;
c1 = b1 - b4;
c2 = b2 + b6;
c3 = b1 + b4;
c4 = b0 - b5;
c5 = b3 + b7;
c6 = b3 + b6;
c7 = b7;

/* Step 3 */
d0 = c0 + c3;
d1 = c0 - c3;
d2 = c2;
d3 = c1 + c4;
d4 = c2 - c5;
d5 = c4;
d6 = c5;
d7 = c6;
d8 = c7;

/* Step 4 */
e0 = d0;
e1 = d1;
e2 = m3 * d2;
e3 = m1 * d7;
e4 = m4 * d6;
e5 = d5;
e6 = m1 * d3;
e7 = m2 * d4;
e8 = d8;

/* Step 5 */
f0 = e0;
f1 = e1;
f2 = e5 + e6;
f3 = e5 - e6;
f4 = e3 + e8;
f5 = e8 - e3;
f6 = e2 + e7;
f7 = e4 + e7;

/* Step 6 */
S0 = f0;
S1 = f4 + f7;
S2 = f2;
S3 = f5 - f6;
S4 = f1;
S5 = f5 + f6;
S6 = f3;
S7 = f4 - f7;

Kovac's algorithm looks very well optimised. First, a lot of work has already been done to reduce the "width" of the data dependency graph : Steps 3 and 4 have a maximum of 9 variables at a time. Second, despite the irregular structure of the graph, it look well suited for a pipelined hardware design : just replace the variables by pipeline latches and the operators by the corresponding unit, and you get a pipelined DCT unit. The technique presented in the paper is different but you have a rough idea. These factors make writing the derived software very easy because these characteristics are compatible with efficient code routines. One could already copy-and-paste it to a file and start compiling it.

The testbench

There are however high risks of human errors when one messes with 50 lines of random code like these. And even before one starts to hack into it, there must be no doubt that the original code performs the task as expected. I have created a testbench that allows the code to be compiled with any C compiler, thus catching typing errors. You can also provide your own stimuli, examine the output and even single-step into the program with a debugger if needed.

This "wrapper" for our code uses floating-point numbers because working with integers requires a lot of alignments and bit-countings. For example, the input of the 2-D DCT is generally 8-bit numbers (when compressing images) and the output is 16-bit numbers, increasing by steps of 1 bit per pipeline stage. The data are usually represented in signed fractional format, with 1 sign bit and from 7 to 15 mantissa bits ("1Q7" to "1Q15"). Using floating point numbers during the algorithm verification phase helps separate the DCT from the scaling : it's easier to find where the errors are.

The testbench simply #includes the DCT code and creates a type called sample. This makes coding easier because we don't have to worry about the data type or inputs/outputs. The program defines a set of source and destination data and the #included code must declare the internal variables itself. The following version of dct1d_test.c allows one to (manually) verify that the output of two different codes are identical :

/*
dct1d_test.c
created mer jan  2 13:55:12 GMT 2002
by Yann Guidon <whygee@f-cpu.org>

this is a "wrapper" for the other files
which allows to test them.
Here we use "float" because the code
doesn't contain the shifts and sign
corrections required by integer arithmetics.

compile with gcc -lm dct1d_test.c -Wall

*/

#include <stdio.h>
#include <math.h>

/* typedef signed short int sample; */
typedef double sample;

sample
  m1, m2, m3, m4,                  /* twiddle factors */
  a0, a1, a2, a3, a4, a5, a6, a7,  /* the inputs */
  S0, S1, S2, S3, S4, S5, S6, S7;  /* the outputs */

static void DCT1D_01() {

#include "dct1d_01.c"

}

static void DCT1D_05() {

#include "dct1d_05.c"

}

int main() {
  sample l[8];
  int i;

  m1 = cos(4*M_PI/16);
  m2 = cos(6*M_PI/16);
  m3 = cos(2*M_PI/16) - cos(6*M_PI/16);
  m4 = cos(2*M_PI/16) + cos(6*M_PI/16);

  for (i=0; i<8; i++) {
    l[i] = cos((i * M_PI)/4);
  }

  a0=l[0];
  a1=l[1];
  a2=l[2];
  a3=l[3];
  a4=l[4];
  a5=l[5];
  a6=l[6];
  a7=l[7];

  DCT1D_01();

  printf("\n DCT8 intput :             DCT8 output :\n\
 a0 = %+.15f   S0 = %+.15f\n\
 a1 = %+.15f   S1 = %+.15f\n\
 a2 = %+.15f   S2 = %+.15f\n\
 a3 = %+.15f   S3 = %+.15f\n\
 a4 = %+.15f   S4 = %+.15f\n\
 a5 = %+.15f   S5 = %+.15f\n\
 a6 = %+.15f   S6 = %+.15f\n\
 a7 = %+.15f   S7 = %+.15f\n\n",
  a0, S0, a1, S1, a2, S2, a3, S3, a4, S4, a5, S5, a6, S6, a7, S7);

  S0=S1=S2=S3=S4=S5=S6=S7=0;

  DCT1D_05();

  printf("\n DCT8 intput :             DCT8 output :\n\
 a0 = %+.15f   S0 = %+.15f\n\
 a1 = %+.15f   S1 = %+.15f\n\
 a2 = %+.15f   S2 = %+.15f\n\
 a3 = %+.15f   S3 = %+.15f\n\
 a4 = %+.15f   S4 = %+.15f\n\
 a5 = %+.15f   S5 = %+.15f\n\
 a6 = %+.15f   S6 = %+.15f\n\
 a7 = %+.15f   S7 = %+.15f\n\n",
  a0, S0, a1, S1, a2, S2, a3, S3, a4, S4, a5, S5, a6, S6, a7, S7);

  return 0;

}

Let the Hack begin !

The first thing to do is to cleanup the code.

A number of "operations" in the original algorithm only copy the result from a pipeline stage to the next : this is useless because the processor executes it from registers, so the copy can be short-circuited with variable renaming. For example, if we find b=a; c=b; we can simply write c=a; and spare one instruction. In fact, the original code contains 50 operations and the Winograd DCT contains 29+5=34 operations, so we can remove 16 lines. This is not complex at all and takes a few minutes by hand. The following code can be found in dct1d_01.c :

/*

DCT1D_01.c
version mar jan  1 08:45:57 GMT 2002
whygee@f-cpu.org

1* 8-bin DCT for a "baseline" JPG compressor.

originally cut and pasted from : sbcci_DCT2D.pdf
"Pipelined Fast 2-D DCT Architecture for JPEG Image Compression"
Luciano Volcan Agostini  
Ivan Saraiva Silva       
Sergio Bampi             

Optimisation : part 1 = variable name inference

A lot of work was already done so it is easy to
short-circuit temporary variables. Each level
(a to S) is an individual layer so the verification
range is reduced.

b2 bug corrected on Tue Dec 16 06:37:45 CET 2003

*/

sample
  b0, b1, b2, b3, b4, b5, b6, b7,
  c0, c1, c2, c3, c4, c5, c6,
  d0, d1,     d3, d4,
          e2, e3, e4,     e6, e7,
          f2, f3, f4, f5, f6, f7;

/* Step 1 */
b0 = a0 + a7;
b1 = a1 + a6;
b2 = a3 - a4;
b3 = a1 - a6;
b4 = a2 + a5;
b5 = a3 + a4;
b6 = a2 - a5;
b7 = a0 - a7;

/* Step 2 */
c0 = b0 + b5;
c1 = b1 - b4;
c2 = b2 + b6;
c3 = b1 + b4;
c4 = b0 - b5;
c5 = b3 + b7;
c6 = b3 + b6;
/*c7 = b7;*/

/* Step 3 */
d0 = c0 + c3;
d1 = c0 - c3;
/* d2 = c2; */
d3 = c1 + c4;
d4 = c2 - c5;
/*d5 = c4;*/
/*d6 = c5;*/
/*d7 = c6;*/
/*d8 = c7 = b7;*/

/* Step 4 */
/*e0 = d0;*/
/*e1 = d1;*/
e2 = m3 * /*d2 =*/ c2;
e3 = m1 * /*d7 =*/ c6;
e4 = m4 * /*d6 =*/ c5;
/* e5 = d5 =c4; */
e6 = m1 * d3;
e7 = m2 * d4;
/*e8 = d8;*/

/* Step 5 */
/*f0 = e0 = d0;*/
/*f1 = e1 = d1;*/
f2 = c4 /*e5*/ + e6;
f3 = c4 /*e5*/ - e6;
f4 = e3 + /*e8 =*/ b7;
f5 = /*e8 =*/ b7 - e3;
f6 = e2 + e7;
f7 = e4 + e7;

/* Step 6 */
S0 = d0 /* = f0*/;
S1 = f4 + f7;
S2 = f2;
S3 = f5 - f6;
S4 = d1 /* = f1 */;
S5 = f5 + f6;
S6 = f3;
S7 = f4 - f7;

Ok, there are 16 red lines, so it looks good :-)

Code cleanup

Now that all the copy instructions are identified, all the references are short-circuited and the code is shortened. We start to see the "skeleton" of the code : a succession of adds, subs and a block of multiplies in the middle. The source code is named dct1d_02.c

/*

DCT1D_02.c
version mar jan  1 08:45:57 GMT 2002
whygee@f-cpu.org

1* 8-bin DCT for a "baseline" JPG compressor.

Optimisation : part 2 = first cleanup

All the unnecessary temporary variables
are removed. A census of the "nodes" is made.
Since their number is below 64, the register
allocation is straight-forward.

Furthermore, the original code seems to be
incredibly well optimised. First, the "minimal distance"
between operations seems to be compatible with FC0.
Second, it seems possible to use less than
the original number of temporary variables :
although the "clean" model defines 30 variables,
it is "sliced" in such a way that maybe 16
working registers are necessary. Later we will
even describe how they are reduced to 10.

b2 bug corrected on Tue Dec 16 06:37:45 CET 2003

*/

sample
  b0, b1, b2, b3, b4, b5, b6, b7,
  c0, c1, c2, c3, c4, c5, c6,
  d0, d1,     d3, d4,
          e2, e3, e4,     e6, e7,
          f2, f3, f4, f5, f6, f7;

/* Step 1 */
b0 = a0 + a7;
b1 = a1 + a6;
b2 = a3 - a4;
b3 = a1 - a6;
b4 = a2 + a5;
b5 = a3 + a4;
b6 = a2 - a5;
b7 = a0 - a7;  /***/

/* Step 2 */
c0 = b0 + b5;
c1 = b1 - b4;
c2 = b2 + b6;  /***/
c3 = b1 + b4;
c4 = b0 - b5;
c5 = b3 + b7;  /***/
c6 = b3 + b6;  /***/

/* Step 3 */
d0 = c0 + c3;  /***/
d1 = c0 - c3;  /***/
d3 = c1 + c4;
d4 = c2 - c5;

/* Step 4 */
e2 = m3 * c2;  /*c2*/
e3 = m1 * c6;  /*c6*/
e4 = m4 * c5;  /*c5*/
e6 = m1 * d3;
e7 = m2 * d4;

/* Step 5 */
f2 = c4 + e6;  /*c4*/
f3 = c4 - e6;  /*c4*/
f4 = e3 + b7;  /*b7*/
f5 = b7 - e3;  /*b7*/
f6 = e2 + e7;
f7 = e4 + e7;

/* Step 6 */
S0 = d0;       /*d0*/
S1 = f4 + f7;
S2 = f2;
S3 = f5 - f6;
S4 = d1;       /*d1*/
S5 = f5 + f6;
S6 = f3;
S7 = f4 - f7;

Distance verification

Now that all the unnecessary instructions are removed, we can examine whether the pipeline will stall and where : it's a first static scheduling analysis that will tell us whether we have to modify the code or not.

The short architectural description of FC0, in the beginning of this document, now makes sense : we need a number of information to perform the analysis, and principally under which conditions the core stalls. We will only care about the read-after-write dependencies (when the data is not ready when it is needed) and the latency of each instructions. Other parameters are mostly ignored because we use only 2 types of arithmetic operations only. The write-back contentions (when more than 2 data are written at the same time) are not checked because they can't happen by construction (there are only 2 latencies so 3 data can't be written in one cycle).

Now the game is to count how many instructions separate an instruction that computes a result and an instruction that uses this result. If the distance is shorter than the latency of the operation's unit, then there is a stall. The following code is available under the name dct1d_03.c

/*

DCT1D_03.c
version mar jan  1 08:45:57 GMT 2002
whygee@f-cpu.org

1* 8-bin DCT for a "baseline" JPG compressor.

Optimisation : part 3 = distance verification

Before the registers are renamed (down to around
16 registers), we have to check that the original
order preserves all the dependency distances.

Here we use the following rules of thumb :
 * addition or substraction : the result is not
   available before 2 cycles
 * multiplication : "mulh" is used, let's consider
   that it takes 6 cycles.

Here, "cycles" are the number of instructions
that separate those with dependencies. For example :
 a = b + c  <= takes 2 cycles
 d = e + f
 g = h + i
 k = l + a  <= a is ready now.

If there is a conflict, we have to swap several
operations. If the swap's range is too large,
there is an avalanche effect with the dependencies
which forces to re-analyse everything through
a global dataflow-graph reordering and flattening.

b2 bug corrected on Tue Dec 16 06:37:45 CET 2003

*/

sample
  b0, b1, b2, b3, b4, b5, b6, b7,
  c0, c1, c2, c3, c4, c5, c6,
  d0, d1,     d3, d4,
          e2, e3, e4,     e6, e7,
          f2, f3, f4, f5, f6, f7;

/*
 here, we assume that there is no pending operation
   that deals with aX.
 */

/* Step 1 */
b0 = a0 + a7;
b1 = a1 + a6;
b2 = a3 - a4;
b3 = a1 - a6;
b4 = a2 + a5;
b5 = a3 + a4;
b6 = a2 - a5;
b7 = a0 - a7;  /***/

/* Step 2 */
c0 = b0 + b5;        /* b5 = 2 cycles : limit...
  if there was a conflict we could have swapped with c1. */
c1 = b1 - b4;        /* b4 = 4 cycles */
c2 = b2 + b6;  /***/ /* b4 = 4 cycles */
c3 = b1 + b4;     /* no more dangerous dependencies here. */
c4 = b0 - b5;  /***/
c5 = b3 + b7;  /***/
c6 = b3 + b6;  /***/

/* Step 3 */
S0 = c0 + c3;  /***/ /* c3=3 */
S4 = c0 - c3;  /***/ /* c3=4 */
d3 = c1 + c4;        /* c4=4 */
d4 = c2 - c5;        /* c5=4 */

/* Step 4 */
e2 = m3 * c2;  /*c2*/
e3 = m1 * c6;  /*c6*/
e4 = m4 * c5;  /*c5*/
e6 = m1 * d3;         /* d3=4 */
e7 = m2 * d4;         /* d4=4 */

/* Step 5 */
S2 = c4 + e6;  /*c4*/ /* e6=1 : OUCH !!! */
S6 = c4 - e6;  /*c4*/ /* e6=2 : ... */
f4 = e3 + b7;  /*b7*/ /* e3=5 */
f5 = b7 - e3;  /*b7*/ /* e3=6 */
f6 = e2 + e7;         /* e7=4 */
f7 = e4 + e7;         /* e7=5 */

/* Step 6 */
/*S0 = d0;*/
S1 = f4 + f7;
/*S2 = f2;*/
S3 = f5 - f6;
/*S4 = d1;*/
S5 = f5 + f6;
/*S6 = f3;*/
S7 = f4 - f7;

A more visual representation follows. Sorry for the small characters but this allows monitors with 800*600 resolution to display the whole picture.

The blue lines represent the dependencies which are solved exactly when needed (a kind of critical datapath). The red lines represent those which are not solved : the operation is not finished before or when the result becomes available. The red lines must be removed either by swapping instructions or by inserting some other "work" during the stall. In the middle of the block, it is not possible to interleave other work and there are a lot of results which are ready long before they are needed.

In fact, everything is fine before the multiplies: there are around 8 data to process with a latency of 2 cycles, so there always remains 5 operations to choose from. However, the multiplies by constants are longer and there are only 5 of them, so the remaining operations must wait for their completion. 5 multiplies is enough to exhaust the pool of the additions' results.

Let's do the Hack again !

Here, things become more serious. We have to swap instructions so no cycle is lost. If you ask a computer to do that, it's an exponentially complex problem but a human brain can do this with a linear complexity because it can sort the problems and avoid those that are not :-)

/*

DCT1D_04.c
version mar jan  1 08:45:57 GMT 2002
whygee@f-cpu.org

1* 8-bin DCT for a "baseline" JPG compressor.

Optimisation : part 4 = rescheduling

It appears that there is a problem at the end
of the block, mainly because of the multiply's
latency and a slight order problem at the beginning
of step 5. We have to reorder the operations
to reduce the overall latency.

End-of-block and start-of-block contentions
are common because the dependencies exhaust.
On top of that, Winograd's algorithm is very
optimised but the dependency diagram is not
symmetrical as in the usual "butterfly" FFT.
The problem that arises here is not surprising
because the dependendy tree narrows.

In a "real" case, there are other things to
interleave with, such as load/stores, counter
or pointer manipulations... However we will
reorder the instructions manually.

When we can't recover the lost cycles,
the solution is to interleave some other "useful
work" to fill the pipeline stalls. As noted before,
the first way is to interleave data movement
or transformations from the next code block.
The other (desperate) way is to unroll the loop :
  1) copy-paste the block
  2) rename all the registers in the second block
  3) interleave the two blocks in such a position
     that the stalls are filled.
  4) copy-paste the resulting block to form
     the starting and ending of the loop
  5) don't forget to divide the loop count by two

Fortunately, in the current case, it is possible
to manually reorder the instructions. It took a few
minutes and it spans over a larger part of the block
than expected but only one cycle is lost in the end,
which is not enough to justify loop unrolling.
This deep swapping increases the register usage
and a few more temporary variables are necessary,
maybe 20 or 22 instead or 16 or 18.

b2 bug corrected on Tue Dec 16 06:37:45 CET 2003

*/

typedef signed short int sample;

sample
  b0, b1, b2, b3, b4, b5, b6, b7,
  c0, c1, c2, c3, c4, c5, c6,
  d0, d1,     d3, d4,
          e2, e3, e4,     e6, e7,
          f2, f3, f4, f5, f6, f7;  /* the outputs */

/*
 here, we assume that there is no pending operation
   that deals with aX.
 */

/* Step 1 */
b0 = a0 + a7;
b1 = a1 + a6;
b2 = a3 - a4;
b3 = a1 - a6;
b4 = a2 + a5;
b5 = a3 + a4;
b6 = a2 - a5;
b7 = a0 - a7;  /***/

/* Step 2 */
c0 = b0 + b5;
c1 = b1 - b4;
c2 = b2 + b6;  /***/
c6 = b3 + b6;         /* moved */
c4 = b0 - b5;  /***/
c5 = b3 + b7;  /***/
c3 = b1 + b4;         /* moved */

/* Step 3 */
e3 = m1 * c6;          /*3*/ /* moved */
d3 = c1 + c4;          /*3*/
d4 = c2 - c5;          /*3*/

/* Step 4 */
e2 = m3 * c2;  /*c2*/  /*7*/
e4 = m4 * c5;  /*c5*/  /*5*/

e6 = m1 * d3;          /*3*/
e7 = m2 * d4;          /*3*/
S0 = c0 + c3;          /*7*/
S4 = c0 - c3;          /*8*/

/* Step 5 */
f4 = e3 + b7;  /*b7*/ /* e3=8 */
f5 = b7 - e3;  /*b7*/ /* e3=9 */
/* one free slot here : the multiply is not ready */
S2 = c4 + e6;  /*c4*/ /* e6=5 : OUCH !!! */
f6 = e2 + e7;         /* e7=5 (6 after the stall) */
S6 = c4 - e6;  /*c4*/ /* e6=already ok */
f7 = e4 + e7;         /* e7=already ok */

/* Step 6 (reordered) */
S3 = f5 - f6;         /* f6=2 */
S5 = f5 + f6;         /* f6=3 */
S1 = f4 + f7;         /* f7=2 */
S7 = f4 - f7;         /* f7=3 */

I don't remember exactly how i did this (3 months ago) but the principle is to find where to move instructions (and which) so the stalls disappear. This can force to move one or more independent instructions but there is only one lost cycle now. I did this by hand within some tens of minutes, i think. I'm too lazy to draw the dependency diagram but the average "waiting time" between the result delivery and its use is decreased. So the first hint (if you want to do this at home) is to search the longest "distances"/"waiting times" and balance them so that the stalls disappear.

One stall cycle for 34 instructions is a decent number (97% of CPU usage) so i stop the cycle counter here and i focus on another issue. I'm already happy that the code was already pretty well optimised and i only had to swap some instructions. Starting from another code could have been much more painful.

Register count reduction

The new code uses even more complex (for the newcomers) techniques and its making is more prone to errors : i have saved a "snapshot" of several steps of my ongoing work to avoid any loss of time if i detected an error.

/*

DCT1D_05.c
version mar jan  1 08:45:57 GMT 2002
whygee@f-cpu.org

1* 8-bin DCT for a "baseline" JPG compressor.

Optimisation : part 5 = register reallocation

Once the data dependencies are free from
pipeline stalls, we can "optimise" the
register allocation. It is not a critical part
in F-CPU because there are 64 registers :
it's well enough to avoid complex algorithms.

We could associate each register to a temporary
variable but this would reduce the available
ressources for the rest of the program, particularly
for the previous and next code blocks. We have
to keep data pointers, constants (index advance
and twiddle factors) and some headroom, in case
loop unrolling appears necessary in the future.
For example, a dual-issue FC0 would probably
require a special 2x unrolling : the duplication
of every instruction (not forgetting to rename
the registers), forming perfectly independent and
scheduled pairs. This is the simplest solution if
you don't want to recalculate all the data
dependecy distances and the register numbers.
Of course, this only works when all the execution
units are duplicated and the multiply unit
is too heavy to accept two instructions per cycle.
Some local reordering would then be necessary by
"shifting" the code stream by a certain amount
of instructions, so that there is no conflict.

Back to register re-allocation.
This algorithm is very simple (once you have
understood it). Furthermore there is no register
pressure so we can follow it quietly.

The basic principle is to identify a
datum, from the time it is created until the
last use. The end of a datum's life is
identified when the associated register
is written to, that is : when it is associated
to another datum (or more exactly, before, and
this detail becomes meaningful later).

For example :
  a = b + c       <-- c contains a first value
--- anything ---  <-- time during which c is unused
  c = d + e       <-- c contains a new value

The algorithm uses a "scoreboard" approach
where each register has a flag (either allocated
and to which old register, or free).
In a forward scan, as in the P6 core family,
the scoreboard's entry is cleared whenever
the register is written to. If it is used before
being overwritten, this means that the value
it contained has died and we have thus detected
a datum's death. In order to detect this case,
it is more efficient to run the algorithm
from the end of the block (backwards),
otherwise the registers will contain unused
values during a significant fraction of the
program. This is an important difference
with out-of-order cores such as the P6 where
the instruction flow must be decoded in-order
for obvious reasons.

This register re-allocation method is optimal
when there is no register pressure and if
the coded algorithm is already optimised.
If there are not enough registers for a
given code block, things are getting much
more difficult because one has to "swap"
data to memory or split/slice the block
into pipelineable chunks. If it is not
possible, it is necessary to count
which data are most used to keep them in
registers. This allocation algorithm is
not funny at all because its complexity depends
on the instruction set's orthogonality...

No such problem here : we have enough registers
in the F-CPU to run an even more complex computation.
When our scoreboard needs a new entry and all the
entries are "busy", we can simply create a new entry.

The reallocation algorithm goes as follows:
 1- point at the end of the instruction stream
 2- identify the source operands as "dying data"
 3- for each "dying data" :
      4 allocate a register for each data
     (set the corresponding scoreboard entry)
      5 go to the previous instruction
      6 if the datum is written
        7  replace the old reference with the
          new register name
        8 leave this loop (this is the end of datum)
      9 replace all source references with the
        new register name (carefully checking
        that there is no numbering conflict)
      10 loop back to 5
 11- If we have reached the beginning of
     the code block, exit this algorithm
 12- When all the curent instruction's references
     are replaced, the instruction can be "committed".
 13- go to the previous instruction (upper in the stream)
 14- identify the yet already re-allocated register
      references.
 15- if all the sources are already allocated, go to 11
 16- all the unallocated sources are "dying data", go to 3

There is a necessity to distinguish newly allocated
references from the original register references.
It is safer to use different symbolic names, such
as a1-a7 for the original names and R1-R63 for the
new names. While this algorithm is easy, it is also
very easy to make fatal errors, particularly if you
are sleep-deprived... You are warned : from experience
i can say that it can hurt very very harshly.

Here is a small example block :
  a0 = a1 + a2  (1)
  a1 = a2 + a3  (2)
  a2 = a3 + a4  (3)
  a5 = a6 + a1  (4)

We start at the end, where a5 = a6 + a1 and the scoreboard
is initially cleared. a5 can be replaced by anything a priori,
let's say R1. Then we propagate the new names of a1=R2 and a6=R3
(we have created new scoreboard entries) until these values are
"associated" with their value.

  R1=a4  R2=a1  R3=a6

  a0 = a1 + a2  (1')
  R2 = a2 + a3  (2')
  a2 = a3 + a4  (3')
  R1 = R3 + R2  (4')

Now that all the references are replaced in the instruction,
we go up and see what references are not yet replaced. For
each replacement, we look in the scoreboard for unused entries :
R1 is free (since it is written later), we need to create two new
entries. The scoreboard then contains :

  R1=a4  R2=a1  R3=a6  R4=a3  R5=a2

  a0 = a1 + a2  (1'')
  R2 = a2 + R4  (2'')
  R5 = R4 + R1  (3'')

The process continues until all instructions are processed.
However we have to process inputs and outputs (terminals)
differently : here a2 is used both internally and externally.
Depending on the case, it is maybe necessary to allocate a new
register.

  R1=a4  R2=a1  R3=a6  R4=a3  R5=a2(terminal) R6=a2(internal)

  a0 = a1 + R6  (1''')
  R2 = R6 + R4  (2''')

and finally, the result is

  R1=a4  R2=a1  R3=a6  R4=a3  R5=a2(terminal) R6=a2(internal)
  R7=a1  R8=a0

  R8 = R7 + R6  (1'''')
  R2 = R6 + R4  (2''')
  R5 = R4 + R1  (3'')
  R1 = R3 + R2  (4')

This small example shows how it works but it becomes efficient
for only larger code blocks, such as our DCT. By changing
the allocation policy, we can require more or less registers
than previously. In our DCT where a lot of temporary variables
are used, this will "compress" the number of used registers.
If we omit m1-m4, a1-a7 and S0-S7, we needed originally 30
registers and now we only need 10 after two hours of hacking.
This means that despite the instruction reordering which
increases the lifelength of some variables, we are still
near the minimum of 9 as written in the original algorithm.

* Note : the recoded algorithm is now completely
obfuscated ! You will not be able to understand it unless
you perform a complex and exhaustive analysis.
* Note 2: The reallocation algorithm must be modified if the CPU has
2-address instructions (ie x86). It can't be applied on the
instruction stream and it requires a higher level of
optimisation, using dataflow graph analysis. This is
much more complex because we have to determine which data
must be copied before the register is overwritten by a
result. It results in a higher register pressure and
a greater programming difficulty.

Do not forget that you have to read from the end of the file
if you want to understand the comments ;-)

Warning :
This manual algorithm is not difficult but any error can
be fatal. It is relatively easy to implement it in software
but no source processor exists yet. After some tens of lines,
it is impossible to avoid one or two errors, but here are
some advices :
 - run regression tests during all the process to detect
   errors as soon as possible
 - save the result under a new name often, in order to always
   have a failsafe and working source code.
 - keep the old and new values in two consecutive lines,
   one being commented out, this helps to understand what
   happens and visually detect errors.
 - implement some "guards" that will help detect the errors.
   In our example, it is done with the variable's naming :
   it is illogical to find a name like "e4" in your scoreboard
   when you are manipulating the b block, because e4 appears
   later and should not have been created. This means that
   you have forgotten to discard it from the scoreboard
   or erased the line where it is defined, or something even
   more difficult to trace back.
 - another example of error detection is when an instruction
   writes to a register that is not present in the scoreboard
   but this is not a "terminal".

b2 bug corrected on Tue Dec 16 06:37:45 CET 2003

*/

sample
  R1, R2, R3, R4, R5, R6, R7, R8, R9, R10; /* reallocated temporary registers */
/*b0, b1, b2, b3, b4, b5, b6, b7,  deprecated
  c0, c1, c2, c3, c4, c5, c6,
  d0, d1,     d3, d4,
          e2, e3, e4,     e6, e7,
          f2, f3, f4, f5, f6, f7; */

/* Step 1 */
/* b0 = a0 + a7; */
R7 = a0 + a7;
/* b1 = a1 + a6; */
R6 = a1 + a6;
/* b2 = a2 - a4; */
R4 = a3 - a4;
/* b3 = a1 - a6; */
R5 = a1 - a6; /* and here it's a simple substitution :-) */
/* b4 = a2 + a5; */
R9 = a2 + a5;
/* b5 = a3 + a4; */
R10 = a3 + a4;
/* b6 = a2 - a5; */
R3 = a2 - a5;
/* b7 = a0 - a7; */
R8 = a0 - a7;

/* Step 2 */
/* c0 = b0 + b5; R1=(c0) R2=free R3=b6 R4=b2 R5=b3 R6=b1 R7=b0 R8=b7 R9=b4 R10=b5 */
   R1 = R7 + R10; /* c0 is removed/"created, b0 and b5 survive */
/* c1 = b1 - b4; R1=c0 R2=(c1) R3=b6 R4=b2 R5=b3 R6=b1 R7=b0 R8=b7 R9=b4 R10=b5 */
   R2 = R6 - R9; /* c1 is removed <= c1 is "created, b1 and b4 survive */
/* c2 = b2 + b6; R1=c0 R2=c1 R3=b6 R4=c2/b2 R5=b3 R6=b1 R7=b0 R8=b7 R9=b4 R10=b5 */
   R4 = R4 + R3; /* replacement b2/c2/R4 <= c2 is "created", b2 dies, b6 survives */
/* c6 = b3 + b6; R1=c0 R2=c1 R3=c6/b6 R4=c2 R5=b3 R6=b1 R7=b0 R8=b7 R9=b4 R10=b5 */
   R3 = R5 + R3; /* replacement c6/b6/R3 <= b3 survives, b6 dies, c6 is created */
/* c4 = b0 - b5; R1=c0 R2=c1 R3=c6 R4=c2 R5=b3 R6=b1 R7=c4/b0 R8=b7 R9=b4 R10=b5 */
   R7 = R7 - R10; /* replacement c4/b0/R7 <= c4 is "created", b0 and b5 die */
/* c5 = b3 + b7; R1=c0 R2=c1 R3=c6 R4=c2 R5=c5/b3 R6=b1 R7=c4 R8=b7 R9=b4 */
   R5 = R5 + R8; /* replacement c5/b3/R5 <= c5 is "created", b3 dies, b7 survives */
/* c3 = b1 + b4; R1=c0 R2=c1 R3=c6 R4=c2 R5=c5 R6=b1 R7=c4 R8=b7 R9=c3/b4 */
   R9 = R6 + R9; /* c3 is "created", b4 and b1 die */

/* Step 3 */
/* e3 = m1 * c6; R1=c0 R2=c1 R3=c6/e3 R4=c2 R5=c5 R6=free R7=c4 R8=b7 R9=c3 */
   R3 = m1 * R3; /* replacement c6/e3/R3 */
/* d3 = c1 + c4; R1=c0 R2=c1/d3 R3=e3 R4=c2 R5=c5 R6=free R7=c4 R8=b7 R9=c3 */
   R2 = R2 + R7; /* replacement c1/d3/R2 */
/* d4 = c2 - c5; R1=c0 R2=d3 R3=e3 R4=c2 R5=c5 R6=free R7=c4 R8=b7 R9=c3 */
   R6 = R4 - R5; /* d4/R6 dies */

/* Step 4 */
/* e2 = m3 * c2; R1=c0 R2=d3 R3=e3 R4=c2 R5=c5 R6=d4 R7=c4 R8=b7 R9=c3 */
   R4 = m3 * R4; /* same as next line with e2/c2/R4 */
/* e4 = m4 * c5; R1=c0 R2=d3 R3=e3 R4=e2 R5=e4/c5 R6=d4 R7=c4 R8=b7 R9=c3 */
   R5 = m4 * R5; /* same as next line with e4/c5/R5 */

/* e6 = m1 * d3; R1=c0 R2=e6/d3 R3=e3 R4=e2 R5=e4 R6=d4 R7=c4 R8=b7 R9=c3 */
   R2 = m1 * R2; /* same as next line with e6/d3/R2 */
/* e7 = m2 * d4; R1=c0 R2=e6 R3=e3 R4=e2 R5=e4 R6=e7/d4 R7=c4 R8=b7 R9=c3 */
   R6 = m2 * R6; /* d4 is "created", its entry is replaced by d4 which dies. */

/* S0 = c0 + c3; R1=c0 R2=e6 R3=e3 R4=e2 R5=e4 R6=e7 R7=c4 R8=b7 R9=c3 */
   S0 = R1 + R9; /* nothing new. */
/* S4 = c0 - c3; R1=c0 R2=e6 R3=e3 R4=e2 R5=e4 R6=e7 R7=c4 R8=b7 R9=c3 */
   S4 = R1 - R9; /* c0 needs a slot, it replaces f4. new slot created for c3 */

/* Step 5 */
/* f4 = e3 + b7; R1=f4 R2=e6 R3=e3 R4=e2 R5=e4 R6=e7 R7=c4 R8=b7 */
   R1 = R3 + R8;  /* R1/f4 created and nothing new is allocated :
                     R1 disappears from the scoreboard */
/* f5 = b7 - e3; R1=f4 R2=e6 R3=f5/e3 R4=e2 R5=e4 R6=e7 R7=c4 R8=b7 */
   R3 = R8 - R3;  /* f5 appears, e3 dies, R3 is associated to both */
/* one free slot here : the multiply is not ready */
/* S2 = c4 + e6; R1=f4 R2=e6 R3=f5 R4=e2 R5=e4 R6=e7 R7=c4 */
   S2 = R7 + R2;
/* f6 = e2 + e7; R1=f4 R2=f7 R3=f5 R4=f6/e2 R5=e4 R6=e7 R7=c4 */
   R4 = R4 + R6;  /* e7 exists but not e2. f6 is being created.
                     R4 is both associated to f6 (new) and e2 (old).
                     We spare one new register allocation !
		     However, in CPUs with imprecise exception
                     (ALPHA for example) this trick is not allowed
                     because on FPU fault the exception handler can't
                     know which value (old or new) the register contains. */
/* S6 = c4 - e6; R1=f4 R2=e6 R3=f5 R4=f6 R5=e4 R6=e7 R7=c4 */
   S6 = R7 - R2; /* e6 replaces f7 */
/* f7 = e4 + e7; R1=f4 R2=f7 R3=f5 R4=f6 R5=e4 R6=e7 */
   R2 = R5 + R6; /* f7 is "created" so we remove it from the scoreboard */

/* Step 6 */
/* S3 = f5 - f6; R1=f4 R2=f7 R3=f5 R4=f6 */
   S3 = R3 - R4;
/* S5 = f5 + f6; R1=f4 R2=f7 R3=f5 R4=f6 */
   S5 = R3 + R4;
/* S1 = f4 + f7; R1=f4 R2=f7 */
   S1 = R1 + R2;
/* S7 = f4 - f7; R1=f4 R2=f7 */
   S7 = R1 - R2;

Some cleanup and other considerations

Now we have a code that uses 10 temporary registers and performs a 1-D 8-bin DCT in 35 cycles. As one of the comments above states, if floating point is used, precise exceptions or a "FP barrier" is needed in some places because an imprecise exception handler could have some troubles. But F-CPU defines all instructions to be precise and there is no problem. Here is the final code with all the development comments removed :

/*

DCT1D.c
version jeudi 18 avril 2002
(C) 2002 whygee@f-cpu.org

Winograd 8-bin DCT for a "baseline" JPG compressor.

Optimisation : final part, cleanup

Most comments from the precedent code version are removed, so it can
be integrated in another piece of software.

b2 bug corrected on Tue Dec 16 06:37:45 CET 2003

*/

sample
  R1, R2, R3, R4, R5, R6, R7, R8, R9, R10; /* temporary registers */

  R7 = a0 + a7;
  R6 = a1 + a6;
  R4 = a3 - a4;
  R5 = a1 - a6;
  R9 = a2 + a5;
  R10 = a3 + a4;
  R3 = a2 - a5;
  R8 = a0 - a7;

  R1 = R7 + R10;
  R2 = R6 - R9;
  R4 = R4 + R3;
  R3 = R5 + R3;
  R7 = R7 - R10;
  R5 = R5 + R8;
  R9 = R6 + R9;

  R3 = m1 * R3;
  R2 = R2 + R7;
  R6 = R4 - R5;

  R4 = m3 * R4;
  R5 = m4 * R5;
  R2 = m1 * R2;
  R6 = m2 * R6;

  S0 = R1 + R9;
  S4 = R1 - R9;

  R1 = R3 + R8;
  R3 = R8 - R3;
/* one free slot here : the multiply is not ready */
  S2 = R7 + R2;
  R4 = R4 + R6;
  S6 = R7 - R2;
  R2 = R5 + R6;

  S3 = R3 - R4;
  S5 = R3 + R4;
  S1 = R1 + R2;
  S7 = R1 - R2;

This code looks good but is completely "obfuscated". If you didn't read this description completely and from the start, you have few chances to understand it. However, if you respect the usage conventions, you can try to copy-paste this code into another program (and give credit to the author). "Translation" to assembly langage is straight-forward (for F-CPU or other RISC computers).

Please note that i have not addressed the issue of the operand alignment or the number's range : if you use integer numbers, you have to take care yourself about the fractional number management.

If you want to use floating point numbers, you have to be careful because the scheduling that is presented here will not apply. The actual latencies of the FP operators are not yet know for the F-CPU but it's likely to be 2x or 3x slower. Use loop unrolling and interleaving of the instructions to "hide" the latencies.

The proposed code is however transparent to the data parallelism : you can use integer SIMD code to boost the performance. For example, if you compute pictures, you can compute a full RGB pixel with one 64-bit register. The initial representation is 8-bit but this gradually increases to 16 bits during the computation. It's your job to find the most suitable internal data representation that will use the least packing/unpacking instructions.

Beyond this first code

If you are programming sound processing effects, your quest ends here : this one dimensional code can do the trick. The FFT (and even Winograd FFT) is more often used than the DCT but the same methods can be used as here. I have some 8-bin Winograd FFT code in my archives, just in case.

However, if you program a 2-D DCT for image processing, the code that is developped in this paper is only the "core" of your loop. Well-written code will not waste all the cycles that were saved in the loop body and accessing data is the place where most problems arise. This will be addressed in another paper, whenever I find some time to write it.

A few other parameters must be taken into account in a real program that will run in a F-CPU computer : SIMD data and superscalar extensions. Using SIMD data (that is : multiple data packed in a register), you have to interleave (pack) separate streams and separate (unpack) them later, which can often take an unexpected amount of time. This must be designed along with the rest of the program, which must be analysed globally to reduce the packing overhead.

The last problem comes if a 2-issue version of F-CPU is designed : the latency of each operation virtually doubles because 2x more instructions must be executed before a result is ready. This applies for 3-issue and 4-issue superscalar as well, but i'll show a simple example here. F-CPU is a configurable, customizable core so you can fit it to your application, but some units like the multiply unit can't be duplicated 2x or 3x without an increase in cost and die space. In short, superscalar versions of F-CPU can't provide the user with all the units that a single-issue core can provide, there can be some restrictions in the type of the operations. For example, the logic unit can be replicated, but not the multiply unit, so only a number of specific operation is allowed to enter the pipeline every cycle.

This is not a big problem with the presented DCT because it requires only 5 multiplies for 34 instructions. The idea is to duplicate the code and interleave the instructions in such a way that only one multiply is decoded for every couple of instructions. This is equivalent to loop unrolling but with the difference that the second instanciation of the loop is "shifted" in a position where the multiplies don't "collide". I will now represent our DCT code with one letter and colour per instruction, to give a visual example.

a is "addition"
s is "substraction"
m is "multiply"

Our DCT code is then represented by the following string :

aassaassasaasaamasmmmmasasaasasaas

Now let's consider that our hypothetical 2-issue core can execute one a or s at any time, and only one m per instruction couple, here is the method to use :

1) Duplication

aassaassasaasaamasmmmmasasaasasaas
aassaassasaasaamasmmmmasasaasasaas

Each line represents a pipeline, so in fact you have to read with the columns first. The actual code will look like

aaaassssaaaassssaassaaaassaaaammaassmmmmmmmmaassaassaaaassaassaaaass

because the instructions are interleaved in the instruction flow. Whether you start from the lower or upper line does not necessarily matter (yet).

2) Shift

Just shift modulo (the number of instructions) the instructions from one pipeline so only one m is present each column. In our case, it's easy because there are several legal positions, so let's take the first one that appears... For example, with 8 positions of difference, we get :

aassaassasaasaamasmmmmasasaasasaas
        aassaassasaasaamasmmmmasasaasasaas

This means that when the instruction flow enters the pipeline, there is only one multiply per instruction pair.

3) Prologue and epilogue

If there is a loop where the instruction flows do not start at the same time, it necessarily overflows the loop body's boundaries. Let's take one instance of the original code as a reference, the first one for example, and put the boundary there so it coincides with it. Another instance of the code is added to fill the "hole" :


abcdefghijklmnopqrstuvwxyz|aassaassasaasaamasmmmmasasaasasaas|abcdefgh
aassaassasaasaamasmmmmasas|aasasaasaassaassasaasaamasmmmmasas|aasasaas
         prolog                      loop body                 epilog

There is one problem however : the total number of operations will be odd, which is a bit annoying when the loop count must be even. One solution is to add a fourth code instance at the end. Another solution is to unroll 3x and choose loop counts where the overhead will give an actual count that is even : for example, with 3x unrolling, the overhead adds 2 iterations, giving obvious solutions for loop counts of (3*2)+2=8, (3*10)+2=32, and so on for 128, 512... (odd powers of 2)


abcdefghijklmnopqrstuvwxyz|aassaassasaasaamasmmmmasasaasasaas|abcdefghijklmnop
abcdefghaassaassasaasaamas|mmmmasasaasasaasaassaassasaasaamas|mmmmasasaasasaas
aassaassasaasaamasmmmmasas|aasasaasaassaassasaasaamasmmmmasas|aasasaasabcdefgh
         prolog                          loop body                  epilog

You can easily remark that the instruction count starts to explode : this is why "orthogonal" architectures are necessary. There must be no unjustifed restriction to the coding rules, or the full core's performance can't be sustained. However, because such a technique becomes unavoidable at a time or another, F-CPU has a large register set which makes such a solution viable.

In the 2D DCT case, the added epilog code can be used to keep as much data as possible inside the register set. For example, the epilog code can be interleaved with the 2nd pass prolog and several memory references can be avoided by using registers instead of load/store instructions...

4) Interleaving

Now that we have a legal and efficient instruction stream by construction, we have to generate the final source code. There are several ways to do that, from the most simple to the hairiest...

The simplest form simply "scans" the above code with the columns first :


aaabbaccsddseeaffaggshhsiaajasksa...

More complex versions will take architecture-specific parameters into account, such as register access restrictions, execution unit availability and other annoying pairing rules. For example, if the multiply unit is only accessible by the first instruction of a pair, the pair will be swapped if the code performs the multiply on the second. This should be harmless to the performance as long as the code is executed on a platform with compatible coding rules. However, on a constrained architecture, this can cause "avalanche" effects that can force the whole code to be rescheduled.

Now, if you are a beginner at assembly programming or at general computer architecture, it should be obvious that a good microprocessor must be as orthogonal as possible and impose few coding rules, otherwise the coding complexity and the code size explode.

Conclusion

I hope you enjoyed reading this paper and that this many hours work is useful to you. F-CPU is a modern architecture that requires modern techniques and a fresh look at the programming paradigm. Although badly-scheduled code will execute, F-CPU becomes an interesting alternative when its main features are used : orthogonal instructions and architecture, a lot of registers and non-blocking pipelined execution units that can handle SIMD data.

All these features can be used at the same time but this requires some hacking and a lot of thinking. This paper has presented three techniques but there are other variants and ways to do the job. At the time of writing, no "optimizing" compiler exists for F-CPU (now you understand that it's not as easy as a porting GCC) and good programming habits can save some execution time and help your code benefit from future architectural enhancements without rewriting everything.

Stay tuned for future in-depth algorithm explorations...

These files are (C) 2002-2003 by Yann Guidon
The information they contain are provided in the hope that they are useful ; they are believed to be acurate but there is no warranty of ANY KIND.
If you find a bug or want to correct something, don't hesitate to contact me.
Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved.