Folding[1] is a transformation technique using in DSP architecture implementation for minimizing the number of functional blocks in synthesizing DSP architecture.Folding was first developed by Keshab K. Parhi and his students in 1992.Its concept is contrary to unfolding.Folding transforms an operation from a unit-time processing to N unit-times processing where N is called folding factor.Therefore, multiple same operations (less than N) used in original system could be replaced with a signal operation block in transformed system.Thus, in N unit-times, a functional block in transformed system could be reused to perform N operations in original system.
While the folding transformation reduces the number of functional units in the architecture, it needs more memory element to store the temporary data.The reason is that multiple data produced from an operation block needs to be distinguished from N data produced from original operations.Therefore, the number of register may be increased.Furthermore, it needs additional multiplexer for switching different operation paths.Hence, the number of switching elements would also be increased.To counterattack such issues, the considerations of folding is
The following graph shows the example of folding transformation.The original DSP system produces y(n) at each unit time.The transformed DSP system produces y(n) in each 2 l where each 2 l increase 1 n, index of y.The resource used in original system are 2 adders, and the resource used in transformed system are 1 adder, 1 register, 3 multiplexer.The functional block, adder, is therefore reduced.
The DSP implementation in the folding algorithm is a Data flow graph(DFG), which is a graph composed of functional nodes and delay edges.
Another input for folding algorithm is folding set which is the function maps an operation unit of original DFG to an operation of transformed DFG with the number n <= N indicated the order of reused operation.
Given a DFG, a folding factor N, and folding set.The transformation is performing:
DF(U\xrightarrow{e}V)=Nw(e)-PU+v-u
\scriptstyleDF
\scriptstyleU,V
\scriptstylew(e)
\scriptstyleU,V
\scriptstyleu
\scriptstyleU
\scriptstylev
\scriptstyleV
\scriptstylePU
\scriptstyleU
The following graph show the example of folding algorithm.The folding set is
\scriptstyle\{Si|j\}
\scriptstyleSi
\scriptstylej
\scriptstyleS1,S2
\scriptstyleDF(1 → 2)=4(1)-1+1-3=1
\scriptstyleDF(1 → 5)=4(1)-1+0-3=0
\scriptstyleDF(1 → 6)=4(1)-1+2-3=2
\scriptstyleDF(1 → 7)=4(1)-1+3-3=3
\scriptstyleDF(1 → 8)=4(2)-1+1-3=5
\scriptstyleDF(3 → 1)=4(0)-1+3-2=0
\scriptstyleDF(4 → 2)=4(0)-1+1-0=0
\scriptstyleDF(5 → 3)=4(0)-2+2-0=0
\scriptstyleDF(6 → 4)=4(1)-2+0-2=0
\scriptstyleDF(7 → 3)=4(1)-2+2-3=1
\scriptstyleDF(8 → 4)=4(1)-2+0-1=1
\scriptstyle\{i,j\}
[2] In the above example, if we perform register minimization, we could reduce the number of register significantly.The technique for minimizing register is call lifetime analysis, which analyzes the time for when a data is produced and when a data finally s consumed.The time for producing a data is denoted
\scriptstyleTinput
\scriptstyleToutput
\scriptstyleTinput=u+PU
\scriptstyleU
\scriptstylePU
\scriptstyleu
\scriptstyleToutput
\scriptstyleU
\scriptstyleu+PU+maxV\{DF(U → V)\}
node | \scriptstyleTinput | \scriptstyleToutput | |
---|---|---|---|
1 | 4 | 9 | |
2 | 2 | 2 | |
3 | 3 | 3 | |
4 | 1 | 1 | |
5 | 2 | 2 | |
6 | 4 | 4 | |
7 | 5 | 6 | |
8 | 3 | 4 |
After allocating 2 delay element for storing the temporary data, we need to schedule data stored at which register.The following table shows the data stored in each register R1 and R2, such that the number of multiplexer could be minimized.
Finally, we could reconstruct the data path with fewer delay element and switching element in the folded design.