3 RNS Applications in DFT, FFT, DCT, DWT
Tải bản đầy đủ  0trang
9.3 RNS Applications in DFT, FFT, DCT, DWT
227
b
*
*
*
m1
X
m2
*
*
X
+
+
+
*
*
*
+
+
+
Y
TCS
QRNS
mp
↑
CONVERSION QRNS
x(n)
CONVERSION TCS
↑
RNS FIR
*
*
*
+
+
+
y(n)
m1
^
X
^
m2
mp
Y
*
*
*
+
+
+
*
*
*
+
+
+
Figure 9.24 (continued)
other types and have shown that QRNSbased design consumes less power and
needs less area.
Taylor [60] has described a single modulus complex ALU using QRNS as
against multimoduli set based systems. He has used Gaussian primes (e.g. 5,
17, 257, 65,537) as well as composite Gaussian primes (85 ¼ 5*17,
1285 ¼ 5*257, 4369 ¼ 17*257, 21,845 ¼ 5*17*257) of the type 2k + 1 as the
modulus. A single modulus ALU has the advantage of trivial magnitude scaling,
sign detection and overflow detection as against multimoduli RNS.
228
9 Applications of RNS in Signal Processing
Cardarilli et al. [61] have suggested implementation of polyphase filters using
QRNS. The moduli set used was {13, 17, 29, 37, 41, 53, 61} to realize a dynamic
range of 34 bits. The architecture is shown in Figure 9.25a. The filter is divided into
^ (both QRNS coded outputs) plus the input and output
two structures for X and X
conversion blocks. Each RNS path was having eight FIR filters to cater for eight
channels and an Inverse Discrete Fourier Transform (IDFT) block. The 8point
IDFT implemented by DIF (Decimation in Frequency) algorithm needed 12 butterflies in three stages. Each butterfly needed two LUTs for multiplications. The
multiplications needed for FIR filters were implemented using index calculus.
The use of QRNS reduced the needed complex multiplications operation to just
two real multiplications as described before. The binary to QRNS and QRNS to
binary converters were included in the hardware. The needed application was for
satellite Digital Video broadcasting (DVB) system which required 367 complextap
filter for realizing a Kaiser window with 0.02 dB inband ripple and 43 dB out of
band attenuation. The filters could be with fixed coefficients or programmable
12bit coefficients. Truncation after multiplication also was used. TCS version
also was implemented in AMS 0.35 Micron technology. The truncation was
achieved by using QRNS to binary conversion and after truncation again binary
to QRNS conversion to perform IDFT subsequently (see Figure 9.25b). The authors
show that the area and power are less than those for TCS implementation.
Cardarilli et al. [62] have described realization of a 128 channel polyphase filter
derived from a 1024tap prototype filter. This uses QRNS and the moduli set {13,
17, 29, 37, 41}. The block diagram is presented in Figure 9.26. The front end is a
binary to QRNS converter. It is followed by a decimator and 128 QRNSbased
8tap FIR filters. The FIR filters used a single MACC (complex multiply and
accumulate) QRNS unit. The output dynamic range is 23 bits. Since truncation
cannot be done directly on QRNS channels, the channel outputs have been
converted into conventional form and scaled to yield a 17bit word. The 17bit
output is converted into QRNS again with base extension to have a dynamic range
of 35 bits using additional moduli 53 and 61 in the block denoted CTBE (conversion plus truncation and base extension). A seven moduli RNSbased IDFT is
realized using a serial architecture. One hundred and twentyeight complex multipliers are used and two adder trees are used to accumulate the results for the real and
imaginary parts. The authors have shown that while both the conventional complex
two’s complement system (CTCS) and QRNSbased designs occupy same area, the
power dissipation in the case of QRNSbased design is 50 % lower. Note that QRNS
filters have higher latency due to the I/O conversions.
D’Amora et al. [63] have used similar techniques for realizing complex digital
filters using QRNS and have demonstrated that for the same throughput rate, the
power dissipation is onethird and area is half of that of CRNS (complex RNS)
filters whereas the latency is more. Stouraitis and Paliouras [64] have suggested the
use of QRNS for low power designs.
9.3 RNS Applications in DFT, FFT, DCT, DWT
229
a
yo(n)
RNS path mod m1
RNS path mod m2
Yn
y1(n)
Xn
y2(n)
RNS path mod mP
QRNS path X
x(n)
QRNS
to
Binary
Binary
to
QRNS
y3(n)
y4(n)
RNS path mod m1
y5(n)
RNS path mod m2
^
^
Yn
Xn
y6(n)
RNS path mod mP
^
QRNS path X
y7(n)
fc/8
fc
↓M
X(n)
FIR Filter E0
yo(n)
y1(n)
fc
↓M
FIR Filter E1
IDFT
↓M
FIR Filter E7
y7(n)
fc/8
RNS path mod mi
fc/8
Figure 9.25 (a) Polyphase filter bank and (b) filter with truncated dynamic range (adapted from
[61] ©IEEE2004)
230
9 Applications of RNS in Signal Processing
b
yo(n)
y1(n)
y2(n)
Binary
to
QRNS
Filter
Banks
QRNS
to
Binary
Truncation (15 bits)
X(n)
Binary
to
QRNS
IDFT
QRNS
to
Binary
y3(n)
y4(n)
y5(n)
y6(n)
y7(n)
fc
fc/8
fc/8
Dynamic range 23 bits mi =(13,17, 29,37,41)
fc/8
fc/8
fc/8
Dynamic range 28 bits mi =(13,17, 29,37,41,53)
Figure 9.25 (continued)
DCT Computation
Fernandez et al. [65] have suggested a straightforward implementation of 8point
RNSbased 1DDCT using 5bit moduli {32, 31, 29, 27}. The data is 8bit wide and
the eight coefficients are each 9bit wide. A typical modulus channel is illustrated in
Figure 9.27. The input data is multiplied by the fixed coefficient. A modular
accumulator accumulates all the products. The modular adder used a twostage
pipeline scheme. The implementation was on Altera Flex10K device and has higher
throughput than a distributed arithmetic processor.
Several authors [66–69] have described DCT computation using QRNS. The
N point 1D DCT [67] of a sequence {x(0), x(1), . . ., x(N 1)} is given by
r N1
2 X
m2n ỵ 1ị
X m ị ẳ
xnị cos
Km
N nẳ0
2N
and K 0 ẳ p12, K 1 ¼ K 2 ¼ Á Á Á ¼ K N À1 ¼ 1.
m ¼ 0, 1, . . . , N À 1
ð9:16Þ
9.3 RNS Applications in DFT, FFT, DCT, DWT
231
QRNS FIR BANK
ROM TWIDDLE (W)
IDFT Serial
A1Kmi A1Kmi
W
mi
QRNSS
FIR1
128
Ymi
128
Xmi
CTBE
Ymi
X128mi
ISOMULT
Vmi
Vmi
SHIFT REGISTER
To
QRNS
128
mi
MUX
{XR,XI)
mi
V1mi
1
DECIMATOR
BIN
W
mi
128
ADDERS TREE
X
Xmi
1
ARRAY
V128mi
V1mi
128
QRNSS
FIR128
CLK0
X
128
128
V
mi
ISOMULT
ARRAY
128
mi
W1mi
CLK(0/128)*8 CLK(0/128)
CLK0
Z1mi
ADDERS TREE
X
1
Z1mi
W128mi
ROM TWIDDLE (W)
CLK0
Dynamic Range Domain 1
mi = {13,17,29,37,41}
Dynamic Range Domain 2
mi = {13,17,29,37,41, 53,61}
Figure 9.26 QRNS polyphase filter architecture (adapted from [62] ©IEEE2010)
x(i)mod mi
LUT
Coefficient
Counter
0:7
Products
 +  mi
y(u)mod mi
Figure 9.27 Modulo mi channel for one transform point of an RNSbased 1DDCT processor
(adapted from [65] ©IEEE1999)
Ramirez et al. [66] use the fact that Npoint DCT can be computed through the
calculation of real part of 2Npoint DCT scaled by a complex exponential constant
as follows:
(
)
r
2N1
jm X
2
xnịW mn
Xmị ẳ
K m Re e 2N
2N
N
nẳ0
9:17ị
232
9 Applications of RNS in Signal Processing
j2
W 2N ẳ e 2N ,
xnị ẳ 0
n ẳ N, N ỵ 1, . . . , 2N À 1
Initially, the Npoint input sequence {x(0), x(1), x(2), . . ., x(N À 1)} is reordered in
the sequence {y(0), y(1), y(2). . .y(N À 1)} defined by
yðnÞ ẳ x2nị, yN n 1ị ẳ x2n ỵ 1ị
n ẳ 0, 1, . . . ,
N
1
2
9:18ị
Let {Y(0), Y(1), Y(2). . .., Y(N À 1)} be the DFT of the sequence {y(0), y(1),
y(2). . .y(N À 1)}. The DCT sequence {X(0), X(1), X(2). . .., X(n À 1)} of the original
sequence can be obtained through the real part of Z(n) [71] defined as
r
2
n
Z nị ẳ H n Y nị ẳ
Y nị
K n W 4N
N
9:19ị
2
where W 4N ¼ eÀj4N . By using the property Z(N À n) ¼ ÀjZ*(n), $ Re
[Z(N À n)] ¼ ÀIm[Z(n)], it is necessary to compute only the N2 ỵ 1 values of Z(n),
viz., Z(0), Z(1),. . ., Z(N/2), Z(N/2 + 1),. . ..Z(3N/4 À 1). The Npoint DCT
sequence is given by {Re[Z(0)], Re [Z(1)], . . .Re [Z(N/4)]}, ÀIm {Z(3N/4 À 1)],
ÀIm[Z(3N/4 À 2)],. . ., ÀIm{Z(N/2 + 1)], Re[(Z(N/2)], Re[(Z(N/2 + 1)],. . .,Re
[Z(3N/4 À 1)], ÀIm[Z(N/4)],. . ., ÀIm[Z(N/4 À 1)],. . ., ÀIm[Z(1)]}.
The fast algorithms known for DFT can be used for fast computation of DCT. A
QRNS butterfly for computation of a DIF radix2 DFT is shown in Figure 9.28a.
Note that since the input sequence is real, each QRNS adder is one modular adder.
A butterfly needs a QRNS adder (two modular adders), a QRNS subtractor (two
modular subtractors) and a QRNS multiplier (two modulo multipliers). The moduli
set used is {221, 229, 233, 241}. The multiplier has used isomorphic mapping with
the roots {47,107, 89, 177}, respectively. The 8point QRNS DCT computation is
shown in Figure 9.28b. Note that only five outputs are Z(0), Z(1), Z(2).., Z(5) are
required for DCT computation.
Fernandez et al. [69] have presented a RNS architecture for computation of
scaled 2DDCT on field programmable logic (FPL). An eight pixel 1DDCT is
implemented as shown in Figure 9.29a. The 2D DCT is computed as
Xu; vị ẳ
N 1 X
N 1
2euịevị X
u2i ỵ 1ị
v2j ỵ 1ị
cos
xi; jị cos
N
2N
2N
iẳ0 jẳ0
9:20ị
u, v ẳ 0, 1, . . . , ðN À 1Þ
where x(i, j) is a N Â N matrix of pixels and X(u, v) is the corresponding
transformed matrix. Since 2DDCT is a separable transform, the rowcolumn
decomposition [70] can be used. A N Â N 2DDCT can be performed by first
N 1D DCTs on rows and next, N 1DDCTs on the columns. The use of a transposition structure containing 8 Â 8 matrix of registers and multiplexers allows the
transposition of the parallel input data.
9.3 RNS Applications in DFT, FFT, DCT, DWT
233
Using an algorithm due to Arai et al. [72] (see Figure 9.29a), the 8pixel
1DDCT can be realized as shown in Figure 9.29b for one modulus channel
which needs only five multiplications. Note that e1 and e2 are power of two scaling
factors. The coefficients
are k1 ¼ C4, k2 ¼ C6 À C2, k3 ¼ C4, k6 ¼ C6 + C2, k5 ¼ C6,
À Á
where Cq ¼ cos qπ
.
The
1DDCT can be designed to have single multiplication
16
per stage. Multiplication by DCT coefficients is by ROM lookups. In order to
obtain the exact value of DCT, each output needs an additional multiplication
which can be taken care of in the next stage. The hardware consists of adders,
registers and LUTs. The moduli set used was {256, 255, 253, 251}. The output
a
a + , bg+, h
_
*
i +, j 
c + , de+ , f a+
+m
b
+m
g+
h
 m
×e+ m
 m
×f m
i+
c+
j
dFigure 9.28 (a) QRNS butterfly for a radix2 FFT and (b) pipelined QRNS DCT implementation
(adapted from [66] ©IEEE2000)
234
b
9 Applications of RNS in Signal Processing
y(0)
Z(0)
+
+
+
*
H0
Z(4)
y(1)
+
+

*
H4
y(2)
+

*
+
*
H2
W80
y(3)
+

Z(2)
*
W 82
y(4)

*
W8
+
*
+
H1
0

*
W8
y(6)
Z(1)
+
*
+
H5
1

Z(5)
*
W8
y(7)
2

*
W 83
Figure 9.28 (continued)
a
[a]
[b]
[c]
[d ]
[e]
x(0)
X(0)
x(1)
X(4)
x(2)
k1
x(3)
x(4)
x(5)
x(6)
X(2)
X(6)
k2
k3
k4
x(7)
X(5)
X(1)
X(7)
X(3)
k5
Figure 9.29 (a) Flow graph for fast computation of DCT and (b) moduli mi channel of 1D DCT
(adapted from [69] IEEE 2000)
9.3 RNS Applications in DFT, FFT, DCT, DWT
235
b
X (0) m
i
.
.
.
X (7) m
+
+
mi
+
+m
i
+m
i
+
mi
−
mi
−m
−
mi
+m
i
+
mi
−
mi
+m
mi
D
D
×k1
LUT
×e1
LUT
×k2
LUT
D
D
+
−
−
X (0)
mi
X (4)
mi
D
mi
mi
+
mi
X (6)
×k3
LUT
mi
+
mi
X (5)
×k5
LUT
+
mi
+
D
X (2)
+
mi
D
D
D
−
i
−
mi
−m
mi
i
D
mi
D
D
i
D
D
−
mi
i
mi
−
mi
−
mi
X (1)
mi
×e2
LUT
×k4
LUT
−
mi
−
mi
X (7)
mi
mi
mi
X (3)
mi
Figure 9.29 (continued)
conversion to binary is performed using a εCRT converter of Griffin, Taylor and
Sousa [23] described earlier. The cosine coefficients are 7bit, the signal samples
are 8bit and a DR of 32 bit could be achieved.
Taylor [73] has described DFT implementation using RNS using a circular shift
register to store the data and using multipliers followed by adders (see Figure 9.30).
The structure of this fivepoint prime factor DFT is akin to FIR filters. Gaussian
236
9 Applications of RNS in Signal Processing
GF(p2) exponents
ex (3),ex (4),ex (2),ex (1)
Circular shift register path
RUN
T
2n
Input in
permuted
order
T
T
T
LOAD
2n
e
2n
e
W(3)
2n
W(4)
e
+
+
x(0)
e
+
W(1)
+
GF(p2) to RNS converter
GF(p2) to RNS converter
p<2n
2n
W(2)
ADDER
X(k)=[X(3),X(4),X(2),X(1)]
Figure 9.30 Five point RNS prime factor DFT implementation (adapted from [73] ©IEEE1990)
primes of the form 2n + 1 for n ¼ 2, 4, 8, 16 or 32 have been suggested for single
modulus or multimoduli RNS architectures. CRT needs to be used to convert into
normal integers and scaling can then be performed. This implementation has
employed CRNS as well as QRNS and index calculusbased multipliers. In order
to overcome the overflow problem of QRNSbased FFT implementations, a prime
factor transform (PFT) [74] has been suggested. This reduces the dynamic range
from NQ4 needed for FFT to NQ2 where N is the number of points and Q is the word
size of the inputs and coefficients. Taylor has suggested CRT computation using
distributed Arithmetic.
Tseng et al. [75] have described FFT implementation using RNS and considered
the effect of quantization noise. They have shown that radix4 is the largest radix
without internal multiplications in the rpoint DFT. The twiddle factors are 0, Ỉ1,
Ỉj in the case of radix2 or radix4. The basic calculation of radix4 decimationintime (DIT) is shown in Figure 9.31. The magnitudes of numbers at subsequent
stages increase very rapidly due to the cascaded integer multiplications. Since RNS
cannot accommodate this dynamic range, all internal numbers shall be scaled
properly by a priori chosen scaling factors to prevent overflow. Tseng et al. [74]
have analyzed the techniques for scaling to prevent overflow by performing error
analysis due to A/D conversion, scaling factor and twiddle factors. The scaling can
be after few stages as well. The number of scaling stages also can be chosen
suitably.
9.3 RNS Applications in DFT, FFT, DCT, DWT
237
Re[xi(0)]
+
+
Im[xi(0)]
+
+
Re[xi+1(0)]
Im[xi+1(0)]
Im[xi(2)]
Re[xi(2)]
Im[WN2t]
1
+
+
+
1
+
+
+
+
+
+
+
+
1
Re[WN2t]
Im[WNt]
+
Re[xi+1(1)]
1
Re[WNt]
Re[xi(1)]
Re[xi(3)]
Re[xi+1(2)]
Im[xi+1(2)]
Im[xi(1)]
Im[xi(3)]
Im[xi+1(3)]
1
+
1
+
1
+
Im[xi+1(1)]
Re[WN3t]
Im[WN3t]
+
1
+
1
+
Re[xi+1(3)]
Figure 9.31 Radix4 DIT (decimation in time) basic calculation (adapted from [75] ©IEEE1979)
Taylor et al. [76] have presented radix4 FFT using complex RNS arithmetic. In
this technique, the complex multipliers needed in conventional implementation are
replaced by QRNS multipliers thus reducing the hardware. A radix4 complex RNS
(CRNS) butterfly is presented in Figure 9.32a together with the QRNS butterfly in
Figure 9.32b. In the CRNS butterfly, 12 real multiplications at level 1, 6 read/
subtract at level 2, 8 read/subtract at level 3 and 8 real add/subtract at level 4 are
needed. On the other hand, in the case of QRNSbased designs, we need only 6 real
multiplications at level 1, 8 real/subtract and 2 multiplications at level 2, 8 real
add/subtract at level 3.
Jullien et al. [77] have described a systolic Quadratic Residue DFT with fault
tolerance. In this each systolic array cell uses a 16 Â 6 ROM in place of 16 Â 4
ROM. The additional two bits correspond to parity of output content of the ROM
and parity of input address bits. In normal operation, the address parity of a cell
must equal content parity of the previous cell.
The general form of Number Theoretic Transform (NTT) [78, 79] is described
by the transform pair