Tải bản đầy đủ - 0 (trang)
3 RNS Applications in DFT, FFT, DCT, DWT

3 RNS Applications in DFT, FFT, DCT, DWT

Tải bản đầy đủ - 0trang

9.3 RNS Applications in DFT, FFT, DCT, DWT



227



b

*



*



*



m1



X

m2



*



*



X



+



+



+



*



*



*



+



+



+



Y



TCS



QRNS



mp







CONVERSION QRNS



x(n)



CONVERSION TCS







RNS FIR



*



*



*



+



+



+



y(n)



m1



^



X

^



m2



mp



Y

*



*



*



+



+



+



*



*



*



+



+



+



Figure 9.24 (continued)



other types and have shown that QRNS-based design consumes less power and

needs less area.

Taylor [60] has described a single modulus complex ALU using QRNS as

against multi-moduli set based systems. He has used Gaussian primes (e.g. 5,

17, 257, 65,537) as well as composite Gaussian primes (85 ¼ 5*17,

1285 ¼ 5*257, 4369 ¼ 17*257, 21,845 ¼ 5*17*257) of the type 2k + 1 as the

modulus. A single modulus ALU has the advantage of trivial magnitude scaling,

sign detection and overflow detection as against multi-moduli RNS.



228



9 Applications of RNS in Signal Processing



Cardarilli et al. [61] have suggested implementation of polyphase filters using

QRNS. The moduli set used was {13, 17, 29, 37, 41, 53, 61} to realize a dynamic

range of 34 bits. The architecture is shown in Figure 9.25a. The filter is divided into

^ (both QRNS coded outputs) plus the input and output

two structures for X and X

conversion blocks. Each RNS path was having eight FIR filters to cater for eight

channels and an Inverse Discrete Fourier Transform (IDFT) block. The 8-point

IDFT implemented by DIF (Decimation in Frequency) algorithm needed 12 butterflies in three stages. Each butterfly needed two LUTs for multiplications. The

multiplications needed for FIR filters were implemented using index calculus.

The use of QRNS reduced the needed complex multiplications operation to just

two real multiplications as described before. The binary to QRNS and QRNS to

binary converters were included in the hardware. The needed application was for

satellite Digital Video broadcasting (DVB) system which required 367 complex-tap

filter for realizing a Kaiser window with 0.02 dB in-band ripple and 43 dB out of

band attenuation. The filters could be with fixed coefficients or programmable

12-bit coefficients. Truncation after multiplication also was used. TCS version

also was implemented in AMS 0.35 Micron technology. The truncation was

achieved by using QRNS to binary conversion and after truncation again binary

to QRNS conversion to perform IDFT subsequently (see Figure 9.25b). The authors

show that the area and power are less than those for TCS implementation.

Cardarilli et al. [62] have described realization of a 128 channel polyphase filter

derived from a 1024-tap prototype filter. This uses QRNS and the moduli set {13,

17, 29, 37, 41}. The block diagram is presented in Figure 9.26. The front end is a

binary to QRNS converter. It is followed by a decimator and 128 QRNS-based

8-tap FIR filters. The FIR filters used a single MACC (complex multiply and

accumulate) QRNS unit. The output dynamic range is 23 bits. Since truncation

cannot be done directly on QRNS channels, the channel outputs have been

converted into conventional form and scaled to yield a 17-bit word. The 17-bit

output is converted into QRNS again with base extension to have a dynamic range

of 35 bits using additional moduli 53 and 61 in the block denoted CTBE (conversion plus truncation and base extension). A seven moduli RNS-based IDFT is

realized using a serial architecture. One hundred and twenty-eight complex multipliers are used and two adder trees are used to accumulate the results for the real and

imaginary parts. The authors have shown that while both the conventional complex

two’s complement system (CTCS) and QRNS-based designs occupy same area, the

power dissipation in the case of QRNS-based design is 50 % lower. Note that QRNS

filters have higher latency due to the I/O conversions.

D’Amora et al. [63] have used similar techniques for realizing complex digital

filters using QRNS and have demonstrated that for the same throughput rate, the

power dissipation is one-third and area is half of that of CRNS (complex RNS)

filters whereas the latency is more. Stouraitis and Paliouras [64] have suggested the

use of QRNS for low power designs.



9.3 RNS Applications in DFT, FFT, DCT, DWT



229



a

yo(n)



RNS path mod m1

RNS path mod m2



Yn



y1(n)



Xn

y2(n)



RNS path mod mP

QRNS path X

x(n)



QRNS

to

Binary



Binary

to

QRNS



y3(n)



y4(n)

RNS path mod m1

y5(n)



RNS path mod m2

^



^



Yn



Xn



y6(n)



RNS path mod mP

^



QRNS path X



y7(n)



fc/8



fc

↓M



X(n)



FIR Filter E0



yo(n)

y1(n)



fc



↓M



FIR Filter E1

IDFT



↓M



FIR Filter E7

y7(n)

fc/8

RNS path mod mi



fc/8



Figure 9.25 (a) Polyphase filter bank and (b) filter with truncated dynamic range (adapted from

[61] ©IEEE2004)



230



9 Applications of RNS in Signal Processing



b



yo(n)



y1(n)



y2(n)

Binary

to

QRNS



Filter

Banks



QRNS

to

Binary



Truncation (15 bits)



X(n)

Binary

to

QRNS



IDFT



QRNS

to

Binary



y3(n)



y4(n)



y5(n)



y6(n)



y7(n)



fc



fc/8



fc/8



Dynamic range 23 bits mi =(13,17, 29,37,41)



fc/8



fc/8



fc/8



Dynamic range 28 bits mi =(13,17, 29,37,41,53)



Figure 9.25 (continued)



DCT Computation

Fernandez et al. [65] have suggested a straightforward implementation of 8-point

RNS-based 1D-DCT using 5-bit moduli {32, 31, 29, 27}. The data is 8-bit wide and

the eight coefficients are each 9-bit wide. A typical modulus channel is illustrated in

Figure 9.27. The input data is multiplied by the fixed coefficient. A modular

accumulator accumulates all the products. The modular adder used a two-stage

pipeline scheme. The implementation was on Altera Flex10K device and has higher

throughput than a distributed arithmetic processor.

Several authors [66–69] have described DCT computation using QRNS. The

N point 1D DCT [67] of a sequence {x(0), x(1), . . ., x(N 1)} is given by

r N1

2 X

m2n ỵ 1ị

X m ị ẳ

xnị cos

Km

N nẳ0

2N

and K 0 ẳ p12, K 1 ¼ K 2 ¼ Á Á Á ¼ K N À1 ¼ 1.



m ¼ 0, 1, . . . , N À 1



ð9:16Þ



9.3 RNS Applications in DFT, FFT, DCT, DWT



231



QRNS FIR BANK



ROM TWIDDLE (W)



IDFT Serial



A1Kmi A1Kmi

W



mi



QRNS-S

-FIR-1



128



Ymi

128



Xmi



CTBE



Ymi



X128mi



ISOMULT



Vmi



Vmi



SHIFT REGISTER



To

QRNS



128

mi



MUX



{XR,XI)



mi



V1mi



1



DECIMATOR



BIN



W



mi



128



ADDERS TREE



X



Xmi



1



ARRAY



V128mi



V1mi

128



QRNS-S

-FIR-128



CLK0



X



128



128



V



mi



ISOMULT

ARRAY



128

mi



W1mi



CLK(0/128)*8 CLK(0/128)

CLK0



Z1mi



ADDERS TREE



X



1



Z1mi



W128mi

ROM TWIDDLE (W)

CLK0



Dynamic Range Domain 1

mi = {13,17,29,37,41}



Dynamic Range Domain 2

mi = {13,17,29,37,41, 53,61}



Figure 9.26 QRNS polyphase filter architecture (adapted from [62] ©IEEE2010)

x(i)mod mi

LUT

Coefficient

Counter

0:7



Products



| + | mi



y(u)mod mi



Figure 9.27 Modulo mi channel for one transform point of an RNS-based 1D-DCT processor

(adapted from [65] ©IEEE1999)



Ramirez et al. [66] use the fact that N-point DCT can be computed through the

calculation of real part of 2N-point DCT scaled by a complex exponential constant

as follows:

(

)

r

2N1

jm X

2

xnịW mn

Xmị ẳ

K m Re e 2N

2N

N

nẳ0



9:17ị



232



9 Applications of RNS in Signal Processing

j2



W 2N ẳ e 2N ,



xnị ẳ 0



n ẳ N, N ỵ 1, . . . , 2N À 1



Initially, the N-point input sequence {x(0), x(1), x(2), . . ., x(N À 1)} is reordered in

the sequence {y(0), y(1), y(2). . .y(N À 1)} defined by

yðnÞ ẳ x2nị, yN n 1ị ẳ x2n ỵ 1ị



n ẳ 0, 1, . . . ,







N

1

2



9:18ị



Let {Y(0), Y(1), Y(2). . .., Y(N À 1)} be the DFT of the sequence {y(0), y(1),

y(2). . .y(N À 1)}. The DCT sequence {X(0), X(1), X(2). . .., X(n À 1)} of the original

sequence can be obtained through the real part of Z(n) [71] defined as

r

2

n

Z nị ẳ H n Y nị ẳ

Y nị

K n W 4N

N



9:19ị



2



where W 4N ¼ eÀj4N . By using the property Z(N À n) ¼ ÀjZ*(n), $ Re

[Z(N À n)] ¼ ÀIm[Z(n)], it is necessary to compute only the N2 ỵ 1 values of Z(n),

viz., Z(0), Z(1),. . ., Z(N/2), Z(N/2 + 1),. . ..Z(3N/4 À 1). The N-point DCT

sequence is given by {Re[Z(0)], Re [Z(1)], . . .Re [Z(N/4)]}, ÀIm {Z(3N/4 À 1)],

ÀIm[Z(3N/4 À 2)],. . ., ÀIm{Z(N/2 + 1)], Re[(Z(N/2)], Re[(Z(N/2 + 1)],. . .,Re

[Z(3N/4 À 1)], ÀIm[Z(N/4)],. . ., ÀIm[Z(N/4 À 1)],. . ., ÀIm[Z(1)]}.

The fast algorithms known for DFT can be used for fast computation of DCT. A

QRNS butterfly for computation of a DIF radix-2 DFT is shown in Figure 9.28a.

Note that since the input sequence is real, each QRNS adder is one modular adder.

A butterfly needs a QRNS adder (two modular adders), a QRNS subtractor (two

modular subtractors) and a QRNS multiplier (two modulo multipliers). The moduli

set used is {221, 229, 233, 241}. The multiplier has used isomorphic mapping with

the roots {47,107, 89, 177}, respectively. The 8-point QRNS DCT computation is

shown in Figure 9.28b. Note that only five outputs are Z(0), Z(1), Z(2).., Z(5) are

required for DCT computation.

Fernandez et al. [69] have presented a RNS architecture for computation of

scaled 2D-DCT on field programmable logic (FPL). An eight pixel 1D-DCT is

implemented as shown in Figure 9.29a. The 2D DCT is computed as

Xu; vị ẳ



N 1 X

N 1

2euịevị X

u2i ỵ 1ị

v2j ỵ 1ị

cos

xi; jị cos

N

2N

2N

iẳ0 jẳ0



9:20ị



u, v ẳ 0, 1, . . . , ðN À 1Þ

where x(i, j) is a N Â N matrix of pixels and X(u, v) is the corresponding

transformed matrix. Since 2D-DCT is a separable transform, the row-column

decomposition [70] can be used. A N Â N 2D-DCT can be performed by first

N 1D DCTs on rows and next, N 1D-DCTs on the columns. The use of a transposition structure containing 8 Â 8 matrix of registers and multiplexers allows the

transposition of the parallel input data.



9.3 RNS Applications in DFT, FFT, DCT, DWT



233



Using an algorithm due to Arai et al. [72] (see Figure 9.29a), the 8-pixel

1D-DCT can be realized as shown in Figure 9.29b for one modulus channel

which needs only five multiplications. Note that e1 and e2 are power of two scaling

factors. The coefficients

are k1 ¼ C4, k2 ¼ C6 À C2, k3 ¼ C4, k6 ¼ C6 + C2, k5 ¼ C6,

À Á

where Cq ¼ cos qπ

.

The

1D-DCT can be designed to have single multiplication

16

per stage. Multiplication by DCT coefficients is by ROM look-ups. In order to

obtain the exact value of DCT, each output needs an additional multiplication

which can be taken care of in the next stage. The hardware consists of adders,

registers and LUTs. The moduli set used was {256, 255, 253, 251}. The output



a



a + , bg+, h-



_



*



i +, j -



c + , de+ , f a+



|+|m



b-



|+|m



g+



h-



|-| m



|×e+| m



|-| m



|×f -|m



i+



c+



j-



dFigure 9.28 (a) QRNS butterfly for a radix-2 FFT and (b) pipelined QRNS DCT implementation

(adapted from [66] ©IEEE2000)



234



b



9 Applications of RNS in Signal Processing



y(0)



Z(0)

+



+



+



*

H0



Z(4)



y(1)

+



+



-



*

H4



y(2)

+



-



*



+



*

H2



W80



y(3)

+



-



Z(2)



*

W 82



y(4)

-



*

W8



+



*



+

H1



0



-



*

W8



y(6)



Z(1)



+



*



+

H5



1



-



Z(5)



*

W8



y(7)



2



-



*

W 83



Figure 9.28 (continued)



a



[a]



[b]



[c]



[d ]



[e]



x(0)



X(0)



x(1)



X(4)



x(2)



k1



x(3)

x(4)

x(5)

x(6)



X(2)

X(6)



k2

k3

k4



x(7)



X(5)

X(1)

X(7)

X(3)



k5



Figure 9.29 (a) Flow graph for fast computation of DCT and (b) moduli mi channel of 1D DCT

(adapted from [69] IEEE 2000)



9.3 RNS Applications in DFT, FFT, DCT, DWT



235



b



X (0) m



i



.

.

.

X (7) m



+



+



mi



+



+m



i



+m



i



+



mi







mi



−m







mi



+m



i



+



mi







mi



+m



mi



D



D



×k1

LUT



×e1

LUT



×k2

LUT



D



D



+











X (0)



mi



X (4)



mi



D



mi



mi



+



mi



X (6)



×k3

LUT



mi



+



mi



X (5)



×k5

LUT



+



mi



+



D



X (2)



+



mi



D



D



D







i







mi



−m



mi



i



D



mi



D



D



i



D



D







mi



i



mi







mi







mi



X (1)



mi



×e2

LUT



×k4

LUT







mi







mi



X (7)



mi



mi



mi



X (3)



mi



Figure 9.29 (continued)



conversion to binary is performed using a ε-CRT converter of Griffin, Taylor and

Sousa [23] described earlier. The cosine coefficients are 7-bit, the signal samples

are 8-bit and a DR of 32 bit could be achieved.

Taylor [73] has described DFT implementation using RNS using a circular shift

register to store the data and using multipliers followed by adders (see Figure 9.30).

The structure of this five-point prime factor DFT is akin to FIR filters. Gaussian



236



9 Applications of RNS in Signal Processing



GF(p2) exponents

ex (3),ex (4),ex (2),ex (1)



Circular shift register path



RUN



T

2n

Input in

permuted

order



T



T



T



LOAD

2n

e



2n

e



W(3)



2n



W(4)



e



+



+



x(0)



e



+



W(1)



+



GF(p2) to RNS converter



GF(p2) to RNS converter



p<2n



2n



W(2)



ADDER



X(k)=[X(3),X(4),X(2),X(1)]



Figure 9.30 Five point RNS prime factor DFT implementation (adapted from [73] ©IEEE1990)



primes of the form 2n + 1 for n ¼ 2, 4, 8, 16 or 32 have been suggested for single

modulus or multi-moduli RNS architectures. CRT needs to be used to convert into

normal integers and scaling can then be performed. This implementation has

employed CRNS as well as QRNS and index calculus-based multipliers. In order

to overcome the overflow problem of QRNS-based FFT implementations, a prime

factor transform (PFT) [74] has been suggested. This reduces the dynamic range

from NQ4 needed for FFT to NQ2 where N is the number of points and Q is the word

size of the inputs and coefficients. Taylor has suggested CRT computation using

distributed Arithmetic.

Tseng et al. [75] have described FFT implementation using RNS and considered

the effect of quantization noise. They have shown that radix-4 is the largest radix

without internal multiplications in the r-point DFT. The twiddle factors are 0, Ỉ1,

Ỉj in the case of radix-2 or radix-4. The basic calculation of radix-4 decimation-intime (DIT) is shown in Figure 9.31. The magnitudes of numbers at subsequent

stages increase very rapidly due to the cascaded integer multiplications. Since RNS

cannot accommodate this dynamic range, all internal numbers shall be scaled

properly by a priori chosen scaling factors to prevent overflow. Tseng et al. [74]

have analyzed the techniques for scaling to prevent overflow by performing error

analysis due to A/D conversion, scaling factor and twiddle factors. The scaling can

be after few stages as well. The number of scaling stages also can be chosen

suitably.



9.3 RNS Applications in DFT, FFT, DCT, DWT



237



Re[xi(0)]



+



+



Im[xi(0)]



+



+



Re[xi+1(0)]



Im[xi+1(0)]



Im[xi(2)]

Re[xi(2)]

Im[WN2t]



-1



+



+



+



-1



+



+



+



+



+



+



+



+



-1



Re[WN2t]

Im[WNt]



+



Re[xi+1(1)]



-1



Re[WNt]

Re[xi(1)]



Re[xi(3)]



Re[xi+1(2)]



Im[xi+1(2)]



Im[xi(1)]

Im[xi(3)]



Im[xi+1(3)]



-1

+



-1



+



-1



+



Im[xi+1(1)]



Re[WN3t]

Im[WN3t]



+



-1



+



-1



+



Re[xi+1(3)]



Figure 9.31 Radix-4 DIT (decimation in time) basic calculation (adapted from [75] ©IEEE1979)



Taylor et al. [76] have presented radix-4 FFT using complex RNS arithmetic. In

this technique, the complex multipliers needed in conventional implementation are

replaced by QRNS multipliers thus reducing the hardware. A radix-4 complex RNS

(CRNS) butterfly is presented in Figure 9.32a together with the QRNS butterfly in

Figure 9.32b. In the CRNS butterfly, 12 real multiplications at level 1, 6 read/

subtract at level 2, 8 read/subtract at level 3 and 8 real add/subtract at level 4 are

needed. On the other hand, in the case of QRNS-based designs, we need only 6 real

multiplications at level 1, 8 real/subtract and 2 multiplications at level 2, 8 real

add/subtract at level 3.

Jullien et al. [77] have described a systolic Quadratic Residue DFT with fault

tolerance. In this each systolic array cell uses a 16 Â 6 ROM in place of 16 Â 4

ROM. The additional two bits correspond to parity of output content of the ROM

and parity of input address bits. In normal operation, the address parity of a cell

must equal content parity of the previous cell.

The general form of Number Theoretic Transform (NTT) [78, 79] is described

by the transform pair



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

3 RNS Applications in DFT, FFT, DCT, DWT

Tải bản đầy đủ ngay(0 tr)

×