

# **DSP Computing**

#### **Chao-Tsung Huang**

#### National Tsing Hua University Department of Electrical Engineering



### Outline

- Fast algorithms
- DSP for VLSI circuits



### Product of integers

• Let x, y be n-digit numbers and n=2m

$$x = x_0 + x_1 \cdot b^m$$
  $y = y_0 + y_1 \cdot b^m$  (e.g. base b=2)

$$z = x \cdot y = (x_0 + x_1 b^m)(y_0 + y_1 b^m) = x_0 y_0 + (x_0 y_1 + x_1 y_0) b^m + x_1 y_1 b^{2m}$$

$$x_0 y_0 = x_0 y_0,$$
  

$$x_0 y_1 + x_1 y_0 = (x_0 - x_1)(y_1 - y_0) + x_0 y_0 + x_1 y_1$$
  

$$x_1 y_1 = x_1 y_1.$$

(3 m-digit operations, instead of 4)

• Recursive implementation:  $n^{\log_2 3}$  operations

Ref: S. Winograd, Arithmetic Complexity of Computations, 1980.



### Two-parallel FIR filter

- Express convolution by polynomial multiplication
  - Then partition both of x and h into even and odd parts





Ref: K. K. Parhi, VLSI Digital Signal Processing Systems Design and Implementation, Wiley.



### Matrix-matrix multiplication

• Let A, B be  $n \times n$  matrices and n=2m

$$\begin{split} P_1 &= (A_{11} + A_{22}) \times (B_{11} + B_{22}), \qquad P_2 = (A_{21} + A_{22}) \times B_{11}, \\ P_3 &= A_{11} \times (B_{12} - B_{22}), \qquad P_4 = A_{22} \times (B_{21} - B_{11}), \qquad C_{11} = P_1 + P_4 - P_5 + P_7, \qquad C_{12} = P_3 + P_5, \\ P_5 &= (A_{11} + A_{12}) \times B_{22}, \qquad P_6 = (A_{21} - A_{11}) \times (B_{11} + B_{12}), \qquad C_{21} = P_2 + P_4, \qquad C_{22} = P_1 + P_3 - P_2 + P_6. \\ P_7 &= (A_{12} - A_{22}) \times (B_{21} + B_{22}), \end{split}$$

(7  $m \times m$  matrix multiplications, instead of 8)

• Recursive implementation:  $n^{\log_2 7}$  multiplications

Ref: S. Winograd, Arithmetic Complexity of Computations, 1980.



### **1D FIR filter**

• F(2,3): 2 outputs and 3-tap filter

$$\begin{pmatrix} z_{0} & z_{1} & z_{2} \\ z_{1} & z_{2} & z_{3} \end{pmatrix} \begin{pmatrix} x_{0} \\ x_{1} \\ x_{2} \end{pmatrix} = \begin{pmatrix} m_{1} + m_{2} + m_{3} \\ m_{2} - m_{3} - m_{4} \end{pmatrix}$$
Filter transform (pre-process)  

$$m_{1} = (z_{0} - z_{2})x_{0}, \quad m_{2} = (z_{1} + z_{2})((x_{0} + x_{1} + x_{2})/2),$$

$$m_{3} = (z_{2} - z_{1})((x_{0} - x_{1} + x_{2})/2), \text{ and } m_{4} = (z_{1} - z_{3})x_{2}.$$
(4 multiplications, instead of 6)

**Point-wise multiplication** 

Proven minimality of multiplication

 $-\mu(F(n,k)) = n + k - 1$ 

Ref: S. Winograd, Arithmetic Complexity of Computations, 1980.



#### Winograd convolution for 2D FIR filter

Recap of F(2,3): 2 outputs and 3-tap filter

Filter

Input







Inverse transform

F(2x2,3x3): 16 multiplications, instead of 36

Ref: "Fast algorithms for convolutional neural networks", CVPR, 2016.



# Fast (magic) inverse square root

- Motivation
  - Inverse square root is heavily used in computer graphics to find normalized vectors in floating point



Ref: https://en.wikipedia.org/wiki/Fast\_inverse\_square\_root



# Fast (magic) inverse square root

- Legend
  - The following magic code was found in the source code of *Quake III Arena* (1999)

```
float Q rsqrt( float number )
{
                                                                   y = -
   long i;
   float x2, y;
    const float threehalfs = 1.5F;
   x2 = number * 0.5F;
   y = number;
   i = * ( long * ) &y;
                                           // evil floating point bit level hacking
   i = 0x5f3759df - ( i >> 1 );
                                              // what the final?
   y = * ( float * ) &i;
   y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * (threehalfs - (x2 * y * y)); // 2nd iteration, this can be removed
    return y;
}
```



### Outline

• Fast algorithms

#### DSP for VLSI circuits

- Pipelining
- Parallel
- Retiming
- Systolic array

# Pipelining



(Timing-reduction; Same throughput, longer latency)

• Cut feed-forward paths into pipelined stages









# Retiming



(Timing-reduction; Same latency and throughput)

Move existing delays around for better timing



Generalization of pipelining



# Systolic Array



(Dimension-reduction; regularity)

Can project in many different ways

:





Matrix-matrix multiplication (3D operations)

a11

: | :

: