## EE3230 Lecture 4: Circuit Characterization and Performance Estimation I

#### Ping-Hsuan Hsieh (謝秉璇)

Delta Building R908 EXT 42590 phsieh@ee.nthu.edu.tw

## Outline

#### Delay estimation

- Logical effort and transistor sizing
- Power dissipation
- Interconnect
- Wire engineering
- Design margin
- Reliability
- Scaling

#### **Transient Response**

- DC analysis tells us V<sub>out</sub> if V<sub>in</sub> is constant
- Transient analysis tells us  $V_{out}(t)$  with certain  $V_{in}(t)$

Requires solving differential equations

• Input is usually considered to be a step or ramp

– From 0 to  $V_{DD}$  or vice versa

 With load capacitance of C<sub>load</sub> Vin(t)  $V_{in}(t) = u(t - t_0)V_{DD}$ Vout (t) Vont (to) = 100 Current discharging the cap Oto Q to PMOSTurned OFF NMOS turned on Vin(1) initially in saturation Slope ((/))  $Idsn(t) = \frac{1}{2} \times \times \frac{W}{L} (Vop - V + h)^{2}$ Vonz=Vin-Vrh Tride "> Vout < Vou-Vth, NMOS in Triode Idsn(T)= XX I (Vov-L Vone) Vone  $t_0$ 

## **Delay Definitions (I)**

- ۲<mark>۳۵۶</mark> t<sub>pdr</sub> maximum <u>rising</u> propagation delay
  - From input to rising output crossing  $V_{DD}/2$
- t<sub>pdf</sub> maximum <u>falling</u> propagation delay
  - From input to falling output crossing  $V_{DD}/2$
- *t*<sub>pd</sub> average propagation delay
  - $-t_{\rm pd} = (t_{\rm pdr} + t_{\rm pdf})/2$
- **t**<sub>r</sub> rise time
  - For output to go from 0.2  $V_{DD}$  to 0.8  $V_{DD}$
- **t**<sub>f</sub> fall time
  - For output to go from 0.8  $V_{DD}$  to 0.2  $V_{DD}$

#### **Delay Definitions (II)**



## **Delay Definitions (III)**

#### MINIWA

 $t_{cdr}$  minimum rising contamination delay

- From input to rising output crossing  $V_{DD}/2$ 

- $t_{cdf}$  maximum falling contamination delay
  - From input to falling output crossing  $V_{DD}/2$
- *t*<sub>cd</sub> average contamination delay

$$- t_{cd} = (t_{cdr} + t_{cdf})/2$$



## **Delay Estimation (I)**

- Estimate delay easily
  - Not as accurate as simulations
  - Easier to ask "what if?"
- Step response looks like a 1<sup>st</sup> order RC response (decaying exponential)
- Use RC delay models
  - C = total capacitance on output node
  - Use effective R
  - $t_{pd} = RC$
- $\rightarrow$  Characterize transistors by finding their effective **R** 
  - Depend on average current of gate switches

## **Delay Estimation (II)**

#### **Critical path**

- The signal path with the slowest (most critical) timing
- Affected at 4 different levels
- Architecture/micro-architecture levelsTradeoff of pipeline stages, number of execution units, and size of memory. It's the level that impacts the most.
- Logic level
  - Tradeoff of functional block types, number of gate in the cycle, fan-in and fan-out number
  - Circuit level
- Circuit level CMOS(NMOS 一 Transistor size and logic styles/families implementation
- Layout level ullet
  - Floor-plan, wire length, and parasitics

#### **Critical Path**



### RC Delay Models channel (congret = Lmin

- Equivalent circuit for MOS transistors
  - Ideal switch + capacitance and ON resistance
  - Unit NMOS has resistance R and capacitance C
  - Unit PMOS has resistance 2R and capacitance C
- Capacitance proportional to width
- Resistance inversely proportional to width



#### **Example: Inverter**



#### **Example: Inverter**



#### Example: NAND3

- Sketch a 3-input NAND with transistor widths chosen to achieve effective rise and fall resistances equal to a unit inverter (R).
- Annotate the 3-input NAND gate with gate and diffusion capacitance



#### (Worst-Case) Delay of NAND3



## **Elmore Delay Model**

- Pull-up or pull-down network can be modeled as RC ladder
- Elmore delay model of an RC ladder



## Example: 2-Input NAND

 Estimate worst-case rising and falling delays of 2-input NAND driving *h* identical gates



# Contamination Delay

- Best-case (contamination) delay can be substantially less than worst-case delay
- Example: If both inputs fall simultaneously



## **Diffusion Capacitance**

- Good layout minimizes diffusion area
- Example: NAND3
  - Sharing diffusion contacts reduces output cap by 2C
  - Merged un-contacted diffusion might help too



#### **Layout Comparison**



#### **Delay Components**

- Parasitic delay
  - Independent of load (末山旅院)
- f· Effort delay 和山有關, proportional to h

Proportional to load capacitance

$$d = f + p - liq Pinv = 3 PC$$
  
 $7 \equiv P \cdot N = 2 = 1, 4 = 2, 6 = 3, ...$ 

## Outline

- Delay estimation
- Logical effort and transistor sizing
- Power dissipation
- Interconnect
- Wire engineering
- Design margin
- Reliability
- Scaling

#### Introduction

- Chip designers face a bewildering array of choices
  - What is the best circuit topology for a given function?
  - How many stages of logic gives the least delay?
  - How wide should the transistors be?
- Logical effort is a method to make these decisions
  - Uses a simple model of delay
  - Allows back-of-the-envelope calculations
  - Helps make rapid comparison between alternatives
  - Emphasizes remarkable symmetries

• Express delays in process-independent unit



• Express delays in **process-independent** unit

$$d = d_{abs} / \tau$$

• Delay has two components

• Express delays in **process-independent** unit

$$d = d_{abs} / \tau$$

• Delay has two components

d = f + p

• Effort delay (or stage effort) has two components

• Express delays in **process-independent** unit

$$d = d_{abs} / \tau$$

- Delay has two components d = f + b
- Effort delay (or stage effort) has two components f = gh
- g: logical effort
  - Measure relative ability of date to deliver current
  - -g = 1 for inverter

• Express delays in **process-independent** unit

$$d = d_{abs} / \tau$$

• Delay has two components

d = f + p

• Effort delay (or stage effort) has two components

f = gh

- *h*: electrical effort
  - Ratio of output to input capacitance
  - Sometimes called fanout

• Express delays in **process-independent** unit

 $d = d_{abs} / \tau$ 

• Delay has two components

$$d = f + p \qquad d = gh + P$$

- Parasitic delay p
  - Delay of gate driving no load
  - Due to internal parasitic capacitance

## **Computing Logical Effort**

Ratio of input capacitance of a gate to that of an inverter delivering the same output current
 Method #1: Measure from delay vs. fanout plots
 Method #2: Estimate by counting transistor widths







= 5/3

#### **Delay vs. Fanout Plots**



只看 effort delay NANDZ unit inv. 推 CL=12C 67 RC time constant d **C** 2 20 Cin Z 120 = 30 B 2 tpaf = R.(12C) Cin = 4CA=1, B=Ø > 1 tpaf = R(12C)20 20 cin=30 Cin=40

只看 effort delay





#### • Logic effort of common gates

| Gate Type      | Number of inputs |      |          |                 |          |
|----------------|------------------|------|----------|-----------------|----------|
|                | 1                | 2    | 3        | 4               | n        |
| ✓ Inverter     | 1                |      |          |                 |          |
| V NAND         |                  | 4/3  | 5/3      | 6/3             | (n+2)/3  |
| V NOR          |                  | 5/3  | 7/3      | 9/3             | (2n+1)/3 |
| Tri-state, MUX | 2                | 2    | 2        | 2               | 2        |
| XOR, XNOR      |                  | 4, 4 | 4, 12, 6 | 8, 16, 16,<br>8 |          |

• Parasitic delay of common gates

| Gate Type      | Number of inputs |   |   |   |    |  |
|----------------|------------------|---|---|---|----|--|
|                | 1                | 2 | 3 | 4 | n  |  |
| Inverter       | 1                |   |   |   |    |  |
| NAND           |                  | 2 | 3 | 4 | n  |  |
| NOR            |                  | 2 | 3 | 4 | n  |  |
| Tri-state, MUX | 2                | 4 | 6 | 8 | 2n |  |
| XOR, XNOR      |                  | 4 | 6 | 8 |    |  |



Estimate the frequency



- Logic effort: 9=1
- Electrical effort: h=1
- Parasitic delay: P=1
- Stage delay: delay = 6PC
- 3PC+3PC - Frequency:

目= ZNot

d = qh + p = 2

stz ats oti

EE3230 Ping-Hsuan Hsieh

T=6 st

С

## **Example:** FO-4 Inverter

• Estimate the delay of an inverter with fanout of 4 (FO4)



- Logic effort: **G** = **1**
- Electrical effort: h = 4
- Parasitic delay: P = 1
- Stage delay: d= gh+ P= 5
- Rule of thumb: FO4 delay for a process is 1/3 to 1/2 of the minimum channel length. EX 180 nm: FO4 =60~90 ps
- Highly sensitive to process, voltage, & temperature variations

## Multistage Logic Networks

• Logic effort generalizes to multistage networks



• Path effort delay

 $D_F = \sum f_i$ 

• Path parasitic delay

 $P = \sum p_i$ 

• Path delay

 $D = \sum d_i = D_F + P$ 

## **Designing Fast Circuits**

 $D = \sum d_i = D_F + P$ 

- Delay is the smallest when each stage bears the same effort
- Minimum delay of N-stage path is

 $D = NF^{\frac{1}{N}} + P$ 

- This is the **key** result of logic effort analysis
  - Find fastest possible delay
  - Doesn't require calculating gate size

#### Gate Size

• How wide should the gates be for the least delay?

$$\hat{f} = gh = g \frac{C_{out}}{C_{in}}$$
$$\Rightarrow C_{in_i} = \frac{g_i C_{out_i}}{\hat{f}}$$

- Working backwards, apply capacitance transformation to find input capacitance of each gate with given load it drives
- Check work by verifying input cap spec is met

## **Paths with Branches**

- F = GH?
- No! Consider paths with branches



## **Branching Effort**

• Account for branches in path

- Branching effort 
$$b = \frac{C_{\text{on path}} + C_{\text{off path}}}{C_{\text{on path}}}$$
  
- Path branching effort  $B = \prod b_i$ 

• Now we can compute path effort  $F = GBH \leftarrow Cont$  $G_1 \cdot G_2 \cdot G_3 \cdot G_4$ 

### **Example: 3-Stage Path**

• Select gate size x and y that minimize the delay from A to B



#### **Example: 3-Stage Path**



- Branching effort
  (91h1+92h2+95h)
- $\overline{2}$  Path effort  $F = GBH = 1^{25}$ 
  - Best stage effort  $f = 3F \neq 5$
  - Parasitic delay P = 2 + 3 + 2 = 7
  - Delay D= 5+ 5+ 5+ 2+3+ 2= 22

#### **Example: 3-Stage Path**

Work backwards for sizes

$$-y = 45^{*}(5/3)/5 = 15$$

$$-x = (15*2)*(5/3)/5 = 10$$



## Best Number of Stages inv P=1

- How many stages should a path use
  - Minimizing number of stages is not always the fastest
- Example: Drive 64-bit datapath with unit inverter



### Derivation

- Consider inserting inverters into the signal chain
  - How many stages give the least delay?



• Define the best stage effort  $\rho = F^{\frac{1}{N}}$ 

 $p_{inv} + \rho (1 - \ln \rho) = 0$ 

## **Best Stage Effort**

- $p_{inv} + \rho (1 \ln \rho) = 0$  has no closed-form solution
- Neglecting parasitics ( $p_{inv} = 0$ ), we define  $\rho = 2.718$  (e)
- For  $p_{inv}$  = 1, solve numerically for  $\rho$  = 3.59

## **Sensitivity Analysis**

• How sensitive is the delay to the number of stages near the best number?



• 2.4 <  $\rho$  < 6 gives delay with 15% variations 4 is a common choice

## **Example: Decoder for a Register File**



## Number of Stages plan

- Decoder effort is mainly electrical and branching Electrical effort  $H = \frac{96}{10} = 9.6$ Branching effort  $B = \frac{9}{2}$
- If we neglect logical effort (by assuming G = 1) Path effort F = GBH = 96.8Number of stages N = 0.94F = 3.1

#### 3 Stage 4:16 Decoder



## **Gate Sizes and Delay**

- Logical effort G =
- Path effort F = GBH = 154
- Stage effort  $\hat{f} = 3F = 5.36$  parasitizaelay
- Path delay D = 3.5.36 + 1 + 4 + 1 = 2.2.1
- Gate sizes z =



v =

#### • Different alternatives

| Design 🚽                     | Ν | G    | Р | D      |
|------------------------------|---|------|---|--------|
| NAND4-INV                    | 2 | 2    | 5 | 29.8   |
| NAND2-NOR2                   | 2 | 20/9 | 4 | 30.1   |
| INV-NAND4-INV 5.36           | 3 | 2    | 6 | 22.1   |
| NAND4-INV-INV 3,5 2          | 4 | 2    | 7 | 21.1   |
| NAND2-NOR2-INV-INV 3.61      | 4 | 20/9 | 6 | 20.5   |
| NAND2-INV-NAND2-INV 3,41     | 4 | 16/9 | 6 | 19.7 < |
| INV-NAND2-INV-NAND2-INV2.6   | 5 | 16/9 | 7 | 20.4   |
| NAND2-INV-NAND2-INV-INV-INV- | 6 | 16/9 | 8 | 21.6   |

# **Key Insights from Logical Effort**

- Logical Effort characterizes the complexity of a logic gate or path
  - Allow comparison of alternative circuit topologies
- NAND structures are faster than NOR structures in static CMOS circuits
- Paths are fastest when
  - Effort delays of each stage are about the same
  - These delays are close to 4 (FO4 inverter delays)
  - Each quadrupling of the load adds about one FO4 inverter delay to the path
- Path delay is insensitive to modest deviations from the optimum
  - Stage efforts of 2.4–6 give designs within 15% of minimum delay.
  - There is no need to make calculations to more than 1–2 significant figures, so many estimations can be made in your head. There is no need to choose transistor sizes exactly according to theory and there is little benefit in tweaking transistor sizes if the design is reasonable.
- Using fewer stages for "less # of gate delays" does not make a circuit faster. Making gates larger also does not make a circuit faster; it only increases the area and power consumption.
- Stage efforts somewhat greater than 4 reduces area and power consumption at a slight cost in speed. Using efforts greater than 6–8 comes at a significant cost in speed.
- Logical Effort of a gate increases as the number of inputs grows
  - Considering both logical effort and parasitic delay, we find a practical limit of about 4 series transistors in logic gates and about 4 inputs to multiplexers.
  - Beyond this fan-in, it is faster to split gates into multiple stages of skinnier gates.
- Inverters or 2-input NAND gates with low logical efforts are best for driving nodes with a large branching effort. Use small gates after the branches to minimize load on the driving gate.
- When a path forks and one leg is more critical than the others, buffer the noncritical legs to reduce the branching effort on the critical path

## Review

|                   | Stage                                                  | Path                                   |  |
|-------------------|--------------------------------------------------------|----------------------------------------|--|
| Number of stages  | 1                                                      | Ν                                      |  |
| Logical effort    | g                                                      | $G = \prod g_i$                        |  |
| Electrical effort | $h = \frac{C_{out}}{C_{in}}$                           | $H = \frac{C_{out-path}}{C_{in-path}}$ |  |
| Branching effort  | $b = \frac{(C_{on-path} + C_{off-path})}{C_{on-path}}$ | $B = \prod b_i$                        |  |
| Effort            | f = gh                                                 | F = GBH                                |  |
| Effort delay      | f                                                      | $D_F = \sum f_i$                       |  |
| Parasitic delay   | р                                                      | $P = \sum p_i$                         |  |
| Delay             | d = f + p                                              | $D=\sum d_i=D_F$                       |  |

## Method of Logical Effort

- 1. Compute path effort F = GBH
- 2. Estimate the best number of stages  $N = \log_4 F$
- 3. Sketch path with N stages
- 4. Estimate the least delay
- 5. Determine the best stage effort
- 6. Find gate sizes

 $D = NF^{\frac{1}{N}} + P$  $\hat{f} = F^{\frac{1}{N}}$ 

$$C_{in_i} = \frac{g_i C_{out_i}}{\hat{f}}$$

# **Limits of Logical Effort**

- Chicken and egg problem
  - Need path to compute G
  - Don't know the number of stages with G
- V Simplified delay model
  - Neglect input <u>rise time effects</u>, velocity saturation, body effect, ...
- V• Neglect interconnect effects
  - Require iterations to take wire capacitance into account
- V• Maximum speed only
  - Not minimum area/power for constrained delay
  - V Non-uniform branching factor

Cw7 dominate

## Summary

- Logical effort is useful when considering circuit delay
  - Numerical logical effort characterize gates
  - NANDs are faster than NORs in CMOS
  - Paths are fastest when effort delays are ~4
  - Path delay is not very sensitive to stages and sizes
  - Using fewer stages doesn't necessarily give faster result
- Language for discussing fast circuits
  - Practice required to master