

#### Design for Skew and Testability Special-Purpose Subsystems



# Outline

#### **1. Design for Skew**

- 2. Design for Testability
- 3. Special-Purpose Subsystems

# Design for Skew

- <u>Clock</u> Distribution
- Clock Skew
- Skew-Tolerant Static Circuits
- Traditional Domino Circuits
- Skew-Tolerant Domino Circuits

# Clocking

- Synchronous systems use a clock to keep operations in sequence
  - Distinguish *this* from *previous* or *next*
  - Determine speed at which machine <u>operates</u>
- Clock must be distributed to all the sequencing elements
  - Flip-flops and latches
- Also distribute clock to other elements
  - Domino circuits and memories

dynamic citarit (2 operating phases)

# **Clock Distribution**

- On a small chip, the clock distribution network is just a wire
  - And possibly an inverter for clkb
- On practical chips, the RC delay of the wire resistance and gate load is very long
  - Variations in this delay cause clock to get to different elements at different times
  - This is called *clock skew*

- restore MOCK signal quality
- Most chips use repeaters to buffer the clock and equalize the delay
  - Reduces but doesn't eliminate skew

VLSI Design

## Example

- Skew comes from differences in gate and wire delay
  - With right buffer sizing, clk<sub>1</sub> and clk<sub>2</sub> could ideally arrive at the same time.
  - But power supply noise changes buffer delays
  - clk<sub>2</sub> and clk<sub>3</sub> will always see RC skew



## **Clock Skew Type**

- Systematic
  - Cause: Normally exists and predictable by simulation
  - Sol: Careful design and layout
- Random
  - Cause: Manufacturing variations, unpredictable
  - Sol: Calibration with <u>adjustable delay elements</u>
- Drift
  - Cause: <u>Time-dependent environmental variations</u>. Ex.
    Temperature effect.
  - Sol: Periodically calibration with adjustable delay elements
- Jitter dynamic
  - Cause: High-frequency environmental variations. Ex. Power noise
  - Sol: Too fast to be calibrated

VLSI Design

# Cycle Time Trends

- Much of CPU performance comes from higher f
  *f* is improving faster than simple process shrinks
  - Sequencing overhead is bigger part of cycle



**VLSI** Design

# Solutions

- Reduce clock skew
  - Careful clock distribution network design
  - Plenty of metal wiring resources
- Analyze clock skew
  - Only budget actual, not worst case skews
  - Local vs. global skew budgets
- Tolerate clock skew
  - Choose circuit structures insensitive to skew

#### Clock Dist. Networks

9- 10

- Ad hoc
- Grids
- H-tree
- Hybrid

# Clock Grids

- Use grid on two or more levels to carry clock
- Make wires wide to reduce RC delay
- Ensures low skew between nearby points
- But possibly large skew across die
- Systematic skew between points closet and furthest to the driver

#### Alpha Clock Grids



**VLSI** Design

# **H**-Trees

- Fractal structure
  - Gets clock arbitrarily close to any point
  - Matched delay along all paths
- Delay variations cause skew
- A and B might see big skew



#### Itanium 2 H-Tree

- Four levels of buffering:
  - Primary driver
  - Repeater
  - Second-level clock buffer
  - Gater
- Route around obstructions



# Spines

- Drive length-matched serpentine wire
  - Avoid systematic skew of the grid
  - Still have large local skews between nearby elements driven by different serpentines.



**VLSI** Design

## Hybrid Networks

- Use <u>H-tree</u> to distribute clock to many points
- <u>Tie these points together with a grid</u>
  - Compared to Grid, less systematic skew
  - Compared to H, less skew from nonuniform load distribution
  - More regular layout
- Ex: IBM Power4, PowerPC
  - H-tree drives 16-64 sector buffers
  - Buffers drive total of 1024 points
  - All points shorted together with grid

#### Layout Issues

- As uniform as possible
- As fast as possible
  - Skew is a fractional difference
  - Use top metal as clock distribution routing
- Low Power/GND coupling
  - Shielding
- Conductance effect concern
  - Interdigitated layout





#### Skew Tolerance

- Flip-flops are sensitive to skew because of <u>hard</u> edges
  - Data launches at latest rising edge of clock
  - Must setup before earliest next rising edge of clock
  - Overhead would shrink if we can soften edge
- Latches tolerate moderate amounts of skew
  - Data can arrive anytime latch is transparent

# Review: Skew Impact

- Ideally full cycle is available for work
- Skew adds sequencing overhead
- Increases hold time too

$$t_{pd} \leq T_c - \underbrace{\left(t_{pcq} + t_{\text{setup}} + t_{\text{skew}}\right)}_{\text{started}}$$

sequencing overhead

$$t_{cd} \geq t_{\text{hold}} - t_{ccq} + t_{\text{skew}}$$





9- 19

**VLSI** Design

#### **Skew: Latches**

• 2-Phase Latches



Pulsed Latches

$$t_{pd} \leq T_c - \underbrace{\max\left(t_{pdq}, t_{pcq} + t_{\text{setup}} - t_{pw} + t_{\text{skew}}\right)}_{\text{sequencing overhead}}$$

$$\begin{split} t_{cd} &\geq t_{\text{hold}} + t_{pw} - t_{ccq} + t_{\text{skew}} \\ t_{\text{borrow}} &\leq t_{pw} - \left(t_{\text{setup}} + t_{\text{skew}}\right) \end{split}$$

**VLSI** Design

## **Dynamic Circuit Review**

- Static circuits are slow because fat pMOS load input
- Dynamic gates use precharge to remove pMOS transistors from the inputs
  - Precharge:  $\phi = 0$  output forced high
  - Evaluate:  $\phi = 1$  output may pull low



# **Domino Circuits**

- Dynamic inputs must monotonically rise during evaluation
  - Place inverting stage between each dynamic gate
  - Dynamic / static pair called domino gate
- Domino gates can be safely cascaded



# **Domino Timing**

- Domino gates are 1.5 2x faster than static CMOS
   Lower logical effort because of reduced C<sub>in</sub>
- Challenge is to keep precharge off critical path
- Look at clocking schemes for precharge and eval
  - Traditional schemes have severe overhead
  - Skew-tolerant domino hides this overhead

# Traditional Domino Ckts

- Hide precharge time by ping-ponging between half-cycles
  - One evaluates while other precharges
  - Latches hold results during precharge



# Clock Skew

- Skew increases sequencing overhead
  - Traditional domino has hard edges
  - Evaluate at latest rising edge
  - Setup at latch by earliest falling edge



9-25

**VLSI** Design

## **Time Borrowing**

- Logic may not exactly fit half-cycle
  - No flexibility to borrow time to balance logic between half cycles
- Traditional domino sequencing overhead is about 25% of cycle time in fast systems!



# Relaxing the Timing

- Sequencing overhead caused by hard edges
  - Data departs dynamic gate on late rising edge
  - Must setup at latch on early falling edge
- Latch functions
  - Prevent glitches on inputs of domino gates
  - Holds results during precharge
- Is the latch really necessary?
  - No glitches if inputs come from other domino
  - Can we hold the results in another way?

#### Skew-Tolerant Domino

- Use overlapping clocks to eliminate latches at phase boundaries.
  - Second phase evaluates using results of first



**VLSI Design** 

# Full Keeper

- After second phase evaluates, first phase precharges
- Input to second phase falls
  Violates monotonicity?
- But we no longer need the value
- Now the second gate has a floating output
  - Need full keeper to hold it either high or low



#### **Time Borrowing**

- Overlap can be used to
  - Tolerate clock skew
  - Permit time borrowing
  - $-t_{hold}$  = required overlap
- No sequencing overhead





 $t_{pd} = T_c$ 

**VLSI** Design

#### **Multiple Phases**

- 9-31
- With more clock phases, each phase overlaps more

- Permits more skew tolerance and time borrowing



#### **Clock Generation**

9- 32



**VLSI** Design

# Summary

- Clock skew effectively increases setup and hold times in systems with hard edges
- Managing skew
  - Reduce: good clock distribution network
  - Analyze: local vs. global skew
  - Tolerate: use systems with soft edges
- Flip-flops and traditional domino are costly
- Latches and skew-tolerant domino perform at full speed even with moderate clock skews.