## A 1.8V 1.1-GHZ N0VEL

## 8 x 8-BIT DIGITAL MULTIPLIER

## by

Amir Ali Khatibzadeh<br>B.Sc. in Electrical and Electronics<br>Khajeh-e Nasir University of Technology<br>Tehran, Iran 1996

A thesis<br>presented to Ryerson University<br>in partial fulfilment of the<br>requirement for the degree of<br>Master of Applied Science<br>in the program<br>Electrical \& Computer Engineering

## UMI Number: EC53466

## INFORMATION TO USERS

The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction.

In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.

## UMİ

## UMI Microform EC53466

Copyright 2009 by ProQuest LLC
All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code.

ProQuest LLC
789 East Eisenhower Parkway
P.O. Box 1346

Ann Arbor, MI 48106-1346

I hereby declare that I am the sole author of this thesis.
I authorize Ryerson University to lend this thesis to other institutions or individuals for the purpose of scholarly research.

Amir Ali Khatibzadeh

I further authorize Ryerson University to reproduce this thesis by photocopying or by other means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly research. Amir Ali Khatibzadeh
$\checkmark$

$$
\begin{aligned}
& 1 \\
& \mu
\end{aligned}
$$

Ryerson University requires the signatures of all persons using or photocopying this thesis. Please sign below, and give address and date.


#### Abstract

Amir Ali Khatibzadeh

\title{ A 1.8V 1.1GHz Novel Digital Multiplier }

Master of Applied Science in Electrical \& Computer Engineering

Ryerson University

Toronto, Ontario, Canada, 2004

This thesis presents the design of an $8 \times 8$-bit novel multiplier, which can provide a better performance than its counterparts in the sense that it has a fraction of the silicon area, delay and power consumption of the common architectures such as the conventional linear array multipliers.

At the system-level high performance is obtained by implementing a pair-wise multiplication algorithm. Also, parallel addition algorithm is used to add up partial products. Combining these two algorithms results in an efficient cell-based circuit realization. In the circuit-level, pseudo-NMOS full adder cell is chosen amongst the several existing full adder cells due to its superior speed and power performance.

The performance of this design has been evaluated by comparing it to those of the recently reported multipliers. The results of the comparison, both in theory and simulation, prove the superiority of the proposed multiplier.


## Acknowledgement

During journey through the Master program support and help from friends, family, and faculty can be invaluable. To begin with, I cannot stress enough that the single most important person contributing to the success of the student is the principle dissertation advisor. With this in mind, I would like to thank Prof. Kaamran Raahemifar for his help and guidance. His extensive knowledge of circuit and keen insight into VLSI design were major assets.

I am also grateful to many people who influenced me throughout my career at Ryerson (sorry if I missed any of you). I would like to thank Prof. Sridhar Krishnan, chair and director of Graduate Program in Department of Electrical and Computer Engineering at Ryerson, who has put his effort and intelligence to make a very vigorous atmosphere in Graduate Studies in the department.

More recently, I would like to thank Prof. Vadim Geurkov for serving on my oral exam committee. His feedback and insightful comments on the material of this thesis are greatly appreciated.

I would like to express my appreciation to Prof. Alireza Sadeghian for reading this thesis and for his helpful comments.

My special thanks and acknowledgements go to Shahab Ardalan,, with whom I spent numerous hours designing VLSI circuits. It has truly been an enlightening experience, and I am thankful for having this opportunity.

I would also like to thank Haleh Vahedi and Majid Malekan for their technical advices regarding the layout.

Last, but certainly not least, I would like to thank my dear friend, Dana, and my family and loved ones who stood by me throughout these years. I will not name you all, but I am eternally grateful. Words cannot express the warmth and gratitude I feel for each of you.

Fabrication supports through Canadian Microelectronics Corporation (CMC) is also gratefully acknowledged.

This thesis is dedicated to the late eminent scholar, Professor Abbas Sahab, my grandfather, who has been named the father of Iran's cartography.

## Table of Contents

Abstract ..... iv
Acknowledgements ..... v
List of Tables ..... x
List of Figures ..... xii
Chapter 1: Introduction ..... 1
1.1 Motivation ..... 1
1.2 Application ..... 2
1.3 Original Contributions ..... 3
1.4 Thesis Organization ..... 5
Chapter 2: Basic Concept of Multiplication ..... 6
2.1 Multiplication Definition ..... 7
2.2 Binary Multiplication ..... 7
2.3 Review of Parallel Multiplication Algorithms ..... 8
2.3.1 Bruan Algorithm ..... 9
2.3.2 Baugh-Wooley Algorithm ..... 10
2.3.3 The Modified Booth Algorithm ..... 14
2.3.4 Wallace Tree Algorithm ..... 17
2.3.5 The Proposed Pair-Wised Algorithm ..... 19
2.4 Qualitative Comparisons of Parallel Algorithms ..... 22
Chapter 3: Multiplier Design ..... 24
3.1 Pair-Wise Multiplier ..... 24
3.1.1 Circuit Level Review ..... 25
3.1.1.1 Full Adder ..... 25
3.1.1.1.1 Simulation Strategies ..... 34
3.1.1.1.2 Power Consumption Performance ..... 37
3.1.1.1.3 Delay Performance ..... 38
3.1.1.1.4 Performance Comparison ..... 39
3.1.1.2 Carry Lookahead Adder ..... 41
3.1.1.3 AND/NAND/OR/XOR Gates ..... 45
3.1.2 Cell Design ..... 47
Chapter 4: Simulation Results \& Layout Considerations ..... 59
4.1 Simulation Results of the Individual Circuits ..... 59
4.2 Final Simulation Results ..... 67
4.3 Layout Considerations ..... 86
4.3.1 Layout Strategies ..... 86
4.3.2.Pads, Package and Chip Size ..... 89
Chapter 5: Conclusion ..... 91
5.1 Features of the Designed Multiplier ..... 91
5.2 Comparison Results ..... 93
5.3 Future Work ..... 95
References ..... 98

## List of Tables

Table 2.1 Partial products selection ..... 15
Table 2.2 Partial product generation process ..... 15
Table 2.3 Partial product generation relation ..... 15
Table 3.1(a) Truth table of a full adder ..... 26
Table 3.1(b) Truth table of a half adder ..... 26
Table 3.2 Transistor dimension in complementary CMOS full adder ..... 28
Table 3.3 Transistor dimension in complementary pass-transistor full adder ..... 30
Table 3.4 Transistor dimension in double pass-transistor full adder (Sum) ..... 31
Table 3.5 Transistor dimension in double pass-transistor full adder ( $\mathrm{C}_{\text {out }}$ ) ..... 31
Table 3.6 Transistor dimension in transmission gate full adder ..... 32
Table 3.7 Transistor dimension in pseudo-NMOS full adder ..... 33
Table 3.8 Transistor dimension in XOR \& transmission gate full adder ..... 34
Table 3.9 Characteristic of the input signals ..... 37
Table 3.10 Simulation results for the full adders sorted by power consumption ..... 38
Table 3.11 Simulation results for the full adders sorted by propagation delay ..... 38
Table 3.12 Simulation results for the full adders sorted by power-delay product ..... 40
Table 3.13 Delay of the generate, propagate and sum signals of PFA ..... 44
Table 3.14 Truth table of AND, NAND, OR and XOR ..... 45
Table 3.15 Transistor dimension \& delay of AND, NAND, XOR and OR gates ..... 46
Table 4.1 Transitions covered by input pattern (a) ..... 61
Table 4.2 Transitions covered by input pattern (b) ..... 62
Table 4.3 Transitions covered by input pattern (c) ..... 63
Table 4.4 Transitions covered by input pattern (d) ..... 64
Table 4.5 The results of the worst-case delay measurement ..... 71
Table 4.6 The numerical results of several intentional multiplications ..... 82
Table 4.7 The numerical results of several random multiplications
sorted by delay ..... 82
Table 5.1 Performance of the proposed multiplier ..... 92
Table 5.2 Summary of the performance of the recent publications on
digital multiplier ..... 94

## List of Figures

Fig. 2.1 Partial products of an $8 \times 8$-bit unsigned integer multiplication ..... 10
Fig. 2.2 Braun's array multiplier ..... 10
Fig. 2.3. Illustration of the partial product terms in Baugh-Wooley algorithm ..... 12
Fig. 2.4. Reorganization of the partial product terms of Fig. 2.3 ..... 12
Fig. 2.5. $8 \times 8$-bit Baugh-Wooley two's complement regular array ..... 13
Fig. 2.6 Block diagram of the nx n multiplier using modified Booth algorithm ..... 17
Fig. 2.7 Construction of Wallace's tree for an $8 \times 8$-bit multiplier, reduction of the
8 partial products with 4-2 compressors ..... 18
Fig. 2.8 Architecture of Wallace's tree for an $8 \times 8$-bit multiplier ..... 18
Fig. 2.9 Block diagram for the $8 \times 8$-bit pair-wise multiplier ..... 21
Fig. 3.1 Schematic of complementary CMOS full adder ..... 28
Fig. 3.2 Schematic of complementary pass-transistor full adder ..... 29
Fig. 3.3(a) Schematic of double pass-transistor full adder (Sum) ..... 30
Fig. 3.3(b) Schematic of double pass-transistor full adder ( $\mathrm{C}_{\text {out }}$ ) ..... 31
Fig. 3.4 Schematic of transmission gate full adder ..... 32
Fig. 3.5 Schematic of pseudo-NMOS full adder ..... 33
Fig. 3.6 Schematic of XOR \& transmission gate full adder ..... 34
Fig. 3.7 (a) Input patterns used to evaluate the performance of the adders ..... 35
Fig. 3.7 (b) Input patterns used to evaluate the performance of the adders ..... 35
Fig. 3.7 (c) Input patterns used to evaluate the performance of the adders ..... 36
Fig. 3.7 (d) Input patterns used to evaluate the performance of the adders ..... 36
Fig. 3.8 Propagation delay measurement ..... 39
Fig. 3.9 Block diagram of 4-bit carry lookahead adder ..... 43
Fig. 3.10 Gate-level implementation of partial full adder (PFA) ..... 43


#### Abstract

Fig. 3.11 Block diagram of the 16 -bit carry lookahead adder implemented by cascading four 4-bit carry lookahead modules 44


Fig. 3.12 Schematic of (a) AND (b) NAND (c) XOR (d) OR gates 46
Fig. 3.13 Block diagram of the proposed $8 \times 8$-bit multiplier showing detail of the required cells 47

Fig. 3.14 Block diagram of AND generator 48
Fig. 3.15 Gate-level of the AND plane $\left(X_{i} Y_{j}\right.$ Cell $) 49$
Fig. 3.16 Block diagram of partial products generator 50
Fig. 3.17 Gate-level of one partial product generator (PP Cell) 52
Fig. 4.1 Circuit structure used for simulation of full adder cell 60
Fig. 4.2 Circuit structure used for simulation of AND/NAND/OR/XOR gates 60
Fig. 4.3 The simulation waveforms showing respond of the pseudo full adder to the input pattern (a)61

Fig. 4.4 The simulation waveforms showing respond of the pseudo full adder to the input pattern (b)

Fig. 4.5 The simulation waveforms showing respond of the pseudo full adder to the input pattern (c)63

Fig. 4.6 The simulation waveforms showing respond of the pseudo full adder to the input pattern (d)64

Fig. 4.7 The simulation waveforms showing respond of the AND/NAND gate 65
Fig. 4.8 The simulation waveforms showing respond of OR gate 66
Fig. 4.9 The simulation waveforms showing respond of XOR gate 66
Fig. 4.10 The critical path of the proposed multiplier 68
Fig. 4.11 Delay of AND gate 69
Fig. 4.12 The worst-case delay of pseudo-NMOS adder occurring when
$A=1, B=1$ and $C_{i n}=1$

Fig. 4.13 The worst-case delay of 16-bit carry lookahead adder 70
Fig. 4.14 Input waveforms of $X=11111111$ and $Y=11111111 \quad 72$
Fig. 4.15 The post layout simulation waveforms showing results of "11111111"x"11111111" 73

Fig. 4.16 The measured delay of the proposed multiplier corresponding to
"11111111"x"1111111"74

Fig. 4.17 The waveform of current drawn by $V_{d d}$ node
("11111111"x"11111111")
Fig. 4.18 The waveform of current drawn by $V_{d d}$ node
("10101010"x"01010101") 79
Fig. 4.19 waveform of the input patterns " 10101010 " x "01010101" 80

Fig 4.20 The simulation waveforms of multiplication product of the input patterns shown in Fig 4.1981

Fig. 4.21 Adding two extra pins to $V_{d d}$ and $V_{s s}$ reducing the parasitic inductance 84 Fig. 4.22 The simulation results of the final outputs against temperature variation 85 Fig. 4.23 Layout of the pseudo-NMOS full adder (die size of $22 \times 8.5 \mu \mathrm{~m}^{2}$ ) 87

Fig. 4.24 Layout of AND gate (die size of $7.9 \times 5.6 \mu \mathrm{~m}^{2}$ ) 87
Fig. 4.25 Layout of NAND gate (die size of $5.4 \times 5.6 \mu \mathrm{~m}^{2}$ ) 88
Fig. 4.26 Layout of XOR gate (die size of $10.2 \times 20.7 \mu^{2}$ ) 88
Fig. 4.27 Layout of core of the $8 \times 8$-bit proposed multiplier core
(die size of $0.275 \times 0.38 \mathrm{~mm}^{2}$ ) 89
Fig. 4.28 The proposed $8 \times 8$-bit multiplier chip 90

## Chapter 1

## Introduction

### 1.1 Motivation

The core of every microprocessor, digital signal processor (DSP), and data processing application-specific integrated circuit (ASIC) is its data path. It is often the crucial circuit component if die area, power consumption, and especially operation speed are of concerns. At the heart of data path and addressing units are arithmetic units, such as comparators, adders, and multipliers. Finally, one of the basic operations found in most arithmetic components is binary multiplication. Besides simply multiplying two numbers, multipliers are also used in more complex operations like address calculation and division. Also, simpler operation like magnitude comparison is based on binary multiplication.

Multiplication is also very critical if implemented in hardware because it involves an expensive carry-propagation step, when partial product addition is performed. The efficient implementation of multiplication operation in an integrated circuit is a key problem in VLSI design. Designing fast and power-efficient multiplier has been of great theoretical and practical interest for computer scientists and engineers. Several algorithms and various VLSI implementations have been proposed $[1,2,3,4,5,6,7]$ and practically used.

In order to achieve high performance multiplier, it is necessary to operate very efficiently in terms of speed and power trade-off in all design levels. Increasing the operating speed of the circuits to make more computations with lower power consumption is the main motivation in multiplier design.

The recent progress in use of Ultra Deep Sub-Micron Devices (UDSM) helps to overcome the area constraint. Employing advanced cell-based architectures constantly improves productivity in ASIC design. Taking all of these into account, implementing low-power circuit techniques based on a fast multiplication algorithm is still a costeffective and feasible alternative for increasing the performance of the multipliers substantially. Therefore, this thesis deals with designing a novel multiplier with inherent high-speed characteristic and power efficient performance.

### 1.2 Applications

Wireless communication systems, including third generation cellular radio systems and wireless Local Area Network Systems (LANS), have become tremendously popular in recent years. These systems can be implemented using various platforms, such as digital signal processors, ASICs and Field Programmable Gate Arrays (FPGAs). Most digital signal processing systems incorporate a multiplication unit to implement algorithms such as correlation, convolution, filtering and frequency analysis. These algorithms are used in applications such as finite impulse filters (FIR), infinite impulse filters (IIR), discrete cosine transforms (DCT), and fast Fourier transforms (FFT). Moreover, there has been a rapid increase in the popularity of portable and wireless electronic devices, such as laptop computers, portable video players and cellular phones, which rely on embedded digital processors. Since the desire is to design digital systems for communication applications
at the best performance without scarifying power, high performance and low power multipliers are inevitable.

### 1.3 Original Contributions

This thesis presents the design of novel multiplier architecture, with superior performance in speed, power consumption and area compared to traditional array multipliers.

In order to achieve architecture with high performance several foremost parallel multiplication algorithms have been studied and compared. Braun, Baugh-Wooley, Wallace and pair-wise algorithms have been reviewed in detail. Among these algorithms pair wise algorithm has been chosen due to its superiority in speed of operation.

The $8 \times 8$-bit multiplier based on this algorithm operates by:

1) generating four 8-bit ( $X_{e}, X_{o}, Y_{e}, Y_{o}$ ) numbers using even and odd positions of the multiplicand $(X)$ and multiplier $(Y)$.
2) multiplying these four 8 -bits numbers to generate the four 15 -bit numbers ( $P_{e e}, P_{e o}, P_{o e}, P_{o o}$ ) known as the even and odd elements of the partial products ( $P$ ).
3) adding the result of the multiplication of elements of partial products. The addition is performed in four steps by using 3-to-2 adding technique which results in two 15-bit numbers.
4) adding two final 16-bit numbers $\left(P_{e}, P_{o}\right)$ and thus generating the product of multiplication via a fast carry lookahead.

In the first step of design flow, topology selection, six full adder cells based on CMOS static logic styles are redesigned and examined at transistor-level in standard $0.18 \mu$ CMOS technology. The results of the extensive evaluation, which are further presented in

Chapter 3, prove that 14-transistor pseudo-NMOS full adder cell offers a better speed and power trade-off with less numbers of transistors [8, 9, 10].

The validity of the design strategy is by proven by testing the complete multiplier and measuring the speed and power. All the designs are simulated using Cadence Computer Aided Design (CAD) Tool in $0.18 \mu \mathrm{~m}$ CMOS technology at 1.8 V supply voltage. In summary a speed/power efficient novel multiplier for medium bit width applications is designed in this thesis. Leading by a quantitative analysis of the characteristics of static CMOS logic adders, several topologies are examined to support the final circuit design. The major contributions of the thesis are summarized as follows:

- An in-depth comparative analysis of the characteristics of static CMOS adder cells is conducted, and useful insights are obtained.
- Power reduction through algorithm selection is achieved by:
a) Minimizing the number of operations and, hence, the number of hardware resources (half adder cells used anywhere possible)
b) Reducing the number of complex operations by transforming mathematic expressions (cascading four 4-bit carry lookahead adders instead of implementing a 16-bit carry lookahead structure, which requires complex logic operation)
- Power reduction through circuit/logic is achieved by using static style rather than dynamic style. This causes the architectural level to be free from clock and related clocking issues such as clock skew and high dynamic power.
- Flexibility in delay modeling in system-level in such a way that modifying the entire multiplier for different speed requirements is straightforward.
- The performance of the proposed multiplier is well enhanced by considering transistor chaining, grouping, and signal sequencing in the adder layout which is
proven to provide substantial power saving and speed improvement at no area penalty.

These original contributions have been published in two conference proceedings [9, 10].

### 1.4 Thesis Organization

This thesis consists of 5 chapters and is organized as:
Following the introductory Chapter 1, Chapter 2 describes the basic concept of two's complement multiplication. The most known parallel multiplication algorithms used in VLSI implementation along with the pair-wise multiplication algorithm are introduced and a brief qualitative comparison of these algorithms is presented.

In Chapter 3, first the top-level design of pair-wise multiplier is presented. Topology selection of the main elements as a result of an extensive performance analysis on adder cells further reviewed. The circuit design of the required cells for pair-wise structure is also discussed.

Chapter 4, is dedicated to the simulation results of individual circuits and cells as well as the final simulation results of the proposed multiplier. Layout considerations are also discussed.

Finally, Chapter 5 presents the features of the Designed Multiplier. A comparative study of the previous works on multipliers is presented to better evaluate on this work. Drawing conclusion, summarizing the contributions of this thesis, and outlining the directions for the future investigations bring this chapter to an end.

## Chapter 2

## Basic Concepts of Multiplication

Multiplication is one of the main arithmetic operations. Multiplier represent a fundamental building block which is being widely used in many Very Large-Scale Integrated (VLSI) systems such as application-specific Digital Signal Processing (DSP) architectures, microprocessors and systems which implement filtering, encryption, security processing and image processing. In addition to their main task, which is multiplying two binary numbers, multipliers are the nucleus of many other useful operations such as division and address calculation. In these systems the multipliers are the part of the critical path that determines the overall performance of the system. That is why enhancing the performance of multiplier is a significant goal.

Parallel to high-speed system design [11], low-power systems [1] are highly in demand because of the fast growing technologies in mobile communication and computation. The battery technology does not advance at the same rates as the microelectronics technology. There is a limited amount of power available for mobile systems. Thus, designers are faced with more constraints; high-speed, high throughput, small silicon area and at the same time, low-power consumption. Therefore, low power, high-performance multiplier is of great interest.

Current architectures range from small, low performance array to tree multipliers. Conventional linear array multipliers achieve high performance in a regular structure, but require large area of silicon. Tree structures achieve even higher performance than linear
arrays but the tree interconnection is more complex and less regular, making them even larger than linear arrays. Ideally, one would wish the speed benefits of a tree structure, the regularity of an array multiplier, and the small size of a shift and add multipliers.

The first section of this Chapter explains the basics of binary multiplication. A review on the most known parallel multiplication algorithms is presented in Section 2.3. The pairwise multiplication algorithm that has been used in the proposed multiplier is also described. These algorithms are, then, briefly compared against each other at the end of this Chapter.

### 2.1 Multiplication Definition

Multiplication is defined as "a mathematical operation that at its simplest form is an abbreviated process of adding an integer to itself a specified number of times." A number (multiplicand) is added to itself a number of times as specified by another number (multiplier) to form a result (product). Multiplication starts with placing the multiplicand on top of the multiplier. The multiplicand is then multiplied by each digit of the multiplier beginning with the rightmost, Least Significant Digit (LSD). Intermediate results (partial products) are placed one atop the other, offset by one digit to align digits of the same weight. The final product is determined by summation of all the partial products. This technique applies equally to any base, including binary.

### 2.2 Binary Multiplication

In the binary number system the digits, called bits, are limited to the set $[0,1]$. The result of multiplying any binary number by a single binary bit is either 0 , or the original number. This makes forming the intermediate partial products simple and efficient.

Summing these partial products is the time-consuming task for binary multipliers. One logical approach is to form the partial-products one at a time and sum them as they are generated. This technique works fine but is slow. For applications where this approach does not provide good enough performance, another approach is used which is known as parallel multiplication algorithms. In this latter approach all bit-products are generated in parallel and a multi-operand adder (i.e., an adder tree) is used for their accumulation. Multipliers that operate based on these algorithms are called parallel multipliers. Parallel multipliers are becoming the key components in Reduced Instruction Set Computers (RISCs), DSP and graphic accelerators due to their inherent higher speed of operation. This brings parallel multiplication to the main focus of our discussion.

### 2.3 Review of Parallel Multiplication Algorithms

Since multiplication is one of the most critical operations in many computational systems, many algorithms have been proposed to perform multiplication, each offering different advantages and having tradeoffs in terms of speed, circuit complexity, area and power consumption. Among the multipliers reported parallel multipliers have been of great theoretical and practical interests for VLSI designers not only for their speed of operation but also for their ease of implementation.

The structure of all parallel multipliers can be partitioned into three parts performing three major tasks:
a) Partial product generation.
b) Carry-free addition.
c) Carry-propagation addition.

These three parts can be implemented using different schemes such as simple AND gate or Booth algorithm to generate partial products. The carry-free addition task is often implemented by using a Wallace tree or redundant binary addition tree.

In the following section four well-known parallel algorithms as well as pair-wise algorithm, which have been used in VLSI implementation of digital multipliers, are briefly presented. The readers can consult references [11,12] for more details on parallel multiplication algorithms.

### 2.3.1 Braun Algorithm

Consider two unsigned numbers $X=X_{n-I} \ldots X_{I} X_{0}$ and $Y=Y_{n-I} \ldots Y_{I} Y_{0}$, where

$$
\begin{align*}
& X=\sum_{i=0}^{i=n-1} X_{i} 2^{i}  \tag{2.1}\\
& Y=\sum_{i=0}^{i=n-1} Y_{i} 2^{i} \tag{2.2}
\end{align*}
$$

The product $P=P_{2 n-I} \ldots \ldots P_{l} P_{0}$, which results from multiplying the multiplicand $X$ by the multiplier $Y$, can be written in the following form

$$
\begin{equation*}
P=\sum_{i=0}^{i=n-1} \sum_{j=0}^{j=n-1}\left(X_{i} Y_{j}\right) 2^{i+j} \tag{2.3}
\end{equation*}
$$

Each of the partial product terms $P_{k}=X_{i} Y_{j}$ is called a summand. Fig.2.1 shows an example of an $8 \times 8$-bit multiplication.

The summands are generated in parallel with AND gates. Fig. 2.2 shows the Braun's array multiplier [4]. Such an $n \times n$ multiplier requires $n \times(n-1)$ adders and $n^{2}$ AND gates. The delay of such a multiplier is determined by the delay of the full adder cell and the final adder in the last row. In the multiplier array a full-adder with balanced carry and
sum delays is desirable because the sum and carry signals are both on the critical path.
For the large arrays, the speed and power of the full adder are both very important.


Fig. 2.1 Partial products of an $8 \times 8$-bit unsigned integer multiplication


Fig. 2.2 Braun's array multiplier

### 2.3.2 Baugh-Wooley Algorithm

Baugh-Wooley is one of the developed algorithms for parallel multiplication, which has been used in VLSI architectures [12]. Multipliers based on this algorithm are used for
direct multiplication of two's complement numbers. This direct approach does not need any two's complementing operations prior to multiplication. Using the Baugh-Wooley algorithm, the product of two numbers $X$ and $Y$ expressed in two's complement,

$$
\begin{align*}
& X=-X_{n-1} 2^{n-1}+\sum_{i=0}^{i=n-2} X_{i} 2^{i},  \tag{2.4}\\
& Y=-Y_{n-1} 2^{n-1}+\sum_{i=0}^{i=n-2} Y_{i} 2^{i}, \tag{2.5}
\end{align*}
$$

is given by

$$
\begin{equation*}
P=X Y=X_{n-1} Y_{n-1} 2^{2 n-2}+\sum_{i=0}^{i=n-2} \sum_{j=0}^{j=n-2} X_{i} Y_{j} 2^{i+j}-X_{n-1} \sum_{i=0}^{i=n-2} Y_{i} 2^{n+i-1}-Y_{n-1} \sum_{i=0}^{i=n-2} X_{i} 2^{n+i-1} \tag{2.6}
\end{equation*}
$$

In order to avoid the use of subtractor cells and use only adders, the negative terms should be transformed. So

$$
\begin{equation*}
-X_{n-1} \sum_{i=0}^{i=n-2} Y_{i} 2^{n+i-1}=X_{n-1}\left(-2^{2 n-2}+2^{n-1}+\sum_{i=0}^{i=n-2} \bar{Y}_{i} 2^{n+i-1}\right) . \tag{2.7}
\end{equation*}
$$

Using this property in equation (2.5), the product $P$ becomes

$$
\begin{align*}
P= & X Y=-2^{2 n-1}+\left(\bar{X}_{n-1}+\bar{Y}_{n-1}+X_{n-1} Y_{n-1}\right) \cdot 2^{2 n-2}+ \\
& \sum_{i=0}^{i=n-2} \sum_{j=0}^{j=n-2} X_{i} Y_{j} 2^{i+j}+\left(X_{n-1}+Y_{n-1}\right) \cdot 2^{n-1}+X_{n-1} \sum_{i=0}^{i=n-2} \bar{Y}_{i} 2^{n+i-1}+Y_{n-1} \sum_{i=0}^{i=n-2} \bar{X}_{i} 2^{n+i-1} \tag{2.8}
\end{align*}
$$

From Equation 2.8, it can be seen that the multiplication of two numbers, expressed in two's complement representation, can be written in a form which involves only positive bit products. The product is, then, obtained by adding a constant to the final result. All the partial product terms to generate the above product are explicitly shown in Fig. 2.3. A simple reorganization of Fig. 2.3 results in the array of partial product shown in Fig. 2.4, which is a modified version of the original Baugh-Wooley algorithm.

It can be seen that half adder, full adder, NAND and AND gates are the required elements by Baugh-Wooley algorithm to perform two's complement multiplication.


Fig. 2.3 Illustration of the partial product terms in Baugh-Wooley algorithm

$$
\begin{aligned}
& \overline{X_{1} Y_{8}} \quad X_{1} Y_{7} \quad X_{1} Y_{6} \quad X_{1} Y_{5} \quad X_{1} Y_{4} \quad X_{1} Y_{3} \quad X_{1} Y_{2} \quad X_{1} Y_{1} \\
& \overline{X_{2} Y_{8}} \quad X_{2} Y_{7} \quad X_{2} Y_{6} \quad X_{2} Y_{5} \quad X_{2} Y_{4} \quad X_{2} Y_{3} \quad X_{2} Y_{2} \quad X_{2} Y_{1} \\
& \overline{X_{3} Y_{8}} \quad X_{3} Y_{7} \quad X_{3} Y_{6} \quad X_{3} Y_{5} \quad X_{3} Y_{4} \quad X_{3} Y_{3} \quad X_{3} Y_{2} \quad X_{3} Y_{1} \\
& \overline{X_{4} Y_{8}} \quad X_{4} Y_{7} \quad X_{4} Y_{6} \quad X_{4} Y_{5} \quad X_{4} Y_{4} \quad X_{4} Y_{3} \quad X_{4} Y_{2} \quad X_{4} Y_{1} \\
& \overline{X_{5} Y_{8}} \quad X_{5} Y_{7} \quad X_{5} Y_{6} \quad X_{5} Y_{5} \quad X_{5} Y_{4} \quad X_{5} Y_{3} \quad X_{5} Y_{2} \quad X_{5} Y_{1} \\
& \overline{X_{6} Y_{8}} \quad X_{6} Y_{7} \quad X_{6} Y_{6} \quad X_{6} Y_{5} \quad X_{6} Y_{4} \quad X_{6} Y_{3} \quad X_{6} Y_{2} \quad X_{6} Y_{1} \\
& \begin{array}{lllllllllll} 
& & \overline{X_{7} Y_{8}} & X_{8} Y_{8} & \overline{X_{8} Y_{7}} & \frac{X_{7} Y_{7}}{X_{8} Y_{6}} & \frac{X_{7} Y_{6}}{X_{8} Y_{5}} & \frac{X_{7} Y_{5}}{X_{8} Y_{4}} & \frac{X_{7} Y_{4}}{X_{8} Y_{3}} & \frac{X_{7} Y_{3}}{X_{8} Y_{2}} & \frac{X_{7} Y_{2}}{X_{8} Y_{1}}
\end{array}
\end{aligned}
$$

Fig. 2.4 Reorganization of the partial product terms of Fig. 2.3

This algorithm is suitable for applications where operands with less than 16 bits are to be processed. Digital filters where small operands are used (e.g. 6, 8 and 12), are examples of such applications.

Fig. 2.5 shows the array architecture of an $8 \times 8$-bit Baugh-Wooley two's complement.

For operands equal to or greater than 16 -bits, the Baugh-Wooley scheme becomes area consuming and slow. Hence, techniques to reduce the size of the array, while maintaining the regularity, are required.


Fig. $2.58 \times 8$-bit Baugh-Wooley two's complement regular array

### 2.3.3 The Modified Booth Algorithm

For operands equal to or greater than 16-bits, the modified Booth algorithm [13] has been extensively used. It is based on encoding the two's complement operand (i.e., multiplier) in order to reduce the number of partial products to be added.

This makes the multiplier faster and uses less hardware (area). For example, the modified Radix-2 algorithm is based on partitioning the multiplier into overlapping groups of 3bits, and each group is decoded to generate the correct partial product.

The multiplier, $Y$, in the two's complement can be written as:

$$
\begin{equation*}
Y=-Y_{n-1} 2^{n-1} \sum_{i=0}^{i=n-2} Y_{i} 2^{i} \tag{2.9}
\end{equation*}
$$

This can be rewritten as:

$$
\begin{equation*}
Y=\sum_{i=0}^{i=n-2}\left(Y_{2 i-1}+Y_{2 i}-2 Y_{2 i+1}\right) \cdot 2^{2 i} \text { with } Y_{\cdot I}=0 \tag{2.10}
\end{equation*}
$$

In Equation 2.10, the terms in brackets assume values from the set $\{-2,-1,0,+1,+2\}$. The encoding of $Y$, using the modified Booth algorithm, generates another number with the following five signed digits, $-2,-1,0,+1,+2$. As illustrated in Table 2.1, each encoded digit in the multiplier performs a certain operation on the multiplicand $X$.

The bits of the multiplier $(Y)$ are partitioned into groups of overlapping 3-bits and each group permits the generation of certain partial products. The five possible multiplies of the multiplicand are generated based on the procedure given in Table 2.2.

The general partial product is related to the multiplicand for each encoded digit by the relationships presented in Table 2.3. $P P_{i}$ is the partial product and $P P_{i}$ is also the sign bit of the partial product with $P_{n}=P_{n-I}$ when no shifting of the partial product is performed. Note that the partial product is requested on $(\mathrm{n}+1)$ bits.

Table 2.1. Partial products selection

| $\boldsymbol{Y}_{2 i+1}$ | $\boldsymbol{Y}_{2 i}$ | $\boldsymbol{Y}_{2 i-l}$ | Recoded Digit | Operation on $\boldsymbol{X}$ |
| :---: | :---: | :---: | :---: | :---: |
| 0 | 0 | 0 | 0 | $0 \times \boldsymbol{X}$ |
| 0 | 0 | 1 | +1 | $+1 \times \boldsymbol{X}$ |
| 0 | 1 | 0 | +1 | $+1 \times \boldsymbol{X}$ |
| 0 | 1 | 1 | +2 | $+2 \times \boldsymbol{X}$ |
| 1 | 0 | 0 | -2 | $-2 \times \boldsymbol{X}$ |
| 1 | 0 | 1 | -1 | $-1 \times \boldsymbol{X}$ |
| 1 | 1 | 0 | -1 | $-1 \times \boldsymbol{X}$ |
| 1 | 1 | 1 | 0 | $0 \times \boldsymbol{X}$ |

Table 2.2. Partial product generation process

| Recoded Digit | Operation on $X$ |
| :---: | :---: |
| 0 | Add 0 to the partial product |
| +1 | Add $X$ to the partial product |
| +2 | Shift left $\boldsymbol{X}$ one position and add it to the partial |
| product |  |
| -1 | Add two's complement of $\boldsymbol{X}$ to the partial product |
| -2 | Take two's complement of $\boldsymbol{X}$ and shift left one position |

Table 2.3. Partial product generation relation

| Recoded Digit | Operation on $\boldsymbol{X}$ |  | Added to LSB |
| :---: | :--- | :--- | :---: |
| 0 | $\boldsymbol{P} P_{i}=0$ | for $\mathrm{i}=0, \ldots, \mathrm{n}$ | 0 |
| +1 | $\boldsymbol{P} \boldsymbol{P}_{\boldsymbol{i}}=\boldsymbol{X}_{\boldsymbol{i}}$ | for $\mathrm{i}=0, \ldots, \mathrm{n}$ | 0 |
| +2 | $\boldsymbol{P} \boldsymbol{P}_{\boldsymbol{i}}=\boldsymbol{X}_{\boldsymbol{i}-1}$ | for $\mathrm{i}=0, \ldots, \mathrm{n}$ | 0 |
| -1 | $\boldsymbol{P} \boldsymbol{P}_{\boldsymbol{i}}=\boldsymbol{X}_{\boldsymbol{i}}$ | for $\mathrm{i}=0, \ldots, \mathrm{n}$ | 1 |
| -2 | $\boldsymbol{P} \boldsymbol{P}_{\boldsymbol{i}}=\boldsymbol{X}_{\boldsymbol{i}-\boldsymbol{I}}$ | for $\mathrm{i}=0, \ldots, \mathrm{n}$ | 1 |

Bits are grouped into 3-bit groups overlapping by one bit. A bit with a value of zero is added on the right side of $Y$ as $Y_{-l}$. So the multiplication of two 8-bit numbers generates only 4 partial products. The number of partial products is then reduced by half.

In order to make the array rectangular and thus more regular for VLSI implementation, the problem of the sign extension must be addressed. This problem is more crucial when the operand lengths are wide, where each partial product must be sign-extended to the length of the product. The basic idea is to use two extra bits in the partial product. For the first partial product, the two additional bits, $P P_{n+1}$ and $P P_{n+2}$ are equal to the sign bit of the partial product

$$
\begin{equation*}
P P_{n+2}=P P_{n+1}=P P_{n} . \tag{2.11}
\end{equation*}
$$

For the second partial product, if the first partial product was positive, then the two additional bits for this second partial product are given by the Equation 2.11, otherwise we have two different cases

$$
\begin{equation*}
P P_{n+2}=P P_{n+1}=1 \quad \text { if } \quad P P_{n}=0 \tag{2.12}
\end{equation*}
$$

and

$$
\begin{equation*}
P P_{n+2}=P P_{n+1}=1 \quad \text { if } \quad P P_{n}=0 \tag{2.13}
\end{equation*}
$$

So it is more interesting to use a third bit $F$ as a flag to indicate whether there is, from the previous partial, a negative sign bit to be propagated. $F_{l}$ is the flag generated by the first partial product to the next one. This flag is expressed by the following Boolean equation

$$
\begin{equation*}
F_{j+1}=F_{j}+P P_{n j}, \tag{2.14}
\end{equation*}
$$

where $P P_{n, j}$ is the sign bit of the $\mathrm{j}^{\text {th }}$ partial product.
Fig. 2.6 shows the block diagram of an $\mathrm{n} \times \mathrm{n}$ modified Booth multiplier. Furthermore, the figure gives an idea about the floorplan of this subsystem.

The diagram is composed of the following blocks:
a) The multiplier array containing partial product's generation and 1-bit adders.
b) The Booth encoder and the sign extension bits $\left(P P_{n+2}, P P_{n+1}, F\right)$.
c) The Booth encoder generates the five signals $(0,+1,+2 x,-1 x$, and $-2 x)$ for each group of 3-bit of $Y$.
d) The final stage adder performs 2 n bits addition.


Fig. 2.6 Block diagram of the $\mathrm{n} \times \mathrm{n}$ multiplier using modified Booth algorithm

The Booth multiplier exhibits some glitches. The main reason for glitches is the race condition between the multiplicand and the multiplier due to the Booth encoder.

### 2.3.4 Wallace Tree Algorithm

As seen in the previous section, applying the Booth algorithm reduces the number of partial products by half. However, for large multipliers such as 32 -bit and over, the number of the partial products is over 16 bits. In such cases, better performance is achieved by adopting the Wallace tree using 4-2 compressors [12]. A 4-2 compressor accepts 4 numbers and a carry in, and sums them to produce 2 numbers and a carry out. Fig. 2.7 shows an example of such a tree on partial products of an unsigned $8 \times 8$-bit multiplier. Eight partial products are produced. Using 4-2 compressors, two levels of additions (stages) are needed. The final two summands are added using a fast 16 -bits adder. Some zeros are added to the array. This example shows that the bits which are not
used in this $1^{\text {st }}$ stage (level) jump to the next stage to be combined with the ones produced by the compressors.


Fig. 2.7 Construction of Wallace's tree for an $8 \times 8$-bit multiplier, reduction of the 8 partial products with 4-2 compressors

Fig 2.8 shows the architecture of the $8 \times 8$-bit multiplier. As one can see for the first stage of the tree two blocks, A and B , are required.


Fig.2.8 Architecture of Wallace's tree for an $8 \times 8$-bit multiplier

The block A of the compressor would group the first (last) four partial products, respectively.

To further enhance the performance of Wallace tree multiplier, the modified Booth algorithm can be used to reduce the number of partial products by half in a carry-save adder array. This architecture exhibits some irregularities in the layout since it has a complicated interconnection scheme. Hence, the interconnection wires affect the speed and power consumption of the adder.

### 2.3.5 The Proposed Pair-Wise Algorithm

This algorithm is based on generating n-bit numbers using even and odd positions of the two n-bit numbers [14]. Then, parallel addition algorithm is used to add up partial products.

If we assume the two multiplicands X and Y are 8-bit numbers as follow:

$$
\begin{align*}
& \mathrm{X}=\left\langle X_{8}, X_{7}, X_{6}, X_{5}, X_{4}, X_{3}, X_{2}, X_{1}\right\rangle  \tag{2.15}\\
& \mathrm{Y}=\left\langle Y_{8}, Y_{7}, Y_{6}, Y_{5}, Y_{4}, Y_{3}, Y_{2}, Y_{1}\right\rangle \tag{2.16}
\end{align*}
$$

each can be represented by the sum of two numbers, namely, $X=X_{e}+X_{o}$ and $Y=Y_{e}+Y_{o}$, which are defined as follows;

$$
\begin{align*}
& \mathrm{X}_{\mathrm{e}}=\left\langle X_{8}, 0, X_{6}, 0, X_{4}, 0, X_{2}, 0\right\rangle  \tag{2.17a}\\
& \mathrm{X}_{\mathrm{o}}=\left\langle 0, X_{7}, 0, X_{5}, 0, X_{3}, 0, X_{I}\right\rangle  \tag{2.17b}\\
& \mathrm{Y}_{\mathrm{e}}=\left\langle Y_{8}, 0, Y_{6}, 0, Y_{4}, 0, Y_{2}, 0\right\rangle  \tag{2.18a}\\
& \mathrm{Y}_{\mathrm{o}}=\left\langle 0, Y_{7}, 0, Y_{5}, 0, Y_{3}, 0, Y_{I}\right\rangle \tag{2.18b}
\end{align*}
$$

Consequently, the product $X \times Y$ can be written as

$$
\begin{align*}
X x Y= & \left(X_{e}+X_{o}\right)\left(Y_{e}+Y_{o}\right)=X_{e} Y_{e}+X_{e} Y_{o}+X_{o} Y_{e}+X_{o} Y_{o} \\
= & P_{e e}+P_{e o}+P_{o e}+P_{o o} . \tag{2.19}
\end{align*}
$$

Expanding these terms allows one to see the advantages of writing the multiplication in the form of Equation 2.19.

$$
\begin{align*}
& \mathrm{P}_{\mathrm{ee}}=\quad\left(X_{8} Y_{8}\right) \times 2^{14}+0 \times 2^{13}+\left(X_{8} Y_{6}+X_{6} Y_{8}\right) \times 2^{12}+0 \times 2^{11}+\left(X_{8} Y_{4}+X_{6} Y_{6}+X_{4} Y_{8}\right) \\
& \\
& \times 2^{10}+0 \times 2^{9}+\left(X_{8} Y_{2}+X_{6} Y_{4}+X_{4} Y_{6}+X_{2} Y_{8}\right) \times 2^{8}+0 \times 2^{7}+\left(X_{6} Y_{2}+X_{4} Y_{4}+\right. \\
& \\
& \left.X_{2} Y_{6}\right) \times 2^{6}+0 \times 2^{5}+\left(X_{4} Y_{2}+X_{2} Y_{4}\right) \times 2^{4}+0 \times 2^{3}+\left(X_{2} Y_{2}\right) \times 2^{2}+0 \times 2^{1}+0 \times  \tag{2.20}\\
& 2^{0}
\end{align*}
$$

$P_{00}=\quad 0 \times 2^{14}+0 \times 2^{13}+\left(X_{7} Y_{7}\right) \times 2^{12}+0 \times 2^{11}+\left(X_{5} Y_{7}+X_{7} Y_{5}\right) \times 2^{10}+0 \times 2^{9}+$ $\left(X_{3} Y_{7}+X_{5} Y_{5}+X_{7} Y_{3}\right) \times 2^{8}+0 \times 2^{7}+\left(X_{1} Y_{7}+X_{3} Y_{5}+X_{5} Y_{3}+X_{7} Y_{1}\right) \times 2^{6}+0 \times$ $2^{5}+\left(X_{I} Y_{5}+X_{3} Y_{3}+X_{5} Y_{I}\right) \times 2^{4}+0 \times 2^{3}+\left(X_{I} Y_{3}+X_{3} Y_{I}\right) \times 2^{2}+0 \times 2^{1}+\left(X_{I} Y_{1}\right)$ $\mathrm{x} 2^{0}$

$$
\begin{align*}
& \mathrm{P}_{\mathrm{eo}}=\quad 0 \times 2^{14}+\left(X_{8} Y_{7}\right) \times 2^{13}+0 \times 2^{12}+\left(X_{8} Y_{5}+X_{6} Y_{7}\right) \times 2^{11}+0 \times 2^{10}+\left(X_{8} Y_{3}+\right. \\
& \\
& \left.X_{6} Y_{5}+X_{4} Y_{7}\right) \times 2^{9}+0 \times 2^{8}+\left(X_{8} Y_{1}+X_{6} Y_{3}+X_{4} Y_{5}+X_{2} Y_{7}\right) \times 2^{7}+0 \times 2^{6}+ \\
& \left(X_{6} Y_{1}+X_{4} Y_{3}+X_{2} Y_{5}\right) \times 2^{5}+0 \times 2^{4}+\left(X_{4} Y_{1}+X_{2} Y_{3}\right) \times 2^{3}+0 \times 2^{2}+\left(X_{2} Y_{1}\right) \times  \tag{2.22}\\
& 2^{1}+0 \times 2^{0}
\end{align*}
$$

$$
\begin{align*}
\mathrm{P}_{\mathrm{oe}}= & 0 \times 2^{14}+\left(X_{7} Y_{8}\right) \times 2^{13}+0 \times 2^{12}+\left(X_{5} Y_{8}+X_{7} Y_{6}\right) \times 2^{11}+0 \times 2^{10}+\left(X_{3} Y_{8}+\right. \\
& \left.X_{5} Y_{6}+X_{7} Y_{4}\right) \times 2^{9}+0 \times 2^{8}+\left(X_{1} Y_{8}+X_{3} Y_{6}+X_{5} Y_{4}+X_{7} Y_{2}\right) \times 2^{7}+0 \times 2^{6}+ \\
& \left(X_{I} Y_{6}+X_{3} Y_{4}+X_{5} Y_{2}\right) \times 2^{5}+0 \times 2^{4}+\left(X_{I} Y_{4}+X_{3} Y_{2}\right) \times 2^{3}+0 \times 2^{2}+\left(X_{I} Y_{2}\right) \times \\
& 2^{1}+0 \times 2^{0} \tag{2.23}
\end{align*}
$$

Note that the zero positions in the bit pattern alternate with non-zero summations. The zero position can be used to hold the carry from corresponding summation in the nonzero position. A full adder can be used to calculate each of the sums and the carry out of the full adder generates the bit in the zero positions. Here, we have considered $\left(X_{2} Y_{8} \times 2^{8}\right)$ separately. Spares bits are collected together to form one or two distinct numbers. That is,
we have $\left[\left(X_{2} Y_{8}\right) \times 2^{8}+\left(X_{2} Y_{7}+X_{7} Y_{2}\right) \times 2^{7}+\left(X_{7} Y_{1}\right) \times 2^{6}\right]$. These numbers are treated separately. The propagation of carry is preserved in the body of multiplication and postponed at the last stage. This algorithm uses adder to convert three $k$-bit numbers to two $(k+1)$-bit numbers. By using this technique, partial product numbers are, then, summed together via adder planes repeatedly to generate two distinct numbers. At the last stage the final two partial products are added by a fast adder to speed up multiplication operation. This approach is discussed in more details in Section 3.1.2.


Fig. 2.9 Block diagram for the $8 \times 8$-bit pair-wise multiplier

### 2.4 Qualitative Comparisons of Parallel Algorithms

In order to choose the appropriate algorithm for the required applications one has to have a clear view of advantages and drawbacks of different algorithms that have been introduced. In the following a brief comparison of parallel algorithm is presented. The basic array multipliers, such as the Baugh-Wooley scheme, consume low power and exhibit relatively good performance. However, they are limited to applications with the process operands with less than 16 bits. For operands of 16 bits and over, the modified Booth algorithm reduces the partial product's numbers by half and hence the speed of the multiplier is increased. In this case power consumption is comparable to that of BaughWooley multiplier due to the circuitry overhead in Booth algorithm. However, by using circuit techniques one can make this multiplier have low-power characteristic. The fastest multipliers adopt the Wallace tree with modified Booth encoding. Due to its interconnecting wiring a Wallace tree would generally lead to larger power consumption and area. Hence, it is not recommended for low-power applications. Finally, the pair-wise multiplier shows faster operation by preventing the carry propagation in the intermediate stages of multiplication. This multiplication algorithm postpones the carry-propagation to the last stage where $2(\mathrm{n}-1)$-bit numbers are added. By using a fast addition circuitry such as carry lookahead adder (CLA) at the last stage of pair-wise multiplier one can accelerate the multiplication operation performance. Besides high-speed characteristic and simplicity of architecture of this algorithm, employing low power techniques [1] in circuit-level designs makes pair-wise algorithm a viable candidate for high performance multiplier. Baugh-Wooley algorithm is shown to be suitable for medium size (6 or 8) bit words [10]. It can be concluded that Baugh-Wooley multiplier is a suitable candidate to be used as test vehicle for the purpose of quantitative evaluation of pair-wise multiplier.

However, the entire Baugh-Wooley architecture should be redesign in order to perform a fair comparison.

## Chapter 3

## Multiplier Design

In this Chapter, the design of novel $8 \times 8$-bit multiplier is described in the circuit level. The building blocks are identified and the design of the cells based on these building blocks is, then, discussed. This Chapter begins with a brief description of some of the terms used hereafter in order to assess the circuits' performance.

Propagation delay of digital cells: duration from the moment that the first signal (50\% transition point on input waveform) reaches the inputs of the cell to the moment that the last output signal ( $50 \%$ transition point on output waveforms) reaches the output nodes [21].

Power consumption of digital cells: The value of the power consumption of one cell is measured individually during testing the circuits. It means that the power consumed by the other cells in the test circuit is not included in the final measured value. This has been done by inserting a power meter in the form of Analog Hardware Description Language (AHDL) block in Cadence CAD tool in the route of the main supply to measure the power dissipation. This approach has been used as standard power measurement method throughout this work.

### 3.1 Pair-Wise Multiplier

Based on the pair-wise algorithm described in Chapter 2, the top level design of the proposed multiplier is built as shown in Fig 2.9. The following decisions were made in
order to implement pair-wise algorithm. First, Due to inherent speed characteristic of pair-wise algorithm, a frequency of multiplication over 1 GHz is targeted in this design. The power consumption of each element has been taken into account in topology selection. These points are discussed further in this Chapter where the circuit-level design of the proposed multiplier is reviewed. Also several low-power techniques are applied in layout extraction in order to achieve the power efficient design. These techniques are discussed in Section 5 of Chapter 4 where the layout considerations are reviewed.

### 3.1.1 Circuit-Level Review

In this Section first the elements required in pair-wise multiplier are introduced and then topology selection for the key elements is briefly presented. The architecture of the pairwise multiplier (Fig. 2.9) shows that full adder, half adder, carry lookahead adder, AND, NAND, OR and XOR gates are the building blocks of the multiplier.

### 3.1.1.1 Full Adder

Full adder (FA) is the most critical circuit for two reasons. First, full adders cause a large percentage of the core propagation delay. Second, full adders ultimately consume the large percentage of power in the whole multiplier architecture. In order to select the best FA suited for high-performance application, a study was done on the existing FA circuits [2]. The result of this extensive study has directed to the selection the most speed/power efficient circuitry for the pair-wise multipliers. A summary of topology selection is provided next. First, note that the Boolean expression for a half adder (HA) is:

$$
\begin{gather*}
S=A \oplus B  \tag{3.1}\\
C_{o u t}=A \cdot B \tag{3.2}
\end{gather*}
$$

and for full adder (FA) is:

$$
\begin{gather*}
S=A \oplus B \oplus C_{i n},  \tag{3.3}\\
C_{o u t}=A \cdot B+C_{i n}(A \oplus B) .
\end{gather*}
$$

Table 3.1(a) Truth table of a full adder (b) Truth table of a half adder

| $\mathbf{A}$ | $\mathbf{B}$ | $\mathbf{C}_{\text {in }}$ | Sum | $\mathbf{C}_{\text {out }}$ | $\mathbf{A}$ | $\mathbf{B}$ | Sum | $\mathbf{C}_{\text {out }}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 |
| 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 0 | 1 |  |  |  |  |
| 0 | 1 | 0 | 1 | 0 |  |  |  |  |
| 0 | 0 | 1 | 0 | 1 |  |  |  |  |
| 0 | 0 | 0 | 0 | 0 |  |  |  |  |

The above Boolean expressions can be realized by different circuitries, each with their own advantages and disadvantages. In the following a brief review of the result of the study of six most well known CMOS full adder structures is presented. These adders have been compared in a wide range of static logic styles, which is viable candidate for low-power circuit design.

They include:

1. Complementary CMOS full adder cell
2. Complementary pass-transistor full adder cell
3. Double pass-transistor full adder cell
4. Transmission gate CMOS full adder cell
5. Pseudo-NMOS full adder cell
6. XOR and transmission gate full adder cell

The HA circuits are then generated from the optimized FAs by eliminating the circuitry which implements the function of the input carry.

Transistor Sizing: Sizing of the transistors in the full adder cells has been carried out in an iterative process consisted of the following steps.

1) Set all the transistors (NMOS and PMOS) to the minimum length ( $L_{m i n}$ ) and the minimum width size ( $W_{\text {min }}$ )
( $L_{\text {min }}=180 \mathrm{~nm}, W_{\text {min }}=660 \mathrm{~nm}$ in $0.18 \mu \mathrm{~m}$ CMOS process).
2) Simulate the circuit with all possible input pattern transitions (16 transitions).
3) Consider the transitions with the highest delay and mark the transistors involved in those transitions.
4) Size one of the transistors in this critical path.
5) Repeat Steps 2, 3 and 4 until the power-delay product for the cell continues to increase.
6) Record the transistor sizes corresponding to the minimum power-delay product. This method guarantees that only the right transistors (in the critical path) are sized in a proper way. No over-sizing or under-sizing will be incurred, which makes it optimal for power-delay product performance. Although this is a lengthy process, it is guaranteed to give excellent transistor sizing results, especially for small circuits. Following the same method with larger circuits will take much longer.

It should be mentioned that the above transistor sizing method is a time consuming task for the structure such as double pass-transistor. This structure is already out of interest due to high numbers of transistors. Therefore, not much effort has been taken to optimize the size of the transistors for this adder.

## Complementary CMOS full adder

Complementary CMOS full adder (CMOS) [15] has 28 transistors and its operation is based on the regular CMOS structure, pull-up \& down networks (Fig.3.1). One of the advantages of the complementary CMOS full adder cell is high noise margins and thus, reliable operation at low voltages and arbitrary transistor sizes (ratio-less logic). The layout of CMOS gates is straightforward due to the complementary transistor pairs. An often mentioned, the disadvantage of complementary CMOS full adder cell is the substantial number of large PMOS transistors resulting in high input loads, more power consumption and larger silicon area. This adder uses $\mathrm{C}_{\text {out }}$ signal to generate Sum, which produces an unwanted additional delay. Another drawback of CMOS is the relatively weak output driving capability due to series transistors of the output stage.


Fig. 3.1 Schematic of complementary CMOS full adder
Table 3.2 Transistor dimension in complementary CMOS full adder

| WWaty |  |  |
| :---: | :---: | :---: |
| $\begin{gathered} M_{1}, M_{2}, M_{3}, M_{4}, M_{5}, M_{11}, M_{13} \\ M_{14}, M_{15}, M_{16}, M_{21}, M_{22}, M_{23} . M_{28} \end{gathered}$ | 2.14 | 0.18 |
| $\begin{gathered} \mathrm{M}_{6}, \mathrm{M}_{7}, \mathrm{M}_{8}, \mathrm{M}_{9}, \mathrm{M}_{10}, \mathrm{M}_{12}, \mathrm{M}_{17}, \mathrm{M}_{18}, \\ \mathrm{M}_{19}, \mathrm{M}_{20}, \mathrm{M}_{24}, \mathrm{M}_{25}, \mathrm{M}_{26} . \mathrm{M}_{27} \end{gathered}$ | 1.44 | 0.18 |
| $\mathrm{M}_{24}, \mathrm{M}_{25}$ | 1.8 | 0.18 |

## Complementary pass-transistor full adder

Complementary pass-transistor full adder cell has 32 transistors (Fig. 3.2). Using passtransistor logic with CMOS inverters, this circuit features complementary inputs and outputs. This adder generates many intermediate nodes and their complements in order to generate the final signals (Sum and $\mathrm{C}_{\text {out }}$ ). Having a signal and its complement together produces high rate of switching activities. Therefore, complementary pass-transistor full adder cell is not a suitable option for low power applications. In order to lower the power consumption of complementary pass-transistor, two circuit styles are used. These circuits have output levels restored with cross-coupled inverters [16] and latches [17].

Due to irregular transistor arrangements and high wiring requirement, layout of this full adder cell family is also not straightforward and efficient


Fig. 3.2 Schematic of complementary pass-transistor full adder

Table 3.3 Transistor dimension in complementary pass-transistor full adder

|  | W(p) | L( $\mu \mathrm{mm}$ ) |
| :---: | :---: | :---: |
| $\begin{gathered} \mathrm{M}_{1}, \mathrm{M}_{2}, \mathrm{M}_{3}, \mathrm{M}_{4}, \mathrm{M}_{5}, \mathrm{M}_{6}, \mathrm{M}_{7}, \mathrm{M}_{8}, \mathrm{M}_{9}, \mathrm{M}_{10}, \mathrm{M}_{11}, \mathrm{M}_{12} \\ \mathrm{M}_{13}, \mathrm{M}_{14}, \mathrm{M}_{15}, \mathrm{M}_{16}, \mathrm{M}_{19}, \mathrm{M}_{20}, \mathrm{M}_{21}, \mathrm{M}_{22} \\ \hline \end{gathered}$ | 7.2 | 1.8 |
| $\mathrm{M}_{17}, \mathrm{M}_{18}, \mathrm{M}_{23}, \mathrm{M}_{24}$ | 9 | 1.8 |
| $\mathrm{M}_{25}, \mathrm{M}_{26}, \mathrm{M}_{27}, \mathrm{M}_{30}, \mathrm{M}_{32}$ | 14.4 | 1.8 |
| $\mathrm{M}_{28}, \mathrm{M}_{29}, \mathrm{M}_{31}$ | 18 | 1.8 |

## Double pass-transistor full adder

Double pass-transistor full adder cell has 48 transistors and its operation is based on the double pass-transistor logic in which both NMOS and PMOS logic networks are used (Fig.3.3.a \& b)[18]. The structure of this cell is similar to its complementary passtransistor counterparts, but it uses complementary transistors to keep full swing operation and reduces the power consumption.

This eliminates the need for restoration circuitry. One disadvantage of this cell is the large area used due to the presence of PMOS transistors.


Fig. 3.3(a) Schematic of double pass-transistor full adder (Sum)

Table 3.4 Transistor dimension in double pass-transistor full adder (Sum)

|  | W(jimi) | L (1.1m) |
| :---: | :---: | :---: |
| $\mathrm{M}_{2}, \mathrm{M}_{4}, \mathrm{M}_{6}, \mathrm{M}_{8}, \mathrm{M}_{9}, \mathrm{M}_{11}, \mathrm{M}_{13}, \mathrm{M}_{15}, \mathrm{M}_{18}, \mathrm{M}_{20}$ | 0.77 | 0.18 |
| $M_{1}, M_{3}, M_{5}, M_{7}, M_{10}, M_{12}, M_{14}, M_{16}, M_{17}, M_{19}$ | 1.08 | 0.18 |



Fig. 3.3(b) Schematic of double pass-transistor full adder ( $\mathrm{C}_{\text {out }}$ )

Table 3.5 Transistor dimension in double pass-transistor full adder ( $\mathrm{C}_{\text {out }}$ )

|  | W(itim) | $\underline{L}(\mu \mathrm{~m})$ |
| :---: | :---: | :---: |
| $\begin{gathered} \mathrm{M}_{2}, \mathrm{M}_{4}, \mathrm{M}_{6}, \mathrm{M}_{8}, \mathrm{M}_{10}, \mathrm{M}_{12}, \mathrm{M}_{14}, \mathrm{M}_{16}, \mathrm{M}_{17}, \mathrm{M}_{19}, \mathrm{M}_{21} \\ \mathrm{M}_{23}, \mathrm{M}_{26}, \mathrm{M}_{28} \end{gathered}$ | 0.77 | 0.18 |
| $\mathrm{M}_{18}, \mathrm{M}_{20}, \mathrm{M}_{22}, \mathrm{M}_{24}$ | 0.9 | 0.18 |
| $M_{1}, M_{3}, M_{5}, M_{7}, M_{9}, M_{11}, M_{13}, M_{15}, M_{25}, M_{27}$ | 1.08 | 0.18 |

## Transmission gate CMOS full adder

Transmission gate full adder has 20 transistors (Fig. 3.4). This circuit generates ( $\mathrm{A}+\mathrm{B}$ ) and uses this and its complement as selected signals to generate the output signals (Sum \& $C_{\text {out }}$ [19]. It also requires complementary input signals ( $A, B, C_{\text {in }}$ ) similar to the
complementary CMOS full adder. However, it exhibits better speed than CMOS full adder with the same power consumption due to the small transistor stack height [20].


Fig. 3.4 Schematic of transmission gate full adder

Table 3.6 Transistor dimension in transmission gate full adder

| MOST | W ( $\boldsymbol{\mu m})$ | $\mathbf{L}(\mu \mathrm{m})$ |
| :---: | :---: | :---: |
| $\mathrm{M}_{2}, \mathrm{M}_{4}, \mathrm{M}_{6}, \mathrm{M}_{8}, \mathrm{M}_{12}, \mathrm{M}_{14}, \mathrm{M}_{16}, \mathrm{M}_{18}, \mathrm{M}_{20}$ | 0.7 | 0.18 |
| $\mathrm{M}_{5}, \mathrm{M}_{7}, \mathrm{M}_{13}, \mathrm{M}_{15}, \mathrm{M}_{17}, \mathrm{M}_{19}$ | 0.9 | 0.18 |
| $\mathrm{M}_{1}, \mathrm{M}_{3}, \mathrm{M}_{10}, \mathrm{M}_{11}$ | 1.44 | 0.18 |
| $\mathrm{M}_{9}$ | 1.8 | 0.18 |

## Pseudo-NMOS full adder

Pseudo-NMOS full adder operates based on pseudo logic, referred to as ratioed style. This cell uses 14 transistors to realize the negative addition function (Fig. 3.5). The advantage of pseudo-NMOS adder is its higher speed (compared to complementary full adder) and low transistor count. On the negative side is the static power consumption of
the pull-up transistor as well as the reduced output voltage swing, which makes this cell more susceptible to noise. In order to increase the output swing two CMOS inverters are added to this circuit, which increases the total transistors of this cell to 18 transistors.


Fig. 3.5 Schematic of pseudo-NMOS full adder

Table 3.7 Transistor dimension in pseudo-NMOS full adder

|  | W( $\mathrm{\mu}_{\text {m }}$ ) | L( $\mu \mathrm{m}$ ) |
| :---: | :---: | :---: |
| $\mathrm{M}_{7}, \mathrm{M}_{12}, \mathrm{M}_{13}, \mathrm{M}_{14}$ | 0.66 | 0.18 |
| $\mathrm{M}_{1}, \mathrm{M}_{2}, \mathrm{M}_{3}, \mathrm{M}_{4}, \mathrm{M}_{5}, \mathrm{M}_{6}, \mathrm{M}_{8}$ | 0.77 | 0.18 |
| $\mathrm{M}_{9}, \mathrm{M}_{10}, \mathrm{M}_{11}, \mathrm{M}_{16}, \mathrm{M}_{18}$ | 1 | 0.18 |
| $\mathrm{M}_{15}, \mathrm{M}_{17}$ | 2 | 0.18 |

## XOR and transmission gate full adder

This adder shown in Fig. 3.6 has been developed based on an XOR gate [21] combined with transmission gate, which requires a total of 14 transistors [22]. XOR gate generates the sum. Using the transmission gate the second half of the circuit produces the carry out. This cell occupies less area compared with complementary CMOS full adder cell. In
terms of power consumption this adder has a better performance. This is due to its low activity factor and passing a strong signal in fewer number of pass-logic gates, unlike the other cells where the signal had to go through more number of logic gates. Having discussed the high performance of this novel logic, one should note that the irregularity in layout of transmission gate and large average size of transistors are the considerable drawbacks of this circuit.


Fig. 3.6 Schematic of XOR \& transmission gate full adder

Table 3.8 Transistor dimensions in XOR \& transmission gate full adder

|  |  | $\underline{L}(\underline{1 m})$ |
| :---: | :---: | :---: |
| $\mathrm{M}_{6}, \mathrm{M}_{7}, \mathrm{M}_{8}, \mathrm{M}_{10}$ | 0.7 | 0.18 |
| $\mathrm{M}_{3}, \mathrm{M}_{4}, \mathrm{M}_{12}, \mathrm{M}_{14}$ | 0.7 | 0.18 |
| $\mathrm{M}_{11}, \mathrm{M}_{13}$ | 0.9 | 0.18 |
| $\mathrm{M}_{1}, \mathrm{M}_{5}, \mathrm{M}_{9}$ | 1.44 | 0.18 |
| $\mathrm{M}_{2}$ | 1.8 | 0.18 |

### 3.1.1.1.1 Simulation Strategies

In the following the techniques for simulations with regards to input patterns of full adder and output loading are presented.

Input Pattern and Output Loading: In order to compare different adders, input patterns should be in such ways that fairly test all cases. An input pattern which maximizes the power consumption for a given cell, could exhibit less power for another. While another input pattern could have the reversed situation due to different distribution of capacitances in both circuits.


Fig. 3.7 (a) Input patterns used to evaluate the performance of the adders


Fig. 3.7 (b) Input patterns used to evaluate the performance of the adders


Fig. 3.7 (c) Input patterns used to evaluate the performance of the adders


Fig. 3.7 (d) Input patterns used to evaluate the performance of the adders

A good input pattern for power consumption leading to a fair comparison of adder cells should alternate the high frequency at the input and intermediate nodes. A good example is the concatenation of the four patterns shown in Fig $3.7(a, b, c, d)$.

Table 3.9 Characteristic of the input signals

| Patterns, |  |  | Wh |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Inputs | T(ns) | P.W. (ns) | T(ns) | P.W.(ns) | T(ns) | P.W.(ns) | T(ns) | P.W.(ns) |
| A | 2 | 1 | 4 | 2 | 8 | 4 | 4 | 2 |
| B | 4 | 2 | 8 | 4 | 2 | 1 | 4 | 2 |
| $\mathrm{C}_{\text {in }}$, | 8 | 4 | 2 | 1 | 4 | 2 | 8 | 4 |

$$
\text { P.W. }=\text { Pulse width, } T=\text { Period, Rise time }=50 p s \text {, Fall time }=50 p s
$$

As for speed, the input patterns should have all the required input-pattern-to-input-pattern transitions. The delay of the cell should be measured for each transition. The input pattern used for the simulation process is a concatenation of the four-input patterns shown in Fig. 3.7 (a, b, c, d).

The test bench used for simulating the adder cell is shown in Fig. 4.1 of Chapter 4, where the simulation result of the selected adder cell is discussed. The inputs are applied through buffers (two cascade inverters), which load adder cells with more realistic inputs in terms of slope and driving strength. Outputs are also applied to another adder to evaluate the driving capability of each cell.

### 3.1.1.1.2 Power Consumption Performance

Results of the comparison among adders, sorted by power consumption are shown in Table 3.10. The power performance of the second and third adder cells (Fig. 4.1) in the cascade configuration seems to be more realistic because in such a case, the high driving capability of the adder is a must in order to provide the next cell with the clean inputs. Therefore, the power values of either second or third full adder can be considered as the basis for our comparison. These results show that XOR and transmission gate full adder exhibit the lowest power consumption and transmission gate CMOS pseudo-NMOS, complementary CMOS, double pass-transistor and complementary pass-transistor are ranked respectively after it.

One can see that ranking is not necessarily related to the transistor count. It should be also pointed out that this evaluation corresponds to a 1.8 V power supply, and this point has slightly rearranged the previously reported adder ranking. The impact of supply reduction is an incomplete voltage swing at some internal nodes leading to a constant current drain. This, in turn, results in higher power consumption in circuits such as complementary pass-transistor and double pass-transistor.

Table 3.10 Simulation results for the full adders sorted by power consumption

| PAdder Cell (1.8V) | Power (mW) |
| :---: | :---: |
| XOR and transmission gate | 0.0203 |
| Transmission gate CMOS | 0.0305 |
| Pseudo-NMOS | 0.0341 |
| Complementary CMOS | 0.0504 |
| Double pass-transistor | 0.0861 |
| Complementary pass-transistor | 0.0967 |

### 3.1.1.1.3 Delay Performance

The experimental results of the comparison among adders sorted by speed are presented in Table 3.11. The delay values are measured from the moment $\mathrm{A}, \mathrm{B}$ and $\mathrm{C}_{\text {in }}$ signals reach the adder inputs till the last of the Sum and $\mathrm{C}_{\text {out }}$ signals reach the next adder cell inputs. The cell with the lowest-delay values is Complementary pass-transistor.

Table 3.11 Simulation results for the full adders sorted by propagation delay

| And Ader.Cell | Delay (ns) |
| :---: | :---: |
| Complementary pass-transistor | 0.057 |
| XOR and transmission gate | 0.066 |
| Transmission gate CMOS | 0.074 |
| Pseudo-NMOS | 0.080 |
| Double pass-transistor | 0.091 |
| Complementary CMOS | 0.140 |

Fig. 3.8 shows the delay of an adder. This measurement is based on the definition of the propagation delay of digital cells, explained at the beginning of this chapter. The inputs
signals are as $A=1, B=1$, and $C_{i n}=1$, therefore, the adder response will be as $\operatorname{Sum}=1$ and $C_{\text {out }}=1$. Then, the delay between the earliest input signal ( $C_{i n}$ ) and Sum has been measured. The delay is also measured between $C_{i n}$ and $C_{\text {out }}$. This measurement has been performed at $50 \%$ transition point of the signals (which is 0.9 V in our case of $V_{d d}=1.8$ V). The delay values of pseudo-NMOS adder are shown in Fig. 3.8. It can be seen that delay of Sum and $C_{o u t}$ is very close in this cell, which avoids any data hazard, and race effects that may occur later in the proposed architecture.


Fig. 3.8 Propagation delay measurement

### 3.1.1.1.4 Performance Comparisons

The following criteria have been considered in performing the comparison amongst different adder:

Power-delay product: The power-delay product is defined as a compromise between speed and power consumption. The values of the power-delay are presented in Table
3.12. The measurements are performed in identical conditions as it is recorded in Tables 3.10 and 3.11 .

Area: The transistor count, showing area efficiency and layout productivity must be taken into account for choosing the best adder.

Table 3.12 Simulation results for the full adder cells sorted by power-delay product

|  | Power delay product $\mathrm{x} 10^{-12}$ | Transistor No. |
| :---: | :---: | :---: |
| XOR and transmission gate | 0.00133 | 14 |
| Transmission gate CMOS | 0.00222 | 20 |
| Pseudo-NMOS | 0.00272 | 14 |
| Complementary pass-transistor | 0.00551 | 32 |
| Complementary CMOS | 0.00702 | 28 |
| Double pass-transistor | 0.00783 | 48 |

The measurement shows that pseudo-NMOS full adder has average values in both power consumption and delay, while providing a sum signal in good logic level. This leads to average of value in power-delay products.

Pseudo-NMOS adder also has small area occupancy not only due to the number of transistors but also because of the size of PMOSs, which are the main issue when it turns to layout extraction level. These properties make the pseudo-NMOS circuit amenable to use of a lower supply voltage to further reduce the power and at the same time maintaining a specific speed of the multiplication operation.

It is timely to mention that the comparison of the performance of the adder cells based on different logic is a very broad area of study and it is impossible to appreciate fully in a small section. Here, identical conditions such as uniform input pattern, capacitive load and constant $V_{d d}$ have been used during simulation in order to achieve a fair comparison.

However, other factors such as selecting different geometry and physical designs and process variations could be considered as well.

### 3.1.1.2 Carry Lookahead Adder

The carry lookahead adder is a viable candidate to resolve the propagation delay problem by calculating the carry signal in advance based on the input signals. It relies on the fact that a carry signal will be generated in two cases:
a) when both input bits $\left(A_{i}, B_{i}\right)$ are "1",
b) when one of the two bits is " 1 " and the $C_{i n}$ (carry-in of the previous stage) is " 1 ".

Thus, one can write

$$
\begin{equation*}
C_{\text {out }}=C_{i+1}=A_{i} \cdot B_{i}+\left(A_{i} \oplus B_{i}\right) \cdot C_{i} \tag{3.5}
\end{equation*}
$$

The above expression can be rewritten as

$$
\begin{equation*}
C_{i+1}=G i+P_{i} \cdot C_{i} \tag{3.6}
\end{equation*}
$$

in which

$$
\begin{gather*}
G_{i}=A_{i} \cdot B_{i}  \tag{3.7}\\
P_{i}=\left(A_{i} \oplus B_{i}\right) \tag{3.8}
\end{gather*}
$$

$G_{i}$ and $P_{i}$ are called generate and propagate terms, respectively [23].
Note that propagate and generate terms only depend on the input bits. If one uses the above expression to calculate the carry signal, $\mathrm{s} /$ he does not need to wait for the carry to ripple through all the previous stages to find its proper value. Thus, comes the main advantage of the carry lookahead adder: reducing the propagation delay.

In the following the generate and propagate terms are derived for a 4-bit adder.

$$
\begin{gather*}
C_{1}=G_{0}+P_{0} \cdot C_{0}  \tag{3.9}\\
C_{2}=G_{1}+P_{1} \cdot C_{1}=G_{1}+P_{1} \cdot G_{0}+P_{1} \cdot P_{0} \cdot C_{0} \tag{3.10}
\end{gather*}
$$

$$
\begin{gather*}
C_{3}=G_{2}+P_{2} \cdot G_{1}+P_{2} \cdot P_{1} \cdot G_{0}+P_{2} \cdot P_{1} \cdot P_{0} \cdot C_{0}  \tag{3.11}\\
C_{4}=G_{3}+P_{3} \cdot G_{2}+P_{3} \cdot P_{2} \cdot G_{1}+P_{3} \cdot P_{2} \cdot C_{1} \cdot G_{0}+P_{3} \cdot P_{2} \cdot P_{1} \cdot P_{0} \cdot C_{0} \tag{3.12}
\end{gather*}
$$

Note that $C_{\text {out }}$ bit and $C_{i+1}$ of the last stage will be available after four delays (two gate delays to calculate propagate signal and two delays due to AND and OR gates). The sum signal $\left(S_{i}\right)$ can be calculated as follows;

$$
\begin{equation*}
S_{i}=A_{i} \oplus B_{i} \oplus C_{i}=P_{i} \oplus G_{i} \tag{3.13}
\end{equation*}
$$

Thus, the sum bit will be available after two additional gate delays (due to the XOR gate) or total of six gate delays after the input signals $A_{i}$ and $B_{i}$ have been applied. The advantage is that these delays will be the same and independent of the number of bits one needs to add, as opposed to the case of ripple counter.

The carry lookahead adder can be broken up in two modules;

1) The partial full adder, PFA, which generates $G_{i}, P_{i}$ and $S_{i}$ as defined by Equations 3.7, 3.8 and 3.13.
2) The carry lookahead logic, which generates $C_{\text {out }}$ bits according to Equations 3.9 to 3.12 . The 4-bit adder can then be built by using four PFAs and the carry lookahead logic block.

The disadvantage of carry lookahead adder is that the carry logic tends to get quite complicated for more than 4 bits. Therefore, carry lookahead adders are usually implemented as 4-bit modules and are used in a hierarchical structure to realize adders that have multiples of 4-bits. High fan-in OR gate is an unavoidable problem in designing a 16-bit carry lookahead adder. This is shown in Equation 3.12 when $C_{4}$ is calculated. Using high fan-in in logic gate would not only increase the propagation delay, but also contributes to additional power consumption. In order to resolve these issues the cascade of four 4-bit carry lookahead adders have been employed in design of 16-bit carry lookahead adder. The propagation delay of 16-bit carry lookahead adder in this
architecture is approximately equal to that of the 4-bit ripple carry adder. This is because of $C_{\text {out }}$ signals that have to ripple from one module to the next one. This is repeated four times until the final $C_{o u t}$ arrives at the output. Despite the amount of delay, this approach is more power-efficient.

In the following the overview of the sub-cells of the 4-bit carry lookahead adder are described. Figure 3.9 shows the block diagram of 4-bit carry lookahead adder.


Fig. 3.9 Block diagram of 4-bit carry lookahead adder
As seen in Fig. 3.9 partial full adder (PFA) is the first block where inputs are fed. As it is mentioned earlier, this block generates, propagate, generate and sum signals. Fig. 3.10 shows the gate-level implementation of PFA. Sum signal is also generated in this block according to Equation 3.13. In order to generate $C_{\text {out }}$ signal another XOR gate is needed.


Fig. 3.10 Gate-level implementation of partial full adder (PFA)

The delays of signals in the highlighted block of carry lookahead adder (Fig. 3.9) are measured and shown in Table 3.13.

Table 3.13 Delay of the generate, propagate and sum signals of PFA

## Output Delay (ns)

| $G_{i}$ | 0.0552 |
| :---: | :---: |
| $S_{i}$ | 0.0385 |
| $C_{i} P_{i}$ | 0.08 |

The block diagram of the 16 -bit carry lookahead adder is shown in Fig. 3.11. Four 4-bit carry lookahead modules have been used to implement the final stage of the pair-wise multiplier. The labels on this diagram are based on the outputs of the previous stages.


Fig. 3.11 Block diagram of the 16-bit carry lookahead adder implemented by cascading four 4-bit carry lookahead modules

### 3.1.1.3 AND, NAND, OR and XOR Gates

AND, NAND, OR and XOR are the fundamental logic gates, used in most logic circuits to realize the arithmetic operations. The Boolean expressions for two-input AND, NAND, OR and XOR gates, followed by their truth tables are shown in Table 3.14.

$$
\begin{gather*}
A . B  \tag{3.5}\\
A+B \tag{3.6}
\end{gather*}
$$

$$
\begin{equation*}
A \oplus B \tag{3.7}
\end{equation*}
$$

Table 3.14 Truth table of AND, NAND, OR and XOR

| $A$ | $B$ | $A \cdot B$ | $\overline{A \cdot B}$ | $A+B$ | $A \oplus B$ |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 1 | 1 | 1 | 0 | 1 | 0 |
| 1 | 0 | 0 | 1 | 1 | 1 |
| 0 | 1 | 0 | 1 | 1 | 1 |
| 0 | 0 | 0 | 1 | 0 | 0 |

The Boolean expressions representing AND/NAND/OR/XOR operation can be realized by different circuitries. However, the varieties of these structures are not as many as adder circuits. Therefore, very common configurations have been used to implement the required logic tasks. Figure 3.12(a, b) shows the schematics of AND, NAND gates that have been optimized for the required speed in the proposed multiplier [22]. NAND gate is composed of two NMOSs and PMOSs. An inverter is added to the circuit to generate the AND function. Several designs of OR and XOR gates have been reported. Each has its own advantages such as less delay and drawbacks such as poor response to some particular inputs [20]. Figure 3.12 (c, d) shows schematics of the OR and XOR circuits Dimension of NMOS and PMOS transistors have been modified for the required rise and fall times in the pair-wise multiplier.



Fig. 3.12 Schematic of (a) AND (b) NAND (c) $X O R$ (d) $O R$ gates

Table 3.15 Transistor dimensions of AND, NAND, XOR and OR gates

|  |  |  |
| :---: | :---: | :---: |
| MOST | $\mathbf{W}(\mu \mathrm{m})$ | L ( $\mu \mathrm{m}$ ) |
| $\mathrm{M}_{1}, \mathrm{M}_{4}, \mathrm{M}_{2}, \mathrm{M}_{3}$ | 1.4 | 0.18 |
| $\mathrm{M}_{5} \mathrm{M}_{6}$ | 1.8 | 0.18 |
|  |  |  |
| MOST | $\mathbf{W}(\mu \mathrm{m})$ | $\underline{L}(\mu \mathrm{~m})$ |
| $\mathrm{M}_{1}, \mathrm{M}_{4}, \mathrm{M}_{2}, \mathrm{M}_{3}$ | 0.7 | 0.18 |
|  |  |  |
| MOST | W ( $\mu \mathrm{m}$ ) | L (pma) |
| $\mathrm{M}_{1}, \mathrm{M}_{3}$ | 3.5 | 0.18 |
| $\mathrm{M}_{2}, \mathrm{M}_{4}, \mathrm{M}_{12}$ | 1.5 | 0.18 |
| $\begin{gathered} \mathrm{M}_{6}, \mathrm{M}_{3} \mathrm{M}_{8} \mathrm{M}_{10}, \\ \mathrm{M}_{12} \end{gathered}$ | 2. | 0.18 |
| $\mathrm{M}_{5} \mathrm{M}_{9}$ | 3. | 0.18 |
| OTO |  |  |
| MOST | W(\%) | L ( $\mu \mathrm{m}$ ) |
| $M_{1}, M_{2}$ | 2.2 | 0.18 |
| $\mathrm{M}_{3}, \mathrm{M}_{4}$ | 0.7 | 0.18 |
| $\mathrm{M}_{5}$ | 2 | 0.18 |
| $\mathrm{M}_{5}$ | 0.75 | 0.18 |

### 3.1.2 Cell Design

In order to increase the productivity in ASIC design, cell design techniques are highly critical. In cell deign, a basic concept is to design uniform circuits that can perform the same task. In the following the top-level and the circuit-level of the required cells in each stage are described.


Fig. 3.13 Block diagram of the proposed $8 \times 8$-bit multiplier showing detail of the required cells

AND Generator: As seen in the block diagram of the pair-wise $8 \times 8$-bit multiplier (Fig. 3.13) the first stage of this architecture is an AND generator. In order to execute the first step of the pair-wise algorithm discussed in Section 2.3.5 AND combinations of all odd and even positions of two 8 -bit multiplicand and multiplier are required. This task is performed by the AND generator. The block diagram of the AND generator is shown in Fig. 3.14. This stage consists of four AND planes known as:
$\mathrm{X}_{\mathrm{e}} \mathrm{Y}_{\mathrm{e}}$ : generating AND combination of all even bits of the both multiplicand and multiplier. The results are: $X_{2} Y_{2}, X_{2} Y_{4}, X_{2} Y_{6}, X_{2} Y_{8}, X_{4} Y_{2}, X_{4} Y_{4}, X_{4} Y_{6}, X_{4} Y_{8}, X_{6} Y_{2}, X_{6} Y_{4}, X_{6} Y_{6}$, $X_{6} Y_{8}, X_{8} Y_{2}, X_{8} Y_{4}, X_{8} Y_{6}, X_{8} Y_{8}$,
$X_{e} Y_{0}$ : generating AND combination of even bits of the multiplicand and odd bits of the multiplier. The results are: $X_{2} Y_{1}, X_{2} Y_{3}, X_{2} Y_{5}, X_{2} Y_{7}, X_{4} Y_{1}, X_{4} Y_{3}, X_{4} Y_{5}, X_{4} Y_{7}, X_{6} Y_{1}, X_{6} Y_{3}, X_{6} Y_{5}$,
$X_{6} Y_{7}, X_{8} Y_{l}, X_{8} Y_{3}, X_{8} Y_{5}, X_{8} Y_{7}$.
$X_{o} Y_{e}$ : generating AND combination of odd bits of the multiplicand and even bits of the multiplier. The results are: $X_{I} Y_{2}, X_{I} Y_{4}, X_{I} Y_{6}, X_{I} Y_{8}, X_{3} Y_{2}, X_{3} Y_{4}, X_{3} Y_{6}, X_{3} Y_{8}, X_{5} Y_{2}, X_{5} Y_{4}, X_{5} Y_{6}$, $X_{5} Y_{8}, X_{7} Y_{2}, X_{7} Y_{4}, X_{7} Y_{6}, X_{7} Y_{8}$.
$X_{0} Y_{0}$ : generating AND combination of all odd bits of the both multiplicand and multiplier. The results are: $X_{1} Y_{1}, X_{1} Y_{3}, X_{I} Y_{5}, X_{I} Y_{7}, X_{3} Y_{1}, X_{3} Y_{3}, X_{3} Y_{5}, X_{3} Y_{7}, X_{5} Y_{1}, X_{5} Y_{3}, X_{5} Y_{5}$, $X_{5} Y_{7}, X_{7} Y_{1}, X_{7} Y_{3}, X_{7} Y_{5}, X_{7} Y_{7}$.


Fig. 3.14 Block diagram of AND generator

Fig 3.15 shows the gate-level implementation of the AND plane. Combination of four planes consequently constructs AND generation stage. The AND circuit discussed in Section 3.1.1.2 (Fig. 3.12a) is used in the circuit level.


Fig. 3.15 Gate level of the AND plane ( $X_{i} Y_{j}$ Cell)

First Adder Plane: The second stage of the multiplier is the first adder plane where partial products ( $\mathrm{P}_{\mathrm{ee}}, \mathrm{P}_{\mathrm{eo}}, \mathrm{P}_{\mathrm{oe}}, \mathrm{P}_{\mathrm{oo}}$ ) are generated. Equations 2.20, 2.21, 2.22 and 2.23 show the different AND combinations of multiplicand and multipliers' bits required for generating each of partial products. Fig 3.16 shows the block diagram of this stage.


Fig. 3.16 Block diagram of partial products generator

This stage consists of four blocks of the partial product generators known as:
$P_{e e}$ : generating partial products resulted by multiplication of the even bits of both multiplicand and multiplier. $\mathrm{P}_{\mathrm{ee}}$ is a 15 -bit number shown by bit number in parentheses as follows:
$P_{e e}(1)=0$
$P_{e e}(2)=0$
$P_{\mathrm{ee}}(3)=X_{2} Y_{2}$
$P_{\text {ee }}(4)=0$
$P_{e e}(5)=\operatorname{Sum}\left[X_{4} Y_{2}+X_{2} Y_{4}\right]$
$P_{e e}(6)=\mathrm{C}_{\text {out }}\left[X_{4} Y_{2}+X_{2} Y_{4}\right]$
$P_{e e}(7)=\operatorname{Sum}\left[X_{6} Y_{2}+X_{4} Y_{4}+X_{2} Y_{6}\right]$
$P_{e e}(8)=\mathrm{C}_{\mathrm{out}}\left[X_{6} Y_{2}+X_{4} Y_{4}+X_{2} Y_{6}\right]$
$P_{e e}(9)=\operatorname{Sum}\left[X_{8} Y_{2}+X_{6} Y_{4}+X_{4} Y_{6}+X_{2} \boldsymbol{Y}_{8}\right]$
$P_{\text {ee }}(10)=\mathrm{C}_{\text {out }}\left[X_{8} Y_{2}+X_{6} Y_{4}+X_{4} Y_{6}+X_{2} Y_{8}\right]$
$P_{e e}(11)=\operatorname{Sum}\left[X_{8} Y_{4}+X_{6} Y_{6}+X_{4} Y_{8}\right]$
$P_{e e}(12)=\mathrm{C}_{\text {out }}\left[X_{8} Y_{4}+X_{6} Y_{6}+X_{4} Y_{8}\right]$
$P_{e e}(13)=\operatorname{Sum}\left[X_{8} Y_{6}+X_{6} Y_{8}\right] \quad P_{e e}(14)=\mathrm{C}_{\text {out }}\left[X_{8} Y_{6}+X_{6} Y_{8}\right] \quad P_{e e}(15)=X_{8} Y_{8}$
$\mathrm{P}_{\mathrm{e} 0}$ : generating partial products resulted by multiplication of even bits of the multiplicand and odd bits of the multiplier. $\mathrm{P}_{\text {eo }}$ is a 15-bit number shown by bit number in parentheses as follows:
$P_{e o}(1)=0 \quad P_{e o}(2)=X_{2} Y_{l} \quad P_{e o}(3)=0 \quad P_{e o}(4)=\operatorname{Sum}\left[X_{4} Y_{1}+X_{2} Y_{3}\right]$
$P_{e o}(5)=\mathrm{C}_{\mathrm{out}}\left[X_{4} Y_{l}+X_{2} Y_{3}\right]$
$P_{e o}(6)=\operatorname{Sum}\left[X_{6} Y_{1}+X_{4} Y_{3}+X_{2} Y_{5}\right]$
$P_{e o}(7)=\mathrm{C}_{\text {out }}\left[X_{6} Y_{I}+X_{4} Y_{3}+X_{2} Y_{5}\right]$
$P_{e o}(8)=\operatorname{Sum}\left[X_{8} Y_{I}+X_{6} Y_{3}+X_{4} Y_{5}\right]$
$P_{e o}(9)=\mathrm{C}_{\text {out }}\left[X_{8} Y_{1}+X_{6} Y_{3}+X_{4} Y_{5}+X_{2} \boldsymbol{Y}_{7}\right] \quad P_{e o}(10)=\operatorname{Sum}\left[X_{8} Y_{3}+X_{6} Y_{5}+X_{4} Y_{7}+X_{2} \boldsymbol{Y}_{7}\right]$
$P_{e o}(11)=\mathrm{C}_{\text {out }}\left[X_{8} Y_{3}+X_{6} Y_{5}+X_{4} Y_{7}\right]$
$P_{e o}(12)=\operatorname{Sum}\left[X_{8} Y_{5}+X_{6} Y_{7}\right]$
$P_{e o}(13)=\mathrm{C}_{\mathrm{out}}\left[X_{8} Y_{5}+X_{6} Y_{7}\right]$
$P_{e o}(14)=X_{8} Y_{7}$
$P_{e o}(15)=0$
$\mathrm{P}_{\mathrm{oc}}$ : generating partial products resulted by multiplication of odd bits of the multiplicand and even bits of the multiplier. $\mathrm{P}_{\mathrm{oe}}$ is a 15 -bit number shown by bit number in parentheses as follows:

$$
\begin{aligned}
& P_{o e}(1)=0 \quad P_{o e}(2)=X_{I} Y_{2} \quad P_{o e}(3)=0 \quad P_{o e}(4)=\operatorname{Sum}\left[X_{I} Y_{4}+X_{3} Y_{2}\right] \\
& P_{o e}(5)=\mathrm{C}_{\text {out }}\left[X_{I} Y_{4}+X_{3} Y_{2}\right] \\
& P_{o e}(6)=\operatorname{Sum}\left[X_{I} Y_{6}+X_{3} Y_{4}+X_{5} Y_{2}\right] \\
& P_{o e}(7)=\mathrm{C}_{\mathrm{out}}\left[X_{I} Y_{6}+X_{3} Y_{4}+X_{5} Y_{2}\right] \\
& P_{o e}(8)=\operatorname{Sum}\left[X_{1} Y_{8}+X_{3} Y_{6}+X_{5} Y_{4}+X_{7} Y_{2}\right]
\end{aligned}
$$

$\begin{array}{ll}P_{o e}(9)=\mathrm{C}_{\mathrm{out}}\left[X_{I} Y_{8}+X_{3} Y_{6}+X_{5} Y_{4}+X_{7} Y_{2}\right] & P_{o e}(10)=\operatorname{Sum}\left[X_{3} Y_{8}+X_{5} Y_{6}+X_{7} Y_{4}\right] \\ P_{o e}(11)=\mathrm{C}_{\mathrm{out}}\left[X_{3} Y_{8}+X_{5} Y_{6}+X_{7} Y_{4}\right] & P_{o e}(12)=\operatorname{Sum}\left[X_{5} Y_{8}+X_{7} Y_{6}\right] \\ P_{o e}(13)=\mathrm{C}_{\mathrm{out}}\left[X_{5} Y_{8}+X_{7} Y_{6}\right] & P_{o e}(14)=X_{7} Y_{8} \quad P_{o e}(15)=0\end{array}$
$\mathrm{P}_{\mathrm{oo}}$ : generating partial products resulted by multiplication of the odd bits of both multiplicand and multiplier. $\mathrm{P}_{\mathrm{oo}}$ is a 15-bit number shown by bit number in parentheses as follow;
$P_{o o}(1)=X_{I} Y_{I} \quad P_{o o}(2)=0 \quad P_{o o}(3)=\operatorname{Sum}\left[X_{I} Y_{3}+X_{3} Y_{I}\right]$
$P_{o o}(4)=\mathrm{C}_{\text {out }}\left[X_{I} Y_{3}+X_{3} Y_{I}\right] \quad P_{o o}(5)=\operatorname{Sum}\left[X_{I} Y_{5}+X_{3} Y_{3}+X_{5} Y_{I}\right]$
$P_{o o}(6)=\mathrm{C}_{\mathrm{out}}\left[X_{I} Y_{5}+X_{3} Y_{3}+X_{5} Y_{I}\right]$
$P_{o o}(7)=\operatorname{Sum}\left[X_{I} Y_{7}+X_{3} Y_{5}+X_{5} Y_{3}+X_{7} Y_{2}\right]$
$P_{o o}(8)=\mathrm{C}_{\text {out }}\left[X_{I} Y_{7}+X_{3} Y_{5}+X_{5} Y_{3}+X_{7} Y_{2}\right]$
$P_{o o}(9)=\operatorname{Sum}\left[X_{3} Y_{7}+X_{6} Y_{5}+X_{7} Y_{3}\right]$
$P_{o o}(10)=\mathrm{C}_{\text {out }}\left[X_{3} Y_{7}+X_{6} Y_{5}+X_{7} Y_{3}\right]$
$P_{o o}(11)=\operatorname{Sum}\left[X_{5} Y_{7}+X_{7} Y_{5}\right]$
$P_{o o}(12)=\mathrm{C}_{\text {out }}\left[X_{5} Y_{7}+X_{7} Y_{5}\right]$
$P_{o o}(13)=X_{7} Y_{7}$
$P_{o o}(14)=0$

$$
P_{o o}(15)=0
$$

All $\mathrm{P}_{\mathrm{ee}}, \mathrm{P}_{\mathrm{eo}}, \mathrm{P}_{\mathrm{oe}}, \mathrm{P}_{\mathrm{oo}}$ blocks perform a similar task and have the same number of inputs and outputs. This makes it possible to employ the same cell for all four partial products ( $\mathrm{P}_{\mathrm{ee}}$, $\mathrm{P}_{\mathrm{eo}}, \mathrm{P}_{\mathrm{oe}}, \mathrm{P}_{\mathrm{oo}}$ ) generators. This cell has been constructed by two half adders and five adders.


Fig. 3.17 Gate-level of one partial product generator (PP Cell)

Fig 3.17 shows the gate-level diagram of the partial product generator cell. In circuitlevel pseudo-NMOS adder has been used to realize these cells. Using adder to convert three $k$-bit numbers to two $(k+1)$ numbers avoids the carry propagation delay in body of the multiplier. The following is description of this technique.

In order to use 3-to-2 adding technique it is necessary that not more than three inputs be used for generating any elements of partial product $\left(P_{i j}\right)$. This condition is not met when the partial product elements are generated by four terms as it happens in $P_{e t}(9), P_{c e}(10)$, $P_{e o}(9), P_{e o}(10), P_{o e}(8), P_{o e}(9), P_{o o}(7), P_{o o}(8)$ (the fourth terms are highlighted in the relevant equation). To deal with these extra terms called spares terms they are taken out of the equations and collected together to form two distinct numbers which are called N and $\mathrm{M} . N(\mathrm{i})$ is a 15 -bit number with zero in all even and odd positions except for the seventh $[N(7)]$ eighth $\quad[N(8)]$ and ninth $[N(9)]$ positions $\left(0,0,0,0,0,0, X_{7} Y_{1}, X_{2} Y_{7}, X_{2} Y_{8}, 0,0,0,0,0,0\right) . M(\mathrm{i})$ is the second 15 -bit number with zero in all even and odd positions except for the eighth position $[M(8)](0,0,0,0,0,0,0$, $\left.X_{7} Y_{2}, 0,0,0,0,0,0,0\right)$. These two numbers are shown in the block diagram of partial products generator (Fig. 3.16).

Now outputs of the first adder plane are six 15-bit numbers called ( $\mathrm{P}_{\mathrm{ee}}, \mathrm{P}_{\mathrm{e} 0}, \mathrm{P}_{\mathrm{oe}}, \mathrm{P}_{\mathrm{oo}}, \mathrm{N}$, M).

Second \& Third Adder Planes: In order to generate the final product of multiplication (P) of 8-bit X (multiplicand) and 8-bit Y (multiplier) all the individual partial products ( $\mathrm{P}_{\mathrm{ee}}, \mathrm{P}_{\mathrm{e} 0}, \mathrm{P}_{\mathrm{oe}}, \mathrm{P}_{\mathrm{oo}}$ ) generated from summation of even and odd bits of the multiplicand and multiplier and two distinct numbers generated by sparse bits $(M, N)$ in the previous stage should be added together.

$$
\begin{equation*}
\mathrm{P}_{\mathrm{ce}}+\mathrm{P}_{\mathrm{eo}}+\mathrm{P}_{\mathrm{oc}}+\mathrm{P}_{\mathrm{o}}+\mathrm{M}+\mathrm{N}=\mathrm{P} \tag{3.6}
\end{equation*}
$$

This task requires the second and third adder planes. This addition operation has to be performed bit-by-bit resulting in carry out propagation. In order to postpone the carry
propagation delay to the last stage of the multiplier a 3-to-2 adding technique has been used. To facilitate this technique adding of four partial products $\left(\mathrm{P}_{\mathrm{ee}}, \mathrm{P}_{\mathrm{eo}}, \mathrm{P}_{\mathrm{oe}}, \mathrm{P}_{\mathrm{oo}}\right)$ and two extra numbers $(\mathrm{M}, \mathrm{N})$ is broken up into two steps as shown in Equation 3.7. These six numbers are divided to two batches of three numbers.

$$
\begin{equation*}
\mathrm{P}_{\mathrm{ee}}+\mathrm{P}_{\mathrm{eo}}+\mathrm{P}_{\mathrm{oe}}+\mathrm{P}_{\mathrm{oo}}+\mathrm{M}+\mathrm{N}=\left(\mathrm{P}_{\mathrm{ec}}+\mathrm{P}_{\mathrm{eo}}+\mathrm{P}_{o c}\right)+\left(\mathrm{P}_{\infty}+\mathrm{M}+\mathrm{N}\right) \tag{3.7}
\end{equation*}
$$

At this stage three 15-bit numbers ( $\mathrm{P}_{\mathrm{ee}}, \mathrm{P}_{\mathrm{eo}}, \mathrm{P}_{\mathrm{oe}}$ ) are converted to two16-bit numbers ( $\mathrm{P}_{\text {eoeo, }}$, $\mathrm{P}_{\text {eoee }}$ ) and so are the fourth partial product ( $\mathrm{P}_{\mathrm{oo}}$ ) and two distinct numbers $(\mathrm{M}, \mathrm{N})$ which generate ( $\mathrm{P}_{\mathrm{oo}} \mathrm{S}_{\mathrm{e}}, \mathrm{P}_{\mathrm{oo}} \mathrm{S}_{\mathrm{o}}$ ).

This task can be performed by using a similar structure shown in Fig. 3.17 with a total of 14 adders. Note that due to the power and area constraints of the entire architecture using half adder is preferred whenever only two inputs signals need to be added (i.e. no $C_{i n}$ signal exits). The result of this 3-to-2 adding is shown as follows;
$\mathrm{P}_{\text {eooo: }}$ : result of adding all the odd positions of $\mathrm{P}_{\mathrm{ee}}, \mathrm{P}_{\mathrm{e} 0}, \mathrm{P}_{\mathrm{oe}}$.

$$
\begin{array}{ll}
P_{e o o o}(1)=0 \quad P_{e o o o}(2)=0 & P_{e o o o}(3)=P_{e e}(3) \quad P_{e o o o}(4)=0 \\
P_{e o o o}(5)=\operatorname{Sum}\left[P_{e e}(5)+P_{e o}(5)+P_{o e}(5)\right] & P_{e o o o}(6)=\mathrm{C}_{\mathrm{out}}\left[P_{e e}(5)+P_{e o}(5)+P_{o e}(5)\right] \\
P_{e o o o}(7)=\operatorname{Sum}\left[P_{e e}(7)+P_{e o}(7)+P_{o e}(7)\right] & P_{e o o o}(8)=\mathrm{C}_{\text {out }}\left[P_{e e}(7)+P_{e o}(7)+P_{o e}(7)\right] \\
P_{e o o o}(9)=\operatorname{Sum}\left[P_{e e}(9)+P_{e o}(9)+P_{o e}(9)\right] & P_{e o o o}(10)=\mathrm{C}_{\text {out }}\left[P_{e e}(9)+P_{e o}(9)+P_{o e}(9)\right] \\
P_{e o o o}(11)=\operatorname{Sum}\left[P_{e e}(11)+P_{e o}(11)+P_{o e}(11)\right] & P_{e o o o}(12)=\mathrm{C}_{\text {out }}\left[P_{e e}(11)+P_{e o}(11)+P_{o e}(11)\right] \\
P_{e o o o}(13)=\operatorname{Sum}\left[P_{e e}(13)+P_{e o}(13)+P_{o e}(13)\right] & P_{e o o o}(14)=\mathrm{C}_{\text {out }}\left[P_{e e}(13)+P_{e o( }(13)+P_{o e}(13)\right] \\
P_{e o o o}(15)=P_{e e}(15) &
\end{array}
$$

$\mathrm{P}_{\text {eoce: }}$ : result of adding all the even positions of $\mathrm{P}_{\mathrm{ee},}, \mathrm{P}_{\mathrm{eo}}$ and $\mathrm{P}_{\mathrm{oe}}$.
$P_{\text {eooee }}(1)=0$
$P_{\text {eooe }}(2)=\operatorname{Sum}\left[P_{e o}(2)+P_{o e}(2)\right]$

$$
P_{\text {eoee }}(3)=\mathrm{C}_{\text {out }}\left[P_{e o}(2)+P_{o e}(2)\right]
$$

$$
P_{e o e e}(4)=\operatorname{Sum}\left[P_{e o}(4)+P_{o e}(4)\right]
$$

$$
P_{\text {eoee }}(5)=\mathrm{C}_{\mathrm{out}}\left[P_{e o}(4)+P_{o e}(4)\right]
$$

$$
P_{\text {eoee }}(6)=\operatorname{Sum}\left[P_{e e}(6)+P_{e o}(6)+P_{o e}(6)\right]
$$

$$
P_{e o e e}(7)=\mathrm{C}_{\text {out }}\left[P_{e e}(6)+P_{e o}(6)+P_{o e}(6)\right]
$$

$$
P_{e o e e}(8)=\operatorname{Sum}\left[P_{e e}(8)+P_{e o}(8)+P_{o e}(8)\right]
$$

$$
P_{\text {eoee }}(9)=\mathrm{C}_{\text {out }}\left[P_{e e}(8)+P_{e o}(8)+P_{o e}(8)\right]
$$

$P_{e o e e}(10)=\operatorname{Sum}\left[P_{e e}(10)+P_{e o}(10)+P_{o e}(10)\right] \quad P_{e o c e}(11)=\mathrm{C}_{\text {out }}\left[P_{e e}(10)+P_{e o}(10)+P_{o e}(10)\right]$
$P_{\text {eoee }}(12)=\operatorname{Sum}\left[P_{e e}(12)+P_{e o}(12)+P_{o e}(12)\right] \quad P_{\text {eoee }}(13)=\mathrm{C}_{\text {out }}\left[P_{e e}(12)+P_{e o}(12)+P_{o e}(12)\right]$
$\mathrm{P}_{\mathrm{oo}} \mathrm{S}_{\mathrm{e}}$ : result of adding all the even positions of $\mathrm{P}_{\mathrm{oo}}, \mathrm{M}$ and N .

$$
\begin{array}{lll}
P_{o o} S_{e}(1)=0 & P_{o o} S_{e}(2)=0 & P_{o o} S_{e}(3)=0 \\
P_{o o} S_{e}(4)=P_{o o}(4) & P_{o o} S_{e}(5)=0 & P_{o o} S_{e}(6)=P_{o o}(6) \\
P_{o o} S_{e}(7)=0 & P_{o o} S_{e}(8)=\operatorname{Sum}\left[P_{o o}(8)+X_{2} Y_{7}+X_{7} Y_{2}\right] \\
P_{o o} S_{e}(9)=\mathrm{C}_{o u t}\left[P_{o o}(8)+X_{2} Y_{7}+X_{7} Y_{2}\right] & P_{o o} S_{e}(10)=P_{o o}(10) \\
P_{o o} S_{e}(11)=0 & P_{o o} S_{e}(12)=P_{o o}(12) & P_{o o} S_{e}(13)=0 \\
P_{o o} S_{e}(14)=P_{o o}(14) & P_{o o} S_{e}(15)=0 &
\end{array}
$$

$\mathrm{P}_{\mathrm{oo}} \mathrm{S}_{\mathrm{o}}$ : result of adding all the odd positions of $\mathrm{P}_{\mathrm{oo}}, \mathrm{M}$ and N .
$P_{o o} S_{o}(1)=P_{o o}(1)$
$P_{o o} S_{o}(2)=0$

$$
P_{o o} S_{o}(3)=P_{o o}(3)
$$

$$
P_{o o} S_{o}(4)=0
$$

$$
P_{o o} S_{o}(5)=P_{o o}(5)
$$

$$
P_{o o} S_{o}(6)=0
$$

$$
P_{o o} S_{o}(7)=\operatorname{Sum}\left[P_{o o}(7)+X_{7} Y_{I}\right]
$$

$$
P_{o o} S_{o}(8)=\mathrm{C}_{\mathrm{out}}\left[P_{o o}(7)+X_{7} Y_{I}\right]
$$

$$
P_{o o} S_{o}(9)=\operatorname{Sum}\left[P_{o o}(9)+X_{2} Y_{8}\right]
$$

$$
P_{o o} S_{o}(10)=\mathrm{C}_{\mathrm{out}}\left[P_{o o}(10)+X_{2} Y_{8}\right]
$$

$$
P_{o o} S_{o}(11)=P_{o o}(11)
$$

$$
P_{o o} S_{o}(12)=0
$$

$$
P_{o o} S_{o}(13)=P_{o o}(13)
$$

$$
P_{o o} S_{o}(14)=0
$$

$$
P_{o o} S_{o}(15)=P_{o o}(15)
$$

Addition process is completed at this stage and four 16 -bit numbers $\left(\mathrm{P}_{\text {eoce, }}, \mathrm{P}_{\text {eooo }}, \mathrm{P}_{00} \mathrm{~S}_{e}\right.$, $\mathrm{P}_{\mathrm{oo}} \mathrm{S}_{\mathrm{o}}$ ) result of 3-to-2 addition of $\mathrm{P}_{\mathrm{ee}}, \mathrm{P}_{\mathrm{e} 0}, \mathrm{P}_{\mathrm{oe}}, \mathrm{M}$ and N are the outputs of this level.

Equation 3.7 is rewritten as:

$$
\begin{equation*}
\mathrm{P}_{\mathrm{ec}}+\mathrm{P}_{\mathrm{eo}}+\mathrm{P}_{\mathrm{oe}}+\mathrm{P}_{\mathrm{oo}}+\mathrm{M}+\mathrm{N}=\mathrm{P}_{\text {eoce, }}+\mathrm{P}_{\text {eoco }}+\mathrm{P}_{\mathrm{oo}} \mathrm{~S}_{\mathrm{e}}+\mathrm{P}_{\mathrm{oo}} \mathrm{~S}_{\mathrm{o}} . \tag{3.8}
\end{equation*}
$$

At the next stage addition of the four numbers is broken to two steps as shown in Equation 3.9.

$$
\begin{equation*}
\mathrm{P}_{\text {eoee, }}+\mathrm{P}_{\text {eoeo }}+\mathrm{P}_{00} \mathrm{~S}_{\mathrm{e}}+\mathrm{P}_{00} \mathrm{~S}_{0}=\left(\mathrm{P}_{\text {eoce, }}+\mathrm{P}_{\text {eooo }}+\mathrm{P}_{00} \mathrm{~S}_{\mathrm{e}}\right)+\mathrm{P}_{00} \mathrm{~S}_{\mathrm{o}} \tag{3.9}
\end{equation*}
$$

The same technique as the previous stage is used two more times to convert the three numbers $\left(\mathrm{P}_{\text {eoee }}, \mathrm{P}_{\text {eoeo }}, \mathrm{P}_{\text {oo }} \mathrm{S}_{\mathrm{e}}\right)$ to two numbers $\left(\mathrm{PS1}_{\mathrm{e}}, \mathrm{PSI}_{\mathrm{o}}\right)$ as following: $\mathrm{PS}_{\mathrm{e}}$ : result of adding all the even positions of $\mathrm{P}_{\text {eoce, }}, \mathrm{P}_{\text {eooo }}$ and $\mathrm{P}_{00} \mathrm{~S}_{\mathrm{e}}$.

$$
\begin{aligned}
& P S I_{e}(1)=0 \\
& P S I_{c}(2)=\operatorname{Sum}\left[P_{\text {eoee }}(2)+P_{\text {eoev }}(2)+P_{\text {oo }} S_{\ell}(2)\right] \\
& P S I_{e}(3)=\mathrm{C}_{\text {out }}\left[P_{\text {egeee }}(2)+P_{\text {eooo }}(2)+P_{\text {oo }} S_{e}(2)\right] \\
& P S I_{e}(4)=\operatorname{Sum}\left[P_{\text {eoee }}(4)+P_{\text {eoeo }}(4)+P_{\text {oo }} S_{e}(4)\right] \\
& P S 1_{e}(5)=\mathrm{C}_{\text {out }}\left[P_{\text {eoee }}(4)+P_{\text {eoes }}(4)+P_{o o s} S_{e}(4)\right] \\
& P S I_{e}(6)=\operatorname{Sum}\left[P_{\text {eoee }}(6)+P_{\text {eoeo }}(6)+P_{o o} S_{e}(6)\right] \\
& P S I_{e}(7)=\mathrm{C}_{\text {out }}\left[P_{\text {eoee }}(6)+P_{\text {eoeo }}(6)+P_{\text {oos }} S_{e}(6)\right] \\
& P S I_{e}(8)=\operatorname{Sum}\left[P_{\text {eoee }}(8)+P_{\text {eooo }}(8)+P_{\text {oo }} S_{e}(8)\right] \\
& P S I_{e}(9)=\mathrm{C}_{\text {out }}\left[P_{\text {eoee }}(8)+P_{\text {eoeo }}(8)+P_{o o} S_{e}(8)\right] \\
& P S I_{e}(10)=\operatorname{Sum}\left[P_{\text {eocee }}(10)+P_{\text {eooo }}(10)+P_{\text {oo }} S_{e}(10)\right] \\
& P S 1_{e}(11)=\mathrm{C}_{\text {out }}\left[P_{\text {eogec }}(10)+P_{\text {eoeo }}(10)+P_{\text {oo }} S_{e}(10)\right] \\
& P S l_{e}(12)=\operatorname{Sum}\left[P_{\text {eovee }}(12)+P_{\text {eooo }}(12)+P_{\text {oo }} S_{e}(12)\right] \\
& P S I_{e}(13)=\mathrm{C}_{\text {out }}\left[P_{\text {eovel }}(12)+P_{\text {eooe }}(12)+P_{\text {oo }} S_{e}(12)\right] \\
& P S I_{e}(14)=\operatorname{Sum}\left[P_{\text {eocee }}(14)+P_{\text {eooo }}(14)+P_{\text {oo }} S_{e}(14)\right] \\
& P S 1_{e}(15)=\mathrm{C}_{\text {out }}\left[P_{\text {eoee }}(14)+P_{\text {eoevo }}(14)+P_{\text {oo }} S_{e}(14)\right]
\end{aligned}
$$

$\mathrm{PS1}_{\mathrm{o}}$ : result of adding all the odd positions of $\mathrm{P}_{\text {eoce, }}, \mathrm{P}_{\text {eooo }}$ and $\mathrm{P}_{\text {oo }} \mathrm{S}_{\mathrm{e}}$

$$
\begin{aligned}
& P S I_{o}(1)=\operatorname{Sum}\left[P_{\text {eocee }}(1)+P_{\text {eooo }}(1)+P_{o o} S_{e}(1)\right] \\
& P S I_{o}(2)=\mathrm{C}_{\text {out }}\left[P_{\text {eoeee }}(1)+P_{\text {eooo }}(1)+P_{o o} S_{e}(1)\right] \\
& \text { PS1 }{ }_{\imath} \text { (3) }=\operatorname{Sum}\left[P_{\text {evee }}(3)+P_{\text {eovo }}(3)+P_{o o} S_{e}(3)\right] \\
& P S I_{o}(4)=\mathrm{C}_{\text {out }}\left[P_{\text {evee }}(3)+P_{\text {eooo }}(3)+P_{o o} S_{e}(3)\right] \\
& P S I_{o}(5)=\operatorname{Sum}\left[P_{\text {eoce }}(5)+P_{\text {eoeo }}(5)+P_{\text {oo }} S_{e}(5)\right] \\
& P S I_{o}(6)=\mathrm{C}_{\text {out }}\left[P_{\text {eveee }}(5)+P_{\text {eoeo }}(5)+P_{\text {oo }} S_{e}(5)\right] \\
& P S l_{\iota}(7)=\operatorname{Sum}\left[P_{\text {eoee }}(7)+P_{\text {eoes }}(7)+P_{c o} S_{e}(7)\right] \\
& P S I_{o}(8)=\mathrm{C}_{\text {out }}\left[P_{\text {eoee }}(7)+P_{\text {eooo }}(7)+P_{\text {oo }} S_{e}(7)\right] \\
& P S 1_{o}(9)=\operatorname{Sum}\left[P_{\text {eooe }}(9)+P_{\text {eoeo }}(9)+P_{\text {oo }} S_{e}(9)\right] \\
& P S l_{o}(10)=\mathrm{C}_{\text {out }}\left[P_{\text {eoce }}(9)+P_{\text {eoeo }}(9)+P_{\text {oo }} S_{e}(9)\right]
\end{aligned}
$$

$$
\begin{aligned}
& P S 1_{o}(11)=\operatorname{Sum}\left[P_{\text {eoce }}(11)+P_{\text {eveo }}(11)+P_{\text {cr }} S_{e}(11)\right] \\
& P S I_{o}(12)=\mathrm{C}_{\text {out }}\left[P_{\text {eoeec }}(11)+P_{\text {coes }}(11)+P_{c o s} S_{c}(11)\right] \\
& P S 1_{o}(13)=\operatorname{Sum}\left[P_{\text {eocee }}(13)+P_{\text {eoteo }}(13)+P_{\text {or }} S_{e}(13)\right] \\
& P S 1_{o}(14)=\mathrm{C}_{\text {out }}\left[P_{\text {eooe }}(13)+P_{\text {coeoo }}(13)+P_{\text {tro }} S_{c}(13)\right] \\
& P S 1_{o}(15)=\operatorname{Sum}\left[P_{\text {eoce }}(15)+P_{\text {eoso }}(15)+P_{\text {or }} S_{c}(15)\right] \\
& P S l_{o}(16)=\mathrm{C}_{\text {out }}\left[P_{\text {eoees }}(15)+P_{\text {coeso }}(15)+P_{\text {ovo }} S_{c}(15)\right]
\end{aligned}
$$

At the next parallel adder plane the two new words from previous adder plane $\left(\mathrm{PSI}_{\mathrm{c}}\right.$, $\mathrm{PS} 1_{\mathrm{o}}$ ) are added to $\left(\mathrm{P}_{\mathrm{oo}} \mathrm{S}_{\mathrm{o}}\right)$ via another 3-to-2 adder stage to complete the Equation 3.9. This addition process is carried out similar to the one in the previous level. The three input numbers at this level are converted to two new 15 -bit numbers called $P_{c}$ and $P_{0}$. Arithmetic
$\mathrm{P}_{\mathrm{e}}$ : result of adding all the even positions of $\mathrm{P}_{\mathrm{oo}} \mathrm{S}_{\mathrm{o}}, \mathrm{PSI}_{\mathrm{e} .}, \mathrm{PSI}_{\mathrm{o}}$

$$
\begin{aligned}
& P_{e}(2)=\operatorname{Sum}\left[P_{o o} S_{o}(2)+P_{\text {eooo }}(2)+P_{\text {eooee }}(2)\right] \\
& P_{e}(3)=\mathrm{C}_{\text {out }}\left[P_{o o} S_{o}(2)+P_{e o o o}(2)+P_{e o o e}(2)\right] \\
& P_{e}(4)=\operatorname{Sum}\left[P_{o o} S_{o}(4)+P S I_{e}(4)+P S I_{o}(4)\right] \\
& P_{e}(5)=\mathrm{C}_{\text {out }}\left[P_{o o} S_{o}(4)+P S I_{e}(4)+P S I_{o}(4)\right] \\
& P_{e}(6)=\operatorname{Sum}\left[P_{o o} S_{o}(6)+P S I_{e}(6)+P S I_{o}(6)\right] \\
& P_{e}(7)=\mathrm{C}_{\text {out }}\left[P_{o o} S_{o}(6)+P S I_{e}(6)+P S I_{o}(6)\right] \\
& P_{e}(8)=\operatorname{Sum}\left[P_{o o} S_{o}(8)+P S I_{e}(8)+P S I_{o}(8)\right] \\
& P_{e}(9)=\mathrm{C}_{\text {out }}\left[P_{o o} S_{o}(8)+P S I_{e}(8)+P S I_{o}(8)\right] \\
& P_{e}(10)=\operatorname{Sum}\left[P_{o o} S_{o}(10)+P S I_{e}(10)+P S I_{o}(10)\right] \\
& P_{e}(11)=\mathrm{C}_{\text {out }}\left[P_{o o} S_{o}(10)+P S I_{e}(10)+P S I_{o}(10)\right] \\
& P_{e}(12)=\operatorname{Sum}\left[P_{o o} S_{o}(12)+P S I_{e}(12)+P S I_{o}(12)\right] \\
& P_{e}(13)=\mathrm{C}_{\text {out }}\left[P_{o o} S_{o}(12)+P S I_{e}(12)+P S I_{o}(12)\right] \\
& P_{e}(14)=\operatorname{Sum}\left[P_{o o} S_{o}(14)+P S I_{e}(14)+P S I_{o}(14)\right] \\
& P_{e}(15)=\mathrm{C}_{\text {out }}\left[P_{o o} S_{o}(14)+P S I_{e}(14)+P S I_{o}(14)\right]
\end{aligned}
$$

$\mathrm{P}_{0}$ : result of adding all the odd positions of $\mathrm{P}_{00} \mathrm{~S}_{0}, \mathrm{PSI}_{\mathrm{e},}, \mathrm{PSI}_{\mathrm{o}}$
$P_{o}(1)=\operatorname{Sum}\left[P_{o o} S_{o}(1)+P S I_{e}(1)+P S I_{o}(1)\right]$
$P_{o}(2)=\mathrm{C}_{\text {out }}\left[P_{o o} S_{o}(1)+P S I_{e}(1)+P S I_{o}(1)\right]$
$P_{o}(3)=\operatorname{Sum}\left[P_{o o} S_{o}(3)+P S I_{e}(3)+P S I_{o}(3)\right]$
$P_{o}(4)=\mathrm{C}_{\mathrm{out}}\left[P_{o o} S_{o}(3)+P S I_{e}(3)+P S I_{o}(3)\right]$
$P_{o}(5)=\operatorname{Sum}\left[P_{o o} S_{o}(5)+P S I_{e}(5)+P S I_{o}(5)\right]$
$P_{o}(6)=\mathrm{C}_{\text {out }}\left[P_{o o} S_{o}(5)+P S I_{e}(5)+P S I_{o}(5)\right]$
$P_{o}(7)=\operatorname{Sum}\left[P_{o o} S_{o}(7)+P S I_{e}(7)+P S I_{o}(7)\right]$
$P_{o}(8)=\mathrm{C}_{\text {out }}\left[P_{o o} S_{o}(7)+P S I_{e}(7)+P S I_{o}(7)\right]$
$P_{o}(9)=\operatorname{Sum}\left[P_{o o} S_{o}(9)+P S I_{e}(9)+P S I_{o}(9)\right]$
$P_{o}(10)=\mathrm{C}_{\text {out }}\left[P_{o o} S_{o}(9)+P S I_{e}(9)+P S I_{o}(9)\right]$
$P_{o}(11)=\operatorname{Sum}\left[P_{o o} S_{o}(11)+P S I_{e}(11)+P S I_{o}(11)\right]$
$P_{o}(12)=\mathrm{C}_{\text {out }}\left[P_{o o} S_{o}(11)+P S I_{e}(11)+P S I_{o}(11)\right]$
$P_{o}(13)=\operatorname{Sum}\left[P_{o o} S_{o}(13)+P S I_{e}(13)+P S I_{o}(13)\right]$
$P_{o}(14)=\mathrm{C}_{\text {out }}\left[P_{o o} S_{o}(13)+P S I_{e}(13)+P S I_{o}(13)\right]$
$P_{o}(15)=\operatorname{Sum}\left[P_{o o} S_{o}(15)+P S I_{e}(15)+P S I_{o}(15)\right]$
$P_{o}(16)=\mathrm{C}_{\text {out }}\left[P_{o o} S_{o}(15)+P S I_{e}(15)+P S I_{o}(15)\right]$
In the last stage of multiplication process, these two final numbers $\left(\mathrm{P}_{\mathrm{e}}, \mathrm{P}_{\mathrm{o}}\right)$ need to be summed as it is shown in Equation 3.10. This equation is summarized form of Equation 3.6.

$$
\begin{equation*}
P_{e}+P_{o}=P \tag{3.10}
\end{equation*}
$$

At the last step final two numbers $\left(\mathrm{P}_{\mathrm{e}}, \mathrm{P}_{\mathrm{o}}\right)$ are simply added to generate the final product. This addition needs to be performed fast. Therefore, carry lookahead structure, known as a fast adder, has been used to speed up the multiplication.

In the next Chapter the simulation of the major block as well as the final simulation results are presented.

## Chapter 4

## Simulation Results \& Layout Considerations

This Chapter presents the simulation results for the major designed cells and circuits. The simulation results of the final stage of the proposed multiplier for certain given inputs are further discussed. Layout considerations are explained later in this Chapter.

All circuits including individual cells and entire design have been simulated in Cadence environment.

### 4.1 Simulation Results of the Individual Circuits

Before presenting the simulation results of the individual circuit and designed cells, we need to introduce the circuit structure that have been used for simulation purposes. Arranging the proper test circuits has significant impact in increasing the ASIC productivity.

Simulation Circuit Structure: In regular multipliers such as the proposed architecture that uses full-adder cells as the building block, a cascade of full adders is usually utilized. In such cases, the high driving capability of adder is a must for providing the next cell with input signal with proper logic level. Having this point in mind, the circuit structure used to simulate the adder is illustrated in Fig. 4.1. A cascade of four full adder cells is utilized; the inputs are fed from buffers (two cascaded inverters) to give
more realistic signals and outputs are loaded with buffers to give proper loading conditions [28].


Fig. 4.1 Circuit structure used for simulation of full adder cell

The parasitic effects are, therefore, included in the simulation results. The same structure has been used to compare the adder cells discussed in topology selection.


Fig. 4.2 Circuit structure used for simulation of AND/NAND/OR/XOR gates

## Full Adder Simulation Results

Here are the simulation results for pseudo-NMOS full adder using the test circuit structure of Fig. 4.1 corresponding to four different input patterns. The input patterns
were already introduced in Chapter 3 when describing simulation strategy of adder cells. These patterns fairly cover most of the possible input combinations.

Fig. 4.3 shows Sum and $C_{\text {out }}$ signals of pseudo-NMOS full adder to input pattern shown in
Fig. 3.7 (a). This pattern covers 6 transitions of the input signals $\left(A, B, C_{i n}\right)$. These transitions are also shown in Table 4.1 corresponding to those in Fig. 4.3.


Fig. 4.3 The simulation waveforms showing respond of the pseudo full adder to the input pattern (a)

Table 4.1 Transitions covered by input pattern (a)

|  |  |  |  |  |  |  | Outputs |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $\boldsymbol{A}$ | $\boldsymbol{B}$ | $C_{\text {in }}$ | Sum $^{\prime}$ | $C_{\text {out }}$ |  |  |  |  |  |
| 0 | 1 | 1 | 0 | 1 |  |  |  |  |  |
| 1 | 1 | 1 | 1 | 1 |  |  |  |  |  |
| 0 | 0 | 1 | 1 | 0 |  |  |  |  |  |
| 1 | 0 | 1 | 0 | 1 |  |  |  |  |  |
| 0 | 1 | 0 | 1 | 0 |  |  |  |  |  |
| 1 | 1 | 0 | 0 | 1 |  |  |  |  |  |

Fig. 4.4 shows Sum and $C_{\text {out }}$ signals of pseudo-NMOS full adder to input pattern shown in Fig 3.7 (b). This pattern covers 6 transitions of the input signals $\left(A, B, C_{i n}\right)$. These transitions are shown in Table 4.2 corresponding to those in Fig. 4.4.


Fig. 4.4 The simulation waveforms showing respond of the pseudo full adder to the input pattern (b)

Table 4.2 Transitions covered by input pattern (b)

| Inputs |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| $\boldsymbol{A}$ | $\boldsymbol{B}$ | $\boldsymbol{C}_{\text {in }}$ | $S_{\text {um }}$ Out | $\boldsymbol{C}_{\text {out }}$ |
| 0 | 1 | 1 | 0 | 1 |
| 0 | 1 | 0 | 1 | 0 |
| 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 0 | 0 | 1 |
| 0 | 0 | 1 | 1 | 0 |
| 0 | 0 | 0 | 0 | 0 |

Fig. 4.5 shows Sum and $C_{\text {out }}$ signals of pseudo-NMOS full adder to input pattern shown in Fig. 3.7 (c). This pattern covers 6 transitions of the input signals $\left(A, B, C_{i n}\right)$. These transitions are shown in Table 4.3 corresponding to those in Fig. 4.5.


Fig. 4.5 The simulation waveforms showing respond of the pseudo full adder to the input pattern (c)

Table 4.3 Transitions covered by input pattern (c)

| Inputs |  |  |  | Out ints |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
| $\boldsymbol{A}$ | $\boldsymbol{B}$ | $C_{\text {in }}$ | $S_{\text {um }}$ | $C_{\text {out }}$ |  |
| 0 | 1 | 1 | 0 | 1 |  |
| 0 | 0 | 1 | 1 | 0 |  |
| 0 | 1 | 0 | 1 | 0 |  |
| 0 | 0 | 0 | 0 | 0 |  |
| 1 | 1 | 1 | 1 | 1 |  |
| 1 | 0 | 1 | 0 | 1 |  |

Fig. 4.6 shows the Sum and $C_{\text {out }}$ signals of pseudo-NMOS full adder to input pattern discussed in Fig. 3.7 (d). This pattern covers 6 transitions of the input signals ( $A, B, C_{i n}$ ). These transitions are shown in Table 4.4 respectively as it is seen in Fig. 4.6.


Fig. 4.6 The simulation waveforms showing respond of the pseudo full adder to the input pattern (d)

Table 4.4 Transitions covered by input pattern (d)

| Inputs |  |  |  | Outputs |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $\boldsymbol{A}$ | $\boldsymbol{B}$ | $C_{\text {in }}$ | Sum | $\boldsymbol{C}_{\text {out }}$ |  |  |
| 0 | 0 | 1 | 1 | 0 |  |  |
| 0 | 1 | 1 | 0 | 1 |  |  |
| 1 | 1 | 1 | 1 | 1 |  |  |
| 1 | 0 | 1 | 0 | 1 |  |  |
| 0 | 0 | 0 | 0 | 0 |  |  |
| 0 | 1 | 0 | 1 | 0 |  |  |

## AND/NAND/OR/XOR Gates Simulation Results

AND/NAND/OR/XOR gates have been described in more details in Chapter 3. Schematics are shown in Fig. 3.15. The test structure used for the simulation is shown in Fig. 4.2. The input signals have $50 \%$ duty cycles with period of 2 ns. Figures 4.7 to 4.9 show the results of the simulation for these gates.


Fig. 4.7 The simulation waveforms showing respond of the AND/NAND gate

MULTIPLIER_PARWISE_SCH OR_TEST_SCH schemotic: Mor 19 02:10:38 2004
Tronsient Response


Fig. 4.8 The simulation waveforms showing respond of OR gate

MLLTIPLIER_PAIRWISE_SCH XOR_TEST_SCH schematic : Mar 19 02:19:02 2004
Transient Response
$\square$




Fig. 4.9 The simulation waveforms showing respond of XOR gate

### 4.2 Final Simulation Results

In order to evaluate the performance of the proposed multiplier three dimensions have to be measured. These dimensions are speed, power consumption and area. In this section speed and power consumption are estimated.

Speed: Speed of the multiplier is translated to the minimum interval (frequency) between two sequential multiplication operations (8-bit x 8 -bit) for which the results of multiplication are successful. To determine the frequency of multiplication, worst-case (maximum) delay of the entire design should be measured. By having the worst-case delay the minimum operating frequency of multiplier can be calculated according to Equation 4.1.

$$
\begin{equation*}
f_{\text {min }}=1 / \tau_{\max }, \tag{4.1}
\end{equation*}
$$

where $f_{\text {min }}$ is minimum operating frequency of multiplication, $\tau_{\max }$ is the worst-case delay of the multiplier.

As shown in Fig 4.10 the operation of the proposed multiplier can be divided to 6 stages as:

1) AND generation
2) $1^{\text {st }}$ Adder level
3) $2^{\text {nd }}$ Adder level
4) $3^{\text {rd }}$ Adder level
5) $4^{\text {th }}$ Adder level
6) Final adder level (Carry lookahead)

Due to parallel operation (AND and addition) in stages 1 to 5 the delay of one AND gate can represent the delay of the first stage (AND generation) and so does the delay of one pseudo-NMOS adder for each of stages 2 to 5 (Adder levels). Delay of carry lookahead adder is separately measured.

In order to evaluate the worst-case delay of the entire design first the worst-case delay of each stage has been measured and, then, the final worst-case delay of the proposed multiplier can be calculated by Equation 4.2:


Fig. 4.10 The critical path of the proposed multiplier

$$
\begin{equation*}
\tau_{\text {Total }}=\tau_{\text {AND Generator }}+4 \times \tau_{\text {Adder level }}+\tau_{\text {Final adder stage }} \tag{4.2}
\end{equation*}
$$

where $\tau_{\text {Total }}$ is the worst-case delay of the multiplier, $\tau_{A N D}$ Generator is the worst-case delay of the AND generator stage which is equal to the delay of an AND gate, $\tau_{\text {Adder level }}$ is worstcase delay of one adder stage which is equal to the worst-case delay of one pseudoNMOS full adder cell, and $\tau_{\text {Final adder stage }}$ is worst-case delay of the final adder stage which is equal to the worst-case delay of the 16 -bit carry lookahead. Delay of AND gate can be simply measured according to propagation delay definition. Fig 4.11 shows the delay of AND gate.

The worst-case delay has a better meaning for pseudo-NMOS full adder due to possibility of different input combinations. The delay of pseudo-NMOS adder has been measured with all input combination. The worst-case delay has been occurred when $A=1, B=1$ and $C_{i n}=1$ as shown in Fig 4.12.


Fig. 4.11 Delay of AND gate


Fig. 4.12 The worst-case delay of pseudo-NMOS adder occurring when $A=1, B=1$ and $C_{\text {t }}=1$

In order to measure the worst-case delay of 16 -bit carry lookahead adder, the same method has been taken. Different input transitions have been applied and delay has been measured between the input and the last output signals at $50 \%$ of transition point. The worst-case delay has been seen when "1111111111111111" and "1111111111111111" are added as it was expected due to rippling $C_{\text {out }}$ signal between every 4-bit carry lookahead adder modules (remember that the 16-bit carry lookahead adder constructed by four 4-bit carry lookahead adder modules).

Fig. 4.13 shows the input and output signals in composite format. The delay occurring between input and output signals is clearly seen in this figure.

Table 4.5 shows the values of worst-case delay of AND gate, pseudo-NMOS full adder and 16-bit carry lookahead adder.


Fig. 4.13 The worst-case delay of 16-bit carry lookahead adder

Table 4.5 The results of the worst-case delay measurement

| estanliplier Stage | The worst-case delay (ps) |
| :---: | :---: |
| AND generation (One AND gate) | 75 |
| Adder stage (One pseudo-NMOS adder) | 120 |
| 16-bit carry lookahead adder | 277 |
| $\tau_{\text {Total }}=832 \mathrm{ps}$ |  |

It should be pointed out that he worst-case delay that has been measured and shown in Table 4.5 is the results of examining each blocks (AND, Adder plane and Final adder stage) separately. It gives an estimation of the worst-case delay of the entire design but as one may notice applying the pattern causing the worst case delay is under control only for the first two multiplier stages which are "AND Generator" and "First Adder Level". By applying pattern " 11111111 "as X and " 11111111 "as Y , AND generator creating " 1 " at all of its outputs. Therefore, all the input of the next stage which is the first adder level are " 1 ". It means the worst-case delay definitely occurs in this stage. So it can be guaranteed that the first two stages operate with their slowest speed (worst-case delay) but this is not necessarily the case in the other stages. However, it is not expected that the entire multiplier's delay exceeds the total worst-case delay ( $\tau_{\text {fotal }}$ ) that have been measured. This has been examined by applying $\mathrm{X}=$ "11111111" and $\mathrm{Y}=$ "11111111" to the proposed multiplier. Fig 4.14 shows this input signals applied to the proposed multiplier. The first curve from the top is the current drawn from the $V_{d d}$ node. Fig 4.15 shows post-layout simulation result corresponding to the applied input (Fig.4.15).
a





$$
=10
$$

$$
\cdots \quad 1.0 \quad \square \quad \square
$$

$$
\underset{\sim}{2} 0
$$

$$
\underset{\sim}{2} \text { 1.B }\left[\begin{array}{l}
-1 \times 3 \\
\end{array}\right.
$$

$$
=1.8[
$$

Fig. 4.14 Input waveforms of $X=11111111$ and $Y=11111111$


Fig. 4.15 The post layout simulation waveforms showing results of "11111111"x"11111111"


Fig. 4.16 The measured delay of the proposed multiplier corresponding to "11111111"x"11111111"

In order to measure the delay occurring during "11111111" $x$ " 11111111 " the composite simulation of the input and output waveforms are used. Then, the delay has been measured in the same manner as previously defined. Fig 4.16 shows the input and output waveforms. The delay ( $\tau_{\text {Total }}$ ) is also shown in Fig 4.16. The measured delay is 793 ps and it is less than 835 ps as it was expected. So now it is fair to assume that the delay of the proposed $8 \times 8$-bit multiplier in worst case is less than summation of worst-case delay of its individual blocks. This worst-case delay may never occur, but it is used to set a point for maximum delay of the multiplier. It is translated to the minimum operating frequency of the proposed multiplier, which is determined as 1.19 GHz . That is, the frequency of the proposed design is guaranteed at 1.19 GHz .

Power: the estimation of power consumed by large digital circuit is a complex task. Measuring the power consumption is critical for low-power design as it permits the designer to optimize power, to meet requirements, and to know the power distribution through the chip. Several heuristic algorithms, statistical, and probabilistic methods have been introduced $[24,25,26]$. These methods become less accurate when the size of the circuits increases. It is better to decompose the large circuit into smaller modules and use this method to estimate the power consumption of each module. These methods are also very helpful approaches to optimize the performance of the decomposed modules. One of these approaches has been employed in the circuit-level design of this work in order to meet the power efficiency as one of the objective. The practical aspect of the method is explained more in detail in the topology selection and transistor sizing of full adder circuitry in Chapter 3.

Nevertheless, in case of complex systems it is wise to use CAD tools for accurate power consumption measurement.

Generally power estimation refers to the techniques of estimating the average power dissipation of circuits. There are several power analysis techniques and tools at the circuit, gate, architectural, and behavioural level of abstraction. The most straightforward method of power estimation is done through circuit simulation; i.e., performing a circuit simulation of the design and measuring the average current drawn from the supply. Therefore, the average power can be estimated which is the average of summation of the three major components as shown in Equation 4.3:

$$
\begin{equation*}
P_{\text {total }}=P_{s}+P_{d}+P_{s c}, \tag{4.3}
\end{equation*}
$$

where $P_{s}$ is static power consumption and it is the power consumed due to leakage and static currents, $P_{s c}$ is short-circuit power consumed because of the current flowing from power supply to ground during transistors switching and $P_{d}$ is referred to as dynamic
power consumption which constitutes the majority of the power consumed in CMOS VLSI circuits.

The method used by CAD tools to measure the power consumption is strongly dependent on the input patterns (pattern-dependent technique). The technique is also called dynamic power simulation, which should not be confused with dynamic power. Equation (4.4) shows the dynamic portion of the total power consumption of the digital circuit. This equation is very similar to the algorithm that has been driven to compute the power consumed by digital circuits in CAD tools such as Spice [27]:

$$
\begin{equation*}
P_{d y n a m i c}=\sum\left(C_{i_{i o a d}} \cdot V_{i_{\text {swing }}} \cdot \alpha_{i}\right) \cdot f_{c l k} \cdot V_{d d}+\sum\left(K_{i} \cdot \alpha_{i}\right)\left(V_{d d}-2 V_{t}\right) \cdot 3 f_{c l k}, \tag{4.4}
\end{equation*}
$$

where $K_{i}=\beta_{t} / 12, C_{i}$ is load capacitance at node $i, V_{i}$ is the voltage swing at node $i, \alpha_{i}$ is known as switching activity factor at node $i, f_{c l k}$ is the system clock frequency, $V_{d d}$ is the power supply voltage, $V_{t}$ is transistor threshold voltage, $\beta$ is the gain factor of the transistor. The summation is over all the nodes of the circuit, which makes the power estimation a very complex calculation. Changing any of the components in Equation 4.4 would result different power consumption values. Some of the components of Equation 4.4 are process-dependent such as $V_{t}$ and $V_{d d}$. Other components such as $C_{i l o a d}, V_{i s w i n g}$ are predetermined by to the design requirements.

Two components in Equation 4.4, which depend only on input pattern, are clock frequency $\left(f_{c l k}\right)$ and the switching factor $\left(\alpha_{i}\right)$. The input frequency of the entire system has been limited at lower level by the delay of the critical path. It means that by taking into account the approximate 800 ps delay of the critical path that has already been measured, the characteristics of the input signals are determined. The required period for input signal is 800 ps at minimum value. By considering $50 \%$ duty cycle as a standard for the input signals the lower input pulse width of 400 ps is required. Thus, the frequency of the
input signals is set at the minimum value of 1.1 GHz . This brings the first condition for power consumption evaluation.

Switching factor $\left(\alpha_{i}\right)$ is the underlying factor of transistors switching. For N periods of 0 $\rightarrow V_{d d}$ and $V_{d d} \rightarrow 0$ transitions, the switching activity $\alpha_{i}$ determines how many $0 \rightarrow V_{d d}{ }^{2}$ transitions occur at the capacitive nodes. In the other words, the $\alpha_{i}$ represents the probability that a transition $0 \rightarrow V_{d d}$ will occur during the period $T=l / f$, where $f$ is the period of the input signals at the node. Considering all internal nodes' transition is a complex task, which is out of the scope of present discussion. However, it is clear at this point that choosing the pattern that makes the high number of transitions in one period of multiplier is a contributing factor to power consumption value. Hence, this brings another condition for the input patterns.

Therefore, due to using Cadence to measure power consumption of the proposed multiplier the two following conditions are considered to govern the power performance;

1) Applying the input signals with the operating frequency of approximate 1.2 GHz .
2) Applying the input pattern causing the maximum switching activity in entire design.

The power consumed by the entire system has been measured by changing the inputs from $X_{I}=00000000 \rightarrow X_{2}=11111111$ and $Y_{I}=00000000 \rightarrow Y_{2}=11111111$. The transition occurring by these input patterns guarantees of charging the load capacitances at all nodes of the circuits to maximum (Fig. 4.17). So one can expect to observe the maximum power consumed by the multiplier by applying this pattern. This pattern is shown in Fig. 4.14. The waveform of the current drawn by $V_{d d}$ node during applying this pattern is shown in Fig 4.17. The average of this current computed by Cadence is 10.5 mA and the power consumption is measured as 18.09 mW . Many different patterns have been applied to the proposed multiplier and delay and power consumption have been recorded (Table 4.6 and 4.7). The maximum power consumption observed belongs to the
pattern multiplication of "11111111" $x$ " 11111111 ", which is expected according to the switching activity definition in complex digital system.


Fig. 4.17 The waveform of current drawn by $V_{d d}$ node ("11111111" $x$ " 11111111 ")

In order to further examine the effect of switching activity in a complex system such as the proposed design another random pattern has been chosen. The power consumed by the entire system has been measured by applying a pattern causing transitions of $X_{I}=$ $00000000 \rightarrow X_{2}=10101010$ and $Y_{1}=00000000 \rightarrow Y_{2}=01010101$. It is expected that the value of the power consumed by the multiplier under this pattern is almost the mean value of the power consumed by the system when all inputs are set to " 0 " which is called the "power down" or "minimum power consumption" value and the maximum power consumption of the system which occurs by applying "11111111" x " 11111111 ". This is shown in Equation (4.5).

$$
\begin{equation*}
P_{\text {average }}=\frac{P_{\min }+P_{\max }}{2}, \tag{4.5}
\end{equation*}
$$

where $P_{\min }$ is the power down value measured as 19.314 nW when no inputs applied and $P_{\max }$ is the power consumed by applying pattern " 11111111 " x " 11111111 " which is
measured as 10.5 mW . The reason for such an assumption is equal random density of " 0 " and " 1 " in the pattern " 101010101 " and " 01010101 " which makes possible to assume that capacitance at $50 \%$ of all the nodes in entire system will be charged. So the assumed value for the power consumption by applying this pattern from Equation 4.5 is 9.054 mW . The actual power consumption measured by Cadence during applying this pattern is 10.347 mW . The difference about $12 \%$ has been seen between the assumed power consumption and the actual power consumption, which is measured by Cadence. This difference is mainly due to power consumed by the interconnections and routing paths.

Fig 4.18 shows the current drawn by $V_{d d}$ node during applying pattern " 10101010 " X " 01010101 ". The average of this current is computed by Cadence as 5.74 mA .


Fig. 4.18 The waveform of current drawn by $V_{d d}$ node ("10101010" x "01010101")

Fig 4.19 shows the simulation waveform of the input patterns by assumption of $50 \%$ switching activity compared to the pattern "11111111" $x$ " 11111111 ". Fig 4.20 shows the multiplication product of the input patterns of Fig 4.19.

IULTIPLIER_PAIRHSE_IWPROMED_5CH TEYา_SCH =hemotic , Hor 20 106:32:53 200
Trangient Pegponge
■


Fig. 4.19 waveform of the input patterns " 10101010 " x " 01010101 "

```
% 1,B n P. P1B
= 1B a:P15
=10
#0, 1,8 [P13 
~0,0
# 18 [8 [i P11
# 1.8 [00 [10
% 1.8
~=1B
```



```
=1,B [PB
\ 1.8 [0:P5
# 18 [-P4
% 1.8 [0.0 [ P3 
```




Fig 4.20 The simulation waveforms of multiplication product of the input patterns shown in Fig 4.19

A total number 100 patterns have been examined as inputs to validate the operation of the
proposed design. These patterns included 80 random patterns and the 20 intentional
patterns propagate the carry from $0^{\text {th }}$ bit to $16^{\text {th }}$ bit.
Tables 4.6 and 4.7 present the delay and power consumption of the proposed multiplier as
results of the applying some of the random and intentional patterns.

Table 4.6 The numerical results of several intentional multiplications

| $\mathbf{x}_{7} \mathrm{X}_{2} \mathbf{x}_{1}$ | Dec. | $\mathbf{Y}_{7} . \mathbf{Y}_{2} \mathrm{Y}_{1}$ | Dec. |  | Dec. | Delay (ns) | $\begin{gathered} \text { Power } \\ \text { Consumption } \\ (\mathrm{mW}) \end{gathered}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| "00000000" | 0 | "00000000" | 0 | "0000000000000000" | 0 | 0 | 0.9504 |
| "00000001" | 1 | "11111111" | 255 | "0000000011111111" | 255 | 0.78 | 7.542 |
| "00000011" | 3 | "11111111" | 255 | "0000001011111101" | 765 | 0.749 | 10.71 |
| "00000111" | 7 | "1111111" | 255 | "0000011011111001" | 1785 | 0.791 | 10.962 |
| "00001111" | 15 | "11111111" | 255 | "0000111011110001" | 3825 | 0.666 | 12.456 |
| "00011111" | 31 | "11111111" | 255 | "0001111011100001" | 7905 | 0.78 | 16.362 |
| "00111111" | 63 | "11111111" | 255 | "0011111011000001" | 16065 | 0.78 | 16.936 |
| "01111111" | 127 | "11111111" | 255 | "0111111010000001" | 32385 | 0.78 | 17.744 |
| "1111111" | 255 | "11111111" | 255 | "1111111000000001" | 65025 | 0.78 | 18.09 |

Table 4.7 The numerical results of several random multiplications sorted by delay

| $\mathbf{X}_{7} . \mathbf{X}_{2} \mathbf{X}_{1}$ | Dec. | $\mathrm{Y}_{7} . \mathrm{Y}_{2} \mathrm{Y}_{1}$ | Dece | $\mathrm{P}_{17} \mathrm{Q}_{\mathrm{P}} \mathrm{P}_{4} \mathrm{P}_{3} \mathrm{P}_{2} \mathrm{P}_{1}$ |  | Delay (ns) | Power: Consumption (mW) |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| "10110100" | 180 | "00101000" | 40 | "0001110000100000" | 7200 | 0.662 | 2.54 |
| "10011101" | 157 | "00101100" | 44 | "0001101011111100" | 6908 | 0.662 | 5.17 |
| "10100101" | 165 | "00001100" | 12 | "0000011110111100" | 1980 | 0.662 | 3.91 |
| "10101101" | 173 | "00110100" | 52 | "0010001100100100" | 8996 | 0.670 | 4.42 |
| "00111101" | 61 | "00101100" | 44 | "0000101001111100" | 2684 | 0.689 | 6.93 |
| "10111101" | 189 | "00111100" | 60 | "0010110001001100" | 11340 | 0.703 | 5.92 |
| "10111001" | 185 | "00111100" | 60 | "0010101101011100" | 11100 | 0.703 | 5.99 |
| "10110101" | 181 | "00111000" | 56 | "0010011110011000" | 10136 | 0.704 | 4.50 |
| "00001100" | 12 | "00011000" | 24 | "0000000100100000" | 288 | 0.711 | 7.23 |
| "00111101" | 61 | "00111100" | 60 | "0000111001001100" | 3660 | 0.711 | 9.83 |
| "00110000" | 48 | "00010000" | 16 | "0000001100000000" | 768 | 0.711 | 9.46 |
| "00011001" | 25 | "00110100" | 52 | "0000010100010100" | 1300 | 0.712 | 7.16 |
| "10100101" | 165 | "00001100" | 12 | "0000011110111100" | 1980 | 0.712 | 3.90 |
| "00111101" | 61 | "00011100" | 28 | "0000011010101100" | 1708 | 0.714 | 4.55 |
| "00100100" | 36 | "00001000" | 8 | "0000000100100000" | 288 | 0.716 | 7.13 |
| "10110001" | 177 | "00110000" | 48 | "0010000100110000" | 8496 | 0.729 | 6.84 |

$X_{7} \ldots X_{2} X_{1}=$ Binary representation of multiplier, $Y_{7} \ldots Y_{2} Y_{I}=$ Binary representation of multiplicand, $P_{17} \ldots P_{2} P_{1}=$ Binary representation of the multiplication product, Dec. $=$ Decimal representation of the multiplicand and multiplier

In regard to design robustness, effects of noise have to be evaluated. Noise is the main factors determining the stability of the system. In the following the main sources of the noise have been described and the performance of the proposed design has been examined by considering the noise effects.

Noise: One of the main degrading factors in performance of high-speed VLSI system is noise, which comes from different sources. Noise can be induced through supply and ground of the system during switching transitions. This noise is known as Simultaneous Switching Noise (SSN). Another type of noise is thermal noise. However, this noise is not dominant source, it is inevitable [23].

Following by definition of the SSN and thermal noise the robustness of the proposed multiplier has been examined by considering the effects of these noises on the performance of the entire design.

One of the main sources of the noise in digital system is Simultaneous Switching Noise (SSN). The effects of SSN is getting more attention as a result of the continuous increase in integration level on a single chip and the operating speed. This noise is caused by the large instantaneous current, due to the switching of multiple drivers and switches, through the parasitic inductance at the ground and power node. SSN can have dramatic impacts on the system by:
a) generating glitches on the ground and power supply interconnections
b) decreasing the effective driving strength of the circuit
c) generating output signal distortion
d) reducing the overall noise margin of the system.

The quantify ground bounce or noise is given by Equation 4.6 as:

$$
\begin{equation*}
V_{n}=L_{V s s} \frac{d I}{d t}, \tag{4.6}
\end{equation*}
$$

where, $V_{n}$ is ground bounce, $L_{\text {vss }}$ is bond wire parasitic inductance and $I$ is the current flow in the bond wire inductor.

Equation 4.6 shows that SSN can be lowered by reducing parasitic inductance. In order to reduce parasitic inductance a multiple pads and pins for power supply ( $V_{d d}$ ) and ground $\left(V_{s s}\right)$ are needed. Allocating the extra pins to $V_{d d}$ and $V_{s s}$ reduces $L_{v s s}$ to $L_{v s s}$ effective due to the parallel configuration of parasitic inductors as shown in Fig 4.21.


Fig. 4.21 Adding two extra pins to $V_{d d}$ and $V_{s s}$ reducing the parasitic inductance

The standard package No. 68PGA offered by CMC has 36 pins which allows us to assign 32 pins to the inputs (two 8-bit multiplicand and multiplier) and the outputs (16-bit multiplication product). Therefore, the 2 extra pins are specified to $V_{d d}$ and $V_{s s}$ (one extra for each) and this has been done at no extra cost. These multiple pads reduce the parasitic inductance to half.

Not having glitches also strong driving capability of the output signals validate our approach to reduce the impact of SSN. This proves the robustness of the proposed design against the switching noise.

Thermal noise is another source of noise, which is generated by thermal agitation of electrons in conductors. Equation 4.7 shows the power of this noise as:

$$
\begin{equation*}
P=\frac{k T}{\Delta f} \tag{4.7}
\end{equation*}
$$

where $P$ is noise power, $k$ is Boltzmann'a constant, $T$ is the conductor temperature and $\Delta f$ is the bandwidth. It is often preferred to represent this noise in noise voltage as shown in Eqaution 4.8.

$$
\begin{equation*}
V_{n(\text { Thernal })}=\sqrt{\frac{k T R}{\Delta f}} \tag{4.8}
\end{equation*}
$$

where $R$ is the parasitic resistance. Thermal noise power, per Hz , is equal throughout the frequency spectrum, depending only on $k$ and $T$. So to simply examine the effect of this noise on performance of the entire system the voltage of the final outputs could be simulated within operating temperature ranged from -40 C to 135 C . The voltage variation within this temperature range with capacitive load $\left(C_{l}\right)$ of 5 pf is $0.15 \%$. This shows the system is quite robust against the temperature variation. Fig 4.22 shows the simulation results of the final outputs against temperature variation.


Fig. 4.22 The simulation results of the final outputs against temperature variation

### 4.3 Layout Considerations

The proposed $8 \times 8$-bit multiplier has been laid out in $0.18 \mu \mathrm{~m}$ CMOS technology. In the following the layout issues such as floor planning, routing, pads and packaging of the adder cell are explained.

### 4.3.1 Layout Strategies

Considering transistor chaining, grouping, and signal sequencing in our proposed adder layout, has been shown to bring substantial power saving and speed improvement at no area penalty [9].

The following measures have been taken to reduce parasitic effects:

1) Minimizing the use of diffusion as a routing layer to reduce the overall parasitic by using metal II layer.
2) Placing the transistors switched by $\mathrm{C}_{\text {in }}$ signal close to the output.
3) Minimizing the capacitive load on $\mathrm{C}_{\text {out }}$ signal by minimizing the size of those transistors in Sum gate whose gate signals are connected to $\mathrm{C}_{\text {out }}$.
4) Using matching transistor structure to reduce area.
5) Making the transistor connecting to $\mathrm{C}_{\text {in }}$ closer to the input of the circuit, therefore, reducing input capacitance.


Fig. 4.23 Layout of the pseudo-NMOS full adder (die size of $22 \times 8.5 \mu \mathrm{~m}^{2}$ )


Fig. 4.24 Layout of AND gate (die size of 7.9x $5.6 \mu \mathrm{~m}^{2}$ )


Fig. 4.25 Layout of NAND gate (die size of $5.4 \times 5.6 \mu \mathrm{~m}^{2}$ )


Fig. 4.26 Layout of XOR gate (die size of $10.2 \times 20.7 \mu \mathrm{~m}^{2}$ )


Fig. 4.27 Layout of core of the $8 \times 8$-bit proposed multiplier core (die size of $0.275 \times 0.38 \mathrm{~mm}^{2}$ )

### 4.3.2 Pads, Package and Chip Size

Following are some details on the routing, pad, package and chip size of the proposed multiplier.

Routings: To reduce parasitic capacitances, local interconnections use metal \# 2, metal \# 3 and metal \# 4. The input and output signals go through metal \# 5. Avoiding long overlap between neighbouring metal layers will reduce the coupling capacitances.

Pad: I/O digital pads of TSMC library "tpz973q" from cell "PDIDGZ" have been used for connecting the core to the output. Dummy layers are also added to satisfy the density requirement.

Package: Package 68PGA is used for the chip. This package provides the core chip with 36 pins ( 9 pins in each side). The total area including the area occupied by pads is 1.395 x $1.37 \mathrm{~mm}^{2}$. The core area of the chip is $0.275 \times 0.38 \mathrm{~mm}^{2}$. Fig 4.28 shows the layout the entire chip.


Fig. 4.28 The proposed $8 \times 8$-bit multiplier chip

## Chapter 5

## Conclusion

Digital multipliers are one of the crucial blocks of real-time Digital Signal Processing (DSP) application ranging from digital filtering to image processing. However, speed of the operation is not the only considerations; low power dissipation and small chip area are also needed because of the dense packing of transistors in today's system on-chip (SoC) applications.

This thesis focuses on an application specific integrated circuit (ASIC) implementation of a digital multiplier with speed of operation over 1 GHz . The three main considerations for the design are high multiplication speed, low power consumption, and a small rectangular chip area.

### 5.1 Features of the Designed Multiplier

The performance of the proposed multiplier has met the objectives of this work. The high-performance of the proposed multiplier has been achieved by an efficient design strategy as follow;

- Several multiplication algorithms have been reviewed. Considering speed as the priority in system-level, pair-wise algorithm has been chosen.
- The critical building blocks of pair-wise algorithm have been identified and ranked by their impacts on speed, power and area on the entire design. This emphasizes the full adder importance.
- An extensive study on performance of the main six static full adders has been performed in order to select the most power-speed efficient full adder topology. Six full adders haven been re-designed through an iterative approach in sake of proper transistor sizing (this approach has been used in design of the other required elements to avoid under-sizing and over-sizing transistors). This approach guaranteed power reduction in the circuit design level. The full adders have been examined under equal conditions by a realistic circuit structure.
- Speed and power are treated as same importance during topology selection by using power-delay product as a measuring factor. Area and driving capability are also taken into account. The comparison results in choosing pseudo-NMOS full adder.

Table 5.1 Performance of the proposed multiplier

| Device Characteristics |  |
| :---: | :---: |
| Process | Five-Metal 0.18 $\mu \mathrm{m}$ Digital CMOS |
| Power Supply $\left(\mathrm{V}_{\mathrm{dd}}\right)$ | 1.8 V |
| Chip Characteristics |  |
| Multiplier \& Multiplicand | 8 bits |
| Product | 16 bits |
| Multiplication delay | $666 \sim 793 \mathrm{ps}$ |
| Power Consumption (power down) | 19.314 nW |
| Power Consumption @ Input Frequency 1.1 GHz | 18.09 mW |
| Average Power Consumption | 10.347 mW |
| Core size | 0.1045 mm |
| Operating Temperature | -40 C to +135 C |

The designed multiplier is suitable for high speed and low power applications, which provides multiplication product of two 8 -bit numbers in approximate 800 psec. Accurate functioning in supply ranged from 1.8 V to 0.09 V has proved the suitability of the proposed architecture. The power consumption is 18.09 mW for 1.1 GHz . The design is implemented in TSMC $0.18 \mu \mathrm{~m}$ CMOS technology and analyzed using Cadence's Spectre with BSIM3v3 device models.

The proposed $8 \times 8$-bit multiplier is laid out in $0.18 \mu$ CMOS technology and was verified for design rules and matched with schematics. The total area is $1.395 \times 1.37 \mathrm{~mm}^{2}$. The results of post-layout simulations are in reasonable consistency with those found in the design process.

### 5.2 Comparison Results

In this section the summarized results of investigation in the recent works on digital multipliers are provided in Table 5.2. This selection has been made based on the novelty of the works. Data are extracted form IEEE Journal of the Solid-State circuits and the results are based on measurement of the actual chips.

As it is seen in this Table, the reported multipliers are implemented in different CMOS technologies with different bit words. This can have dramatic impacts on criteria of comparison such as power, speed and silicon area.

As it is seen, different approaches in designing multipliers are taken based on different design dimensions such as area, power consumption, and speed throughput. However, it will not be fair if the results of the conducted survey on digital multiplier are directly compared, as the technology, bit width, target frequency, and simulation methodologies vary widely. The following discussion provides some indications of multiplier results and this leads us to evaluate the performance of the proposed multiplier.

When speed is the main concern Booth encoding scheme and Wallace tree reduction show their abilities for large throughput multiplier. However, combination of these methodologies with GaAs device results in high-speed multiplier, which is not feasible for CMOS device to reach. From this point of view the proposed multiplier shows its
superiority for the medium bit width (4 to 8 bits) applications in speed and power trade off.

Table 5.2 Summary of the performance of the recent publications on digital multiplier

| Year/hef, | Multiplier | Bit- | $\begin{array}{r} \text { Téchnology } \\ (\mathrm{pm}) \end{array}$ | $\begin{aligned} & \text { Power } \\ & \text { Supply } \\ & \text { (V) } \end{aligned}$ | $\begin{aligned} & \text { Speed } \\ & (\mathbf{M H z}) \end{aligned}$ | Corre Area $\left(\mathrm{mm}^{2}\right)$ | Conower Consump $(m W)$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $\begin{gathered} \text { N. Itoh } \\ (2001)[28] \end{gathered}$ | RectangularStyled Wallace Tree | 54 | 0.18 | 1.8 | 600 | 0.98 | - |
| $\begin{gathered} \text { J. Butas } \\ (2001)[29] \end{gathered}$ | Asynchronous CrossPipelined | 16 | 0.6 | 1.5~5 | 59~251 | - | 40.59 |
| $\begin{gathered} \text { S. Kim } \\ (2001)[30] \end{gathered}$ | True SinglePhase Adiabatic | 8 | 0.5 | 2.7 | 220 | 0.47 | - |
| $\begin{gathered} \text { J.Lim } \\ (2000)[31] \end{gathered}$ | NRERL Serial | 8 | 0.6 | 2.5 | 0.1~1 | 2.37 | - |
| $\begin{aligned} & \text { J.S. Wang } \\ & (2000)[32] \end{aligned}$ | TSPC Flip- Flops | 8 | 0.6 | 3.3 | 300 | 0.3 | 52.4 |
| $\begin{aligned} & \text { A. Smith } \\ & \text { (1997)[33] } \end{aligned}$ | GaAs | 16 | 0.6 | 0.9 | 416 | 1.98 | 1700 |
| $\begin{gathered} \text { J. Mori } \\ (1991)[34] \end{gathered}$ | $4-2$ <br> Compressor | 54 | 0.5 | 3.3 | 100 | 12.4 | 870 |
| $\begin{gathered} \text { K. Yano } \\ (1990)[35] \end{gathered}$ | $\begin{gathered} \text { Pass- } \\ \text { Transistor } \end{gathered}$ | 54 | 0.25 | 2.5 | 227 | 12.7 | - |
| M. Hatamian $(1986)[36]$ | Parallel Pipelined | 8 | 2.5 | 2.5 | 75 | - | 250 |

In terms of power consumption, asynchronous circuitry and adiabatic logic are viable approaches for applications where speed is not the prime concern. Nevertheless, passtransistor logic has properties of both higher speed and lower power consumption (Table 3.12). Also NMOS reversible energy recovery logic, which is a new reversible adiabatic logic circuit, is employed in ultra-low-power applications. The serial multiplier, which has been implemented by this logic, is suitable for the applications where energy consumption is the top priority [30]. The proposed multiplier still stands ahead in power consumption compared to other designs. However, this structure could be more power efficient specifically for large bit words if pipelining technique is employed.

Where area is the prime concern, the recent progress in use of Deep Sub-micron Devices can help to overcome this constraint. It is also possible to reduce the silicon area by tighter layout style such as rectangular Wallace tree [28].

Therefore, it has been recommended that in multiplier performance and area tradeoffs, combinations of several parameters feature size, encoding scheme should be well considered. Encoding scheme has significant effect in the area of implementation. In design of the proposed multiplier the area is considered as one of the criteria in choosing the building blocks. Pseudo-NMOS shows significant area saving due to having only 14 transistors with maximum transistor width size of $2 \mu \mathrm{~m}$.

### 5.3 Future Work

To further improve the performance of pair-wise multiplier one needs to consider a way to reduce the critical path delay of the multiplier for longer bit width with better trade off between speed and power consumption.

A well-known technique to reduce the critical path in digital architectures is to place pipeline latches at appropriate places so that the functionality of the circuit remains unchanged and no appreciable reduction in the throughput occurs, however it takes a very accurate time scheduling for pipeline tasks.

This methodology can be considered as an alternative design of pair-wise due to the absence of any feedback loop in this architecture (Fig 2.9). The advantages of this methodology are many-fold. Since the proposed architecture permits pipelining, the operation speed can be considerably increased. This increased speed can be traded for reduction in supply voltage to achieve a considerable reduction in power consumption. This approach makes the pair-wise architecture qualitatively a viable configuration for constant data stream in DSP applications, however, extensive quantitative evaluation
based on the proper simulation arrangements is required to show the speed and power trade off.

## References

[1] C. R. Baugh and B. A. Wooley, "A Two's Complement Parallel Array Multiplication Algorithm," IEEE Trans. Computers, vol. C-22, no. 12, pp.1045-1047, Dec 1973.
[2] J. S. Wang, "A New True-Single-Phase-Clocked Double-Edge-Triggered Flip-Flop for Low Power VLSI Designs," in Proc. IEEE ISCAS 1997, pp.1896-1899.
[3] A.D. Booth, "A Signed Binary Multiplication Technique," Quarterly J. Mechanics and Applied Mathematics, vol. 4, no. 2, pp. 236-240, 1951.
[4] P.Y. Lu, et al., "A 30-MFLOP 32b CMOS Floating-Point Processor," IEEE SolidState Circuit Conf. Proceedings, vol. XXXI, pp. 28-29, February 1988.
[5] W. McAllister and D. Zuras, "An NMOS 64b Floating-Point Chip Set," IEEE Int. Solid-State Circuits Conf., pp. 34-35, February 1986.
[6] B. J. Benschneider, et al., "A 50MHz Uniformly Pipelined 64b Floating-Point Arithmetic Processor," IEEE Int. Solid-State Circuits Conf., pp.50-51, February 1989.
[7] M. Hatamian and G. L. Cash, "High Speed Signal Processing Pipelining, and VLSI," in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, pp.1173-1176, April 1986.
[8] A. P. Chandrakasan, S. Sheng and W. Brodersen, "Low-Power CMOS Digital Design," IEEE J. Solid State Circuits, vol. 27, No. 4, April 1992.
[9] A. Khatibzadeh and K. Raahemifar, "A 14-Transistor Low Power High-Speed Full Adder Cell," in Proc. 2003 Canadian Conf. on Electrical and Computer Engineering (CCECE2003), Montreal, May 2003.
[10] A. Khatibzadeh and K. Raahemifar, "A Study \& Comparison of Full Adder Cells Based on the Standard Static CMOS Logics," in Proc. 2004 Canadian Conf. on Electrical and Computer Engineering (CCECE2004), Niagara, May 2004.
[11] J. Yuan and C. Svensson, "High-Speed CMOS Circuit Technique," IEEE J. Solid State Circuits, vol. 24, No. 1, February 1989.
[12] K. Hwang, "Computer Arithmetic: Principles, Architecture, and Design," John Wiley and Sons, 1979.
[13] J. J. F. Cavanagh, "Computer Science Series: Digital Computer Arithmetic," McGraw-Hill Book Co., 1984.
[14] K. Raahemifar and M. Ahmadi, "A Fast 32-bit Digital Multiplier," in Proc. of the $8^{\text {th }}$ IEEE International Conf. on Electronics, Circuits and Systems (ICECAS), Malta, Sept. 2001, pp.1413-1416.
[15] R. Zimmermann and W. Fichtner, "Low-power Logic Styles: CMOS versus PassTransistor Logic," IEEE J. Solid State Circuits, vol. 32, pp.1079-90, July 1997.
[16] A. Parameswar et al., "A High Speed, Low Power, Swing Restored Pass-Transistor Logic Based on Multiply and Accumulate Circuit for Multimedia Applications," IEEE CICC, May 1994, pp. 278-281.
[17] Y. Sasaki et al., "Pass-Transistor Based on Gate Array Architectures," in 1995 Symp. VLSI Circuits, Dig. Tech. Papers, June 1995, pp. 123-124.
[18] M. Suzuki et al., "A 1.5-ns 32-b CMOS ALU in Double Pass-Transistor Logic," IEEE J. Solid State Circuits, vol. 28, no. 11, pp. 1145-1151, November 1993.
[19] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design, A System Perspective, MA: Addison-Wesley, 1993.
[20] A. Shams and M. Bayoumi, "A Modular Approach for Designing Low Power Adders", Proc. ASILOMAR, June 1997.
[21] J. M. Wang et al., "New Efficient Design for XOR \& XNOR Functions on the Transistor Level," IEEE J. Solid State Circuits, vol. 29, no. 7, pp. 780-786, July 1994.
[22] E. Abu-Shama and M. Bayoumi, "A New Cell for Low-Power Adder," in Proc. Int. Midwest Symp. Circuits Syst., 1995.
[23] J.M. Rabaey, "Digital Integrated Circuits," Prentice Hall, 1996.
[24] A. Bellauar and M. Elmasry, "Low-Power Digital VLSI Design Circuit and System," Kluwer academic Publishers, 1995.
[25] S. M. Kang, "Accurate Simulation of Power Dissipation in VLSI Circuits," IEEE J. Solid State Circuits, vol. 21, no. 5, pp. 889-891, October 1986.
[26] H. J. M. Veendrick, "Short-Circuit Dissipation of Static CMOS Circuitry anf Its Impact on the Design of Buffer Circuits," IEEE J. Solid State Circuits, vol. 19, no. 4, pp. 468-473, August 1984.
[27] G. J. Fisher, "An Enhanced Power Meter for SPICE2 Circuit Simulation," IEEE Trans. Computer-Aided Design, vol. 7, pp. 641-643, May 1988.
[28] N. Itoh, Y. Naemura, H. Makino, Y. Nakase, T. Yoshihara and Y. Horiba, "A 600MHz $54 \times 54$-bit multiplier with Rectangular-Styled Wallace Tree," IEEE J. Solid State Circuits, vol. 36, No. 2, February 2001.
[29] J. Butas, C. Choy, J. Povanzenec and C. F. Chan, "Asynchronous Cross-Pipelined Multiplier," IEEE J. Solid State Circuits, vol. 36, No. 8, August 2001.
[30] S. Kim, C.H. Ziesler and M. C. Papaefthymiou, "A true Single-Phase 8-bit Adiabatic Multiplier," Design Automation Conference, 2001, Preceeding, pp. 758-763.
[31] J. Lim, D. Kim, S. Kang and S. Chae, "An 8 x 8-b NRERL Serial Multiplier for Ultra-low-power Application," IEE Proceeding, vol. 146, pp. 327-333, Dec 2000.
[32] J. S. Wang, P. H. Yang and D. Sheng, "Design of a 3-V 300-MHz Low-Power $8 \times 8$ b Pipelined Multiplier Using Pulse-Triggered TSPC Flip-Flops," IEEE J. Solid State Circuits, vol. 35, No. 4, April 2000.
[33] A. B. Smith, N. Burgess, S. Cui and M. Liebelt, "GaAs Multiplier Design for HighSpeed DSP Application," Thirty-first ASILOMAR Conference, 1997.
[34] J. Mori and et al., " A 10-ns 54 X 54-bit Parallel Structured Full Array Multiplier with $0.5-\mu$ CMOS Technology," IEEE J. Solid State Circuits, vol. 26, No. 4, April 1991.
[35] K.Yano, " A 3.8-ns CMOS 16 X 16-bit Multiplier Using Complementary PassTransistor Logic," IEEE J. Solid State Circuits, vol. 25, pp. 388-395, April 1990.
[36] M. Hatamian and G. Cash "A $70-\mathrm{MHz} 8 \times 8$-bit Parallel Pipelined Multiplier in $2.5 \mu \mathrm{~m}$ CMOS," IEEE J. Solid State Circuits, vol. SC-21, No. 4, August 1986.

