# Design Exploration of Low Power Arithmetic and SRAM circuits using Subthreshold Design technique 

## THESIS

Submitted in partial fulfillment of the requirements for the degree of

## DOCTOR OF PHILOSOPHY

By<br>Priya Gupta<br>ID No. 2011PHXF416P

Under the supervision of
Prof. Anu Gupta

Co-supervision of
Dr. Abhijit Asati


BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE PILANI (RAJASTHAN) INDIA

MARCH 2017

BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE PILANI (RAJASTHAN) INDIA

## CERTIFICATE

This is to certify that the thesis entitled "Design Exploration of Low Power Arithmetic and SRAM circuits using Subthreshold Design technique" submitted by Ms. Priya Gupta, ID No 2011PHXF416P for award of Ph.D. of the Institute embodies original work done by her under our supervision.
(Signature of the Supervisor)

Prof. Anu Gupta

Professor

Birla Institute of Technology \& Science Pilani

Pilani - 333031 (Rajasthan) INDIA
(Signature of the Co-Supervisor)
Dr. Abhijit Asati
Assistant Professor

Birla Institute of Technology \& Science Pilani
Pilani - 333031 (Rajasthan) INDIA

Date:

## ACKNOWLEDGEMENTS

I intend to extend my gratitude to my advisor, Prof. Anu Gupta, Professor, and co-advisor, Dr. Abhijit Asati, Assistant Professor, BITS Pilani for their suggestion, guidance and supervision in the present work.

Many thanks to my two thesis committee members, Prof. S Gurunarayanan and Dr. Nitin Chaturvedi for reviewing my proposal and advising time to time during my Ph.D.

I would also like to thank the Departmental Research Committee members, Prof. Navneet Gupta, Prof. H. O. Bansal and Dr. K. K. Gupta for their constructive feedback.

I greatly acknowledge the support of Pawan Sharma for all kind of technical discussions which helped me a lot in my research.

I must acknowledge all my fellow research scholars, Ashish Kumar Sharma, V. Balaji, Yogesh B., Prachi Sharma, Jitendra Kumar, Satish Mohanty, Abhishek Joshi, Jyotirmoy Bhardwaj, Fani Mani, Ravinder, Prashant Upadhyay and Dhananjay Mishra for providing me all kind of support to my research work. Working, playing, living, and arguing with my fellow research scholars (Nikita Sharda, Anurada Devi, Arun Nihal Singh, Himanshu Chawla, Pankaj Munjal, Mr. \& Mrs. Dandautiya, Gaurav, Meenakshi, Rini, Rajesh, Sonal etc.) has been enriching and enjoyable.

My heartfelt thanks are due to all those teachers and students who took time out from their busy schedule to help me in this thesis. I thank all those who were involved in typing and printing this project report. I also want to thank staff members of EEE Department: Surender, Manoj Kumar, Amitabh, Mahesh Sani, Birdi Chand and Sanjay for their kind help in laboratory and other official work.

Last but certainly not the least, I would also like to thank my all family members for their invaluable support and everlasting trust. Most of all, I would like to thank God for everything especially his love and grace.

## ABSTRACT

Complete analyses of the arithmetic circuits and SRAM cells used for above-threshold operation have never been examined in the context of sub-threshold operation. The conventional arithmetic circuits and SRAM cells may not work in sub-threshold region at nanometer technologies due to ultra-low supply voltage.

This thesis work presents new functional designs and design space exploration for arithmetic circuits and SRAM cells in sub-threshold region. Exploration includes three parallel prefix adder architectures, two column compression multipliers and Static Random Access Memory (SRAM) cells, with transistor count ranging from 6 to 12, in terms of power consumption, propagation delay, and power-delay product. Their performance is also obtained at 45 nm and 180 nm technology nodes to find the impact of scaling on their performance. The work has been carried out in three parts.

In the first part, three power efficient parallel prefix adders namely Carry look-ahead adder (CLA), Kogge-Stone adder (KSA) and Han-Carlson adder (HCA) architectures are selected as per literature survey. These architectures are designed, simulated and analyzed, in subthreshold region, with three different logic design styles namely Static-CMOS, Hybrid Pass Transistor (Hybrid PT), and Hybrid Transmission Gate (Hybrid TG) for operand sizes of 8, 16, 32 and 64 bits. At 45 nm technology, it is observed that in comparison among themselves of all three selected adder architectures, HCA is the most power efficient, and CLA is the highspeed adder architecture in sub-threshold region. In contrast to worst (maximum) value obtained, HCA shows $96.8 \%$ lesser power consumption, CLA shows $79.8 \%$ lesser propagation delay and KSA shows $63.7 \%$ lesser power-delay product. Reverse body bias technique in Static-CMOS logic only leads to functional circuits with $61.87 \%$ reduction in propagation delay but with the increments in average power consumption and power-delay product by $77.03 \%$ and $38.3 \%$ respectively. Same trend of power consumption, delay and power-delay product for all the adder architectures is observed at 180 nm technology.

In the second part, two power efficient column compression multipliers namely Wallace tree and Dadda are selected as per literature survey. Total sixteen new designs of these architectures have been developed using two different partial product accumulation schemes namely using lower order compressors (LOC) and Mixed lower and higher order compressors (Mixed L/H). These designs are developed with two logic design styles (Static-CMOS \&

HYB-TG) for two sizes ( $4 \mathrm{x} 4 \& 8 \mathrm{x} 8$ ) at two technology nodes ( $45 \mathrm{~nm} \& 180 \mathrm{~nm}$ ) in subthreshold region.

At 45 nm , among all eight multipliers implemented, Dadda-45-L/H-HCA has the least powerdelay product. It has $49 \%$ (for 4x4-bit), and $31.5 \%$ (for $8 \times 8$-bit) lesser power-delay product in comparison to highest power-delay product obtained. Static-CMOS logic and HYB-TG design style are most power-delay product efficient design styles for LOC and Mixed L/H based multipliers respectively. Same trend of power-delay product is observed at 180 nm technology.

At 45 nm , the power consumption is higher than 180 nm technology for all adders and multiplier designs. The power consumption is increasing due to increments in the leakage current at 45 nm technology as compared to 180 nm technology since supply voltage is kept same at 0.4 V .

In the third part, for sub-threshold operation at 45 nm technology, five new designs of SRAM cells have been proposed with $7,8,9$ and 12 transistor configurations at 0.4 V power supply. Cell stability analysis is done using standard measures like hold/ read/ writes static noise margins as well as N-curve cell stability metrics. For performance analysis, read/write delay and leakage power consumption in hold mode are considered. The results show improvement in all the design parameters over other published designs and conventional 6 transistor (6T) SRAM cell. The comparison of proposed designs among themselves exhibit that three new designs namely M8T, MPT8T and MI-12T have low leakage power consumption along with other improved design parameters such as high read stability, high write ability, fast read \& write operation. Thus, these designs can be an attractive choice for low power application. In comparison to results of published designs at 180 nm , all proposed designs show improvement in read stability, write ability, read delay and write delay but with increment in leakage power consumption in hold mode.

## LAYOUT TEMPLATE

Layer Information:

| METALI (Pin) | NWELL | METAL1 | METAL2 |  | CONT |
| :---: | :---: | :---: | :---: | :---: | :---: |
| DIFF | POLY1 | NIMP | PIMP |  | VIA12 |



## TABLE OF CONTENTS

Certificate ..... i
Acknowledgements ..... ii
Abstract ..... iii
Layout templates ..... v
Tables of Contents ..... vi
List of Tables ..... x
List of Figures ..... xii
List of Abbreviations ..... xvii

1. Introduction ..... 1-17
1.1. Background ..... 1
1.2. Sub-Threshold Operation ..... 2
1.2.1. MOSFET in Sub-Threshold Region ..... 2
1.2.2. The MOSFET Drain Current in the Sub-threshold Region ..... 2
1.3. Sources of Power Consumption in CMOS Logic Gate ..... 7
1.4. Motivations and Challenges in Sub-threshold Design ..... 9
1.5. Research Gaps ..... 11
1.6. Objectives of the thesis ..... 12
1.7. Organization of the thesis ..... 13
References ..... 14-17
2. Literature Review ..... 18-43
2.1. Introduction ..... 18
2.2. Arithmetic Circuits ..... 19
2.2.1. Parallel Adder Architectures ..... 22
2.2.2. Column Compression Multipliers ..... 27
2.3. SRAM Cells ..... 28
References ..... 33-43
3. Adders ..... 44-104
3.1. Introduction ..... 44
3.2. Adder Architectures ..... 46
3.2.1. CLA ..... 47
3.2.2. KSA ..... 49
3.2.3. HCA ..... 50
3.3. Logic Families for Sub-threshold Circuit Design ..... 51
3.3.1. Static-CMOS Logic ..... 52
3.3.2. Pass Transistor (PT) Logic Family ..... 53
3.3.3. Complementary Pass-Transistor Logic (CPL) ..... 54
3.3.4. Swing Restored Pass-Transistor Logic (SRPL) ..... 55
3.3.5. Double Pass-Transistor Logic (DPL) ..... 57
3.3.6. Transmission Gate Logic (TG) ..... 58
3.3.7. Comparative Analysis of Basic Logic Gates using Different Logic Families for ..... 59
Sub-Threshold Operation
3.4. Effect of Reverse Body Bias (RBB) Scheme ..... 60
3.4.1 Simulation Results of Logic Gates with RBB using Different Logic Families ..... 64
3.5. Design and Analysis of Parallel Adders ..... 66
3.5.1. Design Implementation using CLA ..... 68
3.5.1.1. CLA with Static-CMOS logic (Static-CMOS CLA) ..... 70
3.5.1.2. CLA with Hybrid TG logic (HYB-TG CLA) ..... 73
3.5.1.3. CLA with Hybrid pass transistor logic (HYB-PT CLA) ..... 75
3.5.2. Simulation Methodology and Results of CLA ..... 77
3.5.3. Design Implementation using KSA ..... 81
3.5.4. Simulation Methodology and Results of KSA ..... 82
3.5.5. Design Implementation using HCA Architectures ..... 85
3.5.6. Simulation Methodology and Results of HCA ..... 86
3.6. Impact of RBB on Static-CMOS Adders ..... 89
3.7. Final Results and Discussions ..... 90
3.8. Conclusions ..... 96
References ..... 101-104
4. Multipliers ..... 105-158
4.1. Introduction ..... 105
4.2. Partial Product Generation ..... 107
4.3. Partial Product Accumulation ..... 109
4.3.1. Partial Product Accumulation using LOC in Wallace tree ..... 109
4.3.2. Partial Product Accumulation using Mixed L/H in Wallace tree ..... 111
4.3.3. Partial Product Accumulation using LOC in Dadda Tree ..... 112
4.3.4. Partial Product Accumulation using Mixed L/H in Dadda Tree ..... 114
4.4. Final Stage Addition in Wallace tree and Dadda Multipliers ..... 115
4.4.1. RCA ..... 116
4.4.2. HCA ..... 116
4.5. Design and Analysis of Partial Product Accumulation Modules ..... 117
4.5.1. LOC Designs ..... 117
4.5.2. LOC Design using Static-CMOS ..... 117
4.5.3. LOC Design using HYB-TG Logic ..... 120
4.5.4. Higher Order Compressors (HOC) ..... 123
4.5.4.1. Conventional and Modified Architecture of 4-2 Compressor ..... 125
4.5.4.2. Design Implementation of Modified HOC's (5-2, 6-2 and 7-2) ..... 128
4.5.5. Performance Analysis of Compressor Designs using Different Design Styles for ..... 130
4.6. Design and Analysis of Wallace tree and Dadda Multipliers ..... 134
4.6.1. Design Implementation using Wallace tree Multipliers ..... 134
4.6.2. Simulation Methodology and Results of Wallace tree Multipliers ..... 137
4.6.3. Design Implementation using Dadda Multipliers ..... 139
4.6.4. Simulation Methodology and Results of Dadda Multipliers ..... 142
4.7. Final Results and Discussions ..... 144
4.8. Conclusions ..... 149
References ..... 155-158
5. Static Random Access Memory (SRAM) ..... 159-254
5.1. Introduction ..... 159
5.2 Design and Operation of C6T ..... 163
5.3 Design and Operation of Proposed SRAM Cells at 45 nm ..... 167
5.3.1 Design of Proposed MPT8T using PN Access Transistor ..... 171
5.3.2 Design of Proposed M8T using NN-Parallel Access Transistor ..... 174
5.3.3 Design of Proposed MI-12T ..... 177
5.3.4 Design of Proposed M7T ..... 181
5.3.5 Design of Proposed M9T ..... 194
5.4. Simulation Results and Discussion at 45 nm ..... 197
5.4.1. SRAM standby stability analysis (Hold stability) ..... 198
5.4.2. SRAM Read Stability Analysis ..... 201
5.4.3. SRAM Write Ability Analysis ..... 204
5.4.4. Alternative Noise Margins ..... 207
5.4.5. Read Access Time ( $\mathrm{T}_{\mathrm{RA}}$ ) with Variability ..... 211
5.4.6. Write Access Time ( $\mathrm{T}_{\mathrm{wA}}$ ) with Variability ..... 213
5.4.7. Leakage Power Consumption in Hold mode ..... 215
5.5. Analytical Expressions for HOLD, RSNM \& WSNM of SRAM Cells ..... 216
5.5.1. Analytical Expressions for Hold SNM of C6T, M7T, MPT8T, M8T, M9T and MI- 12T SRAM cells ..... 218
5.5.2. Analytical Expressions for RSNM of C6T, M7T, MPT8T, M8T, M9T and MI- 12T SRAM cells ..... 221
5.5.3 Analytical Expressions for WSNM of C6T, M7T, MPT8T, M8T, M9T and MI- ..... 228 12T SRAM cells
5.6. Final Results and Discussions ..... 235
5.7. Conclusions ..... 245
References ..... 250-254
6. Conclusions \& Future Scope of Work ..... 255-263
6.1. Conclusions on Adders ..... 255
6.2. Conclusions on Multipliers ..... 257
6.3. Conclusions on SRAM Cells ..... 260
6.4. Future scope of the work ..... 262
Appendix A ..... 264
Appendix B ..... 265
Appendix C ..... 269
Appendix D ..... 276
Appendix E ..... 278
Appendix F ..... 279
List of Publications ..... 283-284
Publication in Peer Reviewed Journals ..... 283
Publication in Peer Reviewed Conferences ..... 284
Brief Biography of the Candidate ..... 285
Brief Biography of the Supervisor ..... 286
Brief Biography of the Co-Supervisor ..... 287

## LIST OF TABLES

Table 2.1 Comparison of various referenced CLA architectures ..... 24
Table 2.2 Comparison of various referenced HCA architectures ..... 25
Table 2.3 Comparison of various referenced KSA architectures ..... 26
Table 2.4 Comparison of various referenced compressor designs ..... 28
Table 2.5 Comparison of various referenced low power SRAM cells ..... 32
Table 3.1 Simulation results for AND Gate ..... 59
Table 3.2 Simulation results for OR Gate ..... 59
Table 3.3 Simulation results for XOR Gate ..... 60
Table 3.4 Noise margin of CMOS Inverter with/without RBB ..... 63
Table 3.5 Simulation results for Inverter with/without RBB at 0.4 V ..... 64
Table 3.6 Simulation results for AND Gate with RBB ..... 65
Table 3.7 Simulation results for OR Gate with RBB ..... 65
Table 3.8 Simulation results for XOR Gate with RBB ..... 65
Table 3.9 Operations of TG-2INV block ..... 73
Table 3.10 Operations of PT-2INV block ..... 76
Table 3.11 Nomenclature used for the proposed designs of CLA ..... 78
Table 3.12 Simulation results of CLA's at 45 nm technology ..... 79
Table 3.13 Simulation results of CLA's at 180 nm technology ..... 79
Table 3.14 Nomenclature used for the proposed designs of KSA ..... 82
Table 3.15 Simulation results of KSA's at 45 nm technology ..... 83
Table 3.16 Simulation results of KSA's at 180 nm technology ..... 81
Table 3.17 Nomenclature used for the proposed designs of HCA ..... 87
Table 3.18 Simulation results of HCA's at 45 nm technology ..... 87
Table 3.19 Simulation results of HCA's at 180 nm technology ..... 87
Table 3.20 Simulation results of adder's at 45 nm technology ..... 89
Table 3.21 Simulation results of adder's at 180 nm technology ..... 90
Table 3.22 Comparative table between proposed and referenced adder designs at 45 nm ..... 91
Table 3.23 Comparative table between proposed and referenced adder designs at 180 nm ..... 91
Table 4.1 Truth table of different Compressors (2-2, 3-2, 4-2, 5-2, 6-2, and 7-2) ..... 124
Table 4.2 Simulation results of LOC and modified HOC Designs ..... 131
Table 4.3. Comparative results of proposed modified compressor with designs of published references at 45 nm technology ..... 132
Table 4.4 Comparative results of proposed LOC and modified HOC compressors with designs of published references at 180 nm technology ..... 133

Table 4.5 Nomenclature used for the proposed designs of Wallace tree multiplier
$\begin{array}{lll}\text { Table 4.6 } & \begin{array}{l}\text { Simulation results of Wallace tree multipliers using } 45 \mathrm{~nm} \text { technology at } \\ \mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}\end{array} & 138\end{array}$
$\begin{array}{lll}\text { Table 4.7 } & \begin{array}{c}\text { Simulation results of Wallace tree multipliers uisng } 180 \mathrm{~nm} \text { technology at } \\ \mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}\end{array} & 138\end{array}$
Table 4.8 Nomenclature used for the proposed designs of Dadda multiplier 142
Table 4.9 Simulation results of Dadda multipliers using 45 nm technology at $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V} \quad 143$
Table 4.10 Simulation results of Dadda multipliers using 180 nm technology at $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V} \quad 143$
Table 4.11 Comparative table between proposed and referenced multiplier designs at $45 \mathrm{~nm} \quad 145$
Table 4.12 Comparative table between proposed and referenced multiplier designs at $180 \mathrm{~nm} \quad 145$
Table 5.1(a) Comparison of various SRAM cells at 180 nm technology at 0.4 V 160
Table 5.1(b) Comparison of low power SRAM cells at 180 nm technology at 0.4 V 161
Table 5.2 Results of conventional and modified access transistor (P, NP, PP, PN, and NN) at 0.4 V supply for 45 nm
Table 5.3 Generation of Voltages at node X and node Y in write, read, and hold operation 188
Table 5.4 Comparative analysis of RSNM, at 0.4 V 203
Table 5.5 Comparative analysis of WSNM, at 0.4 V 206
Table 5.6 Comparative analysis of SVNM, SINM, WTI and WTV at 0.4V 210
Table 5.7 Comparison of WTV, WTI with WSNM 210
Table 5.8 Comparison of SVNM, SINM, with RSNM 211
$\begin{array}{ll}\text { Table 5.9 } & \begin{array}{l}\text { Comparative analysis of read delay and its variability for all proposed SRAM } \\ \text { cells with C6T }\end{array}\end{array}$
$\begin{array}{lll}\text { Table 5.10 } & \begin{array}{l}\text { Comparative analysis of write delay and its variability for all proposed SRAM } \\ \text { cells with C6T }\end{array} & 215\end{array}$
Table 5.11 $\begin{aligned} & \text { Comparative analyses of leakage power consumptions in hold mode at } 0.4 \mathrm{~V} \\ & \text { supply }\end{aligned} \quad 216$
$\begin{array}{ll}\text { Table 5.12 } & \begin{array}{l}\text { Comparison of hold SNM between simulated and estimated values through } \\ \text { analytical and regression equation at } 0.4 \mathrm{~V}\end{array}\end{array}$
Table 5.13 $\begin{aligned} & \text { Comparison of RSNM between simulated and estimated values through analytical } 227 \\ & \text { and regression equation at } 0.4 \mathrm{~V}\end{aligned}$
Table 5.14 Comparison of WSNM between simulated and estimated values through analytical and regression equation at 0.4 V

Table 5.15 $\begin{aligned} & \text { Comparison of proposed with referenced SRAM cells (for 7T, 8T, 9T and 12T } \\ & \text { configurations) at } 45 \mathrm{~nm} \text { technology, } \mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}\end{aligned} 236$
Table 5.16 Write ability metric of proposed SRAM cells at 45 nm 238
Table 5.17 Read stability metric of proposed SRAM cells at 45 nm 239
Table 5.18 Percentage comparison of all proposed SRAM cells with C6T at $45 \mathrm{~nm} \quad 240$
Table 5.19 Write ability metric of SRAM cells at 180 nm 242
Table 5.20 Read stability metric of SRAM cells at 180 nm 243

## LIST OF FIGURES

## Figure 1.1 A cross-sectional view of an n -channel MOSFET

Figure 1.2 A cross-sectional view of an n-channel MOSFET operated in sub-threshold region ..... 3
Figure $1.3 \quad I_{D, ~ S U B}-V_{D S}$ characteristics in sub-threshold operation region ..... 5
Figure $1.4 \quad \mathrm{I}_{\mathrm{D}, \text { sub }}-\mathrm{V}_{\mathrm{GS}}$ Characteristics ..... 5
Figure 1.5 Sub-threshold slope plot at 180 nm technology ..... 6
Figure 1.6 CMOS inverter mode for dynamic power consumption ..... 7
Figure 1.7 CMOS inverter mode for short-circuit power consumption ..... 8
Figure 3.1 Block diagram of n-bit CLA ..... 47
Figure 3.2 Block diagram of n-bit KSA ..... 49
Figure 3.3 Block diagram of n-bit HCA ..... 50
Figure 3.4 Schematics and layouts of logic gates using Static-CMOS logic family ..... 53
Figure 3.5 Schematics and layouts of logic gates using PT logic family ..... 54
Figure 3.6 Schematics and layouts of logic gates using CPL logic family ..... 55
Figure 3.7 Schematics and layouts of logic gates using SRPL logic family ..... 56
Figure 3.8 Schematics and layouts of logic gates using DPL logic family ..... 57
Figure 3.9 Schematics and layouts of logic gates using TG logic family ..... 58
Figure 3.10 Cross sectional view of N-MOS with all leakage currents ..... 61
Figure 3.11 Schematic diagram of CMOS inverter (a) with RBB (b) without RBB ..... 62
Figure 3.12 VTC's of CMOS inverter (a) with RBB (b) without RBB ..... 63
Figure 3.13 Current characteristics of NMOS in CMOS inverter with RBB/without RBB at 45 nm technology
Figure 3.14 Basic block diagram of 4-bit CLA ..... 66
Figure 3.15 Internal blocks of 4-bit CLA ..... 67
Figure 3.16 Block diagram of 8-bit CLA built using 4-bit CLA unit ..... 67
Figure 3.17 Block diagram of 16-bit CLA built using 4-bit CLA unit ..... 68
Figure 3.18 Block diagram of a 32-bit CLA built using 8-bit CLA unit ..... 68
Figure 3.19 Block diagram of a 64-bit CLA built using 16-bit CLA unit ..... 69
Figure 3.20 Circuit diagram of 4-bit group generate ..... 70
Figure 3.21 Circuit diagram of 4-bit group propagate ..... 71
Figure 3.22 Circuit diagram of 4-bit carry generator ..... 72
Figure 3.23 Circuit diagram of bit-wise generate/propagate ..... 72
Figure 3.24 Circuit diagram of bit-wise sum-generator ..... 72
Figure 3.25 TG-2INV block ..... 73
Figure 3.26 Graph for TG-2INV block at OFF condition ..... 74

Figure 3.27 Circuit diagram of bit-wise generate/propagate
Figure 3.28 Circuit diagram of bit-wise sum-generator 75
Figure 3.29 PT-2INV block 75
Figure 3.30 Graph for PT-2INV block at OFF condition 76
Figure 3.31 Circuit diagram of bit-wise generate/propagate 77
Figure 3.32 Circuit diagram of bit-wise sum-generator 77
Figure 3.33 $\begin{aligned} & \text { The overall power consumption, propagation delay and power-delay product graphs of } 80 \\ & \text { CLA at } 45 \mathrm{~nm} / 180 \mathrm{~nm} \text { technology }\end{aligned}$
Figure 3.34 Internal architecture of KSA 81
Figure 3.35 The overall power consumption, propagation delay and power-delay product graphs of KSA at $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology
Figure 3.36 Internal architecture of HCA 86
$\begin{array}{ll}\text { Figure } 3.37 & \text { The overall power consumption, propagation delay and power-delay product graphs of } \\ \text { HCA at } 45 \mathrm{~nm} / 180 \mathrm{~nm} \text { technology }\end{array}$
The power consumption graph between CLA, KSA and HCA at $45 \mathrm{~nm} / 180 \mathrm{~nm}$
Figure 3.38 technology (a) using Static-CMOS logic (b) using HYB-TG logic (c) using HYB-PT logic

The propagation delay graph between CLA, KSA and HCA at $45 \mathrm{~nm} / 180 \mathrm{~nm}$
Figure 3.39 technology (a) using Static-CMOS logic (b) using HYB-TG logic (c) using HYB-PT 93 logic

The power-delay product graph between CLA, KSA and HCA at $45 \mathrm{~nm} / 180 \mathrm{~nm}$
$\begin{array}{lll}\text { Figure 3.40 } & \begin{array}{l}\text { technology (a) using Static-CMOS logic (b) using HYB-TG logic (c) using HYB-PT } \\ \text { logic }\end{array} & 94 \\ & \text { DSE chart of all published referenced and proposed 32-bit CLA, KSA and HCA }\end{array}$
Figure 3.41 architecture using different logic design styles (a) power (b) delay (c) power-delay $\quad 100$ product

Figure 4.1 Block diagram of n x n bit multiplier 105
Figure 4.2 Partial Products generation of 8x8-bit simple multiplication 108
Figure 4.3 Partial Products selection logic for simple multiplication 108
Figure 4.4 Partial product accumulation using LOC's in 8x8-bit Wallace tree 110
Figure 4.5 Partial product accumulation using Mixed L/H in 8x8-bit Wallace tree 111
Figure 4.6 Partial product accumulation using LOC's in 8x8-bit Dadda tree 113
Figure 4.7 Partial product accumulation using Mixed L/H in 8x8-bit Dadda tree 114
Figure 4.8 Block level diagram of 8-bit RCA 116
Figure 4.9 Schematic and layout of 2-2 LOC using Static-CMOS logic 118
Figure 4.10 $\quad$ Schematics and layouts of 3-2 LOC using Static-CMOS logic 119
Figure 4.11 Schematic and layout of modified 2:1MUX using HYB-TG logic 120
Figure 4.12 Schematic and layout of modified 4:1 MUX using HYB-TG logic 121
Figure 4.13 Schematic and layout of 2-2 LOC using HYB-TG logic 122
Figure 4.14 Schematic and layout of 3-2 LOC using HYB-TG logic 123

Figure 4.15 The conventional architecture of 4-2 compressors
Figure 4.16 The conventional architecture of compressors used in 4x4-bit Wallace tree multiplier 126
Figure 4.17 The modified architecture of 4-2 compressors 126
Figure 4.18 The modified architecture of compressors used in 4x4-bit Wallace tree multiplier 127
Figure 4.19 The Schematics and layouts of modified HOC''s 130
Figure 4.20 Multiplication of 8x8-bit multiplier 134
Figure 4.21 Block diagram of Wallace tree using LOC 135
Figure 4.22 Block diagram of Wallace tree using Mixed L/H 136
$\begin{array}{lll}\text { Figure 4.23 } & \text { The overall power consumption, propagation delay and power-delay product graphs of } & 139\end{array}$
Figure 4.24 Block diagram of Dadda tree using LOC 140
Figure 4.25 Block diagram of Dadda tree using Mixed L/H 141
$\begin{array}{lll}\text { Figure 4.26 } & \begin{array}{l}\text { The overall power consumption, propagation delay and power-delay product graphs of } \\ \text { Dadda multipliers at } 45 \mathrm{~nm} / 180 \mathrm{~nm} \text { technology }\end{array} & 144\end{array}$
Figure 4.27 The comparative power consumption graph between Wallace tree and Dadda multipliers for (a) $4 \times 4$-bit (b) $8 \times 8$-bit

Figure 4.28 The comparative propagation delay graph between Wallace tree and Dadda multipliers for (a) $4 \times 4$-bit (b) $8 \times 8$-bit

Figure 4.29 The comparative power-delay product graph between Wallace tree and Dadda multipliers for (a) $4 \times 4$-bit (b) 8x8-bit
DSE chart of all published referenced and proposed 8x8-bit Wallace tree and Dadda architectures (a) power (b) delay (c) power-delay product
Figure 4.30
$\begin{array}{lll}\text { Figure 5.1 Basic SRAM cell } & 163\end{array}$
$\begin{array}{ll}\text { Figure 5.2 Schematic and Layout of C6T } & 164\end{array}$
Figure 5.3 Test circuit for read operation of C6T 165
Figure 5.4 Test circuit for write operation of C6T 166
Figure 5.5 Test circuit for measurement of hold SNM of C6T 167
Figure 5.6 Schematics of access transistor pairs 168
Figure 5.7 ON/OFF state analysis of access transistors 171
Figure $5.8 \quad$ Schematic and layout of MPT8T 172
Figure 5.9 Test circuit for read operation of MPT8T 172
Figure 5.10 Test circuit for write operation of MPT8T 173
Figure 5.11 Test circuit for measurement of hold SNM of MPT8T 174
Figure 5.12 Schematic and layout of M8T 175
Figure 5.13 Test circuit for read operation of M8T 175
Figure 5.14 Test circuit for write operation of M8T 176
Figure 5.15 Test circuit for measurement of hold SNM of M8T 177
$\begin{array}{lll}\text { Figure 5.16 } & \text { Schematic and layout of MI-12T } & 178\end{array}$

Figure 5.17 Test circuit for read operation of MI-12T
$\begin{array}{lll}\text { Figure 5.18 } & \text { Test circuit for write operation of MI-12T } & 180\end{array}$
Figure 5.19 Test circuit for measurement of hold SNM of MI-12T 180
Figure 5.20 Currents plot for MI-12T and C6T during hold operation 181
$\begin{array}{lll}\text { Figure 5.21 } & \text { Schematic and layout of M7T } & 182\end{array}$
Figure 5.22 Coupling of parasitic capacitances in M7T during clock feed-through event 182
Figure 5.23 Model for estimation of voltage at node X and Y during clock feed-through event 183
$\begin{array}{lll}\text { Figure 5.24 } & \begin{array}{l}\text { Read circuit set up for estimation of charge at different nodes during clock feed- } \\ \text { through event }\end{array} & 183\end{array}$
$\begin{array}{lll}\text { Figure 5.25 } & \begin{array}{l}\text { Write circuit set up for estimation of charge at different nodes during clock feed- } \\ \text { through event }\end{array} & 185\end{array}$
Hold circuit set up for estimation of charge at different nodes during clock feed-
through event through event
Figure 5.27 Waveforms at nodes X and Y during (a) read/ write (b) hold operation 189
Figure 5.28 Test circuit for read operation of M7T 190
Figure 5.29 Waveforms at nodes Q and QB during read mode 191
Figure 5.30 Test circuit for write operation of M7T 191
Figure 5.31 Waveforms at nodes Q and QB during write mode 192
Figure 5.32 Test circuit for measurement of hold SNM of M7T 193
Figure 5.33 Waveforms at nodes Q and QB during hold mode 193
Figure 5.34 Schematic and layout of M9T 194
Figure 5.35 Test circuit for read operation of M9T 195
Figure 5.36 Test circuit for write operation of M9T 196
Figure 5.37 Test circuit for measurement of hold SNM of M9T 196
Figure 5.38 Overlapped VTC's of C6T, M7T, MPT8T, M8T and M9T during hold operation 198
Figure 5.39 VTC of MI-12T during hold operation 199
Figure 5.40 $\begin{aligned} & \text { Combined VTC's of C6T, M7T, MPT8T, M8T and M9T during hold operation at } \\ & \text { varying V } V_{\mathrm{DD}}\end{aligned}$
Figure 5.41 VTC's of MI-12T during hold operation at varying VD 200
Figure 5.42 RSNM "butterfly curve" of (a) C6T (b) M7T (c) MPT8T (d) M8T (e) M9T (f) MI-12T 203
Figure 5.43 WSNM "butterfly curve" of (a) C6T (b) M7T (c) MPT8T (d) M8T (e) M9T (f) MI- 206
12 T

Figure 5.44 Test circuit for extracting N-curve of C6T during read mode 207
Figure 5.45 N-curve characteristics of (a) C6T (b) M7T (c) MPT8T (d) M8T (e) M9T (f) MI-12T 209
Figure 5.46 Read delay and its variability of (a) C6T (b) M7T (c) MPT8T (d) M8T (e) M9T (f) MI-12T

Figure 5.47 Write delay and its variability of (a) C6T (b) M7T (c) MPT8T (d) M8T (e) M9T (f) 214
MI-12T
$\begin{array}{ll}\text { Figure } 5.50 & \begin{array}{l}\text { Half part of SRAM cells during read operation (a) C6T (b) M7T (c) MPT8T (d) M8T } \\ \text { (e) M9T (f) MI-12T }\end{array}\end{array}$
Figure 5.51 Half part of SRAM cells during write operation (a) C6T (b)M7T (c) MPT8T (d) M8T 228 (e) M9T (f) MI-12T

Figure 5.52 The comparative histograms of SRAM cells at 45 nm 237
Figure 5.53 The comparative histograms of SRAM cells at 180 nm 241
Figure 5.54 DSE chart of all five proposed SRAM cells as compared to C6T at 45 nm technology 248
Figure 5.55 DSE chart of referenced SRAM cells at 180 nm technology 249

## LIST OF ABBREVIATIONS

## Si: Silicon

DIBL: Drain induced barrier lowering
SCE: Short channel effect
MAC: Multiply and Accumulator
LSB: Least Significant Bit
MSB: Most Significant Bit
PDP: Power-delay Product
WSN: Wireless sensor network
CSA: Carry save adder
CLA: Carry look-ahead adder
KSA: Kogge-Stone adder
HCA: Han-Carlson Adder
PP: Partial product
FA: Full Adder
HA: Half Adder
VLSI: Very large scale integration
MOS: Metal oxide semiconductor
SNM: Static noise margin
RSNM: Read static noise margin
WSNM: Write static noise margin
SRAM: Static random access memory
LOC: Lower order compressor
WTI: Write trip current
SINM: Static current noise margin
SVNM: Static voltage noise margin

## CHAPTER 1 <br> INTRODUCTION

### 1.1. BACKGROUND

Efficient power management has become a critical constraint with the rapid growth of portable, wireless and battery-operated applications. Higher power consumption increases the on-chip temperature which results in reduced operating life of the chip and battery life [1][2][3][4]. It is observed that in present scenario more than $50 \%$ power consumption occurs due to leakage current of the entire VLSI chip [5]. To overcome this leakage current problem, number of low power based design techniques like multi-threshold voltage technique, power gating schemes, back (substrate) bias scheme, sub-threshold design technique etc. have been investigated and explored by many researchers. Out of the different potential alternatives, subthreshold technique has been found to be one of the most useful technique to obtain the ultralow power consumption, which utilizes leakage current as main conduction current. In subthreshold circuits, the power supply voltage $\left(\mathrm{V}_{\mathrm{DD}}\right)$ is reduced below the threshold voltage $\left(\mathrm{V}_{\mathrm{th}}\right)$ of metal oxide semiconductor (MOS) transistor i.e. ( $\mathrm{V}_{\mathrm{DD}}<\mathrm{V}_{\mathrm{th}}$ ). Sub-threshold circuit manages to satisfy the ultralow power requirement due to the quadratic reduction in power with respect to the supply voltage [6].

The increasing demand of smaller, lighter and more durable low power electronic products highlights the importance of sub-threshold design technique. Sub-threshold circuits are promising for applications where performance can be scarified for low power. Some of applications are wireless sensor networks, electronic watch, radio frequency identification (RFID), cryptographic applications like electronic passport (where security and power consumption rather than performance are given high priority), battery operated applications (implantable biomedical devices), and other portable communication devices. These applications need ultra-low power consumption with medium performance of operation (e.g. tens to hundreds of MHz ) [7][8][9].

Automatic supply and body biasing controller [10] has been developed to minimize total active power in digital circuits by dynamically adjusting both threshold voltage and supply voltage based on circuit operating conditions such as temperature, workload, or circuit architecture.

In system on chip (SOC) architectures, embedded cache memories and arithmetic circuits may occupy more than $90 \%$ of the total die area [11]. In the arithmetic circuits, adders and multipliers used in the MAC unit are the major power consuming units. Here, the subthreshold design approaches appear to be suitable for low power/low energy application. Possibilities of making changes in the digital arithmetic circuits and on chip memory cell designs exists which can lead to many new efficient designs. This reflects the great need of the study and exploration of the alternative methods of efficient sub-threshold logic design.

### 1.2. SUB-THRESHOLD OPERATION

### 1.2.1. MOSFET in Sub-Threshold Region

The cross-sectional view of an n-channel MOSFET is shown in Figure 1.1 [12].


Figure 1.1: A cross-sectional view of an n-channel MOSFET
In the gradual channel approximation for $n$-channel MOS structures, the drain current is considered to zero if the gate to source voltage $\left(\mathrm{V}_{\mathrm{GS}}\right)$ is less than the $\mathrm{V}_{\mathrm{th}}$, i.e. when $\mathrm{V}_{\mathrm{GS}} \leq \mathrm{V}_{\mathrm{th}}$. However, in practical aspects, the drain current is present due to minority charge carriers available at the surface under the gate when the gate voltage is less than threshold, i.e. in 'subthreshold' region. This population of mobile electrons under the gate provides a mechanism for charge flow between the drain and source even when $\mathrm{V}_{\mathrm{GS}} \leq \mathrm{V}_{\mathrm{th}}$. Thus, there is in fact a small, non-zero drain current through a MOSFET biased below threshold [13].

### 1.2.2. The MOSFET Drain Current in the Sub-threshold Region

Figure 1.2 shows a cross-sectional view of an n-channel MOSFET biased with a positive drain voltage i.e. $\mathrm{V}_{\mathrm{DS}} \geq 0$, a negative substrate voltage, i.e. $\mathrm{V}_{\mathrm{BS}} \leq 0$ and the gate to source $\mathrm{V}_{\mathrm{GS}}$ is biased positively but below threshold voltage $\mathrm{V}_{\text {th }}$.


Figure 1.2: A cross-sectional view of an n-channel MOSFET operated in sub-threshold region

When the gate-to-source voltage, $\mathrm{V}_{\mathrm{GS}}$, is less than the threshold voltage, $\mathrm{V}_{\mathrm{th}}$, the semiconductor surface below the gate is a lightly doped $n$-channel [14] and [15]. The diffusion (due to the difference in the electron concentrations at the drain and source ends) of the thermally generated minority carriers of the substrate semiconductor material results in some conduction between the source and drain terminals through this weakly inverted channel [16]. This condition is known as weak inversion or the sub-threshold region. The off current, of a MOSFET operating in sub-threshold region, is defined here as the current flowing through the MOSFET transistor when its gate-to-source voltage, $\mathrm{V}_{\mathrm{GS}}$, is equal to 0 V .

The sub-threshold drain current of $n$-channel MOSFET under $\left(V_{G S} \leq V_{t h}\right)$ is given by Eq. (1.1). It shows that sub-threshold drain current $\left(I_{D, S U B}\right)$ is exponentially dependent on $\left(V_{G S}-V_{t h}\right)$ and it decreases with increase in $\mathrm{V}_{\text {th }}$ [17].

$$
I_{\mathrm{D}, \mathrm{SUB}}=\left\{\begin{array}{l}
\mathrm{I}_{\mathrm{S}} \mathrm{e}^{\frac{\mathrm{V}_{\mathrm{GS}}-\mathrm{v}_{\mathrm{th}}}{n V_{\mathrm{T}}}}\left(1-\mathrm{e}^{-\frac{\mathrm{V}_{\mathrm{DS}}}{\mathrm{~V}_{\mathrm{T}}}}\right), 0 \leq \mathrm{V}_{\mathrm{DS}} \leq 3 \mathrm{~V}_{\mathrm{T}}  \tag{1.1}\\
\mathrm{I}_{\mathrm{S}} \mathrm{e}^{\frac{\mathrm{V}_{\mathrm{GS}}-\mathrm{V}_{\mathrm{th}}}{\mathrm{nV}}}, \mathrm{~V}_{\mathrm{DS}}>\mathrm{V}_{\mathrm{T}}
\end{array}\right\}
$$

Where $\mathrm{I}_{\mathrm{S}}$ is the drain current when $\mathrm{V}_{\mathrm{GS}}=\mathrm{V}_{\mathrm{th}}$ and is given by Eq. (1.2).

$$
\begin{equation*}
\mathrm{I}_{\mathrm{S}}=\mu_{0} \mathrm{C}_{\mathrm{ox}} \frac{\mathrm{~W}}{\mathrm{~L}} \mathrm{~V}_{\mathrm{th}}^{2}(\mathrm{n}-1) \tag{1.2}
\end{equation*}
$$

The parameters in Eq. (1.1) and Eq. (1.2) are:
$\mu_{0}$ : Carrier mobility,

W: Channel width,

L: Channel length,
$\mathrm{C}_{\mathrm{ox}}=\varepsilon_{\mathrm{ox}} / \mathrm{t}_{\mathrm{ox}} \quad$ : gate oxide capacitance (Where $\varepsilon_{\mathrm{ox}}$ and $\mathrm{t}_{\mathrm{ox}}$ is the gate oxide dielectric constant and gate oxide thickness),
$\mathrm{V}_{\mathrm{T}}$ : Thermal voltage ( $\mathrm{kT} / \mathrm{q}$ ),
k: Boltzmann constant in joules per Kelvin,

T: Temperature in Kelvin,
q : Electronic charge in coulombs,
n : Sub-threshold slope factor of a long-channel uniformly doped device,
n can be calculated using following equation:

$$
\begin{equation*}
\mathrm{n}=1+\frac{\mathrm{C}_{\mathrm{b}}}{\mathrm{C}_{\mathrm{g}}} \text {, where } \mathrm{C}_{\mathrm{b}}=\frac{\varepsilon_{\mathrm{si}}}{\omega_{\mathrm{d}}} \text { and } \mathrm{C}_{\mathrm{g}}=\frac{\varepsilon_{\mathrm{ox}}}{\mathrm{t}_{\mathrm{ox}}} \tag{1.3}
\end{equation*}
$$

Where, $\mathrm{C}_{\mathrm{g}}$ is the gate capacitance, $\mathrm{C}_{\mathrm{b}}$ is the bulk capacitance, $\varepsilon_{\mathrm{si}}$ and $\omega_{\mathrm{d}}$ denote the dielectric constants for silicon and depletion width under the channel respectively.

In sub-threshold operation, threshold voltage ' $\mathrm{V}_{\text {th }}$ ' is defined as the gate-to source voltage after which the drain current ceases to depend exponentially on the gate-to-source voltage [18].

## (i) ID, sub - Vds Characteristics

According to Eq. (1.1), sub-threshold drain current ( $\mathrm{I}_{\mathrm{D}, \mathrm{SUB}}$ ) does not depend on $\mathrm{V}_{\mathrm{DS}}$ when $\mathrm{V}_{\mathrm{DS}}$ $>3 \mathrm{~V}_{\mathrm{T}}$ because $\left(\mathrm{e}^{-3} \ll 1\right)$, while at $0 \leq \mathrm{V}_{\mathrm{DS}} \leq 3 \mathrm{~V}_{\mathrm{T}}$ condition, its dependence on $\mathrm{V}_{\mathrm{DS}}$ is $\left(1-\mathrm{e}^{-}\right.$ $\mathrm{V}_{\mathrm{DS}}{ }^{\mathrm{V}} \mathrm{T}$ ).

The behavior of current ( $I_{D, S U B}$ ) versus $V_{D S}$ in the sub-threshold region is illustrated in Figure 1.3 [19].


Figure 1.3: $I_{D, ~ S U B}-V_{D S}$ characteristics in sub-threshold operation region
From the plot of $\mathrm{I}_{\mathrm{D}, \mathrm{SUB}}-\mathrm{V}_{\mathrm{DS}}$, it is observed that MOSFET operates in non- current saturated region at below $3 \mathrm{~V}_{\mathrm{T}}$ and at above $3 \mathrm{~V}_{\mathrm{T}}\left(\mathrm{V}_{\mathrm{T}}\right.$ is thermal voltage $\left.=26 \mathrm{mV}\right)$, it operates in current saturation region.

## (ii) $I_{D, \text { sub }}-V_{G S}$ Characteristics

The exponential behavior of the sub-threshold drain current ( $\mathrm{I}_{\mathrm{D}, \mathrm{SUB}}$ ) versus $\mathrm{V}_{\mathrm{GS}}$ is illustrated in Figure 1.4 [20][21].


Figure 1.4: $\mathrm{I}_{\mathrm{D}, \text { sub }}-\mathrm{V}_{\mathrm{GS}}$ Characteristics

From the plot of $\mathrm{I}_{\mathrm{D}, \mathrm{SUB}}-\mathrm{V}_{\mathrm{GS}}$, it is observed that MOSFET operates in sub-threshold region at below threshold voltage $\left(\mathrm{V}_{\mathrm{th}, \mathrm{n}}\right)$ and at above threshold point, it operates in strong inversion region. There are two states (off and on) in sub-threshold region. In on state, the sub-threshold current increases by tenfold and the voltage difference between the two voltage points $\left(\mathrm{V}_{\mathrm{th}, \mathrm{sub}}\right.$ and $\mathrm{V}_{\mathrm{tb}, \mathrm{n}}$ ) is called on state voltage range ( $\mathrm{V}_{\mathrm{DD}, \text { sub }}$ ), whereas below $\mathrm{V}_{\mathrm{tb}, \text { sub }}$ (off state), the magnitude of current is negligible.

For $45 \mathrm{~nm} / 180 \mathrm{~nm}$ process technologies at $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}$, observed values are:
$\mathrm{V}_{\mathrm{th}, \mathrm{n}}=0.496 \mathrm{~V}, \mathrm{~V}_{\mathrm{th}, \mathrm{sub}}=0.028 \mathrm{~V} \quad$ (For 180 nm$)$
$\mathrm{V}_{\mathrm{th}, \mathrm{n}}=0.416 \mathrm{~V}, \mathrm{~V}_{\mathrm{th}, \text { sub }}=0.02 \mathrm{~V} \quad$ (For 45 nm )
(iii) Sub-threshold Slope [22][23]

Sub-threshold Slope (SS) is defined as the amount of the gate voltage swing required to reduce the sub-threshold drain current by a decade. It is expressed in units (mV/decade) and is given by following equation:

$$
\begin{equation*}
\mathrm{SS} \equiv \frac{\mathrm{dV}_{\mathrm{GS}}}{\mathrm{~d}\left(\ln \mathrm{I}_{\mathrm{DS}}\right)} \ln 10=\left(1+\frac{\mathrm{C}_{\mathrm{dep}}}{\mathrm{C}_{\mathrm{ox}}}\right)\left(\frac{\mathrm{kT}}{\mathrm{q}}\right) \ln 10 \tag{1.4}
\end{equation*}
$$

Figure 1.5 shows the sub-threshold slope plot.


Figure 1.5: Sub-threshold slope plot at 180 nm technology
The two curves show identical data that have been plotted using a linear scale (on $y$ axis, right side) and a logarithmic scale (on y axis, left side) on same plot.

When the gate voltage $\left(\mathrm{V}_{\mathrm{G}}\right)$ is increased, the number of electrons in the channel increases. This, in turn, increases the current flowing between the source and the drain. The current at the minimum gate voltage $(0 \mathrm{~V})$ is called off current, and the current at the maximum gate voltage ( 1 V in this case) is the on current. Above the threshold voltage $\mathrm{V}_{\text {th }}$ (dashed line), the drain current ( $\mathrm{I}_{\mathrm{D}}$ ) increases linearly. Below the threshold voltage, the drain current ( $\mathrm{I}_{\mathrm{D}, \mathrm{suB}}$ ) increases exponentially with the gate voltage.

In the present thesis work, sub-threshold slope value is observed as approximately $60 \mathrm{mV} / \mathrm{dec}$ for both, $45 \mathrm{~nm} / 180 \mathrm{~nm}$ process technologies.

### 1.3. SOURCES OF POWER CONSUMPTION IN CMOS LOGIC GATE

Power consumption is an important property of a design that affects feasibility, cost and reliability. It influences a greater number of critical design decisions, such as power supply capacity, the battery lifetime, supply line sizing, packaging and cooling requirements [24].

In sub-threshold region of operation, sub-threshold current is the main operating current which is otherwise considered as leakage current in strong inversion region.

The total power consumption in a CMOS circuit is constituted by dynamic, short circuit and static power consumption.

## (i) Dynamic power consumption

This power consumption is due to logic transitions causing logic gates to charge/discharge output capacitances. It depends on the supply voltage, device threshold, the input rise/fall time and the operating frequency of the transistor.

Figure 1.6 shows the current flow in CMOS inverter resulting into dynamic power consumption.


Figure 1.6: CMOS inverter mode for dynamic power consumption

The equation for dynamic power consumption is expressed as

$$
\begin{equation*}
\mathrm{P}_{\text {Dynamic }}=\alpha_{0-1} \cdot \mathrm{C}_{\mathrm{L}} \cdot \mathrm{~V}_{\mathrm{DD}}^{2} \cdot \mathrm{f}_{\mathrm{clk}} \tag{1.5}
\end{equation*}
$$

Where $\mathrm{V}_{\mathrm{DD}}$ is the power supply voltage; $\mathrm{C}_{\mathrm{L}}$ is the load capacitance; $\mathrm{f}_{\text {clk }}$ is the operating input signal frequency and $\alpha_{0-1}$ is the probability that a power consuming transition occurs (the activity factor).

## (ii) Short-circuit power consumption

This power consumption is due to the direct current path from supply to ground when both NMOS and PMOS transistors are conducting current simultaneously for a brief duration due to non-zero rise/ fall times.

Figure 1.7 shows the current (Isc) flow in CMOS inverter causing short-circuit power consumption.


Figure 1.7: CMOS inverter mode for short-circuit power consumption

The equation for short circuit power consumption is expressed as

$$
\begin{equation*}
P_{\text {shortcircuit }}=I_{\mathrm{sc}} \cdot V_{\mathrm{DD}} \tag{1.6}
\end{equation*}
$$

Where, Isc is short circuit current.

The short-circuit power consumption is the dominant component of power consumption which increases exponentially with the power-supply voltage. Reduction in the short-circuit power consumption in the sub-threshold region is achieved by reducing power supply voltage and by applying input pulses with rise/ fall time. In the present thesis, input signals have rise/ fall times of 1 pico-second, pulse width (ON time) of 1 micro-second and pulse period of 5 microsecond. [25].

## (iii) Static power consumption

This power consumption occurs due to leakage current when the system is in standby mode. In sub-threshold region, the sources of leakage is the gate-oxide tunneling current that results
due to the tunneling of carriers across the very thin gate oxide [26], the edge-direct tunneling that appears between the source and drain extension, the gate overlap [27][28], the reverse currents of the source-substrate and the drain-substrate pn junctions. This power consumption does not depend on the input transition or load capacitance. It remains constant for a logic cell

Static power is expressed as

$$
\begin{equation*}
P_{\text {Static }}=I_{\text {leakage }} \cdot V_{D D} \tag{1.7}
\end{equation*}
$$

Where $\mathrm{I}_{\text {leakage }}$ is the leakage current as discussed above.
The total power consumption in digital CMOS circuit is the combination of the Eq. (1.5), (1.6) and (1.7), given by

$$
\begin{align*}
& \mathrm{P}_{\text {total }}=\mathrm{P}_{\text {dynamic }}+\mathrm{P}_{\text {shortcircuit }}+\mathrm{P}_{\text {static }} \\
& =\alpha_{0-1} \cdot \mathrm{C}_{\mathrm{L}} \cdot \mathrm{~V}_{\mathrm{DD}}^{2} \cdot \mathrm{f}_{\mathrm{clk}}+\mathrm{I}_{\mathrm{sc}} \cdot \mathrm{~V}_{\mathrm{DD}}+\mathrm{I}_{\text {leakage }} \cdot \mathrm{V}_{\mathrm{DD}} \tag{1.8}
\end{align*}
$$

During sub-threshold operation, these three components of the power consumption increase with $\mathrm{V}_{\mathrm{DD}}$ but with different rates with dynamic and short-circuit power consumption increasing quadratically and exponentially with $V_{D D}$ respectively [29]. Hence, all these components of power consumption can be reduced by reduction in power supply voltage. With technology scaling down, power consumption increases due to significant increments in the leakage current (as for CMOS inverter the leakage currents are found to be 1.05 nA and 0.18 nA at $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology respectively through simulation).

### 1.4. MOTIVATIONS AND CHALLENGES IN SUB-THRESHOLD DESIGN

In recent years, as technology of metal oxide semiconductor (MOS) transistors keeps scaling down to deep submicron region, sub-threshold region design focuses on to develop circuits operating at very low power supply voltage.

This is to support battery operated applications with very less power consumption, especially in ultra-low power application fields such as portable biosensors, wireless sensor nodes, RFID tags, hearing aids, pace-maker, wearable computing or implants, and personal digital assistants. These are dominated primarily by the need to minimize energy consumption and increase battery life time whereas speed is a secondary consideration.

To design these types of ultra-low power system, sub-threshold based SRAM and arithmetic circuits (adders and multipliers) are the basic units and the most power consuming modules [30]. Thus, it requires low power consuming memories and arithmetic circuits that are efficient with optimal performance at the same time.

Researchers have done exploration and design of Carry look-ahead adder, Kogge-Stone adder and Han-Carlson adder architectures, Wallace tree and Dadda multiplier architectures and SRAM cell designs operated in super-threshold region (gate voltage above threshold voltage region operation) using conventional Static-CMOS logic [31], transmission gate logic and pass transistor logic families. But these published architectures operation is degraded in subthreshold region [32].

Therefore, it is important to design fully functional power efficient arithmetic and memory circuits which show least power-delay-product when operated in sub-threshold region.

Various challenges for circuit design in sub-threshold region have been discussed in the [33][34][35][36]. These mainly relate to reduced supply voltage, and device scaling as given below:
i. Exploring Logic Families for Robust Design: The current ratio IoN/Ioff decreases with low $V_{\mathrm{DD}}$ which leads to functional degradation and hence, reduces robustness of the circuits. Therefore, to design robust sub-threshold logic circuits, existing logic families need to be explored and modified.
ii. Impact of Device Scaling on circuit performance: The influence of scaling on circuits operating in the weak inversion region has been limited to device-level studies and few circuitlevel simulations of simple circuits. Hence, a thorough investigation is necessary to find the impact of scaling on function of logic gates.
iii. Delay: Another challenging parameter of the circuit operation in the sub-threshold region is the longer propagation delay, hence lower frequency of operation due to weak current flow in the channel. This requires circuit optimization techniques for achieving optimum performance.
iv. Development of Sub-threshold Compatible and Robust Memory Design: SRAM is an important component of many ICs, and contributes a large fraction of the active and leakage power consumption. Process variation and reduced $I_{o n} / I_{\text {off }}$ are the major parameters which degrade the read/write operation and affect read/write access time in sub-threshold region.
v. Noise Margin: It is directly affected by the choice of supply voltage and has high sensitivity to process variations. Therefore, in the presence of process variations, there is need to study noise margin of logic gate and SRAM cell in sub-threshold region.

### 1.5. RESEARCH GAPS

A detailed literature review of published research work of logic design style, arithmetic circuits, and SRAM cell has been carried out and reported in Chapter 2. Based on it, following research gaps are identified:
i. Exploration of basic logic gates using different logic families in sub threshold logic is missing in literature

The current research scenario focuses on the power efficient architectures of arithmetic circuits and SRAM cell design that requires minimum supply voltage and increase the operating frequency of a circuit for sub-threshold operation. For the implementation of these architectures, there is great need to do extensive exploration of basic logic gates using different logic families which is currently missing.
ii. Comprehensive exploration of Adder/ Multiplier/ SRAM circuits has not been carried out in the sub-threshold region.

A limited study of the arithmetic circuits using different architectures has been published in the area of sub-threshold region. Therefore, design and exploration of these circuits with possible modifications in design of internal blocks at circuit and architectural level are required to ensure functional and power-efficient designs in sub-threshold region.

Similarly, operation and analysis of sub-threshold SRAM cells during read, write and hold mode have been least explored. SRAM cells designed for super-threshold operation give degraded function and performance at scaled down technologies Therefore designs of power efficient SRAM cells operating in sub-threshold region with higher noise margin during read/write/hold mode at ultra-low power supply in deep sub-micron technology are required.
iii. Impact of technology scaling on circuits designed in sub-threshold region.

The impact of technology scaling in arithmetic/ memory circuits has not been analyzed out in the sub-threshold region.

### 1.6. OBJECTIVES OF THE THESIS

Complete analyses of the arithmetic circuits and SRAM cells used for super-threshold operation have never been examined in the context of sub-threshold operation. The conventional arithmetic circuits and SRAM cells may not be either functional or not give efficient result in sub-threshold region due to ultra-low supply voltage and process variation effect. Therefore, there is a great need to modify the conventional arithmetic circuits and SRAM cells with improved functional parameters of the design. The main objective of this work is based on structural optimizations to reduce very long critical path delay and to improve the overall power-delay-product of the arithmetic circuits in sub-threshold operation. The subthreshold based SRAM cells are more sensitive to power, voltage and temperature variations and their effects have discussed in published work to some extent [37][38]. But comprehensive analysis of SRAM cell design in terms of various parameters like process sensitivity, area overhead, read stability, write ability, hold stability, read/write access time and leakage power consumption is not carried out in sub-threshold region [39]. The proposed work aims to design low power SRAM cells with improvement in above mentioned design parameters in subthreshold region. The arithmetic circuits and SRAM cells are designed and implemented at two different technology nodes ( $45 \mathrm{~nm} / 180 \mathrm{~nm}$ ) which shows the effect of scaling in subthreshold region.

In this thesis, the research is carried out with following objectives:
i. Explorations and design of functional basic logic gates using different logic families in sub-threshold region.
ii. Design and analysis of power efficient functional adder architectures in subthreshold region using two different technology nodes.
iii. Design and analysis of power efficient functional multiplier architectures in subthreshold region using two different technology nodes. Overall comparative analyses to be performed to get a full view of the improvements with respect to power, delay and power-delay product for arithmetic circuits.
iv. Design and analysis of low power functional SRAM cells in sub-threshold region. A comprehensive analysis of SRAM cells with read stability, write ability, hold stability, read/write access time and leakage power consumption is to be carried out.

### 1.7. ORGANIZATION OF THE THESIS

This thesis is divided into six chapters. The organization of this thesis is as follows:

Chapter 1 presents an introduction of sub-threshold design technique. It is followed by subthreshold operation of MOSFET and sources of power consumption. This chapter also explains motivation and challenges of sub-threshold logic design. The research gaps and objectives chosen for this research work are discussed in this chapter.

Chapter 2 describes the detailed literature review on arithmetic circuits and SRAM cells. The characteristics and performance parameters of various power efficient arithmetic circuits such as power, delay and power-delay product at different technology nodes have been discussed. Various low power SRAM cell designs and their parameters in terms of read stability, write ability, hold stability, read/write access time and leakage power consumption have been discussed in detail.

Chapter 3 explores and proposes new designs of the frequently used adders like Carry lookahead adder, Kogge-Stone adder and Han-Carlson adder architectures with different logic design style and operand size in sub-threshold region.

The comparative analysis of logic gates is given which are used to implement adder architectures using different logic families along with their post layout simulation results. For additional power-delay-product optimization, the effects of reverse body bias scheme on logic gates have been discussed. The design and analysis of chosen adders is done at 45 nm as well as 180 nm technology to find the impact of scaling.

Chapter 4 explores and proposes new designs of column compression multipliers namely Wallace tree and Dadda multiplier architectures, for different bit-widths and different logic design styles in sub-threshold region. The partial product generation scheme, the partial product accumulation scheme used in Wallace tree and Dadda multipliers have been discussed in detail. The design and analysis of chosen multipliers is done at 45 nm as well as 180 nm technology to find the impact of scaling.

Chapter 5 proposes new designs of SRAM cells at 45 nm technology in sub-threshold region. Their comprehensive analysis includes thorough evaluation in terms of read stability, write ability, hold stability, read/write access time and leakage power consumption in comparison
to conventional 6 transistor SRAM cell. To see the impact of scaling, comparison of results of these SRAM cells is done with designs at 180 nm technology at 0.4 V power supply voltage.

Analytical equations, obtained under steady state condition for read, write and hold operation, have been developed to obtain WSNM, RSNM, and Hold SNM theoretically and compared with simulated values.

Chapter 6 finally concludes the outcomes of this research and suggests future scope of this work.

## REFERENCES:

[1] K. Roy, S. Mukhopadhyay and H. M. Meimand, "Leakage current mechanisms and leakage reduction techniques in deep-submicrometer CMOS circuits", Proceedings of the IEEE, vol. 91(2), 2003, pp. 305-327.
[2] W. C. Daasch, C. H. Lim and G. Cai, "Design of VLSI CMOS circuits under thermal constraint", IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol. 49(8), 2002, pp. 589-593.
[3] V. De and S. Borkar, "Technology and design challenges for low power and high performance", Proceedings of International Symposium on Low Power Electronics and Designs, 1999, pp. 163-168.
[4] E. Vittoz, B. Gerber and F. Leuenberger, "Silicon-gate CMOS frequency divider for the electronic wrist watch", IEEE Journal of Solid-State Circuits, vol. 7(2), 1972, pp. 100104.
[5] R. Udaiyakumar and K. Sankaranarayanan, "Certain investigations on static power dissipation in various nano-scale CMOS D flip-flop structures", International Journal of Engineering and Technology, vol. 2(4), 2012, pp. 644-651.
[6] B. H. Calhoun, A. Wang, A.P. Chandrakasan, "Modeling and sizing for minimum energy operation in subthreshold circuits", IEEE Journal of Solid-State Circuits, vol. 40(9), 2005, pp. 1778-86.
[7] P. Sunghyun, C. Min and S. H. Cho, "A 95nW ring oscillator-based temperature sensor for RFID tags in $0.13 \mu \mathrm{~m}$ CMOS", Proceedings of IEEE International Symposium on Circuits and Systems ISCAS, 2009, pp. 1153-1156.
[8] S. I. Halder and L. Nazhandali, "Utilizing sub-threshold technology for the creation of secure circuits", IEEE International Symposium on Circuits and Systems, 2008, pp. 3182-3185.
[9] A. P. Chandrakasan, N. Verma and D. C. Daly, "Ultralow-power electronics for biomedical applications", Annu. Rev. Biomed. Eng. 10, 2008, pp. 247-274.
[10] V. Ramesh, R. P. Agarwal, S. D. Gupta and T. T. Kim, "Design and analysis of doublegate MOSFETs for ultra-low power radio frequency identification (RFID): device and circuit co-design", Journal of Low Power Electronics and Applications, vol. 2(1), 2011, pp. 277-302.
[11] P. J. Havinga and G. J. Smit, "Design techniques for energy efficient and low-power systems", In: Mobile multimedia systems, University of Twente publications, 2000, pp.2.1-2.152.
[12] C. K. Jien, K. W. Ang, N. Balasubramanian, M. F. Li, G. S. Samudra and Y. C. Yeo, "N-MOSFET with silicon-carbon source/drain for enhancement of carrier transport", IEEE Transactions on Electron Devices, vol. 54(2), 2007, pp. 249-256.
[13] R. Chao, "Exploiting subthreshold MOSFET behavior in analog applications", EDNElectronic Design News, vol. 57 (3), 2012, pp. 1-20.
[14] S. Hendrawan, K. Roy and B. C. Paul, "Robust subthreshold logic for ultra-low power operation", IEEE Transactions on Very Large Scale Integration (VLSI) System, vol. 9(1), 2001, pp. 90-99.
[15] D. Wolpert and P. Ampadu, "Temperature effects in semiconductors", Managing Temperature Effects in Nanoscale Adaptive Systems, Springer, 2012, pp. 15-33.
[16] D. G. Michael, "CMOS device modeling for subthreshold circuits", IEEE Transactions on Circuits and Systems-11: Analog and Digital Signal Processing, vol. 39(8), 1992, pp. 532-539.
[17] J. M. Rabaey and M. Pedram, "Low power design methodologies", Springer Science \& Business Media, vol. 336, 2012, pp. 1-367.
[18] S. Hendrawan and K. Roy, "Digital CMOS logic operation in the sub-threshold region", In Proceedings of the 10th Great Lakes symposium on VLSI, ACM, 2000, pp. 107-112.
[19] M. Horiguchi, T. Sakata and K. Itoh, "Switched-source-impedance CMOS circuit for low standby subthreshold current giga-scale LSI's", IEEE Journal of Solid-State Circuits, vol. 28(11), 1993, pp. 1131-1135.
[20] H. Li and Q. Xu, "Sub-threshold-based ultra-low-power neural spike detector", Electronics Letters, vol. 47(6), 2011, pp. 367-368.
[21] K. Joyce and A. P. Chandrakasan, "Variation-driven device sizing for minimum energy sub-threshold circuits", Proceedings of the International Symposium on Low Power Electronics and Design, ACM, 2006, pp. 8-13.
[22] F. Isabelle, C. A. Colinge and J. P. Colinge, "Multigate transistors as the future of classical metal-oxide-semiconductor field-effect transistors", Nature, vol. 479(7373), 2011, pp. 310-316.
[23] B. Calhoun and A. P. Chandrakasan, "Static noise margin variation for subthreshold SRAM in $65-\mathrm{nm}$ CMOS", IEEE Journal of Solid-State Circuits, vol. 41(7), 2006, pp. 1673-1679.
[24] N. S. Kim, T. Austin, D. Blaauw, T. Mudge, K. Flautner, J. S. Hu, M. J. Irwin, M. Kandemir and V. K. Narayanan. "Leakage current: moore's law meets static power", IEEE Transaction of Computer Society, vol. 36(12), 2003, pp. 68-75.
[25] T. James, J. A. Kash and D. P. Vallett, "Time-resolved optical characterization of electrical activity in integrated circuits", Proceedings of the IEEE, vol. 88(9), 2000, pp. 1440-1459.
[26] M. Manghisoni, L. Gaioni, L. Ratti, V. Re, V. Speziali and G. Traversi, "Impact of gateleakage current noise in sub-100 nm CMOS front-end electronics", IEEE Nuclear Science Symposium Conference Record, vol.4, 2007, pp. 2503-2508.
[27] E. C. Christian, A. E. Hoiydi, J. D. Decotignie and P. Vincent, "WiseNET: an ultralowpower wireless sensor network solution", Computer, vol. 37(8), 2004, pp. 62-70.
[28] A. P. Chandrakasan, R. W. Brodersen, "Minimizing power consumption in digital CMOS circuits", Proceedings of the IEEE, vol. 83(4), 1995, pp. 498-523.
[29] A. P. Chandrakasan, S. Sheng and R. W. Brodersen, "Low-power CMOS digital design", IEICE Transactions on Electronics, vol. 75, 1992, pp. 371-382.
[30] A. P. Chandrakasan, N. Verma and D. C. Daly, "Ultralow-power electronics for biomedical applications", Annu. Rev. Biomed. Eng. 10, 2008, pp. 247-274.
[31] K. A. Ali and R. Kaamran, "A study and comparison of full adder cells based on the standard static CMOS logic", Canadian Conference on Electrical and Computer Engineering, vol. 4, 2004, pp. 2139-2142.
[32] B. Brent, and J. Nyathi, "Bulk CMOS device optimization for high-speed and ultra-low power operations", 49th IEEE International Midwest Symposium on Circuits and Systems, vol. 2, 2006, pp. 221-225.
[33] R. Vaddi, S. D. Gupta and R. P. Agarwal, "Review article device and circuit design challenges in the digital subthreshold region for ultralow-power applications", Hindawi Publishing Corporation VLSI Design, vol. 2009, 2009, pp. 1-14.
[34] S. Fisher, A. Teman, D. Vaysman, A. Gertsman, O. Y. Pecht and A. Fish, "Digital subthreshold logic design-motivation and challenges", IEEE Convention of Electrical and Electronics Engineers in Israel, 2008, pp. 702-706.
[35] D. Blaauw, J. Kitchener and B. Phillips, "Optimizing addition for sub-threshold logic", IEEE Conference on Signals, Systems and Computers, 2008, pp. 751-756.
[36] M. Anis, "Subthreshold leakage current: challenges and solutions", Proceedings of the 15th International Conference on Microelectronics, 2003, pp. 77-80.
[37] H. Mizuno and T. Nagano, "Driving source-line cell architecture for sub-1-V high-speed low-power applications", IEEE Journal of Solid- State Circuits, vol. 31(4), 1996, pp. 552-557.
[38] A. Islam and M. Hasan, "Single-ended 6T SRAM cell to improve dynamic power dissipation by decreasing activity factor", The Mediterranean 172 Journal of Electronics and Communications, vol. 7(1), 2011, pp. 172-181.
[39] B. David, R. Ambroise, D. Flandre and J. D. Legat, "Interests and limitations of technology scaling for subthreshold logic", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 17(10), 2009, pp. 1508-1519.

## CHAPTER 2

## LITERATURE REVIEW

### 2.1. INTRODUCTION

Digital systems normally comprise a central processing unit (CPU) that executes arithmetic functions (for instance addition, subtraction, and division), cache on chip SRAM memory for holding the data. The arithmetic circuits are executed by a number of logic gates, or circuit elements, which are interconnected, and form a network with multiple logic depths [1].

A low power microprocessor and digital signal processor (DSP) includes a cache SRAM memory to fetch the data and an arithmetic and logic unit that contains at least one of an adder and a multiplier. In most of the digital systems, adder and multiplier lies in the critical path that affects the overall speed of the system and the performance of adder may determine the whole system performance [2][3]. The choice of arithmetic module architecture is of utmost importance, since its performance determines the whole system response.

In recent years, sub-threshold design technique has been used that tradeoff the overall power-delay-product of the arithmetic circuits by removing or reconfiguring blocks. This is especially important on low-power battery-operated devices, where a longer life could be preferred over a higher output precision [4]. Many papers have been published based on power efficient arithmetic circuits and low power SRAM cells design using different methodology and different design techniques in sub-threshold region which show the importance of this area. Both modules are the part of developing technology that is becoming more creative with the advancement of microchip technology [5]. The arithmetic circuits and SRAM cell operated in sub-threshold region minimizes the switching and leakage energy but it would be slow due to ultra-low supply voltage [6].

Hence there is an interesting tradeoff between speed and switching energy in sub-threshold region. For leakage-aware CMOS circuits, it is a major challenge to find the optimal tradeoff between switching speeds and low leakage currents.

As discussed in Chapter 1, to design these types of ultra-low power system, sub-threshold based SRAM and arithmetic circuits (adders and multipliers) are the basic units and the most power consuming module in the modern-day application for Wireless Sensor Node (WSN).

Sub-threshold based SRAM, Parallel Prefix Adders and column compression multipliers have been established as the most efficient circuits in modern digital systems. There are different kinds of parallel adder, parallel prefix adder and column compression multiplier architectures available. Some of these which are frequently used in ultra-low power based VLSI system designs are analyzed in the present work, are:

- Carry Look-ahead Adder
- Han-Carlson Adder
- Kogge stone Adder
- Wallace tree multiplier
- Dadda Multiplier
- SRAM Cells

Therefore, this chapter explains the detailed literature review of arithmetic circuits (Carry Look-ahead Adder, Han-Carlson Adder, Kogge stone Adder, Wallace tree multiplier, Dadda Multiplier) and SRAM cells. The initial sections of the chapter include the existing architectures of arithmetic circuits with their parameters such as power, delay and power-delay-product using different logic design styles at different technology nodes.

The last section of this chapter describes various low power SRAM cell designs and their parameters in terms of read stability, write ability, hold stability, read/write access time and hold power consumption.

### 2.2. ARITHMETIC CIRCUITS

With the advent of battery-operated applications like portable computing and personal communication systems, power efficient circuits are needed because of the difficulty in providing adequate cooling to high density chips and for increasing the battery life time [7] [8][9]. For portable applications, low power adder and multiplier circuits along with optimum performance are needed.

Hence, designers need to estimate the delay as well as the average power consumption accurately before the circuit goes to fabrication. Accurate timing analysis has been the subject of numerous investigations over the years. Auvergne et al. [10] have defined delays $\mathrm{t}_{\mathrm{HL}}\left(\mathrm{t}_{\mathrm{LH}}\right)$ in enhancement-depletion MOS logic gate as the time spent by the output to fall (rise) from the static high level (static low level) to $1 / 2 \mathrm{~V}_{\mathrm{DD}}$ respectively. Callaway et al. [11] have investigated the worst-case delay and average power dissipation of adders and multipliers.

They have analyzed the worst-case delay and average number of gate-output transitions per addition for 16-bit adders designed in fully static logic family from the three sources: (1) from gate level simulation, (2) detailed circuit simulation, and (3) actual measurement from a test chip. The circuits are subjected to 10,000 pseudo-random inputs. Results show that number of transitions for each adder, except the conditional sum adder, increases linearly with the word size, whereas the power consumption is normally distributed. These results are then compared with circuit simulation results and actually measured results. It is found that the simple unit gate delay model is inaccurate for carry look ahead and conditional sum adders due to either large fan-in or large fan-out gates in worst-case path. Similarly, the unit power model underestimates the power consumption for adders with large fan-in and fan-out [12].

Hence, unit-delay, unit-power gate model can be used to generate only a first estimates of the power consumption and worst-case delay of parallel adder, parallel prefix adder and column compression multiplier.

As bigger bit widths are desirable for achieving higher precisions, they result in bigger adders and multipliers with high toggling profiles and higher power figures. In recent years, some techniques that trade power for accuracy by removing or reconfiguring blocks of the adder/multiplier have been made available. Choosing the correct technique and implementing it can make a big difference regarding power.

The choice of the logic families provides a good comparison between ultra-low power and high speed based adder architectures. Typically, to design power efficient arithmetic circuits, a static implementation is preferred due to its low power consumption. Dynamic circuits and other logic styles are not preferred due to their high activity factors in sub-threshold region [13].
Static-CMOS logic (e.g. complementary CMOS, pass transistor, transmission gate) based arithmetic circuits are most suitable and widely accepted for many VLSI circuit implementations due to its important properties like low power, high speed, large noise margins, no logic degradation and validity of logic design style at scaled down technologies. These are commonly more reliable, simpler and are lower power consuming than dynamic ones.

The following researchers have presented different arithmetic circuits design using Static logic design style:
J. D. Lee has proposed a logic redundancy technique which share maximum number of transistors among multiple output static CMOS complex gates for the speed improvement [14].
C. Nagendra et al. [15] have adopted a uniform static CMOS layout strategy to design a several class of parallel and synchronous arithmetic modules. The adopted layout designs technique has reduced the short circuit power consumption and improve the speed of the circuit.
A. Asati et al. [16] have presented a MUX based 16x16 unsigned multiplier circuits, which utilize an efficient partial product generation and partial product addition technique. The time and space complexity of such multiplier was much better than simpler array multiplier techniques. The multiplier has been designed using optimized static CMOS logic cells to provide best area, power and delay performance.

Apart from these detailed studies, efforts have also been made to optimize particular adder architecture I. S. Hwang et al. [17], I. S. Abu-Khater et al. [18], and J. Lim et al. [19].

Besides considering different adder architectures, another approach is to employ different CMOS circuit design styles to design power efficient arithmetic circuits:
P. Chuang et al. [20] have proposed a constant delay logic (CDL) style targeting at full-custom high-speed applications. The proposed logic demonstrates high speed, energy-delay product (EDP) reduction across all data activities as compared to static and dynamic domino logic, respectively.
Y. Moon et al. [21] have proposed an efficient charge recovery logic (ECRL) technique having cascade voltage switch logic structure for low-energy adiabatic logic circuit to design an arithmetic module.
M. M. Nipun, et al. [22] have designed a 4-by-4 array multiplier and a fifty-fifth order FIR filter using modified STSCL differential logic techniques for the reduction of overall total leakage consumption within a system.
J. Crop et al. [23] have designed a 4-bit asynchronous multiplier in the sub-threshold region. The designed asynchronous multiplier was more tolerant to process variation than conventional synchronous sub-threshold circuits and operates with a low supply voltage and minimum energy voltage.
A. Tajalli et al. [24] have presented a technique for implementing sub-threshold based source couples logic (STSCL) gates. Fundamental circuits such as ring oscillators and frequency dividers, as well as more complex digital blocks such as parallel multipliers designed by using the STSCL topology have been experimentally characterized. The bias current of the STSCL
gate can be scaled over several decodes using the same device dimensions, which makes this circuit topology very suitable for ultra-low power configurable digital systems.
S. K. Chang et al. [25] have presented a multiplexer-based carry-skip algorithm to design a hybrid adder which reduces the delay time of the adder. The hybrid adder combines both carrylookahead and multiplexer-based carry-skip architectures to speed up its performance.
C. Burwick et al. [26] have designed digital CMOS threshold logic circuits for low power applications. The threshold logic function reduces the logic depth of the circuit due to smaller capacitance and reduces the power dissipation with improved circuit performance.
E. Angel et al. [27] have presented and compared sign extension techniques to decrease switching activity and improve the performance in parallel multipliers. A significant reduction in power dissipation has been achieved through the use of an efficient sign extension scheme. The implementation of the sign extension scheme does not require additional hardware or any penalty in delay.
M. Solaz et al. [28] have designed two 16-bit Wallace tree multiplier using operand reduction and truncated techniques synthesized in 90 nm low power standard cells. The results show that truncation technique based multiplier offers $30 \%$ reduction in power due to the lower weight profile of discarded bits, as compared to operand reduction based multiplier.

Analysis of different logic styles operating in the sub-threshold region is necessary due to the dependency of performance parameters (power, delay and power-delay-product) in different arithmetic circuits.

### 2.2.1. Parallel Adder Architectures

Although, adder design is a well-researched area, but limited research works have been carried out to improve adders' performance in the power-delay space [29][30][31].

On the other hand, several optimization techniques have been proposed that try to reduce the power consumption of the circuit, either by lowering supply voltages in non-critical paths or by trading gate sizing with an increase in the maximum delay of the circuit [32][33].

The arrangement of the prefix network specifies the type of the parallel-prefix adders (PPA). It is apparent that a key advantage of the tree structured adder is that the critical path due to the carry delay is on the order of $\log _{2} \mathrm{~N}$ for an N -bit wide adder. The arrangements of the prefix network give rise to various families of adders. There are many parallel-prefix adders that have been invented so far.

Among them radix-2 carry look-ahead adder which is most important block of parallel prefix adders, two different radix-2 parallel prefix adders (Kogge Stone Adder and Han Carlson Adder) were widely known parallel prefix adders. There exist various architectures for the carry calculation part.

Tradeoffs in these architectures involves

- Area of the Adder
- The fan out of the nodes
- The overall wiring network

The biggest difference between the full adder and parallel prefix adder is that in the full adder, summation and carry calculation is done in the same one bit block but in the prefix adder, summation and carry calculation are separated from the bit block and all calculation is treated as a whole in the carry graph. The carry graph uses the prefix circuit and this is the origin of the name, "Prefix Adder".

Conventionally parallel adder structure employs three main parts: the propagate/generate generator, sum generator, and carry generator These parts of the circuit decide the overall circuit performance, power consumption, speed, area, power-delay product etc.

Extensive research has been done to improve the power and performance of these basic building blocks to design power efficient arithmetic circuits in super-threshold region.

Many papers have been published based on the optimization of power efficient adders using different methodology and different design techniques to improve the power consumption performance, size which shows the importance of this area [35]-[68].

Their research outcomes of analysis based on Carry look-ahead, Kogge-stone and HanCarlson adder architecture, different logic design style at different technology are discussed below:

## (i) CLA architectures

CLA is fast adder architecture optimized at the logic-level, and used to design addition-based arithmetic circuits using cell based VLSI design. It generates the carry signals in $\mathrm{O}(\log \mathrm{n})$ time for $n$ bits. The parallel CLA adder has been commonly used for fast carry computation in the
design of energy efficient parallel prefix circuits [34]. Some of the most cited referenced CLA architectures using different techniques and technology nodes are given in Table 2.1.

Table 2.1: Comparison of various referenced CLA architectures

| Types of CLA's | Techniques | Technology | Improvements |
| :---: | :---: | :---: | :---: |
| 64-bit CLA [35] | Using modified carry select adder block | $0.5 \mu \mathrm{~m}$ | Speed and area |
| 64-bit CLA [36] | Dual- $\mathrm{V}_{\text {th }}$ domino circuit | $0.25-\mu \mathrm{m}$ | Performance, less power-consumption with reduced standby leakage current. |
| 4-bit carry-ripple adder and modified CLA (MCLA) [37] | NAND gate instead of AND gate at the end of the generating path | 65 nm and 90 nm | Delay, power-delay and energy-delay product |
| 16-bit CLA [38] | Multi-threshold CMOS (MTCMOS) | $0.35 \mu \mathrm{~m}$ | Propagation delay time and power consumption |
| 32-bit CLA [39] | Non-full voltage swing true-Single-phase- Clocking Logic (NSTSPC) | $0.35 \mu \mathrm{~m}$ | To speed up the generation of the carry |
| $\begin{aligned} & \hline 32 \text {-bit tree- } \\ & \text { structured CLA } \\ & \text { [40] } \\ & \hline \end{aligned}$ | Modified ANT (all-N-transistor) logic | $0.35 \mu \mathrm{~m}$ | Speed and area |
| 8-bit CLA [41] | Two-phase all-N-transistor (ANT) logic blocks | $0.25 \mu \mathrm{~m}$ | Dynamic power consumption |
| 64-bit CLA [42] | Sparse CLA trees adder based on buffering techniques | $0.13 \mu \mathrm{~m}$ | Energy efficiency |
| Mixed-radix look ahead cells of the CLA network [43] | The hierarchical expansion of the carry equation for the reverse conversion algorithm creates a regular multilevel structure | $0.18 \mu \mathrm{~m}$ | Speed |
| 64-bit [44] | Optimize transistors size in the different Manchester carry chain blocks and by adjusting the block widths within the carry Tree | $0.35 \mu \mathrm{~m}$ | Critical delay paths of the carry signals |
| 4-bit CLA [45] | Differential PTL a novel technique | non-selfaligned $1-\mu \mathrm{m}$ D-MESFET GaAs technology | Circuit performance |
| 32-bit CLAs [46] | Complementary all-N transistor (CANT) comprising ANT logic and inverted ANT logic | 90 nm | Power and performance |

## (ii) HCA architectures

HCA offers a highly efficient solution to binary addition problem, assures a low computation delay and low power in sub-threshold region. The hybrid construction of a HC logarithmic prefix adders are the combinations of Kogge-Stone structure which have $\log _{2} n$ stages and Brent-Kung structure which have $2 \log _{2} \mathrm{n}-1$ stages. The combine effects of both the adders provide a reasonably high speed at less complexity [47]. Some of the most cited referenced HCA architectures using different techniques and technology nodes are given in Table 2.2.

Table 2.2: Comparison of various referenced HCA architectures

| Types of HCA's | Techniques | Technology | Improvements |
| :--- | :--- | :---: | :---: |
| N $\leq$ 64-bit <br> [48] | A new prefix algorithm for <br> carry computation. | $4 \mu \mathrm{~m}$ | Delay and area |
| 64-bit HCA [49] | Optimization of prefix level in <br> carry-computation units | $0.25 \mu \mathrm{~m}$ | Power consumption |
| 8-bit, 16-bit and <br> 32-bit HCA [50] | CMOS logic and transmission <br> gate logic design style | 65 nm | Area, delay and <br> power consumption |
| 32-bit HCA [51] | Reduction in prefix operations <br> by adjusting the number of <br> stages | $---\quad$ | Area and power <br> consumption |
| 32-bit, 64-bit and <br> 128-bit <br> HCA [52] | A novel variable latency <br> speculative method | 65 nm | High speed |
| 16-bit, 24-bir, 32- <br> bit, 40-bit, 48-bit, <br> 56-bit, 64-bit, 80- <br> bit, 96-bit,12-bit <br> and 128 bit HCA <br> [53] | Hybrid HCA (combination of <br> two Brent-Kung stages each at <br> the beginning and at the end <br> and with Kogge-Stone stages <br> in the middle) | 45 nm | Area and power <br> consumption |
| 32-bit HCA [54] | Reduces the number of prefix <br> operation | Xilinx ISE 13.1 <br> Tool <br> implemented using <br> Spartan 6 | Area and power <br> consumption |
| 16-bit HCA [55] | HCA implemented with the <br> help of cells like black cell <br> and white cell operations for <br> carry generation and <br> propagation. | Xlinx design suite <br> 14.5, implemented <br> using Spartan-3E <br> FPGA (XC3S500E- <br> 4FG320C). | Speed |
| 32--- | Area and power <br> consumption |  |  |

## (iii) KSA architectures

The Kogge-Stone adder is a parallel prefix form carry look-ahead adder. It generates the carry signals in $\mathrm{O}(\log \mathrm{n})$ time, and is widely considered the fastest adder design possible. It is the common design for high-performance adders in industry. It takes more area to implement than the Brent- Kung adder, but has a lower fan-out at each stage, which increases performance. Wiring congestion is often a problem for Kogge-Stone adders as well [57]. Some of the most cited referenced KSA architectures using different techniques and technology nodes are given in Table 2.3.

Table 2.3: Comparison of various referenced KSA architectures

| Types of KSA's | Techniques | Technology | Improvements |
| :---: | :---: | :---: | :---: |
| n-bit [58] | Proposed recurrence equation by using recursive doubling technique in an algorithm | All | Speed |
| 64-bit KSA [49] | Reduced switching activity of the novel carry-computation units | $0.25 \mu \mathrm{~m}$ | Power consumption |
| 8-bit, 16-bit and 32-bit KSA [50] | Using both CMOS logic and transmission gate logic design style | 65 nm | Area and delay |
| 64-bit KSA [59] | The four-new dot-operator cells based Dynamic pass-transistor technique | $\underset{\mu \mathrm{m}}{0.35 \mu \mathrm{~m} \text { and } 0.13}$ | Area and power consumption |
| $\begin{aligned} & \hline \text { 16-bit, } 32 \text {-bit, } \\ & 64 \text {-bit, 128-bit, } \\ & \text { and } 256 \text {-bit } \\ & \text { KSA [60] } \end{aligned}$ | Reconfigure a wide KSA by using cascading technique | 90 nm | Power-delay product |
| $\begin{aligned} & \text { 16-bit, 32-bit } \\ & \text { and 64-bit [61] } \\ & \hline \end{aligned}$ | Used conventional equations of KSA | $\begin{gathered} 130 \mathrm{~nm}, 90 \mathrm{~nm}, \\ 65 \mathrm{~nm} \text { and } 40 \mathrm{~nm} \\ \hline \end{gathered}$ | Speed |
| $\begin{aligned} & \begin{array}{l} \text { 64-bit adder } \\ \text { [62] } \end{array} \\ & \hline \end{aligned}$ | Logical effort method has been used on transistor-level | $0.18 \mu \mathrm{~m}$ | Delay |
| $\begin{aligned} & 128 \text {-bit KSA } \\ & {[63]} \end{aligned}$ | A new efficient structure based on the operation of each stage is dependent only to a lower significant stage | Xilinx 14.3 software | Power-delay product |
| 8-bit KSA [64] | Re-routing (wiring) and black-cell reduction | XILINX ISIM simulator, Synthesized for the Spartan-3 FPGA XC3S400 | Speed |
| $\begin{aligned} & \text { 64-Bit KSA } \\ & {[65]} \end{aligned}$ | Fault tolerant technique using Adaptive Clocking | 180 nm | Speed |

### 2.2.2. Column Compression Multipliers

Multipliers require high amount of power and delay during the partial products accumulation stage. At this stage, most of the multipliers are designed with different kind of compressors that are capable to add two/three or at most $4,5,6$ and 7 bits by using lower order (2-2/3-2) or high order compressors (4-2,5-2, 6-2 and 7-2 compressors).

These lower/higher order compressors are used to perform parallel computation to accumulate the partial products [66][67].

Therefore, its power and performance determines the overall multiplier response. The power consumption of these partial product accumulation modules in multipliers depends upon the choice of logic design style.

Foster and Stockton have developed a counter implemented with full and half adders which are used in partial product reduction process to optimize the overall performance of the multipliers [68].
S. Asif et al. have been discussed a strategy to reduce the area of traditional Wallace (TW) multiplier by reducing the number of half adders [69]. This innovative method allows for an effective utilization of half adders in such a way that the size of the final adder is also got reduced. The speed of the reduced complexity Wallace (RCW) multiplier is expected to be the same as of TW multiplier due to the equal number of reduction stages in both multipliers.
P. Ramanathan et al. presented a decomposition logic based technique which improves the performance of the Dadda multipliers with little increase in power dissipation [70]. The designed multiplier was faster and energy efficient with a negligible power penalty in spite of extra logic circuitry. Choices of different varieties of logic families to design adders also give improvement in the overall performance of multipliers [71][72][73].
L. Dadda has introduced the concept of ( $\mathrm{n}, \mathrm{m}$ ) parallel compressors to reduce the partial product matrix which is the generalized the idea of using HA and FA's. This ( $\mathrm{n}, \mathrm{m}$ ) parallel compressors are a combinational network with n inputs and m outputs where the outputs express the count of the number of inputs that are ones. Thus, a HA and FA is a (2-2) and (32) parallel compressors respectively.

The compressors are the bit-compressing cells with principal application in multi-operand addition and multiplication hardware.

Some of the details of most cited referenced compressor design used in Wallace tree and Dadda multiplier using different techniques and technology nodes are summarized in Table 2.4.

Table 2.4: Comparison of various referenced compressor designs

| Types of Compressors | Techniques | Technology | Improvements |
| :---: | :---: | :---: | :---: |
| 4-2 [74] | Non-full-swing passtransistor carry generator | $\begin{gathered} 0.8 \mu \mathrm{~m} \text { and } \\ 0.35 \mu \mathrm{~m} \end{gathered}$ | Power, delay and power-delay product |
| 4-5 and 5-2 [75] | Modified XOR and MUX circuits | $0.35 \mu \mathrm{~m}$ | Delay and power consumption |
| 4-5 and 5-2 [76] | Feedback concept in pass transistor logic. | $0.18 \mu \mathrm{~m}$ | Power consumption |
| $\begin{aligned} & 4-3,5-3,6-3 \text { and } \\ & 7-3[77] \end{aligned}$ | Designed only half adder and full adders | 90 nm | Delay |
| $\begin{aligned} & 3-2 \text { and 4-2 } \\ & \text { compressor [78] } \end{aligned}$ | Double pass transistor logic (DPL) | $0.6 \mu \mathrm{~m}$ | Delay |
| $\begin{aligned} & 3-2,4-2 \text { and 5-2 } \\ & {[79]} \end{aligned}$ | Analyzed using CMOS and CMOS+ implementations of XOR and the MUX blocks | $0.18 \mu \mathrm{~m}$ | Area, power, delay and power-delay product |
| 3-2 and 4-2 [80] | Combination of two logic design style | 65 nm | Power consumption |
| 4-2 [81] | Decomposing each XOR gate to three simpler gates among AND/NAND and OR/NOR | 45 nm | Power, delay and power-delay product |
| 5-2 [82] | Designed with XORXNOR circuits | $0.25 \mu \mathrm{~m}$ | Delay |
| 5-3 [83] | Fast 2-bit adder cell, which utilizes two XOR gate delays | $0.225 \mu \mathrm{~m}$ | Delay |

### 2.3 SRAM CELLS

SRAM is an important component in ultra-low power systems. Operation of standard 6T SRAM at sub-threshold voltages has degraded static noise margin (SNM) and fluctuations in MOSFET currents because of process variations at ultra-low voltages SRAM cell stability is a major challenge in sub-threshold region [84]. Numerous prominent publications have
appeared over the recent years targeted at low power SRAM cells design to improve cell stability in sub-threshold region.

- B. H. Calhoun et al. [85] have evaluated static noise margin (SNM) of conventional 6T (C6T) SRAM bit-cell operating in sub-threshold region. The detailed analysis of the statistical distribution of SNM with process variation provides a model for the tail of the probability density failure (PDF) that dominates SNM failures. Low power C6T SRAM cell fails to achieve reliable sub-threshold operation [86]. Single-ended low power C6T suffers from stability during read/write analysis [87].
- S. Mukhopadhyay et al. [88] have explored the operation of single-ended 6T SRAM cell at 32 nm technology node. This design saves dynamic read/write power more than $50 \%$ with marginal penalty in read/write delay and standby power. The analyzed results show that the design offers narrower spread in read access time, which shows its robustness against process and temperature variations at the expense of 5\% penalty in read static noise margin.
- M. F. Chang et al. [89] have done the study of differential data-aware power-supplied (DAP) 8T SRAM cell with 45 nm and 40 nm processes. This study addresses the stability and trade-off-issues between write and half-select accesses in the conventional 8T and 6T cells. The designed 8 T cell applies differential data-aware-supplied voltages to its crosscoupled inverters to increase both stability margins for write and half-select accesses. This DAP-8T cell employs a boosted-BL scheme to improve read speed and read stability. This design improves the write margin, half-select stability, and read stability for low- $\mathrm{V}_{\mathrm{DD}}$ applications.
- T. H. Kim et al. [90] have presented a technique for improving write margin and read performance of 8T sub-threshold SRAMs by using long channel devices to utilize the pronounced reverse short channel effect. The results show that the designed 8 T cell at 0.2 V has a write margin equivalent to a conventional cell at 0.27 V , improved write margin, better variation tolerance and increased Ion-to-Ioff ratio in the read port.
- G. Razavipour et al. [91] have presented two 8T and 9T SRAM cells at 45 nm technology to reduce the static power dissipation due to gate and sub-threshold leakage currents. The first 8 T cell structure results in reduced gate voltages for the NMOS pass transistors, and thus lower the gate leakage current. It reduces the sub-threshold leakage current by increasing the ground level during the idle (inactive) mode. The second 9T cell structure
makes use of PMOS pass transistors to lower the gate leakage current. In addition, dual threshold voltage technology with forward body biasing is utilized with this structure to reduce the sub-threshold leakage while maintaining performance. Compared to a conventional SRAM cell, the first cell structure decreases the total gate leakage current by $66 \%$ and the idle power by $58 \%$ and increases the access time by approximately $2 \%$ while the second cell structure reduces the total gate leakage current by $27 \%$ and the idle power by $37 \%$ with no access time degradation.
- N. Verma et al. [92] have designed a high density 8T-SRAM cell to achieve a minimum operating voltage of 350 mV at 65 nm technology. To ensure read stability, buffered read is used, and peripheral control of both the bit-cell supply voltage and the read-buffer's foot voltage enable sub- write and read without degrading the bit-cell's density. An overall result shows that the entire 256 kb SRAM consumes 2.2 W in leakage power at 350 mV .
- L. Sheng et al. [93] have done the design and evaluation of nine transistors (9T) which utilizes a scheme with separate read and write word lines. The result shows improvements in power dissipation, performance and stability.
- L. Zhiyu et al. [94] have designed a nine-transistor (9T) SRAM cell at 65 nm technology for reducing leakage power and enhancing data stability. This cell completely isolates the data from the bit lines during a read operation, thereby enhanced by 2 , leakage power consumption of a super cutoff 9T SRAM cell was reduced by $22.9 \%$ as compared to a conventional six-transistor SRAM cell.
- M. Zamani et al. [95] have presented a scheme that uses dynamic mechanism cutting the feedback to improve the write SNM and lowering the write access time. The 9T-cell SRAM design shows robust stability, $80 \%$ and $50 \%$ improvement in read and write SNM respectively in comparison to the 6 T .
- M. Majid et al. [96] have presented a nine-transistor (9T) SRAM cell operating in the subthreshold region at 90 nm technology. This 9 T cell uses a common bit-line during read and writes operation. A suitable read operation is achieved by suppressing the drain-induced barrier lowering effect and controlling the body-source voltage dynamically. Proper usage of low-threshold voltage $\left(\mathrm{L}-\mathrm{V}_{\mathrm{th}}\right)$ transistors in 9 T design helps to reduce the read access time and enhance the reliability in the sub-threshold region.
- A. R. Ahmadi Mehr et al. [97] have designed two 12T SRAM cell for a better write margin using 32 nm standard bulk MOSFET technology. In these structures, two new transistors
have been added to cut the feedback of the back to back inverters. The improved static power consumption, read and write noise margins have been obtained at the cost of additional area as compared to conventional 6T SRAM cell.
- H. Chen et al. [98] have presented a schmitt trigger based SRAM 12T bit-cell which operates under optimum-energy supply voltage for medical device application. Results show that the design has $45 \%$ and $30.2 \%$ improvement of the read and hold noise margins as compared to the conventional 6T SRAM bit-cell.
- Y. W. Chiu et al. [99] have designed the new bit-interleaving scheme based 12 T subthreshold SRAM cell with Data-Aware Power-Cutoff (DAPC) Write-assist which improve the Write-ability of the cell. The disturb-free feature facilitates the bit-interleaving architecture that can reduce multiple-bit errors in a single word and enhance soft error immunity by employing error checking and correction (ECC) techniques. The 4 kb SRAM has been implemented at 40 nm general purpose (40GP) CMOS technology. Data can be written successfully for down to 300 mV . The measured maximum operation frequency is 11.5 MHz with total power consumption of $22 \mu \mathrm{~W}$ at 350 mV .
- K. Takahiro et al. [100] have developed a ratio-less full-complementary 12-T SRAM cell using 180 nm technology, operates under an ultra-low supply voltage range of 0.22 V . The ratio-less SRAM design concept enables a memory cell design that is free from the consideration of the SNM. Furthermore, it enables a SRAM function without the restriction of transistor parameter (W/L) settings and the dependence on the variability of device characteristics. The measured results show that the ratio-less full-complementary 12-T SRAM has superior immunity to device variability, and its inherent operating ability as compare to conventional 6T SRAM cell in sub-threshold region.
- D. Roy et al. [101] have presented a low write power and variability-aware 13-transistor (13T) SRAM design in sub-threshold region using 22 nm technology node. The cell achieves low write power dissipation due to reduction of activity factor and breakage of feedback path in the cross-coupled inverter during write operation. It also achieves higher read static noise margin at the expense of $49.5 \%$ decrease in write static noise margin and $27 \%$ tighter spread in read delay distribution compared with standard 6T SRAM cell at nominal $V_{D D}$ at the expense of $76 \%$ higher read delay.

Different existing low power SRAM cell designs and their performance, discussed above, are summarized in Table 2.5.

Table 2.5: Comparison of various referenced low power SRAM cells

| Types of SRAM Cells | Techniques | Technology | Improvements |
| :---: | :---: | :---: | :---: |
| 4T [102] | Two word-lines and one pair bitline | 65 nm | Power consumption and area |
| 5T [103] | Single-ended read and differential write scheme | 40 nm | Static power consumption |
| 5T [104] | Asymmetric cell sizing concepts | 45 nm | Static-noise-margin and area |
| 6T [105] | Single-ended | $0.13 \mu \mathrm{~m}$ | Power consumption |
| 6T [88] | Single ended | 32 nm | Dynamic read/write power consumption |
| 7T [106] | Feedback connection and disconnection are performed through an extra NMOS transistor | 180 nm | Write power consumption |
| 8T [107] | Access pass gates are replaced with full transmission gates | 16 nm | Read/write access time |
| 8T [89] | Differential data-aware powersupply with boosted-BL scheme | $\begin{gathered} 45 \mathrm{~nm} \\ \text { and } 40 \mathrm{~nm} \end{gathered}$ | Write margin and read stability |
| 8T [90] | Reverse short channel effect in subthreshold design technique | $0.36 \mu \mathrm{~m}$ | Write margin, read delay and better variation tolerance |
| $\begin{aligned} & 8 \mathrm{~T} \text { and } 9 \mathrm{~T} \\ & {[91]} \\ & \hline \end{aligned}$ | Dual threshold voltage technology with forward body biasing | 45 nm | Static power consumption |
| 8T [92] | Buffered read foot voltage enable | 65 nm | Leakage power consumption |
| 9T [94] | Isolates the data from the bit lines during a read operation | 65 nm | Leakage power consumption and enhancing data stability |
| 9T [95] | Dynamic mechanism to cut the feedback |  | Read and write SNM |
| 9T [96] | Uses of common bit-line during read and writes operation | 90 nm | Read access time and reliability |
| 12T [97] | Cut the feedback of the back to back inverters | 32 nm | Static power consumption, read and write SNM |
| 12T [98] | Schmitt trigger | $0.13 \mu \mathrm{~m}$ | Hold and read noise margins |
| 12T [99] | New bit-interleaving | 40 nm | Write-ability, power consumption |
| 12T [100] | Ratio-less full-complementary | 180 nm | Superior immunity to device variability |

Hence, the cell structures with more transistors are widely deemed to be inevitable in order to support future process technologies as well as in sub-threshold region of operation.

Summary: This chapter has reviewed the published works on low power architectures of arithmetic circuits (CLA, KSA, HCA, Wallace tree and Dadda multipliers) with their performance parameters such as power, delay and power-delay-product using different logic design styles at different technology nodes.

Also, different existing low power SRAM cell designs and their performance parameters in terms of read stability, write ability, hold stability, read/write access time and leakage power consumption have been reviewed.

Outcome: From the Table 2.1 to Table 2.4, it is clear that power efficient Carry look-ahead adders, Kogge stone adders, Han-Carlson adders, Wallace tree and Dadda multipliers have been developed using different logic styles for super-threshold region.

Thus, fewer explorations have been done in the area of sub-threshold based basic logic gates, arithmetic circuits and SRAM cell designs including the impact of technology scaling on their performance. Therefore, design issues and required circuit modification of arithmetic blocks and SRAM cell (at different technology nodes) at circuit and architecture level is needed to be explored and analyzed thoroughly. This forms the research gaps as already mentioned in Chapter 1.

## REFERENCES

[1] G. Randall, P. Allen and N. Strader, "VLSI design techniques for analog and digital circuits", New York: McGraw-Hill, vol. 90, 1990, pp. 1-951.
[2] V. G. Oklobdzija, "High-speed VLSI arithmetic units: adders and multipliers", Design of High-Performance Microprocessor Circuits, IEEE Press, Los Alamitos, 2000, pp. 1-578.
[3] M. Mahesh and S. Sherlekar, "VLSI synthesis of DSP kernels: algorithmic and architectural transformations", Springer Science \& Business Media, 2013, pp. 1-209.
[4] A. Chandrakasan, S. Sheng and R. Brodersen, "Low-power CMOS digital design", IEICE Transactions on Electronics, vol. 75(4), 1992, pp. 371-382.
[5] B. Neil and M. Craig, "Comparing the performance of FPGA based custom computers with general-purpose computers for DSP applications", Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, 1994, pp. 164-171.
[6] V. Ramesh, S. Dasgupta and R. P. Agarwal, "Device and circuit design challenges in the digital subthreshold region for ultralow-power applications", VLSI Design, vol. 2009, 2009, pp. 1563-1571.
[7] G. Benjamin, A. Klimm, O. Sander, M. G. Klaus and J. Becker, "A system architecture
for reconfigurable trusted platforms", Conference on Design, Automation and Test, 2008, pp. 541-544.
[8] B. Abdellatif and M. Elmasry, "Low-power digital VLSI design: circuits and systems", Springer Science \& Business Media, 2012, pp. 1-527.
[9] Y. T. Lee, I. C. Park and C. M. Kyung, "Design of compact static CMOS recursive output property", Electronics Letters, 1993, vol. 29, pp. 794-796.
[10] D. Auvergne, G. Cambon, D. Deschacht, M. Robert, G. Sagnes and V. Tempier, "Delay-time evaluation in ED MOS logic LSI", IEEE Journal of Solid-State Circuits, vol. 21(2), 1986, pp. 337-343.
[11] C.Thomas and E. Swartzlander, "Estimating the power consumption of CMOS adders", 11th IEEE Symposium on Computer Arithmetic, 1993, pp. 210-216.
[12] P. Dinesh, "Design of robust energy-efficient digital circuits using geometric programming" PhD Dissertation, Stanford University, 2008, pp. 1-157.
[13] R. Zlatanovici, S. Kao and B. Nikolić, "Energy-delay optimization of 64-bit carrylookahead adders with a 240 ps 90 nm CMOS design example", IEEE Journal of SolidState Circuits, vol. 44(2), 2009, pp. 569-583.
[14] Y. T. Lee, I. C. Park and C. M. Kyung, "Design of compact static CMOS recursive output property", Electronics Letters, 1993, vol. 29, pp. 794-796.
[15] C. Nagendra, M. Irwin and R. Owens, "Area-time-power tradeoffs in parallel adders", IEEE Transactions on Circuits and Systems-11: Analog and Digital Signal Processing, vol. 43, 1996, pp. 689-702.
[16] A. Asati and Chandrashekhar, "A $16 \times 16$ mux based multiplier design using optimized static CMOS logic style", International Journal of Electronic Engineering Research, 2009, vol. 1, pp. 53-61.
[17] I. S. Hwang and A. L. Fisher, "Ultrafast compact 32-bit CMOS adders in multipleoutput domino logic", IEEE Journal of Solid-State Circuits, vol. 24(2), 1989, pp. 358369.
[18] I. S. Abukhater, R. H. Yan, A. Bellaouar and M. I. Elmasry, "A 1-V low-power highperformance 32-bit conditional sum adder", IEEE Symposium on Low Power Electronics, 1994, pp. 66-67.
[19] J. Lim, D. G. Kim and S. I. Chae, "A 16-bit carry-lookahead adder using reversible energy recovery logic for ultra-low-energy systems", IEEE journal of Solid-State Circuits, vol. 34(6), 1999, pp. 898-903.
[20] P. Chuang, L. David and M. Sachdev, "Constant delay logic style", IEEE Transactions on VLSI Systems, vol. 21(3), 2012, pp. 554-565.
[21] Y. Moon and D. K. Jeong, "An efficient charge recovery logic circuit", IEEE Journal of Solid-State Circuits, vol. 31 (4), 1996, pp. 514-522.
[22] M. M. Nipun, S. Roy, A. Korishe, M. H. Maruf and M. A. Rahman., "Ultra-low power digital system design using sub-threshold logic styles" Proceedings of IEEE Symposium on Industrial Electronics and Applications (ISIEA), 2011, pp.109-113.
[23] J. Crop, S. Fairbanks, R. Pawlowski and P. Chian, " 150 mV sub-threshold asynchronous multiplier for low-power sensor applications", International Symposium on VLSI Design Automation and Test (VLSI-DAT), 2010, pp. 254-257.
[24] A. Tajalli, E. Brauer, Y. Leblebici and E. Vittoz, "Sub-threshold source-coupled logic circuits for ultra-low-power applications", IEEE Journal of Solid-State Circuits, vol. 43(7), 2008, pp. 1699-1710.
[25] S. K. Chang and C. Wey, "A fast 64-bit hybrid adder design in 90nm CMOS process", Proceedings of 55th IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), 2012, pp. 414-417.
[26] C. Burwick, M. Thomas, J. Dienstuhl and K. F. Goser, "Threshold-gates in arithmetic circuits", Proceedings of $8^{\text {th }}$ IEEE International Conference on Electronics, Circuits and Systems, vol. 2, 2001, pp. 909-912.
[27] E. Angel and E. E. Swartzlander, "Low power parallel multipliers" Workshop on VLSI Signal Processing, vol. 9, 1996, pp. 199-208.
[28] M. Solaz and R. Conway, "Comparative study on word length reduction and truncation for low power multipliers", Proceedings of $33^{\text {rd }}$ International Convention on MIPRO, 2010, pp. 84-88.
[29] C. Thomas and E. Swartzlander, "Low power arithmetic components", In Low Power Design Methodologies, Springer US, 1996, pp. 161-200.
[30] C. Nagendra, M. J. Irwin and R. M. Owens, "Area-time-power tradeoffs in parallel
adders", IEEE Transaction on Circuits Systems II: Analog and Digital Signal Processing, vol. 43(10), 1996, pp. 689-702.
[31] V. G. Oklobdzija, B. R. Zeydel, H. Dao, S. Mathew and R. Krishnamurthy, "Energy delay estimation technique for high-performance microprocessor VLSI adders", 16th IEEE Symposium on Computer Arithmetic, 2003, pp. 15-22.
[32] R. Brodersen, M. Horowitz, D. Markovic, B. Nikolic and V. Stojanovic, "Methods for true power minimization", International Conference on Computer-Aided Design, 2002, pp. 35-42.
[33] Y. Shimazaki, Z. Radu and B. Nikolic, "A shared-well dual-supply-voltage 64-bit ALU", IEEE Journal of Solid-State Circuits, vol. 39(3), 2004, pp. 494-500.
[34] G. Dimitrakopoulos and B. Nikolos, "High-speed parallel-prefix VLSI ling adders", IEEE Transactions on Computers, vol. 54(2), 2005, pp. 225-231.
[35] H. Morinaka, H. Makino, Y. Nakase, H. Suzuki and K. Mashiko, "A 64bit carry lookahead CMOS adder using modified carry select", Proceedings of the IEEE Conference on Custom Integrated Circuits, 1995, pp. 585-588.
[36] K. Fujii, T. Douseki and Y. Kado, "A sub-1v dual-threshold domino circuit using product-of-sum logic", IEEE International Symposium on Low Power Electronics and Design, 2001, pp. 259-262.
[37] F. Karami and H. Ali, "New structure for adder with improved speed, area and power", 2nd IEEE International Conference on Networked Embedded Systems for Enterprise Applications (NESEA), 2011, pp. 1-6.
[38] S. Ziabakhsh, A. R. Hosein, A. R. Mohammad and M. Mortazavi, "The design of a low-power high-speed current comparator in $0.35-\mu \mathrm{m}$ CMOS technology", 10th IEEE International Symposium Conference on Quality of Electronic Design, ISQED, 2009, pp. 107-111.
[39] K. H. Cheng, W. S. Lee and Y. C. Huang, "A 1.2 V 500 MHz 32-bit carry-lookahead adder", Proceedings of $8^{\text {th }}$ IEEE International Conference on Electronics Circuits and Systems, vol. 2, 2001, pp. 765-768.
[40] C. C. Wang, Y. L. Tseng, P. M. Lee, R. C. Lee and C. J. Huang, "A 1.25 GHz 32-bit tree-structured carry lookahead adder using modified ant logic", IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, vol. 50(9), 2003,
pp. 1208-1215.
[41] C. C. Wang, C. L. Lee and P. L. Liu, "Power-aware pipelining design of an 8-bit CLA using PLA-styled all-N-transistor logic", The $2^{\text {nd }}$ Annual IEEE Northeast Workshop on Circuits and Systems, 2004, pp. 149-152.
[42] S. Sheng and S. Carl, "Post-layout comparison of high performance 64-b static adders in energy-delay space" Proceedings of $25^{\text {th }}$ International Conference on Computer Design ICCD, 2007, pp. 401-408.
[43] H. Yajuan and C. H. Chang, "A power-delay efficient hybrid carry-lookahead/carryselect based redundant binary to two's complement converter", IEEE Transactions on Circuits and Systems I, vol. 55(1), 2008, pp. 336-346.
[44] J. Blackbum, A. Russ and E. Swartzlander, "Optimization of spanning tree carry lookahead adders", Proceedings of the $13^{\text {th }}$ Asilomar Conference on Signals, Systems and Computers, vol. 1, 1996, pp. 177-181.
[45] M. Mittal and A. Salama, "DPTL 4-b carry lookahead adder", IEEE Journal of SolidState Circuits, vol. 27(11), 1992, pp. 1644-1647.
[46] C. H. Hsu, G. N. Sung, T. Y. Yao, C. Y. Juan, Y. R. Lin and C. C. Wang, "Low-power 7.2 GHz complementary all-N-transistor logic using 90 nm CMOS technology", Proceedings of IEEE International Symposium on Circuits and Systems, 2009, pp. 389392.
[47] J. Chen, "Parallel-prefix structures for binary and modulo $\{2 \mathrm{n}-1,2 \mathrm{n}+1\}$ adders", $P h D$ Dissertation, Shanghai Jiao Tong University Shanghai, China, 2008, pp. 1-136.
[48] H. Tackdon and D. Carlson, "Fast area-efficient VLSI adders", 8" IEEE Symposium on Computer Arithmetic (ARITH), 1987, pp. 49-56.
[49] G. Dimitrakopoulos, P. Kolovos, P. Kalogerakis and D. Nikolos, "Design of highspeed low-power parallel-prefix VLSI adders", Workshop on Power and Timing Modeling, Optimization and Simulation, Springer Berlin Heidelberg, 2004, pp. 248257.
[50] D. Yagain, V. Krishna and A. Baliga, "Design of high-speed adders for efficient digital design blocks", International Scholarly Research Network, ISRN Electronics, vol. 2012, 2012, pp. 1-12.
[51] G. Swapna and P. Zode, "FPGA implementation of hybrid Han-Carlson adder", 2nd

IEEE International Conference on Devices, Circuits and Systems (ICDCS), 2014, pp. 1-4.
[52] D. Esposito, D. Decaro, E. Napoli, N. Petra and A. G. Strollo, "Variable latency speculative Han-Carlson adder", IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 62(5), 2015, pp. 1353-1361.
[53] S. M. Sudhakar, K. P. Chidambaram and E. Swartzlander "Hybrid Han-Carlson adder", 55th IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), 2012, pp. 818-821.
[54] G. Swapna and P. Zode, "Parallel prefix Han-Carlson adder", International Journal of Research in Engineering and Applied Sciences, vol. 02(2), 2014, pp. 81-84.
[55] S. Jeyamala and B. S. Aswathy, "Performance enhancement of Han-Carlson adder", International Journal of Advanced Research in Electronics and Communication Engineering (IJARECE), vol. 5(2), 2016, pp. 226-230.
[56] K. Kaarthik and C. Vivek, "Hybrid Han-Carlson adder architecture for reducing power and delay", Middle-East Journal of Scientific Research 24 (Special Issue on Innovations in Information, Embedded and Communication Systems), vol. 3(4), 2016, pp. 308-313.
[57] S. K. Yezerla and B. R. Naik, "Design and estimation of delay, power and area for parallel prefix adders", International Journal of Scientific Engineering and Technology Research, vol. 4(25), 2015, pp. 4721-4726.
[58] P. M. Kogge and H. S. Stone, "A parallel algorithm for the efficient solution of a general class of recurrence equations", IEEE Transactions on Computers, vol. 100(8), 1973, pp. 786-793.
[59] H. Eriksson and P. L. Edefors, "Dynamic pass-transistor dot operators for efficient parallel-prefix adders", Proceedings of the International Symposium on Circuits and Systems, 2004, pp. 461-464.
[60] Z. Moudallal, I. Issa, M. Mansour, A. Chehab and A. Kayssi, "A low-power methodology for configurable wide kogge-stone adders", International Conference on Energy Aware Computing (ICEAC), 2011, pp. 1-5.
[61] M. Talsania and E. John, "A comparative analysis of parallel prefix adders", 12th Euromicro Conference on Digital System Design, Architectures, Methods and Tools, 2009, pp. 281-286.
[62] H. Dao, V. Oklobdzija, "Performance comparison of VLSI adders using logical effort", 12th International Workshop on Power and Timing Modeling, Optimization and Simulation, PATMOS, Springer Berlin Heidelberg, 2002, pp. 25-34.
[63] B. Annapurna and V. Laxmi, "Design of 128-bit Kogge-Stone low power parallel prefix VLSI adder for high speed arithmetic circuits", International Journal of Engineering and Advanced Technology (IJEAT), vol. 2(6), 2013, pp. 415-418.
[64] M. Sunil, R. D. Ankith, G. D. Manjunatha and B. S. Premananda, "Design and implementation of faster parallel prefix Kogge Stone adder", International Journal of Electrical and Electronics Engineering \& Telecommunications, vol. 3(1), 2014, pp. 116-121.
[65] G. Swaroop, P. Ndai and K. Roy, "A novel low overhead fault tolerant Kogge-Stone adder using adaptive clocking", In Proceedings of the Conference on Design, Automation and Test, 2008, pp. 366-371.
[66] H. Abdulaziz, "Area and performance optimized CMOS multipliers", $P h D$ Dissertation, Stanford University, 1997, pp. 1-158.
[67] R. Abhilash, S. Dubey and M. C. Chinnaaiah, "High performance and area efficient signed baugh-wooley multiplier with wallace tree using compressors", IEEE International Conference on Electrical, Electronics, Signals, Communication and Optimization (EESCO), 2015, pp. 1-4.
[68] C. Foster and F. Stockton, "Counting responders in an associative memory", IEEE Transactions on Computers, vol. C-20, 1971, pp. 1580-1583.
[69] S. Asif and Y. Kong, "Low-Area Wallace tree Multiplier", VLSI Design, vol. 2014, 2014, pp. 1-6.
[70] P. Ramanathan, P. T. Vanathi and S. Agarwal, "High speed multiplier design using decomposition logic", Serbian Journal of Electrical Engineering, vol. 6(1), 2009, pp. 33-42.
[71] K. Uming, P. Balsara and W. Lee, "Low-power design techniques for highperformance CMOS adders", IEEE Transactions on Very Large Scale Integration Systems, vol. 3(2), 1995, pp. 321-323.
[72] R. Zimmermann and W. Fichtner, "Low-power logic styles: CMOS versus passtransistor logic", IEEE Journal of Solid-State Circuits, vol. 32(7), 1997, pp. 1079-
1090.
[73] D. Markovic, B. Nikolic and V. G. Oklobdzija, "A general method in synthesis of passtransistor circuits", Microelectronics Journal, vol. 31(11), 2000, pp. 991-998.
[74] C. F. Law, S. S. Rofail and K. S. Yeo, "Low-power circuit implementation for partialproduct addition using pass-transistor logic", IEE Proceedings-Circuits Devices Systems, vol. 146(3), 1999, pp. 124-129.
[75] K. Prasad and K. Parhi, "Low-power 4-2 and 5-2 compressors", Conference on Signals, Systems and Computers, 2001, pp. 129-133.
[76] C. H. Chang, G. Jiangmin and M. Zhang, "Ultra low-voltage low-power CMOS 4-2 and 5-2 compressors for fast arithmetic circuits", IEEE Transactions on Circuits and Systems, vol. 51(10), 2004, pp. 1985-1997.
[77] R. Nirlakalla, T. S. Rao and T. J. Prasad, "Performance evaluation of high speed compressors for high speed multipliers", Serbian Journal of Electrical Engineering, vol. 8(3), 2011, pp. 293-306.
[78] S. F. Hsiao, M. R. Jiang and J. S. Yeh, "Design of high-speed low-power 3-2 counter and 4-2 compressor for fast multipliers", Electronics Letters, vol. 34(4), 1998, pp. 341343.
[79] S. Veeramachaneni, K. M. Krishna, A. Lingamneni, S. R. Puppala and M. B. Srinivas, "Novel architectures for high-speed and low-power 3-2, 4-2 and 5-2 compressors", 20th International Conference on VLSI Design, 2007, pp. 324-329.
[80] J. Tonfat and R. Reis, "Low power 3-2 and 4-2 adder compressors implemented using ASTRAN", IEEE Third Latin American Symposium on Circuits and Systems (LASCAS), 2012, pp. 1-4.
[81] A. Pishvaie, G. Jaberipur and J. Ali, "Redesigned CMOS (4; 2) compressor for fast binary multipliers", Canadian Journal of Electrical and Computer Engineering, vol. 36(3), 2013, pp. 111-115.
[82] R. Menon and D. Radhakrishnan, "High performance 5:2 compressor architectures", IEEE Proceedings of Circuits Devices and Systems, vol. 153(5), 2006, pp. 447-452.
[83] O. Kwon, K. Nowka and E. E. Swartzlander, "A 16-bit by 16-bit MAC design using fast 5:3 compressor cells", Journal of VLSI Signal Processing Systems for Signal,

Image and Video Technology, vol. 31(2), 2002, pp. 77-89.
[84] N. Farid, "A survey of power estimation techniques in VLSI circuits", Proceedings of IEEE Transaction on VLSI, vol. 51, 1994, pp. 446-455.
[85] B. H. Calhoun and A. Chandrakasan, "Analyzing static noise margin for sub-threshold SRAM in 65 nm CMOS", European Conference of Solid-State Circuits, 2005, pp. 363366.
[86] H. Mizuno and T. Nagano, "Driving source-line cell architecture for sub-1-V high speed low-power applications", IEICE Transactions on Electronics, vol. 79(9), 1996, pp. 963-968.
[87] J. Singh, D. K. Pradhan, S. Hollis and S. P. Mohanty, "A single ended 6T SRAM cell design for ultra-low-voltage applications", IEICE Electronics Express, vol. 5(18), 2008, pp.750-755.
[88] S. Mukhopadhyay, S. Ghosh, K. Keejong and K. Roy, "Low-power and process variation tolerant memories in sub-90nm technologies", IEEE International Conference on SOC, 2006, pp. 155-159.
[89] M. F. Chang, J. J. Wu, K. T. Chen, Y. C. Chen, Y. H. Chen, R. Lee, H. J. Liao and H. Yamauchi, "A differential data-aware power-supplied (DAP) 8T SRAM cell with expanded write/read stabilities for lower vdd min applications", IEEE Journal of SolidState Circuits, vol. 45(6), 2010, pp. 1234-1245.
[90] T. H. Kim, L. Jason and H. K. Chris, "An 8T sub-threshold SRAM cell utilizing reverse short channel effect for write margin and read performance improvement", IEEE Custom Integrated Circuits Conference, 2007, pp. 241-244.
[91] G. Razavipour, A. K. Ali and M. Pedram, "Design and analysis of two low-power SRAM cell structures", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 17 (10), 2009, pp. 1551-1555.
[92] N. Verma and A. P. Chandrakasan, "A 256 kb 65 nm 8T sub-threshold SRAM employing sense-amplifier redundancy", IEEE Journal of Solid-State Circuits, vol. 43(1), 2008, pp. 141-149.
[93] L. Sheng, Y. B. Kim and F. Lombardi, "A low leakage 9T SRAM cell for ultra-low power operation", Proceedings of the 18th ACM Great Lakes symposium on VLSI, ACM, 2008, pp. 123-126.
[94] L. Zhiyu and V. Kursun, "Characterization of a novel nine-transistor SRAM cell", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 16(4), 2008, pp. 488-492.
[95] M. Zamani, S. Hassanzadeh, K. Hajsadeghi and R. Saeidi, "A 32kb 90nm 9T-cell Subthreshold SRAM with improved read and write SNM", 8th International Conference on Design \& Technology of Integrated Systems in Nanoscale Era (DTIS), 2013, pp. 104-107.
[96] M. Majid, S. Timarchi, M. H. Moaiyeri and M. Eshgh, "An ultra-low-power 9T SRAM cell based on threshold voltage techniques", Circuits, Systems, and Signal Processing, vol. 35(5), 2015, pp. 1-19.
[97] A. R. Ahmadi Mehr, I. Madadi and A. Afzali-Kusha, "A subthreshold SRAM cell tolerant to random dopant fluctuations," IEEE International Conference of Electron Devices and Solid-State Circuits (EDSSC), 2010, pp. 1-4.
[98] H. Chen, Y. Jun, Z. Meng and W. Xiulong, "A 12T subthreshold SRAM bit-cell for medical device application," International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, 2011, pp. 540-543.
[99] Y. W. Chiu, Y. H. Hu, M. H. Tu, J. K. Zhao, Y. H. Chu, S. J. Jou, and C. T. Chuang, "40 nm bit-interleaving 12 T subthreshold SRAM with data-aware write-assist," IEEE Transactions on Circuits and Systems, vol. 61(9), 2014, pp. 2578-2585.
[100] K. Takahiro, H. Yamamoto, S. Hoketsu, H. Imi, H. Okamura and K. Nakamura, "Ratioless full-complementary 12-transistor static random access memory for ultralow supply voltage operation", Japanese Journal of Applied Physics, vol. 54(4S), 2015, pp. 04DD11.
[101] D. Roy, A. K. Singh, R. Anand and A. Islam, "Bit line and storage node decoupled 13T SRAM cell in 22-nm technology node", Wulfenia Journal, vol. 20(3), 2013, pp. 40-55.
[102] A. A. Mazreah, M. Reza Sahebi, M. T. Manzuri and J. Hosseini, "A novel zero-aware four-transistor SRAM cell for high density and low power cache application", IEEE International Conference on Advanced Computer Theory and Engineering, 2008, pp. 571-575.
[103] A. Teman, A. Mordakhay, J. Mezhibovsky and A. Fish, "A 40-nm sub-threshold 5T SRAM bit cell with improved read and write stability", IEEE Transactions on Circuits and Systems, vol. 59(12), 2012, pp. 873-877.
[104] N. Satyanand and B. Calhoun, "A symmetric sizing in a 45 nm 5 T SRAM to improve read stability over 6T", IEEE Conference on Custom Integrated Circuits, 2009, pp. 709-712.
[105] B. Zhai, B. David, S. Dennis and S. Hanson, "A sub-200mV 6T SRAM in $0.13 \mu \mathrm{~m}$ CMOS", IEEE International Solid-State Circuits Conference (ISSCC), 2007, pp. 332333.
[106] A. Ramy, M. Faisal and M. Bayoumi, "Novel 7T SRAM cell for low power cache design", IEEE International Conference on SOC, 2005, pp. 171-174.
[107] A. Islam and M. Hasan, "A technique to mitigate impact of process, voltage and temperature variations on design metrics of SRAM Cell", Microelectronics Reliability, vol. 52(2), 2012, pp. 405-411.

## CHAPTER 3

ADDERS

### 3.1. INTRODUCTION

The arithmetic circuits are designed to handle the data in units; each unit has a fixed number of binary digits which compromises with available memory capacity, operating speed, required accuracy of numerical data, and other considerations [1]. Adders are the core elements of arithmetic circuits as they are widely used in arithmetic logic units (ALU), in the floating-point units, and for address generation in the case of cache or memory access. These are frequently required in VLSI from processors to application specific integrated circuits (ASICs) as well as in fundamental arithmetic operations within contemporary electronic system. Adders are essential used not only for addition, but it is also the nucleus of the basic arithmetic operations such as multiplication, subtraction and division. Addition is performed to increment program counters, multiplication is performed with multiple addition, subtraction is performed as an addition when negative numbers are represented in their 2 'complement form, division is done by successive subtraction, and it requires no extra correction operations, if either dividend or divisor is negative that can also be performed by using adder circuits [2][3]. Even for programs that don't do explicit arithmetic, addition must be performed to increment the program counter and to calculate addresses of the memory.

In most of the digital systems, adder's lies in the critical path that affects the overall speed of the system and the performance of adder may determine the whole system response. Adder logic is thus of obvious importance and has received attention from computer designers. The most important and widely accepted metrics for measuring the quality of adder designs in the past were propagation delay and area. Efforts in the past were focused towards increasing the speed of computing system. As a result, high-speed computation has become an expected norm for the average user. But with the rise of portable, battery operated devices reducing power consumption has become an important design goal. Increased power consumption also leads to increased heat, which makes low power design important even for non-portable applications. Circuit designers now seek to meet performance requirements for adder circuits with the lowest power consumption possible. Digital arithmetic circuits operating in the subthreshold region of the transistor are being used as an ideal option for ultra-low power CMOS
design. To operate a circuit in sub-threshold region the operating voltage scaled down below the device threshold [4]. However, the weak driving current due to the supply voltage below the threshold voltage, limits the circuit performance.

Power efficient adders under sub-threshold operation are also essential part of micro sensors, WSN's and MAC core unit. Thus, optimizing the adder circuitry with minimum power consumption will have a beneficial effect on low power based above mentioned applications.

Extensive research has already been done in the design and implementation of serial adders using different logic families in sub-threshold region [5][6][7]. Ripple carry adder (RCA) has been found the most power efficient architecture operated in sub-threshold region as compared to all types of serial adder architecture [8]. In paper [9], it has been discussed that for low operand size, serial addition can match the speed of parallel addition when operating in subthreshold, while still dissipating less (power). Whereas, for high operand size, serial adder gives low performance due to overall carry computation delay. In other words, for 32-bit addition, a parallel adder is 4.7 times faster than a serial one. For 64-bit addition, the parallel adder would be around 8.1 times faster than the serial adder, based on a similar calculation. Thus, parallel adders are so attractive in order to deliver high performance in terms of computational speed [10]. In paper [11], it has also been investigated that the major problem for binary addition is the propagation delay in the carry chain. As the width of the input operand increases, the length of the carry chain increases. To address the carry propagation problem, most of the modern adder architectures are represented as a parallel prefix adder structure consisting of pre-processing, Carry look-ahead and post processing sections. The carry-look-ahead adders are being used for parallel computation of carry bits in parallel prefix adders. These parts of the circuit decide the overall circuit performance, power consumption and power-delay product.

Parallel Prefix Adders have been established as the most efficient circuits for binary addition in digital systems. The power and delay of a parallel prefix adder is directly proportional to the number of levels in the carry propagation stage. In research work [11], design and analysis of different parallel prefix adders such as Kogge-Stone adder, Brent Kung adder, Han-Carlson adder, Sklansky adder, Lander Fischer adder, and Knowles adder have been done in superthreshold region. The comparative results show that Kogge-Stone adder and, Han-Carlson adder have best results in terms of overall power-delay-product [12].

To design an energy-efficient addition-based arithmetic circuits using cell based VLSI design, designers must rely on power efficient adder architectures that are optimized at the logic-level.

This reduces the number of power efficient adder architecture operated in sub-threshold region to a few classic ones, the Carry look-ahead adder, Kogge-Stone adder and Han-Carlson adder [13][14][15].

For power efficient adders design, radix-2 is suitable network depth due to less complex circuit design, whereas the radix-4 and higher increases the stage efforts, power and delay of each stage but reduces the number of required stages [16]. Therefore, the implementation of above mentioned adders use radix-2 as this is most power efficient network depth, suitable for subthreshold operation and conventionally used in practical applications [17].

The main aim of this work is to design and implementation of radix-2 Carry look-ahead adder which is most important block of parallel prefix adders, two different radix-2 parallel prefix adders (Kogge-Stone Adder and Han-Carlson Adder) using two different technology nodes ( $45 \mathrm{~nm} / 180 \mathrm{~nm}$ ) operated in sub-threshold region. The performance metrics considered for the analysis of the adders are: power, delay and power-delay-product.

The rest of the chapter is organized as follows. Section 3.2 presents internal circuitry of basics building blocks of adder architectures and their functional overview. Section 3.3 presents comparative analysis of logic gates which are used to implement adder architectures using different families and their post layout simulation results for sub-threshold operation. Section 3.4 presents effects of reverse body scheme on logic gates and their post layout simulation results for sub-threshold operation. Section 3.5 describes the design and analysis of three different architecture and their post layout simulation results for sub-threshold operation. Section 3.6 presents final results and discussion of three different architectures and Section 3.7 presents a summary of the chapter and the concluding remarks.

### 3.2. ADDER ARCHITECTURES

Power efficient adders are used starting from very low speed arithmetic calculations, medium speed applications to high-speed real multimedia computations. So, depending upon the energy efficiency, trade-offs are done at architecture and circuit levels to have small silicon area and high performance. Designing power efficient adders operated in sub-threshold region with moderate speed is a significant goal of this work. The power consumption of the adder varies with the choice of architecture, logic family and operand size. Different combinations of these choices lead to many different adder designs. This chapter explores three different adders like Carry look-ahead adder, Kogge-Stone adder and Han-Carlson adder architectures with different logic design style and operand size in sub-threshold region. Based on the
simulation results with different operand size and preferred figure of merit (power, delay and power-delay product) the best architecture is proposed in sub-threshold region. This section briefly describes the internal circuitry of basics building blocks of adders.

### 3.2.1. CLA

The most commonly used scheme for accelerating carry propagation is the Carry look-ahead scheme. The main idea behind Carry look-ahead addition is an attempt to generate all incoming carries in parallel and avoid the need to wait until the correct carry propagates from the stage of adder where it has been generated. CLA is formed by three main logical blocks, pre-calculation of generate/propagate generator blocks used to increase the speed of carry computation, the most intensive parallelizable carry generator blocks which reduces overall computation time and adder blocks to generate the sum. CLA avoids the linear growth of carry delay by generating carries in parallel by using generate and propagate signals. Figure 3.1 shows the block diagram of $n$-bit CLA.


Figure 3.1: Block diagram of n-bit CLA
The logic expressions of the logical generate/propagate blocks, carry generator and sum generator blocks are given by (3.1) - (3.4) [18].
$\mathrm{G}_{\mathrm{i}}=\mathrm{A}_{\mathrm{i}} \bullet \mathrm{B}_{\mathrm{i}}$
$\mathrm{P}_{\mathrm{i}}=\mathrm{A}_{\mathrm{i}} \oplus \mathrm{B}_{\mathrm{i}}$

Notices that both generate and propagate signals depend only on the input bits and thus will be valid after one gate delay. When the adder inputs are loaded in parallel, all the $G_{i}$ and $P_{i}$ signals will be generated at the same time.

The carry of the $\mathrm{i}^{\text {th }}$ stage which is also the input carry of the $(i+1)^{\text {th }}$ stage can be calculated by outputs of $\mathrm{i}^{\text {th }}$ stage using equation (3). The logic expressions for the carryout and sum signals are given by:

$$
\begin{equation*}
\mathrm{C}_{\mathrm{i}}+1=\mathrm{G}_{\mathrm{i}}+\mathrm{P}_{\mathrm{i}} \bullet \mathrm{C}_{\mathrm{i}} \tag{3.3}
\end{equation*}
$$

$\mathrm{S}_{\mathrm{i}}=\mathrm{P}_{\mathrm{i}} \oplus \mathrm{C}_{\mathrm{i}}$
Where $A_{i}$ and $B_{i}$ are the augends and addend inputs, $C_{i}$ the carry input, $S_{i}$ and $C_{i}+1$, the sum and carry-out to the $i^{\text {th }}$ bit position and the auxiliary functions, $\mathrm{G}_{\mathrm{i}}$ and $\mathrm{P}_{\mathrm{i}}$ are generate and propagate signals respectively.

Equation (1) and (2) shows the $\mathrm{i}^{\text {th }}$ partial full adders (PFA) equations and their outputs are labelled as the generate $(\mathrm{G})$ and propagate $(\mathrm{P})$ signals needed for the carryout equations. The values of G and P are then used to find the group generate and group propagate for a group of two successive PFA at the next level of the tree. For example, to represent the generation of a carry in position $0,1,2$ and 3 and its propagation to $C_{4}$, we need to consider the generation of carry in each of the positions, as represented by $\mathrm{G}_{0}$ through $\mathrm{G}_{3}$, and the propagation of each of these four generated carries to position 4 . This gives the group generate $\left(G_{0-3}\right)$ for a 4-bit CLA.

$$
\begin{equation*}
\mathrm{G}_{0-3}=\mathrm{G}_{3}+\mathrm{G}_{2} \bullet \mathrm{P}_{3}+\mathrm{G}_{1} \bullet \mathrm{P}_{3} \bullet \mathrm{P}_{2}+\mathrm{G}_{0} \bullet \mathrm{P}_{3} \bullet \mathrm{P}_{2} \bullet \mathrm{P}_{1} \tag{3.5}
\end{equation*}
$$

Similarly, to propagate a carry from $\mathrm{C}_{0}$ to $\mathrm{C}_{4}$, we need to have all four of the propagate functions equal to 1 , giving the group propagate function ( $\mathrm{P}_{0-3}$ )

$$
\begin{equation*}
\mathrm{P}_{0-3}=\mathrm{P}_{0} \bullet \mathrm{P}_{1} \bullet \mathrm{P}_{3} \bullet \mathrm{P}_{3} \tag{3.6}
\end{equation*}
$$

Whereas equation (3.3) and (3.4) shows that as increase the number of bits in the CLA adders, the complexity increases because the number of gates in the expression $\mathrm{C}_{\mathrm{i}}+1$ increases which increases power and area too. Applying these equations for a 4-bit CLA:

$$
\begin{align*}
& \mathrm{C}_{1}=\mathrm{G}_{0}+\mathrm{P}_{0} \bullet \mathrm{C}_{0}  \tag{3.7}\\
& \mathrm{C}_{2}=\mathrm{G}_{1}+\mathrm{P}_{1} \bullet \mathrm{C}_{1}=\mathrm{G}_{1}+\mathrm{P}_{1} \bullet\left(\mathrm{G}_{0}+\mathrm{P}_{0} \bullet \mathrm{C}_{0}\right)=\mathrm{G}_{1}+\mathrm{P}_{1} \bullet \mathrm{G}_{0}+\mathrm{P}_{1} \bullet \mathrm{P}_{0} \bullet \mathrm{C}_{0} \tag{3.8}
\end{align*}
$$

$$
\begin{align*}
& \mathrm{C}_{3}=\mathrm{G}_{2}+\mathrm{P}_{2} \bullet \mathrm{C}_{2}=\mathrm{G}_{2}+\mathrm{P}_{2} \bullet \mathrm{G}_{1}+\mathrm{P}_{2} \bullet \mathrm{P}_{1} \bullet \mathrm{G}_{0}+\mathrm{P}_{2} \bullet \mathrm{P}_{1} \bullet \mathrm{P}_{0} \bullet \mathrm{C}_{0}  \tag{3.9}\\
& \mathrm{C}_{4}=\mathrm{G}_{3}+\mathrm{P}_{3} \bullet \mathrm{C}_{3}=\mathrm{G}_{3}+\mathrm{P}_{3} \bullet \mathrm{G}_{2}+\mathrm{P}_{3} \bullet \mathrm{P}_{2} \bullet \mathrm{G}_{1}+\mathrm{P}_{3} \bullet \mathrm{P}_{2} \bullet \mathrm{P}_{1} \bullet \mathrm{G}_{0}+\mathrm{P}_{3} \bullet \mathrm{P}_{2} \bullet \mathrm{P}_{1} \bullet \mathrm{P}_{0} \bullet \mathrm{C}_{0} \tag{3.10}
\end{align*}
$$

These carry values can be used to compute the carry-ins at a higher level of hierarchy. The Generate and Propagate signal can be a single bit or many bits in look ahead block. The meaning of the equations remains the same.

### 3.2.2. KSA

The KSA is one of the fundamental parallel adder which is widely used in arithmetic circuits using cell based VLSI design. The circuit employed the scheme of multilevel look-ahead adder with improved delay and generates the carry signals in $\mathrm{O}(\log \mathrm{n})$ time, where n is the number of bits per input. The addition is performed in three stages and the basic blocks of this adder are the bitwise propagate/generate logic block, group propagate/generate logic block and sum/carry logic block as illustrated in Figure 3.2.


Figure 3.2: Block diagram of $n$-bit KSA
The first stage or bitwise pre-processing stage produces the Generate and Propagate signals of two input operands Ai and Bi . Eq.
$\mathrm{G}_{\mathrm{i}}=\mathrm{A}_{\mathrm{i}} \bullet \mathrm{B}_{\mathrm{i}}$
$\mathrm{P}_{\mathrm{i}}=\mathrm{A}_{\mathrm{i}} \oplus \mathrm{B}_{\mathrm{i}}$
Where $(\cdot,+$ and $\oplus)$ represents AND, OR and XOR operations. Notices that both propagate and generate signals depend only on the input bits and thus will be valid after one gate delay.

In the second stage, group generates and group propagates signals are produced and are defined according to (3.13) and (3.14) respectively [19].

$$
\begin{align*}
& \mathrm{G}_{\mathrm{i}+1}=\mathrm{G}_{\mathrm{i}}+\mathrm{P}_{\mathrm{i}} \cdot \mathrm{G}_{\mathrm{i}-1}  \tag{3.13}\\
& \mathrm{P}_{\mathrm{i}+1}=\mathrm{P}_{\mathrm{i}} \cdot \mathrm{P}_{\mathrm{i}-1} \tag{3.14}
\end{align*}
$$

This stage is the major functional unit with dense orientation of calculations and logical functionalities to produce carry signal. The complexity and function rich feature of the stage plays a major impact on power consumption.

The third stage is called post processing which results the sum bits by XOR of carry and propagate signals from previous stages. The first stage and last stage are intrinsically fast because they involve only simple operations on signals local to each bit position.

### 3.2.3. HCA

Due to the slow speed and high power consumption of conventional CLA's has led to the implementation of parallel prefix-based HCA's, particularly where large amount of adders are required. Small group of intermediate prefixes computes the carry first then large group prefixes compute the carry till the computation of all the carry bits. In prefix-based adders, the carry computation scheme significantly increases the speed of the adder (at the expense of increased complexity, delay increases with order $\log _{b}(n)$, where $b$ is the radix and $n$ is the number of bits per input). The basic block diagram of n-bit logarithmic parallel -prefix HC structures is shown in Figure 3.3.


Figure 3.3: Block diagram of n-bit HCA

The performance of the final addition is divided into three stages. In the first stage also known as pre-computation logic stage, generate signals, propagate signals and temporary sum signals
of two input operands $A_{i}$ and $B_{i}$ are computed bitwise. For any given bit-position, $G_{i}$ and $P_{i}$ signal expressions are defined as follows:

$$
\begin{equation*}
\mathrm{G}_{\mathrm{i}}=\mathrm{A}_{\mathrm{i}} \cdot \mathrm{~B}_{\mathrm{i}} \quad \mathrm{P}_{\mathrm{i}}=\mathrm{A}_{\mathrm{i}} \oplus \mathrm{~B}_{\mathrm{i}} \tag{3.15}
\end{equation*}
$$

Where $(\cdot, \oplus$ and + ) represents AND, XOR and OR operations and ' i ' is an integer and $0 \leq \mathrm{i}<$ n . Both propagate, generate and temporary sum signals depend only on the input bits and thus will be valid after one gate delay.

The group generates and group propagates signals are produced in prefix tree block and are given in equation (3.16) and equation (3.17) respectively [20].

$$
\begin{align*}
& \mathrm{G}_{\mathrm{i}+1}=\mathrm{G}_{\mathrm{i}}+\mathrm{P}_{\mathrm{i}} \cdot \mathrm{G}_{\mathrm{i}-1}  \tag{3.16}\\
& \mathrm{P}_{\mathrm{i}+1}=\mathrm{P}_{\mathrm{i}} \cdot \mathrm{P}_{\mathrm{i}-1} \tag{3.17}
\end{align*}
$$

This stage is the major functional unit with dense orientation of calculations and logical functionalities to produce carry signal. The complexity and function rich feature of the stage plays a major impact on power consumption. In the final stage known as post-computation logic block, the final sum and carry-out are computed and are defined according to following given equation.
$\mathrm{S}_{\mathrm{i}}=\mathrm{P}_{\mathrm{i}} \oplus \mathrm{G}_{\mathrm{i}:-1}$
$\mathrm{C}_{\text {out }}=\mathrm{G}_{\mathrm{i}-1}+\mathrm{P}_{\mathrm{i}-1} \cdot \mathrm{G}_{\mathrm{i}-2:-1}$
Where -1 is the position of carry-input $\left(G_{i:-1}=C_{i}\right)$.
All logarithmic prefix structures can be implemented with the equations above; however, to get the same carries, equation (3.16) and (3.17) can be interpreted in various ways which leads to variety of parallel trees. The first stage and last stage involve only simple operations on signals local to each bit position, so these stages are intrinsically fast.

### 3.3. LOGIC FAMILIES FOR SUB-THRESHOLD CIRCUIT DESIGN

Analysis of different logic styles operating in the sub-threshold region is necessary due to the dependency of performance parameters (power, delay and power-delay-product) in different adder architectures. The choice of the logic families provides a good comparison between ultra-low power and high speed based adder architectures.

Typically, to design power efficient arithmetic circuits, a static implementation is preferred due to its low power consumption. Dynamic circuits are not preferred due to their high activity factors in sub-threshold region [21].

This study aims to provide some insights on different logic style performance in the subthreshold region. As per reported literature, Static-CMOS, pass transistor, double passtransistor logic, transmission gate logic, complementary pass-transistor logic, swing restored pass-transistor logic families are the most suitable for low power operation in CMOS technology in sub-threshold region [22].

Out of these, complementary pass-transistor logic and swing restored pass-transistor logic families give differential outputs (OUT/ OUT_bar) whereas other give single ended output (OUT).

For implementation of the Boolean expressions of CLA, KSA and HCA architectures discussed in section 3.2, AND, OR, XOR gates are the three widely used logic gates. Therefore, the design and comparative analysis of these logic gates (AND, OR, XOR) in subthreshold region, using six above mentioned logic styles, along with the tradeoff is necessary.

The basics of logic families and their in-sight effects for the designing of basic logic gates with some tradeoff are given below.

### 3.3.1. Static-CMOS Logic [23][24]

The schematics and layouts of logic gates using Static-CMOS logic family are shown in Figure 3.4. Conventional fully Static-CMOS logic is a common CMOS logic design style choice since it involves low power consumption, large noise margin and fully functional at scaled down technologies, thus more reliable operation at low voltages.

The conventional Static-CMOS logic is built from NMOS pull-down and a dual PMOS pullup logic network. Static-CMOS logic function can be realized very efficiently by these pulldown and pull-up networks connected between the gate output and the power lines.

Logic Gates





Layouts


Figure 3.4: Schematics and layouts of logic gates using Static-CMOS logic family This logic style is reported to give negligible DC power consumption, as there is no direct path form between power supply and ground for any of the logic input combinations. The main drawback of this logic style is the substantial number of large PMOS transistors, resulting in high gate input capacitances which increase the delay, area requirements, relatively weak output driving capability due to series transistors in the output stage, and therefore high dynamic power consumption.

### 3.3.2. Pass Transistor (PT) Logic Family [25][26]

The schematics and layouts of logic gates using PT logic family are shown in Figure 3.5. Conventionally, the PT logic is built from an NMOS network topology. It requires minimum MOS's to implement logic function very effectively. Primary input signals drive gates and source-drain terminals only, which facilitates the usage and characterization of logic cells.

This PT topology do not have direct path from supply voltage to ground, due to this, standby power consumption is approx. zero.

Logic Gates


Layouts





Figure 3.5: Schematics and layouts of logic gates using PT logic family
The PT logic also gain their speed and power advantage due to high logic functionality, reduced transistor count and node capacitance as compared to conventional Static-CMOS logic. However, this PT logic presents the inherent problem of the undesirable threshold voltage effects across a transistor which degrades the high level of PT output nodes and imposes the addition of level restoring transistors. It has poor noise margin.

### 3.3.3. Complementary Pass-Transistor Logic (CPL) [27]

The schematics and layouts of logic gates using CPL logic family are shown in Figure 3.6. CPL belongs to the conventional PT logic family which overcomes the problem of undesirable threshold voltage drop of NMOS transistor, benefits from the small input capacitances (NMOS network only) and improves the output driving capability. The CPL circuit consists of NMOS

PT logic driven by complementary inputs and producing complementary outputs driven by two CMOS inverters used as buffers.

## Logic Gates



Figure 3.6: Schematics and layouts of logic gates using CPL logic family However, at lower supply voltage, CPL circuit suffers the inherent problem of the threshold drop across the NMOS transistor, which results in reduced current drive and slower operation at reduced supply voltages which proves that it is not suited for sub-threshold design technique. In comparison to the conventional Static-CMOS logic, this logic is having larger short-circuit currents and higher wiring overhead due to dual-rail signals, hence increases the overall power consumption.

### 3.3.4. Swing Restored Pass-Transistor Logic (SRPL)[28]

The schematics and layouts of logic gates using CPL logic family are shown in Figure 3.7. SRPL is modified version of CPL which is used to implement high-speed and low-power logic circuit for VLSI applications. The generic construction of this SRPL logic gate consists of two
main parts. First, is the latch type swing restoring circuit which combines two cross coupled CMOS inverters to dive the gate output and the second part is complementary output of n channel devices based PT logic network. The complimentary outputs of the PT logic network are restored to full swing by the swing restoration circuit.

## Logic Gates

AND Gate

Schematics


Layouts


OR Gate


Figure 3.7: Schematics and layouts of logic gates using SRPL logic family
Area and power reduction are also facilitated by its structure due to low transistor count and the low-input capacitance. SRPL overcomes the noise margin problems present in CPL. But, the SRPL performance is not good as claimed in paper [24]. The reason is quite unknown. Overall result in terms of power and performance shows that it is well suited in sub-threshold design technique.

### 3.3.5. Double Pass-Transistor Logic (DPL) [29]

The schematics and layouts of logic gates using DPL logic family are shown in Figure 3.8. DPL overcomes the noise margin and speed degradation problems occur in CPL and SRPL at
reduced power supply. It works efficiently in sub-threshold region and meets the requirement of power efficient arithmetic circuits. DPL circuit consists of symmetrical arrangement of both PMOS transistor branches in parallel with NMOS-tree which avoids the series sizing issues of the full static circuits.

## Logic Gates



Schematics

Layouts



OR Gate


Figure 3.8: Schematics and layouts of logic gates using DPL logic family It attains full-swing operation. As a result, for each input combination, these PMOS and NMOS transistors always posses dual current path, resulting in smallest equivalent resistance compared to other logic styles.
DPL gates are symmetrical whereby the load in them is distributed equally among the inputs. The basic gates AND/NAND, OR/NOR, XOR/XNOR can be constructed by simply exchanging the input nodes. The main drawback of DPL is its redundancy, i.e. it requires more transistors than actually needed for the realization of a function. DPL is one of the power efficient logic style among the discussed logic styles by Uming et al. [24].

### 3.3.6. Transmission Gate Logic (TG) [30]

The schematics and layouts of logic gates using TG logic family are shown in Figure 3.9.


Figure 3.9: Schematics and layouts of logic gates using TG logic family
As discussed, poor conduction of logic 1 by NMOS transistors in PT logic families results in voltage degradation and poor noise margin. To overcome these issues, lot of modification has been done in the PT logic family. TG logic is well known and one of the best logic family which enables rail-to-rail swing, improves noise margin and works efficiently in sub-threshold region. The TG circuitry has two paths but the same input is passed along both paths in contrast to DPL family.

It has the lower transistor count which provides high logic functionality, improving speed in low supply voltage but unbalanced input capacitance as compared to Static-CMOS logic gates.

### 3.3.7. Comparative Analysis of Basic Logic Gates using Different Logic Families for Sub-Threshold Operation

This section gives the comparative analysis of AND, OR and XOR gates using Static-CMOS logic, PTL logic, CPL logic, SRPL logic, DPL logic and TG logic families.

Table 3.1-3.3 gives the measured power, delay and power-delay product of logic these three gates using Static-CMOS logic style, TG, PT, SRPL, CPL and DPL logic families at 0.4 V power supply voltage for sub-threshold operation. These results show that all gates are properly functional at supply voltage as low as 0.4 V . The schematic and layouts are designed and simulated using two different technology ( $45 \mathrm{~nm} / 180 \mathrm{~nm}$ ). The design metrics are characterized in terms of power, delay and power delay product (power-delay product). The power consumption is least at minimum size of both NMOS and PMOS in sub-threshold region. For minimum power-delay product, the W/L's of all designed logic gates are kept in the ratio of 2:1 for the pull up and pull down network respectively.

Table 3.1: Simulation results for AND Gate

| Logic Style | Number of <br> Transistors | Power (nW) |  | Delay (ns) |  | Power-Delay Product <br> (watt*sec10-18) |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | $\mathbf{4 5} \mathbf{~ n m}$ | $\mathbf{1 8 0} \mathbf{~ n m}$ | $\mathbf{4 5} \mathbf{~ n m}$ | $\mathbf{1 8 0} \mathbf{~ n m}$ | $\mathbf{4 5} \mathbf{~ n m}$ | $\mathbf{1 8 0} \mathbf{~ n m}$ |
| Static-CMOS | 6 | 0.938 | 0.255 | 0.181 | 28.14 | 0.1697 | 07.175 |
| TG | 6 | 0.588 | 0.231 | $\mathbf{0 . 1 2 0}$ | $\mathbf{2 7 . 4 2}$ | $\mathbf{0 . 0 7 0 5}$ | $\mathbf{0 6 . 3 3 4}$ |
| PT | 14 | $\mathbf{0 . 3 6 3}$ | $\mathbf{0 . 2 0 9}$ | 1.250 | 39.60 | 0.4537 | 08.276 |
| CPL | 14 | 1.030 | 0.388 | 5.006 | 45.40 | 5.1562 | 17.615 |
| SRPL | 12 | 1.176 | 0.623 | 12.58 | 45.24 | 014.79 | 28.184 |
| DPL | 10 | 4.911 | 0.796 | 4.949 | 86.10 | 24.304 | 68.530 |

Table 3.2: Simulation results for OR Gate

| Logic Style | Number of <br> Transistors | Power (nW) |  | Delay (ns) |  | Power-Delay Product <br> (watt*Sec10-18) |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | 6 | 1.231 | 0.322 | 1.567 | 52.210 | 01.927 |
| Static-CMOS | 6 | $\mathbf{1 8 0} \mathbf{~ n m}$ | $\mathbf{4 5} \mathbf{~ m m}$ | $\mathbf{1 8 0} \mathbf{~ n m}$ | $\mathbf{4 5} \mathbf{~ n m}$ | 16.811 |  |
| TG | 6 | 0.592 | 0.299 | $\mathbf{0 . 1 6 2}$ | $\mathbf{2 0 . 5 4 0}$ | $\mathbf{0 0 . 0 9 6}$ | $\mathbf{0 6 . 1 4 1}$ |
| PT | 14 | $\mathbf{0 . 4 9 1}$ | $\mathbf{0 . 3 0 5}$ | 0.688 | 57.400 | 00.338 | 17.510 |
| CPL | 14 | 1.639 | 0.572 | 5.258 | 44.900 | 08.617 | 25.682 |
| SRPL | 12 | 1.203 | 0.844 | 13.100 | 48.920 | 15.750 | 41.288 |
| DPL | 10 | 5.610 | 1.089 | 5.245 | 78.700 | 29.424 | 85.704 |

Table 3.3: Simulation results for XOR Gate

| Logic Style | Number of <br> Transistors | Power (nW) |  | Delay (ns) |  | Power-Delay Product <br> $($ watt*sec10 |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | $\mathbf{4 5} \mathbf{~ m m}$ | $\mathbf{1 8 0} \mathbf{~ n m}$ | $\mathbf{4 5} \mathbf{~ n m}$ | $\mathbf{1 8 0} \mathbf{~ n m}$ | $\mathbf{4 5} \mathbf{~ \mathbf { ~ m }}$ | $\mathbf{1 8 0} \mathbf{~ n m}$ |
| Static-CMOS | 12 | 1.621 | 0.399 | 2.378 | 58.020 | 3.854 | 23.140 |
| TG | 8 | 0.628 | 0.362 | $\mathbf{0 . 3 4 2}$ | $\mathbf{3 1 . 4 1 0}$ | $\mathbf{0 . 2 1 5}$ | $\mathbf{1 1 . 1 0 6}$ |
| PT | 6 | $\mathbf{0 . 5 3 9}$ | $\mathbf{0 . 2 2 5}$ | 0.699 | 073.20 | 0.376 | 16.470 |
| CPL | 14 | 1.89 | 1.015 | 6.088 | 52.170 | 11.512 | 52.950 |
| SRPL | 12 | 1.659 | 1.334 | 13.215 | 59.540 | 21.923 | 79.420 |
| DPL | 10 | 6.128 | 2.007 | 6.665 | 55.320 | 40.843 | 111.027 |

From Table 3.1, 3.2 and 3.3, it is observed that at the system level, the single ended logic families do not work properly due to degraded output voltage level for sub-threshold operation. Generally, having a large delay and power spread at the gate level itself is not desirable because when these gates are used at system level the spread will add cumulatively which leads to a poor system power-delay product.

It is observed that CPL, DPL and SRPL have the poor power-delay product because of poor noise margin, low noise immunity and output voltage degradation. Therefore, it is unsuitable for practical application at ultra-low power supply voltages. One of the drawbacks of these logic families is that it requires more transistors due to differential implementation of a function.

Key Points: The Static-CMOS, TG and PT logic family having less (TG with lowest) powerdelay product and also possess the advantage of reliability and simpler designs at both technology nodes ( $45 \mathrm{~nm} / 180 \mathrm{~nm}$ ). Migrating from 180 nm to 45 nm shows same trend but reduces the overall power-delay product for all different logic families.

Hence Static-CMOS, PT and TG have been chosen to implement CLA, KS and HC adders.

### 3.4. EFFECT OF REVERSE BODY BIAS (RBB) SCHEME

In order to minimize the power consumption and improve the performance in sub-threshold arithmetic circuits, body biasing scheme has been investigated which manage threshold voltage $\left(\mathrm{V}_{\mathrm{th}}\right)$ and power supply voltage $\left(\mathrm{V}_{\mathrm{DD}}\right)$ at the same time. For instance, effective body biasing techniques can adjust the transistor $\mathrm{V}_{\text {th }}$ to compensate for changes in the transistor as the product ages. It can also adjust the transistor $\mathrm{V}_{\text {th }}$ for temperature fluctuation, maintaining a uniform performance and thus adjusting leakage current.

RBB scheme is well suited for devices with low threshold voltages and power supply voltages below threshold because at high power supply voltage band-to-band tunneling raises the junction leakage dramatically. Therefore, it works efficiently in sub-threshold region [31].

In the conventional configuration, the bulk terminal of both NMOS and PMOS devices are tied to ground and $V_{D D}$ respectively. This type of bulk connection prevents forward body biasing at the source/drain-to-bulk p/n+ junctions in normal region of operation ( $\mathrm{V}_{\mathrm{DD}}$ above than threshold voltage). Under RBB, the bulk terminal of both the devices is interchanged i.e. PMOS tied to ground and NMOS tied to $\mathrm{V}_{\mathrm{DD}}$ in sub-threshold region of operation, a noticeable increase in the drain current is observed which leads to increased switching speeds and potentially dissipating power. This is due to high electric field across a reverse biased p-n junction which causes significant current to flow through the junction due to tunneling of electrons from the valence band of the p-region to the conduction band of the $n$-region (Band-To-Band-Tunneling (BTBT)) [32].

The cross-sectional view of NMOS with their six short-channel leakage currents ( $\mathrm{I}_{1}-\mathrm{I}_{6}$ ) operated in sub-threshold region mechanisms is illustrated in Figure 3.10. The body terminal is connected at $\mathrm{V}_{\mathrm{DD}}$ in contrast to conventional N -MOS.


Figure 3.10: Cross sectional view of N -MOS with all leakage currents
Where,
$\mathrm{I}_{1}=$ reverse-bias pn junction leakage current;
$\mathrm{I}_{2}=$ sub-threshold current;
$\mathrm{I}_{3}=$ oxide tunneling current;
$\mathrm{I}_{4}=$ gate current due to hot-carrier injection;
$\mathrm{I}_{5}=$ gate-induced drain lowering current;
$\mathrm{I}_{6}$ or $\mathrm{I}_{\text {RBB }}=$ reverse body bias current;
$I_{\text {RBB }}$ is the additional current flow in NMOS due to band-to-band tunneling of electrons.
The effect of RBB on performance is investigated in CMOS inverter circuit as shown in Figure 3.11


Figure 3.11: Schematic diagram of CMOS inverter (a) with RBB (b) without RBB

The voltage transfer characteristic (VTC) of CMOS inverter with/without RBB at both technology nodes is shown in Figure 3.12.

(a)


Figure 3.12: VTC's of CMOS inverter (a) with RBB (b) without RBB
Noise margin calculations of CMOS inverter with/without RBB are given in Table 3.4.
Table 3.4: Noise margin of CMOS inverter with/without RBB

| Parameters | Inverter at 45 nm |  | Inverter at 180 nm |  |
| :---: | :---: | :---: | :---: | :---: |
|  | Voltage (V) <br> With RBB | Voltage (V) <br> Without RBB | Voltage (V) <br> With RBB | Voltage (V) <br> Without RBB |
| $\mathrm{V}_{\text {IH }}$ | 0.160 | 0.24 | 0.120 | 0.20 |
| $\mathrm{~V}_{\mathrm{IL}}$ | 0.128 | 0.18 | 0.108 | 0.17 |
| $\mathrm{~V}_{\text {OH }}$ | 0.400 | 0.40 | 0.400 | 0.40 |
| $\mathrm{~V}_{\text {OL }}$ | 0.060 | 0.00 | 0.060 | 0.00 |
| $\mathrm{NM}_{\mathrm{L}}\left(\mathrm{V}_{\text {IL }}-\mathrm{V}_{\text {OL }}\right)$ | $\mathbf{0 . 0 6 8}$ | 0.18 | $\mathbf{0 . 0 4 8}$ | 0.17 |
| $\mathrm{NM}_{\mathrm{H}}\left(\mathrm{V}_{\text {OH }}-\mathrm{V}_{\text {IH }}\right)$ | 0.24 | 0.16 | 0.28 | 0.20 |

Table 3.4 shows that, with RBB, Noise-Margin high ( $\mathrm{NM}_{\mathrm{H}}$ ) value increases as technology scales down. But at both, $45 / 180 \mathrm{~nm}$ technology, Noise-Margin high $\left(\mathrm{NM}_{\mathrm{L}}\right)$ value is very small $(0.068 / 0.048 \mathrm{~V})$ which leads to reduced noise insensitivity of circuits.

In reference to Figure 3.10 and 3.11, with RBB and without RBB, the total current of PMOS (or NMOS) i.e. $\mathrm{IP}_{\mathrm{P}}\left(\right.$ or $\left.\mathrm{I}_{\mathrm{N}}\right)=\mathrm{I}_{1}+\mathrm{I}_{2}+\mathrm{I}_{3}+\mathrm{I}_{4}+\mathrm{I}_{5}+\mathrm{I}_{\mathrm{RBB}}, \mathrm{P}\left(\right.$ or $\left.\mathrm{I}_{\text {RBB }, \mathrm{N}}\right)=\mathrm{RBB}$ current in PMOS (or NMOS).

Graph in Figure 3.13 shows nearly $2.3 \times$ increment in Iout current with RBB in CMOS inverter simulated at 45 nm . This can reduce the delays significantly.


Figure 3.13: Current characteristics of NMOS in CMOS inverter with RBB/without RBB at 45 nm technology

Due to this extra current $\left(\mathrm{I}_{\mathrm{N}}\right)$ in NMOS, the magnitude of the overall current in CMOS inverter with RBB increased which leads to increased switching speeds at cost of increase in powerdelay product as shown in Table 3.5. The frequencies of CMOS logic gates with RBB scheme are observed to be higher than conventional body biasing scheme at low voltages [19].

Table: 3.5: Simulation results for Inverter with/without RBB at 0.4 V

| Module Name | Power (nW) |  | Delay (ns) |  | Power-Delay Product <br> $($ watt*sec 10-18) |  |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
|  | With <br> RBB | Without <br> RBB | With RBB | Without <br> RBB | With RBB | Without <br> RBB |
|  | 1.18 | $\mathbf{0 . 4 2 1}$ | $\mathbf{0 . 0 1 4}$ | 0.033 | 0.016 | $\mathbf{0 . 0 1 4}$ |
| Inverter (180 nm) | 0.127 | $\mathbf{0 . 0 7 2}$ | $\mathbf{0 . 6 4 5}$ | 0.845 | 0.081 | $\mathbf{0 . 0 6 1}$ |

The detailed analysis of AND, OR and XOR logic gates using six different logic families with RBB scheme is given in section 3.4.1.

### 3.4.1. Simulation Results of Logic Gates with RBB using Different Logic Families

This section gives the design and comparative analysis of AND, OR and XOR gates with RBB scheme using Static-CMOS logic, PTL logic, CPL logic, SRPL logic, DPL logic, TG logic families. Table 3.6-3.8 gives the measured power, delay and power-delay product of logic gates using Static-CMOS logic style, TG, PT, SRPL, CPL and DPL logic families at 0.4 V power supply voltage for sub-threshold operation.

The schematic and layouts are designed and simulated using two different technology ( 45 nm / 180 nm ).

The obtained simulation waveforms confirmed proper function of all gates at supply voltage as low as 0.4 V . The basic gates are characterized in terms of power, delay and power-delay product.

For minimum power-delay product, the W/L's of all designed logic gates were kept in the ratio of $2: 1$ for the pull up and pull down network respectively.

Table 3.6: Simulation results for AND Gate with RBB

| Logic Style | Number of <br> Transistors | Power (nW) |  | Delay(ns) |  | Power-Delay Product <br> (watt*sec10-18) |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | $\mathbf{4 5} \mathbf{~ n m}$ | $\mathbf{1 8 0} \mathbf{~ n m}$ | $\mathbf{4 5} \mathbf{~ n m}$ | $\mathbf{1 8 0} \mathbf{~ n m}$ | $\mathbf{4 5} \mathbf{~ \mathbf { ~ m ~ }}$ | $\mathbf{1 8 0} \mathbf{~ n m}$ |
| Static-CMOS | 6 | 1.123 | 0.375 | 0.155 | 21.090 | 0.174 | 7.908 |
| TG | 6 | 0.612 | 0.329 | $\mathbf{0 . 0 9 4}$ | $\mathbf{1 7 . 7 0 0}$ | $\mathbf{0 . 0 5 7}$ | $\mathbf{5 . 8 2 3}$ |
| PT | 14 | $\mathbf{0 . 4 0 1}$ | $\mathbf{0 . 2 3 9}$ | 0.823 | 25.870 | 0.331 | 6.182 |
| CPL | 14 | 1.144 | 0.545 | 4.501 | 32.210 | 5.149 | 17.55 |
| SRPL | 12 | 1.352 | 0.799 | 10.25 | 35.220 | 13.581 | 28.140 |
| DPL | 10 | 5.041 | 0.899 | 4.758 | 55.400 | 23.985 | 49.800 |

Table 3.7: Simulation results for OR Gate with RBB

| Logic Style | Number of <br> Transistors | Power (nW) |  | Delay(ns) |  | Power-Delay Product <br> (watt*sec10-18) |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | $\mathbf{4 5} \mathbf{~ n m}$ | $\mathbf{1 8 0} \mathbf{~ n m}$ | $\mathbf{4 5} \mathbf{~ n m}$ | $\mathbf{1 8 0} \mathbf{~ n m}$ | $\mathbf{4 5} \mathbf{~ n m}$ | $\mathbf{1 8 0} \mathbf{~ n m}$ |
| Static-CMOS | 6 | 1.388 | 0.342 | 1.482 | 50.100 | 2.057 | 17.134 |
| TG | 6 | 0.603 | 0.341 | $\mathbf{0 . 1 5 1}$ | $\mathbf{1 7 . 1 0 0}$ | $\mathbf{0 . 0 9 1}$ | $\mathbf{0 5 . 8 4 1}$ |
| PT | 14 | $\mathbf{0 . 5 0 2}$ | $\mathbf{0 . 3 1 5}$ | 0.670 | 35.10 | 0.337 | 11.050 |
| CPL | 14 | 1.704 | 0.646 | 4.932 | 38.110 | 8.405 | 24.619 |
| SRPL | 12 | 1.331 | 0.779 | 11.501 | 43.010 | 15.307 | 33.504 |
| DPL | 10 | 5.714 | 1.149 | 5.041 | 70.300 | 28.804 | 80.774 |

Table 3.8: Simulation results for XOR Gate with RBB

| Logic Style | Number of <br> Transistors | Power (nW) |  | Delay(ns) |  | Power-Delay <br> Product <br> (watt*sec10-18) |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | $\mathbf{4 5}$ <br> $\mathbf{n m}$ | $\mathbf{1 8 0} \mathbf{~ n m}$ | $\mathbf{4 5}$ <br> $\mathbf{n m}$ | $\mathbf{1 8 0} \mathbf{~ n m}$ | $\mathbf{4 5} \mathbf{~ n m}$ | $\mathbf{1 8 0} \mathbf{~ n m}$ |
| Static-CMOS | 12 | 1.746 | 0.540 | 2.305 | 53.250 | 4.024 | 28.755 |
| TG | 8 | 0.707 | 0.418 | $\mathbf{0 . 2 0 1}$ | $\mathbf{2 9 . 8 1 0}$ | $\mathbf{0 . 1 4 2}$ | $\mathbf{1 2 . 4 6 0}$ |
| PT | 6 | $\mathbf{0 . 5 9 6}$ | $\mathbf{0 . 3 0 2}$ | 0.622 | 69.10 | 0.370 | 20.86 |
| CPL | 14 | 1.906 | 1.385 | 6.011 | 49.140 | 11.456 | 68.050 |
| SRPL | 12 | 1.891 | 1.459 | 12.591 | 52.320 | 23.809 | 76.330 |
| DPL | 10 | 6.221 | 2.102 | 6.509 | 50.240 | 40.492 | 105.610 |

Key Points: The Static-CMOS, TG and PT logic family with RBB scheme shows less (with TG lowest) power-delay product at both technology nodes ( $45 \mathrm{~nm} / 180 \mathrm{~nm}$ ). Migrating from 180 nm to 45 nm reduces the overall power-delay product for all different logic families. The
results conclude that Static-CMOS, PT and TG with RBB scheme, are three power efficient logic families. Hence these are used to implement CLA, KS and HC adder's in sub-threshold region in this chapter.

### 3.5. DESIGN AND ANALYSIS OF PARALLEL ADDERS

The power consumption and speed of adders depend upon the choice of logic design style and architectures. Since different logic and architectures are available consequently, it is important to explore the adders for different bit-widths in sub-threshold region. In this section, design and analysis of adders using three architectures (CLA, KSA and HCA) in three different logic families- Static-CMOS, TG, and PT logic style or their combination operated in sub-threshold region are presented. These adders are analysed in terms of power, delay and power-delay product. The effect of technology scaling is observed by implementing the designs at $45 \mathrm{~nm} /$ 180 nm technology nodes. The internal architecture of adders and their circuit implementations are given in following sections.

### 3.5.1. Design Implementation using CLA

In this section, the design of parallel adder is implemented using CLA architecture. The basic blocks of CLA are implemented using the logical Boolean expressions given by eq. (1) - eq. (4), discussed in section 3.2.1. The adder is implemented at different operand sizes as 4-bit, 8bit, 16-bit, 32-bit and 64-bit using three different logic families in sub-threshold region.

Figure 3.14 and Figure 3.15 show block level diagram of 4-bit CLA as basic building unit for large operand size adders.


Figure 3.14: Basic block diagram of 4-bit CLA


Figure 3.15: Internal blocks of 4-bit CLA

4-bit CLA blocks are used as basic building blocks to build large operand size CLA by combining these blocks hierarchically.

Block diagrams of 8-bit, 16-bit, 32-bit and 64-bits CLA's are shown in Figure 3.16 - Figure 3.19 .


Figure 3.16: Block diagram of 8-bit CLA built using 4-bit CLA unit


Figure 3.17: Block diagram of 16-bit CLA built using 4-bit CLA unit


Figure 3.18: Block diagram of a 32-bit CLA built using 8-bit CLA unit


Figure 3.19: Block diagram of a 64-bit CLA built using 16-bit CLA unit
Based on the generate and propagate values, the look-ahead blocks compute the carry bits. The partial full adders (PFA) inside 4-bit CLA produce the sum values based on the inputs $\left(A_{i}, B_{i}\right)$ and the carry input $C_{i}$ value for each bit position. Final carry generator block generates the final carries ( $\mathrm{C}_{8}, \mathrm{C}_{16}, \mathrm{C}_{32}$ and $\mathrm{C}_{64}$ ) for 8-bit 16-bit, 32-bit and 64-bit CLA's respectively.

- The implementation of the 64-bit CLA is done using three hierarchical levels. The first level of the hierarchical structure in 64-bit CLA, computes carry values from look-ahead blocks, G0-3 and P0-3, are defined as "group Generate" and "group Propagate" of a group of 4-bits.
- In the second level, using the group generate and group propagate signals, the 16 -bit carryout signals is obtained for each adder block as shown in Figure 3.19. The CLA logic creates three carry signals $\left(\mathrm{C}_{16}, \mathrm{C}_{32}\right.$, and $\left.\mathrm{C}_{48}\right)$ are being used as carry input to the 16 -bit adder blocks and the $\mathrm{C}_{64}$ as the sum ( $\mathrm{S}_{64}$ ) signal. An extra final carry generator is used in contrast to conventional 64-bit CLA to generate the $\mathrm{C}_{64}$ as the sum ( $\mathrm{S}_{64}$ ) signal. This extra added final carry generator optimizes the levels of the carry chain which gives minimum area, power as well as delay in sub-threshold region. The delay of a CLA adder is dependent on the number of levels of carry logic, and not on the length of the adder. The conventionally designed 64-bit CLA using multi-level logic generally gives larger area with slower circuit than optimized three level of carry logic implementation. Thus, overall power-delay product of designed 64-bit CLA gets improved.
- The third level contains one Carry look-ahead generator. It is used to compute the input
carries for the Carry look-ahead generators in the second floor, and to compute the 'carrypropagated' and 'carry-generated' bits for the entire 64 -bit. This bit can then be used either to compute the carry of the entire adder or to further extend the adder.

Conventional Static-CMOS, TG and PT logic families have been identified most suitable for designing more robust adder circuits operated in sub-threshold region [22] as discussed in section 3.3. Internal units of CLA with Static-CMOS, TG and PT logic families are discussed in following sections.

### 3.5.1.1. CLA with Static-CMOS Logic (Static-CMOS CLA)

The internal units (carry generator, group generate/propagate circuitry, bit-wise generate/propagate circuitry and sum generator circuitry) of CLA's are implemented using Static-CMOS logic are shown in Figure 3.20-Figure 3.24. Bit-wise generate/propagate circuitry and sum generator circuit are called as pre-computation and post-computation logic blocks.


Figure 3.20: Circuit diagram of 4-bit group generate


Figure 3.21: Circuit diagram of 4-bit group propagate



Figure 3.22: Circuit diagram of 4-bit carry generator


Figure 3.23: Circuit diagram of bit-wise generate/propagate


Figure 3.24: Circuit diagram of bit-wise sum-generator

### 3.5.1.2. CLA with Hybrid TG Logic (HYB-TG CLA)

Conventional TG logic leads to compact design but shows voltage degradation problem in two or more cascaded TG stage at ultra-low power supply voltage for sub-threshold operation. Therefore, to overcome this voltage degradation problem the implementation of circuits is done by coupling TG logic with Static-CMOS buffer at appropriate stages rather than using complete Static-CMOS or TG gate designs. The combination of Static-CMOS buffer with TG logic is represented here as HYB-TG logic.

The output buffer formed by the cascaded inverters is designed in such a way that the first inverter is half the size of the second inverter in order to cut down the power consumption.

To check for proper operation of HYB-TG logic with RBB, a series combination of TG and two Static-CMOS inverters (TG-2INV block), as shown in Figure 3.25, under following different conditions of body bias has been simulated and analysed.

- State a----TG1, 2INV without RBB
- State b----TG1, 2INV with RBB
- State c----TG1 without RBB, 2INV with RBB


Figure 3.25: TG-2INV block

The results are given in Table 3.9 and wave forms at node 'OUT' are given in Figure 3.26.

Table 3.9: Operations of TG-2INV block

| State | Enable[EN/EN! = 1 for NMOS$=0$ for PMOS](ON Condition) |  | Operation <br> Turn ON | Enable[EN/EN! = 0 for NMOS$=1$ for PMOS](OFF Condition) |  | Operation |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | Input given at node IN (V) | Output at node OUT <br> (V) |  | Input at node IN (V) | Output at node OUT (V) | Turn OFF |
| (a) | 0.4 | 0.400 | Proper | 0.4 | 0.000 | Proper |
| (b) | 0.4 | 0.400 | Proper | 0.4 | 0.348 | Not turning off properly |
| (c) | 0.4 | 0.400 | Proper | 0.4 | 0.000 | Proper |



Figure 3.26: Graph for TG-2INV block at OFF condition

From the above results, it is clear that TG-2INV block operates properly for ON condition for all three states, whereas at OFF condition the output is not proper for state $b$. Therefore, in this adder, HYB-TG logic without RBB is used in design. Each output node in HYB-TG logic is followed by a buffer (which is composed of two inverters).

Further, only internal units like bit-wise generate/propagate circuitry and sum generator circuitry are designed using HYB-TG logic without RBB.

A complete adder implementation using HYB-TG logic with buffer showed increased power consumption during simulation results. Therefore, only simple circuits like bit-wise generate/propagate circuitry and sum generator circuit of CLA's are implemented using HYBTG logic as shown in Figure 3.27 and Figure 3.28 respectively to get the advantage of reduced transistor count. Whereas (carry generator and group generate/propagate circuitry) of CLA's are implemented using Static-CMOS logic only.


Figure 3.27: Circuit diagram of bit-wise generate/propagate


Figure 3.28: Circuit diagram of bit-wise sum-generator

### 3.5.1.3. CLA with Hybrid pass transistor Logic (HYB-PT CLA)

Conventional PT logic leads to compact design but shows voltage degradation problem in two or more cascaded PT stage at ultra-low power supply voltage for sub-threshold operation. Therefore, to overcome this voltage degradation problem the implementation of circuits is done by coupling PT logic with Static-CMOS buffer at appropriate stages rather than using complete Static-CMOS or PT gate designs. The combination of Static-CMOS buffer with PT logic is represented here as HYB-PT logic.

To check for proper operation of HYB-PT logic with RBB, a series combination of PT and two Static-CMOS inverters (PT-2INV block), as shown in Figure 3.29, under following different conditions of body bias has been simulated and analysed.

- State a----PT1, 2INV without RBB
- State b----PT1 2INV with RBB
- State c----PT1 without RBB, 2INV with RBB


Figure 3.29: PT-2INV block
The results are given in Table 3.10 and wave forms at node 'OUT' are given in Figure 3.30

Table 3．10：Operations of PT－2INV block

| State | $\begin{gathered} \text { Enable } \\ {\left[\begin{array}{c} \text { [EN }=1 \text { for NMOS] } \\ \text { (ON Condition) } \\ \hline \end{array} ⿳ ⺈ ⿴ 囗 十 一\right. \text {. }} \end{gathered}$ |  | Operation | $\begin{gathered} \text { Enable } \\ {[\mathbf{E N}=0 \text { for NMOS] }} \end{gathered}$(OFF Condition) |  | Operation |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | Input given at node IN（V） | $\begin{gathered} \text { Output at } \\ \text { node OUT (V) } \end{gathered}$ | Turn ON | $\begin{array}{\|c\|} \hline \text { Input at } \\ \text { node IN (V) } \end{array}$ | $\begin{gathered} \text { Output at } \\ \text { node OUT (V) } \end{gathered}$ | Turn OFF |
| （a） | 0.4 | 0.400 | Proper | 0.4 | 0.000 | Proper |
| （b） | 0.4 | 0.400 | Proper | 0.4 | 0.391 | Not turning off properly |
| （c） | 0.4 | 0.400 | Proper | 0.4 | 0.000 | Proper |



Figure 3．30：Graph for PT－2INV block at OFF condition

From the above results，it is clear that PT－2INV block operates properly for ON condition for all three states，whereas at OFF condition the output is not proper for state $b$ ．Therefore，in this adder，HYB－PT logic without RBB is used in design．Each output node in HYB－PT logic is followed by a buffer（which is composed of two inverters）．

Further，only internal units like bit－wise generate／propagate circuitry and sum generator circuitry are designed using HYB－PT logic without RBB．

A complete adder implementation using HYB－PT logic with buffer showed increased power consumption during simulation results．Therefore，only simple circuits like bit－wise generate／propagate circuitry and sum generator circuit of CLA＇s are implemented using HYB－ PT logic as shown in Figure 3.31 and Figure 3.32 respectively to get the advantage of reduced transistor count．Whereas（carry generator and group generate／propagate circuitry）of CLA＇s are implemented using Static－CMOS logic only．


Figure 3.31: Circuit diagram of bit-wise generate/propagate


Figure 3.32: Circuit diagram of bit-wise sum-generator

The CLA's have been implemented at different operand sizes as 8 -bit, 16 -bit, 32 -bit and 64 bit using Static-CMOS, HYB-TG, and HYB-PT logic style in sub-threshold region.

### 3.5.2. Simulation Methodology and Results of CLA

To obtain the simulation results for the CLA's in sub-threshold region, the following methodology is followed.

The 8 -bit, 16 -bit, 32 -bit and 64 -bit versions of each of the above three parallel CLA's were designed in Cadence Virtuoso (Schematic and Layout design). The pre-computation stage is designed using an AND gates and a XOR gates to generate the individual bit generate and propagate signals respectively.

A look-ahead block is designed using an AND gate and an AND-OR implementation (AOI) cell. The multiple carries are computed with proper buffering necessary, according to the implementation. The post-computation stage is designed using XOR gates to generate the sum bits.

In sub-threshold based circuits, leakage currents play very important role in terms of power, delay and power-delay product. Leakage current, including sub-threshold leakage current and gate leakage current, becomes more significant below 100 nm . Therefore, to analyze the effect of leakage current at above and below 100nm, two different technology nodes i.e. $45 \mathrm{~nm} / 180$ nm technology have been considered in implementation. The schematic and layouts were designed using $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology libraries and simulated using the BSIM3 (V3.24) model, at a supply voltage of 0.4 V .

This setup is common to all circuits designed in chapters 3,4 , and 5 in the present work.
A sequence of random inputs is applied to identify input pattern for worst case power consumption and propagation delay. Then, worst case power consumption and propagation delay values are obtained individually through transient simulations for each design. Transient simulations have been done by applying input pulses with rise and fall times of 1 pico-second, pulse width (ON time) of 1 micro-second and pulse period of 5 micro-second [33].

The power, delay and power-delay product values are evaluated for CLA's using StaticCMOS logic, HYB-TG logic and HYB-PT logic style for different operand sizes.

In this thesis, the following nomenclature as shown in Table 3.11, is used to represent the proposed designs of CLA adders:

Architecture (CLA) - technology (45/ 180) - logic family (STA-CMOS, HYB-TG, and HYB-PT)

Table 3.11: Nomenclature used for the proposed designs of CLA

| S. No. | Adder Design Descriptions | Nomenclature |
| :---: | :--- | :--- |
| 1 | CLA at 45 nm using Static-CMOS logic family | CLA-45-STA-CMOS |
| 2 | CLA at 45 nm using hybrid TG logic family | CLA-45-HYB-TG |
| 3 | CLA at 45 nm using hybrid PT logic family | CLA-45-HYB-PT |
| 4 | CLA at 180 nm using Static-CMOS logic family | CLA-180-STA-CMOS |
| 5 | CLA at 180 nm using hybrid TG logic family | CLA-180-HYB-TG |
| 6 | CLA at 180 nm using hybrid PT logic family | CLA-180-HYB-PT |

Table 3.12 and Table 3.13 gives the measured power, delay and power-delay product CLA's using different logic families. All modules function properly at supply voltage as low as 0.4 V.

Table 3.12: Simulation results of CLA's at 45 nm technology

| Module Name | No. of Bits | Power $(\mu \mathbf{W})$ | Delay (ns) | Power-Delay Product (watt*sec10 ${ }^{-15}$ ) | Area ( $\mu^{2}$ ) |
| :---: | :---: | :---: | :---: | :---: | :---: |
| CLA-45- <br> STA- <br> CMOS | 8 | 0.380 | 1.241 | 00.471 | 0359.71 |
|  | 16 | 0.625 | 2.369 | 01.481 | 0647.14 |
|  | 32 | 0.957 | 3.697 | 03.538 | 1175.25 |
|  | 64 | 3.413 | 4.065 | 13.871 | 1894.81 |
| $\begin{aligned} & \text { CLA-45- } \\ & \text { HYB-TG } \end{aligned}$ | 8 | 0.452 | 1.332 | 00.602 | 0293.19 |
|  | 16 | 0.747 | 2.655 | 01.983 | 0597.74 |
|  | 32 | 2.045 | 3.874 | 07.922 | 1090.31 |
|  | 64 | 4.364 | 4.230 | 18.459 | 1727.87 |
| $\begin{aligned} & \text { CLA-45- } \\ & \text { HYB-PT } \end{aligned}$ | 8 | 0.307 | 1.682 | 00.516 | 0257.12 |
|  | 16 | 0.587 | 2.970 | 01.743 | 0501.78 |
|  | 32 | 1.661 | 4.178 | 06.939 | 0997.23 |
|  | 64 | 3.271 | 4.321 | 14.133 | 0898.01 |

Table 3.13: Simulation results of CLA's at 180 nm technology

| Module <br> Name | No. of <br> Bits | Power <br> $(\boldsymbol{\mu} \mathbf{W})$ | Delay <br> $(\mathbf{n s})$ | Power-Delay Product <br> $\left(\mathbf{w a t t *}{ }^{*} \mathbf{s e c}^{-15}\right)$ | Area <br> $\left(\boldsymbol{\mu}^{\mathbf{2}}\right)$ |
| :--- | :---: | :---: | :---: | :---: | :---: |
| CLA-180- <br> STA- | 8 | $\mathbf{0 . 2 0 4}$ | $\mathbf{0 4 5}$ | $\mathbf{0 0 9 . 1 8 0}$ | 015,541 |
|  | 16 | 0.575 | $\mathbf{0 6 5}$ | $\mathbf{0 3 7 . 3 7 5}$ | 026,790 |
|  | 32 | 0.748 | $\mathbf{0 9 8}$ | $\mathbf{0 7 3 . 3 0 4}$ | 077,410 |
|  | 64 | 2.740 | $\mathbf{1 0 0}$ | $\mathbf{2 7 4 . 0 0 0}$ | 128,030 |
|  | 8 | 0.396 | 035 | 013.860 | 011,954 |
| CLA-180- | 16 | 0.799 | 055 | 043.945 | 020,574 |
|  | 32 | 1.274 | 085 | 108.290 | 069,578 |
|  | 64 | 3.081 | 155 | 477.550 | 114,841 |
|  | 8 | 0.242 | 050 | 012.100 | $\mathbf{0 0 9 , 7 4 1}$ |
| CLA-180-- | 16 | $\mathbf{0 . 5 2 8}$ | 072 | 038.016 | $\mathbf{0 1 8 , 6 5 5}$ |
|  | 32 | $\mathbf{0 . 6 8 8}$ | 112 | 077.056 | $\mathbf{0 6 2 , 1 7 8}$ |
|  | 64 | $\mathbf{1 . 9 6 4}$ | 180 | 353.520 | $\mathbf{1 0 9 , 2 2 1}$ |

Key Points: The overall results of the CLA show that

- CLA operates down to 0.4 V power supply at both technology nodes in sub-threshold region.
- The simulation results show that power, delay and power-delay product of the CLA designs increases with the increase in operand size as expected.
- In comparison to 180 nm technology, at 45 nm , the propagation delay is smaller, power consumption is higher (due to increased leakage currents) and power-delay product is smaller for all designs of CLA [34]. For CMOS inverter, the leakage currents are found to be 1.05 nA and 0.18 nA at $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology respectively through simulation.
- The overall power-delay product of the Static-CMOS is smaller than HYB-TG and HYBPT logic. This demonstrates that changing the logic style from Static-CMOS to TG or Static-CMOS to PT logic, increase the overall circuit power-delay product for subthreshold operation in both technologies.
- Fully Static-CMOS logic design style is the most power efficient design style for CLA at both technologies in sub-threshold region.

The overall power consumption, propagation delay and power-delay product graphs of CLA's at $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology using different logic families are shown in Figure 3.33.


Figure 3.33: The overall power consumption, propagation delay and power-delay product graphs of CLA at $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology

### 3.5.3. Design Implementation using KSA

KSA offers an efficient solution to binary addition problem, assures a low computation delay and low power in sub-threshold region. Conventionally, the parallel prefix adders compute addition in two steps: one to obtain the carry at each bit, with the next to compute the sum bit based on the carry bit. The internal architecture of 16 -bit KSA is shown in Figure 3.34. The pre-computation and post-computation blocks are commonly used in all three adders (CLA, KSA and HCA). The transistor level diagrams of these blocks using Static-CMOS, HYB-TG and HYB-PT logic style are given in section 3.5.1.2, 3.5.1.2 and 3.5.1.3 respectively.


Figure 3.34: Internal architecture of KSA

Design Methodology: The basic blocks of the KSA are the bitwise propagate/generate logic block, group propagate/generate logic block and sum/carry logic block. For implementation of these basic blocks the Boolean expressions of KSA are given by Eq. (3.11) - Eq. (3.14) as discussed in section 3.2.2, AND, OR, XOR gates are the three widely used logic gates.

The circuit realization of these basic logic gates with competing logic families have already been discussed in section 3.3. The KSA is implemented at different operand sizes using three different logic families i.e. Static-CMOS, HYB-TG and HYB-PT.

### 3.5.4. Simulation Methodology and Results of KSA

To obtain the simulation results for the KSA's in sub-threshold region, the following methodology is followed.

The 8 -bit, 16 -bit, 32 -bit and 64 -bit KSA' using three different logic families were designed in Cadence Virtuoso (Schematic and Layout design). The pre-computation stage is designed using an AND gates and a XOR gates to generate the individual bit generate and propagate signals respectively.

A prefix tree for generating group generate and group propagate signals is designed using an AND gate and an AND-OR implementation (AOI) cell. The multiple carries are computed with proper buffering necessary, according to the implementation. The post-computation stage is designed using XOR gates to generate the sum bits. Two different technology nodes i.e. 45 $\mathrm{nm} / 180 \mathrm{~nm}$ technology have been considered in implementation of schematic and layouts and simulated using the BSIM3 (V3.24) model, at a supply voltage of 0.4 V . The methodology followed for simulation of KSA in sub-threshold region is same as given in Section 3.5.2.

The power, delay and power-delay product values are evaluated for KSA's using Static-CMOS logic, HYB-TG logic and HYB-PT logic style for different operand sizes.

In this thesis, the following nomenclature as shown in Table 3.14, is used to represent the proposed designs of KSA adders:

Architecture (KSA) - technology (45/ 180) - logic family (STA-CMOS, HYB-TG, and HYB-PT)

Table 3.14: Nomenclature used for the proposed designs of KSA

| S. No. | Adder Design Descriptions | Nomenclature |
| :---: | :--- | :---: |
| 1 | KSA at 45 nm using Static-CMOS logic family | KSA-45-STA-CMOS |
| 2 | KSA at 45 nm using hybrid TG logic family | KSA-45-HYB-TG |
| 3 | KSA at 45 nm using hybrid PT logic family | KSA-45-HYB-PT |
| 4 | KSA at 180 nm using Static-CMOS logic family | KSA-180-STA-CMOS |
| 5 | KSA at 180 nm using hybrid TG logic family | KSA-180-HYB-TG |
| 6 | KSA at 180 nm using hybrid PT logic family | KSA-180-HYB-PT |

The measured power, delay and power-delay product of KSA's using $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology are given in Table 3.15 and Table 3.16 , which operates at 0.4 V power supply voltage for sub-threshold operation.

These results show that all modules function properly at supply voltage as low as 0.4 V .
Table 3.15: Simulation results of KSA's at 45 nm technology

| Module Name | No. of Bits | Power $(\mu \mathbf{W})$ | Delay (ns) | Power-Delay Product (watt*sec10 ${ }^{-15}$ ) | Area $\left(\mu \mathrm{m}^{2}\right)$ |
| :---: | :---: | :---: | :---: | :---: | :---: |
| KSA-45-STA- <br> CMOS | 8 | 0.014 | 2.571 | 0.037 | 240.7 |
|  | 16 | 0.157 | 4.297 | 0.676 | 619.9 |
|  | 32 | 0.215 | 12.96 | 2.792 | 1521.7 |
|  | 64 | 0.317 | 17.24 | 5.472 | 3079.2 |
| KSA-45- <br> HYB-TG | 8 | 0.101 | 1.597 | 0.161 | 189.7 |
|  | 16 | 0.299 | 2.354 | 0.705 | 498.3 |
|  | 32 | 0.367 | 3.412 | 1.253 | 1374.1 |
|  | 64 | 0.587 | 5.698 | 3.349 | 2873.1 |
| KSA-45- <br> HYB-PT | 8 | 0.389 | 10.63 | 4.138 | 107.1 |
|  | 16 | 0.589 | 16.97 | 10.01 | 288.3 |
|  | 32 | 0.962 | 22.52 | 21.67 | 1187.5 |
|  | 64 | 1.470 | 29.71 | 43.88 | 2542.2 |

Table 3.16: Simulation results of KSA's at 180 nm technology

| Module Name | No. of Bits | Power <br> $(\mu \mathrm{W})$ | Delay <br> (ns) | Power-Delay Product (watt*sec10 ${ }^{-15}$ ) | $\begin{gathered} \text { Area } \\ \left(\mu^{2}\right) \end{gathered}$ |
| :---: | :---: | :---: | :---: | :---: | :---: |
| KSA-180- <br> STA- <br> CMOS | 8 | 0.009 | 0260.0 | 002.34 | 016,741 |
|  | 16 | 0.041 | 0479.0 | 019.63 | 028,221 |
|  | 32 | 0.103 | 1140.0 | 117.00 | 080,415 |
|  | 64 | 0.234 | 1618.0 | 378.00 | 129,774 |
| $\begin{aligned} & \text { KSA-180- } \\ & \text { HYB-TG } \end{aligned}$ | 8 | 0.098 | 0067.6 | 006.63 | 013,221 |
|  | 16 | 0.289 | 0073.1 | 021.15 | 022,542 |
|  | 32 | 0.323 | 0078.7 | 026.29 | 075,471 |
|  | 64 | 0.495 | 0081.4 | 040.29 | 120,021 |
| $\begin{aligned} & \text { KSA-180- } \\ & \text { HYB-PT } \end{aligned}$ | 8 | 0.221 | 0293.1 | 064.77 | 011,512 |
|  | 16 | 0.491 | 0358.1 | 175.80 | 018,785 |
|  | 32 | 0.901 | 0421.2 | 379.50 | 068,712 |
|  | 64 | 1.040 | 0489.0 | 508.56 | 101,146 |

Key points: The overall results of the KSA show that

- KSA operates down to 0.4 V power supply in sub-threshold region at both technology nodes.
- The simulation results show that power, delay and power-delay product of the KSA increases with the increase in operand size as expected.
- The KSA's using STA-CMOS logic exhibits the lowest power-delay product for low bit operands (i.e. 8 b \& 16 b input data width) adder. Whereas for higher bit operands (i.e. $32 \mathrm{~b} \& 64 \mathrm{~b}$ ), HYB-TG logic has less power-delay product at both technologies. Hence, logic family greatly affects the power-delay product of the circuits.
- The KSA using HYB-PT show worst power-delay product at both technologies due to higher power consumption and delay.
- In comparison to 180 nm technology, at 45 nm , the propagation delay is smaller, power consumption is higher (due to increased leakage currents) and power-delay product is smaller for all designs of KSA.

The overall power consumption, propagation delay and power-delay product graphs of KSA's at $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology using different logic families are shown in Figure 3.35.



Figure 3.35: The overall power consumption, propagation delay and power-delay product graphs of KSA at $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology

### 3.5.5. Design Implementation using HCA

HCA offers an efficient solution to binary addition problem, assures a low computation delay and low power in sub-threshold region.

Conventionally, theparallel prefix adders compute addition in two steps: one to obtain the carry at each bit, with the next to compute the sum bit based on the carry bit. The HCA are the combinations of two adders KSA with $\log _{2} n$ stages and Brent-Kung with $2 \log _{2} n-1$ stages.

The combined effects of both the adders provide a reasonably high speed at less complexity.
The internal architecture of 16 -bit HCA is shown in Figure 3.36 which reveals that for the same word size, the number of prefix computation stages are one extra logic level than the KS design, whereas in the transistor level design the number of the prefix operations is fewer in the HC design than in the KS design [35].

Thus, the HCA reduces the area in return for one extra stage of delay as compared to the KS adder.

In each logic level of HC prefix tree places cells every other bit and the last logic level accounts for the missing carries. The pre-computation and post-computation blocks are commonly used in all three adders (CLA, KSA and HCA). The transistor level diagrams of these blocks using Static-CMOS, HYB-TG and HYB-PT logic style are given in section 3.5.1.2, 3.5.1.2 and 3.5.1.3 respectively.


Figure 3.36: Internal architecture of HCA

Design Methodology: The basic blocks of the HCA are the bitwise propagate/generate logic block, group propagate/generate logic block and sum/carry logic block. For implementation of these basic blocks the Boolean expressions of HCA architecture are given by Eq. (3.15) Eq. (3.19) as discussed in section 3.2.3, AND, OR, XOR gates are the three widely used logic gates. The circuit realization of these basic logic gates with competing logic families have already been discussed in section 3.3. The HCA is implemented at different operand sizes using three different logic families i.e. Static-CMOS, HYB-TG and HYB-PT.

### 3.5.6. Simulation Methodology and Results of HCA

The methodology followed for simulation of HCA in sub-threshold region is same as given in Section 3.5.2. The power, delay and power-delay product values are evaluated for HCA's using Static-CMOS logic, HYB-TG logic and HYB-PT logic style for different operand sizes. In this thesis, the following nomenclature as shown in Table 3.17, is used to represent the proposed designs of HCA adders:

Architecture (HCA) - technology (45/ 180) - logic family (STA-CMOS, HYB-TG, and HYB-PT)

Table 3.17: Nomenclature used for the proposed designs of HCA

| S. No. | Adder Design Descriptions | Nomenclature |
| :---: | :--- | :--- |
| 1 | HCA at 45 nm using Static-CMOS logic <br> family | HCA-45-STA-CMOS |
| 2 | HCA at 45 nm using hybrid TG logic family | HCA-45-HYB-TG |
| 3 | HCA at 45 nm using hybrid PT logic family | HCA-45-HYB-PT |
| 4 | HCA at 180 nm using Static-CMOS logic <br> family | HCA-180-STA- <br> CMOS |
| 5 | HCA at 180 nm using hybrid TG logic family | HCA-180-HYB-TG |
| 6 | HCA at 180 nm using hybrid PT logic family | HCA-180-HYB-PT |

The measured power, delay and power-delay product of HCA's using $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology are given in Table 3.18 and Table 3.19 , which operates at 0.4 V power supply voltage for sub-threshold operation.

Table 3.18: Simulation results of HCA's at 45 nm technology

| Module <br> Name | No. of Bits | Power $(\mu \mathbf{W})$ | $\begin{gathered} \text { Delay } \\ (\mathrm{ns}) \end{gathered}$ | Power-Delay Product (watt*sec10 ${ }^{-15}$ ) | Area ( $\mu^{2}$ ) |
| :---: | :---: | :---: | :---: | :---: | :---: |
| HCA-45-STACMOS | 8 | 0.012 | 03.74 | 0.044 | 0196.17 |
|  | 16 | 0.035 | 07.20 | 0.252 | 0468.36 |
|  | 32 | 0.197 | 15.90 | 3.158 | 1084.08 |
|  | 64 | 0.291 | 19.89 | 5.802 | 2254.74 |
| HCA-45- <br> HYB-TG | 8 | 0.098 | 04.08 | 0.403 | 0114.35 |
|  | 16 | 0.302 | 06.39 | 1.933 | 0217.24 |
|  | 32 | 0.401 | 07.09 | 2.846 | 0893.21 |
|  | 64 | 0.506 | 09.06 | 4.587 | 1835.74 |
| HCA-45-HYB-PT | 8 | 0.279 | 33.67 | 9.421 | 089.351 |
|  | 16 | 0.398 | 40.12 | 15.99 | 0112.41 |
|  | 32 | 0.725 | 55.10 | 39.95 | 0474.35 |
|  | 64 | 1.795 | 64.12 | 115.1 | 1565.54 |

Table 3.19: Simulation results of HCA's at 180 nm technology

| Module Name | $\begin{gathered} \text { No. of } \\ \text { Bits } \\ \hline \end{gathered}$ | Power $(\mu \mathbf{W})$ | Delay <br> (ns) | Power-Delay Product (watt*sec10 ${ }^{-15}$ ) | $\begin{aligned} & \text { Area } \\ & \left(\mu^{2}\right) \\ & \hline \end{aligned}$ |
| :---: | :---: | :---: | :---: | :---: | :---: |
| HCA- <br> 180- <br> STA- <br> CMOS | 8 | 0.006 | 395.1 | 002.3 | 016,214 |
|  | 16 | 0.018 | 449.0 | 008.1 | 025,784 |
|  | 32 | 0.081 | 534.0 | 043.3 | 072,547 |
|  | 64 | 0.177 | 619.0 | 109.0 | 125,141 |
| HCA- <br> 180- <br> HYB-TG | 8 | 0.079 | 044.2 | 003.4 | 012,874 |
|  | 16 | 0.295 | 050.1 | 014.7 | 019,542 |
|  | 32 | 0.338 | 061.4 | 020.7 | 068,562 |
|  | 64 | 0.473 | 073.1 | 034.5 | 114,260 |
| HCA- <br> 180- <br> HYB-PT | 8 | 0.196 | 309.8 | 060.7 | 010,780 |
|  | 16 | 0.308 | 356.8 | 109.8 | 015,745 |
|  | 32 | 0.549 | 389.4 | 213.7 | 059,541 |
|  | 64 | 1.692 | 400.5 | 677.6 | 096,347 |

Key Points: The overall results of the HCA show that

- HCA's operate down to 0.4 V power supply in sub-threshold region at both technology nodes.
- The simulation results show that power, delay and power-delay product of the HCA designs increases with the increase in operand size as expected.
- The HCA's using Static-CMOS logic exhibits the lowest power-delay product for low bit operands (i.e. $8 b \& 16 b$ input data width) adder. Whereas for higher bit operands (i.e. 32b \& 64b), HYB-TG logic has minimum power-delay product at both technologies.
- The HCA using HYB-PT show worst power-delay product at both technologies due to higher power consumption and delay.
- In comparison to 180 nm technology, at 45 nm , the propagation delay is smaller, power consumption is higher (due to increased leakage currents) and power-delay product is smaller for all designs of HCA.

The overall power consumption, propagation delay and power-delay product graphs of HCA's at $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology using different logic families is shown in Figure 3.37.



Figure 3.37: The overall power consumption, propagation delay and power-delay product graphs of HCA at $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology

### 3.6. IMPACT OF RBB ON STATIC-CMOS ADDERS

Table 3.20 and Table 3.21 show the comparative result of adders using with/without RBB scheme at $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology respectively. All three adders (CLA, KSA and HCA) operate at 0.4 V power supply voltage for sub-threshold operation.

Table 3.20: Simulation results of adders at 45 nm technology

| Module Name | No. of Bits | Power ( $\mu \mathrm{W}$ ) |  | Delay (ns) |  | Power-Delay Product (watt*sec10 ${ }^{-15}$ ) |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | Without RBB | $\begin{aligned} & \hline \text { With } \\ & \text { RBB } \end{aligned}$ | $\begin{gathered} \hline \text { Without } \\ \text { RBB } \\ \hline \end{gathered}$ | With RBB | $\begin{gathered} \hline \text { Without } \\ \text { RBB } \\ \hline \end{gathered}$ | With RBB |
| CLA-45-STACMOS | 8 | 0.380 | 2.514 | 01.24 | 00.42 | 00.471 | 01.073 |
|  | 16 | 0.625 | 4.061 | 02.36 | 00.87 | 01.480 | 03.549 |
|  | 32 | 0.957 | 7.021 | 03.69 | 01.02 | 03.538 | 07.168 |
|  | 64 | 3.413 | 9.741 | 04.06 | 02.01 | 13.871 | 19.618 |
| KSA-45-STACMOS | 8 | 0.014 | 0.041 | 02.57 | 01.74 | 00.037 | 00.071 |
|  | 16 | 0.157 | 0.324 | 04.29 | 02.91 | 00.676 | 00.944 |
|  | 32 | 0.215 | 0.474 | 12.96 | 10.27 | 02.792 | 04.867 |
|  | 64 | 0.317 | 0.511 | 17.24 | 15.98 | 05.472 | 08.165 |
| HCA-45-STACMOS | 8 | 0.012 | 0.037 | 03.74 | 02.47 | 00.044 | 00.091 |
|  | 16 | 0.035 | 0.054 | 07.20 | 06.57 | 00.252 | 00.355 |
|  | 32 | 0.197 | 0.285 | 15.96 | 14.69 | 03.158 | 04.186 |
|  | 64 | 0.291 | 0.334 | 19.89 | 18.87 | 05.802 | 06.302 |

Table 3.21: Simulation results of adders at 180 nm technology

| Module Name | No. of Bits | Power ( $\mu \mathrm{W}$ ) |  | Delay (ns) |  | Power-Delay Product (watt*sec10-15) |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | $\begin{gathered} \hline \text { Without } \\ \text { RBB } \\ \hline \end{gathered}$ | With RBB | $\begin{gathered} \hline \text { Without } \\ \text { RBB } \\ \hline \end{gathered}$ | With RBB | $\begin{gathered} \hline \text { Without } \\ \text { RBB } \\ \hline \end{gathered}$ | With RBB |
| $\begin{aligned} & \text { CLA-180- } \\ & \text { STA- } \\ & \text { CMOS } \end{aligned}$ | 8 | 0.204 | 0.323 | 0045 | 0030.0 | 009.18 | 009.96 |
|  | 16 | 0.575 | 0.711 | 0065 | 0057.0 | 037.37 | 040.52 |
|  | 32 | 0.748 | 0.964 | 0098 | 0079.0 | 073.30 | 076.15 |
|  | 64 | 2.740 | 3.512 | 0100 | 0085.0 | 274.00 | 298.52 |
| $\begin{aligned} & \text { KSA-180- } \\ & \text { STA- } \\ & \text { CMOS } \end{aligned}$ | 8 | 0.009 | 0.034 | 0260 | 0174.0 | 002.34 | 005.91 |
|  | 16 | 0.041 | 0.091 | 0479 | 0347.0 | 019.63 | 031.64 |
|  | 32 | 0.103 | 0.294 | 1140 | 1081.0 | 117.00 | 317.80 |
|  | 64 | 0.234 | 0.374 | 1618 | 1557.0 | 378.00 | 582.30 |
| $\begin{aligned} & \text { HCA- } \\ & \text { 180-STA- } \\ & \text { CMOS } \end{aligned}$ | 8 | 0.006 | 0.014 | 0395 | 0257.8 | 002.37 | 003.61 |
|  | 16 | 0.018 | 0.034 | 0449 | 0338.4 | 008.08 | 011.50 |
|  | 32 | 0.081 | 0.107 | 0534 | 0487.3 | 043.30 | 052.14 |
|  | 64 | 0.177 | 0.221 | 0619 | 0540.2 | 109.00 | 119.30 |

Keypoint: The overall results of the adder with/without RBB scheme show that

- With RBB the propagation delay is smaller, power consumption is higher (due to increased leakage currents) and power-delay product is higher for all designs of adder as compared to without RBB scheme.
- At $45 \mathrm{~nm}(180 \mathrm{~nm})$ technology, the RBB scheme helps in reducing average delay by $61.8 \%$ ( $18.5 \%$ ), increases average power consumption by $76.8 \%$ ( $22.5 \%$ ) and increases the overall average power-delay product by $38.3 \%$ ( $7.3 \%$ ) with the increase in operand size using Static-CMOS logic.


### 3.7. FINAL RESULTS AND DISCUSSION

## Comparison of proposed designs with results of published architectures

This section presents the comparative analysis of 8-bit, 16-bit, 32-bit and 64-bit of CLA, KSA and HCA designs with referenced architectures operated in sub-threshold region at 45/ 180 nm CMOS technology for 0.4 V supply voltage at same frequency of operation ( 200 KHz ).

Here, for comparison 16-bit and 32-bit of CLA, KSA and HCA of the referenced architectures [36][37][38][39] are designed using Static-CMOS logic style to obtain their results in the same simulation setup for sub-threshold operation.

Table 3.22 and 3.23 shows the comparison of Static-CMOS based proposed adder designs (as obtained from Table 3.12, 3.13, 3.15, 3.16, 3.18 and 3.19) with the referenced architectures in sub-threshold region.

Table 3.22: Comparative results between proposed and referenced adder designs at 45 nm

| References | Module name | No. of bits | Power $(\mu \mathbf{W})$ | $\begin{gathered} \text { Delay } \\ (\mathrm{ns}) \end{gathered}$ | $\begin{gathered} \text { Power-Delay } \\ \text { Product } \\ \left(\text { Watt*Sec 10 } 0^{-15}\right. \text { ) } \end{gathered}$ | \% reduction in Power-Delay Product |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Ref [36] | CLA | 16 | 1.287 | 8.872 | 11.41826 | $\begin{gathered} \text { (16-bit) }- \\ 87.03 \% \\ \text { (32-bit) }-91.4 \% \end{gathered}$ |
|  |  | 32 | 2.789 | 14.879 | 41.49753 |  |
| Proposed | CLA | 16 | 0.625 | 2.369 | 1.480625 |  |
|  |  | 32 | 0.957 | 3.697 | 3.538029 |  |
| Ref [37] | KSA | 16 | 0.741 | 16.784 | 12.43694 | $\begin{aligned} & (16 \text {-bit) }-94.5 \% \\ & (32 \text {-bit) }-89.9 \% \end{aligned}$ |
|  |  | 32 | 0.987 | 28.221 | 27.85413 |  |
| Proposed | KSA | 16 | 0.157 | 4.297 | 0.674629 |  |
|  |  | 32 | 0.215 | 12.96 | 2.764 |  |
| Ref [37] | HCA | 16 | 0.689 | 20.798 | 14.32982 | $\begin{aligned} & (16 \text {-bit })-98.2 \% \\ & (32 \text {-bit) }-92.8 \% \end{aligned}$ |
|  |  | 32 | 0.774 | 57.114 | 44.20624 |  |
| Proposed | HCA | 16 | 0.035 | 7.201 | 0.252035 |  |
|  |  | 32 | 0.197 | 15.96 | 3.1441 |  |

Table 3.23: Comparative results between proposed and referenced adder designs at 180 nm

| References | Module name | No. of bits | Power ( $\mu \mathbf{W}$ ) | Delay (ns) | $\begin{gathered} \text { Power-Delay } \\ \text { Product } \\ \left(\text { Watt*Sec 10 } 0^{-15}\right) \\ \hline \end{gathered}$ | \% reduction in Power-Delay Product |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Ref [38] | CLA | 16 | 0.977 | 110.5 | 107.9585 | (16-bit) - 65.3\% |
|  |  | 32 | 1.465 | 172.2 | 252.273 |  |
| Proposed | CLA | 16 | 0.575 | 65 | 37.375 | (32-bit) - 70.9\% |
|  |  | 32 | 0.748 | 98 | 73.304 |  |
| Ref [39] | KSA | 16 | 0.551 | 532.2 | 293.2422 | (16-bit) - 93.3\% |
|  |  | 32 | 0.787 | 1232 | 969.584 |  |
| Proposed | KSA | 16 | 0.041 | 479.0 | 19.639 | (32-bit) - 87.8\% |
|  |  | 32 | 0.103 | 1140 | 117.42 |  |
| Ref [11] | HCA | 16 | 0.489 | 511.5 | 250.1235 | (16-bit) - 96.7\% |
|  |  | 32 | 0.674 | 779 | 525.046 |  |
| Proposed | HCA | 16 | 0.018 | 449.0 | 8.082 | (32-bit) - 91.7\% |
|  |  | 32 | 0.081 | 534.0 | 43.254 |  |

The comparative results show that all proposed designs show minimum power-delay product as compared to published referenced architectures.

## Comparison of results of proposed designs among themselves

All proposed designs show minimum power-delay product as compared to published referenced architectures. Thus, a comparison of these designs, among themselves, is done to obtain the best design in terms of overall power-delay product at both technology. For
comparison purpose, Figure 3.38, Figure 3.39 and Figure 3.40 show the power consumption, propagation delay and power-delay product histograms of all proposed eighteen designs using Static-CMOS logic, HYB-TG and HYB-PT respectively.

(a)
(b)
(c)

Figure 3.38: The power consumption graph between CLA, KSA and HCA at $45 \mathrm{~nm} / 180$ nm technology (a) using Static-CMOS logic (b) using HYB-TG logic (c) using HYB-PT logic


Figure 3.39: The propagation delay graph between CLA, KSA and HCA at $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology (a) using Static-CMOS logic (b) using HYB-TG logic (c) using HYB-PT logic

(a)
(b)
(c)

Figure 3.40: The power-delay product graph between CLA, KSA and HCA at $45 \mathrm{~nm} / 180$ nm technology (a) using Static-CMOS logic (b) using HYB-TG logic (c) using HYB-PT logic

## Observations

The comparative results of the CLA, KSA and HCA show that

- All three-adder architectures operate down to 0.4 V with correct functionality using all three different logic families (Static-CMOS, HYB-TG and HYB-PT) in sub-threshold region.
- The power, delay and power-delay product of all these three adder designs increases with the increase in operand size.
(i) Effect of Logic Design Style:
- Static-CMOS logic gives lowest power consumption because of simpler logic cells for all three adder designs at both technology nodes in sub-threshold region.
- HYB-TG gives lowest propagation delay because of shorter critical paths for all three designs at both technology nodes in sub-threshold region.
- In KSA and HCA, Static-CMOS logic exhibits the lowest power-delay product for low bit operands (i.e. 8 b \& 16 b input data width) adder.
- For higher bit operands (i.e. 32 b \& 64b), HYB-TG logic gives the lowest power-delay product at both technology nodes in sub-threshold region.
(ii) Effect of Technology Scaling:
- At 45 nm technology nodes: The propagation delay is smaller and power consumption is higher than 180 nm technology for all three adder designs using three different logic design styles. The power consumption is increasing due to increments in the leakage current at 45 nm technology as compared to 180 nm technology since supply voltage is kept same at 0.4 V . For CMOS inverter, the leakage currents are found to be 1.05 nA and 0.18 nA at 45 $\mathrm{nm} / 180 \mathrm{~nm}$ technology respectively through simulation.
- The overall power-delay product is smaller in 45 nm technology.
(iii) Effect of Adder Architecture:
- HCA is the most power efficient architecture for all logic design styles at both technology nodes ( $45 \mathrm{~nm} / 180 \mathrm{~nm}$ ).
- CLA is the high-speed adder architecture for all logic design styles at both technology nodes ( $45 \mathrm{~nm} / 180 \mathrm{~nm}$ ).
- For low bit operands (i.e. 8 b \& 16b input data width): KSA and HCA using Static-CMOS logic style provides minimum power-delay product
- For higher bit operands (i.e. 32b \& 64b), KSA and HCA using HYB-TG logic style provides minimum power-delay product.
- HCA shows lowest area at both technology nodes using all three different logic design styles.


### 3.8. CONCLUSIONS

This chapter explores the design space of CLA, KSA and HCA architectures with different logic design style and operand size at two technology nodes ( $45 \mathrm{~nm} / 180 \mathrm{~nm}$ ) in sub-threshold region. Therefore, selecting appropriate logic families with a view to understand the contribution of power, delay, and power-delay product in sub-threshold region can significantly improve the overall adder computations.

The overall results of the CLA, KSA and HCA show following conclusions at $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology nodes

## (i) Power Consumption

The amount of power consumption directly depends on operand sizes and the complexity of the adder architectures. The large operand size adders consume more power for all logic design style at the same voltage supply and both technology nodes.

## Comparison of proposed Static-CMOS based (CLA, KSA and HCA) adders with referenced designs [36][37] at 45 nm , [11][38][39] at 180 nm

- At 45 nm : For 16-bit and 32-bit, the proposed designs have lesser power consumption with respect to referenced designs. The range of reduction in power consumption varies from $34.6 \%$ to 92.8 \%
- At 180 nm : For 16-bit and 32-bit, the proposed designs have lesser power consumption with respect to referenced designs. The range of reduction in power consumption varies from $41.1 \%$ to $96.3 \%$.


## Comparison of all proposed CLA, KSA and HCA designs among themselves

- At 45 nm , HCA has the least power consumption. HCA based designs consume lesser power consumption varying from $8.2 \%$ to 77.7 \% as compared to KSA based designs.

Whereas as compared to CLA designs, they consume lesser power consumption, ranging from $79.4 \%$ to $96.8 \%$, for all bit operands.

- Similarly, for $180 \mathrm{~nm}, \mathrm{HCA}$ (based designs) has the least power consumption. It consumes lesser power consumption varying from $21.2 \%$ to $56.1 \%$ as compared to KSA. Whereas as compared to CLA, it consumes lesser power consumption varying from $89.4 \%$ to 97.1 \% for all bit operands.

Therefore, HCA is the most power efficient architecture as compared to KSA and CLA in subthreshold region.

## (ii) Propagation Delay

As operand size increases, the complexity of the circuits and the number of gates count in the propagation path also increase leading to increase in propagation delay. The large operand size adders show higher propagation delay for all logic design style at the same voltage supply and both technology nodes.

Comparison of proposed Static-CMOS based (CLA, KSA and HCA) adders with referenced designs [11][|36][37][38]|[39]

- At 45 nm : For 16-bit and 32-bit, the proposed designs give a lesser propagation delay varying from $54.6 \%$ to $75.1 \%$.
- At 180 nm : For 16-bit and 32-bit, the proposed designs give a lesser propagation delay varying from $7.46 \%$ to $43.1 \%$.


## Comparison of all proposed CLA, KSA and HCA designs among themselves

- At 45 nm , CLA (based designs) has the least propagation delay. It consumes lesser propagation delay varying from $44.8 \%$ to $76.4 \%$ as compared to KSA. Whereas as compared to HCA (based designs), it gives lesser propagation delay varying from $66.6 \%$ to $79.8 \%$ for all bit operands.
- Similarly, for 180 nm , CLA has the least propagation delay. It consumes lesser propagation delay varying from $82.6 \%$ to $93.8 \%$ as compared to KSA. Whereas for HCA, it gives lesser propagation delay varying from $81.2 \%$ to $88.6 \%$ for all bit operands.

Therefore, CLA is the high-speed adder architecture as compared to KSA and HCA for subthreshold operation.

## (iii) Power-Delay Product

## Comparison of proposed Static-CMOS based (CLA, KSA and HCA) adders with referenced

 designs [11[36][37][38][39]- At 45 nm : For 16-bit and 32-bit, the proposed designs give lesser power-delay product varying from $87 \%$ to $98.2 \%$.
- At 180 nm : For 16-bit and 32-bit, the proposed designs give lesser power-delay product varying from $65.3 \%$ to $96.7 \%$.


## Comparison of all proposed CLA, KSA and HCA designs among themselves

In comparison to other proposed design showing highest power-delay product (at both technology nodes):

- For low bit operands (i.e. 8b / 16b), KSA and HCA using Static-CMOS logic gives lowest power-delay product.
$>(63.7 \% / 29.24 \%)$ lesser power-delay product for KSA
$>(32.69 \% / 43.64 \%)$ lesser power-delay product for HCA
- For higher bit operands (i.e. 32b / 64b), KSA and HCA using HYB-TG logic gives lowest power-delay product
$>(40.59 \% / 58.99 \%)$ lesser power-delay product for KSA
$>(3.23 \% / 15.32 \%)$ lesser power-delay product for HCA.


## (iv) Effect of RBB Scheme

HYB-TG and HYB-PT logic families with RBB do not function properly at both technology nodes in sub-threshold region.

The use of RBB scheme improves sub-threshold conduction current to perform circuit operations in Static-CMOS logic. At $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology, the average decrement in propagation delay compared to circuit design without RBB is approximately $61.87 \% / 18.5 \%$. The increments in average power consumption and power-delay product is by $77.03 \% / 22.5 \%$ and $38.3 \% / 7.3 \%$ respectively.

The use of RBB scheme in CMOS inverter shows that $\mathrm{NM}_{\mathrm{L}}$ reduces by $62.2 \% / 71.7$ \% at $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology respectively.

## (v) Effect of Technology Scaling

At 45 nm , the power consumption is higher than 180 nm technology for all adder designs using different logic design styles.

The power consumption is increasing due to increments in the leakage current at 45 nm technology as compared to 180 nm technology since supply voltage is kept same at 0.4 V . Appendix A shows that simulation time has little impact on power consumption of adder.

Figure 3.41 shows the Design Space Exploration (DSE) chart of all published and proposed 32-bit CLA, KSA and HCA architectures using different logic design styles at 45 nm technology.

In Figure 3.41, a change of scale on y-axis is shown with a kink ( $\sim$ ) to show the histograms for delay and power-delay product using different logic design styles in its entire range.

This is done to show the comparisons of power-delay product and delay using different logic design styles in same figure.

The same distribution pattern of power, delay and power-delay product of all the adder architectures are found at 180 nm technology.

(a)


Figure 3.41: DSE chart of all published referenced and proposed 32-bit CLA, KSA and HCA architectures using different logic design styles (a) power (b) delay (c) power-delay product

## REFERENCES

[1] F. S. Robert, "Arithmetic operations in a binary computer", Proceedings of the Review of Scientific Instruments, vol. 21(8), 1950, pp. 687-693.
[2] H. D. Ross, "The arithmetic element of the IBM type 701 computer", Proceedings of the IRE, 1953, pp. 1287-1294.
[3] M. Eshtawie, S. Hussin and M. Othman, "Analysis of results obtained with a new proposed low area low power high speed fixed point adder", Proceedings of IEEE International Conference on Semiconductor Electronics (ICSE), 2010, pp. 127-130.
[4] A. L. Silburt, A. R. Boothroyd and M. Digiovanni, "Automated parameter extraction and modeling of the MOSFET below threshold", IEEE Transactions on ComputerAided Design, vol. 7(4), 1988, pp. 484-488.
[5] T. A. Tran and B. M. Bevan, "Design of an energy-efficient 32-bit adder operating at subthreshold voltages in $45-\mathrm{nm}$ CMOS", Third IEEE International Conference on Communications and Electronics (ICCE), 2010, pp. 87-91.
[6] Z. Mohsen and A. Joshi, "Sub-threshold logic circuit design using feedback equalization", IEEE Design, Automation and Test in Europe Conference and Exhibition (DATE), 2014, pp. 1-6.
[7] S. Khanna and B. H. Calhoun, "Serial sub-threshold circuits for ultra-low-power systems", In Proceedings of the ACM/IEEE International Symposium on Low power Electronics and Design, 2009, pp. 27-32.
[8] M. Jucemar, J. L. Güntzel and L. Agostini, "A1CSA: An energy-efficient fast adder architecture for cell-based VLSI design", 18th IEEE International Conference on Electronics, Circuits and Systems (ICECS), 2011, pp. 442-445.
[9] B. Valeriu, A. Djupdal and S. Aunet, "Ultra low-power neural inspired addition: when serial might outperform parallel architectures", In Computational Intelligence and Bioinspired Systems Springer Berlin Heidelberg, 2005, pp. 486-493.
[10] A. Snorre, "Nanoelectronics" U.S. Patent No. 7,970,810, 28 Jun, 2011.
[11] M. Talsania and E. John, "A comparative analysis of parallel prefix adders", 12th Euromicro Conference on Digital System Design, Architectures, Methods and Tools, 2009, pp. 281-286.
[12] A. Abdulmajeed, "Speed comparison of binary adders techniques", PhD Dissertation, University of Victoria, 2015, pp.1-45.
[13] J. Monteiro, J. Güntzel and L. Agostini, "A1CSA: an energy-efficient fast adder architecture for cell-based vlsi design", Proceedings of $18^{\text {th }}$ IEEE International Conference on Electronics, Circuits and Systems (ICECS), 2011, pp. 442-445.
[14] H. C. Benton, A. Wang, N. Verma and A. Chandrakasan, "Sub-threshold design: the challenges of minimizing circuit energy", In Proceedings of the 2006 International Symposium on Low Power Electronics and Design, ACM, 2006, pp. 366-368.
[15] R. Nele and W. Dehaene, "Ultra-low-voltage design of energy-efficient digital circuits", Springer International Publishing, 2015, pp: 1-192
[16] F. K. Gurkayna FK, L. Yusuf, L. Chaouat and J. M. Patrik, "Higher radix kogge-stone parallel prefix adder architectures", IEEE International Symposium on Circuits and Systems, ISCAS, vol. 5, 2000, pp. 609-612.
[17] A. Shahzad and M. Vesterbacka, "Performance analysis of radix-4 adders", Integration, The VLSI Journal, vol. 45(2), 2012, pp. 111-120.
[18] J. B. Kim and D. W. Kim, "Low-power carry look-ahead adder with multithreshold voltage cmos technology", International Conference of Semiconductor, 2007, pp. 537540.
[19] H. Kaur, A. Singh and L. Gupta, "Fbb CMOS tapered buffer with optimal vth selection", Journal on Today's Ideas-Tomorrow's Technologies, vol. 2(2), 2014, pp. 93-105.
[20] K. Uming, P. T. Balsara and W. Lee, "Low-power design techniques for highperformance CMOS adders", IEEE Transactions on Very Large Scale Integration Systems, vol. 3(2), 1995, pp. 321-323.
[21] R. Zlatanovici, S. Kao and B. Nikolić, "Energy-delay optimization of 64-bit carrylook ahead adders with a 240 ps 90 nm CMOS design example", IEEE Journal of Solid-State Circuits, vol. 44(2), 2009, pp. 569-583.
[22] V. G. Oklobdzija, "Digital design and fabrication", CRC Press, 2007, pp. 1-652.
[23] R. Zimmermann and F. Wolfgang, "Low-power logic styles: CMOS versus passtransistor logic", IEEE Journal of Solid-State Circuits, vol. 32(7), 1997, pp. 1079-
1090.
[24] B. Chatterjee, M. Sachdev and R. Krishnamurthy, "A CPL-based dual supply 32-bit ALU for sub 180 nm CMOS technologies", In Proceedings of the International Symposium on Low Power Electronics and Design, ACM, 2004, pp. 248-251.
[25] D. Markovic, B. Nikolic and V. G. Oklobdzija, "A general method in synthesis of passtransistor circuits", Microelectronics Journal, vol. 31(11), 2000, pp. 991-998.
[26] M. Suzuki, T. Shinbo, T. Yamanaka, A. Shimizu, K. Sasaki and Y. Nakagome, "A 1.5ns 32-b CMOS ALU in double pass-transistor logic", IEEE Journal of Solid-State Circuits, vol. 28(11), 1993, pp. 1145-1151.
[27] A. Parameswar, H. Hara and T. Sakurai, "A high speed, low power, swing restored pass-transistor logic based multiply and accumulate circuit for multimedia applications", IEEE Conference on Custom Integrated Circuits, 1994, pp. 278-281.
[28] F. S. Lai and W. Hwang, "Design and Implementation of differential cascode voltage switch with pass-gate (DCVSPG) logic for high-performance digital systems", IEEE Journal of Solid-State Circuits, vol. 32(4), 1997, pp. 563-573.
[29] V. G. Oklobdzija and B. Duchene, "Logic synthesis for pass-transistor design", 4th IEEE International Conference on Solid-State and Integrated Circuit Technology, 1995, pp. 103-105.
[30] A. Baliga and D. Yagain, "Design of high speed adders using CMOS and transmission gates in submicron technology: a comparative study", Fourth International Conference on Emerging Trends in Engineering \& Technology, 2011, pp. 284-289.
[31] S. Narendra, J. Tschanz, J. Hofsheier, B. Bloechel, S. Vangal, Y. Hoskote, S. Tang, D. Somasekhar, A. Keshavarzi, V. Erraguntla and G. Dermer, "Ultra-low voltage circuits and processor in 180 nm to 90 nm technologies with swapped- body biasing technique" IEEE International Conference on Solid-State Circuits, 2004, pp. 156-518.
[32] K. Roy, S. Mukhopadhyay and M. M. Hamid, "Leakage current mechanisms and leakage reduction techniques in deep-submicrometer CMOS circuits", Proceedings of the IEEE, vol. 91(2), 2003, pp. 305-327.
[33] T. C. James, A. K. Jeffrey and D. P. Vallett, "Time-resolved optical characterization of electrical activity in integrated circuits", Proceedings of the IEEE, vol. 88(9), 2000, pp. 1440-1459.
[34] S. Yang, W. Wolf, V. K. Narayanan, Y. Xie and W. Wang, "Accurate stacking effect macro-modeling of leakage power in sub-100 nm circuits", 18th IEEE International Conference on VLSI Design, 2005, pp. 165-170.
[35] P. Ndai, S. L. Lu, D. Somesekhar and K. Roy, "Fine-grained redundancy in adders", Proceedings of the 8th International Symposium on Quality Electronic Design (ISQED'07), 2007, pp. 317-321.
[36] A. Abdulmajeed and F. Gebali, "Performance analysis of 64-bit carry lookahead adders using conventional and hierarchical structure styles", IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM), 2015, pp. 80-83.
[37] D. Yagain, V. A. Krishna and A. Baliga, "Design of high-speed adders for efficient digital design blocks", International Scholarly Research Network, ISRN Electronics, vol. 2012, 2012, pp. 1-12.
[38] Y. Chang, "A power-delay efficient hybrid carry-lookahead/carry-select based redundant binary to two's complement converter", IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 55(1), 2008, pp. 336-346.
[39] Sudhakar, M. Sreenivaas, P. C. Kumar and E. E. Swartzlander, "Hybrid han-carlson adder", 55th IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), 2012, pp. 818-821.

## CHAPTER 4

## MULTIPLIERS

### 4.1. INTRODUCTION

Power efficient multiplication is an important fundamental function in low power arithmetic operations. Multiplication-based operations such as multiply \& accumulate and inner product are among some of the frequently used computation- intensive arithmetic functions currently implemented in many Digital Signal Processing applications such as convolution, fast fourier transform, digital filters and in arithmetic \& logic unit of microprocessors.

In order to maintain the rapid increase of high performance applications, emphasis will be on incorporation of low power power efficient modules in present system design. The designs of such modules partially rely on reduced power consumption in fundamental arithmetic computation units such as adders and multipliers. Improving multiplier design directly benefits the low power embedded processors used in consumer and industrial electronic products.

In the past five decades, multiplication method has moved away from the slow add-and-shift techniques to faster, parallel multiplication schemes. In the large-scale digital systems, multiplication is performed as a series of additions and shifts. The hardware typically consists of a parallel adder and registers. Therefore, the choice of multiplier architecture is of utmost importance, since its performance determines the whole system response.

In paper [1], it is found that Wallace tree and Dadda multipliers are power efficient architectures and are well suited for super-threshold operation because of reduced complexity and use of efficient adders/compressors. Also, their propagation delay is proportional to the logarithm of the operand word length in comparison to array multipliers whose delay is directly proportional to operand word length as discussed in [2][3].

Therefore, the main aim of this work is design and implementation of power efficient column compression multipliers using Wallace tree and Dadda multipliers for sub-threshold region, as few published works are available in this area [4][5].

The multiplication process is a three-step process [6]. The block diagram of n x n bit multiplier is shown in Figure 4.1.

This process begins with the generation of all partial-products which comprise of AND gate array and involves the multiplication of every multiplicand bit by every multiplier bit. Simple partial product generation method is used for the partial product generation.

Second step involves the partial-product reduction module to reduce the partial products matrix to an addition of only two operands addition. To perform parallel computation, lower/higher order compressors (Mixed $\mathrm{L} / \mathrm{H}$ ) are used to accumulate the partial products into two operands together in Wallace tree and Dadda multipliers.

Third step is final computation of the binary result by adders. Ripple carry adder (RCA) and HCA is used for the final computation in Wallace tree and Dadda multipliers [7].


Figure 4.1: Block diagram of $\mathrm{n} x \mathrm{n}$ bit multiplier
Multipliers require high amount of power and delay during the partial products accumulation stage. At this stage, most of the multipliers are designed with different kind of compressors that are capable to add two/three or at most $4,5,6$ and 7 bits by using (2-2/3-2) lower order compressors (LOC's) or (4-2,5-2, 6-2 and 7-2) high order compressors (HOC's). These compressors are used to perform parallel computation to accumulate the partial products [8].

Therefore, its power and performance determines the overall Wallace tree and Dadda multiplier response. The power consumption of these partial product accumulation modules in multipliers depends upon the choice of logic design style.

Hence, there exist possibilities of making changes in accumulation modules of Wallace tree and Dadda multipliers in sub-threshold region for different bit-widths and different logic design style.

The main aim of this work is design and implementation of Wallace tree and Dadda multipliers by modifying partial product accumulation modules (LOC's and HOC's) using two different technology nodes ( $45 \mathrm{~nm} / 180 \mathrm{~nm}$ ) operated in sub-threshold region. The performance metrics considered for the analysis of the partial product accumulation modules and multipliers are power, delay and power-delay-product.

The rest of the chapter is organized as follows:
Section 4.2 presents the partial product generation scheme. Section 4.3 presents the partial product accumulation scheme using LOC or Mixed L/H in Wallace tree and Dadda multipliers. Section 4.4 presents the internal circuitry of final adders used in Wallace tree and Dadda multipliers. Section 4.5 presents the design and analysis of partial product accumulation modules (LOC's and HOC's) used in Wallace tree and Dadda multiplier and their post layout simulation results for sub-threshold operation.

Section 4.6 presents design and analysis of Wallace tree and Dadda multipliers and their post layout simulation results for sub-threshold operation. Section 4.7 describes the final results and discussions for Wallace tree and Dadda multipliers and Section 4.8 presents a summary of the chapter and the concluding remarks.

### 4.2. PARTIAL PRODUCT GENERATION

Partial product generation is the first step of multiplication process. Partial products are the intermediate terms which are generated based on the value of multiplier.

The efficient partial product generation process is illustrated by the use of dot diagram. Figure 4.2 shows the dot diagram for the partial products of an $8 \times 8$ bit simple multiplication.


Figure 4.2: Partial products generation of $8 x 8$-bit simple multiplication
The partial products are represented by a horizontal row of dots. Each dot in the diagram is a place holder for a single bit which can be a zero or one. If the multiplier dot bit is ' 0 ', then partial product row is also zero and if it is ' 1 ', then the multiplicand is copied as it is. From the 2 nd bit multiplication onwards, each partial product row is shifted one unit to the left as shown in the Figure 4.2. The final product is represented by the double length row of dots at the bottom. For the simple multiplication algorithm, the logic consists of a single AND gate per bit as shown in Figure 4.3.


Figure 4.3: Partial products generation logic for simple multiplication
This figure shows the generation logic for a single partial product (a single row of dots). Frequently, this logic can be merged directly into whatever hardware is being used to sum the partial products.

This merging can reduce the delay of the logic elements to the point where the extra time due to the selection elements can be ignored.

However, in a real implementation, there will still be interconnect delay due to the physical separation of the common inputs of each AND gate, and distribution of the multiplicand to the selection elements [9].

### 4.3. PARTIAL PRODUCT ACCUMULATION

Partial product accumulation is the second step of multiplication process in which the partial products matrix is reduced to an addition of only two operands addition. Conventionally LOC designs, like 2-2 and 3-2 LOC's, are used to perform parallel computation to accumulate the partial products into two operands together.

A Full Adder's (FA's) itself is a 3-2 LOC which compresses three bits into two bits.
Similarly, the operation of the 2-2 LOC is same as the Half adder's (HA's). Alternatively, circuits that are capable to add four/five/six/seven bits are designed. These are called as higher order compressors (HOC's).

The internal architecture of a HOC is composed of LOC's. Use of Mixed L/H designs simplifies the compression and accumulation process through manipulating carry propagation used in its design [10][11].

The arrangement of the partial products and their reduction stages using LOC's and HOC's in 8x8-bit Wallace tree and Dadda multipliers are described in the sub-sections [4.3.1-4.3.4].

### 4.3.1. Partial Product Accumulation using LOC in Wallace Tree

The partial product matrix is reduced to a height of two using LOC based partial product accumulation scheme developed by Wallace tree.

The arrangement of the partial products and the reduction stages using adders in $8 \times 8$-bit Wallace tree is shown in Figure 4.4. The dots represent the partial products.

The partial products are re-arranged in a reverse pyramid style as shown in Figure 4.4 (a).


Figure 4.4: Partial product accumulation using LOC's in 8x8-bit Wallace tree
The iterative procedure for doing this is as follows:
In Figure 4.4 (a), Find out the maximum height of columns in the dot matrix array. If it is greater than 2 , reduce the height by following the recursive procedure described below.
i. Check the height of each column. If it is 1 , no reduction is done. If it is 2 , use a 2-2 LOC else use a 3-2 LOC and check the height of column again. Continue the reduction till the height of column becomes $\leq 1$.
ii. Repeat the above step for all other columns and at the end, en-queue the 'sum' strings of all 2-2 LOC and 3-2 LOC into the same columns and carry strings into the adjacent columns. iii. Again, find out the maximum height of columns and continue the reduction using the above recursive procedure till maximum height reaches 2 .

Figure 4.4 (a), (b), (c), (d) and (e) show the reduction stages for an $8 \times 8$-bit Wallace tree.

Once the height of matrix is reduced to two, final stage adder is used to generate the final product which is described later.

### 4.3.2. Partial Product Accumulation using Mixed L/H in Wallace Tree

The partial product matrix is reduced to a height of two using Mixed L/H based partial product accumulation scheme developed by Wallace tree.

The arrangement of the partial products and the reduction stages using compressors in $8 \times 8$-bit Wallace tree is shown in Figure 4.5. The dots represent the partial products.

The partial products are re-arranged in a reverse pyramid style as shown in Figure 4.5 (a)


Figure 4.5: Partial product accumulation using Mixed L/H in 8x8-bit Wallace tree

The iterative procedure for reduction of column compression matrix to a height of 2 using compressors is described below.

In Figure 4.5 (a), find out the maximum height of columns in the dot matrix array. If it is greater than 2 , reduce the height by following the recursive procedure described below
i. Check the height of each column. If it is 1 , no reduction is done. If it is 2 , use a 2-2 compressor which is nothing but a 2-2 LOC. Use 3-2 LOC, 4-2 compressor, 5-2 compressor
and 6-2 compressor if the height of the column is $3,4,5$ and 6 respectively else use a 7-2 counter and check the height of column again. Continue the reduction till the height of column becomes $\leq 1$.
ii. Repeat the above step for all other columns and at the end, en-queue the 'sum' strings of all the counters into the same queues. The only carry in case of 2-2 and 3-2 compressors are en-queued into the next queue. In case of 4-2, 5-2, 6-2 and 7-2 compressors, the carry c1 is en-queued into the next queue and the carry c2 is en-queued into the queue following it.
iii. Again, find out the maximum height of columns and continue the reduction using the above recursive procedure till maximum height reaches 2 .

Figure 4.5 (a), (b), (c), (d) and (e) show the reduction stages for an $8 \times 8$ Wallace tree.
Once the height of matrix is reduced to two, final stage adder is used to generate the final product which is described later.

### 4.3.3. Partial Product Accumulation using LOC in Dadda Tree

The partial product matrix is reduced to a height of two using LOCs based partial product accumulation scheme developed by Dadda. The arrangement of the partial products and the reduction stages using adders in $8 \times 8$-bit Dadda tree is shown in Figure 4.6. The dots represent the partial products. The partial products are re-arranged in a reverse pyramid style as shown in Figure 4.6 (a).

The iterative procedure for reduction of column compression matrix to a height of 2 using adders is described below.

In Figure 4.6 (a), find out the maximum height of columns in the dot matrix array. If it is greater than 2 , reduce the height by following the recursive procedure described below:
i. Let $h_{1}=2$ and repeat $h_{j+1}=$ floor $\left(1.5 * h_{j}\right)$ for increasing values of $j$. Continue this until the largest j is reached, for which there exists at least one column in the present stage of the matrix with more dots than $h_{j}$. Using this equation, we get $h_{1}=2, h_{2}=3, h_{3}=4, h_{4}=6, h_{5}=9$ and so on. e.g. in the first stage of the $8 \times 8$ Dadda tree shown in Figure 4.6 (a), the maximum height of columns is 8 therefore, the value of $h_{j}$ is 6 means heights of the columns are reduced to a maximum of 6 . Similarly, in the second stage, shown in Figure 4.6 (b) the maximum height of column is 6 and value of $h_{j}$ is 4 heights of the columns are reduced to a maximum of 4 .
ii. All the columns, with heights greater than $h_{j}$, are reduced to a height of $h_{j}$ using either 2-2 LOC or 3-2 LOC. If the column height has to be reduced by one, use a 2-2 LOC else use a 3-2 LOC and continue this step till the column height is reduced to $h_{j}$.
iii. Stop the reduction if the height of the matrix becomes two, after which it can be fed to final adder.

(e)

Figure 4.6: Partial product accumulation using LOC in $8 x 8$-bit Dadda tree
Figure 4.6 (b), (c), (d) and (e) show the reduction stages for an 8x8-bit Dadda tree.
Once the height of matrix is reduced to two, final stage adder is used to generate the final product which is described later.

### 4.3.4. Partial Product Accumulation using Mixed L/H in Dadda Tree

The partial product matrix is reduced to a height of two using Mixed L/H based partial product accumulation scheme developed by Dadda. The arrangement of the partial products and the reduction stages using compressors in 8x8-bit Dadda tree is shown in Figure 4.7. The dots represent the partial products.

The partial products are re-arranged in a reverse pyramid style as shown in Figure 4.7 (a)

(d)

(e)

Figure 4.7: Partial product accumulation using Mixed L/H in 8x8-bit Dadda tree
i. Assuming the minimum column height i.e. $\mathrm{h}_{1}=2$ and calculating remaining column height using formula $\mathrm{h}_{\mathrm{j}+1}=$ floor $\left(1.5 * \mathrm{~h}_{\mathrm{j}}\right)$ for increasing values of j . Continue this until the largest j is reached such that maximum column height for the multiplier to be designed is attained. Using this equation, we get $h_{1}=2, h_{2}=3, h_{3}=4, h_{4}=6, h_{5}=9$ and so on. For example, in the first stage of the 8x8-bit Dadda multiplication shown in Figure 4.7 (a), the maximum height of columns is 8 , therefore, the value of $h_{j}$ is 6 , meaning that heights of the columns are reduced to a maximum of 6 . Similarly, in the second stage, shown in Figure 4.7 (b), the maximum
height of column is 6 and value of $h_{j}$ is 4 , meaning that heights of the columns are reduced to a maximum of 4 .
ii. All the columns, with heights greater than $h_{j}$, are reduced to a height of $h_{j}$ using higher order compressors of different sizes. If the column height has to be reduced by one, use a 2-2 LOC, else use a 3-2 LOC. A 4-2 compressor is used if the height has to be reduced by 3, a 52 compressor is used if it has to be reduced by 4 , and so on and continue this step till the column height is reduced to $h_{j}$.
iii. The iterations continue until two elements remain in each queue. Once such a state has been reached then the reduction phase is completed and then it can be fed to the final adder.
iv. The first element of all queues form the first input to the adder and the second element forms the second input to the adder.

Figure. 4.7 (b), (c), (d) and (e) show the reduction stages for an 8x8-bit Dadda tree. Once the height of matrix is reduced to two, RCA and HCA is used for the final summation of the Wallace tree and Dadda multipliers.

### 4.4. FINAL STAGE ADDITION IN WALLACE TREE AND DADDA MULTIPLIERS

In Wallace tree's scheme, the partial products are reduced as soon as possible. Whereas Dadda's method does minimum reduction necessary at each level and requires the same number of levels as required by the Wallace tree method resulting in a design with fewer adders or compressors module. The disadvantage of Dadda's method is that it requires a slightly wider, fast final stage adder and has a less regular structure than Wallace tree's. As a result, final adder in Wallace tree multiplier is slightly smaller in size as compared to the final adder in Dadda multiplier.

Once the size of all partial-products has been reduced into two rows or less, the elements in the rows are ready to be summed using adder. The first elements of the first row form the first input to the adder/compressor and the second elements form the second input to the adder/compressors as shown in Figure 4.7.

In paper [12], it has been shown that for low operand size, RCA can match the speed of parallel addition when operating in sub-threshold, while still dissipating less (power). HCA has been found the power efficient architectures operated in sub-threshold region as compared to KSA
and CLA architecture as discussed in Chapter 3. Therefore, either RCA or HCA is used for the final computation of the binary products in Wallace tree and Dadda multipliers.

### 4.4.1. RCA

An 8-bit RCA is built using eight numbers of full adders where carry input $\left(\mathrm{C}_{\mathrm{in}}\right)$ of each of the full adders is the carry output ( $\mathrm{C}_{\text {out }}$ ) of the previous full adders. The block diagram of 8 -bit RCA is shown in Figure 4.8. $\mathrm{C}_{\text {in }}$ of the $1^{\text {st }}$ full adder is assumed to be 0 and the $\mathrm{C}_{\text {out }}$ of each of the full adders ripple to the next adder. The delay of RCA is relatively high, since each full adder must wait for the carry bit to be calculated from the previous full adder. The gate delay can easily be calculated by inspection of the full adder circuit.


Figure 4.8: Block level diagram of 8-bit RCA

### 4.4.2. HCA

The HCA compute addition in two steps: one to obtain the carry at each bit, with the next to compute the sum bit based on the carry bit. The hybrid construction of a HC logarithmic prefix adder combines two designs: the KS construction which takes $\log 2 \mathrm{n}$ stages and the BrentKung construction which takes $2 \log 2 \mathrm{n}-1$ stages. Basically the HC adder takes the best feature of KS adder, i.e., high speed, and best feature of the Brent-Kung design, i.e., low area, and combines both to provide a reasonably good speed at low complexity [13][14].

The detailed achitecture of HCA has already been discussed in Chapter 3 Section 3.5.4. The gate level architechture of HCA is used for final computation of the binary result of Wallace tree and Dadda multipliers in sub-threshold region.

### 4.5. DESIGN AND ANALYSIS OF PARTIAL PRODUCT ACCUMULATION MODULES

The internal standard modules are similar for both architectures (Wallace tree and Dadda) with the difference occurring in the procedure of reduction of the partial products and the size of the final adder [15].

To avoid larger power consumption and larger operation time, different accumulation circuits are developed. Implementation of basic partial product accumulation modules (using Mixed $\mathrm{L} / \mathrm{H}$ ) and their circuits are given in following sections.

### 4.5.1. LOC DESIGNS

In partial product accumulation stage, each row of the partial product matrix is input to an array of compressors to perform parallel computation.

Researchers have improved the performance of these two multipliers by reducing the number of compressor cells used in conventional multipliers, by developing a counter based Mixed $\mathrm{L} / \mathrm{H}$ and by using the decomposition logic based new technique [16][17][18].

In chapter 3, three logic designs styles, Static-CMOS, HYB-PT, HYB-TG, are used to implement adders. Out of these three, HYB-PT logic based adder designs are, compact, but have higher power-delay product for all operand sizes (8 to 64) and all architectures (CLA, KSA, HCA). Thus, in present chapter, only two design styles (Static-CMOS, and HYB-TG) are used in implementation of multipliers.

Therefore, in this work, 2-2 and 3-2 LOC's are designed using Static-CMOS and HYB-TG logic in sub-threshold region.

The circuit implementation of 2-2 and 3-2 LOC's using Static-CMOS and-HYB-TG logic are given in following section:

### 4.5.2. LOC Design using Static-CMOS Logic

i. The 2-2 LOC is an important partial product accumulation module. The 2-2 LOC using Static-CMOS logic family gives the best results in sub-threshold region as per the literature [19]. The internal transistor level schematic diagram and its layout of 2-2 LOC using StaticCMOS logic is shown in Figure 4.9.


Figure 4.9: Schematic and layout of 2-2 LOC using Static-CMOS logic
ii. The 3-2 LOC is used in partial product accumulation module to reduce the number of partial product elements in the particular column. Figure 4.10 shows the schematic diagram of 3-2 LOC using Static-CMOS logic style [20][21][22].


Figure 4.10: Schematics and Layouts of 3-2 LOC using Static-CMOS logic

### 4.5.3. LOC Design using HYB-TG Logic

LOC cells are the combinations of 2:1 MUX and 4:1 MUX using HYB-TG logic design style. The internal circuit's implementation of 2:1 MUX and 4:1 MUX using HYB-TG logic with their layouts are shown in Figures 4.11 and 4.12 respectively.


Figure 4.11: Schematic and layout of 2:1MUX using HYB-TG logic


Figure 4.12: Schematic and layout of 4:1 MUX using HYB-TG logic
i. The 2-2 LOC cell using HYB-TG logic design style, takes two inputs X1, X2 and generates two outputs Sum and Carry. The internal transistor level schematic diagram and its layout of 2-2 compressor using HYB-TG logic is shown in Figure 4.13.


Figure 4.13: Schematic and layout of 2-2 LOC using HYB-TG logic
ii. The 3-2 LOC cell using HYB-TG logic design style, takes three inputs $\mathrm{X} 1, \mathrm{X} 2, \mathrm{X} 3$ and generates two outputs Sum and Carry. The internal transistor level schematic diagram and its layout of 3-2 LOC using HYB-TG logic is shown in Figure 4.14.


Figure 4.14: Schematic and layout of 3-2 LOC using HYB-TG logic

### 4.5.4. Higher Order Compressors (HOC)

For higher order multiplication, HOC's (4-2, 5-2, 6-2, and 7-2) are used to compress the bits [23][24]. The HOC's can be derived using a single bit adder circuit. It has four/five/six/seven inputs and three outputs.

The HOC cells (4-2, 5-2, 6-2 and 7-2) can be implemented in many different logic structures [25][26]. However, in general, it comprises of three main modules. The first module is required to generate $\mathrm{XOR} / \mathrm{XNOR}$ function, the second module is used to generate sum and the last module is used to produce the carry output.

Conventionally, the implementations of compressors are composed of serially connected full adders and MUX. At gate level, HOC's are anatomized into XOR gates and carry generators are normally implemented by MUX. Therefore, different designs can be classified based on the critical path delay, in terms of the number of primitive gates. There are several designs of the XOR and MUXs, which have already been proposed using different logic styles [27] [28]. In [29][30][31], 4-2, 5-2, 6-2 and 7-2 compressors have been designed using different circuit techniques to achieve the improvement in terms of both delay and power.

In this proposed work, compressors utilize the standard hierarchical design approach, where the HOC's are built using 2-2 and 3-2 compressor cells. Input combinations and the corresponding decimal counts of all compressors and their functionalities are shown in Table 4.1.

Table 4.1: Truth table of different Compressors (2-2, 3-2, 4-2, 5-2, 6-2, and 7-2)

| (Decimal count) | Input Conditions | $\begin{gathered} 2-2 \\ \text { Outputs } \end{gathered}$ $(\mathbf{C}, \mathbf{S})$ | 3-2 Outputs (C,S) | 4-2 Outputs <br> (C2,C1,S) | 5-2 Outputs <br> (C2,C1,S | $\begin{array}{\|c\|} \hline 6-2 \\ \text { Outputs } \\ \text { (C2,C1,S) } \end{array}$ | $7-2$ Outputs (C2,C1,S |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | All the inputs are zero | $(0,0)$ | $(0,0)$ | (0,0,0) | (0,0,0) | (0,0,0) | (0,0,0) |
| 1 | Any one input is one | $(0,1)$ | $(0,1)$ | $(0,0,1)$ | $(0,0,1)$ | $(0,0,1)$ | $(0,0,1)$ |
| 2 | Any two inputs are one | --- | $(1,0)$ | (0,1,0) | (0,1,0) | (0,1,0) | (0,1,0) |
| 3 | Any three inputs are one | --- | --- | (0,1,1) | (0,1,1) | (0,1, $)$ | (0,1,1) |
| 4 | Any four inputs are one | --- | --- | --- | (1,0,0) | (1,0,0) | $(1,0,0)$ |
| 5 | Any five inputs are one | --- | --- | --- | --- | $(1,0,1)$ | $(1,0,1)$ |
| 6 | Any six inputs are one | --- | --- | --- | --- | --- | $(1,1,0)$ |
| 7 | All the inputs are one | $(1,0)$ | $(1,1)$ | $(1,0,0)$ | (1,0,1) | $(1,1,0)$ | $(1,1,1)$ |

Note: C, C1, C2 are the carry bits, $S$ is the Sum bit of compressors.
$C 2$ is the most significant bit and $S$ is the least significant bit.

### 4.5.4.1. Conventional and Modified Architecture of 4-2 Compressor

i. Conventional architecture of 4-2 compressor [25][26]: Figure 4.15 shows the conventional architecture of 4-2 compressor.


Figure 4.15: The conventional architecture of 4-2 compressor
Figure 4.15(a) shows the block diagram of conventional architecture of 4-2 compressor.
Figure 4.15 (b) shows that a conventional 4-2 compressor consists of two serially connected 3-2 compressors and involves a critical path delay of 4 XOR/MUXs. It has the five input bits ( $\mathrm{X} 1, \mathrm{X} 2, \mathrm{X} 3$ and X 4 ) at $\mathrm{i}^{\text {th }}$ position including a carry-in (Cin1) from the neighboring cell at ( $\mathrm{i}-$ $1)^{\text {th }}$ position. It has three outputs SUM at $\mathrm{i}^{\text {th }}$ position, Carry and carry-out (Cout1) at $\mathrm{i}+1^{\text {th }}$ position.

The conventional 4-2 compressor abides by the fundamental equation as given in Eq. (4.1)

$$
\begin{equation*}
\mathrm{X} 0+\mathrm{X} 1+\mathrm{X} 2+\mathrm{X} 3+\mathrm{Cin} 1=2^{\mathrm{i}} \mathrm{SUM}+2^{\mathrm{i}+1} .(\text { Carry }+ \text { Cout } 1) \tag{4.1}
\end{equation*}
$$

The operation and implementation of $4 \times 4$-bit Wallace tree multiplier using conventional architecture of compressors is shown in Figure 4.16.

(a)


Figure 4.16: The conventional architecture of compressors used in $4 \times 4$-bit Wallace tree multiplier

Figure 4.16 (a) shows the operation using dot diagram of conventional compressors based $4 \times 4$ bit Wallace tree multiplier. It utilizes total ten numbers of compressors which includes six 22 compressors, three 3-2 compressors and one 4-2 compressor cell.

Figure 4.16 (b) shows the implementation and signal flow in its critical delay path. The critical delay path utilizes following number of gates

Critical Path Conventional $=($ One 4-2 Compressor + One 3-2 Compressor $)$
$=\{($ Four XOR, Two MUX's) + (Two XOR and One MUX's) $\}$
ii. Modified architecture of 4-2 compressor: Figure 4.17 shows the modified architecture of 4-2 compressor.


Figure 4.17: The modified architecture of 4-2 compressor

Figure 4.17 (a) shows the block diagram of modified architecture of 4-2 compressor.
Figure 4.17 (b) shows that in contrast to the conventional design, the modified 4-2 compressor cell is composed of serially connected one 3-2 compressor and two 2-2 compressors. It has the four input bits ( $\mathrm{X} 1, \mathrm{X} 2, \mathrm{X} 3$ and X 4 ) at $\mathrm{i}^{\text {th }}$ position. It has three outputs final SUM at $\mathrm{i}^{\text {th }}$ position, Carry 1 at $\mathrm{i}+1^{\text {th }}$ and Carry 2 at $\mathrm{i}+2^{\text {th }}$ position. The modified 4-2 compressor cell does not include a carry-in (Cin1) bit from the neighboring cell.

The modified 4-2 compressor abides by the fundamental equation as given in Eq. (4.2)

$$
\begin{equation*}
\mathrm{X} 0+\mathrm{X} 1+\mathrm{X} 2+\mathrm{X} 3=2^{\mathrm{i}} \mathrm{SUM}+2^{\mathrm{i}+1} \mathrm{Carry} 1+2^{\mathrm{i}+2} \text { Carry } 2 \tag{4.2}
\end{equation*}
$$

The operation and implementation of $4 x 4$-bit Wallace tree multiplier using modified architecture of compressors is shown in Figure 4.18.

(a)

(b)

Figure 4.18: The modified architecture of compressors used in $4 \times 4$-bit Wallace tree multiplier

Figure 4.18 (a) shows the operation using the dot diagram of modified compressors based $4 \times 4$ bit Wallace tree multiplier. It utilizes total six numbers of modified compressors which combine two 2-2 compressors, three modified 3-2 compressors and one 4-2 compressor cell.

Figure 4.18 (b) shows the implementation and signal flow in its critical delay path. The critical delay path utilizes following number of gates:

Critical Path Modified $=($ One 4-2 Compressor + One 3-2 Compressor $)$
$=\{($ Six MUX's) + (Two MUX's) $\}$
Therefore, in contrast to conventional compressor cells based 4x4-bit Wallace tree multiplier, the modified compressor cells based multiplier reduces the total number of compressor cell which reduces the overall area of the multiplier.

Hence, in this work, based on less hardware requirement, modified compressor cells are used for implementation of partial product accumulation stage of multipliers.

### 4.5.4.2. Design Implementation of Modified HOC's (5-2, 6-2 and 7-2)

All modified HOC cells (5-2, 6-2 and 7-2) are implemented by using two different logic style (Static-CMOS and HYB-TG cell). The internal transistor level diagram of 2-2 and 3-2 compressors using Static-CMOS design style and using HYB-TG cell have been given in Sections 4.5.2 and 4.5.3.

In the modified HOC's (5-2, 6-2 and 7-2), the internal signals (Cout1, Cout2 and Cout3) from one of the internal blocks acts as the carry input to succeeding block and finally generates one SUM and two carry (Carry1, Carry2) outputs as shown in Figure 4.19.

The internal blocks and designs of the compressors cell is discussed below


(a) Modified 5-2 compressor

(b) Modified 6-2 compressor


Figure 4.19: The schematics and layouts of modified HOC's

### 4.5.5. Performance Analysis of Compressor Designs using Different Logic Design Styles for Sub-Threshold Operation

This section gives the performance analysis of compressors designs for partial product accumulation using different selected (Static-MOS and HYB-TG) logic design styles.

Table 4.2 gives the measured power, delay and power-delay product of all compressors design using Static-CMOS and HYB-TG logic design styles.

The simulated results show that all the modules are properly functional in sub-threshold region at supply voltage as low as 0.4 V . The schematic and layouts are designed and simulated using two different technologies ( $45 \mathrm{~nm} / 180 \mathrm{~nm}$ ). The designs are characterized in terms of power, delay and power-delay product. For minimum power-delay product, the W/L's of all MOS transistors are chosen to keep the pull up to pull down network ratio as $2: 1$ in all designed logic modules [8].

Table 4.2: Simulation results of LOC and modified HOC Designs

| Module name | Logic style | Number of transistors | Power (nW) |  | Delay (ns) |  | Power-Delay Product (watt*sec10 ${ }^{-18}$ ) |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  | 45 nm | 180 nm | 45 nm | 180 nm | 45 nm | 180 nm |
| LOC DESIGNS |  |  |  |  |  |  |  |  |
| 2-2 | Static-CMOS | 14 | 1.815 | 0.531 | 0.268 | 52.690 | 0.486 | 027.978 |
|  | HYB-TG | 26 | 4.819 | 0.947 | 0.209 | 47.575 | 2.866 | 045.053 |
| 3-2 | Static-CMOS | 28 | 2.608 | 1.550 | 1.362 | 71.655 | 3.552 | 111.065 |
|  | HYB-TG | 42 | 8.438 | 5.262 | 0.902 | 62.575 | 7.611 | 329.200 |

## MODIFIED HOC DESIGNS

|  |  |  | Power (nW) |  | Delay (ns) |  | Power-Delay Product (watt*sec10 ${ }^{-15}$ ) |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 4-2 | Static-CMOS | 56 | 14.782 | 9.770 | 8.747 | 374.410 | 0.129 | 03.657 |
|  | HYB-TG | 94 | 21.150 | 19.860 | 2.953 | 154.891 | 0.624 | 03.076 |
| 5-2 | Static-CMOS | 70 | 28.410 | 22.420 | 15.341 | 411.870 | 0.435 | 09.234 |
|  | HYB-TG | 110 | 34.750 | 30.231 | 3.435 | 257.790 | 0.119 | 07.792 |
| 6-2 | Static-CMOS | 98 | 38.980 | 34.740 | 21.210 | 552.310 | 0.826 | 19.187 |
|  | HYB-TG | 152 | 44.260 | 39.990 | 4.358 | 326.540 | 0.192 | 13.058 |
| 7-2 | Static-CMOS | 112 | 47.440 | 41.040 | 33.740 | 614.110 | 1.600 | 25.184 |
|  | HYB-TG | 168 | 52.371 | 46.730 | 5.633 | 412.370 | 0.295 | 19.270 |

For the comparative analysis of the modified and published compressor designs under same simulation settings, we have taken the conventional architectures from referred publications and implemented them using the same technology and supply voltage.

The architecture of conventional 2-2, 3-2, 4-2, 5-2, 6-2 and 7-2 compressors and their design parameters are taken from references [8][11][32][33][34].

Table 4.3 and Table 4.4 show comparative simulation results of LOC and modified HOC compressors with designs of published references at 0.4 V power supply using $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology.

Here only best results of LOC and modified HOC compressors obtained from Table 4.2 are compared.

Table 4.3: Comparative results of proposed modified compressor with designs of published references at 45 nm technology

| References | Design style | Module name | Power (nW) | Delay (ns) | Power-Delay Product (watt*sec 10-18) |
| :---: | :---: | :---: | :---: | :---: | :---: |
| Ref [32] | CPL | 2-2 Compressor | 06.32 | 28.87 | 182.4 |
| Ref [33] | DPL | 3-2 Compressor | 12.78 | 61.12 | 781.1 |
| LOC Designs (proposed) | Static-CMOS | 2-2 Compressor | 01.81 | 0.26 | 0.486 |
|  |  | 3-2 Compressor | 02.60 | 1.36 | 3.552 |
|  |  |  |  |  |  |
| Ref [8] | CPL and DPL | 4-2 Compressor | 18.48 | 27.70 | 511.8 |
| Ref [8] | CPL and DPL | 5-2 Compressor | 25.55 | 87.21 | 2,228 |
| Ref [34] | Static-CMOS | 6-2 Compressor | 28.13 | 184.50 | 5,189 |
| Ref [33] | DPL | 7-2 Compressor | 43.78 | 135.90 | 5,949 |
| Modified HOC <br> Designs (proposed) | HYB-TG | 4-2 Compressor | 21.15 | 2.95 | 62.45 |
|  |  | 5-2 Compressor | 34.75 | 3.43 | 119.36 |
|  |  | 6-2 Compressor | 44.26 | 4.35 | 192.88 |
|  |  | 7-2 Compressor | 52.37 | 5.63 | 295.00 |

Table 4.4: Comparative results of proposed LOC and modified HOC compressors with designs of published references at 180 nm technology

| References | Design style | Module name | Power $(\mathbf{n W})$ | Delay ( $\mu \mathrm{s}$ ) | Power-Delay Product (watt*sec 10 ${ }^{-15}$ ) |
| :---: | :---: | :---: | :---: | :---: | :---: |
| Ref [32] | CPL | 2-2 Compressor | 0.441 | 0.357 | 0.157 |
| Ref [33] | DPL | 3-2 Compressor | 3.417 | 0.787 | 2.689 |
| LOC Designs (proposed) | Static-CMOS | 2-2 Compressor | 0.531 | 0.052 | 0.027 |
|  |  | 3-2 Compressor | 1.550 | 0.071 | 0.111 |
|  |  |  |  |  |  |
| Ref [8] | CPL and DPL | 4-2 Compressor | 15.47 | 1.474 | 22.802 |
| Ref [8] | CPL and DPL | 5-2 Compressor | 19.74 | 2.470 | 48.757 |
| Ref [34] | Static-CMOS | 6-2 Compressor | 23.31 | 4.311 | 100.48 |
| Ref [33] | DPL | 7-2 Compressor | 40.88 | 3.235 | 132.24 |
| Modified HOC <br> Designs <br> (proposed) | HYB-TG | 4-2 Compressor | 19.86 | 0.154 | 03.076 |
|  |  | 5-2 Compressor | 30.23 | 0.257 | 07.792 |
|  |  | 6-2 Compressor | 39.99 | 0.326 | 13.058 |
|  |  | 7-2 Compressor | 46.73 | 0.412 | 19.270 |

The overall propagation delay and power-delay product results of the all LOC and modified HOC cells are less in comparison to existing published results of conventional compressor cells. However, modified HOC's have $23.9 \%$ ( $29.2 \%$ ) in average higher power consumption at $45 \mathrm{~nm}(180 \mathrm{~nm})$ technology.

Key Points: The results of the modified Compressor designs show that

- For LOC designs, Static-CMOS logic design style provides lesser power-delay product at both technologies in sub-threshold region.
- For Modified HOC designs, HYB-TG logic design style provides lesser power-delay product at both technologies in sub-threshold region
- At 45 nm , for all designs of LOC and modified HOC designs, the propagation delay is smaller, power consumption is higher and power-delay product is smaller in comparison to 180 nm technology.
- The overall power-delay product of the LOC and modified HOC designs is lesser than designs of published references at 0.4 V supply voltage at both technologies.

Outcome: Based on above results, Static-CMOS logic family is used for the final implementation of multipliers using LOC's. Similarly, HYB-TG logic family is used for the final implementation of multipliers using modified HOC's.

### 4.6. DESIGN AND ANALYSIS OF WALLACE TREE AND DADDA MULTIPLIERS

In this section, design and analysis of multipliers using two architectures (Wallace tree and Dadda) in two different partial product accumulation schemes is given. These schemes are given below:

- Using LOC
- Using Mixed L/H

An example of operation of $8 \times 8$ multiplier shown in Figure 4.20 is used to explain the partial product accumulation schemes using Wallace tree and Dadda compression tree.

Here, $\mathrm{X} 0, \mathrm{X} 1, \mathrm{X} 2 \ldots \mathrm{X} 7$ are multiplicand bits; $\mathrm{Y} 0, \mathrm{Y} 1, \mathrm{Y} 2 \ldots . . \mathrm{Y} 7$ are multiplier bits. PXY is the partial-product of $\mathrm{X}, \mathrm{Y}, \operatorname{Prn}$ is the final $\mathrm{n}^{\text {th }}$ product of the multiplier.


Figure 4.20: Multiplication of $8 \times 8$-bit multiplier
The internal architecture of 8x8-bit multipliers using Wallace tree and Dadda and their circuit implementations are given below.

### 4.6.1. Design Implementation of Wallace tree multiplier

In Wallace tree's scheme, the partial products are reduced as soon as possible. The internal architecture of Wallace tree using LOC and Mixed L/H are shown in Figure 4.21 and Figure 4.22 respectively. The circuit level diagrams of blocks for partial-product generation, partialproduct accumulation and final adder addition are given in section 4.3, 4.5 and 4.4 respectively.


Figure 4.21: Block diagram of Wallace tree using LOC


Figure 4.22: Block diagram of Wallace tree using Mixed L/H

### 4.6.2. Simulation Methodology and Results of Wallace tree Multipliers

To obtain the simulation results for the Wallace tree multipliers in sub-threshold region, the following methodology is followed:

The $4 \times 4$-bit and $8 \times 8$-bit Wallace tree multiplier using LOC and Mixed $\mathrm{L} / \mathrm{H}$ cells are designed in Cadence Virtuoso (Schematic and Layout). Two different technology nodes i.e. $45 \mathrm{~nm} / 180$ nm technologies have been considered in implementation and simulated using the BSIM3 (V3.24) model, at a supply voltage of 0.4 V . For both technology nodes, transient simulations have been done by applying input pulses having rise and fall times of 1 pico-second, pulse width (ON time) of 1 micro-second and pulse period of 5 micro-second and power, delay and power-delay product values are evaluated. Total eight designs of Wallace tree multipliers are implemented using different combinations of partial accumulation modules (LOC, Mixed $\mathrm{L} / \mathrm{H}$ ) and final stage adders (RCA, HCA) at two different technology nodes ( $45 \mathrm{~nm}, 180 \mathrm{~nm}$ ).

In this thesis, the following nomenclature as shown in Table 4.5, is used to represent the proposed designs of multiplier:

Architecture (Wallace tree) - technology (45/ 180) - compressor LOC (L) / Mixed L/H (L/H)final adder (RCA/ HCA)

Table 4.5: Nomenclature used for the proposed designs of Wallace tree multiplier

| S. No. | Multiplier Design Descriptions | Nomenclature |
| :---: | :--- | :--- |
| 1 | Wallace tree multiplier at 45 nm using LOC based <br> accumulation scheme with final addition through RCA | Wallace tree-45- <br> L-RCA. |
| 2 | Wallace tree multiplier at 45 nm using LOC based <br> accumulation scheme with final addition through HCA | Wallace tree-45- <br> L-HCA |
| 3 | Wallace tree multiplier at 45 nm using Mixed L/H based <br> accumulation scheme with final addition through RCA | Wallace tree-45- <br> L/H-RCA. |
| 4 | Wallace tree multiplier at 45 nm using Mixed L/H based <br> accumulation scheme with final addition through HCA | Wallace tree-45- <br> L/H-HCA. |
| 5 | Wallace tree multiplier at 180 nm using LOC based <br> accumulation scheme with final addition through RCA | Wallace tree- <br> 180-L-RCA. |
| 6 | Wallace tree multiplier at 180 nm using LOC based <br> accumulation scheme with final addition through HCA | Wallace tree- <br> 180-L-HCA. |
| 7 | Wallace tree multiplier at 180 nm using Mixed L/H based <br> accumulation scheme with final addition through RCA | Wallace tree- <br> 180-L/H-RCA. |
| 8 | Wallace tree multiplier at 180 nm using Mixed L/H based <br> accumulation scheme with final addition through HCA | Wallace tree- <br> 180-L/H-HCA. |

Table 4.6 and Table 4.7 show power, delay and power-delay product of Wallace tree multipliers using $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology.

Table 4.6: Simulation results of Wallace tree multipliers using 45 nm technology at

| Module Name | $\begin{gathered} \text { Size } \\ \text { (in bits) } \end{gathered}$ | $\begin{gathered} \text { Power } \\ (\mu \mathrm{W}) \end{gathered}$ | $\begin{gathered} \text { Delay } \\ (\mathrm{ns}) \end{gathered}$ | Power-Delay Product (watt*sec10 ${ }^{-15}$ ) | Area ( $\mu^{2}$ ) |
| :---: | :---: | :---: | :---: | :---: | :---: |
| Wallace tree-45-L-RCA | 4 x 4 | 0.397 | 13.410 | 05.323 | 0167.2 |
|  | 8x8 | 0.613 | 34.250 | 20.990 | 0638.3 |
| Wallace tree-45-L-HCA | $4 \times 4$ | 0.412 | 10.470 | 04.313 | 0314.7 |
|  | 8x8 | 0.631 | 30.450 | 19.210 | 0807.1 |
| Wallace tree-45-L/H-RCA | 4 x 4 | 0.472 | 05.171 | 02.441 | 0255.9 |
|  | 8x8 | 3.131 | 06.041 | 18.910 | 0957.4 |
| Wallace tree-45-L/H-HCA | $4 \times 4$ | 0.478 | 04.821 | 02.304 | 0411.3 |
|  | 8x8 | 3.457 | 05.011 | 17.320 | 1072.4 |

Table 4.7: Simulation results of Wallace tree multipliers uisng 180 nm technology at $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}$

| Module Name | $\begin{gathered} \text { Size } \\ \text { (in bits) } \end{gathered}$ | $\begin{gathered} \hline \text { Power } \\ (\mu \mathrm{W}) \\ \hline \end{gathered}$ | $\begin{gathered} \text { Delay } \\ (\mathbf{n s}) \end{gathered}$ | Power-Delay Product (watt*sec10 ${ }^{-15}$ ) | $\begin{aligned} & \text { Area } \\ & \left(\mu \mathrm{m}^{2}\right) \\ & \hline \end{aligned}$ |
| :---: | :---: | :---: | :---: | :---: | :---: |
| Wallace tree-180-L-RCA | $4 \times 4$ | 0.071 | 778.70 | 055.280 | 0334.7 |
|  | 8x8 | 0.115 | 997.10 | 114.660 | 1174.7 |
| Wallace tree-180-L-HCA | $4 \times 4$ | 0.073 | 721.50 | 052.660 | 0674.4 |
|  | 8x8 | 0.123 | 811.50 | 099.810 | 1551.1 |
| Wallace tree-180-L/HRCA | 4x4 | 0.131 | 295.70 | 038.736 | 0497.2 |
|  | 8x8 | 0.244 | 390.40 | 095.257 | 1874.4 |
| Wallace tree-180-L/HHCA | $4 \times 4$ | 0.150 | 245.10 | 036.765 | 0788.8 |
|  | 8x8 | 0.289 | 311.23 | 089.945 | 2041.7 |

Key Points: The results of the Wallace tree multiplier show that

- Wallace tree multiplier operates down to 0.4 V power supply for selected logic design styles at both technology nodes in sub-threshold region.
- The simulation results show that power, delay and power-delay product of the Wallace tree multipliers increases with the increase in operand size as expected.
- The Wallace tree multiplier using Mixed L/H based accumulation scheme exhibits the least power-delay product at both technologies.
- Wallace tree multiplier with HCA gives less delay and power-delay product as compared to Wallace tree multiplier with RCA.
- In comparison to 180 nm technology, for 45 nm , the propagation delay is smaller, power consumption is higher (due to increased leakage currents) and power-delay product is smaller for all designs of Wallace tree multiplier.

The overall power consumption, propagation delay and power-delay product graphs of Wallace tree multipliers at $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology using different accumulation scheme is shown in Figure 4.23.


Figure 4.23: The overall power consumption, propagation delay and power-delay product graphs of Wallace tree multipliers at $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology

### 4.6.3. Design Implementation of Dadda Multiplier

In Dadda's method, at each level the partial products does minimum necessary reduction. The internal architecture of Dadda using LOC and Mixed L/H are shown in Figure 4.24 and Figure 4.25 respectively. The circuit level diagrams of all these blocks (partial-product generation, partial-product accumulation, and final adder addition) are given in section 4.3, 4.5 and 4.4 respectively.


Figure 4.24: Block diagram of Dadda multiplier using LOC


Figure 4.25: Block diagram of Dadda multiplier using Mixed L/H

### 4.6.4. Simulation Methodology and Results of Dadda Multipliers

The methodology followed for simulation of Dadda multipliers in sub-threshold region is same as given in Section 4.6.2.

Total eight designs of Dadda multipliers are implemented using different combinations of partial accumulation modules (LOC, Mixed L/H) and final stage adders (RCA, HCA) at two different technology nodes ( $45 \mathrm{~nm}, 180 \mathrm{~nm}$ ).

In this thesis, the following nomenclature as shown in Table 4.8, is used to represent the designs of multiplier:

Architecture (Dadda tree) - technology (45/ 180) - compressor LOC (L) / Mixed L/H (L/H)final adder (RCA/ HCA).

Table 4.8: Nomenclature used for the proposed designs of Dadda multiplier

| S. No. | Multiplier Design Descriptions | Nomenclature |
| :---: | :--- | :--- |
| 1 | Dadda multiplier at 45 nm using LOC based accumulation <br> scheme with final addition through RCA | Dadda-45-L- <br> RCA. |
| 2 | Dadda multiplier at 45 nm using LOC based accumulation <br> scheme with final addition through HCA | Dadda-45-L- <br> HCA |
| 3 | Dadda multiplier at 45 nm using Mixed L/H based <br> accumulation scheme with final addition through RCA | Dadda-45- <br> L/H-RCA. |
| 4 | Dadda multiplier at 45 nm using Mixed L/H based <br> accumulation scheme with final addition through HCA | Dadda-45- <br> L/H-HCA. |
| 5 | Dadda multiplier at 180 nm using LOC based accumulation <br> scheme with final addition through RCA | Dadda-180- <br> L-RCA. |
| 6 | Dadda multiplier at 180 nm using LOC based accumulation <br> scheme with final addition through HCA | Dadda-180- <br> L-HCA. |
| 7 | Dadda multiplier at 180 nm using Mixed L/H based <br> accumulation scheme with final addition through RCA | Dadda-180- <br> L/H-RCA. |
| 8 | Dadda multiplier at 180 nm using Mixed L/H based <br> accumulation scheme with final addition through HCA | Dadda-180- <br> L/H-HCA. |

Table 4.9 and Table 4.10 show power, delay and power-delay product of proposed Dadda multipliers using $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology.

Table 4.9: Simulation results of Dadda multipliers using 45 nm technology at $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}$

| Module Name | Size <br> $(\mathbf{i n ~ b i t s )}$ | Power <br> $(\boldsymbol{\mu} \mathbf{W})$ | Delay <br> $(\mathbf{n s})$ | Power-Delay Product <br> $\left(\mathbf{w a t t * s e c 1 0}^{-15}\right)$ | Area <br> $\left(\boldsymbol{\mu}^{2}\right)$ |
| :---: | :---: | :---: | :---: | :---: | :---: |
| Dadda-45-L-RCA | 4 x 4 | $\mathbf{0 . 2 7 4}$ | 10.73 | 02.940 | $\mathbf{1 2 2 . 7}$ |
|  | 8 x 8 | $\mathbf{0 . 4 6 5}$ | 30.63 | 14.240 | $\mathbf{5 8 8 . 5}$ |
| Dadda-45-L-HCA | 4 x 4 | 0.294 | 07.44 | 02.187 | 224.9 |
|  | 8 x 8 | 0.484 | 28.22 | 13.650 | 704.8 |
| Dadda-45-L/H-RCA | 4 x 4 | 0.303 | 04.14 | 01.254 | 187.6 |
|  | 8 x 8 | 2.457 | 05.01 | 12.280 | 880.4 |
| Dadda-45-L/H-HCA | 4 x 4 | 0.310 | $\mathbf{0 3 . 7 9}$ | $\mathbf{0 1 . 1 7 5}$ | 369.9 |
|  | 8 x 8 | 2.939 | $\mathbf{0 4 . 0 3}$ | $\mathbf{1 1 . 8 5 0}$ | 988.1 |

Table 4.10: Simulation results of Dadda multipliers using 180 nm technology at $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}$

| Module Name | $\begin{gathered} \text { Size } \\ \text { (in bits) } \end{gathered}$ | Power $(\mu \mathbf{W})$ | Delay (ns) | Power-Delay Product (watt*sec10 ${ }^{-15}$ ) | Area $\left(\mu \mathrm{m}^{2}\right)$ |
| :---: | :---: | :---: | :---: | :---: | :---: |
| Dadda-180-L-RCA | 4 x 4 | 0.065 | 771.4 | 050.140 | 0287.4 |
|  | 8x8 | 0.113 | 989.1 | 111.760 | 1078.6 |
| Dadda-180-L-HCA | 4 x 4 | 0.070 | 705.3 | 049.370 | 0578.2 |
|  | 8x8 | 0.121 | 802.7 | 097.120 | 1478.3 |
| Dadda-180-L/HRCA | 4 x 4 | 0.114 | 289.2 | 032.960 | 0411.4 |
|  | 8x8 | 0.234 | 384.1 | 089.870 | 1774.2 |
| Dadda-180-L/HHCA | 4 x 4 | 0.131 | 222.8 | 029.186 | 0668.4 |
|  | 8x8 | 0.284 | 307.7 | 087.386 | 1878.4 |

Key Points: The results of the Dadda multipliers show that

- Dadda multipliers operate down to 0.4 V power supply for selected logic design styles at both technology nodes in sub-threshold region.
- The simulation results show that power, delay and power-delay product of the Dadda multipliers increases with the increase in operand size as expected.
- The Dadda multiplier using Mixed $\mathrm{L} / \mathrm{H}$ based accumulation scheme exhibits the least power-delay product at both technologies.
- Dadda multiplier with HCA gives less delay and power-delay product as compared to Dadda multiplier with RCA.
- At 45 nm , the propagation delay is smaller, power consumption is higher (due to increased leakage current) and power-delay product is smaller for all designs of Dadda multiplier in comparison to 180 nm technology.

The power consumption, propagation delay and power-delay product graphs of Dadda multipliers at $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology using different accumulation scheme is shown in Figure 4.26.


Figure 4.26: The overall power consumption, propagation delay and power-delay product graphs of Dadda multipliers at $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology

### 4.7. FINAL RESULTS AND DISCUSSION

## Comparison of proposed Wallace tree and Dadda multiplier designs with results of published

 architecturesThis section presents the comparative analysis of $4 \times 4$-bit and $8 \times 8$-bit of Wallace tree and Dadda multipliers with referenced architectures operated in sub-threshold region at $45 \mathrm{~nm} /$ 180 nm technology for 0.4 V supply voltage at same frequency of operation ( 200 KHz ).

Here, for comparison, the $4 \times 4$-bit and $8 \times 8$-bit Wallace tree and Dadda multipliers of the referenced architectures [17][18][35][36] are designed to obtain their results in the same simulation setup for sub-threshold operation.

Whereas, all proposed designs show minimum power-delay product as compared to referenced architectures at both technology nodes, but Table 4.11 and 4.12 show comparisons with only best-proposed multiplier designs i.e with minimum power-delay product, (as per results obtained from Table 4.6, 4.7, 4.9 and 4.10) with the referenced architectures in sub-threshold region.

Table 4.11: Comparative results between proposed and referenced multiplier designs at 45 nm

| Referenced/ Proposed designs | Module name | $\begin{gathered} \text { Size } \\ \text { (in bits) } \end{gathered}$ | Power ( $\mu \mathrm{W}$ ) | Delay (ns) | Power-Delay Product (Watt*Sec 10 -15) | \% reduction in Power-Delay Product |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Ref [35] | Wallacetree | $4 \times 4$ | 0.723 | 25.910 | 018.732 | $\left\{\begin{array}{l} (4 \times 4 \text { size })-87.7 \% \\ (8 \times 8 \text { size })-90.2 \% \end{array}\right.$ |
|  |  | 8x8 | 2.423 | 73.010 | 176.903 |  |
| $\begin{aligned} & \text { Wallace tree- } \\ & \text { 45-L/H-HCA } \\ & \text { (best) } \end{aligned}$ | Wallace tree | $4 \times 4$ | 0.478 | 04.821 | 002.304 |  |
|  |  | 8x8 | 3.457 | 05.011 | 017.323 |  |
| Ref [36] | Dadda | $4 \times 4$ | 0.562 | 10.721 | 006.023 | $\left\{\begin{array}{l} (4 \times 4 \text { size })-80.5 \% \\ (8 \times 8 \text { size })-79.9 \% \end{array}\right.$ |
|  |  | 8x8 | 1.901 | 31.110 | 059.140 |  |
| Dadda-45- <br> L/H-HCA <br> (best) | Dadda | $4 \times 4$ | 0.310 | 03.791 | 001.175 |  |
|  |  | 8x8 | 2.939 | 04.033 | 011.852 |  |

Table 4.12: Comparative table between proposed and referenced multiplier designs at 180 nm

| Referenced/ Proposed designs | Module name | $\begin{gathered} \text { Size } \\ \text { (in bits) } \end{gathered}$ | Power ( $\mu \mathrm{W}$ ) | Delay ( $\mu \mathrm{s}$ ) | $\begin{array}{\|c\|} \hline \text { Power-Delay } \\ \text { Product } \\ \text { (Watt*Sec 10 } \left.{ }^{-12}\right) \\ \hline \end{array}$ | \% reduction in Power-Delay Product |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Ref [17] | Wallace tree | 4 x 4 | 0.194 | 1.857 | 0.3602 | ( 4 x 4 size) - $89.8 \%$ |
|  |  | 8x8 | 0.282 | 2.586 | 0.7292 |  |
| Wallace tree-180-L/H-HCA (best) | Wallace tree | 4 x 4 | 0.150 | 0.245 | 0.0367 | (8x8 size) - 87.6\% |
|  |  | 8x8 | 0.289 | 0.311 | 0.0899 |  |
| Ref [18] | Dadda | $4 \times 4$ | 0.187 | 0.849 | 0.1587 | (4x4 size) - 81.7\% |
|  |  | 8x8 | 0.274 | 1.784 | 0.4888 |  |
| Dadda-180-L/H-HCA (best) | Dadda | $4 \times 4$ | 0.131 | 0.222 | 0.0291 | (8x8 size) -82.1\% |
|  |  | 8x8 | 0.284 | 0.307 | 0.0873 |  | All proposed designs show minimum power-delay product as compared to published referenced architectures. Thus, a comparison of these designs, among themselves, is done to obtain the best design in terms of overall power-delay product at both technologies.

For comparison purpose, Figure 4.27, Figure 4.28 and Figure 4.29 show the power consumption, propagation delay and power-delay product histograms of all proposed sixteen designs.


Figure 4.27: The comparative power consumption graph between Wallace tree and Dadda multipliers for (a) 4x4-bit (b) 8x8-bit


Figure 4.28: The comparative propagation delay graph between Wallace tree and Dadda multipliers for (a) $4 \times 4$-bit (b) $8 \times 8$-bit

(a)

(b)

Figure 4.29: The comparative power-delay product graph between Wallace tree and Dadda multipliers for (a) $4 \times 4$-bit (b) 8x8-bit

## Observations

The comparative results of the Wallace tree and Dadda multiplier show that in sub-threshold region:

- In selected logic design styles (Static-CMOS and HYB-TG), both multiplier architectures operate down to 0.4 V with correct functionality using all four different combinations of partial-product accumulation scheme and final adder (LOC-RCA, LOC-HCA, L/H-RCA and L/H-HCA).
- The power, delay and power-delay product of both the multiplier designs increase with the increase in operand size.
(i) Effect of Logic Design Style:
- Static-CMOS logic gives lowest power consumption because of simpler logic cells using LOC based accumulation scheme at both technology nodes.
- HYB-TG logic gives lowest propagation delay because of shorter critical paths using Mixed L/H based accumulation scheme at both technology nodes.
(ii) Effect of Technology Scaling:
- At same frequency of operation, at 45 nm , the propagation delay is smaller, power consumption is higher (due to increased leakage current as supply voltage is kept same at 0.4 V ) and power-delay product is smaller for all different implemented combinations of multipliers in comparison to 180 nm technology.


## (iii) Effect of Multiplier Architecture:

- Dadda multiplier is the most power efficient, high performance architecture for all different implemented combinations as compared to Wallace tree multiplier at both technology nodes.
- Both the multipliers with Mixed L/H based partial-product accumulation scheme shows reduced power-delay product as compared to LOC based partial-product accumulation scheme.

In both technologies, multipliers with HCA provide lower propagation delay and lesser power-delay product but higher power consumption because of its complex architecture as compared to multipliers with RCA.

### 4.8. CONCLUSIONS

The overall results of the Wallace tree and Dadda multipliers show following conclusions at $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology nodes:

## (i) Power Consumption

Comparison of proposed multipliers with referenced designs [35][36] at $45 \mathrm{~nm},[17][18]$ at 180 nm

- For LOC based proposed Wallace tree and Dadda multipliers:

At 45 nm : For $4 \times 4$-bit and $8 \times 8$-bit, the proposed designs give lesser power consumption with respect to referenced designs. The range of reduction in power consumption varies from $43.1 \%$ to $74.5 \%$.

At 180 nm : For $4 \times 4$-bit and $8 \times 8$-bit, the proposed designs give lesser power consumption with respect to referenced designs. The range of reduction in power consumption varies from 55.8\% to $62.5 \%$.

- Mixed L/H based proposed Wallace tree and Dadda multipliers:

At 45 nm : For 4 x 4 -bit, the proposed designs give lesser power consumption, varying from $33.8 \%$ to $44.8 \%$. Whereas for $8 \times 8$-bit, the proposed designs give more power consumption, varying from $42.6 \%$ to $54.6 \%$ with respect to referenced designs.

At 180 nm : For 4x4-bit, the proposed designs have lesser power consumption, varying from $63.3 \%$ to $74.1 \%$. Whereas for $8 \times 8$-bit, the proposed designs give more power consumption, varying from $2.4 \%$ to $3.5 \%$ with respect to referenced designs.

## Comparison of all proposed Wallace tree and Dadda multiplier designs among themselves

- For LOC based architectures:

For both technology nodes, Dadda multiplier (based designs) has the least power consumption in comparison to corresponding Wallace tree multiplier as given below:

At 45 nm : For $4 x 4$-bit, the lesser power consumption varying from $28.6 \%$ to $30.9 \%$. Whereas, for $8 \times 8$-bit, they consume lesser power consumption, varying from $23.2 \%$ to $24.1 \%$.

At 180 nm : For $4 \times 4$-bit, they consume lesser power consumption varying from $4.1 \%$ to $8.4 \%$. Whereas, for 8x8-bit, they consume lesser power consumption, varying from $1.6 \%$ to $1.7 \%$

- For Mixed L/H based architectures:

For both technology nodes, Dadda multiplier (based designs) has the least power consumption as compared to corresponding Wallace tree multiplier based designs as given below:

At 45 nm : For $4 \times 4$-bit, they have lesser power consumption varying from $35.1 \%$ to $35.8 \%$. Whereas, for 8x8-bit, they consume lesser power consumption, varying from $14.9 \%$ to $21.5 \%$ as compared to Wallace tree multipliers.

At 180 nm : For 4 x 4 -bit, they consume lesser power consumption varying from $12.6 \%$ to $12.9 \%$. Whereas, for 8 x 8 -bit, they consume lesser power consumption, varying from $1.7 \%$ to $4.1 \%$.

- At 45 nm , for both sizes (4x4-bit and 8x8-bit), among all eight (LOC as well Mixed L/H based) proposed multipliers implemented, Dadda-45-L-RCA is the most power efficient architecture.

It consumes $42.6 \%$ (for $4 x 4$-bit), and $86.5 \%$ (for 8 x 8 -bit) less power in comparison to highest power consuming design which is Wallace tree-45-L/H-HCA.

- At 180 nm , for both sizes, Dadda-180-L-RCA is the most power efficient architecture among all eight proposed designs.

It has $56.6 \%$ (for $4 \times 4$-bit) and $60.8 \%$ (for 8 x 8 -bit) less power consumption less power in comparison to highest power consuming design which is Wallace tree-180-L/H-HCA.

## (ii) Propagation Delay

## Comparison of proposed multipliers with referenced designs [17][18][35][36]

- For LOC based proposed Wallace tree and Dadda multipliers:

At 45 nm : For $4 \times 4$-bit and $8 \times 8$-bit, the proposed designs give a lesser propagation delay varying from $9.28 \%$ to $59.5 \%$.

At 180 nm : For $4 \times 4$-bit and 8 x8-bit, the proposed designs give a lesser propagation delay varying from $16.9 \%$ to $68.6 \%$.

- Mixed L/H based proposed Wallace tree and Dadda multipliers:

At 45 nm : For $4 \times 4$-bit and $8 \times 8$-bit, the proposed designs give a lesser propagation delay varying from $64.6 \%$ to $87.1 \%$.

At 180 nm : For $4 \times 4$-bit and 8 x 8 -bit, the proposed designs give a lesser propagation delay varying from $73.7 \%$ to $87.9 \%$.

## Comparison of all proposed Wallace tree and Dadda multipliers among themselves

- For LOC based architectures:

Dadda multiplier (based designs) has the least propagation delay at both technology nodes in comparison to corresponding Wallace tree multiplier designs as given below:

At 45 nm : For 4x4-bit, Dadda multipliers give lesser propagation delay varying from $19.9 \%$ to $28.9 \%$. Whereas for 8x8-bit, they give lesser propagation delay varying from $7.3 \%$ to $10.5 \%$ as compared to Wallace tree multipliers.

At 180 nm : For 4x4-bit, Dadda multipliers give lesser propagation delay varying from $0.9 \%$ to $2.2 \%$. Whereas for $8 \times 8$-bit, they give lesser propagation delay varying from $0.8 \%$ to $1.1 \%$ as compared to Wallace tree multipliers.

- For Mixed L/H based architectures:

Dadda multiplier (based designs) have the least propagation delay at both technology nodes in comparison to corresponding Wallace tree multiplier designs as given below.

At 45 nm : For 4x4-bit, Dadda multipliers give lesser propagation delay varying from $19.9 \%$ to $21.3 \%$. Whereas, for 8 x 8 -bit, they give lesser propagation delay varying from $17.2 \%$ to $19.5 \%$ as compared to Wallace tree multipliers.

At 180 nm : For $4 \times 4$-bit, the Dadda multipliers give lesser propagation delay varying from $2.1 \%$ to $9.1 \%$. Whereas, for $8 x 8$-bit, they give lesser propagation delay varying from $1.1 \%$ to $1.6 \%$ as compared to Wallace tree multipliers.

- Among all eight (LOC as well Mixed L/H based) proposed multipliers implemented at 45 nm, Dadda-45-L/H-HCA has the least propagation delay. It has $71.7 \%$ (for 4 x 4 -bit), and $88.2 \%$ (for 8x8-bit) lesser propagation delay in comparison to most delay intensive design which is Wallace tree-45-L/H-HCA.
- Similarly, at 180 nm , Dadda-180-L/H-HCA has the least propagation delay. It has $71.3 \%$ (for $4 \times 4$-bit) and $69.1 \%$ (for $8 \times 8$-bit) less propagation delay in comparison to most delay intensive design which is Wallace tree-180-L/H-HCA


## (iii) Power-Delay Product

## Comparison of proposed multipliers with referenced designs [17][18][35][36]

- LOC based proposed Wallace tree and Dadda multipliers:

At 45 nm : For $4 \times 4$-bit and $8 \times 8$-bit, the proposed designs give lesser power-delay product varying from $63.6 \%$ to $89.1 \%$.

At 180 nm : For 4 x 4 -bit and 8 x 8 -bit, the proposed designs give lesser power-delay product varying from $68.8 \%$ to $86.3 \%$.

- Mixed L/H based proposed Wallace tree and Dadda multipliers:

At 45 nm : For $4 \times 4$-bit and $8 \times 8$-bit, the proposed designs give lesser power-delay product varying from $70.4 \%$ to $90.2 \%$.

At 180 nm : For 4x4-bit and 8x8-bit, the proposed designs give lesser power-delay product varying from $81.6 \%$ to $89.7 \%$.

## Comparison of proposed Wallace tree and Dadda multipliers among themselves

- For LOC based architectures:

At 45 nm : For 4x4-bit, the Dadda multipliers give lesser power-delay product varying from $44.7 \%$ to $49.2 \%$. Whereas, for $8 \times 8$-bit, they give lesser power-delay product varying from $28.9 \%$ to $32.1 \%$ as compared to corresponding Wallace tree multipliers.

At 180 nm : For 4x4-bit, the Dadda multipliers give lesser power-delay product varying from $6.2 \%$ to $9.2 \%$. Whereas, for $8 \times 8$-bit, they give lesser power-delay product varying from $2.5 \%$ to $2.6 \%$ as compared to corresponding Wallace tree multipliers.

- For Mixed L/H based architectures:

At 45 nm : For $4 \times 4$-bit, the Dadda multipliers give lesser power-delay product varying from $48.6 \%$ to $49 \%$. Whereas, for 8 x 8 -bit, they give lesser power-delay product varying from a $31.5 \%$ to $35.1 \%$ as compared to corresponding Wallace tree multipliers.

At 180 nm : For 4x4-bit, the Dadda multipliers give lesser power-delay product varying from $14.9 \%$ to $20.6 \%$. Whereas, for $8 x 8$-bit, they give lesser power-delay product varying from a $2.8 \%$ to $5.6 \%$ as compared to corresponding Wallace tree multipliers.

- Among all eight (LOC as well Mixed L/H based) proposed multipliers implemented at 45 nm , Dadda-45-L/H-HCA has the least power-delay product. It has $49 \%$ (for 4 x 4 -bit), and $31.5 \%$ (for $8 \times 8$-bit) lesser power-delay product in comparison to Wallace tree- $45-\mathrm{L} / \mathrm{H}-$ HCA having highest power-delay product.
- Similarly, at 180 nm , Dadda-180-L/H-HCA has the least power-delay product. It has $20.6 \%$ (for 4 x 4 -bit), and $2.5 \%$ (for 8 x 8 -bit) lesser power-delay product in comparison to Wallace tree-180-L/H-HCA having highest power-delay product.
- Static-CMOS logic and HYB-TG design style are most power-delay product efficient design style for LOC and Mixed L/H based Wallace tree and Dadda multipliers respectively.
(iv) Effect of Technology Scaling

At same frequency of operation, at 45 nm , the propagation delay is smaller, power consumption is higher (due to increased leakage current since supply voltage is kept same at 0.4 V ) and power-delay product is smaller for all different implemented combinations of multipliers in comparison to 180 nm technology.

Figure 4.30 shows the Design Space Exploration (DSE) chart of all proposed 8x8-bit Wallace tree and Dadda multipliers at 45 technology nodes in sub-threshold region.

In Figure 4.30, a change of scale on $y$-axis is shown with a kink ( $\sim$ ) to show the histograms for power and delay for all types of multipliers in its entire range. This is done to show the comparisons of power and delay for all types of multipliers in same figure.

The same distribution pattern of power, delay and power-delay product of both the multipliers are found at 180 nm technology.

(a)

(b)


Figure 4.30: DSE chart of all published referenced and proposed 8x8-bit Wallace tree and Dadda multipliers (a) power (b) delay (c) power-delay product

## REFERENCES

[1] B. Ramkumar B and H. M. Kittur, "Faster and energy-efficient signed multipliers", VLSI Design, vol. 2013, 2013, pp. 1-12.
[2] P. R. Cappello and K. Steiglitz, "A VLSI layout for a pipe-lined Dadda multiplier", $A C M$ Transactions on Computer Systems, vol. 1(2), 1983, pp. 157-17.
[3] K. A. C. Bickerstaff, E. E. Swartzlander and M. J. Schulte, "Analysis of column compression multipliers", Proceedings of the 15th IEEE Symposium on Computer Arithmetic, 2001, pp. 33-39.
[4] C. S. Wallace, "A suggestion for a fast multiplier", IEEE Transactions on Electronic Computers, vol. 13, 1964, pp. 14-17.
[5] L. Dadda, "Some schemes for parallel multipliers", Alta Frequenza, vol. 34, 1965, pp. 349-356.
[6] H. A. Al-Twaijry, M. J. Flynn, D. Giovanni and T. G. I. John, "Area and performance optimized CMOS multipliers", PhD dissertation, Stanford University, 1997, pp. 1-158.
[7] L. Jayaraju, B. N. Srinivasa Rao and V. Srinivasa Rao, " $0.69 \mathrm{~mW}, 700 \mathrm{MHz}$ novel $8 x 8$ digital multiplier", International Journal of Computer Theory and Engineering, vol. 3, 2011, pp. 662-665.
[8] C. H. Chang, J. Gu and M. Zhang, "Ultra low-voltage low-power CMOS 4-2 and 5-2 compressors for fast arithmetic circuits", IEEE Transactions on Circuits and Systems I, vol. 51(10), 2004, pp. 1985-1997.
[9] G. W. Bewick, "Fast multiplication: algorithms and implementation", PhD Dissertation, Stanford University, 1994, pp. 1-170.
[10] S. F. Hsiao, M. R. Jiang and J. S. Yeh, "Design of high-speed low-power 3-2 counter and 4-2 compressor for fast multipliers", Electronics Letters, vol. 34(4), 1998, pp. 341342.
[11] P. Karuna and K. K. Parhi, "Low-power 4-2 and 5-2 compressors", IEEE Thirty-Fifth Asilomar Conference on Signals, Systems and Computers, vol. 1, 2001, pp. 129-133.
[12] B. Valeriu, A. Djupdal and S. Aunet, "Ultra low-power neural inspired addition: When serial might outperform parallel architectures", Computational Intelligence and Bioinspired Systems, Springer Berlin Heidelberg, 2005, pp. 486-493.
[13] P. M. Kogge and H. S. Stone, "A parallel algorithm for the efficient solution of a general class of recurrence equations", IEEE Transactions on Computers, vol. 22, 1973, pp. 786792.
[14] M. Talsania and E. John, "A comparative analysis of parallel prefix adders", 12th Euromicro Conference on Digital System Design, Architectures, Methods and Tools, 2009, pp. 281-286.
[15] W. J. Townsend, E. E. Swartzlander and J. A. Abraham, "A comparison of Dadda and Wallace multiplier delays", Proceedings of the SPIE in Advanced Signal Processing Algorithms, Architectures and Implementations, vol. 5205, 2003, pp. 552-560.
[16] C. C. Foster and F. D. Stockton, "Counting responders in an associative memory", IEEE Transactions on Computers, vol. C-20, 1971, pp. 1580-1583.
[17] S. Asif and Y. Kong, "Low-area wallace tree multiplier", VLSI Design, vol. 2014, 2014, pp. 1-6.
[18] P. Ramanathan, P. T. Vanathi and S. Agarwal, "High speed multiplier design using
decomposition logic", Serbian Journal of Electrical Engineering, vol. 6(1), 2009, pp. 33-42.
[19] M. Morris Mano and C. D. Michael, "Digital Design", 4th edition, 2012, pp. 1-143.
[20] C. K. Thomas and E. E. Swartzlander, "Low power arithmetic components", In Low Power Design Methodologies, Springer US, 1996, pp. 161-200.
[21] T. Vigneswaran, B. Mukundhan and P. S. Reddy, "A novel low power, high speed 14 transistor CMOS full adder cell with $50 \%$ improvement in threshold loss problem", Enformatika Transactions on Eng. Comp. Tech, vol. 13, 2006, pp. 82-85.
[22] A. P. Chandrakasan, S. Sheng and R. W. Brodersen, "Low-power CMOS digital design", IEICE Transactions on Electronics, vol. 75(4), 1992, pp. 371-382.
[23] A. Pishvaie, G. Jaberipur and A. Jahanian, "Redesigned CMOS (4;2) compressor for fast binary multipliers", Canadian Journal of Electrical and Computer Engineering, vol. 36, 2013, pp. 111-115.
[24] S F. Hsiao, M. R. Jiang and J. S. Yeh, "Design of high-speed low-power 3-2 counter and 4-2 compressor for fast multipliers", Electronics Letters, vol. 34, 1998, pp. 341-343.
[25] S. Veeramachaneni, K. M. Krishna, L. Avinash, S. R. Puppala and M. B. Srinivas, "Novel architectures for high-speed and low-power 3-2, 4-2 and 5-2 compressors", 20th International Conference on VLSI Design (VLSID'07), 2007, pp. 324-329.
[26] J. Tonfat and R. Ricardo, "Low power 3-2 and 4-2 adder compressors implemented using ASTRAN", IEEE Third Latin American Symposium on Circuits and Systems (LASCAS), 2012, pp. 1-4.
[27] R. Menon and D. Radhakrishnan, "High performance 5:2 compressor architectures", IEEE Processing on Circuits Devices Systems, vol. 153, 2006, pp. 447-452.
[28] O. Kwon, K. Nowka and E. E. Swartzlander, "A 16-Bit by 16-bit MAC design using fast 5:3 compressor cells", Journal of VLSI Signal Processing, vol. 31, 2002, pp. 77-89.
[29] C. F. Law, S. S. Rofail and K. S. Yeo, "Low-power circuit implementation for partialproduct addition using pass-transistor logic", IEE Proceedings-Circuits Devices Systems, vol. 146(3), 1999, pp. 124-129.
[30] K. Prasad and K. K. Parhi, "Low-power 4-2 and 5-2 compressors", Conference on Signals, Systems and Computers, 2001, pp.129-133.
[31] R. Nirlakalla, T. S. Rao and T. J. Prasad, "Performance evaluation of high speed compressors for high speed multipliers", Serbian Journal of Electrical Engineering, vol. 8, 2011, pp. 293-306.
[32] R. Zimmermann and F. Wolfgang, "Low-power logic styles: CMOS versus passtransistor logic", IEEE Journal of Solid-State Circuits, vol. 32(7), 1997, pp.1079-1090.
[33] M. Rouholamini, O. Kavehie, A. P. Mirbaha and S. J. Jasbi, "A new design for 7:2 compressors", IEEE/ACS International Conference on Computer Systems and Applications, AICCSA '07, 2007, pp. 474-478.
[34] M. Weinan and L. Shuguo, "A New High Compression Compressor for Large Multiplier", 9th International Conference on Solid-State and Integrated-Circuit Technology, ICSICT, 2008, pp. 1877-1880.
[35] A. Dandapat, S. Ghosal, P. Sarkar and D. Mukhopadhyay, "A 1.2-ns 16×16-Bit Binary Multiplier Using High Speed Compressors", World Academy of Science, Engineering and Technology, vol. 4, 2010, pp. 531-536.
[36] P. Samundiswary and K. Anitha, "Design and analysis of CMOS based Dadda multiplier", International Journal of Computational Engineering \& Management, vol. 16, 2013, pp. 12-17.

## CHAPTER 5

## STATIC RANDOM ACCESS MEMORY (SRAM)

### 5.1. INTRODUCTION

In this era of system on chips (SoC), embedded SRAM is an essential component in the memory hierarchy of modern computing systems [1][2][3]. SRAM comprises $20 \%$ to $80 \%$ of the total chip transistor count on an average which consumes a large amount of power in SoC [4]. As a result, SRAM's area, power, performance and leakage have become significant deciding factors in overall budgeting of SoC. To satisfy the low power requirement of the SRAM cells, sub-threshold design technique is being introduced. Typical application of a subthreshold based SRAM cell is in low power 16-bit RISC general processor in 130nm IBM CMOS process which can be used for wireless sensor node applications [5].

Sub-threshold memory designs demand low leakage currents which involves scaling the power supply voltage below the device threshold [6]. Data retention of the SRAM cell, both in standby (hold) mode and during a read access, is an important functional constraint in nanometer technology nodes. The cell becomes less stable with lower supply voltage, increasing leakage currents and increasing variability, all resulting from technology scaling. Hence, the low supply voltage reduces the performance of the memory cell which makes it necessary to develop new designs with improved performance.
Further, to achieve stable read and successful write operation, static noise margin (SNM), a critical metric for SRAM bit-cell stability, should be as high as possible under various temperature and voltage condition. As SRAM is scaled with lower supply voltage ( $\mathrm{V}_{\mathrm{DD}}$ ) and technology, sufficient SNM becomes difficult to maintain in the conventional 6T SRAM cell (C6T). This difficulty occurs due to increase in inter-die statistical variation in the process parameters (threshold voltage $\left(\mathrm{V}_{\mathrm{th}}\right)$, channel length $(\mathrm{L})$, channel width $(\mathrm{W})$ of transistors) [7]. These inter-die parameter variations may lead to destructive read (i.e., flipping of the stored data in a cell while reading) and unsuccessful write (inability to write to a cell) in a SRAM cell, thereby, degrading the memory design yield in nanometer technologies [8].

To overcome these problems several technologies, such as FinFET, CNTFET, SOI, 3D designs, nano computing etc. have become very attractive area in research but these are very
expensive, less reliable under low voltage or temperature gradients [9]. This drawback necessitates improvement in the design of SRAM cells with the current CMOS technology.

Further, the impact of scaling on the SRAM cell performance needs to be investigated. So, in this thesis work, scaling impact has been observed by studying designs performance at above and below nanometer range i.e. 180 nm , and 45 nm technology nodes.

Through literature survey, it is observed that while few papers are published at 45 nm technology, but a good research has already been carried out on designing ultra-low power SRAM cells, with number of transistors varying from 4 to 12 , at 180 nm , and 0.4 V supply in sub-threshold region [10][11][12][13][14][15][16][17]. The focus is mainly on to improve certain typical important parameters like hold power, read delay, write delay, and read SNM (RSNM). Thus, new designs are not created at 180nm technology in present work.

The results of above published references at 180 nm technology node have been studied and summarized in Table 5.1(a). It is observed that for these referenced designs, either the aspect ratios of transistors are different from each other or the different 45 nm libraries are used for simulation. Thus, for the sake of uniformity, the RSNM, WSNM, read delay, write delay, and leakage power consumption values have been re-evaluated for all at 180 nm technology, in our simulation set up, keeping the schematic of the referenced cells same as given in original references.

Table 5.1(a): Comparison of various SRAM cells at 180 nm technology at 0.4 V

| Types <br> of <br> SRAM <br> cells | Refer <br> ences | Leakage <br> Power in <br> Hold mode <br> $(\mathbf{W})$ | RSNM <br> $(\mathbf{m V})$ | Read <br> Delay <br> $(\mathbf{s})$ | WSNM <br> $(\mathbf{m V})$ | Write <br> Delay <br> $(\mathbf{s})$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 4 T | $[12]$ | $7.79 \mathrm{E}-10$ | 28.0 | $200 \mathrm{E}-9$ | 32.4 | $100 \mathrm{E}-9$ |
|  | $[12]$ | $7.79 \mathrm{E}-10$ | 18.0 | $120 \mathrm{E}-9$ | 22.1 | $20.0 \mathrm{E}-9$ |
|  | $[10]$ | $03.60 \mathrm{E}-6$ | 21.4 | $65.0 \mathrm{E}-9$ | 35.7 | $1.30 \mathrm{E}-9$ |
|  | $[12]$ | $3.42 \mathrm{E}-10$ | 22.0 | $5.00 \mathrm{E}-9$ | 41.0 | $5.00 \mathrm{E}-9$ |
|  | $[12]$ | $5.65 \mathrm{E}-10$ | 20.1 | $10.0 \mathrm{E}-9$ | 47.7 | $10.0 \mathrm{E}-9$ |
|  | $[16]$ | $81.00 \mathrm{E}-6$ | 19.4 | $8.50 \mathrm{E}-9$ | 32.4 | $6.00 \mathrm{E}-9$ |
|  | $[18]$ | $05.44 \mathrm{E}-6$ | 140 | $7.84 \mathrm{E}-9$ | 28.4 | $4.74 \mathrm{E}-9$ |
|  | $[10]$ | $131.0 \mathrm{E}-6$ | 15.2 | $0.47 \mathrm{E}-9$ | 27.8 | $0.96 \mathrm{E}-9$ |
|  | $[12]$ | $9.29 \mathrm{E}-10$ | 22.4 | $5.00 \mathrm{E}-9$ | 55.7 | $1.00 \mathrm{E}-9$ |
|  | $[12]$ | $1.56 \mathrm{E}-10$ | 20.1 | $41.0 \mathrm{E}-9$ | 47.7 | $35.4 \mathrm{E}-9$ |
|  | $[14]$ | $152 \mathrm{E}-06$ | 40.2 | $2.24 \mathrm{E}-9$ | 22.1 | $8.44 \mathrm{E}-9$ |
| 7 T | $[10]$ | $6.64 \mathrm{E}-09$ | 26.7 | $127 \mathrm{E}-9$ | 38 | $210 \mathrm{E}-9$ |
| 8T | $[13]$ | $3.89 \mathrm{E}-09$ | 59.6 | $8.27 \mathrm{E}-9$ | 70.2 | $7.50 \mathrm{E}-9$ |
| 9 T | $[10]$ | $5.09 \mathrm{E}-09$ | 23.4 | $65.0 \mathrm{E}-9$ | 44.8 | $2.00 \mathrm{E}-9$ |
| 12T | $[15]$ | $4.77 \mathrm{E}-09$ | 32 | $110 \mathrm{E}-9$ | 68.5 | $81.1 \mathrm{E}-9$ |

For ultra-low power SRAM cells, with number of transistors varying from 4 to 12 , only those configurations are considered for further comparison for which leakage power in hold mode is less in Table 5.1(a). Additional performance metrics from N curve analysis are also estimated for them. Table 5.1(b) shows all performance parameter values of selected referenced SRAM cells

Table 5.1(b): Comparison of low power SRAM cells at 180 nm technology at 0.4 V

| SRAM <br> cell | WSNM <br> $(\mathbf{m V})$ | $\mathbf{W W T I} \mid$ <br> $(\boldsymbol{\mu A )}$ | WTV <br> $(\mathbf{m V})$ | RSNM <br> $(\mathbf{m V})$ | SVNM <br> $(\mathbf{m V})$ | SINM <br> $(\boldsymbol{\mu A})$ | Leakage <br> Power in <br> Hold mode <br> $(\mathbf{W})$ | Write <br> Delay <br> $(\mathbf{s})$ | Read <br> Delay <br> $(\mathbf{s})$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $4 \mathrm{~T}[12]$ | 22.1 | 84.0 | 230 | 18.0 | 170 | 32.1 | $7.79 \mathrm{E}-10$ | $20.0 \mathrm{E}-9$ | $120.0 \mathrm{E}-9$ |
| $5 \mathrm{~T}[12]$ | 41.0 | 55.2 | 220 | 22.0 | 180 | 60.1 | $3.42 \mathrm{E}-10$ | $5.00 \mathrm{E}-9$ | $005.0 \mathrm{E}-9$ |
| $6 \mathrm{~T}[12]$ | 47.7 | 32.3 | 250 | 20.1 | 150 | 26.9 | $1.56 \mathrm{E}-10$ | $35.4 \mathrm{E}-9$ | $041.0 \mathrm{E}-9$ |
| $7 \mathrm{~T}[10]$ | 38.0 | 64.1 | 230 | 26.7 | 170 | 14.8 | $6.11 \mathrm{E}-09$ | $210 \mathrm{E}-9$ | $127.0 \mathrm{E}-9$ |
| $8 \mathrm{~T}[13]$ | 70.2 | 13.0 | 220 | 59.6 | 180 | 73 | $3.89 \mathrm{E}-09$ | $7.50 \mathrm{E}-9$ | $8.27 \mathrm{E}-9$ |
| 9T[10] | 44.8 | 23.1 | 200 | 23.4 | 200 | 38.4 | $5.09 \mathrm{E}-09$ | $2.00 \mathrm{E}-9$ | $65.00 \mathrm{E}-9$ |
| 12T[15] | 68.5 | 25.5 | 190 | 32.0 | 210 | 65.4 | $4.77 \mathrm{E}-09$ | $81.1 \mathrm{E}-9$ | $110.0 \mathrm{E}-9$ |

The results summarized in Table 5.1(b) are used to study \& obtain performance trends, comparison, and effect of technology scaling (i.e. 180 nm to 45 nm ) on the performance metrics while drawing conclusions.

At 45 nm technology, a major problem is that NMOS as well as PMOS transistors do not turn off completely. A high current through channel in OFF state (termed as 'leakage current' in this thesis) leads to either improper operation or reduced SNM during read, write operation. Thus, new designs need to be created for proper functional output.

For super-threshold operation, negative bit-line (BL) scheme, cell- $\mathrm{V}_{\mathrm{DD}}\left(\mathrm{CV} \mathrm{V}_{\mathrm{DD}}\right)$ adjustment (1 extra control line), differential CVSS (2 extra control line), dual-rail supply scheme, and write back (4 extra control lines) schemes have been devised [18][19]. However, these designs can only operate at supply voltage in 0.45 V to 0.7 V range.

Few reported works at 45 nm technology, for sub-threshold region, have developed various read or write assist methods to enhance the write margin and read stability for various SRAM cells for ultra-low power operation. These schemes include cell virtual-ground (CVSS) bias ( 1 extra control line), and boosted or reduced word-line (WL) voltage ( 2 extra control line) [20].

Reducing the supply voltage further is desirable to achieve very low power consuming memory design.

A comprehensive analysis of sub-threshold based SRAM cells and combined effects of all design metrics (read stability, write ability, hold stability, read/write access time and leakage power consumption) have not been reported so far at sub-nanometer technology.

Thus, this chapter focuses on proposal of new functional SRAM cell designs with a comprehensive study, and analysis of their all design metrics (read stability, write ability, hold stability, read/write access time and leakage power consumption) at 45 nm technology in subthreshold region along with performance comparison with published designs at 45 and 180 nm technology.

In this chapter,

- Results of published research papers at 180 nm technology node have been studied and summarized in Table 5.1. The design trends have been analyzed and conclusions are drawn in Section 5.6 and Section 5.7 respectively.
- New five ultra-low power, low voltage SRAM cells are implemented and their thorough performance parameter analysis is carried out and compared with C6T in sub-threshold region at 45 nm technology.

All these five proposed SRAM cells are designed by inserting additional transistors to increase the node currents for stability improvement. This is done without inserting an additional control line to switch the data inside/outside the memory cells thereby saving overall area of the memory cells.

The proposed five SRAM cell designs are
i. Modified 7T SRAM cell (named as M7T)
ii. Modified PMOS pass transistor logic (PTL)-based eight transistors SRAM cell, (named as MPT8T)
iii. Modified eight transistors SRAM cell (named as M8T)
iv. Modified 9T SRAM cell (named as M9T)
v. Modified inverter based twelve transistors SRAM cell (named as MI-12T)

The rest of the chapter is organized as follows:
Section 5.2 presents the architecture of C6T with their functional read, write and hold operation. Section 5.3 present the architecture of proposed M7T, MPT8T, M8T, M9T, and MI-12T SRAM cells with their functional read, write and hold operation. Section 5.4 describes the simulation methodology and overall post layout simulation results of the all proposed designs. Section 5.5 presents analytical model of all C6T and proposed SRAM cells to obtain the read/write and hold SNM values mathematically and also verifies the simulated results. Section 5.6 describes the final results and discussion of SRAM cells and Section 5.7 presents the summary of the chapter and the concluding remarks.

### 5.2. DESIGN AND OPERATION OF C6T

The conventional structure of SRAM cell consists of two cross coupled inverters to store onebit information inside the cell and two nodes ( Q and QB ) of the inverters are connected to two separate bit-lines, BL and BLB via two switches (left and right of the cell). Both the bits lines BL/BLB participate in read and write operations [20].

The two switches in Figure 5.1 are used to communicate one-bit information with the outside of the cell. These two switches are replaced by different kinds of pass transistor (acts as a switch) which are controlled by a word-line (WL). As long as the switches are turned off, the cell keeps one of its two possible steady states either 'logic 0 ' or 'logic 1 '. During the read and write operations, common WL controls accessibility to the cell nodes Q and QB through these two switches.


Figure 5.1: Basic SRAM cell
During read operation, the data contents of a memory word are read out (nondestructively), and during write operation, data is stored in a memory word, replacing any data that was previously stored there.

- During 'read' operation, the bit-lines (BL/BLB) start pre-charging to reference voltage usually close to the positive supply and the WL is activated. The SRAM cell starts to
discharge one of the bit lines (the one connected to the cell node storing logic 0 ). The minimum acceptable differential voltage across the bit-line pair directly governs the read access time of the memory and therefore its speed. A larger bit line differential voltage is advantageous for reliable sensing but the cell takes longer to develop that differential voltage. The sense amplifier (SA) circuit is used to sense values at BL and BLB lines during the read operation. It detects the difference of voltages between BL and BLB.

An optimum exists between time taken by the cell to develop the bit line differential (usually 100 mV [21]) and SA to amplify the input data to full swing CMOS signal levels [22][23].

- During 'write' operation, the write data is transferred to the desired columns by driving the data onto the bit-line pair by grounding either the bit line or its complement. If the cell data is different from the write data, then the logic ' 1 ' is discharged when the access transistor connect it to the discharged bit-line, thus causing the cell to be written with the bit-line value.

The schematic and layout of C6T is shown in Figure 5.2 [24]. The internal architecture of this cell consists of cross-coupled inverter pair (MP1/MN1) and (MP2/MN2) to store one-bit information inside the cell. The access transistors (MN5 and MN6) are used to communicate one-bit information with the outside of the cell. These two NMOS pass transistor are controlled by WL. As long as the MN5 and MN6 transistors are turned off, the cell keeps one of its two possible steady states either logic ' 0 ' or logic ' 1 '.


Figure 5.2: Schematic and layout of C6T
Transistor sizing i.e. cell ratio (CR) and pull up ratio (PR), is an important factor which decides stable read and writes operation for an SRAM cell that is indicated by a high RSNM and WSNM (figure of merits). CR is the ratio between sizes of the pull-down transistor to the access transistor $\left(=\mathrm{W}_{\text {pull down }} / \mathrm{W}_{\text {access }}\right.$ ) keeping L same for both transistors, during the read
operation. Similarly, PR is a ratio between sizes of the pull up transistor to the access transistor ( $=\mathrm{W}_{\text {pull up }} / \mathrm{W}_{\text {access }}$ ) keeping L same for both transistors, during write operation.

Typically, to obtain an maximum value for both in C6T, the CR and PR is kept in the range of 1.2 to 3 and $\leq 1.8$ respectively [25][26]. For 45 nm technology, mobility ratio of NMOS to PMOS transistor is $\mu_{\mathrm{n}} / \mu_{\mathrm{p}}=2.25, \mathrm{~V}_{\mathrm{th}, \mathrm{n}}=0.422 \mathrm{~V},\left|\mathrm{~V}_{\mathrm{th}, \mathrm{p}}\right|=0.412 \mathrm{~V}$. Accordingly, CR and PR ratio is selected by equating transistor currents under steady state conditions during read and write operations.
(i) Read operation of C6T: Circuit set up for read operation of C6T is shown in Figure 5.3. Assuming that logic ' 0 ' is stored in the cell initially. Thus, internal node voltages are $\mathrm{Q}=$ 0 V and $\mathrm{QB}=1 \mathrm{~V}$ and access transistors (MN5 and MN6) are turned OFF. The transistors MP2 and MN1 are turned OFF, while the transistors MP1 and MN2 operate in linear mode.

During read operation, the bit-lines (BL/BLB) are pre-charged to a high level ( $\mathrm{V}_{\mathrm{DD}}$ ) and WL is enabled (pulsed to a high level) which turns-on the access transistors MN5 and MN6 [27].


Figure 5.3: Test circuit for read operation of C6T
The voltage at BLB will not have a significant variation in voltage as no current flows through MN5 due to BLB \& QB ='1' at both end. On the other hand, transistors MN6 and MN2 conduct and the voltage level of BL line will begin to drop slightly, so that a differential voltage develops between the bit-lines which are sensed by sense amplifier. For successful read operation, the node voltage at Q should remain below the threshold voltage of MN 1 to prevent false turn ON of transistor MN1.

Thus, CR is the critical parameter of the SRAM cell during the read operation which is here selected as 3 to keep the node voltage Q less than the threshold voltage of MN1.

The high value of CR is desirable to prevent tripping of the cell due to noise at Q , thereby increasing read stability (i.e. RSNM) but at the cost of increased cell area.
(ii) Write operation of C6T: Circuit set up for write operation of C6T is shown in Figure 5.4. The node voltage QB always remains below the threshold voltage of MN2, since MN1 and MN5 are designed according to CR ratio. So, it is not sufficient to turn ON MN2.

To change the stored information, i.e. to force $\mathrm{QB}=\mathrm{V}_{\mathrm{DD}}$, the node voltage at Q must be reduced below the threshold voltage of MN 1 to turn it OFF.

So, consider a write ' 0 ' operation at node Q . Thus, internal node voltages are $\mathrm{Q}=$ ' 1 ' and QB = ' 0 ' before WL is enabled (i.e. pulsed to a high level). The transistors MP1 and MN2 are turned OFF, while the transistors MN1 and MP2 operate in linear mode.


Figure 5.4: Test circuit for write operation of C6T
During write operation, the bit-lines BLB is at high level ( $\mathrm{V}_{\mathrm{DD}}$ ) and BL is pulled down to low level (logic ' 0 '), WL is enabled which turns on the access transistors MN5 and MN6. For successful write ' 0 ' operation, the node voltage Q discharges through MN6 to a low voltage below the threshold voltage of MN1. This turns OFF MN1 and turns ON MP1. The current through MP1, pulls up the voltage at QB to logic ' 1 '.

The critical part of the circuit is the voltage divider formed by the pull up and access transistor. So, PR is an important parameter in write mode which is here selected as $\mathrm{PR}=2.6$.

The strength of the pull-up transistor determines the ease/difficulty of writing data '0' (i.e. flipping the state of cell from ' 1 ' to '0'). With small PR, it is easier to pull the node Q to GND thereby increasing write ability of the cell.
(iii) Hold Operation of C6T: One of the primary performance metric in nano-scale SRAM design is data retention ability which is analyzed by computing SNM in hold mode. This hold SNM metric, first defined by Seevinck et al. [28], measures the maximum value of DC noise voltage that can be tolerated by the SRAM cell without changing the stored bit. A higher SNM indicates better stability of the cell. Conceptual test circuit for measuring the hold SNM of C6T is shown in Figure 5.5.


Figure 5.5: Test circuit for measurement of hold SNM of C6T
The hold operation is performed by lowering WL to logic ' 0 ', switching OFF the MN5 and MN6 access transistors and bit line pair (BL/BLB) is at high voltage. This disconnects the cell nodes QB and Q from both the bit-lines. Two equal dc voltage sources, VN1 and VN2 are placed between inverters indicating the dc noise sources. These two voltage sources are swept from 0 to $V_{D D}$ to obtain voltage transfer curves.

Then voltage transfer curve (VTC) of inverter (MP1, MN1) and the mirrored voltage transfer curve ( $\mathrm{VTC}^{-1}$ ), of second inverter (MP2, MN2) are plotted on same axis. The resultant curve is referred as 'butterfly curve'. The side length of the largest square that can be embedded inside the lobes of the butterfly curve represents the hold SNM of the cell of the C6T [29]. The pull up to pull down transistor ratio is critical for stability during hold operation which is here selected as 0.8 . Figures showing transient waveform of $\mathrm{Q}, \mathrm{QB}$ during read, write, and hold mode are included in Appendix B.

### 5.3. DESIGN AND OPERATION OF PROPOSED SRAM CELLS AT 45 NM

In C6T, due to high leakage current in OFF state, NMOS access transistors do not turn OFF completely leading to degraded output at nodes Q and QB in both ON and OFF state (as shown in Table 5.2) which is high compared to 0 V in sub-threshold region. This causes degradation in stability of stored logic due to high OFF STATE leakage currents in cross coupled inverter pair. Therefore, modification is required in C6T for its proper operation.

The C6T design is modified by using following three techniques:
I. Modification in design of access transistor.
II. Modification in design of cross coupled inverter pair.
III. Modification in connection of WL signal to access transistors to generate clock feedthrough effect.

This section presents the design and analysis of all five proposed M7T, MPT8T, M8T, M9T and MI-12T SRAM cells using above techniques.

## I. Technique I: Modification in design of access transistor

Figure 5.6 shows designs of possible ten different transistor- transistor combinations N, P, NP, PP, PN, NN, NN-parallel, NP-parallel, PP-parallel and PN-parallel that can replace NMOS access transistor i.e. MN5 (MN6) of C6T in Figure 5.2.


Figure 5.6: Schematics of access transistor pairs

Each configuration, given above, is checked for functionality through simulation at 0.4 V supply for 45 nm . For simulation, aspect ratio of transistor (either MN1 or MP1) in each configuration is taken $(\mathrm{W} / \mathrm{L})=75 \mathrm{~nm} / 55 \mathrm{~nm}=1.5$. Aspect ratio of additional transistor (i.e. either MN2 or MP2) in each configuration is kept constant at $(\mathrm{W} / \mathrm{L})=65 / 45 \mathrm{~nm}=1.4$. If this value is increased above 1.4, the function of all modified configurations does not show any improvement in observed output. The simulation results of these ten combinations are given in Table 5.2 and the graph are shown in Figure 5.7. These show the ON/OFF transient state analysis of these access transistor pairs.

Table 5.2: Results of conventional and modified access transistor (P, NP, PP, PN and NN) at 0.4 V supply for 45 nm

| Access Transistor Type | Enable <br> [EN=1 for NMOS <br> $=0$ for PMOS] <br> (ON Condition) |  | Enable[EN=0 for NMOS$=1$ for PMOS](OFF Condition) |  | Operation |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | Input given at node X <br> (V) | Output at node $\mathbf{Y}$ <br> (V) | Input at node $\mathbf{X}$ <br> (V) | Output at node $\mathbf{Y}$ <br> (V) | Turn ON | Turn OFF |
| $\begin{gathered} \hline N \\ (M N 5 / M N 6) \end{gathered}$ | 0.4 | 0.239 | 0.4 | 0.073 | Degraded | Degraded |
| P | 0.4 | 0.400 | 0.4 | 0.336 | Proper | Faulty output |
| NP | 0.4 | 0.396 | 0.4 | 0.400 | Proper | Faulty output |
| PP | 0.4 | 0.400 | 0.4 | 0.321 | Proper | Faulty output |
| PN | 0.4 | 0.400 | 0.4 | 0.000 | Proper | Proper |
| NN | 0.4 | 0.038 | 0.4 | 0.000 | Faulty output | Proper |
| NN-parallel | 0.4 | 0.400 | 0.4 | 0.078 | Proper | Degraded |
| NP-parallel | 0.4 | 0.219 | 0.4 | 0.078 | Degraded | Degraded |
| PN-parallel | 0.4 | 0.400 | 0.4 | 0.400 | Proper | Faulty output |
| PP-parallel | 0.4 | 0.400 | 0.4 | 0.400 | Proper | Faulty output |

The results indicate that:

- An N (NMOS) transistor, in Figure 5.6, does not properly turn OFF with a voltage of 0.073 V at its output which is high as compared 0 V . This is due to high off state leakage current. This causes degradation in logic ' 0 '. N also shows degradation of logic ' 1 ' in turn ON state due to low sub-threshold conduction current at low power supply voltage of 0.4 V .
- P (PMOS) transistor also has a turn OFF problem and does not pass a proper 0 V at its output.

Both N and P access transistors (described above) have been modified by adding an extra transistor at its output. The purpose of adding this additional transistor is to discharge the
output node $(\mathrm{Y})$ to 0 V . The gate terminal of additional transistor is connected to ground for the ease of layout so that no additional control line is required for gate voltage. The performance of modified designs is discussed below:

- The resulting NP and PP combinations have shown a turn OFF problem as output remains in charged stage always due to insufficient discharge current provided by additional transistor.
- NN, also, has a turn ON problem as output remains in discharged stage always. Sizing of transistors also have not yield proper operation.
- PN shows proper operation. Here, transistors are carefully sized so that they are strong enough to allow data to be changed at the storage nodes during writing but weak enough not to flip the state of the cell during reading.
- NN-parallel does not properly turn OFF with a voltage of 0.078 V at its output which is high as compared 0 V . This causes degradation in logic ' 0 '.
- NP-parallel combination shows degraded output at both ON/OFF condition. Its performance is poorer than N .
- PP-parallel, PN-parallel combination shows faulty result at OFF condition.

The graphs in Figure 5.7 show the ON/OFF performance of above ten access transistor combinations obtained through transient simulation.

(a)


Figure 5.7: ON/OFF state analysis of access transistors

- The plotted graph shows that only PN combination performs perfect operation for both turn ON and OFF state.
- The access transistor (N, NN-parallel and NP-parallel) show degraded output (i.e. Vol is low but greater than 0 V and $\mathrm{V}_{\text {он }}$ is high but less than 0.4 V ) in ON/OFF state with NP parallel performance poorer than N .
- The access transistor combinations (P, NP, PP, NN, PP-parallel and PN-parallel) show faulty output in either ON or OFF state. Hence, these cannot be used as access transistors in SRAM cells.

Based on above analysis, two new designs of SRAM cell are created by modification in design of access transistor which are discussed in Sections 5.3.1 and 5.3.2.

### 5.3.1. Design of Proposed MPT8T using PN Access Transistor

The proposed MPT8T SRAM cell comprises of eight transistors as shown in Figure 5.8. The internal architecture of proposed 8T-SRAM cell consists of a cross-coupled inverter pair (MP1/MN1 and MP2/MN2) similar to C6T to store one-bit information.

Modification: The access transistors (MN5, MN6) of 6T are replaced by MP3-MN3 pair and MP4-MN4 pair respectively, thereby making it an 8T SRAM cell [30]. The additional NMOS transistors (MN3/ MN4) are always in OFF state for both read/write operations. During hold
operation off state leakage current of $\mathrm{MN} 3 / \mathrm{MN} 4$ helps in maintaining ' 0 ' at node $\mathrm{QB} / \mathrm{Q}$. Also, they do not require additional control line to turn them ON/ OFF. For CR and PR calculation, aspect ratio of additional transistor (i.e. either MN3 or MN4) in each configuration is kept constant at $(W / L)=65 / 45 \mathrm{~nm}=1.4$.


Figure 5.8: Schematic and layout of MPT8T
(i) Read operation of MPT8T: Circuit set up for read operation of MPT8T is shown in Figure 5.9. Assuming that logic ' 0 ' is stored in the cell initially. Thus, internal node voltages are $\mathrm{Q}=0 \mathrm{~V}$ and $\mathrm{QB}=1 \mathrm{~V}$ and the access transistors (MP3 and MP4) are OFF. The transistors MP2 and MN1 are turned OFF, while the transistors MP1 and MN2 operate in linear mode.

During read operation, the bit-lines (BL/BLB) are pre-charged to $V_{D D}$ and WL is enabled (pulsed to a low level) which turns-on the MP3 and MP4 transistors of each access transistor pair.


Figure 5.9: Test circuit for read operation of MPT8T
After the access transistors MP3 and MP4 are switched ON, the voltage at BLB will not have any change in voltage level as no current flows through MP3 due to BLB \& QB ='1' at both
end. On the other hand, transistors MP4 and MN2 conduct and the voltage level of BL line will begin to drop slightly, so that a differential voltage develops between the bit-lines which are sensed by sense amplifier. The read operation would be stable provided the node voltage Q , which tends to rise, does not exceed the threshold voltage of MN1.

For maximum RSNM in MPT8T, the CR has been checked through simulation and chosen to be 3 .
(ii) Write operation of MPT8T: Circuit set up for write operation of MPT8T is shown in Figure 5.10. Assuming logic' 1 ' is stored in the SRAM cell initially, the transistors MP1 and MN2 are turned off, while the transistors MN1 and MP2 operate in linear mode. Thus, internal node voltages are $\mathrm{Q}={ }^{\prime} 1$ ' and $\mathrm{QB}={ }^{\prime} 0$ '.

During write ' 0 ' operation, the bit-lines BLB is pre-charged to a high level ( $\mathrm{V}_{\mathrm{DD}}$ ) and BL is pre-(dis) charged to low level (logic ' 0 ') and WL is enabled (pulsed to a low voltage) which turns on the PMOS transistors (MP3 and MP4) of the corresponding access transistor pairs. The voltages at $\mathrm{Q} / \mathrm{QB}$ nodes fall/ rise respectively to reach logic $0 / 1$.


Figure 5.10: Test circuit for write operation of MPT8T
For successful write ' 0 ' operation the node voltage Q should remain below the threshold voltage of MN1. For this, the aspect ratios of the transistors (MP4, MN2) and (MP3, MN1) have to be computed accurately. For maximum WSNM in MPT8T, the PR has been obtained through simulation as 1.8.
(iii) Hold Operation of MPT8T: Conceptual test circuit for measuring the hold SNM of MPT8T is shown in Figure 5.11.


Figure 5.11: Test circuit for hold SNM of MPT8T
The hold operation is performed by pre-charging bit line pair (BL/BLB) to high voltage ( $\mathrm{V}_{\mathrm{DD}}$ ). WL is disabled (biased at logic ' 1 '), thus, switching OFF the MP3 and MP4 access transistors. This disconnects the cell nodes QB and Q from both the bit-lines.

The pull up to pull down transistor ratio is critical for stability during hold operation which is here selected as 0.6 . Then similar methodology is followed for calculating the hold SNM of the MPT8T in sub-threshold region as given in Section 5.2. Figures showing transient waveform of $\mathrm{Q}, \mathrm{QB}$ during read, write, and hold mode are included in Appendix B.

### 5.3.2. Design of Proposed M8T using NN-Parallel Access Transistor

The proposed M8T SRAM cell comprises of eight transistors as shown in Figure 5.12. The internal architecture of proposed 8T-SRAM cell consists of cross-coupled inverter pair (MP1/MN1 and MP2/MN2) with NN-parallel type access transistor i.e. MN5-MN3 (or MN6MN4).

Modification: Here, in NN-parallel type access transistor pair, MN3 and MN4 always operate in off state and provide additional current (their off-state leakage current) to nodes Q/QB which decreases delay of the memory cell and improves the SNM characteristics. Another advantage is that no additional control lines are required to control gate voltage of MN3 and MN4. For CR and PR calculation, aspect ratio of additional transistor (i.e. either MN3 or MN4) in each configuration is kept constant at $(\mathrm{W} / \mathrm{L})=65 / 45 \mathrm{~nm}=1.4$.


Figure 5.12: Schematic and layout of M8T
(i) Read operation of M8T: Circuit set up for read operation of M8T is shown in Figure
5.13. Assuming that logic ' 0 ' is stored in the cell initially. Thus, internal node voltages are Q $=0 \mathrm{~V}$ and $\mathrm{QB}=1 \mathrm{~V}$. The access transistors (MN5 and MN6) are turned OFF. The transistors MP2 and MN1 are turned OFF, while the transistors MP1 and MN2 operate in linear mode. The additional two NMOS transistors (MN3 and MN4) operate in cut off region (OFF state).

During read operation, the bit-lines (BL/BLB) are pre-charged to $\mathrm{V}_{\mathrm{DD}}$ and WL is enabled (pulsed to a high level) which turns-ON the access transistors MN5 and MN6.

After the access transistors are switched ON, the voltage at BLB will not have a significant variation in voltage as no current flows through MN5 due to BLB \& QB ='1' at both end.


Figure 5.13: Test circuit for read operation of M8T
On the other hand, transistors MN6 and MN2 conduct and the voltage level of BL line will begin to drop slightly, so that a differential voltage develops between the bit-lines which are
sensed by sense amplifier. For successful read operation, the node voltage at Q should remain below the threshold voltage of MN1 to prevent false turn ON of transistor MN1. Thus, CR is the critical parameter of the SRAM cell during the read operation. For maximum RSNM, the CR has been obtained through simulation as 2.7 .
(ii) Write operation of M8T: Circuit set up for write operation of M8T is shown in Figure 5.14. Consider a write ' 0 ' operation at node Q . Thus, internal node voltages are $\mathrm{Q}=$ ' 1 ' and $\mathrm{QB}=$ ' 0 ' before WL is enabled (i.e. pulsed to a high level). The transistors MP1 and MN2 are turned OFF, while the transistors MN1 and MP2 operate in linear mode.

During write operation, the bit-lines BLB is at high level ( $\mathrm{V}_{\mathrm{DD}}$ ) and BL is pulled down to low level (logic ' 0 '), WL is enabled which turns on the access transistors MN5 and MN6. For successful write ' 0 ' operation, the node voltage Q discharges through MN6 to a low voltage below the threshold voltage of MN1.


Figure 5.14: Test circuit for write operation of M8T
The critical part of the circuit is the voltage divider formed by the pull up and access transistor. For maximum WSNM, the PR has been obtained through simulation as 2.5 .
(iii) Hold Operation of M8T: Test circuit for measurement of hold SNM of M8T is shown in Figure 5.15.


Figure 5.15: Test circuit for measurement of hold SNM of M8T
The hold operation is performed by lowering WL biased to logic ' 0 ', switching OFF the MN5 and MN6 access transistors and bit line pair (BL/BLB) is at high voltage. This disconnects the cell nodes QB and Q from both the bit-lines. The pull up to pull down transistor ratio is critical for stability during hold operation which is here selected as 0.9 . Then similar methodology is followed for calculating the hold SNM of the M8T in sub-threshold region as given in Section 5.2. Figures showing transient waveform of $\mathrm{Q}, \mathrm{QB}$ during read, write, and hold mode are included in Appendix B.

## II. Technique II: Modification in design of cross coupled inverter pair

Section 5.3.3 discusses the new design where modification is done in design of cross coupled inverter pair.

### 5.3.3. Design of Proposed MI-12T

The proposed MI-12T SRAM cell comprises of twelve transistors as shown in Figure 5.16. The internal architecture of proposed MI-12T consists of N type access transistors (MN5, MN6) and a modified cross-coupled inverter pair (MP1/MN1 and MP2/MN2).

Modification: Here, cross-coupled inverter pair is modified by adding an ON-OFF logic composed of PMOS \& NMOS transistor (MP3-MN3-MN7 and MP4-MN4-MN8) in order to minimize the OFF-state leakage current during hold mode [31]. Transistors (MN3, MN1) and (MN4, MN2) form a stacked path. Due to body bias effect, MN3 \& MN1 have an increased threshold voltage thereby reducing the OFF-state leakage current through MN1 even with logic '0' at node Q .

Further, the ON-OFF logic transistor MN3 isolates MN1 from node QB. In the case of voltage increasing at Q (due to destructive read operation), which may cause false turn on of MN1, as MN3 remains off causing QB to remain high. This stops the positive feedback through the
cross coupled inverter path which otherwise could cause switch ON of the MP2 transistor and could change the stored logic ' 0 ' in the cell at Q node.

The gate voltage of (MN3, MN4) is generated by (MP3, MP4) for, both, logic ' 0 ' / ' 1 ' at $\mathrm{QB} / \mathrm{Q}$ node as per Table 5.2. Thus, for $\mathrm{QB}=$ ' 0 ', node $\mathrm{a}=$ ' 1 ' so MN 3 turn ON and vice versa. MN3 transistor (in the cross coupled inverter logic) operate in the linear region for $\mathrm{QB}=$ ' 0 ', and cut off region for $\mathrm{QB}=\mathrm{'} 1$ ', of the SRAM cell. Same is valid for node Q as well.


Figure 5.16: Proposed modified inverter based 12T SRAM cell MI-12T
(i) Read operation of MI-12T: Circuit set up for read operation of MI-12T is shown in Figure 5.17. Assuming that logic '0' is stored in the cell initially. Thus, internal node voltages are $\mathrm{Q}=0 \mathrm{~V}$ and $\mathrm{QB}=1 \mathrm{~V}$ and the access transistors (MN5 and MN6) are OFF. The transistors MP2 and MN1 are turned OFF, while the transistors MP1 and MN2 operate in linear mode. One of the transistors (MN4) in ON-OFF logic is turned ON, and transistor in the other ONOFF logic is in the OFF state. During read operation, the bit-lines (BL/BLB) are pre-charged to $\mathrm{V}_{\mathrm{DD}}$ and WL is enabled (pulsed to a high level) which turns-ON the access transistors.


Figure 5.17: Test circuit for read operation of MI-12T

The voltage at BLB will not have a significant change in voltage level as no current flows through MN5 due to BLB \& QB ='1' at both end. On the other hand, MN6, MP4, MN4 and MN2 will conduct and the voltage level of BL line will begin to drop slightly, so that a differential voltage develops between the bit-lines which are sensed by sense amplifier. The read operation would be stable provided the node voltage Q does not exceed the threshold voltage of MN1.

For proper read operation, the aspect ratios of the transistors (MN6, MN4, MN2) and (MN5, MN3, MN1) have to be computed accurately. For maximum RSNM, the CR has been obtained through simulation as 2.9.

The ON-OFF logic transistor of MN3 isolates MN1 from node QB. In the case of voltage increasing at Q (destructive read), which may cause false turn on of MN1, MN3 remains OFF as QB is high. This stops the positive feedback through the cross coupled inverter path which otherwise could cause false switch ON of the MP2 transistor and could change the stored logic ' 0 ' in the cell at Q node.
(ii) Write operation of MI-12T: Circuit set up for write operation of MI-12T is shown in Figure 5.18. Consider a write ' 0 'operation, assuming logic' 1 ' is stored in the SRAM cell initially. Internal node voltages are $\mathrm{Q}={ }^{\prime} 1$ ' and $\mathrm{QB}={ }^{\prime} 0$ ' and the access transistors (MN5 and MN6) are in off state. The transistors MP1 and MN2 are turned off, while the transistors MN1 and MP2 operate in linear mode. In ON-OFF logic pairs, transistor MN3 will be ON and MN4 will be OFF.

During write operation, the bit-line BLB is pre-charged to a high level ( $V_{\mathrm{DD}}$ ) and BL is precharged to low level (logic ' 0 ') . WL is enabled (pulsed to a high level) which turns on access transistors. The access transistors, MN5 and MN6, conduct causing voltages at Q \& QB to decrease and increase respectively. This leads to (MN1 \& MP2) turn OFF and (MN2 \& MP1) to turn ON .

For successful write operation, the aspect ratios of the transistors (MN6, MP2) and (MN5, MP1) have to be computed accurately. For maximum WSNM in MI-12T, the PR has been obtained through simulation as 1.8.

Due to stack of MN4/ MN2 at Q (increased resistance of pull down path), for getting optimum CR, transistor widths need to be increased.


Figure 5.18: Test circuit for write operation of MI-12T
(iii) Hold operation of MI-12T: Test circuit for measurement of hold SNM of MI-12T is shown in Figure 5.19.


Figure 5.19: Test circuit for measurement of hold SNM of MI-12T
The hold operation is performed by disabling WL (at logic ' 0 ') , switching OFF the MN5 and MN6 access transistors and bit line pair (BL/BLB) is pre-charged to high voltage ( $\mathrm{V}_{\mathrm{DD}}$ ). This disconnects the cell nodes QB and Q from both the bit-lines.

The ON-OFF logic forms stack of MN3 and MN1. This increase the threshold voltage of MN3 thereby reducing the OFF-state leakage current through the stacked transistor path. This reduces the rate at which charge stored at QB (or Q ) leaks through this path which improves hold SNM. Figure 5.20 shows comparison of OFF state leakage current through pull down
path (MN1) of MI-12T and C6T during hold operation. The value of leakage current is less in MI-12T thereby confirming stacking effect.


Figure 5.20: Pull down currents plot for MI-12T and C6T during hold operation The pull up to pull down transistor ratio is critical for stability during hold operation which is here selected as 0.62 . Then similar methodology is followed for calculating the hold SNM of the M8T in sub-threshold region as given in Section 5.2. Figures showing transient waveform of $\mathrm{Q}, \mathrm{QB}$ during read, write, and hold mode are included in Appendix B.

## III. Technique III: Modification in connection of WL signal to access transistors to generate clock feed-through effect

Section 5.3.4 discusses the two new designs where modification is done in connection of WL signal to access transistors to generate clock feed-through effect.

### 5.3.4. Design of Proposed M7T

The proposed M7T SRAM cell is very much similar to C6T with addition of an extra transistor. It comprises of seven transistors as shown in Figure 5.21. The internal architecture of proposed M7T consists of access transistors (MN5, MN6) and cross-coupled inverter pair (MP1/MN1) and (MP2/MN2) to store one-bit information inside the cell.

Modification: An extra PMOS transistor (MP3) whose gate terminal is controlled by WL is added. Its source and drain terminals are connected to gate of access transistors (MP5/MN6). The voltage generated by clock feed-through effect through parasitic capacitances of MP3, MN5, and MN6 is used in two ways:

- To turn ON/ OFF MN5/MN6 transistors
- To generate a negative voltage at node Q , when WL is disabled, due to clock feed through effect (through parasitic capacitances of MP3, MN5, and MN6). This negative voltage reduces the net voltage at node Q during read operation. Hence, voltage at node Q does not exceed the threshold voltage of MN1. This increases the RSNM of M7T in comparison to C6T. Similar analysis is valid for node QB.


## Generation of gate voltage for MN5/MN6 transistors to turn ON/ OFF:

As MP3, does not turn off properly at 45 nm technology (as shown in Table 5.2), it always remains ON and node X and node Y gets coupled through it leading to same final voltage (either low or high) at gate terminals of MN5 and MN6.


Figure 5.21: Schematic and layout of M7T
The transistor MP3 acts as a switch between the two access transistors. When WL is enabled (logic 1), the source/drain voltage of MP3 is generated due to due to clock feed-through effect through coupling of parasitic capacitances of MP3 and access transistors (MN5/MN6) as shown in Figure 5.22. Figure 5.23 shows the model of circuit, with parasitic capacitances only, for estimation of voltage node X and Y during clock feed-through event. Transistors MP1 and MN2 are modeled as closed ideal switch.


Figure 5.22: Coupling of parasitic capacitances in M7T during clock feed-through event


Figure 5.23: Model for estimation of voltage at node $X$ and $Y$ during clock feed-through event

Here
$\mathrm{C} 1 / \mathrm{C} 2$ is the gate-source/gate-drain capacitance of MP3 (or vice versa) respectively;
$\mathrm{C} 3 / \mathrm{C} 5$ is the gate-source/ gate-drain capacitance of MN5 respectively;
C4/C6 is the gate-source/ gate-drain capacitance of MN6 respectively;
Other capacitances show negligible values during simulation, hence they are neglected.
Read Operation: Here $\mathrm{WL}=$ increases from 0 to $\mathrm{V}_{\mathrm{DD}}, \mathrm{BL}=\mathrm{BLB}=$ pre-charged to $\mathrm{V}_{\mathrm{DD}}, \mathrm{Q}=0$, $\mathrm{QB}=\mathrm{V}_{\mathrm{DD}}$. Through simulation, values of parasitic capacitances are obtained as: $\mathrm{C} 1=\mathrm{C} 2=$ $0.41 \mathrm{pF}, \mathrm{C} 5=\mathrm{C} 6=0.0001 \mathrm{pF}, \mathrm{C} 3=0.002 \mathrm{pF}$, and $\mathrm{C} 4=0.008 \mathrm{pF}$.

(a)

(b)

(c)

Figure 5.24: Read circuit set up for estimation of charge at different nodes during clock feed-through event

Figure 5.24 (a) shows the model of Figure 5.23 with node voltage of read operation.
Figure 5.24 (b) shows the equivalent circuit of Figure 5.24 (a) for each capacitive path shown separately for estimation of instantaneous charge stored across them.

Figure 5.24 (c) shows the equivalent circuit for estimation of voltage $\mathrm{V}_{\mathrm{X}, \mathrm{Y}}$ obtained under steady state condition.

The expressions for voltage/charge at different nodes in Figure 5.24 (b) are given below:
Expression for charge $Q_{X 1}$ : Voltage $\mathrm{V}_{\mathrm{X} 1}$ is generated at node X 1 due to clock feed-through effect through $\mathrm{C} 1, \mathrm{C} 3$ and C 5 capacitive path.

$$
\mathrm{V}_{\mathrm{X} 1}=\left(\mathrm{V}_{\mathrm{DD}}\right)\left(\frac{\mathrm{C} 1}{\mathrm{C} 1+(\mathrm{C} 3+\mathrm{C} 5)}\right)
$$

Therefore, the final expression for charge $\mathrm{Q}_{\mathrm{x} 1}\left(=(\mathrm{C} 5+\mathrm{C} 3) . \mathrm{V}_{\mathrm{x} 1}\right)$ is define as

$$
\begin{equation*}
\mathrm{Q}_{\mathrm{X} 1}=\left(\mathrm{V}_{\mathrm{DD}}\right)(\mathrm{C} 3+\mathrm{C} 5)\left(\frac{\mathrm{C} 1}{\mathrm{C} 1+(\mathrm{C} 3+\mathrm{C} 5)}\right) \tag{5.1}
\end{equation*}
$$

Expression for charge $Q_{Y \underline{I I}}$ : Voltage $\mathrm{V}_{\mathrm{Y} 1}$ is generated at node Y 1 due to clock feed-through effect through $\mathrm{C} 2, \mathrm{C} 4$ capacitive path.
$\mathrm{V}_{\mathrm{Y} 1}=\left(\mathrm{V}_{\mathrm{DD}}\right)\left(\frac{\mathrm{C} 2}{\mathrm{C} 2+\mathrm{C} 4}\right)$
Therefore, the final expression for charge $\mathrm{Q}_{\mathrm{Y} 1}\left(=\mathrm{C} 4 . \mathrm{V}_{\mathrm{Y} 1}\right)$ is define as
$\mathrm{Q}_{\mathrm{Y} 1}=\left(\mathrm{V}_{\mathrm{DD}}\right)(\mathrm{C} 4)\left(\frac{\mathrm{C} 2}{\mathrm{C} 2+\mathrm{C} 4}\right)$
Expression for charge $Q_{Y 2}$ : Voltage $\mathrm{V}_{\mathrm{Y} 2}$ is generated at node Y 2 due to clock feed-through effect through $\mathrm{C} 2, \mathrm{C} 6$ capacitive path.
$\mathrm{V}_{\mathrm{Y} 2}=\left(\mathrm{V}_{\mathrm{DD}}\right)\left(\frac{\mathrm{C} 2}{\mathrm{C} 2+\mathrm{C} 6}\right)$
Therefore, the expression for charge $\mathrm{Q}_{\mathrm{Y} 2}\left(=\mathrm{C} 6 . \mathrm{V}_{\mathrm{Y} 2}\right)$ is define as
$\mathrm{Q}_{\mathrm{Y} 2}=\left(\mathrm{V}_{\mathrm{DD}}\right)(\mathrm{C} 6)\left(\frac{\mathrm{C} 2}{\mathrm{C} 2+\mathrm{C} 6}\right)$
From Figure 5.24 (c), the expression of final charge $\left(\mathrm{Q}_{\mathrm{F}}\right)$ is obtained as :
$\mathrm{Q}_{\mathrm{F}}=\left(\mathrm{V}_{\mathrm{X}, \mathrm{Y}}\right)(\mathrm{C} 3+\mathrm{C} 4+\mathrm{C} 5+\mathrm{C} 6)$

Final steady state voltage $\mathrm{V}_{\mathrm{X}, \mathrm{Y}}\left(=\mathrm{V}_{\mathrm{X}}=\mathrm{V}_{\mathrm{Y}}\right)$ using charge conservation theorem when MP3 is on (modeled as closed switch)
$\mathrm{Q}_{\mathrm{F}}=\mathrm{Q}_{\mathrm{X} 1}+\mathrm{Q}_{\mathrm{Y} 1}+\mathrm{Q}_{\mathrm{Y} 2}$

$$
\begin{equation*}
\mathrm{v}_{\mathrm{X}, \mathrm{Y}}=\frac{\left(\mathrm{Q}_{\mathrm{X} 1}+\mathrm{Q}_{\mathrm{Y} 1}+\mathrm{Q}_{\mathrm{Y} 2}\right)}{\mathrm{C} 3+\mathrm{C} 4+\mathrm{C} 5+\mathrm{C} 6} \tag{5.4}
\end{equation*}
$$

Write Operation: Here, $\mathrm{WL}=$ increases from 0 to $\mathrm{V}_{\mathrm{DD}}, \mathrm{BL}=0, \mathrm{BLB}=\mathrm{V}_{\mathrm{DD}}, \mathrm{Q}=\mathrm{V}_{\mathrm{DD}}, \mathrm{QB}=0$. Through simulation, values of parasitic capacitances are obtained as: $\mathrm{C} 1=\mathrm{C} 2=0.0062 \mathrm{pF}, \mathrm{C} 5$ $=\mathrm{C} 6=0.0001 \mathrm{pF}, \mathrm{C} 3=0.002 \mathrm{pF}$ and $\mathrm{C} 4=0.008 \mathrm{pF}$.

(a)

(b)


(c)

Figure 5.25: Write circuit set up for estimation of charge at different nodes during clock feed-through event

Figure 5.25 (a) shows the model of Figure 5.23 with node voltage of write operation. Figure 5.25 (b) shows the equivalent circuit of Figure 5.25 (a) for each capacitive path shown separately for estimation of instantaneous charge stored across them.

Figure 5.25 (c) shows the equivalent circuit for estimation of voltage $\mathrm{V}_{\mathrm{X}, \mathrm{Y}}$ obtained under steady state condition.

The expressions for voltage/charge at different nodes in Figure 5.25 (b) is given below:
Expression for voltage $Q_{X 1}$ : Voltage $\mathrm{V}_{\mathrm{x} 1}$ generated due to clock feed-through effect through $\mathrm{C} 1, \mathrm{C} 5$ capacitive path

$$
\mathrm{V}_{\mathrm{x} 1}=\left(\mathrm{V}_{\mathrm{DD}}\right)\left(\frac{\mathrm{C} 1}{\mathrm{C} 5+\mathrm{C} 1}\right)
$$

Therefore, the final expression for charge $\mathrm{Q}_{\mathrm{X} 1}\left(=\mathrm{C} 5 . \mathrm{V}_{\mathrm{X} 1}\right)$ is define as
$\mathrm{Q}_{\mathrm{X} 1}=\left(\mathrm{V}_{\mathrm{DD}}\right)(\mathrm{C} 5)\left(\frac{\mathrm{C} 1}{\mathrm{C} 5+\mathrm{C} 1}\right)$
Expression for voltage $Q_{X 2}$ : Voltage $\mathrm{V}_{\mathrm{X} 2}$ generated due to clock feed-through effect through $\mathrm{C} 1, \mathrm{C} 3$ capacitive path

$$
\mathrm{V}_{\mathrm{X} 2}=\left(\mathrm{V}_{\mathrm{DD}}\right)\left(\frac{\mathrm{C} 1}{\mathrm{C} 1+\mathrm{C} 3}\right)
$$

Therefore, the final expression for charge $\mathrm{Q}_{\mathrm{X} 1}\left(=\mathrm{C} 3 . \mathrm{V}_{\mathrm{X} 2}\right)$ is define as

$$
\begin{equation*}
\mathrm{Q}_{\mathrm{X} 2}=\left(\mathrm{V}_{\mathrm{DD}}\right)(\mathrm{C} 3)\left(\frac{\mathrm{C} 1}{\mathrm{C} 1+\mathrm{C} 3}\right) \tag{5.6}
\end{equation*}
$$

Expression for charge $Q_{Y I}$ : Voltage $\mathrm{V}_{\mathrm{Y} 1}$ is generated at node Y 1 due to clock feed-through effect through C 2 and C 4 capacitive path.

$$
\mathrm{V}_{\mathrm{Y} 1}=\left(\mathrm{V}_{\mathrm{DD}}\right)\left(\frac{\mathrm{C} 2}{\mathrm{C} 2+\mathrm{C} 4}\right)
$$

Therefore, the final expression for charge $\mathrm{Q}_{\mathrm{Y} 1}\left(=\mathrm{C} 4 . \mathrm{V}_{\mathrm{Y} 1}\right)$ is define as

$$
\begin{equation*}
\mathrm{Q}_{\mathrm{Y} 1}=\left(\mathrm{V}_{\mathrm{DD}}\right)(\mathrm{C} 4)\left(\frac{\mathrm{C} 2}{\mathrm{C} 2+\mathrm{C} 4}\right) \tag{5.7}
\end{equation*}
$$

Expression for charge $Q_{Y 2}$ : Voltage $\mathrm{V}_{\mathrm{Y} 2}$ is generated at node Y 2 due to clock feed-through effect through C2 and C6 capacitive path.

$$
\mathrm{V}_{\mathrm{Y} 2}=\left(\mathrm{V}_{\mathrm{DD}}\right)\left(\frac{\mathrm{C} 2}{\mathrm{C} 2+\mathrm{C} 6}\right)
$$

Therefore, the final expression for charge $\mathrm{Q}_{\mathrm{Y} 2}\left(=\mathrm{C} 6 . \mathrm{V}_{\mathrm{Y} 2}\right)$ is define as

$$
\begin{equation*}
\mathrm{Q}_{\mathrm{Y} 2}=\left(\mathrm{V}_{\mathrm{DD}}\right)(\mathrm{C} 6)\left(\frac{\mathrm{C} 2}{\mathrm{C} 2+\mathrm{C} 6}\right) \tag{5.8}
\end{equation*}
$$

From Figure $5.25(\mathrm{c})$, the expression of final charge $\left(\mathrm{Q}_{\mathrm{F}}\right)$ is obtained as :
$\mathrm{Q}_{\mathrm{F}}=\left(\mathrm{V}_{\mathrm{X}, \mathrm{Y}}\right)(\mathrm{C} 3+\mathrm{C} 4+\mathrm{C} 5+\mathrm{C} 6)$
Final steady state voltage $\mathrm{V}_{\mathrm{X}, \mathrm{Y}}\left(=\mathrm{V}_{\mathrm{X}}=\mathrm{V}_{\mathrm{Y}}\right)$ using charge conservation theorem when MP3 is on (modeled as closed switch)
$\mathrm{Q}_{\mathrm{F}}=\mathrm{Q}_{\mathrm{X} 1}+\mathrm{Q}_{\mathrm{Y} 1}+\mathrm{Q}_{\mathrm{Y} 2}$
$\mathrm{V}_{\mathrm{X}, \mathrm{Y}}=\frac{\left(\mathrm{Q}_{\mathrm{X} 1}+\mathrm{Q}_{\mathrm{Y} 1}+\mathrm{Q}_{\mathrm{Y} 2}\right)}{\mathrm{C} 3+\mathrm{C} 4+\mathrm{C} 5+\mathrm{C} 6}$
Hold Operation: Here, WL=decreases from $V_{D D}$ to $0, B L=V_{D D}, B L B=V_{D D}, Q=0, Q B=V_{D D}$. Through simulation, values of parasitic capacitances are obtained as: $\mathrm{C} 1=\mathrm{C} 2=0.41 \mathrm{pF}, \mathrm{C} 5=$ $\mathrm{C} 6=0.001 \mathrm{pF}, \mathrm{C} 3=0.002 \mathrm{pF}$ and $\mathrm{C} 4=0.008 \mathrm{pF}$.


Figure 5.26: Hold circuit set up for estimation of charge at different nodes during clock feed-through event

Figure 5.26 (a) shows the model of Figure 5.23 with node voltage of hold operation.
Figure 5.26 (b) shows the equivalent circuit of Figure 5.26 (a) for each capacitive path shown separately for estimation of instantaneous charge stored across them.

Figure 5.26 (c) shows the equivalent circuit for estimation of voltage $V_{X, Y}$ obtained under steady state condition.

The expressions for voltage/charge at different nodes in Figure 5.26 (b) is given below:
Expression for voltage $Q_{X 1}$ : Voltage $\mathrm{V}_{\mathrm{X} 1}$ generated due to clock feed-through effect through $\mathrm{C} 1, \mathrm{C} 3$ and C5 capacitive path

$$
\mathrm{V}_{\mathrm{X} 1}=\left(-\mathrm{V}_{\mathrm{DD}}\right)\left(\frac{\mathrm{C} 1}{\mathrm{C} 1+(\mathrm{C} 5+\mathrm{C} 3)}\right)
$$

Therefore, the final expression for charge $\mathrm{Q}_{\mathrm{X} 1}\left(=(\mathrm{C} 5+\mathrm{C} 3) . \mathrm{V}_{\mathrm{X} 1}\right)$ is define as

$$
\begin{equation*}
\mathrm{Q}_{\mathrm{x} 1}=\left(-\mathrm{V}_{\mathrm{DD}}\right)(\mathrm{C} 3+\mathrm{C} 5)\left(\frac{\mathrm{C} 1}{\mathrm{C} 1+(\mathrm{C} 3+\mathrm{C} 5)}\right) \tag{5.10}
\end{equation*}
$$

Expression for charge $Q_{Y 1}$ : Voltage $\mathrm{V}_{\mathrm{Y} 1}$ is generated at node Y 1 due to clock feed-through effect through C 2 and C 4 capacitive path.

$$
\mathrm{V}_{\mathrm{Y} 1}=\left(-\mathrm{V}_{\mathrm{DD}}\right)\left(\frac{\mathrm{C} 2}{\mathrm{C} 2+\mathrm{C} 4}\right)
$$

Therefore, the final expression for charge $\mathrm{Q}_{\mathrm{Y} 1}\left(=\mathrm{C} 4 . \mathrm{V}_{\mathrm{Y} 1}\right)$ is define as
$\mathrm{Q}_{\mathrm{Y} 1}=\left(-\mathrm{V}_{\mathrm{DD}}\right)(\mathrm{C} 4)\left(\frac{\mathrm{C} 2}{\mathrm{C} 2+\mathrm{C} 4}\right)$

Expression for charge $Q_{Y 2}$ : Voltage $\mathrm{V}_{\mathrm{Y} 2}$ is generated at node Y 2 due to clock feed-through effect through C 2 and C 6 capacitive path.

$$
\mathrm{V}_{\mathrm{Y} 2}=\left(-\mathrm{V}_{\mathrm{DD}}\right)\left(\frac{\mathrm{C} 2}{\mathrm{C} 2+\mathrm{C} 6}\right)
$$

Therefore, the final expression for charge $\mathrm{Q}_{\mathrm{Y} 2}\left(=\mathrm{C} 6 . \mathrm{V}_{\mathrm{Y} 2}\right)$ is define as

$$
\begin{equation*}
\mathrm{Q}_{\mathrm{Y} 2}=\left(-\mathrm{V}_{\mathrm{DD}}\right)(\mathrm{C} 6)\left(\frac{\mathrm{C} 2}{\mathrm{C} 2+\mathrm{C} 6}\right) \tag{5.12}
\end{equation*}
$$

From Figure 5.26(c), the expression of final charge $\left(\mathrm{Q}_{\mathrm{F}}\right)$ is obtained as :
$\mathrm{Q}_{\mathrm{F}}=\left(\mathrm{V}_{\mathrm{X}, \mathrm{Y}}\right)(\mathrm{C} 3+\mathrm{C} 4+\mathrm{C} 5+\mathrm{C} 6)$
Final steady state voltage $\mathrm{V}_{\mathrm{X}, \mathrm{Y}}\left(=\mathrm{V}_{\mathrm{X}}=\mathrm{V}_{\mathrm{Y}}\right)$ using charge conservation theorem when MP3 is on (modeled as closed switch)

$$
\mathrm{Q}_{\mathrm{F}}=\mathrm{Q}_{\mathrm{X} 1}+\mathrm{Q}_{\mathrm{Y} 1}+\mathrm{Q}_{\mathrm{Y} 2}
$$

$$
\begin{equation*}
\mathrm{V}_{\mathrm{X}, \mathrm{Y}}=\frac{\left(\mathrm{Q}_{\mathrm{X} 1}+\mathrm{Q}_{\mathrm{Y} 1}+\mathrm{Q}_{\mathrm{Y} 2}\right)}{\mathrm{C} 3+\mathrm{C} 4+\mathrm{C} 5+\mathrm{C} 6} \tag{5.13}
\end{equation*}
$$

Therefore, the final gate voltage develops at access transistors (MN5/MN6), when PMOS is enabled:

The estimated values of voltages at node X and node Y , using equations (5.1) - (5.13), in read, writes and hold operation is given in Table 5.3. The results show close matching between estimated and simulated values. This confirms the proposition that access transistors function properly in M7T.

Table 5.3: Generation of Voltages at node $X$ and node $Y$ in write, read, and hold operation

| WL | State of MP3 <br> (as per Table 5.2) | Simulated values | Estimated value <br> through expressions <br> (5.1)-(5.13) |
| :--- | :---: | :---: | :---: |
|  |  | $\mathbf{V}_{\mathbf{X}, \mathbf{Y} \text { (write/ read) }}$ (in V) | $\mathbf{V X X}_{\mathbf{X}}$ <br> (in V) |
| ON (Logic '1') <br> Write/ Read <br> Operation <br> MN3 = MN5 = ON | MP3 = ON | $0.211 / 0.391$ | $0.203 / 0.374$ |
| OFF (Logic '0') <br> Hold Operation <br> MN3 = MN5= OFF | MP3 = ON | 0.064 | -0.374 |

Figure 5.27 shows the transient waveforms at nodes X and Y obtained through simulations.


Figure 5.27: Waveforms at nodes $X$ and $Y$ during (a) read/ write (b) hold operation
(i) Read operation of M7T: Circuit set up for read operation of M7T is shown in Figure 5.28. Assuming that logic ' 0 ' is stored in the cell initially. Thus, internal node voltages are Q $=0 \mathrm{~V}$ and $\mathrm{QB}=1 \mathrm{~V}$ before the access transistors (MN5 and MN6) are turned ON. The
transistors MP2 and MN1 are turned OFF, while the transistors MP1 and MN2 operate in linear mode. During read operation, the bit-lines (BL/BLB) are pre-charged to $\mathrm{V}_{\mathrm{DD}}$ and WL is enabled (pulsed to a high level) which turns-ON MP3 (it does not turn off properly due to faulty operation). The access transistors then turn ON due to their gate voltage generated through clock feed through effect.


Figure 5.28: Test circuit for read operation of M7T
The voltage at BLB will not have a significant change in voltage as no current flows through MN5 due to BLB \& QB ='1' at both end. On the other hand, transistors MN6 and MN2 conduct and the voltage level of BL line will begin to drop slightly, so that a differential voltage develops between the bit-lines which are sensed by sense amplifier. During this process, the voltage at Q/QB nodes falls to '0'/ rises to '1' respectively.

Here clock feed through effect helps in stability of stored data. Because of clock feed through effect through parasitic capacitances, a negative transient voltage is developed at node Q which reduces the net voltage at node Q thereby decreasing its probability to exceed the threshold voltage of MN1. This increases the RSNM of M7T. For successful read operation, the node voltage at Q should remain below the threshold voltage of MN1 to prevent false turn ON of transistor MN1.

For maximum RSNM, the CR has been obtained through simulation as 2.2.
Figure 5.29 shows the transient waveforms at nodes Q and QB obtained through simulation. The figure shows that net voltage at Q and QB node in M 7 T is less than corresponding value in C6T.

M7T Read State Analysis


Figure 5.29: Waveforms at nodes $Q$ and $Q B$ during read mode
(ii) Write operation of M7T: Circuit set up for write operation of M7T is shown in Figure 5.30. Consider a write ' 0 ' operation at node Q . Thus, internal node voltages are $\mathrm{Q}=$ ' 1 ' and $\mathrm{QB}=$ ' 0 ' before WL is enabled (i.e. pulsed to a high level). The transistors MP1 and MN2 are turned OFF, while the transistors MN1 and MP2 operate in linear mode.


Figure 5.30: Test circuit for write operation of M7T
During write operation, the bit-lines BLB is pre-charged to a high level ( $\mathrm{V}_{\mathrm{DD}}$ ) and BL is pulled down to low level (logic ' 0 '), WL is enabled which turns-on the PMOS transistor (MP3 does not turn off properly due to faulty operation) as well as the access transistors MN5 and MN6. After the pass transistors MN5 and MN6 are switched ON, the voltage at node QB should not
rise above the threshold voltage of MN2 to change the stored information i.e. forcing node Q $={ }^{\prime} 0^{\prime}$ and $\mathrm{QB}={ }^{\prime} 1^{\prime}$.

For successful write ' 0 ' operation, the node voltage Q discharges through MN6 to a low voltage below the threshold voltage of MN1.

In addition to setting aspect ratio, clock feed through effect also helps in ensuring node voltage Q to remain below the threshold voltage of MN1. Because of clock feed through effect through parasitic capacitances, a negative transient voltage is developed at node Q which reduces the net voltage at node Q thereby decreasing its probability to exceed the threshold voltage of MN1. This increases the WSNM of M7T.

The critical part of the circuit is the voltage divider formed by the pull up and access transistor. So, PR is an important parameter in write mode. For maximum WSNM, the PR has been obtained through simulation as 2.1.

Figure 5.31 shows the transient waveforms at nodes Q and QB obtained through simulation. The figure shows that net voltage at Q and QB node in M7T is less than corresponding value in C6T.


Figure 5.31: Waveforms at nodes Q and QB during write mode
(iii) Hold Operation of M7T: Conceptual test circuit for measuring the hold SNM of M7T is shown in Figure 5.32.


Figure 5.32: Test circuit for measurement of hold SNM of M7T
During hold operation, WL is disabled (goes low), and MP3 turns ON. Voltages at node X and node Y decreases to 0 V due to clock feed-through effect through parasitic capacitances of MP3, MP5, and MN6. This switches OFF the MN5 and MN6 access transistors. Bit line pair (BL/BLB) is pre-charged to high voltage ( $\mathrm{V}_{\mathrm{DD}}$ ). This disconnects the cell nodes QB and Q from both the bit-lines.

The pull up to pull down transistor ratio is critical for stability during hold operation which is here selected as 0.95 . Then similar methodology is followed for calculating the hold SNM of the M7T in sub-threshold region as given in Section 5.2

Figure 5.33 shows the transient waveforms at nodes Q and QB obtained through simulations. The figure shows that net voltage at $\mathrm{Q} / \mathrm{QB}$ node in M7T is less/ more than corresponding value in C6T.


Figure 5.33: Waveforms at nodes $Q$ and $Q B$ during hold mode

### 5.3.5. DESIGN OF PROPOSED M9T

The proposed M9T SRAM cell comprises of nine transistors as shown in Figure 5.34. This design is the modified version of TG-based fully differential 8T SRAM bit cell [32]. The internal architecture of this cell consists of cross-coupled inverter pair (MP1/MN1 and MP2/MN2) with NN-parallel type access transistor i.e. MN5-MN3 (or MN6-MN4) to store one bit information.

As explained in Section 5.3.4, an extra PMOS transistor (MP3) whose gate terminal is controlled by WL is added. Its source and drain terminals are connected to gate of access transistors (MP5/MN6). The voltage generated by clock feed-through effect through parasitic capacitances of MP3, MN5, and MN6 is used to turn ON/ OFF MN5/MN6 transistors and to generate a negative voltage at node Q , when WL is disabled, due to clock feed through effect (through parasitic capacitances of MP3, MN5, and MN6).

Modification: The access transistors are implemented with NN parallel type access transistor combination (i.e. MN5/MN3 and MN6/MN4). The ON/ OFF operation of (MN5, MN6 ) is controlled by an extra added PMOS transistor (MP3). For CR and PR calculation, aspect ratio of additional transistor (i.e. either MN3 or MN4) in each configuration is kept constant at $(W / L)=65 \mathrm{~nm} / 45 \mathrm{~nm}=1.4$.

The same design methodology is followed for designing the M9T in sub-threshold region as for configuration M7T given in Section 5.3.4.


Figure 5.34: Schematic and layout of M9T
(i) Read operation of M9T: Circuit set up for read operation of M9T is shown in Figure 5.35. Assuming that logic '0' is stored in the cell initially. Thus, internal node voltages are Q $=0 \mathrm{~V}$ and $\mathrm{QB}=1 \mathrm{~V}$ before the access transistors (MN5 and MN6) are turned ON. The
transistors MP2 and MN1 are turned OFF, while the transistors MP1 and MN2 operate in linear mode. The additional two NMOS transistors (MN3 and MN4) operate in cut off region (OFF state).

During read operation, the bit-lines (BL/BLB) are pre-charged to $\mathrm{V}_{\mathrm{DD}}$ and WL is enabled (pulsed to a high level) which turns-ON MP3 (it does not turn off properly due to faulty operation). The access transistors then turn ON due to their gate voltage generated through clock feed through effect.


Figure 5.35: Test circuit for read operation of M9T
The voltage at BLB will not have a significant change in voltage as no current flows through MN5 due to BLB \& QB ='1' at both end. On the other hand, transistors MN6 and MN2 conduct and the voltage level of BL line will begin to drop slightly, so that a differential voltage develops between the bit-lines which are sensed by sense amplifier. Here clock feed through effect helps in stability of stored data. Because of clock feed through effect through parasitic capacitances, a negative transient voltage is developed at node Q which reduces the net voltage at node Q thereby decreasing its probability to exceed the threshold voltage of MN1

Thus, CR is the critical parameter of the SRAM cell during the read operation. For maximum RSNM, the CR has been obtained through simulation as 2.9 .
(ii) Write operation of M9T: Circuit set up for write operation of M9T is shown in Figure 5.36. The node voltage QB always remains below the threshold voltage of MN 2 , since MN1 and MN5 are designed according to CR ratio. So, it is not sufficient to turn ON MN2. To change the stored information, i.e. to force $\mathrm{QB}=\mathrm{V}_{\mathrm{DD}}$, the node voltage at Q must be reduced below the threshold voltage of MN1 to turn it OFF. So, consider a write ' 0 ' operation at node Q . Thus, internal node voltages are $\mathrm{Q}=$ ' 1 ' and $\mathrm{QB}=$ ' 0 ' before WL is enabled (i.e. pulsed to a high level). The transistors MP1 and MN2 are turned OFF, while the transistors MN1 and MP2 operate in linear mode.


Figure 5.36: Test circuit for write operation of M9T
During write operation, the bit-lines BLB is pre-charged to a high level $\left(\mathrm{V}_{\mathrm{DD}}\right)$ and BL is pulled down to low level (logic ' 0 '), WL is enabled which turns-on the PMOS transistor (MP3 does not turn off properly due to faulty operation) as well as the access transistors MN5 and MN6. For successful write ' 0 ' operation, the node voltage Q discharges through MN6 to a low voltage below the threshold voltage of MN1.

In addition to setting aspect ratio, clock feed through effect also helps in ensuring node voltage Q to remain below the threshold voltage of MN1. Because of clock feed through effect through parasitic capacitances, a negative transient voltage is developed at node Q which reduces the net voltage at node Q thereby decreasing its probability to exceed the threshold voltage of MN1. This increases the WSNM of M9T. For maximum WSNM, the PR has been obtained through simulation as 2.3.
(iii) Hold Operation of M9T: Test circuit for measurement of hold SNM of M9T is shown in Figure 5.37.


Figure 5.37: Test circuit for measurement of hold SNM of M9T

The hold operation is performed by lowering WL to logic ' 0 ', switching OFF the MN5 and MN6 access transistors and bit line pair (BL/BLB) is at high voltage. This disconnects the cell nodes QB and Q from both the bit-lines. The pull up to pull down transistor ratio is critical for stability during hold operation which is here selected as 0.79 . Then similar methodology is followed for calculating the hold SNM of the M9T in sub-threshold region as given in Section 5.2.

### 5.4. SIMULATION RESULTS AND DISCUSSION AT 45 nm

Static noise margin metrics have long been the standard for measuring stability and estimating the yield of SRAM arrays. However, in nanometer technologies, under scaled supply voltages, these traditional metrics are no longer sufficient. Alternatively, cell stability based on the N curve uses additional performance metrics like static voltage noise margin (SVNM), static current noise margin (SINM), write trip current (WTI), write trip voltage (WTV) to overcome some of the limitations of static noise margin metrics [36].

However, dynamic stability metrics [47] which capture the inherent dynamic behavior of SRAM cell and access operations have not been considered in present work for stability analysis of SRAM cells. Further this work does not focus on stability degradation of SRAM cell due to process variation. Hence these issues are included in the future scope of this chapter. This section presents comparative analysis of proposed SRAM cells using design metrics like Hold SNM, RSNM, WSNM, SINM, SVNM, WTI, WTV, read access time ( $\mathrm{T}_{\mathrm{RA}}$ ), and write access time ( $\mathrm{T}_{\mathrm{RA}}$ ), leakage Power Consumption of the C6T, M7T, MPT8T, M8T, M9T and MI-12T SRAM cells. Impacts of process variations is studied in terms of mean and standard deviation of read delay and write delay for C6T and the proposed SRAM cells. These design metrics are estimated using 45 nm technology libraries in sub-threshold region.

## Simulation setup:

The butterfly curves of SNM during hold, read/write mode, and NCM curves are extracted among the data stored through simulations using MATLAB script for both C6T and proposed SRAM cells.

Monte Carlo analysis was performed on both C6T and the proposed SRAM cells to find the mean and standard deviation of read and write delay only. For Monte Carlo simulation, we have created a design file which includes device models for 45 nm technology that are assigned statistically varying parameter values.

In Monte Carlo (MC) simulation set up, the $\mathrm{V}_{\mathrm{th}}$ is assumed to have independent Gaussian distributions with $\pm 3$ sigma ( $\sigma$ ) variation of $30 \%$. Expected variation in $V_{D D}$ is $10 \%$ in the future technology generations such as 45 nm , hence, design metrics (read and write delay) are measured by varying the supply voltage by $\pm 10 \%$ around the nominal $\mathrm{V}_{\mathrm{DD}}$ of 0.4 V [32]. Design metrics in this work are estimated with 1500 MC run at $25^{\circ} \mathrm{C}$ temperatures to achieve high accuracy for all process corners. The simulation set-up has been performed in Cadence ADE XL window using statistical (.scs) files.

### 5.4.1. SRAM Standby Stability Analysis (Hold Stability)

The primary metric in nano-scale SRAM design is stability (data retention in the cell) which is analyzed by computing SNM in hold mode. Hold SNM is the smaller of the two squares that can be fitted between the SRAM cell dc voltage transfers characteristics (VTCs) with the WL disabled. A higher SNM indicates better stability of the cell (discussed in Section 5.2). Figure 5.38 shows combined 'butterfly curve' (i.e. VTC's) of C6T, M7T, MPT8T, M8T and M9T during hold operation.

The five curves are overlapped into each other as expected since cross coupled inverter pair designs are same for all these SRAM cells.


Figure 5.38: Overlapped VTC's of C6T, M7T, MPT8T, M8T and M9T during hold operation

SNM of C6T, M7T, MPT8T, M8T and M9T in hold mode is 120.2 mV at nominal supply voltage of $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}$. It is observed that there are three intersection points of the hold state

VTC $(\mathrm{VQ}, \mathrm{VQB})=(0.4,0) \mathrm{V},(0,0.4) \mathrm{V}$, and $(0.150,0.150) \mathrm{V}$. The stable states are corresponding to intersection points $(0.4,0) \mathrm{V}$ and $(0,0.4) \mathrm{V}$, whereas the cell's state corresponding to $(0.150,0.150) \mathrm{V}$ is unstable.

The hold SNM for all (C6T, M7T, MPT8T, M8T and M9T) are 120.2 mV , minimum of side of the two largest squares that can be fitted inside the lobes of the "butterfly" curve [33]) indicating that all designs are equally stable in hold mode.

In proposed MI-12T, the modification has been done in the cross-couple inverter pair (see Figure 5.16). Therefore, the hold SNM is evaluated separately and is found to be more as compared to others.

Figure 5.39 shows the VTC of MI-12T. SNM of MI-12T is 170 mV at nominal supply voltage of $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}$.


Figure 5.39: VTC of MI-12T during hold operation

## Effect of variation in Supply Voltage on Hold Stability:

Figure 5.40 shows VTC's of C6T, M7T, MPT8T, M8T and M9T during hold operation, on same axis, at $V_{D D}$ varying form 0.4 V to 0.2 V .


Figure 5.40: Combined VTC's of C6T, M7T, MPT8T, M8T and M9T during hold operation at varying $V_{D D}$
Figure 5.41 shows VTC's of MI-12T during hold operation at varying $\mathrm{V}_{\mathrm{DD}}$ form 0.4 V to 0.2 V.


Figure 5.41: VTC's of MI-12T during hold operation at varying $\mathrm{V}_{\mathrm{DD}}$
From the Figure 5.40 and Figure 5.41, it is observed that all SRAM cells are able to hold the stored information down to 0.2 V power supply voltage although reduction in supply voltage leads to reduction in hold SNM.

Also, the proposed MI-12T cell exhibits ( $2.09 \times$ ) improvements in average hold SNM as compared to C6T, M7T, MPT8T, M8T and M9T cells due to reduced leakage current because of stacked pull down path.

### 5.4.2. SRAM Read Stability Analysis

Read stability, analyzed by RSNM, is an important design metric of SRAM cell. Measurements of RSNM is carried out in a similar fashion as done for hold SNM with BL/BLB pre-charged to $V_{D D}$ and WL is biased in enable state (discussed in Section 5.2 and 5.3).

RSNM is the smaller of the two squares that can be fitted between the SRAM cell dc voltage transfers characteristics (VTCs) with the WL enabled.

Figure 5.42 represents RSNM 'butterfly curve' for C6T, M7T, MPT8T, M8T, M9T and MI12 T SRAM cells.



(f)

Figure 5.42: RSNM 'butterfly curve’ of (a) C6T (b) M7T (c) MPT8T (d) M8T (e) M9T (f) MI-12T

From the Figure 5.42 and Table 5.4, it is observed that the proposed SRAM cells have higher RSNM as compared to the C6T.

Since the M7T, MPT8T, M8T, M9T and MI-12T consume more area compared to the C6T, it is worthwhile to compare these cells under 'iso-area' condition. For iso-area condition, the CR in the C6T is increased for all SRAM cells i.e. M7T, MPT8T, M8T, M9T and MI-12T so as to have same area to analyze the impact of increasing CR value.

Table 5.4 also shows the comparative analysis of RSNM under iso-area condition. Results show that RSNM of all proposed SRAM cells and C6T show only marginal increment with increased CR value which are in confirmation with the result published in [34].

Table 5.4: Comparative analysis of RSNM, at 0.4 V

| Types of <br> SRAM cells | RSNM <br> $(\mathbf{m V})$ | \% Increment <br> w.r.t. C6T | Layout Area <br> $\left(\boldsymbol{\mu m}^{2}\right)$ | RSNM(mV) <br> For <br> (sso (Layout Area <br> $=7.705 \boldsymbol{\mu} \mathbf{m}^{2}$ |
| :---: | :---: | :---: | :---: | :---: |
| C6T | 034.0 | -- | $\mathbf{3 . 8 2 4}$ | 37.0 |
| M7T | 060.0 | $43.33 \%$ | 4.874 | 67.0 |
| MPT8T | $\mathbf{1 2 0 . 0}$ | $\mathbf{7 1 . 6 0 \%}$ | 5.986 | 123.0 |
| M8T | 085.0 | $60.00 \%$ | 6.281 | 88.0 |
| M9T | 043.0 | $20.93 \%$ | 6.634 | 49.0 |
| MI-12T | 054.4 | $37.50 \%$ | 7.705 | 54.4 |

### 5.4.3. SRAM Write Ability Analysis

Write ability, analyzed by WSNM, is the minimum voltage necessary to drive the bit cell into a mono-stable state during a write ' 0 ' operation. Measurement of WSNM is carried out with BL/BLB line at ' 0 '/ ' 1 ' (or vice versa) and WL of the cell is enabled (logic ' 1 ' as discussed in Section 5.2). During write operation, the bit-line pair directly connects to the node Q and QB and forces the nodes Q and QB (the stored information) to obtain required voltage levels. On completion of change in state, WL signal is asserted low, and cross coupled inverter pair stores the written one-bit information. WSNM is defined as the minimum bit-line dc voltage below $V_{D D}$ needed to write the opposite value into the cell [35]. For a successful write, only one cross point should be found on the butterfly curves, indicating that the cell is mono-stable.

WSNM for writing ' 1 ' is the width of the smallest square that can be nested between the Write ' 0 ' and Write ' 1 ' static characteristics. A cell with lower WSNM implies poor write ability of SRAM cell.

Figure 5.43 shows estimation of WSNM from VTC curves of both inverters (of cross coupled inverter pair of SRAM cell) for of C6T, M7T, MPT8T, M8T, M9T and MI-12T SRAM cells.

(a)

(b)


(f)

Figure 5.43: WSNM ‘butterfly curve’ of (a) C6T (b) M7T (c) MPT8T (d) M8T (e) M9T (f) MI-12T

As observed from Figure 5.43, there is only one intersection point found on the butterfly curves of VTC curves for all SRAM cells.

This indicates the single stable point which signifies the successful write operation and functionality of the cross coupled inverters of the cell as mono-stable circuit.

Table 5.5 shows the comparative analysis of WSNM for all SRAM cells given above; it is observed that the proposed SRAM cells have higher WSNM as compared to the C6T.

Table 5.5: Comparative analysis of WSNM of all SRAM cells, at 0.4 V

| Types of SRAM <br> cells | WSNM <br> $(\mathbf{m V})$ | \% Increment <br> w.r.t. C6T |
| :---: | :---: | :---: |
| C6T | 090.0 | -- |
| M7T | 126.2 | $28.6 \%$ |
| MPT8T | 218.0 | $58.7 \%$ |
| M8T | 186.0 | $51.6 \%$ |
| M9T | 160.0 | $43.7 \%$ |
| MI-12T | $\mathbf{2 2 6 . 0}$ | $\mathbf{6 0 . 1 \%}$ |

### 5.4.4. ALTERNATIVE NOISE MARGINS

Analysis based on N -curve metrics ( NCM ) of the cell is used for the further evaluation of robustness of the SRAM cell in terms of additional performance parameters like; SVNM, SINM, WTV, and WTI. N-curve contains information both on the read stability and on the write ability, thus allowing a complete functional analysis of the SRAM cell with only one Ncurve. [36]

Figure 5.44 shows the test circuit for extracting NCM of C6T during read mode. Bit-line pair is pre-charged to $\mathrm{V}_{\mathrm{DD}}$, WL is pre-charged at logic ' 1 ' ( ON state), $\mathrm{QB}==^{\prime} 0$ ', and $\mathrm{Q}=11$ ' for the NCM analysis [37].

An external voltage source ( $\mathrm{V}_{\text {IN }}$ ) is applied at the input storage node ' QB '. $\mathrm{V}_{\text {IN }}$ is swept from 0 V to $\mathrm{V}_{\mathrm{DD}}$ and corresponding input current ( $\mathrm{I}_{\mathrm{IN}}$ ) produces the NCM characteristics.


Figure 5.44: Test circuit for extracting N-curve of C6T during read mode
Figure 5.45 shows the NCM characteristics of C6T, M7T, MPT8T, M8T, M9T and MI-12T respectively.

(a)



Figure 5.45: N-curve characteristics of (a) C6T (b) M7T (c) MPT8T (d) M8T (e) M9T (f) MI-12T

The curve is analyzed at the three points (A, B and C), where it crosses zero. Point A and C are the two stable points, while $B$ is a meta-stable point.

The voltage in A is determined by the pull down (MN2) to access transistor (MN6) ratio or cell ratio CR.

The voltage in B is related to the pull down (MN2) to pull up (MP2) ratio and access transistor (MN6) of the cell.

The voltage in C is defined by the pull up (MP2) to access transistor (MN6) ratio or the pull up ratio ( PR ) of the cell.

The performance metrics SVNM, SINM and WTI are defined as:

- SVNM: The voltage difference between A and B gives SVNM which indicates the maximum tolerable dc noise voltage at the internal node 'QB'. When points A and B coincide, the cell is at the edge of stability and a destructive read can easily occur.
- SINM: The positive peak current between A and B indicates the stability of the cell, characterized as SINM which indicates the maximum injected dc current in the SRAM cell required to flip the content.
- WTI: The third metrics WTI is the amount of current needed to write the cell when both bit-lines are kept at $\mathrm{V}_{\mathrm{DD}}$. The negative current peak between point B and C gives WTI. This is the current margin of the cell for which its content changes.
- WTV: The voltage difference between point C and B . WTV is the voltage drop needed to flip the internal node ' 1 ' of the cell with both the bit-lines clamped at $\mathrm{V}_{\mathrm{DD}}$.

Table 5.6 shows the comparative analysis of SVNM, SINM and WTI for all SRAM cells. It is observed that that the proposed SRAM cells have higher SINM and WTI values (thus more robust) as compared to the C6T.

Table 5.6: Comparative analysis of SVNM, SINM, WTI and WTV at 0.4V

| $\begin{array}{\|c\|} \hline \text { Types } \\ \text { of } \\ \text { SRAM } \\ \text { cells } \end{array}$ | SVNM(mV) |  | SINM ( $\mu \mathbf{A}$ ) |  | WTI ( $\mu \mathrm{A}$ ) |  | WTV (mV) |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | Values | $\%$ <br> Decrement w.r.t. C6T | Values | \% <br> Increment w.r.t. C6T | \|Values| | $\%$ <br> Decrement w.r.t. C6T | Values | \% <br> Increment w.r.t. C6T |
| C6T | 220 | -- | 10.00 | -- | 64.00 | -- | 180 | -- |
| M7T | 220 | 0\% | 49.21 | 79.6\% | 53.31 | 16.7\% | 180 | 0\% |
| MPT8T | 220 | 0\% | 12.00 | 16.6\% | 28.00 | 56.2\% | 180 | 0\% |
| M8T | 190 | 13.6\% | 14.01 | 28.6\% | 05.01 | 92.1\% | 190 | 5.2\% |
| M9T | 220 | 0\% | 49.20 | 79.6\% | 58.90 | 07.9\% | 180 | 0\% |
| MI-12T | 175 | 20.4\% | 54.50 | 81.6\% | 08.70 | 86.4\% | 225 | 20\% |

Table 5.7 and Table 5.8 show the comparison of WTV, WTI with WSNM and comparison of SVNM, SINM, with RSNM respectively.

Table 5.7: Comparison of WTV, WTI with WSNM

| Types of SRAM cells | $\|\mathbf{W T I}\|$ <br> $(\boldsymbol{\mu A})$ | WTV <br> $(\mathbf{m V})$ | WSNM <br> $(\mathbf{m V})$ |
| :---: | :---: | :---: | :---: |
| C6T | 64.00 | $\mathbf{1 8 0}$ | 090.0 |
| M7T | 53.31 | $\mathbf{1 8 0}$ | 126.2 |
| MPT8T | 28.00 | $\mathbf{1 8 0}$ | 218.0 |
| M8T | $\mathbf{0 5 . 0 1}$ | 190 | 186.0 |
| M9T | 58.90 | $\mathbf{1 8 0}$ | 160.0 |
| MI-12T | 08.70 | 225 | $\mathbf{2 2 6 . 0}$ |

Performance order from best to worst:
WSNM in decreasing order: MI-12T > MPT8T > M8T > M9T > M7T > C6T

WTI in increasing order: $\quad$ M8T < MI-12T < MPT8T < M7T < M9T < C6T

WTV in increasing order: $\quad$ MPT8T $=$ M7T $=$ M9T $=\mathrm{C} 6 \mathrm{~T}<\mathrm{M} 8 \mathrm{~T}<\mathrm{MI}-12 \mathrm{~T}$
Table 5.8: Comparison of SVNM, SINM, with RSNM

| Types of SRAM cells | SVNM <br> $(\mathbf{m V})$ | SINM <br> $(\boldsymbol{\mu} \mathbf{)})$ | RSNM <br> $(\mathbf{m V})$ |
| :---: | :---: | :---: | :---: |
| C6T | $\mathbf{2 2 0}$ | 10.00 | 37.0 |
| M7T | $\mathbf{2 2 0}$ | 49.21 | 60.0 |
| MPT8T | $\mathbf{2 2 0}$ | 12.00 | $\mathbf{1 2 3 . 0}$ |
| M8T | 190 | 14.01 | 88.0 |
| M9T | $\mathbf{2 2 0}$ | 49.20 | 43.0 |
| MI-12T | 175 | $\mathbf{5 4 . 5 0}$ | 54.4 |

Performance order from best to worst:
RSNM in decreasing order: MPT8T > M8T > M7T > MI-12T > M9T > C6T

SINM in decreasing order: MI-12T> M7T $=$ M9T $\gg$ M8T $>$ MPT8T $>$ C6T

SVNM in decreasing order: C6T=M7T = MPT8T=M9T > M8T >MI-12T

### 5.4.5. Read Access Time (Tra) with Variability

Read delay or $\mathrm{T}_{\mathrm{RA}}$ is measured from the point when WL reaches to its $50 \%$ point from its initial low level to the point when bit line differential voltage is developed from its initial high level.

The designed architecture of sense amplifier can detect the differential voltage greater than equal to 50 mV between bit-line pair (BL and BLB) without generating any read error [38] for sub-micron technology. Standard deviation is a measure that is used to quantify the amount of variation (or dispersion) of a set of data values.

A low standard deviation indicates that the data points tend to be close to the average value of the data, while a high standard deviation indicates that the data points are spread out over a wider range of values.

Variability is defined as standard deviation ( $\sigma$ ) to mean $(\mu)$ ratio of a design metric. In this section, variability is calculated for read (\& write) access time for varying supply voltages.

Figure 5.46 show the read delay and its variability of C6T, M7T, MPT8T, M8T, M9T and MI12 T at varying supply voltage from 0.2 V to 0.4 V .


Figure 5.46. Read delay and its variability of (a) C6T (b) M7T (c) MPT8T (d) M8T (e)
M9T (f) MI-12T

From the Figure 5.46, it is observed that the M7T, MPT8T, M8T, M9T and MI-12T have less range of variability around mean value at supply voltage varying from 0.2 V to 0.4 V as compared to C6T.

Table 5.9 shows the comparative analysis of average read delay and its variability for all proposed SRAM cells with C6T. Where $\sigma$ is standard deviation and $\mu$ is the mean value of the read delay.

Average read delay in increasing order: M8T<<M7T< M9T<MPT8T<MI-12T< C6T
For M7T, read delay value is $3 \times$ more than M8T.
Read delay its and variability reduces with reduction in power supply voltage.
Table 5.9: Comparative analysis of read delay and its variability for all proposed SRAM cells with C6T

| Types of SRAM <br> cells | Average READ delay [TRA (ns)] |  | Variability ( $\boldsymbol{\sigma} / \boldsymbol{\mu}$ ) |
| :---: | :---: | :---: | :---: |
|  | Values | \% decrement <br> w.r.t. C6T | Range <br> (for VdD varying from 0.2V to 0.4V) |
| C6T | 13.20 | -- | 0.25 to 0.61 |
| M7T | 11.00 | $16.6 \%$ | $\mathbf{0 . 0 1}$ to 0.04 |
| MPT8T | 12.50 | $05.3 \%$ | 0.02 to 0.19 |
| M8T | 03.45 | $\mathbf{7 3 . 8 \%}$ | 0.15 to 0.38 |
| M9T | 12.20 | $07.5 \%$ | 0.05 to 0.40 |
| MI-12T | 12.60 | $04.5 \%$ | 0.01 to 0.07 |

### 5.4.6. Write Access Time (Twa) with Variability

Write delay is the estimated time required to flip the cell contents at $\mathrm{WL}=$ ' 1 ' during write operation. It is measured as the time required for writing ' 0 ' or ' 1 ' at storage nodes, when WL reaches $50 \%$ of its full swing (from its initial low level) to the point when storage nodes falls or rises to $10 \%$ or $90 \%$ of its full swing from its initial high or low level. This leads to successful write operation without any errors.

An improvement in $\mathrm{T}_{\mathrm{WA}}$ is observed in M7T, MPT8T, M8T, M9T and MI-12T compared to C6T.

Figure 5.47 show the write delay and its variability of C6T, M7T, MPT8T, M8T, M9T and MI-12T at varying supply voltage from 0.2 V to 0.4 V .


Figure 5.47: Write delay and its variability of (a) C6T (b) M7T (c) MPT8T (d) M8T (e) M9T (f) MI-12T

From the Figure 5.47, it is observed that the M7T, MPT8T, M8T, M9T and MI-12T have less range of variability around mean value at supply voltage varying from 0.2 V to 0.4 V as
compared to C6T. Table 5.10 shows the comparative analysis of average write delay and its variability for all proposed SRAM cells with C6T.

Average write delay in increasing order: M8T<MPT8T<MI-12T<M7T<< M9T< C6T
For M9T, write delay value is $3 \times$ more than M7T.
Write delay and its variability reduces with reduction in power supply voltage
Table 5.10: Comparative analysis of write delay and its variability for all proposed SRAM cells with C6T

| Types of SRAM <br> cells | Average WRITE delay [TWA (ns)] |  | Variability ( $\boldsymbol{\sigma} / \boldsymbol{\mu}$ ) |
| :---: | :---: | :---: | :---: |
|  | Values | \% decrement <br> w.r.t. C6T | Range <br> (for Vdd varying from <br> $\mathbf{0 . 2 2} \mathbf{t} \mathbf{0 . 4 V}$ ) |
| C6T | 19.20 | -- | 0.05 to 0.730 |
| M7T | 05.85 | $69.5 \%$ | 0.07 to 0.180 |
| MPT8T | 03.15 | $83.5 \%$ | $\mathbf{0 . 0 1}$ to 0.016 |
| M8T | $\mathbf{0 2 . 0 9}$ | $\mathbf{8 9 . 1 \%}$ | 0.06 to 0.100 |
| M9T | 18.11 | $05.6 \%$ | 0.09 to 0.190 |
| MI-12T | 05.37 | $72.0 \%$ | 0.02 to 0.060 |

### 5.4.7. Leakage Power Consumption in Hold mode

The major leakage components of C6T and proposed cells during hold mode are discussed and given in Appendix C. A comparison of leakage power consumption in hold mode of C6T, M7T, MPT8T, M8T, M9T and MI-12T cells at supply voltage varying from 0.2 V to 0.4 V is shown in Figure 5.48.


Figure 5.48: Leakage power consumptions in hold mode versus supply voltage

From Figure 5.48, the analyzed results show that for supply voltage value 0.3 V onwards, C6T has significant increase in leakage power consumption in hold mode as compared to the MPT8T, M8T and MI-12T. Whereas M9T followed by M7T show high leakage power consumption for all supply voltages.

Table 5.11 shows the comparative analysis of leakage power consumption for all SRAM cells in hold mode.

Leakage power consumption in hold mode in increasing order is given below:
MPT8T < M8T < MI-12T < C6T < M7T < M9T

Table 5.11: Comparative analyses of leakage power consumptions in hold mode at 0.4 V supply

| Types of SRAM <br> cells | Leakage Power <br> Consumption <br> (nW) | \% less power <br> consumption w.r.t. C6T |
| :---: | :---: | :---: |
| C6T | 5.76 | -- |
| M7T | 6.01 | $10.7 \%$ more |
| MPT8T | $\mathbf{2 . 3 1}$ | $\mathbf{5 4 . 6 \%}$ |
| M8T | 3.14 | $32.1 \%$ |
| M9T | 8.10 | $76.4 \%$ more |
| MI-12T | 3.67 | $30.7 \%$ |

### 5.5. ANALYTICAL EXPRESSIONS FOR HOLD SNM, RSNM \& WSNM OF SRAM CELLS

Stability parameters can be expressed as a function of aspect ratio and supply voltage. Hence these expressions can be directly utilized to determine transistor sizes for a desired value of stability parameters (hold SNM, RSNM and WSNM) or vice versa.

The derivations of these expressions and comparison of value thus obtained with simulated results are given in this section.

For analysis, following sub-threshold drain current equation is used and its parameters are taken from 45 nm BSIM model library [39][40]:
$\mathrm{I}_{\mathrm{D}, \mathrm{SUB}}=\mathrm{I}_{\mathrm{S}} \exp \left(\frac{\mathrm{v}_{\mathrm{GS}}-\mathrm{V}_{\text {th }}}{\mathrm{nV}_{\mathrm{T}}}\right)\left(1-\exp \left(\frac{\mathrm{v}_{\mathrm{DS}}}{\mathrm{V}_{\mathrm{T}}}\right)\right)$

Where,
$I_{S, \text { NMOS }}=\mu_{n} C_{o x}\left(\frac{W}{L}\right)_{n}\left(\frac{\mathrm{kT}}{\mathrm{q}}\right)^{2}\left(1-\mathrm{e}^{1.8}\right)$
$\mathrm{I}_{\mathrm{S}, \text { PMOS }}=\mu_{\mathrm{p}} \mathrm{C}_{\mathrm{ox}}\left(\frac{\mathrm{W}}{\mathrm{L}}\right)_{\mathrm{p}}\left(\frac{\mathrm{kT}}{\mathrm{q}}\right)^{2}\left(1-\mathrm{e}^{1.8}\right)$
$\left(\mathrm{C}_{\mathrm{ox}}\right)_{\mathrm{p}}=\frac{\varepsilon_{\mathrm{ox}}}{\left(\mathrm{t}_{\mathrm{ox}}\right)_{\mathrm{p}}}$ and $\left(\mathrm{C}_{\mathrm{ox}}\right)_{\mathrm{n}}=\frac{\varepsilon_{\mathrm{ox}}}{\left(\mathrm{t}_{\mathrm{ox}}\right)_{\mathrm{n}}}$

Following parameter values are taken from 45 nm BSIM model library
$\left(\mathrm{V}_{\mathrm{th}}\right)_{\mathrm{n}}=0.4226 \mathrm{~V}$
$\left(\mathrm{V}_{\mathrm{th}}\right)_{\mathrm{p}}=-0.412642 \mathrm{~V}$
$\left(\mathrm{t}_{\mathrm{ox}}\right)_{\mathrm{p}}=1.26 \times 10^{-9} \mathrm{~m}$
$\left(\mathrm{t}_{\mathrm{ox}}\right)_{\mathrm{n}}=1.14 \times 10^{-9} \mathrm{~m}$
$\mu_{\mathrm{n}}=0.045 \mathrm{~cm}^{2} / \mathrm{V}-\mathrm{s}$
$\mu_{\mathrm{p}}=0.02 \mathrm{~cm}^{2} / \mathrm{V}-\mathrm{s}$
$\mathrm{n}_{\mathrm{n}}=\mathrm{n}_{\mathrm{p}}=1.5$
Constant values for calculating the analytical expressions:
$\mathrm{V}_{\mathrm{T}}=\frac{\mathrm{kT}}{\mathrm{q}}=0.025 \mathrm{~V}$
$\varepsilon_{\text {ox }}=3.97 \varepsilon_{\text {o }}$
$\varepsilon_{0}=8.85 \times 10^{-12} \mathrm{~F} / \mathrm{m}$
Using the above-mentioned values and formulas, following parameters have been computed
$\left(\mathrm{C}_{\mathrm{ox}}\right)_{\mathrm{p}}=0.027$
$\left(C_{o x}\right)_{n}=0.0308$
$\mathrm{I}_{\mathrm{S}, \mathrm{NMOS}}=4.37 \times 10^{-6}\left(\frac{\mathrm{~W}}{\mathrm{~L}}\right)_{\mathrm{n}}$
$\mathrm{I}_{\mathrm{S}, \mathrm{PMOS}}=1.76 \times 10^{-6}\left(\frac{\mathrm{~W}}{\mathrm{~L}}\right)_{\mathrm{p}}$

### 5.5.1. Analytical Expressions for Hold SNM of C6T, M7T, MPT8T, M8T, M9T and MI12T SRAM cells

Figure 5.49 show a part of schematics of C6T, M7T, MPT8T, M8T, M9T and MI-12T (full description given in Section 5.2 and Section 5.3) during hold operation when QB is low and Q is high (i.e. $\mathrm{V}_{\mathrm{QB}}=$ low, $\mathrm{V}_{\mathrm{Q}}=$ high). Crossed/Uncrossed transistors represent $\mathrm{OFF} / \mathrm{ON}$ state respectively.

Pull-up, pull-down and access transistor current analysis during hold mode for C6T and proposed SRAM cells are given in Appendix D.


Figure 5.49: Half part of SRAM cells during hold operation (a) C6T (b) M7T (c)
MPT8T (d) M8T (e) M9T (f) MI-12T
a) The cross coupled inverter pair designs are same for C6T, M7T, MPT8T, M8T and M9T. Therefore, the analytical expressions for C6T, M7T, MPT8T, M8T and M9T during hold operation are same.

During hold operation, in Figure 5.49 (a),

- Voltage conditions taken are: $\mathrm{V}_{\mathrm{Q}}$ is high, $\mathrm{V}_{\mathrm{QB}}$ is low, $\mathrm{BL}=\mathrm{V}_{\mathrm{DD}}$,
- Transistors MN6 is OFF,

For the proper logic operation in sub-threshold region under steady state, assuming $\mathrm{I}_{\mathrm{MN} 6}=0$, applying KCL at node Q gives $\mathrm{I}_{\mathrm{n} 2}=\mathrm{I}_{\mathrm{p} 2}$ which leads to following expression;

$$
I_{S, n 2} \exp \left(\frac{V_{Q}-V_{t h, n 2}}{n_{n 2} \cdot V_{T}}\right)\left(1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)\right)=I_{S, p 2} \exp \left(\frac{V_{D D}-V_{Q}-V_{t h, p 2}}{n_{p 2} \cdot V_{T}}\right)\left(1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)\right)
$$

Using the method given in [41][42], solving above expression for $\mathrm{V}_{\mathrm{Q}}$ and $\mathrm{V}_{\mathrm{QB}}$ gives:

$$
\begin{align*}
& V_{Q}=\frac{n_{n 2} \cdot n_{p 2} \cdot V_{T}}{n_{n 2}+n_{p 2}}\left[\ln \left(\frac{I_{S, p 2}}{I_{S, n 2}}\right)+\ln \left(\frac{1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)}{1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)}\right)\right]  \tag{5.14}\\
& +\frac{n_{n 2} \cdot V_{D D}}{n_{p 2}+n_{n 2}}+\frac{n_{p 2} \cdot n_{n 2}}{n_{p 2}+n_{n 2}}\left(\frac{V_{t h, p 2}}{n_{p 2}}+\frac{V_{t h, n 2}}{n_{n 2}}\right) .
\end{align*}
$$

For 45 nm technology, choosing $\mathrm{L}=45 \mathrm{~nm}$, $\left(\mathrm{W}_{\mathrm{p} 2} / \mathrm{W}_{\mathrm{n} 2}\right)=51 \mathrm{~nm} / 85 \mathrm{~nm}=0.6$, this equation is used to generate butterfly curves for different values of $\mathrm{V}_{\mathrm{DD}}$ (in range- 0.2 to 0.4 ). Then Hold SNM is calculated for each supply value.

The relationship between Hold SNM and supply voltage is modeled, with the help of linear regression (least squares estimation) applied to this data set. The following regression equation has been obtained for C6T,

$$
\begin{equation*}
\mathrm{V}_{\text {HoldSNM }, \mathrm{CGT}}=-0.0112+0.375 * V_{D D} \tag{5.15}
\end{equation*}
$$

For $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}$, value of Hold SNM is obtained from equation 5.15 as;

$$
\mathrm{V}_{\text {HoldSnM, С6т,LP7T,MPT8T,M87,M9T }}=0.138 \mathrm{~V}
$$

Same procedure is repeated for other proposed M7T, MPT8T, M8T and M9T cells. The results are found to be similar as the structure of cross coupled inverter is identical.
b) The analytical expression for MI-12T SRAM cell with stacked transistors based cross coupled invertors configuration during hold operation is done separately and is given below.

During hold operation, in Figure 5.49 (f),

- Voltage conditions taken are $\mathrm{V}_{\mathrm{Q}}$ is high, $\mathrm{V}_{\mathrm{QB}}$ is low, $\mathrm{BL}=\mathrm{V}_{\mathrm{DD}}$,
- Transistors MN6 is OFF,

For the proper logic operation in sub-threshold region under steady state, assuming $\mathrm{I}_{\mathrm{MN} 6}=0$, applying KCL at node Q gives $\mathrm{I}_{\mathrm{p} 2}=\mathrm{I}_{\mathrm{n} 2}$ (or $\mathrm{I}_{\mathrm{n} 4}$ ) which leads to following expression;

$$
I_{S, n 2} \exp \left(\frac{V_{Q}-V_{t h, n 2}}{n_{n 2} \cdot V_{T}}\right)\left(1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)\right)=I_{S, p 2} \exp \left(\frac{V_{D D}-V_{Q}-V_{t h, p 2}}{n_{p 2} \cdot V_{T}}\right)\left(1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)\right)
$$

Using the method given in [41][42], solving above expression for $\mathrm{V}_{\mathrm{Q}}$ and $\mathrm{V}_{\mathrm{QB}}$ gives:

$$
\begin{align*}
& V_{Q}=\frac{n_{n 2} \cdot n_{p 2} \cdot V_{T}}{n_{n 2}+n_{p 2}}\left[\ln \left(\frac{I_{S, p 2}}{I_{S, n 2}}\right)+\ln \left(\frac{1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)}{1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)}\right)\right] \\
& +\frac{n_{n 2} \cdot V_{D D}}{n_{p 2}+n_{n 2}}+\frac{n_{p 2} \cdot n_{n 2}}{n_{p 2}+n_{n 2}}\left(\frac{V_{t h, p 2}}{n_{p 2}}+\frac{V_{t h, n 2}}{n_{n 2}}\right) . \tag{5.16}
\end{align*}
$$

For 45 nm technology, choosing $\mathrm{L}=45 \mathrm{~nm},\left(\mathrm{~W}_{\mathrm{p} 2} / \mathrm{W}_{\mathrm{n} 2}\right)=51 \mathrm{~nm} / 85 \mathrm{~nm}=0.6$, this equation is used to generate butterfly curves for different values of $\mathrm{V}_{\mathrm{DD}}$ (in range-0.2 to 0.4). Then Hold SNM is calculated for each supply value.

The relationship between Hold SNM and supply voltage is modeled, with the help of linear regression (least squares estimation) applied to this data set. The following regression equation has been obtained for MI-12T,

$$
\begin{equation*}
\mathrm{V}_{\mathrm{HoldSNM}, \mathrm{MI-12T}}=-0.0129+0.480 * V_{D D} \tag{5.17}
\end{equation*}
$$

At $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}$; value of Hold SNM is obtained from equation 5.17 as;

$$
\mathrm{V}_{\text {HoldSNM,MI-12T }}=0.179 \mathrm{~V}
$$

Comparison of values of Hold SNM between simulated and estimated values through analytical and regression equation at 0.4 V is done in Table 5.12.

Table 5.12: Comparison of hold SNM between simulated and estimated values through analytical and regression equation at 0.4 V

| Types of <br> SRAM cells | Hold SNM <br> $(\mathbf{m V})$ <br> Simulated values | Hold SNM <br> $(\mathbf{m V})$ <br> Estimated through analytical equation <br> using butterfly curve | Hold SNM <br> $(\mathbf{m V})$ <br> Estimated through <br> regression equation |
| :---: | :---: | :---: | :---: |
| C6T | 120.2 | 125 | 138 |
| M7T | 120.2 | 125 | 138 |
| MPT8T | 120.2 | 125 | 138 |
| M8T | 120.2 | 125 | 138 |
| M9T | 120.2 | 125 | 138 |
| MI-12T | 170 | 177 | 179 |

The value of Hold SNM estimated from analytical equations is within ( $3.84 \%$ to $3.9 \%$ ) to those that are observed with simulated values. The value of Hold SNM estimated from regression equations is within ( $5 \%$ to $12.8 \%$ ) to those that are observed with simulated values.

### 5.5.2. Analytical Expressions for RSNM of C6T, M7T, MPT8T, M8T, M9T and MI-12T SRAM cells

Figure 5.50 show a part of schematics of C6T, M7T, MPT8T, M8T, M9T and MI-12T (full description given in Section 5.2 and Section 5.3) during read '0' operation for $\mathrm{V}_{\mathrm{Q}}$ is low, $\mathrm{V}_{\mathrm{QB}}$ is high. Crossed/Uncrossed transistors show OFF/ON state respectively. Pull-up, pull-down and access transistor current analysis during read mode for C6T and proposed SRAM cells are given in Appendix D.


Figure 5.50: Half part of SRAM cells during read operation (a) C6T (b) M7T (c) MPT8T
(d) M8T
(e) M9T (f) MI-12T
a) The analytical expression for C6T during read ' 0 ' operation is given below:

During read operation, in Figure 5.50 (a),

- Voltage conditions taken are: $\mathrm{V}_{\mathrm{QB}}$ is high, $\mathrm{V}_{\mathrm{Q}}$ is low, $\mathrm{BL}=\mathrm{V}_{\mathrm{DD}}$,
- Transistor MP2 is turned OFF with $\mathrm{I}_{\mathrm{MP} 2}=0$,
- Transistors MN6 and MN2 are ON,

For the proper logic operation in sub-threshold region under steady state, applying KCL at node Q gives $\mathrm{I}_{\mathrm{n} 2}=\mathrm{I}_{\mathrm{n} 6}$ which leads to following expression;

$$
I_{S, n 2} \exp \left(\frac{V_{Q}-V_{t h, n 2}}{n_{n 2} V_{T}}\right)\left(1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)\right)=I_{S, n 6} \exp \left(\frac{V_{D D}-V_{Q B}-V_{t h, n 6}}{n_{n 6} V_{T}}\right)\left(1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)\right)
$$

Using the method given in [41][42], solving above expression for $\mathrm{V}_{\mathrm{Q}}$ gives:
$V_{Q}=n_{n 2} \cdot V_{T} \ln \left(\frac{I_{S, n 6}}{I_{S, n 2}}\right)+n_{n 2} \cdot V_{T} \ln \left(\frac{1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)}{1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)}\right)+V_{t h, n 2}$

$$
\begin{equation*}
+\frac{n_{n 2}}{n_{n 6}}\left(V_{D D}+V_{t h, n 6}-V_{Q B}\right) \tag{5.18}
\end{equation*}
$$

For 45 nm technology, choosing $\mathrm{L}_{\mathrm{n} 2}=\mathrm{L}_{\mathrm{n} 6}=45 \mathrm{~nm},\left(\mathrm{~W}_{\mathrm{n} 2} / \mathrm{W}_{\mathrm{n} 6}\right)=180 \mathrm{~nm} / 60 \mathrm{~nm}=3$. This equation is used to generate butterfly curves for different values of $\mathrm{V}_{\mathrm{DD}}$ (in range- 0.2 V to 0.4 V ) and different values of ( $\mathrm{W}_{\mathrm{n} 2} / \mathrm{W}_{\mathrm{n} 6}$ ) (in range-3 to 5 ). Then RSNM is calculated for each supply value.

The relationship between RSNM, CR, and supply voltage is modeled with the help of multiple linear regressions applied to this data set (i.e. RSNM vs. $\mathrm{V}_{\mathrm{DD}}$ ). The following regression equation has been obtained for C6T;

$$
\begin{equation*}
\mathrm{V}_{\mathrm{SNM}, \mathrm{Read}, \mathrm{C} 6 \mathrm{~T}}=0.0375 \times \ln \left(\frac{W_{n 2} / L_{n 2}}{W_{n 6} / L_{n 6}}\right)+0.829 V_{D D}-0.318 \tag{5.19}
\end{equation*}
$$

For 45 nm technology, choosing $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}, \mathrm{~L}_{\mathrm{n} 2}=\mathrm{L}_{\mathrm{n} 6}=45 \mathrm{~nm},\left(\mathrm{~W}_{\mathrm{n} 2} / \mathrm{W}_{\mathrm{n} 6}\right)=180 \mathrm{~nm} / 60 \mathrm{~nm}=3$ (CR ratio chosen as discussed in Section 5.3), value of RSNM is obtained from equation 5.19 as;

$$
V_{R S N M, C 6 T}=0.054 \mathrm{~V}
$$

b) The analytical expression for M7T during read ' 0 ' operation is given below:

During read operation, in Figure 5.50 (b),

- Voltage conditions taken are; $\mathrm{V}_{\mathrm{QB}}$ is high, $\mathrm{V}_{\mathrm{Q}}$ is low, $\mathrm{BL}=\mathrm{V}_{\mathrm{DD}}$,
- Transistor MP2 is turned OFF with $\mathrm{I}_{\mathrm{MP} 2}=0$,
- Transistors MP3, MN6 and MN2 are ON,

For the proper logic operation in sub-threshold region under steady state, Applying KCL at node Q gives $\mathrm{I}_{\mathrm{n} 2}=\mathrm{I}_{\mathrm{n} 6}$ which leads to following expression;

$$
I_{S, n 2} \exp \left(\frac{V_{Q}-V_{t h, n 2}}{n_{n 2} V_{T}}\right)\left(1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)\right)=I_{S, n 6} \exp \left(\frac{V_{D D}-V_{Q B}-V_{t h, n 6}}{n_{n 6} V_{T}}\right)\left(1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)\right)
$$

Using the method given in [41][42], solving above expression for $\mathrm{V}_{\mathrm{Q}}$ gives:
$V_{Q}=n_{n 2} \cdot V_{T} \ln \left(\frac{I_{S, n 6}}{I_{S, n 2}}\right)+n_{n 2} \cdot V_{T} \ln \left(\frac{1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)}{1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)}\right)+V_{t h, n 2}$
$+\frac{n_{n 2}}{n_{n 6}}\left(V_{D D}+V_{t h, n 6}-V_{Q B}\right)$
Using the procedure as described in Section 5.5.2 (a), the following regression equation has been obtained for M7T;

$$
\begin{equation*}
\mathrm{V}_{\mathrm{SNM}, \mathrm{Rea}, \mathrm{M} 7 \mathrm{~T}}=0.0375 \times \ln \left(\frac{W_{n 2} / L_{n 2}}{W_{n 6} / L_{n 6}}\right)+0.962 V_{D D}-0.356 \tag{5.21}
\end{equation*}
$$

For 45 nm technology, choosing $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}, \mathrm{~L}_{\mathrm{n} 2}=\mathrm{L}_{\mathrm{n} 6}=45 \mathrm{~nm},\left(\mathrm{~W}_{\mathrm{n} 2} / \mathrm{W}_{\mathrm{n} 6}\right)=180 \mathrm{~nm} / 60 \mathrm{~nm}=3$ (CR ratio chosen as discussed in Section 5.3), value of RSNM is obtained from equation 5.21 as;

$$
V_{R S N M, M 7 T}=0.069 \mathrm{~V}
$$

c) The analytical expression for MPT8T during read ' 0 'operation is given below:

During read operation, in Figure 5.50 (c),

- Voltage conditions taken are; $\mathrm{V}_{\mathrm{QB}}$ is high, $\mathrm{V}_{\mathrm{Q}}$ is low, $\mathrm{BL}=\mathrm{V}_{\mathrm{DD}}$,
- Transistors MP2 and MN4 are turned OFF with $\mathrm{I}_{\mathrm{MP2}}=\mathrm{I}_{\mathrm{MN} 4}=0$,
- Transistors MN6 and MN2 are ON,

For the proper logic operation in sub-threshold region under steady state, applying KCL at node Q gives $\mathrm{I}_{\mathrm{n} 2}=\mathrm{I}_{\mathrm{p} 4}$ which leads to following expression;

$$
I_{S, n 2} \exp \left(\frac{V_{Q}-V_{t h, n 2}}{n_{n 2} \cdot V_{T}}\right)\left(1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)\right)=I_{S, p 4} \exp \left(\frac{V_{D D}-V_{Q}-V_{t h, p 4}}{n_{p 4} \cdot V_{T}}\right)\left(1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)\right)
$$

Using the method given in [41][42], solving above expression for $\mathrm{V}_{\mathrm{Q}}$ gives:
$V_{Q}=\frac{n_{n 2} \cdot n_{p 4} \cdot V_{T}}{n_{n 2}+n_{p 4}}\left[\ln \left(\frac{I_{S, p 4}}{I_{S, n 2}}\right)+\ln \left(\frac{1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)}{1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)}\right)\right]$
$+\frac{n_{n 2} \cdot V_{D D}}{n_{p 4}+n_{n 2}}+\frac{n_{p 4} \cdot n_{n 2}}{n_{p 4}+n_{n 2}}\left(\frac{V_{t h, p 4}}{n_{p 4}}+\frac{V_{t h, n 2}}{n_{n 2}}\right)$.
Using the procedure as described in Section 5.5.2 (a), the following regression equation has been obtained for MPT8T;

$$
\begin{equation*}
\mathrm{V}_{\mathrm{SNM}, \mathrm{Read}, \mathrm{MPT} 8 \mathrm{~T}}=0.0375 \times \ln \left(\frac{W_{n 2} / L_{n 2}}{W_{p 4} / L_{p 4}}\right)+0.812 V_{D D}-0.231 \tag{5.23}
\end{equation*}
$$

For 45 nm technology, choosing $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}, \mathrm{~L}_{\mathrm{n} 2}=\mathrm{L}_{\mathrm{p} 4}=45 \mathrm{~nm},\left(\mathrm{~W}_{\mathrm{n} 2} / \mathrm{W}_{\mathrm{p} 4}\right)=180 \mathrm{~nm} / 60 \mathrm{~nm}=3$ (CR ratio chosen as discussed in Section 5.3), value of RSNM is obtained from equation 5.23 as;

$$
V_{R S N M, M P T 8 T}=0.134 \mathrm{~V}
$$

d) The analytical expression for M8T during read ' 0 ' operation is given below:

During read operation, in Figure 5.50 (d),

- Voltage conditions taken are $\mathrm{V}_{\mathrm{QB}}$ is high, $\mathrm{V}_{\mathrm{Q}}$ is low, $\mathrm{BL}=\mathrm{V}_{\mathrm{DD}}$,
- Transistor MP2 is turned OFF with $\mathrm{I}_{\mathrm{MP} 2}=0$,
- Transistors MN6 and MN2 are ON, MN4 provides extra OFF state current to the nodes For the proper logic operation in sub-threshold region under steady state, applying KCL at node $Q$ gives $I_{n 2}=I_{n 6}+I_{n 4}$ which leads to following expression;

$$
\begin{aligned}
& I_{S, n 2} \exp \left(\frac{V_{Q}-V_{t h, n 2}}{n_{n 2} \cdot V_{T}}\right)\left(1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)\right)=I_{S, n 6} \exp \left(\frac{V_{D D}-V_{Q B}-V_{t h, n 6}}{n_{n 6} \cdot V_{T}}\right)\left(1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)\right) \\
& +I_{S, n 4} \exp \left(\frac{V_{D D}-V_{Q B}-V_{t h, n 4}}{n_{n 4} \cdot V_{T}}\right)\left(1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)\right)
\end{aligned}
$$

Using the method given in [41][42], solving above expression for $\mathrm{V}_{\mathrm{Q}}$ gives:

$$
\begin{align*}
& V_{Q}=n_{n 2} \cdot V_{T} \ln \left(\frac{I_{S, n 6}+I_{S, n 4}}{I_{S, n 2}}\right)+n_{n 2} \cdot V_{T} \ln \left(\frac{1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)}{1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)}\right)+V_{t h, n 2}  \tag{5.24}\\
& +\frac{n_{n 2}}{n_{n 6}+n_{n 4}}\left(V_{D D}-\left(V_{t h, n 6}+V_{t h, n 4}\right)+V_{Q B}\right)
\end{align*}
$$

Using the procedure as described in Section 5.5.2 (a), the following regression equation has been obtained for M8T by taking fixed $\mathrm{W}_{\mathrm{n} 4} / \mathrm{L}_{\mathrm{n} 4}=65 \mathrm{~nm} / 45 \mathrm{~nm}=1.4$;

$$
\begin{equation*}
\mathrm{V}_{\mathrm{SNM}, \text { Read, M8T }}=0.0375 \times \ln \left(\frac{W_{n 2} / L_{n 2}}{W_{n 6} / L_{n 6}}\right)+1.23 V_{D D}-0.427 \tag{5.25}
\end{equation*}
$$

For 45 nm technology, choosing $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}, \mathrm{~L}_{\mathrm{n} 2}=\mathrm{L}_{\mathrm{n} 6}=45 \mathrm{~nm},\left(\mathrm{~W}_{\mathrm{n} 2} / \mathrm{W}_{\mathrm{n} 6}\right)=162 \mathrm{~nm} / 60 \mathrm{~nm}=$ 2.7 (CR ratio chosen as discussed in Section 5.3), value of RSNM is obtained from equation 5.25 as;

$$
V_{\text {RSNM,M8T }}=0.102 \mathrm{~V}
$$

e) The analytical expression for M9T during read ' 0 ' operation is given below:

During read operation, in Figure 5.50 (e),

- Voltage conditions taken are $\mathrm{V}_{\mathrm{QB}}$ is high, $\mathrm{V}_{\mathrm{Q}}$ is low, $\mathrm{BL}=\mathrm{V}_{\mathrm{DD}}$,
- Transistor MP2 is turned OFF with $\mathrm{I}_{\mathrm{MP} 2}=0$,
- Transistors MP3, MN6 and MN2 are ON, MN4 provides extra OFF state current to the nodes

For the proper logic operation in sub-threshold region under steady state, Applying KCL at node $Q$ gives $\mathrm{I}_{\mathrm{n} 2}=\mathrm{I}_{\mathrm{n} 6}+\mathrm{I}_{\mathrm{n} 4}$ which leads to following expression;

$$
\begin{aligned}
& I_{S, n 2} \exp \left(\frac{V_{Q}-V_{t h, n 2}}{n_{n 2} \cdot V_{T}}\right)\left(1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)\right)=I_{S, n 6} \exp \left(\frac{V_{D D}-V_{Q B}-V_{t h, n 6}}{n_{n 6} \cdot V_{T}}\right)\left(1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)\right) \\
& +I_{S, n 4} \exp \left(\frac{V_{D D}-V_{Q B}-V_{t h, n 4}}{n_{n 4} \cdot V_{T}}\right)\left(1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)\right)
\end{aligned}
$$

Using the method given in [41][42], solving above expression for $\mathrm{V}_{\mathrm{Q}}$ gives:

$$
\begin{align*}
& V_{Q}=n_{n 2} \cdot V_{T} \ln \left(\frac{I_{S, n 6}+I_{S, n 4}}{I_{S, n 2}}\right)+n_{n 2} \cdot V_{T} \ln \left(\frac{1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)}{1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)}\right)+V_{t h, n 2}  \tag{5.26}\\
& +\frac{n_{n 2}}{n_{n 6}+n_{n 4}}\left(V_{D D}-\left(V_{t h, n 6}+V_{t h, n 4}\right)+V_{Q B}\right)
\end{align*}
$$

Using the procedure as described in Section 5.5.2 (a), the following regression equation has been obtained for M9T by taking fixed $\mathrm{W}_{\mathrm{n} 4} / \mathrm{L}_{\mathrm{n} 4}=65 \mathrm{~nm} / 45 \mathrm{~nm}=1.4$;

$$
\begin{equation*}
\mathrm{V}_{\mathrm{SNM}, \mathrm{Reaa}, \mathrm{M} 9 \mathrm{~T}}=0.0375 \times \ln \left(\frac{W_{n 2} / L_{n 2}}{W_{n 6} / L_{n 6}}\right)+1.43 V_{D D}-0.551 \tag{5.27}
\end{equation*}
$$

For 45 nm technology, choosing $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}, \mathrm{~L}_{\mathrm{n} 2}=\mathrm{L}_{\mathrm{n} 6}=45 \mathrm{~nm},\left(\mathrm{~W}_{\mathrm{n} 2} / \mathrm{W}_{\mathrm{n} 6}\right)=174 \mathrm{~nm} / 60 \mathrm{~nm}=$ 2.9 (CR ratio chosen as discussed in Section 5.3), value of RSNM is obtained from equation 5.27 as;

$$
V_{\text {RSNM,M9T }}=0.060 \mathrm{~V}
$$

f) The analytical expression for MI-12T during read ' 0 ' operation is given below:

During read operation, in Figure 5.50 (f),

- Voltage conditions taken are; $\mathrm{V}_{\mathrm{QB}}$ is high, $\mathrm{V}_{\mathrm{Q}}$ is low, $\mathrm{BL}=\mathrm{V}_{\mathrm{DD}}$,
- Transistor MP2 is turned OFF with $\mathrm{I}_{\mathrm{MP} 2}=0$,
- Transistors MN6, MN4, MP4 and MN2 are ON,

For the proper logic operation in sub-threshold region under steady state, applying KCL at node Q gives $\mathrm{I}_{\mathrm{n} 2}=\mathrm{I}_{\mathrm{n} 6}$ which leads to following expression;

$$
I_{S, n 2} \exp \left(\frac{V_{Q}-V_{t h, n 2}}{n_{n 2} \cdot V_{T}}\right)\left(1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)\right)=I_{S, n 6} \exp \left(\frac{V_{D D}-V_{Q B}-V_{t h, n 6}}{n_{n 6} \cdot V_{T}}\right)\left(1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)\right)
$$

Using the method given in [41][42], solving above expression for $\mathrm{V}_{\mathrm{Q}}$ gives:

$$
\begin{align*}
& V_{Q}=n_{n 2} \cdot V_{T} \ln \left(\frac{I_{S, n 6}}{I_{S, n 2}}\right)+n_{n 2} \cdot V_{T} \ln \left(\frac{1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)}{1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)}\right)+V_{t h, n 2} \\
& +\frac{n_{n 2}}{n_{n 6}}\left(V_{D D}+V_{t h, n 6}-V_{Q B}\right) \tag{5.28}
\end{align*}
$$

Using the procedure as described in Section 5.5.2 (a), the following regression equation has been obtained for MI-12T;

$$
\begin{equation*}
\mathrm{V}_{\mathrm{SNM}, \text { Read, MI-12T }}=0.0375 \times \ln \left(\frac{W_{n 2} / L_{n 2}}{W_{n 6} / L_{n 6}}\right)+1.43 V_{D D}-0.554 \tag{5.29}
\end{equation*}
$$

For 45 nm technology, choosing $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}, \mathrm{~L}_{\mathrm{n} 2}=\mathrm{L}_{\mathrm{n} 6}=45 \mathrm{~nm},\left(\mathrm{~W}_{\mathrm{n} 2} / \mathrm{W}_{\mathrm{n} 6}\right)=174 \mathrm{~nm} / 60 \mathrm{~nm}=$ 2.9 (CR ratio chosen as discussed in Section 5.3), value of RSNM is obtained from equation 5.29 as;

$$
V_{R S N M, \mathrm{MI-12T}}=0.067
$$

Comparison of values of RSNM between simulated and estimated values through analytical and regression equation at 0.4 V is shown in Table 5.13.

Table 5.13: Comparison of RSNM between simulated and estimated values through analytical and regression equation at 0.4 V

| Types of <br> SRAM cells | RSNM(mV) <br> Simulated values | RSNM(mV) <br> Estimated through analytical <br> equation <br> using butterfly curve | RSNM(mV) <br> Estimated through regression <br> equation |
| :---: | :---: | :---: | :---: |
| C6T | 034.0 | 042 | 053 |
| M7T | 060.0 | 065 | 069 |
| MPT8T | 120.0 | 129 | 134 |
| M8T | 085.0 | 097 | 102 |
| M9T | 043.0 | 057 | 060 |
| MI-12T | 054.4 | 062 | 067 |

The value of RSNM estimated from analytical equations is within (6.9\% to $24.5 \%$ ) to those that are observed with simulated values. The value of RSNM estimated from regression equations is within ( $10.4 \%$ to $34.8 \%$ ) to those that are observed with simulated values.

### 5.5.3. Analytical Expressions for WSNM of C6T, M7T, MPT8T, M8T, M9T and MI12T SRAM cells

Figure 5.51 show a part of schematics of C6T, M7T, MPT8T, M8T, M9T and MI-12T (full description given in Section 5.2 and Section 5.3 ) during write '0' operation for $\mathrm{V}_{\mathrm{Q}}$ is high, $\mathrm{V}_{\mathrm{QB}}$ is low. Crossed/Uncrossed transistors show OFF/ON state respectively. Pull-up, pull-down and access transistor current analysis during write mode for C6T and proposed SRAM cells are given in Appendix D.


Figure 5.51: Half part of SRAM cells during write operation (a) C6T (b) M7T (c) MPT8T (d) M8T (e) M9T (f) MI-12T
a) The analytical expression for C 6 T during write ' 0 ' operation is given below:

During write operation, in Figure 5.51 (a),

- Voltage conditions taken are; $\mathrm{V}_{\mathrm{QB}}$ is low, $\mathrm{V}_{\mathrm{Q}}$ is high, $\mathrm{BL}=$ low (GND),
- Transistor MN2 is turned OFF,
- Transistors MN6 and MP2 are ON,

For the proper logic operation in sub-threshold region under steady state, applying KCL at node Q , gives $\mathrm{I}_{\mathrm{n} 6}=\mathrm{I}_{\mathrm{p} 2}$ which leads to following expression;

$$
I_{n 6} \exp \left(\frac{V_{Q}-V_{t h, n 6}}{n_{n 6} \cdot V_{T}}\right)\left(1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)\right)=I_{p 2} \exp \left(\frac{V_{D D}-V_{Q}-V_{t h, p 2}}{n_{p 2} \cdot V_{T}}\right)\left(1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)\right)
$$

Using the method given in [41][42], solving above expression for $\mathrm{V}_{\mathrm{Q}}$ gives

$$
\begin{align*}
& V_{Q}=n_{p 2} \cdot V_{T} \ln \left(\frac{I_{S, n 6}}{I_{S, p 2}}\right)+n_{p 2} V_{T} \ln \left(\frac{1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)}{1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)}\right)  \tag{5.30}\\
& +V_{t h, p 2}+\frac{n_{p 2}}{n_{n 6}}\left(V_{D D}+V_{t h, n 6}-V_{Q B}\right)
\end{align*}
$$

For 45 nm technology, choosing $\mathrm{L}_{\mathrm{p} 2}=\mathrm{L}_{\mathrm{n} 6}=45 \mathrm{~nm},\left(\mathrm{~W}_{\mathrm{n} 6} / \mathrm{W}_{\mathrm{p} 2}\right)=143 \mathrm{~nm} / 55 \mathrm{~nm}=2.6$. This equation is used to generate butterfly curves for different values of $\mathrm{V}_{\mathrm{DD}}$ (in range- 0.2 to 0.4 V ) and $\left(\mathrm{W}_{\mathrm{n} 6} / \mathrm{W}_{\mathrm{p} 2}\right)$ (in range- 2.6 to 5 ). Then WSNM is calculated for each supply value.

The relationship between WSNM, PR, and supply voltage is modeled with the help of multiple linear regressions applied to this data set (i.e. WSNM vs. VDD). The following regression equation has been obtained for C6T;

$$
\begin{equation*}
\mathrm{V}_{\mathrm{WSNM}, \mathrm{C} 6 \mathrm{~T}}=0.0931 \times \ln \left(\frac{W_{n 6} / L_{n 6}}{W_{p 2} / L_{p 2}}\right)+0.962 V_{D D}-0.319 \tag{5.31}
\end{equation*}
$$

For 45 nm technology, choosing $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}, \mathrm{~L}_{\mathrm{p} 2}=\mathrm{L}_{\mathrm{n} 6}=45 \mathrm{~nm},\left(\mathrm{~W}_{\mathrm{n} 6} / \mathrm{W}_{\mathrm{p} 2}\right)=143 \mathrm{~nm} / 55 \mathrm{~nm}=$ 2.6 (PR ratio chosen as discussed in Section 5.3), value of RSNM is obtained from equation 5.31 as;

$$
V_{W S N M, \mathrm{C} 6 \mathrm{~T}}=0.154 \mathrm{~V}
$$

b) The analytical expression for M7T during write ' 0 ' operation is given below:

For M7T, during write operation, in Figure 5.51 (b),

- voltage conditions taken are $\mathrm{V}_{\mathrm{QB}}$ is low, $\mathrm{V}_{\mathrm{Q}}$ is high, $\mathrm{BL}=$ low (GND),
- Transistor MN2 is turned OFF,
- Transistors MN6, MP3 and MP2 are ON,

For the proper logic operation in sub-threshold region under steady state, Applying KCL at node $Q$ gives $\mathrm{I}_{\mathrm{n} 6}=\mathrm{I}_{\mathrm{p} 2}$ which leads to following expression;

$$
I_{n 6} \exp \left(\frac{V_{Q}-V_{t h, n 6}}{n_{n 6} \cdot V_{T}}\right)\left(1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)\right)=I_{p 2} \exp \left(\frac{V_{D D}-V_{Q}-V_{t h, p 2}}{n_{p 2} \cdot V_{T}}\right)\left(1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)\right)
$$

Using the method given in [41][42], solving above expression for $\mathrm{V}_{\mathrm{Q}}$ gives

$$
\begin{align*}
& V_{Q}=n_{p 2} \cdot V_{T} \ln \left(\frac{I_{S, n 6}}{I_{S, p 2}}\right)+n_{p 2} V_{T} \ln \left(\frac{1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)}{1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)}\right)  \tag{5.32}\\
& +V_{t h, p 2}+\frac{n_{p 2}}{n_{n 6}}\left(V_{D D}+V_{t h, n 6}-V_{Q B}\right)
\end{align*}
$$

Using the procedure as described in Section 5.5.3 (a), the following regression equation has been obtained for M7T;

$$
\begin{equation*}
\mathrm{V}_{\mathrm{SNM}, \mathrm{Write}, \mathrm{M} 7 \mathrm{~T}}=0.0931 \times \ln \left(\frac{W_{n 6} / L_{n 6}}{W_{p 2} / L_{p 2}}\right)+0.962 V_{D D}-0.274 \tag{5.33}
\end{equation*}
$$

For 45 nm technology, choosing $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}, \mathrm{~L}_{\mathrm{p} 2}=\mathrm{L}_{\mathrm{n} 6}=45 \mathrm{~nm},\left(\mathrm{~W}_{\mathrm{n} 6} / \mathrm{W}_{\mathrm{p} 2}\right)=126 \mathrm{~nm} / 26 \mathrm{~nm}=$ 2.1 (PR ratio chosen as discussed in Section 5.3), value of WSNM is obtained from equation 5.33 as;

$$
\mathrm{V}_{\mathrm{WSNM}, \mathrm{M} 7 \mathrm{~T}}=0.179 \mathrm{~V}
$$

c) The analytical expression for MPT8T during write ' 0 ' operation is given below:

For MPT8T, during write operation, in Figure 5.51 (c),

- Voltage conditions taken are $\mathrm{V}_{\mathrm{QB}}$ is low, $\mathrm{V}_{\mathrm{Q}}$ is high, $\mathrm{BL}=$ low (GND),
- Transistor MN2 and MN4 are turned OFF,
- Transistors MP4 and MP2 are ON,

For the proper logic operation in sub-threshold region under steady state, Applying KCL at node $Q$ gives $I_{p 4}=I_{p 2}$ which leads to following expression;

$$
I_{p 4} \exp \left(\frac{V_{D D}-V_{Q}-V_{t h, p 4}}{n_{p 4} \cdot V_{T}}\right)\left(1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)\right)=I_{p 2} \exp \left(\frac{V_{D D}-V_{Q}-V_{t h, p 2}}{n_{p 2} \cdot V_{T}}\right)\left(1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)\right)
$$

Using the method given in [41][42], solving above expression for $\mathrm{V}_{\mathrm{Q}}$ gives

$$
\begin{align*}
& V_{Q}=n_{p 2} \cdot V_{T} \ln \left(\frac{I_{S, p 4}}{I_{S, p 2}}\right)+n_{p 2} V_{T} \ln \left(\frac{1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)}{1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)}\right)  \tag{5.34}\\
& +V_{t h, p 2}+\frac{n_{p 2}}{n_{p 4}}\left(V_{D D}+V_{t h, p 4}-V_{Q B}\right)
\end{align*}
$$

Using the procedure as described in Section 5.5.3 (a), the following regression equation has been obtained for MPT8T;

$$
\begin{equation*}
\mathrm{V}_{\mathrm{WSNM}, \mathrm{MPT} \text { PT }}=0.0375 \times \ln \left(\frac{W_{P 4} / L_{p 4}}{W_{p 2} / L_{p 2}}\right)+0.962 V_{D D}-0.120 \tag{5.35}
\end{equation*}
$$

For 45 nm technology, choosing $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}, \mathrm{~L}_{\mathrm{p} 2}=\mathrm{L}_{\mathrm{p} 4}=45 \mathrm{~nm},\left(\mathrm{~W}_{\mathrm{p} 4} / \mathrm{W}_{\mathrm{p} 2}\right)=108 \mathrm{~nm} / 60 \mathrm{~nm}=$ 1.8 (PR ratio chosen as discussed in Section 5.3), value of WSNM is obtained from equation 5.35 as;

$$
\mathrm{V}_{\mathrm{WSNM}, \mathrm{MPTBT}}=0.286 \mathrm{~V}
$$

d) The analytical expression for M8T during write ' 0 ' operation is given below:

For M8T, during write operation, in Figure 5.51 (d),

- Voltage conditions taken are; $\mathrm{V}_{\mathrm{QB}}$ is low, $\mathrm{V}_{\mathrm{Q}}$ is high, $\mathrm{BL}=$ low (GND),
- Transistor MN2 is turned OFF,
- Transistors MN6 and MP2 are ON, MN4 provides extra OFF state current to the nodes.

For the proper logic operation in sub-threshold region under steady state, Applying KCL at node $Q$ gives $I_{n 6}+I_{n 4}=I_{p 2}$ which leads to following expression;

$$
\begin{aligned}
& I_{n 6} \exp \left(\frac{V_{Q}-V_{t h, n 6}}{n_{n 6} \cdot V_{T}}\right)\left(1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)\right)+I_{n 4} \exp \left(\frac{V_{Q}-V_{t h, n 4}}{n_{n 4} \cdot V_{T}}\right)\left(1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)\right) \\
& =I_{p 2} \exp \left(\frac{V_{D D}-V_{Q}-V_{t h, p 2}}{n_{p 2} \cdot V_{T}}\right)\left(1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)\right)
\end{aligned}
$$

Using the method given in [41][42], solving above expression for $\mathrm{V}_{\mathrm{Q}}$ gives

$$
\begin{align*}
& V_{Q}=n_{p 2} \cdot V_{T} \ln \left(\frac{I_{S, n 6}+I_{S, n 4}}{I_{S, p 2}}\right)+n_{p 2} V_{T} \ln \left(\frac{1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)}{1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)}\right)  \tag{5.36}\\
& +V_{t h, p 2}+\frac{n_{p 2}}{n_{n 6}+n_{n 4}}\left(V_{D D}+\left(V_{t h, n 6}+V_{t h, n 4}\right)-V_{Q B}\right)
\end{align*}
$$

Using the procedure as described in Section 5.5.3 (a), the following regression equation has been obtained for M8T by taking fixed $\mathrm{W}_{\mathrm{n} 4} / \mathrm{L}_{\mathrm{n} 4}=65 \mathrm{~nm} / 45 \mathrm{~nm}=1.4$;

$$
\begin{equation*}
\mathrm{V}_{\mathrm{WSNM}, \mathrm{M} 8 \mathrm{~T}}=0.093 \times \ln \left(\frac{W_{n 6} / L_{n 6}}{W_{p 2} / L_{P 2}}\right)+0.963 V_{D D}-0.223 \tag{5.37}
\end{equation*}
$$

For 45 nm technology, choosing $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}, \mathrm{~L}_{\mathrm{p} 2}=\mathrm{L}_{\mathrm{n} 6}=45 \mathrm{~nm},\left(\mathrm{~W}_{\mathrm{n} 6} / \mathrm{W}_{\mathrm{p} 2}\right)=150 \mathrm{~nm} / 60 \mathrm{~nm}=$ 2.5 ( PR ratio chosen as discussed in Section 5.3), value of WSNM is obtained from equation 5.37 as;

$$
\mathrm{V}_{\mathrm{WSNM}, \mathrm{M} 8 \mathrm{~T}}=0.247 \mathrm{~V}
$$

e) The analytical expression for M9T during write ' 0 ' operation is given below:

For M9T, during write operation, in Figure 5.51 (e),

- Voltage conditions taken are $\mathrm{V}_{\mathrm{QB}}$ is low, $\mathrm{V}_{\mathrm{Q}}$ is high, $\mathrm{BL}=$ low (GND),
- Transistor MN2 is turned OFF,
- Transistors MN6, MP3 and MP2 are ON, MN4 provides extra OFF state current to the nodes.

For the proper logic operation in sub-threshold region under steady state, Applying KCL at node $Q$ gives $I_{n 6}+I_{n 4}=I_{p 2}$ which leads to following expression;

$$
\begin{aligned}
& I_{n 6} \exp \left(\frac{V_{Q}-V_{t h, n 6}}{n_{n 6} \cdot V_{T}}\right)\left(1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)\right)+I_{n 4} \exp \left(\frac{V_{Q}-V_{t h, n 4}}{n_{n 4} \cdot V_{T}}\right)\left(1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)\right) \\
& =I_{p 2} \exp \left(\frac{V_{D D}-V_{Q}-V_{t h, p 2}}{n_{p 2} \cdot V_{T}}\right)\left(1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)\right)
\end{aligned}
$$

Using the method given in [41][42], solving above expression for $\mathrm{V}_{\mathrm{Q}}$ gives

$$
\begin{align*}
& V_{Q}=n_{p 2} \cdot V_{T} \ln \left(\frac{I_{S, n 6}+I_{S, n 4}}{I_{S, p 2}}\right)+n_{p 2} V_{T} \ln \left(\frac{1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)}{1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)}\right)  \tag{5.38}\\
& +V_{t h, p 2}+\frac{n_{p 2}}{n_{n 6}+n_{n 4}}\left(V_{D D}+\left(V_{t h, n 6}+V_{t h, n 4}\right)-V_{Q B}\right)
\end{align*}
$$

Using the procedure as described in Section 5.5.3 (a), the following regression equation has been obtained for M9T by taking fixed $\mathrm{W}_{\mathrm{n} 4} / \mathrm{L}_{\mathrm{n} 4}=65 \mathrm{~nm} / 45 \mathrm{~nm}=1.4$;

$$
\begin{equation*}
\mathrm{V}_{\mathrm{WSNM}, \mathrm{M} 9 \mathrm{~T}}=0.093 \times \ln \left(\frac{W_{n 6} / L_{n 6}}{W_{p 2} / L_{P 2}}\right)+0.573 V_{D D}-0.106 \tag{5.39}
\end{equation*}
$$

For 45 nm technology, choosing $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}, \mathrm{~L}_{\mathrm{p} 2}=\mathrm{L}_{\mathrm{n} 6}=45 \mathrm{~nm},\left(\mathrm{~W}_{\mathrm{n} 6} / \mathrm{W}_{\mathrm{p} 2}\right)=138 \mathrm{~nm} / 60 \mathrm{~nm}=$ 2.3 ( PR ratio chosen as discussed in Section 5.3), value of WSNM is obtained from equation 5.39 as;

$$
\mathrm{V}_{\mathrm{WSNM}, \mathrm{M9T}}=0.200 \mathrm{~V}
$$

f) The analytical expression for MI-12T during write ' 0 ' operation is given below:

For MI-12T, during write operation, in Figure 5.51 (f),

- Voltage conditions taken are $\mathrm{V}_{\mathrm{QB}}$ is low, $\mathrm{V}_{\mathrm{Q}}$ is high, $\mathrm{BL}=$ low (GND),
- Transistor MN2 is OFF,
- Transistors MN6, MP4, MN4 and MP2 are ON,

For the proper logic operation in sub-threshold region under steady state, Applying KCL at node $Q$ gives $I_{n 6}=I_{p 2}$ which leads to following expression;

$$
I_{n 6} \exp \left(\frac{V_{Q}-V_{t h n 6}}{n_{n 6} \cdot V_{T}}\right)\left(1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)\right)=I_{p 2} \exp \left(\frac{V_{D D}-V_{Q}-V_{t h, p 2}}{n_{p 2} \cdot V_{T}}\right)\left(1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)\right)
$$

Using the method given in [41][42], solving above expression for $\mathrm{V}_{\mathrm{Q}}$ gives

$$
\begin{aligned}
& V_{Q}=n_{p 2} \cdot V_{T} \ln \left(\frac{I_{S, n 6}}{I_{S, p 2}}\right)+n_{p 2} V_{T} \ln \left(\frac{1-\exp \left(\frac{V_{Q B}-V_{D D}}{V_{T}}\right)}{1-\exp \left(\frac{-V_{Q B}}{V_{T}}\right)}\right) \\
& +V_{t h, p 2}+\frac{n_{p 2}}{n_{n 6}}\left(V_{D D}+V_{t h, n 6}-V_{Q B}\right)
\end{aligned}
$$

Using the procedure as described in Section 5.5.3 (a), the following regression equation has been obtained for MI-12T;

$$
\begin{equation*}
\mathrm{V}_{\mathrm{WSNM}, \mathrm{MI-12T}}=0.0931 \times \ln \left(\frac{W_{n 6} / L_{n 6}}{W_{p 2} / L_{p 2}}\right)+0.962 V_{D D}-0.174 \tag{5.41}
\end{equation*}
$$

For 45 nm technology, choosing $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}, \mathrm{~L}_{\mathrm{p} 2}=\mathrm{L}_{\mathrm{n} 6}=45 \mathrm{~nm},\left(\mathrm{~W}_{\mathrm{n} 6} / \mathrm{W}_{\mathrm{p} 2}\right)=108 \mathrm{~nm} / 60 \mathrm{~nm}=$ 1.8 (PR ratio chosen as discussed in Section 5.3), value of WSNM is obtained from equation 5.41 as;

$$
\mathrm{V}_{\mathrm{WSNM}, \mathrm{MI}-12 \mathrm{~T}}=0.265 \mathrm{~V}
$$

Comparison of values of WSNM between simulated and estimated values through analytical and regression equation at 0.4 V is shown in Table 5.14.

Table 5.14: Comparison of WSNM between simulated and estimated values through analytical and regression equation at 0.4 V

| Types of <br> SRAM cells | WSNM <br> $(\mathbf{m V})$ <br> Simulated values | WSNM <br> $(\mathbf{m V})$ <br> Estimated through analytical <br> equation <br> using buterfly curve | WSNM <br> $(\mathbf{m V})$ |
| :---: | :---: | :---: | :---: |
| C6T | 090.0 | 102 | Estimated through <br> regression equation |
| M7T | 126.2 | 143 | 154 |
| MPT8T | 218.0 | 237 | 179 |
| M8T | 186.0 | 200 | 286 |
| M9T | 160.0 | 183 | 247 |
| MI-12T | $\mathbf{2 2 6 . 0}$ | 239 | 200 |

Note: The value of WSNM estimated from analytical equations is within ( $5.4 \%$ to $11.7 \%$ ) to those that are observed with simulated values. The value of WSNM estimated from regression equations is within ( $14.7 \%$ to $41.7 \%$ ) to those that are observed with simulated values. This is high in comparison to Hold SNM, and RSNM calculations due to the following reason:

Unlike for the hold and read margin case, using the sub- $\mathrm{V}_{\mathrm{T}}$ approximation for $\mathrm{I}_{\mathrm{MN} 6}$ and $\mathrm{I}_{\mathrm{MP} 2}$ does not yield an accurate solution of $\mathrm{V}_{\mathrm{Q}}$ as given in [43]. This may be because the exponential behavior of $\mathrm{I}_{\mathrm{D}}\left(\mathrm{V}_{\mathrm{GS}}\right)$ is accurate only for $\mathrm{V}_{\mathrm{GS}}<200 \mathrm{mV}$ [43].

This error yields significantly different result for $\mathrm{V}_{\mathrm{Q}}$, when applied to the drive fight between $\mathrm{I}_{\mathrm{MN6}}$ and $\mathrm{I}_{\mathrm{MP2}}$ at $\mathrm{V}_{\mathrm{QB}}=0$. Thus, finding an accurate value of $\mathrm{V}_{\mathrm{Q}}$ depends on accurately modeling current in the moderate $-\mathrm{V}_{\mathrm{T}}$ region which is included in future scope at the end of this chapter.

### 5.6. FINAL RESULTS AND DISCUSSIONS

This section presents the comparative analysis of proposed designs of SRAM cell at 45 nm with referenced architectures. Also, a comparison of published results is done at 180 nm technology. Then impact of scaling is obtained from 180 nm to 45 nm on performance of SRAM cell.

- At 45 nm

All five proposed designs of SRAM cell (M7T, MPT8T, M8T, M9T and MI-12T) designs are compared with referenced cells, with similar transistor numbers, operated in sub-threshold region at 45 nm for 0.4 V supply voltage. Here, for comparison $6 \mathrm{~T}, 7 \mathrm{~T}, 8 \mathrm{~T}$ and 12T SRAM cells of the referenced architectures [44][32][45][46] are also designed to obtain their results in the same simulation setup for sub-threshold operation to maintain uniformity of simulation environment.

The values of parameters like RSNM, WSNM, SINM, SVNM, WTI, WTV, read delay, write delay and leakage power consumption in hold mode for different proposed and referenced SRAM cells (with same number of transistor count) are shown in Table 5.15 for comparison.

Figure 5.52 (a) shows histogram comparison of results of proposed SRAM designs, along with C6T, in terms of RSNM, WSNM, and read/write delay and leakage power in hold mode. Figure 5.52 (b) shows the histogram comparison of WTI, WTV, SINM and SVNM of all C6T and proposed SRAM cells at 45 nm technologies.

Table 5.16 and Table 5.17 shows the write ability and read stability performance of all proposed SRAM cells at 45 nm respectively.

Table 5.18 shows the comparison of all proposed SRAM cells with C6T at 45 nm .
These designs are also compared with published designs at 180 nm technology to get impact of technology scaling.

Table 5.15: Comparison of proposed with referenced SRAM cells (for $7 \mathrm{~T}, 8 \mathrm{~T}, 9 \mathrm{~T}$ and 12 T configurations) at 45 nm technology, $\mathrm{V}_{\mathrm{DD}}=0.4 \mathrm{~V}$

| References / Proposed SRAM cells | Types of SRAM cell | RSNM |  | WSNM |  | Read Delay |  | Write Delay |  | Leakage Power Consumption in Hold mode |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | $\underset{(\mathbf{m V})}{\text { RSNM }}$ | \% <br> increment in RSNM in proposed cells | WSNM (mV) | \% <br> increment in WSNM in proposed cells | Read <br> Delay (ns) | \% <br> decrement in Read Delay in proposed cells | Write Delay (ns) | \% <br> decrement in write Delay in proposed cells | Hold- <br> Power $(\mathbf{n W})$ | $\%$ less power dissipation $\quad$ in proposed cells |
| Ref. [44] | 7T | 042 |  | 102.7 |  | 15.87 |  | 03.74 |  | 5.29 |  |
| Proposed | M7T | 060 | 30.0\% | 126.2 | 18.6\% | 11.70 | 26.2\% | 01.15 | 69.2\% | 6.01 | $\begin{aligned} & 11.9 \% \\ & \text { (more) } \end{aligned}$ |
| Ref. [32] | 8T | 040 |  | 154 |  | 05.87 |  | 04.41 |  | 4.70 |  |
| Proposed | MPT8T | 120 | 66.6\% | 218 | 29.3\% | 02.11 | 64.0\% | 01.53 | 65.3\% | 2.31 | 50.8\% |
| Proposed | M8T | 085 | 52.9\% | 186 | 17.2\% | 00.74 | 87.3\% | 00.71 | 83.9\% | 3.14 | 33.1\% |
| Ref. [46] | 9T | 035 |  | 147 |  | 06.58 |  | 13.11 |  | 7.44 |  |
| Proposed | M9T | 043 | 18.6\% | 160 | 8.1\% | 02.28 | 65.3\% | 10.10 | 22.9\% | 8.10 | $\begin{aligned} & 8.1 \% \\ & \text { (more) } \end{aligned}$ |
| Ref. [45] | 12T | 50.1 |  | 158 |  | 03.33 |  | 06.54 |  | 6.82 |  |
| Proposed | MI-12T | 54.4 | 32.7\% | 226 | 30.1\% | 01.54 | 53.7\% | 01.51 | 76.9\% | 3.67 | 46.1\% |



Figure 5.52: The comparative histograms of SRAM cells at 45 nm
In Figure 5.52 (a), a change of scale on y-axis is shown with a kink ( $\sim$ ) to show the histograms for RSNM in its entire range. This is done to show the comparisons of RSNM, WSNM, read delay, write delay, and leakage power consumption in hold mode on same axis.

Robustness of write operation is given by WSNM, WTI, and WTV which are independent of each other as indicated by performance order, from best to worst, given below;

WSNM in decreasing order: MI-12T > MPT8T > M8T > M9T > M7T > C6T
WTI in increasing order: M8T < MI-12T < MPT8T < M7T < M9T < C6T
WTV in increasing order: MPT8T $=$ M7T $=$ M9T $=\mathrm{C} 6 \mathrm{~T}<\mathrm{M} 8 \mathrm{~T}<\mathrm{MI}-12 \mathrm{~T}$

It is observed that none of the proposed cell has best value for all three-metrics related to write ability (i.e. highest WSNM, least WTI, and least WTV). For example, MI-12T has highest WSNM and lower WTI, and a highest WTV. Similarly, C6T has least WSNM and WTI, but a highest WTV. Concluding the write ability metric by mere observation of above performance order (of WSNM, WTI, and WTV) may lead to a wrong outcome.

Thus, we need to estimate a performance index value for write ability metric for each proposed SRAM cell through multi criteria decision analysis taking all three metrics i.e. WSNM, WTI, and WTV into account. Table 5.16 shows the write ability index for all proposed SRAM cells.

Table 5.16: Write ability metric of proposed SRAM cells at 45 nm

| $\begin{array}{\|l\|} \hline \begin{array}{c} \text { SRAM } \\ \text { cells } \end{array} \\ \hline \end{array}$ | \|WTI| <br> ( $\mu \mathbf{A}$ ) <br> X1 | $\begin{gathered} \text { D1= } \\ \max _{\text {X1 }} \end{gathered}$ | $\begin{array}{c\|} \mathbf{I 1}=\mathbf{D 1} \\ \hline \text { /R1 } \end{array}$ | $\begin{gathered} \hline \begin{array}{l} \text { WTV } \\ (\mathrm{mV}) \end{array} \\ \text { X2 } \end{gathered}$ | $\begin{gathered} \text { D2= } \\ \max - \\ \text { X2 } \end{gathered}$ | $\begin{array}{c\|} \mathbf{I 2}=\mathbf{D} 2 \\ \text { /R2 } \end{array}$ | $\begin{gathered} \hline \text { WSNM } \\ (\mathrm{mV}) \\ \text { X3 } \end{gathered}$ | $\begin{aligned} & \text { D3= } \\ & \text { X3-min } \end{aligned}$ | $\begin{aligned} & \mid \mathbf{I 3}=\mathrm{D} \\ & \text { 3/R3 } \end{aligned}$ | $\begin{array}{\|c} \text { INDEX } \\ \text { (I) } \\ =\mathbf{1 1 + 1 2} \\ +13 \end{array}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| C6T | $\begin{gathered} 64.00 \\ \max \end{gathered}$ | 00.00 | 0.00 | 180 | 45 | 1.00 | $\begin{gathered} \hline 090.0 \\ \text { min } \end{gathered}$ | 00.0 | 0.00 | 1.00 |
| M7T | 53.31 | 10.69 | 0.18 | 180 | 45 | 1.00 | 126.2 | 036.2 | 0.26 | 1.44 |
| MPT8T | 28.00 | 36.00 | 0.61 | 180 | 45 | 1.00 | 218.0 | 128.0 | 0.94 | 2.55 |
| M8T | 05.01 | 58.99 | 1.00 | 190 | 35 | 0.77 | 186.0 | 096.0 | 0.70 | 2.47 |
| M9T | 58.90 | 05.10 | 0.08 | 180 | 45 | 1.00 | 160.0 | 070.0 | 0.51 | 1.59 |
| MI-12T | 08.70 | 55.30 | 0.93 | $\begin{aligned} & 225 \\ & \max \end{aligned}$ | 00 | 0.00 | 226.0 | 136.0 | 1.00 | 1.93 |
| Range)> | $\mathrm{R} 1=58.9$ |  |  | R2=45 |  |  | R3=136 |  |  |  |
| $\begin{gathered} \text { Range }(\mathrm{R}) \text { of a performance metric }(\mathrm{WSNM}, \mathrm{WTI} \text { and WTV) } \\ =\text { maximum value }- \text { minimum value } \end{gathered}$ |  |  |  |  |  |  |  |  |  |  |

From Table 5.16, write ability performance order (in decreasing order according to Index I) from best to worst: $\quad$ MPT8T $>$ M8T > MI-12T > M9T > M7T > C6T

Robustness of read operation is given by RSNM, SINM, and SVNM which are independent of each other as indicated by performance order, from best to worst, given below;

RSNM in decreasing order: MPT8T > M8T > M7T > MI-12T > M9T > C6T
SINM in decreasing order: MI-12T > M7T > M9T >> M8T > MPT8T > C6T
SVNM in decreasing order: $\mathrm{C} 6 \mathrm{~T}=\mathrm{M} 7 \mathrm{~T}=\mathrm{MPT} 8 \mathrm{~T}=\mathrm{M} 9 \mathrm{~T}>\mathrm{M} 8 \mathrm{~T}>\mathrm{MI}-12 \mathrm{~T}$
It is observed that none of the proposed cell has best value for all three-metrics related to read stability (i.e. highest RSNM, highest SINM, and highest SVNM). For example, MPT8T has highest RSNM and SVNM, but a lower SINM. Similarly, C6T has least RSNM and SINM, but a higher SVNM

Thus, we need to estimate a performance index value for read stability metric for each proposed SRAM cell through multi criteria decision analysis taking all three metrics i.e. RSNM, SINM, and SVNM into account.

Table 5.17 shows the read stability index for all proposed SRAM cells.
Table 5.17: Read stability metric of proposed SRAM cells at 45 nm

| $\begin{array}{\|c\|} \hline \text { SRAM } \\ \text { cells } \end{array}$ | $\begin{gathered} \text { SVNM } \\ (\mathrm{mV}) \\ \\ \text { X1 } \end{gathered}$ | $\begin{array}{c\|} \text { D11= } \\ \text { X1-min } \end{array}$ | $\begin{gathered} \mid \mathbf{I 1 = D 1 /} \\ \text { R1 } \end{gathered}$ | $\begin{gathered} \text { SINM } \\ (\mu \mathbf{A}) \\ \text { X2 } \end{gathered}$ | $\begin{gathered} \hline \text { D2= } \\ \text { X2- } \\ \text { min } \end{gathered}$ | $\begin{array}{c\|} \hline \mathbf{I 2}=\mathrm{D} 2 \\ / \mathrm{R} 2 \end{array}$ | $\begin{gathered} \text { RSNM } \\ (\mathrm{mV}) \\ \\ \text { X3 } \end{gathered}$ | $\begin{aligned} & \text { D3= } \\ & \text { X3- } \\ & \text { min } \end{aligned}$ | $\begin{array}{c\|} \hline \mathrm{I} 3=\mathrm{D} 3 \\ \text { / R3 } \end{array}$ | $\begin{gathered} \text { INDEX } \\ \text { (I) } \\ =11+12+ \\ 13 \end{gathered}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| C6T | 220 | 45.00 | 1.00 | $\begin{gathered} 10.00 \\ \text { min } \end{gathered}$ | 00.00 | 0.00 | $\begin{gathered} 037.0 \\ \min \end{gathered}$ | 00.00 | 0.00 | 1.00 |
| M7T | 220 | 45.00 | 1.00 | 49.21 | 39.21 | 0.88 | 060.0 | 23.00 | 0.26 | 2.14 |
| MPT8T | 220 | 45.00 | 1.00 | 12.00 | 02.00 | 0.04 | 123.0 | 86.00 | 1.00 | 2.04 |
| M8T | 190 | 15.00 | 0.33 | 14.01 | 04.01 | 0.09 | 088.0 | 51.00 | 0.59 | 1.01 |
| M9T | 220 | 45.00 | 1.00 | 49.20 | 39.20 | 0.88 | 043.0 | 06.00 | 0.07 | 1.95 |
| MI-12T | $\begin{aligned} & 175 \\ & \min \end{aligned}$ | 00.00 | 0.00 | 54.50 | 44.50 | 1.00 | 054.4 | 17.40 | 0.20 | 1.20 |
| Range»> | $\mathrm{R} 1=45$ |  |  | $\mathrm{R} 2=44.5$ |  |  | R3=86 |  |  |  |

Range (R) of a performance metric (RSNM, SINM, and SVNM) $=$ maximum value - minimum value

From Table 5.18, read stability performance order (in decreasing order according to Index I) from best to worst: M7T> MPT8T> M9T>MI-12T> M8T> C6T

Evaluation of read stability and write ability performance order through observation is shown in Appendix F for both $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology nodes.

Table 5.18 shows the comparison of all proposed SRAM cells with C6T at 45 nm

Table 5.18: Percentage comparison of all proposed SRAM cells with C6T at 45 nm

|  | M7T | MPT8T | M8T | M9T | MI-12T |
| :---: | :---: | :---: | :---: | :---: | :---: |
| \% increment in Hold SNM | Same value as C6T | Same value as C6T | Same value as C6T | Same value as C6T | 29.2\% |
| \% increment in RSNM | 43.3\% | 71.6\% | 60.0\% | 20.9\% | 37.5\% |
| \% increment in SVNM | Same value as C6T | Same value as C6T | 13.6\% | Same value as C6T | 20.4\% |
| \% increment in SINM | 79.6\% | 16.6\% | 28.6\%, | 79.6\% | 81.9\% |
| \% increment in WSNM | 28.6\% | 58.7\% | 51.6\% | 43.7\% | 60.1\% |
| \% decrement in WTI | 16.7\% | 56.2\% | 92.1\% | 07.9\% | 86.4\% |
| \% decrement in WTV | Same value as C6T | Same value as C6T | $\begin{aligned} & \hline 5.2 \% \\ & \text { (more) } \end{aligned}$ | Same value <br> as C6T | $\begin{gathered} 20 \% \\ \text { (more) } \end{gathered}$ |
| \% decrement in Read Delay | 16.6\% | 05.3\%, | 73.8\%, | 07.5\% | 04.5\% |
| \% decrement in Write Delay | 69.6\% | 83.5\%, | 89.1\%, | 05.6\% | 72.0\% |
| \% less leakage Power Consumption in hold mode | $10.7 \%$ (more power consumption here) | 54.6\% | 32.1\% | $76.4 \%$ (more power consumption here) | 30.7\% |

## - At 180 nm

For 180 nm technology, a literature survey of SRAM cells with 4, 5, 6, 7, 8, 9, 12 transistor counts have been done. These referenced designs were simulated in the same simulation setup for sub-threshold operation to maintain uniformity of simulation environment and their RSNM, read/write delay and leakage power consumption in hold mode have been obtained. This data has been compiled in Table 5.1(b) and is presented in Figure 5.53 for the sake of comparison.

Figure 5.53 shows comparison of results of only those referenced SRAM designs which show the best results (out of all designs that were studied for a fixed transistor count configuration) in Table 5.1(a)) in terms of leakage power.

In Figure 5.53 (a), a change of scale on $y$-axis is shown with a kink (~) to show the performance metrics in their entire range. This is done to show the comparisons of RSNM, WSNM, read delay, write delay, and leakage power consumption in hold mode on same axis.

(a)

(b)

Figure 5.53: The comparative histograms of SRAM cells at 180 nm
Robustness of write operation is given by WSNM, WTI, and WTV which are independent of each other as indicated by performance order, from best to worst, given below;
WSNM in decreasing order: 8T[13]>12T[15] > 6T[12] > 9T[10] > 7T[10] > 5T[12] > 4T[12]
WTI in increasing order: 8T[13] < 9T[10] < 12T[15] < 6T[12] < 5T[12] < 7T[10] < 4T[12]
WTV in increasing order: $6 \mathrm{~T}[12]<7 \mathrm{~T}[10]=4 \mathrm{~T}[12]<5 \mathrm{~T}[12]<9 \mathrm{~T}[10]<12 \mathrm{~T}[15]<8 \mathrm{~T}[13]$
It is observed that none of the cell has best value for all three-metrics related to write ability (i.e. highest WSNM, least WTI, and least WTV). For example, 8T[13] has highest WSNM and lower WTI, but a highest WTV. Similarly, 4T[12] has least WSNM and least WTI, but a moderate WTV.

Thus, we need to estimate a performance index value for write ability metric for each SRAM cell through multi criteria decision analysis taking all three metrics i.e. WSNM, WTI, and WTV into account. Table 5.19 shows the write ability index for all SRAM cells.

Table 5.19: Write ability metric of SRAM cells at 180 nm

| $\begin{array}{\|c\|} \hline \text { SRAM } \\ \text { cells } \end{array}$ | $\begin{array}{\|c} \hline \text { WSNM } \\ (\mathrm{mV}) \\ \text { X1 } \end{array}$ | $\begin{gathered} \text { D1= } \\ \text { X1-min } \end{gathered}$ | $\begin{array}{\|c\|} \hline \text { I1 }=\text { D1/ } \\ \text { R1 } \end{array}$ | $\begin{gathered} \|\mathbf{( \mu \mathrm { A } )}\| \\ \\ \text { X2 } \end{gathered}$ | $\begin{array}{\|c} \hline \text { D2= } \\ \max - \\ \text { X2 } \end{array}$ | $\begin{gathered} \mathbf{I 2}=\mathbf{D} 2 \\ / \mathbf{R} 2 \end{gathered}$ | $\begin{gathered} \text { WTV } \\ (\mathrm{mV}) \\ \text { X3 } \end{gathered}$ | $\begin{gathered} \text { D3= } \\ \text { max- } \\ \text { X3 } \end{gathered}$ | $\begin{array}{\|c\|} \hline \mathbf{I 3}=\mathrm{D3} \\ \text { / R3 } \end{array}$ | $\begin{array}{\|c\|} \hline \text { INDEX } \\ \text { (I) } \\ =11+12+ \\ 13 \end{array}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 4T[12] | $\begin{aligned} & 22.1 \\ & \text { min } \end{aligned}$ | 00.00 | 0.00 | $\begin{aligned} & 84.0 \\ & \max \end{aligned}$ | 00.00 | 0.00 | 230 | 20.00 | 0.33 | 0.33 |
| 5T[12] | 41.0 | 18.90 | 0.39 | 55.2 | 28.80 | 0.40 | 220 | 30.00 | 0.50 | 1.29 |
| 6T[12] | 47.7 | 25.60 | 0.53 | 32.3 | 51.70 | 0.72 | $\begin{aligned} & 250 \\ & \max \end{aligned}$ | 00.00 | 0.00 | 1.25 |
| 7T[10] | 38.0 | 15.90 | 0.34 | 64.1 | 19.90 | 0.28 | 230 | 20.00 | 0.33 | 0.95 |
| 8T[13] | 70.2 | 48.10 | 1.00 | 13.0 | 71.00 | 1.00 | 220 | 30.00 | 0.50 | 2.50 |
| 9T[10] | 44.8 | 22.70 | 0.47 | 23.1 | 60.90 | 0.85 | 200 | 50.00 | 0.83 | 2.15 |
| 12T[15] | 68.5 | 46.40 | 0.96 | 25.5 | 58.50 | 0.82 | 190 | 60.00 | 1.00 | 2.78 |
| Range>> | $\mathrm{R} 1=48.1$ |  |  | $\mathrm{R} 2=71.0$ |  |  | R3=60 |  |  |  |

From Table 5.19, write ability performance order (in decreasing order according to Index I) from best to worst:

## $12 \mathrm{~T}[15]>8 \mathrm{~T}[13]>9 \mathrm{~T}[10]>5 \mathrm{~T}[12]>6 \mathrm{~T}[12]>7 \mathrm{~T}[10]>4 \mathrm{~T}[12]$

Robustness of read operation is given by RSNM, SINM, and SVNM which are independent of each other as indicated by performance order, from best to worst, given below;

RSNM in decreasing order: $8 \mathrm{~T}[13]>12 \mathrm{~T}[15]>7 \mathrm{~T}[10]>9 \mathrm{~T}[10]>5 \mathrm{~T}[12]>6 \mathrm{~T}[12]>4 \mathrm{~T}[12]$
SINM in decreasing order: $8 \mathrm{~T}[13]>12 \mathrm{~T}[15]>5 \mathrm{~T}[12]>9 \mathrm{~T}[10]>4 \mathrm{~T}[12]>6 \mathrm{~T}[12]>7 \mathrm{~T}[10]$
SVNM in decreasing order: $12 \mathrm{~T}[15]<9 \mathrm{~T}[10]<8 \mathrm{~T}[13]=5 \mathrm{~T}[12]<7 \mathrm{~T}[10]=4 \mathrm{~T}[12]<6 \mathrm{~T}[12]$
It is observed that none of the proposed cell has best value for all three-metrics related to read stability (i.e. highest RSNM, highest SINM, and highest SVNM). For example, 8T[13] has highest RSNM and SINM, but a lower SVNM. Similarly, 4T[12] has least RSNM and lower SVNM, but moderate SINM. Thus, we need to estimate a performance index value for read stability metric for each SRAM cell through multi criteria decision analysis taking all three metrics i.e. RSNM, SINM, and SVNM into account. Table 5.20 shows the read stability index for all SRAM cells.

Table 5.20: Read stability metric of SRAM cells at 180 nm

| $\begin{array}{\|c} \hline \text { SRAM } \\ \text { cells } \end{array}$ | $\begin{gathered} \begin{array}{c} \text { RSNM } \\ (\mathrm{mV}) \end{array} \\ \text { X1 } \end{gathered}$ | $\begin{array}{\|c\|} \hline \text { D1= } \\ \text { X1-min } \end{array}$ | $\begin{gathered} \mathbf{I} 1=\mathrm{D} 1 / \\ \mathrm{R} 1 \end{gathered}$ | $\begin{gathered} \underset{(m V)}{\text { SVNM }} \\ \text { X2 } \end{gathered}$ | $\begin{gathered} \text { D2 }= \\ \text { X2-min } \end{gathered}$ | $\begin{aligned} & \text { I2=D2/ } \\ & \text { R2 } \end{aligned}$ | SINM <br> ( $\mu \mathbf{A}$ ) <br> X3 | $\begin{aligned} & \text { D3= } \\ & \text { X3- } \\ & \text { min } \end{aligned}$ | $\begin{array}{\|c\|c\|} \hline \text { I3=D3/ } \\ \text { R3 } \end{array}$ | $\begin{array}{\|c\|} \hline \text { INDEX } \\ (\mathrm{I})= \\ 11+\mathbf{1 2}+1 \\ 3 \end{array}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 4T[12] | $\begin{gathered} 18 \\ \min \end{gathered}$ | 00.00 | 0.00 | 170 | 20.00 | 0.33 | 32.1 | 17.30 | 0.30 | 0.63 |
| 5T[12] | 22.0 | 04.00 | 0.09 | 180 | 30.00 | 0.50 | 60.1 | 45.30 | 0.77 | 1.36 |
| 6T[12] | 20.1 | 02.10 | 0.05 | $\begin{aligned} & 150 \\ & \min \end{aligned}$ | 00.00 | 0.00 | 26.9 | 12.10 | 0.21 | 0.26 |
| 7T[10] | 26.7 | 08.70 | 0.21 | 170 | 20.00 | 0.33 | $\begin{aligned} & 14.8 \\ & \text { min } \end{aligned}$ | 00.00 | 0.00 | 0.54 |
| 8T[13] | 59.6 | 41.60 | 1.00 | 180 | 30.00 | 0.50 | 73.0 | 58.20 | 1.00 | 2.50 |
| 9T[10] | 23.4 | 05.40 | 0.12 | 200 | 50.00 | 0.84 | 38.4 | 23.60 | 0.41 | 1.37 |
| 12T[15] | 32.0 | 14.00 | 0.33 | 210 | 60.00 | 1.00 | 65.4 | 50.60 | 0.87 | 2.20 |
| Range)> | $\mathrm{R} 1=41.6$ |  |  | R2=60 |  |  | R3=58.2 |  |  |  |
| Range (R) of a performance metric (RSNM, SINM, and SVNM) <br> $=$ maximum value - minimum value |  |  |  |  |  |  |  |  |  |  |

From Table 5.20, read stability performance order (in decreasing order according to Index I) from best to worst:

## $8 \mathrm{~T}[13]>12 \mathrm{~T}[15]>9 \mathrm{~T}[10]>5 \mathrm{~T}[12]>4 \mathrm{~T}[12]>7 \mathrm{~T}[10]>6 \mathrm{~T}[12]$

## Observations:

## (i) Comparison of Results of Proposed Designs at 45 nm

Table 5.15 show that, in comparison to referenced designs, proposed SRAM designs have higher RSNM, lower read / write delay and less leakage power consumption (except M7T and M9T) in hold mode.
Figures 5.52 show that in sub-threshold region at 0.4 V supply:

- Write Ability of SRAM Cell: Based on index value, performance in the decreasing write ability is given as MPT8T> M8T > MI-12T > M9T > M7T > C6T as per Table 5.16.
- Read Stability of SRAM Cell: Based on index value, performance in the decreasing read stability is given as M7T> MPT8T> M9T> MI-12T> M8T> C6T as per Table 5.17.
- Read Delay of SRAM Cell: M8T shows least average read delay. Average read delay performance in the decreasing order is given as: M8T<<M7T<M9T<MPT8T<MI-12T<C6T. Variability of all proposed cell is observed to be ranging from 0.04 to 0.40 at supply voltage
of 0.4 V with M7T having the least value. Read delay and its variability reduces with reduction in power supply voltage.
- Write Delay of SRAM Cell: M8T shows least average write delay at supply voltage of 0.4 V . Average write delay performance in the decreasing order is given as: M8T< MPT8T< MI-12T<M7T<< M9T<C6T. Variability of all proposed cell is observed to be ranging from 0.016 to 0.19 at supply voltage of 0.4 V with MPT8T having the least value. Write delay its and variability reduces with reduction in power supply voltage.
- Leakage Power Consumption in Hold Mode: MPT8T is most power efficient design. Leakage power consumption in the increasing order is given as: MPT8T < M8T < MI-12T < C6T < M7T < M9T.


## (ii) Comparison of Proposed SRAM Cells with C6T at 45 nm

The comparative analysis in Table 5.18 exhibit that M8T, MPT8T and MI-12T designs have low leakage power consumption along with improved design parameters like achieving high read stability, high write ability, fast read \& write operation.

M7T and M9T are also achieving high read stability, high write ability, fast read and write operation but leakage power consumption is increasing ( $10.7 \%$, and $76.4 \%$ respectively) in comparison to C6T.
(iii) Comparison of Results of Referenced Designs At 180 nm Technology Among Themselves

Figures 5.53 show that in sub-threshold region:

- Write Ability of SRAM Cell: Based on index value performance in the decreasing write ability is given as $12 \mathrm{~T}[15]>8 \mathrm{~T}[13]>9 \mathrm{~T}[10]>5 \mathrm{~T}[12]>6 \mathrm{~T}[12]>7 \mathrm{~T}[10]>4 \mathrm{~T}[12]$ as per Table 5.19.

Read Stability of SRAM Cell: Based on index value, performance in the decreasing read stability is given as $8 \mathrm{~T}[13]>12 \mathrm{~T}[15]>9 \mathrm{~T}[10]>5 \mathrm{~T}[12]>4 \mathrm{~T}[12]>7 \mathrm{~T}[10]>6 \mathrm{~T}[12]$ as per Table 5.20.

- Read Delay of SRAM Cell: 5T[12] SRAM cell has least value of read delay.
- Write Delay of SRAM Cell: 9T[10] SRAM cell has least value of write delay.
- Leakage Power Consumption in Hold Mode of SRAM Cell: 6T [12] is most power efficient SRAM cell with least leakage power consumption in hold mode.


## (iv) Effect of Technology Scaling

At 45 nm technology nodes, all proposed designs show increased value of RSNM and WSNM, reduced values of read/write delay and increased values of leakage power consumption in hold mode as compared to all SRAM cells implemented at 180 nm . For few cells like 7T, 8T, and 12T, leakage power consumption reduces due to different cell design. Appendix E contains table of percentage change in all performance metrics of SRAM cells (with same transistor count) at 45 nm in comparison to 180 nm technology.

### 5.7. CONCLUSIONS

This chapter explores the design space of proposed M7T, MPT8T, M8T, M9T and MI-12T SRAM cells implemented at 45 nm technology node which are suitable for sub-threshold operation. The thorough analyses on the impacts of read stability, write ability, average write delay, average read delay and leakage power consumption in hold mode, have been done. The proposed memory cells exhibit improvement in performance over C6T.

Here, at 45 nm , comparison of proposed SRAM cells with referenced designs [32][44][45][46] and with each other are done. At 180 nm , performance parameters of low power referenced designs are compiled in Table 5.1(b), and used for comparison.

The overall results of the SRAM cells show following conclusions at 45 nm and 180 nm technology nodes:

## (i) Comparison of Results of Proposed Designs with Respective Referenced Designs at 45nm Technology

- RSNM- Proposed designs show an RSNM increment ranging from 30\% to $66.6 \%$.
- WSNM- Proposed designs show an WSNM increment ranging from $8.1 \%$ to $30.1 \%$.
- Average Read Delay- Proposed designs show an RSNM decrement ranging from $26.2 \%$ to $87.3 \%$.
- Average Write Delay- Proposed designs show an RSNM decrement ranging from $22.9 \%$ to 83.9 \%.
- Leakage Power Consumption in hold mode- Proposed designs show an RSNM decrement ranging from $16.8 \%$ to $50.8 \%$ except M7T and M9T which consumes $11.9 \%$ and $8.1 \%$ more leakage power respectively.
(ii) Comparison of Results of Proposed Designs Among Themselves at 45nm Technology
- Impact of Design Configuration on RSNM of SRAM Cell: Among all proposed SRAM cells, MPT8T has highest RSNM of 120 mV . The increased RSNM values in MPT8T and M8T are due to addition of extra transistor in parallel to access transistor which helps in maintaining proper logic ' 0 ' at internal storage node. Also, this overcomes the voltage degradation at BLB/BL nodes i.e. increase the node voltage for logic '1' stored.
- Impact of Design Configuration on WSNM of SRAM Cell: Among all proposed SRAM cells, MI-12T has highest WSNM of 226 mV . This is due to removal of positive feedback in bi-stable element in write operation due to stacking of pull down path.
- Impact of Design Configuration on Average Read Delay of SRAM Cell: Among all proposed SRAM cells, M8T has minimum read delay of 0.74 ns . This is due to extra current driven by extra added transistor in parallel to access transistor which consequently decreases the charging/ discharging time of the internal node of SRAM cell, hence decreases the read delay. Variability of all proposed cell is observed to be ranging from 0.04 to 0.40 at supply voltage of 0.4 V with M7T having the least value making it less sensitive to process variation. Read delay and its variability reduces with reduction in power supply voltage.
- Impact of Design Configuration on Average Write Delay of SRAM Cell: Among all proposed SRAM cells, M8T has minimum write delay of 0.71 ns . This is due to extra current driven by extra added transistor in parallel to access transistor which consequently decreases the charging/ discharging time of the of the internal node of SRAM cell, hence decreases the write delay. Variability of all proposed cell is observed to be ranging from 0.016 to 0.19 at supply voltage of 0.4 V with MPT8T having the least value making it less sensitive to process variation. Write delay its and variability reduces with reduction in power supply voltage.
- Impact of Design Configuration on Leakage Power Consumption in Hold Mode: Among all proposed SRAM cells, MPT8T has least leakage power consumption (in hold mode) of 2.31 nW . This is due to proper logic ' 0 ' and logic ' 1 ' at internal storage nodes ( Q , and QB). This causes full turn OFF of pull up and pull down transistors in bi-stable element of MPT8T thereby reducing the leakage current during hold mode.


## (iii) Comparison of All Proposed SRAM Cells with C6T at 45 nm :

The comparative analysis exhibit that M8T, MPT8T and MI-12T designs have low leakage power consumption along with other improved design parameters like achieving high read stability, high write ability, fast read \& write operation. Thus, these designs can be an attractive choice for low power based application in scaled technology. Whereas M7T and M9T also show improved design parameters like achieving high read stability, high write ability, fast read \& write operation but higher leakage power consumption as compared to C6T.
(iv) Comparison of Results of Referenced Designs at 180nm Technology Among Themselves

In comparison with other published low power references with same transistor count, the 8T SRAM cell [13] and 12T SRAM cell [15] are top two cells with higher read stability, \& higher write ability, 5T SRAM cell [12] has lowest read delay, 9T SRAM cell [10] has lowest write delay and 6T SRAM cell [12] has least leakage power consumption in hold mode.

## (v) Effect of Technology Scaling

At 45 nm technology nodes, all proposed designs show $42.7 \%$ average increment in RSNM and $65.8 \%$ average increment in WSNM, $85.6 \%$ and $24.7 \%$ average reduction in read and write delay respectively and $24.9 \%$ increment in leakage power consumption in hold mode as compared to corresponding SRAM cells implemented at 180 nm . For few cells like 7T, 8T, and 12T, leakage power consumption reduces due to different cell design.

## (vi) Modeling Analysis

At 0.4 V supply voltage, analytical equations, obtained under steady state condition for read, write and hold operation, give WSNM, RSNM, and Hold SNM within (5.4\% to 11.7\%), (6.9\% to $24.5 \%),(3.84 \%$ to $3.9 \%)$ respectively as compared to simulated values.

## (vii) Regression Analysis

At 0.4 V supply voltage, simplified equations which are obtained through multiple regression analysis (to obtain the impact of varying CR, PR, and $\mathrm{V}_{\mathrm{DD}}$ on read, write and hold operation), give WSNM, RSNM, and Hold SNM within ( $14.7 \%$ to $41.7 \%$ ), ( $10.4 \%$ to $34.8 \%$ ), and ( $5 \%$ to $12.8 \%$ ) respectively as compared to simulated values. These equations can be used to get first order estimate of the impact of varying CR, PR, and $\mathrm{V}_{\mathrm{DD}}$ on read, write and hold stability. For quick comparison, Figure 5.54 and Figure 5.55 show the comparative Design Space Exploration (DSE) chart of SRAM cells at 45 nm and 180 nm technology respectively.


Figure 5.54: DSE chart of all five proposed SRAM cells as compared to C6T at 45 nm technology


Figure 5.55: DSE chart of referenced SRAM cells at 180 nm technology

## REFERENCES

[1] S. Pal and A. Islam, "Variation tolerant differential 8T SRAM cell for ultralow power applications", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, vol. 4, 2016, pp. 549-558.
[2] J. K. Yadav, P. Das, A. Jain and A. Grover, "Area compact 5T portless SRAM cell for high density cache in 65 nm CMOS", 19th International Symposium on VLSI Design and Test (VDAT), 2015, pp. 1-4.
[3] M. Moghaddam, M. H. Moaiyeri and M. Eshghi, "Ultra low-power 7T SRAM cell design based on CMOS", 23rd Iranian Conference on Electrical Engineering (ICEE), 2015, pp. 1357-1361.
[4] N. Yoshinobu, H. Masahi, K. Takayuki and K. Itoh, "Review and future prospects of low-voltage RAM circuits", IBM Journal of Research and Development, vol. 47, 2003, pp. 525-552.
[5] A. P. Chandrakasan, N. Verma and D. C. Daly, "Ultralow-power electronics for biomedical applications", Annu. Rev. Biomed. Eng, vol. 10, 2008, pp. 247-274.
[6] M. Radfar, K. Shah and J. Singh, "Recent Sub-threshold Design Techniques", Active and Passive Electronic Components, vol. 2012, 2012, pp. 1-11.
[7] H. Mizuno and T. Nagano, "Driving source-line cell architecture for sub-1-V high-speed low-power applications", IEEE Journal of Solid- State Circuits, vol. 31(4), 1996, pp. 552-557.
[8] A. Islam and M. Hasan, "Single-ended 6T SRAM cell to improve dynamic Power Consumption by decreasing activity factor", The Mediterranean 172 Journal of Electronics and Communications, vol. 7(1), 2011, pp. 172-181.
[9] B. Swarup, S. Mukhopadhyay and K. Roy, "Process variations and process-tolerant design", International Conference on Embedded Systems, 2007, pp. 699-704.
[10] A. Agal and B. Krishan, "Comparative analysis of various SRAM cells with low power, high read stability and low area", International Journal of Engineering and Manufacturing (IJEM), vol. 4(3), 2014, pp. 1-12.
[11] W. K. Chen, "The VLSI handbook", CRC Press, 2006, pp. 0-2320.
[12] S. Dhariwal, V. Gupta, S. Kumari, R. Vijay and V. Lamba, "A comparative study \& performance analysis of SRAM cells with symmetric \& asymmetric configuration", Journal of Communication and Computer, vol. 8(4), 2011, pp. 313-317.
[13] S. S. Das, K. B. Ray and P. P. Nanda, "Multi threshold low power SRAM using floating gates", International Journal of Innovative Research in Computer and Communication Engineering, vol. 3(3), 2015, pp. 2370-2376.
[14] M. Mishra, S. Majumdar, D. Malviya and P. P. Bansod, "Design of SRAM cell in 0.18 $\mu \mathrm{m}$ technology", Proceedings of International Conference on Information Communication \& Embedded Systems, 2013, pp. 457-463.
[15] P. V. Thakare and S. Tembhurne, "A power analysis of SRAM Cell using 12T topology for faster data transmission", International Journal of Science Technology \& Engineering, vol. 2(12), 2016, pp. 441-446.
[16] M. Budhaditya and S. Basu, "Low power single bitline 6T SRAM cell with high read stability", IEEE International Conference on Recent Trends in Information Systems (ReTIS), 2011, pp. 169-174.
[17] R. Kumar and V. Kursun, "Temperature-adaptive voltage scaling for enhanced energy efficiency in sub-threshold memory arrays", Microelectronics journal, vol. 40(6), 2009, pp. 1013-1025.
[18] M. Liu, C. Hong, L. Changmeng and W. Zhihua, "An ultra-low-power 1-KB subthreshold SRAM in the 180 nm CMOS process", Journal of Semiconductors, vol. 31(6), 2010, pp. 065013-1-065013-4.
[19] M. F. Chang, S. W. Chang, P. W. Chou and W. C. Wu, "A 130 mV SRAM with expanded write and read margins for sub-threshold applications", IEEE Journal of SolidState Circuits, vol. 46(2), 2011, pp. 520-529.
[20] S. N. Kumar, N. Rao, D. Ramesh and S. D. Harihara, "Design of low power SRAM architecture with isolated read and write operations at deep submicron CMOS technology", International Journal of Advanced Research in Electrical Electronics and Instrumentation Engineerging, vol. 3(7), 2013, pp. 1-6.
[21] K. Itoh, M. Horiguchi and H. Tanaka, "Ultra-low voltage nano-scale memories", Springer, Science \& Business Media, 2007, pp. 1-345.
[22] T. S. Doorn, J. A. Croon, E. J. W. Ter Maten and A. D. Bucchianico, "A yield centric statistical designmethod for optimization of the SRAM active column", IEEE Proceedings of ESSCIRC, 2009, pp. 352-355.
[23] T. S. Doorn, R. Salters and L. E. Villagra, "SRAM design challenges: A research view on current status and future work", PhD dissertation, TU Delft, Delft University of Technology, 2009, pp. 1-132.
[24] J. P. Kulkarni, K. Kim and K. Roy, "A 160 mV robust schmitt trigger based sub-threshold SRAM", IEEE Journal of Solid-State Circuits, vol. 42(10), 2007, pp. 2303-2313.
[25] N. Verma, J. Kwong and A. P. Chandrakasan, "Nanometer MOSFET variation in minimum energy sub-threshold circuits", IEEE Transactions on Electron Devices, vol. 55(1), 2008, pp. 163-174.
[26] J. M. Rabaey, A. P. Chandrakasan and B. Nikolic, "Digital integrated circuits", Englewood Cliffs: Prentice hall, vol. 2, 2002, pp. 1-761.
[27] S. Ohbayashi, M. Yabuuchi, K. Nii, Y. Tsukamoto, S. Imaoka, Y. Oda and Y. Yamaguchi, "A $65-\mathrm{nm}$ SoC embedded 6T-SRAM designed for manufacturability with read and write operation stabilizing circuits", IEEE Journal of Solid-State Circuits, vol. 42(4), 2007, pp. 820-829.
[28] E. Seevinck, F. J. List and J. Lohstroh, "Static-noise margin analysis of MOS SRAM cells", IEEE Journal of Solid-State Circuits, vol. 22(5), 1987, pp. 748-754.
[29] H. Calhoun and A. P. Chandrakasan, "Static noise margin variation for sub-threshold SRAM in 65-nm CMOS", IEEE Journal of Solid-State Circuits, vol. 42, 2006, pp. 16731679.
[30] P. Gupta, A. Gupta and A. Asati, "Leakage immune modified pass transistor based 8T SRAM cell in sub-threshold region", International Journal of Reconfigurable Computing, vol. 2015, 2015, pp. 1-10.
[31] V. K. Sharma, M. Pattanaik and B. Raj, "ONOFIC approach: low power high speed nanoscale VLSI circuits design", International Journal of Electronics, vol. 101(1), 2014, pp. 61-73.
[32] A. Islam and M. Hasan, "A technique to mitigate impact of process, voltage and temperature variations on design metrics of SRAM Cell", Microelectronics Reliability, vol. 52(2), 2012, pp. 405-411.
[33] V. Mukherjee, S. P. Mohanty, E. Kougianos, R. Allawadhi and R. Velagapudi, "Gate leakage current analysis in read/write/idle states of a SRAM cell", IEEE Conference on Region 5, 2006, pp. 196-200.
[34] S. Nalam and B. H. Calhoun, "Asymmetric sizing in a 45 nm 5 T SRAM to improve read stability over 6T", IEEE Custom Integrated Circuits Conference, 2009, pp. 709-712.
[35] K. Zhang, U. Bhattacharya, Z. Chen, F. Hamzaoglu, D. Murray, N. Vallepalli and M. Bohr, "A 3-GHz 70-Mb SRAM in $65-\mathrm{nm}$ CMOS technology with integrated columnbased dynamic power supply", IEEE Journal of Solid-State Circuits, vol. 41(1), 2006, pp. 146-151.
[36] E. Grossar, M. Stucchi, K. Maex and W. Dehaene, "Read stability and write-ability analysis of SRAM cells for nanometer technologies", IEEE Journal of Solid-State Circuits, vol. 41(11), 2006, pp.2577-2588.
[37] S. Lin, Y. B. Kim and A. Lombardi, "Design and analysis of a 32 nm PVT tolerant CMOS SRAM cell for low leakage and high stability", Integration, The VLSI Journal, vol. 43, 2010, pp. 176-187.
[38] S. Ahmad, N. Alam and M. Hasan, "A robust 10T SRAM cell with enhanced read operation", International Journal of Computer Applications, vol. 129(2), 2015, pp.7-12.
[39] A. J. Bhavnagarwala, S. V. Kosonocky, S. P. Kowalczyk, R. V. Joshi, Y. H. Chan, U. Srinivasan and J. K. Wadhwa, "A transregional CMOS SRAM with single, logic vdd and dynamic power rails", IEEE Symposium on VLSI Circuits, Digest of Technical Papers, 2004, pp. 292-293.
[40] R. M. Swanson and J. D. Meindl, "Ion-implanted complementary MOS transistors in low-voltage circuits", IEEE Journal of Solid-State Circuits, vol. 7(2), 1972, pp. 146-153.
[41] B. H. Calhoun and A. P. Chandrakasan, "Static noise margin variation for sub-threshold SRAM in $65-\mathrm{nm}$ CMOS", IEEE Journal of Solid-State Circuits, vol. 41(7), 2006, pp.1673-1679.
[42] A. Makosiej, A. Vladimirescu, O. Thomas and A. Amara, "An SNM estimation and optimization model for ULP sub- 45 nm CMOS SRAM in the presence of variability", 8th IEEE International on NEWCAS, 2010, pp. 337-340.
[43] C. George and P. C. Huang, "Power-area tradeoff in sub-threshold SRAM designs", pp. 1-4, www-inst.eecs.berkeley.edu.
[44] R. E. Aly, M. I. Faisal and M. A. Bayoumi, "Novel 7T SRAM cell for low power cache design", IEEE International Conference on SOC, 2005, pp. 171-174.
[45] P. Upadhyay, S. Ghosh, R. Kar, D. Mandal and S. P. Ghoshal, "Low static and dynamic power MTCMOS based 12T SRAM cell for high speed memory system", In IEEE International Joint Conference on Computer Science and Software Engineering (JCSSE), 2014, pp. 212-217.
[46] R. K. Singh, S. Birla and M. Pattanaik, "Characterization of 9T SRAM cell at various process corners at deep sub-micron technology for multimedia applications", International Journal of Engineering and Technology, vol. 3(6), 2011, pp. 696-700.
[47] D. Khalil, M. Khellah, N. S. Kim, Y. Ismail, T. Karnik and V. K. De, "Accurate estimation of SRAM dynamic stability", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 16(12), 2008, pp.1639-1647.

## CHAPTER 6

## CONCLUSIONS \& FUTURE SCOPE

The objective of this thesis is architectural exploration and implementation of low power arithmetic circuits and SRAM cells in sub-threshold region of operation. The main focus is to explore the optimum architectures and suitable logic families for low power consumption and propagation delay for logarithmic prefix adders, column compression multipliers and SRAM cells in sub-threshold regime at two different nanometer technology nodes.

In this thesis, we have chosen to explore the design space of parallel prefix adder architectures (CLA, KSA, and HCA), two column compressions multipliers (Wallace tree and Dadda multipliers) and SRAM cells in terms of power consumption, delay, and power-delay product in sub-threshold region. These are frequently used in arithmetic and on-chip memory units in an SOC. Their performance is also obtained at 45 nm and 180 nm technology nodes to find the impact of scaling on their performance. Outcomes of this exploration of adders, multipliers and SRAM cells are concluded in section 6.1, 6.2, and 6.3 respectively. Future scope of this work is included in section 6.4.

### 6.1. CONCLUSIONS ON ADDERS

The overall results of the CLA, KSA and HCA show following conclusions at $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology nodes:

## (i) Power Consumption

The amount of power consumption directly depends on operand sizes and the complexity of the adder architectures. The large operand size adders consume more power for all logic design style at the same voltage supply and both technology nodes.

- At 45 nm , HCA architecture has the least power consumption. HCA based designs consume lesser power consumption varying from $8.2 \%$ to $77.7 \%$ as compared to KSA based designs. Whereas as compared to CLA designs, they consume lesser power consumption, ranging from $79.4 \%$ to $96.8 \%$, for all bit operands.
- Similarly, for 180 nm , HCA architecture (based designs) has the least power consumption. It consumes lesser power consumption varying from $21.2 \%$ to $56.1 \%$ as compared to KSA.

Whereas as compared to CLA, it consumes lesser power consumption varying from $89.4 \%$ to $97.1 \%$ for all bit operands.

Therefore, HCA is the most power efficient architecture as compared to KSA and CLA architecture in sub-threshold region.

## (ii) Propagation Delay

As operand size increases, the complexity of the circuits and the number of gates count in the propagation path also increase leading to increase in propagation delay. The large operand size adders show higher propagation delay for all logic design style at the same voltage supply and both technology nodes.

- At 45 nm, CLA architecture (based designs) has the least propagation delay. It consumes lesser propagation delay varying from $44.8 \%$ to $76.4 \%$ as compared to KSA. Whereas as compared to HCA (based designs), it gives lesser propagation delay varying from 66.6\% to $79.8 \%$ for all bit operands.
- Similarly, for 180 nm , CLA architecture has the least propagation delay. It consumes lesser propagation delay varying from $82.6 \%$ to $93.8 \%$ as compared to KSA. Whereas for HCA, it gives lesser propagation delay varying from $81.2 \%$ to $88.6 \%$ for all bit operands.

Therefore, CLA is the high-speed adder architecture as compared to KSA and HCA architecture for sub-threshold operation.

## (iii) Power-Delay Product:

In comparison to other proposed design showing highest power-delay product (at both technology nodes):

- For low bit operands (i.e. 8b / 16b), KSA and HCA architecture using Static-CMOS logic gives lowest power-delay product.
> $(63.7 \%$ / 29.24\%) lesser power-delay product for KSA.
$>(32.69 \% / 43.64 \%)$ lesser power-delay product for HCA.
- For higher bit operands (i.e. 32b / 64b), KSA and HCA architecture using HYB-TG logic gives lowest power-delay product
$>(40.59 \% / 58.99 \%)$ lesser power-delay product for KSA.
$>(3.23 \% / 15.32 \%)$ lesser power-delay product for HCA.


## (iv) Effect of RBB

HYB-TG and HYB-PT logic families with RBB do not function properly at both technology nodes in sub-threshold region.

The use of RBB scheme improves sub-threshold conduction current to perform circuit operations in Static-CMOS logic. At $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology, the average decrement in propagation delay compared to circuit design without RBB is approximately $61.87 \% / 18.5 \%$. The increments in average power consumption and power-delay product is by $77.03 \% / 22.5 \%$ and $38.3 \% / 7.3 \%$ respectively.

The use of RBB scheme in CMOS inverter shows that $\mathrm{NM}_{\mathrm{L}}$ reduces by $62.2 \% / 71.7 \%$ at $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology respectively.

## (v) Effect of Technology Scaling

At 45 nm , the power consumption is higher than 180 nm technology for all adder designs using different logic design styles. The power consumption is increasing due to increments in the leakage current at 45 nm technology as compared to 180 nm technology since supply voltage is kept same at 0.4 V .

### 6.2. CONCLUSIONS ON MULTIPLIERS

The overall results of the Wallace tree and Dadda multipliers show following conclusions at $45 \mathrm{~nm} / 180 \mathrm{~nm}$ technology nodes.

## (i) Power Consumption

- For LOC based architectures:

For both technology nodes, Dadda multiplier (based designs) has the least power consumption in comparison to corresponding Wallace tree multiplier designs as given below.

At 45 nm : For $4 \times 4$-bit, they have lesser power consumption varying from $28.6 \%$ to $30.9 \%$. Whereas, for 8 x8-bit, they consume lesser power consumption, varying from $23.2 \%$ to $24.1 \%$.

At 180 nm : For 4x4-bit, they consume lesser power consumption varying from $4.1 \%$ to $8.4 \%$. Whereas, for 8 x 8 -bit, they consume lesser power consumption, varying from $1.6 \%$ to $1.7 \%$.

- For Mixed L/H based architectures:

For both technology nodes, Dadda multiplier (based designs) has the least power consumption as compared to corresponding Wallace tree multiplier based designs as given below.

At 45 nm : For 4 x 4 -bit, they have lesser power consumption varying from $35.1 \%$ to $35.8 \%$. Whereas, for $8 x 8$-bit, they consume lesser power consumption, varying from $14.9 \%$ to $21.5 \%$ as compared to Wallace tree multipliers.
At 180 nm : For $4 \times 4$-bit, they consume lesser power consumption varying from $12.6 \%$ to $12.9 \%$. Whereas, for $8 \times 8$-bit, they consume lesser power consumption, varying from $1.7 \%$ to $4.1 \%$.

- At 45 nm , for both sizes ( 4 x 4 -bit and 8 x 8 -bit), among all eight (LOC as well Mixed L/H based) proposed multipliers implemented, Dadda-45-L-RCA is the most power efficient architecture.

It consumes $42.6 \%$ (for $4 \times 4$-bit), and $86.5 \%$ (for $8 \times 8$-bit) less power in comparison to highest power consuming design which is Wallace tree-45-L/H-HCA.

- At 180 nm , for both sizes, Dadda-180-L-RCA is the most power efficient architecture among all eight proposed designs.

It has $56.6 \%$ (for $4 \times 4$-bit) and $60.8 \%$ (for 8 x 8 -bit) less power consumption less power in comparison to highest power consuming design which is Wallace tree-180-L/H-HCA.

## (ii) Propagation Delay

- For LOC based architectures:

Dadda multiplier (based designs) has the least propagation delay at both technology nodes in comparison to corresponding Wallace tree multiplier designs as given below.

At 45 nm : For 4x4-bit, Dadda multipliers give lesser propagation delay varying from $19.9 \%$ to $28.9 \%$. Whereas for $8 \times 8$-bit, they give lesser propagation delay varying from $7.3 \%$ to $10.5 \%$ as compared to Wallace tree multipliers.

At 180 nm : For 4x4-bit, Dadda multipliers give lesser propagation delay varying from $0.9 \%$ to $2.2 \%$. Whereas for 8 x 8 -bit, they give lesser propagation delay varying from $0.8 \%$ to $1.1 \%$ as compared to Wallace tree multipliers.

- For Mixed L/H based architectures

Dadda multiplier (based designs) have the least propagation delay at both technology nodes in comparison to corresponding Wallace tree multiplier designs as given below.
At 45 nm : For 4x4-bit, Dadda multipliers give lesser propagation delay varying from 19.9 $\%$ to $21.3 \%$. Whereas, for $8 x 8$-bit, they give lesser propagation delay varying from $17.2 \%$ to $19.5 \%$ as compared to Wallace tree multipliers.

At 180 nm : For 4x4-bit, the Dadda multipliers give lesser propagation delay varying from $2.1 \%$ to $9.1 \%$. Whereas, for 8 x8-bit, they give lesser propagation delay varying from $1.1 \%$ to $1.6 \%$ as compared to Wallace tree multipliers.

- Among all eight (LOC as well Mixed L/H based) proposed multipliers implemented at 45 nm, Dadda-45-L/H-HCA has the least propagation delay. It has $71.7 \%$ (for 4x4-bit), and $88.2 \%$ (for $8 \times 8$-bit) lesser propagation delay in comparison to most delay intensive design which is Wallace tree-45-L/H-HCA.
- Similarly, at 180 nm , Dadda-180-L/H-HCA has the least propagation delay. It has $71.3 \%$ (for $4 \times 4$-bit) and $69.1 \%$ (for $8 \times 8$-bit) less propagation delay in comparison to most delay intensive design which is Wallace tree-180-L/H-HCA.


## (iii) Power-Delay Product

- For LOC based architectures:

At 45 nm : For $4 \times 4$-bit, the Dadda multipliers give lesser power-delay product varying from $44.7 \%$ to $49.2 \%$. Whereas, for $8 x 8$-bit, they give lesser power-delay product varying from $28.9 \%$ to $32.1 \%$ as compared to corresponding Wallace tree multipliers.

At 180 nm : For 4x4-bit, the Dadda multipliers give lesser power-delay product varying from $6.2 \%$ to $9.2 \%$. Whereas, for $8 \times 8$-bit, they give lesser power-delay product varying from $2.5 \%$ to $2.6 \%$ as compared to corresponding Wallace tree multipliers.

- For Mixed L/H based architectures

At 45 nm : For 4x4-bit, the Dadda multipliers give lesser power-delay product varying from $48.6 \%$ to $49 \%$. Whereas, for 8x8-bit, they give lesser power-delay product varying from a $31.5 \%$ to $35.1 \%$ as compared to corresponding Wallace tree multipliers.

At 180 nm : For 4x4-bit, the Dadda multipliers give lesser power-delay product varying from $14.9 \%$ to $20.6 \%$. Whereas, for $8 \times 8$-bit, they give lesser power-delay product varying from a $2.8 \%$ to $5.6 \%$ as compared to corresponding Wallace tree multipliers

- Among all eight (LOC as well Mixed L/H based) proposed multipliers implemented at 45 nm , Dadda-45-L/H-HCA has the least power-delay product. It has $49 \%$ (for $4 \times 4$-bit), and $31.5 \%$ (for 8 x 8 -bit) lesser power-delay product in comparison to Wallace tree- $45-\mathrm{L} / \mathrm{H}-$ HCA having highest power-delay product.
- Similarly, at 180 nm , Dadda-180-L/H-HCA has the least power-delay product. It has $20.6 \%$ (for $4 \times 4$-bit), and $2.5 \%$ (for 8 x 8 -bit) lesser power-delay product in comparison to Wallace tree-180-L/H-HCA having highest power-delay product.
- Static-CMOS logic and HYB-TG design style are most power-delay product efficient design styles for LOC and Mixed L/H based Wallace tree and Dadda multipliers respectively.


## (iv) Effect of Technology Scaling:

At same frequency of operation, at 45 nm , the propagation delay is smaller, power consumption is higher (due to increased leakage current since supply voltage is kept same at 0.4 V ) and power-delay product is smaller for all different implemented combinations of multipliers in comparison to 180 nm technology.

### 6.3. CONCLUSIONS ON SRAM CELLS

The overall results of the SRAM cells show following conclusions at 45 nm and 180 nm technology nodes.
(i) Comparison of Results of Proposed Designs among themselves at 45 nm Technology

- Impact of Design Configuration on RSNM of SRAM Cell: Among all proposed SRAM cells, MPT8T has highest RSNM of 120 mV . The increased RSNM values in MPT8T and M8T are due to addition of extra transistor in parallel to access transistor which helps in maintaining proper logic ' 0 ' at internal storage node. Also, this overcomes the voltage degradation at BLB/BL nodes i.e. increase the node voltage for logic '1' stored.
- Impact of Design Configuration on WSNM of SRAM Cell: Among all proposed SRAM cells, MI-12T has highest WSNM of 226 mV . This is due to removal of positive feedback in bi-stable element in write operation due to stacking of pull down path.
- Impact of Design Configuration on Average Read Delay of SRAM Cell: Among all proposed SRAM cells, M8T has minimum read delay of 0.74 ns . This is due to extra current driven by extra added transistor in parallel to access transistor which consequently decreases
the charging/ discharging time of the internal node of SRAM cell, hence decreases the read delay. Variability of all proposed cell is observed to be ranging from 0.04 to 0.40 at supply voltage of 0.4 V with M7T having the least value making it less sensitive to process variation. Read delay and its variability reduces with reduction in power supply voltage.
- Impact of Design Configuration on Average Write Delay of SRAM Cell: Among all proposed SRAM cells, M8T has minimum write delay of 0.71 ns . This is due to extra current driven by extra added transistor in parallel to access transistor which consequently decreases the charging/ discharging time of the of the internal node of SRAM cell, hence decreases the write delay. Variability of all proposed cell is observed to be ranging from 0.016 to 0.19 at supply voltage of 0.4 V with MPT8T having the least value making it less sensitive to process variation. Write delay its and variability reduces with reduction in power supply voltage.
- Impact of Design Configuration on Leakage Power Consumption in hold mode: Among all proposed SRAM cells, MPT8T has least leakage power consumption (in hold mode) of 2.31 nW . This is due to proper logic ' 0 ' and logic ' 1 ' at internal storage nodes ( Q , and QB). This causes full turn OFF of pull up and pull down transistors in bi-stable element of MPT8T thereby reducing the leakage current during hold mode.


## (ii) Comparison of all proposed SRAM Cells with C6T at 45 nm

The comparative analysis exhibit that M8T, MPT8T and MI-12T designs have low leakage power consumption along with other improved design parameters like achieving high read stability, high write ability, fast read \& write operation. Thus, these designs can be an attractive choice for low power based application in scaled technology. Whereas M7T and M9T also show improved design parameters like achieving high read stability, high write ability, fast read \& write operation but higher leakage power consumption as compared to C6T.

## (iii) Comparison of results of referenced designs at 180 nm technology among themselves

In comparison with other published low power references (in Chapter 5) with same transistor count, the 8T SRAM cell [13] gives highest RSNM, \& highest WSNM, 5T SRAM cell [12] gives lowest read delay, 9T SRAM cell [10] has lowest write delay and 6T SRAM cell [12] gives least leakage power consumption in hold mode.

## (iv) Effect of Technology Scaling

At 45 nm technology nodes, all proposed designs show $42.7 \%$ average increment in RSNM and $65.8 \%$ average increment in WSNM, $85.6 \% / 24.7 \%$ average reduction in read / write delay respectively and $24.9 \%$ increment in leakage power consumption in hold mode as compared to corresponding SRAM cells implemented at 180 nm . For few cells like 7T, 8T, and 12 T leakage power consumption reduces due to different cell design.

## (v) Modeling Analysis

At 0.4 V supply voltage, analytical equations, obtained under steady state condition for read, write and hold operation, give WSNM, RSNM, and Hold SNM within (5.4\% to 11.7\%), (6.9\% to $24.5 \%),(3.84 \%$ to $3.9 \%)$ respectively as compared to simulated values.

## (vi) Regression Analysis

At 0.4 V supply voltage, simplified equations which are obtained through multiple regression analysis (to obtain the impact of varying CR, PR, and $V_{D D}$ on read, write and hold operation), give WSNM, RSNM, and Hold SNM within ( $14.7 \%$ to $41.7 \%$ ), ( $10.4 \%$ to $34.8 \%$ ), and ( $5 \%$ to $12.8 \%$ ) respectively as compared to simulated values. These equations can be used to get first order estimate of the impact of varying $\mathrm{CR}, \mathrm{PR}$, and $\mathrm{V}_{\mathrm{DD}}$ on read, write and hold stability.

### 6.4. FUTURE SCOPE OF THE WORK

The present thesis work can be further augmented by doing following additional studies for adder, multiplier and SRAM cell architectures in sub-threshold region:

Adder: The study can be further improved by including the techniques of transistor sizing in the critical delay path and transistor-reordering to increase the speed of the adders in subthreshold region. Also, we can use scaling of supply voltage and other scaled technology libraries to evaluate the matrices of all the architectures for different operand size in terms of figures of merit to validate their wide applicability area.

Multiplier: Exploration of multiplier designs can be further done for large operand sizes. The different architecture of multipliers in sub-threshold region can also be implemented with other parallel prefix adders (Brent Kung, Ladner-Fisher etc.) which supplement the logarithmic delay of the compression tree for future work and analysis. Also, the impact of scaling of supply voltage along with other scaled technology libraries can be studied to evaluate the performance matrices of all the architectures.

SRAM cells: The extensive exploration of SRAM cells can be built into a software tool that can be used to get the most suited SRAM architectural choice for given technology node. The simulation results can be tested for their accuracy by fabrication of proposed cells M7T, MPT8T, M8T, M9T and MI-12T SRAM cells and obtaining real time results.

Further, to find out the accurate value of the WSNM theoretically, the analytical expressions during write operation can be more accurately modeled. Also, analysis of read stability and write ability can be made more accurate by finding out the dynamic stability metrics which capture the inherent dynamic behavior of SRAM cell.

## APPENDIX-A <br> SIMULATION RESULTS FOR POWER CONSUMPTION OF 8-BIT HCA USING STATIC-CMOS LOGIC AT 45 NM

| Transient <br> Simulation Time | $1 \mu \mathrm{~s}$ | $10 \mu \mathrm{~s}$ | $20 \mu \mathrm{~s}$ | $25 \mu \mathrm{~s}$ | $30 \mu \mathrm{~s}$ |
| :---: | :---: | :---: | :---: | :---: | :---: |
| Power $(\boldsymbol{\mu W})$ | 0.012481 | 0.012483 | 0.012484 | 0.012487 | 0.012489 |

The change in power consumption value at $5^{\text {th }}$ place after decimal indicates that increasing the simulation time does not impact the power consumption value.

## APPENDIX-B

TRANSIENT WAVEFORM OF Q, QB DURING READ, WRITE HOLD MODE FOR C6T AND PROPOSED SRAM CELLS (MPT8T, M8T, M9T, MI-12T)
I. For C6T:


C6T Write State Analysis


## II. For MPT8T:





## III. FOR M8T:


IV. FOR M9T:


## M9TWrite State Analysis



## V. FOR MI-12T:



## APPENDIX-C <br> LEAKAGE POWER CONSUMPTION OF C6T AND PROPOSED SRAM CELLS

In deep-submicron technology, leakage current is of major concern in memory cell. Leakage current becomes a prominent factor of Power Consumption in an SRAM cell.

In an SRAM cell, the total leakage current ( LIEAK ) is the combination of

- OFF state leakage (Isub),
- Gate leakage $\left(\mathrm{I}_{\mathrm{G}}\right)$
- Junction leakage ( $\mathrm{I}_{\mathrm{JN}}$ ) through various devices (ignoring minor leakages such as $\mathrm{I}_{\mathrm{GIDL}}$ and Ipunchthrough).

All the equations are derived from the leakage current expressions mentioned in [33]. The major leakage components of C 6 T and proposed cells during hold mode are discussed below. Here, for each cell, arrow symbols show the leakage component in NMOS or PMOS only irrespective to the direction of current component.

## I. Leakage Power Consumption of C6T

Figure 1 shows the major leakage components of C6T during hold mode.


Figure 1: Major leakage components of C6T during hold mode
The major leakage components ( $\mathrm{I}_{\mathrm{SUB}}, \mathrm{I}_{\mathrm{G}}$ and $\mathrm{I}_{\mathrm{JN}}$ ) are given by the following equations ( C 5.1 C5.4).

$$
\begin{equation*}
I_{\text {SUB }, \mathrm{C} 6 T}=I_{\mathrm{SUB}_{M N 1}}+I_{\mathrm{SUB}_{\mathrm{NMP}_{2}}}+\mathrm{I}_{\mathrm{SUB}_{\mathrm{MNW}}} \tag{C5.1}
\end{equation*}
$$

$\mathrm{I}_{\mathrm{JN}, \mathrm{C} 6 \mathrm{~T}}=\mathrm{I}_{\mathrm{JND}_{\mathrm{NP2}}}+\mathrm{I}_{\mathrm{JND}_{\mathrm{MN1}}}+\mathrm{I}_{\mathrm{JND}_{\text {NNG }}}+\mathrm{I}_{\mathrm{JNS}_{\text {NNS }}}+\mathrm{I}_{\mathrm{JND}_{\text {MNS }}}$

$\mathrm{I}_{\text {LEAK,CGT }}=\mathrm{I}_{\text {SUB,CGT }}+\mathrm{I}_{\mathrm{JN}, \mathrm{C} G \mathrm{~T}}+\mathrm{I}_{\mathrm{D}, \mathrm{C} 6 \mathrm{~T}}$
The overall leakage power of the cell depends on the total value of $\mathrm{I}_{\text {LEAK }}$ given in Table C5.1.

## II. Leakage Power Consumption of MPT8T

Figure 2 shows the major leakage components of MPT8T during hold mode.


Figure 2: Major leakage components of MPT8T during hold mode
The major leakage components ( $\mathrm{I}_{\mathrm{SuB}}, \mathrm{I}_{\mathrm{G}}$ and $\mathrm{I}_{\mathrm{JN}}$ ) of MPT8T are given by the following equations (C5.5-C5.8)
$I_{\text {SUB,MPT } 8 T}=I_{\text {SUB }_{\text {MNI }}}+I_{\text {SUB }_{\mathrm{MPR}^{2}}}+I_{\text {SUB }_{\text {MN6 }}}$

$\mathrm{I}_{G, \mathrm{MPT} 8 \mathrm{~T}}=\mathrm{I}_{\mathrm{GD}_{\mathrm{MP} 2}}+\mathrm{I}_{\mathrm{GD}_{\mathrm{MN} 2}}+\mathrm{I}_{\mathrm{GS}_{\text {MN2 }}}+\mathrm{I}_{\mathrm{GD}_{\text {MP1 }}}+\mathrm{I}_{\mathrm{GS}_{\text {MP1 }}}$
$+\mathrm{I}_{\mathrm{GD}_{\mathrm{MN1}}}+\mathrm{I}_{\mathrm{GD}_{\text {MP3 }}}+\mathrm{I}_{\mathrm{GS}_{\text {MP3 }}}+\mathrm{I}_{\mathrm{GD}_{\text {MP4 }}}+\mathrm{I}_{\mathrm{GD}_{\text {MN3 }}}+\mathrm{I}_{\mathrm{GD}_{\text {MN4 }}}$
$\mathrm{I}_{\text {LEAK,MPT8T }}=\mathrm{I}_{\text {SUB,MPT8T }}+\mathrm{I}_{\text {JN,MPT8T }}+\mathrm{I}_{\mathrm{D}, \mathrm{MPT8T}}$
The total leakage current components for MPT8T are found to be more than the total leakage current components for C6T as shown in Table C5.1. But it consumes less leakage power due to the following reasons:

In hold mode, assume the node QB is at logic ' 1 ' and node Q is at logic '0' (See Figure 5.11). Under this condition modified pass transistor based access pairs (MP3-MN3) and (MP4-MN4) will be switched OFF as WL is connected to logic '1'. The gate terminals of these additional NMOS (MN3-MN4) transistors are connected to logic '0', hence in OFF state, due to which it
provides very high resistance path. Therefore, the current components (
 leakage current of these transistors prevents the leakage of the nodes, which precisely control the leakage current of the overall MPT8T.

### 5.4.2. Leakage Power Consumption of M8T

Figure 3 shows the major leakage components of M8T during hold mode.


Figure 3: Major leakage components of M8T during hold mode
All the major leakage components ( $\mathrm{I}_{\mathrm{SUB}}, \mathrm{I}_{\mathrm{G}}$ and $\mathrm{I}_{\mathrm{JN}}$ ) of M8T during hold mode are expressed in equations (C5.9-C5.12).

$$
\begin{equation*}
\mathrm{I}_{\mathrm{SUB}, \mathrm{M} 8 \mathrm{~T}}=\mathrm{I}_{\mathrm{SUB}_{\mathrm{MNV} 6}}+\mathrm{I}_{\mathrm{SUB}_{\mathrm{MP} 2}}+\mathrm{I}_{\mathrm{SUB}_{\mathrm{NM} 1}} \tag{C5.9}
\end{equation*}
$$


$+\mathrm{I}_{\mathrm{JND}_{\mathrm{MNS}}}+\mathrm{I}_{\mathrm{JNS}_{\mathrm{MN4}}}+\mathrm{I}_{\mathrm{JND}_{\mathrm{MN4}}}$

$+\mathrm{I}_{\mathrm{GS}_{\text {M4 }}}+\mathrm{I}_{\mathrm{GD}_{\text {MN4 }}}+\mathrm{I}_{\mathrm{GS}_{\text {MNS }}}+\mathrm{I}_{\mathrm{GD}_{\text {MP5 }}}+\mathrm{I}_{\mathrm{GD}_{\mathrm{NP6}}}$

The total leakage current components for M8T are found to be more than the total leakage current components for C6T as shown in Table C5.1. But it consumes less leakage power due to the following reasons:

In hold mode, assume the node QB is at logic ' 1 ' and node Q is at logic '0' (See Figure 5.15). Under this condition additional transistor pair (MN3-MN4) will be switched OFF as the gate terminals of these additional transistors are connected to logic '0', hence in OFF state, due to which it provides very high resistance path. Therefore, the current components ( $\mathrm{I}_{\mathrm{JND}_{\mathrm{MN}} 3}, \mathrm{I}_{\mathrm{JND}}^{\mathrm{MN} 4} 4, \mathrm{I}_{\mathrm{GD}}^{\mathrm{MN} 3}, \mathrm{I}_{\mathrm{GD}_{\mathrm{MN}} 4}, \mathrm{I}_{\mathrm{JNS}_{\mathrm{MN}} 4}, \mathrm{I}_{\mathrm{GS}}^{\mathrm{MN} 4} 4$ ) of MN 3 and MN 4 have negligible value, the off-state leakage current of these transistors prevents the leakage of the nodes, which precisely control the leakage current of the overall M8T.

## IV. Leakage Power Consumption of MI-12T

Figure 4 shows the major leakage components of MI-12T during hold mode.


Figure 4: Major leakage components of MI-12T during hold mode
The major leakage components ( $\mathrm{I}_{\mathrm{SUB}}, \mathrm{I}_{\mathrm{G}}$ and $\mathrm{I}_{\mathrm{JN}}$ ) of MI-12T are given by the following equations (C5.13-C5.16)
$\mathrm{I}_{\text {SUB, MI- } 12 \mathrm{~T}}=\mathrm{I}_{\text {SUB }_{\mathrm{MN} 6}}+\mathrm{I}_{\text {SUB }_{\mathrm{MP} 2}}+\mathrm{I}_{\mathrm{SUB}_{\mathrm{MN} 3}}+\mathrm{I}_{\mathrm{SUB}_{\mathrm{MN} 1}}$

$+\mathrm{I}_{\mathrm{JND}}^{\mathrm{MN} 5} 5{ }+\mathrm{I}_{\mathrm{JND}}^{\mathrm{MN} 7}{ }+\mathrm{I}_{\mathrm{JND}}^{\mathrm{MN} 8}{ }$



$\mathrm{I}_{\text {LEAK,MI-12T }}=\mathrm{I}_{\text {SUB,MI-12T }}+\mathrm{I}_{\mathrm{JN}, \mathrm{MI}-12 \mathrm{~T}}+\mathrm{I}_{\mathrm{D}, \mathrm{MI}-12 \mathrm{~T}}$

The total leakage current components for MI-12T are found to be more than the total leakage current components for C6T as given in equation (C5.1-C5.4) and equation (C5.13-C5.16). But it consumes less leakage power due to the following reasons:

- In hold mode, assume the node QB is at logic ' 1 ' and node Q is at logic '0' (See Figure 5.19). Under this condition additional transistor pair (MP3-MN3) will be switched OFF and other pair (MP4-MN4) will be switched ON.
- The additional transistor MN3/MN4 is stacked over MN1/MN2. Due to body bias effect, MN3/MN4 has an increased threshold voltage thereby reducing the OFF-state leakage current through it. The stacking is formed between the inverter (MN3-MN1) and (MN4MN2), which reduces the leakage current of the overall MI-12T cell.
- Due to this stacking effect, all the current components Isub mn3, Ijns mn3, Ijnd, mn3, IGd mp3,
 MN7, MP4-MN4-MN8 have negligible value.


## V. Leakage Power Consumption of M7T

Figure 5 shows the major leakage components of M7T during hold mode.


Figure 5: Major leakage components of M7T during hold mode
The major leakage components ( $\mathrm{I}_{\mathrm{SUB}}, \mathrm{I}_{\mathrm{G}}$ and $\mathrm{I}_{\mathrm{JN}}$ ) are given by the following equations (C5.17C5.20)

$$
\begin{align*}
& \mathrm{I}_{\text {SUB, M7T }}=\mathrm{I}_{\text {SUB }_{\text {MNI }}}+\mathrm{I}_{\text {SUB }_{\text {MP } 2}}+\mathrm{I}_{\text {SUB }_{\text {MN6 }}} \tag{C5.17}
\end{align*}
$$

$$
\begin{align*}
& +\mathrm{I}_{\mathrm{GD}_{\text {MN6 }}}+\mathrm{I}_{\mathrm{GD}_{\text {MNS }}}+\mathrm{I}_{\mathrm{GS}_{\text {MNS }}}+\mathrm{I}_{\mathrm{GD}_{\text {MP3 }}}  \tag{C5.19}\\
& \mathrm{I}_{\mathrm{LEAK}, \mathrm{M} 7 \mathrm{~T}}=\mathrm{I}_{\mathrm{SUB}, \mathrm{M} 7 \mathrm{~T}}+\mathrm{I}_{\mathrm{JN}, \mathrm{M} 7 \mathrm{~T}}+\mathrm{I}_{\mathrm{D}, \mathrm{M} 7 \mathrm{~T}} \tag{C5.20}
\end{align*}
$$

The total leakage current components for M7T are found to be more than the total leakage current components for C6T as shown in Table C5.1. Therefore, M7T consumes more leakage power than C6T.

## VI. Leakage Power Consumption of M9T

Figure 6 shows the major leakage components of M9T during hold mode.


Figure 6: Major leakage components of M9T during hold mode
The major leakage components ( $\mathrm{I}_{\mathrm{SUB}}, \mathrm{I}_{\mathrm{G}}$ and $\mathrm{I}_{\mathrm{NN}}$ ) of M9T are given by the following equations (C5.21-C5.24)
$\mathrm{I}_{\mathrm{SUB}, \mathrm{M} 9 \mathrm{~T}}=\mathrm{I}_{\mathrm{SUB}_{\mathrm{MN1}}}+\mathrm{I}_{\text {SUB }_{\mathrm{MP} 2}}+\mathrm{I}_{\text {SUB }_{\mathrm{MN} 5}}++\mathrm{I}_{\text {SUB }_{\mathrm{MP5}}}$
$\mathrm{I}_{\mathrm{JN}, \mathrm{M} 9 \mathrm{~T}}=\mathrm{I}_{\mathrm{JND}_{\mathrm{MN} 1}}+\mathrm{I}_{\mathrm{JND}}^{\mathrm{MP} 2}{ }+\mathrm{I}_{\mathrm{JND}_{\mathrm{MN6}}}+\mathrm{I}_{\mathrm{JND}}^{\mathrm{MN} 5} 5 \mathrm{I}_{\mathrm{JNS}}^{\mathrm{MN} 5} 5$
$+\mathrm{I}_{\mathrm{JND}}^{\mathrm{MN} 3}{ }+\mathrm{I}_{\mathrm{JNS}}^{\mathrm{MN} 3}{ }+\mathrm{I}_{\mathrm{JND}}^{\mathrm{MN} 4} 4 \mathrm{I}_{\mathrm{JND}}^{\mathrm{MP} 5}$


$\mathbf{I}_{\text {LEAK,M9T }}=\mathbf{I}_{\text {SUB,M9T }}+\mathbf{I}_{\text {JN,M9T }}+\mathrm{I}_{\mathrm{D}, \mathrm{M9T}}$
The total leakage current components for M9T are found to be more than the total leakage current components for C 6 T as shown in Table C5.1. Therefore, M9T consumes more leakage power than C6T.

The total number of effective leakage current components of C6T and proposed cells during hold mode along with simulated total current value is shown in Table C5.1.

It shows that leakage current is more for M9T which also has highest number of leakage current components.

Table C5.1: Leakage current components of conventional and proposed SRAM cells

| Module Name | $\begin{gathered} \text { ISUB } \\ \text { (i) } \end{gathered}$ | IJN <br> (ii) | $\mathbf{I G}_{\mathbf{G}}$ <br> (iii) | Ileakage total |
| :---: | :---: | :---: | :---: | :---: |
|  |  |  |  | No. of leakage current Components (i) + (ii) + (iii) |
| C6T | 3 | 5 | 9 | 17 |
| M7T | 3 | 6 | 10 | 19 |
| MPT8T | 3 | 5 | 9 | 17 |
| M8T | 3 | 5 | 9 | 17 |
| M9T | 4 | 8 | 13 | 25 |
| MI-12T | 3 | 5 | 9 | 17 |

## APPENDIX D

## PULL-UP, PULL-DOWN AND ACCESS TRANSISTOR CURRENT DURING READ, WRITE, AND HOLD MODE FOR C6T AND PROPOSED SRAM CELLS AT 45 NM

I. FOR C6T:

| Mode of <br> Operations | Current in <br> Hold '1'mode <br> (A) | Current in <br> Read '0' mode <br> (A) | Current in <br> Write '0' mode <br> (A) |
| :---: | :---: | :---: | :---: |
| Transistor <br> Name | $8.87 \mathrm{E}-9$ | $41.7 \mathrm{E}-12$ | $1.68 \mathrm{E}-6$ |
| Pull-up transistor <br> (MP2) | $8.87 \mathrm{E}-9$ | $2.01 \mathrm{E}-6$ | $3.79 \mathrm{E}-12$ |
| Pull-down transistor <br> (MN2) | $4.77 \mathrm{E}-24$ | $2.01 \mathrm{E}-6$ | $1.68 \mathrm{E}-6$ |
| Access transistor <br> (MN6) |  |  |  |

## II. FOR M7T:

| Mode of <br> Operations | Current in <br> Hold '1'mode <br> (A) | Current in <br> Read '0' mode <br> (A) | Current in <br> Write '0' mode <br> (A) |
| :---: | :---: | :---: | :---: |
| Transistor <br> Name | $9.78 \mathrm{E}-9$ | $69.74 \mathrm{E}-12$ | $5.77 \mathrm{E}-6$ |
| Pull-up transistor <br> (MP2) | $9.78 \mathrm{E}-9$ | $8.73 \mathrm{E}-6$ | $8.99 \mathrm{E}-12$ |
| Pull-down transistor <br> (MN2) | $107 \mathrm{E}-24$ | $8.73 \mathrm{E}-6$ | $5.77 \mathrm{E}-6$ |
| Access transistor <br> (MN6) |  |  |  |

## III. FOR MPT8T:

| Mode of <br> Operations | Current in <br> Hold '1'mode <br> (A) | Current in <br> Read '0' mode <br> (A) | Current in <br> Write '0' mode <br> (A) |
| :---: | :---: | :---: | :---: |
| Transistor <br> Name | $4.11 \mathrm{E}-9$ | $72.1 \mathrm{E}-12$ | $6.79 \mathrm{E}-6$ |
| Pull-up transistor <br> (MP2) | $4.11 \mathrm{E}-9$ | $12.77 \mathrm{E}-6$ | $10.7 \mathrm{E}-12$ |
| Pull-down transistor <br> (MN2) | $187 \mathrm{E}-24$ | $12.77 \mathrm{E}-6$ | $6.79 \mathrm{E}-6$ |
| Access transistor <br> (MP4) |  |  |  |

## IV. FOR M8T:

| Mode of <br> Operations | Current in <br> Hold '1'mode <br> (A) | Current in <br> Read '0' mode <br> (A) | Current in <br> Write '0' mode <br> (A) |
| :---: | :---: | :---: | :---: |
| Transistor <br> Name | $6.59 \mathrm{E}-9$ | $81 \mathrm{E}-12$ | $7.81 \mathrm{E}-6$ |
| Pull-up transistor <br> (MP2) | $6.59 \mathrm{E}-9$ | $14.39 \mathrm{E}-6$ | $11.5 \mathrm{E}-12$ |
| Pull-down transistor <br> (MN2) | $254 \mathrm{E}-23$ | $14.39 \mathrm{E}-6$ | $7.81 \mathrm{E}-6$ |
| Access transistor <br> (MN6/MN4) | ${ }^{2}$ |  |  |

V. FOR M9T:

| Mode of <br> Operations | Current in <br> Hold '1'mode <br> (A) | Current in <br> Read '0' mode <br> (A) | Current in <br> Write '0' mode <br> (A) |
| :---: | :---: | :---: | :---: |
| Transistor <br> Name | $10.11 \mathrm{E}-9$ | $55.77 \mathrm{E}-12$ | $3.44 \mathrm{E}-6$ |
| Pull-up transistor <br> (MP2) | $10.11 \mathrm{E}-9$ | $6.47 \mathrm{E}-6$ | $5.78 \mathrm{E}-12$ |
| Pull-down transistor <br> (MN2) | $99 \mathrm{E}-24$ | $6.47 \mathrm{E}-6$ | $3.44 \mathrm{E}-6$ |
| Access transistor <br> (MN6/MN4) |  |  |  |

## VI. MI-12T:

| Mode of <br> Operations | Current in <br> Hold '1'mode <br> (A) | Current in <br> Read '0' mode <br> (A) | Current in <br> Write '0' mode <br> (A) |
| :---: | :---: | :---: | :---: |
| Transistor <br> Name | $7.09 \mathrm{E}-9$ | $71.1 \mathrm{E}-12$ | $6.11 \mathrm{E}-6$ |
| Pull-up transistor <br> (MP2) | $7.09 \mathrm{E}-9$ | $10.74 \mathrm{E}-6$ | $9.98 \mathrm{E}-12$ |
| Pull-down transistor <br> (MN2) | $154 \mathrm{E}-24$ | $10.74 \mathrm{E}-6$ | $6.11 \mathrm{E}-6$ |
| Access transistor <br> (MN6) |  |  |  |

## APPENDIX E

TABLE OF PERCENTAGE CHANGE IN PERFORMANCE METRICS OF SRAM CELLS (WITH SAME TRANSISTOR COUNT) AT 45 NM IN COMPARISON TO 180NM TECHNOLOGY

| No. of <br> transistor | \% Increment <br> in RSNM <br> at 45nm | \% Increment <br> in WSNM <br> at 45nm | \% Decrement <br> in Read delay <br> at 45nm | \% Decrement <br> in Write delay <br> at 45nm | \% Increment <br> in Power <br> consumption <br> at 45nm |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 6 T | $45.6 \%$ | $47.0 \%$ | $53.2 \%$ | $45.8 \%$ | $97.3 \%$ |
| 7 T | $55.5 \%$ | $69.8 \%$ | $95.4 \%$ | $97.2 \%$ | $1.64 \%$ <br> (less) |
| 8 T | $32.2 \%$ | $62.2 \%$ | $74.7 \%$ | $72.1 \%$ | $19.2 \%$ <br> (less) |
| 9 T | $45.5 \%$ | $72.0 \%$ | $72.1 \%$ | $89.1 \%$ <br> $($ more $)$ | $37.2 \%$ |
| 12 T | $41.1 \%$ | $69.6 \%$ | $95.1 \%$ | $93.4 \%$ | $22.9 \%$ <br> (less) |

## APPENDIX F <br> ESTIMATION OF WRITE ABILITY AND READ STABILITY THROUGH OBSERVATION

## - At 45 nm Technology:

## WRITE ABILITY

For robust write operation, performance order of proposed cells from best to worst is given below:


WTI in increasing order:


WTV in increasing order:


As WSNM, WTI, and WTV are independent performance metrics for write operation, we have categorized their value as high $(\mathrm{H})$, moderate $(\mathrm{M})$, and low (L) based on their position from left to right. For MPT8T, WTI value is $3 \times$ more than MI-12T.

Table F1 shows the categorization of write ability metrics of proposed SRAM cells.
Table F1: Write ability metrics of proposed SRAM cells

| SRAM cells | WSNM | WTI | WTV |
| :---: | :---: | :---: | :---: |
| MPT8T | H | M | H |
| M8T | M | H | M |
| M7T | L | M | H |
| MI-12T | H | H | L |
| M9T | M | L | H |
| C6T | L | L | H |

Write ability of SRAM cell: Again, it is observed from above tables, that none of the proposed cell has best value for all three-metrics related to write ability (i.e. highest WSNM, least WTI, and least WTV). MI-12T has highest WSNM and lower WTI, and a highest WTV. Similarly, C6T has least WSNM and WTI, but a highest SVNM.

Thus, comparison is made on the basis of value of two metrics out of three. Based on this, in the order of their appearance, MI-12T, MPT8T, M8T, M7T = M9T, C6T show decreased write ability as per Table F1.

## READ STABILITY

For stable read operation, performance order of proposed cells from best to worst is given


As RSNM, SINM, and SVNM are independent performance metrics for read operation, we have categorized their value as high $(\mathrm{H})$, moderate $(\mathrm{M})$, and low (L) based on their position from left to right. For SINM, moderate category is split into moderate high (MH) and moderate low (ML) due to the fact that M9T has SINM $3 \times$ more than M8T.

Table F2 shows the categorization of read stability metrics of proposed SRAM cells.
Table F2: Read stability metrics of proposed SRAM cells

| SRAM cells | RSNM | SINM | SVNM |
| :---: | :---: | :---: | :---: |
| MPT8T | H | L | H |
| M8T | H | ML | M |
| M7T | M | H | H |
| MI-12T | M | H | L |
| M9T | L | MH | H |
| C6T | L | L | H |

Read stability of SRAM cell: It is observed that none of the proposed cell has best value for all three-metrics related to read stability (i.e. highest RSNM, highest SINM, and highest SVNM). MPT8T has highest RSNM and SVNM, but a lower SINM. Similarly, C6T has least RSNM and SINM, but a higher SVNM. Thus, comparison is made on the basis of value of two metrics out of three. Based on this, in the order of their appearance M7T, MPT8T, M8T, M9T, MI-12T, C6T show decreased read stability as per Table F2.

## - At 180 nm Technology:

## WRITE ABILITY

For robust write operation, performance order of proposed cells (as per Figure 5.53) from best to worst is given below:


As WSNM, WTI, and WTV are independent performance metrics for write operation, we have categorized their value as high $(\mathrm{H})$, moderate (MH, M, ML), and low (L) based on their position from left to right. Table F3 shows the categorization of write ability metrics of referenced SRAM cells at 180 nm technology.

Table F3: Write ability metrics of SRAM cells at 180 nm

| SRAM cells | WSNM | WTI | WTV |
| :---: | :---: | :---: | :---: |
| $4 \mathrm{~T}[12]$ | L | L | H |
| $5 \mathrm{~T}[12]$ | L | ML | M |
| $6 \mathrm{~T}[12]$ | MH | M | H |
| $7 \mathrm{~T}[10]$ | ML | L | H |
| $8 \mathrm{~T}[13]$ | H | H | L |
| $9 \mathrm{~T}[10]$ | M | H | M |
| $12 \mathrm{~T}[15]$ | H | MH | L |

Write ability of SRAM cell: Here again, comparison is made on the basis of value of two parameters out of three as obtaining best value for all three-metrics related to write ability simultaneously is not always possible. Based on this, in the order of their appearance 8T[13], $6 \mathrm{~T}[12]$, $9 \mathrm{~T}[10], 12 \mathrm{~T}[15], 7 \mathrm{~T}[10], 5 \mathrm{~T}[12], 4 \mathrm{~T}[12]$ show decreased write ability as per Table F3.

## READ STABILITY

For stable read operation, performance order of proposed cells (as per Figure 5.53) from best to worst is given below:


As RSNM, SINM, and SVNM are independent performance metrics for read operation, we have categorized their value as high (H), moderate (MH, M, ML), and low (L) based on their position from left to right. Table F4 shows the categorization of read stability metrics of referenced SRAM cells

Table F4: Read stability metrics of SRAM cells at 180 nm

| SRAM cells | RSNM | SINM | SVNM |
| :---: | :---: | :---: | :---: |
| $4 \mathrm{~T}[12]$ | L | ML | L |
| $5 \mathrm{~T}[12]$ | ML | MH | M |
| $6 \mathrm{~T}[12]$ | L | L | L |
| $7 \mathrm{~T}[10]$ | MH | L | ML |
| $8 \mathrm{~T}[13]$ | H | H | MH |
| $9 \mathrm{~T}[10]$ | M | M | H |
| $12 \mathrm{~T}[15]$ | H | H | H |

Read stability of SRAM cell: It is observed that obtaining best value for all three-metrics related to read stability simultaneously (i.e. highest RSNM, highest SINM, and highest SVNM) is not always possible. 12T [15] SRAM cell shows increased value of SVNM and SINM but moderate value for RSNM. Similarly, 4T has least RSNM, but a moderate SVNM and SINM. Thus, comparison is made on the basis of value of two parameters out of three. Based on this, in the order of their appearance 12T[15], 8T[13], 9T[10], 5T[12], 7T[10], 4T[12], 6T[12] show decreased read stability as per Table F4.

## LIST OF PUBLICATIONS

## Publication in Journals

1. Priya Gupta, Ishan Munje, Nikhil Kaswan, Anu Gupta and Abhijit Asati, "Effectiveness of body bias \& hybrid logic: An energy efficient approach to design adders in sub-threshold regime", International Journal of Circuits and Architecture Design, vol. 2(2), 2016, pp. 155-168.
2. Priya Gupta, Anu Gupta and Abhijit Asati, "Ultra Low Power MUX Based Compressors for Wallace tree and Dadda Multipliers in Sub-Threshold Regime" American Journal of Engineering and Applied Sciences, vol. 8(4), 2015, pp. 702-715. (Scopus Indexed)
3. Priya Gupta, Anu Gupta and Abhijit Asati, "Leakage Immune Modified Pass Transistor based 8T-SRAM Cell in Sub-threshold Region" International Journal of Reconfigurable Computing, vol. 2015, 2015, pp. 1-10. (Scopus Indexed)
4. Priya Gupta, Anu Gupta and Abhijit Asati, "Design and Implementation of n-bit Subthreshold Kogge Stone Adder with improved Power Delay Product" European Journal of Scientific Research, vol. 123(1), 2014, pp. 106-116. (Scopus Indexed)
5. Priya Gupta, Anu Gupta and Abhijit Asati, "Power-Aware Design of Logarithmic Prefix Adders in Sub-Threshold Regime: A Comparative Analysis", Elsevier's Procedia Computer Science, vol. 46, 2015, pp. 1401-1408. (Scopus Indexed)
6. Priya Gupta, Anu Gupta and Abhijit Asati, "A Review on Ultra Low Power Design Technique: Sub-threshold Logic" International Journal of Computer Science and Technology-IJCST, vol. 4(2), 2013, pp. 64-71.
7. Priya Gupta, Anu Gupta and Abhijit Asati, "Leakage Immune 9T-SRAM Cell in Subthreshold Region" Bulletin of Electrical Engineering and Informatics, vol. 5(1), 2016, pp. 126-132. (Scopus Indexed)

## Publication in Conferences:

1. Priya Gupta, Anu Gupta and Abhijit Asati, "Leakage Immune 9T-SRAM Cell in Subthreshold Region", National Conference on Advances in Microelectronics, Instrumentation and Communication (MICOM), BITS Pilani, 2015, pp. 126-132 (Best Paper Award).
2. Priya Gupta, Divya Samnani, Anu Gupta and Abhijit Asati "Design and ASIC Implementation of column compression Wallace tree/Dadda Multiplier in Sub-threshold regime" IEEE International Conference on Computing for Sustainable Global Development, 2015, pp. 680-683.
3. Priya Gupta, Akshay Kumar Sharma, Pratishtha Dehadray and Anu Gupta, "Design and Implementation of low power TG Full Adder design in subthreshold regime" IEEE International Conference on Intelligent Interactive Systems and Assistive Technologies, 2013, pp. 39-42.
4. Nikhil Kaswan, Ishan Munje, Yash Kothari, Priya Gupta and Anu Gupta, "Implementation of high speed energy efficient 4-bit binary CLA based incrementer/decrementer" International Conference on Advanced Electronic Systems (ICAES), 2013, pp. 103-107.
5. Priya Gupta, Ishan Munje Nikhil Kaswan, Anu Gupta and Abhijit Asati, "Analysis \& Implementation of Ultra Low-Power 4-bit CLA in sub threshold regime" IEEE International Conference on Circuit, Power and Computing Technologies" (ICCPCT), 2014: pp.1074-1076.

## Publication in Book Chapter:

1. Priya Gupta, Anu Gupta and Abhijit Asati, "Detailed analysis of ultra-low power column compression Wallace tree and Dadda multiplier in sub-threshold regime" Advanced Research on Hybrid Intelligent Techniques and Applications, ACIR Book Series published by IGI Global.Nov-2015, pp: 0-654.

## BRIEF BIOGRAPHY OF THE CANDIDATE

Priya Gupta received B.Tech in Electronics and Communication from Bundelkhand University, Jhansi (U.P.) in 2009. She received M.Tech in VLSI Design from Banashali University, Rajasthan, in 2011. She started her carrier as Assistant Professor in Manav Rachna International University (Faridabad). Later she joined Birla Institute of Technology and Science (BITS), Pilani as Research scholar in January 2012. Her research interests include low power VLSI design.

## BRIEF BIOGRAPHY OF THE SUPERVISOR

Anu Gupta received the M.Sc (Hons) in Physics from Delhi University in 1988, M.E and Ph.D. degrees from Birla Institute of Technology and Science (BITS) Pilani, BITS Pilani in 1995, 2003, respectively. In 1995, she joined BITS Pilani as Assistant Lecturer. She was designated as Assistant Professor (in 2003), Associate Professor (in 2010), and Professor (in 2016) at BITS Pilani. Her research interest includes Low power analog/ digital/ mixed signal design for FPGA/ ASIC applications.

## BRIEF BIOGRAPHY OF THE CO-SUPERVISOR

Dr. Abhijit Asati completed Ph.D. degree (in 2010) and M.E. degree (in 2002) from BITS Pilani. Prior to this, he completed B.E. degree in electrical engineering from Amravati University in the year 1996. He has more than 15 years of experience in academia and currently working as 'Assistant Professor' in the 'Electrical and Electronics Engineering Department' at BITS Pilani. He also served as a faculty member at the Visveswarya National Institute of Technology, Nagpur from 1997 to 1999. He taught several courses such as Microelectronic Circuits, Analog and digital VLSI design, CAD for IC Design, VLSI test and testability. Dr. Asati also supervised around 15 B.E. and M.E. thesis and currently supervising 2 research scholars. He has contributed more than 20 journal papers and more than 25 conference papers in the area of microelectronics and VLSI design. His research contributions are in the area of high performance VLSI data path design, microprocessor design, NBTI degradation issues and biometric identification. He also visited "Cypress Semiconductor Limited, Bangalore", from May to July 2014 as a part of Industry Immersion Program of BITS, Pilani.

