# Parametric Yield Management for 3D ICs: Models and Strategies for Improvement<sup>12</sup>

# CESARE FERRI

Division of Engineering Brown University Providence, RI 02912 cesare\_ferri@brown.edu

## SHERIEF REDA

Division of Engineering Brown University Providence, RI 02912 sherief\_reda@brown.edu

R. IRIS BAHAR Division of Engineering Brown University Providence, RI 02912 iris\_bahar@brown.edu

# 1. INTRODUCTION AND MOTIVATION

Three-dimensional (3D) Integrated Circuits (ICs) with through-silicon vias is an exciting new technology that will increase the functionality, scale of integration, and performance of ICs [Banerjee et al. 2001; Topol et al. 2006]. Increasing the scale of integration is particularly attractive considering that optical lithography is approaching its natural limits as predicted by the International Technology Roadmap for Semiconductors (ITRS) [ITRS]. In 3D ICs, multiple die or device layers are integrated and interconnected with *Through-Silicon Vias* (TSV) (also known as *vertical interconnects*) as shown in Figure 1. The theoretical possibility of integrating tens of die in a 3D IC can usher in a new era of computational platforms with capabilities that are far beyond what is currently possible.

In order to reap the full benefits of 3D integration, yield loss is one of the greatest challenges that has to be met [Banerjee et al. 2001; Topol et al. 2006; Patti 2006]. The fabrication yield of integrated circuits is divided into two categories: *functional yield* and

<sup>&</sup>lt;sup>1</sup>This work is partially supported by a gift from Qualcomm Corporation.

 $<sup>^{2}</sup>$ An earlier version of this paper appeared in the International Conference on Computer-Aided Design 2007. This journal version includes expanded, re-written coverage of the technical material; a new, generic formulation to improve the parametric yield for 3D ICs with arbitrary number of die; and a new section (and experimental results) on leakage modeling and leakage-constrained parametric yield improvement.



Fig. 1. A simple illustration of a 3D IC with TSVs (via first, front to back integration). More die layers are assumed stacked but not drawn.

*parametric yield*. Functional yield is the number of fabricated functionally good die with no detected manufacturing defects. Parametric yield is the number of functional die meeting the required speed and power specifications [Rao et al. 2005]. Process variations are the main contributors to the loss of parametric yield [Bowman et al. 2002; Orshansky et al. 2002; Borkar et al. 2003; Raj et al. 2004; Datta et al. 2006; Marculescu and Talpes 2005; Bhardwaj et al. 2006]. The theoretical possibility of integrating tens of interconnected die is practically limited by the yield of the 3D fabrication process. Yield loss, whether due to functional or parametric mechanisms, can occur either during the fabrication of the individual planar die or during the process of integrating and interconnecting the different die together in the 3D IC stack [Banerjee et al. 2001; Reif et al. 2002; Davis et al. 2005; Topol et al. 2006].

The objective of this paper is to model the parametric yield of 3D ICs as well as provide integration strategies that maximizes the parametric yield. More specifically, the contributions of this paper are as follows.

- (1) This work is the first to examine the impact of process variation on the performance and parametric yield of 3D ICs. We formulate the general problem of optimizing the parametric yield in 3D integration under the presence of general process variations.
- (2) Using a processor as a 3D IC example, we model the impact of process variations on both the CPU and L2 cache die, and then model the outcome of 3D integration on the overall performance of the processors.
- (3) This work is first to propose 3D integration strategies that maximize the parametric yield of 3D ICs using a number of criteria including performance, leakage and realistic price models.
- (4) Using extensive simulations and realistic binning strategies, we show that the proposed strategies increase the number of 3D processors in the fastest bins by almost 2×, while simultaneously reducing the number of slow processors by 29.4% in comparison to current integration techniques. Our strategy also leads to an improvement

in 3D processor performance (as measured by MIPS) by up to 6.45% and an increase of about 12.48% in total sales revenue using up-to-date market price models.

The organization of this paper is as follows. Section 2 gives an overview of the necessary background for this work. In Section 3, we show how to model the performance of 3D ICs, using a 3D processor as an example, under the presence of process variations. In Section 4, we propose a number of integration strategies to maximize the parametric yield of 3D ICs. Section 5 gives an extensive set of experimental results supporting our methodology, and finally Section 6 summarizes the main results of this paper.

## 2. BACKGROUND AND MOTIVATION

In this section we give the technical background that provides the motivation and context for this work. In particular we provide an overview of the main 3D fabrication methods and discuss the main benefits of 3D integration. Then we discuss the impact of the fabrication method on the final yield. We also provide a brief overview of the impact of process variations in 2D ICs which ultimately leads to the variations in 3D ICs.

**Benefits of 3D Integration.** 3D ICs allow the creation of new systems that are currently not feasible by planar fabrication technology. Using a 3D approach allows the integration of dissimilar technologies to create highly-interconnected hybrid chips that include memory, logic, optical, RF, and analog components. Besides improved functionality and system scaling capabilities, 3D integration also promises to replace long 2-D interconnects by short TSV-based vertical interconnects [Banerjee et al. 2001; Davis et al. 2005; Topol et al. 2006]. Long (or global) 2-D interconnects have large delays [Davis et al. 2001] and require an increasing number of repeaters to appropriately buffer them [Saxena et al. 2004]. By transforming long 2-D interconnects into short TSVs with less capacitive and resistive loading, the system delay is improved [Topol et al. 2006]. Reducing long interconnect delay is especially important for processors as they continuously access memory subsystems. With 3D integration, processors can cut down the memory access time which improves the overall system performance. The quantification of this improvement has been the subject of a number of recent works [Zeng et al. 2005; Liu et al. 2005; Jacob et al. 2005; Tsai et al. 2005; Li et al. 2006; Xie et al. 2006].

**3D IC fabrication Techniques.** There are four main manufacturing steps during 3D IC fabrication: thinning, alignment, bonding, and through-silicon via fabrication [Burns et al. 2006; Reif et al. 2002; Benkart et al. 2005; Beyne 2004; Davis et al. 2005; Scheiring 2004; Topol et al. 2006; Patti 2006]. *Thinning* involves removing the bulk silicon of a wafer, bringing it to only a few tens microns of thickness. *Alignment* places the different wafers or die of a 3D IC on top of each other, with their faces (or backs) aligned within some allowed tolerance. This tolerance imposes a limit on the size and pitch of the through-silicon vias. *Bonding* fuses the different wafers and/or die together. *Through-Silicon Via Fabrication* creates the vertical interconnects that are required for signal communication between the various parts of the design in the 3D ICs. The via-creation step can also be carried out before or after the bonding step [Baliga 2004].

While failure in any of the four steps impacts the yield of 3D ICs, the method of bonding is typically the most critical step [Scheiring 2004; Topol et al. 2006; Fukushima et al. 3035]. There are currently three different bonding technologies that offer different trade-

offs in production yield, flexibility in die size, and production throughput [Burns et al. 2006; Reif et al. 2002; Benkart et al. 2005; Davis et al. 2005; Topol et al. 2006; Patti 2006].

- (1) Wafer-to-wafer bonding. This method results in the lowest yield of all bonding methods since it offers no way to filter out the bad die before integration. It also offers no flexibility in choosing how to "pick and place" the die to optimize both the parametric and functional yield. This method also requires all die to be of the same size. Its main advantage is in high production throughput and high TSV density.
- (2) Die-to-wafer bonding. This method uses a substrate wafer to integrate diced die on top of it. It has a high yield as it is possible to identify and only use the good die during integration. The method is flexible with different die sizes and has a good production throughput.
- (3) *Die-to-die bonding*. This method offers similar high yield and flexibility as in die-to-wafer bonding, but suffers from low production throughput.

Comparing the three bonding methods, it is typically concluded that die-to-wafer bonding is the most promising for future 3D integration [Scheiring 2004; Topol et al. 2006; Fukushima et al. 3035].

Impact of Process Variations in 2D ICs. Since the main components of a 3D IC are 2D die, it is natural that the same physical phenomena, manufacturing defects and process variations, that reduce the yield of 2D ICs will also impact 3D ICs. Thus it is important to understand these phenomena in 2D ICs before generalizing them to 3D ICs. Process variations change the electrical parameters of ICs (e.g. speed and leakage) from the original estimates of the designers. Process variations can heavily impact the frequency which a processor can be clocked, the total power dissipation due to leakage current [Bowman et al. 2002; Borkar et al. 2003; Marculescu and Talpes 2005; Humenay et al. 2006; Kim et al. 2006], and the relative access time of the memory subsystem [Grossar et al. 2006; Meng and Joseph 2006]. Semiconductor foundries typically categorize chips according to their performance by speed binning them and assigning them to appropriate price points [Cory et al. 2003; Datta et al. 2006]. Improving the parametric yield is concerned with optimizing the values of the electrical parameters of chips in order to achieve overall good performance and profits [Rao et al. 2005; Datta et al. 2006]. Also process variations can increase leakage current by up to  $20 \times$  [Borkar et al. 2003; Kim et al. 2006], it is crucial to minimize such leakage in chips that are embedded in low-power devices. In this case, the binning strategy could be driven entirely by leakage constraints, where high-leakage chips are essentially discarded.

Yield loss, whether functional or parametric, is considered to be one of the bottlenecks that need to be overcome to bring 3D technology from the lab to the fab and the marketplace [Baliga 2004]. Despite its importance, the problem of yield improvement of 3D ICs has been least investigated in the literature. A number of recent efforts [Banerjee et al. 2001; Topol et al. 2006; Patti 2006] point to the importance of functional yield management of 3D ICs. This work is the first to investigate the problem of process variation modeling and parametric yield improvement in 3D ICs.

ACM Journal on Emerging Technologies in Computing Systems, Vol. V, No. N, April 2008.

4

## 3. PARAMETRIC YIELD MODELING

The objective of this section is to model or quantify the impact of process variations on the parametric yield of 3D ICs. Such modeling is more complex than in 2D ICs as different die that belong to the same 3D IC are fabricated on separate wafers and then integrated and interconnected with TSVs. Thus to model the impact of process variations on 3D ICs, it is necessary to first model the impact of process variation on the individual die and then model the interplay of the process variations on the different die composing a 3D IC.

In 2D ICs, process variations can be categorized as *intra-die variations*, which affect sub-parts of a single chip, and as *inter-die variations*, which affect the performance and power parameters of different chips [Bowman et al. 2002]. The overall impact of intra and inter-die variations is that they lead to considerable discrepancies in the performance of fabricated chips. The distribution of chips as a function of performance typically exhibits a Gaussian-like form [Orshansky et al. 2002; Bowman et al. 2002], where the mean and the standard deviation of the distribution are functions of the *intra-die* and *inter-die* variations respectively.

The final result of the interaction of process variations on different die depends on the functionality and the interface of different die in the 3D stack. Consider for example the 3D processor given in Figure 2. The upper wafer holds a set of die, say L2 cache or memory die. The die have been diced and tested, and the faulty ones have been identified (labeled with **F**) and the good ones have been labeled with their speed and leakage (only speed is shown in Figure 2)<sup>3</sup>. The same testing and labeling procedure has been carried out for the substrate wafer containing the Central Processing Unit (CPU) die. The question we now seek to answer is: What is the impact of the process variations in the individual 2D die on the overall 3D IC performance and leakage?

To quantify the impact of process variations on the overall performance of the 3D processor, we choose the popular *MIPS (millions of instructions per second)* as the performance index. For a given pair (i, j) of CPU *i* and L2 cache *j*, we compute the MIPS in the following way. We first calculate the L2 latency,  $L_{i,j}$ , in terms of CPU cycles, i.e.,

$$L_{i,j} = \left\lceil \frac{\text{L2 } j \text{ accesstime}}{\text{CPU } i \text{ cycle period}} \right\rceil.$$
(1)

While the access time and core frequency may display wide variations, the ceiling rounding, by the  $\lceil \cdot \rceil$  operator, reduces the number of distinct latency values. Note that the L2 latency varies from a minimum of  $L_{\min} = \lceil \min L2 | access time \times \min CPU | frequency \rceil$  to a maximum of  $L_{\max} = \lceil \max L2 | access time \times \max CPU | frequency \rceil$ .

The impact of cache latency on performance depends on the particular application executing on the processor. A memory-intensive application will be heavily impacted by large values of memory latency in comparison to a processing-intensive application. To obtain an accurate estimation of the overall speed of the 3D processor, the cache latency values need to be fed to an architectural simulator to compute the actual *Cycles Per Instruction* 

<sup>&</sup>lt;sup>3</sup>3D technology offers unique ways to address the reliability issues of 3D memories. The basic assumption is that the higher bandwidth guaranteed by a 3D structure (because of the superior interconnect density) simplifies enormously the operations of remapping faulty memory cells. In particular, Patti [Patti 2006] (also commercialized by Tezzaron) proposes that continuously monitors in the background the actual state of each memory cell, and repairs on the fly the errors by replacing the cells with non faulty ones.



Fig. 2. Modeling the impact of process variations on 3D processors. **F** indicates a faulty die. The number inside each die represent its speed as measured by testing before 3D integration.

(*CPI*) using typical benchmark applications. The fact that there are only a few possible values for the L2 latency drastically reduces the number of architectural simulations that need to be carried out. Finally, the MIPS of the 3D chip composed of CPU *i* and cache *j* is simply the clock frequency of *i* multiplied by the CPI of the pair.

While modeling the final performance of a 3D IC requires design knowledge (e.g., the processor and its application programs), modeling the final leakage of a 3D IC is more straight forward. The final leakage will be the sum of the leakage currents of the constituent die in the 3D IC while taking into account the spatial and temporal variations temperature of the 3D IC [Im and Banerjee 2000; Loi et al. 2006].

We note that while we extensively use the 3D processor as a specific potential 3D IC, our general parametric yield modeling and improvement methodology is still applicable to other 3D IC designs. For example, we sketch an outline for modeling two other potential 3D ICs: 3D Field programmable Gate Arrays (FPGAs) and 3D embedded System-On-a-Chip (SoC)

- **—3D FPGA**. Heterogeneous 3D FPGAs could be one of the great applications of 3D technology. One can envision a stack of die, which has reconfigurable logic in one set of die, hard IPs (e.g. processors or DSPs) on another, memory on a third set. Consider a 3D FPGA, where a multi-core system on one die is interfaced to another die of reconfigurable computing. The reconfigurable logic provides the necessary fabric to accelerate key software routines. The speed of the system depends on the rates by which the cores call the reconfigurable logic, which in turn depends on the workloads running on the cores. As a first order analysis, the overall delay can be considered equal to processor cycle delay  $+\frac{1}{\text{calling rate}} \times \text{reconfigurable logic delay}$ .
- -Embedded SoC 3D IC. Consider an embedded 3D IC that is designed for multime-

dia applications, where a critical component of the IC is dedicated to compute the Fast Fourier Transform (FFT). A FFT computational system involves a good number of pipeline stages [Baas 1999]. To reduce the communication overhead between the stages, it is advantageous to place them – especially the ones that use global interconnects in the FFT's butterfly structure [Das et al. 2004; Baas 1999] – on separate die. In this case, the maximum operating frequency of the system is determined by the pipeline stage with the largest delay while considering the impact of process variations on the individual pipeline stages located in the different die.

## 4. STRATEGIES FOR IMPROVING THE PARAMETRIC YIELD OF 3D ICS

While wafer-to-wafer bonding dictates the outcome of integrating different wafers, die-towafer and die-to-die integration offer flexibility that we propose to exploit by devising 3D integration strategies that maximize the parametric yield. The general problem of optimizing the parametric yield of 3D ICs under process variations can be stated as follows.

**The 3D Parametric Yield Maximization Problem for 3D ICs.** Given K different wafers (or wafer lots) each with identically N die, yet the die are parametrically different due to process variations, find an integration assignment strategy that maximizes the total parametric yield of the N produced 3D ICs, where each IC is composed of K stacked die.

The outline solution for this problem is as follows.

- (1) Model the impact of the process variations on both the speed and leakage on each die for all *K* wafers.
- (2) Model the performance of the 3D system (composed of *K* different dies) for every possible  $N^{K}$  3D IC combinations.
- (3) From the  $N^K$  possible combinations, find the *N* combinations that maximize the total parametric yield (as measured by performance, leakage or revenue) such that each die is assigned to exactly one 3D IC package.

The problem is obviously electrically and combinatorially challenging. First, the impact of process variations on the electrical properties (speed and leakage) have to be modeled for each die and for each possible 3D combination, and second, the *N* 3D IC combinations that maximize the total parametric yield have to be computed and selected. For the case of three or more integrated die, one can prove that maximizing the parametric yield for 3D ICs is NP-hard by reducing the classical NP-hard 3-D matching problem [Garey and Johnson 1979] to it. A more tractable version of the problem is possible in the case of two die, i.e., K = 2, where for example the first wafer holds processor logic and the second wafer holds the processor L2 cache (or memory in general).

We propose a number of strategies that control and improve the parametric yield of 3D ICs. First, we focus on improving the parametric yield as measured by the speed or performance of the 3D package. Later, we will focus on yield as measured by sales profit or leakage.

## 4.1 Assignment Strategies for Maximizing Performance

The proposed strategies vary in their ability to optimize the parametric yield, and also in their computational complexity.

- —Random-Random (RR) assignment. In this naive strategy, the 3D integration process is oblivious to parametric yield and assigns CPUs and L2 caches randomly to form the 3D processor chips. This strategy can be used as a baseline to compare other strategies against.
- **—Fast-Fast (FF) assignment.** In this strategy, CPU die are sorted in descending order (fastest first) according to their tested speed (CPU frequency), and then L2 cache die are sorted in ascending order (fastest first) according to their tested speed (access time). Then the 3D chips are constructed by matching the CPUs and L2 caches in order. This strategy starts pairing the fastest CPUs and L2 caches together and ends pairing the slowest CPUs and caches together. This strategy attempts to obtain the fastest possible 3D processor chips (at the cost of producing the slowest possible 3D chips). This strategy is easily computed in  $O(N \log N)$  runtime, and for *K* die stacks, it is computable in  $O(KN \log N)$  runtime.
- **—Fast-Slow (FS) assignment.** In this strategy, CPU die are sorted in descending order (fastest first) according to their tested speed (CPU frequency), and then L2 caches are sorted in ascending order (slowest first) according to their tested speed (access time). Then the 3D chips are constructed by matching the CPUs and L2 caches in order. This strategy starts pairing the fastest CPUs with the slowest L2 caches together and ends pairing the slowest CPUs with the fastest cache together. This strategy attempts to increase the number of processors with medium speed. It can also be helpful if leakage is the main factor driving the integration strategy because it integrates low-leakage die with high-leakage die decreasing the overall 3D IC leakage. This strategy is easily computed in  $O(N \log N)$  runtime, and for *K* die stacks, it is computable in  $O(KN \log N)$  runtime.
- **—Optimal (OPT) Assignment.** To find the optimal integration strategy, we propose an integer linear program (ILP) that maximizes the parametric yield for any number of die in the 3D stack. Let  $x_{i_1,i_2,...,i_K}$  denote a binary variable that is true when die  $i_1 \in \{1,...,N\}$  from wafer 1, die  $i_2 \in \{1,...,N\}$  from wafer 2, ..., and die  $i_K \in \{1,...,N\}$  from wafer *K* are integrated into a 3D IC. Let  $Y_{i_1,i_2,...,i_K}$  be a constant that gives the parametric yield of the 3D IC formed from  $i_1, i_2,..., and i_K$  as defined by the speed, leakage, direct revenue, or a combination of them. Given *K* wafers with *N* die, the parametric yield maximization problem can be formulated into the following ILP:

$$\max \sum_{i_1=1}^{N} \cdots \sum_{i_K=1}^{N} Y_{i_1, i_2, \dots, i_K} \times x_{i_1, \dots, i_K},$$
(2)

such that there are exactly N produced 3D ICs

$$\sum_{i_1=1}^{N} \dots \sum_{i_K=1}^{N} x_{i_1,\dots,i_K} = N,$$
(3)

and each die on any wafer participates in exactly one 3D IC:

 $\forall i_1 \in \{1, \dots, N\} : \qquad \sum_{i_2=1}^N \cdots \sum_{i_K=1}^N x_{i_1, \dots, i_K} = 1$  $\dots$  $\forall i_j \in \{1, \dots, N\} : \sum_{i_1=1}^N \cdots \sum_{i_q=1, q \neq j}^N \cdots \sum_{i_K=1}^N x_{i_1, \dots, i_K} = 1$ 

ACM Journal on Emerging Technologies in Computing Systems, Vol. V, No. N, April 2008.

8



Fig. 3. Optimal assignment of CPUs to L2 caches to generate 3D processor chips that maximize the total parametric yield as measured by the total system MIPs.

$$\forall i_K \in \{1, \dots, N\}: \qquad \sum_{i_1=1}^N \cdots \sum_{i_{K-1}=1}^N x_{i_1, \dots, i_K} = 1$$

For the case of K = 2, it is possible to find the optimal solution to the ILP program in polynomial runtime using a graph-theoretic framework. In this case, *vertices* represent the die, *edges* represent the possible 3D ICs, and *edge costs* represent the yield (speed or revenue) value of the possible ICs. Thus, we construct a bipartite graph, given in Figure 3, with 2N vertices representing the N CPU die and the N L2 cache die, and  $N^2$  edges where each edge is labeled by the MIPS of the 3D processor produced from the CPU and L2 cache die that are its end points. The optimal assignment strategy involves finding the N CPU/L2 pairs that maximize the total MIPS, and such that each CPU or L2 cache participate in only one 3D IC. The optimal assignment can be found by computing the maximum graph matching or assignment in the bipartite graph. This can be computed in polynomial  $O(N^3)$  runtime using the classical Hungarian algorithm [Kuhn 1955; Munkres 1957].

The performance of a 3D system determines its speed bin and consequently its price. This is described in the next subsection.

#### 4.2 Strategies to Maximize Sales Profits

Chip manufacturers are ultimately interested in maximizing sales profits. Processors with higher performance (measured by MIPS) are naturally sold at higher prices than the lower ones. The difference in price is correlated with the number of available supply chips. Since process variations produce chips with Gaussian-like distributions, it is expected that there are very few chips with extremely high or extremely low performance and the majority of chips have a performance around some average value. This leads to a non-linear relationship between the performance and price of the chip. For example, the market values of Intel Core Duo processors (according to pricegrabber.com) for its different four speed bins are given in Figure 4. The plot shows an exponential trend for the price. The price of extreme processors are almost double the price of the fast processors, which are in turn double the price of the slow ones. We also note that binning can be partially driven by the



Fig. 4. Market prices (according to pricegrabber.com of Intel Core Duo as of March 2007.

market dynamics of supply and demand. Fast chips can be either sold as slow or fast depending on the market demand and supply forces. However, slow chips can be only binned as slow. This asymmetric situation means that fast chips offer greater flexibility in meeting market requirements.

Our proposed fast-fast and optimal-assignment strategies are designed to increase the number of fastest 3D chips (as we will confirm in Section 5). Thus it is likely that this leads to a significant increase in total sales profits and flexibility according to the market price model. It is also possible to directly derive our optimal assignment strategy, described in Subsection 4, using the dollars values of the 3D system, rather than using the MIPS value. In that case, we can substitute the MIPS label of each edge in Figure 3 by the corresponding dollars value and find the optimal assignment strategy as described earlier.

#### 4.3 Leakage-Aware Assignment Strategies

As described in Section 2, chips with the highest performance (or smallest delay) are likely to produce chips with the highest leakage current. Thus optimal, or fast-fast, assignment strategies can produce 3D chips with excessive leakage power, since they produce systems with the highest performance. This excessive leakage power can be problematic for chips that are targeted for low-power mobile devices. Thus we seek to modify our 3D integration strategies to take into account leakage power to improve the parametric yield. There could be two possible modifications depending on the importance of leakage current.

**A. Leakage-constrained Integration.** In this approach, we still use performance-driven integration but under the constraint that a 3D IC should never exceed a certain leakage threshold. This modification can be accommodated in the proposed graph-theoretic approach as follows. After we generate the bipartite graph shown in Figure 3, we delete any edges that correspond to CPU/L2 cache pairs that generate leakage current above the allowed leakage threshold. The new graph is then used to calculate the optimal assignment. The optimal assignment algorithm automatically results in performance-optimized 3D processor chips that produce leakage current/power below the constraining threshold.

**B. Leakage-driven Integration.** In this approach (which is more suitable for ultra-low



Fig. 5. The tool chain required to model and evaluate our strategy.

power mobile 3D ICs), we completely drive the 3D integration process by leakage. Given a stringent leakage threshold, we label every edge of Figure 3 with the sum of the leakage of its end points if and only if that sum is below the given threshold; otherwise, we label the edge with a high prohibitive cost (ideally  $\infty$ ).

# 5. EXPERIMENTAL RESULTS

In this section we quantify the impact of our 3D integration strategies on the parametric yield and profits of 3D ICs. A key input to our models and strategies is the basic speed and leakage test results from the 2D ICs that will form the 3D ICs. Since such test results are not available, we estimate these numbers through simulations of realistic CPU and cache hardware models. We use the following tools:

- —SPICE to calculate the delay of CPUs under the presence of process variations using 70nm technology [PTM ].
- -CACTI (version 4.2) [Wilton and Jouppi 1996] and PRACTICS (version 1.0) [Zeng et al. 2005] to calculate the access time of L2 caches using vertical interconnects in 3D chips.
- —SimpleScalar (version 3.0) [Burger and Austin 1997] to model the performance (as measured by Cycles Per Instruction CPI) of 3D processors given the underlying CPU frequency and the L2 cache access time with vertical interconnects.
- —The matching code by Rohe [Cook and Rohe 1999] to implement the optimal 3D assignment strategy to maximize the parametric yield.

ACM Journal on Emerging Technologies in Computing Systems, Vol. V, No. N, April 2008.

11

Our tool chain flow, given in Figure 5, starts by modeling the impact of process variations on the speed and leakage of N = 100 CPU and the L2 cache die. Then an architectural simulator, SimpleScalar, is used to calculate the performance of the potential 3D processors composed of the different CPU and L2 die. This information together with the leakage current is used to construct a bipartite graph as outlined in Section 4 which is then fed, with market price models, to the optimal matching module to find the integration assignment that maximizes the parametric yield and profits. In the following subsections, we describe each tool and step in detail.

# 5.1 CPU Setup

We quantify the impact of process variations on the performance and power of 2D CPUs by simulating with SPICE a typical CPU critical path (i.e., a chain of 9 NAND gates representing the CPU pipeline flow [Bowman et al. 2002]). We use the 70nm Berkeley predictive technology model for all the simulations [PTM ]. To model the impact of interdie process variations, we generate 100 critical path SPICE netlists, where the gate length of each is drawn from a Gaussian distribution with a mean of 70nm and standard deviation of 5.07% (leading to  $\pm$ 7nm maximum variations). We then execute SPICE on each netlist and record the delay and leakage current consumed. The frequency of a CPU is the reciprocal of the critical path delay. We plot the distribution of CPU frequencies (GHz) obtained from SPICE simulations in Figure 6(a). Table I gives the maximum, minimum, average and standard deviation of 10.33% with a mean of 3.12GHz.

## 5.2 L2 Cache Setup

Assuming a cache configuration of 2MB at the 70nm technology node, we calculate the cache access time using PRACTICS [Zeng et al. 2005], which is a tool for predicting the access time of L2 caches using vertical interconnects in 3D ICs. To obtain our cache leak-age variations numbers, we assumed normally distributed gate length variations that vary



as produced by modeling critical path tion as produced from modeling provariations on the CPU delay using cess variations using Cacti and PRAC-SPICE. TICS.

Fig. 6. Impact of process variations on CPU frequency and L2 cache access time.

| parameter | CPU                          | J             | parameter | L2 cache                     |               |  |
|-----------|------------------------------|---------------|-----------|------------------------------|---------------|--|
| frequency | max                          | 4.01GHz       | access    | max                          | 1.81ns        |  |
|           | average                      | 3.12GHz       | time      | average                      | 1.46ns        |  |
|           | min                          | 2.46GHz       |           | min                          | 1.02ns        |  |
|           | std dev.                     | 10.33%        |           | std dev.                     | 11.06%        |  |
| leakage   | max                          | 17.8W         | leakage   | max                          | 13.4W         |  |
|           | average                      | 5.09W         |           | average                      | 8.22W         |  |
|           | min                          | 1.93W         |           | min                          | 4.57W         |  |
|           | range $(=\frac{\max}{\min})$ | $5.21 \times$ |           | range $(=\frac{\max}{\min})$ | $2.93 \times$ |  |

Table I. Impact of process variations on the speed of CPU and L2 cache dies.

 $\pm$ 7 nm around the 70 nm nominal gate length. Using PRACTICS to simulate the latency of caches with such variations resulted in cache access time variations with a standard deviation of 11.06%. Figure 6(b) plots the resulting distribution for the L2 cache access time (ns). The main statistics of the L2 access time and leakage distribution are reported in Table I.

## 5.3 3D Processor Performance Modeling

With the computed CPU frequency vector (100 values in GHz) and the L2 access time vector (100 values in ns), it is possible to calculate the L2 access time in terms of CPU cycles, i.e.,  $\lceil \frac{L2 \text{ access time}}{CPU \text{ cycle period}} \rceil$ , for every possible pair of CPU and L2 cache. While the number of different CPU frequencies and cache access times could be large due to process variations, the number of distinct different cache access cycles are much fewer in number (e.g. they vary between 3 to 8 cycles). The newly computed values for access cycles are used as configuration parameters for the SimpleScalar simulator (which requires cache access time expressed in CPU cycles) to simulate the performance of every possible CPU/L2 3D chip combination<sup>4</sup>. Next, we run a suite of six SPEC 2000 benchmarks [SPEC 2000] and compute the average Cycles Per Instruction (CPI) over the six benchmarks: three integer benchmarks (gcc, parser, gzip), and three floating point applications (mgris, apsi, equake). CPI results are given in Table II. We then use the CPI and clock frequency values to calculate the MIPS of every possible CPU/L2 3D processor.

## 5.4 Evaluation of 3D Integration Strategies

With the modeled CPU frequency, L2 access time, and processor MIPS, it is possible to evaluate the effectiveness of our different 3D integration strategies on the parametric yield as measured by the performance of the 3D processor. Given the CPU frequency and L2 access time distributions of Figure 6, we compute the MIPS distributions of 3D processors produced by different assignment strategies (RR, FF, FS and OPT). We report the performance in terms of average, maximum, and minimum MIPS in Table III. The results of Table III demonstrate that the optimal assignment strategy and the fast-fast strategy produce systems with the maximum MIPS; however, the optimal strategy has the highest average MIPS of all strategies. Compared to the performance oblivious strategy (the random-random strategy), the optimal assignment strategy produces system with better performance by up to 6.49% with an average improvement of 1.71%.

<sup>&</sup>lt;sup>4</sup>We use the following parameters for simulation: (1) 2-way, 3 cycle L1 cache of 16 Kbyte; (2) 8-way 2MB L2 cache; (3) main memory latency is 50 cycles; (4) the decode/issue/commit width is 4 issue.

ACM Journal on Emerging Technologies in Computing Systems, Vol. V, No. N, April 2008.

| L2       | bench  | CPI   | Avg   | L2 bench |        | CPI   | Avg   |
|----------|--------|-------|-------|----------|--------|-------|-------|
| Latency  |        |       | CPI   | Latency  |        |       | CPI   |
| (cycles) |        |       |       | (cycles) |        |       |       |
| 3        | apsi   | 0.614 | 0.734 | 6        | apsi   | 0.614 | 0.847 |
|          | equake | 0.785 |       |          | equake | 0.905 |       |
|          | gcc    | 1.031 |       |          | gcc    | 1.556 |       |
|          | gzip   | 0.577 |       |          | gzip   | 0.594 |       |
|          | mgrid  | 0.548 |       |          | mgrid  | 0.547 |       |
|          | parser | 0.850 |       |          | parser | 0.863 |       |
| 4        | apsi   | 0.614 | 0.798 | 7        | apsi   | 0.615 | 0.873 |
|          | equake | 0.865 |       |          | equake | 0.927 |       |
|          | gcc    | 1.330 |       |          | gcc    | 1.669 |       |
|          | gzip   | 0.585 |       |          | gzip   | 0.599 |       |
|          | mgrid  | 0.543 |       |          | mgrid  | 0.559 |       |
|          | parser | 0.851 |       |          | parser | 0.868 |       |
| 5        | apsi   | 0.614 | 0.819 | 8        | apsi   | 0.615 | 0.899 |
|          | equake | 0.875 |       |          | equake | 0.955 |       |
|          | gcc    | 1.441 |       |          | gcc    | 1.786 |       |
|          | gzip   | 0.585 |       |          | gzip   | 0.604 |       |
|          | mgrid  | 0.546 |       |          | mgrid  | 0.561 |       |
|          | parser | 0.855 |       |          | parser | 0.876 |       |

Table II. CPI reported for different L2 cache access cycles ( $\lceil \frac{L2 \text{ access time}}{CPU \text{ cycle period}} \rceil$ ). L2 access times and CPU clock periods are taken from the data of Figure 6.

| Strategy      | Max MIPS | Average | Min MIPS | $\Delta$ MIPS (%) |
|---------------|----------|---------|----------|-------------------|
| Fast-Fast     | 4902.62  | 3810.41 | 3006.78  | 63.04%            |
| Fast-Slow     | 4465.68  | 3784.63 | 3221.22  | 38.63%            |
| Optimal       | 4903.00  | 3855.00 | 3138.00  | 56.25%            |
| Random-Random | 4606.61  | 3790.17 | 3082.71  | 49.93%            |

Table III. Impact of different 3D integration strategies on the statistical performance parameters of 3D processors chips. We calculate  $\Delta$  MIPS (%)=  $\frac{Max MIPS-Min MIPS}{Min MIPS}$ .

The processor distributions produced from different integration strategies are also given in Figure 7. To create the figure, we use the RR processor distribution to designate four performance bins: extreme, fast, medium, and slow, using Matlab's histogram function. The four bins mimic those of Intel processors as we described earlier in Subsection 4.2. We use the bin boundaries of the RR strategy as the bin boundaries of other 3D integration strategies. This way we guarantee a fair comparison for the different strategies. From the data, we draw the following observations.

- -The OPT and FF strategies produce almost twice the number of extreme processors compared to other strategies.
- —While OPT and FF produce the same number of Extreme processors, OPT reduces the number of processors in the slow bin by almost half compared to FF. Note that FF produces the highest number of CPUs in the slow bin.
- -FS produces a large number of CPUs in the medium and fast bins, but produces the fewest number of CPUs in the extreme bin.

ACM Journal on Emerging Technologies in Computing Systems, Vol. V, No. N, April 2008.

•

#### % of 3D chips in different speed (performance) bins



Fig. 7. Impact of process variations on 3D processor performance as measured by MIPS using the different proposed 3D integration assignment strategies. Four performance or speed bins are used: slow, medium, fast, and extreme.

| Bin     | Market     | 3D Integration Strategy |          |           |          |         |          |               |          |
|---------|------------|-------------------------|----------|-----------|----------|---------|----------|---------------|----------|
|         | price (\$) | Fast-Fast               |          | Fast-Slow |          | Optimal |          | Random-Random |          |
|         | per chip   | #chips                  | Revenues | #chips    | Revenues | #chips  | Revenues | #chips        | Revenues |
|         |            | (%)                     | (\$)     | (%)       | (\$)     | (%)     | (\$)     | (%)           | (\$)     |
| EXTREME | 1081       | 14                      | 15134    | 3         | 3243     | 14      | 15134    | 8             | 8684     |
| FAST    | 538        | 37                      | 19906    | 48        | 25824    | 38      | 20444    | 35            | 18830    |
| MEDIUM  | 325        | 28                      | 9100     | 39        | 12675    | 36      | 11700    | 40            | 13000    |
| SLOW    | 240        | 21                      | 5040     | 10        | 2400     | 12      | 2880     | 17            | 4080     |
| Total   |            | 100                     | 49180    | 100       | 44142    | 100     | 50158    | 100           | 44594    |
|         |            |                         | (10.28%) |           | (-1.01%) |         | (12.48%) |               | (0.00%)  |

Table IV. Impact of different 3D assignment strategies on the number of processor in each speed bin as well as the total revenue.

## 5.5 Impact on Revenue

As we discussed earlier in Section 4, after fabrication and binning, IC manufacturers price according to their bin. We also follow the same strategy with the 3D chips produced from our different integration strategies. With the number of processors in each bin in hand from Figure 7, we readily calculate the total revenues from applying different integration strategies and report them in Table IV. We multiply the number of processors in each bin by the market price of the bin (as given in Figure 4 according to pricegrabber.com) and sum over all bins to give the total revenues. The results show that the optimal strategy yields an increase of 12.48% in total revenues compared to the random-random strategy. The FF strategy comes second with an increase of 10.28%.

## 5.6 Incorporating Leakage Current into Parametric Yield Analysis

The last stage of our experiments takes into account leakage current during parametric yield analysis. As described earlier in Section 3, we model the total 3D IC leakage as the sum of leakage of its constituent die at the operating temperature that is determined

# 3D chips that fail Lmax testing (for different Lmax thresholds)



(a) Number of 3D processor with excessive leakage above  $L_{\text{max}}$  (for different  $L_{\text{max}}$  thresholds).



(b) Total revenues in for different  $L_{\text{max}}$  threshold (the values are normalized with respect to RR).

Fig. 8. Impact of incorporating leakage into different 3D assignment strategy.

by the spatial and temporal variations in temperature. For our simulations, we follow a simple approach, where we model the leakage consumption of the cache and the CPU at a constant temperature value (i.e. at room temperature).

With a leakage-oblivious integration strategy, it is possible that a particular 3D IC assignment would exceed an imposed *leakage budget or threshold* ( $L_{max}$ ) and consequently be discarded as unusable. To illustrate this point, we choose different values for leakage threshold ( $L_{max}$ ) by multiplying the leakage power of the nominal 3D processor (that is,

ACM Journal on Emerging Technologies in Computing Systems, Vol. V, No. N, April 2008.

•



#failing / #passing chips Distribution

Fig. 9. Number of good and bad 3D ICs using leakage as main binning strategy.

with no variations) by 1.5, 2, 2.5 and 3. For example, an  $L_{max}=3$  means that the leakage threshold is three times the leakage power consumed by the nominal 3D chip at 70nm. Figure 8(a) shows the number of 3D chips that exceed the leakage threshold for the four matching strategies and for different  $L_{max}$  thresholds. As we may expect, the number of faulty chips decreases for larger values of Lmax. Furthermore, the FS and the OPT strategies result in lower losses than the FF and RR strategies for smaller values of  $L_{\rm max}$  threshold. These results agree with our discussions in Section 4. As processors with excessive leakage obviously reduce total revenues, we take this into account and compute the total revenue for different integration strategies at different  $L_{max}$  thresholds. We plot the results in Figure 8(b). All the values reported in Figure 8(b) are normalized with respect to the RR strategy with  $L_{\text{max}} = 1.5$ . At large threshold values, the OPT and FF strategies yield the largest revenue (as anticipated from Table IV). However, as  $L_{\text{max}}$  thresholds are lowered, FF loses ground as it produces systems with excessive leakage, and FS starts to look like a more promising choice. The OPT strategy holds its ground for different thresholds as it optimally maximizes the performance, while discarding systems that exceed the  $L_{max}$ threshold.

In another experiment, we pursue a leakage driven strategy where binning is driven solely by leakage (as explained in Section 4.3). Here we set an aggressive leakage threshold which is equal to the sum of the average CPU leakage and the average L2 cache leakage. From Table I, this threshold is almost equal to 13W. Then we test the various integration strategies and count the number of chips in two leakage bins (passing and failing). Figure 9 gives the percentage of good and bad 3D ICs for the different strategies. OPT\_LKG corresponds to Strategy B of Section 4.3. APRIORI corresponds to a simple strategy where all CPU (and L2) chips that are below the average are thrown out in advance. Here again we see that our optimal assignment strategy wins out over the other matching schemes. Half as many 3D ICs end up being marked as failing compared to the next best strategy (FF). Clearly, an *a priori* strategy leads to unacceptable number of failing chips.

### 5.7 Practical Considerations at Fabrication

While we have resorted to modeling and simulations to evaluate the impact of process variations on the CPU and L2 cache dies, the situation is actually simpler at a fabrication facility. Speed testing is executed to label the speed of each die. These speeds are then used to calculate the access time of each L2 cache die in terms of CPU cycles for each

possible CPU/L2 combinations. The handful of values obtained for cache latency in cycles are used as indices to a pre-computed CPI lookup table (e.g. Table II). The only thing that needs to be computed during fabrication runtime is the assignment algorithm to figure out the optimal way to integrate the various CPUs and die. The cache access time of the cache can be determined after fabrication. We could reasonably assume an asynchronous design for big L2/L3 caches. The Intel Montecito processor, for example, includes an asynchronous 24 Mbytes L3 cache [Wuu et al. 2005]. Further, FPGA companies [Lattice ] have introduced programmable SRAM controllers that allow the system to select the optimum SRAM latency.

Another important consideration during fabrication is testing costs. 3D technology will only go mainstream if it is cost effective in comparison to 2D ICs. The costs of manufacturing 3D ICs depend on many factors, including the bonding technology, the method of integration, and the testing costs. A number of recent articles quantify the costs/benefits of using 3D ICs over 2D ICs [Smith et al. 2007a; 2007b]. From a cost presepective, the benefits from our proposed parametric yield improvement strategies should be weighed against any test and assembly costs necessary to carry them out. Smith *et al.* [Smith et al. 2007b] assume that testing costs per die in 3D technology is around \$0.20. We believe that such testing must be carried out any way to filter out the bad die in the first place. And it is also plausible to assume that as 3D IC technology matures, the testing costs will reduce.

## 6. CONCLUSIONS

In this paper, we have investigated the problem of maximizing parametric yield and profits in 3D integrated circuits. we have proposed strategies to model the parametric yield of 3D ICs and optimally pair different die together such that performance, leakage current, and revenues are overall maximized. We have tested our approach using a 3D processor as a potential example of 3D ICs. Compared to a strategy of randomly pairing CPUs and caches together, our optimal assignment scheme leads to an overall 6.5% improvement in MIPS and 12.5% increase in revenue. In a market where profit margins for computer systems may be relatively small, this increase in revenue can translate to a substantial increase in profits. It is also significant to compare our optimal strategy to a fast-fast scheme, where the fastest CPUs are paired with the fastest caches, leaving slow processors to be paired with slow caches. While this greedy strategy may increase the total number of fastest possible 3D processors, it does so at the expense of producing a large number of slow processors and has an overall chance of producing 3D ICs with excessive leakage. In comparison, our optimal matching strategy reduces the total number of slowest processors almost in half, while maximizing the number of fastest processors that do not exceed imposed maximum leakage thresholds.

#### REFERENCES

- BAAS, B. M. 1999. A Low-Power, High-Performance 1024-Point FFT Processor. *IEEE Journal of Solid-State Circuits 34*(3), 380–387.
- BALIGA, J. 2004. Chips Go Vertical. IEEE Spectrum Magazine 41(3), 43-47.
- BANERJEE, K., SOURI, S. J., KAPUT, P., AND SARASWAT, K. C. 2001. 3-D ICs: A Novel Chip Design for Deep-Submicrometer Interconnect Performance and Systems-on-Chip Integration. *Proceedings of the IEEE 89*(5), 602–633.
- BENKART, P., KAISER, A., MUNDING, A., BSCHORR, M., PFLEIDERER, H.-J., KOHN, E., HEITTMANN, A., HUEBNER, H., AND RAMACHER, U. 2005. 3D Chip Stack Technology Using Through-Chip Interconnects. *IEEE Design & Test of Computers 22(6)*, 512–518.

- BEYNE, E. 2004. 3D Interconnection and Packaging: Impending Reality or Still a Dream? In IEEE International Solid-State Circuits Conference. 138–139.
- BHARDWAJ, S., VRUDHUKA, S., GHANTA, P., AND CAO, Y. 2006. Modeling of Intra-Die Process Variation for Accurate Analysis and Optimization of Nano-Scale Circuits. In *Proc. Design Automation Conference*. 791–796.
- BORKAR, S., KARNIK, T., NARENDRA, S., TSCHANZ, J., KESHAVARZI, A., AND DE, V. 2003. Parameter Variations and Impact on Circuits and Microarchitecture. In *Proc. Design Automation Conference*. 338–342.
- BOWMAN, K., DUVALL, S., AND MEINDL, J. 2002. Impact of Die-to-Die and Within-Die Parameter Fluctations on the Maximum Clock Frequency Distribution for Gigascale Integration. *IEEE Journal of Solid State Electronics* 37(2), 183–190.
- BURGER, D. C. AND AUSTIN, T. M. 1997. The SimpleScalar Tool Set, Version 2.0. Tech. Rep. CS-TR-1997-1342.
- BURNS, J. A., AULL, B. F., CHEN, C., CHEN, C.-L., KEAST, C. L., KNECHT, J., SUNTHARALINGAM, V., WARNER, K., WYATT, P., AND YOST, D.-R. 2006. A Wafer-Scale 3-D Circuit Integration Technology. *IEEE Transactions on Electron Devices* 53(10), 2507–2516.
- COOK, W. AND ROHE, A. 1999. Computing Minimum Weight Perfect Matchings, http://www.or.unibonn.de/home/rohe/matching.html. *INFORMS J. Computing* 11, 38–148.
- CORY, B. D., KAPUR, R., AND UNDERWOOD, B. 2003. Speed Binning with Path Delay Test in 150-nm Technology. *IEEE Design & Test of Computers 20(5)*, 41–45.
- DAS, S., CHANDRAKASAN, A., AND REIF, R. 2004. Timing, Energy, and Thermal Performance of Three-Dimensional Integrated Circuits. In Proc. Great Lakes Symposium on VLSI. 338–343.
- DATTA, A., BHUNIA, S., CHOI, J. H., MUKHOPADHYAY, S., AND ROY, K. 2006. Speed Binning Aware Design Methodology to Improve Profit under Parameter Variations. In Proc. Asia and South Pacific Design Automation Conference. 712–717.
- DAVIS, J. A., VENKATESAN, R., KALOYEROS, A., BEYLANSKY, M., SOURI, S. J., BANERJEE, K., SARASWAT, K. C., RAHMAN, A., REIF, R., AND MEINDL, J. 2001. Interconnect Limits on Gigascale Integration (GSI) in the 21st Century. *Proceedings of the IEEE 89(3)*, 305–324.
- DAVIS, W. R., WILSON, J., MICK, S., XU, J., HUA, H., MINEO, C., SULE, A., STEER, M., AND P.D.FRANZON. 2005. Demystifying 3D ICs: The Pros and Cons of Going Vertical. *IEEE Design & Test* of Computers 22(6), 498–510.
- FUKUSHIMA, T., YAMADA, Y., AND KOYANAGI, M. 3030–3035. New Three-Dimensional Integration Technology Using Chip-to-Wafer Bonding to Acheive Ultimate Super-Chip Integration. *Japanese Journal of Applied Physics* 45(4B), 2006.
- GAREY, M. R. AND JOHNSON, D. S. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness, first (twenty-third printing) ed. W.H. Freeman and Company.
- GROSSAR, E., STUCCHI, M., MAES, K., AND DEHAENE, W. 2006. Statistically Aware SRAM Memroy Array Design. In International Symposium on Quality Electronic Design. 25–30.
- HUMENAY, E., ARJAN, D., AND SKADRON, K. 2006. Impact of Parameter Variations on Multi-Core Chips. In Workshop on Architectural Support for Gigascale Integration. 1–9.
- IM, S. AND BANERJEE, K. 2000. Full Chip Thermal Analysis of Planar (2-D) and Vertically Intergrated (3-D) High Performance ICs. In *IEEE International Electron Devices Meeting*. 727–730.
- ITRS. International Technology Roadmap for Semiconductors. http://public.itrs.net.
- JACOB, P., ERDOGAN, O., ZIA, A., BELEMJIAN, P. M., KRAFT, R. P., AND MCDONALD, J. F. 2005. Predicting the Performance of a 3D Processor-Memory Chip Stack. *IEEE Design & Test of Computers 22(6)*, 540–547.
- KIM, C., KIM, J.-J., CHANG, I.-J., AND ROY, K. 2006. PVT-Aware Leakage Reduction for On-Die Caches with Improved Read Stability. *IEEE Journal of Solid-State Circuits* 41(1), 170–178.
- KUHN, H. W. 1955. The Hungarian Method for the Assignment Problem. *Naval Research Logistic Quarterly* 2, 83–97.
- LATTICE. LatticeMico32 Asynchronous SRAM Controller Datasheet. http://www.latticesemi.com/ documents/doc21610x19.pdf.

- LI, F., NICOPOULOS, C., RICHARDSON, T., XIE, Y., NARAYANAN, V., AND KANDEMIR, M. 2006. Design and Management of 3D Chip Multiprocessors Using Network-in-Memory. In *International Symposium on Computer Architecture*. 130–141.
- LIU, C. C., GANUSOV, I., BURTSCHER, M., AND TIWARI, S. 2005. Bridging the Processor-Memory Performance Gap with 3D IC Technology. *IEEE Design & Test of Computers* 22(6), 556–564.
- LOI, G. L., AGRAWAL, B., SRIVASTAVA, N., L., S.-C., SHERWOOD, T., AND BANERJEE, K. 2006. A Thermally-Aware Performance Analaysis of Vertically Integrated (3-D) Processor-Memory Hierarchy. In *Proc. Design Automation Conference*. 991–996.
- MARCULESCU, D. AND TALPES, E. 2005. Variability and Energy Awareness: A Microarchitecture-Level Perspective. In Proc. Design Automation Conference. 11–16.
- MENG, K. AND JOSEPH, R. 2006. Process Variation Aware Cache Leakage Management. In International Symposium on Low-Power Electronics. 262–267.
- MUNKRES, J. 1957. Algorithms for the Assignment and Transportation Problems. *Journal of the Society of Industrial and Applied Mathematics* 5(1), 32–38.
- ORSHANSKY, M., MILNOR, L., CHEN, P., KEUTZER, K., AND HU, C. 2002. Impact of Spatial Intrachip Gate Length Variability on the Performance of High-Speed Digital Circuits. *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems* 21(5), 544–553.
- PATTI, R. S. 2006. Three-Dimensional Integrated Circuits and the Future of Systems-on-Chip Designs. Proceedings of IEEE 94(6), 1214–1224.
- PTM. Predictive technology model. http://www.eas.asu.edu/~ptm/introduction.html.
- RAJ, S., VRUDHULA, S. B. K., AND WANG, J. 2004. A Methodology to Improve Timing Yield in The Presence of Process Variations. In Proc. Design Automation Conference. 448–453.
- RAO, R. R., BLAAUW, D., SYLVESTER, D., AND DEVGAN, A. 2005. Modeling and Anlysis of Parametric Yield Under Power and Performance Constraints. *IEEE Design & Test of Computers* 22(4), 376–385.
- REIF, R., FAN, A., CHEN, K.-N., AND DAS, S. 2002. Fabrication Technologies for Three-Dimensional Integrated Circuits. In *International Symposium on Quality Electronic Design Automation*. 33–37.
- SAXENA, P., MENEZES, N., COCCHINI, P., AND KIRKPATRICK, D. A. 2004. Repeater Scaling and Its Impact on CAD. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 23(4), 451–463.
- SCHEIRING, C. 2004. Advanced-Chip-to-Wafer Technology: Enabling Technology for Volume Production of 3D System Integration on Wafer Level. In *International Microelectronics And Packaging Society*. 1–11.
- SMITH, L., SMITH, G., HOSALI, S., AND ARKALGUD, S. 2007a. 3-D: It All Comes Down to Cost. In 3-D Architectures for Semiconductor Integration and Packaging.
- SMITH, L., SMITH, G., HOSALI, S., AND ARKALGUD, S. 2007b. Yield Considerations in the Choice of 3D Technology. In *IEEE Intl. Sympos. on Semiconductor Manufacturing*. 535–537.
- SPEC. 2000. SPEC 2000 benchmarks. http://www.spec.org/cpu/.
- TOPOL, A. W., D. C. LA TULIPE, J., SHI, L., FRANK, D. J., BERNSTEIN, K., STEEN, S. E., KUMAR, A., SINGCO, G. U., YOUNG, A. M., GUARINI, K. W., AND IEONG, M. 2006. Three-dimensional Integrated Circuits. *IBM Journal of Res. and Dev. 50(4-5)*, 491–506.
- TSAI, Y.-F., XIE, Y., VIJAYKRISHNAN, N., AND IRWIN, M. J. 2005. Three-Dimensional Cache Design Exploration Using 3D Cacti. In International Conference on Computer Design. 519–524.
- WILTON, S. AND JOUPPI, N. P. 1996. CACTI: An Enhanced Cache Access and Cycle Time Model. IEEE Journal Solid-State Circuits 31(5), 677–688.
- WUU, J., WEISS, D., MORGANTI, C., AND DREESEN, M. 2005. The Asynchronous 24MB On-Chip Level-3 Cache for a Dual-Core Itanium Family Processor. In *International Solid-State Circuits Conference*. 488–612.
- XIE, Y., LOH, G. H., BLACK, B., AND BERNSTEIN, K. 2006. Design Space Exploration for 3D Architectures. *J. Emerg. Technol. Comput. Syst.* 2, 2, 65–103.
- ZENG, A., LI, J., ROSE, K., AND GUTMANN, R. J. 2005. First-Order Performance Prediction of Cache Memory with Wafer-Level 3D Integration. *IEEE Design & Test of Computers* 22(6), 548–555.