Abstract—This paper introduces a novel class of linear soft-input soft-output detectors with boosted communications performance. The detector showed an SNR gain of up to 2.4 dB compared to state-of-the-art linear detectors. We introduce a low-complexity algorithm tailored for VLSI implementation, and propose a suitable architecture. The developed ASIC demonstrates the feasibility and efficiency of the concept, achieving the IEEE 802.11n standard’s peak data rate of 600 Mbit/s.

Keywords—VLSI architecture; Digital signal processing; MIMO Systems; Wireless Communications; Data detection

I. INTRODUCTION

Multi-antenna (MIMO) systems that employ bit-interleaved coded modulation (BICM) are already adopted for wireless standards such as IEEE 802.11n. The extension of BICM with Iterative Decoding (BICM-ID) is a promising option that provides superior communications performance at high spectral efficiency. However, the data detection within these systems poses particular challenges to VLSI implementation.

The exact MIMO detector suffers from an exponential complexity. Linear detectors, e.g. using MMSE filtering, are an economically reasonable option particularly suitable for VLSI designs. These offer a simple algorithmic structure, are pure feed-forward and thus deeply pipelineable. There is only simple signal processing, without any non-linear control flow. The linear algorithms exhibit inherent pipeline, symbol-level and stream-level parallelism.

Additionally, it was shown in [6] that systems using linear detectors can actually have a better spectral efficiency than with close-to-optimal detectors, for a given maximum area constraint. The reported analysis indicates that the VLSI designs’ properties might be more important than the communications performance in certain cases.

Ideally, we would like to improve the communications performance of linear detectors while maintaining all beneficial implementation properties. Eventually, a recent publication [1] on detection algorithms based on the Expectation Propagation framework highlighted new opportunities to improve linear detection. In this paper, we propose the EPIC detector based on [1], termed the Expectation Propagation based MIMO Detector ASIC.

Contribution: We show how to build a detector that combines the VLSI implementation efficiency of linear detectors with a greatly improved communications performance. To this end, we propose a low-complexity algorithm and provide an ASIC case study. Our post-layout results demonstrate our concept’s competitive implementation complexity in terms of area and throughput compared to related work.

Outline: After the description of the assumed system model, we give an overview of related work. Subsequently, we present a low-complexity algorithm and a suitable architecture design. In the results section, we evaluate the algorithmic performance and hardware figures of our approach.

II. SYSTEM MODEL

Fig. 1 depicts the considered spatial multiplexing $N_t \times N_r$ MIMO system with BICM-ID. A message $b \in \{0,1\}^{N_b}$ is encoded with rate $r = N_b/N_c$ and interleaved, yielding the code word $c \in \{0,1\}^{N_c}$. Let $X \subset \mathbb{C}$ be a modulation alphabet with $K = \log_2 |X|$ bits per symbol. The code word is partitioned into $N_s$ subvectors $c_n \in \{0,1\}^{K N_c}$. They are subsequently mapped to symbol vectors $x_n \in \mathcal{X}^{N_t}$ that are transmitted independently. Assuming a frequency-flat fading channel characterized by $H_n \in \mathbb{C}^{N_t \times N_t}$ and perfect synchronization in time and frequency, the received symbol vector at time $n$ is $y_n = H_n x_n + w_n$, where $w_n \in \mathbb{C}^{N_t}$ is a white Gaussian noise process with $E[w_n w_n^H] = N_0 I_{N_r}$. In the remainder, we drop the time index $n$ for convenience.

Perfect channel knowledge at the receiver is assumed. Using iterative MIMO decoding, detector and channel decoder exchange extrinsic soft information $\lambda^e = \lambda^p - \lambda^a$ in terms of log-likelihood ratios (LLRs), where $\lambda^p$ are the detector’s posterior LLRs and $\lambda^a$ are the prior LLRs fed back from the decoder. A BICM-ID receiver thus requires soft-input soft-output (SISO) channel decoder and detector. In one receiver iteration marked with $\hat{A}$ in Fig. 1, detector and decoder are executed once, in that order.

A. Related Work

Approaches for MIMO detector design can be divided into two categories: linear and non-linear. Non-linear detectors
treat the detection problem as a search for the most likely symbol vector candidate $x$ in the space $\mathcal{X}^{N_t}$. Their VLSI implementation is complex, however they regularly have better communications performance than linear detectors.

For example, the single tree search sphere decoder (STS-SD) [6] traverses a tree representing the search space in depth-first manner, to look for the most likely candidate and the best counter hypotheses. It is shown to achieve max-log optimality. However to guarantee a minimum throughput, the necessary search break-off results in significant losses.

To alleviate this problem, K-Best detectors perform a breadth-first tree search and keep only the K best candidates per transmit layer. This gives a constant deterministic run-time. The algorithmic performance is suboptimal, as it might miss many good candidates and fail to find good counter-hypotheses. No VLSI implementation of SISO K-Best detection is reported so far.

Fixed-complexity sphere decoders (FCSD) [4] also feature a deterministic run-time, but better communications performance. The computational effort is directed towards nodes that are more likely to change the final result (imbalanced expansion). Usually, a full search is performed for the two weakest transmit layers. On the remaining layers, only the best per-stream symbols are selected. The Parallel Candidate Adding scheme [4] helps to find counter-hypotheses.

Another approach [7] structures the search space as trellis diagram with $N_t$ columns and $|\mathcal{X}|$ rows. For every trellis node, a specified number of shortest paths is found. This guarantees the availability of suitable information for every bit. The reported VLSI design is quite large while supporting only 16-QAM.

To get particularly small VLSI designs, Markov-Chain Monte-Carlo (MCMC) detection [5] is a good option. The search is performed randomly. Essentially, this removes almost all the control logic for candidate selection, and leaves only the evaluation logic. However, the detectors are rather slow, offering only a low throughput, and the architecture parameter choice is still an open question.

On the other side, linear detectors try to estimate the most likely symbol distribution. They assume that the transmit symbol distribution is Gaussian, i.e. $x \sim \mathcal{CN}$. The idea is to separate the spatially super-imposed transmit streams in order to process them individually. This greatly reduces the processing complexity, however also results in a suboptimal communications performance. A well-known VLSI design is the MMSE-PIC [2], that performs MMSE detection with parallel interference cancellation.

III. EPIC DETECTOR

The following subsections introduce the proposed low-complexity detection algorithm suitable for VLSI implementation. We tailored the algorithm reported in [1] specifically for VLSI implementation. The concept was partially presented as two-pages abstract [12].

A. Algorithm of the EPIC Detector

The idea is to use regular SISO MMSE MIMO detection, but to boost it in the first receiver iteration, when the decoder’s message is not available yet, using an initial SO-only MMSE MIMO detection.

The proposed detector, depicted in Fig. 1, in principle resembles a standard IC-LMMSE detector. It uses an interference cancellation followed by an LMMSE filter and per-stream demapping. We add an extra LMMSE detector at the front. Upon the reception of a symbol vector, this detector provides additional prior knowledge to the subsequent IC-LMMSE detector. While iterating between detector and decoder, only the IC-LMMSE detector is active.

The LMMSE detector corresponds to the first two blocks of the detector in Fig. 1, while the IC-LMMSE MIMO detector corresponds to the other two blocks. The second internal detector requires a slight modification in order to process the first detector’s output. In the first receiver iteration, when the decoder’s feedback is not available, i.e. $\lambda^2 = 0$, the first internal detector provides the prior knowledge in terms of LLRs denoted as $\lambda^{\text{int}}$ and its LMMSE filter’s output to the second internal detector.

In general, the four internal blocks exchange information in terms of Gaussian distributions. We parametrize the messages produced by the equalizers with $\mu_i$ and $\sigma_i^2$. The demappers’ messages have the parameters $\tilde{\mu}_i$ and $\tilde{\sigma}_i^2$. Both messages combined according to

$$\frac{\mu_i}{\sigma_i^2} = \frac{\mu_i}{\sigma_i^2} + \frac{\tilde{\mu}_i}{\tilde{\sigma}_i^2} \quad \text{and} \quad \frac{1}{\sigma_i^2} = \frac{1}{\sigma_i^2} + \frac{1}{\tilde{\sigma}_i^2}.$$ \hspace{1cm} (1)
express the EPIC detector’s current belief on the transmit symbols with parameters $\mu_t$ and $\sigma_t^2$. The proposed EPIC detector’s algorithm is listed in Fig. 2.

```
1: if First Receiver Iteration ($\lambda^a = 0$) then  
2: \quad $\tilde{\mu}_t, \tilde{\sigma}_t^2 = \text{LMMSE}(y, H, N_0)$  
3: \quad $\lambda^\text{int} = \text{demap}(\tilde{\mu}_t, \tilde{\sigma}_t^2)$  
4: \quad $\mu_t, \sigma_t^2 = \text{map}(\lambda^\text{int})$  
5: \quad $\tilde{\mu}_t, \tilde{\sigma}_t^2 = \text{gaussdiv}(\mu_t, \sigma_t^2, \tilde{\mu}_t, \tilde{\sigma}_t^2)$  
6: else  
7: \quad $\tilde{\mu}_t, \tilde{\sigma}_t^2 = \text{IC-LMMSE}(y, H, N_0, \tilde{\mu}_t, \tilde{\sigma}_t^2)$  
8: end if  
9: $\mu_t, \sigma_t^2 = \text{map}(\lambda^0)$  
10: $\lambda^a = \text{demap}(\tilde{\mu}_t, \tilde{\sigma}_t^2)$
```

Figure 2. The EPIC Detector’s underlying algorithm.

The bridge between the two internal detectors lies in line 5. The function `gaussdiv` performs a “division” of Gaussian distributions. It computes

$$\tilde{\sigma}_t^2 = \left( \frac{1}{\sigma_t^2} - \frac{1}{\tilde{\sigma}_t^2} \right)^{-1}$$

and

$$\tilde{\mu}_t = \tilde{\sigma}_t^2 \left( \frac{\mu_t}{\sigma_t^2} - \frac{\tilde{\mu}_t}{\tilde{\sigma}_t^2} \right)$$

(2)

using the inputs from the mapper and the filter.

Additionally, the algorithm assumes to obtain the decoder’s posterior LLRs and neglects the prior knowledge in the final demapping. We omit both subtractions between detector and decoder in Fig. 1, thus directly compute $\lambda^a$.

### B. Algorithmic Building Blocks

In the following subsection, we introduce low-complexity versions of the algorithmic functions used in Alg. 2, that are particularly suitable for VLSI implementation.

1) IC-LMMSE and LMMSE Equalizer: The lines 2 and 9 of Alg. 2 use MMSE equalization. We describe first the IC-LMMSE equalizer and subsequently the differences to the LMMSE equalizer.

From the inputs $y$, $H$, $N_0$, $\mu_t$ and $\sigma_t^2$, we want to calculate the parameters $\tilde{\mu}_t$ and $\tilde{\sigma}_t^2$. First, compute the gram matrix $R$ and the matched filter output $y^\text{mf}$ as

$$R = H^H H \in \mathbb{C}^{N_t \times N_t} \quad \text{and} \quad y^\text{mf} = H^H y$$

(3)

which we refer to as preprocessing step in the remainder. Perform interference cancellation on $y^\text{mf}$ to obtain

$$y^\text{mf,ic} = y^\text{mf} - \sum_{t=1}^{N_t} r_t \tilde{\mu}_t$$

(4)

where $r_t$ denotes the $t$-th column of $R$. Then decompose the matrix

$$C_y = R + N_0 C_{x}^{-1} = LDL^H$$

(5)

where $C_{x} = \text{diag}(\sigma_t^2, \ldots, \sigma_t^2)$, into a lower triangular matrix $L = [l_{ij}] \in \mathbb{C}^{N_t \times N_t}$ and a positive real-valued diagonal matrix $D = \text{diag}(d_{ii}) \in \mathbb{C}^{N_t \times N_t}$. The matrix $L$ has a unity diagonal $l_{ii} = 1$. We limit the noise density to a lower bound $N_{\text{min}}$, and the elements $d_{ii}$ to $d_{\text{min}}$ for numerical reasons. Subsequently, invert the matrices using forward and back substitution to obtain

$$G^H = C_y^{-1} = (L^H)^{-1} D^{-1} L^{-1}$$

by solving $LDL^H G^H = I_{N_t}$ for $G^H$. The algorithm reuses the reciprocals $d_{ii}^{-1}$ computed for the decomposition in the back substitution step. Compute the bias term

$$\tilde{\mu}_t = g_t^H r_t$$

(7)

and its reciprocal $\tilde{\mu}_t^{-1}$, then compute new estimates of the per-stream symbol means and variances as

$$\mu_t = \tilde{\mu}_t + \tilde{\mu}_t^{-1} g_t^H y^\text{mf,ic}$$

$$\sigma_t^2 = \tilde{\sigma}_t^2 (\tilde{\mu}_t^{-1} - 1)$$

(8)

where $g_t^H$ denotes the $t$-th row of $G^H$. For numerical reasons, we limit $\tilde{\mu}_t$ to a lower value $\tilde{\mu}_{\text{min}}$, and the variances $\tilde{\sigma}_t^2$ to $\tilde{\sigma}_{\text{min}}^2$. The last step is to compute the per-stream SNRs $\hat{\rho}_t = 1/\tilde{\sigma}_t^2$.

In line 2, perform LMMSE equalization since there is no prior knowledge available. To this end, we set $\tilde{\mu}_t = 0$ and $\tilde{\sigma}_t^2 = E_s$, where $E_s$ denotes the average transmit symbol energy, in the above description and do not perform interference cancellation, i.e. assume $y^\text{mf,ic} = y^\text{mf}$.

2) Demapper: In the lines 3 and 10, we compute the per-stream LLRs $\lambda^\text{int}$ and $\lambda^a$ respectively. Using the max-log approximation and omitting the priors $\lambda^a$, demap the equalizer’s output $\tilde{\mu}_t$ and $\tilde{\sigma}_t^2$ according to

$$\lambda_{t,k} = \tilde{\rho}_t \left( \min_{x \in \lambda_k^0} |x - \tilde{\mu}_t|^2 - \min_{x \in \lambda_k^1} |x - \tilde{\mu}_t|^2 \right)$$

(9)

where $\lambda_k^0$ and $\lambda_k^1$ are subsets of $\lambda$ with the $k$-th bit set to zero or one respectively. The difference of the min-terms is computed with piece-wise linear functions [9].

3) Mapper: The `map` function computes symbol means $\mu_t$ and variances $\sigma_t^2$ from the LLRs. Select $\lambda^\text{int}$ in line 4 or the decoder’s LLRs $\lambda^a$ in line 7. Convert the LLRs $\lambda$ to bit probabilities according to

$$p_t(k) = \logistic(\lambda_{t,k}) = \frac{1}{1 + \exp(-\lambda_{t,k})}$$

(10)

Then compute the terms $\mathbb{E}[x_t]$ and $\mathbb{E}[x_t^2]$ as in [10]. We limit the symbol variances to a lower bound $\sigma_{\text{min}}^2$ to improve the numerical stability of the next steps.

4) GaussDiv: The function `gaussdiv` implements (2). It calculates $\tilde{\mu}_t$ and $\tilde{\sigma}_t^2$ from the equalizer’s outputs $\tilde{\mu}_t$, $\tilde{\sigma}_t^2$ and the mapper’s outputs $\mu_t$, $\sigma_t^2$. The variance $\tilde{\sigma}_t^2$ might become arbitrarily large if $\sigma_t^2 \approx \tilde{\sigma}_t^2$, and even negative e.g. due to rounding errors. This results in a largely increased dynamic range requirement for subsequent computations. We propose
Figure 3. EPIC Architecture Design. The two blocks Filter Vect. are identical and each contain two PUs. Forwarding buffers are not drawn. In the first receiver iteration, data flows through the upper chain, then enters the lower one. In subsequent receiver iterations, only the Preproc. PU and the lower chain are active.

A simple ad-hoc heuristic that effectively avoids the increase. To this end, we define that if

$$\tilde{\rho}_{\text{thr}} = \frac{1}{\sigma_t^2} \approx \frac{1}{\sigma_t^2} ? < \tilde{\rho}_{\text{thr}}$$

(11)

where $\tilde{\rho}_{\text{thr}}$ is a constant threshold, then we set $\tilde{\mu}_t = \mu_t$ and $\tilde{\sigma}_t^2 = \sigma_t^2$, i.e. we forward the mapper’s unprocessed output. Otherwise, compute

$$\tilde{\sigma}_t^2 = \frac{1}{\tilde{\rho}_t} \quad \text{and} \quad \tilde{\mu}_t = \tilde{\sigma}_t^2 \left( \frac{\mu_t}{\sigma_t^2} - \frac{\tilde{\mu}_t}{\tilde{\sigma}_t^2} \right).$$

(12)

The threshold $\tilde{\rho}_{\text{thr}}$ is chosen as power of two for further simplification.

C. Explanations

The crux in the EPIC detector is the GaussDiv process. Without it, we simply have two chained linear detectors. Our experiments confirm that the full performance enhancement is only available using the GaussDiv method.

Thus, we want to give a possible interpretation of this phenomenon. Basically, the first LMMSE filter produces a continuous-valued distribution with mean $\mu_t \in \mathbb{C}$, that describes the most likely transmit symbol distribution. The subsequent demapper discretizes the continuous distribution, i.e. it evaluates it at every valid point $x_t \in \mathcal{X}$. The mapper computes a new continuous-valued estimate, which is assumed to be the mean of a new continuous distribution. The innovation contained in this new message potentially results from imposing the validity constraint $x_t \in \mathcal{X}$. The gaussdiv function extracts this innovation, by “dividing” the two distributions. Note that instead of computing symbol probabilities, we compute bit probabilities, which is essentially an approximation of the process, as it neglects symbol correlations.

IV. ARCHITECTURE

As a case study, we implemented the algorithm to support the IEEE 802.11n standard’s peak data rate of $\Theta_c = 720$ MBit/s (for $N_t = 4$, 64-QAM, $r = 5/6$). There are two possible workloads, for the first and the subsequent iterations. Our implementation has a constant deterministic throughput, independent of the receiver iteration and SNR.

A. Overview

We use a coarse-grained pipeline of multi-cycle processing units (PUs) that take 18 clock cycles per pipeline cycle, depicted in Fig. 3. For a reasonable clock frequency of $f_{\text{clk}} = 540$ MHz, we achieve a data rate of $\Theta_b = \frac{r N_t K}{18} f_{\text{clk}} = 600$ Mbit/s.

The architecture runs in two possible configurations. For the first receiver iteration, data flows through the Preprocessing PU into the upper internal detector composed of four PUs. Then it enters the lower internal detector comprising seven PUs. For any subsequent receiver iterations, we disable the upper chain. The mapper PU takes the input $\lambda^a$. In this mode, the GaussDiv unit always forwards $\tilde{\mu}_t, \tilde{\sigma}_t^2$ and computes $C_y$. The processing latency depends on the configuration. In the first case, the latency is $L = 11 \cdot 18$ clock cycles, while in the second it is $L = 7 \cdot 18$ clock cycles. The architecture’s throughput is identical for both configurations.

B. Processing Units

The PUs exchange data in one dedicated clock cycle. They use a simple handshake based protocol. All data is kept in registers. Forwarding buffers are used where required, but are not drawn in Fig. 3. All resource allocations of arithmetic units and schedules have been devised manually for each PU.

Beneath regular multiplier and adder units, we use lookup tables e.g. for the logistic function in (10). Newton-Raphson based reciprocal units perform the required divisions. This unit type first normalizes the input $x$, then performs an initial table lookup $\tilde{x}_0 \approx 1/x$, followed by one Newton-Raphson iteration $\tilde{x}_1 = 2\tilde{x}_0 - \tilde{x}_0^2$ to obtain an approximate reciprocal $\tilde{x}_1 \approx 1/x$.

We use saturation for the input, output and intermediate LLRs $\lambda^a, \lambda^c$ and $\lambda^{\text{int}}$ respectively.
 Due to lack of space, we cannot show the PUs’ implementation details here. However, the architectural style resembles those presented in [2] and in [3], [11].

C. Partitioning

The proposed algorithm partition and PU mapping for this case study is given in Tbl. I. It follows approximately the functional bounds. Note that several PUs are instantiated twice, as they are in use by both internal detectors. Where appropriate, we customized the unique instances individually, e.g. assuming \( \hat{\mu}_t = 0 \) for the upper Filter & SNR.

V. IMPLEMENTATION RESULTS

As proof of concept, we designed an ASIC that meets the IEEE 802.11 standard’s peak data rate, in order to show the feasibility and efficiency of the proposed algorithm.

A. Conditions and Assumptions

A 40 MHz 802.11n-like scenario is considered assuming a \( 4 \times 4 \) MIMO system with gray-mapped 4/16/64-QAM modulation, max-log demapping, a spatially uncorrelated Rayleigh channel and perfect channel knowledge. We use a tail-biting convolutional code with polynomials \( [133, 171]_8 \) and puncturing for the code rates 1/2, 2/3, 3/4 and 5/6, and a max-log BCJR decoder. The frame length equals the interleaver’s length, given by the IEEE 802.11n standard, which depends on the modulation and coding scheme. The average signal-to-noise ratio (SNR) per receive antenna is defined as

\[
\text{SNR} = \frac{E[\|Hx\|^2]}{(N_r N_0)}
\]

We determined the required word lengths to obtain an SNR loss of \( \leq 0.1 \text{dB} \) compared to the floating-point model at a frame error rate of 10%. All simulations have been performed with 10^5 frames.

B. Algorithmic Performance

Fig. 4 shows the frame error rate (FER) over SNR for the \( \Theta_b = 600 \text{ Mbit/s} \) mode of the IEEE 802.11n standard, assuming \( N_t = 4, \) 64-QAM \( (K = 6) \) and the code rate \( r = 5/6 \). The simulation includes the bit-accurate model of our EPIC, and models of the MMSE-PIC [2] detector and the fixed-complexity sphere decoder (FCS) [4] that both neglect any finite word length effects. The EPIC shows a reasonable communications performance in-between the FCS and the MMSE-PIC as expected [1].

Tbl. II summarizes the communications performance gain of the EPIC over the MMSE-PIC, measured at 10% FER. We observe a possible gain of more than 2 dB. The largest gain is achieved in the first receiver iteration, as could be expected. For subsequent receiver iterations, the EPIC is a default IC-LMMSE detector, thus the observed gains result from the improved initial performance. The gain diminishes over receiver iterations. The EPIC exhibits greater gain at lower code rates. The gain from iterating between detector and decoder, e.g. visible in Fig. 4, appears to be weaker for the EPIC than for the MMSE-PIC, for example approximately 4.2 dB (MMSE-PIC) to 3.1 dB (EPIC) from first to second receiver iteration. This is in accordance with the original algorithm [1] and not due to our implementation.

We also conducted simulations using the low-density parity check (LDPC) codes defined in the IEEE 802.11n standard. The observations are similar. The gain values are lower, but always greater than zero, thus the EPIC performs at least as good as the MMSE-PIC.

<table>
<thead>
<tr>
<th>Modulation</th>
<th>Code Rate</th>
<th>SNR Gain [dB] to MMSE-PIC</th>
</tr>
</thead>
<tbody>
<tr>
<td>QPSK</td>
<td>1/2</td>
<td>0.9</td>
</tr>
<tr>
<td></td>
<td>3/4</td>
<td>2.4</td>
</tr>
<tr>
<td>16-QAM</td>
<td>1/2</td>
<td>0.9</td>
</tr>
<tr>
<td></td>
<td>3/4</td>
<td>2.0</td>
</tr>
<tr>
<td>64-QAM</td>
<td>2/3</td>
<td>1.0</td>
</tr>
<tr>
<td></td>
<td>5/6</td>
<td>1.7</td>
</tr>
<tr>
<td></td>
<td>3/4</td>
<td>1.4</td>
</tr>
<tr>
<td></td>
<td>5/6</td>
<td>1.7</td>
</tr>
</tbody>
</table>

Figure 4. Frame Error Rate over SNR. The MMSE-PIC model excludes any finite word length effects. EPIC is our VLSI implementation including all finite word length effects.
likely from the exploitation of the symmetries of the MMSE-PIC’s inversion circuitry. This results most

\[ G \leq 4 \times 4 \]

...a maximum throughput of \( \Theta = 733 \text{ Mbit/s} \). This variant occupies 345.8 kGE and 1.08 mm\(^2\). One gate-equivalent (GE) corresponds to the area of one two-input drive-one NAND gate. The per-unit area breakdown is given in Tbl. IV. The decomposition, performed on the MMSE-PIC’s inversion circuitry. This results most likely from the exploitation of the symmetries \( C_y^H = C_y \) and \( G^H = G \). With about 5.1%, the additional GaussDiv

\[ \text{Throughput} \leq \frac{H}{G} \]

unit constitutes only a small overhead, but is crucial for the improved communications performance.

Tbl. III shows data of other reported MIMO detectors and our work. We use the area-throughput efficiency for comparisons. The EPIC is smaller than the MMSE-PIC [2], and slightly slower. Both detectors show a similar efficiency around 2 Mbit/s/kGE. The reduced area might be due to the data symmetries of \( C_y \) and \( G^H \) present in the EPIC algorithm, which are not available in the MMSE-PIC’s algorithm formulation. Post-synthesis results for an IC-LMMSE detector are reported in [3]. Our detector internally comprises two MMSE detectors, and thus seems to have roughly double the area of that design. Compared to the FCSD, the EPIC is smaller, but has a lower area-throughput efficiency. Combined with the better communications performance of the FCSD, the EPIC seems to be the less efficient architecture. However, neither the required QR decomposition, the required matrix multiplication \( Q^H y \) nor the overhead for storing \( Q \) are included in the FCSD’s figures, while the EPIC does not need any further preprocessing. Thus, the FCSD has a better detector area-throughput efficiency, however its system efficiency depends on additional circuitry, which is not needed for the EPIC (and for the MMSE-PIC).

The MCMC detector and the STS-SD offer a significantly lower throughput and lower area-throughput efficiency. Furthermore, the STS-SD has a strong run-time variability of the throughput, which makes system design challenging, and requires additional circuitry to manage the variations (e.g. dynamic data distribution and collection).

The trellis-based detector [7] shows good communications performance, similar to the FCSD, but the reported design supports only 16-QAM. For this modulation order, it already

### C. ASIC Implementation

We synthesized the proposed architecture with Synopsys Design Compiler G-2012.06 in topographical mode and obtained a layout with Cadence SoC Encounter 10.1 using a 1.0V standard-performance standard cell library for the UMC 90nm SP-RVT LowK CMOS process. The ASIC achieves a maximum clock frequency of 550 MHz, yielding a maximum throughput of \( \Theta = 733 \text{ Mbit/s} \). This variant occupies 345.8 kGE and 1.08 mm\(^2\). One gate-equivalent (GE) corresponds to the area of one two-input drive-one NAND gate. The per-unit area breakdown is given in Tbl. IV. The matrix inversion, performed on the decomposition and solver units, requires up to 68.9 kGE, which is about half of the MMSE-PIC’s inversion circuitry. This results most likely from the exploitation of the symmetries \( C_y^H = C_y \) and \( G^H = G \). With about 5.1%, the additional GaussDiv

\[ \left( L D L^H \right)^{-1} \text{ Decomposition} [\text{kGE}] \]

\[ \text{Filter} & \text{ SNR} [\text{kGE}] \]

\[ \text{Max-Log Demap} [\text{kGE}] \]

\[ \text{GaussDiv} [\text{kGE}] \]

\[ \text{Interference Cancellation} [\text{kGE}] \]

\[ \text{Forwarding} & \text{ Multiplexer} [\text{kGE}] \]

\[ \text{Total area} [\text{kGE}] \]

\[ \text{Clock frequency} [\text{MHz}] \]

\[ \text{Throughput} \leq \frac{H}{G} \]

\[ \text{Efficiency} [\text{Mbit/s/kGE}] \]

\[ \text{Algorithm} \]

\[ \text{Iterative MIMO Decoding} \]

\[ \text{Technology} \]

\[ \text{Core area} [\text{mm}^2] \]

\[ \text{Throughput} \leq \frac{H}{G} \]

\[ \text{Preprocessing area} [\text{kGE}] \]

\[ \text{Detection area} [\text{kGE}] \]

\[ \text{Efficiency} [\text{Mbit/s/kGE}] \]

\[ \text{Number of antennas} \]

\[ \text{Modulation order} \]

\[ \text{Algorithm} \]

\[ \text{Iterative MIMO Decoding} \]

\[ \text{Technology} \]

\[ \text{Core area} [\text{mm}^2] \]

\[ \text{Throughput} \leq \frac{H}{G} \]

\[ \text{Preprocessing area} [\text{kGE}] \]

\[ \text{Detection area} [\text{kGE}] \]

\[ \text{Efficiency} [\text{Mbit/s/kGE}] \]

\[ \text{Number of antennas} \]

\[ \text{Modulation order} \]

\[ \text{Algorithm} \]

\[ \text{Iterative MIMO Decoding} \]

\[ \text{Technology} \]

\[ \text{Core area} [\text{mm}^2] \]

\[ \text{Throughput} \leq \frac{H}{G} \]

\[ \text{Preprocessing area} [\text{kGE}] \]

\[ \text{Detection area} [\text{kGE}] \]

\[ \text{Efficiency} [\text{Mbit/s/kGE}] \]

### Table III

**Comparison to other reported MIMO detectors**

<table>
<thead>
<tr>
<th>This work</th>
<th>[2]</th>
<th>[3]</th>
<th>[4]</th>
<th>[5]</th>
<th>[6]</th>
<th>[7]</th>
<th>[8]</th>
</tr>
</thead>
<tbody>
<tr>
<td>Number of antennas</td>
<td>4 × 4</td>
<td>4 × 4</td>
<td>4 × 4</td>
<td>4 × 4</td>
<td>≤ 4 × 4</td>
<td>4 × 4</td>
<td>4 × 4</td>
</tr>
<tr>
<td>Modulation order</td>
<td>≤ 64</td>
<td>≤ 64</td>
<td>≤ 64</td>
<td>64</td>
<td>≤ 64</td>
<td>64</td>
<td>16</td>
</tr>
<tr>
<td>Algorithm</td>
<td>EPIC</td>
<td>MMSE-PIC</td>
<td>IC-LMMSE</td>
<td>FCSD</td>
<td>MCMC</td>
<td>STS-SD</td>
<td>Trellis</td>
</tr>
<tr>
<td>Iterative MIMO Decoding</td>
<td>yes</td>
<td>yes</td>
<td>yes</td>
<td>yes</td>
<td>yes</td>
<td>yes</td>
<td>yes</td>
</tr>
<tr>
<td>Technology</td>
<td>90 nm</td>
<td>90 nm</td>
<td>90 nm</td>
<td>90 nm</td>
<td>90 nm</td>
<td>65 nm</td>
<td>130 nm</td>
</tr>
<tr>
<td>Core area [mm(^2)]</td>
<td>1.08</td>
<td>1.5</td>
<td>–</td>
<td>2.61</td>
<td>–</td>
<td>–</td>
<td>1.58 (3.03)</td>
</tr>
<tr>
<td>Throughput Θ [Mbit/s]</td>
<td>733</td>
<td>757</td>
<td>833</td>
<td>2200</td>
<td>avg. 34.2</td>
<td>avg. 51.1</td>
<td>1700 (1228)</td>
</tr>
<tr>
<td>Preprocessing area [kGE]</td>
<td>345.8</td>
<td>384.2 (^a)</td>
<td>178.3</td>
<td>(^b)</td>
<td>(^c)</td>
<td>(^b)</td>
<td>(^b)</td>
</tr>
<tr>
<td>Detection area [kGE]</td>
<td>395.5</td>
<td>265</td>
<td>175</td>
<td>1097</td>
<td>174</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Efficiency [Mbit/s/kGE]</td>
<td>2.11</td>
<td>1.97</td>
<td>4.67</td>
<td>3.96</td>
<td>0.13</td>
<td>0.29</td>
<td>1.12</td>
</tr>
</tbody>
</table>

\(^a\) Area for chip IO interface excluded

\(^b\) Area for required QRD not included

\(^c\) Area for optional initial detection not included

### Table IV

**EPIC Implementation Results**

<table>
<thead>
<tr>
<th>Processing Unit</th>
<th>1st</th>
<th>2nd</th>
</tr>
</thead>
<tbody>
<tr>
<td>Preprocessing Unit [kGE]</td>
<td>32.5</td>
<td>38.3</td>
</tr>
<tr>
<td>( L D L^H ) Decomposition [kGE]</td>
<td>32.5</td>
<td>27.5</td>
</tr>
<tr>
<td>Solver ( (L D L^H)^{-1} ) [kGE]</td>
<td>30.6</td>
<td>29.4</td>
</tr>
<tr>
<td>Filter &amp; SNR [kGE]</td>
<td>39.5</td>
<td>52.3</td>
</tr>
<tr>
<td>Max-Log Demap [kGE]</td>
<td>7.8</td>
<td>7.1</td>
</tr>
<tr>
<td>Map [kGE]</td>
<td>12.2</td>
<td>7.8</td>
</tr>
<tr>
<td>GaussDiv [kGE]</td>
<td>17.8</td>
<td>7.8</td>
</tr>
<tr>
<td>Interference Cancellation [kGE]</td>
<td>17.6</td>
<td>7.8</td>
</tr>
<tr>
<td>Forwarding &amp; Multiplexer [kGE]</td>
<td>33.2</td>
<td>7.8</td>
</tr>
<tr>
<td>Total area [kGE]</td>
<td>345.8</td>
<td>384.2</td>
</tr>
<tr>
<td>Clock frequency [MHz]</td>
<td>550</td>
<td>733</td>
</tr>
<tr>
<td>Throughput Θ [Mbit/s]</td>
<td>733</td>
<td>2.11</td>
</tr>
<tr>
<td>Efficiency [Mbit/s/kGE]</td>
<td>733</td>
<td>2.11</td>
</tr>
</tbody>
</table>

\[^d\] We assume 100 visited nodes per symbol vector [2].

\[^e\] We assume \( N_y = 64, N_q = 8, N_p = 8 [5] \).

\[^f\] Scaled to 90nm CMOS technology assuming \( t_{pd} \sim 1/s \) and \( A \sim 1/s^2 \).
occupies a large area of > 1MGE. Transitioning to 64-QAM, the internal trellis will have 64 instead of 16 states, which suggests that the area occupation might grow largely. The area-throughput efficiency is lower than that of the EPIC.

If only SO-only detection is required, then the K-Best detector in [8] is a lot more efficient than the EPIC. We expect that a SO-only variant of the EPIC would save only few area (basically a few multiplexers). This is opposing to SO-only MMSE detectors, that could save more (like the mapper), but the EPIC requires most of the functionality also in the first receiver iteration.

Clearly, the fact that the design is customized for one maximum throughput target is a limitation of our architectural approach. To support higher or lower code rates, beyond simple frequency scaling, we need to redesign the PUs considering less clock cycles per pipeline cycles. E.g. if the target is $\Theta_L = 300$Mbit/s, then the PUs might take 36 clock cycles instead of 18. The architectural concept is scalable in principle, but requires quite some development effort.

VI. CONCLUSIONS

A novel class of linear MIMO detectors for BICM-ID systems is introduced. It keeps the desirable VLSI implementation properties of linear detectors, while additionally improving the communications performance by up to 2.4 dB for the considered IEEE 802.11n-like scenarios. The presented ASIC has a similar area-throughput efficiency as the best linear MIMO detector reported in literature. It proves that similar area and throughput figures combined with significant SNR gains over regular linear MMSE detection is possible.

Future MIMO receivers can take advantage of the detector’s observed SNR gains in order to largely improve the network’s transmission range, or increase the data rate in presence of strong inter-cell interference. In fact, the largest gains have been observed for the non-iterative receiver case. Thus an adoption to current MIMO systems is possible without requiring the complexity of iterative MIMO decoding.

ACKNOWLEDGMENT

This work has been supported by the Ultra High-Speed Mobile Information and Communication Research Centre, RWTH Aachen University.

REFERENCES


