### 1. Introduction

### 2. System Model and Introduction to Vector Perturbation

*n*

*decentralized users that are each equipped with*

_{U}*n*

*receive antennas. Without the loss of generality we assume that*

_{R}*n*

_{T}*=*(

*n*

*×*

_{U}*n*

*), which is the minimum number of required antennas at the BS. The channel coupling transmit and receive antennas are considered to flat fading and slowly time-varying. The system is then converted to the*

_{R}*N*= (2×

*n*

*) real Euclidean system. Let*

_{T}**H**∈ R

^{N}^{×}

*denote the channel matrix and s ∈ R*

^{N}*denote the data symbols, both in the real domain. Then, the precoded vector is given by:*

^{N}*γ*is a scaling factor used to fix the transmit paper to a predefined value; set to unity throughout the paper. The precoding matrix

**P**is given by

**H**

^{−1}and (

**H**

^{T}**H**+ α

**I**)

^{−1}

**H**

*for the linear ZF and MMSE precoders, respectively. Although the MMSE precoder regularizes the channel matrix via the scalar α, the performance is still mediocre.*

^{T}*τ*is an integer given by:

*c*

*| as the absolute value of the constellation point with the largest value and Δ is the spacing between neighboring constellation points. Since the precoding ideally equalizes for the channel, the*

_{max}*i*-th received symbol is given by:

*n*

*denoting the real noise at the*

_{i}*i*-th receiver with a variance of (σn

^{2}/2). The transmitted symbol is then demodulated, without knowing

**t**, using the nonlinear modulo operation below:

*Q*(.) is the demodulation operator. The modulo operation reduces the range of the signal to [-

*K*,

*K*) with

*K*as the square root of the cardinality of the modulation set and

*K*= 2 and 4 for QPSK and 16-QAM schemes, respectively. Fig. 1 depicts the MU-MIMO system with VP, where the matrix

**F**is a power control matrix. In the following section we assume that all users/streams have equal power; hence,

**F = I**.

*τ*= 4. Let the elements of vector t be drawn from the set {0,1} leading to a set of candidates of

**t**that are 23 = 8 in size. Again, for the sake of simplicity, we used ZF precoding, where, consequently, the transmit power is given by:

**t**and the corresponding required transmit power. Note that ZF precoding without VP corresponds to setting all the elements of

**t**to 0 (i.e.,

**t**

_{0}in the equation below).

**t**

_{7},

**t**

_{5}, and

**t**

_{2}are lower than that in case of

**t**

_{0}. The next challenge is to find the “best”

**t**such that the required transmit power is reduced and the performance is improved.

### 3. Statement of Problem and Review of Conventional Algorithms

### 3.1 Statement of Problem

**t**vector such that the total transmit power is reduced [14–17]. That is:

**t**are drawn is decided using simulations (see figures in [14–16]). To solve (6) successively, let the transpose of

**H**be factorized into the product of the unitary matrix

**Q**and upper triangular matrix

**R**. Then the search problem in (6) is expanded to:

*s*

*and*

_{n}*t*

*are the*

_{n}*n*-th elements of the vectors

**s**and

**t**, while the matrix

**L**equals (

**R**

^{−1})

^{T}. In the case of MMSE precoding, the extended matrix

### 3.2 Statement of Problem

**t**. At each search level, the single candidate of

**t**, which reduces the accumulative metric in (7) is retained. The main drawback of this method is that the elements of

**t**are obtained independently, which leads to degradation in the system’s overall performance, in terms of BER and diversity order.

*M*. Fig. 2 shows an example of the QRDM-E with an equal

*Z*and

*M*of 4. At the first VP level, the root node is extended to the four possible branches, where each branch represents a hypothesis of

*t*

_{1}. Since,

*M*=

*T*, all the candidates are retained for the next level. All the retained candidates at the first level are extended, leading to a set of 16 hypotheses for

*t*

_{2}. The metrics are computed based on (7) and sorted where the best

*M*= 4 candidates with the least accumulative metrics are retained for the next level. This strategy is repeated until the last VP level, where the vector with the least accumulative metric is precoded and transmitted. It has been shown that the QRDM-E algorithm with sufficient M achieves quasi-optimum performance and optimum diversity order [13]. The main drawbacks of the QRDM-E are as follows: 1) it has a sequential nature, which limits the parallelization of the VP stage leading to high latency and 2) the computational complexity is high for a large problem size. In the next section, several algorithms are proposed in order to reduce the computational complexity, quantified in terms of the number of visited nodes in the search tree, and the latency, which is quantified by the parallelization capability of the proposed algorithms.

### 4. Fixed-Complexity Vector Perturbation Techniques

### 4.1 Parallel QRDM Encoder

The set of candidates for the elements of the vector

**t**is divided into*G*non-overlapping subsets, U_{1}, U_{2}, …, Uof equal sizes._{G}-
The VP tree-search of the conventional QRDME is then divided into

*G*independent partial VPs (PVPs) that are pipelined, where:In the

*i*-th PVP, the candidates for*t*_{1}are drawn from the subset U._{i}The candidates for the remaining elements of

**t**are drawn from the full set Z.All the candidates of

*t*are retained at the first*p*encoding levels. For example, in PQRD-ME-*p*1 and PQRDME-*p*2 all candidates are retained up to the first and to the second VP levels, respectively.At the remaining levels,

*p*+1 …*N*,*M*/*G*candidates are retained at each encoding level per PVP leading to a total of*M*retained candidates per encoding level.

The vector with the least accumulative metric is retained at each PVP. These vectors are compared in terms of their accumulative metrics, where the one with the least accumulative metric and the one with the least global accumulative metric is encoded and transmitted.

*T*= 4,

*G*= 2,

*M*=

*N*= 4 and

*p*= 1. As such, at the first VP level, the set of four elements of Z is divided into the subsets U

_{1}and U

_{2}, which each contain two elements. The candidates for

*t*

_{1}in the first PVP are drawn from U

_{1}, whereas, the candidates for

*t*

_{1}in the second PVP are drawn from U

_{2}. At the first VP level, all candidates are retained. In each PVP, the candidates for

*t*

_{2}to

*t*

_{4}are drawn from the full set Z, where at each level only

*M/G*= 2 candidates with the least accumulative metrics are retained. The process is repeated up to the last VP level where the vector with the least accumulative metric is selected, precoded, and transmitted.

*M*=

*N*= 4 and

*T*=

*G*=

*p*= 2. Apart from the size of the set of hypothesis Z, the difference with the preceding example is that all candidates are retained at both the first and second VP levels. It worth remembering that the candidates of

*t*

_{2}, …,

*t*

_{4}are drawn from the full set Z.

### 4.2 Fixed-Complexity Sphere Encoder

**Full expansion:**At the first*p*tree search levels, the retained branches are expanded to all possible nodes, and all the resulting branches are retained for the next level.**Single expansion:**Only a single expansion is performed from each retained node at the precedent encoding level. This is done by following the decision-feedback equalization path.

*T*= 7, whereas, the encoding throughput is increased seven-fold.

*T*=

*N*=

*M*= 4 and

*p*= 1. In the first full expansion encoding level, all the candidates are retained for the second level. In the single expansion VP levels, all retained nodes are expanded to all possible candidates. Then, the candidate with the smallest metric is retained as in the case of a successive interference cancellation detector. In the last VP level, the vector with the least accumulative metric is selected, precoded, and transmitted. Due to this tree-search structure of the FSE, it can be pipelined. This results in a tremendous increase in the encoding throughput.

*T*= 3,

*N*= 4,

*M*= 9 and

*p*= 2. Up to the second VP level, all candidates are retained, whereas, a single expansion is employed at the remaining VP levels.

### 4.3 Simulation Results

*T*, is set based on extensive simulations. In the case of THP no further improvement is achieved for

*T*> 5. The performance of the PQRDME-

*p*1, is close to that of the quasi-optimal QRDM-E while the encoding throughput is increased by a factor of

*G*= 2. To achieve more encoding speed, the three-search can be split into four PVPs, whereas, it is clearly seen in the figure that the performance is degraded, but the maximum diversity order is still achieved. In terms of BER performance, FSE-

*p*1 comes last with little degraded performance, while achieving the maximum encoding throughput speed.

*p*= 2 and

*G*= 3 in the case of the PQRDME-

*p*2. The performance of the PQRDME-

*p*2 is closest to the QRDM-E, followed by the FSE-

*p*2. Note that in this scenario, the PQRDME search tree consists of the three PVPs, which leads to high parallelization capability. The THP and linear MMSE encoders achieve much worse BER performance and similar diversity order of unity.

### 5. Vector Perturbation with Block Diagonalization

### 5.1 Block Diagonalization

**B**such that:

##### (9)

**H**

_{i}is the channel coupling of the BS transmit antennas and the receive antennas of the

*i*-th user.

**0**

_{n}is an

*n×n*zero matrix with all elements set to 0.

**H**

_{eff, i}is the effective channel of the

*i*-th user after block diagonalization. The matrix

**B**is obtained using a series of singular value decompositions. Based on the second matrix of (9), users’ data is precoded independently. This leads to the additional parallelization of the VP stage and as a result, improves encoding throughput. In this scenario, VP techniques with random computational complexity generally require different time durations to encode each user’s data. Since all users’ data should be transmitted at the same instant, the encoding stage requires a time duration equivalent to the worst-case delay among the users. On the other hand, fixed-complexity VP algorithms require a fixed encoding duration. That is why they are preferable in such scenarios, in addition to their advantages that we have already discussed in Section 4.

### 5.2 Simulation Results

*n*

_{T}*, n*

_{U}*, n*

*) scenario, where the BS is equipped with*

_{R}*n*

*antennas and*

_{T}*n*

*users that are each equipped with*

_{U}*n*

*antennas. Since the attained diversity order equals min(*

_{R}*n*

_{T}*, n*

*) =*

_{R}*n*

*, as the number of users increases, the diversity order is decreased since part of the degrees of freedom are used in the IUI cancellation. It is worth mentioning that QRDM-E outperforms FSE-*

_{R}*p*2, whereas, FSE-

*p*2 outperforms FSE-

*p*1 in all of the simulated SNR range. The performance of all algorithms coincide when each user is equipped with a single receive antenna. This is reasonable since the diversity order achieved by all algorithms is equal to 1. As such, these algorithms can’t exploit further performance improvement or diversity order.

*p*2 is negligible for all of the scenarios.

### 6. Vector Perturbation with Imperfect Channel Knowledge

### 6.1 Quantization Schemes

**Lloyd-Max Quantization:**Uniform quantization divides the range to which the variable belongs into equal intervals. If the variable to be quantized belongs to a certain interval, then the centroid of the interval is considered to be the quantized value. This quantizer is suitable for uniformly distributed variables, which is not the case for MIMO channels. Therefore, the non-uniform Lloyd-Max quantizer, which takes the probability density function (pdf) of the variables to be quantized into consideration, is more suitable [20–22]. The Lloyd-Max quantizer iteratively finds the intervals’ endpoints so that the mean square error between each channel coefficient and its quantized version is minimized. This can be achieved by allocating shorter intervals when the pdf has high values (i.e., the variable is the most probable) and longer intervals when the pdf has low values.**Uniform Quantization:**In communication systems (WiMAX, LTE/LTE-A, etc.) the channel amplitude is usually quantized and fed back to the BS in order to be used in several schemes, such as scheduling or adaptive modulation and coding. As such, it is more effective to quantize the phase of the channel coefficient as additional feedback. It is known that the phase of the channel coefficient is uniform and, hence, can be uniformly quantized.

*p*2 under the ICSI knowledge at the transmitter as an example of the fixed-complexity VP techniques.

### 6.2 Simulation Results

*p*2 as an example of the VP algorithms for several values of

*B*and the number of quantization bits for the real and imaginary parts of the channel coefficient. For a low

*B*, the BER increases, where, when

*B*= 5 the degradation is tolerable due to the quantization error.

*p*2 for several values of

*B*and the number of quantization bits for each phase value.

*B*= 6 is suitable for quantizing the phase of the channel coefficients, since the degradation due to the quantization error is tolerable.