### 1. Introduction

### 2. Related Work

### 3. Markov Decision Process

*T*⊂ {0, ∞} is the set of decision epochs, which are the points in time that the decision is made if the signal reaches the minimum point, and then the action taken as a handoff. We mainly considered processes with countably many decision epochs, in which case*T*is said to be discrete and we usually take*T*= {1,2,….,*N*} or*T*= {1,2….} depending on whether*T*is finite or countably infinite. Time is divided into time periods or stages in discrete problems and we assume that each decision epoch occurs at the beginning of a time period. Here*T*is taken as the call duration. MDP is said to have either a finite horizon or infinite horizon, respectively, depending on whether the call duration is the least upper bound of*T*(i.e., the supremum is finite or infinite). The duration of a call is infinite since we are unaware of the exact time consumption. Mostly the call duration is finite, because the duration of a call depends on the battery power of a mobile terminal, and also cost. If*T*= {1….*N*} is finite, then we will stipulate that no decision is taken on the final decision epoch*N*.*S*is the set of states that can be assumed by the process and is called the state space. In this context,*S*is called the handoff state. Especially, based on the level of the signal it can be decided whether to continue in the same state or to changeover to the handoff state.*S*can be any measurable set, but for the purpose of our research we will be mainly focused and concerned with the processes that take values in state spaces that are either countable or that are a compact subset of*R*.^{n}For each state,

*s*∈*S*,*A*is the set of actions that are possible when the state of the system is_{s}*S*. The actions will be determined depending on the state. In this case, state is the value of the received signal strength. If it is below the threshold value, a handoff action will be taken. Otherwise, the call will continue with the current network. We write*A*= ∪_{s}_{∈}_{S}*A*for the set of all possible actions and we usually assume that each_{s}*A*is either a countable set or a compact set of_{s}*R*. Actions can be chosen either deterministically or randomly. To describe the second possibility, we will write^{n}*P*(*A*) for the set of probability distributions on_{s}*A*, in which case a randomly chosen action can be specified by a probability distribution_{s}*Q*(●) ∈*P*(*A*). For example, if_{s}*A*is discrete, then the action_{s}*a*∈*A*will be chosen with the probability_{s}*q*(*a*).If

*T*is discrete, then we must specify how the state of the system changes from one decision epoch to the next. Since we are interested in Markov Decision Processes, these changes are chosen at random from a probability distribution*p*(●|_{t}*s*,*a*) on*S*. This may depend on the time*t*, the current state of the system*s*(the current state depends on the signal strength), and the action*a*, which would be the handoff-action chosen by the observer.

*s*by the time

*t*, the observer receives a connection-reward

*r*

*(*

_{t}*s*,

*a*), which can be regarded as a profit when positive and as a cost when negative. We assume that the rewards can be calculated, at least in principle, by the observer prior to selecting a particular action. We will also consider problems in which the reward obtained at time

*t*can be expressed as the expected value of a function

*r*

*(*

_{t}*s*

*,*

_{t}*a*,

*s*

_{t}_{+1}), which depends on the state of a system at that time and at the next time. For example:

*d*

*will be used to choose the action to be taken at the decision epoch*

_{t}*t*∈

*T*. Here, the decision rule is based on the threshold value. If the signal reaches the threshold (i.e., the point of tolerance) the decision rule will be applied to select an action. A decision rule is said to be Markovian if it depends only on the current state of the system (i.e.,

*d*

*, which is a function of*

_{t}*s*

*). Otherwise, the decision rule is said to be history dependent, in which case it depends on the entire history of states and actions from the first decision epoch through the present. Such histories will be denoted as*

_{t}*h*

*= (*

_{t}*s*

_{1},

*a*

_{1},

*s*

_{2},

*a*

_{2},..

*s*

_{t}_{−1},

*a*

_{t}_{−1},

*s*

*). Since a mobile terminal depends only on the current state of the signal, we use the Markovian Policy. The decision rule is further classified into deterministic and randomized. A randomized action is based on probability distribution.*

_{t}*d*

_{1},

*d*

_{2},

*d*

_{3},… for every decision epoch. The policy is said to be Markovian if the decision rule specified by the policy has the corresponding properties. Since a MDP is a stochastic process, the successive states and actions realized by that process form a sequence of random variables. The following notations are used for these variables: for each

*t*∈

*T*, let

*X*

*be the set of states that belongs to*

_{t}*S*and

*Y*

*is the set of actions taken based on the state*

_{t}*S*, where

*Y*

*∈*

_{t}*A*

*. The Markov reward process is given by (*

_{s}*X*

*,*

_{t}*r*

*(*

_{t}*X*

*,*

_{t}*Y*

*):*

_{t}*t*∈

*T*).

### 4. Markov Policy Based Seamless Handoff

*T*= {1,2,...} and that the reward functions and transition probabilities do not vary over time. For example, for every

*t*∈

*T*we have

*r*

*(*

_{t}*s*,

*a*) =

*r*(

*s*,

*a*) and

*p*

*(*

_{t}*ψ*|

*s*,

*a*) =

*p*(

*ψ*|

*s*,

*a*). Here

*p*(

*ψ*|

*s*,

*a*) is the state occupied at time

*t*= 2 (next state). We deal with the stationary policies, in which we denote Ω =

*d*

^{∞}= (

*d*,

*d*,...). This means that the same decision rule is used in every decision epoch. Here the decision rule is to choose the optimal network for handoff. Since Ω is Markovian, there are several performance metrics that can be used to assign a value to the policy, which depends on the initial state of the system. The expected total reward of Ω is given as:

*λ*∈ [0,1) is the discount rate. Notice that the limit is guaranteed to exist if the absolute value of the reward function is uniformly bounded. For example, if |

*r*(

*s*,

*a*)| ≤

*N*for all

*s*∈

*S*and

*a*∈

*A*

*, then the total expected discounted reward is also uniformly bounded:*

_{s}*s*) and policy (Ω) and

*Y*

*) based on the throughput (*

_{t}### 5. Handoff Policy Evaluation

*d*1,

*d*2,...) is the Markovian policy. The total expected reward can be expressed as:

*t*= 1..∞, we have two states

*S*= (

*s*1,

*s*2) in which the WiFi network tries to connect to other networks that are available, such as GSM, WiMAX, and CDMA. Actions are named as

*A*1,

*A*2 and

*A*3 where action

*A*1 refers to a changeover from WiFi to WiMAX,

*A*2 as WiFi to GSM, and

*A*3 as WiFi to CDMA. The rewards are represented as data rate units in Table 1.

*A1, A2 A3*

*λ*) has been given as 0.5. We estimated the expected total reward for all the three networks to be:

*R*

_{2}(1) = 9.7,

*R*

_{2}(2) = 8.75, and

*R*

_{2}(3) = 15.6, in the 1st state WiMAX gets the highest connection reward for handoff. In State 2, CDMA gets the highest reward. At this juncture, we calculate the expected transition time,

*L*

_{Ω}

*L*

_{Ω}(1,3) = 1+ 0.2

*L*

_{Ω}(1,3) + 0.2

*L*

_{Ω}(2,3),

*L*

_{Ω}(2,3) =1+ 0.3

*L*

_{Ω}(1,3) + 0.4

*L*

_{Ω}(2,3) and

*L*

_{Ω}(3,3) =1+

*L*

_{Ω}(1,3) are measured. Finally, by rearranging the first equation we get,

*L*

_{Ω}(1,3) = 3.9,

*L*

_{Ω}(2,3) = 11.6 and

*L*

_{Ω}(3,3) = 4.9. We now estimate the expected rewards obtained from the current state to next state under policy Ω.

*r*

_{i}_{,}

_{k}_{,Ω}is the reward of transition from i to k in one step (this is the reward gained in the initial state). Since

*r*

_{1,}

_{k}_{,Ω}= 6,

*r*

_{2,}

_{k}_{,Ω}= 5, and

*r*

_{3,}

_{k}_{,Ω}= 9. From Table 2, we received the rewards, such as

*R*

_{Ω}(1,3) = 6 + 0.2

*R*

_{Ω}(1,3) + 0.2

*R*

_{Ω}(2,3),

*R*

_{Ω}(2,3) = 5 + 0.3

*R*

_{Ω}(1,3) + 0.4

*R*

_{Ω}(2,3), and

*R*

_{Ω}(3,3) = 9 +

*R*

_{Ω}(1,3). We assess

*R*

_{Ω}(1,3) = 12.4,

*R*

_{Ω}(2,3) = 19.6, and

*R*

_{Ω}(3,3) = 21.4 by reordering the equations. The average reward obtained under policy Ω is thus:

*ψ*denotes the next state, in which we calculate the connection rewards for each state transitions. We used the dataset of RSS measurements [15] that were collected by the University of Colorado’s Wide-Area Radio Testbed. We used the received signal strength for downstream, as shown in Fig. 1.