Chapter3 Markov Decision Processes(MDP)

参考了《Reinforcement Learning: An Introduction》和
David Silver强化学习公开课，
这一章主要来自David Silver的ppt，建议直接看ppt，我只把容易犯错的地方点出来了

马尔科夫过程是强化学习的基础

Finite Markov Decision Processes

Markov property

A state $S_{t}$ is Markov if and only of
$P [S_{t + 1} | S_{t}] = P [S_{t + 1} | S_{1}, \dots, S_{t}]$

The state captures all relevant information from the history

Once the state is know, the history may be thrown away

i.e. The state is a sufficient statistic of the future

A Markov process is a memoryless random process, i.e. a sequence of random states $S_{1}, S_{2}, \dots$ with the Markov property.
Markov Process

A Markov Process (or Markov Chain) is a tuple $⟨ S, P ⟩$

S is a (finite) set of states

P is a state transition probability matrix, $P_{s s^{'}} = P [S_{t + 1} = s^{'} | S_{t} = s]$

A Markov reward process is a Markov chain with values.
Markov Reward Process

A Markov Process (or Markov Chain) is a tuple $⟨ S, P, R, γ ⟩$

S is a (finite) set of states

P is a state transition probability matrix, $P_{s s^{'}} = P [S_{t + 1} = s^{'} | S_{t} = s]$

$R is a reward function, R_{s} = E [R_{t + 1} | S_{t} = s]$

$γ is a discount factor, γ \in [0, 1]$

注意这里 $P_{s s^{'}}$ 的定义，是指从状态 $s$ 到 $s^{'}$ 的概率

后面常因为名字(return)忘记这个的定义，跟上面的单个Reward不一样
Return

The return $G_{t}$ is the total discounted reward from time-step t.
$G_{t} = R_{t + 1} + γ R_{t + 2} + \dots = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1}$

The discount $γ \in [0, 1]$ is the present value of future rewards

The value of receiving reward R after k+1 time-steps is $γ^{k} R$

$γ$ close to 0 leads to “myopic(近视)” evaluation

$γ$ close to 1 leads to “far-sighted(远见)” evaluation
后面提到的很多方法都是看的很远(远见)的

Value Function

The state value function v(s) of an $MRP$ is the expected return starting form state s
$v (s) = E [G_{t} | S_{t} = s]$

确实有必要看一下MRP的Bellman Equation，并与MDP对比。在MRP中没有考虑任何关于action的事情。因为MDP才是强化学习的主角，所以不看David Silver的ppt中的MRP实例了，容易对后面MDP的理解造成误解。
简单看一下Bellman Equation

\begin{aligned} v (s) & = E [G_{t} | S_{t} = s] \\ = E [R_{t + 1} + γ v (S_{t + 1}) | S_{t} = s] \end{aligned}

MRP的状态转换，没有任何action的影响，我们在后面MDP中会考虑actions的影响
Chapter3 Markov Decision Processes(MDP)

v (s) = R_{s} + γ \sum_{s^{'} \in S} P_{s s^{'}} v (s^{'})

其实观察上式，上面计算的是动态规划，而注意到Bellman Equation又称为动态规划方程，上面的计算就很容易理解了

A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.
Markov Decision Process

A Markov Process (or Markov Chain) is a tuple $⟨ S, A, P, R, γ ⟩$

S is a (finite) set of states

A is finite set of actions

P is a state transition probability matrix, $P_{s s^{'}}^{a} = P [S_{t + 1} = s^{'} | S_{t} = s, A_{t} = a]$

R is a reward function, $R_{s}^{a} = E [R_{t + 1} | S_{t} = s, A_{t} = a]$

$γ$ is a discount factor, $γ \in [0, 1]$

Chapter3 Markov Decision Processes(MDP)
注意与上面MRP的区别，这里的黑点是执行一个action之后到达的中间状态，后面用 $q (s, a)$ 来定义此状态，黑点到达后面的状态 $s^{'}$ 的概率就是上面MDP中定义的那个 $P_{s s^{'}}^{a} = P [S_{t + 1} = s^{'} | S_{t} = s, A_{t} = a]$

Policy

A policy $π$ is a distribution over actions given states,
$π (a | s) = P [A_{t} = a | S_{t} = s]$

A policy fully defines the behaviour of an agent

MDP policies depend on the current state (not the history)

i.e. Policies are stationary (time-independent), $A_{t} \sim π (\cdot | S_{t}), \forall t > 0$

Given an MDP $M = ⟨ S, A, P, R, γ ⟩$ and a policy $π$

The state sequence $S_{1}, S_{2}, \dots$ is a Markov reward process $⟨ S, P^{π} ⟩$

The state and reward sequence $S_{1}, R_{2}, S_{2}, \dots$ is a Markov reward process $⟨ S, P^{π}, R^{π}, γ ⟩$

where
$P_{s, s^{'}}^{π} = \sum_{a \in A} π (a | s) P_{s s^{'}}^{a} R_{s}^{π} = \sum_{a \in A} π (a | s) R_{s}^{a}$

要特别注意policy的distribution的定义，因为在后面讲的off-policy方法的概念中，生成样本的policy和目标policy是不同的

Value Function这个是针对MDP的

The state-value function $v_{π} (s)$ of an MDP is the expected return starting from state $s$ , and then following policy $π$
$v_{π} (s) = E_{π} [G_{t} | S_{t} = s]$

The action-value function $q_{π} (s, a)$ is the expected return
starting from state $s$ , taking action $a$ , and then following policy $π$
$q_{π} (s | a) = E_{π} [G_{t} | S_{t} = s, A_{t} = a]$

Bellman Expectation Equation for $V^{π}$
Chapter3 Markov Decision Processes(MDP)

v_{π} (s) = \sum_{a \in A} π (a | s) q_{π} (s, a)

Bellman Expectation Equation for

Q^{π}

q_{π} (s, a) = R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s s^{'}}^{a} v_{π} (s^{'})

v_{π} (s) = \sum_{a \in A} π (a | s) (R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s s^{'}}^{a} v_{π} (s^{'}))

q_{π} (s, a) = R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s s^{'}}^{a} \sum_{a^{'} \in A} π (a^{'} | s^{'}) q_{π} (s^{'}, a^{'})

Optimal Value Function

The optimal state-value function $v_{*} (s)$ is the maximum value function over all policies
$v_{*} (s) = max_{π} v_{π} (s)$

The optimal action-value function $q_{*} (s, a)$ is the maximum action-value function over all policies
$q_{*} (s, a) = max_{π} q_{π} (s, a)$

只要知道了 $q_{*}$ 问题就解决了，比知道 $v_{*}$ 更方便。还有注意的是，上面是在所有的 $π$ (policy)中选择使得 $q$ 最大的 $π$ (policy)，这就是值给出了最佳policy的概念，当然是没有很直接的办法得到结果的，后面将针对上述问题介绍各种逼近的方法

Optimal Policy
Dene a partial ordering over policies

π \geq π^{'} if v_{π} (s) \geq v_{π^{'}} (s), \forall s

Finding an Optimal Policy
An optimal policy can be found by maximising over $q_{*} (s, a)$ ,

π_{*} (a | s) = {\begin{cases} 1 & if a = \underset{a \in A}{a r g max} q_{*} (s, a) \\ 0 & otherwise \end{cases}

如果我们知道了

q_{*} (s, a)

，那么我就可以马上得到optimal policy

Optimal Bellman Expectation Equation

\begin{aligned} v_{π} (s) & ≐ E_{π} [G_{t} | S_{t} = s] \\ = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} | S_{t} = s] \\ = \sum_{a} π (a | s) \sum_{s^{'}} \sum_{r} p (s^{'}, r | s, a) [r + γ E_{π} [G_{t + 1} | S_{t + 1} = s^{'}]] \\ = \sum_{a} π (a | s) \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ v_{π} (s^{'})], for all s \in S \end{aligned}

The Agent-Environment Interface

The learner and decision maker is called the agent.
The thing it interacts with, comprising everything outside the agent, is called the environment.

MDP和agent一起生成的sequence或者trajectory

S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{2}, S_{2}, A_{2}, R_{3}, \dots

以下函数定义了MDP的动态性，agent处于某个状态s，在该状态下采取行动a，然后到达状态 $s^{'}$ ，并获得奖励r。这个公式是MDP的关键。这个四参数的函数可以推导出任何东西

p (s^{'}, r | s, a) ≐ Pr {S_{t} = s^{'}, R_{t} = r | S_{t - 1} = s, A_{t - 1} = a}

Chapter3 Markov Decision Processes(MDP)
for all $s^{'}$ , $s \in S$ , $r \in R$ , and $a \in A (s)$

其中有

\sum_{s' \in S} \sum_{r \in R} p (s^{'}, r | s, a) = 1, for all s \in S, a \in A (s)

3.2 Goals and Rewards

agent的目的就是最大化它收到的全部rewards

3.5 Policies and Value Functions

state-value function for policy $π$

\begin{aligned} v_{π} (s) & ≐ E_{π} [G_{t} | S_{t} = s] \\ = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} | S_{t} = s] \\ = \sum_{a} π (a | s) \sum_{s^{'}} \sum_{r} p (s^{'}, r | s, a) [r + γ E_{π} [G_{t + 1} | S_{t + 1} = s^{'}]] \\ = \sum_{a} π (a | s) \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ v_{π} (s^{'})], for all s \in S \end{aligned}

action-value function for policy $π$

q_{π} (s, a) ≐ E_{π} [G_{t} | S_{t} = s, A_{t} = a] = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} | S_{t} = s, A_{t} = a]

对于任何policy $π$ 和任何状态 $s$ ，state-value和其可能的后继状态的state-value之间存在以下一致性条件

3.6 Optimal Policies and Optimal Value Functions

optimal state-value function

v_{*} (s) ≐ max_{π} v_{π} (s)

optimal action-value function

q_{*} (s, a) ≐ max_{π} q_{π} (s, a)

写出关于 $v_{*}$ 的 $q_{*}$

q_{*} (s, a) = E [R_{t + 1} + γ v_{π} (S_{t + 1}) | S_{t} = s, A_{t} = a]

Bellman optimality equation

v_{*} (s) = max_{a \in A (s)} q_{π_{*}}

Chapter3 Markov Decision Processes(MDP)

\begin{aligned} v_{*} (s) & = max_{a \in A (s)} q_{π_{*}} (s, a) \\ = max_{a} E_{π_{*}} [G_{t} | S_{t} = s, A_{t} = a] \\ = max_{a} E_{π_{*}} [R_{t + 1} + γ G_{t + 1} | S_{t} = s, A_{t} = a] \\ = max_{a} E [R_{t + 1} + γ v_{*} (S_{t + 1}) | S_{t} = s, A_{t} = a] \\ = max_{a} \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ v_{*} (s^{'})] \end{aligned}

Chapter3 Markov Decision Processes(MDP)

\begin{aligned} q_{*} (s, a) & = E [R_{t + 1} + γ max_{a^{'}} q_{*} (S_{t + 1}, a^{'}) | S_{t} = s, A_{t} = a] \\ = \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ max_{a^{'}} q_{*} (s^{'}, a^{'})] \end{aligned}