参考了《Reinforcement Learning: An Introduction》和
David Silver强化学习公开课,
这一章主要来自David Silver的ppt,建议直接看ppt,我只把容易犯错的地方点出来了


马尔科夫过程是强化学习的基础


Finite Markov Decision Processes

Markov property

A state St is Markov if and only of

P[St+1|St]=P[St+1|S1,,St]

  • The state captures all relevant information from the history
  • Once the state is know, the history may be thrown away
  • i.e. The state is a sufficient statistic of the future

A Markov process is a memoryless random process, i.e. a sequence of random states S1,S2, with the Markov property.
Markov Process

A Markov Process (or Markov Chain) is a tuple S,P

  • S is a (finite) set of states
  • P is a state transition probability matrix, Pss=P[St+1=s|St=s]

A Markov reward process is a Markov chain with values.
Markov Reward Process

A Markov Process (or Markov Chain) is a tuple S,P,R,γ

  • S is a (finite) set of states
  • P is a state transition probability matrix, Pss=P[St+1=s|St=s]
  • R is a reward function, Rs=E[Rt+1|St=s]
  • γ is a discount factor, γ[0,1]

注意这里Pss的定义,是指从状态ss的概率

后面常因为名字(return)忘记这个的定义,跟上面的单个Reward不一样
Return

The return Gt is the total discounted reward from time-step t.

Gt=Rt+1+γRt+2+=k=0γkRt+k+1

  • The discount γ[0,1] is the present value of future rewards
  • The value of receiving reward R after k+1 time-steps is γkR
    • γ close to 0 leads to “myopic(近视)” evaluation
    • γ close to 1 leads to “far-sighted(远见)” evaluation
      后面提到的很多方法都是看的很远(远见)的

Value Function

The state value function v(s) of an MRP is the expected return starting form state s

v(s)=E[Gt|St=s]

确实有必要看一下MRP的Bellman Equation,并与MDP对比。在MRP中没有考虑任何关于action的事情。因为MDP才是强化学习的主角,所以不看David Silver的ppt中的MRP实例了,容易对后面MDP的理解造成误解。
简单看一下Bellman Equation

v(s)=E[Gt|St=s]=E[Rt+1+γv(St+1)|St=s]

MRP的状态转换,没有任何action的影响,我们在后面MDP中会考虑actions的影响
Chapter3 Markov Decision Processes(MDP)
v(s)=Rs+γsSPssv(s)

其实观察上式,上面计算的是动态规划,而注意到Bellman Equation又称为动态规划方程,上面的计算就很容易理解了


A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.
Markov Decision Process

A Markov Process (or Markov Chain) is a tuple S,A,P,R,γ

  • S is a (finite) set of states
  • A is finite set of actions
  • P is a state transition probability matrix, Pssa=P[St+1=s|St=s,At=a]
  • R is a reward function, Rsa=E[Rt+1|St=s,At=a]
  • γ is a discount factor, γ[0,1]

Chapter3 Markov Decision Processes(MDP)
注意与上面MRP的区别,这里的黑点是执行一个action之后到达的中间状态,后面用q(s,a)来定义此状态,黑点到达后面的状态s的概率就是上面MDP中定义的那个Pssa=P[St+1=s|St=s,At=a]

Policy

A policy π is a distribution over actions given states,

π(a|s)=P[At=a|St=s]

  • A policy fully defines the behaviour of an agent
  • MDP policies depend on the current state (not the history)
  • i.e. Policies are stationary (time-independent), Atπ(|St),t>0
  • Given an MDP M=S,A,P,R,γ and a policy π
  • The state sequence S1,S2, is a Markov reward process S,Pπ
  • The state and reward sequence S1,R2,S2, is a Markov reward process S,Pπ,Rπ,γ
  • where
    Ps,sπ=aAπ(a|s)PssaRsπ=aAπ(a|s)Rsa

要特别注意policy的distribution的定义,因为在后面讲的off-policy方法的概念中,生成样本的policy和目标policy是不同的

Value Function这个是针对MDP的

The state-value function vπ(s) of an MDP is the expected return starting from state s, and then following policy π

vπ(s)=Eπ[Gt|St=s]

The action-value function qπ(s,a) is the expected return
starting from state s, taking action a, and then following policy π

qπ(s|a)=Eπ[Gt|St=s,At=a]

Bellman Expectation Equation for Vπ
Chapter3 Markov Decision Processes(MDP)

vπ(s)=aAπ(a|s)qπ(s,a)

Bellman Expectation Equation for Qπ
Chapter3 Markov Decision Processes(MDP)
qπ(s,a)=Rsa+γsSPssavπ(s)

Chapter3 Markov Decision Processes(MDP)
vπ(s)=aAπ(a|s)(Rsa+γsSPssavπ(s))

Chapter3 Markov Decision Processes(MDP)
qπ(s,a)=Rsa+γsSPssaaAπ(a|s)qπ(s,a)

Optimal Value Function

The optimal state-value function v(s) is the maximum value function over all policies

v(s)=maxπvπ(s)

The optimal action-value function q(s,a) is the maximum action-value function over all policies

q(s,a)=maxπqπ(s,a)

只要知道了q问题就解决了,比知道v更方便。还有注意的是,上面是在所有的π(policy)中选择使得 q 最大的π(policy),这就是值给出了最佳policy的概念,当然是没有很直接的办法得到结果的,后面将针对上述问题介绍各种逼近的方法

Optimal Policy
Dene a partial ordering over policies

ππ if vπ(s)vπ(s),s

Finding an Optimal Policy
An optimal policy can be found by maximising over q(s,a),

π(a|s)={1if a = argmaxaAq(s,a)0otherwise

如果我们知道了q(s,a),那么我就可以马上得到optimal policy

Optimal Bellman Expectation Equation

vπ(s)Eπ[Gt|St=s]=Eπ[k=0γkRt+k+1|St=s]=aπ(a|s)srp(s,r|s,a)[r+γEπ[Gt+1|St+1=s]]=aπ(a|s)s,rp(s,r|s,a)[r+γvπ(s)], for all sS

The Agent-Environment Interface
  • The learner and decision maker is called the agent.
  • The thing it interacts with, comprising everything outside the agent, is called the environment.

MDP和agent一起生成的sequence或者trajectory

S0,A0,R1,S1,A1,R2,S2,A2,R3,

以下函数定义了MDP的动态性,agent处于某个状态s,在该状态下采取行动a,然后到达状态s,并获得奖励r。这个公式是MDP的关键。这个四参数的函数可以推导出任何东西

p(s,r|s,a)Pr{St=s,Rt=r|St1=s,At1=a}

Chapter3 Markov Decision Processes(MDP)
for all s, sS, rR, and aA(s)

其中有

sSrRp(s,r|s,a)=1, for all sSaA(s)

3.2 Goals and Rewards

agent的目的就是最大化它收到的全部rewards

3.5 Policies and Value Functions

state-value function for policy π

vπ(s)Eπ[Gt|St=s]=Eπ[k=0γkRt+k+1|St=s]=aπ(a|s)srp(s,r|s,a)[r+γEπ[Gt+1|St+1=s]]=aπ(a|s)s,rp(s,r|s,a)[r+γvπ(s)], for all sS

action-value function for policy π

qπ(s,a)Eπ[Gt|St=s,At=a]=Eπ[k=0γkRt+k+1|St=s,At=a]

对于任何policy π和任何状态s,state-value和其可能的后继状态的state-value之间存在以下一致性条件

3.6 Optimal Policies and Optimal Value Functions

optimal state-value function

v(s)maxπvπ(s)

optimal action-value function
q(s,a)maxπqπ(s,a)

写出关于vq

q(s,a)=E[Rt+1+γvπ(St+1)|St=s,At=a]

Bellman optimality equation

v(s)=maxaA(s)qπ


Chapter3 Markov Decision Processes(MDP)

v(s)=maxaA(s)qπ(s,a)=maxaEπ[Gt|St=s,At=a]=maxaEπ[Rt+1+γGt+1|St=s,At=a]=maxaE[Rt+1+γv(St+1)|St=s,At=a]=maxas,rp(s,r|s,a)[r+γv(s)]


Chapter3 Markov Decision Processes(MDP)

q(s,a)=E[Rt+1+γmaxaq(St+1,a)|St=s,At=a]=s,rp(s,r|s,a)[r+γmaxaq(s,a)]

相关文章:

  • 2021-05-18
  • 2021-09-12
  • 2021-12-18
  • 2022-12-23
  • 2021-09-28
  • 2022-12-23
  • 2021-12-31
猜你喜欢
  • 2021-04-26
  • 2021-12-16
  • 2021-11-03
  • 2021-06-04
  • 2021-09-09
  • 2022-01-14
  • 2021-06-07
相关资源
相似解决方案