Qlearning算法
- import numpy as np
- import gym
- import random
- if __name__ == '__main__':
- env = gym.make("FrozenLake-v0")
- env.render()
- action_size = env.action_space.n
- print("Action size ", action_size)
- state_size = env.observation_space.n
- print("State size ", state_size)
- qtable = np.zeros((state_size, action_size))
- print(qtable)
- total_episodes = 10000 # Total episodes
- learning_rate = 0.8 # Learning rate
- max_steps = 99 # Max steps per episode
- gamma = 0.95 # Discounting rate
- # Exploration parameters
- epsilon = 1.0 # Exploration rate
- max_epsilon = 1.0 # Exploration probability at start
- min_epsilon = 0.01 # Minimum exploration probability
- decay_rate = 0.001 # Exponential decay rate for exploration prob
- # List of rewards
- rewards = []
- # 2 For life or until learning is stopped
- for episode in range(total_episodes):
- # Reset the environment
- state = env.reset()
- step = 0
- done = False
- total_rewards = 0
- for step in range(max_steps):
- # 3. Choose an action a in the current world state (s)
- ## First we randomize a number
- exp_exp_tradeoff = random.uniform(0, 1)
- ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)
- if exp_exp_tradeoff > epsilon:
- action = np.argmax(qtable[state, :])
- # Else doing a random choice --> exploration
- else:
- action = env.action_space.sample()
- # Take the action (a) and observe the outcome state(s') and reward (r)
- new_state, reward, done, info = env.step(action)
- # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
- # qtable[new_state,:] : all the actions we can take from new state
- qtable[state, action] = qtable[state, action] + learning_rate * (
- reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])
- print("action is %d , reward is %d , qtable[state, action] is %f , new_state is %d" % (
- action, reward, qtable[state, action], new_state))
- total_rewards = total_rewards + reward
- # Our new state is state
- state = new_state
- # If done (if we're dead) : finish episode
- if done == True:
- break
- episode += 1
- # Reduce epsilon (because we need less and less exploration)
- epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
- rewards.append(total_rewards)
- print("Score over time: " + str(sum(rewards) / total_episodes))
- print("qtable is ", qtable)
- print("epsilon is", epsilon)
Q-Learning和Sarsa算法的对比:
当Sarsa 和 Q-Learning处在状态s时,均选择可带来最大回报的动作a,这样可到达状态s’。而在下一步,如果使用Q-Learning, 则会观察在s’上哪个动作会带来最大回报(不会真正执行该动作,仅用来更新Q表),在s’上做决定时, 再基于更新后的Q表选择动作。而 Sarsa 是实践派,在s’ 这一步估算的动作也是接下来要执行的动作,所以 Q(s, a) 的现实值也会稍稍改动,去掉maxQ,取而代之的是在s’ 上实实在在选取的a’ 的Q值,最后像Q-Learning一样求出现实和估计的差距并更新Q表里的Q(s, a)
我的理解是这样的:
Qlearning更新Q值,s状态下选择行动a到达新的状态s’,我都默认我的下一步行动是s’状态下收益最大的行动来更新当前的Q值,即MaxQ(s’,a’),但是下一步行动我是不确定的,可能随机可能根据Q表,评估值和动作值不一致,所以Qlearning是一个Off-policy离线学习算法;
Sarsa执行完动作后会对新的状态s’执行一次greedy,会根据greedy选择的动作来更新当前Q表,即Q(s’,a’),并且下一步执行这个动作,评估值和动作值一致,所以Sarsa是一个On-policy在线学习算法。
Q-Learning因为有了 maxQ,所以也是一个特别勇敢的算法,原因在于它永远都会选择最近的一条通往成功的道路,不管这条路会有多危险。而 Sarsa 则是相当保守,它会选择离危险远远的,这就是使用Sarsa方法的不同之处。
根据sutton的经典著作来说,我简单说一下自己的理解:
Q-learning 和 Sara是TD method 的两个变体,只不过是一个是on-policy,一个是off-policy。而TD method 和Monte Carlo method 都是以GPI(generalized policy iteration)为基本原理的,分为policy evaluation(也就是所谓的预测问题)和 policy improvement(就是选择最优)两个步骤。
所以,Q-leaning是off-policy的原因在于他在policy evaluation的时候用的是e-greedy policy(为了保证足够的探索state-action pair,也就是behavior policy),而在policy improvement的时候用的却是greedy polic(target policy)。因为behaviour policy 和 target policy 是不一样的,所以说Q-leaning是 off-policy。
另外,就是如果sara和Q-learning的 behaviour 都是greddy policy的话,那么也有可能 这两种算法会出现不一样的结果。
- for episode in range(total_episodes):
- state = env.reset()
- step = 0
- done = False
- total_rewards = 0
- exp_exp_tradeoff = random.uniform(0, 1)
- if exp_exp_tradeoff > epsilon:
- action = np.argmax(qtable[state, :])
- else:
- action = env.action_space.sample()
- for step in range(max_steps):
- new_state, reward, done, info = env.step(action)
- exp_exp_tradeoff = random.uniform(0, 1)
- if exp_exp_tradeoff > epsilon:
- new_action = np.argmax(qtable[new_state, :])
- else:
- new_action = env.action_space.sample()
- qtable[state, action] = qtable[state, action] + learning_rate * (
- reward + gamma * qtable[new_state, new_action] - qtable[state, action])
- print("action is %d , reward is %d , qtable[state, action] is %f , new_state is %d" % (
- action, reward, qtable[state, action], new_state))
- total_rewards = total_rewards + reward
- # Our new state is state
- state = new_state
- action = new_action
- # If done (if we're dead) : finish episode
- if done == True:
- break
- episode += 1
- # Reduce epsilon (because we need less and less exploration)
- epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
- rewards.append(total_rewards)
Sarsa(lambda)算法
Sarsa(lambda)算法是Sarsa 的改进版,二者的主要区别在于:
在每次take action获得reward后,Sarsa只对前一步Q(s,a)进行更新,Sarsa(lambda) 则会对获得reward之前的步进行更新。
Sarsa(lambda)算法的流程如下:
从上图可以看出,和Sarsa相比,Sarsa(lambda)算法中多了一个矩阵E (eligibility trace),它是用来保存在路径中所经历的每一步,因此在每次更新时也会对之前经历的步进行更新。
参数lambda取值范围为[0, 1] ,如果 lambda = 0,Sarsa(lambda) 将退化为Sarsa,即只更新获取到 reward 前经历的最后一步;如果 lambda = 1,Sarsa(lambda) 更新的是获取到 reward 前的所有步。lambda 可理解为脚步的衰变值,即离奶酪越近的步越重要,越远的步则对于获取奶酪不是太重要。
和Sarsa相比,Sarsa(lambda)算法有如下优势:
Sarsa虽然会边走边更新,但是在没有获得奶酪之前,当前步的Q值是没有任何变化的,直到获取奶酪后,才会对获取奶酪的前一步更新,而之前为了获取奶酪所走的所有步都被认为和获取奶酪没关系。Sarsa(lambda)则会对获取奶酪所走的步都进行更新,离奶酪越近的步越重要,越远的则越不重要(由参数lambda控制衰减幅度)。因此,Sarsa(lambda) 能够更加快速有效的学到最优的policy。
在算法前几回合,老鼠由于没有头绪, 可能在原地打转了很久,从而形成一些重复的环路,而这些环路对于算法的学习没有太大必要。Sarsa(lambda)则可解决该问题,具体做法是:在E(s,a)←E(s,a)+1这一步之前,可先令E(s)=0,即把状态s对应的行置为0,这样就只保留了最近一次到达状态s时所做的action。
- def learn(self, s, a, r, s_, a_):
- self.check_state_exist(s_)
- q_predict = self.q_table.loc[s, a]
- if s_ != 'terminal':
- q_target = r + self.gamma * self.q_table.loc[s_, a_] # next state is not terminal
- else:
- q_target = r # next state is terminal
- error = q_target - q_predict
- # increase trace amount for visited state-action pair
- # Method 1:
- # self.eligibility_trace.loc[s, a] += 1
- # Method 2:
- self.eligibility_trace.loc[s, :] *= 0
- self.eligibility_trace.loc[s, a] = 1
- # Q update
- self.q_table += self.lr * error * self.eligibility_trace
- # decay eligibility trace after update
- self.eligibility_trace *= self.gamma*self.lambda_
# Method 1:
self.eligibility_trace.loc[s, a] += 1
# Method 2:
self.eligibility_trace.loc[s, :] *= 0
self.eligibility_trace.loc[s, a] = 1
他们两的不同之处可以用这张图来概括:
这是针对于一个 state-action 值按经历次数的变化. 最上面是经历 state-action 的时间点, 第二张图是使用这种方式所带来的 “不可或缺性值”:
self.eligibility_trace.ix[s, a] += 1
下面图是使用这种方法带来的 “不可或缺性值”:
self.eligibility_trace.ix[s, :] *= 0
self.eligibility_trace.ix[s, a] = 1
我的理解是若采取+1重复插旗动作,那老鼠有可能陷入一个循环,不停进行插旗动作,由于是最新的动作, gamma和lamdba无法指数性的进行衰减而造成此处Q值过大,使用下面的方法可以有效的避免这种怪圈。
整理以及参考:
https://blog.csdn.net/u010089444/article/details/80516345