函数逼近器和 q 学习答案

【问题标题】：Function approximator and q-learning函数逼近器和 q 学习
【发布时间】：2017-09-17 01:48:05
【问题描述】：

我正在尝试使用动作值近似函数来实现 q-learning。我正在使用 openai-gym 和“MountainCar-v0”环境来测试我的算法。我的问题是，它根本没有收敛或找到目标。

基本上，逼近器的工作方式如下，您输入 2 个特征：位置和速度以及 one-hot 编码中的 3 个动作之一：0 -> [1,0,0], 1 -> [0 ,1,0] 和 2 -> [0,0,1]。输出是一个特定动作的动作值近似 Q_approx(s,a)。

我知道通常情况下，输入是状态（2 个特征），输出层包含每个动作的 1 个输出。我看到的最大区别是我已经运行前馈传递 3 次（每个动作一次）并取最大值，而在标准实现中运行一次并取最大值超过输出。

也许我的实现完全错误，我想错了。把代码贴在这里，一团糟，但我只是在尝试一下：

import gym
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation

env = gym.make('MountainCar-v0')

# The mean reward over 20 episodes
mean_rewards = np.zeros(20)
# Feature numpy holder
features = np.zeros(5)
# Q_a value holder
qa_vals = np.zeros(3)

one_hot = {
    0 : np.asarray([1,0,0]),
    1 : np.asarray([0,1,0]),
    2 : np.asarray([0,0,1])
}

model = Sequential()
model.add(Dense(20, activation="relu",input_dim=(5)))
model.add(Dense(10,activation="relu"))
model.add(Dense(1))
model.compile(optimizer='rmsprop',
              loss='mse',
              metrics=['accuracy'])

epsilon_greedy = 0.1
discount = 0.9
batch_size = 16

# Experience replay containing features and target 
experience = np.ones((10*300,5+1))

# Ring buffer
def add_exp(features,target,index):
    if index % experience.shape[0] == 0:
        index = 0
        global filled_once
        filled_once = True
    experience[index,0:5] = features
    experience[index,5] = target
    index += 1
    return index

for e in range(0,100000):
    obs = env.reset()
    old_obs = None
    new_obs = obs
    rewards = 0
    loss = 0
    for i in range(0,300):

        if old_obs is not None:
            # Find q_a max for s_(t+1)
            features[0:2] = new_obs
            for i,pa in enumerate([0,1,2]):
                features[2:5] = one_hot[pa]
                qa_vals[i] = model.predict(features.reshape(-1,5))

            rewards += reward
            target = reward + discount*np.max(qa_vals) 

            features[0:2] = old_obs
            features[2:5] = one_hot[a]

            fill_index = add_exp(features,target,fill_index)

            # Find new action
            if np.random.random() < epsilon_greedy:
                a = env.action_space.sample()
            else:
                a = np.argmax(qa_vals)
        else:
            a = env.action_space.sample()

        obs, reward, done, info = env.step(a)

        old_obs = new_obs
        new_obs = obs

        if done:
            break

        if filled_once:
            samples_ids = np.random.choice(experience.shape[0],batch_size)
            loss += model.train_on_batch(experience[samples_ids,0:5],experience[samples_ids,5].reshape(-1))[0]
    mean_rewards[e%20] = rewards
    print("e = {} and loss = {}".format(e,loss))
    if e % 50 == 0:
        print("e = {} and mean = {}".format(e,mean_rewards.mean()))

提前致谢！

【问题讨论】：

我之前听说过使用动作作为特征，但没有听说过它运作良好。我认为你最好在这方面遵循传统并使用动作作为输出。从数学上讲，这两个网络会有很大的不同。

标签： reinforcement-learning openai-gym

【解决方案1】：

作为网络的输入或作为网络的不同输出的操作之间应该没有太大区别。例如，如果您的状态是图像，它确实会产生巨大的差异。因为 Conv 网络在处理图像时效果很好，而且没有明显的方法可以将动作集成到输入中。

您尝试过cartpole 平衡环境吗？最好测试一下你的模型是否工作正常。

爬山非常困难。在您到达顶峰之前，它没有任何回报，而这通常根本不会发生。一旦你到达顶峰，该模型只会开始学习一些有用的东西。如果您永远无法达到顶峰，您可能应该增加探索时间。换句话说，采取更多随机动作，更多......

【讨论】：