【发布时间】:2020-08-03 10:04:25
【问题描述】:
所以我做了一个小蟒蛇游戏,玩家必须到达终点并避开陷阱,它看起来像这样
我尝试了许多不同的批量大小、奖励、输入形状、隐藏层中的节点数量,但网络仍然没有训练。
我目前的训练方式,是使用 64 批大小和 100000 内存大小,输入是一个表示游戏状态的一维数组 + 玩家的坐标 + 游戏结束前剩余的移动量,以及奖励从-distanceFromEnd + maxDistance / 2开始,如果你到达终点,你得到+500奖励并且游戏结束,如果你触摸一个陷阱你得到-100奖励并且游戏结束,如果游戏在64内没有完成移动你会得到 -200 的奖励,游戏就完成了。
我正在使用 AdamOptimizer 和 MSE 损失函数,而对于激活函数,我对除了最后一层之外的所有层都使用 ReLU。
玩家、结束、陷阱的位置在每一集之后都是随机的
即使在 3000 集之后,最后 100 场比赛的平均分数(分数是奖励的总和)也在 -30 左右。
DQN 在健身房游戏 LunarLander-v2 上运行良好。
正如我所说,我一直在尝试调整价值观,但没有帮助。
首先是我在该州使用的标签
FLOOR = 1
END = 2
TRAP = 3
PLAYER = 4
这是我的阶梯函数
def step(self, action):
isDone = False
if action == 0:
# Move Up
if self.playerY != 0:
self.playerY -= 1
elif action == 1:
# Move Down
if self.playerY != 7:
self.playerY += 1
elif action == 2:
# Move Right
if self.playerX != 0:
self.playerX -= 1
elif action == 3:
# Move Left
if self.playerX != 7:
self.playerX += 1
x = self.playerX - self.endX
x = x * x
y = self.playerY - self.endY
y = y * y
distance = math.sqrt(x + y)
reward = -distance + self.maxDist
#self.lastDist = distance
if self.state[self.playerX, self.playerY] == self.END:
reward = 500
isDone = True
elif self.state[self.playerX, self.playerY] == self.TRAP:
reward = -100
isDone = True
self.moves -= 1
if self.moves < 0:
reward = -200
isDone = True
return self.getFlatState(), reward, isDone, 0
状态获取函数
# Adding one to the players coordinates to avoid 0s as a try to fix the problem
def getFlatState(self):
return np.concatenate([np.ndarray.flatten(self.state), [self.playerX + 1, self.playerY + 1, self.moves]])
这是 DQN/代理脚本
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import load_model
class ReplayBuffer():
def __init__(self, max_size, input_dims):
self.mem_size = max_size
self.mem_cntr = 0
self.state_memory = np.zeros((self.mem_size, *input_dims),
dtype=np.float32)
self.new_state_memory = np.zeros((self.mem_size, *input_dims),
dtype=np.float32)
self.action_memory = np.zeros(self.mem_size, dtype=np.int32)
self.reward_memory = np.zeros(self.mem_size, dtype=np.float32)
self.terminal_memory = np.zeros(self.mem_size, dtype=np.int32)
def store_transition(self, state, action, reward, state_, done):
index = self.mem_cntr % self.mem_size
self.state_memory[index] = state
self.new_state_memory[index] = state_
self.reward_memory[index] = reward
self.action_memory[index] = action
self.terminal_memory[index] = 1 - int(done)
self.mem_cntr += 1
def sample_buffer(self, batch_size):
max_mem = min(self.mem_cntr, self.mem_size)
batch = np.random.choice(max_mem, batch_size, replace=False)
states = self.state_memory[batch]
states_ = self.new_state_memory[batch]
rewards = self.reward_memory[batch]
actions = self.action_memory[batch]
terminal = self.terminal_memory[batch]
return states, actions, rewards, states_, terminal
def build_dqn(lr, n_actions, input_dims, fc1_dims, fc2_dims):
model = keras.Sequential([
keras.layers.Dense(fc1_dims, activation='relu'),
keras.layers.Dense(fc2_dims, activation='relu'),
keras.layers.Dense(n_actions, activation=None)])
model.compile(optimizer=Adam(learning_rate=lr), loss='mean_squared_error')
return model
class Agent():
def __init__(self, lr, gamma, n_actions, epsilon, batch_size,
input_dims, epsilon_dec=1e-3, epsilon_end=0.01,
mem_size=1000000, fname='dqn_model.h5'):
self.action_space = [i for i in range(n_actions)]
self.gamma = gamma
self.epsilon = epsilon
self.eps_dec = epsilon_dec
self.eps_min = epsilon_end
self.batch_size = batch_size
self.model_file = fname
self.memory = ReplayBuffer(mem_size, input_dims)
self.q_eval = build_dqn(lr, n_actions, input_dims, 256, 128)
def store_transition(self, state, action, reward, new_state, done):
self.memory.store_transition(state, action, reward, new_state, done)
def choose_action(self, observation):
if np.random.random() < self.epsilon:
action = np.random.choice(self.action_space)
else:
state = np.array([observation])
actions = self.q_eval.predict(state)
action = np.argmax(actions)
return action
def learn(self):
if self.memory.mem_cntr < self.batch_size:
return
states, actions, rewards, states_, dones = \
self.memory.sample_buffer(self.batch_size)
q_eval = self.q_eval.predict(states)
q_next = self.q_eval.predict(states_)
q_target = np.copy(q_eval)
batch_index = np.arange(self.batch_size, dtype=np.int32)
q_target[batch_index, actions] = rewards + \
self.gamma * np.max(q_next, axis=1)*dones
self.q_eval.train_on_batch(states, q_target)
self.epsilon = self.epsilon - self.eps_dec if self.epsilon > \
self.eps_min else self.eps_min
def save_model(self):
self.q_eval.save(self.model_file)
def load_model(self):
self.q_eval = load_model(self.model_file)
【问题讨论】:
-
@desertnaut colab.research.google.com/drive/…
-
嗯... 一个小问题 - 它达到了目标多少次(在随机探索状态下)?您是否使用 epsilon 贪婪策略?你的问题没有提到...
-
@neelg 是的,我使用从 1 开始并衰减到 0.01 的 epsilon 贪婪,它达到目标的概率为 5%,但我认为因为我奖励接近目标,所以它应该学习去目标
-
我已将 DQN/Agent 代码添加到问题中
-
你说agent的目标和初始位置是随机创建的?你能不能先用一个静态障碍物试试,然后我们可以通过研究代理的行为来进一步隔离问题。现在,有很多因素,很难确定解决方案......
标签: python machine-learning deep-learning tensorflow2.0 reinforcement-learning