【发布时间】:2021-03-21 19:28:00
【问题描述】:
我正在使用基线算法实施强化,但我对折扣奖励功能有疑问。
我是这样实现折扣奖励功能的:
def disc_r(rewards):
r = np.zeros_like(rewards)
tsteps = range(len(rewards)) #timesteps
sum_reward = 0
for i in reversed(tsteps):
sum_reward = rewards[i] + gamma*sum_reward
r[i] = sum_reward
print(r[i])
return r - np.mean(r)
因此,例如,对于折扣系数gamma = 0.1 和奖励rewards = [1,2,3,4],它给出:
r = [1.234, 2.34, 3.4, 4.0]
根据返回的表达式是正确的G:
回报是折扣奖励的总和:G = discount_ factor * G + reward
但是,在这里我的问题是,我从 Towards Data Science https://towardsdatascience.com/learning-reinforcement-learning-reinforce-with-pytorch-5e8ad7fc7da0 找到了这篇文章,他们在其中定义了相同的函数,如下所示:
def discount_rewards(rewards, gamma=0.99):
r = np.array([gamma**i * rewards[i] for i in range(len(rewards))])
# Reverse the array direction for cumsum and then revert back to the original order
r = r[::-1].cumsum()[::-1]
print(r)
return r — r.mean()
计算相同的gamma = 0.1 和奖励rewards = [1,2,3,4] 它给出:
r = [1.234, 0.234, 0.034, 0.004]
但是这里看不到流程,好像不符合G的规则...
有人知道第二个函数发生了什么以及为什么它也可能是正确的(或者在哪些情况下可能......)?
【问题讨论】:
标签: python reinforcement-learning reward