如何更改它以使用 q 表进行强化学习答案

【问题标题】：How can I change this to use a q table for reinforcement learning如何更改它以使用 q 表进行强化学习
【发布时间】：2025-12-05 17:20:03
【问题描述】：

我正在学习 q-tables 并运行了一个简单的版本，它只使用一维数组来向前和向后移动。现在我正在尝试 4 个方向的运动，却被困在控制人身上。

我现在把随机运动弄下来，它最终会找到目标。但我希望它学习如何达到目标，而不是随意绊倒它。因此，我将不胜感激有关在此代码中添加 qlearning 的任何建议。谢谢。

这是我的完整代码，因为它现在很简单。

import numpy as np
import random
import math

world = np.zeros((5,5))
print(world)
# Make sure that it can never be 0 i.e the start point
goal_x = random.randint(1,4)
goal_y = random.randint(1,4)
goal = (goal_x, goal_y)
print(goal)
world[goal] = 1
print(world)

LEFT = 0
RIGHT = 1
UP = 2
DOWN = 3
map_range_min = 0
map_range_max = 5

class Agent:
    def __init__(self, current_position, my_goal, world):
        self.current_position = current_position
        self.last_postion = current_position
        self.visited_positions = []
        self.goal = my_goal
        self.last_reward = 0
        self.totalReward = 0
        self.q_table = world


    # Update the totoal reward by the reward        
    def updateReward(self, extra_reward):
        # This will either increase or decrese the total reward for the episode
        x = (self.goal[0] - self.current_position[0]) **2
        y = (self.goal[1] - self.current_position[1]) **2
        dist = math.sqrt(x + y)
        complet_reward = dist + extra_reward
        self.totalReward += complet_reward 

    def validate_move(self):
        valid_move_set = []
        # Check for x ranges
        if map_range_min < self.current_position[0] < map_range_max:
            valid_move_set.append(LEFT)
            valid_move_set.append(RIGHT)
        elif map_range_min == self.current_position[0]:
            valid_move_set.append(RIGHT)
        else:
            valid_move_set.append(LEFT)
        # Check for Y ranges
        if map_range_min < self.current_position[1] < map_range_max:
            valid_move_set.append(UP)
            valid_move_set.append(DOWN)
        elif map_range_min == self.current_position[1]:
            valid_move_set.append(DOWN)
        else:
            valid_move_set.append(UP)
        return valid_move_set

    # Make the agent move
    def move_right(self):
        self.last_postion = self.current_position
        x = self.current_position[0]
        x += 1
        y = self.current_position[1]
        return (x, y)
    def move_left(self):
        self.last_postion = self.current_position
        x = self.current_position[0]
        x -= 1
        y = self.current_position[1]
        return (x, y)
    def move_down(self):
        self.last_postion = self.current_position
        x = self.current_position[0]
        y = self.current_position[1]
        y += 1
        return (x, y)
    def move_up(self):
        self.last_postion = self.current_position
        x = self.current_position[0]
        y = self.current_position[1]
        y -= 1
        return (x, y)

    def move_agent(self):
        move_set = self.validate_move()
        randChoice = random.randint(0, len(move_set)-1)
        move = move_set[randChoice]
        if move == UP:
            return self.move_up()
        elif move == DOWN:
            return self.move_down()
        elif move == RIGHT:
            return self.move_right()
        else:
            return self.move_left()

    # Update the rewards
    # Return True to kill the episode
    def checkPosition(self):
        if self.current_position == self.goal:
            print("Found Goal")
            self.updateReward(10)
            return False
        else:
            #Chose new direction
            self.current_position = self.move_agent()
            self.visited_positions.append(self.current_position)
            # Currently get nothing for not reaching the goal
            self.updateReward(0)
            return True


gus = Agent((0, 0) , goal)
play = gus.checkPosition()
while play:
    play = gus.checkPosition()

print(gus.totalReward)

【问题讨论】：

Q 通常是状态和动作的函数，而这里它仅与状态一一对应。我建议您将一维状态表示映射到 xD 状态表示，以便 Q 始终只有 2 个维度。
那么就像将世界 (5x5) 扁平化为长度为 25 的一维数组？
是的 - 然后您需要另一个维度来执行操作。即 Q(s,a)
q_table = np.zeros((2,25))
我刚刚想到你可能无法使用 RL 解决这个问题。您有一个不断变化的未知目标位置。问题是您的状态表示不是马尔可夫。解决此问题的一种方法是将目标的相对位置作为状态的一部分。

标签： python artificial-intelligence reinforcement-learning q-learning

【解决方案1】：

根据您的代码示例，我有一些建议：

将环境与代理分开。环境需要有一个new_state, reward = env.step(old_state, action) 形式的方法。这个方法说明了一个动作如何将你的旧状态转换为新状态。将您的状态和动作编码为简单的整数是个好主意。我强烈建议为此方法设置单元测试。
然后代理需要有一个等效的方法action = agent.policy(state, reward)。作为第一步，您应该手动编写一个执行您认为正确的代理程序。例如，它可能只是试图朝着目标位置前进。
考虑状态表示是否是马尔可夫的问题。如果你能在这个问题上做得更好，如果你能记住你访问过的所有过去的状态，那么这个状态就没有马尔可夫属性。最好，状态表示应该是紧凑的（仍然是马尔可夫的最小集合）。
一旦设置了这个结构，您就可以考虑实际学习 Q 表。一种可能的方法（很容易理解但不一定那么有效）是蒙特卡洛，要么探索开始，要么使用 epsilon-soft greedy。一本好的 RL 书应该为这两种变体提供伪代码。

当您感到自信时，前往 openai 健身房 https://gym.openai.com/ 了解更详细的课程结构。这里有一些关于创建自己的环境的提示：https://gym.openai.com/docs/#environments

【讨论】：

我开始认为你的权利我需要从头开始重新设计整个项目。谢谢你的建议。