在 pytorch 中使用我自己的数据集训练简单的 RNN答案

【问题标题】：Train simple RNN from my own dataset in pytorch在 pytorch 中使用我自己的数据集训练简单的 RNN
【发布时间】：2021-05-19 12:32:04
【问题描述】：

在@Nerveless_child 回答后编辑我有一个文件，其中 word-bitstrings 作为键，True/False 作为值，表示单词是否在我的字典中。

010000101010000，错误

10100010110010001011, 真

单词代表一种模式，我想训练可以识别单词是否在语言中的 rnn 网络（简单的二元分类器）。

我的数据集：

class myDataset(T.utils.data.Dataset):
# WORD  T/f
# 010000101010000  FALSE

    tmp_x = np.loadtxt(src_file, max_rows=m_rows,
                       usecols=[0], delimiter=",", skiprows=0, 
    dtype=np.int64)

    tmp_y = np.genfromtxt(src_file, max_rows=m_rows,
                       usecols=[1], delimiter=",", dtype=bool)

    tmp_y = tmp_y.reshape(-1, 1)  # 2-D required

    self.x_data = T.from_numpy(tmp_x).to(device)


def __getitem__(self, index):
    return self.x_data[index], self.y_data[index]

def __len__(self):
    return len(self.x_data)

当我尝试训练网络时

    net.train()  # set mode
for epoch in range(0, max_epochs):
    T.manual_seed(1 + epoch)  # recovery reproducibility
    epoch_loss = 0  # for one full epoch

    for (batch_idx, batch) in enumerate(train_ldr):
        (X, Y) = batch  # (predictors, targets)
        optimizer.zero_grad()  # prepare gradients
        oupt = net(X)  # predicted prices
        loss_val = loss_func(oupt, Y)  # avg per item in batch
        epoch_loss += loss_val.item()  # accumulate avgs
        loss_val.backward()  # compute gradients
        optimizer.step()  # update wts

我得到了错误

OverflowError: Python int 太大而无法转换为 C long

【问题讨论】：

你为什么评论这行# self.x_data = T.tensor(tmp_x).to(device) # self.y_data = T.tensor(tmp_y).to(device)？
与您的数据有关的错误，self.x_data 和 self.y_data，是 string 而不是您预期的整数和布尔值。
嗨，否则我会收到错误：TypeError: can't convert np.ndarray of type numpy.str_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.
哈哈！我会写下我的答案。

标签： pytorch recurrent-neural-network training-data

【解决方案1】：

应该这样做：

def __init__(self, src_file, m_rows=None):
    tmp_x = np.loadtxt(src_file, max_rows=m_rows,
                        usecols=[0], delimiter=",", skiprows=0, dtype=int)
    tmp_y = np.loadtxt(src_file, max_rows=m_rows,
                        usecols=[1], delimiter=",", skiprows=0, dtype=bool)

    tmp_y = tmp_y.reshape(-1, 1)  # 2-D required

    self.x_data = T.from_numpy(tmp_x).to(device)
    self.y_data = T.from_numpy(tmp_y).to(device)

我还建议您使用np.genfromtxt，因为您的数据文件变得更加复杂。

【讨论】：

得到OverflowError: Python int too large to convert to C long
尝试在这一行将int 更改为np.int64 tmp_x = np.loadtxt(src_file, max_rows=m_rows, usecols=[0], delimiter=",", skiprows=0, dtype=int)
不起作用，更改为tmp_x = np.genfromtxt(src_file, max_rows=m_rows, usecols=[0], delimiter=",", dtype='str') 和tmp_y = np.genfromtxt(src_file, max_rows=m_rows, usecols=[1], delimiter=",", dtype=bool) 后起作用
你说得对...当类型为 str 时仍有问题。TypeError: can't convert np.ndarray of type numpy.str_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.
那是因为没有处理字符串的张量类型，所以你必须用其他方式来表示你的数据。