【问题标题】:Embedding matrix and one hot vector (Pytorch)嵌入矩阵和一个热向量(Pytorch)
【发布时间】:2023-12-08 01:08:02
【问题描述】:

我导出了一个双向 LSTM 用于新闻标题的情绪分析,但是在训练模型时,损失函数值没有提高,保持在 0.6 和 0.7 左右。 当然我做错了什么,我想知道它是否与嵌入层有关。

我以 10 的大小和 30 个单词的句子长度将每个批次迭代地传递到网络中,我的词汇量是 5745,因此经过一次热编码后,这个张量的形状将是 (10, 30, 5745)。

我的嵌入层有 num_embeddings = 5745 和 embed-dim = 100,所以当我调用 self.embedding(input) 时,输出形状为:(10, 30, 5745, 100)

我希望输出形状为:(10,30,100)

因此我使用了这行代码:

        embeddings = torch.max(embeddings, dim=2)

但我不确定它是否符合我对每个单词/单热向量的期望,即:

如果我有一个表示形状为 (5745,1) 的单词的热编码向量和一个形状为 (100, 5745) 的嵌入矩阵,我会得到一个 (100,1) 的嵌入向量,因此我会有一个输出(10,30,100) 通过执行上述代码? 也许我的想法不正确,它影响了我的最终结果

循环神经网络:

class RNN(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim, n_layers, dropout):
    #calling the init function of the RNN parent
    super(RNN, self).__init__()
    
    self.embedding = nn.Embedding(vocab_size, embed_dim)
    
    self.encoder = nn.LSTM(embed_dim,
                           hidden_dim,
                           n_layers,
                           dropout=dropout,
                           bidirectional=True
                          )
    
    #Linear transformation
    self.decoder = nn.Linear(hidden_dim*2, output_dim)
    
    self.dropout = nn.Dropout(dropout)
    

def forward(self, inputs):
   
    #(batch_size, timesteps, embed_dim)
    embeddings = self.dropout(self.embedding(inputs))
    embeddings = torch.max(embeddings, dim=2)
    embeddings = embeddings[0].type(torch.cuda.FloatTensor)

    #output of each timestep
    output, (hidden, cell) = self.encoder(embeddings)
    
   
    merge = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
   
    output = self.decoder(merge)
    
    return output

火车:

def train(model, text, label, epochs, lr=0.001):

model.train()

opt = torch.optim.Adam(model.parameters(), lr=lr)
#The BCEWithLogitsLoss criterion carries out both the sigmoid and the binary cross entropy steps
criterion = nn.BCEWithLogitsLoss()

counter = 0
for e in range(epochs):
    for x, y in batches(text, label):
        
        #In order to avoid gradient accumulation before backpropagation
 
        
        x = one_hot_encode(x, vocab)
        
        x = torch.from_numpy(x).to(device)
        
       
        
      
        output = model(x)
       
       #print(torch.cuda.memory_summary(device=None, abbreviated=False))
       
        y = torch.from_numpy(y)
       
        y = torch.unsqueeze(y,1).to(device)
        #print(output.shape)
        #print(y.shape)
        
        loss = criterion(output, y.float())
       
        acc = binaryAccuracy(output,y)
        
        opt.zero_grad()
        loss.backward()
        opt.step()
        
        
        counter += 1
        
    print("Epoch {}/{}".format(e+1, epochs), 
        "Loss: {}".format(loss.item()),
        "accuracy: {}".format(acc))

单热:

def one_hot_encode(arr, n_labels):


one_hot = np.zeros((np.multiply(*arr.shape), n_labels), dtype=np.int64)

one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1

one_hot = one_hot.reshape((*arr.shape, n_labels))

return one_hot

批次:

def batches(text, label, num_seqs=10):

counter = 0
#create empty arrays with the specified number of columns 
x = np.array([], dtype=int).reshape(0,findMaxLen())
y = np.array([], dtype=int).reshape(0,1)

for sent, l in zip(text, label):
    # create a np array with zeros with length 30
    tmp1 = np.zeros((findMaxLen()), dtype=int)
    #tmp1 = np.randint(0, high=vocab,size=findMaxLen(), dtype=int)
    #create a 1d array
    tmp2 = np.atleast_1d(np.array(l))
    
    for ind, wrd in enumerate(sent):
        if wrd in uniqueWrds and ind < 30:
            tmp1[ind] = word_to_index[wrd]
    # the arrays to the empty arrays
    x = np.vstack([x, tmp1])
    y = np.vstack([y, tmp2])
    counter +=1
    if counter == num_seqs:
        yield x, np.squeeze(y,1)
        counter = 0
        x = np.array([], dtype=int).reshape(0,findMaxLen())
        y = np.array([], dtype=int).reshape(0,1)
  



   

【问题讨论】:

    标签: python pytorch lstm one-hot-encoding word-embedding


    【解决方案1】:

    如果您希望输出的维度为(10,30,100),我认为您需要以维度数组(10x30)的形式输入句子单词的索引。您可能可以通过在一个热编码输入张量上执行torch.argmax 来获取索引。

    【讨论】: