在使用 CNN 提取特征后，如何设计 LSTM 来识别图像？答案

【问题标题】：How do you design an LSTM to recognize images after extracting features with a CNN?在使用 CNN 提取特征后，如何设计 LSTM 来识别图像？
【发布时间】：2022-06-11 10:09:23
【问题描述】：

我正在创建一个验证码图像识别系统。它首先使用 ResNet 提取图像的特征，然后使用 LSTM 识别图像中的单词和字母。 fc 层应该将两者连接起来。我之前没有设计过 LSTM 模型，而且对机器学习还很陌生，所以我对此感到非常困惑和不知所措。

我很困惑，我什至不完全确定我应该问什么问题。但这里有几件事对我来说很突出：

如果验证码图像都是随机的，那么嵌入字幕的目的是什么？
for 循环第一部分中的线性 fc 层是将 CNN 特征向量连接到 LSTM 的正确方法吗？
这是在 LSTM 中正确使用 LSTM 单元吗？

而且，总的来说，如果有任何关于一般方向的建议，我们将不胜感激。

到目前为止，我有：

class LSTM(nn.Module):
    def __init__(self, cnn_dim, hidden_size, vocab_size, num_layers=1):
        super(LSTM, self).__init__()
    
        self.cnn_dim = cnn_dim      #i think this is the input size
        self.hidden_size = hidden_size   
        self.vocab_size = vocab_size      #i think this should be the output size
        
        # Building your LSTM cell
        self.lstm_cell = nn.LSTMCell(input_size=self.vocab_size, hidden_size=hidden_size)
    

        '''Connect CNN model to LSTM model'''
        # output fully connected layer
        # CNN does not necessarily need the FCC layers, in this example it is just extracting the features, that gets set to the LSTM which does the actual processing of the features
        self.fc_in = nn.Linear(cnn_dim, vocab_size)   #this takes the input from the CNN takes the features from the cnn              #cnn_dim = 512, hidden_size = 128
        self.fc_out = nn.Linear(hidden_size, vocab_size)     # this is the looper in the LSTM           #I think this is correct?
    
        # embedding layer
        self.embed = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.vocab_size)
    
        # activations
        self.softmax = nn.Softmax(dim=1)


    def forward(self, features, captions): 
        
        #features: extracted features from ResNet
        #captions: label of images

        batch_size = features.size(0)
        cnn_dim = features.size(1)
        
        hidden_state = torch.zeros((batch_size, self.hidden_size)).cuda()  # Initialize hidden state with zeros
        cell_state = torch.zeros((batch_size, self.hidden_size)).cuda()   # Initialize cell state with zeros
        
        outputs = torch.empty((batch_size, captions.size(1), self.vocab_size)).cuda()   
        captions_embed = self.embed(captions)  
        
        '''Design LSTM model for captcha image recognition'''
        # Pass the caption word by word for each time step
        # It receives an input(x), makes an output(y), and receives this output as an input again recurrently
        '''Defined hidden state, cell state, outputs, embedded captions'''

        # can be designed to be word by word or character by character

        for t in range(captions).size(1):
            # for the first time step the input is the feature vector
            if t == 0:
                # probably have to get the output from the ResNet layer
                # use the LSTM cells in here i presume

                x = self.fc_in(features)
                hidden_state, cell_state = self.lstm_cell(x[t], (hidden_state, cell_state))
                x = self.fc_out(hidden_state)
                outputs.append(hidden_state)

            # for the 2nd+ time steps
            else:
                hidden_state, cell_state = self.lstm_cell(x[t], (hidden_state, cell_state))
                x = self.fc_out(hidden_state)
                outputs.append(hidden_state)
                

            # build the output tensor
            outputs = torch.stack(outputs,dim=0)
    
        return outputs

【问题讨论】：

标签： python pytorch conv-neural-network lstm captcha

【解决方案1】：

nn.Embedding() 通常用于将稀疏的 one-hot 向量转换为稠密的向量（例如将 'a' 转换为 [0.1,0.2,...]）以进行实际计算。我不明白您为什么要尝试嵌入字幕，这看起来像是事实。如果你想用它来计算损失，试试nn.CTCLoss()。
如果要向 LSTM 发送字符串，建议先在字符串中嵌入带有nn.Embedding() 的字符，这样可以使它们密集且计算实用。但是，如果 LSTM 的输入是从 CNN（或其他模块）中提取的，那么它已经是密集且计算实用的，在我看来，没有必要用 fc_in 投影它们。
我经常用nn.LSTM()而不是nn.LSTMCell()，因为后者很麻烦。

您的代码中有一些错误，我已修复它们：

import torch
from torch import nn


class LSTM(nn.Module):
    def __init__(self, cnn_dim, hidden_size, vocab_size, num_layers=1):
        super(LSTM, self).__init__()

        self.cnn_dim = cnn_dim  # i think this is the input size
        self.hidden_size = hidden_size
        self.vocab_size = vocab_size  # i think this should be the output size

        # Building your LSTM cell
        self.lstm_cell = nn.LSTMCell(input_size=self.vocab_size, hidden_size=hidden_size)

        '''Connect CNN model to LSTM model'''
        # output fully connected layer
        # CNN does not necessarily need the FCC layers, in this example it is just extracting the features, that gets set to the LSTM which does the actual processing of the features
        self.fc_in = nn.Linear(cnn_dim,
                               vocab_size)  # this takes the input from the CNN takes the features from the cnn              #cnn_dim = 512, hidden_size = 128
        self.fc_out = nn.Linear(hidden_size,
                                vocab_size)  # this is the looper in the LSTM           #I think this is correct?

        # embedding layer
        self.embed = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.vocab_size)

        # activations
        self.softmax = nn.Softmax(dim=1)

    def forward(self, features, captions):

        # features: extracted features from ResNet
        # captions: label of images

        batch_size = features.size(0)
        cnn_dim = features.size(1)

        hidden_state = torch.zeros((batch_size, self.hidden_size)).cuda()  # Initialize hidden state with zeros
        cell_state = torch.zeros((batch_size, self.hidden_size)).cuda()  # Initialize cell state with zeros

        # outputs = torch.empty((batch_size, captions.size(1), self.vocab_size)).cuda()
        outputs = torch.Tensor([]).cuda()
        captions_embed = self.embed(captions)

        '''Design LSTM model for captcha image recognition'''
        # Pass the caption word by word for each time step
        # It receives an input(x), makes an output(y), and receives this output as an input again recurrently
        '''Defined hidden state, cell state, outputs, embedded captions'''

        # can be designed to be word by word or character by character

        # for t in range(captions).size(1):
        for t in range(captions.size(1)):
            # for the first time step the input is the feature vector
            if t == 0:
                # probably have to get the output from the ResNet layer
                # use the LSTM cells in here i presume

                x = self.fc_in(features)
                # hidden_state, cell_state = self.lstm_cell(x[t], (hidden_state, cell_state))
                hidden_state, cell_state = self.lstm_cell(x, (hidden_state, cell_state))
                x = self.fc_out(hidden_state)
                # outputs.append(hidden_state)
                outputs = torch.cat([outputs, hidden_state])

            # for the 2nd+ time steps
            else:
                # hidden_state, cell_state = self.lstm_cell(x[t], (hidden_state, cell_state))
                hidden_state, cell_state = self.lstm_cell(x, (hidden_state, cell_state))
                x = self.fc_out(hidden_state)
                # outputs.append(hidden_state)
                outputs = torch.cat([outputs, hidden_state])

            # build the output tensor
            # outputs = torch.stack(outputs, dim=0)

        return outputs


m = LSTM(16, 32, 10)
m = m.cuda()
features = torch.randn((2, 16))
features = features.cuda()
captions = torch.randn((2, 10))
captions = torch.clip(captions, 0, 9)
captions = captions.long()
captions = captions.cuda()
m(features, captions)

本文可能对您有所帮助：https://arxiv.org/abs/1904.01906

【讨论】：