【发布时间】:2022-06-11 10:09:23
【问题描述】:
我正在创建一个验证码图像识别系统。它首先使用 ResNet 提取图像的特征,然后使用 LSTM 识别图像中的单词和字母。 fc 层应该将两者连接起来。我之前没有设计过 LSTM 模型,而且对机器学习还很陌生,所以我对此感到非常困惑和不知所措。
我很困惑,我什至不完全确定我应该问什么问题。但这里有几件事对我来说很突出:
- 如果验证码图像都是随机的,那么嵌入字幕的目的是什么?
- for 循环第一部分中的线性 fc 层是将 CNN 特征向量连接到 LSTM 的正确方法吗?
- 这是在 LSTM 中正确使用 LSTM 单元吗?
而且,总的来说,如果有任何关于一般方向的建议,我们将不胜感激。
到目前为止,我有:
class LSTM(nn.Module):
def __init__(self, cnn_dim, hidden_size, vocab_size, num_layers=1):
super(LSTM, self).__init__()
self.cnn_dim = cnn_dim #i think this is the input size
self.hidden_size = hidden_size
self.vocab_size = vocab_size #i think this should be the output size
# Building your LSTM cell
self.lstm_cell = nn.LSTMCell(input_size=self.vocab_size, hidden_size=hidden_size)
'''Connect CNN model to LSTM model'''
# output fully connected layer
# CNN does not necessarily need the FCC layers, in this example it is just extracting the features, that gets set to the LSTM which does the actual processing of the features
self.fc_in = nn.Linear(cnn_dim, vocab_size) #this takes the input from the CNN takes the features from the cnn #cnn_dim = 512, hidden_size = 128
self.fc_out = nn.Linear(hidden_size, vocab_size) # this is the looper in the LSTM #I think this is correct?
# embedding layer
self.embed = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.vocab_size)
# activations
self.softmax = nn.Softmax(dim=1)
def forward(self, features, captions):
#features: extracted features from ResNet
#captions: label of images
batch_size = features.size(0)
cnn_dim = features.size(1)
hidden_state = torch.zeros((batch_size, self.hidden_size)).cuda() # Initialize hidden state with zeros
cell_state = torch.zeros((batch_size, self.hidden_size)).cuda() # Initialize cell state with zeros
outputs = torch.empty((batch_size, captions.size(1), self.vocab_size)).cuda()
captions_embed = self.embed(captions)
'''Design LSTM model for captcha image recognition'''
# Pass the caption word by word for each time step
# It receives an input(x), makes an output(y), and receives this output as an input again recurrently
'''Defined hidden state, cell state, outputs, embedded captions'''
# can be designed to be word by word or character by character
for t in range(captions).size(1):
# for the first time step the input is the feature vector
if t == 0:
# probably have to get the output from the ResNet layer
# use the LSTM cells in here i presume
x = self.fc_in(features)
hidden_state, cell_state = self.lstm_cell(x[t], (hidden_state, cell_state))
x = self.fc_out(hidden_state)
outputs.append(hidden_state)
# for the 2nd+ time steps
else:
hidden_state, cell_state = self.lstm_cell(x[t], (hidden_state, cell_state))
x = self.fc_out(hidden_state)
outputs.append(hidden_state)
# build the output tensor
outputs = torch.stack(outputs,dim=0)
return outputs
【问题讨论】:
标签: python pytorch conv-neural-network lstm captcha