使用 keras 功能 API 构建（预训练的）CNN+LSTM 网络答案

【问题标题】：Build (pre-trained) CNN+LSTM network with keras functional API使用 keras 功能 API 构建（预训练的）CNN+LSTM 网络
【发布时间】：2020-09-09 10:34:18
【问题描述】：

我想在预训练的 CNN (VGG) 之上构建一个 LSTM 来对视频序列进行分类。 LSTM 将接受 VGG 的最后一个 FC 层提取的特征。

架构类似于：

我写了代码：

def build_LSTM_CNN_net()
      from keras.applications.vgg16 import VGG16
      from keras.models import Model
      from keras.layers import Dense, Input, Flatten
      from keras.layers.pooling import GlobalAveragePooling2D, GlobalAveragePooling1D
      from keras.layers.recurrent import LSTM
      from keras.layers.wrappers import TimeDistributed
      from keras.optimizers import Nadam
    
    
      from keras.applications.vgg16 import VGG16

      num_classes = 5
      frames = Input(shape=(5, 224, 224, 3))
      base_in = Input(shape=(224,224,3))
    
      base_model = VGG16(weights='imagenet',
                  include_top=False,
                  input_shape=(224,224,3))
    
      x = Flatten()(base_model.output)
      x = Dense(128, activation='relu')(x)
      x = TimeDistributed(Flatten())(x)
      x = LSTM(units = 256, return_sequences=False, dropout=0.2)(x)
      x = Dense(self.nb_classes, activation='softmax')(x)
    
lstm_cnn = build_LSTM_CNN_net()
keras.utils.plot_model(lstm_cnn, "lstm_cnn.png", show_shapes=True)

但得到了错误：

ValueError: `TimeDistributed` Layer should be passed an `input_shape ` with at least 3 dimensions, received: [None, 128]

为什么会这样，我该如何解决？

谢谢

【问题讨论】：

你看到了吗？！不到一个小时你就有答案了！这就是一个有足够细节的好问题与在没有任何努力或提供任何细节的情况下询问same question 之间的区别；更进一步，提供赏金无助于为一个问得不好的问题得到答案。祝你好运，玩得开心:)

标签： python tensorflow keras conv-neural-network lstm

【解决方案1】：

这里是构建模型以对视频序列进行分类的正确方法。请注意，我将一个模型实例包装到 TimeDistributed 中。之前构建此模型是为了分别从每个帧中提取特征。在第二部分，我们处理帧序列

frames, channels, rows, columns = 5,3,224,224

video = Input(shape=(frames,
                     rows,
                     columns,
                     channels))
cnn_base = VGG16(input_shape=(rows,
                              columns,
                              channels),
                 weights="imagenet",
                 include_top=False)
cnn_base.trainable = False

cnn_out = GlobalAveragePooling2D()(cnn_base.output)
cnn = Model(cnn_base.input, cnn_out)
encoded_frames = TimeDistributed(cnn)(video)
encoded_sequence = LSTM(256)(encoded_frames)
hidden_layer = Dense(1024, activation="relu")(encoded_sequence)
outputs = Dense(10, activation="softmax")(hidden_layer)

model = Model(video, outputs)
model.summary()

如果您想使用 VGG 1x4096 emb 表示，您可以这样做：

frames, channels, rows, columns = 5,3,224,224

video = Input(shape=(frames,
                     rows,
                     columns,
                     channels))
cnn_base = VGG16(input_shape=(rows,
                              columns,
                              channels),
                 weights="imagenet",
                 include_top=True) #<=== include_top=True
cnn_base.trainable = False

cnn = Model(cnn_base.input, cnn_base.layers[-3].output) # -3 is the 4096 layer
encoded_frames = TimeDistributed(cnn)(video)
encoded_sequence = LSTM(256)(encoded_frames)
hidden_layer = Dense(1024, activation="relu")(encoded_sequence)
outputs = Dense(10, activation="softmax")(hidden_layer)

model = Model(video, outputs)
model.summary()

【讨论】：