我假设您熟悉 Tensorflow Keras API。我将通过以下方式实现代码。
假设: Vocab_size = 4000 和 input_image_size = (572,572,3)。
vocab_size = 4000
inputs = layers.Input(shape=(572, 572, 3))
c0 = layers.Conv2D(64, activation='relu', kernel_size=3)(inputs)
c1 = layers.Conv2D(64, activation='relu', kernel_size=3)(c0) # This layer for concatenating in the expansive part
c2 = layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2), padding='valid')(c1)
c3 = layers.Conv2D(128, activation='relu', kernel_size=3)(c2)
c4 = layers.Conv2D(128, activation='relu', kernel_size=3)(c3) # This layer for concatenating in the expansive part
c5 = layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2), padding='valid')(c4)
c6 = layers.Conv2D(256, activation='relu', kernel_size=3)(c5)
c7 = layers.Conv2D(256, activation='relu', kernel_size=3)(c6) # This layer for concatenating in the expansive part
c8 = layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2), padding='valid')(c7)
c9 = layers.Conv2D(512, activation='relu', kernel_size=3)(c8)
c10 = layers.Conv2D(512, activation='relu', kernel_size=3)(c9) # This layer for concatenating in the expansive part
c11 = layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2), padding='valid')(c10)
fc1 = layers.Dense(4096)(c11)
fc2 = layers.Dense(4096)(fc1)
reshape = layers.Reshape((64, 4096))(fc2)
rnn1 = layers.LSTM(64, return_sequences=True)(reshape)
rnn2 = layers.LSTM(64)(rnn1)
outputs = layers.Dense(vocab_size, activation='softmax')(rnn2)
model = tf.keras.Model(inputs=inputs, outputs=outputs, name="caption_generate")
model.summary()
这里的重要部分是将您的输出从4 dimensions 真正重塑为3 dimensions。由于LSTM 需要在3 dimensions 中输入
reshape = layers.Reshape((64, 4096))(fc2)
以下代码有效,您应该可以使用它。
我希望答案对您有帮助。