【发布时间】:2020-07-08 12:00:25
【问题描述】:
我为 NLP 多类分类问题制作了一个 Keras 模型。数据由标题和标签组成。我已经在标题上训练了模型来预测标签。我已使用 sklearn.preprocessing LabelEncoder、OneHotEncoder 将标签转换为 one-hot。
OneHot 编码
def onehot(df):
values = array(df)
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
return label_encoder, onehot_encoded
我的模型使用了 categorical_crossentropy 和 adam。这是模型的代码。
Keras CNN 模型
def ConvNet(embeddings, max_sequence_length, num_words, embedding_dim, labels_index):
embedding_layer = Embedding(num_words, embedding_dim, weights=[embeddings],
input_length = max_sequence_length,
trainable = False)
sequence_input = Input(shape = (max_sequence_length,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
convs = []
filter_sizes = [2,3,4,5,6]
for filter_size in filter_sizes:
l_conv = Conv1D(filters=200, kernel_size=filter_size, activation='relu')(embedded_sequences)
l_pool = GlobalMaxPooling1D()(l_conv)
convs.append(l_pool)
l_merge = concatenate(convs, axis=1)
x = Dropout(0.1)(l_merge)
x = Dense(128, activation='relu')(x)
x = Dropout(0.2)(x)
preds = Dense(labels_index, activation='sigmoid')(x)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['acc'])
model.summary()
return model
预测
predictions = model.predict(test_cnn_data, batch_size=1024, verbose=1)
我从预测中得到一个这样的数组
print(predictions[5, :])
Output:
array([8.8067267e-08, 5.1040554e-15, 1.9745098e-16, ..., 8.0959568e-17,
2.1070798e-17, 1.1202571e-18], dtype=float32)
我的理解是,这些是下面句子属于这个标签的概率或置信度分数。
如何将预测的数组转换为标签,以便将其与测试数据集标签的准确性进行比较?
【问题讨论】:
-
你的分类器训练了多少个类?
-
@YoelNisanov 我有大约 5000 个独特的标签
-
预测的长度是多少[5, :]?
-
@YoelNisanov 5015