使用 TensorFlow 的多特征数据集的 NLP答案

【问题标题】：NLP for multi feature data set using TensorFlow使用 TensorFlow 的多特征数据集的 NLP
【发布时间】：2020-04-16 14:37:51
【问题描述】：

我只是这个主题的初学者，我已经测试了一些用于图像识别的神经网络以及使用 NLP 进行序列分类。

第二个话题对我来说很有趣。使用

sentences = [
  'some test sentence',
  'and the second sentence'
]
tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
sentences = tokenizer.texts_to_sequences(sentences)

将产生一个大小为[n,1] 的数组，其中n 是句子中的单词大小。假设我已经正确实现了填充，集合中的每个训练示例的大小将是 [n,1]，其中 n 是最大句子长度。

准备好的训练集我可以传入 keras model.fit

当我的数据集中有多个特征时怎么办？假设我想构建一个事件优先级算法，我的数据结构如下所示：

[event_description, event_category, event_location, label]

尝试对此类数组进行标记将导致 [n,m] 矩阵，其中 n 是最大字长，m 是特征数

如何准备这样的数据集，以便可以根据这些数据训练模型？

这种方法可以吗：

# Going through training set to get all features into specific ararys
for data in dataset:
  training_sentence.append(data['event_description'])
  training_category.append(data['event_category'])
  training_location.append(data['event_location'])
  training_labels.append(data['label'])

# Tokenize each array which contains tokenized value 
tokenizer.fit_on_texts(training_sentence)
tokenizer.fit_on_texts(training_category)
tokenizer.fit_on_texts(training_location)
sequences = tokenizer.texts_to_sequences(training_sentence)
categories = tokenizer.texts_to_sequences(training_category)
locations = tokenizer.texts_to_sequences(training_location)

# Concatenating arrays with features into one
training_example = numpy.concatenate([sequences,categories, locations])

#ommiting model definition, training the model
model.fit(training_example, training_labels, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))

我还没有测试它。我只是想确定我是否正确理解了所有内容以及我的假设是否正确。

这是使用 NN 创建 NPL 的正确方法吗？

【问题讨论】：

标签： python tensorflow machine-learning deep-learning neural-network

【解决方案1】：

我知道管理多个输入序列的两种常用方法，而您的方法介于它们之间。

一种方法是设计一个多输入模型，将每个文本列作为不同的输入。他们可以稍后共享词汇表和/或嵌入层，但现在您仍然需要为每个描述、类别等提供不同的输入子模型。

每个都成为网络的输入，使用Model(inputs=[...], outputs=rest_of_nn) 语法。您将需要设计rest_of_nn，以便它可以接受多个输入。这可以像您当前的连接一样简单，或者您可以使用额外的层来进行合成。

它可能看起来像这样：

# Build separate vocabularies. This could be shared.
desc_tokenizer = Tokenizer()
desc_tokenizer.fit_on_texts(training_sentence)
desc_vocab_size = len(desc_tokenizer.word_index)

categ_tokenizer = Tokenizer()
categ_tokenizer.fit_on_texts(training_category)
categ_vocab_size = len(categ_tokenizer.word_index)

# Inputs.
desc = Input(shape=(desc_maxlen,))
categ = Input(shape=(categ_maxlen,))

# Input encodings, opting for different embeddings.
# Descriptions go through an LSTM as a demo of extra processing.
embedded_desc = Embedding(desc_vocab_size, desc_embed_size, input_length=desc_maxlen)(desc)
encoded_desc = LSTM(categ_embed_size, return_sequences=True)(embedded_desc)
encoded_categ = Embedding(categ_vocab_size, categ_embed_size, input_length=categ_maxlen)(categ)

# Rest of the NN, which knows how to put everything together to get an output.
merged = concatenate([encoded_desc, encoded_categ], axis=1)
rest_of_nn = Dense(hidden_size, activation='relu')(merged)
rest_of_nn = Flatten()(rest_of_nn)
rest_of_nn = Dense(output_size, activation='softmax')(rest_of_nn)

# Create the model, assuming some sort of classification problem.
model = Model(inputs=[desc, categ], outputs=rest_of_nn)
model.compile(optimizer='adam', loss=K.categorical_crossentropy)

第二种方法是在编码之前连接所有数据，然后将所有数据视为更标准的单序列问题。通常使用唯一标记来分隔或定义不同的字段，类似于BOS 和EOS 用于序列的开头和结尾。

看起来像这样：

XXBOS XXDESC This event will be fun. XXCATEG leisure XXLOC Seattle, WA XXEOS

您还可以为DESCXX 等字段添加结束标记，省略BOS 和EOS 标记，并且通常可以随意混合和匹配。您甚至可以使用它来组合您的一些输入序列，然后使用上面的多输入模型来合并其余的。

说到混合和匹配，您还可以选择将某些输入直接视为嵌入。像category和location这样的低基数字段不需要token化，可以直接嵌入，不需要拆分成token。也就是说，它们不需要是一个序列。

如果您正在寻找参考资料，我喜欢Large Scale Product Categorization using Structured and Unstructured Attributes 上的这篇论文。它在大规模真实数据上测试了我刚刚概述的所有或大部分想法。

【讨论】：