如何提高 TensorFlow 中分类、非二进制、外语情感分析模型的准确性？答案

【问题标题】：How to improve accuracy of model for categorical, non-binary, foreign language sentiment analysis in TensorFlow?如何提高 TensorFlow 中分类、非二进制、外语情感分析模型的准确性？
【发布时间】：2020-11-06 15:34:40
【问题描述】：

TLDR

我的目标是将外语（匈牙利语）中的句子分为 3 个情绪类别：消极、中性和积极。我想提高所用模型的准确性，可以在下面的“定义、编译、拟合模型”部分找到。为了完整性和可重复性，本文的其余部分在这里。

我是刚开始就机器学习主题提出问题，也欢迎在这里提出建议：How to ask a good question on Machine Learning?

数据准备

为此，我有 10000 个句子，分配给 5 位人工注释者，分为负面、中性或正面，可从 here 获得。前几行如下所示：

如果注释者的分数总和为正，我将句子分类为正（表示为 2），如果它是 0 则为中性（表示为 1），如果总和为负（表示为 0）是否定的：

import pandas as pd
sentences_df = pd.read_excel('/content/OpinHuBank_20130106.xls')

sentences_df['annotsum'] = sentences_df['Annot1'] +\
                           sentences_df['Annot2'] +\
                           sentences_df['Annot3'] +\
                           sentences_df['Annot4'] +\
                           sentences_df['Annot5']

def categorize(integer):
    if 0 < integer:  return 2
    if 0 == integer: return 1
    else: return 0

sentences_df['sentiment'] = sentences_df['annotsum'].apply(categorize)

在this tutorial 之后，我使用SubwordTextEncoder 继续。从here，我下载了web2.2-freq-sorted.top100k.nofreqs.txt，其中包含100000目标语言中最常用的词。（情绪数据和这个数据都是this推荐的。）

阅读最常用词列表：

wordlist = pd.read_csv('/content/web2.2-freq-sorted.top100k.nofreqs.txt',sep='\n',header=None,encoding = 'ISO-8859-1')[0].dropna()

编码数据，转换为张量

使用build_from_corpus方法初始化编码器：

import tensorflow_datasets as tfds
encoder = tfds.features.text.SubwordTextEncoder.build_from_corpus(
        corpus_generator=(word for word in wordlist), target_vocab_size=2**16)

在此基础上，对句子进行编码：

import numpy as np
import tensorflow as tf
def applyencoding(string):
    return tf.convert_to_tensor(np.asarray(encoder.encode(string)))
sentences_df['encoded_sentences'] = sentences_df['Sentence'].apply(applyencoding)

Convert to a tensor每句话的感悟：

def tensorise(input):
    return tf.convert_to_tensor(input)
sentences_df['sentiment_as_tensor'] = sentences_df['sentiment'].apply(tensorise)

定义为测试保留多少数据：

test_fraction = 0.2
train_fraction = 1-test_fraction

从pandas dataframe，让我们创建编码句子训练张量的numpy array：

nparrayof_encoded_sentence_train_tensors = \
        np.asarray(sentences_df['encoded_sentences'][:int(train_fraction*len(sentences_df['encoded_sentences']))])

这些张量有不同的长度，所以让我们使用padding 使它们具有相同的长度：

padded_nparrayof_encoded_sentence_train_tensors = tf.keras.preprocessing.sequence.pad_sequences(
                                            nparrayof_encoded_sentence_train_tensors, padding="post")

让我们stack这些张量在一起：

stacked_padded_nparrayof_encoded_sentence_train_tensors = tf.stack(padded_nparrayof_encoded_sentence_train_tensors)

也将情绪张量堆叠在一起：

stacked_nparray_sentiment_train_tensors = \
        tf.stack(np.asarray(sentences_df['sentiment_as_tensor'][:int(train_fraction*len(sentences_df['encoded_sentences']))]))

定义、编译、拟合模型（即重点）

定义和编译模型如下：

### THE QUESTION IS ABOUT THESE ROWS ###
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(encoder.vocab_size, 64),
    tf.keras.layers.Conv1D(128, 5, activation='sigmoid'),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(6, activation='sigmoid'),
    tf.keras.layers.Dense(3, activation='sigmoid')
]) 
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True), optimizer='adam', metrics=['accuracy'])

适合它：

NUM_EPOCHS = 40
history = model.fit(stacked_padded_nparrayof_encoded_sentence_train_tensors,
                    stacked_nparray_sentiment_train_tensors,
                    epochs=NUM_EPOCHS)

输出的前几行是：

测试结果

和TensorFlow's RNN tutorial一样，让我们绘制到目前为止我们获得的结果：

import matplotlib.pyplot as plt

def plot_graphs(history):
  plt.plot(history.history['accuracy'])
  plt.plot(history.history['loss'])
  plt.xlabel("Epochs")
  plt.ylabel('accuracy / loss')
  plt.legend(['accuracy','loss'])
  plt.show()

plot_graphs(history)

这给了我们：

像我们准备训练数据一样准备测试数据：

nparrayof_encoded_sentence_test_tensors = \
        np.asarray(sentences_df['encoded_sentences'][int(train_fraction*len(sentences_df['encoded_sentences'])):])

padded_nparrayof_encoded_sentence_test_tensors = tf.keras.preprocessing.sequence.pad_sequences(
                                                 nparrayof_encoded_sentence_test_tensors, padding="post")

stacked_padded_nparrayof_encoded_sentence_test_tensors = tf.stack(padded_nparrayof_encoded_sentence_test_tensors)

stacked_nparray_sentiment_test_tensors = \
        tf.stack(np.asarray(sentences_df['sentiment_as_tensor'][int(train_fraction*len(sentences_df['encoded_sentences'])):]))

仅使用测试数据评估模型：

test_loss, test_acc = model.evaluate(stacked_padded_nparrayof_encoded_sentence_test_tensors,stacked_nparray_sentiment_test_tensors)
print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

给出结果：

完整的笔记本可用here。

问题

如何更改上面的模型定义和编译行，以便在不超过 1000 个 epoch 后在测试集上获得更高的准确度？

【问题讨论】：

标签： python tensorflow machine-learning keras nlp

【解决方案1】：

您正在使用词片子词，您可以尝试 BPE。此外，您可以在 BERT 上构建模型并使用迁移学习，这将使您的结果猛增。
首先，更改 Conv1D 层中的内核大小并尝试各种值。推荐的是 [3, 5, 7]。然后，考虑添加层。此外，在倒数第二层（即密集）中，增加其中的单位数量，这可能会有所帮助。或者，您可以尝试仅包含 LSTM 层或 LSTM 层后跟 Conv1D 层的网络。
通过尝试它是否有效，否则很好重复。但是，训练损失提供了一个暗示，如果您看到损失并没有顺利下降，您可能会认为您的网络缺乏预测能力，即欠拟合并增加其中的神经元数量。
是的，更多数据确实有帮助。但是，如果故障出在您的网络中，即拟合不足，那么它就无济于事了。首先，您应该先探索模型的局限性，然后再查找数据中的错误。
是的，使用最常用的词是通常的规范，因为从概率上讲，较少使用的词不会出现更多，因此不会对预测产生很大影响。

【讨论】：