为神经机器翻译清理文本数据答案

【问题标题】：Cleaning text data for Neural Machine translation为神经机器翻译清理文本数据
【发布时间】：2020-12-07 04:22:04
【问题描述】：

我正在清理我的数据以获取用于将语言 X 转换为 Y 以进行机器翻译的文本对

    [['\ufeffMensahe di Pasco di Gobernador di Aruba 2019',
  'Governor’s Christmas speech 2019'],
 ['Gobernador di Aruba Sr. Alfonso Boekhoudt a duna su mensahe di Pasco riba 24 december ultimo',
  'On Christams eve, December 24, the Governor of Aruba Mr. Alfonso Boekhoudt gave his traditional Christmas speech'],
 ['Por a wak e discurso di Pasco di Gobernador via e canalnan di television local',
  "The governor's Christmas speech was shown at the local television stations"],......

以上是以下代码中的数据：

def clean_pairs(lines):
cleaned = list()
for pair in lines:
    clean_pair = list()
    for line in pair:
        # normalize unicode characters
        line = normalize('NFD', line).encode('ascii', 'ignore')
        line = line.decode('UTF-8')
        # tokenize on white space
        line = line.split()
       .
       .
       .
       .
    clean_pair.append(' '.join(line))
cleaned.append(clean_pair)


for i in range(10):
    print('[%s]->[%s]' % (cleaned[i,0], cleaned[i,1]))

我应该得到如下输出：

[hi]->[hallo]
[hi]->[gru gott]
[run]->[lauf]
[wow]->[potzdonner]
[wow]->[donnerwetter]

但是，我收到以下错误：

索引错误
Traceback（最近一次调用最后）在 49 50 for i in range(10): ---> 51 print('[%s]->[%s]' % (clean_pairs[i,0], clean_pairs[i,1]))

IndexError: 数组的索引过多：数组是一维的，但是 2 被索引了

有人可以帮我解决问题吗？谢谢！

【问题讨论】：

显然该列表是一维的。您可以原始打印整个列表吗？也许试试clean_pairs[i][0] 或者看看它是否是一个元组。

标签： python list data-structures

【解决方案1】：

您的结构是列表的列表。在 Python 中，您可以像这样对它们进行索引：

clean[i][0] #  not like clean[i,0]

【讨论】：