如何标记文本语料库？答案

【问题标题】：How to tokenize a text corpus?如何标记文本语料库？
【发布时间】：2019-12-14 09:29:55
【问题描述】：

我想使用 NLTK 库标记文本语料库。

我的语料库看起来像：

['Did you hear about the Native American man that drank 200 cups of tea?',
 "What's the best anti diarrheal prescription?",
 'What do you call a person who is outside a door and has no arms nor legs?',
 'Which Star Trek character is a member of the magic circle?',
 "What's the difference between a bullet and a human?",

我试过了：

tok_corp = [nltk.word_tokenize(sent.decode('utf-8')) for sent in corpus]

其中提出：

AttributeError: 'str' 对象没有属性 'decode'

我们将不胜感激。谢谢。

【问题讨论】：

'str' object has no attribute 'decode'. Python 3 error?的可能重复

标签： python pandas numpy recommendation-engine

【解决方案1】：

this page 建议 word_tokenize 方法期望一个字符串作为参数，试试吧

tok_corp = [nltk.word_tokenize(sent) for sent in corpus]

编辑：使用以下代码，我可以获得标记化的语料库，

代码：

import pandas as pd
from nltk import word_tokenize

corpus = ['Did you hear about the Native American man that drank 200 cups of tea?',
 "What's the best anti diarrheal prescription?",
 'What do you call a person who is outside a door and has no arms nor legs?',
 'Which Star Trek character is a member of the magic circle?',
 "What's the difference between a bullet and a human?"]


tok_corp = pd.DataFrame([word_tokenize(sent) for sent in corpus])

输出：

      0     1     2           3        4   ...    13    14    15    16    17
0    Did   you  hear       about      the  ...   tea     ?  None  None  None
1   What    's   the        best     anti  ...  None  None  None  None  None
2   What    do   you        call        a  ...    no  arms   nor  legs     ?
3  Which  Star  Trek   character       is  ...  None  None  None  None  None
4   What    's   the  difference  between  ...  None  None  None  None  None

我认为您的语料库中潜入了一些非字符串或非字节类对象。我建议你再检查一次。

【讨论】：

'''TypeError: expected string or bytes-like object''' 老实说，我只是想重现我在网上找到的代码，我认为作者在那里遗漏了一些东西.我尝试在 word_tokenize 之前使用 nltk.sent_tokenize 但没有成功。
作为列表的语料库是否可能包含字符串以外的某些项目？如果您可以调试循环，那就太好了。
它有 '[' 但这也是一个字符串。它偶尔有一个字符串，但又是一个字符串。

【解决方案2】：

错误就在那里，sent 没有属性decode。你只需要.decode()它们，如果它们首先被编码，即bytes对象而不是str对象。删除它应该没问题。

【讨论】：

'''TypeError: expected string or bytes-like object''' 老实说，我只是想重现我在网上找到的代码，我认为作者在那里遗漏了一些东西.我尝试在 word_tokenize 之前使用 nltk.sent_tokenize 但没有成功。
那么corpus 中元素的类型是什么，它是否适用于您发布的较小样本？
它有 '[' 但这也是一个字符串。它偶尔有一个字符串，但又是一个字符串。