word_tokenize TypeError：预期的字符串或缓冲区答案

【问题标题】：word_tokenize TypeError: expected string or buffer [closed]word_tokenize TypeError：预期的字符串或缓冲区
【发布时间】：2015-11-18 06:25:57
【问题描述】：

调用word_tokenize 时出现以下错误：

File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1322,
    in _slices_from_text for match in
    self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer

我有一个大文本文件 (1500.txt)，我想从中删除停用词。我的代码如下：

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

with open('E:\\Book\\1500.txt', "r", encoding='ISO-8859-1') as File_1500:
    stop_words = set(stopwords.words("english"))
    words = word_tokenize(File_1500)
    filtered_sentence = [w for w in words if not w in stop_words]
    print(filtered_sentence)

【问题讨论】：

你遇到了什么错误
你怎么知道它不起作用？
哦，它说它想要一个字符串，但你正在向它传递一个文件。传递给它File_1500.read() 给它一个字符串。
@SaqibAlam 将此words = word_tokenize(File_1500) 更改为此words = word_tokenize(File_1500.read())
重复stackoverflow.com/questions/24273662/…

标签： python python-3.x nlp nltk tokenize

【解决方案1】：

word_tokenize 的输入是文档流语句，即字符串列表，例如['this is sentence 1.', 'that's sentence 2!'].

File_1500 是 File 对象而不是字符串列表，这就是它不起作用的原因。

要获取句子字符串列表，首先您必须将文件作为字符串对象fin.read() 读取，然后使用sent_tokenize 将句子拆分（我假设您的输入文件没有句子标记，只是一个原始文本文件）。

此外，使用 NLTK 以这种方式标记文件会更好/更惯用：

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

stop_words = set(stopwords.words("english"))

with open('E:\\Book\\1500.txt', "r", encoding='ISO-8859-1') as fin:
    for sent in sent_tokenize(fin.read()):
        words = word_tokenize(sent)
        filtered_sentence = [w for w in words if not w in stop_words]
        print(filtered_sentence)

【讨论】：