从字符串中提取单词以创建特征集 nltk答案

【问题标题】：Extract words from string to create featureset nltk从字符串中提取单词以创建特征集 nltk
【发布时间】：2015-07-02 02:10:34
【问题描述】：

我正在使用 NLTK 和 NLTK-Trainer 进行一些情绪分析。我有一个准确的算法腌制。当我关注 NLTK-Trainer 提供的instructions 时，一切正常。

这里有什么作用（返回所需的输出）

>>> words = ['some', 'words', 'in', 'a', 'sentence']
>>> feats = dict([(word, True) for word in words])
>>> classifier.classify(feats)

'壮举'看起来像这样：

Out[52]: {'a': True, 'in': True, 'sentence': True, 'some': True, 'words': True}

但是，我不想每次都输入用逗号和撇号分隔的单词。我有一个脚本，它对文本进行一些预处理并返回一个看起来像这样的字符串。

"[['words'], ['in'], ['a'], ['sentence']]"`

但是，当我尝试用字符串定义“壮举”时，我得到的结果看起来像这样

{' ': True,
 "'": True,
 ',': True,
 '[': True,
 ']': True,
 'a': True,
 'b': True,
 'c': True,
 'e': True,
 'h': True,
 'i': True,
 'l': True,
 'n': True,
 'o': True,
 'p': True,
 'r': True,
 's': True,
 'u': True}

显然分类器函数对这个输入不是很有效。看起来“壮举”定义是从文本字符串中提取单个字母而不是整个单词。 我该如何解决这个问题？

【问题讨论】：

顺便说一句，{word: True for word in words} 比dict([(word, True) for word in words]) 更优雅
为什么不使用nltk.word_tokenize() 将原始文本拆分为令牌列表？

标签： python nltk

【解决方案1】：

我不确定是否理解，但我建议：

words = nltk.word_tokenize("some words in a sentence")
feats = {word: True for word in words}
classifier.classify(feats)

如果您想使用您的预处理文本，请尝试：

text = "[['words'], ['in'], ['a'], ['sentence']]"
words = text[3:len(text)-3].split("'], ['")
feats = {word: True for word in words}
classifier.classify(feats)

【讨论】：

感谢您的帮助。这让我走上了正轨，尽管基本的分词器仍然将括号作为单词。我能够使用 RegexpTokenizer 去除这些。