python多处理 - 文本处理答案

【问题标题】：python multiprocessing - text processingpython多处理 - 文本处理
【发布时间】：2010-06-20 21:42:43
【问题描述】：

我正在尝试创建我发现的文本分类代码的多处理版本here（以及其他很酷的东西）。我在下面附加了完整的代码。

我已经尝试了几件事 - 首先尝试了一个 lambda 函数，但它抱怨无法序列化（！？），所以尝试了原始代码的精简版本：

  negids = movie_reviews.fileids('neg')
  posids = movie_reviews.fileids('pos')

  p = Pool(2)
  negfeats =[]
  posfeats =[]

  for f in negids:
   words = movie_reviews.words(fileids=[f]) 
   negfeats = p.map(featx, words) #not same form as below - using for debugging

  print len(negfeats)

不幸的是，即使这也不起作用 - 我得到以下跟踪：

File "/usr/lib/python2.6/multiprocessing/pool.py", line 148, in map
    return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.6/multiprocessing/pool.py", line 422, in get
    raise self._value
ZeroDivisionError: float division

知道我可能做错了什么吗？我应该改用pool.apply_async 吗（这本身似乎也不能解决问题 - 但也许我在叫错树）？

import collections
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def evaluate_classifier(featx):
    negids = movie_reviews.fileids('neg')
    posids = movie_reviews.fileids('pos')

    negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
    posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

    negcutoff = len(negfeats)*3/4
    poscutoff = len(posfeats)*3/4

    trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
    testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

    classifier = NaiveBayesClassifier.train(trainfeats)
    refsets = collections.defaultdict(set)
    testsets = collections.defaultdict(set)

    for i, (feats, label) in enumerate(testfeats):
            refsets[label].add(i)
            observed = classifier.classify(feats)
            testsets[observed].add(i)

    print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
    print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
    print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
    print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
    print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
    classifier.show_most_informative_features()

【问题讨论】：

标签： python multithreading multicore

【解决方案1】：

关于您的精简版，您是否使用了与http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/ 中使用的功能不同的功能？

异常很可能发生在 featx 内部，并且多处理只是重新引发它，尽管它并没有真正包含原始回溯，这使得它有点无用。

先尝试在不使用 pool.map() 的情况下运行它（即negfeats = [feat(x) for x in words]），或者在可以调试的featx 中包含一些内容。

如果这仍然没有帮助，请将您正在处理的整个脚本发布在您的原始问题中（如果可能，已简化），以便其他人可以运行该脚本并提供更有针对性的答案。请注意，以下代码片段实际上有效（适应您的精简版本）：

from nltk.corpus import movie_reviews
from multiprocessing import Pool

def featx(words):
    return dict([(word, True) for word in words])

if __name__ == "__main__":
    negids = movie_reviews.fileids('neg')
    posids = movie_reviews.fileids('pos')

    p = Pool(2)
    negfeats =[]
    posfeats =[]

    for f in negids:
        words = movie_reviews.words(fileids=[f]) 
        negfeats = p.map(featx, words)

    print len(negfeats)

【讨论】：

先尝试不使用 pool.map() 运行它（即 negfeats = [feat(x) for x in words]）非常感谢。

【解决方案2】：

您是在尝试并行化分类、训练还是两者兼而有之？您可能可以相当容易地使单词计数和评分并行，但我不确定特征提取和训练。对于分类，我推荐execnet。我将它用于并行/分布式词性tagging，取得了很好的效果。

execnet 的基本思想是，您只需训练一个分类器一次，然后将其发送到每个 execnet 节点。接下来，将文件划分到每个节点，然后让它对给定的每个文件进行分类。然后将结果发送回主节点。我还没有尝试腌制分类器，所以我不确定这是否可行，但如果可以腌制 pos 标记器，我假设分类器也可以。

【讨论】：

我刚开始尝试酸洗——不过，它们变得相当大（100mb ish）。我会尝试看看我是否可以让多处理以某种方式工作，否则 execnet 似乎是一个替代方案 - 我怀疑训练可以并行化（很容易），但就像你说的那样，其他位和鲍勃不应该是那个差异..希望.顺便说一句，感谢 streamhacker 上的东西 - 它是一个宝库！