斯坦福 NER 和 POS，大数据的多线程答案

【问题标题】：Stanford NER and POS, Multithreading for a large data斯坦福 NER 和 POS，大数据的多线程
【发布时间】：2017-01-31 04:49:33
【问题描述】：

我正在尝试使用 Stanford NER 和 Stanford POS Tagger 来解析大约 23000 个文档。我已经使用以下伪代码实现了它 -

`for each in document:
  eachSentences = PunktTokenize(each)
  #code to generate NER Tagger
  #code to generate POS Taggers on the above output`

对于具有 15 GB RAM 的 4 核机器，仅 NER 的运行时间约为 945 小时。我试图通过使用“线程”库来加强事情，但我收到以下错误-

`Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "removeStopWords.py", line 75, in partofspeechRecognition
    listOfRes_new = namedEntityRecognition(listRes[min:max])
  File "removeStopWords.py", line 63, in namedEntityRecognition
    listRes_ner.append(namedEntityRecognitionResume(eachResSentence))
  File "removeStopWords.py", line 50, in namedEntityRecognitionResume
    ner2Tags = ner2.tag(each.title().split())
  File "/home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/tag/stanford.py", line 71, in tag
    return sum(self.tag_sents([tokens]), [])
  File "/home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/tag/stanford.py", line 98, in tag_sents
    os.unlink(self._input_file_path)
OSError: [Errno 2] No such file or directory: '/tmp/tmpvMNqwB'`

我正在使用 NLTK 版本 - 3.2.1、Stanford NER、POS - 3.7.0 jar 文件，以及线程模块。据我所知，这可能是由于 /tmp 上的线程锁定。 如果我错了，请纠正我，以及使用线程运行上述内容的最佳方式或更好的实现方式。

我正在使用3 Class Classifier for NER 和Maxent POS Tagger

附： - 请忽略Python文件的名称，我还没有从原文中删除停用词或标点符号。

编辑 - 使用 cProfile，并按累积时间排序，我得到了以下前 20 个调用

600792 function calls (595912 primitive calls) in 60.795 seconds

Ordered by: cumulative time
List reduced from 3357 to 20 due to restriction <20>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000   60.811   60.811 removeStopWords.py:1(<module>)
    1    0.000    0.000   58.923   58.923 removeStopWords.py:76(partofspeechRecognition)
   28    0.001    0.000   58.883    2.103 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/tag/stanford.py:69(tag)
   28    0.004    0.000   58.883    2.103 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/tag/stanford.py:73(tag_sents)
   28    0.001    0.000   56.927    2.033 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:63(java)
  141    0.001    0.000   56.532    0.401 /usr/lib/python2.7/subprocess.py:769(communicate)
  140    0.002    0.000   56.530    0.404 /usr/lib/python2.7/subprocess.py:1408(_communicate)
  140    0.008    0.000   56.492    0.404 /usr/lib/python2.7/subprocess.py:1441(_communicate_with_poll)
  400   56.474    0.141   56.474    0.141 {built-in method poll}
    1    0.001    0.001   43.522   43.522 removeStopWords.py:69(partofspeechRecognitionRes)
    1    0.000    0.000   15.401   15.401 removeStopWords.py:62(namedEntityRecognition)
    1    0.001    0.001   15.367   15.367 removeStopWords.py:46(namedEntityRecognitionRes)
  141    0.004    0.000    2.302    0.016 /usr/lib/python2.7/subprocess.py:651(__init__)
  141    0.020    0.000    2.287    0.016 /usr/lib/python2.7/subprocess.py:1199(_execute_child)
   56    0.002    0.000    1.933    0.035 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:38(config_java)
   56    0.001    0.000    1.931    0.034 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:599(find_binary)
  112    0.002    0.000    1.930    0.017 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:582(find_binary_iter)
  118    0.009    0.000    1.928    0.016 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:453(find_file_iter)
    1    0.001    0.001    1.318    1.318 /usr/lib/python2.7/pickle.py:1383(load)
    1    0.046    0.046    1.317    1.317 /usr/lib/python2.7/pickle.py:851(load)

【问题讨论】：

这是关于训练分类器还是应用它们？ 945h 似乎比标记 2300 个文档（或在它们上训练标记器）所期望的要长得多，除非文档非常大。我怀疑您的代码有问题（例如，为每个句子创建新的标记器实例），我将专注于修复它而不是尝试多线程。尝试分析以找出需要这么长时间的部分。
23000个文档，每个文档大约有20-25个句子。我在开始时创建了一个标注器实例，并使用相同的实例对每个句子进行分类。我在我的文档上应用 NER 分类器来标记它们。我使用 tqdm 来预测剩余时间，但最好的情况预测是 600 小时，这似乎很多。
啊，好吧，23,000，不是 2,300，我的错。不过，它太长了，你应该做一些分析。
请根据 NER 和 Python 详细说明 profiling 的含义。
我不熟悉 CoreNLP 的 NLTK 包装器，但对于这么大的集合，可能值得用原始 Java 代码注释并保存结果。 command line usage 文档可能特别有趣。您可以使用-threads 命令行标志并行化此计算。在 4 核上，注释应该不超过一天；我猜你可以在 6-12 小时内完成。

标签： python multithreading nltk stanford-nlp

【解决方案1】：

似乎 Python 包装器是这里的罪魁祸首。 Java 实现不会花费太多时间。这大约需要@Gabor Angeli 提到的内容。试试看。

希望对你有帮助！

【讨论】：

【解决方案2】：

可能这个问题已经解决了，但对于那些试图在 Python 中加速斯坦福 NLP 的人来说，这里是久经考验的答案。How to speedup Stanford NLP in Python?

基本上，它要求您在后端运行 NER 服务器并调用 sner 库并进一步执行所有斯坦福 NLP 相关任务。

找到答案了。

在斯坦福 NLP 解压文件夹的后台启动斯坦福 NLP 服务器。

下面给出的部分答案..

java -Djava.ext.dirs=./lib -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -port 9199 -loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gz
Then initiate Stanford NLP Server tagger in Python using sner library.

from sner import Ner
tagger = Ner(host='localhost',port=9199)

然后运行标记器。

%%time
classified_text=tagger.get_entities(text)
print (classified_text)
Output:

    [('My', 'O'), ('name', 'O'), ('is', 'O'), ('John', 'PERSON'), ('Doe', 'PERSON')]
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 18.2 ms

【讨论】：