斯坦福 NLP - 处理文件列表时 OpenIE 内存不足答案

【问题标题】：Stanford NLP - OpenIE out of memory when processing list of files斯坦福 NLP - 处理文件列表时 OpenIE 内存不足
【发布时间】：2016-04-05 16:21:49
【问题描述】：

我正在尝试使用来自 Stanford CoreNLP 的 OpenIE 工具从多个文件中提取信息，当将多个文件而不是一个文件传递到输入时，它会出现内存不足错误。

All files have been queued; awaiting termination...
java.lang.OutOfMemoryError: GC overhead limit exceeded
at edu.stanford.nlp.graph.DirectedMultiGraph.outgoingEdgeIterator(DirectedMultiGraph.java:508)
at edu.stanford.nlp.semgraph.SemanticGraph.outgoingEdgeIterator(SemanticGraph.java:165)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$GOVERNER$1.advance(GraphRelation.java:267)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.initialize(GraphRelation.java:1102)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.<init>(GraphRelation.java:1083)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$GOVERNER$1.<init>(GraphRelation.java:257)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$GOVERNER.searchNodeIterator(GraphRelation.java:257)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChildIter(NodePattern.java:320)
at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.matches(CoordinationPattern.java:211)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.matchChild(NodePattern.java:514)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.matches(NodePattern.java:542)
at edu.stanford.nlp.naturalli.RelationTripleSegmenter.segmentVerb(RelationTripleSegmenter.java:541)
at edu.stanford.nlp.naturalli.RelationTripleSegmenter.segment(RelationTripleSegmenter.java:850)
at edu.stanford.nlp.naturalli.OpenIE.relationInFragment(OpenIE.java:354)
at edu.stanford.nlp.naturalli.OpenIE.lambda$relationsInFragments$2(OpenIE.java:366)
at edu.stanford.nlp.naturalli.OpenIE$$Lambda$76/1438896944.apply(Unknown Source)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1540)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at edu.stanford.nlp.naturalli.OpenIE.relationsInFragments(OpenIE.java:366)
at edu.stanford.nlp.naturalli.OpenIE.annotateSentence(OpenIE.java:486)
at edu.stanford.nlp.naturalli.OpenIE.lambda$annotate$3(OpenIE.java:554)
at edu.stanford.nlp.naturalli.OpenIE$$Lambda$25/606198361.accept(Unknown Source)
at java.util.ArrayList.forEach(ArrayList.java:1249)
at edu.stanford.nlp.naturalli.OpenIE.annotate(OpenIE.java:554)
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:71)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:499)
at edu.stanford.nlp.naturalli.OpenIE.processDocument(OpenIE.java:630)
DONE processing files. 1 exceptions encountered.

我使用此调用通过输入传递文件：

java -mx3g -cp stanford-corenlp-3.6.0.jar:stanford-corenlp-3.6.0-models.jar:CoreNLP-to-HTML.xsl:slf4j-api.jar:slf4j-simple.jar edu.stanford.nlp.naturalli.OpenIE file1 file2 file3 etc.

我尝试使用-mx3g 和其他变体增加内存，虽然处理的文件数量增加了，但并不多（例如，从 5 个增加到 7 个）。每个文件都单独处理正确，所以我排除了一个包含大句子或多行的文件。

是否有我没有考虑的选项，一些 OpenIE 或 Java 标志，我可以用来强制转储到输出、清理或在每个处理的文件之间进行垃圾收集？

提前谢谢你

【问题讨论】：

请调用代码
您正在处理的文件有多大（例如，以文字为单位）？你的电脑有多少线程？您可以尝试的一件事是设置-threads 1 并在处理文档时禁用并行性。如果它一次加载许多大文档，这可以解决问题。
@Woot4Moo 我直接从 shell 调用 openIE，使用我放在那里的 java 调用，没有更改提供的源代码，但无论如何谢谢。
@smothP 太好了！很有可能，将内存增加几 GB 应该也可以让它在多线程中工作。 CoreNLP 注释对象非常大，实际上 OpenIE 产生的中间垃圾可能比它应该产生的更多——尤其是对于长句子。 RE 不同的输出：这是一个新功能的好主意。现在，您可以将输出格式设置为-format reverb，然后第一列将具有输入文件名，然后您可以使用它来路由输出。
（请参阅reverb.cs.washington.edu/README.html 了解混响输出格式）

标签： java stanford-nlp

【解决方案1】：

运行此命令以获取每个文件的单独注释（sample-file-list.txt 应该是每行一个文件）

java -Xmx4g -cp "stanford-corenlp-full-2015-12-09/*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,depparse,natlog,openie -filelist sample-file-list.txt -outputDirectory output_dir -outputFormat text

【讨论】：

注意：我刚刚修复了这个命令，因为原来我在本地机器上使用了一个属性文件！
还有多种输出格式（json、xml）我只是喜欢使用文本来提高可读性，但它可能不适合传递到管道中的下一步。
请注意，这会在 OpenIE 旁边转储很多额外的东西；即，所有其他 CoreNLP 注释。
谢谢你们。这行得通，但我会使用 @GaborAngeli 的输出到混响格式的建议，因为我已经将混响用于其他东西。

【解决方案2】：

从上面的 cmets 来看：我怀疑这是并行性过多而内存过少的问题。 OpenIE 有点耗内存，尤其是长句子，因此并行运行许多文件会占用相当多的内存。

一个简单的解决方法是通过设置-threads 1 标志来强制程序运行单线程。如果可能，增加内存也应该有所帮助。

【讨论】：

再次感谢您！我的机器只有 4Gb，所以我只尝试了 3Gb 的内存。我会尝试访问具有更多内存的机器来测试它，但这个解决方案是完美的。