StanfordCoreNLP解析树的生成卡住了答案

【问题标题】：the generation of parse tree of StanfordCoreNLP is stuckStanfordCoreNLP解析树的生成卡住了
【发布时间】：2017-05-10 02:15:07
【问题描述】：

当我使用 StanfordCoreNLP 在 Spark 上使用大数据生成解析时，其中一项任务卡住了很长时间。我找了一下错误，显示如下：

在 edu.stanford.nlp.ling.CoreLabel.(CoreLabel.java:68) 在 edu.stanford.nlp.ling.CoreLabel$CoreLabelFactory.newLabel(CoreLabel.java:248) 在 edu.stanford.nlp.trees.LabeledScoredTreeFactory.newLeaf（LabeledScoredTreeFactory.java:51）在 edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper（Debinarizer.java:27）在 edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper（Debinarizer.java:34）在 edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper（Debinarizer.java:34）在 edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper（Debinarizer.java:34）在 edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper（Debinarizer.java:34）

我认为的相关代码如下：

import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation
import edu.stanford.nlp.trees.TreeCoreAnnotations.TreeAnnotation
import edu.stanford.nlp.util.CoreMap
import scala.collection.JavaConversions._

object CoreNLP {
    def transform(Content: String): String = {
        val v = new CoreNLP
        v.runEnglishAnnotators(Content);
        v.runChineseAnnotators(Content)
    }
}

class CoreNLP {
    def runEnglishAnnotators(inputContent: String): String = {
        var document = new Annotation(inputContent)
        val props = new Properties
        props.setProperty("annotators", "tokenize, ssplit, parse")
        val coreNLP = new StanfordCoreNLP(props)
        coreNLP.annotate(document)
        parserOutput(document)
    }

    def runChineseAnnotators(inputContent: String): String = {
        var document = new Annotation(inputContent)
        val props = new Properties
        val corenlp = new StanfordCoreNLP("StanfordCoreNLP-chinese.properties")
        corenlp.annotate(document)
        parserOutput(document)
    }

    def parserOutput(document: Annotation): String = { 
        val sentences = document.get(classOf[SentencesAnnotation])
        var result = ""
        for (sentence: CoreMap <- sentences) { 
        val tree = sentence.get(classOf[TreeAnnotation])
        //output the  tree to file
        result = result + "\n" + tree.toString
    }
    result
    }
}

我的同学说用来测试的数据是递归的，因此 NLP 是无休止地运行的。不知道是不是真的。

【问题讨论】：

导致问题的句子有多长？
大约 300KB 的数据。我最近发现了另一个问题。当我运行上面提到的程序（runChineseAnnotators()）时，测试文本是一个很长的字符串。它抛出一个异常：NumberFormatException: multiple points
NumberFormatException: edu.stanford.nlp.ie.ChineseQuantifiableEntityNormalizer.normalizedNumberString 处的多个点
句子中有多少个记号？

标签： scala stanford-nlp parse-tree

【解决方案1】：

如果您将props.setProperty("parse.maxlen", "100"); 添加到您的代码中，这将设置解析器不解析超过 100 个标记的句子。这可以帮助防止崩溃问题。您应该为您的应用尝试最佳的最大句子长度。

【讨论】：

好吧，我没有彻底描述这个问题。我使用NLP处理中文，我不知道如何设置中文的'props'。我知道的唯一操作是：StanfordCoreNLP corenlp = new StanfordCoreNLP("StanfordCoreNLP-chinese.properties");你能告诉我如何设置道具吗？感谢您回答我的问题。
在 Scala 中我想你想从文件中加载中文属性：prop.load(new FileInputStream("/path/to/file")) 然后设置这个额外的属性。
嗯，'parse.maxlen' 很难设置，因为我的数据大于 40G，我使用 spark 与 RDD 并行处理它。我不知道每个 RDD 中的数据有多大。但我找到了另一种方法来处理它。即编辑名为“StanfordCoreNLP-chinese.properties”的属性文件。我使用 StanfordNLP 的目的是获取解析树。我删除了几个参数，只剩下'tokenize、ssplit、pos、parse'。不知道合不合理，能不能给点建议？