斯坦福 CoreNLP：使用部分现有注释答案

【问题标题】：Stanford CoreNLP: Use partial existing annotation斯坦福 CoreNLP：使用部分现有注释
【发布时间】：2014-10-07 21:14:47
【问题描述】：

我们正在尝试使用现有的

标记化
分句
和命名实体标记

虽然我们想使用 Stanford CoreNlp 来额外为我们提供

词性标注
词形化
和解析

目前，我们正在尝试以下方式：

1) 为“pos, lemma, parse”做一个注释器

Properties pipelineProps = new Properties();
pipelineProps.put("annotators", "pos, lemma, parse");
pipelineProps.setProperty("parse.maxlen", "80");
pipelineProps.setProperty("pos.maxlen", "80");
StanfordCoreNLP pipeline = new StanfordCoreNLP(pipelineProps);

2) 用自定义方法读入句子：

List<CoreMap> sentences = getSentencesForTaggedFile(idToDoc.get(docId));

在该方法中，令牌的构造方式如下：

CoreLabel clToken = new CoreLabel();
clToken.setValue(stringToken);
clToken.setWord(stringToken);
clToken.setOriginalText(stringToken);
clToken.set(CoreAnnotations.NamedEntityTagAnnotation.class, neTag);
sentenceTokens.add(clToken);

它们组合成这样的句子：

Annotation sentence = new Annotation(sb.toString());
sentence.set(CoreAnnotations.TokensAnnotation.class, sentenceTokens);
sentence.set(CoreAnnotations.TokenBeginAnnotation.class, tokenOffset);
tokenOffset += sentenceTokens.size();
sentence.set(CoreAnnotations.TokenEndAnnotation.class, tokenOffset);
sentence.set(CoreAnnotations.SentenceIndexAnnotation.class, sentences.size());

3) 将句子列表传递给管道：

  Annotation document = new Annotation(sentences);
  pipeline.annotate(document);

但是，在运行此程序时，我们收到以下错误：

null: InvocationTargetException: annotator "pos" requires annotator "tokenize"

我们如何实现我们想要做的任何指针？

【问题讨论】：

当我构建这样的文档并传递给CoreMapExpressionExtractor.createExtractorFromFiles(env, rulesFiles).extractExpressions(sentence) 时，我无法获得任何匹配的表达式。这里我没有使用pipeline.annotate。但是，通常通过pipelineProps.setProperty("annotators", "tokenize, ssplit") 传递文本会导致matchedExpressions = somevalue。有什么想法吗？

标签： nlp stanford-nlp

【解决方案1】：

由于“pos”注释器（POSTaggerAnnotator 类的实例）预期的不满足要求而引发异常

StanfordCoreNLP 知道如何创建注释器的要求在Annotator 接口中定义。对于“pos”注释器，定义了 2 个要求：

标记化
分裂

这两个要求都需要满足，这意味着“tokenize”注释器和“ssplit”注释器都必须在注释器列表中指定在“pos”注释器之前。

现在回到问题...如果您想跳过管道中的“tokenize”和“ssplit”注释，您需要禁用在管道初始化期间执行的需求检查。我发现了两种等效的方法：

在传递给 StanfordCoreNLP 构造函数的属性对象中禁用强制要求：

props.setProperty("enforceRequirements", "false");
将 StanfordCoreNLP 构造函数的 enforceRequirements 参数设置为 false

StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);

【讨论】：

【解决方案2】：

你应该添加参数“tokenize”

pipelineProps.put("annotators", "tokenize, pos, lemma, parse");

【讨论】：