从单词集合中检查输入文本中的单词答案

【问题标题】：Check for words in input text from words collection从单词集合中检查输入文本中的单词
【发布时间】：2017-08-20 13:39:01
【问题描述】：

我收集了大约 10,000 个单词的名词短语。我想检查这些 NP 集合的每个新输入文本数据，并提取那些包含任何这些 NP 的句子。我不想为每个单词运行循环，因为它使我的代码变得非常慢。我正在使用 Java 和斯坦福 CoreNLP。

【问题讨论】：

您是否为目前的慢版本编写了任何代码？如果您将其编辑到问题中并向我们展示您得到了什么，那么有人可能会对其进行改进并提供帮助。

标签： stanford-nlp wordnet

【解决方案1】：

一种快速简便的方法是使用 RegexNER 识别字典中任何内容的所有示例，然后检查句子中的非“O”NER 标记。

package edu.stanford.nlp.examples;

import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.util.*;
import java.util.*;
import java.util.stream.Collectors;

public class FindSentencesWithPhrase {

  public static boolean checkForNamedEntity(CoreMap sentence) {
    for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
      if (token.ner() != null && !token.ner().equals("O")) {
        return true;
      }
    }
    return false;
  }

  public static void main(String[] args) {
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,regexner");
    props.setProperty("regexner.mapping", "phrases.rules");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    String exampleText = "This sentence contains the phrase \"ice cream\"." +
        "This sentence is not of interest.  This sentences contains pizza.";
    Annotation ann = new Annotation(exampleText);
    pipeline.annotate(ann);
    for (CoreMap sentence : ann.get(CoreAnnotations.SentencesAnnotation.class)) {
      if (checkForNamedEntity(sentence)) {
        System.out.println("---");
        System.out.println(sentence.get(CoreAnnotations.TokensAnnotation.class).
            stream().map(token -> token.word()).collect(Collectors.joining(" ")));
      }
    }
  }
}

“phrases.rules”文件应如下所示：

ice cream       PHRASE_OF_INTEREST      MISC    1
pizza   PHRASE_OF_INTEREST      MISC    1

【讨论】：

您可以将“regexner.ignorecase”设置为 true 或 false，具体取决于您是否希望它区分大小写。