通过斯坦福解析器提取所有名词、形容词形式和文本答案

【问题标题】：Extracting all nouns, adjectives form and text via Stanford parser通过斯坦福解析器提取所有名词、形容词形式和文本
【发布时间】：2011-05-18 12:02:58
【问题描述】：

我正在尝试通过斯坦福解析器从给定文本中提取所有名词和形容词。

我目前的尝试是在 Tree-Object 的 getChildrenAsList() 中使用模式匹配来定位以下内容：

(NN paper), (NN algorithm), (NN information), ...

并将它们保存在一个数组中。

输入句子：

在本文中，我们提出了一种从任意文本中提取语义信息的算法。

结果 - 字符串：

[(S (PP (IN In) (NP (DT this) (NN paper))) (NP (PRP we)) (VP (VBP present) (NP (NP (DT an) (NN algorithm)) (SBAR (WHNP (WDT that)) (S (VP (VBD extracts) (NP (JJ semantic) (NN information)) (PP (IN from) (NP (DT an) (ADJP (JJ arbitrary)) (NN text)))))))) (. .))]

我尝试使用模式匹配，因为我在斯坦福解析器中找不到返回所有单词类（例如名词）的方法。

有没有更好的方法来提取这些词类或者解析器是否提供了特定的方法？

public static void main(String[] args) {
    String str = "In this paper we present an algorithm that extracts semantic information from an arbitrary text.";
    LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz"); 
    Tree parseS = (Tree) lp.apply(str);
    System.out.println("tr.getChildrenAsList().toString()"+ parseS.getChildrenAsList().toString());
    }
}

【问题讨论】：

标签： java parsing stanford-nlp

【解决方案1】：

顺便说一句，如果你想要的只是名词和动词等词性，你应该只使用词性标注器，比如 Stanford POS 标注器。它的运行速度会提高几个数量级，并且至少会一样准确。

但是你可以用解析器来做。你想要的方法是taggedYield()，它返回一个List<TaggedWord>。所以你有

List<TaggedWord> taggedWords = (Tree) lp.apply(str);
for (TaggedWord tw : taggedWords) {
  if (tw.tag().startsWith("N") || tw.tag().startsWith("J")) {
    System.out.printf("%s/%s%n", tw.word(), tw.tag());
  }
}

（这种方法走捷径，知道在 Penn 树库标签集中所有且只有形容词和名词标签以 J 或 N 开头。您可以更一般地检查一组标签中的成员资格。）

附言使用标签 stanford-nlp 最适合 stackoverflow 上的斯坦福 NLP 工具。

【讨论】：

【解决方案2】：

我相信您一定知道 nltk（自然语言工具包）只需安装这个 python 库和 maxent pos 标记器，下面的代码就可以解决问题。标注器已在 Penn 上接受过培训，因此标签没有什么不同。上面的代码不是，但我喜欢 nltk，因此。

    import nltk
    nouns=[]
    adj=[]
     #read the text into the variable "text"
    text = nltk.word_tokenize(text)
    tagged=nltk.pos_tag(text)
    for i in tagged:
      if i[1][0]=="N":
        nouns+=[i[0]]
      elif i[1][0]=="J":
        adj+=[i[0]]

【讨论】：