斯坦福 NLP 解析器。如何拆分树？答案

【问题标题】：Stanford NLP Parser. How to splitt the Tree?斯坦福 NLP 解析器。如何拆分树？
【发布时间】：2014-06-25 02:26:43
【问题描述】：

如果我以homepage为例：

The strongest rain ever recorded in India shut down 
the financial hub of Mumbai, snapped communication 
lines, closed airports and forced thousands of people 
to sleep in their offices or walk home during the night, 
officials said today.

斯坦福解析器：

LexicalizedParser lexicalizedParser = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz");

Tree parse = lexicalizedParser.parse(text);
TreePrint treePrint = new TreePrint("penn, typedDependencies");

treePrint.printTree(parse);

提供以下树：

(ROOT
(S
  (S
    (NP
      (NP (DT The) (JJS strongest) (NN rain))
      (VP
        (ADVP (RB ever))
        (VBN recorded)
        (PP (IN in)
          (NP (NNP India)))))
    (VP
      (VP (VBD shut)
        (PRT (RP down))
        (NP
          (NP (DT the) (JJ financial) (NN hub))
          (PP (IN of)
            (NP (NNP Mumbai)))))
      (, ,)
      (VP (VBD snapped)
        (NP (NN communication) (NNS lines)))
      (, ,)
      (VP (VBD closed)
        (NP (NNS airports)))
      (CC and)
      (VP (VBD forced)
        (NP
          (NP (NNS thousands))
          (PP (IN of)
            (NP (NNS people))))
        (S
          (VP (TO to)
            (VP
              (VP (VB sleep)
                (PP (IN in)
                  (NP (PRP$ their) (NNS offices))))
              (CC or)
              (VP (VB walk)
                (NP (NN home))
                (PP (IN during)
                  (NP (DT the) (NN night))))))))))
  (, ,)
  (NP (NNS officials))
  (VP (VBD said)
    (NP-TMP (NN today)))
  (. .)))

我现在想根据其结构拆分树以获得子句。所以在这个例子中，我想拆分树以获得以下部分：

印度有史以来最强降雨
最强降雨关闭了孟买的金融中心
最强雨断通讯线路
最强雨停机场
最强降雨迫使数千人睡在办公室
最强降雨迫使数千人在夜间步行回家

我该怎么做？

所以第一个答案是使用递归算法打印所有根到叶路径。

这是我尝试过的代码：

public static void main(String[] args) throws IOException {
    LexicalizedParser lexicalizedParser = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz");

    Tree tree = lexicalizedParser.parse("In a ceremony that was conspicuously short on pomp and circumstance at a time of austerity, Felipe, 46, took over from his father, King Juan Carlos, 76.");

    printAllRootToLeafPaths(tree, new ArrayList<String>());
}

private static void printAllRootToLeafPaths(Tree tree, List<String> path) {
    if(tree != null) {
        if(tree.isLeaf()) {
            path.add(tree.nodeString());
        }

        if(tree.children().length == 0) {
            System.out.println(path);
        } else {
            for(Tree child : tree.children()) {
                printAllRootToLeafPaths(child, path);
            }
        }

        path.remove(tree.nodeString());
    }
}

当然，这段代码完全不合逻辑，因为如果我只是添加叶子到路径永远不会有递归调用，因为叶子有没有小孩。这里的问题是，所有真实的单词都是叶子，所以这个算法将只需打印出叶子的单个单词：

[The]
[strongest]
[rain]
[ever]
[recorded]
[in]
[India]
[shut]
[down]
[the]
[financial]
[hub]
[of]
[Mumbai]
[,]
[snapped]
[communication]
[lines]
[,]
[closed]
[airports]
[and]
[forced]
[thousands]
[of]
[people]
[to]
[sleep]
[in]
[their]
[offices]
[or]
[walk]
[home]
[during]
[the]
[night]
[,]
[officials]
[said]
[today]
[.]

【问题讨论】：

标签： java stanford-nlp

【解决方案1】：

看看print all root to leaf paths in a binary tree 或者拆分二叉树：

【讨论】：

斯坦福 edu.stanford.nlp.trees.Tree 不是二叉树，但当然您仍然可以使用这种递归算法来打印您的树，其中包含从根到叶的所有可能组合。这里的问题是，节点不是单词。节点是标签，单词总是叶子。因此，您只需将所有节点作为标签，并且路径的最后一个节点（叶子）就是一个单词。
所以如果你使用这个算法，你最终会得到：The, The strong, The strong rain, ...
您可以将名词短语和动词短语视为根节点。您也可以通过 CC 或 , 令牌来拆分 VP。
您正在使用tree.children().length 检查是否进行了新的递归旋转。我的建议是检查 NP / VP 和 CC。你能把你的示例代码上传到gist.github.com，这样我就可以启动环境了吗？
你是这个意思吗？ gist github