如何整合来自两个不同 NLP 工具的两个 Parse Tree 数据结构答案

【问题标题】：How to integrate two Parse Tree data structures from two different NLP Tools如何整合来自两个不同 NLP 工具的两个 Parse Tree 数据结构
【发布时间】：2015-02-14 18:39:33
【问题描述】：

我现在同时使用Stanford CoreNLP 和Fudan NLP 来处理中文自然语言。这两个工具都生成解析树，即Stanford CoreNLP Parse Tree 和Fudan NLP Parse Tree（让我称它们为STree 和FTree）。

而我需要利用STree 和FTree，对它们做一些修饰方法，它们共享相同的函数签名和不同的实现细节。所以最好的做法是定义一个可以从STree 和FTree 生成的类。

但是，这两种解析树在数据结构方面是完全不同的。所以我可以想到两个解决方案：

定义一个泛型类Tree，将通过STree和FTree的内容类型传递。加上一个TreeFactory，然后传递这两种内容类型并生成相关树。如果按照这种方式，我就不能把同一个方法的两种实现分开。

定义一个接口或抽象类Tree，其中包含几个方法。然后把这个接口扩展成STree和FTree对应的两个不同的子类。如果我按照这种方式进行操作，子类中的children 将不是super.children 的子类。

class TreeNode { 
    List<TreeNode> children; 
    //... 
};
class STree extends TreeNode { 
    List<STree> children; // Problem: not a subclass of super.children
    //... 
};
class FTree extends TreeNode { 
    List<FTree> children; // Problem: not a subclass of super.children
    //... 
};

我想知道哪个是更好的选择。或者任何人都可以提供更具适应性的解决方案。

以下是FTree的简要定义：

// Declaration edu.fudan.nlp.parser.dep.DependencyTree;
public class DependencyTree implements Serializable {

    // tree node content
    public String word;
    public String pos;
    // sequence number in sentence
    public int id;
    private int size=1;
    // dependancy relation type
    public String relation;

    public List<DependencyTree> leftChilds;
    public List<DependencyTree> rightChilds;
    private DependencyTree parent = null;

    // ...
};

STree的定义：

// Definition edu.stanford.nlp.trees.LabeledScoredTreeNode
public class LabeledScoredTreeNode extends Tree {

    // Label of the parse tree.
    private Label label; // = null;
    // Score of <code>TreeNode</code>
    private double score = Double.NaN;

    // Daughters of the parse tree.
    private Tree[] daughterTrees; // = null;

    // ...
};

【问题讨论】：

这里的复旦树好像是一个依赖解析树。您的意思是将复旦依赖解析与斯坦福 NLP 选区解析混合在一起吗？如果您还打算使用 CoreNLP 中的依赖解析，您可能想要使用 GrammaticalStructure。
@JonGauthier 是的，GrammaticalStructure 正是我提到STree 时的意思。但是，我真正需要的是利用复旦和斯坦福工具中的 依赖树 并实现几种方法，这些方法在两个不同的树上应用具有不同细节的相同运算符。
另一种解决方案是创建自己的元模型。我们称它为 MyMetaTree，然后提供 STree 和 FTree 转换为您自己的域模型的方法。
@mike 不是第一个选项吗？我同意这听起来更好，但我担心由于转换中的错误/不一致而导致难以调试的下游错误。
@stanleyerror 元模型解决了不同数据结构的问题，因为它有自己的。不要将它与委托/装饰器模式混为一谈。你需要一个有MetaModel createFrom(STree tree) 和MetaModel createFrom(FTree tree) 的工厂。

标签： java design-patterns interface abstract-class stanford-nlp

【解决方案1】：

我认为更基本的问题是您将如何使用来自两个不同解析器的句法分析结果。由于两棵树的结构可能完全不同，我不确定您将如何一起使用这两个解析器结果或词性级别！

另一种可能的选择是，如果您对句子中的每个单词进行注释，您可以结合 Standford NLP 工具和 FudanNLP 工具的输出。

从技术上讲，总是可以使用 Mike 建议的选项来拥有一个带有 STreeMetaModelNode 和 FTreeMetaModelNode 的具体实现类的 MetaModelNode 接口。

【讨论】：