浏览斯坦福 CoreNLP 解析结果答案

【问题标题】：Navigating Stanford CoreNLP parsing results浏览斯坦福 CoreNLP 解析结果
【发布时间】：2017-06-21 02:07:16
【问题描述】：

Stanford Core NLP 解析器为句子生成以下输出：

"He didn't get a reply" 

(ROOT
(S
(NP (PRP He))
(VP (VBD did) (RB n’t)
(VP (VB get)
(NP (DT a) (NN reply))))
(. .)))

我需要一种导航方式，即轻松添加额外标签并找到孩子和父母。目前我正在手动进行（计算括号）。我想知道是否有一个 Python 库可以为我计算括号，或者像 Beautiful Soup 或 Scrapy 这样更好地让我处理对象。

如果没有工具，遍历一个句子并获取所有标签的最佳方法是什么？我猜我需要用包含子标签对象的列表创建某种标签对象。

【问题讨论】：

标签： python parsing stanford-nlp

【解决方案1】：

我使用this python script成功解决了问题。

该脚本可用于将 Stanford Core NLP 的类似 Lisp 的解析树格式转换为嵌套的 Python 列表结构。

您还可以使用 Anytree 之类的东西进一步将嵌套列表转换为更易于导航的 Python 数据结构，这还允许您以文本或图像的形式打印出树。

【讨论】：

【解决方案2】：

这看起来像 LISP。编写一个 Lisp 程序来遍历它并提取你想要的东西似乎很容易。

你也可以将它转换成python中的列表并在python中处理：

from pyparsing import OneOrMore, nestedExpr
nlpdata = '(ROOT (S (NP (PRP He)) (VP (VBD did) (RB n\'t) (VP (VB get) (NP (DT a) (NN reply)))) (. .)))'
data = OneOrMore(nestedExpr()).parseString(nlpdata)
print data
# [['ROOT', ['S', ['NP', ['PRP', 'He']], ['VP', ['VBD', 'did'], ['RB', "n't"], ['VP', ['VB', 'get'], ['NP', ['DT', 'a'], ['NN', 'reply']]]], ['.', '.']]]]

请注意，我必须转义 "n't" 中的引号

【讨论】：

【解决方案3】：

我导航输出的方法不是尝试解析字符串，而是构建一个对象并反序列化。然后你就有了本机可用的对象。

问题中显示的输出是使用管道上名为“prettyPrint”的选项生成的。我将其更改为“jsonPrint”以获取 JSON 输出。然后我能够获取输出并从中生成一个类（VS 可以使用 Paste-Special 选项从 JSON 生成一个类，或者有在线资源，如 http://json2csharp.com/）。生成的类如下所示：

public class BasicDependency
    {
        public string dep { get; set; }
        public int governor { get; set; }
        public string governorGloss { get; set; }
        public int dependent { get; set; }
        public string dependentGloss { get; set; }
    }

    public class EnhancedDependency
    {
        public string dep { get; set; }
        public int governor { get; set; }
        public string governorGloss { get; set; }
        public int dependent { get; set; }
        public string dependentGloss { get; set; }
    }

    public class EnhancedPlusPlusDependency
    {
        public string dep { get; set; }
        public int governor { get; set; }
        public string governorGloss { get; set; }
        public int dependent { get; set; }
        public string dependentGloss { get; set; }
    }

    public class Token
    {
        public int index { get; set; }
        public string word { get; set; }
        public string originalText { get; set; }
        public string lemma { get; set; }
        public int characterOffsetBegin { get; set; }
        public int characterOffsetEnd { get; set; }
        public string pos { get; set; }
        public string ner { get; set; }
        public string speaker { get; set; }
        public string before { get; set; }
        public string after { get; set; }
        public string normalizedNER { get; set; }
    }

    public class Sentence
    {
        public int index { get; set; }
        public string parse { get; set; }
        public List<BasicDependency> basicDependencies { get; set; }
        public List<EnhancedDependency> enhancedDependencies { get; set; }
        public List<EnhancedPlusPlusDependency> enhancedPlusPlusDependencies { get; set; }
        public List<Token> tokens { get; set; }
    }

    public class RootObject
    {
        public List<Sentence> sentences { get; set; }
    }

*注意：不幸的是，这种技术不适用于 coref 注释。 JSON 未正确转换为类。我现在正在努力。该模型是使用注释器“tokenize, ssplit, pos, lemma, ner, parse”从输出构建的。

我的代码与示例代码略有不同，如下所示（注意“pipeline.jsonPrint”）：

public static string LanguageAnalysis(string sourceText)
        {
            string json = "";
            // Path to the folder with models extracted from stanford-corenlp-3.7.0-models.jar
            var jarRoot = @"..\..\..\..\packages\Stanford.NLP.CoreNLP.3.7.0.1\";

            // Annotation pipeline configuration
            var props = new Properties();
            props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse");
            props.setProperty("ner.useSUTime", "0");

            // We should change current directory, so StanfordCoreNLP could find all the model files automatically
            var curDir = Environment.CurrentDirectory;
            Directory.SetCurrentDirectory(jarRoot);
            var pipeline = new StanfordCoreNLP(props);
            Directory.SetCurrentDirectory(curDir);

            // Annotation
            var annotation = new Annotation(sourceText);
            pipeline.annotate(annotation);

            // Result - JSON Print
            using (var stream = new ByteArrayOutputStream())
            {
                pipeline.jsonPrint(annotation, new PrintWriter(stream));
                json = stream.toString();
                stream.close();
            }

            return json;
        }

使用这样的代码似乎可以很好地反序列化：

using Newtonsoft.Json;
string sourceText = "My text document to parse.";
string json = Analysis.LanguageAnalysis(sourceText);
RootObject document = JsonConvert.DeserializeObject<RootObject>(json);

【讨论】：

现在我正在使用结果对象，我意识到我的回答实际上并没有回答问题！解析器输出仍然以与一个字符串相同的格式提供，称为“解析”。我现在加入 @user1700890 寻找解析它的方法！
再看，我看到这个问题，似乎是一样的，并且使用php有答案：PHP and NLP: Nested parenthesis (parser output) to array?