使用 Stanford Parser 获得句子的 K 个最佳解析答案

【问题标题】：Get the K best parses of a sentence with Stanford Parser使用 Stanford Parser 获得句子的 K 个最佳解析
【发布时间】：2012-12-23 20:54:07
【问题描述】：

我想对一个句子进行 K 次最佳解析，我认为这可以通过 ExhaustivePCFGParser 类来完成，问题是我不知道如何使用这个类，更准确地说我可以实例化这个类吗？（构造函数是：ExhaustivePCFGParser(BinaryGrammar bg, UnaryGrammar ug, Lexicon lex, Options op, Index stateIndex, Index wordIndex, Index tagIndex)）但我不知道如何适应所有这些参数

有没有更简单的方法来获得 K 个最佳解析？

【问题讨论】：

标签： java parsing stanford-nlp

【解决方案1】：

一般来说，您通过LexicalizedParser 对象进行操作，该对象是一个“语法”，提供所有这些东西（语法、词典、索引等）。

从命令行，以下将起作用：

java -mx500m -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -printPCFGkBest 20 edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz data/testsent.txt

在 API 级别，您需要获取一个 LexicalizedParserQuery 对象。拥有LexicalizedParser lp（如ParserDemo.java）后，您可以执行以下操作：

LexicalizedParser lp = ... // Load / train a model
LexicalizedParserQuery lpq = lp.parserQuery();
lpq.parse(sentence);
List<ScoredObject<Tree>> kBest = lpq.getKBestPCFGParses(20);

LexicalizedParserQuery 有点等同于 java 正则表达式 Matcher。

注意：目前 kBest 解析仅适用于 PCFG 未分解的语法。

【讨论】：

谢谢克里斯，它有效:)，我只想指出“lpq.parse(sentence);”中的句子必须是标记化的字符串。
同意，您需要先获得一个单词列表，使用 DocumentPreprocessor 或 Tokenizer（如 ParserDemo.java 中）或使用您自己的其他代码来执行此操作。
@Amine 你让它工作了吗？我试图通过 API 获得一个句子的 k 个最佳解析树，但我在 edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper(Debinarizer.java:34) if ((!newChild.isLeaf()) && newChild.label().value().indexOf('@') >= 0) 得到 NullPointerException跨度>
刚刚使用 v.3.2.0 版本再次测试。为我工作。如果您有可重现的错误，请发送。

【解决方案2】：

假设您希望使用 Python，这是我根据 Christopher Manning 的回答实施的解决方法。 CoreNLP 的 Python 包装器没有实现“K-best 解析树”，因此替代方法是使用终端命令

java -mx500m -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -printPCFGkBest 20 edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz data/testsent.txt

请注意，您需要将 Stanford CoreNLP 和所有 JAR 文件下载到一个目录中，以及安装必备的 Python 库（请参阅导入语句）

import os
import subprocess
import nltk
from nltk.tree import ParentedTree

ip_sent = "a quick brown fox jumps over the lazy dog."

data_path = "<Your path>/stanford-corenlp-full-2018-10-05/data/testsent.txt" # Change the path of working directory to this data_path
with open(data_path, "w") as file:
    file.write(ip_sent) # Write to the file specified; the text in this file is fed into the LexicalParser

os.chdir("/home/user/Sidney/Vignesh's VQA/SpElementEx/extLib/stanford-corenlp-full-2018-10-05") # Change the working directory to the path where the JAR files are stored
terminal_op = subprocess.check_output('java -mx500m -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -printPCFGkBest 5 edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz data/testsent.txt', shell = True) # Run the command via the terminal and capture the output in the form of bytecode
op_string = terminal_op.decode('utf-8') # Convert to string object 
parse_set = re.split("# Parse [0-9] with score -[0-9][0-9].[0-9]+\n", op_string) # Split the output based on the specified pattern 
print(parse_set)

# Print the parse trees in a pretty_print format
for i in parse_set:
    parsetree = ParentedTree.fromstring(i)
    print(type(parsetree))
    parsetree.pretty_print()

希望这会有所帮助。

【讨论】：