使用 CoreNLP 将句子分割成子句答案

【问题标题】：Segmenting sentence into subsentences with CoreNLP使用 CoreNLP 将句子分割成子句
【发布时间】：2018-11-05 13:07:35
【问题描述】：

我正在解决以下问题：我想使用 Stanford CoreNLP 将句子分成子句。例句可能是：

"Richard is working with CoreNLP, but does not really understand what he is doing"

我现在希望将我的句子拆分为单个“S”，如下图所示：

我希望输出是一个带有单个“S”的列表，如下所示：

['Richard is working with CoreNLP', ', but', 'does not really understand what', 'he is doing']

我会非常感谢任何帮助:)

【问题讨论】：

标签： nlp stanford-nlp dependency-parsing natural-language-processing pycorenlp

【解决方案1】：

我怀疑您正在寻找的工具是 Tregex，在电源点 here 或课程本身的 Javadoc 中有更详细的描述。

就您而言，我相信您正在寻找的模式只是S。所以，类似：

tregex.sh “S” <path_to_file>

文件是 Penn Treebank 格式的树 - 即类似于 (ROOT (S (NP (NNS dogs)) (VP (VB chase) (NP (NNS cats)))))。

顺便说一句：我相信片段“，但是”实际上并不是一个句子，正如您在图中突出显示的那样。相反，您突出显示的节点包含整个句子“Richard 正在使用 CoreNLP，但并不真正理解他在做什么”。然后，Tregex 会将整个句子作为匹配项之一打印出来。同样，“does not really understand what”不是一个句子，除非它包含整个 SBAR：“does not understand what he is doing”。

如果您只想要“叶子”句子（即不被另一个句子包含的句子），您可以尝试以下模式：

S !>> S

注意：我没有测试过这些模式——使用风险自负！

【讨论】：

感谢您的回复。我们正在使用 Python。您知道我们如何在此处集成您的解决方案吗？那真是太棒了！

【解决方案2】：

好的，我发现这样做如下：

import requests

url = "http://localhost:9000/tregex"
request_params = {"pattern": "S"}
text = "Pusheen and Smitha walked along the beach."
r = requests.post(url, data=text, params=request_params)
print r.json()

有人知道如何使用其他语言（我需要德语）吗？

【讨论】：