斯坦福核心 NLP NER 输出答案

【问题标题】：Stanford Core NLP NER outputs斯坦福核心 NLP NER 输出
【发布时间】：2020-07-26 02:13:39
【问题描述】：

我已使用 grep 和 awk 从 Stanford CRF-NER 'inline XML' 中提取命名实体以获取英文文本，我希望将相同的更大工作流程用于其他人类语言。

我一直在用法语做一些实验（西班牙语似乎给我抛出了一个 Java 错误，这是另一个故事），使用 java -cp stanford-corenlp-4.0.0/stanford-corenlp-4.0.0.jar:stanford-corenlp-4.0.0-models-french.jar edu.stanford.nlp.pipeline.StanfordCoreNLP -properties StanfordCoreNLP-french.properties -file french.txt -outputFormat text 我得到标准文本输出，其中每个类型的注释都被分解了句子，包括正确组合在一起的多词实体，如下所示：

Extracted the following NER entity mentions:
Puget Sound LOC I-LOC:0.9822963367809222
lac Washington  LOC I-LOC:0.9908561818309122
Canada  LOC I-LOC:0.9804363858330243
États-Unis  LOC I-LOC:0.9973224740712531

我知道可以解析它，但是当我真的只想要整个文件中的实体列表时，这似乎浪费了很多处理。

我还能够使用 java -cp stanford-corenlp-4.0.0/stanford-corenlp-4.0.0.jar:stanford-corenlp-4.0.0-models-french.jar edu.stanford.nlp.pipeline.StanfordCoreNLP -properties StanfordCoreNLP-french.properties -file french.txt -output.columns word,ner -outputFormat conll 获取单词列和 ner 类型

Puget   I-LOC
Sound   I-LOC
et  O
le  O
lac I-LOC
Washington  I-LOC
,   O
à   O
environ O
155 O
km  O
à   O
le  O
sud O
de  O
la  O
frontière   O
entre   O
le  O
Canada  I-LOC
et  O
les O
États-Unis  I-LOC
.   O

除了有点混乱之外，这还会将多词实体分开，从而无法大规模缝合在一起。

我更喜欢获得内联 xml（例如 <LOCATION>Puget</LOCATION><LOCATION>Sound</LOCATION>），因为我已经开发了一个工作流程来使用它，但如果这不可能，至少有一种方法可以获得 TSV 输出（如 @ 987654327@ 更早的版本）将多词实体组合在一起，就像在文本输出中一样？

我已经研究了实体提及注释器，但我无法弄清楚，如果它需要培训，那么我宁愿不使用它。默认文本输出的分组足以满足我的需要。

【问题讨论】：

标签： stanford-nlp named-entity-recognition

【解决方案1】：

我在 GitHub 上的最新代码中添加了 inlineXML 作为 ouputFormat 选项。此更改在刚刚发布的 4.1.0 版本中不可用。 GitHub 网站上有关于如何将代码构建到 jar 中的说明。

GitHub 站点：https://github.com/stanfordnlp/CoreNLP

【讨论】：

完美！非常感谢！