TokensregexNER 应该使用哪些设置答案

【问题标题】：Which settings should be used for TokensregexNERTokensregexNER 应该使用哪些设置
【发布时间】：2017-02-26 16:09:13
【问题描述】：

当我尝试 regexner 时，它使用以下设置和数据按预期工作；

props.setProperty("annotators", "tokenize, cleanxml, ssplit, pos, lemma, regexner");

法学学士学位
（艺术|法律|科学|工程|神学）学士学位

我想做的是使用 TokenRegex。例如

法学学士学位
([{tag:NNS}] [{tag:NNP}]) 学士学位

我读到要这样做，我应该使用 TokensregexNERAnnotator。

我尝试如下使用它，但它不起作用。

Pipeline.addAnnotator(new TokensRegexNERAnnotator("expressions.txt", true));

或者我尝试用另一种方式设置注释器，

props.setProperty("annotators", "tokenize, cleanxml, ssplit, pos, lemma, tokenregexner");    
props.setProperty("customAnnotatorClass.tokenregexner", "edu.stanford.nlp.pipeline.TokensRegexNERAnnotator");

我尝试了不同的 TokenRegex 格式，但要么注释器找不到表达式，要么我得到了 SyntaxException。

在 NER 数据文件上使用 TokenRegex（使用标签查询令牌）的正确方法是什么？

顺便说一句，我刚刚在 TokensRegexNERAnnotator.java 文件中看到了一条评论。不确定是否相关 pos 标签不适用于 RegexNerAnnotator。

if (entry.tokensRegex != null) {
    // TODO: posTagPatterns...
    pattern = TokenSequencePattern.compile(env, entry.tokensRegex);
  }

【问题讨论】：

标签： named-entity-recognition stanford-nlp

【解决方案1】：

首先您需要制作一个 TokensRegex 规则文件（sample_degree.rules）。这是一个例子：

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

{ pattern: (/Bachelor/ /of/ [{tag:NNP}]), action: Annotate($0, ner, "DEGREE") }

为了稍微解释一下规则，pattern 字段指定要匹配的模式类型。 action 字段表示要注释整个匹配中的每个标记（这就是 $0 所代表的），注释 ner 字段（请注意，我们也在规则文件中指定了 ner = ...，并且第三个参数表示将字段设置为字符串“DEGREE”）。

然后为命令制作这个 .props 文件（degree_example.props）：

customAnnotatorClass.tokensregex = edu.stanford.nlp.pipeline.TokensRegexAnnotator

tokensregex.rules = sample_degree.rules

annotators = tokenize,ssplit,pos,lemma,ner,tokensregex

然后运行这个命令：

java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -props degree_example.props -file sample-degree-sentence.txt -outputFormat text

您应该看到您想要标记为“DEGREE”的三个标记将被标记。

我想我将更改代码以使tokensregex 链接到 TokensRegexAnnotator，这样您就不必将其指定为自定义注释器。但现在您需要在 .props 文件中添加该行。

这个例子应该有助于实现这一点。如果您想了解更多信息，这里还有一些资源：

http://nlp.stanford.edu/software/tokensregex.shtml#TokensRegexRules

http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/SequenceMatchRules.html

http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/types/Expressions.html

【讨论】：

您建议以相同的方式处理大约 7000 条规则还是仅对少数规则有意义？
regexner 仅用于指定我猜的替代词（（艺术|法律|科学|工程|神性））。有了它会容易得多。（我再次阅读了您的回答。如果我对您计划进行的更改没有错，我猜我们将能够在 regexner 文本文件中指定标签查询。）谢谢您的帮助。
非常感谢@StanfordNLPHelp。解释清楚！
我有一个类似的设置（虽然运行服务器）但ner 和tokensregex 不合作。任何一个都可以，但不能一起使用。