使用命名实体训练模型答案

【问题标题】：Train model using Named entity使用命名实体训练模型
【发布时间】：2015-06-27 15:39:09
【问题描述】：

我正在使用命名实体识别器查看standford corenlp。我有不同类型的输入文本，我需要将其标记到我自己的实体中。所以我开始训练我自己的模型，但它似乎不起作用。

例如：我的输入文本字符串是“Toyota Land Cruiser 1956-1987 Gold Portfolio http://t.co/EqxmY1VmLg http://t.co/F0Vefuoj9Q 上的 49 篇杂志文章书”

我通过示例来训练我自己的模型，并且只寻找一些我感兴趣的单词。

我的 jane-austen-emma-ch1.tsv 看起来像这样

Toyota  PERS
Land Cruiser    PERS

从上面的输入文本中，我只对这两个词感兴趣。一个是丰田，另一个词是陆地巡洋舰。

austin.prop 看起来像这样

trainFile = jane-austen-emma-ch1.tsv
serializeTo = ner-model.ser.gz
map = word=0,answer=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC

运行以下命令生成ner-model.ser.gz文件

java -cp stanford-corenlp-3.4.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop

public static void main(String[] args) {
        String serializedClassifier = "edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz";
        String serializedClassifier2 = "C:/standford-ner/ner-model.ser.gz";
        try {
            NERClassifierCombiner classifier = new NERClassifierCombiner(false, false, 
                    serializedClassifier2,serializedClassifier);
            String ss = "Book of 49 Magazine Articles on Toyota Land Cruiser 1956-1987 Gold Portfolio http://t.co/EqxmY1VmLg http://t.co/F0Vefuoj9Q";
            System.out.println("---");
            List<List<CoreLabel>> out = classifier.classify(ss);
            for (List<CoreLabel> sentence : out) {
              for (CoreLabel word : sentence) {
                System.out.print(word.word() + '/' + word.get(AnswerAnnotation.class) + ' ');
              }
              System.out.println();
            }

        } catch (ClassCastException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }  catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

这是我得到的输出

Book/PERS of/PERS 49/O Magazine/PERS Articles/PERS on/O Toyota/PERS Land/PERS Cruiser/PERS 1956-1987/PERS Gold/O Portfolio/PERS http://t.co/EqxmY1VmLg/PERS http://t.co/F0Vefuoj9Q/PERS

我认为这是错误的。我正在寻找 Toyota/PERS 和 Land Cruiser/PERS（这是一个多价值领域。

感谢您的帮助。非常感谢任何帮助。

【问题讨论】：

为什么要将它与 stanford-english-7class 分类器结合使用。您的训练数据的大小（句子/标记的数量）是多少？
感谢 vihari。我的训练数据只是我在 tsv 文件中提到的两个字段，我只是在玩，慢慢地它会加起来。我包括了standfors分类器类，如果我没有从那里找到任何匹配项，如果找到任何匹配项，也请查看我的训练数据。

标签： nlp stanford-nlp sentiment-analysis named-entity-recognition pos-tagger

【解决方案1】：

我相信您还应该在您的trainFile 中添加0 实体的示例。正如您所提供的，trainFile 太简单了，无法完成学习，它需要 0 和 PERSON 示例，因此它不会将所有内容都注释为 PERSON。你没有教它关于你不感兴趣的实体。像这样说：

Toyota  PERS
of    0
Portfolio    0
49    0

等等。

此外，对于短语级识别，您应该查看regexner，您可以在其中拥有模式（模式对我们有好处）。我正在使用API 处理此问题，并且我有以下代码：

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner");
props.put("regexner.mapping", customLocationFilename);

使用以下customLocationFileName：

Make Believe Town   figure of speech    ORGANIZATION
( /Hello/ [{ ner:PERSON }]+ )   salut   PERSON
Bachelor of (Arts|Laws|Science|Engineering) DEGREE
( /University/ /of/ [{ ner:LOCATION }] )    SCHOOL

和文字：Hello Mary Keller was born on 4th of July and took a Bachelor of Science. Partial invoice (€100,000, so roughly 40%) for the consignment C27655 we shipped on 15th August to University of London from the Make Believe Town depot. INV2345 is for the balance.. Customer contact (Sigourney Weaver) says they will pay this on the usual credit terms (30 days).

我得到的输出

Hello Mary Keller is a salut
4th of July is a DATE
Bachelor of Science is a DEGREE
$ 100,000 is a MONEY
40 % is a PERCENT
15th August is a DATE
University of London is a ORGANIZATION
Make Believe Town is a figure of speech
Sigourney Weaver is a PERSON
30 days is a DURATION

有关如何执行此操作的更多信息，您可以查看让我前进的 example。

【讨论】：

我使用了以下 customLocationFilename:(/Make//Believe//Town) LOCATION (/Hello/[{ner:PERSON}]+) SALUT 学士（艺术|法律|科学|工程） DEGREE (/University//of/[{ner:LOCATION}]) SCHOOL 我得到了结果：'{Mary Keller}' is a {PERSON} '{4th of July}' is a {DATE} '{Bachelor of Science }' 是 {DEGREE} '{$ 100,000}' 是 {MONEY} '{40 %}' 是 {PERCENT} '{15th August}' 是 {DATE} '{University of London}' 是 { ORGANIZATION} '{Make Believe Town}' 是 {ORGANIZATION} ...Make Believe Town 中的错误，你好 Mary Keller，大学。伦敦...请建议。
您需要做的一件事是在 customLocationFilename 中添加第三列，它表示您的自定义训练的 NERClassifier 可以覆盖的注释，如下所示：(/Make//相信//镇）位置组织
感谢您的回复。更改完成..仍然无法正常工作。仍然得到 (1)Make Believe Town 作为 ORGANIZATION (2)伦敦大学作为 ORGANIZATION (3) Mary Keller 作为 PERSON。理学学士学位是正确的。我不清楚为什么。是否需要在 LOCATION 和 ORGANIZATION 之间添加空格或制表符。
是的！所有列都由选项卡分隔。不强调这一点是我的错。

【解决方案2】：

NERClassifier* 是单词级别的，也就是说，它标记单词，而不是短语。鉴于此，分类器似乎表现良好。如果需要，您可以将构成短语的单词连字符连接起来。因此，在您的标记示例和测试示例中，您会将“Land Cruiser”设置为“Land_Cruiser”。

【讨论】：

感谢 Sonal。如果您查看我的 tsv 文件，我只有两个词，一个是 Toyota，另一个是 Land Cruiser，当我运行程序时，只有这两个词应该标记为 PERS，我相信其余的词应该是 O，但我的输出是不同的。我不确定为什么其他词作为 PERS 出现，例如 Book/PERS of/PERS 和 hperlink。有什么想法吗。感谢您的帮助
我们如何让 NERClassifier 训练并为 Pharses 工作？下面使用 RegEx 给出的答案对您有用吗？