斯坦福 NLP - 中文情感分析答案

【问题标题】：Stanford NLP- Sentiment analysis for Chinese language斯坦福 NLP - 中文情感分析
【发布时间】：2014-12-21 16:23:52
【问题描述】：

我想创建一个情感分析程序，它接收中文数据集，并确定是否有更多的正面、负面或中性陈述。按照这个例子，我为英语（stanford-corenlp）创建了一个情感分析，它完全符合我的要求，但采用了中文。

问题：

    Properties props = new Properties();
    props.setProperty("annotators", "tokenize, ssplit, parse, sentiment");
    // gender,lemma,ner,parse,pos,sentiment,sspplit, tokenize
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

     // read some text in the text variable

    String sentimentText = "Fun day, isn't it?";
    String[] ratings = {"Very Negative","Negative", "Neutral", "Positive", "Very Positive"};
    Annotation annotation = pipeline.process(sentimentText);
    for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
     Tree tree = sentence.get(SentimentCoreAnnotations.AnnotatedTree.class);
     int score = RNNCoreAnnotations.getPredictedClass(tree);
     System.out.println("sentence:'"+ sentence + "' has a score of "+ (score-2) +" rating: " + ratings[score]);
     System.out.println(tree);

目前，我不知道如何更改上述代码以使其支持中文。我下载了中文praser和segmenter并看到了demo。但经过几天的尝试，它没有导致任何地方。我也看过http://nlp.stanford.edu/software/corenlp.shtml，英文版真的很有用。是否有任何电子书、教程或示例可以帮助我理解斯坦福 NLP 的中文情感分析是如何工作的？

提前致谢！

PS：我不久前接触了java，如果有一些我没有说或做对的事情，请原谅我。

我研究了什么：

How to parse languages other than English with Stanford Parser？ in java, not command lines

Using stanford parser to parse Chinese

【问题讨论】：

标签： java dataset stanford-nlp sentiment-analysis

【解决方案1】：

根据我在德语方面的经验，您需要执行以下操作：

获取中文文本语料库。
解析每个句子。
二值化生成的解析树。
对于二值化分析树中的每个节点，提取该节点跨越的短语。
用情感标签注释每个短语：
- 0：非常消极
- 1：略负
- 2：中立
- 3：略为阳性
- 4：非常积极
使用类似BuildBinarizedDataset 的方式将标签应用于解析树。请注意，BuildBinarizedDataset 是为英语设置的，它将再次解析您的句子。我发现将标签应用到步骤 3 中预先存在的解析中更实用。

对于注释：您可以自己执行此操作，也可以使用 CrowdFlower 等众包服务。我发现 CrowdFlower 上的“情绪分析”模板很有用。

【讨论】：

亚马逊机械Turk也可用于标注

【解决方案2】：

即使我正在解决同样的问题并遇到问题。这是我做了多少：

您需要更改属性以支持中文，如下所示：

props.setProperty("customAnnotatorClass.segment","edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator");


        props.setProperty("pos.model","edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger");
        props.setProperty("parse.model","edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz");

        props.setProperty("segment.model","edu/stanford/nlp/models/segmenter/chinese/ctb.gz");
        props.setProperty("segment.serDictionary","edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz");
        props.setProperty("segment.sighanCorporaDict","edu/stanford/nlp/models/segmenter/chinese");
        props.setProperty("segment.sighanPostProcessing","true");

        props.setProperty("ssplit.boundaryTokenRegex","[.]|[!?]+|[。]|[！？]+");


        props.setProperty("ner.model","edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz");
        props.setProperty("ner.applyNumericClassifiers","false");
        props.setProperty("ner.useSUTime","false");

但仍然存在的问题是正在使用的标记器仍然默认为 PTBTokenizer（英语）。

对于西班牙语，相应的属性是： props.setProperty("tokenize.language","es"); props.setProperty("sentiment.model","src/international/spanish");

        props.setProperty("pos.model","src/models/pos-tagger/spanish/spanish-distsim.tagger");


        props.setProperty("ner.model","src/models/ner/spanish.ancora.distsim.s512.crf.ser.gz");
        props.setProperty("ner.applyNumericClassifiers","false");
        props.setProperty("ner.useSUTime","false");

        props.setProperty("parse.model","src/models/lexparser/spanishPCFG.ser.gz");

这适用于西班牙语。请注意“tokenize.language”属性设置为“es”。这样的房产不适合中国人。我尝试将其设置为 'ch','cn','zh','zh-cn' 但没有任何效果。告诉我你是否继续。

【讨论】：