Twitter 情绪分析 w R 使用带有分数的德语集 SentiWS3答案

【问题标题】：Twitter Sentiment Analysis w R using German language set SentiWS3 with ScoresTwitter 情绪分析 w R 使用带有分数的德语集 SentiWS3
【发布时间】：2014-05-15 11:03:21
【问题描述】：

我指的是previously asked question：我想对德国推文进行情绪分析，并且一直在使用我提到的 stackoverflow 线程中的以下代码。但是，我想做一个分析，得到实际的情绪分数，而不仅仅是 TRUE/FALSE 的总和，无论一个词是正面的还是负面的。有什么简单的方法可以做到这一点吗？

你也可以在previous thread找到单词列表。

library(plyr)
library(stringr)

readAndflattenSentiWS <- function(filename) { 
  words = readLines(filename, encoding="UTF-8")
  words <- sub("\\|[A-Z]+\t[0-9.-]+\t?", ",", words)
  words <- unlist(strsplit(words, ","))
  words <- tolower(words)
  return(words)
}
pos.words <- c(scan("Post3/positive-words.txt",what='character', comment.char=';', quiet=T), 
               readAndflattenSentiWS("Post3/SentiWS_v1.8c_Positive.txt"))
neg.words <- c(scan("Post3/negative-words.txt",what='character', comment.char=';', quiet=T), 
               readAndflattenSentiWS("Post3/SentiWS_v1.8c_Negative.txt"))

score.sentiment = function(sentences, pos.words, neg.words, .progress='none') {
  require(plyr)
  require(stringr)
  scores = laply(sentences, function(sentence, pos.words, neg.words) 
  {
    # clean up sentences with R's regex-driven global substitute, gsub():
    sentence = gsub('[[:punct:]]', '', sentence)
    sentence = gsub('[[:cntrl:]]', '', sentence)
    sentence = gsub('\\d+', '', sentence)
    # and convert to lower case:
    sentence = tolower(sentence)
    # split into words. str_split is in the stringr package
    word.list = str_split(sentence, '\\s+')
    # sometimes a list() is one level of hierarchy too much
    words = unlist(word.list)
    # compare our words to the dictionaries of positive & negative terms
    pos.matches = match(words, pos.words)
    neg.matches = match(words, neg.words)
    # match() returns the position of the matched term or NA
    # I don't just want a TRUE/FALSE! How can I do this?
    pos.matches = !is.na(pos.matches)
    neg.matches = !is.na(neg.matches)
    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
    score = sum(pos.matches) - sum(neg.matches)
    return(score)
  }, 
  pos.words, neg.words, .progress=.progress )
  scores.df = data.frame(score=scores, text=sentences)
  return(scores.df)
}

sample <- c("ich liebe dich. du bist wunderbar",
            "Ich hasse dich, geh sterben!", 
            "i love you. you are wonderful.",
            "i hate you, die.")
(test.sample <- score.sentiment(sample, 
                                pos.words, 
                                neg.words))

【问题讨论】：

您的代码是否运行正常？我猜laply 应该是lapply 但你引用的帖子也写了......
是的，它运行正常。我实际上尝试将 laply 更改为 lapply ，然后它不再起作用了。我对这些功能还很陌生，所以我不知道为什么......
啊，laply 是 plyr 的一部分！很高兴我现在没有编辑“修复”它:-)

标签： r sentiment-analysis

【解决方案1】：

有什么简单的方法可以做到这一点吗？

嗯，是的。我正在用很多推文做同样的事情。如果你真的很喜欢情绪分析，你应该看看the Text Mining (tm) package。

您会看到，使用文档术语矩阵使生活变得更加轻松。然而我必须警告你——读过几本期刊，词袋法通常只能准确分类 60% 的情绪。如果您真的对进行高质量的研究感兴趣，您应该深入研究 Peter Norvig 的出色“Artificial Intelligence: A Modern Approch”。

...所以这肯定不是快速解决我的情绪问题的方法。然而，两个月前，我一直在某个时候。

但是，我想通过分析得到实际的情绪分数

因为我去过那里，您可以将您的 sentiWS 更改为一个不错的 csv 文件，如下所示（用于否定）：

NegBegriff  NegWert
Abbau   -0.058
Abbaus  -0.058
Abbaues -0.058
Abbauen -0.058
Abbaue  -0.058
Abbruch -0.0048
...

然后你可以将它作为一个不错的 data.frame 导入到 R 中。我使用了这个代码-sn-p：

### for all your words in each tweet in a row
for (n in 1:length(words)) {

  ## get the position of the match /in your sentiWS-file/
  tweets.neg.position <- match(unlist(words[n]), neg.words$NegBegriff)
  tweets.pos.position <- match(unlist(words[n]), pos.words$PosBegriff)

  ## now use the positions, to find the matching values and sum 'em up
  score.pos <- sum(pos.words$PosWert[tweets.pos.position], na.rm = T) 
  score.neg <- sum(neg.words$NegWert[tweets.neg.position], na.rm = T)
  score <- score.pos + score.neg

  ## now we have the sentiment for one tweet, push it to the list
  tweets.list.sentiment <- append(tweets.list.sentiment, score)
  ## and go again.
}

## look how beautiful!
summary(tweets.list.sentiment)

### caveat: This code is pretty ugly and not at all good use of R, 
### however it works sufficiently.  I am using approach from above, 
### thus I did not need to rewrite the latter.  Up to you ;- )

嗯，我希望它有效。（对于我的例子，它滴）

诀窍在于将 sentiWS 带入一个漂亮的形式，这可以通过使用 Excel 宏、GNU Emacs、sed 或其他任何您喜欢使用的东西进行简单的文本操作来实现。

【讨论】：

【解决方案2】：

作为起点，这一行：

words <- sub("\\|[A-Z]+\t[0-9.-]+\t?", ",", words)

是说“扔掉 POS 信息和情感值（只留下单词列表）。

因此，要执行您想要的操作，您需要以不同的方式解析数据，并且您将需要不同的数据结构。 readAndflattenSentiWS 当前正在返回 vector，但您将希望返回查找表（从字符串到数字：使用 env 对象感觉很合适，但如果我还想要 POS 信息，那么 @987654325 @ 开始感觉正确）。

之后，您的大部分主循环可能大致相同，但您需要收集值并将它们相加，而不仅仅是将正匹配和负匹配的数量相加。

【讨论】：