按逻辑向量分组的 R 汇总统计答案

【问题标题】：R Summary Stat by Logical vector grouping按逻辑向量分组的 R 汇总统计
【发布时间】：2015-01-04 12:53:27
【问题描述】：

我有以下 R 数据帧。我正在尝试通过最终“分数”数据帧中的逻辑向量分组来获取摘要统计信息。

    #original df
    type <- c("A", "B", "C","D","E")
    user <- c('user1','user2','user3','user4','user5')
    text <-c('this is a tweet','this is a fb post','tweeting is fun','other text','another fb post')
    tweet.mention <- c('TRUE','FALSE','TRUE','FALSE','FALSE')
    fb.mention <- c('FALSE','TRUE','FALSE','FALSE','TRUE')
    df1 <- cbind.data.frame(type, user, text,tweet.mention,fb.mention)
    df1

   #Remove records that are all FALSE
   tweet<-as.logical(tweet.mention)
   fb<-as.logical(fb.mention)
   test<-cbind(tweet,fb)
   true<-rowSums(test)
   all<-cbind(test,true)

   #Create score df
   score<-subset(df1,true>=1)

   #score API return
   sentiment<-c(1,.5,2,-2)

   #scored text
   score<-cbind(score,sentiment)

分数 df 按原样删除了记录 4，并包含评分数值。然后我想获得平均情绪分数，但按 tweet.mention(1.5) 和 fb.mention(-.75) 分组。我已经尝试过从基础 R 进行摘要，但这就是全部。因此我认为需要一个分组或子集。然后我尝试了 psych 包中的 describeBy。那也无济于事。

让事情变得更复杂的是，我并不总是知道逻辑向量的数量，因此无法通过指定列并具有 ==TRUE 来手动对它们进行子集化。我可以创建列标题的列表或向量以进行遍历，但我不确定完成分组的编码或函数。

我已经阅读了基本的 r 和 psych 小插曲，并检查了 R Cookbook 以获得这个答案，但找不到它。我非常感谢您的帮助。

【问题讨论】：

标签： r dataframe logical-operators summary

【解决方案1】：

2 种使用基础 R 的方法：

> with(score, tapply(sentiment, list(tweet.mention, fb.mention), mean))
      FALSE  TRUE
FALSE    NA -0.75
TRUE    1.5    NA

和：

> aggregate(sentiment~tweet.mention+fb.mention, data=score, mean)
  tweet.mention fb.mention sentiment
1          TRUE      FALSE      1.50
2         FALSE       TRUE     -0.75

【讨论】：

【解决方案2】：

这是使用dplyr 的另一种方式。您可能想使用stringsAsFactors = FALSE。这样你就可以避免在这里把所有变量都作为因素。

df1 %>%
    filter(tweet.mention != FALSE | fb.mention != FALSE) %>%
    mutate(sentiment = c(1, 0.5, 2, -2)) %>%
    group_by(tweet.mention, fb.mention) %>%
    summarize(outcome = mean(sentiment))

#  tweet.mention fb.mention outcome
#1         FALSE       TRUE   -0.75
#2          TRUE      FALSE    1.50

数据

df1 <-structure(list(type = c("A", "B", "C", "D", "E"), user = c("user1", 
"user2", "user3", "user4", "user5"), text = c("this is a tweet", 
"this is a fb post", "tweeting is fun", "other text", "another fb post"
), tweet.mention = c("TRUE", "FALSE", "TRUE", "FALSE", "FALSE"
), fb.mention = c("FALSE", "TRUE", "FALSE", "FALSE", "TRUE")), .Names = c("type", 
"user", "text", "tweet.mention", "fb.mention"), row.names = c(NA, 
-5L), class = "data.frame")

【讨论】：

【解决方案3】：

以下是使用data.table 包的解决方案；有多种方法可以做到这一点。

library(data.table)
setDT(score)
score[, mean(sentiment), by = list(tweet.mention, fb.mention)]

它利用data.table 中的by 关键字进行分组。输出是：

   tweet.mention fb.mention    V1
1:          TRUE      FALSE  1.50
2:         FALSE       TRUE -0.75

【讨论】：

非常感谢。我将使用并探索 data.table 包。