【问题标题】:R Frequency table of multiple categorical variableR多分类变量的频率表
【发布时间】:2016-02-05 12:05:25
【问题描述】:

我已将 SPSS .SAV 文件中的采访数据作为data.frame 导入,现在我正在尝试根据问题编号和采访位置创建频率表。这是一个例子data.frame

loc<-c("city1","city2","city1","city2","city1","city1","city2","city2","city1","city2")
q1<-c("YES","YES","NO","MAYBE","NO","NO","YES","NO","MAYBE","MAYBE")
q2<-c("YES","NO","MAYBE","YES","NO","MAYBE","MAYBE","YES","YES","NO")
q3<-c("NO","NO","NO","NO","YES","YES","MAYBE","MAYBE","NO","MAYBE")
df<-data.frame(loc,q1,q2,q3)

df
     loc    q1    q2    q3
1  city1   YES   YES    NO
2  city2   YES    NO    NO
3  city1    NO MAYBE    NO
4  city2 MAYBE   YES    NO
5  city1    NO    NO   YES
6  city1    NO MAYBE   YES
7  city2   YES MAYBE MAYBE
8  city2    NO   YES MAYBE
9  city1 MAYBE   YES    NO
10 city2 MAYBE    NO MAYBE

现在我想根据问题编号"q1","q2","q3"和位置"city1","city"计算每个答案选项"YES","NO","MAYBE"的出现次数。生成的data.frame 应如下所示:

   loc quest  answ freq
1  city1    q1   YES    1
2  city1    q1    NO    3
3  city1    q1 MAYBE    1
4  city2    q1   YES    2
5  city2    q1    NO    1
6  city2    q1 MAYBE    2
7  city1    q2   YES    2
8  city1    q2    NO    1
9  city1    q2 MAYBE    2
10 city2    q2   YES    2
11 city2    q2    NO    2
12 city2    q2 MAYBE    1
13 city1    q3   YES    2
14 city1    q3    NO    3
15 city1    q3 MAYBE    0
16 city2    q3   YES    0
17 city2    q3    NO    2
18 city2    q3 MAYBE    3

到目前为止,我已经玩过plyr 包中的count()ddply()summarise(),但没有运气。我目前的解决方案非常老套,包括将df 拆分为loc,使用as.data.frame(summary(df_city1)) 创建频率表,从摘要字符串中检索频率并将city1city2 的摘要data.frames 合并回来一起。我想必须有一个更简单/更优雅的解决方案。

【问题讨论】:

    标签: r dplyr plyr frequency summary


    【解决方案1】:

    我们将数据集从“宽”转换为“长”(gather 会这样做),然后将group_by)“loc”、“quest”、“answ”,并使用tally 来获取计数。但是,如果我们正在寻找在数据集中未找到的计数为 0 的组合,那么我们可能需要加入具有三列的所有 unique 组合的数据集(completeunique 确实那)。

    library(dplyr)
    library(tidyr)
    dfN <- gather(df, quest, answ, q1:q3) %>%
                       complete(loc, quest, answ) %>%
                       unique()
    
    res <- gather(df, quest, answ, q1:q3) %>%
                   group_by(loc, quest, answ) %>%
                   tally() %>%
                   left_join(dfN, .) %>%
                   mutate(n = ifelse(is.na(n), 0, n))
    res
    #     loc quest  answ     n
    #   (fctr) (chr) (chr) (dbl)
    #1   city1    q1 MAYBE     1
    #2   city1    q1    NO     3
    #3   city1    q1   YES     1
    #4   city1    q2 MAYBE     2
    #5   city1    q2    NO     1
    #6   city1    q2   YES     2
    #7   city1    q3 MAYBE     0
    #8   city1    q3    NO     3
    #9   city1    q3   YES     2
    #10  city2    q1 MAYBE     2
    #11  city2    q1    NO     1
    #12  city2    q1   YES     2
    #13  city2    q2 MAYBE     1
    #14  city2    q2    NO     2
    #15  city2    q2   YES     2
    #16  city2    q3 MAYBE     3
    #17  city2    q3    NO     2
    #18  city2    q3   YES     0
    

    【讨论】:

    • 感谢@akrun,但是您的解决方案不会产生预期的结果。现在,对于每个计数,一个额外的行被安全地保存到“res”。
    • @viktor_r 我忘了unique。现在它应该可以工作了
    猜你喜欢
    • 2018-05-04
    • 1970-01-01
    • 1970-01-01
    • 2017-06-16
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多