将数据帧聚合到频率表中答案

【问题标题】：Aggregate dataframe into a frequency table将数据帧聚合到频率表中
【发布时间】：2017-07-19 14:22:40
【问题描述】：

我希望从看起来像这样的东西重塑一个数据框，带有变量：

Year, University, Degree, Gender

每行描述一个学生的条目，例如：

2017, University College London, Social Science, Male 

2017, University of Leeds, Social science, Non-Binary

我想根据这些数据创建一个频率表，以压缩行数，这样对于每所大学，每个学位类别都有 19 行，然后对于每个学位，计算/频率显示了每个性别，看起来像这样。

Year University Degree [Gender (Male, Female, Non-Binary)]

2017 UCL Biological Sciences 1 0 2

我希望这是有道理的。谢谢你的帮助。

编辑：我现在希望能够使用数据的子集将此数据绘制为折线图。我目前在绘图功能之外进行子集化，就像这样

   subsetucl <- TFtab[which(TFtab$University == 'University College London'),]
ggplot(data=subsetucl, aes(x=Year, y=Female, group=Degree, color = Degree)) + geom_line()+ geom_point(size = 0.8) + xlab("Year of application") + ylab("Frequnecy of Females") + ggtitle("UCL Applications by Degree (2011-2017)") + theme_bw()

在绘图函数中对数据进行子集化的最佳方法是什么，以及如何最好地显示所有性别的线条，而不仅仅是女性频率。谢谢你

【问题讨论】：

标签： r dataframe subset

【解决方案1】：

这是 dplyr 的一个非常好的解决方案。

library("dplyr")
data %>%
   group_by(University, Degree, Gender) %>%
   count( )%>% 
   spread(key = Gender, value = n, fill = 0)

但是认真使用栈溢出的搜索功能。 Here's a book to help with R

【讨论】：

这很有帮助，但性别的频率都在一个列中，而不是每个性别级别的单独列。这也会删除 0 值。有没有办法保持 0 值？
data %>% group_by(Univesity, Degree, Gender) %>% count() %>% spread(key = Gender, value = n, fill = 0)
这将为具有值的行添加 0，但对于具有 0 的整行，则没有行。有没有办法做到这一点？ @svenhalvorson

【解决方案2】：

1) 聚合/model.matrix 试试这个单行聚合解决方案。不使用任何包。

aggregate(model.matrix(~ Gender + 0) ~ Year + University + Degree, DF, sum)

给予：

  Year                University         Degree GenderFemale GenderMale GenderNon-Binary
1 2017       University of Leeds Social science            1          0                1
2 2017 University College London Social Science            0          1                0

2) aggregate/cbind 也可以像这样使用cbind(...) 写出model.matrix(...) 部分，这可能会更清晰，虽然乏味：

aggregate(cbind(Female = Gender == "Female", Male = Gender == "Male",
            `Non-Binary` = Gender == "Non-Binary") ~ Year + University + Degree, DF, sum)

给出以下内容，除了列名略有不同之外：

  Year                University         Degree Female Male Non-Binary
1 2017       University of Leeds Social science      1    0          1
2 2017 University College London Social Science      0    1          0

注意：上面示例中使用的可重现形式的输入是：

Lines <- "Year, University, Degree, Gender 
2017, University College London, Social Science, Male 
2017, University of Leeds, Social science, Non-Binary
2017, University of Leeds, Social science, Female"
DF <- read.csv(text = Lines, strip.white = TRUE)

【讨论】：

假设“似乎不起作用”意味着您只想聚合值而不是拥有完整的 n 向表我已经修改了使用聚合的答案。