使用两个属性的组合频率直方图答案

【问题标题】：Combined frequency histogram using two attributes使用两个属性的组合频率直方图
【发布时间】：2016-02-07 15:18:11
【问题描述】：

我正在使用 ggplot2 为两个不同的参数创建直方图。我目前的方法附在我的问题的末尾（包括一个数据集，可以直接从 pasetbin.com 使用和加载），它创建了

根据“位置”属性（“WITHIN”或“NOT_WITHIN”）显示记录用户数据空间分布频率的直方图。
基于“上下文”属性（“点击的 A”或“点击的 B”）显示记录用户数据分布频率的直方图。

这看起来像以下内容：

# Load my example dataset from pastebin
RawDataSet <- read.csv("http://pastebin.com/raw/uKybDy03", sep=";")
# Load packages
library(plyr)
library(dplyr)
library(reshape2)
library(ggplot2)

###### Create Frequency Table for Location-Information
LocationFrequency <- ddply(RawDataSet, .(UserEmail), summarize, 
                           All = length(UserEmail),
                           Within_area = sum(location=="WITHIN"),
                           Not_within_area = sum(location=="NOT_WITHIN"))
# Create a column for unique identifiers
LocationFrequency <- mutate(LocationFrequency, id = rownames(LocationFrequency))
# Reorder columns
LocationFrequency <- LocationFrequency[,c(5,1:4)]
# Format id-column as numbers (not as string)
LocationFrequency[,c(1)] <- sapply(LocationFrequency[, c(1)], as.numeric)
# Melt data
LocationFrequency.m = melt(LocationFrequency, id.var=c("UserEmail","All","id"))
# Plot data
p <- ggplot(LocationFrequency.m, aes(x=id, y=value, fill=variable)) +
  geom_bar(stat="identity") +
  theme_grey(base_size = 16)+
  labs(title="Histogram showing the distribution of all spatial information per user.") + 
  labs(x="User", y="Number of notifications interaction within/not within the area") +
  # using IDs instead of UserEmail
  scale_x_continuous(breaks=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30), labels=c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30"))
# Change legend Title
p + labs(fill = "Type of location")



##### Create Frequency Table for Interaction-Information
InterationFrequency <- ddply(RawDataSet, .(UserEmail), summarize, 
                             All = length(UserEmail),
                             Clicked_A = sum(context=="Clicked A"),
                             Clicked_B = sum(context=="Clicked B"))
# Create a column for unique identifiers
InterationFrequency <- mutate(InterationFrequency, id = rownames(InterationFrequency))
# Reorder columns
InterationFrequency <- InterationFrequency[,c(5,1:4)]
# Format id-column as numbers (not as string)
InterationFrequency[,c(1)] <- sapply(InterationFrequency[, c(1)], as.numeric)
# Melt data
InterationFrequency.m = melt(InterationFrequency, id.var=c("UserEmail","All","id"))
# Plot data
p <- ggplot(InterationFrequency.m, aes(x=id, y=value, fill=variable)) +
  geom_bar(stat="identity") +
  theme_grey(base_size = 16)+
  labs(title="Histogram showing the distribution of all interaction types per user.") + 
  labs(x="User", y="Number of interaction") +
  # using IDs instead of UserEmail 
  scale_x_continuous(breaks=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30), labels=c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30"))
  # Change legend Title
  p + labs(fill = "Type of interaction")

但我想要实现的是：如何将两个直方图组合在一个图中？是否有可能以某种方式为每个部分放置相应的百分比？ Somethink像下面的草图，它表示每个用户的观察总数（条形的完整高度）并使用不同的分割来可视化相应的数据。每个条将分为多个部分（within 和 not_within），然后将每个部分分为两个子部分，显示交互类型的百分比（*单击 A' 或点击了 B）。

【问题讨论】：

您能否将数据合并到您的帖子中？依靠外部资源坚持下去充其量是幼稚的。模拟数据或使用已随 R 或（通用）包之一提供的现有数据集之一。
根据定义，直方图只能显示一个变量（除非您在其上放置文本标签）。您在寻找马赛克图吗？
@RomanLuštrik：对不起。我认为包含的 pastebin-link 是使问题尽可能易于管理的完美解决方案，因为您可以使用提供的链接轻松使用我的数据。无论如何...我很快就会包含我的数据集的 sn-p。
@alistaire：有趣的信息 ;) 我搜索了“马赛克图”，这可以解决问题，尽管我不知道如何可视化不同的频率。
@Jaap 感谢您的回答。我已经更新了我的问题并评论了你的答案。也许你会找一些时间来看看我的回答和/或我的问题的编辑:)？

标签： r ggplot2 histogram

【解决方案1】：

使用更新描述，我将制作一个包含两部分的组合条形图：一个负数和一个正数。为了实现这一点，您必须将数据转换为正确的格式：

# load needed libraries
library(dplyr)
library(tidyr)
library(ggplot2)

# summarise your data
new.df <- RawDataSet %>% 
  group_by(UserEmail,location,context) %>% 
  tally() %>%
  mutate(n2 = n * c(1,-1)[(location=="NOT_WITHIN")+1L]) %>%
  group_by(UserEmail,location) %>%
  mutate(p = c(1,-1)[(location=="NOT_WITHIN")+1L] * n/sum(n))

new.df 数据框如下所示：

> new.df
Source: local data frame [90 x 6]
Groups: UserEmail, location [54]

   UserEmail   location   context     n    n2          p
      (fctr)     (fctr)    (fctr) (int) (dbl)      (dbl)
1      andre NOT_WITHIN Clicked A     3    -3 -1.0000000
2       bibi NOT_WITHIN Clicked A     4    -4 -0.5000000
3       bibi NOT_WITHIN Clicked B     4    -4 -0.5000000
4       bibi     WITHIN Clicked A     9     9  0.6000000
5       bibi     WITHIN Clicked B     6     6  0.4000000
6     corinn NOT_WITHIN Clicked A    10   -10 -0.5882353
7     corinn NOT_WITHIN Clicked B     7    -7 -0.4117647
8     corinn     WITHIN Clicked A     9     9  0.7500000
9     corinn     WITHIN Clicked B     3     3  0.2500000
10  dpfeifer NOT_WITHIN Clicked A     7    -7 -1.0000000
..       ...        ...       ...   ...   ...        ...

接下来你可以创建一个情节：

ggplot() +
  geom_bar(data = new.df[new.df$location == "NOT_WITHIN",],
           aes(x = UserEmail, y = n2, color = "darkgreen", fill = context),
           size = 1, stat = "identity", width = 0.7) +
  geom_bar(data = new.df[new.df$location == "WITHIN",],
           aes(x = UserEmail, y = n2, color = "darkred", fill = context),
           size = 1, stat = "identity", width = 0.7) +
  scale_y_continuous(breaks = seq(-20,20,5),
                     labels = c(20,15,10,5,0,5,10,15,20)) +
  scale_color_manual("Location of interaction",
                     values = c("darkgreen","darkred"),
                     labels = c("NOT_WITHIN","WITHIN")) +
  scale_fill_manual("Type of interaction",
                    values = c("lightyellow","lightblue"),
                    labels = c("Clicked A","Clicked B")) +
  guides(color = guide_legend(override.aes = list(color = c("darkred","darkgreen"),
                                                  fill = NA, size = 2), reverse = TRUE),
         fill = guide_legend(override.aes = list(fill = c("lightyellow","lightblue"),
                                                 color = "black", size = 0.5))) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 14),
        axis.title = element_blank(),
        legend.title = element_text(face = "italic", size = 14),
        legend.key.size = unit(1, "lines"),
        legend.text = element_text(size = 11))

导致：

如果你想使用百分比值，你可以使用p-column 来绘制：

ggplot() +
  geom_bar(data = new.df[new.df$location == "NOT_WITHIN",],
           aes(x = UserEmail, y = p, color = "darkgreen", fill = context),
           size = 1, stat = "identity", width = 0.7) +
  geom_bar(data = new.df[new.df$location == "WITHIN",],
           aes(x = UserEmail, y = p, color = "darkred", fill = context),
           size = 1, stat = "identity", width = 0.7) +
  scale_y_continuous(breaks = c(-1,-0.75,-0.5,-0.25,0,0.25,0.5,0.75,1),
                     labels = scales::percent(c(1,0.75,0.5,0.25,0,0.25,0.5,0.75,1))) +
  scale_color_manual("Location of interaction",
                     values = c("darkgreen","darkred"),
                     labels = c("NOT_WITHIN","WITHIN")) +
  scale_fill_manual("Type of interaction",
                    values = c("lightyellow","lightblue"),
                    labels = c("Clicked A","Clicked B")) +
  coord_flip() +
  guides(color = guide_legend(override.aes = list(color = c("darkred","darkgreen"),
                                                  fill = NA, size = 2), reverse = TRUE),
         fill = guide_legend(override.aes = list(fill = c("lightyellow","lightblue"),
                                                 color = "black", size = 0.5))) +
  theme_minimal(base_size = 14) +
  theme(axis.title = element_blank(),
        legend.title = element_text(face = "italic", size = 14),
        legend.key.size = unit(1, "lines"),
        legend.text = element_text(size = 11))

导致：

回应评论

如果你想将文本标签放在条内，你也必须计算一个位置变量：

new.df <- RawDataSet %>% 
  group_by(UserEmail,location,context) %>% 
  tally() %>%
  mutate(n2 = n * c(1,-1)[(location=="NOT_WITHIN")+1L]) %>%
  group_by(UserEmail,location) %>%
  mutate(p = c(1,-1)[(location=="NOT_WITHIN")+1L] * n/sum(n),
         pos = (context=="Clicked A")*p/2 + (context=="Clicked B")*(c(1,-1)[(location=="NOT_WITHIN")+1L] * (1 - abs(p)/2)))

然后将以下行添加到您的ggplot 代码中geom_bar 之后：

geom_text(data = new.df, aes(x = UserEmail, y = pos, label = n))

导致：

除了label = n，您还可以使用label = scales::percent(abs(p)) 来显示百分比。

【讨论】：

感谢您的回答。我不知道我的数据实际上是如此复杂，以至于我无法将两个不同的属性绘制到一个图中。感谢您的澄清。您的答案中提供的情节非常有用，尽管它不完全是我想要的。我已经更新了我的问题，包括如何可视化条形的草图。 ...尽管如此：如果我的“目标图”不可能：是否可以将四个值中的每一个的百分比放在条形部分旁边？
这太棒了 :) 只是一个简短的问题：运行第二个情节的代码给了我一条警告消息：警告消息：当 ymin != 0 时堆叠没有明确定义 ，这会导致百分比值显示不正确且条未对齐的错误。输出如下所示：i.imgur.com/qy7MTOi.png你知道是什么导致了问题吗？
@schlomm 花了很多时间修补它 ;-)。关于警告：我得到了同样的警告，但这与代码强制在0 下方绘制堆积条形图一样。你使用的是哪个版本的ggplot2（我使用的是 R3.2.3 & ggplot 2 2.0.0）？
它正在工作 :) 检查我的 ggplot 版本是一个非常有用的链接。我可以问最后一个问题吗？我尝试使用geom_text(data = new.df, aes(x = UserEmail, y = p, label = n), size = 4) 将每个条形部分的相应百分比值放入图中。但是，正如您可能注意到的那样，我只能将观察数放入条形图中；不是百分比。这很容易实现吗？