分布图中均值和百分位数的数据标签答案

【问题标题】：Data labels for mean and percentiles in a distribution chart分布图中均值和百分位数的数据标签
【发布时间】：2019-03-30 04:47:11
【问题描述】：

我正在创建一个自定义图表以使用geom_density 可视化变量的分布。我为自定义值添加了 3 条垂直线，即第 5 个百分位数和第 95 个百分位数。

如何为这些行添加标签？

我尝试使用geom_text，但我不知道如何参数化 x 和 y 变量

library(ggplot2)

ggplot(dataset, aes(x = dataset$`Estimated percent body fat`)) + 
  geom_density() +
  geom_vline(aes(xintercept = dataset$`Estimated percent body fat`[12]), 
             color = "red", size = 1) +
  geom_vline(aes(xintercept = quantile(dataset$`Estimated percent body fat`,
                                       0.05, na.rm = TRUE)), 
             color = "grey", size = 0.5) +
  geom_vline(aes(xintercept = quantile(dataset$`Estimated percent body fat`,
                                       0.95, na.rm = TRUE)), 
             color="grey", size=0.5) +

  geom_text(aes(x = dataset$`Estimated percent body fat`[12], 
                label = "Custom", y = 0), 
            colour = "red", angle = 0)

我想获得以下内容：

对于自定义值，我想在图表顶部添加标签，就在行的右侧
对于百分位数标签，我想将它们添加到图表的中间；在第 5 个百分位的行的左侧和第 95 个百分位的行的右侧

这是我能够获得的https://i.imgur.com/thSQwyg.png

这些是我的数据集的前 50 行：

structure(list(`Respondent sequence number` = c(21029L, 21034L, 
21043L, 21056L, 21067L, 21085L, 21087L, 21105L, 21107L, 21109L, 
21110L, 21125L, 21129L, 21138L, 21141L, 21154L, 21193L, 21195L, 
21206L, 21215L, 21219L, 21221L, 21232L, 21239L, 21242L, 21247L, 
21256L, 21258L, 21287L, 21310L, 21325L, 21367L, 21380L, 21385L, 
21413L, 21418L, 21420L, 21423L, 21427L, 21432L, 21437L, 21441L, 
21444L, 21453L, 21466L, 21467L, 21477L, 21491L, 21494L, 21495L
), `Estimated percent body fat` = c(NA, 7.2, NA, NA, 24.1, 25.1, 
30.2, 23.6, 24.3, 31.4, NA, 14.1, 20.5, NA, 23.1, 30.6, 21, 20.9, 
NA, 24, 26.7, 16.6, NA, 26.9, 16.9, 21.3, 15.9, 27.4, 13.9, NA, 
20, NA, 12.8, NA, 33.8, 18.1, NA, NA, 28.4, 10.9, 38.1, 33, 39.3, 
15.9, 32.7, NA, 20.4, 16.8, NA, 29)), row.names = c(NA, 50L), class = 
"data.frame")

【问题讨论】：

欢迎来到 Stack Overflow！您能否通过分享您的数据样本来重现您的问题，以便其他人可以提供帮助（请不要使用str()、head() 或屏幕截图）？您可以使用 reprex 和 datapasta 包来帮助您。另见Help me Help you & How to make a great R reproducible example?
@Luca 请编辑您的问题提供更多信息。不要将它们放在评论中。
谢谢两位。我刚刚添加了一个带有 reprex 的图表示例，以及带有 dput 的数据集示例
@Luca 你是否强烈依赖ggplot？我发现使用基础图更容易实现这一点。
@jay.sf 如果允许我添加标签，我很乐意使用基础图

标签： r ggplot2

【解决方案1】：

首先我推荐干净的列名。

dat <- dataset
names(dat) <- tolower(gsub("\\s", "\\.", names(dat)))

使用基本 R 绘图，您可以执行以下操作。关键是，您可以存储分位数和自定义位置，以便稍后将它们用作坐标，从而为您提供动态定位。我不确定ggplot 是否/如何实现。

plot(density(dat$estimated.percent.body.fat, na.rm=TRUE), ylim=c(0, .05), 
     main="Density curve")
abline(v=c1 <- dat$estimated.percent.body.fat[12], col="red")
abline(v=q1 <- quantile(dat$estimated.percent.body.fat, .05, na.rm=TRUE), col="grey")
abline(v=q2 <- quantile(dat$estimated.percent.body.fat, .95, na.rm=TRUE), col="grey")
text(c1 + 4, .05, c(expression("" %<-% "custom")), cex=.8)
text(q1 - 5.5, .025, c(expression("5% percentile" %->% "")), cex=.8)
text(q2 + 5.5, .025, c(expression("" %<-% "95% percentile")), cex=.8)

注意： 你不喜欢箭头的情况，例如"5% percentile" 而不是 c(expression("5% percentile" %->% ""))。

或者在ggplot 中你可以使用annotate。

library(ggplot2)
ggplot(dataset, aes(x = dataset$`Estimated percent body fat`)) + 
  geom_density() +
  geom_vline(aes(xintercept = dataset$`Estimated percent body fat`[12]), 
             color = "red", size = 1) +
  geom_vline(aes(xintercept = quantile(dataset$`Estimated percent body fat`,
                                       0.05, na.rm = TRUE)), 
             color = "grey", size = 0.5) +
  geom_vline(aes(xintercept = quantile(dataset$`Estimated percent body fat`,
                                       0.95, na.rm = TRUE)), 
             color="grey", size=0.5) +
  annotate("text", x=16, y=.05, label="custom") +
  annotate("text", x=9.5, y=.025, label="5% percentile") +
  annotate("text", x=38, y=.025, label="95% percentile")

注意， 在任一解决方案中，结果（即准确的标签位置）取决于您的导出大小。要了解如何控制这一点，请采取例如看看How to save a plot as image on the disk?。

数据

dataset <- structure(list(`Respondent sequence number` = c(21029L, 21034L, 
21043L, 21056L, 21067L, 21085L, 21087L, 21105L, 21107L, 21109L, 
21110L, 21125L, 21129L, 21138L, 21141L, 21154L, 21193L, 21195L, 
21206L, 21215L, 21219L, 21221L, 21232L, 21239L, 21242L, 21247L, 
21256L, 21258L, 21287L, 21310L, 21325L, 21367L, 21380L, 21385L, 
21413L, 21418L, 21420L, 21423L, 21427L, 21432L, 21437L, 21441L, 
21444L, 21453L, 21466L, 21467L, 21477L, 21491L, 21494L, 21495L
), `Estimated percent body fat` = c(NA, 7.2, NA, NA, 24.1, 25.1, 
30.2, 23.6, 24.3, 31.4, NA, 14.1, 20.5, NA, 23.1, 30.6, 21, 20.9, 
NA, 24, 26.7, 16.6, NA, 26.9, 16.9, 21.3, 15.9, 27.4, 13.9, NA, 
20, NA, 12.8, NA, 33.8, 18.1, NA, NA, 28.4, 10.9, 38.1, 33, 39.3, 
15.9, 32.7, NA, 20.4, 16.8, NA, 29)), row.names = c(NA, 50L), class = 
"data.frame")

【讨论】：

谢谢@jay.sf 以及关于存储分位数以便稍后将它们用作坐标的非常好的评论，我会在标签代码正常工作后立即执行此操作。您的示例有效，但是我想让标签的定位是动态的。我想在不同的数据集中使用相同的代码（数据集是动态生成的），而在其他数据集中，正确的 y 定位可能是 0.07 或 0.03 等
在基本解决方案中，您可以获得动态定位吗？
很遗憾，我的基本解决方案中没有动态定位。动态定位是我真正想要添加的部分
在例如text(c1 + 4...、c1 动态取决于您的自定义值。我认为这是动态的。 + 4 取决于始终保持不变的标签长度。
好点。我认为主要问题是 y 定位：虽然现在 y 轴的最大值是 0.05，但在不同的数据集中可能是 0.1。关于 x 轴，我同意您的解决方案是动态的，但是使用不同的数据集（例如：最大值是 30 而不是 40），那么 +4 会变得有点太多或太少。