【问题标题】：Plotting yearly comparison & time distribution in ggplot2 R在 ggplot2 R 中绘制年度比较和时间分布
【发布时间】：2020-10-16 05:03:31
【问题描述】：

我正在尝试制作以下数据的 ggplot，其中包含有关人员（由 id 表示）何时（日期和时间）将其数据同步到服务器的信息。为简单起见，我删除了日期变量。

district id year_sync time_sync
    A   1   2020    12:03:19
    A   2   2020    14:33:23
    A   3   2020    13:14:30
    A   4   2020    12:37:07
    A   5   2020    12:45:48
    A   6   2020    02:26:57
    A   7   2020    08:10:03
    A   8   2020    12:08:15
    A   9   2020    15:21:52
    A   10  2020    17:42:33
    A   11  2020    14:23:29
    A   12  2020    23:18:19
    A   13  2020    12:39:14
    A   14  2020    11:31:33
    A   15  2020    13:00:14
    A   16      
    A   17      
    A   18      
    A   19      
    A   20      
    A   21      
    B   22      
    B   23      
    B   24      
    B   25      
    B   26      
    B   27      
    B   28      
    B   29      
    B   30      
    B   31  2019    12:39:31
    B   32  2019    11:44:39
    B   33  2019    10:18:20
    B   34  2019    18:11:48
    B   35  2019    17:22:32
    B   36  2019    12:17:23
    B   37  2019    12:58:30
    B   38  2019    18:50:29
    B   39  2019    12:58:52
    B   40  2019    21:12:36
    B   41  2019    15:57:53
    B   42  2019    12:52:44
    B   43  2019    14:10:48
    B   44  2019    15:40:08
    B   45  2019    14:34:07
    B   46  2019    02:40:28
    B   47  2019    01:37:05
    B   48  2019    14:36:01
    B   49  2019    11:19:45
    B   50  2019    15:33:42
    B   51  2019    21:00:49
    A   52  2020    15:02:01
    A   53  2020    20:28:23
    A   54  2020    17:02:37
    A   55  2020    15:01:24
    A   56  2020    11:29:02
    A   57  2020    18:31:05
    A   58  2020    12:07:51
    A   59  2020    13:00:11
    A   60  2020    09:35:08
    A   61  2020    18:25:53
    B   62  2020    18:12:51
    B   63  2020    14:26:31
    B   64  2020    14:46:51
    B   65  2020    18:04:50
    B   66  2020    07:08:21
    B   67  2020    14:37:16
    B   68  2020    11:56:24
    B   69  2020    13:19:34
    B   70  2019    15:34:24
    B   71  2019    15:02:03
    B   72  2019    11:05:08
    B   73  2019    16:11:18
    A   74  2019    23:51:36
    A   75  2019    13:30:46
    A   76  2019    12:28:43
    A   77  2019    12:38:56
    A   78  2019    11:22:05
    A   79  2019    15:03:20
    A   80  2019    11:27:34

我想绘制一个年度比较图，即2020年v/s 2019年有多少ID同步数据。为此我使用了以下代码：

df1 <- df %>%
     group_by(year_sync) %>%
     dplyr::summarize(non_na_count = sum(!is.na(year_sync))) %>% ## I only want to calculate % based on non-missing values 
     setNames(., c('year', 'count')) %>%
     mutate('share' = count/sum(count), label = paste0(round(share*100, 2), '%'))

     ggplot(df1, aes(y=count, x=year)) +
       geom_bar(stat='identity',
                #color = "black"
                #fill = c("aquamarine4", "bisque3"),
                position = "dodge") +
       geom_text(aes(label = label),
                 position = position_stack(vjust = 1.05),
                 size = 3) +
       xlab ("Year")   +
       ylab ("Number of People")  +
       theme_minimal() +
       theme(plot.title = element_text(hjust = 0.5, face = "bold"),
             plot.subtitle = element_text(hjust = 0.5, face = "italic"))

这不太好用，因为我的 x 轴为 2018.0 2018.5 等（如下）。我希望 x 轴只有 2019 和 2020。

注意：图表根据原始数据集。所以不用担心匹配 %。

我需要以下方面的帮助： 1.1 修复我的 x 轴 （地址）

1.2区域网格，其中比例（用于标签）是根据每个区域内的总观察值计算的。 （待定）

1.3 Fix Fill - 我想要不同颜色的条。但是，不知何故，填充目前无法正常工作。(ADDRESSED)

我还想绘制时间分布，以便 time_sync 了解人们通常何时同步他们的数据。但是，我无法这样做。 （地址）

编辑 对于第 1.2 点：我正在尝试以下代码：

df2 <-
    df %>% dplyr::filter(!is.na(year_sync)) ## filtering NAs

df3 <- df2 %>%
      group_by(district) %>%
      dplyr::mutate(ssum = n()) %>%
      dplyr::count(year_sync, ssum)  %>% 
      mutate(percent = n / ssum,
             label = paste0(round(percent*100, 2), '%')) ## to calculate % based on total number of IDs in each district

绘图

    ggplot(df3, aes(y=ssum, x=factor(year), fill=district)) +
      geom_bar(stat='identity',
               #color='black',
               position = position_dodge(width=0.8), width=0.8) +
      geom_text(aes(label = label, y=count+10),
                position = position_dodge(width=0.8),
                size = 3) +
      xlab ("Year")   +
      ylab ("Number of People")  +
      scale_fill_manual(values=c("aquamarine4", "bisque3")) +
      theme_minimal() +
      theme(plot.title = element_text(hjust = 0.5, face = "bold"),
            plot.subtitle = element_text(hjust = 0.5, face = "italic"))

但是，我收到以下错误：unique.default(x, nmax = nmax) 中的错误：unique() 仅适用于向量。谁能告诉我怎么了？

谢谢！

【问题讨论】：

for 1. 用x=factor(year) 替换x=year，for 2. 添加+ facet_grid(factor(district)~.) for 3. 你需要一个保存颜色的新列
谢谢，Yingw！我会试试你的建议。但是，您能否通过新的颜色列来澄清您的意思？如果您能详细说明为什么当前的颜色代码不起作用/出了什么问题，将会很有帮助。

标签： r ggplot2 bar-chart histogram

【解决方案1】：

这是一个二合一的问题，所以这里是一个二合一的解决方案：

修正条形图

为您阐明如何解决情节中的三个点：

固定 x 轴。由于df1$year 被归类为int，x 轴被视为数字/连续轴，这就是为什么“2019.5”对ggplot 有意义。解决这个问题的一种方法是简单地告诉ggplot 它需要将df1$year 视为一个离散轴，这可以通过强制年份作为一个因素来完成。您可以在ggplot() 调用之前执行此操作，或者通过在aes() 中指示x=factor(year) 而不是x=year 内联。
地区的分面网格。您可以为此使用facet_grid()，但您还需要按地区对数据集进行分组。这意味着将您用于处理df 的一些代码调整为df1（添加额外的列名并将district 添加到您的group_by() 函数。然后您可以添加对facet_grid() 的调用，传递. ~ district将区域划分为列，或district ~ . 将区域划分为行。
修复填充颜色。 ggplot 的工作原理是使用不同的颜色应该向您的情节传达一些新信息。因此，如果您希望更改不同列的列填充，则应将其与数据集中的某些内容相关联。在这里，我假设您希望每个区域的颜色不同。要让ggplot 处理它，您需要将fill= 放入美学（aes()），并将其链接到您数据集的district 列。然后，您可以接受默认颜色或使用 scale_fill_manual(values=...) 指定它们。

将所有这些放在一起，这是从原始数据集转到新图的新代码：

df1 <- df %>%
  group_by(district, year_sync) %>%
  dplyr::summarize(non_na_count = sum(!is.na(year_sync))) %>% ## I only want to calculate % based on non-missing values 
  setNames(., c('district', 'year', 'count')) %>%
  mutate('share' = count/sum(count), label = paste0(round(share*100, 2), '%'))


ggplot(df1, aes(y=count, x=factor(year), fill=district)) +
  geom_bar(stat='identity', color='black') +
  # note I've pushed the labels up slightly using count+1.
  # also note you don't want to use position="stack" here for the text.
  geom_text(aes(label = label, y=count+1), size = 3) +
  xlab ("Year")   +
  ylab ("Number of People")  +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, face = "italic")) +
  scale_fill_manual(values=c("aquamarine4", "bisque3")) +
  facet_grid(. ~ district)

[Bonus] 不同的条形图？

虽然不是您的问题，但我也建议您使用“躲避”来展示这两个区域，而不是刻面。根据绘图的点，对于任何给定的 x 值（年份），闪避列是比较各区的更好方法。代码稍作更改以适用于绘图部分。最需要注意的是，您需要使用position=position_dodge() 并为geom_bar() 和geom_text() 指定闪避。两者都将在此处使用fill= 美学作为数据集中用来“躲避”的列：

ggplot(df1, aes(y=count, x=factor(year), fill=district)) +
  geom_bar(stat='identity', color='black',
           position = position_dodge(width=0.8), width=0.8) +
  geom_text(aes(label = label, y=count+1),
            position = position_dodge(width=0.8), size = 3) +
  xlab ("Year")   +
  ylab ("Number of People")  +
  scale_fill_manual(values=c("aquamarine4", "bisque3")) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, face = "italic"))

绘制时间分布的直方图

为此，您必须确保您的df$time_sync 列采用可识别的“日期”或“日期时间”格式。 @yingw 已经接近了，但不完全是，因为该列需要设置为 as.POSIXct() 才能工作。之后，您可以通过简单地使用geom_histogram() 并将您的x= 美学设置为转换后的df$time_sync 列来绘制直方图。您将遇到的问题是默认情况下日期轴现在包括 Date 和 time ......即使您的数据只有时间。为了去掉日期部分并只显示时间，我使用scales 库来控制格式，通过scale_x_date() 和date_format() 以及date_breaks() 设置该比例的中断和标签。

library(scales)

df %>% dplyr::filter(!is.na(time_sync)) %>%
  ggplot(aes(as.POSIXct(time_sync, format = "%H:%M:%S"))) +
  geom_histogram(color='black', fill='bisque3') +
  scale_x_datetime(labels=date_format("%H:%M:%S"), date_breaks="3 hours") +
  xlab('Time of Day')

【讨论】：

非常感谢，chemdork！特别是对于奖金情节！然而，三件事：1. 在我的情节中引入了x = factor(year)——ggplot 还为我绘制了 NA 值，标签为 0%——我想摆脱那些。 2. 在我的奖励情节/正常地区情节中：我想根据该地区的总人数（两年合计）计算百分比。因此，在我们的示例中，在 A 区的 X 人中，x1% 在 2020 年同步，x2% 在 2019 年同步。使用 group_by(district, year_sync) 并没有给我想要的结果。
3.在使用您的代码在我的原始数据集中绘制时间分布时，我收到以下错误：错误：美学必须是长度 1 或与数据相同 (406)：x
这些是您在此处共享的相同数据集还是不同的数据集？您共享的数据集df 有 80 个观察值，但看起来您的 #3 数据有 406 个。我建议逐步进行：1. 过滤掉 NA 值。 2. 将时间转换为 POSIXct。 3. 使用生成的数据框调用 ggplot，将 x 美学应用于您的 POSIXct 列。
对于 #1，点，我建议类似 - 先过滤掉你的 NA 值，然后在过滤后的数据集上运行绘图。
对于#2，这可能最好作为一个单独的问题。不确定我是否在没有看到发生了什么的情况下跟随。

【解决方案2】：

第一个问题

将x=year 替换为x=factor(year)
add + facet_grid(factor(district)~.)
您需要一个保存颜色的新列，或者fill= district

对于第二个问题，您可能希望使用 geom_histogram() 和 strptime 函数，例如

df %>%
    filter(!is.na(time_sync)) %>%
    ggplot(aes(strptime(time_sync, format = "%H:%M:%S"))) %>%
    geom_histogram()

【讨论】：