根据特定的行值将列添加到数据框 (2)答案

【问题标题】：Add column to dataframe depending on specific row values (2)根据特定的行值将列添加到数据框 (2)
【发布时间】：2016-08-09 09:59:44
【问题描述】：

我必须调整一个代码，它可以完美地与不同的 data.frame 但条件相似。

这是我的 data.frame 的示例：

df <- read.table(text = 'ID    Day Count
    33012   9526    4
    35004   9526    4
    37006   9526    4
    37008   9526    4
    21009   1913    3
    24005   1913    3
    25009   1913    3
    22317   2286    2
    37612   2286    2
    25009   14329   1
    48007   9527    0
    88662   9528    0
    1845    9528    0
    8872    2287    0
    49002   1914    0
    1664    1915    0', header = TRUE)

我需要在我的 data.frame 中添加一个新列 (new_col)，其中包含从 1 到 4 的值。这些 new_col 值必须包括每一天 (x) 天 (x +1)和天 (x +2)，其中 x = 9526、1913、2286、14329（列 Day）。

我的输出应该如下：

   ID    Day Count  new_col
33012   9526    4     1
35004   9526    4     1
37006   9526    4     1
37008   9526    4     1
21009   1913    3     2
24005   1913    3     2
25009   1913    3     2
22317   2286    2     3
37612   2286    2     3
25009   14329   1     4
48007   9527    0     1
88662   9528    0     1
1845    9528    0     1
8872    2287    0     3
49002   1914    0     2
1664    1915    0     2

new_col 排序的 data.frame 将是：

   ID    Day Count  new_col
33012   9526    4     1
35004   9526    4     1
37006   9526    4     1
37008   9526    4     1
48007   9527    0     1
88662   9528    0     1
1845    9528    0     1
21009   1913    3     2
24005   1913    3     2
25009   1913    3     2
49002   1914    0     2
1664    1915    0     2
22317   2286    2     3
37612   2286    2     3
8872    2287    0     3
25009   14329   1     4

我的真实 data.frame 比示例更复杂（即Count 列中有更多列和更多值）。

@mrbrick 在我之前的问题 (Add column to dataframe depending on specific row values) 中建议我的代码如下：

x <- c(1913, 2286, 9526, 14329) 
df$new_col <- cut(df$Day, c(-Inf, x, Inf))
df$new_col <- as.numeric(factor(df$new_col, levels=unique(df$new_col)))

但它仅适用于第 x 天、第 x 天 -1 和第 x -2 天。

任何建议都会很有帮助。

【问题讨论】：

在剪切命令中尝试df$new_col <- cut(df$Day, c(-Inf, x, Inf), right=F)。
您还有更多df$Day 的值吗？属于不同组的值是否总是彼此相距很远？
你知道Day列中所有你想要的x吗？

标签： r dataframe add

【解决方案1】：

假设不同顺序组中的Day 值使得删除Day 的最后两位数字标识每个组将剩下的内容转换为以序列号作为标签的因子。没有使用任何包。

 g <- df$Day %/% 100
 u <- unique(g)
 transform(df, new_col = factor(g, levels = u, labels = seq_along(u)))

给予：

      ID   Day Count new_col
1  33012  9526     4       1
2  35004  9526     4       1
3  37006  9526     4       1
4  37008  9526     4       1
5  21009  1913     3       2
6  24005  1913     3       2
7  25009  1913     3       2
8  22317  2286     2       3
9  37612  2286     2       3
10 25009 14329     1       4
11 48007  9527     0       1
12 88662  9528     0       1
13  1845  9528     0       1
14  8872  2287     0       3
15 49002  1914     0       2
16  1664  1915     0       2

另一种可能性是将g <- ... 行替换为以下之一：

(a) 已知数量的组使用 kmeans 和适当数量的集群：

g <- kmeans(df$Day, 4)$cluster

(b) 手动设置或手动设置中心并使用它来启动kmeans：

centers <-  c(1913, 2286, 9526, 14329) + 1
g <- kmeans(df$day, centers)$cluster

(c) 检查 x-1 和 x-2 或像这样导出centers。如果一天x 没有x-1 或x-2，那么x 必须是序列中的第一个，因此我们选择这些值并加1 以获得中心。与 (a) 要求我们知道簇的数量和 (b) 要求我们知道实际序列不同，这不需要知道这些。

centers <- with(df, unique(Day[ ! ((Day-1) %in% Day) & ! ((Day-2) %in% Day) ]) + 1)
g <- kmeans(df$Day, centers)$cluster

(d) 最后一点的简化 或者如果我们保证如果x 是序列中的第一个，那么x、x+1 和x+2 都会出现，那么我们可以确定如果没有x-1，则x 是序列中的第一个，因此我们可以将（c）简化为：

# assumes x, x+1, x+2 all appear for each sequence
centers <- with(df, unique(Day[ ! (Day-1) %in% Day ]) + 1)
g <- kmeans(df$Day, centers)$cluster

kmeans 解决方案应该可以工作，如果这些组足够分开并且根据问题中显示的数据，它们似乎是。

【讨论】：

【解决方案2】：

使用基础 R，您可以创建一个带有 ID 列、您想要的日期（x、x+1、x+2）和您想要的 new_col 的 data.frame，然后合并这个 data.frame和你原来的一样。

如果您提前知道您所拥有的所有x Day，那将是有效的。

df <- read.table(text = 'ID    Day Count
    33012   9526    4
                 35004   9526    4
                 37006   9526    4
                 37008   9526    4
                 21009   1913    3
                 24005   1913    3
                 25009   1913    3
                 22317   2286    2
                 37612   2286    2
                 25009   14329   1
                 48007   9527    0
                 88662   9528    0
                 1845    9528    0
                 8872    2287    0
                 49002   1914    0
                 1664    1915    0', header = TRUE)
# identify the day you want (x variable in your example)
x <- c(9526, 1913, 2286, 14329)
# create new_col for each x as you wish, and repeat for x + i, then rbind the results data.frame
new_col_df <- do.call(rbind, 
                      lapply(seq(0, 2, by = 1), 
                             function(add) data.frame(x = x + add, new_col = seq_along(x))
                             )
                      )
# merge with the original df
output_df <-merge(df, new_col_df, by.x = "Day", by.y = "x")
# ordered output is
output_df[order(output_df$new_col),]
#>      Day    ID Count new_col
#> 9   9526 33012     4       1
#> 10  9526 35004     4       1
#> 11  9526 37006     4       1
#> 12  9526 37008     4       1
#> 13  9527 48007     0       1
#> 14  9528 88662     0       1
#> 15  9528  1845     0       1
#> 1   1913 21009     3       2
#> 2   1913 24005     3       2
#> 3   1913 25009     3       2
#> 4   1914 49002     0       2
#> 5   1915  1664     0       2
#> 6   2286 22317     2       3
#> 7   2286 37612     2       3
#> 8   2287  8872     0       3
#> 16 14329 25009     1       4

【讨论】：