根据数据框中的另一个值重复数据框中的行块答案

【问题标题】：Repeating blocks of rows in a data frame based on another value in the data frame根据数据框中的另一个值重复数据框中的行块
【发布时间】：2017-02-20 16:07:44
【问题描述】：

这里有很多关于在 R 中重复行预定次数的问题，但我找不到一个来解决我所问的具体问题。

我有一个调查回复的数据框，其中每个受访者回答了 5 到 10 个问题。作为一个玩具示例：

df <- data.frame(ID = rep(1:2, each = 5),
             Response = sample(LETTERS[1:4], 10, replace = TRUE),
             Weight = rep(c(2,3), each = 5))

> df
   ID Response Weight
1   1        D      2
2   1        C      2
3   1        D      2
4   1        D      2
5   1        B      2
6   2        D      3
7   2        C      3
8   2        B      3
9   2        D      3
10  2        B      3

我想将受访者 1 的答案重复两次，作为一个区块，然后将受访者 2 的答案重复 3 次，作为一个区块，我想要每个答案区块拥有唯一的 ID。换句话说，我希望最终结果如下所示：

     ID Response Weight
1    11        D      2
2    11        C      2
3    11        D      2
4    11        D      2
5    11        B      2
6    12        D      2
7    12        C      2
8    12        D      2
9    12        D      2
10   12        B      2
11   21        D      3
12   21        C      3
13   21        B      3
14   21        D      3
15   21        B      3
16   22        D      3
17   22        C      3
18   22        B      3
19   22        D      3
20   22        B      3
21   23        D      3 
22   23        C      3
23   23        B      3
24   23        D      3
25   23        B      3

目前我这样做的方式真的很笨拙，而且鉴于我的数据集中有超过 3000 名受访者，速度慢得难以忍受。

这是我的代码：

df.expanded <- NULL
for(i in unique(df$ID)) {
  x <- df[df$ID == i,]
  y <- x[rep(seq_len(nrow(x)), x$Weight),1:3]
  y$order <- rep(1:max(x$Weight), nrow(x))
  y <- y[with(y, order(order)),]
  y$IDNew <- rep(max(y$ID)*100 + 1:max(x$Weight), each = nrow(x))
  df.expanded <- rbind(df.expanded, y)
}

有更快的方法吗？

【问题讨论】：

请问您为什么要执行这样的任务？
当然。我正在对响应进行潜在类条件 logit 分析（在真实数据集中，它是 1/0，而不是上面的字母）。在我实际进行分析的 Stata 中，lclogit 不接受权重，因此我支持我拥有的逆概率权重。
重复ID 1 两次：df[df$ID==1,][rep(seq_len(nrow(df[df$ID==1,])), 2), ]

标签： r dataframe

【解决方案1】：

有一个更简单的解决方案。我想您想根据代码中的Weight 复制行。

df2 <- df[rep(seq_along(df$Weight), df$Weight), ]
df2$ID <- paste(df2$ID, unlist(lapply(df$Weight, seq_len)), sep = '')

# sort the rows
df2 <- df2[order(df2$ID), ]

这种方法更快吗？让我们看看：

library(microbenchmark)

microbenchmark(
    m1 = {
        df.expanded <- NULL
        for(i in unique(df$ID)) {
            x <- df[df$ID == i,]
            y <- x[rep(seq_len(nrow(x)), x$Weight),1:3]
            y$order <- rep(1:max(x$Weight), nrow(x))
            y <- y[with(y, order(order)),]
            y$IDNew <- rep(max(y$ID)*100 + 1:max(x$Weight), each = nrow(x))
            df.expanded <- rbind(df.expanded, y)
        }
    },
    m2 = {
        df2 <- df[rep(seq_along(df$Weight), df$Weight), ]
        df2$ID <- paste(df2$ID, unlist(lapply(df$Weight, seq_len)), sep = '')

        # sort the rows
        df2 <- df2[order(df2$ID), ]
    }
)

# Unit: microseconds
# expr     min      lq      mean   median       uq      max neval
# m1 806.295 862.460 1101.6672 921.0690 1283.387 2588.730   100
# m2 171.731 194.199  245.7246 214.3725  283.145  506.184   100

可能还有其他更有效的方法。

【讨论】：

【解决方案2】：

另一种方法是使用data.table。

假设您以“DT”作为data.table 开头，请尝试：

library(data.table)
DT[, list(.id = rep(seq(Weight[1]), each = .N), Weight, Response), .(ID)]

我没有将 ID 列粘贴在一起，而是创建了一个辅助列。这对我来说似乎更灵活一些。

测试数据。更改 n 以创建更大的数据集以供使用。

set.seed(1)
n <- 5
weights <- sample(3:15, n, TRUE)
df <- data.frame(ID = rep(seq_along(weights), weights),
                 Response = sample(LETTERS[1:5], sum(weights), TRUE),
                 Weight = rep(weights, weights))
DT <- as.data.table(df)

【讨论】：