【发布时间】:2020-09-14 02:48:27
【问题描述】:
我想对每个组进行成对比较,并返回不匹配的行以及哪些列不同。下面是一个示例数据集,用于解释我的实际数据会有更多行和列的问题。
data=structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), Common_1 = c("A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B"), Common_2 = c("C", "C", "C", "C", "C", "D",
"D", "D", "D", "D", "C", "C", "C", "C", "C", "D", "D", "D", "D",
"D"), Common_3 = c("X", "X", "X", "X", "X", "X", "X", "X", "X",
"X", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"), G = c(0,
1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0), var_1 = c(1,
3, 3, 3, 3, 1, 3, 2, 4, 3, 5, 5, 3, 4, 5, 1, 3, 5, 1, 4), var_2 = c("lev1",
"lev1", "lev2", "lev2", "lev1", "lev2", "lev2", "lev1", "lev1",
"lev2", "lev2", "lev2", "lev2", "lev1", "lev1", "lev1", "lev1",
"lev1", "lev2", "lev2"), var_3 = c("on", "on", "on", "off", "off",
"on", "on", "on", "off", "off", "on", "on", "on", "off", "off",
"on", "on", "on", "off", "off"), var_4 = c("up", "up", "down",
"down", "up", "down", "up", "down", "up", "up", "up", "up", "down",
"down", "up", "up", "up", "up", "down", "down")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
ID 是唯一标识符,Common_1,Common_2,Common_3 是分组变量,G 是我要比较的组,最后是其余列 var_1:var_4是确定差异的列。该过程是将G=0 中的每一行与G=1 进行比较,如果任何var 列存在差异,则返回不匹配的ID 组合以及哪些列不同。
这是Common_1=A、Common_2=C、Common_3=X 的所需结果,ID 用于行 G=0、所有分组变量、ID 用于G=1 不匹配和指示变量显示哪些列不同。
results=structure(list(ID = c(1, 1, 3, 3, 4, 4), Common_1 = c("A", "A",
"A", "A", "A", "A"), Common_2 = c("C", "C", "C", "C", "C", "C"
), Common_3 = c("X", "X", "X", "X", "X", "X"), G = c(0, 0, 0,
0, 0, 0), var_1 = c(1, 1, 0, 0, 0, 0), var_2 = c(0, 0, 1, 1,
1, 1), var_3 = c(0, 1, 0, 1, 1, 0), var_4 = c(0, 0, 1, 1, 1,
1), ID_diff = c(2, 5, 2, 5, 2, 5)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
更新:添加结果说明
我正在对G=0 和G=1 进行成对比较。前两行结果的派生如下:
同组Common_1=A,Common_2=C,Common_3=X
现在比较 ID=1 和 ID=2
var_1 不同,因此 1 放在 var_1 列中,其余为 0。 ID_diff=2 因为那是与ID=1 不同的ID
比较 ID=1 和 ID=5
var_1 和 var_3 是不同的,所以每列都放置一个 1,其余为 0。 ID_diff=5 因为那是不同于 ID=1 的 ID
我尝试编写一个函数来循环使用 G=0 遍历每个案例,并与 G=1 但在提取不匹配信息时遇到困难的每个案例进行比较,感谢任何帮助。
Ronak Shah 解决方案的结果有效,但我无法正确显示结果。
> var_col <- grep('^var', names(data))
>
> apply_fun <- function(tmp) {
+ df1 <- subset(tmp, G == 0)
+ df2 <- subset(tmp, G == 1)
+ lapply(seq(nrow(df1)), function(x) {
+ df3 <- df1[rep(x, nrow(df2)), ]
+ df3$ID_diff <- df2$ID
+ df3[var_col] <- +(df1[rep(x, nrow(df2)), var_col] != df2[var_col])
+ df3
+ })
+ }
>
>
> library(dplyr)
> data %>%
+ group_by(across(starts_with('Common'))) %>%
+ summarise(data = apply_fun(cur_data_all())) %>%
+ ungroup %>%
+ select(data) %>%
+ tidyr::unnest(data)
`summarise()` regrouping output by 'Common_1', 'Common_2', 'Common_3' (override with `.groups` argument)
# A tibble: 22 x 10
ID Common_1 Common_2 Common_3 G var_1[,1] [,2] [,3] [,4] var_2[,1] [,2] [,3] [,4] var_3[,1] [,2] [,3] [,4] var_4[,1] [,2]
<dbl> <chr> <chr> <chr> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 A C X 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0
2 1 A C X 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
3 3 A C X 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1
4 3 A C X 0 0 1 1 1 0 1 1 1 0 1 1 1 0 1
5 4 A C X 0 0 1 1 1 0 1 1 1 0 1 1 1 0 1
6 4 A C X 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1
7 7 A D X 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0
8 8 A D X 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
9 9 A D X 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
10 10 A D X 0 1 0 1 1 1 0 1 1 1 0 1 1 1 0
# ... with 12 more rows, and 3 more variables: [,3] <int>, [,4] <int>, ID_diff <dbl>
【问题讨论】:
-
我很难理解你的问题。你能解释一下如何在
results中获得输出吗? -
我编辑了问题以包括对所需结果的解释。感谢您花时间查看它。
-
在一组内(
Common_1=A,Common_2=C,Common_3=X是一组)你想比较G = 0的所有行和G = 1的所有行并只返回观察到差异的行? -
是的,还有哪些列是不同的,以及
G=1ID