将一个 Dataframe 映射到第二个 Dataframe答案

【问题标题】：Map one Dataframe to a second Dataframe将一个 Dataframe 映射到第二个 Dataframe
【发布时间】：2018-12-27 12:23:36
【问题描述】：

我有两个数据帧，想要映射两者，如果存在则给出二进制值 1，否则为 0。

第一后卫

id       1_1   1_2   1_3   1_4   1_5   1_6   1_7   1_8   1_9   1_10  1_freq
111.txt  NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
112.txt  NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
113.txt  NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
114.txt  NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
115.txt  NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
116.txt  NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

2nd DF

id                 cats
111.cats           1,7,1
112.cats           1,1,2|1,3,2
113.cats           1,10,1|1,6,2
114.cats           1,4,2
115.cats           1,5,1
116.cats           1,1,2|1,8,1

在第二个DF$cats 第一行有1,7,1，其中1 和7 组合并生成1_7 列，并在此列上放置二进制值1，在剩余列上放置0，最后@ 987654329@ 号码转到1_freq 列，如果任何行有超过 1 个类似 1,10,1|1,6,2 的类别，其中 1,10,1 转到 1_10 列，1,6,2 转到 1_6 列，两个类别的频率相加并转到 1_freq 列。

DF 应该是这样的

id       1_1   1_2   1_3   1_4   1_5   1_6   1_7   1_8   1_9   1_10  1_freq
111.txt  0     0     0     0     0     0     1     0     0     0     1
112.txt  1     0     1     0     0     0     0     0     0     0     4
113.txt  0     0     0     0     0     1     0     0     0     1     3
114.txt  0     0     0     1     0     0     0     0     0     0     2
115.txt  0     0     0     0     1     0     0     0     0     0     1
116.txt  1     0     0     0     0     0     0     1     0     0     3

希望问题很清楚。谢谢你

【问题讨论】：

是的，感谢您指出..编辑它..

标签： r dplyr gsub stringr

【解决方案1】：

这是一个使用tidyverse 的选项。我们通过在'cats'列的|拆分来扩展数据集的行，然后通过在最后一个,拆分separate将'cats'分成两列，按'id'分组，得到@987654325 'freq' 列的@，提取'cats' 末尾的数字，将其转换为factor 并指定levels，创建一列1s（'val'），spread 将其转换为'wide ' 格式

library(tidyverse)
o1 <- df2 %>% 
       separate_rows(cats, sep = "[|]") %>% 
       separate(cats, into = c('cats', 'freq'), 
           sep=",(?=[^,]+$)", convert = TRUE) %>%
       group_by(id) %>%
       mutate(freq = sum(freq), 
              cats = factor(str_extract(cats, "\\d+$"), levels = 1:10), 
              val = 1)  %>% 
       spread(cats, val, fill = 0) %>% 
       rename_at(-1, ~ paste0('1_', .))

现在，我们为初始数据集 ('df1') 共有的列分配值

df1[is.na(df1)] <- 0
df1[names(o1)[-1]] <- o1[-1]
df1
#       id 1_1 1_2 1_3 1_4 1_5 1_6 1_7 1_8 1_9 1_10 1_freq
#1 111.txt   0   0   0   0   0   0   1   0   0    0      1
#2 112.txt   1   0   1   0   0   0   0   0   0    0      4
#3 113.txt   0   0   0   0   0   1   0   0   0    1      3
#4 114.txt   0   0   0   1   0   0   0   0   0    0      2
#5 115.txt   0   0   0   0   1   0   0   0   0    0      1
#6 116.txt   1   0   0   0   0   0   0   1   0    0      3

数据

df1 <- structure(list(id = c("111.txt", "112.txt", "113.txt", "114.txt", 
"115.txt", "116.txt"), `1_1` = c(NA, NA, NA, NA, NA, NA), `1_2` = c(NA, 
NA, NA, NA, NA, NA), `1_3` = c(NA, NA, NA, NA, NA, NA), `1_4` = c(NA, 
NA, NA, NA, NA, NA), `1_5` = c(NA, NA, NA, NA, NA, NA), `1_6` = c(NA, 
NA, NA, NA, NA, NA), `1_7` = c(NA, NA, NA, NA, NA, NA), `1_8` = c(NA, 
NA, NA, NA, NA, NA), `1_9` = c(NA, NA, NA, NA, NA, NA), `1_10` = c(NA, 
NA, NA, NA, NA, NA), `1_freq` = c(NA, NA, NA, NA, NA, NA)),
    class = "data.frame", row.names = c(NA, 
-6L))

df2 <- structure(list(id = c("111.cats", "112.cats", "113.cats", "114.cats", 
"115.cats", "116.cats"), cats = c("1,7,1", "1,1,2|1,3,2", "1,10,1|1,6,2", 
"1,4,2", "1,5,1", "1,1,2|1,8,1")), class = "data.frame", row.names = c(NA, 
-6L))

【讨论】：

谢谢它的工作......我一直在努力......谢谢

【解决方案2】：

虽然问题被标记为dplyr，但我很好奇data.table 的答案会是什么样子。

由于df1 填充有NA，除了id 列和id 列仅在尾部不同（txt 与cats）下面的答案建议创建df1完全来自df2中包含的数据：

library(data.table)
library(magrittr)
long <- setDT(df2)[, strsplit(cats, "[|]"), by = id][
  , c(.(id = id), tstrsplit(V1, ","))][
    , V3 := factor(V3, levels = 1:10)]
df1 <- dcast(long, id ~ V3, function(x) pmax(1, length(x)), 
             value.var = "V3", drop = FALSE, fill = 0)[
               long[, sum(as.integer(V4)), by = id], on = "id", freq := V1][
                 , id := stringr::str_replace(id, "cats$", "txt")][
                   , setnames(.SD, names(.SD)[-1], paste0("1_", names(.SD)[-1]))]
df1

        id 1_1 1_2 1_3 1_4 1_5 1_6 1_7 1_8 1_9 1_10 1_freq
1: 111.txt   0   0   0   0   0   0   1   0   0    0      1
2: 112.txt   1   0   1   0   0   0   0   0   0    0      4
3: 113.txt   0   0   0   0   0   1   0   0   0    1      3
4: 114.txt   0   0   0   1   0   0   0   0   0    0      2
5: 115.txt   0   0   0   0   1   0   0   0   0    0      1
6: 116.txt   1   0   0   0   0   0   0   1   0    0      3

说明

强制转换为 data.table 后，df2 通过在“|”处拆分 cats 列从“字符串化”宽格式重新整形为 long 形式首先，然后将逗号分隔的部分拆分为单独的列 V2 到 V4。

然后V3 从字符转换为因子以在调用dcast() 再次从长格式到宽格式重新整形时保留列的顺序。由于 OP 已要求在至少存在一种组合时显示 1，因此此处必须使用自定义函数定义 function(x) pmax(1, length(x)) 而不是简单的 length。在更新连接中，频率总和作为freq 列附加。最后，id 列中的“cats”替换为“txt”，列名（id 列除外）以“1_”为前缀。

数据

df2 <- data.table::fread("id                 cats
111.cats           1,7,1
112.cats           1,1,2|1,3,2
113.cats           1,10,1|1,6,2
114.cats           1,4,2
115.cats           1,5,1
116.cats           1,1,2|1,8,1", data.table = FALSE)

【讨论】：