【发布时间】:2021-09-04 04:21:50
【问题描述】:
我有一个 data.table,它目前是宽格式且非常大(超过 20,000 行)。数据当前的格式设置为大约有 20-30 列的值为 0 或 1,我需要将它们组合成一列。我可以使用melt 或do.call(paste()),但我不知道如何确定0 和1 来自哪一列。我当前的工作正常的过程是单独更新每一列,以便任何 1 成为与该列名称相同的字符串,然后使用 do.call(paste()) 将所有这些列合并为一个。我觉得必须有一种更优雅的方式来完成这个过程,但我想不出任何东西。有没有比我目前的方法更好的方法(当前方法见下文)?
非常缩小的data.table:
dput(head(photos01a))
structure(list(photo_name = c("BENT-5023-2-150927-CHECK1 (1).JPG",
"BENT-5023-2-150927-CHECK1 (10).JPG", "BENT-5023-2-150927-CHECK1 (100).JPG",
"BENT-5023-2-150927-CHECK1 (101).JPG", "BENT-5023-2-150927-CHECK1 (102).JPG",
"BENT-5023-2-150927-CHECK1 (103).JPG"), BAAS = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), BIRD = c(NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_), CADO = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_),
CAFA = c(NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_), CALA = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), CALL = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), CEEL = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_), Crew = c("CREW", NA, NA, NA, NA, NA)), row.names = c(NA,
-6L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000012d1fe71ef0>)
当前方法:
# Adjust the species columns so they can be combined into one column
photos01a[BAAS == 1, BAAS := "BAAS"]
photos01a[BIRD == 1, BIRD := "BIRD"]
photos01a[CADO == 1, CADO := "CADO"]
photos01a[CAFA == 1, CAFA := "CAFA"]
photos01a[CALA == 1, CALA := "CALA"]
photos01a[CALL == 1, CALL := "CALL"]
photos01a[CEEL == 1, CEEL := "CEEL"]
photos01a[Crew == 1, Crew := "CREW"]
# Create a list of all the species columns
species_cols <- c("BAAS", "BIRD", "CADO", "CAFA", "CALA", "CALL", "CEEL", "Crew")
# Merge the species columns into one column, any photos that have more then one species tagged will have the 4 letter codes pushed together (i.e. TASP and TADO become TASPTADO)
photos01a[, "org_species" := .(col_test = do.call(paste, c(replace(.SD, is.na(.SD), ""), sep = ""))), .SDcols = species_cols]
# Separate out species for photos with multiple tags
photos01a[, "species_1" := substr(org_species, 1, 4)]
photos01a[nchar(org_species) > 4, "species_2" := substr(org_species, 5, 8)]
photos01a[nchar(org_species) > 8, "species_3" := substr(org_species, 9, nchar(org_species))]
# Bring those seperate species back into one column and get rid of unneeded columns
photos01 <- photos01a %>%
melt(measure.vars = c("species_1", "species_2", "species_3"),# Pivot table so there
value.name = "species", # is only one species column,
value.factor = TRUE, # and photos with 2 tags are
variable.name = "tag_order", # duplicated with one tag in each
variable.factor = TRUE,
na.rm = TRUE)
【问题讨论】:
-
您确定您共享的数据正确吗?其中没有 1/0 值。
-
抱歉,我确实分享了错误的版本。数据相同,任何 NA 值为 0,其他值为 1。这就是我希望数据最终的样子。
标签: r data.table