这是一个dplyr 解决方案,它使用left_join()...但在其他方面完全依赖于矢量化操作,这对于大型数据集而言明显是more efficient than looping。
虽然代码可能出现很长,但这只是一种格式选择:为了清楚起见,我使用
foo(
arg_1 = bar,
arg_2 = baz,
# ...
arg_n = qux
)
而不是单线foo(bar, baz, qux)。另外为了清楚起见,我将详细说明该行
# Map each row to its house ID.
House_id = data[row_number()[target][cumsum(target)]],
在详细信息部分。
解决方案
鉴于subset.txt 之类的文件在此处复制
H18105265_0
R1_0
Mab_3416311514210525745_W923650.80
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W123650.80
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W923650.80
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W923650.80
Mab_5325411210110485554_W723650.80
T1_0
T2_0
T3_0
T4_0
以及像df 这样的参考数据集在此处复制
df <- tibble::tribble(
~House_id, ~id, ~new_weight,
18105265, "Mab", 4567,
18117631, "Maa", 3367,
18121405, "Mab", 4500,
71811763, "Maa", 2455,
71811763, "Mab", 2872
)
以下解决方案
# For manipulating data.
library(dplyr)
# ...
# Code to generate your reference 'df'.
# ...
# Specify the filepath.
text_filepath <- "subset.txt"
# Define the textual pattern for each data item we want, where the relevant
# values are divided into their own capture groups.
regex_house_id <- "(H)(\\d+)(_)(\\d)"
regex_weighted_label <- "(M[a-z]{2,})(_)(\\d+)(_W)(\\d+(\\.\\d+)?)"
# Read the textual data (into a dataframe).
data.frame(data = readLines(text_filepath)) %>%
# Transform the textual data.
mutate(
# Target (TRUE) the identifying row (house ID) for each (contiguous) group.
target = grepl(
# Use the textual pattern for house IDs.
pattern = regex_house_id,
x = data
),
# Map each row to its house ID.
House_id = data[row_number()[target][cumsum(target)]],
# Extract the underlying numeric ID from the house ID.
House_id = gsub(
pattern = regex_house_id,
# The numeric ID is in the 2nd capture group.
replacement = "\\2",
x = House_id
),
# Treat the numeric ID as a number.
House_id = as.numeric(House_id),
# Target (TRUE) the weighted labels.
target = grepl(
# Use the textual pattern for weighted labels.
pattern = regex_weighted_label,
x = data
),
# Extract the ID from (only) the weighted labels.
id = if_else(
target,
gsub(
pattern = regex_weighted_label,
# The ID is in the 1st capture group.
replacement = "\\1",
x = data
),
# For any data that is NOT a weighted label, give it a blank (NA) ID.
as.character(NA)
),
# Extract from (only) the weighted labels everything else but the weight.
rest = if_else(
target,
gsub(
pattern = regex_weighted_label,
# Everything is in the 2nd, 3rd, and 4th capture groups; ignoring the ID
# (1st) and the weight (5th).
replacement = "\\2\\3\\4",
x = data
),
# For any data that is NOT a weighted label, make it blank (NA) for
# everything else.
as.character(NA)
)
) %>%
# Link (JOIN) each weighted label to its new weight; with blanks (NAs) for
# nonmatches.
left_join(df, by = c("House_id", "id")) %>%
# Replace (only) the weighted labels, with their updated values.
mutate(
data = if_else(
target,
# Generate the updated value by splicing together the original components
# with the new weight.
paste0(id, rest, new_weight),
# For data that is NOT a weighted label, leave it unchanged.
data
)
) %>%
# Extract the column of updated values.
.$data %>%
# Overwrite the original text with the updated values.
writeLines(con = text_filepath)
将转换您的文本数据并更新原始文件。
结果
原始文件(此处为subset.txt)现在将包含更新的信息:
H18105265_0
R1_0
Mab_3416311514210525745_W4567
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W3367
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W4500
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W2455
Mab_5325411210110485554_W2872
T1_0
T2_0
T3_0
T4_0
详情
正则表达式
文本操作仅依赖于grepl()(识别匹配)和gsub()(提取组件)的基本功能。我们将每个文本模式 regex_house_id 和 regex_weighted_label 划分为它们的组件,在正则表达式中以 capture groups 区分:
# The "H" prefix. The "_" separator.
# | | | |
regex_house_id <- "(H)(\\d+)(_)(\\d)"
# | | | |
# The digits following "H". The "0" suffix (or any digit).
# The digits after the 'id'.
# The 'id': "M" then 2 small letters. | | The weight (possibly a decimal).
# | | | | | |
regex_weighted_label <- "(M[a-z]{2,})(_)(\\d+)(_W)(\\d+(\\.\\d+)?)"
# | | | |
# The "_" separator. The "_" separator and "W" prefix before weight.
我们可以使用grepl(pattern = regex_weighted_label, x = my_strings) 来检查向量my_strings 中的哪些字符串与加权标签的格式匹配(如"Mab_3416311514210525745_W923650.80")。
我们还可以使用gsub(pattern = regex_weighted label, replacement = "\\5", my_labels) 从该格式标签的向量my_labels 中提取权重(第5个捕获组)。
映射
在第一个mutate() 语句中找到
# Map each row to its house ID.
House_id = data[row_number()[target][cumsum(target)]],
可能看起来很神秘。然而,它只是一个classic arithmetic trick(也被@mnist 在他们的solution 中使用)将连续值索引为组。
代码cumsum(target) 扫描target 列,该列(在工作流的这一点上)具有逻辑值(TRUE FALSE FALSE ...),指示是否(TRUE)或不是(FALSE)文本行是房屋 ID(如 "H18105265_0")。当它达到TRUE(数字为1)时,它会增加其运行总数,而FALSE(数字为0)保持总数不变。
由于文字data列
# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |------------ ...
"H18105265_0" "R1_0" ... "H18117631_0" "R1_0" ... "H18121405_0" ...
为我们提供了符合逻辑的target 列
# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE ...
这些值(TRUE 和 FALSE)被强制转换为数字(1 和 0)
# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 ...
在此处生成cumsum():
# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 ...
请注意,现在我们已将每一行映射到其“组号”。 cumsum(target) 就这么多。
现在为row_number()[target]!实际上,row_number() 只是“索引”每个位置(行)
# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
1 2 ... 8 9 ... 13 ...
在data 列(或任何其他列)中:
# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |------------ ...
"H18105265_0" "R1_0" ... "H18117631_0" "R1_0" ... "H18121405_0" ...
所以用target为这些索引下标
# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
TRUE FALSE ... TRUE FALSE ... TRUE ...
仅选择具有房屋 ID 的职位:
# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
1 8 13 ...
所以如果我们把这个结果当作row_number()[target]
# House ID: 1st 2nd 3rd ...
# Position:
1 8 13 ...
并用cumsum(target)下标
# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 ...
我们将每一行映射到其房屋 ID 的位置(data):
# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
1 1 1 1 1 1 1 8 8 8 8 8 8 13 13 ...
这是row_number()[target][cumsum(target)]的结果。
最后,当我们将data 下标为房屋 ID 的这些(重复)位置时,我们得到House_id 列
# |----------------- Group 1 -----------------| |----------------- Group 2 -----------------| |-------------------------- ...
"H18105265_0" "H18105265_0" ... "H18105265_0" "H18117631_0" "H18117631_0" ... "H18117631_0" "H18121405_0" "H18121405_0" ...
data 中的每个值都映射到其组的房屋 ID。
感谢House_id 专栏
House_id = data[row_number()[target][cumsum(target)]]
在我们的data 列旁边,我们可以将df 中的ids 映射(left_join()) 到它们对应的文本data。