【发布时间】:2025-12-06 08:30:02
【问题描述】:
编写脚本以解析 SARS-CoV-2 测序结果以导入我们的实验室信息系统。需要测试关键突变的核苷酸位置是否包含在共有序列数据中。缺失核苷酸序列位置的数据作为逗号分隔的字符串变量包含在内,其中范围由“-”分隔。 我认为 Id 编写了一个 for 循环,针对字符串变量中定义的缺失数据测试每个关键核苷酸位置的特定突变。 到目前为止:
library(tidyverse)
创建测试数据
subs <- as.character(c("A", "B", "C", "D"))
subs_pos <- as.numeric(c("1", "30","22700", "13500"))
df <- data.frame("id" = letters[1:5],
"missing" = as.character(c("1-13030,13364-13626,13962-15504,15862-26543,26891-29904",
"1-29,21717,29727-29777,29837-29904",
"19276-19571,22627-22822,29837-29904",
"29837-29904",
"1-10,20-30"
)))
数据框:
id missing
1 a 1-13030,13364-13626,13962-15504,15862-26543,26891-29904
2 b 1-29,21717,29727-29777,29837-29904
3 c 19276-19571,22627-22822,29837-29904
4 d 29837-29904
5 e 1-10,20-30
for循环
for(i in seq_along(subs)) {
new_var = as.character(subs[i])
print(new_var)
nn = as.numeric(subs_pos[i])
print(nn)
df <- df %>%
mutate(!!new_var := ifelse(!!nn %in%
as.numeric(
source(textConnection(paste("c(", gsub("\\-", ":", missing),")")))$value), "I", "N"))
}
在屏幕上打印并生成数据框:
>[1] "A"
>[1] 1
>[1] "B"
>[1] 30
>[1] "C"
>[1] 22700
>[1] "D"
>[1] 13500
> df
> id missing A B C D
>1 a 1-13030,13364-13626,13962-15504,15862-26543,26891-29904 I I N N
>2 b 1-29,21717,29727-29777,29837-29904 I I N N
>3 c 19276-19571,22627-22822,29837-29904 I I N N
>4 d 29837-29904 I I N N
>5 e 1-10,20-30 I I N N
预期的数据框:
> id missing A B C D
> 1 a 1-13030,13364-13626,13962-15504,15862-26543,26891-29904 I I N I
> 2 b 1-29,21717,29727-29777,29837-29904 I N N N
> 3 c 19276-19571,22627-22822,29837-29904 N N I N
> 4 d 29837-29904 N N N N
> 5 e 1-10,20-30 I I N N
如果在一个实例上运行,则测试有效
> 13500 %in% as.numeric(source(textConnection(paste("c(", gsub("\\-", ":", df$missing[1]),")")))$value)
[1] TRUE
似乎我的代码导致上次运行的评估结果应用于数据框中的所有行。我已经通过更改测试数据确认了这一点。
【问题讨论】: