【问题标题】:Dynamically create variables based on character vector and evaluation of numeric vector against variable in R基于字符向量动态创建变量并根据 R 中的变量评估数值向量
【发布时间】:2025-12-06 08:30:02
【问题描述】:

编写脚本以解析 SARS-CoV-2 测序结果以导入我们的实验室信息系统。需要测试关键突变的核苷酸位置是否包含在共有序列数据中。缺失核苷酸序列位置的数据作为逗号分隔的字符串变量包含在内,其中范围由“-”分隔。 我认为 Id 编写了一个 for 循环,针对字符串变量中定义的缺失数据测试每个关键核苷酸位置的特定突变。 到目前为止:

library(tidyverse)

创建测试数据

subs <- as.character(c("A", "B", "C", "D"))
subs_pos <- as.numeric(c("1", "30","22700", "13500"))
df <- data.frame("id" = letters[1:5], 
                 "missing" = as.character(c("1-13030,13364-13626,13962-15504,15862-26543,26891-29904",
                                            "1-29,21717,29727-29777,29837-29904",
                                            "19276-19571,22627-22822,29837-29904",
                                            "29837-29904",
                                            "1-10,20-30"
                                            )))

数据框:

id                                                 missing
1  a 1-13030,13364-13626,13962-15504,15862-26543,26891-29904
2  b                      1-29,21717,29727-29777,29837-29904
3  c                     19276-19571,22627-22822,29837-29904
4  d                                             29837-29904
5  e                                              1-10,20-30

for循环

for(i in seq_along(subs)) { 
  new_var = as.character(subs[i])
  print(new_var)
  nn = as.numeric(subs_pos[i])
  print(nn)
  df <- df %>% 
    mutate(!!new_var := ifelse(!!nn %in%
                               as.numeric(
                                 source(textConnection(paste("c(", gsub("\\-", ":", missing),")")))$value), "I", "N"))
}

在屏幕上打印并生成数据框:

>[1] "A"
>[1] 1
>[1] "B"
>[1] 30
>[1] "C"
>[1] 22700
>[1] "D"
>[1] 13500
> df
>  id                                                 missing A B C D
>1  a 1-13030,13364-13626,13962-15504,15862-26543,26891-29904 I I N N
>2  b                      1-29,21717,29727-29777,29837-29904 I I N N
>3  c                     19276-19571,22627-22822,29837-29904 I I N N
>4  d                                             29837-29904 I I N N
>5  e                                              1-10,20-30 I I N N

预期的数据框:

> id                                                 missing A B C D
> 1  a 1-13030,13364-13626,13962-15504,15862-26543,26891-29904 I I N I
> 2  b                      1-29,21717,29727-29777,29837-29904 I N N N
> 3  c                     19276-19571,22627-22822,29837-29904 N N I N
> 4  d                                             29837-29904 N N N N
> 5  e                                              1-10,20-30 I I N N                                            

如果在一个实例上运行,则测试有效

> 13500 %in% as.numeric(source(textConnection(paste("c(", gsub("\\-", ":", df$missing[1]),")")))$value)
[1] TRUE

似乎我的代码导致上次运行的评估结果应用于数据框中的所有行。我已经通过更改测试数据确认了这一点。

【问题讨论】:

    标签: r variables dplyr


    【解决方案1】:

    我们可以通过, 拆分“缺失”以扩展行,通过在分隔符-(“df1”)处拆分创建新列“开始”、“停止”

    library(dplyr)
    library(tidyr)
    
    df1 <- df %>%
           separate_rows(missing, sep = ",") %>%
           separate(missing, into = c('start', 'stop'), convert = TRUE) 
    

    现在,我们使用 OP 的方法来创建带有 'subs' 向量的新列

    for(i in seq_along(subs)) {
         df1 <- df1 %>%
              mutate(!! subs[i] :=  case_when(!is.na(stop) &
               start <= subs_pos[i] &  stop >= subs_pos[i] ~ 'I', TRUE ~'N'))
     
     }
    

    按“id”分组,summarise“subs”列返回“I”,如果有任何“I”或“N”,则与原始数据和select中的列进行连接我们想要的顺序

    df1 %>% 
         group_by(id) %>% 
         summarise(across(c(A:D), ~ case_when('I' %in% . ~ 'I', 
                TRUE ~ 'N'))) %>% 
         right_join(df) %>%
         select(names(df), everything())
    

    -输出

    # A tibble: 5 x 6
    #  id    missing                                                 A     B     C     D    
    #  <chr> <chr>                                                   <chr> <chr> <chr> <chr>
    #1 a     1-13030,13364-13626,13962-15504,15862-26543,26891-29904 I     I     I     I    
    #2 b     1-29,21717,29727-29777,29837-29904                      I     N     N     N    
    #3 c     19276-19571,22627-22822,29837-29904                     N     N     I     N    
    #4 d     29837-29904                                             N     N     N     N    
    #5 e     1-10,20-30                                              I     I     N     N    
    
    
     
    

    【讨论】:

      【解决方案2】:

      同事建议的另一种解决方案:

      for(i in 1:nrow(df)){
        ranges <- as.numeric(source(textConnection(paste("c(", gsub("\\-", ":", df$missing[i]),")")))$value)
        for(j in 1:length(subs)){
          if(subs_pos[j] %in% ranges) {
            df[i,][subs[j]] <- "I"
          }
        }
      }
      

      }

      【讨论】: