【问题标题】:Loop through a list with a condition r循环使用条件 r 的列表
【发布时间】:2021-04-09 21:23:17
【问题描述】:

下面给出了一个数据框DF和一个列表mappingList

DF <- data.frame(
           "colors number 3 former" = c("r","r","?","l","?","r","?","?","r","?"),
           "music number 3 latter" = c("r","l","r","l","r","r","l","l","r","l"),
           "genres number 3 latter" = c("l","r","?","l","?","r","?","l","l","r"),
           "genres number 12 former" = c("r","r","?","l","l","r","l","?","r","?"),
           "music number 12 latter" = c("r","l","?","l","?","r","l","l","r","?"),
           "fabric number 12 latter" = c("l","r","?","l","r","r","r","l","l","r"),
           "colors number 12 latter" = c("r","r","?","r","?","r","?","r","r","?"),
           check.names = FALSE
           )

mappingList <- list("number 3",
                    "genres",
                    "music",
                    "number 12",
                    "music",
                    "fabric",
                    "colors")

DF中,当一列以former结尾并包含值“?”时,需要从以latter结尾的列编码。通过编码,我的意思是,former 列中的? 值需要填充其对应的latter 列中的任何值。 former 列可以有多个 latter 列。从mappingList 中找到former 列的对应latter 列。例如对于colors number 3 formermappingList 中有 2 个列指示符:genresmusic,因为它们在 number 3 下,colors number 3 former 属于并包含子字符串 number 3。在 for 循环中 colors number 3 former 应该首先从 genres number 3 latter 编码,对于具有值 ? 的行。如果former 列中仍然存在?,则应使用第二个选项进行映射,即“音乐编号 3 later(the next element under genres in number 3). The loop should stop if there are no more ?left in theformercolumn, if not it should move down in themappingList` 用于该数字。原始数据帧大得多,所以手动映射不是首选。预期的输出是:

expectedDF <- data.frame(
           "colors number 3 former" = c("r","r","r","l","r","r","l","l","r","r"),
           "music number 3 latter" = c("r","l","r","l","r","r","l","l","r","l"),
           "genres number 3 latter" = c("l","r","?","l","?","r","?","l","l","r"),
           "genres number 12 former" = c("r","r","?","l","l","r","l","l","r","r"),
           "music number 12 latter" =  c("r","l","?","l","?","r","l","l","r","?"),
           "fabric number 12 latter" = c("l","r","?","l","r","r","r","l","l","r"),
           "colors number 12 latter" = c("r","r","?","r","?","r","?","r","r","?"),
           check.names = FALSE
           )

我用嵌套循环尝试了这种方法,但是一旦循环到达下一个数字,我就找不到停止循环的方法:

# Take columns with that end with "former"
# Populate former columns in columnsToBeEncoded
columnsToBeEncoded <- list()
for(col in names(DF)){
  if(grepl("former", col)){
    columnsToBeEncoded <- append(columnsToBeEncoded, col)
  }
}

#columnsToBeEncoded

# Encode  "former" columns where row is "?" from "latter" columns by the order in mappingList
for(col in columnsToBeEncoded){
  # extract column number from former column
  colNumber <- paste(strsplit(col, " ")[[1]][2:3], collapse = " ")
  # Find indices where former column has "?"
  j <- which(DF[, col] == "?")
  for(element in mappingList){
    # I think the if statement below is not working
    # Inside the if statement I see elements with "number" in it are involved too
    if(!grepl(colNumber, element)){
      elementNameinColumnForm <- paste(c(element, colNumber, "latter"), collapse = " ")
      print(elementNameinColumnForm)
      DF[j,col] <- DF[j,elementNameinColumnForm]

    }
  }
}

【问题讨论】:

    标签: r list dataframe


    【解决方案1】:

    试试下面的代码是否有效。它适用于您提供的示例,因此希望它也适用于更多示例:

    columnsToBeEncoded = names(DF)[grepl("former", names(DF))] # alternative, vectorised form
    
    # Create a nested list, where each key is a "number XX" and its elements are the variables needed
    number_smth = which(grepl("number", mappingList))
    mappingListNested = lapply(seq_along(number_smth), function(i){
      if (i+1 <= length(number_smth)){
        return(mappingList[(number_smth[i]+1):(number_smth[i+1]-1)])
      } else {
        return(mappingList[(number_smth[i]+1):length(mappingList)])
      }
    })
    
    names(mappingListNested) = paste('number', str_extract(mappingList[number_smth], "[[:digit:]]+"))
    
    # Encode  "former" columns where row is "?" from "latter" columns by the order in mappingList
    for(col in columnsToBeEncoded){
      # extract column number from former column
      colNumber <- paste(strsplit(col, " ")[[1]][2:3], collapse = " ")
      # Find indices where former column has "?"
      replacementVariables = mappingListNested[[colNumber]]
      for (var in replacementVariables){
        varNameinColumnForm <- paste(c(var, colNumber, "latter"), collapse = " ")
        DF[, col] = ifelse(
          DF[, col] == "?", # which elements are "?"
          DF[, varNameinColumnForm], # replace those which are "?" with the values of var
          DF[, col] # otherwise leave unchanged
        )
      }
    }
    

    一步一步:

    1. 我将 mappingList 转换为嵌套列表,以使迭代更容易。嵌套列表将包含与 mappingList 中的“编号 XX”项一样多的元素。每个元素都是一个字符向量,其中包含“数字 XX”和下一个“数字 XX”之间的变量,如果我理解正确的话,这应该是替换方案。
    number_smth = which(grepl("number", mappingList))
    mappingListNested = lapply(seq_along(number_smth), function(i){
      if (i+1 <= length(number_smth)){
        return(mappingList[(number_smth[i]+1):(number_smth[i+1]-1)])
      } else {
        return(mappingList[(number_smth[i]+1):length(mappingList)])
      }
    })
    
    1. 用相应的列名命名列表,以使索引更容易。为此,我以与您类似的方式提取每个“数字”中的数字,只是我在这里使用正则表达式(为此您需要 stringr 包)。如果您愿意,实际上可以使用strsplit() 自行处理
    names(mappingListNested) = paste('number', stringr::str_extract(mappingList[number_smth], "[[:digit:]]+"))
    
    1. for 循环方案看起来与您的几乎相同,只是我使用ifelse() 直接替换值。 ifelse() 是一种矢量化方式,用于遍历向量、检查某些条件并将满足该条件的值替换为其他值。语法为ifelse(logical_vector, replacement_TRUE, replacement_FALSE)
      1. 在我的例子中,logical_vectorDF[, col] == "?",它检查DF 的col 列的每个元素是否等于“?”。这将给出TRUEs 和FALSEs 的向量。
      2. 下一个参数用于替换 DF[,col] 的那些元素 TRUE(即带有“?”的元素),在这种情况下,它是 mappingListNested 中的任何变量。
      3. 下一个参数(在本例中为 DF[,col] 列本身)将替换为 FALSE 的元素,换句话说,它将使列保持不变,只要它不是“?”
    # Encode  "former" columns where row is "?" from "latter" columns by the order in mappingList
    for(col in columnsToBeEncoded){
      # extract column number from former column
      colNumber <- paste(strsplit(col, " ")[[1]][2:3], collapse = " ")
      # Find indices where former column has "?"
      replacementVariables = mappingListNested[[colNumber]]
      for (var in replacementVariables){
        varNameinColumnForm <- paste(c(var, colNumber, "latter"), collapse = " ")
        DF[, col] = ifelse(
          DF[, col] == "?", # which elements are "?"
          DF[, varNameinColumnForm], # replace those which are "?" with the values of var
          DF[, col] # otherwise leave unchanged
        )
      }
    }
    

    由于我对 mappingListNested 的元素进行了迭代,因此我确保当变量结束时迭代将停止。此外,由于 DF[,col] 列在每次迭代时都会更改,因此请确保先替换为第一个变量,然后替换为下一个变量,依此类推。

    【讨论】:

    • 另外,作为一般提示(我希望不要对此感到屈尊),尝试对您的 R 代码进行矢量化,它会运行得更快。 For 循环很好,但尽量避免使用append(),它会大大降低代码速度。 Instread,您可以定义一个空向量,重新定义所需的值,然后用逻辑向量过滤掉“空”点。我曾经有一个充满 appends 的脚本,过去需要 6 个小时才能运行,而按照我刚才描述的操作将时间缩短到 3 分钟(!)。只是为了向您展示它可以通过 append、rbind 等获得多慢。人
    【解决方案2】:

    这是另一个解决方案。说明请参考代码中的cmets。

    #----
    
    #Your data.
    
    DF <- data.frame(
      "colors number 3 former" = c("r","r","?","l","?","r","?","?","r","?"),
      "music number 3 latter" = c("r","l","r","l","r","r","l","l","r","l"),
      "genres number 3 latter" = c("l","r","?","l","?","r","?","l","l","r"),
      "genres number 12 former" = c("r","r","?","l","l","r","l","?","r","?"),
      "music number 12 latter" = c("r","l","?","l","?","r","l","l","r","?"),
      "fabric number 12 latter" = c("l","r","?","l","r","r","r","l","l","r"),
      "colors number 12 latter" = c("r","r","?","r","?","r","?","r","r","?"),
      check.names = FALSE
    )
    
    mappingList <- list("number 3",
                        "genres",
                        "music",
                        "number 12",
                        "music",
                        "fabric",
                        "colors")
    
    
    expectedDF <- data.frame(
      "colors number 3 former" = c("r","r","r","l","r","r","l","l","r","r"),
      "music number 3 latter" = c("r","l","r","l","r","r","l","l","r","l"),
      "genres number 3 latter" = c("l","r","?","l","?","r","?","l","l","r"),
      "genres number 12 former" = c("r","r","?","l","l","r","l","l","r","r"),
      "music number 12 latter" =  c("r","l","?","l","?","r","l","l","r","?"),
      "fabric number 12 latter" = c("l","r","?","l","r","r","r","l","l","r"),
      "colors number 12 latter" = c("r","r","?","r","?","r","?","r","r","?"),
      check.names = FALSE
    )
    
    
    #--------
    
    #Solution.
    
    library(stringr)
    library(magrittr)
    library(tidyr)
    library(dplyr)
    
    #The mappingList isn't handy.
    #Converting this into a data.frame with two columns: 
    #"former", which indicates the former column in DF, 
    #and "latter", which indicates the corresponding latter 
    #column in DF from which the data in the former column 
    #needs to be filled in.
    
    mlist <- data.frame(latter = unlist(mappingList), stringsAsFactors = FALSE)
    
    #A loop to identify former and latter values from the 
    #data.frame's one available column.
    j <- 0
    for(i in 1:nrow(mlist)){
      if(str_detect(mlist$latter[i], "number [0-9]+")){
        j <- j + 1
      }
      mlist$type[i] <- j
    }
    rm(j)
    
    
    #Munging to create the former and latter columns 
    #properly.
    mlist %<>% 
      group_by(type) %>% 
      mutate(former = latter[1]) %>% 
      ungroup()
    
    mlist %<>% filter(latter != former)
    
    mlist %<>% 
      group_by(type) %>% 
      mutate(ord = row_number()) %>% 
      ungroup()
    
    mlist %<>% select(c(former, latter, ord))
    
    #For ease of use, bringing the former and latter 
    #columns contents as close to the column names in 
    #DF as is possible.
    mlist %<>% 
      mutate(latter = paste0(latter, " ", former, " latter"), 
             former = paste0(former, " former"))
    
    
    #Nested loops to fill in the DF rows.
    #Basic logic is: take a row in DF.
    #Loop through the rows of mlist.
    #mlist basically holds the fill-in relationship's 
    #column names. So extract the former and latter 
    #(fcol and lcol) column names respectively.
    #Then check if that particular former column in the 
    #ith row of DF is a "?". If it is, fill it in with 
    #the value from the cell corresponding to the column 
    #name indicated by the jth row of mlist and the ith row 
    #of DF.
    #This also automatically takes care of the fact that the 
    #next latter column's value gets used if a "?" remains.
    
    for(i in 1:nrow(DF)){
      
      for(j in 1:nrow(mlist)){
        
        fcol <- colnames(DF)[str_detect(colnames(DF), mlist$former[j])]
        lcol <- colnames(DF)[str_detect(colnames(DF), mlist$latter[j])]
        
        if(DF[i, fcol] == "?"){
          DF[i, fcol] <- DF[i, lcol]
        }
        
      }
      
    }
    
    
    identical(DF, expectedDF)
    
    # [1] TRUE
    
    #----
    

    【讨论】:

      猜你喜欢
      • 2021-08-27
      • 2019-04-04
      • 1970-01-01
      • 2016-10-26
      • 1970-01-01
      • 2016-03-19
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多