【问题标题】:Loading txt file into R and replace some value based on other data frame将txt文件加载到R中并根据其他数据框替换一些值
【发布时间】:2021-11-13 12:30:51
【问题描述】:

我有一个包含特定格式结构的大 txt 文件。我的目标是使用readLines 在 R 中加载文本,并且我想根据我的 df 数据框将每条记录的权重值替换为新值。我不想更改 .txt 数据结构格式或解析 .txt 文件。最终输出应具有与原始 .txt (writeLines()) 完全相同的结构。如何读取并更新值?谢谢

这是我的参考数据框

df <- tibble::tribble(
        ~House_id,  ~id, ~new_weight,
  18105265, "Mab",        4567,
  18117631, "Maa",        3367,
  18121405, "Mab",        4500,
  71811763, "Maa",        2455,
  71811763, "Mab",        2872
  ) 

这是我的 .txt 的一小部分

H18105265_0
R1_0
Mab_3416311514210525745_W923650.80
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W123650.80
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W923650.80
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W923650.80
Mab_5325411210110485554_W723650.80
T1_0
T2_0
T3_0
T4_0

这里是第一个个人记录 house_id = 18105265 的期望输出:更新Mab_3416311514210525745_W923650.80df

的新值 Mab_3416311514210525745_W4567 基数一致
H18105265_0
R1_0
Mab_3416311514210525745_W4567
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959

【问题讨论】:

  • 你怎么知道Mab_3416311514210525745_W4567不是Mab_3416311514210525745_W4500
  • H 指的是 house_Id,id 指的是个人 id。我更新了问题
  • “我不想 [...] 解析 .txt 文件”是什么意思?

标签: r dplyr stringr readlines


【解决方案1】:

编辑 - 添加 id 以查找以区分非唯一 House_id。

这是一种方法,我读取数据,加入 df 中的更新权重,然后使用该新权重在以“M”开头的行上创建更新值。

library(tidyverse)
read_fwf("txt_sample.txt" ,  col_positions = fwf_empty("txt_sample.txt")) %>% # edit suggested by DanG

# if the row starts with H, extract 8 digit house number and
# use that to join to the table with new weights
mutate(House_id = if_else(str_starts(X1, "H"), as.numeric(str_sub(X1, 2,9)), NA_real_),
       id = if_else(str_starts(X1, "M"), str_sub(X1, 1,3), NA_character_)) %>%
fill(House_id) %>%
left_join(df, by = c("House_id", "id")) %>%
fill(new_weight) %>%

# make new string using updated weight (or keep existing string)
mutate(X1_new = coalesce(
  if_else(str_starts(X1, "M"),
          paste0(word(X1, end = 2, sep = "_"), "_W", new_weight),
          NA_character_),
  X1)) %>%

pull(X1_new) %>% 
writeLines()

输出

H18105265_0
R1_0
Mab_3416311514210525745_W4567
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W3367
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W4500
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W2455
Mab_5325411210110485554_W2872
T1_0
T2_0
T3_0
T4_0

【讨论】:

  • 我猜最后一个 MaaMab 应该有不同的新权重,而不是两者的 2827。
  • 感谢您的关注。我没有意识到每个房子可能有多个 id。
  • 不错的解决方案,已经为它投票了。
  • 非常好的解决方案。 @JonSpring 谢谢。我添加了这个以便能够读取文件read_fwf("txt_sample.txt" , col_positions = fwf_empty("txt_sample.txt"))
【解决方案2】:

您可以尝试以下基本 R 代码

writeLines(
  do.call(
    paste0,
    lapply(
      unlist(
        strsplit(
          readChar("test.txt", file.info("test.txt")$size),
          "(?<=\\d)\n(?=H)",
          perl = TRUE
        )
      ),
      function(x) {
        with(
          df,
          Reduce(
            function(x, ps) sub(ps[[1]], ps[[2]], x),
            asplit(rbind(
              unlist(regmatches(x, gregexpr("W.*(?=\n)", x, perl = TRUE))),
              paste0("W", new_weight[sapply(sprintf("H%s.*%s_\\d+_W", House_id, id), grepl, x)])
            ), 2),
            init = x
          )
        )
      }
    )
  )
)

给了

H18105265_0
R1_0
Mab_3416311514210525745_W4567
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W3367
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W4500
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W2455
Mab_5325411210110485554_W2872
T1_0
T2_0
T3_0
T4_0

分解代码

  • 我们先用下面的代码把长字符串分成更小的块
      unlist(
        strsplit(
          readChar("test.txt", file.info("test.txt")$size),
          "(?<=\\d)\n(?=H)",
          perl = TRUE
        )
      )
  • 对于每个块中的子字符串,我们找到匹配的House_id + id,并将权重部分,例如Wxxxxxx替换为对应的new_weight
        with(
          df,
          Reduce(
            function(x, ps) sub(ps[[1]], ps[[2]], x),
            asplit(
              rbind(
              unlist(regmatches(x, gregexpr("W.*(?=\n)", x, perl = TRUE))),
              paste0("W", new_weight[sapply(sprintf("H%s.*%s_\\d+_W", House_id, id), grepl, x)])
            ), 2),
            init = x
          )
        )

注意最后一个block有两个不同匹配的id,我们用Reduce迭代替换权重

【讨论】:

    【解决方案3】:

    您必须遍历文本文档的readlines 之后获得的各行。您可以使用hpatt = 'H[0-9]+_0' 作为正则表达式从以H 开头的行中解析House_id,然后将stringr 包应用于处理行:

    for (i in 1:length(lines)){
      line = lines[[i]]
    
      #detect if line looks like 'H[number]_0'
      if (stringr::str_detect(line, hpatt)){
        #if it does, extract the 'house_id' from the line
        h_id = stringr::str_extract(test, pattern = 'H[0-9]+') %>% 
          stringr::str_replace('H|_0','')
      }
    

    在第二部分中,您可以将原始重量替换为从您的 tibble 中获得的重量(我在这里称其为 replacetibble)。我正在使用正则表达式mpatt = '^[a-zA-z]+_[0-9]+_W[0-9\\.]+$',它查找看起来像[character-onlyname]_[number]_W[numberwithdecimal] 的字符串:

      if (stringr::str_detect(line, mpatt)){
        # split string to get 'id'
        id = stringr::str_split(line, '_')[[1]][[1]]
        # look up weight
        wt = (replacetibble %>% filter(house_id==h_id & id == id) %>% select(weight))
        # replace number in line, split the original line by the 'W'
        # this will of course break if your id contains a W - please
        # adapt logic according to your naming rules
        replaceline = stringr::str_split(line, 'W')[[1]]
        replaceline[length(replaceline)] =wt
        # put the line back together with a 'W' character
        lines[[i]] = paste0(replaceline, collapse = 'W')
      }
    }
    

    Stringr(备忘单here)在处理字符串方面通常非常强大。

    我将把加载和保存部分留给你。

    【讨论】:

    • 应该是 [H]_[id]_w[weight]。 H 指的是 house_Id,id 指的是个人 id。我更新了我的问题
    【解决方案4】:

    我尝试将每一步都放在一个新对象中,以更好地了解正在发生的事情。如果您不清楚任何正则表达式,请随时询问。

    ID 不限于任何位数,个人 ID 仅限以“Ma(任何字符)_”开头并且可以轻松扩展,因此一个房屋 ID 可以包含任意数量的个人。

    library(tidyverse)
    df <- tibble::tribble(
      ~House_id,  ~id, ~new_weight,
      18105265, "Mab",        4567,
      18117631, "Maa",        3367,
      18121405, "Mab",        4500,
      71811763, "Maa",        2455,
      71811763, "Mab",        2872
    )
    
    # read the data
    dat <- readLines("test.txt")
    
    # convert to tibble
    dat2 <- tibble::tibble(X = dat)
    
    # keep relevant info, i.e. house IDs and individual IDs
    dat3 <- dat2 %>% 
      rowid_to_column() %>% 
      filter(grepl(pattern = "H[0-9]+_0", X) | 
               grepl(pattern = "^Ma._[0-9]+", X))
    dat3
    #> # A tibble: 9 × 2
    #>   rowid X                                 
    #>   <int> <chr>                             
    #> 1     1 H18105265_0                       
    #> 2     3 Mab_3416311514210525745_W923650.80
    #> 3     8 H18117631_0                       
    #> 4    10 Maa_1240111711220682016_W123650.80
    #> 5    13 H18121405_0                       
    #> 6    15 Mab_2467211713110643835_W923650.80
    #> 7    19 H71811763_0                       
    #> 8    21 Maa_5325411210120486554_W923650.80
    #> 9    22 Mab_5325411210110485554_W723650.80
    
    
    # determine which individuals belong to which house
    dat4 <- dat3 %>% 
      mutate(house1 = grepl(pattern = "H[0-9]+_0", X)) %>% 
      mutate(house2 = cumsum(house1))
    dat4
    #> # A tibble: 9 × 4
    #>   rowid X                                  house1 house2
    #>   <int> <chr>                              <lgl>   <int>
    #> 1     1 H18105265_0                        TRUE        1
    #> 2     3 Mab_3416311514210525745_W923650.80 FALSE       1
    #> 3     8 H18117631_0                        TRUE        2
    #> 4    10 Maa_1240111711220682016_W123650.80 FALSE       2
    #> 5    13 H18121405_0                        TRUE        3
    #> 6    15 Mab_2467211713110643835_W923650.80 FALSE       3
    #> 7    19 H71811763_0                        TRUE        4
    #> 8    21 Maa_5325411210120486554_W923650.80 FALSE       4
    #> 9    22 Mab_5325411210110485554_W723650.80 FALSE       4
    
    
    dat4b <- dat4 %>% 
      filter(grepl(pattern = "H[0-9]+_0", X)) %>% 
      select(house_id = X, house2)
    dat4b
    #> # A tibble: 4 × 2
    #>   house_id    house2
    #>   <chr>        <int>
    #> 1 H18105265_0      1
    #> 2 H18117631_0      2
    #> 3 H18121405_0      3
    #> 4 H71811763_0      4
    
    
    # combine house and individual ids next to each other
    dat5 <- dat4 %>% 
      left_join(dat4b,
                by = "house2") %>% 
      mutate(prefix = gsub(pattern = "_.+", replacement = "", x = X),
             house_id = as.numeric(gsub("^H|_0", "", house_id))) %>% 
      select(rowid, house_id, prefix, X) %>% 
      filter(grepl(pattern = "^Ma._[0-9]+", X)) 
    dat5
    #> # A tibble: 5 × 4
    #>   rowid house_id prefix X                                 
    #>   <int>    <dbl> <chr>  <chr>                             
    #> 1     3 18105265 Mab    Mab_3416311514210525745_W923650.80
    #> 2    10 18117631 Maa    Maa_1240111711220682016_W123650.80
    #> 3    15 18121405 Mab    Mab_2467211713110643835_W923650.80
    #> 4    21 71811763 Maa    Maa_5325411210120486554_W923650.80
    #> 5    22 71811763 Mab    Mab_5325411210110485554_W723650.80
    
    
    # add he new information about individual ids
    dat6 <- left_join(dat5, df,
                      by = c("house_id" = "House_id",
                             "prefix" = "id"))
    dat6
    #> # A tibble: 5 × 5
    #>   rowid house_id prefix X                                  new_weight
    #>   <int>    <dbl> <chr>  <chr>                                   <dbl>
    #> 1     3 18105265 Mab    Mab_3416311514210525745_W923650.80       4567
    #> 2    10 18117631 Maa    Maa_1240111711220682016_W123650.80       3367
    #> 3    15 18121405 Mab    Mab_2467211713110643835_W923650.80       4500
    #> 4    21 71811763 Maa    Maa_5325411210120486554_W923650.80       2455
    #> 5    22 71811763 Mab    Mab_5325411210110485554_W723650.80       2872
    
    
    # generate the new ids
    dat7 <- dat6 %>% 
      mutate(Y = gsub(pattern = "(?=W).+", replacement = "", x = X, perl = T),
             X_new = paste0(Y, "W", new_weight)) %>% 
      select(rowid, X_new)
    dat7
    #> # A tibble: 5 × 2
    #>   rowid X_new                        
    #>   <int> <chr>                        
    #> 1     3 Mab_3416311514210525745_W4567
    #> 2    10 Maa_1240111711220682016_W3367
    #> 3    15 Mab_2467211713110643835_W4500
    #> 4    21 Maa_5325411210120486554_W2455
    #> 5    22 Mab_5325411210110485554_W2872
    
    
    # replace the old ids by the new ones
    dat[dat7$rowid] <- dat7$X_new
    dat
    #>  [1] "H18105265_0"                                                                           
    #>  [2] "R1_0"                                                                                  
    #>  [3] "Mab_3416311514210525745_W4567"                                                         
    #>  [4] "T1_0"                                                                                  
    #>  [5] "T2_0"                                                                                  
    #>  [6] "T3_0"                                                                                  
    #>  [7] "V64_0_2_010_ab171900171959"                                                            
    #>  [8] "H18117631_0"                                                                           
    #>  [9] "R1_0"                                                                                  
    #> [10] "Maa_1240111711220682016_W3367"                                                         
    #> [11] "T1_0"                                                                                  
    #> [12] "V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059"
    #> [13] "H18121405_0"                                                                           
    #> [14] "R1_0"                                                                                  
    #> [15] "Mab_2467211713110643835_W4500"                                                         
    #> [16] "T1_0"                                                                                  
    #> [17] "T2_0"                                                                                  
    #> [18] "V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359"          
    #> [19] "H71811763_0"                                                                           
    #> [20] "R1_0"                                                                                  
    #> [21] "Maa_5325411210120486554_W2455"                                                         
    #> [22] "Mab_5325411210110485554_W2872"                                                         
    #> [23] "T1_0"                                                                                  
    #> [24] "T2_0"                                                                                  
    #> [25] "T3_0"                                                                                  
    #> [26] "T4_0"
    
    
    # write back the updated data
    # writeLines(...)
    

    【讨论】:

      【解决方案5】:

      这是一个dplyr 解决方案,它使用left_join()...但在其他方面完全依赖于矢量化操作,这对于大型数据集而言明显是more efficient than looping

      虽然代码可能出现很长,但这只是一种格式选择:为了清楚起见,我使用

      foo(
        arg_1 = bar,
        arg_2 = baz,
        # ...
        arg_n = qux
      ) 
      

      而不是单线foo(bar, baz, qux)。另外为了清楚起见,我将详细说明该行

          # Map each row to its house ID.
          House_id = data[row_number()[target][cumsum(target)]],
      

      详细信息部分。

      解决方案

      鉴于subset.txt 之类的文件在此处复制

      H18105265_0
      R1_0
      Mab_3416311514210525745_W923650.80
      T1_0
      T2_0
      T3_0
      V64_0_2_010_ab171900171959
      H18117631_0
      R1_0
      Maa_1240111711220682016_W123650.80
      T1_0
      V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
      H18121405_0
      R1_0
      Mab_2467211713110643835_W923650.80
      T1_0
      T2_0
      V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
      H71811763_0
      R1_0
      Maa_5325411210120486554_W923650.80
      Mab_5325411210110485554_W723650.80
      T1_0
      T2_0
      T3_0
      T4_0
      
      

      以及像df 这样的参考数据集在此处复制

      df <- tibble::tribble(
        ~House_id,   ~id, ~new_weight,
         18105265, "Mab",        4567,
         18117631, "Maa",        3367,
         18121405, "Mab",        4500,
         71811763, "Maa",        2455,
         71811763, "Mab",        2872
      )
      

      以下解决方案

      # For manipulating data.
      library(dplyr)
      
      
      # ...
      # Code to generate your reference 'df'.
      # ...
      
      
      
      # Specify the filepath.
      text_filepath <- "subset.txt"
      
      # Define the textual pattern for each data item we want, where the relevant
      # values are divided into their own capture groups.
      regex_house_id <- "(H)(\\d+)(_)(\\d)"
      regex_weighted_label <- "(M[a-z]{2,})(_)(\\d+)(_W)(\\d+(\\.\\d+)?)"
      
      
      
      # Read the textual data (into a dataframe).
      data.frame(data = readLines(text_filepath)) %>%
      
        # Transform the textual data.
        mutate(
          # Target (TRUE) the identifying row (house ID) for each (contiguous) group.
          target = grepl(
            # Use the textual pattern for house IDs.
            pattern = regex_house_id,
            x = data
          ),
      
          # Map each row to its house ID.
          House_id = data[row_number()[target][cumsum(target)]],
      
          # Extract the underlying numeric ID from the house ID.
          House_id = gsub(
            pattern = regex_house_id,
            # The numeric ID is in the 2nd capture group.
            replacement = "\\2",
            x = House_id
          ),
      
          # Treat the numeric ID as a number.
          House_id = as.numeric(House_id),
      
      
      
          # Target (TRUE) the weighted labels.
          target = grepl(
            # Use the textual pattern for weighted labels.
            pattern = regex_weighted_label,
            x = data
          ),
      
          # Extract the ID from (only) the weighted labels.
          id = if_else(
            target,
            gsub(
              pattern = regex_weighted_label,
              # The ID is in the 1st capture group.
              replacement = "\\1",
              x = data
            ),
            # For any data that is NOT a weighted label, give it a blank (NA) ID.
            as.character(NA)
          ),
      
          # Extract from (only) the weighted labels everything else but the weight.
          rest = if_else(
            target,
            gsub(
              pattern = regex_weighted_label,
              # Everything is in the 2nd, 3rd, and 4th capture groups; ignoring the ID
              # (1st) and the weight (5th).
              replacement = "\\2\\3\\4",
              x = data
            ),
            # For any data that is NOT a weighted label, make it blank (NA) for
            # everything else.
            as.character(NA)
          )
        ) %>%
      
        # Link (JOIN) each weighted label to its new weight; with blanks (NAs) for
        # nonmatches.
        left_join(df, by = c("House_id", "id")) %>%
      
        # Replace (only) the weighted labels, with their updated values.
        mutate(
          data = if_else(
            target,
            # Generate the updated value by splicing together the original components
            # with the new weight.
            paste0(id, rest, new_weight),
            # For data that is NOT a weighted label, leave it unchanged.
            data
          )
        ) %>%
      
        # Extract the column of updated values.
        .$data %>%
      
        # Overwrite the original text with the updated values.
        writeLines(con = text_filepath)
      

      将转换您的文本数据并更新原始文件。

      结果

      原始文件(此处为subset.txt)现在将包含更新的信息:

      H18105265_0
      R1_0
      Mab_3416311514210525745_W4567
      T1_0
      T2_0
      T3_0
      V64_0_2_010_ab171900171959
      H18117631_0
      R1_0
      Maa_1240111711220682016_W3367
      T1_0
      V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
      H18121405_0
      R1_0
      Mab_2467211713110643835_W4500
      T1_0
      T2_0
      V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
      H71811763_0
      R1_0
      Maa_5325411210120486554_W2455
      Mab_5325411210110485554_W2872
      T1_0
      T2_0
      T3_0
      T4_0
      
      

      详情

      正则表达式

      文本操作仅依赖于grepl()(识别匹配)和gsub()(提取组件)的基本功能。我们将每个文本模式 regex_house_idregex_weighted_label 划分为它们的组件,在正则表达式中以 capture groups 区分:

      #      The "H" prefix.      The "_" separator.
      #                  | |      | |
      regex_house_id <- "(H)(\\d+)(_)(\\d)"
      #                     |    |   |   |
      #  The digits following "H".   The "0" suffix (or any digit).
      
      #                                The digits after the 'id'.
      #   The 'id': "M" then 2 small letters.   |    |    The weight (possibly a decimal).
      #                          |          |   |    |    |              |
      regex_weighted_label <-   "(M[a-z]{2,})(_)(\\d+)(_W)(\\d+(\\.\\d+)?)"
      #                                      | |      |  |
      #                       The "_" separator.      The "_" separator and "W" prefix before weight.
      

      我们可以使用grepl(pattern = regex_weighted_label, x = my_strings) 来检查向量my_strings 中的哪些字符串与加权标签的格式匹配(如"Mab_3416311514210525745_W923650.80")。

      我们还可以使用gsub(pattern = regex_weighted label, replacement = "\\5", my_labels) 从该格式标签的向量my_labels 中提取权重(第5个捕获组)。

      映射

      在第一个mutate() 语句中找到

          # Map each row to its house ID.
          House_id = data[row_number()[target][cumsum(target)]],
      

      可能看起来很神秘。然而,它只是一个classic arithmetic trick(也被@mnist 在他们的solution 中使用)将连续值索引为组。

      代码cumsum(target) 扫描target 列,该列(在工作流的这一点上)具有逻辑值(TRUE FALSE FALSE ...),指示是否(TRUE)或不是(FALSE)文本行是房屋 ID(如 "H18105265_0")。当它达到TRUE(数字为1)时,它会增加其运行总数,而FALSE(数字为0)保持总数不变。

      由于文字data

      # |-------------- Group 1 ---------------| |----------- Group 2 ------------| |------------ ...
        "H18105265_0" "R1_0" ...                 "H18117631_0" "R1_0" ...           "H18121405_0" ...
      

      为我们提供了符合逻辑的target

      # |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
        TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE ...
      

      这些值(TRUEFALSE)被强制转换为数字(10

      # |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
        1    0     0     0     0     0     0     1    0     0     0     0     0     1    0     ...
      

      在此处生成cumsum()

      # |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
        1    1     1     1     1     1     1     2    2     2     2     2     2     3    3     ...  
      

      请注意,现在我们已将每一行映射到其“组号”。 cumsum(target) 就这么多。

      现在为row_number()[target]!实际上,row_number() 只是“索引”每个位置(行)

      # |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
        1             2      ...                 8             9      ...           13         ...
      

      data 列(或任何其他列)中:

      # |-------------- Group 1 ---------------| |----------- Group 2 ------------| |------------ ...
        "H18105265_0" "R1_0" ...                 "H18117631_0" "R1_0" ...           "H18121405_0" ...
      
      

      所以用target为这些索引下标

      # |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
        TRUE           FALSE ...                  TRUE          FALSE ...           TRUE       ...
      

      仅选择具有房屋 ID 的职位:

      # |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
        1                                         8                                 13         ...
      

      所以如果我们把这个结果当作row_number()[target]

      # House ID: 1st 2nd 3rd ...
      # Position:
                  1   8   13  ... 
      

      并用cumsum(target)下标

      # |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
        1    1     1     1     1     1     1     2    2     2     2     2     2     3    3     ...
      

      我们将每一行映射到其房屋 ID 的位置(data):

      # |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
        1    1     1     1     1     1     1     8    8     8     8     8     8     13   13    ...
      

      这是row_number()[target][cumsum(target)]的结果。

      最后,当我们将data 下标为房屋 ID 的这些(重复)位置时,我们得到House_id

      # |----------------- Group 1 -----------------| |----------------- Group 2 -----------------| |-------------------------- ...
        "H18105265_0" "H18105265_0" ... "H18105265_0" "H18117631_0" "H18117631_0" ... "H18117631_0" "H18121405_0" "H18121405_0" ...
      

      data 中的每个值都映射到其组的房屋 ID。

      感谢House_id 专栏

      House_id = data[row_number()[target][cumsum(target)]]
      

      在我们的data 列旁边,我们可以将df 中的ids 映射(left_join()) 到它们对应的文本data

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2014-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多