【问题标题】：Loading txt file into R and replace some value based on other data frame将txt文件加载到R中并根据其他数据框替换一些值
【发布时间】：2021-11-13 12:30:51
【问题描述】：

我有一个包含特定格式结构的大 txt 文件。我的目标是使用readLines 在 R 中加载文本，并且我想根据我的 df 数据框将每条记录的权重值替换为新值。我不想更改 .txt 数据结构格式或解析 .txt 文件。最终输出应具有与原始 .txt (writeLines()) 完全相同的结构。如何读取并更新值？谢谢

这是我的参考数据框

df <- tibble::tribble(
        ~House_id,  ~id, ~new_weight,
  18105265, "Mab",        4567,
  18117631, "Maa",        3367,
  18121405, "Mab",        4500,
  71811763, "Maa",        2455,
  71811763, "Mab",        2872
  )

这是我的 .txt 的一小部分

H18105265_0
R1_0
Mab_3416311514210525745_W923650.80
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W123650.80
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W923650.80
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W923650.80
Mab_5325411210110485554_W723650.80
T1_0
T2_0
T3_0
T4_0

这里是第一个个人记录 house_id = 18105265 的期望输出：更新Mab_3416311514210525745_W923650.80 与 df

的新值 Mab_3416311514210525745_W4567 基数一致

H18105265_0
R1_0
Mab_3416311514210525745_W4567
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959

【问题讨论】：

你怎么知道Mab_3416311514210525745_W4567不是Mab_3416311514210525745_W4500？
H 指的是 house_Id，id 指的是个人 id。我更新了问题
“我不想 [...] 解析 .txt 文件”是什么意思？

标签： r dplyr stringr readlines

【解决方案1】：

编辑 - 添加 id 以查找以区分非唯一 House_id。

这是一种方法，我读取数据，加入 df 中的更新权重，然后使用该新权重在以“M”开头的行上创建更新值。

library(tidyverse)
read_fwf("txt_sample.txt" ,  col_positions = fwf_empty("txt_sample.txt")) %>% # edit suggested by DanG

# if the row starts with H, extract 8 digit house number and
# use that to join to the table with new weights
mutate(House_id = if_else(str_starts(X1, "H"), as.numeric(str_sub(X1, 2,9)), NA_real_),
       id = if_else(str_starts(X1, "M"), str_sub(X1, 1,3), NA_character_)) %>%
fill(House_id) %>%
left_join(df, by = c("House_id", "id")) %>%
fill(new_weight) %>%

# make new string using updated weight (or keep existing string)
mutate(X1_new = coalesce(
  if_else(str_starts(X1, "M"),
          paste0(word(X1, end = 2, sep = "_"), "_W", new_weight),
          NA_character_),
  X1)) %>%

pull(X1_new) %>% 
writeLines()

输出

H18105265_0
R1_0
Mab_3416311514210525745_W4567
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W3367
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W4500
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W2455
Mab_5325411210110485554_W2872
T1_0
T2_0
T3_0
T4_0

【讨论】：

我猜最后一个 Maa 和 Mab 应该有不同的新权重，而不是两者的 2827。
感谢您的关注。我没有意识到每个房子可能有多个 id。
不错的解决方案，已经为它投票了。
非常好的解决方案。 @JonSpring 谢谢。我添加了这个以便能够读取文件read_fwf("txt_sample.txt" , col_positions = fwf_empty("txt_sample.txt"))

【解决方案2】：

您可以尝试以下基本 R 代码

writeLines(
  do.call(
    paste0,
    lapply(
      unlist(
        strsplit(
          readChar("test.txt", file.info("test.txt")$size),
          "(?<=\\d)\n(?=H)",
          perl = TRUE
        )
      ),
      function(x) {
        with(
          df,
          Reduce(
            function(x, ps) sub(ps[[1]], ps[[2]], x),
            asplit(rbind(
              unlist(regmatches(x, gregexpr("W.*(?=\n)", x, perl = TRUE))),
              paste0("W", new_weight[sapply(sprintf("H%s.*%s_\\d+_W", House_id, id), grepl, x)])
            ), 2),
            init = x
          )
        )
      }
    )
  )
)

给了

H18105265_0
R1_0
Mab_3416311514210525745_W4567
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W3367
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W4500
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W2455
Mab_5325411210110485554_W2872
T1_0
T2_0
T3_0
T4_0

分解代码

我们先用下面的代码把长字符串分成更小的块

      unlist(
        strsplit(
          readChar("test.txt", file.info("test.txt")$size),
          "(?<=\\d)\n(?=H)",
          perl = TRUE
        )
      )

对于每个块中的子字符串，我们找到匹配的House_id + id，并将权重部分，例如Wxxxxxx替换为对应的new_weight值

        with(
          df,
          Reduce(
            function(x, ps) sub(ps[[1]], ps[[2]], x),
            asplit(
              rbind(
              unlist(regmatches(x, gregexpr("W.*(?=\n)", x, perl = TRUE))),
              paste0("W", new_weight[sapply(sprintf("H%s.*%s_\\d+_W", House_id, id), grepl, x)])
            ), 2),
            init = x
          )
        )

注意最后一个block有两个不同匹配的id，我们用Reduce迭代替换权重

【讨论】：

【解决方案3】：

您必须遍历文本文档的readlines 之后获得的各行。您可以使用hpatt = 'H[0-9]+_0' 作为正则表达式从以H 开头的行中解析House_id，然后将stringr 包应用于处理行：

for (i in 1:length(lines)){
  line = lines[[i]]

  #detect if line looks like 'H[number]_0'
  if (stringr::str_detect(line, hpatt)){
    #if it does, extract the 'house_id' from the line
    h_id = stringr::str_extract(test, pattern = 'H[0-9]+') %>% 
      stringr::str_replace('H|_0','')
  }

在第二部分中，您可以将原始重量替换为从您的 tibble 中获得的重量（我在这里称其为 replacetibble）。我正在使用正则表达式mpatt = '^[a-zA-z]+_[0-9]+_W[0-9\\.]+$'，它查找看起来像[character-onlyname]_[number]_W[numberwithdecimal] 的字符串：

  if (stringr::str_detect(line, mpatt)){
    # split string to get 'id'
    id = stringr::str_split(line, '_')[[1]][[1]]
    # look up weight
    wt = (replacetibble %>% filter(house_id==h_id & id == id) %>% select(weight))
    # replace number in line, split the original line by the 'W'
    # this will of course break if your id contains a W - please
    # adapt logic according to your naming rules
    replaceline = stringr::str_split(line, 'W')[[1]]
    replaceline[length(replaceline)] =wt
    # put the line back together with a 'W' character
    lines[[i]] = paste0(replaceline, collapse = 'W')
  }
}

Stringr（备忘单here）在处理字符串方面通常非常强大。

我将把加载和保存部分留给你。

【讨论】：

应该是 [H]_[id]_w[weight]。 H 指的是 house_Id，id 指的是个人 id。我更新了我的问题

【解决方案4】：

我尝试将每一步都放在一个新对象中，以更好地了解正在发生的事情。如果您不清楚任何正则表达式，请随时询问。

ID 不限于任何位数，个人 ID 仅限以“Ma（任何字符）_”开头并且可以轻松扩展，因此一个房屋 ID 可以包含任意数量的个人。

library(tidyverse)
df <- tibble::tribble(
  ~House_id,  ~id, ~new_weight,
  18105265, "Mab",        4567,
  18117631, "Maa",        3367,
  18121405, "Mab",        4500,
  71811763, "Maa",        2455,
  71811763, "Mab",        2872
)

# read the data
dat <- readLines("test.txt")

# convert to tibble
dat2 <- tibble::tibble(X = dat)

# keep relevant info, i.e. house IDs and individual IDs
dat3 <- dat2 %>% 
  rowid_to_column() %>% 
  filter(grepl(pattern = "H[0-9]+_0", X) | 
           grepl(pattern = "^Ma._[0-9]+", X))
dat3
#> # A tibble: 9 × 2
#>   rowid X                                 
#>   <int> <chr>                             
#> 1     1 H18105265_0                       
#> 2     3 Mab_3416311514210525745_W923650.80
#> 3     8 H18117631_0                       
#> 4    10 Maa_1240111711220682016_W123650.80
#> 5    13 H18121405_0                       
#> 6    15 Mab_2467211713110643835_W923650.80
#> 7    19 H71811763_0                       
#> 8    21 Maa_5325411210120486554_W923650.80
#> 9    22 Mab_5325411210110485554_W723650.80


# determine which individuals belong to which house
dat4 <- dat3 %>% 
  mutate(house1 = grepl(pattern = "H[0-9]+_0", X)) %>% 
  mutate(house2 = cumsum(house1))
dat4
#> # A tibble: 9 × 4
#>   rowid X                                  house1 house2
#>   <int> <chr>                              <lgl>   <int>
#> 1     1 H18105265_0                        TRUE        1
#> 2     3 Mab_3416311514210525745_W923650.80 FALSE       1
#> 3     8 H18117631_0                        TRUE        2
#> 4    10 Maa_1240111711220682016_W123650.80 FALSE       2
#> 5    13 H18121405_0                        TRUE        3
#> 6    15 Mab_2467211713110643835_W923650.80 FALSE       3
#> 7    19 H71811763_0                        TRUE        4
#> 8    21 Maa_5325411210120486554_W923650.80 FALSE       4
#> 9    22 Mab_5325411210110485554_W723650.80 FALSE       4


dat4b <- dat4 %>% 
  filter(grepl(pattern = "H[0-9]+_0", X)) %>% 
  select(house_id = X, house2)
dat4b
#> # A tibble: 4 × 2
#>   house_id    house2
#>   <chr>        <int>
#> 1 H18105265_0      1
#> 2 H18117631_0      2
#> 3 H18121405_0      3
#> 4 H71811763_0      4


# combine house and individual ids next to each other
dat5 <- dat4 %>% 
  left_join(dat4b,
            by = "house2") %>% 
  mutate(prefix = gsub(pattern = "_.+", replacement = "", x = X),
         house_id = as.numeric(gsub("^H|_0", "", house_id))) %>% 
  select(rowid, house_id, prefix, X) %>% 
  filter(grepl(pattern = "^Ma._[0-9]+", X)) 
dat5
#> # A tibble: 5 × 4
#>   rowid house_id prefix X                                 
#>   <int>    <dbl> <chr>  <chr>                             
#> 1     3 18105265 Mab    Mab_3416311514210525745_W923650.80
#> 2    10 18117631 Maa    Maa_1240111711220682016_W123650.80
#> 3    15 18121405 Mab    Mab_2467211713110643835_W923650.80
#> 4    21 71811763 Maa    Maa_5325411210120486554_W923650.80
#> 5    22 71811763 Mab    Mab_5325411210110485554_W723650.80


# add he new information about individual ids
dat6 <- left_join(dat5, df,
                  by = c("house_id" = "House_id",
                         "prefix" = "id"))
dat6
#> # A tibble: 5 × 5
#>   rowid house_id prefix X                                  new_weight
#>   <int>    <dbl> <chr>  <chr>                                   <dbl>
#> 1     3 18105265 Mab    Mab_3416311514210525745_W923650.80       4567
#> 2    10 18117631 Maa    Maa_1240111711220682016_W123650.80       3367
#> 3    15 18121405 Mab    Mab_2467211713110643835_W923650.80       4500
#> 4    21 71811763 Maa    Maa_5325411210120486554_W923650.80       2455
#> 5    22 71811763 Mab    Mab_5325411210110485554_W723650.80       2872


# generate the new ids
dat7 <- dat6 %>% 
  mutate(Y = gsub(pattern = "(?=W).+", replacement = "", x = X, perl = T),
         X_new = paste0(Y, "W", new_weight)) %>% 
  select(rowid, X_new)
dat7
#> # A tibble: 5 × 2
#>   rowid X_new                        
#>   <int> <chr>                        
#> 1     3 Mab_3416311514210525745_W4567
#> 2    10 Maa_1240111711220682016_W3367
#> 3    15 Mab_2467211713110643835_W4500
#> 4    21 Maa_5325411210120486554_W2455
#> 5    22 Mab_5325411210110485554_W2872


# replace the old ids by the new ones
dat[dat7$rowid] <- dat7$X_new
dat
#>  [1] "H18105265_0"                                                                           
#>  [2] "R1_0"                                                                                  
#>  [3] "Mab_3416311514210525745_W4567"                                                         
#>  [4] "T1_0"                                                                                  
#>  [5] "T2_0"                                                                                  
#>  [6] "T3_0"                                                                                  
#>  [7] "V64_0_2_010_ab171900171959"                                                            
#>  [8] "H18117631_0"                                                                           
#>  [9] "R1_0"                                                                                  
#> [10] "Maa_1240111711220682016_W3367"                                                         
#> [11] "T1_0"                                                                                  
#> [12] "V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059"
#> [13] "H18121405_0"                                                                           
#> [14] "R1_0"                                                                                  
#> [15] "Mab_2467211713110643835_W4500"                                                         
#> [16] "T1_0"                                                                                  
#> [17] "T2_0"                                                                                  
#> [18] "V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359"          
#> [19] "H71811763_0"                                                                           
#> [20] "R1_0"                                                                                  
#> [21] "Maa_5325411210120486554_W2455"                                                         
#> [22] "Mab_5325411210110485554_W2872"                                                         
#> [23] "T1_0"                                                                                  
#> [24] "T2_0"                                                                                  
#> [25] "T3_0"                                                                                  
#> [26] "T4_0"


# write back the updated data
# writeLines(...)

【讨论】：

【解决方案5】：

这是一个dplyr 解决方案，它使用left_join()...但在其他方面完全依赖于矢量化操作，这对于大型数据集而言明显是more efficient than looping。

虽然代码可能出现很长，但这只是一种格式选择：为了清楚起见，我使用

foo(
  arg_1 = bar,
  arg_2 = baz,
  # ...
  arg_n = qux
)

而不是单线foo(bar, baz, qux)。另外为了清楚起见，我将详细说明该行

    # Map each row to its house ID.
    House_id = data[row_number()[target][cumsum(target)]],

在详细信息部分。

解决方案

鉴于subset.txt 之类的文件在此处复制

H18105265_0
R1_0
Mab_3416311514210525745_W923650.80
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W123650.80
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W923650.80
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W923650.80
Mab_5325411210110485554_W723650.80
T1_0
T2_0
T3_0
T4_0

以及像df 这样的参考数据集在此处复制

df <- tibble::tribble(
  ~House_id,   ~id, ~new_weight,
   18105265, "Mab",        4567,
   18117631, "Maa",        3367,
   18121405, "Mab",        4500,
   71811763, "Maa",        2455,
   71811763, "Mab",        2872
)

以下解决方案

# For manipulating data.
library(dplyr)


# ...
# Code to generate your reference 'df'.
# ...



# Specify the filepath.
text_filepath <- "subset.txt"

# Define the textual pattern for each data item we want, where the relevant
# values are divided into their own capture groups.
regex_house_id <- "(H)(\\d+)(_)(\\d)"
regex_weighted_label <- "(M[a-z]{2,})(_)(\\d+)(_W)(\\d+(\\.\\d+)?)"



# Read the textual data (into a dataframe).
data.frame(data = readLines(text_filepath)) %>%

  # Transform the textual data.
  mutate(
    # Target (TRUE) the identifying row (house ID) for each (contiguous) group.
    target = grepl(
      # Use the textual pattern for house IDs.
      pattern = regex_house_id,
      x = data
    ),

    # Map each row to its house ID.
    House_id = data[row_number()[target][cumsum(target)]],

    # Extract the underlying numeric ID from the house ID.
    House_id = gsub(
      pattern = regex_house_id,
      # The numeric ID is in the 2nd capture group.
      replacement = "\\2",
      x = House_id
    ),

    # Treat the numeric ID as a number.
    House_id = as.numeric(House_id),



    # Target (TRUE) the weighted labels.
    target = grepl(
      # Use the textual pattern for weighted labels.
      pattern = regex_weighted_label,
      x = data
    ),

    # Extract the ID from (only) the weighted labels.
    id = if_else(
      target,
      gsub(
        pattern = regex_weighted_label,
        # The ID is in the 1st capture group.
        replacement = "\\1",
        x = data
      ),
      # For any data that is NOT a weighted label, give it a blank (NA) ID.
      as.character(NA)
    ),

    # Extract from (only) the weighted labels everything else but the weight.
    rest = if_else(
      target,
      gsub(
        pattern = regex_weighted_label,
        # Everything is in the 2nd, 3rd, and 4th capture groups; ignoring the ID
        # (1st) and the weight (5th).
        replacement = "\\2\\3\\4",
        x = data
      ),
      # For any data that is NOT a weighted label, make it blank (NA) for
      # everything else.
      as.character(NA)
    )
  ) %>%

  # Link (JOIN) each weighted label to its new weight; with blanks (NAs) for
  # nonmatches.
  left_join(df, by = c("House_id", "id")) %>%

  # Replace (only) the weighted labels, with their updated values.
  mutate(
    data = if_else(
      target,
      # Generate the updated value by splicing together the original components
      # with the new weight.
      paste0(id, rest, new_weight),
      # For data that is NOT a weighted label, leave it unchanged.
      data
    )
  ) %>%

  # Extract the column of updated values.
  .$data %>%

  # Overwrite the original text with the updated values.
  writeLines(con = text_filepath)

将转换您的文本数据并更新原始文件。

结果

原始文件（此处为subset.txt）现在将包含更新的信息：

H18105265_0
R1_0
Mab_3416311514210525745_W4567
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W3367
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W4500
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W2455
Mab_5325411210110485554_W2872
T1_0
T2_0
T3_0
T4_0

详情

正则表达式

文本操作仅依赖于grepl()（识别匹配）和gsub()（提取组件）的基本功能。我们将每个文本模式 regex_house_id 和 regex_weighted_label 划分为它们的组件，在正则表达式中以 capture groups 区分：

#      The "H" prefix.      The "_" separator.
#                  | |      | |
regex_house_id <- "(H)(\\d+)(_)(\\d)"
#                     |    |   |   |
#  The digits following "H".   The "0" suffix (or any digit).

#                                The digits after the 'id'.
#   The 'id': "M" then 2 small letters.   |    |    The weight (possibly a decimal).
#                          |          |   |    |    |              |
regex_weighted_label <-   "(M[a-z]{2,})(_)(\\d+)(_W)(\\d+(\\.\\d+)?)"
#                                      | |      |  |
#                       The "_" separator.      The "_" separator and "W" prefix before weight.

我们可以使用grepl(pattern = regex_weighted_label, x = my_strings) 来检查向量my_strings 中的哪些字符串与加权标签的格式匹配（如"Mab_3416311514210525745_W923650.80"）。

我们还可以使用gsub(pattern = regex_weighted label, replacement = "\\5", my_labels) 从该格式标签的向量my_labels 中提取权重（第5个捕获组）。

映射

在第一个mutate() 语句中找到

    # Map each row to its house ID.
    House_id = data[row_number()[target][cumsum(target)]],

可能看起来很神秘。然而，它只是一个classic arithmetic trick（也被@mnist 在他们的solution 中使用）将连续值索引为组。

代码cumsum(target) 扫描target 列，该列（在工作流的这一点上）具有逻辑值（TRUE FALSE FALSE ...），指示是否（TRUE）或不是（FALSE）文本行是房屋 ID（如 "H18105265_0"）。当它达到TRUE（数字为1）时，它会增加其运行总数，而FALSE（数字为0）保持总数不变。

由于文字data列

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |------------ ...
  "H18105265_0" "R1_0" ...                 "H18117631_0" "R1_0" ...           "H18121405_0" ...

为我们提供了符合逻辑的target 列

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE ...

这些值（TRUE 和 FALSE）被强制转换为数字（1 和 0）

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  1    0     0     0     0     0     0     1    0     0     0     0     0     1    0     ...

在此处生成cumsum()：

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  1    1     1     1     1     1     1     2    2     2     2     2     2     3    3     ...

请注意，现在我们已将每一行映射到其“组号”。 cumsum(target) 就这么多。

现在为row_number()[target]！实际上，row_number() 只是“索引”每个位置（行）

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  1             2      ...                 8             9      ...           13         ...

在data 列（或任何其他列）中：

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |------------ ...
  "H18105265_0" "R1_0" ...                 "H18117631_0" "R1_0" ...           "H18121405_0" ...

所以用target为这些索引下标

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  TRUE           FALSE ...                  TRUE          FALSE ...           TRUE       ...

仅选择具有房屋 ID 的职位：

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  1                                         8                                 13         ...

所以如果我们把这个结果当作row_number()[target]

# House ID: 1st 2nd 3rd ...
# Position:
            1   8   13  ...

并用cumsum(target)下标

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  1    1     1     1     1     1     1     2    2     2     2     2     2     3    3     ...

我们将每一行映射到其房屋 ID 的位置（data）：

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  1    1     1     1     1     1     1     8    8     8     8     8     8     13   13    ...

这是row_number()[target][cumsum(target)]的结果。

最后，当我们将data 下标为房屋 ID 的这些（重复）位置时，我们得到House_id 列

# |----------------- Group 1 -----------------| |----------------- Group 2 -----------------| |-------------------------- ...
  "H18105265_0" "H18105265_0" ... "H18105265_0" "H18117631_0" "H18117631_0" ... "H18117631_0" "H18121405_0" "H18121405_0" ...

data 中的每个值都映射到其组的房屋 ID。

感谢House_id 专栏

House_id = data[row_number()[target][cumsum(target)]]

在我们的data 列旁边，我们可以将df 中的ids 映射(left_join()) 到它们对应的文本data。

【讨论】：