【问题标题】:In R, convert/split 1-column dataframe into 4 columns based on splitting content in strings在 R 中,基于拆分字符串中的内容将 1 列数据帧转换/拆分为 4 列
【发布时间】:2021-09-26 18:04:36
【问题描述】:

这感觉像是 R 中一个相当困难的数据操作/数据框修复问题。我们有以下凌乱的数据框,目前的组织方式是将多列信息打包到 X2 列中。在下面的示例中使用假名、电子邮件、电话号码:

coach_info <- structure(list(X1 = c(NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_), X2 = c("TBA\r\n Head Coach", "Bobby Flowes\r\n Associate Head Women's Basketball Coach", 
"Jimmy Jimm\r\n Assistant Women's Basketball Coach", "Rod Barber\r\n Head Men's Basketball Coach\r\n       (123) 456-7890Tom.Tommy@abc.edu", 
NA, "Gabens Spar\r\n Men's Basketball Graduate Assistant Coachgabensspar@gmail.edu", 
"A.B. Better\r\n Head Women's Basketball Coach/Head Men's Golf Coach/Sports Information Associateabbetter@gmail.edu\r\n   111-222-3333", 
"Nick Romanov\r\n Head Crew Coach\r\n nick.nick@school.edu\r\n 123-123-1234", 
"Name Lasttt\r\n Assistant Coach")), row.names = c(1L, 2L, 3L, 
7L, 12L, 16L, 17L, 25L, 29L), class = "data.frame")

head(coach_info, 4)
    X1                                                                                   X2
1 <NA>                                                                   TBA\r\n Head Coach
2 <NA>                             Bobby Flowes\r\n Associate Head Women's Basketball Coach
3 <NA>                                    Jimmy Jimm\r\n Assistant Women's Basketball Coach
7 <NA> Rod Barber\r\n Head Men's Basketball Coach\r\n       (123) 456-7890Tom.Tommy@abc.edu

我们正在尝试将X2 列信息拆分为NameTitleEmailPhone 的4 列。当我们strsplit(coach_info$X2, '\r\n') 时,我们得到的是一个凌乱的嵌套列表,而使用\r\n 的拆分是不完美的,因为在某些行中缺少\r\n

除此之外,每个内部嵌套列表都有不同数量的元素,因为许多行缺少一个或多个姓名、电话号码或电子邮件地址:

> unlist(lapply(strsplit(coach_info$X2, '\r\n'), length))
 [1] 2 2 2 3 1 2 3 4 2

我们的目标是尽可能接近这个:

output_df <- data.frame(
    Name = c('TBA', 'Bobby Flowes', 'Jimmy Jimm', 'Rod Barber', NA, 'Gaben Spar', 'A.B. Better', 'Nick Romanov', 'Name Lasttt'),
    Title = c('Head Coach', "Associate Head Women's Basketball Coach", "Assistant Women's Basketball Coach", "Head Men's Basketball Coach",
              NA, " Men's Basketball Graduate Assistant", "Head Women's Basketball Coach/Head Men's Golf Coach/Sports Information Associate",
              "Head Crew Coach", "Assistant Coach"),
    Email = c(NA, NA, NA, "Tom.Tommy@abc.edu", NA, "Coachgabensspar@gmail.edu", "abbetter@gmail.edu", "nick.nick@school.edu", NA),
    Phone = c(NA, NA, NA, "(123) 456-7890", NA, NA, "111-222-3333", "123-123-1234", NA),
    stringsAsFactors = FALSE
  )
  

>   head(output_df, 4)
          Name                                   Title             Email          Phone
1          TBA                              Head Coach              <NA>           <NA>
2 Bobby Flowes Associate Head Women's Basketball Coach              <NA>           <NA>
3   Jimmy Jimm      Assistant Women's Basketball Coach              <NA>           <NA>
4   Rod Barber             Head Men's Basketball Coach Tom.Tommy@abc.edu (123) 456-7890

在不同字段之间不存在空格或\r\n 的情况下,似乎不可能干净地拆分字符串,如上面的屏幕截图所示。在这一点上,我们只是试图尽可能接近......

【问题讨论】:

    标签: r dataframe data-manipulation strsplit


    【解决方案1】:

    这样的事情怎么样

    require(data.table)
    setDT(coach_info)
    
    re.phone <- '.*(\\d{3}[^[:alnum:]]*\\d{3}[^[:alnum:]]*\\d{4}).*'
    re.email <- ".*[^_[:alnum:]\\-\\.]([_[:alnum:]\\-\\.]+@[[:alnum:]\\.]+).*"
    re.text1 <- '([[:alnum:][:blank:]]+)\r\n([[:alnum:][:blank:][:punct:]]+).*'
    
    
    coach_info[,processed:=X2]
    
    coach_info[grepl(re.phone,X2), phone:=gsub(re.phone,'\\1',X2)]
    coach_info[!is.na(phone), processed:=gsub(phone,' ',X2,fixed=T),by=phone]
    
    coach_info[grepl(re.email,processed), email:=gsub(re.email,'\\1',processed)]
    coach_info[!is.na(email), processed:=gsub(email,' ',processed,fixed=T),by=email]
    
    coach_info[, Name:=gsub(re.text1,'\\1',processed)]
    coach_info[, Title:=gsub(re.text1,'\\2',processed)]
    

    【讨论】:

    • 是的,这让我们完成了 85 - 90% 的工作,这很有帮助。与 tidyverse 相比,我对 DT 不太熟悉,但遵循这一点仍然不太难。
    • 如果有任何记录未使用此代码正确处理,请随时将它们添加到您的示例中,我或其他人会尽力提供帮助
    猜你喜欢
    • 2018-01-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-10-29
    • 1970-01-01
    • 2023-03-31
    相关资源
    最近更新 更多