【问题标题】:R split string and keep sectionR拆分字符串并保留部分
【发布时间】:2019-03-11 03:48:27
【问题描述】:

我有一个包含橄榄球比赛的首发阵容(从网络上提取)的字符串,它看起来像这样:

 "Crusaders: 15 David Havili, 14 Seta Tamanivalu, 13 Jack Goodhue, 12 Ryan Crotty, 11 George Bridge, 10 Richie Mo’unga, 9 Bryn Hall, 8 Kieran Read, 7 Matt Todd, 6 Heiden Bedwell-Curtis, 5 Sam Whitelock (c), 4 Scott Barrett, 3 Owen Franks, 2 Codie Taylor, 1 Joe MoodyReplacements: 16 Sam Anderson-Heather, 17 Tim Perry, 18 Michael Alaalatoa, 19 Luke Romano, 20 Pete Samu, 21 Mitchell Drummond, 22 Mitchell Hunt, 23 Braydon Ennor"

我想要的基本上是一个有两列的表格,一列是玩家的号码,另一列是玩家的名字。例如

position     name
1            Joe Moody
2            Codie Taylor
3            Owen Franks
4            Scott Barrett
...          ...

所有玩家。

我试过使用strsplit,被","分割,但是问题变成了第一个玩家:

"Crusaders: 15 David Havili"

和数字1和16合并

"1 Joe MoodyReplacements: 16 Sam Anderson-Heather".

有什么想法吗?

【问题讨论】:

  • 你的字符串格式不一致;例如,在几乎所有情况下,","(逗号)都用作分隔符,"1 Joe Moody: 16 Sam Anderson-Heather" 部分除外,其中":"(冒号)是分隔符。那是错字吗?你预计替补球员会发生什么?它们是否要包含在输出表中?
  • 重新导入数据并确保保留换行符。

标签: r list strsplit


【解决方案1】:

我同意@HongOoi 的评论;最好退后一步,确保以更明智的方式导入数据。也就是说,这是一个事后的hacky解决方案。不确定这是否能很好地概括,如果有的话。

ss <-  "Crusaders: 15 David Havili, 14 Seta Tamanivalu, 13 Jack Goodhue, 12 Ryan Crotty, 11 George Bridge, 10 Richie Mo’unga, 9 Bryn Hall, 8 Kieran Read, 7 Matt Todd, 6 Heiden Bedwell-Curtis, 5 Sam Whitelock (c), 4 Scott Barrett, 3 Owen Franks, 2 Codie Taylor, 1 Joe MoodyReplacements: 16 Sam Anderson-Heather, 17 Tim Perry, 18 Michael Alaalatoa, 19 Luke Romano, 20 Pete Samu, 21 Mitchell Drummond, 22 Mitchell Hunt, 23 Braydon Ennor"


library(tidyverse)
data.frame(ss = ss) %>%
    mutate(ss = str_replace(ss, "Replacements", "")) %>%   # Remove "Replacements"
    mutate(ss = str_split(ss, "(,|:) ")) %>%               # Split on "," or ":"
    unnest() %>%
    separate(ss, c("position", "name"), sep = "(?<=\\d)\\s", fill = "right") %>%
    filter(!is.na(name))                                   # Remove the first "Crusaders" line
#   position                  name
#1        15          David Havili
#2        14       Seta Tamanivalu
#3        13          Jack Goodhue
#4        12           Ryan Crotty
#5        11         George Bridge
#6        10        Richie Mo’unga
#7         9             Bryn Hall
#8         8           Kieran Read
#9         7             Matt Todd
#10        6 Heiden Bedwell-Curtis
#11        5     Sam Whitelock (c)
#12        4         Scott Barrett
#13        3           Owen Franks
#14        2          Codie Taylor
#15        1             Joe Moody
#16       16  Sam Anderson-Heather
#17       17             Tim Perry
#18       18     Michael Alaalatoa
#19       19           Luke Romano
#20       20             Pete Samu
#21       21     Mitchell Drummond
#22       22         Mitchell Hunt
#23       23         Braydon Ennor

【讨论】:

    【解决方案2】:

    使用 stringr::str_match_all() 和一些正则表达式,您可以找到并提取所有匹配项,注意使用非贪婪 (?) 运算符和匹配没有逗号的行尾:

    library(dplyr)
    library(stringr)
    ea <- "Crusaders: 15 David Havili, 14 Seta Tamanivalu, 13 Jack Goodhue, 12 Ryan Crotty, 11 George Bridge, 10 Richie Mo’unga, 9 Bryn Hall, 8 Kieran Read, 7 Matt Todd, 6 Heiden Bedwell-Curtis, 5 Sam Whitelock (c), 4 Scott Barrett, 3 Owen Franks, 2 Codie Taylor, 1 Joe MoodyReplacements: 16 Sam Anderson-Heather, 17 Tim Perry, 18 Michael Alaalatoa, 19 Luke Romano, 20 Pete Samu, 21 Mitchell Drummond, 22 Mitchell Hunt, 23 Braydon Ennor"
    ea <- unlist(strsplit(ea, "Replacements: "))
    
    tibble(jersey = str_match_all(ea, "\\d+") %>% unlist(),
    player = str_match_all(ea, "(?<=\\d\\s).*?(?=.$|,)") %>% unlist())
    
    # A tibble: 23 x 2
       jersey player               
       <chr>  <chr>                
     1 15     David Havili         
     2 14     Seta Tamanivalu      
     3 13     Jack Goodhue         
     4 12     Ryan Crotty          
     5 11     George Bridge  
    

    【讨论】:

    • 没有注意到“替换”,但它现在适用于所有人
    • 您好@Elio,非常感谢您的回答,这真的很有帮助。几个问题: - 它似乎没有读取最终名称的最后一个字母,有什么想法吗? - 此外,它只有在读取到逗号时才有效(我知道你就是这样编码的)。但是,在某场比赛中出现了一个特定错误,即球员姓名后缺少逗号。 “17 Jacques Van Rooyen, 18 Jacobie Adriaanse 19 Lourens Erasmus, 20 Marvin Orie,”看到 Jacobie Adriaanse 之后没有逗号。有什么想法可以在这里做什么吗?
    • 是的,你是对的,这是因为捕获组;将其更改为 "(?
    【解决方案3】:

    这是一种适用于您的示例字符串的快速而肮脏的方法。如果开头缺少团队名称,它将不适用于其他字符串。

    player.string <- "Crusaders: 15 David Havili, 14 Seta Tamanivalu, 13 Jack Goodhue, 12 Ryan Crotty, 11 George Bridge, 10 Richie Mo’unga, 9 Bryn Hall, 8 Kieran Read, 7 Matt Todd, 6 Heiden Bedwell-Curtis, 5 Sam Whitelock (c), 4 Scott Barrett, 3 Owen Franks, 2 Codie Taylor, 1 Joe MoodyReplacements: 16 Sam Anderson-Heather, 17 Tim Perry, 18 Michael Alaalatoa, 19 Luke Romano, 20 Pete Samu, 21 Mitchell Drummond, 22 Mitchell Hunt, 23 Braydon Ennor"
    
    df <- read.table(text = gsub("(\\d+)", "\\1\t", gsub("Replacements:|(^[^:]*:)|, ", "\n", player.string)), header = FALSE, sep = "\t", col.names = c("Number", "Name")) 
    df[order(df$Number),]
    
       Number                   Name
    15      1              Joe Moody
    14      2           Codie Taylor
    13      3            Owen Franks
    12      4          Scott Barrett
    11      5      Sam Whitelock (c)
    10      6  Heiden Bedwell-Curtis
    9       7              Matt Todd
    8       8            Kieran Read
    7       9              Bryn Hall
    ...
    

    【讨论】:

      猜你喜欢
      • 2022-11-03
      • 2020-09-13
      • 1970-01-01
      • 2016-11-26
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-11-26
      • 1970-01-01
      相关资源
      最近更新 更多