【问题标题】:R efficiency challenge: Splitting a long character vectorR 效率挑战:拆分长字符向量
【发布时间】:2019-03-18 22:35:53
【问题描述】:

问题是如何有效地解析这种格式的数据:

lineup = " C James McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"

进入一个两列的数据框;一个用于位置,一个用于玩家。

名字是棒球运动员,每个名字前面都有他们的位置,这是按某种顺序精确的集合 {C, P, P, OF, 3B, SS, 1B, OF, 2B, OF}。也就是说,那些确切的位置总是会出现。

例如,“C James McCann”应该变成

data.frame(position = "C", player = "James McCann")

实际上,我有数十万个这样的字符串,我想有效地解析它们。这是我的低效解决方案:

data.frame(
    position = str_match_all(lineup, "\\s[0-9A-Z]{1,2}\\s")[[1]] %>% as.character() %>% str_trim(),
    player = str_split(lineup, "\\s[0-9A-Z]{1,2}\\s")[[1]][-1],
    stringsAsFactors = F
)

这个 tidyverse 解决方案很简单,但我怀疑我可以做得更好。有人有什么想法吗?

【问题讨论】:

  • "C James McCann" 是故意删除的吗?他没有进入您的tidyverse 解决方案。
  • 是否保证没有人的名字或姓氏与职位相同??
  • @Maurits,不,他不是,对不起。我现在已经编辑了
  • @qwr,这是有保证的。凡有球员姓名首字母的地方,都会写上一个点
  • @ThanksABundle 您能否提供一个更具代表性的示例来测试边缘案例。例如,目前每个玩家都有名字和姓氏。我想你也可以有中间名的球员?双筒姓氏呢?你提到了球员的姓名缩写。为了确保方法的稳健性,有一个更复杂的样本数据集可以使用是很重要的。

标签: r regex string performance


【解决方案1】:

您可以使用 stringi::stri_match_all_regex: 制作一个模式,让您同时获得位置和球员姓名:

stri_match_all_regex(lineup, 
                   patt= "(C|P|OF|3B|SS|1B|OF|2B) ([A-Z][A-Za-z]+ [A-Z][A-Za-z]+)" )
[[1]]
      [,1]                   [,2] [,3]               
 [1,] "C James McCann"       "C"  "James McCann"     
 [2,] "P Robbie Ray"         "P"  "Robbie Ray"       
 [3,] "P Rafael Montero"     "P"  "Rafael Montero"   
 [4,] "OF Giancarlo Stanton" "OF" "Giancarlo Stanton"
 [5,] "3B Derek Dietrich"    "3B" "Derek Dietrich"   
 [6,] "SS Miguel Rojas"      "SS" "Miguel Rojas"     
 [7,] "1B Tommy Joseph"      "1B" "Tommy Joseph"     
 [8,] "OF Marcell Ozuna"     "OF" "Marcell Ozuna"    
 [9,] "OF Christian Yelich"  "OF" "Christian Yelich" 

我的模式比你的更严格,因为我将空格之间的一两个字母限制为仅匹配棒球位置的组合。您将获得一个列表,其中包含每行的矩阵项目。您可能应该发布一个更复杂的示例来支持需要的进一步处理。您将需要使用类似于lapply( results, function(x){ as.data.frame(x[ , -1]) })

的内容
lapply( results, function(x){ as.data.frame(x[ , -1]) })
[[1]]
  V1                V2
1  C      James McCann
2  P        Robbie Ray
3  P    Rafael Montero
4 OF Giancarlo Stanton
5 3B    Derek Dietrich
6 SS      Miguel Rojas
7 1B      Tommy Joseph
8 OF     Marcell Ozuna
9 OF  Christian Yelich

如果要使用连字符名称或中间名或首字母,则模式可能需要更复杂。

【讨论】:

    【解决方案2】:

    这是一个解决方案,它将lineup 转换为 csv 文件格式的字符串,然后由fread() 读取:

    library(magrittr)  # piping used to improve readability
    lineup %>% 
      stringr::str_replace_all("\\s(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\1;") %>% 
      data.table::fread(header = FALSE, col.names = c("position", "player"))
    
        position            player
     1:        C      James McCann
     2:        P        Robbie Ray
     3:        P    Rafael Montero
     4:       OF Giancarlo Stanton
     5:       3B    Derek Dietrich
     6:       SS      Miguel Rojas
     7:       1B      Tommy Joseph
     8:       OF     Marcell Ozuna
     9:       2B   C?sar Hern?ndez
    10:       OF  Christian Yelich
    

    “诀窍”是在位置字符前放置一个换行符,在其后放置一个列分隔符,例如," C " 变为 "\nC;"

    lineup %>% 
      stringr::str_replace_all("\\s(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\1;")
    

    返回

    [1] "\nC;James McCann\nP;Robbie Ray\nP;Rafael Montero\nOF;Giancarlo  Stanton\n3B;Derek Dietrich\nSS;Miguel Rojas\n1B;Tommy Joseph\nOF;Marcell Ozuna\n2B;C?sar Hern?ndez\nOF;Christian Yelich"
    

    这种方法不会对名称做出很多假设。它甚至可以使用 James P. McCannRobbie Ray, Jr 这样的名称。

    lineup2 %>% 
      stringr::str_replace_all("\\s(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\1;") %>% 
      data.table::fread(header = FALSE, col.names = c("position", "player"))
    
        position            player
     1:        C   James P. McCann
     2:        P    Robbie Ray, Jr
     3:        P  Rafael D Montero
     4:       OF Giancarlo Stanton
     5:       3B    Derek Dietrich
     6:       SS      Miguel Rojas
     7:       1B      Tommy Joseph
     8:       OF     Marcell Ozuna
     9:       2B   C?sar Hern?ndez
    10:       OF  Christian Yelich
    

    必须满足三个先决条件:

    1. 名称部分不得包含任何也用作位置指示符的首字母,例如,首字母 CP 必须用点完成以避免混淆。
    2. 列分隔符; 不得在lineup 的其他地方使用。
    3. 字符串必须以空格开头。

    条件 3 可以通过改进的正则表达式进行挥动,并且可以检查条件 2:

    lineup3 %T>% 
      {stopifnot(!stringr::str_detect(., ";"))} %>% 
      stringr::str_replace_all("(^\\s?|\\s)(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\2;") %>% 
      data.table::fread(header = FALSE, col.names = c("position", "player"))
    
        position            player
     1:        C   James P. McCann
     2:        P    Robbie Ray, Jr
     3:        P    Rafael Montero
     4:       OF Giancarlo Stanton
     5:       3B    Derek Dietrich
     6:       SS      Miguel Rojas
     7:       1B      Tommy Joseph
     8:       OF     Marcell Ozuna
     9:       2B   C?sar Hern?ndez
    10:       OF  Christian Yelich
    

    数据

    # original
    lineup = " C James McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
    
    # other use cases
    lineup1 = "C James McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
    lineup2 = " C James P. McCann P Robbie Ray, Jr P Rafael D Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
    lineup2a = " C James P. McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
    lineup2b = " C James McCann P Robbie Ray, Jr P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
    lineup3 = "C James P. McCann P Robbie Ray, Jr P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
    lineup4 = " C James P. McCann P Robbie Ray; Jr P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
    

    【讨论】:

      【解决方案3】:

      这是一个stringr::str_split 选项,使用积极的后视和前瞻

      pos <- c("C", "P", "P", "OF", "3B", "SS", "1B", "OF", "2B", "OF")
      pat <- sprintf("(%s)", paste(pos, collapse = "|"))
      
      library(stringr)
      matrix(unlist(str_split(trimws(lineup), sprintf(
          "((?<=(%s))\\s|\\s(?=(%s)))", pat, pat))), ncol = 2, byrow = T)
      #    [,1] [,2]
      #[1,] "C"  "James McCann"
      #[2,] "P"  "Robbie Ray"
      #[3,] "P"  "Rafael Montero"
      #[4,] "OF" "Giancarlo Stanton"
      #[5,] "3B" "Derek Dietrich"
      #[6,] "SS" "Miguel Rojas"
      #[7,] "1B" "Tommy Joseph"
      #[8,] "OF" "Marcell Ozuna"
      #[9,] "2B" "C?sar Hern?ndez"
      #[10,] "OF" "Christian Yelich"
      

      我不知道这在多大程度上涵盖了任何边缘情况。一个更复杂和有代表性的示例字符串将有助于测试。

      【讨论】: