【问题标题】:Proper use of gsub / regular expressions in R?在 R 中正确使用 gsub / 正则表达式?
【发布时间】:2012-10-22 10:44:38
【问题描述】:

我有很长的字符串列表,例如这个机器可读的例子:

A <- list(c("Biology","Cell Biology","Art","Humanities, Multidisciplinary; Psychology, Experimental","Astronomy & Astrophysics; Physics, Particles & Fields","Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods","Geriatrics & Gerontology","Gerontology","Management","Operations Research & Management Science","Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic","Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability"))  

所以它看起来像这样:

> A  
[[1]]  
 [1] "Biology"  
 [2] "Cell Biology"  
 [3] "Art"  
 [4] "Humanities, Multidisciplinary; Psychology, Experimental"  
 [5] "Astronomy & Astrophysics; Physics, Particles & Fields"  
 [6] "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods"  
 [7] "Geriatrics & Gerontology"  
 [8] "Gerontology"  
 [9] "Management"  
[10] "Operations Research & Management Science"  
[11] "Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic"  
[12] "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability"  

我想编辑这些术语并消除重复项以获得此结果:

 [1] "Science"  
 [2] "Science"  
 [3] "Arts & Humanities"  
 [4] "Arts & Humanities; Social Sciences"  
 [5] "Science"  
 [6] "Social Sciences; Science"  
 [7] "Science"  
 [8] "Social Sciences"  
 [9] "Social Sciences"  
[10] "Science"  
[11] "Science"  
[12] "Social Sciences; Science"  

到目前为止,我只得到了这个:

stringedit <- function(A)  
{  
  A <-gsub("Biology", "Science", A)  
  A <-gsub("Cell Biology", "Science", A)  
  A <-gsub("Art", "Arts & Humanities", A)  
  A <-gsub("Humanities, Multidisciplinary", "Arts & Humanities", A)  
  A <-gsub("Psychology, Experimental", "Social Sciences", A)  
  A <-gsub("Astronomy & Astrophysics", "Science", A)  
  A <-gsub("Physics, Particles & Fields", "Science", A)  
  A <-gsub("Economics", "Social Sciences", A)  
  A <-gsub("Mathematics", "Science", A)  
  A <-gsub("Mathematics, Applied", "Science", A)  
  A <-gsub("Mathematics, Interdisciplinary Applications", "Science", A)  
  A <-gsub("Social Sciences, Mathematical Methods", "Social Sciences", A)  
  A <-gsub("Geriatrics & Gerontology", "Science", A)  
  A <-gsub("Gerontology", "Social Sciences", A)  
  A <-gsub("Management", "Social Sciences", A)  
  A <-gsub("Operations Research & Management Science", "Science", A)  
  A <-gsub("Computer Science, Artificial Intelligence", "Science", A)  
  A <-gsub("Computer Science, Information Systems", "Science", A)  
  A <-gsub("Engineering, Electrical & Electronic", "Science", A)  
  A <-gsub("Statistics & Probability", "Science", A)  
}  
B <- lapply(A, stringedit)  

但它不能正常工作:

> B  
[[1]]  
 [1] "Science"  
 [2] "Cell Science"  
 [3] "Arts & Humanities"  
 [4] "Arts & Humanities; Social Sciences"  
 [5] "Science; Science"  
 [6] "Social Sciences; Science, Interdisciplinary Applications; Social Sciences"  
 [7] "Science"  
 [8] "Social Sciences"  
 [9] "Social Sciences"  
[10] "Operations Research & Social Sciences Science"  
[11] "Computer Science, Arts & Humanitiesificial Intelligence; Science; Science"  
[12] "Social Sciences; Science, Interdisciplinary Applications; Social Sciences; Science"  

我怎样才能获得上述正确的输出?
非常感谢您的考虑!

【问题讨论】:

  • 当你发现自己以很多类似的代码行结尾时,你就绕过了可爱的DRY principle。所以是时候重新设计了,显然是传递给某种*apply-function 的包装器,或者其他类似循环的帮助器。

标签: regex r list gsub


【解决方案1】:

我发现使用两列 data.frame 作为查找最容易,其中一列用于课程名称,一列用于类别。这是一个例子:

course.categories <- data.frame(
  Course = 
  c("Art", "Humanities, Multidisciplinary", "Biology", "Cell Biology", 
    "Astronomy & Astrophysics", "Physics, Particles & Fields", "Mathematics", 
    "Mathematics, Applied", "Mathematics, Interdisciplinary Applications", 
    "Geriatrics & Gerontology", "Operations Research & Management Science", 
    "Computer Science, Artificial Intelligence", 
    "Computer Science, Information Systems", 
    "Engineering, Electrical & Electronic", "Statistics & Probability", 
    "Psychology, Experimental", "Economics", 
    "Social Sciences, Mathematical Methods", 
    "Gerontology", "Management"),
  Category =
  c("Arts & Humanities", "Arts & Humanities", "Science", "Science", 
    "Science", "Science", "Science", "Science", "Science", "Science", 
    "Science", "Science", "Science", "Science", "Science", "Social Sciences", 
    "Social Sciences", "Social Sciences", "Social Sciences", "Social Sciences"))

然后,假设A 是您问题中的列表:

sapply(strsplit(unlist(A), "; "), 
       function(x) 
         paste(unique(course.categories[match(x, course.categories[["Course"]]),
                                        "Category"]), 
               collapse = "; "))
#  [1] "Science"                            "Science"                           
#  [3] "Arts & Humanities"                  "Arts & Humanities; Social Sciences"
#  [5] "Science"                            "Social Sciences; Science"          
#  [7] "Science"                            "Social Sciences"                   
#  [9] "Social Sciences"                    "Science"                           
# [11] "Science"                            "Social Sciences; Science"

match 将来自A 的值与course.categories 数据集中的课程名称相匹配,并说明匹配发生在哪些行上;这用于提取课程所属的类别。然后,unique 确保每个类别中只有一个。 paste 把事情重新组合在一起。

【讨论】:

  • 非常感谢您的建议,@mrdwab !
【解决方案2】:

让我从一个例子开始。你有一个字符串“细胞生物学”。第一个替换,A &lt;-gsub("Biology", "Science", A),把它变成了“细胞科学”。则不会被替换。

由于您不使用正则表达式,我宁愿使用一种哈希来进行替换:

myhash <- c( "Science", "Science", "Arts & Humanities", "Arts & Humanities", "Social Sciences", 
  "Science", "Science", "Social Sciences", "Science", "Science", "Science", "Social Sciences", 
  "Science", "Social Sciences", "Social Sciences", "Science", "Science", "Science", "Science", 
  "Science" )

names( myhash ) <- c( "Biology", "Cell Biology", "Art", "Humanities, Multidisciplinary", 
  "Psychology, Experimental", "Astronomy & Astrophysics", "Physics, Particles & Fields", "Economics", 
  "Mathematics", "Mathematics, Applied", "Mathematics, Interdisciplinary Applications", 
  "Social Sciences, Mathematical Methods", "Geriatrics & Gerontology", "Gerontology", "Management",
   "Operations Research & Management Science", "Computer Science, Artificial Intelligence", 
  "Computer Science, Information Systems", "Engineering, Electrical & Electronic", 
  "Statistics & Probability" )

现在,给定一个字符串,例如“Biology”,您可以快速查找您的类别:

myhash[ "Biology" ]

我不确定您为什么要使用列表而不是字符串向量,因此我将简化一下您的情况:

A <- c("Biology","Cell Biology","Art",
  "Humanities, Multidisciplinary; Psychology, Experimental",
  "Astronomy & Astrophysics; Physics, Particles & Fields",
  "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods",
  "Geriatrics & Gerontology","Gerontology","Management","Operations Research & Management Science",
  "Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic",
  "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability")

has 查找不适用于复合字符串(包含“;”)。您可以拆分它们,但是使用strsplit。然后,您可以使用unique 来避免术语重复,并使用paste 函数将其重新组合在一起。

stringedit <- function( x ) { 
  # first, split into subterms
  a.all <- unlist( strsplit( x, "; *" ) ) ; 
  paste( unique( myhash[ a.all ] ), collapse= "; " ) 
}

unlist( lapply( A, stringedit  ) )

根据需要,结果如下:

[1] "Science"                            "Science"                            "Arts & Humanities"                  "Arts & Humanities; Social Sciences"
[5] "Science"                            "Social Sciences; Science"           "Science"                            "Social Sciences"                   
[9] "Social Sciences"                    "Science"                            "Science"                            "Social Sciences; Science" 

当然,你可以像这样多次调用*apply

a.spl <- sapply( A, strsplit, "; *" )
a.spl <- sapply( a.spl, function( x ) myhash[ x ] )
unlist( sapply( a.spl, collapse, "; " )

这并不比之前的代码效率更高或更低。

是的,您可以使用正则表达式实现相同的效果,但首先,无论如何都要拆分字符串,然后使用像 ^Biology$ 这样的正则表达式来确保它们匹配“生物学”而不是“细胞生物学”等。除非您想使用“.* Biology”之类的构造。最后,在我看来,无论如何你都必须摆脱重复,而这一切都将是(i)不那么冗长(= 更容易出错)和(ii)不值得付出努力。

【讨论】:

  • IMO,一个坏主意。您在每次循环迭代中 strsplit-ing 一个字符串。你应该只做一次。
  • 我只有strsplit-ing length( A )的次数;就与lapply( A, strsplit, "; " ) 的拆分数量而言,这并没有什么不同。
  • 非常感谢您的解决方案,@January!
【解决方案3】:

那么使用switch怎么样?

science.category <- function(science){
    switch(science,
           "Biology" =,
           "Cell Biology" =,
           "Astronomy & Astrophysics" =,
           "Physics, Particles & Fields" =,
           "Mathematics" =,
           "Mathematics, Applied" =,
           "Mathematics, Interdisciplinary Applications" =,
           "Geriatrics & Gerontology" =,
           "Operations Research & Management Science" =,
           "Computer Science, Artificial Intelligence" =,
           "Computer Science, Information Systems" =,
           "Engineering, Electrical & Electronic" =,
           "Statistics & Probability" = "Science",
           "Art" =,
           "Humanities, Multidisciplinary" = "Arts & Humanities",
           "Psychology, Experimental" =,
           "Economics" =,
           "Social Sciences, Mathematical Methods" =,
           "Gerontology" =,
           "Management" = "Social Sciences",
           NA
           )
}

a <- unlist(lapply(A, strsplit, split = " *; *"), recursive = FALSE)
a1 <- lapply(a, function(x) unique(sapply(x, science.category)))
sapply(a1, paste, collapse = "; ")

当然,只要您将正确的字符串作为switch 参数插入,这将起作用。一次不匹配,您将以NA 结尾。对于一些高级用法,您应该编写自己的包装器来使用 grep-family 函数,甚至是 agrep(小心处理)。

【讨论】:

  • 不过,您在strsplitsapply 通话之间错过了对science category 的通话。
  • 哈哈哈,太好了! =) 感谢您发现它! =)
  • @January,已修复,谢谢提示。
  • @aL3xa,您还应该在其中添加一个unique 以匹配所需的输出。
  • 非常感谢您的建议,@aL3xa !
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2019-11-17
  • 1970-01-01
  • 1970-01-01
  • 2015-05-29
  • 1970-01-01
相关资源
最近更新 更多