【问题标题】:Split vectors by uppercases and lower cases按大小写拆分向量
【发布时间】:2018-10-17 08:35:35
【问题描述】:

我读过一些关于拆分大写和小写的好问题,例如 thisthis,但我无法让它们与我的数据一起使用。

# here my data
    data <- data.frame(text = c("SOME UPPERCASES     And some Lower Cases"
                                ,"OTHER UPPER CASES   And other words"
                                , "Some lower cases        AND UPPER CASES"
                                ,"ONLY UPPER CASES"
                                ,"Only lower cases, maybe"
                                ,"UPPER lower UPPER!"))
    data
                                         text
    1 SOME UPPERCASES     And some Lower Cases
    2      OTHER UPPER CASES   And other words
    3  Some lower cases        AND UPPER CASES
    4                         ONLY UPPER CASES
    5                  Only lower cases, maybe
    6                        UPPER lower UPPER!

想要的结果应该是这样的:

       V1                  V2
1      SOME UPPERCASES     And some Lower Cases
2      OTHER UPPER CASES   And other words
3      AND UPPER CASES     Some lower cases        
4      ONLY UPPER CASES    NA
5      NA                  Only lower cases, maybe
6      UPPER UPPER!         lower

因此,将所有仅包含大写字母的单词与其他单词分开。

作为测试,我只尝试了一些方法,但它们都不能很好地工作:

strsplit(x= data$text[1], split="[[:upper:]]")   # error
gsub('([[:upper:]])', ' \\1', data$text[1])      # not good results

library(reshape)
transform(data, FOO = colsplit(data$text[1], split = "[[:upper:]]", names = c('a', 'b')))                                        # neither good results

【问题讨论】:

  • 不清楚你的规则是什么。您想省略最后一行中的!,但保留上一行中的,。您在这里的确切规则是什么?
  • 非常感谢,有错别字,标点跟上一个字母的大小写一致。

标签: r regex


【解决方案1】:

数据:

data <- data.frame(text = c("SOME UPPERCASES     And some Lower Cases"
                            ,"OTHER UPPER CASES   And other words"
                            , "Some lower cases        AND UPPER CASES"
                            ,"ONLY UPPER CASES"
                            ,"Only lower cases, maybe"
                            ,"UPPER lower UPPER!"))

代码:

library(magrittr)

UpperCol    <- regmatches(data$text , gregexpr("\\b[A-Z]+\\b",data$text)) %>% lapply(paste, collapse = " ") %>% unlist
notUpperCol <- regmatches(data$text , gregexpr("\\b(?![A-Z]+\\b)[a-zA-Z]+\\b",data$text, perl = T)) %>% lapply(paste, collapse = " ") %>% unlist

result <- data.frame(I(UpperCol), I(notUpperCol))
result[result == ""] <- NA

结果:

#           UpperCol            notUpperCol
#1   SOME UPPERCASES   And some Lower Cases
#2 OTHER UPPER CASES        And other words
#3   AND UPPER CASES       Some lower cases
#4  ONLY UPPER CASES                   <NA>
#5              <NA> Only lower cases maybe
#6       UPPER UPPER                  lower

  • 诀窍是正则表达式。所以学习regex
  • 感谢 Wiktor Stribiżew 的一些优化。

【讨论】:

  • 小心[A-z]+, it does not only match letters。此外,(?![A-Z]+\\b) 应放在前导 \b 之后以提高效率 (=> "\\b(?![A-Z]+\\b)[a-zA-Z]+\\b")。
  • 非常感谢,我的工具包中缺少该部分,我需要学习 (+1)。
  • @WiktorStribiżew 谢谢。这两句话都很有价值!
【解决方案2】:

使用 包的方法:

library(stringi)
l1 <- stri_extract_all_regex(dat$text, "\\b[A-Z]+\\b")
l2 <- mapply(setdiff, stri_extract_all_words(dat$text), l1)

res <- data.frame(all_upper = sapply(l1, paste, collapse = " "),
                  not_all_upper = sapply(l2, paste, collapse = " "),
                  stringsAsFactors = FALSE)
res[res == "NA"] <- NA
res[res == ""] <- NA

给出:

> res
          all_upper          not_all_upper
1   SOME UPPERCASES   And some Lower Cases
2 OTHER UPPER CASES        And other words
3   AND UPPER CASES       Some lower cases
4  ONLY UPPER CASES                   <NA>
5              <NA> Only lower cases maybe
6       UPPER UPPER                  lower

【讨论】:

    【解决方案3】:
    separate <- function(x) {
      x <- unlist(strsplit(as.character(x), "\\s+"))
      with_lower <- grepl("\\p{Ll}", x, perl = TRUE)
      list(paste(x[!with_lower], collapse = " "),  paste(x[with_lower], collapse = " "))
    }
    
    
    do.call(rbind, lapply(data$text, separate))
    
         [,1]                [,2]                     
    [1,] "SOME UPPERCASES"   "And some Lower Cases"   
    [2,] "OTHER UPPER CASES" "And other words"        
    [3,] "AND UPPER CASES"   "Some lower cases"       
    [4,] "ONLY UPPER CASES"  ""                       
    [5,] ""                  "Only lower cases, maybe"
    [6,] "UPPER UPPER!"      "lower"  
    

    【讨论】:

      猜你喜欢
      • 2020-06-05
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-09-04
      • 1970-01-01
      • 2021-03-26
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多