按大小写拆分向量答案

【问题标题】：Split vectors by uppercases and lower cases按大小写拆分向量
【发布时间】：2018-10-17 08:35:35
【问题描述】：

我读过一些关于拆分大写和小写的好问题，例如 this 和 this，但我无法让它们与我的数据一起使用。

# here my data
    data <- data.frame(text = c("SOME UPPERCASES     And some Lower Cases"
                                ,"OTHER UPPER CASES   And other words"
                                , "Some lower cases        AND UPPER CASES"
                                ,"ONLY UPPER CASES"
                                ,"Only lower cases, maybe"
                                ,"UPPER lower UPPER!"))
    data
                                         text
    1 SOME UPPERCASES     And some Lower Cases
    2      OTHER UPPER CASES   And other words
    3  Some lower cases        AND UPPER CASES
    4                         ONLY UPPER CASES
    5                  Only lower cases, maybe
    6                        UPPER lower UPPER!

想要的结果应该是这样的：

       V1                  V2
1      SOME UPPERCASES     And some Lower Cases
2      OTHER UPPER CASES   And other words
3      AND UPPER CASES     Some lower cases        
4      ONLY UPPER CASES    NA
5      NA                  Only lower cases, maybe
6      UPPER UPPER!         lower

因此，将所有仅包含大写字母的单词与其他单词分开。

作为测试，我只尝试了一些方法，但它们都不能很好地工作：

strsplit(x= data$text[1], split="[[:upper:]]")   # error
gsub('([[:upper:]])', ' \\1', data$text[1])      # not good results

library(reshape)
transform(data, FOO = colsplit(data$text[1], split = "[[:upper:]]", names = c('a', 'b')))                                        # neither good results

【问题讨论】：

不清楚你的规则是什么。您想省略最后一行中的!，但保留上一行中的,。您在这里的确切规则是什么？
非常感谢，有错别字，标点跟上一个字母的大小写一致。

标签： r regex

【解决方案1】：

数据：

data <- data.frame(text = c("SOME UPPERCASES     And some Lower Cases"
                            ,"OTHER UPPER CASES   And other words"
                            , "Some lower cases        AND UPPER CASES"
                            ,"ONLY UPPER CASES"
                            ,"Only lower cases, maybe"
                            ,"UPPER lower UPPER!"))

代码：

library(magrittr)

UpperCol    <- regmatches(data$text , gregexpr("\\b[A-Z]+\\b",data$text)) %>% lapply(paste, collapse = " ") %>% unlist
notUpperCol <- regmatches(data$text , gregexpr("\\b(?![A-Z]+\\b)[a-zA-Z]+\\b",data$text, perl = T)) %>% lapply(paste, collapse = " ") %>% unlist

result <- data.frame(I(UpperCol), I(notUpperCol))
result[result == ""] <- NA

结果：

#           UpperCol            notUpperCol
#1   SOME UPPERCASES   And some Lower Cases
#2 OTHER UPPER CASES        And other words
#3   AND UPPER CASES       Some lower cases
#4  ONLY UPPER CASES                   <NA>
#5              <NA> Only lower cases maybe
#6       UPPER UPPER                  lower

诀窍是正则表达式。所以学习regex
感谢 Wiktor Stribiżew 的一些优化。

【讨论】：

小心[A-z]+, it does not only match letters。此外，(?![A-Z]+\\b) 应放在前导 \b 之后以提高效率 (=> "\\b(?![A-Z]+\\b)[a-zA-Z]+\\b")。
非常感谢，我的工具包中缺少该部分，我需要学习 (+1)。
@WiktorStribiżew 谢谢。这两句话都很有价值！

【解决方案2】：

使用stringi 包的方法：

library(stringi)
l1 <- stri_extract_all_regex(dat$text, "\\b[A-Z]+\\b")
l2 <- mapply(setdiff, stri_extract_all_words(dat$text), l1)

res <- data.frame(all_upper = sapply(l1, paste, collapse = " "),
                  not_all_upper = sapply(l2, paste, collapse = " "),
                  stringsAsFactors = FALSE)
res[res == "NA"] <- NA
res[res == ""] <- NA

给出：

> res
          all_upper          not_all_upper
1   SOME UPPERCASES   And some Lower Cases
2 OTHER UPPER CASES        And other words
3   AND UPPER CASES       Some lower cases
4  ONLY UPPER CASES                   <NA>
5              <NA> Only lower cases maybe
6       UPPER UPPER                  lower

【讨论】：

【解决方案3】：

separate <- function(x) {
  x <- unlist(strsplit(as.character(x), "\\s+"))
  with_lower <- grepl("\\p{Ll}", x, perl = TRUE)
  list(paste(x[!with_lower], collapse = " "),  paste(x[with_lower], collapse = " "))
}


do.call(rbind, lapply(data$text, separate))

     [,1]                [,2]                     
[1,] "SOME UPPERCASES"   "And some Lower Cases"   
[2,] "OTHER UPPER CASES" "And other words"        
[3,] "AND UPPER CASES"   "Some lower cases"       
[4,] "ONLY UPPER CASES"  ""                       
[5,] ""                  "Only lower cases, maybe"
[6,] "UPPER UPPER!"      "lower"

【讨论】：