【问题标题】:r split a column in a data frame based on square bracketsr 根据方括号拆分数据框中的一列
【发布时间】:2017-08-08 19:19:15
【问题描述】:

我有一个数据框:

x <- data.frame(a = letters[1:7], b = letters[2:8], 
   c = c("bla bla    [ text1 ]", "bla bla  [text2]", "how how [text3  ]",
   "wow wow   [ text4a ] [ text4b  ]", "ba ba [ text5a  ][  text5b]", 
    "my text A", "my text B"), stringsAsFactors = FALSE)
x

我想根据其中两个方括号 [...] 之间的内容来拆分列 c。如果 c 列仅包含一组方括号,我希望字符串转到下一列。如果 c 列包含由[] 包围的两组字符串,我只希望最后一个[ ] 之间的字符串进入新列。

这是我的做法。看起来很复杂,我正在使用循环。是否有可能以更简约的方式做到这一点?

library(stringr)

# Counting number of square brackets "[" in column c:
sqrbrack_count <- str_count(x$c, pattern = '\\[')

# Creating a new column:
x$newcolumn <- NA

for(i in 1:nrow(x)){                 # looping through rows of x
  if(sqrbrack_count[i] == 0) next    # do nothing of 0 square brackets
  minilist <- str_split_fixed(x[i, "c"], pattern = '\\[', n = Inf)  # split string
  if(sqrbrack_count[i] == 1) {       # if there is only one square bracket "["
    x[i, "c"] <- minilist[1]
    x[i, "newcolumn"] <- minilist[2]
  } else {                           # if there are >1 square bracket "["
    x[i, "c"] <- paste(minilist[1:2], collapse = "+")
    x[i, "newcolumn"] <- minilist[3]
  }
}
# Replacing renmaning square brackets we don't need anymore:
x$c <- str_replace(x$c, pattern = " \\]", replacement =  "")
x$c <- str_replace(x$c, pattern = "\\]", replacement =  "")
x$newcolumn <- str_replace(x$newcolumn, pattern = " \\]", replacement =  "")
x$newcolumn <- str_replace(x$newcolumn, pattern = "\\]", replacement =  "")
x

【问题讨论】:

    标签: r regex stringr square-bracket


    【解决方案1】:

    以下代码更短一些,可能更容易理解,因为大多数复杂的逻辑都发生在两行代码中。我在这两行上面加了cmets,我觉得剩下的就很清楚了。

    library(plyr)
    # find all strings between characters '[' and ']'
    strmatches = lapply(1:nrow(x), function(y) {regmatches(x$c[y], gregexpr("(?<=\\[).*?(?=\\])", x$c[y], perl=T))[[1]]})
    # parse these to a dataframe called 'new_cols'
    new_cols = rbind.fill(lapply(strmatches, function(x) {as.data.frame(t(x),stringsAsFactors = F)}))
    df = cbind(x,new_cols)
    df$c = gsub("\\[.*$", "", x$c) # only keep everything before '['
    df$c[!is.na(df$V2)] = paste0(df$c[!is.na(df$V2)], '+',df$V1[!is.na(df$V2)])
    df$V1[!is.na(df$V2)] = df$V2[!is.na(df$V2)]
    df$V2=NULL
    colnames(df)[colnames(df)=="V1"]="newcolumn"
    

    输出:

      a b                   c        V1
    1 a b         bla bla        text1 
    2 b c           bla bla       text2
    3 c d            how how    text3  
    4 d e wow wow   + text4a   text4b  
    5 e f    ba ba + text5a      text5b
    6 f g           my text A      <NA>
    7 g h           my text B      <NA>
    

    希望这会有所帮助!

    PS:这符合您的预期输出,但您可能需要在其中添加一些 str_trim

    【讨论】:

      猜你喜欢
      • 2017-09-03
      • 2021-09-11
      • 2021-08-23
      • 1970-01-01
      • 2022-09-23
      • 1970-01-01
      • 1970-01-01
      • 2019-09-17
      • 2012-09-22
      相关资源
      最近更新 更多