【问题标题】:Create new variables based upon specific values根据特定值创建新变量
【发布时间】:2015-09-24 09:54:25
【问题描述】:

我阅读了正则表达式和 Hadley Wickham 的 stringrdplyr 软件包,但不知道如何让它工作。

我在数据框中有图书馆流通数据,索书号作为字符变量。我想把最初的大写字母变成一个新变量,把字母和句点之间的数字变成第二个新变量。

Call_Num
HV5822.H4 C47 Circulating Collection, 3rd Floor
QE511.4 .G53 1982 Circulating Collection, 3rd Floor
TL515 .M63 Circulating Collection, 3rd Floor
D753 .F4 Circulating Collection, 3rd Floor
DB89.F7 D4 Circulating Collection, 3rd Floor 

【问题讨论】:

  • 我不清楚您的数据到底是什么样的。您可以发布生成您正在处理的数据框类型的代码吗?

标签: regex r dplyr stringr


【解决方案1】:

您可以使用 gsubfn 包中的 strapply

library(gsubfn)

m <- strapply(as.character(df$Call_Num), '^([A-Z]+)(\\d+)', 
     ~ c(id = x, num = y), simplify = rbind)

X <- as.data.frame(m, stringsAsFactors = FALSE)

#   id  num
# 1 HV 5822
# 2 QE  511
# 3 TL  515
# 4  D  753
# 5 DB   89

【讨论】:

    【解决方案2】:

    使用stringi 包,这是一种选择。由于您的目标位于字符串的开头,stri_extract_first() 会很好地工作。 [:alpha:]{1,} 表示包含多个字母的字母序列。使用stri_extract_first(),您可以识别第一个字母序列。同样,您可以使用stri_extract_first(x, regex = "\\d{1,}") 找到第一个数字序列。

    x <- c("HV5822.H4 C47 Circulating Collection, 3rd Floor",
           "QE511.4 .G53 1982 Circulating Collection, 3rd Floor",
           "TL515 .M63 Circulating Collection, 3rd Floor",
           "D753 .F4 Circulating Collection, 3rd Floor",
           "DB89.F7 D4 Circulating Collection, 3rd Floor")
    
    library(stringi)
    
    data.frame(alpha = stri_extract_first(x, regex = "[:alpha:]{1,}"), 
               number = stri_extract_first(x, regex = "\\d{1,}"))
    
    #  alpha number
    #1    HV   5822
    #2    QE    511
    #3    TL    515
    #4     D    753
    #5    DB     89
    

    【讨论】:

    • 感谢 jazzurro,它工作得很好!这是我为名为“circ_data: circ_data_new
    • 只有一个小问题 - 当它创建新变量时,它使它们成为两个因素。您能否建议如何使第一个字符类型和第二个整数类型?
    • @ConceptDelta 感谢您的评论。您想使用 as.character() 并包装代码。例如,alpha = as.character(stri_extract_first(x, regex = "[:alpha:]{1,}"))。希望这对您有所帮助。
    • 嗨爵士乐。我试过了: circ_data
    • @ConceptDelta 你的括号太多了。我认为Call_Num_Alpha = as.character(stri_extract_first(circ_data$Call_Num, regex = "[:alpha:]{1,}")) 会起作用。如果您需要更多帮助,请告诉我。
    【解决方案3】:

    如果你想使用stringr,解决方案可能如下所示:

    df <- data.frame(Call_Num = c("HV5822.H4 C47 Circulating Collection, 3rd Floor", "QE511.4 .G53 1982 Circulating Collection, 3rd Floor", "TL515 .M63 Circulating Collection, 3rd Floor", "D753 .F4 Circulating Collection, 3rd Floor", "DB89.F7 D4 Circulating Collection, 3rd Floor"))
    
    require(stringr)
    
    matches = str_match(df$Call_Num, "([A-Z]+)(\\d+)\\s*\\.")
    df2 <- data.frame(df, letter=matches[,2], number=matches[,3])
    df2
    ##                                                  Call_Num letter number
    ## 1     HV5822.H4 C47 Circulating Collection, 3rd Floor     HV   5822
    ## 2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor     QE    511
    ## 3        TL515 .M63 Circulating Collection, 3rd Floor     TL    515
    ## 4          D753 .F4 Circulating Collection, 3rd Floor      D    753
    ## 5        DB89.F7 D4 Circulating Collection, 3rd Floor     DB     89
    

    我认为将str_match() 调用坚持到dplyrmutate() 是不值得的,所以我就这样吧。或者使用rawr's solution.

    【讨论】:

      【解决方案4】:

      怎么样

      rl <- read.table(header = TRUE, text = "Call_Num
      'HV5822.H4 C47 Circulating Collection, 3rd Floor'
                       'QE511.4 .G53 1982 Circulating Collection, 3rd Floor'
                       'TL515 .M63 Circulating Collection, 3rd Floor'
                       'D753 .F4 Circulating Collection, 3rd Floor'
                       'DB89.F7 D4 Circulating Collection, 3rd Floor'",
                       stringsAsFactors = FALSE)
      cbind(rl, read.table(text = gsub('([A-Z]+)([0-9]+).*', '\\1 \\2', rl$Call_Num)))
      
      #                                              Call_Num V1   V2
      # 1     HV5822.H4 C47 Circulating Collection, 3rd Floor HV 5822
      # 2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor QE  511
      # 3        TL515 .M63 Circulating Collection, 3rd Floor TL  515
      # 4          D753 .F4 Circulating Collection, 3rd Floor  D  753
      # 5        DB89.F7 D4 Circulating Collection, 3rd Floor DB   89
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2021-11-26
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-10-15
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多