【问题标题】:Sum the total number of strings separated by comma [duplicate]用逗号分隔的字符串总数[重复]
【发布时间】:2018-12-08 10:59:28
【问题描述】:
structure(list(Other = c(NA_character_, NA_character_, NA_character_,
                         NA_character_, NA_character_),
              Years = c("2005, 2005, 2006, 2006, 2007", "2011, 2014",
                        "2007", "2011, 2011, 2011, 2012, 2012, 2012",
                        "2006, 2006, 2012, 2012, 2015")),
         .Names = c("Other", "Years"), row.names = 1:4, class = "data.frame")

鉴于上述数据框,第二列有一堆用逗号排列的年份。我想创建一个新列,将列中每个元素的总年数相加。所以最终的数据框是这样的:

structure(list(Other = c(NA_character_, NA_character_, NA_character_,
                         NA_character_, NA_character_),
               Years = c("2005, 2005, 2006, 2006, 2007","2011, 2014", "2007",
                         "2011, 2011, 2011, 2012, 2012, 2012",
                         "2006, 2006, 2012, 2012, 2015"), 
               yearlength = c(5, 2, 1, 6, 5)),
         .Names = c("Other", "Years", "yearlength"), row.names = 1:4, class = "data.frame")

我已经尝试过使用诸如 stack$yearlength <- count.fields(textConnection(stack), sep = ",") 之类的解决方案,但我无法让它发挥作用。

【问题讨论】:

    标签: r


    【解决方案1】:

    你可以根据逗号分割,然后找到向量的长度。

    > sapply(strsplit(xy$Years, ","), length)
    [1] 5 2 1 6 5
    

    添加到 NA 帐户(来自@missuse 的示例):

    xy <- structure(list(Other = c(NA_character_, NA_character_, NA_character_, 
                             NA_character_, NA_character_), Years = c("2005, 2005, 2006, 2006, 2007", 
                                                                      "2011, 2014", "2007", "2011, 2011, 2011, 2012, 2012, 2012", "2006, 2006, 2012, 2012, 2015"
                             )), .Names = c("Other", "Years"), row.names = 1:4, class = "data.frame")
    
    xy[3, 2] <- NA
    
    sapply(strsplit(xy$Years, ","), FUN = function(x) {
      length(na.omit(x))
    })
    
    [1] 5 2 0 6 5
    

    【讨论】:

    • lengths(strsplit(xy$Years, ","))
    • 感谢您的回答。有没有办法让它不计算 NA 值?
    • @Woe这就是我将结果包装成sapply 的原因。您可以指定一个匿名函数,而不是length,您可以随意处理每一行/元素。
    【解决方案2】:

    一种方法是计算逗号并添加1

    df$yearlength <- stringr::str_count(df$Years, ",")+1
    df
    #output
      Other                              Years yearlength
    1  <NA>       2005, 2005, 2006, 2006, 2007          5
    2  <NA>                         2011, 2014          2
    3  <NA>                               2007          1
    4  <NA> 2011, 2011, 2011, 2012, 2012, 2012          6
    5  <NA>       2006, 2006, 2012, 2012, 2015          5
    

    另一种方法是计算数字的跨度:

    df$yearlength <- stringr::str_count(df$Years, "\\d+")
    

    第三种选择(感谢 Sotos 的评论)是计算单词:

    stringi::stri_count_words(df$Years)
    

    stringr::str_count(df$Years, "\\w+")
    

    第四个选项是计算非空格:

    stringr::str_count(df$Years, "\\S+")
    
    all.equal(stringr::str_count(df$Years, ",")+1,
              stringr::str_count(df$Years, "\\d+"),
              stringi::stri_count_words(df$Years),
              stringr::str_count(df$Years, "\\w+"),
              stringr::str_count(df$Years, "\\S+"))
    

    编辑:当数据集中存在 NA 时:

    df[3,2] <- NA
    

    上述所有解决方案都会产生 #输出 5 2 不适用 6 5

    将 NA 更改为 0:

    df$yearlength[is.na(df$yearlength)] <- 0
    #output
      Other                              Years yearlength
    1  <NA>       2005, 2005, 2006, 2006, 2007          5
    2  <NA>                         2011, 2014          2
    3  <NA>                               <NA>          0
    4  <NA> 2011, 2011, 2011, 2012, 2012, 2012          6
    5  <NA>       2006, 2006, 2012, 2012, 2015          5
    

    数据(因为问题中的数据已损坏):

    df <- structure(list(Other = c(NA_character_, NA_character_, NA_character_, 
                             NA_character_, NA_character_), Years = c("2005, 2005, 2006, 2006, 2007", 
                                                                      "2011, 2014", "2007", "2011, 2011, 2011, 2012, 2012, 2012", "2006, 2006, 2012, 2012, 2015"
                             )), .Names = c("Other", "Years"), row.names = 1:5, class = "data.frame")
    

    【讨论】:

    • 您也可以使用stringi 并使用stringi::stri_count_words
    • 感谢您的回答。当我尝试将其应用于 NA 值时,我的问题就出现了。似乎将它们计为 1 而不是 0。
    • 所有提议的解决方案都不计算在内:比如:stringr::str_count(df$Years, "\\w+"),但在适当的位置生成NA。请参阅编辑如何将 NA 替换为 0
    猜你喜欢
    • 2023-04-09
    • 1970-01-01
    • 1970-01-01
    • 2015-08-11
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-08-12
    相关资源
    最近更新 更多