【问题标题】:R: sum vector by vector of conditionsR:对条件向量求和
【发布时间】:2015-08-16 00:33:51
【问题描述】:

我正在尝试获取一个向量,其中包含符合条件的元素的总和。

    values = runif(5000)
    bin = seq(0, 0.9, by = 0.1)
    sum(values < bin)

我预计 sum 会返回 10 个值 - 每个“bin”元素都符合“

【问题讨论】:

    标签: r


    【解决方案1】:

    我理解这意味着对于bin 中的每个值,您希望values 中小于bin 的元素数。所以我想你想在这里vapply()

    vapply(bin, function(x) sum(values < x), 1L)
    # [1]    0  497 1025 1501 1981 2461 2955 3446 3981 4526
    

    如果你想要一个小表作为参考,你可以添加名字

    v <- vapply(bin, function(x) sum(values < x), 1L)
    setNames(v, bin)
    #   0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9 
    #   0  497 1025 1501 1981 2461 2955 3446 3981 4526 
    

    【讨论】:

    • 在我的回答中,我应该得到与列 cumsum 相同的结果还是执行不同的计算?谢谢。
    • 不,因为使用了runif(),我们谁都不会得到相同的结果
    • 对不起,我没有提到我使用您的代码设置了相同的种子。现在我明白了,您正在计算累积计数,而我正在计算累积总和。我现在已经将两者都包含在我的答案中。
    • 谢谢!帮了我很多:)
    【解决方案2】:

    我个人更喜欢data.table 而不是tapplyvapply,以及findInterval 而不是cut

    set.seed(1)
    library(data.table)
    dt <- data.table(values, groups=findInterval(values, bin))
    setkey(dt, groups)
    dt[,.(n=.N, v=sum(values)), groups][,list(cumsum(n), cumsum(v)),]
    #      V1         V2
    # 1:  537   26.43445
    # 2: 1041  101.55686
    # 3: 1537  226.12625
    # 4: 2059  410.41487
    # 5: 2564  637.18782
    # 6: 3050  904.65876
    # 7: 3473 1180.53342
    # 8: 3951 1540.18559
    # 9: 4464 1976.23067
    #10: 5000 2485.44920
    
    cbind(vapply(bin, function(x) sum(values < x), 1L)[-1], 
    cumsum(tapply(  values,  cut(values, bin), sum)))    
    #          [,1]       [,2]
    #(0,0.1]    537   26.43445
    #(0.1,0.2] 1041  101.55686
    #(0.2,0.3] 1537  226.12625
    #(0.3,0.4] 2059  410.41487
    #(0.4,0.5] 2564  637.18782
    #(0.5,0.6] 3050  904.65876
    #(0.6,0.7] 3473 1180.53342
    #(0.7,0.8] 3951 1540.18559
    #(0.8,0.9] 4464 1976.23067
    

    【讨论】:

      【解决方案3】:

      将tapply 与cut()-constructed INDEX 向量一起使用似乎可以实现:

       tapply(  values,  cut(values, bin), sum)
        (0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] 
       25.43052  71.06897 129.99698 167.56887 222.74620 277.16395 
      (0.6,0.7] (0.7,0.8] (0.8,0.9] 
      332.18292 368.49341 435.01104 
      

      虽然我猜你会希望切割向量扩展到 1.0:

      bin = seq(0, 1, by = 0.1)
      tapply(  values,  cut(values, bin), sum)
      
        (0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] 
       25.48087  69.87902 129.37348 169.46013 224.81064 282.22455 
      (0.6,0.7] (0.7,0.8] (0.8,0.9]   (0.9,1] 
      335.43991 371.60885 425.66550 463.37312 
      

      我发现我对这个问题的理解与理查德不同。如果你想要他的结果,你可以在我的结果上使用cumsum

      【讨论】:

        【解决方案4】:

        使用dplyr

        set.seed(1)
        library(dplyr)
        df %>% group_by(groups) %>% 
          summarise(count = n(), sum = sum(values)) %>% 
          mutate(cumcount= cumsum(count), cumsum = cumsum(sum))
        

        输出:

              groups count       sum cumcount     cumsum
        1    (0,0.1]   537  26.43445      537   26.43445
        2  (0.1,0.2]   504  75.12241     1041  101.55686
        3  (0.2,0.3]   496 124.56939     1537  226.12625
        4  (0.3,0.4]   522 184.28862     2059  410.41487
        5  (0.4,0.5]   505 226.77295     2564  637.18782
        6  (0.5,0.6]   486 267.47094     3050  904.65876
        7  (0.6,0.7]   423 275.87466     3473 1180.53342
        8  (0.7,0.8]   478 359.65217     3951 1540.18559
        9  (0.8,0.9]   513 436.04508     4464 1976.23067
        10        NA   536 509.21853     5000 2485.44920
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2021-03-27
          • 2015-10-27
          • 1970-01-01
          • 2011-05-20
          • 2016-07-03
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多