使用 dplyr 的数据帧中的频率加权百分位数答案

【问题标题】：Frequency-weighted percentile in dataframe with dplyr使用 dplyr 的数据帧中的频率加权百分位数
【发布时间】：2020-10-07 21:37:13
【问题描述】：

我正在尝试计算数据框中某个值的百分位等级，并且我在数据框中也有一个相关的频率来加权。我正在努力想出一个解决方案来计算原始值的百分位数，就好像整体分布是按频率复制的值以及按该频率复制的所有其他值。

例如：

groceries <- tribble(
  ~item, ~price, ~freq,
  "apple",   1, 20,
  "banana",   2, 5,
  "carrot",   3, 1
)

groceries %>% 
    mutate(reg_ptile = percent_rank(price),
           wtd_ptile = weighted_percent_rank(price, wt = freq))

# the expected result would be:

# A tibble: 3 x 5
  item   price  freq reg_ptile wtd_ptile
  <chr>  <dbl> <dbl> <dbl>     <dbl>
1 apple      1    20  0.0      0.0
2 banana     2     5  0.5      0.8
3 carrot     3     1  1.0      1.0

percent_rank() 是一个实际的 dplyr 函数。函数weighted_percent_rank() 怎么写？不确定如何在数据框和管道中进行这项工作。如果该解决方案也适用于团体，那就太好了。

编辑：使用uncount() 并没有真正起作用，因为不计算我正在使用的数据会产生 8000 亿行。还有其他想法吗？

【问题讨论】：

标签： r dplyr statistics

【解决方案1】：

您可以使用tidyr::uncount 按频率扩展行数以获得加权百分位数，然后根据此正则表达式使用summarize 减少它们：

library(dplyr)

groceries <- tribble(
  ~item, ~price, ~freq,
  "apple",   1, 10,
  "banana",   2, 5,
  "carrot",   3, 1
)

groceries %>% 
  tidyr::uncount(freq) %>% 
  mutate(wtd_ptile = percent_rank(price)) %>%
  group_by(item) %>%
  summarize_all(~.[1]) %>%
  mutate(ptile = percent_rank(price))
#> # A tibble: 3 x 4
#>   item   price wtd_ptile ptile
#>   <chr>  <dbl>     <dbl> <dbl>
#> 1 apple      1     0       0  
#> 2 banana     2     0.667   0.5
#> 3 carrot     3     1       1

请注意，您可以选择不同的排名函数，但在这种情况下，加权百分位数是 0.667 (10/(16 - 1))，而不是 0.8

编辑

不涉及创建数十亿行的替代方案：

groceries %>% 
  arrange(price) %>% 
  mutate(wtd_ptile = lag(cumsum(freq), default = 0)/(sum(freq) - 1))
#> # A tibble: 3 x 4
#>   item   price  freq wtd_ptile
#>   <chr>  <dbl> <dbl>     <dbl>
#> 1 apple      1    10     0    
#> 2 banana     2     5     0.667
#> 3 carrot     3     1     1

【讨论】：

啊，我的意思是让苹果的频率值为 20，但我在 tribble 中将其设为 10。我的错！我编辑了问题以反映这一点，您可以根据需要编辑答案。现在要在我的数据上进行测试，谢谢！
好的，所以关于这个解决方案的问题是，在我的数据集上使用它意味着创建一个包含 7850 亿行的 df 哈哈！你能想出一个解决方案来解决这个问题，也许如果我们使用比例而不是频率？所以数据看起来像来自groceries %>% mutate(prop = freq/sum(freq))
@AdhiR。没想到你的数字这么大！我添加了一个更有效的替代方案。
天哪，它工作起来非常简单易行。非常感谢。