按宽度对向量进行分类答案

【问题标题】：Bin a vector by width按宽度对向量进行分类
【发布时间】：2022-12-29 02:57:58
【问题描述】：

我有一个连续变量的向量。例如：

x <- c(0.9000000,  1.2666667,  4.0000000,  5.7333333, 19.7333333, 35.7666667, 44.0000000,  4.4333333,  0.4666667,  0.7000000,  0.9333333,  1.0000000,  1.0000000,  1.0000000,  1.2000000,  1.2333333, 1.2666667,  1.4333333,  1.7000000,  4.0666667,  1.9000000,  2.1000000,  0.9333333,  1.2666667,  3.7333333,  0.9333333,  2.7666667,  3.1333333,  3.9333333,  5.0333333,  6.0666667,  8.2333333)

我想按宽度（相等数量的值）将此向量分成三组（低值、中值和高值）。所以 low 组将具有所有总数中第三低的值。

然后我想对低和中容器进行分组，这样我就会有一个分类向量，其中包含 Not high 主题，这将是最低的 66%，以及高，这将是最高的 33%。

我已经检查过，但找不到任何预定义的函数来执行此操作。

【问题讨论】：

标签： r vector

【解决方案1】：

您可以使用 cut 拆分和标记数值向量。本质上，您正在寻找一个等于 x 有序值的 2/3 的值的单个断点，因此您可以执行以下操作：

break_point <- sort(x)[round(2 * length(x)/3)]

break_point
#> [1] 3.733333

高于 3.733333 的任何值都将是“高”。所以我们可以这样做：

y <- cut(x, breaks = c(-Inf, break_point, Inf), labels = c('not high', 'high'))

如果我们将其放入带有 x 的数据框中，您可以看到 y 适当地标记了最高值：

data.frame(x, y)
#>             x        y
#> 1   0.9000000 not high
#> 2   1.2666667 not high
#> 3   4.0000000     high
#> 4   5.7333333     high
#> 5  19.7333333     high
#> 6  35.7666667     high
#> 7  44.0000000     high
#> 8   4.4333333     high
#> 9   0.4666667 not high
#> 10  0.7000000 not high
#> 11  0.9333333 not high
#> 12  1.0000000 not high
#> 13  1.0000000 not high
#> 14  1.0000000 not high
#> 15  1.2000000 not high
#> 16  1.2333333 not high
#> 17  1.2666667 not high
#> 18  1.4333333 not high
#> 19  1.7000000 not high
#> 20  4.0666667     high
#> 21  1.9000000 not high
#> 22  2.1000000 not high
#> 23  0.9333333 not high
#> 24  1.2666667 not high
#> 25  3.7333333 not high
#> 26  0.9333333 not high
#> 27  2.7666667 not high
#> 28  3.1333333 not high
#> 29  3.9333333     high
#> 30  5.0333333     high
#> 31  6.0666667     high
#> 32  8.2333333     high

您可以看到大约 2/3 的案例“不高”，1/3 的案例“高”：

table(y) / length(x)
#> y
#> not high     high 
#>  0.65625  0.34375

你不能在“不高”组中恰好有 2/3，因为你的矢量长度为 32，不能被 3 整除。

【讨论】：

【解决方案2】：

你可以使用quantile()：

y <- ifelse(x < quantile(x, 2/3), "not high", "high")

proportions(table(y))

#     high not high 
#  0.34375  0.65625

【讨论】：

非常好的使用 ifelse 这里 Darren。

【解决方案3】：

你可以使用santoku::chop_equally()：

library(santoku)
chopped <- santoku::chop_equally(x, 3, labels = c("low", "medium", "high"))
data.frame(x, chopped)
            x chopped
1   0.9000000     low
2   1.2666667  medium
3   4.0000000    high
4   5.7333333    high
5  19.7333333    high
6  35.7666667    high
7  44.0000000    high
8   4.4333333    high
9   0.4666667     low
10  0.7000000     low
...

然后你可以重新组合这个因素（如果你想保留低/中/高版本）：

library(forcats)
chopped2 <- forcats::fct_collapse(chopped, 
                                    "High" = "high", 
                                     other_level = "Not high"
                                  )
data.frame(x, chopped2)
            x chopped2
1   0.9000000 Not high
2   1.2666667 Not high
3   4.0000000     High
4   5.7333333     High
5  19.7333333     High
6  35.7666667     High
7  44.0000000     High
8   4.4333333     High
9   0.4666667 Not high
10  0.7000000 Not high
...

或者，如果您只想要“高”/“不高”版本，请使用 chop_quantiles():

chopped2 <- santoku::chop_quantiles(x, .66, 
                                    labels = c("Not high", "High"))
data.frame(x, chopped2)
            x chopped2
1   0.9000000 Not high
2   1.2666667 Not high
3   4.0000000     High
4   5.7333333     High
5  19.7333333     High
6  35.7666667     High
7  44.0000000     High
8   4.4333333     High
9   0.4666667 Not high
10  0.7000000 Not high
...

你说你想按“宽度（相等数量的值）”进行分类。上述 bins 的值数量相等，即 3 个类别中每个类别的 1/3。如果你想按宽度分箱，即等宽间隔，使用santoku::chop_evenly()：

chopped3 <- santoku::chop_evenly(x, 3, labels = c("low", "medium", "high"))
data.frame(x, chopped3)
            x chopped3
1   0.9000000      low
2   1.2666667      low
3   4.0000000      low
4   5.7333333      low
5  19.7333333   medium
6  35.7666667     high
7  44.0000000     high
8   4.4333333      low
9   0.4666667      low
10  0.7000000      low
...

注意：我是 santoku 包的维护者。

【讨论】：

【解决方案4】：

这是我在 R 文档中找到的内容，可能有帮助吗？

bin（x，bin）

关于仓：

" ... 使用 "cut::n" 将向量分成 n 个相等的部分，b) 使用 "cut::a]b[" 创建以下 bins：[min, a], ]a, b[, [b，最大值]。”

使用库 fixst https://rdrr.io/cran/fixest/ 虽然这仅适用于我检查过的整数，抱歉。

来源：https://rdrr.io/cran/fixest/man/bin.html

【讨论】：