影响在 R 中使用 hist() 函数绘制的直方图的变量答案

【问题标题】：Variables that affect histogram plotted with hist() function in R影响在 R 中使用 hist() 函数绘制的直方图的变量
【发布时间】：2015-11-22 06:19:16
【问题描述】：

在 R 中，可以绘制直方图并将其属性保存到变量：

> h1=hist(c(1,1,2,3,4,5,5), breaks=0.5:5.5)

然后可以读取这些属性：

> h1
$breaks
[1] 0.5 1.5 2.5 3.5 4.5 5.5

$counts
[1] 2 1 1 1 2

$density
[1] 0.2857143 0.1428571 0.1428571 0.1428571 0.2857143

$mids
[1] 1 2 3 4 5

$xname
[1] "c(1, 1, 2, 3, 4, 5, 5)"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

这些属性如何影响直方图？到目前为止，我已经弄清楚了以下几点：

$breaks 和$counts 之间的关系。 $breaks代表绘制数据可能落入的区间，$counts代表落入该区间的数据量，例如：

[] 表示closed interval（包括端点）

() 表示open interval（不包括端点）

BREAKS  : COUNTS
[0.5-1.5] : 2 # There are two 1 which falls into this interval
(1.5-2.5] : 1 # There is one 2 which falls into this interval
(2.5-3.5] : 1 # There is one 3 which falls into this interval
(3.5-4.5] : 1 # There is one 4 which falls into this interval
(4.5-5.5] : 2 # There are two 5 which falls into this interval

$breaks和$density的关系基本同上，只是用百分比来写，例如：

BREAKS  : DENSITY
[0.5-1.5] : 0.2857143 # This interval covers cca 28% of plot
(1.5-2.5] : 0.1428571 # This interval covers cca 14% of plot
(2.5-3.5] : 0.1428571 # This interval covers cca 14% of plot
(3.5-4.5] : 0.1428571 # This interval covers cca 14% of plot
(4.5-5.5] : 0.2857143 # This interval covers cca 28% of plot

当然，当你将所有这些值相加时，你会得到 1：

> sum(h1$density)
[1] 1

以下代表x轴名称：

$xname
[1] "c(1, 1, 2, 3, 4, 5, 5)"

但是其余的做什么，尤其是$mids？

$mids
[1] 1 2 3 4 5

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

help(hist) 还返回许多其他人，如果不是为什么，它们不应该也列在上面的输出中吗？正如following文章中所解释的那样

默认情况下，bin 计数包括小于或等于 bin 的值右断点且严格大于 bin 的左断点点，除了最左边的 bin，它包括它的左中断点。

如下：

h1=hist(c(1,1,2,3,4,5,5,1.5), breaks=0.5:5.5)

将返回直方图，其中 1.5 将落入 0.5-1.5 区间。一种“解决方法”是将间隔大小设置得更小，例如

h1=hist(c(1,1,2,3,4,5,5,1.5), breaks=seq(0.5,5.5,0.1))

但这对我来说似乎不切实际，而且它还在$counts和$density中添加了一堆0，有没有更好的自动方法？

除此之外，它还有一个我无法解释的副作用：为什么最后一个示例返回摘要 10 而不是 1？

> sum(h1$density)
[1] 10
> h1$density[h1$density>0]
[1] 2.50 1.25 1.25 1.25 1.25 2.50

【问题讨论】：

在发布此类问题之前请阅读?[functionname]。

标签： r histogram

【解决方案1】：

问题 1 $mids 和 $equidist 是什么意思：从帮助文件中：

mids：n 个单元格的中点。

equidist：逻辑，表示中断之间的距离是否都相同。

Q2：是的，有 h1=hist(c(1,1,2,3,4,5,5,1.5), breaks=0.5:5.5) 1.5 将属于 0.5-1.5 类别。如果您希望它属于 1.5-2.5 类别，您应该使用：

h1=hist(c(1,1,2,3,4,5,5,1.5), breaks=0.49:5.49)

或者更整洁：

h1=hist(c(1,1,2,3,4,5,5,1.5), breaks=0.5:5.5, right=FALSE)

我不确定您要在这里自动化什么，但希望以上内容能回答您的问题。如果不是，请让我更清楚你的问题。

第三季度关于密度是 10 而不是 1，那是因为密度不是频率。从帮助文件中：

密度：值 f^(x[i])，作为估计的密度值。如果 all(diff(breaks) == 1)，它们是相对频率计数/n，并且通常满足 sum[i; f^(x[i]) (b[i+1]-b[i])] = 1，其中 b[i] = break[i]。

因此，如果您的休息时间不等于 1，那么密度之和不会等于 1。

【讨论】：

感谢您的回复，我不是以英语为母语的人，所以即使在帮助下我也没有理解mids 背后的想法。如果我理解正确，那么它是特定垂直条的中心？关于最后一个问题，我发现这将返回 1 sum(h$density * diff(h$breaks))
我也不是母语人士，但我的理解是一样的：它是直方图类的中间，如果类是2.5-3.5，中间就是3。