SciPy 如何计算 pearsonr() 函数中的 p 值？答案

【问题标题】：How SciPy calculates the p-value in pearsonr() function?SciPy 如何计算 pearsonr() 函数中的 p 值？
【发布时间】：2018-04-30 00:07:25
【问题描述】：

我搜索了很多，但没有解释 SciPy 如何计算相关系数的 p 值，以及为什么它对于小于 500 的数据集不可靠（由 SciPy 在功能页面上开始）。

【问题讨论】：

看起来文档字符串需要一些工作。我为此创建了一个问题：github.com/scipy/scipy/issues/8789

标签： python scipy p-value

【解决方案1】：

scipy.stats.pearsonr 使用t distribution 计算 p 值。（您可以查看the source code in the file stats.py on github。）这绝对应该在文档字符串中提及。

这是一个例子。首先导入pearsonr和scipy对t分布的实现：

In [334]: from scipy.stats import pearsonr, t as tdist

为此示例定义x 和y：

In [335]: x = np.array([0, 1, 2, 3, 5, 8, 13])

In [336]: y = np.array([1.2, 1.4, 1.6, 1.7, 2.0, 4.1, 6.6])

为此数据计算 r 和 p：

In [337]: r, p = pearsonr(x, y)

In [338]: r
Out[338]: 0.9739566302403544

In [339]: p
Out[339]: 0.0002073053505382502

现在再次计算p，首先计算 t 统计量，然后找到该 t 值的两倍生存函数：

In [340]: df = len(x) - 2

In [341]: t = r * np.sqrt(df/(1 - r**2))

In [342]: 2*tdist.sf(t, df)  # This is the p value.
Out[342]: 0.0002073053505382502

我们得到了与预期相同的 p 值。

我不知道“p 值并不完全可靠，但对于大于 500 左右的数据集可能是合理的”这句话的来源。如果有人知道可引用的参考文献，则应将其添加到 pearsonr 文档字符串中。

【讨论】：

在示例中的 github 文档中，t 分布中的 firth r 是平方的，而分数不是平方根的，这可能表明存在差异？我知道您的 t-stat 使用正确。