为什么 np.corrcoef(x) 和 df.corr() 给出不同的结果？

【问题标题】：Why do np.corrcoef(x) and df.corr() give different results?为什么 np.corrcoef(x) 和 df.corr() 给出不同的结果？
【发布时间】：2021-01-27 21:58:16
【问题描述】：

为什么使用np.corrcoef(x)和df.corr()时numpy相关系数矩阵和pandas相关系数矩阵不同？

x = np.array([[0, 2, 7], [1, 1, 9], [2, 0, 13]]).T
x_df = pd.DataFrame(x)
print("matrix:")
print(x)
print()
print("df:")
print(x_df)
print()

print("np correlation matrix: ")
print(np.corrcoef(x))
print()
print("pd correlation matrix: ")

print(x_df.corr())
print()

给我输出

matrix:
[[ 0  1  2]
 [ 2  1  0]
 [ 7  9 13]]

df:
   0  1   2
0  0  1   2
1  2  1   0
2  7  9  13

np correlation matrix: 
[[ 1.         -1.          0.98198051]
 [-1.          1.         -0.98198051]
 [ 0.98198051 -0.98198051  1.        ]]

pd correlation matrix: 
          0         1         2
0  1.000000  0.960769  0.911293
1  0.960769  1.000000  0.989743
2  0.911293  0.989743  1.000000

我猜它们是不同类型的相关系数？

【问题讨论】：

np.corrcoef(x.T)==x_df.corr() 或 print(np.corrcoef(x, rowvar=False))

标签： python pandas numpy correlation

【解决方案1】：

@AlexAlex 是对的，您在相关系数中采用了一组不同的数字。

在 2x3 矩阵中考虑它

x = np.array([[0, 2, 7], [1, 1, 9]])
np.corrcoef(yx)

给予

array([[1.        , 0.96076892],
       [0.96076892, 1.        ]])

和

x_df = pd.DataFrame(yx.T)
print(x_df)
x_df[0].corr(x_df[1])

给予

   0  1
0  0  1
1  2  1
2  7  9

0.9607689228305227

0.9607...等数字与 NumPy 计算的输出相匹配。

如果按照计算方式进行，则相当于比较行的相关性而不是列的相关性。您可以使用 .T 或参数 rowvar=False 修复 NumPy 版本

【讨论】：