执行 PCA 后如何绘制每个变量的主向量？答案

【问题标题】：How to plot the pricipal vectors of each variable after performing PCA?执行 PCA 后如何绘制每个变量的主向量？
【发布时间】：2019-12-11 21:20:54
【问题描述】：

我的问题主要来自这篇文章 :https://stats.stackexchange.com/questions/53/pca-on-correlation-or-covariance

在文章中，作者绘制了每个变量的向量方向和长度。根据我的理解，在执行 PCA 之后。我们得到的只是特征向量和特征值。对于维度为 M x N 的数据集，每个特征值应该是一个 1 x N 的向量。所以，我的问题可能是向量的长度是特征值，但是如何找到每个变量的向量方向数学?而向量长度的物理意义是什么？

另外，如果可能的话，我可以在 python 中使用 scikit PCA 函数做类似的工作吗？

谢谢！

【问题讨论】：

jakevdp.github.io/PythonDataScienceHandbook/…
如果我的回答有帮助，请告诉我

标签： python scikit-learn pca

【解决方案1】：

此图称为 biplot，它对理解 PCA 结果非常有用。 向量的长度就是每个特征/变量在每个主成分（即 PCA 负载）上的值。

示例：

这些加载可通过print(pca.components_) 访问。使用 Iris 数据集，加载是：

  [[ 0.52106591, -0.26934744,  0.5804131 ,  0.56485654],
   [ 0.37741762,  0.92329566,  0.02449161,  0.06694199],
   [-0.71956635,  0.24438178,  0.14212637,  0.63427274],
   [-0.26128628,  0.12350962,  0.80144925, -0.52359713]])

这里，每一行是一台 PC，每一列对应一个变量/特征。 所以特征/变量 1 在 PC1 上的值为 0.52106591，在 PC2 上的值为 0.37741762。 这些值用于绘制您在双图中看到的向量。见下方Var1 的坐标。正是那些（以上）值！

最后，要在 python 中创建此图，您可以使用sklearn：

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

iris = datasets.load_iris()
X = iris.data
y = iris.target

#In general it is a good idea to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)

pca = PCA()
pca.fit(X,y)
x_new = pca.transform(X)   

def myplot(score,coeff,labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]

    plt.scatter(xs ,ys, c = y) #without scaling
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')

plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()

#Call the function. 
myplot(x_new[:,0:2], pca.components_.T) 
plt.show()

另请参阅此帖子：https://stackoverflow.com/a/50845697/5025009

和

https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f

【讨论】：

谢谢，您的回答和链接对理解很有帮助。
顺便问一下，你看到由 hellpanderr 发布的链接了吗？该链接说他们使用 pca.explained_variance_ 来确定向量的长度。这是错的吗？还是因为其他一些原因？
他们使用“解释方差”来定义向量的平方长度。是等价的
这些值在哪里（>这里，每一行是一台 PC，每一列对应一个变量/特征。所以特征/变量 1 在 PC1 上的值为 0.5223，在 PC2 上的值为 0.3723。 ) 从？
这些可以使用print(pca.components_) 打印。我在回答中明确说明了这一点。再读一遍我的 naswer。

【解决方案2】：

试试“pca”库。这将绘制解释的方差，并创建一个双标图。

pip install pca

一个小例子：

from pca import pca

# Initialize to reduce the data up to the number of componentes that explains 95% of the variance.
model = pca(n_components=0.95)

# Or reduce the data towards 2 PCs
model = pca(n_components=2)

# Load example dataset
import pandas as pd
import sklearn
from sklearn.datasets import load_iris
X = pd.DataFrame(data=load_iris().data, columns=load_iris().feature_names, index=load_iris().target)

# Fit transform
results = model.fit_transform(X)

# Plot explained variance
fig, ax = model.plot()

# Scatter first 2 PCs
fig, ax = model.scatter()

# Make biplot with the number of features
fig, ax = model.biplot(n_feat=4)

结果是一个字典，其中包含 PC、负载等的许多统计信息。

【讨论】：