Python中的主成分分析（PCA）答案

【问题标题】：Principal Component Analysis (PCA) in PythonPython中的主成分分析（PCA）
【发布时间】：2012-10-24 19:57:34
【问题描述】：

我有一个 (26424 x 144) 数组，我想使用 Python 对其执行 PCA。但是，网络上没有特定的地方可以解释如何完成此任务（有些网站只是根据自己的方式进行 PCA - 我找不到通用的方法）。任何有任何帮助的人都会做得很好。

【问题讨论】：

你的数组是稀疏的（大部分是 0）吗？您是否关心前 2-3 个组件捕获了多少方差——50 %、90 %？
不，它不是稀疏的，我已经过滤了错误值。是的，我有兴趣了解需要多少主成分来解释 > 75% 和 >90% 的方差......但不确定如何。对此有何想法？
在 Doug 的回答中查看从 eigh 排序的 evals - 如果您愿意，可以在此处或新问题发布前几个和总和。并查看维基百科PCA cumulative energy
基本 PCA 方法的比较，仅使用numpy 和/或scipy，可以在here 中找到timeit 结果。

【解决方案1】：

即使已经接受了另一个答案，我还是发布了我的答案；接受的答案依赖于deprecated function；此外，这个已弃用的函数基于奇异值分解 (SVD)，它（尽管完全有效）是计算 PCA 的两种通用技术中内存和处理器密集型的。由于 OP 中数据数组的大小，这在这里特别重要。使用基于协方差的 PCA，计算流程中使用的数组只是 144 x 144，而不是 26424 x 144（原始数据数组的维度）。 p>

这是使用 SciPy 中的 linalg 模块的 PCA 的简单工作实现。由于此实现首先计算协方差矩阵，然后对该数组执行所有后续计算，因此它使用的内存远少于基于 SVD 的 PCA。

（NumPy 中的 linalg 模块也可以在下面的代码中使用，除了 import 语句，即 from numpy import linalg as LA。 )

此 PCA 实施的两个关键步骤是：

计算协方差矩阵；和
获取这个 cov 矩阵的 eivenvectors 和 特征值

在下面的函数中，参数dims_rescaled_data指的是rescaled数据矩阵中所需的维数；此参数的默认值只有二维，但下面的代码不限于二维，它可以是小于原始数据数组列号的任意值。

def PCA(data, dims_rescaled_data=2):
    """
    returns: data transformed in 2 dims/columns + regenerated original data
    pass in: data as 2D NumPy array
    """
    import numpy as NP
    from scipy import linalg as LA
    m, n = data.shape
    # mean center the data
    data -= data.mean(axis=0)
    # calculate the covariance matrix
    R = NP.cov(data, rowvar=False)
    # calculate eigenvectors & eigenvalues of the covariance matrix
    # use 'eigh' rather than 'eig' since R is symmetric, 
    # the performance gain is substantial
    evals, evecs = LA.eigh(R)
    # sort eigenvalue in decreasing order
    idx = NP.argsort(evals)[::-1]
    evecs = evecs[:,idx]
    # sort eigenvectors according to same index
    evals = evals[idx]
    # select the first n eigenvectors (n is desired dimension
    # of rescaled data array, or dims_rescaled_data)
    evecs = evecs[:, :dims_rescaled_data]
    # carry out the transformation on the data using eigenvectors
    # and return the re-scaled data, eigenvalues, and eigenvectors
    return NP.dot(evecs.T, data.T).T, evals, evecs

def test_PCA(data, dims_rescaled_data=2):
    '''
    test by attempting to recover original data array from
    the eigenvectors of its covariance matrix & comparing that
    'recovered' array with the original data
    '''
    _ , _ , eigenvectors = PCA(data, dim_rescaled_data=2)
    data_recovered = NP.dot(eigenvectors, m).T
    data_recovered += data_recovered.mean(axis=0)
    assert NP.allclose(data, data_recovered)
    

def plot_pca(data):
    from matplotlib import pyplot as MPL
    clr1 =  '#2026B2'
    fig = MPL.figure()
    ax1 = fig.add_subplot(111)
    data_resc, data_orig = PCA(data)
    ax1.plot(data_resc[:, 0], data_resc[:, 1], '.', mfc=clr1, mec=clr1)
    MPL.show()

>>> # iris, probably the most widely used reference data set in ML
>>> df = "~/iris.csv"
>>> data = NP.loadtxt(df, delimiter=',')
>>> # remove class labels
>>> data = data[:,:-1]
>>> plot_pca(data)

下图是这个 PCA 函数在虹膜数据上的直观表示。如您所见，2D 变换将 I 类与 II 类和 III 类清晰地分开（但不是 II 类与 III 类，这实际上需要另一个维度）。

【讨论】：

我同意你的建议..看起来很有趣，老实说，消耗内存的方法要少得多。我有大量的多维数据，我将测试这些技术，看看哪一种效果最好。谢谢:-)
如何用这种方法检索第一个主成分？谢谢！ stackoverflow.com/questions/17916837/…
@doug--因为您的测试没有运行（m 是什么？为什么在返回之前定义 PCA 返回中的eigenvalues, eigenvectors？等等），这有点难以以任何有用的方式使用它...
@mmr 我已经发布了一个基于此答案的工作示例（在新答案中）
@doug NP.dot(evecs.T, data.T).T，为什么不简化为np.dot(data, evecs)？

【解决方案2】：

你可以在matplotlib模块中找到一个PCA函数：

import numpy as np
from matplotlib.mlab import PCA

data = np.array(np.random.randint(10,size=(10,3)))
results = PCA(data)

results 将存储 PCA 的各种参数。来自matplotlib的mlab部分，是MATLAB语法的兼容层

编辑：在博客nextgenetics 上，我发现了如何使用 matplotlib mlab 模块执行和显示 PCA 的精彩演示，玩得开心并查看该博客！

【讨论】：

恩里科，谢谢。我正在将此 3D 场景用于 3D PCA 图。再次感谢。有问题我会联系的。
@khan matplot.mlab 中的 PCA 函数已弃用。 (matplotlib.org/api/…)。此外，它使用 SVD，给定 OPs 数据矩阵的大小将是一个昂贵的计算。使用协方差矩阵（请参阅下面的答案），您可以将特征向量计算中的矩阵大小减少 100 倍以上。
@doug：它没有被弃用......他们只是放弃了它的文档。我假设。
我很难过，因为这三行代码不起作用！
我认为您想添加和更改以下命令@user2988577：import numpy as np 和data = np.array(np.random.randint(10,size=(10,3)))。然后我建议按照本教程来帮助您了解如何绘制blog.nextgenetics.net/?e=42

【解决方案3】：

另一个使用 numpy 的 Python PCA。与@doug 相同的想法，但没有运行。

from numpy import array, dot, mean, std, empty, argsort
from numpy.linalg import eigh, solve
from numpy.random import randn
from matplotlib.pyplot import subplots, show

def cov(X):
    """
    Covariance matrix
    note: specifically for mean-centered data
    note: numpy's `cov` uses N-1 as normalization
    """
    return dot(X.T, X) / X.shape[0]
    # N = data.shape[1]
    # C = empty((N, N))
    # for j in range(N):
    #   C[j, j] = mean(data[:, j] * data[:, j])
    #   for k in range(j + 1, N):
    #       C[j, k] = C[k, j] = mean(data[:, j] * data[:, k])
    # return C

def pca(data, pc_count = None):
    """
    Principal component analysis using eigenvalues
    note: this mean-centers and auto-scales the data (in-place)
    """
    data -= mean(data, 0)
    data /= std(data, 0)
    C = cov(data)
    E, V = eigh(C)
    key = argsort(E)[::-1][:pc_count]
    E, V = E[key], V[:, key]
    U = dot(data, V)  # used to be dot(V.T, data.T).T
    return U, E, V

""" test data """
data = array([randn(8) for k in range(150)])
data[:50, 2:4] += 5
data[50:, 2:5] += 5

""" visualize """
trans = pca(data, 3)[0]
fig, (ax1, ax2) = subplots(1, 2)
ax1.scatter(data[:50, 0], data[:50, 1], c = 'r')
ax1.scatter(data[50:, 0], data[50:, 1], c = 'b')
ax2.scatter(trans[:50, 0], trans[:50, 1], c = 'r')
ax2.scatter(trans[50:, 0], trans[50:, 1], c = 'b')
show()

与更短的产生相同的东西

from sklearn.decomposition import PCA

def pca2(data, pc_count = None):
    return PCA(n_components = 4).fit_transform(data)

据我了解，使用特征值（第一种方式）更适合高维数据和更少样本，而如果样本多于维度，则使用奇异值分解更好。

【讨论】：

使用循环违背了 numpy 的目的。您可以通过简单地进行矩阵乘法 C = data.dot(data.T) 更快地获得协方差矩阵
嗯或者使用numpy.cov我猜。不知道为什么我包含我自己的版本。
您的数据测试和可视化的结果似乎是随机的。你能解释一下如何可视化数据的细节吗？喜欢scatter(data[50:, 0], data[50:, 1] 的意义吗？
@PeterZhu 我不确定我是否理解你的问题。 PCA 将您的数据转换为最大化方差的新向量。 scatter 命令显示相互绘制的前两行。因此，将数据投影到所有其他维度上以使其成为 2D。
@Mark dot(V.T, data.T).T 为什么要跳这个舞，应该相当于dot(data, V)？ 编辑： 啊，我看你可能只是从上面复制的。我在面团的答案中添加了评论。

【解决方案4】：

这是numpy 的工作。

这里有一个教程，展示了如何使用numpy 的内置模块（如mean,cov,double,cumsum,dot,linalg,array,rank）进行主要成分分析。

http://glowingpython.blogspot.sg/2011/07/principal-component-analysis-with-numpy.html

注意scipy这里也有很长的解释 - https://github.com/scikit-learn/scikit-learn/blob/babe4a5d0637ca172d47e1dfdd2f6f3c3ecb28db/scikits/learn/utils/extmath.py#L105

【讨论】：

我认为链接的glowingpython博文在代码中有许多错误，请小心。（查看博客上最新的 cmets）
@EnricoGiampieri 同意你的看法 +$\infty$
对不起，我讽刺了。那条发光的蟒蛇不行

【解决方案5】：

这里是 scikit-learn 选项。这两种方法都使用了 StandardScaler，因为PCA is effected by scale

方法 1：让 scikit-learn 选择最小个主成分，以便保留至少 x%（在下面的示例中为 90%）的方差。

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

iris = load_iris()

# mean-centers and auto-scales the data
standardizedData = StandardScaler().fit_transform(iris.data)

pca = PCA(.90)

principalComponents = pca.fit_transform(X = standardizedData)

# To get how many principal components was chosen
print(pca.n_components_)

方法2：选择主成分的个数（本例中选择了2个）

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

iris = load_iris()

standardizedData = StandardScaler().fit_transform(iris.data)

pca = PCA(n_components=2)

principalComponents = pca.fit_transform(X = standardizedData)

# to get how much variance was retained
print(pca.explained_variance_ratio_.sum())

来源：https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

【讨论】：

【解决方案6】：

更新： matplotlib.mlab.PCA 是自 2.2 (2018-03-06) 版以来确实是 deprecated。

~~库matplotlib.mlab.PCA（用于this answer）不已弃用。因此，对于所有通过 Google 来到这里的人，我将发布一个使用 Python 2.7 测试的完整工作示例。~~

小心使用以下代码，因为它使用了现已弃用的库！

from matplotlib.mlab import PCA
import numpy
data = numpy.array( [[3,2,5], [-2,1,6], [-1,0,4], [4,3,4], [10,-5,-6]] )
pca = PCA(data)

现在 `pca.Y' 中的原始数据矩阵以主成分基向量表示。有关 PCA 对象的更多详细信息，请参阅here。

>>> pca.Y
array([[ 0.67629162, -0.49384752,  0.14489202],
   [ 1.26314784,  0.60164795,  0.02858026],
   [ 0.64937611,  0.69057287, -0.06833576],
   [ 0.60697227, -0.90088738, -0.11194732],
   [-3.19578784,  0.10251408,  0.00681079]])

您可以使用matplotlib.pyplot 来绘制这些数据，只是为了让自己相信 PCA 会产生“良好”的结果。 names 列表只是用来标注我们的五个向量。

import matplotlib.pyplot
names = [ "A", "B", "C", "D", "E" ]
matplotlib.pyplot.scatter(pca.Y[:,0], pca.Y[:,1])
for label, x, y in zip(names, pca.Y[:,0], pca.Y[:,1]):
    matplotlib.pyplot.annotate( label, xy=(x, y), xytext=(-2, 2), textcoords='offset points', ha='right', va='bottom' )
matplotlib.pyplot.show()

查看我们的原始向量，我们会看到 data[0] ("A") 和 data[3] ("D") 非常相似，data[1] ("B") 和 data[2 ] （“C”）。这反映在我们的 PCA 转换数据的 2D 图中。

【讨论】：

【解决方案7】：

除了所有其他答案之外，这里还有一些代码可以使用 `sklearn` 和 `matplotlib` 绘制 `biplot`。

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general a good idea is to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)    

pca = PCA()
x_new = pca.fit_transform(X)

def myplot(score,coeff,labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
    plt.scatter(xs * scalex,ys * scaley, c = y)
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()

#Call the function. Use only the 2 PCs.
myplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.show()

【讨论】：

【解决方案8】：

我制作了一个用于比较不同 PCA 的小脚本，作为答案出现在这里：

import numpy as np
from scipy.linalg import svd

shape = (26424, 144)
repeat = 20
pca_components = 2

data = np.array(np.random.randint(255, size=shape)).astype('float64')

# data normalization
# data.dot(data.T)
# (U, s, Va) = svd(data, full_matrices=False)
# data = data / s[0]

from fbpca import diffsnorm
from timeit import default_timer as timer

from scipy.linalg import svd
start = timer()
for i in range(repeat):
    (U, s, Va) = svd(data, full_matrices=False)
time = timer() - start
err = diffsnorm(data, U, s, Va)
print('svd time: %.3fms, error: %E' % (time*1000/repeat, err))


from matplotlib.mlab import PCA
start = timer()
_pca = PCA(data)
for i in range(repeat):
    U = _pca.project(data)
time = timer() - start
err = diffsnorm(data, U, _pca.fracs, _pca.Wt)
print('matplotlib PCA time: %.3fms, error: %E' % (time*1000/repeat, err))

from fbpca import pca
start = timer()
for i in range(repeat):
    (U, s, Va) = pca(data, pca_components, True)
time = timer() - start
err = diffsnorm(data, U, s, Va)
print('facebook pca time: %.3fms, error: %E' % (time*1000/repeat, err))


from sklearn.decomposition import PCA
start = timer()
_pca = PCA(n_components = pca_components)
_pca.fit(data)
for i in range(repeat):
    U = _pca.transform(data)
time = timer() - start
err = diffsnorm(data, U, _pca.explained_variance_, _pca.components_)
print('sklearn PCA time: %.3fms, error: %E' % (time*1000/repeat, err))

start = timer()
for i in range(repeat):
    (U, s, Va) = pca_mark(data, pca_components)
time = timer() - start
err = diffsnorm(data, U, s, Va.T)
print('pca by Mark time: %.3fms, error: %E' % (time*1000/repeat, err))

start = timer()
for i in range(repeat):
    (U, s, Va) = pca_doug(data, pca_components)
time = timer() - start
err = diffsnorm(data, U, s[:pca_components], Va.T)
print('pca by doug time: %.3fms, error: %E' % (time*1000/repeat, err))

pca_mark 是pca in Mark's answer。

pca_doug 是pca in doug's answer。

这是一个示例输出（但结果很大程度上取决于数据大小和 pca_components，因此我建议使用您自己的数据运行您自己的测试。此外，facebook 的 pca 针对标准化数据进行了优化，因此它会在这种情况下更快更准确）：

svd time: 3212.228ms, error: 1.907320E-10
matplotlib PCA time: 879.210ms, error: 2.478853E+05
facebook pca time: 485.483ms, error: 1.260335E+04
sklearn PCA time: 169.832ms, error: 7.469847E+07
pca by Mark time: 293.758ms, error: 1.713129E+02
pca by doug time: 300.326ms, error: 1.707492E+02

编辑：

来自 fbpca 的 diffsnorm 函数计算 Schur 分解的谱范数误差。

【讨论】：

准确度与您所说的错误不同。您能否解决此问题并解释该指标，因为它不直观，为什么这被认为是有信誉的？此外，将 Facebook 的“随机 PCA”与协方差版本的 PCA 进行比较是不公平的。最后，您是否考虑过某些库对输入数据进行了标准化？
感谢您的建议，您对准确性/错误差异的看法是正确的，我已经修改了我的答案。我认为根据速度和准确性将随机 PCA 与 PCA 进行比较是有道理的，因为两者都是为了降维。为什么你认为我应该考虑标准化？

【解决方案9】：

为了def plot_pca(data):可以工作，有必要换行

data_resc, data_orig = PCA(data)
ax1.plot(data_resc[:, 0], data_resc[:, 1], '.', mfc=clr1, mec=clr1)

有线条

newData, data_resc, data_orig = PCA(data)
ax1.plot(newData[:, 0], newData[:, 1], '.', mfc=clr1, mec=clr1)

【讨论】：

【解决方案10】：

此示例代码加载日本收益率曲线，并创建 PCA 组件。然后，它使用 PCA 估计给定日期的移动，并将其与实际移动进行比较。

%matplotlib inline

import numpy as np
import scipy as sc
from scipy import stats
from IPython.display import display, HTML
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import datetime
from datetime import timedelta

import quandl as ql

start = "2016-10-04"
end = "2019-10-04"

ql_data = ql.get("MOFJ/INTEREST_RATE_JAPAN", start_date = start, end_date = end).sort_index(ascending= False)

eigVal_, eigVec_ = np.linalg.eig(((ql_data[:300]).diff(-1)*100).cov()) # take latest 300 data-rows and normalize to bp
print('number of PCA are', len(eigVal_))

loc_ = 10
plt.plot(eigVec_[:,0], label = 'PCA1')
plt.plot(eigVec_[:,1], label = 'PCA2')
plt.plot(eigVec_[:,2], label = 'PCA3')
plt.xticks(range(len(eigVec_[:,0])), ql_data.columns)
plt.legend()
plt.show()

x = ql_data.diff(-1).iloc[loc_].values * 100 # set the differences
x_ = x[:,np.newaxis]
a1, _, _, _ = np.linalg.lstsq(eigVec_[:,0][:, np.newaxis], x_) # linear regression without intercept
a2, _, _, _ = np.linalg.lstsq(eigVec_[:,1][:, np.newaxis], x_)
a3, _, _, _ = np.linalg.lstsq(eigVec_[:,2][:, np.newaxis], x_)

pca_mv = m1 * eigVec_[:,0] + m2 * eigVec_[:,1] + m3 * eigVec_[:,2] + c1 + c2 + c3
pca_MV = a1[0][0] * eigVec_[:,0] + a2[0][0] * eigVec_[:,1] + a3[0][0] * eigVec_[:,2]
pca_mV = b1 * eigVec_[:,0] + b2 * eigVec_[:,1] + b3 * eigVec_[:,2]

display(pd.DataFrame([eigVec_[:,0], eigVec_[:,1], eigVec_[:,2], x, pca_MV]))
print('PCA1 regression is', a1, a2, a3)


plt.plot(pca_MV)
plt.title('this is with regression and no intercept')
plt.plot(ql_data.diff(-1).iloc[loc_].values * 100, )
plt.title('this is with actual moves')
plt.show()

【讨论】：

【解决方案11】：

这可能是 PCA 最简单的答案，包括易于理解的步骤。假设我们想要从 144 中保留 2 个主要维度，这提供了最大的信息。

首先，将您的二维数组转换为数据框：

import pandas as pd

# Here X is your array of size (26424 x 144)
data = pd.DataFrame(X)

那么，有两种方法可以使用：

方法一：手动计算

第 1 步：在 X 上应用列标准化

from sklearn import preprocessing

scalar = preprocessing.StandardScaler()
standardized_data = scalar.fit_transform(data)

第二步：求原始矩阵X的协方差矩阵S

sample_data = standardized_data
covar_matrix = np.cov(sample_data)

第 3 步：求 S 的特征值和特征向量（这里是 2D，所以每个 2）

from scipy.linalg import eigh

# eigh() function will provide eigen-values and eigen-vectors for a given matrix.
# eigvals=(low value, high value) takes eigen value numbers in ascending order
values, vectors = eigh(covar_matrix, eigvals=(142,143))

# Converting the eigen vectors into (2,d) shape for easyness of further computations
vectors = vectors.T

第 4 步：转换数据

# Projecting the original data sample on the plane formed by two principal eigen vectors by vector-vector multiplication.

new_coordinates = np.matmul(vectors, sample_data.T)
print(new_coordinates.T)

此new_coordinates.T 的大小为 (26424 x 2)，包含 2 个主成分。

方法 2：使用 Scikit-Learn

第 1 步：在 X 上应用列标准化

from sklearn import preprocessing

scalar = preprocessing.StandardScaler()
standardized_data = scalar.fit_transform(data)

第 2 步：初始化 pca

from sklearn import decomposition

# n_components = numbers of dimenstions you want to retain
pca = decomposition.PCA(n_components=2)

第 3 步：使用 pca 拟合数据

# This line takes care of calculating co-variance matrix, eigen values, eigen vectors and multiplying top 2 eigen vectors with data-matrix X.
pca_data = pca.fit_transform(sample_data)

此pca_data 的大小为 (26424 x 2)，包含 2 个主成分。

【讨论】：

除了所有其他答案之外，这里还有一些代码可以使用 sklearn 和 matplotlib 绘制 biplot。

除了所有其他答案之外，这里还有一些代码可以使用 `sklearn` 和 `matplotlib` 绘制 `biplot`。