【问题标题】:Python PCA plot using Hotelling's T2 for a confidence interval使用 Hotelling 的 T2 作为置信区间的 Python PCA 图
【发布时间】:2018-03-25 17:13:10
【问题描述】:

我正在尝试将 PCA 应用于多变量分析,并在 python 中使用 Hotelling T2 置信椭圆绘制前两个组件的得分图。我能够得到散点图,我想在散点图中添加 95% 的置信椭圆。如果有人知道如何在 python 中完成,那就太好了。

预期输出的示例图片:

【问题讨论】:

标签: python python-3.x statistics pca confidence-interval


【解决方案1】:

这让我很烦,所以我在 python 中采用了来自PCA and Hotelling's T^2 for confidence intervall in R 的答案(并使用了 ggbiplot R 包中的一些源代码)

from sklearn import decomposition
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
import scipy, random

#Generate data and fit PCA
random.seed(1)
data = np.array(np.random.normal(0, 1, 500)).reshape(100, 5)
outliers = np.array(np.random.uniform(5, 10, 25)).reshape(5, 5)
data = np.vstack((data, outliers))
pca = decomposition.PCA(n_components = 2)
scaler = StandardScaler()
scaler.fit(data)
data = scaler.transform(data)
pcaFit = pca.fit(data)
dataProject = pcaFit.transform(data)

#Calculate ellipse bounds and plot with scores
theta = np.concatenate((np.linspace(-np.pi, np.pi, 50), np.linspace(np.pi, -np.pi, 50)))
circle = np.array((np.cos(theta), np.sin(theta)))
sigma = np.cov(np.array((dataProject[:, 0], dataProject[:, 1])))
ed = np.sqrt(scipy.stats.chi2.ppf(0.95, 2))
ell = np.transpose(circle).dot(np.linalg.cholesky(sigma) * ed)
a, b = np.max(ell[: ,0]), np.max(ell[: ,1]) #95% ellipse bounds
t = np.linspace(0, 2 * np.pi, 100)

plt.scatter(dataProject[:, 0], dataProject[:, 1])
plt.plot(a * np.cos(t), b * np.sin(t), color = 'red')
plt.grid(color = 'lightgray', linestyle = '--')
plt.show()

Plot

【讨论】:

    【解决方案2】:

    pca library 提供 Hotelling T2 和 SPE/DmodX 异常值检测。

    pip install pca
    
    from pca import pca
    import pandas as pd
    import numpy as np
    
    # Create dataset with 100 samples
    X = np.array(np.random.normal(0, 1, 500)).reshape(100, 5)
    # Create 5 outliers
    outliers = np.array(np.random.uniform(5, 10, 25)).reshape(5, 5)
    # Combine data
    X = np.vstack((X, outliers))
    
    # Initialize model. Alpha is the threshold for the hotellings T2 test to determine outliers in the data.
    model = pca(alpha=0.05)
    
    # Fit transform
    out = model.fit_transform(X)
    

    打印异常值
    print(out['outliers'])
    
    #            y_proba      y_score  y_bool  y_bool_spe  y_score_spe
    # 1.0   9.799576e-01     3.060765   False       False     0.993407
    # 1.0   8.198524e-01     5.945125   False       False     2.331705
    # 1.0   9.793117e-01     3.086609   False       False     0.128518
    # 1.0   9.743937e-01     3.268052   False       False     0.794845
    # 1.0   8.333778e-01     5.780220   False       False     1.523642
    # ..             ...          ...     ...         ...          ...
    # 1.0   6.793085e-11    69.039523    True        True    14.672828
    # 1.0  2.610920e-291  1384.158189    True        True    16.566568
    # 1.0   6.866703e-11    69.015237    True        True    14.936442
    # 1.0  1.765139e-292  1389.577522    True        True    17.183093
    # 1.0  1.351102e-291  1385.483398    True        True    17.319038
    

    制作剧情

    model.biplot(legend=True, SPE=True, hotellingt2=True)
    

    【讨论】:

      猜你喜欢
      • 2012-08-17
      • 2023-03-25
      • 2019-10-07
      • 1970-01-01
      • 1970-01-01
      • 2020-02-04
      • 2018-07-13
      • 1970-01-01
      • 2023-03-27
      相关资源
      最近更新 更多