R中的主成分分析图答案

【问题标题】：Principal component analysis plot in RR中的主成分分析图
【发布时间】：2021-04-30 06:23:07
【问题描述】：

我需要一个 PCA 图来显示数据是否以及如何按人口（AFR_ACB、AFR_ASW 等）聚类。我还需要每个人口的不同颜色和人口颜色的图例。如果我可以为所有非洲人口、美国人口、亚洲人和欧洲人添加一个框架也很好，因为我的真实数据由所有这些人口组成

我从结果文件创建的 csv (TLR9.csv) 文件中有以下格式的数据。实际上，有 26 列（26 个总体）和 1522 行。

nuc_pos AFR_ACB AFR_ASW AMR_PUR AMR_PEL EAS_CHS EAS_JPN EUR_FIN EUR_CEU AMR_MXL AMR_PEL AMR_PUR EAS_CDX  EAS_CHB  EAS_CHS
42809473    0   0   0   0   0   0   0   0   0   0   0   0   0.00971 0
42809498    0.01042 0   0.0201  0.00885 0   0.03488 0.00926 0   0   0   0   0   0   0
42809524    0   0   0   0   0.0201  0   0.00926 0   0   0   0   0   0   0
42809625    0   0   0   0   0   0   0   0.08192 0.01563 0.02339 0.02857 0   0   0
42809638    0   0   0   0.00885 0   0   0   0   0   0   0   0   0   0
42809715    0.30628 0.20485 0.34743 0.36531 0.19059 0.36199 0.34729 0.02116 0.01563 0   0.06536 0   0   0
42809846    0   0   0   0   0   0   0   0   0   0   0   0   0.00971 0.00952
42809910    0   0   0   0   0   0   0   0   0   0.01176 0   0   0   0
42809911    0   0   0   0   0   0   0   0   0   0   0   0   0   0
42809964    0.30628 0.20485 0.34743 0.36531 0.20638 0.38016 0.35241 0.02116 0.01563 0   0.06536 0   0   0
42810034    0.30628 0.20485 0.34743 0.36531 0.19059 0.34918 0.34729 0.02116 0.01563 0   0.06536 0   0   0
42810082    0   0   0   0   0   0.02339 0   0   0   0   0   0   0   0
42810098    0   0   0   0   0   0   0   0   0   0   0   0   0   0
42810103    0   0   0   0   0.0101  0   0   0   0   0   0   0   0   0
42810184    0   0   0   0   0.03    0   0   0   0   0   0   0   0   0
42810189    0.30628 0.20485 0.34743 0.36531 0.19853 0.34918 0.34729 0.02116 0.01563 0   0.06536 0   0   0
42810233    0   0   0   0   0   0   0   0   0   0   0   0   0   0

我使用以下代码制作了 PCA 图：

df <- read.csv('TLR9.csv')
pca_res <- prcomp(df, scale. = TRUE)
autoplot(pca_res, data = df, loadings = TRUE, loadings.label = TRUE, frame = TRUE, label = TRUE, shape = FALSE, label.size = 2, loadings.label.size = 3)

对于此类分析，输入文件格式是否正确？把26个种群全部作为主成分也对吗？

我尝试了其他 R 包，其中的教程更好地解释了如何在 R 上制作 PCA，但它们与我拥有的 R 版本不兼容。所以，我尝试了这个，它可以工作，但我不确定输出是否应该是这样。

这是我第一次做 pca，我对 R 不是很熟悉。任何帮助都将不胜感激。提前致谢！

【问题讨论】：

我的回答能回答你的问题吗？
嗨，安迪。很抱歉，我根本无法测试代码。我尝试安装 FactoMineR 包，但出现以下错误： install.packages 中的警告：安装包 'FactoMineR' 的退出状态非零下载的源包位于 '/tmp/RtmpHtrBPj/downloaded_packages' 警告消息：1 : 在 .rs.normalizePath(defaultLibraryPath) : path[1]="/home/aahm/R/x86_64-pc-linux-gnu-library/4.0": 没有这样的文件或目录 2: 在 .rs.normalizePath(libPaths ) : path[1]="/home/aahm/R/x86_64-pc-linux-gnu-library/4.0": 没有这样的文件或目录。
我搜索了为什么会有这个，我发现某处 FactoMineR 不能与 R 版本 3.4 一起使用。我卸载了 R 并重新安装了 3.5 版，但我也遇到了这个问题。所以，我这样做了： sudo apt --fix-broken install sudo apt autoremove sudo apt-get update sudo apt-get upgrade sudo apt-get install r-base-dev 我发现自己又遇到了同样的问题。您能否推荐另一个适用于 R 3.4 版的软件包？谢谢。
看来您可能正在运行一个非常旧的 R 版本（甚至 3.5 都是旧版本）。再次尝试删除 R，转到下面的网站链接，更新到最新版本，然后重试。我正在运行 R 版本 4.0.2 (2020-06-22) r-project.org
我已经在我的电脑上重新安装了 R 4.0.3，但我仍然无法安装该软件包。我收到以下错误：错误：依赖项 'rio' 不适用于包 'car' * 删除 '/home/aahm/R/x86_64-pc-linux-gnu-library/4.0/car' * 安装 source package 'shiny' ... ** package 'shiny' 成功解压并检查 MD5 和 ** 使用分阶段安装 ** R ** inst ** 字节编译并准备延迟加载的包 ** 帮助 * ** 安装帮助索引 *** 复制数字 ** 测试安装的包是否保留临时安装路径的记录 * 完成（闪亮

标签： r pca

【解决方案1】：

首先，我无法使用您的数据集，因为您没有将其提供给我们。所以我在下面提供了一个

首先，这很容易在library(FactoMineR) 中完成。

加载数据框

df <- read.table("https://pastebin.com/raw/6aukL6YW", header=T)

library(FactoMineR) # load package

names(df) notice I have one column called "treatment", the others are columns filled with data

运行 PCA

x <- PCA(df,quali.sup=1) # the quali.sup= is referring to "which column do you want to refer to as a category (and each category is automatically assigned a color), in your case, this would be "population"

您还可以使用直接集成在 FactoMineR 包中的 plot.PCA() 命令制作散点图

plot.PCA(x, axes=c(1, 2), cex=1,choix="ind", habillage=1) # habillage is referring the which column you want to treat as a factor, and it also will assign different colors, (again in your case, "population
 and this plot automatically adds a legend

最后，您可以绘制一个图，再次使用plot.PCA()，告诉您哪个变量导致数据集中的变化最大

plot.PCA(x, choix='var',select='contrib 2') # top 2 contributors of variation, the rest are not shown in bold, could do 5, 10, etc..

你去...

【讨论】：

嗨，安迪。非常感谢您的帮助！我已经安装了 FactoMineR 包，运行脚本时收到以下消息：下载的源包位于 '/tmp/RtmpKcLvrC/downloaded_packages' > source('~/SoftMaker/Documents/Aahm/pca_test.R') 错误在 library(FactoMineR) 中：没有名为“FactoMineR”的包 > View (df)
这实际上对安迪很有用。再次感谢！
很高兴我能帮助@ahusnoo :)
如何在图表上绘制 5 个主成分？我尝试将 plot.PCA(res.pca, axes=c(1,2), cex=1,choix="ind", habillage=1) 更改为 plot.PCA(res.pca, axes=c(1,5 ), cex=1,choix="ind", habillage=1) 但我收到以下错误： res.pca$ind$coord[, axes, drop = FALSE] 中的错误：下标超出范围
我不确定我是否遵循。但要简要回答一下，图表上有多少变量并不重要，可以有 5 个，也可以有 50 个，PCA 可以将所有这些都考虑在内。如果你有办法分享你的数据集，我可能会想出来。如果您愿意，可以使用 pastebin.com 与我分享您的数据集