为 PCA 生成加载矩阵时如何将 pandas 数据框列设置为索引答案

【问题标题】：How to set pandas dataframe column as index when generating loading matrix for PCA为 PCA 生成加载矩阵时如何将 pandas 数据框列设置为索引
【发布时间】：2019-12-16 16:06:10
【问题描述】：

我在 python 中使用 sklearn 对基因表达数据执行主成分分析 (PCA)。我的数据作为 pandas 数据框加载，我可以调用 df.head() 并且 df 看起来不错。我正在使用 sklearn 生成加载矩阵，但该矩阵仅显示通用索引，并且不会接受索引的列名。我有 1722 个基因，因此通过计算获得每个基因的加载分数很重要。

这是我的 PCA 代码：

import pandas as pd
from sklearn.decomposition import PCA
from sklearn import preprocessing


# Load the data as pandas dataframe
cols = ['gene', 'FC_TSWV', 'FC_WFT', 'FC_TSWV_WFT']
df = pd.read_csv('./PCA.txt', names = cols, header = None, index_col = 'gene')

# preprocess data:

scaled_df = preprocessing.scale(df.T)


# perform PCA

pca = PCA()
pca.fit(scaled_df)
pca_data = pca.transform(scaled_df)


# Generate loading matrix. HERE IS WHERE THE TROUBLE IS:

loading_scores = pd.Series(pca.components_[0], index = df.gene)


# Print loading matrix

sorted_loading_scores = loading_scores.abs().sort_values(ascending=False)
print(loading_scores)

我试过了：

loading_scores = pd.Series(pca.components_[0], index = df.gene)

loading_scores = pd.Series(pca.components_[0], index = df['gene'])

loading_scores = pd.Series(pca.components_[0], index = df.loc['gene']

AttributeError: 'DataFrame' 对象没有属性 'gene'。

如果我根本不指定索引，加载分数将使用基于 0 的通用索引指定。

有人知道如何解决这个问题吗？

【问题讨论】：

标签： python pandas scikit-learn pca genetics

【解决方案1】：

使用df.index 代替df.gene 或df['gene']

一旦你将某个列设置为索引，访问它的方法是通过.index 属性，而不是通过列的名称。

【讨论】：

没用。新的错误消息显示：“ValueError: cannot reindex from a duplicate axis”