【问题标题】:scikit-learn PCA transform returns incorrect reduced feature lengthscikit-learn PCA 变换返回不正确的减少特征长度
【发布时间】:2016-08-20 08:32:06
【问题描述】:

我尝试在我的代码中应用 PCA,当我使用以下代码训练我的数据时:

def gather_train():
    train_data = np.array([])
    train_labels = np.array([])
    with open(training_info, "r") as traincsv:
        for line in traincsv:
            current_image = "train\\{}".format(line.strip().split(",")[0])
            print "Reading data from: {}".format(current_image)
            train_labels = np.append(train_labels, int(line.strip().split(",")[1]))
            with open(current_image, "rb") as img:
                train_data = np.append(train_data, np.fromfile(img, dtype=np.uint8).reshape(-1, 784)/255.0)
    train_data = train_data.reshape(len(train_labels), 784)
    return train_data, train_labels

def get_PCA_train(data):
    print "\nFitting PCA. Components: {} ...".format(PCA_components)
    pca = decomposition.PCA(n_components=PCA_components).fit(data)
    print "\nReducing data to {} components ...".format(PCA_components)
    data_reduced = pca.fit_transform(data)
    return data_reduced

def get_PCA_test(data):
    print "\nFitting PCA. Components: {} ...".format(PCA_components)
    pca = decomposition.PCA(n_components=PCA_components).fit(data)
    print "\nReducing data to {} components ...".format(PCA_components)
    data_reduced = pca.transform(data)
    return data_reduced

def gather_test(imgfile):
    #input is a file, and reads data from it. different from gather_train which gathers all at once
    with open(imgfile, "rb") as img:
        return np.fromfile(img, dtype=np.uint8,).reshape(-1, 784)/255.0

...

train_data = gather_train()
train_data_reduced = get_PCA_train(train_data)
print train_data.ndim, train_data.shape
print train_data_reduced.ndim, train_data_reduced.shape

它打印出预期的ff:

2 (1000L, 784L)
2 (1000L, 300L)

但是当我开始减少我的测试数据时:

test_data = gather_test(image_file)
# image_file is 784 bytes (28x28) of pixel values; 1 byte = 1 pixel value
test_data_reduced = get_PCA_test(test_data)
print test_data.ndim, test_data.shape
print test_data_reduced.ndim, test_data_reduced.shape

输出是:

2 (1L, 784L)
2 (1L, 1L)

稍后会导致错误:

ValueError: X.shape[1] = 1 应该等于 300,个数 训练时的特征

为什么 test_data_reduced 的形状是 (1,1),而不是 (1,300)?我尝试使用fit_transform 用于训练数据和transform 仅用于测试数据,但仍然是同样的错误。

【问题讨论】:

  • 你的数据是什么样的,你能发布一些模型吗?但是,您应用 PCA 是错误的,您应该对训练数据进行 fit_transform,然后仅转换测试数据。当您重新拟合测试数据时,您实际上忽略了您的训练数据。另外,您应该发布更完整的代码 - 您如何定义 train_data 和 test_data?
  • @flyingmeatball 是正确的,这是因为您正在根据测试数据重新训练 PCA 模型。
  • @flyingmeatball 我添加了更多代码。这里的流程是train_datatest_data相似,只是test_data是单项
  • 我在train_data 上使用了fit_transform,在test_data 上使用了transform,但我仍然遇到同样的错误
  • 但是transform 上方的两行仍然是fit test 数据。您需要使用training 数据。

标签: python scikit-learn pca


【解决方案1】:

PCA 的调用大致如下所示:

pca = decomposition.PCA(n_components=PCA_components).fit(train_data)
data_reduced = pca.transform(test_data)

首先你在训练数据上调用fit,然后在测试数据上调用transform,你想减少。

【讨论】:

    猜你喜欢
    • 2017-03-09
    • 2021-11-02
    • 2014-06-11
    • 2019-10-20
    • 2014-04-14
    • 2021-12-23
    • 2016-07-21
    • 2014-11-24
    • 2016-02-25
    相关资源
    最近更新 更多