【问题标题】:How to loop through multiple sklearn classification models?如何循环遍历多个sklearn分类模型?
【发布时间】:2020-05-03 17:17:21
【问题描述】:

我正在尝试弄清楚如何将我的数据集输入到几个 scikit 分类模型中。

当我运行代码时,我收到以下错误:

Traceback (most recent call last):

  File "<ipython-input-515-9a3302837c99>", line 3, in <module>
    X, y = dataset

ValueError: too many values to unpack (expected 2)

这是我的代码。

X = np.asarray([np.asarray(df['LRMScore']),np.asarray(df['Spread'])]).T


import time

import numpy as np
import matplotlib.pyplot as plt

from sklearn import cluster, datasets
from sklearn.neighbors import kneighbors_graph
from sklearn.preprocessing import StandardScaler

np.random.seed(0)


colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)

clustering_names = [
    'MiniBatchKMeans', 'AffinityPropagation', 'MeanShift',
    'SpectralClustering', 'Ward', 'AgglomerativeClustering',
    'DBSCAN', 'Birch']

plt.figure(figsize=(len(clustering_names) * 2 + 3, 9.5))
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05,
                    hspace=.01)

plot_num = 1

datasets = [X]
for i_dataset, dataset in enumerate(datasets):
    X, y = dataset
    # normalize dataset for easier parameter selection
    X = StandardScaler().fit_transform(X)

    # estimate bandwidth for mean shift
    bandwidth = cluster.estimate_bandwidth(X, quantile=0.3)

    # connectivity matrix for structured Ward
    connectivity = kneighbors_graph(X, n_neighbors=10, include_self=False)
    # make connectivity symmetric
    connectivity = 0.5 * (connectivity + connectivity.T)

    # create clustering estimators
    ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True)
    two_means = cluster.MiniBatchKMeans(n_clusters=2)
    ward = cluster.AgglomerativeClustering(n_clusters=2, linkage='ward',
                                           connectivity=connectivity)
    spectral = cluster.SpectralClustering(n_clusters=2,
                                          eigen_solver='arpack',
                                          affinity="nearest_neighbors")
    dbscan = cluster.DBSCAN(eps=.2)
    affinity_propagation = cluster.AffinityPropagation(damping=.9,
                                                       preference=-200)

    average_linkage = cluster.AgglomerativeClustering(
        linkage="average", affinity="cityblock", n_clusters=2,
        connectivity=connectivity)

    birch = cluster.Birch(n_clusters=2)
    clustering_algorithms = [
        two_means, affinity_propagation, ms, spectral, ward, average_linkage,
        dbscan, birch]

    for name, algorithm in zip(clustering_names, clustering_algorithms):
        # predict cluster memberships
        t0 = time.time()
        algorithm.fit(X)
        t1 = time.time()
        if hasattr(algorithm, 'labels_'):
            y_pred = algorithm.labels_.astype(np.int)
        else:
            y_pred = algorithm.predict(X)

        # plot
        plt.subplot(4, len(clustering_algorithms), plot_num)
        if i_dataset == 0:
            plt.title(name, size=18)
        plt.scatter(X[:, 0], X[:, 1], color=colors[y_pred].tolist(), s=10)

        if hasattr(algorithm, 'cluster_centers_'):
            centers = algorithm.cluster_centers_
            center_colors = colors[:len(centers)]
            plt.scatter(centers[:, 0], centers[:, 1], s=100, c=center_colors)
        plt.xlim(-2, 2)
        plt.ylim(-2, 2)
        plt.xticks(())
        plt.yticks(())
        plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'),
                 transform=plt.gca().transAxes, size=15,
                 horizontalalignment='right')
        plot_num += 1

plt.show()

我的 X 变量由数据框的两列组成,看起来像这样。

array([[ 8.  ,  0.06],
       [ 8.  ,  0.06],
       [ 8.  ,  0.06],
       ...,
       [10.  ,  0.01],
       [ 8.  ,  0.03],
       [ 9.75,  0.06]])

这些数据集由两个数组组成:X 和 Y。

noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,
                                      noise=.05)
noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05)
blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)
no_structure = np.random.rand(n_samples, 2), None

我的数据集由一个数组组成。那就是问题所在。我想我的设置必须稍有不同,但我不确定它会是什么样子。

我从下面的链接中获得了代码。

https://scikit-learn.org/0.18/auto_examples/cluster/plot_cluster_comparison.html

【问题讨论】:

  • 您已经描述了您想要做什么,但您的实际问题是什么? minimal reproducible example对于需要帮助解决的问题,你能提供什么?
  • 我刚刚更新了我的帖子。当我运行代码时,出现以下错误:ValueError: too many values to unpack (expected 2)
  • 这更有帮助,但请edit 包含完整的错误回溯,而不仅仅是最后一行,因为这会告诉您(和我们)错误发生的位置
  • 就在X, y = dataset 之前放这个print('dataset: ', dataset),看看for-loop 的数据集中有什么。它可能仍然被某种东西包裹着。还要在你的数组上执行一个打印语句,看看它是如何用 [ 和 ] 等封装的。
  • 基本上,该错误告诉您您正在尝试将具有超过 2 个值的内容解压缩到两个变量中。无论dataset 是什么,它的大小、形状或长度都错误,无法自动将其解压缩为 X 和 y。可能是您需要索引到数组并选择您想要的列,或者一些这样的

标签: python python-3.x machine-learning scikit-learn


【解决方案1】:

由于您的X 数组有两个,您需要转置它以使用值解包:

x, y = dataset.T

【讨论】:

  • @guest,我刚试过。现在,我得到了这个: Traceback(最近一次调用最后一次):文件“”,第 1 行,在 中为 i_dataset,数据集在枚举(数据集):TypeError:'模块' 对象不可迭代
  • 我猜你需要在继续之前进一步调试你的代码。
【解决方案2】:

做到了!谢谢帕萨。这是我的最终工作解决方案。

import time

import numpy as np
import matplotlib.pyplot as plt

from sklearn import cluster, datasets
from sklearn.neighbors import kneighbors_graph
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

np.random.seed(0)

pd.set_option('display.max_columns', 500)
df = pd.read_csv('C:\\your_path_here\\test.csv')
print('done!')

df = df[:10000]
df = df.fillna(0)
df = df.dropna()


X = df[['RatingScore', 
            'Par', 
            'Term', 
            'TimeToMaturity', 
            'LRMScore', 
            'Coupon', 
            'Price']]
#select your target variable
y = df[['Spread']]
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)

clustering_names = [
    'MiniBatchKMeans', 'AffinityPropagation', 'MeanShift',
    'SpectralClustering', 'Ward', 'AgglomerativeClustering',
    'DBSCAN', 'Birch']

plt.figure(figsize=(len(clustering_names) * 2 + 3, 9.5))
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05,
                    hspace=.01)

plot_num = 1

blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)


# normalize dataset for easier parameter selection
X = StandardScaler().fit_transform(X)

# estimate bandwidth for mean shift
bandwidth = cluster.estimate_bandwidth(X, quantile=0.3)

# connectivity matrix for structured Ward
connectivity = kneighbors_graph(X, n_neighbors=10, include_self=False)
# make connectivity symmetric
connectivity = 0.5 * (connectivity + connectivity.T)

# create clustering estimators
ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True)
two_means = cluster.MiniBatchKMeans(n_clusters=2)
ward = cluster.AgglomerativeClustering(n_clusters=2, linkage='ward',
                                       connectivity=connectivity)
spectral = cluster.SpectralClustering(n_clusters=2,
                                      eigen_solver='arpack',
                                      affinity="nearest_neighbors")
dbscan = cluster.DBSCAN(eps=.2)
affinity_propagation = cluster.AffinityPropagation(damping=.9,
                                                   preference=-200)

average_linkage = cluster.AgglomerativeClustering(
    linkage="average", affinity="cityblock", n_clusters=2,
    connectivity=connectivity)

birch = cluster.Birch(n_clusters=2)
clustering_algorithms = [
    two_means, affinity_propagation, ms, spectral, ward, average_linkage,
    dbscan, birch]

for name, algorithm in zip(clustering_names, clustering_algorithms):
    # predict cluster memberships
    t0 = time.time()
    algorithm.fit(X)
    t1 = time.time()
    if hasattr(algorithm, 'labels_'):
        y_pred = algorithm.labels_.astype(np.int)
    else:
        y_pred = algorithm.predict(X)

    # plot
    plt.subplot(4, len(clustering_algorithms), plot_num)
    if i_dataset == 0:
        plt.title(name, size=18)
    plt.scatter(X[:, 0], X[:, 1], color=colors[y_pred].tolist(), s=10)

    if hasattr(algorithm, 'cluster_centers_'):
        centers = algorithm.cluster_centers_
        center_colors = colors[:len(centers)]
        plt.scatter(centers[:, 0], centers[:, 1], s=100, c=center_colors)
    plt.xlim(-2, 2)
    plt.ylim(-2, 2)
    plt.xticks(())
    plt.yticks(())
    plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'),
             transform=plt.gca().transAxes, size=15,
             horizontalalignment='right')
    plot_num += 1

plt.show()

【讨论】:

    猜你喜欢
    • 2014-08-25
    • 2010-10-24
    • 1970-01-01
    • 1970-01-01
    • 2016-10-07
    • 1970-01-01
    • 2011-08-28
    • 1970-01-01
    相关资源
    最近更新 更多