sklearn SVC 在执行时抛出“重塑错误”答案

【问题标题】：sklearn SVC throwing "reshape error" upon executionsklearn SVC 在执行时抛出“重塑错误”
【发布时间】：2019-04-01 15:34:16
【问题描述】：

上下文

我正在尝试对我自己的数据使用此交叉验证article 中的方法（从 csv 导入，没有缺失值，所有插值，没有缺失，一些 0，一些负数到正数范围，主要是正数范围）。由于使用 shift 进行偏移，初始数据缺少页眉和页脚行，但由 train_test_split 函数中的 [1:,][:-1] 处理。

无论我尝试将代码包含在我自己的数据中的任何方式，都会引发错误。我可以使用 train_test_split 函数为大多数其他函数拆分我的数据，我怀疑该错误与数据的结构方式有关？

link 转 csv

被读作

input_file = "parsed.csv"

df = pd.read_csv(input_file, header = 0)


x = df.loc[0:,[
...
]]

...

model_testing = sm.OLS(model_training.predict(X_test),y_test,missing='drop').fit()

我最初尝试过。

clf = svm.SVC(kernel='linear', C=1).fit(X_train,y_train)

会引发错误

/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py:578: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
---------------------------------------------------------------------------
ValueError                                
Traceback (most recent call last)
<ipython-input-435-eae5045a136b> in <module>

所以我查看了 URL 中的原始代码，发现形状参数与我的不同。所以我试图“压平”它们（1d），但得到了不同的错误。

代码

X_train, X_test, y_train, y_test = train_test_split(set1.loc[1:,][:-1], yFutureYield.loc[1:,][:-1], test_size=0.25)

model_testing = sm.OLS(model_training.predict(X_test),y_test,missing='drop').fit()

print(X_train.shape)
print(y_train.shape)
print(X_train.head(5))
print(y_train.head(5))

clf = svm.SVC(kernel='linear', C=1).fit(np.array(X_train).flatten(), 
np.array(y_train).flatten())

产生

(285, 47)
(285, 1)
     CSUSHPISA  CUUR0000SETB01  DCOILBRENTEU  RECPROUSM156N  CPIHOSNS  \
149     96.004          98.600     15.863182           0.02   164.100   
272    148.031         220.542     67.646190           0.34   217.178   
171    111.653         132.800     25.657143          32.02   175.400   
187    123.831         120.900     26.651364           0.52   181.700   
309    143.607         322.934    111.710870           0.02   223.708   

     CPALTT01USM661S  PAYNSA  CUUR0000SEHA  CPIAUCSL  LNS12300060     ...      \
149        70.037170  130150       177.100   166.000         81.4     ...       
272        91.074058  130589       248.965   215.861         75.1     ...       
171        74.425041  132349       190.200   176.400         80.9     ...       
187        76.154875  130356       200.200   180.500         79.3     ...       
309        97.730543  135649       262.707   231.638         76.1     ...       

     CUUR0000SEHA      CPIAUCSL  LNS12300060      GS5  CUUR0000SETA01  \
149  31293.570000  27556.000000      6625.96  31.6064    20363.250000   
272  61999.504985  46506.173145      5677.56   6.0909    18043.950080   
171  36061.920000  31064.040000      6577.17  22.0864    20377.560000   
187  39999.960000  32490.000000      6272.63  12.5349    19154.470000   
309  68677.126647  53511.852570      5783.60   0.4757    20697.980975   

         CPILFESL      CPILFENS  PCECTPICTM  CSUSHPINSA  CSUSHPINSA  
149  31169.900000  31187.560000      4.2025      96.393   -0.010515  
272  48271.560320  48341.204652      4.2025     149.631    0.006909  
171  34187.970000  34391.680000      4.2025     111.248   -0.007727  
187  36404.550000  36347.300000      4.2025     124.729   -0.008424  
309  53287.764816  53373.875280      4.2025     143.977    0.002688  

[5 rows x 47 columns]
     CSUSHPINSA
149    0.008579
272   -0.006950
171    0.008584
187    0.006125
309   -0.000042
---------------------------------------------------------------------------

错误

ValueError                                Traceback (most recent call last)
<ipython-input-433-9ef18f8c2bef> in <module>
     90 #np.array(y_train).flatten()
     91 #dir(model_training)
---> 92 clf = svm.SVC(kernel='linear', C=1).fit(np.array(X_train).flatten(), np.array(y_train).flatten())
     93 
     94 #deltas

/opt/conda/lib/python3.6/site-packages/sklearn/svm/base.py in fit(self, X, y, sample_weight)
    147         self._sparse = sparse and not callable(self.kernel)
    148 
--> 149         X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr')
    150         y = self._validate_targets(y)
    151 

/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    571     X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
    572                     ensure_2d, allow_nd, ensure_min_samples,
--> 573                     ensure_min_features, warn_on_dtype, estimator)
    574     if multi_output:
    575         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    439                     "Reshape your data either using array.reshape(-1, 1) if "
    440                     "your data has a single feature or array.reshape(1, -1) "
--> 441                     "if it contains a single sample.".format(array))
    442             array = np.atleast_2d(array)
    443             # To ensure that array flags are maintained

ValueError: Expected 2D array, got 1D array instead:
array=[  9.60040000e+01   9.86000000e+01   1.58631818e+01 ...,   4.20250000e+00
   1.56299000e+02  -7.67852077e-03].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

【问题讨论】：

您能说出 Xtrain 和 Ytrain 的外观，以便我们重现错误吗？
你需要什么？我认为 head 命令就足够了。类型命令？指向原始 csv 的链接？
理想情况下，DataFrame 构造函数会有所帮助，否则很难从 head() 转到数据帧。
df = pd.read_csv(input_file, header = 0)

标签： scikit-learn svm sklearn-pandas

【解决方案1】：

用途：

clf = svm.SVR(kernel='linear', C=1).fit(X_train,y_train.values.flatten())

【讨论】：

... 88 #pd.DataFrame.from_records(deltas) 89 type(deltas) ---> 90 y_train.reshape(len(y_train),1) /opt/conda/lib/python3 .6/site-packages/pandas/core/generic.py in __getattr__(self, name) 4374 if self._info_axis._can_hold_identifiers_and_holds_name(name): 4375 return self[name] -> 4376 return object.__getattribute__(self, name) 4377 4378 def __setattr__(self, name, value): AttributeError: 'DataFrame' object has no attribute 'reshape'
DataFrame.values.reshape() 如果是数据框。
clf = svm.SVC(kernel='linear', C=1).fit(X_train,y_train.values.reshape(len(y_train),1)) 导致 /opt/conda/ lib/python3.6/site-packages/sklearn/utils/validation.py:578：DataConversionWarning：当需要一维数组时，传递了列向量 y。请将 y 的形状更改为 (n_samples, )，例如使用 ravel()。 y = column_or_1d(y, warn=True) ---------------------------------------- ------------------------------------------------ ValueError (google drive url with parsed.csv)
set1 和 yfutureyield 是如何构造的？
好吧，我刚刚播放了你所有的代码，在运行简化版 clf = svm.SVC(kernel='linear', C=1).fit(X_train,y_train) 时，我得到一个不同的错误与分类器有关，它抱怨 SVC 不能做连续变量，这是有道理的，因为它是一个分类器，输出变量是连续的，我把它换成 SVR，它是回归器，它起作用了。