如何在 Scikit-learn 中使用 NumPy 数组答案

【问题标题】：How to use NumPy array in Scikit-learn如何在 Scikit-learn 中使用 NumPy 数组
【发布时间】：2020-08-02 18:49:14
【问题描述】：

对于一个机器学习项目，我制作了一个 Pandas 数据框以在 Scikit 中用作输入

  label                                             vector
0      0   1:0.02776011 2:-0.009072121 3:0.05915284 4:-0...
1      1   1:0.014463682 2:-0.00076486735 3:0.044999316 ...
2      1   1:0.010583069 2:-0.0072133583 3:0.03766079 4:...
3      0   1:0.02776011 2:-0.009072121 3:0.05915284 4:-0...
4      1   1:0.039645035 2:-0.039485127 3:0.0898234 4:-0...
..   ...                                                ...
95     0   1:-0.013014212 2:-0.008092734 3:0.050860845 4...
96     0   1:-0.038887568 2:-0.007960074 3:0.03387617 4:...
97     0   1:-0.038887568 2:-0.007960074 3:0.03387617 4:...
98     0   1:-0.038887568 2:-0.007960074 3:0.03387617 4:...
99     0   1:-0.038887568 2:-0.007960074 3:0.03387617 4:...

其中label对应数据集记录的标签，vector对应每条记录的向量特征。

为了将数据帧传递给 Scikit，我创建了两个不同的数组，一个用于 Col 标签 (y)，另一个用于 col 向量 (X)

按照here 的建议创建我正在做的 X 数组：

X = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print(X.astype(float).to_numpy())
print(X)

一切正常，我正在输出

               1               2            3  ...            298            299           300
0     0.02776011    -0.009072121   0.05915284  ...  0.00035095372    -0.01569933  -0.010564591
1    0.014463682  -0.00076486735  0.044999316  ...   -0.008144852  -0.0066369134  -0.013060478
2    0.010583069   -0.0072133583   0.03766079  ...   0.0041615684    0.008569179  -0.008645372
3     0.02776011    -0.009072121   0.05915284  ...  0.00035095372    -0.01569933  -0.010564591
4    0.039645035    -0.039485127    0.0898234  ...   0.0046293125     0.01663368   0.010215017
..           ...             ...          ...  ...            ...            ...           ...
95  -0.013014212    -0.008092734  0.050860845  ...   0.0021799654   -0.011884902   0.016460473
96  -0.038887568    -0.007960074   0.03387617  ...   0.0057248613    0.026993237   0.025746094
97  -0.038887568    -0.007960074   0.03387617  ...   0.0057248613    0.026993237   0.025746094
98  -0.038887568    -0.007960074   0.03387617  ...   0.0057248613    0.026993237   0.025746094
99  -0.038887568    -0.007960074   0.03387617  ...   0.0057248613    0.026993237   0.025746094

[100 rows x 300 columns]

其中 100 行是我的记录，300 列是矢量特征。

要按照here 的建议创建 y 数组，我正在这样做：

y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1, 1)
print(y)

输出是：

[100 rows x 2 columns]
[[0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [...]
]

我的 NumPy 数组包含 100 条记录，但输出不是 1 列，而是 2 列。

我认为这个问题是导致以下错误的原因。对吧？

/Users/mac-pro/scikit_learn/lib/python3.7/site-packages/sklearn/utils/validation.py:760: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

如果是这样，我怎样才能输出类似于我为 X 数组得到的输出？

如果对这里有帮助，请查看完整代码

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
from sklearn.model_selection._validation import cross_val_score
from sklearn.model_selection import KFold
from scipy.stats import sem


r_filenameTSV = 'TSV/A19784_test3886.tsv'

tsv_read = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])

df = pd.DataFrame(tsv_read)

df = pd.DataFrame(df.vector.str.split(' ',1).tolist(),
                                   columns = ['label','vector'])

print(df)


y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1, 1)
print(y)
#exit()

X = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print(X.astype(float).to_numpy())
print(X)
#exit()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=0)


clf = svm.SVC(kernel='rbf',
              C=100,
              gamma=0.001,
              )
scores = cross_val_score(clf, X, y, cv=10)

print ("K-Folds scores:")
print (scores) 

#Train the model using the training sets
clf.fit (X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

print ("Metrics and Scoring:")
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
print("F1:",metrics.f1_score(y_test, y_pred))

print ("Classification Report:")
print (metrics.classification_report(y_test, y_pred,labels=[0,1]))

再次感谢您的宝贵时间。

【问题讨论】：

标签： python arrays pandas numpy scikit-learn

【解决方案1】：

正如错误所说，您只需要更改 Y 数据集的形状。

当需要一维数组时，传递了列向量 y。请将 y 的形状更改为 (n_samples, )，例如使用 ravel()。

因此，您有 2 个选项可以解决您的问题，以下是可以解决问题的代码行。

选项 1：

y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1,1).ravel()
print(y.shape)
# Output
(8,)

选项 2：

y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1,)
print(y.shape)
# Output
(8,)

希望对你有帮助！

【讨论】：

df.label.to_numpy() 应该会生成所需的一维数组。