【发布时间】:2020-08-02 18:49:14
【问题描述】:
对于一个机器学习项目,我制作了一个 Pandas 数据框以在 Scikit 中用作输入
label vector
0 0 1:0.02776011 2:-0.009072121 3:0.05915284 4:-0...
1 1 1:0.014463682 2:-0.00076486735 3:0.044999316 ...
2 1 1:0.010583069 2:-0.0072133583 3:0.03766079 4:...
3 0 1:0.02776011 2:-0.009072121 3:0.05915284 4:-0...
4 1 1:0.039645035 2:-0.039485127 3:0.0898234 4:-0...
.. ... ...
95 0 1:-0.013014212 2:-0.008092734 3:0.050860845 4...
96 0 1:-0.038887568 2:-0.007960074 3:0.03387617 4:...
97 0 1:-0.038887568 2:-0.007960074 3:0.03387617 4:...
98 0 1:-0.038887568 2:-0.007960074 3:0.03387617 4:...
99 0 1:-0.038887568 2:-0.007960074 3:0.03387617 4:...
其中label对应数据集记录的标签,vector对应每条记录的向量特征。
为了将数据帧传递给 Scikit,我创建了两个不同的数组,一个用于 Col 标签 (y),另一个用于 col 向量 (X)
按照here 的建议创建我正在做的 X 数组:
X = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print(X.astype(float).to_numpy())
print(X)
一切正常,我正在输出
1 2 3 ... 298 299 300
0 0.02776011 -0.009072121 0.05915284 ... 0.00035095372 -0.01569933 -0.010564591
1 0.014463682 -0.00076486735 0.044999316 ... -0.008144852 -0.0066369134 -0.013060478
2 0.010583069 -0.0072133583 0.03766079 ... 0.0041615684 0.008569179 -0.008645372
3 0.02776011 -0.009072121 0.05915284 ... 0.00035095372 -0.01569933 -0.010564591
4 0.039645035 -0.039485127 0.0898234 ... 0.0046293125 0.01663368 0.010215017
.. ... ... ... ... ... ... ...
95 -0.013014212 -0.008092734 0.050860845 ... 0.0021799654 -0.011884902 0.016460473
96 -0.038887568 -0.007960074 0.03387617 ... 0.0057248613 0.026993237 0.025746094
97 -0.038887568 -0.007960074 0.03387617 ... 0.0057248613 0.026993237 0.025746094
98 -0.038887568 -0.007960074 0.03387617 ... 0.0057248613 0.026993237 0.025746094
99 -0.038887568 -0.007960074 0.03387617 ... 0.0057248613 0.026993237 0.025746094
[100 rows x 300 columns]
其中 100 行是我的记录,300 列是矢量特征。
要按照here 的建议创建 y 数组,我正在这样做:
y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1, 1)
print(y)
输出是:
[100 rows x 2 columns]
[[0]
[1]
[1]
[0]
[1]
[1]
[1]
[1]
[0]
[0]
[0]
[0]
[0]
[0]
[1]
[1]
[...]
]
我的 NumPy 数组包含 100 条记录,但输出不是 1 列,而是 2 列。
我认为这个问题是导致以下错误的原因。对吧?
/Users/mac-pro/scikit_learn/lib/python3.7/site-packages/sklearn/utils/validation.py:760: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
如果是这样,我怎样才能输出类似于我为 X 数组得到的输出?
如果对这里有帮助,请查看完整代码
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
from sklearn.model_selection._validation import cross_val_score
from sklearn.model_selection import KFold
from scipy.stats import sem
r_filenameTSV = 'TSV/A19784_test3886.tsv'
tsv_read = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])
df = pd.DataFrame(tsv_read)
df = pd.DataFrame(df.vector.str.split(' ',1).tolist(),
columns = ['label','vector'])
print(df)
y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1, 1)
print(y)
#exit()
X = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print(X.astype(float).to_numpy())
print(X)
#exit()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=0)
clf = svm.SVC(kernel='rbf',
C=100,
gamma=0.001,
)
scores = cross_val_score(clf, X, y, cv=10)
print ("K-Folds scores:")
print (scores)
#Train the model using the training sets
clf.fit (X_train, y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
print ("Metrics and Scoring:")
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
print("F1:",metrics.f1_score(y_test, y_pred))
print ("Classification Report:")
print (metrics.classification_report(y_test, y_pred,labels=[0,1]))
再次感谢您的宝贵时间。
【问题讨论】:
标签: python arrays pandas numpy scikit-learn