【发布时间】:2015-09-12 18:10:54
【问题描述】:
我有以下数据集。我用 SVC 对它进行分类(它有 5 个标签)。当我想执行时:class_weight='auto' 像这样:
X = tfidf_vect.fit_transform(df['content'].values)
y = df['label'].values
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,
y)
svm_1 = SVC(kernel='linear', class_weight='auto')
svm_1.fit(X, y)
svm_1_prediction = svm_1.predict(X_test)
然后我得到这个异常:
Traceback (most recent call last):
File "test.py", line 62, in <module>
svm_1.fit(X, y)
File "/usr/local/lib/python2.7/site-packages/sklearn/svm/base.py", line 140, in fit
y = self._validate_targets(y)
File "/usr/local/lib/python2.7/site-packages/sklearn/svm/base.py", line 474, in _validate_targets
self.class_weight_ = compute_class_weight(self.class_weight, cls, y_)
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/class_weight.py", line 47, in compute_class_weight
raise ValueError("classes should have valid labels that are in y")
ValueError: classes should have valid labels that are in y
然后对于previous question,我尝试了以下方法:
svm_1 = SVC(kernel='linear', class_weight='auto')
svm_1.fit(X, y_encoded)
svm_1_prediction = le.inverse_transform(svm_1.predict(X))
问题是我得到了这个异常:
File "/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py", line 179, in accuracy_score
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py", line 74, in _check_targets
check_consistent_length(y_true, y_pred)
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 174, in check_consistent_length
"%s" % str(uniques))
ValueError: Found arrays with inconsistent numbers of samples: [ 858 2598]
谁能帮我理解上述方法有什么问题以及如何正确使用SVC 的class_weight='auto' 参数来自动平衡数据?
更新:
当我执行print(y) 时,输出如下:
0 5
1 4
2 5
3 4
4 4
5 5
6 4
7 4
8 3
9 5
10 4
11 4
12 1
13 4
14 4
15 5
16 4
17 4
18 5
19 5
20 4
21 4
22 5
23 5
24 3
25 3
26 4
27 5
28 4
29 4
..
2568 4
2569 4
2570 4
2571 3
2572 4
2573 5
2574 5
2575 5
2576 5
2577 3
2578 4
2579 4
2580 2
2581 4
2582 3
2583 4
2584 5
2585 4
2586 5
2587 4
2588 4
2589 3
2590 5
2591 5
2592 4
2593 4
2594 4
2595 2
2596 2
2597 5
更新
然后我执行以下操作:
mask = np.array(test)
print y[np.arange(len(y))[~mask]]
这是输出:
0 5
1 4
2 5
3 4
4 4
5 5
6 4
7 4
8 3
9 5
10 4
11 4
12 1
13 4
14 4
15 5
16 4
17 4
18 5
19 5
20 4
21 4
22 5
23 5
24 3
25 3
26 4
27 5
28 4
29 4
..
2568 4
2569 4
2570 4
2571 3
2572 4
2573 5
2574 5
2575 5
2576 5
2577 3
2578 4
2579 4
2580 2
2581 4
2582 3
2583 4
2584 5
2585 4
2586 5
2587 4
2588 4
2589 3
2590 5
2591 5
2592 4
2593 4
2594 4
2595 2
2596 2
2597 5
Name: label, dtype: float64
【问题讨论】:
-
您能展示一下您的 y-labels 数组的样本吗?它的类型是什么?如果所有数据类型都符合预期,
auto应该可以工作。 -
只需
print(y)并将结果复制到帖子中。 -
你能试试这个看看y中的所有元素是否都是int吗?
test = [type(element) is int for element in y]然后print(all(test)) -
@ml_guy 我修改了 class_weight.py 中的行以满足我的需要,似乎效果很好。
-
试试这个。
mask = np.array(test)然后y[np.arange(len(y))[~mask]]结果是什么?代码尝试选择那些非 int 元素。
标签: python numpy machine-learning scikit-learn svm