【发布时间】:2017-02-13 12:45:48
【问题描述】:
我有一个不平衡的数据集,其中正类大约有 10,000 个条目,负类大约有 8,00,000 个条目。我正在尝试一个简单的 scikit 的 LogisticRegression 模型作为基线模型,class_weight='balanced' (希望不平衡的问题应该得到解决?)。
但是,我的准确度得分为 0.83,但准确度得分为 0.03。可能是什么问题?不平衡部分需要单独处理吗?
这是我当前的代码:
>>> train = []
>>> target = []
>>> len(posList)
... 10214
>>> len(negList)
... 831134
>>> for entry in posList:
... train.append(entry)
... target.append(1)
...
>>> for entry in negList:
... train.append(entry)
... target.append(-1)
...
>>> train = np.array(train)
>>> target = np.array(target)
>>>
>>> X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.3, random_state=42)
>>>
>>> model = LogisticRegression(class_weight='balanced')
>>> model.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight='balanced', dual=False,
fit_intercept=True, intercept_scaling=1, max_iter=100,
multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
>>>
>>> predicted = model.predict(X_test)
>>>
>>> metrics.accuracy_score(y_test, predicted)
0.835596671213
>>>
>>> metrics.precision_score(y_test, predicted, average='weighted')
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py:976: DeprecationWarning: From version 0.18, binary input will not be handled specially when using averaged precision/recall/F-score. Please use average='binary' to report only the positive class performance.
'positive class performance.', DeprecationWarning)
0.033512518766
【问题讨论】:
-
I am getting an accuracy score of 0.83, but a precision score of 0.03. What could be the issue- 它可以帮助您了解当您随机预测 10000 个正数而其余的负数时,检查您获得的准确度/精度分数。
标签: python python-2.7 scikit-learn