【问题标题】:How to improve performance for imbalanced dataset using SVM如何使用 SVM 提高不平衡数据集的性能
【发布时间】:2020-09-16 02:44:03
【问题描述】:

我正在尝试使用 scikit-learn 在令牌级别对数据进行分类。我已经有 traintest 拆分。 数据采用以下\t分隔格式:

-----------------
token       label
-----------------
way          6
to           6
reduce       6
the          6
amount       6
of           6
traffic      6
   ....
public       2
transport    5
is           5
a            5
key          5
factor       5
to           5 
minimize     5
   ....

数据分布如下:

                              Training Data                    Test Data
# Total:                        119490                          29699
# Class 0:                      52631                           13490
# Class 1:                      35116                           8625
# Class 2:                      17968                           4161
# Class 3:                      8658                            2088
# Class 4:                      3002                            800
# Class 5:                      1201                            302
# Class 6:                      592                             153

我正在尝试SVM,而F1-score 非常糟糕。

代码是:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import KFold

if __name__ == '__main__':
    # reading Files
    train_df = pd.read_csv(TRAINING_DATA_PATH, names=['token', 'label'], sep='\t')
    test_df = pd.read_csv(TEST_DATA_PATH, names=['token', 'label'], sep='\t')

    # getting training and testing data
    train_X = train_df['token'].astype('U')
    test_X = test_df['token'].astype('U')
    train_y = train_df['label']
    test_y = test_df['label']

    # Linear SVM
    sgd = Pipeline([('vect', CountVectorizer()),        
                    ('tfidf', TfidfTransformer()),      
                    ('clf',   SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, max_iter=100, tol=None)
                   ])
    f1_list = []
    acc_list = []
    cv = KFold(n_splits=5)
    for train_index, test_index in cv.split(train_X):
        X_train, X_val = train_X[train_index], train_X[test_index]
        y_train, y_val = train_y[train_index], train_y[test_index]
        sgd.fit(X_train, y_train)
        predicted = sgd.predict(X_val)
        f1 = f1_score(y_val, predicted, average='macro')
        acc = accuracy_score(y_val, predicted)
        f1_list.append(f1)
        acc_list.append(acc)
    print(f1_list)
    print(acc_list)
    sgd_pred = sgd.predict(test_X)
    print('SVM accuracy: %s' % accuracy_score(sgd_pred, test_y))
    print('SVM F1-macro: %s' % f1_score(sgd_pred, test_y, average='macro'))
    print('SVM F1-weighted: %s' % f1_score(sgd_pred, test_y, average='weighted'))

线性支持向量机的结果如下:

SVM accuracy: 0.49493248930940437
SVM F1-macro: 0.2677988484198396

如何提高性能?

【问题讨论】:

    标签: python machine-learning scikit-learn svm


    【解决方案1】:

    您可以考虑采取多种措施来提高预测性能:

    1. 使用class_weight 参数来惩罚少数类较重的错误

      sgd = Pipeline([('vect', CountVectorizer()),        
                    ('tfidf', TfidfTransformer()),      
                    ('clf',   SGDClassifier(loss='hinge', 
                                            penalty='l2',
                                            class_weight='balanced', # add this
                                            alpha=1e-3, 
                                            max_iter=100, 
                                            tol=None)
                 ])
      
    2. 在使用 imblearn 库进行训练之前重新采样您的数据,例如

      from imblearn.oversampling import SMOTE
      
      # getting training and testing data
      train_X = train_df['token'].astype('U')
      test_X = test_df['token'].astype('U')
      train_y = train_df['label']
      test_y = test_df['label']
      
      sm = SMOTE(random_state=42)
      train_X_res, train_y_res = sm.fit_resample(train_X, train_y)
      
    3. 尝试另一个分类器,例如Logistic Regression

      sgd = Pipeline([('vect', CountVectorizer()),        
                    ('tfidf', TfidfTransformer()),      
                    ('clf',   SGDClassifier(loss='log', # change here
                                            penalty='l2',
                                            alpha=1e-3, 
                                            max_iter=100, 
                                            tol=None)
      

    您可以单独或组合尝试这些方法。但尤其是前两个是处理不平衡数据集的经典措施。

    【讨论】:

    • 感谢您的回答。我分别和组合尝试了所有 3 件事,但不幸的是结果没有什么不同。我猜由于数据处于令牌级别,并且每个单词都可以出现在所有类中,因此模型无需学习太多,直到它也以某种方式学习了上下文。在我的数据集中,属于一个类的单词成批出现,例如前 10 个单词可能属于 class0 ,接下来的 7 个单词可能属于 class1 ,依此类推。所以我正在考虑尝试收集 n-gram,但不确定这将如何工作,因为测试数据将再次适合令牌级别。
    猜你喜欢
    • 2019-05-14
    • 2020-09-25
    • 2013-09-13
    • 2013-10-06
    • 2012-07-28
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-07-17
    相关资源
    最近更新 更多