【问题标题】:calibrated classifier ValueError: could not convert string to float校准分类器ValueError:无法将字符串转换为浮点数
【发布时间】:2021-11-16 11:46:26
【问题描述】:

数据框:

id    review                                              name         label
1     it is a great product for turning lights on.        Ashley       
2     plays music and have a good sound.                  Alex        
3     I love it, lots of fun.                             Peter        

我想使用概率分类器 (linear_svc) 根据评论预测标签(概率为 1)。我的代码:

from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn import datasets

#Load  dataset
X = training['review']
y = training['label']

linear_svc = LinearSVC()     #The base estimator

# This is the calibrated classifier which can give probabilistic classifier
calibrated_svc = CalibratedClassifierCV(linear_svc,
                                        method='sigmoid',  #sigmoid will use Platt's scaling. Refer to documentation for other methods.
                                        cv=3) 
calibrated_svc.fit(X, y)


# predict
prediction_data = predict_data['review']
predicted_probs = calibrated_svc.predict_proba(prediction_data)

calibrated_svc.fit(X, y) 出现以下错误:

ValueError: could not convert string to float: 'it is a great product 转身……'

感谢您的帮助。

【问题讨论】:

  • 文本数据需要以某种方式编码,例如one-hot 编码、词嵌入等
  • 谢谢,@tdy。只需运行 one-hot 编码,仍然无法正常工作。
  • 为什么你的标签栏是空的?

标签: scikit-learn text-classification valueerror


【解决方案1】:

SVM 模型不能直接处理文本数据。您需要先从文本中提取一些数字特征。我推荐阅读一些关于 NLP 的内容,例如 Bag of Words 和 TF-IDF。无论如何,对于您建议的示例,功能最小的管道将是:

from sklearn.calibration import CalibratedClassifierCV
from sklearn import datasets
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

#Load  dataset
X = training['review']
y = training['label']

linear_svc = make_pipeline(TfIdfVectorizer(), LinearSVC())

# This is the calibrated classifier which can give probabilistic classifier
calibrated_svc = CalibratedClassifierCV(linear_svc,
                                        method='sigmoid',
                                        cv=3) 
calibrated_svc.fit(X, y)


# predict
prediction_data = predict_data['review']
predicted_probs = calibrated_svc.predict_proba(prediction_data)

您可能还想通过删除特殊字符、小写、词干等来稍微清理一下文本。看看spacy 用于文本处理的库。

【讨论】:

    【解决方案2】:

    试试这个:

    from sklearn.feature_extraction.text import TfidfVectorizer
    
    X = training['review']
    y = training['label']    
    prediction_data = predict_data['review']
    
    tfv = TfidfVectorizer(min_df=1, stop_words = 'english')
    tfv.fit(list(X) + list(prediction_data))
    X =  tfv.transform(X) 
    prediction_data = tfv.transform(prediction_data)
    

    然后构建模型:

    linear_svc = LinearSVC()    
    calibrated_svc = CalibratedClassifierCV(linear_svc, method='sigmoid', cv=3) 
    calibrated_svc.fit(X, y)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-03-23
      • 2018-06-13
      • 2013-05-30
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-08-04
      相关资源
      最近更新 更多