【问题标题】:Python sklearn - could not convert string to float ErrorPython sklearn - 无法将字符串转换为浮点错误
【发布时间】:2021-10-10 19:17:09
【问题描述】:

我正在尝试训练一个决策树分类器:

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

dataset = pd.read_csv('./vgsales.csv')
X = dataset[["Name"]]
Y = dataset[["Global_Sales"]]

model = DecisionTreeClassifier()
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2)

model.fit(X_train,Y_train)

在运行model.fit 行时,我收到以下错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-58-84aa7640ed28> in <module>
     12 X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2)
     13 
---> 14 model.fit(X_train,Y_train)

~/opt/anaconda3/lib/python3.8/site-packages/sklearn/tree/_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    896         """
    897 
--> 898         super().fit(
    899             X, y,
    900             sample_weight=sample_weight,

~/opt/anaconda3/lib/python3.8/site-packages/sklearn/tree/_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    154             check_X_params = dict(dtype=DTYPE, accept_sparse="csc")
    155             check_y_params = dict(ensure_2d=False, dtype=None)
--> 156             X, y = self._validate_data(X, y,
    157                                        validate_separately=(check_X_params,
    158                                                             check_y_params))

~/opt/anaconda3/lib/python3.8/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    428                 # :(
    429                 check_X_params, check_y_params = validate_separately
--> 430                 X = check_array(X, **check_X_params)
    431                 y = check_array(y, **check_y_params)
    432             else:

~/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

~/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    614                     array = array.astype(dtype, casting="unsafe", copy=False)
    615                 else:
--> 616                     array = np.asarray(array, order=order, dtype=dtype)
    617             except ComplexWarning as complex_warning:
    618                 raise ValueError("Complex data not supported\n"

~/opt/anaconda3/lib/python3.8/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order, like)
    100         return _asarray_with_like(a, dtype=dtype, order=order, like=like)
    101 
--> 102     return array(a, dtype, copy=False, order=order)
    103 
    104 

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in __array__(self, dtype)
   1897 
   1898     def __array__(self, dtype=None) -> np.ndarray:
-> 1899         return np.asarray(self._values, dtype=dtype)
   1900 
   1901     def __array_wrap__(

~/opt/anaconda3/lib/python3.8/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order, like)
    100         return _asarray_with_like(a, dtype=dtype, order=order, like=like)
    101 
--> 102     return array(a, dtype, copy=False, order=order)
    103 
    104 

ValueError: could not convert string to float: 'Tiger Woods PGA Tour 14'

我已经尝试使用此链接中提到的代码 -> sklearn-LinearRegression: could not convert string to float: '--'

当我使用 apply() 方法转换为数字/浮点数时,我所有的数据值都变为 NaN

我使用的数据集是https://www.kaggle.com/gregorut/videogamesales

【问题讨论】:

    标签: python pandas machine-learning scikit-learn


    【解决方案1】:

    这是一个工作示例,其中所有列都已转换为“树友好类型”。您可能不想转换所有列,但这取决于您要解决的问题。适用于本例。

    import pandas as pd
    import numpy as np
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    from sklearn import preprocessing
    le = preprocessing.LabelEncoder()
    
    dataset = pd.read_csv('./vgsales.csv')
    dataset.Publisher = dataset.Publisher.astype(str)
    
    for column in dataset.columns:
        temp_new = le.fit_transform(dataset[column].astype('category'))
        dataset.drop(labels=[column], axis="columns", inplace=True)
        dataset[column] = temp_new
    
    X = dataset[["Name"]]
    Y = dataset[["Global_Sales"]]
    
    model = DecisionTreeClassifier()
    X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2)
    
    model.fit(X_train,Y_train)
    

    现在我将继续在这里给你一些提示。首先,如果你打算实际训练一个有用的分类器,你可能应该

    • 在使用它们之前检查每列的唯一计数,如果你有太多任何树模型都会有困难。 例如,在这种情况下,您有 11493 个唯一名称。
    • 看起来您正在尝试进行回归(预测一个数字)。即改用回归器。
    • 实际上,您应该始终选择随机森林而不是决策树,它们只是更好(不太容易过度拟合)。决策树并没有真正使用。

    【讨论】:

      【解决方案2】:

      机器学习模型不能处理文本数据,你必须把它转换成数字形式,像这样使用LabelEncoder

      from sklearn.preprocessing import LabelEncoder    
      dataset = pd.read_csv('vgsales.csv')
      dataset = dataset.dropna()
      
      dataset[['Name', 'Platform', 'Genre', 'Publisher']] = dataset[['Name', 'Platform', 'Genre', 'Publisher']].apply(LabelEncoder().fit_transform)
      

      【讨论】:

        猜你喜欢
        • 2018-04-28
        • 1970-01-01
        • 2018-02-15
        • 2020-05-25
        • 2020-01-22
        • 1970-01-01
        • 2018-09-20
        • 2019-02-06
        • 2019-02-08
        相关资源
        最近更新 更多