【问题标题】:CountVectorizer does not work on training data in PythonCountVectorizer 不适用于 Python 中的训练数据
【发布时间】:2019-07-23 23:14:43
【问题描述】:

我正在使用 scikit learn 对文本进行分类。我用过CountVectorizer。我认为CountVectorizer 应该只用于训练数据,而不是所有数据(特征)。

我已在所有数据(特征)上使用它并且代码有效,但是当我仅在训练中使用它时,它显示此错误:

TypeError:传递了稀疏矩阵,但需要密集数据。利用 X.toarray() 转换为密集的 numpy 数组。

这是我的代码(代码非常简单,仅举例):

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

from sklearn import tree
from sklearn.metrics import accuracy_score


df = pd.DataFrame({"second":["yes ofc", "not a chance", " hell no", "yes yes yes", "yes",'yes maybe', 'yes ofc', 'no not'],
                  "third":["true","false", "false", "true", "false", "true","false", "false"]})

##CHANGE HERE
results = df['third']
features = df['second']

cv = CountVectorizer()  
#features = cv.fit_transform(features) #it worked

features_train, features_test, result_train, result_test = train_test_split(features, results, test_size = 0.3, random_state = 42)

#features_train = cv.fit_transform(features_train).toarray() #it does not work
#result_train = cv.fit_transform(result_train).toarray() #it does not work

cls = tree.DecisionTreeClassifier()
model = cls.fit(features_train, result_train)

acc_prediction  = model.predict(features_test)
accuracy_test = accuracy_score(result_test, acc_prediction)

print(accuracy_test)

【问题讨论】:

  • 您可能希望将字符串转换为整数或布尔值。它不会解决你的问题,但它会更好。
  • 这只是一个例子。真正的值是更长的字符串,并且有很多值
  • 你仍然想让它们分类。
  • 我猜结果应该是 df['third'] 因为 results 是这里的标签(true, false)。
  • result_train 不应该被馈送到CountVectorizer

标签: python scikit-learn


【解决方案1】:

您应该只在训练数据上训练(或fit计数器矢量化器,但同时在训练和测试数据上运行。

创建CountVectorizer后:

cv = CountVectorizer()

并将数据分成训练集和测试集:

features_train, features_test, result_train, result_test = train_test_split(features, results, test_size = 0.3, random_state = 42)

fit_transformfeatures_train 在继续之前,因为您想使用计数矢量化器转换的数据来训练您的实际分类器:

features_train = cv.fit_transform(features_train)

现在,在此之后,cv拟合训练数据,并且也转换了训练数据。现在,使用这些转换后的数据训练实际的分类器:

cls = tree.DecisionTreeClassifier()
model = cls.fit(features_train, result_train)

现在,您的分类器使用 count vectorized 训练数据进行训练。在测试数据上测试准确性时,首先使用相同的计数向量器对测试数据进行转换:

features_test = cv.transform(features_test)

请注意,您没有再次拟合它,我们只是使用已经训练过的计数矢量化器来转换此处的测试数据。现在,使用经过训练的决策树分类器进行预测:

acc_prediction = model.predict(features_test)
accuracy_test = accuracy_score(result_test, acc_prediction)
print(accuracy_test)

【讨论】:

  • 所以我应该在features_trainfeatures_test 上运行CountVectorizer,而不是在features_trainresult_train 上运行,对吧?如果我真的像你告诉我的那样,我会防止过度拟合吗?
  • features_trainfeatures_test 上运行CountVectorizer 而不是result_train 不必对过度拟合做任何事情。只是您没有在标签 (result_train) 上运行计数矢量化器,您只需在您正在训练模型的数据上运行它。
  • @AhmadKhan 由于您将数据拆分为训练集和测试集,因此两者的维度将相同。但是你如何预测与训练数据分开的测试数据呢?因为运行 cv.transform 会因尺寸不匹配而出错。
【解决方案2】:

要将拟合模型应用于测试数据,请使用.transform()。下面的代码是我的建议!

另外,.toarray() 是将稀疏矩阵转换为密集矩阵的代价高昂的操作。因此,在绝对需要之前不要使用它。决策树可以使用稀疏矩阵本身。

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

from sklearn import tree
from sklearn.metrics import accuracy_score


df = pd.DataFrame({"second":["yes ofc", "not a chance", " hell no", "yes yes yes", "yes",'yes maybe', 'yes ofc', 'no not'],
                  "third":["true","false", "false", "true", "false", "true","false", "false"]})

##CHANGE HERE
results = df['third']
features = df['second']

cv = CountVectorizer()  

features_train, features_test, result_train, result_test = train_test_split(features, results, test_size = 0.3, random_state = 42)

features_train = cv.fit_transform(features_train) 
features_test = cv.transform(features_test) 

cls = tree.DecisionTreeClassifier()
model = cls.fit(features_train, result_train)

acc_prediction  = model.predict(features_test)
accuracy_test = accuracy_score(result_test, acc_prediction)

print(accuracy_test)

【讨论】:

    【解决方案3】:

    试试这个:

    features = cv.fit_transform(features)
    X_train , X_test = cross_validation.train_test_split(features , test_size=0.3 , random_state=0)
    Y_train , Y_test = cross_validation.train_test_split(results, test_size=0.3 , random_state=0) 
    

    【讨论】:

    • 但是我得到了什么?我认为我应该转换 X_train 和 Y_train
    【解决方案4】:

    以下代码有效。我猜你在分配结果和特征方面做错了。

    import pandas as pd
    
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import CountVectorizer
    
    from sklearn import tree
    from sklearn.metrics import accuracy_score
    
    
    df = pd.DataFrame({"second":["yes ofc", "not a chance", " hell no", "yes yes yes", "yes",'yes maybe', 'yes ofc', 'no not'],
                      "third":["true","false", "false", "true", "false", "true","false", "false"]})
    
    ##CHANGE HERE
    results = df['third']
    features = df['second']
    
    cv = CountVectorizer()  
    features = cv.fit_transform(features) #it worked
    
    features_train, features_test, result_train, result_test = train_test_split(features, results, test_size = 0.3, random_state = 42)
    
    #features_train = cv.fit_transform(features_train) #it does not work
    #result_train = cv.fit_transform(result_train) #it does not work
    
    cls = tree.DecisionTreeClassifier()
    model = cls.fit(features_train, result_train)
    
    acc_prediction  = model.predict(features_test)
    accuracy_test = accuracy_score(result_test, acc_prediction)
    
    print(accuracy_test)
    

    如果您想在训练集和测试集上分别运行 CountVectorizer,那么下面是执行此操作的方法:

    {SAME AS ABOVE TILL HERE}
    
    results = df['third']
    features = df['second']
    
    cv = CountVectorizer()  
    
    features_train, features_test, result_train, result_test = train_test_split(features, results, test_size = 0.3, random_state = 42)
    
    features_train = cv.fit_transform(features_train) #it does not work
    
    cls = tree.DecisionTreeClassifier()
    model = cls.fit(features_train, result_train)
    
    acc_prediction  = model.predict(cv.transform(features_test))
    accuracy_test = accuracy_score(result_test, acc_prediction)
    
    print(accuracy_test)
    

    【讨论】:

    • 是的,我做错了。但这不是我的问题的答案。我应该只对训练数据还是所有数据运行 CountVectorizer?如何仅对训练数据运行 CountVectorizer?
    • 您需要对所有数据运行 CountVectorizer。您必须以相同的方式转换训练数据和测试数据。您不能只转换训练数据并使用原始格式的测试数据。
    猜你喜欢
    • 2019-12-09
    • 1970-01-01
    • 2020-06-01
    • 1970-01-01
    • 2017-07-07
    • 2017-05-13
    • 2019-02-16
    • 1970-01-01
    • 2021-02-28
    相关资源
    最近更新 更多