Tf-Idf 的输出不令人满意答案

【问题标题】：Unsatisfactory output from Tf-IdfTf-Idf 的输出不令人满意
【发布时间】：2020-09-11 18:51:16
【问题描述】：

我在文本文件中有一个文档，分 2 行，如下所示。我想将 tf-idf 应用到它，我得到如下所示的错误，我不确定我的文件中的 int 对象在哪里？为什么会抛出这个错误？

环境：

Jupter notebook, python 3.7

错误：

AttributeError: 'int' object has no attribute 'lower'

文件.txt：

  Random person from the random hill came to a running mill and I have a count of the hill. This is my house. 

  A person is from a great hill and he loves to run a mill. 

  Sub-disciplines of biology are defined by the research methods employed and the kind of system studied: theoretical biology uses mathematical methods to formulate quantitative models while experimental biology performs empirical experiments.

  The objects of our research will be the different forms and manifestations of life, the conditions and laws under which these phenomena occur, and the causes through which they have been effected. The science that concerns itself with these objects we will indicate by the name biology.

代码：

import pandas as pd
import spacy
import csv
import collections
import sys
import itertools
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
from nltk.tokenize import sent_tokenize
from gensim import corpora, models
from stop_words import get_stop_words
from nltk.stem import PorterStemmer

data = pd.read_csv('file.txt', sep="\n", header=None)

data.dtypes
0    object
dtype: object

data.shape()
4, 1

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(data)
print(X)

【问题讨论】：

标签： python-3.x tf-idf tfidfvectorizer

【解决方案1】：

我通过读取这样的文件来解决它：

使用 open('file.txt') 作为 f: lines = [line.rstrip() for line in f]

【讨论】：