【问题标题】:Problems fitting vocabulary in scikit-learn?scikit-learn 中的词汇拟合问题?
【发布时间】:2015-02-22 06:29:33
【问题描述】:

我有一个充满.txt 文件(文档)的目录。首先我load文件,去掉一些括号,去掉一些引号,所以文件如下所示,例如:

document1:
is a scientific discipline that explores the construction and study of algorithms that can learn from data Such algorithms operate by building a model

document2:
Machine learning can be considered a subfield of computer science and statistics It has strong ties to artificial intelligence and optimization which deliver methods

所以我从这样的目录中加载文件:

preprocessDocuments =[[' '.join(x) for x in sample[:-1]] for sample in load(directory)]


documents = ''.join( i for i in ''.join(str(v) for v
                                              in preprocessDocuments) if i not in "',()")

然后我尝试对document1document2 进行矢量化,以创建如下训练矩阵:

from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer(analyzer='word')
X = HashingVectorizer.fit_transform(documents)
X.toarray()

那么这是输出:

    raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

鉴于此,我如何创建矢量表示?我以为我在documents 中携带加载的文件,但似乎无法安装文件。

【问题讨论】:

    标签: python machine-learning nlp scikit-learn


    【解决方案1】:

    documents的内容是什么? It looks like 它应该是带有标记的文件名或字符串的列表。此外,您应该使用对象调用 fit_transform,而不是像静态方法,即。 e. vectorizer.fit_transform(documents).

    例如,这在这里工作:

    from sklearn.feature_extraction.text import HashingVectorizer
    documents=['this is a test', 'another test']
    vectorizer = HashingVectorizer(analyzer='word')
    X = vectorizer.fit_transform(documents)
    

    【讨论】:

    • 感谢您的反馈,当我打印documents 时,我得到以下信息:[a very large text][another very large text][a third very large text],三个列表代表我在目录中拥有的 3 个 .txt 文件,这里有什么建议吗?。
    • 是的,您的documents 应该是一个列表,其中每个元素都是带有标记化文档的字符串。类似:documents=['word_1_doc_1 word_2_doc_1 word_3_doc_1', 'word_1_doc_2 word_2_doc_2 ...', 'word_1_doc_3, word_2_doc_3 ...']。如果您执行documents=[' '.join(ii) for ii in documents] 之类的操作,它可能会成功。
    猜你喜欢
    • 2020-01-15
    • 2013-02-05
    • 2016-07-24
    • 2015-03-25
    • 2014-01-27
    • 2013-12-19
    • 2021-05-09
    • 2016-04-22
    • 2018-05-26
    相关资源
    最近更新 更多