【问题标题】:Vectorizing Files using sklearn使用 sklearn 对文件进行矢量化
【发布时间】:2015-10-18 08:29:17
【问题描述】:

我正在尝试读取 100 个训练文件并使用 sklean 对它们进行矢量化。这些文件的内容是代表系统调用的单词。一旦矢量化,我想将矢量打印出来。 我的第一次尝试如下:

import re
import os
import sys
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
import numpy.linalg as LA

trainingdataDir = 'C:\data\Training data'

def readfile():
    for file in os.listdir(trainingdataDir):
        trainingfiles = os.path.join(trainingdataDir, file)
        if os.path.isfile(trainingfiles):
         data = open(trainingfiles, "rb").read()

    return data 

train_set = [readfile()]

vectorizer = CountVectorizer()
transformer = TfidfTransformer()

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray

但是,这只返回最后一个文件的向量。 我的结论是打印函数应该放在for循环中。于是第二次尝试:

def readfile():
    for file in os.listdir(trainingdataDir):
        trainingfiles = os.path.join(trainingdataDir, file)
        if os.path.isfile(trainingfiles):
         data = open(trainingfiles, "rb").read()
    trainVectorizerArray = vectorizer.fit_transform(data).toarray()
    print 'Fit Vectorizer to train set', trainVectorizerArray          

但是,这不会返回任何内容。 你能帮我解决这个问题吗?为什么我看不到打印出来的向量?

【问题讨论】:

    标签: python-2.7 scikit-learn vectorization pythonxy


    【解决方案1】:

    问题是因为用于矢量化的数据集列表为空。我设法矢量化了一组 100 个文件。我首先打开文件,然后读取每个文件,最后将它们添加到列表中。 'tfidf_vectorizer' 会使用数据集列表

    import re
    import os
    import sys
    import numpy as np
    import numpy.linalg as LA
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    
    trainingdataDir = 'C:\\data\\Training data'
    
    tfidf_vectorizer = TfidfVectorizer()
    
    transformer = TfidfTransformer()
    def readfile(trainingdataDir):
        train_set = []
        for file in os.listdir(trainingdataDir):
            trainingfiles = os.path.join(trainingdataDir, file)
            if os.path.isfile(trainingfiles): 
                data = open(trainingfiles, 'r')
                data_set=str.decode(data.read())
                train_set.append(data_set)
        return train_set 
    
    
    tfidf_matrix_train = tfidf_vectorizer.fit_transform(readfile(trainingdataDir))
    print 'Fit Vectorizer to train set',tfidf_matrix_train
    print "cosine scores ==> ",cosine_similarity(tfidf_matrix_train[0:1], tfidf_matrix_train)
    

    【讨论】:

      猜你喜欢
      • 2017-05-29
      • 2020-01-15
      • 2014-08-06
      • 2018-12-06
      • 1970-01-01
      • 1970-01-01
      • 2021-06-30
      • 2015-08-25
      • 1970-01-01
      相关资源
      最近更新 更多