【问题标题】:Gensim word2vec online trainingGensim word2vec 在线培训
【发布时间】:2017-03-28 14:18:55
【问题描述】:

我正在使用 csv 文件中的句子在 gensim 中训练 word2vec 模型,如下所示:

import string
import gensim
import csv
import nltk

path = '/home/neel/Desktop/csci544_proj/test/sample.csv'
translator = str.maketrans({key: None for key in string.punctuation})

class gen(object):

    def __init__(self, path):
        self.path = path

    def __iter__(self):
        with open(path) as infile:
            reader = csv.reader(infile)
            for row in reader:
                rev = row[4]
                l = nltk.sent_tokenize(rev)
                for sent in l:
                    sent = sent.translate(translator)
                    yield sent.lower().split()

sentences = [path]
for p in gen(path):
    model = gensim.models.Word2Vec(p, min_count=1, iter=1)

print(model.vocab.keys())

我得到以下结果: (['b', 'u', 'm', 'h', 'e', 'n', 'r', 'v', 'i', 'a', 't', 's', 'k', 'w', 'o', 'l'])

我得到的结果不是文字而是文字。程序哪里出错了?

【问题讨论】:

标签: python gensim word2vec yield-keyword


【解决方案1】:

我修复了你的代码

import string
import gensim
import csv
import nltk

path = '/home/neel/Desktop/csci544_proj/test/sample.csv'
translator = str.maketrans({key: None for key in string.punctuation})

class Generator(object):
    def __init__(self, pathes):
        self.pathes = pathes

    def __iter__(self):
        for path in self.pathes:
            with open(path) as infile:
                for row in csv.reader(infile):
                    for sent in nltk.sent_tokenize(row[4]):
                        yield sent.translate(translator).lower().split()


corpus = Generator([path])
model = gensim.models.Word2Vec(min_count=1, iter=1)
model.build_vocab(corpus)
model.train(corpus, total_examples=model.corpus_count, epochs=2)
model.wv.vocab.keys()

你会得到dict_keys(['wassup', 'where', 'fresh', 'new', 'about', 'juice', 'whats', 'are', 'im', 'hello', 'wtf', 'd', 'hi', 'you', 'world', 'bro', 'friend'])

【讨论】:

  • 您好,最好解释一下您更改代码的内容和原因;关于代码的描述也可以帮助其他用户,谢谢
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2023-03-13
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-01-03
相关资源
最近更新 更多