【发布时间】:2017-06-17 23:20:19
【问题描述】:
我有一个标题列表:
> print(data)
>
0 Manager
1 Electrician
3 Carpenter
4 Electrician & Carpenter
...
我正在尝试使用 gensim 查找最接近的相关标题。
我的代码是:
import os
import pandas as pd
import nltk
import gensim
from gensim import corpora, models, similarities
from nltk.tokenize import word_tokenize
df = pd.read_csv('df.csv')
corpus = pd.DataFrame(df, columns=['Job Title'])
tokenized_sents = [word_tokenize(i) for i in corpus]
model = gensim.models.Word2Vec(tokenized_sents, min_count=1)
model.most_similar("Electrician")
当我运行标记化以将每个标题标记为一个句子(tokenized_sents 变量)时,它只标记标题:
> tokenzied_sents
> [['Job', 'Title']]
我做错了什么?
【问题讨论】:
标签: python pandas nltk tokenize gensim