【发布时间】:2018-05-23 07:14:09
【问题描述】:
我从Word Embeddings by M. Baroni et al. 下载了预训练的词嵌入模型 我想可视化句子中单词的嵌入。我有两句话:
sentence1 = "Four people died in an accident."
sentence2 = "4 men are dead from a collision"
我有从上面的链接加载嵌入文件的功能:
def load_data(FileName = './EN-wform.w.5.cbow.neg10.400.subsmpl.txt'):
embeddings = {}
file = open(FileName,'r')
i = 0
print "Loading word embeddings first time"
for line in file:
# print line
tokens = line.split('\t')
#since each line's last token content '\n'
# we need to remove that
tokens[-1] = tokens[-1].strip()
#each line has 400 tokens
for i in xrange(1, len(tokens)):
tokens[i] = float(tokens[i])
embeddings[tokens[0]] = tokens[1:-1]
print "finished"
return embeddings
e = load_data()
从这两个句子中,我计算出单词的 lemmas 并忽略停用词和标点符号,所以现在我的句子变成:
sentence1 = ['Four', 'people', 'died', 'accident']
sentence2 = ['4', 'men', 'dead', 'collision']
现在,当我尝试使用 TSNE(t 分布随机邻域嵌入)可视化嵌入时,我首先为每个句子存储标签和标记:
#for sentence store labels and embeddings in list
# tokens contains vector of 400 dimensions for each label
labels1 = []
tokens1 = []
for i in sentence1:
if i in e:
labels1.append(i)
tokens1.append(e[i])
else:
print i
labels2 = []
tokens2 = []
for i in sentence2:
if i in e:
labels2.append(i)
tokens2.append(e[i])
else:
print i
对于 TSNE
tsne_model = TSNE(perplexity=40, n_components=2, init='random', n_iter=2000, random_state=23)
# fit transform for tokens of both sentences
new_values = tsne_model.fit_transform(tokens1)
new_values1 = tsne_model.fit_transform(tokens2)
#Plot values
x = []
y = []
x1 = []
y1 = []
for value in new_values:
x.append(value[0])
y.append(value[1])
for value in new_values1:
x1.append(value[0])
y1.append(value[1])
plt.figure(figsize=(10, 10))
for i in range(len(x)):
plt.scatter(x[i],y[i])
plt.annotate(labels[i],
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
for i in range(len(x1)):
plt.scatter(x1[i],y1[i])
plt.annotate(labels[i],
xy=(x1[i], y1[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
plt.show()
我的问题是,为什么“碰撞”和“事故”、“人”和“人”等同义词有不同的坐标?如果单词相同/同义词,它们不应该更接近吗?
距离 = euclidean_distances(tokens1) # 返回形状 (8,8)
【问题讨论】:
标签: python nlp word2vec word-embedding