Keras 中的 Hashing_trick。这个怎么运作？答案

【问题标题】：Hashing_trick in Keras. How it works?Keras 中的 Hashing_trick。这个怎么运作？
【发布时间】：2019-08-23 04:13:41
【问题描述】：

需要对 keras 中的一种热门或散列技巧有基本的了解。

from keras.preprocessing.text import hashing_trick
from keras.preprocessing.text import text_to_word_sequence
# define the document
text = 'The quick brown fox jumped over the lazy dog dog.'
# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
print(words)
vocab_size = len(words)
print(vocab_size)
# integer encode the document
result = hashing_trick(text, round(vocab_size*1.3), hash_function='md5')
print(text)
print(result)

输出：

{'over', 'the', 'lazy', 'dog', 'quick', 'brown', 'jumped', 'fox'} 8 敏捷的棕色狐狸跳过了懒惰的狗狗。 [6、4、1、2、7、5、6、2、6、6]

结论：此处为每个标记分配一个整数。例如。 “快速分配4” 该--6
快速--4 棕色--1 狐狸--2 跳了--7 超过--5 该--6 懒惰--2 狗--6

我想了解如何为“the”和“dog”分配相同的整数 6。如果我错了，请纠正我并解释它是如何做到的？

【问题讨论】：

标签： python-3.x tensorflow keras deep-learning

【解决方案1】：

这是hashing collision 的示例。哈希函数只是对输入词计算的函数。例如，Java 的默认散列函数会执行类似第一个字符乘以 1、第二个字符乘以 31、第三个字符乘以 31^2 等操作，然后将它们加在一起。

无法保证两个不同的字符串可能不会计算出相同的数字。

如果我们选择较小的词汇量，这个问题会变得更加明显。例如，如果词汇表大小为 10，则 11 的哈希可能会“环绕”为 1。（应用模运算符将任意大的整数映射到 1-vocab_size 范围内。）

如果您想让哈希不太可能出现，使用vocab_size = 10*len(words) 或vocab_size = 10*len(words) 可以减少冲突次数。

不过，我不确定更大词汇量的下游成本是多少。

【讨论】：