从 Tensorflow 中的嵌入列获取嵌入向量答案

【问题标题】：Get embedding vectors from Embedding Column in Tensorflow从 Tensorflow 中的嵌入列获取嵌入向量
【发布时间】：2020-03-28 18:42:33
【问题描述】：

我想获取使用 Tensorflow 中的“嵌入列”创建的 numpy 向量。

例如，创建一个示例 DF：

sample_column1 = ["Apple","Apple","Mango","Apple","Banana","Mango","Mango","Banana","Banana"]
sample_column2 = [1,2,1,3,4,6,2,1,3]
ds = pd.DataFrame(sample_column1,columns=["A"])
ds["B"] = sample_column2
ds

将 pandas DF 转换为 TensorFlow 对象

# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):

    dataframe = dataframe.copy()
    labels = dataframe.pop('B')
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    #print (ds)
    if shuffle:
       ds = ds.shuffle(buffer_size=len(dataframe))
    #print (ds)
    ds = ds.batch(batch_size)
    return ds

创建嵌入列：

tf_ds = df_to_dataset(ds)
# embedding cols
col_a = feature_column.categorical_column_with_vocabulary_list(
  'A', ['Apple', 'Mango', 'Banana'])
col_a_embedding = feature_column.embedding_column(col_a, dimension=8)

有没有办法从 'col_a_embedding' 对象中获取嵌入作为 numpy 向量？

例子，

类别“Apple”将嵌入到大小为 8 的向量中：

[a1 a2 a3 a4 a5 a6 a7 a8]

我们可以获取那个向量吗？

【问题讨论】：

真的很难理解你需要什么。能举个例子吗？
“凹凸不平的向量”是什么意思？
@thushv89 我想获取嵌入向量。每个类别将被嵌入到一个给定维度的向量中，我想得到那个向量。
@greeness 抱歉打错了。我的意思是 numpy。

标签： numpy tensorflow deep-learning

【解决方案1】：

我看不到使用特征列获得所需内容的方法（我在tf.feature_column 的available functions 中看不到名为sequence_embedding_column 或类似的函数）。因为特征列的结果似乎是一个固定长度的张量。他们通过使用组合器来聚合单个嵌入向量（sum、mean、sqrtn 等）来实现这一点。所以类别序列上的维度实际上是丢失了。

但如果您使用较低级别的 api，它是完全可行的。首先，您可以构建一个查找表来将分类字符串转换为 id。

features = tf.constant(["apple", "banana", "apple", "mango"])
table = tf.lookup.index_table_from_file(
    vocabulary_file="fruit.txt", num_oov_buckets=1)
ids = table.lookup(features)

#Content of "fruit.txt"
apple
mango
banana
unknown

现在您可以将嵌入初始化为二维变量。它的形状是[number of categories, embedding dimension]。

num_categories = 3
embedding_dim = 64
category_emb = tf.get_variable(
                "embedding_table", [num_categories, embedding_dim],
                initializer=tf.truncated_normal_initializer(stddev=0.02))

然后您可以像下面这样查找类别嵌入：

ids_embeddings = tf.nn.embedding_lookup(category_emb, ids)

注意ids_embeddings 中的results 是一个连接的长张量。随意reshape 把它变成你想要的形状。

【讨论】：

这是个好主意。谢谢。
当我尝试使用 TensorFlow 2.0 时，我收到以下错误 AttributeError: module 'tensorflow_core._api.v2.lookup' has no attribute 'index_table_from_file
在 tf 2.0 中，有一个类似的 tf 模块 tf.lookup。您可能想切换到tf.lookup.StaticVocabularyTable。请参阅tensorflow.org/api_docs/python/tf/lookup/…。

【解决方案2】：

我建议最简单的最快方法是这样做，这就是我在自己的应用程序中所做的：

使用 pandas 将您的文件 read_csv 转换为类型的字符串列 pandas 中使用 dtype 参数的“类别”。让我们称之为领域 “F”。这是原始字符串列，还不是数字列。

仍然在 pandas 中，创建一个新列并复制原始列的 pandas cat.codes 进入新列。我们称它为“f_code”字段。 Pandas 自动将其编码为一个紧凑表示的数字列。它将包含传递给神经网络所需的数字。

现在在您的 keras 功能 API 神经网络中的嵌入层中网络模型，将 f_code 传递给模型的输入层。这 f_code 中的值现在将是一个数字，例如 int8。嵌入 layer 现在会正确处理它。不要将原始列传递给模型。

下面是从我的项目中复制出来的一些示例代码行，完全按照上述步骤操作。

all_col_types_readcsv = {'userid':'int32','itemid':'int32','rating':'float32','user_age':'int32','gender':'category','job':'category','zipcode':'category'}

<some code omitted>

d = pd.read_csv(fn, sep='|', header=0, dtype=all_col_types_readcsv, encoding='utf-8', usecols=usecols_readcsv)

<some code omitted>

from pandas.api.types import is_string_dtype
# Select the columns to add code columns to. Numeric cols work fine with Embedding layer so ignore them.

cat_cols = [cn for cn in d.select_dtypes('category')]
print(cat_cols)
str_cols = [cn for cn in d.columns if is_string_dtype(d[cn])]
print(str_cols)
add_code_columns = [cn for cn in d.columns if (cn in cat_cols) and (cn in str_cols)]
print(add_code_columns)

<some code omitted>

# Actually add _code column for the selected columns
for cn in add_code_columns:
  codecolname = cn + "_code"
  if not codecolname in d.columns:
    d[codecolname] = d[cn].cat.codes

您可以看到熊猫为您制作的数字代码：

d.info()
d.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99991 entries, 0 to 99990
Data columns (total 5 columns):
userid      99991 non-null int32
itemid      99991 non-null int32
rating      99991 non-null float32
job         99991 non-null category
job_code    99991 non-null int8
dtypes: category(1), float32(1), int32(2), int8(1)
memory usage: 1.3 MB

最后，您可以省略 job 列并保留 job_code 列，在此示例中，用于传递到您的 keras 神经网络模型。这是我的一些模型代码：

v = Lambda(lambda z: z[:, field_num0_X_cols[cn]], output_shape=(), name="Parser_" + cn)(input_x)
emb_input = Lambda(lambda z: tf.expand_dims(z, axis=-1), output_shape=(1,), name="Expander_" + cn)(v)
a = Embedding(input_dim=num_uniques[cn]+1, output_dim=emb_len[cn], input_length=1, embeddings_regularizer=reg, name="E_" + cn)(emb_input)

顺便说一句，在将所有 pandas 数据帧传递给 model.fit() 时，请用 np.array() 包裹它们。没有很好的文档记录，并且显然也没有在运行时检查 pandas 数据帧无法安全传入。您会获得大量内存分配，否则会导致主机崩溃。

【讨论】：