如何在 tensorflow 2.0 中定义嵌入列？答案

【问题标题】：How to define an embedding column in tensorflow 2.0?如何在 tensorflow 2.0 中定义嵌入列？
【发布时间】：2020-05-20 18:08:11
【问题描述】：

我是 Tensorflow 的新手，我正在使用本地驱动器 https://www.tensorflow.org/tutorials/structured_data/feature_columns 中的 csv 数据学习本教程，我可以加载 csv 文件并打印列标题

for feature_batch, label_batch in train_ds.take(1):
  print('Every feature:', list(feature_batch.keys()))
  print('A batch of traffic_type',label_batch)

当我尝试使用

创建嵌入特征列时

_mt_datetime_embedding = feature_column.embedding_column(_mt_datetime, dimension=8)
demo(_mt_datetime_embedding)

出现了这个错误

AttributeError：“EmbeddingColumn”对象没有属性“num_buckets”。我不知道出了什么问题？有人可以帮我吗？非常感谢。

【问题讨论】：

标签： csv dataset tensorflow2.0

【解决方案1】：

根据 Tensorflow 关于嵌入列的文档：

假设不是只有几个可能的字符串，我们有每个类别的数千个（或更多）值。出于多种原因，如类别的数量越来越大，训练一个使用 one-hot 编码的神经网络。我们可以使用嵌入列来克服这个限制。而不是将数据表示为多维的 one-hot 向量，一个嵌入列表示数据作为一个低维的密集向量，其中每个单元格都可以包含任何数字，而不仅仅是 0 或 1。

当分类列有许多可能的值时，最好使用embedding column。

tf.feature_column.embedding_column 的输入必须是由任何 categorical_column_* function 创建的 CategoricalColumn

语法：

tf.feature_column.embedding_column(
    categorical_column, dimension, combiner='mean', initializer=None,
    ckpt_to_load_from=None, tensor_name_in_ckpt=None, max_norm=None, trainable=True,
    use_safe_embedding_lookup=True
)

当我将输入添加为numeric_column 而不是categorical_column 然后收到AttributeError: 'NumericColumn' object has no attribute 'num_buckets'

age_embedding = feature_column.embedding_column(age, dimension=8)
demo(age_embedding)

输出：

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-23-94a5fc74016e> in <module>()
      1 age_embedding = feature_column.embedding_column(age, dimension=8)
----> 2 demo(age_embedding)

4 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/feature_column/feature_column_v2.py in create_state(self, state_manager)
   3181     """Creates the embedding lookup variable."""
   3182     default_num_buckets = (self.categorical_column.num_buckets
-> 3183                            if self._is_v2_column
   3184                            else self.categorical_column._num_buckets)   # pylint: disable=protected-access
   3185     num_buckets = getattr(self.categorical_column, 'num_buckets',

AttributeError: 'NumericColumn' object has no attribute 'num_buckets'

当我将输入添加为categorical_column 时，它将它们转换为密集表示。这是完整的代码。

import numpy as np
import pandas as pd

import tensorflow as tf

from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

URL = 'https://storage.googleapis.com/applied-dl/heart.csv'
dataframe = pd.read_csv(URL)

train, test = train_test_split(dataframe, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')

def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('target')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

batch_size = 5 # A small batch sized is used for demonstration purposes
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

example_batch = next(iter(train_ds))[0]

def demo(feature_column):
  feature_layer = layers.DenseFeatures(feature_column)
  print(feature_layer(example_batch).numpy())

age = feature_column.numeric_column("age")

thal = feature_column.categorical_column_with_vocabulary_list(
      'thal', ['fixed', 'normal', 'reversible'])

thal_embedding = feature_column.embedding_column(thal, dimension=8)
demo(thal_embedding)

输出：

193 train examples
49 validation examples
61 test examples

[[-0.4675103   0.61985296  0.06297898  0.00818724  0.05449321 -0.6865342
  -0.05250816 -0.13339798]
 [-0.4675103   0.61985296  0.06297898  0.00818724  0.05449321 -0.6865342
  -0.05250816 -0.13339798]
 [-0.4675103   0.61985296  0.06297898  0.00818724  0.05449321 -0.6865342
  -0.05250816 -0.13339798]
 [ 0.3212179   0.29932576 -0.44579896 -0.4998746   0.064592    0.16934885
   0.02404759  0.5051637 ]
 [-0.4675103   0.61985296  0.06297898  0.00818724  0.05449321 -0.6865342
  -0.05250816 -0.13339798]]

更多详情请参考here

【讨论】：