根据 Tensorflow 关于嵌入列的文档:
假设不是只有几个可能的字符串,我们有
每个类别的数千个(或更多)值。出于多种原因,如
类别的数量越来越大,训练一个
使用 one-hot 编码的神经网络。我们可以使用嵌入列
来克服这个限制。而不是将数据表示为
多维的 one-hot 向量,一个嵌入列表示
数据作为一个低维的密集向量,其中每个单元格都可以
包含任何数字,而不仅仅是 0 或 1。
当分类列有许多可能的值时,最好使用embedding column。
tf.feature_column.embedding_column 的输入必须是由任何 categorical_column_* function 创建的 CategoricalColumn
语法:
tf.feature_column.embedding_column(
categorical_column, dimension, combiner='mean', initializer=None,
ckpt_to_load_from=None, tensor_name_in_ckpt=None, max_norm=None, trainable=True,
use_safe_embedding_lookup=True
)
当我将输入添加为numeric_column 而不是categorical_column 然后收到AttributeError: 'NumericColumn' object has no attribute 'num_buckets'
age_embedding = feature_column.embedding_column(age, dimension=8)
demo(age_embedding)
输出:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-23-94a5fc74016e> in <module>()
1 age_embedding = feature_column.embedding_column(age, dimension=8)
----> 2 demo(age_embedding)
4 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/feature_column/feature_column_v2.py in create_state(self, state_manager)
3181 """Creates the embedding lookup variable."""
3182 default_num_buckets = (self.categorical_column.num_buckets
-> 3183 if self._is_v2_column
3184 else self.categorical_column._num_buckets) # pylint: disable=protected-access
3185 num_buckets = getattr(self.categorical_column, 'num_buckets',
AttributeError: 'NumericColumn' object has no attribute 'num_buckets'
当我将输入添加为categorical_column 时,它将它们转换为密集表示。这是完整的代码。
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
URL = 'https://storage.googleapis.com/applied-dl/heart.csv'
dataframe = pd.read_csv(URL)
train, test = train_test_split(dataframe, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
labels = dataframe.pop('target')
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.batch(batch_size)
return ds
batch_size = 5 # A small batch sized is used for demonstration purposes
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)
example_batch = next(iter(train_ds))[0]
def demo(feature_column):
feature_layer = layers.DenseFeatures(feature_column)
print(feature_layer(example_batch).numpy())
age = feature_column.numeric_column("age")
thal = feature_column.categorical_column_with_vocabulary_list(
'thal', ['fixed', 'normal', 'reversible'])
thal_embedding = feature_column.embedding_column(thal, dimension=8)
demo(thal_embedding)
输出:
193 train examples
49 validation examples
61 test examples
[[-0.4675103 0.61985296 0.06297898 0.00818724 0.05449321 -0.6865342
-0.05250816 -0.13339798]
[-0.4675103 0.61985296 0.06297898 0.00818724 0.05449321 -0.6865342
-0.05250816 -0.13339798]
[-0.4675103 0.61985296 0.06297898 0.00818724 0.05449321 -0.6865342
-0.05250816 -0.13339798]
[ 0.3212179 0.29932576 -0.44579896 -0.4998746 0.064592 0.16934885
0.02404759 0.5051637 ]
[-0.4675103 0.61985296 0.06297898 0.00818724 0.05449321 -0.6865342
-0.05250816 -0.13339798]]
更多详情请参考here