【问题标题】:Create a Tensorflow Dataset from a Pandas data frame with numerous labels?从具有大量标签的 Pandas 数据框中创建 TensorFlow 数据集?
【发布时间】:2021-12-11 20:46:49
【问题描述】:

我正在尝试将 pandas 数据帧加载到张量数据集中。 列是文本[字符串]和标签[字符串格式的列表]

一行看起来像: 文本:“嗨,我在这里,....” 标签:[0, 1, 1, 0, 1, 0, 0, 0, ...]

每个文本有 17 个标签的概率。

我找不到将数据集加载为数组的方法,并调用 model.fit() 我阅读了很多答案,尝试在 df_to_dataset() 中使用以下代码。

我无法弄清楚我在这个..中缺少什么..

labels = labels.apply(lambda x: np.asarray(literal_eval(x)))  # Cast to a list
labels = labels.apply(lambda x: [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])  # Straight out list ..

#  ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).

打印一行(从返回的数据集中)显示:

({'text': <tf.Tensor: shape=(), dtype=string, numpy=b'Text in here'>}, <tf.Tensor: shape=(), dtype=string, numpy=b'[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0, 0]'>)

当我不使用任何转换时,model.fit 会发送一个异常,因为它不能使用字符串。

UnimplementedError:  Cast string to float is not supported
     [[node sparse_categorical_crossentropy/Cast (defined at <ipython-input-102-71a9fbf2d907>:4) ]] [Op:__inference_train_function_1193273]
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('labels')

  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  return ds

train_ds = df_to_dataset(df_train, batch_size=batch_size)
val_ds = df_to_dataset(df_val, batch_size=batch_size)
test_ds = df_to_dataset(df_test, batch_size=batch_size)

def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')

  preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
  encoder_inputs = preprocessing_layer(text_input)

  encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
  outputs = encoder(encoder_inputs)
  net = outputs['pooled_output']
  net = tf.keras.layers.Dropout(0.2)(net)
  net = tf.keras.layers.Dense(17, activation='softmax', name='classifier')(net)

  return tf.keras.Model(text_input, net)


classifier_model = build_classifier_model()

loss = 'sparse_categorical_crossentropy'
metrics = ["accuracy"]
classifier_model.compile(optimizer=optimizer,
                         loss=loss,
                         metrics=metrics)

history = classifier_model.fit(x=train_ds,
                               validation_data=val_ds,
                               epochs=epochs)

【问题讨论】:

    标签: pandas tensorflow keras tensorflow-datasets


    【解决方案1】:

    也许在使用tf.data.Dataset.from_tensor_slices 之前尝试预处理您的数据框。这是一个简单的工作示例:

    import tensorflow as tf
    import tensorflow_text as tf_text
    import tensorflow_hub as hub
    import pandas as pd
    
    def build_classifier_model():
      text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
    
      preprocessing_layer = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/1', name='preprocessing')
      encoder_inputs = preprocessing_layer(text_input)
    
      encoder = hub.KerasLayer('https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/2', trainable=True, name='BERT_encoder')
      outputs = encoder(encoder_inputs)
      net = outputs['pooled_output']
      net = tf.keras.layers.Dropout(0.2)(net)
      net = tf.keras.layers.Dense(5, activation='softmax', name='classifier')(net)
      return tf.keras.Model(text_input, net)
    
    def remove_and_split(s):
      s = s.replace('[', '') 
      s = s.replace(']', '')  
      return s.split(',')
     
    def df_to_dataset(dataframe, shuffle=True, batch_size=2):
      dataframe = dataframe.copy()
      labels = tf.squeeze(tf.constant([dataframe.pop('labels')]), axis=0)
      ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels)).batch(
            batch_size)
      return ds
    
    dummy_data = {'text': [
    "Improve the physical fitness of your goldfish by getting him a bicycle",
    "You are unsure whether or not to trust him but very thankful that you wore a turtle neck",
    "Not all people who wander are lost", 
    "There is a reason that roses have thorns",
    "Charles ate the french fries knowing they would be his last meal",
    "He hated that he loved what she hated about hate",
    ], 'labels': ['[0, 1, 1, 1, 1]', '[1, 1, 1, 0, 0]', '[1, 0, 1, 0, 0]', '[1, 0, 1, 0, 0]', '[1, 1, 1, 0, 0]', '[1, 1, 1, 0, 0]']}  
    
    df = pd.DataFrame(dummy_data)  
    df["labels"] = df["labels"].apply(lambda x: [int(i) for i in remove_and_split(x)])
    batch_size = 2
    
    train_ds = df_to_dataset(df, batch_size=batch_size)
    val_ds = df_to_dataset(df, batch_size=batch_size)
    test_ds = df_to_dataset(df, batch_size=batch_size)
    
    loss = 'categorical_crossentropy'
    metrics = ["accuracy"]
    
    classifier_model = build_classifier_model()
    classifier_model.compile(optimizer='adam',
                             loss=loss,
                             metrics=metrics)
    
    history = classifier_model.fit(x=train_ds,
                                 validation_data=val_ds,
                                  epochs=5)
    

    并且不要忘记在使用 Bert 预处理层时在 tf.data.Dataset.from_tensor_slices 中包含批量大小。我还将您的损失函数更改为categorical_crossentropy,因为您正在使用单热编码标签(至少可以从您的问题中推断出来)。 sparse_categorical_crossentropy 损失函数需要整数标签而不是 one-hot 编码。

    【讨论】:

    • 您的示例完美运行。您的回答让我明白了我的主要问题之一是我对张量结构缺乏了解。
    【解决方案2】:

    您可以在map 方法中使用tf.strings 函数。

    import tensorflow as tf
    
    x = ['[0, 1, 0]', '[1, 1, 0]']
    
    
    def splitter(string):
        string = tf.strings.substr(string, 1, tf.strings.length(string) - 2) # no brackets
        string = tf.strings.split(string, ', ')                              # isolate int
        string = tf.strings.to_number(string, out_type=tf.int32)             # as integer
        return string
    
    
    ds = tf.data.Dataset.from_tensor_slices(x).map(splitter)
    
    next(iter(ds))
    
    <tf.Tensor: shape=(3,), dtype=int32, numpy=array([0, 1, 0])>
    

    话虽如此,您不妨更改您的 DataFrame,以便对目标进行一次性编码。

    【讨论】:

      猜你喜欢
      • 2019-01-20
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-10-12
      • 2020-08-16
      • 1970-01-01
      • 1970-01-01
      • 2019-10-07
      相关资源
      最近更新 更多