【问题标题】:How to preprocess tensorflow imdb_review dataset如何预处理 tensorflow imdb_review 数据集
【发布时间】:2021-02-25 07:31:14
【问题描述】:

我正在使用tensorflow imdb_review dataset,我想使用 Tokenizerpad_sequences

对其进行预处理

当我使用 Tokenizer 实例并使用以下代码时:

tokenizer=Tokenizer(num_words=100)
tokenizer.fit_on_texts(df['text'])
word_index = tokenizer.word_index
sequences=tokenizer.texts_to_sequences(df['text'])

print(word_index)
print(sequences)

我收到错误 TypeError: a bytes-like object is required, not 'dict'

我的尝试

将数据集存储为数据框,然后遍历文本列,并将其存储在列表中,然后对其进行标记。

df = tfds.as_dataframe(ds.take(4), info)
# list to store corpus
corpus = []
for sentences in df['text'].iteritems():
  corpus.append(sentences)

tokenizer=Tokenizer(num_words=100)
tokenizer.fit_on_texts(corpus)
word_index=tokenizer.word_index
print(word_index)

但我收到错误 AttributeError: 'tuple' object has no attribute 'lower'

如何使用“文本”列并对其进行预处理以将其提供给我的神经网络?

【问题讨论】:

    标签: python pandas tensorflow


    【解决方案1】:

    您需要先将['text'] 列转换为numpy,然后进行必要的标记化和填充。以下是完整的工作代码。享受吧。

    数据集

    import numpy as np
    import tensorflow as tf
    import tensorflow_datasets as tfds
    from tensorflow.keras.preprocessing.text import Tokenizer
    from tensorflow.keras.preprocessing.sequence import pad_sequences
    
    # get the data first
    imdb = tfds.load('imdb_reviews', as_supervised=True)
    

    数据准备

    # we will only take train_data (for demonstration purpose)
    # do the same for test_data in your case 
    train_data, test_data = imdb['train'], imdb['test']
    
    training_sentences = []
    training_labels = []
    
    for sentence, label in train_data:
        training_sentences.append(str(sentence.numpy()))
        training_labels.append(str(label.numpy()))
    
    training_labels_final = np.array(training_labels).astype(np.float)
    print(training_sentences[0])    # first samples
    print(training_labels_final[0]) # first label 
    
    # b"This was an absolutely terrible movie. ...."
    # 0.0
    

    预处理 - 标记器 + 填充

    vocab_size = 2000 # The maximum number of words to keep, based on word frequency. 
    embed_size = 30   # Dimension of the dense embedding.
    max_len = 100     # Length of input sequences, when it is constant.
    
    # https://keras.io/api/preprocessing/text/
    tokenizer = Tokenizer(num_words=vocab_size, 
                          filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                          lower=True,
                          split=" ",
                          oov_token="<OOV>")
    tokenizer.fit_on_texts(training_sentences)
    print(tokenizer.word_index) 
    # {'<OOV>': 1, 'the': 2, 'and': 3, 'a': 4, 'of': 5, 'to': 6, 'is': 7, ...
    
    # tokenized and padding 
    training_sequences = tokenizer.texts_to_sequences(training_sentences)
    training_padded = pad_sequences(training_sequences, maxlen=max_len, truncating='post')
    print(training_sentences[0])
    print()
    print(training_padded[0])
    
    # b"This was an absolutely terrible movie. ...."
    #
    [  59   12   14   35  439  400   18  174   29    1    9   33 1378    1
       42  496    1  197   25   88  156   19   12  211  340   29   70  248
      213    9  486   62   70   88  116   99   24    1   12    1  657  777
       12   18    7   35  406    1  178    1  426    2   92 1253  140   72
      149   55    2    1    1   72  229   70    1   16    1    1    1    1
     1506    1    3   40    1  119 1608   17    1   14  163   19    4 1253
      927    1    9    4   18   13   14    1    5  102  148 1237   11  240
      692   13]
    

    型号

    示例模型。

    # Input for variable-length sequences of integers
    inputs = tf.keras.Input(shape=(None,), dtype="int32")
    # Embed each integer 
    x = tf.keras.layers.Embedding(input_dim = vocab_size, 
                                  output_dim = embed_size,
                                  input_length=max_len)(inputs)
    # Add 2 bidirectional LSTMs
    x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True))(x)
    x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64))(x)
    # Add a classifier
    outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
    model = tf.keras.Model(inputs, outputs)
    
    # Compile and Run 
    model.compile(loss='binary_crossentropy',
                          optimizer='adam',
                          metrics=['accuracy'])
    model.fit(training_padded,
              training_labels_final,
              epochs=10,
              verbose=1)
    
    Epoch 1/10
    782/782 [==============================] - 25s 18ms/step - loss: 0.5548 - accuracy: 0.6915
    Epoch 2/10
    782/782 [==============================] - 14s 18ms/step - loss: 0.3921 - accuracy: 0.8248
    ...
    782/782 [==============================] - 14s 18ms/step - loss: 0.2171 - accuracy: 0.9121
    Epoch 9/10
    782/782 [==============================] - 14s 17ms/step - loss: 0.1807 - accuracy: 0.9275
    Epoch 10/10
    782/782 [==============================] - 14s 18ms/step - loss: 0.1486 - accuracy: 0.9428
    

    【讨论】:

    • 谢谢!我在数据准备部分遇到问题非常有帮助。
    【解决方案2】:

    您可以通过调用to_numpy() 方法将df[ 'text' ] 列转换为NumPy 数组。请参阅文档here。另外,请考虑来自hereTokenizer.fit_on_texts 的文档。

    corpus = df[ 'text' ].to_numpy()
    tokenizer = Tokenizer( num_words=100 )
    tokenizer.fit_on_texts(corpus)
    

    Tokenizer.fit_on_texts 方法在内部调用text_elem.lower()。但是由于您没有向它提供String 的列表,因此您遇到了例外。这是来自source 的sn-p。

      ...
      for text in texts:
                self.document_count += 1
                if self.char_level or isinstance(text, list):
                    if self.lower:
                        if isinstance(text, list):
                            text = [text_elem.lower() for text_elem in text]
                        else:
       ...
    

    【讨论】:

    • 我在阅读您的答案后尝试使用 .to_numpy() ,但我仍然收到相同的错误,这很奇怪,因为当我检查语料库的类型时,它是o/p 是 dtype=obj,根据错误应该使用对象。我什至尝试使用 tf 文档中显示的字符串列表。我将尝试通过下载在外部使用数据集,当我找到解决方案时我会更新。
    猜你喜欢
    • 2023-03-15
    • 2019-08-31
    • 2018-10-05
    • 2021-09-06
    • 2019-12-07
    • 2020-10-29
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多