使用 pandas 数据框设置 Keras 模型答案

【问题标题】：Setting up Keras model with pandas dataframe使用 pandas 数据框设置 Keras 模型
【发布时间】：2018-06-29 16:07:11
【问题描述】：

这是我第一次使用 python 和 Keras 进行机器学习，我习惯使用 MATLAB。基本上我有一个镶木地板，其中包含作为一列的标签和作为另一列的文本。我获取文本并使用 GloVe 嵌入对其进行矢量化，因此在所有这些之后，我剩下 2 列：矢量化，它有一个 ndarray，每个 numpy 数组中有 4000 个数字；和标签列。然后，我尝试将此向量化列用作模型的输入，但这是我遇到问题的地方。

pd_df.head(1) #pd_df is my dataframe

输出：

    vectorized  label
0   [-0.10767000168561935, 0.11052999645471573, 0....   0

然后我拆分我的数据并转换为 ndarray：

from sklearn.model_selection import train_test_split

train, test = train_test_split(pd_df, test_size=0.3)

trainLabels = train.as_matrix(columns=['label'])
train = train.as_matrix(columns=['vectorized'])

testLabels = test.as_matrix(columns=['label'])
test = test.as_matrix(columns=['vectorized'])

然后我检查数据的形状：

train.shape
(410750, 1)

这就是我缺乏 numpy 知识的原因，因为这个大小对我来说没有意义。看起来应该是 (410750, 4000) 因为每个元素都是 4000 个项目的 ndarray。

在此之后我设置了我的模型：

from keras.layers import Input, Dense
from keras.models import Model
from keras.optimizers import SGD
from keras.losses import binary_crossentropy
from keras.metrics import binary_accuracy

inputs = Input(shape=(4000,))

x = Dense(units=2000, activation='relu')(inputs)
x = Dense(units=500, activation='relu')(x)
output = Dense(units=2, activation='softmax')(x)

model = Model(inputs=inputs, outputs=output)
model.compile(optimizer=SGD(), loss=binary_crossentropy, metrics=['accuracy'])
model.fit(train, 
          trainLabels, 
          epochs=50,
          batch_size=50)

然后我不断收到错误：

ValueError: Error when checking input: expected input_13 to have shape (4000,) but got array with shape (1,)

就像我说的，我是 python 世界中机器学习的新手，所以任何帮助都会很棒。

感谢您的帮助。

【问题讨论】：

我已经找到答案很抱歉打扰了任何人。我会在 24 小时后发布我的解决方案。

标签： python pandas numpy keras

【解决方案1】：

您的训练数据只有 1 个维度，而您在输入中指定了 4000 个维度。此外，如果使用预先训练的词嵌入（例如 GloVe），您应该使用嵌入层。查看这个 Keras 博客： https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

【讨论】：

是的，我确实在看那个。我认为问题是形状说它只有一维，但它是一个 4000 X 1 数组的数组，所以它实际上不止一维。我想我正试图弄清楚如何将我的数组数组转换为应有的正确尺寸。

【解决方案2】：

为了解决这个问题，我必须解压缩我的数组数组。我选择这样做的方式是：

xTrain = np.zeros((train.shape[0], 4000))

i = 0
for vector in train: # train is my numpy array of arrays
    xTrain[i] = vector[0]
    i += 1

【讨论】：