字符串的分类（修改）答案

【问题标题】：Classification of string of characters (modified)字符串的分类（修改）
【发布时间】：2018-05-31 09:07:52
【问题描述】：

我正在解决一个问题，其中有 32514 行混乱的字符“wewlsfnskfddsl...eredsda”，每行长度为 406 个字符。我们需要预测他们属于哪一类？这里的类是 1-12 本书的名字。

在互联网上搜索后，我尝试了以下方法。然而，我得到一个错误。非常感谢。

#code
y = ytrain.values
#ytrain = y.ravel()
y = to_categorical(y, num_classes=12)

 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

X = X.reshape((1,32514,1))


# define model
model = Sequential()
model.add(LSTM(75, input_shape=(32514,1)))
model.add(Dense(12, activation='softmax'))
print(model.summary())
# compile model
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit model
model.fit(X, y, epochs=100, verbose=2)

# save the model to file
model.save('model.h5')
# save the mapping
dump(mapping, open('mapping.pkl', 'wb'))

#(batch_size, input_dim)
#(batch_size, timesteps, input_dim)

#### 我收到以下错误：

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_19 (LSTM)               (None, 75)                23100     
_________________________________________________________________
dense_13 (Dense)             (None, 12)                912       
=================================================================
Total params: 24,012
Trainable params: 24,012
Non-trainable params: 0
_________________________________________________________________
None
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-35-503a6273e5d0> in <module>()
      7 
      8 # fit model
----> 9 model.fit(X, y, epochs=100, verbose=2)
     10 
     11 # save the model to file

/usr/local/lib/python3.6/dist-packages/keras/models.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
   1000                               initial_epoch=initial_epoch,
   1001                               steps_per_epoch=steps_per_epoch,
-> 1002                               validation_steps=validation_steps)
   1003 
   1004     def evaluate(self, x=None, y=None,

/usr/local/lib/python3.6/dist-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
   1628             sample_weight=sample_weight,
   1629             class_weight=class_weight,
-> 1630             batch_size=batch_size)
   1631         # Prepare validation data.
   1632         do_validation = False

/usr/local/lib/python3.6/dist-packages/keras/engine/training.py in _standardize_user_data(self, x, y, sample_weight, class_weight, check_array_lengths, batch_size)
   1478                                     output_shapes,
   1479                                     check_batch_axis=False,
-> 1480                                     exception_prefix='target')
   1481         sample_weights = _standardize_sample_weights(sample_weight,
   1482                                                      self._feed_output_names)

/usr/local/lib/python3.6/dist-packages/keras/engine/training.py in _standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix)
    121                             ': expected ' + names[i] + ' to have shape ' +
    122                             str(shape) + ' but got array with shape ' +
--> 123                             str(data_shape))
    124     return data
    125 

ValueError: Error when checking target: expected dense_13 to have shape (1,) but got array with shape (12,)

【问题讨论】：

这称为序列分类。所以，基本上你正在尝试进行字符级序列分类。 LSTM 非常适合这一点

标签： deep-learning lstm random-forest rnn

【解决方案1】：

用于文本分类的机器学习和深度学习模型构建起来很复杂。这是一个可以帮助您入门的指南。

https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf

希望对您有所帮助！ :-)

【讨论】：

您好 Vaibhav，本文不使用 python。无论如何，谢谢。

【解决方案2】：

在我看来，您可以使用 lstm 解决这个问题。长短期记忆 (LSTM) 单元（或块）是循环神经网络 (RNN) 层的构建单元

这些 LSTM 将帮助我们捕获序列信息，通常用于我们想要学习数据中的序列模式的情况

您可以使用字符级 LSTM 解码此问题。

在这个过程中，你必须在 LSTM 单元格中传递文本的每个字符。在最后一步，你将有一个类，它是真正的标签

您可以使用交叉熵损失函数。

https://machinelearningmastery.com/develop-character-based-neural-language-model-keras/

这会给你完整的想法

【讨论】：

嗨 Naaviiii，感谢分享此链接。非常详细。关于如何修改链接中 Jason 代码的任何建议。我的问题是否仍然需要文本生成，因为这是一个多类分类问题？
嗨 sm_，您只需要更改该代码中的 num_classes 参数，因为 num_classes 在您的情况下为 12，其余完全没问题。只需相应地准备数据即可。博客的问题也是多分类问题，为此他们正在使用softmax函数。基本上softmax会给你每个类的概率。
您好 naavii，我编辑了他的代码，但仍然出现错误。谢谢。