Tensorflow tf.train.Saver 不保存所有变量答案

【问题标题】：Tensorflow tf.train.Saver not saving all variablesTensorflow tf.train.Saver 不保存所有变量
【发布时间】：2018-12-03 14:12:40
【问题描述】：

我认为 Tensorflow 保护程序会保存此处所述的所有变量

如果不向 tf.train.Saver() 传递任何参数，则保存程序处理图中的所有变量。每个变量都保存在创建变量时传递的名称。

https://www.tensorflow.org/programmers_guide/saved_model

但是，我下面代码中的变量 epochCount 似乎没有得到保存。此变量用于跟踪模型在数据上训练的总时期。

当我恢复一个图形时，它会重置为它的初始值设定项，而不是上次保存检查点时的值。

在我看来，它只是保存用于计算损失的变量。

这是我的代码。

这是我声明我的图表的地方：

graph = tf.Graph()

with graph.as_default(): 

  valid_examples = np.array(random.sample(range(1, valid_window), valid_size)) #put inside graph to get new words each time

  train_dataset = tf.placeholder(tf.int32, shape=[batch_size, cbow_window*2 ])
  train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
  valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
  valid_datasetSM = tf.constant(valid_examples, dtype=tf.int32)

  epochCount = tf.get_variable( 'epochCount', initializer= 0) #to store epoch count to total # of epochs are known

  embeddings = tf.get_variable( 'embeddings', 
    initializer= tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

  softmax_weights = tf.get_variable( 'softmax_weights',
    initializer= tf.truncated_normal([vocabulary_size, embedding_size],
                         stddev=1.0 / math.sqrt(embedding_size)))
  softmax_biases = tf.get_variable('softmax_biases', 
    initializer= tf.zeros([vocabulary_size]),  trainable=False )

  embed = tf.nn.embedding_lookup(embeddings, train_dataset) #train data set is
  embed_reshaped = tf.reshape( embed, [batch_size*cbow_window*2, embedding_size] )
  segments= np.arange(batch_size).repeat(cbow_window*2)
  averaged_embeds = tf.segment_mean(embed_reshaped, segments, name=None)

  loss = tf.reduce_mean(
    tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=averaged_embeds,
                               labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))

  optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss) #Original learning rate was 1.0

  norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
  normalized_embeddings = embeddings / norm
  valid_embeddings = tf.nn.embedding_lookup(
    normalized_embeddings, valid_dataset) 
  similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings)) 

  saver = tf.train.Saver()

如果我从检查点恢复图形，则嵌入和 softmax_biases 会恢复，但 epochCount 会重置为其初始化值。（请注意，我没有调用 tf.global_variables_initializer().run() 行，这是在恢复检查点后错误地重置变量的常见原因）

这是运行图表的代码

num_steps = 1000001

with tf.Session(graph=graph) as session:

  saver.restore(session, './checkpointsBook2VecCbowWindow2Downloaded/bookVec.ckpt' )
  average_loss = 0
  saveIteration = 1
  for step in range(1, num_steps):

    batch_data, batch_labels = generate_batch(
      batch_size, cbow_window)
    feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
    _, l = session.run([optimizer, loss], feed_dict=feed_dict) 

    if step % 20000 == 0:
      recEpoch_indexA =  epoch_index - recEpoch_indexA
      epochCount = tf.add(  epochCount, recEpoch_indexA, name=None )
      recEpoch_indexA = epoch_index

      save_path = saver.save(session, "checkpointsBook2VecCbowWindow2/bookVec.ckpt") 
      chptName = 'B2VCbowW2Embed256ckpt'+str(saveIteration)
      zipfolder(chptName, 'checkpointsBook2VecCbowWindow2')
      uploadModel.SetContentFile(chptName+".zip")
      uploadModel.Upload()

      print("Checkpoint uploaded to Google Drive")
      saveIteration += 1

这是我用来打印训练后保存在检查点中的所有变量的代码。我恢复图表并打印出所有保存的变量。

with tf.Session() as sess:
  saver = tf.train.import_meta_graph('./MODEL/bookVec.ckpt.meta')
  saver.restore(sess, './MODEL/bookVec.ckpt' )
  for v in tf.get_default_graph().get_collection("variables"):
    print('From variables collection ', v)

这是上面代码的输出

From variables collection  <tf.Variable 'embeddings:0' shape=(10001, 256) dtype=float32_ref>
From variables collection  <tf.Variable 'softmax_weights:0' shape=(10001, 256) dtype=float32_ref>
From variables collection  <tf.Variable 'softmax_biases:0' shape=(10001,) dtype=float32_ref>

如所见，epochCount 尚未保存。

【问题讨论】：

可能是因为它从未在图中实际使用过。您只在训练循环中使用它。顺便说一句，在会话启动后调用 TF ops（在本例中为 tf.add）被认为是不好的做法，尤其是在训练循环中。
哇，感谢您提供此信息！ “顺便说一句，在会话启动后，尤其是在训练循环中调用 TF ops（在这种情况下为 tf.add）被认为是不好的做法”为什么会这样？另外，是否有更成熟的方法来跟踪 epoch 的总数？
请参阅我的回答 here 以了解 TF 问题（在您的情况下，它并不那么引人注目，因为您每个时期只执行一次）。至于如何跟踪时代，我将把它放在一个答案中。

标签： python tensorflow

【解决方案1】：

变量恢复为 0 的原因是因为它实际上从未更新（即它已正确恢复）！您在会话期间通过tf.add 调用覆盖epochCount，它只返回操作，没有实际值。也就是说，变量（在 Tensorflow 意义上）是“孤立的”，将永远保持为 0。

您可以改用tf.assign 来更新变量。它可能看起来像这样：

# where you define the graph
epochCount = tf.get_variable( 'epochCount', initializer= 0)
update_epoch = tf.assign(epochCount, epochCount + 1)
...
# after you launched the session
for step in range(1, num_steps):
    if step % 20000 == 0:
        sess.run(update_epoch)

【讨论】：

谢谢！还有1个问题：这是否消除了您之前提到的关于“在会话启动后，特别是在训练循环中调用TF ops（在这种情况下为tf.add）”的问题，因为该操作是使用sess.run（update_epoch）调用的)?我的解释是，在我的原始代码中，我试图直接在 python 环境中更改 tensorflow 变量，而会话正在定义该变量的同一个图表上运行。在您的代码中，现在正在运行的同一会话中更改了变量。还是别的什么？
我对您问题的原始评论实际上有点误导，抱歉。一个问题是您只调用了tf.add 函数，它只“创建”了要添加的操作，但没有实际添加。通过使用sess.run，我们实际上进行了添加。但是，由于用其他东西覆盖了tf.Variable，您的原始公式也被破坏了。即使您要使用run tf.add，这也会用另一个值覆盖保存tf.Variable 的 Python 变量，而不是更新 TF 变量。根据经验，如果要更新 TF 变量，请使用 tf.assign。