【问题标题】:tensorflow: invalid fastbin entry (free): 0x00007f2fa8023940tensorflow:无效的 fastbin 条目(免费):0x00007f2fa8023940
【发布时间】:2016-11-04 02:39:35
【问题描述】:

我正在按照this从头开始用TensorFlow训练inception模型,这是我的环境配置:

  • Tensorflow 版本:0.11.0rc1(从源代码编译)
  • 操作系统:CentOS Linux 版本 7.0.1406(核心)64 位
  • 模型:模型/初始

但在大约 11500 步后出现此错误:

...
...
2016-11-03 22:37:06.142819: step 11540, loss = 9.38 (66.9 examples/sec; 0.957 sec/batch)
2016-11-03 22:37:15.753609: step 11550, loss = 9.22 (67.4 examples/sec; 0.950 sec/batch)
2016-11-03 22:37:25.332004: step 11560, loss = 9.51 (65.6 examples/sec; 0.975 sec/batch)
*** Error in `/home/software/anaconda2/bin/python': invalid fastbin entry (free): 0x00007f2fa8023940 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7d19d)[0x7f315d7b919d]
/home/software/anaconda2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(+0x248ff48)[0x7f314baa2f48]
/home/software/anaconda2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(+0x244520f)[0x7f314ba5820f]
/home/software/anaconda2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(_ZN10tensorflow19LocalRendezvousImpl4SendERKNS_10Rendezvous9ParsedKeyERKNS1_4ArgsERKNS_6TensorEb+0xf9)[0x7f314bb9e7f9]
/home/software/anaconda2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(_ZN10tensorflow22IntraProcessRendezvous4SendERKNS_10Rendezvous9ParsedKeyERKNS1_4ArgsERKNS_6TensorEb+0xb4)[0x7f314ba57b74]
/home/software/anaconda2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(_ZN10tensorflow6SendOp7ComputeEPNS_15OpKernelContextE+0x346)[0x7f314baa3736]
/home/software/anaconda2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(+0x242ea59)[0x7f314ba41a59]
/home/software/anaconda2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(+0x2422e30)[0x7f314ba35e30]
/home/software/anaconda2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(_ZN5Eigen26NonBlockingThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x3c8)[0x7f314bc474a8]
/home/software/anaconda2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x22)[0x7f314bc46c72]
/home/software/anaconda2/bin/../lib/libstdc++.so.6(+0xb4870)[0x7f3149153870]
/lib64/libpthread.so.0(+0x7df3)[0x7f315e20ddf3]
/lib64/libc.so.6(clone+0x6d)[0x7f315d8321ad]
======= Memory map: ========
00400000-00401000 r-xp 00000000 fd:02 34476856                           /home/software/anaconda2/bin/python2.7
00600000-00601000 rw-p 00000000 fd:02 34476856                           /home/software/anaconda2/bin/python2.7
0067e000-42ae4000 rw-p 00000000 00:00 0                                  [heap]
200000000-200100000 rw-s 1026d71000 00:05 221089                         /dev/nvidiactl
200100000-204100000 ---p 00000000 00:00 0 
204100000-204200000 rw-s f70ee2000 00:05 221089                          /dev/nvidiactl
204200000-204300000 ---p 00000000 00:00 0 
204300000-204400000 rw-s f75483000 00:05 221089                          /dev/nvidiactl
204400000-204500000 ---p 00000000 00:00 0 
204500000-204600000 rw-s 1014d38000 00:05 221089                         /dev/nvidiactl
204600000-208600000 ---p 00000000 00:00 0 
208600000-208700000 rw-s f7735a000 00:05 221089                          /dev/nvidiactl
208700000-208800000 ---p 00000000 00:00 0 
208800000-208900000 rw-s f7777d000 00:05 221089                          /dev/nvidiactl
208900000-208a00000 ---p 00000000 00:00 0 
208a00000-208b00000 rw-s f77eaa000 00:05 221089                          /dev/nvidiactl
208b00000-20cb00000 ---p 00000000 00:00 0 
...
...

【问题讨论】:

  • 这是您不止一次看到的问题,还是只是一次性错误?
  • 每次都会发生。我已将批量大小从 256 更改为 64,它也会引发此错误。
  • 你能把它归结为你可以发布的重现错误的代码吗?
  • 我用的是tensorflow的inception模型训练代码:github.com/tensorflow/models/blob/master/inception/inception/…
  • 如果您使用标准的 Inception 模型,我对这个错误的第一反应是硬件故障导致的内存损坏。它是否每次都在步骤 11560 崩溃?是否取决于批量大小?

标签: c++ python-2.7 tensorflow deep-learning


【解决方案1】:

是否可能 glibc 仍然不是线程安全的,作为错误: https://rhn.redhat.com/errata/RHBA-2014-0480.html

【讨论】:

    猜你喜欢
    • 2013-06-29
    • 1970-01-01
    • 1970-01-01
    • 2022-08-14
    • 2017-10-04
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多