虚拟方法错误（0x0）地址答案

【问题标题】：virtual method wrong (0x0) address虚拟方法错误（0x0）地址
【发布时间】：2019-05-04 21:51:25
【问题描述】：

在一些调用虚拟成员函数的代码中偶尔会出现一些奇怪的段错误。 Segfault 大约平均在 30k 次调用中发生一次。

我正在使用虚拟方法来实现模板方法模式。

它出现的代码行是第一行

GenericDevice::updateValue()
{
     ...
     double tmpValue=getValue();
     Value=tmpValue;
     ...
}

与

class GenericDevice
{
    public: 
    void updateValue();
    void print(string& result);
    ...
    protected:
    virtual double getValue()const=0;
    ...
    private:
    std::atomic<double> Value;
    ...
}

稍后通过在运行时加载动态库来提供一个类 GenericDevice

class SpecializedDeviced : public
{
    ...
    virtual double getValue()const final;
    ... 
}

当问题发生时，我能够获得一个核心转储，并查看了汇编代码：

0x55cd3ef036f4 GenericDevice::updateValue()+92   mov    -0x38(%rbp),%rax   
0x55cd3ef036f8 GenericDevice::updateValue()+96   mov    (%rax),%rax 
0x55cd3ef036fb GenericDevice::updateValue()+99   add    $0x40,%rax  
0x55cd3ef036ff GenericDevice::updateValue()+103  mov    (%rax),%rax 
0x55cd3ef03702 GenericDevice::updateValue()+106  mov   -0x38(%rbp),%rdx
0x55cd3ef03706 GenericDevice::updateValue()+110  mov   %rdx,%rdi         
0x55cd3ef03709 GenericDevice::updateValue()+113  callq  *%rax
0x55cd3ef0370b <GenericDevice::updateValue()+115>  movq   %xmm0,%rax          
0x55cd3ef03710 <GenericDevice::updateValue()+120>  mov    %rax,-0x28(%rbp) 
0x55cd3ef03714 <GenericDevice::updateValue()+124>  mov    -0x38(%rbp),%rax  
0x55cd3ef03718 <GenericDevice::updateValue()+128>  lea    0x38(%rax),%rdx     
0x55cd3ef0371c <GenericDevice::updateValue()+132>  mov    -0x28(%rbp),%rax    
0x55cd3ef03720 <GenericDevice::updateValue()+136>  mov    %rax,-0x40(%rbp)    
0x55cd3ef03724 <GenericDevice::updateValue()+140>  movsd  -0x40(%rbp),%xmm0

段错误预计发生在 0x55cd3ef03709 GenericDevice::updateValue()+113。

where
#0  0x000055cd3ef0370a in MyNamespace::GenericDevice::updateValue (this=0x55cd40586698) at ../src/GenericDevice.cpp:22
#1  0x000055cd3ef038d2 in MyNamespace::GenericDevice::print (this=0x55cd40586698,result="REDACTED"...) at ../src/GenericDevice.cpp:50
...

GenericDevice::updateValue() 函数按预期调用

<GenericDevice::print(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)+301>  callq  0x55cd3ef03698 <GenericDevice::updateValue()>

将 rax 设置为 0x0 的原因。

Register group: general
rax            0x0              0  
rbx            0x5c01b8a2       1543616674  
rcx            0x2              2  
rdx            0x28             40  
rsi            0x2              2  
rdi            0x55cd40586630   94340036191792  
rbp            0x7ffe39086e60   0x7ffe39086e60  
rsp            0x7ffe39086e20   0x7ffe39086e20  
r8             0x7fbb06e7e8a0   140441251473568  
r9             0x3              3  
r10            0x33             51  
r11            0x206            518                       
r12            0x55cd3ef19438   94340012676152  
r13            0x7ffe39089010   140729855283216   
r14            0x0              0   
r15            0x0              0  
rip            0x55cd3ef0370a  0x55cd3ef0370a<GenericDevice::updateValue()+114>                     eflags         0x10206  [ PF IF RF ]               
cs             0x33     51
ss             0x2b     43
ds             0x0      0  
es             0x0      0  
fs             0x0      0   
gs             0x0      0

通过执行汇编摘录中的计算，我能够确认汇编代码及其使用的数据与预期的虚函数调用相匹配，并以正确的数据开头：

对象的this指针被使用

(gdb) x /g $rbp-0x38  
0x7ffe39086e28: 0x000055cd40586698   
(gdb) p this  
$1 = (GenericDevice * const) 0x55cd40586698

指向 vtable 的指针正确（*this 的第一个元素）

(gdb) x 0x000055cd40586698  
0x55cd40586698: 0x00007fbb070c1aa0
(gdb) info vtbl this  
vtable for 'GenericDevice' @ 0x7fbb070c1aa0 (subobject @ 0x55cd40586698):

vtable 包含我们正在寻找的方法的地址。

(gdb) info vtbl this  
vtable for 'GenericDevice' @ 0x7fbb070c1aa0 (subobject @ 0x55cd40586698):  
...  
[8]: 0x7fbb06e7bf50 non-virtual thunk to MyNamespace::SpecializedDevice::getValue() const.

使用了正确的 vtable 偏移量

(gdb) x 0x00007fbb070c1aa0+0x40  
0x7fbb070c1ae0 <_ZTVN12MyNamespace11SpecializedDeviceE+168>: 0x00007fbb06e7bf50

目前的结论：通过逐步检查汇编代码，对正确数据和指令的使用进行了验证。

使用了正确的数据：可以排除内存损坏。
汇编指令似乎正确：可以排除编码/编译错误
vtable 看起来没问题：可以排除在运行时加载库时的错误：函数通常可以正常运行数万次。

请随时指出我推理中的任何错误。

但寄存器 rax 中的值仍然为零，而不是预期的 0x7fbb070c1ae0

这是否表明一个（很少使用的）cpu 内核出现硬件错误？将解释罕见和随机发生，但我预计其他程序和操作系统也会出现问题。

处理器型号为 Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz

提前致谢！

更新：我找到了 $RIP 标记
0x55cd3ef0370a MyNamespace::GenericDevice::updateValue()+114 shlb 0x48(%rsi)

gdb 显示的程序集似乎在滚动后发生了变化。这就是为什么我在第一次尝试时没有看到标记。启动 gdb 并输入 layout asm 后，我得到：

>0x55cd3ef0370a <MyNamespace::GenericDevicer::updateValue()+114>  shlb   0x48(%rsi)           
0x55cd3ef0370d <MyNamespace::GenericDevicer::updateValue()+117>  movd   %mm0,%eax            
0x55cd3ef03710 <MyNamespace::GenericDevicer::updateValue()+120>  mov    %rax,-0x28(%rbp)     
0x55cd3ef03714 <MyNamespace::GenericDevicer::updateValue()+124>  mov    -0x38(%rbp),%rax     
0x55cd3ef03718 <MyNamespace::GenericDevicer::updateValue()+128>  lea    0x38(%rax),%rdx   
0x55cd3ef0371c <MyNamespace::GenericDevicer::updateValue()+132>  mov    -0x28(%rbp),%rax
0x55cd3ef03720 <MyNamespace::GenericDevicer::updateValue()+136>  mov    %rax,-0x40(%rbp)
0x55cd3ef03724 <MyNamespace::GenericDevicer::updateValue()+140>  movsd  -0x40(%rbp),%xmm0

...

在 gdb 中滚动 ams 后，我得到了原始问题中发布的代码。原始问题中的代码与可执行文件中的代码相匹配。上面发布的代码确实与可执行文件有部分偏差。

shlb 指令对我来说毫无意义。甚至找不到说明 Intel® 64 and IA-32 Architectures Software Developer’s Manual。最接近的匹配是 shl。

【问题讨论】：

如果我们可以相信您的寄存器转储rdx 和rdi 也是错误的，更重要的是，即使刚刚执行了mov %rdx,%rdi，它们也不相等。要么你的寄存器转储不是来自那个地方，要么发生了一些非常奇怪的事情。
您的函数作为非 void 返回不是吗？返回类型是什么，程序集看起来很奇怪。
rip 0x55cd3ef0370a 非常错误，在指令中间。那是您的错误的实际原因，执行来自其他地方并击中了该指令的中间。
能否请您澄清一下“执行来自其他地方”？
是的。这可能正确执行，但是当该函数尝试返回时，它使用了一个损坏的地址，因此继续在 call 的中间而不是在下面的指令中。

标签： c++ assembly gdb cpu-registers vtable

【解决方案1】：

正如@Jester 所说，您的其他寄存器值与您所说的发生崩溃的代码不匹配。

当问题发生时，我能够获得一个 coredump，并查看了汇编代码：... segfault 发生在汇编摘录的最后一行。

你怎么知道的？ where 的输出是什么？

通常，应该有一个当前的$RIP 标记，如下所示：

   0x55cd3ef036f4 GenericDevice::updateValue()+92   mov    -0x38(%rbp),%rax   
   0x55cd3ef036f8 GenericDevice::updateValue()+96   mov    (%rax),%rax 
   0x55cd3ef036fb GenericDevice::updateValue()+99   add    $0x40,%rax  
   0x55cd3ef036ff GenericDevice::updateValue()+103  mov    (%rax),%rax 
   0x55cd3ef03702 GenericDevice::updateValue()+106  mov   -0x38(%rbp),%rdx
   0x55cd3ef03706 GenericDevice::updateValue()+110  mov   %rdx,%rdi         
   0x55cd3ef03709 GenericDevice::updateValue()+113  callq  *%rax
=> 0x55cd3ef0370e GenericDevice::updateValue()+118  ....

你看到那个标记了吗？

如果不是，您的崩溃很可能在其他地方（但分析您的数据做得很好）。

如果您确实看到了标记，其他详细信息，例如确切处理器品牌和型号可能很重要（参见例如 this 问题和答案）。

【讨论】：

不从寄存器转储中删除 rip 也会有所帮助。
使用edit link 将该信息添加到原始问题中。
@TLepold 请编辑您的问题并提供来自where 和info registers 的完整输出。您可能误解了崩溃的实际位置。
@TLepold $RIP 正在教学中，很像stackoverflow.com/questions/4703844/unexplainable-core-dump/…

【解决方案2】：

调用语句在执行被调用函数之前将返回地址压入堆栈。资源 Intel® 64 and IA-32 Architectures Software Developer’s Manual 第 225 页。另一个线程持有对同一堆栈上的变量的无效引用并将其递减，这是存储的返回地址。基本上，线程应该持有对计数器的引用，该计数器计算有多少 GenericDevice::updateValue() 的作业仍在等待中。超时后，计数器将超出范围，但执行线程仍持有现在无效的引用。超时很少发生，并且仅在读取设备而不是模型时发生。因此，存储在堆栈中的返回地址偶尔会被破坏。

【讨论】：