malloc 分段错误答案

【问题标题】：Malloc segmentation faultmalloc 分段错误
【发布时间】：2014-03-29 20:46:47
【问题描述】：

这是发生分段错误的一段代码（未调用 perror）：

job = malloc(sizeof(task_t));
if(job == NULL)
    perror("malloc");

更准确地说，gdb 表示segfault 发生在__int_malloc 调用中，这是malloc 进行的子例程调用。

由于 malloc 函数是与其他线程并行调用的，最初我认为这可能是问题所在。我使用的是 2.19 版的 glibc。

数据结构：

typedef struct rv_thread thread_wrapper_t;

typedef struct future
{
  pthread_cond_t wait;
  pthread_mutex_t mutex;
  long completed;
} future_t;

typedef struct task
{
  future_t * f;
  void * data;
  void *
  (*fun)(thread_wrapper_t *, void *);
} task_t;

typedef struct
{
  queue_t * queue;
} pool_worker_t;

typedef struct
{
  task_t * t;
} sfuture_t;

struct rv_thread
{
  pool_worker_t * pool;
};

现在未来的实现：

future_t *
create_future()
{
  future_t * new_f = malloc(sizeof(future_t));
  if(new_f == NULL)
    perror("malloc");
  new_f->completed = 0;
  pthread_mutex_init(&(new_f->mutex), NULL);
  pthread_cond_init(&(new_f->wait), NULL);
  return new_f;
}

int
wait_future(future_t * f)
{
  pthread_mutex_lock(&(f->mutex));
  while (!f->completed)
    {
      pthread_cond_wait(&(f->wait),&(f->mutex));
    }
  pthread_mutex_unlock(&(f->mutex));
  return 0;
}

void
complete(future_t * f)
{
  pthread_mutex_lock(&(f->mutex));
  f->completed = 1;
  pthread_mutex_unlock(&(f->mutex));
  pthread_cond_broadcast(&(f->wait));
}

线程池本身：

pool_worker_t *
create_work_pool(int threads)
{
  pool_worker_t * new_p = malloc(sizeof(pool_worker_t));
  if(new_p == NULL)
    perror("malloc");
  threads = 1;
  new_p->queue = create_queue();
  int i;
  for (i = 0; i < threads; i++){
    thread_wrapper_t * w = malloc(sizeof(thread_wrapper_t));
    if(w == NULL)
      perror("malloc");
    w->pool = new_p;
    pthread_t n;
    pthread_create(&n, NULL, work, w);
  }
  return new_p;
}

task_t *
try_get_new_task(thread_wrapper_t * thr)
{
  task_t * t = NULL;
  try_dequeue(thr->pool->queue, t);
  return t;
}

void
submit_job(pool_worker_t * p, task_t * t)
{
  enqueue(p->queue, t);
}

void *
work(void * data)
{
  thread_wrapper_t * thr = (thread_wrapper_t *) data;
  while (1){
    task_t * t = NULL;
    while ((t = (task_t *) try_get_new_task(thr)) == NULL);
    future_t * f = t->f;
    (*(t->fun))(thr,t->data);
    complete(f);
  }
  pthread_exit(NULL);
}

最后是task.c：

pool_worker_t *
create_tpool()
{
  return (create_work_pool(8));
}

sfuture_t *
async(pool_worker_t * p, thread_wrapper_t * thr, void *
(*fun)(thread_wrapper_t *, void *), void * data)
{
  task_t * job = NULL;
  job = malloc(sizeof(task_t));
  if(job == NULL)
    perror("malloc");
  job->data = data;
  job->fun = fun;
  job->f = create_future();
  submit_job(p, job);
  sfuture_t * new_t = malloc(sizeof(sfuture_t));
  if(new_t == NULL)
    perror("malloc");
  new_t->t = job;
  return (new_t);
}

void
mywait(thread_wrapper_t * thr, sfuture_t * sf)
{
  if (sf == NULL)
    return;
  if (thr != NULL)
    {
      while (!sf->t->f->completed)
        {
          task_t * t_n = try_get_new_task(thr);
          if (t_n != NULL)
            {
          future_t * f = t_n->f;
          (*(t_n->fun))(thr,t_n->data);
          complete(f);
            }
        }
      return;
    }
  wait_future(sf->t->f);
  return ;
}

队列是lfds无锁队列。

#define enqueue(q,t) {                                 \
    if(!lfds611_queue_enqueue(q->lq, t))             \
      {                                               \
        lfds611_queue_guaranteed_enqueue(q->lq, t);  \
      }                                               \
  }

#define try_dequeue(q,t) {                            \
    lfds611_queue_dequeue(q->lq, &t);               \
  }

只要对 async 的调用次数非常多，就会出现问题。

Valgrind 输出：

Process terminating with default action of signal 11 (SIGSEGV)
==12022==  Bad permissions for mapped region at address 0x5AF9FF8
==12022==    at 0x4C28737: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)

【问题讨论】：

会不会有其他事情搞砸了malloc 的簿记工作？
听起来内存在其他地方损坏了。
这是唯一的解释，我会发布整个代码。（这真的是一个最小的模型，有内存泄漏等）。
“如果需要，我可以把完整的源代码放在这里”——是的，这可能是你应该做的，因为上面的代码本身不能暗示段错误的来源。
有机会在 valgrind 下运行程序吗？如果发生内存损坏，valgrind 可能会告诉您何时何地。

标签： c segmentation-fault malloc stack-overflow buffer-overflow

【解决方案1】：

我已经找出问题所在：堆栈溢出。

首先，让我解释一下为什么会在 malloc 内部发生堆栈溢出（这可能就是您阅读本文的原因）。当我的程序运行时，每次它开始执行（递归）另一个任务时，堆栈大小都会不断增加（因为我的编程方式）。但是每次这样，我都必须使用 malloc 分配一个新任务。但是，malloc 会进行其他子例程调用，这使得堆栈的大小甚至超过了执行另一个任务的简单调用。所以，发生的事情是，即使没有 malloc，我也会遇到堆栈溢出。然而，因为我有 malloc，堆栈溢出的那一刻是在 malloc 中，在它通过另一个递归调用溢出之前。下图显示了正在发生的事情：

初始堆栈状态：

-------------------------
| recursive call n - 3  |
-------------------------
| recursive call n - 2  |
-------------------------
| recursive call n - 1  |
-------------------------
|        garbage        |
-------------------------
|        garbage        | <- If the stack passes this point, the stack overflows.
-------------------------

malloc 调用期间的堆栈：

-------------------------
| recursive call n - 3  |
-------------------------
| recursive call n - 2  |
-------------------------
| recursive call n - 1  |
-------------------------
|        malloc         |
-------------------------
|     __int_malloc      | <- If the stack passes this point, the stack overflows.
-------------------------

然后堆栈再次收缩，我的代码进入了新的递归调用：

-------------------------
| recursive call n - 3  |
-------------------------
| recursive call n - 2  |
-------------------------
| recursive call n - 1  |
-------------------------
| recursive call n      |
-------------------------
|        garbage        | <- If the stack passes this point, the stack overflows.
-------------------------

然后，它在这个新的递归调用中再次调用了 malloc。但是，这一次它溢出了：

-------------------------
| recursive call n - 3  |
-------------------------
| recursive call n - 2  |
-------------------------
| recursive call n - 1  |
-------------------------
| recursive call n      |
-------------------------
|        malloc         | <- If the stack passes this point, the stack overflows.
-------------------------
|     __int_malloc      | <- This is when the stack overflow occurs.
-------------------------

[其余的答案更集中在我的代码中为什么会出现这个问题。]

通常，当递归计算斐波那契时，例如，对于某个数 n，堆栈大小随该数线性增长。但是，在这种情况下，我正在创建任务，使用队列来存储它们，并将 (fib) 任务出列以执行。如果你在纸上画这个，你会看到任务的数量随着 n 呈指数增长，而不是线性增长（另请注意，如果我在创建任务时使用堆栈来存储任务，分配的任务数量为以及堆栈大小只会随 n 线性增长。所以发生的情况是堆栈随 n 呈指数增长，导致堆栈溢出......现在是为什么这个溢出发生在对 malloc 的调用中的部分。所以基本上，作为我在上面解释过，堆栈溢出发生在 malloc 调用内部，因为它是堆栈最大的地方。发生的情况是堆栈几乎爆炸，并且由于 malloc 调用其中的函数，堆栈的增长不仅仅是 mywait 的调用和谎言。

谢谢大家！如果不是你的帮助，我将无法弄清楚！

【讨论】：

这就是我的猜测，因为我找不到任何问题。但是要确保这是问题所在，您可以将“顶部”输出转储到文件中并检查内存使用量如何增加吗？ +1 回答和问题。
当我删除所有线程时，valgrind 说这可能是堆栈溢出，尽管这不太可能。我已经将 ulimit 设置得更大，然后我可以运行更大的 fib nums。当我复制堆栈大小时，我只能在前一个数字上加 1。但我会照你说的做，只是为了确认

【解决方案2】：

在 malloc 中触发的 SIGSEGV（分段错误）通常是由堆损坏引起的。堆损坏不会导致分段错误，因此您只会在 malloc 尝试访问那里时看到。问题是造成堆损坏的代码可能在任何位置，甚至远离调用 malloc 的位置。它通常是 malloc 内的下一个块指针，它被您的堆损坏更改为无效地址，因此当您调用 malloc 时，无效指针会被取消引用，并且您会遇到分段错误。

我认为您可以尝试将部分代码与程序的其余部分隔离，以减少错误的可见性。

此外，我看到您从不释放这里的内存，并且可能存在内存泄漏。

为了检查内存泄漏，您可以运行顶部命令top -b -n 1 并检查：

RPRVT - resident private address space size
RSHRD - resident shared address space size
RSIZE - resident memory size
VPRVT - private address space size
VSIZE - total memory size

【讨论】：

问题是分段错误只发生在多次调用之后。
你看看有没有内存泄漏？我在这里没有看到任何空闲的......你有没有时间释放内存？
如果我不释放内存迟早会遇到问题......因为这个程序只在这里分配......
这只是一个最小模型，在原始版本中它没有内存泄漏。在这一个中，我只是想通过消除尽可能多的代码来检查错误的原因。所以在这个版本中我只想找到malloc问题。
如果你猜这是一个 malloc 问题，你应该分配并释放（避免堆外）大量时间（malloc 不知道你的结构），所以你不需要完整的程序，但这不太可能发生@guilhermemtr