【问题标题】:Facing memory error on my Django celery worker instance我的 Django celery worker 实例面临内存错误
【发布时间】:2021-04-04 15:12:17
【问题描述】:

我正在使用 django celery 和 redis(代理)。我在我的一个工作实例上观察到以下错误。

[2020-12-27 02:26:15,920: INFO/MainProcess] missed heartbeat from worker@ip-xxx-xx-xx-
xxx.ec2.internal
[2020-12-27 02:26:40,937: INFO/MainProcess] missed heartbeat from worker@ip-xxx-xx-xx-xxx.ec2.internal
[2020-12-27 02:27:00,943: INFO/MainProcess] missed heartbeat from worker@ip-xxx-xx-xx-xxx.ec2.internal
[2020-12-27 02:27:15,955: INFO/MainProcess] missed heartbeat from worker@ip-xxx-xx-xx-xxx.ec2.internal
[2020-12-27 02:27:45,971: INFO/MainProcess] missed heartbeat from worker@ip-xxx-xx-xx-xxx.ec2.internal
[2020-12-27 02:28:02,118: INFO/MainProcess] missed heartbeat from worker@ip-xxx-xx-xx-xxx.ec2.internal
[2020-12-27 02:28:36,496: CRITICAL/MainProcess] Unrecoverable error: MemoryError()
Traceback (most recent call last):
  File "/home/ec2-user/.virtualenvs/xxxxx/lib/python3.7/site-packages/celery/worker/worker.py", line 205, in start
    self.blueprint.start(self)
  File "/home/ec2-user/.virtualenvs/xxxxx/lib/python3.7/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/home/ec2-user/.virtualenvs/xxxxx/lib/python3.7/site-packages/celery/bootsteps.py", line 369, in start
    return self.obj.start()
  File "/home/ec2-user/.virtualenvs/xxxxx/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 318, in start
    blueprint.start(self)
  File "/home/ec2-user/.virtualenvs/xxxxx/lib/python3.7/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/home/ec2-user/.virtualenvs/xxxxx/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 596, in start
    c.loop(*c.loop_args())
  File "/home/ec2-user/.virtualenvs/xxxxx/lib/python3.7/site-packages/celery/worker/loops.py", line 83, in asynloop
    next(loop)
  File "/home/ec2-user/.virtualenvs/xxxxx/lib/python3.7/site-packages/kombu/asynchronous/hub.py", line 364, in create_loop
    cb(*cbargs)
  File "/home/ec2-user/.virtualenvs/xxxxx/lib/python3.7/site-packages/kombu/transport/redis.py", line 1074, in on_readable
    self.cycle.on_readable(fileno)
  File "/home/ec2-user/.virtualenvs/xxxxx/lib/python3.7/site-packages/kombu/transport/redis.py", line 359, in on_readable
    chan.handlers[type]()
  File "/home/ec2-user/.virtualenvs/xxxxx/lib/python3.7/site-packages/kombu/transport/redis.py", line 694, in _receive
    ret.append(self._receive_one(c))
  File "/home/ec2-user/.virtualenvs/xxxxx/lib/python3.7/site-packages/kombu/transport/redis.py", line 700, in _receive_one
    response = c.parse_response()
  File "/home/ec2-user/.virtualenvs/xxxxx/lib/python3.7/site-packages/redis/client.py", line 3036, in parse_response
    return self._execute(connection, connection.read_response)
  File "/home/ec2-user/.virtualenvs/xxxxx/lib/python3.7/site-packages/redis/client.py", line 3013, in _execute
    return command(*args)
  File "/home/ec2-user/.virtualenvs/xxxxx/lib/python3.7/site-packages/redis/connection.py", line 637, in read_response
    response = self._parser.read_response()
  File "/home/ec2-user/.virtualenvs/xxxxx/lib/python3.7/site-packages/redis/connection.py", line 330, in read_response
    response = [self.read_response() for i in xrange(length)]
  File "/home/ec2-user/.virtualenvs/xxxxx/lib/python3.7/site-packages/redis/connection.py", line 330, in <listcomp>
    response = [self.read_response() for i in xrange(length)]
  File "/home/ec2-user/.virtualenvs/xxxxx/lib/python3.7/site-packages/redis/connection.py", line 324, in read_response
    response = self._buffer.read(length)
  File "/home/ec2-user/.virtualenvs/xxxxx/lib/python3.7/site-packages/redis/connection.py", line 205, in read
    self._read_from_socket(length - self.length)
  File "/home/ec2-user/.virtualenvs/xxxxx/lib/python3.7/site-packages/redis/connection.py", line 186, in _read_from_socket
    buf.write(data)
MemoryError
[2020-12-27 06:44:31,570: INFO/MainProcess] Connected to redis://xxxxxxxxxx.cache.amazonaws.com:6379//
[2020-12-27 06:44:31,585: INFO/MainProcess] mingle: searching for neighbors
[2020-12-27 06:44:32,611: INFO/MainProcess] mingle: sync with 1 nodes

我只是想确认一下,这个内存错误是由于我的代码中某处的内存泄漏、某些工作人员特定的问题,还是由于其他原因。 我非常感谢任何帮助/建议找出根本原因。

注意:我的工作人员(在 aws 上)的实例类型是 t2.small

【问题讨论】:

    标签: django redis celery celerybeat


    【解决方案1】:

    有道理(小实例),但我更担心健康检查失败(丢失心跳)。

    这里有一些想法:

    • 尝试分析您的 celery 任务以了解它消耗了多少内存。是否超过此实例类型的 2GB?
    • 您为工作人员定义的并发级别是多少?你试过减少这个数字吗?如果c==2 并且每个任务消耗 2GB(例如),这可以解释您的问题。
    • 使用 CloudWatch 指标(在 AWS 控制台中)查看 CPU 和内存利用率,看看您是否发现错误时间与图表中的某些峰值之间存在相关性。
    • 如果它是可重现的,您可以在出现此错误时尝试 htop - 以确保这是资源限制 (mem/CPU)。
    • 您自己收集这些指标 - 它总能帮助您应对此类情况。

    祝你好运!

    【讨论】:

    • @WaqasAli 有进展吗?
    • 并发没有明确设置(使用默认值),但是我们设置了 --max-tasks-per-child=1 ,现在看起来情况好多了。这样可以吗,或者您对这个值有什么建议?
    • 你推荐一些用于分析 celery 任务的工具吗?还是我们应该使用 Python Profiler?
    • 根据您的实例类型,默认并发数应为 1(1 个核心)。对于指标收集,有 github.com/prometheus/node_exporter 和许多其他解决方案来收集这些数据并稍后通过 Grafana 或其他方式查看
    最近更新 更多