【问题标题】:Apache Beam SIGKILLApache Beam SIGKILL
【发布时间】:2021-05-25 08:16:55
【问题描述】:

问题

如何在 Apache Beam 中最好地执行内存密集型管道?

背景

我编写了一个pipeline,它采用Naemura Bird dataset 并将图像和注释转换为TF 记录,其中包含用于TF 对象检测API 的required format 的TF 示例。

我使用 DirectRunner 对一小部分图像(4 个或 5 个)进行了测试,它运行良好。

问题

使用更大的数据集(第 1 天,共 3 天,约 21GB)运行管道时,它会在一段时间后崩溃,并显示非描述性 SIGKILL。 我确实在崩溃前看到了内存峰值,并假设该进程由于内存负载过高而被终止。

我通过strace 运行管道。这些是跟踪中的最后几行:

[pid 53702] 10:00:09.105069 poll([{fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}], 11, 100) = 0 (Timeout)
[pid 53702] 10:00:09.205826 poll([{fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}], 11, 100 <unfinished ...>
[pid 53534] 10:00:09.259806 mmap(NULL, 63082496, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3aa43d7000
[pid 53694] 10:00:09.297140 <... clock_nanosleep resumed>NULL) = 0
[pid 53694] 10:00:09.297273 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=200000000},  <unfinished ...>
[pid 53702] 10:00:09.306409 <... poll resumed>) = 0 (Timeout)
[pid 53702] 10:00:09.306478 poll([{fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}], 11, 100) = 0 (Timeout)
[pid 53702] 10:00:09.406866 poll([{fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}], 11, 100 <unfinished ...>
[pid 53710] 10:03:55.844910 <... futex resumed>) = ?
[pid 53709] 10:03:57.797618 <... futex resumed>) = ?
[pid 53708] 10:03:57.797737 <... futex resumed>) = ?
[pid 53707] 10:03:57.797793 <... futex resumed>) = ?
[pid 53706] 10:03:57.797847 <... futex resumed>) = ?
[pid 53705] 10:03:57.797896 <... futex resumed>) = ?
[pid 53704] 10:03:57.797983 <... futex resumed>) = ?
[pid 53703] 10:03:57.798035 <... futex resumed>) = ?
[pid 53702] 10:03:57.798085 +++ killed by SIGKILL +++
[pid 53701] 10:03:57.798124 <... futex resumed>) = ?
[pid 53700] 10:03:57.798173 <... futex resumed>) = ?
[pid 53699] 10:03:57.798224 <... futex resumed>) = ?
[pid 53698] 10:03:57.798272 <... futex resumed>) = ?
[pid 53697] 10:03:57.798321 <... accept4 resumed> <unfinished ...>) = ?
[pid 53694] 10:03:57.798372 <... clock_nanosleep resumed> <unfinished ...>) = ?
[pid 53693] 10:03:57.798426 <... futex resumed>) = ?
[pid 53660] 10:03:57.798475 <... futex resumed>) = ?
[pid 53641] 10:03:57.798523 <... futex resumed>) = ?
[pid 53640] 10:03:57.798572 <... futex resumed>) = ?
[pid 53639] 10:03:57.798620 <... futex resumed>) = ?
[pid 53710] 10:03:57.798755 +++ killed by SIGKILL +++
[pid 53709] 10:03:57.798792 +++ killed by SIGKILL +++
[pid 53708] 10:03:57.798828 +++ killed by SIGKILL +++
[pid 53707] 10:03:57.798864 +++ killed by SIGKILL +++
[pid 53706] 10:03:57.798900 +++ killed by SIGKILL +++
[pid 53705] 10:03:57.798937 +++ killed by SIGKILL +++
[pid 53704] 10:03:57.798973 +++ killed by SIGKILL +++
[pid 53703] 10:03:57.799008 +++ killed by SIGKILL +++
[pid 53701] 10:03:57.799044 +++ killed by SIGKILL +++
[pid 53700] 10:03:57.799079 +++ killed by SIGKILL +++
[pid 53699] 10:03:57.799116 +++ killed by SIGKILL +++
[pid 53698] 10:03:57.799152 +++ killed by SIGKILL +++
[pid 53697] 10:03:57.799187 +++ killed by SIGKILL +++
[pid 53694] 10:03:57.799245 +++ killed by SIGKILL +++
[pid 53693] 10:03:57.799282 +++ killed by SIGKILL +++
[pid 53660] 10:03:57.799318 +++ killed by SIGKILL +++
[pid 53641] 10:03:57.799354 +++ killed by SIGKILL +++
[pid 53640] 10:03:57.799390 +++ killed by SIGKILL +++
[pid 53639] 10:03:57.910349 +++ killed by SIGKILL +++
10:03:57.910381 +++ killed by SIGKILL +++

【问题讨论】:

  • 您使用的是干净的数据吗?指向[...] mmap(NULL, [...] 的日志可能表示输入数据错误。
  • @vdolez,感谢您的评论。我将研究null 值可能是如何溜进来的。
  • 您在代码中的 ImageToTfExample 中读取的各个文件大小有多大?是否有一个文件比开发机器上的内存大?您可能希望在 Dataflow 或 Flink 等生产运行器上尝试使用内存占用较大的 worker。
  • @vdolez,如果您提供答案,我会批准。我采用了更优雅的使用标记输出的解决方案。这样,我得到一个 PCollection 表示成功和一个失败的转换,而不是丑陋的 None 值。我不是 100% 确定这是根本原因。 @RezaRokni 也可能会有所作为。但即使图片很大,我也没有怀疑 16GB ram pc 上的 OOM。我确实尝试过使用不同的分片号。
  • 我会在白天尝试纠正一些事情:)

标签: tensorflow tensorflow2.0 apache-beam apache-beam-io sigkill


【解决方案1】:

多种因素都可能导致这种行为,因为管道运行良好且数据较少,分析已更改的内容可能会导致我们找到解决方案。

选项 1:清理输入数据

您提供的日志的第三行可能表明您正在更大的管道mmap(NULL, 中处理不干净的数据,这可能意味着| "Get Content" &gt;&gt; beam.Map(lambda x: x.read_utf8()) 正在尝试读取空值。

某处有空文件吗?你的文件是 utf8 编码的吗?

选项 2:使用较小的文件作为输入

我猜测使用fileio.ReadMatches() 会尝试将整个文件加载到内存中,如果您的文件大于内存,这可能会导致错误。您可以将数据拆分成更小的文件吗?

选项 3:使用更大的基础架构

如果文件对于您当前使用 DirectRunner 的机器来说太大了,您可以尝试使用云上的另一个运行程序来使用按需基础架构,例如 DataflowRunner

【讨论】:

    猜你喜欢
    • 2018-12-07
    • 2021-02-25
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-10-05
    • 2021-11-10
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多