Logstash 消耗大量 CPU答案

【问题标题】：Logstash consuming huge amount of CPULogstash 消耗大量 CPU
【发布时间】：2016-05-17 17:57:29
【问题描述】：

详情：

dockerized 环境，隔离，只有 ELK
logstash 版本 2.3.1（2016 年 4 月 4 日）
logstash-input-kafka-2.0.6
弹性搜索输出
200 条记录/秒
8 个内核、8GB 内存、2GB 日志存储堆（所有 8 个内核的 CPU 消耗（“ni”）达到 100%）
除了 elasticsearch 之外，没有其他进程消耗超过 1% 的 cpu/mem
一段时间后（通常是几个小时）logstash 停止向 ES 发送数据
logstash 日志中没有任何内容
当 logstash 服务重新启动后，一段时间内一切正常

logstash 进程：

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
systemd+ 15395  457  5.3 6953124 424628 ?      SNl  08:57  63:19 /usr/bin/java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/opt/logstash -Xmx2g -Xss2048k -Djffi.boot.library.path=/opt/logstash/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/opt/logstash -XX:HeapDumpPath=/opt/logstash/heapdump.hprof -Xbootclasspath/a:/opt/logstash/vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/opt/logstash/vendor/jruby -Djruby.lib=/opt/logstash/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /opt/logstash/lib/bootstrap/environment.rb logstash/runner.rb agent -f /etc/logstash/conf.d -l /var/log/logstash/logstash.log --auto-reload

Logstash 子进程（顶部 -Hp）

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                               
 4637 systemd+  39  19 7045796 2,201g   5728 R 97,7 28,8 292:53.17 [main]>worker7                                                                                        
 4632 systemd+  39  19 7045796 2,201g   5728 R 97,3 28,8 140:04.15 [main]>worker2                                                                                        
 4635 systemd+  39  19 7045796 2,201g   5728 R 97,3 28,8 223:43.32 [main]>worker5                                                                                        
 4630 systemd+  39  19 7045796 2,201g   5728 R 97,0 28,8 281:26.59 [main]>worker0                                                                                        
 4631 systemd+  39  19 7045796 2,201g   5728 R 96,3 28,8 160:18.87 [main]>worker1                                                                                        
 4636 systemd+  39  19 7045796 2,201g   5728 R 96,3 28,8 134:52.64 [main]>worker6                                                                                        
 4517 systemd+  39  19 7045796 2,201g   5728 R 79,7 28,8 183:59.43 LogStash::Runne                                                                                       
 4493 systemd+  39  19 7045796 2,201g   5728 S  1,7 28,8  33:23.33 java                                                                                                  
 4495 systemd+  39  19 7045796 2,201g   5728 S  1,7 28,8  33:18.66 java                                                                                                  
 4497 systemd+  39  19 7045796 2,201g   5728 S  1,7 28,8  33:22.12 java                                                                                                  
 4498 systemd+  39  19 7045796 2,201g   5728 S  1,7 28,8  33:21.07 java                                                                                                  
 4499 systemd+  39  19 7045796 2,201g   5728 S  1,7 28,8  33:22.47 java                                                                                                  
 4494 systemd+  39  19 7045796 2,201g   5728 S  1,3 28,8  33:23.83 java                                                                                                  
 4496 systemd+  39  19 7045796 2,201g   5728 S  0,7 28,8  33:22.12 java                                                                                                  
 4500 systemd+  39  19 7045796 2,201g   5728 S  0,7 28,8  33:22.44 java                                                                                                  
 4633 systemd+  39  19 7045796 2,201g   5728 S  0,7 28,8   3:22.78 [main]>worker3                                                                                        
20526 systemd+  39  19 7045796 2,201g   5728 S  0,7 28,8   0:15.27 [main]<kafka                                                                                          
 4492 systemd+  39  19 7045796 2,201g   5728 S  0,3 28,8   0:23.78 LogStash::Runne                                                                                       
 4532 systemd+  39  19 7045796 2,201g   5728 S  0,3 28,8   0:55.25 [main]<kafka                                                                                          
 4536 systemd+  39  19 7045796 2,201g   5728 S  0,3 28,8   1:29.43 [main]<kafka                                                                                          
 4606 systemd+  39  19 7045796 2,201g   5728 S  0,3 28,8   0:03.88 [main]<kafka                                                                                          
 4638 systemd+  39  19 7045796 2,201g   5728 S  0,3 28,8   0:48.34 pipeline.main                                                                                         
20529 systemd+  39  19 7045796 2,201g   5728 S  0,3 28,8   0:14.41 [main]<kafka                                                                                          
20533 systemd+  39  19 7045796 2,201g   5728 S  0,3 28,8   0:15.62 [main]<kafka

Logstash 配置

像这样的几个块：

kafka {
    zk_connect => "pbdcnt1:2181,pbdcnt2:2181,pbdcnt3:2181"
    topic_id => "transactions"
    tags => ["transactions"]
    group_id => "prod_elastic_fraud"
    codec => plain {
        charset => "Windows-1250"
    }
}

几个这样的：

filter {
    if "transactions" in [tags] {
        grok{
            match=>["message","(?<log_time>.*?)? ?(?<src>.*?)? ?(?<insert_time>.*?);(?<col1>.*?);(?<col2>.*?);...;(?<colN>.*?)"]
        }
    }

输出是这样的：

output {
  if [tags] {
    elasticsearch {
      hosts => ["localhost"]
      index => "%{tags}-%{+YYYY.MM.dd}"
    }
  }
}

我的错在哪里？如果需要更多数据，请询问。

谢谢，埃迪

更新：

在某处泄漏？垃圾回收时间不断增加（jstat）

S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT    LGCC                 GCC  

64.69   0.00  38.41  13.47  59.96  57906 2977.233  4916  206.461 3183.694 Allocation Failure   No GC               
  0.00  51.04  27.61  13.48  59.96  57907 2977.291  4916  206.461 3183.753 Allocation Failure   No GC               
 38.59  51.04 100.00  13.48  59.96  57908 2977.291  4916  206.461 3183.753 Allocation Failure   Allocation Failure  
 57.04   0.00  77.47  13.48  59.96  57908 2977.342  4916  206.461 3183.803 Allocation Failure   No GC               
  0.00  72.64  81.57  13.48  59.96  57909 2977.399  4916  206.461 3183.860 Allocation Failure   No GC               
 61.65   0.00  34.59  13.48  59.96  57910 2977.448  4916  206.461 3183.909 Allocation Failure   No GC               
  0.00  45.05   8.99  13.48  59.96  57911 2977.492  4916  206.461 3183.953 Allocation Failure   No GC               
 48.47   0.00   7.30  13.49  59.96  57912 2977.539  4916  206.461 3184.000 Allocation Failure   No GC               
 48.47   0.00  72.43  13.49  59.96  57912 2977.539  4916  206.461 3184.000 Allocation Failure   No GC               
  0.00  53.80  43.27  13.49  59.96  57913 2977.588  4916  206.461 3184.050 Allocation Failure   No GC               
 59.96   0.00  49.70  13.49  59.96  57914 2977.636  4916  206.461 3184.098 Allocation Failure   No GC               
 59.96   0.00  95.84  13.49  59.96  57914 2977.636  4916  206.461 3184.098 Allocation Failure   No GC               
  0.00  54.05  84.24  13.49  59.96  57915 2977.683  4916  206.461 3184.145 Allocation Failure   No GC               
 65.37   0.00  84.28  13.49  59.96  57916 2977.743  4916  206.461 3184.205 Allocation Failure   No GC               
  0.00  48.62  48.73  13.49  59.96  57917 2977.798  4916  206.461 3184.259 Allocation Failure   No GC               
 53.00   0.00  24.35  13.50  59.96  57918 2977.845  4916  206.461 3184.307 Allocation Failure   No GC               
  0.00  52.87  19.08  13.50  59.96  57919 2977.900  4916  206.461 3184.361 Allocation Failure   No GC               
  0.00  52.87  84.52  13.50  59.96  57919 2977.900  4916  206.461 3184.361 Allocation Failure   No GC               
 72.04   0.00  60.68  13.50  59.96  57920 2977.954  4916  206.461 3184.415 Allocation Failure   No GC               
  0.00  65.02  64.85  13.51  59.96  57921 2978.009  4916  206.461 3184.471 Allocation Failure   No GC               
 58.54   0.00  40.52  13.51  59.96  57922 2978.058  4916  206.461 3184.520 Allocation Failure   No GC               
 58.54   0.00  98.01  13.51  59.96  57922 2978.058  4916  206.461 3184.520 Allocation Failure   No GC               
  0.00  58.06  99.74  13.51  59.96  57924 2978.107  4916  206.461 3184.569 Allocation Failure   Allocation Failure  
 69.63   0.00  81.55  13.51  59.96  57924 2978.163  4916  206.461 3184.624 Allocation Failure   No GC               
  0.00  56.36  38.21  13.51  59.96  57925 2978.211  4916  206.461 3184.672 Allocation Failure   No GC               
 52.32   0.00  46.66  13.51  59.96  57926 2978.260  4916  206.461 3184.721 Allocation Failure   No GC               
  0.00  67.84  18.10  13.51  59.96  57927 2978.312  4916  206.461 3184.774 Allocation Failure   No GC

jstack 用于进程： https://www.dropbox.com/s/u6tz0w4v0in8lev/logstash.jstack?dl=0

【问题讨论】：

标签： elasticsearch logstash kafka-consumer-api logstash-grok logstash-configuration

【解决方案1】：

遗憾的是，正则表达式占用了过多的 CPU。当我删除正则表达式并使用带有列选项的 CSV 过滤器时（这基本上是我最初想要做的），CPU 现在一直保持在 150% 左右（对于 200 个事务/秒来说这似乎太高了）。但是，嘿，它就是现在的样子。

令人困惑的是为什么它会工作一段时间然后就停止了。也许是为了让更多的工人......

【讨论】：