array(2) { ["docs"]=> array(10) { [0]=> array(10) { ["id"]=> string(3) "428" ["text"]=> string(77) "Visual Studio 2017 单独启动MSDN帮助(Microsoft Help Viewer)的方法" ["intro"]=> string(288) "目录 ECharts 异步加载 ECharts 数据可视化在过去几年中取得了巨大进展。开发人员对可视化产品的期望不再是简单的图表创建工具,而是在交互、性能、数据处理等方面有更高的要求。 chart.setOption({ color: [ " ["username"]=> string(8) "DonetRen" ["tagsname"]=> string(55) "Visual Studio 2017|MSDN帮助|C#程序|.NET|Help Viewer" ["tagsid"]=> string(23) "[401,402,403,"300",404]" ["catesname"]=> string(0) "" ["catesid"]=> string(2) "[]" ["createtime"]=> string(10) "1511400964" ["_id"]=> string(3) "428" } [1]=> array(10) { ["id"]=> string(3) "427" ["text"]=> string(42) "npm -v;报错 cannot find module "wrapp"" ["intro"]=> string(288) "目录 ECharts 异步加载 ECharts 数据可视化在过去几年中取得了巨大进展。开发人员对可视化产品的期望不再是简单的图表创建工具,而是在交互、性能、数据处理等方面有更高的要求。 chart.setOption({ color: [ " ["username"]=> string(4) "zzty" ["tagsname"]=> string(50) "node.js|npm|cannot find module "wrapp“|node" ["tagsid"]=> string(19) "[398,"239",399,400]" ["catesname"]=> string(0) "" ["catesid"]=> string(2) "[]" ["createtime"]=> string(10) "1511400760" ["_id"]=> string(3) "427" } [2]=> array(10) { ["id"]=> string(3) "426" ["text"]=> string(54) "说说css中pt、px、em、rem都扮演了什么角色" ["intro"]=> string(288) "目录 ECharts 异步加载 ECharts 数据可视化在过去几年中取得了巨大进展。开发人员对可视化产品的期望不再是简单的图表创建工具,而是在交互、性能、数据处理等方面有更高的要求。 chart.setOption({ color: [ " ["username"]=> string(12) "zhengqiaoyin" ["tagsname"]=> string(0) "" ["tagsid"]=> string(2) "[]" ["catesname"]=> string(0) "" ["catesid"]=> string(2) "[]" ["createtime"]=> string(10) "1511400640" ["_id"]=> string(3) "426" } [3]=> array(10) { ["id"]=> string(3) "425" ["text"]=> string(83) "深入学习JS执行--创建执行上下文(变量对象,作用域链,this)" ["intro"]=> string(288) "目录 ECharts 异步加载 ECharts 数据可视化在过去几年中取得了巨大进展。开发人员对可视化产品的期望不再是简单的图表创建工具,而是在交互、性能、数据处理等方面有更高的要求。 chart.setOption({ color: [ " ["username"]=> string(7) "Ry-yuan" ["tagsname"]=> string(33) "Javascript|Javascript执行过程" ["tagsid"]=> string(13) "["169","191"]" ["catesname"]=> string(0) "" ["catesid"]=> string(2) "[]" ["createtime"]=> string(10) "1511399901" ["_id"]=> string(3) "425" } [4]=> array(10) { ["id"]=> string(3) "424" ["text"]=> string(30) "C# 排序技术研究与对比" ["intro"]=> string(288) "目录 ECharts 异步加载 ECharts 数据可视化在过去几年中取得了巨大进展。开发人员对可视化产品的期望不再是简单的图表创建工具,而是在交互、性能、数据处理等方面有更高的要求。 chart.setOption({ color: [ " ["username"]=> string(9) "vveiliang" ["tagsname"]=> string(0) "" ["tagsid"]=> string(2) "[]" ["catesname"]=> string(8) ".Net Dev" ["catesid"]=> string(5) "[199]" ["createtime"]=> string(10) "1511399150" ["_id"]=> string(3) "424" } [5]=> array(10) { ["id"]=> string(3) "423" ["text"]=> string(72) "【算法】小白的算法笔记:快速排序算法的编码和优化" ["intro"]=> string(288) "目录 ECharts 异步加载 ECharts 数据可视化在过去几年中取得了巨大进展。开发人员对可视化产品的期望不再是简单的图表创建工具,而是在交互、性能、数据处理等方面有更高的要求。 chart.setOption({ color: [ " ["username"]=> string(9) "penghuwan" ["tagsname"]=> string(6) "算法" ["tagsid"]=> string(7) "["344"]" ["catesname"]=> string(0) "" ["catesid"]=> string(2) "[]" ["createtime"]=> string(10) "1511398109" ["_id"]=> string(3) "423" } [6]=> array(10) { ["id"]=> string(3) "422" ["text"]=> string(64) "JavaScript数据可视化编程学习(二)Flotr2,雷达图" ["intro"]=> string(288) "目录 ECharts 异步加载 ECharts 数据可视化在过去几年中取得了巨大进展。开发人员对可视化产品的期望不再是简单的图表创建工具,而是在交互、性能、数据处理等方面有更高的要求。 chart.setOption({ color: [ " ["username"]=> string(7) "chengxs" ["tagsname"]=> string(28) "数据可视化|前端学习" ["tagsid"]=> string(9) "[396,397]" ["catesname"]=> string(18) "前端基本知识" ["catesid"]=> string(5) "[198]" ["createtime"]=> string(10) "1511397800" ["_id"]=> string(3) "422" } [7]=> array(10) { ["id"]=> string(3) "421" ["text"]=> string(36) "C#表达式目录树(Expression)" ["intro"]=> string(288) "目录 ECharts 异步加载 ECharts 数据可视化在过去几年中取得了巨大进展。开发人员对可视化产品的期望不再是简单的图表创建工具,而是在交互、性能、数据处理等方面有更高的要求。 chart.setOption({ color: [ " ["username"]=> string(4) "wwym" ["tagsname"]=> string(0) "" ["tagsid"]=> string(2) "[]" ["catesname"]=> string(4) ".NET" ["catesid"]=> string(7) "["119"]" ["createtime"]=> string(10) "1511397474" ["_id"]=> string(3) "421" } [8]=> array(10) { ["id"]=> string(3) "420" ["text"]=> string(47) "数据结构 队列_队列实例:事件处理" ["intro"]=> string(288) "目录 ECharts 异步加载 ECharts 数据可视化在过去几年中取得了巨大进展。开发人员对可视化产品的期望不再是简单的图表创建工具,而是在交互、性能、数据处理等方面有更高的要求。 chart.setOption({ color: [ " ["username"]=> string(7) "idreamo" ["tagsname"]=> string(40) "C语言|数据结构|队列|事件处理" ["tagsid"]=> string(23) "["246","247","248",395]" ["catesname"]=> string(12) "数据结构" ["catesid"]=> string(7) "["133"]" ["createtime"]=> string(10) "1511397279" ["_id"]=> string(3) "420" } [9]=> array(10) { ["id"]=> string(3) "419" ["text"]=> string(47) "久等了,博客园官方Android客户端发布" ["intro"]=> string(288) "目录 ECharts 异步加载 ECharts 数据可视化在过去几年中取得了巨大进展。开发人员对可视化产品的期望不再是简单的图表创建工具,而是在交互、性能、数据处理等方面有更高的要求。 chart.setOption({ color: [ " ["username"]=> string(3) "cmt" ["tagsname"]=> string(0) "" ["tagsid"]=> string(2) "[]" ["catesname"]=> string(0) "" ["catesid"]=> string(2) "[]" ["createtime"]=> string(10) "1511396549" ["_id"]=> string(3) "419" } } ["count"]=> int(200) } 222 Prometheus+grafana+alertmanager+node_exporter+blackbox_exporter+cadvisor+钉钉告警 - 爱码网

高可用集群参见https://www.cnblogs.com/xiaoyou2018/p/14243099.html

 

服务器公网IP:122.226.xx.220

服务器内网IP:192.168.1.190

 采用docker安装Prometheus、grafana、altermanager、cadvisor

实现对服务器硬件、容器、web站点、接口返回内容、证书的监控

mkdir -p /data/prometheus

cd !$

mkdir -p {conf,prometheus,rules}

cd /data/prometheus/conf

vi prometheus.yml          (yml文件格式一定要注意“空格”,要全部对齐、一致,不然报错,每次修改完后热更一下Prometheus服务)

global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.(拉取 targets 的默认时间间隔)
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.(执行 rules 的时间间隔)
# scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['192.168.1.190:9093']

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/etc/prometheus/rules/*.yml"   
- "rules.yml"
#- "node_down.yml"
#- "memory.yml"
# - "first_rules.yml"
# - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
static_configs:
- targets: ['122.226.xx.220:9090']

- job_name: 'cadvisor'
static_configs:
- targets: ['122.226.xx.220:8080','192.168.1.213:8080','192.168.1.215:8080','192.168.1.216:8080','192.168.1.53:8080','192.168.1.54:8080']
# 以下为各节点类型分组
# 数仓服务器
- job_name: '数仓服务器'
scrape_interval: 8s
static_configs:
- targets: ['192.168.1.45:9100','192.168.1.46:9100','192.168.1.47:9100','192.168.1.48:9100','192.168.1.44:9100','192.168.1.51:9100','192.168.1.52:9100','192.168.1.23:9100','192.168.1.211:9100','192.168.1.202:9100','192.168.1.203:9
100','192.168.1.23:9100','192.168.1.61:9100']

#测试环境K8S服务器
- job_name: '测试环境K8S服务器'
scrape_interval: 8s
static_configs:
- targets: ['192.168.1.213:9100','192.168.1.215:9100','192.168.1.216:9100','192.168.1.53:9100','192.168.1.54:9100']
# web站点检测
- job_name: "blackbox_web"
metrics_path: /probe
params:
module: [http_2xx] # Look for a HTTP 200 response.
file_sd_configs:
- refresh_interval: 1m
files:
- "/etc/prometheus/blackbox-dis.yml"
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.1.190:9115

# 接口返回内容检测
- job_name: "blackbox_check"
metrics_path: /probe
params:
module: [http_2xx_check] # Look for a HTTP 200 response.
file_sd_configs:
- refresh_interval: 1m
files:
- "/etc/prometheus/blackbox-check.yml"
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.1.190:9115

#端口检测
- job_name: 'blackbox_tcp'
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- 192.168.1.45:9100
- 192.168.1.190:9093
- 192.168.1.212:6380

relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.1.190:9115 # Blackbox exporter

热更新

curl -X POST http://122.226.xx.220:9090/-/reload

 

vi alertmanager.yml

global:
  resolve_timeout: 5m
route:
  group_by: ['alertname']   # 分组名
  receiver: webhook
  group_wait: 30s           # 当收到告警的时候,等待十秒看是否还有告警,如果有就一起发出去  
  group_interval: 1m        # 各个分组之间发送警告间隔时间 
  repeat_interval: 48h       # 重复报警的间隔时间

receivers:
- name: webhook
  webhook_configs:
  - url: http://192.168.1.190:8060/dingtalk/webhook1/send 
    send_resolved: true
inhibit_rules:            #告警收敛
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

vi  docker-compose-monitor.yml

version: '2'

networks:
  monitor:
    driver: bridge

services:
  prometheus:
    image: prom/prometheus
    container_name: prometheus
    hostname: prometheus
    restart: always
    volumes:
      - /data/prometheus/conf/prometheus.yml:/etc/prometheus/prometheus.yml
- /data/prometheus/prometheus:/prometheus
- /data/prometheus/rules/:/etc/prometheus/rules
- /etc/localtime:/etc/localtime command: [ "--config.file=/etc/prometheus/prometheus.yml", "--web.enable-lifecycle",
"--web.enable-admin-api",
]
    ports:
      - '9090:9090'
    networks:
      - monitor

  alertmanager:
    image: prom/alertmanager
    container_name: alertmanager
    hostname: alertmanager
    restart: always
    volumes:
      - /data/prometheus/conf/alertmanager.yml:/etc/alertmanager/alertmanager.yml
- /etc/localtime:/etc/localtime
    ports:
      - '9093:9093'
    networks:
      - monitor

  grafana:
    image: grafana/grafana
    container_name: grafana
    hostname: grafana
    restart: always
    ports:
      - '3000:3000'
    networks:
      - monitor

 # node-exporter:
 #  image: quay.io/prometheus/node-exporter
 #  container_name: node-exporter
 #   hostname: node-exporter
 #   restart: always
 #   ports:
 #     - '9100:9100'
 #   networks:
 #     - monitor

  cadvisor:
    image: google/cadvisor:latest
    container_name: cadvisor
    hostname: cadvisor
    restart: always
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    ports:
      - '8080:8080'
    networks:
      - monitor

 # 使用docker-composer命令启动yml里配置好的各容器

docker-compose -f /data/prometheus/conf/docker-compose-monitor.yml up -d

#删除所有创建的容器

# 删除容器:
docker-compose -f /data/prometheus/conf/docker-compose-monitor.yml kill
docker-compose -f /data/prometheus/conf/docker-compose-monitor.yml rm

脚本安装node-exporter

#!/bin/bash
#Supports System:Ubuntu16.04,CentOS7



cd /opt
wget https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz
tar -zxvf node_exporter-1.0.1.linux-amd64.tar.gz
mv /opt/node_exporter-1.0.1.linux-amd64  node_exporter
#rm -rf /opt/node_exporter-1.0.1.linux-amd64.tar.gz


groupadd prometheus
useradd -g prometheus -s /sbin/nologin prometheus -M
chown -R prometheus:prometheus /opt/node_exporter

cat > node_exporter.service << EOF
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/opt/node_exporter/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

mv /opt/node_exporter.service /etc/systemd/system/
chown prometheus:prometheus /etc/systemd/system/node_exporter.service

systemctl daemon-reload
systemctl start node_exporter.service
systemctl enable node_exporter.service

echo "请使用curl localhost:9100命令测试是否安装成功"

cadvisor安装

docker run -d -p 8080:8080 --name cadvisor -v /:/rootfs:ro -v /var/run:/var/run:rw -v /sys:/sys:ro -v /var/lib/docker/:/var/lib/docker:ro -v /dev/disk/:/dev/disk:ro google/cadvisor:latest    

blackbox_exporter 安装

wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.18.0/blackbox_exporter-0.18.0.linux-amd64.tar.gz
tar -zxvf blackbox_exporter-0.18.0.linux-amd64.tar.gz  -C /usr/local/
mv /usr/local/blackbox_exporter-0.18.0.linux-amd64/  /usr/local/blackbox
vi /etc/systemd/system/blackbox_exporter.service 
[Unit]
Description=blackbox_exporter
After=network.target 

[Service]
WorkingDirectory=/usr/local/blackbox
ExecStart=/usr/local/blackbox/blackbox_exporter \
         --config.file=/usr/local/blackbox/blackbox.yml
[Install]
WantedBy=multi-user.target

systemctl start blackbox_exporter
systemctl enable blackbox_exporter

修改配置文件,实现监控网站和监控网站、接口返回内容(修改完后要重启blackbox服务)

cd /usr/local/blackbox/

vi blackbox.yml

modules:
  http_2xx:
    prober: http 
  http_2xx_check:
    prober: http
  # 下面这段是需要添加的内容
    timeout: 5s 
    http:
      #valid_http_versions: ["HTTP/1.1", "HTTP/2"]   
      valid_status_codes: []
      method: GET
      #headers:
        #Host:test.kaboy.net/MessageMon.aspx 
        #Accept-Language: en-US
        #Origin:test.kaboy.net
      fail_if_body_matches_regexp:    # 如果我get的url地址返回的正文中有"fail",那么就会失败,则probe_success值为0
        - "#fail#"
      fail_if_body_not_matches_regexp:
        - "#SUCCESS#"    # 如果我get的url地址返回的正文中没有"success",那么就会失败,则probe_success值为0

  http_post_2xx:
    prober: http
    http:
      method: POST
  tcp_connect:
    prober: tcp
  pop3s_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^+OK"
      tls: true
      tls_config:
        insecure_skip_verify: false
  ssh_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^SSH-2.0-"
  irc_banner:
    prober: tcp
    tcp:
      query_response:
      - send: "NICK prober"
      - send: "USER prober prober prober :prober"
      - expect: "PING :([^ ]+)"
        send: "PONG ${1}"
      - expect: "^:[^ ]+ 001"
  icmp:
    prober: icmp

进入容器创建blackbox-dis.yml、blackbox-check.yml

docker exec -it prometheus /bin/sh

 

vi /etc/prometheus/blackbox-dis.yml

- targets:
   - https://meeuapp.cn
  #- https://test.kaboy.net/MessageMon.aspx
  #- https://www.baidu.com

vi /etc/prometheus/blackbox-check.yml

- targets:
  #- https://meeuapp.cn
  - https://test.kaboy.net/MessageMon.aspx   #这个站点返回值是success
  #- https://www.baidu.com
systemctl restart blackbox_exporter

 

创建rule规则文件

vi /data/prometheus/rules/node_exporter.yml

groups:
    - name: 主机状态-监控告警
      rules:
      - alert: 主机状态
        expr: up == 0
        for: 1m
        labels:
          status: 非常严重
        annotations:
          summary: "{{$labels.instance}}:服务器宕机"
          description: "{{$labels.instance}}:服务器延时超过5分钟"
      
      - alert: CPU使用情况
        expr: 100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100) > 80
        for: 1m
        labels:
          status: 一般告警
        annotations:
          summary: "{{$labels.mountpoint}} CPU使用率过高!"
          description: "{{$labels.mountpoint }} CPU使用大于80%(目前使用:{{$value}}%)"
  
      - alert: 内存使用
        expr: round(100- node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes*100) > 90
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "内存使用率过高"
          description: "当前使用率{{ $value }}%"

      - alert: IO性能
        expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 60
        for: 1m
        labels:
          status: 严重告警
        annotations:
          summary: "{{$labels.mountpoint}} 流入磁盘IO使用率过高!"
          description: "{{$labels.mountpoint }} 流入磁盘IO大于60%(目前使用:{{$value}})"
 
      - alert: 网络
        expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
        for: 1m
        labels:
          status: 严重告警
        annotations:
          summary: "{{$labels.mountpoint}} 流入网络带宽过高!"
          description: "{{$labels.mountpoint }}流入网络带宽持续2分钟高于100M. RX带宽使用率{{$value}}"
      
      - alert: TCP会话
        expr: node_netstat_Tcp_CurrEstab > 1000
        for: 1m
        labels:
          status: 严重告警
        annotations:
          summary: "{{$labels.mountpoint}} TCP_ESTABLISHED过高!"
          description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$value}}%)"
 
      - alert: 磁盘容量
        expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 90
        for: 1m
        labels:
          status: 严重告警
        annotations:
          summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!"
          description: "{{$labels.mountpoint }} 磁盘分区使用大于90%(目前使用:{{$value}}%)"

vi /data/prometheus/rules/blackbox_exporter.yml

groups:
- name: 站点状态-监控告警
  rules:
  - alert: 网络检测
    expr: probe_success == 0
    for: 1m
    labels:
      status: 严重告警
    annotations:
      summary: "{{$labels.instance}} 不能访问"
      description: "{{$labels.instance}} 不能访问"

vi /data/prometheus/rules/ssl.yml

groups:
- name: check_ssl_status
  rules:
  - alert: "ssl证书过期警告"
    expr: (probe_ssl_earliest_cert_expiry - time())/86400 <15
    for: 1h
    labels:
      severity: warn
    annotations:
      description: '域名{{$labels.instance}}的证书还有{{ printf "%.1f" $value }}天就过期了,请尽快更新证书'
      summary: "ssl证书过期警告"

vi /data/prometheus/rules/docker.yml

groups:
- name:  Docker containers monitoring
  rules: 
  - alert: ContainerKilled
    expr: time() - container_last_seen > 60
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container killed (instance {{ $labels.instance }})"
      description: "A container has disappeared\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  - alert: ContainerCpuUsage
    expr: (sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container CPU usage (instance {{ $labels.instance }})"
      description: "Container CPU usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  - alert: ContainerMemoryUsage
    expr: (sum(container_memory_usage_bytes) BY (instance, name) / sum(container_spec_memory_limit_bytes) BY (instance, name) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container Memory usage (instance {{ $labels.instance }})"
      description: "Container Memory usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  - alert: ContainerVolumeUsage
    expr: (1 - (sum(container_fs_inodes_free) BY (instance) / sum(container_fs_inodes_total) BY (instance)) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container Volume usage (instance {{ $labels.instance }})"
      description: "Container Volume usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  - alert: ContainerVolumeIoUsage
    expr: (sum(container_fs_io_current) BY (instance, name) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container Volume IO usage (instance {{ $labels.instance }})"
      description: "Container Volume IO usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  - alert: ContainerHighThrottleRate
    expr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container high throttle rate (instance {{ $labels.instance }})"
      description: "Container is being throttled\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  - alert: PgbouncerActiveConnectinos
    expr: pgbouncer_pools_server_active_connections > 200
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "PGBouncer active connectinos (instance {{ $labels.instance }})"
      description: "PGBouncer pools are filling up\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  - alert: PgbouncerErrors
    expr: increase(pgbouncer_errors_count{errmsg!="server conn crashed?"}[5m]) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "PGBouncer errors (instance {{ $labels.instance }})"
      description: "PGBouncer is logging errors. This may be due to a a server restart or an admin typing commands at the pgbouncer console.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  - alert: PgbouncerMaxConnections
    expr: rate(pgbouncer_errors_count{errmsg="no more connections allowed (max_client_conn)"}[1m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "PGBouncer max connections (instance {{ $labels.instance }})"
      description: "The number of PGBouncer client connections has reached max_client_conn.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  - alert: SidekiqQueueSize
    expr: sidekiq_queue_size{} > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Sidekiq queue size (instance {{ $labels.instance }})"
      description: "Sidekiq queue {{ $labels.name }} is growing\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  - alert: SidekiqSchedulingLatencyTooHigh
    expr: max(sidekiq_queue_latency) > 120
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Sidekiq scheduling latency too high (instance {{ $labels.instance }})"
      description: "Sidekiq jobs are taking more than 2 minutes to be picked up. Users may be seeing delays in background processing.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  - alert: ConsulServiceHealthcheckFailed
    expr: consul_catalog_service_node_healthy == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Consul service healthcheck failed (instance {{ $labels.instance }})"
      description: "Service: `{{ $labels.service_name }}` Healthcheck: `{{ $labels.service_id }}`\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  - alert: ConsulMissingMasterNode
    expr: consul_raft_peers < 3
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Consul missing master node (instance {{ $labels.instance }})"
      description: "Numbers of consul raft peers should be 3, in order to preserve quorum.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  - alert: ConsulAgentUnhealthy
    expr: consul_health_node_status{status="critical"} == 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Consul agent unhealthy (instance {{ $labels.instance }})"
      description: "A Consul agent is down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

 

Prometheus

http://122.226.xx.220:9090/

Prometheus+grafana+alertmanager+node_exporter+blackbox_exporter+cadvisor+钉钉告警

 Prometheus+grafana+alertmanager+node_exporter+blackbox_exporter+cadvisor+钉钉告警

 

 

 

 grafana

http://122.226.xx.220:3000/

node exporter模板8919

Prometheus+grafana+alertmanager+node_exporter+blackbox_exporter+cadvisor+钉钉告警

 black exporter模板9965  7587

Prometheus+grafana+alertmanager+node_exporter+blackbox_exporter+cadvisor+钉钉告警

 

 docker 模板 193

Prometheus+grafana+alertmanager+node_exporter+blackbox_exporter+cadvisor+钉钉告警

 

 

 钉钉告警

钉钉添加机器人

Prometheus+grafana+alertmanager+node_exporter+blackbox_exporter+cadvisor+钉钉告警

Prometheus+grafana+alertmanager+node_exporter+blackbox_exporter+cadvisor+钉钉告警

钉钉机器人的webhook: https://oapi.dingtalk.com/robot/send?access_token=xxx

 

 使用docker安装Prometheus-webhook-dingtalk

docker pull timonwong/prometheus-webhook-dingtalk
docker run -d --restart always --name dingding -p 8060:8060 -v /etc/localtime:/etc/localtime timonwong/prometheus-webhook-dingtalk --ding.profile="webhook1=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxx"

当触发rule规则时

网站检测、接口返回内容检测

Prometheus+grafana+alertmanager+node_exporter+blackbox_exporter+cadvisor+钉钉告警

 

 

Prometheus+grafana+alertmanager+node_exporter+blackbox_exporter+cadvisor+钉钉告警

 

 Prometheus+grafana+alertmanager+node_exporter+blackbox_exporter+cadvisor+钉钉告警

 

 

问题:

1、docker启动 cAdvisor报错

Could not configure a source for OOM detection, disabling OOM events: open /dev/kmsg: no such file or directory
Failed to start container manager: inotify_add_watch /sys/fs/cgroup/cpuacct,cpu: no such file or directory

解决:

mount -o remount,rw '/sys/fs/cgroup'
ln -s /sys/fs/cgroup/cpu,cpuacct /sys/fs/cgroup/cpuacct,cpu
docker restart cadvisor

 

2、blackbox exporter模板报错

Panel plugin not found: grafana-piechart-panel

解决:

grafana-cli plugins install grafana-piechart-panel

 

相关文章: