|
导航:这里主要是列出一个prometheus一些系统的学习过程,最后按照章节顺序查看,由于写作该文档经历了不同时期,所以在文中有时出现 的云环境不统一,但是学习具体使用方法即可,在最后的篇章,有一个完整的腾讯云的实战案例。 8.kube-state-metrics 和 metrics-server 13.Grafana简单用法 参考: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config https://www.bookstack.cn/read/prometheus_practice/introduction-README.md |
有了上述几章的学习之后,这里将总结一下之前所有内容;正好,在我司架构中正好是多数据中心,多项目,且k8s环境,宿主机环境共存,这里正好将学习的知识用在实战中.
以下可能不会贴出所有配置,但是会将重要的展示出来.
基础概念:
高可用: 2台汇总所有数据的prometheus 做负载均衡,这样即使有一台down机也不会影响任务,这2台机器汇总所有联邦prometheus的数据.
持久:2台汇总所有数据的prometheus 数据保存在本地这样不利于迁移和更长之间的持久,所以2台prometheus的数据保存至远端数据库.
联邦:简单来说就是zabbix-proxy的概念,这些prometheus不需要去持久化数据,只要能采集数据,主prometheus需要数据时能采集到即可.
这样,多数据中心,多集群,多环境就可以使用这种架构解决.
1.环境信息
| 角色 | 部署方式 | 信息 | 云环境 | 作用 |
| Prometheus 主(总)/alertmanager01 | 宿主 | 10.1.1.10 | 腾讯云 | 汇总prometheus |
| Prometheus 备(总)/alertmanager02 | 宿主 | 10.1.1.5 | 腾讯云 | 汇总prometheus |
| 联邦 Prometheus | K8s | K8s | 阿里云 | Lcm k8s环境 |
| 联邦 Prometheus | 宿主 | 阿里云内网 | 阿里云 | 罪恶王冠游戏 |
| 软件对应版本 | |
| 软件名称 | 软件版本 |
| 主prometheus | 2.13.1 |
| 联邦子prometheus | 2.13.1 |
| Node-exporter | 0.18.1 |
| blackbox_exporter | 0.16.0 |
| Consul | 1.6.1 |
| Metrics-server | 0.3.5 |
| Kube-state-metrics | 1.8.0 |
这里我们由小变大来配置这个项目.
本章不讲解任何prometsql和alert告警规则的方法,只讲解集群的配置方式。
2.子联邦配置
2.1 K8s集群联邦点
因为子联邦节点是监控k8s集群的,为了方便,肯定是部署在prometheus里,下面看看deploy的配置文件。
这里不讲解获取数据来源工具组件等一些授权以及安装,默认当作已安装完成的情况,需要的去查看kube-state-metrics和metrics-server章节。
apiVersion: v1 kind: "Service" metadata: name: prometheus namespace: monitoring labels: name: prometheus spec: ports: - name: prometheus protocol: TCP port: 9090 targetPort: 9090 nodePort: 30946 selector: app: prometheus type: NodePort --- apiVersion: apps/v1 kind: Deployment metadata: labels: name: prometheus name: prometheus namespace: monitoring spec: replicas: 1 selector: matchLabels: app: prometheus template: metadata: labels: app: prometheus spec: serviceAccountName: prometheus containers: - name: prometheus image: prom/prometheus:v2.3.0 env: - name: ver value: "15" command: - "/bin/prometheus" args: - "--config.file=/etc/prometheus/prometheus.yml" - "--log.level=debug" ports: - containerPort: 9090 protocol: TCP volumeMounts: - mountPath: "/etc/prometheus" name: prometheus-config volumes: - name: prometheus-config configMap: name: prometheus-config
当然也不能少了RBAC的授权。
apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: [""] resources: - nodes #node发现模式的授权资源,不然通过kubelet自带的发现模式不授权这个资源,会在prometheus爆出403错误 - nodes/metrics - nodes/proxy - services - endpoints - pods - namespaces verbs: ["get", "list", "watch"] - apiGroups: - extensions resources: - ingresses verbs: ["get", "list", "watch"] - nonResourceURLs: ["/metrics","/api/*"] verbs: ["get"] --- apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: monitoring --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: monitoring
上面的deploy挂载了一个卷,就是配置文件,如下的配置文件,没有告警条目,没有持久化存储,因为他们不需要,只要总prometheus来向它收集数据就可以了.它也是一个prometheus,只是做的工作比较少罢了.
这个prometheus包含了以下采集任务:
- kubernetes-kubelet
- kubernetes-cadvisor
- kubernetes-pods
- kubernetes-apiservers
- kubernetes-services
- kubernetes-ingresses
- kubernetes-service-endpoints
基本涵盖了k8s 的大部分的key,所以k8s内联邦prometheus角色需要注意这么几点就可以.
下面的配置文件如下,可能(cn-lcm-prod)项目标识不太一样,这里可以忽略,改成自己对应的项目即可
apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config namespace: monitoring data: prometheus.yml: | global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'cn-lcm-prod-kubernetes-kubelet' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics - job_name: 'cn-web-prod-kubernetes-cadvisor' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor - action: labelmap regex: __meta_kubernetes_node_label_(.+) - job_name: 'cn-lcm-prod-kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name - job_name: 'cn-lcm-prod-kubernetes-apiservers' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - target_label: __address__ replacement: kubernetes.default.svc:443 - job_name: 'cn-lcm-prod-kubernetes-services' metrics_path: /probe params: module: [http_2xx] kubernetes_sd_configs: - role: service relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe] action: keep regex: true - source_labels: [__address__] target_label: __param_target - target_label: __address__ replacement: blackbox-exporter.monitoring.svc.cluster.local:9115 - source_labels: [__param_target] target_label: instance - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] target_label: kubernetes_name - job_name: 'cn-lcm-prod-kubernetes-ingresses' metrics_path: /probe params: module: [http_2xx] kubernetes_sd_configs: - role: ingress relabel_configs: - source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_probe] action: keep regex: true - source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path] regex: (.+);(.+);(.+) replacement: ${1}://${2}${3} target_label: __param_target - target_label: __address__ replacement: blackbox-exporter.monitoring.svc.cluster.local:9115 - source_labels: [__param_target] target_label: instance - action: labelmap regex: __meta_kubernetes_ingress_label_(.+) - source_labels: [__meta_kubernetes_namespace] target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_ingress_name] target_label: kubernetes_name - job_name: 'cn-lcm-prod-kubernetes-service-endpoints' scrape_interval: 10s scrape_timeout: 10s #这个job配置不太一样,采集时间是10秒,因为使用全局配置的15秒,会出现拉取数据闪断的情况,所以,这里单独配置成10秒 kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name