prometheus+grafana监控部署

介绍

Prometheus是一个开源监控系统,它前身是SoundCloud的警告工具包。从2012年开始,许多公司和组织开始使用Prometheus。该项目的开发人员和用户社区非常活跃,越来越多的开发人员和用户参与到该项目中。目前它是一个独立的开源项目,且不依赖与任何公司。为了强调这点和明确该项目治理结构,Prometheus在2016年继Kurberntes之后,加入了Cloud Native Computing Foundation。

安装步骤

监控机安装:192.168.1.24

  1. 安装prometheus,监控服务
  2. 安装node-exporter,节点数据收集服务
  3. 安装alertmanager,报警服务
  4. 安装grafana,图形化展示服务

一、监控机安装prometheus

1、安装命令

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 下载
wget https://github.com/prometheus/prometheus/releases/download/v2.12.0/prometheus-2.12.0.linux-amd64.tar.gz

tar xf prometheus-2.12.0.linux-amd64.tar.gz
cd prometheus-2.12.0.linux-amd64

# 根据下面yml配置文件配置需要信息
vim prometheus.yml

# 启动
nohup ./prometheus --config.file=./prometheus.yml &

# 访问prometheus
http://192.168.1.24:9090

2、配置修改:vim prometheus.yml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
global:
# 每隔15秒向pushgateway采集一次指标数据
scrape_interval: 15s
# 每隔15秒根据所配置的规则集,进行规则计算
evaluation_interval: 15s

alerting:
alertmanagers:
- static_configs:
# 设置altermanager的地址,altermanager报警规则服务
- targets: ['192.168.1.24:9093']

# 指定所配置报价模板文件,下面给出模板配置
rule_files:
- "rules.yml"

scrape_configs:
# 用于获取被控主机数据,可以添加多条
- job_name: '测试-25'
static_configs:
- targets: ['192.168.1.25:9100']

3、rules.yml报警模板配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
groups:
- name: down
rules:
- alert: "down报警"
expr: up == 0
for: 1m
labels:
severity: warning
annotations:
summary: "down报警"
description: "报警时间:"
value: "已使用:{{ $value }}"
- name: memory
rules:
- alert: "内存报警"
expr: ((node_memory_MemTotal_bytes -(node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes) )/node_memory_MemTotal_bytes ) * 100 > 90
for: 1m
labels:
severity: warning
annotations:
summary: "内存报警"
description: "报警时间:"
value: "已使用:{{ $value }}%"
- name: cpu
rules:
- alert: "cpu报警"
expr: 100 - ((avg by (instance,job,env)(irate(node_cpu_seconds_total{mode="idle"}[30s]))) *100) > 90
for: 1m
labels:
severity: warning
annotations:
summary: "cpu报警"
description: "报警时间:"
value: "已使用:{{ $value }}%"
- name: disk
rules:
- alert: "disk报警"
expr: 100 - (node_filesystem_avail_bytes{fstype !~ "nfs|rpc_pipefs|rootfs|tmpfs",device!~"/etc/auto.misc|/dev/mapper/centos-home",mountpoint !~ "/boot|/net|/selinux"} /node_filesystem_size_bytes{fstype !~ "nfs|rpc_pipefs|rootfs|tmpfs",device!~"/etc/auto.misc|/dev/mapper/centos-home",mountpoint !~ "/boot|/net|/selinux"} ) * 100 > 80
for: 1m
labels:
severity: warning
annotations:
summary: "disk报警"
description: "报警时间:"
value: "已使用:{{ $value }}%"
- name: net
rules:
- alert: "net报警"
expr: (irate(node_network_transmit_bytes_total{device!~"lo"}[1m]) / 1000) > 80000
for: 1m
labels:
severity: warning
annotations:
summary: "net报警"
description: "报警时间:"
value: "已使用:{{ $value }}KB"

二、监控机安装node-exporter

1
2
3
4
5
6
7
8
9
10
11
# 下载
wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz

tar xf node_exporter-0.18.1.linux-amd64.tar.gz
cd node_exporter-0.18.1.linux-amd64

# 启动
nohup ./node_exporter &

# 访问
http://192.168.1.24:9100

三、监控机安装alertmanager

1、安装命令

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 下载
wget https://github.com/prometheus/alertmanager/releases/download/v0.19.0/alertmanager-0.19.0.linux-amd64.tar.gz

tar xf alertmanager-0.19.0.linux-amd64.tar.gz
cd alertmanager-0.19.0.linux-amd64

# 修改配置,下面给出配置信息
vim alertmanager.yml

# 启动
nohup ./alertmanager --config.file=./alertmanager.yml &

# 访问
http://192.168.1.24:9093

2、alertmanager.yml报警规则配置

以下是邮箱报警配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
global:
#处理超时时间,默认为5min
resolve_timeout: 5m
# 邮箱smtp服务器代理
smtp_smarthost: 'smtp.163.com:25'
# 发送邮箱名称
smtp_from: '123123@163.com'
# 邮箱名称
smtp_auth_username: '123123@163.com'
# 邮箱密码或授权码
smtp_auth_password: '123123'
smtp_require_tls: false

route:
# 报警分组依据
group_by: ['alertname']
# 最初即第一次等待多久时间发送一组警报的通知
group_wait: 10s
# 在发送新警报前的等待时间
group_interval: 1h
# 发送重复警报的周期 对于email配置中,此项不可以设置过低,否则将会由于邮件发送太多频繁,被smtp服务器拒绝
repeat_interval: 1h
# 发送警报的接收者的名称,以下receivers name的名称
receiver: 'mail'
receivers:
- name: 'mail'
email_configs:
- to: '123123@163.com'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']

四、监控机安装grafana

1、安装命令

1
2
3
4
5
6
7
8
9
10
wget https://dl.grafana.com/oss/release/grafana-6.3.5.linux-amd64.tar.gz 

tar xf grafana-6.3.5.linux-amd64.tar.gz
cd grafana-6.3.5.linux-amd64

# 启动
nohup ./alertmanager --config.file=./alertmanager.yml &

# 访问
http://192.168.1.24:3000

2、添加数据源

1.在设置中:Configuration–>Date Sources
2.配置prometheus服务信息:http://192.168.1.24:9090

3、添加仪表盘

1.在创建中:Create–>Import
2.去下载:https://grafana.com/grafana/dashboards 或复制模板ID号
3.填写ID号,或上传仪表盘模板

五、k8s部署

Prometheus部署

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
---
# Source: prometheus/templates/server/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
component: "server"
app: prometheus
release: prometheus
chart: prometheus-11.16.2
heritage: Helm
name: prometheus
namespace: default
annotations:
{}
---
# Source: prometheus/templates/server/cm.yaml
apiVersion: v1
kind: ConfigMap
metadata:
labels:
component: "server"
app: prometheus
release: prometheus
chart: prometheus-11.16.2
heritage: Helm
name: prometheus
namespace: default
data:
alerting_rules.yml: |
{}
alerts: |
{}
prometheus.yml: |
global:
evaluation_interval: 1m
scrape_interval: 15s
scrape_timeout: 10s
rule_files:
- /etc/config/recording_rules.yml
- /etc/config/alerting_rules.yml
- /etc/config/rules
- /etc/config/alerts
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
job_name: kubernetes-apiservers
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: default;kubernetes;https
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_service_name
- __meta_kubernetes_endpoint_port_name
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
job_name: kubernetes-nodes
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- replacement: kubernetes.default.svc:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/$1/proxy/metrics
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
job_name: kubernetes-nodes-cadvisor
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- replacement: kubernetes.default.svc:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
- job_name: kubernetes-service-endpoints
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_service_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
- action: replace
source_labels:
- __meta_kubernetes_pod_node_name
target_label: kubernetes_node
- job_name: kubernetes-service-endpoints-slow
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape_slow
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_service_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
- action: replace
source_labels:
- __meta_kubernetes_pod_node_name
target_label: kubernetes_node
scrape_interval: 5m
scrape_timeout: 30s
- honor_labels: true
job_name: prometheus-pushgateway
kubernetes_sd_configs:
- role: service
relabel_configs:
- action: keep
regex: pushgateway
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_probe
- job_name: kubernetes-services
kubernetes_sd_configs:
- role: service
metrics_path: /probe
params:
module:
- http_2xx
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_probe
- source_labels:
- __address__
target_label: __param_target
- replacement: blackbox
target_label: __address__
- source_labels:
- __param_target
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_pod_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: kubernetes_pod_name
- action: drop
regex: Pending|Succeeded|Failed
source_labels:
- __meta_kubernetes_pod_phase
- job_name: kubernetes-pods-slow
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape_slow
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_pod_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: kubernetes_pod_name
- action: drop
regex: Pending|Succeeded|Failed
source_labels:
- __meta_kubernetes_pod_phase
scrape_interval: 5m
scrape_timeout: 30s
recording_rules.yml: |
{}
rules: |
{}
---
# Source: prometheus/templates/server/clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
component: "server"
app: prometheus
release: prometheus
chart: prometheus-11.16.2
heritage: Helm
name: prometheus
rules:
- apiGroups:
- ""
resources:
- nodes
- nodes/proxy
- nodes/metrics
- services
- endpoints
- pods
- ingresses
- configmaps
verbs:
- get
- list
- watch
- apiGroups:
- "extensions"
- "networking.k8s.io"
resources:
- ingresses/status
- ingresses
verbs:
- get
- list
- watch
- nonResourceURLs:
- "/metrics"
verbs:
- get
---
# Source: prometheus/templates/server/clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
component: "server"
app: prometheus
release: prometheus
chart: prometheus-11.16.2
heritage: Helm
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: default
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
---
# Source: prometheus/templates/server/service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
component: "server"
app: prometheus
release: prometheus
chart: prometheus-11.16.2
heritage: Helm
name: prometheus
namespace: default
spec:
ports:
- name: http
port: 9090
protocol: TCP
targetPort: 9090
selector:
component: "server"
app: prometheus
release: prometheus
sessionAffinity: None
type: "ClusterIP"
---
# Source: prometheus/templates/server/deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
component: "server"
app: prometheus
release: prometheus
chart: prometheus-11.16.2
heritage: Helm
name: prometheus
namespace: default
spec:
selector:
matchLabels:
component: "server"
app: prometheus
release: prometheus
replicas: 1
template:
metadata:
annotations:

sidecar.istio.io/inject: "false"
labels:
component: "server"
app: prometheus
release: prometheus
chart: prometheus-11.16.2
heritage: Helm
spec:
serviceAccountName: prometheus
containers:
- name: prometheus-server-configmap-reload
image: "jimmidyson/configmap-reload:v0.4.0"
imagePullPolicy: "IfNotPresent"
args:
- --volume-dir=/etc/config
- --webhook-url=http://127.0.0.1:9090/-/reload
resources:
{}
volumeMounts:
- name: config-volume
mountPath: /etc/config
readOnly: true

- name: prometheus-server
image: "prom/prometheus:v2.21.0"
imagePullPolicy: "IfNotPresent"
args:
- --storage.tsdb.retention.time=15d
- --config.file=/etc/config/prometheus.yml
- --storage.tsdb.path=/data
- --web.console.libraries=/etc/prometheus/console_libraries
- --web.console.templates=/etc/prometheus/consoles
- --web.enable-lifecycle
ports:
- containerPort: 9090
readinessProbe:
httpGet:
path: /-/ready
port: 9090
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 30
failureThreshold: 3
successThreshold: 1
livenessProbe:
httpGet:
path: /-/healthy
port: 9090
initialDelaySeconds: 30
periodSeconds: 15
timeoutSeconds: 30
failureThreshold: 3
successThreshold: 1
resources:
{}
volumeMounts:
- name: config-volume
mountPath: /etc/config
- name: storage-volume
mountPath: /data
subPath: ""
securityContext:
fsGroup: 65534
runAsGroup: 65534
runAsNonRoot: true
runAsUser: 65534
terminationGracePeriodSeconds: 300
volumes:
- name: config-volume
configMap:
name: prometheus
- name: storage-volume
emptyDir:
{}

alertmanager部署

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: default
name: alertmanager
annotations:
k8s.kuboard.cn/workload: alertmanager
deployment.kubernetes.io/revision: '6'
k8s.kuboard.cn/service: ClusterIP
k8s.kuboard.cn/ingress: 'false'
labels:
k8s.kuboard.cn/layer: ''
k8s.kuboard.cn/name: alertmanager
spec:
selector:
matchLabels:
k8s.kuboard.cn/layer: ''
k8s.kuboard.cn/name: alertmanager
revisionHistoryLimit: 10
template:
metadata:
labels:
k8s.kuboard.cn/layer: ''
k8s.kuboard.cn/name: alertmanager
spec:
securityContext:
seLinuxOptions: {}
imagePullSecrets: []
restartPolicy: Always
initContainers: []
containers:
- image: prom/alertmanager
imagePullPolicy: IfNotPresent
name: alertmanager
volumeMounts:
- name: conf
readOnly: true
mountPath: /etc/alertmanager/alertmanager.yml
subPath: alertmanager.yml
resources:
limits:
requests:
env: []
lifecycle: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumes:
- name: conf
configMap:
name: alertmanager-conf
items:
- key: alertmanager.yml
path: alertmanager.yml
defaultMode: 420
dnsPolicy: ClusterFirst
dnsConfig:
options: []
schedulerName: default-scheduler
terminationGracePeriodSeconds: 30
progressDeadlineSeconds: 600
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25%
maxSurge: 25%
replicas: 1

---
apiVersion: v1
kind: Service
metadata:
namespace: default
name: alertmanager
annotations:
k8s.kuboard.cn/workload: alertmanager
labels:
k8s.kuboard.cn/layer: ''
k8s.kuboard.cn/name: alertmanager
spec:
selector:
k8s.kuboard.cn/layer: ''
k8s.kuboard.cn/name: alertmanager
type: ClusterIP
ports:
- port: 9093
targetPort: 9093
protocol: TCP
name: 8jxjpr
nodePort: 0
sessionAffinity: None

---
metadata:
name: alertmanager-conf
namespace: default
managedFields:
- manager: Mozilla
operation: Update
apiVersion: v1
time: '2022-10-20T02:45:57Z'
fieldsType: FieldsV1
fieldsV1:
'f:data':
.: {}
'f:alertmanager.yml': {}
data:
alertmanager.yml: |-
global:
# 在指定时间内没有新的事件就发送恢复通知
resolve_timeout: 5m
route:
# 默认的接收器名称
receiver: 'webhook_mention_all'
# 在组内等待所配置的时间,如果同组内,30秒内出现相同报警,在一个组内出现。
group_wait: 30s
# # 如果组内内容不变化,5m后发送。
group_interval: 5m
# 发送报警间隔,如果指定时间内没有修复,则重新发送报警
repeat_interval: 24h
# # 报警分组,根据 prometheus 的 lables 进行报警分组,这些警报会合并为一个通知发送给接收器,也就是警报分组。
group_by: ['alertname']
routes:
- receiver: 'webhook_mention_all'
group_wait: 10s
receivers:
- name: 'webhook_mention_all'
webhook_configs:
# 钉钉报警
# - url: 'http://dingtalk:8060/dingtalk/webhook_mention_all/send'
# send_resolved: true
# 飞书报警
- url: 'http://10.0.1.120:8087/prometheusalert?type=fs&tpl=prometheus-fsv2&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxx'
kind: ConfigMap
apiVersion: v1

prometheusalert告警接收

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
version: "2"
services:
prometheusalert:
image: feiyu563/prometheus-alert:latest
container_name: prometheusalert
restart: always
mem_limit: 2000M
ports:
- 8080:8080
volumes:
- /data/prometheusalert/app:/app
- /etc/localtime:/etc/localtime:ro
environment:
TZ: Asia/Shanghai
PA_LOGIN_USER: prometheusalert
PA_LOGIN_PASSWORD: prometheusalert
PA_TITLE: test

kube-state-metrics部署

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/version: 2.1.0
name: kube-state-metrics
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/version: 2.1.0
name: kube-state-metrics
rules:
- apiGroups:
- ""
resources:
- configmaps
- secrets
- nodes
- pods
- services
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs:
- list
- watch
- apiGroups:
- apps
resources:
- statefulsets
- daemonsets
- deployments
- replicasets
verbs:
- list
- watch
- apiGroups:
- batch
resources:
- cronjobs
- jobs
verbs:
- list
- watch
- apiGroups:
- autoscaling
resources:
- horizontalpodautoscalers
verbs:
- list
- watch
- apiGroups:
- authentication.k8s.io
resources:
- tokenreviews
verbs:
- create
- apiGroups:
- authorization.k8s.io
resources:
- subjectaccessreviews
verbs:
- create
- apiGroups:
- policy
resources:
- poddisruptionbudgets
verbs:
- list
- watch
- apiGroups:
- certificates.k8s.io
resources:
- certificatesigningrequests
verbs:
- list
- watch
- apiGroups:
- storage.k8s.io
resources:
- storageclasses
- volumeattachments
verbs:
- list
- watch
- apiGroups:
- admissionregistration.k8s.io
resources:
- mutatingwebhookconfigurations
- validatingwebhookconfigurations
verbs:
- list
- watch
- apiGroups:
- networking.k8s.io
resources:
- networkpolicies
- ingresses
verbs:
- list
- watch
- apiGroups:
- coordination.k8s.io
resources:
- leases
verbs:
- list
- watch

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/version: 2.1.0
name: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/version: 2.1.0
name: kube-state-metrics
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: kube-state-metrics
template:
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/version: 2.1.0
spec:
containers:
- image: bitnami/kube-state-metrics:2.1.0
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 5
name: kube-state-metrics
ports:
- containerPort: 8080
name: http-metrics
- containerPort: 8081
name: telemetry
readinessProbe:
httpGet:
path: /
port: 8081
initialDelaySeconds: 5
timeoutSeconds: 5
securityContext:
runAsUser: 65534
nodeSelector:
kubernetes.io/os: linux
serviceAccountName: kube-state-metrics
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/version: 2.1.0
name: kube-state-metrics
namespace: kube-system
spec:
clusterIP: None
ports:
- name: http-metrics
port: 8080
targetPort: http-metrics
- name: telemetry
port: 8081
targetPort: telemetry
type: clusterIP
selector:
app.kubernetes.io/name: kube-state-metrics

prometheus匹配规则

1
2
3
4
5
6
7
- job_name: "kube-state-metrics"
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_endpoints_name]
regex: kube-system;kube-state-metrics
action: keep

本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!