维多利亚指标正在大幅下降
引发
我以前在树莓派上收集了家庭数据并使用Prometheus进行收集,使用Victoria Metrics进行持久化,并写了一篇文章,但是在运行一段时间后,号称擅长长期保存的Victoria Metrics却意外地崩溃了。
以下是导致原因的调查和解决方案。
普罗米修斯看起来很不错,而维多利亚度量似乎有些问题。
就像这样残缺不全。
(尽管反复重启了几次,但很快就崩溃了)
这是之前的文章。
我本来计划保存两年,结果完全没有坚持下去。。。
# cat victoria-metrics.service
[Unit]
Description=Victoria Metrics
Before=prometheus.service
[Service]
User=prometheus
ExecStart=/usr/local/bin/victoria-metrics-prod \
-storageDataPath /var/lib/victoria/ \
-retentionPeriod=24
[Install]
WantedBy=multi-user.target
环境
-
- Raspberry Pi 3 Model B+
- Raspbian GNU/Linux 10 (buster)
构成如下所示:
(传感器)-> (Python脚本)-> (节点导出器)-> (Prometheus)-> (Victoria Metrics)-> (Grafana)
第一个图表使用树莓派3进行收集,第三个图表使用另一台树莓派Zero进行收集。
调查
在/var/log/syslog上有这样的东西。
Sep 18 17:05:14 rasp3 victoria-metrics-prod[379]: 2022-09-18T08:05:14.695Z#011panic#011VictoriaMetrics/lib/fs/dir_remover.go:50#011FATAL: cannot remove "/var/lib/victoria/data/small/2022_09/tmp/1715E5652798B8B9": openfdat /var/lib/victoria/data/small/2022_09/tmp/1715E5652798B8B9: too many open files
Sep 18 17:05:14 rasp3 victoria-metrics-prod[379]: panic: FATAL: cannot remove "/var/lib/victoria/data/small/2022_09/tmp/1715E5652798B8B9": openfdat /var/lib/victoria/data/small/2022_09/tmp/1715E5652798B8B9: too many open files
Sep 18 17:05:14 rasp3 victoria-metrics-prod[379]: goroutine 362 [running]:
(中略)
Sep 18 17:05:14 rasp3 systemd[1]: victoria-metrics.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Sep 18 17:05:14 rasp3 systemd[1]: victoria-metrics.service: Failed with result 'exit-code'.
Sep 18 17:05:17 rasp3 prometheus[385]: ts=2022-09-18T08:05:17.359Z caller=dedupe.go:112 comp
「打开的文件太多了。」呃,这是什么来着。。。我在做各种搜索的时候意识到是文件描述符的问题。(虽然在LPIC考试中学过,但已被遗忘。)
在变更之前的文件描述符是1024。
由于524288是硬限制,所以将会被截断为1024。
# cat /proc/(Victoria Metoricsのpid)/limits | grep 'Max open'
Max open files 1024 524288 files
应对措施
参考这个,然后尝试做出改变。
我会以原生的中文方式进行改写。
无论如何更改/etc/security/limits.conf,对于systemd的守护程序文件描述符都是无效的。因此,我们会将其作为服务的配置文件加以补充。
# mkdir /etc/systemd/system/victoria-metrics.service.d
# vi /etc/systemd/system/victoria-metrics.service.d/00-limits.conf
暂时设定为65536吧。(随意。)
[Service]
LimitNOFILE=65536:65536
改正之后、回应
# systemctl daemon-reload
# systemctl start victoria-metrics
# systemctl status victoria-metrics
● victoria-metrics.service - Victoria Metrics
Loaded: loaded (/etc/systemd/system/victoria-metrics.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/victoria-metrics.service.d
└─00-limits.conf
Active: active (running) since Sun 2022-09-18 23:19:30 JST; 3s ago
Main PID: 4450 (victoria-metric)
Tasks: 9 (limit: 2059)
CGroup: /system.slice/victoria-metrics.service
└─4450 /usr/local/bin/victoria-metrics-prod -storageDataPath /var/lib/victoria/ -retentionPeriod=24
Sep 18 23:19:32 rasp3 victoria-metrics-prod[4450]: 2022-09-18T14:19:32.152Z info VictoriaMetrics/lib/storage/partition.go:1530 opened part "/v
Sep 18 23:19:32 rasp3 victoria-metrics-prod[4450]: 2022-09-18T14:19:32.154Z info VictoriaMetrics/lib/storage/partition.go:1530 opened part "/v
Sep 18 23:19:32 rasp3 victoria-metrics-prod[4450]: 2022-09-18T14:19:32.156Z info VictoriaMetrics/lib/storage/partition.go:1530 opened part "/v
Sep 18 23:19:32 rasp3 victoria-metrics-prod[4450]: 2022-09-18T14:19:32.195Z info VictoriaMetrics/lib/storage/partition.go:1530 opened part "/v
Sep 18 23:19:32 rasp3 victoria-metrics-prod[4450]: 2022-09-18T14:19:32.201Z info VictoriaMetrics/app/vmstorage/main.go:105 successfully opened
Sep 18 23:19:32 rasp3 victoria-metrics-prod[4450]: 2022-09-18T14:19:32.223Z info VictoriaMetrics/app/vmselect/promql/rollup_result_cache.go:106
Sep 18 23:19:32 rasp3 victoria-metrics-prod[4450]: 2022-09-18T14:19:32.408Z info VictoriaMetrics/app/vmselect/promql/rollup_result_cache.go:132
Sep 18 23:19:32 rasp3 victoria-metrics-prod[4450]: 2022-09-18T14:19:32.409Z info VictoriaMetrics/app/victoria-metrics/main.go:61 started Victo
Sep 18 23:19:32 rasp3 victoria-metrics-prod[4450]: 2022-09-18T14:19:32.409Z info VictoriaMetrics/lib/httpserver/httpserver.go:83 starting http
Sep 18 23:19:32 rasp3 victoria-metrics-prod[4450]: 2022-09-18T14:19:32.410Z info VictoriaMetrics/lib/httpserver/httpserver.go:84 pprof handler
变成了65536。
# cat /proc/4450/limits | grep 'Max open'
Max open files 65536 65536 files
结果
变更后,已经过去了大约6小时,看起来很健康。
台风来了,我担心气压会怎样变化。希望一切都平安无事。