使用IBM Cloud Monitoring with Sysdig和blackbox_exporter进行存活监控
简要介绍
本次尝试使用blackbox_exporter配置活死监控。因为Sysdig本身没有活死监控和外观监控,所以我们设置了这样的代理服务器来补充基础设施监控。
另外,Sysdig Agent 10.5.0 版本开始支持原生的 Prometheus 抓取配置,我们将尝试使用该功能。
-
- Sysdig Agent Release Notes – 10.5.0 September 23, 2020
- Enable Prometheus Native Service Discovery
从 agent v10.5.0 开始,Sysdig 支持原生的 Prometheus 服务发现,您可以像配置原生 Prometheus 一样在 prometheus.yaml 中进行配置。
新版本的 promscrape 被命名为 promscrape.v2。
promscrape.v2 支持所有类型的抓取配置,例如联邦, blackbox-exporter 等等。
使用IBM Cloud命令行界面登录
我将在命令行界面上进行设置。
export REGION="jp-tok"
export RESOURCE_GROUP="khayama-rg"
ibmcloud login -a cloud.ibm.com -r $REGION -g $RESOURCE_GROUP
创建Sysdig服务实例
将IBM Cloud Monitoring with Sysdig作为监视结果的聚合位置进行配置。
# ibmcloud catalog service sysdig-monitor
export SYSDIG_NAME=khayama-sysdig
export SYSDIG_PLAN=graduated-tier # or lite
ibmcloud resource service-instance-create $SYSDIG_NAME sysdig-monitor $SYSDIG_PLAN $REGION
创建 Sysdig 服务的身份验证信息
获取已部署的Sysdig实例的ID。
export SYSDIG_ID=$(ibmcloud resource service-instance --output JSON $SYSDIG_NAME | jq -r '.[].id')
echo $SYSDIG_ID
创建所需的身份验证信息以安装代理。
ibmcloud resource service-key-create "$SYSDIG_NAME"-service-key --instance-id $SYSDIG_ID
ibmcloud resource service-keys
确认Sysdig访问密钥。
export SYSDIG_ACCESS_KEY=$(ibmcloud resource service-key --output JSON "$SYSDIG_NAME"-service-key | jq -r '.[].credentials."Sysdig Access Key"')
echo $SYSDIG_ACCESS_KEY
代理服务器的设置准备
使用 Red Hat Enterprise Linux 作为代理服务器。
[root@khayama-proxy ~]# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.2 (Ootpa)
安装 Sysdig Agent
从地区和终端设置私有终端点,并根据配置 Sysdig 代理到代理节点。
[root@khayama-proxy ~]# SYSDIG_ACCESS_KEY=xxxxxxxx-yyyy-zzzz-aaaa-bbbbbbcccccc
[root@khayama-proxy ~]# COLLECTOR_ENDPOINT=ingest.private.jp-tok.monitoring.cloud.ibm.com
[root@khayama-proxy ~]# curl -sL https://ibm.biz/install-sysdig-agent | sudo bash -s -- --access_key $SYSDIG_ACCESS_KEY --collector $COLLECTOR_ENDPOINT --collector_port 6443 --secure true --tags role:proxy,location:tok04 --additional_conf 'sysdig_capture_enabled: false'
* Detecting operating system
* Installing Sysdig public key
* Installing Sysdig repository
* Installing kernel headers
* Installing Sysdig Agent
* Setting access key
* Setting tags
* Setting collector endpoint
* Setting collector port
* Setting connection security
* Adding additional configuration to dragent.yaml
Restarting dragent (via systemctl): [ OK ]
确认 Sysdig Agent 的版本为 10.5.0 或更高版本。
[root@khayama-proxy ~]# /opt/draios/bin/dragent --version
10.5.0
您可以使用以下命令来查看代理的配置。
[root@khayama-proxy ~]# cat /opt/draios/etc/dragent.yaml
customerid: xxxxxxxx-yyyy-zzzz-aaaa-bbbbbbcccccc
tags: role:proxy,location:tok04
collector: ingest.private.jp-tok.monitoring.cloud.ibm.com
collector_port: 6443
ssl: true
sysdig_capture_enabled: false
安装 blackbox_exporter
根据prometheus/blackbox_exporter的要求,将黑匣子探针导入到代理服务器中。
从GitHub下载二进制文件并解压缩。
wget -c https://github.com/prometheus/blackbox_exporter/releases/download/v0.17.0/blackbox_exporter-0.17.0.linux-amd64.tar.gz -O - | tar -xz
cd blackbox_exporter-0.17.0.linux-amd64
./blackbox_exporter -h
请按照以下方式放置二进制文件和配置文件。
mv blackbox_exporter /usr/local/bin
mkdir -p /etc/blackbox
mv blackbox.yml /etc/blackbox
将附加设置添加到ICMP探测中。
-
- blackbox_exporter/CONFIGURATION.md at master · prometheus/blackbox_exporter
- blackbox_exporter/example.yml at master · prometheus/blackbox_exporter
cat <<EOF >> /etc/blackbox/blackbox.yml
timeout: 5s
icmp:
preferred_ip_protocol: "ip4"
EOF
blackbox_exporter --config.check --config.file="/etc/blackbox/blackbox.yml"
我們將服務配置為定期啟動。
cat <<EOF > /lib/systemd/system/blackbox.service
[Unit]
Description=Blackbox Exporter Service
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
ExecStart=/usr/local/bin/blackbox_exporter \
--config.file=/etc/blackbox/blackbox.yml \
--web.listen-address=":9115"
Restart=always
[Install]
WantedBy=multi-user.target
EOF
最终添加权限并启动黑盒服务。
# to allow any user the ability to use ping
sysctl -w net.ipv4.ping_group_range='0 2147483647'
systemctl enable blackbox.service
systemctl start blackbox.service
systemctl status blackbox.service
黑匣子导出器的运行确认
如果探测成功,则表示为Up。
如果您想显示详细信息,请添加 debug=true 参数,如下所示。
[root@khayama-proxy ~]# curl "http://localhost:9115/probe?module=icmp&target=10.193.37.176&debug=true"
Logs for the probe:
ts=2020-09-23T07:22:53.810006347Z caller=main.go:304 module=icmp target=10.193.37.176 level=info msg="Beginning probe" probe=icmp timeout_seconds=5
ts=2020-09-23T07:22:53.810227284Z caller=icmp.go:84 module=icmp target=10.193.37.176 level=info msg="Resolving target address" ip_protocol=ip4
ts=2020-09-23T07:22:53.810278023Z caller=icmp.go:84 module=icmp target=10.193.37.176 level=info msg="Resolved target address" ip=10.193.37.176
ts=2020-09-23T07:22:53.810324592Z caller=main.go:119 module=icmp target=10.193.37.176 level=info msg="Creating socket"
ts=2020-09-23T07:22:53.810535428Z caller=main.go:119 module=icmp target=10.193.37.176 level=info msg="Creating ICMP packet" seq=34885 id=25639
ts=2020-09-23T07:22:53.810577682Z caller=main.go:119 module=icmp target=10.193.37.176 level=info msg="Writing out packet"
ts=2020-09-23T07:22:53.810673503Z caller=main.go:119 module=icmp target=10.193.37.176 level=info msg="Waiting for reply packets"
ts=2020-09-23T07:22:53.813713568Z caller=main.go:119 module=icmp target=10.193.37.176 level=info msg="Found matching reply packet"
ts=2020-09-23T07:22:53.813791262Z caller=main.go:304 module=icmp target=10.193.37.176 level=info msg="Probe succeeded" duration_seconds=0.003687872
Metrics that would have been returned:
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 4.7291e-05
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.003687872
# HELP probe_icmp_duration_seconds Duration of icmp request by phase
# TYPE probe_icmp_duration_seconds gauge
probe_icmp_duration_seconds{phase="resolve"} 4.7291e-05
probe_icmp_duration_seconds{phase="rtt"} 0.003104699
probe_icmp_duration_seconds{phase="setup"} 0.00025286
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 3.984007035e+09
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1
Module configuration:
prober: icmp
timeout: 5s
http:
ip_protocol_fallback: true
tcp:
ip_protocol_fallback: true
icmp:
preferred_ip_protocol: ip4
ip_protocol_fallback: true
dns:
ip_protocol_fallback: true
黑匣子导出器探测目标设定
基于 Prometheus 配置,设置 ICMP 的存活监视目标。这次我们将注册 localhost 和 10.193.37.176 这两个目标,以确保可以在 Sysdig 中按照段岛查看数据。
cat <<EOF > /opt/draios/etc/prometheus.yaml
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [icmp] # Look for a icmp response.
static_configs:
- targets:
- localhost # Target to probe with icmp
- 10.193.37.176 # Target to probe with icmp
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115 # The blackbox exporter's real hostname:port.
EOF
对于blackbox_exporter的端点进行抓取设置。
由于Sysdig Agent 10.5.0实现了Prometheus原生的抓取设置,因此需要添加该设置并重新启动Agent。
cat <<EOF >> /opt/draios/etc/dragent.yaml
prometheus:
enabled: true
prom_service_discovery: true
EOF
service dragent restart
确认数据
可以通过IBM Cloud Monitoring with Sysdig的“仪表板 > 添加仪表板”界面,确认可以使用以下设置添加面板。
可以看到,即使是相同的指标,也可以按照实例查看数据。
-
- Metrics: probe_success
Segmentation: instance
设定告知
当面板创建完成后,您可以通过”创建提醒”选项来设置邮件通知。(在此之前,请确保已添加邮箱通知渠道。)创建”多个提醒”后,您可以同时为”实例”和”工作”两个分段设置提醒。在邮件通知正文中,您可以使用”{{instance}}”和”{{job}}”作为变量。
确认通知
我将关闭「10.193.37.176」并确认通知。
数据已经确认如下。
我确认收到了通知邮件。
最后
你可以使用blackbox_exporter和Sysdig来实现生死监控。
为确保代理服务器的冗余性,最好设置多个代理服务器。
请参考以下链接。
参考链接: (can insert your link here)
-
- Configuring Sysdig Agent – Sysdig Documentation
-
- Working with Prometheus Metrics
- How To Install and Configure Blackbox Exporter for Prometheus
参考资料:故障排除
通过在dragent.yaml文件中添加配置并重新启动Agent,可以查看详细的日志信息。
cat <<EOF >> /opt/draios/etc/dragent.yaml
log:
file_priority: debug
EOF
service dragent restart && tail -f /opt/draios/logs/draios.log
如果您能够查看以下的日志,就可以确认本次设置是否已成功运行。
2020-09-28 04:32:47.025, 21727.21762, Debug, promscrape:1010: have metrics for job 1
2020-09-28 04:32:47.025, 21727.21762, Debug, promscrape:1010: have metrics for job 2
在Python 3中运行时,请参考。
不安装Python3直接运行Sysdig Agent,可以看到以下错误。
Error, sdchecks[0] /opt/draios/lib/python-deps2.7/OpenSSL/crypto.py:12: CryptographyDeprecationWarning: Python 2 is no longer supported by the Python core team. Support for it is now deprecated in cryptography, and will be removed in a future release.
安装Python3。
dnf install python38 -y
python3 -V
alternatives --set python /usr/bin/python3
只有这些还不能确认以下的错误。
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] Traceback (most recent call last):
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/bin/sdchecks", line 33, in <module>
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] from sdchecks import Application
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/lib/python/sdchecks.py", line 28, in <module>
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] import config
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/lib/python/config.py", line 22, in <module>
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] from util import get_os, yLoader
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/lib/python/util.py", line 44, in <module>
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] from utils.platform import Platform
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/lib/python/utils/platform.py", line 6, in <module>
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] from utils.dockerutil import get_client
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/lib/python/utils/dockerutil.py", line 15, in <module>
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] from docker import Client
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/lib/python-deps/docker/__init__.py", line 6, in <module>
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] from .client import Client, AutoVersionClient, from_env # flake8: noqa
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/lib/python-deps/docker/client.py", line 5, in <module>
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] import requests
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/lib/python-deps/requests/__init__.py", line 112, in <module>
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] from . import utils
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/lib/python-deps/requests/utils.py", line 39, in <module>
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] DEFAULT_CA_BUNDLE_PATH = certs.where()
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/opt/draios/lib/python-deps/certifi/core.py", line 37, in where
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] _CACERT_PATH = str(_CACERT_CTX.__enter__())
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/usr/lib64/python3.8/contextlib.py", line 113, in __enter__
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] return next(self.gen)
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/usr/lib64/python3.8/importlib/resources.py", line 201, in path
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] with open_binary(package, resource) as fp:
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "/usr/lib64/python3.8/importlib/resources.py", line 91, in open_binary
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] return reader.open_resource(resource)
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] File "<frozen importlib._bootstrap_external>", line 988, in open_resource
2020-09-28 03:36:57.979, 18226.18241, Error, sdchecks[0] FileNotFoundError: [Errno 2] No such file or directory: '/opt/draios/lib/python-deps/certifi/cacert.pem'
如遇此种情况,则接下来请执行以下命令。
python3 -m pip install --upgrade pip
python3 -m pip --version
python3 -m pip install certifi
cp $(python -m certifi) /opt/draios/lib/python-deps/certifi/cacert.pem
请确保在最后运行这条命令时不会输出错误信息。
service dragent restart && tail -f /opt/draios/logs/draios.log | grep Error
参考:指标目录
[root@khayama-proxy ~]# curl "http://localhost:9115/metrics"
# HELP blackbox_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which blackbox_exporter was built.
# TYPE blackbox_exporter_build_info gauge
blackbox_exporter_build_info{branch="HEAD",goversion="go1.14.4",revision="1bc768014cf6815f7e9d694e0292e77dd10f3235",version="0.17.0"} 1
# HELP blackbox_exporter_config_last_reload_success_timestamp_seconds Timestamp of the last successful configuration reload.
# TYPE blackbox_exporter_config_last_reload_success_timestamp_seconds gauge
blackbox_exporter_config_last_reload_success_timestamp_seconds 1.6008451361406033e+09
# HELP blackbox_exporter_config_last_reload_successful Blackbox exporter config loaded successfully.
# TYPE blackbox_exporter_config_last_reload_successful gauge
blackbox_exporter_config_last_reload_successful 1
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0
go_gc_duration_seconds_sum 0
go_gc_duration_seconds_count 0
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 10
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.14.4"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 834256
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 834256
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.444856e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 274
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction 0
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 3.436808e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 834256
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 6.49216e+07
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 1.794048e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 4258
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 6.49216e+07
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 6.6715648e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 0
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 4532
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 3472
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 16384
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 28152
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 32768
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 4.473924e+06
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 787456
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 393216
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 393216
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 7.2827136e+07
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 7
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 9
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.8173952e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.60084513566e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 7.33978624e+08
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes -1
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 0
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0